Joint work performed by Gautam Muralidhar and Regunathan Radhakrishnan of Pivotal’s Data Science Labs.
Digital images and videos are ubiquitous online. The number of images and videos captured by humans using digital cameras is staggering. It has been estimated that an average of 350 million photos are uploaded to Facebook daily, and about 100 hours of video are uploaded to Youtube every minute. This has resulted in enormous “image and video lakes,” a term we use to describe these ever-growing collections of images and videos. Facebook is currently estimated to have more than 250 billion photographs in its collection. Such gigantic image lakes are not unique to consumer services—they are also found in domains such as healthcare and astronomy.
With large image lakes comes the challenge of managing them efficiently. Image management at this scale requires strategies for optimal automated tagging, which is important for efficient indexing, searching, and browsing of the files in these large image lakes. Manual tagging is infeasible for image databases of this size, and is prone to errors due to users’ subjective opinions. This is where a good content-based image retrieval (CBIR) system becomes important.
A CBIR system takes a query image as an input, and returns images with content most similar to the input query image (figure 1). Given a query image, a CBIR system can be potentially used to auto-tag (label) similar images in the collection, with the assigned label being the object category or scene description label. This technology also has an important role to play within a number of non-consumer domains. CBIR systems can be used in healthcare contexts for case-based diagnoses. A common example is the retrieval of cases and associated diagnoses reports based on content depicted in the query medical image. A more sophisticated use of CBIR systems is to introduce autonomous agents that can mine these gigantic image lakes and learn to recognize objects and scenes in real time
Figure 1: Illustration of a typical CBIR system
An effective CBIR system requires many components: a large image collection, a reliable feature extractor, and the development of machine learning/computer vision models, which mine similar images. Further, these components need to run efficiently, both to enable offline processing of images to extract compact feature vectors of signatures for each image, and for retrieving similar images when presented with a query image. In this post, we will demonstrate how a CBIR system can be easily and efficiently realized using Pivotal’s Hadoop® Distribution (Pivotal HD) with HAWQ, our SQL engine for Hadoop®.
The CBIR system we present here is largely based on the work described in the 2007 CVIR paper, “Image Retrieval on Large-Scale Image Databases.” The main idea is that images are modeled as a “bag” of visual words, from which a collection of latent topics is estimated. The similarity between a pair of images is then computed as the similarity between the corresponding topic collections. An analogy to this CBIR system is a typical natural language document retrieval system, where the documents and the words in the documents are observed, while the collection of top-level topics that label the documents (e.g., sports, science, art, etc.) are not explicitly observed.
To apply this approach to images, the following components are needed:
1) A vocabulary of visual words, over which an image is modeled as a bag-of-words,
2) A mathematical model that takes a bag-of-words as input and generates a collection of latent topics
3) An algorithm that can compute the similarity between topic collections.
Unlike natural language documents that are composed of words contained in the language corpus or dictionary, a visual word vocabulary has to be generated. A common approach for generating a visual word vocabulary is to identify a set of interesting points (pixels) in each image and describe the points using a suitable feature descriptor. Examples of such interest point detectors and feature descriptors include the popular scale invariant feature transform, speeded up robust features, oriented FAST and rotated BRIEF, and several others, based on the specific application.
The detected interest points are clustered using an algorithm such as k-means clustering to generate a vocabulary comprising of as many words as the number of cluster centroids, or k, in case of k-means. The interest points are then assigned to one-of-k clusters to generate a bag-of-words representation for each image. Once we have a bag-of-words representing each image, a mathematical model such as Latent Dirichlet Allocation (LDA) can be used to estimate a compact latent topic collection (vector) for each image.
The visual word vocabulary and topic collections are computed from the images contained in the repository. For a new query image, the pre-computed visual word vocabulary is used to generate a bag-of-words representation, which is then fed to the already trained LDA model to estimate the topic collection for the query image. Similar images from the database are then retrieved based on similarities between the topic collections. The image retrieval algorithm can be based on a number of supervised and unsupervised machine learning techniques, but for simplicity, we use the exhaustive k-nearest neighbor approach. The system overview is illustrated in Figure 2.
Figure 2: CBIR System Design Overview
In this section, we describe the implementation of the various components of the CBIR system using the Pivotal stack. We begin with a collection of images stored in the Hadoop® distributed file system (HDFS).
1. Feature extraction using Pivotal’s Hadoop® Distribution (Pivotal HD)
The interest point detection and feature representation is performed on each image independently. Hence, this is a task that is well suited for a simple map job on Pivotal HD. Prior to running the map job, the images on HDFS are packed into a single Hadoop® sequence file, since the Hadoop® eco-system is best suited for processing large files. Another simple map job on Pivotal HD is used to pack all the images stored on HDFS to a single Hadoop® sequence file. For the interest point detection and feature representation, we use SIFT features, although any of the other popularly used feature descriptor can be easily incorporated into the framework. A collection of SIFT feature vectors for each image (each SIFT vector is a 128-D vector of real numbers) is output to a CSV file by the map job. The feature vectors are then loaded into a table contained in a Pivotal HAWQ database.
2. In-Database Visual Word Vocabulary Creation
K-means clustering is performed in-database (HAWQ) on the feature vectors from Step 1 using the Pivotal MADlib machine-learning library. Figure 3 illustrates this step:
Figure 3: In-database k-means clustering of SIFT feature vectors using MADlib
Once the visual word vocabulary is created, each interest point is assigned to the closest visual word, thereby generating a bag-of-words representation for every image as illustrated in figure 4.
Figure 4: Bag-of-words representation
3. In-Database Topic Modeling and K-Nearest Neighbor Retrieval
Once we have a bag-of-words representation for each image, we apply a topic model to discover latent topics that pervade the collection of images. For each word in an image, a topic model assigns the most probable topic the word belongs to, yielding a topic vector, or vector of topic proportions, for the image. To discover the topics, we employ the LDA model and run this in-database using the MADlib library. Figure 5 illustrates the process of topic modeling.
Figure 5: Topic modeling illustration
Once the topic collections have been estimated, the same trained topic model can be used on a previously unseen query image to infer the query topic vector. An exhaustive k-nearest neighbor retrieval scheme is then evaluated on the query topic vector to retrieve the top k similar images. Here again, the implementation is in-database (HAWQ) and topic similarity is measured using the cosine distance.
Implementation Details and Retrieval Analyses
The current system was built with a subset of 1133 images from the 30,606 images contained in Caltech-256, a popular image dataset. The 1133 images were split across eight main object categories: galaxy, tennis racket, t-shirt, gorilla, teddy bear, bear, fire truck, and golden gate bridge. For each category, 90% of images were set aside as a training set (part of the image collection,) and the remaining 10% were used as a test or query set. The visual word vocabulary size was set to 2000 (‘k’ in k-means clustering,) the number of topics to 50, and the top 15 similar images were retrieved using the k-nearest neighbor implementation.
Figures 6 and 7 illustrate examples of query images and the top four images retrieved. The top-performing category was “galaxy,” with a precision of 0.76.
Figure 6: Example of a galaxy query image and the top four image retrievals
Figure 7: Example of a gorilla query image and the top four image retrievals
Since the system is completely unsupervised, confusions can occur during retrieval. An example of confusion is illustrated in Figure 8:
Figure 8: An example of confusion in retrieval
In this blog post, we have demonstrated how a CBIR system can be easily built, using the various components of the Pivotal stack. The Pivotal stack allows for the easy plug-and-play experience of defining different feature descriptors and machine-learning algorithms. Further, Pivotal HD with HAWQ enable fast processing of images, and the ability to easily experiment with various in-database MADlib machine learning algorithms. And this is just the beginning: similar frameworks and systems for image/computer vision analytics on large-scale image/photo collections can be easily built using Pivotal technologies, which we will cover in future posts.
 Horster, E., et al. “Image Retrieval on Large-Scale Image Databases”, CIVR 2007, pp. 17-24
 SIFT, first described in: Lowe, D. G. “Object recognition from local scale-invariant features,” ICCV 1999, pp. 1150-1157
 SURF, first described in: Bay, H. et al. “SURF: speeded up robust features”, CVIU 2008, 110 (3), pp. 346-359
 ORB, first described in: Rublee, E., et al. “ORB: An efficient alternative to SIFT or SURF”, ICCV 2011, pp.2564-2571
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author