[PDF] Sketch-based Image Retrieval from Millions of Images under Rotation, Translation and Scale Variations

Abstract

Proliferation of touch-based devices has made sketch-based image retrieval practical. While many methods exist for sketch-based object detection/image retrieval on small datasets, relatively less work has been done on large (web)-scale image retrieval. In this paper, we present an efficient approach for image retrieval from millions of images based on user-drawn sketches. Unlike existing methods for this problem which are sensitive to even translation or scale variations, our method handles rotation, translation, scale (i.e. a similarity transformation) and small deformations. The object boundaries are represented as chains of connected segments and the database images are pre-processed to obtain such chains that have a high chance of containing the object. This is accomplished using two approaches in this work: a) extracting long chains in contour segment networks and b) extracting boundaries of segmented object proposals. These chains are then represented by similarity-invariant variable length descriptors. Descriptor similarities are computed by a fast Dynamic Programming-based partial matching algorithm. This matching mechanism is used to generate a hierarchical k-medoids based indexing structure for the extracted chains of all database images in an offline process which is used to efficiently retrieve a small set of possible matched images for query chains. Finally, a geometric verification step is employed to test geometric consistency of multiple chain matches to improve results. Qualitative and quantitative results clearly demonstrate superiority of the approach over existing methods.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Sketch-based Image Retrieval from Millions of Images underRotation, Translation and Scale Variations

Sarthak Parui · Anurag Mittal

Received: date / Accepted: date

Abstract

Proliferation of touch-based devices has made sketch-based image retrievalpractical. While many methods exist for sketch-based object detection/image retrievalon small datasets, relatively less work has been done on large (web)-scale imageretrieval. In this paper, we present an efﬁcient approach for image retrieval frommillions of images based on user-drawn sketches. Unlike existing methods for thisproblem which are sensitive to even translation or scale variations, our method han-dles rotation, translation, scale (i.e. a similarity transformation) and small deforma-tions. The object boundaries are represented as chains of connected segments and thedatabase images are pre-processed to obtain such chains that have a high chance ofcontaining the object. This is accomplished using two approaches in this work: a)extracting long chains in contour segment networks and b) extracting boundaries ofsegmented object proposals. These chains are then represented by similarity-invariantvariable length descriptors. Descriptor similarities are computed by a fast DynamicProgramming-based partial matching algorithm. This matching mechanism is used togenerate a hierarchical k-medoids based indexing structure for the extracted chainsof all database images in an ofﬂine process which is used to efﬁciently retrieve asmall set of possible matched images for query chains. Finally, a geometric veriﬁ-cation step is employed to test geometric consistency of multiple chain matches toimprove results. Qualitative and quantitative results clearly demonstrate superiorityof the approach over existing methods.

Keywords

Sketch-based Retrieval · Image Retrieval · Shape Representation · Indexing

Sarthak Parui · Anurag MittalComputer Vision Lab,Dept. of Computer Science & EngineeringIndian Institute of Technology MadrasTel.: +91-44-22575352E-mail: sarthak,[email protected] a r X i v : . [ c s . C V ] O c t Sarthak Parui, Anurag Mittal

The explosive growth of digital images on the web has substantially increased theneed for an accurate, efﬁcient and user-friendly large-scale image retrieval system.With the growing popularity of touch-based smart computing devices and the con-sequent ease and simplicity of querying images via hand-drawn sketches on touchscreens (Lee et al. , 2011), sketch-based image retrieval has emerged as an interestingapplication. The standard mechanism of text-based querying could be imprecise dueto wide demographic variations. It also faces the issue of availability, authenticity andambiguity in the tag and text information surrounding an image (Sigurbj¨ornsson andVan Zwol, 2008; Schroff et al. , 2011), which necessitates us to exploit image contentfor better search. Although various popular image search engines such as Google and TinEye provide an interface for similar image search using an exemplar imageas the query, a user may not have access to such an image every time. Instead, a hand-drawn sketch may be used for querying since sketching is a fundamental mechanismfor humans to conceptualize and render visual content as also suggested by variousNeuroscience studies (Marr, 1982; Landay and Myers, 2001; Walther et al. , 2011).Thus, Sketch-based image retrieval, being a far more expressive way of image search,either alone or in conjunction with other retrieval mechanisms such as text, may yieldbetter results, which makes it an important and interesting problem to study.Apart from online web-scale image retrieval, an efﬁcient sketch-based image re-trieval mechanism has numerous other applications. For instance, it can be used toefﬁciently retrieve intended images from a constrained dataset, viz. a user’s personalphoto-album in a touch-sensitive camera/tablet for which no text/tag information isavailable. Observing the expressive power of free-hand user sketches, a few methodshave been proposed for searching/designing apparels (King and Lau, 1996; Tseng et al. , 2009; Zhang et al. , 2012; Kondo et al. , 2014), accessories (Zeng et al. , 2014) orgeneric 3D objects such as home appliances (Eitz et al. , 2012b) using user-sketches.These applications require efﬁcient shape representation and fast sketch-to-imagematching to facilitate a smooth user experience. Furthermore, sketch-based retrievalcan also be used to improve existing text-based image retrieval systems. For instance,it may be possible to build a sketch in an on-line manner using the ﬁrst few results ofa text query system (Lee and Grauman, 2009; Bagon et al. , 2010; Marvaniya et al. ,2012) and use this sketch for retrieving images that may not have any associatedtag information. Image tag information may also be improved for a database in anoff-line process using sketch-based retrieval. Sketch-based Object Detection:

Several approaches have been considered in the lit-erature for describing shapes and measuring their similarity. Basic approaches formeasuring the similarity of rigid shapes include Chamfer Matching (Barrow et al. ,1977), in which the binary template of a target shape is efﬁciently matched in anedge-image by calculating the distances to the nearest edge pixels. This process canbe sped up using a pre-computed Distance Transform of such an edge-image. Hutten-locher et al. (1993) used a closely related point correspondence measure called the https://images.google.com/ ketch-based Image Retrieval from Millions of Images 3 Hausdorff distance which examines the fraction of points in one set that lie withinsome ε distance of points in the other set (and vice versa). Liu et al. (2010) improvedthe accuracy and efﬁciency of edgemap alignment substantially by incorporating im-age edge-orientation information and using a three-dimensional distance transformto match over possible locations and orientations.To handle non-rigidity of shapes and speed up matching, Belongie et al. (2002)proposed Shape Context in which a shape is described by a set of descriptors, eachof which captures the spatial distributions of other points along the shape contour ina log-polar space. Ling and Jacobs (2007) extend

Shape Context by using the inner-distance instead of the Euclidean distance which restricts the paths between any twocontour points to remain within the shape, thus making the descriptors quite robust toarticulations. Gopalan et al. (2010) further normalize each part afﬁnely before com-puting the innner distance for achieving even greater invariance to shape variations,especially in the case of perspective distortions. Shape Context-based methods haveshown a good promise for clutter free shape-to-shape matchings. However, for match-ing shapes in images where an image has lots of extra and/or clutter edges, these aredifﬁcult to apply.Many methods have been proposed in the literature for more sophisticated sketch-based Object Detection/Retrieval in images that handle more shape variations, al-though the running time for detection and retrieval is often compromised. Felzen-szwalb and Schwartz (2007) use a shape-tree to form a hierarchical structure of con-tour segments for representing each object, which helps in capturing local geometricproperties along with the global shape structure. Deformation is allowed at individualnodes of the tree and an efﬁcient Dynamic Programming-based matching algorithmis used to match two curves. Ferrari et al. (2006) create a Contour Segment Network(CSN) by connecting nearby edge pixels in an edge-map. For matching a sketch to animage, they ﬁnd paths through the constructed CSN that resemble the outline of thesketch. In a later approach, Ferrari et al. (2008) propose groups of k (typically ≤ )Adjacent approximately straight Segments ( k AS) as a local feature and describe themin a scale insensitive way. These k ASs are extracted for a large number of image win-dows and ﬁnally the object boundary is traced by linking individual small matched k ASs in a multi-scale detection phase. In further work, they learn a codebook ofPairs of Adjacent Segments (PAS) (Ferrari et al. , 2010) which is used in combina-tion with Hough-based centroid voting and non-rigid thin-plate spline matching fordetecting the sketched object in cluttered images. In contrast to k AS matching, Rav-ishankar et al. (2008) propose a multi-stage contour-based detection approach, whereDynamic Programming is used to match segments directly to the edge pixels.Recently, several approaches have shown a good Object Detection performanceusing self-contained angle-based descriptors for representing the object contours. Lu et al. (2009) propose a shape descriptor based on a three-dimensional histogram ofangles and distances for triples of consecutive sample points along the object con-tours. To explicitly handle varying local shape distortions, they exploit a particle ﬁl-ter framework to jointly solve the contour fragment grouping and matching problems.Riemenschneider et al. (2010) sample a contour into a ﬁxed number of points and cal-culate the angles between a line connecting any two sampled points and a line to athird point relative to the position of the ﬁrst two points. This representation is insen-

Sarthak Parui, Anurag Mittal sitive to translation and rotation but not scale. They employ a partial matching mech-anism between two such angle-based descriptors by efﬁciently choosing the rangeof consecutive points using an integral image-based approach. To achieve robustnesswith respect to scale, object detection is performed over a range of scales.All of these methods and many other state-of-the-art methods for Object De-tection and Retrieval (Scott and Nowak, 2006; Kokkinos and Yuille, 2011; Ma andLatecki, 2011; Yarlagadda and Ommer, 2012) employ expensive online matchingoperations based on complex shape features to enhance the detection performanceand typically show results on relatively small-sized datasets such as ETHZ (Ferrari et al. , 2006) and MPEG-7 (Latecki et al. , 2000) containing only a few hundreds orthousands of images while taking a considerable time to parse through each image atdetection/search time. However, for a dataset with millions of images with a desiredretrieval time of at most a few seconds, these methods are inapplicable/insufﬁcient.Efﬁcient Image pre-processing and a mechanism for fast online retrieval are neces-sary for large (web)-scale Image Retrieval.

Large-Scale Image Retrieval using Sketches:

Only a few attempts exist in the liter-ature for the problem of sketch-based image retrieval on large databases. Eitz et al. (2009) decompose an image or sketch into different spatial regions and measure thecorrelation between the direction of strokes in the sketch and the direction of gra-dients in the image by proposing two types of descriptors, viz. an Edge HistogramDescriptor and a Tensor Descriptor. Histograms of prominent gradient orientationsare encoded in the Edge Histogram Descriptor, whereas the Tensor Descriptor deter-mines a single representative vector per cell that captures the main orientation of theimage gradients of that cell. Descriptors at corresponding positions in the sketch andthe image are correlated for matching. Due to this strong spatial assumption, theyfail to retrieve images if the sketched object is present at a different scale, orienta-tion and/or position. To determine similar images corresponding to a sketch, a linearscan over all database images is performed. This further limits the scalability of themethod for large databases.To address the issue of scalability, Cao et al. (2011) propose an indexing-friendlyraw contour based method. Given a sketch query, the primary objective of this methodis to retrieve database images which closely correlate with the shape and the positionof the sketched object. For every possible image location and a few orientations, theygenerate an inverted list of images that have edge pixels (edgels) at that particularlocation and orientation. For a sketch query, similar images are determined by count-ing the number of similar edgels in both the sketch and images, which makes thismethod susceptible to scale, translation and rotation changes. Bozas and Izquierdo(2012) introduce a hashing based framework with a strong assumption that a useronly wants spatially consistent images as the search result. They extract HoG fea-tures (Dalal and Triggs, 2005) for overlapping spatial patches in an image and rep-resent them using binary vectors by thresholding the HoG responses. The similaritybetween corresponding patches in the sketch and the image is estimated using the

Min-hash algorithm (Chum et al. , 2008) that exploits the set-overlap similarity ofthese binarized descriptors. Similar to Cao et al. (2011), a reverse indexing structureon the hash keys is built to facilitate fast retrieval. ketch-based Image Retrieval from Millions of Images 5

In order to encapsulate the spatial structure, Hu et al. (2010) describe both imagesand sketches using

Gradient ﬁeld HoG (GF-HOG) which encodes a sparse orienta-tion ﬁeld computed from the gradients of the edge pixels. To facilitate retrieval, aBag of Visual Words model is used and sketch-to-image similarity is measured bycomputing the distance between corresponding frequency histograms representingthe distribution of GF-HOG derived “visual words”. However, this representation isnoisy in presence of even small amount of background clutter. Moreover, since theexperiments were performed only on small datasets that contain less than images,the usability of the method on large scale retrieval is not very clear.Riemenschneider et al. (2011) extend their prior idea (Riemenschneider et al. ,2010) to large scale retrieval by building a vocabulary tree (Nister and Stewenius,2006) on the descriptors. They extract heavily overlapping contour fragments (al-lowing only one point shift) from a sketch and the edge-map of an image. In theirapproach, each contour is composed of a ﬁxed number of points L and described asa matrix of (cid:0) L (cid:1) angles that denote the orientation of the lines joining any two suchpoints with respect to a vertical line. This makes their method sensitive to scale andorientation changes. Furthermore, due to the use of dense descriptors, the compu-tational complexity is still very high for large datasets. Moreover, the experimentswere performed only on very small datasets consisting of less than images andtherefore the applicability of this method in large scale is again not very clear. Zhou et al. (2012) determine the most “salient” object in the image and measure imagesimilarity based on a descriptor built on the object. However, determining saliency isa very hard problem and the accuracy of even the state-of-the-art saliency methodsin natural images is low (Li et al. , 2014), thus rendering the method possibly quiteunreliable. Proposed method in context:

In this paper, we develop a system for large scalesketch-based image retrieval that can handle scale, translation and rotation (simi-larity) variations without compromising on efﬁciency, which we believe has not beenaddressed earlier in the literature. First, the essential shape information of all thedatabase images is captured by extracting sequences/chains of contour segments inan ofﬂine process using two efﬁcient and often complimentary methods: (a) ﬁnd-ing long connected segments in contour segment networks (Sec. 2.1) and (b) usingboundaries of segmented object proposals (Sec. 2.2). Such chains are represented us-ing a similarity-invariant variable length descriptors (Sec. 3). These chain descriptorsare matched using an efﬁcient Dynamic Programming-based approximate substring matching algorithm. Note that, variability in the length of the descriptors makes theformulation unique and more challenging. Furthermore, partial matching is allowedto accommodate intra-class variations, small occlusions and the presence of non-object portions in the chains (Sec. 4). A hierarchical indexing tree structure of thechain descriptors of the entire image database is built ofﬂine to facilitate fast onlinesearch by matching the chains down the tree (Sec. 5). Finally, a geometric veriﬁcationscheme is used for reﬁning the retrievals by considering the geometric consistencyamong multiple chain matchings (Sec. 6). Results on several datasets indicate supe-rior performance and advantages of our approach compared to prior work (Sec. 7).

Sarthak Parui, Anurag Mittal

In this section, we describe ofﬂine preprocessing of database images with an objec-tive of having a compact representation which can be used to efﬁciently match theimages with a query sketch. We ﬁrst note that a user typically draws an object alongits boundary (Cole et al. , 2008) and a sketch of the object boundaries can more orless capture the distinctive object shape information (Cole et al. , 2009). Thus, animage representation based on contour information of the object boundaries wouldbe quite appropriate in this scenario. Ferrari et al. (2008) construct an unweighted

Contour Segment Network (CSN) which links nearby edge segments, and then ex-tract k ( ≤ adjacent segments( k AS) from such a network. Full contour segmentnetworks (Ferrari et al. , 2008) capture a good amount of information about the imageedge segments, but are difﬁcult to represent compactly and match efﬁciently in a largedatabase. Shape Descriptors can be utilized; but they can be quite noisy, especiallyin the presence of image clutter. In this work, we represent shape information using(long) chains of contour segments which we believe contain a good amount of infor-mation for capturing the shape perspective and are efﬁcient for storage and matchingat the same time. While extraction of contour chains in sketches is trivial, doing so indatabase images is non-trivial due to the exponential number of chain possibilities.In this work, we devise two efﬁcient and often complimentary methods for extract-ing and encoding the essential object boundary information in a database image by(a) ﬁnding long chains in contour segment networks, and (b) using boundaries ofsegmented object proposals.2.1 Finding Long Chains in Contour Segment NetworkIt has been observed that long sequences of contour segments typically have a goodamount of intersection with the important object boundaries. Therefore, in our ap-proach, we try to extract long sequences of segments or chains. Furthermore, thesesegments must be connected to each other strongly, i.e. their connecting end pointsshould be close to each other for them to be considered in the same chain. To thisend, we extract a set of salient contours for each image that have been experimentallyfound to be good candidates for object boundaries and then propose a technique togroup these contours into meaningful sequences that have a good chance of overlap-ping with the boundaries of the objects present in an image.

At ﬁrst, in order to use ﬁxed parameters, all the database images are normalized toa standard size (we consider the size of the longest side as 256 pixels). Then, theBerkeley Edge Detector (Martin et al. , 2004) is used to generate a probabilistic edge-map corresponding to the object boundaries in the image. This gives superior objectboundaries compared to traditional edge detection approaches such as Canny (Canny,1986) by considering texture along with color and brightness in an image and specif-ically learning for object boundaries. However such a boundary-map typically still ketch-based Image Retrieval from Millions of Images 7

Fig. 1: Grouping Salient Contours into Chains: (a) Image, (b) Edgemap (Martin et al. ,2004), (c) Salient contours (Zhu et al. , 2007), (d) Illustrative snapshot of a constructedgraph, (e) Maximum Spanning Tree for one component, (f) A long chaincontains a lot of clutter (Fig. 1(b)). Therefore, an intelligent grouping of edge pixelsis done to yield better contours that have a higher chance of belonging to an objectboundary. The method proposed by Zhu et al. (2007) groups edge pixels by consid-ering long connected edge sequences that have as little bends as possible, especiallyat the junction points as it has been found that the object boundaries mostly follow astraight path at the junction points. Contours that satisfy such a constraint are called salient contours in their work and this method is used by us to extract a set of salient contours from a database image (Fig. 1(c)).

Given the salient contours of an image, we consider bends in them (which remainsince the object shape has bends). Some articulation should be allowed at such bendssince it has been observed that the object shape perspective remains relatively un-changed under articulations at such bend points (Basri et al. , 1998). These bendpoints along the contour are determined as the local maxima of the curvature. Al-though curvature has been deﬁned in the literature in many ways (Cohen et al. , 1992;Basri et al. , 1998; Chetverikov, 2003), we use a simple formulation that is fast androbust. The curvature of a point p c is obtained using m points on either side of it as: κ p c = m (cid:88) i =1 w i · (cid:54) p c − i p c p c + i (1)where w i is the weight deﬁned by a Gaussian function centered at p c . This functionrobustly estimates the curvature at point p c at a given scale m . The salient contoursthus obtained are split into multiple segments at such high curvature points and as aresult, a set of straight line-like segments are obtained for an image. Sarthak Parui, Anurag Mittal(a) (b)

Fig. 2: Graph of Joints: (a) Set of segments after breaking the salient contours (shownin different colors) at high curvature points (e.g joint b is a high curvature point). (b)Nearby endpoints of two different segments are merged if they are sufﬁciently closeand a graph of joints ( j ) is created. Given a set of straight line-like segments in an image, we try to connect them. Theconnectivity among the segments suggests an underlying graph structure. Thus, aweighted graph/contour segment network is constructed where each end of a contoursegment is considered as a vertex/joint and the edge weight between any two verticesis equal to the length of segment. Vertices from two different contour segments aremerged if they are spatially close. Fig. 2(b) shows the graph corresponding to theillustrative set of straight line-like segments in Fig. 2(a).The weight of an edge in the graph/contour segment network represents the spa-tial extent of the segments. Therefore, a long path in the graph based on the edgeweights (higher weight is better) relates to a long connected sequence of contour seg-ments or chains in an image. As the graph may contain cycles, to get non-cyclicalpaths, the maximum spanning tree is constructed for each connected component inthe contour segment network (Fig. 1(e)) using a standard minimum spanning tree al-gorithm (Cormen et al. , 2009) and paths are extracted from these trees. Note that theMaximum Spanning Tree algorithm removes the minimum weight edges in any cycle(Fig. 3(a)). Since we consider only long chains, minimum weight edges which corre-spond to the small segments would not be typically picked anyway in a long chain.For instance, in Fig. 3(a), a long chain passing through the joints J and J followsthe path < ..., J , J , J , J , ... > and therefore, removal of the smallest segment J J in the cycle does not lead to any change in the extracted long chain. Fig. 3(b)illustrates possible chains for the contour segment network in Fig. 3(a) while Fig. 1(f)shows an extracted chain for an image (Fig. 1(a)).A single long chain from this tree typically cannot cover the distinctive portion ofany object (Fig. 4(a)). Therefore, to extract informative chains, the best N OC chainsare determined by allowing certain overlaps among the chains. The object boundary Maximum spanning tree of a graph can be computed by negating the edge weights and computing theminimum spanning tree.ketch-based Image Retrieval from Millions of Images 9(a) (b)

Fig. 3: (a) An illustrative snapshot of a network of joints. The red colored edge join-ing J and J , being the least weight edge in the induced cycle, is removed in theMaximum Spanning Tree of the network. (b) Two possible long paths/chains in theconstructed tree.Fig. 4: (a) A single chain does not capture the object boundary, which can be ad-dressed by considering multiple overlapping chains (b).is typically smooth. Therefore, in the case of multiple possibilities at the junctionpoints, we measure the smoothness of possible paths at that junction and prefer an al-most straight path compared to a curve. Fig. 5 illustrates two possibilities at the junc-tion v and the sequence (cid:104) ...uvw... (cid:105) is preferred to (cid:104) ...uvw (cid:48) ... (cid:105) . To this end, the anglebetween three successive joints u, v and w is calculated. Since a straight line is pre-ferred, the deviation of (cid:54) uvw from 180° is determined as a measure of smoothness.Furthermore, to use this measure only in the case of multiple possibilities at junctionpoints, smoothness at a joint of a chain is normalized by the smoothness terms of allpossible chains at that joint. Let t, u, v, w be four consecutive joints along a chain C and d( u, v ) be the distance between any two joints u and v . Then the score of thechain C is determined as: (2) Score ( C ) = (cid:88) u,v ∈ C d( u, v ) ·  λ l · exp ( − λ s · | π − (cid:54) tuv | ) (cid:80) x | t,u,x ∈ C (cid:48) exp ( − λ s · | π − (cid:54) tux | )+ λ l · exp ( − λ s · | π − (cid:54) uvw | ) (cid:80) y | u,v,y ∈ C (cid:48)(cid:48) exp ( − λ s · | π − (cid:54) uvy | )  Fig. 5: (cid:104) ...uvw... (cid:105) , being comparatively smoother, is preferred to (cid:104) ...uvw (cid:48) ... (cid:105) for scor-ing possible paths through the junction v Here λ l and λ s are two scalar constants. C (cid:48) and C (cid:48)(cid:48) are possible chains through thejoints t, u and u, v respectively. Negative exponential functions are used since onlythe values close to the desired value (i.e. π ) can be considered a good candidate andanything beyond a certain limit should be given a low score. Note that the smoothnessterm is weighted by the length of the segment in order to achieve robustness withrespect to the number of intermediate joints. A tree with N l number of leaves has (cid:0) N l (cid:1) paths and there is a substantial overlap of joints among many paths. We exploitthis and use a Least Common Ancestor-based implementation (Cormen et al. , 2009)to efﬁciently score all (cid:0) N l (cid:1) paths and to sequentially select some top N OC (= 5) chains such that the relative overlap among them is less than λ thresh chain (= 60%) .Fig. 4(b) illustrates the usefulness of considering multiple overlapping chains whereonly the third chain encapsulates the informative shape information of a swan. Quitereasonable results may be achieved using such long chains in a contour segmentnetwork alone (Parui and Mittal, 2014). We next propose another technique that canoften provide complimentary chains in the case this algorithm fails.2.2 Using Segmented Object ProposalsThe primary objective of popular segmented object proposal techniques (Carreiraand Sminchisescu, 2012; Uijlings et al. , 2013; Arbelaez et al. , 2014) is to providelocations and boundaries of the possible objects in an image. Since we need to ex-tract boundary information for millions of images, we use a very fast method calledGeodesic Object Proposals (GOP) (Kr¨ahenb¨uhl and Koltun, 2014) for extracting a setof possible object regions. This method typically produces many overlapping scoredobject regions. To limit the number of chains for each image, we consider only sometop N GOP (= 20) proposals based on their scores.Note that while it is possible to use image segmentation techniques such as Shiand Malik (2000) and Arbelaez et al. (2011) to obtain good connected regions, theyare typically very slow (Uijlings et al. , 2013). Furthermore, for the purpose of objectboundary representation, the unit of interest is a single object region for each objectpresent in an image. Therefore, Geodesic Object Proposal (Kr¨ahenb¨uhl and Koltun,2014) is used in this work. The method is based on superpixel growing and eachproposal corresponds to a segment in the image. Thus, the object shape information ketch-based Image Retrieval from Millions of Images 11(a) (b)

Fig. 6: Limitation of considering only top N oc chains from contour segment network:(a) None of the chains encloses sufﬁcient object boundary, (b) A distinctive chain isextracted (below) considering segmented object proposals (Kr¨ahenb¨uhl and Koltun,2014) (above).can be easily extracted by considering the boundaries of the proposed segments. Weremove the segments that mostly touch the image boundary as such segments haveincomplete boundaries and a possible object is only partially present in the image.Furthermore, to extract only distinctive object boundary, very small regions are alsodiscarded (Fig. 6(b) and Fig. 7(a)). Finally the boundaries of the remaining regionsfrom the top N GOP proposals are taken as chains. Fig. 6(b) shows a chain success-fully obtained from the segmented proposals where the chains extracted using contoursegment network were inferior (Fig. 6(a)).Fig. 7 illustrates a reverse example where the top object proposals do not con-tain informative object boundary information, while the overlapping chains extractedfrom the contour segment network covers the desired object boundary. Therefore weconsider both the approaches for extracting chains for database images in an ofﬂineprocess.Fig. 8 demonstrates the entire chain creation framework and Fig. 9 shows thechains thus obtained in some common images. Note that in our framework, it is easyto adapt other possibly better boundary extraction mechanisms as well due to theﬂexibility of the framework in terms of chain length and the number of chains.

In order to efﬁciently match two chains in a similarity-invariant way, we require acompact descriptor that captures the shape information of the extracted chains in asimilarity-invariant way. Towards this goal, the local shape information is captured atthe joints in a scale, in-plane rotation and translation invariant way. For the i th jointof chain k ( J ki ), the segment length ratio γ i = l segi l segi +1 ( l seg i denotes the length of the i th segment) and the anti-clockwise angle θ i (range: [0, 2 π ]) between the adjacentpair of segments seg i and seg i +1 are determined, as shown in Fig. 10. The descriptor Ψ k for a chain k with N segments is then deﬁned as an ordered sequence of such Fig. 7: Limitation of extracting chains only from boundaries of segmented objectproposals ketch-based Image Retrieval from Millions of Images 13

Fig. 8: Entire chain creation frameworkFig. 9: Chains extracted for some images. Different chains are represented using dif-ferent colors.Fig. 10: The chain for the curve SE is composed of three line segments. The descriptorfor this chain is Ψ = (cid:68) γ i = l segi l segi +1 , θ i | i ∈ { , } (cid:69) .similarity-invariant quantities: Ψ k = (cid:104) γ i , θ i | i ∈ { . . . N − }(cid:105) (3)Note that Riemenschneider et al. (2010) also use joint information by measuring therelative angles among all pairs of sampled points along a contour. However, their rep-resentation is not scale invariant which leads to a costly online multi-scale matchingphase. In contrast, our proposed descriptor is insensitive to similarities and is suit-able for efﬁciently representing and matching contour chain information in millionsof images.Having extracted chains from images and compactly represented them in a similarity-invariant way, we next describe an approach for efﬁciently matching two such chains. Fig. 11: (a) A match when fragmented skips are allowed. (b) A match when onlyalmost-contiguous matches are allowed. Matched joints are shown with the samemarker in the sketch and the image. Unmatched portions of the chains are indicatedby dashed lines.

Standard vectorial type of distance measures are not applicable for matching twochains due to the variability in the lengths of the chains in our case. Further, notethat the object boundary is typically captured by only a portion of the chain in thedatabase image (Fig. 8). Therefore, a partial matching strategy of such chains needs tobe devised which can be smoothly integrated with an indexing structure to efﬁcientlydetermine object shape similarity.Since image chains are typically noisy, it is not uncommon to obtain a chain thatcaptures an object boundary and has non-object contour segments on either side ofthe object boundary portion. Furthermore, we assume that the object boundary istypically captured by a more or less contiguous portion of the chain without largegaps in between. Although such large split-ups may occur in certain circumstances,allowing such matches leads to a lot of false matches of images due to too muchrelaxation of the matching criteria. This is illustrated in Fig. 11(a), where the splitmatches are individually good matches but put together do not match with the in-tended shape structure at all. Thus, in our work, the similarity between two chains ismeasured by determining the maximum (almost) contiguous matching portions of thesequences while leaving out the non-matching portions on either side from consid-eration (Fig. 11(b)). This is quite similar to the Longest Common Substring prob-lem (Cormen et al. , 2009), with some modiﬁcations that can be solved efﬁcientlyusing Dynamic Programming. We ﬁrst consider the individual scores for matchingtwo joints across two chains.4.1 Joint SimilaritySince exact correspondence of the joints does not capture the deformation that an ob-ject may undergo, we provide a slack while matching and score the match between apair of joints based on the deviation from exact correspondence. The score S jnt ( x, y ) for matching the x th joint of chain C to the y th joint of chain C is taken to be the Substring, unlike subsequence, does not allow gap between successive tokens.ketch-based Image Retrieval from Millions of Images 15 product of two constituent scores: S jnt ( x, y ) = S lr ( x, y ) · S ang ( x, y ) (4) S lr ( x, y ) is the closeness in the segment length ratio of the two adjacent segments atthe x th and y th joints of the two descriptors: S lr ( x, y ) = exp (cid:0) λ lr · (cid:0) − Ω (cid:0) γ C x , γ C y (cid:1)(cid:1)(cid:1) (5)where γ x = l segx l segx +1 is as deﬁned in Sec. 3, Ω ( a, b ) = min ( a/b, b/a ) , a, b ∈ R > measures the relative similarity between two ratios ( Ω ( a, b ) ∈ (0 , ) and λ lr (= 0 . is a constant. S ang ( x, y ) determines the closeness of the angles at the x th and y th joints and is deﬁned as: S ang ( x, y ) = exp (cid:0) − λ ang · (cid:12)(cid:12) θ C x − θ C y (cid:12)(cid:12)(cid:1) (6)where λ ang (= 2) is a constant. These two components measure the structure simi-larity between a pair of joints. Due to the consideration of length ratios and relativeangles, the joint matching score S jnt ( ., . ) is also invariant to scale, translation androtation.4.2 Handling SkipsGiven the scoring mechanism between a pair of joints, the match score between twochains can be determined by calculating the cumulative joint matching score of con-tiguous portions in the two chains. Although exact matching of such portions can beconsidered, due to intra-class shape variations, small partial occlusion or noise, a fewnon-object joints may occur in the object boundary portion of the chain. To handlethese non-object portions, some skips need to be allowed. Thus, the problem is for-mulated as one that ﬁnds the longest almost-contiguous matching portion of the twochains that are to be matched. Since only descriptors are available at this stage, thismatching is performed in the space of chain descriptors.To this end, a skip penalty α is considered for the skipped joints. Note that, theloss of shape information due to a skip depends on the complexity of the skippedjoints. It has been observed that a sharper angle captures more shape informationthan a smoother one. Hence, a skipped joint with a sharper angle should be penalizedmore. The sharpness ( S x ) of any joint x can be calculated by taking the deviation ofthe joint angle ( θ x ) from 180° (Fig. 12): S x = 1 − exp( − | π − θ x | ) (7)Furthermore, lengthier skips typically cause more loss in shape information. There-fore, to penalize skips based on its complexity, along with the sharpness of the skippedangle, the skip penalty ( ω x ) is also weighted by the average length of the segmentson either side of a skipped joint x : ω x = S x + λ skc · (cid:0) l seg x + l seg x +1 (cid:1) (8) Fig. 12: Sharpness ( S x ) of joint x is calculated by determining the difference of jointangle ( θ x ) from π .where, to determine the penalty of the skipped segments relative to the chain, thelength of each segment is normalized by the length of the chain. λ skc is a scalarconstant that determines the relative effect of the two components.4.3 Matching using Dynamic ProgrammingTowards ﬁnding almost-contiguous matches, one can formulate the match score M ( p ,q , p , q ) for the portion of the chain between joints p and q in chain C and joints p and q in chain C . Let the set J and J denote the set of joints of chains C and C respectively in this interval. Also let JM be a matching between J and J in thisinterval. We restrict JM to obey the order constraint on the matches, i.e., if the joints a and b of the ﬁrst chain are matched to the joints a and b respectively in thesecond chain, then a occurring before b implies that a also occurs before b andvice versa. Also let X ( JM ) = { x | ( x, y ) ∈ JM } and Y ( JM ) = { y | ( x, y ) ∈ JM } bethe set of joints covered by JM. Then M ( p , q , p , q ) is deﬁned as: M ( p , q , p , q ) = max JM ∈ orderedmatchings inthe interval ( p ,q ) and ( p ,q )  (cid:88) ( x,y ) ∈ JM S jnt ( x, y ) − (cid:88) x ∈ J \ X ( JM ) ω x α − (cid:88) y ∈ J \ Y ( JM ) ω y α  (9)Note that α and α may be different since while matching a sketch chain to animage chain, more penalty is given to a skip in the sketch chain ( α = 0 . ) since it isconsidered cleaner and relatively more free from clutter compared to an image chain( α = 0 . ). Now, the maximum matching score ending at the joint q of C and q of C from any pair of starting joints, is deﬁned as: M ( q , q ) = max p ,p M ( p , q , p , q ) (10)We also take the matching score of a null set ( p >q or p >q ) as zero which con-strains M ( q , q ) to take only non-negative values. Then, it is not difﬁcult to prove ketch-based Image Retrieval from Millions of Images 17 Fig. 13: Partial matching of chains with small skips. Matched joints are indicated bythe same marker and colored the same in the table (Best viewed in color).that M can be rewritten using the following recurrence relation: M ( q , q ) =  , if q , q = 0max  M ( q − , q −

1) + S jnt ( q , q ) M ( q − , q ) − ω q α M ( q , q − − ω q α , otherwise (11)This formulation immediately leads to an efﬁcient Dynamic Programming solutionthat computes M for all possible values of q and q starting from the ﬁrst joints to thelast ones. A search for the largest value of M ( q , q ) over all possible q and q willthen give us the best almost-contiguous matched portions between two chains C and C in terms of the highest matching score. Fig. 13 visually illustrates the DynamicProgramming-based matching procedure, where the chains are partially matched anda few joints are skipped while matching. This approach helps us to efﬁciently obtaina matching score between a pair of chains. Furthermore, to handle an object ﬂip, wematch by ﬂipping one of the chains as well and determine the best matching scoreas the one that gives the highest score between the two directions. We call the ﬁnalscore between two chains C and C as the Chain Matching Score CM S ( C , C ) .The entire operation of matching two chains takes O ( n C ∗ n C ) time, where n C and n C are the number of joints in chains C and C respectively. It has been Fig. 14: Skipping of important joints can lead to a false positive match.However (cid:54) C , (cid:54) C , (cid:54) C in sketch chain and corresponding angles (cid:54) C , (cid:54) C , (cid:54) C in image chain is highly dissimilar leading to a low GlobalAngle Consistency score for these two falsely matched chains.observed that a chain typically consists of 12-17 joints leading to a running time ofapproximately 100-400 units of joint matching, which is not very high. Note that,this DP formulation is similar to the Smith-Waterman algorithm (SW) (Smith andWaterman, 1981), which aligns two protein sequences based on a ﬁxed alphabet-set and predeﬁned matching costs. Meltzer and Soatto (2008) use SW to performmatching between two images under wide-baseline viewpoint changes. Our methodis a slight variation from this since it performs matching based on a continuous-space formulation that measures the deviation from exact correspondence to handledeformation.However, matching two chains by determining local joint correspondence alonesometimes leads to a globally inconsistent match as both deformation and skips ofindividual joints are allowed while matching. In Fig. 14, most of the joints are locallymatched correctly in a similarity invariant way; but a few joints are skipped in thesketch chain and in the database chain, which leads to a globally inconsistent match-ing. This necessitates consideration of global consistency of the matched portions ofchains for improved matching.4.4 Global Angle Consistency of the Matched ChainsThe angle that any two consecutive matched joints make with respect to some globalreference point will be similar for two correctly matched chains and different forfalsely-matched chains. The centroid of the matched portion of a chain is a robustpoint that can be used as a reference. Thus, to determine the Global Angle Consis-tency (GAC) between any two matched chains C and C , we consider the centroid ketch-based Image Retrieval from Millions of Images 19 of the matched portion of the chain ( C ) as the reference point and calculate the dif-ferences in angles that any two consecutive joints make: GAC ( C , C ) = exp  − λ ac · N J N J − (cid:88) i =1 (cid:54) J i C J i +11 − (cid:54) J i C J i +12  (12)where, N J is the total number of matched joints between C and C and λ ac is ascalar constant. A higher value of λ ac indicates a harder constraint on the globalshape similarity. Fig. 14 shows an example where a false match is rejected due toa low global angle consistency score. Note that the computation of the global angleconsistency is quite fast as it involves only the matched joints and can be done fromthe descriptor directly without referring back to the corresponding images.The chain matching score (CMS) is weighted by the global angle consistencyscore (GAC) to obtain the ﬁnal chain similarity score of a pair of chains C and C : CS ( C , C ) = GAC ( C , C ) · CM S ( C , C ) (13)This chain-to-chain matching strategy is used to match two image chains duringindexing as well as a sketch chain to an image chain during image retrieval. OfﬂineImage Indexing for faster retrieval is considered next. Given a chain descriptor, matching it online with all chains obtained from millionsof images will take a considerable amount of time. Therefore, for fast retrieval ofimages from a large dataset, an indexing mechanism is required. Different indexingtechniques have been considered in the literature for content-based image retrieval,viz. tree-based approaches using k d tree and its variants (Friedman et al. , 1977; Vla-chos et al. , 2005; Muja and Lowe, 2009), hierarchical k-means (Nister and Stewenius,2006), hashing (Indyk and Motwani, 1998; Deng et al. , 2011; Jegou et al. , 2011) etc.These approaches exploit the vectorial representation of the extracted features andperform either exact or approximate nearest neighbor search. However, in our rep-resentation, the length of each chain is not ﬁxed. Furthermore the matching scorecannot be obtained as a direct accumulation of the scores of individual dimensions.It is also not possible to use metric-based indexing techniques in our case due to aviolation of the triangle inequality (Keogh and Ratanamahatana, 2005). These con-siderations rule out most of the possibilities such as k d tree, hashing etc. Therefore,in this work, an approach similar to hierarchical k-means (Muja and Lowe, 2009;Nister and Stewenius, 2006) but using medoids instead of means is used, which hasbeen found to perform comparable to the state-of-the-art indexing techniques (Mujaand Lowe, 2009).All the database chains are considered for indexing and a hierarchical structure isconstructed by splitting the set of extracted chains into k different clusters using the k-medoids algorithm (Toyama and Blake, 2002; Opelt et al. , 2008). At ﬁrst, k chainsare chosen as the cluster centroids probabilistically using the initialization mecha-nism of k-means++ (Arthur and Vassilvitskii, 2007) which increases both speed and Fig. 15: Similar chains are clustered at the leaf nodes of the hierarchical k-medoid-based indexing structure.accuracy. The remaining chains are matched to each medoid chain using DynamicProgramming-based partial matching algorithm and assigned to the closest one basedon the matching score (Eqn. 13). However, due to partial matching, it is possible toget a high matching score for more than one medoids. Therefore, a chain is assignedto all the medoids for which the matching score is greater than some

T h ms = (80%) of the score of the closest medoid. This operation is then recursively performed onthe individual clusters to determine the clusters at different levels of the tree. A leafnode of such a tree maintains a list of images of which at least one chain matches tothe corresponding medoid chain. Note that, since an image has multiple chains andeven one chain can belong to multiple nodes in our approach, the same image can bepresent at multiple leaves (Fig. 15). Given such a hierarchical chain tree constructedofﬂine during indexing, we next discuss how to search in the Image Database givena query sketch. A user typically draws an object along its boundary (Cole et al. , 2008). From a touch-based device, the input order of the contour points of the object boundary is usuallyavailable. Therefore, sketch chains can be trivially obtained in an online retrieval sys-tem breaking them at turns in the drawing. Ofﬂine line drawings can be decomposedin a manner similar to the edge-detected Images (Sec. 2.1.3). However, chains withless than some

T h nj (= 5) joints are discarded as they are very simple and can matchto any non-informative portion of another chain. Finally, descriptors are determinedin a manner similar to image chains (Eqn. 3).6.1 Search in the Hierarchical Chain TreeFor each of these sketch chain descriptors, a search is performed in the hierarchical k-medoids tree. At every level of the tree, the query chain is matched with all the medoid ketch-based Image Retrieval from Millions of Images 21 Fig. 16: Considering only individual chain matches without global consistency checkcan lead to a false positive retrieval.chains and then the subtree of the best matched medoid is explored in a best-bin-ﬁrst manner (Muja and Lowe, 2009). At ﬁrst, a single traversal is performed throughthe tree following the best matched medoids at every level. This yields a small setof images corresponding to the best matched leaf medoid. Since at every level thequery chain can get a good match with more than one medoids, to consider thosepossible matches, all the unexplored branches along the path are added to a priorityqueue. After the ﬁrst traversal, the branch closest to the query chain is extractedfrom this priority queue and explored further. The search procedure stops once a pre-determined number of database images are retrieved. For all these retrieved images,at least one chain of each image matches with the query chain. Note that, for multiplesketch chains, we get multiple sets of images from the leaf nodes of the search tree,all of which are taken for the next step.Given a set of retrieved images with corresponding matched chains, we devisea sketch-to-image matching strategy to rank the images. Since all chain matchingsbetween a sketch and an image may not be retrieved from the hierarchical tree dueto low similarity scores, we try to match the remaining chains of the sketch alsowith other chains of a shortlisted image to obtain the complete chain-matching in-formation between the corresponding sketch and image. The matching score of animage for a given sketch is then calculated based on the cumulative matching scoresof individual matched chain pairs between the sketch and the image. However, theactual object boundary may be split across multiple chains. Therefore it is necessaryto consider all such matchings while determining the match score between a sketchand an image. However, such multiple matches may not be geometrically consistentwith each other. Fig. 16 shows a case where two chains individually match well inboth the sketch and the image, but the matches are not geometrically consistent witheach other. This necessitates us to consider the geometric consistency of the matchedchains to discard false positive retrievals. Although such geometric consistency hasbeen studied previously in the literature (Philbin et al. , 2007; Sattler et al. , 2009; Tsai et al. , 2010), it is considered in a new context in this work.

Fig. 17: Pairwise geometric consistency of the matched portions of a chain pair p = ( C S , C I ) with respect to p (cid:48) = ( C (cid:48) S , C (cid:48) I ) uses (i) the distances d ( C S , C (cid:48) S ) and d ( C I , C (cid:48) I ) between their centroids ( C ) and (ii) the difference of angles (cid:12)(cid:12)(cid:12) φ C S i − φ C I i (cid:12)(cid:12)(cid:12) .6.2 Geometric consistency across multiple matched chainsThe geometric consistency of the matched portions of a pair of chains p = ( m ( C S ) ,m ( C I )) with respect to that of another chain pair p (cid:48) = ( m ( C (cid:48) S ) , m ( C (cid:48) I )) , where C S and C (cid:48) S are the sketch chains and C I , C (cid:48) I are the image chains, is measured based ontwo factors: a) distance-consistency G d ( p , p (cid:48) ) and b) angular-consistency G a ( p , p (cid:48) ) .The centroids of the matched chain portions can be obtained in a manner that isrelatively robust to the presence of noise. Therefore, G d ( p , p (cid:48) ) is deﬁned in terms ofthe closeness of the distances between the chain centroids d ( m ( C S ) , m ( C (cid:48) S )) in thesketch and d ( m ( C I ) , m ( C (cid:48) I )) in the database image (Fig. 17). These distances arenormalized by the total length of the matched portions of the corresponding chains inorder to achieve scale insensitivity: G d ( p , p (cid:48) ) = exp (cid:16) λ c · (cid:16) − Ω (cid:16) d ( m ( C S ) ,m ( C (cid:48) S )) L S , d ( m ( C I ) ,m ( C (cid:48) I )) L I (cid:17)(cid:17)(cid:17) (14)where, L S =length ( m ( C S ))+ length ( m ( C (cid:48) S )) , L I =length ( m ( C I )) + length ( m ( C (cid:48) I )) , λ c (=1) is a scalar constant and Ω is deﬁned in Eqn. 5.The next factor G a measures angular-consistency . To achieve rotational invari-ance, the line joining the corresponding chain centers is considered as the referenceaxis and the relative angle difference at the i th joint is determined (Fig. 17). G a ( p , p (cid:48) ) is deﬁned using the average difference of such relative angles of all the individualmatched joints in a chain: G a ( p , p (cid:48) ) = exp  − λ a · N J p N J p (cid:88) i =1 (cid:12)(cid:12)(cid:12) φ C S i − φ C I i (cid:12)(cid:12)(cid:12) (15) ketch-based Image Retrieval from Millions of Images 23 Fig. 18: The entire Retrieval Frameworkwhere N J p is the number of matched joints between C S and C I and λ a (=2) is ascalar constant. Since, both G d and G a should be high for consistent matching, weconsider the pairwise geometric consistency G ( p , p (cid:48) ) as a product of the constituentfactors: G ( p , p (cid:48) ) = G d ( p , p (cid:48) ) · G a ( p , p (cid:48) ) .Erroneously matched chains are typically geometrically inconsistent with othersand one may have both geometrically consistent and inconsistent pairs in a group ofmatched pairs between a sketch and an image. Therefore, the geometric consistency GC ( p ) for a matched pair p is taken to be the maximum of G ( p , p (cid:48) ) with respectto all other matched pairs p (cid:48) : GC ( p ) = max p (cid:48) G ( p , p (cid:48) ) . The max operator allowsus to neglect the falsely matched pairs while considering only the consistent matchedpairs. Finally, the similarity score of a database image I with respect to a sketch query S is determined as: Score ( S, I ) = (cid:88) p ∈ P GC ( p ) · CS ( p ) (16)where CS ( p ) is the Chain Score for the chain pair p (Eqn. 13) and P is the set ofall matched pairs of chains between a sketch S and an image I . Since erroneouslymatched chains get very low score for consistency, effectively only the geometricallyconsistent chains are given weight for scoring an image. This score is used to deter-mine the ﬁnal ranking of the database images, which can be used for ranked displayof such images.Fig. 18 shows the complete retrieval framework. Results of experiments are con-sidered next. To evaluate the performance of our system, we created a database of . millionimages, which contains million Flickr images taken from the MIRFLICKR-1M im-age collection (Huiskes and Lew, 2008). In addition, we included . million images from the Imagenet (Deng et al. , 2009) database in order to have some common ob-ject images in our database. In the experiments, the hierarchical index for 1.2 millionimages is generated with a branching factor of and a maximum leaf node size of , which leads to a maximum tree depth of 6. Using this tree, we obtain around similar images for a given sketch, for which geometric consistency (Eqn. 16) isapplied to ﬁnally rank the list of retrieved images from the chain tree.The whole operation for a given sketch typically takes − seconds on a singlethread running on an Intel Core i7-3770 3.40GHz CPU. The running time typicallydepends on the number of chains in the sketch and most of the processing time isconsumed by the geometric veriﬁcation phase. However, this geometric consistencycheck can be trivially parallelized.The hierarchical index for our dataset required only around MB of memory.For fast online access of database chain descriptors during geometric veriﬁcation andranking of retrievals, all the descriptors for . million images are loaded a priori inthe memory, which additionally required approximately GB of memory. Note that,the chain descriptors can be distributed across multiple CPUs if such geometric con-sistency check is performed parallely. Furthermore, to make our approach work in amemory-constrained environment, for every sketch, only the descriptors correspond-ing to ~ selected images may be loaded each time in the memory at runtimealthough this may slow down the process somewhat due to online disk access.Visual results for sketches of different categories of varying complexity areshown in Fig. 19. These clearly indicate insensitivity of our approach to similaritytransforms (e.g positive retrievals of the swan sketch). Furthermore, due to our partialmatching scheme, an object is retrieved even under a viewpoint change if a portion ofthe distinguishing shape structure of the object is matched (e.g th image for swan).Global invariance to similarities as well as matching with ﬂipped objects can be seenin the results for the sketches of swan and bi-cycle ( nd and th retrieved imagefor swan, nd and th retrieved image for bi-cycle etc.). It can be easily observedthat the performance of our approach depends on the complexity/distinctiveness ofthe shape structure. False matches (e.g cross for the sketch of airplane; two adjacentclocks/cups for the sketch of spectacle; keychain, brush for the sketch of bottle inFig. 19) typically occur due to some shape similarity between the sketch and an objectin the image, the probability of which is higher when the sketch is simple and/orcontains only one chain (e.g bottle).To understand the characteristics of the missed retrievals as well in a controlleddataset, we also tested our system on the ETHZ extended shape dataset consistingof images of different categories with signiﬁcant scale, translation and rota-tion variations. Fig. 20 shows top retrieval results from the ETHZ extended shapedataset (Schindler and Suter, 2008) for a few sketches. It can be observed that theaccuracy of retrieval heavily depends on the quality of the sketch. In Fig. 20, a sig-niﬁcant portion of the ﬁrst swan ( th sketch) is circular and thus it matches to locallycircular shapes. However, for a relatively better sketch of swan ( th sketch), the num-ber of positive retrievals is higher. Sometimes, the matched portions of two differentshapes appear to be globally similar, which again leads to false positives at a few toppositions (e.g matched portion of giraffe and mug for the sketch of hat in Fig. 20). ketch-based Image Retrieval from Millions of Images 25 Even when the object shape is well captured in both the sketch and the image,sometimes our approach retrieves an incorrect object if the distinctive object bound-ary portion(s) get skipped. Fig. 21(a) illustrates such a situation where skipping ofvery important object boundary portions in the image of “star” leads to a wrongretrieval. Furthermore, the matched portions of corresponding chains are globallysimilar. Therefore, even global angular consistency check fails to identify the falseretrieval in this case. This case cannot be easily addressed if we are to allow skips tohandle noise in boundary extraction. Note that, for a sketch of “star”, however, an im-age of apple does not get a high matching score because of asymmetricity in our chainmatching criteria where we assume that the sketch chains are much less noisy than theimage chains and hence skips are more heavily penalized in the sketch chains. Thisalso explains the reason behind very good retrievals for the objects whose shape is(almost) unique. Fig. 21(b) also shows an example where allowing local deformationand skipping an informative joint lead to a false positive.Quantitative measurement of the performance of a large scale retrieval system isnot easy due to the difﬁculty in obtaining ground truth data, which is currently un-available for a web-scale dataset. Some common metrics to measure retrieval perfor-mances (F-measure, Mean Average Precision (Manning et al. , 2008) etc.) use recallwhich is impossible to compute without a full annotation of the dataset. Therefore, toevaluate the performance of our approach quantitatively, we use the Precision-at- K measure for different rank levels ( K ) for the retrieval results (Rubner et al. , 2000).This is an acceptable measure since an end-user of a large scale Image Retrieval sys-tem typically cares only about the top results which must be good.First, we separately evaluate the major components of our method to understandtheir effects. Then, we discuss the retrieval performance of the proposed algorithmin comparison to prior work on our large dataset. To study the properties of ouralgorithm in detail, the proposed algorithm is also evaluated on the ETHZ shapedataset (Ferrari et al. , 2006), on which it is possible to compute the recall as well.Finally, using these evaluations, we discuss the strengths and weaknesses of our ap-proach.7.1 Evaluation of major componentsUser sketches are highly subjective (Eitz et al. , 2012a) and the retrieval performancedepends on the quality of the user sketch. Therefore, to obtain a robust estimate of theperformance, the system must be tested using a diverse set of user sketches of varyingcomplexity. To this end, we use a dataset (Bhattacharjee and Mittal (2015)) of user-drawn sketches consisting of sketches for each of the ﬁve shape categories inthe ETHZ shape dataset (Ferrari et al. , 2006), viz. applelogo, bottle, giraffe, mug andswan. This dataset provides a wide variety in terms of the quality of the sketch andis therefore an appropriate choice for evaluation purposes. For testing these sketches,we add the images of the ETHZ shape dataset (Ferrari et al. , 2006) to our dataset of . million images. Finally, for each of these sketches, we determine the number ofcorrect matches in the top retrievals, where such counting has to be done manuallyas we do not possess any prior categorical information of the dataset images. Table 1: Percentage of true positive images in top 50 retrievals with the correspond-ing standard deviation using different chain extraction mechanisms. NOC: Non-Overlapping Chains (Parui and Mittal, 2014), OC: Chains with Overlap allowed,GOP: Chains extracted using Geodesic Object Proposal (Kr¨ahenb¨uhl and Koltun,2014). OC+GOP uses chains extracted from both methods. GC indicates perfor-mance with geometric veriﬁcation. Last row details the performance when ETHZModels (Ferrari et al. , 2006) are used.

Method Applelogo Bottle Giraffe Mug Swan

NOC+GC . ± . . ± . . ± . . ± . . ± . OC+GC . ± . . ± . . ± . . ± . . ± . GOP+GC . ± . ± . . ± . ± . ± . OC+GOP . ± . . ± . . ± . . ± . . ± . OC+GOP+GC . ± .

83 14 . ± . . ± . . ± . . ± . OC+GOP+GC(ETHZ Models)

54 16 20 94 80

Two chain extraction strategies are used in our work to capture the object shapeinformation: a) overlapping long chains from contour segment networks and b) theboundaries of the segmented object proposals. To understand the relative beneﬁts ofthese, we ﬁrst measure the retrieval performance considering only a single methodof chain extraction for all the database images at a time. Table 1 shows the retrievalscores for different object categories. Allowing overlaps while extracting long chainsfrom contour segment networks (OC+GC) compared to using only non-overlappingchains (NOC+GC) (Parui and Mittal, 2014) increases the chance of covering the en-tire object boundary and hence gives superior results. When the chains are extractedusing segmented object proposals (GOP+GC) (Kr¨ahenb¨uhl and Koltun, 2014), im-proved results can be observed for a few categories. For some objects, segmentationproduces better result and if the entire object is covered in a single segmented pro-posal, then an accurate chain corresponding to the object boundary can be extracted.Therefore, for such categories, viz. appelogo, swan etc., one obtains a good accuracywhen chains extracted using segmented proposals are used. However, for other cate-gories, the top proposals cover only small and/or ambiguous portions of the object(s).In such cases (giraffe, bottle), a better retrieval score is obtained when chains are ex-tracted from the contour segment networks. To utilize the beneﬁts of both these mech-anisms, chains extracted from these two methods are combined for all the databaseimages (OC+GOP+GC) and signiﬁcant improvement on retrieval performance canbe observed as compared to the different ideas considered in isolation or comparedto only using the non-overlapping chains (Parui and Mittal, 2014).Table 1 also shows the importance of considering geometric consistency of thematched chains (OC+GOP+GC vs. OC+GOP). Although signiﬁcant performancegain is observed after applying geometric consistency, it is not possible to apply thisstep when only one chain is matched between a sketch and a database image. Due tothis, for some of the categories (viz. bottle, swan etc.) in Table 1, applying geometricconsistency does not make much difference in the retrieval score.Object categories vary in the complexity and uniqueness of their shape informa-tion. For highly deformable objects,viz. giraffe, the performance is poor since our ketch-based Image Retrieval from Millions of Images 27 approach can only handle a similarity transform. Furthermore, there is a consider-able amount of texture variation and background clutter for the giraffe images in theETHZ shape dataset (Ferrari et al. , 2006), making it hard to extract good chains forthis category. For simpler shapes, viz. apple, bottle etc., many false positives get agood matching score as these shapes are relatively simpler and thus easy to match.Typically, our approach performs well for object categories with a distinctive shapeand where the chains can be extracted easily. The inﬂuence of the sketch quality isalso evident from the standard deviation in the retrieval accuracies. Table 1 also liststhe performance of our approach for different categories using the fairly good qualityETHZ dataset model sketches (Ferrari et al. , 2006), for which much better perfor-mance was obtained.Next, we show much more comprehensive retrieval results of the proposed algo-rithm and compare them to prior work.7.2 Comparisons with Prior Work

To evaluate our system for large scale retrieval on different object categories, weasked random subjects to draw sketches for a variety of objects on a touch-basedtablet and collected sketches. These sketches, along with sketches from acrowd-sourced sketch database (Eitz et al. , 2012a), containing different categoriesin total, are used for retrieval. Non-availability of a public implementation of any priorwork makes it difﬁcult to have a comparative study with prior work. Even though aWindows phone App (SketchMatch) based on (Cao et al. , 2011) is available, thedatabase is not available to make a fair comparison to other algorithms. Hence, were-implemented this algorithm (Cao et al. , 2011) (EI) as well as another by Eitz et al. (2009) (TENSOR) and tested their algorithms on our database for the purpose of com-parison. Zhou et al. (2012) did not provide complete implementation details in theirpublication and it is not trivial to make the method proposed by Riemenschneider et al. (2011) run efﬁciently on a very large database. Furthermore, Riemenschneider et al. (2011) did not show any result on a large scale dataset and Zhou et al. (2012)showed results only for 3 sketches. Hence, these methods were not compared against.Table 2 shows the performances of our algorithm in comparison with TENSOR (Eitz et al. , 2009) and EI (Cao et al. , 2011) at different retrieval levels. First, the precision iscomputed for all sketches of a given object category and then the best, worst and av-erage retrieval scores of different categories are averaged over all categories. Thesigniﬁcant deviation between the best and the worst retrieval performances indicatethe diversity in the quality of the user sketches and the system response to it. It can beobserved from Table 2 that our method signiﬁcantly outperforms the other two meth-ods on this large dataset. Both TENSOR (Eitz et al. , 2009) and EI (Cao et al. , 2011)consider edge matchings approximately at the same location in an image as that ofthe sketch and therefore, the retrieved images from their system contain the sketchedshape only at the same position, scale and orientation while images containing thesketched object at a different scale, orientation and/or position are missed leading Table 2: Precision (expressed as Percentage of true positives) at different ranks for retrieval tasks in categories on a dataset of . million images. B: Best, W:Worst, A: Average performances are computed among sketches for each category andthen averaged. Method Top 5 Top 10 Top 25 Top 50 Top 100 Top 250

B W A B W A B W A B W A B W A B W A

TENSOR (Eitz et al. , 2009) 30.8 7.5 14.7 30 7.1 13.7 24.8 7 12.9 20.8 7 12.3 16.5 5.8 10.2 9.4 3 5.7 EI (Cao et al. , 2011) 36.7 20.8 23.4 34.2 17.9 21.5 30 15.3 19.5 27 13.8 17.5 22.2 11.2 14.8 15.7 7.8 10.5 NOC+GC (Parui and Mittal, 2014) 80.8 42.5 60.8 72.5 38.3 53.6 54.7 29.3 39.5 40.3 20.7 28.5 31.8 16.3 22.2 23 12.5 16.5

OC+GOP+GC (Ours) to false retrievals in top few matches (Fig. 22). Similar performance was observedby us on the

Sketch Match app (SketchMatch), although a direct comparison withit is inappropriate since the databases are different. It can also be observed that us-ing two types of chain extraction strategy and considering global angular consistencyimproved the performance compared to only using the non-overlapping chains (Paruiand Mittal, 2014). Note that, due to the non-availability of a fully annotated datasetof a million images, it is extremely hard to use an automated parameter learning al-gorithm. Hence, parameters are chosen empirically by trying out a few variations.Better parameter learning/tuning could possibly improve the results further.

To provide comparisons on a standard dataset and study the recall characteristicswhich is difﬁcult for a large dataset, we tested our system on the ETHZ shape dataset (Fer-rari et al. , 2006). We used the user-drawn sketches of Bhattacharjee and Mittal (2015)for evaluation purposes. Although standard sketch-to-image matching algorithms forObject Detection that perform time consuming online processing certainly performbetter than our approach on this small dataset, such comparison would be unfair sincethe objectives are different. Hence, we compare only against TENSOR (Eitz et al. ,2009) and EI (Cao et al. , 2011). In this dataset, we measure the percentage of posi-tive retrievals in top retrieved results which also gives an idea of recall of variousapproaches since the number of true positives is ﬁxed. Table 3 shows the best, worstand average performance for the different sketches in a category (as for the previousdataset). It can be seen that our method performs much better than other methods onthis dataset as well. The retrieval performance on ETHZ models (Ferrari et al. , 2006)further indicates substantial advantage of using very good sketches. We have proposed an efﬁcient image retrieval approach via hand-drawn sketches forlarge datasets. To the best of our knowledge, this is the ﬁrst major work in the ﬁeldof large scale sketch-based image retrieval that handles rotation, translation, scale ketch-based Image Retrieval from Millions of Images 29

Table 3: Comparison of Percentage of true positive retrievals in top using sketches of Bhattacharjee and Mittal (2015) and ETHZ models (Ferrari et al. , 2006)on ETHZ dataset (Ferrari et al. , 2006). Method Our Sketches ETHZ Models

Best Worst Avg (Ferrari et al. , 2006)

TENSOR (Eitz et al. , 2009) 17 10 13.5 13.6 EI (Cao et al. , 2011) 46 6 26.7 27.9 NOC+GC (Parui and Mittal, 2014) 60 28 42.9 49.3

OC+GOP+GC (Ours)

76 37 56.3 76.4 and small variations of the object shape even for a dataset consisting of millions ofimages. This is accomplished by representing the images using chains of contour seg-ments that have a high probability of containing the object boundary. A similarity-invariant variable length descriptor is proposed that is used to partially match twochains in a hierarchical indexing framework. We also proposed a geometric veriﬁca-tion scheme for improving the search accuracy. Experimental results shown on dif-ferent datasets clearly indicate the beneﬁts of our approach compared to the existingmethods.Due to similarity-invariance of our approach as compared to other relevant work,our method could be used to efﬁciently search in large natural image databases, whichtypically have a lot of variations. The proposed method could also open the windowfor efﬁciently searching in constrained image databases, viz. personal photo albums,which typically do not contain any tag/text information. Furthermore, our methodcould be augmented by other techniques such as text for tagging images in an ofﬂinefashion.One major issue of our approach is the difﬁculty in extracting “good” represen-tative chains in the presence of considerable background clutter. Newer and betterboundary extraction mechanisms can be easily adapted to our framework for improv-ing the quality of the chains. Furthermore, the proposed sketch-to-image matchingapproach is only similarity-invariant and also fails to handle substantial deforma-tion and major viewpoint changes. A sophisticated afﬁne or even projective-invariantmatching mechanism could possibly help retrieve images with such variations as welland can be considered in future work.

References

Arbelaez P, Maire M, Fowlkes C, Malik J (2011) Contour detection and hierarchicalimage segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 33(5):898–916Arbelaez P, Pont-Tuset J, Barron J, Marques F, Malik J (2014) Multiscale combinato-rial grouping. In: IEEE International Conference on Computer Vision and PatternRecognitionArthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In:ACM-SIAM Symposium on Discrete algorithms

Bagon S, Brostovski O, Galun M, Irani M (2010) Detecting and sketching the com-mon. In: IEEE International Conference on Computer Vision and Pattern Recog-nitionBarrow H, Tenenbaum J, Bolles R, Wolf H (1977) Parametric correspondence andchamfer matching: two new techniques for image matching. In: International JointConference on Artiﬁcial IntelligenceBasri R, Costa L, Geiger D, Jacobs D (1998) Determining the similarity of de-formable shapes. Vision Research 38(15):2365–2385Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition usingshape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence24(4):509–522Bhattacharjee SD, Mittal A (2015) Part-based Deformable Object Detection with aSingle Sketch (SUBMITTED to Computer Vision and Image Understanding)Bozas K, Izquierdo E (2012) Large scale sketch based image retrieval using patchhashing. In: Advances in Visual ComputingCanny J (1986) A computational approach to edge detection. IEEE Transactions onPattern Analysis and Machine Intelligence (6):679–698Cao Y, Wang C, Zhang L, Zhang L (2011) Edgel index for large-scale sketch-basedimage search. In: IEEE International Conference on Computer Vision and PatternRecognitionCarreira J, Sminchisescu C (2012) CPMC: Automatic object segmentation using con-strained parametric min-cuts. IEEE Transactions on Pattern Analysis and MachineIntelligence 34(7):1312–1328Chetverikov D (2003) A simple and efﬁcient algorithm for detection of high curvaturepoints in planar curves. In: Computer Analysis of Images and PatternsChum O, Philbin J, Zisserman A (2008) Near Duplicate Image Detection: min-Hashand tf-idf Weighting. In: British Machine Vision ConferenceCohen I, Ayache N, Sulger P (1992) Tracking Points on Deformable Objects UsingCurvature Information. In: European Conference on Computer VisionCole F, Golovinskiy A, Limpaecher A, Barros HS, Finkelstein A, Funkhouser T,Rusinkiewicz S (2008) Where do people draw lines? In: ACM Transactions onGraphics (Proceedings of SIGGRAPH)Cole F, Sanik K, DeCarlo D, Finkelstein A, Funkhouser T, Rusinkiewicz S, Singh M(2009) How well do line drawings depict shape? In: ACM Transactions on Graph-ics (Proceedings of SIGGRAPH)Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms,3rd edn. The MIT PressDalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In:IEEE International Conference on Computer Vision and Pattern RecognitionDeng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hi-erarchical image database. In: IEEE International Conference on Computer Visionand Pattern RecognitionDeng J, Berg AC, Fei-Fei L (2011) Hierarchical semantic indexing for large scaleimage retrieval. In: IEEE International Conference on Computer Vision and PatternRecognition ketch-based Image Retrieval from Millions of Images 31

Eitz M, Hildebrand K, Boubekeur T, Alexa M (2009) A descriptor for large scaleimage retrieval based on sketched feature lines. In: Eurographics Symposium onSketch-Based Interfaces and ModelingEitz M, Hays J, Alexa M (2012a) How do humans sketch objects? In: ACM Transac-tions on Graphics (Proceedings of SIGGRAPH)Eitz M, Richter R, Boubekeur T, Hildebrand K, Alexa M (2012b) Sketch-BasedShape Retrieval. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH)Felzenszwalb PF, Schwartz JD (2007) Hierarchical matching of deformable shapes.In: IEEE International Conference on Computer Vision and Pattern RecognitionFerrari V, Tuytelaars T, Van Gool L (2006) Object detection by contour segmentnetworks. In: European Conference on Computer VisionFerrari V, Fevrier L, Jurie F, Schmid C (2008) Groups of adjacent contour segmentsfor object detection. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 30(1):36–51Ferrari V, Jurie F, Schmid C (2010) From images to shape models for object detection.International Journal of Computer Vision 87(3):284–303Friedman JH, Bentley JL, Finkel RA (1977) An algorithm for ﬁnding best matchesin logarithmic expected time. ACM Transactions on Mathematical Software3(3):209–226Gopalan R, Turaga P, Chellappa R (2010) Articulation-invariant representation ofnon-planar shapes. In: European Conference on Computer VisionHu R, Barnard M, Collomosse J (2010) Gradient ﬁeld descriptor for sketch basedretrieval and localization. In: International Conference on Image ProcessingHuiskes MJ, Lew MS (2008) The MIR ﬂickr retrieval evaluation. In: Proceedings ofthe 1st ACM International Conference on Multimedia Information RetrievalHuttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images usingthe Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intel-ligence 15(9):850–863Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing thecurse of dimensionality. In: Proceedings of the 30th annual ACM Symposium onTheory of ComputingJegou H, Douze M, Schmid C (2011) Product quantization for nearest neigh-bor search. IEEE Transactions on Pattern Analysis and Machine Intelligence33(1):117–128Keogh E, Ratanamahatana CA (2005) Exact indexing of dynamic time warping.Knowledge and Information Systems 7(3):358–386King I, Lau TK (1996) A feature-based image retrieval database for the fashion,textile, and clothing industry in Hong Kong. In: International Symposium Multi-Technology Information ProcessingKokkinos I, Yuille A (2011) Inference and learning with hierarchical shape models.International Journal of Computer Vision 93(2):201–225Kondo Si, Toyoura M, Mao X (2014) Sketch based skirt image retrieval. In: ACMJoint Symposium on Computational Aesthetics, Non-Photorealistic Animation andRendering, and Sketch-Based Interfaces and ModelingKr¨ahenb¨uhl P, Koltun V (2014) Geodesic object proposals. In: European Conferenceon Computer Vision

Landay JA, Myers BA (2001) Sketching interfaces: Toward more human interfacedesign. IEEE Computer 34(3):56–64Latecki LJ, Lakamper R, Eckhardt T (2000) Shape descriptors for non-rigid shapeswith a single closed contour. In: IEEE International Conference on Computer Vi-sion and Pattern RecognitionLee YJ, Grauman K (2009) Shape discovery from unlabeled image collections. In:IEEE International Conference on Computer Vision and Pattern RecognitionLee YJ, Zitnick CL, Cohen MF (2011) Shadowdraw: real-time user guidance for free-hand drawing. In: ACM Transactions on Graphics (Proceedings of SIGGRAPH)Li Y, Hou X, Koch C, Rehg JM, Yuille AL (2014) The secrets of salient objectsegmentation. In: IEEE International Conference on Computer Vision and PatternRecognitionLing H, Jacobs DW (2007) Shape classiﬁcation using the inner-distance. IEEE Trans-actions on Pattern Analysis and Machine Intelligence 29(2):286–299Liu MY, Tuzel O, Veeraraghavan A, Chellappa R (2010) Fast directional chamfermatching. In: IEEE International Conference on Computer Vision and PatternRecognitionLu C, Latecki LJ, Adluru N, Yang X, Ling H (2009) Shape guided contour groupingwith particle ﬁlters. In: IEEE International Conference on Computer VisionMa T, Latecki LJ (2011) From partial shape matching through local deformation torobust global shape similarity for object detection. In: IEEE International Confer-ence on Computer Vision and Pattern RecognitionManning CD, Raghavan P, Sch¨utze H (2008) Introduction to information retrieval,vol 1. Cambridge University PressMarr D (1982) Vision: A Computational Investigation into the Human Representationand Processing of Visual Information. Henry Holt and Co. Inc.Martin DR, Fowlkes CC, Malik J (2004) Learning to detect natural image bound-aries using local brightness, color, and texture cues. IEEE Transactions on PatternAnalysis and Machine Intelligence 26(5):530–549Marvaniya S, Bhattacharjee S, Manickavasagam V, Mittal A (2012) Drawing an au-tomatic sketch of deformable objects using only a few images. In: European Con-ference on Computer Vision. Workshops and DemonstrationsMeltzer J, Soatto S (2008) Edge descriptors for robust wide-baseline correspondence.In: IEEE International Conference on Computer Vision and Pattern RecognitionMuja M, Lowe DG (2009) Fast Approximate Nearest Neighbors with Automatic Al-gorithm Conﬁguration. In: International Conference on Computer Vision Theoryand ApplicationNister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: IEEEInternational Conference on Computer Vision and Pattern RecognitionOpelt A, Pinz A, Zisserman A (2008) Learning an alphabet of shape and appear-ance for multi-class object detection. International Journal of Computer Vision80(1):16–44Parui S, Mittal A (2014) Similarity-Invariant Sketch-Based Image Retrieval in LargeDatabases. In: European Conference on Computer VisionPhilbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with largevocabularies and fast spatial matching. In: IEEE International Conference on Com- ketch-based Image Retrieval from Millions of Images 33 puter Vision and Pattern RecognitionRavishankar S, Jain A, Mittal A (2008) Multi-stage contour based detection of de-formable objects. In: European Conference on Computer VisionRiemenschneider H, Donoser M, Bischof H (2010) Using partial edge contourmatches for efﬁcient object category localization. In: European Conference onComputer VisionRiemenschneider H, Donoser M, Bischof H (2011) Image retrieval by shape-focusedsketching of objects. In: Computer Vision Winter WorkshopRubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric forimage retrieval. International Journal of Computer Vision 40(2):99–121Sattler T, Leibe B, Kobbelt L (2009) SCRAMSAC: Improving RANSAC’s efﬁciencywith a spatial consistency ﬁlter. In: IEEE International Conference on ComputerVision and Pattern RecognitionSchindler K, Suter D (2008) Object detection by global contour shape. Pattern Recog-nition 41(12):3736–3748Schroff F, Criminisi A, Zisserman A (2011) Harvesting image databases from theweb. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(4):754–766Scott C, Nowak R (2006) Robust contour matching via the order-preserving assign-ment problem. IEEE Transactions on Image Processing 15(7):1831–1838Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Transactionson Pattern Analysis and Machine Intelligence 22(8):888–905Sigurbj¨ornsson B, Van Zwol R (2008) Flickr tag recommendation based on collectiveknowledge. In: International conference on World Wide WebSketchMatch (2011) http://research.microsoft.com/en-us/projects/sketchmatch

Smith TF, Waterman MS (1981) Identiﬁcation of common molecular subsequences.Journal of molecular biology 147(1):195–197Toyama K, Blake A (2002) Probabilistic tracking with exemplars in a metric space.International Journal of Computer Vision 48(1):9–19Tsai SS, Chen D, Takacs G, Chandrasekhar V, Vedantham R, Grzeszczuk R, Girod B(2010) Fast geometric re-ranking for image-based retrieval. In: International Con-ference on Image ProcessingTseng CH, Hung SS, Tsay JJ, Tsaih D (2009) An efﬁcient garment visual searchbased on shape context. WSEAS transactions on Computers 8(7):1195–1204Uijlings JR, van de Sande KE, Gevers T, Smeulders AW (2013) Selective search forobject recognition. International Journal of Computer Vision 104(2):154–171Vlachos M, Vagena Z, Yu PS, Athitsos V (2005) Rotation invariant indexing ofshapes and line drawings. In: ACM International Conference on Information andKnowledge ManagementWalther DB, Chai B, Caddigan E, Beck DM, Fei-Fei L (2011) Simple line drawingssufﬁce for functional MRI decoding of natural scene categories. Proceedings of theNational Academy of Sciences 108(23):9661–9666Yarlagadda P, Ommer B (2012) From meaningful contours to discriminative objectshape. In: European Conference on Computer Vision

Zeng L, Liu Yj, Wang J, Zhang Dl, Yuen MMF (2014) Sketch2Jewelry: Semanticfeature modeling for sketch-based jewelry design. Computers & Graphics 38:69–77Zhang W, Antunez E, Gokturk S, Sumengen B (2012) Apparel silhouette attributesrecognition. In: Workshop on Applications of Computer VisionZhou R, Chen L, Zhang L (2012) Sketch-based image retrieval on a large scaledatabase. In: ACM international conference on MultimediaZhu Q, Song G, Shi J (2007) Untangling cycles for contour grouping. In: IEEE Inter-national Conference on Computer Vision ketch-based Image Retrieval from Millions of Images 35

Fig. 19: Top retrieved images for 14 sketches from 1.2 million images. Retrievedimages indicate similarity insensitivity and deformation handling of our approach.Chains are embedded on the retrieved images to illustrate the locations of the match-ings. Multiple matched chains are shown using different colors. Correct, similar andfalse matches are illustrated by green, yellow and red boxes respectively (Best viewedin color).

Fig. 20: Top retrievals from ETHZ extended shape dataset (Schindler and Suter, 2008)for few sketches. (a) (b)

Fig. 21: Skipping of important joints and allowing local deformations lead to falsepositive retrievals. Matched joints are numbered same in both the sketch and theimage. ketch-based Image Retrieval from Millions of Images 37

Fig. 22: Top 4 results by (b) Eitz et al. (2009), (c) Cao et al.et al.