[PDF] Efficient Large-scale Approximate Nearest Neighbor Search on the GPU

Abstract

We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values. Due to its small memory footprint during traversal, the method lends itself to an efficient, parallel GPU implementation. This Product Quantization Tree (PQT) approach significantly outperforms recent state of the art methods for high dimensional nearest neighbor queries on standard reference datasets. Ours is the first work that demonstrates GPU performance superior to CPU performance on high dimensional, large scale ANN problems in time-critical real-world applications, like loop-closing in videos.

Full PDF

EEfﬁcient Large-scale Approximate Nearest Neighbor Search on the GPU

Patrick Wieschollek , Oliver Wang Alexander Sorkine-Hornung Hendrik P.A. Lensch University of T¨ubingen Adobe Systems Inc. Disney Research Max Planck Institute for Intelligent Systems, T¨ubingen

Abstract

We present a new approach for efﬁcient approximatenearest neighbor (ANN) search in high dimensional spaces,extending the idea of Product Quantization. We proposea two level product and vector quantization tree that re-duces the number of vector comparisons required duringtree traversal. Our approach also includes a novel highlyparallelizable re-ranking method for candidate vectors byefﬁciently reusing already computed intermediate values.Due to its small memory footprint during traversal themethod lends itself to an efﬁcient, parallel GPU implemen-tation. This Product Quantization Tree (PQT) approach sig-niﬁcantly outperforms recent state of the art methods forhigh dimensional nearest neighbor queries on standard ref-erence datasets. Ours is the ﬁrst work that demonstratesGPU performance superior to CPU performance on highdimensional, large scale ANN problems in time-criticalreal-world applications, like loop-closing in videos.

1. Introduction

Finding the nearest neighbors (NN) of a query vectorin a high dimensional space is a fundamental task in com-puter vision. For a given query vector y ∈ R D , the nearestneighbor problem consists of ﬁnding an element N ( y ) ∈ X from a predeﬁned ﬁxed set X ⊂ R D , which minimizesa distance metric, most commonly the Euclidean distance d ( · ) := k·k . This NN-problem can be written as N ( y ) = arg min x ∈X d ( y, x ) . (1)In many computer vision tasks these query vectors repre-sent visual descriptors, meaning both D as well as X areoften large, leading to NN searches being a signiﬁcant com-putational bottleneck in many applications. This is due tothe necessity of computing exact distances in a high dimen-sional space between many pairs of vectors, a problem ex- acerbated by the phenomenon known as the curse of dimen-sionality . More precisely, consider a D -dimensional hyperunit-cube enclosing X . To explore a ν fraction of the vol-ume, we need to visit a ν /D percent of each hyper-cubeedge. This means that to explore of a set of SIFT-vectors ( D = 128 ) in a hypercube, one has to search aninterval covering ≈ of the possible values per coordi-nate.Due to this computational complexity, most applicationsrely on approximate nearest-neighbor (ANN) search tech-niques, which try to ﬁnd the nearest neighbor with a highprobability. There exists many CPU-approaches for com-puting ANN in the literature, the most common of whichare KD-trees [6], which hierarchically subdivide the vectorspace. While these methods are widely used in graphicsand vision, it has been shown that KD-trees are no more ef-ﬁcient than brute force searches when D is large [9]. TheFLANN software package [16] proposes randomized KD-forests and K-Means trees, which prune the overall searchspace by identifying small regions around the query vec-tors, yielding better performance with higher dimensionalvectors.Another family of approaches are based on Locality Sen-sitive Hashing (LSH) [5]. These methods hash databasevectors with a number of random projections, and performnearest distance checks only on vectors that are hashed tothe same bin. The speed and accuracy of such methodsdepends on the hashing function used. Andoni and In-dyk [1] describe a family of hashing functions which arenear-optimal. These ideas have since been extended intothe image domain for patch-based nearest neighbor com-putation [13]. While these methods work well, they havenot yet achieved the same performance as space partition-ing methods [16].While leveraging GPU parallelism seems obvious, inpractice accelerating ANN search techniques using GPUparallelism is notoriously difﬁcult, largely due to the mem-ory restrictions of GPUs when compared to the amount1 a r X i v : . [ c s . C V ] F e b f RAM available to CPUs. As a result, existing GPU-based methods often implement brute force approaches, arelimited to small datasets of up to 225 candidate neighbors[18] or can handle only 3-dimensional vectors [16], makingthese approaches unsuited for many vision problems. Ourmethod is designed to be highly parallel, and can be eas-ily implemented on a GPU for signiﬁcant improvements inquery time, while even a CPU version is competitive to pre-vious methods.In general, ANN searches are composed of an ofﬂine phase, containing all query- independent computations likebuilding an indexing structure of the dataset; and an online phase which comprises all query- dependent computations.However, reported results from previous state-of-the-artmethods [11, 9, 3, 2] excluded expensive query- dependent pre-computations from the timing of their online-phase.This does not give a reasonable expectation of running time,as in nearly all applications, the query vector y is unknown until the query request, and cannot be precomputed in anofﬂine phase. Therefore, we include all query-dependentcomputation steps in our timing results, in order to give abetter indication as to running times achievable for real ap-plications.Vector Quantization (VQ) [14] is a simple method that clusters the search space into a number of bins based on thedistance to the cluster centroid. If a query vector is quan-tized to a bin, all other vectors in that bin are likely to begood candidates for being the nearest neighbor. Unfortu-nately, if a query lies at the edge of a bin, one has to considerall neighboring bins as well, and the number of neighborsto each Voronoi cell increases exponentially w.r.t to the di-mension D of the space.The concept of Product Quantization (PQ) was intro-duced in [12] and made popular in the computer vision com-munity by Jegou et al. [9]. Several state-of-the-art ANN ap-proaches extend these ideas, such as locally optimized prod-uct quantization[11] and the inverted multi-index [4]. Thesemethods currently provide the most efﬁcient techniques forANN search for high dimensional data, in terms of speed,accuracy, and memory requirements. In general, PQ basedapproaches consist of the following three main steps: (1)a robust proposal mechanism is used to identify a list ofnearest neighbor candidates in the database (similar to thevectors from a bin in the VQ example), (2) a re-ranking step then sorts these candidates according to their ascend-ing approximate distances to the query vector. Finally, theapproximated k-nearest vectors after re-ranking are furthersorted using (3) an exact distance calculation .We present an extension to the family of PQ methodscalled the Product Quantization Tree (PQT). The main con-tributions of our approach are: a two level product andquantization tree that requires signiﬁcantly fewer exact dis-tance tests than prior work; a relaxation of the Dijkstra- Algorithm for an effective bin traversal order; a fast, re-ranking step to approximate exact distances between queryand database points in constant time O ( P ) , where a queryvector is split into P parts; and a highly optimized GPUbased open-source implementation.In this work, we compare our method using a commonbenchmark, BigANN [9], which consists of 1 billion 128-dimensional SIFT-vectors and 10000 query vectors. Thisdataset is challenging due to the infeasibility of an exhaus-tive search, as well as the sheer size of the data (just stor-ing the database vectors requires 132 GB of memory). Atcomparable approximation quality the GPU implementa-tion achieves signiﬁcant speed-up over prior work. We pro-vide source code of our approach to encourage the devel-opment of new applications that require high-performanceANN queries.

2. Background

Our approach builds on PQ [9], which we describe here,followed by a description of the most related work to ours.Let X = { x , . . . , x n } be a ﬁnite set of database vectors x i ∈ R D . Without loss of generality we consider the Eu-clidean space ( X, d ) , however our approach can be usedwith any arbitrary metric d . In vector quantization (VQ), each vector x ∈ R D is encoded by a codebook C = { c , . . . , c k } of k cen-troids with the mapping: c : X → C , x c ( x ) :=arg min c ∈C d ( x, c ) , In other words, each vector is repre-sented by its closest centroid in the codebook. The set C k = { x ∈ R D | c ( x ) = c k } is called the cluster or bin for centroid c k . This quantization of vectors introduces anapproximation error, but allows for quick retrieval of a sim-ilar set of vectors C k , i.e., all those that are quantized to thesame bin as the query. Classical Lloyd iterations [15] canbe used on a subset of the original data to efﬁciently ﬁnd agood codebook C .In PQ, the high dimensional vector space is transformedinto a product space, whose subspaces are then quantizedusing VQ. Under the assumption that D = P · m for some P, m ∈ N we can write x ∈ R P · m as the concatenation of P vector parts x = ([ x ] , [ x ] , . . . , [ x ] P ) T with [ x ] i ∈ R m .This allows for exponentially large codebook generation byencoding x ∈ R D into a Cartesian product of subspaces C = C × C × · · · × C P , with k P bins, while only requir-ing space for k · P centroids (see Figure 1d) . Increasingthe number of bins enables a much ﬁner granularity for thequery process, and so the vectors in each single bin have asigniﬁcant higher correlation. The canonical projection is amapping of each part [ x ] p independently c p : X → C p , x c p ( x ) := arg min c ∈C p d ([ x ] p , c ) , (2) ) Vector Quantization b) Product Quantizationc) Hierarchical SubspaceClustering d) Product Quantization Tree Figure 1: Three different quantization schemes with k =32 clusters. Vector Quantization (a) represents vectors bytheir closest centroids. Product Quantization performs theclustering in subspaces (here axes) (b). A tree structure canbe used to build a hierarchy of clusters on each axis (c).Our method use the hierarchy of two quantization levels,ﬁrst using PQ with a low number of centroids, and then asecond-layer of PQ within these bins (d). Points drawn asare PQ centroids, and each corresponding cluster is splitagain into ﬁner 4 clusters (2 on each axis) with centroidsillustrated as .to its nearest part-centroid. The nearest centroid c ( x ) ∈ C for x ∈ R D is the concatenation of the sub-centroids c ( x ) = ( c ( x ) , c ( x ) , . . . , c p ( x )) T . (3)Finding a good quantizer c ( · ) for X is distributed into ﬁnd-ing P codebooks C , C , . . . , C P independently, which canalso be done using Lloyd iterations. Note that when setting P = 1 , Product Quantization becomes Vector Quantization.While it is indeed easy to produce exponentially many(in terms of P ) clusters using PQ, most will be empty be-cause of the dataset distribution (see supplementary mate-rial). However, if we assume that the set of query vectorshas a similar distribution as the set of database vectors, wecan expect that most queries will also correspond to non-empty clusters. Nonetheless, we still must be able to dealwith clusters of highly diverse cardinality as illustrated inFigure 1.For a better quantization, the authors of [7] proposed toaugment PQ using the mapping c : X → C , x c ( x ) := arg min c ∈C d ( Rx, c ) , (4)where R ∈ R D × D is a rotation matrix. However, this re- quires a high dimensional matrix multiplication for rotatingthe query for each visited cluster, which is expensive evenfor a GPU implementation. We note that the authors of [11]pre-compute all possible projections Ry i of the query vec-tors in the ofﬂine phase, but this approach is only practicalwhen query vectors are known beforehand. For a query vector y ∈ R D , the approach of [9] proposesan inverted index system with an asymmetric distance com-putation. This consists of an initial VQ step that acts as acoarse quantizer with a codebook of k centroids to extract aset of candidates vectors ( k = 8192 clusters are used [9]).The number of candidates is empirically set to 0.05% of thedatabase size to achieve a recall ≥ . by visiting clus-ters. This corresponds to candidate vectors for eachquery in the BigANN database. This approach requires k exact D -dimensional distance calculations for each queryvector y to identify reasonable clusters. The centroids arethen sorted based on distances, and the w -best clusters arechosen, giving a list of database vectors L c ⊂ X whichhave a high chance of containing the nearest neighbor.These vectors are again sorted in a re-ranking step basedon PQ of the expensive residual-computation r w = y − c w to the identiﬁed cluster c w , which is precomputed in [9],[11] and stored in a distance lookup-table. Again, this pre-computation is only possible when query vectors are known in advance .The distance between the query vector y ∈ R D and eachnearest neighbor candidate x ∈ L c can be approximated byquantizing the residual using a second PQ codebook with k words. Re-ordering the list L c → L s and consideringthe ﬁrst few vectors L ′ s ⊂ L s , an exhaustive search in L ′ s becomes feasible.The lookup and re-ranking steps when visiting w clustersrequires k + w · k exact distance calculations during querytime. With typical values of k = 8192 , w = 64 , k = 256 ,this implies distance calculations.Our hierarchical approach reduces the total number ofexact distance calculations to less than 200. On a mod-ern NVIDIA Titan X computing 16k exact distances is 62times slower (0.13 ms) than 256 exact distance computa-tions (0.0021 ms). Hence, the number of exact vector com-parison would become a bottleneck. A complete query inour algorithm only takes about 0.02 ms due to an efﬁcientindex structure and the parallel nature of our approach. The inverted multi-index [4] exploits PQ rather than VQfor an indexing structure over all database vectors, which re-duces the number of centroid-distance calculation for clus-ter proposals or vise-versa increases the number bins: Givenpart distances to k codebook vectors, for each part [ y ] p ofhe query vector this approach sorts the corresponding k centroids w.r.t. to the ascending distances [ y ] → { i , i , i , . . . , i k − } = I [ y ] → { i , i , i , . . . , i k − } = I ... ... ... ... ... [ y ] P → { i P , i P , i P , . . . , i P k − } = I P , where i is the ID of the 3rd nearest cluster for part . Thecombined cluster IDs of all parts encoded a bin ID via amulti-index i ∈ I × I × · · · × I P . (5)For NN-search, starting with bin B i , i = ( i , i , . . . , i P ) a heuristic is needed to traverse all bins B i in the vicinity.The authors of [4] make use of a priority queue to dy-namically select the next closest not yet visited bin until suf-ﬁciently many bins/vectors are proposed. All vectors storedin each visited bin B i are then examined in an exhaustivesearch using PQ-based re-ranking of the residual to eachbin centroid.Both methods [9, 4] achieve state-of-the-art precision buthave efﬁciency issues when making a query. VQ-based in-dexing requires a very large number of full dimensional-ity codebook vector comparisons, and even for PQ-basedindexing the number is still large. PQ-based indexing ishindered by a slow enumeration of the next best bin [4].Additionally, both methods require quantizing the residualwithin each bin for re-ranking; this is accelerated by pre-computing the residual quantization for each part in [4],however with unknown query vectors, this optimization cannot be made.We address these issues by introducing a Product Quan-tization Tree (Sec. 3). Our approach presents an efﬁcientheuristic for proposing bins (Sec. 3.2), as well as a novel re-ranking method based on projections to quantized lines forre-ranking (Sec. 3.3). Our re-ranking step is especially efﬁ-cient as it can simply reuse distance calculations computedduring the tree traversal. Finally, we demonstrate that ourapproach can be efﬁciently implemented on a GPU usingCUDA (Sec. 4).

3. Product Quantization Tree

The Product Quantization Tree (PQT) is built upon acombination of the inverted multi-index and hierarchicalPQ. The main idea is that product quantization is performedusing a hierarchical VQ-tree [16] for each part rather than aﬂat codebook. The tree structure on the centroids speeds upthe query ( online ), sorting into the database ( ofﬂine ), andindexes considerably more bins in contrast to the invertedmulti-index. Additionally, it is designed to enable the reuseof computed values for fast re-ranking. [ c ] [ c ] [ c ] [ c ] [ c ] [ c ] [ c ] [ c ] PQ k = 4reﬁnement k = 5reﬁnement k = 5search space x [ x ] [ x ] Figure 2: Both parts of ([ x ] , [ x ] ) ∈ R D are quantized bya VQ tree with k = 4 clusters in the ﬁrst and k = 5 ﬁnerclusters in the second level. During traversal, only the best w closest clusters of the ﬁrst level are reﬁned. The examplesearch space by extending w = 2 clusters is illustrated asthe gray area. Sorting each database vector x ∈ X into a bin B ℓ givesdisjoint sets, X = S · Kk =1 B k . We describe how to effec-tively map a vector into a bin, m : X → I × I × · · · × I P .The indexing structure is a tree which consists of twolevels of quantizers. The ﬁrst level is a traditional P -partsproduct quantizer with k centroids for each part. Each re-sulting part cluster is then independently reﬁned by one ad-ditional vector quantizer with k centroids as illustrated inFigure 2. The bins are addressed by any combination of theper-part child node centroids. This gives K = ( k · k ) P bins in total. Training the codebook.

Constructing the VQ-trees isdone independently for each part, ﬁrst by constructing a VQcodebook (level 1) using Lloyd iteration in the fashion ofthe Linde-Buzo-Gray algorithm [14] and then quantizing allsub-vectors (level 2) assigned to a ﬁrst level cluster.While the inverted multi-index approach [4] also usestwo levels of product quantization, the second is exclusivelyused for re-ranking. As opposed to this, we use two levelsfor indexing . While additional tree levels are possible, weempirically found this conﬁguration to be optimal in termsof balancing the number of bins to check with the reductionof candidate vectors.

A query now consist of four steps: tree traversal, binproposal, vector proposal, and re-ranking.The tree traversal is carried out as described above pro-ducing an ordered list of ( i, d ) p for the best subset of level 2lusters. Tree Traversal.

The tree reduces the number of exactdistance computations required during traversal by pruning.After comparing to all k ﬁrst-level codebook vectors, thedistances are sorted, and only the w best clusters are reﬁnedfor further distance calculations for the level 2 codebook.Let y ∈ R D be the query vector, distances d ([ y ] p , (cid:2) c (cid:3) p ) tothe k ﬁrst-level . in the ﬁrst level are computed separatelyfor each part. This step returns a set of IDs and distancepairs { ( i, d ′ ) p | d ′ = d ([ y ] p , (cid:2) c i (cid:3) p ) } (6)for each part p and each level 1 centroid.From these possible per-part clusters, we only use theclosest w centroids for further processing, i.e. computingthe distances d ([ y ] p , (cid:2) c (cid:3) p ) only to those level 2 centroidswhose corresponding c are in the best set. The level 2 dis-tances are ordered to ﬁnd the best cluster indices for eachpart. Finally, combining the best indices of the individualparts identiﬁes the best bin as in Equation 5.A typical conﬁguration might consist of four parts ( P =4 , k = 16 , k = 16 , w = 4) , amounting to only

16 +4 ·

16 = 80 full vector distance calculations to address (16 · ≈ trillion bins. For practical purposes weapplied modulo-hashing by using unsigned integers repre-senting the index. Bin Proposal Heuristic.

Given the best bin as deter-mined by the index i = ( i , i , . . . , i P ) , one has to ﬁnda sequence of neighbor bins to check such that a sufﬁcientnumber of vectors for re-ranking is generated. The priorityqueue used in Babenko and Lemptsky [4] would yield theoptimal sequence but it requires a resorting operation foreach proposed bin, which is expensive and is sequential bynature.Instead, we propose to choose a ﬁxed traversal heuristic.The most simple order would be to compute all id-vectors v ∈ { , r } P and sort them according their distance from theorigin k v k . However, this returns an isotropic bin traversalheuristic as depicted in Figure 3 (blue line) compared to theoptimal sequence from [4] (green line) and our proposedanisotropic traversal heuristic (red line). The anisotropicversion with ﬂexible slope produced a better approximationof the Dijkstra ordering. Hereby, we pre-compute bin or-derings for 10 slopes . k with k = − , − , . . . , . Eachslope describes the progress balance on one part-pair. Aslope of 1 would equally handle both parts, while a slope of . − would allow more bin combinations with higher idsin the second part (see red line in Figure 3). In the index structure, each database vector is quantizedto its nearest bin with a quantization error ∆ i . To ﬁndthe best vectors in the bin they need to be sorted based

10 11 12 13 15 179 10 11 12 14 168 9 10 11 13 153 4 5 6 8 101 2 3 4 6 80 1 2 3 5 7 d i s t a n ce p a r t p : (cid:13)(cid:13)(cid:13) [ x ] p − [ c ] p (cid:13)(cid:13)(cid:13) distance part p ′ : (cid:13)(cid:13)(cid:13) [ x ] p ′ − [ c ] p ′ (cid:13)(cid:13)(cid:13) naive: 10 binsanisotropic: 8 binsoptimal: 7 bins Figure 3: Merging the independent lookups from differentsparts p, p ′ to ﬁnd the best bin-combination requires sort-ing all combinations. A Dijkstra-based traversal [4] (green)cannot be evaluated on a GPU due to its sequential nature,though it is the optimal sequence. Possible parallel approx-imations are a naive (blue) or a anisotropic (red) heuristic.on their distance to the query vector. However, full D -dimensional distance calculations for each vector are too ex-pensive. Similarly, re-ranking based on product quantizedresiduals [9] requires comparison to yet another codebook.Inspired by the Johnson-Lindenstrauss lemma [8], wepropose line quantization, where some of the informationgathered during traversal is reused. Ofﬂine computation

Each vector ( ) is quantized to thenearest projection ( ) onto any line ( - ) through the level 1centroids for each part, see Figure 4. For multiple parts,this quantization effectively spans hyper-planes. The dis-tance of the query point to the line quantized vector can beevaluated exactly and efﬁciently using only one 2D trianglecalculation per part.In order to disambiguate the database vectors x in these bins, we propose an approximation by pro-jecting each vector part [ x ] p in the linear subspace span([ c i ] p , [ c j ] p ) , c i , c j ∈ C deﬁned by the ﬁrst level cen-troids [ c i ] p and [ c j ] p illustrated by in Figure 4. We chose [ c i ] p , [ c j ] p such that the quantization error δ p is minimized.Therefore, when approximating each database vector part [ x ] p by (1 − λ p ) [ c i ] p + λ p [ c j ] p calculating the distance d ( y, x ) does not rely on the values of x but uses existinginformation from the tree-structure.Using this approach, we store λ p and ( i p , j p ) foreach database vector using 1+1 bytes in our implementa-tion. Storing the information of λ p and the indicies from [ c i ] p , [ c j ] p in Figure 5 describes the approximation ( ) ofa vector ( ). In fact, all information about database vector x i ∈ X we need for the complete algorithm is encoded inthe · P tuple x i ↔ ( λ , . . . , λ P , i , . . . , i P , j , . . . , j P ) , (7) ∆ [ b k ] p [ x ] p δ [ x ] p δ [ c ] p [ c ] p [ c ] p [ y ] p Figure 4: Line Quantization. In traditional PQ, eachdatabase vector x i ( ) is projected onto the bin centroid ( )yielding an approximation error ∆ i . Vectors in a bin wouldbe indistinguishable wrt. a query y . We project x i onto ( )on the nearest line - which gives an approximation error δ i . λ p c p δ p ∆ p ([ c i ] p )∆ p ([ c j ] p ) h p ( y, x ) b p a p [ y ] p [ x ] p [ c i ] p [ c j ] p Figure 5: Exact query to line calculation. The database vec-tor part ( [ x ] p , ) is projected onto ( ) at the line ([ c i ] p , [ c j ] p ) with error δ p . When re-ranking the exact distance h p be-tween the query ( [ y ] p ) and the quantized database pointis obtained using triangulation. All necessary values areknown as they are computed during tree traversal ( a p , b p ) or during database construction ( c p , λ p , ( i, j )) .which can be heavily compressed to about 2 bytes perpart. While this scheme does not have the same compres-sion rate as previous methods, it is the ﬁrst to allow anefﬁcient parallel re-ranking on the GPU by look-up fromalready computed values without any computational over-head.We only need to store one small global lookup table of P × k × k precomputed distances between all pairs oflevel 1 centroids, i.e . k [ c s ] p − [ c t ] p k for all p, s, t . Thiscan be computed in the ofﬂine phase as it is independent ofquery vectors. Online computation

During tree traversal all distancesbetween a query point y ∈ R D and all level 1 centroids havealready been computed as list of pairs ( i, d ) p . The approx-imate distance to the database vector x is computed given the triple ( λ p , i p , j p ) , looking up a p and b p in the query’slist, and c p . The distance between y and x is approximatedby d ( y, x ) = P X p =1 d ([ y ] p , [ x ] p ) ≈ P X p =1 h p ( y, x ) (8) ≈ P X p =1 (cid:0) b p + λ p · c p + λ p · ( a p − b p − c p ) (cid:1) . (9)Note, that it is possible to compute the distance between aquery and database vector by triangulation exactly up to theprojection error δ i as illustrated in Figure 5.In practice we use different numbers of parts for the tree( P tree = 2 or ) and for the line quantization ( P line = 8 , or ) for sufﬁciently precise re-ranking. Using exactly thesame level 1 codebook with p parts, we split each centroidpart to get p ′ = k · p parts and compute the distances byaggregating the components.

4. GPU Implementation

Our approach is well suited to take advantage of GPUparallelism, which we implemented in CUDA. There aretwo levels of granularity of parallelism. The ﬁrst is by pro-cessing multiple vectors in parallel, each vector with oneblock of threads. The second is by processing vector ele-ments in parallel for all threads within the block.Database bins are represented by a long, sorted arraycontaining all vector IDs and a pointer array indicatingwhere the vectors of each bin start. The pointer array isassembled by ﬁrst computing a histogram of vectors overall bins and then computing the preﬁx sum [17]. In order todeal with a possibly excessive number of bins, we hash thebins using a simple modulus. As many bins contain zerovectors collision is simply ignored.One kernel computes the distances to all level 1 centroidsand sorts them using bitonic sort in shared memory. Thesecond kernel does the same for the selected level 2 clustersbased on the previous output. These two kernels are usedboth for sorting vectors into the bins as well as for kNNqueries. For database vectors, a special kernel computes theoptimal line projection (Sec. 3.3).For each query vector, we then generate an ordered listof bins following the heuristic of Sec. 3.2. Each thread ina block computes and stores one bin ID. All empty bins areremoved. Then, the kernel produces a list of potential vectorIDs, each thread is responsible for copying all vectors IDsof one bin. The ﬁnal kernel calculates the distances for there-ranking using the line quantization and outputs the re-ranked list of IDs. Here, re-ranking one vector is executedby one warp each.ethod ms R@1 R@10 R@100 suFLANN [16] 5.32 0.97 - - × . LOPQ [11] 51.1 0.51 0.93 0.97 × IVFADC* 11.2 0.28 0.70 0.93 × . PQT (CPU) 4.89 0.45 0.86 0.98 × . PQT (CPU) 5.74 0.98 (exact re-ranking) × . PQT (GPU) 0.02 0.51 0.83 0.86 × GPU brutef. 23.7 1 1 1 × Table 1: Performance on the SIFT1M dataset using dif-ferent methods. Reported query times include query + re-ranking times. The GPU implementation uses the ﬁrst vectors from the proposed bins and (64 · bins. The re-ported CPU performance is base on (8 · bins. Speedup(su) is reported relative to the slowest method. PQT isPQT but with additional exact re-ranking. (*) indicatesthat the timing was reported by the authors. R@ n means,the correct vector is within the ﬁrst n returned vectors fromthe algorithm.

5. Results

We now present the results of the PQT evaluated on sev-eral standard benchmark sets. All reported CPU query timeswere obtained from a single-thread C++ implementation us-ing SSE2 instructions. Results of our GPU implementationsare obtained with a NVIDIA GTX Titan X.We use the publicly available benchmark SIFT1M,SIFT1B datasets [10], of , We compared our implementation with the available im-plementations of [11] and [4]. Due to the approximationnature of these algorithms and discrete parameter space it isnot trivial to ﬁnd parameter settings which produce the sameaccuracy for timing comparisons. Therefore, we choose ahighly optimized GPU-based exhaustive search as a strongbaseline method. The accuracy is measured in recall R @ x ,which is the fraction of nearest neighbors found in the ﬁrst x proposed vectors after re-ranking.Table 1 gives the average query time in milliseconds ob-tained on the same machine using public available code.Compared to all PQ-based approaches [4, 11] our approach( P line = 32 ) is faster on the CPU at similar accuracy. Allow-ing [11] to use more memory consumption for re-rankingslows down the query process. Note that the reported timeof [11] excludes all intensive operations like the multiplica- method avg. (ms) R@100SH [5] 22.7 0.132IVFADC 65.9 0.744FLANN not possiblePQT(CPU) 63 0.83Table 2: Performance of the GIST1M dataset using differ-ent methods. PQT uses parts for re-ranking. . . . . |L c | R ec a ll R IMI k = 2 IMI k = 2 PQT

Figure 6: Recall rates on the SIFT1B data set ( p = 4 , k =32 , k = 16 , w = 8) with ordering of bins. The recallfrom PQT is without a reranking step. Even with signiﬁcantlower query time, our approach is comparable in quality tothe inverted multi-index with k = 2 .tion of query vector with a D × D rotation matrix, whichwas pre-computed.On the GPU, sorting the SIFT1M vectors into the binstakes 1051ms, performing the line quantization for these1M vectors about 458ms ( p = 4 , k = 16 , k = 8 , w = 8 ).The processing time for one query is roughly 39 microsec-onds, split into traversal, bin selection, vectorproposal, and re-ranking. In our implementation themaximum number of sortable vectors on the GPU per queryis currently limited to 4096 during re-ranking. Applyingdifferent algorithms, this restriction could be removed.With the right conﬁguration of bins, high recall valuescan even be achieved on the SIFT1B data set (Figure 6). Be-cause the full data set did not ﬁt on the GPU, the data basewas build in waves of 1M vectors, aggregating the informa-tion on the CPU. With ﬁle I/O this took about 144min. Ona NVIDIA GTX Titan X with 12GB of RAM one can up-load the resulting DB structure, i.e. bin sizes and vector IDsper bin. For the SIFT1B dataset it was essential to re-sortthe proposed bins by the actual distance. This slowed downquery time to . ms in total without re-ranking. The re-call rate of our approach is R@10=0.35. For the re-rankingwe directly accessed the CPU main memory from the GPU . |L s | R ec a ll R P line = 32 P line = 16 P line = 8 P line = 4 Figure 7: Line Quantization. The different curves showthe recall of the SIFT1M dataset for varying values of P line using p = 2 , k = 16 , k = 8 , w = 4 . A query took , , , ms on a single CPU with |L c | < . P line min distor. max distor. avg. distor. time (ms)2 10874.9 179870 30534.6 2.04 8967.8 166722 26257.9 2.68 6709.2 145082 19719.4 3.616 3318.3 84640 10509.7 5.332 1035.3 39143 3686.71 8.5Table 3: Squared Line-quantization error (distortion δ ) byprojecting each x ∈ X onto a line using p = 2 , k =16 , k = 8 , h = 4 for the SIFT1M data set. Last columngives the average query time.resulting in a total query time of 0.15ms.Memory is the limiting factor for the maximum numberof actual bins. We apply hashing to 100M bins. Increasingthe number of parts P or introducing a further level into thetree would further boost the number of bins – at the sametime, also the number of bins to be visited in the vicinitywould drastically increase and slow down the system. We tested the performance of encoding each databasevector x ∈ X by its projection onto a line for different num-bers of parts used for line quantization (see Figure 7 andTable 3). The recall rate increases with the number of lineparts, P line . Low quantization errors with moderate com-putational and storage effort are obtained with P line = 16 .Note, that the necessary data for each query vector is di-rectly assembled during the tree traversal without the needfor any further quantization computation. See the supple-mentary material for re-ranking results on MNIST (784 di-mensions).

6. Conclusion

In conclusion, we introduce a new method for efﬁcientsimilarity search on large, high dimensional datasets. Wepropose a two level Product Quantization Tree for quicklyindexing large numbers of bins with minimal memory andcomputation overhead. We combined this with a novel re-ranking method based on closest-line projections, and a binordering heuristic. The tree structure provides all interme-diate values, which accelerate the re-ranking procedure.Our prototype implementation demonstrates improve-ment in accuracy and speed over state-of-the art methods forANN queries. We demonstrated the scalability from com-petitive performance to FLANN [16] in small benchmarksets (SIFT1M) and outperform to the state-of-the-art meth-ods at the challenging BigANN benchmark set containingone billion vectors of dimension 128, as well as in datasetswith high dimensionality (GIST1M). By construction, theproposed approach can easily be implemented as well onthe GPU , which evaluates to a signiﬁcant speedup againstprevious methods.While our method worked well in the examples anddatasets we tried, there are many avenues for future re-search. For example, it is possible that other tree structuresfeaturing different combinations of PQ and VQ, or eventhese methods in combination with different approachessuch as KD-trees, or LSH would be an interesting area of fu-ture research. Additionally, performing efﬁcient on-the-ﬂyupdates to the database vectors, and resulting tree structurewould be another area for future work.

Acknowledgement

This work has been par-tially supported by the DFG Emmy Noether fel-lowship Le 1341/1-1 and an NVIDIA hardwaregrant.

References [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms forapproximate nearest neighbor in high dimensions.

Commun.ACM , 51(1):117, Jan. 2008. 1[2] A. Babenko and V. Lempitsky. Additive quantization for ex-treme vector compression. June 2014. 2[3] A. Babenko and V. Lempitsky. Tree quantization for large-scale similarity search and classiﬁcation. June 2015. 2[4] A. Babenko and V. S. Lempitsky. The inverted multi-index.In

CVPR , pages 3069–3076. IEEE, 2012. 2, 3, 4, 5, 7[5] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni.Locality-sensitive hashing scheme based on p-stable distri-butions. In

Proceedings of the Twentieth Annual Symposiumon Computational Geometry , SCG ’04, pages 253–262, NewYork, NY, USA, 2004. ACM. 1, 7[6] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithmfor ﬁnding best matches in logarithmic expected time.

ACMTrans. Math. Softw. , 3(3):209–226, Sept. 1977. 17] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quan-tization for approximate nearest neighbor search. In

CVPR2013 . IEEE Computer Society, June 2013. 3[8] P. Indyk and R. Motwani. Approximate nearest neighbors:Towards removing the curse of dimensionality. In

Proceed-ings of the Thirtieth Annual ACM Symposium on Theoryof Computing , STOC ’98, pages 604–613, New York, NY,USA, 1998. ACM. 5[9] H. J´egou, M. Douze, and C. Schmid. Product quantiza-tion for nearest neighbor search.

IEEE Trans. Pattern Anal.Mach. Intell. , 33(1):117–128, 2011. 1, 2, 3, 4, 5, 7[10] H. J´egou, R. Tavenard, M. Douze, and L. Amsaleg. Search-ing in one billion vectors: re-rank with source coding.

CoRR ,abs/1102.3828, 2011. 7[11] Y. Kalantidis and Y. Avrithis. Locally optimized productquantization for approximate nearest neighbor search. In in Proceedings of International Conference on ComputerVision and Pattern Recognition (CVPR 2014) , Columbus,Ohio, June 2014. IEEE. 2, 3, 7[12] D. S. Kim and N. B. Shroff. Quantization based on a novelsample-adaptive product quantizer (sapq).

IEEE Transac-tions on Information Theory , 45(7):2306–2320, 1999. 2[13] S. Korman and S. Avidan. Coherency Sensitive Hashing. , pages 1607–1614, Nov. 2011.1[14] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vectorquantizer design.

IEEE Transactions on Communications ,28:84–95, 1980. 2, 4[15] S. Lloyd. Least squares quantization in pcm.

IEEE Trans.Inf. Theor. , 28(2):129–137, Sept. 2006. 2[16] M. Muja and D. G. Lowe. Scalable nearest neighbor algo-rithms for high dimensional data.

Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on , 36, 2014. 1, 2, 4,7, 8[17] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scanprimitives for gpu computing. In

Proceedings of the 22NdACM SIGGRAPH/EUROGRAPHICS Symposium on Graph-ics Hardware , GH ’07, pages 97–106, Aire-la-Ville, Switzer-land, Switzerland, 2007. Eurographics Association. 6[18] Y. Tsai, M. Steinberger, D. Pajak, and K. Pulli. Fast ANNfor high-quality collaborative ﬁltering. In