[PDF] Vector and Line Quantization for Billion-scale Similarity Search on GPUs

Abstract

Billion-scale high-dimensional approximate nearest neighbour (ANN) search has become an important problem for searching similar objects among the vast amount of images and videos available online. The existing ANN methods are usually characterized by their specific indexing structures, including the inverted index and the inverted multi-index structure. The inverted index structure is amenable to GPU-based implementations, and the state-of-the-art systems such as Faiss are able to exploit the massive parallelism offered by GPUs. However, the inverted index requires high memory overhead to index the dataset effectively. The inverted multi-index structure is difficult to implement for GPUs, and also ineffective in dealing with database with different data distributions. In this paper we propose a novel hierarchical inverted index structure generated by vector and line quantization methods. Our quantization method improves both search efficiency and accuracy, while maintaining comparable memory consumption. This is achieved by reducing search space and increasing the number of indexed regions. We introduce a new ANN search system, VLQ-ADC, that is based on the proposed inverted index, and perform extensive evaluation on two public billion-scale benchmark datasets SIFT1B and DEEP1B. Our evaluation shows that VLQ-ADC significantly outperforms the state-of-the-art GPU- and CPU-based systems in terms of both accuracy and search speed. The source code of VLQ-ADC is available at this https URL.

Full PDF

VVector and Line Quantization for Billion-scale SimilaritySearch on GPUs

Wei Chen a , Jincai Chen a,b, ∗ , Fuhao Zou c, ∗ , Yuan-Fang Li d , Ping Lu a,b ,Qiang Wang a , Wei Zhao b a Wuhan National Laboratory for Optoelectronics, Huazhong University of Science andTechnology, Wuhan 430074, China b Key Laboratory of Information Storage System of Ministry of Education, School ofComputer Science and Technology, Huazhong University of Science and Technology,Wuhan 430074, China c School of Computer Science and Technology, Huazhong University of Science andTechnology, Wuhan 430074,China d Faculty of Information Technology, Monash University, Clayton 3800, Australia

Abstract

Billion-scale high-dimensional approximate nearest neighbour (ANN) searchhas become an important problem for searching similar objects among thevast amount of images and videos available online. The existing ANN meth-ods are usually characterized by their speciﬁc indexing structures, includingthe inverted index and the inverted multi-index structure. The inverted indexstructure is amenable to GPU-based implementations, and the state-of-the-art systems such as Faiss are able to exploit the massive parallelism oﬀeredby GPUs. However, the inverted index requires high memory overhead toindex the dataset eﬀectively. The inverted multi-index structure is diﬃcultto implement for GPUs, and also ineﬀective in dealing with database withdiﬀerent data distributions. In this paper we propose a novel hierarchicalinverted index structure generated by vector and line quantization methods.Our quantization method improves both search eﬃciency and accuracy, whilemaintaining comparable memory consumption. This is achieved by reducingsearch space and increasing the number of indexed regions.We introduce a new ANN search system, VLQ-ADC, that is based on ∗ Corresponding author

Email addresses: [email protected] (Jincai Chen), [email protected] (Fuhao Zou)

Preprint submitted to Future Generation Computer Systems April 19, 2019 a r X i v : . [ c s . C V ] A p r he proposed inverted index, and perform extensive evaluation on two pub-lic billion-scale benchmark datasets SIFT1B and DEEP1B. Our evaluationshows that VLQ-ADC signiﬁcantly outperforms the state-of-the-art GPU-and CPU-based systems in terms of both accuracy and search speed. Thesource code of VLQ-ADC is available at https://github.com/zjuchenwei/vector-line-quantization . Keywords:

Quantization; Billion-scale similarity search; high dimensionaldata; Inverted index; GPU

1. Introduction

In the age of the Internet, the amount of images and videos availableonline increases incredibly fast and has grown to an unprecedented scale.Google processes over 40,000 various queries per second, and handles morethan 400 hours of YouTube video uploads every minute [1]. Every day, morethan 100 million photos/videos are uploaded to Instagram, more than 300million uploaded to Facebook, and a total of 50 billion photos have beenshared to Instagram . As a result, scalable and eﬃcient search for similarimages and videos on the billion scale has become an important problem andit has been under intense investigation.As online images and videos are unstructured and usually unlabeled, itis hard to compare them directly. A feasible solution is to use real-valued,high-dimensional vectors to represent images and videos, and compare thedistances between the vectors to ﬁnd the nearest ones. Due to the curse ofdimensionality [2], it is impractical for multimedia applications to performexhaustive search in billion-scale datasets. Thus, as an alternative, many approximate nearest neighbor (ANN) search algorithms are now employedto tackle the billion-scale search problem for high-dimensional data. Recentbest-performing billion-scale retrieval systems [3–8] typically utilize two mainprocesses: indexing and encoding.To avoid expensive exhaustive search, these systems use index structures that can partition the dataset space into a large number of disjoint regions,and the search process only collects points from the regions that are closest tothe query point. The collected points then form a short list of candidates for encoded into a compressed representation. Encoding has also proven to be critical formemory-limited devices such as GPUs that excel at handling data-paralleltasks. A high-performance CPU like Intel Xeon Platinum 8180 (2.5 GHz,28 cores) performs 1.12 TFLOP/s single precision peak performance . Incontrast, GPUs like NVdia Tesla P100 can provide up to 10T FLOP/s sin-gle precision peak performance , and are good choices for high performancesimilarity search systems. Many encoding methods have been proposed, in-cluding hashing methods and quantization methods. Hashing methods en-code data points to compact binary codes through a hash function [9, 10],and quantization methods, typically product quantization (PQ), map datapoints to a set of centroids and use the indices of the centroids to encode thedata points [11, 12]. By hashing methods, the distance between two datapoints can be approximated by the Hamming distance between their binarycode. By quantization methods, the Euclidean distance between the queryand compressed points can be computed eﬃciently. It has been shown inthe literature that quantization encoding can be more accurate than varioushashing methods [11, 13, 14].J´egou et al. [11] ﬁrst introduced an index structure that is able to handlebillion-scale datasets eﬃeciently. It is based on the inverted index structurethat partitions the high dimensional vector space into Voronoi regions for aset of centroids obtained by a quantization method called vector quantization(VQ) [15]. This system, called IVFADC, achieves reasonable recall rates inseveral tens of milliseconds. However, the VQ-based index structure needsto store a large set of full dimensional centroids to produce a huge numberof regions, which would require a large amount of memory.An improved inverted index structure called the inverted multi-index(IMI) was later proposed by Babenko and Lempitsky [16]. The IMI is basedon product quantization (PQ), which divides the point space into severalorthogonal subspaces and clusters the subspaces into Voronoi regions inde- https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf (a) Vector Quantization. (b) Product Quantization. 𝑐𝑐 𝑖𝑖 𝑐𝑐 𝑗𝑗 𝑥𝑥 𝑏𝑏𝑥𝑥 −𝑞𝑞 𝑙𝑙 ( 𝑥𝑥 ) 𝑞𝑞 𝑙𝑙 ( 𝑥𝑥 ) 𝑎𝑎 𝜆𝜆 c (c) Line Quantization.Figure 1: Three diﬀerent quantization methods. Vector and Product quantization methodsare both with k = 64 clusters. The red dots in plot (a) and (b) denote the centroids andthe grey dots denote the dataset points in both plots. Vector quantization (a) maps thedataset points to the closest centroids. Product quantization (b) performs clustering ineach subspace independently (here axes). In plot (c), a 2-dimensional point x (red dot) isprojected on line l ( c i , c j ) with the anchor point q l ( x ) (black dot). The a, b, c denote thevalues of (cid:107) x − c i (cid:107) , (cid:107) x − c j (cid:107) and (cid:107) c i − c j (cid:107) respectively. We use the parameter λ to represent the value of (cid:107) c i − q l ( x ) (cid:107) /c . The anchor point q l ( x ) can be represented by c i , c j and λ . The distance from x to l ( c i , c j ) can be calculated by a, b, c and λ .

2. Related work

In this section, we brieﬂy introduce some quantization methods and sev-eral retrieval systems related to our approach. Table 1 summarizes the com-mon notations used throughout this paper. For example, we assume that X = { x , . . . , x N } ⊂ R D is a ﬁnite set of N data points of dimension D .5 able 1: Commonly used notations. Notation Description x i , D data points, their dimension and the number of data points X , N a set of data points and its size, X = { x , . . . , x N } ⊂ R D c, s, l ( c, s ) centroids, nodes and edges m encoding length k the number of ﬁrst-level centroids n the number of edges of each ﬁrst-level centroid w the number of ﬁrst-layer nearest regions for a query α the portion of the nearest of the w · n second-level regions w the number of second-level nearest regions for a query, w = α · w · nλ a scalar parameter for line quantization r displacement from data points to the approximate points In vector quantization [15] (Figure 1 a), a quantizer is a function q v thatmaps a D -dimensional vector x to a vector q v ( x ) ∈ C , where C is a ﬁnitesubset of R D , of k vectors. Each vector c ∈ C is called a centroid, and C isa codebook of size k . We can use Lloyd iterations [19] to eﬃciently obtain acodebook C on a subset of the dataset. For a ﬁnite dataset, X , q v ( x ) inducesquantization error E : E = (cid:88) x ∈X (cid:107) x − q v ( x ) (cid:107) . (1)According to Lloyd’s ﬁrst condition, to minimize quantization error aquantizer should map vector x to its nearest codebook centroid. q v ( x ) = arg min c ∈ C (cid:107) x − c (cid:107) . (2)Hence, the set of points X i = { x ∈ R D | q v ( x ) = c i } is called a cluster ora region for centroid c i .The inverted index structure based on VQ [11] can split the datasetspace into k regions that correspond to the k centroids of the codebook.Since the ratio of regions to centroids is 1 : 1, it requires a large amount ofspace to store the D -dimensional centroids when k is large. This would give6 negative eﬀect on the performance of the retrieval system. Our hierarchicalindex structure based on VLQ increase the ratio by n times, i.e., n times moreregions can be generated by our indexing structure with the same number ofcentroids as the VQ based indexing structure. Product quantization (Figure 1 (b)) is an extension of vector quantization.Assuming that the dimension D is a multiple of m , any vector x ∈ R D can beregarded as a concatenation ( x , · · · , x m ) of m sub-vectors, each of dimension D/m . Suppose that C , · · · , C m are m codebooks of subspace R D/m , eachowns k D/m -dimensional sub-centroids. A codebook of a product quantizer q p is thus a Cartesian product of sub-codebooks. C = C × · · · × C m . (3)Hence the codebook C contains a total of k m centroids, each is a form of c = ( c , · · · , c m ), where each sub-centroid c i ∈ C i for i ∈ M = { , · · · , m } .A product quantizer q p should minimize the quantization error E deﬁned inFormula 1. Hence, for x ∈ R D , the nearest centroid in codebook C is q p ( x ) = ( q p ( x ) , · · · , q mp ( x m )) , (4)where q i is a sub-quantizer of q and q ip ( x ) is the nearest sub-centroid forsub-vector x i , i.e., the nearest centroid q p ( x ) for x is the concatenation of thenearest sub-centroids for sub-vector x i .The inverted multi-index structure (IMI) applies the idea of PQ forindexing and can generate k m regions with m codebooks of k sub-centroidseach. The beneﬁt of inverted multi-index is thus it can easily generate a muchlarger number of regions than that of VQ-based inverted index structure withmoderate values of m and k . The drawback of IMI is that it produces a lot ofempty regions when the distributions of subspaces are not independent [5].This will aﬀect the system’s performance when handling datasets which havesigniﬁcant correlations between diﬀerent subspaces, such as CNN-producedfeature point dataset [5].The PQ-based indexing structure later has been improved by OPQ [20]and LOPQ [12]. OPQ make a rotation on dataset points by a global D × D rotation matrix and LOPQ rotates the points which belong to the same cellby a same local D × D rotation matrix to minimize correlations between two7ubspaces [20]. OPQ and LOPQ can both improve the indexing eﬃciency ofPQ but slow down the query speed by a large margin as well.Additionally, PQ can also be used to compress datasets. Typically eachsub-codebook of PQ contains 256 sub-centroids and each vector x is mappedto a concatenation of m sub-centroids ( c j , · · · , c mj m ), for j i is a value between1 and 256. Hence the vector x can be encoded into an m -byte code ofsub-centroid index ( j , · · · , j m ). With the approximate representation byPQ, the Euclidean distances between the query vector and the large numberof compressed vectors can be computed eﬃciently. According to the ADCprocedure [11], the computation is performed based on lookup tables. (cid:107) y − x (cid:107) ≈(cid:107) y − q p ( x ) (cid:107) = m (cid:88) i =1 (cid:107) y i − c ij i (cid:107) (5)where y i is the i th subvector of a query y . The Euclidean distances betweenthe query sub-vector y i and each sub-centroids c ij i can be precomputed andstored in lookup tables that reduce the complexity of distance computationfrom O ( D ) to O ( m ). Due to the high compression quality and eﬃcient dis-tance computation approach, PQ is considered the top choice for compactrepresentation of large-scale datasets[3, 7, 12, 14, 20]. Line quantization (LQ) [4] owns a codebook C of k centroids like VQ. Asshown in Figure 1 (c), with any two diﬀerent centroids c i , c j ∈ C , a line isformed and denoted by l ( c i , c j ). A line quantizer q l quantizes a point x tothe nearest line as follows: q l ( x ) = arg min l ( c i ,c j ) d ( x, l ( c i , c j )) , (6)where d ( x, l ( c i , c j )) is the Euclidean distance from x to the line l ( c i , c j ), andthe set X i,j = { x ∈ R D | q l ( x ) = l ( c i , c j ) } is called a cluster or a region for line l ( c i , c j ). The squared distance d ( x, l ( c i , c j )) can be calculated as following : d ( x, l ( c i , c j )) = (1 − λ ) (cid:107) x − c i (cid:107) +( λ − λ ) (cid:107) c j − c i (cid:107) + λ (cid:107) x − c j (cid:107) (7)Because the values of (cid:107) x − c j (cid:107) , (cid:107) x − c i (cid:107) , (cid:107) c j − c i (cid:107) can be pre-computedbetween x and all centroids, Equation 7 can be calculated eﬃciently. The8nchor point of x is represented by (1 − λ ) · c i + λ · c j , where λ is a scalarparameter that can be computed as following: λ = 0 . · ( (cid:107) x − c i (cid:107) + (cid:107) c j − c i (cid:107) − (cid:107) x − c j (cid:107) ) (cid:107) c j − c i (cid:107) . (8)When x is quantized to a region of l ( c i , c j ), then the displacement of x from l ( c i , c j ) can be computed as following: r q l ( x ) = x − ((1 − λ ) · c i + λ · c j ) . (9)Here we regard l ( c i , c j ) and l ( c j , c i ) as two diﬀerent lines. So LQ-basedindexing structrue can produce k · ( k −

1) regions with a codebook of k centroids, The beneﬁt of LQ-based indexing structure is that it can producemany more regions than that of VQ-based regions. However it is considerablymore complicated to ﬁnd the nearest line for a point x when k is large. Sowe use LQ as an indexing approach with a codebook of a few lines. Table 2: A summary of current state-of-the-art retrieval systems based on quantizationmethod. N is the size of the dataset X , m is the number of sub-vectors in productquantizatino (PQ), k is the size of the codebook, and n is the number of second-levelregions. In the last column of each row, the ﬁrst term is the complexity for encoding, andthe second term is the complexity for indexing. System Index structure Encoding CPU/GPU Space complexityFaiss [6] VQ PQ GPU O ( N · m ) + O ( k · D )Ivf-hnsw [7] 2-level VQ PQ CPU O ( N · m ) + O ( k · ( D + n ))Multi-D-ADC [16] IMI (PQ) PQ CPU O ( N · m ) + O ( k · ( D + k ))VLQ-ADC (our system) VLQ PQ GPU O ( N · m ) + O ( k · ( D + n )) In this subsection we introduce several billion-scale similarity retrievalsystems that apply VQ- or PQ-based indexing structure and encoded byPQ, and discuss their strengths and weaknesses.All the systems discussed below are best-performing, state-of-the-art sys-tems for billion-scale high-dimensional ANN search. Their indexing structureand encoding method are summarized in Table 2. Since all these systems em-ploy the same encoding method based on PQ, we will mainly focus on theirindexing structures in the discussions below.9 aiss [6] is a very eﬃcient GPU-based retrieval approach, by realizingthe idea of IVFADC [11] on GPUs. Faiss uses the inverted index based onVQ [21] for non-exhaustive search and compresses the dataset by PQ. Theinverted index of IVFADC owns a vector quantizer q with a codebook of k centroids. Thus there are k regions for the data space. Each point x ∈ X is quantized to a region corresponding to a centroid by a VQ quantizer q v .The displacement of each point from the centroid of a region it belongs to isdeﬁned as r q ( x ) = x − q ( x ) , (10)where the displacement r q ( x ) is encoded by PQ with m codebooks sharedby all regions. For each region, an inverted list of data points is maintained,along with PQ-encoded displacements.The search process of Faiss/IVFADC proceeds as follows:1. A query point y is quantized to its w nearest regions, extracting a listof candidates L c ⊂ X which have a high probability of containing thenearest neighbor.2. The displacement of the query point y from the centroid of each sub-region is computed as r q ( y ).3. The distances between r q ( y ) and PQ-encoded displacements in L c arethen computed according to Formula 5.4. Sort the list L c to be L s based on the distances computed above. Theﬁrst points in L s are returned as the search result for query point y . Ivf-hnsw [7] is a retrieval system based on a two-level inverted indexstructure. Ivf-hnsw ﬁrst splits the data space into k regions like IVFADC.Then each region is further split into several sub-regions that correspond to n sub-centroids. Each sub-centroid of a region can be represented by thecentroid of the region and another centroid of a neighbor region. Assumethat each region has n neighbor regions, thus each region can be split into n regions. Each data point is ﬁrst quantized to a region and then furtherquantized to a sub-region of the region. The displacement of each point fromthe sub-centroid of a sub-region it belongs to is encoded by PQ. An invertedlist of data point is maintained for each sub-regions similar to IVFADC.The search process of Ivf-hnsw proceeds as follows:1. A query point y is quantized to its w ﬁrst-level nearest regions, giving w · n sub-regions. 10. Among the w · n sub-regions, y is secondly quantized to 0 . · w · n nearestsub-regions, generating a list of candidates L c ⊂ X .3. The displacement of the query point y from the sub-centroid of eachsub-region is computed as r q ( y ).4. The distances between r q ( y ) and PQ-encoded displacements in L c arethen computed according to Formula 5.5. The re-ordering process of Ivf-hnsw is similar to IVFADC/Faiss. Multi-D-ADC [16] is based on the inverted multi-index which is cur-rently the state-of-the-art indexing method for high-dimensional large-scaledatasets. An inverted multi-index of Multi-D-ADC usually owns a productquantizer with two sub-quantizers q , q for subspace R D/ , each of k sub-centroids. A region in the D-dimensional space is now a Cartesian productof two corresponding subspace regions. So the IMI can produce k regions.For each point x = ( x , x ) ∈ X , sub-vectors x , x ∈ R D/ are separatelyquantized to subspace regions of q ( x ) , q ( x ) respectively, and x is thenquantized to the region of ( q ( x ) , q ( x )) . The displacement of each point x from the centroid ( q ( x ) , q ( x )) is also encoded by PQ, and an invertedlist of points is again maintained for each region.The search process of Multi-D-ADC proceeds as follows:1. For a query point y = ( y , y ), The Euclidean distances of each of sub-vectors y , y to all sub-centroids of q , q are computed respectively.The distance of y to a region can be computed according to Formula 5for m = 2.2. Regions are traversed in ascending order of distance to y by the multi-sequence algorithm [16] to generate a list of candidates L c ⊂ X .3. The displacement of the query point y from the centroid ( c , c ) of eachregion is computed as r q ( y ) as well.4. The re-ordering process of Multi-D-ADC is similar to IVFADC/Faiss.The VQ-based indexing structure requires a large full-dimensional code-book to produce regions when k is large. The PQ-based indexing structureare not suitable for all datasets, especially for those produced by convolu-tional neural networks (CNN) [7]. The novel VQ-based indexing structureproposed by Ivf-hnsw can produce more regions than the prior VQ-basedindexing structure. However its performance on the codebook of small sizeis not good enough. We will discuss that in Sec.5. In comparison, our index-ing structure is eﬃcient with a small size of codebook which can accelerate11uery speed and at the same time is suitable for any dataset irrespective ofthe presence/absence of correlations between subspaces.

3. The VLQ-ADC System

In this section we introduce our GPU-based similarity retrieval system,VLQ-ADC, that contains a two-layer hierarchical indexing structure basedon vector and line quantization and an asymmetric distance computationmethod. VLQ-ADC incorporates a novel index structure that can index thedataset points eﬃciently (Sec. 3.1). The indexing and encoding process willbe presented in Sec. 3.2, and the querying process is discussed in Sec. 3.3.Comparing with the existing systems above, One major advantage of oursystem is that our indexing structure can generate shorter and more accu-rate candidate list for the query point, which will accelerate query speed bya large margin. Another advantage of our system is that the improved asym-metric distance computation method base on PQ encoding method provide ahigher search accuracy. In the remainder of this section we will use Figure 2to illustrate our framework. We recall that commonly used notations aresummarized in Table 1.

For billion-scale datasets with a moderate number of regions (e.g., 2 )produced by vector quantization (VQ), the number of data points in mostregions is too large, which negatively aﬀects search accuracy. To alleviatethis problem, we propose a hierarchical indexing structure. In our structure,each list is split into several shorter lists, i.e., each region is divided intoseveral subregions, using line quantization (LQ).Our indexing structure is a two-layer hierarchical structure which consistsof two levels of quantizers. The ﬁrst level contains a vector quantizer q v with a codebook of k centroids. The vector quantizer q v partitions the datapoint space X into k regions. The second level contains a line quantizer q l with an n -nearest neighbor ( n -NN) graph. The n -NN graph is a directedgraph in which nodes are ﬁrst-level centroids and edges connect a centroidto its n nearest neighbors. In each ﬁrst-level region, the line quantizer q l then quantizes each data point to the closest edge in the n -NN graph, thussplitting the region into n second-level regions.As an example, in the right side of Figure 2, given n = 4, the top leftﬁrst-level region is further divided into 4 subregions by q l , enclosed by solid12 VQ-based indexing structure y1 234

VLQ-based indexing structureFigure 2: A comparison of the indexing structure and search process of the VQ-basedindexing structure ( left ) and our VLQ-based indexing structure ( right ) on data points(small blue dots) of dimension 2 ( D = 2). The large red dots denote the (ﬁrst-level)same cell centroids in both ﬁgures. Left:

The 4 shaded areas in the left ﬁgure representthe ﬁrst-level regions, one for each centroid, and they make up the areas that need to betraversed for the query point q . Right:

For each centroid in the right ﬁgure, n = 4 nearestneighboring centroids are found. Thus the n -NN graph consists of all the centroids andthe edges (thick dashed lines) between them. Each ﬁrst-level region in the right ﬁgureconsists of 4 second-level regions, each of which represent the data points closet to thecorresponding edge in the n -NN graph as denoted by the line quantizer q l . Given thequery point q and parameter α = 0 .

5, only half of the second-level subregions (shaded inblue) need to be traversed. As can be seen, VLQ allows search to process substantiallysmaller regions in the dataset than a VQ-based approach. lines and denoted 1, 2, 3, and 4. Each subregion contains all the data pointsthat are closest to a given edge of the n -NN graph, as calculated by the linequantization q l . Training the codebook . We use Lloyd iteration in the fashion of theLinde-Buzo-Gray algorithm [15] to obtain the codebook of the VQ quantizer q v . The n -NN graph is then built on the centroids of the codebook. Memory overhead of indexing structure . One advantage of ourindexing structure is its ability to produce substantially more subregions withlittle additional memory consumption. Same as VQ, our ﬁrst layer codebookneeds k · D · sizeof ( float ) bytes. In addition, for second-level indexing, foreach of the k ﬁrst-layer centroids, the n -NN graph only needs to store (1)the indices of its n nearest neighbors and (2) the distances to its n nearestneighbors, which amounts to k · n · ( sizeof ( int )+ sizeof ( float )) bytes. Notewe do not need to store the full-dimensional points. For a typical values of13 = 2 centroids and n = 32 subcentroids, the additional memory overheadfor storing the graph is 2 · · (32 + 32) bits (16 MB), which is acceptablefor billion-scale datasets.One way to produce the subregions is by utilizing vector quantization(VQ) again in each region. However, that would require storing full-dimensionalsubcentroids and thus consume too much additional memory. For the sameconﬁguration ( k = 2 centroids and n = 32 subcentroids) and a dimensionof D = 128, the additional memory overhead for a VQ-based hierarchical in-dexing structure would be 2 · · · sizeof ( float ) additional bits (1,024MB). As can be seen, our VLQ-based hierarchical indexing structure is sub-stantially more compact, only consuming 1 /

64 of the memory required by aVQ-based approach for the second-level codebook.We note that the PQ-based indexing structure requires O ( k · ( D + k ))memory to maintain the indexing structure (Table 2), which is memory in-eﬃcient as it is quadratic in k . This is a limitation of PQ-based indexingstructure. In contrast, the space complexity of our hierarchical indexingstructure is O ( k · ( D + n )), where typically n (cid:28) k ( n is much smaller than k ), hence making our index much more memory eﬃcient. Algorithm 1

VLQ-ADC batch indexing process function Index ([ x , . . . , x N ]) for t ← N do x t (cid:55)→ q v ( x ) = arg min c ∈ C (cid:107) x t − c (cid:107) // VQ S i = n -arg min c ∈ C (cid:107) c − c i (cid:107) // Construct the n -NN graph x t (cid:55)→ q l ( x ) = arg min l ( c i ,s ij ) ,s ij ∈ S i d ( x, l ( c i , s ij )) // LQ end for end function In this subsection, we will describe the indexing and encoding processand summarize both processes in Algorithm 1 and 2 respectively.For our two-level index structure, the indexing process comprises twodiﬀerent quantization procedures, one for each layer. Similar to the IVFADCscheme, each dataset point is quantized by the vector quantizer q v to the ﬁrst-level regions surrounded by the dotted lines in Figure 2. These regions forma set of inverted lists as search candidates.14 lgorithm 2 VLQ-ADC batch encoding process function Encode ([ x , . . . , x N ]) for t ← N do r q l ( x t ) = x t − ((1 − λ ij ) · c i + λ ij · s ij ) // Equation 12 let r t = r q l ( x t ) // displacement r t = [ r t , . . . , r mt ] // divide r t into m subvectors for p ← m do r pt (cid:55)→ c j p = arg min c jp ∈ C p (cid:107) r pt − c p (cid:107) end for Code t = ( j , . . . , j m ) end for end function We describe the second-level indexing process as follows. Let X i be aregion of { x , . . . , x l } that corresponds to a centroid c i , for i ∈ { , . . . , k } .In constructing the n -NN graph, let S i = { s i , . . . , s in } denote the set of the n centroids closest to c i and l ( c i , s ij ) denote an edge between c i and s ij , for j ∈ { , . . . , n } . The points in X i are quantized to the subregions by a linequantizer q l with a codebook E i of n edges { l ( c i , s i ) , . . . , l ( c i , s in ) } . Thus theregion X i is split into n subregions {X i , . . . , X in } and each point x ∈ X i isquantized to a second-level subregion X ij . So the entire space X are dividedinto k × n second-level subregions. X ij = { x ∈ X i | q l ( x ) = l ( c i , s ij ) } , for all i ∈ { . . . k } (11)Each data point in the dataset X is assigned to one of the k · n cells. Whenthe data point x is quantized to the sub-region of edge l ( c i , s ij ), according tothe Equation 9 and 8 the displacement of x from the corresponding anchorpoint can be computed as following: r q l ( x ) = x − ((1 − λ ij ) · c i + λ ij · s ij ) , where (12) λ ij = − . · ( (cid:107) x − s ij (cid:107) − (cid:107) x − c i (cid:107) − (cid:107) s ij − c i (cid:107) ) (cid:107) s ij − c i (cid:107) . (13). As shown in Algorithm 2, the value of r q l ( x ) is ﬁrst computed by Equa-tion 12 and encoded into m bytes using PQ [11]. The PQ codebooks aredenoted by C , . . . , C m , each containing 256 sub-centroids. The vector r q l ( x )15s mapped to a concatenation of m sub-centroids ( c j , · · · , c mj m ), for j i is avalue between 1 and 256. Hence the vector r q l ( x ) is encoded into an m -bytecode of sub-centroid index ( j , · · · , j m ). In Figure 1(c), we assume that c i is the closest centroid to x and can observe that the anchor point of eachpoint x is closer to x than c i . So the dataset points can be encoded moreaccurately with the same code length. This will improve the recall rate ofsearch, as can be seen in our evaluation in Section 5.From Equation 13, the value of λ ij for each point can be computed. itis a float type value and requires 4 bytes for each data point. To furtherimprove memory eﬃciency, we quantize it into 256 values and encode it by abyte. Empirically we ﬁnd that the encoded λ ij still exhibits high recall rates. One important advantage of our indexing structure is that at query time,a speciﬁc query point only needs to traverse a small number of cells whoseedges are closest to the query point, as shown in the right side of Figure 2.There are three steps for query processing: (1) region traversal, (2) distancecomputation and (3) re-ranking.

The region traversal process consists of two steps: ﬁrst-level regionstraversal and second-level regions traversal. During ﬁrst-level regions traver-sal, a query point y is quantized to its w nearest ﬁrst-level regions, whichcorrespond to w · n second-level regions produced by quantizer q v . Thesubregions traversal is performed within only the w · n second-level regions.Moreover, y is quantized again to w nearest second-level regions by quan-tizer q l . Then the candidate list of y is formed by the data points only withinthe w nearest second-level regions. Because the w second-level regions isobviously smaller than the w ﬁrst-level regions, the candidate list producedby our VLQ-based indexing structure is shorter than that produced by theVQ-based indexing structure. This will result in a faster query speed.We use parameter α to determine the percentage of w · n second-levelregions to be traversed give a query, such that w = α · w · n . We conduct aseries of experiments in Section 5 to discuss the performance of our systemwith diﬀerent values of α . Distance computation is a prerequisite condition for re-ranking. In thissection, we describe how to compute the approximate distance between a16uery point y to a candidate point x . According to [11], the distance from y to x can be evaluated by asymmetric distance computation (ADC) as follows: (cid:107) y − q ( x ) − q ( x − q ( x )) (cid:107) (14)where q ( x ) = (1 − λ ij ) · c i + λ ij · s ij and q ( · · · ) is the PQ approximation ofthe x i displacement.Expression 14 can be further decomposed as follows [3]: (cid:107) y − q ( x ) (cid:107) + (cid:107) q ( · · · ) (cid:107) +2 (cid:104) q ( x ) , q ( · · · ) (cid:105)− (cid:104) y, q ( · · · ) (cid:105) . (15)where (cid:104)· , ·(cid:105) denotes the inner product between two points.If l ( c i , s ij ) is the closest edge to x , i.e., q ( x ) = (1 − λ ij ) c i + λ ij s ij , Ex-pression 15 can be transformed in the following way: (cid:107) y − ((1 − λ ij ) c i + λ ij s ij ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) term1 + (cid:107) q ( · · · ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) term2 +2(1 − λ ij ) (cid:104) c i , q ( · · · ) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) term3 +2 λ ij (cid:104) s ij , q ( · · · ) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) term4 − (cid:104) y, q ( · · · ) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) term5 . (16)According to Equation 7, term1 in Expression 16 can be computed in thefollowing way: (cid:107) y − ((1 − λ ij ) c i + λ ij s ij ) (cid:107) = (1 − λ ij ) (cid:107) y − c i (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) term6 +( λ ij − λ ij ) (cid:107) c i − s ij (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) term7 + λ ij (cid:107) y − s ij (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) term8 . (17)In Expression 16 and Equation 17, some computations can be done inadvance and stored in lookup table as follows: • All of term2, term3, term4 and term7 are independent of the query.They can be precomputed from the codebooks. Term2 is the squarednorm of the displacement approximation and can be stored in a tableof size 256 × m . Term7 is the square of the length of the edge that thepoint x belongs to and is already computed in the codebook learningprocess. Term3 and term4 are scalar products of the PQ sub-centroidsand the corresponding ﬁrst-level centroid subvectors and can be storedin a table of size k × × m . 17 Term6 and term8 are the distances from the query point to the ﬁrst-layer centroids. They are the by-product of ﬁrst-layer traversal. • Term5 is the scalar product of the PQ sub-centroids and the corre-sponding query subvectors and can be computed independently beforethe search. Its computation costs 256 × D multiply-adds [6].The proposed decomposition is used to simplify the distance computation.With the lookup tables, the distance computation only requires 256 × D multiply-adds and 2 × m lookup-adds. In comparison, the classic IVFADCdistance computation requires 256 × D multiply-adds and m lookup-adds[6]. The additional m lookup-adds in our framework improves the distancecomputation accuracy with a moderate increase of time overhead. We willdiscuss this trade-oﬀ in detail in Section 5. Re-ranking is a step of re-sorting the candidate list of data points accord-ing to the distances from candidate points to the query point. It is the laststep of the query process. The purpose of re-ranking is to ﬁnd out the near-est neighbours to the query point among the candidate points by distancecomparing.We apply the fast sorting algorithm of [6] to our re-ranking step.Due to the shorter candidate list and more accurate approximate distances,the re-ranking step of our system is both faster and more accurate than thatof Faiss.

4. GPU Implementation

One advantage of our VLQ-ADC framework is that it is amenable toimplementations on GPUs. It is mainly because our searching and distancecomputing algorithm that applied during query can be eﬃciently parallelizedon GPUs. In this work we have implemented our framework in CUDA.There are three diﬀerent levels of granularity of parallelism on GPU:threads, blocks and grids. A block is composed of multiple threads, and agrid is composed of multiple blocks. Furthermore, there are three memorytypes on GPU. Global memory is typically 4–32 GB in size with 5–10 × higher bandwidth than CPU main memory [6], and can be shared by diﬀerentblocks. Shared memory is similar to CPU L1 cache in terms of speed and isonly shared by threads within the same block. GPU register ﬁle memory has18 lgorithm 3 VLQ-ADC batch search process function Search ([ y , · · · , y n q ] , L , · · · , L k × n ) for t ← n q do C t ← w -arg min c ∈ C (cid:107) y t − c (cid:107) L tLQ ← w -arg min c i ∈ C t ,s ij ∈ S i (cid:107) y − (1 − λ ij ) · c i − λ ij · s ij (cid:107) //described in Sec. 3.3.1 Store values of (cid:107) y t − c (cid:107) end for for t ← n q do L t ← [] Compute (cid:104) y t , q ( · · · ) (cid:105) // See term5 in Equation 16 for L in L tLQ do for t (cid:48) in L L do // distance evaluation described in Sec. 3.3.2 d ←(cid:107) y t − q ( x t (cid:48) ) − q ( x t (cid:48) − q ( x t (cid:48) )) (cid:107) Append ( d ; L ; j ) to L t end for end for end for R t ← K-smallest distance-index pairs ( d , t (cid:48) ) from L t // Re-ranking return R t end function the highest bandwidth and the size of register ﬁle memory on GPU is muchlarger than that on CPU [6].VLQ-ADC is able to utilize GPU eﬃciently for indexing and search.For example, we use blocks to process D -dimensional query points and thethreads of a block to traverse the inverted lists. We use global memory tostore the indexing structures and compressed dataset that is shared by allblocks and grids, and load part of lookup tables in the shared memory toaccelerate distance computation. As the GPU register ﬁle memory is verylarge, we store structured data in the register ﬁle memory to increase theperformance of the sorting algorithm.Algorithm 3 summarizes our search process that is implemented on GPU.We use four arrays to store the information of the inverted index lists. Theﬁrst array stores the length of each index list, the second one stores thesorted vector IDs of each list, and the third the fourth store the correspond-19ng codes and λ values of each list respectively. For an NVIDIA GTX TitanX GPU with a 12GB of RAM, we load part of the dataset indexing struc-ture in the global memory for diﬀerent kernels, i.e., region size, data pointscompressed codes and λ values of each list. A kernel is the unit of work (in-struction stream with arguments) scheduled by the host CPU and executedby GPUs [6]. We load the vector IDs on the CPU side, because vector IDsare resolved only if re-ranking step determines K-nearest membership. Thislookup produces a few sparse memory reads in a large array, thus the IDsstored on CPU can only cause a tiny performance cost.Our implementation makes use of some basic functions from the Faisslibrary, including matrix multiplication and the K-selection algorithm to im-prove the performance of our approach . K-selection.

The K-selection algorithm is a high-performance GPU-basedsorting method proposed by Faiss [6] and GSKNN [8]. The K-selection keepintermediate data in the register ﬁle memory. It exchanges register data usingthe warp shuﬄe instruction, enabling warp-wide parallelism and storage. Thewarp is a 32-wide vector of GPU threads, each thread in the warp has upto 255 32-bit registers in a shared register ﬁle. All the threads in the samewarp can exchange register data using the warp shuﬄe instruction.

List search.

We use two kernels for inverted list search. The ﬁrst kernel isresponsible for quantizing each query point to w nearest ﬁrst-level regions(line 3 in Algorithm 3). The second kernel is responsible for ﬁnding out the w nearest second-level regions for the query point (line 4 in Algorithm 3). Thedistances between each query point and its w nearest centroids are storedfor further calculation. In the two kernels, we use a block of threads toprocess one query point, thus a batch of n q query points can be processedconcurrently. Distance computation and re-ranking.

After the inverted lists L i of eachquery point are collected, there are up to n q × w × max |L i | candidate pointsto process. During the distance computation and re-ranking process, pro-cessing all the query points in a batch yields high parallelism, but can exceedavailable GPU global memory. Hence, we choose a tile size t q < n q basedon amount of available memory to reduce memory overhead, bounding itscomplexity by O ( t q × w × max |L i | ). The source code will be released upon publication.

20e use one kernel to compute the distances from each query point to thecandidate points according to Expression 16, and sort the distances via theK-selection algorithm in a separate kernel. The lookup tables are stored inthe global memory. In the distance computation kernel, we use a block toscan all w inverted lists for a single query point, and the signiﬁcant portionof the runtime is the 2 × w × m lookups in the lookup tables and the linearscanning of the L i from global memory.In the re-ranking kernel, we refer to Faiss by using a two-pass K-selection.First reduce t q × w × max |L i | ) to t q × τ × K partial results, where τ is somesubdivision factor, then the partial results are reduced again via k-selectionto the ﬁnal t q × K results.Due to the limited amount of GPU’s memory, if an index instance withlong encoding length cannot ﬁt in the memory of a single GPU, it cannotbe processed one the GPU eﬃciently. Our framework supports multi-GPUparallelism to process indexing instance of a long encoding length. For b GPUs, we split the index instance into b parts, each of which can ﬁt in thememory of a single GPU. We then process the local search of n q queries oneach GPU, and ﬁnally join the partial results on one GPU. Our multi-GPUsystem is based on MPI, which can be easily extended to multiple GPUs onmultiple servers.

5. Experiments and Evaluation

In this section, we evaluate the performance of our system VLQ-ADC andcompare it to three state-of-the-art billion-scale retrieval systems that arebased on diﬀerent indexing structures and implemented on CPUs or GPUs:Faiss [6], Ivf-hnsw [7] and Multi-D-ADC [16]. All the systems are evalu-ated on the standard metrics: accuracy and query time, with diﬀerent codelengths. All the experiments are conducted on a machine with two 2.1GHzIntel Xeon E5-2620 v4 CPUs and two NVIDIA GTX Titan X GPUs with 12GB memory each.The evaluation is performed on two public benchmark datasets that arecommonly used to evaluate billion-scale ANN search: SIFT1B [22] of 10 × vectors from each dataset for learning all thetrainable parameters. We evaluate the search accuracy by the test result21ecall@ K , which is the rate of queries for which the nearest neighbors is inthe top K results.Here we choose nprobe =64 for all the inverted indexing systems (Faiss,Ivf-hnsw and VLQ-ADC), as 64 is a typical value for nprobe in the Faisssystem. The parameter max codes that means the max number of candidatedata points for a query is only useful to CPU-based system (max codes isset to 100,000), hence for GPU-based systems like Faiss and VLQ-ADC,max codes parameter is not conﬁgured. In fact, we compute the distances ofquery point to all the data points that are contained in the neighbor regions. In experiment 1, we evaluate the index quality of each retrieval system.We compare three diﬀerent inverted index structures and two inverted multi-index schemes with diﬀerent codebooks sizes without the re-ranking step.1.

Faiss . We build a codebook of k = 2 centroids by k-means, and ﬁndproposed inverted lists of each query by Faiss.2. Ivf-hnsw . We use a codebook of k = 2 centroids by k-means, andset 64 sub-centroids for each ﬁrst-level centroid. Multi-D-ADC . We use two IMI schemes with two codebook sizes k = 2 and k = 2 and choose the implementation from the Faisslibrary for all the experiments.4. VLQ-ADC . For our approach, we use the same codebook as Ivf-hnsw,and a 64-edge k-NN graph with indexing and querying as described inSection 3.2 and 3.3. The recall curves of each indexing approach are presented in Figure 3. Onboth datasets, our proposed system VLQ-ADC (blue curve) outperforms theother two inverted index systems and the Multi-D-ADC scheme with smallcodebooks ( k = 2 ) for all the reasonable range of X . Compared with theMulti-D-ADC scheme with a larger codebook ( k = 2 ), our system performsbetter on DEEP1B, and almost equally well on SIFT1B.On the DEEP1B dataset, the recall rate of our system is consistentlyhigher than that of all the other indexing structures. With a codebook We use the implementation of Ivf-hnsw that is available online( https://github.com/dbaranchuk/ivf-hnsw ) for all the experiments. The VLQ-ADC source code is available at https://github.com/zjuchenwei/vector-line-quantization . k = 10 andmuch better than other inverted index structures.As shown in Figure 3, for the SIFT1B dataset, the IMI with k = 2 cangenerate better candidate list than the inverted indexing structures. Whilefor the DEEP1B dataset, the performance of the IMI falls behind that ofthe inverted indexing structures. The reasons are that SIFT vectors arehistogram-based and the subvectors are corresponding to the diﬀerent sub-spaces, which describe disjoint image parts that have weak correlations in thesubspace distributions. On the other hand, the DEEP vectors are producedby CNN that have a lot of correlations between the subspaces. It can beobserved that the performances of our indexing structure is consistent acrossthe two datasets. This demonstrates that our indexing structure’s suitabilityfor diﬀerent data distributions. Log K R e c a ll @ K VLQ-ADC k=2 Faiss k=2 Ivf-hnsw k=2 Multi-D-ADC k=2 Multi-D-ADC k=2 DEEP1B

Log K R e c a ll @ K VLQ-ADC k=2 Faiss k=2 Ivf-hnsw k=2 Multi-D-ADC k=2 Multi-D-ADC k=2 SIFT1BFigure 3: Recall rate comparison of our system, VLQ-ADC, without the re-ranking step,against two inverted index systems, Faiss, Ivf-hnsw, and one inverted multi-index scheme,Multi-D-ADC (with two diﬀerent codebook sizes: k = 2 and k = 2 ). In experiment 2, we evaluate the recall rates with the re-ranking step. Inall systems the dataset points are encoded in the same way: indexing and23ncoding. (1)

Indexing : displacements from data points to the nearest cellcentroids are calculated. For VLQ-ADC the displacements are calculatedfrom data points to the nearest anchor points on the line. (2)

Encoding :the residual values are encoded into 8 or 16 bytes by PQ with the samecodebooks shared by all the cells. Here we compare the same four retrievalsystems as in experiments 1. All the conﬁgurations for the retrieval systemsare the same as in experiment 1. For the GPU-based systems, we evaluateperformance with 8-byte codes on 1 GPU and 16-byte codes on 2 GPUs.The Recall@ K values for diﬀerent values K = 1 / /

100 and the averagequery times on both datasets in milliseconds (ms) are presented in Table 3.From Table 3 we can make the following important observations.

Overall best recall performance.

Our system VLQ-ADC achieves bestrecall performance for both datasets and the two codebooks (8-byte and16-byte) in most cases. For the twelve recall values (Recall@1/10/100 × two codebooks × two datasets), VLQ-ADC achieves best values in ninecases and second best in two cases. The second-best system is Faiss,obtaining best results in two cases. Multi-D-ADC (with k = 2 × regions) obtains best results in one case. Substantial speedup.

VLQ-ADC is consistently and signiﬁcantly fasterthan all the other systems in all experiments. For all conﬁgurations,VLQ-ADC’s query time is within 0.054–0.068 milliseconds, while theother systems’ query times vary greatly. In the most extreme case,VLQ-ADC is 125 × faster than Multi-D-ADC (0.068 vs 8.54). At thesame time, VLQ-ADC is also consistently faster than the second fastestsystem, the GPU-based Faiss, by an average 5 × speedup. Comparison with Faiss.

VLQ-ADC outperforms the current state-of-the-art GPU-based system Faiss in terms of both accuracy and query timeby a large margin, except for only three out of sixteen cases (R@10 with16-byte codes for SIFT1B, and R@100 with 16-byte codes for SIFT1Band DEEP1B). E.g., as a GPU-based system, VLQ-ADC outperformsFaiss in terms of accuracy by 17%, 14%, 4% of R@1, R@10 and R@100respectively on the SIFT1B dataset and 8-byte codes. At the sametime, the query time is consistently and signiﬁcantly faster than Faiss,with a speedup of up to 5 . × . Faiss outperforms VLQ-ADC in recallvalues in three cases, all with 16-byte codes. However, the diﬀerence is24egligible ( ∼ k = 2 ) isonly 1 / k = 2 ), our system produces more regions(2 ) than Faiss (2 ). Therefore, our system achieves better accuracyas well as memory and runtime eﬃciency than Faiss. Comparison with Multi-D-ADC.

The proposed system also outperformsthe IMI based system Multi-D-ADC both in terms of accuracy andquery time on both datasets. For example, VLQ-ADC leads Multi-D-ADC with codebooks k = 2 by 14.2%, 7.4%, 1.3% of R@1, R@10and R@100 respectively on the SIFT1B dataset and 8-byte codes withup to 6 . × speedup. On the DEEP1B dataset, the advantage of oursystem is even more pronounced. Similarly, VLQ-ADC outperformsMulti-D-ADC scheme with smaller codebooks k = 2 even more signif-icantly, especially in terms of query time, where VLQ-ADC consistentlyachieves speedups of at least one order of magnitude while obtainingbetter recall values. Comparison with Ivf-hnsw.

Similarly, VLQ-ADC outperforms Ivf-hnsw,another CPU-based retrieval system in both recall and query time. Al-though Ivf-hnsw can also produce more regions with a small codebook,it still cannot outperform the VQ-based indexing structure with largersize of codebook.

Eﬀects on recall of indexing and encoding.

The improvement of R@10and R@100 shows that the second-level line quantization provides moreaccurate short-list of candidates than the previous inverted index struc-ture, and the improvement of R@1 shows that it can also improvesencoding accuracy.

Multi-D-ADC.

From Table 3, we can also observe that Multi-D-ADCscheme with k = 2 outperforms the scheme with k = 2 in querytime by a large margin. It is mainly because Multi-D-ADC with larger25odebooks can produce more regions, which can extract more conciseand accurate short-lists of candidates. Table 3: Performance comparison between VLQ-ADC (with the re-ranking step) againstthree other state-of-the-art retrieval systems of recall@1/10/100 and retrieval time ontwo public datasets. For each system the number of total regions is speciﬁed beneath eachsystem’s name. VLQ-ADC consistently achieves higher recall values and signiﬁcantly lowerquery time than all other systems. Best result in each column is bolded , and second bestis underlined. For the two GPU-based systems, Faiss and VLQ-ADC, we experiments areperformed on 1 GPU for 8-byte encoding length, and on 2 GPUs for 16-byte encodinglength.

System SIFT1B DEEP1B8 bytes 16 bytes 8 bytes 16 bytesR@1 R@10 R@100 t (ms) R@1 R@10 R@100 t (ms) R@1 R@10 R@100 t (ms) R@1 R@10 R@100 t (ms)Faiss2 × × The number of points per region P o r t i o n o f d i ff e r e n t r e g i o n s SIFT1B

The number of points per region P o r t i o n o f d i ff e r e n t r e g i o n s DEEP1BFigure 4: The distributions of data points in regions produced by diﬀerent indexing struc-tures. The x axis is ﬁve categories representing the discretized numbers of data pointsin each region (0, 1–100, 101–300, 301–500 and > .3. Data point distributions of diﬀerent indexing structures The space and time eﬃciency of an indexing structure is impacted bythe distribution of data points produced by the structure. To analyse thedistribution produced by the structures studied in this paper, we plot inFigure 4 the percentages of regions by the discretized number of data pointsin each region.As shown in Figure 4, the portion of empty regions produced by theinverted indexing structures (Faiss, Ivf-hnsw and VLQ-ADC) is much lessthan that produced by the inverted multi-index structure (Multi-D-ADC).For Multi-D-ADC, there are 38% empty regions for SIFT1B and 58% emptyregions for DEEP1B (left most group in each plot). This result empiricallyvalidates the space ineﬃciency of inverted multi-index structure [16].For Faiss, which is based on the inverted indexing structure using VQ,over 98% and 93% of regions contain more than 500 data points for SIFT1Band DEEP1B respectively. This will possibly produce long candidate listsfor queries, thus negatively impacting query speed. For VLQ-ADC (andIvf-hnsw), the regions are much more evenly distributed. The majority ofthe regions on both datasets contain less than 500 data points, and moreregions contain 101–300 data points than others. This is a main reason whyVLQ-ADC can provide shorter candidate lists and thus a faster query speed. M M M M M Dataset Scale R e c a ll @ A v e r a g e Q u e r y T i m e ( m s ) VLQ-ADCFaiss

SIFT1B M M M M M Dataset Scale R e c a ll @ A v e r a g e Q u e r y T i m e ( m s ) VLQ-ADCFaiss

DEEP1BFigure 5: Comparison of recall@10 and average query time between VLQ-ADC and Faissunder diﬀerent dataset scales. The two systems are compared with an 8-byte encodinglength. The x axis indicates the ﬁve data scales (1M/10M/100M/300M/1000M). The left y axis is the recall@10 value (represented by the bars) and the right y axis is the averagequery time (in ms, represented by the lines). .4. Performance comparison under diﬀerent dataset scales In this section we evaluate the performance of our system under diﬀer-ent dataset scales. Figure 5 shows, for SIFT1B and DEEP1B, the recalland query time values for Faiss and VLQ-ADC for subsets of SIFT andDEEP1B of diﬀerent sizes: 1M, 10M, 100M, 300M and 1000M (full dataset)respectively. As can be seen in the ﬁgure, the recall values of VLQ-ADC isalways higher than that of Faiss under all dataset scales. When the scale ofdataset is under 300M, the query speed of Faiss is slightly faster than thatof VLQ-ADC. When the scale of the datasets is over 300M, the query speedof VLQ-ADC matches that of Faiss.It can also been observed from the ﬁgure that for the full datasets ofSIFT1B and DEEP1B (1000M), Faiss takes 0.31ms and 0.32ms respectively(see Table 3 too). Compared to the 100M subsets of these two datasets, Faisssuﬀers an approx. 15 × slowdown when data scale grows 10 × . On the otherhand, for these two datasets, VLQ-ADC takes 0.054ms and 0.059ms respec-tively, representing only an approx. 2 × slowdown when data scale grows 10 × .The superior scalability and robustness of VLQ-ADC over Faiss is evidentfrom this experiment. K R e c a ll @ K k=2 k=2 SIFT1B K R e c a ll @ K k=2 k=2 DEEP1B s e a r c h t i m e ( m s / v e c t o r ) SIFT1BDEEP1B

Search timeFigure 6: The performance of VLQ-ADC on diﬀerent numbers of centroids k =2 / / . The results are collected on the same two datasets with an 8-byte encodinglength and n = 32 edges of each centroids. The right plot shows the average search timewith diﬀerent values of k . Number of centroids k and edges n . We evaluate the performance ofVLQ-ADC on diﬀerent k and n values with 8-byte codes. We ﬁrst ﬁx thevalue of n to 64 and compare the performance of our system for diﬀerent k centroids. In Figure 6, we present the evaluation of VLQ-ADC for k =28

10 100 K R e c a ll @ K SIFT1B K R e c a ll @ K DEEP1B

32 64 1280.000.020.040.060.080.10 s e a r c h t i m e ( m s / v e c t o r ) SIFT1BDEEP1B

Search timeFigure 7: The performance of VLQ-ADC on diﬀerent numbers of graph edges n =32 / / k = 2 number of centroids. The right plot shows the average search timewith diﬀerent values of n . K R e c a ll @ K SIFT1B K R e c a ll @ K DEEP1B s e a r c h t i m e ( m s / v e c t o r ) SIFT1BDEEP1B

Search timeFigure 8: The performance of VLQ-ADC on diﬀerent values of parameter α =0 . / . / .

5, with values of k , n and w ﬁxed at k = 2 , n = 64 , w = 64. The resultare collected on the same two datasets with an 8-byte encoding length and 64 edges ofeach centroids. The right plot shows the average search time with diﬀerent values of α . / / . Then we ﬁx k = 2 and increase the number of edge n from 32to 64 and 128. In Figure 7, we present the evaluation of the VLQ-ADC fordiﬀerent edge numbers.From Figure 6 and 7 we can observe that the increase in the numberof centroids and edges can improve search accuracy, while slightly increas-ing query time. This is because the indexing scheme with more centroidsand more edges can represent the dataset points more accurately and henceprovide more accurate short inverted lists. Value of portion α . Now, we discuss how to determine the value of pa-rameter α for subregions pruning, as described in Section 3.3.1. As shown inFigure 8, we test several values of α on both datasets. A lower α value means29ewer subregions will be traversed, hence lower query time. At the same time,we can observe that higher α values only moderately increase recall values,while signiﬁcantly increases query time (up to 3 . × times). Hence we choose α = 0 . Time and memory consumption.

Because the billion-scale dataset donot ﬁt on the GPU, the database is built in batches of 2M vectors, thenaggregating the information on the CPU. With ﬁle I/O, it takes about 150minutes to build the whole database on a single GPU.Here we analyze the memory consumption of each system. As shown inTable 2, for a database of N = 10 points, the basic memory consumption forall systems is 4 · N bytes for point IDs that are Integer type and m · N bytesfor point codes. In addition to that, Multi-D-ADC consumes 4 · k bytes tostore the region boundaries. Faiss consumes 4 · k · D bytes for the codebooksand 4 · k · m ·

256 bytes for the lookup tables. Ivf-hnsw requires N bytes forquantized norm items 4 · k · ( D + n ) bytes for its indexing structure[7]. Forour system, we require N bytes for quantized λ values and 4 · k · ( D + 2 n + m · n -NN graph and the lookup tables. Wesummarize the total memory consumption for all systems in Table 4 with8-byte encoding length on both datasets.As presented in Table 4, the memory consumption of our system is lessthan that of Faiss, and about 10% more than that of Multi-D-ADC with 2 codebook, which is acceptable for most realistic setups. Table 4: The memory consumption of all systems for SIFT1B of 10 System (codebook size) Memory consumption (GB)Faiss (2 ) 14Ivf-hnsw (2 ) 13.04Multi-D-ADC (2 × ) 12.25VLQ-ADC (2 ) 13.55

6. Conclusion

Billion-scale approximate nearest neighbor (ANN) search has become animportant task as massive amounts of visual data becomes available online.30n this work, we proposed VLQ-ADC, a simple yet scalable indexing struc-ture and a retrieval system that is capable of handling billion-scale datasets.VLQ-ADC combines line quantization with vector quantization to create ahierarchical indexing structure. Search space is further pruned to a por-tion of the closest regions, further improving ANN search performance. Theproposed indexing structure can partition the billion-scale database in largenumber of regions with a moderate size of codebook, which solved the draw-back of prior VQ-based indexing structures.We performed comprehensive evaluation on two billion-scale benchmarkdatasets: SIFT1B and DEEP1B and three state-of-the-art ANN search sys-tems: Multi-D-ADC, Ivf-hnsw, and Faiss. Our evaluation shows that VLQ-ADC consistently outperforms all three systems on both recall and querytime. VLQ-ADC achieves a recall improvement over Faiss, the state-of-the-art GPU-based system, of up to 17% and a query time speedup of up to 5 × times.Moreover, VLQ-ADC takes the data distribution into account in the in-dexing structure. As a result, it performs well on datasets with diﬀerentdistributions. Our evaluation shows that VLQ-ADC is the best performeron both SIFT1B and DEEP1B, demonstrating its robustness with respect todata with diﬀerent distributions.We conclude by pointing out a number of future work directions. We planto investigate further improvements to the indexing structure. Moreover,a more systematic and principled method for hyperparameter selection isworthy investigation. Acknowledgment

This work is supported in part by the National Natural Science Foun-dation of China under Grant No.61672246, No.61272068, No.61672254 andthe Fundamental Research Funds for the Central Universities under GrantHUST:2016YXMS018. In addition, we gratefully acknowledge the support ofNVIDIA Corporation with the donation of the Titan Xp GPUs used for thisresearch. The authors appreciate the valuable suggestions from the anony-mous reviewers and the Editors.