[PDF] Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

Abstract

Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based indexing approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH) is a very popular technique for finding approximate nearest neighbors in high-dimensional spaces. Apart from providing theoretical guarantees on the query results, one of the main benefits of LSH techniques is their good scalability to large datasets because they are external memory based. The most dominant costs for existing LSH techniques are the algorithm time and the index I/Os required to find candidate points. Existing works do not compare both of these dominant costs in their evaluation. In this experimental survey paper, we show the impact of both these costs on the overall performance of the LSH technique. We compare three state-of-the-art techniques on four real-world datasets, and show that, in contrast to recent works, C2LSH is still the state-of-the-art algorithm in terms of performance while achieving similar accuracy as its recent competitors.

Full PDF

EExperimental Analysis of Locality SensitiveHashing Techniques for High-DimensionalApproximate Nearest Neighbor Searches

Omid Jafari [0000 − − − and Parth Nagarkar [0000 − − − New Mexico State University, Las Cruces, US { ojafari, nagarkar } @nmsu.edu Abstract.

Finding nearest neighbors in high-dimensional spaces is afundamental operation in many multimedia retrieval applications. Exacttree-based indexing approaches are known to suﬀer from the notorious curse of dimensionality for high-dimensional data. Approximate search-ing techniques sacriﬁce some accuracy while returning good enough re-sults for faster performance. Locality Sensitive Hashing (LSH) is a verypopular technique for ﬁnding approximate nearest neighbors in high-dimensional spaces. Apart from providing theoretical guarantees on thequery results, one of the main beneﬁts of LSH techniques is their goodscalability to large datasets because they are external memory based. Themost dominant costs for existing LSH techniques are the algorithm timeand the index I/Os required to ﬁnd candidate points. Existing works donot compare both of these dominant costs in their evaluation. In this ex-perimental survey paper, we show the impact of both these costs on theoverall performance of the LSH technique. We compare three state-of-the-art techniques on four real-world datasets, and show that, in contrastto recent works, C2LSH is still the state-of-the-art algorithm in terms ofperformance while achieving similar accuracy as its recent competitors.

Keywords:

Locality Sensitive Hashing · High-Dimensional Spaces · Ap-proximate Nearest Neighbor.

Many large multimedia retrieval applications require eﬃcient processing of near-est neighbor queries in high-dimensional spaces. Exact tree-based indexing struc-tures, such as KD-tree, SR-tree, etc., work well for low-dimensional spaces ( <

10) but suﬀer from the notorious curse of dimensionality for high-dimensionalspaces. They are often outperformed by brute-force linear scans [4]. One solu-tion to this problem is to search for good enough approximate results instead.Approximate techniques sacriﬁce some accuracy for a signiﬁcant improvement inthe overall processing time. In many applications where 100% is not needed, thistradeoﬀ is very useful in saving time. The goal of the approximate version of thenearest neighbor problem, also called c-approximate Nearest Neighbor search , isto return points that are within c ∗ R distance from the query point. Here, c > a r X i v : . [ c s . D B ] J un O. Jafari et al. is a user-deﬁned approximation ratio and R denotes the distance of the querypoint and its nearest neighbor. Locality Sensitive Hashing (LSH) [9] is one of the most popular techniques forﬁnding approximate nearest neighbors in high-dimensional spaces. LSH was ﬁrstintroduced in [9] for the Hamming distance, but was later extended to severaldistances, such as the popular Euclidean distance [7]. LSH uses random hashprojections to map the original high-dimensional space to the projected low-dimensional space. The main idea behind LSH is that nearby points in theoriginal high-dimensional space will map to similar hash buckets in the low-dimensional space with a higher probability than mapping to dissimilar or faraway points to the same buckets. Since LSH was ﬁrst proposed in [9], there havebeen several works that have focused on improving the search accuracy and/orperformance [3,8,10,17,19,24,16,5].

Locality Sensitive Hashing (LSH) is known for two main advantages: its sub-linear query performance (in terms of the data size) and theoretical guaranteeson the query accuracy. Additionally, LSH uses random hash functions which aredata-independent (i.e. data properties such as data distribution are not needed togenerate these random hash functions). Since LSH uses random hash functions,the generation of these hash functions is a simple process that takes negligibletime. Additionally, the data distribution does not aﬀect the generation of thesehash functions. Hence, in applications where data is changing or where newerdata is coming in, these hash functions do not require any change during runtime.While the original LSH index structure suﬀered from large index sizes (in orderto obtain a high query accuracy) [3,19], state-of-the-art LSH techniques [8,10]have alleviated this issue by using advanced methods such as

Collision Counting and

Virtual Rehashing . In addition to their fast index maintenance, fast queryperformance, and theoretical guarantees on the query accuracy, LSH algorithmsare easy to implement as external memory-based algorithms, and hence are morescalable than in-memory algorithms (such as graph-based ANN algorithms) [16].

Locality Sensitive Hashing techniques have two dominant costs for ﬁnding near-est neighbors: 1) cost of reading the index ﬁles from the external memory to themain memory (which we call

Index I/Os ), and 2) cost of ﬁnding candidates andremoving false positives (which we call

Algorithm time ). As mentioned in Section1.2, one of the beneﬁts of LSH is that it is a scalable algorithm. Some of theexisting LSH techniques (e.g. C2LSH [8] and QALSH [10]) are not entirely ex-ternal memory-based (i.e. even though the indexes are stored on the disk, their itle Suppressed Due to Excessive Length 3 implementations require the entire data and indexes should ﬁt into the mainmemory during the index creation phase). Thus, existing works (such as [1])do not compare their results with C2LSH and QALSH on large datasets sincethey do not ﬁt in the main memory. Additionally, some recent works (such as[16]) only compare the

Index I/Os without comparing the important

Algorithmtime . This leads to other recent papers (such as [15,14,26]) to unfairly comparetheir

Algorithm time with QALSH or I-LSH [16] since they are deemed as thestate-of-the-art LSH techniques.

In this paper, we carefully present a detailed experimental analysis on threestate-of-the-art LSH algorithms, C2LSH [8], QALSH [10], and I-LSH [16]. Ourcontributions are as follows: – We modify the implementations of C2LSH and QALSH to create fully ex-ternal memory-based implementations such that the entire dataset and/orthe entire index do not need to be in the main memory for the algorithmsto work during index generation or query processing. – We show the importance of experimentally analyzing and comparing the

Index I/Os and

Algorithm time of all algorithms. – We compare these three algorithms on real datasets with diﬀerent charac-teristics under diﬀering system parameters.To the best of our knowledge, we are the ﬁrst work to present a detailed analysisof these three state-of-the-art LSH techniques, namely C2LSH [8], QALSH [10],and I-LSH [16].

Nearest Neighbor problem is an important problem for multimedia applicationsin many diverse domains such as multimedia retrieval, image processing, machinelearning, etc. Since tree-based index structures can be outperformed by a linearscan, due to the curse of dimensionality , in high-dimensional spaces, approxi-mate techniques are preferred due to their fast performance at the expense ofsome accuracy. Due to the importance of the nearest neighbor problem in variousdomains, several diverse techniques have been proposed by researchers. Thesetechniques can be broadly classiﬁed into three main categories: Hashing-basedmethods, Partition-based methods, and Graph-based methods. Hashing-basedmethods can be further classiﬁed into learning-based hashing techniques andrandom hashing techniques. The beneﬁt of random hashing techniques, such asLocality Sensitive Hashing [9], are that they are easy to construct, no need fortraining data, and easy to maintain and update. Additionally, LSH provides a These implementations will be made public. We refer the reader to a recent survey [15] for an in-depth survey on these categories. O. Jafari et al. sub-linear (in terms of the data size) query performance and theoretical guaran-tees on the query accuracy.

Locality Sensitive Hashing and its variants:

The main idea of LocalitySensitive Hashing is to create random projections and hash data points in theserandom projections such that nearby data points in the original high-dimensionalspace will be mapped to the same hash bucket with a high probability (andconversely, data points that are far apart from each other in the original high-dimensional space will be mapped to the same hash bucket with a low prob-ability). It was originally proposed in [9] for the Hamming distance and thenlater extended to the popular Euclidean distance [7]. In this original work onEuclidean distance (E2LSH), instead of a single hash function (or a projection),a hash table consisted of several hash functions (represented by Compound HashKeys) in order to reduce false positives. But this also generated false negatives.Hence several hash tables had to be used to reduce the number of false positivesand false negatives, while keeping the accuracy of the query high. The maindrawbacks of this approach were the size of the index structure (since largenumber of hash tables were required to return the desired number of resultswith a high accuracy) and the need to determine the width of the hash bucketduring index creation (a larger width returned enough results but also with apotential of too many false positives, whereas a smaller width had a potentialof misses resulting in insuﬃcient results). This user-deﬁned width, which wasmainly dependent on the data distribution, had to be often determined througha trial and error process.LSH-Forest [3] was proposed where the compound hash-keys were hierarchicallystored such that the algorithm could stop at a higher level in the tree if moreresults were needed. In Multi-probe LSH [19], the authors proposed a techniqueto probe into neighboring buckets when more results were needed. The intuitionis that neighboring buckets are more likely to contain nearby points. Hence, ifthe bucket width was underestimated (which is better than overestimation whichcan lead to signiﬁcant wasteful processing), neighboring buckets were probed toﬁnd the desired number of results.Later, C2LSH [8] introduced two main concepts of

Collision Counting and

Vir-tual Rehashing that solved the two main drawbacks of E2LSH [7]. In C2LSH, theauthors proposed to create m base hash functions and choose candidate pointsbased on how many times a data point collides with the query point (and henceinstead of creating several hash tables of several hash functions, only 1 table of m base hash functions is needed), which reduced the size of the index structure.Additionally, in Virtual Rehashing , the neighboring buckets in each hash func-tion are read incrementally when suﬃcient number of results are not found.In SK-LSH [17], the authors propose a linear ordering on the Compound HashKeys (using a space-ﬁlling curve) such that nearby Compound Hash Keys arestored on the same (or nearby) page on the disk, thus reducing the total numberof I/Os. The design of SK-LSH is still build on the original E2LSH, and hencesuﬀers from the parameter tuning problem, where the user is expected to enterimportant parameters such as number of hash functions and the radius at which itle Suppressed Due to Excessive Length 5 k results will be found. Wrong choice of parameters can negatively aﬀect theaccuracy and eﬃciency of the algorithm.QALSH [10] was later proposed that built query-aware hash functions such thatthe hash value of the query point is considered as the anchor bucket during queryprocessing and this idea would solve the issue when close points to a query werepartitioned into diﬀerent buckets when query was near the bucket boundaries.Additionally, B+trees are built on each hash function for eﬃcient lookups intoneighboring buckets (which translate to range queries). QALSH utilizes the con-cepts of Collision Counting and

Virtual Rehashing .HD-Index [1] was introduced which generated Hilbert keys of the dataset pointsand also stored the distances of the points to each other to eﬃciently prune theresults based on distance ﬁlters. HD-Index stores the Hilbert keys using modiﬁedB+-trees, called RDB-trees. Due to the reliance on space-ﬁlling curves (Hilbertcurves) and B+-trees, HD-Index cannot scale for moderately high-dimensionaldatasets [1].SRS [23] uses the Euclidean distance between two points in the projected spaceto estimate their distance in the original space. In order to ﬁnd the next nearestneighbor in the projected space, SRS uses an R-tree to index the points in theprojected space. This incremental ﬁnding of the NN is similar to I-LSH. Themain goal of SRS is to introduce a very lightweight index structure to solve theANN problem. SRS is shown to suﬀer from memory leaks and slow running timesas compared with C2LSH [1], and hence not included in our work.Recently, I-LSH [16], which is considered to be the state-of-the-art LSH tech-nique [14], was proposed to improve the Virtual Rehashing process of QALSH(where the range of the lookups are incremented exponentially). In I-LSH, theauthors propose to increase the range of the lookups based on the distance tothe nearest point (in the projected space) instead of increasing the range expo-nentially. While this strategy results in less disk I/Os, it also leads to high diskseeks (random I/Os) and algorithm time as we show in Section 5.Very recently, PM-LSH [26] was proposed where the idea was to estimate theEuclidean distance based on a tunable conﬁdence interval value such that theoverall query processing time is reduced. In this section, we describe the key concepts behind LSH. We primarily use theterminologies and formulations introduced in E2LSH [7] and C2LSH [8].

Hash Functions:

A hash function family H is ( R , cR , p , p )-sensitive if itsatisﬁes the following conditions for any two points x and y in a d -dimensionaldataset D ⊂ R d : – if | x − y | ≤ R , then P r [ h ( x ) = h ( y )] ≥ p , and – if | x − y | > cR , then P r [ h ( x ) = h ( y )] ≤ p The code of PM-LSH was not released before the submission date of SISAP. O. Jafari et al.

Here, p and p are probabilities and c is an approximation ratio. LSH requiresthat c > p > p . The above deﬁnition states that the two points x and y are hashed to the same bucket with a very high probability ≥ p if they are closeto each other (i.e. the distance between the two points is less than or equal to R ),and if they are not close to each other (i.e. the distance between the two pointsis greater than cR ), then they will be hashed to the same bucket with a lowprobability ≤ p . In the original LSH scheme for Euclidean distance, each hashfunction is deﬁned as h a ,b ( x ) = (cid:4) a .x + bw (cid:5) , where a is a d -dimensional randomvector with entries chosen independently from the standard normal distribution N (0 ,

1) and b is a real number chosen uniformly from [0 , w ), such that w is thewidth of the hash bucket [7]. This leads to the following collision probabilityfunction [7], which states that if || x, y || = r , then the probability that x and y map to the same hash bucket for a given hash function h a ,b ( x ) is: P ( r ) = (cid:82) w r √ π e − t r (1 − tw ) dt . Here, the collision probability P ( r ) is decreasing on r for a given w . For a t , which is the largest absolute value of a coordinate of pointin D , and for every b uniformly drawn from the interval [0 , c (cid:100) log c td (cid:101) w ] and R = c n for some n ≤ (cid:100) log c td (cid:101) we have that h R ( x ) = (cid:106) h a ,b ( x ) R (cid:107) is ( R, cR, p , p )-sensitive, where p = p (1) and p = p ( c ) [8]. In Section 2, we explained the beneﬁts and drawbacks of diﬀerent LSH tech-niques. In this paper, we will experimentally analyze the three state-of-the-artexternal memory-based LSH techniques, C2LSH [8], QALSH [10], and I-LSH [16].In this section, we will introduce the concepts introduced by these techniques.C2LSH [8] introduced the concepts of

Collision Counting and

Virtual Rehash-ing . In [8], authors theoretically show that two close points x and y collide in atleast l hash layers with a probability 1 − δ , when the total number, m , of hashlayers are equal to: m = (cid:6) ln( δ )2( p − p ) (1 + z ) (cid:7) . Here, z = (cid:113) ln( β ) / ln( δ ), where β is the allowed false positive percentage (i.e. the allowed number of points whosedistance with a query point is greater than cR ). C2LSH sets β = n , where n is the cardinality of the dataset. Further, only those points that collide at least l times, where l is the collision count threshold, which is calculated as following: l = (cid:100) α × m (cid:101) , where the collision threshold percentage, α , is α = zp + p z . C2LSHcreates only one hash function per hash table, and hence the number of hashfunctions are equal to the number of hash table.Instead of assuming a magic radius (which traditional LSH methods did),C2LSH sets the initial radius R to 1. It is possible that with R = 1, there arenot enough results for a top- k query to be returned. C2LSH increases the radiusof the query in the following sequence: R = 1 , c, c , c ... . If at level-R , enoughcandidates are not found, the radius is increased until enough query results arefound. This exponential expansion process is called Virtual Rehashing .Moreover, C2LSH uses two terminating conditions to stop the algorithmwhen the conditions are met. These conditions specify that 1) at the end of each itle Suppressed Due to Excessive Length 7 virtual rehashing at least k candidates should have been found whose Euclideandistance to the query are less than or equal to cR , and 2) at any point, k + βn candidates are found.QALSH introduces query-aware hash functions h a ( x ) = a .x . For a query q , once the query projection is found by computing h a ( q ), QALSH uses thequery as the “anchor” to ﬁnd the anchor bucket with width w with the interval | h a ( q ) − w , h a ( q ) + w | . If the projected location for a point x falls in the sameanchor bucket as q , i.e., | h a ( o ) − h a ( q ) | ≤ w , then QALSH considers that o hascollided with q under h a . QALSH [10] also utilizes these concepts of CollisionCounting and Virtual Rehashing to build query-aware hash functions. Anothermain diﬀerence of QALSH is that it uses B+-trees to represent the hash tables.An exponential expansion in each hash table is thus the same as a range queryon a B+-tree. By using query-aware hash functions and B+-trees, QALSH im-proves the theoretical bounds by reducing the total number of hash functionsrequired to satisfy the quality guarantee. Additionally, QALSH can work for anyapproximation ratio, c , greater than 1, while C2LSH can only work for c ≥ In this section, we ﬁrst explain our carefully designed experimental evalua-tion plan. We experimentally analyze C2LSH, QALSH, and I-LSH on diﬀerentdatasets and report the results for varying criteria. All experiments were run onthe nodes of the Bigdat cluster with the following speciﬁcations: two Intel XeonE5-2695, 256GB RAM, and CentOS 6.5 operating system. All codes were writ-ten in C++11 and compiled with gcc v4.7.2 with the -O3 optimization ﬂag. Asmentioned in Section 1.4, we extend the implementations of C2LSH and QALSHto be completely external-memory based implementations (i.e. the entire datasetor the index ﬁles are not needed to be in the main memory in order to constructthe LSH indexes). We use the following six diverse high-dimensional datasets with varying cardi-nality and dimensionality: Supported by NSF Award – P53 [6] consists of 31 ,

002 5409-dimensional points which are generated basedon the biophysical features of mutant p53 proteins and can be used to pre-dict p53 transcriptional activity. The values of this dataset are normalizedbetween zero and 10 ,

000 and duplicate rows are removed. – LabelMe [20] consists of 181 ,

093 512-dimensional points which were gener-ated by running the GIST feature extraction algorithm on 30369 annotatedimages belonging to 183 categories. There are no duplicates in the datasetand values range between zero and 58104. – Sift1M [11] consists of 1 , ,

000 128-dimensional points that were createdby running the SIFT feature extraction algorithm on real images. The valuesof this dataset are integers between zero and 218. – Deep1M consists of 1 , ,

000 96-dimensional points sampled from theDeep1B dataset introduced in [2]. These points are extracted from the lastlayers of convolutional neural networks for images. – Mnist8M [18] This dataset, also known as the InﬁMNIST dataset, contains8 , ,

000 784-dimensional points that represent images of the digits 0 to 9which are grayscale and of size 28 × – Tiny80M [25] This dataset contains 79 , ,

017 384-dimensional points gen-erated using Gist feature extraction algorithm on 80 million 32 ×

32 coloredimages and its values are normalized between zero and 255.All datasets are normalized to contain only integers since C2LSH requiresthe data format to be only integers [8].

The goal of our paper is to present a detailed analysis of the performance ofthe state-of-the-art LSH techniques. We also compare the accuracy of thesealgorithms. We randomly choose 50 queries from the dataset and report theaverage of the results of these 50 queries. We used the same parameters suggestedin their papers ( w = 2 .

781 for QALSH and w = 2 .

184 for C2LSH). We choose δ = 0 . c = 2 (since C2LSH cannot give guarantees for c < QP T ): – Index Read Cost: LSH techniques need to read index ﬁles (from the exter-nal memory) in order to ﬁnd the candidates. This dominant cost of readingindex ﬁles can be further broken down into the number of disk seeks (i.e.random I/Os) and the total amount of data read. Following [16], we alsoconsider the number of disk seeks and amount read in our cost formulation. itle Suppressed Due to Excessive Length 9 – Algorithm Time:

Another dominant cost in LSH processing is the process-ing of index ﬁles once they are read into the main memory. LSH techniquesneed to ﬁnd points that are considered as candidates. Techniques such asCollision Counting (explained in 4) are included in this cost. – False Positive Removal Cost:

Once a point is deemed as a candidate,the LSH technique brings the actual data point (resulting in a random seek)into the main memory to calculate the Euclidean distance with the querypoint. Since the state-of-the-art LSH techniques have an upper bound of thenumber of candidates that are generated (which is set to k + 100), this costis negligible as compared to the previous two costs.It is well-known that random I/Os are much more expensive than sequentialI/Os [13]. Additionally, the diﬀerence in the cost changes signiﬁcantly dependingon whether the external storage medium is an HDD or an SSD. The diﬀerence inthe costs of random I/Os and sequential I/Os is signiﬁcantly more in HDDs thanin SSDs (mainly because random disk seeks are faster in SSDs than HDDs) [12].We noticed that the number of disk seeks are signiﬁcantly diﬀerent in these state-of-the-art LSH techniques due to their strategy in ﬁnding neighboring points inprojected spaces. Hence, we model the overall Query Processing Time (QPT)for both HDDs and SSDs. For an HDD, we use the reported benchmarks forSeagate Barracuda HDD with 7200 RPM and 1TB: average disk seek requires8.5 ms and the average data read rate is 0.156 MB/ms [22]. Similarly, for anSSD, we use the reported benchmarks for the Seagate Barracuda 120 SSD with1TB storage: average disk seek requires 0.01 ms and the average data read rateis 0.56 MB/ms [21].We use the same accuracy measure, the overall ratio, used in several priorworks [8,10,17,16]: k (cid:80) ki =1 || o i ,q |||| o ∗ i ,q || . Here, o i is the i th point returned by the tech-nique and o ∗ i is the true i th nearest point from q (ground truth). Ratio of 1 meansthe returned results have the same distance from the query as the ground truth.The closer the ratio is to 1, the higher is the accuracy of the LSH technique. Figure 1 shows the required number of disk seeks(random I/Os) by the experimented techniques. The interesting observation isthat I-LSH performs the best for P53, LabelMe, Sift, and Deep datasets. How-ever, its performance degrades as the dataset size becomes large (i.e. greaterthan approximately one million points). This is because I-LSH needs to ﬁnd theclosest projected point each time the radius needs to be expanded, which furtherrequires reading the indexed points from the disk several times. We also observethat QALSH has a better performance compared to C2LSH for smaller datasets(i.e. P53), but as the dataset size (number of points) increases, the number ofseeks are signiﬁcantly higher than C2LSH and I-LSH. This is happening becausethe search radiuses of QALSH are larger than C2LSH in larger datasets, whichresults in more radius expansions, which further results in higher disk seeks.

P53 Dataset

C2LSH QALSH I-LSH

LabelMe Dataset

C2LSH QALSH I-LSH

Sift Dataset

C2LSH QALSH I-LSH

Deep Dataset

C2LSH QALSH I-LSH Mnist Dataset

C2LSH QALSH I-LSH

Tiny Dataset

C2LSH QALSH I-LSH

Fig. 1: Number of Disk Seeks (Y axis) for diﬀerent k (X Axis) on 6 datasets P53 Dataset

C2LSH QALSH I-LSH LabelMe Dataset

C2LSH QALSH I-LSH

Sift Dataset

C2LSH QALSH I-LSH

Deep Dataset

C2LSH QALSH I-LSH Mnist Dataset

C2LSH QALSH I-LSH

Tiny Dataset

C2LSH QALSH I-LSH

Fig. 2: Amount of Data Read (in MB) (Y axis) for k (X Axis) on 6 datasets Amount of Data Read:

Figure 2 shows the total amount of data that wasread from the index ﬁles. I-LSH always has the least amount of data read forall datasets because it incrementally searches for the nearest points in the pro-jections instead of having buckets and ﬁxed widths. However, we later showthat these I/O savings are oﬀset by the processing time of ﬁnding these nearestpoints. C2LSH reads more data than QALSH for most datasets (except Mnist)because it has more projections to process (since QALSH uses less hash projec-tions because they are query-aware).

Algorithm Time:

Figure 3 shows the time needed by an algorithm to ﬁndthe candidates (excluding the I/O times). This ﬁgure shows the huge overhead itle Suppressed Due to Excessive Length 11 P53 Dataset

C2LSH QALSH I-LSH LabelMe Dataset

C2LSH QALSH I-LSH

Sift Dataset

C2LSH QALSH I-LSH Deep Dataset

C2LSH QALSH I-LSH

Mnist Dataset

C2LSH QALSH I-LSH Tiny Dataset

C2LSHQALSHI-LSH

Fig. 3: Algorithm Time (in s) (Y axis) for k (X Axis) on 6 datasets P53 Dataset

C2LSH QALSH I-LSH LabelMe Dataset

C2LSH QALSH I-LSH Sift Dataset

C2LSH QALSH I-LSH

Deep Dataset

C2LSH QALSH I-LSH Mnist Dataset

C2LSH QALSH I-LSH

Fig. 4: HDD Query Processing Time (in s) (Y axis) for k (X Axis) on 6 datasetsof I-LSH which is caused due to their incremental searching for the nearestprojected neighbors. Also, since I-LSH and QALSH both use B+-trees, whichbecome huge for the larger datasets, their performance degrades heavily in thesecases while searching for candidates. Since C2LSH does not have any overhead ofadditional index structures (such as B+-tree), it has the least Algorithm time forall datasets. In terms of Algorithm Time, I-LSH is faster than QALSH (exceptfor the P53 dataset - which is the smallest dataset in our experiments) mainlybecause it has to process less hash functions than QALSH [16]. False Positive Removal Time:

We also analyzed the time it takes to readthe actual data point from the external memory in order to calculate Euclideandistance with the query (for removing false positives). Since all three algorithms P53 Dataset

C2LSH QALSH I-LSH

LabelMe Dataset

C2LSH QALSH I-LSH Sift Dataset

C2LSH QALSH I-LSH Deep Dataset

C2LSH QALSH I-LSH

Mnist Dataset

C2LSH QALSH I-LSH C2LSHQALSHI-LSH

Fig. 5: SSD Query Processing Time (in s) (Y axis) for k (X Axis) on 6 datasets P53 Dataset

C2LSH QALSH I-LSH

LabelMe Dataset

C2LSH QALSH I-LSH Sift Dataset

C2LSH QALSH I-LSH Deep Dataset

C2LSH QALSH I-LSH

Mnist Dataset

C2LSH QALSH I-LSH Tiny Dataset

C2LSH QALSH I-LSH

Fig. 6: Accuracy Ratio (Y axis) for diﬀerent k (X Axis) on 6 datasetshave an upper bound of the number of candidates ( k + 100) it produces, all al-gorithms took similar time which was less than 0.5 ms. Due to space limitations,we do not show these results. Query Processing Time (on HDD):

Figure 4 shows the overall time re-quired to solve a given k-NN query on a Hard Disk Drive. I-LSH performs thebest for smaller datasets (P53 and LabelMe) because its Algorithm Time over-head is small, but as the dataset size increases, the Algorithm Time overheadoﬀsets the savings in disk seeks and performs worse than C2LSH (but betterthan QALSH). Except for the smallest dataset (P53), QALSH is the slowest ofthe three algorithms. It works good for smaller datasets (P53) but does not scale itle Suppressed Due to Excessive Length 13 well for moderate and large sized datasets. For larger datasets, C2LSH is alwaysthe fastest technique since its having better algorithm time and number of diskseeks compared to the other two algorithms.

Query Processing Time (on SSD):

Figure 5 shows the overall time requiredto solve a given k-NN query on a Solid State Drive. In SSDs, I/O operations aremuch faster and the overall Query Processing Time is mainly dominated by thealgorithm time. Therefore, C2LSH (which has the best Algorithm time) alwaysperforms the best on SSDs (for all datasets) followed by I-LSH (except for thesmallest dataset, P53).

Accuracy Ratio:

Figure 6 shows the accuracy of the compared techniques. Hav-ing a ratio equal to 1 equates to highest accuracy. Except for the Mnist dataset,C2LSH produces the best accuracy among the three algorithms. QALSH is moreaccurate than I-LSH, which we believe is mainly because it uses more hash func-tions than I-LSH. Except for C2LSH’s accuracy on the Mnist dataset, all threealgorithms produce accurate results for all datasets.Overall, we ﬁnd that C2LSH can ﬁnd k-NN results faster than QALSH and I-LSH. Additionally, all three algorithms produce accurate results (with C2LSHproducing slightly better accurate results than QALSH and I-LSH for mostdatasets).

Approximate similarity search in high dimensional spaces has been an importantproblem in many diverse domains. In this paper, we focused on Locality Sen-sitive Hashing based techniques and presented a detailed experimental analysison three famous LSH algorithms, C2LSH, QALSH, and I-LSH. For this anal-ysis, we used various sizes of datasets and diﬀerent yet important evaluationmetrics. The results showed us that although a speciﬁc technique can performbetter for smaller datasets but may not prove to be scalable and work well forlarger datasets. We also observed that improvements in one portion of the LSH(e.g. I/O operations), do not results in overall improvements. Thus, trade-oﬀsand diﬀerent evaluation metrics should always be considered when comparingdiﬀerent techniques. In future we plan to also analyze the eﬀect of changing theuser-deﬁned parameters on the performance of diﬀerent techniques.

References

1. Arora, A., et al., “Hd-index: Pushing the scalability-accuracy boundary for ap-proximate knn search,”

VLDB

CVPR

WWW

CSUR

SISAP

PLoS computational biology

SOCG

SIGMOD

VLDB

TPAMI

HILDA

VLDB

ICDE

TKDE

ICDE

VLDB

Large scale kernel machines

VLDB

IJCV

22. Seagate ST2000DM001 Manual.:

23. Sun, Y., et al., “Srs: Solving c-approximate nearest neighbor queries in high di-mensional euclidean space with a tiny index,”

VLDB

TODS

TPAMI