A Survey on Locality Sensitive Hashing Algorithms and their Applications
Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, Chidambaram Crushev
aa r X i v : . [ c s . D B ] F e b A Survey on Locality Sensitive Hashing Algorithms and their Applications
OMID JAFARI,
New Mexico State University, USA
PREETI MAURYA,
New Mexico State University, USA
PARTH NAGARKAR,
New Mexico State University, USA
KHANDKER MUSHFIQUL ISLAM,
New Mexico State University, USA
CHIDAMBARAM CRUSHEV,
New Mexico State University, USA
Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. LocalitySensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensionalspaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this surveypaper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey,we present how Locality Sensitive Hashing is utilized in different application domains.CCS Concepts: •
General and reference → Surveys and overviews .Additional Key Words and Phrases: Locality Sensitive Hashing, Approximate Nearest Neighbor Search, High-Dimensional SimilaritySearch, Indexing
Finding nearest neighbors in high-dimensional spaces is an important problem in several diverse applications, suchas multimedia retrieval, machine learning, biological and geological sciences, etc. For low-dimensions ( < curse of dimensionality (where the performance of theseindex structures is often out-performed even by linear scans) [21]. Instead of searching for exact results, one solutionto address the curse of dimensionality problem is to look for approximate results. In many applications where 100%accuracy is not needed, searching for results that are close enough is much faster than searching for exact results [30].Approximate solutions trade-off accuracy for a much faster performance. The goal of the approximate version of thenearest neighbor problem, also called c-approximate Nearest Neighbor search , is to return objects that are within 𝑐 × 𝑅 distance from the query object (where 𝑐 > 𝑅 is the distance of the queryobject from its nearest neighbor). Locality Sensitive Hashing [30] is one of the most popular solutions for the approximate nearest neighbor (ANN)problem in high-dimensional spaces. Locality Sensitive Hashing (LSH) maps high-dimensional data to lower dimen-sional representations by using random hash functions. Data points are assigned to individual hash buckets in eachhash function. The idea behind this approach is that closer data points in the original high-dimensional space will bemapped to the same hash buckets in the lower-dimensional projected space with a high probability. Since it was first
Authors’ addresses: Omid Jafari, [email protected], New Mexico State University, Computer Science, Las Cruces, NM, USA; Preeti Maurya, [email protected], New Mexico State University, Computer Science, Las Cruces, NM, USA; Parth Nagarkar, [email protected], New Mexico State University,Computer Science, Las Cruces, NM, USA; Khandker Mushfiqul Islam, mushfi[email protected], New Mexico State University, Computer Science, Las Cruces,NM, USA; Chidambaram Crushev, [email protected], New Mexico State University, Computer Science, Las Cruces, NM, USA. Jafari, et al.introduced in [38], many variants of Locality Sensitive Hashing have been proposed [11, 36, 46, 73, 75, 77, 98, 125]that mainly focused on improving the search accuracy and/or the search performance of the given queries. The per-formance/accuracy trade-off of the query is determined by a user-provided success guarantee (where a high successguarantee will return a result with high accuracy at the expense of faster performance and vice-versa).
Locality Sensitive Hashing (LSH) is known for two main advantages: its sub-linear query performance (in terms of thedata size) and theoretical guarantees on the query accuracy. Additionally, LSH uses random hash functions which aredata-independent (i.e. data properties such as data distribution are not needed to generate these random hash functions).Additionally, the data distribution does not affect the generation of these hash functions. Hence, in applications wheredata is changing or where newer data is coming in, these hash functions do not require any change during runtime.While the original LSH index structure suffered from large index sizes (in order to obtain a high query accuracy) [11, 77],state-of-the-art LSH techniques [36, 46] have alleviated this issue by using advanced methods such as
Collision Counting and
Virtual Rehashing . Hence, owing to their small index sizes, fast index maintenance, fast query performance, andtheoretical guarantees on the query accuracy, Locality Sensitive Hashing is still considered an important technique forsolving the Approximate Nearest Neighbor problem.
LSH-based algorithms are used in several application domains such as content-based multimedia retrieval systems,computational biological/medical studies, earth sciences, information retrieval tasks, etc. Most of these works are bas-ing their methods on the original Euclidean distance based LSH (E2LSH [30]). However, E2LSH has several drawbacksand one goal of this survey paper is to show the workflow of several other LSH-based algorithms. This work will helpthe application domains to improve their efficiency by changing their base algorithm.There has been several surveys on approximate nearest neighbor search methods, such as [7, 17, 24, 72, 105, 106],that have reviewed some of the hashing-based algorithms. In [105], authors review hashing-based and quantization-based methods in solving similarity search problems. Moreover, for each of the methods, various aspects such asthe hash functions, distance measures, and search techniques are also reviewed. [24] reviews hashing-based tech-niques used in domains such as information systems. Moreover, it categorizes the techniques in two major groups:data-oriented hashing and security-oriented hashing. [106] reviews the hashing-based methods and specifically thelearning to hash and quantization-based solutions to solve similarity search problems. Learning to hash methods aredata-dependent techniques that aim to learn hash functions from a specific given dataset. [7] presents a tool for bench-marking in-memory approximate nearest neighbor search algorithms. Moreover, several graph-based, tree-based, andLSH-based algorithms are experimentally compared using real datasets. In [72], authors conduct an experimental studyover several LSH-based, learning to hash, partition-based, and graph-based algorithms. Another experimental study ispresented in [17] that compares tree-based, hashing-based, quantization-based, and graph-based methods using realdatasets.As mentioned earlier, the previous works have reviewed some of the LSH-based techniques; however, they do notprovide an extensive review on LSH-based techniques. Different from the previous works, in this survey paper, wefocus only on LSH-based techniques to solve the ANN problem, and we review the latest advances in LSH-basedtechniques. Moreover, in this survey paper, we review two aspects that, to the best of our knowledge, are not included Survey on Locality Sensitive Hashing Algorithms and their Applications 3in any other survey papers; Distributed LSH frameworks and applications of LSH-based techniques in various diversedomains.
In this paper, we present an in-depth review of the recent advances in Locality Sensitive Hashing techniques. Ourcontributions are listed as following: • We perform an in-depth review over LSH-based techniques by categorizing them based on the hash family thatthey use and explaining their work flow. • There are different distributed frameworks proposed to improve the processing speed of the LSH algorithms. Inthis paper, we review these distributed frameworks and present an overview of their architecture. • LSH-based algorithms are used in various application domains. In this survey, we categorize the applicationdomains and explain how LSH-based algorithms are utilized in each of them.
The remainder of the paper is organized as follows: In section 2, we provide background information related to LSH.Section 3 presents a detailed review of LSH-based algorithms that are proposed to solve and improve the approximatenearest neighbor search problem. Section 4 presents a detailed review of the distributed frameworks for LSH-basedalgorithms. In Section 5, we present the works in different application domains that use Locality Sensitive Hashing.Finally, we conclude the paper in Section 6.
In this section, we describe the key concepts behind Approximate Nearest Neighbor (ANN) search. Note that, whilethere are several space partitioning and graph-based methods that also tackle the ANN problem, our focus in this paperis specifically on Locality Sensitive Hashing-based methods. We refer the reader to [24] for discussions on non-LSHbased methods.Given a dataset D with 𝑛 points and 𝑑 dimensions and a query point 𝑞 in the same space as the dataset, the goal of 𝑐 -ANN search (where 𝑐 > 𝑜 ∈ D such that 𝑑𝑖𝑠𝑡 ( 𝑜, 𝑞 ) ≤ 𝑐 × 𝑑𝑖𝑠𝑡 ( 𝑜 ∗ , 𝑞 ) ,where 𝑜 ∗ is the true nearest neighbor of 𝑞 in D and 𝑑𝑖𝑠𝑡 is the distance between the two points. Similarily, 𝑐 - 𝑘 -ANNsearch aims at returning top- 𝑘 points such that 𝑑𝑖𝑠𝑡 ( 𝑜 𝑖 , 𝑞 ) ≤ 𝑐 × 𝑑𝑖𝑠𝑡 ( 𝑜 ∗ 𝑖 , 𝑞 ) , where 1 ≤ 𝑖 ≤ 𝑘 .Hashing-based methods try to find the nearest neighbors in high-dimensional datasets by projecting them into oneor more low-dimensional spaces using hash functions. LSH is a famous hashing-based method that creates the low-dimensional projections such that the localities of the original space are preserved in them (i.e. two nearby points inthe original space are also nearby in the projected space).For two points 𝑥 and 𝑦 in a 𝑑 -dimensional dataset 𝐷 ⊂ R 𝑑 , we say a hash function 𝐻 is ( 𝑅 , 𝑐𝑅 , 𝑝 , 𝑝 )-sensitive if itsatisfies the following two conditions: • if | 𝑥 − 𝑦 | ≤ 𝑅 , then 𝑃𝑟 [ 𝐻 ( 𝑥 ) = 𝐻 ( 𝑦 )] ≥ 𝑝 , and • if | 𝑥 − 𝑦 | > 𝑐𝑅 , then 𝑃𝑟 [ 𝐻 ( 𝑥 ) = 𝐻 ( 𝑦 )] ≤ 𝑝 Here, 𝑐 is an approximation ratio and 𝑝 and 𝑝 are probabilities. In order for this definition to work, 𝑐 > 𝑝 > 𝑝 . The definition states that two points 𝑥 and 𝑦 are hashed to the same bucket in the projection with a veryhigh probability ≥ 𝑝 if they are close to each other, and if they are not close to each other, then they will be hashed Jafari, et al.to the same bucket with a low probability ≤ 𝑝 . Next, we present the popular hash function families for the Hamming,Minkowski, Angular, and Jaccard distances.For the Hamming metric, [48] defined the LSH function as 𝐻 ( 𝑥 ) = 𝑥 𝑖 , where 𝑥 𝑖 is the 𝑖 -th dimension of the point 𝑥 ( 𝑖 ∈ [ , 𝑑 ] ). Therefore, for two points 𝑥 and 𝑦 with a Hamming distance of 𝑟 , the probability that they have the samehash value is 𝑃𝑟 [ 𝐻 ( 𝑥 ) = 𝐻 ( 𝑦 )] = − 𝑟𝑑 .For the Minkowski metric, [30] defined the LSH functions as 𝐻 ® 𝑎,𝑏 ( 𝑥 ) = j ® 𝑎.𝑥 + 𝑏𝑤 k , where ® 𝑎 is a 𝑑 -dimensional randomvector chosen from the standard 𝑝 -stable distribution and 𝑏 is a real number chosen uniformly from [ , 𝑤 ) , such that 𝑤 is the width of the hash bucket. To generate ® 𝑎 , Cauchy, Gaussian (Normal), and Lévy distributions are used for 𝑝 = 𝑝 =
2, and 𝑝 = respectively. Therefore, for two points 𝑥 and 𝑦 with a Minkowski distance of 𝑟 , the probability thatthey have the same hash value is 𝑃𝑟 [ 𝐻 ( 𝑥 ) = 𝐻 ( 𝑦 )] = ∫ 𝑤 𝑓 𝑝 ( 𝑡𝑟 )( − 𝑡𝑤 ) 𝑑𝑡 . Here, 𝑓 𝑝 ( 𝑡𝑟 ) is the density function of the 𝑝 -stable distribution.For the Angular metric, [20] defined the LSH functions as 𝐻 ( 𝑥 ) = 𝑠𝑔𝑛 ( ® 𝑎.𝑥 ) , where 𝑠𝑔𝑛 is the sign function and ® 𝑎 is arandom vector drawn from the Normal distribution. In this case, for two points 𝑥 and 𝑦 with 𝜃 defined as the anglebetween them, the probability that they have the same hash value is 𝑃𝑟 [ 𝐻 ( 𝑥 ) = 𝐻 ( 𝑦 )] = − 𝜃𝜋 .Finally, for the Jaccard metric, [14] defined the LSH functions as 𝐻 ( 𝑥 ) = 𝑚𝑖𝑛 { 𝜋 ( 𝑥 𝑖 )} , where 𝑥 𝑖 ∈ 𝑥 and 𝜋 is a randompermutation from the set of all possible permutations. Therefore, for two points 𝑥 and 𝑦 with a Jaccard similarity of 𝐽 ,the probability that they have the same hash value is 𝑃𝑟 [ 𝐻 ( 𝑥 ) = 𝐻 ( 𝑦 )] = 𝐽 . In this section, we present our detailed review of LSH techniques. The main benefits of LSH are its sub-linear querytime and the theoretical guarantees provided on the query accuracy. We classify different LSH techniques based onthe distance hash function family (Section 2), in particular, Hamming distance, Minkowski distance, Angular distance,and Jaccard distance.
Locality Sensitive Hashing was first proposed in [48] for the Hamming distance to solve the ( 𝑅, 𝑐 ) -near neighbor searchproblem. The proposed method uses multiple hash functions and hash tables to be able to guarantee a good searchquality. Moreover, authors theoretically find the optimal number of hash functions and hash tables in order to haveconstant hashing probabilities.Boosted LSH (BLSH) is proposed in [58]. This method trains linear classifiers sequentially using an adaptive boostingparadigm that results in reducing the redundancy in LSH projections. Further, BLSH is experimentally shown to havecomparable performance against other machine/deep learning approaches in the speech enhancement applications. E2LSH [30] is the first LSH method that is proposed for the Euclidean distance. The main idea of E2LSH is to usemultiple hash tables and compound hash functions to increase the collision chance of two nearby points. By usingmultiple hash tables and multiple hash functions, E2LSH reduces the number of false positives and false negatives whilekeeping the accuracy high. In [77], hash-perturbation LSH is proposed that perturbs the hash values on the projectionsto reduce the large number of required hash tables in the basic LSH. Multi-probe LSH [78] uses the intuition that nearestneighbors are more likely to be hashed to the close-by buckets and intelligently looks into multiple neighboring buckets. Survey on Locality Sensitive Hashing Algorithms and their Applications 5Moreover, it assigns a distance score to each bucket, and later, the buckets are accessed in the increasing order of theirscores. By using this strategy, Multi-probe requires less number of hash tables while achieving the same accuracy.Authors in [54] improve on the multi-probe LSH [78] and utilize probabilistic approaches instead of likelihood. Thisis done via estimating the distribution of the neighboring points of a query. Using this approach, a probabilistic scoreis created that is used in the multi-probe search. In [50], authors propose a query adaptive hashing method that in thequery processing phase, it uses the expected accuracy measurement to choose the best hash functions that are moreappropriate for the given query. By using this query adaptive strategy, the proposed technique improves the accuracyof the results.In [18], the authors propose a framework that learns from data-set characteristics (i.e. density) to intelligently chooseLSH hash functions and thus, improve the accuracy. The idea is to have fine/smaller buckets in the dense areas ofthe data-set and coarse/larger buckets in the sparse areas. To this end, dynamic programming is used at each data-set dimension to repeatedly divide the data-set points into left and right bins until the desired size is reached. [125]proposes a data dependent technique that uses Principal Components Analysis to lower the dimensions of the datasetsuch that a uniform dataset is generated. The proposed method generates projections that are more uniform than therandom projections. Authors in [29] use Hadamard transformations to better estimate the distances in the Euclideanand angular spaces and reduce the running time of LSH methods. They propose two methods called ACHash andDHHash, where the former uses one Hadamard transformation and the latter uses two Hadamard transformations.In C2LSH [36], there is only one hash table and 𝑚 random hash functions (also called projections). Each projection isdivided into buckets of width 𝑤 and data-set points are hashed into each projection using the hash functions. Moreover,an approach called “collision counting" is proposed that counts the number of times a hashed point is mapped to thesame bucket as the query point and as soon as 𝑙 collisions are occurred for any point, that point is considered acandidate. C2LSH has two stopping conditions: 1) 𝑘 + 𝛽𝑛 candidates are found, and 2) 𝑘 true positives are found. Thetrue positives are found by calculating the Euclidean distance of the candidates with the query and checking if thedistance is less than 𝑐𝑅 . If the stopping conditions are not met, the algorithm increases the projection search radiusexponentially at each time (e.g. looking for collisions in two buckets instead of one bucket).Bi-level LSH [84] is proposed to improve accuracy and runtime of searches. It first partitions the dataset into randomsub-groups using RP-tree; then, creates a single hash table for each sub-group and a hierarchical structure based onspace-filling curves (Z-order curves). Boundary-expanding locality sensitive hashing [108] works on the problem ofnearest points being hashed into the boundaries of different buckets. It overcomes this problem by expanding the bound-aries of each bucket; thus, increasing the collision probability of neighbors. The motivation of [66] is that sometimespoints are projected to the boundaries of the buckets that will make them false negatives or false positives dependingon the query bucket. To solve this issue, authors introduce the concept of projecting a point into more than one bucketusing three hash functions that map the point to the 1) current bucket, 2) bucket to the left, and 3) bucket to the right.In [39], the authors focus on the false positive removal process which requires Euclidean distance computation. Tomake the false positive removal faster in cases where there are many candidates generated, the authors use triangularinequality and a pivot-based algorithm to further prune the candidates. Finally, they experimentally show the speedupof their proposed method.Dynamic Multi-probe LSH [117] is proposed to optimize I/O efficiency by dynamically changing the number of hashfunctions of each bucket such that the bucket fits a single disk page. A B+-tree is used for this process and as a resultof this technique, buckets with varying granularity are generated. In [124], a series of data-adaptive projections aregenerated using linear projections first to reduce the dimensionality of the dataset and then, using the distribution of Jafari, et al.the lower dimension data to generate the final hash functions. Further, an improved multi-probe strategy is proposedto improve the performance. In [10], a data-dependent approach is proposed for LSH. This approach, projects onedataset point to multiple random projections, and then, uses data distribution to learn a lower number of projectionscapable of approximating the projections created in the previous step. This approach is proposed for the Euclideanmetric and experimentally shown to perform better than E2LSH. In [113], the K-means method is adopted to clusterthe dataset into multiple groups and then E2LSH is performed on each cluster to construct LSH tables. Finally, it thebenefits are experimentally shown.In [5], a data dependent hashing method is proposed. The proposed method uses two levels of hash functions to furtherprune the projections. Moreover, although the two-level hashing is data dependent, the hash functions are chosenrandomly without any dependency to the data. Further, the time and space complexities of the proposed methodare theoretically proven to be optimal. In [107], authors use the same logic of Bi-level LSH and build a bi-level LSHmethod. During the first level, they use the K-means algorithm to partition the dataset into several clusters, and then,in the second level, they apply E2LSH [30] to each cluster. The motivation of this work is that items belonging to thesame cluster will have a more uniform distribution; thus, E2LSH can perform better on them. SRS [98] uses R-treeindex structures to estimate the original distances of the points from their projected distances. Moreover, it uses anincremental search strategy to look for nearest neighbors of a given query. The main contribution of SRS is to usesmall indexes while maintaining the same theoretical guarantees of traditional LSH methods. SK-LSH [75] focuses onreducing the random I/Os by efficiently placing nearby projected points to the same or close disk pages. In order todo this, SK-LSH uses a space-filling curve and a new distance measure between the compound hash keys to estimatethe distance between the points in the original space. A linear order is then used to sort the compound hash keys andstore them on the disk.QALSH [46] creates query-aware hash functions with the intuition that when the query is near the bucket boundaries,its near neighbors might fall into another bucket, and in order to prevent that, QALSH considers the query point asthe anchor of the buckets. Moreover, QALSH uses B+-trees for each hash function to improve the lookup time andmake the range queries more efficient. LazyLSH [130] mentions that the chance of two nearby points being close intwo different 𝑙 𝑝 spaces is high. Therefore, in the indexing phase, LazyLSH builds an index for the 𝑙 space and callit a base space. In the query processing phase and when the query is in another 𝑙 𝑝 space where 𝑝 ∈ ( , ) , it usestransformations to find nearest neighbors with a high accuracy. Authors in [45], extend QALSH [46] to work with 𝑙 𝑝 norms where 𝑝 ∈ ( , ] . They specifically use Levy distribution, Cauchy distribution, and Gaussian distribution for 𝑙 / , 𝑙 , and 𝑙 norms respectively. Moreover, authors present a heuristic-based method called QALSH + that uses two-levelindexing and KD-Tree structure to speed up the search process in datasets with large cardinalities.I-LSH [73] focuses on the radius expansion (Virtual Rehashing) process of LSH where the search radius in the projec-tions are increased in order to look for candidates in the neighboring buckets of the query. Previous methods increasethe radius exponentially; however, I-LSH uses an incremental way of increasing the radius. To incrementally increasethe radius, I-LSH looks for the nearest point in the projection based on its distance to the query. This operation re-sults in less disk I/Os since it prevents reading unnecessary buckets from the disk. Nevertheless, as we show in ourexperimental paper [49], this incremental strategy leads to high computation costs. In [74], authors extend I-LSH andintroduce EI-LSH that features an aggressive early termination condition in order to stop the algorithm when goodenough candidates are found and save the processing time. Considering that 𝑅 𝑚𝑖𝑛 is the radius of the closest neighbor, 𝑅 is the current search radius in the original space, and 𝑟 is the current search radius in the projected space, EI-LSH Survey on Locality Sensitive Hashing Algorithms and their Applications 7changes 𝑅 𝑚𝑖𝑛 ≤ 𝑐𝑅 which is the stopping condition of I-LSH to 𝑅 𝑚𝑖𝑛 ≤ 𝜆𝑟 . Here, 𝜆 is a pre-computed parameter thatrelies on the number of projections ( 𝑚 ), the approximation ratio ( 𝑐 ), and the success probability ( 𝛿 ).PM-LSH [129] mentions that previous methods cannot estimate distances accurately because of using coarse-grainedindex structures, and as a result, they have to use larger search radiuses. Therefore, authors use PM-Trees to indexthe data and improve the query processing time. Moreover, PM-LSH uses a tunable confidence interval to use betterdistance estimations and offer a higher accuracy of the results. Authors in [76] propose a novel two-dimensional methodcalled R2LSH. Instead of using one-dimensional projections, R2LSH uses two-dimensional projections and maps datasetpoints into those projections. Furthermore, in the indexing phase, it builds B+-Trees for each of the two-dimensionalprojections. Later, in the query processing phase, R2LSH uses a query-centric ball to search the neighboring areas ofthe query and saves I/O costs. Super-bit LSH (SBLSH) [51] focuses on the angular distance and mentions that previous methods suffer from a largevariance in their angular similarity estimations. Therefore, SBLSH divides LSH random projections into multiple sub-projections, and then, it orthogonalizes multiple random projections for each sub-projection. The result of this strategyis several projections called super bits, and it is experimentally shown that using these super bits will result in a smallerestimation variance when the angle to estimate is within ( , 𝜋 / ] . In [52], authors propose batch-orthogonal locality-sensitive hashing (BOLSH) that uses batches of orthogonal projections instead of independent random projections forthe angular similarity measure. Since these batches of projections partition the data space into regular regions, theyimprove accuracy. Further, authors show the benefit of BOLSH both experimentally and theoretically. LSH Forest [11] creates a prefix tree on each hash table and stores the compound hash keys in the prefix trees. In thequery processing phase, a top-down and a bottom-up search is performed to find the points with the largest prefixmatch with the hash code of the query. This hierarchical search strategy makes it possible to stop in the middle oftraversing the trees when enough results are found.
The following papers work with a combination of the hash function families or other application-specific families.
BayesLSH is proposed in [94]. The motivation of BayesLSH is that the false positives removal process in E2LSH [30] isexpensive since Euclidean distance needs to be computed. Authors in this paper use the Bayesian statistics to find theprobability distribution of similarities between the query and candidates by knowing the distribution of collisions inprojections. This way, the Euclidean distance is estimated; thus, false positives removal process will be faster. BayesLSHcan be applied to Euclidean, Angular, and Jaccard metrics, and authors show the benefit of their method both theoreti-cally and experimentally. Authors in [86] observe that LSH techniques such as E2LSH [30] and C2LSH [36] show poorperformance on some data-sets while working well for the others. They mention that these techniques are significantlyaffected by the characteristics of the data-sets. In their paper, the hashed values of the points are called signatures andprojection buckets are called signature regions. In some cases, the sizes of signature regions are different and points ina large signature region are not good candidates since they can be far from the query. Therefore, the proposed method Jafari, et al.(S2LSH) tries to distinguish between the signature regions and only use the important ones defined by two criteria:1) smaller regions, and 2) regions that query is near center. Moreover, S2LSH can work with Angular and Euclideandistances.In [119], authors mention that LSH algorithms are slow (linear time complexity) when used in query workloads con-sisting of a large number of queries (e.g. using LSH to process similarity joins on two large data-sets). The main ideain this paper is to choose a small representative set from the query data-set to decrease the number of LSH lookups;thus, improve processing time. Several methods are proposed to solve the minimum set coverage problem (i.e. select-ing representative query set) and the parameters of these methods are determined based on an error analysis on thefinal results. For the query processing, authors use an optimized version of QALSH [46] which uses compound hashkeys and R-trees. Moreover, authors show that their method can be applied to Euclidean-based, Jaccard-based, andHamming-based versions of LSH. [65] mentions that most of the LSH methods require the distribution and embedding of the input data to be explicitlyknown. In many scenarios, kernel functions are employed and the embedding of the data is not explicitly known.Therefore, authors propose a method to apply LSH functions to any kernel functions. The proposed method constructsLSH random projections using only a given kernel function and a sparse set of samples from the dataset. In [19],authors improve on BayesLSH [94] by adding support for arbitrary kernel similarity measures. Authors use hyperplanerounding and sketch generation algorithms to adapt BayesLSH to kernel spaces and call their method K-BayesLSH.
In [67], LSH is modified to work on categorical data. The idea is to first find a similarity matrix between all categoricalvalues, and then, use an agglomerative hierarchical clustering to cluster the categorical values. Then, each categoricalvalue can be mapped to a cluster ID and the cluster IDs are projected to different buckets using LSH. In [32], machinelearning models are used to learn from dataset characteristics. In the offline processing phase, the NN search problemis mapped to a graph whose vertices are dataset points and each vertex is connected to its nearest neighbors. Then,a balanced graph partitioning algorithm is used to partition the graph into several bins such that the edges crossingbetween different bins would be as small as possible. Finally, a machine learning model is trained that given any point,it will predict the possible graph bins that the point can belong to. In the query processing phase, first, the graph bins ofthe query are predicted using the model, and then, the nearest neighbors are found from those bins only. The proposedmethod works with Edit and Optimal Transport distance metrics.
In this section, we review the distributed frameworks that are proposed for LSH. There are several independent taskshappening in many LSH techniques (such as the multi-probing process in [78]) and researchers have used this motiva-tion to build distributed frameworks for LSH techniques in order to further improve the performance.[87] improved the time and space complexities using a distributed implementation of Multi-Probe LSH. The authorshas developed their system using the master-slave architecture where the slave nodes are responsible to conduct thecomputational tasks, such as creating and maintaining the hash table, query searching, query ranking, and communi-cation with the master node. On the other hand, the master node is responsible for splitting and distributing the largedataset, sending the query into the slaves nodes, and aggregating the ranked results. This work uses the master-slavearchitecture to search the LSH buckets and improve performance and scalability while maintaining the accuracy. Survey on Locality Sensitive Hashing Algorithms and their Applications 9In [40], a different distributed LSH architecture is proposed to improve the efficiency of KNN searches and rangequeries in high dimensional datasets. In order to map the LSH buckets to the peers of the distributed system, a two-level mapping strategy is used. This two-level strategy ensures that: 1) LSH buckets that are holding similar dataare mapped to the same peers, and 2) load balancing between the peers is fair. The proposed method is followed byexperiments to show that not only it meets the two requirements but also minimizes the network I/Os. Layered LSH,proposed in [9], uses Apache Hadoop for its disk-based version and Active DHT for its in-memory version. Authorsprovide theoretical guarantees for the network operations in the single hash tables setting. However, for the multiplehash tables setting, no theoretical guarantees are provided. Layered LSH works in the Euclidean space and uses EntropyLSH [85] as its base LSH method.Parallel LSH (PLSH) [99] introduces an in-memory, multi-core, distributed variant of LSH that can be used to performKNN searches on large and streaming data. PLSH, uses a caching strategy to improve the online index construction timeof the streaming data. Moreover, it uses insert-optimized delta tables to hold the indexes of new incoming data whilemerging them with the main index structures periodically. Furthermore, a bitmap-based strategy is used to eliminateduplicate data fetched from different hash tables. Additionally, PLSH is designed to work only on the angular distance.An LSH-based distributed framework is proposed in [55] that improves scalibility in the privacy preserving recordlinkage domain. The framework utilizes the Map-Reduce paradigm and uses LSH to find the Minhash signatures ofthe records in the Map phase. The signatures are then encoded into Bloom filters and the Bloom filters are distributedover the Reduce tasks. CLSH [115] uses the K-Means clustering algorithm to split the original dataset into multipleclusters. Later, it distributes these clusters over different compute nodes where each one creates their local indexes. Inthe query processing phase, the given query is compared to the cluster centers and nearest compute nodes based onthe Euclidean distance metric are chosen to run LSH on them. Finally, the intermediate results are combined and thenearest points are chosen as the final results.In [68], the authors introduce a naive distributed version of LSH on Apache Spark called SLSH. In the indexing phase ofSLSH, each worker node loads a subset of the dataset and calculates the hash values of the points in that subset. Later, inthe query phase, all worker nodes load the hash functions and hash tables and the query is sent to all of them for queryprocessing. The main downside of SLSH is that it requires several data shuffles across the worker nodes that result inheavy network costs. To overcome this issue, authors present a more efficient version called SES-LSH. SES-LSH uses ahashing scheme (called BKDR) to partition data such that points belonging to the same hash tables are partitioned to thesame worker nodes. Therefore, since the location of data points is known in advance, computations can be performedlocally and it is not required to send the query to all worker nodes. [71] proposes a generic distributed platform, calledLoSHa, that can be used to easily implement distributed versions of different LSH methods with different distancemetrics. LoSHa can lower programming costs and also achieve high efficiency and performance. Internally, LoSHa usesthe Map-Reduce paradigm and offers several user-friendly APIs to ease the process of converting an LSH algorithminto a distributed version. Furthermore, by using different optimization techniques such as bucket compression anddata de-duplication, LoSHa improves the efficiency of the implemented method.C2Net [70] focuses on the collision counting operation of LSH methods such as C2LSH since its authors believe col-lision counting is the most time consuming operation compared to hash value calculation and false positive removaloperations. C2Net utilizes minimum spanning trees and spectral clustering to partition LSH buckets and distributethem over mapper tasks in a Map-Reduce framework. Moreover, C2Net supports virtual rehashing by running tworounds of Map-Reduce and merging different bucket blocks. [95] and its extended version [109] focus on the problemof load balancing in the indexing phase of distributed LSH-based methods. They propose two theoretical models that0 Jafari, et al.can predict the data distribution of a single hash table. Later, using the two theoretical models and CDF-based andvirtual node methods, a dynamic load balancing strategy is proposed. Finally, the proposed method is followed byexperimental evaluations that use the Gini coefficient metric in order to prove the scalability. A distributed approach,named RDH, is proposed in [33] to improve the speed of performing similarity search for images. RDH randomlysplits and distributes the dataset over different compute nodes and each node runs LSH indexing locally using theirlocal subset of the data. Furthermore, authors show that the accuracy does not change significantly if the same hashfunctions are used in all compute nodes.
In this section, we present different works in diverse application domains that utilize Locality Sensitive Hashing. Notethat, we counted more than 1000 application papers that utilize LSH in their application workflow. Here, we havecarefully categorized the most recent and popular papers (in terms of their citations). For each paper, we present theparticular problem that they are trying to solve, and how LSH is utilized in their methodology. Additionally, we alsonote the specific LSH algorithm and hash family that is used by the paper (if the paper mentions it). The goal of thissection is to present the reader the knowledge of how LSH is used in these diverse domains. [92] presents a query by humming method for music retrieval using LSH. The pitch vectors are extracted for a musicdatabase and an index structure is constructed. For retrieval, the query transcription technique is employed to producenotes for the song, and then, pitch vectors are extracted similar to the pitch vectors extracted in the index constructionphase. Later, the E2LSH [30] method is used to return the nearest neighbors to finalize the list of similar songs. Moreover,the Euclidean distance metric is used to find the distances between the pitch vectors.In [120], Order Statistics LSH (OS-LSH) is proposed to improve the scalability of content-based retrieval of audiotracks in music databases. For an audio query, Chroma sequences are calculated and then a Multi-Probe histogram(MPH) is generated for the sequences. Later, OS-LSH maps MPH into hash keys. Finally, in the post-filtering step, thehistograms associated with the keys in the hash table are compared against the query and similar items are returned.The representation using histograms make the proposed method more storage-efficient and scalable.[41] solves the computational complexity of finding similar musics from their content-based extracted features usingLSH. After extracting the Mel-frequency Cepstral Coefficients (MFCC) and Time Histogram (TH) features from themusic, the K-Means clustering algorithm is performed to have a Bag of Words (BoW) representation. Later, LSH isused to find similar clusters instead of a linear search approach which would otherwise take O( 𝑁 ).[122] speeds up the retrieval of similar songs based on melody similarity in a large database. Initially, a Support VectorMachine (SVM) model is trained by taking the generated Chord Progression (CP) into consideration for a set of audiotracks. The Chord Progression Histograms (CPH) are computed for the audio tracks and organized in to one singlehash table with a tree-structure while considering CP as the hash key. In the query processing phase, the CPH of CPsare computed and similar songs are retrieved from the corresponding hash buckets.[83] proposes an identification method that uses LSH for Raga which is a quintessential component of classical Indianmusic. The new music recordings are stored in the database. Then, the pitch vectors are extracted and stored alongwith their labels. Later, LSH is used to index the pitch vectors. Similarly, the pitch vectors are extracted for the queryset and then compared with the indexed pitch vectors using the Euclidean distance metric in order to find the Top-kRaga to the query. Survey on Locality Sensitive Hashing Algorithms and their Applications 11 In [64], the authors extend the scope of LSH to arbitrary kernel functions while preserving the algorithm’s sub-lineartime and propose KLSH. In KLSH, the random projections are computed for LSH which are required only in the kernelspace and for a limited number of database objects to find the set of similar images to the query. Moreover, KLSHutilizes Gaussian RBF kernel to retrieve the images.[128] introduces a new approach for video anomaly detection using LSH filters. The training data points are hashed intoa set of buckets using LSH. With the help of bloom filters, a test data point will be detected as abnormal if it falls into adifferent bucket and will be normal if it falls into the same training bucket. The hamming distance measure is used tofind the distances between the training buckets and the test bucket. Furthermore, The Practical Swarm Optimization(PSO) method is also used to search for optimal hash functions which further improves the detection quality.In [111], the authors propose a method to support content-based image retrieval over encrypted images in cloud appli-cations. Initially, feature vectors are extracted for the corresponding input images and the k-NN method is employedfor the encryption of feature vectors. Then, pre-filtered tables are created using LSH. E2LSH [30] is used to constructthe tables. In order to further enhance the security, watermark-based protocols are used to prevent the illegal distri-bution of images by the legal users. Furthermore, Euclidean distance between the feature vectors is used to find thesimilarity of images. [81] accelerates the clustering process of large malware datasets by using the LSH technique. The malware samplesare represented as a set of features and the distance between the two samples is computed using the Jaccard distance.Then, minHash functions are computed efficiently to find the set of similar malware samples by using the bandingapproach. Later, the malware samples are clustered by using the Single Linkage (SLINK) algorithm.[82] introduces a new online system to detect malicious spam emails by using a Resource Allocating Network and LSH(RAN-LSH). LSH is used to select the training data that has to be learned by the RAN-LSH classifier to detect the spamemails. For the test data, the hash table is looked upon to find same or similar spam emails.In [127], the authors improve the scalability of local recoding (a technique used to anonymize data and preserve privacy)in big data applications. A semantic distance metric is proposed in order to measure the similarity between data points.Later, Minhash LSH and Map-Reduce are used to split the data into several partitions that contain similar records. Theanonymization of these partitions is done via a recursive agglomerative k-member clustering method.In [8], the authors introduce a novel authentication system, called ai.lock , for mobile devices which uses an imagingsensor for authentication. To extract invariant features for image-based authentication, LSH is used along with deepneural networks and Principal Component Analysis (PCA). The architecture processes the input image through neuralnetworks and LSH is employed to map them to a binary image print. It also uses a classifier to identify the ideal errortolerance threshold to lock and match image prints. Moreover, this work uses the hamming distance metric. [131] presents a blockchain scheme for image copyrights and provides the copyrights over the network under distri-bution constraints. First, the image feature vector is calculated to represent the image content for the input images;then, the feature vector is added to the blockchain. When a user wants to use the photo/image which is present in2 Jafari, et al.the blockchain network, E2LSH [30] is employed to find out the copyright owner by using the information such asauthority item, image index item, usage item, and ownership change item present in the network as data blocks.
In [90], the authors reduce the running time of similarity list creation for nouns gathered from a web corpus. Nounsclustering is a well-known task in Natural Language Processing and creating the similarity matrix is an expensive op-eration of noun clustering. Therefore, authors use LSH and cosine similarity metric to create an approximate similaritymatrix and speedup the operation.[62] uses LSH in the single-link hierarchical clustering technique to approximate the distances and reduce the timecomplexity of finding nearest clusters. Authors experimentally and theoretically show that using LSH in their proposedmethod reduces the time complexity significantly. It is also worth mentioning that Euclidean distance is used as thedistance metric.Often, researchers use sampling techniques while removing outliers from the initial sample to perform association rulemining on large datasets in a reasonable amount of time. [22] mentions that the initial sample may contain multipleclusters and performing data clustering on the initial sample can result in an increase in the accuracy. Therefore,authors use LSH and Euclidean distance metric to first cluster the initial data sample, and then, remove outliers fromthe buckets.LSHiForest [126] uses LSH forest to propose a framework for ensemble anomaly analysis. LSHiForest is built uponiForest, which is an isolation based anomaly detection forest. Moreover, the proposed framework has the ability to useany distance metric and any type of LSH family such as Euclidean-based, angle-based, and kernelized LSH. Finally, au-thors experimentally show that LSHiForest beats other methods in terms of time efficiency, anomaly detection quality,and robustness.Direct Robust Matrix Factorization (DRMF) is a technique used in the anomaly detection domain. [112] argues thatalthough DRMF is robust and accurate; however, it involves expensive computations. To speedup traffic anomaly de-tection, [112] proposes a multi-layer LSH table that maps origin and destination pairs to different layers with differentsimilarity levels (based on the Euclidean metric). Later, an adaptive strategy is proposed to search the generated layers. [53] presents a new method called Semi-Supervised SimHash ( 𝑆 𝐻 ) to search for similar documents in high-dimensionalspaces. Since it is semi-supervised, it learns the optimal feature weights and the weights are used to find the queryresults since similar objects have similar fingerprints. Initially, the data set is mapped into an L-dimensional Hammingspace using LSH, and then, the fingerprints are generated which are used to find the similar documents.In [37], the paper proposes a method to identify misspelled names and near-duplicates using LSH. First, the data istransformed and similar names (candidates) are produced from the transformed data using LSH with Jaccard distance.The candidate pairs are then filtered using Full Damerau-Levenshtein distance. Similar names are aggregated into aset of names by utilizing a graph.[69] proposes a method for large-scale document reduction based on domain ontology and LSH. In the first step, thefeatures are extracted using a Semantic Vector Space Model (SVSM). Then, the index of SVSM is obtained using E2LSH[30], and then, each document is mapped into the hash tables. Later, the candidate set of similar documents is obtainedby getting the union of the buckets that have similar documents. Lastly, Euclidean distance between the query and thedocuments in the candidate set are computed to retrieve the true similar documents. Survey on Locality Sensitive Hashing Algorithms and their Applications 13[104] generates online LSH signatures in order to process large text collections. The authors improve upon offlinegeneration of LSH signatures and propose an algorithm that is suitable for streaming applications. It is also space-efficient since the method does not need an explicit representation of the feature vectors or random matrices. Forevery feature, it uses a fixed pool of random values rather than creating a unique value for each feature. Furthermore,the proposed method uses cosine similarity of the feature vectors. LSH-ALL-PAIRS [15] efficiently compares genomic DNA sequences with the goal of finding conserved genome featuresacross different species. LSH-ALL-PAIRS converts the sequences to shingles. Then, LSH is applied to all shingles andthe shingles with the same hash value are grouped together into a class. Finally, a pair-wise comparison is performedonly on a specific class to find similar shingles. [13] introduces MinHash Alignment Process (MHAP) to detect overlapsbetween noisy and long reads of microbial genomes. MHAP uses MinHash to create small fingerprints of sequencingreads with the goal of dimensionality reduction. To do this, MHAP decomposes DNA sequences into multiple shin-gles, and then, the shingles are converted into integer fingerprints using multiple random hash functions. Finally, theHamming distance of fingerprints is used to approximate the Jaccard distance of two shingles that helps determine theoverlaps between them.MASH [80] facilitates the use of MinHash in data-intensive problems in genomics by proposing a general-purposetoolkit to construct, manipulate, and compare MinHash fingerprints from genomic data. Moreover, MASH derives asignificance test and proposes a new distance metric, called Mash distance, that estimates the mutation rate of twosequences. Mash distance can easily be computed using only the MinHash fingerprints.Molecular fingerprints are often used to describe, compare, and benchmark organic molecules. MinHash Fingerprintup to six bonds (MHFP6) [88] is a fingerprint that adopts LSH to improve the performance of nearest neighbor searchesin benchmarking studies. In order to do this, MHFP6 applies MinHash to molecular substrings and generates multiplefingerprints. The fingerprints are then indexed by LSH Forest [11] to efficiently retrieve nearest neighbors. [121] improves the scalability of geo-fencing applications by processing hundreds of polygons and points in real-time.Initially, an R-tree is used to quickly detect whether a point is present in a minimum bounding rectangle. Then, an edge-based LSH technique is used in large-scale pairing between points and polygons for INSIDE and WITHIN detection ofpoints followed by a probing method to find out all the geo-edges close to a target point.In [89], the authors speed up the construction of roadmaps without leveraging the quality by using LSH. Centroid-basedhashing is employed to search for nearest neighbors during the construction phase. The centroids are initialized firstand each centroid corresponds to a region of a Voronoi cell. An arbitrary point is associated with one of the centroidsby calculating the Euclidean distance of the point to all of the centroids and then selecting the nearest distance. Toretrieve a set of nearest points, it is sufficient to retrieve it from the points in the Voronoi cells.In [79], the authors speed up the process of searching for patterns similar to a target one in 2D and 3D image traininglibraries. To efficiently search for a pattern in training images, LSH is used first to filter the patterns that are similarto a given data event. Then, an exhaustive search using a Run-Length Encoding (RLE) compression technique is usedto calculate the similarity among the filtered patterns. The Euclidean distance measure is used for continuous imagesand the Hamming distance measure for categorical images.4 Jafari, et al.[6] reduces the computation costs of nearest neighbor search, distance estimation, clustering, and classification of GPStrajectory data. LSH is applied to two distance measures named Hausdorff and Fréchet distances which are used inthe neighbor search. Furthermore, a data structure called the Multi-Resolution Trajectory Sketch (MRTS) is built tocompactly represent the dataset. This also helps for fast insertion of trajectories in the database. [34] utilizes LSH and proposes a symbol spotting approach in graphical documents. The proposed method uses thecritical points of graphical documents as the nodes and the lines joining those critical points as the edges of a graph.Later, the graph is decomposed into multiple graph paths, and finally, the shape descriptors of the graph paths aremapped to hash tables using LSH.[123] focuses on similarity search in undirected vertex-labeled simple graphs that have no self loops and multipleedges. Given a query graph with a semantic class label, the goal is to find the nearest graphs that have the sameclass as the query graph. The authors propose a vectorial representation method that is used to convert the graphsto high-dimensional vectors. Finally, Multi-Probe LSH [78] and Euclidean distance are used to query the generatedhigh-dimensional vectors.Basic Weighted Graph Summarization (BWGS) is a method used to compress graphs that requires finding similarnodes within a graph. [57] uses Min-Hash to approximately find the required similar nodes, and as a result, speedupthe process of graph compression. Moreover, for graphs that contain edge weights, a Weight Oriented LSH (WOLSH)strategy is proposed to increase the chance of generating similar Min-Hash values from a subset of neighbors thathave higher edge weights.
Maximum Inner Product Search (MIPS) is the problem of finding a dataset point that has a maximum inner productto a given query point. Mathematically, this problem can be converted to a near neighbor search problem when thenorm of every dataset point is constant. However, this is not the case in many applications and the MIPS problemcannot easily be solved by using near neighbor search techniques. In [96], authors propose an asymmetric LSH (ALSH)method that uses different hash functions for buckets creation and buckets probing. ALSH can be easily applied toMIPS to improve the performance, and it is based on E2LSH [30] that uses the Euclidean distance metric.In [23], the authors solve the scalability issue of applying large trained models on huge non-annotated media collections.A trained linear Support Vector Machine (SVM) classifier requires a weight vector, features vector, and a bias in theformula ℎ ( 𝑥 ) = 𝑠𝑔𝑛 ( 𝑤.𝑥 + 𝑏 ) to classify the features vector. As a result of using LSH, the approximated ℎ ( 𝑥 ) valuecan be found by a range query in the Hamming cube anchored around the hashed value of 𝑤 . Hence, they build anapproximated classifier using LSH by hashing the weight vector and the features while knowing that the dot productof 𝑤 and 𝑥 can be estimated by the Hamming distance of their hashed values.Automatic Speech Recognition (ASR) is a method used to convert speech into text. One strategy to improve the per-formance of ASR is to apply it on features derived from a manifold learning based approach. However, this processrequires computing pair-wise distances between the feature vectors to construct nearest neighborhood graphs that arerequired in manifold learning techniques. LPDA-LSH [103] is proposed to solve this issue using a modified version ofLSH. In this modified version, pairwise distances between all hashed values are calculated for each hash bucket withthe goal of creating candidate sets for all the points in that bucket. Later, the candidate sets are concatenated based onclass labels and a within-class and an inter-class neighborhood graphs are created. Survey on Locality Sensitive Hashing Algorithms and their Applications 15[116] mentions that the state-of-the-art hashing based technique for MIPS uses a normalization strategy with themaximum 2-norm in the dataset and this strategy suffers from performance issues when used in real datasets thathave long tails in their distributions. Therefore, authors introduce NORM-RANGING LSH that divides the datasetbased on the percentiles of the 2-norm distribution. Later, the state-of-the-art technique is applied individually to eachdataset partition. Moreover, a new similarity metric is proposed to define how to probe from different partitions of thedataset.H2ALSH [47] presents a novel transformation method to reduce the error that is caused by asymmetric strategiesapplies on MIPS. The novel transformation method is called Query Normalized First (QNF) and converts the MIPSproblem to a nearest neighbor search problem. H2ALSH first divides the dataset into multiple subsets and uses QNFto transform the subsets. Later, it uses QALSH [46] to build the indexes for the subsets. Finally, the union of the resultsfrom the subsets are reported as the MIPS result. Stratified LSH (SLSH) [60] presents a detection system for high-dimensional physiological data using LSH. SLSH whichis a type of multi-level LSH predicts the critical events of a patient. First, the training data is stratified with the helpof LSH by using 𝑙 distance. Then, Cosine distance LSH (COSLSH) is applied at the inner level on each bucket. Toretrieve the approximate nearest neighbors of the query, the same two levels of LSH are applied and a linear search isperformed within the candidate set. Finally, the prediction is made by the majority vote technique.[114] demonstrates an LSH-based approach to retrieve bone scan images by using the SIFT-based Fly Locality SensitiveHashing (FLSH) technique. First, the Difference of Gaussian (DOG) is calculated for the input images to detect potentialminimum points as key points. Then, the value of normalized Laplacian function at each key point is evaluated andis assigned one or more orientations if it meets a certain threshold. Finally, a SIFT 128-dimension feature vector isgenerated for each image and hash codes are produced for the feature vector.[4] improves the prediction accuracy of critical events for a patient in a medical database. In the indexing phase (alsocalled training step) of the proposed method, LSH is used to hash and index the features that are extracted from ECGsignals of a subject. In the query processing phase (also called testing step), the extracted features of a new ECG signalare also hashed into the LSH buckets, and the class of this new signal is determined based on the class of the majorityof its near points. [25] presents two novel schemes for near-duplicate image and video shot detection. The first scheme uses the hierar-chical tiled color histogram for image representation, and it uses Euclidean distance to compute the similarity. Then,image retrieval is achieved using LSH. The second approach uses a sparse set of visual words for image representation.Then, the set overlap measure is used to compute the similarity and the Min-Hash algorithm is employed for efficientretrieval of images.SimPair LSH [35] solves the problem of near-duplicate detection for high dimensional data points incrementally andefficiently. Initially, a certain number of pair-wise similar distance sets that meet a threshold for existing data pointsare stored in the memory. Later, in the query processing phase, if one of the points from a similar pair appear in thesame bucket as the query, it is very probable for the other point to also appear in the bucket. Therefore, SimPair LSHcan avoid computing distances for the points that are similar to each other at the first place; thus, it can save the overallprocessing time. Additionally, SimPair LSH is an in-memory index structure that uses Euclidean distance metric.6 Jafari, et al.[97] presents a technique to detect similar documents using SimHash. Probabilistic SimHash Matching (PSM) is pro-posed that incorporates a proposed algorithm called Volatility Ordered Set Heap (VOSH). VOSH randomly flips thebits without repetitions. PSM finds similar documents using online and batch operation modes. The online mode ofoperation is adopted for the documents which fit in to the main memory and the batch operation mode is preferredfor the documents that are kept on the disk.[2] proposes a method for Copy-Move Forgery detection of images using K-Means clustering and LSH. First, the imageis divided into non-overlapping blocks and features are extracted from these blocks. Later, a vector array is formedfrom these features and the PCA method is used for dimensionality reduction of vectors into two dimensions. In thenext step, the reduced vectors are clustered into multiple clusters using the K-Means clustering method. Finally, LSH isdeployed to find the matching pair of blocks based on the Euclidean distance, which in turn, results in a list of candidatepairs for the forgery.In [44], the authors introduce an efficient method for near-duplicate detection using a neural network and a load-balanced LSH approach. The neural network extracts the features for the detection process. Then, LSH is used to buildan index for the extracted features. A load-balanced LSH method is proposed to map images into buckets in a balancedmanner and find the relevant number of neighboring buckets to detect duplicates. The Load-balanced LSH uses theEuclidean distance metric. [43] integrates a semantic searching method based on LSH in Mobile P2P networks. Initially, for a document vector,the semantic indexing method is used to hash the document vector into a key by using entropy-based LSH [85], andit stores the key in the associated mobile node (mobiles, laptops, etc.) in the network. Moreover, there are some supernodes, called stationary nodes, present in the network which act as access points. When any of the mobile nodes wantto query the document, query is sent to access points, and then, super nodes are responsible to choose various pointsrandomly from the neighborhood of the query.NearBucket-LSH [63] solves the similarity search problem in P2P online social networks. Initially, LSH is used tomap the users into a collection of buckets. Therefore, when a query is received, LSH limits the searching processto the buckets to which query is mapped. NearBucket-LSH is based on Multi-probe LSH [78] and also uses ContentAddressable Network (CAN). CAN allows to map the users to the buckets and store the buckets in distributed nodesacross the network. Furthermore, CAN is used to update and locate the required buckets in the query processing phase.In [3], the authors use a neural network model to detect Distributed Denial of Service (DDoS) cyber attacks by usingthe Resource Allocating Network with LSH (RAN-LSH) classifier. First, in the pre-processing step, the transformationof darknet packets to the feature vector is carried out. In order to train the neural network, LSH is used to select onlya certain amount of data to be trained for the model which in turn accelerates the learning time of the model.[102] presents a network congestion detection method for the Signal Safety Data Network (SSDN). Initially, the dataflow of SSDN is decoded and fed into LSH as input. LSH is used to determine whether there is network congestionor not. If the hash buckets overflow, then there is a network congestion and vice versa. If there is a congestion in thenetwork, the data is pre-processed by extracting the features and then normalizing it. Later, the XGBoost algorithm isused to determine the congestion type of the pre-processed data. Survey on Locality Sensitive Hashing Algorithms and their Applications 17 Tree Locality-Sensitive Hash (TLSH) [16] uses Min-Hash to map trees that are similar to each other to the same hashvalue using the tree edit distance metric. TLSH is applied to the path constraints of symbolic execution states ofsoftware to identify the similar states and help to find the bugs more efficiently. [26] uses Min-Hash to match equivalent terms in different big ontologies. Class labels are used to create a string-basedalignment. Two strategies are adopted to make the alignment usable in Min-Hash. In the first strategy, class labelsare shingled and a hierarchical method is used to merge the shingles. In the second strategy, a class is representedusing the different tokens from the class label. The representation is then fed into the Min-Hash algorithm. Finally,two compare two ontologies, one of them is used as the dataset and the other one is used as the query.[27] uses LSH Forest [11] to find the right context of the environment for placing new knowledge tokens. Moreover,the Random Hyperplane Hashing (RHH) method is used along with LSH Forest to utilize the angular distance metric. [28] uses Vector Space Models (VSM) to represent user profiles in social media and Random Hyperplace Hashin (RHH)to create an index for the user profiles. RHH is an LSH family that uses cosine distance. Moreover, authors experimen-tally show the benefits of using RHH to develop similarity search systems for social media profiles.Tag Assignments Stream Clustering (TASC) [110] is a method that uses LSH for community detection in social taggingsystems. To detect the communities, TASC generates user profile vectors based on users’ interests. Later, LSH andangular distance is used to find the similarities between the users and create user clusters. In real-time systems, theclusters are updated over time and similarities between new users and the current users are recalculated.[1] proposes an LSH-based method to match papers in Web of Science and Scopus bibliographic databases. First, LSHis used to find similar papers from the two databases in a reasonable amount of time. Later, heuristic-based approachesare used to remove the false positives and obtain the exact matches. The utilized LSH method uses cosine distancemetric and is implemented on Spark to further improve the speed by distributed processing.
Authors in [61] use two LSH methods to predict critical events from physiological time series. L1LSH [38] that is basedon Hamming distance and E2LSH [30] that is based on Euclidean distance are used to find the top nearest neighborsto a given query. Later, the class label of the query is chosen using a majority vote of its nearest neighbors.[59] mentions that using a generic LSH method for physiological time series has two problems: 1) not being able to usemultiple distance metrics at a time, and 2) expensive distance computations on the candidates set in order to removethe false positives. Therefore, in their thesis, the author proposes Stratified LSH (SLSH) to solve the first problem andCollision Frequency LSH (CFLSH) to solve the second problem. SLSH uses a multi-level hierarchical approach thatsupports using different distance metrics at each level. CFLSH uses a collision counting strategy with the intuition thatthe more similar a point is to a given query, the more times it will collide with the query across multiple hash tables.In [91], authors use LSH to identify potential earthquakes by finding similar seismic data time series. First, a fingerprintextraction strategy is used to convert the input time series into compact binary vectors. Later, Min-Hash is applied to8 Jafari, et al.the generated binary fingerprints in order to identify all similar fingerprints. Moreover, authors also study the effectof parallel processing (using multiple CPU threads) on the processing times.[31] reduces the cost of detecting an Acute Hypotensive Episode (AHE) given the vital signals of ICU patients in theform of multivariate time series using LSH. First, Bidirectional Sequence-to-Sequence (BSS), Hierarchical Sequence-to-Sequence (HSS) autoencoders, and a combination of them is used to encode the input time series into context vectors.Later, Stratified LSH (SLSH) [60] is used to find the similar context vectors.[118] mentions that finding similar time series is gaining importance these days due to advances in mobile devices andsensors. Therefore, authors use LSH to find candidate similar time series, and then use the hash values to estimate theoriginal distances and prune the candidate sets. Moreover, authors find the appropriate LSH parameter by performingan error analysis. The proposed method uses QALSH [46] and both Dynamic Time Warping (DTW) and Euclideandistance metrics. [101] improves the scalability of the Monte Carlo Localization (MCL) algorithm that is used for global localizationand position tracking in robotic systems. E2LSH [30] is employed to select the features that are required to build thescalable MCL framework. For building the features database incrementally, the mapper robot gets the new feature andhashes the features using E2LSH and adds the location to the corresponding buckets.LSH-RANSAC [93] solves the problem of feature-based robot localization in large-size maps. For appearance basedlocalization, incremental maps based on iLSH [101] is used to build an incremental database. For a new feature, thereal-world location is computed, and then, the feature is hashed using E2LSH [30]. Then, the real-world location of thefeature will be associated with the hashed values. For position-based localization, iRANSAC [100] is employed which isa map-matching scheme. Both of the iLSH and iRANSAC techniques are used to solve real-time localization problems.In [42], the authors improve the processing speed of high-resolution stereo images in robotics. To establish dense imagecorrespondences, the approximation of one image with another image has to be computed. LSH is used to solve theapproximation problem by using the dense binary strings of the image pixels.
In this survey paper, we reviewed the recent advances in Locality Sensitive Hashing (LSH) techniques and categorizedthem based on the hash function families that they utilize. Additionally, we reviewed the distributed frameworksproposed for LSH techniques and explained their architecture. Finally, we categorized different application papers andpresented how Locality Sensitive Hashing is utilized in each of them.
REFERENCES [1] Mehmet Ali Abdulhayoglu and Bart Thijs. 2018. Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus.
Sciento-metrics
Information Science and Applications (ICISA) 2016 . Springer, 663–672.[3] Siti Hajar Aminah Ali, Seiichi Ozawa, Tao Ban, Junji Nakazato, and Jumpei Shimamura. 2016. A neural network model for detecting DDoS attacksusing darknet traffic features. In . IEEE, 2979–2985.[4] Turky N Alotaiby, Alanoud Alhakbani, Nujood Alwhibi, Gaseb Alotaibi, and Saleh A Alshebeili. 2019. Locality Sensitive Hashing for ECG-basedSubject Identification. In . IEEE, 1–4.[5] Alexandr Andoni, Piotr Indyk, Huy L Nguyen, and Ilya Razenshteyn. 2014. Beyond locality-sensitive hashing. In
Proceedings of the twenty-fifthannual ACM-SIAM symposium on Discrete algorithms . SIAM, 1018–1028.
Survey on Locality Sensitive Hashing Algorithms and their Applications 19 [6] Maria Astefanoaei, Paul Cesaretti, Panagiota Katsikouli, Mayank Goswami, and Rik Sarkar. 2018. Multi-resolution sketches and locality sensitivehashing for fast trajectory processing. In
Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic InformationSystems . 279–288.[7] Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-benchmarks: A benchmarking tool for approximate nearest neighboralgorithms. In
International Conference on Similarity Search and Applications . Springer, 34–49.[8] Mozhgan Azimpourkivi, Umut Topkara, and Bogdan Carbunar. 2017. A secure mobile authentication alternative to biometrics. In
Proceedings ofthe 33rd Annual Computer Security Applications Conference . 28–41.[9] Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In
Proceedings of the 21st ACM inter-national conference on Information and knowledge management . 2174–2178.[10] Xiao Bai, Haichuan Yang, Jun Zhou, Peng Ren, and Jian Cheng. 2014. Data-dependent hashing based on p-stable distribution.
IEEE Transactionson Image Processing
23, 12 (2014), 5033–5046.[11] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. 2005. LSH forest: self-tuning indexes for similarity search. In
Proceedings of the 14thinternational conference on World Wide Web . 651–660.[12] Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching.
Commun. ACM
18, 9 (1975), 509–517.[13] Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. 2015. Assembling large genomeswith single-molecule sequencing and locality-sensitive hashing.
Nature biotechnology
33, 6 (2015), 623–630.[14] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. 2000. Min-wise independent permutations.
J. Comput. System Sci.
60, 3 (2000), 630–659.[15] Jeremy Buhler. 2001. Efficient large-scale sequence comparison by locality-sensitive hashing.
Bioinformatics
17, 5 (2001), 419–428.[16] Camdon J Cady. 2017. A Tree Locality-Sensitive Hash for Secure Software Testing. (2017).[17] Deng Cai. 2019. A revisit of hashing algorithms for approximate nearest neighbor search.
IEEE Transactions on Knowledge and Data Engineering (2019).[18] Lawrence Cayton and Sanjoy Dasgupta. 2008. A learning framework for nearest neighbor search. In
Advances in Neural Information ProcessingSystems . 233–240.[19] Aniket Chakrabarti, Venu Satuluri, Atreya Srivathsan, and Srinivasan Parthasarathy. 2015. A bayesian perspective on locality sensitive hashingwith extensions for kernel methods.
ACM Transactions on Knowledge Discovery from Data (TKDD)
10, 2 (2015), 1–32.[20] Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In
Proceedings of the thiry-fourth annual ACM symposium onTheory of computing . 380–388.[21] Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-Yates, and José Luis Marroquín. 2001. Searching in metric spaces.
ACM computing surveys (CSUR)
33, 3 (2001), 273–321.[22] Chyouhwa Chen, Shi-Jinn Horng, and Chin-Pin Huang. 2011. Locality sensitive hashing for sampling-based algorithms in association rule mining.
Expert Systems with Applications
38, 10 (2011), 12388–12397.[23] Dandan Chen. 2016. Structural Nonparallel Support Vector Machine Based on LSH for Large-Scale Prediction. In . IEEE, 839–846.[24] Lianhua Chi and Xingquan Zhu. 2017. Hashing techniques: A survey and taxonomy.
ACM Computing Surveys (CSUR)
50, 1 (2017), 1–36.[25] Ondřej Chum, James Philbin, Michael Isard, and Andrew Zisserman. 2007. Scalable near identical image and shot detection. In
Proceedings of the6th ACM international conference on Image and video retrieval . 549–556.[26] Michael Cochez. 2014. Locality-sensitive hashing for massive string-based ontology matching. In , Vol. 1. IEEE, 134–140.[27] Michael Cochez, Vagan Terziyan, and Vadim Ermolayev. 2017. Large scale knowledge matching with balanced efficiency-effectiveness using lshforest. In
Transactions on Computational Collective Intelligence XXVI . Springer, 46–66.[28] Rodolfo da Silva Villaca, Luciano Bernardes de Paula, Rafael Pasquini, and Mauricio Ferreira Magalhaes. 2013. A similarity search system basedon the hamming distance of social profiles. In . IEEE, 90–93.[29] Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. 2011. Fast locality-sensitive hashing. In
Proceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining . 1073–1081.[30] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In
Proceedings of the twentieth annual symposium on Computational geometry . 253–262.[31] Jwala Dhamala, Emmanuel Azuh, Abdullah Al-Dujaili, Jonathan Rubin, and Una-May O’Reilly. 2018. Multivariatetime-seriessimilarity assessmentvia unsupervised representation learning and stratified locality sensitive hashing: Application to early acute hypotensive episode detection.
IEEESensors Letters
3, 1 (2018), 1–4.[32] Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. 2019. Learning Space Partitions for Nearest Neighbor Search. arXiv preprintarXiv:1901.08544 (2019).[33] Osman Durmaz and Hasan Sakir Bilge. 2019. Fast image similarity search by distributed locality sensitive hashing.
Pattern Recognition Letters
PatternRecognition
46, 3 (2013), 752–768. [35] Marco Fisichella, Fan Deng, and Wolfgang Nejdl. 2010. Efficient incremental near duplicate detection based on locality sensitive hashing. In
International Conference on Database and Expert Systems Applications . Springer, 152–166.[36] Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng. 2012. Locality-sensitive hashing scheme based on dynamic collision counting. In
Proceed-ings of the 2012 ACM SIGMOD international conference on management of data . 541–552.[37] Fernando Turrado García, Luis Javier García Villalba, Ana Lucila Sandoval Orozco, Francisco Damián Aranda Ruiz, Andrés Aguirre Juárez, andTai-Hoon Kim. 2019. Locating similar names through locality sensitive hashing and graph theory.
Multimedia Tools and Applications
78, 21 (2019),29853–29866.[38] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In
Vldb , Vol. 99. 518–529.[39] Xiaoguang Gu, Yongdong Zhang, Lei Zhang, Dongming Zhang, and Jintao Li. 2013. An improved method of locality sensitive hashing for indexinglarge-scale and high-dimensional features.
Signal Processing
93, 8 (2013), 2244–2255.[40] Parisa Haghani, Sebastian Michel, and Karl Aberer. 2009. Distributed similarity search in high dimensions using locality sensitive hashing. In
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology . 744–755.[41] Byeong-jun Han, Hyunwoo Kim, Ziwon Hyung, Kyogu Lee, and Sheayun Lee. 2011. A content-based music similarity retrieval scheme by usingBoW representation and LSH-based retrieval.[42] Philipp Heise, Brian Jensen, Sebastian Klose, and Alois Knoll. 2015. Fast dense stereo correspondences by binary locality sensitive hashing. In . IEEE, 105–110.[43] Xiang-song Hou, Cao Yuan-da, and Zhi-tao Guan. 2008. A Semantic Search Model based on Locality-sensitive Hashing in mobile P2P. In , Vol. 3. IEEE, 1635–1640.[44] Weiming Hu, Yabo Fan, Junliang Xing, Liang Sun, Zhaoquan Cai, and Stephen Maybank. 2018. Deep constrained siamese hash coding networkand load-balanced locality-sensitive hashing for near duplicate image detection.
IEEE Transactions on Image Processing
27, 9 (2018), 4452–4464.[45] Qiang Huang, Jianlin Feng, Qiong Fang, Wilfred Ng, and Wei Wang. 2017. Query-aware locality-sensitive hashing scheme for lp norm.
The VLDBJournal
26, 5 (2017), 683–708.[46] Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearestneighbor search.
Proceedings of the VLDB Endowment
9, 1 (2015), 1–12.[47] Qiang Huang, Guihong Ma, Jianlin Feng, Qiong Fang, and Anthony KH Tung. 2018. Accurate and fast asymmetric locality-sensitive hashingscheme for maximum inner product search. In
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & DataMining . 1561–1570.[48] Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In
Proceedings of thethirtieth annual ACM symposium on Theory of computing . 604–613.[49] Omid Jafari and Parth" Nagarkar. 2021. Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional ApproximateNearest Neighbor Searches. In
Databases Theory and Applications . Springer International Publishing, 62–73.[50] Hervé Jégou, Laurent Amsaleg, Cordelia Schmid, and Patrick Gros. 2008. Query adaptative locality sensitive hashing. In . IEEE, 825–828.[51] Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, and Qi Tian. 2012. Super-bit locality-sensitive hashing. In
Advances in neural informationprocessing systems . Citeseer, 108–116.[52] Jianqiu Ji, Shuicheng Yan, Jianmin Li, Guangyu Gao, Qi Tian, and Bo Zhang. 2014. Batch-orthogonal locality-sensitive hashing for angularsimilarity.
IEEE transactions on pattern analysis and machine intelligence
36, 10 (2014), 1963–1974.[53] Qixia Jiang and Maosong Sun. 2011. Semi-supervised simhash for efficient document similarity search. In
Proceedings of the 49th annual meetingof the association for computational linguistics: Human language technologies . 93–101.[54] Alexis Joly and Olivier Buisson. 2008. A posteriori multi-probe locality sensitive hashing. In
Proceedings of the 16th ACM international conferenceon Multimedia . 209–218.[55] Dimitrios Karapiperis and Vassilios S Verykios. 2013. A distributed framework for scaling up LSH-based computations in privacy preservingrecord linkage. In
Proceedings of the 6th Balkan Conference in Informatics . 102–109.[56] Norio Katayama and Shin’ichi Satoh. 1997. The SR-tree: An index structure for high-dimensional nearest neighbor queries.
ACM Sigmod Record
26, 2 (1997), 369–380.[57] Kifayat Ullah Khan, Batjargal Dolgorsuren, Tu Nguyen Anh, Waqas Nawaz, and Young-Koo Lee. 2017. Faster compression methods for a weightedgraph using locality sensitive hashing.
Information Sciences
421 (2017), 237–253.[58] Sunwoo Kim, Haici Yang, and Minje Kim. 2020. Boosted Locality Sensitive Hashing: Discriminative Binary Codes for Source Separation. In
ICASSP2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 106–110.[59] Yongwook Bryce Kim. 2017.
Physiological time series retrieval and prediction with locality-sensitive hashing . Ph.D. Dissertation. MassachusettsInstitute of Technology.[60] Yongwook Bryce Kim, Erik Hemberg, and Una-May O’Reilly. 2016. Stratified locality-sensitive hashing for accelerated physiological time seriesretrieval. In . IEEE, 2479–2483.[61] Yongwook Bryce Kim and Una-May O’Reilly. 2016. Analysis of locality-sensitive hashing for fast critical event prediction on physiological timeseries. In . IEEE, 783–787.
Survey on Locality Sensitive Hashing Algorithms and their Applications 21 [62] Hisashi Koga, Tetsuo Ishibashi, and Toshinori Watanabe. 2007. Fast agglomerative hierarchical clustering algorithm using Locality-SensitiveHashing.
Knowledge and Information Systems
12, 1 (2007), 25–53.[63] Naama Kraus, David Carmel, Idit Keidar, and Meni Orenbach. 2016. NearBucket-LSH: Efficient similarity search in P2P networks. In
Internationalconference on similarity search and applications . Springer, 236–249.[64] Brian Kulis and Kristen Grauman. 2009. Kernelized locality-sensitive hashing for scalable image search. In . IEEE, 2130–2137.[65] Brian Kulis and Kristen Grauman. 2011. Kernelized locality-sensitive hashing.
IEEE Transactions on Pattern Analysis and Machine Intelligence
Applied Mechanics and Materials ,Vol. 263. Trans Tech Publ, 1341–1346.[67] Kyung Mi Lee and Keon Myung Lee. 2013. A locality sensitive hashing technique for categorical data. In
Applied Mechanics and Materials , Vol. 241.Trans Tech Publ, 3159–3164.[68] Dongsheng Li, Wanxin Zhang, Siqi Shen, and Yiming Zhang. 2017. SES-LSH: shuffle-efficient locality sensitive hashing for distributed similaritysearch. In . IEEE, 822–827.[69] Hongmei Li, Wenning Hao, Gang Chen, and Xianglin Liao. 2014. Large-scale documents reduction based on domain ontology and E2LSH. In
Proceedings of the 11th IEEE International Conference on Networking, Sensing and Control . IEEE, 24–29.[70] Hangyu Li, Sarana Nutanong, Hong Xu, Foryu Ha, et al. 2018. C2Net: A network-efficient approach to collision counting LSH similarity join.
IEEE Transactions on Knowledge and Data Engineering
31, 3 (2018), 423–436.[71] Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, and Ruihao Zhao. 2017. Losha: A general framework for scalablelocality sensitive hashing. In
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval .635–644.[72] Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on highdimensional data—experiments, analyses, and improvement.
IEEE Transactions on Knowledge and Data Engineering
32, 8 (2019), 1475–1488.[73] Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, and Lu Qin. 2019. I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In . IEEE, 1670–1673.[74] Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, Lu Qin, and Xuemin Lin. 2020. EI-LSH: An early-termination driven I/O efficient incrementalc-approximate nearest neighbor search.
The VLDB Journal (2020), 1–21.[75] Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, and Heng Tao Shen. 2014. SK-LSH: an efficient index structure for approximate nearest neighborsearch.
Proceedings of the VLDB Endowment
7, 9 (2014), 745–756.[76] Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces. In . IEEE, 1045–1056.[77] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. [n.d.].
A Time-Space Efficient Locality Sensitive Hashing Method for SimilaritySearch in High Dimensions . Technical Report.[78] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: efficient indexing for high-dimensional similaritysearch. In . Association for Computing Machinery, Inc, 950–961.[79] Pedro Moura, Eduardo Laber, Hélio Lopes, Daniel Mesejo, Lucas Pavanelli, João Jardim, Francisco Thiesen, and Gabriel Pujol. 2017. LSHSIM: alocality sensitive hashing based method for multiple-point geostatistics.
Computers & Geosciences
107 (2017), 49–60.[80] Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. 2016. Mash: fastgenome and metagenome distance estimation using MinHash.
Genome biology
17, 1 (2016), 1–14.[81] Ciprian Oprişa, Marius Checicheş, and Adrian Năndrean. 2014. Locality-sensitive hashing optimizations for fast malware clustering. In . IEEE, 97–104.[82] Seiichi Ozawa, Junji Nakazato, Tao Ban, Jumpei Shimamura, et al. 2015. An online malicious spam email detection system using resource allocatingnetwork with locality sensitive hashing.
Journal of intelligent learning systems and applications
7, 02 (2015), 42.[83] G Padmasundari and Hema A Murthy. 2017. Raga identification using locality sensitive hashing. In . IEEE, 1–6.[84] Jia Pan and Dinesh Manocha. 2012. Bi-level locality sensitive hashing for k-nearest neighbor computation. In . IEEE, 378–389.[85] Rina Panigrahy. 2005. Entropy based nearest neighbor search in high dimensions. arXiv preprint cs/0510019 (2005).[86] Youngki Park, Heasoo Hwang, and Sang-goo Lee. 2015. A Fast k-Nearest Neighbor Search Using Query-Specific Signature Selection. In
Proceedingsof the 24th ACM International on Conference on Information and Knowledge Management . 1883–1886.[87] A. Patil. 2007. Distributed Multi-Probe LSH : Tackling Real World Data.[88] Daniel Probst and Jean-Louis Reymond. 2018. A probabilistic molecular fingerprint for big data settings.
Journal of cheminformatics
10, 1 (2018),1–12.[89] Mika T Rantanen and Martti Juhola. 2015. Speeding up probabilistic roadmap planners with locality-sensitive hashing.
Robotica
33, 7 (2015), 1491.[90] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and NLP: Using locality sensitive hash functions for highspeed noun clustering. In
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) . 622–629. [91] Kexin Rong, Clara E Yoon, Karianne J Bergen, Hashem Elezabi, Peter Bailis, Philip Levis, and Gregory C Beroza. 2018. Locality-sensitive hashingfor earthquake detection: A case study of scaling data-driven science. arXiv preprint arXiv:1803.09835 (2018).[92] Matti Ryynanen and Anssi Klapuri. 2008. Query by humming of midi and audio using locality sensitive hashing. In . IEEE, 2249–2252.[93] Kenichi Saeki, Kanji Tanaka, and Takeshi Ueda. 2009. Lsh-ransac: An incremental scheme for scalable localization. In . IEEE, 3523–3530.[94] Venu Satuluri and Srinivasan Parthasarathy. 2011. Bayesian locality sensitive hashing for fast similarity search. arXiv preprint arXiv:1110.1328 (2011).[95] Lu Shen, Jiagao Wu, Yongrong Wang, and Linfeng Liu. 2018. Towards load balancing for LSH-based distributed similarity indexing in high-dimensional space. In . IEEE, 384–391.[96] Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for sublinear time Maximum Inner Product Search (MIPS).
Advances in NeuralInformation Processing Systems
3, January (2014), 2321–2329.[97] Sadhan Sood and Dmitri Loguinov. 2011. Probabilistic near-duplicate detection using simhash. In
Proceedings of the 20th ACM internationalconference on Information and knowledge management . 1117–1126.[98] Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. 2014. SRS: solving c-approximate nearest neighbor queries in high dimensionaleuclidean space with a tiny index.
Proceedings of the VLDB Endowment (2014).[99] Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Stream-ing similarity search over one billion tweets using parallel locality-sensitive hashing.
Proceedings of the VLDB Endowment
6, 14 (2013), 1930–1941.[100] Kanji Tanaka and Eiji Kondo. 2006. Incremental ransac for online relocation in large dynamic environments. In
Proceedings 2006 IEEE InternationalConference on Robotics and Automation, 2006. ICRA 2006.
IEEE, 68–75.[101] Kanji Tanaka and Eiji Kondo. 2008. A scalable localization algorithm for high dimensional features and multi robot systems. In . IEEE, 920–925.[102] Kaiyuan Tian, Jian Wang, Yuanyuan Liao, Dengke Xu, and Baigen Cai. 2020. LSH-XGBoost based Network Congestion Detection Method forSSDN. In
Journal of Physics: Conference Series , Vol. 1549. IOP Publishing, 052069.[103] Vikrant Singh Tomar and Richard C Rose. 2013. Efficient manifold learning for speech recognition using locality sensitive hashing. In . IEEE, 6995–6999.[104] Benjamin Van Durme and Ashwin Lall. 2010. Online generation of locality sensitive hash signatures. In
Proceedings of the ACL 2010 conferenceshort papers . 231–235.[105] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927 (2014).[106] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on learning to hash.
IEEE transactions on pattern analysis andmachine intelligence
40, 4 (2017), 769–790.[107] Peng Wang, Dong Yin, and Tao Sun. 2014. Bi-Level Locality Sensitive Hashing Index Based on Clustering. In
Applied Mechanics and Materials ,Vol. 556. Trans Tech Publ, 3804–3808.[108] Qiang Wang, Zhiyuan Guo, Gang Liu, and Jun Guo. 2012. Boundary-expanding locality sensitive hashing. In . IEEE, 358–362.[109] Jiagao Wu, Lu Shen, and Linfeng Liu. 2020. LSH-based distributed similarity indexing with load balancing in high-dimensional space.
The Journalof Supercomputing
76, 1 (2020), 636–665.[110] Zhenyu Wu and Ming Zou. 2014. An incremental community detection method for social tagging systems using locality-sensitive hashing.
Neuralnetworks
58 (2014), 14–28.[111] Zhihua Xia, Xinhui Wang, Liangao Zhang, Zhan Qin, Xingming Sun, and Kui Ren. 2016. A privacy-preserving and copy-deterrence content-basedimage retrieval scheme in cloud computing.
IEEE transactions on information forensics and security
11, 11 (2016), 2594–2608.[112] Gaogang Xie, Kun Xie, Jun Huang, Xin Wang, Yuxiang Chen, and Jigang Wen. 2017. Fast low-rank matrix approximation with locality sensitivehashing for quick anomaly detection. In
IEEE INFOCOM 2017-IEEE Conference on Computer Communications . IEEE, 1–9.[113] Hongtao Xie, Zhineng Chen, Yizhi Liu, Jianlong Tan, and Li Guo. 2014. Data-dependent locality sensitive hashing. In
Pacific Rim Conference onMultimedia . Springer, 284–293.[114] Kuan Xu, Yu Qiao, Xiaoguang Niu, Xinzui Fang, Yuan Han, and Jie Yang. 2018. Bone scintigraphy retrieval using sift-based fly local sensitivehashing. In . IEEE, 735–740.[115] Xiangyang Xu, Tongwei Ren, and Gangshan Wu. 2014. Clsh: Cluster-based locality-sensitive hashing. In
Proceedings of International Conferenceon Internet Multimedia Computing and Service . 144–147.[116] Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and James Cheng. 2018. Norm-Ranging LSH for Maximum Inner Product Search.
Advances inNeural Information Processing Systems
31 (2018), 2952–2961.[117] Shaoyi Yin, Mehdi Badr, and Dan Vodislav. 2013. Dynamic multi-probe lsh: An i/o efficient index structure for approximate nearest neighborsearch. In
International Conference on Database and Expert Systems Applications . Springer, 48–62.
Survey on Locality Sensitive Hashing Algorithms and their Applications 23 [118] Chenyun Yu, Lintong Luo, Leanne Lai-Hang Chan, Thanawin Rakthanmanon, and Sarana Nutanong. 2019. A fast LSH-based similarity searchmethod for multivariate time series.
Information Sciences
476 (2019), 337–356.[119] Chenyun Yu, Sarana Nutanong, Hangyu Li, Cong Wang, and Xingliang Yuan. 2016. A generic method for accelerating LSH-based similarity joinprocessing.
IEEE Transactions on Knowledge and Data Engineering
29, 4 (2016), 712–726.[120] Yi Yu, Michel Crucianu, Vincent Oria, and Ernesto Damiani. 2010. Combining multi-probe histogram and order-statistics based lsh for scalableaudio content retrieval. In
Proceedings of the 18th ACM international conference on Multimedia . 381–390.[121] Yi Yu, Suhua Tang, and Roger Zimmermann. 2013. Edge-based locality sensitive hashing for efficient geo-fencing application. In
Proceedings ofthe 21st ACM SIGSPATIAL international conference on advances in geographic information systems . 576–579.[122] Yi Yu, Roger Zimmermann, Ye Wang, and Vincent Oria. 2013. Scalable content-based music retrieval using chord progression histogram andtree-structure LSH.
IEEE Transactions on Multimedia
15, 8 (2013), 1969–1981.[123] Boyu Zhang, Xianglong Liu, and Bo Lang. 2015. Fast graph similarity search via locality sensitive hashing. In
Pacific Rim Conference on Multimedia .Springer, 623–633.[124] Lei Zhang, Yongdong Zhang, Dongming Zhang, and Qi Tian. 2013. Distribution-aware locality sensitive hashing. In
Advances in MultimediaModeling . Springer, 395–406.[125] Wei Zhang, Ke Gao, Yong-dong Zhang, and Jin-tao Li. 2010. Data-oriented locality sensitive hashing. In
Proceedings of the 18th ACM internationalconference on Multimedia . 1131–1134.[126] Xuyun Zhang, Wanchun Dou, Qiang He, Rui Zhou, Christopher Leckie, Ramamohanarao Kotagiri, and Zoran Salcic. 2017. LSHiForest: A genericframework for fast tree isolation based ensemble anomaly analysis. In . IEEE,983–994.[127] Xuyun Zhang, Christopher Leckie, Wanchun Dou, Jinjun Chen, Ramamohanarao Kotagiri, and Zoran Salcic. 2016. Scalable local-recodinganonymization using locality sensitive hashing for big data privacy preservation. In
Proceedings of the 25th ACM International on Conferenceon Information and Knowledge Management . 1793–1802.[128] Ying Zhang, Huchuan Lu, Lihe Zhang, Xiang Ruan, and Shun Sakai. 2016. Video anomaly detection based on locality sensitive hashing filters.
Pattern Recognition
59 (2016), 302–311.[129] Bolong Zheng, Zhao Xi, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, and Christian S Jensen. 2020. PM-LSH: A fast and accurate LSHframework for high-dimensional approximate NN search.
Proceedings of the VLDB Endowment
13, 5 (2020), 643–655.[130] Yuxin Zheng, Qi Guo, Anthony KH Tung, and Sai Wu. 2016. Lazylsh: Approximate nearest neighbor search for multiple distance functions witha single index. In
Proceedings of the 2016 International Conference on Management of Data . 2023–2037.[131] Aleksei Zhuvikin. 2018. A Blockchain Of Image Copyrights Using Robust Image Features And Locality-Sensitive H Ashing [J].