Dynamic Similarity Search on Integer Sketches
DDynamic Similarity Search on Integer Sketches
Shunsuke Kanda
RIKEN Center for Advanced Intelligence Project
Tokyo, [email protected]
Yasuo Tabei
RIKEN Center for Advanced Intelligence Project
Tokyo, [email protected]
Abstract —Similarity-preserving hashing is a core technique forfast similarity searches, and it randomly maps data points in ametric space to strings of discrete symbols (i.e., sketches) in theHamming space. While traditional hashing techniques producebinary sketches, recent ones produce integer sketches for preserv-ing various similarity measures. However, most similarity searchmethods are designed for binary sketches and inefficient forinteger sketches. Moreover, most methods are either inapplicableor inefficient for dynamic datasets, although modern real-worlddatasets are updated over time. We propose dynamic filter trie(DyFT), a dynamic similarity search method for both binary andinteger sketches. An extensive experimental analysis using largereal-world datasets shows that DyFT performs superiorly withrespect to scalability, time performance, and memory efficiency.For example, on a huge dataset of 216 million data points, DyFTperforms a similarity search 6,000 times faster than a state-of-the-art method while reducing to one-thirteenth in memory.
I. I
NTRODUCTION
Similarity search of vectorial data in databases has beena fundamental task in recent data analysis and has variousapplications such as near duplicate detection in a collection ofweb pages [1], context-based retrieval in images [2], and func-tional analysis of molecules [3]. In recent years, databases forthese applications have become larger and the vectorial dataof these databases have also become dimensionally higher,making it difficult to apply existing similarity search methodsto such large databases. Therefore, it is necessary to developmuch more powerful similarity search methods to analyzelarge databases efficiently.Similarity-preserving hashing is a powerful technique thatapproximates a similarity measure by randomly mapping datapoints in a metric space to strings of discrete symbols (i.e., sketches ) in the Hamming space. Similarity search problemsfor various similarity measures can be approximately solved asthe
Hamming distance problem for sketches (i.e., computationsof the number of positions at which the corresponding integersbetween two sketches are different). Thus far, many hashingtechniques producing binary sketches have been developed asreviewed in [4]; accordingly, quite a few similarity searchmethods especially for binary sketches have been proposed fordecades (e.g., [5]–[8]). In recent years, many types of hashingtechniques intending to produce integer sketches have beendeveloped for various similarity measures. Examples are b -bit minwise hashing for Jaccard similarity [9], 0-bit consistentweighted sampling (CWS) for min-max kernels [10], and 0-bitCWS for generalized min-max kernels [11]. There is a strongneed to develop efficient solutions for the general Hamming distance problem for not only binary sketches but also integersketches; however, few similarity search methods designed onthe general problem have been proposed [12]–[14].Modern real-world datasets are updated over time, whichwe shall call dynamic setting. For example, search enginesoften have a large number of new web pages containingimages and text data, which arrive in the data center everyday. Dynamic similarity search methods that can efficientlyperform insertions of new data points to a dataset and deletionsof data points from the dataset are essential in modern datamining and information retrieval. However, most of state-of-the-art methods have drawbacks: (i) limitations to staticsettings [5,6,13,14] or (ii) inefficiency in dynamic settings[7,12]. Although Eghbali et al. [8] recently proposed Hammingweight tree (HWT) to address this problem, it is applicableonly to binary sketches and its performance is degraded forlarge datasets. Consequently, an important open challenge is todevelop a fast, scalable, and dynamic similarity search methodfor the general Hamming distance problem.Our main contributions in this paper are as follows: • We propose dynamic filter trie (DyFT) , a dynamic simi-larity search method for both binary and integer sketchesusing an edge-labeled tree called trie [15]. DyFT growsthe data structure based on a search cost model to main-tain fast similarity searches. It also reduces on memoryconsumption by omitting redundant trie nodes (SectionIV). • We design an implementation for DyFT, called modifiedadaptive radix tree (MART) , in which the data structurechanges adaptively depending on the configuration ofDyFT nodes. MART always enables DyFT to performwell for any input parameter of similarity-preservinghashing (Section V). • We present an extensive experimental analysis that showsDyFT performs superiorly compared to state-of-the-artsimilarity search methods for both binary and integersketches with respect to scalability, time performance,and memory efficiency. For example, on a huge dataset of216 million sketches, DyFT performs a similarity search6,000 times faster than HWT, while reducing to one-thirteenth in memory (Section VI).II. P
ROBLEM S TATEMENT
Sketch x of length m is an m -dimensional vector of non-negative integers from alphabet Σ = { , , . . . , σ − } of a r X i v : . [ c s . D S ] S e p ABLE IS
UMMARY OF S IMILARITY S EARCH M ETHODS . Method Sketch type Data structure Search time Update time Memory
MIH [7] Binary Hash table O ( q ( m/q ) r/q · max(1 , n/ m/q )) + V mih O ( q ) O ( mn ) HWT [8] Binary Search tree O ( m log m (log n ) r ) + V hwt O ( m log m ) O ( mn ) HmSearch (HSV) [12] Integer Hash table O ( r · max(1 , mnσ/σ m − m/r )) + V hsv O ( m σ/r ) O ( mnσ ) HmSearch (HSD) [12] Integer Hash table O ( m · max(1 , mn/σ m − m/r +1 )) + V hsd O ( m /r ) O ( mn ) GV [5] Integer Hash table O (( m + r ) · max(1 , n/σ m/ ( r +2) )) + V gv O ( m ) O ( mn ) DyFT (this study) Integer MART O ( m r +2 ) + V dyft O ( m ) O ( mn ) DyFT + (this study) Integer MART O ( q ( m/q ) r/q +2 ) + V dyft+ O ( m ) O ( mn ) Note: V is verification time for candidates obtained from each similarity search method. alphabet size σ , i.e., x ∈ Σ m . The i -th element of x is denotedby x [ i ] . The Hamming distance between sketches x and y isthe number of positions at which the corresponding elementsare different, formally defined as H ( x, y ) = m (cid:88) i =1 (cid:26) x [ i ] (cid:54) = y [ i ])0 ( x [ i ] = y [ i ]) . We assume m = O ( w ) for word size w . Then, H ( x, y ) canbe computed in O (log σ ) time by performing (cid:100) log σ (cid:101) sets ofbitwise XOR and popcount operations [12].A database X = { x , x , . . . , x n } is a dynamic set consist-ing of n sketches, and it supports the insertion of a new sketch x i and deletion of sketch x i . The general Hamming distanceproblem for a given sketch y and radius r is to find all thesketches whose Hamming distance to sketch y in X is at most r , i.e., R = { x i ∈ X : H ( x i , y ) ≤ r } .III. L ITERATURE R EVIEW
Many similarity search methods on Hamming distance havebeen proposed for decades. Several recent studies have focusedon static settings [5,6,13,14]. Theoretical aspects have alsobeen argued [16]–[18]. In this section, we briefly review state-of-the-art similarity search methods for binary and integersketches, and they are applicable to dynamic datasets. Table Isummarizes state-of-the-arts.A seminal work for binary sketches is multi-index hashing(MIH) developed by Norouzi et al. [7]. MIH is based onthe multi-index approach [19] and it enables quick similaritysearches even with large r . A key observation is that twosimilar sketches must have similar parts. Thus, MIH partitionseach sketch into q blocks of short sketches and builds q hashtables from the short sketches in each block. The similaritysearch first obtains a set of candidate solutions R (cid:48) ⊇ R byretrieving each hash table with small radius (cid:98) r/q (cid:99) and thenremoves false positives from R (cid:48) by computing the Hammingdistances.The number of blocks offering the best search performanceis determined by the configuration of the dataset. Norouzi etal. [7] empirically demonstrated that the best performance ofMIH is often achieved when q = m/ log n . They also showedthat setting q to a number apart from m/ log n significantlydegrades performance. Thus, MIH is unsuitable for dynamicproblem settings where database size n varies. Hamming weight tree (HWT) developed by Eghbali et al.[8] is a state-of-the-art similarity search method to solve theissue of MIH. Instead of using hash tables, HWT uses asearch tree constructed based on Hamming weight (i.e., thenumber of ones appearing in a binary sketch). However, thesimilarity search takes O ( m log m (log n ) r ) time and slowsdown dramatically for a large database of n . In addition, thosesimilarity search methods were designed for binary sketches,and they are not necessarily suitable for integer sketches.HmSearch developed by Zhang et al. [12] is a multi-index similarity search method designed for integer sketches.HmSearch reduces the general Hamming distance problemwith radius r to small problems with radius one by tuningthe number of blocks and preregistering candidate solutions inhash tables. However, this approach preregistering candidatesolutions consumes a large amount of memory and requires alarge amount of update time.Gog and Venturini [5] proposed an idea that defines (cid:98) r/ (cid:99) +1 blocks to produce small problems with radius one andbypasses preregistering candidate solutions stored in hashtables. They presented a simple variant of HmSearch, whichis referred to as GV in this paper. The similarity search isperformed with the same algorithm as that of MIH. Thus,GV can be considered as a simple modification of MIHfor integer sketches and has the same issue as MIH, whichcauses inefficiency in dynamic problem settings where n isvariable. Moreover, GV’s search speed was much slower thanHmSearch’s, as experimentally demonstrated in Section VI.Despite the importance of dynamic similarity search meth-ods for the general Hamming distance problem, there is noefficient method. The main reason is that most methods relyon the multi-index approach using hash tables, which requiresetting the appropriate number of blocks depending on variableparameter n . Although HWT attempts to address that issueusing a tree structure, it is inefficient for large datasets and isinapplicable to integer sketches.IV. D YNAMIC F ILTER T RIE
DyFT is a dynamic similarity search method for the generalHamming distance problem. As with HWT, DyFT is built on atree-based data structure. In contrast to HWT, DyFT employs a trie data structure [15], which enables quick similarity searchesfor integer sketches. In this section, we first introduce the trieata structure and the design motivation of DyFT; Then, wepresent DyFT’s data structure and complexity analyses.
A. Preliminaries
Trie is an edge-labeled tree storing a set of sketches. Eachnode is associated with the common prefix of a subset of thesketches, and each leaf is associated with a particular sketchin the database. Each edge has an integer organizing sketchesas a label. All outgoing edges of an inner node are labeledwith distinct integers. The downgoing path from the root toeach leaf corresponds to the sketch associated to the leaf.The exact search for a given sketch y traverses trie nodesfrom the root by using the integers of y . If we reach a leaf, y is stored in the trie. A simple extension of the exact searchimplements the similarity search with radius r . The similaritysearch traverses trie nodes from the root by using the integersof y with at most r errors allowed. In other words, we countthe number of errors from the root to each node v visited in thetraversal and, if the number exceeds r , stop traversing downto all the descendants under v . The solution R is the set ofall sketches associated with leaves reachable within r errors.A more specific description of the similarity search algorithmusing trie is presented in [13, Sect. IV-B]. The similarity searchcan prune unnecessary portions of the search space and can bequickly performed for a small radius r . The time complexityis O ( m r +2 ) [20]. Each inner node in a trie is implemented as a mapping fromedge labels to child pointers. A trie storing a large database X maintains many pointers and consumes a large amount ofmemory. A well-known technique for substantially reducingmemory consumption is to omit nodes around leaves. Thusfar, a number of memory-efficient trie data structures havebeen developed by leveraging this technique, e.g., [21]–[24].However, these data structures were designed for exact stringsearches and inefficient for similarity searches.There is no dynamic and scalable trie data structure forsimilarity searches. In the remainder of this section, wepresent DyFT, which omits many nodes while maintaining fastsimilarity searches. DyFT’s performance also depends on theimplementation of the mapping for each inner node. In SectionV, we introduce an efficient node implementation for DyFT. B. Approach
The basic idea is to allow false positives and store only someof trie nodes around the root. In other words, DyFT exploitsthe trie search algorithm for filtering out dissimilar sketchesand aims to obtain solution candidates. Figure 1 shows anexample of DyFT for eight sketches. A leaf v at level (cid:96) reachedby sub-sketch x (cid:48) ∈ Σ (cid:96) is associated with all sketches in X starting with x (cid:48) . For example, in Figure 1, the leaf reached by“03” is associated with sketches x and x starting with “03”.Every leaf v has the posting list of associated sketches. Wedenote the posting list by L v and its length by | L v | . Although Arslan and Eeciolu [20] derived the complexity assuming σ = 2 ,it does not vary for any σ . 𝓵 = 0 x = 111020 x = 001020 x = 032021 x = 113021 x = 333110 x = 330110 x = 311020 x = 030120 x x x x x x x x Database X Search for y = 111020 with r = 1Index 𝓵 = 1 𝓵 = 2 Fig. 1. Example of DyFT for database X . Search for y with r = 1 traversesnodes along blue dashed arrows and reaches the posting lists containing x , x , and x . The solution R = { x , x } is obtained by verifying H ( x , y ) =0 , H ( x , y ) = 2 , and H ( x , y ) = 1 . Algorithm 1:
Search and insertion algorithms of DyFT function Search ( y, r ) (cid:46) Traverse DyFT nodes using a stack R ← ∅ , S stack ← { ( v root , , r ) } (cid:46) v root : DyFT’s root while S stack (cid:54) = ∅ do (cid:46) r (cid:48) is r minus the number of errors at node v of level (cid:96) Pop back ( v, (cid:96), r (cid:48) ) from S stack if v is a leaf then (cid:46) Verify the candidates in L v for x i ∈ L v do if H ( x i , y ) ≤ r then Append x i to R continue if r (cid:48) > then (cid:46) Check all the children of v for u in the set of children of v do if u ’s edge label is y [ (cid:96) + 1] then Push back ( u, (cid:96) + 1 , r (cid:48) ) to S stack else Push back ( u, (cid:96) + 1 , r (cid:48) − to S stack else if r (cid:48) = 0 then (cid:46) Look up the child of v if v has a child u with edge y [ (cid:96) + 1] then Push back ( u, (cid:96) + 1 , r (cid:48) ) to S stack return R procedure Insert ( y ) v ← v root , (cid:96) ← while v is not a leaf do (cid:46) Traverse DyFT nodes if v does not have a child with edge y [ (cid:96) ] then Insert a new child to v with edge y [ (cid:96) ] v ← the child of v with edge y [ (cid:96) ] , (cid:96) ← (cid:96) + 1 Append y to L v if | L v | > τ then (cid:46) Split v and create new leaves from v for c ∈ { x i [ (cid:96) ] | x i ∈ L v } do Insert new leaf u from v with edge label c Create new posting list L u = { x i ∈ L v | x i [ (cid:96) ] = c } Remove the old posting list L v The similarity search for given y and r traverses DyFTnodes in the aforementioned manner. For a leaf v reachedwithin r errors, each sketch x i ∈ L v is verified by checkingwhether H ( x i , y ) ≤ r . Figure 1 shows a search example, andAlgorithm 1 shows the search algorithm.We now present the insertion algorithm. Initially, the DyFTstructure for an empty X consists only of the root with anempty posting list. Given a sketch x i , we traverse DyFT nodesusing x i and visit the deepest reachable node v . If v is an innerone, we insert a new leaf u from v and associate a new postinglist L u storing x i . If v is a leaf, we append x i to L v ; Then, plit v (if |L v | > τ )v x x x |L v | x x x Insert for x = 030110Append Fig. 2. Example of inserting x to L v in Figure 1. If | L v | is more than τ ,we split v into two leaves. DyFT determines whether leaf v should be split . If | L v | islonger than a threshold τ , we create new leaves from v andsplit L v into disjoint short lists (see Figure 2). Algorithm 1shows the insertion algorithm.The deletion algorithm is symmetrical to that of insertion.We remove x i from L v for leaf v reached by x i . If L v becomesempty, we remove leaf v from DyFT. C. Optimal Threshold
The search performance of DyFT is affected by threshold τ . If τ is large, the verification time for L v becomes large. If τ is small, DyFT defines many nodes and the traversal timebecomes large. Thus, we need to set a reasonable value of τ .Such a reasonable value of τ can be determined according tothe configuration of X and given parameters such as n , σ , and r ; however, it is impossible to search such a reasonable valuein dynamic settings. To address this issue, we first constructa search cost model assuming that sketches are uniformlydistributed in the Hamming space and then determine anoptimal threshold τ ∗ minimizing the search cost.By fixing r and σ , we consider the reach probability fornode v at level (cid:96) , which is the probability to reach v within r errors using a random sketch x ∈ Σ (cid:96) from a uniformdistribution. Let v be traversed from the root node using sketch φ ( v ) ∈ Σ (cid:96) . The set of all sketches reachable to v within r errors is { x ∈ Σ (cid:96) : H ( x, φ ( v )) ≤ r } whose cardinality is N ( (cid:96) ) = r (cid:88) k =0 (cid:18) (cid:96)k (cid:19) ( σ − k . As the number of all possible sketches of length (cid:96) is σ (cid:96) , thereach probability of a node at level (cid:96) is P ( (cid:96) ) = (cid:26) (cid:96) ≤ r ) N ( (cid:96) ) /σ (cid:96) ( (cid:96) > r ) . It holds that P ( (cid:96) ) > P ( (cid:96) + 1) for (cid:96) ≥ r .We define the search cost of node v at level (cid:96) for randomsketch x ∈ Σ (cid:96) by multiplying the reach probability by thecomputational cost. When we visit an inner node v at level (cid:96) during similarity search, we try to descend to the children of v . Then, we have two cases whether (i) H ( x, φ ( v )) < r or(ii) H ( x, φ ( v )) = r . In case (i), we check all the children in O ( σ ) time (Lines 10–14 in Algorithm 1). In case (ii), we lookup the child in O (1) time (Lines 16–17 in Algorithm 1). Case * =2 r =1 r =2 r =3 r =4 0 5 10 15 200.40.60.81.01.21.4 * =16 r =1 r =2 r =3 r =4 Fig. 3. Optimal thresholds τ ∗ for various parameters. (ii) occurs for sketches in { x ∈ Σ (cid:96) : H ( x, φ ( v )) = r } whosecardinality is N ( (cid:96) ) = (cid:18) (cid:96)r (cid:19) ( σ − r . The occurrence probability of case (ii) is N ( (cid:96) ) /N ( (cid:96) ) , and thecomputational cost of v is F in ( (cid:96) ) = (cid:18) − N ( (cid:96) ) N ( (cid:96) ) (cid:19) × σ + N ( (cid:96) ) N ( (cid:96) ) . Thus, the search cost of inner node v at level (cid:96) is C in ( v ) = P ( (cid:96) ) × F in ( (cid:96) ) . When we visit a leaf v at level (cid:96) , we verify allsketches associated with L v , and the search cost is C leaf ( v ) = P ( (cid:96) ) × | L v | × (cid:100) log σ (cid:101) .We fix the optimal threshold τ ∗ based on the search costmodel. After appending a new sketch to L v , τ ∗ can be usedto determine whether to split v depending on | L v | to maintainthe smaller search cost. If v is not split, then the search costis C leaf ( v ) . If v is split into k new leaves u , u , . . . , u k , thenthe new search cost is C in ( v ) + k (cid:88) i =1 C leaf ( u i ) . We assume that node v is at level (cid:96) ≥ r . Since the total lengthof L u , L u , . . . , L u k is | L v | , it holds that k (cid:88) i =1 C leaf ( u i ) = P ( (cid:96) + 1) × | L v | × (cid:100) log σ (cid:101) . Thus, splitting v can maintain the smaller search cost if | L v | > P ( (cid:96) ) P ( (cid:96) ) − P ( (cid:96) + 1) × F in ( (cid:96) ) (cid:100) log σ (cid:101) =: τ ∗ . (1)Given r and σ , the optimal thresholds τ ∗ are determinedfor each level (cid:96) and pre-computable. Figure 3 shows optimalthresholds τ ∗ for various parameters r and σ . Exception Case.
We need to address the exception when (cid:96) < r , because the divisor of τ ∗ becomes zero, i.e., P ( (cid:96) ) = P ( (cid:96) +1) . The occurrence of the exception is intuitively correctbecause the search always traverses all nodes at level (cid:96) ≤ r ,and splitting a leaf at level (cid:96) < r just generates redundantnodes locally.We fix τ ∗ to zero for (cid:96) < r since we cannot determine τ ∗ by Eq. (1). Instead, we incrementally compute the total searchcost of DyFT, defined as C trie = (cid:88) v ∈ V in C in ( v ) + (cid:88) v ∈ V leaf C leaf ( v ) , lgorithm 2: Modified search algorithm of DyFT function Search ∗ ( y, r ) if C ls ≤ C trie then (cid:46) Perform linear search R ← ∅ for x i ∈ X do if H ( x i , y ) ≤ r then Append x i to R return R else (cid:46) Perform trie search in Algorithm 1 return Search ( y, r ) where V in and V leaf are sets of inner nodes and leaves in DyFT,respectively. In the search phase, we compare the currentcost C trie with the computational cost of linear search for X , i.e., C ls = n × (cid:100) log σ (cid:101) . If C ls ≤ C trie , we performlinear search for x i ∈ X to avoid redundant node traversal;otherwise, we perform Search in Algorithm 1. Algorithm 2shows the modified search algorithm. The switching approachenables us to select the faster search algorithm depending onthe configuration of DyFT.
Weighting Factor.
In practice, the computational costs of C in and C leaf depend on the implementation of DyFT andthe configuration of a computing machine. To address the gapbetween the theoretical and practical costs, we introduce aweighting factor for inner nodes W in and adjust the searchcost for inner node v by W in × C in ( v ) . We search a value of W in that supports fast searches by using a synthetic dataset ofrandom sketches generated from a uniform distribution. D. Complexities
We simply assume that τ is constant and derive the com-plexities shown in Table I. The similarity search consists oftraversing DyFT nodes, accessing posting lists and verifyingcandidates. The number of traversed nodes is bounded by O ( m r +2 ) when assuming the complete σ -ary trie [20]; thus,the traversal time is O ( m r +2 ) . The access time for eachposting list is O (1) because the length of each posting list isbounded by constant τ . Therefore, the search time complexityis O ( m r +2 ) + V dyft , where V dyft is the verification time forthe obtained candidates.Insertion is performed by traversing DyFT nodes in O ( m ) time and splitting the posting list in O (1) time. Deletion isalso performed by traversing DyFT nodes in O ( m ) time andremoving a leaf in O (1) time. Thus, the update time is O ( m ) .The memory complexity is O ( mn ) since the number of nodesis bounded by O ( mn ) . Multi-index Variant DyFT + . The similarity search ofDyFT is inefficient for large r as the complexity is exponentialto r . We can relax the time using the multi-index approach[19]. In the same manner as MIH, we define q DyFT structuresfor each block. We call this multi-index variant
DyFT + . Thesimilarity search is performed on q small DyFT structures withblock length m/q and threshold (cid:98) r/q (cid:99) . The time complexityis O ( q ( m/q ) r/q +2 ) + V dyft+ , where V dyft+ is the verificationtime for the obtained candidates. The update time and memorycomplexities are the same as those of DyFT. V. N ODE I MPLEMENTATION
A node implementation is also significant to enhance theperformance of DyFT. This section presents modified adaptiveradix tree (MART) , which is an efficient node implementationfor DyFT. We first give observations for node implementa-tions and then present our scheme of implementing MART.Subsequently, we describe the data structure of MART.
A. Observation and Implementation Scheme
We consider a data structure for an inner node that mapsedge labels to child pointers. A simple data structure referredto as the array form is a pointer array of length σ whose c -thelement has the child pointer with edge label c . The array formcan directly obtain the child pointer for a given c . Using thearray form as a baseline, we provide the following observationsfor node implementations. Observation A.
For binary sketches (i.e., σ = 2 ), the arrayform is memory-efficient because most inner nodes have twochildren and most elements of the array are used. By chunkingbits in binary sketches and suppressing the height of DyFT,we can reduce cache misses caused by node-to-node traversalsand enhance time performance, as observed in prior studies[24]–[26]. Observation B.
For integer sketches with large σ , innernodes around the root have many children, but those aroundleaves have few children. The array form is inefficient fornodes with few children because most elements of the arrayare empty. Memory efficiency can be improved by introducingseveral data structures depending on the number of children, assuggested in prior studies [13,23,24]. Although adaptive radixtree (ART) [24] is a successful data structure in this approach,it was designed for byte edge labels and lacks generality to σ . Scheme.
We assume σ ≤ , following practical settingsof similarity-preserving hashing techniques [9,11,27]. MARTreorganizes integer sketches into byte sketches to suppressDyFT’s height (from Observation A) and represents DyFTnodes from byte sketches using a modified ART data structure(from Observation B). Sections V-B and V-C present theformer and latter approaches, respectively. B. Byte Packing and Fast Computation
To efficiently handle integer sketches as byte sketches, wepack z = (cid:98) log σ (256) (cid:99) integers c , c , . . . , c z into byte b = (cid:80) zi =1 c i σ i − < σ z ≤ . In this manner, we convert aninteger sketch x = ( c , c , . . . , c m ) ∈ Σ m into byte sketch x (cid:48) = ( b , b , . . . , b m (cid:48) ) of length m (cid:48) = (cid:100) m/z (cid:101) . In what follows, H ( b, b (cid:48) ) denotes the Hamming distance between two integersequences c , c , . . . , c z and c (cid:48) , c (cid:48) , . . . , c (cid:48) z packed in two bytes b and b (cid:48) , respectively.Through the packing, we build a DyFT structure from bytesketches and perform the similarity search using a given bytesketch. When we visit an inner node v during the search,we face the small problem corresponding to Lines 9–17 inAlgorithm 1. u u u u a a a a r’=1, b=0231 Fig. 4. Example of Problem 1 when r (cid:48) = 1 and Σ = { , , , } . The bytelabels are denoted in unpacked format. Children u and u are the solutionbecause H ( a , b ) = 1 and H ( a , b ) = 0 , while u and u are not thesolution because H ( a , b ) = 2 and H ( a , b ) = 3 . Key
NodeD … Ptr
Idx
NodeSNodeF … … PtrPtru u k v0 255u u u k u u u k u u u k k children Fig. 5. MART representations for node v with k children u , u , . . . , u k .The child pointer to u with edge label 3 is stored in Ptr [1] such that
Key [1] =3 in NodeS , Ptr [ Idx [3] = 1] in NodeD , and
Ptr [3] in NodeF . Problem 1.
Given an inner node v , byte label b , and radius r (cid:48) , find children u , u , . . . , u k of v with edge byte labels a , a , . . . , a k such that H ( a i , b ) ≤ r (cid:48) . Figure 4 shows an example of Problem 1. If r (cid:48) = 0 , we justlook up a child with edge label b . If r (cid:48) > , we have the twoapproaches: LinearScan visits all children of v and computesthe Hamming distances for the edge labels; BruteForce gener-ates a set of all byte labels A = { a : H ( a, b ) ≤ r (cid:48) } and looksup the children of v with edge labels a ∈ A . MART performsone of these approaches according to the configuration of agiven inner node, as presented in Section V-C.To quickly perform the approaches without unpacking bytelabels, we introduce two σ z × σ z tables H and A . H is usedin LinearScan , whose b -th row stores the Hamming distancesbetween b and all byte labels a , i.e., H [ b, a ] := H ( a, b ) . A isused in BruteForce , whose b -th row stores all byte labels a sorted in ascending order of H ( a, b ) . We can simply generate A by scanning the elements of A [ b, i ] for i = 0 , , . . . until weencounter H [ b, A [ b, i ]] > r (cid:48) . Both H and A are implemented assimple tables of byte elements and can be precomputed. Thus, H and A contribute to quickly solving Problem 1 with onlyup to 128 KB of memory without unpacking byte labels. C. Adaptive Data Structure for Inner Nodes
Although ART [24] is a space-efficient data structure forrepresenting inner nodes with byte labels, the design is forstandard trie structures and is redundant for DyFT. For exam-ple, the path-compression technique of ART is not necessaryfor DyFT. MART simply modifies ART and represents inner
Algorithm 3:
MART search algorithms for Problem 1
Input :
Inner node v , byte label b , and radius r (cid:48) Output :
Set of child pointers T function NodeSearchS ( v, b, r (cid:48) ) (cid:46) NodeS T ← ∅ if r (cid:48) = 0 then (cid:46) Instead, SIMD search can be used as presented in [24]. for i = 0 , , . . . , v.k − do if v. Key [ i ] = b then Append v. Ptr [ i ] to T break else for i = 0 , , . . . , v.k − do (cid:46) LinearScan if H [ b, v. Key [ i ]] ≤ r (cid:48) then Append v. Ptr [ i ] to T return T function NodeSearchD ( v, b, r (cid:48) ) (cid:46) NodeD T ← ∅ if r (cid:48) = 0 then if v. Idx [ b ] (cid:54) = K + 1 then Append v. Ptr [ v. Idx [ b ]] to T else for i = 0 , , . . . , σ z − do (cid:46) BruteForce if H [ b, A [ b, i ]] > r (cid:48) then break else if v. Idx [ A [ b, i ]] (cid:54) = K + 1 then Append v. Ptr [ v. Idx [ A [ b, i ]]] to T return T function NodeSearchF ( v, b, r (cid:48) ) (cid:46) NodeF T ← ∅ if r (cid:48) = 0 then if v. Ptr [ b ] (cid:54) = nullptr then Append v. Ptr [ b ] to T else for i = 0 , , . . . , σ z − do (cid:46) BruteForce if H [ b, A [ b, i ]] > r (cid:48) then break else if v. Ptr [ A [ b, i ]] (cid:54) = nullptr then Append v. Ptr [ A [ b, i ]] to T return T nodes of DyFT. MART uses the following three types ofdata structures depending on the number of children. Let usconsider representing an inner node v with k children. Thethree types of data structures are illustrated in Figure 5, andtheir algorithms to Problem 1 are presented in Algorithm 3. NodeS (NodeSparse) is a data structure for storing node v with k children of no more than K , where K is a constantparameter. It consists of two arrays Key and
Ptr . Key is abyte array of length K that stores edge labels from v . Ptr isa pointer array of length K such that Ptr [ i ] stores the childpointer with edge label Key [ i ] . We maintain the arrays suchthat the first k elements are used. Problem 1 is solved byperforming LinearScan for the first k elements of Key . If r (cid:48) =0 , modern CPUs can quickly search the elements using SIMDinstructions in parallel, as presented in [24]. NodeSearchS shows the algorithm. odeD (NodeDense) is a data structure for storing node v with k children no more than K . It consists of two arrays Idx and
Ptr . Idx is a byte array of length 256 to indicate positionsof
Ptr . Ptr is a pointer array of length K such that Ptr [ Idx [ b ]] stores the child pointer with edge label b . Idx [ b ] = K + 1 indicates that there is not a child pointer with b . Problem 1is solved by performing BruteForce for
Idx . NodeSearchD shows the algorithm.
NodeF (NodeFull) is a data structure for very large k andconsists of pointer array Ptr of length 256 such that
Ptr [ b ] stores the child pointer with edge label b . The data structure isidentical to the array form. Problem 1 is solved by performing BruteForce for
Ptr . NodeSearchF shows the algorithm.Every data structure has a header of one byte to store thevalue of k . Let w be the word size in bits such as 32 or 64bits. NodeS consumes K + wK bits, NodeD consumes ·
256 + wK bits, and NodeF consumes w bits. NodeS is the most memory-efficient but uses
LinearScan taking O ( k ) time. NodeD is more memory-efficient than
NodeF when
K < − /w .With respect to time and space, NodeS is efficient for small K , and NodeD is efficient for large K . We define NodeS with K = 2 , , , , and NodeD with K = 64 , . An innernode with k children of no more than 128 is represented in NodeS or NodeD such that K is the smallest and no lessthan k . An inner node with k children of more than 128 isrepresented in NodeF . This adaptive selection allows childpointers to be stored space-efficiently.
D. Compact Implementation for Leaves
Finally, we briefly present a compact implementation ofleaves. Each leaf is represented as a pointer to the postinglist. We compress the pointers using a sparse direct addresstable [7] that groups g pointers by concatenating the g postinglists and reduces the number of pointers by a factor of g .Given a leaf, the sparse direct address table can access thecorresponding posting list using the identifier in O ( g/w ) time.DyFT sets g = w to perform the access in constant time. Theimplementation details are presented in [7, Sect. 6].VI. E XPERIMENTS
We evaluated the performances of DyFT and DyFT + usingthree real-world vector datasets. Text1M consists of 999,994pre-trained continuous word vectors from English Wikipedia2017 using fastText [28], where each vector is a real numbervector of 300 dimensions.
Review13M consists of 12,886,488book reviews in English from Amazon [29]. Each reviewis represented as a 9,253,464-dimensional binary fingerprintof which each dimension represents the presence or absenceof a word.
CP216M consists of 216,121,626 compound-protein pairs in the STITCH database [30], where each pair isrepresented as a 3,621,623-dimensional binary fingerprint.We converted real number vectors in
Text1M into binarysketches using Charikar’s simhash algorithm [31] and integersketches using the GCWS algorithm [11]. We converted binary n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 32, = 2, r = 2) * =1=10=100 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 32, = 2, r = 4) * =1=10=10010 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 32, = 16, r = 2) * =1=10=100 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 32, = 16, r = 4) * =1=10=100 Fig. 6. Results for optimal threshold τ ∗ and fixed thresholds τ = 1 , , on CP216M . The charts show average search time in milliseconds for varyingthe number of sketches n plotted in logarithmic scale . vectors in Review13M and
CP216M into binary or integersketches using Li’s mihhash algorithm [9].We constructed an index of similarity search methods byinserting sketches in a dataset in random order. We measuredthe elapsed insertion time and required memory usage for theconstruction. We produced a query set by randomly sampling1,000 sketches from each dataset and measured the averagesimilarity search time per query.We evaluated σ = 16 for integer sketches following thepractical considerations in [9,11]. We evaluated DyFT andHWT (without the multi-index approach) using short sketchesof m = 32 and small radii r ≤ . We evaluated DyFT + , MIH,HmSearch, and GV (with the multi-index approach) usinglong sketches of m = 64 and large radii r ≤ . We fixed W in = 0 . based on experiments using a dataset of 10 millionrandom sketches.We conducted all experiments on one core of quad-core IntelXeon CPU E5–2680 v2 clocked at 2.8 GHz in a machine with256 GB of RAM running the 64-bit version of CentOS 6.10based on Linux 2.6. We implemented all data structures inC++17 and compiled source codes using g++ version 7.3.0with optimization flags -O3 and -march=native. The codeused in our experiments is available at https://github.com/kampersanda/dyft. A. Analysis for Optimal Threshold τ ∗ We analyzed DyFT’s performance with optimal threshold τ ∗ and fixed thresholds τ = 1 , , . Figure 6 shows theresults of search time on CP216M when r = 2 , . The searchtime with τ ∗ was the fastest in most cases. The effectiveness of τ ∗ could be observed especially when σ = 16 and r = 4 . Thesearch times with τ were reversed according to n , i.e., setting τ = 1 provided faster searches for large n while setting τ =100 provided faster searches for small n . This demonstratedthat τ is not efficient in dynamic settings where n is varied. ABLE IIR
ESULTS FOR N ODE I MPLEMENTATIONS ON R EVIEW ( m = 32 ) σ = 2 (binary) σ = 16 (integer) r Array ART MART Array ART MARTSearch Time (ms) per Query1 0.014 0.019
32 48 Insertion Time (sec)1 16 20
20 202 16 20
20 203 16 20
20 204 16 20
20 20Memory Usage (MB)1
379 246 882 334
379 249 882
335 335
378 244 881
350 202 880
335 335
On the other hand, τ ∗ maintained the fastest similarity searchspeed even when n was varied. B. Analysis for Node Implementations
We compared the performances of MART, the array form(Array), and the original ART [24]. We evaluated each datastructure when implementing inner nodes of DyFT. Both Arrayand ART did not apply the byte-packing technique. The aimof the comparison with ART is to observe the effectivenessof the byte-packing technique; hence, we did not implementunnecessary techniques of ART such as path compression.Table II shows the results of search time, insertion time,and memory usage on
Review13M . They demonstrated thevalidity of our observations in Section V-A. The search timeof MART was the fastest in all cases. Compared to Array,MART was at most 6.3 × faster for binary sketches and at most1.5 × faster for integer sketches. This suggests that suppressingDyFT’s height with the byte-packing technique provides fastretrieval on Observation A. Similarly, the insertion time ofMART was the fastest for binary sketches due to the byte-packing technique, although Array was the fastest for integersketches due to the simplest data structure. With respect tomemory usage, Array was the smallest for binary sketches butlargest for integer sketches on Observations A and B; ART andMART were the smallest for integer sketches on ObservationB. Overall, MART achieved relevant space-time trade-offs forboth binary and integer sketches. C. Analysis for DyFT on Binary Sketches
We compared the performances of DyFT and HWT. HWTis the state-of-the-art method designed for dynamic similar-ity searches on binary sketches [8]. We implemented HWTusing the original source code available at https://github.com/sepehr3pehr/hwt.Figure 7 shows the results of search time, insertion time,and memory usage on
CP216M . As n increased, the search n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 32, = 2, r = 2) DyFTHWT 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 32, = 2, r = 4) DyFTHWT10 n I n s e r t i o n t i m e ( s e c ) CP216M ( m = 32, = 2, r = 2) DyFTHWT 10 n M e m o r y u s a g e ( G B ) CP216M ( m = 32, = 2, r = 2) DyFTHWT
Fig. 7. Results for DyFT and HWT on
CP216M . The upper charts showaverage search time in milliseconds for varying the number of sketches n when r = 2 , . The bottom-left chart shows insertion time in minutes forvarying n . The bottom-right chart shows memory usage in GB for varying n .They are plotted in logarithmic scale . time of DyFT became faster than that of HWT. This resultis consistent with the search time complexities of DyFT andHWT, as HWT’s complexity contains the factor of O (log n ) .When r = 2 , DyFT was at most 6000 × faster than HWT.Although HWT’s insertion time complexity O ( m log m ) isworse than DyFT’s complexity O ( m ) , the measured insertiontimes were not much different because m was not significant.Although the memory complexities of DyFT and HWT arethe same, DyFT was at most 13 × more memory-efficient thanHWT because of the node-omitting approach and MART. D. Analysis for DyFT + on Binary Sketches We compared the performances of DyFT + , MIH, and HSVon binary sketches. MIH is an early similarity search methodusing the multi-index approach [7]. We implemented MIHusing the original source code available at https://github.com/norouzi/mih. HSV is a variant of HmSearch optimized forbinary sketches [12]. We implemented HSV applicable todynamic settings using std::unordered map. We tested q = 2 , for DyFT + and MIH to observe the effect of the number ofblocks on performance. Note that the only difference betweenDyFT + and MIH is whether a DyFT or hash-table structureis used to implement the index.Figure 8 shows the results of search time, insertion time, andmemory usage. Since HSV was not competitive, we consideronly on DyFT + and MIH. We first focus on the average searchtime for varying r (on the leftmost column). The search timesof DyFT + and MIH were not much different when all sketchesin the dataset were inserted. Both DyFT + and MIH with q = 2 performed superiorly when the dataset had large n . We nowfocus on the average search time for varying n (on the secondleftmost column). As reviewed in Section III, the performanceof MIH significantly degraded according to n . MIH with q = 2 was fast when n was large, but very slow when n was small. r A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Text1M ( m = 64, = 2) DyFT + q =2 DyFT + q =4 MIH q =2 MIH q =4 HSV 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Text1M ( m = 64, = 2, r = 6) n I n s e r t i o n t i m e ( s e c ) Text1M ( m = 64, = 2, r = 6) n M e m o r y u s a g e ( G B ) Text1M ( m = 64, = 2, r = 6) r A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Review13M ( m = 64, = 2) DyFT + q =2 DyFT + q =4 MIH q =2 MIH q =4 HSV 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Review13M ( m = 64, = 2, r = 6) n I n s e r t i o n t i m e ( s e c ) Review13M ( m = 64, = 2, r = 6) n M e m o r y u s a g e ( G B ) Review13M ( m = 64, = 2, r = 6) r A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 64, = 2) DyFT + q =2 DyFT + q =4 MIH q =2 MIH q =4 HSV 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 64, = 2, r = 6) n I n s e r t i o n t i m e ( s e c ) CP216M ( m = 64, = 2, r = 6) n M e m o r y u s a g e ( G B ) CP216M ( m = 64, = 2, r = 6) Fig. 8. Comparison results for DyFT + , MIH, and HSV on binary sketches. The leftmost column shows average search time in milliseconds for varyingradius r . The second leftmost, third leftmost, and rightmost columns respectively show average search time in milliseconds, insertion time in seconds, andmemory usage in GB, for varying the number of input sketches n . The search time of HSV on CP216M when r = 2 is not plotted since we could notconstruct the complete index within 256 GB of memory. They are plotted in logarithmic scale . DyFT + maintained faster searches even when n was small.For insertion time and memory usage (on the two rightmostcolumns), MIH with q = 2 was significantly worse when n was small. The result demonstrated that DyFT + with q = 2 isan excellent similarity search method if the dataset is dynamicand expected to be large. E. Analysis for DyFT + on Integer Sketches We compared the performances of DyFT + , GV, and HSDon integer sketches. GV is a simple modification of MIHbased on the idea in [5]. HSD is a variant of HmSearchoptimized for integer sketches [12]. We implemented GV andHSD applicable to dynamic settings using std::unordered map.The only difference between DyFT + and GV is whether aDyFT or hash-table structure is used to implement the index.To fairly compare DyFT + with GV, we set q = (cid:98) r/ (cid:99) + 1 inDyFT + in the same manner as GV.Figure 9 shows the results of search time, insertion time,and memory usage. We first focus on the average search time(on the two leftmost columns). GV was not competitive toDyFT + and HSD. DyFT + outperformed HSD in most cases.We now focus on the insertion time and memory usage (on thetwo rightmost columns). HSD was not competitive to DyFT + and GV, as reviewed in Section III. The insertion time of GVwas the fastest because of its very simple data structure. Thememory usage of DyFT + was the smallest because of thenode-omitting approach and MART. The result demonstrated that DyFT + is a fast, scalable, and dynamic similarity searchmethod on integer sketches.VII. C ONCLUSION
We presented a dynamic similarity search method calledDyFT and its multi-index variant called DyFT + for the gen-eral Hamming distance problem. Our experimental analysesusing real-world datasets demonstrated that DyFT and DyFT + outperform state-of-the-art similarity search methods.A CKNOWLEDGMENTS
This work was supported by JST AIP-PRISM (grant numberJPMJCR18Y5). We thank the anonymous reviewers for theirhelpful comments. R
EFERENCES[1] M. Henzinger, “Finding near-duplicate web pages: a large-scale evalua-tion of algorithms,” in
SIGIR , 2006, pp. 284–291.[2] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-mediahashing for large-scale retrieval from heterogeneous data sources,” in
SIGMOD , 2013, pp. 785–796.[3] J.-I. Ito, Y. Tabei, K. Shimizu, K. Tsuda, and K. Tomii, “PoSSuM: adatabase of similar protein–ligand binding and putative pockets,”
NucleicAcids Res. , vol. 40, pp. D541–D548, 2012.[4] Y. Cao, H. Qi, W. Zhou, J. Kato, K. Li, X. Liu, and J. Gui, “Binaryhashing for approximate nearest neighbor search on big data: A survey,”
IEEE Access , vol. 6, pp. 2039–2054, 2018.[5] S. Gog and R. Venturini, “Fast and compact Hamming distance index,”in
SIGIR , 2016, pp. 285–294. r A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Text1M ( m = 64, = 16) DyFT + GVHSD 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Text1M ( m = 64, = 16, r = 6) n I n s e r t i o n t i m e ( s e c ) Text1M ( m = 64, = 16, r = 6) n M e m o r y u s a g e ( G B ) Text1M ( m = 64, = 16, r = 6) r A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Review13M ( m = 64, = 16) DyFT + GVHSD 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) Review13M ( m = 64, = 16, r = 6) n I n s e r t i o n t i m e ( s e c ) Review13M ( m = 64, = 16, r = 6) n M e m o r y u s a g e ( G B ) Review13M ( m = 64, = 16, r = 6) r A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 64, = 16) DyFT + GV 10 n A v e r a g e s e a r c h t i m e ( m s / q u e r y ) CP216M ( m = 64, = 16, r = 6) n I n s e r t i o n t i m e ( s e c ) CP216M ( m = 64, = 16, r = 6) n M e m o r y u s a g e ( G B ) CP216M ( m = 64, = 16, r = 6) Fig. 9. Comparison results for DyFT + , GV, and HSD on integer sketches. The leftmost column shows average search time in milliseconds for varying radius r . The second leftmost, third leftmost, and rightmost columns respectively show average search time in milliseconds, insertion time in seconds, and memoryusage in GB, for varying the number of input sketches n . Some results of HSD on CP216M are not plotted since we could not complete to construct theindex within 256 GB of memory. They are plotted in logarithmic scale .[6] J. Qin, C. Xiao, Y. Wang, and W. Wang, “Generalizing the pigeonholeprinciple for similarity search in Hamming space,”
IEEE Trans. Knowl.Data Eng. , 2019.[7] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast exact search in Hammingspace with multi-index hashing,”
IEEE Trans. Pattern Anal. Mach.Intell. , vol. 36, no. 6, pp. 1107–1119, 2014.[8] S. Eghbali, H. Ashtiani, and L. Tahvildari, “Online nearest neighborsearch using Hamming weight trees,”
IEEE Trans. Pattern Anal. Mach.Intell. , p. 1, 2019.[9] P. Li and C. K¨onig, “b-Bit minwise hashing,” in
WWW , 2010, pp. 671–680.[10] P. Li, “0-bit consistent weighted sampling,” in
SIGKDD , 2015, pp. 665–674.[11] ——, “Linearized GMM kernels and normalized random fourier fea-tures,” in
SIGKDD , 2017, pp. 315–324.[12] X. Zhang, J. Qin, W. Wang, Y. Sun, and J. Lu, “HmSearch: An efficientHamming distance query processing algorithm,” in
SSDBM , 2013, p. 19.[13] S. Kanda and Y. Tabei, “b-bit sketch trie: Scalable similarity search oninteger sketches,” in
BigData , 2019, pp. 810–819.[14] S. Kanda, K. Takeuchi, K. Fujii, and Y. Tabei, “Succinct trit-array trie forscalable trajectory similarity search,” in
SIGSPATIAL , 2020, to appear.[15] E. Fredkin, “Trie memory,”
Commun. ACM , vol. 3, no. 9, pp. 490–499,1960.[16] D. Belazzougui and R. Venturini, “Compressed string dictionary look-upwith edit distance one,” in
CPM , 2012, pp. 280–292.[17] R. Cole, L.-A. Gottlieb, and M. Lewenstein, “Dictionary matching andindexing with errors and don’t cares,” in
STOC , 2004, pp. 91–100.[18] H.-L. Chan, T.-W. Lam, W.-K. Sung, S.-L. Tam, and S.-S. Wong,“Compressed indexes for approximate string matching,”
Algorithmica ,vol. 58, no. 2, pp. 263–281, 2010.[19] D. Greene, M. Parnas, and F. Yao, “Multi-index hashing for informationretrieval,” in
FOCS , 1994, pp. 722–731.[20] A. N. Arslan and ¨O. Eeciolu, “Dictionary look-up within small editdistance,” in
COCOON , 2002, pp. 127–136. [21] N. Askitis and R. Sinha, “Engineering scalable, cache and space efficienttries for strings,”
The VLDB Journal , vol. 19, no. 5, pp. 633–660, 2010.[22] S. Heinz, J. Zobel, and H. E. Williams, “Burst tries: A fast, efficientdata structure for string keys,”
ACM Trans. Inf. Syst. , vol. 20, no. 2, pp.192–223, 2002.[23] H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton,and A. Pavlo, “SuRF: Practical range query filtering with fast succincttries,” in
SIGMOD , 2018, pp. 323–336.[24] V. Leis, A. Kemper, and T. Neumann, “The adaptive radix tree: ARTfulindexing for main-memory databases,” in
ICDE , 2013, pp. 38–49.[25] R. Binna, E. Zangerle, M. Pichl, G. Specht, and V. Leis, “HOT: A heightoptimized trie index for main-memory database systems,” in
SIGMOD ,2018, pp. 521–534.[26] M. Boehm, B. Schlegel, P. B. Volk, U. Fischer, D. Habich, andW. Lehner, “Efficient in-memory indexing with generalized prefix trees,”in
BTW , 2011, pp. 227–246.[27] Y. Tabei and K. Tsuda, “Sketchsort: Fast all pairs similarity search forlarge databases of molecular fingerprints,”
Mol. Inf. , vol. 30, no. 9, pp.801–807, 2011.[28] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin,“Advances in pre-training distributed word representations,” in
LREC ,2018.[29] J. McAuley and J. Leskovec, “Hidden factors and hidden topics:understanding rating dimensions with review text,” in
RecSys , 2013, pp.165–172.[30] M. Kuhn, D. Szklarczyk, A. Franceschini, M. Campillos, C. von Mering,L. J. Jensen, A. Beyer, and P. Bork, “STITCH 2: An interaction networkdatabase for small molecules and proteins,”
Nucleic Acids Res. , vol. 38,no. suppl 1, pp. D552–D556, 2009.[31] M. S. Charikar, “Similarity estimation techniques from rounding algo-rithms,” in