Convolutional Embedding for Edit Distance
Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, James Cheng
CConvolutional Embedding for Edit Distance
Xinyan DAI
The Chinese University of Hong [email protected]
Xiao Yan ∗ The Chinese University of Hong [email protected]
Kaiwen Zhou
The Chinese University of Hong [email protected]
Yuxuan Wang
The Chinese University of Hong [email protected]
Han Yang
The Chinese University of Hong [email protected]
James Cheng
The Chinese University of Hong [email protected]
ABSTRACT
Edit-distance-based string similarity search has many applicationssuch as spell correction, data de-duplication, and sequence align-ment. However, computing edit distance is known to have highcomplexity, which makes string similarity search challenging forlarge datasets. In this paper, we propose a deep learning pipeline(called CNN-ED) that embeds edit distance into Euclidean distancefor fast approximate similarity search. A convolutional neural net-work (CNN) is used to generate fixed-length vector embeddingsfor a dataset of strings and the loss function is a combination ofthe triplet loss and the approximation error. To justify our choiceof using CNN instead of other structures (e.g., RNN) as the model,theoretical analysis is conducted to show that some basic opera-tions in our CNN model preserve edit distance. Experimental resultsshow that CNN-ED outperforms data-independent CGK embed-ding and RNN-based GRU embedding in terms of both accuracyand efficiency by a large margin. We also show that string sim-ilarity search can be significantly accelerated using CNN-basedembeddings, sometimes by orders of magnitude.
CCS CONCEPTS • Information systems → Nearest-neighbor search ; Texting ; Top-k retrieval in databases . KEYWORDS
Edit distance; string similarity search; convolutional neural net-work; metric embedding
ACM Reference Format:
Xinyan DAI, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, and JamesCheng. 2020. Convolutional Embedding for Edit Distance. In
Proceedings ofthe 43rd International ACM SIGIR Conference on Research and Development inInformation Retrieval (SIGIR ’20), July 25–30, 2020, Virtual Event, China.
ACM,New York, NY, USA, 10 pages. https://doi.org/10.1145/3397271.3401045 ∗ Corresponding author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
SIGIR ’20, July 25–30, 2020, Virtual Event, China © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8016-4/20/07...$15.00https://doi.org/10.1145/3397271.3401045
Given two strings s x and s y , their edit distance ∆ e ( s x , s y ) is theminimum number of edit operations (i.e., insertion, deletion andsubstitution) required to transform s x into s y (or s y into s x ). Asa metric, edit distance is widely used to evaluate the similaritybetween strings. Edit-distance-based string similarity search hasmany important applications including spell corrections, data de-duplication, entity linking and sequence alignment [8, 14, 29].The high computational complexity of edit distance is the mainobstacle for string similarity search, especially for large datasetswith long strings. For two strings with length l , computing their editdistance has O( l / log ( l )) time complexity using the best algorithmknown so far [18]. There are evidences that this complexity cannotbe further improved [2]. Pruning-based solutions have been usedto avoid unnecessary edit distance computation [3, 16, 20, 25, 27].However, it is reported that pruning-based solutions are inefficientwhen a string and its most similar neighbor have a large edit dis-tance [30], which is common for datasets with long strings.Metric embedding has been shown to be successful in bypassingdistances with high computational complexity (e.g., Wassersteindistance [6]). For edit distance, a metric embedding model can bedefined by an embedding function f (·) and a distance measure d (· , ·) such that the distance in the embedding space approximatesthe true edit distance, i.e., ∆ e ( s x , s y ) ≈ d (cid:0) f ( s x ) , f ( s y ) (cid:1) . A smallapproximation error ( | ∆ e ( s x , s y ) − d (cid:0) f ( s x ) , f ( s y ) (cid:1) | ) is crucial formetric embedding. For similarity search applications, we also wantthe embedding to preserve the order of edit distance. That is, fora triplet of strings, s x , s y and s z , with ∆ e ( s x , s y ) < ∆ e ( s x , s z ) , itshould ensure that d (cid:0) f ( s x ) , f ( s y ) (cid:1) < d ( f ( s x ) , f ( s z )) . In this paper,we evaluate the accuracy of the embedding methods using bothapproximation error and order preserving ability.Several methods have been proposed for edit distance embed-ding. Ostrovsky and Rabani embed edit distance into ℓ with adistortion of 2 O( √ log l log log l ) [19] but the algorithm is too com-plex for practical implementation. The CGK algorithm embeds editdistance into Hamming distance and the distortion is O( ∆ e ) [4], inwhich ∆ e is the true edit distance. CGK is simple to implement andshown to be effective when incorporated into a string similaritysearch pipeline. Both Ostrovsky and Rabani’s method and CGK aredata-independent while learning-based methods can provide betterembedding by considering the structure of the underlying dataset. An embedding method is said to have a distortion of γ if there exists a positiveconstant λ that satisfies λ ∆ e ( s x , s y ) ≤ d (cid:0) f ( s x ) , f ( s y ) (cid:1) ≤ γ λ ∆ e ( s x , s y ) , in which λ is a scaling factor [6]. a r X i v : . [ c s . D B ] M a y RU [32] trains a recurrent neural network (RNN) to embed edit dis-tance into Euclidean distance. Although GRU outperforms CGK, itsRNN structure makes training and inference inefficient. Moreover,its output vector (i.e., f ( s x ) ) has a high dimension, which results incomplicated distance computation and high memory consumption.As our main baseline methods, we discussion CGK and GRU inmore details in Section 2.To tackle the problems of GRU, we propose CNN-ED , whichembeds edit distance into Euclidean distance using a convolutionalneural network (CNN). The CNN structure allows more efficienttraining and inference than RNN, and we constrain the output vec-tor to have a relatively short length (e.g., 128). The loss functionis a weighted combination of the triplet loss and the approxima-tion error, which enforces accurate edit distance approximationand preserves the order of edit distance at the same time. We alsoconducted theoretical analysis to justify our choice of CNN as themodel structure, which shows that the operations in CNN preserveedit distance to some extent. In contrasts, similar analytical resultsare not known for RNN. As a result, we observed that for somedatasets a randomly initialized CNN (without any training) alreadyprovides better embedding than CGK and fully trained GRU.We conducted extensive experiments on 5 datasets with variouscardinalities and string lengths. The results show that CNN-EDoutperforms both CGK and GRU in approximation accuracy, com-putation efficiency, and memory consumption. The approximationerror of CNN-ED can be only 50% of GRU even if CNN-ED uses anoutput vector that is two orders of magnitude shorter than GRU.For training and inference, the speedup of CNN-ED over GRU isup to 30x and 200x, respectively. Using the embeddings for stringsimilarity join, CNN-ED outperforms EmbedJoin [30], a state-of-the-art method. For threshold based string similarity search, CNN-EDreaches a recall of 0.9 up to 200x faster compared with HSsearch [23].Moreover, CNN-ED is shown to be robust to hyper-parameters suchas output dimension and the number of layers.To summarize, we made three contributions in this paper. First,we propose a CNN-based pipeline for edit distance embedding,which outperforms existing methods by a large margin. Second,theoretical evidence is provided for using CNN as the model for editdistance embedding. Third, extensive experiments are conductedto validate the performance of the proposed method.The rest of the paper is organized as follows. Section 2 introducesthe background of string similarity search and two edit distanceembedding algorithms, i.e., CGK and GRU. Section 3 presents ourCNN-based pipeline and conduct theoretical analysis to justifyusing CNN as the model. Section 4 provides experimental resultsabout the accuracy, efficiency, robustness and similarity searchperformance of the CNN embedding. The concluding remarks aregiven in Section 5.
In this part, we introduce two string similarity search problems,and then discuss two existing edit distance embedding methods,i.e., CGK [4] and GRU [32].
Algorithm 1
CGK Embedding
Input:
A string s ∈ D l for some l ≤ L , and a random matrix R ∈ { , } L ×|D | Output:
An embedding sequence y ∈ {D , ⊥} L |D | Interpret R as 3 L functions π , π , · · · , π L , with π j ( c k ) = R jk ,where c k denotes the k th character in D Initialize i = y = ∅ for j = , · · · , L doif i ≤ l then y = y ⊙ x [ i ] ▷ ⊙ means concatenation i = i + π j ( x [ i ]) else y = y ⊙ ⊥ ▷ pad with a special character ⊥ end ifend for There are two well-known string similarity search problems, sim-ilarity join [3, 16, 27] and threshold search [23]. For a dataset S = { s , s , · · · , s n } containing n strings, similarity join finds all pairs ( s i , s j ) of strings with ∆ e ( s i , s j ) ≤ τ and i < j , in which τ is athreshold for the edit distance between similar pairs. A numberof methods [3, 5, 9, 16, 24, 27, 28, 30, 31] have been developedfor similarity join but they are shown to be inefficient when thestrings are long and τ is large. EmbedJoin [30] utilizes the CGKembedding [4] and is currently the state-of-the-art method forsimilarity join on long strings. For a given query string q , thresh-old search [8, 15, 21, 22, 25, 33] finds all strings s ∈ S that satis-fies ∆ e ( q , s ) ≤ τ . HSsearch [23] is one state-of-the-art method forthreshold search, and outperforms Adapt [25], QChunk [20] and B ed -tree [33]. Similarity join is usually evaluated by the time it takesto find all similar pairs (called end-to-end time), while thresholdsearch is evaluated by the average query processing time. Algorithm 1 describes the CGK algorithm [4], which embeds editdistance into Hamming distance. It assumes that the longest stringin the dataset S has a length of L and the characters in the stringscome from a known alphabet D . R is a random binary matrix inwhich each entry is 0 or 1 with equal probability. ⊥ (cid:60) D is a specialcharacter used for padding. Denote the CGK embeddings of twostring s i and s j as y i and y j , respectively. The following relationholds with high probability, ∆ e ( s i , s j ) ≤ d H ( y i , y j ) ≤ O( ∆ e ( s i , s j )) , (1)in which d H ( y i , y j ) = (cid:205) Lk = [ y i ( k ) (cid:44) y j ( k )] is the Hammingdistance between y i and y j . RNN is used to embed edit distance into Euclidean distance inGRU [32]. The network structure of GRU is shown in Figure 1,which consists of two layers of gated recurrent unit (GRU) and alinear layer. A string s is first padded to a length of L (the lengthof the longest string in the dataset) and then each of its elementis fed into the network per step. The outputs of the L steps are igure 1: The model architecture of GRU concatenated as the final embedding. The embedding function ofGRU can be expressed as follows h i = GRU ( s [ i ] , h i − ) ; h i = GRU ( h i δ i , h i − ) ; f iGRU = W h i + b ; f GRU ( s ) = [ f GRU , f GRU , · · · , f LGRU ] . (2)As GRU uses the concatenation of the outputs, the embedding hasa high dimension and takes up a large amount of memory. Thenetwork is trained with a three-phase procedure and a differentloss function is used in each phase. We now present our CNN-based model for edit distance embedding.We first introduce the details of the learning pipeline, includinginput preparation, network structure, loss function and trainingmethod. Then we report an interesting phenomenon–a randomCNN without training already matches or even outperforms GRU,which serves as a strong empirical evidence that CNN is suitable foredit distance embedding. Finally, we justify this phenomenon withtheoretical analysis, which shows that operations in GNN preservesa bound on edit distances.
We assume that there is a training set S = { s , s , · · · , s n } with n strings. The strings (including training set, base dataset and possiblequeries) that we are going to apply our model on have a maximumlength of L , and their characters come from a known alphabet D with size |D| . c j denotes the j th character in D . For two vectors x and y , we use ∥ x − y ∥ to denote their Euclidean distance. One-hot embedding as input.
For each training string s x , wegenerate an one-hot embedding matrix X of size |D| × L as theinput for the model as follows, X = (cid:2) X , · · · , X |D | (cid:3) ⊺ with X j ∈ { , } L for 1 ≤ j ≤ |D| , X j [ l ] = [ s x [ l ] = c j ] for 1 ≤ l ≤ L . (3)For example, for D = {‘A’, ‘G’, ‘C’, ‘T’} and s x = “CATT” and L = X = [[ ] , [ ] , [ ] , [ ]] . Intuitively, each row of X (e.g., X j ) encodes a character (e.g., c j ) in D , and if that characterappears in certain position of s x (e.g., l ), we mark the corresponding position in that row as 1 (e.g., X j [ l ] = X ( X = [ ] ) encodes the fourth character (i.e., ‘T’). X [ ] = X [ ] = rd and 4 th position of s x . If string s x has a length L ′ < L , the last L − L ′ columnsof X ′ are filled with 0. In this way, we generate fixed-size input forthe CNN. Network structure.
The network structure of CNN-ED is shownin Figure 2, which starts with several one-dimensional convolutionand pooling layers. The convolution is conducted on the rows of X and always uses a kernel size of 3. By default, there are 8 ker-nels for each convolutional layer and 10 convolutional layers. Thelast layer is a linear layer that maps the intermediate representa-tions to a pre-specified output dimension of d (128 by default). Theone-dimensional convolution layers allow the same character indifferent positions to interact with each other, which correspondsto insertion and deletion in edit distance computation. As we willshow in Section 3.2, max-pooling preserves a bound on edit distance.The linear layer allows the representation for different charactersto interact with each other. Our network is typically small and thenumber of parameters is less than 45k for the DBLP dataset. Loss function.
We use the following combination of triplet loss [12]and approximation error as the loss function L( s acr , s pos , s neд ) = L t ( s acr , s pos , s neд ) + α L p ( s acr , s pos , s neд ) , in which L t is the triplet loss and L p is the approximation error. ( s acr , s pos , s neд ) is a randomly sampled string triplet, in which s acr is the anchor string, s pos is the positive neighbor that has smalleredit distance to s acr than the negative neighbor s neд . The weight α is usually set as 0.1. The triplet loss is defined as L t ( s acr , s pos , s neд ) = max (cid:8) , ∥ y acr − y pos ∥−∥ y acr − y neд ∥− η (cid:9) , in which η = ∆ e ( s acr , s pos ) − ∆ e ( s acr , s neд ) is a margin that isspecific for each triplet, and y acr is the embedding for s acr . Intu-itively, the triplet loss forces the distance gap in the embeddingspace ( ∥ y acr − y neд ∥ − ∥ y acr − y pos ∥ ) to be larger than the editdistance gap ( ∆ e ( s acr , s neд ) − ∆ e ( s acr , s pos ) ), which helps to pre-serve the relative order of edit distance. The approximation erroris defined as, L p ( s acr , s pos , s neд ) = w ( s acr , s pos ) + w ( s acr , s neд ) + w ( s pos , s neд ) , in which w ( s , s ) = (cid:12)(cid:12) ∥ y s − y s ∥ − ∆ e ( s , s ) (cid:12)(cid:12) measures the differ-ence between the Euclidean distance and edit distance for a stringpair. Intuitively, the approximation error encourages the Euclideandistance to match the edit distance. Training and sampling.
The network is trained using min-batchSGD and we sample 64 triplets for each min-batch. To obtain atriplet, a random string is sampled from the training set as s acr .Then two of its top- k neighbors ( k =
100 by default) are sampled, andthe one having smaller edit distance with s acr is used as s pos whilethe other one is used as s neд . For a training set with cardinality n ,we call it an epoch when n triplets are used in training. Using CNN embedding in similarity search.
The most straight-forward application of the embedding is to use it to filter unneces-sary edit distance computation. We demonstrate this application inAlgorithm 2 for approximate threshold search. The idea is to uselow-cost distance computation in the embedding space to avoid cr s neg s pos One-hot embedding ... ... ... X acr X neg X pos CNN Embedding || y acr - y pos || - Δ e ( s acr , s pos ) || y acr - y pos || - || y acr - y neg || - [ Δ e ( s acr , s pos ) - Δ e ( s acr , s neg )], 0 s acr y pos y neg y max Losses D C o v - P o o l LinearLayer +++ || y acr - y neg || - Δ e ( s acr , s neg )|| y pos - y neg || - Δ e ( s pos , s neg ) Figure 2: The model architecture of CNN-ED expensive edit distance computation. More sophisticated designs tobetter utilize the embedding are possible but is beyond the scope ofthis paper. For example, the embeddings can also be used to generatecandidates for similarity search following the methodology of Em-bedJoin, which builds multiple hash tables using CGK embeddingand locality sensitive hashing (LSH) [1, 7, 26]. To avoid computingall-pair distances in the embedding space, approximate Euclideandistance similarity methods such as vector quantization [11, 13] andproximity graph [10, 17] can be used. Finally, it is possible to utilizemultiple sets of embeddings trained with different initializations toprovide diversity and improve the performance.
Algorithm 2
Using Embedding for Approximate Threshold Search
Input:
A query string q , a string dataset S = { s , s , · · · , s n } , theembeddings of the strings Y = { y , y , · · · , y n } , a model f (·) , athreshold K and a blow-up factor µ > Output:
Strings with ∆ e ( q , s ) ≤ K Initialize the candidate set S ′ = ∅ and result set as C = ∅ Compute the embedding of the query string y q = f ( q ) for each emebdding y i in Y doif ∥ y q − y i ∥ ≤ µ · K then S ′ = S ′ ∪ s i end ifend forfor each string s i in S ′ doif ∆ e ( q , s i ) ≤ K then C = C ∪ s i end ifend for Performance of random CNN.
In Figure 3 and Figure 4, we com-pare the performance of CGK and GRU with a randomly initializedCNN, which has not been trained. The CNN contains 8 convolu-tional layers and uses max-pooling. The recall-item curve is definedin Section 4 and higher recall means better performance. The sta-tistics of the datasets can be found in Table 1. The results showthat a random CNN already outperforms CGK on all datasets and R e c a ll ENRONTOP-1RNDGRUCGK R e c a ll ENRONTOP-10RNDGRUCGK R e c a ll ENRONTOP-50RNDGRUCGK R e c a ll ENRONTOP-100RNDGRUCGK
Figure 3: Recall-item curve comparison for random CNN (de-note as RND), CGK and GRU on the Enron dataset for different value of k . The random CNN also outperforms fullytrained GRU on Trec and Gen50ks, and is comparable to GRU onUniref. Although random CNN does not perform as good as GRUon DBLP, the performance gap is not large. On the Enron dataset,random GNN slightly outperforms GRU for different values of k .This phenomenon suggests that the CNN structure may havesome properties that suit edit distance embedding. This is againstcommon sense as strings are sequences and RNN should be goodat handling sequences. To better understand this phenomenon, weanalyze how the operations in our CNN model affects edit distanceapproximation. Basically, the results show that one-hot embed-ding and max-pooling preserve bounds on edit distance .Theorem 1 (One-Hot Deviation Bound). Given two strings s x ∈ {D} M , s y ∈ {D} N and their corresponding one-hot embed-dings X = (cid:2) X , · · · , X |D | (cid:3) ⊺ and Y = (cid:2) Y , · · · , Y |D | (cid:3) ⊺ , defining the
10 20 30 40 50 R e c a ll DBLPTOP-1RNDGRUCGK R e c a ll TRECTOP-1RNDGRUCGK R e c a ll UNIREFTOP-1RNDGRUCGK R e c a ll GEN50KSTOP-1RNDGRUCGK
Figure 4: Recall-item curve comparison for random CNN (de-note as RND), CGK and GRU on more datasets binary edit distance as ∆ e ( s x , s y ) ≜ (cid:205) |D | i = ∆ e (cid:0) X i , Y i (cid:1) , we have |D| ∆ e ( s x , s y ) − (|D| − )( M + N )≤ ∆ e ( s x , s y ) ≤ |D| ∆ e ( s x , s y ) . Proof. For the upper bound, note that by modifying the opera-tions in the shortest edit sequence of changing s x into s y to binaryoperations, we can use this sequence to transform X i into Y i , forany i ∈ [|D|] . Since a substitution in the original sequence may bemodified into ‘0 → ∆ e (cid:0) X i , Y i (cid:1) ≤ ∆ e ( s x , s y ) . Summing this bound for i = , . . . , |D| , we obtain the upper bound.For the lower bound, letting s c i x be the string of replacing thecharacter in s x that is not c i with a special character ⊥ (cid:60) D , where c i is the i th character in the alphabet, we can conclude that ∆ e ( s x , s c i x ) = M − | c i | s x , ∆ e ( s c i x , s c i y ) = ∆ e (cid:0) X i , Y i (cid:1) , where | c i | s x is the number of character c i in s x .Using the triangle inequality of edit distance, for any i ∈ [|D|] ,we have ∆ e ( s x , s y ) ≤ ∆ e ( s x , s c i x ) + ∆ e ( s c i x , s y )≤ ∆ e ( s x , s c i x ) + ∆ e ( s c i x , s c i y ) + ∆ e ( s c i y , s y ) = M + N − | c i | s x − | c i | s y + ∆ e (cid:0) X i , Y i (cid:1) . Summing this inequality for i = , . . . , |D| and using that |D | (cid:213) i = | c i | s x = M , |D | (cid:213) i = | c i | s y = N , The edit sequence between two strings is a sequence of operations that transfer onestring to the other one. The shortest edit sequence is one of the edit sequences with minimum length, i.e. theedit distance. we obtain |D| ∆ e ( s x , s y ) ≤ |D|( M + N ) − M − N + ∆ e ( s x , s y ) . Re-arranging this inequality completes the proof. □ Note that the bound in Theorem 1 can be tightened by choosing D as supp ( s x )∪ supp ( s y ) . Theorem 1 essentially shows that a boundon the true edit distance ∆ e ( s x , s y ) can be constructed by the sum ofthe edit distances of |D| binary sequences. These binary sequencesare exactly the rows of the one-hot embedding matrices X and Y .This justifies our choice of using one-hot embedding as the inputfor the network.Theorem 2 (Max-Pooling Deviation Bound). Given two bi-nary vectors x ∈ { , } M , y ∈ { , } N and a max-pooling operation P (·) on x , y with stride K and size K , assuming that M and N aredivisible by K , the following holds: max ∆ e ( x , y ) − K − K ( M + N ) , K ∆ e ( x , y ) − K − K (| | P ( x ) + | | P ( y ) ) ≤ ∆ e ( P ( x ) , P ( y )) ≤ ∆ e ( x , y ) + K − K ( M + N ) . Proof. Using the triangle inequality of edit distance, we have ∆ e ( x , y ) ≤ ∆ e ( x , P ( x )) + ∆ e ( P ( x ) , y )≤ ∆ e ( x , P ( x )) + ∆ e ( P ( x ) , P ( y )) + ∆ e ( P ( y ) , y ) = K − K ( M + N ) + ∆ e ( P ( x ) , P ( y )) . Applying this inequality again for ∆ e ( P ( x ) , P ( y )) , we obtain | ∆ e ( x , y ) − ∆ e ( P ( x ) , P ( y ))| ≤ K − K ( M + N ) . (4)Denote A ( x ) as the string of replicating each bit of P ( x ) K times.For the substitutions, insertions and deletions in the edit sequenceof ∆ e ( P ( x ) , P ( y )) , we can repeat these operations for the corre-sponding replicated bits in A ( x ) , which transform A ( x ) into A ( y ) .Thus, we conclude that ∆ e ( A ( x ) , A ( y )) ≤ K ∆ e ( P ( x ) , P ( y )) .Using triangle inequality, it satisfies that ∆ e ( x , y ) ≤ ∆ e ( x , A ( x )) + ∆ e ( A ( x ) , A ( y )) + ∆ e ( A ( y ) , y )≤ ∆ e ( x , A ( x )) + K ∆ e ( P ( x ) , P ( y )) + ∆ e ( A ( y ) , y ) . For ∆ e ( x , A ( x )) , if a bit is 0 in P ( x ) , its corresponding windowin A ( x ) and x must be all 0; if the bit is 1, the number of differentbits in the corresponding window of A ( x ) and x is upper-boundedby K −
1, which implies that ∆ e ( x , A ( x )) ≤ ( K − )| | P ( x ) , where | | P ( x ) denotes the number of 1 in P ( x ) . Thus, ∆ e ( x , y ) ≤ ( K − )(| | P ( x ) + | | P ( y ) ) + K ∆ e ( P ( x ) , P ( y )) . Rearranging this lower bound and the bound (4) complete theproof. □ Theorem 2 shows that max-pooling preserves a bound on theedit distance of binary vectors. Combining with Theorem 1, it alsoshows that max-pooling preserves a bound on the true edit dis-tance ∆ e ( s x , s y ) . Our randomly initialized network can be viewedas a stack of multiple max-pooling layers, which explains its goodperformance shown in Figure 3 and Figure 4. However, similar able 1: Dataset statistics DataSet UniRef DBLP Trec Gen50ks Enron analysis is difficult for RNN as an input character passes throughthe network in many time steps and the influence on edit distanceis hard to capture.
We conduct extensive experiments to evaluate the performanceof CNN-ED. Two existing edit distance embedding methods, CGKand GRU, are used as the main baselines. We first introduce theexperiment settings, and evaluate the quality of the embeddingsgenerated by CNN-ED. Then, we assess the efficiency of the em-bedding methods in terms of both computation and storage costs.To demonstrate the benefits of vector embedding, we also test theperformance of CNN-ED when used for similarity join and thresh-old search. Finally, we test the influence of the hyper-parameters(e.g., output dimension, network structure, loss function) on per-formance. For conciseness, we use CNN to denote CNN-ED in thissection. The source code is on GitHub . We conduct the experiments with the fives datasets in Table 1,which have diverse cardinalities and string lengths. As GRU cannothandle very long strings, we truncated the strings longer than5,000 in UniRef and Enron to a length of 5,000 following the GRUpaper [32]. Moreover, as the memory consumption of GRU is toohigh for datasets with large cardinality, we sample 50,000 item fromeach dataset for comparisons that involve GRU. In experiments thatdo not involve GRU, the entire dataset is used. By default, CNN-EDuses 10 one-dimensional convolutional layers with a kernel size of3 and one linear layer. The dimension of the output embedding is128.All experiments are conducted on a machine equipped withGeForce RTX 2080 Ti GPU, 2.10GHz E5-2620 Intel(R) Xeon(R) CPU(16 physical cores), and 48GB RAM. The neural network trainingand inference experiments are conducted on the GPU while therest of the experiments are conducted on the CPU. By default, theCPU experiments are conducted using a single thread. For GRUand CNN-ED, we partition each dataset into three disjoint sets,i.e., training set, query set and base set. Both the training set andthe query set contain 1,000 items and the other items go to thebase set.
We used only the training set to tune the models and theperformance of the models are evaluated on the other two sets.
GRUis trained for 500 epochs as suggested in its code, while CNN-ED istrained for 50 epochs. https://github.com/xinyandai/string-embed Table 2: Average edit distance estimation error
DataSet UniRef DBLP Trec Gen50ks EnronCGK 0.590 63.602 6.856 0.452 0.873GRU 0.275 0.175 46.840 0.419 0.126CNN
We assess the quality of the embedding generated by CNN fromtwo aspects, i.e., approximation error and the ability to preserve editdistance order .To provide an intuitive illustration of the approximation errorof the CNN embeddings, we plot the true edit distance and theestimated edit distance of 1,000 randomly sampled query-item pairsin Figure 5. The estimated edit distance of a string pair ( s i , s j ) iscomputed using a linear function of the Euclidean distance ∥ f ( s i ) − f ( s j )∥ . The linear function is introduced to account for possibletranslation and scaling between the two distances, and it is fitted onthe training set without information from the base and query set.The results show that the distance pairs locate closely around the y = x line (the black one), which suggests that CNN embeddingsprovide good edit distance approximation.To quantitatively compare the approximation error of the em-bedding methods, we report the average edit distance estimationerror in Table 2. The estimation error for a string pair is definedas e = | д ( d ( f ( s i ) , f ( s j ))− ∆ e ( s i , s j )| ∆ e ( s i , s j ) , in which ∆ e ( s i , s j ) is the true editdistance and д ( d ( f ( s i ) , f ( s j )) is the edit distance estimated fromembeddings. The distance function d ( f ( s i ) , f ( s j )) is Hamming dis-tance for CGK and Euclidean distance for GRU and CNN. д (·) is afunction used to calculate edit distance using distance in the embed-ding space, and it is fitted on the training set. We set д (·) as a linearfunction for GRU and CNN, and a quadratic function for CGK asthe theoretical guarantee of CGK in Equation (1) has a quadraticform. The reported estimation error is the average of all possiblequery-item pairs. The results show that CNN has the smallest esti-mation error on all five datasets, while overall CGK has the largestestimation error. This is because CGK is data-independent, whileGRU and CNN use machine learning to fit the data. The perfor-mance of GRU is poor on the Trec dataset and similar phenomenonis also reported in its original paper [32].To evaluate the ability of the embeddings to preserve edit dis-tance order, we plot the recall-item curve in Figure 6 and Figure 7.The recall-item curve is widely used to evaluate the performanceof metric embedding. To plot the curve, we first find the top- k mostsimilar strings for each query in the base set using linear scan. Then,for each query, items in the base set are ranked according to theirdistance to the query in the embedding space. If the items rankingtop T contain k ′ of the true top- k neighbors, the recall is k ′ / k . Foreach value of T , we report the average recall of the 1,000 queries.Intuitively, a good embedding should ensure that a neighbor witha high rank in edit distance (i.e., having smaller edit distance thanmost items) also has a high rank in embedding distance. In this case,the recall is high for a relatively small T . The results show that CNNconsistently outperforms CGK and GRU on all five datasets and for igure 5: True edit distance (horizontal axis) vs. estimated edit distance (vertical axis) for CNN-ED R e c a ll ENRONTOP-1CNNGRUCGK R e c a ll ENRONTOP-10CNNGRUCGK R e c a ll ENRONTOP-50CNNGRUCGK R e c a ll ENRONTOP-100CNNGRUCGK
Figure 6: Recall-item curve comparison among CGK, GRU, CNN for top- k search on the Enron dataset R e c a ll DBLPTOP-1CNNGRUCGK R e c a ll TRECTOP-1CNNGRUCGK R e c a ll GEN50KSTOP-1
CNNGRUCGK R e c a ll UNIREFTOP-1CNNGRUCGK R e c a ll DBLPTOP-10CNNGRUCGK R e c a ll TRECTOP-10CNNGRUCGK R e c a ll GEN50KSTOP-10
CNNGRUCGK R e c a ll UNIREFTOP-10CNNGRUCGK
Figure 7: Recall-item curve comparison among CGK, GRU, CNN for top- k search on more datasets different values of k . The recall-item performance also agrees withthe estimation error in Table 2. CNN has the biggest advantage inestimation error on Trec and its item-recall performance is alsosignificantly better than CGK and GRU on this dataset. On Gen50ks,GRU and CNN have similar estimation error, and the item-recallperformance of CNN is only slightly better than GRU. We compare the efficiency of the embedding algorithms from var-ious aspects in Table 3.
Train time is the time to train the modelon the training set, and embed time is the average time to computethe embedding for a string (also called inference).
Compute time isthe average time to compute the distance between a pair of strings in the embedding space.
Embed size is the memory consumptionfor storing the embeddings of a dataset, and
Raw is the size of theoriginal dataset. Note that the embed time of GRU and CNN ismeasured on GPU, while the embed time of CGK is measured onCPU as the CGK algorithm is difficult to parallelize on GPU. ForGRU, the embedding size of the entire dataset is estimated using asample of 50,000 strings.The results show that CNN is more efficient than GRU in allaspects. CNN trains and computes string embedding at least 2.6xand 13.7x faster than GRU, respectively. Moreover, CNN takes morethan 290x less memory to store the embedding, and computesdistance in the embedding space over 400x faster. When comparedwith CGK, CNN also has very attractive efficiency. CNN computespproximate edit distance at least 14x faster than CGK and usesat least an order of magnitude less memory. We found that CNNis more efficient than GRU and CGK mainly because it has muchsmaller output dimension. For example, on the Gen50ks dataset,the output dimensions of CGK and GRU are 322x and 121x of CNN,respectively. Note that even with much smaller output dimension,CNN embedding still provides more accurate approximation foredit distance than CGK and GRU, as we have shown in Section 4.2.CGK embeds strings faster than both GRU and CNN as the twolearning-based models need to conduct neural network inferencewhile CGK follows a simple random procedure.
Table 3: Embedding efficiency comparison
DataSet Method UniRef DBLP Trec Gen50ks EnronTrainTime (s) GRU 31.8 13.2 26.3 31.3 34.9
CNN 4.31 4.96 5.19 1.61 5.63
EmbedTime( µ s) CGK GRU 8332 2340 7654 12067 7650
CNN µ s) CGK 1.72 0.60 1.36 1.65 1.71GRU 123.7 47.2 129.1 18.0 177.7 CNN( − ) 4.6 4.2 4.2 4.2 4.5 EmbedSize Raw 170MB 140MB 280MB 238MB 207MBCGK 5.59GB 6.45GB 4.86GB 0.70GB 3.43GBGRU 372GB 621GB 378GB 7GB 338GB
CNN 195MB 676MB 169MB 24MB 119MB
In this part, we test the performance of CNN when used for the twostring similarity search problems discussed in Section 2, thresholdsearch and similarity join .For threshold search, model training and dataset embedding areconducted before query processing. When a query comes, we firstcalculate its embedding, and then use the distances in the embed-ding space to rank the items, and finally conduct exact edit distancecomputation in the ranked order. Following [30], the thresholds forUniRef, DBLP, Trec, Gen50ks and Enron are set as 100, 40, 40, 100and 40, respectively. We compare with HSsearch, which supportsthreshold search with a hierarchical segment index. For CNN, wemeasure the average query processing time when reaching certainrecall, where recall is defined as the number of returned similarstring pairs over the total number of ground truth similar pairs.The results in Table 4 show that when approximate search is ac-ceptable, CNN can achieve very significant speedup over HSsearch.At a recall of 0.6, the speedup over HSsearch is at least 6x and canbe as much as 227x. In principle, CNN is not designed for exactthreshold search as there are errors in its edit distance approxima-tion. However, it also outperforms HSsearch when the recall is 1,which means all ground truth similar pairs are returned, and thespeedup is at least 1.44x for the five datasets.To demonstrate the advantage of the accurate and efficient em-bedding provided by CNN, we compare CNN and GRU for thresholdsearch in Table 5. The dataset is a sample with 50,000 items fromthe DBLP dataset (different from the entire DBLP dataset used inTable 4). We conduct sampling because the GRU embedding for thewhole dataset does not fit into memory. The results show that CNN
Table 4: Average query time for threshold search (in ms)
DataSet UniRef DBLP Trec Gen50ks EnronHSsearch 4333 6907 222 393 76CNN(R=0.6) 26 263 37 1.73 12CNN(R=0.8) 66 478 44 1.74 13CNN(R=0.9) 143 1574 58 1.74 15CNN(R=0.95) 254 2296 80 1.75 15CNN(R=0.99) 1068 3560 93 1.77 21CNN(R=1) 3007 4321 116 1.79 22
Table 5: Average query time for GRU and CNN (in ms)
Recall 0.6 0.8 0.9 0.95 0.99 1CNN 5.3 6.7 8.6 14.3 60.2 91.0GRU 2980.5 3012.1 3059.6 3059.6 3590.2 3590.2 can be up to 500x faster than GRU for attaining the same recall.Detailed profiling finds that distance computation is inefficient withlong GRU embedding (as shown in Table 3) and CNN embeddingbetter preserve the order of edit distance (as shown in Figure 7).We compare CNN with EmbedJoin and PassJoin for similarityjoin in Figure 8. PassJoin partitions a string into a set of segmentsand creates inverted index for the segments, then generates similarstring pairs using the inverted index. EmbedJoin uses the CGK em-bedding and locality sensitive hashing (LSH) to filter unnecessaryedit distance computations, which is the state-of-the-art method forstring similarity join. Note that PassJoin is an exact method whileEmbedJoin is an approximate method. Different from thresholdsearch, the common practice for similarity join is to report the end-to-end time, which includes both pre-processing time (e.g., indexbuilding in EmbedJoin) and edit distance computation time. There-fore, we include the time for training the model and embedding thedataset in the results of CNN. For Gen50ks and Trec, the thresholdsfor similar string pairs are set as 150 and 40, respectively, followingthe EmbedJoin paper. For EmbedJoin and CNN, we report the timetaken to reach a recall of 0.99.The results show that EmbedJoin outperforms CNN on theGen50ks dataset but CNN performs better than EmbedJoin on theTrec dataset. To investigate the reason behind the results, we de-compose the running time of CNN into training time, embeddingtime and search time in Table 6. On the smaller Gen50ks dataset(with 50,000 items), CNN takes 160.1s, 8.6s and 48.8s for training,embedding and search, respectively, while EmbedJoin takes 52.8seconds in total. The results suggest that CNN performs poorly onGen50ks because the dataset is small and the long training time can-not be amortized. On the larger Trec dataset (with 347,949 items),the training time (and embedding time) is negligible (only 5% ofthe total time) and CNN is 1.76x faster than EmbedJoin due to itsthe high quality embedding. Therefore, we believe CNN will have abigger advantage over EmbedJoin on larger dataset. We have tried assJoin EmbedJoin CNNJoin01000200030004000 T i m e ( s ) GEN50KSK=150 (a) Gen50ks
PassJoin EmbedJoin CNNJoin020406080100120 T i m e ( s ) TRECK=40 (b) Trec
Figure 8: Time comparison for similarity joinTable 6: Time decomposition for similarity join (in seconds)
Dataset CNN-Train CNN-Embed CNN-Search EmbedJoinGen50ks 160.1 8.6 48.8 52.8Trec 510.7 190.2 8924.6 16944.0 to run the algorithms on the DBLP dataset but both PassJoin andEmbedJoin fail.This set of experiments shows that CNN embeddings providepromising results for string similarity search. We believe that moresophisticated designs are possible to further improve performance,e.g., using multiple sets of independent embeddings to introducediversity, combining with Euclidean distance similarity methodssuch as vector quantization and proximity graph, and incorporatingthe various pruning rules used in existing string similarity searchwork.
We evaluate the influence of the hyper-parameters on the perfor-mance of CNN embedding in Figure 9. The dataset is Enron and weuse the recall-item curve as the performance measure, for whichhigher recall means better performance.Figure 9a shows that the quality of the embedding improvesquickly in the initial stage of training and stabilizes after 50 epochs,which suggests that CNN is easy to train. In Figure 9b, we testthe performance of CNN using different output dimensions. Theresults show that the performance improves considerably whenincreasing the output dimension from 8 to 32 but does not changemuch afterwards. It suggests that a small output dimension is suf-ficient for CNN, while CGK and GRU need to use a large outputdimension, which slows down distance computation and takes upa large amount of memory.Figure 9c shows the performance of CNN when increasing thenumber of convolutional layers from 8 to 12. The results show thatthe improvements in performance is marginal with more layersand thus there is no need to use a large number of layers. This isfavorable as using a large number of layers makes training andinference inefficient. We show the performance of using differentnumber of convolution kernels in a layer in Figure 9d. The resultsshow that performance improves when we increase the number ofkernels to 8 but drops afterwards. R e c a ll
5 Epoch10 Epoch20 Epoch50 Epoch100 Epoch200 Epoch (a) Epoch count R e c a ll (b) Output dimension R e c a ll (c) Number of layer R e c a ll (d) Number of kernel R e c a ll CNN(Triplet Loss)CNN(Pairwise Loss)CNN(Composite Loss) (e) Loss function R e c a ll Average PoolingMaximum Pooling (f) Pooling function
Figure 9: Influence of the hyper-parameters on item-recallperformance for Enron dataset (best viewed in color)
We report the performance of CNN using different loss functionsin Figure 9e. Recall that we use a combination of the triplet lossand the approximation error to train CNN. In Figure 9e,
Triplet Loss means using only the triplet loss while
Pairwise Loss means usingonly the approximation error. The results show that using a combi-nation of the two loss terms performs better than using a single lossterm. The performance of maximum pooling and average pooling isshown in Figure 9f. The results show that average pooling performsbetter than maximum pooling. Therefore, it will be interesting toextend our analysis on maximum pooling in Section 3 to morepooling methods.
In this paper, we proposed CNN-ED, a model that uses convolu-tional neural network (CNN) to embed edit distance into Euclideandistance. A complete pipeline (including input preparation, lossfunction and sampling method) is formulated to train the model endto end and theoretical analysis is conducted to justify choosing CNNas the model structure. Extensive experimental results show thatCNN-ED outperforms existing edit distance embedding method interms of both accuracy and efficiency. Moreover, CNN-ED showspromising performance for edit distance similarity search and isobust to different hyper-parameter configurations. We believe thatincorporating CNN embeddings to design efficient string similaritysearch frameworks is a promising future direction.
Acknowledgments.
We thank the reviewers for their valuablecomments. This work was partially supported GRF 14208318 fromthe RGC and ITF 6904945 from the ITC of HKSAR.
REFERENCES [1] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya P. Razenshteyn, and LudwigSchmidt. 2015. Practical and Optimal LSH for Angular Distance. In
NeurlPS . 1225–1233. http://papers.nips.cc/paper/5893-practical-and-optimal-lsh-for-angular-distance[2] Arturs Backurs and Piotr Indyk. 2015. Edit Distance Cannot Be Computed inStrongly Subquadratic Time (unless SETH is false). In
STOC . 51–58. https://doi.org/10.1145/2746539.2746612[3] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling upall pairs similarity search. In
WWW . 131–140. https://doi.org/10.1145/1242572.1242591[4] Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. 2016. Streamingalgorithms for embedding and computing edit distance in the low distance regime.In
STOC . 712–725. https://doi.org/10.1145/2897518.2897577[5] Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A PrimitiveOperator for Similarity Joins in Data Cleaning. In
ICDE . 5. https://doi.org/10.1109/ICDE.2006.9[6] Nicolas Courty, Rémi Flamary, and Mélanie Ducoffe. 2018. Learning WassersteinEmbeddings. In
ICLR . https://openreview.net/forum?id=SJyEH91A-[7] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In
ACM Symposium onComputational Geometry . 253–262. https://doi.org/10.1145/997817.997857[8] Dong Deng, Guoliang Li, and Jianhua Feng. 2014. A pivotal prefix based filteringalgorithm for string similarity search. In
SIGMOD . 673–684. https://doi.org/10.1145/2588555.2593675[9] Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, and Jianhua Feng. 2014.MassJoin: A mapreduce-based method for scalable string similarity joins. In
ICDE .340–351. https://doi.org/10.1109/ICDE.2014.6816663[10] Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast ApproximateNearest Neighbor Search With The Navigating Spreading-out Graph.
PVLDB
CVPR . 2946–2953.https://doi.org/10.1109/CVPR.2013.379[12] Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In Defense ofthe Triplet Loss for Person Re-Identification.
CoRR abs/1703.07737 (2017).arXiv:1703.07737 http://arxiv.org/abs/1703.07737[13] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantizationfor Nearest Neighbor Search.
TPAMI
33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57[14] Yu Jiang, Guoliang Li, Jianhua Feng, and Wen-Syan Li. 2014. String SimilarityJoins: An Experimental Evaluation.
PVLDB
7, 8 (2014), 625–636. https://doi.org/10.14778/2732296.2732299[15] Chen Li, Bin Wang, and Xiaochun Yang. 2007. VGRAM: Improving Performanceof Approximate Queries on String Collections Using Variable-Length Grams. In
PVLDP
PVLDB
5, 3 (2011), 253–264.https://doi.org/10.14778/2078331.2078340[17] Yury A. Malkov and D. A. Yashunin. 2016. Efficient and robust approximatenearest neighbor search using Hierarchical Navigable Small World graphs.
CoRR abs/1603.09320 (2016). arXiv:1603.09320 http://arxiv.org/abs/1603.09320[18] William J. Masek and Mike Paterson. 1980. A Faster Algorithm Computing StringEdit Distances.
J. Comput. Syst. Sci.
20, 1 (1980), 18–31. https://doi.org/10.1016/0022-0000(80)90002-1[19] Rafail Ostrovsky and Yuval Rabani. 2007. Low distortion embeddings for editdistance.
J. ACM
54, 5 (2007), 23. https://doi.org/10.1145/1284320.1284322[20] Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, and Xuemin Lin. 2011. Efficientexact edit similarity query processing with the asymmetric signature scheme. In
SIGMOD . 1033–1044. https://doi.org/10.1145/1989323.1989431[21] Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, and Xuemin Lin. 2011. Efficientexact edit similarity query processing with the asymmetric signature scheme. In
SIGMOD . 1033–1044. https://doi.org/10.1145/1989323.1989431[22] Ji Sun, Zeyuan Shang, Guoliang Li, Zhifeng Bao, and Dong Deng. 2019. Balance-Aware Distributed String Similarity-Based Query Processing System.
PVLDB
ICDE . 519–530. https://doi.org/10.1109/ICDE.2015.7113311[24] Jiannan Wang, Guoliang Li, and Jianhua Feng. 2010. Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints.
PVLDB
3, 1 (2010),1219–1230. https://doi.org/10.14778/1920841.1920992[25] Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefixfiltering?: an adaptive framework for similarity join and search. In
SIGMOD .85–96. https://doi.org/10.1145/2213836.2213847[26] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. 2014. Hashingfor Similarity Search: A Survey.
CoRR abs/1408.2927 (2014). arXiv:1408.2927http://arxiv.org/abs/1408.2927[27] Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: an efficient algorithmfor similarity joins with edit distance constraints.
PVLDB
1, 1 (2008), 933–944.https://doi.org/10.14778/1453856.1453957[28] Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. 2008. Efficient similarityjoins for near duplicate detection. In
WWW . 131–140. https://doi.org/10.1145/1367497.1367516[29] Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. 2016. String similaritysearch and join: a survey.
Frontiers Comput. Sci.
10, 3 (2016), 399–417. https://doi.org/10.1007/s11704-015-5900-5[30] Haoyu Zhang and Qin Zhang. 2017. EmbedJoin: Efficient Edit Similarity Joinsvia Embeddings. In
SIGKDD . 585–594. https://doi.org/10.1145/3097983.3098003[31] Haoyu Zhang and Qin Zhang. 2019. MinJoin: Efficient Edit Similarity Joinsvia Local Hash Minima. In
Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK,USA, August 4-8, 2019 . 1093–1103. https://doi.org/10.1145/3292500.3330853[32] Xiyuan Zhang, Yang Yuan, and Piotr Indyk. 2020. Neural Embeddings for NearestNeighbor Search Under Edit Distance. (2020). https://openreview.net/forum?id=HJlWIANtPH[33] Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.2010. Bed-tree: an all-purpose index structure for string similarity search basedon edit distance. In