GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search
aa r X i v : . [ c s . I R ] S e p GB-KMV: An Augmented KMV Sketch forApproximate Containment Similarity Search
Yang Yang † , Ying Zhang § , Wenjie Zhang † , Zengfeng Huang †† University of New South Wales, § University of Technology Sydney { yang.yang, zhangw } @cse.unsw.edu.au, [email protected] Abstract —In this paper, we study the problem of approximatecontainment similarity search. Given two records Q and X , thecontainment similarity between Q and X with respect to Q is | Q ∩ X || Q | . Given a query record Q and a set of records S , thecontainment similarity search finds a set of records from S whosecontainment similarity regarding Q is not less than the giventhreshold. This problem has many important applications in com-mercial and scientific fields such as record matching and domainsearch. Existing solution relies on the asymmetric LSH method bytransforming the containment similarity to well-studied Jaccardsimilarity. In this paper, we use a inherently different frameworkby transforming the containment similarity to set intersection.We propose a novel augmented KMV sketch technique, namely
GB-KMV , which is data-dependent and can achieve a much bettertrade-off between the sketch size and the accuracy. We providea set of theoretical analysis to underpin the proposed augmented
KMV sketch technique, and show that it outperforms the state-of-the-art technique
LSH-E in terms of estimation accuracy underpractical assumption. Our comprehensive experiments on real-lifedatasets verify that
GB-KMV is superior to
LSH-E in terms of thespace-accuracy trade-off, time-accuracy trade-off, and the sketchconstruction time. For instance, with similar estimation accuracy(F- score), GB-KMV is over 100 times faster than
LSH-E onseveral real-life datasets.
I. I
NTRODUCTION
In many applications such as information retrieval, datacleaning, machine learning and user recommendation, an ob-ject (e.g., document, image, web page and user) is describedby a set of elements (e.g., words, q -grams, and items). Oneof the most critical components in these applications is todefine the set similarity between two objects and developcorresponding similarity query processing techniques. Giventwo records (objects) X and Y , a variety of similarity func-tions/metrics have been identified in the literature for differentscenarios (e.g., [28], [15]). Many indexing techniques havebeen developed to support efficient exact and approximate lookups and joins based on these similarity functions.Many of the set similarity functions studied are symmetricfunctions, i.e., f ( X, Y ) = f ( Y, X ) , including widely usedJaccard similarity and Cosine similarity. In recent years, muchresearch attention has been given to the asymmetric set similar-ity functions, which are more appropriate in some applications.Containment similarity (a.k.a, Jaccard containment similarity)is one of the representative asymmetric set similarity functions,where the similarity between two records X and Y is definedas f ( X, Y ) = | X ∩ Y || X | in which | X ∩ Y | and | X | are intersectionsize of X and Y and the size of X , respectively.Compared with symmetric similarity such as Jaccard simi-larity, containment similarity gives special consideration on thequery size, which makes it more suitable in some applications.As shown in [35], containment similarity is useful in record matching application. Given two text descriptions of tworestaurants X and Y which are represented by two “set ofwords” records: { five , guys , burgers , and , fries , downtown , brooklyn , new , york } and { five , kitchen , berkeley } respectively.Suppose query Q is { five , guys } , we have that the Jaccardsimilarity of Q and X (resp. Y ) is = 0 . ( = 0 . ).Note the Jaccard similarity is f ( Q, X ) = | Q ∩ X || Q ∪ X | . Based onthe Jaccard similarity, record Y matches better to query Q , butintuitively X should be a better choice. This is because the Jac-card similarity unnessesarily favors the short records. On theother hand, the containment similarity will lead to the desiredorder with f ( Q, X ) = = 1 . and f ( Q, Y ) = = 0 . .Containment similarity search can also support online error-tolerant search for matching user queries against addresses(map service) and products (product search). This is becausethe regular keyword search is usually based on the containmentsearch, and containment similarity search provides a naturalerror-tolerant alternative [5]. In [44], Zhu et al. show thatcontainment similarity search is essential in domain searchwhich enables users to effectively search Open Data.The containment similarity is also of interest to applicationsof computing the fraction of values of one column that arecontained in another column. In a dataset, the discovery ofall inclusion dependencies is a crucial part of data profilingefforts. It has many applications such as foreign-key detectionand data integration(e.g., [22], [31], [8], [33], [30]). Challenges.
The problem of containment similarity searchhas been intensively studied in the literature in recent years(e.g., [5], [35], [44]). The key challenges of this problem comefrom the following three aspects: ( i ) The number of elements(i.e., vocabulary size) may be very large. For instance, thevocabulary will blow up quickly when the higher-order shin-gles are used [35]. Moreover, query and record may containmany elements. To deal with the sheer volume of the data,it is desirable to use sketch technique to provide effectivelyand efficiently approximate solutions. ( ii ) The data distribution(e.g., record size and element frequency) in real-life applicationmay be highly skewed. This may lead to poor performancein practice for data independent sketch methods. ( iii ) Asubtle difficulty of the approximate solution comes from theasymmetric property of the containment similarity. It is shownin [34] that there cannot exist any locality sensitive hashing( LSH ) function family for containment similarity search.To handle the large scale data and provide quick response,most existing solutions for containment similarity search seekto the approximate solutions. Although the use of
LSH is re-stricted, the novel asymmetric LSH method has been designedin [34] to address the issue by padding techniques. Someenhancements of asymmetric
LSH techniques are proposedin the following works by introducing different functionse.g., [35]). Observe that the performance of the existingsolutions are sensitive to the skewness of the record size, Zhu e t. al propose a partition-based method based on Minhash LSH function. By using optimal partition strategy based on the sizedistribution of the records, the new approach can achieve muchbetter time-accuracy trade-off.We notice that all existing approximate solutions rely onthe
LSH functions by transforming the containment similarityto well-studied
Jaccard similarity . That is, | Q ∩ X || Q | = | Q ∩ X || Q ∪ X | × | Q ∪ X | × | Q | As the size of query is usually readily available, the estimationerror come from the computation of Jaccard similarity andunion size of Q and X . Note that although the union sizecan be derived from jaccard similarity [44], the large variancecaused by the combination of two estimations remains. Thismotivates we to use a different framework by transforming thecontainment similarity to set intersection size estimation , andthe error is only contributed by the estimation of | Q ∩ X | . Thewell-known KMV sketch [11] has been widely used to estimatethe set intersection size, which can be immediately appliedto our problem. However, this method is data-independent and hence cannot well handle the skewed distributions ofrecords size and element frequency, which is common in real-life applications. Intuitively, the record with larger size andthe element with high-frequency should be allocated moreresources. In this paper, we theoretically show that the existing
KMV -sketch technique cannot consider these two perspectivesby simple heuristics, e.g., explictly allocating more resource torecord with large size. Consequently, we develop an augmented
KMV sketch to exploit both record size distribution and theelement frequency distribution for better space-accuracy andtime-accuracy trade-offs. Two technique are proposed: ( i ) weimpose a global threshold to KMV sketch, namely
G-KMV sketch, to achieve better estimate accuracy. As disscussed inSection IV-A(2), this technique cannot be extended to theMinhash
LSH . ( ii ) we introduce an extra buffer for each recordto take advantage of the skewness of the element frequency.A cost model is proposed to carefully choose the buffer sizeto optimize the accuracy for the given total space budget anddata distribution. Contributions.
Our principle contributions are summarized asfollows. • We propose a new augmented
KMV sketch technique,namely
GB-KMV , for the problem of approximatecontainment similarity search. By imposing a globalthreshold and an extra buffer for
KMV sketches of therecords, we significanlty enhance the performance asthe new method can better exploit the data distribu-tions. • We provide theoretical underpinnings to justify thedesign of
GB-KMV method. We also theoreticallyshow that
GB-KMV outperforms the state-of-the-arttechnique
LSH-E in terms of accuracy under realisticassumption on data distributions. • Our comprehensive experiments on real-life set-valueddata from various applications demonstrate the effec-tiveness and efficiency of our proposed method.
Road Map.
The rest of the paper is organized as follows.Section II presents the preliminaries. Section III introduces the
Notation Definition S a collection of records X, Q record, query record x, q record size of X , query size of QJ ( Q, X ) , s Jaccard similarity between query Q and set XC ( Q, X ) , t Containment similarity of query Q in set Xs ∗ Jaccard similarity threshold L X the KMV signature (i.e., hash values) of record X h ( X ) all hash values of the elements in record X H X the buffer of record X t ∗ containment similarity threshold b sketch space budget, measured by the number ofsignatures (i.e., hash values or elements) τ the global threshold for hash values r the buffer size(with bit unit) of GB-KMV sketch m number of records in dataset S n number of distinct elements in dataset S TABLE I. T HE SUMMARY OF NOTATIONS existing solutions. Our approach,
GB-KMV sketch, is devisedin Section IV. Extensive experiments are reported in Section V,followed by the related work in Section VI. Section VIIconcludes the paper.II. P
RELIMINARIES
In this section, we first formally present the problem ofcontainment similarity search, then introduce some preliminaryknowledge. In Table I, we summarize the important mathemat-ical notations appearing throughout this paper.
A. Problem Definition
In this paper, the element universe is E = { e , e , ..., e n } .Let S be a collection of records (sets) { X , X , ..., X m } ’where X i ( ≤ i ≤ m ) is a set of elements from E .Before giving the definition of containment similarity, wefirst introduce the Jaccard similarity. Definition 1 ( Jaccard Similarity ) . Given two records X and Y from S , the Jaccard similarity between X and Y is definedas the size of the intersection divided by the size of the union,which is expressed as J ( X, Y ) = | X ∩ Y || X ∪ Y | (1)Similar to the Jaccard similarity, the containment similarity(a.k.a Jaccard containment similarity) is defined as follows. Definition 2 ( Containment Similarity ) . Given two records X and Y from S , the containment similarity of X in Y , denotedby C ( X, Y ) is the size of the intersection divided by recordsize | X | , which is formally defined as C ( X, Y ) = | X ∩ Y || X | (2)Note that by replacing the union size | X ∪ Y | in Equation 1with size | X | , we get the containment similarity. It is easyto see that Jaccard similarity is symmetric while containmentsimilarity is asymmetric.In this paper, we focus on the problem of containmentsimilarity search which is to look up a set of records whosecontainment similarity towards a given query record is notsmaller than a given threshold. The formal definition is asfollows. Definition 3 ( Containment Similarity Search ) . Given aquery Q , and a threshold t ∗ ∈ [0 , on the containment d record C ( Q, X i ) X { e , e , e , e , e } . X { e , e , e } . X { e , e , e } . X { e , e , e , e } . Q { e , e , e , e , e , e } Fig. 1.
A four-record dataset and a query Q ; C ( Q, X i ) is thecontainment similarity of Q in X i similarity, search for records { X : X ∈ S} from a dataset S such that: C ( Q, X ) ≥ t ∗ (3)Next, we give an example to show the problem of contain-ment similarity search. Example 1.
Fig. 1 shows a dataset with four records { X , X , X , X } , and the element universe is E = { e , e , ..., e } .Given a query Q = { e , e , e , e , e , e } and a contain-ment similarity threshold t ∗ = 0 . , the records satisfying C ( Q, X i ) ≥ . are X , X . Problem Statement.
In this paper, we investigate the problemof approximate containment similarity search . For the dataset S with a large number of records, we aim to build a synopsesof the dataset such that it ( i ) can efficiently support contain-ment similarity search with high accuracy, ( ii ) can handle largesize records, and ( iii ) has a compact index size. B. Minwise Hashing
Minwise Hashing is proposed by Broder in [13], [14] forestimating the Jaccard similarity of two records X and Y . Let h be a hash function that maps the elements of X and Y to dis-tinct integers, and define h min ( X ) and h min ( Y ) to be the min-imum hash value of a record X and Y , respectively. Assumingno hash collision, Broder[13] showed that the Jaccard similar-ity of X and Y is the probability of two minimum hash valuesbeing equal: P r [ h min ( X ) = h min ( Y )] = J ( X, Y ) . Applyingsuch k different independent hash functions h , h , ..., h k to arecord X ( Y , resp.), the MinHash signature of X ( Y , resp.) isto keep k values of h imin ( X ) ( h min ( Y ) , resp.) for k functions.Let n i , i = 1 , , ..., k be the indicator function such that n i := (cid:26) if h imin ( X ) = h imin ( Y ) , otherwise . (4)then the Jaccard similarity between record X and Y can beestimated as ˆ s = ˆ J ( X, Y ) = 1 k k X i =1 n i (5)Let s = J ( X, Y ) be the Jaccard similarity of set X and Y ,then the expectation of ˆ J is E (ˆ s ) = s (6)and the variance of ˆ s is V ar (ˆ s ) = s (1 − s ) k (7) C. KMV Sketch
The k minimum values( KMV ) technique introduced byBayer e t. al in [11] is to estimate the number of distinctelements in a large dataset. Given a no-collision hash function h which maps elements to range [0 , , a KMV synopses ofa record X , denoted by L X , is to keep k minimum hashvalues of X . Then the number of distinct elements | X | canbe estimated by d | X | = k − U ( k ) where U ( k ) is k -th smallest hashvalue. By h ( X ) , we denote hash values of all elements in therecord X .In [11], Bayer e t. al also methodically analyse the problemof distinct element estimation under multi-set operation. Asfor union operation, consider two records X and Y withcorresponding KMV synopses L X and L Y of size k X and k Y ,respectively. In [11], L X ⊕ L Y represents the set consisting ofthe k smallest hash values in L X ∪ L Y where k = min ( k X , k Y ) (8)Then the KMV synopses of X ∪ Y is L = L X ⊕ L Y . Anunbiased estimator for the number of distinct elements in X ∪ Y , denoted by D ∪ = | X ∪ Y | is as follows. ˆ D ∪ = k − U ( k ) (9)For intersection operation, the KMV synopses is L = L X ⊕L Y where k = min ( k X , k Y ) . Let K ∩ = |{ v ∈ L : v ∈ L X ∩L Y }| ,i.e., K ∩ is the number of common distinct hash values of L X and L Y within L . Then the number of distinct elements in X ∩ Y , denoted by D ∩ , can be estimated as follows. ˆ D ∩ = K ∩ k × k − U ( k ) (10)The variance of ˆ D ∩ , as shown in[11], is V ar [ ˆ D ∩ ] = D ∩ ( kD ∪ − k − D ∪ + k + D ∩ ) k ( k − (11)III. E XISTING S OLUTIONS
In this section, we present the state-of-the-art techniquefor the approximate containment similarity search, followedby theoretical analysis on the limits of the existing solution. A. LSH Ensemble Method
LSH
Ensemble technique,
LSH-E for short, is proposedby Zhu e t. al in [44] to tackle the problem of approximatecontainment similarity search. The key idea is : (1) transformthe containment similarity search to the well-studied Jaccardsimilarity search; and (2) partition the data by length andthen apply the LSH forest [9] technique for each individualpartition.
Similarity Transformation.
Given a record X with size x = | X | , a query Q with size q = | Q | , containment similarity t = C ( Q, X ) and Jaccard similarity s = J ( Q, X ) . Thetransformation back and forth are as follows. s = t xq + 1 − t , t = ( xq + 1) s s (12)Given the containment similarity search threshold as t ∗ forthe query q , we may come up with its corresponding Jaccardimilarity threshold s ∗ by Equation 12. A straightforwardsolution is to apply the existing approximate Jaccard similaritysearch technique for each individual record X ∈ D with theJaccard similarity threshold s ∗ (e.g., compute Jaccard similar-ity between the query Q and a set X based on their MinHashsignatures). In order to take advantages of the efficient indexingtechniques (e.g., LSH forest [9]),
LSH-E will partition thedataset S . Data Partition.
By partitioning the dataset S according tothe record size, LSH-E can replace x in Equation 12 with itsupper bound u (i.e., the largest record size in the partition) asan approximation. That is, for the given containment similarity t ∗ we have s ∗ = t ∗ uq + 1 − t ∗ (13)The use of upper bound u will lead to false positives. In [44],an optimal partition method is designed to minimize the totalnumber of false positives brought by the use of upper boundin each partition. By assuming that the record size distributionfollows the power-law distribution and similarity values areuniformly distributed, it is shown that the optimal partition canbe achieved by ensuring each partition has the equal numberof records (i.e., equal-depth partition). Containment Similarity Search.
For each partition S i of thedata, LSH-E applies the dynamic
LSH technique (e.g.,
LSH forest [9]). Particularly, the records in S i are indexed by aMinHash LSH with parameter ( b , r ) where b is the number ofbands used by the LSH index and r is the number of hashvalues in each band. For the given query Q , the b and r values are carefully chosen by considering their correspondingnumber of false positives and false negatives regarding theexisting records. Then the candidate records in each partitioncan be retrieved from the MinHash index according to thecorresponding Jaccard similarity thresholds obtained by Equa-tion 13. The union of the candidate records from all partitionswill be returned as the result of the containment similaritysearch. B. Analysis
One of the
LSH-E ’s advantages is that it converts thecontainment similarity problem to Jaccard similarity searchproblem which can be solved by the mature and efficientMinHash
LSH method. Also,
LSH-E carefully considers therecord size distribution and partitions the records by recordsize. In this sense, we say
LSH-E is a data-dependent methodand it is reported that
LSH-E significantly outperforms ex-isting asymmetric
LSH based solutions [34], [35] (i.e., data-independent methods) as
LSH-E can exploit the informationof data distribution by partitioning the dataset. However, thisbenefit is offset by the fact that the the upper bound will bringextra false positives, in addition to the error from the MinHashtechnique.Below we theoretically analyse the performance of
LSH-E by studying the expectation and variance of its estimator.Using the notations same as above, let s = J ( Q, X ) be the Jaccard similarity between query Q and set X and t = C ( Q, X ) be the containment similarity of Q in X . ByEquation 5, given the MinHash signature of query Q and X respectively, an unbiased estimator ˆ s of Jaccard similarity s = J ( Q, X ) is the ratio of collisions in the signature, and the variance of ˆ s is V ar [ˆ s ] = s (1 − s ) k where k is signature size ofeach record. Then by transformation Equation 12, the estimator ˆ t of containment similarity t = C ( Q, X ) by MinHash LSH is ˆ t = ( xq + 1)ˆ s s (14)where q = | Q | and x = | X | . The estimator ˆ t ′ of containmentsimilarity t = C ( Q, X ) by LSH-E is ˆ t ′ = ( uq + 1)ˆ s s (15)where q = | Q | and u is the upper bound of | X | .Next, we use Taylor expansions to approximate the ex-pectation and variance of a function with one random vari-able [26]. We first give a lemma. Lemma 1.
Given a random variable X with expectation E [ X ] and variance V ar [ X ] , the expectation of f ( X ) can beapproximated as E [ f ( X )] = f ( E [ X ] + f ′′ ( E [ X ])2 V ar [ X ] (16) and the variance of f ( X ) can be approximated as V ar [ f ( X )] = [ f ′ ( E [ X ])] V ar [ X ] − [ f ′′ ( E [ X ])] V ar [ X ] (17)According to Equation 14, let ˆ t = f (ˆ s ) = α ˆ s s where α = xq + 1 . We can see that the estimator ˆ t is a function of ˆ s ,and f ′ (ˆ s ) = α s ) and f ′′ (ˆ s ) = − α s ) . Then based onLemma 1, the expectation and variance of ˆ t are approximatedas E [ˆ t ] ≈ t (1 − − sk (1 + s ) ) (18) V ar [ˆ t ] ≈ D ∩ (1 − s )[ k (1 + s ) − s (1 − s )] q k s (1 + s ) (19)Similarly, the expectation and variance of LSH-E estimator ˆ t ′ can be approximated as E [ˆ t ′ ] ≈ t ( u + qx + q )(1 − − sk (1 + s ) ) (20) V ar [ˆ t ′ ] ≈ ( u + qx + q ) D ∩ (1 − s )[ k (1 + s ) − s (1 − s )] q k s (1 + s ) (21)The computation details are in technique report [41]. Since u is the upper bound of x , the variance of LSH-E estimator
V ar [ˆ t ′ ] is larger than that of MinHash LSH estimator. Also, byEquation 18 and Equation 20, we can see that both estimatorsare biased and
LSH-E method is quite sensitive to the settingof the upper bound u by Equation 20. Because the presence ofupper bound u will enlarge the estimator off true value, LSH-E method favours recall while the precision will be deteriorated.The larger the upper bound u is, the worse the precision willbe. Our empirical study shows that LSH-E cannot achieve agood trade-off between accuracy and space, compared withour proposed method.
KMV k i L X { ( e , . , ( e , . , ( e , . } L X { ( e , . , ( e , . , ( e , . } L X { ( e , . , ( e , . } L X { ( e , . , ( e , . } L Q { ( e , . , ( e , . , ( e , . , ( e , . } Fig. 2.
The
KMV sketch of the dataset in Example 1, each signatureconsists of element-hash value pairs. k i is the signature size of X i IV. O UR A PPROACH
In this section, we introduce an augmented
KMV sketchtechnique to achieve better space-accuracy trade-off for ap-proximate containment similarity search. Section IV-A brieflyintroduces the motivation and main technique of our method,namely
GB-KMV . The detailed implementation is presentedin Section IV-B, followed by extensive theoretical analysis inSection IV-C. A. Motivation and Techniques
The key idea of our method is to propose a data-dependent indexing technique such that we can exploit the distribution ofthe data (i.e., record size distribution and element frequencydistribution) for better performance of containment similaritysearch. We augment the existing
KMV technique by intro-ducing a global threshold for sample size allocation and abuffer for frequent elements, namely
GB-KMV , to achievebetter trade-off between synopses size and accuracy. Then weapply the existing set similarity join/search indexing techniqueto speed up the containment similarity search.Below we outline the motivation of the key techniques usedin this paper. Detailed algorithms and theoretical analysis willbe introduced in Section IV-B and IV-C, respectively. (1) Directly Apply KMV Sketch
Given a query Q and a threshold t ∗ on containmentsimilarity, the goal is to find record X from dataset S suchthat | Q ∩ X || Q | ≥ t ∗ , (22)Applying some simple transformation to Equation 22, we get | Q ∩ X | ≥ t ∗ | Q | , (23)Let θ = t ∗ | Q | , then the containment similarity search problemis converted into finding record X whose intersection size withthe query Q is not smaller than θ , i.e., | Q ∩ X | ≥ θ .Therefore, we can directly apply the KMV method intro-duced in Section II-C. Given
KMV signatures of a record X and a query Q , we can estimate their intersection size ( | Q ∩ X | )according to Equation 10. Then the containment similarity of Q in X is immediately available given the query size | Q | .Below, we show an example on how to apply KMV methodto containment similarity search.
Example 2.
Fig. 2 shows the KMV sketch ondataset in Example 1. Given KMV signature of Q ( L Q = { ( e , . , ( e , . , ( e , . , e , . } ) and X ( L X = { ( e , . , ( e , . , ( e , . } ), we have k = min { k Q , k } = 3 , then the size- k KMV synopses of Q ∪ X is L = L Q ⊕L X = { ( e , . , ( e , . , ( e , . } ,the k -th smallest hash value U ( k ) is 0.33 and thesize of intersection of L Q and L X within L is K ∩ = |{ v : v ∈ L Q ∩ L X , v ∈ L}| = 2 . Thenthe intersection size of Q and X is estimated as ˆ D ∩ = K ∩ k × k − U ( k ) = ∗ . = 4 . , and the containmentsimilarity is ˆ t = ˆ D ∩ | Q | = 0 . . Then X is returned if the givencontainment similarity threshold t ∗ is . . Remark 1.
In [44], the size of the query is approximated byMinHash signature of Q , where KMV sketch can also servefor the same purpose. But the exact query size is used theirimplementation for performance evaluation. In practice, thequery size is readily available, we assume query size is giventhroughout the paper. Optimization of KMV Sketch.
Given a space budget b , wecan keep size- k i KMV signatures (i.e., k i minimal hash values)for each record X i with P ni =1 k i = b . A natural questionis how to allocate the resource (e.g., setting of k i values)to achieve the best overall estimation accuracy. Intuitively,more resources should be allocated to records with morefrequent elements or larger record size, i.e., larger k i forrecord with larger size. However, Theorem 1 (Section IV-C2)suggests that, the optimal resource allocation strategy in termsof estimation variance is to use the same size of signature foreach record. This is because the minimal of two k-values isused in Equation 8, and hence the best solution is to evenly allocate the resource. Thus, we have the
KMV sketch basedmethod for approximate containment similarity search. For thegiven budget b , we keep k i = ⌊ bm ⌋ minimal hash values foreach record X i . (2) Impose a Global Threshold to KMV Sketch ( G-KMV ) The above analysis on optimal
KMV sketch suggests anequal size allocation strategy, that is, each record is associatedwith the same size signature. Intuitively we should assign moreresources (i.e., signature size) to the records with large sizebecause they are more likely to appear in the results. However,the estimate accuracy of KMV for two sets size intersectionis determined by the sketch with smaller size since we choose k = min( k , k ) for KMV signatures of X and X for D ∪ and D ∩ in Equation 9, thus it is useless to give more resourceto one of the records. We further explain the reason behindwith the following example.Before we introduce the global threshold to KMV sketch,consider the
KMV sketch shown in the Fig. 2.
Example 3.
Suppose we have L Q = { ( e , . , ( e , . , ( e , . , ( e , . } and L X = { ( e , . , ( e , . } . Although there are four hash valuesin L Q ∪ L X = { ( e , . , ( e , . , ( e , . , ( e , . } ,we can only consider k = min { k Q , k X } = 2 smallesthash values of L Q ∪ L X by Equation 8, which is { ( e , . , ( e , . } , and the k -th ( k = 2 ) minimumhash value used in Equation 9 is . . We cannot use k = 4 (i.e., U ( k ) = . ) to estimate | Q ∪ X | because the -thsmallest hash value in L Q ∪ L X may not be the -th smallesthash values in h ( Q ∪ X ) , because the unseen -rd smallesthash value of X might be ( e , . for example, which issmaller than . . Recall that h ( Q ∪ X ) denote the hashvalues of all elements in Q ∪ X .Nevertheless, if we know that all the hash values smallerthan a global threshold , say . , are kept for every record, wecan safely use the -th hash value of L Q ∪ L X (i.e., . ) forthe estimation. This is because we can ensure the -th smallest GKMV L X { ( e , . , ( e , . , ( e , . }L X { ( e , . , ( e , . }L X { ( e , . , ( e , . , ( e , . }L X { ( e , . , ( e , . }L Q { ( e , . , ( e , . , ( e , . } Fig. 3.
The
G-KMV sketch of the dataset in Example 1 with hashvalue threshold τ = 0 . hash value in L Q ∪ L X must be the -th smallest hash valuesin h ( Q ∪ X ) . Inspired by the above observation, we can carefully choosea global threshold τ (e.g., . in the above example) for a givenspace budget b , and ensure all hash values smaller than τ willbe kept for KMV sketch of the records. By imposing a globalthreshold, we can identify a better (i.e., larger) k value usedfor estimation, compared with Equation 8.Given a record X and a global threshold τ , the sketch ofa record X is obtained as L X = { h ( e ) : h ( e ) ≤ τ, e ∈ X } where h is the hash function. The sketch of Q ( L Q ) is definedin the same way. In this paper, we say a KMV sketch is a
G-KMV sketch if we impose a global threshold to generate
KMV sketch. Then we set k value of the KMV estimation asfollows. k = |L Q ∪ L X | (24)Meanwhile, we have K ∩ = |L Q ∩ L X | . Let U ( k ) be the k -thminimal hash value in L Q ∪ L X , then the overlap size of Q and X can be estimated as ˆ D GKMV ∩ = K ∩ k k − U ( k ) (25)Then the containment similarity of Q in X is ˆ C = ˆ D GKMV ∩ q (26)where q is the query size. We remark that, as a by-product,the global threshold favours the record with large size becauseall elements with hash value smaller than τ are kept for eachrecord.Below is an example on how to compute the containmentsimilarity based on G-KMV sketch.
Example 4.
Fig. 3 shows the KMV sketch of datasetin Example 1 with a global threshold τ = 0 . . Giventhe signature of Q ( L Q = { ( e , . , ( e , . , ( e , . } )and X ( L X = { ( e , . , ( e , . , ( e , . } ), theKMV sketch of Q ∪ X is L = L Q ∪ L X = { ( e , . , ( e , . , ( e , . , ( e , . } , the k -th( k = 4 )smallest hash value is U ( k ) = 0 . , and the size of intersectionof L Q and L X within L is K ∩ = |{ v : v ∈ L Q ∩ L X , v ∈L}| = 2 . Then the intersection size of Q and X is estimatedas ˆ D ∩ = K ∩ k × k − U ( k ) = ∗ . = 3 . , and the containmentsimilarity is ˆ t = ˆ D ∩ | Q | = 0 . . Then X is returned if the givencontainment similarity threshold t ∗ is . . Correctness of
G-KMV sketch . Theorem 2 in Section IV-C3shows the correctness of the
G-KMV sketch.
Comparison with
KMV . In
Theorem 3 (Section IV-C4), wetheoretically show that
G-KMV can achieve better accuracycompared with
KMV . L H L GKMV X { e , e } { ( e , . , ( e , . } X { e } { ( e , . } X { e } { ( e , . } X { e , e } { e , . } Q { e , e } { ( e , . , ( e , . } Fig. 4.
The
GB-KMV sketch of dataset in Example 1
Remark 2.
Note that the global threshold technique cannotbe applied to MinHash based techniques. In minHash LSH,the k minimum hash values are corresponding to k differentindependent hash functions, while in KMV sketch, the k -valuesketch is obtained under one hash function. Thus we can onlyimpose this global threshold on the same hash function for theKMV sketch based method. (3) Use Buffer for KMV Sketch ( GB-KMV ) In addition to the skewness of the record size, it is alsoworthwhile to exploit the skewness of the element frequency.Intuitively, more resource should be assigned to high-frequencyelements because they are more likely to appear in the records.However, due to the nature of the hash function used by
KMV sketch, the hash value of an element is independent to itsfrequency; that is, all elements have the same opportunitycontributing to the
KMV sketch.One possible solution is to divide the elements into mul-tiple disjoint groups according to their frequency (e.g., low-frequency and high-frequency ones), and then apply
KMV sketch for each individual group. The intersection size betweentwo records Q and X can be computed within each groupand then sum up together. However, our initial experimentssuggest that this will lead to poor accuracy because of thesummation of the intersection size estimations. In Theorem 4(Section IV-C5), our theoretical analysis suggests that thecombination of estimated results are very likely to make theoverall accuracy worse.To avoid combining multiple estimation results, we use abitmap buffer with size r for each record to exactly keep trackof the r most frequent elements, denoted by E H . Then weapply G-KMV technique to the remaining elements, resultingin a new augmented sketch, namely
GB-KMV . Now we canestimate | Q ∩ X | by combining the intersection of theirbitmap buffers ( exact solution ) and KMV sketches ( estimatedsolution ).As shown in Fig. 4, suppose we have E H = { e , e } and the global threshold for hash value is τ = 0 . , then thesketch of each record consists of two parts L H and L GKMV ;that is, for each record we use bitmap to keep the elementscorresponding to high-frequency elements E H = { e , e } , thenwe store the left elements with hash value less than τ = 0 . . Example 5.
Given the signature of Q ( L Q = { e , e } ∪ { ( e , . , ( e , . } ) and X ( L X = { e , e } ∪ { ( e , . , ( e , . } ), the intersection of High-frequency part is L HQ ∩ L HX = { e , e } with intersection sizeas ; next we consider the G-KMV part. Similar to Example4, we compute the intersection of L GKMV part. The KMVsketch is L ′ = L ′ Q ∪ L ′ X = { ( e , . , ( e , . , ( e , . } .According to Equation 24, the k -th( k = 3 ) smallest hashvalue is U ( k ) = 0 . , and the size of intersection of L Q and L X within L is K ∩ = |{ v : v ∈ L Q ∩ L X , v ∈ L}| = 1 .Then the intersection size of Q and X in L GKMV part isstimated as ˆ D ∩ = K ∩ k × k − U ( k ) = ∗ . = 1 . ; together withthe High-frequency part, the intersection size of Q and X isestimated as . . and the containment similarity is ˆ t = ˆ D ∩ | Q | = 0 . . Then X is returned if the given containmentsimilarity threshold t ∗ is . . Optimal Buffer Size.
The key challenge is how to set thesize of bitmap buffer for the best expected performance of
GB-KMV sketch. In Section IV-C6, we provide a theoreticalanalysis, which is verified in our performance evaluation.
Comparison with
G-KMV . As the
G-KMV is a special caseof
GB-KMV with buffer size and we carefully choose thebuffer size with our cost model, the accuracy of GB-KMV is not worse than G-KMV . Comparison with
LSH-E . Through theoretical analysis, weshow that the performance (i.e., the variance of the estimator)of
GB-KMV can always outperform that of
LSH-E in Theo-rem 5 (Section IV-C7). B. Implementation of GB-KMV
In this section, we introduce the technique details of ourproposed
GB-KMV method. We first show how to build
GB-KMV sketch on the dataset S and then present the containmentsimilarity search algorithm. GB-KMV Sketch Construction . For each record X ∈ S ,its GB-KMV sketch consists of two components: (1) a bufferwhich exactly keeps high-frequency elements, denoted by H X ;and (2) a G-KMV sketch, which is a
KMV sketch with a globalthreshold value, denoted by L X . Algorithm 1 : GB-KMV
Index Construction
Input : S : dataset; b : space budget; h : a hash function; r : buffer size Output : L S , the GB-KMV index of dataset S Compute buffer size r based on distribution statistics of S and the space budget b ; E H ← Top r most frequent elements; E K ← E \ E H ; τ ← compute the global threshold for hash values; for each record X ∈ S do H X ← elements of X in E H ; L X ← hash values of elements { e } of X with h ( e ) ≤ τ ; Algorithm 1 illustrates the construction of
GB-KMV sketch.Let the element universe be E = { e , e , ..., e n } and eachelement is associated with its frequency in dataset S . Line 1calculates a buffer size r for all records based on the skewnessof record size and elements as well as the space budget b interms of elements. Details will be introduced in Section IV-C6.We use E H to denote the set of top- r most frequent elements(Line 1), and they will be kept in the buffer of each record. Let E K denote the remaining elements. Line 1 identifies maximalpossible global threshold τ for elements in E K such that thetotal size of GB-KMV sketch meets the space budget b . Foreach record X , let n X denote the number of elements in E K with hash values less than τ , we have P X ∈S ( r + n X ) ≤ b .Then Lines 1-1 build the buffer H X and G-KMV sketch L X forevery record X ∈ S . In section 2, we will show the correctnessof our sketch in Theorem 2. Containment Similarity Search.
Given the
GB-KMV sketchof the query record Q and the dataset S , we can conduct approximate similarity search as illustrated in Algorithm 2.Given a query Q with size q and the similarity threshold t ∗ ,let θ = t ∗ ∗ q (Lines 1-2). With GB-KMV sketch {H Q , L Q } ,we can calculate the containment similarity based on \ | Q ∩ X | = |H Q ∩ H X | + ˆ D GKMV ∩ (27)where ˆ D GKMV ∩ is the estimation of overlap size of Q and X which is calculated by Equation 25 in Section IV-A.Note that |H Q ∩ H X | is the number of common elementsof Q and X in E H . Algorithm 2 : Containment Similarity Search
Input : Q , a query set t ∗ , containment similarity threshold Output : R : records { X } with C ( Q, X ) ≥ t ∗ q ← | Q | ; θ ← t ∗ ∗ q ; for each record X ∈ S do \ | Q ∩ X | ← |L QH ∩ L XH | + ˆ D GKMV ∩ ; if \ | Q ∩ X | ≥ θ then S candidate = S candidate ∪ X ; return S candidate Implementation of Containment Similarity Search.
In ourimplementation, we use a bitmap with size r to keep theelements in buffer where each bit is reserved for one frequentelement. We can use bitwise intersection operator to efficientlycompute |H Q ∩ H X | in Line 2 of Algorithm 2. Note that theestimator of overlap size by G-KMV method in Equation 25is ˆ D GKMV ∩ = K ∩ k k − U ( k ) . As to the computation of \ | Q ∩ X | , weapply some transformation to |L QH ∩L XH | + ˆ D GKMV ∩ ≥ θ . Thenwe get K ∩ ≥ o where o = U ( k ) ( θ − o ) and o = |H Q ∩ H X | .Since K ∩ is the overlap size, then we make use of thePPjoin* [40] to speed up the search. Note that in order to makethe PPjoin* which is designed for similarity join problem tobe applicable to the similarity search problem, we partitionthe dataset S by record size, and in each partition we searchfor the records which satisfy K ∩ ≥ o , where overlap size ismodified by the lower bound in corresponding partition. Remark 3.
Note that the size-aware overlap set similarityjoins algorithm in [25] can not be applied to our GB-KMVmethod, because we need to online construct c -subset invertedlist for each incoming query, which results in very inefficientperformance. Processing Dynamic Data.
Note that our algorithm can bemodified to process dynamic data. Particularly, when newrecords come, we compute the new global threshold τ underthe fixed space budget by Line 1 of Algorithm 1, and with thenew global threshold, we maintain the sketch of each recordas shown in Line 1 of Algorithm 1. C. Theoretical Analysis
In this section, we provide theoretical underpinnings of theclaims and observations in this paper. Background : We need some reasonable assumptionson the record size distribution, element frequency distributionand query work-load for a comprehensive analysis. Followingare three popular assumptions widely used in the literature(e.g., [6], [29], [27], [18], [16], [44], [34]):
The element frequency in the dataset follows thepower-law distribution, with p ( x ) = c x − α . • The record size in the dataset follows the power-lawdistribution, with p ( x ) = c x − α . • The query Q is randomly chosen from the records.Throughout the paper, we use the variance to evaluate thegoodness of an estimator. Regarding the KMV based sketchtechniques (
KMV , G-KMV and
GB-KMV ), we have
Lemma 2.
In KMV sketch based methods, the larger the k value used in Equation 8 and Equation 24 is, the smaller thevariance will be. It is easy to verify the above lemma by calculating thederivative of Equation 11 with respect to the variable k . Thus,in the following analysis of KMV based sketch techniques. Weuse the k value (i.e., the sketch size used for estimation) toevaluate the goodness of the estimation, the larger the better. Optimal KMV Signature Scheme : In this part, we givean optimal resource allocation strategy for
KMV sketch methodin similarity search.
Theorem 1.
Given a space budget b , each set is associatedwith a size- k i KMV signature and P mi =1 k i = b . For KMVsketch based containment similarity search, the optimal signa-ture scheme is to keep the ⌊ bm ⌋ minimal hash values for eachset X i .Proof: Given a query Q and dataset S = { X , ..., X m } ,an optimal signature scheme for containment similarity searchis to minimize the average variance between Q and X i , i =1 , ..., m . Considering the query Q and set X i with size- k q KMV sketch L Q and size- k i sketch L X i respectively, the sketch sizeis k = min { k q , k i } according to Equation 8. By Lemma 2, anoptimal signature scheme is to maximize the total k value(say T ), then we have the following optimization goal, max T = m X i =1 min { k q , k i } s.t. b = m X i =1 k i , k i > , i = 1 , , ..., m Rank the k i by increasing order, w.l.o.g., let k , k , ..., k m bethe sketch size sequence after reorder. Let k l be the first in thesequence such that k l = k q , then we have T = k + ... + k l +( m − l ) k q = b − P mi = l +1 ( k i − k q ) . In order to maximize T , weset k i = k q , i = l + 1 , ..., m . Then by b = P mi =1 k i , we have k + ... + k l + k q ( m − l ) = b . Note that k i ≤ k q , i = 1 , ..., l , wemust have k i = k q , i = 1 , ..., l . Since Q is randomly selectedfrom dataset S , we can get that all the k i , i = 1 , ...m are equaland k i = ⌊ bm ⌋ . Correctness of GKMV Sketch : In this section, we showthat the
G-KMV sketch is a valid
KMV sketch.
Theorem 2.
Given two records X and Y , let L X and L Y bethe G-KMV sketch of X and Y , respectively. Let k = |L X ∪L Y | , then the size- k KMV synopses of X ∪ Y is L = L X ∪L Y .Proof: We show that the above L = L X ∪ L Y is a valid KMV sketch of X ∪ Y . Let k = |L X ∪ L Y | and v k is the k -th smallest hash value in L X ∪ L Y . In order to prove that L X ∪ L Y is valid, we show that v k corresponds the elementwith the k -th minimal hash value in X ∪ Y . If not, there shouldexist an element e such that h ( e ′ ) < v k , e ′ ∈ X ∪ Y and h ( e ′ ) / ∈ L X ∪ L Y . Note that v k ≤ τ , then h ( e ′ ) ≤ τ , thus h ( e ′ ) is included in L X ∪ L Y , which contradicts to the abovestatement. G-KMV: A Better KMV Sketch : In this part, we showthat by imposing a global threshold to
KMV sketch, we canachieve better accuracy. Let L KMVX and L KMVY be the
KMV sketch of X and Y respectively. Let k = |L KMVX | and k = |L KMVY | , then the sketch size k value can be set by Equation 8.Similarly, let L GKMVX and L GKMVY be the
G-KMV sketch of X and Y respectively, and the sketch size k value can be setby Equation 24. Theorem 3.
With the fixed index space budget, for containmentsimilarity search the G-KMV sketch method is better than KMVmethod in terms of accuracy when the power-law exponent ofelement frequency α ≤ . .Proof: Let x j = | X j | , j = 1 , , ..., m be the set size and k j be the signature size of record X j . The frequency of element e i is set to be f i . The index space budget is b .For KMV sketch based method, by Theorem 1, the optimalsignature scheme is k = min ( k j , k l ) = ⌊ bm ⌋ given the indexspace budget b , then the average k value for all pairs of setsis ¯ k KMV = 1 m m X j =1 m X l =1 min ( k j , k l ) = ⌊ bm ⌋ (28)For G-KMV sketch based method, let τ be the hash valuethreshold. The probability that hash value h ( e i ) is included insignature L GKMVX j is P r [ h ( e i ) ∈ L GKMVX j ] = τ f i N x j where f i is the frequency of element e i , and N = P ni =1 f i is the totalnumber of elements. The size of L GKMVX j can be computedby l j = P ni =1 P r [ h ( e i ) ∈ L GKMVX j ] = τ x j then the totalindex space is b = P mj =1 l j = P mj =1 τ x j = τ N . and the hashvalue threshold τ = bN . Next we compute average sketch size k value of G-KMV method. The intersection size of L X j and L X l |L X j ∩ L X l | = n X i =1 τ f i N x j ∗ τ f i N x l = τ x j x l f n (29)where f n = P ni =1 f i N . The k value of G-KMV methodaccording to Equation 24 is |L X j ∪ L X l | = τ x j + τ x l − τ x j x l f n (30)Then the average k value for all pairs of sets is ¯ k GKMV = 1 m m X j =1 m X l =1 |L X j ∪ L X l | = 2 bm − b m f n (31)Let ¯ k GKMV ≥ ¯ k KMV , we get α ∈ (0 , . ∪ [(1 + mb ) − p (1 + mb ) mb , (1 + mb ) + p (1 + mb ) mb ] . Note that for thecommon setting mb ≤ , we can get α ≤ . . The resultmakes sense since the power-law(Zipf’s law) exponent ofelement frequency is usually less than 3.4 for real datasets. ) Partition of KMV Sketch Is Not Promising : In this part,we show that it is difficult to improve the performance of
KMV by dividing elements to multiple groups according to theirfrequency and apply
KMV estimation individually. W.l.o.g.,we consider dividing elements into two groups.We divide the sorted element universe E into two disjointparts E H and E H . Let X and Y be two sets from dataset S with KMV sketch L X and L Y respectively. Let k X = |L X | and k Y = |L Y | . The estimator of containment similarity is ˆ C = ˆ D ∩ q , where ˆ D ∩ is the estimator of intersection size D ∩ and q is the query size( x or y ).Corresponding to E H and E H , we divide X ( Y , resp.) totwo parts X and X ( Y and Y , resp.). We know that X ∩ X = Φ and Y ∩ Y = Φ . Also, let D ∩ = | X ∩ Y | , D ∪ = | X ∪ Y | , we have D ∩ = | X ∩ Y | + | X ∩ Y | and D ∪ = | X ∪ Y | + | X ∪ Y | since E H and E H are disjoint. For simplicity,let D ∩ = | X ∩ Y | , D ∪ = | X ∪ Y | , D ∩ = | X ∩ Y | and D ∪ = | X ∪ Y | . For X , X , Y and Y , the KMV sketches are L X , L X , L Y and L Y with size k X , k X , k Y and k Y , respectively. Based on this, we give another estimatoras ˆ C ′ = ˆ D ∩ + ˆ D ∩ q , where ˆ D ∩ ( ˆ D ∩ , resp.) is the estimatorof intersection size D ∩ ( D ∩ , resp.). Next, we compare thevariance of ˆ C and ˆ C ′ . Theorem 4.
After dividing the element universe into twogroups and applying KMV sketch in each group, with the sameindex space budget, the variance of ˆ C ′ is larger than that of ˆ C . Proof: Recall the
KMV sketch, we have E ( ˆ C ′ ) = E ( ˆ D ∩ ) + E ( ˆ D ∩ ) = D ∩ + D ∩ = D ∩ . Because of thetwo disjoint element groups, ˆ D ∩ and ˆ D ∩ are independent.Thus the variance V ar [ ˆ C ′ ] = V ar [ ˆ D ∩ ]+ V ar [ ˆ D ∩ ] q . Next, wewill show V ar [ ˆ D ∩ ] + V ar [ ˆ D ∩ ] ≥ V ar [ ˆ C ] . Consider the
KMV sketch for set X and Y , the sketch sizeaccording to Equation 8 is k = min { k X , k Y } . Similarly, for X and Y , we have the sketch size k = min { k X , k Y } ;for X and Y , we have the sketch size k = min { k X , k Y } .Since the index is fixed, we have k X = k X + k X and k Y = k Y + k Y . Then, k + k = min { k X , k Y } +min { k X , k Y } ≤ min { k X , k Y } = k .Let ∆ = V ar [ ˆ D ∩ ] + V ar [ ˆ D ∩ ] − V ar [ ˆ C ] , after somecalculation, we have ∆ = D ∩ k + D ∩ k − D ∩ k + D ∩ D ∪ k + D ∩ D ∪ k − D ∩ D ∪ k . Next we show that D ∩ k + D ∩ k − D ∩ k ≥ .Let k = α k and k = β where α + β = 1 and α, β > .Then we have D ∩ k + D ∩ k − D ∩ k = α D ∩ + β D ∩ − D ∩ k = ( α − D ∩ +( β − D ∩ − D ∩ D ∩ k . As for the upper part in theabove equation, by inequality of arithmetic and geometricmeans, we get ( α − D ∩ + ( β − D ∩ − D ∩ D ∩ ≥ p ( α − α + 1)( β − β + 1) − D ∩ D ∩ . Since ( α − β −
1) = 1 , we get p ( α − α + 1)( β − β + 1) − p ( α + 1)( β + 1) − ≥ , thus D ∩ k + D ∩ k − D ∩ k ≥ .Let ∆ = D ∩ D ∪ k + D ∩ D ∪ k − D ∩ D ∪ k , after some computa-tion, we have ∆ = ( k D ∪ − k D ∪ )( k D ∩ − k D ∩ ) kk k . As for thenumerator(upper) of ∆ , consider the two parts after dividing the element universe, if the union size in one part, say D ∪ , islarger, meanwhile the corresponding intersection size D ∩ islarger, we have ∆ ≥ . This case can be realized since oneof the two groups divided from element universe is made ofhigh-frequency elements, which will result in large intersectionsize and large union size under the proper choice of k, k , k . Optimal Buffer Size r : In this part, we show how tofind optimal buffer size r by analysing the variance for GB-KMV method. Given the space budget b , we first show that thevariance for GB-KMV sketch is a function of f ( r, α , α , b ) and then we give a method to appropriately choose r . Beloware some notations first.Given two sets X and Y with G-KMV sketch L X and L Y respectively, the containment similarity of Q in X is computedby Equation 26 as ˆ C GKMV = ˆ D GKMV ∩ q , where ˆ D GKMV ∩ = K ∩ k × k − U ( k ) is the overlap size.As for the GB-KMV method of set X an Y with sketch H X ∪L X and H Y ∪L Y respectively, the containment similarityof Q in X is computed by Equation 27 as ˆ C GBKMV = |H Q ∩H X | + ˆ D GKMV ∩ q , where |H Q ∩ H X | is the number of com-mon elements in E H part. It is easy to verify that ˆ C GBKMV is an unbiased estimator. Also, the variance of
GB-KMV method estimator is
V ar [ ˆ C GBKMV ] =
V ar [ ˆ D GKMV ∩ ] q , where V ar [ ˆ D GKMV ∩ ] corresponds to the variance of the G-KMV sketch in the
GB-KMV sketch.Next, with the same space budget b , we compute theaverage variance of GB-KMV method.Consider the
GB-KMV index construction which is in-troduced in Section IV-B by Algorithm 1. Let N be thetotal number of elements and b the space budget in termsof elements for index construction. Assume that we keep r high-frequency elements by bitmap in the buffer, whichhave N = P mj =1 |H X j | = P ri =1 f i elements and occupy T = m ∗ r/ index space. Then the total number of elementsleft for G-KMV sketch is N = N − N and the index spacefor G-KMV sketch is T = b − T .Given two sets X j and X l , the variance of overlap sizeestimator in Equation 11 is as follows V ar [ ˆ D ∩ ] = D ∩ ( kD ∪ − k − D ∪ + k + D ∩ ) k ( k − (32)where D ∪ = | X j ∪ X l | , D ∩ = | X j ∩ X l | and k is the sketchsize. Since the variance is concerned with the union size D ∪ ,the intersection size D ∩ and the signature size k , we firstcalculate these three formulas, then compute the variance.Consider the two sets X j , X l from dataset S with GB-KMV sketch H X j ∪L X j and H X j ∪L X j respectively. The element e i is associated with frequency f i , and the probability of element e i appearing in record X j is P r [ h ( e i ) ∈ L X j ] = f i N x j . Givena hash value threshold τ , the G-KMV signature size of set X j is computed as k j = τ ( x j − |H X j | ) . The total index space in G-KMV sketch is P mj =1 k j = T = b − T = b − r ∗ m , thenwe get τ = b − r/ ∗ mN − N .Similar to Equation 29, 30, the sketch size k value for GB-KMV sketch is k = τ ( x j + x l ) − τ x x ( f n − f r ) where n = P ni +1 f i N , f r = P ri +1 f i N . The intersection size and unionsize of X j and X l are D ∩ = x j x l ( f n − f r ) and D ∪ =( x j + x l )(1 − f r ) − x j x l ( f n − f r ) where f r = P ri =1 f i N , thenthe variance of GB-KMV method by Equation 32 is
V ar [ ˆ C GBKMV ] = ( x j + x l ) x l kx j F + x l k F + x l x j F where F = f n − f r , F = − ( f n − f r ) and F = − ( f n − f r ) , and the average variance of GB-KMV method
V ar
GBKMV = m P mj =1 P ml =1 V ar [ ˆ C GBKMV ] is V ar
GBKMV = L F + L F + L F where L = m P mj =1 P ml =1 ( x j + x l ) x j x l kx j , L = m P mj =1 P ml =1 ( x j x l ) kx j and L = m P mj =1 P ml =1 x l x j .Note that F , F , F is concerned with the elementfrequency which can be computed by using the distribution p ( x ) = c x − α ; L , L , L is related to the recordsize which can be computed by using p ( x ) = c x − α and k is related to the index budget size b and buffersize r , then V ar
GBKMV can be restated as
V ar
GBKMV = L F + L F + L F = m [ A ( d − α − r − α )( d − α − r − α ) b − m r − B ( d − α − r − α )( d − α − r − α ) b − m r ] − C ( d − α − r − α ) where A = N ( α − (1 − α ) d − α ( d − α − ( α − − α (2 − α ) ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) , B = N ( α − (1 − α ) d − α ( d − α − ( α − − α (3 − α ) ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) and C = ( α − (1 − α )( d − α − ( α − − α (2 − α ) ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) Moreover, we have
V ar
GBKMV = a r α + a r α + a r α + a r α + a r α + a r α + a r α + a r ( b − m r ) r α where a = C m d − α , a = Ad − α − Bd − α − bCd − α , a = − Ad − α + Bd − α , a = − C m , a = − Ad − α + 2 Bd − α + bC , a = A − Bd − α , a = − Bd − α and a = B .We can see that the variance V ar
GBKMV can be regardedas a function of f ( r, α , α , b ) , i.e., V ar
GBKMV = f ( r, α , α , b ) (33)Similarly, for the G-KMV sketch based method, the variancecan be calculated as
V ar [ ˆ C GKMV ] = ( x j + x l ) x j x l kx j F ′ + ( x j x l ) kx j F ′ + x j x l x j F ′ where F ′ = f n , F ′ = − f n , F ′ = − f n and k = bN ( x j + x l ) − ( bN ) x j x l f n . Let ∆ V ar = V ar [ ˆ C GBKMV ] − V ar [ ˆ C GKMV ] , then for all pairs of X j , X l , the average of ∆ V ar is V ∆ = m P mj =1 P ml =1 ∆ V ar . Moreover, we canrewrite V ∆ as V ∆ = L ( F ′ − F )+ L ( F ′ − F )+ L ( F ′ − F ) .Eventually, in order to find the optimal r , i.e., the numberof high-frequency elements in GB-KMV method, we givethe optimization goal as max r V GBKMV = f ( r, α , α , b ) , s.t. V ∆ < .In order to compute the above optimization problem, wetry to extract the roots of the first derivative function ofEquation 33( i.e., f ( r, α , α , b ) ) with respect to r . However,the derivative function is a polynomial function with degree of r larger than four. According to Abel’s impossibility the-orem [39], there is no algebraic solution, thus we try to givethe numerical solution.Recall that we use bitmap to keep the r high-frequencyelements, given the space budget b , the element frequency andrecord size distribution with power-law exponent α and α respectively, the optimization goal max r V GBKMV can beconsidered as a function max r f ( r, b, α , α ) . Given a dataset S and the space budget b , we can get the power-law exponent α , α . Then we assign , , , ... to r and calculate the f ( r, b, α , α ) . In this way, we can give a good guide to thechoice of r . GB-KMV Sketch provides Better Accuracy than LSH-EMethod : In Section III-B, we have shown that the variance of
LSH-E estimator(Equation 21) is larger than that of MinHash
LSH estimator(Equation 19). Note that
G-KMV sketch is aspecial case of
GB-KMV sketch when the buffer size r = 0 .By choosing an optimal buffer size r in IV-C6, it can guaranteethat the performance of GB-KMV is not worse than
G-KMV .Below, we show that
G-KMV outperforms MinHash
LSH interms of estimate accuracy.
Theorem 5.
The variance of G-KMV method is smaller thanthat of minHash LSH method given the same sketch size.Proof:
Suppose that the minHash
LSH method uses k ′ hash functions to the dataset, then the total sketch size is T = mk ′ . Let τ be the global threshold of G-KMV method, wehave τ = mk ′ N where N is the total number of elements indataset.We first consider the G-KMV method. Similar to Equa-tion 29, 30, the intersection size of X j and X l is D ∩ = x j x l P ni =1 f i N , and the union size is D ∪ = x j + x l − x j x l P ni =1 f i N . Then by Equation 11 the variance of the G-KMV method to estimate the containment similarity of X j in X l can be rewritten as V G-KMV = ( x j + x l ) x j x l kx j F + ( x j x l ) kx j F + x j x l x j F (34)where F = f n , F = − ( f n ) , F = − f n and f n = P ni =1 f i N .Next we compute the k value of the sketch. Note that τ is the global threshold of G-KMV method. The k value corre-sponding the intersection size of X j and X l by Equation 24is k = τ ( x j + x l ) − τ x j x l f n . Then the average variance V = m P mj =1 P ml =1 V G-KMV is V = 1 m ( L F + L F + L F ) where L = P mj =1 P ml =1 ( x j + x l ) x j x l x j k , L = P mj =1 P ml =1 ( x j x l ) kx j and L = P mj =1 P ml =1 x j x l x j . Aftersome computation, V = k ′ [ ( α − − α (2 − α ) W f n + ( α − α (2 − α )(3 − α ) W ( f n ) + k ′ ( α − α (2 − α ) W f n ] where W = ( x − α t − x − α ) ( x − α t − x − α )( x − α t − x − α ) , W = ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) , W = ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) and f n = (1 − α ) − α d − α − d − α − .ote that x t ( x , resp.) is the largest(smallest, resp.) set sizeand d is the distinct number of elements.Next we take into account the minHash LSH method.Given two sets X j and X l , by Equation 19, the vari-ance of minHash LSH method to estimate the containmentsimilarity of X j in X l is V minH = k ′ [ a f n + a ( f n ) + a ( f n ) + a ( f n ) ] where a = x l + x l x j , a = − x l , a = 5 x j x l x j + x l and a = − x j x l ( x j + x l ) . Then the average variance V = m P mj =1 P ml =1 V minH is V = 1 k ′ m [ A f n + A ( f n ) + A ( f n ) + A ( f n ) ] where A = α − − α x − α t − x − α x − α t − x − α + ( α − − α (3 − α ) ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) , A = − α − − α x − α t − x − α x − α t − x − α , A = 5 α − − α x − α t − x − α x − α t − x − α and A = − α − − α x − α t − x − α x − α t − x − α ) − ( α − (4 − α )(2 − α ) ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) +3 α − − α x − α t − x − α x − α t − x − α ] Note that f n is computed by the distribution p ( x ) = c x − α and the sum over set size is computed by the set sizedistribution p ( x ) = c x − α , and the variance V and V isdependent on α and α . Compare the variance V and V , weget that V < V for all α > and α > .Next, we analyse the performance of the two methods withthe dataset following uniform distribution(i.e., α = 0 , α =0 ). For G-KMV method, the average variance is V ′ = 1 m ( L F + L F + L F ) where L = P mj =1 P ml =1 ( x j + x l ) x j x l x j k , L = P mj =1 P ml =1 ( x j x l ) kx j and L = P mj =1 P ml =1 x j x l x j .After some computation, V ′ = k ′ [ ( α − − α W f n − ( α − (2 − α )(3 − α ) W ( f n ) − k ′ ( α − − α W f n ] where W = ( x − α t − x − α ) (ln x t − ln x )( x − α t − x − α ) , W = ( x − α t − x − α )(ln x t − ln x )( x − α t − x − α )( x − α t − x − α ) , W = ( x − α t − x − α )(ln x t − ln x )( x − α t − x − α ) and f n = (1 − α ) − α d − α − d − α − .Note that x t ( x , resp.) is the largest(smallest, resp.) set sizeand d is the distinct number of elements.For LSH-E method, the average variance is V ′ = 1 k ′ m [ A f n + A ( f n ) + A ( f n ) + A ( f n ) ] where A = α − − α x − α t − x − α x − α t − x − α + ( α − − α ( x − α t − x − α )(ln x t − ln x )( x − α t − x − α ) , A = − α − − α x − α t − x − α x − α t − x − α , A = 5 α − − α x − α t − x − α x − α t − x − α and A = − α − − α x − α t − x − α x − α t − x − α ) − ( α − (4 − α )(2 − α ) ( x − α t − x − α )( x − α t − x − α )( x − α t − x − α ) +3 α − − α x − α t − x − α x − α t − x − α ] .Similarly, we can get that V ′ < V ′ . Remark 4.
We have illustrated that the variance of GB-KMVis smaller than that of LSH-E. Then by Chebyshev’s inequality,i.e.,
Pr( | X − µ | ≥ ǫσ ) ≤ ǫ where µ is the expectation, δ is thestandard deviation and ǫ > is a constant, we consider theprobability that values lie outside the interval [ µ − ǫδ, µ + ǫδ ] ,that is, values deviating from the expectation. By Theorem 5,we get that the standard deviation δ of GB-KMV is smallerthan δ of LSH-E, then with the same interval [ µ − ǫδ, µ + ǫδ ] ,the constant ǫ for GB-KMV is larger than ǫ for LSH-E, thusthe probability that values lie outside the interval for GB-KMVis smaller than that for LSH-E, which means that the resultof GB-KMV is more concentrated around the expected valuethan that of LSH-E. V. P
ERFORMANCE S TUDIES
In this section, we empirically evaluate the performanceof our proposed
GB-KMV method and compare
LSH
Ensem-ble [44] as baseline. We also compare our approximate
GB-KMV method with the exact containment similarity searchmethod. All experiments are conducted on PCs with Intel Xeon × . GHz
CPU and GB RAM running Debian Linux,and the source code of
GB-KMV is made available [1].
A. Experimental Setup F S c o r e A v g . V a r( * - ) Buffer Size
F-1 ScoreAvg. Var (a) NETFLIX F S c o r e A v g . V a r( * - ) Buffer Size
F-1 ScoreAvg. Var (b) ENRONFig. 5.
Effect of Buffer Size
Approximate Algorithms.
In the experiments, the approxi-mate algorithms evaluated are as follows. • GB-KMV . Our approach proposed in Section IV-B. • LSH-E . The state-of-the-art approxiamte containmentsimilarity search method proposed in [44].The above two algorithms are implemented in Go program-ming language. We get the source code of
LSH-E from [44].For
LSH-E , we follow the parameter setting from [44].
Exact Algorithms.
To better evaluate the proposed methods,we also compare our approximate method
GB-KMV with thefollowing two exact containment similarity search methods. • PPjoin * . We extend the prefix-filtering based methodfrom [40] to tackle the containment similarity searchproblem. • FrequentSet . The state-of-the-art exact containmentsimilarity search method proposed in [5]. ataset Abbrev Type Record α -eleFreq α -recSize Netflix [12] NETFLIX Rating Movie 480,189 209.25 17,770 1.14 4.95Delicious [2] DELIC Folksonomy User 833,081 98.42 4,512,099 1.14 3.05CaOpenData [44] COD Folksonomy User 65,553 6284 111,011,807 1.09 1.81Enron [3] ENRON Text Email 517,431 133.57 1,113,219 1.16 3.10Reuters [4] REUTERS Folksonomy User 833,081 77.6 283,906 1.32 6.61Webspam [38] WEBSPAM Text Text 350,000 3728 16,609,143 1.33 9.34WDC Web Table [44] WDC Text Text 262,893,406 29.2 111,562,175 1.08 2.4
TABLE II. C HARACTERISTICS OF DATASETS F - S c o r e SpaceUsed
GB-KMVGKMVKMV (a) NETFLIX F - S c o r e SpaceUsed
GB-KMVGKMVKMV (b) DELIC F - S c o r e SpaceUsed
GB-KMVGKMVKMV (c) COD F - S c o r e SpaceUsed
GB-KMVGKMVKMV (d) ENRON F - S c o r e SpaceUsed
GB-KMVGKMVKMV (e) REUTERS F - S c o r e SpaceUsed
GB-KMVGKMVKMV (f) WEBSPAM F - S c o r e SpaceUsed
GB-KMVGKMVKMV (g) WDCFig. 6.
GB-KMV , G-KMV , KMV comparison
Remark 5.
A novel size-aware overlap set similarity joinalgorithm has been recently proposed in [25]. Although thecontainment similarity search relies on the set overlap, theirtechnique cannot be trivially applied because we need toconstruct c -subset inverted lists for each possible query size.In particular, in the size-aware overlap set similarity joinalgorithm, it is required to build the c -subset inverted list forthe given overlap threshold c . In our GB-KMV method, thethreshold c corresponds to | Q | ∗ t ∗ , where | Q | is the querysize and t ∗ is the similarity threshold, thus with different querysize | Q | , we need to build different | Q |∗ t ∗ -subset inverted lists,which is very inefficient. Datasets.
We deployed real-life datasets with differentdata properties. Note that the records with size less than 10are discarded from dataset. We also remove the stop words(e.g., ”the”) from the dataset. Table II shows the detailedcharacteristics of the datasets. Each dataset is illustrated with the dataset type, the representations of record, the numberof records in the dataset, the average record length, and thenumber of distinct elements in the dataset. We also report thepower-law exponent α and α (skewness) of the record sizeand element frequency of the dataset respectively. Note thatwe make use of the framework in [18] to quantify the power-law exponent. The dataset Canadian Open Data appears in thestate-of-the-art algorithm LSH-E [44] .
Settings.
We borrow the idea from the evaluation of
LSH-E in [44] to use F α score ( α = , . ) to evaluate the accuracy ofthe containment similarity search. Given a query Q randomlyselected from the dataset S and a containment similaritythreshold t ∗ , we define T = { X : t ( Q, X ) ≥ t ∗ , X ∈ S} as the ground truth set and A as the collection of recordsreturned by some search algorithms. The precision and recallto evaluate the experiment accuracy are P recision = | T ∩ A || A | and Recall = | T ∩ A || T | respectively. The F α score is defined asfollows. F α = (1 + α ) ∗ P recision ∗ Recallα ∗ P recision + Recall (35)Note that we use F . score because LSH-E favours recallin [44]. We use the datasets from Table II to evaluate theperformance of our algorithm, and we randomly choose 200queries from the dataset.As to the default values, the similarity threshold is set as t ∗ = 0 . . In the experiments, we use the ratio of space budgetto the total dataset size to measure the space used. For our GB-KMV method, it is set to . For
LSH-E method, weuse the same default values in [44] where the signature sizeof each record is and the number of partition is . Byvarying the number of hash functions, we change the spaceused in LSH-E . B. Performance Tuning
As shown in Section IV-C6, we can use the varianceestimation function to identify a good buffer size r for GB-KMV method based on the skewness of record size andelement frequency, as well as the space budget. In Fig. 5,we use NETFLIX and ENRON to evaluate the goodness ofthe function by comparing the trend of the variance and theestimation accuracy. By varying the buffer size r , Fig. 5 reportsthe estimated variance (right side y axis) based on the variancefunction in Section IV-C6 as well as the F score (left side yaxis) of the corresponding GB-KMV sketch with buffer size r .Fig. 5(a) shows that the best buffer size for variance estimation(prefer small value) is around , while the GB-KMV methodachieves the best F score (prefer large value) with buffersize around . They respectively become 220 and 230 inFig. 5(b). This suggests that our variance estimation functionis quite reliable to identify a good buffer size. In the followingexperiments, GB-KMV method will use buffer size suggestedby this system, instead of manually tuning. F - S c o r e SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% P r e c i s i on SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% F - . S c o r e SpaceUsed
GB-KMVLSH-E
Fig. 7.
Accuracy versus Space on COD F - S c o r e SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% P r e c i s i on SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% F - . S c o r e SpaceUsed
GB-KMVLSH-E
Fig. 8.
Accuracy versus Space on DELIC F - S c o r e SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% P r e c i s i on GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% F - SpaceUsed
GB-KMVLSH-E
Fig. 9.
Accuracy versus Space on ENRON F - S c o r e SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% P r e c i s i on SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% F - . S c o r e SpaceUsed
GB-KMVLSH-E
Fig. 10.
Accuracy versus Space on NETFLIX
We also compare the performance of
KMV , G-KMV , and
GB-KMV methods in Fig. 6 to evaluate the effectiveness ofusing global threshold and the buffer on datasets. It is shownthat the use of new KMV estimator with global threshold (i.e.,Equation 26) can significantly improve the search accuracy.By using a buffer whose size is suggested by the system, wecan further enhance the performance under the same spacebudget. In the following experiments, we use GB-KMV forthe performance comparison with the state-of-the-art technique
LSH-E . C. Space v.s. Accuracy
An important measurement for sketch technique is thetrade-off between the space and accuracy. We evaluate thespace-accuracy trade-offs of
GB-KMV method and
LSH-E method in Figs. 7-13 by varying the space usage onfive datasets NETFLIX, DELIC, COD, ENRON, REUTERS,WEBSPAM and WDC. We use F score, F . score, precisionand recall to measure the accuracy. By changing the numberof hash functions, we tune the space used in LSH-E . It isreported that our
GB-KMV can beat the
LSH-E in terms ofspace-accuracy trade-off with a big margin under all settings.We also plot the distribution of accuracy (i.e., min, maxand avgerage value) to compare our
GB-KMV method with
LSH-E in Fig. 14.Meanwhile, by changing the similarity threshold, F scoreis reported in Fig. 15 on dataset NETFLIX and COD. Wecan see that with various similarity thresholds, our GB-KMV always outperforms
LSH-E . We also evaluate the space-accuracy trade-offs on syntheticdatasets with 100K records in Fig. 16 where the record sizeand the element frequency follow the zipf distribution. Wecan see that on datasets with different record size and elementfrequency skewness,
GB-KMV consistently outperforms
LSH-E in terms of space-accuracy trade-off.
D. Time v.s. Accuracy
Another important measurement for the sketch techniqueis the trade-off between time and accuracy. Hopefully, thesketch should be able to quickly complete the search with agood accuracy. We tune the index size of
GB-KMV to showthe trade-off. As to the
LSH-E , we tune the number of hashfunctions. The time is reported as the average search time perquery. In Fig. 17, we evaluate the time-accuracy trade-offsfor
GB-KMV and
LSH-E on four datasets COD, NETFLIX,DELIC and ENRON where the accuracy is measured by F score. It is shown that with the similar accuracy ( F score), GB-KMV is significantly faster than
LSH-E . For datasets COD,DELIC and ENRON,
GB-KMV can be 100 times faster than
LSH-E with the same F score. It is observed that the accuracy( F score) improvement of LSH-E algorithm is very slowcompared with
GB-KMV method. This is because the
LSH-E method favours recall and the precision performance is quitepoor even for a large number of hash functions, resulting in apoor F score which considers both precision and recall. F - S c o r e SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 5% 10% P r e c i s i on SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 5% 10% F - . S c o r e SpaceUsed
GB-KMVLSH-E
Fig. 11.
Accuracy versus Space on REUTERS F - S c o r e SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% P r e c i s i on SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 5% 10% F - . S c o r e SpaceUsed
GB-KMVLSH-E
Fig. 12.
Accuracy versus Space on WEBSPAM F - S c o r e SpaceUsed
GB-KMVLSH-E 0.2 0.4 0.6 5% 10% P r e c i s i on SpaceUsed
GB-KMVLSH-E 0.2 0.4 0.6 0.8 5% 10% R e c a ll SpaceUsed
GB-KMVLSH-E 0.2 0.4 0.6 5% 10% F - . S c o r e SpaceUsed
GB-KMVLSH-E
Fig. 13.
Accuracy versus Space on WDC
Dataset
GB-KMV LSH-E
NETFLIX 10 118DELIC 10 211COD 10 4ENRON 10 185REUTERS 10 329WEBSPAM 10 7WDC 10 109TABLE III. T HE SPACE USAGE ( % ) E. Sketch Construction Time
In this part, we compare the sketch construction timeof
GB-KMV and
LSH-E on different datasets under defaultsettings. As expected,
GB-KMV uses much less sketch con-struction time than that of
LSH-E since
GB-KMV sketch needonly one hash function, while
LSH-E needs multiple for adecent accuracy. Note that, for the internet scale dataset WDC,the index construction time for
GB-KMV is around minutes,while for LSH-E it is above minutes. We also give thespace usage of the two methods on each dataset in Table III.The space usage of GB-KMV is as mentioned in Settings.For LSH-E in some dataset, the space is over becausethere are many records with size less than the number of hashfunctions . F. Supplementary Experiment
Evaluation on Uniform Distribution. In Theorem 5 , we havetheoretically shown that when the dataset follows uniform dis-tribution (i.e., α = 0 and α = 0 ), our GB-KMV method canoutperform the
LSH-E method. In this part, we experimentallyillustrate the performance on dataset with uniform distribution.We generate k records where the record size is uniformlydistributed between and , and each element is randomlychosen from , distinct elements. Fig. 19(a) illustratesthe time-accuracy trade-off of GB-KMV and
LSH-E on thesynthetic dataset with K records. It is reported that, toachieve the same accuracy ( F score), GB-KMV consumesmuch less time than
LSH-E . Comparison with Exact Algorithms.
We also compare the running time of our proposed method
GB-KMV with twoexact containment similarity search methods PPjoin* [40]and FreqSet [5]. Experiments are conducted on the datasetWebSpam, which consists of , records and has theaverage length around , . We partition the data into groups based on their record size with boundaries increasingfrom , to , . As expected, Fig. 19(b) shows that therunning time of our approximate algorithm is not sensitiveto the growth of the record size because a fixed number ofsamples are used for a given budget. GB-KMV outperformstwo exact algorithm by a big margin, especially when therecord size is large, with a decent accuracy (i.e., with F scoreand recall always larger than 0.8 and 0.9 under all settings). G. Discussion Summary
In the accuracy comparison between
GB-KMV and
LSH-E ,it is remarkable to see that the accuracy (i.e., F score) is verylow on some datasets. We give some discussions as follows.First we should point out that in [44], the accuracy of LSH-E is only evaluated on only one dataset COD, in which bothour
GB-KMV method and
LSH-E can achieve decent accuracyperformance with F score above . .As mentioned in III-A, the LSH-E method first transformsthe containment similarity to Jaccard similarity, then in orderto make use of the efficient index techniques,
LSH-E partitionsthe dataset and uses the upper bound to approximate the recordsize in each partition. which can favour recall but result in extrafalse positives as analysed in section III-B. However, the
LSH-E method does not provide a partition scheme associated withdifferent data distribution, and the algorithm setting (e.g., 256hash functions and 32 partitions) can not perform well in somedataset. VI. R
ELATED W ORK
In this Section, we review two closely related categoriesof work on set containment similarity search. T i m e ( s ) F-1 Score
GB-KMVLSH-E (a) NETFLIX T i m e ( s ) F-1 Score
GB-KMVLSH-E (b) DELIC T i m e ( s ) F-1 Score
GB-KMVLSH-E (c) COD T i m e ( s ) F-1 Score
GB-KMVLSH-E (d) ENRON T i m e ( s ) F-1 Score
GB-KMVLSH-E (e) REUTERS T i m e ( s ) F-1 Score
GB-KMVLSH-E (f) WEBSPAM T i m e ( s ) F-1 Score
GB-KMVLSH-E (g) WDCFig. 14.
The distribution of Accuracy
Exact Set Similarity Queries.
Exact set similarity query hasbeen widely studied in the literature. Existing solutions aremainly based on the filtering-verification framework which canbe divided into two categories, prefix-filter based method andpartition-filter based method. Prefix-filter based method is firstintroduced by Bayardo e t al. in [10]. Xiao e t al. [40] furtherimprove the prefix-filter method by exploring positional filterand suffix filter techniques. In [32], Mann e t al. introduce anefficient candidate verification algorithm which significantlyimproves the efficiency compared with the other prefix filteralgorithms. Wang e t al. [36] consider the relations amongrecords in query processing to improve the performance. Deng e t al. in [23] present an efficient similarity search methodwhere each object is a collection of sets. For partition-basedmethod, in [7], Arasu e t al. devise a two-level algorithmwhich uses partition and enumeration techniques to search forexact similar records. Deng et al. in [24] develop a partition-based method which can effectively prune the candidate sizeat the cost higher filtering cost. In [43], Zhang e t al. proposean efficient framework for exact set similarity search basedon tree index structure. In [25], Deng e t al. present a size-aware algorithm which divides all the sets into small andlarge ones by size and processes them separately. Regardingexact containment similarity search, Agrawal e t al. in [5] buildthe inverted lists on the token-sets and considered the stringtransformation. Approximate Set Similarity Queries.
The approximateset similarity queries mostly adopt the Locality Sensitive F - S c o r e Similarity Threshold
GB-KMVLSH-E (a) NETFLIX F - S c o r e Similarity Threshold
GB-KMVLSH-E (b) DELIC F - S c o r e Similarity Threshold
GB-KMVLSH-E (c) COD F - S c o r e Similarity Threshold
GB-KMVLSH-E (d) ENRON F - S c o r e Similarity Threshold
GB-KMVLSH-E (e) REUTERS F - S c o r e Similarity Threshold
GB-KMVLSH-E (f) WEBSPAM F - S c o r e Similarity Threshold
GB-KMVLSH-E (g) WDCFig. 15.
Accuracy versus Similarity threshold F - S c o r e eleFreq z-value GB-KMVLSH-E 0 0.2 0.4 0.6 0.8 1 0.8 0.9 1.0 1.2 1.4 F - S c o r e recSize z-value GB-KMVLSH-E
Fig. 16.
EleFreq z -value varying from 0.4 to 1.2 with recSize z -value1.0; recSize z -value varying from 0.8 to 1.4 with eleFreq z -value 0.8 Hashing(
LSH ) [28] techniques. For Jaccard similarity, Min-Hash [14] is used for approximate similarity search. Asym-metric minwise hashing is a technique for approximate con-tainment similarity search [35]. This method makes use ofvector transformation by padding some values into sets, whichmakes all sets in the index have same cardinality as the largestset. After the transformation, the near neighbours with respectto Jaccard similarity of the transformed sets are the same asnear neighbours in containment similarity of the original sets.Thus, MinHash
LSH can be used to index the transformed sets,such that the sets with larger containment similarity scores canbe returned with higher probability. In [35], they show thatasymmetric minwise hashing is advantageous in containmentsimilarity search over datasets such as news articles and emails,while Zhu e t. al in [44] finds that for datasets which are veryskewed in set size distribution, asymmetric minwise hashingwill reduce the recall.The KMV sketch technique has been widely used to esti- -3 -2 -1
1 0.6 0.8 T i m e ( s ) F-1 Score
GB-KMVLSH-E (a) COD -2 -1
1 0.2 0.4 T i m e ( s ) F-1 Score
GB-KMVLSH-E (b) DELIC -2 -1
1 0.2 0.4 T i m e ( s ) F-1 Score
GB-KMVLSH-E (c) ENRON -2 -1
1 0.2 0.4 0.6 0.8 T i m e ( s ) F-1 Score
GB-KMVLSH-E (d) NETFLIX -2 -1
1 0.2 0.4 T i m e ( s ) F-1 Score
GB-KMVLSH-E (e) REUTERS -1
1 0.4 0.6 T i m e ( s ) F-1 Score
GB-KMVLSH-E (f) WEBSPAM T i m e ( s ) F-1 Score
GB-KMVLSH-E (g) WDCFig. 17.
Time versus Accuracy
COD DELIC ENRON NETFLIX REUTERS WEBSPAM R unn i ng t i m e ( s ) GB-KMV LSH-E
Fig. 18.
Sketch Construction Time mate the cardinality of record size [42], [20], [37]. The idea ofimposing a global threshold on
KMV sketch is first proposedin [37] in the context of term pattern size estimation. However,there is no theoretical analysis for the estimation performance.In [17], Christiani e t al. give a data structure for approximatesimilarity search under Braun-Blanquet similarity which has a1-1 mapping to Jaccard similarity if all the sizes of records arefixed. In [19], Cohen e t al. introduce a new estimator for setintersection size, but it is still based on the MinHash technique. -3 -2 -1 T i m e ( s ) F-1 Score
GB-KMVLSH-E (a) Time versus Accuracy T i m e ( s ) RecordSize
GB-KMVFreqSetPPjoin (b) Running TimeFig. 19.
Supplementary experiments
In [21], Dahlgaard e t al. develop a new sketch method whichhas the alignment property and same concentration bounds asMinHash. VII. C ONCLUSION
In this paper, we study the problem of approximate contain-ment similarity search. The existing solutions to this problemare based on the MinHash
LSH technique. We develop anaugmented
KMV sketch technique, namely
GB-KMV , whichis data-dependent and can effectively exploit the distributionsof record size and element frequency. We provide thoroughtheoretical analysis to justify the design of
GB-KMV , and showthat the proposed method can outperform the state-of-the-art technique in terms of space-accuracy trade-off. Extensiveexperiments on real-life set-valued datasets from a variety ofapplications demonstrate the superior performance of
GB-KMV method compared with the state-of-the-art technique.R ∼ enron.[4] http://trec.nist.gov/data/reuters/reuters.html.[5] P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant setcontainment. In SIGMOD , pages 927–938. ACM, 2010.[6] R. Albert. R. albert, h. jeong, and a.-l. barab´asi, nature (london) 401,130 (1999).
Nature (London) , 401:130, 1999.[7] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarityjoins. In
Proceedings of the VLDB Endowment , pages 918–929. VLDBEndowment, 2006.[8] J. Bauckmann, U. Leser, and F. Naumann. Efficiently computinginclusion dependencies for schema discovery. In
Data EngineeringWorkshops, 2006. Proceedings. 22nd International Conference on ,pages 2–2. IEEE, 2006.[9] M. Bawa, T. Condie, and P. Ganesan. Lsh forest: self-tuning indexesfor similarity search. In
WWW , pages 651–660. ACM, 2005.[10] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similaritysearch. In
WWW , pages 131–140. ACM, 2007.[11] K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. Onsynopses for distinct-value estimation under multiset operations. In
SIGMOD , pages 199–210, 2007.[12] P. Bouros, N. Mamoulis, S. Ge, and M. Terrovitis. Set containment joinrevisited.
Knowledge and Information Systems , 49(1):375–402, 2016.[13] A. Z. Broder. On the resemblance and containment of documents. In
Compression and Complexity of Sequences 1997. Proceedings , pages21–29. IEEE, 1997.[14] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In
STOC , pages 327–336. ACM, 1998.[15] M. S. Charikar. Similarity estimation techniques from rounding algo-rithms. In
STOC , 2002.[16] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: usermovement in location-based social networks. In
SIGKDD , pages 1082–1090. ACM, 2011.[17] T. Christiani and R. Pagh. Set similarity search beyond minhash. arXivpreprint arXiv:1612.07710 , 2016.[18] A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributionsin empirical data.
SIAM review , 51(4):661–703, 2009.[19] R. Cohen, L. Katzir, and A. Yehezkel. A minimal variance estimatorfor the cardinality of big data set intersection. In
SIGKDD , 2017.[20] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopsesfor massive data: Samples, histograms, wavelets, sketches.
Foundationsand Trends in Databases , 4(1–3):1–294, 2012.[21] S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. Fast similaritysketching. arXiv preprint arXiv:1704.04370 , 2017.[22] F. De Marchi and J.-M. Petit. Zigzag: a new algorithm for mining largeinclusion dependencies in databases. In
Data Mining, 2003. ICDM2003. Third IEEE International Conference on , pages 27–34. IEEE,2003.23] D. Deng, A. Kim, S. Madden, and M. Stonebraker. Silkmoth: Anefficient method for finding related sets with maximum matchingconstraints. arXiv preprint arXiv:1704.04738 , 2017.[24] D. Deng, G. Li, H. Wen, and J. Feng. An efficient partition basedmethod for exact set similarity joins.
Proceedings of the VLDBEndowment , 9(4):360–371, 2015.[25] D. Deng, Y. Tao, and G. Li. Overlap set similarity joins with theoreticalguarantees. 2018.[26] A. Esmaili. Probability models in engineering and science, 2006.[27] M. L. Goldstein, S. A. Morris, and G. G. Yen. Problems with fittingto the power-law distribution.
The European Physical Journal B-Condensed Matter and Complex Systems , 41(2):255–258, 2004.[28] P. Indyk and R. Motwani. Approximate nearest neighbors: towardsremoving the curse of dimensionality. In
STOC , 1998.[29] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barab´asi. Thelarge-scale organization of metabolic networks.
Nature , 407(6804):651,2000.[30] S. Kruse, T. Papenbrock, C. Dullweber, M. Finke, M. Hegner, M. Zabel,C. Zo?llner, and F. Naumann. Fast approximate discovery of inclusiondependencies.
Datenbanksysteme f¨ur Business, Technologie und Web(BTW 2017) , 2017.[31] S. Lopes, J.-M. Petit, and F. Toumani. Discovering interesting inclusiondependencies: application to logical database tuning.
InformationSystems , 27(1):1–19, 2002.[32] W. Mann, N. Augsten, and P. Bouros. An empirical evaluation ofset similarity join techniques.
Proceedings of the VLDB Endowment ,9(9):636–647, 2016.[33] T. Papenbrock, S. Kruse, J.-A. Quian´e-Ruiz, and F. Naumann. Divide& conquer-based inclusion dependency discovery.
Proceedings of theVLDB Endowment , 8(7):774–785, 2015.[34] A. Shrivastava and P. Li. Asymmetric lsh (alsh) for sublinear timemaximum inner product search (mips). In
NIPS , pages 2321–2329,2014.[35] A. Shrivastava and P. Li. Asymmetric minwise hashing for indexingbinary inner products and set containment. In
WWW , 2015.[36] X. Wang, L. Qin, X. Lin, Y. Zhang, and L. Chang. Leveragingset relations in exact set similarity join.
Proceedings of the VLDBEndowment , 10(9):925–936, 2017.[37] X. Wang, Y. Zhang, W. Zhang, X. Lin, and W. Wang. Selectivityestimation on streaming spatio-textual data using local correlations.
Proceedings of the VLDB Endowment , 8(2):101–112, 2014.[38] S. Webb, J. Caverlee, and C. Pu. Introducing the webb spam corpus:Using email spam to identify web spam automatically. In
CEAS , 2006.[39] E. W. Weisstein. Abels impossibility theorem.
FromMathWorld–A Wolfram Web Resource. http://mathworld. wolfram.com/AbelsImpossibilityTheorem. html .[40] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarityjoins for near-duplicate detection.
TODS ∼ yingz/GBKMV.pdf, 2018.[42] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, andD. Srivastava. On multi-column foreign key discovery. Proceedingsof the VLDB Endowment , 3(1-2):805–814, 2010.[43] Y. Zhang, X. Li, J. Wang, Y. Zhang, C. Xing, and X. Yuan. An efficientframework for exact set similarity search using tree structure indexes.In
ICDE , pages 759–770. IEEE, 2017.[44] E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller. Lsh ensemble:internet-scale domain search.