A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C.S. Lui, Xiaohong Guan
aa r X i v : . [ c s . D S ] M a y A Memory-Efficient Sketch Method for EstimatingHigh Similarities in Streaming Sets
Pinghui Wang , ,⋆ , Yiyan Qi ,⋆ , Yuanming Zhang , Qiaozhu Zhai , Chenxu Wang , ∗ ,John C.S. Lui , Xiaohong Guan , , , ∗ NSKEYLAB, Xi’an Jiaotong University, Xi’an, China Shenzhen Research Institute of Xi’an Jiaotong University, Shenzhen, China The Chinese University of Hong Kong, Hong Kong Department of Automation and NLIST Lab, Tsinghua University, Beijing, China{phwang,qzzhai,cxwang,xhguan}@mail.xjtu.edu.cn,{qiyiyan,zhangyuanming}@stu.xjtu.edu.cn,[email protected]
ABSTRACT
Estimating set similarity and detecting highly similar sets are fun-damental problems in areas such as databases, machine learning,and information retrieval. MinHash is a well-known technique forapproximating Jaccard similarity of sets and has been successfullyused for many applications such as similarity search and largescale learning. Its two compressed versions, b -bit MinHash andOdd Sketch, can significantly reduce the memory usage of the orig-inal MinHash method, especially for estimating high similarities(i.e., similarities around 1). Although MinHash can be applied tostatic sets as well as streaming sets, of which elements are givenin a streaming fashion and cardinality is unknown or even infi-nite, unfortunately, b -bit MinHash and Odd Sketch fail to deal withstreaming data. To solve this problem, we design a memory effi-cient sketch method, MaxLogHash , to accurately estimate Jaccardsimilarities in streaming sets. Compared to MinHash, our methoduses smaller sized registers (each register consists of less than 7bits) to build a compact sketch for each set. We also provide asimple yet accurate estimator for inferring Jaccard similarity fromMaxLogHash sketches. In addition, we derive formulas for bound-ing the estimation error and determine the smallest necessary mem-ory usage (i.e., the number of registers used for a MaxLogHashsketch) for the desired accuracy. We conduct experiments on a va-riety of datasets, and experimental results show that our methodMaxLogHash is about 5 times more memory efficient than Min-Hash with the same accuracy and computational cost for estimat-ing high similarities. ⋆ Pinghui Wang and Yiyan Qi contributed equally to this work. ∗ Corresponding Author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
KDD ’19, August 4–8, 2019, Anchorage, AK, USA © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6201-6/19/08...$15.00https://doi.org/10.1145/3292500.3330825
CCS CONCEPTS • Mathematics of computing → Probabilistic algorithms ; •
Information systems → Similarity measures ; •
Theory ofcomputation → Sketching and sampling . KEYWORDS
Streaming algorithms;Sketch;Jaccard coefficient similarity
ACM Reference Format:
Pinghui Wang , ,⋆ , Yiyan Qi ,⋆ , Yuanming Zhang , Qiaozhu Zhai , ChenxuWang , ∗ , and John C.S. Lui , Xiaohong Guan , , , ∗ . 2019. A Memory-EfficientSketch Method for Estimating High Similarities in Streaming Sets. In The25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD ’19), August 4–8, 2019, Anchorage, AK, USA.
ACM, New York, NY, USA,10 pages. https://doi.org/10.1145/3292500.3330825
Data streams are ubiquitous in nature. Examples range from finan-cial transactions to Internet of things (IoT) data, network traffic,call logs, trajectory logs, etc. Due to the nature of these applica-tions which involve massive volume of data, it is prohibitive to col-lect the entire data streams, especially when computational andstorage resources are limited [1]. Therefore, it is important to de-velop memory efficient methods such as sampling and sketchingtechniques for mining large streaming data.Many datasets can be viewed as collections of sets and comput-ing set similarities is fundamental for a variety of applications in ar-eas such as databases, machine learning, and information retrieval.For example, one can view each mobile device’s trajectory as a setand each element in the set corresponds to a tuple of time t andthe physical location of the device at time t . Then, mining deviceswith similar trajectories is useful for identifying friends or devicesbelonging to the same person. Other examples are datasets encoun-tered in computer networks, mobile phone networks, and onlinesocial networks (OSNs), where learning user similarities in the setsof users’ visited websites on the Internet, connected phone num-bers, and friends on OSNs is fundamental for applications such aslink prediction and friendship recommendation.One of the most popular set similarity measures is the Jaccardsimilarity coefficient, which is defined as | A ∩ B || A ∪ B | for two sets A and B . To handle large sets, MinHash (or, minwise hashing) [2] is apowerful set similarity estimation technique, which uses an arrayf k registers to build a sketch for each set. Its accuracy only de-pends on the value of k and the Jaccard similarity of two sets ofinterest, and it is independent from the size of two sets. MinHashhas been successfully used for a variety of applications, such assimilarity search [3], compressing social networks [4], advertisingdiversification [5], large scale learning [6], and web spam detec-tion [7]. Many of these applications focus on estimating similarityvalues close to 1. Take similar document search in a sufficientlylarge corpus as an example. For a corpus, there may be thousandsof documents which are similar to the query document, thereforeour goal is not just to find similar documents, but also to provide ashort list (e.g., top-10) and ranking of the most similar documents.For such an application, we need effective methods that are veryaccurate and memory-efficient for estimating high similarities. Toachieve this goal, there are two compressed MinHash methods, b -bit MinHash [8] and Odd Sketch [9], which were proposed in thepast few years to further reduce the memory usage of the originalMinHash by dozens of times, while to provide comparable estima-tion accuracy especially for large similarity values. However, weobserve that these two methods fail to handle data streams (thedetails will be given in Section 3).To solve the above challenge, recently, Yu and Weber [10] de-velop a method, HyperMinHash. HyperMinHash consists of k reg-isters, whereas each register has two parts, an FM (Flajolet-Martin)sketch [11] and a b -bit string. The b -bit string is computed based onthe fingerprints (i.e., hash values) of set elements that are mappedto the register. Based on HyperMinHash sketches of two sets A and B , HyperMinhash first estimates | A ∪ B | and then infers theJaccard similarity of A and B from the number of collisions of b -bitstrings given | A ∪ B | . Later in our experiments, we demonstrate thatHyperMinHash not only exhibits a large bias for high similarities,but it is also computationally expensive for estimating similarities,which results in a large estimation error and a big delay in query-ing highly similar sets. More importantly, it is difficult to analyti-cally analyze the estimation bias and variance of HyperMinHash,which are of great value in practice–the bias and variance can beused to bound an estimate’ error and determine the smallest neces-sary sampling budget (i.e., k ) for a desired accuracy. In this paper,we develop a novel memory efficient method, MaxLogHash , to es-timate Jaccard similarities in streaming sets. Similar to MinHash,MaxLogHash uses a list of k registers to build a compact sketch foreach set. Unlike MinHash which uses a 64-bit (resp. 32-bit) regis-ter for storing the minimum hash value of 64-bit (resp. 32-bit) setelements, our method MaxLogHash uses only 7-bit register (resp.6-bit register) to approximately record the logarithm value of theminimum hash value, and this results in 9 times (resp. 5 times)reduction in memory usage. Another attractive property is thatour MaxLogHash sketch can be computed incrementally, therefore,MaxLogHash is able to handle streaming-sets. Given any two sets’MaxLogHash sketches, we provide a simple yet accurate estimatorfor their Jaccard similarity, and derive exact formulas for bound-ing the estimation error. We conduct experiments on a variety ofpublicly available datasets, and experimental results show that ourmethod MaxLogHash reduces the amount of memory required forMinHash by 5 folds to achieve the same desired accuracy and com-putational cost. The rest of this paper is organized as follows. The problem for-mulation is presented in Section 2. Section 3 introduces preliminar-ies used in this paper. Section 4 presents our method MaxLogHash.The performance evaluation and testing results are presented inSection 5. Section 6 summarizes related work. Concluding remarksthen follow. For ease of reading and comprehension, we say that each set be-longs to a user, elements in the set are items (e.g., products) thatthe user connects to. Let U denote the set of users and I denotethe set of all items. Let Π = e ( ) e ( ) · · · e ( t ) · · · denote the user-item stream of interest, where e ( t ) = ( u ( t ) , i ( t ) ) is the element of Π occurred at discrete time t > u ( t ) ∈ U and i ( t ) ∈ I are the ele-ment’s user and item, which represents a connection from user u ( t ) to item i ( t ) . We assume that Π has no duplicate user-item pairs ,that is, e ( i ) , e ( j ) when i , j . Let I ( t ) u ⊂ I be the item set of user u ∈ U , which consists of items that user u connects to before andincluding time t . Let ∪ ( t ) ( u , u ) denote the union of two sets I ( t ) u and I ( t ) u , that is, ∪ ( t ) ( u , u ) = I ( t ) u ∪ I ( t ) u . Similarly, we define theintersection of two sets I ( t ) u and I ( t ) u as ∩ ( t ) ( u , u ) = I ( t ) u ∩ I ( t ) u . Then, the Jaccard similarity of sets I ( t ) u and I ( t ) u is defined as J ( t ) u , u = | ∩ ( t ) ( u , u )|| ∪ ( t ) ( u , u )| , which reflects the similarity between users u and u . In this paper,we aim to develop a fast and accurate method to estimate J ( t ) u , u for any two users u and u over time, and to detect pairs of highsimilar users. When no confusion arises, we omit the superscript ( t ) to ease exposition. In this section, we first introduce MinHash [2]. Then, we elaboratetwo state-of-the-art memory-efficient methods b -bit MinHash [8]and Odd Sketch [9] that can decrease the memory usage of the orig-inal MinHash method. At last, we demonstrate that both b -bit Min-Hash and Odd Sketch fail to handle streaming sets. Given a random permutation (or hash function ) π from elementsin I to elements in I , i.e., a hash function maps integers in I to distinct integers in I at random. Broder et al. [2] observed that theJaccard similarity of two sets A , B ⊆ I equals J A , B = | ∩ ( A , B )|| ∪ ( A , B )| = P ( min ( π ( A )) = min ( π ( B ))) , where π ( A ) = { π ( w ) : w ∈ A } . Therefore, MinHash uses a se-quence of k independent permutations π , . . . , π k and estimates J A , B as ˆ J A , B = Í ki = ( min ( π ( A )) = min ( π ( B )) k , Duplicated user-item pairs can be easily checked and filtered using fast and memory-efficient techniques such as Bloom filter [12]. MinHash assumes no hash collisions. here ( P ) is an indicator function that equals 1 when the predi-cate P is true and 0 otherwise. Note that ˆ J A , B is an unbiased esti-mator for J A , B , i.e., E ( ˆ J A , B ) = J A , B , and its variance isVar ( ˆ J A , B ) = J A , B ( − J A , B ) k . Therefore, instead of storing a set A in memory, one can computeand store its MinHash sketch S A , i.e., S A = ( min ( π ( A )) , . . . , min ( π k ( A ))) , which reduces the memory usage when | A | > k . The Jaccard sim-ilarity of any two sets can be accurately and efficiently estimatedbased on their MinHash sketches. Li and König [8] proposed a method, b -bit MinHash, to furtherreduce the memory usage. b -bit MinHash reduces the memory re-quired for storing a MinHash sketch S A from 32 k or 64 k bits to bk bits. The basic idea behind b -bit MinHash is that the same hashvalues give the same lowest b bits while two different hash valuesgive the same lowest b bits with a small probability 1 / b . Formally,let min ( b ) ( π ( A )) denote the lowest b bits of the value of min ( π ( A )) for a permutation π . Define the b -bit MinHash sketch of set A as S ( b ) A = ( min ( b ) ( π ( A )) , . . . , min ( b ) ( π k ( A ))) . To mine set similarities, Li and König [8] first compute S A for eachset A , and then store its b -bit MinHash sketch S ( b ) A . At last, theJaccard similarity J A , B is estimated asˆ J ( b ) A , B = Í ki = ( min ( b ) ( π i ( A )) = min ( b ) ( π i ( B ))) − k / b k ( − / b ) . ˆ J ( b ) A , B is also an unbiased estimator for J A , B , and its variance isVar ( ˆ J ( b ) A , B ) = − J A , B k (cid:18) J A , B + b − (cid:19) . Mitzenmacher et al. [9] developed a method Odd Sketch, whichis more memory efficient than b -bit MinHash when mining setsof high similarity. Odd Sketch uses a hash function h that mapseach tuple ( i , min ( π i ( A ))) , i = , . . . , k , to an integer in { , . . . , z } at random. For a set A , its odd sketch S (Odd) A consists of z bits. Func-tion h maps tuples ( , min ( π ( A ))) , . . . , ( k , min ( π k ( A ))) into z bitsof S (Odd) A at random. S (Odd) A [ j ] , 1 ≤ j ≤ z , is the parity of the numberof tuples that are mapped to the j th bit of S (Odd) A . Formally, S (Odd) A [ j ] is computed as S (Odd) A [ j ] = ⊕ i = ,..., k ( h ( i , min ( π i ( A ))) = j ) , ≤ j ≤ z . The Jaccard similarity J A , B is then estimated asˆ J (Odd) A , B = + z k ln © « − Í zi = S (Odd) A [ j ] ⊕ S (Odd) B [ j ] z ª®¬ . A 32- or 64-bit register is used to store each min ( π i ( A )) , i = , . . . , k . Mitzenmacher et al. demonstrate that ˆ J (Odd) A , B is more accurate thanˆ J (b) A , B under the same memory usage (refer to [9] for details of theerror analysis of ˆ J (Odd) A , B ). MinHash can be directly applied to stream data.
We can eas-ily find that MinHash sketch can be computed incrementally. Thatis, one can compute the MinHash sketch of set A ∪ { v } from theMinHash sketch of set A asmin ( π ( A ∪ { v })) = min ( min ( π ( A )) , π ( v )) . Variants b -bit MinHash and Odd Sketch cannot be used tohandle streaming sets. Let π ( b ) ( v ) denote the lowest b bits of π ( v ) . Then, one can easily show thatmin ( b ) ( π ( A ∪ { v })) , min ( min ( b ) ( π ( A )) , π ( b ) ( v )) . It shows that computing min ( b ) ( π ( A ∪{ v })) requires the hash value π ( w ) of each w ∈ A ∪{ v } . In addition, we observe that min ( b ) ( π ( A )) cannot be approximated as min w ∈ A π ( b ) ( w ) , which can be com-puted incrementally, because min w ∈ A π ( b ) ( w ) equals 0 with a highprobability when | A | ≫ b . Similarly, we cannot compute the oddsketch of a set incrementally. Therefore, both b -bit MinHash andOdd Sketch fail to deal with streaming sets. Let h be a function that maps any element v in I to a random num-ber in range ( , ) . i.e., h ( v ) ∼ U ni f orm ( , ) . Define the log-rankof v with respect to hash function h as r ( v ) ← ⌊− log h ( v )⌋ . Wecompute and storeMaxLog ( h ( A )) = max v ∈ A r ( v ) = max v ∈ A ⌊− log h ( v )⌋ . Let us now develop a simple yet accurate method to estimate Jac-card similarity of streaming sets based on the following propertiesof function MaxLog ( h ( A )) . Observation 1.
MaxLog ( h ( A )) can be represented by an integer ofno more than ⌈ log log | I |⌉ bits with a high probability. For each v ∈ I , we have h ( v ) ∼ U ni f orm ( , ) , and thus r ( v ) ∼ Geometric ( / ) , supported on the set { , , , . . . } , that is, P ( r ( v ) = j ) = j + , P ( r ( v ) < j ) = − j , j ∈ { , , , . . . } . Then, one can easily find that P ( MaxLog ( h ( A )) ≤ ⌈ log log | I |⌉ − ) = (cid:18) − ⌈ log2 log2 | I |⌉ (cid:19) | A | . For example, when A ⊆ { , . . . , } and | A | ≤ , we only require6 bits to store MaxLog ( h ( A )) with probability at least 0.999. Observation 2.
MaxLog ( h ( A )) can be computed incrementally. Thisis becauseMaxLog ( h ( A ∪ { v })) = max ( MaxLog ( h ( A )) , ⌊− log h ( v )⌋) . bservation 3. J A , B can be easily estimated from MaxLog ( h ( A )) and MaxLog ( h ( B )) with a little additional information. We find that γ = P ( MaxLog ( h ( A )) , MaxLog ( h ( B ))) = + ∞ Õ j = | A \ B | j + (cid:18) − j + (cid:19) | A \ B |− (cid:18) − j (cid:19) | B | + + ∞ Õ j = | B \ A | j + (cid:18) − j + (cid:19) | B \ A |− (cid:18) − j (cid:19) | A | . Due to the limited space, we omit the details of how γ is derived.Similar to MinHash, we have P ( max ( h ( A )) , max ( h ( B ))) = − J A , B .Therefore, we have γ < − J A , B . Although γ can be estimatedsimilar to MinHash using k hash functions h , . . . , h k , that is, E ( γ ) = Í ki = ( MaxLog ( h i ( A )) , MaxLog ( h i ( B ))) k , unfortunately, it is difficult to compute J A , B from γ . To solve thisproblem, we observe P ( MaxLog ( h ( A )) , MaxLog ( h ( B )) ∧ δ A , B = ) ≈ . ( − J A , B ) , where δ A , B = A ∪ B of which log-rank equals MaxLog ( h ( A ∪ B )) .Based on the above three observations, we propose to incre-mentally and accurately estimate the value of P ( MaxLog ( h ( A )) , MaxLog ( h ( B ))∧ δ A , B = ) using k hash functions h , . . . , h k . Then,we easily infer the value of J A , B . The MaxLogHash sketch of a user u , i.e., S u , consists of k bit-strings,where each bit-string S u [ i ] , ≤ i ≤ k , has two components, s u [ i ] and m u [ i ] , i.e., S u [ i ] = s u [ i ] k m u [ i ] . At any time t , m u [ i ] recordsthe maximum hash value of items in I ( t ) u with respect to hash func-tion r i (·) = ⌊− log h i (·)⌋ , i.e., m u [ i ] = max w ∈ I ( t ) u r i ( w ) , where I ( t ) u refers to the set of items that user u connected to before and includ-ing time t ; s u [ i ] consists of 1 bit and its value indicates whetherthere exists one and only one item w ∈ I u such that r i ( w ) = m u [ i ] .As we mentioned, we can use ⌈ log log | I |⌉ bits to record the valueof m u [ i ] with a high probability (very close to 1). When m u [ i ] ≥ ⌈ log log | I |⌉ , we use a hash table to record tuples ( u , i , m u [ i ]) forall users. For each user u ∈ U , when it first connects with an item w instream Π , we initialize the MaxLogHash sketch of user u as S u [ i ] = k r i ( w ) , i = , . . . , k , where r i ( w ) = ⌊− log h i ( w )⌋ . That is,we set indicator s u [ i ] = m u [ i ] = r i ( w ) . For anyother item v that user u connects to after the first item w , i.e., anuser-item pair ( u , v ) occurring on stream Π after the user-item pair ( u , w ) , we update it as follows: We first compute the log-rank ofitem v , i.e., r i ( v ) = ⌊− log h i ( v )⌋ , i = , . . . , k . When r i ( v ) issmaller than m u [ i ] , we perform no further operations for updatingthe user-item ( u , v ) . When r i ( v ) = m u [ i ] , it indicates that at leasttwo items in I u has a log-rank value m u [ i ] . Therefore, we simplyset s u [ i ] =
0. When r i ( v ) > m u [ i ] , we set S u [ i ] = k r i ( v ) . Define variables χ u , u [ i ] = ( m u [ i ] , m u [ i ]) , i = , . . . , k , ψ u , u [ i ] = s u [ i ] , m u [ i ] > m u [ i ] s u [ i ] , m u [ i ] < m u [ i ]− , m u [ i ] = m u [ i ] . Let δ u , u [ i ] = ( χ u , u [ i ] = ) ( ψ u , u [ i ] = ) . Note that δ u , u [ i ] = ∪( u , u ) of which log-rank equals max w ∈∪( u , u ) r i ( w ) with respect to func-tion r i . Then, we have the following theorem. Theorem For non-empty sets I u and I u , we have P ( δ u , u [ i ] = ) = , i = , . . . , k , when | ∪ ( u , u )| = . Otherwise, we have P ( δ u , u [ i ] = ) = α |∪( u , u )| ( − J u , u ) , i = , . . . , k , where α n = n Í + ∞ j = j + (cid:16) − j (cid:17) n − , n ≥ . Proof.
Let r ∗ be the maximum log-rank of all items in ∪( u , u ) .When two items w and v in I u or I u has the log-rank value r ∗ ,we easily find that ψ u , u [ i ] =
0. When only one item w in I u andonly one item v in I u have the log-rank value r ∗ , we easily findthat χ u , u [ i ] =
0. Let ∆ ( u , u ) = ( I u \ I u ) ∪ ( I u \ I u ) = ∪( u , u ) \ ∩( u , u ) . Then, we find that event χ u , u [ i ] = ∧ ψ u , u [ i ] = δ u , u [ i ] =
1) only when one item w in ∆ ( u , u ) has a log-rankvalue larger than all items in ∪( u , u ) \ { w } . For any item v ∈ I ,we have h i ( v ) ∼ U ni f orm ( , ) and so r i ( v ) ∼ Geometric ( / ) ,supported on the set { , , , . . . } . Based on the above observations,when | ∪ ( u , u )| ≥
2, we have P ( δ u , u [ i ] = ∧ r ∗ = j ) = Õ w ∈ ∆ ( u , u ) P ( r i ( w ) = j ) Ö v ∈∪( u , u )\{ w } P ( r i ( v ) < j ) = | ∆ ( u , u )| j + (cid:18) − j (cid:19) |∪( u , u )|− . Therefore, we have P ( δ u , u [ i ] = ) = + ∞ Õ j = P ( δ i = ∧ r ∗ = j ) = Õ w ∈ ∆ ( u , u ) P ( r w = j ) Ö v ∈∪( u , u )\{ w } P ( r v < j ) = + ∞ Õ j = | ∆ ( u , u )| j + (cid:18) − j (cid:19) |∪( u , u )|− = + ∞ Õ j = | ∆ ( u , u )|| ∪ ( u , u )| · | ∪ ( u , u )| j + (cid:18) − j (cid:19) |∪( u , u )|− = α |∪( u , u )| ( − J u , u ) , where the last equation holds because | ∆ ( u , u )| = | ∪ ( u , u )| −| ∩ ( u , u )| . (cid:3) n α n Figure 1: Value of α n , n = , . . . , where | α − . | = . , | α n − . | ≤ . when n ≥ . Define variable ˆ k = Í ki = ( δ u , u [ i ] = ) . From Theorem 1, theexpectation of ˆ k is computed as E ( ˆ k ) = E k Õ i = ( δ u , u [ i ] = ) ! = k Õ i = E ( ( δ u , u [ i ] = )) = kα |∪( u , u )| ( − J u , u ) . (1)Therefore, we have J u , u = − E ( ˆ k ) kα |∪( u , u )| . Note that the cardinality of set ∪( u , u ) (i.e. | ∪ ( u , u )| ) is un-known. To solve this challenge, we find that α n = n + ∞ Õ j = j (cid:18) − j (cid:19) n − = n + ∞ Õ j = j n − Õ l = (cid:18) n − l (cid:19) (cid:18) − j (cid:19) n − l − = n n − Õ l = (− ) n − l − (cid:18) n − l (cid:19) + ∞ Õ j = j ( n − l ) = n n − Õ l = (− ) n − l − (cid:18) n − l (cid:19) n − l − . Figure 1 shows that the value of α n , n = , , . . . . We easily findthat α n ≈ α = . n ≥
2. Therefore, we estimate J u , u asˆ J u , u = − ˆ kkα . The error of our method MaxLogHash is shown in the followingtheorem.
Theorem For any users u , u ∈ U , we have E ( ˆ J u , u ) − J u , u = ( − β |∪( u , u )| )( − J u , u ) , where β n = α n α . The variance of ˆ J u , u is computed asVar ( ˆ J u , u ) = β |∪( u , u )| ( − J u , u )( α − − β |∪( u , u )| ( − J u , u )) k . When | ∪ ( u , u )| ≥ , we have | β |∪( u , u )| − | ≤ . , and so E ( ˆ J u , u ) ≈ J u , u and Var ( ˆ J u , u ) ≈ ( − J u , u )( J u , u + . ) k . Proof.
From equation (1), we easily have E ( ˆ J u , u ) = E − ˆ kkα ! = − kα |∪( u , u )| ( − J u , u ) kα = β |∪( u , u )| J u , u + − β |∪( u , u )| . To derive Var ( ˆ J u , u ) , we first compute E ( ˆ k ) = E © « k Õ i = ( δ u , u [ i ] = ) ! ª®¬ = k Õ i = E (cid:16) ( ( δ u , u [ i ] = )) (cid:17) + Õ i , j , ≤ i , j ≤ k E (cid:0) ( δ u , u [ i ] = ) ( δ u , u [ j ] = ) (cid:1) = kα |∪( u , u )| ( − J u , u ) + k ( k − ) α |∪( u , u )| ( − J u , u ) . Then, we haveVar ( ˆ k ) = E ( ˆ k ) − ( E ( ˆ k )) = kα |∪( u , u )| ( − J u , u )( − α |∪( u , u )| ( − J u , u )) . (2)From the definition of ˆ J u , u , we haveVar ( ˆ J u , u ) = Var − ˆ kkα ! = Var ( ˆ k ) k α . Then, we easily obtain a closed-form formals of Var ( ˆ J u , u ) fromequation (2). (cid:3) Inspired by OPH (one permutation hashing) [13], which signif-icantly reduces the time complexity of MinHash for processingeach element in the set, we can use a hash function which splitsitems in I u into k registers at random, and each register S u [ i ] , 1 ≤ i ≤ k , records MaxLog ( h ({ v : v ∈ I u ∧ h ( v ) = j })) as well as thevalue of indicator s u [ i ] , which is similar to the regular MaxLogHashmethod. We name this extension as MaxLogOPH. MaxLogOPH re-duces the time complexity of processing each item from O ( k ) to O ( ) . When | u ∪ u | ≫ k , our experiments demonstrate that MaxL-ogOPH is comparable to MaxLogHash in terms of accuracy. The algorithms are implemented in Python, and run on a computerwith a Quad-Core Intel(R) Xeon(R) CPU E3-1226 v3 CPU 3.30GHzprocessor. To demonstrate the reproducibility of the experimentalresults, we make our source code publicly available . http://nskeylab.xjtu.edu.cn/dataset/phwang/code/MaxLog.zip .8 0.85 0.9 0.95 1 J A,B R M SE MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (a) (balanced) RMSE, k = J A,B R M SE MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (b) (balanced) RMSE, k = J A,B R M SE MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (c) (unbalanced) RMSE, k = J A,B R M SE MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (d) (unbalanced) RMSE, k = J A,B -0.1-0.08-0.06-0.04-0.0200.02 B i a s MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (e) (balanced) Bias, k = J A,B -0.1-0.08-0.06-0.04-0.0200.02 B i a s MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (f) (balanced) Bias, k = J A,B -0.1-0.08-0.06-0.04-0.0200.02 B i a s MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (g) (unbalanced) Bias, k = J A,B -0.1-0.08-0.06-0.04-0.0200.02 B i a s MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (h) (unbalanced) Bias, k = Figure 2: Estimation error of our method MaxLogHash in comparison with MinHash and HyperMinHash on both balancedand unbalanced set-pairs.
For simplicity, we assume that elements in sets are 32-bit numbers,i.e., I = { , , . . . , − } . We evaluate the performance of ourmethod MaxLogHash a variety of datasets.1) Synthetic datasets.
Our synthetic datasets consist of set-pairs A and B with various cardinalities and Jaccard similarities.We conduct our experiments on the following two different set-tings: • Balanced set-pairs (i.e., | A | = | B | ). We set | A | = | B | = n andvary J A , B in { . , . , ..., . } . Specially, we generate set A byrandomly selecting n different numbers from I and generate set B by randomly selecting | A ∩ B | = J A , B | A | + J A , B different numbers from set A and n −| A ∩ B | different numbers from set I \ A . In our experiments,we set n = ,
000 by default. • Unbalanced set-pairs (i.e., | A | , | B | ). We set | A | = n and | B | = J A , B n , where we vary J A , B ∈ { . , . , ..., . } . Specially, wegenerate set A by randomly selecting n different numbers from I and generate set B by selecting J A , B n different elements from A .2) Real-world datasets.
Similar to [9], we evaluate the perfor-mance of our method on the detection of item-pairs (e.g., pairs ofproducts) that always appear together in the same records (e.g.,transactions). We conduct experiments on two real-world datasets :MUSHROOM and CONNECT, which are also used in [9]. We gen-erate a stream of item-record pairs for each dataset, where a recordcan be viewed as a transaction and items in the same record canbe viewed as products bought together. For each record x in thedataset of interest and every item w in x , we append an element ( w , x ) to the stream of item-record pairs. In summary, MUSHROOM http://fimi.ua.ac.be/data/ and CONNECT have 8 ,
124 and 67 ,
557 records, 119 and 127 distinctitems, and 186 ,
852 and 2 , ,
951 item-record pairs, respectively.
Our methods use k • MinHash [2] . MinHash builds a sketch for each set. A MinHashsketch consists of k • HyperLogLog [15] . A HyperLogLog sketch consists of k A ∪ B by merg-ing the HyperLogLog sketches of sets A and B and then use thesketch to estimate | A ∪ B | . Therefore, HyperLogLog can also beused to estimate J A , B by approximating | A | + | B |−| A ∪ B || A ∪ B | . • HyperMinHash [10] . A HyperMinHash sketch consists of k q -bit registers and k r -bit registers. The first k q -bit registers can beviewed as a HyperLogLog sketch. To guarantee the performancefor large sets (including up to 2 elements), we set q = We evaluate both efficiency and effectiveness of our methods incomparison with the above baseline methods. For efficiency, weevaluate the running time of all methods. Specially, we study thetime for updating each set element and estimating set similarities,respectively. The update time determines the maximum through-put that a method can handle, and the estimation time determinesthe delay in querying the similarity of set-pairs. For effectiveness,we evaluate the error of estimation ˆ J with respect to its true value J using metrics: bias and root mean square error (RMSE), i.e., Bias ( ˆ J ) = E ( ˆ J )− J and RMSE ( ˆ J ) = q E (( ˆ J − J ) ) . Our experimental results are n R M SE MaxLogHashHyperLogLog (a) m = , J A , B = . n R M SE MaxLogHashHyperLogLog (b) m = , J A , B = . n R M SE MaxLogHashHyperLogLog (c) m = , J A , B = . n R M SE MaxLogHashHyperLogLog (d) m = , J A , B = . Figure 3: Estimating error of our method MaxLogHash incomparison with HyperLogLog on synthetic set-pairs A and B with the same memory space m bits, where | A | = | B | = n . empirically computed from 1 ,
000 independent runs by default. Wefurther evaluate our method on the detection of association rules,and use precision and recall to evaluate the performance.
MaxLogHash vs MinHash and HyperMinHash . From Figures 2(a)-(d), we see that our method MaxLogHash gives comparable re-sults to MinHash and HyperMinHash with r =
4. Specially, theRMSEs of these three methods differ within 0 .
006 and continuallydecrease as the similarity increases. The RMSE of HyperMinHashwith r = J A , B increases. We observethat the large estimation error occurs because HyperMinHash ex-hibits a large estimation bias. Figures 2 (e)-(h) show the bias ofour method MaxLogHash in comparison with MinHash and Hy-perMinHash. We see that the empirical biases of MaxLogHash andMinHash are both very small and no systematic biases can be ob-served. However, HyperMinHash with r = − .
06 to − .
089 when the similar-ity increases from 0 .
80 to 0 .
99. One can increase r to reduce the biasof HyperMinHash. However, HyperMinHash with large r desiresmore memory space. For example, HyperMinHash with r = . . n R M SE MaxLogHashMaxLogOPH (a) balanced n R M SE MaxLogHashMaxLogOPH (b) unbalanced
Figure 4: Estimation error of our methods MaxLogHash andMaxLogOPH on both balanced and unbalanced syntheticdata pairs A and B with the same number of registers, k = ,and J A , B = . . (a) | A | = | B | = n . (b) | A | = J A , B | B | = n .MaxLogHash vs HyperLogLog . To make a fair comparison, weallocate the same amount of memory space, m bits, to each ofMaxLogHash and HyperLogLog. As discussed in Section 4, theattractive property of our method MaxLogHash is its estimationerror is almost independent from the cardinality of sets A and B ,which does not hold for HyperLogLog. Figure 3 shows the RMSEsof MaxLogHash and HyperLogLog on sets of different sizes. Wesee that the RMSE of our method MaxLogHash is almost a con-stant. Figures 3 (a) and (b) show the performance of HyperLogLogsuddenly degrades when m = and the cardinalities of A and B are around 200, because HyperLogLog uses two different estima-tors for cardinalities within two different ranges respectively [15].As a result, our method MaxLogHash decreases the RMSE of Hy-perLogLog by up to 36%. As shown in Figures 3 (c) and (d), similarly,the RMSE of our method MaxLogHash is about 2.5 times smallerthan HyperLogLog when m = and the cardinalities of A and B are around 500. MaxLogHash vs MaxLogOPH . As discussed in Section 4.6, theestimation error of MaxLogOPH is comparable to MaxLogHashwhen k is far smaller than the cardinalities of two sets of inter-est. We compare MaxLogOPH with MaxLogHash on sets with in-creasing cardinalities to provide some insights. As shown in Fig-ure 4, MaxLogOPH exhibits relatively large estimation errors forsmall cardinalities. When k =
128 and the cardinality increasesto 200 (about 2 k ), we see that MaxLogOPH achieves similar ac-curacy to MaxLogHash. Later in Section 5.6, MaxLogOPH signif-icantly accelerates the speed of updating elements compared withMaxLogHash. In this experiment, we evaluate the performance of our methodMaxLogHash, MinHash, and HyperMinHash on the detection ofitems (e.g., products) that almost always appear together in thesame records (e.g., transactions). We conduct the experiments onreal-world datasets: MUSHROOM and CONNECT. We first esti-mate all pairwise similarities among items’ record-sets, and retrieveevery pair of record-sets with similarity J > J . As discussed pre-viously (results in Figure 3), HyperLogLog is not robust, because itxhibits large estimation errors for sets of particular sizes. There-fore, in what follows we compare our method MaxLogHash onlywith MinHash and HyperMinHash. As shown in Figure 5, MaxLogHashgives comparable precision and recall to MinHash and HyperMin-Hash with r =
4. We note that MaxLogHash gives up to 5 . . We further evaluate the efficiency of our method MaxLogHashand its extension MaxLogOPH in comparison with MinHash andHyperLogLog. Specially, we present the time for updating eachcoming element and computing Jaccard similarity, respectively. Weconduct experiments on synthetic balanced datasets. We omit thesimilar results for real-world datasets and synthetic unbalanceddatasets. Figure 6 (a) shows that the update time of MaxLogOPHand HyperLogLog is almost a constant and our method outper-forms other baselines. The update time of HyperMinHash is almostirrelevant to its parameter r and thus we only plot the curve for r =
1. Specially, MaxLogOPH is about 2 and 420 times faster thanHyperMinHash and MinHash. Figure 6 (b) shows that our meth-ods MaxLogHash and MaxLogOPH have estimation time similar toMinHash, while they are about 10 times faster than HyperLogLogand 4 to 5 orders of magnitude faster than HyperMinHash.
Jaccard similarity estimation for static sets.
Broder et al. [2]proposed the first sketch method MinHash to compute the Jaccardsimilarity of sets, which builds a sketch consisting of k registersfor each set. To reduce the amount of memory space required forMinHash, [8, 9] developed methods b -bit MinHash and Odd Sketch,which are dozens of times more memory efficient than the originalMinHash. The basic idea behind b -bit MinHash and Odd Sketch isto use probabilistic methods such as sampling and bitmap sketch-ing to build a compact digest for each set’s MinHash sketch. Re-cently, several methods [13, 16–18] were proposed to reduce thetime complexity of processing each element in a set from O ( k ) to O ( ) . Weighted similarity estimation for static vectors.
SimHash(or, sign normal random projections) [19] was developed for ap-proximating angle similarity (i.e., cosine similarity) of weightedvectors. CWS [20, 21], ICWS [22], 0-bit CWS [23], CCWS [24],Weighted MinHash [25], PCWS [26], and BagMinHash [27] weredeveloped for approximating generalized Jaccard similarity of weightedvectors , and Datar et al. [28] developed an LSH method using p -stable distribution for estimating l p distance for weighted vectors,where 0 < p ≤
2. Campagna and Pagh [29] developed a biased sam-pling method for estimating a variety of set similarity measuresbeyond Jaccard similarity.
Similarity estimation for data streams.
The above weightedsimilarity estimation methods fail to deal with streaming weightedvectors, whereas elements in vectors come in a stream fashion. Tosolve this problem, Kutzkov et al. [30] extended AMS sketch [31] The Jaccard similarity between two positive real value vectors ® x = ( x , x , . . . , x p ) and ® y = ( y , y , . . . , y p ) is defined as J (® x , ® y ) = Í ≤ j ≤ p min ( xj , yj ) Í ≤ j ≤ p max ( xj , yj ) . for the estimation of cosine similarity and Pearson correlation instreaming weighted vectors. Yang et al. [32] developed a stream-ing method HistoSketch for approximating Jaccard similarity withconcept drift. Set intersection cardinality (i.e., the number of com-mon elements in two sets) is also a popular metric for evaluatingthe similarity in sets. A variety of sketch methods such as LPC [33],FM [11], LogLog [34], HyperLogLog [15], HLL-TailCut+ [35], andMinCount [36] were proposed to estimate the stream cardinality(i.e., the number of distinct elements in the stream), and can beeasily extended to estimate | A ∪ B | by merging the sketches of sets A and B . Then, one can approximate | A ∩ B | because | A ∩ B | = | A | + | B | − | A ∪ B | . To further improve the estimation accuracy, Co-hen et al. [37] developed a method combining MinHash and Hyper-LogLog to estimate set intersection cardinalities. Our experimentsreveal that these sketch methods have large errors when first es-timating | A ∩ B | and | A ∪ B | , and then approximating the Jaccardsimilarity J A , B . As mentioned in Section 3, MinHash can be eas-ily extended to handle streaming sets, but its two compressed ver-sions, b -bit MinHash and Odd Sketch fail to handle data streams. Tosolve this problem, Yu and Weber [10] developed a method, Hyper-MinHash, which can be viewed as a joint of HyperLogLog and b -bit MinHash. HyperMinHash consists of k registers, whereas eachregister has two parts, an FM sketch and a b -bit string. The b -bitstring is computed based on the fingerprints (i.e., hash values) ofset elements that map to the register. HyperMinhash first estimates | A ∪ B | and then infers the Jaccard similarity of sets A and B fromthe number of collisions of b -bit strings given | A ∪ B | . Our exper-iments demonstrates that HyperMinHash exhibits a large bias forhigh similarities and it is several orders of magnitude slower thanour methods when estimating the similarity. We develop a memory efficient sketch method MaxLogHash to es-timate the similarity of two sets given in a streaming fashion. Weprovide a simple yet accurate estimator for Jaccard similarity, andderive exact formulas for the estimator’s bias and variance. Exper-imental results demonstrate that MaxLogHash can reduce around5 times the amount of memory required for MinHash with thesame desired accuracy and computational cost. Compared with ourmethod MaxLogHash, the state-of-the-art method HyperMinHashexhibits a larger estimation bias and its estimation time is 4 to 5 or-ders of magnitude larger. Although HyperLogLog can be extendedto estimate Jaccard similarity, its estimation error (resp. estimationtime) is about 2.5 times (resp. 10 times) larger than our methods.In the future, we plan to extend MaxLogHash to weighted stream-ing vectors and fully dynamic streaming sets that include both setelement insertions and deletions.
ACKNOWLEDGMENT
The research presented in this paper is supported in part by Na-tional Key R&D Program of China (2018YFC0830500), National Nat-ural Science Foundation of China (U1736205, 61603290), ShenzhenBasic Research Grant (JCYJ20170816100819428), Natural ScienceBasic Research Plan in Shaanxi Province of China (2019JM-159).The work of John C.S. Lui is supported in part by the GRF R4032-18.
100 200 300 400 500 k P r e c i s i on MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (a) (MUSHROOM) Precision, J = . k R e c a ll MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (b) (MUSHROOM) Recall, J = . k P r e c i s i on MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (c) (MUSHROOM) Precision, J = . k R e c a ll MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (d) (MUSHROOM) Recall, J = . k P r e c i s i on MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (e) (CONNECT) Precision, J = . k R e c a ll MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (f) (CONNECT) Recall, J = . k P r e c i s i on MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (g) (CONNECT) Precision, J = . k R e c a ll MaxLogHashMinHashHyperMinHash r=1HyperMinHash r=4 (h) (CONNECT) Recall, J = . Figure 5: Precision and recall of our method MaxLogHash in comparison with MinHash and HyperMinHash on datasets MUSH-ROOM and CONNECT. k -6 -5 -4 -3 T i m e ( s e c ond ) HyperMinHashMaxLogHashMaxLogOPHHyperLogLogMinHash (a) update time k -8 -5 -2 T i m e ( s e c ond ) HyperMinHash r=1HyperMinHash r=4MaxLogHash MaxLogOPHHyperLogLogMinHash (b) estimation time
Figure 6: Computational cost of our methods MaxLogHashand MaxLogOPH in comparison with MinHash, Hyper-LogLog and HyperMinHash.
REFERENCES [1] Kaiyu Li and Guoliang Li. Approximate query processing: What is new andwhere to go?
Data Science and Engineering , 2018.[2] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher.Min-wise independent permutations.
J. Comput. Syst. Sci. , 60(3):630–659, June2000.[3] A. Broder. On the resemblance and containment of documents. In
SEQUENCES ,1997.[4] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessan-dro Panconesi, and Prabhakar Raghavan. On compressing social networks. In
KDD , pages 219–228, 2009.[5] Sreenivas Gollapudi and Aneesh Sharma. An axiomatic approach for result di-versification. In
WWW , pages 381–390.[6] Ping Li, Anshumali Shrivastava, Joshua L. Moore, and Arnd Christian König.Hashing algorithms for large-scale learning. In
NIPS. , pages 2672–2680, 2011.[7] Tanguy Urvoy, Emmanuel Chauveau, Pascal Filoche, and Thomas Lavergne.Tracking web spam with html style similarities.
ACM Trans. Web , 2(1), March2008.[8] Ping Li and Arnd Christian König. b-bit minwise hashing. In
WWW , pages671–680, 2010. [9] Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. Efficient estimation forhigh similarities using odd sketches. In
WWW , pages 109–118, 2014.[10] Yun William Yu and Griffin Weber. Hyperminhash: Jaccard index sketching inloglog space.
CoRR , abs/1710.08436, 2017.[11] Philippe Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications.
J. Comput. Syst. Sci. , 31(2):182–209, October 1985.[12] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.
CACM , 13(7):422–426, 1970.[13] Ping Li, Art B. Owen, and Cun-Hui Zhang. One permutation hashing. In
NIPS ,pages 3122–3130, 2012.[14] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, RajeevMotwani, Jeffrey D. Ullman, and Cheng Yang. Finding interesting associationswithout support pruning.
IEEE Trans. on Knowl. and Data Eng. , 13(1):64–78, Jan-uary 2001.[15] Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier. Hyper-loglog: The analysis of a near-optimal cardinality estimation algorithm. In
AOFA ,2007.[16] Anshumali Shrivastava and Ping Li. Improved densification of one permutationhashing. In
UAI , pages 732–741, 2014.[17] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing viarotation for fast near neighbor search. In
ICML , pages 557–565, 2014.[18] Anshumali Shrivastava. Optimal densification for fast and accurate minwisehashing. In
ICML , pages 3154–3163, 2017.[19] Moses Charikar. Similarity estimation techniques from rounding algorithms. In
STOC , pages 380–388, 2002.[20] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sam-pling. Technical report, June 2010.[21] Bernhard Haeupler, Mark S. Manasse, and Kunal Talwar. Consistent weightedsampling made fast, small, and easy.
CoRR , abs/1410.4266, 2014.[22] Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching.In
ICDM , pages 246–255, 2010.[23] Ping Li. 0-bit consistent weighted sampling. In
KDD , pages 665–674, 2015.[24] Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. Canonical consistent weightedsampling for real-value weighted min-hash. In
ICDM , pages 1287–1292, 2016.[25] Anshumali Shrivastava. Simple and efficient weighted minwise hashing. In
NIPS ,pages 1498–1506, 2016.[26] Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. Consistent weighted samplingmade more practical. In
WWW , pages 1035–1043, 2017.[27] Otmar Ertl. Bagminhash - minwise hashing algorithm for weighted sets.
CoRR ,abs/1802.03914, 2018.28] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In
SOCG , pages 253–262, 2004.[29] Andrea Campagna and Rasmus Pagh. Finding associations and computing sim-ilarity via biased pair sampling.
Knowl. Inf. Syst. , 31(3):505–526, 2012.[30] Konstantin Kutzkov, Mohamed Ahmed, and Sofia Nikitaki. Weighted similarityestimation in data streams. In
CIKM , pages 1051–1060, 2015.[31] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approxi-mating the frequency moments. In
STOC , pages 20–29, 1996.[32] Dingqi Yang, Bin Li, Laura Rettig, and Philippe Cudré-Mauroux. Histosketch:Fast similarity-preserving sketching of streaming histograms with concept drift.In
ICDM , pages 545–554, 2017. [33] Kyuyoung Whang, Brad T. Vander-zanden, and Howard M. Taylor. A linear-timeprobabilistic counting algorithm for database applications.
IEEE Transaction ofDatabase Systems , 15(2):208–229, June 1990.[34] Marianne Durand and Philippe Flajolet.
Loglog Counting of Large Cardinalities ,pages 605–617. Springer Berlin Heidelberg, 2003.[35] Qingjun Xiao, You Zhou, and Shigang Chen. Better with fewer bits: Improvingthe performance of cardinality estimation of large data streams. In
INFOCOM ,pages 1–9, 2017.[36] Frederic Giroire. Order statistics and estimating cardinalities of massive datasets.
Discrete Applied Mathematics , 157(2):406 – 427, 2009.[37] Reuven Cohen, Liran Katzir, and Aviv Yehezkel. A minimal variance estimatorfor the cardinality of big data set intersection. In