[PDF] Hardness of Bichromatic Closest Pair with Jaccard Similarity

Abstract

Consider collections A and B of red and blue sets, respectively. Bichromatic Closest Pair is the problem of finding a pair from A×B that has similarity higher than a given threshold according to some similarity measure. Our focus here is the classic Jaccard similarity |a∩b|/|a∪b| for (a,b)∈A×B . We consider the approximate version of the problem where we are given thresholds j 1 > j 2 and wish to return a pair from A×B that has Jaccard similarity higher than j 2 if there exists a pair in A×B with Jaccard similarity at least j 1 . The classic locality sensitive hashing (LSH) algorithm of Indyk and Motwani (STOC '98), instantiated with the MinHash LSH function of Broder et al., solves this problem in O ~ ( n 2−δ ) time if j 1 ≥ j 1−δ 2 . In particular, for δ=Ω(1) , the approximation ratio j 1 / j 2 =1/ j δ 2 increases polynomially in 1/ j 2 . In this paper we give a corresponding hardness result. Assuming the Orthogonal Vectors Conjecture (OVC), we show that there cannot be a general solution that solves the Bichromatic Closest Pair problem in O( n 2−Ω(1) ) time for j 1 / j 2 =1/ j o(1) 2 . Specifically, assuming OVC, we prove that for any δ>0 there exists an ε>0 such that Bichromatic Closest Pair with Jaccard similarity requires time Ω( n 2−δ ) for any choice of thresholds j 2 < j 1 <1−δ , that satisfy j 1 ≤ j 1−ε 2 .

Full PDF

HHardness of Bichromatic Closest Pair withJaccard Similarity

Rasmus Pagh

BARC and IT University of Copenhagen, [email protected]

Nina Mesing Stausholm

BARC and IT University of Copenhagen, [email protected]

Mikkel Thorup

BARC and University of Copenhagen, [email protected]

Abstract

Consider collections A and B of red and blue sets, respectively. Bichromatic Closest Pair is theproblem of ﬁnding a pair from A × B that has similarity higher than a given threshold accordingto some similarity measure. Our focus here is the classic Jaccard similarity | a ∩ b | / | a ∪ b | for( a , b ) ∈ A × B .We consider the approximate version of the problem where we are given thresholds j > j andwish to return a pair from A × B that has Jaccard similarity higher than j if there exists a pair in A × B with Jaccard similarity at least j . The classic locality sensitive hashing (LSH) algorithmof Indyk and Motwani (STOC ’98), instantiated with the MinHash LSH function of Broder et al.,solves this problem in ˜ O ( n − δ ) time if j ≥ j − δ . In particular, for δ = Ω(1), the approximationratio j /j = 1 /j δ increases polynomially in 1 /j .In this paper we give a corresponding hardness result. Assuming the Orthogonal VectorsConjecture (OVC), we show that there cannot be a general solution that solves the BichromaticClosest Pair problem in O ( n − Ω(1) ) time for j /j = 1 /j o (1)2 . Speciﬁcally, assuming OVC, we provethat for any δ > ε > n − δ ) for any choice of thresholds j < j < − δ , that satisfy j ≤ j − ε . Theory of computation → Problems, reductions and completeness

Keywords and phrases ﬁne-grained complexity, set similarity search, bichromatic closest pair, jaccardsimilarity

Digital Object Identiﬁer

Funding

This work was supported by Investigator Grant 16582, Basic Algorithms Research Copen-hagen (BARC), from the VILLUM Foundation.

Rasmus Pagh : This research has received funding from the European Research Council under theEuropean Union’s 7th Framework Programme (FP7/2007-2013) / ERC grant agreement no. 614331.

Acknowledgements

We want to thank A. Rubinstein for helping us understand the background ofhis results in [8].

Twitter is a well-known social network, in which a user can connect to other users by following them [5]. Users can read and write messages called tweets of up to 280 characters. Animportant service that Twitter provides is helping users discover other users that they mightlike to follow, by making suggestions. This service is called the

You might also want tofollow -service and is better known as the WTF (Who To Follow) recommender system [6].In order to suggest connections that the user might like, they should be similar to the user’s © Rasmus Pagh, Nina M. Stausholm and Mikkel Thorup;licensed under Creative Commons License CC-BY27th Annual European Symposium on Algorithms (ESA 2019).Editors: Michael A. Bender, Ola Svensson, and Grzegorz Herman; Article No. 80; pp. 80:1–80:17Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . CC ] J u l existing connections. As an example, if a user is already connected to Cristiano Ronaldo,Twitter might suggest Lionel Messi as a new connection, since the connection to Ronaldohints that the user likes famous soccer players. Hence, we need a way to decide if a connectionis similar to an existing connection. We might for instance suggest a new connection if thetweets are similar to the tweets of an existing connection or if the connection has a lot of thesame followers as an existing connection.The main challenge is to ﬁnd similar connections when the number of user accountsincreases drastically and the task is particularly diﬃcult when the similarity does not needto be signiﬁcant, i.e., when we look for connections that have only little in common withexisting ones, while they may still be of interest to the particular user [5]. This leads us tothe notion of similarity search , which concerns the general problem of searching for similarobjects in a collections of objects. Often we consider these objects as sets representing someconcept or entity. An object could for example be a document that is represented by a set ofwords. Hence, we talk about set similarity search .There are several versions of the problem addressing diﬀerent situations. In this paperwe consider a batched version of set similarity search, namely the Bichromatic Closest Pairwhich can be informally described as follows:Suppose we are given collections A and B , each of n sets from a universe of size O (log n ).We refer to the sets in A as red and the sets in B as blue . Bichromatic Closest Pair is theproblem of ﬁnding the pair consisting of a red and a blue set that is closest with respectto some distance or similarity measure. We will concern ourselves with Jaccard similarity,which is deﬁned for a pair of sets ( a , b ) ∈ A × B as J ( a , b ) = | a ∩ b || a ∪ b | = | a ∩ b || a | + | b | − | a ∩ b | . (1)In particular, we consider the following decision version of Bichromatic Closest Pair withJaccard similarity: decide whether there exists a pair ( a , b ) ∈ A × B such that J ( a , b ) ≥ j or if all pairs ( a , b ) ∈ A × B , has J ( a , b ) < j for given thresholds j and j .It is well-known that we can solve Bichromatic Closest Pair with Jaccard similarity forthresholds satisfying j ≥ j − δ in time O ( n − δ ) (see Section 1.1). In particular, for δ = Ω(1),the approximation ratio j /j = 1 /j δ increases polynomially in 1 /j . In this paper, we willpresent a corresponding hardness result. The hardness is conditioned on one of the mostwell-known and widely believed hypotheses, namely the Orthogonal Vectors Conjecture [10]. (cid:73) Conjecture 1. (Orthogonal Vectors Conjecture (OVC)) For every δ > there exists c = c ( δ ) such that given two collections A , B ⊂ { , } m of cardinality n , where m = c log n ,deciding if there is a pair ( a , b ) ∈ A × B such that a · b = 0 requires time Ω( n − δ ) . Assuming OVC, we show that there cannot be a general solution that solves the Bichro-matic Closest Pair problem with Jaccard similarity in O ( n − Ω(1) ) time for j /j = 1 /j o (1)2 .More speciﬁcally, we show (cid:73) Theorem 2.

Assuming the Orthogonal Vectors Conjecture (OVC), the following holds: forany δ > , there exists an ε > such that for any given j < j < − δ satisfying j ≤ j − ε ,solving Bichromatic Closest Pair with Jaccard similarity for n red and n blue sets for setsfrom a universe of size ln( n ) /j O (log 1 /j )2 for thresholds j and j requires time Ω( n − δ ) . The dependence of ε on δ is unspeciﬁed because the function c ( δ ) in OVC is not speciﬁed,see discussion in Appendix B. . Pagh, N. M. Stausholm and M. Thorup 80:3 Similarity search can be performed in several ways – a popular technique is Locality SensitiveHashing (LSH) [7] which attempts to collect similar items in buckets in order to reducethe number of sets needed to check similarity against. We can for example use Broder’sMinHash [1] with locality sensitive hashing to solve Bichromatic Closest Pair with Jaccardsimilarity in time ˜ O ( n − ε ) when j ≥ j − ε for any ε . This is done by ensuring that thecollision probability for pairs with similarity j is 1 /n and the collision probability for pairswith similarity j is 1 /n − ε . Hashing n − ε times means that we ﬁnd a pair with similarity j if one exists. The ChosenPath method presented in [4] also uses the LSH framework to solveBichromatic Closest Pair with Braun-Blanquet similarity in time ˜ O ( n − ε ) for thresholds j ≥ j − ε .The proof of Theorem 2 will be based on a result by Rubinstein [8]: Assuming theOrthogonal Vectors Conjecture, a (1 + ε )-approximation to Bichromatic Closest Pair withHamming, Edit or Euclidean distance requires time Ω( n − δ ). The required approximationfactor 1 + ε depends on δ , and tends to 1 as δ tends to zero. We translate this into anequivalent conditional lower bound for Jaccard similarity for certain constants j and j .In order to handle smaller subconstant values of j and j we use a technique that wecall squaring, which allows us to increase the gap in similarities between pairs with highJaccard similarity and pairs with low Jaccard similarity by computing the cartesian productof a binary vector with itself. A similar technique is used in [9] by Valiant. His technique iscalled tensoring and is used to amplify the gap between small and large inner products ofvectors. We also see a similar technique in the LSH framework with MinHash, where we useconcatenation of hash values (which are sampled set elements) to amplify the diﬀerence incollision probability, and hence in the Jaccard similarity.Combining two simple reductions with the above squaring we show that for any δ , we canalways ﬁnd ε such that Bichromatic Closest Pair with Jaccard similarity cannot be solved intime O ( n − δ ) for any pair j , j < − δ when j ≤ j − ε . Contrast this with the above LSHupper bound of ˜ O (cid:0) n − δ (cid:1) for j ≥ j − δ . We also know that there are parts of the parameterspace where j = j − δ that can be solved in ˜ O (cid:0) n − δ − Ω(1) (cid:1) time, see the discussion in [4].While LSH with MinHash is not the fastest possible algorithm in terms of the exponentachieved, it has been unclear how far from optimal it might be.

Other related work

Very recently, Chen and Williams [3] showed that assuming the OVC we cannot additivelyapproximate our Bichromatic Closest Pair problem with Jaccard similarity. It might bepossible to use Chen and Williams as a base for showing our main theorem, but this wouldrequire reductions quite diﬀerent from the ones presented in this paper.An earlier of result of Chen [2] shows that it is not possible (under OVC) to compute a( d/ log n ) o (1) -approximation to Maximum Inner Product (Max-IP) with two sets of n vectorsfrom { , } d in time O ( n − Ω(1) ). We will occasionally consider a set, x , from a ﬁnite universe U = { u , ..., u | U | } as a vector v of dimension | U | such that v i = [ u i ∈ x ], in Iverson notation. We call this vector E S A 2 0 1 9 the characteristic vector for x . Hence, we refer to the set of indexes and the universeinterchangeably. We denote the Hamming weight of a binary vector v by | v | . In the following,we will not only index vectors with integers, but also with vectors of integers. Hence, we willconsider vectors of dimension d with entries v ij , for i = ( i , ..., i d ) and j = ( j , ..., j d ). Recall Jaccard similarity as is deﬁned in (1). We deﬁne Bichromatic Closest Pair withJaccard similarity for thresholds t and t as follows: Let U be a universe of size O (log n ).Given collections A and B , each of n sets from U , and thresholds t < t <

1, we will considerthe problem of ﬁnding a pair of sets ( a , b ) ∈ A × B with J ( a , b ) ≥ t if there exists a pair( a ∗ , b ∗ ) ∈ A × B with J ( a ∗ , b ∗ ) ≥ t . If all pairs have J ( a , b ) < t , we must not return anypair of sets. The following lemma corresponds to Theorem 4.1 in [8] and will form the basis of our results.It includes the important properties of the instances constructed in the proof the theorem,which we will use actively to prove our own Theorem 2. (cid:73)

Lemma 3.

Assume OVC. Given δ > , there exist ε > and values h , h where h = (1 + ε ) h such that Bichromatic Closest Pair with Hamming distance for thresholds h and h requires time Ω( n − δ ) for instances with n red and n blue sets from a universe of size O (log n ) . There are instances that require this time with the following properties, where welet T = O (cid:0) ε (cid:1) and m = O (log n ) :All red sets have size T m and all blue sets have size m .The thresholds h and h are m ( T − and mT , respectively.All sets in the instance come from a universe of size T m . In particular, the lemma states that we cannot compute a (1 + ε )-approximation toBichromatic Closest Pair with Hamming distance in truly subquadratic time. We will extendthis result in a few steps, using the properties of the hard instances, to achieve Theorem 2. In order to prove Theorem 2, we need the following lemma, which extends Lemma 3 in thenatural way to Jaccard similarity. (cid:73)

Lemma 4.

Assuming OVC, we have the following: For any δ > there exist j , j with j = 2 · j such that Bichromatic Closest Pair with Jaccard similarity with thresholds j and j requires time Ω( n − δ ) . Proof.

We use instances as described in Lemma 3. First, note that J ( a , b ) = | a ∩ b || a ∪ b | = | a | + | b |− d H ( a , b )2 | a | + | b | − | a | + | b |− d H ( a , b )2 = | a | + | b | − d H ( a , b ) | a | + | b | + d H ( a , b )which implies that letting j = T m + m − m ( T − T m + m + m ( T −

1) = 1 T and j = T m + m − T mT m + m + T m = 12 T + 1 , . Pagh, N. M. Stausholm and M. Thorup 80:5 we cannot solve Bichromatic Closest Pair with Jaccard similarity in time O ( n − δ ). Since T = O (cid:0) ε (cid:1) , as mentioned in Lemma 3, we get a lower bound for the approximation factor: T T +1 = 2 T + 1 T = 2 + 1 T = 2 + Ω( ε ) . In particular, we achieve hardness of a 2-approximation. (cid:74)

We prove Theorem 2 by combining several reductions into one. So let ( A , B ) be any instanceof Bichromatic Closest Pair with Jaccard similarity as described in Lemma 3. We give abrief introduction to each of these reductions – note that all reductions are self-reductions.We give the details of the proof and the use of each reduction in Section 5. Further detailscan be found in Appendix B. Adding common elements to sets:

Adding common elements to all sets in collections A and B increases the Jaccard similarity between any pair of red and blue sets. Adding diﬀerent elements to sets:

Adding elements to all sets in A decreases theJaccard similarity between any pair of red and blue sets. Squaring:

Consider all sets by their characteristic vector. We deﬁne squaring as follows:given vector a = ( a , ..., a d ) the squared vector has entries a ij = a i · a j for i, j ∈ { , ..., d } . The resulting vector a , which is the characteristic vector for a × a , has dimension d asdescribed in Section 2.1. Vector a can equivalently be considered as a set from a universeof size d . We will use this reduction iteratively to reduce the Jaccard similarity betweenany pair of vectors in the instance of Bichromatic Closest Pair. Sampling:

We will use sampling to reduce the size of the universe after each stepof squaring. Hence, we consider squaring and sampling as a single reduction whichﬁrst squares the vectors and then samples from the resulting vectors. We will use thesquaring-and-sampling reduction iteratively.

In the proof of Theorem 2 we will take any instance of Bichromatic Closest Pair with Jaccardsimilarity with the properties described in Lemma 3 and use the squaring reduction describedin Section 3 to decrease the Jaccard similarity of every pair of sets in the instance. We willargue that a solution for the new instance also provides a solution for the original instance.When squaring all sets, the Jaccard similarity between any pair of sets will decrease, so weneed to capture this change in the thresholds, such that a solution for the new instanceimplies a solution for the initial instance. When squaring the sets in A and B , the size of thesets will be squared and it is easy to see that so will the size of the intersection. Hence, theJaccard similarity of a pair ( a , b ) after squaring i times, ( a i , b i ) is J ( a i , b i ) = | a ∩ b | i | a | i + | b | i − | a ∩ b | i . (2)In order to keep down the size of the universe, we need to sample after each step of squaring.This might incur a small error in the Jaccard similarity. The next few sections will boundthis error. From this point, we will denote the squaring-and-sampling reduction by f . Hence,applying the reduction f to a set, v , i times will yield a set f ( v , i ). E S A 2 0 1 9

We bound the error incurred in each of | a ∩ b | , | a | and | b | and combine these with a unionbound to get a bound on the error in the Jaccard similarity. We shall see that when samplingsuﬃciently many elements from the universe the sets are taken from, we get that with highprobability a solution for the constructed instance will provide a valid solution for the originalinstance.The following lemmas will help us show that sampling after squaring will not distort thesimilarity of the resulting vectors too much. (cid:73) Lemma 5.

Let < m < m < and let p be a set from a universe of size s for an integer s . Assume that ( m · s ) ≤ | p | ≤ ( m · s ) . Sample s elements from the universe uniformlyat random, z , thus generating sample set p ∩ z . We have (1 − γ ) · m · s ≤ | p ∩ z | ≤ (1 + γ ) · m · s with probability at least − n − when sampling s ≥

30 ln( n ) γ m elements. Proof.

The result is an immediate consequence of the Chernoﬀ bound: when we sample s ≥

20 ln( n ) γ m elements, we have with probability at least 1 − n − that (1 − γ ) ( m · s ) · s s ≤ | p ∩ z | .A similar result gives the upper bound on | p ∩ z | for s ≥

30 ln( n ) γ m . As m ≤ m , we maximize s by

30 ln( n ) γ m and thus ensure both bounds with probability at least 1 − n − using a unionbound. (cid:74) We are generally going to use γ as the same ﬁxed parameter (to be determined later)every time we invoke the sampling of Lemma 5.We will use Lemma 5 to show that sampling after squaring will not distort the Jaccardsimilarity of a pair of vectors too much, and hence we get the beneﬁts of squaring withoutthe exploding vector dimensions. We start by bounding the resulting sizes for each of | a | , | b | and | a ∩ b | for any choice of a , b ∈ A × B from squaring and sampling i times. (cid:73) Lemma 6.

Let v be a set from a universe of size d or the intersection of such two sets.Let f ( v , i ) denote the resulting set after running i iterations of the squaring-and-samplingreduction on set v for i ≥ . We have (1 − γ ) i | v | i d i s i ≤ | f ( v , i ) | ≤ (1 + γ ) i | v | i d i s i with probability at least − in − where s i ≥

30 ln( n ) d i γ (1 − γ ) i − | v | i . Proof.

Let v be as described. We show the lemma by induction on i . Clearly, whensquaring the vector v once, i.e., for i = 1, the resulting vector has Hamming weight | v | anddimension d . Hence, by Lemma 5 we have(1 − γ ) | v | d · s ≤ | f ( v , ) | ≤ (1 + γ ) | v | d · s with probability at least 1 − n − for our choice of s . Assume now that after i − − γ ) i − − | v | i − d i − s i − ≤ | f ( v , i − ) | ≤ (1 + γ ) i − − | v | i − d i − s i − . (3) . Pagh, N. M. Stausholm and M. Thorup 80:7 Then Lemma 5 gives that after i iterations of the squaring-and-sampling reduction, we have(1 − γ ) i − | v | i s i − d i · s i s i − ≤ | f ( v , i ) | ≤ (1 + γ ) i − | v | i s i − d i · s i s i − with probability at least 1 − n − for s i ≥

30 ln( n ) d i γ (1 − γ ) i − | v | i . This particularly means that(1 − γ ) i | v | i d i · s i ≤ | f ( v , i ) | ≤ (1 + γ ) i | v | i d i · s i . Now, to ensure these bounds, we assumed that | f ( v , i − ) | satisﬁes certain bounds (see (3)).So in order to ensure that f ( v , i ) satisﬁes the given bounds, we need f ( v , j ) to satisfy similarbounds for every 1 ≤ j ≤ i . By a union bound, we see that | f ( v , j ) | satisﬁes both upperand lower bounds for all 1 ≤ j ≤ i (simultaneously) with probability at least 1 − in − when sampling s j ≥

30 ln( n ) d j γ (1 − γ ) j − | v | j at step j . Hence, | f ( v , i ) | satisﬁes the given bound withprobability at least 1 − in − . (cid:74) The next section will use Lemma 6 to bound the Jaccard similarity after i iterations ofthe squaring/sampling reduction. For a given pair of vectors a and b , Lemma 6 gives upper and lower bounds on the Jaccardsimilarity J = J (cid:16) f ( a , i ) , f ( b , i ) (cid:17) . We claim that with probability at least 1 − in − : J ≥ (1 − γ ) i − | a ∩ b | i d i s i (1 + γ ) i − | a | i d i s i + (1 + γ ) i − | b | i d i s i − (1 − γ ) i − | a ∩ b | i d i s i ≥ (1 − γ ) i | a ∩ b | i (1 + γ ) i (cid:0) | a | i + | b | i (cid:1) − (1 − γ ) i | a ∩ b | i J ≤ (1 + γ ) i − | a ∩ b | i d i s i (1 − γ ) i − | a | i d i s i + (1 − γ ) i − | b | i d i s i − (1 + γ ) i − | a ∩ b | i d i s i ≤ (1 + γ ) i | a ∩ b | i (1 − γ ) i (cid:0) | a | i + | b | i (cid:1) − (1 + γ ) i | a ∩ b | i This is easily seen by taking a union bound over the probabilities that each of | a | , | b | and | a ∩ b | violate either the upper or the lower bound. Next, we claim that these bounds imply: J ≥ (1 − γ ) i | a ∩ b | i (1 + 4 γ ) i (cid:0) | a | i + | b | i − | a ∩ b | i (cid:1) ≥ (1 − γ ) i | a ∩ b | i (1 + γ ) i (cid:0) | a | i + | b | i (cid:1) − (1 − γ ) i | a ∩ b | i J ≤ (1 + γ ) i | a ∩ b | i (1 − γ ) i (cid:0) | a | i + | b | i (cid:1) − (1 + γ ) i | a ∩ b | i ≤ (1 + γ ) i | a ∩ b | i (1 − γ ) i (cid:0) | a | i + | b | i − | a ∩ b | i (cid:1) . The argument can be found in Appendix A. In particular, we have argued for the followinglemma. We ignore the sample size for now and discuss it in Section 4.3.

E S A 2 0 1 9 (cid:73)

Lemma 7.

Let A and B be an instance of Bichromatic Closest Pair with Jaccard similarity.After applying the Squaring and Sampling mapping, f , i times as previously described toeach set in A and B , we have for all n pairs ( a , b ) ∈ A × B in the instance that: (cid:18) − γ γ (cid:19) i | a ∩ b | i | a | i + | b | i − | a ∩ b | i ≤ J (cid:16) f ( a , i ) , f ( b , i ) (cid:17) ≤ (cid:18) γ − γ (cid:19) i | a ∩ b | i | a | i + | b | i − | a ∩ b | i with probability at least − in − Hence, with high probability none of the Jaccard similarities diverge too much from (2) dueto sampling. This was exactly what we wanted, as this allows us to reduce the dimension bysampling.

Recall that in our setting we reduce from instances where the set sizes of all red and bluesets are ﬁxed. We now describe thresholds such that solving the instances constructed bythe reduction f cannot be done in truly subquadratic time. (cid:73) Lemma 8.

Let A and B be two collections of n sets from a universe of dimension d , whereall sets in A have size y and all sets in B have size z . Assume that ( A , B ) is taken from afamily of instances of Bichromatic Closest Pair with Jaccard similarity, which require time Ω( n − δ ) for thresholds t = x y + z − x and t = x y + z − x . The reduction which applies f i timesto each set in s ∈ A ∪ B for i ≥ constructs an instance of Bichromatic Closest Pair withJaccard similarity, which requires time Ω( n − δ ) time for thresholds t = (cid:18) − γ γ (cid:19) i x i y i + z i − x i , and t = (cid:18) γ − γ (cid:19) i x i y i + z i − x i . whose solution provides a valid solution to the original instance with high probability whensampling s j >

30 ln( n ) d j γ (1 − γ ) j − x j at each step ≤ j ≤ i . Proof.

Lemma 7 ensures that with high probability a solution to the constructed instanceprovides a valid solution to the original instance, since no pair of sets is likely to have Jaccardsimilarities that deviate beyond the chosen thresholds.In Lemma 7 we skipped the discussion of the sample size at each iteration – we will arguefor it now. From Lemma 6, it is easily seen that we maximize the needed sample size for allof | a | , | b | or | a ∩ b | for any choice of a and b in iteration i by s i >

30 ln( n ) d i γ (1 − γ ) i − min ( a , b ) ∈A×B {| a ∩ b |} i . Hence, sampling s i elements from the universe will ensure that each of the upper and lowerbounds for either | a | , | b | or | a ∩ b | will fail with probability at most n − in that iteration.As min ( a , b ) ∈A×B {| a ∩ b |} is unknown, we instead use x , which was the intersection size fora pair with Jaccard similarity j . Such a pair need not exist, but as the set sizes are ﬁxed, x can be easily computed.We have left to argue that the pairs with intersection smaller than x also satisfy thebounds in Lemma 7 with high probability. The main observation is that they only need tosatisfy the upper bound, as the resulting Jaccard similarities need only to stay below the . Pagh, N. M. Stausholm and M. Thorup 80:9 lower threshold, t — the Jaccard similarities can become arbitrarily small without aﬀectingthe result.By bounding the size of each term as we did in Lemma 6 using the chosen s i , we see thatthe error probabilities are still at most n − for each of | a | , | b | and | a ∩ b | for any choice of( a , b ) ∈ A × B . (cid:74) We are now ready to prove Theorem 2. We ﬁrst give some intuition behind the proof andstate a few lemmas to ease the proof. For convenience we restate Theorem 2. (cid:73)

Theorem 2.

Assuming the Orthogonal Vectors Conjecture (OVC), the following holds: forany δ > , there exists an ε > such that for any given j < j < − δ satisfying j ≤ j − ε ,solving Bichromatic Closest Pair with Jaccard similarity for n red and n blue sets for setsfrom a universe of size ln( n ) /j O (log(1 /j ))2 for thresholds j and j requires time Ω( n − δ ) . The proof of Theorem 2 reduces instances of Bichromatic Closest Pair as described inSection 2.3 by composing three reductions, that together construct instances of BichromaticClosest Pair with Jaccard similarity, which requires time Ω( n − δ ) for the given thresholds j and j and some ε . A short description of each of the reductions can be found in Section 3.Below, we give three lemmas showing that these reductions preserve hardness.The ﬁrst lemma states that adding common elements to all sets in the instance willpreserve hardness. This reduction increases the Jaccard similarity of all pairs of red andblue sets, and by choice of the number of added elements, we ensure that pairs of sets thatinitially had Jaccard similarity higher than the lower threshold will get Jaccard similaritygreater than 1 − δ . Hence, we get hardness for thresholds that are greater than 1 − δ . Fromthis point we can decrease the thresholds using two other reductions to achieve the giventhresholds, that by assumption are less than 1 − δ .The second lemma states that the squaring-and-sampling reduction, discussed in detailin Section 4, preserves hardness. The squaring-and-sampling reduction allows us to decreasethe thresholds, so they come close to j and j . Finally, the third lemma states that thereduction, which adds elements to only red sets will still preserve hardness. This reductionensures that we can decrease the Jaccard similarity further. We will use it in such a way,that we eﬀectively multiply the upper bound by a well-chosen α that ensures that the upperthreshold is j after this reduction. The proof ends by picking an ε , such that j is strictlygreater than the current lower threshold, and thus preserves hardness for the thresholds j and j . In the following, assume that A and B are collections of n red and n blue sets from auniverse U , respectively. (cid:73) Lemma 9.

Let < δ ≤ be given and let ( A , B ) be any instance of Bichromatic Closest Pairwith Jaccard similarity as described in Lemma 3. Deﬁne ‘ := max q ∈A∪B {| q |} · (1 /δ − and x := { x , ..., x ‘ } such that x ∩ ( A∪B ) = ∅ , and further deﬁne the mapping g : A∪B → A ∪B by g ( v ) = v ∪ x where A = A ∪ x and equivalently B = B ∪ x . The reduction that applies g E S A 2 0 1 9 to every element of A and B generates an instance ( A , B ) of Bichromatic Closest Pair withJaccard similarity that requires time Ω( n − δ ) for some thresholds t , t ≥ − δ . Proof.

First, note that if v ∈ A , then g ( v ) ∈ A and similarly if v ∈ B then g ( v ) ∈ B . Werecall that instances of Bichromatic Closest Pair as described in Lemma 3 are constructedsuch that all red sets have the same size and all blue sets have the same size. We also havemax q ∈A∪B {| q |} = | a | , for any a ∈ A , since the sets in A were larger than the sets in B . It iseasy to see that hardness is preserved under the reduction.We ﬁnally argue that the resulting thresholds are larger than 1 − δ : Let ( a , b ) be anypair from A × B which has Jaccard similarity at least t and let a = g ( a ) and b = g ( b ).We argue that any such pair satisﬁes | a ∩ b | ≥ | b | : Note that with these particular instancesof Bichromatic Closest Pair and from the proof of Lemma 4, we have J ( a , b ) = | a ∩ b || a ∪ b | = | a ∩ b | T m + m − | a ∩ b | ≥ t = t /T . Since | b | = m ≥ | a ∩ b | , this implies | a ∩ b | ≥ m m T − | a ∩ b | T ⇒ | a ∩ b | ≥ m/ | b | / . We will consider the Jaccard similarity of a and b : J ( a , b ) = | a ∩ b | + | a | (1 /δ − | a | + | a | (1 /δ − | b | + | a | (1 /δ − − ( | a ∩ b | + | a | (1 /δ − | a ∩ b | + | a | (1 /δ − | a | /δ + | b | − | a ∩ b | By assumption | a ∩ b | ≥ | b | , so: | a ∩ b | + | a | (1 /δ − | a | /δ + | b | − | a ∩ b | ≥ | b | / | a | (1 /δ − | a | /δ + | b | / ≥ − δ ⇔ | b | | a | (1 /δ − ≥ | a | (1 /δ −

1) + | b | − | b | δ J ( a , b ) ≥ − δ for any choice of δ >

0, and so, weconstruct an instance where every pair with Jaccard similarity higher than t will haveJaccard similarity higher than 1 − δ . Thus, there are thresholds that are greater than 1 − δ ,that make the constructed instance hard. (cid:74)(cid:73) Lemma 10.

Let < δ ≤ be given and consider any instance of Bichromatic ClosestPair with Jaccard similarity, ( A , B ) , from a family of instances which require time Ω( n − δ ) for thresholds t and t . Using the reduction f deﬁned in Section 4 on each v ∈ A ∪ B for i iterations where i ≥ , we construct a valid instance of Bichromatic Closest Pair withJaccard similarity with high probability, which requires time Ω( n − δ ) for thresholds that aredecreasing functions of i . Proof.

The lemma follows immediately from Lemma 8. (cid:74) . Pagh, N. M. Stausholm and M. Thorup 80:11 (cid:73)

Lemma 11.

Let < δ ≤ be given and consider any instance of Bichromatic ClosestPair with Jaccard similarity, ( A , B ) , from a family of instances which require time Ω( n − δ ) for thresholds t and t . Deﬁne ‘ := max q ∈A∪B {| q |} · (1 /α − and y := { y , ..., y ‘ } suchthat y ∩ ( A ∪ B ) = ∅ . Deﬁne mapping h : A → A where A = A ∪ Y by h ( a ) = a ∪ y . Thereduction that applies h to every element of A generates an instance ( A , B ) of BichromaticClosest Pair with Jaccard similarity that requires time Ω( n − δ ) for some thresholds t , t . Proof.

Clearly, hardness is preserved under the reduction that simply adds new elements toall red sets. In particular this reduction decreases the thresholds by decreasing the similaritybetween red and blue pairs. (cid:74)

Proof.

For simplicity and readability we leave out most of the calculations — details can befound in Appendix B.Let δ > j , j be given such that j < j < − δ . Take any instanceof Bichromatic Closest Pair with Jaccard similarity satisfying the properties described inLemma 3. Recall from this lemma that T = O (cid:0) ε (cid:1) .Apply the reductions from ﬁrst Lemma 9 to achieve an instance, which requires timeΩ( n − δ ) for thresholds greater than 1 − δ . We wish to reduce to an instance that is hard forsmaller thresholds j and j . The reduction from Lemma 10 is used to decrease the thresholds,where we pick the largest i , such that the resulting upper threshold t is no smaller than j , i.e., t ≥ j . This reduction decreases the thresholds until the upper threshold is onlyslightly greater than j . Now, let α = j t and apply the reduction from Lemma 11 to ensurethat the resulting upper threshold is now equal to j . This eventually gives an instance ofBichromatic Closest Pair with Jaccard similarity, which cannot be solved in time O ( n − δ )for thresholds t = α (cid:18) − γ γ (cid:19) i (cid:18) δT + 1 − δ (cid:19) i t = (cid:18) γ − γ (cid:19) i (cid:0) δ T + 1 − δ (cid:1) i α + (cid:0) δT + 1 − δ (cid:1) i − (cid:0) δ T + 1 − δ (cid:1) i where we observe that by construction t = α · t = j . We refer to Appendix B for thecalculations. So we have constructed an instance which is hard for thresholds j and t .Set t ∗ = (cid:16) γ − γ (cid:17) i (cid:0) δ T + 1 − δ (cid:1) i . Then t < αt ∗ and so the hardness for t = j and t implies hardness for t = j and αt ∗ . We show that there is an ε that only depends on δ such that αt ∗ < j . Then the hardness for t = j and αt ∗ implies hardness for the given j and j .Note that we have chosen α ≥ t , since otherwise i could not be maximal. So we have:log( j )log ( αt ∗ ) = log ( αt )log ( αt ∗ ) ≤ log (cid:0) t (cid:1) log (cid:18) t · (cid:16) γ − γ (cid:17) i · (cid:0) δ T + 1 − δ (cid:1) i (cid:19) = 2 i · log (cid:18)(cid:16) − γ γ (cid:17) · ( δ/T + 1 − δ ) (cid:19) i · log (cid:16)(cid:16) − γ γ (cid:17) · ( δ/T + 1 − δ ) (cid:16) γ − γ (cid:17) · (cid:0) δ T + 1 − δ (cid:1)(cid:17) . E S A 2 0 1 9

We need to show that this expression is bounded by 1 − ε for some ε that depends on δ ,but not on j and j . Observe that the factors 2 i cancel out and we may pick γ smallenough that it can essentially be ignored. We show in Appendix B that we can use any γ < min (cid:8) i +1 , δ T (cid:9) . Then for given δ , there exists an ε such that the expression is boundedby 1 − ε , since T can be considered a constant for a ﬁxed δ . Recall that T was deﬁned inLemma 3. By the assumption j ≤ j − ε we then have αt ∗ < j . Then the hardness of t and αt ∗ where t = j and αt ∗ < j , implies the desired hardness for the given j and j .We ﬁnally argue about the size of the universe of the instance constructed by thecompositions of reductions described. In the following, d is the size of the universe of theinitial instance of Bichromatic Closest Pair with Jaccard instance. In the proof of Lemma 8,we argued that we could use x , which was the size of the intersection for a pair with Jaccardsimilarity j , in the sample size s i , which means that s i ≥

30 ln( n ) d i γ (1 − γ ) i x i = 30 ln( n ) d i γ (1 − γ ) i ( j ( | a | + | b | − x )) i = 30 ln( n ) γ (1 − γ ) i j i · δ + 11 + δ T ! i . Again, the calculations can be found in Appendix B. Hence, the sets constructed by thecomposition of reductions come from a universe whose size is bounded by | U | ≤ s i + s i (1 /α −

1) = s i α ≤

30 ln( n ) γ j i δ + 1 (cid:0) δT + 1 − δ (cid:1) (cid:0) δ T + 1 (cid:1) ! i (cid:18) γ (1 − γ ) (cid:19) i By Assumption t < j ≤ t , which implies that 2 i = O (cid:16) log j log c (cid:17) = O (cid:16) log j (cid:17) for constant c <

1. Hence, we conclude that the size of the universe is ln( n ) /j O (log 1 /j )2 . This ﬁnishes theproof of Theorem 2. (cid:74) On a ﬁnal note, we remark that one can obtain a result similar to Theorem 2 for Braun-Blanquet similarity. Recall that we deﬁne Braun-Blanquet similarity for a pair of sets( a , b ) ∈ A × B as BB ( a , b ) = | a ∩ b | max (cid:8) | a | , | b | (cid:9) ∈ [0 , ε is an arbitrary constant between 0 and 1. Our techniques only work when ε issuﬃciently small. References Andrei Z Broder. On the resemblance and containment of documents. In

Compression andcomplexity of sequences 1997. proceedings , pages 21–29. IEEE, 1997. . Pagh, N. M. Stausholm and M. Thorup 80:13 Lijie Chen. On the hardness of approximate and exact (bichromatic) maximum inner product.In , pages 14:1–14:45, 2018. URL: https://doi.org/10.4230/LIPIcs.CCC.2018.14 , doi:10.4230/LIPIcs.CCC.2018.14 . Lijie Chen and Ryan Williams. An equivalence class for orthogonal vectors. In

Proceedings ofthe Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego,California, USA, January 6-9, 2019 , pages 21–40, 2019. URL: https://doi.org/10.1137/1.9781611975482.2 , doi:10.1137/1.9781611975482.2 . Tobias Christiani and Rasmus Pagh. Set similarity search beyond minhash. In

Proceedings ofthe 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal,QC, Canada, June 19-23, 2017 , pages 1094–1107, 2017. URL: https://doi.org/10.1145/3055399.3055443 , doi:10.1145/3055399.3055443 . Ashish Goel, Aneesh Sharma, Dong Wang, and Zhijun Yin. Discovering similar users ontwitter. In , 2013. Pankaj Gupta, Ashish Goel, Jimmy J. Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh.WTF: the who to follow service at twitter. In , pages 505–514, 2013. URL: https://doi.org/10.1145/2488388.2488433 , doi:10.1145/2488388.2488433 . Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing thecurse of dimensionality. In

Proceedings of the Thirtieth Annual ACM Symposium on theTheory of Computing, Dallas, Texas, USA, May 23-26, 1998 , pages 604–613, 1998. URL: https://doi.org/10.1145/276698.276876 , doi:10.1145/276698.276876 . Aviad Rubinstein. Hardness of approximate nearest neighbor search. In

Proceedings of the 50thAnnual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA,USA, June 25-29, 2018 , pages 1260–1268, 2018. URL: https://doi.org/10.1145/3188745.3188916 , doi:10.1145/3188745.3188916 . Gregory Valiant. Finding correlations in subquadratic time, with applications to learningparities and the closest pair problem.

J. ACM , 62(2):13:1–13:45, 2015. URL: https://doi.org/10.1145/2728167 , doi:10.1145/2728167 . Virginia Vassilevska Williams. Some open problems in ﬁne-grained complexity.

SIGACTNews , 49(4):29–35, 2018. URL: https://doi.org/10.1145/3300150.3300158 , doi:10.1145/3300150.3300158 . A Bounds on Jaccard similarity after squaring-and-sampling reduction

We show that the bounds given in Section 4.2 hold. We want to show the upper bound(1 + γ ) i | a ∩ b | i (1 − γ ) i (cid:0) | a | i + | b | i (cid:1) − (1 + γ ) i | a ∩ b | i ≤ (1 + γ ) i | a ∩ b | i (1 − γ ) i (cid:0) | a | i + | b | i − | a ∩ b | i (cid:1) (4)and similarly the lower bound(1 − γ ) i | a ∩ b | i (1 + γ ) i (cid:0) | a | i + | b | i (cid:1) − (1 − γ ) i | a ∩ b | i ≥ (1 − γ ) i | a ∩ b | i (1 + 4 γ ) i (cid:0) | a | i + | b | i − | a ∩ b | i (cid:1) . (5)Observe that (4) is true when(1 − γ ) i | a | i + (1 − γ ) i | b | i − (1 + γ ) i | a ∩ b | i ≥ (1 − γ ) i (cid:16) | a | i + | b | i − | a ∩ b | i (cid:17)(cid:16) (1 − γ ) i − (1 − γ ) i (cid:17) (cid:16) | a | i + | b | i (cid:17) ≥ (cid:16) (1 + γ ) i − (1 − γ ) i (cid:17) | a ∩ b | i | a | i + | b | i ≥ (1 + γ ) i − (1 − γ ) i (1 − γ ) i − (1 − γ ) i · | a ∩ b | i E S A 2 0 1 9 which in particular holds if (note that this is a loose bound)(1 + γ ) i − (1 − γ ) i (1 − γ ) i − (1 − γ ) i ≤ ⇔ (1 + γ ) i ≤ − γ ) i − (1 − γ ) i Similarly we see that (5) is true when(1 + γ ) i (cid:16) | a | i + | b | i (cid:17) − (1 − γ ) i | a ∩ b | i ≤ (1 + 4 γ ) i (cid:16) | a | i + | b | i − | a ∩ b | i (cid:17)(cid:16) (1 + 4 γ ) i − (1 − γ ) i (cid:17) | a ∩ b | i ≤ (cid:16) (1 + 4 γ ) i − (1 + γ ) i (cid:17) (cid:16) | a | i + | b | i (cid:17) (1 + 4 γ ) i − (1 − γ ) i (1 + 4 γ ) i − (1 + γ ) i · | a ∩ b | i ≤ | a | i + | b | i which in particular holds if(1 + 4 γ ) i − (1 − γ ) i (1 + 4 γ ) i − (1 + γ ) i ≤ ⇔ γ ) i ≤ (1 + 4 γ ) i + (1 − γ ) i So we want to ensure that for any choice of i there exists a γ which satisﬁes(1 + γ ) i ≤ − γ ) i − (1 − γ ) i (6)2(1 + γ ) i ≤ (1 + 4 γ ) i + (1 − γ ) i (7)Let’s begin with (6): we see that we can bound the right-hand side as follows using Bernoulli’sinequality(1 + γ ) i ≤ − i γ ) − (1 − · i · γ ) < − γ ) i − (1 − γ ) i and so (4) holds if we can show that(1 + γ ) i ≤ − i γ ) − (1 − · i · γ ) = 2 − · i γ − · i · γ = 1 + 2 i +1 γ. Now let’s consider (7): as before, we can bound the right-hand side by2(1 + γ ) i ≤ (1 + 4 · i · γ ) + (1 − i · γ ) ≤ (1 + 4 γ ) i + (1 − γ ) i and so (5) holds if we can show that(1 + γ ) i ≤ · i · γ (8)Note that both (4) and (5) are satisﬁed if (1 + γ ) i ≤ · i · γ , which holds for γ < i +1 for every choice of i . B Proof details for Theorem 2

We here give the calculations at each step of the proof of Theorem 2. Given δ , we constructinstances of Bichromatic Closest Pair with Jaccard similarity which require time Ω( n − δ ) forcertain thresholds - we will mainly give the thresholds after each reduction, as the ultimategoal is to show that there exists ε such that the constructed instance is hard for any choiceof j and j with j ≤ j − ε . Hence we construct instances which preserve hardness undervarying thresholds and ﬁnally achieve an instance, where we can ﬁnd ε , and thus that wecan ensure hardness for any given thresholds j and j which satisfy j ≤ j − ε . Each of the . Pagh, N. M. Stausholm and M. Thorup 80:15 thresholds are indexed by a number (1 for upper thresholds and 2 for lower threshold) and aletter (d the thresholds after the reduction adding common elements to all sets, s for thethresholds after the squaring reduction and a for the thresholds after adding elements to thered sets only).Let δ > j < j < − δ be given. Let j and j be as in the proof of Lemma 4. Then Bichromatic Closest Pair withJaccard similarity cannot be solved in time O ( n − δ ) for thresholds j and j . We willpreserve this hardness under a series of reductions. Now add max x ∈A∪B {| x |} (1 /δ −

1) common values to all sets in A and B . Since all setsin A have size T m and all sets in B have size m , we get an instance, which requires timeΩ( n − δ ) for thresholds j d = m + T m (1 /δ − T m + T m (1 /δ −

1) + m + T m (1 /δ − − ( m + T m (1 /δ − m + T m/δ − T mT m + T m/δ − T m = δ (1 + T /δ − T ) T = δ/T + 1 − δ > − δj d = m/ T m (1 /δ − T m + T m (1 /δ −

1) + m + T m (1 /δ − − ( m/ T m (1 /δ − m/ T m/δ − T mT m/δ + m + T m/δ − T m − m/ − T m/δ + T m = T + 1 /δ − /δ + T = δ T + 1 − δ δ T > − δ Note that both thresholds are greater than 1 − δ . Clearly, Bichromatic Closest Pair withJaccard similarity still requires time O ( n − δ ) for thresholds j d and j d . We now use the squaring-and-sampling reduction on each set in the current instance toreduce the Jaccard similarity between all pairs of sets. We let the thresholds be as followswhere i is maximal such that j s ≥ j . In order to satisfy (8) we require that γ ≤ / i +1 . j s = (cid:16) − γ γ (cid:17) i ( m + T m (1 /δ − i ( T m + T m (1 /δ − i + ( m + T m (1 /δ − i − ( m + T m (1 /δ − i = (cid:18) − γ γ (cid:19) i ( m + T m/δ − T m ) i ( T m/δ ) i = (cid:18) − γ γ (cid:19) i ( δ/T + 1 − δ ) i j s = (cid:16) γ − γ (cid:17) i ( m/ T m (1 /δ − i ( T m + T m (1 /δ − i + ( m + T m (1 /δ − i − ( m/ T m (1 /δ − i = (cid:18) γ − γ (cid:19) i (1 / T /δ − T ) i ( T /δ ) i + (1 + T /δ − T ) i − (1 / T /δ − T ) i = (cid:18) γ − γ (cid:19) i (cid:0) T + 1 /δ − (cid:1) i (1 /δ ) i + (1 /T + 1 /δ − i − (cid:0) T + 1 /δ − (cid:1) i . Hence, using the squaring-and-sampling reduction, we constructed an instance of Bichro-matic Closest Pair with Jaccard similarity, which requires time Ω( n − δ ) for thresholds j s and j s . Note that j s is only slightly larger than j . The next step will decrease theupper threshold to become equal to j , which will allow us to argue about ε . E S A 2 0 1 9 The ﬁnal reduction adds max v ∈A ∪B {| v |} (1 /α −

1) to all red sets, where we let α = j j s ,since then j = j a = αj s . The number of elements that we add to the red sets ensuresthat we get an instance of Bichromatic Closest Pair, which is hard for upper threshold j a = αj s = j and some lower threshold, j a . j a = (cid:16) − γ γ (cid:17) i ( m + T m/δ − T m ) i ( T m/δ ) i + ( T m/δ ) i (1 /α −

1) + ( m + T m/δ − T m ) i − ( m + T m/δ − T m ) i = (cid:18) − γ γ (cid:19) i α ( m + T m/δ − T m ) i ( T m/δ ) i = α (cid:18) − γ γ (cid:19) i ( δ/T + 1 − δ ) i = αj s j a = (cid:16) γ − γ (cid:17) i ( m/ T m/δ − T m ) i ( T m/δ ) i + ( T m/δ ) i (1 /α −

1) + ( m + T m/δ − T m ) i − ( m/ T m/δ − T m ) i = (cid:18) γ − γ (cid:19) i (cid:0) T + 1 /δ − (cid:1) i (1 /δ ) i α + (cid:0) T + 1 /δ − (cid:1) i − (cid:0) T + 1 /δ − (cid:1) i . Observe further that j a < α (cid:18) γ − γ (cid:19) i (cid:18) δ T + 1 − δ (cid:19) i = αj ∗ So Bichromatic Closest Pair with thresholds j and αj ∗ still requires time Ω( n − δ ), for awell-chosen γ . A simple calculation shows that any gamma γ < δ T suﬃces. Hence, wechoose any γ < min (cid:8) δ T , i +1 (cid:9) in order to also satisfy (8). We will now ﬁnd ε such that j > αj ∗ − ε and so, by the assumption that j ≤ j − ε , weconclude that j > j ∗ . Hence, the constructed instance of Bichromatic Closest Pair isalso hard for thresholds j and j .Note that α ≥ j s — otherwise α = j j s < j s and so j < j s , which contradicts theassumption that i was maximal. We observe that there exists an ε > j log ( αj ∗ ) = log ( αj s )log (cid:18) α (cid:16) γ − γ (cid:17) i (cid:0) δ T + 1 − δ (cid:1) i (cid:19) < log (cid:0) j s (cid:1) log (cid:18) j s (cid:16) γ − γ (cid:17) i (cid:0) δ T + 1 − δ (cid:1) i (cid:19) = log (cid:18)(cid:16)(cid:16) − γ γ (cid:17) ( δ/T + 1 − δ ) (cid:17) (cid:19) log (cid:16)(cid:16) − γ γ (cid:17) (cid:0) δT + 1 − δ (cid:1) (cid:16) γ − γ (cid:17) (cid:0) δ T + 1 − δ (cid:1)(cid:17) = 2 log (cid:16)(cid:16) − γ γ (cid:17) ( δ/T + 1 − δ ) (cid:17) log (cid:16)(cid:16) − γ γ (cid:17) (cid:0) δT + 1 − δ (cid:1) (cid:16) γ − γ (cid:17) (cid:0) δ T + 1 − δ (cid:1)(cid:17) = log (cid:16)(cid:16) − γ γ (cid:17) ( δ/T + 1 − δ ) (cid:17) log (cid:16)(cid:16) − γ γ (cid:17) (cid:0) δT + 1 − δ (cid:1) (cid:16) γ − γ (cid:17) (cid:0) δ T + 1 − δ (cid:1)(cid:17) = 1 − ε . . Pagh, N. M. Stausholm and M. Thorup 80:17 For γ < min (cid:8) δ T , i +1 (cid:9) we have that ε = Θ(1 /T ).Since log j log ( αj ∗ ) < − ε and log j log j ≥ − ε by assumption, we have αj ∗ < j . So thehardness of t and αt ∗ implies hardness for j and j as we wanted.We ﬁnally discuss the size of the universe. This size depends on the number of elementssampled in the squaring-and-sampling reduction. Recall that we use the squaring-and-sampling reduction on an instance of Bichromatic Closest Pair, where we have already addedcommon elements to all sets in A and B and which is hard for thresholds j d and j d .Let x be the size of the intersection between a red and a blue set that have Jaccardsimilarity j d . Note that we can easily compute x = | a | + | b | − d H ( a , b )2= ( T m + T m (1 /δ − m + T m (1 /δ − − T m m T m (1 /δ − s i ≥

30 ln( n ) d i γ (1 − γ ) i x i = 30 ln( n ) d i γ (1 − γ ) i ( j ( | a | + | b | − x )) i = 30 ln( n ) γ (1 − γ ) i j i (cid:18) T m + T m (1 /δ − T m + T m (1 /δ − m + T m (1 /δ − − ( m/ T m (1 /δ − (cid:19) i = 30 ln( n ) γ (1 − γ ) i j i · δ + 11 + δ T ! i . Recalling that α ≥ t s we conclude that the universe has size | U | ≤ s i + s i (1 /α −

1) = s i α ≤

30 ln( n ) γ (1 − γ ) i j i · δ + 11 + δ T ! i · (cid:16) − γ γ (cid:17) i (cid:0) δT + 1 − δ (cid:1) i = 30 ln( n ) γ j i δ + 1 (cid:0) δT + 1 − δ (cid:1) (cid:0) δ T + 1 (cid:1) ! i (cid:18) γ (1 − γ ) (cid:19) i . Remark.

Since ε = Θ(1 /T ) we get the same dependence of ε on δ as that of Rubinstein,discussed in [8, Remark 1.4]. We note that ε also depends on the unspeciﬁed function c ( δ ) inOVC, so we are not able to express it as a function of δ , but the value is at least exponentiallydecreasing in 1 /δ ..