Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors
Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, Erik Waingarten
OOptimal Hashing-based Time–Space Trade-offs forApproximate Near Neighbors ∗ Alexandr AndoniColumbia Thijs LaarhovenIBM Research Zürich Ilya RazenshteynMIT CSAIL Erik WaingartenColumbiaMay 23, 2017
Abstract
We show tight upper and lower bounds for time–space trade-offs for the c -Approximate NearNeighbor Search problem. For the d -dimensional Euclidean space and n -point datasets, wedevelop a data structure with space n ρ u + o (1) + O ( dn ) and query time n ρ q + o (1) + dn o (1) forevery ρ u , ρ q ≥ c √ ρ q + ( c − √ ρ u = p c − . (1)For example, for the approximation c = 2 we can achieve: • Space n . ... and query time n o (1) , significantly improving upon known data structuresthat support very fast queries [IM98, KOR00]; • Space n . ... and query time n . ... , matching the optimal data-dependent Locality-Sensitive Hashing (LSH) from [AR15]; • Space n o (1) and query time n . ... , making significant progress in the regime of near-linearspace, which is arguably of the most interest for practice [LJW + every approximation factor c >
1, improving upon [Kap15]. The data structure is a culmination ofa long line of work on the problem for all space regimes; it builds on spherical Locality-SensitiveFiltering [BDGL16] and data-dependent hashing [AINR14, AR15].Our matching lower bounds are of two types: conditional and unconditional. First, we provetightness of the whole trade-off (1) in a restricted model of computation, which captures allknown hashing-based approaches. We then show unconditional cell-probe lower bounds for oneand two probes that match (1) for ρ q = 0, improving upon the best known lower bounds from[PTW10]. In particular, this is the first space lower bound (for any static data structure) fortwo probes which is not polynomially smaller than the corresponding one-probe bound. To showthe result for two probes, we establish and exploit a connection to locally-decodable codes . ∗ This paper merges two arXiv preprints: [Laa15c] (appeared online on November 24, 2015) and [ALRW16] (appearedonline on May 9, 2016), and subsumes both of these articles. An extended abstract of this paper appeared in theproceedings of 28th Annual ACM–SIAM Symposium on Discrete Algorithms (SODA ’2017). a r X i v : . [ c s . D S ] M a y ontents Introduction
The Near Neighbor Search problem (NNS) is a basic and fundamental problem in computationalgeometry, defined as follows. We are given a dataset P of n points from a metric space ( X, d X ) anda distance threshold r >
0. The goal is to preprocess P in order to answer near neighbor queries :given a query point q ∈ X , return a dataset point p ∈ P with d X ( q, p ) ≤ r , or report that there is nosuch point. The d -dimensional Euclidean ( R d , ‘ ) and Manhattan/Hamming ( R d , ‘ ) metric spaceshave received the most attention. Besides classical applications to similarity search over many typesof data (text, audio, images, etc; see [SDI06] for an overview), NNS has been also recently used forcryptanalysis [MO15, Laa15a, Laa15b, BDGL16] and optimization [DRT11, HLM15, ZYS16].The performance of an NNS data structure is primarily characterized by two key metrics: • space: the amount of memory a data structure occupies, and • query time: the time it takes to answer a query.All known time-efficient data structures for NNS (e.g., [Cla88, Mei93]) require space exponentialin the dimension d , which is prohibitively expensive unless d is very small. To overcome this so-called curse of dimensionality , researchers proposed the ( c, r )- Approximate
Near Neighbor Search problem,or ( c, r )-ANN. In this relaxed version, we are given a dataset P and a distance threshold r >
0, aswell as an approximation factor c >
1. Given a query point q with the promise that there is atleast one data point in P within distance at most r from q , the goal is to return a data point p ∈ P within a distance at most cr from q .ANN does allow efficient data structures with a query time sublinear in n , and only polyno-mial dependence in d in all parameters [IM98, GIM99, KOR00, Ind01a, Ind01b, Cha02, CR04,DIIM04, Pan06, AI08, TT07, AC09, AINR14, Kap15, AR15, Pag16, BDGL16, ARN17, ANRW17].In practice, ANN algorithms are often successful even when one is interested in exact nearestneighbors [ADI +
06, AIL + A classic technique for ANN is
Locality-Sensitive Hashing (LSH), introduced in 1998 by Indyk andMotwani [IM98, HIM12]. The main idea is to use random space partitions , for which a pair of closepoints (at distance at most r ) is more likely to belong to the same part than a pair of far points (atdistance more than cr ). Given such a partition, the data structure splits the dataset P accordingto the partition, and, given a query, retrieves all the data points which belong to the same part1s the query. In order to return a near-neighbor with high probability of success, one maintainsseveral partitions and checks all of them during the query stage. LSH yields data structures withspace O ( n ρ + dn ) and query time O ( dn ρ ), where ρ is the key quantity measuring the quality ofthe random space partition for a particular metric space and approximation c ≥
1. Usually, ρ = 1for c = 1 and ρ → c → ∞ .Since the introduction of LSH in [IM98], subsequent research established optimal values of theLSH exponent ρ for several metrics of interest, including ‘ and ‘ . For the Manhattan distance ( ‘ ),the optimal value is ρ = c ± o (1) [IM98, MNP07, OWZ14]. For the Euclidean metric ( ‘ ), it is ρ = c ± o (1) [IM98, DIIM04, AI08, MNP07, OWZ14].More recently, it has been shown that better bounds on ρ are possible if random space partitionsare allowed to depend on the dataset . That is, the algorithm is based on an observation that everydataset has some structure to exploit. This more general framework of data-dependent LSH yields ρ = c − + o (1) for the ‘ distance, and ρ = c − + o (1) for ‘ [AINR14, Raz14, AR15]. Moreover,these bounds are known to be tight for data-dependent LSH [AR16]. Since the early results on LSH, a natural question has been whether one can obtain query time vs.space trade-offs for a fixed approximation c . Indeed, data structures with polynomial space and poly-logarithmic query time were introduced [IM98, KOR00] simultaneously with LSH.In practice, the most important regime is that of near-linear space, since space is usually a harderconstraint than time: see, e.g., [LJW + +
07, Kap15, AIL + +
07, AIL + all approximations c >
1. For example, the best currentlyknown algorithm of [Kap15] obtained query time of roughly n / ( c +1) , which becomes trivial for c < √ Lower bounds for NNS and ANN have also received considerable attention. Such lower boundsare ideally obtained in the cell-probe model [MNSW98, Mil99], where one measures the number ofmemory cells the query algorithm accesses. Despite a number of success stories, high cell-probe lowerbounds are notoriously hard to prove. In fact, there are few techniques for proving high cell-probelower bounds, for any (static) data structure problem. For ANN in particular, we have no viable Let us note that the idea of data-dependent random space partitions is ubiquitous in practice, see, e.g., [WSSJ14,WLKC15] for a survey. But the perspective in practice is that the given datasets are not “worst case” and hence it ispossible to adapt to the additional “nice” structure. ω (log n ) query time lower bounds. Due to this state of affairs, one may rely on restricted models of computation, which nevertheless capture existing algorithmic approaches.Early lower bounds for NNS were obtained for data structures in exact or deterministic set-tings [BOR99, CCGL99, BR02, Liu04, JKKR04, CR04, PT06, Yin16]. [CR04, LPY16] obtainedan almost tight cell-probe lower bound for the randomized Approximate Nearest
Neighbor Searchunder the ‘ distance. In that problem, there is no distance threshold r , and instead the goal is tofind a data point that is not much further than the closest data point. This twist is the main sourceof hardness, so the result is not applicable to the ANN problem as introduced above.There are few results that show lower bounds for randomized data structures for ANN. The firstsuch result [AIP06] shows that any data structure that solves (1 + ε, r )-ANN for ‘ or ‘ using t cellprobes requires space n Ω(1 /tε ) . This result shows that the algorithms of [IM98, KOR00] are tightup to constants in the exponent for t = O (1).In [PTW10] (following up on [PTW08]), the authors introduce a general framework for provinglower bounds for ANN under any metric. They show that lower bounds for ANN are implied bythe robust expansion of the underlying metric space. Using this framework, [PTW10] show that( c, r )-ANN using t cell probes requires space n /tc ) for the Manhattan distance and n /tc ) for the Euclidean distance (for every c > ‘ ∞ distance, [ACP08] showa lower bound for deterministic ANN data structures. This lower bound was later generalizedto randomized data structures [PTW10, KP12]. A recent result [AV15] adapts the frameworkof [PTW10] to Bregman divergences.To prove higher lower bounds, researchers resorted to lower bounds for restricted models. Theseexamples include: decision trees [ACP08] (the corresponding upper bound [Ind01b] is in the samemodel), LSH [MNP07, OWZ14, AIL +
15] and data-dependent LSH [AR16].
We give an algorithm obtaining the entire range of time–space tradeoffs, obtaining sublinear querytime for all c >
1, for the entire space R d . Our main theorem is the following: Theorem 1.1 (see Sections 3 and 4) . For every c > , r > , ρ q ≥ and ρ u ≥ such that c √ ρ q + (cid:0) c − (cid:1) √ ρ u ≥ p c − , (2) there exists a data structure for ( c, r ) -ANN for the Euclidean space R d , with space n ρ u + o (1) + O ( dn ) and query time n ρ q + o (1) + dn o (1) . This algorithm has optimal exponents for all hashing-based algorithms, as well as one- andtwo-probe data structures, as we prove in later sections. In particular, Theorem 1.1 recovers orimproves upon all earlier results on ANN in the entire time-space trade-off. For the near-linear The correct dependence on 1 /ε requires the stronger Lopsided Set Disjointness lower bound from [Pˇat11]. ρ u = 0, we obtain space n o (1) with query time n c − c + o (1) , which is sublinearfor every c >
1. For ρ q = ρ u , we recover the best data-dependent LSH bound from [AR15], withspace n c − + o (1) and query time n c − + o (1) . Finally, setting ρ q = 0, we obtain query time n o (1) and space n (cid:16) c c − (cid:17) + o (1) , which, for c = 1 + ε with ε →
0, becomes n / (4 ε )+ ... .Using a reduction from [Ngu14], we obtain a similar trade-off for the ‘ p spaces for 1 ≤ p < c replaced with c p . In particular, for the ‘ distance we get: c √ ρ q + (cid:0) c − (cid:1) √ ρ u ≥ √ c − . Our algorithms can support insertions/deletions with only logarithmic loss in space/query time,using the dynamization technique for decomposable search problems from [OvL81], achieving updatetime of dn ρ u + o (1) . To apply this technique, one needs to ensure that the preprocessing time isnear-linear in the space used, which is the case for our data structure. We now describe the proof of Theorem 1.1 at a high level. It consists of two major stages. In thefirst stage, we give an algorithm for random
Euclidean instances (introduced formally in Section 2).In the random Euclidean instances, we generate a dataset uniformly at random on a unit sphere S d − ⊂ R d and plant a query at random within distance √ /c from a randomly chosen data point.In the second stage, we show the claimed result for the worst-case instances by combining ideasfrom the first stage with data-dependent LSH from [AINR14, AR15]. Data-independent partitions.
To handle random instances, we use a certain data-independent random process, which we briefly introduce below. It can be seen as a modification of sphericalLocality-Sensitive Filtering from [BDGL16], and is related to a cell-probe upper bound from [PTW10].While this data-independent approach can be extended to worst case instances, it gives a boundsignificantly worse than (2).We now describe the random process which produces a decision tree to solve an instance of ANNon a
Euclidean unit sphere S d − ⊂ R d . We take our initial dataset P ⊂ S d − and sample T i.i.d.standard Gaussian d -dimensional vectors z , z , . . . , z T . The sets P i ⊆ P (not necessarily disjoint)are defined for each z i as follows: P i = { p ∈ P | h z i , p i ≥ η u } . We then recurse and repeat the above procedure for each non-empty P i . We stop the recursiononce we reach depth K . The above procedure generates a tree of depth K and degree at most T , where each leaf explicitly stores the corresponding subset of the dataset. To answer a query q ∈ S d − , we start at the root and descend into (potentially multiple) P i ’s for which h z i , q i ≥ η q .When we eventually reach the K -th level, we iterate through all the points stored in the accessed4eaves searching for a near neighbor.The parameters T , K , η u and η q depend on the distance threshold r , the approximation factor c ,as well as the desired space and query time exponents ρ u and ρ q . The special case of η u = η q corresponds to the “LSH regime” ρ u = ρ q ; η u < η q corresponds to the “fast queries” regime ρ q < ρ u (the query procedure is more selective); and η u > η q corresponds to the “low memory” regime ρ u < ρ q .The analysis of this algorithm relies on bounds on the Gaussian area of certain two-dimensionalsets [AR15, AIL + r = √ c . Second, we obtain an inferior trade-off for worst-case instancesof ( c, r )-ANN over a unit sphere S d − . Namely, we get:( c + 1) √ ρ q + ( c − √ ρ u ≥ c. (3)Even though it is inferior to the desired bound from (2) , it is already non-trivial. In particular, (3)is better than all the prior work on time–space trade-offs for ANN, including the most recenttrade-off [Kap15]. Moreover, using a reduction from [Val15], we achieve the bound (3) for the whole R d as opposed to just the unit sphere. Let us formally record it below: Theorem 1.2.
For every c > , r > , ρ q ≥ and ρ u ≥ such that (3) holds, there existsa data structure for ( c, r ) -ANN for the whole R d with space n ρ u + o (1) + O ( dn ) and query time n ρ q + o (1) + dn o (1) . Data-dependent partitions.
We then improve Theorem 1.2 for worst-case instances and obtainthe final result, Theorem 1.1. We build on the ideas of data-dependent LSH from [AINR14, AR15].Using the reduction from [Val15], we may assume that the dataset and queries lie on a unitsphere S d − .If pairwise distances between data points are distributed roughly like a random instance, we couldapply the data-independent procedure. In absence of such a guarantee, we manipulate the datasetin order to reduce it to a random-looking case. Namely, we look for low-diameter clusters thatcontain many data points. We extract these clusters, and we enclose each of them in a ball of radiusnon-trivially smaller than one, and we recurse on each cluster. For the remaining points, which donot lie in any cluster, we perform one step of the data-independent algorithm: we sample T Gaussianvectors, form T subsets of the dataset, and recurse on each subset. Overall, we make progressin two ways: for the clusters, we make them a bit more isotropic after re-centering, which, afterseveral re-centerings, makes the instance amenable to the data-independent algorithm, and for theremainder of the points, we can show that the absence of dense clusters makes the data-independentalgorithm work for a single level of the tree (though, when recursing into P i ’s, dense clusters mayre-appear, which we will need to extract).While the above intuition is very simple and, in hindsight, natural, the actual execution requiresa good amount of work. For example, we need to formalize “low-diameter”, “lots of points”, “more See Figure 2 for comparison for the case c = 2. triples of points. While thiswas necessary in [AR15], we can avoid that analysis here, which makes the overall argument muchcleaner. The algorithm still requires fine tuning of many moving parts, and we hope that it will befurther simplified in the future.Let us note that prior work suggested that time–space trade-offs might be possible with data-dependent partitions. To quote [Kap15]: “ It would be very interesting to see if similar [. . . to[AINR14] . . . ] analysis can be used to improve our tradeoffs ”. We show new cell-probe and restricted lower bounds for ( c, r )-ANN matching our upper bounds. Allour lower bounds rely on a certain canonical hard distribution for the Hamming space (defined laterin Section 2). Via a standard reduction [LLR94], we obtain similar hardness results for ‘ p with1 < p ≤ c being replaced by c p ). First, we show a tight lower bound on the space needed to solve ANN for a random instance, forquery algorithms that use a single cell probe. More formally, we prove the following theorem:
Theorem 1.3 (see Section 6.2) . Any data structure that: • solves ( c, r ) -ANN for the Hamming random instance (as defined in Section 2) with probabilityat least / , • operates on memory cells of size n o (1) , • for each query, looks up a single cell,must use at least n ( cc − ) − o (1) words of memory. The space lower bound matches: • Our upper bound for random instances that can be made single-probe; • Our upper bound for worst-case instances with query time n o (1) .The previous best lower bound from [PTW10] for a single probe are weaker by a polynomial factor.We prove Theorem 1.3 by computing tight bounds on the robust expansion of a hypercube {− , } d as defined in [PTW10]. Then, we invoke a result from [PTW10], which yields the desiredcell probe lower bound. We obtain estimates on the robust expansion via a combination ofthe hypercontractivity inequality and Hölder’s inequality [O’D14]. Equivalently, one could obtainthe same bounds by an application of the Generalized Small-Set Expansion Theorem for {− , } d from [O’D14]. 6 .6.2 Two cell probes To state our results for two cell probes, we first define the decision version of ANN (first introducedin [PTW10]). Suppose that with every data point p ∈ P we associate a bit x p ∈ { , } . A new goalis: given a query q ∈ {− , } d which is within distance r from a data point p ∈ P , if P \ { p } isat distance at least cr from q , return x p with probability at least 2 /
3. It is easy to see that anyalgorithm for ( c, r )-ANN would solve this decision version.We prove the following lower bound for data structures making only two cell probes per query.
Theorem 1.4 (see Section 8) . Any data structure that: • solves the decision ANN for the random instance (Section 2) with probability / , • operates on memory cells of size o (log n ) , • accesses at most two cells for each query,must use at least n ( cc − ) − o (1) words of memory. Informally speaking, Theorem 1.4 shows that the second cell probe cannot improve the spacebound by more than a subpolynomial factor. To the best of our knowledge, this is the first lowerbound on the space of any static data structure problem without a polynomial gap between t = 1and t ≥ t = 2, we must depart from the framework of [PTW10].Our proof establishes a connection between two-query data structures (for the decision versionof ANN), and two-query locally-decodable codes (LDC) [Yek12]. A possibility of such a connectionwas suggested in [PTW08]. In particular, we show that any data structure violating the lower boundfrom Theorem 1.4 implies a too-good-to-be-true two-query LDC, which contradicts known LDClower bounds from [KdW04, BRdW08].The first lower bound for unrestricted two-query LDCs was proved in [KdW04] via a quantum argument. Later, the argument was simplified and made classical in [BRdW08]. It turns out that,for our lower bound, we need to resort to the original quantum argument of [KdW04] since it hasa better dependence on the noise rate a code is able to tolerate. During the course of our proof,we do not obtain a full-fledged LDC, but rather an object which can be called an LDC on average .For this reason, we are unable to use [KdW04] as a black box but rather adjust their proof to theaverage case.Finally, we point out an important difference with Theorem 1.3: in Theorem 1.4 we allow wordsto be merely of size o (log n ) (as opposed to n o (1) ). Nevertheless, for the decision version of ANNfor random instances our upper bounds hold even for such “tiny” words. In fact, our techniquesdo not allow us to handle words of size Ω(log n ) due to the weakness of known lower bounds fortwo-query LDC for large alphabets . In particular, our argument can not be pushed beyond word7ize 2 e Θ( √ log n ) in principle , since this would contradict known constructions of two-query LDCs overlarge alphabets [DG15]! Finally, we prove conditional lower bound on the entire time–space trade-off matching our upperbounds that up to n o (1) factors. Note that—since we show polynomial query time lower bounds—proving similar lower bounds unconditionally is far beyond the current reach of techniques. Anysuch statement would constitute a major breakthrough in cell probe lower bounds.Our lower bounds are proved in the following model, which can be loosely thought of comprisingall hashing-based frameworks we are aware of: Definition 1.5. A list-of-points data structure for the ANN problem is defined as follows: • We fix (possibly random) sets A i ⊆ {− , } d , for ≤ i ≤ m ; also, with each possible querypoint q ∈ {− , } d , we associate a (random) set of indices I ( q ) ⊆ [ m ] ; • For a given dataset P , the data structure maintains m lists of points L , L , . . . , L m , where L i = P ∩ A i ; • On query q , we scan through each list L i for i ∈ I ( q ) and check whether there exists some p ∈ L i with k p − q k ≤ cr . If it exists, return p .The total space is defined as s = m + P mi =1 | L i | and the query time is t = | I ( q ) | + P i ∈ I ( q ) | L i | . For this model, we prove the following theorem.
Theorem 1.6 (see Section 7) . Consider any list-of-points data structure for ( c, r ) -ANN for randominstances of n points in the d -dimensional Hamming space with d = ω (log n ) , which achieves a totalspace of n ρ u , and has query time n ρ q − o (1) , for / success probability. Then it must hold that: c √ ρ q + ( c − √ ρ u ≥ √ c − . (4)We note that our model captures the basic hashing-based algorithms, in particular most ofthe known algorithms for the high-dimensional ANN problem [KOR00, IM98, Ind01b, Ind01a,GIM99, Cha02, DIIM04, Pan06, AC09, AI08, Pag16, Kap15], including the recently proposedLocality-Sensitive Filters scheme from [BDGL16]. The only data structures not captured are thedata-dependent schemes from [AINR14, Raz14, AR15]; we conjecture that the natural extension ofthe list-of-point model to data-dependent setting would yield the same lower bound. In particular,Theorem 1.6 uses the random instance as a hard distribution, for which being data-dependent seemsto offer no advantage. Indeed, a data-dependent lower bound in the standard LSH regime (where ρ q = ρ u ) has been recently shown in [AR16], and matches (4) for ρ q = ρ u .8 .7 Related work: past and concurrent There have been many recent algorithmic advances on high-dimensional similarity search. Theclosest pair problem, which can seen as the off-line version of NNS/ANN, has received muchattention recently [Val15, AW15, KKK16, KKKÓ16, ACW16]. ANN solutions with n ρ u space(and preprocessing), and n ρ q query time imply closest pair problem with O ( n ρ u + n ρ q ) time(implying that the balanced, LSH regime is most relevant). Other work includes locality-sensitivefilters [BDGL16] and LSH without false negatives [GPY94, Ind00, AGK06, Pag16, PP16]. A steptowards bridging the data-depending hashing to the practical algorithms has been made in [ARS17].See also the surveys [AI08, AI17]. Relation to the article of [Chr17].
The article of [Chr17] has significant intersection with thispaper (and, in particular, with the arXiv preprints [Laa15c, ALRW16] that are now merged to givethis paper), as we explain next. In November 2015, [Laa15c] announced the optimal trade-off (i.e.,Theorem 1.1) for random instances. As mentioned earlier, it is possible to extend this result to theentire Euclidean space, albeit with an inferior trade-off, from Theorem 1.2; for this, one can use astandard reduction á la [Val15] (this extension was not discussed in [Laa15c]). On May 9, 2016,both [Chr17] and [ALRW16] have been announced on arXiv. In [Chr17], the author also obtainsan upper bound similar to Theorem 1.2 (trade-offs for the entire R d , but which are suboptimal),using a different (data- independent ) reduction from the worst-case to the spherical case. Besidesthe upper bound, the author of [Chr17] also proved a conditional lower bound, similar to our lowerbound from Theorem 1.6. This lower bound of [Chr17] is independent of our work in [ALRW16](which is now a part of the current paper). We compile a list of exciting open problems: • While our upper bounds are optimal (at least, in the hashing framework), the most generalalgorithms are, unfortunately, impractical. Our trade-offs for random instances on the spheremay well be practical (see also [BDGL16, Laa15a] for an experimental comparison withe.g. [Cha02, AIL +
15] for ρ q = ρ u ), but a specific bottleneck for the extension to worst-caseinstances in R d is the clustering step inherited from [AR15]. Can one obtain simple andpractical algorithms that achieve the optimal time–space trade-off for these instances as well?For the balanced regime ρ q = ρ u , a step in this direction was taken in [ARS17]. • The constructions presented here are optimal when ω (log n ) ≤ d ≤ n o (1) . Do the sameconstructions give optimal algorithms in the d = Θ(log n ) regime? • Our new algorithms for the Euclidean case come tantalizingly close to the best known datastructure for the ‘ ∞ distance [Ind01b]. Can we unify them and extend in a smooth way tothe ‘ p spaces for 2 < p < ∞ ? 9 Can we improve the dependence on the word size in the reduction from ANN data structuresto LDCs used in the two-probe lower bound? As discussed above, the word size can not bepushed beyond 2 e Θ( √ log n ) due to known constructions [DG15]. • A more optimistic view is that LDCs may provide a way to avoid the barrier posed byhashing-based approaches. We have shown that ANN data structures can be used to buildweak forms of LDCs, and an intriguing open question is whether known LDC constructionscan help with designing even more efficient ANN data structures.
In this section, we introduce the random instances of ANN for the Hamming and Euclidean spaces.These instances play a crucial role for both upper bounds (algorithms) and the lower bounds in allthe subsequent sections (as well as some prior work). For upper bounds, we focus on the Euclideanspace, since algorithms for ‘ yield the algorithms for the Hamming space using standard reductions.For the lower bounds, we focus on the Hamming space, since these yield lower bounds for theEuclidean space. Hamming distance.
We now describe a distribution supported on dataset-query pairs (
P, q ),where P ⊂ {− , } d and q ∈ {− , } d . Random instances of ANN for the Hamming space will bedataset-query pairs drawn from this distribution. • A dataset P ⊂ {− , } d is given by n points, where each point is drawn independently anduniformly from {− , } d , where d = ω (log n ); • A query q ∈ {− , } d is drawn by first picking a dataset point p ∈ P uniformly at random,and then flipping each coordinate of p independently with probability c . • The goal of the data structure is to preprocess P in order to recover the data point p fromthe query point q .The distribution defined above is similar to the classic distribution introduced for the light bulbproblem in [Val88], which can be seen as the off-line setting of ANN. This distribution has served asthe hard distribution in many of the lower bounds for ANN mentioned in Section 1.4. Euclidean distance.
Now, we describe the distribution supported on dataset-query pairs (
P, q ),where P ⊂ S d − and q ∈ S d − . Random instances of ANN for Euclidean space will be instancesdrawn from this distribution. • A dataset P ⊂ S d − is given by n unit vectors, where each vector is drawn independentlyand uniformly at random from S d − . We assume that d = ω (log n ), so pairwise distances aresufficiently concentrated around √
2. 10
A query q ∈ S d − is drawn by first choosing a dataset point p ∈ P uniformly at random, andthen choosing q uniformly at random from all points in S d − within distance √ c from p . • The goal of the data structure is to preprocess P in order to recover the data point p fromthe query point q .Any data structure for (cid:16) c + o (1) , √ c (cid:17) -ANN over ‘ must handle this instance. [AR15] showedhow to reduce any ( c, r )-ANN instance to several pseudo -random instances without increasing querytime and space too much. These pseudo-random instances have the necessary properties of therandom instance above in order for the data-independent algorithms (which are designed with therandom instance in mind) to achieve optimal bounds. Similarly to [AR15], a data structure forthese instances will lie at the core of our algorithm. For 0 < s <
2, let α ( s ) = 1 − s be the cosine of the angle between two points on a unit Euclideansphere S d − with distance s between them, and β ( s ) = p − α ( s ) be the sine of the same angle.We introduce two functions that will be useful later. First, for η >
0, let F ( η ) = Pr z ∼ N (0 , d [ h z, u i ≥ η ] , where u ∈ S d − is an arbitrary point on the unit sphere, and N (0 , d is a distribution over R d ,where coordinates of a vector are distributed as i.i.d. standard Gaussians. Note that F ( η ) does notdepend on the specific choice of u due to the spherical symmetry of Gaussians.Second, for 0 < s < η, σ >
0, let G ( s, η, σ ) = Pr z ∼ N (0 , d [ h z, u i ≥ η and h z, v i ≥ σ ] , where u, v ∈ S d − are arbitrary points from the unit sphere with k u − v k = s . As with F , thevalue of G ( s, η, σ ) does not depend on the specific points u and v ; it only depends on the distance k u − v k between them. Clearly, G ( s, η, σ ) is non-increasing in s , for fixed η and σ .We state two useful bounds on F ( · ) and G ( · , · , · ). The first is a standard tail bound for N (0 , +
15] for a proof).
Lemma 3.1.
For η → ∞ , F ( η ) = e − (1+ o (1)) · η . Lemma 3.2. If η, σ → ∞ , then, for every s , one has: G ( s, η, σ ) = e − (1+ o (1)) · η σ − α ( s ) ησ β s ) . d =Θ(log n · log log n ) incurring distortion at most 1 + Ω(1) log n . Now we formulate the main result of Section 3, which we later significantly improve in Section 4.
Theorem 3.3.
For every c > , r > , ρ q ≥ and ρ u ≥ such that cr < and (cid:0) − α ( r ) α ( cr ) (cid:1) √ ρ q + (cid:0) α ( r ) − α ( cr ) (cid:1) √ ρ u ≥ β ( r ) β ( cr ) , (5) there exists a data structure for ( c, r ) -ANN on a unit sphere S d − ⊂ R d with space n ρ u + o (1) andquery time n ρ q + o (1) . We instantiate Theorem 3.3 for two important cases. First, we get a single trade-off between ρ q and ρ u for all r > at the same time by observing that (5) is the worst when r →
0. Thus,we get a bound on ρ q and ρ u that depends on the approximation c only, which then can easily betranslated to a result for the whole R d using a reduction from [Val15]. Corollary 3.4.
For every c > , r > , ρ q ≥ and ρ u ≥ such that (cid:0) c + 1 (cid:1) √ ρ q + (cid:0) c − (cid:1) √ ρ u ≥ c, (6) there exists a data structure for ( c, r ) -ANN for the whole R d with space n ρ u + o (1) and query time n ρ q + o (1) .Proof. We will show that we may transform an instance of ( c, r )-ANN on R d to an instance of( c + o (1) , r )-ANN on the sphere with r →
0. When r →
0, we have:1 − α ( r ) α ( cr ) = ( c + 1) r O c ( r ) ,α ( r ) − α ( cr ) = ( c − r O c ( r ) ,β ( r ) β ( cr ) = cr + O c ( r ) . Substituting these estimates into (5), we get (6).Now let us show how to reduce ANN over R d to the case, when all the points and queries lie ona unit sphere.We first rescale all coordinates so as to assume r = 1. Now let us partition the whole space R d into randomly shifted cubes with the side length s = 10 · √ d and consider each cube separately. Forany query q ∈ R d , with near neighbor p ∈ P ,Pr[ p and q are in different cubes] ≤ d X i =1 | p i − q i | s = k p − q k s ≤ √ d · k p − q k s ≤ . ‘ diameter of a single cube is d . Consider one particular cube C , where we first translatepoints so x ∈ C have k x k ≤ d . We let π : C → R d +1 where π ( x ) = ( x, R ) , where we append coordinate R (cid:29) d as the ( d + 1)-th coordinate. For any point x ∈ C , (cid:13)(cid:13)(cid:13)(cid:13) π ( x ) − (cid:18) R k π ( x ) k (cid:19) · π ( x ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ k x k R and for any two points x, y ∈ C , k x − y k = k π ( x ) − π ( y ) k ; thus, (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) R k π ( x ) k (cid:19) π ( x ) − (cid:18) R k π ( y ) k (cid:19) π ( y ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ d R + k x − y k . In addition, since (cid:16) R k π ( x ) k (cid:17) π ( x ) lies in a sphere of radius R for each point x ∈ C . Thus, letting R = d · log log n ≤ O (log n · log log n ) (which is without loss of generality by the Johnson–Lindenstrauss Lemma), we get that an instance of ( c, r )-ANN on R d corresponds to an instanceof ( c + o (1) , d log log n )-ANN on the surface of the unit sphere S d ⊂ R d +1 , where we lose in thesuccess probability due to the division into disjoint cubes. Applying Theorem 3.3, we obtain thedesired bound.If we instantiate Theorem 3.3 with inputs (dataset and query) drawn from the random instancesdefined in Section 2 (corresponding to the case r = √ c ), we obtain a significantly better tradeoffthan (6). By simply applying Theorem 3.3, we give a trade-off for random instances matching thetrade-off promised in Theorem 1.1. Corollary 3.5.
For every c > , ρ q ≥ and ρ u ≥ such that c √ ρ q + (cid:0) c − (cid:1) √ ρ u ≥ p c − , (7) there exists a data structure for (cid:16) c, √ c (cid:17) -ANN on a unit sphere S d − ⊂ R d with space n ρ u + o (1) and query time n ρ q + o (1) . In particular, this data structure is able to handle random instances asdefined in Section 2.Proof. Follows from (5) and that α ( √
2) = 0 and β ( √
2) = 1.Figure 2 plots the time-space trade-off in (6) and (7) for c = 2. Note that (7) is much betterthan (6), especially when ρ q = 0, where (6) gives space n . ... , while (7) gives much better space n . ... . In Section 4, we show how to get best of both worlds: we obtain the trade-off (7) for worst-case instances. The remainder of the section is devoted to proving Theorem 3.3.13 .3 Data structure Fix K and T to be positive integers, we determine their exact value later. Our data structure is a single rooted tree where each node corresponds to a spherical cap. The tree consists of K + 1 levelsof nodes where each node has out-degree at most T . We will index the levels by 0, 1, . . . , K , wherethe 0-th level consists of the root denoted by v , and each node up to the ( K − T children. Therefore, there are at most T K nodes at the K -th level.For every node v in the tree, let L v be the set of nodes on the path from v to the root v excluding the root (but including v ). Each node v , except for the root, stores a random Gaussianvector z v ∼ N (0 , d . For each node v , we define the following subset of the dataset P v ⊆ P : P v = (cid:8) p ∈ P | ∀ v ∈ L v h z v , p i ≥ η u (cid:9) , where η u > v , P v = P , since L v = ∅ . Intuitively, each set P v corresponds to a subset ofthe dataset lying in the intersection of spherical caps centered around z v for all v ∈ L v . Every leaf ‘ at the level K stores the subset P ‘ explicitly.We build the tree recursively. For a given node v in levels 0, . . . , K −
1, we first sample T i.i.d.Gaussian vectors g , g , . . . , g T ∼ N (0 , d . Then, for every i such that { p ∈ P v | h g i , p i ≥ η u } isnon-empty, we create a new child v with z v = g i and recursively process v . At the K -th level, eachnode v stores P v as a list of points.In order to process a query q ∈ S d − , we start from the root v and descend down the tree. Weconsider every child v of the root for which h z v , q i ≥ η q , where η q > . After identifying all the children, we proceed down the children recursively. If wereach leaf ‘ at level K , we scan through all the points in P ‘ and compute their distance to thequery q . If a point lies at a distance at most cr from the query, we return it and stop.We provide pseudocode for the data structure above in Figure 1. The procedure Build ( P , 0, ⊥ ) builds the data structure for dataset P and returns the root of the tree, v . The procedure Query ( q , v ) queries the data structure with root v at point q . We first analyze the probability of success of the data structure. Weassume that a query q has some p ∈ P where k p − q k ≤ r . The data structure succeeds when Query ( q , v ) returns some point p ∈ P with k q − p k ≤ cr . Lemma 3.6. If T ≥ G ( r, η u , η q ) , Note that η u may not be equal to η q . It is exactly this discrepancy that will govern the time–space trade-off. unction Build ( P , l , z )create a tree node v store l as v.l store z as v.z if l = K then store P as v.P elsefor i ← . . . T do sample a Gaussian vector z ∼ N (0 , d P ← { p ∈ P | h z , p i ≥ η u } if P = ∅ then add Build ( P , l + 1, z ) as a child of v return v function Query ( q , v ) if v.l = K thenfor p ∈ v.P doif k p − q k ≤ cr thenreturn p elsefor v : v is a child of v doif h v .z, q i ≥ η q then p ← Query ( q, v ) if p = ⊥ thenreturn p return ⊥ Figure 1: Pseudocode for data-independent partitions then with probability at least . , Query ( q , v ) finds some point within distance cr from q .Proof. We prove the lemma by induction on the depth of the tree. Let q ∈ S d − be a query pointand p ∈ P its near neighbor. Suppose we are within the recursive call Query ( q , v ) for some node v in the tree. Suppose we have not yet failed, that is, p ∈ P v . We would like to prove that—if thecondition of the lemma is met—the probability that this call returns some point within distance cr is at least 0 . v is a node in the last level K , the algorithm enumerates P v and, since we assume p ∈ P v ,some good point will be discovered (though not necessarily p itself). Therefore, this case is trivial.Now suppose that v is not from the K -th level. Using the inductive assumption, suppose that thestatement of the lemma is true for all T potential children of v , i.e., if p ∈ P v , then with probability0.9, Query ( q, v ) returns some point within distance cr from q . Then,Pr[failure] ≤ T Y i =1 (cid:18) − Pr z vi [ h z v i , p i ≥ η u and h z v i , q i ≥ η q ] · . (cid:19) ≤ (1 − G ( r, η u , η q ) · . T ≤ . , where the first step follows from the inductive assumption and independence between the childrenof v during the preprocessing phase. The second step follows by monotonicity of G ( s, ρ, σ ) in s , andthe third step is due to the assumption of the lemma. Space
We now analyze the space consumption of the data structure.
Lemma 3.7.
The expected space consumption of the data structure is at most n o (1) · K · (cid:0) T · F ( η u ) (cid:1) K . Proof.
We compute the expected total size of the sets P ‘ for leaves ‘ at K -th level. There are atmost T K such nodes, and for a fixed point p ∈ P and a fixed leaf ‘ the probability that p ∈ P ‘ isequal to F ( η u ) K . Thus, the expected total size is at most n · (cid:0) T · F ( η u ) (cid:1) K . Since we only store a15ode v if P v is non-empty, the number of nodes stored is at most K + 1 times the number of pointsstored at the leaves. The Gaussian vectors stored at each node require space d , which is at most n o (1) . Query time
Finally, we analyze the query time.
Lemma 3.8.
The expected query time is at most n o (1) · T · ( T · F ( η q )) K + n o (1) · ( T · G ( cr, η u , η q )) K . (8) Proof.
First, we compute the expected query time spent going down the tree, without scanning theleaves. The expected number of nodes the query procedure reaches is:1 + T · F ( η q ) + ( T · F ( η q )) + . . . + ( T · F ( η q )) K = O (1) · ( T · F ( η q )) K , since we will set T so T · F ( η q ) ≥ n o (1) · T . The product of thetwo expressions gives the first term in (8).The expected time spent scanning points in the leaves is at most n o (1) times the number ofpoints scanned at the leaves reached. The number of points scanned is always at most one morethan the number of far points, i.e., lying a distance greater than cr from q , that reached the sameleaf. There are at most n − T K leaves. For each far point p and each leaf ‘ theprobability that both p and q end up in P ‘ is at most G ( cr, η u , η q ) K . For each such pair, we spendtime at most n o (1) processing the corresponding p . This gives the second term in (8). We end the section by describing how to set parameters T , K , η u and η q to prove Theorem 3.3.First, we set K ∼ √ ln n . In order to satisfy the requirement of Lemma 3.6, we set T = 100 G ( r, η u , η q ) . (9)Second, we (approximately) balance the terms in the query time (8). Toward this goal, we aim tohave F ( η q ) K = n · G ( cr, η u , η q ) K . (10)If we manage to satisfy these conditions, then we obtain space n o (1) · ( T · F ( η u )) K and querytime n o (1) · ( T · F ( η q )) K .Let F ( η u ) K = n − σ and F ( η q ) K = n − τ . By Lemma 3.1, Lemma 3.2 and (10), we have that, upto o (1) terms, τ = σ + τ − α ( cr ) · √ στβ ( cr ) − , Other terms from the query time are absorbed into n o (1) due to our choice of K . (cid:12)(cid:12) √ σ − α ( cr ) √ τ (cid:12)(cid:12) = β ( cr ) , (11)since α ( cr ) + β ( cr ) = 1. We have, by Lemma 3.1, Lemma 3.2 and (9), T K = n σ + τ − α ( r ) √ στβ r ) + o (1) . Thus, the space bound is n o (1) · ( T · F ( η u )) K = n σ + τ − α ( r ) √ στβ r ) − σ + o (1) = n
1+ ( α ( r ) √ σ −√ τ ) β r ) + o (1) and query time is n o (1) · ( T · F ( η q )) K = n σ + τ − α ( r ) √ στβ r ) − τ + o (1) = n ( √ σ − α ( r ) √ τ ) β r ) + o (1) . In other words, ρ q = (cid:0) √ σ − α ( r ) √ τ (cid:1) β ( r ) , and ρ u = (cid:0) α ( r ) √ σ − √ τ (cid:1) β ( r )where τ is set so that (11) is satisfied. Combining these identities, we obtain (5).Namely, we set √ σ = α ( cr ) √ τ + β ( cr ) to satisfy (11). Then, √ τ can vary between: α ( r ) β ( cr )1 − α ( r ) α ( cr ) , which corresponds to ρ u = 0 and β ( cr ) α ( r ) − α ( cr ) , which corresponds to ρ q = 0.This gives a relation: √ τ = β ( cr ) − β ( r ) √ ρ q α ( r ) − α ( cr ) = α ( r ) β ( cr ) + β ( r ) √ ρ u − α ( r ) α ( cr ) , which gives the desired trade-off (5). We remark that there is an alternative method to the algorithm described above, using
SphericalLocality-Sensitive Filtering introduced in [BDGL16]. As argued in [BDGL16], this method maynaturally extend to the d = O (log n ) case with better trade-offs between ρ q , ρ u than in (2) (indeed,17uch better exponents were obtained in [BDGL16] for the “LSH regime” of ρ u = ρ q ).For spherical LSF, partitions are formed by first dividing R d into K blocks ( R d = R d/K × · · · × R d/K ), and then generating a spherical code C ⊂ S d/K − ⊂ R d/K of vectors sampled uniformlyat random from the lower-dimensional unit sphere S d/K − . For any vector p ∈ R d , we write p (1) , . . . , p ( K ) for the K blocks of d/K coordinates in the vector p .The tree consists of K levels, and the | C | children of a node v at level ‘ are defined by thevectors (0 , . . . , , z i , , . . . , ‘ -th block of d/K entries is potentially non-zero and isformed by one of the | C | code words. The subset P of a child then corresponds to the subset P ofthe parent, intersected with the spherical cap corresponding to the child, where P = { p ∈ P : h z i , p (1) i + · · · + h z i K , p ( K ) i ≥ K · η u } . (12)Decoding each of the K blocks separately with threshold η u was shown in [BDGL16] to be asymp-totically equivalent to decoding the entire vector with threshold K · η u , as long as K does not growtoo fast as a function of d and n . The latter joint decoding method based on the sum of the partialinner products is then used as the actual decoding method. In this section we prove the main upper bound theorem, Theorem 1.1, which we restate below:
Theorem 4.1.
For every c > , r > , ρ q ≥ and ρ u ≥ such that c √ ρ q + (cid:0) c − (cid:1) √ ρ u ≥ p c − , (13) there exists a data structure for ( c, r ) -ANN for the whole R d with space n ρ u + o (1) + O ( dn ) andquery time n ρ q + o (1) + dn o (1) . This theorem achieves “the best of both worlds” in Corollary 3.4 and Corollary 3.5. LikeCorollary 3.4, our data structure works for worst-case datasets; however, we improve upon thetrade-off between time and space complexity from Corollary 3.4 to that of random instances inCorollary 3.5. See Figure 2 for a comparison of both trade-offs for c = 2. We achieve the improvementby combining the result of Section 3 with the techniques from [AR15].As in [AR15], the resulting data structure is a decision tree. However, there are several notabledifferences from [AR15]: • The whole data structure is a single decision tree, while [AR15] considers a collection of n Θ(1) trees. • Instead of Spherical LSH used in [AR15], we use the partitioning procedure from Section 3. • In [AR15], the algorithm continues partitioning the dataset until all parts contain less than n o (1) points. We change the stopping criterion slightly to ensure the number of “non-cluster”18 ll lll [AI08][AR15] r q r u Data−independent boundLSH regimeRandom instances
Figure 2: Trade-offs between query time n ρ q + o (1) and space n ρ u + o (1) for the Euclidean distanceand approximation c = 2. The green dashed line corresponds to the simple data-independent boundfor worst-case instances from Corollary 3.4. The red solid line corresponds to the bound for random instances from Corollary 3.5, which we later extend to worst-case instances in Section 4. The bluedotted line is ρ q = ρ u , which corresponds to the “LSH regime”. In particular, the intersection ofthe dotted and the dashed lines matches the best data-independent LSH from [AI08], while theintersection with the solid line matches the best data-dependent
LSH from [AR15].19 √ − ε ) R (1 − Θ( ε )) R Figure 3: Covering a spherical cap of radius ( √ − ε ) R nodes on any root-leaf branch is the same (this value will be around √ ln n to reflect the settingof K in Section 3). • Unlike [AR15], our analysis does not require the “three-point property”, which is necessaryin [AR15]. This is related to the fact that the probability success of a single tree is constant,unlike [AR15], where it is polynomially small. • In [AR15], the algorithm reduces the general case to the “bounded ball” case using LSHfrom [DIIM04]. While the cost associated with this procedure is negligible in the LSH regime,the cost becomes too high in certain parts of the time–space trade-off. Instead, we use astandard trick of imposing a randomly shifted grid, which reduces an arbitrary dataset toa dataset of diameter e O (log n ) (see the proof of Corollary 6 and [IM98]). Then, we invokean upper bound from Section 3 together with a reduction from [Val15] which happens to beenough for this case. We start with a high-level overview. Consider a dataset P of n points. We may assume r = 1by rescaling. We may further assume the dataset lies in the Euclidean space of dimension d =Θ(log n · log log n ); one can always reduce the dimension to d by applying the Johnson–Lindenstrausslemma [JL84, DG03] which reduces the dimension and distorts pairwise distances by at most1 ± / (log log n ) Ω(1) with high probability. We may also assume the entire dataset P and a querylie on a sphere ∂B (0 , R ) of radius R = e O (log n ) (see the proof of Corollary 6).We partition P into various components: s dense components, denoted by C , C , . . . , C s , andone pseudo-random component, denoted by e P . The partition is designed to satisfy the followingproperties. Each dense component C i satisfies | C i | ≥ τ n and can be covered by a spherical cap ofradius ( √ − ε ) R (see Figure 3). Here τ, ε > C i as clusters consisting of n − o (1) points which are closer than random points would be.The pseudo-random component e P consists of the remaining points without any dense clusters inside.We proceed separately for each C i and e P . We enclose every dense component C i in a smallerball E i of radius (1 − Ω( ε )) R (see Figure 3). For simplicity, we first ignore the fact that C i does20ot necessarily lie on the boundary ∂E i . Once we enclose each dense cluster with a smaller ball, werecurse on each resulting spherical instance of radius (1 − Ω( ε )) R . We treat the pseudo-randomcomponent e P similarly to the random instance from Section 2 described in Section 3. Namely, wesample T Gaussian vectors z , z , . . . , z T ∼ N (0 , d , and form T subsets of e P : e P i = { p ∈ e P | h z i , p i ≥ η u R } , where η u > e P i . Note that after we recurse, new dense clusters may appear in some e P i since it becomes easier to satisfy the minimum size constraint.During the query procedure, we recursively query each C i with the query point q . For thepseudo-random component e P , we identify all i ’s such that h z i , q i ≥ η q R , and query all correspondingchildren recursively. Here T , η u > η q > − Ω( ε )). Ideally, we have that initially R = e O (log n ), so in O (log log n/ε ) iterations of removing dense clusters, we arrive at the case of R ≤ c/ √
2, whereCorollary 3.5 begins to apply. For the pseudo-random component e P , most points will lie a distanceat least ( √ − ε ) R from each other. In particular, the ratio of R to a typical inter-point distance isapproximately 1 / √
2, exactly like in a random case. For this reasonm we call e P pseudo-random. Inthis setting, the data structure from Section 3 performs well.We now address the issue deferred in the above high-level description: that a dense component C i does not generally lie on ∂E i , but rather can occupy the interior of E i . In this case, we partition E i into very thin annuli of carefully chosen width and treat each annulus as a sphere. This discretizationof a ball adds to the complexity of the analysis, but is not fundamental from the conceptual pointof view. We are now ready to describe the data structure formally. It depends on the (small positive)parameters τ , ε and δ , as well as an integer parameter K ∼ √ ln n . We also need to chooseparameters T , η u > η q > Preprocessing.
Our preprocessing algorithm consists of the following functions: • Process ( P ) does the initial preprocessing. In particular, it performs the rescaling so that r =1 as well as the dimension reduction to d = Θ(log n log log n ) with the Johnson–Lindenstrausslemma [JL84, DG03]. In addition, we partition into randomly shifted cubes, translate thepoints, and think of them as lying on a sphere of radius R = e O (log n ) (see the proof ofCorollary 6 for details). Then we call ProcessSphere .21
Project ( R , R , r ) R R S S Figure 4: The definition of
Project • ProcessSphere ( P , r , r , o , R , l ) builds the data structure for a dataset P lying on a sphere ∂B ( o, R ), assuming we need to solve ANN with distance thresholds r and r . Moreover,we are guaranteed that queries will lie on ∂B ( o, R ). The parameter l counts the number ofnon-cluster nodes in the recursion stack we have encountered so far. Recall that we stop assoon as we encounter K of them. • ProcessBall ( P , r , r , o , R , l ) builds the data structure for a dataset P lying inside theball B ( o, R ), assuming we need to solve ANN with distance thresholds r and r . Unlike ProcessSphere , here queries can be arbitrary. The parameter l has the same meaning as in ProcessSphere . • Project ( R , R , r ) is an auxiliary function allowing us to project points on a ball to verythin annuli. Suppose we have two spheres S and S with a common center and radii R and R . Suppose there are points p ∈ S and p ∈ S with k p − p k = r . Project ( R , R , r )returns the distance between p and the point f p that lies on S and is the closest to p (seeFigure 4). This is implemented by a formula as in [AR15].We now elaborate on the above descriptions of ProcessSphere and
ProcessBall , since theseare the crucial components of our analysis. We will refer to line number of the pseudocode fromFigure 5.
ProcessSphere.
We consider three base cases.1. If l = K , we stop and store P explicitly. This corresponds to having reached a leaf in thealgorithm from Section 3. This case is handled in lines 2–4 of Figure 5.2. If r ≥ R , then we may only store one point, since any point in P is a valid answer to anyquery made on a sphere of radius R containing P . This trivial instance is checked in lines 5–7of Figure 5.3. The last case occurs when the algorithm from Section 3 can give the desired point on thetime–space trade-off. In this case, we may simply proceed as in the algorithm from Section 3.22e choose η u , η q > T appropriately and build a single level of the tree from Section 3with l increased by 1. We check for this last condition using (5) in line 9 of Figure 5, and if so,we may skip lines 10–18 of Figure 5.If none of the above three cases apply, we proceed in lines 10–18 of Figure 5 by removing the densecomponents and then handling the pseudo-random remainder. The dense components are clustersof at least τ | P | points lying in a ball of radius ( √ − ε ) R with its center on ∂B ( o, R ). These ballscan be enclosed by smaller balls of radius e R ≤ (1 − Ω( ε )) R . In each of these smaller balls, weinvoke ProcessBall with the same l . Finally, we build a single level of the tree in Section 3 forthe remaining pseudo-random points. We pick the appropriate η u , η q > T and recurse on eachpart with ProcessSphere with l increased by 1. ProcessBall.
Similarly to
ProcessSphere , if r + 2 R ≤ r , then any point from B ( o, R ) is avalid answer to any query in B ( o, R + r ). We handle this trivial instance in lines 25–27 of Figure 5.If we are not in the trivial setting above, we reduce the ball to the spherical case via a discretizationof the ball B ( o, R ) into thin annuli of width δr . First, we round all distances from points to o to amultiple of δr in line 28 of Figure 5. This rounding can change the distance between any pair ofpoints by at most 2 δr by the triangle inequality.Then, we handle each non-empty annuli separately. In particular, for a fixed annuli at distance δir from o , a possible query can lie at most a distance δjr from o , where δr | i − j | ≤ r + 2 δr .For each such case, we recursively build a data structure with ProcessSphere . However, whenprojecting points, the distance thresholds of r and r change, and this change is computed using Project in lines 34 and 35 of Figure 5.Overall, the preprocessing creates a decision tree. The root corresponds to the procedure
Process , and subsequent nodes correspond to procedures
ProcessSphere and
ProcessBall .We refer to the tree nodes correspondingly, using the labels in the description of the query algorithmfrom below.
Query algorithm.
Consider a query point q ∈ R d . We run the query on the decision tree, startingwith the root which executes Process , and applying the following algorithms depending on thelabel of the nodes: • In ProcessSphere we first recursively query the data structures corresponding to the clusters.Then, we locate q in the spherical caps (with threshold η q , like in Section 3), and query datastructures we built for the corresponding subsets of P . When we encounter a node with pointsstored explicitly, we simply scan the list of points for a possible near neighbor. This happenswhen l = K . • In ProcessBall , we first consider the base case, where we just return the stored point if itis close enough. In general, we check whether k q − o k ≤ R + r . If not, we return with no23eighbor, since each dataset point lies within a ball of radius R from o , but the query point isat least R + r away from o . If k q − o k ≤ R + r , we round q so the distance from o to q isa multiple of δr and enumerate all possible distances from o to the potential near neighborwe are looking for. For each possible distance, we query the corresponding ProcessSphere children after projecting q on the sphere with a tentative near neighbor using, Project . We complete the description of the data structure by setting the remaining parameters. Recall thatthe dimension is d = Θ(log n · log log n ). We set ε, δ, τ as follows: • ε = n ; • δ = exp (cid:0) − (log log log n ) C (cid:1) ; • τ = exp (cid:0) − log / n (cid:1) ,where C is a sufficiently large positive constant.Now we specify how to set η u , η q > T for each pseudo-random remainder. The idea will beto try to replicate the parameter settings of Section 3.3.3 corresponding to the random instance. Theimportant parameter will be r ∗ , which acts as the “effective” r . In the case that r ≥ √ R , thenwe have more flexibility than in the random setting, so we let r ∗ = r . In the case that r < √ R ,then we let r ∗ = √ R . In particular, we let T = 100 G ( r /R, η u , η q )in order to achieve a constant probability of success. Then we let η u and η q such that • F ( η u ) /G ( r /R, η u , η q ) ≤ n ( ρ u + o (1)) /K • F ( η q ) /G ( r /R, η u , η q ) ≤ n ( ρ q + o (1)) /K • G ( r ∗ /R, η u , η q ) /G ( r /R, η u , η q ) ≤ n ( ρ q − o (1)) /K which correspond to the parameter settings achieving the tradeoff of Section 3.3.3.A crucial relation between the parameters is that τ should be much smaller than n − /K = 2 −√ log n .This implies that the “large distance” is effectively equal to √ R , at least for the sake of a singlestep of the random partition.We collect some basic facts from the data structure which will be useful for the analysis. Thesefacts follow trivially from the pseudocode in Figure 5. • Process is called once at the beginning and has one child corresponding to one call to
ProcessSphere . In the analysis, we will disregard this node.
Process does not takeup any significant space or time. Thus, we refer to the root of the tree as the first call to
ProcessSphere . 24
The children to
ProcessSphere may contain at most τ many calls to ProcessBall ,corresponding to cluster nodes, and T calls to ProcessSphere . Each
ProcessBall callof
ProcessSphere handles a disjoint subset of the dataset. Points can be replicated in thepseudo-random remainder, when a point lies in the intersection of two or more caps. • ProcessBall has many children, all of which are
ProcessSphere which do not increment l . Each of these children corresponds to a call on a specific annulus of width δr around thecenter as well as possible distance for a query. For each annulus, there are at most δ + 4notable distances; after rounding by δr , a valid query can be at most r + 2 δr away from aparticular annulus in both directions, thus, each point gets duplicated at most δ + 4 manytimes. • For each possible point p ∈ P , we may consider the subtree of nodes which process thatparticular point. We make the distinction between two kinds of calls to ProcessSphere : callswhere p lies in a dense cluster, and calls where p lies in a pseudo-random remainder. If p liesin a dense cluster, l will not be incremented; if p lies in the pseudo-random remainder, l willbe incremented. The point p may be processed by various rounds of calls to ProcessBall and
ProcessSphere without incrementing l ; however, there will be a moment when p is notin a dense cluster and will be part of the pseudo-random remainder. In that setting, p will beprocessed by a call to ProcessSphere which increments l . Lemma 4.2.
The following invariants hold. • At any moment one has r r ≥ c · (1 − o (1)) and r ≤ c · (1 + o (1)) . • At any moment the number of calls to
ProcessBall in the recursion stack is at most e O (log log n ) .Proof. Our proof will proceed by keeping track of two quantities, γ and ξ as the algorithm proceedsdown the tree. We will be able to analyze how these values change as the algorithm executes thesubroutines ProcessSphere and
ProcessBall . We will then combine these two measures to get apotential function which always increases by a multiplicative factor of (1 + Ω( ε )). By giving overallbounds on γ and ξ , we will deduce an upper bound on the depth of the tree. For any particularnode of the tree v (where v may correspond to a call to ProcessSphere or ProcessBall ), we let γ v = r r and ξ v = r R where the values of r , r , and R are given by the procedure call at v . We will often refer to γ v and ξ v as γ and ξ , respectively, when there is no confusion. Additionally, we will often refer to how γ and ξ changes; in particular, if ˜ v is a child of v , then we let ˜ γ and ˜ ξ be the values of γ ˜ v and ξ ˜ v .25 : function ProcessSphere ( P , r , r , o , R , l )2: if l = K then . base case 13: store P explicitly4: return if r ≥ R then . base case 26: store any point from P return r ∗ ← r if (cid:0) − α (cid:0) r R (cid:1) α (cid:0) r R (cid:1)(cid:1) √ ρ q + (cid:0) α (cid:0) r R (cid:1) − α (cid:0) r R (cid:1)(cid:1) √ ρ u < β (cid:0) r R (cid:1) β (cid:0) r R (cid:1) then . base case 310: m ← | P | b R ← ( √ − ε ) R while ∃ x ∈ ∂B ( o, R ) : | B ( x, b R ) ∩ P | ≥ τ m do . remove dense clusters13: B ( e o, e R ) ← the seb for P ∩ B ( x, b R )14: ProcessBall ( P ∩ B ( x, b R ), r , r , e o , e R , l )15: P ← P \ B ( x, b R )16: r ∗ ← √ R
17: choose η u and η q such that: . data independent portion • F ( η u ) /G ( r /R, η u , η q ) ≤ n ( ρ u + o (1)) /K ; • F ( η q ) /G ( r /R, η u , η q ) ≤ n ( ρ q + o (1)) /K ; • G ( r ∗ /R, η u , η q ) /G ( r /R, η u , η q ) ≤ n ( ρ q − o (1)) /K .18: T ← /G ( r /R, η u , η q )19: for i ← . . . T do
20: sample z ∼ N (0 , d P ← { p ∈ P | h z, p i ≥ η u R } if P = ∅ then ProcessSphere ( P , r , r , o , R , l + 1)24: function ProcessBall ( P , r , r , o , R , l )25: if r + 2 R ≤ r then . trivial instance of ProcessBall
26: store any point from P return P ← { o + δr d k p − o k δr e · p − o k p − o k | p ∈ P } for i ← . . . d Rδr e do e P ← { p ∈ P : k p − o k = δir } if e P = ∅ then for j ← . . . d R + r +2 δr δr e do if δ | i − j | ≤ r + 2 δr then e r ← Project ( δir , δjr , r + 2 δr ) . computing e r and e r for projected instance35: e r ← Project ( δir , δjr , r − δr )36: ProcessSphere ( e P , e r , e r , o , δir , l )37: function Project ( R , R , r )38: return p R ( r − ( R − R ) ) /R Figure 5: Pseudocode of the data structure ( seb stands for smallest enclosing ball )26 laim 4.3.
Initially, γ = c , and it only changes when ProcessBall calls
ProcessSphere .Letting ∆ R = δr | i − j | , there are two cases: • If ≤ ∆ R r ≤ δλ , then ˜ γγ ≥ − δ . • If ∆ R r ≥ δλ , then ˜ γγ ≥ R r · λ − δ .for λ = 1 − ( c +1 ) > . Note that initially, γ = c since r = 1 and r = c . Now, ˜ γ changes when Process-Ball ( P, r , r , o, R, l ) calls ProcessSphere ( e P , e r , e r , o, δir , l ) in line 36 of Figure 5. When thisoccurs: ˜ γγ = Project ( δir , δjr , r − δr ) / Project ( δir , δjr , r + 2 δr ) r /r = ( r − δr ) − ∆ R r · r ( r + 2 δr ) − ∆ R = (1 − δr r ) − ∆ R r (1 + 2 δ ) − ∆ R r ≥ R ( r − r ) − δ (1 + 2 δ ) − ∆ R r = 1 + ∆ R r · λ − δ (1 + 2 δ ) − ∆ R r assuming that r ( c +12 ) ≤ r (we will actually show the much tighter bound of r · c · (1 − o (1)) ≤ r toward the end of the proof) and setting λ = 1 − ( c +1 ) , where λ ∈ (0 , ∆ R r ≤ (1 + 2 δ ) . Now, consider two cases: • Case 1: 0 ≤ ∆ R r ≤ δλ . In this case, we have:˜ γγ ≥ ∆ R r · λ − δ (1 + 2 δ ) − ∆ R r ≥ − δ (1 + 2 δ ) − ∆ R r ≥ − δ (1 / . Thus, the multiplicative decrease is at most (1 − δ ) since δ = o (1).27 Case 2: ∆ R r ≥ δλ . In this case: ˜ γγ ≥ ∆ R r · λ − δ (1 + 2 δ ) − ∆ R r ≥ ∆ R r · λ − δ
2= 1 + ∆ R r · λ − δ. Claim 4.4.
Initially, ξ ≥ e Ω c log n ! . ξ changes only when ProcessBall calls
ProcressSphere ,or vice-versa. When
ProcessBall calls
ProcessSphere , and some later
ProcessSphere calls
ProcessBall , letting ∆ R = δr | i − j | : ˜ ξξ ≥ (cid:16) ε ) (cid:17) (1 − δ ) − ∆ R r (1 − λ ) !
11 + ∆ RR ! , for λ = 1 − ( c +1 ) > . The relevant procedure calls in Claim 4.4 are:1.
ProcessBall ( P, r , r , o, R, l ) calls ProcessSphere ( e P , e r , e r , o, δir , l ) in line 36 of Figure 5.2. After possibly some string of calls to ProcessSphere , ProcessSphere ( P , e r , e r , o, δir , l )calls ProcessBall ( P ∩ B ( e o, e R ) , e r , e r , e o, l ) in line 14 of Figure 5.Since both calls to ProcessBall identified above have no
ProcessBall calls in their path in thetree, we have the following relationships between the parameters: • e r = Process ( δir , δjr , r + 2 δ ), • e r = Process ( δir , δjr , r − δ ), • e R ≤ (1 − Ω( ε )) · δir ,Using these setting of parameters,˜ ξξ = (cid:16) ε ) (cid:17) · Process ( δir , δjr , r − δr ) /δ i r r /R = (cid:16) ε ) (cid:17) (cid:18) − δr r (cid:19) − ∆ R r ! · R δ ijr ≥ (cid:16) ε ) (cid:17) (1 − δ ) − ∆ R r · (1 − λ ) !
11 + ∆ RR ! , r ( c +12 ) ≤ r , and that δir ≤ R and δjr ≤ R + ∆ R .Note that the lower bound is always positive since ∆ R r ≤ δ , δ = o (1), and λ ∈ (0 ,
1) is someconstant.We consider the following potential function:Φ = γ M · ξ, and we set M = √ · λ · δ . Claim 4.5.
In every iteration of
ProcessBall calling
ProcessSphere which at some point calls
ProcessBall again, Φ increases by a multiplicative factor of ε ) . We simply compute the multiplicative change in Φ by using Claim 4.3 and Claim 4.4. We willfirst apply the first case of Claim 4.3, where 0 < ∆ R r ≤ δλ . e ΦΦ ≥ (1 − δ ) M · (cid:16) ε ) (cid:17) · (1 − δ ) − ∆ R r (1 − λ ) ! ·
11 + ∆ RR ! ≥ (cid:16) ε ) (cid:17) · − δM − δ − ∆ R r − ∆ RR ! ≥ (cid:16) ε ) (cid:17) · − δM − δ − δλ − √ · δ √ λ ! , where the third inequality, we used ∆ R r ≤ δλ and r ≤ r ≤ R by line 5 of Figure 5. Finally, wenote that ε (cid:29) δM − δ − δλ − √ δ √ λ , so in this case, e ΦΦ ≥ (cid:16) ε ) (cid:17) . We now proceed to the second case, when ∆ R r > δλ . Using case 2 of Claim 4.3, we have˜ΦΦ ≥ R r · λ ! M · (cid:16) ε ) (cid:17) · (1 − δ ) − ∆ R r (1 − λ ) ! ·
11 + ∆ RR ! . (14)We claim the above expression is at least 1 + Ω( ε ). This follows from three observations: • ∆ R r ≥ δλ implies that √ λ √ · δ ≥ r ∆ R . • Since r ≤ r ≤ R (by line 5 of Figure 5), r ≥ R , so ∆ R r · √ λ √ · δ ≥ Rr ≥ ∆ RR . • Thus, if M = √ · λ · δ , ∆ R r · λ · M ≥ · ∆ R r · · √ λ √ · δ ≥ · ∆ RR . ∆ R r · λ · M (cid:29) δ + ∆ R r , which means that in this case, e ΦΦ ≥ (cid:16) ε ) (cid:17) . Having lower bounded the multiplicative increase in Φ, we note that initially,Φ = e Ω c M +2 log n ! . Claim 4.6.
At all moments in the algorithm γ ≤ log n . Note that before reaching r r ≥ log n , line 9 of Figure 5 will always evaluate to false, and thealgorithm will continue to proceed in a data-independent fashion without further changes to r , r or R . Another way to see this is that when r r ≥ √ log n , then the curve in Figure 2 correspondingto [AI08] will give a data structure with runtime n o (1) and space n o (1) .Additionally, line 5 of Figure 5 enforces that all moments in the algorithm, ξ ≤
4. Thus, at allmoments, Φ ≤ O (log M n ) . Thus, the number of times that
ProcessBall appears in the stack is e O (log log n ). We will nowshow the final part of the proof which we stated earlier: Claim 4.7.
For the first e O (log log n ) many iterations, r · c · (1 − o (1)) ≤ r . Note that showing this will imply η ≥ (cid:16) c +12 (cid:17) . From Claim 4.3, η ≥ c (1 − δ ) N , where N = e O (log log n ), which is in fact, always at most c (1 − o (1)). In order to show that r ≤ c · (1+ o (1)),note that r only increases by a multiplicative factor of (1 + 2 δ ) in each call of ProcessBall . Thisfinishes the proof of all invariants.
Lemma 4.8.
During the algorithm we can always be able to choose η u and η q such that: • F ( η u ) /G ( r /R, η u , η q ) ≤ n ( ρ u + o (1)) /K ; • F ( η q ) /G ( r /R, η u , η q ) ≤ n ( ρ q + o (1)) /K ; • G ( r ∗ /R, η u , η q ) /G ( r /R, η u , η q ) ≤ n ( ρ q − o (1)) /K .Proof. We will focus on the the part of
ProcessSphere where we find settings for η u and η q .There are two important cases: • r ∗ = r . This happens when the third “if” statement evaluates to false. In other words, wehave that (cid:18) − α (cid:18) r R (cid:19) α (cid:18) r R (cid:19)(cid:19) √ ρ q + (cid:18) α (cid:18) r R (cid:19) − α (cid:18) r R (cid:19)(cid:19) √ ρ u ≥ β (cid:18) r R (cid:19) β (cid:18) r R (cid:19) . (15)30ince in a call to ProcessSphere , all points are on the surface of a sphere of radius R , theexpression corresponds to the expression from Theorem 3.3. Thus, as described in Section 3.3.3,we can set η u and η q to satisfy the three conditions. • r ∗ = √ R . This happens when the third “if” statement evaluates to true. We have byLemma 4.2 r r ≥ c − o (1). Since (15) does not hold, thus r < √ R . Hence, r ≤ √ Rc − o (1).If this is the case since r ≤ r ∗ /c − o (1), we are instantiating parameters as in Subsection 3.3.3where r = r R and cr = r ∗ R . Lemma 4.9.
The probability of success of the data structure is at least . .Proof. In all the cases, except for the handling of the pseudo-random remainder, the data structureis deterministic. Therefore, the proof follows in exactly the same way as Lemma 3.6. In this case,we also have at each step that T = G ( r /R,η u ,η q ) , and the induction is over the number of times wehandle the pseudo-random remainder. Lemma 4.10.
The total space the data structure occupies is at most n ρ u + o (1) in expectation.Proof. We will prove that the total number of explicitly stored points (when l = K ) is at most n ρ u + o (1) . We will count the contribution from each point separately, and use linearity of expectationto sum up the contributions. In particular, for a point p ∈ P , we want to count the number of listswhere p appears in the data structure. Each root to leaf path of the tree has at most K calls to ProcessSphere which increment l , and at most e O (log log n ) calls to ProcessBall , and thus e O (log log n ) calls to ProcessSphere which do not increment l . Thus, once we count the numberof lists, we may multiply by K + e O (log log n ) = n o (1) to count the size of the whole tree.For each point, we will consider the subtree of the data structure where the point was processed.In particular, we may consider the tree corresponding to calls to ProcessSphere and
ProcessBall which process p . As discussed briefly in Section 4.3, we distinguish between calls to ProcessSphere which contain p in a dense cluster, and calls to ProcessSphere which contain p in the pseudo-random remainder. We increment l only when p lies in the pseudo-random remainder. Claim 4.11.
It suffices to consider the data structure where each node is a function call to
ProcessSphere which increments l , i.e., when p lies in the pseudo-random remainder, since thetotal amount of duplication of points corresponding to other nodes is n o (1) . We will account for the duplication of points in calls to
ProcessBall and calls to
Pro-cessSphere which do not increment l . Consider the first node v in a path from the root whichdoes not increment l , this corresponds to a call to ProcessSphere which had p in some densecluster. Consider the subtree consisting of descendants of v where the leaves correspond to the firstoccurrence of ProcessSphere which increments l . We claim that every internal node of the treecorresponds to alternating calls to ProcessBall and
ProcessSphere which do not increment l . Note that calls to ProcessSphere which do not increment l never replicate p . Each call to31 rocessBall replicates p in b := r (1+2 δ ) δ many times. Since r ≤ r ≤ c + o (1) by Lemma 4.2, b = O ( δ − ). We may consider contracting the tree and at edge, multiplying by the number of timeswe encounter ProcessBall .Note that p lies in a dense cluster if and only if it does not lie in the pseudo-random remainder.Thus, our contracted tree looks like a tree of K levels, each corresponding to a call to ProcessSphere which contained p in the pseudo-random remainder.The number of children of some nodes may be different; however, the number of times Process-Ball is called in each branch of computation is U := ˜ O (log log n ), the total amount of duplicationof points due to ProcessBall is at most b U = n o (1) . Now, the subtree of nodes processing p contains K levels with each T children, exactly like the data structure for Section 3. Claim 4.12.
A node v corresponding to ProcessSphere ( P, r , r , o, R, l ) has in expectation, p appearing in n (( K − l ) ρ u + o (1)) /K many lists in the subtree of v . The proof is an induction over the value of l in a particular node. For our base case, considersome node v corresponding to a function call of ProcessSphere which is a leaf, so l = K , in thiscase, each point is only stored at most once, so the claim holds.Suppose for the inductive assumption the claim holds for some l , then for a particular node atlevel l −
1, consider the point when p was part of the pseudo-random remainder. In this case, p isduplicated in T · F ( η u ) = 100 · F ( η u ) G ( r /R, η u , η q ) ≤ n ( ρ u + o (1)) /K many children, and in each child, the point appears n (( K − l ) ρ u + o (1)) /K many times. Therefore, in anode v , p appears in n (( K − l +1) ρ u + o (1)) /K many list in its subtree. Letting l = 0 for the root givesthe desired outcome. Lemma 4.13.
The expected query time is at most n ρ q + o (1) .Proof. We need to bound the expected number of nodes we traverse as well as the number of pointswe enumerate for nodes with l = K .We first bound the number of nodes we traverse. Let A ( u, l ) be an upper bound on theexpected number of visited nodes when we start in a ProcessSphere node such that there are u ProcessBall nodes in the stack and l non-cluster nodes. By Lemma 4.2, u ≤ U := e O (log log n ) , and from the description of the algorithm, we have l ≤ K . We will prove A (0 , ≤ n ρ q + o (1) , whichcorresponds to the expected number of nodes we touch starting from the root.We claim A ( u, l ) ≤ exp(log / o (1) n ) · A ( u + 1 , l ) + n ( ρ q + o (1)) /K · A ( u, l + 1) . (16)32here are at most 1 /τ = exp(log / n ) cluster nodes, and in each node, we recurse on r (1+2 δ ) δ =exp(log o (1) n ) possible annuli with calls to ProcessSphere nodes where u increased by 1 and l remains the same. On the other hand, there are T · F ( η q ) = 100 · F ( η q ) G ( r /R, η u , η q ) ≤ n ( ρ q + o (1)) /K caps, where the query falls, in expectation. Each calls ProcessSphere where u remains the sameand l increased by 1.Solving (16): A (0 , ≤ U + KK ! exp( U · log / o (1) n ) · n ρ q + o (1) ≤ n ρ q + o (1) . We now give an upper bound on the number of points the query algorithm will test at level K .Let B ( u, l ) be an upper bound on the expected fraction of the dataset in the current node that thequery algorithm will eventually test at level K (where we count multiplicities). u and l have thesame meaning as discussed above.We claim B ( u, l ) ≤ τ · B ( u + 1 , l ) + n ( ρ q − o (1)) /K · B ( u, l + 1)The first term comes from recursing down dense clusters. The second term is a bit more subtle. Inparticular, suppose r = r ∗ , then the expected fraction of points is T · G ( r /R, η u , η q ) · B ( u, l + 1) = 100 · G ( r /R, η u , η q ) · B ( u, l + 1) G ( r /R, η u , η q ) ≤ n ( ρ q − o (1)) /K · B ( u, l + 1)by the setting of η u and η q . On the other hand, there is the other case when r ∗ = √ R , whichoccurs after having removed some clusters. In that case, consider a particular cap containing thepoints e P i . For points with distance to the query at most ( √ − ε ) R , there are at most a τ n of them.For the far points, e P i a G ( √ − ε, η u , η q ) fraction of the points in expectation. T · F ( η q ) · (cid:16) τ + G ( √ − ε, η u , η q ) (cid:17) · B ( u, l + 1) = 100 · F ( η q ) · (cid:16) τ + G ( √ − ε, η u , η q ) (cid:17) · B ( u, l + 1) G ( r /R, η u , η q ) ≤ · F ( η q ) · G ( √ , η u , η q ) · B ( u, l + 1) G ( r /R, η u , η q ) ≤ n ( ρ q − o (1)) /K · B ( u, l + 1)Where we used that τ (cid:28) G ( √ , η u , η q ) and G ( √ − ε, η u , η q ) ≤ G ( √ , η u , η q ) · n o (1) /K (since ε = o (1)),and that r ∗ = √ R . Unraveling the recursion, we note that u ≤ U and l ≤ K ∼ √ ln n . Additionally,33e have that B ( u, K ) ≤
1, since we do not store duplicates in the last level. Therefore, B (0 , ≤ U + KU ! (cid:18) τ (cid:19) U · (cid:16) n ( ρ q − o (1)) /K (cid:17) K = n ρ q − o (1) . We introduce a few techniques and concepts to be used primarily for our lower bounds. We start bydefining the approximate nearest neighbor search problem.
Definition 5.1.
The goal of the ( c, r ) -approximate nearest neighbor problem with failure probability δ is to construct a data structure over a set of points P ⊂ {− , } d supporting the following query:given any point q such that there exists some p ∈ P with k q − p k ≤ r , report some p ∈ P where k q − p k ≤ cr with probability at least − δ . We introduce a few definitions from [PTW10] to setup the lower bounds for the ANN.
Definition 5.2 ([PTW10]) . In the
Graphical Neighbor Search problem (GNS), we are given abipartite graph G = ( U, V, E ) where the dataset comes from U and the queries come from V . Thedataset consists of pairs P = { ( p i , x i ) | p i ∈ U, x i ∈ { , } , i ∈ [ n ] } . On query q ∈ V , if there exists aunique p i with ( p i , q ) ∈ E , then we want to return x i . We will sometimes use the GNS problem to prove lower bounds on ( c, r )-ANN as follows: webuild a GNS graph G by taking U = V = {− , } d , and connecting two points u ∈ U, v ∈ V iff theirHamming distance most r (see details in [PTW10]). We will also ensure q is not closer than cr toother points apart from the near neighbor.[PTW10] showed lower bounds for ANN are intimately tied to the following quantity of a metricspace. Definition 5.3 (Robust Expansion [PTW10]) . Consider a GNS graph G = ( U, V, E ) , and fix adistribution e on E ⊂ U × V , where µ is the marginal distribution on U and η is the marginaldistribution on V . For δ, γ ∈ (0 , , the robust expansion Φ r ( δ, γ ) is: Φ r ( δ, γ ) = min A ⊂ V : η ( A ) ≤ δ min B ⊂ U : e ( A × B ) e ( A × V ) ≥ γ µ ( B ) η ( A ) . Our 2-probe lower bounds uses results on Locally-Decodable Codes (LDCs). We present the standarddefinitions and results on LDCs below, although in Section 8, we will use a weaker definition ofLDCs for our 2-query lower bound. 34 efinition 5.4. A ( t, δ, ε ) locally-decodable code (LDC) encodes n -bit strings x ∈ { , } n into m -bitcodewords C ( x ) ∈ { , } m such that, for each i ∈ [ n ] , the bit x i can be recovered with probability + ε while making only t queries into C ( x ) , even if the codeword is arbitrarily modified (corrupted)in δm bits. We will use the following lower bound on the size of the LDCs.
Theorem 5.5 (Theorem 4 from [KdW04]) . If C : { , } n → { , } m is a (2 , δ, ε ) -LDC, then m ≥ Ω( δε n ) . The goal of this section is to compute tight bounds for the robust expansion Φ r ( δ, γ ) in the Hammingspace of dimension d , as defined in the preliminaries. We use these bounds for all of our lowerbounds in the subsequent sections.We use the following model for generating dataset points and queries corresponding to therandom instance of Section 2. Definition 6.1.
For any x ∈ {− , } n , N σ ( x ) is a probability distribution over {− , } n representingthe neighborhood of x . We sample y ∼ N σ ( x ) by choosing y i ∈ {− , } for each coordinate i ∈ [ d ] independently; with probability σ , we set y i = x i , and with probability − σ , y i is set uniformly atrandom.Given any Boolean function f : {− , } n → R , the function T σ f : {− , } n → R is T σ f ( x ) = E y ∼ N σ ( x ) [ f ( y )] (17)In the remainder of this section, will work solely on the Hamming space V = {− , } d . We let σ = 1 − c d = ω (log n )and µ will refer to the uniform distribution over V .A query is generated as follows: we sample a dataset point x uniformly at random and thengenerate the query y by sampling y ∼ N σ ( x ). From the choice of σ and d , d ( x, y ) ≤ d c (1 + o (1))with high probability. In addition, for every other point in the dataset x = x , the pair ( x , y )is distributed as two uniformly random points (even though y ∼ N σ ( x ), because x is randomlydistributed). Therefore, by taking a union-bound over all dataset points, we can conclude that withhigh probability, d ( x , y ) ≥ d (1 − o (1)) for each x = x .Given a query y generated as described above, we know there exists a dataset point x whosedistance to the query is d ( x, y ) ≤ d c (1 + o (1)). Every other dataset point lies at a distance d ( x , y ) ≥ d (1 − o (1)). Therefore, the two distances are a factor of c − o (1) away.35he following lemma is the main result of this section, and we will reference this lemma insubsequent sections. Lemma 6.2 (Robust expansion) . Consider the Hamming space equipped with the Hamming norm.For any p, q ∈ [1 , ∞ ) where ( q − p −
1) = σ , any γ ∈ [0 , , and m ≥ , Φ r (cid:18) m , γ (cid:19) ≥ γ q m qp − q . The robust expansion comes from a straight forward application from small-set expansion. Infact, one can easily prove tight bounds on robust expansion via the following lemma:
Theorem 6.3 (Generalized Small-Set Expansion Theorem, [O’D14]) . Let ≤ σ ≤ . Let A, B ⊂{− , } n have volumes exp( − a ) and exp( − b ) and assume ≤ σa ≤ b ≤ a . Then Pr ( x,y ) σ − correlated [ x ∈ A, y ∈ B ] ≤ exp − a − σab + b − σ ! . We compute the robust expansion via an application of the Bonami-Beckner Inequality andHölder’s inequality. This computation gives us more flexibility with respect to parameters whichwill become useful in subsequent sections. We now recall the necessary tools.
Theorem 6.4 (Bonami-Beckner Inequality [O’D14]) . Fix ≤ p ≤ q and ≤ σ ≤ p ( p − / ( q − .Any Boolean function f : {− , } n → R satisfies k T σ f k q ≤ k f k p . Theorem 6.5 (Hölder’s Inequality) . Let f : {− , } n → R and g : {− , } n → R be arbitraryBoolean functions. Fix s, t ∈ [1 , ∞ ) where s + t = 1 . Then h f, g i ≤ k f k s k g k t . We will let f and g be indicator functions for two sets A and B , and use a combination of theBonami-Beckner Inequality and Hölder’s Inequality to lower bound the robust expansion. Theoperator T σ applied to f will measure the neighborhood of set A . We compute an upper bound onthe correlation of the neighborhood of A and B (referred to as γ ) with respect to the volumes of A and B , and the expression will give a lower bound on robust expansion.We also need the following lemma. Lemma 6.6.
Let p, q ∈ [1 , ∞ ) , where ( p − q −
1) = σ and f, g : {− , } d → R be two Booleanfunctions. Then h T σ f, g i ≤ k f k p k g k q . Proof.
We first apply Hölder’s Inequality to split the inner-product into two parts, apply the36onami-Beckner Inequality to each part. h T σ f, f i = h T √ σ f, T √ σ g i ≤ k T √ σ f k s k T √ σ g k t . We pick the parameters s = p − σ + 1 and t = ss − s + t = 1. Note that p ≤ s because σ < p ≥ p − q −
1) = σ ≤ σ . We have q ≤ σp − t. In addition, s p − s − √ σ s q − t − q ( q − s −
1) = s ( q − p − σ = √ σ. We finally apply the Bonami-Beckner Inequality to both norms to obtain k T √ σ f k s k T √ σ g k t ≤ k f k p k g k q . We are now ready to prove Lemma 6.2.
Proof of Lemma 6.2.
We use Lemma 6.6 and the definition of robust expansion. For any two sets
A, B ⊂ V , let a = d | A | and b = d | B | be the measure of set A and B with respect to the uniformdistribution. We refer to χ A : {− , } d → { , } and χ B : {− , } d → { , } as the indicatorfunctions for A and B . Then, γ = Pr x ∼ µ,y ∼ N σ ( x ) [ x ∈ B | y ∈ A ] = 1 a h T σ χ A , χ B i ≤ a p − b q . (18)Therefore, γ q a q − qp ≤ b . Let A and B be the minimizers of ba satisfying (18) and a ≤ m .Φ r (cid:18) m , γ (cid:19) = ba ≥ γ q a q − qp − ≥ γ q m qp − q . In this section, we prove Theorem 1.3. Our proof relies on the main result of [PTW10] for the GNSproblem:
Theorem 6.7 (Theorem 1.5 [PTW10]) . There exists an absolute constant γ such that the followingholds. Any randomized cell-probe data structure making t probes and using m cells of w bits for a eakly independent instance of GNS which is correct with probability greater than must satisfy m t wn ≥ Φ r (cid:18) m t , γt (cid:19) . Proof of Theorem 1.3.
The lower bound follows from a direct application of Lemma 6.2 to Theo-rem 6.7. Setting t = 1 in Theorem 6.7, we obtain mw ≥ n · Φ r (cid:18) m , γ (cid:19) ≥ nγ q m qp − q for some p, q ∈ [1 , ∞ ) and ( p − q −
1) = σ . Rearranging the inequality and letting p = 1 + log log n log n ,and q = 1 + σ n log log n , we obtain m ≥ γ pp − n ppq − q w ppq − q ≥ n σ − o (1) . Since σ = 1 − c and w = n o (1) , we obtain the desired result. Corollary 6.8.
Any 1 cell probe data structure with cell size n o (1) for c -approximate nearestneighbors on the sphere in ‘ needs n c − c − − o (1) many cells.Proof. Each point in the Hamming space {− , } d (after scaling by √ d ) can be thought of as lyingon the unit sphere. If two points are a distance r apart in the Hamming space, then they are 2 √ r apart on the sphere with ‘ norm. Therefore a data structure for a c -approximation on the spheregives a data structure for a c -approximation in the Hamming space. In this section, we prove Theorem 1.6, i.e., a tight lower bound against data structure that fallinside the “list-of-points” model, as defined in Def. 1.5.
Theorem 7.1 (Restatement of Theorem 1.6) . Let D be a list-of-points data structure which solves ( c, r ) -ANN for n points in the d -dimensional Hamming space with d = ω (log n ) . Suppose D isspecified by a sequence of m sets { A i } mi =1 and a procedure for outputting a subset I ( q ) ⊂ [ m ] usingexpected space s = n ρ u , and expected query time n ρ q − o (1) with success probability . Then c √ ρ q + ( c − √ ρ u ≥ √ c − . We will prove the lower bound by giving a lower bound on list-of-points data structures whichsolve the random instance for the Hamming space defined in Section 2. The dataset consists of n points { u i } ni =1 where each u i ∼ V drawn uniformly at random, and a query v is drawn from theneighborhood of a random dataset point. Thus, we may assume D is a deterministic data structure.38ix a data structure D , where A i ⊂ V specifies which dataset points are placed in L i . Additionally,we may define B i ⊂ V which specifies which query points scan L i , i.e., B i = { v ∈ V | i ∈ I ( v ) } .Suppose we sample a random dataset point u ∼ V and then a random query point v from theneighborhood of u . Let γ i = Pr[ v ∈ B i | u ∈ A i ]represent the probability that query v scans the list L i , conditioned on u being in L i . Additionally,we write s i = µ ( A i ) as the normalized size of A . The query time for D is given by the followingexpression: T = m X i =1 χ B i ( v ) n X j =1 χ A i ( u j ) E [ T ] = m X i =1 µ ( B i ) + m X i =1 γ i µ ( A i ) + ( n − m X i =1 µ ( B i ) µ ( A i ) ≥ m X i =1 Φ r ( s i , γ i ) s i + m X i =1 s i γ i + ( n − m X i =1 Φ r ( s i , γ i ) s i . (19)Since the data structure succeeds with probability γ , m X i =1 s i γ i ≥ γ = Pr j ∼ [ n ] ,v ∼ N ( u j ) [ ∃ i ∈ [ m ] : v ∈ B i , u j ∈ A i ] . (20)Additionally, since D uses at most s space, n m X i =1 s i ≤ O ( s ) . (21)Using the two constraints in (20) and (21), we will use the estimates of robust expansion in order tofind a lower bound for (19). From Lemma 6.2, for any p, q ∈ [1 , ∞ ) where ( p − q −
1) = σ where σ = 1 − c , E [ T ] ≥ m X i =1 s q − qp i γ qi + ( n − m X i =1 s q − qp +1 i γ qi + γγ ≤ m X i =1 s i γ i O (cid:18) sn (cid:19) ≥ m X i =1 s i . We set S = { i ∈ [ m ] : s i = 0 } and for i ∈ S , we write v i = s i γ i . Then E [ T ] ≥ X i ∈ S v qi (cid:18) s − qp i + ( n − s − qp +1 i (cid:19) ≥ X i ∈ S (cid:18) γ | S | (cid:19) q (cid:18) s − qp i + ( n − s − qp +1 i (cid:19) (22)39here we used the fact q ≥
1. Consider F = X i ∈ S (cid:18) s − qp i + ( n − s − qp +1 i (cid:19) . (23)We analyze three cases separately: • < ρ u ≤ c − • c − < ρ u ≤ c − c − • ρ u = 0.For the first two cases, we let q = 1 − σ + σβ p = ββ − σ β = s − σ ρ u (24)Since 0 < ρ u ≤ c − c − , one can verify β > σ and both p and q are at least 1. Lemma 7.2.
When ρ u ≤ c − , and s = n ρ u , E [ T ] ≥ Ω( n ρ q ) where ρ q and ρ u satisfy Equation 4.Proof. In this setting, p and q are constants, and q ≥ p . Therefore, qp ≥ F can be viewed asconsisting of the contributions of each s i ’s in Equation 23, constrained by (21). One can easilyverify that F minimized when s i = O ( sn | S | ), so substituting in (22), E [ T ] ≥ Ω γ q s − q/p +1 n q/p | S | q − q/p ! ≥ Ω( γ q s − q n q/p )since q − q/p > | S | ≤ s . In addition, p , q and γ are constants, and note the fact s = n ρ u ,and (24), we let n ρ q be the best query time we can achieve. Combining these facts, along with thelower bound for ρ q in (7), we obtain the following relationship between ρ q and ρ u : ρ q = (1 + ρ u )(1 − q ) + qp = (1 + ρ u )( σ − σβ ) + (1 − σ + σβ )( β − σ ) β = (cid:16)p − σ − √ ρ u σ (cid:17) = √ c − c − √ ρ u · ( c − c ! . emma 7.3. When ρ u > c − , E [ T ] ≥ Ω( n ρ q ) where ρ q and ρ u satisfy Equation 4.Proof. We follow a similar pattern to Lemma 7.2. ∂F∂s i = (cid:18) − qp (cid:19) s − qp − i + (cid:18) − qp + 1 (cid:19) ( n − s − qp i . Consider the case when each ∂F∂s i ( s i ) = 0, by setting s i = q ( p − q )( n −
1) . Since q < p , this valueis positive and P i ∈ S s i ≤ O (cid:0) mn (cid:1) for large enough n . Thus, F is minimized at this point, and E [ T ] ≥ (cid:16) γ | S | (cid:17) q | S | (cid:16) q ( p − q )( n − (cid:17) − qp . Since q ≥ | S | ≤ s , E [ T ] ≥ (cid:18) γs (cid:19) q s (cid:18) q ( p − q )( n − (cid:19) − qp . Since p , q and γ are constants, E [ T ] ≥ Ω( n ρ q ), ρ q = (1 + ρ u )(1 − q ) + qp which is the same expression for ρ q as in Lemma 7.2. Lemma 7.4.
When ρ u = 0 (so s = O ( n ) ), E [ T ] ≥ n ρ q − o (1) where ρ q = 2 c − c = 1 − σ .Proof. In this case, we let q = 1 + σ · log n log log n p = 1 + log log n log n . Since q > p , we have E [ T ] = Ω( γ q s − q n qp ) = n − σ − o (1) , which is the desired expression. In this section, we prove the cell probe lower bound for t = 2 cell probes stated in Theorem 1.4.We follow the framework in [PTW10] and prove lower bounds for GNS when U = V withmeasure µ (see Def. 5.2). We assume there is an underlying graph G with vertex set V . For anypoint p ∈ V , we write p ’s neighborhood , N ( p ), as the set of points with an edge incident on p in G .41n the 2-probe GNS problem, we are given a dataset P = { p i } ni =1 ⊂ V of n points as well as abit-string x ∈ { , } n . The goal is to build a data structure supporting the following types of queries:given a point q ∈ V , if there exists a unique neighbor p i ∈ N ( q ) ∩ P , return x i with probability atleast after making two cell-probes.We let D denote a data structure with m cells of w bits each. D will depend on the dataset P as well as the bit-string x . We will prove the following theorem. Theorem 8.1.
There exists a constant γ > such that any non-adaptive GNS data structureholding a dataset of n ≥ points which succeeds with probability using two cell probes and m cellsof w bits satisfies m log m · O ( w ) n ≥ Ω (cid:18) Φ r (cid:18) m , γ (cid:19)(cid:19) . Theorem 1.4 will follow from Theorem 8.1 together with the robust expansion bound fromLemma 6.2 for the special case of non-adaptive probes. We will later show how to reduce adaptivealgorithms losing a sub-polynomial factor in the space for w = o (log n ) in Section 8.6.3. We nowproceed to proving Theorem 8.1.At a high-level, we show that a “too-good-to-be-true”, 2-probe data structure implies a weakernotion of 2-query locally-decodable code (LDC) with small noise rate using the same amount ofspace . Even though our notion of LDC is weaker than Def. 5.4, we adapt the tools for showing2-query LDC lower bounds from [KdW04]. These arguments, using quantum information theory,are very robust and work well with the weaker 2-query LDC we construct.We note that [PTW08] was the first to suggest the connection between ANN and LDCs. Thiswork represents the first concrete connection which gives rise to better lower bounds. Proof structure.
The proof of Theorem 8.1 proceeds in six steps.1. First we use Yao’s principle to focus on deterministic non-adaptive data structures for GNSwith two cell-probes. We provide distributions over n -point datasets P , as well as bit-strings x and a query q , and assume the existence of a deterministic data structure succeeding withprobability at least .2. We simplify the deterministic data structure in order to get “low-contention” data structures.These are data structures which do not rely on any single cell too much (similar to Def. 6.1 in[PTW10]).3. We use ideas from [PTW10] to understand how queries neighboring particular dataset pointsprobe various cells of the data structure. We fix an n -point dataset P with a constant fractionof the points satisfying the following condition: many possible queries in the neighborhood ofthese points probe disjoint pairs of cells. A 2-query LDC corresponds to LDCs which make two probes to their memory contents. Even though there is aslight ambiguity with the data structure notion of query, we say “2-query LDCs” in order to be consistent with theLDC literature.
42. For the fixed dataset P , we show that we can recover a constant fraction of bits of x withsignificant probability even if we corrupt the contents of some cells.5. We reduce to data structures with 1-bit words in order to apply the LDC arguments from[KdW04].6. Finally, we design an LDC with weaker guarantees and use the arguments in [KdW04] toprove lower bounds on the space of the weak LDC. Definition 8.2.
A non-adaptive randomized algorithm R for the GNS problem making two cell-probes is an algorithm specified by the following two components:1. A procedure which preprocess a dataset P = { p i } ni =1 of n points, as well as a bit-string x ∈ { , } n in order to output a data structure D ∈ ( { , } w ) m .2. An algorithm R that given a query q , chooses two indices ( i, j ) ∈ [ m ] and specifies a function f q : { , } w × { , } w → { , } .We require the data structure D and the algorithm R satisfy Pr R,D [ f q ( D j , D k ) = x i ] ≥ whenever q ∈ N ( p i ) and p i is the unique such neighbor. Note that the procedure which outputs the data structure does not depend on the query q , andthat the algorithm R does not depend on the dataset P or bit-string x . Definition 8.3.
We define the following distributions: • Let P be the uniform distribution supported on n -point datasets from V . • Let X be the uniform distribution over { , } n . • Let Q ( P ) be the distribution over queries given by first drawing a dataset point p ∈ P uniformlyat random and then drawing q ∈ N ( p ) uniformly at random. Lemma 8.4.
Assume R is a non-adaptive randomized algorithm for GNS using two cell-probes.Then, there exists a non-adaptive deterministic algorithm A for GNS using two cell-probes succeedingwith probability at least when the dataset P ∼ P , the bit-string x ∼ X , and q ∼ Q ( P ) .Proof. We apply Yao’s principle to the success probability of the algorithm. By assumption, thereexists a distribution over algorithms which can achieve probability of success at least for any singlequery. Therefore, for the fixed distributions P , X , and Q , there exists a deterministic algorithmachieving at least the same success probability. 43n order to simplify notation, we let A D ( q ) denote output of the algorithm A . We assume that A ( q ) outputs a pair of indices ( j, k ) as well as the function f q : { , } w × { , } w → { , } , andthus, we use A D ( q ) as the output of f q ( D j , D k ). For any fixed dataset P = { p i } ni =1 and bit-string x ∈ { , } n , Pr q ∼ N ( p i ) [ A D ( q ) = x i ] = Pr q ∼ N ( p i ) [ f q ( D j , D k ) = x i ] . This notation allows us to succinctly state the probability of correctness when the query is a neighborof p i .For the remainder of the section, we let A denote a non-adaptive deterministic algorithmsucceeding with probability at least using m cells of width w . The success probability is takenover the random choice of the dataset P ∼ P , x ∼ X and q ∼ Q ( P ). For any t ∈ { , } and j ∈ [ m ], let A t,j be the set of queries which probe cell j at the t -th probeof algorithm A . Since A is deterministic, the indices ( i, j ) ∈ [ m ] which A outputs are completelydetermined by two collections A = { A ,j } j ∈ [ m ] and A = { A ,j } j ∈ [ m ] which independently partitionthe query space V . On query q , if q ∈ A ,i and q ∈ A ,j , algorithm A outputs the indices ( i, j ).We now define the notion of low-contention data structures, which requires the data structureto not rely on any one particular cell too much by ensuring no A t,j is too large. Definition 8.5.
A deterministic non-adaptive algorithm A using m cells has low contention ifevery set µ ( A t,j ) ≤ m for t ∈ { , } and j ∈ [ m ] . We now use the following lemma to argue that up to a small increase in space, a data structurecan be made low-contention.
Lemma 8.6.
Let A be a deterministic non-adaptive algorithm for GNS making two cell-probes using m cells. There exists a deterministic non-adaptive algorithm A for GNS making two cell-probesusing m cells which has low contention and succeeds with the same probability.Proof. Suppose µ ( A t,j ) ≥ m for some j ∈ [ m ]. We partition A t,j into enough sets { A ( j ) t,k } k of measure m and at most one set with measure between 0 and m . For each of set A ( j ) t,k , we make a new cell j k with the same contents as cell j . When a query lies inside A ( j ) t,k the t -th probe is made to the newcell j k instead of cell j .We apply the above transformation on all sets with µ ( A t,j ) ≥ m . In the resulting data structure,in each partition A and A , there can be at most m cells of measure m and at most m sets withmeasure less than m . Therefore, the transformed data structure has at most 3 m cells. Since thecontents remain the same, the data structure succeeds with the same probability.Given Lemma 8.6, we assume that A is a deterministic non-adaptive algorithm for GNS withtwo cell-probes using m cells which has low contention. The extra factor of 3 in the number of cellsis absorbed in the asymptotic notation. 44 .3 Datasets which shatter We fix some γ >
Definition 8.7 (Weak-shattering [PTW10]) . We say a partition A , . . . , A m of V ( K, γ ) - weaklyshatters a point p if X i ∈ [ m ] (cid:18) µ ( A i ∩ N ( p )) − K (cid:19) + ≤ γ, where the operator ( · ) + : R → R + is the identity on positive real numbers and zero otherwise. Lemma 8.8 (Shattering [PTW10]) . Let A , . . . , A k collection of disjoint subsets of measure at most m . Then Pr p ∼ µ [ p is ( K, γ ) -weakly shattered ] ≥ − γ for K = Φ r (cid:16) m , γ (cid:17) · γ . For the remainder of the section, we let K = Φ r m , γ ! · γ . We are interested in dataset points which are shattered with respect to the collections A and A . Intuitively, queries which are near-neighbors of these dataset points probe various disjoint cellsin the data structure, so their corresponding bit is stored in many cells. Definition 8.9.
Let p ∈ V be a dataset point which is ( K, γ ) -weakly shattered by A and A . Let β , β ⊂ N ( p ) be arbitrary subsets where each j ∈ [ m ] satisfies µ ( A ,j ∩ N ( p ) \ β ) ≤ K and µ ( A ,j ∩ N ( p ) \ β ) ≤ K Since p is ( K, γ ) -weakly shattered, we can pick β and β with measure at most γ each. We willrefer to β ( p ) = β ∪ β . For a fixed dataset point p ∈ P , we refer to β ( p ) as the set holding the slack in the shatteringof measure at most 2 γ . For a given collection A , let S ( A , p ) be the event that the collection A ( K, γ )-weakly shatters p . Note that Lemma 8.8 implies that Pr p ∼ µ [ S ( A , p )] ≥ − γ . Lemma 8.10.
With high probability over the choice of n -point dataset, at most γn points do notsatisfy S ( A , p ) and S ( A , p ) .Proof. The expected number of points p which do not satisfy S ( A , p ) and S ( A , p ) is at most 2 γn .Therefore via a Chernoff bound, the probability that more than 4 γn points do not satisfy S ( A , p )and S ( A , p ) is at most exp (cid:16) − γn (cid:17) . 45e call a dataset good if there are at most 4 γn dataset points which are not ( K, γ )-weaklyshattered by A and A . Lemma 8.11.
There exists a good dataset P = { p i } ni =1 where Pr x ∼X ,q ∼Q ( P ) [ A D ( q ) = x i ] ≥ − o (1) Proof.
For any fixed dataset P = { p i } ni =1 , let P = Pr x ∼X ,q ∼ Q ( p ) [ A D ( q ) = x i ] . Then, 23 ≤ E P ∼P [ P ]= (1 − o (1)) · E P ∼P [ P | P is good] + o (1) · E P ∼P [ P | P is not good]23 − o (1) ≤ (1 − o (1)) · E P ∼P [ P | P is good] . Therefore, there exists a dataset which is not shattered by at most 4 γn and Pr x ∼X ,q ∼Q ( P ) [ A D ( y ) = x i ] ≥ − o (1). In the rest of the proof, we fix the dataset P = { p i } ni =1 satisfying the conditions of Lemma 8.11, i.e., P satisfies Pr x ∼X ,q ∼Q ( P ) [ A D ( q ) = x i ] ≥ − o (1) . (25)We now introduce the notion of corruption of the data structure cells D , which parallels thenotion of noise in locally-decodable codes. Remember that, after fixing some bit-string x , thealgorithm A produces some data structure D ∈ ( { , } w ) m . Definition 8.12.
We call D ∈ ( { , } w ) m a corrupted version of D at k cells if D and D differon at most k cells, i.e., if |{ i ∈ [ m ] : D i = D i }| ≤ k . Definition 8.13.
For a fixed x ∈ { , } n , let c x ( i ) = Pr q ∼ N ( p i ) [ A D ( q ) = x i ] (26) denote the recovery probability of bit i . Note that from the definitions of Q ( P ) , E [ c x ( i )] ≥ − o (1) ,where the expectation is taken over x ∼ X and i ∈ [ n ] .
46n this section, we show there exist a subset S ⊂ [ n ] of size Ω( n ) where each i ∈ S has constantrecovery probability averaged over x ∼ X , even if the algorithm probes a corrupted version of datastructure. We let ε > Lemma 8.14.
Fix a vector x ∈ { , } n , and let D ∈ ( { , } w ) m be the data structure that algorithm A produces on dataset P and bit-string x . Let D be a corruption of D at εK cells. For every i ∈ [ n ] where events S ( A , p i ) and S ( A , p i ) occur, Pr q ∼ N ( p i ) [ A D ( q ) = x i ] ≥ c x ( i ) − γ − ε. Proof.
Note that c x ( i ) represents the probability that algorithm A run on a uniformly chosen queryfrom the neighborhood of p i returns the correct answer, i.e. x i . We denote the subset C ⊂ N ( p ) ofqueries that when run on A return x i ; so, µ ( C ) = c x ( i ) by definition.By assumption, p i is ( K, γ )-weakly shattered by A and A , so by Def. 8.9, we specify some β ( p ) ⊂ N ( p ) where µ ( C ∩ β ( p )) ≤ µ ( β ( p )) ≤ γ . Let C = C \ β ( p ), where µ ( C ) ≥ c i ( x ) − γ .Again, by assumption that p i is ( K, γ )-weakly shattered, each j ∈ [ m ] and t ∈ { , } satisfy µ ( C ∩ A t,j ) ≤ K . Let ∆ ⊂ [ m ] be the set of εK cells where D and D differ, and let C ⊂ C begiven by C = C \ [ j ∈ ∆ ( A ,j ∪ A ,j ) . Thus, µ ( C ) ≥ µ ( C ) − X j ∈ ∆ ( µ ( C ∩ A ,j ) + µ ( C ∩ A ,j )) ≥ c i ( x ) − γ − ε. If q ∈ C , then on query q , algorithm A probes cells outside of ∆, so A D ( q ) = A D ( q ) = x i . Lemma 8.15.
There exists a set S ⊂ [ n ] of size Ω( n ) with the following property. If i ∈ S , thenevents S ( A , p i ) and S ( A , p i ) occur, and E x ∼X [ c x ( i )] ≥
12 + ν, where ν is a constant. Proof.
For i ∈ [ n ], let E i be the event that S ( A , p i ) and S ( A , p i ) occur and E x ∼X [ c x ( i )] ≥ + ν .Additionally, let P = Pr i ∈ [ n ] [ E i ] . One can think of ν as around .
47e set S = { i ∈ [ n ] | E i } , so it remains to show that P = Ω(1). To this end,23 − o (1) ≤ E x ∼X ,i ∈ [ n ] [ c x ( i )] (by Equations 25 and 26) ≤ γ + P + (cid:18)
12 + ν (cid:19) · (1 − P ) (since P is good)16 − o (1) − γ − ν ≤ P · (cid:18) − ν (cid:19) . Fix the set S ⊂ [ n ] satisfying the conditions of Lemma 8.15. We combine Lemma 8.14 andLemma 8.15 to obtain the following condition on the dataset. Lemma 8.16.
Whenever i ∈ S , E x ∼X " Pr q ∼ N ( p i ) [ A D ( q ) = x i ] ≥
12 + η where η = ν − γ − ε and D differs from D in εK cells.Proof. Whenever i ∈ S , p i is ( K, γ )-weakly shattered. By Lemma 8.15, A outputs x i with probability + ν on average when probing the data structure D on input q ∼ N ( p i ), i.e E x ∼X " Pr q ∼ N ( p i ) [ A D ( q ) = x i ] ≥
12 + ν. Therefore, from Lemma 8.14, if A probes D which is a corruption of D in any εK cells, A willrecover x i with probability at least + ν − γ − ε averaged over all x ∼ X where q ∼ N ( p i ). Inother words, E x ∼X " Pr q ∼ N ( p i ) [ A D ( q ) = x i ] ≥
12 + ν − γ − ε. Summarizing the results of the section, we conclude with the following theorem.
Theorem 8.17.
There exists a two-probe algorithm and a subset S ⊆ [ n ] of size Ω( n ) , satisfyingthe following property. When i ∈ S , we can recover x i with probability at least + η over a randomchoice of x ∼ X , even if we probe a corrupted version of the data structure at εK cells.Proof. We describe how one can recover bit x i from a data structure generated by algorithm A . Inorder to recover x i , we generate a random query q ∼ N ( p i ) and probe the data structure at the cellsspecified by A . From Lemma 8.16, there exists a set S ⊂ [ n ] of size Ω( n ) for which the describedalgorithm recovers x i with probability at least + η , where the probability is taken on average overall possible x ∈ { , } n . 48ince we fixed the dataset P = { p i } ni =1 satisfying the conditions of Lemma 8.11, we will abusea bit of notation, and refer to algorithm A as the algorithm which recovers bits of x described inTheorem 8.17. We say that x ∈ { , } n is an input to algorithm A in order to initialize the datastructure with dataset P = { p i } ni =1 and x i is the bit associated with p i . In order to apply the lower bounds of 2-query locally-decodable codes, we reduce to the case whenthe word size w is one bit. Lemma 8.18.
There exists a deterministic non-adaptive algorithm A which on input x ∈ { , } n builds a data structure D using m · w cells of bit. For any i ∈ S as well as any corruption C which differs from D in at most εK cells satisfies E x ∈{ , } n " Pr q ∼ N ( p i ) [ A C ( q ) = x i ] ≥
12 + η w . Proof.
Given algorithm A which constructs the data structure D ∈ ( { , } w ) m on input x ∈ { , } n ,construct the following data structure D ∈ ( { , } ) m · w . For each cell D j ∈ { , } w , make 2 w cellscontaining all parities of the w bits in D j . This procedure increases the size of the data structureby a factor of 2 w .Fix i ∈ S and q ∈ N ( p i ) be a query. If the algorithm A produces a function f q : { , } w ×{ , } w → { , } which succeeds with probability at least + ζ over x ∈ { , } n , then there exists asigned parity on some input bits which equals f q in at least + ζ w inputs x ∈ { , } n . Let S j be theparity of the bits of cell j and S k be the parity of the bits of cell k . Let f q : { , } × { , } → { , } denote the parity or the negation of the parity which equals f q on + ζ w possible input strings x ∈ { , } n .Algorithm A will evaluate f q at the cell containing the parity of the S j bits in cell j and theparity of S k bits in cell k . Let I S j , I S k ∈ [ m · w ] be the indices of these cells. If C is a sequence of m · w cells which differ in εK many cells from D , then E x ∈{ , } n " Pr q ∼ N ( p i ) [ f q ( C I Sj , C I Sk ) = x i ] ≥
12 + η w whenever i ∈ S .For the remainder of the section, we will prove a version of Theorem 8.1 for algorithms with1-bit words. Given Lemma 8.18, we will modify the space to m · w and the probability to + η w to obtain the answer. So for the remainder of the section, assume algorithm A has 1 bit words. To complete the proof of Theorem 8.1, it remains to prove the following lemma.49 emma 8.19.
Let A be a non-adaptive deterministic algorithm which makes cell probes to a datastructure D of m cells of bit and recover x i with probability + η on random input x ∈ { , } n even after εK cells are corrupted whenever i ∈ S for some fixed S of size Ω( n ) . Then the followingmust hold: m log mn ≥ Ω (cid:16) εKη (cid:17) . The proof of the lemma uses [KdW04] and relies heavily on notions from quantum computing.In particular, quantum information theory applied to LDC lower bounds.
We introduce a few concepts from quantum computing that are necessary in our subsequentarguments. The quantum state of a qubit is described by a unit-length vector in C . We write thequantum state as a linear combination of the basis states ( ) = | i and ( ) = | i . The quantumstate α = ( α α ) can be written | α i = α | i + α | i where we refer to α and α as amplitudes and | α | + | α | = 1. The quantum state of an m - qubit system is a unit vector in the tensor product C ⊗ · · · ⊗ C of dimension 2 m . The basisstates correspond to all 2 m bit-strings of length m . For j ∈ [2 m ], we write | j i as the basis state | j i ⊗ | j i ⊗ · · · ⊗ | j m i where j = j j . . . j m is the binary representation of j . We will write the m -qubit quantum state | φ i as unit-vector given by linear combination over all 2 m basis states. So | φ i = P j ∈ [2 m ] φ j | j i . As a shorthand, h φ | corresponds to the conjugate transpose of a quantum state.A mixed state { p i , | φ i i} is a probability distribution over quantum states. In this case, we thequantum system is in state | φ i i with probability p i . We represent mixed states by a density matrix P p i | φ i i h φ i | .A measurement is given by a family of Hermitian positive semi-definite operators which sum tothe identity operator. Given a quantum state | φ i and a measurement corresponding to the familyof operators { M ∗ i M i } i , the measurement yields outcome i with probability k M i | φ i k and results instate M i | φ ik M i | φ ik , where the norm k · k is the ‘ norm. We say the measurement makes the observation M i .Finally, a quantum algorithm makes a query to some bit-string y ∈ { , } m by starting with thestate | c i | j i and returning ( − c · y j | c i | j i . One can think of c as the control qubit taking values 0 or1; if c = 0, the state remains unchanged by the query, and if c = 1 the state receives a ( − y j in itsamplitude. The queries may be made in superposition to a state, so the state P c ∈{ , } ,j ∈ [ m ] α cj | c i | j i becomes P c ∈{ , } ,j ∈ [ m ] ( − c · y j α cj | c i | j i . C : { , } n → { , } m is a (2 , δ, η ) -LDC if there exists a randomized decodingalgorithm making at most queries to an m -bit string y non-adaptively, and for all x ∈ { , } n , ∈ [ n ] , and y ∈ { , } m where d ( y, C ( x )) ≤ δm , the algorithm can recover x i from the two queriesto y with probability at least + η . In their paper, [KdW04] prove the following result about 2-query LDCs.
Theorem 8.21 (Theorem 4 in [KdW04]) . If C : { , } n → { , } m is a (2 , δ, η ) -LDC, then m ≥ Ω( δη n ) . The proof of Theorem 8.21 proceeds as follows. They show how to construct a 1-query quantum-LDC from a classical 2-query LDC. From a 1-query quantum-LDC, [KdW04] constructs a quantumrandom access code which encodes n -bit strings in O (log m ) qubits. Then they apply a quantuminformation theory lower bound due to Nayak [Nay99]: Theorem 8.22 (Theorem 2 stated in [KdW04] from Nayak [Nay99]) . For any encoding x → ρ x of n -bit strings into m -qubit states, such that a quantum algorithm, given query access to ρ x , candecode any fixed x i with probability at least / η , it must hold that m ≥ (1 − H (1 / η )) n . Our proof will follow a pattern similar to the proof of Theorem 8.21. We assume the existenceof a GNS algorithm A which builds a data structure D ∈ { , } m .Our algorithm A from Theorem 8.17 does not satisfy the strong properties of an LDC, preventingus from applying 8.21 directly. However, it does have some LDC- ish guarantees. In particular, wecan recover bits in the presence of εK corruptions to D . In the LDC language, this means that wecan tolerate a noise rate of δ = εKm . Additionally, we cannot necessarily recover every coordinate x i ,but we can recover x i for i ∈ S , where | S | = Ω( n ). Also, our success probability is + η over therandom choice of i ∈ S and the random choice of the bit-string x ∈ { , } n . Our proof follows byadapting the arguments of [KdW04] to this weaker setting. Lemma 8.23.
Let r = δa where δ = εKm and a ≤ is a constant. Let D be the data structurefrom above (i.e., satisfying the hypothesis of Lemma 8.19). Then there exists a quantum algorithmthat, starting from the r (log m + 1) -qubit state with r copies of | U ( x ) i , where | U ( x ) i = 1 √ m X c ∈{ , } ,j ∈ [ m ] ( − c · D j | c i | j i can recover x i for any i ∈ S with probability + Ω( η ) (over a random choice of x ). Assuming Lemma 8.23, we can complete the proof of Lemma 8.19.
Proof of Lemma 8.19.
The proof is similar to the proof of Theorem 2 of [KdW04]. Let ρ x representthe s -qubit system consisting of the r copies of the state | U ( x ) i , where s = r (log m + 1); ρ x is anencoding of x . Using Lemma 8.23, we can assume we have a quantum algorithm that, given ρ x , canrecover x i for any i ∈ S with probability α = + Ω( η ) over the random choice of x ∈ { , } n .We will let H ( A ) be the Von Neumann entropy of A , and H ( A | B ) be the conditional entropyand H ( A : B ) the mutual information. 51et XM be the ( n + s )-qubit system12 n X x ∈{ , } n | x i h x | ⊗ ρ x . The system corresponds to the uniform superposition of all 2 n strings concatenated with theirencoding ρ x . Let X be the first subsystem corresponding to the first n qubits and M be the secondsubsystem corresponding to the s qubits. We have H ( XM ) = n + 12 n X x ∈{ , } n H ( ρ x ) ≥ n = H ( X ) H ( M ) ≤ s, since M has s qubits. Therefore, the mutual information H ( X : M ) = H ( X ) + H ( M ) − H ( XM ) ≤ s .Note that H ( X | M ) ≤ P ni =1 H ( X i | M ). By Fano’s inequality, if i ∈ S , H ( X i | M ) ≤ H ( α )where we are using the fact that Fano’s inequality works even if we can recover x i with probability α averaged over all x ’s. Additionally, if i / ∈ S , H ( X i | M ) ≤
1. Therefore, s ≥ H ( X : M ) = H ( X ) − H ( X | M ) ≥ H ( X ) − n X i =1 H ( X i | M ) ≥ n − | S | H ( α ) − ( n − | S | )= | S | (1 − H ( α )) . Furthermore, 1 − H ( α ) ≥ Ω( η ) since, and | S | = Ω( n ), we have2 ma εK (log m + 1) ≥ Ω (cid:16) nη (cid:17) m log mn ≥ Ω (cid:16) εKη (cid:17) . It remains to prove Lemma 8.23, which we proceed to do in the rest of the section. We firstshow that we can simulate our GNS algorithm with a 1-query quantum algorithm.
Lemma 8.24.
Fix an x ∈ { , } n and i ∈ [ n ] . Let D ∈ { , } m be the data structure produced byalgorithm A on input x . Suppose Pr q ∼ N ( p i ) [ A D ( q ) = x i ] = + b for b > . Then there exists aquantum algorithm which makes one quantum query (to D ) and succeeds with probability + b tooutput x i . roof. We use the procedure in Lemma 1 of [KdW04] to determine the output algorithm A oninput x at index i . The procedure simulates two classical queries with one quantum query.Without loss of generality, all quantum algorithms which make 1-query to D can be specified inthe following manner: there is a quantum state | Q i i , where | Q i i = X c ∈{ , } ,j ∈ [ m ] α cj | c i | j i which queries D . After querying D , the resulting quantum state is | Q i ( x ) i , where | Q i ( x ) i = X c ∈{ , } ,j ∈ [ m ] ( − c · D j α cj | c i | j i . There is also a quantum measurement { R, I − R } such that, after the algorithm obtains the state | Q i ( x ) i , it performs the measurement { R, I − R } . If the algorithm observes R , it outputs 1 and ifthe algorithm observes I − R , it outputs 0.From Lemma 8.24, we know there exist a state | Q i i and a measurement { R, I − R } where ifalgorithm A succeeds with probability + η on random x ∼ { , } n , then the quantum algorithmsucceeds with probability + η on random x ∼ { , } n .In order to simplify notation, we write p ( φ ) as the probability of making observation R fromstate | φ i . Since R is a positive semi-definite matrix, R = M ∗ M and so p ( φ ) = k M | φ i k .In exactly the same way as [KdW04], we can remove parts of the quantum state | Q i ( x ) i where α cj > √ δm = √ εK . If we let L = { ( c, j ) | α cj ≤ √ εK } , after keeping only the amplitudes in L , weobtain the quantum state a | A i ( x ) i , where | A i ( x ) i = X ( c,j ) ∈ L ( − c · D j α cj | c i | j i , a = s X ( c,j ) ∈ L α cj . Lemma 8.25.
Fix i ∈ S . The quantum state | A i ( x ) i satisfies E x ∈{ , } n (cid:20) p (cid:18) a A i ( x ) (cid:19) | x i = 1 (cid:21) − E x ∈{ , } n (cid:20) p (cid:18) a A i ( x ) (cid:19) | x i = 0 (cid:21) ≥ η a . Proof.
Note that since | Q i ( x ) i and { R, I − R } simulate A and succeed with probability at least + η on a random x ∈ { , } n , we have that12 E x ∈{ , } n [ p ( Q i ( x )) | x i = 1] + 12 E x ∈{ , } n [1 − p ( Q i ( x )) | x i = 0] ≥
12 + 4 η , which we can simplify to say E x ∈{ , } n [ p ( Q i ( x )) | x i = 1] + E x ∈{ , } n [ p ( Q i ( x )) | x i = 0] ≥ η . | Q i ( x ) i = | A i ( x ) i + | B i ( x ) i and | B i ( x ) i contains at most εK parts, if all probes to D in | B i ( x ) i had corrupted values, the algorithm should still succeed with the same probability onrandom inputs x . Therefore, the following two inequalities hold: E x ∈{ , } n [ p ( A i ( x ) + B ( x )) | x i = 1] + E x ∈{ , } n [ p ( A i ( x ) + B ( x )) | x i = 0] ≥ η E x ∈{ , } n [ p ( A i ( x ) − B ( x )) | x i = 1] + E x ∈{ , } n [ p ( A i ( x ) − B ( x )) | x i = 0] ≥ η p ( φ ± ψ ) = p ( φ ) + p ( ψ ) ± ( h φ | R | ψ i + h ψ | D | φ i ) and p ( c φ ) = p ( φ ) c . One can verify byaveraging the two inequalities (27) and (28) that we get the desired expression. Lemma 8.26.
Fix i ∈ S . There exists a quantum algorithm that starting from the quantum state a | A i ( x ) i , can recover the value of x i with probability + η a over random x ∈ { , } n .Proof. The algorithm and argument are almost identical to Theorem 3 in [KdW04], we just checkthat it works under the weaker assumptions. Let q = E x ∈{ , } n (cid:20) p (cid:18) a A i ( x ) (cid:19) | x i = 1 (cid:21) q = E x ∈{ , } n (cid:20) p (cid:18) a A i ( x ) (cid:19) | x i = 0 (cid:21) . From Lemma 8.25, we know q − q ≥ η a . In order to simplify notation, let b = η a . So we wanta quantum algorithm which starting from state a | A i ( x ) i can recover x i with probability + b on random x ∈ { , } n . Assume q ≥ + b , since otherwise q ≤ − b and the same argumentwill work for 0 and 1 flipped. Also, assume q + q ≥
1, since otherwise simply outputting 1 onobservation R and 0 on observation I − R will work.The algorithm works in the following way: it outputs 0 with probability 1 − q + q and otherwisemakes the measurement { R, I − R } on state a | A i ( x ) i . If the observation made is R , then thealgorithm outputs 1, otherwise, it outputs 0. The probability of success over random input x ∈ { , } n is E x ∈{ , } n [Pr[returns correctly]]= 12 E x ∈{ , } n [Pr[returns 1] | x i = 1] + 12 E x ∈{ , } n [Pr[returns 0] | x i = 0] . (29)When x i = 1, the probability the algorithm returns correctly is (1 − q ) p (cid:16) a A i ( x ) (cid:17) and when x i = 0,the probability the algorithm returns correctly is q + (1 − q )(1 − p ( a A i ( x ))). So simplifying (29), E x ∈{ , } n [Pr[returns correctly]] = 12 (1 − q ) q + 12 ( q + (1 − q )(1 − q )) ≥
12 + b . Now we can finally complete the proof of Lemma 8.23.54 roof of Lemma 8.23.
Again, the proof is exactly the same as the finishing arguments of Theorem 3in [KdW04], and we simply check the weaker conditions give the desired outcome. On input i ∈ [ n ]and access to r copies of the state | U ( x ) i , the algorithm applies the measurement { M ∗ i M i , I − M ∗ i M i } where M i = √ εK X ( c,j ) ∈ L α cj | c, j i h c, j | . This measurement is designed in order to yield the state a | A i ( x ) i on | U ( x ) i if the measurementmakes the observation M ∗ i M i . The fact that the amplitudes of | A i ( x ) i are not too large makes { M ∗ i M i , I − M ∗ i M i } a valid measurement.The probability of observing M ∗ i M i is h U ( x ) | M ∗ i M i | U ( x ) i = δa , where we used that δ = εKm .So the algorithm repeatedly applies the measurement until observing outcome M ∗ i M i . If it nevermakes the observation, the algorithm outputs 0 or 1 uniformly at random. If the algorithm doesobserve M ∗ i M i , it runs the output of the algorithm of Lemma 8.26. The following simple calculation(done in [KdW04]) gives the desired probability of success on random input, E x ∈{ , } n [Pr[returns correctly]] ≥ (cid:16) − (1 − δa / r (cid:17) (cid:18)
12 + 2 η a (cid:19) + (1 − δa / r · ≥
12 + η a . We can extend our lower bounds from the non-adaptive to the adaptive setting.
Lemma 8.27.
If there exists a deterministic data structure which makes two queries adaptivelyand succeeds with probability at least + η , there exists a deterministic data structure which makesthe two queries non-adaptively and succeeds with probability at least + η w .Proof. The algorithm guesses the outcome of the first cell probe and simulates the adaptive algorithmwith the guess. After knowing which two probes to make, we probe the data structure non-adaptively.If the algorithm guessed the contents of the first cell-probe correctly, then we output the value ofthe non-adaptive algorithm. Otherwise, we output a random value. This algorithm is non-adaptiveand succeeds with probability at least (cid:16) − w (cid:17) · + w (cid:16) + η (cid:17) = + η w .Applying Lemma 8.27, from an adaptive algorithm succeeding with probability , we obtaina non-adaptive algorithm succeeding with probability + Ω(2 − w ). This value is lower than theintended , but we may still reduce to a weak LDC, where we require γ = Θ(2 − w ), ε = Θ(2 − w ),and | S | = Ω(2 − w n ). With these minor changes to the parameters in Subsections 8.1 through 8.6,one can easily verify m log m · Θ( w ) n ≥ Ω (cid:18) Φ r (cid:18) m , γ (cid:19)(cid:19) . This inequality yields tight lower bounds (up to sub-polynomial factors) for the Hamming spacewhen w = o (log n ). 55n the case of the Hamming space, we can compute robust expansion in a similar fashion toTheorem 1.3. In particular, for any p, q ∈ [1 , ∞ ) where ( p − q −
1) = σ , we have m log m · O ( w ) n ≥ Ω( γ q m q/p − q ) m q − q/p + o (1) ≥ n − o (1) γ q m ≥ n − o (1) q − q/p + o (1) γ qq − q/p + o (1) = n ppq − q − o (1) γ pp − − o (1) . Let p = 1 + wf ( n )log n and q = 1 + σ nwf ( n ) where we require that wf ( n ) = o (log n ) and f ( n ) → ∞ as n → ∞ . Then, m ≥ n σ − o (1) log nwf ( n ) ≥ n σ − o (1) . We would like to thank Jop Briët for helping us to navigate literature about LDCs. We also thankOmri Weinstein for useful discussions. Thanks to Adam Bouland for educating us on the topic ofquantum computing. Thijs Laarhoven is supported by the SNSF ERC Transfer Grant CRETP2-166734 FELICITY. This material is based upon work supported by the National Science FoundationGraduate Research Fellowship under Grant No. DGE-16-44869. It is also supported in part by NSFCCF-1617955 and Google Research Award.
References [AC09] Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximatenearest neighbors.
SIAM J. Comput. , 39(1):302–322, 2009.[ACP08] Alexandr Andoni, Dorian Croitoru, and Mihai Pˇatraşcu. Hardness of nearest neighbor underL-infinity. In
FOCS , pages 424–433, 2008.[ACW16] Josh Alman, Timothy M. Chan, and Ryan Williams. Polynomial representations of thresholdfunctions with applications. In
FOCS , 2016.[ADI +
06] Alexandr Andoni, Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni. Locality-sensitive hashing scheme based on p -stable distributions. Nearest Neighbor Methods for Learningand Vision: Theory and Practice, Neural Processing Information Series, MIT Press , 2006.[AGK06] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In
Proceedings of the 32nd international conference on Very large data bases , pages 918–929. VLDBEndowment, 2006.[AI08] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearestneighbor in high dimensions.
Communications of the ACM , 51(1):117–122, 2008.[AI17] Alexandr Andoni and Piotr Indyk. Nearest neighbors in high-dimensional spaces. In Jacob E.Goodman, Joseph O’Rourke, and Csaba Toth, editors,
Handbook of Discrete and ComputationalGeometry (third edition) . CRC Press LLC, 2017.[AIL +
15] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practicaland optimal LSH for angular distance. In
NIPS , 2015. Full version available at http://arxiv.org/abs/1509.02897 . AINR14] Alexandr Andoni, Piotr Indyk, Huy L. Nguyen, and Ilya Razenshteyn. Beyond locality-sensitivehashing. In
SODA , 2014. Full version at http://arxiv.org/abs/1306.1547 .[AIP06] Alexandr Andoni, Piotr Indyk, and Mihai Pˇatraşcu. On the optimality of the dimensionalityreduction method. In
FOCS , pages 449–458, 2006.[ALRW16] Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Lower bounds ontime-space trade-offs for approximate near neighbors.
CoRR , abs/1605.02701, 2016.[And09] Alexandr Andoni.
Nearest Neighbor Search: the Old, the New, and the Impossible . PhD thesis,MIT, 2009. Available at .[ANRW17] Alexandr Andoni, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten. Approximatenear neighbors for general symmetric norms. In
Proceedings of the 49th Annual ACM Symposiumon the Theory of Computing (STOC ’2017) , 2017.[AR15] Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate nearneighbors. In
STOC , 2015. Full version at http://arxiv.org/abs/1501.01062 .[AR16] Alexandr Andoni and Ilya Razenshteyn. Tight lower bounds for data-dependent locality-sensitivehashing. In
Proceedings of the 32nd International Symposium on Computational Geometry , 2016.Available at http://arxiv.org/abs/1507.04299 .[ARN17] Alexandr Andoni, Ilya Razenshteyn, and Negev Shekel Nosatzki. LSH forest: Practical algorithmsmade theoretical. In
Proceedings of the 28th Annual ACM–SIAM Symposium on DiscreteAlgorithms (SODA ’2017) , 2017.[ARS17] Alexandr Andoni, Ilya Razenshteyn, and Negev Shekel Nosatzki. Lsh forest: Practical algorithmsmade theoretical. In
SODA , 2017.[AV15] Amirali Abdullah and Suresh Venkatasubramanian. A directed isoperimetric inequality withapplication to bregman near neighbor lower bounds. In
Proceedings of the Forty-Seventh AnnualACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17,2015 , pages 509–518, 2015.[AW15] Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors. In
FOCS , 2015.[BDGL16] Anja Becker, Léo Ducas, Nicolas Gama, and Thijs Laarhoven. New directions in nearest neighborsearching with applications to lattice sieving. In
SODA , 2016.[BOR99] Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. Lower bounds for high dimensional nearestneighbor search and related problems.
Proceedings of the Symposium on Theory of Computing ,1999.[BR02] Omer Barkol and Yuval Rabani. Tighter bounds for nearest neighbor search and related problemsin the cell probe model.
J. Comput. Syst. Sci. , 64(4):873–896, 2002. Previously appeared inSTOC’00.[BRdW08] Avraham Ben-Aroya, Oded Regev, and Ronald de Wolf. A hypercontractive inequality formatrix-valued functions with applications to quantum computing and ldcs. In , pages 477–486, 2008.[CCGL99] Amit Chakrabarti, Bernard Chazelle, Benjamin Gum, and Alexey Lvov. A lower bound on thecomplexity of approximate nearest-neighbor searching on the Hamming cube.
STOC , 1999.[Cha02] Moses Charikar. Similarity estimation techniques from rounding. In
STOC , pages 380–388, 2002.[Chr17] Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In
SODA , 2017.[Cla88] Ken Clarkson. A randomized algorithm for closest-point queries.
SIAM Journal on Computing ,17:830–847, 1988. CR04] Amit Chakrabarti and Oded Regev. An optimal randomised cell probe lower bounds forapproximate nearest neighbor searching.
FOCS , 2004.[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson andLindenstrauss.
Random Structures Algorithms , 22(1):60–65, 2003.[DG15] Zeev Dvir and Sivakanth Gopi. 2-server PIR with sub-polynomial communication. In
Proceedingsof the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland,OR, USA, June 14-17, 2015 , pages 577–584, 2015.[DIIM04] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab Mirrokni. Locality-sensitive hashingscheme based on p-stable distributions. In
Proceedings of the 20th Annual Symposium onComputational Geometry , 2004.[DRT11] Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Nearest neighbor based greedycoordinate descent. In
Advances in Neural Information Processing Systems 24: 25th AnnualConference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14December 2011, Granada, Spain. , pages 2160–2168, 2011.[GIM99] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions viahashing.
Proceedings of the 25th International Conference on Very Large Data Bases (VLDB) ,1999.[GPY94] Daniel H. Greene, Michal Parnas, and F. Frances Yao. Multi-index hashing for informationretrieval. In
FOCS , pages 722–731, 1994.[HIM12] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towardsremoving the curse of dimensionality.
Theory of Computing , 1(8):321–350, 2012.[HLM15] Thomas Hofmann, Aurélien Lucchi, and Brian McWilliams. Neighborhood watch: Stochasticgradient descent with neighbors.
CoRR , abs/1506.03662, 2015.[IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbor: towards removing the curse ofdimensionality.
STOC , pages 604–613, 1998.[Ind00] Piotr Indyk. Dimensionality reduction techniques for proximity problems.
Proceedings of theNinth ACM-SIAM Symposium on Discrete Algorithms , 2000.[Ind01a] Piotr Indyk.
High-dimensional computational geometry . Ph.D. Thesis. Department of ComputerScience, Stanford University, 2001.[Ind01b] Piotr Indyk. On approximate nearest neighbors in ‘ ∞ norm. J. Comput. Syst. Sci. , 63(4):627–638,2001. Preliminary version appeared in FOCS’98.[JKKR04] T. S. Jayram, Subhash Khot, Ravi Kumar, and Yuval Rabani. Cell-probe lower bounds for thepartial match problem.
Journal of Computer and Systems Sciences , 69(3):435–447, 2004. Seealso STOC’03.[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of lipshitz mapping into hilbert space.
Contemporary Mathematics , 26:189–206, 1984.[Kap15] Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighborsearch. In
PODS , pages 329–342, New York, NY, USA, 2015. ACM.[KdW04] Iordanis Kerenidis and Ronald de Wolf. Exponential lower bound for 2-query locally decodablecodes via a quantum argument.
Journal of Computer and System Sciences , 69(3):395–420, 2004.[KKK16] Matti Karppa, Petteri Kaski, and Jukka Kohonen. A faster subquadratic algorithm for findingoutlier correlations. In
SODA , 2016. Available at http://arxiv.org/abs/1510.03895 .[KKKÓ16] Matti Karppa, Petteri Kaski, Jukka Kohonen, and Padraig Ó Catháin. Explicit correlationamplifiers for finding outlier correlations in deterministic subquadratic time. In
Proceedings ofthe 24th European Symposium Of Algorithms (ESA ’2016) , 2016. KOR00] Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearestneighbor in high dimensional spaces.
SIAM J. Comput. , 30(2):457–474, 2000. Preliminary versionappeared in STOC’98.[KP12] Michael Kapralov and Rina Panigrahy. NNS lower bounds via metric expansion for ‘ ∞ and EMD.In ICALP , pages 545–556, 2012.[Laa15a] Thijs Laarhoven.
Search problems in cryptography: From fingerprinting to lattice sieving . PhDthesis, Eindhoven University of Technology, 2015.[Laa15b] Thijs Laarhoven. Sieving for shortest vectors in lattices using angular locality-sensitive hashing.In
Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptology Conference, Santa Barbara,CA, USA, August 16-20, 2015, Proceedings, Part I , pages 3–22, 2015.[Laa15c] Thijs Laarhoven. Tradeoffs for nearest neighbors on the sphere.
CoRR , abs/1511.07527, 2015.[Liu04] Ding Liu. A strong lower bound for approximate nearest neighbor searching in the cell probemodel.
Information Processing Letters , 92:23–29, 2004.[LJW +
07] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. Multi-probe LSH: efficientindexing for high-dimensional similarity search. In
VLDB , 2007.[LLR94] Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of itsalgorithmic applications. In
FOCS , pages 577–591, 1994.[LPY16] Mingmou Liu, Xiaoyin Pan, and Yitong Yin. Randomized approximate nearest neighbor searchwith limited adaptivity.
CoRR , abs/1602.04421, 2016.[Mei93] Stefan Meiser. Point location in arrangements of hyperplanes.
Information and Computation ,106:286–303, 1993.[Mil99] Peter Bro Miltersen. Cell probe complexity-a survey.
Proceedings of the 19th Conference onthe Foundations of Software Technology and Theoretical Computer Science, Advances in DataStructures Workshop , page 2, 1999.[MNP07] Rajeev Motwani, Assaf Naor, and Rina Panigrahy. Lower bounds on locality sensitive hashing.
SIAM Journal on Discrete Mathematics , 21(4):930–935, 2007. Previously in SoCG’06.[MNSW98] Peter B. Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. Data structures andasymmetric communication complexity.
Journal of Computer and System Sciences , 1998.[MO15] Alexander May and Ilya Ozerov. On computing nearest neighbors with applications to decodingof binary linear codes. In
EUROCRYPT , 2015.[Nay99] Ashwin Nayak. Optimal lower bounds for quantum automata and random access codes. In
Foundations of Computer Science, 1999. 40th Annual Symposium on , pages 369–376. IEEE, 1999.[Ngu14] Huy L. Nguyên.
Algorithms for High Dimensional Data . PhD thesis, Princeton University, 2014.Available at http://arks.princeton.edu/ark:/88435/dsp01b8515q61f .[O’D14] Ryan O’Donnell.
Analysis of boolean functions . Cambridge University Press, 2014.[OvL81] Mark H. Overmars and Jan van Leeuwen. Some principles for dynamizing decomposable searchingproblems.
Information Processing Letters , 12(1):49–53, 1981.[OWZ14] Ryan O’Donnell, Yi Wu, and Yuan Zhou. Optimal lower bounds for locality sensitive hashing(except when q is tiny).
Transactions on Computation Theory , 6(1):5, 2014. Previously in ICS’11.[Pag16] Rasmus Pagh. Locality-sensitive hashing without false negatives. In
SODA , 2016. Available at http://arxiv.org/abs/1507.03225 .[Pan06] Rina Panigrahy. Entropy-based nearest neighbor algorithm in high dimensions. In
SODA , 2006.[Pˇat11] Mihai Pˇatraşcu. Unifying the landscape of cell-probe lower bounds.
SIAM Journal on Computing ,40(3):827–847, 2011. See also FOCS’08, arXiv:1010.3783. PP16] Ninh Pham and Rasmus Pagh. Scalability and total recall with fast CoveringLSH.
CoRR ,abs/1602.02620, 2016.[PT06] Mihai Pˇatraşcu and Mikkel Thorup. Higher lower bounds for near-neighbor and further richproblems.
FOCS , 2006.[PTW08] Rina Panigrahy, Kunal Talwar, and Udi Wieder. A geometric approach to lower bounds forapproximate near-neighbor search and partial match. In
FOCS , pages 414–423, 2008.[PTW10] Rina Panigrahy, Kunal Talwar, and Udi Wieder. Lower bounds on near neighbor search viametric expansion. In
FOCS , pages 805–814, 2010.[Raz14] Ilya Razenshteyn. Beyond Locality-Sensitive Hashing. Master’s thesis, MIT, 2014.[SDI06] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk, editors.
Nearest Neighbor Methods inLearning and Vision . Neural Processing Information Series, MIT Press, 2006.[TT07] Tengo Terasawa and Yuzuru Tanaka. Spherical LSH for approximate nearest neighbor search onunit hypersphere.
Workshop on Algorithms and Data Structures , 2007.[Val88] Leslie G Valiant. Functionality in neural nets. In
First Workshop on Computational LearningTheory , pages 28–39, 1988.[Val15] Gregory Valiant. Finding correlations in subquadratic time, with applications to learning paritiesand the closest pair problem.
J. ACM , 62(2):13, 2015. Previously in FOCS’12.[WLKC15] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data— a survey. Available at http://arxiv.org/abs/1509.05472 , 2015.[WSSJ14] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: Asurvey.
CoRR , abs/1408.2927, 2014.[Yek12] Sergey Yekhanin. Locally decodable codes.
Foundations and Trends in Theoretical ComputerScience , 6(3):139–255, 2012.[Yin16] Yitong Yin. Simple average-case lower bounds for approximate near-neighbor from isoperimetricinequalities.
CoRR , abs/1602.05391, 2016.[ZYS16] Zeyuan Allen Zhu, Yang Yuan, and Karthik Sridharan. Exploiting the structure: Stochasticgradient methods using raw clusters.
CoRR , abs/1602.02151, 2016., abs/1602.02151, 2016.