A Sub-linear Time Algorithm for Approximating k-Nearest-Neighbor with Full Quality Guarantee
AA Sub-linear Time Algorithm for Approximatingk-Nearest-Neighbor with Full Quality Guarantee
Hengzhao Ma and Jianzhong Li [email protected] [email protected] Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
Abstract.
In this paper we propose an algorithm for the approximate k-Nearest-Neighbors problem. According to the existing researches, thereare two kinds of approximation criterion. One is the distance criteria,and the other is the recall criteria. All former algorithms suffer the prob-lem that there are no theoretical guarantees for the two approximationcriterion. The algorithm proposed in this paper unifies the two kindsof approximation criterion, and has full theoretical guarantees. Further-more, the query time of the algorithm is sub-linear. As far as we know,it is the first algorithm that achieves both sub-linear query time and fulltheoretical approximation guarantee.
Keywords:
Computation Geometry · Approximate k-Nearest-Neighbors
The k-Nearest-Neighbor (kNN) problem is a well-known problem in theoreticalcomputer science and applications. Let (
U, D ) be a metric space, then for theinput set P ⊆ U of elements and a query element q ∈ U , the kNN problem isto find the k elements with smallest distance to q . Since the exact results areexpensive to compute when the size of the input is large [18], and approximateresults serve as good as the exact ones in many applications [29], the approximatekNN, kANN for short, draws more research efforts in recent years. There are twokinds of approximation criterion for the kANN problem, namely, the distancecriterion and the recall criterion. The distance criterion requires that the ratiobetween the distance from the approximate results to the query and the distancefrom the exact results to the query is no more than a given threshold. The recallcriterion requires that the size of the intersection of the approximate result setand the exact result set is no less than a given threshold. The formal descriptionwill be given in detail in Section 2. Next we brief the existing algorithms for thekANN problem to see how these two criteria are considered by former researchers.The algorithms for the kANN problem can be categorized into four classes.The first class is the tree-based methods. The main idea of this method is torecursively partition the metric space into sub-spaces, and organize them into atree structure. The K-D tree [6] is the representative idea in this category. It isefficient in low dimensional spaces, but the performance drops rapidly when the a r X i v : . [ c s . C G ] A ug H. Ma, J. Li number of dimension grows up. Vantage point tree (VP-tree) [30] is another datastructure with a better partition strategy and better performance. The FLANN[24] method is a recent work with improved performance in high dimensionalspaces, but it is reported that this method would achieve in sub-optimal results[19]. To the best of our knowledge, the tree based methods can satisfy neitherthe distance nor the recall criterion theoretically.The second class is the permutation based methods. The idea is to choosea set of pivot points, and represent each data element with a permutation ofthe pivots sorted by the distance to it. In such a representation, close objectswill have similar permutations. Methods using the permutation idea include theMI-File [2] and PP-Index [13]. Unfortunately, the permutation based methodcan not satisfy either of the distance or the recall criterion theoretically, as faras we know.The third class is the Locality Sensitive Hashing (LSH) based methods. LSHwas first introduced by Indyk et. [18] for the kANN problem where k = 1.Soon after, Datar et. [11] proposed the first practical LSH function, and sincethen there came a burst in the theoretical and applicational researches on theLSH framework. For example, Andoni et. proved the lower bound of the time-space complexities of the LSH based algorithms [3], and devised the optimalLSH function which meets the lower bound [4]. On the other hand, Gao et. [15]proposed an algorithm that aimed to close the gap between the LSH theory andkANN search applications. See [28] for a survey. The basic LSH based methodcan satisfy only the distance criterion when k = 1 [18]. Some existing algorithmsmade some progress. The C2LSH algorithm [14] solved the kANN problem withthe distance criterion, but it has a constraint that the approximation factor mustbe a square of an integer. The SRS algorithm [27] is another one aimed at thedistance criterion. However, it only has partial guarantee, that is, the resultssatisfy the distance criterion only when the algorithm terminates on a specificcondition.The forth class is graph based methods. The specific kind of graphs usedin this method is the proximity graphs, where the edges in this kind of graphare defined by the geometric relationship of the points. See [22] for a survey.The graph based kANN algorithms usually conduct a navigating process on theproximity graphs. This process selects an vertex in the graph as the start point,and move to the destination point following some specific navigating strategy. Forexample, Paredes et. [26] used the kNN graph, Ocsa et. [25] used the RelativeNeighborhood Graph (RNG), and Malkov et. [21] used the Navigable SmallWorld Graph (NSW) [21]. None of these algorithms have theoretical guaranteeon the two approximation criteria.In summary, most of the existing algorithms do not have theoretical guar-antee on either of the two approximation criteria. The recall criterion is onlyused as a measurement of the experimental results, and the distance criterion isonly partially satisfied by only a few algorithms [14,27]. In this paper, we pro-pose a sub-linear time algorithm for kANN problem that unifies the two kinds pproximate k-NN with Full Approximation Guarantee 3 of approximation criteria, which overcomes the disadvantages of the existingalgorithms. The contributions of this paper are listed below.1. We propose an algorithm that unifies the distance criterion and the recallcriterion for the approximate k-Nearest-Neighbor problem. The result re-turned by the algorithm can satisfy at least one criterion in any situation.This is a major progress compared to the existing algorithms.2. Assuming the input point set follows the spatial Poisson process, the algo-rithm takes O ( n log n ) time of preprocessing, O ( n log n ) space, and answersa query in O ( dn /d log n + kn ρ log n ) time, where ρ < The problem studied in this paper is the approximate k-Nearest-Neighbor prob-lem, which is denoted as kANN for short. In this paper the problem is constrainedto the Euclidean space. The input is a set P of points where each p ∈ P is a d-dimensional vector ( p (1) , p (2) , · · · , p ( n ) ). The distance between two points p and p (cid:48) is defined by D ( p, p (cid:48) ) = (cid:115) d (cid:80) i =1 ( p ( i ) − p (cid:48) ( i ) ) , which is the well known Euclideandistance. Before giving the definition of the kANN problem, we first introducethe exact kNN problem. Definition 2.1 (kNN).
Given the input point set P ⊂ R d and a query point q ∈ R d , define kN N ( q, P ) to be the set of k points in P that are nearest to q .Formally,1. kN N ( q, P ) ⊆ P , and | kN N ( q, P ) | = k ;2. D ( p, q ) ≤ D ( p (cid:48) , q ) for ∀ p ∈ kN N ( q, P ) and ∀ p (cid:48) ∈ P \ kN N ( q, P ) . Next we will give the definition of the approximate kNN. There are two kindsof definitions based on different approximation criteria.
Definition 2.2 ( kAN N c ). Given the input point set P ⊂ R d , a query point q ∈ R d , and a approximation factor c > , find a point set kAN N c ( q, P ) whichsatisfies: H. Ma, J. Li kAN N c ( q, P ) ⊆ P , and | kAN N c ( q, P ) | = k ;2. let T k ( q, P ) = max p ∈ kNN ( q,P ) D ( p, q ) , then D ( p (cid:48) , q ) ≤ c · T k ( q, P ) holds for ∀ p (cid:48) ∈ kAN N c ( q, P ) .Remark 2.1. The second requirement in Definition 2.2 is called the distancecriterion.
Definition 2.3 ( kAN N δ ). Given the input point set P ⊂ R d , a query point q ∈ R d , and a approximation factor δ < , find a point set kAN N δ ( q, P ) ⊆ P which satisfies:1. kAN N δ ( q, P ) ⊆ P , and | kAN N δ ( q, P ) | = k ;2. | kAN N δ ( q, P ) ∩ kN N ( q, P ) | ≥ δ · k .Remark 2.2. If a kANN algorithm returned a set S , the value | S ∩ kNN ( q,P ) || kNN ( q,P ) | isusually called the recall of the set S . This is widely used in many works toevaluate the quality of the kANN algorithm. Thus we call the second statementin Definition 2.3 as the recall criterion.Next we give the definition of the problem studied in this paper, which unifiesthe two different criteria. Definition 2.4.
Given the input point set P ⊂ R d , a query point q ∈ R d , andapproximation factors c > and δ < , find a point set kN N c,δ ( q, P ) whichsatisfies:1. kAN N c,δ ( q, P ) ⊆ P , and | kAN N c,δ ( q, P ) | = k ;2. kAN N c,δ ( q, P ) satisfies at least one of the distance criterion and the recallcriterion. Formally, either D ( p (cid:48) , q ) ≤ c · T k ( q, P ) holds for ∀ p (cid:48) ∈ kAN N c,δ ( q, P ) ,or | kAN N c,δ ( q, P ) ∩ kN N ( q, P ) | ≥ δ · k . According to Definition 2.4, the output of the algorithm is required to satisfyone of the two criteria, but not both. It will be our future work to devise analgorithm to satisfy both of the criteria.In the rest of this section we will introduce some concepts and algorithmsthat will be used in our proposed algorithm.
The D-dimensional spheres is the generalization of the circles in the 2-dimensionalcase. Let c be the center and r be the radius. A d-dimensional sphere, denotedas S ( c, r ), is the set S ( c, r ) = { x ∈ R d | D ( x, c ) ≤ r } . Note that the boundaryis included. If q ∈ S ( c, r ) we say that q falls inside sphere S ( c, r ), or the sphereencloses point p . A sphere S ( c, r ) is said to pass through point p iff D ( c, p ) = r .Given a set P of points, the minimum enclosing sphere (MES) of P , is the d-dimensional sphere enclosing all points in P and has the smallest possible radius.It is known that the MES of a given finite point set in R d is unique, and can becalculated by a quadratic programming algorithm [31]. Next we introduce theapproximate minimum enclosing spheres. pproximate k-NN with Full Approximation Guarantee 5 Definition 2.5 (AMES).
Given a set of points P ⊂ R d and an approxima-tion factor (cid:15) < , the approximate minimum enclosing sphere of P , denoted as AM ES ( P, (cid:15) ) , is a d-dimensional sphere S ( c, r ) satisfies:1. p ∈ S ( c, r ) for ∀ p ∈ P ;2. r < (1 + (cid:15) ) r ∗ , where r ∗ is the radius of the exact MES of P . The following algorithm can calculate the AMES in O ( n/(cid:15) ) time, which isgiven in [5]. Algorithm 1:
Compute
AM ES
Input: a point set P , and an approximation factor (cid:15) . Output:
AM ES ( P, (cid:15) ) c ← an arbitrary point in P ; for i = 1 to /(cid:15) do p i ← the point in P farthest away from c i − ; c i ← c i − + i ( p i − c i − ); end The following Lemma gives the complexity of Algorithm 1 .
Lemma 2.1 ([5]).
For given (cid:15) and P where | P | = n , Algorithm 1 can calculate AM ES ( P, (cid:15) ) in O ( n/(cid:15) ) time. The Delaunay Triangulation (DT) is a fundamental data structure in computa-tion geometry. The definition is given below.
Definition 2.6 (DT).
Given a set of points P ⊂ R d , the Delaunay Triangula-tion is a graph DT ( P ) = ( V, E ) which satisfies:1. V = P ;2. for ∀ p, p (cid:48) ∈ P , ( p, p (cid:48) ) ∈ E iff there exists a d-dimensional sphere passingthrough p and p (cid:48) , and no other p (cid:48)(cid:48) ∈ P is inside it. The Delaunay Triangulation is a natural dual of the Voronoi diagram. Weomit the details about their relationship since it is not the focus of this paper.There are extensive research works about the Delaunay triangulation. Animportant problem is to find the expected properties of DT built on randompoint sets. Here we focus on the point sets that follow the spatial Poisson processin d-dimensional Euclidean space. In this model, for any region R ⊂ R d , theprobability that R contains k points follows the Poisson distribution. See [1] formore details. We cite one important property of the spatial Poisson process inthe following lemma. Lemma 2.2 ([1]).
Let S ⊂ R d be a point set following the spatial Poissonprocess. Suppose there are two regions B ⊆ A ⊂ R d . For any point p ∈ S , if p H. Ma, J. Li falls inside A then the probability that p falls inside B is the ratio between thevolume of B and A . Formally, we have Pr[ p ∈ B | p ∈ A ] = volume ( B ) volume ( A ) . Further, we cite some important properties of the Delaunay triangulationbuilt on point sets which follow the spatial Poisson process.
Lemma 2.3 ([7]).
Let S ⊂ R d be a point set following the spatial Poissonprocess, and ∆ ( G ) = max p ∈ V ( G ) |{ ( p, q ) ∈ E ( G ) }| be the maximum degree of G .Then the expected maximum degree of DT ( S ) is O (log n/ log log n ) . Lemma 2.4 ([9]).
Let S ⊂ R d be a point set following the spatial Poissonprocess. The expected time to construct DT ( S ) is O ( n log n ) . Given a Delaunay Triangulation DT , the points and edges of DT form a set ofsimplices. Given a query point q , there is a problem to find which simplex of DT that q falls in. There is a class of algorithms to tackle this problem which iscalled Walking. The Walking algorithm start at some simplex, and walk to thedestination by moving to adjacent simplices step by step. There are several kindsof walking strategy, including Jump&Walk [23], Straight Walk [8] and StochasticWalk [12], etc. Some of these strategies are only applicable to 2 or 3 dimensions,while Straight Walk can generalize to higher dimension. As Figure 2.4 shows,the Straight Walk strategy only considers the simplices that intersect the linesegment from the start point to the destination. The following lemma gives thecomplexity of this walking strategy. Lemma 2.5 ([10]).
Given a Delaunay Triangulation DT of a point set P ⊂ R d ,and two points p and p (cid:48) in R d as the start point and destination point, the walkingfrom p to p (cid:48) using Straight Walk takes O ( n /d ) expected time. c, r )-NN The Approximate Near Neighbor problem is introduced in [18] for solving the kAN N c problem with k = 1. Usually the Approximate Near Neighbor problem isdenoted as ( c, r )-NN since there are two input parameters c and r . The definitionis given below. The idea to use ( c, r )-NN to solve 1 AN N c is via Turing reduction,that is, use ( c, r )-NN as an oracle or sub-procedure. The details can be found in[18,16,17,20]. Definition 2.7.
Given a point set P , a query point q , and two query parameters c > , r > , the output of the ( c, r ) -NN problem should satisfy:1. if ∃ p ∗ ∈ S ( q, r ) ∩ P , then output a point p (cid:48) ∈ S ( q, c · r ) ∩ P ; pproximate k-NN with Full Approximation Guarantee 7 Fig. 1.
Illustration of the Straight Walk
2. if D ( p, q ) > c · r for ∀ p ∈ P , then output N o ; Since we aim to solve kANN problem in this paper, we need the followingdefinition of ( c, r )-kNN.
Definition 2.8.
Given a point set P , a query point q , and two query parameters c, r , the output of the ( c, r ) -kNN problem is a set kN N ( c,r ) ( q, P ) , which satisfies:1. if | P ∩ S ( q, r ) | ≥ k , then output a set Q ⊆ P ∩ S ( q, c · r ) , where | Q | = k ;2. if | P ∩ S ( q, c · r ) | < k , then output ∅ ; It can be easily seen that the ( c, r )-kNN problem is a natural generalization ofthe ( c, r )-NN problem. Recently, there are several algorithms proposed to solvethis problem. The following Lemma 2.6 gives the complexity of the ( c, r )-kNNalgorithm, which will be proved in Appendix A.
Lemma 2.6.
There is an algorithm that solves ( c, r ) -kNN problem in O ( kn ρ ) oftime, requiring O ( kn ρ log n ) time of preprocessing and O ( kn ρ ) of space. Theparameter ρ is a constant depending on the LSH function used in the algorithm,and ρ < always holds. The proposed algorithm consists of two phases, i.e., the preprocessing phase andthe query phase. The preprocessing phase is to built a data structure, whichwill be used to guide the search in the query phase. Next we will describe thealgorithm of the two phases in detail.
H. Ma, J. Li
Before describing the details of the preprocessing algorithm, we first introduceseveral concepts that will be used in the following discussion.
Axis Parallel Box.
An axis parallel box B in R d is defined to be the Cartesianproduct of d intervals, i.e., B = I × I ×· · ·× I d . And the following is the definitionof Minimum Bounding Box. Definition 3.1.
Given a point set P , the Minimum Bounding Box, denoted as M BB ( P ) , is the axis parallel box satisfying the following two requirements:1. M BB ( P ) encloses all points in P , and2. there exists points p and p (cid:48) in P such that p ( i ) = a i , p (cid:48) ( i ) = b i for eachinterval I i = ( a i , b i ) defining M BB ( P ) , ≤ i ≤ d . Median Split
Given a point set P and its minimum bounding box M BB ( P ),we introduce an operation on P that splits P into two subsets, which is calledmedian split. This operation first finds the longest interval I i from the intervalsdefining M BB ( P ). Then, the operation finds the median of the set { p ( i ) | p ∈ P } ,which is the median of the i -th coordinates of the points in P . This median isdenoted as med i ( P ). Finally P is split into two subsets, i.e., P = { p ∈ P | p ( i ) ≤ med i ( P ) } and P = { p ∈ P | p ( i ) > med i ( P ) } . Here we assume that notwo points share the same coordinate in any dimension. This assumption can beassured by adding some random small shift on the original coordinates. Median Split Tree
By recursively conducting the median split operation, apoint set P can be organized into a tree structure, which is called the MedianSplit Tree (MST). The definition of MST is given below. Definition 3.2.
Given the input point set P , a Median Split Tree (MST) basedon P , denoted as M ST ( P ) , is a tree structure satisfying the following require-ments:1. the root of M ST ( P ) is P , and the other nodes in M ST ( P ) are subsets of P ;2. there are two child nodes for each interior node N ∈ M ST ( P ) , which aregenerated by conducting a median split on N ;3. each leaf node contains only one point. Balanced Median Split Tree
The depth of a node N in a tree T , denoted as dep T ( N ), is defined to be the number of edges in the path from N to the root of T . It can be noticed that the leaf nodes in the M ST may have different depths.So we introduce the Balanced Median Split Tree (BMST), where all leaf nodeshave the same depth.Let L T ( i ) = { N ∈ T | dep T ( N ) = i } , which is the nodes in the i -th layer intree T , and | N | be the number of points included in node N . For a median split pproximate k-NN with Full Approximation Guarantee 9 tree M ST ( P ), it can be easily proved that either | N | = (cid:100) n/ i (cid:101) or | N | = (cid:98) n/ i (cid:99) for ∀ N ∈ L MST ( P ) ( i ). Given M ST ( P ), the BM ST ( P ) is constructed as follows.Find the smallest i such that (cid:98) n/ i (cid:99) ≤
3, then for each node N ∈ L MST ( P ) ( i ),all the nodes in the sub-tree rooted at N are directly connected to N . Hierarchical Delaunay Graph
Given a point set P , we introduce the mostimportant concept for the preprocessing algorithm in this paper, which is theHierarchical Delaunay Graph (HDG). This structure is constructed by addingedges between nodes in the same layer of BM ST ( P ). The additional edges arecalled the graph edges, in contrast with the tree edges in BM ST ( P ). The defini-tion of the HDG is given below. Here Cen ( N ) denotes the center of AM ES ( N ). Definition 3.3.
Given a point set P and the balanced median split tree BM ST ( P ) ,a Hierarchical Delaunay Graph HDG is a layered graph based on
BM ST ( P ) ,where each layer is a Delaunay triangulation. Formally, for each N, N (cid:48) ∈ HDG ( P ) ,there is an graph edge between N, N (cid:48) iff1. dep
BMST ( P ) ( N ) = dep BMST ( P ) ( N (cid:48) ) , and2. there exists a d-dimensional sphere S passing through Cen ( N ) , Cen ( N (cid:48) ) , andthere is no N (cid:48)(cid:48) ∈ HDG ( P ) such that Cen ( N (cid:48)(cid:48) ) falls in S , where N (cid:48)(cid:48) is inthe same layer with N and N (cid:48) . That is, the graph edges connecting nodes inthe same layer forms the Delaunay Triangulation. The preprocessing algorithm
Next we describe the preprocessing algorithmwhich aims to build the HDG. The algorithm can be divided into three steps.Step 1, Split and build tree. The first step is to recursively split P into smallersets using the median split operation, and the median split tree is built. Finallythe nodes near the leaf layer is adjusted to satisfy the definition of the balancedmedian split tree.Step 2, Compute Spheres. In this step, the algorithm will go over the treeand compute the AMES for each node using Algorithm 1.Step 3, Construct the HDG . In this step, an algorithm given in [9] whichsatisfies Lemma 2.4 is invoked to compute the Delaunay triangulation for eachlayer.The pseudo codes of the preprocessing algorithm is given in Algorithm 2.
The query algorithm takes the
HDG built by the preprocessing algorithm, andexecutes the following three steps.The first is the descending step. The algorithm goes down the tree and stopsat level i such that k ≤ n/ i < k . At each level, the child node with smallestdistance to the query is chosen to be visited in next level.The second is the navigating step. The algorithm marches towards the localnearest AMES center by moving on the edges of the HDG . Algorithm 2:
Preprocessing Algorithm
Input: a point set P Output: a hierarchical Delaynay graph
HDG ( P ) T ← SplitTree( P ) ; Modify T into a BMST; ComputeSpheres( T ) ; HierarchicalDelaunay( T ) ; Procedure
SplitTree( N ) : Conduct median split on N and generate two sets N and N ; T ← SplitTree( N ) ; T ← SplitTree( N ) ; Let T be the left sub-tree of N , and T be the right sub-tree of N ; end Procedure
ComputeSpheres( T ) : foreach N ∈ T do Call
AMES( N, . ) (Algorithm 1); end end Procedure
HierarchicalDelaunay( T ) : Let dl be the depth of the leaf node in T ; for i = 0 to dl do Delaunay( L T ( i ) ) (Lemma 2.4); end end The third step is the answering step. The algorithm finds the answer of kAN N c,δ ( q, P ) by invoking the ( c, r )-kNN query. The answer can satisfy thedistance criterion or the recall criterion according to the different return resultof the ( c, r )-kNN query.Algorithm 3 describes the above process in pseudo codes, where Cen ( N ) and Rad ( N ) are the center and radius of the AM ES of node N , respectively. The analysis in this section will assume that the input point set P follows thespatial Poisson process. If Algorithm 3 terminates when i = 0 , then the returned point set Res is a δ -kNN of q in P with at least − e − n − knd probability. pproximate k-NN with Full Approximation Guarantee 11 Algorithm 3:
Query
Input: a query points q , a point set P , approximation factors c > , δ <
1, and
HDG ( P ) Output: kAN N c,δ ( q, P ) N ← the root of HDG ( P ); while | N | > k do Lc ← the left child of N , Rc ← the right child of N ; if D ( q, Cen ( Lc )) < D ( q, Cen ( Rc ))) then N ← Lc ; else N ← Rc ; end end while ∃ N (cid:48) ∈ N br ( N ) s.t. D ( q, Cen ( N (cid:48) )) < D ( q, Cen ( N ))) do N ← arg min N (cid:48) ∈ Nbr ( N ) { D ( q, Cen ( N (cid:48) )) } ; end for i = 0 to log c n do Invoke ( c, r )-kNN query where r = D ( q,Cen ( N ))+ Rad ( N ) n c i ; if the query returned a set Res then return Res as the final result; end end Proof.
Let R = D ( q, Cen ( N )) + Rad ( N ), R = R/n , t ∈ [0 , k ] be an integer. Wedefine the following three events. A = {| P ∩ S ( q, R ) | ≥ k } B = {| P ∩ S ( q, R ) | ≥ k + t } C = { Res ∩ kN N ( q, P ) | ≤ δ · k } The lemma states the situation that the algorithm returns at i = 0, whichimplies that event A happens. Event C represents the situation that Res is a δ -kNN set. Then it is easy to see that the desired probability is in this lemma is1 − Pr[ C | A ]. By the formula of conditional probability,Pr[ C | A ] = Pr[ C | B, A ] Pr[ B | A ] ≤ Pr[ B | A ] . Thus in the rest of the proof we focus on calculate Pr[ B | A ].To calculate Pr[ B | A ] we need the probability that a single point p falls in S ( q, R ). We have the following calculations.Pr[ p ∈ S ( q, R )] = Pr[ p ∈ S ( q, R ) | p ∈ S ( q, R )] · Pr[ p ∈ S ( q, R )] ≤ Pr[ p ∈ S ( q, R ) | p ∈ S ( q, R )]= 1 /n d The last equation is based on Lemma 2.2.On the other hand, the number of points in S ( q, R ) is at most n . Here weuse the trivial upper bound of n since it is sufficient to the proof. Denote P =Pr[ p ∈ S ( q, R )], we have the following equations.Pr[ B | A ] ≤ (cid:18) n − kt (cid:19) P t (1 − P ) n − k − t ≤ e − P ( n − k ) P t ( n − k ) t t !By the property of the Poisson Distribution, the above equation achieves themaximum when t = (cid:98) ( n − k ) P (cid:99) = (cid:98) ( n − k ) /n d (cid:99) = 0. Thus we have Pr[ B | A ] ≤ e − n − knd .Finally, combining the above analysis, we achieve the result that Res is a δ -kNN set with at least 1 − e − n − knd probability. (cid:117)(cid:116) Lemma 4.2.
If Algorithm 3 returns at i > , then the returned point set Res is a c -kNN of q in P .Proof. Let R i − = D ( q,Cen ( N ))+ Rad ( N ) n c i − and R i = D ( q,Cen ( N ))+ Rad ( N ) n c i ,which are the input parameter of the ( i − i -th invocation of the( c, r )-kNN query. The lemma states the situation that the algorithm returnsat the i -th loop, which implies that the ( c, r )-kNN query returns empty set inthe ( i − c, r )-kNN problem, thenumber of points is less than k in the d-dimensional sphere S ( q, R i − ). Denote T k ( q, P ) = max p ∈ kNN ( q,P ) D ( q, p ), then it can be deduced that T k ( q, P ) ≥ R i − . Onthe other hand, the algorithm returns at the i -th loop, which implies that the( c, r )-kNN query returns a subset of P ∩ S ( q, R i ). Thus we have D ( p, q ) ≤ R i foreach p in the result. Finally, D ( q, p ) /T k ≤ R i /R i − = c , which exactly satisfiesthe definition of c -kNN. (cid:117)(cid:116) Theorem 4.1.
The result of Algorithm 3 satisfies the requirement of kN N c,δ ( q, P ) with at least − e − n − knd probability.Proof. The result can be directly deduced by combining Lemma 4.1 and 4.2. (cid:117)(cid:116)
For ease of understanding, We first analyze the complexity of the single steps inAlgorithm 2 and 3.
Lemma 4.3.
The first step of Algorithm 2 takes O ( n log n ) time.Proof. The following recursion formula can be easily deduced from the pseudocodes of Algorithm 2. T ( n ) = 2 T ( n/
2) + O ( n )The O ( n ) term comes from the time of splitting and computing the median.This recursion formula can be solved by standard process, and the result is T ( n ) = O ( n log n ). (cid:117)(cid:116) pproximate k-NN with Full Approximation Guarantee 13 Lemma 4.4.
The second step of Algorithm 2 takes O ( n log n ) time.Proof. According to lemma 2.1, the time to compute the AMES of a point setis proportional to the number of points in this set. Thus we have the followingrecursion formula: T ( n ) = 2 T ( n/
2) + O ( n )The answer is also T ( n ) = O ( n log n ). (cid:117)(cid:116) Lemma 4.5.
The third step of Algorithm 2 takes O ( n log n ) time.Proof. According to the definition and the building process of the
HDG , thereare 2 i nodes in the i -th layer. And by Lemma 2.4, the time to build the Delau-nay triangulation is O ( n log n ). Thus, the time complexity of the third step isrepresented by the following equation. log n (cid:88) i =0 i log 2 i = log n (cid:88) i =0 i · i This result of this additive equation is O ( n log n ). (cid:117)(cid:116) Lemma 4.6.
The navigating step (second step) in Algorithm 3 needs O ( dn /d log n ) time.Proof. From Lemma 2.5 we know that the navigating step passes at most O ( n /d )simplices. A d-dimensional simplex has d − O ( dn /d ). While a node is visited, theprocess goes over the neighbors of this node, and from Lemma 2.3 we know thatthe expected maximum degree of each node is O (log n ). Thus the total time ofthe navigating is O ( dn /d log n ).Now we are ready to present the final results about the complexities. Theorem 4.2.
The expected time complexity of Algorithm 2, which is the pre-processing time complexity, is O ( n log n ) .Proof. The preprocessing consists of three steps, the time complexities of whichare shown in Lemma 4.3, 4.4 and 4.5. Adding them and we get the desiredconclusion.
Theorem 4.3.
The space complexity of Algorithm 2 is O ( n log n ) .Proof. The space needed to store the
HDG is the proportional to the numberof the graph edges and tree edges. The number of graph edges connected toeach node is O (log n ) according to Lemma 2.3, and the number of tree edges isconstant. On the other hand, the number of nodes in HDG is O ( n ). Finally, weget the result that the space complexity is O ( n log n ). (cid:117)(cid:116) Theorem 4.4.
The time complexity of Algorithm 3, which is the query com-plexity, is O ( dn /d log n + kn ρ log n ) , where ρ < is a constant. Proof.
The time complexity of Algorithm 3 consists of three parts. For thefirst part, which is descending part, it is easy to see that the complexity is O (log ( n/k )). And the time complexity of the second part is already solved byLemma 4.6, which is O ( dn /d log n ). The third part is to invoke the ( c, r )-kNNquery for log n times. By Lemma 2.6 each invocation of ( c, r )-kNN needs O ( kn ρ )where ρ > c, r )-kNN query is constant. Thus the third step needs kn ρ log n time. Adding the three parts and the desired result is achieved. (cid:117)(cid:116) In this paper we proposed an algorithm for the approximate k-Nearest-Neighborsproblem. We observed that there are two kinds of approximation criterion inthe history of this research area, which is called the distance criteria and therecall criteria in this paper. But we also observed that all existing works do nothave theoretical guarantees on this criteria. We raised a new definition for theapproximate k-Nearest-Neighbor problem which unifies the distance criteria andthe recall criteria, and proposed an algorithm that solves the new problem. Theresult of the algorithm can satisfy at least one of the two criterion. In our futurework, we will try to devise new algorithms that can satisfy both of the criterion.
References
1. Poisson Point Process, https://wikimili.com/en/Poisson { } point { } process2. Amato, G., Gennaro, C., Savino, P.: MI-File: using inverted files for scalable ap-proximate similarity search. Multimedia Tools and Applications (3), 1333–1362(aug 2014). https://doi.org/10.1007/s11042-012-1271-1, http://link.springer.com/10.1007/s11042-012-1271-13. Andoni, A., Laarhoven, T., Razenshteyn, I., Waingarten, E.: Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors. In: Proceedingsof the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms.pp. 47–66. Society for Industrial and Applied Mathematics, Philadelphia, PA (jan2017). https://doi.org/10.1137/1.9781611974782.44. Andoni, A., Razenshteyn, I.: Optimal Data-Dependent Hashing for ApproximateNear Neighbors. In: Proceedings of the Forty-Seventh Annual ACM on Symposiumon Theory of Computing - STOC ’15. pp. 793–801. ACM Press, New York, NewYork, USA (2015). https://doi.org/10.1145/2746539.27465535. Bˆadoiu, M., Bˆadoiu, M., Clarkson, K.L., Clarkson, K.L.: Smaller core-sets for balls.In: Proceedings of the Fourteenth Annual { ACM-SIAM } Symposium on DiscreteAlgorithms. pp. 801–802 (2003)6. Bentley, J.L.: Multidimensional binary search trees used for associa-tive searching. Communications of the ACM (9), 509–517 (sep 1975).https://doi.org/10.1145/361002.361007, http://portal.acm.org/citation.cfm?doid=361002.3610077. Bern, M., Eppstain, D., YAO, F.: THE EXPECTED EXTREMESIN A DELAUNAY TRIANGULATION. International Journal ofComputational Geometry & Applications (01), 79–91 (mar 1991).pproximate k-NN with Full Approximation Guarantee 15https://doi.org/10.1142/S0218195991000074, http://link.springer.com/10.1007/3-540-54233-7 { } (2), 89–105 (2007).https://doi.org/10.1016/j.comgeo.2006.05.0059. Buchin, K., Mulzer, W.: Delaunay triangulations in O(sort(n)) time and more.Proceedings - Annual IEEE Symposium on Foundations of Computer Science,FOCS V , 139–148 (2009). https://doi.org/10.1109/FOCS.2009.5310. de Castro, P.M.M., Devillers, O.: Simple and Efficient Distribution-SensitivePoint Location in Triangulations. In: 2011 Proceedings of the ThirteenthWorkshop on Algorithm Engineering and Experiments (ALENEX), pp. 127–138. Society for Industrial and Applied Mathematics, Philadelphia, PA (jan2011). https://doi.org/10.1137/1.9781611972917.13, http://epubs.siam.org/doi/abs/10.1137/1.9781611972917.1311. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashingscheme based on p-stable distributions. In: Proceedings of the twentieth an-nual symposium on Computational geometry - SCG ’04. pp. 253–262. ACMPress, New York, New York, USA (2004). https://doi.org/10.1145/997817.997857,http://portal.acm.org/citation.cfm?doid=997817.99785712. Devillers, O., Pion, S., Teillaud, M., Devillers, O., Pion, S., Teillaud, M.: Walkingin a triangulation To cite this version : HAL Id : inria-00072509 pp. 181–199 (2006)13. Esuli, A.: Use of permutation prefixes for efficient and scalable approximate sim-ilarity search. Information Processing & Management (5), 889–902 (sep 2012).https://doi.org/10.1016/j.ipm.2010.11.011, http://dx.doi.org/10.1016/j.ipm.2010.11.011https://linkinghub.elsevier.com/retrieve/pii/S030645731000101914. Gan, J., Feng, J., Fang, Q., Ng, W.: Locality-sensitive hashing scheme based ondynamic collision counting. In: Proceedings of the 2012 international conferenceon Management of Data - SIGMOD ’12. pp. 541–552. ACM Press, New York, NewYork, USA (2012). https://doi.org/10.1145/2213836.2213898, http://dl.acm.org/citation.cfm?doid=2213836.221389815. Gao, J., Jagadish, H., Ooi, B.C., Wang, S.: Selective Hashing: Closing the Gapbetween RadiusSearch and k-NN Search. In: Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining- KDD ’15. pp. 349–358. ACM Press, New York, New York, USA (2015).https://doi.org/10.1145/2783258.2783284, http://dl.acm.org/citation.cfm?doid=2783258.278328416. Har-Peled, S.: A replacement for Voronoi diagrams of near linear size. In: Proceed-ings 42nd IEEE Symposium on Foundations of Computer Science. pp. 94–103.IEEE (2001). https://doi.org/10.1109/SFCS.2001.959884, http://ieeexplore.ieee.org/document/959884/https://graphics.stanford.edu/courses/cs468-02-winter/Papers/sariel { } (1), 321–350 (2012).https://doi.org/10.4086/toc.2012.v008a01418. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards Removing theCurse of Dimensionality. In: Proceedings of the thirtieth annual ACM sympo-sium on Theory of computing - STOC ’98. pp. 604–613. ACM Press, New York,New York, USA (1998). https://doi.org/10.1145/276698.276876, http://portal.acm.org/citation.cfm?doid=276698.2768766 H. Ma, J. Li19. Lin, P.C., Zhao, W.L.: Graph based Nearest Neighbor Search: Promises and Fail-ures X (X), 1–8 (apr 2019), http://arxiv.org/abs/1904.0207720. Ma, H.Z., Li, J.: An Algorithm for Reducing Approximate Nearest Neighbor toApproximate Near Neighbor with $$ O( \ log { n } ) $$ Query Time. In: Kim, D., Uma,R.N., Zelikovsky, A. (eds.) Combinatorial Optimization and Applications - 12thInternational Conference, { COCOA } , 61–68 (2014). https://doi.org/10.1016/j.is.2013.10.006, http://dx.doi.org/10.1016/j.is.2013.10.00622. Mitchell, J.S., Mulzer, W.: Proximity algorithms. Handbook of Dis-crete and Computational Geometry, Third Edition pp. 849–874 (2017).https://doi.org/10.1201/978131511960123. M¨ucke, E.P., Saias, I., Zhu, B.: Fast randomized point location without preprocess-ing in two- and three-dimensional Delaunay triangulations. Computational Geom-etry (1-2), 63–83 (feb 1999). https://doi.org/10.1016/S0925-7721(98)00035-2,https://linkinghub.elsevier.com/retrieve/pii/S092577219800035224. Muja, M., Lowe, D.G.: Scalable Nearest Neighbor Algorithms for High DimensionalData. IEEE Transactions on Pattern Analysis and Machine Intelligence (11),2227–2240 (nov 2014). https://doi.org/10.1109/TPAMI.2014.2321376, http://elk.library.ubc.ca/handle/2429/44402http://ieeexplore.ieee.org/document/6809191/25. Ocsa, A., Bedregal, C., Cuadros-vargas, E., Society, P.C.: A new approach for sim-ilarity queries using proximity graphs. Simp´osio Brasileiro de Banco de Dados pp.131–142 (2007), http://socios.spc.org.pe/ecuadros/papers/Ocsa2007RNG-SBBD.pdf26. Paredes, R., Ch´avez, E.: Using the k-Nearest Neighbor Graph for Proximity Search-ing in Metric Spaces. In: International Symposium on String Processing and Infor-mation Retrieval. pp. 127–138 (2005).27. Sun, Y., Wang, W., Qin, J., Zhang, Y., Lin, X.: SRS. Proceedings of the VLDBEndowment (1), 1–12 (sep 2014). https://doi.org/10.14778/2735461.2735462,https://doi.org/10.14778/2735461.2735462http://dl.acm.org/doi/10.14778/2735461.273546228. Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for Similarity Search: A Survey(2014), http://arxiv.org/abs/1408.292729. Weber, R., Schek, H.J., Blott, S.: A Quantitative Analysis and Performance Studyfor Similarity-Search Methods in High-Dimensional Spaces. In: Proceedings of 24rdInternational Conference on Very Large Data Bases. pp. 194–205 (1998)30. Yianilos, P.N.: Data Structures and Algorithms for Nearest Neighbor Search inGeneral Metric Spaces. In: Proceedings of the Fourth Annual ACM-SIAM Sympo-sium on Discrete Algorithms. pp. 311–321. SODA ’93, Society for Industrial andApplied Mathematics, USA (1993). https://doi.org/10.5555/313559.31378931. Yildirim, E.A.: Two Algorithms for the Minimum Enclosing Ball Prob-lem. SIAM Journal on Optimization (3), 1368–1391 (jan 2008).https://doi.org/10.1137/070690419, http://epubs.siam.org/doi/10.1137/070690419pproximate k-NN with Full Approximation Guarantee 17 Appendix A Proof of Lemma 2.6
Proof.
The algorithm for ( c, r )-kNN is adapted from the standard LSH algorithmfor ( c, r )-NN. See [11] for more details. Briefly speaking, let H be a family of LSHfunctions, h ji be a set of LSH functions uniformly drawn from H , 1 ≤ i ≤ M, ≤ j ≤ L , and G j ( p ) = ( h j ( p ) , · · · , h jM ) be a composition of M LSH functions. Thealgorithm stores each element p in the input point set P in the hash bucket G j ( p ), 1 ≤ j ≤ L , and for the query point q , the algorithm scans the buckets G j q , 1 ≤ j ≤ L , and collects the points in S ( q, r ). If the algorithm collects k points in S ( q, cr ), then the algorithm returns the k points. If the algorithm havescanned 3 L points before collects enough points, it returns N o .Now we prove the algorithm succeeds with constant probability. The algo-rithm succeeds if the following two conditions are true. Here we call the pointsout side S ( q, cr ) as outer points.A. If there exists p , · · · , p k ∈ B ( q, r ), then for each p i there exists G j such that G j ( p i ) = G j ( q ), andB. the algorithm encounters at most 3 L outer points.First we introduce the following two probabilities.Let P = Pr[ G j ( p (cid:48) ) = G j ( q ) | D ( p (cid:48) , q ) ≥ cr ]. Apparently P ≤ p K . Then let K = log /p n ⇒ P ≤ n . For all points outside B ( q, cr ), the expected numberof outer points satisfying G j ( p (cid:48) ) = G j ( q ) for some j is at most n × n −