CSD: Discriminance with Conic Section for Improving Reverse k Nearest Neighbors Queries
Yang Li, Gang Liu, Mingyuan Bai, Junbin Gao, Lixin Ye, Zi Ming
CCSD: Discriminance with Conic Section forImproving Reverse k Nearest Neighbors Queries
Yang Li ∗† , Gang Liu ∗† , Mingyuan Bai ‡ , Junbin Gao ‡ , Lixin Ye ∗† and Zi Ming §∗ School of Computer Science, China University of Geosciences, Wuhan, China † Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China ‡ The University of Sydney Business School, The University of Sydney, Sydney, NSW, Australia § School of Economics and Management, Hubei University of Technology, Wuhan, China { liyang cs, liugang } @cug.edu.cn, { mbai8854@uni., junbin.gao@ } sydney.edu.au, [email protected], [email protected] Abstract —The reverse k nearest neighbor (R k NN) queryfinds all points that have the query point as one of their k nearest neighbors ( k NN), where the k NN query finds the k closest points to its query point. Based on the characteristicsof conic section, we propose a discriminance, named CSD (ConicSection Discriminance), to determine points whether belong tothe R k NN set without issuing any queries with non-constantcomputational complexity. By using CSD, we also implementan efficient R k NN algorithm CSD-R k NN with a computationalcomplexity at O ( k . · log k ) . The comparative experiments areconducted between CSD-R k NN and other two state-of-the-artRkNN algorithms, SLICE and VR-R k NN. The experimentalresults indicate that the efficiency of CSD-R k NN is significantlyhigher than its competitors.
Index Terms —R k NN, conic section, Voronoi, Delaunay
I. I
NTRODUCTION
As a variant of nearest neighbor (NN) query, RNN queryis first introduced by Korn and Muthukrishnan [1]. A directgeneralization of NN query is the reverse k nearest neighbors(R k NN) query, where all points having the query point asone of their k closest points are required to be found. Sinceits appearance, R k NN has received extensive attention [2],[3], [4], [5], [6], [7] and been prominent in various scientificfields including machine learning, decision support, intelligentcomputation and geographic information systems, etc.At first glance, R k NN and k NN queries appear to beequivalent, meaning that the results for R k NN and k NNmay be the same for the same query point. However, R k NNis not as simple as it seems to be. It is a very differentkind of query from k NN, although their results are similarin many cases. So far, R k NN is still an expensive queryfor its computational complexity at O ( k ) [6], whereas thecomputational complexity of k NN queries has been reducedto O ( k · log k ) [7].In order to solve the RNN/R k NN problem, a large numberof approaches have been proposed. Some early methods [8],[1], [9] speed up RNN/R k NN queries by pre-computation.Their disadvantage is that it is difficult to support queries ondynamic data sets. Therefore, many R k NN algorithms withoutpre-computation are proposed.
Corresponding authors: Gang Liu(email: [email protected]).
Most existing non-pre-computation R k NN algorithms havetwo phases: the filtering phase and the refining phase (alsoknown as the pruning phase and the verification phase). In thepruning phase, the majority of points that do not belong toR k NN should be filtered out. The main goal of this phase is togenerate a candidate set as small as possible. In the verificationphase, each candidate point should be verified whether itbelongs to the R k NN set or not. For most algorithms, thecandidate points are verified by issuing k NN queries or rangequeries, which are very computational expensive. The state-of-the-art R k NN technique SLICE, provides a more efficientverification method with a computational complexity of O ( k ) for one candidate. The size of the candidate set of SLICEvaries form k to . k . However, it is still time consuming toperform such a verification for each candidate point.There seems to be a consensus in the past studies thatfor an R k NN technique, the number of verification pointscannot be smaller than the size of the result set. Such anidea, however, limits our understanding of the R k NN problem.Hence we amend our thought and come up with a conjecturethat whether a point could be directly determined as belongingto the R k NN set according to its location. Given the querypoint q , our intuition tells us that if a point p is closer to q than a point p + belonging to the R k NNs of q , then p ishighly likely to also belong to the R k NN of q . Conversely, if p is further away from q than a point p − that does not belongto the R k NN set of q , then p is probably not a member ofthe R k NN set. Such a conjecture is true in many cases, butit is too broad and vague, and lacks rigorous mathematicalproof to be practical. However, along with this idea, we furtherstudy and obtain a set of discriminant methods for R k NNqueries that can withstand mathematical verification. We namethis method as CSD (Conic Section Discriminance). WithCSD, we can use a reference point which has been verifiedto determine whether another point belongs to the R k NNset without issuing a query with non-constant computationalcomplexity. An efficient R k NN algorithm, named CSD-R k NN,is also implemented by using CSD.Table I shows the comparison of computational complexityamong VR-R k NN , SLICE and CSD-R k NN. It can be seenthat the bottleneck of both VR-R k NN and SLICE is the a r X i v : . [ c s . D B ] M a y ABLE IC
OMPARISON OF COMPUTATIONAL COMPLEXITY
Operation VR-R k NN SLICE CSD-R k NN Generate candidates O ( k · log k ) O ( k · log k ) O ( k · log k ) Verify a candidate O ( k · log k ) O ( k ) O ( k · log k ) | Verified candidates | O ( k ) (=6 k ) O ( k ) (2 k ∼ k ) O ( √ k ) ( ≤ . √ k )Overall O ( k · log k ) O ( k ) O ( k . · log k ) verification phase. The computational complexity of verifyinga candidate of CSD-R k NN is O ( k · log k ) , which is higher thanthat of SLICE. However, the number of candidates verified byCSD-R k NN is only about . √ k , which is much less than thatof SLICE. In addition, the overall computational complexityof CSD-R k NN is much lower than that of SLICE.The rest of the paper is organized as follows. In Section 2,we introduce the major related work of R k NN since itsappearance. In Section 3, we formally define the R k NNproblem and introduce the concepts and knowledge related toour approach. Our approach and its principles are describedin section 4. Section 5 provides a detailed theoretical analysis.Experimental evaluation is demonstrated in Section 6. The lasttwo sections are conclusions and acknowledgements.II. R
ELATED WORK
A. RNN-tree
Reverse nearest neighbor (RNN) queries are first introducedby Korn and Muthukrishnan where RNN queries are imple-mented by preprocessing the data [1]. For each point p inthe database, a circle with p as the center and the distancefrom p to its nearest neighbor as the radius is pre-calculatedand these circles are indexed by an R-tree. The RNN set ofa query point q includes all the points whose circle contains q . With the R-tree, the RNN set of any query point can befound efficiently. Soon after, several techniques [10], [11] areproposed to improve their work. B. Six-regions
Six-regions [2] algorithm, proposed by Stanoi et al., is thefirst approach that does not need any pre-computation. Theydivide the space into six equal segments using six rays startingat the query point, so that the angle between the two boundaryrays of each segment is 60 ◦ . They suggest that only the nearestneighbor (NN) of the query point in each of the six segmentsmay belong to the RNN set. It firstly performs six NN queriesto find the closest point of the query point q in each segments.Then it launches an NN query for each of the six points toverify q as their NN. Finally the RNN of q is obtained.Generalizing this theory to R k NN queries leads to a corol-lary that, only the members of k NN of the query point ineach segment have the possibility of belonging to the R k NNset. This corollary is widely adopted in the pruning phase ofseveral R k NN techniques.
C. TPL
TPL [3], proposed by Tao et al., is one of the prestigiousalgorithms for RkNN queries. This technique prunes the spaceusing the bisectors between the query point and other points.The perpendicular bisector is denoted by B p : q . B p : q is betweena point p and the query point q . B p : q divides the space into twohalf-spaces. The half-space that contains p is denoted as H p : q .Another one is denoted as H q : p . If a point p (cid:48) lies in H p : q , p (cid:48) must be closer to p than to q . Then p (cid:48) cannot be the RNN of q and we can say that p prunes p (cid:48) . If a point is pruned by atleast k other points, then it cannot belong to the R k NN of q .An area that is the intersection of any combination of k half-spaces can be pruned. The total pruned area corresponds tothe union of pruned regions by all such possible combinationsof k bisectors (total (cid:18) mk (cid:19) combinations). TPL also uses analternative computational cheaper pruning method which hasa less pruning power. All the points are sorted by their Hilbertvalues. Only the combinations of k consecutive points are usedto prune the space (total m combinations). D. FINCH
FINCH is another famous R k NN algorithm proposed byWu et al. [4]. The authors of FINCH think that it is toocomputational costly to use m combinations of k bisectorsto prune the points. They utilize a convex polygon that ap-proximates the unpruned region to prune the points instead ofusing bisectors. All points lying outside the polygon should bepruned. Since the containment can be achieved in logarithmictime for convex polygons, the pruning of FINCH has a higherefficiency than TPL. However, the computational complexityof computing the approximately unpruned convex polygonis O ( m ) , where m is the number of points considered forpruning. E. InfZone
Previous techniques can reduce the candidate set to anextent by different pruning methods. However, their verifica-tion methods for candidates are very inefficient. It is quitecomputational costly to issue an inefficient verification foreach point in a candidate set with a size of O( k ). In order toovercome this issue, a novel R k NN technique which is namedas InfZone is proposed by Cheema et al. [5]. The authors ofInfZone introduce the concept of influence zone (denoted as Z k ), which also can be called R k NN region. The influencezone of a query point q is a region that, a point p belongs tothe R k NN set of q , if and only if it lies in the Z k of q . Theinfluence zone is always a star-shaped polygon and the querypoint is its kernel point. A number of properties are detailed.These properties are aimed to shrink the number of pointswhich are crucial to compute the influence zone. They proposean influence zone computing algorithm with a computationalcomplexity of O ( k · m ) , where m is the number of pointsaccessed during the construction of the influence zone. Everypoints that lies inside the influence zone are accessed inthe pruning phase, since they cannot be ignored during theonstruction of the influence zone. Namely, all the potentialmembers of the R k NN are accessed during the pruning phase.Hence, for monochromatic R k NN queries, InfZone does notrequire to verify the candidates. It is indicated that the expectedsize of R k NN set is k . Evidently, the size of R k NN must notbe greater than m , i.e., k ≤ m . Therefore, the computationalcomplexity of InfZone must be no less than O ( k ) . F. SLICE
SLICE [6] is the state-of-the-art approach for R k NNqueries. In recent years, several well-known techniques [2]have been proposed to address the limitations of half-spacepruning[3] (e.g., FINCH [4], InfZone [5]). While few re-searcher carries out further research based on the idea of Six-regions. Yang et al. suggests that the regions-based pruningapproach of Six-regions has great potential and proposed anefficient R k NN algorithm SLICE [6]. SLICE uses a morepowerful and flexible pruning approach that prunes a muchlarger area as compared to Six-regions with almost similarcomputational complexity. Furthermore, it significantly im-proves the verification phase by computing a list of significantpoints for each segment. These lists are named as sigList s.Each candidate can be verified by accessing sigList instead ofissuing a range query. Therefore, SLICE is significantly moreefficient than the other existing algorithms.
G. VR-R k NN For most R k NN algorithms, data points are indexed by R-tree [12]. However, R-tree is originally designed primarilyfor range queries. Although some approaches [13], [3], [14],[15] are proposed afterwards to make it also suitable for NNqueries and their variants: the NN derived queries are stilldisadvantageous. When answering an NN derived query, allnodes in the R-tree intersecting with the local neighborhood(Search Region) of the query point need to be accessed to findall the members of the result set. Once the candidate set ofthe query is large, the cost of accessing the nodes can alsobecome very large. In order to improve the performance of R-tree on NN derived queries, Sharifzadeh and Shahabi proposesa composite index structure composed of an R-tree and aVoronoi diagram, and named it as VoR-Tree [7]. VoR-Treebenefits from both the neighborhood exploration capabilityof Voronoi diagrams and the hierarchical structure of R-tree.By utilizing VoR-tree, they propose VR-R k NN to answer theR k NN query. Similar to the filter phase of Six-regions [2], Vor-R k NN divides the space into 6 equal segments and selects k candidate points from each segment to form a candidate setof size 6 k . During the refining phase, each candidate point isverified to be a member of the R k NN through issuing a k NNquery (VR- k NN). The expected computational complexity ofVR-R k NN is O ( k · log k ) III. P
RELIMINARIES
First, we formally define the problem in Section 3.1. Thenwe present some concepts and knowledge to enhance theunderstanding of our methodologies in Section 3.2 and Section3.3.
A. Problem definition
Definition 1. Euclidean Distance:
Given two points A = { a , a , ..., a d } and B = { b , b , ..., b d } in R d , the Euclideandistance between A and B , dist ( A, B ) , is defined as follows: dist ( A, B ) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 ( a i − b i ) . (1) Definition 2. k NN Queries: A k NN query is to find the k closest points to the query point from a certain point set.Mathematically, this query in Euclidean space can be statedas follows. Given a set P of points in R d and a query point q ∈ R d , k NN( q ) = { p ∈ P | dist ( p, q ) ≤ dist ( p k , q ) } where p k is the k th closest point to q in P. (2) Definition 3. R k NN Queries:
A R k NN query retrieves allthe points that have the query point as one of their k nearestneighbors from a certain point set. Formally, given a set P ofpoints in R d and a query point q ∈ P , the R k NN of q in P can be defined as R k NN( q ) = { p ∈ P | q ∈ k NN( p ) } . (3)Since the vast majority of spatial data in the GIS appliedin real life are two-dimensional and most of the applicationsof R k NN queries are in location-based services, like theexisting approaches [2], [4], [5], [6], our study focus on theR k NN query in the 2D scene. In the following sections, ifnot specified, the problem is discussed in two-dimensionalenvironment by default.
B. Voronoi diagram & Delaunay graph
Fig. 1. a) Voronoi Diagram, b) Delaunay Graph
Voronoi diagram [16], proposed by Rene Descartes in 1644,is a spatial partition structure widely applied in many sciencedomains, especially spatial database and computational geom-etry. In a Voronoi diagram of n points, the space is dividedinto n regions corresponding to these points, which are calledVoronoi cells. For each of these n points, the correspondingVoronoi cell consists of all locations closer to that point thanto any other. In other words, each point is the nearest neighborof all the locations in its corresponding Voronoi cell. Formally,the above description can be stated as follows. efinition 4. Voronoi cell & Voronoi diagram: Given a set P of n points, the Voronoi cell of a point p ∈ P , denoted as V ( P, p ) or V ( p ) for short, is defined as Equation (4) V ( P, p ) = { q | ∀ p (cid:48) ∈ P \ { p } : dist ( p, q ) ≤ dist ( p (cid:48) , q ) } (4)and the Voronoi diagram of P , denoted as V D ( P ) , is definedas Equation (5). V D ( P ) = { V ( P, p ) | p ∈ P } (5)The Voronoi diagram of a certain set P of points, V D ( P ), isunique. Definition 5. Voronoi neighbor:
Given the Voronoi diagramof P , for a point p , its Voronoi neighbors are the points in P whose Voronoi cells share an edge with V ( P, q ) . It is denotedas V N ( P, q ) or V N ( q ) for short. Note that the nearest pointin P to p is among V N ( q ) . Lemma 1.
Let p k be the k -th nearest neighbor of q , then p k is a Voronoi neighbor of at least one point of the k − nearestneighbors of q (where k > ).Proof. See [7].
Lemma 2.
For a Voronoi diagram, the expected number ofVoronoi neighbors of a generator point does not exceed 6.Proof.
Let n , n e and n v be the number of generator points,Voronoi edges and Voronoi vertices of a Voronoi diagram in R , respectively, and assume n ≥ . According to Euler’sformula, n + n v − n e = 1 (6)Every Voronoi vertex has at least 3 Voronoi edges and eachVoronoi edge belongs to two Voronoi vertices. Hence thenumber of Voronoi edges is not less than n v + 1) / , i.e., n e ≥
32 ( n v + 1) (7)According to Equation (6) and Equation (7), the followingrelationships holds: n e ≤ n − (8)When the number of generator points is large enough, theaverage number of Voronoi edges per Voronoi cell of a Voronoidiagram in R d is a constant value depending only on d . When d = 2, every Voronoi edge is shared by two Voronoi Cells.Hence the average number of Voronoi edges per Voronoi celldoes not exceed 6, i.e., · n e /n ≤ n − /n = 6 − /n ≤ . For set of points, a dual graph of its Voronoi Diagram isthe Delaunay graph (graph of Delaunay triangulation ) [17] ofit. Definition 6. Delaunay triangulation & Delaunay graph:
For a set P of discrete points in a plane, the Delaunaytriangulation DT ( P ) is that no point in P is inside the circumcircle of any triangle of DT ( P ) . The graph of DT ( P ) is called the Delaunay graph of P and denoted as DG ( P ) .Note that a graph of Delaunay graph must be a connectedgraph, i.e., two vertices in the graph are connected. For a set ofpoints, its nearest neighbor graph is a subgraph of its Delaunaygraph. Definition 7. Delaunay graph distance:
Given the Delaunaygraph DG ( P ) , the Delaunay graph distance between twovertices p and p (cid:48) of DG ( P ) is the minimum number of edgesconnecting p and p (cid:48) in DG ( P ) . It is denoted as dist DG ( p, p (cid:48) ) . Lemma 3.
Given the query point q , if a point p belongs toR k NN( q ), then we have dist DG ( p, q ) ≤ k in Delaunay graph DG ( p ) .Proof. See [7].
C. Conic section
In mathematics, a conic section (or just conic) is a curveobtained as the intersection of a right circular conical surfacewith a plane. Conic curves include ellipse, hyperbola andparabola. Some properties of ellipses and hyperbolas are usedin our work, so we introduce them in the follows.
Definition 8. Ellipse:
An ellipse is a closed curve on a plane,such that the sum of the distances from any point on the curveto two fixed points p and p is a constant C . Formally, it isdenoted as E cp : p defined as follows: E cp : p = { p | dist ( p, p ) + dist ( p, p ) = C } (9) Definition 9. Hyperbola:
A hyperbola is a geometric figuresuch that the difference between the distances from any pointon the figure to two fixed points p and p is a constant C .Formally, it is denoted as H cp : p defined as follows: H cp : p = { p | | dist ( p, p ) − dist ( p, p ) | = C } (10)IV. M ETHODOLOGIES
A. Discriminance with Conic Section (CSD)
Fig. 2. k NN region
Definition 10. k NN region:
Given a query point q , the k NNregion of q is the inner region of C q : dist ( q,p k ) , i.e., the circlewith q as center and dist ( q, p k ) as the length of radius, where p k represents the k th closest point to q . This region is denotedas RG k NN ( q ) . The radius of RG k NN ( q ) is called the k NNradius of q and is denoted as r q .ote that a point p must be one k NN( q ) if it lies in RG k NN ( q ) , i.e., the k NN region of q . Conversely, if a point p (cid:48) lies out of RG k NN ( q ) , it cannot be any one of k NN( q ). InFigure 2, q is the query point and the gray region within thecircle centered on q represents RG k NN ( q ) . As we can see, p , p and p lie inside RG k NN ( q ) , then we can determine thatthey belong to k NN( q ). while p and p lie outside. So theyare not the members of k NN( q ). Lemma 4.
Given a query point q , a point p must be one ofR k NN( q ) if it satisfies dist ( p, q ) ≤ r p . (11) Conversely, a point p (cid:48) cannot be any one of R k NN( q ) if itsatisfies dist ( p (cid:48) , q ) > r p (cid:48) . (12) Simply, for a point p , if the query point q lies in its k NN region, p must be one of R k NN( q ), otherwise it must not belong toR k NN( q ).Proof. The lemma is easily proved by the definition of k NNand R k NN, see Equation (2) and Equation (3).According to Lemma 4, we can determine whether a point p belongs to the R k NN of the query point q by calculatingthe k NN region of p . Obviously, q lying in RG k NN ( p ) isa necessary and sufficient condition for p to be one ofR k NN( q ). In the refining phase of some R k NN algorithms, thecandidates are verified by this discriminant condition. In thisdiscriminance, k NN region is required, so a k NN query mustbe conducted. The computational complexity of the state-of-the-art k NN algorithm is O ( k · log k ) . Thus, the computationalcomplexity of discriminance based on Lemma 4 is O ( k · log k ) .For most R k NN algorithms, the size of candidate set is oftenseveral times much as that of the result set. Therefore, issuinga R k NN verification of which the computational complexityis O ( k · log k ) for each candidate is obviously expensive. Inorder to reduce the computational cost of the refining phase ofR k NN queries, we introduce several more efficient verificationapproaches in the following.
Lemma 5.
Given a query point q and a point p + ∈ R k NN( q ),a point p must be one of R k NN( q ) if it satisfies dist ( p, q ) + dist ( p, p + ) ≤ r p + . (13) Proof.
As shown in Figure 3, the larger circle takes p + asthe center and r p + as the radius, which represents the k NNregion of p + . L p × ,p + is a line segment passing through thepoint p with a length of r p + . The smaller circle takes p asthe center and dist ( p, p × ) as the radius. Let p (cid:48) be an arbitrarypoint inside C p : dist ( p,p × ) , then it must satisfy that dist ( p, p (cid:48) ) ≤ dist ( p, p × ) . (14)According to the triangle inequality, we can obtain dist ( p (cid:48) , p + ) ≤ dist ( p, p (cid:48) ) + dist ( p, p + ) . (15) Fig. 3. Lemma 5
Combining Inequality (14) and Inequality (15), we can obtain dist ( p (cid:48) , p + ) ≤ dist ( p, p × ) + dist ( p, p + )= dist ( p × , p + )= r p + . (16)From above, we can construct a corollary that any point lyingin C p : dist ( p,p × ) must belong to k NN( p + ). Specifically, thenumber of points lying in C p : dist ( p,p × ) must not be greaterthan k , i.e., the size of k NN( p + ). Equivalently, there is nomore than k points closer to p than p × . Thus, p k (the k thclosest point to p ) cannot be closer than p × to p . Then dist ( p, p × ) ≤ dist ( p, p k ) = r p . Suppose Inequality (13)holds, dist ( p, q ) ≤ r p + − dist ( p, p + )= dist ( p × , p + ) − dist ( p, p + )= dist ( p, p × ) ≤ r p . (17)From Lemma 4 and Inequality (17), we can deduce that p ∈ R k NN( q ). Therefore Lemma 5 proved to be true.Lemma 5 provides a sufficient but unnecessary conditionfor determining that a point belongs to R k NN( q ), where q represents the query point. That means if a point p satisfiesthe condition of Inequality (13), it can be determined as oneof R k NN( q ) without issuing a k NN query. In the case that r p + is known, we can verify whether Inequality (13) holdsby only calculating the Euclidean distance from p to q and p + respectively. Calculating the Euclidean distance betweentwo points can be regarded as an atomic operation. Hence thecomputational complexity of the discriminance correspondingto Lemma 5 is O (1) . Definition 11. Positive discriminant region:
Given the querypoint q and a point p , the positive discriminant region of p is the internal region of E r p p : q . Formally, it is denoted as RG + disc ( p ) and is defined as follows: RG + disc ( p ) = { p (cid:48) | dist ( p (cid:48) , q ) + dist ( p (cid:48) , p ) ≤ r p } . (18)From the triangle inequality, it can be shown that dist ( p (cid:48) , q ) + dist ( p (cid:48) , p ) ≥ dist ( p, q ) . (19) ig. 4. Positive discriminant region If p / ∈ R k NN( q ), i.e., dist ( p, q ) > r p , dist ( p (cid:48) , q ) + dist ( p (cid:48) , p ) > r p (20)then RG + disc ( p ) = ∅ . Therefore, if RG + disc ( p ) (cid:54) = ∅ , p mustbelong to R k NN( q ). In consequence, from Lemma 5, we canconstruct a corollary that, for any point p , if RG + disc ( p ) is notempty, all the points lying inside of RG + disc ( p ) must belongto R k NN( q ).As shown in Figure 4, q represents the query point, theinternal region of the circle C p : r p indicates RG k NN ( p ) , andthe gray region within the ellipse E r p p : q is for RG + disc ( p ) . As p and p lies in RG + disc ( p ) , we can know p , p ∈ R k NN( q ).Whereas p , p and p lie out of RG + disc ( p ) , so we cannotdirectly determine whether or not they belong to R k NN( q ) byLemma 5. Lemma 6.
Given a query point q and a point p − / ∈ R k NN( q ),a point p cannot be any one of R k NN( q ) if it satisfies dist ( p, q ) − dist ( p, p − ) > r p − . (21) Fig. 5. Lemma 6
Proof.
As shown in Figure 5, the smaller circle takes p − asthe center and r p − as the radius, which represents the k NNregion of p − . The point p × is the intersection of an extensionof L p,p − (a line segment between p and p − ) with C p − : r p − .The larger circle takes p as the center and dist ( p, p × ) as theradius. Let p (cid:48) be an arbitrary point inside of C p − : r p − , then itmust satisfy that dist ( p − , p (cid:48) ) ≤ dist ( p − , p × ) = dist ( p − , p × ) . (22) According to the triangle inequality, we can obtain dist ( p, p (cid:48) ) ≤ dist ( p, p − ) + dist ( p − , p (cid:48) ) . (23)From Inequality.(22) and Inequality.(23), we can get that dist ( p, p (cid:48) ) ≤ dist ( p, p − ) + dist ( p − , p × )= dist ( p, p × )= r p . (24)Then we realize that all the points lying in RG k NN ( p − ) mustlie inside C p : dist ( p,p × ) , namely the number of points lyinginside of C p : dist ( p,p × ) must be no less than k , i.e., the numberof points lying in RG k NN ( p − ) . That is to say, there exist atleast k points no further than p × away from p . Equivalently, dist ( p, p × ) ≥ dist ( p, p k ) = r p (where p k represents the k th closest point to p ). If the condition of Inequality (21) issatisfied, dist ( p, q ) > dist ( p, p − ) + r p − = dist ( p, p − ) + dist ( p − , p × )= dist ( p, p × ) ≥ r p . (25)From Lemma 4 and Inequality (25), we can deduce that p / ∈ R k NN( q ). Therefore, Lemma 6 proved to be true.From Lemma 6, we can know that, if a point is determinednot to be one of R k NN( q ) and its k NN radius is known,then there may exist some other points that can be sufficientlydetermined to belong to R k NN( q ) without performing a k NNquery but by performing two times of simple Euclideandistance calculation. That means the computational complexityof the discriminance based on Lemma 6 is O (1) . Fig. 6. Negative discriminant region
Definition 12. Negative discriminant region:
Given thequery point q and a point p , H r p p : q divides the space intothree regions of which the one contains p is the negativediscriminant region of p . Formally, this region is denoted as RG − disc ( p ) and is defined as follows: RG − disc ( p ) = { p (cid:48) | dist ( p (cid:48) , q ) − dist ( p (cid:48) , p ) > r p } . (26)For an arbitrary point p (cid:48) ,from the triangle inequality in (cid:52) pqp (cid:48) , it can be known that dist ( p (cid:48) , p ) + dist ( p, q ) ≥ dist ( p (cid:48) , q ) . (27)f p ∈ R k NN( q ), i.e., dist ( p, q ) ≤ r p , dist ( p (cid:48) , q ) − dist ( p (cid:48) , p ) ≤ dist ( p, q ) ≤ r p (28)then RG − disc ( p ) = ∅ . Therefore, if RG − disc ( p ) is not empty, p must belong to R k NN( q ). Hence from Lemma 6, we can drawsuch a corollary that, for an arbitrary point p , if RG − disc ( p ) isnot empty, any point lying inside RG − disc ( p ) cannot belong toR k NN( q ).As shown in Figure 6, q represents the query point, theregion within the circle centered on p represents RG k NN ( p ) ,and the gray region separated by the hyperbola H r p p : q onthe right represents RG − disc ( p ) . As in the figure, p and p lie inside RG − disc ( p ) , while p and p do not. Then wecan determine that p and p must not belong to R k NN( q ),whereas we cannot tell by Lemma 6 whether p or p belongsto R k NN( q ) or not. Definition 13. Positive/Negative discriminant point:
Giventhe query point q and two other points p and p (cid:48) , if p (cid:48) lies in RG + disc ( p ) , we claim that p is a positive discriminant pointof p (cid:48) and p can positive discriminate p (cid:48) . It is denoted as p + disc −−−−→ p (cid:48) . Similarity, if p (cid:48) lies in RG − disc ( p ) , we name that p is a negative discriminant point of p (cid:48) and p can negativediscriminate p (cid:48) . It is denoted as p − disc −−−−→ p (cid:48) . If not specified,both of these two types of points may be collectively referredto as discriminant points and we can use p disc −−→ p (cid:48) to expressthat p can discriminate p (cid:48) .Whether a point belongs to the R k NN set of the querypoint or not, the corresponding discriminant method withlow computational complexity is provided. However, whenperforming the discriminance of Lemma 5 or Lemma 6, thedistance from the point to be determined to the query pointand the positive/negative discriminant point should be calcu-lated respectively. In order to further improve the verificationefficiency of some points, we propose Lemma 7.
Lemma 7.
Given a query point q , a point p must be one ofR k NN( q ) if it satisfies dist ( p, q ) ≤ r q / . Fig. 7. Lemma 7
Proof.
In Figure 7, there are three circles, two of which arecentered on q and take r q and r q / as the length of their radii, respectively. The other circle takes p as the center and dist ( p, q ) as the length of the radius, where p lies in c q : r q / ,i.e., dist ( q, p ) ≤ r q / . Let p (cid:48) be an arbitrary point inside of C p : dist ( p,q ) , then it must satisfy that dist ( p, p (cid:48) ) ≤ dist ( q, p ) . (29)From the triangle inequality of (cid:52) pqp (cid:48) , it can be obtained that dist ( q, p (cid:48) ) ≤ dist ( q, p ) + dist ( p, p (cid:48) ) . (30)Then we can get that, dist ( q, p (cid:48) ) ≤ · dist ( q, p ) . (31)Because dist ( q, p ) ≤ r q / , dist ( q, p (cid:48) ) ≤ · r q / r q (32)That means, any point lying in C p : dist ( p,q ) must belong to k NN( q ). Therefore, the number of points lying in C p : dist ( p,q ) must not be greater than k , i.e., the size of k NN( q ), whichmeans there is no more than k points closer to p than q . Hence p k ( k th closest point to p ) cannot be closer than q to p . Then dist ( p, q ) ≤ dist ( p, p k ) = r p . (33)According to Lemma 4, p ∈ R k NN( q ), then Lemma 7 isproved. Fig. 8. Semi- k NN region
Definition 14. Semi- k NN region:
Given the query point q ,the semi- k NN region of q is the internal region of C q : r q / .Formally, it is denoted as SRG k NN ( q ) and is defined asEquation (34). SRG k NN ( q ) = { p | dist ( p, q ) ≤ r q / } (34)As shown in Figure 8, q represents the query point, theregion within the larger circle represents RG k NN ( q ) , and thegray region within the smaller circle represents SRG k NN ( q ) .It can be observed from the figure, p and p lie in the grayregion, while p , p and p do not. Then p and p can bedetermined as members of R k NN( q ). Nevertheless, we cannotdetermine whether p , p or p belongs to R k NN( q ) or not byLemma 7With Lemma 4, Lemma 5, Lemma 6 and Lemma 7, wecan find all the points in the R k NNs of the query point byverifying only a small portion of points in the candidates.We combine these four lemmas to form a complete R k NNverification method named CSD (Conic Section Discriminant). . Selection of discriminant points
Theoretically, when using CSD to verify the candidates ofa R k NN query, any point that belongs to the R k NN set canbe considered as a positive discriminant point. Similarly, if apoint is not a member of R k NNs, then it can be considered asa negative discriminant point. In other words, all points in thecandidate set are eligible to be selected as discriminant points.Our aim is to issue as few k NN queries as possible in theprocess of R k NN queries, that is, to use as few discriminantpoints as possible to discriminate all the other points in thecandidate set. Therefore, the selection of discriminant points isvery important for improving the efficiency of R k NN queries.Which points should be selected as discriminant points is whatwe will scrutinize next.
Definition 15. Discriminant set:
For a R k NN query, givena set S cnd of candidates and denoted as S dist , a discriminantset is such a set that the following condition is satisfied: ∀ p ∈ S cnd \ S dist , ∃ p (cid:48) ∈ S dist : p (cid:48) disc −−→ p. (35)Because it is not certain how many points and whichpoints need to be selected as discriminant points, the totalnumber of schemes for selecting discriminant points can beas large as | S cnd | (cid:80) i =1 (cid:18) | S cnd | i (cid:19) , where | S cnd | means the numberof candidates. Hence the computational complexity of findingthe absolute optimal one out of all the schemes is as much as O ( k !) . However, it is not difficult to come up with a relativelygood discriminant points selecting scheme, of which the sizeof the discriminant set | S dist | is just about O ( √ k ) .For a positive discriminant point, most of the points in itsdiscriminant region are closer to the query point than itself.Furthermore, any negative discriminant point is closer to thequery point than most of the points in its own discriminantregion. Therefore, a point belonging to R k NNs can rarely bediscriminated by a point closer to the query point than itself,and the probability that a point not belonging to R k NNs canbe discriminated by a point further than itself away from thequery point is also very low. Therefore, the points which areextremely close to the boundary of the R k NN region (i.e.,influence zone [5]) are rarely able to be discriminated by otherpoints. Thus, these points should be selected as discriminantpoints in preference. However, it is impossible to directlyfind these points near the boundary without pre-calculatingthe R k NN region. Calculating the R k NN region is a verycomputational costly process for its computational complexityof O ( k ) . While the k NN region of the query point is easy toobtained by issuing a k NN query. Assuming that the points areuniformly distributed, the k NN region and the R k NN regionof a query point are extremely approximate and the differencebetween them is negligible. Hence it is a good strategy topreferentially select the points near the boundary of k NNregion as the discriminant points to some extent.As shown in Figure 9, there are some points distributed.The region inside the circle with q as the center represents the k NN region of q . In general, only the points near the boundary Fig. 9. Discriminant set of RG k NN ( q ) need to be selected as the discrinimant pointsand all the other candidate points can be discriminated bythese discriminant points. In other words, if the points areevenly distributed, the points near the boundary of RG k NN ( q ) are enough to form a valid discriminant set of q . Because thedistribution of points is not guaranteed to be absolute uniform,it is not always reliable if only the points near the boundaryof the kNN region of the query point are taken as discriminantpoints for a R k NN query.In order to ensure the reliability of the selection, we proposea strategy to dynamically construct the discriminant set whileverifying the candidate points. First, the candidate pointsbelonging to k NN( q ) are accessed in descending order ofdistance to q . Then the other candidate points are accessedin ascending order of distance to q . During the process ofaccessing candidates, once the currently accessed point cannotbe discriminated by any point in the discriminant set, thispoint should be selected as a discriminant point and put intothe discriminant set. Otherwise, we can use a correspondingpoint in the discriminant set to determine whether it belongsto R k NNs or not.
C. Matching candidate points with discriminant points
Under the above strategy, it is sufficient to ensure that anypoint not belonging to S dist can be discriminated by at leastone point in S dist . Since the expected size of S dist is O ( √ k ) (see Section 5), the computational complexity of finding adiscriminant point for a point by exhaustive searching thediscriminant set is O ( √ k ) . Obviously, it is not a good ideato match candidate points with their discriminant point inthis way. Therefore, we propose a method based on Voronoidiagrams to improve the efficiency of this process.Given a Voronoi diagram V D ( P ) of a point set P and acontinuous region RG , the vast majority of points in RG haveat least one Voronoi neighbor lying in RG [18]. For any dis-criminant point, its discriminant region is a continuous region(ellipse region or hyperbola region). So for a non-discriminantpoint, there is high probability that at least one of its Voronoineighbors can discriminate it or shares a discriminant pointwith it. Therefore, when accessing a candidate point, if thepoint can be discriminated by one of its Voronoi neighborsr the discriminant point of one of its Voronoi neighbors,this point can be determined whether belongs to the R k NNs.Otherwise, we say that this point is almost impossible to bediscriminated by any known discriminate point and it shouldbe marked as a discriminate point. Recall Lemma 2, in twodimensions, the expected number of Voronoi neighbors perpoint is 6, which is a constant. By using the above approachwe can find the discriminant point for a non-discriminant pointwith a computational complexity of O (1) . D. Algorithm
In the previous three subsections, we introduce a discrim-inant with conic sections (CSD) for improving the efficiencyof R k NN queries, and its principle is also explained. In thissubsection, we will introduce the implementation of the R k NNalgorithm based on CSD.The pseudocode for CSD is shown in Algorithm 1. Whenverifying a point, we first try to determine whether the pointbelongs to R k NNs by Lemma 4 (line 2). If this fails, we visitthe Voronoi neigbors of the point and try to use Lemma 2 orLemma3 to discriminant it (line 10 and line 13). If none ofthe three lemmas above apply to this point, then we issue a k NN query for it and use Lemma 4 to verify it (line 18).Using CSD, we implement a efficient R k NN algorithm, asshown in Algorithm 2. First we generate the candidate set inthe same way as VR-R k NN [7], where the size of candidateis 6 k (line 1). Next, the candidate set is sorted in ascendingorder by the distance to the query point (line 2). Then the first k elements of the candidate set and the rest of the elementsare divided into two groups. The elements in the two groupsare verified one by one in the order from back to front andfrom front to back, respectively (line 8 and line 11). After allcandidate points are verified, the R k NNs of the query point isobtained.We used the same algorithm as VR-R k NN to generate thecandidate set, and we do not improve it. The core of thisalgorithm is still from the Six-regions [2]. In addition, it usesa Voronoi diagram to find the candidate points incrementallyaccording to Lemma 1. By Lemma 3, only the points whoseDelaunay distance to the query point is not larger than k areeligible to be selected as candidate points. Hence the numberof points accessed for finding candidates in the algorithmis guaranteed to be no more than O ( k ) . The pseudocodeof the algorithm for generating candidates is presented inAlgorithm 3. As generating the candidate set is not the focusof our study, we do not describe Algorithm 3 in detail. [7] canbe referred to for specific instructions.V. T HEORETICAL ANALYSIS
In this section, we analyze the expected size of discriminantset, the expected number of accessed points and the compu-tational complexity of CSD-R k NN. We assume that the datapoints are uniformly distributed in a unit space.
A. Expected size of discriminant set
The query point is q , the number of points in R k NN( q )is | R k NN | , and the number of points near the boundary of Algorithm 1: CSD( p, q, k, r q , S v , S disc , D disc )Input: the point p to be verified, the query point q , theparameter k , the k NN radius r q of q , the set S v of points that have been visited , the discriminantset S disc and the dictionary D disc that records thecorresponding discriminant points fornon-discriminant points Output: whether p ∈ R k NN ( q ) . S v .add( p ); if dist ( p, q ) ≤ r q / then /* Lemma 7 */ return true ; foreach p n ∈ VN( p ) do if p n ∈ S v then if p n ∈ S disc then p disc ←− p n ; else p disc ←− D disc [ p n ] ; if p disc ∈ R k NN ( q ) and dist ( p, q ) + dist ( p, p disc ) ≤ r p disc then /* Lemma 5 */ D disc [ p ] ←− p disc ; return true ; if p disc / ∈ R k NN ( q ) and dist ( p, q ) − dist ( p, p disc ) > r p disc then /* Lemma 6 */ D disc [ p ] ←− p disc ; return false ; r p ←− calculate the k NN radius of p ; S disc .add( p ); if r p ≥ dist ( p, q ) then /* Lemma 4 */ return true ; else return false ; Algorithm 2: CSD-R k NN( q )Input: the query point q Output: R k NN( q S cnd ←− generateCandidates ( q, k ); Sort S cnd in ascending order by the distance to q ; r q ←− calculate the k NN radius of q ; S v ←− ∅ ; S disc ←− ∅ ; D disc ←− generate an empty dictionary; S R k NN ←− ∅ ; for i ←− k to do if CSD ( S cnd [ i ] , q, k, r q , S v , S disc , D disc ) then S R k NN .add ( S cnd [ i ]) ; for i ←− k + 1 to k do if CSD ( S cnd [ i ] , q, k, r q , S v , S disc , D disc ) then S R k NN .add ( S cnd [ i ]) ; return S R k NN ; lgorithm 3: generateCandidates( q, k )Input: the query point q and the parameter k Output: the candidates of R k NN( q ) H ←− M inHeap () ; V isited ←− ∅ ; for i ←− to do S cnd [ i ] ←− M inHeap () ; foreach p ∈ VN( q ) do H.push ([1 , p ]) ; V isited.add ( p ) ; while | H | > do [ dist DG ( p ) , p ] ←− H.pop () ; for i ←− to do if Segment i contains p then if | S cnd [ i ] | > then p n ←− the last point in S cnd [ i ] ; else p n ←− a point infinitely away from q ; if dist DG ( p ) ≤ k and dist ( q, p ) ≤ dist ( q, p n ) then S cnd [ i ] .push ([ dist ( p, q ) , p ]) ; foreach p (cid:48) ∈ VN( p ) do if p (cid:48) / ∈ V isited then dist DG ( p (cid:48) ) ←− dist DG ( p ) + 1 ; H.push ([ dist DG ( p (cid:48) ) , p (cid:48) ]) ; V isited.add ( p (cid:48) ) ; Candidates ←− ∅ ; for i ←− to do for j ←− to k do Candidates .add( S cnd [ i ] .pop()); return Candidates ; RG k NN ( q ) is | S b | . The area and circumference (total lengthof the boundary) of RG k NN ( q ) are denoted as A R k NN ( q ) and C R k NN ( q ) , respectively. The expected size of the discriminantset of q is | S disc | .It is shown that the expected value of | R k NN | is k [5]. Thus,the radius of the approximate circle of RG k NN ( q ) is equal to r q . Then A R k NN ( q ) = π · r q (36) C R k NN ( q ) = 2 π · r q . (37)The following equation can be obtained from Equation (36)and Equation (37). C R k NN ( q ) = 2 (cid:112) π · A R k NN ( q ) (38)As the points around the boundary of RG k NN ( q ) consists oftwo sets of points where one is inside RG k NN ( q ) and the other is outside, | S b | is to | R k NN | what · C R k NN ( q ) is to A R k NN ( q ) ,i.e., | S b | = 2 · (cid:112) π · | R k NN | = 4 √ π · k ≈ . √ k. (39)If all the points near the boundary are selected as the dis-criminant points, there must be some redundancy, i.e., thediscriminant region of some points will overlap. Hence thesize of the discriminant set generated under our strategy isless than the number of the points near the boundary of theR k NN region, i.e, | S disc | ≤ . √ k . B. Expected number of accessed points
For an R k NN query of q , the candidate points are distributedin an approximately circular region RG cnd ( q ) centered around q , which has an area A cnd ( q ) and a circumference C cnd ( q ) .The expected number of accessed points is | S ac | . In thefiltering phase of CSD-R k NN, the points accessed include allthe the candidate points and their Voronoi neighbors. Exceptfor the points in the candidate set, the other accessed pointsare distributed outside RG cnd ( q ) and adjacent to the boundaryof RG cnd ( q ) . Hence | S ac | − | S cnd | is to | S cnd | what C cnd ( q ) is to A cnd ( q ) , i.e., | S ac | − | S cnd | = 2 (cid:112) π · | S cnd | (40) | S ac | = | S cnd | + 2 (cid:112) π · | S cnd | = 6 k + 2 √ π · k ≈ k + 8 . √ k (41)Therefore, if the points are distributed uniformly, the expectednumber of accessed points is approximately k +8 . √ k . Whenthe points are distributed unevenly, | S ac | becomes larger.However, it has an upper bound. Recall Lemma 3, we canmake deduce that only the points whose Delaunay graphdistance to q is not larger than k are eligible to be selected ascandidate points. Then | S ac | ≤ k (cid:88) i =1 π · i = ( k + k ) π. (42) C. Computational complexity
The expected computational complexity of the filteringphase of CSD-R k NN is O ( k · log k ) [7]. In the refining phase,we have to issue a k NN query with O ( k · log k ) computationalcomplexity for each discriminant point, and the size of thediscriminant set is about . √ k . The other candidates onlyneed to be verified by CSD. Thus, the computational complex-ity of the refining phase is O ( k . · log k ) . Hence the overallcomputational complexity of CSD-R k NN is O ( k . · log k ) .VI. E XPERIMENTS
In the previous section, we discussed the theoretical perfor-mance of CSD-R k NN. In this section, we intend to evaluatethe performance of aspects through comparison experiments. . Experimental settings
In the experiments, we let VR-R k NN [7] and the state-of-the-art R k NN approach SLICE [6] to be the competitors ofour method.The settings of our experiment environment are as follows.The experiment is conducted on a personal computer withPython 2.7. The CPU is Intel Core i5-4308U 2.80GHz andthe RAM is DDR3 8G.To be fair, all three methods in the experiment are imple-mented in Python, with six partitions in the pruning phase. Weuse two types of experimental data sets: simulated data set andreal data set . To decrease the error of the experiments, werepeat each experiment for 30 times and calculate the averageof the results. The query point for each time of the experimentis randomly generated.Our experiments are designed into four sets. The first setof experiments is used to evaluate the effect of the data sizeon the time cost of the R k NN algorithms. The data size isfrom to and the value of k is fixed at 200. The restof sets are used to evaluate the effect of the value of k onthe time cost, the number of verified points and the numberof the accessed points of the R k NN algorithms, respectively.For these three sets of experiments, the size of the simulateddata is fixed at , the size of the real data is 49,601 and thevalue of k varies from to . B. Experimental results
TABLE IIT
OTAL TIME COST ( IN MS ) OF DIFFERENT R k NN ALGORITHMS WITHVARIOUS SIZES OF DATA SETS . Algorithm Data size VR-R k NN 510 725 728 732SLICE 232 397 438 441CSD-R k NN
59 65 69 72 Data size T i m e ( m s ) VoR-RkNNSLICECSD-RkNN
Fig. 10. Effect of data size on efficiency of R k NN queries
Figure 10 shows the time cost of the three R k NN algorithmswith various data sizes. As we can see, when the number ofpoints in the database is significantly much lager than k , the impact of the data size on the time cost of R k NN queries isvery limited. If the number of points in the database is smallenough to be on the same order of magnitude as k , all pointsin the database become candidate points. Then the smaller thedatabase size, the less time cost of the R k NN query. When thenumber of points in the database is above 10,000 and the valueof k is fixed at 200, the time cost of CSD-R k NN is alwaysaround 84% and 90% less than that of SLICE and VR-R k NN,respectively. The detailed experimental results are presentedin Table II.
TABLE IIIT
OTAL TIME COST ( IN MS ) OF R k NN QUERIES WITH VARIOUS VALUES OF k . k Simulated data Real data
VR-R k NN SLICE CSD-R k NN VR-R k NN SLICE CSD-R k NN
199 193
194 283 k T i m e ( m s ) VoR-RkNNSLICECSD-RkNN (a) Simulated data k T i m e ( m s ) VR-RkNNSLICECSD (b) Real dataFig. 11. Effect of k on efficiency of R k NN queries
Figure 11 shows the influence of k on the efficiency of thesethree R k NN algorithms, where sub-figure (a) and (b) shows thetime cost of R k NN queries from simulated data and real data,respectively. As k varies from 10 to 10,000, the time cost ofthese three algorithms increases. With both synthetic data andreal data, the query efficiency of CSD-R k NN is significantlyhigher than that of the other two competitors. With the increaseof k , this advantage becomes more and more obvious. When k is 10,000, the time cost of CSD-R k NN is only about 1/10of that of the state-of-the-art algorithm SLICE. The detailedexperimental results are presented in Table III.
TABLE IVN
UMBER OF CANDIDATES VERIFIED BY R k NN ALGORITHMS WITHVARIOUS VALUES OF k . k Simulated data Real data
VR-R k NN SLICE CSD-R k NN VR-R k NN SLICE CSD-R k NN
60 25
60 20
600 257
600 203 k o f v e r i f i e d ca nd i d a t e s TheoreticalVoR-RkNNSLICECSD-RkNN (a) Simulated data k o f v e r i f i e d ca nd i d a t e s TheoreticalVoR-RkNNSLICECSD-RkNN (b) Real dataFig. 12. Effect of k on the number of candidates verified Figure 12 reflects the relationship between k and the numberof candidate points verified of the three algorithms in theexperiments. Sub-figure (a) and (b) show the experimentalresults on simulated data and real data, respectively. Thesetwo sub-figures also show the theoretical number of candidatepoints verified with different values of k . During the executionof CSD-R k NN, only the points in the discriminant set areverified by issuing k NN queries. Therefore, the number ofcandidates verified is equal to the size of the discriminant set.As we discussed in section V-A, the size of the discriminantset is theoretically not larger than . √ k . In consequence,the theoretical number of verified candidates in Figure 12 is . √ k . It can be seen from the figure that the actual numberof points verified is slightly less than the theoretical value, . √ k . It indicates that the experimental results are consistentwith our analysis. It is also obvious from the figure that thenumber of verified candidate points of CSD-R k NN is muchsmaller than that of the other two algorithms. The detailedexperimental results are presented in Table IV. k o f acce ss e d po i n t s TheoreticalVoR-RkNNSLICECSD-RkNN (a) Simulated data k o f acce ss e d po i n t s TheoreticalVoR-RkNNSLICECSD-RkNN (b) Real dataFig. 13. Effect of k on the number of points accessed Figure 13 shows the number of accessed points of the threealgorithms in the experiments and the theoretical number ofaccessed points of CSD-R k NN with various values of k , whichindirectly reflects their IO cost. It can be seen from sub-figure(a), the number of accessed points of the three algorithms isalmost equal in terms of magnitude, and so is the theoreticalvalue of CSD-R k NN. Specifically, the number of accessedpoints of CSD-R k NN is slightly smaller than that of SLICE.As shown in sub-figure (b), CSD-R k NN needs to access morepoints than SLICE. The reason is that the distribution of real
TABLE VN
UMBER OF ACCESSED POINTS OF R k NN QUERIES WITH VARIOUSVALUES OF k . k Simulated data Real data
VR-R k NN SLICE CSD-R k NN VR-R k NN SLICE CSD-R k NN
76 119
75 153
181 162 data is very uneven, and CSD-R k NN is more sensitive to thedistribution of data than SLICE. Note that CSD-R k NN andVR-R k NN use the same candidate set generation method, sothey have almost the same number of accessed points. Thedetailed experimental results are presented in Table V.From the above three experiments, it can be seen thatR k NN query efficiency is little affected by the data size, butgreatly affected by the value of k . CSD-R k NN is significantlymore efficient than other algorithms because it requires lessverification of candidate points. For data sets with very unevendistribution of points, the candidate set of CSD-R k NN isrelatively large, which will affect the IO cost to some extent.However, the main time cost of the R k NN query is causedby a large number of verification operations rather than IO.Therefore, the distribution of points has little impact on theoverall performance of CSD-R k NN.VII. C
ONCLUSIONS AND FUTURE WORKS
In this paper, we propose CSD, a discriminant method todetermine points whether belong to the R k NN set withoutissuing any queries with non-constant computational complex-ity. An efficient R k NN algorithm, named CSD-R k NN, is alsoimplemented by using CSD. The comparative experiments areconducted between CSD-R k NN and other two RkNN algo-rithms of the state-of-the-art. The experimental results showthat CSD-R k NN significantly outperforms its competitors invarious aspects, except that CSD-R k NN needs to access morepoints to generate the candidate set when the distributionof points is very uneven. However, CSD-R k NN does notrequire costly validation of each candidate point. Hence thedistribution of data has very limited impact on its overallperformance.As an efficient discriminant method for improving R k NNqueries, CSD has a great potential and can be further devel-oped. Our plan in the future is to extend the application ofCSD to other variants of R k NN, including multidimensionalR k NN, continuous R k NN, constrained R k NN, etc.A
CKNOWLEDGMENTS
The work is supported by the National Natural Sci-ence Foundation of China (U1711267, 41572314, 41972306),Hubei Province Innovation Group Project (2019CFA023) andFundamental Research Funds for the Central Universities,China University of Geosciences (Wuhan) (CUGCJ1810).
EFERENCES[1] Flip Korn and S. Muthukrishnan. Influence sets based on reversenearest neighbor queries. In
Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data, May 16-18, 2000,Dallas, Texas, USA , pages 201–212, 2000.[2] Ioana Stanoi, Divyakant Agrawal, and Amr El Abbadi. Reverse nearestneighbor queries for dynamic databases. In , pages 44–53, 2000.[3] Yufei Tao, Dimitris Papadias, and Xiang Lian. Reverse knn search inarbitrary dimensionality. In
Proceedings of the Thirtieth InternationalConference on Very Large Data Bases - Volume 30 , VLDB 04, page744755. VLDB Endowment, 2004.[4] Wei Wu, Fei Yang, Chee Yong Chan, and Kian-Lee Tan. FINCH:evaluating reverse k-nearest-neighbor queries on location data.
PVLDB ,1(1):1056–1067, 2008.[5] Muhammad Aamir Cheema, Xuemin Lin, Wenjie Zhang, and YingZhang. Influence zone: Efficiently processing reverse k nearest neighborsqueries. In
Proceedings of the 27th International Conference on DataEngineering, ICDE 2011, April 11-16, 2011, Hannover, Germany , pages577–588, 2011.[6] Shiyu Yang, Muhammad Aamir Cheema, Xuemin Lin, and Ying Zhang.SLICE: reviving regions-based pruning for reverse k nearest neighborsqueries. In
IEEE 30th International Conference on Data Engineering,Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014 , pages 760–771, 2014.[7] Mehdi Sharifzadeh and Cyrus Shahabi. Vor-tree: R-trees with voronoidiagrams for efficient processing of spatial nearest neighbor queries.
Proc. VLDB Endow. , 3(1-2):1231–1242, September 2010.[8] Anil Maheshwari, Jan Vahrenhold, and Norbert Zeh. On reverse nearestneighbor queries. In
Proceedings of the 14th Canadian Conference onComputational Geometry, University of Lethbridge, Alberta, Canada,August 12-14, 2002 , pages 128–132, 2002.[9] Congjun Yang and King-Ip Lin. An index structure for efficient reversenearest neighbor queries. In
Proceedings of the 17th InternationalConference on Data Engineering, April 2-6, 2001, Heidelberg, Germany ,pages 485–492, 2001.[10] Congjun Yang and King-Ip Lin. An index structure for efficient reversenearest neighbor queries. In
Proceedings of the 17th InternationalConference on Data Engineering, April 2-6, 2001, Heidelberg, Germany ,pages 485–492, 2001.[11] King-Ip Lin, Michael Nolen, and Congjun Yang. Applying bulkinsertion techniques for dynamic reverse nearest neighbor problems. In , pages 290–297,2003.[12] Antonin Guttman. R-trees: A dynamic index structure for spatialsearching. In Beatrice Yormark, editor,
SIGMOD’84, Proceedings ofAnnual Meeting, Boston, Massachusetts, USA, June 18-21, 1984 , pages47–57. ACM Press, 1984.[13] G´ısli R. Hjaltason and Hanan Samet. Distance browsing in spatialdatabases.
ACM Trans. Database Syst. , 24(2):265–318, 1999.[14] Dimitris Papadias, Yufei Tao, Kyriakos Mouratidis, and Chun Kit Hui.Aggregate nearest neighbor queries in spatial databases.
ACM Trans.Database Syst. , 30(2):529–576, 2005.[15] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. Progressiveskyline computation in database systems.
ACM Trans. Database Syst. ,30(1):41–82, 2005.[16] Cyrus Shahabi and Mehdi Sharifzadeh. Voronoi diagrams for queryprocessing. In
Encyclopedia of GIS. , pages 2446–2452. Springer, 2017.[17] B. Delaunay. Sur la sph`ere vide. a la m´emoire de georges vorono¨ı.
Bulletin de I’Acad´emie des Sciences de I’URSS. Classe des SciencesMath´ematiques et Naturelles , 6:793–800, 1934.[18] Yang Li. Area queries based on voronoi diagrams.