[PDF] CSD: Discriminance with Conic Section for Improving Reverse k Nearest Neighbors Queries

Abstract

The reverse k nearest neighbor (R k NN) query finds all points that have the query point as one of their k nearest neighbors ( k NN), where the k NN query finds the k closest points to its query point. Based on the characteristics of conic section, we propose a discriminance, named CSD (Conic Section Discriminance), to determine points whether belong to the R k NN set without issuing any queries with non-constant computational complexity. By using CSD, we also implement an efficient R k NN algorithm CSD-R k NN with a computational complexity at O( k 1.5 ⋅logk) . The comparative experiments are conducted between CSD-R k NN and other two state-of-the-art RkNN algorithms, SLICE and VR-R k NN. The experimental results indicate that the efficiency of CSD-R k NN is significantly higher than its competitors.

Full PDF

CCSD: Discriminance with Conic Section forImproving Reverse k Nearest Neighbors Queries

Yang Li ∗† , Gang Liu ∗† , Mingyuan Bai ‡ , Junbin Gao ‡ , Lixin Ye ∗† and Zi Ming §∗ School of Computer Science, China University of Geosciences, Wuhan, China † Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China ‡ The University of Sydney Business School, The University of Sydney, Sydney, NSW, Australia § School of Economics and Management, Hubei University of Technology, Wuhan, China { liyang cs, liugang } @cug.edu.cn, { mbai8854@uni., junbin.gao@ } sydney.edu.au, [email protected], [email protected] Abstract —The reverse k nearest neighbor (R k NN) queryﬁnds all points that have the query point as one of their k nearest neighbors ( k NN), where the k NN query ﬁnds the k closest points to its query point. Based on the characteristicsof conic section, we propose a discriminance, named CSD (ConicSection Discriminance), to determine points whether belong tothe R k NN set without issuing any queries with non-constantcomputational complexity. By using CSD, we also implementan efﬁcient R k NN algorithm CSD-R k NN with a computationalcomplexity at O ( k . · log k ) . The comparative experiments areconducted between CSD-R k NN and other two state-of-the-artRkNN algorithms, SLICE and VR-R k NN. The experimentalresults indicate that the efﬁciency of CSD-R k NN is signiﬁcantlyhigher than its competitors.

Index Terms —R k NN, conic section, Voronoi, Delaunay

I. I

NTRODUCTION

As a variant of nearest neighbor (NN) query, RNN queryis ﬁrst introduced by Korn and Muthukrishnan [1]. A directgeneralization of NN query is the reverse k nearest neighbors(R k NN) query, where all points having the query point asone of their k closest points are required to be found. Sinceits appearance, R k NN has received extensive attention [2],[3], [4], [5], [6], [7] and been prominent in various scientiﬁcﬁelds including machine learning, decision support, intelligentcomputation and geographic information systems, etc.At ﬁrst glance, R k NN and k NN queries appear to beequivalent, meaning that the results for R k NN and k NNmay be the same for the same query point. However, R k NNis not as simple as it seems to be. It is a very differentkind of query from k NN, although their results are similarin many cases. So far, R k NN is still an expensive queryfor its computational complexity at O ( k ) [6], whereas thecomputational complexity of k NN queries has been reducedto O ( k · log k ) [7].In order to solve the RNN/R k NN problem, a large numberof approaches have been proposed. Some early methods [8],[1], [9] speed up RNN/R k NN queries by pre-computation.Their disadvantage is that it is difﬁcult to support queries ondynamic data sets. Therefore, many R k NN algorithms withoutpre-computation are proposed.

Corresponding authors: Gang Liu(email: [email protected]).

Most existing non-pre-computation R k NN algorithms havetwo phases: the ﬁltering phase and the reﬁning phase (alsoknown as the pruning phase and the veriﬁcation phase). In thepruning phase, the majority of points that do not belong toR k NN should be ﬁltered out. The main goal of this phase is togenerate a candidate set as small as possible. In the veriﬁcationphase, each candidate point should be veriﬁed whether itbelongs to the R k NN set or not. For most algorithms, thecandidate points are veriﬁed by issuing k NN queries or rangequeries, which are very computational expensive. The state-of-the-art R k NN technique SLICE, provides a more efﬁcientveriﬁcation method with a computational complexity of O ( k ) for one candidate. The size of the candidate set of SLICEvaries form k to . k . However, it is still time consuming toperform such a veriﬁcation for each candidate point.There seems to be a consensus in the past studies thatfor an R k NN technique, the number of veriﬁcation pointscannot be smaller than the size of the result set. Such anidea, however, limits our understanding of the R k NN problem.Hence we amend our thought and come up with a conjecturethat whether a point could be directly determined as belongingto the R k NN set according to its location. Given the querypoint q , our intuition tells us that if a point p is closer to q than a point p + belonging to the R k NNs of q , then p ishighly likely to also belong to the R k NN of q . Conversely, if p is further away from q than a point p − that does not belongto the R k NN set of q , then p is probably not a member ofthe R k NN set. Such a conjecture is true in many cases, butit is too broad and vague, and lacks rigorous mathematicalproof to be practical. However, along with this idea, we furtherstudy and obtain a set of discriminant methods for R k NNqueries that can withstand mathematical veriﬁcation. We namethis method as CSD (Conic Section Discriminance). WithCSD, we can use a reference point which has been veriﬁedto determine whether another point belongs to the R k NNset without issuing a query with non-constant computationalcomplexity. An efﬁcient R k NN algorithm, named CSD-R k NN,is also implemented by using CSD.Table I shows the comparison of computational complexityamong VR-R k NN , SLICE and CSD-R k NN. It can be seenthat the bottleneck of both VR-R k NN and SLICE is the a r X i v : . [ c s . D B ] M a y ABLE IC

OMPARISON OF COMPUTATIONAL COMPLEXITY

Operation VR-R k NN SLICE CSD-R k NN Generate candidates O ( k · log k ) O ( k · log k ) O ( k · log k ) Verify a candidate O ( k · log k ) O ( k ) O ( k · log k ) | Veriﬁed candidates | O ( k ) (=6 k ) O ( k ) (2 k ∼ k ) O ( √ k ) ( ≤ . √ k )Overall O ( k · log k ) O ( k ) O ( k . · log k ) veriﬁcation phase. The computational complexity of verifyinga candidate of CSD-R k NN is O ( k · log k ) , which is higher thanthat of SLICE. However, the number of candidates veriﬁed byCSD-R k NN is only about . √ k , which is much less than thatof SLICE. In addition, the overall computational complexityof CSD-R k NN is much lower than that of SLICE.The rest of the paper is organized as follows. In Section 2,we introduce the major related work of R k NN since itsappearance. In Section 3, we formally deﬁne the R k NNproblem and introduce the concepts and knowledge related toour approach. Our approach and its principles are describedin section 4. Section 5 provides a detailed theoretical analysis.Experimental evaluation is demonstrated in Section 6. The lasttwo sections are conclusions and acknowledgements.II. R

ELATED WORK

A. RNN-tree

Reverse nearest neighbor (RNN) queries are ﬁrst introducedby Korn and Muthukrishnan where RNN queries are imple-mented by preprocessing the data [1]. For each point p inthe database, a circle with p as the center and the distancefrom p to its nearest neighbor as the radius is pre-calculatedand these circles are indexed by an R-tree. The RNN set ofa query point q includes all the points whose circle contains q . With the R-tree, the RNN set of any query point can befound efﬁciently. Soon after, several techniques [10], [11] areproposed to improve their work. B. Six-regions

Six-regions [2] algorithm, proposed by Stanoi et al., is theﬁrst approach that does not need any pre-computation. Theydivide the space into six equal segments using six rays startingat the query point, so that the angle between the two boundaryrays of each segment is 60 ◦ . They suggest that only the nearestneighbor (NN) of the query point in each of the six segmentsmay belong to the RNN set. It ﬁrstly performs six NN queriesto ﬁnd the closest point of the query point q in each segments.Then it launches an NN query for each of the six points toverify q as their NN. Finally the RNN of q is obtained.Generalizing this theory to R k NN queries leads to a corol-lary that, only the members of k NN of the query point ineach segment have the possibility of belonging to the R k NNset. This corollary is widely adopted in the pruning phase ofseveral R k NN techniques.

C. TPL

TPL [3], proposed by Tao et al., is one of the prestigiousalgorithms for RkNN queries. This technique prunes the spaceusing the bisectors between the query point and other points.The perpendicular bisector is denoted by B p : q . B p : q is betweena point p and the query point q . B p : q divides the space into twohalf-spaces. The half-space that contains p is denoted as H p : q .Another one is denoted as H q : p . If a point p (cid:48) lies in H p : q , p (cid:48) must be closer to p than to q . Then p (cid:48) cannot be the RNN of q and we can say that p prunes p (cid:48) . If a point is pruned by atleast k other points, then it cannot belong to the R k NN of q .An area that is the intersection of any combination of k half-spaces can be pruned. The total pruned area corresponds tothe union of pruned regions by all such possible combinationsof k bisectors (total (cid:18) mk (cid:19) combinations). TPL also uses analternative computational cheaper pruning method which hasa less pruning power. All the points are sorted by their Hilbertvalues. Only the combinations of k consecutive points are usedto prune the space (total m combinations). D. FINCH

FINCH is another famous R k NN algorithm proposed byWu et al. [4]. The authors of FINCH think that it is toocomputational costly to use m combinations of k bisectorsto prune the points. They utilize a convex polygon that ap-proximates the unpruned region to prune the points instead ofusing bisectors. All points lying outside the polygon should bepruned. Since the containment can be achieved in logarithmictime for convex polygons, the pruning of FINCH has a higherefﬁciency than TPL. However, the computational complexityof computing the approximately unpruned convex polygonis O ( m ) , where m is the number of points considered forpruning. E. InfZone

Previous techniques can reduce the candidate set to anextent by different pruning methods. However, their veriﬁca-tion methods for candidates are very inefﬁcient. It is quitecomputational costly to issue an inefﬁcient veriﬁcation foreach point in a candidate set with a size of O( k ). In order toovercome this issue, a novel R k NN technique which is namedas InfZone is proposed by Cheema et al. [5]. The authors ofInfZone introduce the concept of inﬂuence zone (denoted as Z k ), which also can be called R k NN region. The inﬂuencezone of a query point q is a region that, a point p belongs tothe R k NN set of q , if and only if it lies in the Z k of q . Theinﬂuence zone is always a star-shaped polygon and the querypoint is its kernel point. A number of properties are detailed.These properties are aimed to shrink the number of pointswhich are crucial to compute the inﬂuence zone. They proposean inﬂuence zone computing algorithm with a computationalcomplexity of O ( k · m ) , where m is the number of pointsaccessed during the construction of the inﬂuence zone. Everypoints that lies inside the inﬂuence zone are accessed inthe pruning phase, since they cannot be ignored during theonstruction of the inﬂuence zone. Namely, all the potentialmembers of the R k NN are accessed during the pruning phase.Hence, for monochromatic R k NN queries, InfZone does notrequire to verify the candidates. It is indicated that the expectedsize of R k NN set is k . Evidently, the size of R k NN must notbe greater than m , i.e., k ≤ m . Therefore, the computationalcomplexity of InfZone must be no less than O ( k ) . F. SLICE

SLICE [6] is the state-of-the-art approach for R k NNqueries. In recent years, several well-known techniques [2]have been proposed to address the limitations of half-spacepruning[3] (e.g., FINCH [4], InfZone [5]). While few re-searcher carries out further research based on the idea of Six-regions. Yang et al. suggests that the regions-based pruningapproach of Six-regions has great potential and proposed anefﬁcient R k NN algorithm SLICE [6]. SLICE uses a morepowerful and ﬂexible pruning approach that prunes a muchlarger area as compared to Six-regions with almost similarcomputational complexity. Furthermore, it signiﬁcantly im-proves the veriﬁcation phase by computing a list of signiﬁcantpoints for each segment. These lists are named as sigList s.Each candidate can be veriﬁed by accessing sigList instead ofissuing a range query. Therefore, SLICE is signiﬁcantly moreefﬁcient than the other existing algorithms.

G. VR-R k NN For most R k NN algorithms, data points are indexed by R-tree [12]. However, R-tree is originally designed primarilyfor range queries. Although some approaches [13], [3], [14],[15] are proposed afterwards to make it also suitable for NNqueries and their variants: the NN derived queries are stilldisadvantageous. When answering an NN derived query, allnodes in the R-tree intersecting with the local neighborhood(Search Region) of the query point need to be accessed to ﬁndall the members of the result set. Once the candidate set ofthe query is large, the cost of accessing the nodes can alsobecome very large. In order to improve the performance of R-tree on NN derived queries, Sharifzadeh and Shahabi proposesa composite index structure composed of an R-tree and aVoronoi diagram, and named it as VoR-Tree [7]. VoR-Treebeneﬁts from both the neighborhood exploration capabilityof Voronoi diagrams and the hierarchical structure of R-tree.By utilizing VoR-tree, they propose VR-R k NN to answer theR k NN query. Similar to the ﬁlter phase of Six-regions [2], Vor-R k NN divides the space into 6 equal segments and selects k candidate points from each segment to form a candidate setof size 6 k . During the reﬁning phase, each candidate point isveriﬁed to be a member of the R k NN through issuing a k NNquery (VR- k NN). The expected computational complexity ofVR-R k NN is O ( k · log k ) III. P

RELIMINARIES

First, we formally deﬁne the problem in Section 3.1. Thenwe present some concepts and knowledge to enhance theunderstanding of our methodologies in Section 3.2 and Section3.3.

A. Problem deﬁnition

Deﬁnition 1. Euclidean Distance:

Given two points A = { a , a , ..., a d } and B = { b , b , ..., b d } in R d , the Euclideandistance between A and B , dist ( A, B ) , is deﬁned as follows: dist ( A, B ) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 ( a i − b i ) . (1) Deﬁnition 2. k NN Queries: A k NN query is to ﬁnd the k closest points to the query point from a certain point set.Mathematically, this query in Euclidean space can be statedas follows. Given a set P of points in R d and a query point q ∈ R d , k NN( q ) = { p ∈ P | dist ( p, q ) ≤ dist ( p k , q ) } where p k is the k th closest point to q in P. (2) Deﬁnition 3. R k NN Queries:

A R k NN query retrieves allthe points that have the query point as one of their k nearestneighbors from a certain point set. Formally, given a set P ofpoints in R d and a query point q ∈ P , the R k NN of q in P can be deﬁned as R k NN( q ) = { p ∈ P | q ∈ k NN( p ) } . (3)Since the vast majority of spatial data in the GIS appliedin real life are two-dimensional and most of the applicationsof R k NN queries are in location-based services, like theexisting approaches [2], [4], [5], [6], our study focus on theR k NN query in the 2D scene. In the following sections, ifnot speciﬁed, the problem is discussed in two-dimensionalenvironment by default.

B. Voronoi diagram & Delaunay graph

Fig. 1. a) Voronoi Diagram, b) Delaunay Graph

Voronoi diagram [16], proposed by Rene Descartes in 1644,is a spatial partition structure widely applied in many sciencedomains, especially spatial database and computational geom-etry. In a Voronoi diagram of n points, the space is dividedinto n regions corresponding to these points, which are calledVoronoi cells. For each of these n points, the correspondingVoronoi cell consists of all locations closer to that point thanto any other. In other words, each point is the nearest neighborof all the locations in its corresponding Voronoi cell. Formally,the above description can be stated as follows. eﬁnition 4. Voronoi cell & Voronoi diagram: Given a set P of n points, the Voronoi cell of a point p ∈ P , denoted as V ( P, p ) or V ( p ) for short, is deﬁned as Equation (4) V ( P, p ) = { q | ∀ p (cid:48) ∈ P \ { p } : dist ( p, q ) ≤ dist ( p (cid:48) , q ) } (4)and the Voronoi diagram of P , denoted as V D ( P ) , is deﬁnedas Equation (5). V D ( P ) = { V ( P, p ) | p ∈ P } (5)The Voronoi diagram of a certain set P of points, V D ( P ), isunique. Deﬁnition 5. Voronoi neighbor:

Given the Voronoi diagramof P , for a point p , its Voronoi neighbors are the points in P whose Voronoi cells share an edge with V ( P, q ) . It is denotedas V N ( P, q ) or V N ( q ) for short. Note that the nearest pointin P to p is among V N ( q ) . Lemma 1.

Let p k be the k -th nearest neighbor of q , then p k is a Voronoi neighbor of at least one point of the k − nearestneighbors of q (where k > ).Proof. See [7].

Lemma 2.

For a Voronoi diagram, the expected number ofVoronoi neighbors of a generator point does not exceed 6.Proof.

Let n , n e and n v be the number of generator points,Voronoi edges and Voronoi vertices of a Voronoi diagram in R , respectively, and assume n ≥ . According to Euler’sformula, n + n v − n e = 1 (6)Every Voronoi vertex has at least 3 Voronoi edges and eachVoronoi edge belongs to two Voronoi vertices. Hence thenumber of Voronoi edges is not less than n v + 1) / , i.e., n e ≥

32 ( n v + 1) (7)According to Equation (6) and Equation (7), the followingrelationships holds: n e ≤ n − (8)When the number of generator points is large enough, theaverage number of Voronoi edges per Voronoi cell of a Voronoidiagram in R d is a constant value depending only on d . When d = 2, every Voronoi edge is shared by two Voronoi Cells.Hence the average number of Voronoi edges per Voronoi celldoes not exceed 6, i.e., · n e /n ≤ n − /n = 6 − /n ≤ . For set of points, a dual graph of its Voronoi Diagram isthe Delaunay graph (graph of Delaunay triangulation ) [17] ofit. Deﬁnition 6. Delaunay triangulation & Delaunay graph:

For a set P of discrete points in a plane, the Delaunaytriangulation DT ( P ) is that no point in P is inside the circumcircle of any triangle of DT ( P ) . The graph of DT ( P ) is called the Delaunay graph of P and denoted as DG ( P ) .Note that a graph of Delaunay graph must be a connectedgraph, i.e., two vertices in the graph are connected. For a set ofpoints, its nearest neighbor graph is a subgraph of its Delaunaygraph. Deﬁnition 7. Delaunay graph distance:

Given the Delaunaygraph DG ( P ) , the Delaunay graph distance between twovertices p and p (cid:48) of DG ( P ) is the minimum number of edgesconnecting p and p (cid:48) in DG ( P ) . It is denoted as dist DG ( p, p (cid:48) ) . Lemma 3.

Given the query point q , if a point p belongs toR k NN( q ), then we have dist DG ( p, q ) ≤ k in Delaunay graph DG ( p ) .Proof. See [7].

C. Conic section

In mathematics, a conic section (or just conic) is a curveobtained as the intersection of a right circular conical surfacewith a plane. Conic curves include ellipse, hyperbola andparabola. Some properties of ellipses and hyperbolas are usedin our work, so we introduce them in the follows.

Deﬁnition 8. Ellipse:

An ellipse is a closed curve on a plane,such that the sum of the distances from any point on the curveto two ﬁxed points p and p is a constant C . Formally, it isdenoted as E cp : p deﬁned as follows: E cp : p = { p | dist ( p, p ) + dist ( p, p ) = C } (9) Deﬁnition 9. Hyperbola:

A hyperbola is a geometric ﬁguresuch that the difference between the distances from any pointon the ﬁgure to two ﬁxed points p and p is a constant C .Formally, it is denoted as H cp : p deﬁned as follows: H cp : p = { p | | dist ( p, p ) − dist ( p, p ) | = C } (10)IV. M ETHODOLOGIES

A. Discriminance with Conic Section (CSD)

Fig. 2. k NN region

Deﬁnition 10. k NN region:

Given a query point q , the k NNregion of q is the inner region of C q : dist ( q,p k ) , i.e., the circlewith q as center and dist ( q, p k ) as the length of radius, where p k represents the k th closest point to q . This region is denotedas RG k NN ( q ) . The radius of RG k NN ( q ) is called the k NNradius of q and is denoted as r q .ote that a point p must be one k NN( q ) if it lies in RG k NN ( q ) , i.e., the k NN region of q . Conversely, if a point p (cid:48) lies out of RG k NN ( q ) , it cannot be any one of k NN( q ). InFigure 2, q is the query point and the gray region within thecircle centered on q represents RG k NN ( q ) . As we can see, p , p and p lie inside RG k NN ( q ) , then we can determine thatthey belong to k NN( q ). while p and p lie outside. So theyare not the members of k NN( q ). Lemma 4.

Given a query point q , a point p must be one ofR k NN( q ) if it satisﬁes dist ( p, q ) ≤ r p . (11) Conversely, a point p (cid:48) cannot be any one of R k NN( q ) if itsatisﬁes dist ( p (cid:48) , q ) > r p (cid:48) . (12) Simply, for a point p , if the query point q lies in its k NN region, p must be one of R k NN( q ), otherwise it must not belong toR k NN( q ).Proof. The lemma is easily proved by the deﬁnition of k NNand R k NN, see Equation (2) and Equation (3).According to Lemma 4, we can determine whether a point p belongs to the R k NN of the query point q by calculatingthe k NN region of p . Obviously, q lying in RG k NN ( p ) isa necessary and sufﬁcient condition for p to be one ofR k NN( q ). In the reﬁning phase of some R k NN algorithms, thecandidates are veriﬁed by this discriminant condition. In thisdiscriminance, k NN region is required, so a k NN query mustbe conducted. The computational complexity of the state-of-the-art k NN algorithm is O ( k · log k ) . Thus, the computationalcomplexity of discriminance based on Lemma 4 is O ( k · log k ) .For most R k NN algorithms, the size of candidate set is oftenseveral times much as that of the result set. Therefore, issuinga R k NN veriﬁcation of which the computational complexityis O ( k · log k ) for each candidate is obviously expensive. Inorder to reduce the computational cost of the reﬁning phase ofR k NN queries, we introduce several more efﬁcient veriﬁcationapproaches in the following.

Lemma 5.

Given a query point q and a point p + ∈ R k NN( q ),a point p must be one of R k NN( q ) if it satisﬁes dist ( p, q ) + dist ( p, p + ) ≤ r p + . (13) Proof.

As shown in Figure 3, the larger circle takes p + asthe center and r p + as the radius, which represents the k NNregion of p + . L p × ,p + is a line segment passing through thepoint p with a length of r p + . The smaller circle takes p asthe center and dist ( p, p × ) as the radius. Let p (cid:48) be an arbitrarypoint inside C p : dist ( p,p × ) , then it must satisfy that dist ( p, p (cid:48) ) ≤ dist ( p, p × ) . (14)According to the triangle inequality, we can obtain dist ( p (cid:48) , p + ) ≤ dist ( p, p (cid:48) ) + dist ( p, p + ) . (15) Fig. 3. Lemma 5

Combining Inequality (14) and Inequality (15), we can obtain dist ( p (cid:48) , p + ) ≤ dist ( p, p × ) + dist ( p, p + )= dist ( p × , p + )= r p + . (16)From above, we can construct a corollary that any point lyingin C p : dist ( p,p × ) must belong to k NN( p + ). Speciﬁcally, thenumber of points lying in C p : dist ( p,p × ) must not be greaterthan k , i.e., the size of k NN( p + ). Equivalently, there is nomore than k points closer to p than p × . Thus, p k (the k thclosest point to p ) cannot be closer than p × to p . Then dist ( p, p × ) ≤ dist ( p, p k ) = r p . Suppose Inequality (13)holds, dist ( p, q ) ≤ r p + − dist ( p, p + )= dist ( p × , p + ) − dist ( p, p + )= dist ( p, p × ) ≤ r p . (17)From Lemma 4 and Inequality (17), we can deduce that p ∈ R k NN( q ). Therefore Lemma 5 proved to be true.Lemma 5 provides a sufﬁcient but unnecessary conditionfor determining that a point belongs to R k NN( q ), where q represents the query point. That means if a point p satisﬁesthe condition of Inequality (13), it can be determined as oneof R k NN( q ) without issuing a k NN query. In the case that r p + is known, we can verify whether Inequality (13) holdsby only calculating the Euclidean distance from p to q and p + respectively. Calculating the Euclidean distance betweentwo points can be regarded as an atomic operation. Hence thecomputational complexity of the discriminance correspondingto Lemma 5 is O (1) . Deﬁnition 11. Positive discriminant region:

Given the querypoint q and a point p , the positive discriminant region of p is the internal region of E r p p : q . Formally, it is denoted as RG + disc ( p ) and is deﬁned as follows: RG + disc ( p ) = { p (cid:48) | dist ( p (cid:48) , q ) + dist ( p (cid:48) , p ) ≤ r p } . (18)From the triangle inequality, it can be shown that dist ( p (cid:48) , q ) + dist ( p (cid:48) , p ) ≥ dist ( p, q ) . (19) ig. 4. Positive discriminant region If p / ∈ R k NN( q ), i.e., dist ( p, q ) > r p , dist ( p (cid:48) , q ) + dist ( p (cid:48) , p ) > r p (20)then RG + disc ( p ) = ∅ . Therefore, if RG + disc ( p ) (cid:54) = ∅ , p mustbelong to R k NN( q ). In consequence, from Lemma 5, we canconstruct a corollary that, for any point p , if RG + disc ( p ) is notempty, all the points lying inside of RG + disc ( p ) must belongto R k NN( q ).As shown in Figure 4, q represents the query point, theinternal region of the circle C p : r p indicates RG k NN ( p ) , andthe gray region within the ellipse E r p p : q is for RG + disc ( p ) . As p and p lies in RG + disc ( p ) , we can know p , p ∈ R k NN( q ).Whereas p , p and p lie out of RG + disc ( p ) , so we cannotdirectly determine whether or not they belong to R k NN( q ) byLemma 5. Lemma 6.

Given a query point q and a point p − / ∈ R k NN( q ),a point p cannot be any one of R k NN( q ) if it satisﬁes dist ( p, q ) − dist ( p, p − ) > r p − . (21) Fig. 5. Lemma 6

Proof.

As shown in Figure 5, the smaller circle takes p − asthe center and r p − as the radius, which represents the k NNregion of p − . The point p × is the intersection of an extensionof L p,p − (a line segment between p and p − ) with C p − : r p − .The larger circle takes p as the center and dist ( p, p × ) as theradius. Let p (cid:48) be an arbitrary point inside of C p − : r p − , then itmust satisfy that dist ( p − , p (cid:48) ) ≤ dist ( p − , p × ) = dist ( p − , p × ) . (22) According to the triangle inequality, we can obtain dist ( p, p (cid:48) ) ≤ dist ( p, p − ) + dist ( p − , p (cid:48) ) . (23)From Inequality.(22) and Inequality.(23), we can get that dist ( p, p (cid:48) ) ≤ dist ( p, p − ) + dist ( p − , p × )= dist ( p, p × )= r p . (24)Then we realize that all the points lying in RG k NN ( p − ) mustlie inside C p : dist ( p,p × ) , namely the number of points lyinginside of C p : dist ( p,p × ) must be no less than k , i.e., the numberof points lying in RG k NN ( p − ) . That is to say, there exist atleast k points no further than p × away from p . Equivalently, dist ( p, p × ) ≥ dist ( p, p k ) = r p (where p k represents the k th closest point to p ). If the condition of Inequality (21) issatisﬁed, dist ( p, q ) > dist ( p, p − ) + r p − = dist ( p, p − ) + dist ( p − , p × )= dist ( p, p × ) ≥ r p . (25)From Lemma 4 and Inequality (25), we can deduce that p / ∈ R k NN( q ). Therefore, Lemma 6 proved to be true.From Lemma 6, we can know that, if a point is determinednot to be one of R k NN( q ) and its k NN radius is known,then there may exist some other points that can be sufﬁcientlydetermined to belong to R k NN( q ) without performing a k NNquery but by performing two times of simple Euclideandistance calculation. That means the computational complexityof the discriminance based on Lemma 6 is O (1) . Fig. 6. Negative discriminant region

Deﬁnition 12. Negative discriminant region:

Given thequery point q and a point p , H r p p : q divides the space intothree regions of which the one contains p is the negativediscriminant region of p . Formally, this region is denoted as RG − disc ( p ) and is deﬁned as follows: RG − disc ( p ) = { p (cid:48) | dist ( p (cid:48) , q ) − dist ( p (cid:48) , p ) > r p } . (26)For an arbitrary point p (cid:48) ,from the triangle inequality in (cid:52) pqp (cid:48) , it can be known that dist ( p (cid:48) , p ) + dist ( p, q ) ≥ dist ( p (cid:48) , q ) . (27)f p ∈ R k NN( q ), i.e., dist ( p, q ) ≤ r p , dist ( p (cid:48) , q ) − dist ( p (cid:48) , p ) ≤ dist ( p, q ) ≤ r p (28)then RG − disc ( p ) = ∅ . Therefore, if RG − disc ( p ) is not empty, p must belong to R k NN( q ). Hence from Lemma 6, we can drawsuch a corollary that, for an arbitrary point p , if RG − disc ( p ) isnot empty, any point lying inside RG − disc ( p ) cannot belong toR k NN( q ).As shown in Figure 6, q represents the query point, theregion within the circle centered on p represents RG k NN ( p ) ,and the gray region separated by the hyperbola H r p p : q onthe right represents RG − disc ( p ) . As in the ﬁgure, p and p lie inside RG − disc ( p ) , while p and p do not. Then wecan determine that p and p must not belong to R k NN( q ),whereas we cannot tell by Lemma 6 whether p or p belongsto R k NN( q ) or not. Deﬁnition 13. Positive/Negative discriminant point:

Giventhe query point q and two other points p and p (cid:48) , if p (cid:48) lies in RG + disc ( p ) , we claim that p is a positive discriminant pointof p (cid:48) and p can positive discriminate p (cid:48) . It is denoted as p + disc −−−−→ p (cid:48) . Similarity, if p (cid:48) lies in RG − disc ( p ) , we name that p is a negative discriminant point of p (cid:48) and p can negativediscriminate p (cid:48) . It is denoted as p − disc −−−−→ p (cid:48) . If not speciﬁed,both of these two types of points may be collectively referredto as discriminant points and we can use p disc −−→ p (cid:48) to expressthat p can discriminate p (cid:48) .Whether a point belongs to the R k NN set of the querypoint or not, the corresponding discriminant method withlow computational complexity is provided. However, whenperforming the discriminance of Lemma 5 or Lemma 6, thedistance from the point to be determined to the query pointand the positive/negative discriminant point should be calcu-lated respectively. In order to further improve the veriﬁcationefﬁciency of some points, we propose Lemma 7.

Lemma 7.

Given a query point q , a point p must be one ofR k NN( q ) if it satisﬁes dist ( p, q ) ≤ r q / . Fig. 7. Lemma 7

Proof.

In Figure 7, there are three circles, two of which arecentered on q and take r q and r q / as the length of their radii, respectively. The other circle takes p as the center and dist ( p, q ) as the length of the radius, where p lies in c q : r q / ,i.e., dist ( q, p ) ≤ r q / . Let p (cid:48) be an arbitrary point inside of C p : dist ( p,q ) , then it must satisfy that dist ( p, p (cid:48) ) ≤ dist ( q, p ) . (29)From the triangle inequality of (cid:52) pqp (cid:48) , it can be obtained that dist ( q, p (cid:48) ) ≤ dist ( q, p ) + dist ( p, p (cid:48) ) . (30)Then we can get that, dist ( q, p (cid:48) ) ≤ · dist ( q, p ) . (31)Because dist ( q, p ) ≤ r q / , dist ( q, p (cid:48) ) ≤ · r q / r q (32)That means, any point lying in C p : dist ( p,q ) must belong to k NN( q ). Therefore, the number of points lying in C p : dist ( p,q ) must not be greater than k , i.e., the size of k NN( q ), whichmeans there is no more than k points closer to p than q . Hence p k ( k th closest point to p ) cannot be closer than q to p . Then dist ( p, q ) ≤ dist ( p, p k ) = r p . (33)According to Lemma 4, p ∈ R k NN( q ), then Lemma 7 isproved. Fig. 8. Semi- k NN region

Deﬁnition 14. Semi- k NN region:

Given the query point q ,the semi- k NN region of q is the internal region of C q : r q / .Formally, it is denoted as SRG k NN ( q ) and is deﬁned asEquation (34). SRG k NN ( q ) = { p | dist ( p, q ) ≤ r q / } (34)As shown in Figure 8, q represents the query point, theregion within the larger circle represents RG k NN ( q ) , and thegray region within the smaller circle represents SRG k NN ( q ) .It can be observed from the ﬁgure, p and p lie in the grayregion, while p , p and p do not. Then p and p can bedetermined as members of R k NN( q ). Nevertheless, we cannotdetermine whether p , p or p belongs to R k NN( q ) or not byLemma 7With Lemma 4, Lemma 5, Lemma 6 and Lemma 7, wecan ﬁnd all the points in the R k NNs of the query point byverifying only a small portion of points in the candidates.We combine these four lemmas to form a complete R k NNveriﬁcation method named CSD (Conic Section Discriminant). . Selection of discriminant points

Theoretically, when using CSD to verify the candidates ofa R k NN query, any point that belongs to the R k NN set canbe considered as a positive discriminant point. Similarly, if apoint is not a member of R k NNs, then it can be considered asa negative discriminant point. In other words, all points in thecandidate set are eligible to be selected as discriminant points.Our aim is to issue as few k NN queries as possible in theprocess of R k NN queries, that is, to use as few discriminantpoints as possible to discriminate all the other points in thecandidate set. Therefore, the selection of discriminant points isvery important for improving the efﬁciency of R k NN queries.Which points should be selected as discriminant points is whatwe will scrutinize next.

Deﬁnition 15. Discriminant set:

For a R k NN query, givena set S cnd of candidates and denoted as S dist , a discriminantset is such a set that the following condition is satisﬁed: ∀ p ∈ S cnd \ S dist , ∃ p (cid:48) ∈ S dist : p (cid:48) disc −−→ p. (35)Because it is not certain how many points and whichpoints need to be selected as discriminant points, the totalnumber of schemes for selecting discriminant points can beas large as | S cnd | (cid:80) i =1 (cid:18) | S cnd | i (cid:19) , where | S cnd | means the numberof candidates. Hence the computational complexity of ﬁndingthe absolute optimal one out of all the schemes is as much as O ( k !) . However, it is not difﬁcult to come up with a relativelygood discriminant points selecting scheme, of which the sizeof the discriminant set | S dist | is just about O ( √ k ) .For a positive discriminant point, most of the points in itsdiscriminant region are closer to the query point than itself.Furthermore, any negative discriminant point is closer to thequery point than most of the points in its own discriminantregion. Therefore, a point belonging to R k NNs can rarely bediscriminated by a point closer to the query point than itself,and the probability that a point not belonging to R k NNs canbe discriminated by a point further than itself away from thequery point is also very low. Therefore, the points which areextremely close to the boundary of the R k NN region (i.e.,inﬂuence zone [5]) are rarely able to be discriminated by otherpoints. Thus, these points should be selected as discriminantpoints in preference. However, it is impossible to directlyﬁnd these points near the boundary without pre-calculatingthe R k NN region. Calculating the R k NN region is a verycomputational costly process for its computational complexityof O ( k ) . While the k NN region of the query point is easy toobtained by issuing a k NN query. Assuming that the points areuniformly distributed, the k NN region and the R k NN regionof a query point are extremely approximate and the differencebetween them is negligible. Hence it is a good strategy topreferentially select the points near the boundary of k NNregion as the discriminant points to some extent.As shown in Figure 9, there are some points distributed.The region inside the circle with q as the center represents the k NN region of q . In general, only the points near the boundary Fig. 9. Discriminant set of RG k NN ( q ) need to be selected as the discrinimant pointsand all the other candidate points can be discriminated bythese discriminant points. In other words, if the points areevenly distributed, the points near the boundary of RG k NN ( q ) are enough to form a valid discriminant set of q . Because thedistribution of points is not guaranteed to be absolute uniform,it is not always reliable if only the points near the boundaryof the kNN region of the query point are taken as discriminantpoints for a R k NN query.In order to ensure the reliability of the selection, we proposea strategy to dynamically construct the discriminant set whileverifying the candidate points. First, the candidate pointsbelonging to k NN( q ) are accessed in descending order ofdistance to q . Then the other candidate points are accessedin ascending order of distance to q . During the process ofaccessing candidates, once the currently accessed point cannotbe discriminated by any point in the discriminant set, thispoint should be selected as a discriminant point and put intothe discriminant set. Otherwise, we can use a correspondingpoint in the discriminant set to determine whether it belongsto R k NNs or not.

C. Matching candidate points with discriminant points

Under the above strategy, it is sufﬁcient to ensure that anypoint not belonging to S dist can be discriminated by at leastone point in S dist . Since the expected size of S dist is O ( √ k ) (see Section 5), the computational complexity of ﬁnding adiscriminant point for a point by exhaustive searching thediscriminant set is O ( √ k ) . Obviously, it is not a good ideato match candidate points with their discriminant point inthis way. Therefore, we propose a method based on Voronoidiagrams to improve the efﬁciency of this process.Given a Voronoi diagram V D ( P ) of a point set P and acontinuous region RG , the vast majority of points in RG haveat least one Voronoi neighbor lying in RG [18]. For any dis-criminant point, its discriminant region is a continuous region(ellipse region or hyperbola region). So for a non-discriminantpoint, there is high probability that at least one of its Voronoineighbors can discriminate it or shares a discriminant pointwith it. Therefore, when accessing a candidate point, if thepoint can be discriminated by one of its Voronoi neighborsr the discriminant point of one of its Voronoi neighbors,this point can be determined whether belongs to the R k NNs.Otherwise, we say that this point is almost impossible to bediscriminated by any known discriminate point and it shouldbe marked as a discriminate point. Recall Lemma 2, in twodimensions, the expected number of Voronoi neighbors perpoint is 6, which is a constant. By using the above approachwe can ﬁnd the discriminant point for a non-discriminant pointwith a computational complexity of O (1) . D. Algorithm

In the previous three subsections, we introduce a discrim-inant with conic sections (CSD) for improving the efﬁciencyof R k NN queries, and its principle is also explained. In thissubsection, we will introduce the implementation of the R k NNalgorithm based on CSD.The pseudocode for CSD is shown in Algorithm 1. Whenverifying a point, we ﬁrst try to determine whether the pointbelongs to R k NNs by Lemma 4 (line 2). If this fails, we visitthe Voronoi neigbors of the point and try to use Lemma 2 orLemma3 to discriminant it (line 10 and line 13). If none ofthe three lemmas above apply to this point, then we issue a k NN query for it and use Lemma 4 to verify it (line 18).Using CSD, we implement a efﬁcient R k NN algorithm, asshown in Algorithm 2. First we generate the candidate set inthe same way as VR-R k NN [7], where the size of candidateis 6 k (line 1). Next, the candidate set is sorted in ascendingorder by the distance to the query point (line 2). Then the ﬁrst k elements of the candidate set and the rest of the elementsare divided into two groups. The elements in the two groupsare veriﬁed one by one in the order from back to front andfrom front to back, respectively (line 8 and line 11). After allcandidate points are veriﬁed, the R k NNs of the query point isobtained.We used the same algorithm as VR-R k NN to generate thecandidate set, and we do not improve it. The core of thisalgorithm is still from the Six-regions [2]. In addition, it usesa Voronoi diagram to ﬁnd the candidate points incrementallyaccording to Lemma 1. By Lemma 3, only the points whoseDelaunay distance to the query point is not larger than k areeligible to be selected as candidate points. Hence the numberof points accessed for ﬁnding candidates in the algorithmis guaranteed to be no more than O ( k ) . The pseudocodeof the algorithm for generating candidates is presented inAlgorithm 3. As generating the candidate set is not the focusof our study, we do not describe Algorithm 3 in detail. [7] canbe referred to for speciﬁc instructions.V. T HEORETICAL ANALYSIS

In this section, we analyze the expected size of discriminantset, the expected number of accessed points and the compu-tational complexity of CSD-R k NN. We assume that the datapoints are uniformly distributed in a unit space.

A. Expected size of discriminant set

The query point is q , the number of points in R k NN( q )is | R k NN | , and the number of points near the boundary of Algorithm 1: CSD( p, q, k, r q , S v , S disc , D disc )Input: the point p to be veriﬁed, the query point q , theparameter k , the k NN radius r q of q , the set S v of points that have been visited , the discriminantset S disc and the dictionary D disc that records thecorresponding discriminant points fornon-discriminant points Output: whether p ∈ R k NN ( q ) . S v .add( p ); if dist ( p, q ) ≤ r q / then /* Lemma 7 */ return true ; foreach p n ∈ VN( p ) do if p n ∈ S v then if p n ∈ S disc then p disc ←− p n ; else p disc ←− D disc [ p n ] ; if p disc ∈ R k NN ( q ) and dist ( p, q ) + dist ( p, p disc ) ≤ r p disc then /* Lemma 5 */ D disc [ p ] ←− p disc ; return true ; if p disc / ∈ R k NN ( q ) and dist ( p, q ) − dist ( p, p disc ) > r p disc then /* Lemma 6 */ D disc [ p ] ←− p disc ; return false ; r p ←− calculate the k NN radius of p ; S disc .add( p ); if r p ≥ dist ( p, q ) then /* Lemma 4 */ return true ; else return false ; Algorithm 2: CSD-R k NN( q )Input: the query point q Output: R k NN( q S cnd ←− generateCandidates ( q, k ); Sort S cnd in ascending order by the distance to q ; r q ←− calculate the k NN radius of q ; S v ←− ∅ ; S disc ←− ∅ ; D disc ←− generate an empty dictionary; S R k NN ←− ∅ ; for i ←− k to do if CSD ( S cnd [ i ] , q, k, r q , S v , S disc , D disc ) then S R k NN .add ( S cnd [ i ]) ; for i ←− k + 1 to k do if CSD ( S cnd [ i ] , q, k, r q , S v , S disc , D disc ) then S R k NN .add ( S cnd [ i ]) ; return S R k NN ; lgorithm 3: generateCandidates( q, k )Input: the query point q and the parameter k Output: the candidates of R k NN( q ) H ←− M inHeap () ; V isited ←− ∅ ; for i ←− to do S cnd [ i ] ←− M inHeap () ; foreach p ∈ VN( q ) do H.push ([1 , p ]) ; V isited.add ( p ) ; while | H | > do [ dist DG ( p ) , p ] ←− H.pop () ; for i ←− to do if Segment i contains p then if | S cnd [ i ] | > then p n ←− the last point in S cnd [ i ] ; else p n ←− a point inﬁnitely away from q ; if dist DG ( p ) ≤ k and dist ( q, p ) ≤ dist ( q, p n ) then S cnd [ i ] .push ([ dist ( p, q ) , p ]) ; foreach p (cid:48) ∈ VN( p ) do if p (cid:48) / ∈ V isited then dist DG ( p (cid:48) ) ←− dist DG ( p ) + 1 ; H.push ([ dist DG ( p (cid:48) ) , p (cid:48) ]) ; V isited.add ( p (cid:48) ) ; Candidates ←− ∅ ; for i ←− to do for j ←− to k do Candidates .add( S cnd [ i ] .pop()); return Candidates ; RG k NN ( q ) is | S b | . The area and circumference (total lengthof the boundary) of RG k NN ( q ) are denoted as A R k NN ( q ) and C R k NN ( q ) , respectively. The expected size of the discriminantset of q is | S disc | .It is shown that the expected value of | R k NN | is k [5]. Thus,the radius of the approximate circle of RG k NN ( q ) is equal to r q . Then A R k NN ( q ) = π · r q (36) C R k NN ( q ) = 2 π · r q . (37)The following equation can be obtained from Equation (36)and Equation (37). C R k NN ( q ) = 2 (cid:112) π · A R k NN ( q ) (38)As the points around the boundary of RG k NN ( q ) consists oftwo sets of points where one is inside RG k NN ( q ) and the other is outside, | S b | is to | R k NN | what · C R k NN ( q ) is to A R k NN ( q ) ,i.e., | S b | = 2 · (cid:112) π · | R k NN | = 4 √ π · k ≈ . √ k. (39)If all the points near the boundary are selected as the dis-criminant points, there must be some redundancy, i.e., thediscriminant region of some points will overlap. Hence thesize of the discriminant set generated under our strategy isless than the number of the points near the boundary of theR k NN region, i.e, | S disc | ≤ . √ k . B. Expected number of accessed points

For an R k NN query of q , the candidate points are distributedin an approximately circular region RG cnd ( q ) centered around q , which has an area A cnd ( q ) and a circumference C cnd ( q ) .The expected number of accessed points is | S ac | . In theﬁltering phase of CSD-R k NN, the points accessed include allthe the candidate points and their Voronoi neighbors. Exceptfor the points in the candidate set, the other accessed pointsare distributed outside RG cnd ( q ) and adjacent to the boundaryof RG cnd ( q ) . Hence | S ac | − | S cnd | is to | S cnd | what C cnd ( q ) is to A cnd ( q ) , i.e., | S ac | − | S cnd | = 2 (cid:112) π · | S cnd | (40) | S ac | = | S cnd | + 2 (cid:112) π · | S cnd | = 6 k + 2 √ π · k ≈ k + 8 . √ k (41)Therefore, if the points are distributed uniformly, the expectednumber of accessed points is approximately k +8 . √ k . Whenthe points are distributed unevenly, | S ac | becomes larger.However, it has an upper bound. Recall Lemma 3, we canmake deduce that only the points whose Delaunay graphdistance to q is not larger than k are eligible to be selected ascandidate points. Then | S ac | ≤ k (cid:88) i =1 π · i = ( k + k ) π. (42) C. Computational complexity

The expected computational complexity of the ﬁlteringphase of CSD-R k NN is O ( k · log k ) [7]. In the reﬁning phase,we have to issue a k NN query with O ( k · log k ) computationalcomplexity for each discriminant point, and the size of thediscriminant set is about . √ k . The other candidates onlyneed to be veriﬁed by CSD. Thus, the computational complex-ity of the reﬁning phase is O ( k . · log k ) . Hence the overallcomputational complexity of CSD-R k NN is O ( k . · log k ) .VI. E XPERIMENTS

In the previous section, we discussed the theoretical perfor-mance of CSD-R k NN. In this section, we intend to evaluatethe performance of aspects through comparison experiments. . Experimental settings

In the experiments, we let VR-R k NN [7] and the state-of-the-art R k NN approach SLICE [6] to be the competitors ofour method.The settings of our experiment environment are as follows.The experiment is conducted on a personal computer withPython 2.7. The CPU is Intel Core i5-4308U 2.80GHz andthe RAM is DDR3 8G.To be fair, all three methods in the experiment are imple-mented in Python, with six partitions in the pruning phase. Weuse two types of experimental data sets: simulated data set andreal data set . To decrease the error of the experiments, werepeat each experiment for 30 times and calculate the averageof the results. The query point for each time of the experimentis randomly generated.Our experiments are designed into four sets. The ﬁrst setof experiments is used to evaluate the effect of the data sizeon the time cost of the R k NN algorithms. The data size isfrom to and the value of k is ﬁxed at 200. The restof sets are used to evaluate the effect of the value of k onthe time cost, the number of veriﬁed points and the numberof the accessed points of the R k NN algorithms, respectively.For these three sets of experiments, the size of the simulateddata is ﬁxed at , the size of the real data is 49,601 and thevalue of k varies from to . B. Experimental results

TABLE IIT

OTAL TIME COST ( IN MS ) OF DIFFERENT R k NN ALGORITHMS WITHVARIOUS SIZES OF DATA SETS . Algorithm Data size VR-R k NN 510 725 728 732SLICE 232 397 438 441CSD-R k NN

59 65 69 72 Data size T i m e ( m s ) VoR-RkNNSLICECSD-RkNN

Fig. 10. Effect of data size on efﬁciency of R k NN queries

Figure 10 shows the time cost of the three R k NN algorithmswith various data sizes. As we can see, when the number ofpoints in the database is signiﬁcantly much lager than k , the impact of the data size on the time cost of R k NN queries isvery limited. If the number of points in the database is smallenough to be on the same order of magnitude as k , all pointsin the database become candidate points. Then the smaller thedatabase size, the less time cost of the R k NN query. When thenumber of points in the database is above 10,000 and the valueof k is ﬁxed at 200, the time cost of CSD-R k NN is alwaysaround 84% and 90% less than that of SLICE and VR-R k NN,respectively. The detailed experimental results are presentedin Table II.

TABLE IIIT

OTAL TIME COST ( IN MS ) OF R k NN QUERIES WITH VARIOUS VALUES OF k . k Simulated data Real data

VR-R k NN SLICE CSD-R k NN VR-R k NN SLICE CSD-R k NN

199 193

194 283 k T i m e ( m s ) VoR-RkNNSLICECSD-RkNN (a) Simulated data k T i m e ( m s ) VR-RkNNSLICECSD (b) Real dataFig. 11. Effect of k on efﬁciency of R k NN queries

Figure 11 shows the inﬂuence of k on the efﬁciency of thesethree R k NN algorithms, where sub-ﬁgure (a) and (b) shows thetime cost of R k NN queries from simulated data and real data,respectively. As k varies from 10 to 10,000, the time cost ofthese three algorithms increases. With both synthetic data andreal data, the query efﬁciency of CSD-R k NN is signiﬁcantlyhigher than that of the other two competitors. With the increaseof k , this advantage becomes more and more obvious. When k is 10,000, the time cost of CSD-R k NN is only about 1/10of that of the state-of-the-art algorithm SLICE. The detailedexperimental results are presented in Table III.

TABLE IVN

UMBER OF CANDIDATES VERIFIED BY R k NN ALGORITHMS WITHVARIOUS VALUES OF k . k Simulated data Real data

VR-R k NN SLICE CSD-R k NN VR-R k NN SLICE CSD-R k NN

60 25

60 20

600 257

600 203 k o f v e r i f i e d ca nd i d a t e s TheoreticalVoR-RkNNSLICECSD-RkNN (a) Simulated data k o f v e r i f i e d ca nd i d a t e s TheoreticalVoR-RkNNSLICECSD-RkNN (b) Real dataFig. 12. Effect of k on the number of candidates veriﬁed Figure 12 reﬂects the relationship between k and the numberof candidate points veriﬁed of the three algorithms in theexperiments. Sub-ﬁgure (a) and (b) show the experimentalresults on simulated data and real data, respectively. Thesetwo sub-ﬁgures also show the theoretical number of candidatepoints veriﬁed with different values of k . During the executionof CSD-R k NN, only the points in the discriminant set areveriﬁed by issuing k NN queries. Therefore, the number ofcandidates veriﬁed is equal to the size of the discriminant set.As we discussed in section V-A, the size of the discriminantset is theoretically not larger than . √ k . In consequence,the theoretical number of veriﬁed candidates in Figure 12 is . √ k . It can be seen from the ﬁgure that the actual numberof points veriﬁed is slightly less than the theoretical value, . √ k . It indicates that the experimental results are consistentwith our analysis. It is also obvious from the ﬁgure that thenumber of veriﬁed candidate points of CSD-R k NN is muchsmaller than that of the other two algorithms. The detailedexperimental results are presented in Table IV. k o f acce ss e d po i n t s TheoreticalVoR-RkNNSLICECSD-RkNN (a) Simulated data k o f acce ss e d po i n t s TheoreticalVoR-RkNNSLICECSD-RkNN (b) Real dataFig. 13. Effect of k on the number of points accessed Figure 13 shows the number of accessed points of the threealgorithms in the experiments and the theoretical number ofaccessed points of CSD-R k NN with various values of k , whichindirectly reﬂects their IO cost. It can be seen from sub-ﬁgure(a), the number of accessed points of the three algorithms isalmost equal in terms of magnitude, and so is the theoreticalvalue of CSD-R k NN. Speciﬁcally, the number of accessedpoints of CSD-R k NN is slightly smaller than that of SLICE.As shown in sub-ﬁgure (b), CSD-R k NN needs to access morepoints than SLICE. The reason is that the distribution of real

TABLE VN

UMBER OF ACCESSED POINTS OF R k NN QUERIES WITH VARIOUSVALUES OF k . k Simulated data Real data

VR-R k NN SLICE CSD-R k NN VR-R k NN SLICE CSD-R k NN

76 119

75 153

181 162 data is very uneven, and CSD-R k NN is more sensitive to thedistribution of data than SLICE. Note that CSD-R k NN andVR-R k NN use the same candidate set generation method, sothey have almost the same number of accessed points. Thedetailed experimental results are presented in Table V.From the above three experiments, it can be seen thatR k NN query efﬁciency is little affected by the data size, butgreatly affected by the value of k . CSD-R k NN is signiﬁcantlymore efﬁcient than other algorithms because it requires lessveriﬁcation of candidate points. For data sets with very unevendistribution of points, the candidate set of CSD-R k NN isrelatively large, which will affect the IO cost to some extent.However, the main time cost of the R k NN query is causedby a large number of veriﬁcation operations rather than IO.Therefore, the distribution of points has little impact on theoverall performance of CSD-R k NN.VII. C

ONCLUSIONS AND FUTURE WORKS

In this paper, we propose CSD, a discriminant method todetermine points whether belong to the R k NN set withoutissuing any queries with non-constant computational complex-ity. An efﬁcient R k NN algorithm, named CSD-R k NN, is alsoimplemented by using CSD. The comparative experiments areconducted between CSD-R k NN and other two RkNN algo-rithms of the state-of-the-art. The experimental results showthat CSD-R k NN signiﬁcantly outperforms its competitors invarious aspects, except that CSD-R k NN needs to access morepoints to generate the candidate set when the distributionof points is very uneven. However, CSD-R k NN does notrequire costly validation of each candidate point. Hence thedistribution of data has very limited impact on its overallperformance.As an efﬁcient discriminant method for improving R k NNqueries, CSD has a great potential and can be further devel-oped. Our plan in the future is to extend the application ofCSD to other variants of R k NN, including multidimensionalR k NN, continuous R k NN, constrained R k NN, etc.A

CKNOWLEDGMENTS

The work is supported by the National Natural Sci-ence Foundation of China (U1711267, 41572314, 41972306),Hubei Province Innovation Group Project (2019CFA023) andFundamental Research Funds for the Central Universities,China University of Geosciences (Wuhan) (CUGCJ1810).

EFERENCES[1] Flip Korn and S. Muthukrishnan. Inﬂuence sets based on reversenearest neighbor queries. In

Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data, May 16-18, 2000,Dallas, Texas, USA , pages 201–212, 2000.[2] Ioana Stanoi, Divyakant Agrawal, and Amr El Abbadi. Reverse nearestneighbor queries for dynamic databases. In , pages 44–53, 2000.[3] Yufei Tao, Dimitris Papadias, and Xiang Lian. Reverse knn search inarbitrary dimensionality. In

Proceedings of the Thirtieth InternationalConference on Very Large Data Bases - Volume 30 , VLDB 04, page744755. VLDB Endowment, 2004.[4] Wei Wu, Fei Yang, Chee Yong Chan, and Kian-Lee Tan. FINCH:evaluating reverse k-nearest-neighbor queries on location data.

PVLDB ,1(1):1056–1067, 2008.[5] Muhammad Aamir Cheema, Xuemin Lin, Wenjie Zhang, and YingZhang. Inﬂuence zone: Efﬁciently processing reverse k nearest neighborsqueries. In

Proceedings of the 27th International Conference on DataEngineering, ICDE 2011, April 11-16, 2011, Hannover, Germany , pages577–588, 2011.[6] Shiyu Yang, Muhammad Aamir Cheema, Xuemin Lin, and Ying Zhang.SLICE: reviving regions-based pruning for reverse k nearest neighborsqueries. In

IEEE 30th International Conference on Data Engineering,Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014 , pages 760–771, 2014.[7] Mehdi Sharifzadeh and Cyrus Shahabi. Vor-tree: R-trees with voronoidiagrams for efﬁcient processing of spatial nearest neighbor queries.

Proc. VLDB Endow. , 3(1-2):1231–1242, September 2010.[8] Anil Maheshwari, Jan Vahrenhold, and Norbert Zeh. On reverse nearestneighbor queries. In

Proceedings of the 14th Canadian Conference onComputational Geometry, University of Lethbridge, Alberta, Canada,August 12-14, 2002 , pages 128–132, 2002.[9] Congjun Yang and King-Ip Lin. An index structure for efﬁcient reversenearest neighbor queries. In

Proceedings of the 17th InternationalConference on Data Engineering, April 2-6, 2001, Heidelberg, Germany ,pages 485–492, 2001.[10] Congjun Yang and King-Ip Lin. An index structure for efﬁcient reversenearest neighbor queries. In

Proceedings of the 17th InternationalConference on Data Engineering, April 2-6, 2001, Heidelberg, Germany ,pages 485–492, 2001.[11] King-Ip Lin, Michael Nolen, and Congjun Yang. Applying bulkinsertion techniques for dynamic reverse nearest neighbor problems. In , pages 290–297,2003.[12] Antonin Guttman. R-trees: A dynamic index structure for spatialsearching. In Beatrice Yormark, editor,

SIGMOD’84, Proceedings ofAnnual Meeting, Boston, Massachusetts, USA, June 18-21, 1984 , pages47–57. ACM Press, 1984.[13] G´ısli R. Hjaltason and Hanan Samet. Distance browsing in spatialdatabases.

ACM Trans. Database Syst. , 24(2):265–318, 1999.[14] Dimitris Papadias, Yufei Tao, Kyriakos Mouratidis, and Chun Kit Hui.Aggregate nearest neighbor queries in spatial databases.

ACM Trans.Database Syst. , 30(2):529–576, 2005.[15] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. Progressiveskyline computation in database systems.

ACM Trans. Database Syst. ,30(1):41–82, 2005.[16] Cyrus Shahabi and Mehdi Sharifzadeh. Voronoi diagrams for queryprocessing. In

Encyclopedia of GIS. , pages 2446–2452. Springer, 2017.[17] B. Delaunay. Sur la sph`ere vide. a la m´emoire de georges vorono¨ı.

Bulletin de I’Acad´emie des Sciences de I’URSS. Classe des SciencesMath´ematiques et Naturelles , 6:793–800, 1934.[18] Yang Li. Area queries based on voronoi diagrams.