Spatial Interpolation-based Learned Index for Range and kNN Queries
SSpatial Interpolation-based Learned Index for Rangeand k NN Queries
Songnian Zhang
Faculty of Computer ScienceUniversity of New Brunswick
Fredericton, NB, [email protected]
Suprio Ray
Faculty of Computer ScienceUniversity of New Brunswick
Fredericton, NB, [email protected]
Rongxing Lu
Faculty of Computer ScienceUniversity of New Brunswick
Fredericton, NB, [email protected]
Yandong Zheng
Faculty of Computer ScienceUniversity of New Brunswick
Fredericton, NB, [email protected]
Abstract —A corpus of recent work has revealed that thelearned index can improve query performance while reducing thestorage overhead. It potentially offers an opportunity to addressthe spatial query processing challenges caused by the surge inlocation-based services. Although several learned indexes havebeen proposed to process spatial data, the main idea behindthese approaches is to utilize the existing one-dimensional learnedmodels, which requires either converting the spatial data into one-dimensional data or applying the learned model on individualdimensions separately. As a result, these approaches cannot fullyutilize or take advantage of the information regarding the spatialdistribution of the original spatial data. To this end, in this paper,we exploit it by using the spatial (multi-dimensional) interpolationfunction as the learned model, which can be directly employedon the spatial data. Specifically, we design an efficient SPatialinteRpolation functIon based Grid index (SPRIG) to process therange and k NN queries. Detailed experiments are conducted onreal-world datasets, and the results indicate that our proposedlearned index can significantly improve the performance incomparison with the traditional spatial indexes and a state-of-the-art multi-dimensional learned index.
Index Terms —Learned index, Spatial interpolation function,Range query, k NN query
I. I
NTRODUCTION
As location-based services (LBS) have been widely de-ployed and have become highly popular, spatial query pro-cessing has attracted considerable interests in the researchcommunity. Although several spatial indexes, such as R-treeand k-d tree, have been proposed to facilitate spatial queryperformance, it is still challenging to process the spatialqueries efficiently due to the rapidly growing volume of spatialdata. Recently, Kraska et al. [1] suggested substituting thetraditional indexes with machine learned based indexes (alsocalled learned index ). Since then, several follow-up researchprojects [2], [3], [4], [5], [6] have shown that the learnedindex can indeed improve query performance by learned datadistribution and query workload patterns.Typically, there are two main aspects involving a learnedindex, namely, a learned model and a local search. The formeris trained and used to quickly locate the approximate positionof a search key, while the latter is responsible for refining theaccurate position. Since the latter can be achieved by perform-ing a local binary or exponential search, it is a fundamentalbut challenging topic to find a reasonable learned model and further employ it as the learned index. Existing learned indexesare constructed based on mainly one of two categories oflearned models: machine learning [1] and piecewise linearfunctions [3]. However, to the best of our knowledge, both ofthese learned models can be only applied in single dimensionaldata. As a result, the current spatial learned indexes eithertransform multi-dimensional data into one-dimensional databefore introducing the learned model as a foundation [7], [8]or apply a learned model on every single dimension [6]. Forthis reason, the question then arises, “
Is there a learned modelthat can be directly applied to spatial (multi-dimensional) dataand achieve better performance? ”Aiming to address the above-mentioned question, in thispaper, we explore how to utilize spatial (two-dimensional)interpolation functions as the learned models to directly predictthe position of a spatial search key. Based on this idea, wepropose a SPatial inteRpolation functIon based Grid index(SPRIG) to support range and k NN queries over spatialdata. In particular, we sample the spatial data to constructan adaptive grid and use the sample data as inputs to fita spatial interpolation function. Given a spatial search key,first, we can use the fitted spatial interpolation function topredict the approximate position of the key. Then, around theestimated position, we can conduct a local binary search tofind the target key. However, it entails a new challenge: howto guarantee that the target key is in the local search range.To address this issue, we introduce the maximum estimationerror based error guarantee, which is derived based on thequery workload. Furthermore, we propose efficient range and k NN query execution strategies using our proposed index. Inthese strategies, we take full advantage of the properties of theadaptive grid to facilitate the query executions, and a pivotbased filtering technique is introduced to improve the k NNquery performance.We conduct extensive experiments to evaluate our learnedindex, SPRIG. First, we evaluate five spatial interpolationfunctions and choose the bilinear interpolation function, whichhas the best performance and estimation accuracy, as ourlearned model. Then, we compare SPRIG against the state-of-the-art multi-dimensional learned index Flood [6], along witha few spatial indexes. The experimental results involving real-world datasets show that: 1) SPRIG outperforms the alternative a r X i v : . [ c s . D B ] F e b patial indexes on range queries and is competitive on k NNqueries in terms of execution time. In the best case, SPRIG is3 × faster than Flood with range queries and 9 × faster thanFlood with k NN queries; 2) SPRIG consumes less storage toachieve a favorable execution performance compared with thetraditional indexes. Our evaluations demonstrate that SPRIGcan reduce the storage footprint of traditional spatial indexesby orders of magnitude.The remainder of this paper is organized as follows. InSection II, we discuss the related work. Then, we introduce thespatial interpolation function, error guarantee, and pivot basedfiltering in Section III. After that, we present our SPRIG inSection IV, followed by performance evaluation in Section V.Finally, we draw our conclusion in Section VI.II. R
ELATED W ORK
Kraska et al. [1] presented the idea of the learned index,which is based on learning the relationship between keys andtheir positions in a sorted array. They adopted a machinelearning based technique as the learned model and built arecursive model index (RMI), which predicts the position of asearch key within a known error bound. Since then, a varietyof learned indexes was proposed to handle one-dimensionaldata. Recently, Tang et al. [2] proposed a scalable learnedindex
XIndex based on RMI, which focuses on handlingconcurrent writes without affecting the query performance.Very differently, Galakatos et al. [3] exploited the piecewiselinear function as the learned model to build a data-awareindex
FITing-tree that replaces leaf nodes of B+-tree withthe learned piecewise linear functions. Unlike
FITing-tree ,Ferragina et al. [4] introduced a pure learned index
PGM-index that does not mix the traditional data structure and learnedmodel. However, their work still focuses on one-dimensionaldata and uses the existing linear learned model.Naturally, the idea of the learned index has been extended tospatial and multi-dimensional data. Wang et al. [7] proposeda learned index
ZM-index for spatial queries. In that work, theauthors utilized the Z-order curve to convert two-dimensionaldata into one-dimensional values, and then applied a machinelearning model to predict a key’s position on one-dimensionaldata. Qi et al. [5] refined the idea of
ZM-index and built arecursive spatial model index (RSMI). Before applying Z-ordercurve, their work adopts a rank space-based transformationtechnique to mitigate the uneven-gap problem.
LISA [9] isa disk-based spatial learned index that achieves low storageconsumption and I/O cost. In this work, the authors used amapping function to map spatial keys into one-dimensionalvalues and a monotone shard prediction function, which is sim-ilar to the piecewise linear functions, to predict the shard id fora given mapped value. Extending to multi-dimensional data,
ML-index [8] is an RMI based learned index. It first convertsthe multi-dimensional data into one dimension by employingthe i-Distance technique. Based on the one-dimensional data,
ML-index uses the RMI to estimate the approximate positionof a search key. Recently proposed index Flood [6] can alsosupport multi-dimensional data and is very relevant to our
Query Workload 𝓦 Dataset 𝓓 Spatial Interpolation
Function 𝓕 𝒊𝒏 Cost Model
ID Info
Table 𝓣 Predict Cell ID
Queries 𝒏 = 5 𝒎 = Grid Layout 𝑮 𝒏×𝒎 Locate Cell
Local Binary
Search
Query Strategies
Range Query 𝒌 NN Query
ResultIndex BuildingQuery Processing Preprocess
Map Select Pivot Sort
Fig. 1. System Architecture of SPRIG work. It adopts the
FITing-tree as the building block to predictthe key’s position on a single dimension. By integrating d − dimensions’ positions, where d is the number of dimensions,Flood can locate the cell that covers the search key.III. B ACKGROUND
Before delving into the details of SPRIG, in this section,we introduce three basic concepts: 1) Spatial InterpolationFunction; 2) Error Guarantee; and 3) Pivot Based Filtering,which serve as the building blocks of the proposed index.
A. Spatial Interpolation Function
Given a set of 2-dimensional sample points { ( x i , y i ) | ≤ i ≤ n } and their corresponding values { v i = f ( x i , y i ) | ≤ i ≤ n } , one can construct a spatial (two-dimensional) interpolationfunction F in = f ( x, y ) that passes through all these samplepoints [10]. Afterward, given any point ( x, y ) , it is easy toestimate the value of f ( x, y ) with the interpolation function.Borrowing the idea from the learned index [1], if we treat v i as the position of the point ( x i , y i ) , we can use F in to quicklyestimate the position of any given point. Moreover, if thesample points are not random but can represent the distributionof the original spatial dataset, we can use fewer sample pointsto fit the spatial interpolation function for estimating positions.It indicates that the spatial interpolation function can learnthe spatial position distribution with a lower storage overhead,which fits well with the goal of the learned index. Therefore,it is feasible and promising to exploit the spatial interpolationfunction F in as the learned model. B. Error Guarantee
For a learned index, it is essential to conduct a local searchafter predicting the position of a given point. Generally, thelocal search range is [ pos − eg , pos + eg ] , where pos is thepredicted position and eg is the estimation error, also callederror guarantee. Consequently, eg is an essential concept ina learned index. Different from FITing-tree [3] (used inFlood [6]), we adopt the maximum estimation error as theerror guarantee in our scheme, which can be determined by aquery workload W . eg = max ( F in ( q x , q y ) − f ( q x , q y )) , (1)where ( q x , q y ) ∈ W . Regarding a spatial point, its spatial po-sition can be determined by its x and y coordinates. Therefore,in our scheme, we project the estimated spatial position to x imension and y dimension and obtain two error guarantees,i.e, eg x = max ( P x ( F in ( q x , q y )) − P x ( f ( q x , q y ))); eg y = max ( P y ( F in ( q x , q y )) − P y ( f ( q x , q y ))) , where P x () and P y () project a spatial position to x dimensionand y dimension, respectively. C. Pivot Based Filtering
Assume that a set D contains n D = { p i = ( x i , y i ) | ≤ i ≤ n } . Also, we define a distancebased range query Q c = { q c , r } , where q c is a point, and r isa radius. Launching a query Q c over D means one wouldlike to find points in D that satisfy d ( p i , q c ) ≤ r , where d ( · ) calculates the Euclidean distance between two spatialpoints. The intuitive solution is to scan and check points in D one by one. However, it is inefficient. Although we canbuild tree based index structures, such as k-d tree and M-tree [11], to speed up the query processing, it will incur muchextra storage. In order to improve the performance of Q c without introducing too much storage overhead, we adopt apivot based filtering technique. The key idea is to select avirtual pivot p v for D and calculate distances d ( p i , p v ) foreach point p i . Then, sort D according to the corresponding d ( p i , p v ) . When performing a query Q c over D , we can onlycalculate the distance d ( q c , p v ) and check the points whosedistances d ( p i , p v ) lie in [ d ( q c , p v ) − r, d ( q c , p v ) + r ] , insteadof all points in D . This optimization technique is based ontriangle inequality, and its correctness is as follows: | d ( p i , p v ) − d ( q c , p v ) | ≤ d ( p i , q c ) ≤ r ⇒ − r ≤ d ( p i , p v ) − d ( q c , p v ) ≤ r ⇒ d ( q c , p v ) − r ≤ d ( p i , p v ) ≤ d ( q c , p v ) + r. The above inequality indicates that if p i falls within Q c , itmust satisfy d ( p i , p v ) ∈ [ d ( q c , p v ) − r, d ( q c , p v ) + r ] . Thus,we can narrow down the scan range from the whole dataset D to the points that have distances d ( p i , p v ) in [ d ( q c , p v ) − r, d ( q c , p v ) + r ] .IV. O UR PROPOSED INDEX -SPRIGIn this section, we present the details of our proposed index,SPRIG. Fig. 1 depicts the system architecture of our index,which is comprised of two parts: index building and queryprocessing. In the following, we first discuss how to build thelearned index. Then, we describe the detailed query processingon our index.
A. Index Building
Our SPRIG mainly consists of three components: 1) An n × m grid layout G n × m , where n is the number of columnsalong x dimension and m is for y dimension; 2) A table T ; and3) A spatial interpolation function F in based learned model,as shown in Fig. 1. Here, we may use some parameters forour index, which will be further discussed in Section IV-C. Tobuild the n × m grid layout, we first find n − boundaries on x dimension to generate n non-equal size columns, and add these boundaries into a set B x . Our goal is to make data recordsevenly distributed across columns, i.e., each column has aroughly equal number of records (in this paper, we use “point”and “record” interchangeably). For y dimension, there will be m − boundaries in set B y . We denote the maximum andminimum values in x dimension as { x min , x max } , while thoseare { y min , y max } for y dimension. After adding { x min , x max } into B x and { y min , y max } into B y , we separately sort B x and B y in an increasing order, where | B x | = n + 1 and | B y | = m + 1 .Totally, there are n × m cells for the grid, and we have G n × m = ( B x , B y ) . Algorithm 2 shows the process of buildingthe grid.Next, we allocate integers in the range [0, n × m − ] ascell ids along x dimension and define a 2-dimensional array C id to index these cell ids: C id [ i ][ j ] = j · n + i, ≤ i < n and ≤ j < m . Afterward, we can build a table T to map the cellid to the covered records, in which the key is the cell id andthe value is a pair ( firstAddress , size ) indicating the pointer tothe first record and the number of records in the cell. Based on G n × m and C id , we can fit a spatial interpolation function F in .In particular, we treat { B x , B y } as inputs and C id as the desiredestimation values to determine F in , i.e., C id ← F in ( B x , B y ) .Besides, to speed up the distance related search, for example, k NN queries, we employ the pivot based filtering technique,which drives us to select a pivot p v in a cell and sort therecords (in a cell) according to the distances d ( p i , p v ) . In thispaper, we adopt the k -means clustering to select pivots foreach cell. It is worth noting this technique does not incur muchextra storage except for pivot points. B. Query Processing
In this paper, we focus on the range and k NN queries, inwhich the point query is implicitly included. Next, we describeour approaches.
Range Query . For spatial data, the range query identifiesthe records p i that fall within Q r = ( q b , q t ) , where q b = [ b x , b y ] and q t = [ t x , t y ] are the bottom left and top right point ofa query window, respectively. The main idea of processing arange query is to convert it to locate the cell ids of q b and q t . Here, we take locating q b ’s cell id as an example, and thesteps are as follows. • Step-1 . Given q b , we can use F in to obtain a predictedcell id pid b . It is simple to calculate the locations of pid b in set B x and B y . That is, l px = pid b mod n and l py = pid b / n . • Step-2 . Given a pair of error guarantee ( eg x , eg y ) , byapplying local binary searches on B x and B y , we canobtain the real x and y locations of the search point,denoted as l rx and l ry . The search range on B x set is [ l px − eg x , l px + eg x ] , while it is [ l py − eg y , l py + eg y ] on B y set. • Step-3 . Finally, we can obtain the real cell id of q b ,namely, rid b = l ry · n + l rx .Algorithm 1 formally outlines the above steps to obtain thereal cell id of a search point. Similarly, we can get the real 𝒙 dimension 𝒚 d i m e n s i o n 𝒑𝒊𝒅 𝒕 𝒑𝒊𝒅 𝒃 𝒓𝒊𝒅 𝒃 𝒓𝒊𝒅 𝒕 𝒒 𝒃 𝒒 𝒕 𝐁 𝐱 𝒍 𝒑 𝒃 𝒙 − 𝒆𝒈 𝒙 𝒍 𝒑 𝒃 𝒙 + 𝒆𝒈 𝒙 𝐁 𝒚 max 𝒍 𝒑 𝒃 𝒚 − 𝒆𝒈 𝒚 , 𝟎𝒍 𝒑 𝒃 𝒚 + 𝒆𝒈 𝒚 𝒍 𝒑 𝒃 𝒙 = 𝒑𝒊𝒅 𝒃 mod 5 𝒍 𝒑 𝒃 𝒚 = 𝒑𝒊𝒅 𝒃 / 5 𝒓𝒊𝒅 𝒃 = 𝟏 × 𝟓 + 𝟏 = 𝟔 (a) Range Query 𝒙 dimension 𝒚 d i m e n s i o n 𝐜 𝐭 𝐜 𝐫 𝐜 𝐛 𝐜 𝐥 𝒒 𝒌 𝓐 𝓑 𝝈 ID Info
13 (firstAddress, size=t), 𝒑 𝒗 t 𝒅 𝒒 𝒌 , 𝒑 𝒗 − 𝝈 𝒅 𝒒 𝒌 , 𝒑 𝒗 + 𝝈 putQueuePivot Based Filtering (b) k NN QueryFig. 2. Processing queries on SPRIG. A 5 × F in . Assume that ( eg x =2, eg y =1), we take finding rid b as an example. (b) k NN Query. Spread the search range to the outer-layeradjacent cells. The hollow circles represent the closest point of cell A and cell B to q k . Assume that cell B has t records, we use the pivot based filtering toprocess the records in the cell. Algorithm 1
GetRealCellId
Input:
A given point ( x, y ) ; The grid layout, G n × m = ( B x , B y ) ; Apredicted cell id of the given point, pid ; Trained error guarantee eg = ( eg x , eg y ) ; Output:
The real cell id rid of the given point ( x, y ) . n ← | B x | − l px ← pid mod n; l py ← pid / n l rx ← BinarySearch( B x , x , l px − eg x , l px + eg x ) l ry ← BinarySearch( B y , y , l py − eg y , l py + eg y ) return l ry · n + l rx cell id rid t of q t . After that, the range query result can becollected with the following strategy:1) For cells intersected by query window Q r , we scan therecords in these cells and put the records that fall within Q r into the range query result.2) For cells contained inside query window Q r , we directlyadd all the records covered in these cells into the result.As shown in Fig. 2(a), the intersected cells must be onthe vertical and horizontal lines of rid b and rid t . Therefore,it is easy to determine the intersected cells (gray area) andcontained cells (blue area). k NN Query . In this paper, we denote the k NN query as Q k NN = ( q k , k ) , where q k is a point, and k is the number ofnearest neighbors. The formal definition is: ∀ p i ∈ S, ∀ p j ∈D \ S, d ( q k , p j ) ≥ d ( q k , p i ) , where S is the result set of k NNquery. For k NN queries, our solution is to locate the real cellid of the query point q k . Then, starting from the cell, werecursively spread the search range and incrementally checkthe records in the outer-layer adjacent cells until the resultis complete. To facilitate improved performance, we employtwo pruning techniques. One is the closest point pruningtechnique that is used to determine whether a cell shouldbe checked. A cell may have different closest points to thedifferent query points. However, in our scheme, since it issimple to obtain the locations ( l rx , l ry ) of the located realcell, we can get the closest point of the cell, denoted as p c ,with the help of B x and B y easily. In Fig. 2(b), cell A hasthe closest point ( B x [ l rx ] , B y [ l ry + 1]) , and the cell B ’s closestpoint is ( B x [ l rx + 1] , q k .x ) . The other pruning technique isthe pivot based filtering that can filter out records in a cellby the principle of the triangle inequality. By employing the Algorithm 2
Build Grid
Input:
A spatial dataset, D ; The number of columns along x dimension, n ; The number of columns along y dimension, m . Output:
A grid layout G n × m = ( B x , B y ) . mapX ← ∅ ; mapY ← ∅ for each entry in D do cntX ← ( mapX .get ( entry .x ) is ∅ ) ? 1: mapX .get ( entry .x )+1 mapX .put ( entry .x, cntX ) cntY ← ( mapY .get ( entry .y ) is ∅ ) ? 1: mapY .get ( entry .y ) + 1 mapY .put ( entry .y, cntY ) ( x min , y min ) ← findMin ( D ) ; ( x max , y max ) ← findMax ( D ) mapX .sortByKey(); mapY .sortByKey() avgX ← D . size / n ; avgY ← D . size / m B x ← getBoundary ( mapX , avgX , x min , x max ) B y ← getBoundary ( mapY , avgY , y min , y max ) return ( B x , B y ) function getBoundary ( map , avg , min , max ) B ← ∅ ; cnt ← ; pre ← B .add ( min ) for each entry in map do singleCnt ← entry.value if singleCnt > avg then B .add ( entry.key + pre ) pre ← entry.key cnt ← continue cnt ← cnt + singleCnt if cnt > avg then B .add ( entry.key + pre ) cnt ← else pre ← entry.key B .add ( max ) return B spreading outwards strategy, closest-point pruning, and pivotbased filtering, our k NN solution works as follows and isformally depicted in Algorithm 3: • Step-1 . Locate the real cell of q k . With F in and localbinary search, we obtain a real cell id rid of the querypoint q k , which is shown in line 1 and line 2 . • Step-2 . Calculate the distances from q k to the borders ofthe located cell. We denote them as c t , c b , c l , and c r , asshown in Fig. 2(b). Then, we obtain a radius r = min ( c t , c b , c l , c r ). See line 6 - line 8 for this step. lgorithm 3 k NN Query
Input: A k NN query Q k NN = ( q k , k ) ; Our index, G n × m = ( B x , B y ) , T , and F in ; Trained error guarantee eg = ( eg x , eg y ) . Output:
A priority queue queue contains the k closest points. pid ← F in ( q k .x, q k .y ) ; rid ← getRealCellId ( q k .x, q k .y, pid , B x , B y , eg ) l rx ← rid mod n ; l ry ← rid/n e c ← ; k cnt ← ; queue ← ∅ ; while k cnt < k do c b ← q k .y − B y [ l ry − e c ] ; c t ← B y [ l ry + 1 + e c ] − q k .y c l ← q k .x − B x [ l rx − e c ] ; c r ← B x [ l rx + 1 + e c ] − q k .x r ← min ( c b , c t , c l , c r ) if e c > then cells ← collectAdjacent ( q k , G n × m , T , e c ) for each cell in cells do if d ( q k , cell.p c ) < queue.peek.dist then records ← PivotFilter ( cell , q k , queue ) for each entry in records do putQueue ( entry , queue , q k , k cnt , r, k ) else for each entry in T .get ( rid ) do putQueue ( entry , queue , q k , k cnt , r, k ) e c ← e c + 1 function putQueue ( entry , queue , q k , k cnt , r, k ) dist ← getDisance ( q k , entry ) if queue.size < k then queue.offer(entry) if dist ≤ r then k cnt ← k cnt + 1 else if queue.peek().dist > dist then queue.poll() queue.offer(entry) if dist ≤ r then k cnt ← k cnt + 1 • Step-3 . Scan records in the cell. After filtering the recordsby the pivot based filtering technique, we put a record p i into a priority queue if the queue’s size is less than k or d ( p i , q k ) < σ , where σ is the distance between queue.peek and q k . Besides, we use a counter k cnt torecord the result size and increase it if d ( p i , q k ) ≤ r .In Algorithm 3, we use a separate function putQueue() ( line 21 − line 31 ) to show the details of this step. • Step-4 . If k cnt < k , we expand the search range andcalculate a new radius r = min ( c t , c b , c l , c r ), where c t , c b , c l , c r are the distances from q k to the borders ofouter layer adjacent cells. For each adjacent cell, we firstcheck whether d ( p c , q k ) > σ . If yes, we skip the cell.Otherwise, perform Step-3 . As shown in Fig. 2(b), wewould skip cell A and further process cell B .Repeat Step-4 and
Step-3 until k cnt ≥ k . Eventually, thepriority queue holds the result of k NN query. Note that,since the collectAdjacent() and
PivotFilter() functions arestraightforward, we do not present them in Algorithm 3 dueto the limited space.
C. Cost Model
In this section, we build a cost model, which can be usedto determine the number of columns in x dimension and y dimension, i.e., the values of n and m . Motivated by [6], we model the execution time of performing a query overour index. Here, we take the range query workload as anexample. From Section IV-B, we know that the query timeof range queries consists of four parts: 1) Prediction; 2)Local binary search; 3) Retrieve cells; 4) Scan and check datapoints in the intersected cells. Clearly, different n and m willgenerate different B x , B y . Consequently, the execution time of F in and the local binary search will be affected. For easeof description, we denote the execution time of the spatialinterpolation function as T ( F n × min ) and the execution timeof local binary search as T ( B n × m ) . Assume there are N i intersected cells and N c contained cells fall within the rangequery Q r . Meanwhile, we define N p as the data points inthe intersected cells. Through simulations, we can obtain theaverage time of retrieving a cell and scanning a data point,which are denoted as T r and T s , respectively. Putting thesefour parts together, the time cost of performing a rang queryis modeled as: Time = T ( F n × min ) + T ( B n × m ) + T r · ( N i + N c ) + T s · N p . Given a dataset D and a query workload W , where Q r ∈ W ,we expect to obtain the best layout parameter: n × m thatmakes Time have minimal average value.V. E
VALUATION
In this section, we experimentally evaluate the performanceof our index and compare it with alternative schemes in pro-cessing range and k NN queries. Specifically, we first explorethe efficiency and accuracy of the typical spatial interpolationfunctions. Then, we compare SPRIG with traditional indicesand Flood [6], which is a recently proposed multi-dimensionallearned index, in terms of range and k NN query time andstorage overheads. We implemented all indexes in Java andevaluated them with in-memory versions. To analyze theimpact of dataset size on index performance, we adopt threeTwitter datasets [12] consisting of tweets with their locations:
Tweet200k , Tweet2M , and
Tweet20M that have 200k, 2M,and 20M spatial points, respectively. All experiments areconducted on a machine with 16 GB memory and 3.4 GHzIntel(R) Core(TM) i7-3770 processors and running Ubuntu16.04 OS.
Efficiency and Accuracy of Spatial Interpolation Func-tions . We evaluate five spatial (two-dimensional) interpola-tion functions on adaptive grids, i.e., bilinear interpolation , bicubic interpolation , piecewise bicubic interpolation , Shepardinterpolation , and radial based function (RBF) interpolation [13]. Given a query workload, we expect to evaluate theaverage execution time and maximum estimation errors ofthese five functions varying grid layout from × to × . It is noted that, for ease of comparison, we adopt eg (Eq. (1)) instead of eg x and eg y to evaluate the accuracyof F in . All of these interpolation functions are evaluated on Tweet200k . Fig. 5(a) shows the average execution time forthree interpolation functions: bilinear , bicubic , and piecewisebicubic , while Fig. 5(b) presents their accuracy. We exclude the Shepard and
RBF interpolation functions, for which they have .1% 0.5% 1.0% 1.5% 2.0%
Selectivity A v e r a g e q u e r y t i m e ( m s ) R-tree k-d treeFloodSPRIG (a) Tweet200K
Selectivity A v e r a g e q u e r y t i m e ( m s ) R-tree k-d treeFloodSPRIG (b) Tweet2M
Selectivity A v e r a g e q u e r y t i m e ( s ) R-tree k-d treeFloodSPRIG (c) Tweet20MFig. 3. Average query time of range query over different datasets k A v e r a g e q u e r y t i m e ( m s ) M-tree k-d tree FloodSPRIG (a) Tweet200K k A v e r a g e q u e r y t i m e ( m s ) M-tree k-d tree FloodSPRIG (b) Tweet2M k A v e r a g e q u e r y t i m e ( m s ) M-tree k-d treeSPRIG (c) Tweet20MFig. 4. Average query time of k NN query over different datasets. Note that, the average query time of Flood is individually shown in Table II due to itssignificant execution time. TABLE IS
HEPARD AND
RBF I
NTERPOLATION F UNCTIONS
Metrics Functions ×
10 20 ×
20 50 ×
50 100 ×
100 200 × Execution time (ms) Shepard 6.7 25.8 161.2 646.3 2595.7Execution time (ms) RBF 6.1 10.3 60.9 243.3 Out of memoryEstimation Error Shepard 60 223 1322 5195 20587Estimation Error RBF 1815 6497 30813 116772 Out of memory
Grid Layout A v e r a g e e x e c u t i o n t i m e ( m s ) Piecewise bicubicBicubicBilinear (a)
Grid Layout M a x i m u m E s t i m a t i o n E rr o r Piecewise bicubicBicubicBilinear (b)Fig. 5. Spatial Interpolation Functions. (a) Average Execution Time. (b)Maximum Estimation Error much larger execution time and estimation errors. However,we still list their average execution time and estimation errorsin Table I. From Fig. 5(a), 5(b), and Table I, we can seethat the bilinear interpolation function has the best efficiencyand accuracy in all grid layouts. Therefore, in the followingcomparisons, we apply bilinear interpolation function in ourindex. It is interesting that the maximum estimation error of bilinear interpolation function is always n + 1, which makes itquite suitable to reduce the execution time of the local search. Comparison on Range Query.
For range queries, wecompare our proposed index with other multi-dimensionalindexes that support range queries over the aforementionedthree datasets. We choose R-tree, k-d tree, and Flood as thecompeting indexes, in which the first two are representative traditional indexes, and the third one is a state-of-the-artlearned index and similar to our work. In addition, we considerfive sets of queries with different selectivities { } . Aggregating these query sets as a big queryworkload, we tune our index by the cost model (See detailsin Section IV-C) to obtain the best grid layout. To be fair,we also train Flood by the approach proposed in [6], whichcan obtain the best layout and error guarantee. Similarly, thetraditional indices, R-tree and k-d tree, are tuned to obtain theirbest parameter, i.e., the maximum number of children for anode. Therefore, all indexes participating in the comparison areevaluated with their best parameters. From Fig. 3, we can seethat SPRIG is always significantly faster than the traditionalindexes in all selectivities, which is more advantageous on bigdata set. For example, on the Tweet20M dataset (Fig. 3(c)), ourindex achieves up to an order of magnitude better performancethan k-d tree for range queries. Besides, our index outperformsFlood in most cases. This benefit comes from the fact thatFlood only learns the distribution of one dimension, whileSPRIG learns the spatial distribution, which allows us to domore fine-grained filtering.
Comparison on k NN Query.
For k NN queries, we replaceR-tree with M-tree [11]. It is because R-tree is much slowerthan other indexes in k NN queries, and M-tree is a typical k NN-support index. Since Flood does not provide any detailbout how to deal with k NN queries, we implement it with asimilar k NN search strategy as our index. In the k NN querycomparison, we consider k = { , , , , } . Similar to therange query, we tune all the comparison indexes to obtain theirbest parameters. From Fig. 4, we can see that M-tree is moreexpensive than our index in executing k NN queries. For Flood,our index outperforms this learned index on all datasets. SinceFlood is much slower than other indexes on the
Tweet20M dataset, we exclude it from Fig. 4(c) and list its performanceas follows. As shown in Fig. 4 and Table II, our index is
TABLE IIA
VERAGE QUERY TIME OF F LOOD ON
Tweet20M k 4 8 16 32 64Query Time (ms) 4590 4796 4818 4778 5073 around 2 × , 5 × , and 9 × faster than Flood on the Tweet200k , Tweet2M , and
Tweet20M dataset, respectively. For k-d tree,our index is at least 2 × faster on the Tweet200k dataset andis slightly faster on the
Tweet2M dataset. For the
Tweet20M dataset, although our index still has a huge advantage over M-tree and Flood in the query performance, it is not as good ask-d tree. However, k-d tree has a significantly higher storagefootprint to achieve such query performance. We demonstratethe storage overheads of these indexes in the next section.
Storage Overhead.
Due to the limited space, we onlycompare the storage overheads of different indexes on the
Tweet20M dataset, which actually has a similar relationshipbetween indexes on the other two datasets. In the range and k NN query comparisons, we use the corresponding queryworkload to tune these indexes for their best parameters. Forexample, our index has the best grid layout × on rangquery workload, while it is × for k NN query workload.Since different parameters of one index may lead to differentstorage overheads, we compare the storage overheads of theseindexes on the range and k NN query workloads, separately, asshown in Table III and Table IV, respectively. To obtain thestorage footprint of a tree index, we evaluate the necessarystorage of one node, e.g., the minimum bounding rectangle(MBR) of R-tree, and count the total number of internal nodesby traversing the tree. For Flood, it has two components:a
FITing-tree on one dimension and a table to map a cellto the covered data records. For our index, there are threecomponents: a grid layout G n × m , a table T for managingthe cells, and a spatial interpolation function F in . The storageconsumption of these two learned indexes is related to theirgrid layout. Table III and Table IV show that both thelearned indexes have less storage overheads than traditionalindexes on the range and k NN query workloads. Although ourindex consumes more storage compared to Flood, we achievebetter query performance with an acceptable storage overhead.Recall that in the k NN query time comparison, our index isslower than k-d tree on
Tweet20M . However, as shown inTable IV, the storage overhead of k-d tree is three ordersof magnitude larger than that of our index, which rendersit challenging to use in practice. Thus, our index is more
TABLE IIIS
TORAGE OF INDEXES ON RANGE QUERY WORKLOAD
Index R-tree k-d tree Flood SPRIGStorage Overhead (MB) 7.95 305.17 0.11 6.30TABLE IVS
TORAGE OF INDEXES ON k NN QUERY WORKLOAD
Index M-tree k-d tree Flood SPRIGStorage Overhead (MB) 65.89 305.17 0.05 0.60 practical than k-d tree with big datasets.VI. C
ONCLUSION
In this paper, we have proposed a new learned model thatcan learn the spatial distribution of the spatial data directly.Based on the learned model, we have built a novel learnedspatial index SPRIG and designed the range and k NN queryexecution strategies over the index. Our experimental resultssuggest that 1) the bilinear interpolation function is the bestoption as the spatial learned model compared with the otherspatial interpolation functions; 2) our index SPRIG is efficientwith the relatively small storage footprint. In our future work,we expect to further reduce the average query time of k NNqueries on big datasets and make our index more flexible.R
EFERENCES[1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The casefor learned index structures,” in
SIGMOD , 2018, pp. 489–504.[2] C. Tang, Y. Wang, Z. Dong, G. Hu, Z. Wang, M. Wang, and H. Chen,“Xindex: a scalable learned index for multicore data storage,” in
SIG-PLAN , 2020, pp. 308–320.[3] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska,“Fiting-tree: A data-aware index structure,” in
SIGMOD , 2019, pp.1189–1206.[4] P. Ferragina and G. Vinciguerra, “The pgm-index: a fully-dynamiccompressed learned index with provable worst-case bounds,”
VLDB ,vol. 13, no. 8, pp. 1162–1175, 2020.[5] J. Qi, G. Liu, C. S. Jensen, and L. Kulik, “Effectively learning spatialindices,”
VLDB , vol. 13, no. 12, pp. 2341–2354, 2020.[6] V. Nathan, J. Ding, M. Alizadeh, and T. Kraska, “Learning multi-dimensional indexes,” in
SIGMOD , 2020, pp. 985–1000.[7] H. Wang, X. Fu, J. Xu, and H. Lu, “Learned index for spatial queries,”in . IEEE, 2019, pp. 569–574.[8] A. Davitkova, E. Milchevski, and S. Michel, “The ml-index: A multidi-mensional, learned index for point, range, and nearest-neighbor queries.”in
EDBT , 2020, pp. 407–410.[9] P. Li, H. Lu, Q. Zheng, L. Yang, and G. Pan, “Lisa: A learned indexstructure for spatial data,” in
SIGMOD , 2020, pp. 2119–2133.[10] L. Mitas and H. Mitasova, “Spatial interpolation,”
Geographical infor-mation systems: principles, techniques, management and applications ,1999.[11] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An e cient access methodfor similarity search in metric spaces,” in
VLDB . Citeseer, 1997, pp.426–435.[12] https://developer.twitter.com/en, 2018.[13] D. E. Myers, “Spatial interpolation: an overview,”