[PDF] Spatial Interpolation-based Learned Index for Range and kNN Queries

Abstract

A corpus of recent work has revealed that the learned index can improve query performance while reducing the storage overhead. It potentially offers an opportunity to address the spatial query processing challenges caused by the surge in location-based services. Although several learned indexes have been proposed to process spatial data, the main idea behind these approaches is to utilize the existing one-dimensional learned models, which requires either converting the spatial data into one-dimensional data or applying the learned model on individual dimensions separately. As a result, these approaches cannot fully utilize or take advantage of the information regarding the spatial distribution of the original spatial data. To this end, in this paper, we exploit it by using the spatial (multi-dimensional) interpolation function as the learned model, which can be directly employed on the spatial data. Specifically, we design an efficient SPatial inteRpolation functIon based Grid index (SPRIG) to process the range and kNN queries. Detailed experiments are conducted on real-world datasets, and the results indicate that our proposed learned index can significantly improve the performance in comparison with the traditional spatial indexes and a state-of-the-art multi-dimensional learned index.

Full PDF

SSpatial Interpolation-based Learned Index for Rangeand k NN Queries

Songnian Zhang

Faculty of Computer ScienceUniversity of New Brunswick

Fredericton, NB, [email protected]

Suprio Ray

Faculty of Computer ScienceUniversity of New Brunswick

Fredericton, NB, [email protected]

Rongxing Lu

Faculty of Computer ScienceUniversity of New Brunswick

Fredericton, NB, [email protected]

Yandong Zheng

Faculty of Computer ScienceUniversity of New Brunswick

Fredericton, NB, [email protected]

Abstract —A corpus of recent work has revealed that thelearned index can improve query performance while reducing thestorage overhead. It potentially offers an opportunity to addressthe spatial query processing challenges caused by the surge inlocation-based services. Although several learned indexes havebeen proposed to process spatial data, the main idea behindthese approaches is to utilize the existing one-dimensional learnedmodels, which requires either converting the spatial data into one-dimensional data or applying the learned model on individualdimensions separately. As a result, these approaches cannot fullyutilize or take advantage of the information regarding the spatialdistribution of the original spatial data. To this end, in this paper,we exploit it by using the spatial (multi-dimensional) interpolationfunction as the learned model, which can be directly employedon the spatial data. Speciﬁcally, we design an efﬁcient SPatialinteRpolation functIon based Grid index (SPRIG) to process therange and k NN queries. Detailed experiments are conducted onreal-world datasets, and the results indicate that our proposedlearned index can signiﬁcantly improve the performance incomparison with the traditional spatial indexes and a state-of-the-art multi-dimensional learned index.

Index Terms —Learned index, Spatial interpolation function,Range query, k NN query

I. I

NTRODUCTION

As location-based services (LBS) have been widely de-ployed and have become highly popular, spatial query pro-cessing has attracted considerable interests in the researchcommunity. Although several spatial indexes, such as R-treeand k-d tree, have been proposed to facilitate spatial queryperformance, it is still challenging to process the spatialqueries efﬁciently due to the rapidly growing volume of spatialdata. Recently, Kraska et al. [1] suggested substituting thetraditional indexes with machine learned based indexes (alsocalled learned index ). Since then, several follow-up researchprojects [2], [3], [4], [5], [6] have shown that the learnedindex can indeed improve query performance by learned datadistribution and query workload patterns.Typically, there are two main aspects involving a learnedindex, namely, a learned model and a local search. The formeris trained and used to quickly locate the approximate positionof a search key, while the latter is responsible for reﬁning theaccurate position. Since the latter can be achieved by perform-ing a local binary or exponential search, it is a fundamentalbut challenging topic to ﬁnd a reasonable learned model and further employ it as the learned index. Existing learned indexesare constructed based on mainly one of two categories oflearned models: machine learning [1] and piecewise linearfunctions [3]. However, to the best of our knowledge, both ofthese learned models can be only applied in single dimensionaldata. As a result, the current spatial learned indexes eithertransform multi-dimensional data into one-dimensional databefore introducing the learned model as a foundation [7], [8]or apply a learned model on every single dimension [6]. Forthis reason, the question then arises, “

Is there a learned modelthat can be directly applied to spatial (multi-dimensional) dataand achieve better performance? ”Aiming to address the above-mentioned question, in thispaper, we explore how to utilize spatial (two-dimensional)interpolation functions as the learned models to directly predictthe position of a spatial search key. Based on this idea, wepropose a SPatial inteRpolation functIon based Grid index(SPRIG) to support range and k NN queries over spatialdata. In particular, we sample the spatial data to constructan adaptive grid and use the sample data as inputs to ﬁta spatial interpolation function. Given a spatial search key,ﬁrst, we can use the ﬁtted spatial interpolation function topredict the approximate position of the key. Then, around theestimated position, we can conduct a local binary search toﬁnd the target key. However, it entails a new challenge: howto guarantee that the target key is in the local search range.To address this issue, we introduce the maximum estimationerror based error guarantee, which is derived based on thequery workload. Furthermore, we propose efﬁcient range and k NN query execution strategies using our proposed index. Inthese strategies, we take full advantage of the properties of theadaptive grid to facilitate the query executions, and a pivotbased ﬁltering technique is introduced to improve the k NNquery performance.We conduct extensive experiments to evaluate our learnedindex, SPRIG. First, we evaluate ﬁve spatial interpolationfunctions and choose the bilinear interpolation function, whichhas the best performance and estimation accuracy, as ourlearned model. Then, we compare SPRIG against the state-of-the-art multi-dimensional learned index Flood [6], along witha few spatial indexes. The experimental results involving real-world datasets show that: 1) SPRIG outperforms the alternative a r X i v : . [ c s . D B ] F e b patial indexes on range queries and is competitive on k NNqueries in terms of execution time. In the best case, SPRIG is3 × faster than Flood with range queries and 9 × faster thanFlood with k NN queries; 2) SPRIG consumes less storage toachieve a favorable execution performance compared with thetraditional indexes. Our evaluations demonstrate that SPRIGcan reduce the storage footprint of traditional spatial indexesby orders of magnitude.The remainder of this paper is organized as follows. InSection II, we discuss the related work. Then, we introduce thespatial interpolation function, error guarantee, and pivot basedﬁltering in Section III. After that, we present our SPRIG inSection IV, followed by performance evaluation in Section V.Finally, we draw our conclusion in Section VI.II. R

ELATED W ORK

Kraska et al. [1] presented the idea of the learned index,which is based on learning the relationship between keys andtheir positions in a sorted array. They adopted a machinelearning based technique as the learned model and built arecursive model index (RMI), which predicts the position of asearch key within a known error bound. Since then, a varietyof learned indexes was proposed to handle one-dimensionaldata. Recently, Tang et al. [2] proposed a scalable learnedindex

XIndex based on RMI, which focuses on handlingconcurrent writes without affecting the query performance.Very differently, Galakatos et al. [3] exploited the piecewiselinear function as the learned model to build a data-awareindex

FITing-tree that replaces leaf nodes of B+-tree withthe learned piecewise linear functions. Unlike

FITing-tree ,Ferragina et al. [4] introduced a pure learned index

PGM-index that does not mix the traditional data structure and learnedmodel. However, their work still focuses on one-dimensionaldata and uses the existing linear learned model.Naturally, the idea of the learned index has been extended tospatial and multi-dimensional data. Wang et al. [7] proposeda learned index

ZM-index for spatial queries. In that work, theauthors utilized the Z-order curve to convert two-dimensionaldata into one-dimensional values, and then applied a machinelearning model to predict a key’s position on one-dimensionaldata. Qi et al. [5] reﬁned the idea of

ZM-index and built arecursive spatial model index (RSMI). Before applying Z-ordercurve, their work adopts a rank space-based transformationtechnique to mitigate the uneven-gap problem.

LISA [9] isa disk-based spatial learned index that achieves low storageconsumption and I/O cost. In this work, the authors used amapping function to map spatial keys into one-dimensionalvalues and a monotone shard prediction function, which is sim-ilar to the piecewise linear functions, to predict the shard id fora given mapped value. Extending to multi-dimensional data,

ML-index [8] is an RMI based learned index. It ﬁrst convertsthe multi-dimensional data into one dimension by employingthe i-Distance technique. Based on the one-dimensional data,

ML-index uses the RMI to estimate the approximate positionof a search key. Recently proposed index Flood [6] can alsosupport multi-dimensional data and is very relevant to our

Query Workload 𝓦 Dataset 𝓓 Spatial Interpolation

Function 𝓕 𝒊𝒏 Cost Model

ID Info

Table 𝓣 Predict Cell ID

Queries 𝒏 = 5 𝒎 = Grid Layout 𝑮 𝒏×𝒎 Locate Cell

Local Binary

Query Strategies

Range Query 𝒌 NN Query

ResultIndex BuildingQuery Processing Preprocess

Map Select Pivot Sort

Fig. 1. System Architecture of SPRIG work. It adopts the

FITing-tree as the building block to predictthe key’s position on a single dimension. By integrating d − dimensions’ positions, where d is the number of dimensions,Flood can locate the cell that covers the search key.III. B ACKGROUND

Before delving into the details of SPRIG, in this section,we introduce three basic concepts: 1) Spatial InterpolationFunction; 2) Error Guarantee; and 3) Pivot Based Filtering,which serve as the building blocks of the proposed index.

A. Spatial Interpolation Function

Given a set of 2-dimensional sample points { ( x i , y i ) | ≤ i ≤ n } and their corresponding values { v i = f ( x i , y i ) | ≤ i ≤ n } , one can construct a spatial (two-dimensional) interpolationfunction F in = f ( x, y ) that passes through all these samplepoints [10]. Afterward, given any point ( x, y ) , it is easy toestimate the value of f ( x, y ) with the interpolation function.Borrowing the idea from the learned index [1], if we treat v i as the position of the point ( x i , y i ) , we can use F in to quicklyestimate the position of any given point. Moreover, if thesample points are not random but can represent the distributionof the original spatial dataset, we can use fewer sample pointsto ﬁt the spatial interpolation function for estimating positions.It indicates that the spatial interpolation function can learnthe spatial position distribution with a lower storage overhead,which ﬁts well with the goal of the learned index. Therefore,it is feasible and promising to exploit the spatial interpolationfunction F in as the learned model. B. Error Guarantee

For a learned index, it is essential to conduct a local searchafter predicting the position of a given point. Generally, thelocal search range is [ pos − eg , pos + eg ] , where pos is thepredicted position and eg is the estimation error, also callederror guarantee. Consequently, eg is an essential concept ina learned index. Different from FITing-tree [3] (used inFlood [6]), we adopt the maximum estimation error as theerror guarantee in our scheme, which can be determined by aquery workload W . eg = max ( F in ( q x , q y ) − f ( q x , q y )) , (1)where ( q x , q y ) ∈ W . Regarding a spatial point, its spatial po-sition can be determined by its x and y coordinates. Therefore,in our scheme, we project the estimated spatial position to x imension and y dimension and obtain two error guarantees,i.e, eg x = max ( P x ( F in ( q x , q y )) − P x ( f ( q x , q y ))); eg y = max ( P y ( F in ( q x , q y )) − P y ( f ( q x , q y ))) , where P x () and P y () project a spatial position to x dimensionand y dimension, respectively. C. Pivot Based Filtering

Assume that a set D contains n D = { p i = ( x i , y i ) | ≤ i ≤ n } . Also, we deﬁne a distancebased range query Q c = { q c , r } , where q c is a point, and r isa radius. Launching a query Q c over D means one wouldlike to ﬁnd points in D that satisfy d ( p i , q c ) ≤ r , where d ( · ) calculates the Euclidean distance between two spatialpoints. The intuitive solution is to scan and check points in D one by one. However, it is inefﬁcient. Although we canbuild tree based index structures, such as k-d tree and M-tree [11], to speed up the query processing, it will incur muchextra storage. In order to improve the performance of Q c without introducing too much storage overhead, we adopt apivot based ﬁltering technique. The key idea is to select avirtual pivot p v for D and calculate distances d ( p i , p v ) foreach point p i . Then, sort D according to the corresponding d ( p i , p v ) . When performing a query Q c over D , we can onlycalculate the distance d ( q c , p v ) and check the points whosedistances d ( p i , p v ) lie in [ d ( q c , p v ) − r, d ( q c , p v ) + r ] , insteadof all points in D . This optimization technique is based ontriangle inequality, and its correctness is as follows: | d ( p i , p v ) − d ( q c , p v ) | ≤ d ( p i , q c ) ≤ r ⇒ − r ≤ d ( p i , p v ) − d ( q c , p v ) ≤ r ⇒ d ( q c , p v ) − r ≤ d ( p i , p v ) ≤ d ( q c , p v ) + r. The above inequality indicates that if p i falls within Q c , itmust satisfy d ( p i , p v ) ∈ [ d ( q c , p v ) − r, d ( q c , p v ) + r ] . Thus,we can narrow down the scan range from the whole dataset D to the points that have distances d ( p i , p v ) in [ d ( q c , p v ) − r, d ( q c , p v ) + r ] .IV. O UR PROPOSED INDEX -SPRIGIn this section, we present the details of our proposed index,SPRIG. Fig. 1 depicts the system architecture of our index,which is comprised of two parts: index building and queryprocessing. In the following, we ﬁrst discuss how to build thelearned index. Then, we describe the detailed query processingon our index.

A. Index Building

Our SPRIG mainly consists of three components: 1) An n × m grid layout G n × m , where n is the number of columnsalong x dimension and m is for y dimension; 2) A table T ; and3) A spatial interpolation function F in based learned model,as shown in Fig. 1. Here, we may use some parameters forour index, which will be further discussed in Section IV-C. Tobuild the n × m grid layout, we ﬁrst ﬁnd n − boundaries on x dimension to generate n non-equal size columns, and add these boundaries into a set B x . Our goal is to make data recordsevenly distributed across columns, i.e., each column has aroughly equal number of records (in this paper, we use “point”and “record” interchangeably). For y dimension, there will be m − boundaries in set B y . We denote the maximum andminimum values in x dimension as { x min , x max } , while thoseare { y min , y max } for y dimension. After adding { x min , x max } into B x and { y min , y max } into B y , we separately sort B x and B y in an increasing order, where | B x | = n + 1 and | B y | = m + 1 .Totally, there are n × m cells for the grid, and we have G n × m = ( B x , B y ) . Algorithm 2 shows the process of buildingthe grid.Next, we allocate integers in the range [0, n × m − ] ascell ids along x dimension and deﬁne a 2-dimensional array C id to index these cell ids: C id [ i ][ j ] = j · n + i, ≤ i < n and ≤ j < m . Afterward, we can build a table T to map the cellid to the covered records, in which the key is the cell id andthe value is a pair ( ﬁrstAddress , size ) indicating the pointer tothe ﬁrst record and the number of records in the cell. Based on G n × m and C id , we can ﬁt a spatial interpolation function F in .In particular, we treat { B x , B y } as inputs and C id as the desiredestimation values to determine F in , i.e., C id ← F in ( B x , B y ) .Besides, to speed up the distance related search, for example, k NN queries, we employ the pivot based ﬁltering technique,which drives us to select a pivot p v in a cell and sort therecords (in a cell) according to the distances d ( p i , p v ) . In thispaper, we adopt the k -means clustering to select pivots foreach cell. It is worth noting this technique does not incur muchextra storage except for pivot points. B. Query Processing

In this paper, we focus on the range and k NN queries, inwhich the point query is implicitly included. Next, we describeour approaches.

Range Query . For spatial data, the range query identiﬁesthe records p i that fall within Q r = ( q b , q t ) , where q b = [ b x , b y ] and q t = [ t x , t y ] are the bottom left and top right point ofa query window, respectively. The main idea of processing arange query is to convert it to locate the cell ids of q b and q t . Here, we take locating q b ’s cell id as an example, and thesteps are as follows. • Step-1 . Given q b , we can use F in to obtain a predictedcell id pid b . It is simple to calculate the locations of pid b in set B x and B y . That is, l px = pid b mod n and l py = pid b / n . • Step-2 . Given a pair of error guarantee ( eg x , eg y ) , byapplying local binary searches on B x and B y , we canobtain the real x and y locations of the search point,denoted as l rx and l ry . The search range on B x set is [ l px − eg x , l px + eg x ] , while it is [ l py − eg y , l py + eg y ] on B y set. • Step-3 . Finally, we can obtain the real cell id of q b ,namely, rid b = l ry · n + l rx .Algorithm 1 formally outlines the above steps to obtain thereal cell id of a search point. Similarly, we can get the real 𝒙 dimension 𝒚 d i m e n s i o n 𝒑𝒊𝒅 𝒕 𝒑𝒊𝒅 𝒃 𝒓𝒊𝒅 𝒃 𝒓𝒊𝒅 𝒕 𝒒 𝒃 𝒒 𝒕 𝐁 𝐱 𝒍 𝒑 𝒃 𝒙 − 𝒆𝒈 𝒙 𝒍 𝒑 𝒃 𝒙 + 𝒆𝒈 𝒙 𝐁 𝒚 max 𝒍 𝒑 𝒃 𝒚 − 𝒆𝒈 𝒚 , 𝟎𝒍 𝒑 𝒃 𝒚 + 𝒆𝒈 𝒚 𝒍 𝒑 𝒃 𝒙 = 𝒑𝒊𝒅 𝒃 mod 5 𝒍 𝒑 𝒃 𝒚 = 𝒑𝒊𝒅 𝒃 / 5 𝒓𝒊𝒅 𝒃 = 𝟏 × 𝟓 + 𝟏 = 𝟔 (a) Range Query 𝒙 dimension 𝒚 d i m e n s i o n 𝐜 𝐭 𝐜 𝐫 𝐜 𝐛 𝐜 𝐥 𝒒 𝒌 𝓐 𝓑 𝝈 ID Info

13 (firstAddress, size=t), 𝒑 𝒗 t 𝒅 𝒒 𝒌 , 𝒑 𝒗 − 𝝈 𝒅 𝒒 𝒌 , 𝒑 𝒗 + 𝝈 putQueuePivot Based Filtering (b) k NN QueryFig. 2. Processing queries on SPRIG. A 5 × F in . Assume that ( eg x =2, eg y =1), we take ﬁnding rid b as an example. (b) k NN Query. Spread the search range to the outer-layeradjacent cells. The hollow circles represent the closest point of cell A and cell B to q k . Assume that cell B has t records, we use the pivot based ﬁltering toprocess the records in the cell. Algorithm 1

GetRealCellId

Input:

A given point ( x, y ) ; The grid layout, G n × m = ( B x , B y ) ; Apredicted cell id of the given point, pid ; Trained error guarantee eg = ( eg x , eg y ) ; Output:

The real cell id rid of the given point ( x, y ) . n ← | B x | − l px ← pid mod n; l py ← pid / n l rx ← BinarySearch( B x , x , l px − eg x , l px + eg x ) l ry ← BinarySearch( B y , y , l py − eg y , l py + eg y ) return l ry · n + l rx cell id rid t of q t . After that, the range query result can becollected with the following strategy:1) For cells intersected by query window Q r , we scan therecords in these cells and put the records that fall within Q r into the range query result.2) For cells contained inside query window Q r , we directlyadd all the records covered in these cells into the result.As shown in Fig. 2(a), the intersected cells must be onthe vertical and horizontal lines of rid b and rid t . Therefore,it is easy to determine the intersected cells (gray area) andcontained cells (blue area). k NN Query . In this paper, we denote the k NN query as Q k NN = ( q k , k ) , where q k is a point, and k is the number ofnearest neighbors. The formal deﬁnition is: ∀ p i ∈ S, ∀ p j ∈D \ S, d ( q k , p j ) ≥ d ( q k , p i ) , where S is the result set of k NNquery. For k NN queries, our solution is to locate the real cellid of the query point q k . Then, starting from the cell, werecursively spread the search range and incrementally checkthe records in the outer-layer adjacent cells until the resultis complete. To facilitate improved performance, we employtwo pruning techniques. One is the closest point pruningtechnique that is used to determine whether a cell shouldbe checked. A cell may have different closest points to thedifferent query points. However, in our scheme, since it issimple to obtain the locations ( l rx , l ry ) of the located realcell, we can get the closest point of the cell, denoted as p c ,with the help of B x and B y easily. In Fig. 2(b), cell A hasthe closest point ( B x [ l rx ] , B y [ l ry + 1]) , and the cell B ’s closestpoint is ( B x [ l rx + 1] , q k .x ) . The other pruning technique isthe pivot based ﬁltering that can ﬁlter out records in a cellby the principle of the triangle inequality. By employing the Algorithm 2

Build Grid

Input:

A spatial dataset, D ; The number of columns along x dimension, n ; The number of columns along y dimension, m . Output:

A grid layout G n × m = ( B x , B y ) . mapX ← ∅ ; mapY ← ∅ for each entry in D do cntX ← ( mapX .get ( entry .x ) is ∅ ) ? 1: mapX .get ( entry .x )+1 mapX .put ( entry .x, cntX ) cntY ← ( mapY .get ( entry .y ) is ∅ ) ? 1: mapY .get ( entry .y ) + 1 mapY .put ( entry .y, cntY ) ( x min , y min ) ← ﬁndMin ( D ) ; ( x max , y max ) ← ﬁndMax ( D ) mapX .sortByKey(); mapY .sortByKey() avgX ← D . size / n ; avgY ← D . size / m B x ← getBoundary ( mapX , avgX , x min , x max ) B y ← getBoundary ( mapY , avgY , y min , y max ) return ( B x , B y ) function getBoundary ( map , avg , min , max ) B ← ∅ ; cnt ← ; pre ← B .add ( min ) for each entry in map do singleCnt ← entry.value if singleCnt > avg then B .add ( entry.key + pre ) pre ← entry.key cnt ← continue cnt ← cnt + singleCnt if cnt > avg then B .add ( entry.key + pre ) cnt ← else pre ← entry.key B .add ( max ) return B spreading outwards strategy, closest-point pruning, and pivotbased ﬁltering, our k NN solution works as follows and isformally depicted in Algorithm 3: • Step-1 . Locate the real cell of q k . With F in and localbinary search, we obtain a real cell id rid of the querypoint q k , which is shown in line 1 and line 2 . • Step-2 . Calculate the distances from q k to the borders ofthe located cell. We denote them as c t , c b , c l , and c r , asshown in Fig. 2(b). Then, we obtain a radius r = min ( c t , c b , c l , c r ). See line 6 - line 8 for this step. lgorithm 3 k NN Query

Input: A k NN query Q k NN = ( q k , k ) ; Our index, G n × m = ( B x , B y ) , T , and F in ; Trained error guarantee eg = ( eg x , eg y ) . Output:

A priority queue queue contains the k closest points. pid ← F in ( q k .x, q k .y ) ; rid ← getRealCellId ( q k .x, q k .y, pid , B x , B y , eg ) l rx ← rid mod n ; l ry ← rid/n e c ← ; k cnt ← ; queue ← ∅ ; while k cnt < k do c b ← q k .y − B y [ l ry − e c ] ; c t ← B y [ l ry + 1 + e c ] − q k .y c l ← q k .x − B x [ l rx − e c ] ; c r ← B x [ l rx + 1 + e c ] − q k .x r ← min ( c b , c t , c l , c r ) if e c > then cells ← collectAdjacent ( q k , G n × m , T , e c ) for each cell in cells do if d ( q k , cell.p c ) < queue.peek.dist then records ← PivotFilter ( cell , q k , queue ) for each entry in records do putQueue ( entry , queue , q k , k cnt , r, k ) else for each entry in T .get ( rid ) do putQueue ( entry , queue , q k , k cnt , r, k ) e c ← e c + 1 function putQueue ( entry , queue , q k , k cnt , r, k ) dist ← getDisance ( q k , entry ) if queue.size < k then queue.offer(entry) if dist ≤ r then k cnt ← k cnt + 1 else if queue.peek().dist > dist then queue.poll() queue.offer(entry) if dist ≤ r then k cnt ← k cnt + 1 • Step-3 . Scan records in the cell. After ﬁltering the recordsby the pivot based ﬁltering technique, we put a record p i into a priority queue if the queue’s size is less than k or d ( p i , q k ) < σ , where σ is the distance between queue.peek and q k . Besides, we use a counter k cnt torecord the result size and increase it if d ( p i , q k ) ≤ r .In Algorithm 3, we use a separate function putQueue() ( line 21 − line 31 ) to show the details of this step. • Step-4 . If k cnt < k , we expand the search range andcalculate a new radius r = min ( c t , c b , c l , c r ), where c t , c b , c l , c r are the distances from q k to the borders ofouter layer adjacent cells. For each adjacent cell, we ﬁrstcheck whether d ( p c , q k ) > σ . If yes, we skip the cell.Otherwise, perform Step-3 . As shown in Fig. 2(b), wewould skip cell A and further process cell B .Repeat Step-4 and

Step-3 until k cnt ≥ k . Eventually, thepriority queue holds the result of k NN query. Note that,since the collectAdjacent() and

PivotFilter() functions arestraightforward, we do not present them in Algorithm 3 dueto the limited space.

C. Cost Model

In this section, we build a cost model, which can be usedto determine the number of columns in x dimension and y dimension, i.e., the values of n and m . Motivated by [6], we model the execution time of performing a query overour index. Here, we take the range query workload as anexample. From Section IV-B, we know that the query timeof range queries consists of four parts: 1) Prediction; 2)Local binary search; 3) Retrieve cells; 4) Scan and check datapoints in the intersected cells. Clearly, different n and m willgenerate different B x , B y . Consequently, the execution time of F in and the local binary search will be affected. For easeof description, we denote the execution time of the spatialinterpolation function as T ( F n × min ) and the execution timeof local binary search as T ( B n × m ) . Assume there are N i intersected cells and N c contained cells fall within the rangequery Q r . Meanwhile, we deﬁne N p as the data points inthe intersected cells. Through simulations, we can obtain theaverage time of retrieving a cell and scanning a data point,which are denoted as T r and T s , respectively. Putting thesefour parts together, the time cost of performing a rang queryis modeled as: Time = T ( F n × min ) + T ( B n × m ) + T r · ( N i + N c ) + T s · N p . Given a dataset D and a query workload W , where Q r ∈ W ,we expect to obtain the best layout parameter: n × m thatmakes Time have minimal average value.V. E

VALUATION

In this section, we experimentally evaluate the performanceof our index and compare it with alternative schemes in pro-cessing range and k NN queries. Speciﬁcally, we ﬁrst explorethe efﬁciency and accuracy of the typical spatial interpolationfunctions. Then, we compare SPRIG with traditional indicesand Flood [6], which is a recently proposed multi-dimensionallearned index, in terms of range and k NN query time andstorage overheads. We implemented all indexes in Java andevaluated them with in-memory versions. To analyze theimpact of dataset size on index performance, we adopt threeTwitter datasets [12] consisting of tweets with their locations:

Tweet200k , Tweet2M , and

Tweet20M that have 200k, 2M,and 20M spatial points, respectively. All experiments areconducted on a machine with 16 GB memory and 3.4 GHzIntel(R) Core(TM) i7-3770 processors and running Ubuntu16.04 OS.

Efﬁciency and Accuracy of Spatial Interpolation Func-tions . We evaluate ﬁve spatial (two-dimensional) interpola-tion functions on adaptive grids, i.e., bilinear interpolation , bicubic interpolation , piecewise bicubic interpolation , Shepardinterpolation , and radial based function (RBF) interpolation [13]. Given a query workload, we expect to evaluate theaverage execution time and maximum estimation errors ofthese ﬁve functions varying grid layout from × to × . It is noted that, for ease of comparison, we adopt eg (Eq. (1)) instead of eg x and eg y to evaluate the accuracyof F in . All of these interpolation functions are evaluated on Tweet200k . Fig. 5(a) shows the average execution time forthree interpolation functions: bilinear , bicubic , and piecewisebicubic , while Fig. 5(b) presents their accuracy. We exclude the Shepard and

RBF interpolation functions, for which they have .1% 0.5% 1.0% 1.5% 2.0%

Selectivity A v e r a g e q u e r y t i m e ( m s ) R-tree k-d treeFloodSPRIG (a) Tweet200K

Selectivity A v e r a g e q u e r y t i m e ( m s ) R-tree k-d treeFloodSPRIG (b) Tweet2M

Selectivity A v e r a g e q u e r y t i m e ( s ) R-tree k-d treeFloodSPRIG (c) Tweet20MFig. 3. Average query time of range query over different datasets k A v e r a g e q u e r y t i m e ( m s ) M-tree k-d tree FloodSPRIG (a) Tweet200K k A v e r a g e q u e r y t i m e ( m s ) M-tree k-d tree FloodSPRIG (b) Tweet2M k A v e r a g e q u e r y t i m e ( m s ) M-tree k-d treeSPRIG (c) Tweet20MFig. 4. Average query time of k NN query over different datasets. Note that, the average query time of Flood is individually shown in Table II due to itssigniﬁcant execution time. TABLE IS

HEPARD AND

RBF I

NTERPOLATION F UNCTIONS

Metrics Functions ×

10 20 ×

20 50 ×

50 100 ×

100 200 × Execution time (ms) Shepard 6.7 25.8 161.2 646.3 2595.7Execution time (ms) RBF 6.1 10.3 60.9 243.3 Out of memoryEstimation Error Shepard 60 223 1322 5195 20587Estimation Error RBF 1815 6497 30813 116772 Out of memory

Grid Layout A v e r a g e e x e c u t i o n t i m e ( m s ) Piecewise bicubicBicubicBilinear (a)

Grid Layout M a x i m u m E s t i m a t i o n E rr o r Piecewise bicubicBicubicBilinear (b)Fig. 5. Spatial Interpolation Functions. (a) Average Execution Time. (b)Maximum Estimation Error much larger execution time and estimation errors. However,we still list their average execution time and estimation errorsin Table I. From Fig. 5(a), 5(b), and Table I, we can seethat the bilinear interpolation function has the best efﬁciencyand accuracy in all grid layouts. Therefore, in the followingcomparisons, we apply bilinear interpolation function in ourindex. It is interesting that the maximum estimation error of bilinear interpolation function is always n + 1, which makes itquite suitable to reduce the execution time of the local search. Comparison on Range Query.

For range queries, wecompare our proposed index with other multi-dimensionalindexes that support range queries over the aforementionedthree datasets. We choose R-tree, k-d tree, and Flood as thecompeting indexes, in which the ﬁrst two are representative traditional indexes, and the third one is a state-of-the-artlearned index and similar to our work. In addition, we considerﬁve sets of queries with different selectivities { } . Aggregating these query sets as a big queryworkload, we tune our index by the cost model (See detailsin Section IV-C) to obtain the best grid layout. To be fair,we also train Flood by the approach proposed in [6], whichcan obtain the best layout and error guarantee. Similarly, thetraditional indices, R-tree and k-d tree, are tuned to obtain theirbest parameter, i.e., the maximum number of children for anode. Therefore, all indexes participating in the comparison areevaluated with their best parameters. From Fig. 3, we can seethat SPRIG is always signiﬁcantly faster than the traditionalindexes in all selectivities, which is more advantageous on bigdata set. For example, on the Tweet20M dataset (Fig. 3(c)), ourindex achieves up to an order of magnitude better performancethan k-d tree for range queries. Besides, our index outperformsFlood in most cases. This beneﬁt comes from the fact thatFlood only learns the distribution of one dimension, whileSPRIG learns the spatial distribution, which allows us to domore ﬁne-grained ﬁltering.

Comparison on k NN Query.

For k NN queries, we replaceR-tree with M-tree [11]. It is because R-tree is much slowerthan other indexes in k NN queries, and M-tree is a typical k NN-support index. Since Flood does not provide any detailbout how to deal with k NN queries, we implement it with asimilar k NN search strategy as our index. In the k NN querycomparison, we consider k = { , , , , } . Similar to therange query, we tune all the comparison indexes to obtain theirbest parameters. From Fig. 4, we can see that M-tree is moreexpensive than our index in executing k NN queries. For Flood,our index outperforms this learned index on all datasets. SinceFlood is much slower than other indexes on the

Tweet20M dataset, we exclude it from Fig. 4(c) and list its performanceas follows. As shown in Fig. 4 and Table II, our index is

TABLE IIA

VERAGE QUERY TIME OF F LOOD ON

Tweet20M k 4 8 16 32 64Query Time (ms) 4590 4796 4818 4778 5073 around 2 × , 5 × , and 9 × faster than Flood on the Tweet200k , Tweet2M , and

Tweet20M dataset, respectively. For k-d tree,our index is at least 2 × faster on the Tweet200k dataset andis slightly faster on the

Tweet2M dataset. For the

Tweet20M dataset, although our index still has a huge advantage over M-tree and Flood in the query performance, it is not as good ask-d tree. However, k-d tree has a signiﬁcantly higher storagefootprint to achieve such query performance. We demonstratethe storage overheads of these indexes in the next section.

Storage Overhead.

Due to the limited space, we onlycompare the storage overheads of different indexes on the

Tweet20M dataset, which actually has a similar relationshipbetween indexes on the other two datasets. In the range and k NN query comparisons, we use the corresponding queryworkload to tune these indexes for their best parameters. Forexample, our index has the best grid layout × on rangquery workload, while it is × for k NN query workload.Since different parameters of one index may lead to differentstorage overheads, we compare the storage overheads of theseindexes on the range and k NN query workloads, separately, asshown in Table III and Table IV, respectively. To obtain thestorage footprint of a tree index, we evaluate the necessarystorage of one node, e.g., the minimum bounding rectangle(MBR) of R-tree, and count the total number of internal nodesby traversing the tree. For Flood, it has two components:a

FITing-tree on one dimension and a table to map a cellto the covered data records. For our index, there are threecomponents: a grid layout G n × m , a table T for managingthe cells, and a spatial interpolation function F in . The storageconsumption of these two learned indexes is related to theirgrid layout. Table III and Table IV show that both thelearned indexes have less storage overheads than traditionalindexes on the range and k NN query workloads. Although ourindex consumes more storage compared to Flood, we achievebetter query performance with an acceptable storage overhead.Recall that in the k NN query time comparison, our index isslower than k-d tree on

Tweet20M . However, as shown inTable IV, the storage overhead of k-d tree is three ordersof magnitude larger than that of our index, which rendersit challenging to use in practice. Thus, our index is more

TABLE IIIS

TORAGE OF INDEXES ON RANGE QUERY WORKLOAD

Index R-tree k-d tree Flood SPRIGStorage Overhead (MB) 7.95 305.17 0.11 6.30TABLE IVS

TORAGE OF INDEXES ON k NN QUERY WORKLOAD

Index M-tree k-d tree Flood SPRIGStorage Overhead (MB) 65.89 305.17 0.05 0.60 practical than k-d tree with big datasets.VI. C

ONCLUSION

In this paper, we have proposed a new learned model thatcan learn the spatial distribution of the spatial data directly.Based on the learned model, we have built a novel learnedspatial index SPRIG and designed the range and k NN queryexecution strategies over the index. Our experimental resultssuggest that 1) the bilinear interpolation function is the bestoption as the spatial learned model compared with the otherspatial interpolation functions; 2) our index SPRIG is efﬁcientwith the relatively small storage footprint. In our future work,we expect to further reduce the average query time of k NNqueries on big datasets and make our index more ﬂexible.R

EFERENCES[1] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The casefor learned index structures,” in

SIGMOD , 2018, pp. 489–504.[2] C. Tang, Y. Wang, Z. Dong, G. Hu, Z. Wang, M. Wang, and H. Chen,“Xindex: a scalable learned index for multicore data storage,” in

SIG-PLAN , 2020, pp. 308–320.[3] A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska,“Fiting-tree: A data-aware index structure,” in

SIGMOD , 2019, pp.1189–1206.[4] P. Ferragina and G. Vinciguerra, “The pgm-index: a fully-dynamiccompressed learned index with provable worst-case bounds,”

VLDB ,vol. 13, no. 8, pp. 1162–1175, 2020.[5] J. Qi, G. Liu, C. S. Jensen, and L. Kulik, “Effectively learning spatialindices,”

VLDB , vol. 13, no. 12, pp. 2341–2354, 2020.[6] V. Nathan, J. Ding, M. Alizadeh, and T. Kraska, “Learning multi-dimensional indexes,” in

SIGMOD , 2020, pp. 985–1000.[7] H. Wang, X. Fu, J. Xu, and H. Lu, “Learned index for spatial queries,”in . IEEE, 2019, pp. 569–574.[8] A. Davitkova, E. Milchevski, and S. Michel, “The ml-index: A multidi-mensional, learned index for point, range, and nearest-neighbor queries.”in

EDBT , 2020, pp. 407–410.[9] P. Li, H. Lu, Q. Zheng, L. Yang, and G. Pan, “Lisa: A learned indexstructure for spatial data,” in

SIGMOD , 2020, pp. 2119–2133.[10] L. Mitas and H. Mitasova, “Spatial interpolation,”

Geographical infor-mation systems: principles, techniques, management and applications ,1999.[11] P. Ciaccia, M. Patella, and P. Zezula, “M-tree: An e cient access methodfor similarity search in metric spaces,” in

VLDB . Citeseer, 1997, pp.426–435.[12] https://developer.twitter.com/en, 2018.[13] D. E. Myers, “Spatial interpolation: an overview,”