A Two-level Spatial In-Memory Index
Dimitrios Tsitsigkos, Konstantinos Lampropoulos, Panagiotis Bouros, Nikos Mamoulis, Manolis Terrovitis
AA Two-level Spatial In-Memory Index
Dimitrios Tsitsigkos
University of Ioannina, GreeceAthena Research Center, [email protected]
Konstantinos Lampropoulos
University of Ioannina, [email protected]
Panagiotis Bouros
Johannes Gutenberg UniversityMainz, [email protected]
Nikos Mamoulis
University of Ioannina, [email protected]
Manolis Terrovitis
Athena Research Center, [email protected]
ABSTRACT
Very large volumes of spatial data increasingly become availableand demand effective management. While there has been decadesof research on spatial data management, few works consider thecurrent state of commodity hardware, having relatively large mem-ory and the ability of parallel multi-core processing. In this paper,we re-consider the design of spatial indexing under this new reality.Specifically, we propose a main-memory indexing approach forobjects with spatial extent, which is based on a classic regular spacepartitioning into disjoint tiles. The novelty of our index is that thecontents of each tile are further partitioned into four classes. Thissecond-level partitioning not only reduces the number of compar-isons required to compute the results, but also avoids the generationand elimination of duplicate results, which is an inherent problemof spatial indexes based on disjoint space partitioning. The spatialpartitions defined by our indexing scheme are totally independent,facilitating effortless parallel evaluation, as no synchronization orcommunication between the partitions is necessary. We show howour index can be used to efficiently process spatial range queriesand drastically reduce the cost of the refinement step of the queries.In addition, we study the efficient processing of numerous rangequeries in batch and in parallel. Extensive experiments on realdatasets confirm the efficiency of our approaches.
The management and indexing of spatial data has been studiedextensively for at least four decades. Classic spatial indexes [13]have been designed for the – now obsolete – storage model of the80’s, i.e., the data are too big to reside in memory and the goalis to minimize the I/O cost during query evaluation. Things havechanged a lot since then. First, memories have become much biggerand cheaper. In most applications, the spatial data can easily fit inthe memory of even a commodity machine. Second, modern proces-sors have multiple cores and facilitate parallel query processing. Inthis paper, we re-consider the design of spatial indexing under thisnew reality. Our goal is a main-memory spatial index, which out-performs the state-of-the-art spatial access methods, consideringcomputational cost as the main factor.Our index is based on a simple grid-based space partitioning.Grid-based indexing has several advantages over hierarchical in-dexes, such as the R-tree [14]. First, the relevant partitions to aquery are very fast to identify (using algebraic operations only).Second, the partitions are totally independent to each other and can be handled by different threads without the need of any synchro-nization or scheduling. Third, updates can be performed very fast,as locating the cell which contains or the cells which intersect anobject takes constant time and no changes to the space partitioningare required. Hence, main memory grids have been preferred overhierarchical indexes, especially for the (in-memory) managementof highly dynamic collections of 2D points [16, 21, 22, 29, 33, 39].Still, spatial grids have their own weaknesses. First, the distri-bution of the objects to cells can be highly uneven, which renderssome of the partitions to be overloaded. This issue can be alleviatedby increasing the grid granularity, which, however, may result innumerous empty tiles. The problem of empty tiles can be easilyhandled by using a hash table or a bitmap to mark non-emptytiles, or by modeling and searching non-empty (fine) tiles usingspace-filling curves [9, 15]. The most important issue arises in theindexing of non-point objects (e.g., polygons), which are typicallyapproximated by their minimum bounding rectangles (MBRs). If anMBR intersects multiple tiles, then it is assigned to multiple parti-tions, which causes replication, increases query times, and requiresspecial handing for possible duplicate results. For example, considerthe six object MBRs depicted in Figure 1(a), partitioned using a 4 × r ) are assigned to multiple tiles. Besidesthe increased space requirements due to replication, given a queryrange (e.g., W ), a replicated object (e.g., r ) may be examined andreported multiple times (e.g., at tiles 0, 1, 4, and 5). Spatial indexesthat allow overlapping partitions (such as the R-tree) do not havethis problem because each object falls into exactly one leaf node, sothere is no replication and no need for duplicate result avoidance.For example the R-tree of Figure 1(a) would examine the MBR of r only once, in the leaf node under entry R . Still, as mentionedabove, R-tree like methods have relatively high update costs due toindex maintenance and query evaluation using them is harder toparallelize. Introductory examples (a) main-memory grid r r r r r r r , r }1 → { r , r }2 → { r }4 → { r }5 → { r }6 → { r }7 → { r }10 → { r }11 → { r }15 → { r } r , r }1 → { r , r }2 → { r } 4 → { r }5 → { r }6 → { r } 7 → { r }10 → { r }11 → { r }15 → { r } W (a) grid Introductory examples (b) R-tree r r r r r r r r r r r r R R R R W (b) R-tree Figure 1: Examples of indexing schemes a r X i v : . [ c s . D B ] M a y ur goal is the design of a grid based index for non-point datawhich maintains the simplicity and advantages of a grid, withouthaving its disadvantages. The niece of our index is the introduc-tion an additional, secondary-level partitioning which divides theobjects that intersect a tile into four classes (2 m classes in an m -dimensional space), based on whether their MBRs start inside orbefore the tile in each dimension. This division helps us to avoidaccessing some classes of objects during query evaluation, whichreduces the query cost and at the same time avoids duplicate re-sults, while ensuring that no results are missed. For example, inFigure 1(b), due to our second-level partitioning object r will beexamined and reported only by tile 0 because both the object startsin that tile and the query starts in or before that tile. We lay outthe set of rules, based on which range queries using our index areevaluated. Besides, we show how our index can be used to answerdistance range queries, reducing the number of expensive distancecomputations that alternative indexing schemes require. Finally, weshow how the number of applications of the expensive refinementstep can greatly be reduced.Besides rectangular range queries, we also study the evaluationof circular range (i.e., disk) queries and, in general, queries withconvex range shapes. We show how our indexing scheme reducesthe number of comparisons and avoids duplicate results, also in thiscase. In addition, we show how for the majority of query results,the refinement step can be avoided by a simple post-filtering teston the object MBRs. Finally, we investigate the efficient evaluationof numerous range queries in batch and in parallel.We compare our index experimentally with a state-of-the-artimplementation of an in-memory R-tree from boost.org and showthat it is up to several times faster, especially for large queries onlarge datasets. In addition, our index performs much better that theR-tree for mixed workloads (with inserts and range queries). Wealso show that our approach (which is directly parallelizable) scalesgracefully with the number of cores (i.e., threads in a multi-coremachine), making it especially suitable for shared-nothing parallelenvironments where tree-based spatial indexes (such as the R-tree)are hard to deploy.In summary, this paper makes the following contributions: • We design a novel secondary-level partitioning approach forspace-partitioning indexes (such as grids). • We show how spatial range queries can benefit from ourindexing scheme, by avoiding redundant comparisons andthe generation of duplicate results. • We introduce a simple additional filter that avoids the refine-ment step for the great majority of the objects, renderingrange queries very efficient in practice. • We conduct an extensive experimental evaluation whichdemonstrates the superiority of our index in comparisonto alternative methods and its scalability when evaluatingmultiple queries using multiple cores.The rest of the paper is organized as follows. Section 2 providesthe necessary background and discusses related work. Section 3introduces our secondary partitioning scheme and its applicationin grid-based spatial indexing. Section 4 shows how spatial rangequery evaluation can benefit from our indexing scheme. In Section5, we present a filtering condition that applies on the MBRs of the objects and can be used to confirm the inclusion of an object to arange query result, without the need of a refinement step. Section6 discusses how numerous range queries that may need to be han-dled can be processed efficiently and in parallel. An experimentalevaluation is presented in Section 7. Finally, Section 8 concludesthe paper with a discussion about future work.
In this section, we introduce the necessary background in spatialdata management [20] and present related work to our research.
Common types of spatial objects include points (defined by onevalue per dimension), rectangles (defined by one interval per di-mension), line segments (defined by a pair of points), polygons(defined by a sequence of points), linestrings (defined by a sequenceof points), etc. Three classes of spatial relationships characterizethe relative position and geometry of spatial objects.
Topological relationships model the relation between the geometric extentsof objects (e.g., overlap, inside).
Directional relationships comparethe relative locations of the objects with respect to a coordinate(or cardinal) system (e.g., north/south, above/below). Last, distance relationships capture distance/proximity information between twoobjects (e.g., near/far).The most frequently applied query operation on spatial datais the spatial selection or range query which retrieves the objectsthat satisfy a spatial relationship with a reference spatial object.Typically, the reference object is a region W and the objective isto retrieve the objects that intersect W or are inside W . In anotherpopular range query type (especially, in location-based services),given a reference location q and a distance threshold ϵ , the objectiveis to find all objects having at most ϵ Euclidean distance from q .Spatial access methods are primarily designed for spatial selectionqueries. Other important spatial queries include nearest neighborqueries, intersection joins and distance joins. The potentially complex geometry of the objects renders ineffi-cient the evaluation of spatial predicates directly on their exactrepresentation. Hence, spatial queries are processed in two stepsfollowing a filtering-and-refinement framework. During the filter-ing step, the query is applied on the
Minimum Bounding Rectangles (MBRs), which approximate the objects. If the MBR of an objectdoes not qualify the query predicate, then the exact geometry doesnot qualify it either. The filtering step is a computationally cheapway to prune the search space, in many cases powered by spatialindexing, but it only provides candidate query results. During the refinement step, the exact representations of the candidates aretested with the query predicate to retrieve the actual results.
Due to the complex nature of spatial objects, decades of researchefforts have been devoted on spatial indexing; as a result, a num-ber of
Spatial Access Methods (SAMs) have been proposed [31, 32].Although the majority of SAMs focus on disk-resident data, it isstraightforward to also use them in main memory, which is theocus of our work. Typically, the goal of an SAM is to group closelylocated objects in space, into the same index blocks (traditionally,blocks are disk pages). These blocks are then organized in an index(single-level or hierarchical).Depending on the nature of the partitioning, spatial indices canbe classified into two classes [25]. Indices based on space-oriented partitioning divide the space into disjoint partitions. As a result,objects whose extent overlaps with multiple partitions need to bereplicated (or clipped ) in each of them. A grid [5] is the simplestindex based on space-oriented partitioning; the space is uniformlydivided into cells (partitions), using axis-parallel lines. Hierarchicalindices that fall in this category are the kd-tree [4] and the quad-tree [12]. Space-oriented partitioning was originally proposed andis especially suitable for indexing collections of points, because noreplication issues arise. A bitmap-based index for point data wasrecently proposed in [23]. SIDI [24] is another spatial index forpoint data, which learns the characteristics of the dataset beforeconstruction and its layout is designed to fit the data well.For non-point objects, the replication of object MBRs to multiplespace-oriented partitions may negatively affect query performance.In addition, due to object replication, the same query results maybe detected in multiple partitions and deduplication techniquesshould be applied, as discussed in the Introduction. In view of this,an alternative class of indices, based on a data-oriented partitioning,were also proposed, allowing the extents of the partitions to overlapand ensuring that their contents are disjoint (i.e., each object isassigned to exactly one partition). The R-tree [14] (and its variants,e.g., the R*-tree [3]) is the most popular SAM in this class (and ingeneral). The R-tree is a height-balanced tree, which generalizes theB + -tree in the multi-dimensional space and hierarchically groupsobject MBRs to blocks. Each block is also approximated by an MBR,hence the tree defines a hierarchy of MBR groups. Some R-treevariants use circles (or spheres in the 3D space) instead of MBRs,i.e., the SS-tree [35], or a combination of circles and rectangles, i.e.,the SR-tree [17].Most spatial indexing methods are designed for the efficientevaluation of spatial range queries. In brief, during the filteringstep, the goal is to determine which partitions of the space intersectthe query region W . In case of hierarchical indices such as thekd-tree, the quad-tree and the R-tree, the query is processed byrecursively traversing the nodes whose MBRs intersect W , startingfrom the root. Finally, every object whose MBR overlaps with region W is passed as a candidate to the refinement step where its exactgeometry is compared against W .Most spatial indexes have been designed to support dynamicupdates. The R*-tree [3] differs from the original R-tree [14] inits insertion algorithm, which is designed to be both efficient andto result in a high tree quality. Bulk loading methods for R-treeshave also been proposed, with the most popular method being thesort-tile-recursive approach [19]. Naturally, updates on hierarchicalindexes are more expensive, compared to updates on single-level(flat) indexes, because they may result in index reorganization.Hence, single-level indexes, such as grids are preferred over hierar-chical ones in workloads with many updates (e.g., when indexingmoving objects [16]).The R-tree was originally proposed for disk-resident data withthe key focus on minimizing the I/O during query processing. The CR-tree [18] is an optimized R-tree for the memory hierarchy.BLOCK [25] is a recently proposed main-memory spatial index,which uses a hierarchy of grids. At each level, a uniform grid withhigher resolution compared to the level above is used. Given arange query, starting from the uppermost grid, BLOCK evaluatesthe query on cells that are completely contained in the query. Theremaining query parts (excluding the cells that are contained in therange) are either evaluated at the cells they overlap at the currentlevel, or they are evaluated recursively at the level below, dependingon the estimated benefit. Early efforts on parallel and distributed spatial query evaluationhave mainly focused on spatial joins, which are more expensivethan range queries and they can benefit more from parallelism. TheR-tree join (RJ) algorithm [7] and PBSM [27] were parallelized in[6] and [28, 30, 40], respectively.With the advent of Hadoop, research on spatial data manage-ment has shifted to the development of distributed spatial datamanagement systems [1, 11, 36–38].
Hadoop-GIS [1] is one of thefirst efforts in this direction. Spatial data in Hadoop-GIS are par-titioned using a hierarchical grid, wherein high density tiles aresplit to smaller ones, in order to handle data skew. The nodes of thecluster share a global tile index which can be used to find the HDFSfiles where the contents of the tiles are stored. For query evaluation,an implicit parallelization approach is followed, which leveragesMapReduce. That is, the partitioned objects are given IDs based onthe tiles they reside and finding the objects in each tile can be doneby a map operation. Spatial queries are implemented as MapReduceworkloads. Duplicate results in spatial queries are eliminated byadding a MapReduce job at the end. In the
SpatialHadoop system[11] data are also spatially partitioned, but different options forpartitioning based on different spatial indexes are possible (i.e., gridbased, R-tree based, quadtree based, etc.) Different spatial datasetscould be partitioned by a different approach. A global index foreach dataset is stored at a Master node, indexing for each HDFS fileblock the MBR of its contents. In addition, a local index is built ateach physical partition and used by map tasks.Spark-based implementations of spatial data management sys-tems [36–38] apply similar partitioning approaches. The main dif-ference to Hadoop-based implementations is that data, indexes,and intermediate results are shared in the memories of all nodesin the cluster as resilient distributed datasets (RDDs) and can bemade persistent on disk. Unlike SpatialSpark [37] and GeoSpark[38] which are built on top of Spark, Simba [36] has its own nativequery engine and query optimizer, however, Simba does not sup-port non-point geometries. Pandey et al. [26] conduct a comparisonbetween in-memory spatial analytics systems and find that theyscale well in general, although each one has its own limitations.Similar conclusions are drawn in another study [2].We observe that distributed spatial data management systemsfocus more on data partitioning and less on minimizing the costof query evaluation at each partition. In other words, emphasis isgiven on scaling out (i.e., making the cost anti-proportional to thenumber of nodes), rather than on per-node scalability (i.e., reducinghe computational cost per node) and multi-core parallelism. For ex-ample, a typical range query throughput rate reported by the testedsystems in [2, 26] is a few hundred queries per minute, whereasfor the same scale of data an in-memory R-tree can handle on asingle machine (without parallelism) tens of thousands of queriesper minute (according to our tests in Section 7).
We consider the classic approach of approximating spatial objectsby their minimum bounding rectangles (MBRs). By imposing a N × M regular grid over the space, we can divide it into N · M disjoint tiles . Determining N and M is not a subject of this section; wewill discuss/study this issue in Section 7. Each tile divides a spatialpartition . An object o is assigned to a tile T if MBR ( o ) and T intersect(i.e., they have at least one common point); in this case, o is assignedto tile T . Since MBR ( o ) can intersect with multiple tiles, o can beassigned to more than one tiles. We target applications where theobject extents are relatively small compared to the map (and to theextent of a tile); hence object replication is expected to be low.For example, Figure 2 shows a grid and a spatial object o , whoseMBR intersects tiles T a and T b ; o is assigned to both tiles. For eachtile T , we keep a list of (MBR, object-id) pairs that are assigned to T .For example, the MBR and id of o in Figure 2 appears in the list of T a and T b . This means that while the MBRs and ids of the objectscan be replicated to multiple tiles, the actual geometry of an objectis stored only once in a separate data structure (e.g., an array or ahash-map) in order to be retrieved fast, given the object’s id. Sincethe spatial distribution of objects may not be uniform, there couldbe empty tiles. If the number of empty tiles is significantly largecompared to the total number of tiles, we can use a hash-map toassign each non-empty tile to the set of rectangles in it. The abovestorage scheme is quite effective for main-memory data because itsupports queries and updates quite fast, while it is straightforwardto parallelize popular spatial queries and operations. We now discuss in more detail how this simple indexing schemecan be used to evaluate rectangular range queries and expose itslimitations. We first introduce some notation that will also be usefulwhen we discuss our solution.Recall that each MBR r can be represented by an interval of valuesat each dimension. Let r [ i ] = [ r [ i ][ ] , r [ i ][ ]] be the projection ofrectangle r on the i -th axis. For example, in the 2D space, r [ ][ ] denotes the upper bound of rectangle r on dimension 0 (i.e., the x -axis). Similarly, we use T [ i ] = [ T [ i ][ ] , T [ i ][ ]] to denote theprojection of a tile T to the i -th dimension. Given a tile T and adimension i , we use prev ( T , i ) to denote the tile T ′ which is rightbefore T in dimension i and has exactly the same projection as T inthe other dimension(s). For example, in Figure 2, T b = prev ( T a , ) . prev ( T , i ) is not defined for tiles T which are in the first column(for i =
0) or row (for i =
1) of the grid.Given a range query window W , a tile that does not intersect W does not contribute any results to the query. Specifically, theonly tiles T that may contain query results are those for which T [ i ][ ] ≥ W [ i ][ ] and T [ i ][ ] ≤ W [ i ][ ] at every dimension i andcan easily be enumerated after finding the tiles T s and T e , which W [1][1] W [0][0] W [0][1] W [1][0] WT a T b T s T e o T c dimension x ( i =0) d i m e n s i o n y ( i = ) T d WW W o Figure 2: Example of tiling and query evaluation contain W [ ][ ] and W [ ][ ] , respectively. Figure 2 illustrates awindow query W in lightgrey color and its four corner points W [ ][ ] , W [ ][ ] , W [ ][ ] , W [ ][ ] . The tiles which are relevant to W are between (in both dimensions) the two tiles T s and T e . For each tile which is totally covered by the query range in atleast one dimension (e.g., T a in dimension 0), we know that theobjects in it certainly intersect W in that dimension. For a tile T that partially overlaps with W in both dimensions (e.g., T b ), weneed to iterate through its objects list to verify their intersectionwith W . We first check whether the MBR of the object intersects W and then we might have to verify with the exact geometry of theobject at a refinement step.An important issue is that neighboring tiles may intersect W andalso contain the same object o . In this case, o will be reported morethan once, so we need an approach for handling these duplicates.For example, in Figure 2, object o could be reported both by T a and by T b . A solution to this problem is to report an object o onlyat the tile which is before all tiles (in both dimensions) where o is found to intersect W . For example, in Figure 2, o is reportedby T b only, which is before T a . An easy approach to perform thistest is to compute the intersection between the query window andthe rectangle and report the result only if a reference point of theintersection (e.g., the smallest value in all dimensions) is includedin the tile [10]. While this solution prevents reporting duplicates,it requires extra comparisons and it is unclear how to apply it fornon-rectangular range queries. An alternative and more generalapproach is to add the results from all tiles in a hash table, whichwould prevent the same rectangle from being added multiple times. We now present our proposal for improving this basic spatial in-dexing approach by introducing a second level of partitioning tothe contents of each tile. Our approach avoids the generation of du-plicate results overall and, hence, it does not require any duplicateelimination. We propose that the set of MBRs in each tile is furtherdivided into four classes A , B , C , and D (which are physically storedseparately in memory). Specifically, consider a rectangle r which isassigned to (i.e., intersects) tile T . T s and T e can be found in O ( ) by algebraic calculations if the grid is uniform. We conventionally assume that the x = dimension is from left to right and the y = dimension is from top to bottom. T Mini-joins a) Define an order (direction) for each axis, for example x: left-to-right, y: top-to-bottomb) Given a cell (tile) c, the rectangles which are assigned to c are divided to 4 classes: • A: their x.start and y.start points are contained in the x-projection of c • B: their x.start pt is contained in the x-projection of c, but their y.start is before c • C: their y.start pt is contained in the y-projection of c, but their x.start is before c • D: their x.start and y.start points are both before cxy rectangles of type A rectangles of type B rectangles of type C rectangle of type D
T T
Figure 3: The 4 classes of rectangles inside a tile T . • r belongs to class A , if for every dimension i , the begin value r [ i ][ ] of r falls into projection T [ i ] , i.e., if T [ i ][ ] ≤ r [ i ][ ] . • r belongs to class B if its x -projection r [ ] begins inside T [ ] ,but its y -projection r [ ] begins before T [ ] , i.e., if T [ ][ ] ≤ r [ ][ ] and T [ ][ ] > r [ ][ ] . • r belongs to class C if its x -projection r [ ] begins before T [ ] ,but its y -projection r [ ] begins inside T [ ] , i.e., if T [ ][ ] > r [ ][ ] and T [ ][ ] ≤ r [ ][ ] . • r belongs to class D if both its x - and y -projections beginbefore T , i.e., if T [ ][ ] > r [ ][ ] and T [ ][ ] > r [ ][ ] .We can refer to each class by two bits, one for each dimension.The bit in each dimension indicates whether the rectangle startsbefore the tile in that dimension. Hence, class A can also be referredto as class 00 because a rectangle in class A of a tile T does notstart before T in both dimensions. Similarly, classes B , C , and D can be denoted by 01, 10, and 11, respectively. This notation cangeneralized to an arbitrary number of dimensions m , where thereare 2 m classes of bounding boxes in each multidimensional tile and m bits are used to denote each class.Figure 3 illustrates examples of rectangles in a tile T that belongto the four different classes. During data partitioning, for each tile T a rectangle r is assigned to, we identify its class and, hence, for eachtile, we have four different rectangle divisions (which are storedseparately). Note that a rectangle can belong to class A of just onetile, while it can belong to other classes (in other tiles) an arbitrarynumber of times. In this section, we show how the divisions can be used to evaluatespatial range queries efficiently and at the same avoid generatingand testing duplicate query results. For simplicity, we first considerrectangular range queries where the query range is a rectangle(window) W and the objective is to find the objects which spatiallyintersect W . The cases of other query shapes will be discussedlater on. For now, we focus on the filtering step of the query, i.e.,the objective is to find the object MBRs which intersect W . Therefinement step will be discussed in Section 5.Recall that the tiles which are relevant to W are between thetwo tiles that contain W [ ][ ] and W [ ][ ] in both dimensions.We lay out a set of rules that can be used to determine whichrectangles in each of these tiles are query results and what arethe necessary comparisons for determining whether a rectangleis a result. Finally, these rules can help us to avoid generatingand eliminating duplicate query results, without any comparisons,bookkeeping, or synchronization in the processing at differenttiles. In summary, the goal of our method is twofold: (i) eliminate any dependencies between processing at different tiles and (ii)minimize the cost of processing at each tile, by avoiding redundantcomparisons and duplicate result checks. Recall that in order for two rectangles to intersect (in our case W anda candidate query result), they should intersect in all dimensions. Inorder words, if a rectangle r does not intersect W in one dimension(i.e., in a dimension i , we either have r [ i ][ ] < W [ i ][ ] or W [ i ][ ] < r [ i ][ ] ), then r is not a query result. We now present a lemmawhich can help us to determine classes of rectangles in a tile thatshould not be considered in a query, otherwise they would produceduplicate results.Lemma 1 (Filtering). If W intersects tile T and starts before T indimension i , then: • in the classes having in dimension i , all rectangles that inter-sect W are guaranteed to intersect W also in the previous tile prev ( T , i ) hence they can be safely disregarded; • if also W starts before T in dimension j (cid:44) i , then all objects inthe class having in dimension j are guaranteed to intersect W also in the previous tile prev ( T , j ) hence they can be safelydisregarded. To understand the first point of the lemma, consider again Figure2 and tile T a ; W starts before T a in dimension 0. All rectangles of T a in classes C =
10 and D =
11 can be ignored by tile T a whenprocessing query W because these rectangles are guaranteed toalso intersect the previous tile T b = prev ( T a , ) in dimension 0 andthey can be processed there. Hence, o (which belongs to class C =
10) is not examined at all by tile T a .To understand the second point of the lemma, consider object o in Figure 2. Note that W also starts before T a in dimension 1. Thisguarantees that all objects in class B =
01 of tile T a which intersect W can be reported at tile T c = prev ( T a , ) , so T a can safely ignoreall rectangles in class B = We now turn our attention to minimizing the comparisons neededfor rectangle classes that have to be checked (i.e., those not elim-inated by Lemma 1). If a tile T is covered by the window W in adimension i , then we do not have to perform intersection tests inthat dimension. Let us go back to the example of Figure 2. Recallthat only rectangles in class A =
00 of tile T a need to be checkedagainst window W , because the other classes have been filtered outby Lemma 1. For all these rectangles, we only have to conduct anintersection test with W in dimension 1, since T a is totally coveredby W in dimension 0. For the dimension(s) where the tile is notcovered by W , the following lemmas can be used to further reducethe necessary comparisons.Lemma 2 (Comparisons Reduction 1). If W ends in tile T andstarts before T in dimension i , then for a rectangle r ∈ T , r intersects W in dimension i iff r [ i ][ ] ≤ W [ i ][ ] . Note that if W also covers T b at dimension , a rectangle r in T b will be processedrecursively at prev ( T b , ) , if r is in a class of T b having in dimension i . [1][1] W [0][0] W [0][1] W [1][0] T dimension x ( i =0) d i m e n s i o n y ( i = ) T T T T T T T T T T T T T T T T T T T Figure 4: Processing a range query at each tile
For example, in tile T a , we only have to test intersection indimension 1 for rectangles r in class A =
00, as already explained.The intersection test can be reduced to a simple comparison, i.e.,if r [ ][ ] ≤ W [ ][ ] then r intersects W . Symmetrically, we canshow:Lemma 3 (Comparisons Reduction 2). If W starts in tile T andends after T in dimension i , then for a rectangle r ∈ T , r intersects W in dimension i iff r [ i ][ ] ≥ W [ i ][ ] . For example, consider tile T d in Figure 2. Due to Lemma 1, we caneliminate from consideration rectangle classes C =
10 and D = T d , while for the rectangles in classes A =
00 and B =
01, therectangles are guaranteed to intersect W in dimension 0. Hence,we only have to find the rectangles r in classes A and B , for which r [ ][ ] ≥ W [ ][ ] . Example.
A detailed example of the tasks executed by each tile ina window query W is illustrated in Figure 4. Tile T processes allfour classes of its rectangles. For each rectangle, just one compari-son is needed per dimension due to Lemma 3. Tiles T – T processonly classes A =
00 and B =
01 due to Lemma 1. In T – T , theintersection test at dimension 0 is skipped. In addition, for eachrectangle, only one comparison is necessary (Lemma 3). Tile T applies one comparison in each dimension using the start and endpoints in dimensions 0 and 1, respectively (Lemmas 2 and 3). For tile T only rectangle classes A =
00 and C =
10 need to be processedand Lemma 3 can be used to reduce the comparisons for dimension0, while there is no need for comparisons in dimension 1, as W covers the tile in this dimension. For tiles T – T , only rectangleclasses A =
00 are accessed. For tiles T – T , no comparisons at allare required, whereas for tile T , one comparison for dimension 0should be performed (Lemma 2). Tiles T – T are processed as tiles T – T (respectively). Tile T processes classes A =
00 and C = T – T process only class A =
00. In Tiles T – T , one comparisonper rectangle is required (Lemma 2), while Tile T requires twocomparisons per rectangle. Given a range query W , we first identify the range of tiles thatintersect W (i.e., the first and the last tile in each dimension) bysimple algebraic operations (i.e., by dividing the endpoints of W in each dimension by the number of space divisions in that dimen-sion). We then pass the control to each tile T , which accesses therelevant classes of rectangles and perform the necessary compu-tations for the rectangles in them. For each qualifying rectangle arefinement step is performed (after accessing the correspondingobject’s geometry). The results produced at each tile are eventuallymerged. We now discuss the evaluation of disk range queries, where theobjective is to find all objects which overlap a disk D of radius ϵ centered at a given point q . This query is equivalent to a distancerange query of the following form: “find all objects having distanceat most ϵ from location q ” and it is very popular in location basedservices applications.To evaluate a disk query on our two-level partitioned dataset, weapply a similar method, as for the rectangular window queries thatwe have seen already; we first find the tiles that intersect with thedisk and then the objects in them that satisfy the query predicate.If we approximate the disk D by MBR ( D ) , we can easily identifythe tiles that potentially intersect the disk by simple algebraic com-putations, as in window queries. For each such tile, its minimumdistance to q is computed and, if the distance is found at most ϵ , thetile is confirmed to intersect the disk. We turn our attention to computing results in a tile and to du-plicate avoidance. Since a rectangle can be assigned to multipletiles, the objective is to examine only the MBR classes in each tilewhich are necessary to ensure that (i) no result is missed and (ii)no duplicate results are reported. In other words, all rectanglesthat intersect the disk should be reported exactly once. For this, wefollow a similar approach as for the case of rectangular queries. Foreach tile T , where T is within distance ϵ from q , we check whether prev ( T , i ) in each dimension i is also within distance ϵ from q , i.e.,whether prev ( T , i ) is in the set of tiles S that may include results. Ifyes, then we disregard the corresponding class of rectangles in T .Hence, if prev ( T , ) ∈ S , then class B =
01 is disregarded, whereasif prev ( T , ) ∈ S , then class C =
10 is disregarded. If prev ( T , ) ∈ S and prev ( T , ) ∈ S , then all classes B , C , D are disregarded.Figure 5 shows an example of a disk query centered at q. The tileswhich intersect the disk are shown by different patterns dependingon the classes of rectangles in them that have to be checked. Forexample, in tile T all four classes will be examined (we call T an ABCD tile, in the context of the disk query). Note that for themajority of tiles which intersect the disk range, we only have toexamine rectangles in class A = r , which will beexamined in both tiles T (in class B ) and T (in class C ). To avoidsuch duplicates, for each rectangle in an ABCD tile T , if the tile iscloser to q in the y -dimension compared to the x -dimension, weignore rectangles r in classes C and D , for which r [ ][ ] > T [ ][ ] In practice, we do not have to compute the minimum distance for each tile thatintersects
MBR ( D ) . For each row of tiles intersecting MBR ( D ) , we can just find thefirst tile T s with distance at most ϵ to q , by scanning the row forward, and then thelast tile T e with distance at most ϵ , by scanning the row backward. All tiles between T s and T e are guaranteed to qualify the minimum distance predicate. Disk queries - A is always included- if tile before in y does not intersect range, add B- if tile before in x does not intersect range, add C- If tile before in both dim do not, add D - Problem: duplicates (B in one tile and C in another or D…)- Solution below the diagonal (tile closer to q in y-axis): if ABCD, for each r in class B or D ignore r if r overflows to next-x tile- Solution above the diagonal (tile closer to q in x-axis): if ABCD, for each r in class C or D ignore r if r overflows to next-y tile- Exception: closest tile to the 45 o radius xy AA,BA,B,C,DA,C TT type B type C r r r T T T T T Figure 5: Example of disk query evaluation (these will be handled in another tile). For example, in tile T , weignore rectangles in classes C or D , which “overflow” to the tilebelow T (such as r ). If the tile is closer to q in the x -dimensioncompared to the y -dimension, we ignore rectangles r in classes B and D , for which r [ ][ ] > T [ ][ ] . For example, in tile T , weignore rectangles in classes B or D , which “overflow” to the tileon the right of T (such as r ). Finally, for a single tile, which has(almost) the same distance to q in both dimensions, we consider allrectangles, regardless whether they “overflow” or not to the nexttiles. For example, for tile T we consider all rectangle classes, i.e.,rectangle r will be examined in T and not in T .Before examining the rectangles in the tiles in S (only the relevantclasses), we can compute the maximum distance between the tile and q , and if this distance is found to be at most ϵ , then the tile is markedas covered by the disk. For tiles which are covered by the disk, wedo not have to verify any distances between the objects assigned tothem and q , as these distances are guaranteed to have distance atmost ϵ from q (i.e., they are definite results). Again, at each row, theset of tiles which are covered by the disk are continuous, meaningthat we only have to check the tiles in both directions starting fromthe tile which includes q in dimension x until we find the first onethat violates the maximum distance condition.Finally, the method described above for disk queries can be gen-eralized for any query whose range is a convex polygon. We firstfind the set of tiles S which intersect the query range. Then, foreach tile T ∈ S , we determine which classes of objects need tobe examined (i.e., exclude classes that would produce duplicates).For each tile which is totally covered by the query region, we justreport its contents in the relevant classes as results and for theremaining tiles we conduct an intersection test for each rectanglebefore determining whether it is a result. We now discuss the evaluation of the refinement step of rangequeries on our two-level partitioning index. We begin by a generaland important lemma, which applies independently to our indexand greatly reduces the number of objects for which the refinementstep needs to be applied, especially for query ranges which arerelatively large.Lemma 4 (Refinement Step Avoidance).
Given a candidateobject whose MBR r intersects the query range, if at least one side of r is inside the query range, then the object is guaranteed to intersectthe range and no refinement step is necessary. The lemma is trivial to prove, based on the definition of MBR.Recall that the MBR of an object is defined by the minimum andmaximum values of the object in every dimension. Hence, at eachside of the MBR, there is at least one point which is part of theobject’s geometry. If one side of the MBR is inside the query range,then there should be at least one point of the object inside the queryrange, i.e., the object and the range intersect. The lemma generalizesto more than two dimensions. In a d -dimensional space, we testif at least one of the ( d − ) -dimensional faces of the minimumhypercube that bounds the object is inside the query range.For different range shapes, we can define specialized MBR sidecoverage tests. Consider a rectangular query range W and an objectMBR r that intersects W . To apply a refinement avoidance test,we should verify if W covers r in at least one dimension. If this istrue, given that r intersects W , one of the cases shown in Figure6(a) should happen. Either one side of r is inside the window W and Lemma 4 applies (see r a in Figure 6(a)) or r splits W along thecoverage axis (see r b in Figure 6(a)). In both cases, whatever thegeometry of the object is the object definitely intersects W . For adisk query range D , we can check whether there are at least twocorners of r whose distances to the disk center are smaller than orequal to the disk radius. For example, in Figure 6(b), rectangle r has at least two corners in the disk range, which means that at leastone side of the rectangle is in the range and Lemma 4 applies. Onthe other hand, only one corner of r is inside the disk, hence therefinement step for the corresponding object cannot be avoided. Evaluation of refinement
WW W
No need(general case) No need(special case)(only for narrowwindows)General corner case: Two sides of MBR outside W One part of object definitely outside WObject is hit only if one part of it is inside W W Case A: one corner of polygon inside W(cost linear to number of corners)Case B: one side of polygon intersects W(cost linear to number of corners) W q r r r r Wr a r b (a) window query Evaluation of refinement
WW W
No need(general case) No need(special case)(only for narrowwindows)General corner case: Two sides of MBR outside W One part of object definitely outside WObject is hit only if one part of it is inside W W Case A: one corner of polygon inside W(cost linear to number of corners)Case B: one side of polygon intersects W(cost linear to number of corners) W q r r r r (b) disk query Figure 6: Refinement step avoidance
Let us now turn our focus to our index and see how we can takeadvantage of Lemma 4 to apply the refinement avoidance test andminimize the necessary comparisons. The main idea is to studythe refinement avoidance test at the tile level , in order to limit thecomparisons required for each class of objects in the tiles.Specifically, for each T that intersects a query range W and foreach dimension i , we consider two cases: (i) W starts before T in dimension i , i.e., W [ i ][ ] < T [ i ][ ] and (ii) W [ i ][ ] ≥ T [ i ][ ] .In the first case, due to Lemma 1, only classes of rectangles thatstart inside T in dimension i are considered, which means for eachrectangle r ∈ T which is found to intersect W , we already knowthat W [ i ][ ] < r [ i ][ ] . Hence, we only have to test if r [ i ][ ] ≤ W [ i ][ ] to confirm that r is covered by W in dimension i and thatthe refinement step for r is not necessary. On the contrary, forthe case where W [ i ][ ] ≥ T [ i ][ ] , we should apply the complete We assume that the geometry of each object is continuous. efinement avoidance test in dimension i (i.e., W [ i ][ ] ≤ r [ i ][ ] and r [ i ][ ] ≤ W [ i ][ ] ) for each r ∈ T which is found to intersect W .As an example, consider again the query in Figure 4. For eachrectangle in tile T found to intersect W , we should apply the com-plete coverage test in each of the two dimensions, before applyingthe refinement step. For each rectangle r in tiles T – T and for di-mension 0 we only have to test if r [ ][ ] ≤ W [ ][ ] , since theserectangles are in classes A and B and they start inside the tile indimension 0. Similarly, for each rectangle r in tiles T , T , and T and for dimension 1 we only have to test if r [ ][ ] ≤ W [ ][ ] , sincethese rectangles are in classes A and C and they start inside thetile in dimension 1. For all remaining tiles, if r [ ][ ] ≤ W [ ][ ] or r [ ][ ] ≤ W [ ][ ] holds, r is guaranteed to be a true result and norefinement is necessary. In the previous sections, we presented how our two-level indexhandles single query requests. Real systems however receive andneed to evaluate a large number of concurrent queries. Under this,we next discuss how to efficiently process batches of spatial rangequeries. Although our focus is primarily in a single-threaded pro-cessing environment, parallel query processing in modern multi-core hardware can also benefit from the ideas discussed in thissection. To this end, our experimental analysis includes both single-threaded and multi-threaded experiments.Naturally, a straightforward approach for processing a workloadof concurrent spatial range queries is to evaluate every query in-dependently, directly applying the ideas discussed in the previoussections. In a parallel processing environment, we can easily adoptthis approach by assigning the queries to the available threads in around robin fashion. We call this simple approach queries - based . Itsmain shortcoming is that it is cache agnostic; as every issued query q typically overlaps multiple tiles of the grid, the computation of q requires accessing data structures in different parts of the mainmemory, i.e., the memory access pattern is prone to cache misses.The problem is present also in parallel query processing, as everythread goes through multiple rounds of “content switching”.To address this shortcoming of queries - based , we design a cache-conscious two-step approach. Given a large batch of queries Q , foreach tile, accumulate the subtasks of all queries in Q that intersectthe tile. Each subtask corresponds to accessing and processing (therelevant to the query) classes of rectangles in the tile. Then, in asecond step, we initiate one process at each tile, which evaluatesthe corresponding subtasks. Essentially, query processing is nolonger driven by the queries, but from the grid tiles and therefore,we call this approach tiles - based . This method is favored by par-allel processing, since each thread (corresponding to a tile) canbenefit from the processor’s cache while processing the subtasksassigned to it. As we demonstrate in Section 7 the tiles - based ap-proach scales better with the number of parallel threads comparedto queries - based . In this section we present our experimental analysis. We first de-scribe our setup and then our experiments, which investigate the
Table 1: Datasets used in the experiments dataset type card. avg. x -extent avg. y -extent AREAWATER polygons 2 .
3M 0 . . .
8M 0 . . . . . . construction and update costs of our two-level index as well as itsefficiency and scalability in spatial range query evaluation. Our analysis was conducted on a machine with 384 GBs of RAM anda dual Intel(R) Xeon(R) CPU E5-2630 v4 clocked at 2.20GHz runningCentOS Linux 7.6.1810. All methods were implemented in C++, com-piled using gcc (v4.8.5) with flags -O3 , -mavx and -march=native .For our parallel processing tests, we used OpenMP and activatedhyper-threading, allowing us to run up to 40 threads. Datasets.
We experimented with four of the publicly availableTiger 2015 datasets [11]. The input objects were normalized sothat the coordinates in each dimension take values inside [ , ] .Table 1 provides statistics about the datasets we used. The datasetscontain from 2.3M to 70M objects, either polygons or linestrings.The last two columns of the table are the relative (over the entirespace) average length for every object’s MBR at each axis. Methods . We designed two variants of our two-level indexing ap-proach (presented in Section 3.2). In the first one termed - level ,each tile of the grid stores the (MBR, id) pairs of the indexed objectsin four tables (one for each of the A , B , C , D classes), such thatthere is no particular order of the contents of each table (i.e., asin a heap file). This organization supports insertions efficiently asthe MBRs of new objects are simply appended to the tables of thetiles. In the second variant, termed - level + , the MBRs of each classare also stored in four decomposed tables, following the Decom-position Storage Model (DSM) [8], adopted by column-orienteddatabase systems (e.g., [34]). Specifically, each rectangle r with id i is decomposed to four tuples, i.e., ⟨ r [ ][ ] , i ⟩ , ⟨ r [ ][ ] , i ⟩ , ⟨ r [ ][ ] , i ⟩ , ⟨ r [ ][ ] , i ⟩ and each tuple is stored in a dedicated table. The tablesare sorted by their first column and used to evaluate fast queries ontiles, where just one endpoint of each MBR needs to be compared(see Lemmas 2 and 3). In this case, - level + takes advantage of thesorted decomposed tables to reduce the information that has to beaccessed and the number of comparisons. For example, in tile T ofFigure 4, we only have to access the decomposed tables of classes A and B with ⟨ r [ ][ ] , i ⟩ tuples to test whether r [ ][ ] ≥ W [ ][ ] for each rectangle there. Since these tuples are sorted, we performbinary search to find the first qualifying tuple in each table and thenscan the tables forward from thereon. - level + processes windowqueries very fast, but it is suited mostly for static data.We also considered three competitors to our indexing scheme.The - level scheme, discussed in Section 3.1, indexes the MBRsof the input objects using a simple uniform grid; MBRs that spanmultiple tiles are replicated accordingly, but our proposal for asecond level of indexing (i.e., Section 3.2) is not applied. When http://spatialhadoop.cs.umn.edu/datasets.html rocessing window queries, - level performs duplicate eliminationusing the reference point approach [10]. The second competitor isan in-memory STR-bulkloaded R-Tree [19] taken from the Boostlibrary (boost.org), which has a fanout of 16 for both inner andleaf nodes. This configuration is reported to perform the best (wealso confirmed this by testing). The third competitor is BLOCK;the implementation was kindly provided by the authors of [25].After testing this approach we found it to be orders of magnitudeslower that the R-tree (BLOCK takes seconds to evaluate rangequeries on our data), which can be attributed to the fact that BLOCKis implemented for 3D objects. Therefore, we decided to excludeBLOCK from the reported measurements. Tests and parameters . To assess the effectiveness of the testedindices, we compared their space requirements, their building andupdate costs and their query performance. For the partitioning-based schemes, i.e., - level , - level and - level + , we investigatedthe best granularity for their grid by varying the number of par-titions (i.e., divisions) per dimension. By default, we use 2000 perdimension, resulting in a 2000 × .
1% of the area of the map.
We first justify our decision to focus on and optimize the filteringstep of range query evaluation, which in fact has been the primarytarget of previous works as well. We used our - level index toexecute both the filtering and refinement steps. We consider threevariants of query evaluation depending on the way refinement isperformed (see Section 5); filtering is identical in all three variants.More specifically, under Simple , all candidates identified by thefiltering step are passed to the refinement step;
RefAvoid employsLemma 4 as an extra pre-refinement filter to significantly reducethe number of candidates to be refined; last,
RefAvoid + enhances RefAvoid by reducing the number of comparisons required fortesting Lemma 4 in window queries, as discussed at the end ofSection 5.Figure 7 illustrates the breakdown of the average executiontime for both window and disk queries; note that for disk queries
RefAvoid + is not applicable. We make two important observations.First, the figure clearly shows the effectiveness of the refinementavoidance technique discussed in Section 5. Both RefAvoid and
RefAvoid + significantly reduce the number of candidates to be re-fined by over 90% and so, their refinement step is always lower thanthat of Simple . To achieve this however, they apply a refinementavoidance test on the MBRs; the cost of this is higher in case ofdisk queries because it requires expensive distance computationsbetween the disk center and the corners of object MBRs. The secondobservation is that, when our refinement avoidance technique isused, the bottleneck of the query is in the filtering step. Hence, in the subsequent experiments, we focus on the filtering step of spatialquery processing.
We next investigate the building cost and the tuning of our two-level index. Figure 8 reports the indexing time for both - level and - level + variants and the size of each index, while varying thegranularity of the underlying grid partitioning. For reference andcompleteness purposes, we include the - level competitor whichalso uses a uniform grid to partition the input space. As the goal ofthe index is to efficiently answer spatial range queries, we addition-ally report the average query time in order to determine the bestgrid granularity.Naturally, the indexing cost for all three indices grows whileincreasing the granularity of the grid. As expected, both - level and - level have the same space requirements. Regardless of em-ploying the second level of partitioning or not, both indices storeexactly the same number of object MBRs (originals and replicas);the difference is that inside each tile, - level divides the rectanglesin four classes and stores them in dedicated structures while - level stores all rectangles together. In terms of the indexing time, - level is slightly more expensive than - level , as it needs to first determinethe class for each rectangle and then store it accordingly. On theother hand, the indexing cost of - level + is higher than both - level and - level indices. Remember that - level + essentially stores asecond copy of the rectangles inside every tile, after decomposingtheir coordinate information. As a result, the size of - level + is 2.4xlarger; its building time is also higher due to the cost of computing,sorting and storing the decomposed replicas of the rectangles. Thesizes of the packed R-trees (not shown) are about the same as thesizes of the corresponding - level (and - level ) indices when 2000partitions per dimension are used, indicating that the replicationratio of our indexes is very low. In addition, the bulk loading costsof the R-trees are 0.53s, 1.4s, 5.2s, and 19.5s for the four datasets,respectively, i.e., about 20% lower compared to the constructioncost of - level + .Let us now study the efficiency of each index. The key observa-tion is that employing the second level of indexing significantlyenhances query processing; - level and - level + always outperform - level . Due to lack of space, Figure 8 reports only on windowqueries, but the same trend applies for disk queries. The fastestindex is - level + as it trades the extra used space for better queryperformance; nevertheless, - level is also significantly faster than - level . The last observation is that all three indices perform attheir best when the underlying grid defines approximately 2000partitions per dimension. Under this configuration, the number oftiles is not excessive and the indices do not have a large overheadin accessing and managing tiles; at the same time the rectangles aresmall enough compared to the tile extent (see Table 1), in order notto incur excessive replication. For the rest of our analysis, - level , - level and - level + all use a 2000 × In order to confirm the superiority of our proposed indexing schemein updates compared to the R-tree, we conducted an experiment,where for all datasets, we first constructed the index by loading 90% ltering avoidance refinement
AREAWATER LINEARWATER ROADS EDGES S i m p l e R e f A v o i d R e f A v o i d + T i m e [ µ s ec s ] S i m p l e R e f A v o i d R e f A v o i d + T i m e [ µ s ec s ] S i m p l e R e f A v o i d R e f A v o i d + T i m e [ µ s ec s ] S i m p l e R e f A v o i d R e f A v o i d + T i m e [ µ s ec s ] Window queries S i m p l e R e f A v o i d T i m e [ m s ec s ] S i m p l e R e f A v o i d T i m e [ m s ec s ] S i m p l e R e f A v o i d T i m e [ m s ec s ] S i m p l e R e f A v o i d T i m e [ m s ec s ] Disk queries
Figure 7: Time breakdown: - level indexing with 2000 partitions per dimension, 10000 queries of 0.1% extent ratioTable 2: Total update cost (sec)dataset R-tree 1 - level 2 - level AREAWATER 0.619 0.007 0.009LINEARWATER 1.574 0.023 0.027ROADS 5.34 0.059 0.068EDGES 19.8 0.220 0.241538of the data in batch and then we measured the cost of incrementallyinserting the last 10% of the data. Table 2 compares the total updatecosts of the R-tree, - level , and - level . As the table shows, the R-treeis two orders of magnitude slower than the baseline - level indexand the cost of updates on - level is only a bit higher compared tothe update cost on - level . We now evaluate the performance of the indexes in query process-ing. We first compare them in terms of their average query costand then evaluate batch and parallel query processing.
Window queries . The first row of plots in Figure 9 reports theaverage execution for window queries, while varying the relativearea of the window compared to the data space. The plots alsoinclude the performance of the R-tree. Naturally, query processingis negatively affected by the increase of the relative window extent;as the windows grow larger they overlap a larger number of ob-jects, rendering the range queries more expensive. Regarding thecomparison between - level , - level and - level + , we observe thesame trend as Figure 8. More importantly though, we observe that our two-level index variants clearly outperform also the R-tree, inall tests, because (i) the relevant partitions to each query are foundvery fast (without the need of traversing a hierarchical index) and(ii) they manage to drastically reduce the total number of compu-tations. As expected, query processing using - level + is the mostefficient method for window queries and - level comes in secondplace. Disk range queries . We turn our focus now to disk range queriesand the second row of plots in Figure 9. Different to window queries,the - level + index does not give any benefit compared to - level ,because all coordinates of the object MBRs are needed to computetheir distances to the center of the disk (i.e., the decomposed tablesare not useful in computations). Hence, - level + is not included inthe comparison (as it uses the complete MBR table in each class andhas the same performance as - level ). In addition, - level we cannotuse the reference point technique to eliminate duplicate results;instead all tiles use a shared hash table to insert all query resultsin order to eliminate duplicates. This explodes the cost of - level (it becomes one order of magnitude slower than - level ), so weskip it from the comparisons. The results clearly show once againthe superiority of the - level index. The advantage of - level overthe R-tree is more pronounced compared to the case of windowqueries, as - level manages to avoid the distance computations forthe majority of query results (which belong to tiles covered by thequery range). Batch and parallel query processing . Figure 10 compares thetwo approaches ( queries - based and tiles - based ), discussed in Sec-tion 6, for batch window query processing (10K queries per batch).REAWATER LINEARWATER ROADS EDGES I nd e x ti m e [ s ec s ] I nd e x ti m e [ s ec s ] I nd e x ti m e [ s ec s ] I nd e x ti m e [ s ec s ] partitions per dimension [ × ] partitions per dimension [ × ] partitions per dimension [ × ] partitions per dimension [ × ] I nd e x s i ze [ G B s ] I nd e x s i ze [ G B s ] I nd e x s i ze [ G B s ] I nd e x s i ze [ G B s ] partitions per dimension [ × ] partitions per dimension [ × ] partitions per dimension [ × ] partitions per dimension [ × ] A vg qu e r y ti m e [ µ s ec s ] A vg qu e r y ti m e [ µ s ec s ] A vg qu e r y ti m e [ µ s ec s ] A vg qu e r y ti m e [ µ s ec s ] partitions per dimension [ × ] partitions per dimension [ × ] partitions per dimension [ × ] partitions per dimension [ × ] Figure 8: Indexing and tuning: varying the granularity of the grid, 10000 window queries of 0.1% relative extent
A general observation from the plots is that tiles - based is supe-rior to queries - based when the dataset is large (i.e., dense) and thequeries are relatively large. In this case, the sizes of the dedicated ta-bles for each class per tile are large and cache conscious tiles - based approach makes a difference. On the other hand, the overhead offinding and accumulating the subtasks per tile does not pay offwhen the number of queries on each tile is too small or when thetiles do not contain many rectangles. The advantage of tiles - based becomes more prominent in parallel query processing. Figure 11shows the speedup of batch query evaluation on the two largestdatasets (again, 10K queries per batch) as a function of the numberof parallel threads. Note that tiles - based scales gracefully with thenumber of threads (up to about 25 threads, where it starts being af-fected by hyperthreading). On the other hand, queries - based scalespoorly due to the numerous cache misses. In this paper, we presented a secondary partitioning approach thatcan be applied to space-partitioning spatial indexes, such as gridsand divides the indexed rectangles within each spatial partition(tile) to four classes. Our approach reduces the number of com-parisons during range query evaluation and avoids the generation (and elimination) of duplicate results. In addition, we propose arefinement avoidance technique for spatial range queries, whichconfirms as results the great majority of objects that intersect therange without needing to apply a refinement step for them. Finally,we investigate techniques for evaluating numerous range queryrequests in batch and in parallel. Our experimental findings con-firm the efficiency of our proposed indexing scheme compared toa state-of-the-art in-memory implementation of the R-tree and itsscalability to multiple query evaluation in parallel.In the future, we will dig more into parallel and distributed spa-tial query evaluation. The fact that our indexing scheme facilitatesparallel and independent query evaluation at each tile renders ita promising approach for distributed spatial data management.In addition, we plan to investigate the efficiency of our partition-ing scheme on 3D data. Another direction is to investigate theeffectiveness of our secondary partitioning scheme on other space-partitioning approaches such as the quad-tree and the kd-tree andthe better management of data skew. Finally, we will study theevaluation of other popular query types, such as nearest neighborqueries and spatial joins.REAWATER LINEARWATER ROADS EDGES A vg qu e r y ti m e [ µ s ec s ] R-tree1-level2-level2-level+ A vg qu e r y ti m e [ µ s ec s ] R-tree1-level2-level2-level+ A vg qu e r y ti m e [ µ s ec s ] R-tree1-level2-level2-level+ 0 2000 4000 6000 8000 10000 120000.01 0.05 0.1 0.5 1 A vg qu e r y ti m e [ µ s ec s ] R-tree1-level2-level2-level+ query relative extent [%] query relative extent [%] query relative extent [%] query relative extent [%]
Window queries A vg qu e r y ti m e [ m s ec s ] R-tree2-level A vg qu e r y ti m e [ m s ec s ] R-tree2-level A vg qu e r y ti m e [ m s ec s ] R-tree2-level A vg qu e r y ti m e [ m s ec s ] R-tree2-level query relative extent [%] query relative extent [%] query relative extent [%] query relative extent [%]
Disk queries
Figure 9: Query processing: varying query relative extent, 2000 partitions per dimension for - level , - level , - level + AREAWATER LINEARWATER ROADS EDGES T o t a l qu e r y ti m e [ s ec s ] query-basedtile-based 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60.01 0.05 0.1 0.5 1 T o t a l qu e r y ti m e [ s ec s ] query-basedtile-based T o t a l qu e r y ti m e [ s ec s ] query-basedtile-based T o t a l qu e r y ti m e [ s ec s ] query-basedtile-based query relative extent [%] query relative extent [%] query relative extent [%] query relative extent [%] Figure 10: Batch query processing for window queries: 2000 partitions per dimension
ROADS EDGES S p ee dup [ x ] query-basedtile-based S p ee dup [ x ] query-basedtile-based Figure 11: Batch query parallel processing: 10000 windowqueries of 1% relative extent, 2000 partitions per dimension
REFERENCES [1] Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang,and Joel H. Saltz. 2013. Hadoop-GIS: A High Performance Spatial Data Ware-housing System over MapReduce.
PVLDB
6, 11 (2013), 1009–1020. [2] Md Mahbub Alam, Suprio Ray, and Virendra C. Bhavsar. 2018. A PerformanceStudy of Big Spatial Data Systems. In
BigSpatial@SIGSPATIAL Workshop . 1–9.[3] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger.1990. The R*-tree: An Efficient and Robust Access Method for Points and Rectan-gles. In
SIGMOD . 322–331.[4] Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associa-tive Searching.
Commun. ACM
18, 9 (1975), 509–517.[5] Jon Louis Bentley and Jerome H. Friedman. 1979. Data Structures for RangeSearching.
ACM Comput. Surv.
11, 4 (1979), 397–409.[6] Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1996. ParallelProcessing of Spatial Joins Using R-trees. In
ICDE . 258–265.[7] Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1993. EfficientProcessing of Spatial Joins Using R-Trees. In
SIGMOD . 237–246.[8] George P. Copeland and Setrag Khoshafian. 1985. A Decomposition StorageModel. In
SIGMOD . 268–279.[9] Jens Dittrich, Lukas Blunschi, and Marcos Antonio Vaz Salles. 2009. IndexingMoving Objects Using Short-Lived Throwaway Indexes. In
SSTD . 189–207.[10] Jens-Peter Dittrich and Bernhard Seeger. 2000. Data Redundancy and DuplicateDetection in Spatial Join Processing. In
ICDE . 535–546.[11] Ahmed Eldawy and Mohamed F. Mokbel. 2015. SpatialHadoop: A MapReduceframework for spatial data. In
ICDE . 1352–1363.[12] Raphael A. Finkel and Jon Louis Bentley. 1974. Quad Trees: A Data Structure forRetrieval on Composite Keys.
Acta Inf.
Comput. Surveys
30, 2 (1998), 170–231.[14] Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Search-ing. In
SIGMOD . 47–57.[15] Christian S. Jensen, Dan Lin, and Beng Chin Ooi. 2004. Query and Update EfficientB+-Tree Based Indexing of Moving Objects. In
PVLDB . 768–779.[16] Dmitri V. Kalashnikov, Sunil Prabhakar, and Susanne E. Hambrusch. 2004. MainMemory Evaluation of Monitoring Queries Over Moving Objects.
Distributedand Parallel Databases
15, 2 (2004), 117–135.[17] Norio Katayama and Shin’ichi Satoh. 1997. The SR-tree: An Index Structure forHigh-Dimensional Nearest Neighbor Queries. In
SIGMOD . 369–380.[18] Kihong Kim, Sang Kyun Cha, and Keunjoo Kwon. 2001. Optimizing Multidimen-sional Index Trees for Main Memory Access. In
SIGMOD . 139–150.[19] Scott T. Leutenegger, J. M. Edgington, and Mario A. López. 1997. STR: A Simpleand Efficient Algorithm for R-Tree Packing. In
ICDE . 497–506.[20] Nikos Mamoulis. 2011.
Spatial Data Management . Morgan & Claypool Publishers.[21] Mohamed F. Mokbel, Xiaopeng Xiong, and Walid G. Aref. 2004. SINA: ScalableIncremental Processing of Continuous Queries in Spatio-temporal Databases. In
SIGMOD . ACM, 623–634.[22] Kyriakos Mouratidis, Marios Hadjieleftheriou, and Dimitris Papadias. 2005. Con-ceptual Partitioning: An Efficient Method for Continuous Nearest NeighborMonitoring. In
SIGMOD . 634–645.[23] Parth Nagarkar, K. Selçuk Candan, and Aneesha Bhat. 2015. Compressed SpatialHierarchical Bitmap (cSHB) Indexes for Efficiently Processing Spatial RangeQuery Workloads.
PVLDB
8, 12 (2015), 1382–1393.[24] Duc Hai Nguyen, Khue Doan, and Tran Vu Pham. 2016. SIDI: A Scalable in-Memory Density-based Index for Spatial Databases. In
DIDC@HPDC 2016 . 45–52.[25] Matthaios Olma, Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki. 2017.BLOCK: Efficient Execution of Spatial Range Queries in Main-Memory. In
SSDBM .15:1–15:12.[26] Varun Pandey, Andreas Kipf, Thomas Neumann, and Alfons Kemper. 2018. HowGood Are Modern Spatial Analytics Systems?
PVLDB
11, 11 (2018), 1661–1673.[27] Jignesh M. Patel and David J. DeWitt. 1996. Partition Based Spatial-Merge Join.In
SIGMOD . 259–270.[28] Jignesh M. Patel and David J. DeWitt. 2000. Clone join and shadow join: twoparallel spatial join algorithms. In
ACM-GIS . 54–61.[29] Suprio Ray, Rolando Blanco, and Anil K. Goel. 2014. Supporting Location-BasedServices in a Main-Memory Database. In
IEEE MDM . 3–12.[30] Suprio Ray, Bogdan Simion, Angela Demke Brown, and Ryan Johnson. 2014.Skew-resistant parallel in-memory spatial join. In
SSDBM . 6:1–6:12.[31] Hanan Samet. 1990.
The Design and Analysis of Spatial Data Structures . Addison-Wesley.[32] Hanan Samet. 2006.
Foundations of multidimensional and metric data structures .Academic Press.[33] Darius Sidlauskas, Simonas Saltenis, Christian W. Christiansen, Jan M. Johansen,and Donatas Saulys. 2009. Trees or grids?: indexing moving objects in mainmemory. In
SIGSPATIAL/ACM-GIS . 236–245.[34] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher-niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J.O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. 2005.C-Store: A Column-oriented DBMS. In
VLDB . 553–564.[35] David A. White and Ramesh C. Jain. 1996. Similarity Indexing with the SS-tree.In
ICDE . 516–523.[36] Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba:Efficient In-Memory Spatial Analytics. In
SIGMOD . 1071–1085.[37] Simin You, Jianting Zhang, and Le Gruenwald. 2015. Large-scale spatial joinquery processing in Cloud. In
CloudDB, ICDE Workshops . 34–41.[38] Jia Yu, Zongsi Zhang, and Mohamed Sarwat. 2019. Spatial data management inapache spark: the GeoSpark perspective and beyond.
GeoInformatica
23, 1 (2019),37–78.[39] Xiaohui Yu, Ken Q. Pu, and Nick Koudas. 2005. Monitoring K-Nearest NeighborQueries Over Moving Objects. In
ICDE . 631–642.[40] Xiaofang Zhou, David J. Abel, and David Truffet. 1997. Data Partitioning forParallel Spatial Join Processing. In