[PDF] R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets

Abstract

The rapid growth of big spatial data urged the research community to develop several big spatial data systems. Regardless of their architecture, one of the fundamental requirements of all these systems is to spatially partition the data efficiently across machines. The core challenges of big spatial partitioning are building high spatial quality partitions while simultaneously taking advantages of distributed processing models by providing load balanced partitions. Previous works on big spatial partitioning are to reuse existing index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree, by building a temporary tree for a sample of the input and use its leaf nodes as partition boundaries. However, we show in this paper that none of those techniques has addressed the mentioned challenges completely. This paper proposes a novel partitioning method, termed R*-Grove, which can partition very large spatial datasets into high quality partitions with excellent load balance and block utilization. This appealing property allows R*-Grove to outperform existing techniques in spatial query processing. R*-Grove can be easily integrated into any big data platforms such as Apache Spark or Apache Hadoop. Our experiments show that R*-Grove outperforms the existing partitioning techniques for big spatial data systems. With all the proposed work publicly available as open source, we envision that R*-Grove will be adopted by the community to better serve big spatial data research.

Full PDF

RR*-Grove: Balanced Spatial Partitioning for Large-scale Datasets

Tin Vu and Ahmed EldawyDepartment of Computer Science & EngineeringUniversity of California, Riverside, CA 92521, USA { tvu032,eldawy } @ucr.eduJuly, 21, 2020 Abstract

The rapid growth of big spatial data urged the research community to develop several big spatial datasystems. Regardless of their architecture, one of the fundamental requirements of all these systems is tospatially partition the data eﬃciently across machines. The core challenges of big spatial partitioning arebuilding high spatial quality partitions while simultaneously taking advantages of distributed processingmodels by providing load balanced partitions. Previous works on big spatial partitioning are to reuseexisting index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree, by building atemporary tree for a sample of the input and use its leaf nodes as partition boundaries. However, weshow in this paper that none of those techniques has addressed the mentioned challenges completely. Thispaper proposes a novel partitioning method, termed R*-Grove, which can partition very large spatialdatasets into high quality partitions with excellent load balance and block utilization. This appealingproperty allows R*-Grove to outperform existing techniques in spatial query processing. R*-Grove can beeasily integrated into any big data platforms such as Apache Spark or Apache Hadoop. Our experimentsshow that R*-Grove outperforms the existing partitioning techniques for big spatial data systems. Withall the proposed work publicly available as open source, we envision that R*-Grove will be adopted bythe community to better serve big spatial data research.

Keywords: big spatial data, partitioning, R*-Grove, index optimization, query processing

The recent few years witnessed a rapid growth of big spatial data collected by diﬀerent applications suchas satellite imagery [14], social networks [30], smart phones [20], and VGI [18]. Traditional Spatial DBMStechnology could not scale up to these petabytes of data which led to the birth of many big spatial datamanagement systems such as SpatialHadoop [12], GeoSpark [42], Simba [41], LocationSpark [35], and Sphinx[10], to name a few.Regardless of their architecture, all these systems need an essential preliminary step that partitions thedata across machines before the execution can be parallelized. This is also known as global indexing [13].A common method that was ﬁrst introduced in SpatialHadoop [12], is the sample-based STR partitioner.This method picks a small sample of the input to determine its distribution, packs this sample using the1 a r X i v : . [ c s . D B ] J u l (a) STR-based partitioning [12]. All the thin and wide partitions reduce the query eﬃciency.(b) The proposed R*-Grove method with square-like and balanced partitions.Figure 1: Comparison between STR and R*-GroveSTR packing algorithm [26], and then uses the boundaries of the leaf nodes to partition the entire data.Figure 1(a) shows an example of an STR-based partitioning where each data partition is depicted by arectangle. The method was later generalized by replacing the STR bulk loading algorithm with other spatialindexes such as Quad-tree [33], Kd-Tree, and Hilbert R-trees [23, 8]. That STR-based partitioning was veryattractive due to its simplicity and good load balancing which is very important for distributed applications.Its simplicity urged many other researchers to adopt it in their systems such as GeoSpark [42] and Simba [41]for in-memory distributed processing; Sphinx [10] for SQL big spatial data processing; HadoopViz [15, 17] forscalable visualization of big spatial data; and in distributed spatial join [32].Despite their wide use, the existing partitioning techniques all suﬀer from one or more of the followingthree limitations. First, some partitioning techniques (STR, Kd-tree) prioritize load balance over spatialquality which results in suboptimal partitions. This is apparent in Figure 1(a) where the thin and widepartitions result in low overall quality for the partitions since square-like partitions are preferred for mostspatial queries. Square-like partitions are preferred in indexing because they indicate that the index is notbiased towards one dimension. Also, since most queries are shaped like a square or a circle, square-like2artitions would minimize the overlap with the queries [2]. Second, they could produce partitions that do notﬁll the HDFS blocks in which they are stored. Big data systems are optimized to process full blocks, i.e.,128 MB, to oﬀset the ﬁxed overhead in processing each block. However, the index structures used in existingpartitioning techniques, e.g., R-trees, Kd-tree, Quad-tree, produce nodes with number of records in the range[ m, M ], where m ≤ M/

2. In practice, m can be as low as 0 . M [2, 4]. While those underutilized index nodeswere desirable for disk indexing as they can accommodate future inserts, they result in underutilized blocksas depicted in Figure 1(a) where all blocks are less than 80% full. Moreover, this design might also producespoor load balance among partitions due to the wide range of partition sizes. Third, all existing partitioningtechniques rely on a sample and try to balance the number of records per partition. This resembles traditionalindexes where the index contains record IDs. However, in big spatial data partitioning, the entire record iswritten in each partition, not just its ID. When records are highly variant in size, all existing techniques endup with extremely unbalanced partitions.This paper proposes a novel spatial partitioning technique for big data, termed R*-Grove, which completelyaddresses all of three aforementioned limitations. First, it produces high quality partitions by utilizing theR*-tree optimization techniques [2] which aim at minimizing the total area, overlap area, and margins. Thekey idea of the R*-Grove partitioning technique is to start with one partition that contains all sample pointsand then use the node split algorithm of the R*-tree to split it into smaller partitions. This results in compactsquare-like partitions as shown in Figure 1(b). Second, in order to ensure that we produce full blocks andbalanced partitions, R*-Grove introduces a new constraint that puts a lower bound on the ratio between thesmallest and the largest block, e.g., 95%. This property is theoretically proven and practically validated byour experiments. Third, when the input records have variable sizes, R*-Grove combines a data size histogramwith the sample points to assign a weight for each sample point. These weights are utilized to guarantee thatthe size of each partition falls in a user-deﬁned range.Given the wide adoption of the previous STR-based partitioner, we believe the proposed R*-Grove willbe widely used in big spatial data systems. This impacts a wide range of spatial analytics and processingalgorithms including indexing [8, 37], range queries [12, 42], kNN queries [12], visualization [15, 17], spatialjoin [22], and computational geometry [11, 27]. All the work proposed in this paper is publicly availableas open source and supports both Apache Spark and Apache Hadoop. We run an extensive experimentalevaluation with up-to 500 GB and 7 billion record datasets and up-to nine dimensions. The experimentsshow that R*-Grove consistently outperforms existing STR-based, Z-curve-based, Hilbert-Curve-based, andKd-tree-based techniques in both partitions quality and query eﬃciency.The rest of this paper is organized as follow. Section 2 describes the related works. Section 3 givesa background about big spatial data partitioning. Section 4 describes the proposed R*-Grove technique.Section 5 describes the advantages of R*-Grove in popular case studies of big spatial data systems. Section 6gives a comprehensive experimental evaluation of the proposed work. Finally, Section 7 concludes the paper. This section discusses the related work in big spatial data partitioning. In general, distributed indexes for bigspatial data are constructed in two levels, one global index that partitions the data across machines, andseveral local indexes that organize records in each partition. Previous work [12, 29, 13] showed that the globalindex provides far much improvement than local indexes. Therefore, in this paper we focus on global indexingand it can be easily combined with any of the existing local indexes. The work in global indexing can be3roadly categorized into three approaches, namely, sampling-based methods, space-ﬁlling-curve (SFC)-basedmethods, and quad-tree-based methods.The sampling-based method picks a small sample from the input data to infer its distribution. The sampleis loaded into an in-memory index structure while adjusting the data page capacity, e.g., leaf node capacity,such that the number of data pages is roughly equal to the desired number of partitions. The order of sampleobjects does not aﬀect the partition quality, since the sample is uniformly taken from the entire input dataset.Furthermore, most of algorithms sort the data as part of the partitioning process so the original order iscompletely lost. Some R-tree bulk-loading algorithms (STR [26] or OMT [25]) can also be used to speed upthe tree construction time. Then, the minimum bounding rectangles (MBRs) of the data pages are used topartition the entire dataset. This method was originally proposed for spatial join and denoted the seeded-tree[28]. It was then used for big spatial indexing in many systems including SpatialHadoop [12, 8], Scala-GiST[29], GeoSpark [42], Sphinx [10], Simba [41], and many other systems. This technique can be used withexisting R-tree indexes but it suﬀers from two limitations, load imbalance and low quality of spatial partitions.Additionally, when there is a big variance in record sizes, the load imbalance is further ampliﬁed due to theuse of the sample. We will further discuss these limitations in Section 4.The

SFC-based method builds a spatial index on top of an existing one-dimensional index by applyingany space-ﬁlling curve, e.g., Z-curve or Hilbert curve. MD-HBase [31] builds Kd-tree-like and Quad-tree-likeindexes on top of HBase by applying the Z-curve on the input data and customizing the region split methodin HBase to respect the structure of both indexes. GeoMesa [16] uses geo-hashing which is also based on theZ-curve to build spatio-temporal indexes on top of Accumulo. Unlike MD-HBase which only supports pointdata, GeoMesa can support rectangular of polygonal geometries by replicating a record to all overlappingbuckets in the geohash. While this method can ensure a near-perfect load balance, it produces an evenbigger spatial overlap between partitions as compared to the sampling-based approach described above. Thisdrawback leads to the ineﬃcient performance of spatial queries.The quad-tree-based method relies heavily on the Quad-tree structure to build eﬃcient and scalableQuad-tree index in Hadoop [40]. It starts by splitting the input data into equi-sized chunks and building apartial Quad-tree for each split. Then, it combines the leaf nodes of the partial trees based on the Quad-treestructure to merge them into the ﬁnal tree. While highly eﬃcient, this method cannot generalize to otherspatial indexes and is tightly tied to the Quad-tree structure. In addition, this Quad-tree-based partitioningtends to produce much more than the desired number of partitions which also leads to load imbalance.Although there are several partitioning techniques for large-scale spatial data as mentioned above, sampling-based method is the most ubiquitous option, which is integrated in most of existing spatial data systems.Sampling-based methods are preferred as they are simple to implement and provide very good results. In thispaper, we follow the sampling-based approach, and propose a method which utilizes R*-tree’s advantages thatwere never used before for big spatial data partitioning. The proposed R*-Grove index has three advantagesover the existing work. First, it inherits and improves the R*-tree index structure to produce high-qualitypartitions that are tailored to big spatial data. Second, the improved algorithm produces balanced partitionsby employing a user-deﬁned parameter, termed balance factor , α , e.g., 95%. In addition, it can producespatially disjoint partitions which are necessary for some spatial analysis algorithms. Third, R*-Grove cancouple a sample with a data size histogram to guarantee the desired load balance even when the input recordsizes are highly variant. While R*-Grove is not the only framework for big spatial partitioning, it is the ﬁrstone that is tailored for large-scale spatial datasets while existing techniques reuse traditional index structures,such as R-tree, STR, or Quad-tree, as black boxes. 4 nput Phase 1 Sampling

Phase 2

BoundaryComputation

Phase 3

Partitioning

Figure 2: The sampling-based partitioning process

The R*-tree [2] belongs to the R-tree family [19] and it improves the insertion algorithm to provide highquality index. In R-tree, the number of children in each nodes has to be in the range [ m, M ]. By design, m can be at most (cid:98) M/ (cid:99) to ensure that splitting a node of size M + 1 is feasible. In this paper, we utilize andenhance two main functions of the R*-tree index, namely, ChooseSubtree and

SplitNode which are bothused in the insertion process. For the

ChooseSubtree method, given the MBR of a record and a tree node,it chooses the best subtree to assign this record to. The

SplitNode method takes an overﬂow node with M + 1 records and splits it into two nodes. This section gives a background on the sampling-based partitioning technique [12, 8, 37], just partitioninghereafter, that this paper relies on. Figure 2 shows the workﬂow for the partitioning algorithm which consistsof three phases, namely, sampling, boundary computation, and partitioning. The sampling phase (Phase 1)draws a random sample of the input records and converts each one to a point. Notice that sample pointsare picked from the entire ﬁle at no particular order so the order of points does not aﬀect the next steps.The boundary computation phase (Phase 2) runs on a single machine and processes the sample to producepartition boundaries as a set of rectangles. Given a sample S , the input size D , and the desired partitionsize B , this phase adjusts the capacity of each partition to contain M = (cid:100)| S | · B/D (cid:101) sample points which isexpected to produce ﬁnal partitions with the size of one block each. The ﬁnal partitioning phase (Phase 3)scans the entire input in parallel and assigns each record to these partitions based on the MBR of the recordand the partition boundaries. If each record is assigned to exactly one partition, the partitions will bespatially overlapping with no data replication. If each record is assigned to all overlapping partitions, thepartitions will be spatially disjoint but some records can be replicated and duplicate handling will be neededin the query processing [7]. Some algorithms can only work if the partitions are spatially disjoint such asvisualization [15] and some computational geometry functions [27].The proposed R*-Grove method expands Phase 1 by optionally building a histogram of storage size thatassists in the partitioning algorithm at Phase 2. In Phase 2, it adapts R*-tree-based algorithms to producethe partition boundaries with desired level of load balance. In Phase 3, we propose a new data structure thatimproves the performance of that phase and allows us to produce spatially disjoint partitions if needed.

This paper uses the quality metrics deﬁned in [9]. Below, we redeﬁne these metrics while accounting forthe case of partitions that span multiple HDFS blocks. A single partition π i is deﬁned by two parameters,5inimum bounding box mbb i and size in bytes size i . Given the HDFS block size B , e.g., 128 MB, we deﬁnethe number of blocks for a partition π i as b i = (cid:100) size i /B (cid:101) . Given a dataset that is partitioned into a set of l partitions, P = { π i } , we deﬁne the ﬁve quality metrics as follows. Deﬁnition 1 (Total Volume - Q ) . The total volume is the sum of the volume of all partitions where thevolume of a partition is the product of its side lengths. Q ( P ) = (cid:88) π i ∈P b i · volume ( mbb i ) We multiply by the number of blocks b i because big spatial data systems process each block separately. Loweringthe total volume is preferred to minimize the overlap with a query. Given the popularity of the two-dimensionalcase, this is usually used under the term total area . Deﬁnition 2 (Total Volume Overlap - Q ) . This quality metric measures the sum of the overlap betweenpairs of partitions. Q ( P ) = (cid:88) π i ,π j ∈P ,i (cid:54) = j b i · b j · volume ( mbb i ∩ mbb j ) + (cid:88) π i ∈P b i ( b i − · volume ( mbb i ) where mbb i ∩ mbb j is the intersection region between the two boxes. The ﬁrst term calculates the overlapsbetween pairs of partitions and the second term accounts for self-overlap which treats a partition with multipleblocks as overlapping partitions. Lowering the volume overlap is preferred to keep the partitions apart. Deﬁnition 3 (Total Margin - Q ) . The margin of a block is the sum of its side lengths. The total margin isthe sum of all margins as given below. Q ( P ) = (cid:88) π i ∈P b i · margin ( mbb i ) Similar to Q , multiplying by the number of blocks b i treats each block as a separate partition. Lowering thetotal margin is preferred to produce square-like partitions. Deﬁnition 4 (Block Utilization - Q ) . Block utilization measures how full the HDFS blocks are. Q ( P ) = (cid:80) π i ∈P size i B · (cid:80) π i ∈P b i The numerator (cid:80) size i represents the total size of all partitions and denominator B (cid:80) b i is the maximumamount of data that can be stored in all blocks used by these partitions. In big data applications, each block isprocessed in a separate task which has a setup time of a few seconds. Having full or near-full blocks minimizethe overhead of the setup. The maximum value of block utilization is . , or . Deﬁnition 5 (Standard Deviation of Sizes) . Q ( P ) = (cid:115) (cid:80) π i ∈P ( size i − size ) l Where size = (cid:80) size i /l is the average partition size. Lowering this value is preferred to balance the loadacross partitions. lgorithm 1 A simpliﬁed version of the traditional R*-tree splitting mechanism.Inputs: P is the all sample records; m is the minimum size of a node.Output: the optimal splitting position. function ChooseSplitPoint ( P , m ) chosenK = -1; minCost = ∞ for k in [ m, | P | − m ] do P = P [1 ..k ] (cid:46) P is the ﬁrst k records of P P = P [ k + 1 .. | P | ] (cid:46) P is all the remaining records P − P Calculate the cost of the partitions P and P if the cost is smaller than minCost then Set chosenK = k and update minCost return chosenK This section describes the details of the proposed R*-Grove partitioning algorithm. R*-Grove employsthree techniques that overcome the limitations of existing works. The ﬁrst technique adapts the R*-treeindex structure for spatial partitioning by utilizing the

ChooseSubTree and

SplitNode functions in thesample-based approach described in Section 3. This technique ensures a high spatial quality of partitions. Thesecond technique addresses the problem of load balancing by introducing a new constraint that guarantees auser-deﬁned ratio between smallest and largest partitions. The third technique combines the sample pointswith its storage histogram to balance the sizes of the partitions rather than the number of records. Thiscombination allows R*-Grove to precisely produce partitions with a desired block utilization, which cannotbe achieved by any other partitioning techniques.

This part describes how R*-Grove utilizes the R*-tree index structure to produce high quality partitions. Itutilizes the

SplitNode and

ChooseSubtree functions from the R*-tree algorithm in Phases 2 and 3 asdescribed shortly. A na¨ıve method [38] is to use the R*-tree as a blackbox in Phase 2 in Figure 2 and insertall the sample points into an R*-tree. Then it emits the MBRs of the leaf nodes as the output partitionboundaries. However, this technique was shown to be ineﬃcient as it processes the sample points one-by-oneand does not integrate the R*-tree index well in the partitioning algorithm. Therefore, we propose an eﬃcientapproach that runs much faster and produces higher quality partitions. It extends Phases 2 and 3 as follows.Phase 2 computes partition boundaries by only using the

SplitNode algorithm from the R*-tree indexwhich splits a node with M + 1 records into two nodes with the size of each one in the range [ m, M ]. Thisalgorithm starts by choosing the split axis, e.g., x or y , that minimizes the total margin. Then, all thepoints are sorted along the chosen axis and the split point is chosen as depicted in Algorithm 1. The ChooseSplitPoint algorithm simply considers all the split points and chooses the one that minimizes somecost function which is typically the total area of the two resulting partitions.We set M = (cid:100)| S | · B/ | D |(cid:101) as explained in Section 3 and m = 0 . M as recommended in the R*-treepaper. In particular, this phase starts by creating a single big tree node that has all the sample points S . Then, it recursively calls the SplitNode algorithm as long as the resulting node has more than M elements. This top-down approach has a key advantage over building the tree record-by-record as it allowsthe algorithm to look at all the records at the beginning and optimize for all of them. Furthermore, it avoidsthe ForcedReinsert algorithm which is known to slow down the R*-tree insertion process. Notice that this7 lgorithm 2

R*-tree-based split while ensuring valid partitionsInputs: P is the all sample records; [ m, M ] is the target range of sizes for ﬁnal partitions.Output: the optimal splitting position. function ChooseValidSplitPoint ( P , m ) for k in [ m, | P | − m ] do if either k or | P | − k is invalid then (cid:46) Lemma 1 Skip this iteration and continue Similar to Lines 4-8 in Algorithm 1 return chosenKis diﬀerent than the bulk loading algorithms as it does not produce a full tree. Rather, it just produces a setof boundaries that are produced as partitions. Phase 3 treats all the MBRs as leaf nodes in an R-tree anduses the ChooseLeaf method from the R*-tree to assign an input record to a partition.

Run-time analysis:

The

SplitNode algorithm can be modeled as a recursive algorithm where eachiteration sorts all the points and runs the linear-time splitting algorithm to produce two smaller partitions.The run-time can be expressed as T ( n ) = T ( k ) + T ( n − k ) + O ( n log n ), where k is the size of one groupresulting from the partitioning, n is the number of records in the input partition. In particular, T ( k ) and T ( n − k ) are the running times to partition two partitions from splitting process. The term O ( n log n ) is therunning time for the splitting part which requires sorting all the points. This recurrence relation has a worstcase of n log n if k is always n −

1. In order to guarantee a run-time of O ( n log n ), we deﬁne a parameter ρ ∈ [0 , .

5] which deﬁnes the minimum splitting ratio k/n . Setting this parameter to any non-zero fractionguarantees an O ( n log n ) run-time. However, the restriction of k/n also limits the range of possible valueof k . For example, if n = 100 and ρ = 0 . k must be a number in the range [30 , .

5, the two partitions become closer in size and the run-time decreases but the quality of theindex might also deteriorate due to the limited search space imposed by this parameter. To incorporate thisparameter in the node-splitting algorithm, we call the

ChooseSplitPoint function with the parameters( P ,max { m, ρ · | P |} ), where | P | is the number of points in the list P . In this section, we focus on balancing the number of records in partitions assuming equal-size records. Wefurther extend this in the next section to support variable-size records. The method in Section 4.1 does agood job in producing high-quality partitions similar to what the R*-tree provides. However, it does notaddress the second limitation, that is, balancing the sizes of the partitions. Recall that the R-tree indexfamily requires the leaf nodes to have sizes in the range [ m, M ], where m ≤ M/

2. With the R*-tree algorithmexplained earlier, some partitions might be 30% full which reduces block utilization and load balance. Wewould like to be able to set m to a larger value, say, m = 0 . M . Unfortunately, if we do so, the SplitNode algorithm would just fail because it will face a situation where there is no valid partitioning.To illustrate the limitation of the

SplitNode mechanism, consider the following simple example. Let usassume we choose m = 9 and M = 10 while the list P contains 28 points. If we call the SplitNode algorithmon the 28 points, it might produce two partitions with 14 records each. Since both contain more than M = 10points, the splitting method will be called again on each of them which will produce an incorrect answersince there is no way to split 14 records into two groups while each of them contain between 9 and 10 records.A correct splitting algorithm would produce three partitions with sizes 9, 9, and 10. Therefore, we need tointroduce a new constraint to the splitting algorithm so that it always produces partitions with sizes in the8ange [ m, M ]. The Final Finding:

The

SplitNode algorithm can be minimally modiﬁed to guarantee ﬁnal leafpartitions in the range [ m, M ] by satisfying the following validity constraint: (cid:100) S i /M (cid:101) ≤ (cid:98) S i /m (cid:99) , i ∈ { , } , where S and S are the sizes of the two resulting partitions of the split. Algorithm 2 depicts the mainchanges to the algorithm that introduces a new constraint test in Line 3 that skips over invalid partitioning.The rest of this section provides the theoretical proof that this simple constraint guarantees the algorithmtermination with leaf partitions in the range [ m, M ]. We start with the following deﬁnition. Deﬁnition 6.

Valid Partition Size:

An integer number S is said to be a valid partition size with respectto a range [ m, M ] if there exists a set of integers X = { x , · · · , x n } such that (cid:80) x i = S and x i ∈ [ m, M ] ∀ i ∈ [1 , n ] . In words, if we have S records, there is at least one way to split them such that each split hasbetween m and M records. For example, if m = 9 and M = 10, the sizes 14, 31, and 62, are all invalid while the sizes 9, 27, and 63,are valid. Therefore, to produce balanced partitions, the SplitNode algorithm should keep the invariantthat the partition sizes are always valid according to the above deﬁnition. Going back to the earlier example,if S = 28, the answer S = S = 14 will be rejected because S = 14 is invalid. Rather, the result of theﬁrst call to the SplitNode algorithm will result in two partitions with sizes { , } or { , } . The followinglemma shows how to test a size for validity in constant time. Lemma 1.

Validity Test:

An integer S is a valid partition size w.r.t a range [ m, M ] iﬀ L ≤ U in which L(lower bound) and U (upper bound) are computed as: L = (cid:100) S/M (cid:101) U = (cid:98) S/m (cid:99)

Proof.

First, if S is valid then, by deﬁnition, there is a partitioning of S into n partitions such that eachpartition is in the range [ m, M ]. It is easy to show that L ≤ U and we omit this part for brevity. The secondpart is to show that if the inequality L ≤ U holds, then there is at least one valid partitioning. Based on thedeﬁnition of L and U , we have: U = (cid:98) S/m (cid:99) ⇒ U ≤ S/m ⇒ S ≥ m · U ⇒ S ≥ m · L ⇒ S − m · L ≥ L = (cid:100) S/M (cid:101) ⇒ L ≥ S/M ⇒ S ≤ M · L ⇒ S − m · L ≤ ( M − m ) · L (2)Based on Inequalities 1 and 2, we can make a valid partitioning as follows:1. Start with L empty partitions. Assign m records to each partition. The remaining number of records is S − m · L ≥

0. This is satisﬁed due to Inequality 1.2. Since each partition now has m records, it can receive up-to M − m additional records in order to keepits validity. Overall, L partitions of size m can accommodate up-to ( M − m ) · L records to keep a valid9artitioning. But the remaining number of records S − m · L is not larger than the upper limit of whatthe partitions can accommodate, ( M − m ) · L as shown in Inequality 2. Therefore, this condition issatisﬁed as well.In conclusion, it follows that if the condition L ≤ U holds, we can always ﬁnd a valid partitioning schemefor S records which completes the proof.If we apply this test for the example above, we ﬁnd that 28 is valid because L = (cid:100) / (cid:101) = 3 ≤ U = (cid:98) / (cid:99) = 3 while 62 is invalid because L = (cid:100) / (cid:101) = 7 > U = (cid:98) / (cid:99) = 6. This approach works ﬁne as longas the initial sample size S is valid but how do we guarantee the validity of S ? We show that this is easilyguaranteed if the size S is above some threshold S ∗ as shown in the following lemma. Lemma 2.

Given a range [ m, M ] , any partition of size S ≥ S ∗ is valid where S ∗ is deﬁned by the followingformula: S ∗ = (cid:24) mM − m (cid:25) · m (3) Proof.

Following Deﬁnition 6, we will prove that for any partition size S ≥ S ∗ , there exists a way to split itinto k groups such that the size of each group is in the range [ m, M ].First, let i = (cid:108) mM − m (cid:109) , we have: S ≥ S ∗ = (cid:24) mM − m (cid:25) · m = i · m (4) ⇒ S = i · m + X, X ≥ . Let X = a · m + b , a ≥ , ≤ b < m (5) ⇒ S = i · m + ( a · m + b ) = ( i + a ) · m + b , a ≥ , ≤ b < m (6)Second, since b < m , we have: bM − m < mM − m ≤ i ⇒ bi < M − m (7)From Equation 6 and 7, we can make a valid partitioning for a partition size S as follows:1. Start with i + a empty partitions. Assign m records to each partition. The remaining number of recordsis b . This step is based on Equation 6.2. Equation 7 means that we can split b records over i groups such that each group receives at most M − m records. Since we already have i + a groups each of size m , adding M − m to i groups out ofthem will increase their sizes to M which still keeps them in the valid range [ m, M ]. The remaininggroups will still have m records making them valid too.This completes the proof of Lemma 2.Based on Lemma 2, a question is raised as how large the size of sample points S should be to ensure thata good block utilization is achievable. As we mentioned from beginning, R*-Grove allows us to conﬁgure aparameter α = m/M , that called balance factor , is computed as the ratio between minimum and maximum10 lgorithm 3 Choose the split point with weightsInputs: P is the all sample records; w is an array of weights of corresponding records in P ; [ m, M ] is thetarget range of sizes for ﬁnal partitions.Output: the optimal splitting position. function ChooseWeightedSplitPoint ( P , w , m ) W = (cid:80) ≤ i ≤| P | w i for k in [ m, | P | − m ] do W = (cid:80) ≤ i ≤ k w i if either W or W − W is invalid then (cid:46) Lemma 1 Skip this iteration and continue Similar to Lines 4-8 in Algorithm 1 return chosenKnumber of records of a leaf node in the tree. α should be close to 1 to guarantee a good block utilization.Let’s assume that 0 < r ≤ p is the storage size of a single point. The maximumnumber of records M is computed in the Section 4.1 as: M = (cid:24) | S | · BD (cid:25) ⇒ M = (cid:24) | S | · pD · Bp (cid:25) ⇒ M = (cid:24) r · Bp (cid:25) (8)From Equation 8, we can rewrite Lemma 2 as: | S | ≥ S ∗ = (cid:24) mM − m (cid:25) · m ⇒ | S | ≥ (cid:24) α − α (cid:25) · α · (cid:24) r · Bp (cid:25) ⇒ | S | · p ≥ (cid:24) α − α (cid:25) · α · (cid:100) r · B (cid:101) (9)Therefore, assume that we want to conﬁgure the balance factor as α = 0 .

95, sample ratio r = 1% andblock size B = 128 MB, then the term | S | · p in Equation 9 would be computed as 23 MB. In other words, ifthe storage size of sample points | S | · p ≥

23 MB, it will be guaranteed to produce a valid partitioning. Thisis a reasonable size that can be stored in main memory and processed in a single machine.

The above two approaches can be combined to produce high-quality and balanced partitions in terms ofnumber of records. However, the partitioning technique needs to write the actual records in each partitionand often these records are of variable sizes. For example, the sizes of records in the

OSM-Objects dataset [1]range from 12 bytes to 10 MB per record. Therefore, balancing the number of records can result in a hugevariance in the partition sizes in terms of number of bytes.To overcome this limitation, we combine the sample points with a storage size histogram of the inputas follows. The storage size histogram is used to assign a weight to each sample point that represents thetotal size of all records in its vicinity. To ﬁnd these weights, Phase 1 computes, in addition to the sample,a storage size histogram of the input. This histogram is created by overlaying a uniform grid on the inputspace and computing the total size of all records that lie in each grid cell [6, 34]. This histogram is computedon the full dataset not the sample, therefore, it catches the actual size of the input. After that, we count thenumber of sample points in each grid cell. Finally, we divide the total weight of each cell among all samplepoints in this cell. For example, if a cell has a weight of 1,000 bytes and contains ﬁve sample points, theweight of each point in this cell becomes 200 bytes. 11n Phase 2, the

SplitNode function is further improved to balance the total weight of the points in eachpartition rather than the number of points. This also requires modifying the value of M to be M = (cid:100) (cid:80) w i /N (cid:101) ,where w i is the assigned weight to the sample point p i , and N is the desired number of partitions. Algorithm 3shows how the algorithm is modiﬁed to take the weights into account. Line 4 calculates the weight of eachpartitioning point which is used to test the validity of this split point as shown in Line 5.Unfortunately, if we apply this change, the algorithm is no longer guaranteed to produce balancedpartitions. The reason is that the proof of Lemma 1 is no longer valid. That proof assumed that the partitionsizes are deﬁned in terms of number of records which makes all possible partition sizes part of the searchspace in the for-loop in Line 2 of Algorithm 2. However, when the size of each partition is the sum of theweights, the possible sizes are limited to the weights of the points. For example, let us assume a partitionwith ﬁve points all of the same weight w i = 200 while m = 450 and M = 550. The condition in Deﬁnition 6suggests that the total weight 1 ,

000 is valid because L = (cid:100) / (cid:101) = 2 ≤ U = (cid:98) / (cid:99) = 2. However,given the weights w i = 200 for i ∈ [1 , , SplitNode algorithm so that it still guarantees avalid partitioning even for the case described above. The key idea is to make minimal changes to the weightsto ensure that the algorithm will terminate with a valid partitioning; we call this process weight correction .For example, the case described earlier will be resolved by changing the weights of two points from 200 and200 to 100 and 300. This will result in the valid partitioning { , , } and { , } which is valid.Keep in mind that these weights are approximate anyway as they are based on the sample and histogram sothese minimal changes would not hugely aﬀect the overall quality, yet, they ensure that the algorithm willterminate correctly. The following part describes how these weight changes are applied while ensuring a validanswer.First of all, we assume that the points are already sorted along the chosen axis as explained in Section 4.1.Further, we assume that Algorithm 3 failed by not ﬁnding any valid partitions, i.e., return -1. Now, we makethe following deﬁnitions to use them in the weight update function. Deﬁnition 7.

Point position:

Let p i be point i in the sort order and its weight is w i . We deﬁne theposition of the point i as pos i = (cid:80) j ≤ i w j . Based on this deﬁnition, we can place all the points on a linear scale based on their position as shown inFigure 3(a).

Deﬁnition 8.

Valid left range:

A range of positions

V L = [ vl s , vl e ] is a valid left range if for all positions vl ∈ V L the value vl is valid w.r.t. [ m, M ] . All the valid left ranges can be written in the form [ im, iM ] where i is a natural number and they might overlap for large values of i . (See Figure 3(b).) Deﬁnition 9.

Valid right range:

A range of positions

V R = [ vr s , vr e ] is a valid right range if for allpositions vr ∈ V R the value W − vr is valid w.r.t. [ m, M ] . Similar to valid left ranges, all valid right rangescan be written in the form [ W − jM, W − jm ] , where W = (cid:80) w i . (See Figure 3(b).) Deﬁnition 10.

Valid range:

A range of positions V = [ v s , v e ] is valid if for all positions v ∈ V , v belongsto at least one valid left range and at least one valid right range. In other words, the valid ranges are theintersection of the valid left ranges and valid right ranges. Figure 3(b) illustrates the valid left, valid right, and valid ranges. If we split a partition around a pointwith a position in a valid left range, the ﬁrst partition will be valid. Similarly for valid right positions the12 os = (cid:80) j ∈ [1 , w j w w w w w w w (a) Positions of points Valid rangesValid left ranges[ im, iM ] m m imM M iM

Valid right ranges m mjm M MjMW = (cid:80) w i (b) Valid ranges∆ pos BeforeAfter p p w w w (cid:48) w (cid:48) Emptyvalid range(c) Weight correction Figure 3: Load balancing for datasets with variable-size recordssecond partition (on the right) will be valid. Therefore, we would like to split a partition around a point inone of the valid ranges (intersection of left and right).

Lemma 3.

Empty valid ranges:

If Algorithm 3 fails by returning -1, then none of the point positions in P falls in a valid range.Proof. By contradiction, let a point p i has a position pos i that falls in a valid range. In this case, thepartitions P = { p k : k ≤ i } and P = { p l : l > i } are both valid partitions because the total weight of P isequal to the position pos i which is valid because pos i falls in a valid left range. Similarly, the total weightof P is valid because pos i falls in a valid right range. In this case, Algorithm 3 should have found thispartitioning as a valid partitioning because it tests all the points which is a contradiction.A corollary to Lemma 3 is that when Algorithm 3 fails by returning -1, then all valid ranges are empty.As a result, we would like to slightly modify the weights of some points in the sample points in orderto enforce some points to fall in valid ranges. We call this the weight correction process. This process is13 Figure 4: Auxiliary search structure for R*-Grovedescribed in the following lemma:

Lemma 4.

Weight correction:

Given any empty valid range [ v s , v e ] , we can modify the weight of only twopoints such that the position of one of them will fall in the range.Proof. Figure 3(c) illustrates the proof of this lemma. Given an empty valid range, we modify the two pointswith positions that follow the empty valid range, p and p , where pos < pos . We would like to move thepoint p to the new position pos (cid:48) = ( v s + v e ) / w by ∆ pos = pos − pos (cid:48) . The updated weight w (cid:48) = w − ∆ pos . To keep theposition of p and all the following points intact, we have to also increase the weight of p by ∆ pos ; that is, w (cid:48) = w + ∆ pos .We do the weight correction process for all empty valid ranges to make them non-empty and then werepeat Algorithm 3 to choose the best one among them.The only remaining part is how to enumerate all the valid ranges. The idea is to simply ﬁnd a valid leftrange, an overlapping valid right range, and compute their intersection, all in constant time. Given a naturalnumber i , the valid left range is in the form [ im, iM ]. Assume that this range overlaps a valid right range inthe form [ W − jM, W − jm ]. Since they overlap, the following two inequalities should hold: W − jm ≥ im ⇒ j < W − immW − jM ≤ iM ⇒ j > W − iMM Therefore, the lower bound of j is j = (cid:6) W − i · MM (cid:7) and the upper bound of j is j = (cid:4) W − i · mm (cid:5) . If j ≤ j ,then there is a solution to these inequalities which we use to generate the bounds of the valid range [ v s , v e ].Notice that if there is more than one valid solution to j , all of them should be considered to generate all thevalid ranges but we omit this special case for brevity. Optimization of Phase 3:

The

ChooseSubTree operation in R*-tree chooses the node that results in theleast overlap increase with its siblings [2]. A straight-forward implementation of this method is O ( n ) as itneeds to compute the overlap between each candidate partition and all other partitions. In the R*-tree index,this cost is limited due to the limited size of each node. However, this step can be too slow as the numberof partitions in R*-Grove can be extremely large. To speed up this step, we use a K-d-tree-like auxiliary14earch structure as shown in Figure 4. This index structure is generated during Phase 2 as the partitionboundaries are computed. Each time the NodeSplit operation completes, the search structure is updatedby adding a corresponding split in the direction of the chosen split axis. This auxiliary search structure isstored in memory and replicated to all nodes. It will be used in Phase 3, when we physically store the inputrecords to the partitions. Given a spatial record, it will be assigned to the corresponding partition using asearch algorithm which is similar to the K-d-tree’s point search algorithm [5]. Based on this similarity, wecan estimate the running time to choose a partition to be O ( log ( n )). Notice that this optimization is notapplicable in traditional R*-trees as the partitions might be overlapping while in R*-Grove we utilize the factthat we only partition points which guarantees disjoint partitions.Since the partition MBRs in Phase 2 are computed from sample objects, there will be objects whichdo not fall in any partition in Phase 3. R*-Grove addresses this problem in two ways. First, if no disjointpartitions are desired, it chooses a single partition based on the ChooseLeaf method in original R*-tree. Inshort, an object will be assigned to the partition in which the enlarged area or margin is minimal. Second, ifdisjoint partitions are desired, R*-Grove uses the auxiliary data structure, which covers the entire space, toassign this record to all overlapping partitions.

Disjoint indexes:

Another advantage of using the auxiliary search structure described above, is that itallows for building a disjoint index. This search structure naturally provides disjoint partitions. To ensurethat the partitions cover the entire input space, we assume that input region is inﬁnite, that is, starts from −∞ and ends at + ∞ in all dimensions. Then, Phase 3 replicates each record to all overlapping partitions bydirectly searching in this k -d-tree-like structure with range search algorithm, which has the O ( √ n ) runningtime [24]. This advantage was not possible with the black-box R*-tree implementation as it is not guaranteedto provide disjoint partitions. This section describes three case studies where the R*-Grove partitioning technique can improve big spatialdata processing. We consider three fundamental operations, namely, indexing, range query, and spatial join.

Spatial data indexing is an essential component in most big spatial data management systems. The state-of-the-art global indexing techniques rely on reusing existing index structures with a sample which are shown tobe ineﬃcient in terms of quality and load balancing [9, 43, 37].R*-Grove partitioning can be used for the global indexing step which partitions records across machines.In big spatial data indexing, the global index is the most crucial step as it ensures load balancing and eﬃcientpruning when the index is used. If only the number of records needs to be balanced or if the records areroughly equi-sized, then the techniques described in Sections 4.1 and 4.2 can be used. If the records areof a variable size and the total sizes of partitions need to be balanced, then the histogram-based step inSection 4.3 can be added to ensure a higher load balance. Notice that the index would hugely beneﬁt fromthe balanced partition size as it reduces the total number of blocks in the output ﬁle which improves theperformance of all Spark and MapReduce queries that create one task per ﬁle block.15 .2 Range Query

Range query is a popular spatial query, which is also the building block of many other complex spatialqueries. Previous studies found a strong correlation between the performance of range queries and theperformance of other queries such as spatial join [9, 21]. Therefore, the performance of range query could beconsidered as a good reﬂection about the quality of a partitioning technique. A good partitioning techniqueallows the query processor to make two optimization techniques. First, it can prune the partitions thatare completely outside the query range. Second, it can directly write to the output the partitions thatare completely contained in the query range without further processing [10]. For very small ranges, mostpartitioning techniques will behave similarly as it is most likely that the small query overlaps one partitionand no partitions are completely contained [8]. However, as the query range increases, the diﬀerences betweenthe partitioning techniques become apparent. Since most range queries are expected to be square-like, theR*-Grove partitioning is expected to perform very well as it minimizes the total margin which producessquare-like partitions. Furthermore, the balanced load across partitions minimizes the straggler eﬀect whereone partition takes signiﬁcantly longer time than all other partitions.

Spatial join is another important spatial query that beneﬁts from the improved R*-Grove partitioningtechnique. In spatial join, two big datasets need to be combined to ﬁnd all the overlapping pairs of records.To support spatial join on partitioned big spatial data, each dataset is partitioned independently. Then, aspatial join runs between the partition boundaries to ﬁnd all pairs of overlapping partitions. Finally, thesepairs of partitions are processed in parallel. An existing approach [44] preserves spatial locality to reduce theprocessing jobs. However, it still relies on traditional index like R-Tree, which also inherited its limitations.The R*-Grove partitioning has two advantages for the spatial join operation. First, it is expected to reducethe number of partitions by increasing the load balance which reduces the total number of pairs. Second, itproduces square-like partitions which is expected to overlap with fewer partitions of the other dataset ascompared to the very thin and wide partitions that the STR or other partitioning techniques produce. Theseadvantages allows R*-Grove to signiﬁcantly outperform other partitioning techniques in spatial join queryperformance. We will validate these advantages in the Section 6.5.2.

In this section, we carry out an extensive experimental study to validate the advantages of R*-Grove overwidely used partitioning techniques, such as bulk loading STR, Kd-tree, Z-Curve and Hilbert curve. We willshow how R*-Grove addresses the current limitations of those techniques, leads to a better performance inbig spatial data processing. In addition, we also show other capabilities of R*-Grove in the context of bigspatial data, for example, how it works with large or multi-dimensional datasets. The experimental resultsin this section provide an evidence to the spatial community to start using R*-Grove if they would like toimprove the system performance of their spatial applications.

Datasets:

Table 1 summarizes the datasets will be used in our experiments. We use both real worldand synthetic datasets for our experiments: (1) Semi-synthetic OpenStreetMap (

OSM-Nodes ) dataset with16able 1: Datasets for experiments

Dataset Type Dimensions Size . OSM-Nodes dataset; (2)

OSM-Roads with size 20 GB and (3)

OSM Parks with size 7 . OSM-Objects dataset with size 92GB, whichcontains many variable-size records. (5)

NYC-Taxi dataset with size 41 . diagonal points, with the number of dimensions are 3 , , , and 9. This synthetic datasetis generated using our open source Spatial Data Generator [39]. Dataset (5) and (6) allow us to show theadvantages of R*-Grove in multi-dimensional datasets. Parameters and performance metrics:

In the following experiments, we partition the mentioneddatasets with diﬀerent datasets size | D | in diﬀerent techniques then we measure: (1) partition quality metrics,namely, total partition area, total partition margin, total partition overlap, block utilization(maximum is 1 . α = 0 .

95 and HDFS block size at 128 MB.

Machine specs:

All the experiments are executed on a cluster of one head node and 12 worker nodes,each having 12 cores, 64 GB of RAM, and a 10 TB HDD. They run CentOS 7 and Oracle Java 1.8.0 131. Thecluster is equipped with Apache Spark 2.3.0 and Apache Hadoop 2.9.0. The proposed indexes are available forrunning in both Spark and Hadoop. Unless otherwise mentioned, we use Spark by default. The source code isavailable at https://bitbucket.org/tvu032/beast-tv/src/rsgrove/ . The implementation for R*-Grove(

RSGrovePartitioner ) is located at indexing package.

Baseline techniques:

We compare R*-Grove to K-d Tree, STR, Z-curve and Hilbert curve (denotedH-Curve thereafter) which are widely used in existing big spatial data systems [13]. Z-Curve is adopted insome systems under the name Geohash which behaves in the same way.

In this experiment, we compare the three following variants of R*-Grove: (1)

R*-tree-black-box is theapplication of the method in Section 4.1. Simply, it uses the basic R*-tree algorithm to compute high-qualitypartition but it does not guarantee a high block utilization or load balance. (2)

R*-tree-gray-box appliesthe improvements in Sections 4.1 and 4.2. In addition to the high-quality partition, this method can alsoguarantee a high block utilization in terms of number of records per partition but it does not perform wellif records have highly-variable sizes since it does not include the size adjustment technique in Section 4.3.(3)

R*-Grove applies all the three improvements at Sections 4.1, 4.2 and 4.3. It has the advantage of producinghigh-quality partitions and can also guarantee a high block utilization in terms of storage size even when the17 T o t a l a r ea Dataset size in GB

R*-GroveR*-Tree-gray-boxR*-Tree-black-box T o t a l m a r g i n Dataset size in GB

R*-GroveR*-Tree-gray-boxR*-Tree-black-box (a) Total area (b) Total margin B l o ck u t ili z a t i on Dataset size in GB

R*-GroveR*-Tree-gray-boxR*-Tree-black-box S t de v o f pa r t i t i on s i z e ( M B ) Dataset size in GB

R*-GroveR*-Tree-gray-boxR*-Tree-black-box (c) Block utilization (d) Load balanceFigure 5: Partition quality with variable-record-size dataset in R*-Grove and its two variants, R*-tree-black-box and R*-tree-gray-boxrecord sizes are highly variable.In Figure 5, we partition the

OSM-Objects dataset, which contains variable-size records to validate ourproposed improvements. Overall, R*-Grove outperforms R*-tree-black-box and R*-tree-gray-box in all ofspatial quality metrics. Especially, R*-Grove provides excellent load balance between partitions as shown inFigure 5(d), which is the standard deviation of partition size in

OSM-Objects dataset. Given the HDFS blocksize is 128MB, R*-Grove has the standard deviation of partition size 5 − Figure 6 shows an overview of the advantages of R*-Grove over other partitioning techniques for indexing,range query, and spatial join. In this experiment, we compare to four popular baseline techniques, namely,STR, Kd-Tree, Z-Curve and H-Curve. We use

OSM-Nodes dataset [36] for this experiment. The numbers onthe y − axis are normalized to the largest number for a better representation except for block utilization whichis reported as-is. Except for block utilization, the lower the value in the chart the better it is. The ﬁrst twogroups, total area and total margin, show that index quality of R*-Grove is clearly better than other baselines18 N o r m a li z e d m e t r i c s R*-GroveKd-TreeSTRZ-CurveH-Curve

Figure 6: The advantages of R*-Grove when compared to existing partitioning techniquesin both measures. For block utilization, on average, a partition in R*-Grove occupy around 90%, whileother techniques could only utilize 60 −

70% storage capacity of an HDFS block. R*-Grove also has a betterload balance when compared to other techniques. The last two groups indicate that R*-Grove signiﬁcantlyoutperforms other partitioning techniques in terms of range query and spatial join query performance. Wewill go into further details in the rest of this section.

This section shows the advantages of R*-Grove for indexing big spatial data when compared to otherpartitioning techniques. We use

OSM-Nodes and

OSM-Objects dataset with size up to 200 and 92 GB,respectively. We compare ﬁve techniques, namely, R*-Grove, STR, Kd-Tree, Z-Curve and H-Curve. Weimplemented those techniques on Spark with sampling-based partitioning mechanism. Figures 7(a) and 8(a)show that there is no signiﬁcant diﬀerence of indexing performance between diﬀerent techniques. This resultis expected since the main diﬀerence between them is in Phase 2 which runs on a single machine on a sampleof a small size (and a histogram in case of R*-Grove). Typically, Phase 2 takes only a few seconds to ﬁnish.These results suggest that the proposed R*-Grove algorithm requires the same computational resources asthe baseline techniques. Meanwhile, it provides a better query performance by providing a higher partitionquality as detailed next.

Figures 7(b) and 8(b) show the total area of indexed datasets when we vary the

OSM-Nodes and

OSM-Objects dataset size from 20GB to 200GB and 16GB to 92GB, respectively. R*-Grove is the winner since it minimizesthe total area of all partitions. While H-Curve performs generally better than Z-Curve, they are both doingbad since they do not take partition area into account in their optimization criteria. Specially, Figure 8(b)strongly validates the advantages of R*-Grove in non-point datasets. Figures 7(c) and 8(c) report the totalmargin for the same experiment. R*-Grove is the clear winner because it inherits the splitting mechanism ofR*-Tree, which is the only one among all those that tries to produce square-like partitions. As the input sizeincreases, more partitions are generated which causes the total margin to increase.19 P a r t i t i on i ng t i m e ( s e c ond s ) Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve T o t a l a r ea Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve T o t a l m a r g i n Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve (a) Partitioning time (b) Total area (c) Total margin B l o ck u t ili z a t i on Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve S t de v o f pa r t i t i on s i z e ( M B ) Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve T o t a l r unn i ng t i m e ( s e c ond s ) Number of queries

R*-GroveSTRKd-TreeZ-CurveH-Curve (d) Block utilization (e) Load balance (f) Range query performanceFigure 7: Indexing performance and partition quality of R*-Grove and other partitioning techniques in

OSM-Nodes datasets with similar-size records.

Figures 7(d) and 8(d) show the block utilization as the input size increases. R*-Grove outperforms otherpartitioning techniques due to the proposed improvements in Section 4.2 and 4.3 speciﬁcally improve blockutilization. Using R*-Grove, each partition almost occupies a full block in HDFS which increases the overallblock utilization. Z-Curve and H-Curve perform similarly since they produce equi-sized partition by creatingsplit points along the curve. The high variability of the Kd-tree is due to the way it partitions the space ateach iteration. Since it always partitions the space along the median, it only works perfectly if the number ofpartitions is a power of two; otherwise, it could be very ineﬃcient. This occasionally results in partitions ofhigh block utilization but they could be highly variable in size.

Figures 7(e) and 8(e) show the standard deviation of partition size in MB for the

OSM-Nodes and

OSM-Objects datasets, respectively. Note that the HDFS block size is set to 128 MB. A smaller standard deviation indicatesa better load balance. In Figure 7(e), the dataset

OSM-Nodes contains records of almost the same size soR*-Grove performs only slightly better than Z-Curve, H-Curve and STR even though these three techniquestry to primarily balance the partition sizes. In Figure 8(e), the

OSM-Objects dataset contains highly variablerecord sizes. In this case, R*-Grove is way better than all other techniques as it is the only one that employsthe storage histogram to balance variable size records. In particular, we observe that the standard deviation ofpartition size on STR, Kd-Tree, Z-Curve and H-Curve is about 50 −

60% of the HDFS block size. Meanwhile,R*-Grove maintains a value around 10 MB, which is only 8% of the block size.20 P a r t i t i on i ng t i m e ( s e c ond s ) Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve T o t a l a r ea Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve T o t a l m a r g i n Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve (a) Partitioning time (b) Total area (c) Total margin B l o ck u t ili z a t i on Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve S t de v o f pa r t i t i on s i z e ( M B ) Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve o f p r o c e ss i ng b l o cks Dataset size in GB

R*-GroveSTRKd-TreeZ-CurveH-Curve (d) Block utilization (e) Load balance (f) Average range query costFigure 8: Indexing performance and partition quality of R*-Grove and other partitioning techniques in

OSM-Objects dataset with variable-size records.

Since the proposed R*-Grove follows the sampling-based partitioning mechanism, a valid question is how thesampling ratio aﬀects partition quality and performance? In this experiment, we execute several partitioningoperations using R*-Grove in

OSM-Objects datasets. All the partitioning parameters are kept ﬁxed, exceptthe sampling ratio, which is varying from 0 . In Section 4.1, we introduced parameter ρ , namely minimum splitting ratio, to speed up the running timeof SplitNode algorithm used in Phase 2, boundary computation. In this experiment, we verify how the21 P a r t i t i on c on s t r u c t i on t i m e ( s e c ond s ) Sampling ratio (%) T o t a l a r ea Sampling ratio (%) T o t a l m a r g i n Sampling ratio (%) (a) Partition construction time (b) Total area (c) Total margin T o t a l o v e r l ap Sampling ratio (%) S t d o f pa r t i t i on s i z e ( M B ) Sampling ratio (%) B l o ck u t ili z a t i on Sampling ratio (%) (d) Total overlap (e) Load balance (f) Block utilizationFigure 9: Indexing performance and partition quality of R*-Grove in

OSM-Objects datasets with diﬀerentsampling ratios.minimum splitting ratio impacts the partition quality and performance. We also use

OSM-Objects datasetwith R*-Grove partitioning as the previous experiment in Section 6.4.4. We vary ρ from 0 to 0 .

45. Figure 10shows the overview of the experimental results. First, Figure 10(a) shows that the running time of Phase 2,boundary computation, decreases as ρ increases which is expected due to the balanced splitting in therecursive algorithm which causes it to terminate earlier. According to the run-time analysis in Section 4.1, alarger value of ρ reduces the depth of the recursive formula which results in a lower running time. However,this minimum splitting ratio also shrinks the search space for optimal partitioning scheme. Fortunately, thenumber of records in the 1% sample is usually large enough such that the boundary computation algorithmcould still ﬁnd a good partitioning scheme even for high value of ρ . In the following experiments, we choose ρ = 0 . Figure 7(f) shows the performance of range query on the

OSM-Nodes dataset with size 200GB. For partitioned

OSM-Nodes dataset, we run a number of range queries (from 200 to 1 , .

01% of the area covered by the entire input. All the queries are sent in one batch to run in parallelto put the cluster at full utilization. It is clear that R*-Grove outperforms all other techniques, especiallywhen we run a large number of queries. This is the result of the high-quality and load-balanced partitionswhich minimize the number of blocks needed to process for each query. Figure 8(f) shows the average cost ofa range query on the

OSM-Objects dataset in terms of number of blocks that need to be processed, the lowerthe better. This value is also computed for a range query with size 0 .

01% of space area. This result furtherconﬁrms that R*-Grove provide a better query performance for variable-size records datasets.22 P a r t i t i on c on s t r u c t i on t i m e ( s e c ond s ) Minimum splitting ratio T o t a l a r ea Minimum splitting ratio T o t a l m a r g i n Minimum splitting ratio (a) Partition construction time (b) Total area (c) Total margin T o t a l o v e r l ap Minimum splitting ratio S t d o f pa r t i t i on s i z e ( M B ) Minimum splitting ratio B l o ck u t ili z a t i on Minimum splitting ratio (d) Total overlap (e) Load balance (f) Block utilizationFigure 10: Indexing performance and partition quality of R*-Grove in

OSM-Objects datasets with diﬀerentminimum splitting ratios.

In this experiment, we split

OSM-Parks and

OSM-Roads datsets to get multiple datasets as follows:

Parks1 , Park2 with sizes 3 . . Roads1 and

Roads2 with sizes 10 and 20 GB, respectively. This allows usto study the eﬀect of the input size on the spatial join query while keeping the input data characteristics thesame, i.e., distribution and geometry size. We compare to STR since it is the best competitor of R*-Grovein previous experiments. Figure 11 shows the performance of the spatial join query. In general, R*-Grovesigniﬁcantly outperforms STR in all query instances.Figure 11(a) shows the number of accessed blocks for each spatial join query over the datasets whichare partitioned by R*-Grove and STR. We can notice that R*-Grove needs to access 40%-60% fewer blocksthan STR for two reasons. First, the better load balance in R*-Grove reduces the overall number of blocksin each dataset. Second, the higher partition quality in R*-Grove results in fewer overlapping partitionsbetween the two datasets. The number of accessed blocks is an indicator to estimate the actual performanceof spatial join queries. Indeed, this is further veriﬁed in Figure 11(b), which shows actual running time forthose queries. As we described, STR does not produce high quality partitions, thus the compound eﬀect willeven make it worst for spatial join query, which always relates to multiple STR partitioned datasets. On theother hand, R*-Grove addresses the limitations of STR so it can signiﬁcantly improve the performance ofspatial join query. 23 P a r k s x R o a d s P a r k s x R o a d s P a r k s x R o a d s P a r k s x R o a d s o f p r o c e ss i n g b l o c k s R*-GroveSTR P a r k s x R o a d s P a r k s x R o a d s P a r k s x R o a d s P a r k s x R o a d s R unn i n g t i m e ( s e c o n d s ) R*-GroveSTR (a) Number of processing blocks (b) Running time in secondsFigure 11: Spatial join performance in R*-Grove and STR partitioning T i m e i n s e c ond s Dataset size in GB

R*-Grove on SparkR*-Grove on Hadoop

Figure 12: The scalability of R*-Grove partitioning in Spark and Hadoop

Figure 12 shows the indexing time for two-dimensional

OSM-Nodes dataset with sizes 100 , ,

300 and 500GB.We executed the same indexing jobs in both Spark and Hadoop to see how the processing model aﬀects theindexing performance. We observed that Spark outperforms Hadoop in terms of total indexing time. Thisexperiment also demonstrates that R*-Grove is ready to work with large volume datasets on both Hadoopand Spark. We also observe that the gap between Hadoop and Spark decreases as the input size increases asSpark starts to spill more data to disk.

In this experiment, we study the quality of R*-Grove on multi-dimensional datasets. Inspired by [3], we usefour synthetic datasets with number of dimensions 3, 4, 5, and 9. We measure the running time and thequality of the ﬁve partitioning techniques: R*-Grove, STR, Kd-Tree, Z-Curve and H-Curve. Figure 13(a)shows that R*-Grove is mostly the fastest technique to index the input dataset due to the best load balanceamong partitions. Figure 13(b) shows that R*-Grove signiﬁcantly reduces total area of partitions. Figure 13(c)shows the total margin of all the techniques. While the total margin varies with the number of dimensionssince they are diﬀerent datasets, the techniques maintain the same order in terms of quality from best to worst,i.e., R*-Grove, Z-Curve, Kd-tree, STR and H-Curve, except the last group, where H-Curve is better than STR.This experiment indicates that R*-Grove could maintain its characteristics for multi-dimensional datasets.24 P a r t i t i o n i n g t i m e ( s e c o n d s ) Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve T o t a l v o l u m e Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve T o t a l m a r g i n Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve (a) Partitioning time (b) Total volume (c) Total margin B l o c k u t ili z a t i o n Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve S t d e v o f p a r t i t i o n s i z e ( M B ) Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve o f p r o c e ss i n g b l o c k s Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve (d) Block utilization (e) Load balance (f) Range query performanceFigure 13: Indexing performance and partition quality of R*-Grove and other partitioning techniques on synthetic multi-dimensional dataset.Figure 13(d) and 13(e) report the block utilization and standard deviation of partition size, respectively.R*-Grove is the best technique that keeps both measures good. Figure 13(f) depicts the normalized rangequery performance of diﬀerent techniques, which aﬃrms the advantages of R*-Grove. Notice that this isthe only experiment where Z-Curve performs better than H-Curve. The reason is that the generated pointsare generated close to a diagonal line in the d -dimension. Since the Z-Curve just interleaves the bits of alldimensions, it will result in sorting these points along the diagonal line which results in a good partitioning.However, the way H-Curve rotates the space with each level will cause it to jump across the diagonal.Additionally, STR becomes very bad as the number dimensions increases. This can be explained by theway STR computes the number of partitions given a sample data points. The existing STR implementationalways creates a tree with a ﬁxed node degree n and d levels where d is the number of dimensions. Thisconﬁguration results in n d leaf nodes or partitions. It computes the node degree n as the smallest integer thatsatisﬁes n d ≥ P where P is the number of desired partitions. For example, for an input dataset of 100 GB, d = 9 dimensions, and a block size of B = 128 MB, the number of desired partitions P = 100 · /

128 = 800partitions and n = 3. This results in a total of 3 = 19683 partitions. Obviously, as d increases, the gapbetween the ideal number of partitions P and the actual number of partitions n d increases which results in avery poor block utilization as shown in this experiment. Finally, Figure 13(f) shows the average cost of arange query in terms of number of processed blocks, which indicates that R*-Grove is the winner when wewant to speed up spatial query processing.To further support our ﬁndings, we also execute similar experiment on NYC-Taxi dataset, which containsup to seven dimensions as follows: pickup latitude , pickup longitude , dropof f latitude , dropof f longitude , pickup datetime , trip time in secs , trip distance . These attribute values are normalized in order to avoidthe dominance of some columns. We decide to partition this dataset using multiple attributes which ispicked in the aformentioned order with size 4 , P a r t i t i o n i n g t i m e ( s e c o n d s ) Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve T o t a l v o l u m e Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve T o t a l m a r g i n Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve (a) Partitioning time (b) Total volume (c) Total margin B l o c k u t ili z a t i o n Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve S t d e v o f p a r t i t i o n s i z e ( M B ) Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve o f p r o c e ss i n g b l o c k s Number of data dimensions

R*-GroveKd-TreeSTRZ-CurveH-Curve (d) Block utilization (e) Load balance (f) Range query performanceFigure 14: Indexing performance and partition quality of R*-Grove and other partitioning techniques onmulti-dimensional

NYC-Taxi dataset.

This paper proposes R*-Grove, a novel partitioning technique which can be widely used in many big spatialdata processing systems. We highlighted three limitations in existing partitioning techniques such as STR,Kd-Tree, Z-Curve and Hilbert Curve. These limitations are the low quality of the partitions, the imbalanceamong partitions, and the failure to handle variable-size records. We showed that R*-Grove overcomes thesethree limitations to produce high quality partitions. We showed three case studies in which R*-Grove can beused to facilitate big spatial indexing, range query, and spatial join. An extensive experimental evaluationwas carried out on big spatial datasets and showed that R*-Grove is scalable and speeds up all the operationsin the case studies. We believe that R*-Grove promises to be a good replacement to existing big spatial datapartitioning techniques in many systems. In the future, we will further study the proposed technique forin-memory and streaming applications to see how it behaves under these architectures.

Funding

This work is supported in part by the National Science Foundation (NSF) under grants IIS-1838222 andCNS-1924694.

Data Availability Statement

The datasets generated for this study are available on the UCR Spatio-temporal Active Repository (UCR-STAR, https://star.cs.ucr.edu/) or on request to the corresponding author. In particular, we used

OSM2015/all nodes , OSM2015/roads , OSM2015/parks , OSM2015/all objects , NYCTaxi . For the diagonal points dataset, wegenerated them using the spatial data generator [39] with following parameters: dataset size | D | = 80 millionpoints; number of dimensions d = 3 , , ,

9; the percentage (ratio) of the points that are exactly on the line perc = 0 .

05; the size of the buﬀer around the line where additional points are scattered buf = 0 . eferences [1] Openstreetmap all objects dataset, 2019. http://star.cs.ucr.edu/ SIGMOD , pages 322–331, Atlantic City, NJ, May 1990.[3] N. Beckmann and B. Seeger. A benchmark for multidimensional index structures, 2008.[4] N. Beckmann and B. Seeger. A revised r*-tree in comparison with related index structures. In

SIGMOD ,pages 799–812, Providence, RI, June 2009.[5] J. L. Bentley. Multidimensional binary search trees used for associative searching.

Communications ofthe ACM , 18(9):509–517, 1975.[6] H. Chasparis and A. Eldawy. Experimental evaluation of selectivity estimation on big spatial data. In

Proceedings of the Fourth International ACM Workshop on Managing and Mining Enriched Geo-SpatialData, Chicago, IL, USA, May 14, 2017 , pages 8:1–8:6, 2017.[7] J. Dittrich and B. Seeger. Data redundancy and duplicate detection in spatial join processing. In

Proceedings of the 16th International Conference on Data Engineering, San Diego, California, USA,February 28 - March 3, 2000 , pages 535–546, 2000.[8] A. Eldawy, L. Alarabi, and M. F. Mokbel. Spatial partitioning techniques in spatial hadoop.

PVLDB ,8(12):1602–1605, 2015.[9] A. Eldawy et al. Spatial partitioning techniques in spatialhadoop.

Proceedings of the VLDB Endowment ,8(12):1602–1605, 2015.[10] A. Eldawy et al. Sphinx: Empowering Impala for Eﬃcient Execution of SQL Queries on Big SpatialData. In

SSTD , pages 65–83, Arlington, VA, Aug. 2017.[11] A. Eldawy, Y. Li, M. F. Mokbel, and R. Janardan. Cg hadoop: computational geometry in mapreduce.In

SIGSPATIAL , pages 284–293, Orlando, FL, Nov. 2013.[12] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In

ICDE , pages1352–1363, Seoul, South Korea, Apr. 2015.[13] A. Eldawy and M. F. Mokbel. The Era of Big Spatial Data: A Survey.

Foundations and Trends inDatabases , 6(3-4):163–273, 2016.[14] A. Eldawy, M. F. Mokbel, S. Al-Harthi, A. Alzaidy, K. Tarek, and S. Ghani. SHAHED: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In

ICDE , pages 1585–1596,Seoul, South Korea, Apr. 2015.[15] A. Eldawy, M. F. Mokbel, and C. Jonathan. Hadoopviz: A mapreduce framework for extensiblevisualization of big spatial data. In

ICDE , pages 601–612, Helsinki, Finland, May 2016.[16] A. D. Fox, C. N. Eichelberger, J. N. Hughes, and S. Lyon. Spatio-temporal indexing in non-relationaldistributed databases. In

Big Data , pages 291–299, Santa Clara, CA, Oct. 2013.2717] S. Ghosh, A. Eldawy, and S. Jais. Aid: An adaptive image data index for interactive multilevelvisualization. In

ICDE , page 4, Macau, China, Apr. 2019.[18] M. F. Goodchild. Citizens as voluntary sensors: Spatial data infrastructure in the world of web 2.0.

IJSDIR , 2:24–32, 2007.[19] A. Guttman. R-trees: A dynamic index structure for spatial searching. In

SIGMOD , pages 47–57,Boston, MA, June 1984.[20] N. Henke et al. The Age of Analytics: Competing in a Data-driven World. Technical report, McKinseyGlobal Institute, Dec. 2016.[21] E. G. Hoel and H. Samet. Performance of data-parallel spatial operations. In

VLDB’94, Proceedings of20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile,Chile , pages 156–167, 1994.[22] E. H. Jacox and H. Samet. Spatial join techniques.

ACM Transactions on Database Systems (TODS) ,32(1):7, 2007.[23] I. Kamel and C. Faloutsos. Hilbert r-tree: An improved r-tree using fractals. In

VLDB , pages 500–509,Santiago de Chile, Chile, Sept. 1994.[24] D.-T. Lee and C. Wong. Worst-case analysis for region and partial region searches in multidimensionalbinary search trees and balanced quad trees.

Acta Informatica , 9(1):23–29, 1977.[25] T. Lee et al. Omt: Overlap minimizing top-down bulk loading algorithm for r-tree. In

CAISE Shortpaper proceedings , volume 74, pages 69–72, 2003.[26] S. T. Leutenegger et al. Str: A simple and eﬃcient algorithm for r-tree packing. In

ICDE , pages 497–506.IEEE, 1997.[27] Y. Li, A. Eldawy, J. Xue, N. Knorozova, M. F. Mokbel, and R. Janardan. Scalable ComputationalGeometry in MapReduce.

The VLDB Journal , Jan 2019.[28] M. Lo and C. V. Ravishankar. Spatial joins using seeded trees. In

SIGMOD , pages 209–220, Minneapolis,MN, May 1994.[29] P. Lu et al. ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems.

PVLDB , 7(14):1797–1808, 2014.[30] A. Magdy, L. Alarabi, S. Al-Harthi, M. Musleh, T. M. Ghanem, S. Ghani, and M. F. Mokbel. Taghreed:a system for querying, analyzing, and visualizing geotagged microblogs. In

SIGSPATIAL , pages 163–172,Dallas/Fort Worth, TX, Nov. 2014.[31] S. Nishimura, S. Das, D. Agrawal, and A. El Abbadi. MD -hbase: design and implementation of an elasticdata infrastructure for cloud-scale location services. Distributed and Parallel Databases , 31(2):289–319,2013.[32] I. Sabek and M. F. Mokbel. On spatial joins in mapreduce. In

SIGSPATIAL , pages 21:1–21:10, RedondoBeach, CA, Nov. 2017. 2833] H. Samet. The quadtree and related hierarchical data structures.

ACM Computing Surveys , 16(2):187–260,1984.[34] A. B. Siddique, A. Eldawy, and V. Hristidis. Comparing synopsis techniques for approximate spatialdata analysis.

Proceedings of the VLDB Endowment , 12(11):1583–1596, 2019.[35] M. Tang, Y. Yu, Q. M. Malluhi, M. Ouzzani, and W. G. Aref. LocationSpark: A Distributed In-MemoryData Management System for Big Spatial Data.

PVLDB , 9(13):1565–1568, 2016.[36] The ucr spatio-temporal active repository (ucr-star), 2019. https://star.cs.ucr.edu/.[37] H. Vo, A. Aji, and F. Wang. SATO: a spatial data partitioning framework for scalable query processing.In

SIGSPATIAL , pages 545–548, Dallas/Fort Worth, TX, Nov. 2014.[38] T. Vu and A. Eldawy. R-Grove: growing a family of R-trees in the big-data forest. In

SIGSPATIAL ,pages 532–535, Seattle, WA, Nov. 2018.[39] T. Vu, S. Migliorini, A. Eldawy, and A. Belussi. Spatial data generators. In , page 7, 2019.[40] R. T. Whitman et al. Spatial indexing and analytics on hadoop. In

SIGSPATIAL , pages 73–82,Dallas/Fort Worth, TX, Nov. 2014.[41] D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo. Simba: Eﬃcient in-memory spatial analytics. In

SIGMOD , pages 1071–1085, San Francisco, CA, July 2016.[42] J. Yu, J. Wu, and M. Sarwat. Geospark: a cluster computing framework for processing large-scale spatialdata. In

SIGSPATIAL , pages 70:1–70:4, Bellevue, WA, Nov. 2015.[43] J. Yu, J. Wu, and M. Sarwat. Geospark: A cluster computing framework for processing large-scale spatialdata. In

Proceedings of the 23rd SIGSPATIAL International Conference on Advances in GeographicInformation Systems , page 70. ACM, 2015.[44] X. Zhou, D. J. Abel, and D. Truﬀet. Data partitioning for parallel spatial join processing.