A GPU-friendly Geometric Data Model and Algebra for Spatial Queries: Extended Version
AA GPU-friendly Geometric Data Model and Algebra for SpatialQueries: Extended Version
Harish Doraiswamy and Juliana FreireNew York University [email protected] [email protected]
Abstract
The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data.However, the time required to evaluate even simple spatial queries over large data sets greatly hampersour ability to interactively explore these data sets and extract actionable insights. Graphics ProcessingUnits (GPUs) are increasingly being used to speedup spatial queries. However, existing GPU-based solu-tions have two important drawbacks: they are often tightly coupled to the specific query types they target,making it hard to adapt them for other queries; and since their design is based on CPU-based approaches,it can be difficult to effectively utilize all the benefits provided by the GPU. As a first step towards makingGPU spatial query processing mainstream, we propose a new model that represents spatial data as geo-metric objects and define an algebra consisting of GPU-friendly composable operators that operate overthese objects. We demonstrate the expressiveness of the proposed algebra by formulating standard spatialqueries as algebraic expressions. We also present a proof-of-concept prototype that supports a subset ofthe operators and show that it is at least two orders of magnitude faster than a CPU-based implementa-tion. This performance gain is obtained both using a discrete Nvidia mobile GPU and the less powerfulintegrated GPUs common in commodity laptops.
The availability of low cost sensors such as GPS in vehicles, mobile and IoT devices have led to an unprece-dented growth in the volume of spatial data. Extracting insights from these data sets requires the ability toeffectively and efficiently handle a variety of queries.The most common approach to support spatial queries is through the use of spatial extensions that areavailable in existing relational database systems (e.g., the PostGIS extension for PostgreSQL [38], Oracle Spa-tial [32], DB2 Spatial Extender [1], SQL Server Spatial [44]). Popular geographic information system (GIS)software typically use these systems to process spatial queries [4, 19, 39] and some also provide their owndatabase backend. Using these state-of-the-art systems, the response times to even simple spatial queries overlarge data sets can runs into several minutes (or more), hampering the ability to perform interactive analyticsover these data [15, 29]. While faster response times can be attained by powerful clusters [14, 34], such anoption, due to its costs and complexity, is often out of reach for many analysts.Recent technological advances have made Graphics Processing Units (GPUs) a cost-effective alternativeto provide high computing power. Since GPUs are widely available, even in commodity laptops, effectiveGPU-based solutions have the potential to democratize large-scale spatial analytics. Not surprisingly, severalapproaches have been proposed that use GPUs to speed up spatial queries (e.g., [10, 11, 53–55]). However,these implementations typically follow the traditional approaches that were designed primarily for the CPU,1 a r X i v : . [ c s . D B ] A p r igure 1: Reformulation of the spatial selection query.and simply porting the algorithms may lead to an ineffective use of GPU capabilities. For example, a spatialaggregation query that aggregates input points across different polygonal regions would typically be imple-mented as a spatial join of the points and polygons followed by the aggregation of the join results. On theother hand, the recently proposed RasterJoin [47] represented a departure from traditional strategies: by mod-eling spatial aggregation queries using GPU-specific operations, it attained significant speedups even overthe traditional query plan executed on a GPU, suggesting that GPU-specific strategies can lead to substantialperformance gains.Another drawback of traditional GPU-based approaches is that they require different implementations foreach query class. Consequently, it is hard to re-use and/or extend them for other similar queries. As an exam-ple, consider the simple spatial selection query illustrated in Figure 1(a). Given a spatial data set consistingof a collection of points (say restaurants) and their locations, this query identifies all points that are containedwithin the specified query polygon (e.g., a neighborhood). Existing approaches (including the state-of-the-artGPU-based solution [11]) implement this query as a single operator typically making use of a spatial index,which organizes the minimum bounding rectangle (MBR) of the spatial objects in a tree structure. The indexis used to identify relevant MBRs, and then, for each point inside the selected MBRs, a test is performed tocheck whether the point is indeed inside the query polygon. Note that this containment test is specific to theinput being points. If the spatial component of the data is instead represented as the polygon corresponding tothe land plot where the restaurant is located, the selection query requires a different implementation, since apolygon-intersect-polygon test must be performed instead. This shortcoming, coupled with the complexitiesinvolved in implementing GPU-based solutions, has impeded a wider use of GPUs in spatial databases.
Our Approach.
With the goal of enabling GPUs to be exploited without the need to build custom solutions,we revisit the problem of designing a spatial data model and operators. We adapt common computer graphicsoperations for which GPUs are specifically designed and optimized, to propose a new geometric data modelthat provides a uniform representation for different geometric objects , and an algebra consisting of a smallset of composable operators capable of handling a wide variety of spatial queries .While several data models and algebras have been described in the literature [12, 18, 20, 22, 42], they wereall designed before the advent of modern GPUs and suffer from at least one of the above shortcomings, as wediscuss in Section 8. Furthermore, these models are typically user facing : users express the queries of interestby making use of the data types and the operators provided in the model; the implementation of the operatorsis left to the developer. In contrast, we aim for a developer-facing model that can be incorporated into existingsystems unbeknownst to the users, while at the same time providing significant benefits to the database engineand query performance.To give an intuition behind the proposed geometric model, consider again the example in Figure 1(a) froma geometric point of view. The query can be translated into two operations performed one after the other2s shown in Figure 1(b). Visually (or graphically), the input points and the query polygon are uniformlyrepresented as drawings on a canvas . The first operation merges the input points and the query polygoninto a single canvas. The second operation computes the intersection between the points and the polygon toeliminate points outside the polygon. Unlike the traditional execution strategy for spatial aggreagtion (i.e.,join followed by aggregation), the two operations used here are applicable to any kind of geometry. Therefore,even if the data (restaurants) were represented as polygons instead of points, the same set of operations couldbe applied.Informally, we represent a spatial object as an embedding of its geometry onto a plane, called a canvas ,and define GPU-friendly operators similar to the ones in Figure 1(b), that act on one or more canvases. As weshow in Section 4, these operators can be re-used and composed to support a diverse set of spatial queries.Given a small set of basic operators, our model makes it possible for implementations to focus on theefficiency of these operators, the gains from which become applicable to a variety of queries. While our focusin this paper is on the conceptual representation and modeling of spatial data and the design of the algebra,this work opens new avenues for research in spatial query optimization, both for theory (e.g., developing plangeneration strategies, designing cost models) and systems (e.g., designing indexes leveraging GPUs).
Contributions.
To the best of our knowledge, this is the first approach to propose a query algebra designedwith a focus on enabling an efficient GPU realization of spatial queries. Our main contributions are as follows: • We propose a new geometric data model that provides a uniform representation for spatial data on GPUs(Section 2). • We design a spatial algebra consisting of five fundamental operators designed based on common com-puter graphics operations (Section 3). We show that the algebra is: expressive and able to representall standard spatial queries; and closed, allowing the operators to be composed to construct complexqueries (Section 4). • We present a proof-of-concept implementation of a subset of the proposed operators that shows: 1) howthe proposed model and operators are naturally suited for GPUs; and 2) how operators can be re-usedin different queries. Our implementation achieves over two orders of magnitude speedup over a customCPU-based implementation, and outperforms custom GPU-based approaches (Sections 5 & 6). • We discuss the compatibility of the proposed algebra with the relational model and its utility for queryoptimization (Section 7).
In this section, we first formalize the notion of a spatial data set, and then define the concept of a canvas , thespatial analogue of a relational tuple.
As discussed in Section 1, an important limitation of current spatial operations is that they are tied to spe-cific representations for geometric data types. To design flexible operators, we need a uniform representationfor the geometry that is independent of underlying types. We propose to schematically represent the geom-etry using a single type called geometric object , which can conceptually represent any complex geometricstructure.
Definition 1 (Geometric Object)
A geometric object is defined as a collection of geometric primitives. igure 2: Canvas : A uniform representation of spatial data. (a) Example point data. (b) Two canvasescorresponding to the first two records of the table. The [0,0] element of the matrix (corresponding to the0-primitive) stores the unique ID corresponding to the record. (c) Example polygon data. (d) Two canvasescorresponding to the first two records. Here, all points inside a polygon will map to the same value, with theelement [2,0] (corresponding to the 2-primitives) storing the unique ID. The white parts of the canvas (notpart of the geometry) maps to a null value.
Definition 2 (Geometric Primitive) A d -dimensional geometric primitive ( d -primitive) is defined as a d -manifold (with or without a boundary). Informally, a d -manifold is geometric space in which the local neighborhood of every point represents R d .In the context of spatial data that is of interest in this work, we focus on 2-dimensional (2D) space and d -primitives, where ≤ d ≤ . Intuitively, a 0-primitive is a point while a 1-primitive is a line (not necessarilya straight line or with a finite length). 2-primitives include any subset of R that is neither a line nor a point,such as polygons and half spaces.A spatial data set can now be defined in terms of geometric objects as follows: Definition 3 (Spatial Data)
A spatial data set consists of one or more attributes of type geometric object.
Note that the above definition allows geometric objects of arbitrarily complex shapes composed using aheterogeneous set that contains points, lines, as well as polygons. However, geometric objects common inreal world data sets are primarily only points (e.g., locations of restaurants, hospitals, bus stops, etc.), onlylines (e.g., road networks), or only polygons (e.g., state or city boundaries).
We define the notion of a canvas to explicitly capture the geometric structure of a spatial data set. As men-tioned above, we assume that the dimensions of the geometric primitives composing a geometric object in aspatial data set is either 0, 1 or 2. Let S be a set of k -tuples, where k ≥ , such that ∅ ∈ S . A canvas isformally defined as follows: Definition 4 (Canvas) A canvas is a function C : R → S that maps each point in R to a triple ( s [0] , s [1] , s [2]) ∈ ( S × S × S ) , where the i th element of the triple, s [ i ] , stores information (as a k -tuple) corresponding to i -dimensional geometric primitives. Definition 5 (Empty Canvas) A canvas C is empty iff C maps all points in R to ( ∅ , ∅ , ∅ ) . igure 3: A canvas representing a complex object. Since all the primitives (colored differently) are part ofthe same object, they have the same ID.A canvas is analogous to a tuple in the relational model. It is defined such that it captures geometric objectsin the world coordinate space of the graphics pipeline, thus making it straightforward to apply computergraphics operations on it. Intuitively, a canvas stores for each point in R information corresponding to thegeometric primitives that pass through that point. This information is captured by the elements of the set S (we discuss S in more detail below).Given a spatial data set, each record of this data is represented using one or more canvases equal to thenumber of geometric object attributes of the data. For ease of exposition, consider a spatial data set havinga single geometric object attribute and a geometric object o corresponding to one of the records in this data.Let o = { g , g , g , . . . , g n } , where g i is a geometric primitive having dimension dim ( g i ) , ≤ dim ( g i ) ≤ , ∀ i . A canvas representation of the geometric object o is defined as follows. Definition 6 (Canvas representation of a geometric object) A canvas corresponding to a geometric object isa function C o : R → S such that ∀ d ∈ [0 , C o ( x, y )[ d ] = s d (cid:54) = ∅ ∈ S, if ∃ i | dim ( g i ) = d and g i intersects ( x, y ) ∅ otherwise The set S used in the above definition is called the object information set , and is defined as follows. Definition 7 (Object Information Set S ) The object information set S is defined as a set of triples ( v , v , v ) where v stores a unique identifier (or a pointer) for the record corresponding to the geometric object. v and v are real numbers storing meta data related to the canvas. The range of the canvas function C can thus be represented as a × matrix, where each row corresponds tothe Object Information Set for the associated primitive dimension. We abuse notation to represent the triple ( ∅ , ∅ , ∅ ) simply as ∅ . Example 1:
Consider the two example data sets in Figure 2. The first data set corresponds to the set ofrestaurants in a city (represented as points) (a), while the second data set corresponds to the neighborhoodboundaries of this city (represented as polygons) (c). Figure 2 also illustrates the canvas representationscorresponding to two records from each of these two data sets. Note that in this example, we use only theidentifier element of the object information set. The values of the other elements are initialized depending onthe query scenario (Section 4).
Example 2:
The complex geometric object shown in Figure 3 consists of two polygons (an ellipse and apolygon with a hole) connected by a line, with the hole also containing a point. This is represented in thecanvas by mapping the regions corresponding to the different primitives using the appropriate rows in thematrix (for point, line, polygon). 5 igure 4:
The 5 fundamental operators. For illustrative purposes, we use colors to denote the informationstored in each point of the canvas; the white color corresponds to a null value.
Below, we define the operators we designed for the canvas representation of a spatial data set. Concreteexamples of how these operators are used are given in Section 4. We categorize the set of operators into threeclasses: fundamental, derived, and utility operators. We use the following notation to represent operators thattake as input zero or more canvases: Op [ P , P , . . . ]( C , C , . . . , C n ) where Op is the operator name, P i , ∀ i , the parameters of the operator, and C j , ∀ j , the canvases input to theoperator. The output of all the operators is always one or more canvases. Thus, the proposed algebra is closedby design . Fundamental operators form the core of the proposed algebra. Their design is based on common computergraphics operations that are already supported in GPUs. Figure 4 illustrates the five fundamental operators.
Geometric Transform C (cid:48) = G [ γ ]( C ) : This operator takes as input a single canvas C and outputs a canvas C (cid:48) in which all the geometric objects of C are transformed to a new position in C (cid:48) defined by the parameterfunction γ . Here, the parameter function γ can be defined in two ways:1. γ : R → R γ : S → R In the first case, the new position ( x (cid:48) , y (cid:48) ) of a geometry is dependent on its current position ( x, y ) : C (cid:48) ( γ ( x, y )) = C ( x, y ) Examples of such functions include operations such as rotation, translation, etc. The example in Figure 4(a)rotates and translates (moves) the polygon object to a different position.6 scenario where this operator is useful is when spatial data sets in a database use different coordinatesystems. Thus, when performing binary or n-ary operations on canvases from these data sets, the geometry hasto be converted into a common coordinate system first. The parameter function γ can be defined appropriatelyfor this purpose.In the second case, the new position ( x (cid:48) , y (cid:48) ) of the geometry is dependent on the information stored at thecurrent position C ( x, y ) : C (cid:48) ( γ ( C ( x, y ))) = C ( x, y ) Such a transformation is useful, for example, when one is interested in accumulating values (e.g., for aggre-gation queries) corresponding to a geometric object—in this case, the function γ can be defined to move allpoints having the same object identifier to a unique location. Value Transform C (cid:48) = V [ f ]( C ) : This unary operator outputs a canvas C (cid:48) in which the information corre-sponding to the geometries is modified based on the parameter function f . That is, C (cid:48) ( x, y ) = f ( x, y, C ( x, y )) where, f : R × S → S is a function that changes the object information based on its location and/or value.Figure 4(b) illustrates an example of this operation where the color of the polygon in the canvas is changedfrom blue to orange. Mask C (cid:48) = M [ M ]( C ) : The mask operator is used to filter regions of canvas so that only regions satisfyingthe condition specified by M ⊂ S are retained. Formally, the application of this operator results in the canvas C (cid:48) such that C (cid:48) ( x, y ) = (cid:40) C ( x, y ) , if C ( x, y ) ∈ M ∅ otherwiseFor example, this can be used to accomplish select intersection operations shown in Figure 1(b) and Fig-ure 4(c). Blend C (cid:48) = B [ (cid:12) ]( C , C ) : Blend is a binary operator used to merge two canvases into one. The blendfunction (cid:12) : S × S → S defines how the merge is performed: C (cid:48) ( x, y ) = C ( x, y ) (cid:12) C ( x, y ) The merge operation used in Figure 1(b) is an instance of the blend function. Another example is shown inFigure 4(d).
Dissect { C , C , . . . , C n } = D ( C ) : The dissect operation splits a given canvas into multiple non-emptycanvases, each corresponding to a point ( x, y ) ∈ R having C ( x, y ) (cid:54) = ∅ . That is, a new canvas C i isgenerated corresponding to a non-null point ( x, y ) such that C i ( x (cid:48) , y (cid:48) ) = (cid:40) C ( x, y ) , if ( x (cid:48) , y (cid:48) ) = ( x, y ) ∅ otherwiseFor example, in Figure 4(e), a canvas encoding 4 points is split into 4 canvases each corresponding to oneof those points. As we show later, one of the uses of this operator is for queries involving aggregations overgeometries with 1- and 2-primitives (such as polygons). It is common for certain combinations of fundamental operators to be repeatedly used for various queries. Werepresent these combinations as derived operators and describe of couple of such useful operators below.7 ultiway Blend C (cid:48) = B ∗ [ (cid:12) ]( C , C , . . . , C n ) : This n-ary operator takes as input n canvases and generatesa single canvas after blending all these n canvases in the given order. C (cid:48) = B [ (cid:12) ]( C , B [ (cid:12) ]( C , B [ (cid:12) ]( C , . . . ))) Note that if the blend function (cid:12) is associative, then it allows relaxing the grouping of the different blendoperations, thus providing more flexibility while optimizing queries.
Map { C , C , . . . , C n } = D ∗ [ γ ]( C ) : Map is a composition of a dissect followed by a geometric transform. { C , C , . . . , C n } = G [ γ ]( D ( C )) This operator is useful to align all the canvases resulting from the dissect. In such a case, γ is typically definedas a constant function: γ ( x, y ) = ( x c , y c ) where x c and y c are constants.Without loss of generality, we assume the above notation of providing multiple canvases as input to a unaryoperator (in this case the geometric transform that takes as input canvases output from a dissect operation) asequivalent to applying the operator individually to each of the input canvases. Utility operators are primarily used to generate canvases based on a set of parameters. These are particularlyuseful for classes of spatial queries involving parametric constraints. In particular, we consider the followingthree types of utility operators:
Circle C = Circ [( x, y ) , r ]() : generates a canvas corresponding to a circle with center is ( x, y ) and radius r . Rectangle C = Rect [ l , l ]() : generates a canvas corresponding to a rectangle having diagonal end points l and l . Half Space C = HS [ a, b, c ]() : generates a canvas representing the half space defined by the equation: ax + by + c < . To demonstrate the expressiveness of the proposed model, in what follows we describe how common spatialqueries can be represented as algebraic expressions. We build upon the classification of spatial queries used byEldawy et al. [14] and categorize spatial queries into the following classes: selection , join , aggregate , nearestneighbor , and geometric queries . Note that this is a super set of the query types evaluated in a state-of-the-artexperimental survey by Pandey et al. [34].For ease of exposition, we consider only point and polygonal data sets. It is straightforward to expresssimilar queries for other types of spatial data sets with lines, or more complex geometries (combination ofpoints, lines and polygons). Without loss of generality, we assume that the different operators prune emptycanvases from their output, and an empty canvas is generated when an input canvas does not satisfy a givenconstraint. This is similar to the relational model, where tuples that do not match the query constraints areexcluded from the output table. 8 igure 5: A schematic representation as a plan diagram of the algebraic expression used to select points basedon a polygonal constraint (left). The different steps of this plan is illustrated using an example input with 2points (colored red and cyan)(right). For simplicity, both points are shown in a single canvas.
We can classify spatial selection queries into three types: polygonal selection, range selection, and distance-based selection. We first consider selection queries that have a polygonal constraints, and then extend thealgebraic expressions for other types of selection queries.
Polygonal Selection of Points.
Let D P be a data set consisting of a set of points. Let { ( x , y ) , ( x , y ) , . . . , ( x n , y n ) } be the coordinates corresponding to the location of these points. Let Q be any arbitrary-shaped polygon. Con-sider the following spatial query expressed in an SQL-like syntax: SELECT * FROM D P WHERE Location INSIDE Q Note that this is the same query used for the example in Figure 1(a). Using the proposed data representation,let C P = { C , C , . . . C n } be the set of canvases corresponding to each point (record) in D P . Let the canvas C i corresponding to the i th record be defined as follows: C i ( x, y )[0] = (cid:40) ( id, , if ( x, y ) = ( x i , y i ) ∅ otherwise C i ( x, y )[1] = ∅ C i ( x, y )[2] = ∅ Here, id corresponds to the unique identifier mapping the canvas to the corresponding record in D p . We usethe second element of the tuple C i ( x, y )[0] to keep count of the points incident on the location ( x, y ) , whichin this case is . The third element is ignored for this query. Let the canvas C Q corresponding to the querypolygon Q be defined as follows: C Q ( x, y )[0] = ∅ C Q ( x, y )[1] = ∅ C Q ( x, y )[2] = (cid:40) (1 , , if ( x, y ) falls inside Q ∅ otherwise While nearest-neighbor-based selection could also be in this category, we place it in a separate class (Section 4.4). igure 6: Plan diagram corresponding to the algebraic expression used to select polygons based on a polyg-onal constraint (left). This plan is illustrated using 2 input polygons (colored red and cyan) (right), which areshown in a single canvas.Similar to the case of points above, the elements C Q ( x, y )[2][0] and C Q ( x, y )[2][1] stores the id of the querypolygon (set to since there is only one polygon) and count of 2-primitives incident on ( x, y ) respectively.Using the canvases defined above, the select query can be algebraically expressed as follows: C result ← M [ M p ]( B [ (cid:12) ]( C P , C Q )) where, ∀ s , s ∈ S s (cid:12) s = s [0][0] s [0][1] s [0][2] — ∅ — s [2][0] s [2][1] s [2][2] and M p = (cid:8) s ∈ S | s [0] (cid:54) = ∅ and s [2][0] = 1 (cid:9) Similar to the example in Figure 1(b), the above expression first merges the input data with the querypolygon using the blend operator B , and then uses the mask operator M to select only the intersection (alocation is part of the intersection if both, a 1-primitive and 2-primitive are incident on it). Figure 5 visualizesthe above expression as a plan diagram, and illustrates the different steps for two examples when a point isinside the query polygon (and hence part of the result), and when a point is outside respectively. Polygonal Selection of Polygons.
Let D Y be a data set consisting of a set of polygons. Let { Y , Y , . . . , Y n } be the set of polygons associated with each record of the data set. As before, the polygons can take any shape.Let Q be another arbitrary-shaped polygon.Let the set of canvases C Y corresponding to polygons in D Y be defined as follows: C i ( x, y )[0] = ∅ C i ( x, y )[1] = ∅ C i ( x, y )[2] = (cid:40) ( id, , if ( x, y ) falls inside Y i ∅ otherwiseLet the canvas corresponding to query polygon Q be defined as before. Now,consider the following selection query, similar to the one above, but over D Y : SELECT * FROM D Y WHERE Geometry INTERSECTS Q C result ← M [ M y ]( B [ ⊕ ]( C Y , C Q )) where, ∀ s , s ∈ S s ⊕ s = — ∅ —— ∅ — s [2][0] s [2][1] + s [2][1] s [2][2] and M y = (cid:8) s ∈ S | s [2][1] = 2 (cid:9) Note that unlike in the previous case of selecting points, since both the data as well as the query consistof polygons, both the data canvas and the query canvas store information only for 2-primitives. Hence, thesecond element of the information tuple is made use of in this case to compute the intersection (i.e., locationshaving two 2-primitives incident on them). Figure 6 shows the algebraic expression using a plan diagram, andillustrates two examples denoting selection and non-selection scenarios respectively.
Selection Using Other Spatial Constraints.
In addition to polygonal constraints, selection queries overspatial data may also involve other types of spatial constraints. Commonly used are range constraints anddistance-based selection. It is easy to extend the algebraic expressions used for polygonal constraints to thesescenarios as follows.
1. Rectangular Range Constraints:
This class of queries requires the selection of spatial objects that intersecta 2D range. To execute such queries, the query polygon is simply replaced by a rectangle, the canvas for whichcan be created using the utility operator: C Q ← Rect [ l , l ]() , where l , l denotes the diagonal endpoints ofthe rectangle range.
2. One-Sided Range Constraints:
In this scenario, the queries require selecting geometries that intersect agiven half-space ax + by + c < (note that this is a more generic formulation of queries involving constraintssuch as x < c or y < c ). Again, the utility operator can be used to generate the required query canvas as areplacement for the query polygon: C Q ← HS [ a, b, c ]() .
3. Distance-based Selection:
In this case, the queries require the selection of geometries that lie within agiven distance d of a query point ( x q , y q ) . This essentially translates to using a circle with radius d centeredat ( x q , y q ) as the query polygon, the canvas for which can also be created using the utility operator: C Q ← Circ [( x q , y q ) , d ]() Given the possibility to adapt these three types of spatial constraints to a polygon, we will focus only onpolygonal constraints for the remainder of this section.
Spatial join queries can be broadly classified into three types: (Type I) points (cid:46)(cid:47) polygons join; (Type II) poly-gons (cid:46)(cid:47) polygons join; and (Type III) points (cid:46)(cid:47) points join. Type III is commonly known as a distance join. Asin the previous section, one set of points (say the RHS) of the distance join can be converted into a collectionof circles to transform this to a points (cid:46)(cid:47) polygons join query. We therefore focus on the first two types of joinqueries.Let D P and D Y be a point data set and a polygon data set respectively. A Type I join query between thesetwo data sets is typically specified as follows: SELECT * FROM D P , D Y WHERE D P . Location INSIDE D Y .Geometry D Y and D Y be two polygon data sets. A Type II join query between these two data sets canbe specified as follows: SELECT * FROM D Y , D Y WHERE D Y .Geometry INTERSECTS D Y .Geometry The above join queries are equivalent to performing selection queries, one for each record (canvas) from D Y and D Y respectively. Thus, conceptually, the algebraic expression for joins is the same as the correspondingselection queries, with the exception that a single query polygon is instead replaced with a collection ofpolygons. A Type I join query can then be realized using the following expression C result ← M [ M p ]( B [ (cid:12) ]( C P , C Y )) while a Type II join query can be realized using C result ← M [ M y ]( B [ ⊕ ]( C Y , C Y )) Here, C P , C Y , C Y , and C Y are collections of canvases corresponding to the data sets D P , D Y , D Y and D Y respectively. The different parameters of the operators in the above expressions remain the same as whatwas used for their selection counterparts.Similar to the join operator in the relational model, spatial joins using our model can be implemented inseveral ways. A straightforward approach is using nested loops for the blend operation, which can be mademore efficient if spatial indexes are available. The third class of queries common for spatial data are spatial aggregation queries. We consider two types ofsuch queries: aggregating the results from a selection, and the aggregation required for a group-by over a join.
Aggregation over a Select.
Consider first a simple count of the results from a selection query:
SELECT COUNT(*) FROM D P WHERE Location INSIDE Q This query can be realized using the expression: C count ← B ∗ [+]( G [ γ c ]( C result )) where γ c : S → R is defined such that ∀ s ∈ S , γ c ( s ) = ( s [2][0] , , + : S × S → S is defined as s + s = s [0][1] + s [0][1] 0 — ∅ — s [2][0] s [2][1] s [2][2] and C result ← M [ M p ]( B [ (cid:12) ]( C P , C Q )) is the set of canvases resulting from the select operation (same as in Section 4.1 above).Basically, each canvas (corresponding to a point) satisfying the selection constraint is transformed to aconstant location (1 , (recall that the id of the query polygon Q is 1), and the resulting canvases are mergedtogether to compute the required summation (see Figure 7). The value of C count (1 , stores the resultingcount. Note that the second element of the tuple corresponding to the 0-primitives is used for this operation,while this was not necessary when performing only a select.12 igure 7: Plan diagram for aggregating the results from a select query (left). The example (right) uses theresults from a select sub-query that returns 4 points, and illustrates the workflow that counts the results.Instead of count, if the query requires computing other distributive (e.g., sum, minimum, maximum) oralgebraic (e.g., average) aggregations over a given attribute, then the third element of the tuple correspondingto the 0-primitives can be used to store the value corresponding to this attribute, and the + function canbe modified appropriately. For example, let A be a real-valued attribute of the data set D P . Consider thefollowing query: SELECT SUM( A ) FROM D P WHERE Location INSIDE Q This query can be realized using the same expression as above by defining C i ∈ C P and + , respectively,as follows. C i ( x, y )[0] = (cid:40) ( id, , A [ i ]) if ( x, y ) = ( x i , y i ) ∅ otherwise C i ( x, y )[1] = ∅ C i ( x, y )[2] = ∅ s + s = s [0][1] + s [0][1] s [0][2] + s [0][2] — ∅ — s [2][0] s [2][1] s [2][2] In this scenario, the value of C result (1 , maintains the required sum. Aggregation over a Join.
The second type of aggregation queries consist of a group-by operation over aspatial join. In particular, consider the following query:
SELECT COUNT(*) FROM D P , D Y WHERE D P . Location INSIDE D Y . GeometryGROUP BY D Y . ID The algebraic expression used for aggregations over select works for this query as well: C count ← B ∗ [+]( G [ γ c ]( C result )) where C result ← M [ M p ]( B [ (cid:12) ]( C P , C Y )) When using the expression for a join, each of the polygons have a unique id . Hence, the join resultcorresponding to a point-polygon pair that satisfies the containment constraint will be moved to the location ( id, corresponding to that polygon. Thus, the final multiway blend operation will individually count pointswithin each of the polygons in D Y . The value C count ( id, stores the value corresponding to polygonwith ID id . 13 .4 Nearest-Neighbor Queries We consider the following nearest-neighbor-based query template that finds the k points closest to a givenquery point X ( x p , y p ) ( k NN query).
SELECT * FROM D P WHERE Location ∈ KNN( X , k ) Without loss of generality, we assume that the distances of points in D P to query point X are totally ordered,i.e., no two distances are the same. In the presence of a clash, the points can be perturbed by an infinitesimallysmall distance (cid:15) to ensure the total order condition is satisfied.One way to answer this query is to first find the distance r such that there are exactly k points within thecircle centered at X with radius r . Then, the distance-based selection can be used to obtain the query result.This workflow can be accomplished using the proposed algebra as follows. Let C X be a set of circles centeredat X have increasing radii. This can be accomplished by using the
Circ () utility operator. Let the id of eachcircle c be the radius of c . Then, the required radius r to identify the k nearest neighbors can be obtainedusing the following expression: C r ← D ∗ [ γ ]( M [ M r ]( C count )) where M r = (cid:8) s ∈ S | s [0][1] = k (cid:9) γ : S → R is defined as ∀ s ∈ S , γ ( s ) = (0 , , and C count ← B ∗ [+]( G [ γ c ]( M [ M p ]( B [ (cid:12) ]( C P , C X )))) is the same join-group-by aggregation used above. Essentially, the mask operation is applied to the resultfrom the aggregation to remove all circles containing fewer or more than k points, followed by a map toobtain individual canvases for each valid radius. Therefore, C (0 , , ∀ C ∈ C r has the id s of canvasescorresponding to the circles having exactly k points in them. Since the id s correspond to the radius of therespective circles, this can then be used to perform a distance-based selection to complete the k NN query.
The final class of queries we consider is the set of computational geometry queries. These include queries suchas computing the Voronoi diagram, spatial skyline, and convex hull [14]. While it might not be straightforwardto realize all of these queries algebraically, the provided operators can be used as part of a stored procedure toexecute some of them.We illustrate this through the following example: consider the query to compute the Voronoi diagram for agiven set of points { ( x , y ) , ( x , y ) , . . . , ( x n , y n ) } . This can be accomplished using the following pseudo-code: Procedure
ComputeVoronoi
Require:
Points { ( x , y ) , ( x , y ) , . . . , ( x n , y n ) } C voronoi ← ∅ for each i ∈ [1 , n ] do C voronoi ← V [ f ( x i ,y i ) ]( C voronoi ) end for return C voronoi Conceptually there is an infinite number of circles, but in practice, a finite number of circles can be created with small incrementsin radii up to a maximum radius. f ( x p ,y p ) : R × S → S is defined as follows: f ( x p ,y p ) ( x, y, s )[0] = ∅ f ( x p ,y p ) ( x, y, s )[1] = ∅ f ( x p ,y p ) ( x, y, s )[2] = ( i, d , if s = ∅ ( s [2][0] , s [2][1] , s [2][1] < d ( i, d , otherwisewhere d is the Euclidean distance between the point ( x, y ) and the parameter point ( x p , y p ) . This procedureincrementally builds the Voronoi diagram by adding one input point at a time—during iteration i , the regionsof existing polygons closest to point i are merged to form a new Voronoi region corresponding to this point.It might not be possible to express all computational geometry queries as stored procedures using thepreviously defined operators. In such cases new operators can be added for such queries. So far we focused on standard queries and showed how they could be translated into algebraic expressionsusing the proposed algebra. As mentioned in Section 1, an algebra is useful only if the operators can beeasily composed to also support more complex queries. In this section, we demonstrate this property using aspatial query involving constraints on two spatial attributes [16]: we consider selection queries over origin-destination data sets (e.g., taxi trips, migration data), where the selection is based on polygonal constraints onboth origin as well as destination locations:
SELECT * FROM D P WHERE Origin INSIDE Q and Destination INSIDE Q Here, D P is the input point data set having two location attributes Origin and
Destination , and Q and Q arepolygonal constraints over the two location attributes respectively. This query could be used, for example, toretrieve all the taxi trips between two specific neighborhoods.Let C p be the canvases corresponding to D P defined as before, but with respect to the origin attribute. Let C Q and C Q be canvases corresponding to the query constraints defined as follows: C Qi ( x, y )[0] = ∅ C Qi ( x, y )[1] = ∅ C Qi ( x, y )[2] = (cid:40) ( i, , if ( x, y ) falls inside Q i ∅ otherwiseThe above query can then be realized as: C result ← M [ M p (cid:48) ]( B [ (cid:12) ]( G [ γ d ]( C origin ) , C Q )) where C origin ← M [ M p ]( B [ (cid:12) ]( C P , C Q )) is the same expression as the selection query used earlier and selects the points that belong to the intersectionbetween C P and C Q . The function γ d : S → R is used to transform the point from the origin to thedestination location and is defined as ∀ s ∈ S , γ d ( s ) = destination ( s [0][0]) Here, destination() is a function that takes the id of the point as input and returns the destination location; andthe mask function M p (cid:48) defined as: M p (cid:48) = (cid:8) s ∈ S | s [0] (cid:54) = ∅ and s [2][0] = 2 (cid:9) igure 8: Examples of alternate plan execution strategies for complex queries. (a) Query plan for a selectionquery over origin- destination data having a polygonal constraint on both spatial attributes. (b) Selection querywith multiple polygonal constraints. (c) Spatial aggregation approach used in [47].The other parameter functions M p and (cid:12) are defined as before. Figure 8(a) illustrates the above expression asa plan diagram. Intuitively, this plan first computes C origin , i.e., all records whose origin intersect with Q . Itthen transforms each record in C origin to its destination and tests for their intersection with Q . We now briefly describe a proof-of-concept GPU-based implementation of our model to demonstrate its ad-vantages with respect to enabling the reuse of operators. In particular, we discuss the blend and mask operatorsrequired to implement spatial selection queries. To further illustrate the expressive power of our model, wealso examine the spatial aggregation operation proposed in [47], and show how it translates directly into analgebraic expression. We note that there can be alternate implementations and that different design choicescan be made, but these are beyond the scope of this paper.
The prototype was implemented using C++ and OpenGL. It assumes the traditional representation of pointand polygon data sets, that is, they are stored as a set of tuples. Instead of duplicating the geometric objectsin the data by explicitly storing the corresponding canvases, we create the canvases on the fly when a query isexecuted.
Data Representation.
Recall from Section 2 that geometric objects are modeled as a union of smooth mani-folds, and a canvas representing these objects are defined as a scalar function over R . Given such a continuousformal representation, it is therefore important to have a discrete representation to be used in the implementa-tion. Our choice was to maintain a canvas as a texture [43], which corresponds to a collection of pixels. Here,each pixel stores the object information triple. The canvas functions are defined as discussed in Section 4.1.However, since the pixels discretize the space, it is also necessary to store additional data corresponding tothe geometry boundaries. In the case of points, this additional information corresponds to the actual locationof the points. For polygons, we store a flag that is set to true if the pixel is on the boundary of the polygon. Toaccurately identify all boundary pixels, we use an OpenGL extension that enables conservative rasterization.This identifies and draws all pixels that are touched by a triangle (or line), and is different from the default16asterization, wherein a pixel is only drawn when > of the pixel is covered by the primitive. This en-sures that the border pixels are kept track of in a conservative fashion, and hence there is no loss in accuracy.Additionally, a simple index is maintained that maps each boundary pixel to the actual vector representationof the polygon.The canvases are created on the fly by simply rendering (i.e., drawing) the geometry using the traditionalgraphics pipeline. The color components (r,g,b,a) are used to store the canvas function. This rendering isperformed onto an off-screen buffer, which generates the required texture. To handle polygons with holes, theouter polygon is first drawn onto the off-screen buffer. The inner polygon (representing one or more holes) isthen drawn such that the pixels corresponding to it are negated (i.e., the canvas function is set to null). Operators.
The blend operator is accomplished through a straightforward alpha blending [43] of two tex-tures, which is supported as part of the graphics pipeline. The mask operator looks up each pixel of the texturein parallel and tests for the mask condition. Note that here, the boundary information is used to perform anaccurate test if the point is part of a pixel that is on the boundary of the polygon.Note that if an approximate result suffices, then the hybrid representation of the canvas can be entirelyeliminated, making the implementation simpler. In fact, in this case, the texture size can be adjusted in orderto appropriately bound the error in the query result, similar to the approach used in [47].
Alternate Implementations.
Another possibility for the implementation is to represent geometric objects asa collection of simplicial complexes, thus avoiding any rasterization. The operators then can be implementedto make use of the native ray tracing support provided by the latest RTX-based Nvidia GPUs. We decidedto use the rasterization pipeline instead so that our prototype could support any modern GPU from multiplevendors, and not just the RTX GPUs from Nvidia.
Queries.
The polygonal selection of points is accomplished by first creating the canvases corresponding tothe query polygon and query points, which are blended together and then filtered using the mask operator.The operator functions are as defined previously. Our implementation, without any modification, also worksfor polygonal selection of polygons , i.e., if the input is changed from a set of points to a set of polygons.A straightforward variation of the selection query is to support multiple polygons as part of the constraint .In particular, consider the case when the constraint requires the input point to be inside at least one of thepolygons (a disjunction). Existing approaches accomplish this by testing the points with respect to each of thepolygonal constraints. However, using our model this query can be expressed as follows using just the blendand mask operators: C result ← M [ M p (cid:48) ]( B [ (cid:12) ]( C P , B ∗ [ ⊕ ]( C Q ))) Here, C Q is the collection of canvases corresponding to the query polygons, and (cid:12) and ⊕ are the blendfunctions defined in Section 4.1. The above expression first blends together all the query constraint polygonsinto a single canvas, which is then used to perform the select similar to the single polygon case. The maskfunction M p (cid:48) is defined as: M p (cid:48) = (cid:8) s ∈ S | s [0] (cid:54) = ∅ and s [2][0] ≥ (cid:9) Recall that the mask function M p used for the single query polygon case tests the incidence of the polygon on apixel by testing the id field of the function value corresponding to 2-primitives. Instead, this is accomplishedusing M p (cid:48) by checking if the count of the polygons incident on the pixel is at least one. Thus, this maskfunction M p (cid:48) is valid even when there is only a single query polygon. So, in our implementation, we use thisinstead of the M p defined earlier. Figure 8(b) shows the plan for this query. Furthermore, as we discuss inSection 6, using the proposed operators also helps improve the performance of the queries when compared tothe traditional approach. A query with a conjunction can be expressed similarly, by appropriately adjustingthe mask function. 17 .2 Spatial Aggregation Consider the spatial join-aggregation query discussed in Section 4.3. Recall that the typical evaluation strategyused by existing systems is to perform a join followed by an aggregation.
RasterJoin [47] proposed analternate approach that maps these queries into operations supported by the graphics pipeline in GPUs, leadingto orders of magnitude speedup over CPU-based approaches. RasterJoin can be directly mapped into a queryexecution plan using the proposed spatial operators as illustrated in Figure 8(c), and translates to the followingexpression: C count ← B ∗ [+]( D ∗ [ γ c ]( M [ M p ]( B [ (cid:12) ]( B ∗ [+]( C P ) , C Y )))) Here, the parameters + , γ c , M p , and (cid:12) are the same as defined earlier. While the above expression assumes count as the aggregation function, other aggregations can be incorporated by modifying the blend operationparameter + appropriately. Note that in this plan, all the points are first merged into a single canvas whichkeeps track of partial aggregates. That is, each canvas pixel maintains the count of all points that fall into thatpixel. This canvas is then joined with the set of input polygons to identify the points that intersect with thepolygons, and the results are again merged to compute the final aggregate. In other words, the counts from theindividual pixels that fall within a polygon are combined to generate the aggregation for that polygon. Eventhough this approach performs an additional merge (through the multiway blend), the size of the input for thejoin is drastically reduced (there is only one canvas on the left hand side of the blend), thus reducing the costof the entire plan. We now briefly discuss the performance of the spatial selection queries using the prototype described above.All experiments were run on a laptop having an Intel Core i7-8750H processor, 16 GB memory and 512 GBSSD. The laptop has a dual Nvidia GTX 1070 Max-Q GPU with 8 GB graphics memory, and an integratedIntel UHD Graphics 630 GPU.
Data and Queries.
The main goal of our evaluation is to: 1) demonstrate the advantage of using GPU-friendlyoperators compared to a traditional GPU-based solution; and 2) illustrate how the same operators can be usedfor variations of a given query. We do this using selection queries that selects trips from the New York City’staxi data having their pickup location within a query polygon. To demonstrate point (2), we also use querieshaving a disjunction of multiple polygonal constraints. The size of the input is varied using the pickup timerange of the taxi trips.To mimic real-world use cases, all the query polygons used in these queries were “hand-drawn” using avisual interface (e.g., [16]) and adjusted to have the same MBR. We then use as input only taxi trips that havetheir pickup location within this MBR. In other words, the evaluation assumes the existence of a filtering stageand primarily focuses on the refinement step. We decided to do this for two reasons. First, the refinementstage, and not filtering, is now the primary bottleneck. Unlike previous decades when the disk-based indexfiltering was the primary bottleneck, due to the existence of fast ssd-based storage and large CPU memory,the filtering takes only a small fraction of the query time. For example, the filtering step used by the state-of-the-art GPU-based selection approach, even though it is CPU-based, takes only a few milliseconds even fordata having over a billion points [11]. Second, when working with complex queries, depending on the queryparameters, the optimizer need not always choose to use the spatial index corresponding to a spatial parameter,and the spatial operations could be further up in the plan (e.g., the optimizer might to choose first filter basedon another attribute, say time, before performing a spatial operation). In such scenarios, the spatial operationwould not have the benefit of an index based filtering, and query bottleneck would then be the refinement step.Additionally, the above setup also helps remove input bias when comparing the performance across polygonalconstraints having different shapes and sizes. 18 igure 9:
Scaling with input size.
Approaches.
We compare the performance of our approach with a CPU baseline, a parallel CPU implemen-tation using OpenMP, as well as a GPU baseline. Because of the above mentioned experimental setup thateliminates the effect of indexes used by current state-of-the-art, we only need to implement the PIP tests forthe above baselines. While our approach was executed on two different GPUs (denoted as Nvidia and Intel),the GPU baseline was executed only on the faster Nvidia GPU.
Performance.
Figures 9(a) and 9(c) shows the speedup achieved by the different approaches over a singlethreaded CPU implementation when the query had one and two polygonal constraints, respectively. Notethat while all GPU-based approaches are over two orders of magnitude faster than the CPU-based approach,the speedup of our approach increases when the number of polygonal constraint increases. This is because,the only additional work done by our approach when there are additional polygons is to blend the constraintpolygons. This is significantly less work when compared to existing approaches which have to perform morePIP tests in this case. This is corroborated when looking at the query run times in Figures 9(b) and 9(d)wherein our approach (in red) requires only 4 seconds (using the Nvidia GPU) even when there are twopolygons as constraints even when the query MBR has as many as 571M points. For a given input and GPU,not only is the time to transfer data between the CPU and GPU similar, but is also a significant fraction of thequery time. In this light, the speedup in the processing time achieved using our model over a traditional GPU-based approach (which is greater than the overall speedup depicted in Figures 9(a) & (c)) clearly demonstratesthe advantages of using a GPU-friendly approach.
Figure 10:
Varying polygonal constraints. Legend is the same as in Figure 9.Figure 10 shows the speedup and running times when the polygonal constraint is varied. The differentpolygons had different shapes (and sizes) with query selectivity varying from roughly 3% to 83%. While thereis some variation in the processing time depending on the complexity of the polygon constraint, this variationis higher for the baseline. This is because the number of PIP tests performed by the baseline is linearlyproportional to the size of the polygon. Irrespective of this complexity, our approach using the discrete GPUrequires at most 2 seconds even when the MBR has 302M points.19lso interesting to note is the performance of our approach on the integrated Intel GPU. While, as expected,it is slower than the GPU baseline using a Nvidia GPU, it is still over 2-orders of magnitude faster than theCPU implementation. Given that these GPUs are present in even mid-range laptops, ultrabooks, and eventablets, our algebraic formulation can potentially allow fast spatial queries even on such systems.
Interoperability with Relational Model.
The proposed model is compatible with the relational model andcan be incorporated into existing relational systems. Recall that the minimalistic definition of S used inSection 2.2 reserves the first element of the triple to store the unique ID corresponding to the data record.Thus, given a set of canvases corresponding to existing data sets, it is possible to switch to the correspondingrelational tuple using this ID. Analogously, the storage structure of a relational tuple can be changed to linkto the corresponding canvas, thus allowing connection in the opposite direction. Alternatively, similar to ourproof-of-concept implementation, the canvases could also be created on demand.Thus, conceptually, one can consider the relational tuple and a canvas to be the dual of each other allowinga seamless use of the two representations by a query optimizer to appropriately generate query plans involvingboth spatial and non-spatial operators. Query Optimization.
The proposed model facilitates query optimization in the following ways.
1. Allowing different query execution plans.
Given a complex query Q , the proposed model enables the cre-ation of multiple plans to realize Q . We gave two such examples in the previous section—one for disjunctionand the other for spatial aggregation. In all such scenarios, by appropriately modeling the cost functions ofthe operators together with metadata about the input, the optimizer can choose a plan that has a lower cost.
2. Supporting diverse implementations.
It is also possible to have multiple implementations of the sameoperators, for example, using pre-built spatial indexes. Each of the indexes would result in a different costbased on the properties of the data and the query, thus providing a rich set of options over which to performthe optimization.
3. Enabling general query processing.
Given the duality between the canvas and the relational tuple as dis-cussed above, the proposed operators can also be easily plugged into existing query optimizers, thus allowingfor complex queries involving both the spatial and relational attributes.
Limitations.
While the proposed data representation can be directly extended to support 3D primitives, theproposed operators over such 3D data do not have a straightforward implementation using the GPU. Giventhat native ray tracing support is now being introduced in GPUs, it would be interesting to explore extensionsto our algebra that make use of such advances to support 3D spatial queries.
Spatial Queries.
The most common approach used for executing spatial queries is to implement customtechniques for different query types. Selection queries, for example, are typically handled through the use ofspatial indexes. These include R-Trees [23], R ∗ -trees [7], kd-trees [8], quad trees [17] and the grid index [40].While such indexes are also useful for other query types such as spatial joins, enhancements and more efficientalgorithms are often designed for specific queries. For instance, several works focus on the filtering step ofspatial join algorithms [9, 25, 36, 37]. Custom algorithms have also been designed for spatial aggregation [35,45, 48, 50] and for nearest neighbor-based queries (see e.g., [24, 26, 28, 41, 52]).The advent of modern hardware with multiple processing units has led to the design of new approachesthat use them for spatial query processing. In particular, GPUs, and clusters supporting the MapReduceparadigm, are extremely popular for this purpose. GPUs have been used for spatial selections [11], spatial20oins [2, 56], spatial aggregations [47], as well as nearest neighbor queries [10, 33]. Similarly, there arededicated spatial database systems designed using MapReduce such as Hadoop-GIS [3] and Simba [51].Eldawy and Mokbel [14] provide a comprehensive survey of approaches that use MapReduce for spatialquery processing. Covering the numerous work related to different spatial queries is beyond the scope ofthis paper. However, note that many of these approaches (e.g., indexes) can be easily applied to enhance theoperators of the proposed algebra. Spatial Data Models and Algebras.
Specific to spatial databases, G¨uting [20] introduced geo-relationalalgebra, which extends relational algebra to include geometric data types and operators. The geometric datatypes included points, lines, and polygons (without holes) and the geometric operators included operationsthat are now common in most spatial database solutions (containment, intersection, perimeter, area, etc.).Aref and Samet [5, 6, 42] generalized the above model and provided one of the first high-level discussions onintegrating spatial and non-spatial data to build a spatial database system, and the related challenges involvedin designing a query optimizer for such a system. Note that current spatial extensions follow approachesvery similar to the ideas proposed in these works. This model is user facing , i.e., the queries of interest areexpressed making use of the data types and the operators provided in the model. The implementation of theoperators, however, is left to the developer, and often devolves into having separate implementations for eachdata type/query combination (similar to the selection query example discussed in Section 1). Our approach, onthe other hand, which uses a single representation for all spatial data types and set of GPU-friendly operatorsdifferent from the traditional operators, is primarily meant to help database developers implement an efficientGPU-based spatial query engine : it can be incorporated into existing systems unbeknownst to the user whileat the same time providing significant benefits to the database engine and query performance.Different from the extended relational models, Egenhofer and Franzosa [12] proposed a model that usesconcepts from point set topology for spatial queries. In particular, this work models spatial data objects(of a single type, like lines or regions) as closed sets (that defines the underlying topological space), anduses the topological relationship between pairs of closed sets to answer spatial queries. These relationshipsare computed based on 9 possible intersections computed between the open set, boundary and complementcorresponding to the closed sets. Egenhofer and Sharma [13] showed the equivalence of the above modelto a raster space, thus making it suitable for GIS queries involving raster data. Kainz et al. [27] model thesame topological relations as above, but using partially ordered sets. While theoretically elegant, there arethree main shortcomings of this topological approach: (1) the topological relationships are tied to a particulardata type, that is, between two regions, or two lines, etc., making it difficult to work with complex spatialobjects; (2) computing the relationships requires costly intersection tests to be performed between everypair of spatial objects, making the approach untenable for working with large spatial data sets; and moreimportantly, (3) while intersection-based queries are straightforward, distance-based queries such as distance-based selections/joins, nearest neighbors etc. cannot be expressed using this model.Gargano et al. [18] proposed an alternative model that supports complex objects in which spatial objectsare represented using a set of rectangular regions. The spatial queries are then realized as operations overthese sets. This representation results in a loss of accuracy in the query results. Trying to overcome this usingvery small rectangles can result in high memory overheads and also require expensive set operations, thuslimiting the practical applicability of this model.G¨uting and Hartmut proposed another model called Realms [21] and a corresponding ROSE algebra [22].A Realm models spatial data as a planar graph, where the nodes correspond to points on an integer grid. Giventhe hardware limitations during that era, the goal of this approach was to avoid costly floating point operationsand any imprecision in the query computation. As data is inserted into the database, the spatial objects are“redrawn” to ensure topological consistency (such as locations of intersection points). This framework hasimportant limitations. First, even though the redrawings ensure that queries involving intersection tests can beefficiently and precisely computed using only integer operations, due to the distortion involved, distance-basedqueries now become imprecise. Second, it is necessary for all query parameters to be a part of the Realm.21hus, when generating dynamic queries (common in several data analysis tasks), the query parameters have tofirst be inserted into the Realm, requiring several redrawings of the existing data. Then, after query execution,the newly inserted parameters should be removed. Not only is this expensive, the redrawings caused by thetemporary insertions are also not undone, resulting in further distortions. Third, queries involving spatialobjects outside the Realm boundaries are not possible. This is a major drawback in modern exploratory dataanalysis tasks where users can dynamically change their focus as they test and formulate new hypotheses.Finally, similar to the extended relational models, there are separate data types for points, lines, and polygons,making the implementation specific to these data types, and also making it difficult to incorporate complexspatial objects consisting of more than one type.All of the above models/algebras were designed before GPUs became mainstream, and thus an implemen-tation of these models using GPUs is non-trivial (difficult to parallelize, involves iterative algorithms likeintersection computations, etc.). Our approach on the other hand was designed keeping GPUs in mind, and isbased on computer graphics operations for which they are optimized.Models have also been proposed that focus on moving objects [31] which are orthogonal to our work. ForGIS applications, Tomlin [46] proposed the Map algebra which was then extended to support time by Jeremyet al. [30]. The map algebra was designed to enable cartographers to easily specify common cartographicfunctions. Voisard and David [49] propose a layered model specific to geographic maps to help users buildnew maps. From an implementation point of view, all of the above operations can be translated into spatialqueries for execution, and thus an efficient spatial model will be useful in such scenarios as well.
In this paper, we introduced a GPU-friendly data model and algebra to support queries over spatial data sets.A key and novel idea in this work is to use a representation that captures the geometric properties inherent inspatial data, and design GPU-friendly operators that can be applied directly on the geometry. We have shownthat the proposed algebra is expressive and able to realize common spatial queries. In addition, since thealgebra is closed, it can also be used to construct complex queries by composing the operators. The potentialease of implementation afforded by our approach, and its performance even on commodity hardware, cangreatly influence the design as well as adoption of GPU-based spatial database solutions, thus democratizingreal-time spatial analyses. This is corroborated by the results of the experimental evaluation carried out withour proof-of-concept prototype.We also believe that this work opens opportunities for new research in the area of spatial query optimization,both for the development of new theory, including algorithms for plan generation and cost estimation, andsystems that use the algebra to efficiently evaluate queries over the growing volumes of spatial data.
Acknowledgements
This work was partially supported by the DARPA D3M program and the NYU Moore Sloan Data ScienceEnvironment.
References [1] D. W. Adler. Db2 spatial extender - spatial data within the rdbms. In
Proc. VLDB , pages 687–690, SanFrancisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.[2] D. Aghajarian, S. Puri, and S. Prasad. Gcmf: An efficient end-to-end spatial join system over largepolygonal datasets on gpgpu platform. In
Proc. GIS , pages 18:1–18:10, New York, NY, USA, 2016.ACM. 223] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: A high performance spatialdata warehousing system over mapreduce.
PVLDB
Proc. SSD , pages 299–318,London, UK, UK, 1991. Springer-Verlag.[6] W. G. Aref and H. Samet. Optimization for spatial query processing. In
Proc. VLDB , pages 81–90, SanFrancisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc.[7] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The r*-tree: an efficient and robust accessmethod for points and rectangles.
SIGMOD Rec. , 19(2):322–331, May 1990.[8] J. L. Bentley. Multidimensional binary search trees used for associative searching.
Commun. ACM ,18(9):509–517, 1975.[9] T. Brinkhoff, H. Kriegel, and B. Seeger. Efficient processing of spatial joins using r-trees.
SIGMODRec. , 22(2):237–246, June 1993.[10] B. Bustos, O. Deussen, S. Hiller, and D. Keim. A graphics hardware accelerated algorithm for nearestneighbor search. In V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors,
Proc.ICCS , pages 196–199, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.[11] H. Doraiswamy, H. T. Vo, C. T. Silva, and J. Freire. A gpu-based index to support interactive spatio-temporal queries over historical data. In
Proc. ICDE , pages 1086–1097. IEEE, May 2016.[12] M. J. Egenhofer and R. D. Franzosa. Point-set topological spatial relations.
Int. J. Geogr. Inf. Syst. ,5(2):161–174, 1991.[13] M. J. Egenhofer and J. Sharma. Topological relations between regions in R and Z . In D. Abel andB. Chin Ooi, editors, Adv. Spatial Databases , pages 316–336, Berlin, Heidelberg, 1993. Springer BerlinHeidelberg.[14] A. Eldawy and M. F. Mokbel. The era of big spatial data: A survey.
Found. Trends databases , 6(3-4):163–273, Dec. 2016.[15] J.-D. Fekete and C. Silva. Managing Data for Visual Analytics: Opportunities and Challenges.
IEEEData Eng. Bull. , 35(3):27–36, 2012.[16] N. Ferreira, J. Poco, H. T. Vo, J. Freire, and C. T. Silva. Visual exploration of big spatio-temporal urbandata: A study of new york city taxi trips.
IEEE TVCG , 19(12):2149–2158, 2013.[17] R. Finkel and J. Bentley. Quad trees a data structure for retrieval on composite keys.
Acta Informatica ,4(1):1–9, 1974.[18] M. Gargano, E. Nardelli, and M. Talamo. Abstract data types for the logical modeling of complex data.
Information Systems , 16(6):565 – 583, 1991.[19] GRASS GIS. https://grass.osgeo.org/, 2018.[20] R. H. G¨uting. Geo-relational algebra: A model and query language for geometric database systems. In
Proc. EDBT , pages 506–527, London, UK, UK, 1988. Springer-Verlag.2321] R. H. G¨uting and M. Schneider. Realms: A foundation for spatial data types in database systems.In D. Abel and B. Chin Ooi, editors,
Adv. Spatial Databases , pages 14–35, Berlin, Heidelberg, 1993.Springer Berlin Heidelberg.[22] R. H. G¨uting and M. Schneider. Realm-based spatial data types: The rose algebra.
VLDBJ , 4(2):243–286, Apr 1995.[23] A. Guttman. R-trees: a dynamic index structure for spatial searching.
SIGMOD Rec. , 14(2):47–57, June1984.[24] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases.
ACM Trans. Database Syst. ,24(2):265–318, June 1999.[25] E. H. Jacox and H. Samet. Spatial join techniques.
ACM Trans. Database Syst. , 32(1), Mar. 2007.[26] H. V. Jagadish, B. C. Ooi, K. Tan, C. Yu, and R. Zhang. idistance: An adaptive b+-tree based indexingmethod for nearest neighbor search.
ACM Trans. Database Syst. , 30(2):364–397, June 2005.[27] W. Kainz, M. J. Egenhofer, and I. Greasley. Modelling spatial relations and operations with partiallyordered sets.
Int. J. Geogr. Inf. Syst. , 7(3):215–229, 1993.[28] N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighborqueries.
SIGMOD Rec. , 26(2):369–380, June 1997.[29] Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis.
IEEE TVCG ,20(12):2122–2131, 2014.[30] J. M., R. V., and C. D. Tomlin. Cubic map algebra functions for spatio-temporal analysis.
CaGIS ,32(1):17–32, 2005.[31] J. K. Nidzwetzki and R. H. G¨uting. Distributed secondo: an extensible and scalable database manage-ment system.
Distributed and Parallel Databases
Proc. GIS , pages 211–220, New York, NY, USA, 2011. ACM.[34] V. Pandey, A. Kipf, T. Neumann, and A. Kemper. How good are modern spatial analytics systems?
PVLDB , 11(11):1661–1673, July 2018.[35] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient olap operations in spatial data warehouses. In
Proc. SSTD , pages 443–459, London, UK, UK, 2001. Springer-Verlag.[36] J. Patel and D. DeWitt. Partition based spatial-merge join.
SIGMOD Rec. , 25(2):259–270, June 1996.[37] M. Pavlovic, T. Heinis, F. Tauheed, P. Karras, and A. Ailamaki. Transformers: Robust spatial joins onnon-uniform data distributions. In
Proc. ICDE
Spatial Databases with Application to GIS . Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2002.[41] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries.
SIGMOD Rec. , 24(2):71–79,May 1995.[42] H. Samet and W. G. Aref. Spatial data models and query processing. In W. Kim, editor,
ModernDatabase Systems , pages 338–360. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA,1995.[43] D. Shreiner, G. Sellers, J. M. Kessenich, and B. M. Licea-Kane.
OpenGL Programming Guide: TheOfficial Guide to Learning OpenGL, Version 4.3 . Addison-Wesley Professional, 8th edition, 2013.[44] SQL Server Spatial. https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-data-sql-server?view=sql-server-2017, 2018.[45] Y. Tao, D. Papadias, and J. Zhang. Aggregate processing of planar points. In
Proc. EDBT , pages 682–700, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.[46] C. D. Tomlin. Map algebra: one perspective.
Landscape and Urban Planning , 30(1-2):3–12, 1994.[47] E. Tzirita Zacharatou, H. Doraiswamy, A. Ailamaki, C. T. Silva, and J. Freire. Gpu rasterization forreal-time spatial aggregation over arbitrary polygons.
PVLDB , 11(3):352–365, 2017.[48] I. F. V. Lopez, R. T. Snodgrass, and B. Moon. Spatiotemporal aggregate computation: A survey.
IEEETKDE , 17(2):271–286, Feb. 2005.[49] A. Voisard and B. David. A database perspective on geospatial data modeling.
IEEE TKDE , 14(2):226–243, March 2002.[50] L. Wang, R. Christensen, F. Li, and K. Yi. Spatial online sampling and aggregation.
PVLDB , 9(3):84–95,2015.[51] D. Xie, F. Li, B. Yao, G. Li, L. Zhou, and M. Guo. Simba: Efficient in-memory spatial analytics. In
Proc. SIGMOD , pages 1071–1085, New York, NY, USA, 2016. ACM.[52] P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In
Proc. SODA , pages 311–321, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathe-matics.[53] J. Zhang and S. You. Speeding up large-scale point-in-polygon test based spatial join on gpus. In
Proc.BigSpatial , pages 23–32, 2012.[54] J. Zhang, S. You, and L. Gruenwald. High-performance online spatial and temporal aggregations onmulti-core cpus and many-core gpus. In
Proc. DOLAP , pages 89–96, 2012.[55] J. Zhang, S. You, and L. Gruenwald. High-performance spatial join processing on gpgpus with applica-tions to large-scale taxi trip data. Technical report, The City College of New York, 2012.[56] J. Zhang, S. You, and L. Gruenwald. Efficient parallel zonal statistics on large-scale global biodiversitydata on gpus. In