[PDF] Uncertain Spatial Data Management:An Overview

Abstract

Both the current trends in technology such as smartphones, general mobile devices, stationary sensors, and satellites as well as a new user mentality of using this technology to voluntarily share enriched location information produces a flood of geo-spatial and geo-spatiotemporal data. This data flood provides tremendous potential for discovering new and useful knowledge. But in addition to the fact that measurements are imprecise, spatial data is often interpolated between discrete observations. To reduce communication and bandwidth utilization, data is often subjected to a reduction, thereby eliminating some of the known/recorded values. These issues introduce the notion of uncertainty in spatial data management, an aspect raising the imminent need for scalable and flexible solutions. The main scope of this chapter is to survey existing techniques for managing, querying, and mining uncertain spatial data. First, this chapter surveys common data representations for uncertain data, explains the commonly used possible worlds semantics to interpret an uncertain database, and surveys existing system to process uncertain data. Then, this chapter defines the notion of probabilistic result semantics to distinguish the task of computing individual object probabilities versus computing entire result probabilities. This is important, as, for many queries, the problem of computing object-level probabilities can be solved efficiently, whereas result-level probabilities are hard to compute. Finally, this chapter introduces a novel paradigm to efficiently answer any kind of query on uncertain data: the Paradigm of Equivalent Worlds, which groups the exponential set of possible database worlds into a polynomial number of sets of equivalent worlds that can be processed efficiently. Examples and use-cases of querying uncertain spatial data are provided using the example of uncertain range queries.

Full PDF

UUncertain Spatial Data Management:An Overview

Andreas Züﬂe

George Mason University, Fairfax, VA, USADepartment of Geography and Geoinformation Science [email protected]

Preprint: To Appear in Big Geospatial Data. Chapter 3.2. Springer Books.

Abstract.

Both the current trends in technology such as smart phones, general mo-bile devices, stationary sensors, and satellites as we as a new user mentality of usingthis technology to voluntarily share enriched location information produces a ﬂoodof geo-spatial and geo-spatio-temporal data. This data ﬂood provides a tremendouspotential of discovering new and useful knowledge. But in addition to the fact thatmeasurements are imprecise, spatial data is often interpolated between discrete obser-vations. To reduce communication and bandwidth utilization, data is often subjectedto a reduction, thereby eliminating some of the known/recorded values. These issues in-troduce the notion of uncertainty in the context of spatio-temporal data management,an aspect raising imminent need for scalable and ﬂexible solutions. The main scope ofthis chapter is to survey existing techniques for managing, querying, and mining un-certain spatio-temporal data. First, this chapter surveys common data representationsfor uncertain data, explains the commonly used possible worlds semantics to interpretan uncertain database, and surveys existing system to process uncertain data. Thenthis chapter deﬁnes the notion of diﬀerent probabilistic result semantics to distinguishthe task of enrich individual objects with probabilities rather than enriched entire re-sults with probabilities. To distinguish between result semantics is important, as formany queries, the problem of computing object-level result probabilities can be doneeﬃciently, whereas the problem of computing probabilities of entire results is oftenexponentially hard. Then, this chapter provides an overview over probabilistic querypredicates to quantify the required probability of a result to be included in the result.Finally, this chapter introduces a novel paradigm to eﬃciently answer any kind of queryon uncertain data: the Paradigm of Equivalent Worlds, which groups the exponentialset of possible database worlds into a polynomial number of set of equivalent worldsthat can be processed eﬃciently. Examples and use-cases of querying uncertain spatialdata are provided using the example of uncertain range queries. a r X i v : . [ c s . D B ] S e p ig. 1. User locations in a Location-based social network (Gowalla) over a day.

Due to the proliferation of handheld GPS enabled devices, spatial and spatio-temporal data is generated, stored, and published by billions of users in aplethora of applications. By mining this data, and thus turning it into actionableinformation, The McKinsey Global Institute projects a “$600 billion potentialannual consumer surplus from using personal location data globally”.As the volume, variety and velocity of spatial data has increased sharply overthe last decades, uncertainty has increased as well. Until the early 21st century,spatial data available for geographic information science (GIS) was mainly col-lected, curated, standardized [50,49], and published by authoritative sources suchas the United States Geological Survey (USGS) [101]. Now, data used for spatialdata mining is often obtained from sources of volunteered geographic informa-tion (VGI) [98,84]. Consequentially, our ability to unearth valuable knowledgefrom large sets of such spatial data is often impaired by the uncertainty of thedata which geography has been named the “the Achilles heel of GIS” [53] formany reasons: – Imprecision is caused by physical limitations of sensing devices and connec-tion errors, for instance in geographic information system using cell-phoneGPS [38], – Data records may be obsolete. In geo-social networks and microblogging plat-forms such as Twitter, users may update their location infrequently, yieldinguncertain location information in-between data records [63], – Data can be obtained from unreliable sources, such as volunteered geographicinformation like data in Open-Street-Map [84], where data is obtained fromindividual users, which may incur inaccurate or plain wrong data, deliber-ately or due to human error [54], – Data sets pertaining to speciﬁc questions may be too small to answer ques-tions reliably. Proper statistical inference is required to draw signiﬁcant con-clusions from the data and to avoid basing decisions upon spurious miningresults [56,24].To illustrate uncertainty in spatial and spatio-temporal data, Figure 1 shows atypical one-day “trajectory” of a proliﬁc user in the location-based social networkGowalla (data taken from [35]). While a trajectory is usually deﬁned as a functionthat continuously maps time to locations, we see that in this case, we can only Qb A EDC q b B Fig. 2.

Exemplary Uncertain Database.observe the user at discrete times, having hours in-between subsequent locationupdates. Where was the user located in-between these updates? Should we usedead reckoning techniques to interpolate the locations or should be assume thatthe user stays at a location until next update? Also, users may spoof theirlocation [116], either to protect their privacy or to gain advantages within thelocation-based social network. Given this uncertainty, how certain can we beabout the location of the user at a given time t ? And how does the uncertaintyincrease as location updates become more sparse and obsolete? The goal of thischapter is to provide a comprehensive overview of models and techniques todeal with uncertainty. To handle uncertainty, we must ﬁrst remind ourselvesthat a database models an aspect of the real world, the universe of discourse.Information observed and stored in a database may deviate from the real-world.For reliable decision making, we need to quantify the uncertainty of attributevalues stored in the database and consider potentially missing objects that maychange mining results. Example 1.

As a running example used through this chapter, consider Figure 2which shows a toy uncertain spatial database. In this example, two objects, Q and B have uncertain locations, indicated by alternative locations { q , q } of Q and alternative locations { b , b } of B . In this book chapter, we will surveymethods to answer questions such as “What object is closest to Q?”, or “Whatis the probability of B to be one of the two-nearest neighbors of Q ?”To answer such queries, we ﬁrst need a crisp deﬁnition of what it means for anuncertain object to be a (probabilistic) nearest neighbor of a query object andhow the probability of such an event is deﬁned. This chapter gives a widelyused interpretation of uncertain databases using Possible Worlds Semantics .This interpretation allows to answer arbitrary queries on uncertain data, butat a computational cost exponential in the number of uncertain objects. Foreﬃcient processing, this chapter deﬁnes a paradigm of querying uncertain datathat allows to eﬃciently answer many spatial queries on uncertain spatial data.managing and querying uncertain spatial data. Parts of this section havebeen presented in the form of presentation slides at recent conference tutorialsat VLDB 2010 ([90]), ICDE 2014 ([31]), ICDE 2017 ([122]), and MDM 2020 [121].his chapter is subdivided to give a survey of deﬁnitions, notions and techniquesused in the ﬁeld of querying and mining uncertain spatio-temporal data. – Section 2 presents a survey of state-of-the-art data representations models used in the ﬁeld of uncertain data management. This section explain discreteand continuous models for uncertain objects. – To interpret queries on a database of uncertain objects, well-deﬁned seman-tics of uncertain database are required. For this purpose,

Section 3 intro-duces the possible world semantics for uncertain data. – To run queries on uncertain spatial data, existing systems for uncertainspatial database management are surveyed in

Section 4 . – Given an uncertain database, the result of a probabilistic query can be in-terpreted in two ways as elaborated in

Section 5 . This distinction betweendiﬀerent probabilistic result semantics is not made explicitly in any relatedwork, but is required to gain a deep understanding of problems in the ﬁeldof querying uncertain spatial data and their complexity. – Section 6 gives an overview over probabilistic query predicates . A probabilis-tic query predicate deﬁnes the requirements for the probability of a candidateresult to be returned as a query result. – Section 7 introduces a novel paradigm for uncertain data to eﬃciently an-swer any kind of query using possible world semantics. This

Paradigm ofEquivalent Worlds generalizes existing solutions by identifying requirementsa query must satisfy in order to have a polynomial solution. – Section 8 presents eﬃcient solutions for the problem of computing rangequeries on uncertain spatial databases. For this purpose, the paradigm ofequivalent worlds is leveraged to compute the distribution of the sum of aPoisson-binomial distributed random variable, a problem that is paramountfor many spatial queries on uncertain data. – Section 9 gives an overview of speciﬁc research problems using uncertainspatial and spatio-temporal data, and surveys state-of-the-art solutions. – Finally,

Section 10 concludes this book chapter and sketches future researchdirections that can be opened by leveraging the Paradigm of EquivalentWorlds to new applications and query types.

ATABASESYSTEMSGROUP

Uncertain Data Model

GROUP • Uncertain attribute

A tt ib t i t i if it l i i b b bili tiAn attribute x is uncertain if its value is given by a probabilistic density function (PDF), which describes all possible values v of x , associated with probability P( x=v ).– Discrete PDF (e.g., temperature history data)– Continuous PDF (e.g., sensor measurement error) Renz/Cheng/Kriegel: Similarity Search and Mining in Uncertain Databases (a) Discrete Probability Mass Function (b) Continuous Prob. Density Function Fig. 3.

Models for Uncertain Attributes

An object is uncertain if at least one attribute of o is uncertain. The uncertaintyof an attribute can be captured in a discrete or continuous way. A discrete modeluses a probability mass function (pmf) to describe the location of an uncertainobject. In essence, such a model describes an uncertain object by a ﬁnite numberof alternative instances, each with an associated probability [61,86], as shownin Figure 3(a). In contrast, a continuous model uses a continuous probabilitydensity function (pdf), like Gaussian, uniform, Zipﬁan, or a mixture model,as depicted in Figure 3(b), to represent object locations over the space. Thus,in a continuous model, the number of possible attribute values is uncountablyinﬁnite. In order to estimate the probability that an uncertain attribute value iswithin an interval, integration of its pdf over this interval is required [99]. Therandom variables corresponding to each uncertain attribute of an object o canbe arbitrarily correlated.To capture positional uncertainty, such models can be applied by treatinglongitude and latitude (and optionally elevation) as two (three) uncertain at-tributes. In the case of discrete positional uncertainty, the position of an object A is given by a discrete set a , ..., a m of m ∈ N possible alternatives in space,as exemplarily depicted in Figure 4(a) for two uncertain objects A and B . Eachalternative a i is associated with a probability value p ( a i ) , which may for examplebe derived from empirical information about the turn probabilities of intersec-tion in an underlying road network. In a nutshell, the position A is a randomvariable, deﬁned by a probability mass function pdf A that maps each alterna-tive position a i to its corresponding probability p ( a i ) , and that maps all otherpositions in space to a zero probability. An important property of uncertain spa-tial databases is the inherent correlation of spatial attributes. In the exampleshown in Figure 4(a) it can be observed that the uncertain attributes a and b arehighly correlated: given the value of one attribute, the other attribute is certain,as there is no two alternatives of objects A and B having identical attributevalues in either attribute.Clearly, it must hold that the sum of probabilities of all alternatives mustsum to at most one: m (cid:88) i =1 p ( a i ) ≤ .10.2 0.30.2 Object A Object B Uncertain

Attribute a U n c e r t a i n A tt r i bu t e b (a) Discrete Case (b) Continuous Case Fig. 4.

Uncertain ObjectsIn the case where (cid:80) mi =1 p ( a i ) ≤ object A has a non-zero probability of − (cid:80) mi =1 p ( a i ) ≥ to not exist at all. This case is called existential uncertainty ,and A is denoted as existentially uncertain [112]. If the total number of possibleinstances m is greater than one, A is denoted as attribute uncertain . In thecontext of uncertain spatial data, attribute uncertainty is also referred to as positional uncertainty or location uncertainty . An object can be both existentiallyuncertain and attribute uncertain. In Figure 4(a), object A is both existentiallyuncertain and attribute uncertain, while object B is attribute uncertain but doesexist for certain.In the case of continuous uncertainty, the number of possible alternativepositions of an object A is inﬁnite, and given by the non-zero domain of theprobability density function pdf x . The probability of A to occur in some spatialregion r is given by integration (cid:90) r pdf A ( x ) dx. Since arbitrary pdfs may be represented by an uncountably inﬁnite large numberof ( position , probability ) pairs, such pdfs may require inﬁnite space to represent.For this reason, assumptions on the shape of a pdf are made in practice. All con-tinuous models for positionally uncertain data therefore use parametric pdfs,such as Gaussian, uniform, Zipﬁan, mixture models, or parametric spline repre-sentations. For illustration purpose, Figure 4(b) depicts three uncertain objectsmodelled by a mixture of gaussian pdfs. Similar to the discrete case, the con-straint (cid:90) R d pdf A ( x ) dx ≤ must be satisﬁed, where R d is a d dimensional vector space. In the case of spatialdata, d usually equals two or three. The notion of existentially and attributeuncertain objects is deﬁned analogous to the discrete case.The following section reviews related work and state-of-the-art on the ﬁeldof modeling uncertain data. .1 Existing Models for Uncertain Data This section gives a brief survey on existing models for uncertain spatial dataused in the database community. Many of the presented models have been de-veloped to model uncertainty in relational data, but can be easily adapted tomodel uncertain spatial data. Since one of the main challenges of modeling un-certain data is to capture correlation between uncertain objects, this section willelaborate details on how state-of-the-art approaches tackles this challenge. Bothdiscrete and continuous models are presented.

Discrete Models

In addition to reviewing related work deﬁning discrete uncertainty models, theaim of this section is to put these papers into context of Section 2. In particular,models which are special cases or equivalent to the model presented in Section2 will be identiﬁed, and proper mappings to Section 2 will be given.

Independent Tuple Model.

Initial models have been proposed simulta-neously and independently in [52,118]. These works assume a relational modelin which each tuple is associated with a probability describing its existentialuncertainty. All tuples are considered independent from each other. This simplemodel can be seen as a special case of the model presented in Section 2, whereonly existential uncertain but no attribute uncertainty is modelled.

Block-Independent Disjoint Tuples Model and X-Tuple model

Amore recent and the currently most prominent approach to model discrete un-certainty is the block-independent disjoint tuples model ([41]), which can capturemutual exclusion between tuples in uncertain relational databases. A probabilis-tic database is called block independent-disjoint if the set of all possible tuplescan be partitioned into blocks such that tuples from the same block are disjointevents, and tuples from distinct blocks are independent. A commonly used ex-ample of a block-independent disjoint tuples model is the

Uncertainty-LineageDatabase Model ([13,91,97,110,111]), also called

X-Relation Model or simply

X-Tuple Model that has been developed for relational data. In this model, a prob-abilistic database is a ﬁnite set of probabilistic tables . A probabilistic table T contains a set of (uncertain) tuples, where each tuple t ∈ T is associated witha membership probability value P r ( t ) > . A generation rule R on a table T speciﬁes a set of mutually exclusive tuples in the form of R : t r ⊕ ... ⊕ t r m where t r i ∈ T (1 ≤ i ≤ m ) and P ( R ) := (cid:80) mi =1 t r i ≤ . The rule R constrains that,among all tuples t r , ..., t r m involved in the rule, at most one tuple can appear ina possible world. The case where P ( R ) < the probability − P ( R ) correspondsto the probability that no tuple contained in rule R exists. It is assumed thatfor any two rules R and R it holds that R and R do not share any commontuples, i.e., R ∩ R = ∅ . In this model, a possible world w is a subset of T suchthat for each generation rule R , w contains exactly one tuple involved in R if P ( R ) = 1 , or w contains 0 or 1 tuple involved in R if P r ( R ) < .This model can be translated to a discrete model for uncertain spatial data asdiscussed in Section 2 by interpreting the set T as the set of all possible locationsf all objects, and interpreting each rule R as an uncertain spatial object havingalternatives t r i . The constraint that no two rules may share any common tuplestranslates into the assumption of mutually independent spatial objects. Finally,the case P ( R ) < corresponds to the case of existential uncertainty (see Section2). A similar block-independent disjoint tuples model is called p-or-set [89] andcan be translated to the model described in Section 2 analogously. In [8], anothermodel for uncertainty in relational databases has been proposed that allows torepresent attribute values by sets of possible values instead of single determin-istic values. This work extends relational algebra by an operator for comput-ing possible results. A normalized representation of uncertain attributes, whichessentially splits each uncertain attribute into a single relation, a so-called U-relation, allows to eﬃciently answer projection-selection-join queries. The maindrawback of this model is that it is not possible to compute probabilities of thereturned possible results. Sen and Deshpande [95] propose a model based on aprobabilistic graphical model, for explicitly modeling correlations among tuplesin a probabilistic database. Strategies for executing SQL queries over such datahave been developed in this work. The main drawback of using the proposedgraphical model is its complexity, which grows exponential in the number of mu-tually correlated tuples. This is a general drawback for graphical models such asBayesian networks and graphical Markov models, where even a factorized repre-sentation may fail to reduce the complexity suﬃciently: The idea of a factorizedrepresentation is to identify conditional independencies. For example, if a ran-dom variable C depends on random variables A and B , then the distributionof C has to be given relative to all combination of realizations of A and B . Ifhowever, C is conditionally independent of A , i.e., B depends on A , C dependson B , and C only transitively depends on A , then it is suﬃcient to store thedistribution of C relative only to the realizations of B . Nevertheless, if for agiven graphical model a random variable depends on more than a hand-full ofother random variables, then the corresponding model will become infeasible. And/Xor Tree Model.

A very recent work by Li and Deshpande [66] ex-tends the block-independent disjoint tuples model by adding support for mutualco-existence. Two events satisfy the mutual co-existence correlation if in any pos-sible world, either both happen or neither occurs. This work allows both mutualexclusiveness and mutual co-existence to be speciﬁed in a hierarchical manner.The resulting tree structure is called an and/xor tree . While theoretically highlyrelevant, the and/xor tree model becomes impracticable in large database hav-ing non-trivial object dependencies, as it grows exponentially in the number ofdatabase objects.If not stated otherwise, this chapter will apply the block-independent disjointtuples model as model of choice for discrete uncertain data. ontinuous Models

In general, similarity search methods based on continuous models involve ex-pensive integrations of the PDFs, hence special approximation and indexingtechniques for eﬃcient query processing are typically employed [34,99]. In orderto increase quality of approximations, and in order to reduce the computationalcomplexity, a number of models have been proposed making assumptions onthe shape of object PDFs. Such assumptions can often be made in applicationswhere the uncertain values follow a speciﬁc parametric distribution, e.g. a uni-form distribution [32,29] or a Gaussian distribution [29,44,85]. Multiple suchdistributions can be mixed to obtain a mixture model [100,22]. To approximatearbitrary PDFs, [67] proposes to use polynomial spline approximations.

In an uncertain spatial database D = { U , ..., U N } , the location of an object isa random variable. Consequently, if there is at least one uncertain object, thedata stored in the database becomes a random variable. To interpret, that is,to deﬁne the semantics of a database that is, in itself, a random variable, theconcept of possible worlds is described in this section. Deﬁnition 1 (Possible World Semantics).

A possible world w = { u a , ..., u a N N } is a set of instances containing at most one instance u a i i ∈ U i from each object U i ∈ D . The set of all possible worlds is denoted as W . The total probabilityof an uncertain world P ( w ∈ W ) is derived from the chain rule of conditionalprobabilities: P ( w ) := P ( (cid:94) u aii ∈ w U i = u a i i ) = N (cid:89) i =1 P ( u a i i | (cid:94) j

Attribute a U n c e r t a i n A tt r i bu t e b u u u u u u u U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U U Fig. 5.

An uncertain database and all of its possible worlds.

Table 1.

Possible worlds corresponding to Figure 5.

World Probability World Probability { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u , u } . · . · . . { u , u } . · . · . . { u , u } . · . · . . { u u } . · . · . . { u , u } . · . · . . { u , u } . · . · . . { u , u } . · . · . . Example 2.

As an example, consider Figure 5 where a database consisting ofthree uncertain objects D = { U , U , U } is depicted. Objects U = { u , u } and U = { u , u } each have two possible instances, while object U = { u , u , u } has three possible instances. The probabilities of these instances is given as P ( u ) = P ( u ) = 0 . , P ( u ) = 0 . , P ( u ) = 0 . , P ( u ) = 0 . , P ( u ) = 0 . , P ( u ) = 0 . . Note that object U is the only object having existential uncer-tainty: With a probability of − . − . . object U does not exist at all.Assuming independence between spatial objects, the probability for the possibleworld where U = u , U = u and U = u is given by applying Equation 2 toobtain the product . · . · . . . All possible worlds spanned by D aredepicted in Figure 5. The probability of each possible world is shown in Table1, including possible worlds where U does not exist.ecall that a predicate can evaluate to either true or false on a crisp (non-uncertain) database. An exemplary predicate is There are at least ﬁve databaseobjects in a 500meter range of the location “Theresienwiese, Munich”.

To evalu-ate a predicate φ on an uncertain database using possible world semantics, thequery predicate is evaluated on each possible world. The probability that thequery predicate evaluates to true is deﬁned as the sum of probabilities of allworlds where φ is satisﬁed, formally: Deﬁnition 2.

Let D be an uncertain spatial database inducing the set of possibleworlds W , let φ be some query predicate, and let I ( φ, w ∈ W ) := P ( φ ( D ) |D = w ) ∈ { , } be the indicator function that returns one if world w satisﬁes φ and zero other-wise. The marginal probability P ( φ ( D )) of the event φ ( D ) that predicate φ holdsin D is deﬁned as follows using the theorem of total probability [123]: P ( φ ( D )) = (cid:88) w ∈W I ( φ, w ) · P ( w ) (3)The main challenge of analyzing uncertain data is to eﬃciently and eﬀectivelydeal with the large number of possible worlds induced by an uncertain database D . In the case of continuous uncertain data, the number of possible worlds isuncountably inﬁnite and expensive integration operations or numerical approx-imation are required for most spatial database queries and spatial data miningtasks. Even in the case of discrete uncertainty, the number of possible worldsgrows exponentially in the number of objects: in the worst case, any combinationof alternatives of objects may have a non-zero probability, as shown exemplary inFigure 5. This large number of possible worlds makes eﬃcient query processingand data mining an extremely challenging problem. In particular, any problemthat requires an enumeration of all possible worlds is . In particular, anumber of probabilistic problems have been proven to be in any query in polynomial time. This implies that querying processingon models that are generalizations of the discrete case with object independence,e.g., models using continuous distribution, or models that relax the object inde-pendence assumption, must also be a Existing Uncertain Spatial Database ManagementSystems

Recently developed systems provide support for spatio-temporal data in bigdata systems [7,6,79,105,113]. Such systems exhibit high scalability for batch-processing jobs [10,43], but do not provide eﬃcient solutions to handle uncertaindata and to assess the reliability of results. The vivid ﬁeld of managing, querying,and mining uncertain data has received tremendous attention from the database,data mining, and spatial data science communities. Recent books [3] and surveypapers [4,108,72] provide an overview of the ﬂurry of research papers that haveappeared in these ﬁelds.been well-studied by the database research community in the past. Whilethe traditional database literature [25,12,11,64,51] has studied the problem ofmanaging uncertain data, this research ﬁeld has seen a recent revival, due tomodern techniques for collecting inherently uncertain data. Most prominent con-cepts for probabilistic data management are MayBMS [9], MystiQ [23], Trio [5],and BayesStore [104]. These uncertain database management systems (UDBMS)provide solutions to cope with uncertain relational data, allowing to eﬃcientlyanswer traditional queries that select subsets of data based on predicates or joindiﬀerent datasets based on conditions. Extensions to the UDBMS also allow an-swering of important classes of spatial queries such as top-k and distance-rankingqueries [57,36,69,18,68]. While these existing UDBMS provide probabilistic guar-antees for their query results, they oﬀer no support for data mining tasks. A likelyreason for this gap is the theoretic result of [40] which shows that the problemof answering complex queries is N objects each having an arbitrarynon-zero probability of being in that range. Further, assume stochastic indepen-dence between these objects. In that case, any of the N combinations of resultobjects becomes a possible result and must be returned.Nevertheless, a number of polynomial time solutions have been proposed inthe literature for various spatial query types such as nearest neighbor queries[33,61,58,29], k-nearest neighbor queries [21,78,70,30] and (similarity-) rankingqueries [19,37,70,96]. On ﬁrst glance, these ﬁndings may look contradicting (un-less P = N P ), providing polynomial-time solution to a object to be part of the result. This reduces the numberof probabilities that have to be reported, in the worst-case, from a number ex-ponential in the number of database objects, to a linear number. Re-using theexample of a range query on an uncertain database, it is possible to computethe probability that a single object is within the query range independent fromall other objects. Qb A EDC q b B Fig. 6.

The Exemplary Uncertain Database from Figure 2Unfortunately, this simpliﬁcation also yields a loss of information, as it isnot possible to infer the probability of query results given only probabilities ofindividual objects. Let us revisit the running example from introduction, which isduplicate in Figure 6 for convenience. This example will illustrate how such anobject-based approach, which computes object-individual probabilities, ratherthan the probabilities of result sets, may yield misleading results.

Example 3.

Assume that the task is to simply ﬁnd the probabilistic two nearestneighbors (2NN) of uncertain object Q . Objects Q and B have two alternativepositions each, yielding a total of four possible worlds. For example, in onepossible world, where Q has location q and B has location b , the two nearestneighbors of Q are A and C . This possible world has a probability of . · . . , obtained by assuming stochastic independence between objects. Followingobject-based result semantics, we can obtain probabilities of . , . , . , . , . for objects A , B , C , D , and E to be the NNs of Q , respectively. However,this result hides any dependence between these result objects, such as objects A and B are mutually exclusive, while D and E are mutually inclusive.Towards approximate solutions, the Monte-Carlo DB (MCDB) system [59]has been proposed, which samples possible worlds from the database, executesthe query predicate on each sampled world. MCDB estimates the probability ofeach object to be part of the result set. However, this approach of assigning aresult probability to each object, as illustrate in the example above, cannot beextended to assess the probability of result sets. The problem is that the numberof possible result sets may be exponentially large. To aggregate possible worldsinto groups of mutually similar worlds (having similar results), an approach hasbeen proposed for clustering of uncertain data [120,94] and more recently forgeneral query processing on spatial data [92]. Revisiting the example of Figure 2,this approach reports the results of a probabilistic query 2NN query as { A, C } , { B, C } , { D, E } , having respective probabilities of . , . , and . . However,this approach ([92]) can only be applied to spatial queries that return resultsets, thus cannot be applied to more complex spatial queries or data miningtasks. To further elaborate the diﬀerence between solutions that compute theprobability of each object to be part of the result, and solutions that computethe probability of each result, the following section will further survey the twodiﬀerent “Probabilistic Result Semantics”: Object-based and Result-based. Probabilistic Result Semantics

Recall that a spatial similarity query always requires a query object q and, in-formally speaking, returns objects to the user that are similar to q . In the caseof uncertain data, there exists two fundamental semantics to describe the resultof such a probabilistic spatial similarity query. These diﬀerent result semanticswill be denoted as object based result semantics and the result based result se-mantics . Informally, the former semantics return possible result objects and theirprobability of being part of the result, while the later semantics return possibleresults, which consist of a single object, of a set of objects or of a sorted listof objects depending on the query predicate, and their probability of being theresult as a whole. Using object based probabilistic result semantics , a probabilistic spatial query re-turns a set of objects, each associated with a probability describing the individuallikelihood of this object to satisfy the spatial query predicate.

Deﬁnition 3 (Object Based Result Semantics).

Let D be an uncertain spa-tial database, let q be a query object and let φ denote a spatial query predicate.Under object based (OB) probabilistic result semantics, the result of a probabilis-tic spatial φ query is a set φ OB ( q, D ) = { ( o ∈ D , P ( o ∈ φ OB ( q, D ))) } of pairs.Each pair consists of a result object o and its probability P ( o ∈ φ OB ( q, D )) tosatisfy φ . Applying possible world semantics (c.f. Deﬁnition 1) to compute theprobability P ( o ∈ φ OB ( q, D )) yields P ( o ∈ φ OB ( q, D )) = (cid:88) w ∈W ,o ∈ φ ( q,w ) P ( w ) , (4) where φ ( q, w ) is the deterministic result of a spatial φ query having query object q applied to the deterministic database deﬁned by world w . Formally, the result of a probabilistic spatial query under object based resultsemantics is a function φ OB ( q, D ) : D → [0 , o (cid:55)→ P ( o ∈ φ OB ( q, D )) . mapping each object o in D (the results) to a probability value. a b c b a P(a )=0.1P(a )=0.9P(b )=0.6P(b )=0.4P(c )=1.0 World

Rank1 Rank =a b c A B C 0.04 w =a b c A C B 0.06 w =a b c B C A 0.54 w =a b c C B A 0.36

Fig. 7.

Example Database showing possible positions of uncertain objects andtheir corresponding probabilities.

Example 4.

Figure 7 depicts a database containing objects D = { A, B, C } . Ob-jects A and B have two alternative locations each, while the position of C isknown for certain. The locations and the probabilities of all alternatives are alsodepicted in Figure 7. This leads to a total number of four possible worlds. Forexample, in world w where A = a , B = b and C = c , object A is closestto q , followed by objects B and C . Assuming inter-object independence, theprobability of this world is given by the product of individual instance prob-abilities P ( w ) = P ( a ) · P ( b ) · P ( c ) = 0 . . The ranking of each possibleworld and the corresponding probability is also depicted in Figure 7. For a prob-abilistic N N query for the depicted query object q , the object based resultsemantic computes the probability of each object to be in the two-nearest neigh-bor set of q . For object A , the probability P ( A ) of this event equals . , sincethere exists exactly two possible worlds w and w with a total probability of .

04 + 0 .

06 = 0 . in which A is on rank one or on rank two, yielding a resulttuple ( A, . . The complete result of a P N N query under object based resultsemantics is { ( A, . , ( B, . , ( C, . } . Note that in general, objects havinga zero probability are included in the result. For instance, assume an additionalobject D such that all instances of D have a distance to q greater than thedistance between q and b . In this case, the pair ( D, would be part of theresult.The result of a query under object based probabilistic result semantics containsone result tuple for every single database object, even if the probability of thecorresponding object to be a result is very low or zero. In many applications,such results may be meaningless. Therefore, the size of the result set can bereduced by using a probabilistic query predicate as explained later in Section 6.A computational problem is the computation of the probability P ( o ∈ D ) of anobject o to satisfy the spatial query predicate. In the example, this probabilitywas derived by iterating over the set of all possible worlds w , ..., w . Since thisset grows exponentially in the number of objects, such an approach is not viablein practice. Therefore, eﬃcient techniques to compute the probability values P ( o ) are required. A general paradigm to develop algorithms that avoid an explicitenumeration of all possible worlds is presented in Section 7. .2 Result Based Probabilistic Result Semantics In the case of result based result semantics, possible result sets of a probabilisticspatial query are returned, each associated with the probability of this result.

Deﬁnition 4 (Result Based Result Semantics).

Let D be an uncertainspatial database, let q be a query object and let φ denote a spatial query predicate.Under result based (RB) result semantics, the result of a probabilistic spatial φ query is a set φ RB ( q, D ) = { ( r, P ( r )) | r ⊆ D , P ( r ) = (cid:88) w ∈W ,φ ( q,w )= r P ( w ) } of pairs. This set contains one pair for each result r ⊆ D associated with theprobability P ( r ) of r to be the result. Following possible world semantics, theprobability P ( r ) is deﬁned as the sum of probabilities of all worlds w ∈ W suchthat a spatial φ query returns r . Formally, the result of a probabilistic spatial query under result based resultsemantics is a function φ RB ( q, D ) : P ( D ) → [0 , r (cid:55)→ P ( r ) . mapping a elements of the power set P ( D ) (the results) to probability values. Example 5.

For a probabilistic N N query for the depicted query object q , re-sult based result semantics require to compute the probability of each subsetof { A, B, C } to be in the two-nearest neighbor set of q . For the set { B, C } , theprobability of this event is . , since there is two possible worlds w and w witha total probability of .

54 + 0 .

36 = 0 . in which B and C are both contained inthe N N set of q . Note that in worlds w and w objects B and C appear in dif-ferent ranking positions. This fact is ignored by a kN N query, as the results arereturned unsorted. In this example, the complete result of a P N N query underobject based result semantics is { ( { A, B, C } , , ( { A, B } , . , ( { A, C } , . , ( { B, C } , . , ( { A } , , ( { B } , , ( { C } , , ( {∅} , } .Clearly, the result of a query using result based result semantics can be used toderive the result of an identical query using object based result semantics. Forinstance, the result of Example 5 implies that the probability of object A to be a N N of q is . , since there exists two possible results using result based resultsemantics, namely ( { A, B } , . and ( { A, C } , . having a total probabilityof .

04 + 0 .

06 = 0 . , which matches the result of Example 4. Lemma 1.

Let q be the query point of a probabilistic spatial φ query. It holdsthat the result of this query using object based result semantics φ OB ( q, D ) is func-tionally dependent of the result of this query using result based result semantics.The set P SφQ OB ( q, D ) can be computed given only the set P SφQ RB ( q, D ) asfollows: P SφQ OB ( q, D ) = { ( o, P ( o )) | o ∈ D ∧ P ( o ) = (cid:88) ( r,P ( r )) ∈ P SφQ RB ( q, D ) ,o ∈ r P ( r ) } roof. Let W denote the set of possible worlds of D , and let p ( w ∈ W ) denotethe probability of a possible world. Furthermore, let w S ⊆D := { w ∈ W| φ ( q, w ) = S } denote the set of possible worlds such that φ ( q, w ) = S , i.e., such that thepredicate that a φ query using query object q returns set S holds. In each world w , query q returns exactly one deterministic result P SφQ RB ( q, w ) . Thus, thesets w S ⊆D represent a complete and disjunctive partition of W , i.e., it holdsthat W = (cid:91) S ⊆D w S (5)and ∀ R, S ∈ P ( D ) : R (cid:54) = S ⇒ w R (cid:92) w S = ∅ . (6)Using Equations 5 and 6, we can rewrite Equation 4 P ( o ∈ φ OB ( q, D )) = (cid:88) w ∈W ,o ∈ φ ( q,w ) P ( w ) as P ( o ∈ φ OB ( q, D )) = (cid:88) S ∈P ( D ) (cid:88) w ∈ w S ,o ∈ φ ( q,w ) P ( w ) . By deﬁnition, query q returns the same result for each world in w ∈ w S . Thisresult contains object o if o ∈ S . Thus we can rewrite the above equation as P ( o ) = (cid:88) S ∈P ( D ) ,o ∈ S P ( S ) . The probabilities P ( S ) are given by function P SφQ RB ( q, D ) .In the above proof, we have performed a linear-time reduction of the problemof answering probabilistic spatial queries using object based result semantics tothe problem of answering probabilistic spatial queries using result based resultsemantics. Thus, we have shown that, except for a linear factor (which can beneglected for most probabilistic spatial query types, since most algorithm run inno better than log-linear time), the problem of answering a probabilistic spatialquery using result based result semantics is at least as hard as answering aprobabilistic spatial query using object based semantics.To summarize this section, we have learned about two diﬀerent semantics tointerpret the result of a spatial query on uncertain data: Object Based and ResultBased. Understanding the diﬀerence of both result semantics is paramount tounderstand the landscape of existing research: in some related publication theproblem of answering some probabilistic query may be proven to be in P ,while another publication gives a solution that lies in P -TIME for the samespatial query predicate and the same probabilistic query predicate. In such cases,diﬀerent result semantics may explain these results without assuming P = N P . ε A B CDE F

Fig. 8.

Example of an uncertain (cid:15) -range query. Object A is a true hit, objects B , C and D are possible hits. Generally, in an uncertain database, the question whether an object satisﬁesa given query predicate φ , such as being in a speciﬁed range or being a kNNof a query object, cannot be answered deterministically due to uncertainty ofobject locations. Due to this uncertainty, the predicate that an object satisﬁes φ is a random variable, having some (possibly zero, possibly one) probability. Aprobabilistic query predicate quantiﬁes the minimal probability required for aresult to qualify as a result that is suﬃciently signiﬁcant to be returned to theuser. This section formally deﬁne probabilistic query predicate for general querypredicates. The following deﬁnition are made for uncertain data in general, butcan be applied analogously for uncertain spatial data.A probabilistic query can be deﬁned without any probabilistic query pred-icate. In this case, all objects, and their respective probabilities are returned. Deﬁnition 5 (Probabilistic Query).

Let D be an uncertain database, let q bea query point and let φ be a query predicate. A probabilistic query φ ( q, D ) returnsall database objects o ∈ D together with their respective probability P ( o ∈ φ ( q, D )) that o satisﬁes φ . φ ( q, D ) = { ( o ∈ D , P ( o ∈ φ ( q, D ))) } (7)The term probabilistic query is simply derived from the fact that unlike a tradi-tional query, a probabilistic query result has probability values associated witheach result. The main challenge of answering a probabilistic query, is to computethe probability P ( o ∈ φ ( q, D )) for each object. Using possible world semantics,a probabilistic query can be answered by evaluating the query predicate for eachbject and each possible world, i.e., P ( o ∈ φ ( q, D )) := (cid:88) w ∈ W f ind ( φ,w ) · P ( w ) . But clearly, it is necessary to avoid the combinatorial growth that would beinduced by this "naive" evaluation method.

Example 6.

For example, consider the query “Return all friends of user q havinga spatial distance of less than m to q ” depicted in Figure 8. Thus, the predi-cate φ is a m-range predicate using query point q . We can deterministicallytell that friend A must be within (cid:15) = 100 m Euclidean distance of q , while friends E and F cannot possibly be in range. The pairs ( A, , ( E, and ( F, areadded to the result. For friends B , C and D , this predicate cannot be answereddeterministically. Here, friend B has some possible positions located inside the m range of q , while other possible positions are outside this range. The twolocations inside q ’s range have a probability of . and . , respectively, thus thetotal probability of object B to satisfy the query predicate is . . . . Thepair ( B, . is thus added to the result. The pairs ( C, . and ( D, . completethe result m-range ( q, D ) = { ( A, , ( B, . , ( C, . , ( D, . , ( E, , ( F, } .The immediate question in the above example is:“Is a probability of . suﬃ-cient to warrant returning B as a result?”. To answer this question, a probabilisticquery can explicitly specify a probabilistic query predicate, to specify the require-ments, in terms of probability, required for an object to qualify to be includedin the result. The following subsections brieﬂy survey the most commonly usedprobabilistic query predicates: probabilistic threshold queries and probabilisticTop k queries. This paragraph deﬁnes a probabilistic query predicate that allows to return onlyresults that are statistically signiﬁcant.

Deﬁnition 6 (Probabilistic Threshold Query (P τ Q)).

Let D be an uncer-tain (spatial) database, let q be a spatial query object, let ≤ τ ≤ be a real valueand let φ be a spatial query predicate. A probabilistic τ query (P τ Q) returns allobjects o ∈ D such that o has a probability of at least τ to satisfy φ ( q, D ) : P τ φ ( q, D ) := { o ∈ D| P ( o ∈ φ ( q, D )) ≥ τ } . Example 7.

In Figure 8, a probabilistic threshold m -range(q, D ) query with τ = 0 . query returns the set of objects P . m-range ( q, D ) = { A, D } , sinceobjects A and D are the only objects such that their total probability of alter-natives inside the query region is equal or greater to τ = 0 . .Semantically, a probabilistic threshold spatial query returns all results havinga statistically signiﬁcant probability to satisfy the query predicate. Therefore,he probabilistic threshold query serves as a statistical test of the hypothesis“o is a result” at a signiﬁcance level of τ . This test uses the probability P ( o ∈ φ ( q, D )) as a test statistic. Eﬃcient algorithms to compute this probability P ( o ∈ φ ( q, D )) , for the example of k NN and similarity ranking queries will be surveyedin Section 8 similarity ranking queries and R k NN queries.A probabilistic threshold query on uncertain spatial data is useful in appli-cations, where the parameters of the spatial predicate τ (e.g. the range of an (cid:15) -range query, or the parameter k of a k NN query), as well as the probabilisticthreshold τ are chosen wisely, requiring expert knowledge about the database D . If these parameters are chosen inappropriately, no results may be returned,or the set of returned result may grow too large. For example, if τ is chosenvery large, and if the database has a high grade of uncertainty, then no resultmay be returned at all. Analogously, if the parameter (cid:15) is chosen too small thenno result will be returned, while a too large value of (cid:15) may return all objects.The special case of having (cid:15) = 0 , i.e., the case of returning all possible results(having a non-zero probability), is often used as default if no other probabilisticquery predicate is speciﬁed (e.g. [97,110]). This case may be referred to as a pos-sibilistic query predicate , as all possible results (regardless of their probability)are returned. k Queries

In cases where insuﬃcient information is given to select appropriate parametervalues, the following probabilistic query predicate is deﬁned to guarantee thatonly the k most signiﬁcant results are returned. Deﬁnition 7 (Probabilistic Top k Query (PTop k Q)).

Let D be an uncer-tain spatial database, let q be a spatial query object, let ≤ k ≤ |D| be a positiveinteger, and let φ be a spatial query predicate. A probabilistic spatial Top k query (PTop k Q) returns the smallest set PTop kφ ( q , D ) of at least k objects such that ∀ U i ∈ PTop kφ ( q, D ) , U j ∈ D\ PTop kφ ( q, D ) : P ( U i ∈ φ ( q, D )) ≥ P ( U j ∈ φ ( q, D )) Thus, a probabilistic spatial Top k query returns the k objects having the highestprobability to satisfy the query predicate. Again, in case of ties, the resulting setmay be greater than k . Example 8.

In Figure 8, a PTop φ query using a φ = 100 m-range spatial pred-icate returns objects P T op m-range ( q, D ) = { A, B, D } , since these objectshave the highest probability to satisfy the spatial predicate, i.e., have the highestprobability to be located in the spatial m -range.Note, that the probabilistic Top k query predicate can be combined with a kN N spatial query, i.e., with the case where φ = kN N . Such a probabilistic Top kjN N query returns the set of k objects having the highest probability, to be j -nearest neighbor of the query object. Clearly, k and j may have diﬀerent integervalues, such that diﬀerentiation is needed. .3 Discussion In summary, a probabilistic spatial query is deﬁned by two query predicates: – A spatial predicate φ to select uncertain objects having suﬃciently highproximity to the query object, and – a probabilistic predicate ψ , to select uncertain objects having suﬃciently high probability to satisfy φ .It has to be mentioned, that alternatively to this deﬁnition, a single predicatecan be used, that combines both spatial and probabilistic features. For example,a monotonic score function can be utilized, which combines spatial proximityand probability to return a single scalar score. An example of such a monotonescore function is the expected distance function E ( dist ( q, U ∈ D )) = (cid:88) u ∈ U P ( u ) · dist ( q, u ) , where q is the query object, and D is an uncertain database. The expectedsupport function is utilized by a number of related publications, such as [78,37].Using such a monotone score function, objects with a suﬃciently high score canbe returned. The advantage of using such an approach, is that objects that arelocated very close to the query require a lower probability to be returned as aresult, while objects that are located further away from the query object requirea higher probability. Yet, the main problem of such a combined predicate, isthat the probability of an object is treated as a simple attribute, thus losing itsprobabilistic semantic. Thus, the resulting score is very hard to interpret. Anobject that has a high score, may indeed have a very low probability to exist atall, because it is located (if it exists) very close to the query object. Consequently,the score itself no longer contains any conﬁdence information, and thus, it is notpossible to answer queries according to possible world semantics using a singleaggregate, such as expected distance, only. In Section 3 the concept of possible world semantics has been introduced. Pos-sible world semantics give an intuitive and mathematically sound interpretationof an uncertain spatial database. Furthermore, queries that adhere to possibleworld semantics return unbiased results, by evaluating the query on each pos-sible world. Since any such approach requires to run queries on an exponentialnumber of worlds, any naive approach is infeasible. Yet, for speciﬁc settings, suchas speciﬁc result-based semantics, speciﬁc spatial query predicates and speciﬁcprobabilistic query predicates, the literature has shown that it is possible to eﬃ-ciently answer queries on uncertain data. While it is hardly feasible to enumerateall combinations of result-based semantics, spatial query predicates and proba-bilistic query predicates, this section introduces a general paradigm to ﬁnd sucha solution yourself. In a nutshell, the idea is to ﬁnd, among the exponentiallylarge set of possible worlds, a partitioning into a polynomially large number ofsubsets, which are equivalent for a given query. .1 Equivalent Worlds

The goal of this section is introduce a general paradigm to eﬃciently computeexact probabilities, while still adhering to possible world semantics. For thispurpose, reconsider Deﬁnition 2, deﬁning the probability that some predicate φ is satisﬁed in an uncertain database D as the total probability of all possibleworlds satisfying φ . Recall Equation 3 P ( φ ( D )) = (cid:88) w ∈W I ( φ, w ) · P ( w ) , where W is the set of all possible worlds; I ( φ, w ) is an indicator function thatreturns one if predicate φ holds (i.e., resolves to true) in the crisp databasedeﬁned by world w and zero otherwise, and P ( w ) is the probability of world w .To reduce the number of possible worlds that need to be considered to compute P ( φ ( D )) , we ﬁrst need the following deﬁnition. Deﬁnition 8 (Class of Equivalent Worlds).

Let φ be a query predicate andlet S ⊆ W be a set of possible worlds such that for any two worlds w , w ∈ S we can guarantee that φ holds in world w if an only if φ holds in world w , i.e., ∀ w , w ∈ S : I ( φ, w ) ⇔ I ( φ, w ) Then set S is called a class of worlds equivalent with respect to φ . In the re-mainder of this chapter, if the spatial query predicate φ is clearly given by thecontext, then S will simply be denoted as a class of equivalent worlds . Any worlds w i , w j ∈ S are denoted as equivalent worlds . We now make the following observation:

Corollary 1.

Let S ⊆ W be a class of worlds equivalent with respect to φ (c.f.Deﬁnition 8, we can rewrite Equation 3 as follows: P ( φ ( D )) = (cid:88) w ∈W I ( φ, w ) · P ( w ) ⇔ P ( φ ( D )) = (cid:88) w ∈W\ S I ( φ, w ) · P ( w ) + I ( φ, w ∈ S ) · (cid:88) w ∈ S P ( w ) . (8) roof. Due to the assumption that for any two worlds w , w ∈ S it holds that φ holds in world w if an only if φ holds in world w , we get I ( φ, w ) = 1 ⇔I ( φ, w ) = 1 and I ( φ, w ) = 0 ⇔ I ( φ, w ) = 0 by deﬁnition of function I . Dueto this assumption, we have to consider two cases. Case 1: ∀ w ∈ S : I ( φ, w ) = 0 In this case, both Equation 3 and Equation 8 can be rewritten as P ( φ ( D )) = (cid:88) w ∈W\ S I ( φ, w ) · P ( w ) . Case 2: ∀ w ∈ S : I ( φ, w ) = 1 In this case, both Equation 3 and Equation 8 can be rewritten as P ( φ ( D )) = (cid:88) w ∈W\ S I ( φ, w ) · P ( w ) + (cid:88) w ∈ S P ( w ) (cid:117)(cid:116) The only diﬀerence between both cases is the additive term (cid:80) w ∈ S P ( w ) , whichexists only in Case 2. The indicator function I ( φ, w ∈ S ) ensures that this term isonly added in the second case. As main purpose, Corollary 1 states that, given aset of equivalent worlds S , we only have to evaluate the indictor function I ( φ, w ) on a single representative world w ∈ S , rather than on each world in S . Thisallows to reduce the number of (crisp) φ queries required to compute Equation3 by | S | − .Corollary 1 leads to the following Lemma. Lemma 2.

Let S be a partitioning of W into disjoint sets such that (cid:83) S ∈S S = W and for all S , S ∈ S : S ∩ S = ∅ . Equation 3 can be rewritten as P ( φ ( D )) = (cid:88) w ∈W I ( φ, w ) · P ( w ) ⇔ P ( φ ( D )) = (cid:88) S ∈S I ( φ, w ∈ S ) · (cid:88) w ∈ S P ( w ) . (9) Proof.

Lemma 2 is derived by applying Corollary 1 once for each S ∈ S . (cid:117)(cid:116) The next subsection will show how to leverage Lemma 2 to partition the setof all possible worlds into equivalence classes that are guaranteed to have thesame result for a given query predicate, and how to exploit this partitioning toeﬃciently answer queries. ll O( N ) Possible

Worlds C C C … C k k ϵ O(poly(N))

Partitioning

Classes of Equivalent

WorldsSelect a Representative

World

P(C ) Compute

Probability of each Class

P(C ) P(C … ) P(C k ) Evaluate

Query

Predicate true false true false true false true falseResult

Probability

P(C ) P(C ) P(C … ) P(C k ) Fig. 9.

Summary of the Paradigm of Equivalent Worlds.

Given a partitioning S of all possible worlds, Equation 9 requires to performthe following two tasks. The ﬁrst task requires to evaluate the indicator func-tion I ( φ, w ∈ S ) for one representative world of each partition. This can beachieved by performing a traditional (non-uncertain) φ query on these repre-sentatives. The ﬁnal challenge is to eﬃciently compute the total probability P ( S ) := (cid:80) w ∈ S P ( w ) for each equivalent class S ∈ S . This computation mustavoid an enumeration of all possible worlds, i.e., must be in o ( | S | ) . Achieving aneﬃcient computation is a creative task, and usually requires to exploit propertiesof the model (such as object independence) and properties of the spatial querypredicate. The paradigm of equivalent worlds is illustrated and summarized inFigure 9. In the ﬁrst step, set of all possible worlds W , which is exponentialin the number N of uncertain objects, has to be partitioned into a polynomiallarge set of classes of equivalent worlds, such that all worlds in the same class areguaranteed to be equivalent given the query predicate φ . This yields a the set C = { C , C , ..., C k } of classes of equivalent worlds. To allow eﬃcient processing,this set must be polynomial in size, since each class has to be considered individ-ually in the following. Next, we require to compute the probability of each class C i , without enumeration of all possible worlds contained in C i , the number of Note that if an exponential large set is partitioned into a polynomial number ofsubsets, then at least one such subset must have exponential size. This is evidentconsidering that O ( n poly ( n ) ) = O (2 n ) . hich may still be exponential. In fact, at least one class C i must contain O (2 N ) possible worlds. Next, we need to decide, for each class C i , whether all worlds w ∈ C i satisfy the query predicate φ , or whether no world w ∈ C i satisﬁes φ .Due to equivalence of all possible worlds in C i , these are the only possible cases.For some query predicates, this decision can be made using special properties ofthe query predicate, as we will see later in this chapter. In the general case, thisdecision can be made by choosing one representative world w ∈ C i (e.g. at ran-dom) from each class C i , and evaluating the query predicate on this world. Thisyields at total run-time of O ( |C| ) · O ( I ( φ, w )) , where I ( φ, w ) is the time complex-ity of evaluating the query predicate φ on the certain database w . If this querypredicate can be evaluated in polynomial time, i.e., if O ( I ( φ, w )) ∈ O ( poly ( N )) ,then the total run-time is in O ( poly ( N )) . This is evident, since if O ( C ) is in O ( poly ( N )) , then O ( C ) · O ( I ( φ, w )) is in O ( poly ( N )) · O ( poly ( N )) which is in O ( poly ( N )) . For each class C i , where the representative world satisﬁes φ , thecorresponding probability P ( C i ) is added to the result probability.The following lemma summarizes the assumptions that a query predicate hasto satisfy in order to eﬃciently apply paradigm of ﬁnding equivalent worlds. Lemma 3.

Given a query predicate φ and an uncertain database D of size N := | DB | , we can answer φ on D in polynomial time if the following four conditionsare satisﬁed:I A traditional ψ query on non-uncertain data can be answered in polynomialtime.II we can identify a partitioning C of W into classes C ∈ C of equivalent worlds(see Deﬁnition 8.III The number |C| of classes is at most polynomial in N .IV The the total probability of a class S ∈ C can be computed in at most poly-nomial time.Proof. Answering a φ query on D requires to evaluate Equation 3 which wereformed into Equation 9 using Property II. This requires to iterate over all |C| classes of equivalent worlds in polynomial time due to Property III. For each class C ∈ C , this requires to perform two tasks. The ﬁrst task requires to compute thetotal probability of all worlds in C , and the second task requires to evaluate φ ona single possible world w ∈ C . The former task can be performed in polynomialtime due to Property IV. The later task requires to perform a crisp φ query onthe (crisp) world w in polynomial time due to Property I. Case Study: Range Queries and the Sum ofIndependent Bernoulli Trials

In this chapter, the paradigm of equivalent worlds will be applied to eﬃcientlysolve the problem of computing the number of uncertain objects located withina speciﬁed range.

Example 9.

As an example, consider the setting depicted in Figure 8. In thisexample, we have four objects, A , B , C , and D having probabilities of . , . , . , and . of being located inside the query region deﬁned by query location q and query range (cid:15) . Intuitively, the number of objects in this range can beanywhere between one and four, as only object A is guaranteed to be inside therange, while on B , C , and D have a chance to be inside this range among allother objects. How can we eﬃciently compute the distribution of this numberof objects inside the query range? What is the probability of having exactlyone, two, three or four object in the range? Intuitively, the number of objectscorresponds depends on the result of three “coin-ﬂips”, each using a coin with adiﬀerent bias of ﬂipping heads.Each such “coin-ﬂip” is a Bernoulli trial, which may have a successful (“heads”)of unsuccessful (“tails”) outcome. In the case where all Bernoulli trials have thesame probability p , the number of successful trials out of N trials is described bythe well-known binomial distribution. In the case where each trial may have adiﬀerent probability to succeed, the number of successful trials follows a Poisson-binomial distributions [55].Formally, let X , ..., X N be independent and not necessarily identically dis-tributed Bernoulli trials, i.e., random variables that may only take values zeroand one. Let p i := P ( X i = 1) denote the probability that random variable X i has value one. In this section, we will show how to eﬃciently compute thedistribution of the random variable N (cid:88) i =1 X i without enumeration of all possible worlds. That is, for each ≤ k ≤ N , thissection shows how to compute the probability P ( (cid:80) Ni =1 X i = k ) that exactly k trials are successful.This section shows two commonly used solutions to compute the distributionof (cid:80) i X i eﬃciently: The Poisson-binomial recurrence, and a technique based ongenerating functions. Both solutions have in common that they identify worldsthat are equivalent to the query predicate. /01/1 0/11/2 0/22/2 2/33/3 1/3 0/3… … … … … N/N N ‐ … … p p p p p N ‐ p ‐ p ‐ p ‐ p ‐ p ‐ p ‐ p ‐ p ‐ p ‐ p p N p N p N p N ‐ p N ‐ p N ‐ p N ‐ p N ‐ p N p p p p p p Fig. 10.

Deterministic ﬁnite automaton corresponding to the problem of thesum of independent Bernoulli trials.

The ﬁrst approach iteratively computes the distribution of the sum of the ﬁrst ≤ k ≤ N Bernoulli variables given the distribution of the sum of the ﬁrst k − Bernoulli variables.To gain an intuition of how to do this eﬃciently, consider the deterministicﬁnite automaton depicted in Figure 10. The states (i/j) of this automaton cor-respond to the random event that out of the ﬁrst j Bernoulli trials X , ..., X j ,exactly i trials have been successful. Initially, zero Bernoulli trials have been per-formed, out of which zero (trivially) were successful. This situation is representedby the initial state (0 / in Figure 10. Evaluating the ﬁrst Bernoulli trial X ,there is two possible outcomes: The trial may be successful with a probabilityof p , leading to a state (1 / where one out of one trials have been successful.Alternatively, the trial may be unsuccessful, with a probability of − p , leadingto a state (0 / where zero out of one trial have been successful. The secondtrial is then applied to both possible outcomes. If the ﬁrst trial has not beensuccessful, i.e., we are currently located in state (0 / , then there is again twooutcomes for the second Bernoulli trial, leading to state (1 / and (0 / with aprobability of p and − p respectively. If currently located in state (0 / , thetwo outcomes are state (2 / and state (1 / with the same probabilities. Atthis point, we have uniﬁed two diﬀerent possible worlds that are equivalent withrespect to (cid:80) i X i : The world where trial one has been successful and trial two Note that this automaton is deterministic, despite the process of choosing a successornode being a random event. Once the Bernoulli trial corresponding to a node hasbeen performed, the next node will be chosen deterministically, i.e., the upper nodewill be chosen if the trial was successful, and the right node will be chosen otherwise.Either way, there is exactly one successor node. as not been successful, and the world where trial one has not been successfuland trial two has been successful have been uniﬁed into state (1 / , represent-ing both worlds. This uniﬁcation was possible, since both paths leading to state (1 / are equivalent with respect to the number of successful trials.The three states (0 / , (1 / and (2 / are then subjected to the outcomeof the third Bernoulli trial, leading to states (0 / , (1 / , (2 / and (3 / . Thatis a total of four states for a total of = 8 possible worlds. In summary, thenumber of states in Figure 10 equals N . However, it is not yet clear how tocompute the probability of a state ( i/j ) eﬃciently. Naively, we have to computethe sum over all paths leading to state ( i/j ) . For example, the probability ofstate (2/3) is given by p · p · (1 − p ) + p · (1 − p ) · p + (1 − p ) · p · p . Thisnaive computation requires to enumerate all (cid:0) jp (cid:1) combinations of paths leadingto state ( i/j ) .For an eﬃcient computation, we make the following observation: Each stateof the deterministic ﬁnite automaton depicted in Figure 10 has at most twoincoming edges. Thus, to compute the probability of a state ( i/j ) , we only requirethe probabilities of states leading to ( i/j ) . The states leading to state ( i/j ) arestate ( i − /j − and state ( i/j − . Given the probabilities P ( i − /j − and P ( i/j − , we can compute the probability P ( i/j ) of state (i/j) as follows: P ( i/j ) = P ( i − /j − · p j + P ( i, j − · (1 − p j ) (10)where P (0 /

0) = 1 and P ( i/j ) = 0 if i > j or if i < . Equation 10 is known as the Poisson-Binomial Recurrence (To the best of ourknowledge, the Poisson binomial recurrence was ﬁrst introduced by [65]) andcan be used to compute the probabilities of states ( k/N ) , ≤ k ≤ N which bydeﬁnition, correspond to the probabilities P ( (cid:80) i =1 N X i = k ) that out of all N Bernoulli trials, exactly k trials are successful.This approach follows the paradigm of equivalent worlds in each iteration k :The set of all k possible worlds is partitioned into k + 1 equivalent sets, eachcorresponding to a state i/k , where i ≤ k . Each class contains only and all ofthe (cid:0) ki (cid:1) possible worlds where exactly i Bernoulli trails succeeded. The informa-tion about the particular sequence of the successful trials, i.e., which trials weresuccessful and which were unsuccessful is lost. This information however, is nolonger necessary to compute the distribution of (cid:80) Ni =0 X i , since for this randomvariable, we only need to know the number of successful trials, not their sequence.This abstraction allows to remove the combinatorial aspect of the problem.An example showcasing the Poisson binomial recurrence is given in the fol-lowing. Example 10.

Let N = 4 and let p = 0 . , p = 0 . , p = 0 . and p = 0 . . Thecorresponding DFA is depicted in Figure 11. The probability of state (0/0) isexplicitly set to . in Equation 10. To compute the probability of state (0/1),we apply Equation 10 to compute P (0 /

1) = P ( − / · p + P (0 / · (1 − p ) . /01/1 0/11/2 0/22/2 2/33/3 1/3 0/34/4 3/4 2/4 1/4 0/4 Fig. 11.

Deterministic ﬁnite automaton for four Bernoulli random variables.with P ( − /

0) = 0 and P (0 /

0) = 1 explicitly deﬁned in Equation 10 this yields P (0 /

1) = 0 · p + 1 · (1 − p ) = 0 . Analogously, we obtain P (1 /

1) = P (0 / · p + P (1 / · (1 − p ) = 1 · p = 0 . Using these initial probabilities, we can continue to compute P (0 /

2) = P ( − / · p + P (0 / · (1 − p ) = 0 · . . · . . P (1 /

2) = P (0 / · p + P (1 / · (1 − p ) = 0 . · . . · . . P (2 /

2) = P (1 / · p + P (2 / · (1 − p ) = 0 . · . · . . The probabilities P ( i/ , ≤ i ≤ can be used to compute P (0 /

3) = P ( − / · p + P (0 / · (1 − p ) = 0 · . . · . . P (1 /

3) = P (0 / · p + P (1 / · (1 − p ) = 0 . · . . · . . P (2 /

3) = P (1 / · p + P (2 / · (1 − p ) = 0 . · . . · . . P (3 /

3) = P (2 / · p + P (3 / · (1 − p ) = 0 . · . · . . Finally, these probabilities can be used to derive the ﬁnal distribution of therandom variable (cid:80) i =1 X i : P (0 /

4) = P ( − / · p + P (0 / · (1 − p ) = 0 · . . · . . P (1 /

4) = P (0 / · p + P (1 / · (1 − p ) = 0 . · . . · . . P (2 /

4) = P (1 / · p + P (2 / · (1 − p ) = 0 . · . . · . . P (3 /

4) = P (2 / · p + P (3 / · (1 − p ) = 0 . · . . · . . P (4 /

4) = P (3 / · p + P (4 / · (1 − p ) = 0 . · . · . . These probabilities describe the PDF of (cid:80) i =1 X i by deﬁnition of P ( i/j ) . omplexity Analysis To compute the distribution of (cid:80) i X i we require tocompute each probability P ( i/j ) for ≤ j ≤ N, i ≤ j , yielding a total of N ∈ O ( N ) probability computations. To compute any such probability, wehave to evaluate Equation 10, which requires to look up four probabilities P ( i − /j − , P ( i/j − , p j and − p j , which can be performed in constant time.This yields a total runtime complexity of O ( N ) . The O ( N ) space complexityrequired to store the matrix of probabilities P ( i/j ) for ≤ j ≤ N, i ≤ j can bereduced to O ( N · k ) by exploiting that in each iteration where the probabilities P ( i/k ) , ≤ i ≤ k are computed, only the probabilities P ( i/k − , ≤ i ≤ k − are required, and the result of previous iterations can be discarded. Thus, atmost N probabilities have to be stored at a time. An alternative technique to compute the sum of independent Bernoulli variablesis the generating functions technique. While showing the same complexity as thePoisson binomial recurrence, its advantage is its intuitiveness.Represent each Bernoulli trial X i by a polynomial poly ( X i ) = p i · x + (1 − p i ) .Consider the generating function F N = N (cid:89) i =1 poly ( X i ) = N (cid:88) i =0 c i x i . (11)The coeﬃcient c i of x i in the expansion of F N equals the probability P ( (cid:80) Nn =1 X n = i ) ([66]). For example, the monomial . · x implies that with a probability of . , the sum of all Bernoulli random variables equals four.The expansion of N polynomials, each containing two monomials leads toa total of N monomials, one monomial for each sequence of successful andunsuccessful Bernoulli trials, i.e., one monomial for each possible worlds. Toreduce this complexity, again an iterative computation of F N , can be used, byexploiting that F k = F k − · poly ( X k ) . (12)This rewriting of Equation 11 allows to inductively compute F k from F k − . Theinduction is started by computing the polynomial F , which is the empty productwhich equals the neutral element of multiplication, i.e., F = 1 . To understandthe semantics of this polynomial, the polynomial F = 1 can be rewritten as F = 1 · x , which we can interpret as the following tautology:“with a probabilityof one, the sum of all zero Bernoulli trials equals zero.” After each iteration, wecan unify monomials having the same exponent, leading to a total of at most k + 1 monomials after each iteration. This uniﬁcation step allows to remove thecombinatorial aspect of the problem, since any monomial x i corresponds to aclass of equivalent worlds, such that this class contains only and all of the worldswhere the sum (cid:80) Nk =1 X k = 1 . In each iteration, the number of these classes is k and the probability of each class is given by the coeﬃcient of x i .An example showcasing the generating functions technique is given in thefollowing. This examples uses the identical Bernoulli random variables used inExample 10. xample 11. Again, let N = 4 and let p = 0 . , p = 0 . , p = 0 . and p = 0 . .We obtain the four generating polynomials poly ( X ) = (0 . x + 0 . , poly ( X ) =(0 . x + 0 . , poly ( X ) = (0 . x + 0 . , and poly ( X ) = (0 . x + 0 . . We triviallyobtain F = 1 . Using Equation 12 we get F = F · poly ( X ) = 1 · (0 . x + 0 .

9) = 0 . x + 0 . . Semantically, this polynomial implies that out of the ﬁrst one Bernoulli variables,the probability of having a sum of one is . (according to monomial . x =0 . x , and the probability of having a sum of zero is . (according to monomial . . x . Next, we compute F , again using Equation 12: F = F · poly ( X ) = (0 . x + 0 . x ) · (0 . x + 0 . x ) =0 . x x + 0 . x x + 0 . x x + 0 . x x In this expansion, the monomials have deliberately not been uniﬁed to givean intuition of how the generating function techniques is able to identify andunify equivalent worlds. In the above expansion, there is one monomial for eachpossible world. For example, the monomial . x x represents the world wherethe ﬁrst trial was unsuccessful (represented by the zero of the ﬁrst exponent) andthe second trial was succesful (represented by the one of the second exponent).The above notation allows to identify the sequence of successful and unsuccessfulBernouli trials, clearly leading to a total of k possible worlds for F k . However,we know that we only need to compute the total number of successful trials,we do not need to know the sequence of successful trials. Thus, we need totreat worlds having the same number of successful Bernoulli trials equivalently,to avoid the enumeration of an exponential number of sequences. This is doneimplicitly by polynomial multiplication, exploiting that . x x + 0 . x x + 0 . x x + 0 . x x = 0 . x + 0 . x + 0 . x + 0 . x This representation no longer allows to distinguish the sequence of successfulBernouli trials. This loss of information is beneﬁcial, as it allows to unify possibleworlds having the same sum of Bernoulli trials. . x + 0 . x + 0 . x + 0 . x = 0 . x + 0 . x + 0 . x The remaining monomials represent an equivalence class of possible worlds. Forexample, monomial . x represents all worlds having a total of one successfulBernoulli trials. This is evident, since the coeﬃcient of this monomial was derivedfrom the sum of both worlds having a total of one successful Bernoulli trials.Inthe next iteration, we compute: F = F · poly ( X ) = (0 . x + 0 . x + 0 . x ) · (0 . x + 0 . . x x + 0 . x x + 0 . x x + 0 . x x + 0 . x x + 0 . x x This polynomial represents the three classes of possible worlds in F combinedwith the two possible results of the third Bernoulli trial, yielding a total of monomials. Uniﬁcation yields . x x + 0 . x x + 0 . x x + 0 . x x + 0 . x x + 0 . x x = . x + 0 . x + 0 . x + 0 . The ﬁnal generating function is given by F = F · poly ( X ) =(0 . x + 0 . x + 0 . x + 0 . · (0 . x + 0 .

6) = . x + . x + . x + . x + . x + . x + . x + . x = 0 . x + 0 . x + 0 . x + 0 . x + 0 . This polynomial describes the PDF of (cid:80) i =1 X i , since each monomial c i x i im-plies that the probability, that out of all four Bernoulli trials, the total num-ber of successful events equals i , is c i . Thus, we get P ( (cid:80) i =1 X i = 0) = 0 . , P ( (cid:80) i =1 X i = 1) = 0 . , P ( (cid:80) i =1 X i = 2) = 0 . , P ( (cid:80) i =1 X i = 3) = 0 . and P ( (cid:80) i =1 X i = 4) = 0 . . Note that this result equals the result we ob-tained by using the Poisson binomial recurrence in the previous section. Complexity Analysis

The generating function technique requires a total of N iterations. In each iteration ≤ k ≤ N , a polynomial of degree k , and thus ofmaximum length k + 1 , is multiplied with a polynomial of degree , thus havinga length of . This requires to compute a total of ( k + 1) · monomials in eachiteration, each requiring a scalar multiplication. Thus leads to a total time com-plexity of (cid:80) Ni =1 k + 2 ∈ O ( N ) for the polynomial expansions. Uniﬁcation of apolynomial of length k can be done in O ( k ) time, exploiting that the polynomialsare sorted by the exponent after expansion. Uniﬁcation at each iteration leadsto a O ( n ) complexity for the uniﬁcation step. This results in a total complexityof O ( n ) , similar to the Poisson binomial recurrence approach.An advantage of the generating function approach is that this naive polyno-mial multiplication can be accelerated using Discrete Fourier Transform (DFT).This technique allows to reduce to total complexity of computing the sum of N Bernoulli random variables to O ( N log N ) ([71]). This acceleration is achievedby exploiting that DFT allows to expand two polynomials of size k in O ( klogk ) time. Equi-sized polynomials are obtained in the approach of [71], by using a di-vide and conquer approach, that iteratively divides the set of N Bernoulli trialsinto two equi-sized sets. Their recursive algorithm then combines these resultsby performing a polynomial multiplication of the generating polynomials of eachset. More details of this algorithm can be found in [71].

The Paradigm of Equivalent worlds has been successfully applied to eﬃcientlysupport many spatial query predicates and spatial data mining tasks. These moreadvanced techniques are out of scope of this book chapter, but the techniquespresented in this chapter should help the interested reader to dive deeper into able 2.

Advanced Topics in Querying and Mining Uncertain Spatial Data.

Topic Related Work

Nearest Neighbor Query Processing [33,61,29,58,115,82,93] k -Nearest Neighbor ( k NN) Query Processing [60,21,30,15]Top- k Query Processing [88,97,111]Ranking of Uncertain Spatial Data [74,19,37,96,70,18,39,76,17,57]Reverse k NN Query Processing [75,27,16,46]Skyline Query Processing [87,73,103,45,109]Indexing Uncertain Spatial Data [114,47,1]Maximum Range-Sum Query Processing [2,80,77]Querying Uncertain Trajectory Data [48,83,117]Clustering Uncertain Spatial Data [94,120,81,62]Frequent Itemset and Colocation Mining [14,106,20,17,107] understanding state-of-the-art solutions, and to help the reader to contribute tothis ﬁeld. An overview of research directions on uncertain spatial is provided inTable 2.Eﬃcient solutions on uncertain data have been presented for ( )-nearestneighbor ( NN) queries [33,61,29,58,115,82,93]. The case of N N is special, asfor N N the cases of object-based and result-based probabilistic result seman-tics are equivalent: Since a N N query only results a single result object. Thus,the probability of any object to be part of the result is equal the probability ofthis object to be the (whole) result. For k Nearest Neighbor queries, this is notthe case, as initially motivated in Figure 2. For object-based result semantics(as explained in Section 5), polynomial time solutions leveraging the paradigmof equivalent worlds have been proposed [15]. For result-based result semantics,where each of the (potentially exponential many in k ) results is associated witha probability, solutions have been presented in [21,30].A related problem is Top- k query processing which returns the k best resultobjects for a given score function [88,97,111]. While these solution are not pro-posed in the context of spatial or spatio-temporal data, they are mentioned hereas they can be applied to spatial data. For example, if the score function is de-ﬁned as the distance to query object, this problem becomes equivalent to k NN.Solutions for result-based probabilistic result semantics are proposed in [97,88]and for object-based result semantics in [111].Another problem generalization are ranking queries, which return the Top- k result ordered by score. For uncertain data using object-based result semantics,this yields a probabilistic mapping of each database mapping to each rank forthe case of object-based result semantics. For example, it may return that object o has a probability to be Rank 1, and a probability to be Rank 2.In the case of result-based probabilistic result semantics, each possible rankingof objects is mapped to a probability, for example, the ranking [ o , o , o ] mayhave a probability. Solutions for the result-based probabilistic result seman-tic case have been proposed in [96] having exponential run-time due to the hardnature of this problem. For the case of object-based probabilistic result seman-ics, ﬁrst solutions having exponential run-time were proposed [19,74]. Applyingthe paradigm of equivalent worlds, a number of solutions have been proposedconcurrently and independently to achieve polynomial run-time (linear in thenumber of database objects times the number of ranks). The generating func-tions technique (as explained in Section 8) was proposed for this purpose by Liet al. [70]. An equivalent approach using a technique called Poisson-BinomialRecurrence was simultaneously proposed by [18,57]. A comparison of the gen-erating functions technique and the Poisson Binomial Recurrence, along witha proof of equivalence, can be found in [119]. Other works shown in Table 2include solutions for the case of existential uncertainty [39], inverse ranking [76],and spatially extended objects [17], and the computation of the expected rankof an object. [37]. Solution for indexing of uncertain spatial [1,28] and spatio-temporal [114,47] data have been proposed to speed up various of the previouslymentioned query types.The problem of ﬁnding reverse k nearest neighbors (R k NNs) have been stud-ied for spatial data [75,27,16,27] and spatio-temporal data [46]. Solutions forskyline queries on uncertain data have been proposed in [87,73,103,45,109]. Morerecently, the problem of answering Maximum Range-Sum Queries has been stud-ied for uncertain data [2,80,77].Solutions tailored towards uncertain spatio-temporal trajectories, in whichthe exact location of an object at each point in time is a random variable havebeen proposed [48,83,117]. In this work, the challenge is to leverage stochasticprocesses that consider temporal dependencies. Such dependencies describe thatthe location of an object at a time t depends on its location at time t − .Solutions for clustering uncertain data have been proposed [94,120,81,62].The challenge of clustering uncertain data is that the membership likelihoodof on uncertain object to a cluster depends on other objects, making it hardto identify groups of worlds that are guaranteed to yields the same clusteringresult.Finally, solutions for frequent itemset mining have been proposed for uncer-tain data [14,106,20,17]. While frequent itemset mining is not a spatial problem,it has applications in spatial co-location mining [107,26].Yet, many other spatial query predicates, as well as other probabilistic querypredicates using diﬀerent probabilistic result semantics are still open to study.The authors hopes that this chapter provides interested scholars with a start-ing point to fully understand preliminaries and assumptions made by existingwork, as well as a general paradigm to develop eﬃcient solutions for future workleveraging the Paradigm of Equivalent Worlds presented herein.

10 Summary

This chapter provided an overview of uncertain spatial data models and theconcept of possible world semantics to interpret queries on these models. Tounderstand the landscape of existing query processing algorithms on uncertaindata, this chapter further surveyed diﬀerent probabilistic result semantics anddiﬀerent probabilistic query predicates. To give the interested reader a startnto this ﬁeld, this chapter presented a general paradigm to eﬃciently queryuncertain data based on the Paradigm of Equivalent Worlds, which aims atﬁnding possible worlds that are guaranteed to have the same query result. As acase-study to apply this paradigm, this chapter provided solutions to eﬃcientlycompute range queries on uncertain data using an eﬃcient recursion approach,as well as leveraging the concept of generating functions.Given this survey on modeling and querying uncertain spatial data, this chap-ter further provided a brief (and not exhaustive) overview of some research di-rections on uncertain spatial data. Many queries on uncertain data have alreadybeen solved eﬃciently, but many new challenges arise. For instance, only limitedwork has focused on streaming uncertain data, that is, handling uncertain datathat changes rapidly. Another mostly open research direction is uncertain dataprocessing in resources-limited scenarios such as edge computing. The authorhopes that readers will ﬁnd this overview useful to help readers understandingexisting solutions and support readers towards adding their own research to thisﬁeld.

References Agarwal, P. K., Cheng, S.-W., Tao, Y., and Yi, K.

Indexing uncertain data.In

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposiumon Principles of database systems (2009), pp. 137–146.2.

Agarwal, P. K., Kumar, N., Sintos, S., and Suri, S.

Range-max queries onuncertain data.

Journal of Computer and System Sciences 94 (2018), 118–134.3.

Aggarwal, C. C.

Managing and Mining Uncertain Data , vol. 35. SpringerScience & Business Media, 2010.4.

Aggarwal, C. C., and Philip, S. Y.

A Survey of Uncertain Data Algorithmsand Applications.

IEEE Transactions on Knowledge and Data Engineering 21 , 5(2008), 609–623.5.

Agrawal, P., Benjelloun, O., Sarma, A. D., Hayworth, C., Nabar, S.,Sugihara, T., and Widom, J.

Trio: A System for Data, Uncertainty, andLineage.

Proc. of VLDB 2006 (demonstration description) (2006).6.

Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., and Saltz,J.

Hadoop-GIS: A High Performance Spatial Data Warehousing System overMapReduce.

Proceedings of the VLDB Endowment 6 , 11 (2013), 1009–1020.7.

Akdogan, A., Demiryurek, U., Banaei-Kashani, F., and Shahabi, C.

Voronoi-based Geospatial Query Processing with MapReduce. In (2010), IEEE, pp. 9–16.8.

Antova, L., Jansen, T., Koch, C., and Olteanu, D.

Fast and simple re-lational processing of uncertain data. In

Proceedings of the 24th InternationalConference on Data Engineering (ICDE), Cancun, Mexico (2008), pp. 983–992.9.

Antova, L., Jansen, T., Koch, C., and Olteanu, D.

Fast and Simple Rela-tional Processing of Uncertain Data. In (2008), IEEE, pp. 983–992.10.

Apache . Hadoop. http://hadoop.apache.org/ .11.

Bacchus, F., Grove, A. J., Halpern, J. Y., and Koller, D.

From statisticalknowledge bases to degrees of belief.

Artiﬁcial Intelligence 87 , 1 (1996), 75–143.2.

Barbará, D., Garcia-Molina, H., and Porter, D.

The Management ofProbabilistic Data.

IEEE Transactions on Knowledge and Data Engineering 4 , 5(1992), 487–502.13.

Benjelloun, O., Sarma, A. D., Halevy, A. Y., and Widom, J.

Uldbs:Databases with uncertainty and lineage. In

Proceedings of the 32nd InternationalConference on Very Large Data Bases (VLDB), Seoul, Korea (2006), pp. 953–964.14.

Bernecker, T., Cheng, R., Cheung, D. W., Kriegel, H.-P., Lee, S. D.,Renz, M., Verhein, F., Wang, L., and Zuefle, A.

Model-based probabilisticfrequent itemset mining.

Knowledge and Information Systems 37 , 1 (2013), 181–217.15.

Bernecker, T., Emrich, T., Kriegel, H.-P., Mamoulis, N., Renz, M.,and Züfle, A.

A novel probabilistic pruning approach to speed up similarityqueries in uncertain databases. In (2011), IEEE, pp. 339–350.16.

Bernecker, T., Emrich, T., Kriegel, H.-P., Renz, M., Zankl, S., andZüfle, A.

Eﬃcient probabilistic reverse nearest neighbor query processing onuncertain data.

Proceedings of the VLDB Endowment 4 , 10 (2011), 669–680.17.

Bernecker, T., Emrich, T., Kriegel, H.-P., Renz, M., and Züfle, A.

Probabilistic ranking in fuzzy object databases. In

Proceedings of the 21st ACMinternational conference on Information and knowledge management (2012),pp. 2647–2650.18.

Bernecker, T., Kriegel, H.-P., Mamoulis, N., Renz, M., and Zuefle,A.

Scalable Probabilistic Similarity Ranking in Uncertain Databases.

IEEETransactions on Knowledge and Data Engineering 22 , 9 (2010), 1234–1246.19.

Bernecker, T., Kriegel, H.-P., and Renz, M.

ProUD: probabilistic rankingin uncertain databases. In

Proceedings of the 20th International Conference onScientiﬁc and Statistical Database Management (SSDBM), Hong Kong, China (2008), pp. 558–565.20.

Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., and Zuefle, A.

Probabilistic Frequent Itemset Mining in Uncertain Databases. In

Proceedings ofthe 15th ACM SIGKDD International Conference on Knowledge Discovery andData Mining (2009), ACM, pp. 119–128.21.

Beskales, G., Soliman, M., and Ilyas, I.

Eﬃcient search for the top-k prob-able nearest neighbors in uncertain databases.

PVLDB 1 (2008), 326–339.22.

Böhm, C., Pryakhin, A., and Schubert, M.

The Gauss-tree: Eﬃcient objectidentiﬁcation of probabilistic feature vectors. In

Proceedings of the 22nd Interna-tional Conference on Data Engineering (ICDE), Atlanta, GA (2006), p. 9.23.

Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., and Suciu,D.

MYSTIQ: A system for ﬁnding more answers by using probabilities. In

Proceedings of the 2005 ACM SIGMOD International Conference on Managementof Data (2005), ACM, pp. 891–893.24.

Casella, G., and Berger, R. L.

Statistical Inference , vol. 2. Duxbury PaciﬁcGrove, CA, 2002.25.

Cavallo, R., and Pittarelli, M.

The Theory of Probabilistic Databases. In

VLDB (1987), vol. 87, pp. 1–4.26.

Chan, H. K.-H., Long, C., Yan, D., and Wong, R. C.-W.

Fraction-score:a new support measure for co-location pattern mining. In (2019), IEEE, pp. 1514–1525.27.

Cheema, M. A., Lin, X., Wang, W., Zhang, W., and Pei, J.

Probabilisticreverse nearest neighbor queries on uncertain data.

IEEE Trans. Knowl. DataEng. 22 , 4 (2010), 550–564.8.

Chen, L., Gao, Y., Zhong, A., Jensen, C. S., Chen, G., and Zheng, B.

Indexing metric uncertain data for range queries and range joins.

The VLDBJournal 26 , 4 (2017), 585–610.29.

Cheng, R., Chen, J., Mokbel, M. F., and Chow, C.-Y.

Probabilistic ver-iﬁers: Evaluating constrained nearest-neighbor queries over uncertain data. In

Proceedings of the 24th International Conference on Data Engineering (ICDE),Cancun, Mexico (2008), pp. 973–982.30.

Cheng, R., Chen, L., Chen, J., and Xie, X.

Evaluating probability thresholdk-nearest-neighbor queries over uncertain data. In

Proceedings of the 13th Interna-tional Conference on Extending Database Technology (EDBT), Saint-Petersburg,Russia (2009), pp. 672–683.31.

Cheng, R., Emrich, T., Kriegel, H.-P., Mamoulis, N., Renz, M., Tra-jcevski, G., and Züfle, A.

Managing Uncertainty in Spatial and Spatio-temporal Data. In (2014), IEEE, pp. 1302–1305.32.

Cheng, R., Kalashnikov, D. V., and Prabhakar, S.

Evaluating proba-bilistic queries over imprecise data. In

Proceedings of the ACM InternationalConference on Management of Data (SIGMOD), San Diego, CA (2003), pp. 551–562.33.

Cheng, R., Kalashnikov, D. V., and Prabhakar, S.

Querying imprecisedata in moving object environments.

IEEE Trans. Knowl. Data Eng. 16 , 9 (2004),1112–1127.34.

Cheng, R., Xia, Y., Prabhakar, S., Shah, R., and Vitter, J.

Eﬃcientindexing methods for probabilistic threshold queries over uncertain data. In

Pro-ceedings of the 30th International Conference on Very Large Data Bases (VLDB),Toronto, Canada (2004), pp. 876–887.35.

Cho, E., Myers, S. A., and Leskovec, J.

Friendship and Mobility: UserMovement In Location-Based Social Networks. In

Proceedings of the 17th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (2011), ACM, pp. 1082–1090.36.

Cormode, G., Li, F., and Yi, K.

Semantics of Ranking Queries for Proba-bilistic Data and Expected Ranks. In (2009), IEEE, pp. 305–316.37.

Cormode, G., Li, F., and Yi, K.

Semantics of ranking queries for probabilisticdata and expected results. In

Proceedings of the 25th International Conferenceon Data Engineering (ICDE), Shanghai, China (2009), pp. 305–316.38.

Couclelis, H.

The Certainty of Uncertainty: GIS and the Limits of GeographicKnowledge.

Transactions in GIS 7 , 2 (2003), 165–175.39.

Dai, X., Yiu, M. L., Mamoulis, N., Tao, Y., and Vaitis, M.

Probabilisticspatial queries on existentially uncertain data. In

International Symposium onSpatial and Temporal Databases (2005), Springer, pp. 400–417.40.

Dalvi, N., and Suciu, D.

Eﬃcient query evaluation on probabilistic databases.

The VLDB Journal 16 , 4 (2007), 523–544.41.

Dalvi, N. N., Ré, C., and Suciu, D.

Probabilistic databases: diamonds in thedirt.

Commun. ACM 52 , 7 (2009), 86–94.42.

Dalvi, N. N., and Suciu, D.

Eﬃcient query evaluation on probabilisticdatabases. In

Proceedings of the 30th International Conference on Very LargeData Bases (VLDB), Toronto, Canada (2004), pp. 864–875.43.

Dean, J., and Ghemawat, S.

MapReduce: Simpliﬁed Data Processing on LargeClusters.

Communications of the ACM 51 , 1 (2008), 107–113.4.

Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J. M., and Hong,W.

Model-driven data acquisition in sensor networks. In

Proceedings of the 30thInternational Conference on Very Large Data Bases (VLDB), Toronto, Canada (2004), pp. 588–599.45.

Ding, X., Jin, H., Xu, H., and Song, W.

Probabilistic skyline queries overuncertain moving objects.

Computing and Informatics 32 , 5 (2014), 987–1012.46.

Emrich, T., Kriegel, H.-P., Mamoulis, N., Niedermayer, J., Renz, M.,and Züfle, A.

Reverse-nearest neighbor queries on uncertain moving objecttrajectories. In

International Conference on Database Systems for Advanced Ap-plications (2014), Springer, pp. 92–107.47.

Emrich, T., Kriegel, H.-P., Mamoulis, N., Renz, M., and Züfle, A.

In-dexing Uncertain Spatio-Temporal Data. In

Proceedings of the 21st ACM Inter-national Conference on Information and Knowledge Management (2012), ACM,pp. 395–404.48.

Emrich, T., Kriegel, H.-P., Mamoulis, N., Renz, M., and Züfle, A.

Querying Uncertain Spatio-Temporal Data. In

IEEE 28th International Confer-ence on Data Engineering (ICDE) (2012), IEEE, pp. 354–365.49.

Federal Geographic Data Committee . Geospatial Metadata Standards andGuidelines. .50.

Fegeas, R. G., Cascio, J. L., and Lazar, R. A.

An Overview of FIPS 173,The Spatial Data Transfer Standard.

Cartography and Geographic InformationSystems 19 , 5 (1992), 278–293.51.

Fuhr, N., and Rölleke, T.

A Probabilistic Relational Algebra for the Inte-gration of Information Retrieval and Database Systems.

ACM Transactions onInformation Systems (TOIS) 15 , 1 (1997), 32–66.52.

Fuhr, N., and Rölleke, T.

A probabilistic relational algebra for the integrationof information retrieval and database systems.

ACM Trans. Inf. Syst. 15 , 1 (1997),32–66.53.

Goodchild, M. F.

Uncertainty: The Achilles Heel of GIS.

Geo Info Systems 8 ,11 (1998), 50–52.54.

Grira, J., Bédard, Y., and Roche, S.

Spatial Data Uncertainty in the VGIWorld: Going from Consumer to Producer.

Geomatica 64 , 1 (2010), 61–72.55.

Hoeffding, W., et al.

On the distribution of the number of successes inindependent trials.

The Annals of Mathematical Statistics 27 , 3 (1956), 713–721.56.

Hsu, J.

Multiple Comparisons: Theory and Methods . Chapman and Hall/CRC,1996.57.

Hua, M., Pei, J., Zhang, W., and Lin, X.

Ranking queries on uncertain data:a probabilistic threshold approach. In

Proceedings of the 2008 ACM SIGMODinternational conference on Management of data (2008), pp. 673–686.58.

Iijima, Y., and Ishikawa, Y.

Finding probabilistic nearest neighbors for queryobjects with imprecise locations. In

Proceedings of the 10th International Con-ference on Mobile Data Management (MDM), Taipei, Taiwan (2009), pp. 52–61.59.

Jampani, R., Xu, F., Wu, M., Perez, L. L., Jermaine, C., and Haas, P. J.

MCDB: A Monte Carlo Approach to Managing Uncertain Data. In

Proceedingsof the 2008 ACM SIGMOD International Conference on Management of Data (2008), ACM, pp. 687–700.60.

Kolahdouzan, M., and Shahabi, C.

Voronoi-based k nearest neighbor searchfor spatial network databases. In

Proceedings of the Thirtieth international confer-ence on Very large data bases-Volume 30 (2004), VLDB Endowment, pp. 840–851.1.

Kriegel, H.-P., Kunath, P., and Renz, M.

Probabilistic nearest-neighborquery on uncertain objects. In

Proceedings of the 12th International Conferenceon Database Systems for Advanced Applications (DASFAA), Bangkok, Thailand (2007), pp. 337–348.62.

Kriegel, H.-P., and Pfeifle, M.

Density-based clustering of uncertain data.In

Proceedings of the eleventh ACM SIGKDD international conference on Knowl-edge discovery in data mining (2005), pp. 672–677.63.

Kumar, S., Morstatter, F., and Liu, H.

Twitter data analytics . Springer,2014.64.

Lakshmanan, L. V., Leone, N., Ross, R., and Subrahmanian, V. S.

Prob-View: A Flexible Probabilistic Database System.

ACM Transactions on DatabaseSystems (TODS) 22 , 3 (1997), 419–469.65.

Lange, K.

Numerical analysis for statisticians. In

Statistics and computing (1999).66.

Li, J., and Deshpande, A.

Consensus answers for queries over probabilisticdatabases. In

Symposium on Principles of Database Systems (PODS), Providence,Rhode Island. (2009), pp. 259–268.67.

Li, J., and Deshpande, A.

Ranking continuous probabilistic datasets.

Proceed-ings of the 36nd International Conference on Very Large Data Bases (VLDB),Singapore 3 , 1 (2010), 638–649.68.

Li, J., and Deshpande, A.

Ranking Continuous Probabilistic Datasets.

Pro-ceedings of the VLDB Endowment 3 , 1-2 (2010), 638–649.69.

Li, J., Saha, B., and Deshpande, A.

A Uniﬁed Approach to Ranking inProbabilistic Databases.

Proceedings of the VLDB Endowment 2 , 1 (2009), 502–513.70.

Li, J., Saha, B., and Deshpande, A.

A uniﬁed approach to ranking in prob-abilistic databases.

Proceedings of the 35nd International Conference on VeryLarge Data Bases (VLDB), Lyon, France 2 , 1 (2009), 502–513.71.

Li, J., Saha, B., and Deshpande, A.

A uniﬁed approach to ranking in prob-abilistic databases.

VLDB Journal 20 , 2 (2011), 249–275.72.

Li, L., Wang, H., Li, J., and Gao, H.

A survey of uncertain data management.

Frontiers of Computer Science (09 2018).73.

Lian, X., and Chen, L.

Monochromatic and bichromatic reverse skyline searchover uncertain databases. In

Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data (2008), pp. 213–226.74.

Lian, X., and Chen, L.

Probabilistic ranked queries in uncertain databases.In

Proceedings of the 12th International Conference on Extending Database Tech-nology (EDBT), Nantes, France (2008), pp. 511–522.75.

Lian, X., and Chen, L.

Eﬃcient processing of probabilistic reverse nearestneighbor queries over uncertain data.

VLDB Journal 18 , 3 (2009), 787–808.76.

Lian, X., and Chen, L.

Probabilistic inverse ranking queries over uncertaindata. In

Proceedings of the 14th International Conference on Database Systemsfor Advanced Applications (DASFAA), Brisbane, Australia (2009), pp. 35–50.77.

Liu, Q., Lian, X., and Chen, L.

Probabilistic maximum range-sum querieson spatial database. In

Proceedings of the 27th ACM SIGSPATIAL InternationalConference on Advances in Geographic Information Systems (2019), pp. 159–168.78.

Ljosa, V., and Singh, A. K.

Apla: Indexing arbitrary probability distributions.In

Proceedings of the 23rd International Conference on Data Engineering (ICDE),Istanbul, Turkey (2007), pp. 946–955.9.

Lu, W., Shen, Y., Chen, S., and Ooi, B. C.

Eﬃcient Processing of k NearestNeighbor Joins using MapReduce.

Proceedings of the VLDB Endowment 5 , 10(2012), 1016–1027.80.

Nakayama, Y., Amagata, D., and Hara, T.

Probabilistic maxrs queries onuncertain data. In

International Conference on Database and Expert SystemsApplications (2017), Springer, pp. 111–119.81.

Ngai, W. K., Kao, B., Chui, C. K., Cheng, R., Chau, M., and Yip, K. Y.

Eﬃcient Clustering of Uncertain Data. In

Sixth International Conference on DataMining (ICDM’06) (2006), IEEE, pp. 436–445.82.

Niedermayer, J., Züfle, A., Emrich, T., Renz, M., Mamoulis, N., Chen,L., and Kriegel, H.-P.

Probabilistic Nearest Neighbor Queries on UncertainMoving Object Trajectories.

Proceedings of the VLDB Endowment 7 , 3 (2013),205–216.83.

Niedermayer, J., Züfle, A., Emrich, T., Renz, M., Mamoulis, N., Chen,L., and Kriegel, H.-P.

Similarity search on uncertain spatio-temporal data. In

International Conference on Similarity Search and Applications

Patroumpas, K., Papamichalis, M., and Sellis, T. K.

Probabilistic rangemonitoring of streaming uncertain positions in geosocial networks. In

Proceed-ings of the 22nd International Conference on Scientiﬁc and Statistical DatabaseManagement (SSDBM), Crete, Greece (2012), pp. 20–37.86.

Pei, J., Hua, M., Tao, Y., and Lin, X.

Query answering techniques on un-certain and probabilistic data: tutorial summary. In

Proceedings of the ACMInternational Conference on Management of Data (SIGMOD), Vancouver, BC (2008), pp. 1357–1364.87.

Pei, J., Jiang, B., Lin, X., and Yuan, Y.

Probabilistic skylines on uncertaindata. In

Proceedings of the 33rd international conference on Very large data bases (2007), Citeseer, pp. 15–26.88.

Re, C., Dalvi, N., and Suciu, D. "Eﬃcient top-k query evaluation on probal-istic databases". In

Proceedings of the 23rd International Conference on DataEngineering (ICDE), Istanbul, Turkey (2007), pp. 886–895.89.

Re, C., Dalvi, N. N., and Suciu, D.

Query evaluation on probabilisticdatabases.

IEEE Data Eng. Bull. 29 , 1 (2006), 25–31.90.

Renz, M., Cheng, R., Kriegel, H.-P., Züfle, A., and Bernecker, T.

Similarity search and mining in uncertain databases.

Proceedings of the 36nd In-ternational Conference on Very Large Data Bases (VLDB), Singapore 3 , 2 (2010),1653–1654.91.

Sarma, A. D., Benjelloun, O., Halevy, A. Y., and Widom, J.

Workingmodels for uncertain data. In

Proceedings of the 22nd International Conferenceon Data Engineering (ICDE), Atlanta, GA (2006), p. 7.92.

Schmid, K. A., and Züfle, A.

Representative query answers on uncertaindata. In

Proceedings of the 16th International Symposium on Spatial and TemporalDatabases (2019), pp. 140–149.93.

Schmid, K. A., Zufle, A., Emrich, T., Renz, M., and Cheng, R.

Uncertainvoronoi cell computation based on space decomposition.

Geoinformatica 21 , 4(2017), 797–827.94.

Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K. A., and Zimek,A.

A Framework for Clustering Uncertain Data.

Proceedings of the VLDB En-dowment 8 , 12 (2015), 1976–1979.5.

Sen, P., and Deshpande, A.

Representing and querying correlated tuples inprobabilistic databases. In

Proceedings of the 23rd International Conference onData Engineering (ICDE), Istanbul, Turkey (2007), pp. 596–605.96.

Soliman, M., and Ilyas, I.

Ranking with uncertain scores. In

Proceedings ofthe 25th International Conference on Data Engineering (ICDE), Shanghai, China (2009), pp. 317–328.97.

Soliman, M. A., Ilyas, I. F., and Chang, K. C.-C.

Top-k query processingin uncertain databases. In

Proceedings of the 23rd International Conference onData Engineering (ICDE), Istanbul, Turkey (2007), pp. 896–905.98.

Sui, D., Elwood, S., and Goodchild, M.

Crowdsourcing Geographic Knowl-edge: Volunteered Geographic Information (VGI) in Theory and Practice . SpringerScience & Business Media, 2012.99.

Tao, Y., Cheng, R., Xiao, X., Ngai, W. K., Kao, B., and Prabhakar,S.

Indexing multi-dimensional uncertain data with arbitrary probability densityfunctions. In

Proceedings of the 31st International Conference on Very Large DataBases (VLDB), Trondheim, Norway (2005), pp. 922–933.100.

Tran, T. T., Peng, L., Li, B., Diao, Y., and Liu, A.

Pods: a new modeland processing algorithms for uncertain data streams. In

Proceedings of the ACMInternational Conference on Management of Data (SIGMOD), Indianapolis, IN (2010), pp. 159–170.101.

United States Geological Survey . USGS Science Data Catalog.https://data.usgs.gov/datacatalog/.102.

Valiant, L.

The complexity of enumeration and reliability problems. In

SIAMJournal of Computing (1979), pp. 410–421.103.

Vu, K., and Zheng, R.

Eﬃcient algorithms for spatial skyline query with un-certainty. In

Proceedings of the 21st ACM SIGSPATIAL International Conferenceon Advances in Geographic Information Systems (2013), pp. 412–415.104.

Wang, D. Z., Michelakis, E., Garofalakis, M., and Hellerstein, J. M.

BAYESSTORE: Managing Large, Uncertain Data Repositories with ProbabilisticGraphical Models.

Proceedings of the VLDB Endowment 1 , 1 (2008), 340–351.105.

Wang, K., Han, J., Tu, B., Dai, J., Zhou, W., and Song, X.

AcceleratingSpatial Data Processing with MapReduce. In

IEEE 16th International Conferenceon Parallel and Distributed Systems (2010), IEEE, pp. 229–236.106.

Wang, L., Cheung, D. W.-L., Cheng, R., Lee, S. D., and Yang, X. S.

Eﬃcient Mining of Frequent Item Sets on Large Uncertain Databases.

IEEETransactions on Knowledge and Data Engineering 24 , 12 (2012), 2170–2183.107.

Wang, L., Wu, P., and Chen, H.

Finding probabilistic prevalent colocationsin spatially uncertain data sets.

IEEE Transactions on Knowledge and DataEngineering 25 , 4 (2011), 790–804.108.

Wang, Y., Li, X., Li, X., and Wang, Y.

A survey of queries over uncertaindata.

Knowledge and Information Systems 37 , 3 (2013), 485–530.109.

Yang, Z., Li, K., Zhou, X., Mei, J., and Gao, Y.

Top k probabilistic skylinequeries on uncertain data.

Neurocomputing 317 (2018), 1–14.110.

Yi, K., Li, F., Kollios, G., and Srivastava, D.

Eﬃcient processing of top-kqueries in uncertain databases. In

Proceedings of the 24th International Confer-ence on Data Engineering (ICDE), Cancun, Mexico (2008), pp. 1406–1408.111.

Yi, K., Li, F., Kollios, G., and Srivastava, D.

Eﬃcient processing of top-kqueries in uncertain databases with x-relations.

IEEE Trans. Knowl. Data Eng.20 , 12 (2008), 1669–1682.12.

Yiu, M. L., Mamoulis, N., Dai, X., Tao, Y., and Vaitis, M.

Eﬃcient eval-uation of probabilistic advanced spatial queries on existentially uncertain data.

Knowledge and Data Engineering, IEEE Transactions on 21 , 1 (2009), 108–122.113.

Zhang, C., Li, F., and Jestes, J.

Eﬃcient Parallel kNN Joins for Large Datain MapReduce. In

Proceedings of the 15th International Conference on ExtendingDatabase Technology (2012), ACM, pp. 38–49.114.

Zhang, M., Chen, S., Jensen, C. S., Ooi, B. C., and Zhang, Z.

Eﬀectivelyindexing uncertain moving objects for predictive queries.

Proceedings of the VLDBEndowment 2 , 1 (2009), 1198–1209.115.

Zhang, P., Cheng, R., Mamoulis, N., Renz, M., Züfle, A., Tang, Y., andEmrich, T.

Voronoi-based nearest neighbor search for multi-dimensional uncer-tain databases. In (2013), IEEE, pp. 158–169.116.

Zhao, B., and Sui, D. Z.

True lies in geospatial big data: Detecting locationspooﬁng in social media.

Annals of GIS 23 , 1 (2017), 1–14.117.

Zheng, K., Trajcevski, G., Zhou, X., and Scheuermann, P.

Probabilisticrange queries for uncertain trajectories on road networks. In

Proceedings of the14th International Conference on Extending Database Technology (2011), pp. 283–294.118.

Zimányi, E.

Query evaluation in probabilistic relational databases.

Theor. Com-put. Sci. 171 , 1-2 (1997), 179–219.119.

Züfle, A.

Similarity Search and Mining in Uncertain Spatial and Spatio-Temporal Tatabases . PhD thesis, Ludwig-Maximilians University Munich, 2013.120.

Züfle, A., Emrich, T., Schmid, K. A., Mamoulis, N., Zimek, A., andRenz, M.

Representative Clustering of Uncertain Data. In

Proceedings of the20th ACM SIGKDD International Conference on Knowledge Discovery and DataMining (2014), ACM, pp. 243–252.121.

Züfle, A., Trajcevski, G., Pfoser, D., and Kim, J.-S.

Managing uncer-tainty in evolving geo-spatial data. In (2020), IEEE, pp. 5–8.122.

Züfle, A., Trajcevski, G., Pfoser, D., Renz, M., Rice, M. T., Leslie,T., Delamater, P., and Emrich, T.

Handling Uncertainty in Geo-SpatialData. In (2017),IEEE, pp. 1467–1470.123.

Zwillinger, D., and Kokoska, S.

Related Researches

Empowering Investigative Journalism with Graph-based Heterogeneous Data Management

by Angelos-Christos Anadiotis

Approximating Happiness Maximizing Set Problems

by Phoomraphee Luenam

A Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments

by Lars Heling

Materializing Knowledge Bases via Trigger Graphs

by Efthymia Tsamoura

Online Sketch-based Query Optimization

by Yesdaulet Izenov

Typing Errors in Factual Knowledge Graphs: Severity and Possible Ways Out

by Peiran Yao

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

by Ciprian-Octavian Truic?

Fast Distributed Complex Join Processing

by Hao Zhang

A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs

by Waqas Ali

Durable Top-K Instant-Stamped Temporal Records with User-Specified Scoring Functions

by Junyang Gao

Data Quality Certification using ISO/IEC 25012: Industrial Experiences

by Fernando Gualo

FAST: FPGA-based Subgraph Matching on Massive Graphs

by Xin Jin

Approximate Knowledge Graph Query Answering: From Ranking to Binary Classification

by Ruud van Bakel

LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs

by Angjela Davitkova

New Recruiter and Jobs: The Largest Enterprise Data Migration at LinkedIn

by Xie Lu

Interactive Query Formulation using Point to Point Queries

by Henderik Alex Proper

Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

by Zhihan Guo

A Unified System for Data Analytics and In Situ Query Processing

by Alex Watson

A Survey on Locality Sensitive Hashing Algorithms and their Applications

by Omid Jafari

A Lazy Approach for Efficient Index Learning

by Guanli Liu

THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning

by Jiashen Cao

Data provenance, curation and quality in metrology

by James Cheney

Querying collections of tree-structured records in the presence of within-record referential constraints

by Foto N. Afrati

Updatable Materialization of Approximate Constraints

by Steffen Kläbe

Spatial Interpolation-based Learned Index for Range and kNN Queries

by Songnian Zhang

«

1

2

3

4

»

Submitted on 2 Sep 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar