Range Counting Coresets for Uncertain Data
RRange Counting Coresets for Uncertain Data
Amirali Abdullah
University of Utah [email protected]
Samira Daruki
University of Utah [email protected]
Jeff M. Phillips
University of Utah [email protected]
Abstract
We study coresets for various types of range counting queries on uncertain data. In our model eachuncertain point has a probability density describing its location, sometimes defined as k distinct locations.Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so thatrange counting queries can be answered by just examining this subset. We study three distinct typesof queries. RE queries return the expected number of points in a query range. RC queries return thenumber of points in the range with probability at least a threshold. RQ queries returns the probabilitythat fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets thethreshold is provided as part of the query. And for each type of query we provide coreset constructionswith approximation-size tradeoffs. We show that random sampling can be used to construct each typeof coreset, and we also provide significantly improved bounds using discrepancy-based approaches onaxis-aligned range queries. A powerful notion in computational geometry is the coreset [3, 2, 10, 45]. Given a large data set P and afamily of queries A , then an η -coreset is a subset S ⊂ P such that for all r ∈ A that (cid:107) r ( P ) − r ( S ) (cid:107) ≤ η (note the notion of distance (cid:107)·(cid:107) between query results is problem specific and is intentionally left ambiguousfor now). Initially used for smallest enclosing ball queries [10] and perhaps most famous in geometry forextent queries as η -kernels [3, 2], the coreset is now employed in many other problems such as clustering [6]and density estimation [45]. Techniques for constructing coresets are becoming more relevant in the era ofbig data; they summarize a large data set P with a proxy set S of potentially much smaller size that canguarantee error for certain classes of queries. They also shed light onto the limits of how much informationcan possibly be represented in a small set of data.In this paper we focus on a specific type of coreset called an η -sample [45, 21, 13] that can be thought ofas preserving density queries and that has deep ties to the basis of learning theory [5]. Given a set of objects X (often X ⊂ R d is a point set) and a family of subsets A of X , then the pair ( X, A ) is called an rangespace . Often A are specified by containment in geometric shapes, for instance as all subsets of X defined byinclusion in any ball, any half space, or any axis-aligned rectangle. Now an η -sample of ( X, A ) is a singlesubset S ⊂ X such that max r ∈ A (cid:12)(cid:12)(cid:12)(cid:12) | X ∩ r || X | − | S ∩ r || S | (cid:12)(cid:12)(cid:12)(cid:12) ≤ η. For any query range r ∈ A , subset S approximates the relative density of X in r with error at most η . Uncertain points.
Another emerging notion in data analysis is modeling uncertainty in points. There areseveral formulations of these problems where each point p ∈ P has an independent probability distribution µ p describing its location and such a point is said to have locational uncertainty . Imprecise points (alsocalled deterministic uncertainty ) model where a data point p ∈ P could be anywhere within a fixed contin-uous range and were originally used for analyzing precision errors. The worst case properties of a point set P under the imprecise model have been well-studied [19, 20, 7, 22, 33, 36, 37, 44, 30]. Indecisive points (or attribute uncertainty in database literature [40]) model each p i ∈ P as being able to take one of k distinct a r X i v : . [ c s . C G ] A p r ocations { p i, , p i, , . . . , p i,k } with possibly different probabilities, modeling when multiple readings of thesame object have been made [26, 43, 17, 16, 1, 4].We also note another common model of existential uncertainty (similar to tuple uncertainty in databaseliterature [40] but a bit less general) where the location or value of each p ∈ P is fixed, but the point maynot exist with some probability, modeling false readings [28, 27, 40, 17].We will focus mainly on the indecisive model of locational uncertainty since it comes up frequently inreal-world applications [43, 4] (when multiple readings of the same object are made, and typically k issmall) and can be used to approximately represent more general continuous representations [25, 38]. Combining these two notions leads to the question: can we create a coreset (specifically for η -samples)of uncertain input data? A few more definitions are required to rigorously state this question. In fact, wedevelop three distinct notions of how to define the coreset error in uncertain points. One corresponds torange counting queries, another to querying the mean, and the third to querying the median (actually itapproximates the rank for all quantiles).For an uncertain point set P = { p , p , . . . , p n } with each p i = { p i, , p i, , . . . p i,k } ⊂ R d we say that Q (cid:98) P is a transversal if Q ∈ p × p × . . . × p n . I.e., Q = ( q , q , . . . , q n ) is an instantiation of theuncertain data P and can be treated as a “certain” point set, where each q i corresponds to the location of p i . Pr Q (cid:98) P [ ζ ( Q )] , (resp. E Q (cid:98) P [ ζ ( Q )] ) represents the probability (resp. expected value) of an event ζ ( Q ) where Q is instantiated from P according to the probability distribution on the uncertainty in P .As stated, our goal is to construct a subset of uncertain points T ⊂ P (including the distribution of eachpoint p ’s location, µ p ) that preserves specific properties over a family of subsets ( P, A ) . For completeness,the first variation we list cannot be accomplished purely with a coreset as it requires Ω( n ) space. • Range Reporting ( RR ) Queries support queries of a range r ∈ A and a threshold τ , and return all p i ∈ P such that Pr Q (cid:98) P [ q i ∈ r ] ≥ τ . Note that the fate of each p i ∈ P depends on no other p j ∈ P where i (cid:54) = j , so they can be considered independently. Building indexes for this model have beenstudied [15, 18, 42, 47] and effectively solved in R [1]. • Range Expectation ( RE ) Queries consider a range r ∈ A and report the expected number of uncertainpoints in r , E Q (cid:98) P [ | r ∩ Q | ] . The linearity of expectation allows summing the individual expectationseach point p ∈ P is in r . Single queries in this model have also been studied [23, 11, 24]. • Range Counting ( RC ) Queries support queries of a range r ∈ A and a threshold τ , but only returnthe number of p i ∈ P which satisfy Pr Q (cid:98) P [ q i ∈ r ] ≥ τ . The effect of each p i ∈ P on the queryis separate from that of any other p j ∈ P where i (cid:54) = j . A random sampling heuristic [46] has beensuggested without proof of accuracy. • Range Quantile ( RQ ) Queries take a query range r ∈ A , and report the full cumulative densityfunction on the number of points in the range Pr Q (cid:98) P [ | r ∩ Q | ] . Thus for a query range r , this returnedstructure can produce for any value τ ∈ [0 , the probability that τ n or fewer points are in r . Sincethis is no longer an expectation, the linearity of expectation cannot be used to decompose this queryalong individual uncertain points.Across all queries we consider, there are two main ways we can approximate the answers. The first andmost standard way is to allow an ε -error (for ≤ ε ≤ ) in the returned answer for RQ , RE , and RC .The second way is to allow an α -error in the threshold associated with the query itself. As will be shown,this is not necessary for RR , RE , or RC , but is required to get useful bounds for RQ . Finally, we will alsoconsider probabilistic error δ , demarcating the probability of failure in a randomized algorithm (such asrandom sampling). We strive to achieve these approximation factors with a small size coreset T ⊂ P asfollows: 2 E : For a given range r , let r ( Q ) = | Q ∩ r | / | Q | , and let E r ( P ) = E Q (cid:98) P [ r ( Q )] . T ⊂ P is an ε - RE coreset of ( P, A ) if for all queries r ∈ A we have (cid:12)(cid:12) E r ( P ) − E r ( T ) (cid:12)(cid:12) ≤ ε. RC : For a range r ∈ A , let G P,r ( τ ) = | P | (cid:12)(cid:12)(cid:8) p i ∈ P | Pr Q (cid:98) P [ q i ∈ r ] ≥ τ (cid:9)(cid:12)(cid:12) be the fraction of points in P that are in r with probability at least some threshold τ . Then T ⊂ P is an ε - RC coreset of ( P, A ) iffor all queries r ∈ A and all τ ∈ [0 , we have | G P,r ( τ ) − G T,r ( τ ) | ≤ ε. RQ : For a range r ∈ A , let F P,r ( τ ) = Pr Q (cid:98) P [ r ( Q ) ≤ τ ] = Pr Q (cid:98) P (cid:104) | Q ∩ r || Q | ≤ τ (cid:105) be the probability thatat most a τ fraction of P is in r . Now T ⊂ P is an ( ε, α )- RQ coreset of ( P, A ) if for all r ∈ A and τ ∈ [0 , there exists a γ ∈ [ τ − α, τ + α ] such that | F P,r ( τ ) − F T,r ( γ ) | ≤ ε. In such a situation, wealso say that F T,r is an ( ε, α ) -quantization of F P,r .A natural question is whether we can construct a ( ε, )- RQ coreset where there is not a secondary α -errorterm on τ . We demonstrate that there are no useful non-trivial bounds on the size of such a coreset.When the ( ε, α ) -quantization F T,r need not be explicitly represented by a coreset T , then L¨offler andPhillips [32, 26] show a different small space representation that can replace it in the above definitionof an ( ε, α ) - RQ coreset with probability at least − δ . First randomly create m = O ((1 /ε ) log(1 /δ )) transversals Q , Q , . . . , Q m , and for each transversal Q i create an α -sample S i of ( Q i , A ) . Then to satisfythe requirements of F T,r ( τ ) , there exists some γ ∈ [ τ − α, τ + α ] such that we can return (1 /m ) |{ S i | r ( S i ) ≤ γ }| , and it will be within ε of F P,r ( τ ) . However, this is subverting the attempt to construct andunderstand a coreset to answer these questions. A coreset T (our goal) can be used as proxy for P asopposed to querying m distinct point sets. This alternate approach also does not shed light into how muchinformation can be captured by a small size point set, which is provided by bounds on the size of a coreset. Simple example.
We illustrate a simple example with k = 2 and d = 1 , where n = 10 and the nk = 20 possible locations of the uncertain points are laid out in order: p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , < p , . We consider a coreset T ⊂ P that consists of the uncertain points T = { p , p , p , p , p } . Now considera specific range r ∈ I + , a one-sided interval that contains p , and smaller points, but not p , and largerpoints. We can now see that F T,r is an ( ε (cid:48) = 0 . , α = 0 . -quantization of F P,r in Figure 1; this followssince at F P,r (0 .
75) = 0 . either F T,r ( x ) is at most . for x ∈ [0 . , . and is at least . for x ∈ [0 . , . . Also observe that (cid:12)(cid:12) E r ( P ) − E r ( T ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) = 120 = ε. When these errors (the ( ε (cid:48) , α ) -quantization and ε -error) hold for all ranges in some range space, then T isan ( ε (cid:48) , α ) - RQ coreset or ε - RC coreset, respectively.To understand the error associated with an RC coreset, also consider the threshold τ = 2 / with respect tothe range r . Then in range r , / of the uncertain points from P are in r with probability at least τ = 2 / (points p and p ). Also / of the uncertain points from T are in r with probability at least τ = 2 / (onlypoint p ). So there is RC error for this range and threshold. We provide the first results for RE -, RC -, and RQ -coresets with guarantees. In particular we show that arandom sample T of size O ((1 /ε )( ν + log(1 /δ )) with probability − δ is an ε - RC coreset for any family3 Figure 1: Example cumulative density functions ( F T,r , in red with fewer steps, and F P,r , in blue with moresteps) on uncertain point set P and a coreset T for a specific range.of ranges A whose associated range space has VC-dimension ν . Otherwise we enforce that each uncertainpoint has k possible locations, then a sample T of size O ((1 /ε )( ν + log( k/δ )) suffices for an ε - RE coreset.Then we leverage discrepancy-based techniques [35, 12] for some specific families of ranges A , to im-prove these bounds to O ((1 /ε ) poly ( k, log(1 /ε ))) . This is an important improvement since /ε can bequite large (say or more), while k , interpreted as the number of readings of a data point, is smallfor many applications (say ). In R , for one-sided ranges we construct ε - RE and ε - RC coresets of size O (( √ k/ε ) log( k/ε )) . For axis-aligned rectangles in R d we construct ε - RE coresets of size O (( √ k/ε ) · log d − ( k/ε )) and ε - RC coresets of size O (( k d + /ε ) log d − ( k/ε ))) . Finally, we show that any ε - RE coreset of size t is also an ( ε, α ε,t ) - RQ coreset with value α ε,t = ε + (cid:112) (1 / t ) ln(2 /ε ) .These results leverage new connections between uncertain points and both discrepancy of permutationsand colored range searching that may be of independent interest. The key tools we will use to construct small coresets for uncertain data is discrepancy of range spaces, andspecifically those defined on permutations. Consider a set X , a range space ( X, A ) , and a coloring χ : X → {− , +1 } . Then for some range A ∈ A , the discrepancy is defined disc χ ( X, A ) = | (cid:80) x ∈ X ∩ A χ ( x ) | .We can then extend this to be over all ranges disc χ ( X, A ) = max A ∈ A disc χ ( X, A ) and over all colorings disc ( X, A ) = min χ disc χ ( X, A ) .Consider a ground set ( P, Σ k ) where P is a set of n objects, and Σ k = { σ , σ , . . . , σ k } is a set of k permutations over P so each σ j : P → [ n ] . We can also consider a family of ranges I k as a set of intervalsdefined on one of the k permutations so I x,y,j ∈ I k is defined so P ∩ I x,y,j = { p ∈ P | x < σ j ( p ) ≤ y } for4 < y ∈ [0 , n ] and j ∈ [ k ] . The pair (( P, Σ k ) , I k ) is then a range space, defining a set of subsets of P .A canonical way to obtain k permutations from an uncertain point set P = { p , p , . . . , p n } is as follows.Define the j th canonical traversal of P as the set P j = ∪ ni =1 p i,j . When each p i,j ∈ R , the sorted orderof each canonical traversal P j defines a permutation on P as σ j ( p i ) = |{ p i (cid:48) ,j ∈ P j | p i (cid:48) ,j ≤ p i,j }| , that is σ j ( p i ) describes how many locations (including p i,j ) in the traversal P j have value less than or equal to p i,j .In other words, σ j describes the sorted order of the j th point among all uncertain points. Then, given anuncertain point set, let the canonical traversals define the canonical k -permutation as ( P, Σ k ) .A geometric view of the permutation range space embeds P as n fixed points in R k and considers rangeswhich are defined by inclusion in ( k − -dimensional slabs, defined by two parallel half spaces with normalsaligned along one of the coordinate axes. Specifically, the j th coordinate of the i th point is σ j ( p i ) , and ifthe range is on the j th permutation, then the slab is orthogonal to the j th coordinate axis.Another useful construction from an uncertain point set P is the set P cert of all locations any point in P might occur. Specifically, for every uncertain point set P we can define the corresponding certain point set P cert = (cid:83) i ∈ [ n ] p i = (cid:83) j ∈ [ k ] P j = (cid:83) i ∈ [ n ] ,j ∈ [ k ] p i,j . We can also extend any coloring χ on P to a coloring in P cert by letting χ cert ( p ij ) = χ ( p i ) , for i ∈ [ n ] and j ∈ [ k ] . Now we can naturally define the discrepancyinduced on P cert by any coloring χ of P as disc χ cert ( P cert , A ) = max r ∈ A (cid:80) p i,j ∈ P cert ∩ r χ ( p i,j ) . From low-discrepancy to ε -samples. There is a well-studied relationship between range spaces thatadmit low-discrepancy colorings, and creating ε -samples of those range spaces [13, 35, 12, 8]. The keyrelationship states that if disc ( X, A ) = γ log ω ( n ) , then there exists an ε -sample of ( P, A ) of size O (( γ/ε ) · log ω ( γ/ε )) [38], for values γ, ω independent of n or ε . Construct the coloring, and with equal probabilitydiscard either all points colored either − or those colored +1 . This roughly halves the point set size, andalso implies zero over-count in expectation for any fixed range. Repeat this coloring and reduction of pointsuntil the desired size is achieved. This can be done efficiently in a distributed manner through a merge-reduce framework [13]; The take-away is that a method for a low-discrepancy coloring directly impliesa method to create an ε -sample, where the counting error is in expectation zero for any fixed range. Wedescribe and extend these results in much more detail in Appendix A. First we will analyze ε - RE coresets through the P cert interpretation of uncertain point set P . The canonicaltransversals P j of P will also be useful. In Section 3.2 we will relate these results to a form of discrepancy. Lemma 3.1. T ⊂ P is an ε - RE coreset for ( P, A ) if and only if T cert ⊂ P cert is an ε -sample for ( P cert , A ) .Proof. First note that since Pr [ p i = p ij ] = k ∀ i, j , hence by linearity of expectations we have that E Q (cid:98) P [ | Q ∩ r | ] = (cid:80) ni =1 E [ | p i ∩ r | ] = k | P cert ∩ r | . Now, direct computation gives us: (cid:12)(cid:12)(cid:12)(cid:12) | P cert ∩ r || P cert | − | T cert ∩ r || T cert | (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) | P cert ∩ r | k | P | − | T cert ∩ r | k | T | (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) E r ( P ) − E r ( T ) (cid:12)(cid:12) < ε. The next implication enables us to determine an ε - RE coreset on P from ε -samples on each P j (cid:98) P .Recall P j is the j th canonical transversal of P for j ∈ [ k ] , and is defined similarly for a subset T ⊂ P as T j . Lemma 3.2.
Given a range space ( P cert , A ) , if we have T ⊂ P such that T j is an ε -sample for ( P j , A ) forall j ∈ [ k ] , then T is an ε - RE coreset for ( P, A ) . roof. Consider an arbitrary range r ∈ R , and compute directly (cid:12)(cid:12) E r ( P ) − E r ( T ) (cid:12)(cid:12) . Recalling that E r ( P ) = | P cert ∩ r || P cert | and observing that | P cert | = k | P | , we get that: (cid:12)(cid:12) E r ( P ) − E r ( T ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) kj =1 | P j ∩ r | k | P | − (cid:80) kj =1 | T j ∩ r | k | T | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k k (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12) | P j ∩ r || P | − | T j ∩ r || T | (cid:12)(cid:12)(cid:12)(cid:12) ≤ k ( kε ) = ε. We show that a simple random sampling gives us an ε - RE coreset of P . Theorem 3.1.
For an uncertain points set P and range space ( P cert , A ) with VC-dimension ν , a randomsample T ⊂ P of size O ((1 /ε )( ν + log( k/δ ))) is an ε - RE coreset of ( P, I ) with probability at least − δ .Proof. A random sample T j of size O ((1 /ε )( ν +log(1 /δ (cid:48) ))) is an ε -sample of any ( P j , A ) with probabilityat least − δ (cid:48) [31]. Now assuming T ⊂ P resulted from a random sample on P , it induces the k disjointcanonical transversals T j on T , such that T j ⊂ P j and | T j | = O ((1 /ε )( ν + log(1 /δ (cid:48) ))) for j ∈ [ k ] . Each T j is an ε -sample of ( P j , A ) for any single j ∈ [ k ] with probability at least − δ (cid:48) . Following Lemma 3.2 andusing union bound, we conclude that T ⊂ P is an ε - RE coreset for uncertain point set P with probability atleast − kδ (cid:48) . Setting δ (cid:48) = δ/k proves the theorem. Next we extend the well-studied relationship between geometric discrepancy and ε -samples on certain datatowards ε - RE coresets on uncertain data.We first require precise and slightly non-standard definitions.We introduce a new type of discrepancy based on the expected value of uncertain points called RE -discrepancy . Let P + χ and P − χ denote the sets of uncertain points from P colored +1 or − , respectively,by χ . Then RE - disc χ ( P, r ) = | P | · | E r ( P + χ ) − E r ( P ) | for any r ∈ A . The usual extensions then follow: RE - disc χ ( P, A ) = max r ∈ A RE - disc ( P, r ) and RE - disc ( P, A ) = min χ RE - disc χ ( P, A ) . Note that ( P, A ) istechnically not a range space, since A defines subsets of P cert in this case, not of P . Lemma 3.3.
Consider a coloring χ : P → {− , +1 } such that RE - disc χ ( P, A ) = γ log ω ( n ) and | P + χ | = n/ . Then the set P + χ is an ε - RE coreset of ( P, A ) with ε = γn log( n ) .Furthermore, if a subset T ⊂ P has size n/ and is an ( γn log ω ( n )) - RE coreset, then it defines a coloring χ (where χ ( p i ) = +1 for p i ∈ T ) that has RE - disc χ ( P, A ) = γ log ω ( n ) .Proof. We prove the second statement, the first follows symmetrically. We refer to the subset T as P + χ . Let r = arg max r (cid:48) ∈ A | E r (cid:48) ( P ) − E r (cid:48) ( P + χ ) | . This implies γn log ω n ≥ | E r ( P ) − E r ( P + χ ) | = n RE - disc χ ( P, r ) . We can now recast RE -discrepancy to discrepancy on P cert . From Lemma 3.1 (cid:12)(cid:12)(cid:12) | P cert ∩ r | k | P | − | T cert ∩ r | k | T | (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) E r ( P ) − E r ( T ) (cid:12)(cid:12) and after some basic substitutions we obtain the following. Lemma 3.4. RE - disc χ ( P, A ) = k disc χ cert ( P cert , A ) . This does not immediately solve ε - RE coresets by standard discrepancy techniques on P cert because weneed to find a coloring χ on P . A coloring χ cert on P cert may not be consistent across all p i,j ∈ p i . Thefollowing lemma allows us to reduce this to a problem of coloring each canonical transversal P j .6 emma 3.5. RE - disc χ ( P, A ) ≤ max j disc χ cert ( P j , A ) . Proof.
For any r ∈ A and any coloring χ (and the corresponding χ cert ), we can write P as a union ofdisjoint transversals P j to obtain disc χ cert ( P cert , r ) = (cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) j =1 (cid:88) p ij ∈ P j ∩ r χ cert ( p ij ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ k (cid:88) j =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) p ij ∈ P j ∩ r χ cert ( p ij ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ k (cid:88) j =1 disc χ cert ( P j , r ) ≤ k max j disc χ cert ( P j , r ) . Since this holds for every r ∈ A , hence (using Lemma 3.4) RE - disc χ ( P, A ) = 1 k disc χ cert ( P cert , A ) ≤ max j disc χ cert ( P j , A ) . ε -RE Coresets in R Lemma 3.6.
Consider uncertain point set P with P cert ⊂ R and the range space ( P cert , I + ) with rangesdefined by one-sided intervals of the form ( −∞ , x ] , then RE - disc ( P, I ) = O ( √ k log n ) .Proof. Spencer et. al. [41] show that disc (( P, Σ k ) , I k ) is O ( √ k log n ) . Since we obtain the Σ k from thecanonical transversals P through P k , by definition this results in upper bounds on the the discrepancy overall P j (it bounds the max). Lemma 3.5 then gives us the bound on RE - disc ( P, I ) .As we discussed in Appendix A the low RE -discrepancy coloring can be iterated in a merge-reduceframework as developed by Chazelle and Matousek [13]. With Theorem A.2 we can prove the followingtheorem. Theorem 3.2.
Consider uncertain point set P and range space ( P cert , I + ) with ranges defined by one-sidedintervals of the form ( −∞ , x ] , then an ε - RE coreset can be constructed of size O (( √ k/ε ) log( k/ε )) . Since expected value is linear, RE - disc χ ( P, ( −∞ , x ]) − RE - disc χ ( P, ( −∞ , y )) = RE - disc χ ( P, [ y, x ]) for y < x and the above result also holds for the family of two-sided ranges I . ε -RE Coresets for Rectangles in R d Here let P be a set of n uncertain points where each possible location of a point p i,j ∈ R d . We consider arange space ( P cert , R d ) defined by d -dimensional axis-aligned rectangles.Each canonical transversal P j for j ∈ [ k ] no longer implies a unique permutation on the points (for d > ).But, for any rectangle r ∈ R , we can represent any r ∩ P j as the disjoint union of points P j contained inintervals on a predefined set of (1 + log n ) d − permutations [9]. Spencer et al. [41] showed there exists acoloring χ such that max j disc χ ( P j , R ) = O ( D (cid:96) ( n ) log d − n ) , where (cid:96) = (1 + log n ) d − is the number of defined permutations and D (cid:96) ( n ) is the discrepancy of (cid:96) per-mutations over n points and ranges defined as intervals on each permutation. Furthermore, they showed D (cid:96) ( n ) = O ( √ (cid:96) log n ) . 7o get the RE -discrepancy bound for P cert = ∪ kj =1 P j , we first decompose P cert into the k point sets P j of size n . We then obtain (1 + log n ) d − permutations over points in each P j , and hence obtain a family Σ (cid:96) of (cid:96) = k (1 + log n ) d − permutations over all P j . D (cid:96) ( n ) = O ( √ (cid:96) log n ) yields disc (( P, Σ (cid:96) ) , I (cid:96) ) = O ( √ k log d +12 n ) . Now each set P j ∩ r for r ∈ R d , can be written as the disjoint union of O (log d − n ) intervals of Σ (cid:96) .Summing up over each interval, we get that disc ( P j , R ) = O ( √ k log d − n ) for each j . By Lemma 3.5 thisbounds the RE -discrepancy as well. Finally, we can again apply the merge-reduce framework of Chazelleand Matousek [13] (via Theorem A.2) to achieve an ε - RE coreset. Theorem 3.3.
Consider uncertain point set P and range space ( P cert , R d ) (for d > ) with ranges definedby axis-aligned rectangles in R d . Then an ε - RE coreset can be constructed of size O (( √ k/ε ) log d − ( k/ε )) . Recall that an ε - RC coreset T of a set P of n uncertain points satisfies that for all queries r ∈ A and allthresholds τ ∈ [0 , we have | G P,r ( τ ) − G T,r ( τ ) | ≤ ε , where G P,r ( τ ) represents the fraction of pointsfrom P that are in range r with probability at least τ .In this setting, given a range r ∈ A and a threshold τ ∈ [0 , we can let the pair ( r, τ ) ∈ A × [0 , definea range R r,τ such that each p i ∈ P is either in or not in R r,τ . Let ( P, A × [0 , denote this range space.If ( P cert , A ) has VC-dimension ν , then ( P, A × [0 , has VC-dimension O ( ν + 1) ; see Corollary 5.23 in[21]. This implies that random sampling works to construct ε -RC coresets. Theorem 4.1.
For uncertain point set P and range space ( P cert , A ) with VC-dimension ν , a random sample T ⊂ P of size O ((1 /ε )( ν + log(1 /δ ))) is an ε -RC coreset of ( P, A ) with probability at least − δ . Yang et al. propose a similar result [46] as above, without proof. R Constructing ε - RC coresets when the family of ranges I + represents one-sided, one-dimensional intervalsis much easier than other cases. It relies heavily on the ordered structure of the canonical permutations, andthus discrepancy results do not need to decompose and then re-compose the ranges. Lemma 4.1.
A point p i ∈ P is in range r ∈ I + with probability at least τ = t/k if and only if p i,t ∈ r ∩ P t .Proof. By the canonical permutations, since for all i ∈ [ n ] , we require p i,j < p i,j +1 , then if p i,t ∈ r , itfollows that p i,j ∈ r for j ≤ t . Similarly if p i,t / ∈ r , then all p i,j / ∈ r for j ≥ t .Thus when each canonical permutation is represented upto an error ε by a coreset T , then each threshold τ is represented within ε . Hence, as with ε - RE coresets, we invoke the low-discrepancy coloring of Bohus [9]and Spencer et al. [41], and then iterate them (invoking Theorem A.1) to achieve a small size ε - RC coreset. Theorem 4.2.
For uncertain point set P and range space ( P cert , I + ) with ranges defined by one-sidedintervals of the form ( −∞ , a ] . An ε - RC coreset of ( P, I + ) can be constructed of size O (( √ k/ε ) log( k/ε )) . Extending Lemma 4.1 from one-sided intervals of the form [ −∞ , a ] ∈ I + to intervals of the form [ a, b ] ∈ I turns out to be non-trivial. It is not true that G P, [ a,b ] ( τ ) = G P, [ −∞ ,b ] ( τ ) − G P, [ −∞ ,a ] ( τ ) , hence the twoqueries cannot simply be subtracted. Also, while the set of points corresponding to the query G P, [ −∞ ,a ] ( tk ) im 1 d i m dim 3 d i m ⌘ p i, p i, p i, p i, p i, p i, p i, p i, ¯ p i, ¯ p i, b b b aa b Figure 2:
Uncertain point p i queried by range [ a, b ] . Lifting shown to ¯ p i along dimensions 1 and 2 (left) and alongdimensions 2 and 3 (right). are a contiguous interval in the t th permutation we construct in Lemma 4.1, the same need not be true ofpoints corresponding to G P, [ a,b ] ( tk ) . This is a similar difficulty in spirit as noted by Kaplan et al. [29] inthe problem of counting the number of points of distinct colors in a box where one cannot take a naivedecomposition and add up the numbers returned by each subproblem.We give now a construction to solve this two-sided problem for uncertain points in R inspired by thatof Kaplan et al. [29], but we require specifying a fixed value of t ∈ [ k ] . Given an uncertain point p i ∈ P assume w.l.o.g that p i,j < p i,j +1 . Also pretend there is a point p i,k +1 = η where η is larger than any b ∈ R from a query range [ a, b ] (essentially η = ∞ ). Given a range [ a, b ] , we consider the right-most set of t locations of p i (here { p i,j − t , . . . , p i,j } ) that are in the range. This satisfies (i) p i,j − t ≥ a , (ii) p i,j ≤ b , and(iii) to ensure that it is the right-most such set, p i,j +1 > b .To satisfy these three constraints we re-pose the problem in R to designate each contiguous set of t possible locations of p i as a single point. So for t < j ≤ k , we map p i,j to ¯ p ti,j = ( p i,j − t , p i,j , p i,j +1 ) .Correspondingly, a range r = [ a, b ] is mapped to a range ¯ r = [ a, ∞ ) × ( −∞ , b ] × ( b, ∞ ) ; see Figure 2. Let ¯ p ti denote the set of all ¯ p ti,j , and let ¯ P t represent (cid:83) i ¯ p ti . Lemma 4.2. p i is in interval r = [ a, b ] with threshold at least t/k if and only if ¯ p ti ∩ ¯ r t ≥ . Furthermore,no two points p i,j , p i,j (cid:48) ∈ p i can map to points ¯ p ti,j , ¯ p ti,j (cid:48) such that both are in a range ¯ r t .Proof. Since p i,j < p i,j +1 , then if p i,j − t ≥ a it implies all p i,(cid:96) ≥ a for (cid:96) ≥ j − t , and similarly, if p i,j ≤ b then all p i,(cid:96) ≤ b for all (cid:96) ≤ j . Hence if ¯ p ti,j satisfies the first two dimensional constraints of the range ¯ r t , itimplies t points p i,j − t . . . , p i,j are in the range [ a, b ] . Satisfying the constraint of ¯ r t in the third coordinateindicates that p i,j +1 / ∈ [ a, b ] . There can only be one point p i,j which satisfies the constraint of the last twocoordinates that p i,j ≤ b < p i,j +1 . And for any range which contains at least t possible locations, theremust be at least one such set (and only one) of t consecutive points which has this satisfying p i,j . Corollary 4.1.
Any uncertain point set P ∈ R of size n and range r = [ a, b ] has G P,r ( tk ) = | ¯ P t ∩ ¯ r t | /n . This presents an alternative view of each uncertain point in R with k possible locations as an uncertainpoint in R with k − t possible locations (since for now we only consider a threshold τ = t/k ). Where I represents the family of ranges defined by two-sided intervals, let ¯ I be the corresponding family of rangesin R of the form [ a, ∞ ) × ( −∞ , b ] × ( b, ∞ ) corresponding to an interval [ a, b ] ∈ I . Under the assumption(valid under the lifting defined above) that each uncertain point can have at most one location fall in eachrange, we can now decompose the ranges and count the number of points that fall in each sub-range andadd them together. Using the techniques (described in detail in Section 3.4) of Bohus [9] and Spencer etal. [41] we can consider (cid:96) = ( k − t )(1 + (cid:100) log n (cid:101) ) permutations of ¯ P t cert such that each range ¯ r ∈ ¯ I canbe written as the points in a disjoint union of intervals from these permutations. To extend low discrepancy9o each of the k distinct values of threshold t , there are k such liftings and h = k · (cid:96) = O ( k log n ) suchpermutations we need to consider. We can construct a coloring χ : P → {− , +1 } such that intervals oneach permutation has discrepancy O ( √ h log n ) = O ( k log n ) . Recall that for any fixed threshold t we onlyneed to consider the corresponding (cid:96) permutations, hence the total discrepancy for any such range is at mostthe sum of discrepancy from all corresponding (cid:96) = O ( k log n ) permutations or O ( k log n ) . Finally, thislow-discrepancy coloring can be iterated (via Theorem A.1) to achieve the following theorem. Theorem 4.3.
Consider an uncertain point set P along with ranges I of two-sided intervals. We canconstruct an ε -RC coreset T for ( P, I ) of size O (( k /ε ) log ( k/ε )) . R d The approach for I can be further extended to R d , axis-aligned rectangles in R d . Again the key idea is todefine a proxy point set ¯ P such that | ¯ r ∩ ¯ P | equals the number of uncertain points in r with at least threshold t . This requires a suitable lifting map and decomposition of space to prevent over or under counting; weemploy techniques from Kaplan et al. [29].First we transform queries on axis-aligned rectangles in R d to the semi-bounded case in R d . Denote the x i -coordinate of a point q as x i ( q ) , we double all the coordinates of each point q = ( x ( q ) , ..., x (cid:96) ( q ) , ..., x d ( q )) to obtain point ˜ q = ( − x ( q ) , x ( q ) ..., − x (cid:96) ( q ) , x (cid:96) ( q ) , ..., − x d ( q ) , x d ( q )) in R d . Now answering range counting query (cid:81) di =1 [ a i , b i ] is equivalent to solving the query d (cid:89) i =1 [( −∞ , − a i ] × ( −∞ , b i ]] on the lifted point set.Based on this reduction we can focus on queries of negative orthants of the form (cid:81) di =1 ( −∞ , a i ] andrepresent each orthant by its apex a = ( a , ..., a d ) ∈ R d as Q − a . Similarly, we can define Q + a as positiveorthants in the form (cid:81) di =1 [ a i , ∞ ) ⊆ R d . For any point set A ⊂ R d define U ( A ) = ∪ a ∈ A Q + a .A tight orthant has a location of p i ∈ P incident to every bounding facet. Let C i,t be the set of all apexesrepresenting tight negative orthants that contain exactly t locations of p i ; see Figure 3(a). An importantobservation is that query orthant Q − a contains p i with threshold at least t if and only if it contains at leastone point from C i,t .Let Q + i,t = ∪ c ∈ C i,t Q + c be the locus of all negative orthant query apexes that contain at least t locations of p i ; see Figure 3(b). Notice that Q + i,t = U ( C i,t ) . Lemma 4.3.
For any point set p i ⊂ R d of k points and some threshold ≤ t ≤ k , we can decompose U ( C i,t ) into f ( k ) = O ( k d ) pairwise disjoint boxes, B ( C i,t ) .Proof. Let M ( A ) be the set of maximal empty negative orthants for a point set A , such that any m ∈ M ( A ) is also bounded in the positive direction along the st coordinate axis. Kaplan et al. [29] show (withinLemma 3.1) that | M ( A ) | = | B ( A ) | and provide a specific construction of the boxes B . Thus we only needto bound | M ( C i,t ) | to complete the proof; see M ( C i,t ) in Figure 3(c). We note that each coordinate of each c ∈ C i,t must be the same as some p i,j ∈ p i . Thus for each coordinate, among all c ∈ C i,t there are at most k values. And each maximal empty tight orthant m ∈ M ( C i,t ) is uniquely defined by the d coordinatesalong the axis direction each facet is orthogonal to. Thus | M ( C i,t ) | ≤ k d , completing the proof.Note that as we are working in a lifted space R d , this corresponds to U ( C i,t ) being decomposed into f ( k ) = O ( k d ) pairwise disjoint boxes in which d is the dimensionality of our original point set.10 + c Q + c c c c c aQ − a U ( C i,t ) c c (a) (b) (c)Figure 3: Illustration of uncertain point p i ∈ R with k = 8 and t = 3 . (a) All tight negative orthants containingexactly t = 3 locations of p i , their apexes are C i,t = { c , c } . (b): U ( C i,t ) is shaded and query Q − a . (c): M ( C i,t ) ,the maximal negative orthants of C i,t that are also bounded in the x -direction. Lemma 4.4.
For negative orthant queries Q − a with apex a on uncertain point set P , a point p i ∈ P is in Q − a with probability at least t/k if a is in some box in B ( C i,t ) , and a will lie in at most one box from B ( C i,t ) .Proof. The query orthant Q − a contains point p i with threshold at least t if and only if Q − a contains at leastone point from C i,t and this happens only when a ∈ U ( C i,t ) . Since the union of constructed boxes in B ( C i,t ) is equivalent to U ( C i,t ) and they are disjoint, the result follows. Corollary 4.2.
The number of uncertain points from P in query range Q − a with probability at least t/k isexactly the number of boxes in ∪ ni B ( C i,t ) that contain a . Thus for a set of boxes representing P , we need to perform count stabbing queries with apex a and showa low-discrepancy coloring of boxes.We do a second lifting by transforming each point a ∈ R d to a semi-bounded box ¯ a = (cid:81) di =1 (( −∞ , a i ] × [ a i , ∞ )) and each box b ∈ R d of the form (cid:81) di [ x i , y i ] to a point ¯ b = ( x , y , ..., x (cid:96) , y (cid:96) , ..., x d , y d ) in R d . Itis easy to verify that a ∈ b if and only if ¯ b ∈ ¯ a .Since this is our second doubling of dimension, we are now dealing with points in R d . Lifting P to ¯ P in R d now presents an alternative view of each uncertain point p i ∈ P as an uncertain point ¯ p i in R d with g k = O ( k d ) possible locations with the query boxes represented as ¯ R in R d .We now proceed similarly to the proof of Theorem 4.3. For a fixed threshold t , obtain (cid:96) = g k · (1 + (cid:100) log n (cid:101) ) d − disjoint permutations of ¯ P t cert such that each range ¯ r ∈ ¯ R can be written as the points in adisjoint union of intervals from these permutations. For the k distinct values of t , there are k such liftingsand h = O (cid:0) k · g k · log d − n (cid:1) such permutations we need to consider, and we can construct a coloring χ : P → {− , +1 } so that intervals on each permutation have discrepancy O ( √ h log n ) = O ( k d + log d +12 n ) . Hence for any such range and specific threshold t , the total discrepancy is the sum of discrepancy fromall corresponding (cid:96) = O (cid:0) g k · log d − n (cid:1) permutations, or O (cid:16) k d + log d − n (cid:17) . By applying the iteratedlow-discrepancy coloring (Theorem A.1), we achieve the following result. Theorem 4.4.
Consider an uncertain point set P and range space ( P cert , R d ) with ranges defined by axis-aligned rectangles in R d . Then an ε - RC coreset can be constructed of size O (cid:16) ( k d + /ε ) log d − ( k/ε ) (cid:17) . In this section, given an uncertain point set P and its ε - RE coreset T , we want to determine values ε (cid:48) and α so T is an ( ε (cid:48) , α ) - RQ coreset. That is for any r ∈ A and threshold τ ∈ [0 , there exists a γ ∈ [ τ − α, τ + α ] (cid:12)(cid:12)(cid:12) Pr Q (cid:98) P (cid:20) | Q ∩ r || Q | ≤ τ (cid:21) − Pr S (cid:98) T (cid:20) | S ∩ r || S | ≤ γ (cid:21) (cid:12)(cid:12)(cid:12) ≤ ε (cid:48) . At a high level, our tack will be to realize that both | Q ∩ r | and | S ∩ r | behave like Binomial randomvariables. By T being an ε - RE coreset of P , then after normalizing, its mean is at most ε -far from that of P . Furthermore, Binomial random variables tend to concentrate around their mean–and more so for thosewith more trials. This allows us to say | S ∩ r | / | S | is either α -close to the expected value of | Q ∩ r | / | Q | oris ε (cid:48) -close to or . Since | Q ∩ r | / | Q | has the same behavior, but with more concentration, we can boundtheir distance by the α and ε (cid:48) bounds noted before. We now work out the details. Theorem 5.1. If T is an ε - RE coreset of P for ε ∈ (0 , / , then T is an ( ε (cid:48) , α ) - RQ coreset for P for ε (cid:48) , α ∈ (0 , / and satisfying α ≥ ε + (cid:112) (1 / | T | ) ln(2 /ε (cid:48) ) .Proof. We start by examining a Chernoff-Hoeffding bound on a set of independent random variables X i sothat each X i ∈ [ a i , b i ] with ∆ i = b i − a i . Then for some parameter β ∈ (0 , (cid:80) i ∆ i / Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i X i − E (cid:34)(cid:88) i X i (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ β (cid:35) ≤ (cid:18) − β (cid:80) i ∆ i (cid:19) . Consider any r ∈ A . We now identify each random variable X i = 1( q i ∈ r ) (that is, if q i ∈ r and otherwise) where q i is the random instantiation of some p i ∈ T . So X i ∈ { , } and ∆ i = 1 . Thus byequating | S ∩ r | = (cid:80) X i Pr S (cid:98) T [ || S ∩ r | − E [ | S ∩ r | ] | ≥ β | S | ] ≤ (cid:18) − β | S | (cid:80) i ∆ i (cid:19) = 2 exp( − β | S | ) ≤ ε (cid:48) . Thus by solving for β (and equating | S | = | T | ) Pr S (cid:98) T (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) | S ∩ r || S | − E (cid:20) | S ∩ r || S | (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:115) | T | ln( 2 ε (cid:48) ) (cid:35) ≤ ε (cid:48) . Now by T being an ε - RE coreset of P then (cid:12)(cid:12)(cid:12)(cid:12) E S (cid:98) T (cid:20) | S ∩ r || S | (cid:21) − E Q (cid:98) P (cid:20) | Q ∩ r || Q | (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε. Combining these two we have Pr S (cid:98) T (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) | S ∩ r || S | − E Q (cid:98) P (cid:20) | Q ∩ r || Q | (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≥ α (cid:21) ≤ ε (cid:48) for α = ε + (cid:113) | T | ln( ε (cid:48) ) .Combining these statements, for any x ≤ M − α ≤ M − α (cid:48) we have ε (cid:48) > F T,r ( x ) ≥ and ε (cid:48) >F P,r ( x ) ≥ (and symmetrically for x ≥ M + α ≥ M + α (cid:48) ) . It follows that F T,r is an ( ε (cid:48) , α ) -quantizationof F P,r .Since this holds for any r ∈ A , by T being an ε - RE coreset of P , it follows that T is also an ( ε (cid:48) , α ) - RQ coreset of P .We can now combine this result with specific results for ε - RE coresets to get size bounds for ( ε, α ) - RQ coresets. To achieve the below bounds we set ε = ε (cid:48) .12 orollary 5.1. For uncertain point set P with range space ( P cert , A ) , there exists a ( ε, ε + (cid:112) (1 / | T | ) ln(2 /ε )) -RQ coreset of ( P, A ) of size | T | = • O ((1 /ε )( ν + log( k/δ ))) when A has VC-dimension ν , with probability − δ (Theorem 3.1), • O (( √ k/ε ) log( k/ε )) when A = I (Theorem 3.2), and • O (cid:16) ( √ k/ε ) log d − ( k/ε ) (cid:17) when A = R d (Theorem 3.3). Finally we discuss why the α term in the ( ε (cid:48) , α ) - RQ coreset T is needed. Recall from Section 3 thatapproximating the value of E Q (cid:98) P (cid:2) | Q ∩ r || Q | (cid:3) with E S (cid:98) T (cid:2) | S ∩ r || S | (cid:3) for all r corresponds to a low-discrepancysample of P cert . Discrepancy error immediately implies we will have at least the ε horizontal shift betweenthe two distributions and their means, unless we could obtain a zero discrepancy sample of P cert . Note this ε -horizontal error corresponds to the α term in an ( ε (cid:48) , α ) - RQ coreset. When P is very large, then due to thecentral limit theorem, F P,r will grow very sharply around E Q (cid:98) P (cid:2) | Q ∩ r || Q | (cid:3) . In the worst case F T,r may be
Ω(1) vertically away from F P,r on either side of E S (cid:98) T (cid:2) | S ∩ r || S | (cid:3) , so no reasonable amount of ε (cid:48) vertical tolerancewill make up for this gap.On the other hand, the ε (cid:48) vertical component is necessary since for very small probability events (that isfor a fixed range r and small threshold τ ) on P , we may need a much smaller value of τ (smaller by Ω(1) ) toget the same probability on T , requiring a very large horizontal shift. But since it is a very small probabilityevent, only a small vertical ε (cid:48) shift is required.The main result of this section then is showing that there exist pairs ( ε (cid:48) , α ) which are both small. This paper defines and provides the first results for coresets on uncertain data. These can be essential toolsfor monitoring a subset of a large noisy data set, as a way to approximately monitor the full uncertainty.There are many future directions on this topic, in addition to tightening the provided bounds especiallyfor other range spaces. Can we remove the dependence on k without random sampling? Can coresets beconstructed over uncertain data for other queries such as minimum enclosing ball, clustering, and extents? References [1] A
GARWAL , P. K., C
HENG , S.-W., T AO , Y., AND Y I , K. Indexing uncertain data. In PODS (2009).[2] A
GARWAL , P. K., H AR -P ELED , S.,
AND V ARADARAJAN , K. Geometric approximations via coresets.
Current Trends in Combinatorial and Computational Geometry (E. Welzl, ed.) (2007).[3] A
GARWAL , P. K., H AR -P ELED , S.,
AND V ARADARAJAN , K. R. Approximating extent measure ofpoints.
Journal of ACM 51 , 4 (2004), 2004.[4] A
GRAWAL , P., B
ENJELLOUN , O., S
ARMA , A. D., H
AYWORTH , C., N
ABAR , S., S
UGIHARA , T.,
AND W IDOM , J. Trio: A system for data, uncertainty, and lineage. In
PODS (2006).[5] A
NTHONY , M.,
AND B ARTLETT , P. L.
Neural Network Learning: Theoretical Foundations . Cam-bridge University Press, 1999.[6] B ¯
ADOIU , M., H AR -P ELED , S.,
AND I NDYK , P. Approximate clustering via core-sets. In
STOC (2002). 137] B
ANDYOPADHYAY , D.,
AND S NOEYINK , J. Almost-Delaunay simplices: Nearest neighbor relationsfor imprecise points. In
SODA (2004).[8] B
ECK , J. Roth’s estimate of the discrepancy of integer sequences is nearly sharp.
Combinatorica 1 (1981), 319–325.[9] B
OHUS , G. On the discrepancy of 3 permutations.
Random Structures and Algorithms 1 , 2 (1990),215–220.[10] B ˘
ADOIU , M.,
AND C LARKSON , K. Smaller core-sets for balls. In
SODA (2003).[11] B
URDICK , D., D
ESHPANDE , P. M., J
AYRAM , T., R
AMAKRISHNAN , R.,
AND V AITHYANATHAN , S.OLAP over uncertain and imprecise data. In
VLDB (2005).[12] C
HAZELLE , B.
The Discrepancy Method . Cambridge, 2000.[13] C
HAZELLE , B.,
AND M ATOUSEK , J. On linear-time deterministic algorithms for optimization prob-lems in fixed dimensions.
Journal of Algorithms 21 (1996), 579–597.[14] C
HAZELLE , B.,
AND W ELZL , E. Quasi-optimal range searching in spaces of finite VC-dimension.
Discrete and Computational Geometry 4 (1989), 467–489.[15] C
HENG , R., X IA , Y., P RABHAKAR , S., S
HAH , R.,
AND V ITTER , J. S. Efficient indexing methodsfor probabilistic threshold queries over uncertain data. In
VLDB (2004).[16] C
ORMODE , G.,
AND G ARAFALAKIS , M. Histograms and wavelets of probabilitic data. In
ICDE (2009).[17] C
ORMODE , G., L I , F., AND Y I , K. Semantics of ranking queries for probabilistic data and expectedranks. In ICDE (2009).[18] D
ALVI , N.,
AND S UCIU , D. Efficient query evaluation on probabilistic databases. In
VLDB (2004).[19] G
UIBAS , L. J., S
ALESIN , D.,
AND S TOLFI , J. Epsilon geometry: building robust algorithms fromimprecise computations. In
SoCG (1989).[20] G
UIBAS , L. J., S
ALESIN , D.,
AND S TOLFI , J. Constructing strongly convex approximate hulls withinaccurate primitives.
Algorithmica 9 (1993), 534–560.[21] H AR -P ELED , S.
Geometric Approximation Algorithms . American Mathematical Society, 2011.[22] H
ELD , M.,
AND M ITCHELL , J. S. B. Triangulating input-constrained planar point sets.
InformationProcessing Letters 109 , 1 (2008).[23] J
AYRAM , T., K
ALE , S.,
AND V EE , E. Efficient aggregation algorithms for probabilistic data. In SODA (2007).[24] J
AYRAM , T., M C G REGOR , A., M
UTHUKRISHNAN , S.,
AND V EE , E. Estimating statistical aggre-gates on probabilistic data streams. In PODS (2007).[25] J
ØRGENSEN , A. G., L ¨
OFFLER , M.,
AND P HILLIPS , J. M. Geometric computation on indecisive anduncertain points. arXiv:1205.0273.[26] J
ØRGENSEN , A. G., L ¨
OFFLER , M.,
AND P HILLIPS , J. M. Geometric computation on indecisivepoints. In
WADS (2011). 1427] K
AMOUSI , P., C
HAN , T. M.,
AND S URI , S. The stochastic closest pair problem and nearest neighborsearch. In
WADS (2011).[28] K
AMOUSI , P., C
HAN , T. M.,
AND S URI , S. Stochastic minimum spanning trees in euclidean spaces.In
SOCG (2011).[29] K
APLAN , H., R
UBIN , N., S
HARIR , M.,
AND V ERBIN , E. Counting colors in boxes. In
SODA (2007).[30] K
RUGER , H. Basic measures for imprecise point sets in R d . Master’s thesis, Utrecht University, 2008.[31] L I , Y., L ONG , P. M.,
AND S RINIVASAN , A. Improved bounds on the samples complexity of learning.
Journal of Computer and System Science 62 (2001), 516–527.[32] L ¨
OFFLER , M.,
AND P HILLIPS , J. Shape fitting on point sets with probability distributions. In
ESA (2009), Springer Berlin / Heidelberg.[33] L ¨
OFFLER , M.,
AND S NOEYINK , J. Delaunay triangulations of imprecise points in linear time afterpreprocessing. In
SOCG (2008).[34] M
ATOUSEK , J. Approximations and optimal geometric divide-and-conquer.
Journal of Computer andSystem Sciences 50 , 2 (1995), 203 – 208.[35] M
ATOU ˇ SEK , J.
Geometric Discrepancy . Springer, 1999.[36] N
AGAI , T.,
AND T OKURA , N. Tight error bounds of geometric problems on convex objects withimprecise coordinates. In
Jap. Conf. on Discrete and Comput. Geom. (2000), LNCS 2098, pp. 252–263.[37] O
STROVSKY -B ERMAN , Y.,
AND J OSKOWICZ , L. Uncertainty envelopes. In (2005), pp. 175–178.[38] P
HILLIPS , J. M. Algorithms for ε -approximations of terrains. In ICALP (2008).[39] P
HILLIPS , J. M.
Small and Stable Descriptors of Distributions for Geometric Statistical Problems .PhD thesis, Duke University, 2009.[40] S
ARMA , A. D., B
ENJELLOUN , O., H
ALEVY , A., N
ABAR , S.,
AND W IDOM , J. Representing uncer-tain data: models, properties, and algorithms.
The VLDB Journal 18 , 5 (2009), 989–1019.[41] S
PENCER , J., S
RINIVASAN , A.,
AND T ETAI , P. Discrepancy of permutation families. Unpublishedmanuscript, 2001.[42] T AO , Y., C HENG , R., X
IAO , X., N
GAI , W. K., K AO , B., AND P RABHAKAR , S. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In
VLDB (2005).[43]
VAN DER M ERWE , R., D
OUCET , A., DE F REITAS , N.,
AND W AN , E. The unscented particle filter.In NIPS (2000), vol. 8, pp. 351–357.[44]
VAN K REVELD , M.,
AND
L ¨
OFFLER , M. Largest bounding box, smallest diameter, and related prob-lems on imprecise points.
Computational Geometry: Theory and Applications 43 (2010), 419–433.[45] V
APNIK , V.,
AND C HERVONENKIS , A. On the uniform convergence of relative frequencies of eventsto their probabilities.
Theory of Probability and its Applications 16 (1971), 264–280.1546] Y
ANG , S., Z
HANG , W., Z
HANG , Y.,
AND L IN , X. Probabilistic threshold range aggregate queryprocessing over uncertain data. Advances in Data and Web Management (2009), 51–62.[47] Z
HANG , Y., L IN , X., T AO , Y., Z HANG , W.,
AND W ANG , H. Efficient computation of range aggre-gates against uncertain location based queries.
IEEE Transactions on Knowledge and Data Engineer-ing 24 (2012), 1244–1258.
A Low Discrepancy to ε -Coreset Mainly in the 90s Chazelle and Matousek [14, 13, 34, 35, 12] led the development of method to convert froma low-discrepancy coloring to a coreset that allowed for approximate range queries. Here we summarize andgeneralize these results.We start by restating a results of Phillips [38, 39] which generalizes these results, here we state it a bitmore specifically for our setting.
Theorem A.1 (Phillips [38, 39]) . Consider a point set P of size n and a family of subsets A . Assume an O ( n β ) time algorithm to construct a coloring χ : P → {− , +1 } so disc χ ( P, A ) = O ( γ log ω n ) where β , γ , and ω are constant algorithm parameters dependent on A , but not P (or n ). There exists an algorithm toconstruct an ε -sample of ( P, A ) of size g ( ε, A ) = O (( γ/ε ) log ω ( γ/ε )) in time O ( n · g ( ε, A ) β − ) . Note that we ignored non-exponential dependence on ω and β since in our setting they are data andproblem independent constants. But we are more careful with γ terms since they depend on k , the numberof locations of each uncertain point.We restate the algorithm and analysis here for completeness, using g = g ( ε, A ) for shorthand. Divide P into n/g parts { ¯ P , ¯ P , . . . , ¯ P n/g } of size k = 4( β + 2) g . Assume this divides evenly and n/g is a powerof two, otherwise pad P and adjust g by a constant. Until there is a single set, repeat the following twostages. In stage 1, for β + 2 steps, pair up all remaining sets, and for all pairs (e.g. P i and P j ) construct alow-discrepancy coloring χ on P i ∪ P j and discard all points colored − (or +1 at random). In the ( β + 3) rdstep pair up all sets, but do not construct a coloring and halve. That is every epoch ( β + 3 steps) the size ofremaining sets double, otherwise they remain the same size. When a single set remains, stage 2 begins; itperforms the color-halve part of the above procedure until disc ( P, A ) ≤ εn as desired.We begin analyzing the error on a single coloring. Lemma A.1.
The set P + = { p ∈ P | χ ( p ) = +1 } is an ( disc χ ( P, A ) /n ) -sample of ( P, A ) .Proof. max R ∈ A (cid:12)(cid:12)(cid:12)(cid:12) | P ∩ R || P | − | P + ∩ R || P + | (cid:12)(cid:12)(cid:12)(cid:12) = max R ∈ A (cid:12)(cid:12)(cid:12)(cid:12) | P ∩ R | − | P + ∩ R | n (cid:12)(cid:12)(cid:12)(cid:12) ≤ disc χ ( P, A ) n . We also note two simple facts [12, 35]:(S1) If Q is an ε -sample of P and Q is an ε -sample of P , then Q ∪ Q is an ε -sample of P ∪ P .(S2) If Q is an ε -sample of P and S is an ε sample of Q , then S is an ( ε + ε ) -sample of P .Note that (S1) (along with Lemma A.1) implies the arbitrarily decomposing P into n/g sets and constructingcolorings of each achieves the same error bound as doing so on just one. And (S2) implies that chainingtogether rounds adds the error in each round. It follows that if we ignore the ( β + 3) rd step in each epoch,16hen there is set remaining after log( n/g ) steps. The error caused by each step is disc ( g, A ) /g so the totalerror is log( n/g )( γ log ω g ) /g = ε . Solving for g yields g = O ( γε log( nεγ ) log ω ( γε )) .Thus to achieve the result stated in the theorem the ( β + 3) rd step skip of a reduce needs to remove the log( nε/γ ) term from the error. This works! After β + 3 steps, the size of each set is g and the discrepancyerror is γ log ω (2 g ) / g . This is just more than half of what it was before, so the total error is now: log( n/g ) β +3 (cid:88) i =0 ( β + 3) γ log ω (2 i g ) / (2 i g ) = Θ( β ( γ log ω g ) /g ) = ε. Solving for g yields g = O ( βγε log ω (1 /ε )) as desired. Stage 2 can be shown not to asymptotically increasethe error.To achieve the runtime we again start with the form of the algorithm without the halve-skip on every ( β + 3) rd step. Then the first step takes O (( n/g ) · g β ) time. And each i th step takes O (( n/ i − ) g β − ) time.Since each subsequent step takes half as much time, the runtime is dominated by the first O ( ng β − ) timestep.For the full algorithm, the first epoch ( β + 3 steps, including a skipped halve) takes O ( ng β − ) time, andthe i th epoch takes O ( n/ ( β +2) i ( g i ) β − ) = O ( ng β − / i ) time. Thus the time is still dominated by thefirst epoch. Again, stage 2 can be shown not to affect this runtime, and the total runtime bound is achievedas desired, and completes the proof.Finally, we state a useful corollary about the expected error being . This holds specifically when wechoose to discard the set P + or P − = { p ∈ P | χ ( p ) = − } at random on each halving. Corollary A.1.
The expected error for any range R ∈ A on the ε -sample T created by Theorem A.1 is E (cid:20) | R ∩ P || P | − | T ∩ R || T | (cid:21) = 0 . Note that there is no absolute value taken inside E [ · ] , so technically this measures the expected undercount. RE-discrepancy.
We are also interested in achieving these same results for RE -discrepancy. To this end,the algorithms are identical. Lemma 3.3 replaces Lemma A.1. (S1) and (S2) still hold. Nothing else aboutthe analysis depends on properties of disc or RE - disc , so Theorem A.1 can be restated for RE -discrepancy. Theorem A.2.
Consider an uncertain point set P of size n and a family of subsets A of P cert . Assume an O ( n β ) time algorithm to construct a coloring χ : P → {− , +1 } so RE - disc χ ( P, A ) = O ( γ log ω n ) where β , γ , and ω are constant algorithm parameters dependent on A , but not P (or n ). There exists an algorithmto construct an ε - RE coreset of ( P, A ) of size g ( ε, A ) = O (( γ/ε ) log ω ( γ/ε )) in time O ( n · g ( ε, A ) β − ) ..