[PDF] Consensus Answers for Queries over Probabilistic Databases

Abstract

We address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, \Topk queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for a general probabilistic database model, called {\em and/xor tree model}, which significantly generalizes previous probabilistic database models like x-tuples and block-independent disjoint models, and is of independent interest.

Full PDF

aa r X i v : . [ c s . D B ] D ec Consensus Answers for Queries over Probabilistic Databases

Jian Li and Amol Deshpande { lijian, amol } @cs.umd.eduUniversity of Maryland at College Park Abstract

We address the problem of ﬁnding a “best” deterministic query answer to a query over a probabilistic database.For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministicworld (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seenas a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) toprobabilistic databases. We consider this problem for various types of queries including SPJ queries,

Top- k queries,group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal orapproximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for ageneral probabilistic database model, called and/xor tree model , which signiﬁcantly generalizes previous probabilisticdatabase models like x-tuples and block-independent disjoint models, and is of independent interest. There is an increasing interest in uncertain and probabilistics databases arising in application domains such as in-formation retrieval [11, 35], recommendation systems [32, 33], mobile object data management [8], information ex-traction [20], data integration [3] and sensor networks [13]. Supporting complex queries and decision-making onprobabilistic databases is signiﬁcantly more difﬁcult than in deterministic databases, and the key challenges includedeﬁning proper and intuitive semantics for queries over them, and developing efﬁcient query processing algorithms.The common semantics in probabilistic databases are the “possible worlds” semantics, where a probabilisticdatabase is considered to correspond to a probability distribution over a set of deterministic databases called “pos-sible worlds”. Therefore, posing queries over such a probabilistic database generates a probability distribution over aset of deterministic results which we call “possible answers”. However, a full list of possible answers together withtheir probabilities is not desirable in most cases since the size of the list could be exponentially large, and the proba-bility associated with each single answer is extremely small. One approach to addressing this issue is to “combine”the possible answers somehow to obtain a more compact representation of the result. For select-project-join queries,for instance, one proposed approach is to union all the possible answers, and compute the probability of each resulttuple by adding the probabilities of all possible answers it belongs to [11]. This approach, however, can not be easilyextended to other types of queries like ranking or aggregate queries.Furthermore, from the user or application perspective, despite the probabilistic nature of the data, a single, deter-ministic query result would be desirable in most cases, on which further analysis or decision-making could be based.For SPJ queries, this is often achieved by “thresholding”, i.e., returning only the result tuples with a sufﬁciently highprobability of being true. For aggregate queries, often expected values are returned instead [24]. For ranking queries,on the other hand, a range of different approaches have been proposed to ﬁnd the true ranking of the tuples. Theseinclude

UTop- k , URank- k [37], probabilistic threshold Top- k function [22], Global Top- k [43], expected rank [9], andso on. Although these deﬁnitions seem to reason about the ranking over probabilistic databases in some “natural”ways, there is a lack of a uniﬁed and systematic analysis framework to justify their semantics and to discriminate theusefulness of one from another.In this paper, we consider the problem of combining the results for all possible worlds in a systematic way byputting it in the context of inconsistent information aggregation which has been studied extensively in numerous1ontexts over the last half century. In our context, the set of different query answers returned from possible worlds canbe thought as inconsistent information which we need to aggregate to obtain a single representative answer. To the bestof our knowledge, this connection between query processing in probabilistic databases and inconsistent informationaggregation, though natural, has never been realized before in any formal and mathematical way. Concretely, wepropose the notion of the consensus answer . Roughly speaking, the consensus answer is a answer that is closest tothe answers of the possible worlds in expectation. To measure the closeness of two answers τ and τ , we have todeﬁne suitable distance function d ( τ , τ ) over the answer space. For example, if an answer is a vector, we can simplyuse the L norm; whereas in other cases, for instance, Top- k queries, the deﬁnition of d is more involved. If the mostconsensus answer can be taken from any point in the answer space, we refer it as the mean answer . A median answer isdeﬁned similarly except that the median answer must be the answer of some possible world with non-zero probability.From a mathematical perspective, if the distance function is properly deﬁned to reﬂect the closeness of the answers,the most consensus answer is perhaps the best deterministic representative of the set of all possible answers since itcan be thought as the centroid of the set of points corresponding to the possible answers. Our key results can besummarized as follows: • (Probabilistic And/Xor Tree) We propose a new model for modeling correlations, called the probabilistic and/xortree model, that can capture two types of correlations, mutual exclusion and coexistence. This model generalizesthe previous models such as x-tuples, and block-independent disjoint tuples model. More important, this modeladmits an elegant generating functions based framework for many types of probability computations. • (Set Distance Metrics) We show that the mean and the median world can be found in polynomial time for the symmetric difference metric for and/xor tree model. For the Jaccard distance metric, we present a polynomial timealgorithm to compute the mean and median world for tuple independent database. • ( Top- k ranking Queries) The problem of aggregating inconsistent rankings has been well-studied under the nameof rank aggregation [14]. We develop polynomial time algorithms for computing mean and median Top- k answersunder the symmetric difference metric, and the mean answers under intersection metric and generalized Spearman’sfootrule distance [16], for the and/xor tree model. • (Groupby Aggregates) For group by count queries, we present a 4-approximation to the problem of ﬁnding a mediananswer (ﬁnding mean answers is trivial). • (Consensus Clustering) We also consider the consensus clustering problem for the and/xor tree model and get aconstant approximation by extending a previous result [2]. Outline:

We begin with a discussion of the related work (Section 2). We then deﬁne the probabilistic and/xor treemodel (Section 3), and present a generating functions-based method to do probability computations on them (Section3.3). The bulk of our key results are presented in Sections 4 and 5 where we address the problem of ﬁnding consensusworlds for different set distance metrics and for

Top- k ranking queries respectively. We then brieﬂy discuss ﬁndingconsensus worlds for group-by count aggregate queries and clustering queries in Section 6. There has been much work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systemsand this area has received renewed attention in the last few years (see e.g. [23, 5, 28, 19, 17, 7, 8, 11, 40, 18]). Thiswork has spanned a range of issues from theoretical development of data models and data languages, to practicalimplementation issues such as indexing techniques. In terms of representation power, most of this work has eitherassumed independence between the tuples [17, 11], or has restricted the correlations that can be modeled [5, 28, 3, 34].Several approaches for modeling complex correlations in probabilistic databases have also been proposed [35, 4, 36,39].For efﬁcient query evaluation over probabilistic databases, one of the key results is the dichotomy of conjunctivequery evaluation on tuple-independent probabilistic databases by Dalvi and Suciu [11, 12]. Brieﬂy the result statesthat the complexity of evaluating a conjunctive query over tuple-independent probabilistic databases is either PTIME2r safe queryplans , that permit correct extensional evaluation of the query. Unfortunately the problem of ﬁnding consensus answersappears to be much harder; this is because even if a query has a safe plan, the result tuples may still be arbitrarilycorrelated.In recent years, there has also been much work on efﬁciently answering different types of queries over probabilisticdatabases. Soliman et al. [37] ﬁrst considered the problem of ranking over probabilistic databases, and proposed tworanking functions to combine the tuple scores and probabilities. Yi et al. [41] presented improved algorithms forthe same ranking functions. Zhang and Chomicki [43] presented a desiderata for ranking functions and proposeGlobal

Top- k queries. Ming Hua et al. [21, 22] recently presented a different ranking function called Probabilisticthreshold

Top- k queries . Finally, Cormode et al. [9] also present a semantics of ranking functions and a new rankingfunction called expected rank . In a recent work, we proposed a parameterized ranking function, and presented generalalgorithms for evaluating them [29] Other types of queries have also been recently considered over probabilisticdatabases (e.g. clustering [10], nearest neighbors [6] etc.).The problem of aggregating inconsistent information from different sources arises in numerous disciplines and hasbeen studied in different contexts over decades. Speciﬁcally, the R ANK -A GGREGATION problem aims at combining k different complete ranked lists τ , . . . , τ k on the same set of objects into a single ranking, which is the best descriptionof the combined preferences in the given lists. This problem was considered as early as 18th century when Condorcetand Borda proposed a voting system for elections [31, 25]. In the late 50’s, Kemeny proposed the ﬁrst mathematicalcriterion for choosing the best ranking [26]. Namely, the Kemeny optimal aggregation τ is the ranking that minimizes P ki =1 d ( τ, τ i ) , where d ( τ i , τ j ) is the number of pairs of elements that are ranked in different order in τ i and τ j (alsocalled Kendall’s tau distance). While computing the Kemeny optimal is shown to be NP-hard [15], 2-approximationcan be easily achieved by picking the best one from k given ranking lists. The other well-known 2-approximationis from the fact the Spearman footrule distance, deﬁned to be d F ( τ i , τ j ) = P t | τ i ( t ) − τ j ( t ) | , is within twice theKendall’s tau distance and the footrule aggregation can be done optimally in polynomial time [14]. Ailon et al. [2]improve the approximation ratio to / . We refer the readers to [27] for a survey on this problem. For aggregating Top- k answers, Ailon [1] recently obtained an / -approximation based on rounding an LP solution.The C ONSENSUS -C LUSTERING problem asks for the best clustering of a set of elements which minimizes the num-ber of pairwise disagreements with the given k clusterings. It is known to be NP-hard [42] and a 2-approximationcan also be obtained by picking the best one from the given k clusterings. The best known approximation ratio is / due to Ailon et al. [2]. Recently Cormode et al. [10] proposed approximation algorithms for k -center and k -medianclustering problems under attribute-level uncertainty in probabilistic databases. We begin with reviewing the possible worlds semantics, and introduce the probabilistic and/xor tree model.

We consider probabilistic databases with both tuple-level uncertainty (the existence of a tuple is uncertain) andattribute-level uncertainty (a tuple attribute value is uncertain). Speciﬁcally, we denote a probabilistic relation by R P ( K ; A ) , where K is the key attribute, and A is the value attribute . For a particular tuple in R P , its key at-tribute is certain and is sometimes called the possible worlds key. R P is assumed to correspond to a probabil-ity space ( P W, Pr ) where the set of outcomes is a set of deterministic relations, which we call possible worlds , P W = { pw , pw , ...., pw N } . Note that two tuples can not have the same value for the key attribute in a single possi-ble world. Because of the typically exponential size of P W , an explicit possible worlds representation is not feasible,and hence the semantics are usually captured implicitly by probabilistic models with polynomial size speciﬁcation.Let T denote the set of tuples in all possible worlds. For ease of notation, we will use t ∈ pw in place of “ t appearsin the possible world pw ”, Pr ( t ) to denote Pr ( t is present ) and Pr ( ¬ t ) to denote Pr ( t is not present ) . For clarity, we will assume singleton key and value attributes. .4 pw3 = {(t2, 8), (t4, 4), (t5, 3)} pw2 = {(t3, 9), (t1, 7), (t4, 0)} pw1 = {(t3, 6), (t2, 5), (t1, 1)} ProbPossible Worlds ∧ ∨∧ ∧ y 1 1 x x 1 x 1 1 y x2 x ∨ ∧ x x0.08 x2 + 0.44 x3 + 0.48 x4 ∨ x x ∨ x x ∨ x x (i) (ii) (iii) (t1, 8) (t1, 2) (t2, 3) (t2, 4) (t3, 1) (t3, 9) (t4, 6) (t4, 5) (t3, 6) (t2, 5) (t1, 1) (t3, 9) (t1, 7) (t4, 0) (t2, 8) (t4, 4) (t5, 3) Figure 1: (i) The and/xor tree representation of a set of block-independent disjoint tuples; the generating functionobtained by assigning the same variable x to all leaves gives us the distribution over the sizes of the possible worlds.(ii) Example of a highly correlated probabilistic database with possible worlds and (iii) the and/xor tree that capturesthe correlation; the coefﬁcient of y (0.3) is Pr ( r ( t ,

6) = 1) (i.e., prob. that that alternative of t is ranked at position ). Further, for a tuple t P ∈ R P , we call the certain tuples corresponding to it (with the same key value) in the unionof the possible worlds, its alternatives . Block-Independent Disjoint (BID) Scheme:

BID is one of the more popular models for probabilistic databases, andassumes that different probabilistic tuples (with different key values) are independent of each other [11, 40, 12, 38].Formally, a BID scheme has the relational schema of the from R ( K ; A ; Pr ) where K is the possible worlds key, A isthe value attribute, and Pr captures the probability of the corresponding tuple alternative. We generalize the block-independent disjoint tuples model, which can capture mutual exclusion between tuples, byadding support for mutual co-existence , and allowing these to be speciﬁed in a hierarchical manner. Two eventssatisfy the mutual co-existence correlation if in any possible world, either both happen or neither occurs. We modelsuch correlations using a probabilistic and/xor tree (or and/xor tree for short), which also generalizes the notionsof x-tuples [34, 41], p -or-sets [12] and tuple independent databases. We ﬁrst considered this model for tuple-leveluncertainty in an earlier paper [29], and generalize it here to handle attribute-level uncertainty.We use ∨ (cid:13) (or) to denote mutual exclusion and ∧ (cid:13) (and) for coexistence. Figure 1 shows two examples of proba-bilistic and/xor trees. Brieﬂy, the leaves of the tree correspond to the tuple alternatives (we abuse the notation somewhatand use t i to denote both the tuple, and its key value). The ﬁrst tree captures a relation with four independent tuples, t , t , t , t , each with two alternatives, whereas the second tree shows how we can capture arbitrary possible worldsusing an and/xor tree (Figure 1(ii) shows the possible worlds corresponding to that tree).Now, let us formally deﬁne a probabilistic and/xor tree. In tree T , we denote the set of children of node v by Ch T ( v ) and the least common ancestor of two leaves l and l by LCA T ( l , l ) . We omit the subscript if the contextis clear. Deﬁnition 1

A probabilistic and/xor tree T represents the mutual exclusion and co-existence correlations in a prob-abilistic relation R P ( K ; A ) , where K is the possible worlds key, and A is the value attribute. In T , each leaf is akey-attribute pair (a tuple alternative), and each inner node has a mark, ∨ (cid:13) or ∧ (cid:13) . For each ∨ (cid:13) node u and each of itschildren v ∈ Ch ( u ) , there is a nonnegative value Pr ( u, v ) associated with the edge ( u, v ) . Moreover, we require • (Probability Constraint) P v : v ∈ Ch ( u ) Pr ( u, v ) ≤ . • (Key Constraint) For any two different leaves l , l holding the same key, LCA ( l , l ) is a ∨ (cid:13) node .Let T v be the subtree rooted at v and Ch ( v ) = { v , . . . , v ℓ } . The subtree T v inductively deﬁnes a random subset S v of its leaves by the following independent process: The key constraint is imposed to avoid two leaves with the same key coexisting in a possible world. If v is a leaf, S v = { v } . • If T v roots at a ∨ (cid:13) node, then S v = (cid:26) S v i with prob Pr ( v, v i ) ∅ with prob − P ℓi =1 Pr ( v, v i ) • If T v roots at a ∧ (cid:13) node, then S v = ∪ ℓi =1 S v i Probabilistic and/xor trees can capture more complicated correlations than the prior models such as the BID modelor x-tuples. We remark that Markov or Bayesian network models are able to capture more general correlations [35],however, the structure of the model is more complex and probability computations on them (inference) is typicallyexponential in the treewidth of the model. The treewidth of an and/xor tree (viewing it as a Markov network) is notbounded, and hence the techniques developed for those models can not be used to obtain a polynomial time algorithmsfor and/xor trees.

Aside from the representational power of the and/xor tree model, perhaps its best feature is that many types of proba-bility computations can be done efﬁciently and elegantly on them using generating functions . In our prior work [29],we used a similar technique for computing ranking functions for tuple-level uncertainty model. Here we generalizethe idea to a broader range of probability computations.We denote the and/xor tree by T . Suppose X = { x , x , . . . } is a set of variables. Deﬁne a mapping s whichassociates each leaf l ∈ T with a variable s ( l ) ∈ X . Let T v denote the subtree rooted at v and let v , . . . , v l be v ’schildren. For each node v ∈ T , we deﬁne a generating function F v recursively: • If v is a leaf, F iv ( X ) = s ( v ) . • If v is a ∨ (cid:13) node, F v ( X ) = (1 − P lh =1 p ( v, v h )) + P lh =1 F v h ( X ) · p ( v, v h ) • If v is a ∧ (cid:13) node, F iv ( X ) = Q lh =1 F v h ( X ) .The generating function F ( X ) for tree T is the one deﬁned above for the root. It is easy to see, if we have aconstant number of variables, the polynomial can be expanded in the form of P i ,i ,... c i ,i ... x i x i . . . in polynomialtime.Now recall that each possible world pw contains a subset of the leaves of T (as dictated by the ∨ (cid:13) and ∧ (cid:13) nodes).The following theorem characterizes the relationship between the coefﬁcients of F and the probabilities we are inter-ested in. Theorem 1

The coefﬁcient of the term Q j x i j j in F ( X ) is the total probability of the possible worlds for which, forall j , there are exactly i j leaves associated with variable x j . The proof is by induction on the tree structure and is omitted.

Example 1

If we associate all leaves with the same variable x , the coefﬁcient of x i is equal to Pr ( | pw | = i ) . The above can be used to obtain a distribution on the possible world sizes (Figure 1(i)).

Example 2

If we associate a subset S of the leaves with variable x , and other leaves with constant , the coefﬁcientof x i is equal to Pr ( | pw ∩ S | = i ) . Example 3

Next we show how to compute

P r ( r ( t ) = i ) (i.e., the probability t is ranked at position i ), where r ( t ) denote the rank of the tuple in a possible world by some score metric. Assume t only has one alternative, ( t, a ) , andhence only one possible value of score, s . Then, in the and/xor tree T , we associate all leaves with key other than t and score value larger than s with variable x , and the leaf ( t, a ) with variable y , and the rest of leaves with constant .Then, the coefﬁcient of x j − y in the generating function is exactly P r ( r ( t ) = i ) . If the tuple has multiple alternatives,we can compute P r ( r ( t ) = i ) for it by summing up the probabilities for the alternatives. We denote the domain of answers for a query by Ω and the distance function between two answers by d () . Formally,we deﬁne the most consensus answer τ to be a feasible answer to the query such that the expected distance between τ and the answer τ pw of the (random) world pw is minimized, i.e, τ = arg min τ ′ ∈ Ω { E [ d ( τ ′ , τ pw )] } .We call the most consensus answer in Ω the mean answer when Ω is the set of all feasible answers. If Ω isrestricted to be the set of possible answers (answers of some possible worlds with non-zero probability), we call themost consensus answer in Ω the median answer . Taking the example of the Top- k queries, the median answer must bethe Top- k answer of some possible world while the mean answer can be any sorted list of size k . We ﬁrst consider the problem of ﬁnding the consensus world for a given probabilistic database, under two set distancemeasures: symmetric difference, and Jaccard distance.

The symmetric difference distance between two sets S , S is deﬁned to be d ∆ ( S , S ) = | S ∆ S | = | ( S \ S ) ∪ ( S \ S ) | . Note that two different alternatives of a tuple are treated as different tuples here. Theorem 2

The mean world under the symmetric difference distance is the set of all tuples with probability > . . Proof:

Suppose S is a ﬁxed set of tuples and ¯ S = T − S . Let δ ( p ) = (cid:26) , if p = true , if p = f alse be the indicator function.We write E pw ∈ P W [ d ∆ ( S, pw )] as follows: E [ d ∆ ( S, pw )] = E [ X t ∈ S δ ( t / ∈ pw ) + X t ∈ ¯ S δ ( t ∈ pw )]= X t ∈ S E [ δ ( t / ∈ pw )] + X t ∈ ¯ S E [ δ ( t ∈ pw )] = X t ∈ S Pr ( ¬ t ) + X t ∈ ¯ S Pr ( t ) Thus, each tuple t contributes Pr ( ¬ t ) to the expected distance if t ∈ S and Pr ( t ) otherwise, and hence the minimumis achieved by the set of tuples with probability . . (cid:3) Finding the consensus median world is somewhat trickier, with the main concern being that the world that containsall tuples with probability > . may not be a possible world. Corollary 1

If the correlation can be modeled by a probabilistic and/xor tree, the median world is the set contains alltuples with probability greater than . . The proof is by induction on the height of the tree, and is omitted for space constraints. This however does nothold for arbitrary correlations, and it is easy to see that ﬁnding a median world is NP-Hard even if result tuple prob-ability computation is easy. We show a reduction to MAX-2-SAT for a simple 2-relation query. Let the MAX-2-SAT instance consists of n literals, x , . . . , x n , and k clauses. Consider a query R ⋊⋉ S , where S ( x, b ) = { ( x , , ( x , , ( x , , ( x , , . . . } contains two mutually exlusive tuples each for n literals; all tuples are equi-probable with probability 0.5. R ( C, x, b ) is a certain table, and contains two tuples for each clause: for the clause c = x ∨ ¯ x , it contains tuples ( c , x , and ( c , x , . The result of π C ( R ⋊⋉ S ) contains one tuple for eachclause, associated with a probability of 0.75. So the median answer is the possible answer containing maximumnumber of tuples, which corresponds to ﬁnding the assignment to x i ’s that maximizes the number of satisﬁed clauses.6 .2 Jaccard Distance The Jaccard distance between two sets S , S is deﬁned to be d J ( S , S ) = | S ∆ S || S ∪ S | . Jaccard distance always lies in [0 , and is a real metric, i.e, satisﬁes triangle inequality. Next we present polynomial time algorithms for ﬁnding themean and median worlds for tuple independent databases, and median world for the BID model. Lemma 1

Given an and/xor tree, T and a possible world for it, W (corresponding to a set of leaves of T ), we cancompute E [ d ( W, pw )] in polynomial time. Proof:

A generating function F T is constructed with the variables associated with leaves as follows: for t ∈ W ( t / ∈ W ), the associated variable is x ( y ). For example, in a tuple independent database, the generating function is: F ( x, y ) = Y t ∈ W ( Pr ( ¬ t ) + Pr ( t ) x ) Y t/ ∈ W ( Pr ( ¬ t ) + Pr ( t ) y ) From Theorem 1, the coefﬁcient c i,j of term x i y j in generating function F is equal to the total probability ofthe worlds such that the Jaccard distance between those worlds and W is exactly | W |− i + j | W | + j . Thus, the distance is P i,j c i,j | W |− i + j | W | + j . Lemma 2

For tuple independent databases, if the mean world contains tuple t but not tuple t , then Pr ( t ) ≥ Pr ( t ) . Proof:

Say W is the mean world and the lemma is not true, i.e, ∃ t ∈ W , t / ∈ W s.t. Pr ( t ) < Pr ( t ) . Let W = W − { t } , W = W + { t } and W ′ = T − W − { t } − { t } . We will prove W has a smaller expected Jaccarddistance, thus rendering contradiction. Suppose | W | = | W | = k . We let matrix M = [ m i,j ] i,j where m i,j = k − i + jk + j .We construct generating functions as we did in Lemma 1. Suppose F and F are the generating functions for W and W , respectively. We write || A || = P i,j a i,j for any matrix A and let A ⊗ B the Hadamard product of A and B (take product entrywise). We denote: F ′ ( x, y ) = Q t ∈ W ( Pr ( ¬ t ) + Pr ( t ) x ) Q t ∈ W ′ ( Pr ( ¬ t ) + Pr ( t ) y ) We can easily see: F ( x, y ) = F ′ ( x, y ) ( Pr ( ¬ t ) + Pr ( t ) x ) ( Pr ( ¬ t ) + Pr ( t ) y ) F ( x, y ) = F ′ ( x, y ) ( Pr ( ¬ t ) + Pr ( t ) y ) ( Pr ( ¬ t ) + Pr ( t ) x ) Then, taking the difference, we get ¯ F = F ( x, y ) − F ( x, y ) is equal to: F ′ ( x, y ) ( Pr ( ¬ t ) Pr ( t ) − Pr ( t ) Pr ( ¬ t )) ( y − x ) (1)Let C F = [ c i,j ] be the coefﬁcient matrix of F where c i,j is the coefﬁcient of term x i y j . Using the proof of Lemma 1: E [ d ( W , pw )] − E [ d ( W , pw )] = || C F ⊗ M || − || C F ⊗ M || = || C ¯ F ⊗ M || Let c ′ i,j and ¯ c i,j be the coefﬁcient of x i y j in F ′ and ¯ F , respectively. It is not hard to see ¯ c i,j = ( c ′ i,j − − c ′ i − ,j ) p from(1) where p = ( Pr ( ¬ t ) Pr ( t ) − Pr ( t ) Pr ( ¬ t )) > . Then we have || C ¯ F ⊗ M || = p X i,j (cid:0) ( c ′ i,j − − c ′ i − ,j ) m i,j (cid:1) = p X i,j c ′ i,j ( m i,j +1 − m i +1 ,j )= p X i,j c ′ i,j (cid:18) k − i + j + 1 k + j + 1 − k − i − jk + j (cid:19) Due to the fact that k − i + j +1 k + j +1 − k − i − jk + j > for any i, j ≥ , the proof is complete. (cid:3) In this section, we consider

Top- k queries in probabilistic databases. Each tuple t i has a score s ( t i ) . In the tuple-leveluncertainty model, s ( t i ) is ﬁxed for each t i , while in the attribute-level uncertainty model, it is an random variable.In the and/xor tree model, we assume that the attribute ﬁeld is the score (uncertain attributes that don’t contribute tothe score can be ignored). We further assume no two tuples can take the same score for avoiding ties. We use r ( t ) todenote the random variable indicating the rank of t and r pw ( t ) to denote the rank of t in possible world pw . If t doesnot appear in the possible world pw , then r pw ( t ) = ∞ . So, Pr ( r ( t ) > i ) includes the probability that t ’s rank is largerthan i and that t doesn’t exist. We say t ranks higher than t in possible world pw if r pw ( t ) < r pw ( t ) .Finally, we use the symbol τ to denote rankings, and τ i to denote the restriction of the Top- k list τ to the ﬁrst i items. We use τ ( i ) to denote the i th item in the list τ for positive integer i , and τ ( t ) to denote the position of t ∈ T in τ . Top- k Answers

Fagin et al. [16] provide a comprehensive analysis of the problem of comparing two

Top- k lists. They present exten-sions of the Kendall’s tau and Spearman footrule metrics (deﬁned on full rankings) to Top- k lists and propose severalother natural metrics, such as the intersection metric and Goodman and Kruskal’s gamma function. In our paper, weconsider three of the metrics discussed in that paper: the symmetric difference metric, the intersection metric and oneparticular extension to Spearman’s footrule distance. We brieﬂy recall some deﬁnitions here. For more details and therelation between different deﬁnitions, please refer to [16].Given two Top- k lists, τ and τ , the normalized symmetric difference metric is deﬁned as: d ∆ ( τ , τ ) = | τ ∆ τ | = | ( τ \ τ ) ∪ ( τ \ τ ) | . While d ∆ focuses only on the membership, the intersection metric d I also takes the order of tuples into consider-ation. It is deﬁned to be: d I ( τ , τ ) = P k i =1 d ∆ ( τ i , τ i ) Both d ∆ and d I () values are always between and .The original Spearman’s Footrule metric is deﬁned as the L distance between two permutations σ and σ .Formally, F ( σ , σ ) = P t ∈ T | σ ( t ) − σ ( t ) | . Let ℓ be a integer greater than k . The footrule distance with locationparameter ℓ , denoted F ( ℓ ) generalizes the original footrule metric. It is obtained by placing all missing elements ineach list at position ℓ and then computing the usual footrule distance between them. A natural choice of ℓ is k + 1 andwe denote F ( ℓ +1) by d F . It is also proven that d F is a real metric and a member of a big and important equivalenceclass [16].It is shown in [16] that: d F ( τ , τ ) = (k + 1) | τ ∆ τ | + X t ∈ τ ∩ τ | τ ( t ) − τ ( t ) | − X t ∈ τ \ τ τ ( t ) − X t ∈ τ \ τ τ ( t ) . Next we consider the problem of evaluating consensus answers for these distance metrics. All distance functions in one equivalence class are bounded by each other within a constant factor. This class includes several extensions ofSpearman’s footrule and Kendall’s tau metrics. .2 Symmetric Difference and P T − k function In this section, we show how to ﬁnd mean and median

Top- k answers under symmetric difference metric in the and/xortree model. The probabilistic threshold Top- k ( P T − k ) query [22] has been proposed for evaluating ranking queriesover probabilistic databases, and essentially returns all tuples t for which Pr ( r ( t ) ≤ k) is greater than a given threshold.If we set the threshold carefully so that the P T − k query returns k tuples, we can show that the answer returned is themean answer under symmetric difference metric. Theorem 3 If τ = { τ (1) , τ (2) , . . . , τ (k) } is the set of k tuples with the largest Pr ( r ( t ) ≤ k) , then τ is the mean Top- k answer under metric d ∆ , i.e., the answer minimizes E [ d ∆ ( τ, τ pw )] . Proof:

Suppose τ is ﬁxed. We write E [ d ∆ ( τ, τ pw )] as follows: E [ d ∆ ( τ, τ pw )] = E [ X t ∈ T δ ( t ∈ τ ∧ t / ∈ τ pw ) + δ ( t ∈ τ pw ∧ t / ∈ τ )]= X t ∈ T \ τ E [ δ ( t ∈ τ pw )] + X t ∈ τ E [ δ ( t / ∈ τ pw )]= X t ∈ T \ τ Pr ( r ( t ) ≤ k) + X t ∈ τ Pr ( r ( t ) > k)= k + X t ∈ T Pr ( r ( t ) ≤ k) − X t ∈ τ Pr ( r ( t ) ≤ k) The ﬁrst two terms are invariant with respect to τ . Therefore, it is clear that the set of k tuples with the largest Pr ( r ( t ) ≤ k) minimizes the expectation. (cid:3) To ﬁnd a median answer, we essentially need to ﬁnd the

Top- k answer τ of some possible world such that P t ∈ τ Pr ( r ( t ) ≤ k) is maximum. Next we show how to do this given an and/xor tree in polynomial time.We write P ( t ) = Pr ( r ( t ) ≤ k) for ease of notation. We use dynamic programming over the tree structure. Foreach possible attribute value a ∈ A , let T a be the tree which contains all leaves with attribute value at least a . Werecursively compute the set of tuples pw av,i , which maximizes the value P t ∈ pw av,i P ( t ) among all possible worldsgenerated by the subtree T av rooted at v and of size i , for each node v in T a and ≤ i ≤ k . We compute this for alldifferent a values, and the optimal solution can be chosen to be min a ( pw ar, k ) .Suppose v , v , . . . , v l are v ’s children. The recursion formula is: • If v is a ∨ (cid:13) node, pw av,i = arg max pw ∈ P W ( T avi ) P t ∈ pw P ( t ) . • If v is a ∧ (cid:13) node, pw av,i = ∪ j pw j such that P j | pw j | = i, pw j ∈ P W ( T av j ) and P t ∈∪ j pw j P ( t ) is maximized.In the latter case, the maximum value can be computed by dynamic programming again as follows. Let pw a [ v ,...v h ] ,i = ∪ hj =1 pw j such that P hj =1 | pw j | = i, pw j ∈ P W ( T av j ) and P t ∈∪ hj =1 pw j P ( t ) is maximized. It can be computed recur-sive by seeing pw a [ v ,...v h ] ,i = pw a [ v ,...v h − ] ,p ∪ pw av h ,q for p, q such that p + q = i and P t ∈ pw a [ v ,...vh − ,p ∪ pw avh,q P ( t ) is maximized. Then, it is easy to see pw a ( v, i ) is simply pw a ([ v , . . . , v l ] , i ) . Theorem 4

The median

Top- k answer under symmetric difference metric can be found in polynomial time for aprobabilistic and/xor tree. Note that the intersection metric d I is a linear combination of the normalized asymmetric difference metric d ∆ . Usinga similar approach used in the proof of Theorem 3, we can show that: E [ d I ( τ, τ pw )] = 1k k X i =1 E [ d ∆ ( τ i , τ ipw )]= 1k k X i =1 i k + X t ∈ T Pr ( r ( t ) ≤ k) − X t ∈ τ i Pr ( r ( t ) ≤ i ) ! τ which maximizes the last term, A ( τ ) = P k i =1 (cid:0) i P t ∈ τ i Pr ( r ( t ) ≤ i ) (cid:1) . We ﬁrst rewritethe objective as follows, using the indicator ( δ ) function: A ( τ ) = k X i =1 i X t ∈ T Pr ( r ( t ) ≤ i )) δ ( t ∈ τ i ) ! = X t ∈ T  k X i =1 i Pr ( r ( t ) ≤ i ) i X j =1 δ ( t = τ ( j ))  = X t ∈ T k X j =1  δ ( t = τ ( j )) k X i = j i Pr ( r ( t ) ≤ i )  The last equality holds since P k i =1 P ij =1 a ij = P k j =1 P ki = j a ij .The optimization task can thus be written as an assignment problem , with each tuple t acting as an agent and eachof the Top- k positions j as a task. Assigning task j to agent t gains a proﬁt of P k i = j i Pr ( r ( t ) ≤ i ) and the goal is toﬁnd an assignment such that each task is assigned to at most one agent, and the proﬁt is maximized. The best knownalgorithm for computing the optimal assignment runs in O ( n k √ n ) time, via computing a maximum weight matchingon bipartite graph [30]. Approximating the Intersection Metric:

We deﬁne the following ranking function, where H k denotes the k th Har-monic number: Υ H ( t ) = k X i =1 ( H k − H i − ) Pr ( r ( t ) = i ) = k X i =1 Pr ( r ( t ) ≤ i ) i . This is a special case of the parameterized ranking function proposed in [29] and can be computed in O ( n k log n ) time for all tuples in the and/xor tree. We claim that the Top- k answer τ H returned by Υ H function, i.e., the k tupleswith the highest Υ H values, is a good approximation of the mean answer with respect to the intersection metric byarguing that τ H = { t , t , . . . , t k } is actually an approximated maximizer of A ( τ ) . Indeed, we prove the fact that A ( τ H ) ≥ H k A ( τ ∗ ) where τ ∗ is the optimal mean Top- k answer.Let B ( τ ) = P t ∈ τ Υ H ( t ) for any Top- k answer τ . It is easy to see A ( τ ∗ ) ≤ B ( τ ∗ ) ≤ B ( τ H ) since τ H maximizesthe B () function. Then, we can get: A ( τ H ) = k X j =1 k X i = j i Pr ( r ( t j ) ≤ i ) ≥ k X j =1 ( H k − H j − H k ) k X i =1 i Pr ( r ( t j ) ≤ i )= k X j =1 ( H k − H j − H k )Υ H ( t j ) ≥ k X i =1 ( H k − H i − H k ) k X i =1 Υ H ( t i )= 1 H k B ( τ H ) ≥ H k A ( τ ∗ ) . The second inequality holds because for non-decreasing sequences a i (1 ≤ i ≤ n ) and c i (1 ≤ i ≤ n ) , P ni =1 a i c i ≥ n ( P ni =1 a i )( P ni =1 c i ) [ F ∗ ( τ, τ pw )] = E (k + 1) | τ ∆ τ pw | + X t ∈ τ ∩ τ pw | τ ( t ) − τ pw ( t ) | − X t ∈ τ \ τ pw τ ( t ) − X t ∈ τ pw \ τ τ pw ( t ) = (k + 1) E [ | τ ∆ τ pw | ] + X t ∈ T E [ δ ( t ∈ τ ∩ τ pw ) | τ ( t ) − τ pw ( t ) | ] − X t ∈ T E [ δ ( t ∈ τ \ τ pw ) τ ( t )] − E

24 X t ∈ τ pw \ τ τ pw ( t ) = (k + 1) E [ | τ ∆ τ pw | ] + X t ∈ T k X i =1 k X j =1 E [ δ ( t ∈ τ ∩ τ pw ) δ ( t = τ pw ( i )) δ ( t = τ ( j )) | i − j | ] − X t ∈ T k X i =1 E [ δ ( t ∈ τ \ τ pw ) δ ( t = τ ( i )) i ] − X t ∈ T \ τ Υ ( t )= (k + 1) E [ | τ ∆ τ pw | ] + X t ∈ T k X i =1 δ ( t = τ ( i )) k X j =1 Pr ( r ( t ) = j ) | i − j | − X t ∈ T k X i =1 ( δ ( t = τ ( i )) i Pr ( r ( t ) > k)) − X t ∈ T \ τ Υ ( t )= (k + 1)(k + X t ∈ T Υ ( t ) − X t ∈ τ Υ ( t )) + X t ∈ T k X i =1 δ ( t = τ ( i ))Υ ( t, i ) − X t ∈ T \ τ Υ ( t )= (k + 1)k + X t ∈ T ((k + 1)Υ ( t ) − Υ ( t )) + X t ∈ T k X i =1 δ ( t = τ ( i ))(Υ ( t, i ) + Υ ( t ) − ( t )) Figure 2: Derivation for Spearman’s Footrule Distance

For a

Top- k answer τ = { τ (1) , τ (2) , . . . , τ (k) } , we deﬁne: • Υ ( t ) = P k i =1 Pr ( r ( t = i )) • Υ ( t ) = P k i =1 Pr ( r ( t = i )) · i • Υ ( t, i ) = P k j =1 Pr ( r ( t ) = j )) | i − j | + i Pr ( r ( t ) > k) .It is easy to see Υ ( t ) , Υ ( t ) , Υ ( t ) can be computed in polynomial time for a probabilistic and/xor tree using ourgenerating functions method.A careful and non-trivial rewriting of E pw ∈ P W [ F ∗ ( τ, τ pw )] shows that it also has the form (Figure 2): E pw ∈ P W [ F ∗ ( τ, τ pw )] = C + X t ∈ T k X i =1 δ ( t = τ ( i )) f ( t, i ) where C is a constant independent of τ , and f ( t, i ) is a function of t and i , which is polynomially computable. Figure2 shows the exact derivation.Thus, we only need to minimize the second term, which can be modeled as the assignment problem and can besolved in polynomial time. Then

Kendall’s tau distance (also called Kemeny distance) d K between two Top- k lists τ and τ is deﬁned to be thenumber of unordered pairs ( t i , t j ) such that that the order of i and j disagree in any full rankings extended from τ and τ , respectively. It is shown that d F and d K and a few other generalizations of Spearman’s footrule and Kendall’s taumetrics form a big equivalence class, i.e., they are within a constant factor of each other [16]. Therefore, the optimalsolution for d F implies constant approximations for all metrics in this class (the constant for d K is ).However, we can also easily obtain a / -approximation for d K by extending the / -approximation for partialrank aggregation problem due to Ailon [1]. The only information used in their algorithm is the proportion of lists11here t i is ranked higher than t j for all i, j . In our case, this corresponds to Pr ( r ( t i ) < r ( t j )) . This can be easilycomputed in polynomial time using the generating functions method.We also note that the problem of optimally computing the mean answer is NP-hard for probabilistic and/xor trees.This follows from the fact that probabilistic and/xor trees can simulate arbitrary possible worlds, and previous workhas shown that aggregating even 4 rankings under this distance metric is NP-Hard [14]. We brieﬂy extend the notion of consensus answers to two other types of queries and present some initial results.

Consider a query of the type: select groupname, count(*) from R group by groupname

Suppose there are m potential groups (indexed by groupname) and n independent tuples with attribute uncertainty.The probabilistic database can be speciﬁed by the matrix P = [ p i,j ] n × m where p i,j is the probability that tuple i takesgroupname j and P mj =1 p i,j = 1 for any ≤ i ≤ n . A query result (on a deterministic relation) is a m -dimensionalvector r where the i th entry is the number of tuples having groupname i . The natural distance metric to use is thesquared vector distance.Computing the mean answer is easy in this case, because of linearity of expectation: we simply take the mean foreach aggregate separately, i.e., ¯ r = P where = (1 , , . . . , . We note the mean answer minimizes the expectedsquared vector distance to any possible answer.The median world requires that the returned answer be a possible answer. It is not clear how to solve this problemoptimally in polynomial time. To enumerate all worlds is obviously not computationally feasible. Rounding entries of ¯ r to the nearest integers may not result in a possible answer.Next we present a polynomial time algorithm to ﬁnd a closest possible answer to the mean world ¯ r . This yields a -approximation for ﬁnding the median answer. We can model the problem as follows: Consider the bipartite graph B ( U, V, E ) where each node in U is a tuple, each node in V is a groupname, and an edge ( u, v ) , u ∈ U, v ∈ V indicates that tuple u takes groupname v with non-zero probability. We call a subgraph G ′ such that deg G ′ ( u ) = 1 for all u ∈ U and deg G ′ ( v ) = r [ v ] , an r -matching of B for some m -dimensional integral vector r . Given this, ourobjective is to ﬁnd an r -matching of B such that || r − ¯ r || is minimized. Before presenting the main algorithm, weneed the following lemma. Lemma 3

The possible world r ∗ that is closest to ¯ r is of the following form: r ∗ [ i ] is either ⌊ ¯ r [ i ] ⌋ or ⌈ ¯ r [ i ] ⌉ for each ≤ i ≤ m . Proof:

Let M ∗ be the corresponding r ∗ -matching. Suppose the lemma is not true, and there exists i such that | r ∗ [ i ] − ¯ r [ i ] | > . W.l.o.g, we assume r ∗ [ i ] > ¯ r [ i ] . The other case can be proved the same way. Consider theconnected component K = { U ′ , V ′ , E ( U ′ , V ′ ) } containing i . We claim that there exists j ∈ V ′ such that r ∗ [ j ] < ¯ r [ j ] and there is an alternating path P with respect to M ∗ connecting i and j . Therefore, M ′ = M ∗ ⊕ P is also a validmatching. Suppose M ′ is a r ′ -matching. But: || r ′ − ¯ r || = m X v =1 ( r ′ [ v ] − ¯ r [ v ]) = m X v =1 ( r ∗ [ v ] − ¯ r [ v ]) − ( r ∗ [ i ] − ¯ r [ i ]) − ( r ∗ [ j ] − ¯ r [ j ]) + ( r ′ [ i ] − ¯ r [ i ]) + ( r ′ [ j ] − ¯ r [ j ]) = || r ∗ − ¯ r || − ( r ∗ [ i ] − ¯ r [ i ]) − ( r ∗ [ j ] − ¯ r [ j ]) +( r ∗ [ i ] − − ¯ r [ i ]) + ( r ∗ [ j ] + 1 − ¯ r [ j ]) = || r ∗ − ¯ r || + 2 − r ∗ [ i ] + 2¯ r [ i ] + 2 r ∗ [ j ] − r [ j ] < || r ∗ − ¯ r || . r ∗ is the vector closest to ¯ r .Now, we prove the claim. We grow a alternating path (w.r.t. M ∗ ) tree rooted at i in a BFS manner: at odd depth,we extend all edges in M ∗ and at even depth, we extend all edge not in M ∗ . Let O ⊆ V be the set of nodes at odddepth ( i is at depth ) and E ⊆ U the set of nodes at even depth. It is easy to see N B ( E ) = O , E ⊆ N B ( O ) and P v ∈ O r ∗ [ v ] = | E | . Suppose r ∗ [ v ] ≥ ¯ r [ v ] for all v and r ∗ [ i ] ≥ ¯ r [ i ] . However, the contradiction follows since: | E | = X v ∈ O r ∗ [ v ] > X v ∈ O ¯ r [ v ] = X v ∈ O X u ∈ N B ( O ) P[ u, v ]= X v ∈ O X u ∈ E P[ u, v ] = | E | . (cid:3) With Lemma 3 at hand, we can construct the following min-cost network ﬂow instance to compute the vector r ∗ closest to ¯ r . Add to B a source s and a sink t . Add edges ( s, u ) with capacity upper bound for all u ∈ U . For each v ∈ V and ¯ r [ v ] is not integer, add two edges e ( v, t ) and e ( v, t ) . e ( v, t ) has both lower and upper bound of capacity ⌊ ¯ r [ v ] ⌋ and e ( v, t ) has capacity upper bound and cost ( ⌈ ¯ r [ v ] ⌉ − ¯ r [ v ]) − ( ⌊ ¯ r [ v ] ⌋ − ¯ r [ v ]) . If ¯ r [ v ] is a integer, we onlyadd e ( v, t ) . We ﬁnd a min-cost integral ﬂow of value n on this network. For any v such that e ( v, t ) is saturated, weset r ∗ [ v ] to be ⌈ ¯ r ⌉ and ⌊ ¯ r ⌋ otherwise. Such a ﬂow with minimum cost suggests the optimality of the vector r ∗ due toLemma 3. Theorem 5

There is a polynomial time algorithm for ﬁnding the vector r ∗ to ¯ r such that r ∗ corresponds to somepossible answer with non-zero probability. Finally, we can prove that:

Corollary 2

There is a polynomial time deterministic 4-approximation for ﬁnding the median aggregate answer.

Proof:

Suppose r ∗ is the answer closest to the mean answer ¯ r and r m is the median answer. Let r be the vectorcorresponding to the random answer. Then: E [ d ( r ∗ , r )] ≤ E [2( d ( r ∗ , ¯ r ) + d (¯ r , r ))] = 2 ( d ( r ∗ , ¯ r ) + E [ d (¯ r , r )]) ≤ E [ d (¯ r , r )] ≤ E [ d ( r m , r )] . The C

ONSENSUS -C LUSTERING problem is deﬁned as follows: given k clusterings C , . . . , C k of V , ﬁnd a clustering C that minimizes P ki =1 d ( C , C i ) . In the setting of probabilistic databases, the given clusterings are the clusterings inthe possible worlds, weighted by the existence probability. The main problem with extending the notion of consensusanswers to clustering is that the input clusterings are not well-deﬁned (unlike ranking where the score function deﬁnesthe ranking in any world). We consider a somewhat simpliﬁed version of the problem, where we assume that twotuples t i and t j are clustered together in a possible world, if and only if they take the same value for the value attribute A (which is uncertain). Thus, a possible world pw uniquely determines a clustering C pw . We deﬁne the distancebetween two clustering C and C to be the number of unordered pairs of tuples that are clustered together in C , butseparated in the other (the C ONSENSUS -C LUSTERING metric). To deal with nonexistent keys in a possible world, weartiﬁcally create a cluster containing all of those.Our task is to ﬁnd a mean clustering C such that E [ d ( C , C pw )] . Approximation with factor of / is known forC ONSENSUS -C LUSTERING [2], and can be adapted to our problem in a straightforward manner. In fact, that approxi-mation algorithm simply needs w t i ,t j for all t i , t j , where w t i ,t j is the fraction of input clusters that cluster t i and t j together, and can be computed as: w t i ,t j = P a ∈ A Pr ( i.A = a ∧ j.A = a ) .To compute these quantities given an and/xor tree, we associate a variable x with all leaves with value ( i, a ) and ( j, a ) , and constant with the other leaves. From Theorem 1, Pr ( i.A = a ∧ j.A = a ) is simply the coefﬁcient of x in the corresponding generating function. 13 Conclusion

We addressed the problem of ﬁnding a single representative answer to a query over probabilistic databases by gener-alizing the notion of inconsistent information integration. We believe this approach provides a systematic and formalway to reason about the semantics of probabilistic query answers, especially for

Top- k queries. Our initial work hasopened up many interesting avenues for future work. These include design of efﬁcient exact and approximate algo-rithms for ﬁnding consensus answers for other types of queries, exploring connections to safe plans, and understandingthe semantics of the other previously proposed ranking functions using this framework. References [1] Nir Ailon. Aggregation of partial rankings, p-ratings and top-m lists. In

SODA , pages 415–424, 2007.[2] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. In

J.ACM , volume 55(5),2008.[3] Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. Clean answers over dirty databases. In

ICDE , 2006.[4] Lyublena Antova, Christoph Koch, and Dan Olteanu. From complete to incomplete information and back. In

SIGMOD , 2007.[5] B., H. Garcia-Molina, and D. Porter. The management of probabilistic data.

IEEE TKDE , 1992.[6] George Beskales, Mohamed A. Soliman, and Ihab F. Ilyas. Efﬁcient search for the top-k probable nearest neighbors in uncertain databases.In

VLDB , 2008.[7] B. Buckles and F. E. Petry. A fuzzy model for relational databases.

Intl. Journal of Fuzzy Sets and Syst. , 1982.[8] Reynold Cheng, Dmitri Kalashnikov, and Sunil Prabhakar. Evaluating probabilistic queries over imprecise data. In

SIGMOD , 2003.[9] Graham Cormode, Feifei Li, and Ke Yi. Semantics of ranking queries for probabilistic data and expected ranks. In

ICDE , 2009.[10] Graham Cormode and Andrew McGregor. Approximation algorithms for clustering uncertain data. In

PODS , 2008.[11] Nilesh Dalvi and Dan Suciu. Efﬁcient query evaluation on probabilistic databases. In

VLDB , 2004.[12] Nilesh Dalvi and Dan Suciu. Management of probabilistic data: Foundations and challenges. In

PODS , 2007.[13] Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, and Wei Hong. Model-driven data acquisition in sensor networks.In

VLDB , 2004.[14] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In

Proceedings of the Tenth InternationalConference on the World Wide Web (WWW) , pages 613–622, 2001.[15] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation revistied. In

Manuscript , 2001.[16] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Comparing top k lists.

SIAM J. Discrete Mathematics , 17(1):134–160, 2003.[17] N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integration of information retrieval and database systems.

ACM Trans. onInfo. Syst. , 1997.[18] Minos Garofalakis and Dan Suciu, editors.

IEEE Data Engineering Bulletin Special Issue on Probabilistic Data Management . March 2006.[19] Gosta Grahne. Horn tables - an efﬁcient tool for handling incomplete information in databases. In

PODS , 1989.[20] Rahul Gupta and Sunita Sarawagi. Creating probabilistic databases from information extraction models. In

VLDB , Seoul, Korea, 2006.[21] M. Hua, J. Pei, W. Zhang, and X. Lin. Efﬁciently answering probabilistic threshold top-k queries on uncertain data. In

ICDE , 2008.[22] M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: A probabilistic threshold approach. In

SIGMOD , 2008.[23] T. Imielinski and W. Lipski, Jr. Incomplete information in relational databases.

Journal of the ACM , 1984.[24] T. S. Jayram, Andrew McGregor, S. Muthukrishnan, and Erik Vee. Estimating statistical aggregates on probabilistic data streams. In

PODS ,pages 243–252, 2007.[25] J.C.Borda. M´emoire sur les ´elections au scrutin.

Histoire de l’Acad´emie Royale des Sciences , 1781.[26] J.G.Kemeny. Mathematics without numbers.

Daedalus , 88:571–591, 1959.[27] J.Hodge and R.E.Klima.

The mathematics of voting and elections: a hands-on approach . AMS, 2000.[28] L. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview: a ﬂexible probabilistic database system.

ACM Trans. on DB Syst. ,1997.[29] Jian Li, Barna Saha, and Amol Deshpande. Ranking and clustering in probabilistic databases. , 2008. Unpublished manuscript.[30] Silvio Micali and Vijay V. Vazirani. An o ( sqrt ( | v | ) | e | ) algorithm for ﬁnding maximum matching in general graphs. In FOCS ’80: Proceed-ings of the 21th Annual Symposium on Foundations of Computer Science , pages 17–27, 1980.

31] M.J.Condorcet. ´Essai sur l’application de l’analyse `a la probabilit´e des d´ecisions rendues `a la pluralit´e des voix . 1785.[32] Christopher Re, Nilesh Dalvi, and Dan Suciu. Efﬁcient top-k query evaluation on probabilistic data. In

ICDE , 2007.[33] Christopher Re and Dan Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In

VLDB ,Vienna, Austria, 2007.[34] A. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In

ICDE , 2006.[35] Prithviraj Sen and Amol Deshpande. Representing and querying correlated tuples in probabilistic databases. In

ICDE , 2007.[36] Prithviraj Sen, Amol Deshpande, and Lise Getoor. Exploiting shared correlations in probabilistic databases. In

VLDB , 2008.[37] M. Soliman, I. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In

ICDE , 2007.[38] Christopher R´eand Dan Suciu. Efﬁcient evaluation of having queries on a probabilistic database. In

DBPL , 2007.[39] Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Hellerstein. BayesStore: Managing large, uncertain data reposito-ries with probabilistic graphical models. In

VLDB , Auckland, New Zealand, 2008.[40] J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In

CIDR , 2005.[41] Ke Yi, Feifei Li, Divesh Srivastava, and George Kollios. Efﬁcient processing of top-k queries in uncertain databases. In

ICDE , 2008.[42] Y.Wakabayashi. The complexity of computing medians of relations. In

Resenhas , volume 3(3), pages 323–349, 1998.[43] Xi Zhang and Jan Chomicki. On the semantics and evaluation of top-k queries in probabilistic databases. In

DBRank , 2008., 2008.