[PDF] On the Complexity and Approximation of Binary Evidence in Lifted Inference

Abstract

Lifted inference algorithms exploit symmetries in probabilistic models to speed up inference. They show impressive performance when calculating unconditional probabilities in relational models, but often resort to non-lifted inference when computing conditional probabilities. The reason is that conditioning on evidence breaks many of the model's symmetries, which can preempt standard lifting techniques. Recent theoretical results show, for example, that conditioning on evidence which corresponds to binary relations is #P-hard, suggesting that no lifting is to be expected in the worst case. In this paper, we balance this negative result by identifying the Boolean rank of the evidence as a key parameter for characterizing the complexity of conditioning in lifted inference. In particular, we show that conditioning on binary evidence with bounded Boolean rank is efficient. This opens up the possibility of approximating evidence by a low-rank Boolean matrix factorization, which we investigate both theoretically and empirically.

Full PDF

OOn the Complexity and Approximation ofBinary Evidence in Lifted Inference

Guy Van den Broeck and

Adnan Darwiche

Computer Science DepartmentUniversity of California, Los Angeles { guyvdb,darwiche } @cs.ucla.edu Abstract

Boolean rank of the evidence as a key parameter for charac-terizing the complexity of conditioning in lifted inference. In particular, we showthat conditioning on binary evidence with bounded Boolean rank is efﬁcient. Thisopens up the possibility of approximating evidence by a low-rank Boolean matrixfactorization, which we investigate both theoretically and empirically.

Statistical relational models are capable of representing both probabilistic dependencies and rela-tional structure [1, 2]. Due to their ﬁrst-order expressivity, they concisely represent probability dis-tributions over a large number of propositional random variables, causing inference in these modelsto quickly become intractable. Lifted inference algorithms [3] attempt to overcome this problem byexploiting symmetries found in the relational structure of the model.In the absence of evidence, exact lifted inference algorithms can work well. For large classes ofstatistical relational models [4], they perform inference that is polynomial in the number of objectsin the model [5], and are therein exponentially faster than classical inference algorithms. Whenconditioning a query on a set of evidence literals, however, these lifted algorithms lose their advan-tage over classical ones. The intuitive reason is that evidence breaks the symmetries in the model.The technical reason is that these algorithms perform an operation called shattering, which endsup reducing the ﬁrst-order model to a propositional one. This issue is implicitly reﬂected in theexperiment sections of exact lifted inference papers. Most report on experiments without evidence.Examples include publications on FOVE [3, 6, 7] and WFOMC [8, 5]. Others found ways to efﬁ-ciently deal with evidence on only unary predicates. They perform experiments without evidenceon binary or higher-arity relations. There are examples for FOVE [9, 10], WFOMC [11], PTP [12]and CP [13].This evidence problem has largely been ignored in the exact lifted inference literature, until recently,when Bui et al. [10] and Van den Broeck and Davis [11] showed that conditioning on unary evidenceis tractable. More precisely, conditioning on unary evidence is polynomial in the size of evidence.This type of evidence expresses attributes of objects in the world, but not relations between them.Unfortunately, Van den Broeck and Davis [11] also showed that this tractability does not extend to1 a r X i v : . [ c s . A I] N ov vidence on binary relations, for which conditioning on evidence is ∀ X, Y p( X, Y ) ) or false ( ∀ X, Y ¬ p( X, Y ) ). Asour ﬁrst main contribution , we formalize this intuition and characterize the complexity of condition-ing more precisely in terms of the Boolean rank of the evidence. We show that it is a measure ofhow much lifting is possible, and that one can efﬁciently condition on large amounts of evidence,provided that its Boolean rank is bounded.Despite the limitations, useful applications of exact lifted inference were found by sidestepping theevidence problem. For example, in lifted generative learning [14], the most challenging task is tocompute partition functions without evidence. Regardless, the lack of symmetries in real applica-tions is often cited as a reason for rejecting the idea of lifted inference entirely (informally calledthe “death sentence for lifted inference”). This problem has been avoided for too long, and aslifted inference gains maturity, solving it becomes paramount. As our second main contribution ,we present a ﬁrst general solution to the evidence problem. We propose to approximate evidence by an over-symmetric matrix, and will show that this can be achieved by minimizing Boolean rank.The need for approximating evidence is new and speciﬁc to lifted inference: in (undirected) proba-bilistic graphical models, more evidence typically makes inference easier. Practically, we will showthat existing tools from the data mining community can be used for this low-rank Boolean matrixfactorization task.The evidence problem is less pronounced in the approximate lifted inference literature. These al-gorithms often introduce approximations that lead to symmetries in their computation, even whenthere are no symmetries in the model. Also for approximate methods, however, the beneﬁts of lift-ing will decrease with the amount of symmetry-breaking evidence (e.g., Kersting et al. [15]). Wewill show experimentally that over-symmetric evidence approximation is also a viable technique forapproximate lifted inference.

Our analysis of conditioning is based on a reduction, turning evidence on a binary relation intoevidence on several unary predicates. We ﬁrst introduce some necessary background. An atom p( t , . . . , t n ) consists of a predicate p /n of arity n followed by n arguments, which are ei-ther (lowercase) constants or (uppercase) logical variables . A literal is an atom a or its negation ¬ a .A formula combines atoms with logical connectives (e.g., ∨ , ∧ , ⇔ ). A formula is ground if it doesnot contain any logical variables. A possible world assigns a truth value to each ground atom. Statis-tical relational languages deﬁne a probability distribution over possible words, where ground atomsare individual random variables. Numerous languages have been proposed in recent years, and ouranalysis will apply to many, including MLNs [16], parfactors [3] and WFOMC problems [8].

Example 1.

The following MLNs model the dependencies between web pages. A ﬁrst, peer-to-peermodel says that student web pages are more likely to link to other student pages. w studentpage( X ) ∧ linkto( X, Y ) ⇒ studentpage( Y ) It increases the probability of a world by a factor e w with every pair of pages X, Y that satisﬁes theformula. A second, hierarchical model says that professors are more likely to link to course pages. w profpage( X ) ∧ linkto( X, Y ) ⇒ coursepage( Y ) In this context, evidence e is a truth-value assignment to a set of ground atoms, and is of-ten represented as a conjunction of literals. In unary evidence , atoms have one argument (e.g., studentpage( a ) ) while in binary evidence , they have two (e.g., linkto( a, b ) ). Without loss of gen-erality, we assume full evidence on certain predicates (i.e., all their ground atoms are in e ). We willsometimes represent unary evidence as a Boolean vector and binary evidence as a Boolean matrix. Partial evidence on the relation p can be encoded as full evidence on predicates p and p by addingformulas ∀ X, Y p( X, Y ) ⇐ p ( X, Y ) and ∀ X, Y ¬ p( X, Y ) ⇐ p ( X, Y ) to the model. xample 2. Evidence e = p( a, a ) ∧ p( a, b ) ∧ ¬ p( a, c ) ∧ · · · ∧ ¬ p( d, c ) ∧ p( d, d ) is represented by P =  p( X,Y ) Y = a Y = b Y = c Y = dX = a X = b X = c X = d  We will look at computing conditional probabilities

Pr( q | e ) for single ground atoms q . Finally, weassume a representation language that can express universally quantiﬁed logical constraints. Certain binary relations can be represented by a pair of unary predicates. By adding the formula ∀ X, ∀ Y, p( X, Y ) ⇔ q( X ) ∧ r( Y ) (1)to our statistical relational model and conditioning on the q and r relations, we can condition oncertain types of binary p relations. Assuming that we condition on the q and r predicates, addingthis formula (as hard clauses) to the model does not change the probability distribution over theatoms in the original model. It is merely an indirect way of conditioning on the p relation.If we now represent these unary relations by vectors q and r , and the binary relation by the binarymatrix P , the above technique allows us to condition on any relation P that can be factorized in theouter vector product P = q r (cid:124) . Example 3.

Consider the following outer vector factorization of the Boolean matrix P . P =   =     (cid:124) In a model containing Formula 1, this factorization indicates that we can condition on the 16 binaryevidence literals ¬ p( a, a ) ∧ ¬ p( a, b ) ∧ · · · ∧ ¬ p( d, c ) ∧ p( d, d ) of P by conditioning on the the 8unary literals ¬ q( a ) ∧ q( b ) ∧ ¬ q( c ) ∧ q( d ) ∧ r( a ) ∧ ¬ r( b ) ∧ ¬ r( c ) ∧ r( d ) represented by q and r . This idea of encoding a binary relation in unary relations can be generalized to n pairs of unaryrelations, by adding the following formula to our model. ∀ X, ∀ Y, p( X, Y ) ⇔ (q ( X ) ∧ r ( Y )) ∨ (q ( X ) ∧ r ( Y )) ∨ · · · ∨ (q n ( X ) ∧ r n ( Y )) (2)By conditioning on the q i and r i relations, we can now condition on a much richer set of binary p relations. The relations that can be expressed this way are all the matrices that can be representedby the sum of outer products (in Boolean algebra, where + is ∨ and ∨ ): P = q r (cid:124) ∨ q r (cid:124) ∨ · · · ∨ q n r (cid:124) n = Q R (cid:124) (3)where the columns of Q and R are the q i and r i vectors respectively, and the matrix multiplicationis performed in Boolean algebra, that is, ( Q R (cid:124) ) i,j = (cid:87) r Q i,r ∧ R j,r Example 4.

Consider the following P , its decomposition into a sum/disjunction of outer vectorproducts, and the corresponding Boolean matrix multiplication. P =   =     (cid:124) ∨     (cid:124) ∨     (cid:124) =     (cid:124) This factorization shows that we can condition on the binary evidence literals of P (see Example 2)by conditioning on the unary literals e = [ ¬ q ( a ) ∧ q ( b ) ∧ ¬ q ( c ) ∧ q ( d )] ∧ [r ( a ) ∧ ¬ r ( b ) ∧ ¬ r ( c ) ∧ r ( d )] ∧ [q ( a ) ∧ q ( b ) ∧ ¬ q ( c ) ∧ ¬ q ( d )] ∧ [r ( a ) ∧ r ( b ) ∧ ¬ r ( c ) ∧ ¬ r ( d )] ∧ [ ¬ q ( a ) ∧ ¬ q ( b ) ∧ q ( c ) ∧ ¬ q ( d )] ∧ [ ¬ r ( a ) ∧ ¬ r ( b ) ∧ r ( c ) ∧ ¬ r ( d )] . Boolean Matrix Factorization

Matrix factorization (or decomposition) is a popular linear algebra tool. Some well-known instancesare singular value decomposition and non-negative matrix factorization (NMF) [17, 18]. NMFfactorizes into a product of non-negative matrices, which are more easily interpretable, and thereforeattracted much attention for unsupervised learning and feature extraction. These factorizations allwork with real-valued matrices. We instead consider Boolean-valued matrices, with only 0/1 entries.

Factorizing a matrix P as Q R (cid:124) in Boolean algebra is a known problem called Boolean MatrixFactorization (BMF) [19, 20]. BMF factorizes a ( k × l ) matrix P into a ( k × n ) matrix Q and a( l × n ) matrix R , where potentially n (cid:28) k and n (cid:28) l and we always have that n ≤ min( k, l ) .Any Boolean matrix can be factorized this way and the smallest number n for which it is possible iscalled the Boolean rank of the matrix. Unlike (textbook) real-valued rank, computing the Booleanrank is NP-hard and cannot be approximated unless P=NP [19]. The Boolean and real-valued rankare incomparable, and the Boolean rank can be exponentially smaller than the real-valued rank.

Example 5.

The factorization in Example 4 is a BMF with Boolean rank 3. It is only a decompo-sition in Boolean algebra and not over the real numbers. Indeed, the matrix product over the realscontains an incorrect value of 2:   × real   (cid:124) =   (cid:54) = P Note that P is of full real-valued rank (having four non-zero singular values) and that its Booleanrank is lower than its real-valued rank. Computing Boolean ranks is a theoretical problem. Because most real-world matrices will havenearly full rank (i.e., almost min( k, l ) ), applications of BMF look at approximate factorizations. Thegoal is to ﬁnd a pair of (small) Boolean matrices Q k × n and R l × n such that P k × l ≈ (cid:0) Q k × n R (cid:124) l × n (cid:1) ,or more speciﬁcally, to ﬁnd matrices that optimize some objective that trades off approximationerror and Boolean rank n . When n (cid:28) k and n (cid:28) l , this approximation extracts interesting structureand removes noise from the matrix. This has caused BMF to receive considerable attention in thedata mining community recently, as a tool for analyzing high-dimensional data. It is used to ﬁndimportant and interpretable (i.e., Boolean) concepts in a data matrix.Unfortunately, the approximate BMF optimization problem is NP-hard as well, and inapprox-imable [20]. However, several algorithms have been proposed that work well in practice. Algo-rithms exist that ﬁnd good approximations for ﬁxed values of n [20], or when P is sparse [21].BMF is related to other data mining tasks, such as biclustering [22] and tiling databases [23], whosealgorithms could also be used for approximate BMF. In the context of social network analysis, BMFis related to stochastic block models [24] and their extensions, such as inﬁnite relational models. Our goal in this section is to provide a new complexity result for reasoning with binary evidencein the context of lifted inference. Our result can be thought of as a parametrized complexity re-sult, similar to ones based on treewidth in the case of propositional inference. To state the newresult, however, we must ﬁrst deﬁne formally the computational task. We will also review the keycomplexity result that is known about this computation now (i.e., the one we will be improving on).Consider an MLN ∆ and let Γ m contain a set of ground literals representing binary evidence. That is,for some binary predicate p( X, Y ) , evidence Γ m contains precisely one literal (positive or negative)for each grounding of predicate p( X, Y ) . Here, m represents the number of objects that parameters X and Y may take. Therefore, evidence Γ m must contain precisely m literals. We assume without loss of generality that all logical variables range over the same set of objects. Pr m is the distribution induced by MLN ∆ over m objects, and q is a groundliteral. Our analysis will apply to classes of models ∆ that are domain-liftable [4], which means thatthe complexity of computing Pr m ( q ) without evidence is polynomial in m . One such class is theset of MLNs with two logical variables per formula [5].Our task is then to compute the posterior probability Pr m ( q | e m ) , where e m is a conjunction of theground literals in binary evidence Γ m . Moreover, our goal here is to characterize the complexity ofthis computation as a function of evidence size m .The following recent result provides a lower bound on the complexity of this computation [11]. Theorem 1.

Suppose that evidence Γ m is binary. Then there exists a domain-liftable MLN ∆ witha corresponding distribution Pr m , and a posterior marginal Pr m ( q | e m ) that cannot be computedby any algorithm whose complexity grows polynomially in evidence size m , unless P = N P . This is an analogue to results according to which, for example, the complexity of computing poste-rior probabilities in propositional graphical models is exponential in the worst case. Yet, for thesemodels, the complexity of inference can be parametrized, allowing one to bound the complexity ofinference on some models. Perhaps the best example of such a parametrized complexity is the onebased on treewidth, which can be thought of as a measure of the model’s sparsity (or tree-likeness).In this case, inference can be shown to be linear in the size of the model and exponential only in itstreewidth. Hence, this parametrized complexity result allows us to state that inference can be doneefﬁciently on models with bounded treewidth.We now provide a similar parameterized complexity result, but for evidence in lifted inference. Inthis case, the parameter we use to characterize complexity is that of Boolean rank.

Theorem 2.

Suppose that evidence Γ m is binary and has a bounded Boolean rank. Then for everydomain-liftable MLN ∆ and corresponding distribution Pr m , the complexity of computing posteriormarginal Pr m ( q | e m ) grows polynomially in evidence size m . The proof of this theorem is based on the reduction from binary to unary evidence, which is describedin Section 2. In particular, our reduction ﬁrst extends the MLN ∆ with Formula 2, leading to the newMLN ∆ (cid:48) and new pairs of unary predicates q i and r i . This does not change the domain-liftabilityof ∆ (cid:48) , as Formula 2 is itself liftable. We then replace binary evidence Γ m by unary evidence Γ (cid:48) . Thatis, the ground literals of the binary predicate p are replaced by ground literals of the unary predicates q i and r i (see Example 4). This unary evidence is obtained by Boolean matrix factorization. As thematrix size in our reduction is m , the following Lemma implies that the ﬁrst step of our reductionis polynomial in m for bounded rank evidence. Lemma 3 (Miettinen [25]) . The complexity of Boolean matrix factorization for matrices withbounded Boolean rank is polynomial in their size.

The main observation in our reduction is that Formula 2 has size n , which is the Boolean rank of thegiven binary evidence. Hence, when the Boolean rank n is bounded by a constant, the size of theextended MLN ∆ (cid:48) is independent of the evidence size and is proportional to the size of the originalMLN ∆ .We have now reduced inference on MLN ∆ and binary evidence Γ m into inference on an extendedMLN ∆ (cid:48) and unary evidence Γ (cid:48) . The second observation behind the proof is the following. Lemma 4 (Van den Broeck and Davis [11], Van den Broeck [26]) . Suppose that evidence Γ m is unary . Then for every domain-liftable MLN ∆ and corresponding distribution Pr m , the complexityof computing posterior marginal Pr m ( q | e m ) grows polynomially in evidence size m . Hence, computing posterior probabilities can be done in time which is polynomial in the size ofunary evidence m , which completes our proof.We can now identify additional similarities between treewidth and Boolean rank. Exact inference al-gorithms for probabilistic graphical models typically perform two steps, namely to (a) compute a treedecomposition of the graphical model (or a corresponding variable order), and (b) perform inferencethat is polynomial in the size of the decomposition, but potentially exponential in its (tree)width. Theanalogous steps for conditioning are to (a) perform a BMF, and (b) perform inference that is polyno-mial in the size of the BMF, but potentially exponential in its rank. The (a) steps are both NP-hard,yet are efﬁcient assuming bounded treewidth [27] or bounded Boolean rank (Lemma 3). Whereas5reewidth is a measure of tree-likeness and sparsity of the graphical model, Boolean rank seems tobe a fundamentally different property, more related to the presence of symmetries in evidence. Theorem 2 opens up many new possibilities. Even for evidence with high Boolean rank, it is possibleto ﬁnd a low-rank approximate BMF of the evidence, as is commonly done for other data miningand machine learning problems. Algorithms already exist for solving this task (cf. Section 3).

Example 6.

The evidence matrix from Example 4 has Boolean rank three. Dropping the third pairof vectors reduces the Boolean rank to two.   ≈     (cid:124) ∨     (cid:124) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:64)(cid:64)(cid:64)(cid:64)(cid:64)(cid:64) ∨     (cid:124) =     (cid:124) = 

01 0 0 1 

This factorization is approximate, as it ﬂips the evidence for atom p( c, c ) from true to false (repre-sented by the bold 0). By paying this price, the evidence has more symmetries, and we can conditionon the binary relation by introducing only two instead of three new pairs (q i , r i ) of unary predicates.Low-rank approximate BMF is an instance of a more general idea; that of over-symmetric evidenceapproximation . This means that when we want to compute Pr( q | e ) , we approximate it by comput-ing Pr( q | e (cid:48) ) instead, with evidence e (cid:48) that permits more efﬁcient inference. In this case, it is moreefﬁcient because it maintains more symmetries of the model and permits more lifting. Because alllifted inference algorithms, exact or approximate, exploit symmetries, we expect this general idea,and low-rank approximate BMF in particular, to improve the performance of any lifted inferencealgorithm.Having a small amount of incorrect evidence in the approximation need not be a problem. As theseliterals are not covered by the ﬁrst most important vector pairs, they can be considered as noise inthe original matrix. Hence, a low-rank approximation may actually improve the performance of, forexample, a lifted collective classiﬁcation algorithm. On the other hand, the approximation made inExample 6 may not be desirable if we are querying attributes of the constant c , and we may preferto approximate other areas of the evidence matrix instead. There are many challenges in ﬁndingappropriate evidence approximations, which makes the task all the more interesting. To complement the theoretical analysis from the previous sections, we will now report on experi-ments that investigate the following practical questions. Q1 How well can we approximate a real-world relational data set by a low-rank Boolean matrix? Q2 Is Boolean rank a good indicator of the complexity of inference, as suggested by Theorem 2? Q3 Is over-symmetric evidence approximation a viable technique for approximate lifted inference?To answer Q1 , we compute approximations of the linkto binary relation in the WebKB data setusing the ASSO algorithm for approximate BMF [20]. The WebKB data set consists of web pagesfrom the computer science departments of four universities [28]. The data has information aboutwords that appear on pages, labels of pages and links between web pages ( linkto relation). Thereare four folds, one for each university. The exact evidence matrix for the linkto relation ranges insize from 861 by 861 to 1240 by 1240. Its real-valued rank ranges from 384 to 503. Performing aBMF approximation in this domain adds or removes hyperlinks between web pages, so that moreweb pages can be grouped together that behave similarly.Figure 1 plots the approximation error for increasing Boolean ranks, measured as the number ofincorrect evidence literals. The error goes down quickly for low rank, and is reduced by half afterBoolean rank 70 to 80, even though the matrix dimensions and real-valued rank are much higher.Note that these evidence matrices contain around a million entries, and are sparse. Hence, theseapproximations correctly label . to . of the atoms.6

20 40 60 80 100 1200100020003000 Boolean rank E rr o r cornelltexaswashingtonwisconsin Figure 1: Approximation BMF error in termsof the number of incorrect literals for theWebKB linkto relation.

Rank n Circuit Size (a) Circuit Size (b)0 18 241 58 502 160 1293 1873 3714 > Figure 2: First-order NNF circuit size (numberof nodes) for increasing Boolean rank n , and(a) the peer to peer and (b) hierarchical model. R e l a ti v e K L D Iteration (a) Texas Data Set R e l a ti v e K L D Iteration (b) Wisconsin Data Set

Figure 3: KLD of LMCMC on different BMF approximations, relative to the KLD of vanilla MCMCon the same approximation. From top to bottom, the lines represent exact evidence (blue), andapproximations (red) of rank 150, 100, 75, 50, 20, 10, 5, 2, and 1.To answer Q2 , we perform two sets of experiments. Firstly, we look at exact lifted inference andinvestigate the inﬂuence of adding Formula 2 to the “peer-to-peer” and “hierarchical” MLNs fromExample 1. The goals is to condition on linkto relations with increasing rank n . These modelsare compiled using the WFOMC [8] algorithm into ﬁrst-order NNF circuits, which allow for exactdomain-lifted inference (c.f., Lemma 4). Table 2 shows the sizes of these circuits. As expected,circuit sizes grow exponentially with n . Evidence breaks more symmetries in the peer-to-peer modelthan in the hierarchical model, causing the circuit size to increase more quickly with Boolean rank.Since the connection between rank and exact inference is obvious from Theorem 2, the moreinteresting question in Q2 is whether Boolean rank is indicative of the complexity of approxi-mate lifted inference as well. Therefore, we investigate its inﬂuence on the Lifted MCMC algo-rithm (LMCMC) [29] with Rao-Blackwellized probability estimation [30]. LMCMC interleavesstandard MCMC steps (here Gibbs sampling) with jumps to states that are symmetric in the graphi-cal model, in order to speed up mixing of the chain. We run LMCMC on the WebKB MLN of Davisand Domingos [31], which has 333 ﬁrst-order formulas and over 1 million random variables. Itclassiﬁes web pages into 6 categories, based on their link structure and the 50 most predictive wordsthey contain. We learn its parameters with the Alchemy package and obtain evidence sets of varyingBoolean rank from the factorizations of Figure 1. . For these, we run both vanilla and lifted MCMC,and measure the KL divergence (KLD) between the marginal distribution at each iteration , and aground truth obtained from 3 million iterations on the corresponding evidence set. Figure 3 plots theKLD of LMCMC divided by the KLD of MCMC. It shows that the improvement of LMCMC overMCMC goes down with Boolean rank, answering Q2 positively.To answer Q3 , we look at the KLD between different evidence approximations Pr( . | e (cid:48) n ) of rank n , and the true marginals Pr( . | e ) conditioned on exact evidence. As this requires a good estimateof Pr( . | e ) , we make our learned WebKB model more tractable by removing formulas about wordcontent. For two approximations e (cid:48) a and e (cid:48) b such that rank a < b , we expect LMCMC to convergefaster to Pr( . | e (cid:48) a ) than to Pr( . | e (cid:48) b ) , as suggested by Figure 3. However, because Pr( . | e (cid:48) a ) is a morecrude approximation of Pr( . | e ) than Pr( . | e (cid:48) b ) is, the KLD at convergence should be worse for a When synthetically generating evidence of these ranks, results are comparable. Runtime per iteration is comparable for both algorithms. BMF runtime is negligible. K L D i v e r g e n ce IterationGround MCMCLifted MCMCLifted MCMC (Rank 2)Lifted MCMC (Rank 10) (a) Cornell, Ranks 2 and 10 K L D i v e r g e n ce IterationGround MCMCLifted MCMCLifted MCMC (Rank 75)Lifted MCMC (Rank 150) (b) Cornell, Ranks 75 and 150 K L D i v e r g e n ce IterationGround MCMCLifted MCMCLifted MCMC (Rank 75)Lifted MCMC (Rank 150) (c) Washington, Ranks 75 and 150 K L D i v e r g e n ce IterationGround MCMCLifted MCMCLifted MCMC (Rank 75)Lifted MCMC (Rank 150) (d) Wisconsin, Ranks 75 and 150

Figure 4: Error for different low-rank approximations of WebKB, in KLD from true marginals.than for b . Hence, we expect to see a trade-off , where the lowest ranks are optimal in the beginning,higher ranks become optimal later one, and the exact model is optimal at convergence.Figure 4 shows exactly that, for a representative sample of ranks and data sets. In Figure 4(a), rank2 and 10 outperform LMCMC with the exact evidence at ﬁrst. Exact evidence overtakes rank 2after 40k iterations, and rank 10 after 50k. After 80k iterations, even non-lifted MCMC outperformsthese crude approximations. Figure 4(b) shows the other side of the spectrum, where a rank 75and 150 approximation are overtaken at iterations 90k and 125k. Figure 4(c) is representative ofother datasets. Note here that at around iteration 50k, rank 75 in turn outperforms the rank 150approximation, which has fewer symmetries and does not permit as much lifting. Finally, Figure 4(d)shows the ideal case for low-rank approximation. This is the largest dataset, and therefore the mostchallenging inference task. Here, LMCMC on e converges slowly compared to its approximations e (cid:48) ,and e (cid:48) results in almost perfect marginals. The crossover point where exact inference outperformsthe approximation is never reached in practice. This answers Q3 positively. We presented two main results. The ﬁrst is a more precise complexity characterization of condi-tioning on binary evidence, in terms of its Boolean rank. The second is a technique to approximatebinary evidence by a low-rank Boolean matrix factorization. This is a ﬁrst type of over-symmetricevidence approximation that can speed up lifted inference. We showed empirically that low-rankBMF speeds up approximate inference, leading to improved approximations.For future work, we want to evaluate the practical implications of the theory developed for otherlifted inference algorithms, such as lifted BP, and look at the performance of over-symmetric evi-dence approximation on machine learning tasks such as collective classiﬁcation. There are manyremaining challenges in ﬁnding good evidence-approximation schemes, including ones that arequery-speciﬁc (cf. de Salvo Braz et al. [32]) or that incrementally run inference to ﬁnd better ap-proximations (cf. Kersting et al. [33]). Furthermore, we want to investigate other subsets of binaryrelations for which conditioning could be efﬁcient, in particular functional relations p( X, Y ) , whereeach X has at most a limited number of associated Y values. Acknowledgments

We thank Pauli Miettinen, Mathias Niepert, and Jilles Vreeken for helpful suggestions. This workwas supported by ONR grant eferences [1] L. Getoor and B. Taskar, editors.

An Introduction to Statistical Relational Learning . MIT Press, 2007.[2] Luc De Raedt, Paolo Frasconi, Kristian Kersting, and Stephen Muggleton, editors.

Probabilistic inductivelogic programming: theory and applications . Springer-Verlag, 2008.[3] David Poole. First-order probabilistic inference. In

Proceedings of IJCAI , pages 985–991, 2003.[4] Manfred Jaeger and Guy Van den Broeck. Liftability of probabilistic inference: Upper and lower bounds.In

Proceedings of the 2nd International Workshop on Statistical Relational AI, , 2012.[5] Guy Van den Broeck. On the completeness of ﬁrst-order knowledge compilation for lifted probabilisticinference. In

Advances in Neural Information Processing Systems 24 (NIPS) , pages 1386–1394, 2011.[6] Rodrigo de Salvo Braz, Eyal Amir, and Dan Roth. Lifted ﬁrst-order probabilistic inference. In

Proceed-ings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI) , pages 1319–1325, 2005.[7] B. Milch, L.S. Zettlemoyer, K. Kersting, M. Haimes, and L.P. Kaelbling. Lifted probabilistic inferencewith counting formulas.

Proceedings of the 23rd AAAI Conference on Artiﬁcial Intelligence , 2008.[8] Guy Van den Broeck, Nima Taghipour, Wannes Meert, Jesse Davis, and Luc De Raedt. Lifted probabilisticinference by ﬁrst-order knowledge compilation. In

Proceedings of IJCAI , pages 2178–2185, 2011.[9] N. Taghipour, D. Fierens, J. Davis, and H. Blockeel. Lifted variable elimination with arbitrary constraints.In

Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics , 2012.[10] H.H. Bui, T.N. Huynh, and R. de Salvo Braz. Exact lifted inference with distinct soft evidence on everyobject. In

Proceedings of the 26th AAAI Conference on Artiﬁcial Intelligence , 2012.[11] Guy Van den Broeck and Jesse Davis. Conditioning in ﬁrst-order knowledge compilation and liftedprobabilistic inference. In

Proceedings of the 26th AAAI Conference on Artiﬁcial Intelligence, , 2012.[12] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In

Proceedings of the 27th Confer-ence on Uncertainty in Artiﬁcial Intelligence (UAI) , pages 256–265, 2011.[13] A. Jha, V. Gogate, A. Meliou, and D. Suciu. Lifted inference seen from the other side: The tractablefeatures. In

Proceedings of the 24th Conference on Neural Information Processing Systems (NIPS) , 2010.[14] Guy Van den Broeck, Wannes Meert, and Jesse Davis. Lifted generative parameter learning. In

StatisticalRelational AI (StaRAI) workshop , July 2013.[15] K. Kersting, B. Ahmadi, and S. Natarajan. Counting belief propagation. In

Proceedings of the 25thConference on Uncertainty in Artiﬁcial Intelligence (UAI) , pages 277–284, 2009.[16] M. Richardson and P. Domingos. Markov logic networks.

Machine learning , 62(1):107–136, 2006.[17] D. Seung and L. Lee. Algorithms for non-negative matrix factorization.

Advances in neural informationprocessing systems , 13:556–562, 2001.[18] M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons. Algorithms and applications for ap-proximate nonnegative matrix factorization. In

Computational Statistics and Data Analysis , 2006.[19] Pauli Miettinen, Taneli Mielik¨ainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discretebasis problem. In

Knowledge Discovery in Databases , pages 335–346. Springer, 2006.[20] Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discretebasis problem.

IEEE Transactions on Knowledge and Data Engineering , 20(10):1348–1362, 2008.[21] Pauli Miettinen. Sparse Boolean matrix factorizations. In

IEEE 10th International Conference on DataMining (ICDM) , pages 935–940. IEEE, 2010.[22] Boris Mirkin.

Mathematical classiﬁcation and clustering , volume 11. Kluwer Academic Pub, 1996.[23] Floris Geerts, Bart Goethals, and Taneli Mielik¨ainen. Tiling databases. In

Discovery science , 2004.[24] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps.

Social networks , 5(2):109–137, 1983.[25] Pauli Miettinen.

Matrix decomposition methods for data mining: Computational complexity and algo-rithms . PhD thesis, 2009.[26] Guy Van den Broeck.

Lifted Inference and Learning in Statistical Relational Models . PhD thesis, KULeuven, January 2013.[27] Hans L Bodlaender.

Treewidth: Algorithmic techniques and results . Springer, 1997.[28] M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models forhypertext.

Machine Learning Journal , 43(1/2):97–119, 2001.[29] Mathias Niepert. Markov chains on orbits of permutation groups. In

Proceedings of the 28th Conferenceon Uncertainty in Artiﬁcial Intelligence (UAI) , 2012.[30] Mathias Niepert. Symmetry-aware marginal density estimation. In

Proceedings of the 27th Conferenceon Artiﬁcial Intelligence (AAAI) , 2013.[31] Jesse Davis and Pedro Domingos. Deep transfer via second-order markov logic. In

Proceedings of the26th annual international conference on machine learning , pages 217–224, 2009.[32] R. de Salvo Braz, S. Natarajan, H. Bui, J. Shavlik, and S. Russell. Anytime lifted belief propagation.

Proceedings of the 6th International Workshop on Statistical Relational Learning , 2009.[33] K. Kersting, Y. El Massaoudi, B. Ahmadi, and F. Hadiji. Informed lifting for message-passing. In

Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, , 2010., 2010.