Context-Specific Likelihood Weighting
CContext-Specific Likelihood Weighting
Nitesh Kumar Ondˇrej Kuˇzelka
Department of Computer Science and Leuven.AIKU Leuven, Belgium Department of Computer ScienceCzech Technical University in Prague, Czechia
Abstract
Sampling is a popular method for approximate in-ference when exact inference is impractical. Gen-erally, sampling algorithms do not exploit context-specific independence (CSI) properties of proba-bility distributions. We introduce context-specificlikelihood weighting (CS-LW), a new samplingmethodology, which besides exploiting the clas-sical conditional independence properties, alsoexploits CSI properties. Unlike the standard like-lihood weighting, CS-LW is based on partial as-signments of random variables and requires fewersamples for convergence due to the sampling vari-ance reduction. Furthermore, the speed of gener-ating samples increases. Our novel notion of con-textual assignments theoretically justifies CS-LW.We empirically show that CS-LW is competitivewith state-of-the-art algorithms for approximateinference in the presence of a significant amountof CSIs.
Exploiting independencies present in probability distribu-tions is crucial for feasible probabilistic inference. Bayesiannetworks (BNs) qualitatively represent conditional indepen-dencies (CIs) over random variables, which allow inferencealgorithms to exploit them. In many applications, however,exact inference quickly becomes infeasible. The use ofstochastic sampling for approximate inference is commonin such applications. Sampling algorithms are simple yetpowerful tools for inference. They can be applied to ar-bitrary complex distributions, which is not true for exactinference algorithms. The design of efficient sampling al-gorithms for BNs has received much attention in the past.Unfortunately, BNs can not represent certain independen-cies qualitatively: independencies that hold only in certain
Proceedings of the 24 th International Conference on Artificial In-telligence and Statistics (AISTATS) 2021, San Diego, California,USA. PMLR: Volume 130. Copyright 2021 by the author(s).
Figure 1: Context-Specific Independencecontexts (Boutilier et al., 1996). These independencies arecalled context-specific independencies (CSIs). To illustratethem, consider a BN in Figure 1, where a tree-structure ispresent in the conditional probability distribution (CPD) of arandom variable E . If one observes the CPD carefully, theycan conclude that P ( E ∣ A = , B, C ) = P ( E ∣ A = ) ,that is, P ( E ∣ A = , B, C ) is same for all values of B and C . The variable E is said to be independent of variables { B, C } in the context A = . These independencies mayhave global implications, for instance, E ⊥ B, C ∣ A = implies E ⊥ D ∣ H, A = . Sampling algorithms generallydo not exploit CSIs arising due to structures within CPDs.One might think that structures in CPDs are accidental. Itturns out, however, that such structures are common inmany real-world settings. For example, consider a scenario(Koller and Friedman, 2009) where a symptom, fever , de-pends on diseases. It would be impractical for medicalexperts to answer , questions of the format: “What isthe probability of high fever when the patient has disease A does not have disease B . . . ?” It might be the case that,if patients suffer from disease A , then they are certain tohave a high fever, and our knowledge of their suffering fromother diseases does not matter. One might argue, what if weautomatically learn BNs from data? In this case, however, ahuge amount of data would be needed to learn the parame-ters that are exponential in the number of parents required todescribe the tabular-CPD. The tree-CPDs that require muchfewer parameters are a more efficient way of learning BNsautomatically from data (Chickering et al., 1997; Friedman,1998; Breese et al., 1998). Moreover, the structures natu-rally arise due to if-else conditions in programs written in a r X i v : . [ c s . A I] F e b ontext-Specific Likelihood Weighting probabilistic programming languages (PPLs).There are exact inference algorithms that exploit CSIs, andthus, form state-of-the-art algorithms for exact inference(Friedman and Van den Broeck, 2018). These algorithms arebased on the knowledge compilation technique (Darwiche,2003) that uses logical reasoning to naturally exploit CSIs.An obvious question, then, is: how to design a samplingalgorithm that naturally exploits CSIs, along with CIs? Itis widely believed that CSI properties in distributions aredifficult to harness for approximate inference (Friedmanand Van den Broeck, 2018). In this paper, we answer thisdifficult question by developing a sampling algorithm thatcan harness both CI and CSI properties.To realize this, we adopt likelihood weighting (LW, Shachterand Peot, 1990; Fung and Chang, 1990), a sampling algo-rithm for BNs; and extend it to a rule-based representa-tion of distributions since rules are known to represent thestructures qualitatively (Poole, 1997). We call the resultingalgorithm context-specific likelihood weighting (CS-LW)and provide its open-source implementation . Additionally,we present a novel notion of contextual assignments thatprovides a theoretical framework for exploiting CSIs. Tak-ing advantage of the better representation of structures viarules, CS-LW assigns only a subset of variables requiredfor computing conditional query leading to i) faster conver-gence, ii) faster speed of generating samples. This contrastswith many modern sampling algorithms such as collapsedsampling, which speed up convergence by sampling only asubset of variables but at the cost of much reduced speedof generating samples. We empirically demonstrate thatCS-LW is competitive with state of the art. We denote random variables with uppercase letters ( A ) andtheir assignments with lowercase letters ( a ). Bold lettersdenote sets ( A ) and their assignments ( a ). Parents of thevariable A are denoted with Pa ( A ) and their assignmentswith pa ( A ) . In a probability distribution P ( E , X , Z ) spec-ified by a Bayesian network B , E denotes a set of observedvariables, X a set of unobserved query variables and Z aset of unobserved variable other than query variables. Theexpected value of A relative to a distribution Q is denotedby E Q [ A ] . Next, we briefly introduce LW, one of the mostpopular approximate inference algorithms for BNs. A typical query to a probability distribution P ( E , X , Z ) isto compute P ( x q ∣ e ) , that is, the probability of X beingassigned x q given that E is assigned e . Following Bayes’s The code is available here: https://github.com/niteshroyal/CS-LW.git rule, we have: P ( x q ∣ e ) = P ( x q , e ) P ( e ) = ∑ x , z P ( x , z , e ) f ( x ) ∑ x , z P ( x , z , e ) = µ, where f ( x ) is an indicator function { x = x q } , which takesvalue when x = x q , and otherwise. We can estimate µ using LW if we specify P using a Bayesian network B . LWbelongs to a family of importance sampling schemes thatare based on the observation, µ = ∑ x , z Q ( x , z , e ) f ( x )( P ( x , z , e )/ Q ( x , z , e )) ∑ x , z Q ( x , z , e )( P ( x , z , e )/ Q ( x , z , e )) , (1)where Q is a proposal distribution such that Q > whenever P > . The distribution Q is different from P and is usedto draw independent samples. Generally, Q is selected suchthat the samples can be drawn easily. In the case of LW,to draw a sample, variables X i ∈ X ∪ Z are assignedvalues drawn from P ( X i ∣ pa ( X i )) and variables in E areassigned their observed values. These variables are assignedin a topological ordering relative to the graph structure of B . Thus, the proposal distribution in the case of LW can bedescribed as follows: Q ( X , Z , E ) = ∏ X i ∈ X ∪ Z P ( X i ∣ Pa ( X i )) ∣ E = e . Consequently, it is easy to compute the likelihood ratio P ( x , z , e )/ Q ( x , z , e ) in Equation 1. All factors in the nu-merator and denominator of the fraction cancel out exceptfor P ( x i ∣ pa ( X i )) where x i ∈ e . Thus, P ( X , Z , e ) Q ( X , Z , e ) = ∏ x i ∈ e P ( x i ∣ Pa ( X i )) = ∏ x i ∈ e W x i = W e , where W x i , which is also a random variable, is the weight of evidence x i . The likelihood ratio W e is the product ofall of these weights, and thus, it is also a random variable.Given M independent weighted samples from Q , we canestimate: ˆ µ = ∑ Mm = f ( x [ m ]) w e [ m ] ∑ Mm = w e [ m ] . (2) Next, we formally define the independencies that arise dueto the structures within CPDs.
Definition 1.
Let P be a probability distribution over vari-ables U , and let A , B , C , D be disjoint subsets of U . Thevariables A and B are independent given D and con-text c if P ( A ∣ B , D , c ) = P ( A ∣ D , c ) whenever P ( B , D , c ) > . This is denoted by A ⊥ B ∣ D , c . If D is empty then A and B are independent given context c ,denoted by A ⊥ B ∣ c . itesh Kumar, Ondˇrej Kuˇzelka Independence statements of the above form are called context-specific independencies (CSIs). When A is inde-pendent of B given all possible assignments to C then wehave: A ⊥ B ∣ C . The independence statements of thisform are generally referred to as conditional independencies (CIs). Thus, CSI is a more fine-grained notion than CI. Thegraphical structure in B can only represent CIs. Any CI canbe verified in linear time in the size of the graph. However,verifying any arbitrary CSI has been recently shown to becoNP-hard (Corander et al., 2019). A natural representation of the structures in a CPD is via a tree-CPD , as illustrated in Figure 1. For all assignments tothe parents of a variable A , a unique leaf in the tree specifiesa (conditional) distribution over A . The path to each leafdictates the contexts, i.e., partially assigned parents , givenwhich this distribution is used. It is easier to reason usingtree-CPDs if we break them into finer-grained elements. Afiner-grained representation of structured CPDs is via rules(Poole, 1997; Koller and Friedman, 2009), where each pathfrom the root to a leaf in each tree-CPD maps to a rule. Forour purposes, we will use a simple rule-based representationlanguage, which can be seen as a restricted fragment of Distributional Clauses (DC, Gutmann et al., 2011).
Example 1.
A set of rules for the tree-CPD in Figure 1: e ∼ bernoulli(0.2) ← a=1.e ∼ bernoulli(0.9) ← a=0 ∧ b=1.e ∼ bernoulli(0.6) ← a=0 ∧ b=0 ∧ c=1.e ∼ bernoulli(0.3) ← a=0 ∧ b=0 ∧ c=0. We can also represent structures in CPDs of discrete-continuous distributions using this form of rules like this:
Example 2.
Consider a machine that breaks down if thecooling of the machine is not working or the ambient tem-perature is too high. The following set of rules specifies adistribution over cool , t (temperature) and broken , wherea CSI is implied: broken is independent of cool in a con-text t>30 . cool ∼ bernoulli(0.1).t ∼ gaussian(25,2.2).broken ∼ bernoulli(0.9) ← t>30.broken ∼ bernoulli(0.6) ← t=<30 ∧ cool=0.broken ∼ bernoulli(0.1) ← t=<30 ∧ cool=1. Intuitively, the head of a rule ( h ∼ D ← b1 ∧ ⋅ ⋅ ⋅ ∧ bn )defines a random variable h , distributed according to a distri-bution D , whenever all atoms bi in the body (an assignmentof some parents of the variable) of the rule are true, thatis: p ( h ∣ b1 , . . . , bn ) = D . Since we study tree-CPDs, wefocus on mutually exclusive and exhaustive rules; that is,only one rule for the variable h can fire (each atom in thebody of the rule is true) at a time. A set of rules forms a program , which we call the DC( B ) program. Figure 2: The four rules of Bayes-ball algorithm that decidenext visits (indicated using ) based on the direction of thecurrent visit (indicated using ) and the type of variable. Todistinguish observed variables from unobserved variables,the former type of variables are shaded. Definition 2.
Let B be a Bayesian network with tree-CPDsspecifying a distribution P . Let P be a set of rules such thateach path from the root to a leaf of each tree-CPD corre-sponds to a rule in P . Then P specifies the same distribution P , and P will be called DC( B ) program. In this section, we will ignore the structures within CPDsand only exploit the graphical structure of BNs. The ap-proach presented in this section forms the basis of our discus-sion on CS-LW, where we will also exploit CPDs’ structure.In Section 2.1, we used all variables to estimate µ . However,due to CIs, observed states and CPDs of only some vari-ables might be required for computing µ . These variablesare called requisite variables . To get a better estimate of µ ,it is recommended to use only these variables. The standardapproach is to first apply the Bayes-ball algorithm (Shacter,1998) over the graph structure in B to obtain a sub-networkof requisite variables, then simulate the sub-network to ob-tain the weighted samples. An alternative approach that wepresent next is to use Bayes-ball to simulate the originalnetwork B and focus on only requisite variables to obtainthe weighted samples.To obtain the samples, we need to traverse the graph struc-ture of the Bayesian network B in a topological ordering.The Bayes-ball algorithm, which is linear in the graph’s size,can be used for it. The advantage of using Bayes-ball isthat it also detects CIs; thus, it traverses only a sub-graphthat depends on the query and evidence. We can also keepassigning unobserved variables, and weighting observedvariables along with traversing the graph. In this way, weassign/weigh only requisite variables. The Bayes-ball al-gorithm uses four rules to traverse the graph (when deter-ministic variables are absent in B ), and marks variables toavoid repeating the same action. These rules are illustratedin Figure 2. Next, we discuss these rules and also indicatehow to assign/weigh variables, resulting in a new algorithmcalled Bayes-ball simulation of BNs . Starting with all queryvariables scheduled to be visited as if from one of their chil-dren, we apply the following rules until no more variables ontext-Specific Likelihood Weighting can be visited:1. When the visit of an unobserved variable U ∈ X ∪ Z is from a child, and U is not marked on top, then dothese in the order: i) Mark U on top; ii) Visit all itsparents; iii) Sample a value y from P ( U ∣ pa ( U )) and assign y to U ; iv) If U is not marked on bottom,then mark U on bottom and visit all its children.2. When the visit of an unobserved variable is from aparent, and the variable is not marked on bottom, thenmark the variable on bottom and visit all its children.3. When the visit of an observed variable is from a child,then do nothing.4. When the visit of an observed variable E ∈ E is from aparent, and E is not marked on top, then do these in theorder: i) Mark E on top; ii) Visit all its parents; iii) Let e be a observed value of E and let w be a probabilityat e according to P ( E ∣ pa ( E )) , then the weight of E is w .The above rules define an order for visiting parents and chil-dren so that variables are assigned/weighted in a topologicalordering. Indeed we can define the order since the originalrules for Bayes-ball do not prescribe any order. The marksrecord important information; consequently, we show thefollowing. The proofs for all the results are in the supple-mentary material. Lemma 1.
Let E ⋆ ⊆ E be marked on top, E ⭒ ⊆ E bevisited but not marked on top, and Z ⋆ ⊆ Z be marked ontop. Then the query µ can be computed as follows, µ = ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) f ( x ) ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) (3)Now, since X , Z ⋆ , E ⋆ , E ⭒ are variables of B and they forma sub-network B ⋆ such that E ⭒ do not have any parent, wecan write, P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) = ∏ u i ∈ x ∪ z ⋆ ∪ e ⋆ P ( u i ∣ pa ( U i )) such that ∀ p ∈ pa ( U i ) ∶ p ∈ x ∪ z ⋆ ∪ e ⋆ ∪ e ⭒ . Thismeans CPDs of some observed variables are not requiredfor computing µ . Now we define these variables. Definition 3.
The observed variables whose observed statesand CPDs might be required to compute µ will be calleddiagnostic evidence. Definition 4.
The observed variables whose only observedstates might be required to compute µ will be called predic-tive evidence. Diagnostic evidence (denoted by e ⋆ ) is marked on top,while predictive evidence (denoted by e ⭒ ) is visited but not marked on top. The variables X , Z ⋆ , E ⋆ , E ⭒ will becalled requisite variables. Now, we can sample from a factor Q ⋆ of Q such that, Q ⋆ ( X , Z ⋆ , E ⋆ ∣ E ⭒ ) = ∏ X i ∈ X ∪ Z ⋆ P ( X i ∣ Pa ( X i )) ∣ E ⋆ = e ⋆ (4)When we use Bayes-ball, precisely this factor is consid-ered for sampling. Starting by first setting E ⭒ their ob-served values, X ∪ Z ⋆ is assigned and e ⋆ is weightedin the topological ordering. Given M weighted samples D ⋆ = ⟨ x [ ] , w e ⋆ [ ]⟩ , . . . , ⟨ x [ M ] , w e ⋆ [ M ]⟩ from Q ⋆ , wecan estimate: ˜ µ = ∑ Mm = f ( x [ m ]) w e ⋆ [ m ] ∑ Mm = w e ⋆ [ m ] . (5)In this way, we sample from a lower-dimensional space;thus, the new estimator ˜ µ has a lower variance compared to ˆ µ due to the Rao-Blackwell theorem. Consequently, fewersamples are needed to achieve the same accuracy. Hence, weexploit CIs using the graphical structure in B for improvedinference. Now, we will exploit the graphical structure as well as struc-tures within CPDs. This section is divided into two parts.The first part presents a novel notion of contextual assign-ments that forms a theoretical framework for exploitingCSIs. It provides an insight into the computation of µ usingpartial assignments of requisite variables. We will show thatCSIs allow for breaking the main problem of computing µ into several sub-problems that can be solved independently.The second part presents CS-LW based on the notion intro-duced in the first part, where we will exploit the structure ofrules in the program to sample variables given the states ofonly some of their requisite ancestors. This contrasts withour discussion till now for BNs where knowledge of all suchancestors’ state is required. We will consider the variables X , Z ⋆ , E ⋆ , E ⭒ requisite forcomputing the query µ to the distribution P and the sub-network B ⋆ formed by these variables. We start by definingthe partial assignments that we will use to compute µ at theend of this section. Definition 5.
Let Z † ⊆ Z ⋆ and e † ⊆ e ⋆ . Denote Z ⋆ \ Z † by Z ‡ , and e ⋆ \ e † by e ‡ . A partial assignment x , z † , Z ‡ , e † , e ‡ will be called contextual assignment if due to CSIs in P , ∏ u i ∈ x ∪ z † ∪ e † P ( u i ∣ pa ( U i )) = ∏ u i ∈ x ∪ z † ∪ e † P ( u i ∣ ppa ( U i )) where ppa ( U i ) ⊆ pa ( U i ) is a set of partially assignedparents of U i such that Z ‡ ∩ Ppa ( U i ) = ∅ . itesh Kumar, Ondˇrej Kuˇzelka Example 3.
Consider the network of Figure 1, and assumethat our diagnostic evidence is { F = , G = , H = } ,predictive evidence is { D = } , and query is { E = } .From the CPD’s structure, we have: P ( E = ∣ A = , B, C ) = P ( E = ∣ A = ) ; consequently, a contextualassignment is x = { E = } , z † = { A = } , e † = {} , Z ‡ = { B, C } , e ‡ = { F = , G = , H = } . We also have: P ( E = ∣ A = , B = , C ) = P ( E = ∣ A = , B = ) ; consequently, another such assignment is x = { E = } , z † = { A = , B = } , e † = { H = } , Z ‡ = { C } , e ‡ = { F = , G = } . We aim to treat the evidence e ‡ independently, thus, wedefine it first. Definition 6.
The diagnostic evidence e ‡ in a contextualassignment x , z † , Z ‡ , e † , e ‡ will be called residual evidence. However, contextual assignments do not immediately allowus to treat the residual evidence independently. We need theassignments to be safe.
Definition 7.
Let e ∈ e ⋆ be a diagnostic evidence, and let S be an unobserved ancestor of E in the graph structurein B ⋆ , where B ⋆ is the sub-network formed by the requisitevariables. Let S → ⋯ B i ⋯ → E be a causal trail suchthat either no B i is observed or there is no B i . Let S be aset of all such S . Then the variables S will be called basisof e . Let ˙e ⋆ ⊆ e ⋆ , and let ˙S ⋆ be a set of all such S for all e ∈ ˙e ⋆ . Then ˙S ⋆ will be called basis of ˙e ⋆ . Reconsider Example 3; the basis of { F = } is { B } . Definition 8.
Let x , z † , Z ‡ , e † , e ‡ be a contextual assign-ment, and let S ‡ be the basis of the residual evidence e ‡ . If S ‡ ⊆ Z ‡ then the contextual assignment will be called safe. Example 4.
Reconsider Example 3; the first example ofcontextual assignment is safe, but the second is not sincethe basis B of e ‡ is assigned in z † . We can make the secondsafe like this: x = { E = } , z † = { A = , B = } , e † = { F = , H = } , Z ‡ = { C } , e ‡ = { G = } . See Figure 3. Before showing that the residual evidence can now betreated independently, we first define a random variablecalled weight . Definition 9.
Let e ∈ e ⋆ be a diagnostic evidence, and let W e be a random variable defined as follows: W e = P ( e ∣ Pa ( E )) . The variable W e will be called weight of e . The weight of asubset ˙e ⋆ ⊆ e ⋆ is defined as follows: W ˙e ⋆ = ∏ u i ∈ ˙e ⋆ P ( u i ∣ Pa ( U i )) . Now we can show the following result: Figure 3: Two safe contextual assignments to variables ofBN in Figure 1: (a) in the context A = , where edges C → E and B → E are redundant since E ⊥ B, C ∣ A = ;(b) in the context A = , B = , where the edge C → E isredundant since E ⊥ C ∣ A = , B = . To identify suchassignments, intuitively, we should apply the Bayes-ballalgorithm after removing these edges. Portions of graphsthat the algorithm visits, starting with visiting the variable E from its child, are highlighted. Notice that variables x , z † , e † lie in the highlighted portion. Theorem 2.
Let ˙e ⋆ ⊆ e ⋆ , and let ˙S ⋆ be the basis of ˙e ⋆ .Then the expectation of weight W ˙e ⋆ relative to the distribu-tion Q ⋆ as defined in Equation 4 can be written as: E Q ⋆ [ W ˙e ⋆ ] = ∑ ˙s ⋆ ∏ u i ∈ ˙e ⋆ ∪ ˙s ⋆ P ( u i ∣ pa ( U i )) . Hence, apart from unobserved variables ˙S ⋆ , the computa-tion of E Q ⋆ [ W ˙e ⋆ ] does not depend on other unobservedvariables. Now we are ready to show our main result: Theorem 3.
Let Ψ be a set of all possible safe contextualassignments in the distribution P . Then the query µ to P can be computed as follows: ∑ ψ ∈ Ψ ( ∏ u i ∈ x [ ψ ] ∪ z † [ ψ ] ∪ e † [ ψ ] P ( u i ∣ ppa ( U i )) f ( x [ ψ ]) R [ ψ ]) ∑ ψ ∈ Ψ ( ∏ u i ∈ x [ ψ ] ∪ z † [ ψ ] ∪ e † [ ψ ] P ( u i ∣ ppa ( U i )) R [ ψ ])) (6) where R [ ψ ] denotes E Q ⋆ [ W e ‡ [ ψ ] ] . We draw some important conclusions: i) µ can be exactlycomputed by performing the summation over all safe con-textual assignments; notably, variables in Z † vary, and sodoes variables in E † ; ii) For all ψ ∈ Ψ , the computation of E Q ⋆ [ W e ‡ [ ψ ] ] does not depend on the context x [ ψ ] , z † [ ψ ] since no basis of e ‡ [ ψ ] is assigned in the context (by The-orem 2). Hence, E Q ⋆ [ W e ‡ [ ψ ] ] can be computed indepen-dently. However, the context decides which evidence shouldbe in the subset e ‡ [ ψ ] . That is why we can not cancel E Q ⋆ [ W e ‡ [ ψ ] ] from the numerator and denominator. First, we present an algorithm that simulates the DC( B ) pro-gram P , specifying the same distribution P , to generate safecontextual assignments. Then we discuss how to estimatethe expectations independently before estimating µ . ontext-Specific Likelihood Weighting Algorithm 1
Simulation of DC( B ) Programs procedure S IMULATE -DC( P , x , e )• Visits variables from parent and also simulates a DC( B )program P based on inputs: i) x , a query; ii) e : evidence.• Output: i) i : f ( x ) that can be either or ; ii) W : a tableof weights of diagnostic evidence ( e † ).• The procedure maintains global data structures: i) Asg ,a table that records assignments of variables ( x ∪ z † ); ii) Dst , a table that records distributions for variables; iii)
Forward , a set of variables whose children to be visitedfrom parent; iv)
Top , a set of variables marked on top; v)
Bottom , a set of variables marked on bottom.1. Empty
Asg , Dst , W , Top , Bottom , Forward .2. If
PROVE - MARKED ( x ) ==yes then i = else i = .3. While Forward is not empty:(a) Remove m from Forward .(b) For all h ∼ D ← Body in P such that m=z in Body :i. If h is observed in e and h not in Top :A. Add h to Top
B. For all h ∼ D ← Body in P : P ROVE - MARKED ( Body ∧ dist(h, D ) ).C. Let x be a observed value of h and let p be a prob-ability at x according to distribution Dst[h] .Record
W[h]=p .ii. If h is not observed in e and h not in Bottom :A. Add h to Bottom and add h to Forward .4. Return [i,W] . B ) Programs We start by asking a question. Suppose we modify the firstand the fourth rule of Bayes-ball simulation, introduced inSection 3, as follows:• In the first rule, when the visit of an unobserved vari-able is from its child, everything remains the sameexcept that only some parents are visited, not all.• Similarly, in the fourth rule, when the visit of an ob-served variable is from its parent, everything remainsthe same except that only some parents are visited.Which variables will be assigned, and which will beweighted using the modified simulation rules? Intuitively,only a subset of variables in Z ⋆ should be assigned, andonly a subset of variables in E ⋆ should be weighted. Butthen how to assign/weigh a variable knowing the state ofonly some of its parent. We can do that when structuresare present in CPDs, and these structures are explicitly rep-resented using rules, as discussed in Section 2.3. This isbecause rules define the distribution from which the variableshould be sampled, although the state of some parents ofthe variable is known before that. Hence, the key idea isto visit only some parents (if possible due to structures);consequently, those unobserved parents that are not visitedmight not be required to be sampled.To realize that, we need to modify the Bayes-ball simulation Algorithm 2
DC( B ) Proof Procedure procedure PROVE - MARKED ( Goal )• Visits variables from child, consequently, proves a con-junction of atoms
Goal . Returns yes ; otherwise fails.• Accesses the program P , the set Top , the tables
Asg , Dst and evidence e as defined in Algorithm 1.1. While Goal in not empty:(a) Select the first atom b from Goal .(b) If b is of the form a=x :i. If a is observed in e then let y is the value of a .ii. Else if a in Top then y=Asg[a] .iii. Else:A. Add a to Top .B. For all a ∼ D ← Body in P : P ROVE - MARKED ( Body ∧ dist(a, D ) )C. Sample a value y from distribution Dst[a] andrecord
Asg[a]=y .D. If a not in Bottom : add a to Bottom and add a to Forward .iv. If x==y then remove b from Goal else fail.(c) If b is of the form dist(a, D ) : record Dst[a]= D and remove b from Goal .2. Return yes . such that it works on DC( B ) programs. This modified sim-ulation for DC( B ) programs is defined procedurally in Al-gorithm 1. The algorithm visits variables from their parentsand calls Algorithm 2 to visit variables from their children.Like Bayes-ball, this algorithm also marks variables on topand bottom to avoid repeating the same action. Readersfamiliar with theorem proving will find that Algorithm 2closely resembles SLD resolution (Kowalski, 1974), but it isalso different since it is stochastic. An example illustratinghow Algorithm 2 visits only some requisite ancestors tosample a variable is present in the supplementary material.Since the simulation of P follows the same four rules ofBayes-ball simulation except that only some parents arevisited in the first and fourth rule, we show that Lemma 4.
Let E † be a set of observed variables weighedand let Z † be a set of unobserved variables, apart fromquery variables, assigned in a simulation of P , then, Z † ⊆ Z ⋆ and E † ⊆ E ⋆ . The query variables X are always assigned since the simula-tion starts with visiting these variables as if visits are fromone of their children. To simplify notation, from now onwe use Z † to denote the subset of variables in Z ⋆ that areassigned, E † to denote the subset of variables in E ⋆ that areweighted in the simulation of P . Z ‡ to denote Z ⋆ \ Z † , and E ‡ to denote E ⋆ \ E † We show that the simulation performssafe contextual assignments to requisite variables.
Theorem 5.
The partial assignment x , z † , Z ‡ , e † , e ‡ gen-erated in a simulation of P is a safe contextual assignment. The proof of Theorem 5 relies on the following Lemma. itesh Kumar, Ondˇrej Kuˇzelka
Lemma 6.
Let P be a DC( B ) program specifying a distri-bution P . Let B , C be disjoint sets of parents of a variable A . In the simulation of P , if A is sampled/weighted, givenan assignment c , and without assigning B , then, P ( A ∣ c , B ) = P ( A ∣ c ) . Hence, just like the standard LW, we sample from a factor Q † of the proposal distribution Q ⋆ , which is given by, Q † = ∏ u i ∈ x ∪ z † ∪ e † P ( u i ∣ ppa ( U i )) where P ( u i ∣ ppa ( U i )) = if u i ∈ e † . It is pre-cisely this factor that Algorithm 1 considers for the sim-ulation of P . Starting by first setting E ⭒ , E ‡ their ob-served values, it assigns X ∪ Z † and weighs e † in thetopological ordering. In this process, it records par-tial weights w e † , such that: ∏ x i ∈ e † w x i = w e † and w x i ∈ w e † . Given M partially weighted samples D † = ⟨ x [ ] , w e † [ ] ⟩ , . . . , ⟨ x [ M ] , w e † [ M ] ⟩ from Q † , we couldestimate µ using Theorem 3 as follows: µ = ∑ Mm = f ( x [ m ]) × w e † [ m ] × E Q ⋆ [ W e ‡ [ m ] ] ∑ Mm = w e † [ m ] × E Q ⋆ [ W e ‡ [ m ] ] (7)However, we still can not estimate it since we still do nothave expectations E Q ⋆ [ W e ‡ [ m ] ] . Fortunately, there areways to estimate them from partial weights in D † . Wediscuss one such way next. We start with the notion of sampling mean. Let W ⋆ = ⟨ w e [ ] , . . . , w e m [ ]⟩ , . . . , ⟨ w e [ n ] , . . . , w e m [ n ]⟩ be a data set of n observations of weights of m diagnosticevidence drawn using the standard LW. How can we esti-mate the expectation E Q ⋆ [ W e i ] from W ⋆ ? The standard ap-proach is to use the sampling mean: W e i = n ∑ nr = w e i [ r ] .In general, E Q ⋆ [ W e i . . . W e j ] can be estimated using the es-timator: W e i . . . W e j = n ∑ nr = w e i [ r ] . . . w e j [ r ] . SinceLW draws are independent and identical distributed (i.i.d.),it is easy to show that the estimator is unbiased.However, some entries, i.e., weights of residual evidence,are missing in the data set W † obtained using CS-LW. Thetrick is to fill the missing entries by drawing samples ofthe missing weights once we obtain W † . More precisely,missing weights ⟨ W e i , . . . , W e j ⟩ in r th row of W † are filledin with a joint state ⟨ w e i [ r ] , . . . , w e j [ r ]⟩ of the weights.To draw the joint state, we again use Algorithm 1 and visitobserved variables ⟨ E i , . . . , E j ⟩ from parent. Once all miss-ing entries are filled in, we can estimate E Q ⋆ [ W e i . . . W e j ] using the estimator W e i . . . W e j as just discussed. Once weestimate all required expectations, it is straightforward toestimate µ using Equation 7. LW CS-LWBN N MAE ± Std. Time MAE ± Std. Time
Alarm
100 0.2105 ± ± ± ± ± ± ± ± Andes
100 0.0821 ± ± ± ± ± ± ± ± Table 1: The mean absolute error (MAE), the standarddeviation of the error (Std.), and the average elapsed time(in seconds) versus the number of samples (N). For eachcase, LW and CS-LW were executed 30 times.At this point, we can gain some insight into the role ofCSIs in sampling. They allow us to estimate the expectation E Q ⋆ [ W e ‡ ] separately. We estimate it from all samples ob-tained at the end of the sampling process, thereby reducingthe contribution W e ‡ makes to the variance of our mainestimator µ . The residual evidence e ‡ would be large ifmuch CSIs are present in the distribution; consequently, wewould obtain a much better estimate of µ using significantlyfewer samples. Moreover, drawing a single sample wouldbe faster since only a subset of requisite variables is visited.Hence, in addition to CIs, we exploit CSIs and improve LWfurther. We observe all these speculated improvements inour experiments. We answer three questions empirically: Q1 : How does the sampling speed of CS-LW compare withthe standard LW in the presence of CSIs? Q2 : How does the accuracy of the estimate obtained usingCS-LW compare with the standard LW ? Q3 : How does CS-LW compare to the state-of-the-art ap-proximate inference algorithms?To answer the first two questions, we need BNs with struc-tures present within CPDs. Such BNs, however, are notreadily available since the structure while designing infer-ence algorithms is generally overlooked. We identified twoBNs from the Bayesian network repository (Elidan, 2001),which have many structures within CPDs: i) Alarm , a moni-toring system for patients with 37 variables; ii)
Andes , anintelligent tutoring system with 223 variables.We used the standard decision tree learning algorithm todetect structures and overfitted it on tabular-CPDs to gettree-CPDs, which was then converted into rules. Let usdenote the program with these rules by P tree . CS-LW isimplemented in the Prolog programming language, thus tocompare the sampling speed of LW with CS-LW, we need a ontext-Specific Likelihood Weighting similar implementation of LW. Fortunately, we can use thesame implementation of CS-LW for obtaining LW estimates.Recall that if we do not make structures explicit in rulesand represent each entry in tabular-CPDs with rules, thenCS-LW boils down to LW. Let P table denotes the programwhere each rule in it corresponds to an entry in tabular-CPDs. Table 1 shows the comparison of estimates obtainedusing P tree (CS-LW) and P table (LW). Note that CS-LWautomatically discards non-requisite variables for sampling.So, we chose the query and evidence such that almost allvariables in BNs were requisite for the conditional query.As expected, we observe that less time is required by CS-LWto generate the same number of samples. This is because itvisits only the subset of requisite variables in each simula-tion. Andes has more structures compared to
Alarm . Thus,the sampling speed of CS-LW is much faster compared toLW in
Andes . Additionally, we observe that the estimate,with the same number of samples, obtained by CS-LW ismuch better than LW. This is significant. It is worth empha-sizing that approaches based on collapsed sampling obtainbetter estimates than LW with the same number of sam-ples, but then the speed of drawing samples significantlydecreases. In CS-LW, the speed increases when structuresare present. This is possible because CS-LW exploits CSIs.Hence, we get the answer to the first two questions: Whenmany structures are present, and when they are made explicitin rules, then CS-LW will draw samples faster comparedto LW. Additionally, estimates will be better with the samenumber of samples.To answer our final question, we compared CS-LW with thecollapsed compilation (CC, Friedman and Van den Broeck,2018), which has been recently shown to outperform severalsampling algorithms. It combines a state of the art exactinference algorithm that exploits CSIs and importance sam-pling that scales the exact inference. The load between theexact and sampling can be regulated using the size of thearithmetic circuit: larger the circuit’s size, larger the loadon the exact and lesser the load on the sampling, i.e., lessvariables are considered for sampling. For this experiment,we consider two additional BNs: i)
Win95pts , a system forprinting troubleshooting in Windows 95 with 76 variables;ii)
Munin1 , an expert EMG assistant with 186 variables.However, not many structures are present in the CPDs ofthese two BNs, so not much difference in the performanceof LW and CS-LW is expected.The comparison is shown in Table 2. We can observe thefollowing: i) as expected from collapsed sampling, muchfewer samples are drawn in the same time; ii) the rightchoice of circuit’s size is crucial, e.g., with circuit size10,000, CC performs poorly compared to LW on some BNswhile better when the size is increased; iii) CS-LW performsbetter compared to CC when the circuit is not huge; iv) onthe three BNs, CC with a huge circuit size computes the exact conditional probability while LW and CS-LW canonly provide a good approximation of that in the same time.To demonstrate that the fourth observation does not under-mine the importance of pure sampling, we used
Munin1 .Although the size of this BN is comparable to the size of
Andes , almost all variables are multi-valued, and their do-main size can be as large as 20; hence, some CPDs are huge,while in
Andes , variables are binary-valued. CC that workswell on
Andes , fails to deal with huge CPDs of
Munin1 on amachine with GB memory. On the other hand, both LWand CS-LW work well on this BN.Hence, we get the answer to our final question: CS-LW iscompetitive with the state-of-the-art and can be a useful toolfor inference on massive BNs with structured CPDs.
Although the question of how to exploit CSIs arising dueto structures is not new and has puzzled researchers fordecades, research in this direction has mainly been focusedon exact inference (Boutilier et al., 1996; Zhang and Poole,1999; Poole, 1997; Poole and Zhang, 2003). Nowadays, itis common to use knowledge compilation (KC) based ex-act inference for the purpose (Chavira and Darwiche, 2008;Fierens et al., 2015; Shen et al., 2016). There are not manyapproximate inference algorithms that can exploit them.However, there are some tricks that make use of structuresto approximate the probability of a query. One trick, in-troduced by Poole (1998), is to make rule-base simpler byignoring distinctions in close probabilities. Another trick,explored in the case of the tree-CPDs, is to prune treesand reduce the size of actual CPDs (Salmer´on et al., 2000;Cano et al., 2011). However, approximation by makingdistribution simpler is orthogonal to traditional ways of ap-proximation, such as sampling. Fierens (2010) observed thespeedup in Gibbs sampling due to structures; however, didnot consider global implications of structures.Recently, Friedman and Van den Broeck (2018) realizedthat KC is good at exploiting structures while sampling isscalable; thus, proposed CC that inherits advantages of both.However, along with advantages, this approach also inheritsthe scalability limitations of KC. Furthermore, CC is limitedto discrete distributions. The problem of exploiting CSIs indiscrete-continuous distributions is non-trivial and is poorlystudied. Recently, it has attracted some attention (Zeng andVan den Broeck, 2019). However, proposed approaches arealso exact and rely on complicated weighted model integra-tion (Belle et al., 2015), which quickly become infeasible.CS-LW is simple, scalable, and applies to such distribu-tions. A sampling algorithm for a rule-based representationof discrete-continuous distributions was developed by Nittiet al. (2016); however, it did not exploit CIs and globalimplications of rule structures. itesh Kumar, Ondˇrej Kuˇzelka
LW CC-10,000 CC-100,000 CC-1000,000 CS-LWBN N MAE ± Std. N MAE ± Std. N MAE ± Std. N MAE ± Std. N MAE ± Std.
Alarm ± ± ± ± ± Win95pts ± ± ± ± ± Andes ± ± ± ± ± Munin1 ± ± Table 2: The mean absolute error (MAE), the standard deviation of the error (Std.), and the average number of samples (N)drawn when algorithms were run 50 times for 2 minutes (approx.) each. The algorithms are: LW, CS-LW, CC with circuitsize 10,000, with size 100,000, and with size 1000,000.
We studied the role of CSI in approximate inference andintroduced a notion of contextual assignments to show thatCSIs allow for breaking the main problem of estimatingconditional probability query into several small problemsthat can be estimated independently. Based on this notion,we presented an extension of LW, which not only generatessamples faster; it also provides a better estimate of the querywith much fewer samples. Hence, we provided a solidreason to use structured-CPDs over tabular-CPDs. Like LW,we believe other sampling algorithms can also be extendedalong the same line. We aim to open up a new directiontowards improved sampling algorithms that also exploitCSIs.
Acknowledgements
This work has received fundingfrom the European Research Council (ERC) under the Eu-ropean Union’s Horizon 2020 research and innovation pro-gramme (grant agreement No [694980] SYNTH: Synthe-sising Inductive Data Models). OK was supported by theCzech Science Foundation project “Generative RelationalModels” (20-19104Y) and partially also by the OP VVVproject
CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Cen-ter for Informatics”. The authors would like to thank LucDe Raedt, Jessa Bekker, Pedro Zuidberg Dos Martires andthe anonymous reviewers for valuable feedback.
References
Craig Boutilier, Nir Friedman, Moises Goldszmidt, andDaphne Koller. Context-specific independence inbayesian networks. In
Proceedings of the Twelfth interna-tional conference on Uncertainty in artificial intelligence ,pages 115–123. Morgan Kaufmann Publishers Inc., 1996.Daphne Koller and Nir Friedman.
Probabilistic graphicalmodels: principles and techniques . MIT press, 2009.David Maxwell Chickering, David Heckerman, and Christo-pher Meek. A bayesian approach to learning bayesiannetworks with local structure. In
Proceedings of the Thir-teenth conference on Uncertainty in artificial intelligence ,pages 80–89. Morgan Kaufmann Publishers Inc., 1997.Nir Friedman. The bayesian structural em algorithm. In
Pro-ceedings of the Fourteenth conference on Uncertainty in artificial intelligence , pages 129–138. Morgan KaufmannPublishers Inc., 1998.John S Breese, David Heckerman, and Carl Kadie. Empir-ical analysis of predictive algorithms for collaborativefiltering. In
Proceedings of the Fourteenth conferenceon Uncertainty in artificial intelligence , pages 43–52.Morgan Kaufmann Publishers Inc., 1998.Tal Friedman and Guy Van den Broeck. Approximate knowl-edge compilation by online collapsed importance sam-pling. In
Advances in Neural Information ProcessingSystems , pages 8024–8034, 2018.Adnan Darwiche. A differential approach to inference inbayesian networks.
Journal of the ACM (JACM) , 50(3):280–305, 2003.Ross D Shachter and Mark A Peot. Simulation approachesto general probabilistic inference on belief networks. In
Machine Intelligence and Pattern Recognition , volume 10,pages 221–231. Elsevier, 1990.Robert Fung and Kuo-Chu Chang. Weighing and integratingevidence for stochastic simulation in bayesian networks.In
Machine Intelligence and Pattern Recognition , vol-ume 10, pages 209–219. Elsevier, 1990.David Poole. Probabilistic partial evaluation: Exploitingrule structure in probabilistic inference. In
IJCAI , vol-ume 97, pages 1284–1291, 1997.Jukka Corander, Antti Hyttinen, Juha Kontinen, Johan Pen-sar, and Jouko V¨a¨an¨anen. A logical approach to context-specific independence.
Annals of Pure and Applied Logic ,170(9):975–992, 2019.Bernd Gutmann, Ingo Thon, Angelika Kimmig, MauriceBruynooghe, and Luc De Raedt. The magic of logicalinference in probabilistic programming.
Theory and Prac-tice of Logic Programming , 11(4-5):663–680, 2011.R Shacter. Bayes ball: The rational pastime. In
Proc of the14 Annual Conf on Uncertainty in Artificial Intelligence ,1998.Robert Kowalski. Predicate logic as programming language.In
IFIP congress ontext-Specific Likelihood Weighting
Nevin L Zhang and David Poole. On the role of context-specific independence in probabilistic inference. In , volume 2, page 1288,1999.David Poole and Nevin Lianwen Zhang. Exploiting contex-tual independence in probabilistic inference.
Journal ofArtificial Intelligence Research , 18:263–313, 2003.Mark Chavira and Adnan Darwiche. On probabilistic infer-ence by weighted model counting.
Artificial Intelligence ,172(6-7):772–799, 2008.Daan Fierens, Guy Van den Broeck, Joris Renkens, DimitarShterionov, Bernd Gutmann, Ingo Thon, Gerda Janssens,and Luc De Raedt. Inference and learning in probabilisticlogic programs using weighted boolean formulas.
The-ory and Practice of Logic Programming , 15(3):358–401,2015.Yujia Shen, Arthur Choi, and Adnan Darwiche. Tractableoperations for arithmetic circuits of probabilistic models.In
Advances in Neural Information Processing Systems ,pages 3936–3944, 2016.David Poole. Context-specific approximation in probabilis-tic inference. In
Proceedings of the Fourteenth Confer-ence on Uncertainty in Artificial Intelligence , UAI’98,page 447–454, San Francisco, CA, USA, 1998. MorganKaufmann Publishers Inc. ISBN 155860555X.Antonio Salmer´on, Andr´es Cano, and Serafın Moral. Im-portance sampling in bayesian networks using probabilitytrees.
Computational Statistics & Data Analysis , 34(4):387–413, 2000.Andr´es Cano, Manuel G´emez-Olmedo, and Seraf´en Moral.Approximate inference in bayesian networks using binaryprobability trees.
International Journal of ApproximateReasoning , 52(1):49–62, 2011.Daan Fierens. Context-specific independence in directedrelational probabilistic models and its influence on theefficiency of gibbs sampling. In
ECAI , pages 243–248,2010.Zhe Zeng and Guy Van den Broeck. Efficient search-basedweighted model integration.
UAI 2019 Proceedings ,2019.Vaishak Belle, Andrea Passerini, and Guy Van den Broeck.Probabilistic inference in hybrid domains by weightedmodel integration. In
Twenty-Fourth International JointConference on Artificial Intelligence , 2015.Davide Nitti, Tinne De Laet, and Luc De Raedt. Proba-bilistic logic programming for hybrid relational domains.
Machine Learning , 103(3):407–449, 2016.David L Poole and Alan K Mackworth.
Artificial Intelli-gence: foundations of computational agents . CambridgeUniversity Press, 2010. itesh Kumar, Ondˇrej Kuˇzelka
Context-Specific Likelihood Weighting:Supplementary Materials
Nitesh Kumar Ondˇrej Kuˇzelka
Department of Computer Science and Leuven.AIKU Leuven, Belgium Department of Computer ScienceCzech Technical University in Prague, Czechia
In this section, we present the detailed proof of Lemma 1.
Proof.
Let us denote the variables in Z that are marked on the top (requisite) by Z ⋆ and that are not marked on the top (notrequisite) by Z ¯ ⋆ . The required probability µ is then given by, µ = P ( x q ∣ e ) = ∑ x , z ⋆ , z ¯ ⋆ P ( x , z ⋆ , z ¯ ⋆ , e ) f ( x ) ∑ x , z ⋆ , z ¯ ⋆ P ( x , z ⋆ , z ¯ ⋆ , e ) = ∑ x , z ⋆ P ( x , z ⋆ , e ) f ( x ) ∑ z ¯ ⋆ P ( z ¯ ⋆ ∣ x , z ⋆ , e ) ∑ x , z ⋆ P ( x , z ⋆ , e ) ∑ z ¯ ⋆ P ( z ¯ ⋆ ∣ x , z ⋆ , e ) Since ∑ z ¯ ⋆ P ( z ¯ ⋆ ∣ x , z ⋆ , e ) = , we can write, µ = ∑ x , z ⋆ P ( x , z ⋆ , e ) f ( x ) ∑ x , z ⋆ P ( x , z ⋆ , e ) Now let us denote the observed variables in E that are visited (requisite) by E r and those that are not visited (not requisite)by E n . We can write, µ = ∑ x , z ⋆ P ( x , z ⋆ , e r ) P ( e n ∣ x , z ⋆ , e r ) f ( x ) ∑ x , z ⋆ P ( x , z ⋆ , e r ) P ( e n ∣ x , z ⋆ , e r ) The variables in X ∪ Z ⋆ pass the Bayes-balls to all their parents and all their children, but E n is not visited by these balls.The correctness of the Bayes-ball algorithm ensures that there is no active path from X ∪ Z ⋆ to any E n in E n given E r .Thus X , Z ⋆ ⊥ E n ∣ E r and P ( e n ∣ x , z ⋆ , e r ) = P ( e n ∣ e r ) . After cancelling out the common term P ( e n ∣ e r ) , we get, µ = ∑ x , z ⋆ P ( x , z ⋆ , e r ) f ( x ) ∑ x , z ⋆ P ( x , z ⋆ , e r ) Now let us denote observed variables in E r that are only visited by E ⭒ and that are visited as well as marked on top by E ⋆ .After cancelling out the common term, we get the desired result, µ = ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) P ( e ⭒ ) f ( x ) ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) P ( e ⭒ ) = ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) f ( x ) ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) Example 5.
Consider the network of Figure 1, and assume that our evidence is { D = , F = , G = , H = } , and ourquery is { E = } . Suppose we start by visiting the query variable from its child and apply the four rules of Bayes-ball.One can easily verify that observed variables F, G, H will be marked on top; hence { F = , G = , H = } is diagnosticevidence ( e ⋆ ). The observed variable D will only be visited; hence { D = } is predictive evidence ( e ⭒ ). Variables A, B, C, E will be marked on top and are requisite unobserved variables ( X ∪ Z ⋆ ). ontext-Specific Likelihood Weighting In this section, we present the detailed proof of Theorem 2.
Proof.
The expectation E Q ⋆ [ W ˙e ⋆ ] is given by ∑ x , z ⋆ ∏ u i ∈ x ∪ z ⋆ P ( u i ∣ pa ( U i )) ∏ v i ∈ ˙e ⋆ P ( v i ∣ pa ( V i )) . The basis ˙S ⋆ is a subset of X ∪ Z ⋆ by Definition 7 . Let us denote ( X ∪ Z ⋆ ) \ ˙S ⋆ by Z ⋄ . We can now rewrite the expectationas follows, ∑ ˙s ⋆ , z ⋄ ∏ u i ∈ ˙e ⋆ ∪ ˙s ⋆ P ( u i ∣ pa ( U i )) ∏ v i ∈ z ⋄ P ( v i ∣ pa ( V i )) . We will show that
P a ∉ Z ⋄ for any P a ∈ Pa ( U i ) , which will then allow us to push the summation over z ⋄ inside. Let usconsider two cases:• For U i ∈ ˙E ⋆ , let P a ∈ Pa ( U i ) be an unobserved parent of U i , then there will be a direct causal trail from P a to U i ,consequently P a will be in the set ˙S ⋆ .• For U i ∈ ˙S ⋆ , there will be a causal trail U i → ⋯ B j ⋯ → E such that E ∈ ˙E ⋆ and such that either no B i is observedor there is no B i . Let P a ∈ Pa ( U i ) be an unobserved parent of U i then there will be a direct causal trail from P a to U i , consequently, there will be such causal trail from P a to E and P a will be in the set ˙S ⋆ .Hence, we push the summation over z ⋄ inside and use the fact that ∑ z ⋄ ∏ v i ∈ z ⋄ P ( v i ∣ pa ( V i )) = , to get the desiredresult. In this section, we present the detailed proof of Theorem 3.
Proof.
Since X , Z ⋆ , E ⋆ , E ⭒ are variables of the Bayesian network B and they form a sub-network B ⋆ such that E ⭒ do nothave any parent, we can always write, P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) = ∏ u i ∈ x ∪ z † ∪ e † P ( u i ∣ pa ( U i )) ∏ v i ∈ z ‡ ∪ e ‡ P ( v i ∣ pa ( V i )) such that p ∈ x ∪ z ⋆ ∪ e ⋆ ∪ e ⭒ for all p ∈ pa ( U i ) or p ∈ pa ( V i ) . Now consider the summation over all possibleassignments of variables in X , Z ⋆ , that is: ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) . We can always write, ∑ x , z ⋆ P ( x , z ⋆ , e ⋆ ∣ e ⭒ ) = ∑ ψ ∈ Ψ ∑ z ‡ [ ψ ] P ( x [ ψ ] , z † [ ψ ] , z ‡ [ ψ ] , e † [ ψ ] , e ‡ [ ψ ] ∣ e ⭒ ) (8)To simplify notation, from now we denote { x [ ψ ] , z † [ ψ ] , Z ‡ [ ψ ] , e † [ ψ ] , e ‡ [ ψ ]} by { x , z † , Z ‡ , e † , e ‡ } . After using thedefinition of contextual assignments, we have that, P ( x , z † , z ‡ , e † , e ‡ ∣ e ⭒ ) = ∏ u i ∈ x ∪ z † ∪ e † P ( u i ∣ ppa ( U i )) ∏ v i ∈ z ‡ ∪ e ‡ P ( v i ∣ pa ( V i )) Since p ∉ z ‡ for any p ∈ ppa ( U i ) , we can push the summation over z ‡ inside to get, ∑ ψ ∈ Ψ ∑ z ‡ P ( x , z † , z ‡ , e † , e ‡ ∣ e ⭒ ) = ∑ ψ ∈ Ψ ∏ u i ∈ x ∪ z † ∪ e † P ( u i ∣ ppa ( U i )) ∑ z ‡ ∏ v i ∈ z ‡ ∪ e ‡ P ( v i ∣ pa ( V i )) . (9)However, we get a strange term ∑ z ‡ ∏ v i ∈ z ‡ ∪ e ‡ P ( v i ∣ pa ( V i )) . Let S ‡ denote the basis of residual e ‡ . We have that S ‡ ⊆ Z ‡ by Definition 8. Let us denote Z ‡ \ S ‡ with Z ⋄ . Now the strange term can be rewritten as, ∑ s ‡ , z ⋄ ∏ u i ∈ e ‡ ∪ s ‡ P ( u i ∣ pa ( U i )) ∏ v i ∈ z ⋄ P ( v i ∣ pa ( V i )) . In the proof of Theorem 2, we showed that the summation over variables not in S ‡ can be pushed inside; hence, Z ⋄ can bepushed inside. After using the fact that ∑ z ⋄ ∏ v i ∈ z ⋄ P ( v i ∣ pa ( V i )) = , we conclude that the strange term is actually theexpectation E Q ⋆ [ W e ‡ ] . Using Equation (3), (8), (9) and rearranging terms, the result follows. itesh Kumar, Ondˇrej Kuˇzelka In this section, we present the detailed proof of Lemma 4.
Proof.
It is clear that a subset of unobserved variables is assigned. Let Z ‡ be a set of unobserved variables left unassigned.Let E ∈ E ⋆ be an observed variable. Consider two cases:• All ancestors of E are in Z ‡ ∪ E ⭒ ∪ E ⋆ .• Some ancestors of E are in Z ‡ ∪ E ⭒ ∪ E ⋆ and some are in X ∪ Z † . Let A ∈ X ∪ Z † and let A → ⋯ B i ⋯ → E be acausal trail. Some B i are observed in all such trails.Clearly, E will not be visited from any parent in the first case, and in the second case, the visit will be blocked by observedvariables. Consequently, E will not be weighted, which completes the proof. In this section, we present the detailed proof of Theorem 5.
Proof.
Variables in Z ‡ are not assigned in the simulation; hence, it follows immediately from Lemma 6 that the assignmentis contextual. Assume by contradiction that A ∈ X ∪ Z † , E ∈ E ‡ and there is a causal trail A → ⋯ B i ⋯ → E such thatno B i is observed or there is no B i . Since A is assigned, all children of A will be visited, and following the trail, the variable E will also be visited from its parent since there is no observed variable in the trail to block the visit. Consequently, E willbe weighted, which contradicts our assumption that E is not weighted. Hence, the assignment is also safe. In this section, we present the detailed proof of Lemma 6.
Proof.
Since A is assigned/weighted and rules in P are exhaustive, a rule R ∈ P with A in its head must have fired. Let d be a body and D be a distribution in the head of R . Since each d i ∈ d must be true for R to fire, d ⊆ c . We assume thatrules in P are mutually exclusive. Thus, among all rules for A , only R will fire even when an assignment of some variablesin B is also given. Hence, by definition of the rule R , we have that, D = P ( A ∣ d ) = P ( A ∣ c ) = P ( A ∣ c , B ) B ) Let us look into the process of estimating the unconditional probability of queries to DC( B ) programs. We assume somefamiliarity with proof procedures for definite clauses (Poole and Mackworth, 2010).Just like the set of definite clauses forms the knowledge base, the DC( B ) program forms a probabilistic knowledge base . Wecan ask queries of the form yes ← e=1 , which is a question: is e assigned to ? We first need to prove that e=1 beforeconcluding that the answer is yes . To realize that we perform a proof procedure, from the query, to determine whetherit is a logical consequence of rules in the DC( B ) program. Algorithm 2 describes the procedure, which is similar to thestandard SLD-resolution for definite clauses. However, there are some differences to prove atoms of the form e=1 due to thestochastic nature of sampling. We illustrate the proof procedure with an example. Example 6.
Consider a Bayesian network whose graph structure is as shown in Figure 4 (right) and whose CPDs areexpressed using the following rules: a ∼ bernoulli(0.1).d ∼ bernoulli(0.3).b ∼ bernoulli(0.2) ← a=0.b ∼ bernoulli(0.6) ← a=1.c ∼ bernoulli(0.2) ← a=1. ontext-Specific Likelihood Weighting Figure 4: Left: A search graph induced to prove e=1 ; Right: A graphical structure. c ∼ bernoulli(0.7) ← a=0 ∧ b=1.c ∼ bernoulli(0.8) ← a=0 ∧ b=0.e ∼ bernoulli(0.9) ← c=1.e ∼ bernoulli(0.4) ← c=0 ∧ d=1.e ∼ bernoulli(0.3) ← c=0 ∧ d=0. Suppose the query yes ← e=1 is asked. The procedure induces a search graph. An example of such a graph is shown inFigure 4 (left), where we write dist(y, p) for dist(y, bernoulli(p)) and use comma( , ) instead of ∧ since there isno risk of confusion. In this example, the proof succeeds using a derivation. However, it might happen that the proof can notbe derived. In that case, the proof fails, and the answer is no .After repeating the procedure, the fraction of times we get yes is the estimated probability of the query. It is important tonote that some requisite variables may not be assigned in some occasions, e.g., in the proof shown in Figure 4, variable b ,which is requisite to compute the probability, is not assigned. Hence, we sample values of e faster. In this way, the procedureexploits the structure of rules.
10 Additional Experimental Details
To make sure that almost all variables in BNs are requisite, we used the following query variables and evidence to obtain theresults (Table 1 and Table 2):•
Alarm – x = { bp = low } – e = { lvfailure = false, cvp = normal, hr = normal, expco2 = low, ventalv = low, ventlung = zero } • Win95pts – x = { problem1 = normal output } – e = { prtstatoff = no error, prtfile = yes, prtstattoner = no error, repeat = yes always the same , ds lclok = yes,lclok = yes, problem3 = yes, problem4 = yes, nnpsgrphc = yes, psgraphic = yes, problem5 = yes, gdiin = yes,appdata = correct, prtstatpaper = no error } • Andes itesh Kumar, Ondˇrej Kuˇzelka – x = { grav78 = false } – e = { goal 2 = true, displacem0 = false, snode 10 = true, snode 16 = true, grav2 = true, constant5 = false,known8 = true, try11 = true, kinemati17 = false, try13 = true, given21 = true, choose35 = false, write31 = false,need36 = false, resolve38 = true, goal 69 = false, snode 73 = false, goal 79 = false, try24 = false, newtons45= false, try26 = false, snode 65 = false, snode 88 = false, buggy54 = true, weight57 = true, goal 104 = false,goal 108 = false, need67 = false, goal 114 = false, snode 118 = false, snode 122 = false, snode 125 = false,goal 129 = false, snode 135 = false, goal 146 = false, snode 151 = false } • Munin1 – x = { rmedd2amprew = r04 } – e = { rlnlt1apbdenerv = no, rlnllpapbmaloss = no, rlnlwapbderegen = no, rdiffnapbmaloss = no, rderegenapb-nmt = no, rdiffnmedd2block = no, rmeddcvew = ms60, rapbnmt = no, rapbforce = 5, rlnlbeapbdenerv = no,rlnlbeapbneuract = no, rmedldwa = no, rmedd2blockwd = no }
11 Code11 Code