Handling Epistemic and Aleatory Uncertainties in Probabilistic Circuits
Federico Cerutti, Lance M. Kaplan, Angelika Kimmig, Murat Sensoy
HHandling Epistemic and Aleatory Uncertaintiesin Probabilistic Circuits
Federico Cerutti , Lance M. Kaplan , Angelika Kimmig , andMurat S¸ensoy Department of Information Engineering, Universit`a degli Studi diBrescia, Brescia, Italy, Cardiff University, Crime and SecurityResearch Institute, Cardiff, UK, [email protected] CCDC Army Research Laboratory, Adelphi, MD, USA, [email protected] Department of Computer Science, KU Leuven, Belgium, [email protected] Blue Prism AI Labs, London, UK , Department of ComputerScience, Ozyegin University, Istanbul, Turkey, [email protected]
Submitted to MACH: Under Review
Abstract
When collaborating with an AI system, we need to assess when to trustits recommendations. If we mistakenly trust it in regions where it is likelyto err, catastrophic failures may occur, hence the need for Bayesian ap-proaches for probabilistic reasoning in order to determine the confidence(or epistemic uncertainty) in the probabilities in light of the training data.We propose an approach to overcome the independence assumption be-hind most of the approaches dealing with a large class of probabilisticreasoning that includes Bayesian networks as well as several instances ofprobabilistic logic. We provide an algorithm for Bayesian learning fromsparse, albeit complete, observations, and for deriving inferences and theirconfidences keeping track of the dependencies between variables when theyare manipulated within the unifying computational formalism provided byprobabilistic circuits. Each leaf of such circuits is labelled with a beta-distributed random variable that provides us with an elegant frameworkfor representing uncertain probabilities. We achieve better estimation ofepistemic uncertainty than state-of-the-art approaches, including highlyengineered ones, while being able to handle general circuits and with justa modest increase in the computational effort compared to using pointprobabilities. a r X i v : . [ c s . A I] F e b Introduction
Even in simple collaboration scenarios—like those in which an artificial intelli-gence (AI) system assists a human operator with predictions—the success of theteam hinges on the human correctly deciding when to follow the recommenda-tions of the AI system and when to override them [6]. Extracting benefits fromcollaboration with the AI system depends on the human developing insights(i.e., a mental model) of when to trust the AI system with its recommendations[6]. If the human mistakenly trusts the AI system in regions where it is likelyto err, catastrophic failures may occur. This is a strong argument in favour ofBayesian approaches to probabilistic reasoning: research in the intersection ofAI and HCI has found that interaction improves when setting expectations rightabout what the system can do and how well it performs [39, 5]. Guidelines havebeen produced [1], and they recommend to
Make clear what the system can do ( G1 ), and Make clear how well the system can do what it can do ( G2 ).To identify such regions where the AI system is likely to err, we need todistinguish between (at least) two different sources of uncertainty: aleatory (or aleatoric ), and epistemic uncertainty [26, 27]. Aleatory uncertainty refersto the variability in the outcome of an experiment which is due to inherentlyrandom effects (e.g. flipping a fair coin): no additional source of information butLaplace’s daemon can reduce such a variability. Epistemic uncertainty refersto the epistemic state of the agent using the model, hence its lack of knowledgethat—in principle—can be reduced on the basis of additional data samples.Particularly when considering sparse data, the epistemic uncertainty aroundthe learnt model can significantly affect decision making [2, 3], for instancewhen used for computing an expected utility [58].In this paper, we propose an approach to probabilistic reasoning that manip-ulates distributions of probabilities without assuming independence and withoutresorting to sampling approaches within the unifying computational formal-ism provided by arithmetic circuits [59], sometimes named probabilistic circuits when manipulating probabilities, or simply circuits . This is clearly a novel con-tribution as the few approaches [47, 29, 10] resorting to distribution estimationvia moment matching like we also propose, still assume statistical indepen-dence in particular when manipulating distributions within the circuit. Instead,we provide an algorithm for Bayesian learning from sparse—albeit complete—observations, and for probabilistic inferences that keep track of the dependenciesbetween variables when they are manipulated within the circuit. In particu-lar, we focus on the large class of approaches to probabilistic reasoning thatrely upon algebraic model counting (AMC) [37] (Section 2.1), which has beenproven to encompass probabilistic inferences under [50]’s semantics, thus cover-ing not only Bayesian networks [49], but also probabilistic logic programmingapproaches such as ProbLog [21], and others as discussed by [11]. As AMC isdefined in terms of the set of models of a propositional logic theory, we canexploit the results of [16] (Section 2.2) who studied the succinctness relations “An intelligence that, at a given instant, could comprehend all the forces by which natureis animated and the respective situation of the beings that make it up” [41, p.2]. less certain the machine is, thus targeting directly[1, G1 and G2 ]. In previous work [10] we provided operators for manipulatingbeta-distributed random variables under strong independence assumptions (Sec-tion 4). This paper significantly extends and improves our previous approachby eliminating the independence assumption in manipulating beta-distributedrandom variables within a circuit.Indeed, our main contribution (Section 5) is an algorithm for reasoning overa circuit whose leaves are labelled with beta-distributed random variables, withthe additional piece of information describing which of those are actually inde-pendent (Section 5.1). This is the input to an algorithm that shadows the circuitby superimposing a second circuit for computing the probability of a query con-ditioned on a set of pieces of evidence (Section 5.2) in a single feed forward.While this at first might seems unnecessary, it is actually essential when in-specting the main algorithm that evaluates such a shadowed circuit (Section5.3), where a covariance matrix plays an essential role by keeping track of thedependencies between random variables while they are manipulated within thecircuit. We also include discussions on memory management of the covariancematrix in Section 5.4.We evaluate our approach against a set of competing approaches in anextensive set of experiments detailed in Section 6, comparing against leadingapproaches to dealing with uncertain probabilities, notably: (1) Monte Carlosampling; (2) our previous proposal [10] taken as representative of the classof approaches using moment matching with strong independence assumptions;(3) Subjective Logic [30], that provides an alternative representation of betadistributions as well as a calculus for manipulating them applied already in avariety of domains, e.g. [31, 43, 52]; (4) Subjective Bayesian Network (SBN)on circuits derived from singly-connected Bayesian networks [28, 32, 33], thatalready showed higher performance against other traditional approaches deal-ing with uncertain probabilities, such as (5) Dempster-Shafer Theory of Evi-dence [18, 53], and (6) replacing single probability values with closed intervalsrepresenting the possible range of probability values [61]. We achieve betterestimation of epistemic uncertainty than state-of-the-art approaches, includinghighly engineered ones for a narrow domain such as SBN, while being able to3andle general circuits and with just a modest increase in the computationaleffort compared to using point probabilities. [37] introduce the task of algebraic model counting (AMC) . AMC generalisesweighted model counting (WMC) to the semiring setting and supports vari-ous types of labels, including numerical ones as used in WMC, but also sets,polynomials, Boolean formulae, and many more. The underlying mathematicalstructure is that of a commutative semiring.A semiring is a structure p A , ‘ , b , e ‘ , e b q , where addition ‘ and multipli-cation b are associative binary operations over the set A , ‘ is commutative, b distributes over ‘ , e ‘ P A is the neutral element of ‘ , e b P A that of b ,and for all a P A , e ‘ b a “ a b e ‘ “ e ‘ . In a commutative semiring , b iscommutative as well.Algebraic model counting is now defined as follows. Given: • a propositional logic theory T over a set of variables V , • a commutative semiring p A , ‘ , b , e ‘ , e b q , and • a labelling function ρ : L Ñ A , mapping literals L of the variables in V toelements of the semiring set A ,compute A p T q “ à I P M p T q â l P I ρ p l q , (1)where M p T q denotes the set of models of T .Among others, AMC generalises the task of probabilistic inference accordingto [50]’s semantics ( PROB ), [37, Thm. 1], [24, 19, 4, 7, 36].A query q is a finite set of algebraic literals q Ď L . We denote the set ofinterpretations where the query is true by I p q q , I p q q “ t I | I P M p T q ^ q Ď I u (2)The label of query q is defined as the label of I p q q , A p q q “ A p I p q qq “ à I P I p q q â l P I ρ p l q . (3)As both operators are commutative and associative, the label is independent ofthe order of both literals and interpretations.In the context of this paper, we extend AMC for handling PROB of querieswith evidence by introducing an additional division operator m that defines theconditional label of a query as follows: A p q | E “ e q “ A p I p q ^ E “ e qq m A p I p E “ e qq (4)4here A p I p q ^ E “ e qq m A p I p E “ e qq returns the label of q ^ E “ e giventhe label of a set of pieces of evidence E “ e .In the case of probabilities as labels, i.e. ρ p¨q P r , s , (5) presents the AMC-conditioning parametrisation S p for handling PROB of (conditioned) queries: A “ R ě a ‘ b “ a ` ba b b “ a ¨ be ‘ “ e b “ ρ p f q P r , s ρ p(cid:32) f q “ ´ ρ p f q a m b “ ab (5)A na¨ıve implementation of (4) is clearly exponential: [14] introduced the firstmethod for deriving tractable circuits ( d-DNNF s) that allow polytime algorithmsfor clausal entailment, model counting and enumeration. As AMC is defined in terms of the set of models of a propositional logic the-ory, we can exploit the succinctness results of the knowledge compilation mapof [16]. The restriction to two-valued variables allows us to directly compileAMC tasks to circuits without adding constraints on legal variable assignmentsto the theory.In their knowledge compilation map, [16] provide an overview of succinctnessrelationships between various types of circuits. Instead of focusing on classical,flat target compilation languages based on conjunctive or disjunctive normalforms, [16] consider a richer, nested class based on representing propositionalsentences using directed acyclic graphs:
NNF s. A sentence in negation normalform ( NNF ) over a set of propositional variables V is a rooted, directed acyclicgraph where each leaf node is labeled with true ( J ), false ( K ), or a literal of avariable in V , and each internal node with disjunction ( _ ) or conjunction ( ^ ).An NNF is decomposable if for each conjunction node Ź ni “ φ i , no two chil-dren φ i and φ j share any variable.An NNF is deterministic if for each disjunction node Ž ni “ φ i , each pair ofdifferent children φ i and φ j is logically contradictory, that is φ i ^ φ j |ù K for i ‰ j . In other terms, only one child can be true at any time. The function
Eval specified in Algorithm 1 evaluates an NNF circuit for acommutative semiring p A , ‘ , b , e ‘ , e b q and labelling function ρ . Evaluating an NNF representation N T of a propositional theory T for a semiring p A , ‘ , b , e ‘ , e b q and labelling function ρ is a sound AMC computation iff Eval p N T , ‘ , b , e ‘ , e b , ρ q “ A p T q . In the case φ i and φ j are seen as events in a sample space, the determinism can beequivalently rewritten as φ i X φ j “ H and hence P p φ i X φ j q “ lgorithm 1 Evaluating an
NNF circuit N for a commutative semiring p A , ‘ , b , e ‘ , e b q and labelling function ρ . procedure Eval ( N, ‘ , b , e ‘ , e b , ρ ) if N is a true node J then return e b if N is a false node K then return e ‘ if N is a literal node l then return ρ p l q if N is a disjunction Ž mi “ N i then return À mi “ Eval ( N i , ‘ , b , e ‘ , e b , ρ ) end if if N is a conjunction Ź mi “ N i then return  mi “ Eval ( N i , ‘ , b , e ‘ , e b , ρ ) end if end procedure In particular, [37, Theorem 4] shows that evaluating a d-DNNF representa-tion of the propositional theory T for a semiring and labelling function withneutral p‘ , ρ q is a sound AMC computation. A semiring addition and labellingfunction pair p‘ , ρ q is neutral iff @ v P V : ρ p v q ‘ ρ p(cid:32) v q “ e b .Unless specified otherwise, in the following we will refer to d-DNNF circuitslabelled with probabilities or distributions of probability simply as circuits, andany addition and labelling function pair p‘ , ρ q are neutral. Also, we extendthe definition of the labelling function such that it also operates on tK , Ju , i.e. ρ pKq “ e ‘ and ρ pJq “ e b .Let us now introduce a graphical notation for circuits in this paper: Figure1 illustrates a d-DNNF circuit where each node has a unique integer (positiveor negative) identifier. Moreover, circled nodes are labelled either with ‘ fordisjunction (a.k.a. ‘ -gates) or with b for conjunction (a.k.a. b -gates). Leavesnodes are marked with a squared box and they are labelled with the literal, J ,or K , as well as its label via the labelling function ρ .Unless specified otherwise, in the following we will slightly abuse the notationby defining an ¨ operator both for variables and J , K , i.e. for x P V Y tK , Ju , x “ $&% (cid:32) x if x P V K if x “ JJ if x “ K (6)and for elements of the set A of labels, s.t. ρ p x q “ ρ p x q .Finally, each leaf node presents an additional parameter λ —i.e. the indicatorvariable cf. [21]—that assumes values 0 or 1, and we will be using it for reusingthe same circuit for different purposes.In the following, we will make use of a running example based upon theburglary example as presented in [21, Example 6]. In this way, we hope toconvey better to the reader the value of our approach as the circuit derivedfrom it using [14] will have a clear, intuitive meaning behind. However, ourapproach is independent from the system that employs circuit compilation for6 alarm :- burglary. alarm :- earthquake. calls(john) :- alarm, hears_alarm(john). evidence(calls(john)). query(burglary). Listing 1: Problog code for the Burglary example, originally Example 6 in [21].its reasoning process, as long as it can make use of d-DNNF s circuits. The d-DNNF circuit for our running example is depicted in Figure 1 and has beenderived by compiling [14] in a d-DNNF the ProbLog [21] code listed in Listing 1[21, Example 6]. For compactness, in the graph each literal of the program isrepresented only by the initials, i.e. burglary becomes b , hears alarm(john) becomes h(j) . ProbLog is an approach to augment prolog programs [40, 9]annotating facts with probabilities: see Appendix A for an introduction. Asdiscussed in [21], the prolog language admits a propositional representation ofits semantics. For the example the propositional representation of Listing 1 is: alarm Ø burglary _ earthquakecalls(john) Ø alarm ^ hears alarm(john)calls(john) (7)Figure 1 thus shows the result of the compilation of (7) in a circuit, annotatedwith a unique id that is either a number x or x to indicate the node thatrepresent the negation of the variable represented by node x ; and with weights(probabilities) as per Listing 1.To enforce that we know calls(john) is true (see line 7 of Listing 1). Thistranslates in having λ “ calls(john) , i.e. c(j) —while λ “ calls(john) , i.e. c(j) .The λ indicators modify the execution of the function Eval (Alg. 1) in theway illustrated by Algorithm 2: note that Algorithm 2 is analogous to Algorithm1 when all λ “
1. Hence, in the following, when considering the function
Eval ,we will be referring to the one defined in Algorithm 2.Finally, the ProbLog program in Listing 1 queries the value of burglary ,hence we need to compute the probability of burglary given calls(john) , We refer readers interested in probabilistic augmentation of logical theories in general to[11]. Albeit ProbLog allows for rules to be annotated with probabilities: rules of the form p::h:- b are translated into h :- b,t with t a new fact of the form p::t . lgorithm 2 Evaluating an
NNF circuit N for a commutative semiring p A , ‘ , b , e ‘ , e b q and labelling function ρ , considering indicators λ . procedure Eval ( N, ‘ , b , e ‘ , e b , ρ ) if N is a true node J then if λ “ then return e b else return e ‘ end if if N is a false node K then return e ‘ if N is a literal node l then if λ “ then return ρ p l q else return e ‘ end if if N is a disjunction Ž mi “ N i then return À mi “ Eval ( N i , ‘ , b , e ‘ , e b , ρ ) end if if N is a conjunction Ź mi “ N i then return  mi “ Eval ( N i , ‘ , b , e ‘ , e b , ρ ) end if end procedure p p burglary | calls(john) q “ p p burglary ^ calls(john) q p p calls(john) q (8)While the denominator of (8) is given by Eval of the circuit in Figure 1, weneed to modify it in order to obtain the numerator p p burglary ^ calls(john) q as depicted in Figure 2, where λ “ burglary . Eval on the circuit in Figure 2 will thus return the value of the denominator in (8).It is worth highlighting that computing p p query | evidences q for an arbir-trary query and arbirtrary set of evidences requires Eval to be executed atleast twice on slightly modified circuits.In this paper, similarly to [38], we are interested in learning the parameters ofour circuit, i.e. the ρ function for each of the leaves nodes, or ρ in the following,thus representing it as a vector. We will learn ρ from a set of examples , whereeach example is an instantiation of all propositional variables: for n propositionalvariables, there are 2 n of such instantiations. In the case the circuit is derivedfrom a logic program, an example is a complete interpretation of all the groundatoms. A complete dataset D is then a sequence (allowing for repetitions) ofexamples, each of those is a vector of instantiations of independent Bernoullidistributions with true but unknown parameter p x . From this, the likelihood isthus: p p D | p q “ | D | ź i “ p p x i | p x i q (9)8 ρ ( a ) = 1 λ = 1 ρ ( a ) = 1 λ = 1 ρ ( c(j) ) = 1 λ = 1 ρ ( c(j) ) = 1 λ = 0 ρ ( b ) = 0 . λ = 1 ρ ( b ) = 0 . λ = 1 ρ ( h(j) ) = 0 . λ = 1 ρ ( e ) = 0 . λ = 1 ρ ( h(j) ) = 0 . λ = 1 ⊗ ⊗ ⊕ ρ ( e ) = 0 . λ = 1 ⊗ ⊕ ⊗ ⊕ ⊗ ⊗ ⊕ ⊗ ⊕ Figure 1: Circuit computing p p calls(john) q for the Burglary example (Listing1). Solid box for query, double box for evidence.9 ρ ( a ) = 1 λ = 1 ρ ( a ) = 1 λ = 1 ρ ( c(j) ) = 1 λ = 1 ρ ( c(j) ) = 1 λ = 0 ρ ( b ) = 0 . λ = 1 ρ ( b ) = 0 . λ = 0 ρ ( h(j) ) = 0 . λ = 1 ρ ( e ) = 0 . λ = 1 ρ ( h(j) ) = 0 . λ = 1 ⊗ ⊗ ⊕ ρ ( e ) = 0 . λ = 1 ⊗ ⊕ ⊗ ⊕ ⊗ ⊗ ⊕ ⊗ ⊕ Figure 2: Circuit computing p p burglary ^ calls(john) q for the Burglary ex-ample (Listing 1). Solid box for query, double box for evidence. White overblack for the numeric value that has changed from Figure 1. In particular, inthis case, λ for the node labelled with burglary is set to 0.10here x i represents the i -th example in the dataset D . Differently, however,from [38], we do not search for a maximum likelihood solution of this problem,rather we provide a Bayesian analysis of it in Section 3.The following analysis provides the distribution of the probabilities (second-order probabilities) for each propositional variables. For complete datasets,these distributions factor meaning that the second-order probabilities for thepropositional variables are statistically independent (see Appendix C). Never-theless, it is shown that second-order probabilities of a variable and its negationare correlated because the first-order probabilities (i.e. the expected values ofthe distributions) sum up to one.The inference process proposed in this paper does not assume independentsecond-order probabilities as it encompasses dependencies between the randomvariables associated to the proposition in the form of a covariance matrix. Forcomplete datasets, the covariances at the leaves are only non-zero between avariable and its negation. More generally, when the training dataset D is notcomplete (i.e., variable values cannot always be observed for various instanti-ations), the second-order probabilities become correlated. The derivations ofthese correlations during the learning process with partial observations is leftfor future work. Nevertheless, the proposed inference method can accommodatesuch correlations without any modifications.This is one of our main contributions, that separates our approach from theliterature. Indeed, ours is clearly not the only Bayesian approach to learningparameters in circuits, see for instance [29, 62, 54, 57, 47, 63]. In addition,similarly to [47, 29] we also apply the idea of moment matching instead of usingsampling. Let us now expand further (9): for simplicity, let us consider here only the caseof a single propositional variable, i.e. a single binary random variable x P t , u ,e.g. flipping coin, not necessary fair, whose probability is thus conditioned by aparameter 0 ď p x ď p p x “ | p x q “ p x (10)The probability distribution over x is known as the Bernoulli distribution:Bern p x | p x q “ p xx p ´ p x q ´ x (11)Given a data set D of i.i.d. observations p x , . . . , x N q T drawn from theBernoulli with parameter p x , which is assumed unknown, the likelihood of datagiven p x is: p p D | p x q “ N ź n “ p p x n | p x q “ N ź n “ p x n x p ´ p x q ´ x n (12)11o develop a Bayesian analysis of the phenomenon, we can choose as priorthe beta distribution, with parameters α “ x α x , α x y , α x ě α x ě
1, thatis conjugate to the Bernoulli:Beta p p x | α q “ Γ p α x ` α x q Γ p α x q Γ p α x q p α x ´ x p ´ p x q α x ´ (13)where Γ p t q ” ż u t ´ e ´ u d u (14)is the gamma function.Given a beta-distributed random variable X , s X “ α x ` α x (15)is its Dirichlet strength and E r X s “ α x s X (16)is its expected value. From (15) and (16) the beta parameters can equivalentlybe written as: α X “ x E r X s s X , p ´ E r X sq s X y . (17)The variance of a beta-distributed random variable X isvar r X s “ var r ´ X s “ E r X sp ´ E r X sq s X ` X ` p ´ X q “
1, it is easy to see thatcov r X, ´ X s “ ´ var r X s . (19)From (18) we can rewrite s X (15) as s X “ E r X sp ´ E r X sq var r X s ´ . (20)Considering a beta distribution prior and the binomial likelihood function,and given N observations of x such that for r observations x “ s “ N ´ r observations x “ p p p x | D , α q “ p p D | p x q p p p x | α q p p D q 9 p r ` α x ´ x p ´ p x q s ` α x ´ (21)Hence p p p x | r, s, α q is another beta distribution such that after normaliza-tion via p(D), p p p x | r, s, α q “ Γ p r ` α x ` s ` α x q Γ p r ` α x q Γ p s ` α x q p r ` α x ´ x p ´ p x q s ` α x ´ (22)We can specify the parameters for the prior we are using for deriving ourbeta distributed random variable X as α “ x a X W, p ´ a X q W y where a X is12he prior assumption, i.e. p p x “ q in the absence of observations; and W ą @ X, a X “ . W “
2, so to havean uninformative, uniformly distributed, prior.The complete dataset D is modelled as samples from independent binomialsdistributions for facts and rules. As such, the posterior factors as a product ofbeta distributions representing the posterior distribution for each fact or ruleas in (22) for a single fact (see Appendix C for further details). This posteriordistribution enable the computation of the means and covariances for the leavesof the circuit, and because it factors, the different variables are statisticallyindependent leading to zero covariances. Only the leaves associated to a variableand its complement exhibit nonzero covariance via (19). Now, the means andcovarainces of the leaves can be propagated through the circuit to determinethe distribution of the queried conditional probability as described in Section 5.Given an inference, like the conditioned query of our running example (8), weapproximate its distribution by a beta distribution by finding the correspondingDirichlet strength to match the compute variance. Given a random variable Z with known mean E r Z s and variance var r Z s , we can use the method of momentsand (20) to estimate the α parameters of a beta-distributed variable Z of mean E r Z s “ E r Z s and s Z “ max " E r Z sp ´ E r Z sq var r Z s ´ , W a Z E r Z s , W p ´ a Z qp ´ E r Z sq * . (23)(23) is needed to ensure that the resulting beta-distributed random variable Z does not lead to a α Z ă x , y . Subjective logic [30] provides (1) an alternative, more intuitive, way of repre-senting the parameters of beta-distributed random variables, and (2) a set ofoperators for manipulating them. A subjective opinion about a proposition X isa tuple ω X “ x b X , d X , u X , a X y , representing the belief, disbelief and uncertaintythat X is true at a given instance, and, as above, a X is the prior probabilitythat X is true in the absence of observations. These values are non-negativeand b X ` d X ` u X “
1. The projected probability p p x q “ b X ` u X ¨ a X , providesan estimate of the ground truth probability p x .The mapping from a beta-distributed random variable X with parameters α X “ x α x , α x y to a subjective opinion is: ω X “ B α x ´ W a X s X , α x ´ W p ´ a X q s X , Ws X , a X F (24)With this transformation, the mean of X is equivalent to the projected probabil-ity p p x q , and the Dirichlet strength is inversely proportional to the uncertaintyof the opinion: E r X s “ p p x q “ b X ` u X a X , s X “ Wu X (25)13onversely, a subjective opinion ω X translates directly into a beta-distributedrandom variable with: α X “ B Wu X b X ` W a X , Wu X d X ` W p ´ a X q F (26)Subjective logic is a framework that includes various operators to indirectlydetermine opinions from various logical operations. In particular, we will makeuse of ‘ SL , b SL , and m SL , resp. summing, multiplying, and dividing two sub-jective opinions as they are defined in [30] (Appendix B). Those operators aimat faithfully matching the projected probabilities: for instance the multiplica-tion of two subjective opinions ω X b SL ω Y results in an opinion ω Z such that p p z q “ p p x q ¨ p p z q . Building upon our previous work [10], we allow manipulation of imprecise prob-abilities as labels in our circuits. Figure 3 shows an example of the circuitswe will be manipulating, where probabilities from the circuit depicted in Fig.1 has been replaced by uncertain probabilities represented as beta-distributedrandom variables and formalised as SL opinion, in a shorthand format listingonly belief and uncertainty values.
The straightforward approach to derive an AMC-conditioning parametrisationunder complete independence assumptions at each step of the evaluation of theprobabilistic circuit using subjective logic, is to use the operators ‘ , b , and m . This gives rise to the SL AMC-conditioning parametrisation S SL , defined as14 ρ ( a ) = h . , . i λ = 1 ρ ( a ) = h . , . i λ = 1 ρ ( c(j) ) = h . , . i λ = 1 ρ ( c(j) ) = h . , . i λ = 0 ρ ( b ) = h . , . i λ = 1 ρ ( b ) = h . , . i λ = 1 ρ ( h(j) ) = h . , . i λ = 1 ρ ( e ) = h . , . i λ = 1 ρ ( h(j) ) = h . , . i λ = 1 ⊗ ⊗ ⊕ ρ ( e ) = h . , . i λ = 1 ⊗ ⊕ ⊗ ⊕ ⊗ ⊗ ⊕ ⊗ ⊕ Figure 3: Variation on the circuit represented in Figure 1 with leaves labelledwith imprecise probabilities represented as Subjective Logic opinions, listingonly b X and u X : d X “ ´ b X ´ u X , and a X “ .
5. Solid box for query, doublebox for evidence. 15ollows: A SL “ R ě a ‘ SL b “ $&% a if b “ e ‘ SL b if a “ e ‘ SL a ‘ SL b otherwise a b SL b “ $&% a if b “ e b SL b if a “ e b SL a b SL b otherwise e ‘ SL “ x , , , y e b SL “ x , , , y ρ SL p f i q “ x b f i , d f i , u f i , a f i y ρ SL p(cid:32) f i q “ x d f i , b f i , u f i , ´ a f i y a m SL b “ $&% a if b “ e b SL a m SL b if defined x , , , . y otherwise (27)Note that x A SL , ‘ SL , b SL , e ‘ SL , e b SL y does not form a commutative semiringin general. If we consider only the projected probabilities—i.e. the meansof the associated beta distributions—then ‘ and b are indeed commutative,associative, and b distributes over ‘ . However, the uncertainty of the resultingopinion depends on the order of operands. In [10] we derived another set of operators operating with moment matching:they aim at maintaining a stronger connection to beta distribution as the resultof the manipulation. Indeed, while SL operators try to faithfully characterisethe projected probabilities, they employ an uncertainty maximisation principleto limit the belief commitments, hence they have a looser connection to thebeta distribution. Instead, in [10] we first represented beta distributions (andthus also SL opinions) not parametric in α , but rather parametric on mean andvariance. Hence we then propose operators that manipulate means and vari-ances, and then we transformed them back into beta distributions by momentmatching.In [10] we first defined a sum operator between two independent beta-distributed random variables X and Y as the beta-distributed random vari-able Z such that E r Z s “ E r X ` Y s and σ Z “ σ X ` Y . The sum (and in thefollowing the product as well) of two beta random variables is not necessarilya beta random variable. Consistently with [33], the resulting distribution is16hen approximated as a beta distribution via moment matching on mean andvariance.Given X and Y independent beta-distributed random variables representedby the subjective opinion ω X and ω Y , the sum of X and Y ( ω X ‘ β ω Y ) is definedas the beta-distributed random variable Z such that: E r Z s “ E r X ` Y s “ E r X s ` E r Y s (28)and σ Z “ σ X ` Y “ σ X ` σ Y . (29) ω Z “ ω X ‘ β ω Y can then be obtained as discussed in Section 3, taking (23) intoconsideration. The same applies for the following operators as well.The product operator between two independent beta-distributed randomvariables X and Y is then defined as the beta-distributed random variable Z such that E r Z s “ E r XY s and σ Z “ σ XY . Given X and Y independent beta-distributed random variables represented by the subjective opinion ω X and ω Y ,the product of X and Y ( ω X b β ω Y ) is defined as the beta-distributed randomvariable Z such that: E r Z s “ E r XY s “ E r X s E r Y s (30)and σ Z “ σ XY “ σ X p E r Y sq ` σ Y p E r X sq ` σ X σ Y . (31)Finally, the conditioning-division operator between two independent beta-distributed random variables X and Y , represented by subjective opinions ω X and ω Y , is the beta-distributed random variable Z such that E r Z s “ E r XY s and σ Z “ σ XY . Given ω X “ x b X , d X , u X , a X y and ω Y “ x b Y , d Y , u Y , a Y y subjective opinions such that X and Y are beta-distributed random variables, Y “ A p I p E “ e qq “ A p I p q ^ E “ e qq ‘ A p I p(cid:32) q ^ E “ e qq , with A p I p q ^ E “ e qq “ X . The conditioning-division of X by Y ( ω X m β ω Y ) isdefined as the beta-distributed random variable Z such that: E r Z s “ E „ XY “ E r X s E „ Y » E r X s E r Y s (32)and σ Z » p E r Z sq p ´ E r Z sq ˆ σ X p E r X sq ` σ Y ` σ X p E r Y s ´ E r X sq ` σ X E r X sp E r Y s ´ E r X sq ˙ (33) Please note that (33) corrects a typos that is present in its version in [10]. S β is defined as follows: A β “ R ě a ‘ β b “ a ‘ β ba b β b “ a b β be ‘ β “ x , , , . y e b β “ x , , , . y ρ β p f i q “ x b f i , d f i , u f i , a f i y P r , s ρ β p(cid:32) f i q “ x d f i , b f i , u f i , ´ a f i y a m β b “ a m β b (34)As per (27), also x A β , ‘ β , b β , e ‘ β , e b β y is not in general a commutativesemiring. Means are correctly matched to projected probabilities, therefore forthem S β actually operates as a semiring. However, for what concerns variance,by using (31) and (29)—thus under independence assumption—the productis not distributive over addition: var r X p Y ` Z qs “ var r X sp E r Y s ` E r Z sq `p var r Y s ` var r Z sq E r X s ` var r X sp var r Y s ` var r Z sq ‰ var r X sp E r Y s ` E r Z s q `p var r Y s ` var r Z sq E r X s ` var r X sp var r Y s ` var r Z sq “ var rp XY q ` p XZ qs .To illustrate the discrepancy, let’s consider node 6 in Figure 3: the dis-junction operator there is summing up probabilities that are not statisticallyindependent, despite the independence assumption used in developing the op-erator. Due to the dependencies between nodes in the circuit, the error growsduring propagation, and then the numerator and denominator in the condition-ing operator exhibit strong correlation due to redundant operators. Therefore,(33) introduces further error leading to an overall inadequate characterisationof variance. The next section reformulates the operations to account for theexisting correlations. We now propose an entirely novel approach to the AMC-conditioning problemthat considers the covariances between the various distributions we are manip-ulating. Indeed, our approach for computing Covariance-aware Probabilisticentailment with beta-distributed random variables CPB is designed to satisfythe total probability theorem, and in particular to enforce that for any X and Y beta-disributed random variables, 18 lgorithm 3 Solving the
PROB problem on a circuit N A labelled with iden-tifier of beta-distributed random variables and the associative table A , andcovariance matrix C A . procedure CovProbBeta ( N A , C A ) y N A := ShadowCircuit ( N A ) return EvalCovProbBeta ( y N A , C A ) end procedure var r Y b X s ‘ var r Y b X s “ var r Y s (35)Algorithm 3 provides an overview of CPB , that comprises three stages:(1) pre-processing; (2) circuit shadowing; and (3) evaluation. The overall ap-proach is to view the second-order probability of each node in the circuit as abeta distribution. The determination of the distributions is through momentmatching via the first and second moments through (18) and (20). Effectively,the collection of nodes are treated as multivariate Gaussian characterised by amean vector and covariance matrix that it computed via the propagation pro-cess described below. When analysing the distribution for particular node (viamarginalisation of the Gaussian), it is approximated via the best-fitting betadistribution through moment-matching. We assume that the circuit we are receiving has the leaves labelled with uniqueidentifiers of beta-distributed random variables. We also allow for the specifi-cation of the covariance matrix between the beta-distributed random variables,bearing in mind that cov r X, ´ X s “ ´ var r X s , cf. (18) and (19). We do notprovide a specific algorithm for this, as it would depend on the way the circuit iscomputed. In our running example, we assume the ProbLog code from Listing1 has been transformed into the aProbLog code in Listing 2.We also expect there is a table associating the identifier with the actualvalue of the beta-distributed random variable. In the following, we assumethat ω is a reserved indicator for the Beta p8 , . q (in Subjective Logic term x . , . , . , . y ). For instance, Table 1 provides the associations for code inListing 2, and Table 2 the covariance matrix for those beta-distributed randomvariables that we assume being learnt from complete observations of independentrandom variables, and hence the posterior beta-distributed random variables arealso independent (cf. Appendix C). aProbLog [36] is the algebraic version of ProbLog that allows for arbitrary labels to beused. ω ::burglary. ω ::earthquake. ω ::hears_alarm(john). alarm :- burglary. alarm :- earthquake. calls(john) :- alarm, hears_alarm(john). evidence(calls(john)). query(burglary). Listing 2: Problog code for the Burglary example with unique identifier for therandom variables associated to the database, originally Example 6 in [21]
Identifier Beta parameters Subjective Logic opinion ω Beta p8 , q x . , . , . , . y ω Beta p ,
8q x . , . , . , . y ω Beta p , q x . , . , . , . y ω Beta p , q x . , . , . , . y ω Beta p , q x . , . , . , . y ω Beta p , q x . , . , . , . y ω Beta p . , . q x . , . , . , . y ω Beta p . , . q x . , . , . , . y Table 1: Associative table for the aProbLog code in Listing 220 ω ω ω ω ω ω ω ω σ ´ σ ω ´ σ σ ω σ ´ σ ω ´ σ σ ω σ ´ σ ω ´ σ σ ω σ ´ σ ω ´ σ σ Table 2: Covariance matrix for the associative table (Tab. 1) under the assump-tion that all the beta-distributed random variables are independent each other.We use a short-hand notation for clarity: σ i “ cov r ω i s . Zeros are omitted. We then augment the circuit adding shadow nodes to superinpose a secondcircuit to enable the possibility to assess, in a single forward pass, both p p query ^ evidence q and p p evidence q . This can provide a benefit time-wise at the expenseof memory, but more importantly it simplifies the bookkeeping of indexes in thecovariance matrix as we will see below.Algorithm 4 focuses on the node that identifies the negation of the querywe want to evaluate with this circuit, identified as qnode p N A q ): indeed, toevaluate p p query ^ evidence q , the λ parameter for such a node must be set to 0.In lines 4–18 Algorithm 4 superimpose a new circuit by creating shadow nodes ,e.g. p c at line 9, that will represent random variables affected by the change inthe λ parameter for the qnode p N A q ). The result of Algorithm 4 on the circuitfor our running example is depicted in Figure 4.In Algorithm 4 we make use of a stack data structure with associated popand push functions (cf. lines 3, 5, 8, 16): that is for ease of presentation as thealgorithm does not require an ordered list. Each of the nodes in the shadowed circuit (e.g. Figure 4) has associated a (beta-distributed) random variable. In the following, and in Algorithm 5, given a node n , its associated random variable is identified as X n . For the nodes for whichexists a ρ label, its associated random variable is the beta-distributed randomvariable labelled via the ρ function, cf. Figure 4. In this paper we focus on a query composed by a single literal. ρ ( a ) = ω X = ω λ = 1 ρ ( a ) = ω X = ω λ = 1 ρ ( c(j) ) = ω X = ω λ = 1 ρ ( c(j) ) = ω X = ω λ = 0 ⊕ c ⊕ ⊗ c ⊗ ⊗ c ⊗ ⊗ c ⊗ ⊕ c ⊕ ρ ( b ) = ω X = ω λ = 1 b ρ ( b ) = ω X b = ω λ = 0 ρ ( b ) = ω X = ω λ = 1 ⊗ b ⊗ ρ ( e ) = ω X = ω λ = 1 ρ ( e ) = ω X = ω λ = 1 ⊗ ⊕ ⊕ ρ ( h(j) ) = ω X = ω λ = 1 ρ ( h(j) ) = ω X = ω λ = 1 ⊗ ⊗ ⊕ Figure 4: Shadowing of the circuit represented in Figure 1 according to Algo-rithm 4. Solid box for query, double box for evidence, in grey the shadow nodesadded to the circuit. If a node has a shadow, they are grouped together with adashed box. Dashed arrows connect shadow nodes to their children.22 lgorithm 4
Shadowing the circuit N A . procedure ShadowCircuit ( N A ) y N A := N A links := stack() for p P parents ( N A , qnode p N A q ) do push ( links , x qnode p N A q , p y ) end for while (cid:32) empty ( links) do x c, p y := pop ( links ) y N A := y N A Y t p c u if p p R y N A then y N A := y N A Y t p p u children ( y N A , p p ) := children ( y N A , p ) end if children ( y N A , p p ) := ( children ( y N A , p p ) zt c uq Y t p c u for p P parents ( N A , p ) do push ( links , x p, p y ) end for end while return y N A end procedure Algorithm 5 begins with building a vector of means ( means ), and a matrix ofcovariances ( cov ) of the random variables associated to the leaves of the circuit(lines 2–16) derived from the C A covariance matrix provided as input. Thealgorithm can be made more robust by handling the case where C A is empty ora matrix of zeroes: in this case, assuming independence among the variables, itis straightforward to obtain a matrix such as Table 2.Then, Algorithm 5 proceeds to compute the means and covariances for allthe remaining nodes in the circuit (lines 17–31). Here two cases arises.Let n be a ‘ -gate over C nodes, its children: hence (lines 22–35) E r X n s “ ÿ c P C E r X c s , (36)cov r X n s “ ÿ c P C ÿ c P C cov r X c , X c s , (37)cov r X n , X z s “ ÿ c P C cov r X c , X z s for z P y N A zt n u (38)with cov r X, Y s “ E r XY s ´ E r X s E r Y s (39)and cov r X s ” cov r X, X s “ var r X s .Let n be a b -gate over C nodes, its children (lines 26–30). Due to the natureof the variable X n , following [25, § lgorithm 5 Evaluating the shadowed circuit y N A taking into considerationthe given C A covariance matrix. procedure EvalCovProbBeta ( y N A , C A ) means := zeros p| y N A | , q cov := zeros p| y N A | , | y N A |q nvisited := t { qnode p N A qu for n P leaves p y N A qz { qnode p N A q do nvisited := nvisited Y t n u tvar := 0 if λ “ then means r n s : “ E r X n s tvar := var[ X n ] else means r n s : “ end if for n P leaves p y N A qz { qnode p N A q do cov r n, n s : “ C A r X n , X n s end for end for nqueue := y N A z nvisited while nqueue ‰ H do n := n P nqueue s.t. children p y N A , n q Ď nvisited nqueue : “ nqueue zt n u nvisited : “ nvisited Y t n u if n is a (shadowed) disjunction over C : “ children p y N A , n q then means r n s : “ ř c P C means r X c s cov r n, n s : “ ř c P C ř c P C cov r c, c s cov r z, n s : “ cov r n, z s : “ ř c P C cov r c, z s @ z P y N A zt n u else if n is a (shadowed) conjunction over C : “ children p y N A , n q then means r n s : “ ś c P C means r X c s cov r n, n s : “ ř c P C ř c P C means r X n s means r X c s means r X c s cov r c, c s cov r z, n s : “ cov r n, z s : “ ř c P C means r X n s means r X c s cov r c, z s @ z P y N A zt n u end if end while r : “ root p y N A q return A means r r s means r p r s , means r p r s cov r r, r s ` means r r s means r p r s cov r p r, p r s ´ means r r s means r r s cov r r, p r s E end procedure Let’s assume X n “ Π p X C q “ ź c P C X c , with X C “ p X c , . . . , X c k q T and k “ | C | .24xpanding the first two terms yields: X n » Π p E r X C sq ` p X C ´ E r X C sq T ∇ Π p X C q ˇˇˇ X C “ E r X C s “ E r X n s ` p X c ´ E r X c sq ź c P C zt c u E r X c s ` ... ` p X c k ´ E r X c k sq ź c P C zt c k u E r X c s“ E r X n s ` ÿ c P C ś c P C E r X c s E r X c s p X c ´ E r X c sq“ E r X n s ` ÿ c P C E r X n s E r X c s p X c ´ E r X c sq (40)where the first term can be seen as an approximation for E r X n s .Using this approximation, then (lines 34–42 of Algorithm 5)cov r X n s » ÿ c P C ÿ c P C E r X n s E r X c s E r X c s cov r X c , X c s , (41)cov r X n , X z s » ÿ c P C E r X n s E r X c s cov r X c , X z s for z P y N A zt n u . (42)Finally, Algorithm 5 computes a conditioning between X r and X p r , with r being the root of the circuit ( r : “ root p y N A q at line 46). This shows how criticalis to keep track of the non-zero covariances where they exist. The Taylor seriesapproximation of X r and 1 X p r about E r X r s and 1 E r X p r s leads to X r X p r » E r X r s E r X p r s ` E r X p r s p X r ´ E r X r sq ´ E r X r s E r X p r s p X p r ´ E r X p r sq , (43)which implies E „ X r X p r » E r X r s E r X p r s , (44)cov „ X r X p r » E r X p r s cov r X r s ` E r X r s E r X p r s cov r X p r s ´ E r X r s E r X p r s cov r X r , X p r s . (45)Tables 3 and 4 depicts respectively the non-zero values of the means vectorand cov matrix for our running example. Overall, the mean and variance for p p burglary | calls(john) q are 0 . . X X p X X X X X x X X x X X X x µ . . .
18 1 0 . .
28 0 . .
196 0 .
07 1 0 .
196 0 . p
7, i.e. the shadow of qnode ( N A ), is included for illustration purpose. . . . . . . . . . . . . . .
75 MCCPB
Figure 5: Resulting distribution of probabilities for our running example usingAlgorithm 3 (solid line), and a Monte Carlo simulation with 100,000 samplesgrouped in 25 bins and then interpolated with a cubic polynomial (dashed line).26 X X X X p X X X X X x X X x X X x X X X X X p X X -0.1 1.3 1.2 -0.1 1.1 -0.1 0.8 -0.1 0.8 -0.1 X X X x X X x X X x ˆ ´ ) as computed by Algorithm 5 on our running ex-ample. In grey the shadow nodes. In grey the shadow nodes. Values very closeor equal to zero are omitted. Also, values for nodes labelled with negated vari-ables are omitted. p
7, i.e. the shadow of qnode ( N A ), is included for illustrationpurpose. 27 .4 Memory performance Algorithms 3 returns the mean and variance of the probability for the query con-ditioned on the evidence. Algorithm 4 adds shadow nodes to the initial circuitformed by the evidence to avoid redundant computations in the second pass. Forthe sake of clarity, Algorithm 5 is presented in its most simple form. As formu-lated, it requires a | y N A | ˆ | y N A | array to store the covariance values between thenodes. For large circuits, this memory requirement can significantly slow downthe processing (e.g., disk swaps) or simply become prohibitive. The covariancesof a particular node are only required after it is computed via lines 24-25 or 34-35 in Algorithm 5. Furthermore, these covariances are no longer needed onceall the parent node values have been computed. Thus, it is straightforward todynamically allocate/de-allocate portions of the covariance array as needed. Infact, the selection of node n to compute in line 19, which is currently arbitrary,can be designed to minimise processing time in light of the resident memoryrequirements for the covariance array. Such an optimisation depends on thecomputing architecture and complicates the presentation. Thus, further detailsare beyond the scope of this paper. To illustrate the benefits of Algorithm 3 (Section 5), we run an experimentalanalysis involving several circuits with unspecified labelling function. For eachcircuit, first labels are derived for the case of parametrisation S p (5) by selectingthe ground truth probabilities from a uniform random distribution. Then, foreach label, we derive a subjective opinion by observing N ins instantiations ofa random variables derived from the chosen probability, so to simulate datasparsity [33].We then proceed analysing the inference on specific query nodes q in thepresence of a set of evidence E “ e using: • CPB as articulated in Section 5; • S β , cf. (34); • S SL , cf. (27); • MC , a Monte Carlo analysis with 100 samples from the derived randomvariables to obtain probabilities, and then computing the probability ofqueries in presence of evidence using the parametrisation S p .We then compare the RMSE to the actual ground truth. This process ofinference to determine the marginal beta distributions is repeated 1000 timesby considering 100 random choices for each label of the circuit, i.e. the groundtruth, and for each ground truth 10 repetitions of sampling the interpretations28 ω ::stress(X) :- person(X). ω ::influences(X,Y) :- person(X), person(Y). smokes(X) :- stress(X). smokes(X) :- friend(X,Y), influences(Y,X), smokes(Y). ω ::asthma(X) :- smokes(X). person(1). person(2). person(3). person(4). friend(1,2). friend(2,1). friend(2,4). friend(3,2). friend(4,2). evidence(smokes(2),true). evidence(influences(4,2),false). query(smokes(1)). query(smokes(3)). query(smokes(4)). query(asthma(1)). query(asthma(2)). query(asthma(3)). query(asthma(4)). Listing 3: Smoker and Friends aProbLog codeused to derive the subjective opinion labels observing N ins instantiations of allthe variables.We judge the quality of the beta distributions of the queries on how well itsexpression of uncertainty captures the spread between its projected probabilityand the actual ground truth probability, as also [33] did. In simulations wherethe ground truths are known, such as ours, confidence bounds can be formedaround the projected probabilities at a significance level of γ and determine thefraction of cases when the ground truth falls within the bounds. If the uncer-tainty is well determined by the beta distributions, then this fraction shouldcorrespond to the strength γ of the confidence interval [33, Appendix C].Following [10], we consider the famous Friends & Smokers problem, cf. List-ing 3, with fixed queries and set of evidence. Table 5 provides the root meansquare error (RMSE) between the projected probabilities and the ground truthprobabilities for all the inferred query variables for N ins = 10, 50, 100. The tablealso includes the predicted RMSE by taking the square root of the average—overthe number of runs—variances from the inferred marginal beta distributions, cf. https://dtai.cs.kuleuven.be/problog/tutorial/basic/05_smokers.html (on 29thApril 2020). X g q (resp. X q ) therandom variable associated to the queries q computed using the golden stan-dard (resp. computed using either MC or CPB ), the Pearson’s correlationcoefficient displayed in Figure 8 is given by: r “ cov r s X g q , s X q s cov r s X g q s cov r s X q s (46)This is a measure of the quality of the epistemic uncertainty associated withthe evaluation of the circuit using MC with varying number of samples, andCPB : the closer the Dirichlet strengths are to those of the golden standard, thebetter the computed epistemic uncertainty represents the actual uncertainty, hence the closer the correlations are to 1 in Figure 8 the better.From Table 5, CPB exhibits the lowest RMSE and the best prediction of itsown RMSE. As already noticed in [10], S β is a little conservative in estimatingits own RMSE, while S SL is overconfident. This is reflected in Figure 6, withthe results of S β being over the diagonal, and those of S SL being below it,while CPB sits exactly on the diagonal, like also MC . However, MC with100 samples does not exhibit the lowest RMSE according to Table 5, althoughthe difference with the best one is much lower compared with S SL .Considering the execution time, Figure 7, we can see that there is a substan-tial difference between CPB and MC with 100 samples.Finally, Figure 8 depicts the correlation of the Dirichlet strength betweenthe golden standard, i.e. a Monte Carlo simulation with 10,000 samples, andboth CPB and MC , this last one varying the number of samples used. Itis straightforward to see that MC improves the accuracy of the computedepistemic uncertainty when increasing the number of samples considered, ap-proaching the same level of CPB when considering more than 200 samples. To compare our approach against the state-of-the-art approaches for reason- The Dirichlet strengths are inversely proportional to the epistemic uncertainty. ω ::n1. ω ::n2 :- \+n1. ω ::n2 :- n1. ω ::n3 :- \+n2. ω ::n3 :- n2. ω ::n4 :- \+n2. ω ::n4 :- n2. ω ::n5 :- \+n3. ω ::n5 :- n3. ω ::n6 :- \+n3. ω ::n6 :- n3. ω ::n7 :- \+n6. ω ::n7 :- n6. ω ::n8 :- \+n5. ω ::n8 :- n5. ω ::n9 :- \+n5. ω ::n9 :- n5. evidence(n1, e ). evidence(n4, e ). evidence(n7, e ). evidence(n8, e ). evidence(n9, e ). query(n2). query(n3). query(n5). query(n6). Listing 4: An example of aProblog code that can be seen also as a Bayesiannetwork, cf. Fig. 12a in Appendix D. e i are randomly assigned as either True or False . 31 ins
CPB S β S SL MCFriends&Smokers 10 A 0.1065 0.1065 0.1198 0.1072P 0.1024 0.1412 0.1060 0.102750 A 0.0489 0.0489 0.0617 0.0490P 0.0491 0.0898 0.0587 0.0489100 A 0.0354 0.0354 0.0521 0.0355P 0.0357 0.0709 0.0487 0.0356
Table 5: RMSE for the queried variables in the Friends & Smokers program:A stands for Actual, P for Predicted. Best results—also considering hiddendecimals—for the actual RMSE boxed. Monte Carlo approach has been runover 100 samples. . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MC (a) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MC (b) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MC (c) Figure 6: Actual versus desired significance of bounds derived from the un-certainty for Smokers & Friends with: (a) N ins “
10; (b N ins “
50; and (c) N ins “ Subjective Bayesian Network
SBN [28, 32, 33], was first proposed in [28],and it is an uncertain Bayesian network where the conditionals are sub-jective opinions instead of dogmatic probabilities. In other words, the32
PB MC024681012 E x ec u t i o n T i m e s (a) CPB MC024681012 E x ec u t i o n T i m e s (b) CPB MC024681012 E x ec u t i o n T i m e s (c) Figure 7: Distribution of execution time for running the different algorithms forSmokers & Friends with: (a) N ins “
10; (b) N ins “
50; and (c) N ins “
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (a)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (b)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (c) Figure 8: Correlation of Dirichlet strengths between runs of Monte Carlo ap-proach varying the number of samples and golden standard (i.e. a Monte Carlorun with 10,000 samples) as well as between the proposed approach and goldenstandard with cubic interpolation—that is independent of the number of sam-ples used in Monte Carlo —for Smokers & Friends with: (a) N ins “
10; (b) N ins “
50; and (c) N ins “ π - and λ -messages are passed from parents and children, respectively, to a node,i.e., variable. The node uses these messages to formulate the inferredmarginal probability of the corresponding variable. The node also usesthese messages to determine the π - and λ -messages to send to its childrenand parents, respectively. In SBP, the π - and λ -messages are subjectiveopinions characterised by a projected probability and Dirichlet strength.The SBP formulation approximates output messages as beta-distributedrandom variables using the methods of moments and a first-order Taylorseries approximation to determine the mean and variance of the outputmessages in light of the beta-distributed input messages. The details ofthe derivations are provided in [32, 33].33 elief Networks GBT [53] introduced a computationally efficient methodto reason over networks via Dempster-Shafer theory [18]. It is an approx-imation of a valuation-based system. Namely, a (conditional) subjectiveopinion ω X “ r b x , b ¯ x , u X s from our circuit obtained from data is con-verted to the following belief mass assignment: m p x q “ b x , m p ¯ x q “ b ¯ x and m p x Y ¯ x q “ u X . Note that in the binary case, the belief function overlapswith the belief mass assignment. The method exploits the disjunctive ruleof combination to compose beliefs conditioned on the Cartesian productspace of the binary power sets. This enables both forward propagationand backward propagation after inverting the belief conditionals via thegeneralized Bayes’ theorem (GBT). By operating in the Cartesian prod-uct space of the binary power sets, the computational complexity growsexponentially with respect to the number of parents. Credal Networks
Credal [61]. A credal network over binary random vari-ables extends a Bayesian network by replacing single probability valueswith closed intervals representing the possible range of probability values.The extension of Pearl’s message-passing algorithm by the 2U algorithmfor credal networks is described in [61]. This algorithm works by determin-ing the maximum and minimum value (an interval) for each of the targetprobabilities based on the given input intervals. It turns out that theseextreme values lie at the vertices of the polytope dictated by the extremevalues of the input intervals. As a result, the computational complexitygrows exponentially with respect to the number of parents nodes. For thesake of comparison, we assume that the random variables we label ourcircuts with and elicited from the given data corresponds to a credal net-work in the following way: if ω x “ r b x , b ¯ x , u X s is a subjective opinion onthe probability p x , then we have r b x , b x ` u X s as an interval correspond-ing to this probability in the credal network. It should be noted thatthis mapping from the beta-distributed random variables to an interval isconsistent with past studies of credal networks [35].As before, Table 6 provides the root mean square error (RMSE) between theprojected probabilities and the ground truth probabilities for all the inferredquery variables for N ins = 10, 50, 100, together with the RMSE predictedby taking the square root of the average variances from the inferred marginalbeta distributions. Figure 9 plots the desired and actual significance levels forthe confidence intervals (best closest to the diagonal). Figure 10 depicts thedistribution of execution time for running the various algorithms, and Figure11 the correlation of the Dirichlet strength between the golden standard, i.e. aMonte Carlo simulation with 10,000 samples, and both CPB and MC varyingthe number of samples.Table 6 shows that CPB shares the best performance with the state-of-the-art SBN and S β almost constantly. This is clearly a significant achievementconsidering that SBN is the state-of-the-art approach when dealing only withsingle connected Bayesian Networks with uncertain probabilities, while we can34 ins CPB S β S SL MC SBN GBT CredalNet1 10 A 0.1511 0.1511 0.2078 0.1517 0.1511 0.1542 0.1633P 0.1473 0.1864 0.1559 0.1465 0.1472 0.0873 0.2009Net1 50 A 0.0816 0.0816 0.1237 0.0818 0.0816 0.0848 0.0827P 0.0802 0.1227 0.0825 0.0789 0.0794 0.0372 0.1069Net1 100 A 0.0544 0.0544 0.0837 0.0550 0.0544 0.0601 0.0557P 0.0572 0.0971 0.0592 0.0564 0.0566 0.0262 0.0766Net2 10 A 0.1389 0.1389 0.1916 0.1392 0.1389 0.1418 0.1473P 0.1391 0.1808 0.1457 0.1381 0.1399 0.1058 0.1856Net2 50 A 0.0701 0.0701 0.1092 0.0702 0.0701 0.0730 0.0702P 0.0722 0.1148 0.0755 0.0714 0.0720 0.0486 0.0952Net2 100 A 0.0534 0.0534 0.0901 0.0536 0.0534 0.0553 0.0537P 0.0533 0.0937 0.0601 0.0526 0.0531 0.0340 0.0696Net3 10 A 0.1481 0.1481 0.2160 0.1488 0.1481 0.1511 0.1634P 0.1453 0.1708 0.1578 0.1438 0.1454 0.0821 0.1947Net3 50 A 0.0737 0.0737 0.1167 0.0741 0.0737 0.0760 0.0756P 0.0777 0.1115 0.0780 0.0763 0.0772 0.0348 0.1003Net3 100 A 0.0574 0.0574 0.0909 0.0578 0.0574 0.0608 0.0582P 0.0564 0.0882 0.0584 0.0553 0.0560 0.0239 0.0728
Table 6: RMSE for the queried variables in the various networks: A stands forActual, P for Predicted. Best results—also considering hidden decimals—for theActual RMSE boxed. Monte Carlo approach has been run over 100 samples.35 . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (a) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (b) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (c) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (d) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (e) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (f) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (g) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (h) . . . . . . . . . . . . A c t u a l C o nfid e n ce CPB S β S SL MCSBNGBTCredal (i)
Figure 9: Actual versus desired significance of bounds derived from the uncer-tainty for: (a) Net1 with N ins “
10; (b) Net1 with N ins “
50; (c) Net1 with N ins “ N ins “
10; (e) Net2 with N ins “
50; (f) Net2 with N ins “ N ins “
10; (h) Net3 with N ins “
50; (i) Net3 with N ins “ S β has lower RMSE than S SL and it seems that S β overestimatesthe predicted RMSE and S SL underestimates it as S SL predicts smaller errorthan is realised and vice versa for S β .From visual inspection of Figure 9, it is evident that CPB , SBN , and MCall are very close to the diagonal, thus correctly assessing their own epistemicuncertainty. S β performance is heavily affected by the fact that it computesthe conditional distributions at the very end of the process and it relies, in (33),on the assumption of independence. CPB , keeping track of the covariancebetween the various nodes in the circuits, does not suffer from this problem.This positive result has been achieved without substantial deterioration of the36erformance in terms of execution time, as displayed in Figure 10, for which thesame commentary of Figure 7 applies.Finally, Figure 11 depicts the correlation of the Dirichlet strength betweenthe golden standard, i.e. a Monte Carlo simulation with 10,000 samples, andboth CPB and MC , this last one varying the number of samples used. Likefor Figure 8, it is straightforward to see that MC improves the accuracy of itscomputed epistemic uncertainty when increasing the number of samples con-sidered, approaching the same level of CPB when considering more than 200samples, while CPB performs very closely to the optimal value of 1. In this paper, we introduce (Section 5) an algorithm for reasoning over a proba-bilistic circuit whose leaves are labelled with beta-distributed random variables,with the additional piece of information describing which of those are actuallyindependent (Section 5.1). This provides the input to an algorithm that shad-ows the circuit derived for computing the probability of the pieces of evidence by superimposing a second circuit modified for computing the probability of agiven query and the pieces of evidence , thus having all the necessary componentsfor computing the probability of a query conditioned on the pieces of evidence(Section 5.2). This is essential when evaluating such a shadowed circuit (Section5.3), with the covariance matrix playing an essential role by keeping track ofthe dependencies between random variables while they are manipulated withinthe circuit. We also include discussions on memory management in Section 5.4.In our extensive experimental analysis (Section 6) we compare against lead-ing approaches to compute uncertain probabilities, notably: (1) Monte Carlosampling; (2) our previous proposal [10] as representative of the family of ap-proaches using a moment matching approach with strong independence as-sumptions; (3) Subjective Logic [30]; (4) Subjective Bayesian Network (SBN)[28, 32, 33]; (5) Dempster-Shafer Theory of Evidence [18, 53]; and (6) credalnetworks [61].We achieve the same or better results of state-of-the-art approaches for deal-ing with epistemic uncertainty, including highly engineered ones for a narrowdomain such as SBN, while being able to handle general probabilistic circuitsand with just a modest increase in the computational effort. In fact, this workhas inspired us to leverage probabilistic circuits to expand second-order inferencefor SBN for arbitrary directed acyclic graphs whose variables are multinomials.In work soon to be released [34], we prove the mathematical equivalence of theupdated SBN inference approach to that of [55], but with significantly lowercomputational burden.We focused our attention on probabilistic circuits derived from d-DNNF s:work by [15], and then also by [38] has introduced Sentential Decision Diagrams(SDDs) as a new canonical formalism respectively for propositional and forprobabilistic circuits. However, as we can read in [15, p. 819] SDDs is a strict37ubset of d-DNNF , which is thus the least constrained type of propositional circuitwe can safely rely on according to [37, Theorem 4]. However, in future work wewill enable our approach to efficiently make use of SDDs.In addition, we will also work in the direction of enabling learning withpartial observations—incomplete data where the instantiations of each of thepropositional variables are not always visible over all training instantiations—on top of its ability of tracking the covariance values between the various randomvariables for a better estimation of epistemic uncertainty.
Acknowledgement
This research was sponsored by the U.S. Army Research Laboratory and theU.K. Ministry of Defence under Agreement Number W911NF-16-3-0001. Theviews and conclusions contained in this document are those of the authors andshould not be interpreted as representing the official policies, either expressed orimplied, of the U.S. Army Research Laboratory, the U.S. Government, the U.K.Ministry of Defence or the U.K. Government. The U.S. and U.K. Governmentsare authorized to reproduce and distribute reprints for Government purposesnotwithstanding any copyright notation hereon.
References [1] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, BesmiraNushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, KoriInkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. Guidelines forhuman-AI interaction. In
Conference on Human Factors in ComputingSystems - Proceedings , pages 1–13, New York, New York, USA, may 2019.Association for Computing Machinery.[2] R. Anderson, N. Hare, and S. Maskell. Using a bayesian model for confi-dence to make decisions that consider epistemic regret. In , pages 264–269, 2016.[3] Alessandro Antonucci, Alexander Karlsson, and David Sundgren. Decisionmaking with hierarchical credal sets. In Anne Laurent, Oliver Strauss,Bernadette Bouchon-Meunier, and Ronald R. Yager, editors,
InformationProcessing and Management of Uncertainty in Knowledge-Based Systems ,pages 456–465, 2014.[4] Fahiem Bacchus, Shannon Dalmao, and Toniann Pitassi. Solving
Journal of Artificial Intelli-gence Research , 34:391–442, jan 2009.[5] Gagan Bansal, Besmira Nushi, Ece Kamar, Walter Lasecki, Dan Weld, andEric Horvitz. Beyond Accuracy: The Role of Mental Models in Human-AITeam Performance. In
HCOMP . AAAI, oct 2019.386] Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter SLasecki, and Eric Horvitz. Updates in Human-AI Teams: Understand-ing and Addressing the Performance/Compatibility Tradeoff. In
AAAI ,pages 2429–2437, 2019.[7] John S. Baras and George Theodorakopoulos. Path problems in networks.
Synthesis Lectures on Communication Networks , 3:1–77, jan 2010.[8] Elena Bellodi and Fabrizio Riguzzi. Expectation Maximization over BinaryDecision Diagrams for Probabilistic Logic Programs.
Intell. Data Anal. ,17(2):343–363, mar 2013.[9] Ivan Bratko.
Prolog programming for artificial intelligence . Addison Wesley,2001.[10] Federico Cerutti, Lance M. Kaplan, Angelika Kimmig, and Murat Sensoy.Probabilistic logic programming with beta-distributed random variables. In
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019,The Thirty-First Innovative Applications of Artificial Intelligence Confer-ence, IAAI 2019, The Ninth AAAI Symposium on Educational Advancesin Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27- February 1, 2019 , pages 7769–7776. AAAI Press, 2019.[11] Federico Cerutti and Matthias Thimm. A general approach to reasoningwith probabilities.
Int. J. Approx. Reason. , 111:35–50, 2019.[12] Mark Chavira and Adnan Darwiche. On probabilistic inference by weightedmodel counting.
Artificial Intelligence , 172(6):772–799, 2008.[13] Arthur Choi and Adnan Darwiche. Dynamic minimization of Sentential De-cision Diagrams. In
Proceedings of the 27th AAAI Conference on ArtificialIntelligence, AAAI 2013 , AAAI’13, pages 187–194. AAAI Press, 2013.[14] Adnan Darwiche. New Advances in Compiling CNF to DecomposableNegation Normal Form. In
Proceedings of the 16th European Conferenceon Artificial Intelligence , ECAI’04, pages 318–322, NLD, 2004. IOS Press.[15] Adnan Darwiche. SDD: A New Canonical Representation of PropositionalKnowledge Bases. In
Proceedings of the Twenty-Second International JointConference on Artificial Intelligence - Volume Volume Two , IJCAI’11,pages 819–826. AAAI Press, 2011.[16] Adnan Darwiche and Pierre Marquis. A Knowledge Compilation Map.
J.Artif. Int. Res. , 17(1):229–264, sep 2002.[17] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. ProbLog: A prob-abilistic Prolog and its application in link discovery. In
Proceedings of the20th International Joint Conference on Artificial Intelligence , pages 2462–2467, 2007. 3918] A. P. Dempster. A generalization of bayesian inference.
Journal of theRoyal Statistical Society. Series B (Methodological) , 30(2):205–247, 1968.[19] Jason Eisner. Parameter Estimation for Probabilistic Finite-State Trans-ducers. In
Proceedings of the 40th Annual Meeting of the Association forComputational Linguistics , pages 1–8, Philadelphia, Pennsylvania, USA,jul 2002. Association for Computational Linguistics.[20] Jason Eisner, Eric Goldlust, and Noah A. Smith. Compiling comp ling:Practical weighted dynamic programming and the dyna language. In
Pro-ceedings of the Conference on Human Language Technology and EmpiricalMethods in Natural Language Processing , HLT ’05, pages 281–290, 2005.[21] Daan Fierens, Guy den Broeck, Joris Renkens, Dimitar Shterionov, BerndGutmann, Ingo Thon, Gerda Janssens, and Luc De Raedt. Inference andlearning in probabilistic logic programs using weighted { B } oolean formulas. Theory and Practice of Logic Programming , 15(03):358–401, may 2015.[22] Tal Friedman and Guy den Broeck. Approximate Knowledge Compila-tion by Online Collapsed Importance Sampling. In S Bengio, H Wallach,H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors,
Ad-vances in Neural Information Processing Systems 31 , pages 8024–8034.Curran Associates, Inc., 2018.[23] Robert Gens and Pedro Domingos. Learning the structure of sum-productnetworks. In Sanjoy Dasgupta and David McAllester, editors, , volume 28 of
Pro-ceedings of Machine Learning Research , pages 1910–1917, Atlanta, Georgia,USA, 2013. PMLR.[24] Joshua Goodman. Semiring Parsing.
Computational Linguistics , 25(4):573–606, 1999.[25] Seon Mi Han Mark Nagurka Haym Benaroya, Seon Mi Han.
ProbabilityModels in Engineering and Science . CRC Press, 2005.[26] Stephen C. Hora. Aleatory and epistemic uncertainty in probability elici-tation with an example from hazardous waste management.
Reliability En-gineering & System Safety , 54(2):217 – 223, 1996. Treatment of Aleatoryand Epistemic Uncertainty.[27] Eyke H¨ullermeier and Willem Waegeman. Aleatoric and epistemic uncer-tainty in machine learning: A tutorial introduction, 2019.[28] Magdalena Ivanovska, Audun Jøsang, Lance Kaplan, and Francesco Sambo.Subjective networks: Perspectives and challenges. In
Proc. of the 4th In-ternational Workshop on Graph Structures for Knowledge Representationand Reasoning , pages 107–124, Buenos Aires, Argentina, 2015.4029] Priyank Jaini, Abdullah Rashwan, Han Zhao, Yue Liu, Ershad Banijamali,Zhitang Chen, and Pascal Poupart. Online algorithms for sum-product net-works with continuous variables. In
Conference on Probabilistic GraphicalModels , pages 228–239, 2016.[30] Audun Jøsang.
Subjective Logic: A Formalism for Reasoning Under Un-certainty . Springer, 2016.[31] Audun Jøsang, Ross Hayward, and Simon Pope. Trust network analysiswith subjective logic. In
Proceedings of the 29th Australasian ComputerScience Conference-Volume 48 , pages 85–94, 2006.[32] Lance Kaplan and Magdalena Ivanovska. Efficient subjective Bayesian net-work belief propagation for trees. In , pages 1300–1307, 2016.[33] Lance Kaplan and Magdalena Ivanovska. Efficient belief propagation insecond-order bayesian networks for singly-connected graphs.
InternationalJournal of Approximate Reasoning , 93:132–152, 2018.[34] Lance Kaplan, Magdalena Ivanovska, Kumar Vijay Mishra, FedericoCerutti, and Murat Sensoy. Second-order inference in uncertain Bayesiannetworks. to be submitted to the International Journal of Artificial Intel-ligence, 2020.[35] Alexander Karlsson, Ronnie Johansson, and Sten F Andler. An empiricalcomparison of Bayesian and credal networks for dependable high-level in-formation fusion. In
Intl. Conf. on Information Fusion (FUSION) , pages1–8, 2008.[36] Angelika Kimmig, Guy Van den Broeck, and Luc De Raedt. An algebraicprolog for reasoning about possible worlds. In
Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence , pages 209–214, 2011.[37] Angelika Kimmig, Guy Van den Broeck, and Luc De Raedt. Algebraicmodel counting.
J. of Applied Logic , 22(C):46–62, July 2017.[38] Doga Kisa, Guy den Broeck, Arthur Choi, and Adnan Darwiche. Prob-abilistic Sentential Decision Diagrams. In
Proceedings of the FourteenthInternational Conference on Principles of Knowledge Representation andReasoning , KR’14, pages 558–567. AAAI Press, 2014.[39] Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. Will you acceptan imperfect AI? Exploring Designs for Adjusting End-user Expectationsof AI Systems. In
Conference on Human Factors in Computing Systems -Proceedings , pages 1–14, New York, New York, USA, may 2019. Associationfor Computing Machinery.[40] Robert A. Kowalski. The early years of logic programming.
Communica-tions of the ACM , 31(1):38–43, 1988.4141] Pierre Simon Laplace.
A Philosophical Essay on Probabilities . Springer,1825. Translator Andrew I. Dale, Published in 1995.[42] Yitao Liang, Jessa Bekker, and Guy Van Den Broeck. Learning the struc-ture of probabilistic sentential decision diagrams.
Uncertainty in ArtificialIntelligence - Proceedings of the 33rd Conference, UAI 2017 , 2017.[43] Magnus Moglia, Ashok K. Sharma, and Shiroma Maheepala. Multi-criteriadecision assessments using subjective logic: Methodology and the case ofurban water strategies.
Journal of Hydrology , 452-453:180–189, 2012.[44] Umut Oztok and Adnan Darwiche. A Top-down Compiler for SententialDecision Diagrams. In
Proceedings of the 24th International Conference onArtificial Intelligence , IJCAI’15, pages 3141–3148. AAAI Press, 2015.[45] Judea Pearl. Fusion, propagation, and structuring in belief networks.
Ar-tificial Intelligence , 29(3):241–288, 1986.[46] David Poole. Abducing through negation as failure: stable models withinthe independent choice logic.
The Journal of Logic Programming , 44(1):5–35, 2000.[47] Abdullah Rashwan, Han Zhao, and Pascal Poupart. Online and distributedbayesian moment matching for parameter learning in sum-product net-works. In
Artificial Intelligence and Statistics , pages 1469–1477, 2016.[48] Amirmohammad Rooshenas and Daniel Lowd. Learning Sum-Product Net-works with Direct and Indirect Variable Interactions. In
Proceedings ofthe 31st International Conference on International Conference on MachineLearning - Volume 32 , ICML’14, pages I–710–I–718. JMLR.org, 2014.[49] Tian Sang, Paul Bearne, and Henry Kautz. Performing bayesian inferenceby weighted model counting. In
Proceedings of the 20th National Conferenceon Artificial Intelligence - Volume 1 , pages 475–481, 2005.[50] Taisuke Sato. A statistical learning method for logic programs with distri-bution semantics. In
Proceedings of the 12th International Conference onLogic Programming (ICLP-95) , 1995.[51] Taisuke Sato and Yoshitaka Kameya. Parameter learning of logic pro-grams for symbolic-statistical modeling.
Journal of Artificial IntelligenceResearch , 15(1):391–454, December 2001.[52] Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learn-ing to quantify classification uncertainty. In , 2018.[53] P. Smets. Belief functions: The disjunctive rule of combination and thegeneralized Bayesian theorem.
International Journal of Approximate Rea-soning , 9:1– 35, 1993. 4254] Martin Trapp, Robert Peharz, Hong Ge, Franz Pernkopf, and ZoubinGhahramani. Bayesian learning of sum-product networks. In
Advancesin Neural Information Processing Systems , pages 6344–6355, 2019.[55] Tim Van Allen, Ajit Singh, Russell Greiner, and Peter Hooper. Quantifyingthe uncertainty of a belief net response: Bayesian error-bars for belief netinference.
Artificial Intelligence , 172(4):483–513, 2008.[56] Antonio Vergari, Nicola Di Mauro, and Floriana Esposito. Simplifying,Regularizing and Strengthening Sum-Product Network Structure Learning.In Annalisa Appice, Pedro Pereira Rodrigues, V´ıtor Santos Costa, Jo˜aoGama, Al´ıpio Jorge, and Carlos Soares, editors,
Machine Learning andKnowledge Discovery in Databases , pages 343–358, Cham, 2015. SpringerInternational Publishing.[57] Antonio Vergari, Alejandro Molina, Robert Peharz, Zoubin Ghahramani,Kristian Kersting, and Isabel Valera. Automatic Bayesian density anal-ysis. In
Proceedings of the AAAI Conference on Artificial Intelligence ,volume 33, pages 5207–5215, 2019.[58] John Von Neumann and Oskar Morgenstern.
Theory of games and eco-nomic behavior (commemorative edition) . Princeton university press, 2007.[59] Joachim von zur Gathen. Algebraic Complexity Theory.
Annual Reviewof Computer Science , 3(1):317–348, jun 1988.[60] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van DenBroeck. A semantic loss function for deep learning with symbolic knowl-edge. ,12:8752–8760, 2018.[61] M. Zaffalon and E. Fagiuoli. 2U: An exact interval propagation algorithmfor polytrees with binary variables.
Artificial Intelligence , 106(1):77–107,1998.[62] Han Zhao, Tameem Adel, Geoff Gordon, and Brandon Amos. Collapsedvariational inference for sum-product networks. In
International Confer-ence on Machine Learning , pages 1310–1318, 2016.[63] Han Zhao, Pascal Poupart, and Geoff Gordon. A Unified Approach forLearning the Parameters of Sum-Product Networks. In
Proceedings of the30th International Conference on Neural Information Processing Systems ,NIPS’16, pages 433–441, Red Hook, NY, USA, 2016. Curran AssociatesInc.
A aProbLog
In the last years, several probabilistic variants of Prolog have been developed,such as ICL [46], Dyna [20], PRISM [51] and ProbLog [17], with its aProbLog43xtension [36] to handle arbitrary labels from a semiring. They all are basedon definite clause logic (pure Prolog) extended with facts labelled with prob-ability values. Their meaning is typically derived from Sato’s distribution se-mantics [50], which assigns a probability to every literal. The probability of aHerbrand interpretation, or possible world, is the product of the probabilitiesof the literals occurring in this world. The success probability is the probabilitythat a query succeeds in a randomly selected world.For a set J of ground facts, we define the set of literals L p J q and the set ofinterpretations I p J q as follows:L p J q “ J Y t(cid:32) f | f P J u (47) I p J q “ t S | S Ď L p J q ^ @ l P J : l P S Ø (cid:32) l R S u (48)An algebraic Prolog (aProbLog) program [36] consists of: • a commutative semiring x A , ‘ , b , e ‘ , e b y • a finite set of ground algebraic facts F “ t f , . . . , f n u • a finite set BK of background knowledge clauses • a labeling function ρ : L p F q Ñ A Background knowledge clauses are definite clauses, but their bodies may con-tain negative literals for algebraic facts. Their heads may not unify with anyalgebraic fact.For instance, in the following aProbLog program alarm :- burglary.0.05 :: burglary.burglary is an algebraic fact with label , and alarm :- burglary rep-resents a background knowledge clause, whose intuitive meaning is: in case ofburglary, the alarm should go off.The idea of splitting a logic program in a set of facts and a set of clausesgoes back to Sato’s distribution semantics [50], where it is used to define aprobability distribution over interpretations of the entire program in terms of adistribution over the facts. This is possible because a truth value assignment tothe facts in F uniquely determines the truth values of all other atoms defined inthe background knowledge. In the simplest case, as realised in ProbLog [17, 21],this basic distribution considers facts to be independent random variables andthus multiplies their individual probabilities. aProbLog uses the same basicidea, but generalises from the semiring of probabilities to general commutativesemirings. While the distribution semantics is defined for countably infinite setsof facts, the set of ground algebraic facts in aProbLog must be finite.In aProbLog, the label of a complete interpretation I P I p F q is defined asthe product of the labels of its literals A p I q “ â l P I ρ p l q (49)44nd the label of a set of interpretations S Ď I p F q as the sum of the interpretationlabels A p S q “ à I P S â l P I ρ p l q (50)A query q is a finite set of algebraic literals and atoms from the Herbrand base, q Ď L p F q Y HB p F Y BK q . We denote the set of interpretations where the queryis true by I p q q , I p q q “ t I | I P I p F q ^ I Y BK |ù q u (51)The label of query q is defined as the label of I p q q , A p q q “ A p I p q qq “ à I P I p q q â l P I ρ p l q . (52)As both operators are commutative and associative, the label is independent ofthe order of both literals and interpretations.ProbLog [21] is an instance of aProbLog with A “ R ě ; a ‘ b “ a ` b ; a b b “ a ¨ b ; e ‘ “ e b “ δ p f q P r , s ; δ p(cid:32) f q “ ´ δ p f q (53) B Subjective Logic Operators of Sum, Multipli-cation, and Division
Let us recall the following operators as defined in [30]. In the following, let ω X “x b X , d X , u X , a X y and ω Y “ x b Y , d Y , u Y , a Y y be two subjective logic opinions. B.1 Sum
The opinion about X Y Y ( sum , ω X ‘ SL ω Y ) is defined as ω X Y Y “ x b X Y Y , d X Y Y , u X Y Y , a X Y Y y ,where: • b X Y Y “ b X ` b Y ; • d X Y Y “ a X p d X ´ b Y q` a Y p d Y ´ b X q a X ` a Y ; • u X Y Y “ a X u X ` a Y u Y a X ` a Y ; and • a X Y Y “ a X ` a Y . I.e., the set of ground atoms that can be constructed from the predicate, functor andconstant symbols of the program. .2 Product The opinion about X ^ Y ( product , ω X b SL ω Y ) is defined—under assumptionof independence—as ω X ^ Y “ x b X ^ Y , d X ^ Y , u X ^ Y , a X ^ Y y , where: • b X ^ Y “ b X b Y ` p ´ a X q a Y b X u Y ` a X p ´ a Y q u X b Y ´ a X a Y ; • d X ^ Y “ d X ` d Y ´ d X d Y ; • u X ^ Y “ u X u Y ` p ´ a Y q b X u Y `p ´ a X q u X b Y ´ a X a Y ; and • a X ^ Y “ a X a Y . B.3 Division
The opinion about the division of X by Y , X r ^ Y ( division , ω X m SL ω Y ) isdefined as ω X r ^ Y “ x b X r ^ Y , d X r ^ Y , u X r ^ Y , a X r ^ Y y where • b X r ^ Y = a Y p b X ` a X u X qp a Y ´ a X qp b Y ` a Y u Y q ´ a X p ´ d X qp a Y ´ a X qp ´ d Y q ; • d X r ^ Y “ d X ´ d Y ´ d Y ; • u X r ^ Y “ a Y p ´ d X qp a Y ´ a X qp ´ d Y q ´ a Y p b X ` a X u X qp a Y ´ a X qp b Y ` a Y u Y q ; and • a X r ^ Y “ a X a Y subject to: • a X ă a Y ; d X ě d Y ; • b X ě a X p ´ a Y qp ´ d X q b Y p ´ a X q a Y p ´ d Y q ; and • u X ě p ´ a Y qp ´ d X q u Y p ´ a X qp ´ d Y q . C Independence of posterior distributions whenlearning from complete observations
Let us instantiate AMC using probabilities as labels (cf. (5)) and let us considera propositional logic theory over M variables. We can thus re-write (1) as: p p T q “ ÿ I P M p T q M ź m “ p p l m q (54)Hence, the probability of a theory is function of the probabilities of interpreta-tions p p I P M p T qq , where p p I P M p T qq “ M ź m “ p p l m q (55)46et’s assume that we want to learn such probabilities from a dataset D “p x , . . . , x N q T , then by (55) the variables for which we are learning probabilitiesare independent, hence p p l , . . . , l M q “ M ź m “ p p l m q (56)We can thus re-write the likelihood (9) as: p p D | p x q “ | D | ź i “ p p x i | p x i q“ | D | ź i “ M ź m “ p x i,m x m p ´ p x m q ´ x i,m (57)Assuming a uniform prior, and letting r m be the number of observations for x m “ s m the number of observations for x m “
0, we can thus computethe posterior as: p p p x | D , α q 9 p p D | p x q ¨ p p p x | α q9 M ź m “ p r m ` α xm ´ x m p ´ p x m q s m ` α xm ´ (58)which, in turns, show that the independence is maintained also considering theposterior beta distributions. D Bayesian networks derived from aProbLog pro-grams
Figure 12 depicts the Bayesian networks that can be derived from the threecircuits considered in the experiments described in Section 6.2.47
PB MC024681012 E x ec u t i o n T i m e s (a) CPB MC024681012 E x ec u t i o n T i m e s (b) CPB MC024681012 E x ec u t i o n T i m e s (c) CPB MC024681012 E x ec u t i o n T i m e s (d) CPB MC024681012 E x ec u t i o n T i m e s (e) CPB MC024681012 E x ec u t i o n T i m e s (f) CPB MC024681012 E x ec u t i o n T i m e s (g) CPB MC024681012 E x ec u t i o n T i m e s (h) CPB MC024681012 E x ec u t i o n T i m e s (i) Figure 10: Distribution of computational time for running the different algo-rithms for: (a) Net1 with N ins “
10; (b) Net1 with N ins “
50; (c) Net1 with N ins “ N ins “
10; (e) Net2 with N ins “
50; (f) Net2 with N ins “ N ins “
10; (h) Net3 with N ins “
50; (i) Net3 with N ins “ NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (a)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (b)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (c)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (d)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (e)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (f)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (g)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (h)
10 20 30 40 50 60 70 80 90 100110120130140150160170180190200
NumberofMonteCarloSamples . . . . . . . . . C o rr e l a t i o n w i t h G o l d e n S t a nd a r d (i) Figure 11: Correlation of Dirichlet strengths between runs of Monte Carlo ap-proarch varying the number of samples and golden standard (i.e. a MonteCarlo run with 10,000 samples) as well as between the proposed approach andgolden standard with cubic interpolation—that is independent of the numberof samples used in Monte Carlo—for: (a) Net1 with N ins “
10; (b) Net1 with N ins “
50; (c) Net1 with N ins “ N ins “
10; (e) Net2 with N ins “
50; (f) Net2 with N ins “ N ins “
10; (h) Net3 with N ins “
50; (i) Net3 with N ins “ (a) (b) (c)(a) (b) (c)