Uncertainty Reasoning for Probabilistic Petri Nets via Bayesian Networks
Rebecca Bernemann, Benjamin Cabrera, Reiko Heckel, Barbara König
UUncertainty Reasoning for Probabilistic Petri Netsvia Bayesian Networks
Rebecca Bernemann
University of Duisburg-Essen, [email protected]
Benjamin Cabrera
University of Duisburg-Essen, [email protected]
Reiko Heckel
University of Leicester, [email protected]
Barbara König
University of Duisburg-Essen, [email protected]
Abstract
This paper exploits extended Bayesian networks for uncertainty reasoning on Petri nets, where firingof transitions is probabilistic. In particular, Bayesian networks are used as symbolic representationsof probability distributions, modelling the observer’s knowledge about the tokens in the net. Theobserver can study the net by monitoring successful and failed steps.An update mechanism for Bayesian nets is enabled by relaxing some of their restrictions,leading to modular Bayesian nets that can conveniently be represented and modified. As for everysymbolic representation, the question is how to derive information – in this case marginal probabilitydistributions – from a modular Bayesian net. We show how to do this by generalizing the knownmethod of variable elimination. The approach is illustrated by examples about the spreadingof diseases (SIR model) and information diffusion in social networks. We have implemented ourapproach and provide runtime results.
Mathematics of computing → Bayesian networks; Software andits engineering → Petri nets
Keywords and phrases uncertainty reasoning, probabilistic knowledge, Petri nets, Bayesian networks
Digital Object Identifier
Funding
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under grantGRK 2167, Research Training Group “User-Centred Social Media”.
Today’s software systems and the real-world processes they support are often distributed,with agents acting independently based on their own local state but without completeknowledge of the global state. E.g., a social network may expose a partial history of its users’interactions while hiding their internal states. An application tracing the spread of a viruscan record test results but not the true infection state of its subjects. Still, in both cases, wewould like to derive knowledge under uncertainty to allow us, for example, to predict thespread of news in the social network or trace the outbreak of a virus.Using Petri nets as a basis for modelling concurrent systems, our aim is to performuncertainty reasoning on Petri nets, employing Bayesian networks as compact representationsof probability distributions. Assume that we are observing a discrete-time concurrent systemmodelled by a Petri net. The net’s structure is known, but its initial state is uncertain, given © Rebecca Bernemann, Benjamin Cabrera, Reiko Heckel, and Barbara König;licensed under Creative Commons License CC-BY40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science(FSTTCS 2020).Editors: Nitin Saxena and Sunil Simon; Article No. 42; pp. 42:1–42:26Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . A I] S e p only as an a-priori probability distribution on markings. The net is probabilistic: Transitionsare chosen at random, either from the set of enabled transitions or independently, basedon probabilities that are known but may change between steps. We cannot observe whichtransition actually fires, but only if firing was successful or failed. Failures occur if thechosen transition is not enabled under the current marking (in the case where we choosetransitions independently), if no transition can fire, or if a special fail transition is chosen.After observing the system for a number of steps, recording a sequence of “success” and“failure” events, we then determine a marginal distribution on the markings (e.g., computethe probability that a given place is marked), taking into account all observations.First, we set up a framework for uncertainty reasoning based on time-inhomogeneousMarkov chains that formally describes this scenario, parameterized over the specific semanticsof the probabilistic net. This encompasses the well-known stochastic Petri nets [29], as wellas a semantics where the choice of the marking and the transition is independent. Usingbasic Bayesian reasoning (reminiscent of methods used for hidden Markov models [32]), it isconceptually relatively straightforward to update the probability distribution based on theacquired knowledge. However, the probability space is exponential in the number of placesof the net and hence direct computations become infeasible relatively quickly.Following [4], our solution is to use (modular) Bayesian networks [36, 12, 31] as compactsymbolic representations of probability distributions. Updates to the probability distributioncan be performed very efficiently on this data structure, simply by adding additional nodes.By analyzing the structure of the Petri net we ensure that this node has a minimal numberof connections to already existing nodes.As for every symbolic representation, the question is how to derive information, in thiscase marginal probability distributions. We solve this question by generalizing the knownmethod of variable elimination [13, 12] to modular Bayesian networks. This method is knownto work efficiently for networks of small treewidth, a fact that we experimentally verify inour implementation.We consider some small application examples modelling gossip and infection spreading.Summarized, our contributions are:We propose a framework for uncertainty reasoning based on time-inhomogeneous Markovchains, parameterized over different types of probabilistic Petri nets (Sct. 2 and 3).We use modular Bayesian networks to symbolically represent and update probabilitydistributions (Sct. 4 and 5).We extend the variable elimination method to modular Bayesian networks and show howit can be efficiently employed in order to compute marginal distributions (Sct. 6). This iscorroborated by our implementation and runtime results (Sct. 7).All proofs and further material can be found in the appendix. Markov chains [17, 35] are a stochastic state-based model, in which the probability of atransition depends only on the state of origin. Here we restrict to a finite state space. (cid:73)
Definition 1 (Markov chain) . Let Q be a finite state space. A (discrete-time) Markov chain is a sequence ( X n ) n ∈ N of random variables such that for q, q , . . . , q n ∈ Q : P ( X n +1 = q | X n = q n ) = P ( X n +1 = q | X n = q n , . . . , X = q ) . . Bernemann and B. Cabrera and R. Heckel and B. König 42:3 Assume that | Q | = k . Then, the probability distribution over Q at time n can berepresented as a k -dimensional vector p n , indexed over Q . We abbreviate p n ( q ) = P ( X n = q ).We define k × k -transition matrices P n , indexed over Q , with entries for the entry of matrix M at row q and column q : P n ( q | q ) = P ( X n +1 = q | X n = q ). Note that p n +1 = P n · p n .We do not restrict to time-homogeneous Markov chains where it is required that P n = P n +1 for all n ∈ N . Instead, the probability distribution on the transitions might vary over time. As a basis for probabilistic Petri nets we use the following variant of condition/event nets[33]. Deviating from [33], we omit the initial marking and furthermore the fact that the post-condition is marked is not inhibiting the firing of a transition. That is, we omit the so-calledcontact condition, which makes it easier to model examples from application scenarios wherethe contact condition would be unnatural. Note however that we could easily accommodatethe theory to include this condition, as we did in the predecessor paper [4]. (cid:73)
Definition 2 (condition/event net) . A condition/event net (C/E net or simply Petri net) N = ( S, T, • () , () • ) is a four-tuple consisting of a finite set of places S , a finite set of transitions T with pre-conditions • () : T → P ( S ) and post-conditions () • : T → P ( S ) . A marking is any subset of places m ⊆ S and will also be represented by a bit string m ∈ { , } | S | (assuming an ordering on the places).A transition t can fire for a marking m ⊆ S if • t ⊆ m . Then marking m is transformedinto m = ( m \ • t ) ∪ t • , written m t ⇒ m . We write m t ⇒ to indicate that there exists some m with m t ⇒ m and m t ⇒ if this is not the case. We denote the set of all markings by M = P ( S ) . In order to obtain a Markov chain from a C/E net, we need the following data: given amarking m and a transition t , we denote by r n ( m, t ) the probability of firing t in marking m (at step n ), and by r n ( m, fail) the probability of going directly to a fail state ∗ . (cid:73) Definition 3.
Let N = ( S, T, • () , () • ) be a condition/event net and let T f = T ∪ { fail } (the set of transitions enriched with a fail transition). Furthermore let r n : M × T f → [0 , , n ∈ N be a family of functions (the transition distributions at step n ), such that for each n ∈ N , m ∈ M : P t ∈ T f r n ( m, t ) = 1 .The Markov chain generated from
N, r n has states Q = M ∪ {∗} and for m, m ∈ M : P ( X n +1 = m | X n = m ) = P t ∈ T,m t ⇒ m r n ( m, t ) P ( X n +1 = m | X n = ∗ ) = 0 P ( X n +1 = ∗ | X n = m ) = P t ∈ T f ,m t ⇒ r n ( m, t ) P ( X n +1 = ∗ | X n = ∗ ) = 1 where we assume that m fail ⇒ for every m ∈ M . Note that we can make a transition from m to the fail state ∗ either when there is a non-zero probability for performing such a transition directly or when we pick a transition thatcannot be fired in m . Requiring that m fail ⇒ for every m is for notational convenience, since wehave to sum up all probabilities leading to the fail state ∗ to compute P ( X n +1 = ∗ | X n = m ).In this way the symbol always signifies a transition to ∗ .By parametrising over r n we obtain different semantics for condition/even nets. Inparticular, we consider the following two probabilistic semantics, both based on probability We are using the notation M ( q | q ), resembling conditional probability, F S T T C S 2 0 2 0 d d d d d K K K K (a) A Petri net modellinggossip diffusion in a socialnetwork ( K i : i knowsinformation) I R R I i i r r S S (b) A Petri net modellingspread of a disease ( S :susceptible, I : infected, R :removed) I fail infflp (c) A Petri net modelling atest with false positives andnegatives ( I : infected) Figure 1
Example Petri nets distributions p nT : T → [0 , n ∈ N on transitions. We work under the assumption thatthis information is given or can be gained from extra knowledge that we have about ourenvironment. Independent case:
Here we assume that the marking and the transition are drawn indepen-dently, where markings are distributed according to p n and transitions according to p nT . Itmay happen that the transition and the marking do not “match” and the transition cannotfire. Formally, r n ( m, t ) = p nT ( t ), r n ( m, fail) = 0 (where m ∈ M , t ∈ T ). This extends to thecase where fail has non-zero probability, with probability distribution p nT : T f → [0 , Stochastic net case:
We consider stochastic Petri nets [29] which are often provided with asemantics based on continuous-time Markov chains [35]. Here, however we do not considercontinuous time, but instead model the embedded discrete-time Markov chain of jumps thatabstracts from the timing. The firing rate of a transition t is proportional to p nT ( t ).Intuitively, we first sample a marking m (according to p n ) and then sample a transition,restricting to those that are enabled in m . Formally, for every t ∈ T f , r n ( m, t ) = 0, r n ( m, fail) = 1 if no transition can fire in m and r n ( m, t ) = p nT ( t ) / P m t ⇒ p nT ( t ), r n ( m, fail) =0 otherwise.Other semantics might make sense, for instance the probability of firing a transition coulddepend on a place not contained in its pre-condition. Furthermore, it is possible to mix thetwo semantics and do one step in the independent and the next in the stochastic semantics. (cid:73) Example 1.
The following nets illustrate the two semantics. The first net (Fig. 1a) explainsthe diffusion of gossip in a social network: There are four users and each place K i representsthe knowledge of user i . To convey the fact that user i knows some secret, place K i containsa token. The diffusion of information is represented by transitions d j . E.g., if 1 knows thesecret he will tell it to either 2 or 3 and if 3 knows a secret she will broadcast it to both 1and 4. Note that a person will share the secret even if the recipient already knows, and shewill retain this knowledge (see the double arrows in the net). Here we use the stochastic semantics: only transitions that are enabled will be chosen(unless the marking is empty and no transition can fire). We assume that p T ( d ) = / and p T ( d ) = p T ( d ) = p T ( d ) = p T ( d ) = / , i.e., user 2 is more talkative than the others. Hence, in the Petri net semantics, we allow a transition to fire although the post-conditions is marked. . Bernemann and B. Cabrera and R. Heckel and B. König 42:5
One of the states of the Markov chain is the marking m = 1100 ( K , K are marked –users 1 and 2 know the secret – and K , K are unmarked – users 3 and 4 do not). In thissituation transitions d , d , d are enabled. We normalize the probabilities and obtain that d fires with probability / and the other two with probability / . By firing d or d we stayin state 1100, i.e., the corresponding Markov chain has a loop with probability / . Firing d gives us a transition to state 1110 (user 3 now knows the secret too) with probability / .The second net (Fig. 1b) models the classical SIR infection model [23] for two persons.A person is susceptible (represented by a token in place S i ) if he or she has not yet beeninfected. If the other person is infected (i.e. place I or I is marked), then he or she mightalso get infected with the disease. Finally, people recover (or die), which means that theyare removed (places R i ). Again we use the stochastic semantics.The third net (Fig. 1c) models a test (for instance for an infection) that may have falsepositives and false negatives. A token in place I means that the corresponding person isinfected. Apart from I there is another random variable R (for result) that tells whetherthe test is positive or negative. In order to faithfully model the test, we assign the followingprobabilities to the transitions: p T ( flp ) = P ( R | ¯ I ) (false or lucky positive: this transitioncan fire regardless of whether I is marked, in which case the test went wrong and is onlyaccidentally positive), p T ( inf ) = P ( R | I ) − P ( R | ¯ I ) (the remaining probability, such thatthe probabilities of flp and inf add up to the true positive) and p T (fail) = P ( ¯ R | I ) (falsenegative). Here we use the independent semantics, assuming that we have a random testwhere the ground truth (infected or not infected) is independent of the firing probabilities ofthe transitions. We now introduce the following scenario for uncertainty reasoning: assume that we are givenan initial probability distribution p ∗ on the markings of the Petri net. We stipulate thatthe fail state ∗ cannot occur, assuming that the state of the net is always some (potentiallyunknown) well-defined marking. If this fail state would be reached in the Markov model, weassume that the marking of the Petri net does not change, i.e., we perform a “reset” to theprevious marking.Furthermore, we are aware of all firing probabilities of the various transitions, givenby the functions ( r n ) n ∈ N and hence all transition matrices P n that specify the transitionprobabilities at step n .Then we observe the system and obtain a sequence of success and failure occurrences.We are not told which exact transition fires, but only if the firing is successful or fails (sincethe pre-condition of the transition is not covered by the marking). Note that according toour model, transitions can be chosen to fire, although they are not activated. This couldhappen if either a user or the environment tries to fire such a transition, unaware of thestatus of its pre-condition. Failure corresponds to entering state ∗ and in this case we assumethe marking does not change. That is, we keep the previous marking, but acquire additionalknowledge – namely that firing fails – which is used to update the probability distributionaccording to Prop. 4 (by performing the corresponding matrix multiplications, includingnormalization).We use the following notation: let M be a matrix indexed over M ∪ {∗} . Then we denoteby M ∗ the matrix obtained by deleting the ∗ -indexed row and column from M . Analogously Here we require that P ( R | ¯ I ) ≤ P ( R | I ). F S T T C S 2 0 2 0 for a vector p . Note that ( M · p ) ∗ = M ∗ · p ∗ . Furthermore if p ∗ is a sub-probability vector,indexed over M , norm( p ∗ ) stands for the corresponding normalized vector, where the m -entryis p ∗ ( m ) / (cid:0) P m ∈M p ∗ ( m ) (cid:1) . (cid:73) Proposition 4.
Let r n : M × T f → [0 , and p n : M ∪ {∗} → [0 , be given as above. Let N be a C/E net and let ( X n ) n ∈ N be the Markov chain generated from N, r n . Then P ( X n +1 = m | X n +1 = ∗ , X n = ∗ ) = P ( X n +1 = m | X n +1 = ∗ ) = norm( P n ∗ · p n ∗ )( m ) P ( X n = m | X n +1 = ∗ , X n = ∗ ) = norm( F n ∗ · p n ∗ )( m ) where p n ( m ) = P ( X n = m ) , p n ( ∗ ) = P ( X n = ∗ ) and F n is a diagonal matrix with F n ( ¯ m | ¯ m ) := P n ( ∗ | ¯ m ) , ¯ m ∈ M , and F n ( ∗ | ∗ ) := P n ( ∗ | ∗ ) = 1 , all other entries are . Hence, in case we observe a success we update the probability distribution to ¯ p n +1 bycomputing P n ∗ · ¯ p n (and normalizing). Instead, in the case of a failure we assume thatthe marking stays unchanged, but by observing the failure we have gathered additionalknowledge, which means that we can replace ¯ p n +1 by F n ∗ · ¯ p n (after normalization). P n ∗ and F n ∗ are typically not stochastic, but only sub-stochastic. For a (sub-)probabilitymatrix M ∗ and a (sub-)probability vector p ∗ it is easy to see that norm( M ∗ · p ∗ ) = norm( M ∗ · norm( p ∗ )). Hence another option is to omit the normalization steps and to normalize at thevery end of the sequence of observations. Normalization may be undefined (in the case ofthe 0-vector), which signifies that we assumed an a priori probability distribution that isinconsistent with reality. (cid:73) Example 2.
We get back to Ex. 1 and discuss uncertainty reasoning. Assume that in thenet in Fig. 1b person 1 is susceptible ( S is marked), person 2 is infected ( I is marked) andthe i j -transitions have a higher rate (higher probability of firing) than the r j -transitions.Then, in the next step the probability that both are infected is higher than the probabilitythat 1 is still susceptible and 2 has recovered.Regarding the net in Fig. 1c we can show that in the next step, in the case of success,the probability distribution is updated in such a way that place I is marked with probability P ( I | R ) and unmarked with probability P ( ¯ I | R ) ( P ( I | ¯ R ), P ( ¯ I | ¯ R ) in the case of failure),exactly as required. For more details see Appendix B. AB D EC
Figure 2
An exampleBayesian network
In order to implement the updates to the probability distri-butions described above in an efficient way, we will now rep-resent probability distributions over markings symbolically asBayesian networks [31, 7]. Bayesian networks (BNs) modelcertain probabilistic dependencies of random variables throughconditional probability tables and a graphical representation.Consider for instance the Bayesian network in Fig. 2. Eachnode ( A , B , C , D , E ) represents a binary random variable,where a node without predecessors (e.g., A ) is associated withthe probabilities P ( A ) and P ( ¯ A ). Edges denote dependencies: for instance D is dependenton A, B , which means that D is associated with a conditional probability table (matrix)with entries P ( D | A, B ), similar for E (entries of the form P ( E | D, C )). In both cases, thematrix contains 2 · P ( E )) from a Bayesian network. . Bernemann and B. Cabrera and R. Heckel and B. König 42:7 We deviate from the literature on Bayesian networks in three respects: first, since wewill update and transform those networks, we need a structure where we can easily expresscompositionality via sequential and parallel composition. To this end we use the representationof Bayesian networks via PROPs as in [15, 21]. Second, we permit sub-stochastic matrices.Third, we allow a node to have several outgoing wires, whereas in classical Bayesian networksa node is always associated to the distribution of a single random variable. This is neededsince we need to add nodes to a network that represent stochastic matrices of arbitrarydimensions (basically the matrices P n and F n of Proposition 4). We rely on the notationintroduced in [4], but extend it by taking the last item above into account. The syntax of Bayesian networks is provided by causality graphs [4]. For this we fix a set ofnode labels G , also called generators , where every g ∈ G is associated with a type n g → m g ,where n g , m g ∈ N . (cid:73) Definition 5 (Causality Graph (CG)) . A causality graph (CG) of type n → m , n, m ∈ N ,is a tuple B = ( V, ‘, s, out) where V is a set of nodes ‘ : V → G is a labelling function that assigns a generator ‘ ( v ) ∈ G to each node v ∈ V . s : V → W ∗ B is the source function that maps a node to a sequence of input wires, where | s ( v ) | = n ‘ ( v ) and W B = { ( v, p ) | v ∈ V, p ∈ { , . . . , m ‘ ( v ) }} ∪ { i , . . . , i n } is the wire set . out : { o , . . . , o m } → W B is the output function that assigns each output port to a wire.Moreover, the corresponding directed graph (defined by s ) has to be acyclic.We also define the target function t : V → W ∗ B with t ( v ) = ( v, . . . ( v, m ‘ ( v ) ) and the setof internal wires IW B = W B \{ i , . . . , i n , out( o ) , . . . , out( o m ) } . We visualize such causality graphs by drawing the n input wires on the left and the m outputs on the right. Each node v is drawn as a box, with n v ingoing wires and m v outgoingwires, ordered from top to bottom. Connections induced by the source and by the outputfunction are drawn as undirected edges (see Fig. 2).We define two operations on causality graphs: sequential composition and tensor. Given B of type n → k and B of type k → m , the sequential composition is obtained via concatenation,by identifying the output wires of B with the input wires of B , resulting in B ; B of type n → m . The tensor takes two causality graphs B i of type n i → m i , i ∈ { , } and takes theirdisjoint union, concatenating the sequences of input and output wires, resulting in B ⊗ B of type n + n → m + m . For a visualization see Fig. 5 and for formal definitions see[4, 3]. The semantics of modular Bayesian networks is given by (sub-)stochastic matrices , i.e.,matrices with entries from [0 , n → m whenever it is of dimension2 m × n . We again use a sequential composition operator ; that corresponds to matrixmultiplication ( P ; Q = Q · P ) and the Kronecker product ⊗ as the tensor. More concretely,given P : n → m , Q : n → m we define P ⊗ Q : n + n → m + m as ( P ⊗ Q )( x x | y y ) = P ( x | y ) · Q ( x | y ) where x i ∈ { , } m i , y i ∈ { , } n i . F S T T C S 2 0 2 0
Finally, modular Bayesian networks , adapted from [4], are causality graphs, where eachgenerator g ∈ G is associated with a (sub-)stochastic matrix of suitable type. (cid:73) Definition 6 (Modular Bayesian network (MBN)) . An MBN is a tuple ( B, ev ) where B isa causality graph and ev an evaluation function that assigns to every generator g ∈ G oftype n → m a m × n sub-stochastic matrix ev ( g ) . An MBN ( B, ev ) is called an ordinaryBayesian network (OBN) whenever B has no inputs (i.e. it has type → m ), each generatoris of type n → , out is a bijection and every node is associated with a stochastic matrix. We now describe how to evaluate an MBN to obtain a (sub-)stochastic matrix. ForOBNs – which are exactly the Bayesian networks considered in [16] – this coincides with thestandard interpretation and yields a probability vector of dimension m . (cid:73) Definition 7 (MBN evaluation) . Let ( B, ev ) be an MBN where B is of type n → m .Then M ev ( B ) is a m × n -matrix, which is defined as follows: M ev ( B )(x . . . x m | y . . . y n ) = X b ∈B Y v ∈ V ev ( l ( v )) ( b ( t ( v )) | b ( s ( v ))) with x , . . . , x m , y , . . . , y n ∈ { , } . B is the set of all functions b : W B → { , } such that b ( i j ) = y j , b ( out ( o k )) = x k , where k ∈ { , . . . , m } , j ∈ { , . . . , n } . The functions b areapplied pointwise to sequences of wires. Calculating the underlying probability distribution of an MBN can also be done on agraphical level by treating every occurring wire as a boolean variable that can be assignedeither 0 or 1. Function b ∈ B assigns the wires, ensuring consistency with the input/outputvalues. After the wire assignment, the corresponding entries of each matrix ev ( l ( v )) aremultiplied. After iterating over every possible wire assignment, the products are summed up.Note that M ev is compositional, it preserves sequential composition and tensor. Moreformally, it is a functor between symmetric monoidal categories, or – more specifically –between CC-structured PROPs. (For more details on PROPs see Appendix A.) (cid:73) Example 3.
We illustrate Def. 7 by evaluating the Bayesian network ( B , ev ) in Fig. 2.This results in a 2 × M ev ( B ), assigning (sub-)probabilities to the only output wirein the diagram being 1 or 0, respectively. More concretely, we assign values to the four innerwires to obtain: M ev ( B )(e) = X a ∈{ , } X b ∈{ , } X c ∈{ , } X d ∈{ , } (cid:0) A (a) · B (b) · C (c) · D (d | ab) · E (e | cd) (cid:1) , where a , b , c , d , e correspond to the output wire of the corresponding matrix ( A, B, C, D, E ). An MBN B of type 0 → k , as defined above, symbolically represents a probability distributionon { , } k , that is, a probability distribution on markings of a net with | S | = k places.Under uncertainty reasoning (cf. Section 3), the probability distribution in the next step p n +1 is obtained by multiplying p n with a matrix M (either P n ∗ in the successful case or F n ∗ in the case of failure). Hence, a simple way to update B would be to create an MBN B M with a single node v (labelled by a generator g with ev ( g ) = M ), connected to k inputs and k outputs. Then the updated B is simply B ; B M (remember that sequential composition . Bernemann and B. Cabrera and R. Heckel and B. König 42:9 corresponds to matrix multiplication). However, at dimension 2 k × k the matrix M is hugeand we would sacrifice the desirable compact symbolic representation. Hence the aim is todecompose M = M ⊗ Id where Id is an identity matrix of suitable dimension. Due to thefunctoriality of MBN evaluation this means composing with a smaller matrix and a numberof identity wires (see e.g. Fig. 3b).This decomposition arises naturally from the structure of the Petri net N , in particular ifthere are only relatively few transitions that may fire in a step. In this case we intuitivelyhave to attach a stochastic matrix only to the wires representing the places connected tothose transitions, while the other wires can be left unchanged. If there are several updates,we of course have to attach several matrices, but each of them might be of a relatively modestsize.In order to have a uniform treatment of the various semantics, we assume that foreach step n there is a set ¯ S ⊆ S of places and a set ¯ T ⊆ T f of transitions such that: (i) r n ( m, t ) = 0 whenever t ¯ T ; (ii) r n ( m m , t ) = ¯ r ( m , t ) for some function ¯ r (where m is amarking of length ‘ = | ¯ S | , corresponding to the places of ¯ S ); (iii) ¯ S contains at least • t, t • for all t ∈ ¯ T . Intuitively, ¯ S , ¯ T specify the relevant places and transitions.For the two Petri net semantics studied earlier, these conditions are satisfied if we takeas ¯ T the support of p nT and as ¯ S the union of all pre- and post-sets of ¯ T . The function r n can in both cases be defined in terms of ¯ r : in the independent case this is obvious, whereasin the stochastic net case we observe that r n ( m, t ) is only dependent on p nT and on the set oftransitions that is enabled in m and this can be derived from m .Now, under these assumptions, we can prove that we obtain the decomposition mentionedabove. (cid:73) Proposition 8.
Assume that N is a condition/even-net together with a function r n . Assumethat we have ¯ S ⊆ S , ¯ T ⊆ T f satisfying the conditions above. Then P n ∗ = P ⊗ Id k − ‘ where P ( m | m ) = P t ∈ ¯ T ,m t ⇒ m ¯ r ( m , t ) . F n ∗ = F ⊗ Id k − ‘ where F ( m | m ) = P t ∈ ¯ T ,m t ⇒ ¯ r ( m , t ) if m = m and otherwise.Here P , F are ‘ × ‘ -matrices and m , m ⊆ ¯ S . Note also that we implicitly restricted thefiring relation to the markings on ¯ S . (cid:73) Example 4.
In order to illustrate this, we go back to gossip diffusion (Fig. 1a, Ex. 1). Ourinput is the following: an initial probability distribution, describing the a priori knowledge,given by an MBN. Here we have no information about who knows or does not know the secretand hence we assume a uniform probability distribution over all markings. This is representedby the Bayesian network in Fig. 3a where each node is associated with a 2 × K i where both entries are / .Also part of the input is the family of transition distributions ( r n ) n ∈ N . Here we assumethat the firing probabilities of transitions are as in Example 1, but not all users are active atthe same time. We have information that in the first step only users 1 and 2 are active,henceby normalization we obtain probabilities / , / , / for transitions d , d , d (the othertransitions are deactivated).Now we observe a success step. According to Sct. 3 we can make an update with P ∗ where P is the transition matrix of the Markov chain. Since none of the transitions is attached toplace K the optimizations of this section allow us to represent P ∗ as P ⊗ Id where P is Without loss of generality we assume that the outputs have been permuted such that places in ¯ S occurfirst in the sequence of places. F S T T C S 2 0 2 0 K K K K (a) An MBN modelling auniform probabilitydistribution K K K K P (b) An MBN after performingan update (observation of asuccessful step) K K K K P (c) Computing a marginalprobability distributionfrom an MBN
Figure 3
Example: transformation of modular Bayesian networks an 8 × P (110 | / , P (111 | / .This matrix is simply attached to the modular Bayesian network (see Fig. 3b).Now assume that it is our task to compute the probability that place K is marked.For this, we compute the corresponding marginal probabilities by terminating each outputwire (apart from the third one) (see Fig. 3c). “Terminating a wire” means to remove itfrom the output wires. This results in summing up over all possible values assigned to eachwire, where we can completely omit the last component, which is the unit of the Kroneckerproduct. Note that the resulting vector is sub-stochastic and still has to be normalized. Thenormalization factor can be obtained by terminating also the remaining third wire,whichgives us the probability mass of the sub-probability distribution. Our implementation willnow tell us that place K is marked with probability / . Given a modular Bayesian network, it is inefficient to obtain the full distribution, not justfrom the point of view of the computation, but also since its direct representation is ofexponential size. However what we often need is to compute a marginal distribution (e.g.,the probability that a certain place is marked) or a normalization factor for a sub-stochasticprobability distribution (cf. Ex. 4). Another application would be to transform an MBNinto an OBN, by isolating that part of the network that does not conform to the propertiesof an OBN, evaluating it and replacing it by an equivalent OBN.Def. 7 gives a recipe for the evaluation, which is however quite inefficient. Hence wewill now explain and adapt the well-known concept of variable elimination [13, 12]. Let usstudy the problem with a concrete example. Consider the Bayesian network B in Fig. 2and its evaluation described in Ex. 3. If we perform this computation one has to enumerate2 = 16 bit vectors of length 4. Furthermore, after eliminating d we have to represent amatrix (also called factor in the literature on Bayesian networks) that is dependent on fourrandom variables (a , b , c , e), hence we say that it has width 4 (2 = 16 entries).However, it is not difficult to see that we can – via the distributive law – reorder theproducts and sums to obtain a more efficient way of computing the values: M ev ( B )(e) = X d ∈{ , } (cid:0) X c ∈{ , } (cid:0) X b ∈{ , } (cid:0) X a ∈{ , } (cid:0) A (a) · D (d | ab) (cid:1) · B (b) (cid:1) · C (c) (cid:1) · E (e | cd) (cid:1) . In this way we obtain smaller matrices, the largest matrix (or factor) that occurs is D (width 3). Choosing a different elimination order might have been worse. For instance, if wehad eliminated d first, we would have to deal with a matrix dependent on a , b , c , e (width 4). . Bernemann and B. Cabrera and R. Heckel and B. König 42:11 The literature of Bayesian networks [13, 12] extensively studies the best variable eliminationorder and discusses the relation to treewidth. For our setting we have to extend the resultsin the literature, since we also allow generators with more than one output. (cid:73)
Definition 9 (Elimination order) . Let B = ( V, ‘, s, out) be the causality graph of a modularBayesian network of type n → m . As in Def. 5 let W B be the set of wires.We define an undirected graph U that has as vertices the wires W B and two wires w , w are connected by an edge whenever they are connected to the same node. More precisely, theyare connected whenever they are input or output wires for the same node (i.e. w , w areboth in s ( v ) t ( v ) for a node v ∈ V ).Now let w , . . . , w k (where k = | IW B | ) be an ordering of the internal wires, a so-called elimination ordering . We update the graph U i − to U i by removing the next wire w i andconnecting all of its neighbours by edges (so-called fill in ). External wires are never eliminated.The width of the elimination ordering is the size of the largest clique that occurs in somegraph U i . The elimination width of B is the least width taken over all orderings. In the case of Bayesian networks, the set of wires of an OBN corresponds to the set ofrandom variables. In the literature, the graph U is called the moralisation of the Bayesiannetwork, it is obtained by taking the Bayesian network (an acyclic graph), forgetting aboutthe direction of the edges, and connecting all the parents (i.e., the predecessors) of a randomvariable, i.e. making them form a clique. This results in the same graph as the constructiondescribed above.To introduce the algorithm, we need the notion of a factor , already hinted at earlier. (cid:73) Definition 10 (Factor) . Let ( B, ev ) be a modular Bayesian network with a set of wires W B . A factor ( f, ˜ w ) of size s consists of a map f : { , } s → [0 , together with a sequenceof wires ˜ w ∈ W ∗ B . We require that ˜ w is of length s ( | ˜ w | = s ) and does not contain duplicates.Given a wire w ∈ W B and a multiset F of factors, we denote by C w ( F ) all those factors ( f, ˜ w ) ∈ F where ˜ w contains w . By X w ( F ) we denote the set of all wires that occur in thefactors in C w ( F ) , apart from w . We now consider an algorithm that computes the probability distribution representedby a modular Bayesian network of type n → m . We assume that an evaluation map ev ,mapping generators to their corresponding matrices, and an elimination order w , . . . , w k ofinternal wires is given. Furthermore, given a sequence of wires ˜ w = w . . . w s and a bitstring x = x . . . x s , we define the substitution function b ˜ w, x from wires to bits as b ˜ w, x ( w j ) = x j . (cid:73) Algorithm 11 (Variable elimination) . Input:
An MBN ( B, ev ) of type n → m Let F be the initial multiset of factors. For each node v of type n ‘ ( v ) → m ‘ ( v ) , it containsthe matrix ev ( v ) , represented as a factor f , together with the sequence s ( v ) t ( v ) . That is f ( xy ) = ev ( v )( y | x ) where x ∈ { , } n ‘ ( v ) , y ∈ { , } m ‘ ( v ) .Now assume that we have a set F i − of factors and take the next wire w i in the eliminationorder. We choose all those factors that contain w i and compute a new factor ( f, ˜ w ) . Let ˜ w be a sequence that contains all wires of X w ( F i − ) (in arbitrary order, but without We talk about the nodes of an MBN B and the vertices of an undirected graph U i . F S T T C S 2 0 2 0 duplicates). Let s = | ˜ w | . Then f is a function of type f : { , } s → [0 , , defined as: f ( y ) = X z ∈{ , } Y ( g, ˜ w g ) ∈ C wi ( F i − ) g ( b ˜ ww i , y z ( ˜ w g )) . We set F i = F i − \ C w i ( F i − ) ∪ { ( f, ˜ w ) } .After the elimination of all wires we obtain a multiset of factors F k , whose sequences con-tain only input and output wires. The resulting probability distribution is p : { , } n + m → [0 , , where x ∈ { , } n , y ∈ { , } m , ˜ ι = i . . . i n , ˜ o = out( o ) . . . out( o m ) : p ( xy ) = Y ( f, ˜ w f ) ∈F k f ( b ˜ ι ˜ o, xy ( ˜ w f ))That is, given the next wire w i we choose all factors that contain this wire, remove themfrom F i − and multiply them, while eliminating the wire. The next set is obtained by addingthe new factor. Finally, we have factors that contain only input and output wires and weobtain the final probability distribution by multiplying them. (cid:73) Proposition 12.
Given a modular Bayesian network ( B, ev ) where B is of type n → m ,Algorithm 11 computes its corresponding (sub-)stochastic matrix M ev ( B ) , that is M ev ( B )( y | x ) = p ( xy ) for x ∈ { , } n , y ∈ { , } m . Furthermore, the size of the largest factor in any multiset F i is bounded by the width of theelimination ordering. We conclude this section by investigating the relation between elimination width and thewell-known notion of treewidth [1]. (cid:73)
Definition 13 (Treewidth of a causality graph) . Let B = ( V, ‘, s, out) be a causality graphof type n → m . A tree decomposition for B is an undirected tree T = ( V T , E T ) such thatevery node t ∈ V T is associated with a bag X t ⊆ W B ,every wire w ∈ W B in contained in at least one bag X t ,for every node v ∈ V there exists a bag X t such that all input and output wires of v arecontained in X t (i.e., all wires in s ( v ) and t ( v ) are in X t ) and for every wire w ∈ W B , the tree nodes { t ∈ V T | w ∈ X t } form a subtree of T .The width of a tree decomposition is given by max t ∈ V T | X t | − .The treewidth of B is the minimal width, taken over all tree decompositions. Note that the treewidth of a causality graph corresponds to the treewidth of the graph U from Def. 9. Now we are ready to compare elimination width and treewidth. (cid:73) Proposition 14.
Elimination width is always an upper bound for treewidth and they coincidewhen B is a causality graph of type → . For a network of type → m the treewidth maybe strictly smaller than the elimination width. The treewidth might be strictly smaller since we are now allowed to eliminate outputwires. However, it is easy to see that the treewidth plus the number of output wires alwaysprovides an upper bound for the elimination width.The paper [1] also discusses heuristics for computing good elimination orderings, anopimization problem that is NP -hard. Hence the treewidth of a causality graph gives us an . Bernemann and B. Cabrera and R. Heckel and B. König 42:13 upper bound for the most costly step in computing its corresponding probability distribution.[25] shows that a small treewidth is actually a necessary condition for obtaining efficientinference algorithms.We can also compare elimination width to the related notion of term width, more detailscan be found in Appendix C. We extended the implementation presented in the predecessor paper [4] by incorporatingprobabilistic Petri nets and elimination orderings, in order to evaluate the performance of theproposed concepts. The implementation is open source and freely available from GitHub. Runtime results were obtained by randomly generating Petri nets with different parameters,e.g. number of places, transitions and tokens, initial marking. The maximal number of placesin pre- and post-conditions is restricted to three and at most five transitions are enabledin each step. With these parameters, the worst case scenario is the creation of a matrix oftype 30 →
30. After the initialization of a Petri net, which can be interpreted with eithersemantics (independent/stochastic), transitions and their probabilities are picked at random.Then we observe either success or failure and update the probability distribution accordingly.We select the elimination order via a heuristics by preferring wires with minimal degree inthe graph U i (cf. Def 9). Furthermore we apply a few optimizations: Nodes with no outputwires will be evaluated first, nodes without inputs second. The observation of a failure willgenerate a diagonal matrix, which enables an optimized evaluation, as its input and outputwires have to carry the same value (otherwise we obtain a factor 0). In addition, we useoptimizations whenever we have definitive knowledge about the marking of a particular place(of a pre-condition), by drawing conclusions about the ability to fire certain transitions.The plot on the left of Fig. 4 compares runtimes when incorporating ten success/failureobservations directly on the joint distribution (i.e. the naive representation of a probabilitydistribution) versus our MBN implementation. We initially assume a uniform distributionof tokens and calculate the probability that the first place is marked after the observations.Both approaches evaluate the same Petri net and therefore calculate the same results. Thedata is for the independent semantics, but it is very similar for the stochastic semantics.While the runtime increases exponentially when using joint distributions, our MBNimplementation stays relatively constant (see Fig. 4, left). Due to memory issues, handlingPetri nets with more than 30 places is not anymore feasible for the direct computation ofjoint distributions. We use the median for comparison (see Fig. 4, left), but if an MBNconsists of very large matrices, the evaluation time will be rather high. The right plot ofFig. 4 shows this correlation, where colours denote the runtime and the y -axis represents thenumber of wires attached to the largest matrix. (Here we actually count equivalence classesby grouping those wires that have to carry the same value, due to their attachment to adiagonal matrix, see also the optimization explained above.)The advantage of our approach decreases when we have substantially more places in thepre- and post-set, more transitions that may fire and a larger number of steps, since thenthe Bayesian network is more densely connected and contains larger matrices. Furthermore,one might generally expect the state (containing tokens or not) of places of the Petri netto become more and more coupled over time, as more transitions have fired, decreasing theperformance improvement we gain from using MBNs. However, recall that the transitions https://github.com/RebeccaBe/Bayesian-II F S T T C S 2 0 2 0 S e c ond s Uncertainty modelling via GBN Joint Distribution N u m be r o f W i r e s ( La r ge s t M a t r i x ) Seconds > 750 > 20 & <=750 <= 20
Figure 4
Left: Median of runtimes performing after 10 transitions on a Petri net. Right: Effectof large matrices on the runtimes of the MBN implementation. that can fire at any time are explicitly controlled by the input p nT . This allows our model tocapture situations where different parts of the network stay uncoupled over time and whereusing MBNs is an advantage. Furthermore the observation of a failure allows an optimizatedvariable elimination, as explained above. We propose a framework for uncertainty reasoning for probabilistic Petri nets that representsprobability distributions compactly via Bayesian networks. In particular we describe how toefficiently update and evaluate Bayesian networks.
Related work:
Naturally, uncertainty reasoning has been considered in many different scenarios(for an overview see [18]). Here we review only those approaches that are closest to our work.In [4] we studied a simpler scenario for nets whose transitions do not fire probabilistically,but are picked by the observer, resulting in a restricted set of update operations. Rather thancomputing marginal distributions directly via variable elimination as in this paper, our aimthere was to transform the resulting modular Bayesian network into an ordinary one. Sincethe updates to the net were of a simpler nature, we were able to perform this conversion. Herewe are dealing with more complex updates where this can not be done efficiently. Instead weare concentrating on extracting information, such as marginal distributions, from a Bayesiannetwork.Furthermore, uncertainty reasoning as described in Sct. 3 is related to the methodsused for hidden Markov models [32], where the observations refer to the states, whereas we(partially) observe the transitions.There are several proposals which enrich Petri nets with a notion of uncertainty: possibilis-tic Petri nets [26], plausible Petri nets [8] that combine discrete and continuous processes orfuzzy Petri nets [5, 34] where firing of transitions is governed by the truth values of statements.Uncertainty in connection with Petri nets is also treated in [24, 22], but without introducinga formal model. As far as we know neither approach considers symbolic representation ofprobability distributions via Bayesian networks.In [2] the authors exploit the fact that Petri nets also have a monoidal structure anddescribe how to convert an occurrence (Petri) net with a truly concurrent semantics into aBayesian network, allowing to derive probabilistic information, for instance on whether aplace will eventually be marked. This is different from our task, but it will be interesting tocompare further by unfolding our nets and equipping them with a truly concurrent semantics,based on the probabilistic information from the time-inhomogeneous Markov chain. . Bernemann and B. Cabrera and R. Heckel and B. König 42:15
We instead propose to use Bayesian networks as symbolic representations of probabilitydistributions. An alternative would be to employ multi-valued (or multi-terminal) binarydecision diagrams (BDDs) as in [19]. An exact comparison of both methods is left for futurework. We believe that multi-valued BDDs will fare better if there are only few differentnumerical values in the distribution, otherwise Bayesian networks should have an advantage.As mentioned earlier, representing Bayesian networks by PROPs or string diagrams is awell-known concept, see for instance [15, 21]. The paper [20] describes another transformationof Bayesian networks by string diagram surgery that models the effect of an intervention.In addition there is a notion of dynamic Bayesian networks [30], where a random variablehas a separate instance for each time slice. We instead keep only one instance of everyrandom variable, but update the Bayesian network itself.In addition to variable elimination, a popular method to compute marginals of a probabilitydistribution is based on belief propagation and junction trees [27]. In order to assess thepotential efficiency gain, this approach has to be adapted for modular Bayesian networks.However due to the dense interconnection and large matrices of MBNs, an improvement inruntime is unclear and deserves future investigation.
Future work:
One interesting avenue of future work is to enrich our model with timinginformation by considering continuous-time Markov chains [35], where firing delays aresampled from an exponential distribution. Instead of asking about the probability distributionafter n steps we could instead ask about the probability distribution at time t .We would also like to add mechanisms for controlling the system, such as transitions thatare under the control of the observer and can be fired whenever enabled. Then the task ofthe observer would be to control the system and guide it into a desirable state. In this veinwe are also interested in studying stochastic games [10] with uncertainty.The interaction between the structure of the Petri net and the efficiency of the analysismethod also deserves further study. For instance, are free-choice nets [14] – with restrictedconflicts of transitions – more amenable to this type of analysis than arbitrary nets?Recently there has been a lot of interest in modelling compositional systems via stringdiagrams, in the categorical setting of symmetric monoidal categories or PROPs [9]. In thiscontext it would be interesting to see how the established notion of treewidth [1] and itsalgebraic characterizations [11] translates into a notion of width for string diagrams. Westarted to study this for the notion of term width, but we are not aware of other approaches,apart from [6] which considers monoidal width. References H.L. Bodlaender and A.M.C.A. Koster. Treewidth computations I. Upper bounds. Techni-cal Report UU-CS-2008-032, Department of Information and Computing Sciences, UtrechtUniversity, September 2008. R. Bruni, H. C. Melgratti, and U. Montanari. Bayesian network semantics for Petri nets.
Theoretical Computer Science , 807:95–113, 2020. B. Cabrera.
Analyzing and Modeling Complex Networks – Patterns, Paths and Probabilities .PhD thesis, Universität Duisburg-Essen, 2019. B. Cabrera, T. Heindel, R. Heckel, and B. König. Updating probabilistic knowledge onCondition/Event nets using Bayesian networks. In
Proc. of CONCUR ’18 , volume 118 of
LIPIcs , pages 27:1–27:17. Schloss Dagstuhl – Leibniz Center for Informatics, 2018. J. Cardoso, R. Valette, and D. Dubois. Fuzzy Petri nets: An overview. In
Proc. of 13thTriennal World Congress , 1996. A. Chantawibul and P. Sobociński. Towards compositional graph theory. In
Proc. of MFPSXXXI . Elsevier, 2015. ENTCS 319.
F S T T C S 2 0 2 0 E. Charniak. Bayesian networks without tears.
AI magazine , 12(4):50–50, 1991. M. Chiachio, J. Chiachio, D. Prescott, and J.D. Andrews. A new paradigm for uncertainknowledge representation by plausible Petri nets.
Information Sciences , 453:323–345, 2018. B. Coecke and A. Kissinger.
Picturing Quantum Processes: A First Course in QuantumTheory and Diagrammatic Reasoning . Cambridge University Press, 2017. A. Condon. The complexity of stochastic games.
Information and Computation , 96(2):203–224,1992. B. Courcelle and J. Engelfriet.
Graph Structure and Monadic Second-Order Logic, A Language-Theoretic Approach . Cambridge University Press, June 2012. A. Darwiche.
Modeling and Reasoning with Bayesian Networks . Cambridge University Press,2011. R. Dechter. Bucket elimination: A unifying framework for reasoning.
Artificial Intelligence ,113:41–85, 1999. J. Desel and J. Esparza.
Free Choice Petri Nets , volume 40 of
Cambridge Tracts in TheoreticalComputer Science . Cambridge University Press, 1995. B. Fong. Causal theories: A categorical perspective on Bayesian networks. Master’s thesis,University of Oxford, 2012. arXiv:1301.6201. N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers.
Machine Learning ,29:131–163, 1997. C.M. Grinstead and J.L. Snell.
Introduction to probability . American Mathematical Soc., 2012. J.Y. Halpern.
Reasoning about Uncertainty . MIT Press, second edition edition, 2017. H. Hermanns, J. Meyer-Kayser, and M. Siegle. Multi-terminal binary decision diagrams torepresent and analyse continuous-time markov chains. In
Proc. of NSMC ’99 (InternationalWorkshop on the Numerical Solution of Markov Chains) , pages 188–207, 1999. B. Jacobs, A. Kissinger, and F. Zanasi. Causal inference by string diagram surgery. In
Proc.of FOSSACS ’19 , pages 313–329. Springer, 2019. LNCS 11425. B. Jacobs and F. Zanasi. A formal semantics of influence in Bayesian reasoning. In
Proc. ofMFCS , volume 83 of
LIPIcs , pages 21:1–21:14, 2017. I. Jarkass and M. Rombaut. Dealing with uncertainty on the initial state of a Petri net. In
Proc. of UAI ’98 (Uncertainty in Artificial Intelligence) , pages 289–295, 1998. M.J. Keeling and K.T.D. Eames. Networks and epidemic models.
Journal of the Royal SocietyInterface , 2(4):295–307, 2005. M. Kuchárik and Z. Balogh. Modeling of uncertainty with Petri nets. In
Proc. of ACIIDS ’19(Asian Conference on Intelligent Information and Database Systems) , pages 499–509. Springer,2019. LNAI 11431. J.H.P. Kwisthout, H.L. Bodlaender, and L.C. Van Der Gaag. The necessity of boundedtreewidth for efficient inference in Bayesian networks. In
Proc. of ECAI ’10 (EuropeanConference on Artificial Intelligence) , volume 215 of
Frontiers in Artificial Intelligence andApplications , pages 237–242. IOS Press, 2010. J. Lee, K.F.R. Liu, and W. Chiang. Modeling uncertainty reasoning with possibilistic Petrinets.
IEEE Transactions on Systems, Man, and Cybernetics, Part B , 33(2):214–224, 2003. V. Lepar and P.P. Shenoy. A comparison of Lauritzen-Spiegelhalter, Hugin, and Shenoy-Shafer architectures for computing marginals of probability distributions. In G.F. Cooper andS. Moral, editors,
Proc. of UAI ’98 (Uncertainty in Artificial Intelligence) , pages 328–337,1998. S. MacLane. Categorical algebra.
Bull. Amer. Math. Soc. , 71(1):40–106, 1965. M. Ajmone Marsan. Stochastic Petri nets: an elementary introduction. In
Proc. of theEuropean Workshop on Applications and Theory in Petri Nets , volume 424 of
Lecture Notes inComputer Science , pages 1–29. Springer, 1990. K. Murphy.
Dynamic Bayesian Networks: Representation, Inference and Learning . PhD thesis,UC Berkeley, Computer Science Division, 2002. . Bernemann and B. Cabrera and R. Heckel and B. König 42:17 J. Pearl. Bayesian networks: A model of self-activated memory for evidential reasoning. In
Proc. of the 7th Conference of the Cognitive Science Society , pages 329–334, 1985. UCLATechnical Report CSD-850017. L. R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speechrecognition.
Proceedings of the IEEE , 77(2):257–286, 1989. W. Reisig.
Petri Nets: An Introduction . EATCS Monographs on Theoretical Computer Science.Springer-Verlag, Berlin, Germany, 1985. Z. Suraj. Generalised fuzzy Petri nets for approximate reasoning in decision support systems. In
Proc. of CS&P ’12 (International Workshop on Concurrency, Specification and Programming) ,volume 928 of
CEUR Workshop Proceedings , pages 370–381. CEUR-WS.org, 2012. A. Tolver. An introduction to Markov chains. Department of Mathematical Sciences, Universityof Copenhagen, November 2016. W. Wiegerinck, W. Burgers, and B. Kappen. Bayesian networks, introduction and practicalapplications. In
Handbook on Neural Information Processing , pages 401–431. Springer, 2013.
A PROPs
Both, causality graphs and (sub-)stochastic matrices, can be seen in the context of PROPs[28] (where PROP stands for “products and permutations category”), a categorical notionthat formalizes string diagrams.Since we do not need the full theory behind PROPs to obtain our results and because ofspace restrictions, we did not define PROPs explicitly within the main part of the paper.However, here we formally introduce the mathematical structure underlying modularBayesian networks: CC-structured PROPs, i.e., PROPs with commutative comonoid struc-ture, a type of strict symmetric monoidal category.A CC-structured PROP is a symmetric monoidal category whose objects are naturalnumbers and arrows are terms. In particular every term t has a type n → m with n, m ∈ N .There are two operators that can be used to combine terms: sequential (;) and parallel( ⊗ ) composition, also called tensor (see Fig 5). Sequential composition corresponds tothe categorical composition and combines two terms t : n → l and t : l → m to a term t ; t = t : n → m . When applied to t : n → m and t : n → m , the tensor operatorproduces a term t ⊗ t = t : n + n → m + m . Furthermore, there are atomic terms:generators g ∈ G (from a given set G ) of fixed type and four different constants (Fig. 6):id : 1 → ∇ : 1 → σ : 2 → > : 1 → σ , ∇ and > are specific to CC-structured PROPs. The freelygenerated CC-structured PROP can be obtained by taking all terms obtained inductivelyfrom generators and constants via sequential composition and tensor, quotiented by theaxioms. f ff ; f n m = n k m f f f ⊗ f n + n m + m = n m n m Figure 5
String diagrammatic representation of the operators ⊗ and ; within a graph. Thickdouble lines represent several wires. The types are f : n → k , f : k → m , f : n → m , f : n → m . F S T T C S 2 0 2 0 id = id id n +1 = id n ⊗ id > = > > n +1 = > n ⊗ > σ n, = σ ,n = id n σ n +1 , = (id ⊗ σ n, ); ( σ ⊗ id n ) σ n,m +1 = ( σ n,m ⊗ id ); (id m ⊗ σ n, ) ∇ = ∇ ∇ n +1 = ( ∇ n ⊗ ∇ ); (id n ⊗ σ n, ⊗ id)( t ; t ) ⊗ ( t ; t ) = ( t ⊗ t ); ( t ⊗ t ) ( t ; t ); t = t ; ( t ; t )id n ; t = t = t ; id m ( t ⊗ t ) ⊗ t = t ⊗ ( t ⊗ t ) id ⊗ t = t = t ⊗ id σ ; σ = id ( t ⊗ id k ); σ m,k = σ n,k ; (id k ⊗ t ) ∇ ; ( ∇ ⊗ id ) = ∇ ; (id ⊗ ∇ ) ∇ = ∇ ; σ ∇ ; (id ⊗ > ) = id Table 1
Operators of higher arity (above) and axioms for CC-structured PROPs (below) id = ∇ = σ = > = Figure 6
String diagrammatic equivalents to the constants
Causality graphs (Def. 5) form a PROP, in fact the free CC-structured PROP. They cansimply be seen as the string diagram representation of the arrows of the PROP. The axiomsin Fig. 1 basically describe how to rearrange a string diagram into an isomorphic one. Theconstants correspond to causality graphs as drawn in Fig. 6.Hence, every causality graph has both a string diagrammatic representation and isrepresented by an equivalence class of terms. For instance the causality graph B in Fig. 2can also be written as a term ( A ⊗ B ⊗ C ); ( D ⊗ id ); E , where A, B, C, D, E are thecorresponding matrices (given by ev ).Another instance of a CC-structured PROP are (sub-)stochastic matrices, with entriestaken from the closed interval [0 , ⊂ R .Here, the constants correspond to the following matrices:id = (1) id = (cid:18) (cid:19) ∇ = σ = > = (cid:0) (cid:1) We index matrices over { , } m × { , } n , i.e. for x ∈ { , } m , y ∈ { , } n the correspond-ing entry is denoted by P ( x | y ). The order of rows and columns in the matrix regarding theassignment of events taking place or not, is descending (see matrix below).11100100 · · · · · · · ·· · · ·· · · · Sequential composition is matrix multiplication , i.e., given P : n → m , Q : m → ‘ we define P ; Q = Q · P : n → ‘ , which is a2 ‘ × n -matrix. The tensor is given by the Kronecker product , i.e.,given P : n → m , Q : n → m we define P ⊗ Q : n + n → m + m as ( P ⊗ Q )( x x | y y ) = P ( x | y ) · Q ( x | y )where x i ∈ { , } m i , y i ∈ { , } n i .A modular Bayesian network (MBN) is a causality graph,where every generator is intepreted by a (sub-)stochastic matrix via an evaluation function ev . . Bernemann and B. Cabrera and R. Heckel and B. König 42:19 Given ev , we obtain a mapping M ev that transforms causality graphs into (sub-)stochasticmatrices (Def. 7). Note that M ev is compositional, it preserves constants, generators,sequential composition and tensor. More formally, it is a functor between symmetricmonoidal categories, preserving also the CC-structure of the PROP. This also means thatevaluating a term via matrix operations and evaluating its causality graph as described inDef. 7, gives the same result. B Modelling a Test with False Positives and NegativesB.1 Setup
Here we analyze the net from Fig. 1c (discussed in Ex. 1) that models a test with falsepositives and negatives in more detail. We have a random variable T that describes whetherthe test is positive (1) or negative (0).Remember that there are is one place ( I , marked when the person is infected) and threetransitions, modelling the following events : flp : this transition has no pre- and no post-condition. In the case where I in unmarked,firing this transition denotes a false positive. It may also fire if I is marked, in whichcase it stands for a test that actually went wrong and is only positive because of luck(lucky positive). Hence p T ( flp ) = P ( R | ¯ I ). inf : this transition has I in its pre- and post-condition. Hence it can be fired only if I ismarked, representing a test that truly – and not by chance – uncovers an infection. Hence p T ( inf ) = P ( R | I ) − P ( R | ¯ I ), the probability that has to be added to the probabilityof flp to obtain the true positive (assuming that the probability for false positive is lessthan the one for true positive).fail: the special fail transition stands for a failed test and its probability is p T (fail) = P ( ¯ R | I ) (false negative). Note that if the person is not infected also inf fails and the sumof the probabilities is p T ( inf ) + p T (fail) = P ( ¯ R | I ) + P ( R | I ) − P ( R | ¯ I ) = 1 − P ( R | ¯ I ) = P ( ¯ R | ¯ I ) (true negative), exactly as required.As explained before, we use the independent semantics. B.2 Success Case
Now we perform uncertainty reasoning as described in Sct. 3 and the initial probabilitydistribution is p = (cid:18) P ( I ) P ( ¯ I ) (cid:19) Assume first that we observe success. In this case we have to multiply p with the matrix P ∗ that is given as follows. P ∗ = (cid:18) p T ( flp ) + p T ( inf ) 00 p T ( flp ) (cid:19) = (cid:18) P ( R | I ) 00 P ( R | ¯ I ) (cid:19) The first row/column always refers to the case where I is marked ( I ) and the secondrow/column to the case where I is unmarked ( ¯ I ). Hence, the first entry in the diagonal givesus the probability of going from marking I to itself and the second entry the probability ofstaying in the empty marking. F S T T C S 2 0 2 0
Hence by multiplying P ∗ · p we obtain P ∗ · p = (cid:18) P ( R | I ) 00 P ( R | ¯ I ) (cid:19) · (cid:18) P ( I ) P ( ¯ I ) (cid:19) = (cid:18) P ( R | I ) · P ( I ) P ( R | ¯ I ) · P ( ¯ I ) (cid:19) = (cid:18) P ( R ∩ I ) P ( R ∩ ¯ I ) (cid:19) We normalize by dividing by P ( R ∩ I ) + P ( R ∩ ¯ I ) = P ( R ) and get, using again the definitionof conditional probability:norm( P ∗ · p ) = (cid:18) P ( I | R ) P ( ¯ I | R ) (cid:19) B.3 Failure Case
We now switch to the case where failure is observed. In this case we have to multiply p withthe matrix F ∗ that is given as follows: F ∗ = (cid:18) p T (fail) 00 p T ( inf ) + p T (fail) (cid:19) = (cid:18) P ( ¯ R | I ) 00 P ( ¯ R | ¯ I ) (cid:19) The first entry in the diagonal gives us the probability of failing from marking I and thesecond entry the probability of failing from the empty markingHence by multiplying F ∗ · p we obtain F ∗ · p = (cid:18) P ( ¯ R | I ) 00 P ( ¯ R | ¯ I ) (cid:19) · (cid:18) P ( I ) P ( ¯ I ) (cid:19) = (cid:18) P ( ¯ R | I ) · P ( I ) P ( ¯ R | ¯ I ) · P ( ¯ I ) (cid:19) = (cid:18) P ( ¯ R ∩ I ) P ( ¯ R ∩ ¯ I ) (cid:19) We normalize by dividing by P ( ¯ R ∩ I ) + P ( ¯ R ∩ ¯ I ) = P ( ¯ R ) and get, using again the definitionof conditional probability:norm( F ∗ · p ) = (cid:18) P ( I | ¯ R ) P ( ¯ I | ¯ R ) (cid:19) Hence we obtain the probabilities that the person is infected respectively not infectedunder the condition that the test is positive respectively negative, exactly as required. Thisshows that our formalism is expressive enough to model the standard testing scenario withfalse postitives and negatives.
C Comparison to Term Width
We here compare the notion of elimination to another notion of width: term width, a verynatural notion, since we are working in a PROP.Given a representation of the causality graph of a Bayesian network B as a term t , thewidth of t is intuitively the size of the largest matrix that occurs when evaluating t . In fact,an arrow of type m → n corresponds to a matrix of dimensions 2 n × m where 2 n · m = 2 n + m .As before, we will give the width or size of the matrix as m + n . (cid:73) Definition 15 (Term width) . Let t be a term of a CC-structured PROP. Then we inductivelydefine the term width of t : n → m , denoted by hh t ii , as follows:Whenever t is a generator or a constant ( > , σ , ∇ , id), the width of t , is the size of thecorresponding matrix: hh t ii = n + m . hh t ; t ii = max {hh t ii , hh t ii , n + m } . hh t ⊕ t ii = max {hh t ii , hh t ii , n + m } . . Bernemann and B. Cabrera and R. Heckel and B. König 42:21 The term width of a causality graph B , denoted by hh B ii , is the minimum width of a termthat represents B . That is, we compute the sizes of all the matrices that we encounter along the way andtake the maximum of these sizes. As discussed earlier, it is essential to choose a good termrepresentation of a Bayesian network in order to obtain small matrix sizes and hence anefficient evaluation.However, the elimination width and term width of a causality graph do not necessarilycoincide. This suggests that term width does not provide suitable bounds for the actualcomputations. A direction of future research is to come up with an alternative notion thatbetter relates the size of a term with the efficiency of its “computation recipe”.However, we still obtain an upper bound for the elimination width, by viewing everymatrix as a factor. (cid:73)
Proposition 16.
There is a causality graph whose elimination width is strictly smallerthan hh B ii and vice versa. Let B be a causality graph of type n → m . Then the eliminationwidth of B is bounded by · hh B ii . D Proofs (cid:73)
Proposition 4.
Let r n : M × T f → [0 , and p n : M ∪ {∗} → [0 , be given as above. Let N be a C/E net and let ( X n ) n ∈ N be the Markov chain generated from N, r n . Then P ( X n +1 = m | X n +1 = ∗ , X n = ∗ ) = P ( X n +1 = m | X n +1 = ∗ ) = norm( P n ∗ · p n ∗ )( m ) P ( X n = m | X n +1 = ∗ , X n = ∗ ) = norm( F n ∗ · p n ∗ )( m ) where p n ( m ) = P ( X n = m ) , p n ( ∗ ) = P ( X n = ∗ ) and F n is a diagonal matrix with F n ( ¯ m | ¯ m ) := P n ( ∗ | ¯ m ) , ¯ m ∈ M , and F n ( ∗ | ∗ ) := P n ( ∗ | ∗ ) = 1 , all other entries are . Proof. P ( X n +1 = m | X n +1 = ∗ , X n = ∗ ):Whenever X n +1 = ∗ , we automatically have X n = ∗ , since one cannot leave the fail state.Hence: P ( X n +1 = m | X n +1 = ∗ , X n = ∗ ) = P ( X n +1 = m | X n +1 = ∗ )= P ( X n +1 = m ∧ X n +1 = ∗ ) P ( X n = ∗ ) = P ( X n +1 = m ) P m P ( X n +1 = m )= p n +1 ( m ) P m p n +1 ( m ) = norm( p n +1 ∗ )( m ) = norm(( P n · p n ) ∗ )( m ) = norm( P n ∗ · p n ∗ )( m ) P ( X n = m | X n +1 = ∗ , X n = ∗ ):We first observe that: P ( X n = m ∧ X n +1 = ∗ ) = P ( X n = m ) · P ( X n +1 = ∗ | X n = m )= p n ( m ) · P n ( ∗ | m ) = p n ( m ) · F n ( m | m ) = ( F n · p n )( m )From this we can derive: P ( X n +1 = ∗ ∧ X n = ∗ ) = P ( X n +1 = ∗ ∧ (cid:16) _ m X n = m (cid:17) )= P ( _ m ( X n +1 = ∗ ∧ X n = m )) = X m P ( X n +1 = ∗ ∧ X n = m ) = X m ( F n · p n )( m ) F S T T C S 2 0 2 0
The second-last equality holds, since the events are disjoint. And so finally we obtain: P ( X n = m | X n +1 = ∗ , X n = ∗ ) = P ( X n = m ∧ X n +1 = ∗ ∧ X n = ∗ ) P ( X n +1 = ∗ ∧ X n = ∗ )= P ( X n = m ∧ X n +1 = ∗ ) P ( X n +1 = ∗ ∧ X n = ∗ ) = ( F n · p n )( m ) P m ( F n · p n )( m )= norm(( F n · p n ) ∗ )( m ) = norm( F n ∗ · p n ∗ )( m ) (cid:74)(cid:73) Proposition 8.
Assume that N is a condition/even-net together with a function r n . Assumethat we have ¯ S ⊆ S , ¯ T ⊆ T f satisfying the conditions above. Then P n ∗ = P ⊗ Id k − ‘ where P ( m | m ) = P t ∈ ¯ T ,m t ⇒ m ¯ r ( m , t ) . F n ∗ = F ⊗ Id k − ‘ where F ( m | m ) = P t ∈ ¯ T ,m t ⇒ ¯ r ( m , t ) if m = m and otherwise.Here P , F are ‘ × ‘ -matrices and m , m ⊆ ¯ S . Note also that we implicitly restricted thefiring relation to the markings on ¯ S . Proof.
Let m, m be two markings which split into m = m m , m = m m . P n ∗ = P ⊗ Id k − ‘ : P n ∗ ( m | m ) = X t,m t ⇒ m r n ( m, t ) = X t ∈ ¯ T ,m t ⇒ m r n ( m, t ) = X t ∈ ¯ T ,m t ⇒ m ¯ r ( m , t )= X t ∈ ¯ T ,m t ⇒ m ¯ r ( m , t ) · Id k − ‘ ( m | m )= (cid:16) X t ∈ ¯ T ,m t ⇒ m ¯ r ( m , t ) (cid:17) · Id k − ‘ ( m | m ) = P ( m | m ) · Id k − ‘ ( m | m )= ( P ⊗ Id k − ‘ )( m | m )The second equality holds because only transitions of ¯ T can fire in m . The fourth equalityis true since there is a transition m t ⇒ m if and only if m t ⇒ m and m = m . F n ∗ = F ⊗ Id k − ‘ : Here we distinguish two cases: if m = m then F n ∗ ( m | m ) = X t ∈ T f ,m t ⇒ r n ( m, t ) = X t ∈ ¯ T ,m t ⇒ r n ( m, t ) = X t ∈ ¯ T ,m t ⇒ ¯ r ( m , t ) · X t ∈ ¯ T ,m t ⇒ ¯ r ( m , t ) · Id k − ‘ ( m | m ) = F ( m | m ) · Id k − ‘ ( m | m )= ( F ⊗ Id k − ‘ )( m | m )In the other case ( m = m ) we have F n ∗ ( m | m ) = 0 = F ( m | m ) · Id k − ‘ ( m | m ) = ( F ⊗ Id k − ‘ )( m | m )Note that the second equality holds since whenever m = m we have m = m (and so F ( m | m ) = 0) or m = m (and so Id k − ‘ ( m | m ) = 0). (cid:74) . Bernemann and B. Cabrera and R. Heckel and B. König 42:23 (cid:73) Proposition 12.
Given a modular Bayesian network ( B, ev ) where B is of type n → m ,Algorithm 11 computes its corresponding (sub-)stochastic matrix M ev ( B ) , that is M ev ( B )( y | x ) = p ( xy ) for x ∈ { , } n , y ∈ { , } m . Furthermore, the size of the largest factor in any multiset F i is bounded by the width of theelimination ordering. Proof.
We first show that we obtain the correct result. For this, we define the subprobabilitydistribution p F associated to a multiset of factors F . Let IW B = { w , . . . , w k } be the set ofinternal wires and we fix the elimination ordering w , . . . , w k .Then we define p F : { , } n + m → [0 ,
1] with: p F ( xy ) = X z ∈{ , } k Y ( f, ˜ w f ) ∈F f ( b ˜ ι ˜ o ˜ w, xyz ( ˜ w f ))where x ∈ { , } n , y ∈ { , } m , ˜ ι = i . . . i n and ˜ o = out( o ) . . . out( o m ). Clearly p F corresponds to M ev ( B ) (see Def. 7), that is p F ( xy ) = M ev ( B )( y | x ). Furthermore theresult p of the algorithm equals p F k , since in this case z is empty.We only have to show that in each step p F i − = p F i . This can be seen by observing that– due to distributivity – we have: p F i − ( xy )= X z k ∈{ , } · · · X z i ∈{ , } Y ( f, ˜ w f ) ∈F i − f ( b i − ( ˜ w f ))= X z k ∈{ , } · · · X z i ∈{ , } (cid:16) Y ( h, ˜ w h ) ∈F i − \ C wi ( F i − ) h ( b i − ( ˜ w h )) (cid:17) · (cid:16) Y ( g, ˜ w g ) ∈ C wi ( F i − ) g ( b i − ( ˜ w g )) (cid:17) = X z k ∈{ , } · · · X z i +1 ∈{ , } (cid:16) Y ( h, ˜ w h ) ∈F i − \ C wi ( F i − ) h ( b i ( ˜ w h )) (cid:17) · (cid:16) X z ∈{ , } Y ( g, ˜ w g ) ∈ C wi ( F i − ) g ( b i ( ˜ w g )) | {z } f ( b i ( ˜ w f )) (cid:17) = X z k ∈{ , } · · · X z i − ∈{ , } Y ( f, ˜ w f ) ∈F i f ( b i ( ˜ w f ))= p F i ( xy ) , where ( f, ˜ w f ) is the new factor, produced in step i of the algorithm. Here we use functions b i = b ˜ ι ˜ ow k − i +1 ...w k , yx z k − i +1 ... z k Furthermore observe that for every factor ( f, ˜ w f ) ∈ F i , all the wires in ˜ w f are connectedvia an edge in U i , i.e., all those wires are part of a clique. This means that the size of thefactors is bounded by the elimination width. We show this by induction on i . i = 0: F contains those factors that correspond to the generators originally contained in B . Each of these generators induces a factor ( f, ˜ w f ) and a clique containing all verticesin ˜ w f in U . i → i + 1: In step i we eliminate wire w i and produce a new factor ( f, ˜ w f ). In order toproduce this new factor, we multiply factors of F i − . Furthermore, we obtain a graph U i that contains a clique of all all vertices in ˜ w f . For the factors that we keep, thecorresponding parts of the graph are unchanged and hence the corresponding cliquesremain. F S T T C S 2 0 2 0 B B n B A a aa b b b n a ...... ... Figure 7
A causality graph B with n output wires whose treewidth is strictly smaller than itselimination width (cid:74)(cid:73) Proposition 14.
Elimination width is always an upper bound for treewidth and they coincidewhen B is a causality graph of type → . For a network of type → m the treewidth maybe strictly smaller than the elimination width. Proof.
First, by comparing Def. 13 and the definition of treewidth for undirected graphsfrom the literature [1], one observes that the treewidth of a causality graph B coincides withthe treewidth of its undirected clique graph U constructed above.Furthermore, according to [1, Theorem 6], the fact that U has treewidth k is equivalentto the existence of an elimination order as in Def. 9, where the highest clique in the graphsequence U i is bounded by k .Furthermore, as described in [1], every elimination order gives rise to a tree decompositionof the same width. Hence elimination width provides an upper bound for treewidth.In order to show that the treewidth of a causality graph of type 0 → m may be strictlysmaller than the elimination width, consider the network shown in Fig. 7 (left): we only haveto eliminate one internal wire, the wire a that exits node A . By eliminating it, we obtainan n -clique of the output wires and thus we have elimination width n . However, we have astar-shaped tree decomposition of width 1 (shown in Fig. 7 on the right). (cid:74)(cid:73) Proposition 16.
There is a causality graph whose elimination width is strictly smallerthan hh B ii and vice versa. Let B be a causality graph of type n → m . Then the eliminationwidth of B is bounded by · hh B ii . Proof.
We first show that the elimination width of B can be strictly smaller than hh B ii , asobserved by the following example (see Fig. 8a).The elimination width is 2: if we denote the wires of the network by a , a (the twowires originating from A ) and c (wire originating from C ), then the joint distribution can beobtained as: p (a , c) = X a ∈{ , } A (a , a ) · C (c) , where the largest factor that is involved is of size 2.However, there is no way to represent this causality graph by a term t of width 2: wehave to start with A (of type 0 →
2) and multiply it with some other matrix. Since we haveto produce a matrix of type 0 → → ⊗ C ), which results in width 4.On the other hand, hh B ii can be strictly smaller than the elimination width of B , asobserved by the causality graph in Fig. 8b, where we compose (multiply) two matrices A, C of type k × k . . Bernemann and B. Cabrera and R. Heckel and B. König 42:25 A C (a)
A causality graph B whose elimination widthis strictly smaller than hh B ii ...... ... A C (b)
A causality graph B where hh B ii is strictlysmaller than the elimination width Figure 8
The largest matrix encountered during the computation is hence of size k + k = 2 k . Onthe other hand, the width of the elimination ordering is 3 k −
1: if we eliminate one of theinner wires, we immediately obtain a clique containing the remaining 3 k − U (since every inner wire is connected to all the other wires).Assume that B is represented by a term t , where hh t ii ≤ hh B ii . From t we inductivelyderive an elimination order eo ( t ) for the inner wires of B :whenever t is a generator or a constant, the elimination order is the empty sequence(since there are no inner wires).whenever t = t ⊗ t , then eo ( t ) = eo ( t ) eo ( t ) (the concatenation of the eliminationorderings)whenever t = t ; t , then let ˜ w be the sequence of wires that becomes internal due to thecomposition. Then eo ( t ) = eo ( t ) eo ( t ) ˜ w .This gives us an ordering of all the inner wires of B .Now, for a given evaluation map ev , we compute M ev ( B ) according to the eliminationorder eo ( t ) and prove by structural induction on t that the elimination width of B is boundedby 2 · hh t ii . In fact we use a stronger induction hypothesis where we show in addition that inthe multiset of factors that we obtain at the very end, every factor has size at most hh t ii .whenever t is a generator or a constant, we have that the elimination width correspondsto the size of the largest generator. Hence, for a generator g we have that the eliminationwidth equals hh t ii ≤ · hh t ii , whereas for the other constants, we have elimination width0 ≤ · hh t ii .whenever t = t ⊗ t , we know by induction hypothesis that the elimination width of thecausality graph B i , represented by t i , is bounded by 2 · hh t i ii and that in the multiset offactors that we obtain at the end, every factor has size at most hh t i ii .This means that for B = B ⊗ B the algorithm starts with the disjoint union offactors of B and B . If we follow the elimination order eo ( t ) eo ( t ), we first processthe factors of B , followed by the factors of B . No factor that is produced exceedsmax { · hh t ii , · hh t ii} = 2 · max {hh t ii , hh t ii} ≤ · hh t ii and in the end we obtain factorswhose size is bounded by max {hh t ii , hh t ii} ≤ hh t ii .whenever t = t ; t (where t : n → k , t : k → m ), we know by induction hypothesis thatthe elimination width of the causality graph B i , represented by t i , is bounded by 2 · hh t i ii and that in the multiset of factors that we obtain at the end, every factor has size atmost hh t i ii .This means that for B = B ; B the algorithm starts with the disjoint union of factors of B and B . If we follow the elimination order eo ( t ) eo ( t ), we first process the factorsof B , followed by the factors of B . As above, no factor that is produced exceedsmax { · hh t ii , · hh t ii} ≤ · hh t ii and in the end we obtain factors whose size is boundedby max {hh t ii , hh t ii} ≤ hh t ii .Next, we have to eliminate the wires in ˜ w . The largest factor that can be produced inthis way is of size n + k + k + m − ≤ hh t ii + hh t ii ≤ · max {hh t ii , hh t ii} ≤ hh t ii . At F S T T C S 2 0 2 0 the very end, once we have eliminated all the wires, we obtain factors whose size is atmost n + m ≤ hh t ii ..