[PDF] Probabilistic Generating Circuits

Abstract

Generating functions, which are widely used in combinatorics and probability theory, encode function values into the coefficients of a polynomial. In this paper, we explore their use as a tractable probabilistic model, and propose probabilistic generating circuits (PGCs) for their efficient representation. PGCs are strictly more expressive efficient than many existing tractable probabilistic models, including determinantal point processes (DPPs), probabilistic circuits (PCs) such as sum-product networks, and tractable graphical models. We contend that PGCs are not just a theoretical framework that unifies vastly different existing models, but also show great potential in modeling realistic data. We exhibit a simple class of PGCs that are not trivially subsumed by simple combinations of PCs and DPPs, and obtain competitive performance on a suite of density estimation benchmarks. We also highlight PGCs' connection to the theory of strongly Rayleigh distributions.

Full PDF

PProbabilistic Generating Circuits

Honghua Zhang ∗ , Brendan Juba † , and Guy Van den Broeck ‡ University of California, Los Angeles Washington University in St. Louis

Abstract

Generating functions, which are widely used in combinatorics and probability theory, encode functionvalues into the coeﬃcients of a polynomial. In this paper, we explore their use as a tractable probabilisticmodel, and propose probabilistic generating circuits (PGCs) for their eﬃcient representation. PGCsstrictly subsume many existing tractable probabilistic models, including determinantal point processes(DPPs), probabilistic circuits (PCs) such as sum-product networks, and tractable graphical models. Wecontend that PGCs are not just a theoretical framework that uniﬁes vastly diﬀerent existing models,but also show huge potential in modeling realistic data. We exhibit a simple class of PGCs that are nottrivially subsumed by simple combinations of PCs and DPPs, and obtain competitive performance on asuite of density estimation benchmarks. We also highlight PGCs’ connection to the theory of stronglyRayleigh distributions.

Probabilistic modeling is an important task in machine learning. Scaling up such models is a key challenge:probabilistic inference quickly becomes intractable as the models become large and sophisticated Roth [1996].Central to this eﬀort is the development of tractable probabilistic models (TPMs) that guarantee tractableprobabilistic inference in the size of the model, yet can eﬃciently represent a wide range of probabilitydistributions. There has been a proliferation of diﬀerent classes of TPMs. Examples include bounded-treewidth graphical models [Meila and Jordan, 2000], determinantal point processes [Borodin and Rains,2005, Kulesza and Taskar, 2012], and various probabilistic circuits [Darwiche, 2009, Kisa et al., 2014, Vergariet al., 2020] such as sum-product networks [Poon and Domingos, 2011].Ideally, we want our probabilistic models to be as expressive eﬃcient [Martens and Medabalimi, 2014] aspossible, meaning that they can eﬃciently represent as many classes of distributions as possible, and adapt toa wider spectrum of realistic applications. Often, however, stronger expressive power comes at the expense oftractability: fewer restrictions can make a model more expressive, but it can also make probabilistic inferenceintractable. We therefore raise the following central research question of this paper:

Does there exist a classof tractable probabilistic models that is strictly more expressive eﬃcient than current TPMs?

All aforementioned models are usually seen as representing probability mass functions : they take assignmentsto random variables as input and output likelihoods. In contrast, especially in the ﬁeld of probability theory,it is also common to represent distributions as probability generating polynomials (or generating polynomialsfor short). Generating polynomials are a powerful mathematical tool, but they have not yet found direct useas a probabilistic machine learning representation that permits tractable probabilistic inference. ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ c s . A I] F e b X X Pr β .

020 0 1 0 .

080 1 0 0 .

120 1 1 0 .

481 0 0 0 .

021 0 1 0 .

081 1 0 0 .

041 1 1 0 . (a) Table (b) Probabilistic generat-ing circuit (c) Probabilistic mass cir-cuit L β = X X X (cid:34) (cid:35) X X X K β = X X X (cid:34) (cid:35) . . X . . X . X (d) Kernel L β andmarginal kernel K β for aDPP Figure 1: Four diﬀerent representations for the same probability distribution Pr β .We make the key observation that the marginal probabilities (including likelihoods) for a probabilitydistribution can be computed by evaluating its generating polynomial in a particular way. Based on thisobservation, we propose probability generating circuits (PGCs), a class of probabilistic models that representprobability generating polynomials compactly as directed acyclic graphs. PGCs provide a partly positiveanswer to our research question: they are the ﬁrst known class of TPMs that (strictly) subsume probabilisticcircuits (PCs) such as sum-product networks [Poon and Domingos, 2011] and determinantal point processes(DPPs) while supporting tractable marginal inference.Section 2 formally deﬁnes PGCs and establishes their tractability by presenting an eﬃcient algorithm forcomputing marginals. Section 3 demonstrates the expressive power of PGCs by showing that they subsumePCs and DPPs. Section 4 shows that there are PGCs that are not trivially subsumed by simple combinationsof PCs and DPPs. Section 5 evaluates PGCs on standard density estimation benchmarks: even simple PGCsoutperform other TPM learners on half of the datasets. Then, Section 6 highlights PGCs’ connection tostrongly Rayleigh distributions. Section 7 summarizes the paper and motivates future research directions. In this section we establish probabilistic generating circuits (PGCs) as a class of tractable probabilistic models.We ﬁrst introduce generating polynomials as a representation for probability distributions and propose PGCsfor their compact representations. Then, we show that marginal probabilities for a PGC can be computedeﬃciently.

It is a common technique in combinatorics to encode sequences as generating polynomials . In particular,probability distributions over binary random variables can be represented by probability-generating polynomials . Deﬁnition 1.

Let Pr( · ) be a probability distribution over binary random variables X , X , . . . , X n , then the probability-generating polynomial (or generating polynomial for short) for the distribution is deﬁned as g ( z , . . . , z n ) = (cid:88) S ⊆{ ,...,n } α S z S (1)where α S = Pr( { X i =1 } i ∈ S , { X i =0 } i/ ∈ S ) and z S = (cid:81) i ∈ S z i .As an illustrating example, we consider the probability distribution Pr β speciﬁed as a table in Figure 1a. By2eﬁnition, the generating polynomial for distribution Pr β is given by g β = 0 . z z z + 0 . z z + 0 . z z + 0 . z + 0 . z z + 0 . z + 0 . z + 0 . . (2)We see from Equation 2 that the generating polynomial for a distribution simply “enumerates” all possiblevariable assignments term-by-term, and the coeﬃcient of each term corresponds to the probability of anassignment. The probability for the assignment X = 0 , X = 1 , X = 1, for example, is 0 .

48, whichcorresponds to the coeﬃcient of the term z z . That is, given an assignment, we can evaluate its probabilityby directly reading oﬀ the coeﬃcient for the corresponding term. We can also evaluate marginal probabilitiesby summing over the coeﬃcients for a set of terms. For example, the marginal probability Pr( X = 0 , X = 0)is given by Pr( X = 0 , X = 0 , X = 0) + Pr( X = 1 , X = 0 , X = 0), which corresponds to the sum of theconstant term and the coeﬃcient for the term z . Equation 2 also illustrates that the size of a term-by-term representation for a generating polynomial is exponential in the number of variables. As the number of variables increases, it quickly becomes impractical tocompute probabilities by extracting coeﬃcients from these polynomials. Hence, to turn generating polynomialsinto tractable models, we need a data structure to represent them more eﬃciently. We thus introduce anew class of probabilistic circuits called probabilistic generating circuits to represent generating polynomialscompactly as directed acyclic graphs (DAGs). We ﬁrst present the formal deﬁnition for PGCs.

Deﬁnition 2. A probabilistic generating circuit (PGC) is a directed acyclic graph consisting of three typesof nodes:1. Sum nodes (cid:76) with weighted edges to children;2.

Product nodes (cid:78) with unweighted edges to children;3.

Leaf nodes , which are z i or constants.A PGC has one node of out-degree 0 (edges are directed from children to parents), and we refer to it as the root of the PGC. The size of a PGC is the number of edges in it.Each node in a PGC represents a polynomial: (i) each leaf in a PGC represents the polynomial z i or aconstant, (ii) each sum node represents the weighted sum over the polynomials represented by its children, and(iii) each product node represents the unweighted product over the polynomials represented by its children.The polynomial represented by a PGC is the polynomial represented by its root.We have now fully speciﬁed the syntax of PGCs, but a PGC with valid syntax does not necessarily havevalid semantics . Because of the presence of negative parameters, it is not guaranteed that the polynomialrepresented by a PGC is a probability generating polynomial: it might contain terms that are not multiaﬃne orhave negative coeﬃcients (e.g. − . z z ). In practice, however, we show in later sections ways of constructingPGC structures that are guaranteed to have valid semantics for any parameterization.Continuing our example, we observe that the generating polynomial g β in Equation 2 can be re-written as: g β = (0 . z + 1)(6 z + 1) − . z z )(0 . z + 0 .

2) (3)Based on Equation 3, we can immediately construct a PGC that compactly represents g β , as shown inFigure 1b. In this way, generating polynomials for high-dimensional distributions may become feasible torepresent by PGCs. We now show that the computation of marginals is tractable for PGCs. As brieﬂy mentioned in Section 2.1,we can compute probabilities by extracting the coeﬃcients of generating polynomials, which is much trickier3hen they are represented as deeply-nested DAGs; as shown in Figure 1b, it is impossible to directly readoﬀ any coeﬃcient. We circumvent this problem by making the key observation that we can “zero-out” theterms we don’t want from generating polynomials by evaluating them in a certain way. For example, whenevaluating a marginal probability with X set to 0, we zero-out all terms that contain z by setting z to 0.We generalize this idea as follows. Lemma 1.

Let g ( z , . . . , z n ) be a probability generating function for Pr( · ) , then for A, B ⊆ { , . . . , n } with A ∩ B = ∅ , the marginal probability can be computed by: Pr( { X i = 1 } i ∈ A , { X i = 0 } i ∈ B ) = coef | A | ( g ( { z i = t } i ∈ A , { z i = 0 } i ∈ B , { z i = 1 } i/ ∈ A ∪ B )) , where t is an indeterminate for polynomials, and coef k ( g ( t )) denotes the coeﬃcient for the term t k in g ( t ) . As an illustrating example for Lemma 1, we compute the marginal probability Pr β ( X = 1 , X = 0). Assuggested by the Lemma, we ﬁrst evaluate g β ( z = 1 , z = t, z = 0); that is, for the PGC shown in Figure 1b,we ﬁrst replace z by 1, z by t and z by 0, then propagate the values upward. The output of the PGC is0 . t + 0 .

04, and the coeﬃcient for the term of degree one gives us the answer: Pr( X = 1 , X = 0) = 0 . R [ t ].It follows that computing marginal probabilities is tractable for PGCs. Theorem 1.

For PGCs of size m representing distributions on n binary random variables, marginalprobabilities (including likelihoods) are computable in time O ( mn log n log log n ) . The O ( mn log n log log n ) complexity in the theorem consists of two parts: the O ( m ) part is the timecomplexity of a bottom-up pass for a PGC of size m and the O ( n log n log log n ) part is contributed by thetime complexity for computing the product of two degree- n polynomials with fast Fourier transform [Sch¨onhageand Strassen, 1971, Cantor and Kaltofen, 1991]. To this point, we have introduced PGCs as a probabilistic model and shown that they support tractablemarginals. Next, we show that PGCs are at least as expressive as other TPMs by showing that PGCstractably subsume decomposable probabilistic circuits and determinantal point processes.

We start by introducing the basics of probabilistic circuits [Vergari et al., 2020, Choi et al., 2020]. The syntaxof probabilistic circuits (PCs) is basically the same as PGCs except for the following: (1) the variables inPCs are X i s and X i s rather than z i s; (2) the edge weights of PCs must be non-negative. Unlike PGCs, allof the existing PCs represent probability mass polynomials, so we sometimes refer to them as probabilisticmass circuits . Figure 1c shows an example PC that represents the distribution Pr β . For a given assignment X = x , the PC A evaluates to a number A ( x ), which is obtained by (i) replacing variable leaves X i by x i ,(ii) replacing variable leaves X i by 1 − x i , (iii) evaluating product nodes as taking the product over theirchildren, and (iv) evaluating sum nodes as taking a weighted sum over their children. Finally, a PC A withvariable leaves X = ( X , . . . , X n ) represents the probability distribution Pr( X = x ) ∝ A ( x ). For an arbitraryPC, most probabilistic inference tasks, including marginals and MAP inference, are computationally hard inthe circuit size. In order to guarantee the eﬃcient evaluation of queries it is therefore necessary to imposefurther constraints on the structure of the circuit. In this paper we consider two well-known structuralproperties of probabilistic circuits [Darwiche and Marquis, 2002, Choi et al., 2020]: Deﬁnition 3.

For a PC, we denote the input variables that a node depends on as its scope ; then,4. A (cid:78) node is decomposable if the scopes of its children are disjoint.2. A (cid:76) node is smooth if the scopes of its children are the same.A PC is decomposable if all of its (cid:78) nodes are decomposable; a PC is smooth if all of its (cid:76) nodes aresmooth.Let A be a PC over X , . . . , X n . If A is decomposable and smooth, then we can eﬃciently compute itsmarginals: for disjoint A, B ⊆ { , . . . , n } the marginal probability Pr( { X i = 1 } i ∈ A , { X i = 0 } i ∈ B ) is given bythe evaluation of A with the following inputs.  X i = 1 , X i = 0 if i ∈ AX i = 0 , X i = 1 if i ∈ BX i = 1 , X i = 1 otherwise.The class of decomposable PCs subsumes many TPMs as subclasses. Examples include sum-productnetworks (SPNs) [Poon and Domingos, 2011, Peharz et al., 2019], And-Or graphs [Mateescu et al., 2008],probabilistic sentential decision diagrams (PSDDs) [Kisa et al., 2014], arithmetic circuits [Darwiche, 2009],cutset networks [Rahman and Gogate, 2016] and bounded-treewidth graphical models [Meila and Jordan,2000] such as Chow-Liu trees [Chow and Liu, 1968] and hidden Markov models [Rabiner and Juang, 1986].A decomposable PC can always be “smoothed” (i.e. transformed into a smooth and decomposable PC)in polynomial time with respect to its size [Shih et al., 2019]. Hence, when we are trying to show thatdecomposable PCs can be transformed into equivalent PGCs in polynomial time, we can always assumewithout loss of generality that decomposable PCs are also smooth. Our ﬁrst observation is that the probabilitymass functions represented by smooth and decomposable PCs are very similar to the corresponding generatingpolynomials: Proposition 1.

Let A be a smooth and decomposable PC that represents the probability distribution Pr overrandom variables X , . . . , X n . Then A represents a probability mass polynomial of the form: m ( X , . . . , X n , X , . . . , X n ) = (cid:88) S ⊆{ ,...,n } α S (cid:89) i ∈ S X i (cid:89) i/ ∈ S X i (4) where α S = Pr( { X i = 1 } i ∈ S , { X i = 0 } i/ ∈ S ) . By comparing Equation 4 to Equation 1 in the deﬁnition of generating circuits, we ﬁnd that they are basicallythe same, except for the absence of the negative literals X i in Equation 1, which gives us the followingcorollary: Corollary 1.

Let A be a smooth and decomposable PC. By replacing all X i in A by and X i by z i , weobtain a PGC that represents the same distribution. Corollary 1 establishes that PGCs subsume decomposable PCs and in turn, the TPMs subsumed bydecomposable PCs. This raises the question of whether PGCs are strictly more general. In the next section,we give a positive answer by showing that PGCs also subsume determinantal point processes, which cannotbe tractably represented by decomposable PCs with univariate leaves [Zhang et al., 2020].

In this section, we show that determinantal point processes (DPPs) can be tractably represented by PGCs. Ata high-level, a unique property of DPPs is that they are tractable representations of probability distributionsthat express global negative dependence , which makes them very useful in many applications [Mariet and Sra,2016], such as document and video summarization [Chao et al., 2015, Lin and Bilmes, 2012], recommendersystems [Zhou et al., 2010], and object retrieval [Aﬀandi et al., 2014].5n machine learning, DPPs are most often represented by means of an

L-ensemble [Borodin and Rains, 2005]: Deﬁnition 4.

A probability distribution Pr over n binary random variables X = ( X , . . . , X n ) is an L-ensemble if there exists a (symmetric) positive semideﬁnite matrix L ∈ R n × n such that for all x =( x , . . . , x n ) ∈ { , } n , Pr( X = x ) ∝ det( L x ) , (5)where L x = [ L ij ] x i =1 ,x j =1 denotes the submatrix of L indexed by those i, j where x i = 1 and x j = 1. Thematrix L is called the kernel for the L-ensemble. To ensure that the distribution is properly normalized, it isnecessary to divide Equation 5 by det( L + I ), where I is the n × n identity matrix [Kulesza and Taskar, 2012].Consider again the example distribution Pr β . It is actually a DPP whose kernel is given by the matrix L β inFigure 1d. The probability of the assignment X = (1 , , X = (1 , , L β { , } )det( L β + I ) = 150 (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) = 0 . . To compute marginal probabilities for L-ensembles, we also need marginal kernels , which characterize DPPsin general, as an alternative to L-ensemble kernels.

Deﬁnition 5.

A probability distribution Pr is a DPP over n binary random variables X , . . . , X n if thereexists a positive semdideﬁnite matrix K ∈ R n × n such that for all A ⊆ { , . . . , n } Pr( { X i = 1 } i ∈ A ) = det( K A ) , (6)where K A = [ K ij ] i ∈ A,j ∈ A denotes the submatrix of K indexed by elements in A .The marginal kernel K β for the L-ensemble that represents the distribution Pr β is shown in Figure 1d, alongwith its kernel L β . One can use a generalized version of Equation 6 to compute the marginal probabilitiesPr(( X i = 1) i ∈ A , ( X j = 0) j ∈ B ) eﬃciently, where A, B ⊆ { , . . . , n } . We refer to Kulesza and Taskar [2012] forfurther details.PCs and DPPs support tractable marginals in strikingly diﬀerent ways, and we wonder whether these twotractable languages can be captured by a uniﬁed framework. Previous works have shown that DPPs cannotbe tractably represented by various subclasses of PCs including SPNs [Martens and Medabalimi, 2014, Zhanget al., 2020]. Here, by showing that PGCs tractably subsume DPPs, we provide a positive answer to thisopen problem.The key to constructing a PGC representation for DPPs is that their generating polynomials can be writtenas determinants over polynomial rings. Lemma 2 (Borcea et al. [2009]) . The generating polynomial for an L-ensemble with kernel L is given by g L = 1det( L + I ) det( LZ + I ) . (7) With Z = diag( z , . . . , z n ) , the generating polynomial for a DPP with marginal kernel K is given by g K = det( I − K + KZ ) . (8)Note that the generating polynomials presented in Lemma 2 are just mathematical objects; to use them astractable models, we need to represent them in the framework of PGCs. So let us examine Equations (7) and(8) in detail. The entries in the matrices LZ + I and I − K + KZ are degree-one univariate polynomials, Although not every DPP is an L-ensemble, Kulesza and Taskar [2012] show that DPPs that assign non-zero probability tothe empty set (the all-false assignment) are L-ensembles. O ( nM ( n )) additions and multiplications, where M ( n ) is the number of basic operations needed for a matrixmultiplication. We conservatively assume that M ( n ) is upper-bounded by n . Thus when we encode Bird’salgorithm as a PGC, the PGC contains at most O ( n ) sum and product nodes, each with a constant numberof edges. Together with Lemma 2, it follows that DPPs are tractably subsumed by PGCs. Theorem 2.

Any DPP over n binary random variables can be represented by a PGC of size O ( n ) . We conclude this section with the following remarks:1. In practice, when representing DPPs in the language of PGCs, we do not need to explicitly constructthe sum nodes and product nodes to form the circuit structures. Recall from Lemma 1 that marginals aretractable as long as we can eﬃciently evaluate the PGCs over polynomial rings. Thus we can simply applyBird’s algorithm, for example, to compute the determinant.2. Since nonsymmetric DPPs [Gartrell et al., 2019] are deﬁned in the same way as standard DPPs, exceptthat their kernel L does not need to be symmetric, they are also tractably subsumed by PGCs. In the previous section we have demonstrated the expressive power of PGCs by showing that they subsumedecomposable PCs and DPPs. It is well-known, however, that PCs can use arbitrary families of tractabledistributions at their leaves, including DPPs. Thus, our ﬁndings in the previous section did not rule out thepossibility that PGCs could be represented by such a combination of PCs over standard TPMs such as DPPs.In this section, to show that PGCs contain classes of models that are strictly more expressive, we construct asimple class of PGCs that are not trivially subsumed by such simple combinations of PCs and DPPs.

We start by deﬁning the sum , product and hierarchical composition operations for PGCs. Proposition 2.

Let

A, B ⊂ N + ; denote { z i } i ∈ A by z A and { X i } i ∈ A by X A . Let f ( z A ) and g ( z B ) be thegenerating polynomials for distributions Pr f ( X A ) and Pr g ( X B ) , then, (Sum) : let α ∈ [0 , , then αf + (1 − α ) g is the generating polynomial for the probability distribution Pr sum where Pr sum ( X A = a , X B = b ) = α Pr f ( X A = a ) + (1 − α )Pr g ( X B = b ) (Product) : if A and B are disjoint (i.e. f and g depend on disjoint sets of variables), then f g is thegenerating polynomial for the probability distribution Pr prod where Pr prod ( X A = a , X B = b ) = Pr f ( X A = a )Pr g ( X B = b )The sum and product operations described above are basically the same as those for PCs: the sum distributionPr sum is just a mixture over two distributions Pr f and Pr g , and the product distribution Pr prod is the point-wiseproduct of Pr f and Pr g . The hierarchical composition is much more interesting.7 ! ! " … ! ! !" … ! ! Pr " Pr $ Pr ! ! $" … ! $ ! "" … ! " ! ! = det(' + ) diag - " , - , … , - $ ) % " % $ % ! Figure 2: An example of the hierarchical composition for PGCs. Given n binary random variables, we ﬁrstpartition them into m parts, each with k variables. Then, variables from part i are modeled by the PGC Pr i with generating polynomial f i . Let g L = det( I + L diag( z , . . . , z n )) be the generating polynomial for a DPPwith kernel L . Then g δ is the hierarchical composition of g L and f i s. We refer to this architecture for g δ as adeterminantal PGC. Proposition 3 (hierarchical composition) . Let Pr g be a probability distribution with generating polynomial g ( z , . . . , z n ) . Let A , . . . , A n be disjoint subsets of N + and f ( z A ) , . . . , f n ( z A n ) be generating polynomialsfor Pr i . We deﬁne the hierarchical composition of g and f i s by g comp = g (cid:12)(cid:12) z i = f i , which is the generating polynomial obtained by substituting z i in g by f i s. In particular, g comp is a well-deﬁnedgenerating polynomial that represents a valid probability distribution. Unlike the sum and product operations, the hierarchical composition operation for PGCs does not have animmediate analogue for PCs. This operation is a simple yet powerful way to construct classes of PGCs;Figure 2 shows an example, which we illustrate in detail in the next section.

Now we construct a separating class of PGCs that are not trivially subsumed by simple combinations of PCsand DPPs. Figure 2 gives an outline of its structure. We construct a model of a probability distribution over n random variables X , . . . , X n . For simplicity we assume n = mk and partition the variables into m parts,each with k variables. Changing notation, we write { X , . . . , X k } , . . . , { X m , . . . , X mk } . For 1 ≤ i ≤ m ,let Pr i be a PGC over the random variables X i , . . . , X ik with generating polynomial f i ( z i , . . . , z ik ). LetPr L be a DPP with kernel L and generating polynomial g L ( z , . . . , z m ). Then, the generating polynomial g δ = g L (cid:12)(cid:12) z i = f i , namely the hierarchical composition of g L and f i , deﬁnes a PGC Pr δ , which we refer to as a determinantal PGC (DetPGC).DPPs are great at modeling negative dependencies but cannot represent positive dependencies betweenvariables: for the DPP Pr L , Pr L ( X i = 1 , X j = 1) ≥ Pr L ( X i = 1)Pr L ( X j = 1) can never happen. Ourconstruction of DetPGCs aims to equip a DPP model to capture local positive dependencies. To understandhow DetPGCs actually behave, we compute the marginal probability Pr δ ( X ik = 1 , X jl = 1) by Lemma 1.When X ik and X jl belong to the same group (i.e. i = j ):Pr δ ( X ik = 1 , X il = 1) = Pr L ( X i = 1)Pr i ( X ik = 1 , X il = 1);8hat is, when two variables belong to the same group, the dependencies between them are dominated by theleaf PGC Pr i , giving space for positive dependencies.When X ik and X jl belong to diﬀerent groups (i.e. i (cid:54) = j ):Pr δ ( X ik = 1 , X jl = 1) ≤ Pr δ ( X ik = 1)Pr δ ( X jl = 1);that is, random variables from diﬀerent clusters are still negatively dependent, just like variables in the DPPPr L .We stress that we construct DetPGCs merely to illustrate how the ﬂexibility of PGCs permits us to developTPMs that capture structure that is beyond the reach of the standard suite of TPMs, including PCs, DPPs,and standard combinations thereof. We are not proposing DetPGCs as an “optimal” PGC structure forprobabilistic modeling. Nevertheless, as we will see, even the simple DetPGC model may be a better modelthan PCs or DPPs for some kinds of real-world data. This section evaluates PGCs’ ability to model real data on density estimation benchmarks. We use a weightedsum over DetPGCs as our model. This simple method achieves state-of-the-art performance on half thebenchmarks, illustrating the potential of PGCs in real-world applications.

We evaluate PGCs on two density estimation benchmarks:1.

Twenty Datasets [Van Haaren and Davis, 2012], which contains 20 real-world datasets ranging fromretail to biology. These datasets have been used to evaluate various tractable probabilistic models [Lianget al., 2017, Dang et al., 2020, Peharz et al., 2020].2.

Amazon Baby Registries , which contains 15 datasets, each representing a collection of registries or“baskets” of baby products from a speciﬁc category such as “apparel” and “bath”. We randomly split eachdataset into train (70%), valid (10%) and test (20%) sets. This benchmark has been commonly used toevaluate DPP learners [Gillenwater et al., 2014, Mariet and Sra, 2015, Gartrell et al., 2019]. The model we use in our experiments is a weighted sum over DetPGCs, the example constructed in Section 4.2,which we refer to as a

SimplePGC . Recall from Figure 2 that a DetPGC is the hierarchical composition of aDPP and some “leaf” PGCs Pr , . . . , Pr m . For SimplePGC, we make the simplest choice by setting the Pr i sto be the fully general PGCs. In addition, to partition the input variables into m groups as shown in Figure 2,we use a simple greedy algorithm that aims at putting pairs of positively dependent variables into the same groups. The structure of a SimplePGC is also governed by two hyperparameters: the number of DetPGCs inthe weighted sum (denoted by C ) and the maximum number of variables (i.e. k in Figure 2) allowed in eachgroup (denoted by K ). We tune C and K by a grid search over the following ranges: K ∈ { , , , } and C ∈ { , , , , } . Note that our model reduces to a mixture over DPPs when K = 1.We implement SimplePGC in PyTorch and learn the parameters by maximum likelihood estimation (MLE).In particular, we use Adam [Kingma and Ba, 2014] as the optimizing algorithm to minimize the negative loglikelihoods of the training sets. Regularization is done by setting the weight decay parameter in Adam. Forfurther details regarding the construction and implementation of SimplePGC, please see the Appendix. The original benchmark had 17 datasets. We omit the datasets with fewer than 10 variables: decor and pottytrain. PP Strudel EiNet SimplePGCnltcs − . − . − . − . ∗ msnbc − . − . − . − . † kdd − . − . − . − . ∗† plants − . − . − . − . † audio − . − . − . − . ∗ jester − . − . − . − . ∗ netﬂix − . − . − . − . ∗ accidents − . − . − . − . † retail − . − . − . − . † pumsb − . − . − . − . † dna − . − . − . − . ∗† kosarek − . − . − . − . † msweb − . − . − . − . † book − . − . − . − . ∗† movie − . − . − . − . ∗ webkb − . − . − . − . † reuters − . − . − . − . − . − . − . − . − . − . − . − . ∗ ad − . − . − . − . † (a) Results on the Twenty Datasets benchmark. DPP Strudel EiNet SimplePGCapparel − . − . − . − . ∗† bath − . − . − . − . ∗† bedding − . − . − . − . ∗† carseats − . − . − . − . ∗† diaper − . − . − . − . ∗† feeding − . − . − . − . ∗† furniture − . − . − . − . ∗† gear − . − . − . − . ∗† gifts − . − . − . − . − . − . − . − . ∗† media − . − . − . − . † moms − . − . − . − . − . − . − . − . ∗† strollers − . − . − . − . ∗† toys − . − . − . − . † (b) Results on the Amazon Baby Registries benchmark. Figure 3: Experiment results on the Twenty Datasets the Amazon Baby Registries, comparing the performanceof DPP, Strudel, EiNet and SimplePGC in terms of average log-likelihood. Bold numbers indicate the bestlog-likelihood. For SimplePGC, annotations ∗ resp. † mean better log-likelihood compared to Strudel, resp.EiNet. We compare SimplePGC against three baselines: DPPs, Strudel, and Einsum Networks.

DPP : As mentioned, SimplePGC reduces to a mixture over DPPs when the hyperparameter K = 1. DPPsare included as a sanity check; in particular, we expect PGCs to beat DPPs on most datasets and be at leastas good on all datasets. Strudel : Strudel [Dang et al., 2020] is an algorithm for learning structured-decomposable PCs. We includethem as one of the state-of-the-art density estimators.

Einsum Networks : Einsum Networks [Peharz et al., 2020] (EiNets) is a deep-learning style implementationdesign for PCs. Compared to Strudel, EiNets are more related to SimplePGC in the sense that they are bothﬁxed-structure models where only parameters are learned.

Figure 3 shows the experiment results. As a sanity check, we ﬁrst compare SimplePGC against DPPs. Onboth benchmarks, SimplePGC performs signiﬁcantly better than DPPs on almost every dataset except for 4datasets from the Amazon Baby Registries benchmark, where SimplePGC performs at least as well as DPPs.Overall, SimplePGC achieves competitive performance when compared against Strudel and EiNet on bothbenchmarks. On the Twenty Datasets benchmark, SimplePGC obtains better average log-likelihood thaneither Strudel or EiNet on 18 out of the 20 datasets and, in particular, SimplePGC obtains higher log-likelihood than both of them on 3 datasets. Such results are remarkable, given the fact that SimplePGC isjust simple hand-crafted PGC architecture with little ﬁne-tuning, while Strudel and EiNets follow from along line of research aiming to perform well on exactly the Twenty Datasets benchmark. The performance ofSimplePGC on the Amazon Baby Registries benchmark is even more impressive: SimplePGC beats bothStrudel and

EiNet on 11 out of 15 datasets and beats either one of them on 13 datasets. We also conducted10ne-sample t-test for the results; for further details please refer to the Appendix.

At a high level, the study of PCs and graphical models mainly focuses on constructing classes of models thatguarantee tractable exact inference. A separate line of research in probabilistic machine learning, however,aims at identifying classes of distributions that support tractable sampling, where generating polynomialsplay an essential role. For example, a well-studied class of distributions are the strongly Rayleigh (SR)distributions [Borcea et al., 2009, Li et al., 2016], which were ﬁrst deﬁned in the ﬁeld of probability theory forstudying negative dependence:

Deﬁnition 6.

A polynomial f ∈ R [ z , . . . , z n ] is real stable if whenever Im( z i ) > ≤ i ≤ n , f ( z , . . . , z n ) (cid:54) = 0. We say that a distribution over X , . . . , X n is strongly Rayleigh (SR) if its generat-ing polynomial is real stable.SR distributions contain many important subclasses such as DPPs and the spanning tree (forest) distributions,which have various applications. From Section 3.2, we already know that PGCs can compactly representDPPs. We now show that PGCs can represent spanning tree distributions.Let G = ( V, E ) be a connected graph with vertex set V = { , . . . , n } and edge set E . Associate to eachedge e ∈ E a variable z e and a weight w e ∈ R ≥ . If e = { i, j } , let A e be the n × n matrix where A ii = A jj = 1, A ij = A ji = − weighted Laplacian of G is givenby L ( G ) = (cid:80) e ∈ E w e z e A e , and by the Principal Minors Matrix-Tree Theorem [Chaiken and Kleitman, 1978], f G = det( L ( G ) \{ i } ) = (cid:88) T a spanning tree of G w edges( T ) z edges( T ) is the (un-normalized) generating polynomial for the distribution (denote it by Pr G supported on the spanningtrees of G ; here L ( G ) \{ i } denotes the principal minor of L ( G ) by removing its i th row and column.As presented in the equation above, the probability of each spanning tree is proportional to the product of itsedge weights. Pr G is a strongly Rayleigh distribution [Borcea et al., 2009], and, to the best of our knowledge,it is not a DPP unless the edge weights are the same. By the same argument as in Section 3.2, we claim thatPr G can be tractably represented by PGCs.Thus, we see that there is another natural class of SR distributions – spanning tree distributions – that can berepresented by PGCs. More generally, generating polynomials play a key role in the study of a number of otherclasses of distributions, include the Ising model [Jerrum and Sinclair, 1993], exponentiated strongly Rayleigh(ESR) distributions [Mariet et al., 2018] and strongly log-concave (SLC) distributions [Robinson et al., 2019].Speciﬁcally, most of these distributions are naturally characterized by their generating polynomials ratherthan probability mass functions. This poses a major barrier to linking them to other probabilistic models.Thus by showing that PGCs can tractably represent certain subclasses of SR distributions, we present PGCsas a prospective avenue for bridging this gap.Although we conjecture that not all SR distributions can be represented by polynomial-size PGCs, we believethat the subclasses of the above distributions that have concise parameterizations should be representable byPGCs. Establishing this for various families is a direction for future work. We conclude by summarizing our contributions and highlighting future research directions. In this paper,we study the use of probability generating polynomials as a data structure for representing probabilitydistributions. We showed that their representation as circuits are a TPM, and strictly subsume existing11amilies of TPMs. Indeed, even a simple example family of distributions that can be represented by PGCsbut not PCs or DPPs obtains state-of-the-art performance as a probabilistic model on some datasets.To facilitate the general use of PGCs for probabilistic modeling, a facinating direction for future workis to build eﬃcient structure learning or ‘architecture search’ algorithms for PGCs. Theoretically, themain mathematical advantage of generating polynomials was the variety of properties they reveal abouta distribution. This raises the question of whether there are other kinds of useful queries we can supporteﬃciently with PGCs, and where truly lies the boundary of tractable probabilistic inference.

References

Raja Haﬁz Aﬀandi, Emily Fox, Ryan Adams, and Ben Taskar. Learning the parameters of determinantalpoint process kernels. In

International Conference on Machine Learning , pages 1224–1232, 2014.Stuart J Berkowitz. On computing the determinant in small parallel time using a small number of processors.

Information processing letters , 18(3):147–150, 1984.Richard S Bird. A simple division-free algorithm for computing determinants.

Information Processing Letters ,111(21-22):1072–1074, 2011.Julius Borcea, Petter Br¨and´en, and Thomas Liggett. Negative dependence and the geometry of polynomials.

Journal of the American Mathematical Society , 22(2):521–567, 2009.Alexei Borodin and Eric M Rains. Eynard–mehta theorem, schur process, and their pfaﬃan analogs.

Journalof statistical physics , 121(3-4):291–317, 2005.David G Cantor and Erich Kaltofen. On fast multiplication of polynomials over arbitrary algebras.

ActaInformatica , 28(7):693–701, 1991.Seth Chaiken and Daniel J Kleitman. Matrix tree theorems.

Journal of combinatorial theory, Series A , 24(3):377–381, 1978.Wei-Lun Chao, Boqing Gong, Kristen Grauman, and Fei Sha. Large-margin determinantal point processes.In

UAI , pages 191–200, 2015.YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework fortractable probabilistic modeling. 2020.C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees.

IEEEtransactions on Information Theory , 14(3):462–467, 1968.Meihua Dang, Antonio Vergari, and Guy Van den Broeck. Strudel: Learning structured-decomposableprobabilistic circuits. arXiv e-prints , pages arXiv–2007, 2020.Adnan Darwiche.

Modeling and reasoning with Bayesian networks . Cambridge university press, 2009.Adnan Darwiche and Pierre Marquis. A knowledge compilation map.

Journal of Artiﬁcial IntelligenceResearch , 17:229–264, 2002.Mike Gartrell, Victor-Emmanuel Brunel, Elvis Dohmatob, and Syrine Krichene. Learning nonsymmetricdeterminantal point processes. arXiv preprint arXiv:1905.12962 , 2019.Jennifer Gillenwater, Alex Kulesza, Emily Fox, and Ben Taskar. Expectation-maximization for learningdeterminantal point processes. In

Proceedings of the 27th International Conference on Neural InformationProcessing Systems-Volume 2 , pages 3149–3157, 2014.Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the ising model.

SIAMJournal on computing , 22(5):1087–1116, 1993. 12iederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential decisiondiagrams. In

Fourteenth International Conference on the Principles of Knowledge Representation andReasoning , 2014.Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.

Foundations and Trends ® in Machine Learning , 5(2–3):123–286, 2012. ISSN 1935-8237. doi: 10.1561/2200000044.Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast mixing markov chains for strongly rayleigh measures,dpps, and constrained sampling. arXiv preprint arXiv:1608.01008 , 2016.Yitao Liang, Jessa Bekker, and Guy Van den Broeck. Learning the structure of probabilistic sententialdecision diagrams. In Proceedings of the 33rd Conference on Uncertainty in Artiﬁcial Intelligence (UAI) ,2017.Hui Lin and Jeﬀ A Bilmes. Learning mixtures of submodular shells with application to document summariza-tion. arXiv preprint arXiv:1210.4871 , 2012.Meena Mahajan and V Vinay. A combinatorial algorithm for the determinant. In

SODA , pages 730–738,1997.Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning determinantal point processes. In

Interna-tional Conference on Machine Learning , pages 2389–2397. PMLR, 2015.Zelda Mariet, Suvrit Sra, and Stefanie Jegelka. Exponentiated strongly rayleigh distributions.

Advances inneural information processing systems , 2018.Zelda E Mariet and Suvrit Sra. Kronecker determinantal point processes. In

Advances in Neural InformationProcessing Systems , pages 2694–2702, 2016.James Martens and Venkatesh Medabalimi. On the expressive eﬃciency of sum product networks. arXivpreprint arXiv:1411.7717 , 2014.Robert Mateescu, Rina Dechter, and Radu Marinescu. And/or multi-valued decision diagrams (aomdds) forgraphical models.

Journal of Artiﬁcial Intelligence Research , 33:465–519, 2008.Marina Meila and Michael I Jordan. Learning with mixtures of trees.

Journal of Machine Learning Research ,1(Oct):1–48, 2000.Robert Peharz, Antonio Vergari, Karl Stelzner, Alejandro Molina, Xiaoting Shao, Martin Trapp, KristianKersting, and Zoubin Ghahramani. Random sum-product networks: A simple but eﬀective approach toprobabilistic deep learning. In

Proceedings of UAI , 2019.Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van denBroeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: Fast and scalable learning oftractable probabilistic circuits. In

International Conference on Machine Learning , pages 7563–7574. PMLR,2020.Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In , pages 689–690. IEEE, 2011.Lawrence Rabiner and Biinghwang Juang. An introduction to hidden markov models. ieee assp magazine , 3(1):4–16, 1986.Tahrima Rahman and Vibhav Gogate. Learning ensembles of cutset networks. In

Thirtieth AAAI Conferenceon Artiﬁcial Intelligence , 2016. 13oshua Robinson, Suvrit Sra, and Stefanie Jegelka. Flexible modeling of diversity with strongly log-concavedistributions. arXiv preprint arXiv:1906.05413 , 2019.Dan Roth. On the hardness of approximate reasoning.

Artiﬁcial Intelligence , 82(1-2):273–302, 1996.Paul A Samuelson. A method of determining explicitly the coeﬃcients of the characteristic equation.

TheAnnals of Mathematical Statistics , 13(4):424–429, 1942.Arnold Sch¨onhage and Volker Strassen. Schnelle multiplikation grosser zahlen.

Computing , 7(3):281–292,1971.Andy Shih, Guy Van den Broeck, Paul Beame, and Antoine Amarilli. Smoothing structured decomposablecircuits. In

Advances in Neural Information Processing Systems , pages 11412–11422, 2019.Jan Van Haaren and Jesse Davis. Markov network structure learning: A randomized feature generationapproach. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 26, 2012.Antonio Vergari, YooJung Choi, Robert Peharz, and Guy Van den Broeck. Probabilistic circuits: Repre-sentations, inference, learning and applications. Tutorial at the The 34th AAAI Conference on ArtiﬁcialIntelligence, 2020.Honghua Zhang, Steven Holtzen, and Guy Van den Broeck. On the relationship between probabilistic circuitsand determinantal point processes. In Ryan P. Adams and Vibhav Gogate, editors,

Proceedings of theThirty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2020, virtual online, August 3-6,2020 , volume 124 of

Proceedings of Machine Learning Research , pages 1188–1197. AUAI Press, 2020.Tao Zhou, Zolt´an Kuscsik, Jian-Guo Liu, Mat´uˇs Medo, Joseph Rushton Wakeling, and Yi-Cheng Zhang.Solving the apparent diversity-accuracy dilemma of recommender systems.

Proceedings of the NationalAcademy of Sciences , 107(10):4511–4515, 2010. 14

Proofs

Proof for Lemma 1.

We write the generating polynomial for Pr as g ( z , . . . , z n ) = (cid:80) S ⊂ [ n ] α S z S ; then,Pr( { X i = 1 } i ∈ A , { X i = 0 } i ∈ B )) = (cid:88) A ⊂ S,B ∩ S = ∅ α S Besides, we also have coef | A | ( g (cid:12)(cid:12) asgn ) = (cid:88) α S coef | A | ( z S (cid:12)(cid:12) asgn ) , given the assignment asgn := {{ z i = t } i ∈ A , { z i = 0 } i ∈ B , { z i = 1 } i/ ∈ A ∪ B } ; that is, to prove the Proposition, weonly need to show that coef | A | ( z S (cid:12)(cid:12) asgn ) = 1 for S ⊂ [ n ] where A ⊂ S and B ∩ S = ∅ ; and coef | A | ( z S (cid:12)(cid:12) asgn ) = 0otherwise.Case 1. Assume A ⊂ S and B ∩ S = ∅ ; then z S (cid:12)(cid:12) asgn = t | A | ; hence coef | A | ( z S (cid:12)(cid:12) asgn ) = 1.Case 2. Assume A (cid:54)⊂ S or B ∩ S (cid:54) = ∅ ; if B ∩ S (cid:54) = ∅ , then z S (cid:12)(cid:12) asgn = 0; done. Now we assume A (cid:54)⊂ S . In this case, z S (cid:12)(cid:12) asgn = t | S ∩ A | . It follows from S ∩ A (cid:40) A that | S ∩ A | < | A | , which implies that coef | A | ( z S (cid:12)(cid:12) asgn ) = 0. Proof for Proposition 1.

Let A be a decomposable and smooth PC. Without loss of generality we assume A is normalized. For each node u in A , we deﬁne I u = { i : X i or X i ∈ scope( u ) } and denote the polynomialthat u represents by m u . We ﬁrst prove the following intermediate result by a bottom-up induction on A : m u = (cid:88) S ⊂ I u α S (cid:89) i ∈ S X i (cid:89) i/ ∈ S X i , where the α S are some non-negative numbers depending on the node u .Case 1. If u is a leaf node X i or X i , then m u is X i or X i ; done.Case 2. If u is a sum node with children { v i } ≤ i ≤ k and weights { w i } ≤ i ≤ k . m u = (cid:80) ≤ i ≤ k w i m v i . Bysmoothness, I v i = I u for all i . Then, by the induction hypothesis: m u = (cid:88) ≤ i ≤ k w i (cid:88) S ⊂ I vi α iS (cid:89) j ∈ S X j (cid:89) j / ∈ S X j = (cid:88) ≤ i ≤ k w i (cid:88) S ⊂ I u α iS (cid:89) j ∈ S X j (cid:89) j / ∈ S X j = (cid:88) S ⊂ I u ( (cid:88) ≤ i ≤ k w i α iS ) (cid:89) j ∈ S X j (cid:89) j / ∈ S X j Case 3. If u is a product node with children { v i } ≤ i ≤ k . Then, by decomposability, I v , . . . , I v k are pairwisedisjoint; in particular, for each S ⊂ I u , S can be uniquely decomposed into S ⊂ I v , . . . , S k ⊂ I v k . Thus, m u = (cid:89) ≤ i ≤ k m v i = (cid:89) ≤ i ≤ k (cid:88) S i ⊂ I vi α iS i (cid:89) j ∈ S i X j (cid:89) j / ∈ S i X j = (cid:88) S ⊂ I v ,...,S k ⊂ I vk  (cid:89) ≤ i ≤ k α iS i  (cid:89) ≤ i ≤ k  (cid:89) j ∈ S i X j (cid:89) j / ∈ S i X j  = (cid:88) S ⊂ I u  (cid:89) ≤ i ≤ k α iS i  (cid:89) j ∈ S X j (cid:89) j / ∈ S X j ; with I u = I v ∪ · · · ∪ I v n a disjoint union.15ence the mass polynomial represented by A is given by : m ( X , . . . , X n , X , . . . , X n ) = (cid:88) S ⊂{ ,...,n } α S (cid:89) i ∈ S X i (cid:89) i/ ∈ S X i By plugging in { X i = 1 } i ∈ S , { X i = 0 } i/ ∈ S , it immediately follows that α S = Pr( { X i = 1 } i ∈ S , { X i =0 } i/ ∈ S ). Proof for Proposition 2.

Let

A, B ⊂ N + ; let f = (cid:80) S ⊂ A β S z S and g = (cid:80) S ⊂ B γ S z S be the normalizedprobability generating polynomials for distributions Pr f ( X A ) and Pr g ( X B ), respectively.Case 1 ( Sum ). First, we view f and g as polynomials over { z i } i ∈ A ∪ B by setting β S = 0 ∀ S (cid:54)⊂ A , and γ S = 0 ∀ S (cid:54)⊂ B , which is equivalent to, from the perspective of probability distributions, viewing Pr f and Pr g asdistributions over X A ∪ B such thatPr f ( X A = a , X B = b ) = (cid:40) Pr f ( X A = a ) , if b i = 0 for i ∈ ( A ∪ B ) \ A , otherwiseand Pr g ( X A = a , X B = b ) = (cid:40) Pr g ( X B = b ) , if a i = 0 for i ∈ ( A ∪ B ) \ B , otherwiseThen, αf + (1 − α ) g = α (cid:88) S ⊂ A β S z S + (1 − α ) (cid:88) S ⊂ B γ S z S = (cid:88) S ⊂ A ∪ B ( αβ S + (1 − α ) γ S ) z S , where αβ S + (1 − α ) γ S are clearly non-negative, and (cid:80) S ⊂ A ∪ B αβ S + (1 − α ) γ S = α (cid:80) S ⊂ A ∪ B β S + (1 − α ) (cid:80) S ⊂ A ∪ B γ S = α + (1 − α ) = 1. That is, αf + (1 − α ) g is a valid probability generating polynomial for adistribution, Pr sum .For assignments X A = a and X B = b with no conﬂict ( A and B are not necessarily disjoint), let S = { i ∈ A : a i = 1 } ∪ { i ∈ B : b i = 1 } . By deﬁnition, Pr sum ( X A = a , X B = b ) is given by the coeﬃcient of the term z S , which is αβ S + (1 − α ) γ S = α Pr f ( X A = a , X B = b ) + (1 − α )Pr g ( X A = a , X B = b )= α Pr f ( X A = a ) + (1 − α )Pr g ( X B = b ) for short.Case 2 ( Product ). We assume A ∩ B = ∅ . Then, f g = (cid:32) (cid:88) S ⊂ A β S z S (cid:33) (cid:32) (cid:88) T ⊂ B γ T z T (cid:33) = (cid:88) S ⊂ A,T ⊂ B β S γ T z S z T As A and B are disjoint, z S z T are multiaﬃne. On top of that, β S γ T ≥ (cid:80) S ⊂ A,T ⊂ B β S γ T = (cid:80) S ⊂ A β S (cid:80) T ⊂ B γ T = 1. Thus, f g is a valid probability generating polynomial for a distribution, Pr prod .For assignments X A = a , X B = b , we set S a = { i ∈ A : a i = 1 } and S b = { i ∈ B : b i = 1 } . Then,by deﬁnition, Pr prod ( X A = a , X B = b ) is given by the coeﬃcient of the term z S a ∪ S b , which is β S a γ S b =Pr f ( X A = a )Pr f ( X B = b ). Here we correct the typo m ( z , . . . , z n ) in the statement of Proposition 1. roof for Proposition 3. Let g ( z , . . . , z n ) be a normalized probability generating polynomial. Let A , . . . , A n be disjoint subsets of N + and f ( z A ) , . . . , f n ( z A n ) be normalized generating polynomials. Write g = (cid:80) S ⊂{ ,...,n } α S z S . Then, g (cid:12)(cid:12) z i = f i = (cid:88) S ⊂{ ,...,n } α S (cid:89) i ∈ S f i It follows from Proposition 2 (product operation) that (cid:81) i ∈ S f i are valid generating polynomials for S ⊂{ , . . . , n } ; again by Proposition 2 (sum operation), g (cid:12)(cid:12) z i = f i = (cid:80) S ⊂{ ,...,n } α S (cid:81) i ∈ S f i is a valid generatingpolynomial. B Experiments

B.1 The Construction of SimplePGC

A SimplePGC is a weighted sum over several DetPGCs, which are deﬁned in Section 4.2. The structure of aSimplePGC is governed by two hyperparameters, the number of DetPGCs in the weighted sum (denoted by C ), and the maximum number of the variables (i.e. k in Figure 2) in the leaf distributions of the DetPGCs(denoted by K ). Partitioning Variables

To construct SimplePGC, we ﬁrst partition the variables X , . . . , X n into severalgroups. The idea is, as shown in Section 4.2, for a DetPGC, variables from diﬀerent groups have to benegatively dependent, so we want to put pairs of variables that are positively dependent in the same group.Given some training examples D , . . . , D l , we estimate the probabilities Pr( X i = 1) and Pr( X i = 1 , X j = 1)by counting; in particular, we set Pr(event) = | D i where event is true | l . Then, inspired by the deﬁnition of pairwise mutual information , when Pr( X i = 1 , X j = 1) >

0, we use thequantity w ij = Pr( X i = 1 , X j = 1) log Pr( X i = 1 , X j = 1)Pr( X i = 1) Pr( X j = 1) , (9)to measure the degree of positive dependence between X i and X j . Note that X i and X j are positivelydependent if w ij >

0. Then we partition the variables into groups by the following greedy algorithm.

Algorithm 1

Partition Variables

Input: variables { X i } ≤ i ≤ n , weights { w ij } ≤ i } ≤ i

05 = =msnbc − . − . − . < > kdd − . − . − . = > plants − . − . − . < =audio − . − . − . > =jester − . − . − . > < netﬂix − . − . − . > < accidents − . − . − . < > retail − . − . − .

84 = =pumsb − . − . − . < > dna − . − . − . > > kosarek − . − . − .

72 = =msweb − . − . − . < =book − . − . − . = =movie − . − . − .

15 = =webkb − . − . − .

23 = =reuters − . − . − .

65 = =20ng − . − . − .

03 = =bbc − . − . − .

81 = =ad − . − . − . < > (a) One-sided t-test results on the Twenty Datasets benchmark. Strudel EiNet SimplePGC vs. Strudel vs. EiNetapparel − . − . − . > =bath − . − . − . = > bedding − . − . − . = =carseats − . − . − . > =diaper − . − . − . > =feeding − . − . − . > =furniture − . − . − . = =gear − . − . − . = > gifts − . − . − .

47 = =health − . − . − . > > media − . − . − .

69 = =moms − . − . − .

53 = =safety − . − . − . > > strollers − . − . − . = =toys − . − . − .

62 = > (b) One-sided t-test results on the Amazon Baby Registries benchmark. Figure 4: Results for one-sample two-sided t-test on two benchmarks with p = 0 .