Multivariate binary probability distribution in the Grassmann formalism
aa r X i v : . [ s t a t . M E ] S e p Multivariate binary probability distribution in the Grassmannformalism
Takashi Arai ∗ Faculty of Science, Yamagata University, Yamagata 990-8560, Japan
Abstract
We propose a probability distribution for multivariate binary random variables. For this pur-pose, we use the Grassmann number, an anti-commuting number. In our model, the partitionfunction, the central moment, and the marginal and conditional distributions are expressed ana-lytically by the matrix of the parameters analogous to the covariance matrix in the multivariateGaussian distribution. That is, summation over all possible states is not necessary for obtain-ing the partition function and various expected values, which is a problem with the conventionalmultivariate Bernoulli distribution. The proposed model has many similarities to the multivariateGaussian distribution. For example, the marginal and conditional distributions are expressed bythe parameter matrix and its inverse matrix, respectively. That is, the inverse matrix expresses asort of partial correlation. Analytical expressions for the marginal and conditional distributions arealso useful in generating random numbers for multivariate binary variables. Hence, we validatedthe proposed method using synthetic datasets. We observed that the sampling distributions ofvarious statistics are consistent with the theoretical predictions and estimates are consistent andasymptotically normal. ∗ [email protected] . INTRODUCTION The multivariate binary probability distribution is a model for multivariate binary ran-dom variables. This model is used in many applications such as modeling the behavior ofmagnets in statistical physics [1], building statistical models in computer vision [2] and socialnetwork analysis. In the terminology of the graphical model, the multivariate binary prob-ability distribution is a kind of Markov random fields. This model is essentially the same asthe Ising model in statistical physics, and which is also called the Boltzmann machine in thefield of machine learning research [3]. Recent applications of this model include the studyin detecting statistical dependence in the voting pattern from senate voting records data [4]and the study of cooperative mutations in the Human Immunodeficiency Virus (HIV) [5].Existing methods of multivariate binary probability distribution encode a binary variableas a dummy variable that takes discrete values in { , } or {− , } , which is called the multi-variate Bernoulli distribution or the Ising model. However, such discrete coding prevents usfrom analytical calculations. For example, the marginal and conditional distributions are nolonger in the same form as the original joint distribution. Furthermore, a problem also arisesfrom the viewpoint of computational complexity. In the existing method of the multivari-ate binary probability distribution, we have to sum over all possible states to calculate thepartition function and various expected values, however, in a binary system, the number ofpossible states exponentially increases as the number of variables increases. In other words,the computation of the partition function and expected values is NP-hard, which causesdifficulties with parameter estimation. In fact, the maximum likelihood estimation of modelparameters by using a gradient-based method requires the calculation of various expectedvalues, then, the application of such a usual estimation procedure becomes difficult whenthe number of variables is large. One way of dealing with parameter estimation is to approx-imate the expected values by Gibbs sampling, a Markov chain Monte Carlo simulation, butthis method is computationally demanding and time-consuming. Another way is to approx-imate the likelihood function to a more tractable functional form. That is the variationalinference [6], the pseudo-likelihood and the composition likelihood methods [5, 7, 8], wheremethods for estimating the sparse structure of a graph are proposed through the use of L and non-concave regularizations. However, the multivariate Bernoulli distribution has notbeen widely used in application compared to the multivariate Gaussian distribution, whose2artition function can be analytically computed and is widely used in various fields such asnatural language processinig [9], image analysis [2, 10, 11] and spatial statistics [12].In this paper, we propose a probability distribution that models multivariate binaryvariables. The difficulty with the existing method of the multivariate binary probabilitydistribution stems from the fact that binary variables are represented by discrete dummyvariables, which complicates analytical calculations due to summation over states. Using theGrassmann variable, an anti-commuting number, we rewrite the summation over all possiblestates by integral of the Grassmann variable. The resulting model resolves the problem inthe conventional multivariate Bernoulli distribution that summation over states can not becalculated analytically.This paper is organized as follows. In Sec. II, we propose a method for the multivariatebinary probability distribution using the Grassmann variables. Various properties of ourdistribution, such as the marginal and conditional distributions and correlatedness and sta-tistical independence are investigated. In Sec. III, we validate our method numerically usingsynthetic datasets of correlated binary variables. The consistency and asymptotic normalityof the sampling distribution of various statistics and estimates are investigated. Sec. IV isdevoted to conclusions. We use popular notation in matrix theory. For a p × p matrix A , A [ η ]denotes the principal submatrix of A with rows and columns out of a subset η ⊆ { , , ..., p } . II. PROPOSED METHOD
The multivariate Bernoulli distribution is a probability distribution for binary randomvariables, where the p -dimensional binary variables are encoded by the dummy variablestaking x i ∈ { , } , i ∈ P ≡ { , , . . . , p } . Usually, the joint distribution of the multivariateBernoulli distribution is expressed as the exponential function of the quadratic form in thedummy variables [6, 13], p ( x P ) = 1 Z exp (cid:26) P X i =1 b i x i + P X i, j =1 x i w ij x j (cid:27) , (1)where b i and w ij are called the bias and weight terms, and the exponent is called theenergy function. Z is the partition function that ensures that the distribution sums toone. In the conventional multivariate Bernoulli distribution, various quantities such as thepartition function and expected values are computed by summation over all possible states.3or example, the expected value of the random variable x i is expressed as the followingsummation, E[ x i ] = X x P ∈{ , } p x i p ( x P ) , = (cid:20) X x ∈{ , } X x ∈{ , } · · · X x p ∈{ , } (cid:21) x i p ( x , x , . . . , x p ) . (2)However, for the multivariate binary variables, the number of possible states increases expo-nentially as the dimension of the variable p increases. Then, performing the summation overstates in the expected value becomes difficult even numerically. Furthermore, there existsa difficulty with the conventional multivariate Bernoulli distribution that the marginal andconditional distributions do not follow the multivariate Bernoulli distribution. In fact, themarginal distribution p ( x R ) for indices R = P \ M , in which the variables with the indices M ⊆ P are marginalized out, p ( x R ) = X x M p ( x R , x M ) , (3)is no longer in the same form as the original expression of Eq. (1). Then, it is difficult tointerpret the model parameters, while in the multivariate Gaussian distribution the covari-ance matrix and its inverse matrix can be interpreted as indirect and direct correlations. Wetry to resolve these difficulties by introducing the Grassmann number, an anti-commutingnumber. We express the system by a pair of Grassmann variables θ , ¯ θ , instead of the dummyvariable x . By expressing the summation over states as an integral of the Grassmann vari-ables, we expect that the partition function and various expected values can be expressedanalytically. A. Univariate binary probability distribution
First, we explain our idea with the simplest example of the univariate binary probabil-ity distribution. In the conventional Bernoulli distribution, the normalization condition ofthe probability distribution and expected values of the random variable are computed by4ummation over all possible states of the dummy variable x ∈ { , } : X x p ( x ) =1 , E[ x ] = X x x p ( x ) = p ( x = 1) . (4)On the other hand, our formalism introduces a pair of Grassmann variables θ and ¯ θ [14],anti-commuting numbers, corresponding to the dummy variable. These variables obey thefollowing anti-commuting relations: { θ, ¯ θ } ≡ θ ¯ θ + ¯ θθ = 0 ,θ = ¯ θ = 0 . (5)Then, we assume that instead of the summation described above expected values can beobtained by integration of the Grassmann function defined by e H (¯ θ,θ ) = exp n ¯ θ Λ θ − log Z (Λ) o , = 1 Z (Λ) e ¯ θ Λ θ , (6)where Λ is a parameter of the model, and Z (Λ) is the partition function, the normalizationconstant that ensures that the distribution sums to one. We hereafter refer to the exponentof the Grassmann function as Hamiltonian H (¯ θ, θ ). In the above equation, we have adoptedthe quadratic form in the Grassmann variables as a Hamiltonian. The partition function iscalculated by integration of the Grassmann variables as follows: Z (Λ) = Z dθd ¯ θ e ¯ θ Λ θ , =Λ ≡ , (7)where we have adopted the following sign convention of the Grassmann integral: Z dθd ¯ θ ¯ θθ = 1 . (8)Furthermore, we assume that the expected value of the dummy variable x , which correspondsto the probability of p ( x = 1), is calculated by the expected value of the product of the5rassmann variable (¯ θθ ) as follows:E[ x ] = p ( x = 1) = Z dθd ¯ θ (¯ θθ ) e H (¯ θ,θ ) , = 1Λ Z dθd ¯ θ (¯ θθ ) e ¯ θ Λ θ , = 1Λ ≡ Σ . (9)Thus, we see that the parameter Σ can be interpreted as the mean parameter of the probabil-ity distribution. In the same way, we assume that in our formalism the probability p ( x = 0)is calculated by the expected value of the Grassmann variable (1 − ¯ θθ ), p ( x = 0) = Z dθd ¯ θ (1 − ¯ θθ ) e H (¯ θ,θ ) , =1 − Σ , (10)which is an analogy from the summation p ( x = 0) = P x (1 − x ) p ( x ). The central moment ofhigher order can be derived consistently by the following prescription. Since the Grassmannvariable with higher order becomes zero, we first summarize the polynomials for the dummyvariable using the identity, x k = x, k ∈ { , , . . . } :E (cid:2) ( x − µ ) k (cid:3) = E[ x ] k − X l =0 (cid:18) kl (cid:19) ( − µ ) l + ( − µ ) k . (11)Then, the Grassmann integral for the above expression gives consistent results. Therefore,our formalism successfully reproduces the univariate Bernoulli distribution. B. Bivariate binary probability distribution
The same idea as the previous subsection is applicable to the bivariate binary probabilitydistribution. We introduce a pair of Grassmann vectors θ = ( θ , θ ) T , ¯ θ = (¯ θ , ¯ θ ) T cor-responding to the dummy variables x , x . Again, we assume that the expected value bysummation over states can be calculated by integration of the following exponential functionof the Grassmann variables, e H (¯ θ,θ ) = 1 Z (Λ) e θ † Λ θ , (12)where θ † denotes the transpose of the Grassmann vector ¯ θ , θ † ≡ ¯ θ T , and Λ = Σ − isa matrix of model parameters analogous to the precision and covariance matrices in the6ivariate Gaussian distribution,Λ = Λ Λ Λ Λ = Σ − = Σ Σ Σ Σ − . (13)We adopt the following sign convention of the Grassmann integral: Z dθ d ¯ θ dθ d ¯ θ θ ¯ θ θ ¯ θ = 1 . (14)By performing the Grassmann integral, the partition function is expressed by the determi-nant of the matrix Λ, Z (Λ) = Z dθ d ¯ θ dθ d ¯ θ e θ † Λ θ , = det Λ . (15)We first discuss the joint distribution. In the conventional bivariate Bernoulli distribution,the co-occurrence probability can be rewritten as an expected value of the dummy variables, p ( x = 1 , x = 1) = X x ,x x x p ( x , x ) , = E[ x x ] . (16)In our formalism, we assume that the above summation over states is expressed by theGrassmann integral. In fact, the co-occurrence probability p ( x = 1 , x = 1) is calculated as p ( x = 1 , x = 1) = 1det Λ Z dθ d ¯ θ dθ d ¯ θ (¯ θ θ )(¯ θ θ ) e θ † Λ θ , = 1det Λ . (17)In the same way, the joint probabilities of the remaining states are calculated as p ( x = 1 , x = 0) = Λ − , (18) p ( x = 0 , x = 1) = Λ − , (19) p ( x = 0 , x = 0) = det(Λ − I )det Λ , (20)where I is the identity matrix. The above expressions for the joint distribution can also beinterpreted as all of the principal minors of the matrix Λ − I divided by det Λ. In terms ofΣ, the joint probabilities are summarized as p ( x , x ) = det Σ x (1 − Σ ) − x ( − − x Σ ( − − x Σ Σ x (1 − Σ ) − x . (21)7ext, we turn to the marginal distribution. In the conventional bivariate Bernoulli distri-bution, marginalization of the variable x is taken by the summation of the dummy variable: p ( x ) = X x p ( x , x ) . (22)Again, we assume that the above marginalization can be performed by the integration ofthe Grassmann variables θ and ¯ θ , which is calculated by completing the square and theshift of the integral variables as Z dθ d ¯ θ e H (¯ θ,θ ) = 1det Λ Z dθ d ¯ θ exp n θ † Λ θ o , = 1det Λ Z dθ d ¯ θ exp n ¯ θ (Λ − Λ Λ − Λ ) θ + (¯ θ + ¯ θ Λ Λ − )Λ ( θ + Λ − Λ θ ) o , = Λ det Λ exp n ¯ θ (Λ − Λ Λ − Λ ) θ o , ≡ e H (¯ θ ,θ ) . (23)Here, we shall call the resulting Hamiltonian H (¯ θ , θ ) the marginal Hamiltonian. From theabove expression, we can read that the marginal distribution still follows the same form asthe original joint distribution and the parameter of the resulting distribution is the Schurcomplement of the matrix Λ. In terms of Σ, the marginal distribution is simply expressedby a principal submatrix of Σ, e H (¯ θ ,θ ) = Σ e ¯ θ Σ − θ . (24)Therefore, the diagonal elements of the parameter matrix Σ can be interpreted as the meanparameters of the marginal distributions.Here, we discuss the correlation, covariance and statistical independence of the variables.From an analogy to the bivariate Bernoulli distribution, the covariance between x and x can be calculated as a moment about the mean asCov[ x x ] = E[( x − µ )( x − µ )] , = 1det Λ Z dθ d ¯ θ dθ d ¯ θ (¯ θ θ − µ )(¯ θ θ − µ ) e θ † Λ θ , = − Σ Σ , (25)8here µ i ≡ E[ x i ] = Σ ii is the mean parameter. Therefore, the product of the off-diagonalterms can be interpreted as the covariance of the variables. Here, we notice that the expres-sion for the joint distribution, Eq. (21), can be transformed to p ( x , x ) = p ( x ) p ( x ) − ( − x + x Σ Σ . (26)When we define correlation of binary variables by the Pearson correlation coefficient ex-pressed as ρ ≡ Cov[ x x ] p Cov[ x x ]Cov[ x x ] , = − Σ Σ p Σ (1 − Σ )Σ (1 − Σ ) , (27)the statistical independence between x and x is equivalent to Σ Σ ∝ ρ = 0. Therefore,in our model, the uncorrelatedness between variables is equivalent to statistical independenceas in the case of the Gaussian distribution: ρ = 0 , ⇔ p ( x , x ) = p ( x ) p ( x ) . (28)Lastly, we discuss the conditional distribution. In the conventional Bernoulli distribution,the conditioning on the observation x = 1 is expressed by the summation over the dummyvariable using the Bayes’ theorem: p ( x | x = 1) = p ( x , x = 1) p ( x = 1) , = P x x p ( x , x ) p ( x = 1) . (29)Again, we rewrite the above summation by the Grassmann integral. Then, the Hamiltoniancorresponding to the conditional distribution, which we call the conditional Hamiltonian H (¯ θ , θ | x = 1), is calculated as e H (¯ θ ,θ | x =1) ≡ p ( x = 1) Z dθ d ¯ θ (¯ θ θ ) e H (¯ θ,θ ) , = 1 p ( x = 1) 1det Λ Z dθ d ¯ θ (¯ θ θ ) e ¯ θ Λ θ , = 1Λ e ¯ θ Λ θ . (30)Therefore, the conditional distribution given x = 1 still follows the same form as the originaljoint distribution and the model parameter is just a principal submatrix of Λ. In terms of Σ,9he above conditional distribution is expressed by the Schur complement of Σ with respectto Σ : e H (¯ θ ,θ | x =1) = Σ | e ¯ θ Σ − | θ , (31)Σ | = Σ − Σ Σ − Σ . (32)In the same way, the conditional Hamiltonian by the observation x = 0 is calculated as e H (¯ θ ,θ | x =0) ≡ p ( x = 0) 1det Λ Z dθ d ¯ θ (1 − ¯ θ θ ) e ¯ θ Λ θ , = Λ − n ¯ θ (cid:2) Λ − Λ (Λ − − Λ (cid:3) θ o . (33)Again, the conditional distribution given x = 0 still follows the same form as the jointdistribution. The meaning of the conditioning is easy to interpret in terms of Σ. In fact,the mean of the variable x is shifted by each conditioning as follows: p ( x = 1 | x = 1) = Σ − Σ Σ − Σ , (34) p ( x = 1 | x = 0) = Σ − Σ (Σ − − Σ , = Σ − ( − Σ )(1 − Σ ) − Σ . (35)The conditioning on x = 1 is expressed by the partial covariance matrix Σ | for observingvariables with the mean Σ and covariance − Σ Σ . On the other hand, the conditioningon x = 0 is expressed by the partial covariance matrix for observing variables with the meanand the sign of the correlation are inverted as 1 − Σ and − ( − Σ )Σ . In other words, theconditional distribution given x = 0 is simply expressed by the Schur complement of thefollowing matrix ˜Σ with respect to ˜Σ ,˜Σ ≡ Σ − Σ Σ − Σ . (36) C. p -dimensional binary probability distribution The procedure in the previous subsections can be extended to p -dimensional variablesstraightforwardly. In this subsection, we just enumerate the results. First, we introduce apair of p -dimensional Grassmann vectors θ = ( θ , θ , . . . , θ p ) T , ¯ θ = (¯ θ , ¯ θ , . . . , ¯ θ p ) T and the10atrix of the parameters with p × p dimension, Λ = Σ − . Then, the expected value bysummation over states is replaced with the Grassmann integral of the following function, e H (¯ θ,θ ) ≡ e θ † Λ θ . (37)We adopt the following sign convention of the Grassmann integral: Z dθ d ¯ θ dθ d ¯ θ · · · dθ p d ¯ θ p (¯ θ θ )(¯ θ θ ) · · · (¯ θ p θ p ) = 1 . (38)To discuss the joint distribution, we here define index labels for the variables. We writethe set of all indices of the p -dimensional binary variables as P ≡ { , , . . . , p } . Then, wewrite the index label for the variables observed as x i = 1 as A ⊆ P , and denote thesevariables as x A . In the same way, we write the index label for the variables observed as x i = 0 as B ⊆ P , and denote these variables as x B . Then, without loss of generality, thematrix of the parameters is expressed as a partitioned matrix as follows:Σ = Σ AA Σ AB Σ BA Σ BB = Λ − = Λ AA Λ AB Λ BA Λ BB − . (39)Then, the joint distribution is given by p ( x A = 1 , x B = 0) = 1det Λ Z dθ P d ¯ θ P (¯ θ A θ A )(1 − ¯ θ B θ B ) e θ † Λ θ , ≡ Z (cid:20) p Y i =1 dθ i d ¯ θ i (cid:21)(cid:20)Y j ∈ A ¯ θ j θ j (cid:21)(cid:20)Y k ∈ B (1 − ¯ θ k θ k ) (cid:21) e θ † Λ θ , = 1det Λ det(Λ BB − I ) , = det Σ AA − Σ AB Σ BA I − Σ BB , (40)where I denotes the identity matrix. The above equation indicates that the joint probabilitiesare expressed by the principal minors of the matrix Λ − I divided by det Λ.Next, we turn to the marginal distribution. We write the index labels of the marginalizedand the remaining variables as M and R . Then, without loss of generality, the modelparameters are written by a partitioned matrix as follows:Σ = Σ RR Σ RM Σ MR Σ MM = Λ − = Λ RR Λ RM Λ MR Λ MM − . (41)11hen, the marginal Hamiltonian H (¯ θ R , θ R ) is expressed as e H (¯ θ R ,θ R ) ≡ Z dθ M d ¯ θ M e θ † Λ θ , = det Σ RR e θ † R Σ − RR θ R , (42)where Σ − RR = Λ ( M ) RR , ≡ Λ RR − Λ RM Λ − MM Λ MR . (43)The parameter of the marginal Hamiltonian is just a principal submatrix of Σ with thesame indices of rows and columns. That is, the diagonal and off-diagonal elements of thematrix Σ ij denote the mean and the covariance with all the other variables marginalizedout. When the product of the off-diagonal elements − Σ ij Σ ji vanishes the variable x i and x j are unconditionally independent, or marginally independent. The central moment of higherorder can also be calculated by the Grassmann integral. For example, the central momentfor the variables with the index label R is given by µ R ≡ E h ( x R − µ R ) i , = E (cid:20) Y i ∈ R ( x i − µ i ) (cid:21) , = 1det Λ Z dθ P d ¯ θ P Y i ∈ R (¯ θ i θ i − µ i ) e θ † Λ θ , = det h Σ RR − I R ( µ R ) i , (44)where I R ( µ R ) is a diagonal matrix with the diagonal element given by µ R , I R ( µ R ) ≡ diag( µ R ) , = δ ij µ i , ( i, j ∈ R ) , (45)and δ ij is the Kronecker delta. We write a fourth-order central moment here for futurereference: µ iijj ≡ E h ( x i − µ i ) ( x j − µ j ) i , ( i = j ) , = E "h (1 − µ i ) x i + µ i ih (1 − µ j ) x j + µ j i , =(1 − µ i )(1 − µ j )( − Σ ij Σ ji ) + µ i (1 − µ i ) µ j (1 − µ j ) . (46)12hen, we discuss the conditional distribution. As in the case of the joint distribution,we write index labels for the variables observed as x i = 1 and x i = 0 as A and B , andwrite these variables as x A and x B , respectively. We write the union of A and B as C , i.e., x C = ( x A , x B ). Then, the remaining indices after conditioning are represented by the setdifference of these indices R = P \ ( A ∪ B ) ≡ P \ C . Without loss of generality, the matrixof the parameters is expressed as a partitioned matrix as follows:Σ = Σ RR Σ RC Σ CR Σ CC = Σ RR Σ RA Σ RB Σ AR Σ AA Σ AB Σ BR Σ BA Σ BB , (47)Λ = Λ RR Λ RC Λ CR Λ CC = Λ RR Λ RA Λ RB Λ AR Λ AA Λ AB Λ BR Λ BA Λ BB . (48)The conditional Hamiltonian H (¯ θ R , θ R | x C ) is given by e H (¯ θ R ,θ R | x C ) = 1det Λ Z dθ C d ¯ θ C (¯ θ A θ A )(1 − ¯ θ B θ B ) e θ † Λ θ , = 1det Λ R | C e θ † R Λ R | C θ R , (49)where Λ R | C = Λ RR − Λ RB (cid:0) Λ BB − I (cid:1) − Λ BR , = h Σ RR − Σ RC (cid:2) Σ CC − I C (1 − x C ) (cid:3) − Σ CR i − , (50)and I C (1 − x C ) is the diagonal matrix defined by Eq. (45): I C (1 − x C ) ≡ diag(1 − x C ) , = δ ij (1 − x i ) , ( i, j ∈ C ) , (51)As in the case of the bivariate variables, we can give an intuitive interpretation of theparameters of the conditional distribution. In fact, the matrix Λ R | C can be rewritten by theSchur complement of the following matrix ˜Σ with respect to the principal submatrix ˜Σ CC ,Λ R | C = ˜Σ − R | C , = (cid:2) Σ RR − ˜Σ RC ˜Σ − CC Σ CR (cid:3) − , = ˜Λ RR , (52)13here˜Σ ≡ Σ RR ˜Σ RC Σ CR ˜Σ CC ≡ Σ RR Σ RA − Σ RB Σ AR Σ AA − Σ AB Σ BR Σ BA I − Σ BB ≡ ˜Λ − , = Λ RR − Λ RB (Λ BB − I ) − Λ BR Λ RA − Λ RB (Λ BB − I ) − Λ BA − Λ RB (Λ BB − I ) − Λ AR − Λ AB (Λ BB − I ) − Λ BR Λ AA − Λ AB (Λ BB − I ) − Λ BA − Λ AB (Λ BB − I ) − (Λ BB − I ) − Λ BR (Λ BB − I ) − Λ BA Λ BB (Λ BB − I ) − − . (53)The matrix ˜Σ corresponds to the original matrix Σ with the mean and sign of the covarianceparameters for the variables x B inverted as I − Σ BB and ( − Σ RB , − Σ AB ), respectively. Inother words, observing the dummy variable x i as x i = 0 is equivalent to observing thedummy variable ˜ x i with the dummy coding inverted as ˜ x i = 1. Therefore, our formalism isa symmetric formalism that does not depend on how to encode binary variables.The matrix Λ can also be interpreted intuitively. From Eq. (50), we see that the diag-onal element of Λ expresses the inverse of the mean conditioned on all the other variablesobserved as x A = 1. Furthermore, the off-diagonal elements can be interpreted as the par-tial correlation, similar to the multivariate Gaussian distribution. Here, we consider theconditional distribution of x R = ( x , x ) given x A = 1 , A = P \ { , } . The correspondingconditional Hamiltonian is given by e H (¯ θ R ,θ R | x A ) = 1det Λ RR e θ † R Λ RR θ R , = det Σ R | A e θ † R Σ − R | A θ R , (54)where Σ R | A = Σ | A Σ | A Σ | A Σ | A = Λ − RR = 1det Λ RR Λ − Λ − Λ Λ . (55)Then, the correlation between x and x for the conditional distribution, i.e., the partialcorrelation ρ | A , is expressed by the product of the off-diagonal elements of Λ RR : ρ | A = − Σ | A Σ | A p Σ | A (1 − Σ | A )Σ | A (1 − Σ | A ) , = − Λ Λ p Λ (det Λ RR − Λ )Λ (det Λ RR − Λ ) . (56)14herefore, Λ can be interpreted as the partial correlation with all the other variables observedas x A = 1. When the partial correlation conditioned on x A = 1 vanishes the variables x and x are conditionally independent: ρ ij | A = 0 , ⇔ p ( x i , x j | x A = 1) = p ( x i | x A = 1) p ( x j | x A = 1) . (57)The partial correlation for the general conditioning x C = ( x A = 1 , x B = 0) other than x A = 1can also be interpreted in the same way. In this case, we first define the matrix ˜Σ in whichthe dummy coding of the variables observed as x B = 0 is inverted to ˜ x B = 1 as in Eq. (53).Then the product of the off-diagonal elements of the inverse matrix ˜Λ = ˜Σ − expresses themagnitude of the partial correlation under that conditioning. The partial correlation for thegeneral conditioning is given by the same expression, Eq. (56), except that Λ is replaced by˜Λ. Lastly, we should mention the normalization and positivity of our probability distribution.Since the analytical expression for the partition function is obtained, the normalization ofthe joint distribution should be satisfied. This can be confirmed from the identity of theprincipal minors. In fact, the joint probabilities of the p -dimensional binary variables areexpressed by the principal minors of the matrix Λ − I divided by det Λ as shown in Eq. (40).When we notice the identity regarding the summation over all principal minors, X B ⊆ P det(Λ BB − I ) = det Λ , (58)we see that the normalization of the joint distribution is satisfied by definition. On the otherhand, the positivity that all probabilities of the joint distribution are greater than or equal tozero does not necessarily true in general. Expressed in the terminology of linear algebra, thepositivity that all joint probabilities must be positive is equivalent to that the matrix Λ − I must be a P -matrix, which is an important property in various applications [15]. When thematrix Λ − I is a P -matrix, the positivity of the marginal and conditional distributions canalso be confirmed. The marginal distribution is expressed as the Schur complement of Λ asin Eq. (43). In other words, the positivity of the marginal probabilities is expressed as thepositivity of the determinants of the Schur complements of each principal submatrix of thefollowing matrix, (Λ − I + I P ( δ P M ))[ R ′ ∪ M ] ( R ′ ⊆ R ), with respect to Λ MM :Λ − I + I P ( δ P M ) ≡ Λ RR − I Λ RM Λ MR Λ MM . (59)15he above matrix is the matrix Λ − I plus the identity submatrix I P ( δ P M ). Since adding anidentity submatrix amounts to shifting the real part of the eigenvalues by one, the matrixΛ − I + I P ( δ P M ) and its principal submatrices are still p -matrices. From the identity of thedeterminant of the block matrix, the determinant of the Schur complement of a P -matrixis positive, therefore, all of the marginal probabilities are positive. For the conditionalprobabilities, their positivity is rephrased as the positivity of the determinants of the Schurcomplements of each principal submatrix (Λ − I )[ R ′ ∪ B ] ( R ′ ⊆ R ) with respect to Λ BB − I as can be seen from Eq. (50). Again, since the determinant of the Schur complement of the P -matrix (Λ − I )[ R ′ ∪ B ] is positive, all of the conditional probabilities are positive. Theseproperties are in contrast to those of the conventional multivariate Bernoulli distribution. Inthe multivariate Bernoulli distribution, the partition function is not given analytically buthas to be summed numerically over all possible states. On the other hand, the property thatall joint probabilities are positive is satisfied by definition because probability distributionsare given by the exponential function of the polynomial in the dummy variables as shownin Eq. (1). III. NUMERICAL EXPERIMENTS
In this section, we validate our method numerically. Since the analytical expressions forthe marginal and conditional distributions are obtained in our formalism, we can easily gen-erate random numbers for correlated binary variables by repeating Bernoulli trials. Hence,we investigate the sampling distributions of various statistics and estimates using syntheticdatasets. Below, we define index labels for the variables used in this section. We denotethe set of all indices for p -dimensional binary variables as P ≡ { , , . . . , p } . Then, we writethe index labels for the variables observed as x i = 1 and x i = 0 as A and B , and writethese variables as x A and x B , respectively. We denote a specific realization of the dummyvector as δ , for example, δ = (1 , , , ,
0) for five-dimensional variables. Generated data arerepresented by D = { x P , x P , . . . , x NP } , where x nP , n ∈ { , , . . . , N } , is a p -dimensionalvector of dummy variables 16 . Sampling distribution of statistics Since the covariance structure of our model is determined by the product of off-diagonalelements of the matrix, − Σ ij Σ ji , the parameters Σ ij themselves are not fixed uniquely eventhough we set the mean and covariance by hand. Hence, the remaining parameters of themodel Σ ij ( i < j ) with a given mean and covariance are determined by maximizing theentropy of the joint distribution H ( p | Σ ij ), H ( p | Σ ij ) = − X δ p ( x P = δ | Σ ij ) log p ( x P = δ | Σ ij ) , (60)where the summation has been taken over all possible states. Correlated random numbersfor multivariate binary variables can be generated by repeating Bernoulli trials using theanalytical expressions for the marginal and conditional distributions. In fact, since the jointdistribution can be factorized as p ( x , x , . . . , x p ) = p ( x ) p ( x | x ) · · · p ( x p | x , x , . . . , x p − ),we can generate a random number by repeating Bernoulli trials p times from p ( x ) to p ( x p | x , x , . . . , x p − ) depending on the previous observations. The dimension of the variablesused to generate the synthetic datasets is p = 5, and the mean parameters are µ = 0 . µ = 0 . µ = 0 . µ = 0 . µ = 0 .
7. The correlation coefficients of the variables are ρ = − . ρ = 0 . ρ = − . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = − . ρ = 0 . ρ = − .
19, where we have defined the correlation coefficient forbinary variables by the Pearson correlation coefficient.Fig. 1 shows the sampling distributions of the statistics for the sample mean and unbi-ased sample covariance and the empirical joint distribution from the synthetic datasets fordifferent sample sizes N . The unbiased sample covariance is defined as s ij ≡ N − N X n =1 ( x ni − ¯ x i )( x nj − ¯ x j ) , ( i = j ) , (61)¯ x i ≡ N N X n =1 x ni . (62)We observe that the sampling distributions of the statistics are consistent with the followingtheoretical predictions: E[¯ x i ] = µ i , (63)Var[¯ x i ] = 1 N µ i (1 − µ i ) , (64)17 x s s q F r equen cy F r equen cy F r equen cy FIG. 1. Examples of the sampling distributions of the statistics for the sample mean ¯ x , unbiasedsample covariances s , s , and the empirical joint distribution q ( x P = (1 , , , , N = 500, N = 200 and N = 50,respectively. The trial size, i.e., the number of times the statistics are computed, is M = 5000.The red solid lines denote the true values and the black dashed lines denote the mean values ofthe sampling distributions. E[ s ij ] = − Σ ij Σ ji , ≡ σ ij , (65)Var[ s ij ] = 1 N µ iijj − ( N − N ( N −
1) ( µ ij ) + 1 N ( N − µ ii µ jj , (66)where µ ii ≡ µ i (1 − µ i ) and µ iijj are the second- and fourth-order central moments,E[ q δ ] = π δ (Σ) , (67)Var[ q δ ] = 1 N π δ (Σ) (cid:2) − π δ (Σ) (cid:3) . (68)Although the sampling distribution can be skewed when the sample size is small, it becomesasymptotically normal as the sample size increases, which is consistent with the central limittheorem. These observations, in turn, demonstrate the justification of our method.18 . Sampling distribution of estimates In this subsection, we discuss a method of parameter estimation and the sampling distri-bution of estimates given observed data. A common method of parameter estimation is themaximum likelihood estimation given observed data D . In our model, the log-likelihood isexpressed as l (Σ |D ) = N X n =1 log p ( x nP | Σ) , = N X δ n δ N log p ( x P = δ | Σ) , ≡ N X δ q δ log π δ (Σ) , (69)where n δ is the number of times we observed the state as x P = δ , which satisfies P δ n δ = N .In other words, the log-likelihood is expressed as the cross entropy between the empiricaljoint distribution q δ and the distribution by the model π δ (Σ) ≡ p ( x P = δ | Σ).The maximum likelihood estimation of Eq. (69) by using gradient-based methods is validwhen the sample size is large or the number of variable p is small. However, when it isnot the case many of the empirical joint probabilities q δ become zero which causes thedifficulty that the corresponding joint distribution by the model π δ (Σ) can take a negativevalue during parameter estimation. Hence, in this paper, we use the maximum a posterior(MAP) estimation instead of the maximum likelihood estimation. We assign the Dirichletdistribution for a prior probability on the joint distribution by the model π δ (Σ) ≡ p ( x P = δ | Σ) so that the joint probabilities of all states are always positive, p (Σ |D ) ∝ p ( D| Σ) p ( π (Σ) | α ) , = 1 B ( α ) Y δ p ( x P = δ | Σ) n δ π δ (Σ) α δ − , ∝ Y δ p ( x P = δ | Σ) n δ + γ δ , (70)where B ( α ) is the multivariate Beta distribution and α δ = 1 + γ δ is the concentrationparameter of the Dirichlet distribution. The use of this prior simply amounts to adding theprior hyper-parameters to the empirical counts, n δ → n δ + γ δ , which is known as pseudocounts. In this paper, we set a small and uniform value on the concentration parameter19 δ = γ = 1 / m s s p F r equen cy F r equen cy F r equen cy FIG. 2. Examples of the sampling distributions of the MAP estimates for the mean parameter µ , covariance parameters σ , σ , and the joint distribution by the estimated model π ( x P =(1 , , , , N = 500, N = 200 and N = 50, respectively. The trial size, i.e., the number of times the estimates arecomputed, is M = 2000. The red solid lines denote the true values and the black dashed linesdenote the mean values of the sampling distributions. Fig. 2 shows the sampling distributions of the MAP estimates for the mean and covarianceparameters and the joint distribution by the estimated model for different sample sizes N .The data are generated by the same model as the previous subsection. As in the case ofstatistics, we observe that each estimate converges to the true values of the parametersasymptotically as the sample size goes to infinity. Although the sampling distribution canbe skewed when the sample size is small, it becomes asymptotically normal as the samplesize increases. In other words, our MAP estimator is consistent and asymptotically normal.The standard errors of the sampling distributions decrease as 1 / √ N as the sample sizeincreases. These results suggest that the usual statistical inference on model parameterssuch as hypothesis testing and confidence interval estimation is available.20 S S S S F r equen cy F r equen cy F r equen cy FIG. 3. Examples of the sampling distributions of the MAP estimates for parameters Σ ij . Thefirst, second and third rows correspond to the results of the sample sizes N = 5000, N = 500 and N = 50, respectively. The trial size is M = 2000. The red solid lines denote the value of thetrue parameters. Due to the degrees of freedom of the matrix transposition, there exist two linescorresponding to the true parameters. Fig. 3 shows the sampling distributions of the MAP estimates for the parameters Σ ij fordifferent sample sizes N . Here, we note the degrees of freedom for the model parameters.Since the joint distribution is given by the determinant of the matrix Σ as in Eq. (40), theelements of the matrix are not determined uniquely. That is, there exist different parametersthat generate the same joint probabilities. In fact, the determinants of the matrix of theform of Eq. (40) are invariant under multiplying an i -th row of the matrix Σ by a constant c at the same time multiplying the i -th column with the same index i by the constant 1 /c .Furthermore, the determinants are invariant under the matrix transposition. These degreesof freedom can also be read from the expression for the Grassmann integral. Hence, withoutloss of generality, we fix the parameters Σ i = − i = 1) using the degrees of freedomfor the constant multiplication. That is, the parameters Σ j ( j = 1) simply represent thecovariances. However, we failed to find a way to fix the degree of freedom for the matrixtransposition. Hence, in Fig. 3, we show both of the estimated parameters corresponding21o the degrees of freedom for the matrix transposition. Although when the sample size issmall the sampling distribution of the estimate Σ ij has large dispersion and is unimodal, theestimates converge to one of the true parameters as the sample size goes to infinity. IV. CONCLUSION
We formulated a probability distribution for multivariate binary variables using the Grass-mann number. Performing summation over states by integration of the Grassmann variables,we derived analytical expressions for the partition function, the central moment, and thejoint, marginal and conditional distributions. These distributions are expressed by the ma-trix of the parameters Σ, which is a matrix analogous to the covariance matrix in themultivariate Gaussian distribution. The proposed method has many similarities to themultivariate Gaussian distribution. For example, a principal submatrix of Σ expresses themarginal distribution, the diagonal elements of which express the mean parameters, and theproducts of the off-diagonal elements express the covariance. The inverse matrix Λ = Σ − expresses the conditional distribution; the diagonal element expresses the inverse of the meanand the off-diagonal element can be interpreted as a kind of partial correlation conditionedon all the other variables. The uncorrelatedness of variables for the marginal and conditionaldistributions is equivalent to unconditional and conditional independence, respectively. Fur-thermore, the joint probabilities corresponding to all 2 p possible states for a p -dimensionalbinary variables are expressed as all possible principal minors of the matrix Λ − I . Theproperty that all joint probabilities must be greater than or equal to zero can be rephrasedin the terminology of linear algebra as that the matrix Λ − I must be a P -matrix. Theseproperties are in contrast to those of the conventional multivariate Bernoulli distribution,where the joint probabilities of all states are always greater than zero because it is expressedas an exponential function of a polynomial in the dummy variables.We validated our method numerically. Since we have analytical expressions for themarginal and conditional distributions, we can easily generate random numbers of the corre-lated binary variables by repeating Bernoulli trials. We investigated the sampling distribu-tions of various statistics and MAP estimates by using synthetic datasets. We demonstratedthat the sampling distributions of various statistics are consistent with the theoretical pre-dictions. We observed that our MAP estimates are consistent and asymptotically normal.22ince our method has many similarities to the multivariate Gaussian distribution, it hasmany potential applications, where the covariance structure of random variables is exten-sively utilized. Examples include the hierarchical and non-hierarchical clustering, such as k -means clustering and the mixture distribution model, and anomaly detection like theHotelling’s T method. There, a method similar to the multivariate Gaussian distributionwill be extended and applied to binary random variables. It is important to demonstratethat the proposed method can describe real data well. Our method in turn will also be usefulin studying the behavior of gases or magnets in statistical physics, which have conventionallybeen analyzed using the Ising model.Further direction of theoretical research is to develop an estimation procedure for modelparameters. In our method, we have to estimate model parameters with which the jointprobabilities of all possible states must be positive. The parameter estimation procedureemployed in this paper is limited to the case where the number of variables p is small due tocomputational complexity, since the objective function, i.e., the logarithm of the posteriordistribution, has all terms corresponding to the number of all 2 p possible states. Hence,it is desirable to develop an efficient procedure for parameter estimation that ensures thepositivity of all joint probabilities when the number of variables is large, to take advantageof our method that the analytical expressions are given. Another direction is to exploretheoretical properties of the sampling distribution of an estimated parameter to make astatistical inference on model parameters such as hypothesis testing and confidence intervalestimation. Lastly, we expect our method to be a framework for dealing with qualitativerandom variables as an alternative to the method of quantification. [1] E. Ising, Z. Physik , 253258 (1925).[2] S. Geman and D. Geman, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6 , 721 (1984).[3] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cognitive Science , 147 (1985).[4] O. Banerjee, L. El Ghaoui, and A. dAspremont, J. Mach. Learn. Res. , 485516 (2008).[5] L. Xue, H. Zou, and T. Cai, Ann. Statist. , 1403 (2012).[6] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families, and Variational nference (2008).[7] H. H¨ofling and R. Tibshirani, Journal of Machine Learning Research , 883 (2009).[8] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, Ann. Statist. , 1287 (2010).[9] C. D. Manning and H. Sch¨utze, Foundations of Statistical Natural Language Processing (MITPress, Cambridge, MA, USA, 1999).[10] J. Woods, IEEE Transactions on Automatic Control , 846 (1978).[11] M. Hassner and J. Sklansky, Computer Graphics and Image Processing , 357 (1980).[12] B. D. RIPLEY, Spatial Statistics. (Wiley, New York., 1981).[13] B. Dai, S. Ding, and G. Wahba, Bernoulli , 1465 (2013).[14] M. E. Peskin and D. V. Schroeder, An Introduction to quantum field theory (Addison-Wesley,Reading, USA, 1995).[15] A. Berman and R. J. Plemmons,
Nonnegative Matrices in the Mathematical Sciences (Societyfor Industrial and Applied Mathematics, 1994).(Societyfor Industrial and Applied Mathematics, 1994).