Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables
HHierarchical Multinomial-Dirichlet model for the estimation ofconditional probability tables
Laura Azzimonti
IDSIA - SUPSI/USIManno, [email protected]
Giorgio Corani
IDSIA - SUPSI/USIManno, [email protected]
Marco Zaffalon
IDSIA - SUPSI/USIManno, [email protected]
Abstract —We present a novel approach for estimating con-ditional probability tables, based on a joint, rather thanindependent, estimate of the conditional distributions belongingto the same table. We derive exact analytical expressions forthe estimators and we analyse their properties both analyticallyand via simulation. We then apply this method to the estimationof parameters in a Bayesian network. Given the structureof the network, the proposed approach better estimates thejoint distribution and significantly improves the classificationperformance with respect to traditional approaches.
Keywords -hierarchical Bayesian model; Bayesian networks;conditional probability estimation.
I. INTRODUCTIONA Bayesian network is a probabilistic model constitutedby a directed acyclic graph (DAG) and a set of conditionalprobability tables (CPTs), one for each node. The CPT ofnode X contains the conditional probability distributions of X given each possible configuration of its parents. Usuallyall variables are discrete and the conditional distributions areestimated adopting a Multinomial-Dirichlet model, wherethe Dirichlet prior is characterised by the vector of hyper-parameters α . Yet, Bayesian estimation of multinomials issensitive to the choice of α and inappropriate values causethe estimator to perform poorly [1]. Mixtures of Dirichletdistributions have been recommended both in statistics [2],[3] and in machine learning [4] in order to obtain morerobust estimates. Yet, mixtures of Dirichlet distributionsare computationally expensive; this prevents them frombeing widely adopted. Another difficulty encountered in CPTestimation is the presence of rare events. Assuming that allvariables have cardinality k and that the number of parents is q , we need to estimate k q conditional distributions, one foreach joint configuration of the parent variables. Frequentlyone or more of such configurations are rarely observed inthe data, making their estimation challenging.We propose to estimate the conditional distributionsby adopting a novel approach, based on a hierarchicalMultinomial-Dirichlet model. This model has two main char-acteristics. First, the prior of each conditional distributionis constituted by a mixture of Dirichlet distributions withparameter α ; the mixing is attained by treating α as a random variable with its own prior and posterior distribution.By estimating from data the posterior distribution of α , weneed not to fix its value a priori . Instead we give moreweight to the values of α that are more likely given thedata. Secondly, the model is hierarchical since it assumesthat the conditional distributions within the same CPT (butreferring to different joint configurations of the parents) aredrawn from the same mixture. The hierarchical model jointly estimates all the conditional distributions of the same CPT,called columns of the CPT. The joint estimates generateinformation flow between different columns of the CPT; thusthe hierarchical model exploits the parameters learned fordata-rich columns to improve the estimates of the parametersof data-poor columns. This is called borrowing statisticalstrength [5, Sec 6.3.3.2] and it is well-known within theliterature of hierarchical models [6]. Also the literatureof Bayesian networks acknowledges [7, Sec.17.5.4] thatintroducing dependencies between columns of the same CPTcould improve the estimates, especially when dealing withsparse data. However, as far as we known, this work is thefirst practical application of joint estimation of the columnsof the same CPT.To tackle the problem of computational complexity weadopt a variational inference approach. Namely, we computea factorised approximation of the posterior distribution thatis highly efficient. Variational inference appears particularlywell suited for hierarchical models; for instance the inferenceof Latent Dirichlet Allocation [8] is based on variationalBayes. By extensive experiments, we show that our novelapproach considerably improves parameter estimation com-pared to the traditional approaches based on Multinomial-Dirichlet model. The experiments show large gains espe-cially when dealing with small samples, while with largesamples the effect of the prior vanishes as expected.The paper is organised as follows. Section II introducesthe novel hierarchical model. Section III provides an analyt-ical study of the resulting estimator, proving that the novelhierarchical approach provides lower estimation error thanthe traditional approaches, under some mild assumptions onthe generative model. Section IV presents some simulationstudies showing that, given the same network structure, a r X i v : . [ s t a t . M L ] A ug ierarchical estimation yields both a better fit of the jointdistribution and a consistent improvement in classificationperformance, with respect to the traditional estimation underparameter independence. Section V reports some concludingremarks. II. ESTIMATION UNDERMULTINOMIAL-DIRICHLET MODELWe want to induce a Bayesian network over the set ofrandom variables X = { X , . . . , X I } . We assume that eachvariable X i ∈ X is discrete and has r i possible values in theset X i . The parents of X i are denoted by Pa i and they have q i possible joint states collected in the set P i .We denote by θ x ∣ pa the probability of X i being in state x ∈ X i when its parent set is in state pa ∈ P i , i.e., θ x ∣ pa = p ( X i = x ∣ Pa i = pa ) > . We denote by θ X i ∣ pa the parameters of the conditional distribution of X i given Pa i = pa . A common assumption [7, Sec.17] is that θ X i ∣ pa is generated from a Dirichlet distribution with knownparameters. The collection of the conditional probabilitydistributions θ X i = ( θ X i ∣ pa , . . . , θ X i ∣ pa qi ) constitutes the conditional probability table (CPT) of X i . Each vector oftype θ X i ∣ pa , with pa ∈ P i , is a column of the CPT.The assumption of local parameter independence [7,Sec. 17] allows to estimate each parameter vector θ X i ∣ pa independently of the other parameter vectors. The assumedgenerative model, ∀ i ∈ , . . . , I , is: p ( θ X i ∣ pa ) = Dir ( s α ) pa ∈ P i ,p ( X i ∣ Pa i = pa , θ X i ∣ pa ) = Cat ( θ X i ∣ pa ) pa ∈ P i , where s ∈ R denotes the prior strength, also called equivalentsample size , and α ∈ R r i is a parameter vector such that ∑ x ∈X α x = . The most common choice is to set α x = / r i and s = / q i , which is called BDeu prior [7, Sec.17].If there are no missing values in the data set D , theposterior expected value of θ x ∣ pa is [6]: E [ θ x ∣ pa ] = n x, pa + sα x n pa + s , where n x, pa is the number of observations in D charac-terised by X i = x and Pa i = pa , while n pa = ∑ x ∈X i n x, pa .III. HIERARCHICAL MODELThe proposed hierarchical model estimates the conditionalprobability tables by removing the local independence as-sumption. In order to simplify the notation we present themodel on a node X with a single parent Y . X has r statesin the set X , while Y has q states in the set Y . Lastly, wedenote by n xy the number of observations with X = x and Y = y and by n y = ∑ x ∈X n xy the number of observationswith Y = y , where x ∈ X and y ∈ Y .As described in Section II, θ X i ∣ pa , for i = , . . . , I and pa ∈ P i , are usually assumed to be independently drawnfrom a Dirichlet distribution, with known parameter α . On the contrary, the hierarchical model treats α as a hiddenrandom vector, thus making different columns of the CPTdependent. Specifically, we assume α to be drawn from ahigher-level Dirichlet distribution with hyper-parameter α .We assume that ( x k , y k ) for k = , . . . , n are n i.i.d. observations from the hierarchical Multinomial-Dirichletmodel: p ( α ∣ α ) = Dirichlet ( α ) p ( θ X ∣ y ∣ s, α ) = Dirichlet ( s α ) y ∈ Y (1) p ( X ∣ Y = y, θ X ∣ y ) = Cat ( θ X ∣ y ) y ∈ Y where s ∈ R is the equivalent sample size, and α ∈ R r is avector of hyper-parameters. A. Posterior moments for θ X ∣ y We now study the hierarchical model, deriving an analyti-cal expression for the posterior average of θ x ∣ y , which is theelement x of vector θ X ∣ y , and for the posterior covariancebetween θ x ∣ y and θ x ′ ∣ y ′ . To keep notation simple, in thefollowing we will not write explicitly the conditioning withrespect to the fixed parameters s and α . We introduce thenotation E D [⋅] = E [ ⋅∣ D ] to represent the posterior averageand C ov D (⋅ , ⋅) = C ov (⋅ , ⋅∣ D ) to represent the posteriorcovariance. Definition 1.
We define the pointwise estimator ˆ θ x ∣ y for theparameter θ x ∣ y as its posterior average, i.e., ˆ θ x ∣ y = E D [ θ x ∣ y ] ,and the pointwise estimator ˆ α x for the element x of the pa-rameter vector α as its posterior average, i.e., ˆ α x = E D [ α x ] . Theorem 1.
Under model (1) , the posterior average of θ x ∣ y is ˆ θ x ∣ y = E D [ θ x ∣ y ] = n xy + s ˆ α x n y + s , (2) while the posterior covariance between θ x ∣ y and θ x ′ ∣ y ′ is C ov D ( θ x ∣ y ,θ x ′ ∣ y ′ )= δ yy ′ ˆ θ x ∣ y δ xx ′ − ˆ θ x ∣ y ˆ θ x ′ ∣ y n y + s + + s C ov D ( α x ,α x ′ ) C yy ′ , where C yy ′ is defined as C yy ′ = { ( n y + s )( n y ′ + s ) if y ≠ y ′ ( n y + s )( n y + s + ) if y = y ′ . The posterior average and posterior covariance of α cannot be computed analytically. Some results concerningtheir expression and numerical computation, together withthe complete proof of Theorem 1, are detailed in Appendix.Notice that the pointwise estimator ˆ θ x ∣ y is a mixtureof traditional Bayesian estimators obtained under (non-hierarchical) Multinomial-Dirichlet models with α fixed, i.e, n xy + sα x n y + s . Indeed, thanks to the linearity in α x , we obtain ˆ θ x ∣ y = n xy + s ˆ α x n y + s = ∫ n xy + sα x n y + s p ( α ∣ D ) d α . This mixture gives more weight to the values of α that aremore likely given the observations. . Properties of the estimator ˆ θ X ∣ y We study now the mean-squared error (MSE) of ˆ θ x ∣ y andwe compare it to the MSE of other traditional estimators.In order to study the MSE of ˆ θ x ∣ y we need to assume thegenerative model p ( θ X ∣ y ∣ s, ˜ α ) = Dirichlet ( s ˜ α ) y ∈ Y ,p ( X ∣ Y = y, θ X ∣ y ) = Cat ( θ X ∣ y ) y ∈ Y , (3)where s and ˜ α are the true underlying parameters. Moreover,since θ X ∣ y is a random vector, we define the MSE for anestimator ¯ θ x ∣ y of the single component θ x ∣ y asMSE ( ¯ θ x ∣ y ) = E θ [ E n [( ¯ θ x ∣ y − θ x ∣ y ) ]] , (4)where E θ [⋅] and E n [⋅] represent respectively the expectedvalue with respect to θ x ∣ y and n xy , and the MSE forthe estimator ¯ θ X ∣ y of the vector θ X ∣ y as MSE ( ¯ θ X ∣ y ) =∑ x ∈X MSE ( ¯ θ x ∣ y ) . Notice that the generative model (3) is the traditional,thus non-hierarchical, Multinomial-Dirichlet model, whichimplies parameter independence. Hence, the traditionalBayesian estimator satisfies exactly the assumptions of thismodel. The Bayesian estimator is usually adopted by as-suming ˜ α to be fixed to the values of a uniform distributionon X , i.e., α B = / r ⋅ × r , see e.g. [6]. However, sincein general ˜ α ≠ / r ⋅ × r , the traditional Bayesian approachgenerates biased estimates in small samples. On the contrary,the novel hierarchical approach estimates the unknown pa-rameter vector ˜ α basing on its posterior distribution. For thisreason the proposed approach can provide estimates that arecloser to the true underlying parameters, with a particularadvantage in small samples, with respect to other traditionalapproaches.In order to study the MSE of different estimators, we firstconsider an ideal shrinkage estimator θ ∗ X ∣ y = ˜ ω y θ ML X ∣ y + ( − ˜ ω y ) ˜ α , (5)where ˜ ω y ∈ ( , ) and θ ML x ∣ y = n xy n y is the maximum-likelihood(ML) estimator, obtained estimating from the observationseach vector θ X ∣ y independently of other vectors. This convexcombination shrinks the ML estimator towards the true un-derlying parameter ˜ α . Setting ˜ ω y = n y n y + s , the ideal estimatorcorresponds to a Bayesian estimator with known parameter ˜ α , i.e., θ ∗ x ∣ y = n xy + ˜ α x n y + s . However, since ˜ α represents thetrue underlying parameter that is usually unknown, the idealestimator (5) is unfeasible. Yet it is useful as it allows tostudy the MSE.The main result concerns the comparison, in terms ofMSE, of the ideal estimator with respect to the traditional(non hierarchical) Bayesian estimator, which estimates ˜ α x by means of the uniform distribution / r , i.e., θ B x ∣ y = n xy + s / rn y + s ,see e.g. [6]. The traditional Bayesian estimator can bewritten as θ B X ∣ y = ω y θ ML X ∣ y + ( − ω y ) r , (6) where ω y = n y n y + s . Theorem 2.
Under the assumption that the true generativemodel is (3) , the MSE for the ideal estimator isMSE ( θ ∗ x ∣ y ) = ( ˜ ω y sn y + ( − ˜ ω y ) ) ˜ α x − ˜ α x s + , while the MSE for the traditional Bayesian estimator isMSE ( θ B x ∣ y ) = MSE ( θ ∗ x ∣ y ) + ( − ω y ) ( ˜ α x − r ) . If ˜ ω y = ω y , MSE ( θ ∗ X ∣ y ) ≤ MSE ( θ B X ∣ y ) . The proof is reported in Appendix.Since in general ˜ α x ≠ r , the second term in (2) is positiveand the ideal estimator achieves smaller MSE with respect totraditional Bayesian estimator. To improve the estimates ofthe traditional Bayesian model in terms of MSE, we proposeto act exactly on the second term of (2). Specifically, we canachieve this purpose by estimating the parameter vector ˜ α from data, instead of considering it fixed.The proposed hierarchical estimator defined in (2) has thesame structure of the ideal estimator (5): ˆ θ X ∣ y = ω y θ ML X ∣ y + ( − ω y ) ˆ α , where ω y = n y n y + s . This convex combination of θ ML X ∣ y and ˆ α shrinks the ML estimator towards the posterior average of α , with a strength that is inversely proportional to n y .Contrary to the traditional Bayesian estimator (6), the hi-erarchical estimator provides an estimate of ˜ α that convergesto the true underlying parameter as n increases. Indeed, it iswell known that the posterior average E D [ α ] converges tothe true underlying parameter ˜ α as n goes to infinity, i.e., ˆ α → ˜ α as n → +∞ , [6]. As a consequence, ˆ θ x ∣ y convergesto θ ∗ x ∣ y and the MSE of ˆ θ x ∣ y converges to the MSE of θ ∗ x ∣ y ,i.e., MSE ( ˆ θ x ∣ y ) → MSE ( θ ∗ x ∣ y ) as n → +∞ .In the finite sample assumption MSE ( ˆ θ x ∣ y ) differs fromMSE ( θ ∗ x ∣ y ) , since ˆ θ x ∣ y includes an estimator of α . Sincein the hierarchical model we cannot compute this quantityanalytically, we verify by simulation that the hierarchicalestimator provides good performances in terms of MSE withrespect to the traditional Bayesian estimators.In conclusion, as we will show in the numerical experi-ments, the hierarchical model can achieve a smaller MSEthan the traditional Bayesian estimators, in spite of theunfavourable conditions of the generative model (3). Thisgain is obtained thanks to the estimation of the parameter ˜ α , rather than considering it fixed as in the traditionalapproaches. In more general conditions with respect to(3), the true generating model could not satisfy parameterindependence and the MSE gain of the hierarchical approachwould further increase.V. EXPERIMENTSIn the experiments we compute the proposed hierarchicalestimator using variational Bayes inference in R by meansof the rstan package [9]. The variational Bayes estimatesare practically equivalent to those yielded by Markov ChainMonte Carlo (MCMC), though being the variational infer-ence much more efficient than MCMC (less than a secondfor estimating a CPT compared to a couple of minutes forthe MCMC method). In the following we report the resultsobtained via variational inference. The code is available athttp://ipg.idsia.ch/software.php?id=139. A. MSE analysis
In the first study we assess the performances of thehierarchical estimator in terms of MSE. We consider twodifferent settings, in which we generate observations frommodel (3), where ˜ α is fixed. In the first setting (test 1) wesample ˜ α from a Dirichlet distribution with parameter × r ,while in the second setting (test 2) we sample it from aDirichlet distribution with parameter ⋅ × r . Under test 2the parameters of the sampling distribution for ˜ α are verylarge and equal to each other, implying ˜ α x ≃ / r , ∀ x ∈ X .For this reason, test 2 is the ideal setting for the traditionalBayesian estimator, while test 1 is the ideal setting for thehierarchical estimator.In both test 1 and test 2 we consider all the possiblecombinations of r (the number of states of X ) and q (the number of conditioning states), with r ∈ { , , , } and q ∈ { , , , } . For each combination of r and q , forboth test 1 and test 2, we generate data sets with size n ∈ { , , , , , } . We repeat the data samplingand the estimation procedure 10 times for each combinationof r , q and n . Then we compare the estimates yielded by thehierarchical estimator, with s = r and α = × r , and by thetraditional Bayesian estimator assuming parameter indepen-dence, with s = r . We compare the performance of differentestimators by computing the difference in terms of averageMSE, defined as MSE ( ¯ θ X ∣ Y ) = ∑ y ∈Y ∑ x ∈X MSE ( ¯ θ x ∣ y )/ rq for an estimator ¯ θ X ∣ Y . In every repetition of the experiment { , . . . , } , we then compute MSE ( θ B X ∣ Y ) − MSE ( ˆ θ X ∣ Y ) and we represent it in Figure 1.The results show that the hierarchical estimator mostlyprovides better or equivalent results in comparison to θ B X ∣ Y ,especially for small n and/or large q . In a Bayesian networkit is usual to have large values of q , since q represents thecardinality of the parents’ joint states set. In test 1 (lightblue boxplots) the advantage of the hierarchical estimatorover the Bayesian one is generally large, as expected. Theadvantage of the hierarchical model steadily increases as q increases, becoming relevant for q = or q = . For large n the gap between the two estimators vanishes, although itis more persistent when dealing with large q . Interestingly,in test 2 (green boxplots), the traditional Bayesian estimatoris just slightly better than the hierarchical one, even though the former is derived exactly from the true generative model.The traditional estimator has a small advantage only for q = and small values of n , and this advantage quickly decreasesif either q or n increase. B. Joint distribution fitting
In the second study we assess the performance of thehierarchical estimator in the recovery of the joint distributionof a given Bayesian network.We consider 5 data sets from UCI Machine LearningRepository:
Adult , Letter Recognition , Nursery , Pen-BasedRecognition of Handwritten Digits and
Spambase . We dis-cretise all numerical variables into five equal-frequency binsand we consider only instances without missing values. Foreach dataset we first learn, from all the available data, theassociated directed acyclic graph (DAG) by means of ahill-climbing greedy search, as implemented in the bnlearn package [10]. We then keep such structure as fixed for all theexperiments referring to the same data set, since our focus isnot structural learning. Then, for each data set and for each n ∈ { , , , , , , } we repeat 10 times theprocedure of 1) sampling n observations from the data setand 2) estimating the CPTs. We perform estimation using theproposed hierarchical approach, with s = r and α = × r ,and the traditional BDeu prior (Bayesian estimation underparameter independence) with s = and s = . The choiceof s = is the most commonly adopted in practice, while s = is the default value proposed by the bnlearn package.Conversely, we did not offer any choice to the hierarchicalmodel. Indeed, we set the smoothing factor s in the proposedmodel to the number of states of the child variable, whichhas the same order of magnitude of the smoothing factorsused in the traditional Bayesian approach. In spite of themore limited choice for the parameter s , the hierarchicalestimator consistently outperforms the traditional Bayesianestimator, regardless whether the latter adopts a smoothingfactor 1 or 10.We then measure the log-likelihood of all the instances in-cluded in the test set, where the test contains all the instancesnot present in training set. We report in the top panels ofFigure 2 the difference between the log-likelihood of thehierarchical approach and the log-likelihood of Bayesianestimation under local parameter independence, i.e., thelog-likelihood ratio. The log-likelihood ratio approximatesthe ratio between the Kullback-Leibler (KL) divergences ofthe two models. The KL of a given model measures thedistance between the estimated and the true underlying jointdistribution.The log-likelihood ratios obtained in the experiments areextremely large on small sample sizes, being larger than onethousand on all data sets (Figure 2, top panels). This showsthe huge gain delivered by the hierarchical approach whendealing with small data sets. This happens regardless of theequivalent sample size used to perform Bayesian estimation l llllll ll ll ll l q = 2 number of observations M SE ( q B ) - M SE ( q ^ ) llll llllll l lllll llll llll
20 40 80 160 320 640 − . . . llll lllll ll ll lllll q = 4 number of observations lll lll lll ll llll lll
20 40 80 160 320 640 − . . . l l lll q = 6 number of observations llll ll ll lll l l
20 40 80 160 320 640 − . . . llllll l l q = 8 number of observations lll l l llll
20 40 80 160 320 640 − . . . ll test 1test 2 Figure 1: Boxplots of MSE difference between the Bayesian ( s = r ) and the hierarchical estimator in test 1 (light blue) andtest 2 (green) with different dimension of the conditioning set ( q = , , , ). Positive values favour the hierarchical model.under parameter independence: we note however that ingeneral s = yields better results than s = . The lowestgains are obtained on the data set Nursery ; the reason is thatthe DAG of
Nursery has the lowest number of parents pernode (0.9 on average, compared to about twice as much forthe other DAGs). Thus, this data set is the less challengingfrom the parameter estimation viewpoint, with respect tothe others. We point out that significant likelihood ratiosare obtained in general even for large samples, even thoughthey are not apparent from the figure due to the scale. Forinstance, for n = the log-likelihood ratios range from 50( Nursery ) to 85000 (
Letter ). C. Classification
In the third study we assess the performance of thehierarchical estimator in terms of classification. We considerthe same datasets of the previous experiment, discretised inthe same way.For each dataset we first learn the Tree-Augmented NaiveBayes (TAN) structure by means of the bnlearn package.The networks are estimated on the basis of all the availablesamples and are kept fixed for all the experiments referringto the same data set, since our focus is not structural learn-ing. Then, for each dataset and for each n ∈ { , , , , , , } , we sample n observations. We then esti-mate the CPTs of the Bayesian network from the sampleddata by means of the hierarchical estimator ( s = r and α = × r ) and the traditional Bayesian estimators obtainedunder a BDeu prior ( s = and s = ). We repeat thesampling and the estimating steps 10 times for each valueof n and each data set. We then classify each instance of thetest set, which contains 1000 instances sampled uniformlyfrom all the instances not included in the training set. Weassess the classification performance by measuring accuracyand area under the ROC (ROC AUC) of the classifiers.In the central panels of Figure 2 we report the differencein accuracy between the hierarchical estimator and thetraditional Bayesian ones, while in the bottom panels of thesame figure we report the difference in ROC AUC betweenthe same classifiers.The area under the ROC is a more sensitive indicatorfor the correctness of the estimated posterior probabilities with respect to accuracy. According to Figure 2 (bottompanels), the hierarchical approach yields consistently higherROC AUC than both the BDeu classifiers. The increase ofROC AUC in small samples ( n = , n = ) ranges between and points compared to both the BDeu priors. As n increases this gain in ROC AUC tends to vanish. However,for n > the gain in ROC AUC for the datasets Adult and
Letter ranges between 1 and 5 points.Figure 2 (central panels) shows also improvements inaccuracy, even if this indicator is less sensitive to theestimated posterior probability than the area under ROC.Indeed, in computing the accuracy, the most probable class iscompared to the actual class, without paying further attentionto its posterior probability. In small samples ( n = , n = )there is an average increase of accuracy of about pointscompared to the BDeu prior with s = and of about points compared to the BDeu prior with s = . The accuracyimprovements tends to decrease as n increases; yet on both Adult and
Letter data sets an accuracy improvement of about1-2 points is shown also for n = .V. CONCLUSIONSWe have presented a novel approach for estimating theconditional probability tables by relaxing the local indepen-dence assumption. Given the same network structure, thenovel approach yields a consistently better fit to the jointdistribution than the traditional Bayesian estimation underparameter independence; it also improves classification per-formance. Moreover, the introduction of variational infer-ence makes the proposed method competitive in terms ofcomputational cost with respect to the traditional Bayesianestimation. A PPENDIX
In order to prove Theorem 1, we first need to derive someresults concerning the posterior moments for the vector α ,whose general element α x is associated to state x ∈ X for thevariable X . Given a dataset D , the k -th posterior momentfor the element α x is E [ α kx ∣ D ] = ∫ α kx p ( α ∣ D ) d α . ll l Adult l og ( li k e li hood r a t i o ) ll l
20 40 80 160 320 640 1280 l Letter l l
20 40 80 160 320 640 1280
Nursery l
20 40 80 160 320 640 1280 l ll
Pendigits l ll
20 40 80 160 320 640 1280 l l
Spambase l
20 40 80 160 320 640 1280 ll BDeu s=1BDeu s=10 l l a cc u r a cy d i ff e r en c e l l
20 40 80 160 320 640 1280 . . . ll l ll ll
20 40 80 160 320 640 1280 . . . . l
20 40 80 160 320 640 1280 . . . . l lll
20 40 80 160 320 640 1280 . . . lll
20 40 80 160 320 640 1280 . . . . . l ll number of observations R O C A UC d i ff e r en c e l ll l
20 40 80 160 320 640 1280 . . . . ll l number of observations R O C A UC d i ff e r en c e llll l
20 40 80 160 320 640 1280 . . . . . . number of observations R O C A UC d i ff e r en c e
20 40 80 160 320 640 1280 . . . ll number of observations R O C A UC d i ff e r en c e ll ll
20 40 80 160 320 640 1280 . . . . . l l l number of observations R O C A UC d i ff e r en c e l ll l
20 40 80 160 320 640 1280 . . . . Figure 2: Boxplots of the logarithm of the likelihood ratio (top panels), accuracy gain (central panels) and area under theROC gain (bottom panels) obtained comparing the hierarchical method with respect to BDeu ( s = in orange and s = inred) for the five machine learning datasets analysed. Positive values favour the hierarchical model.The following proposition states a general result for comput-ing any posterior moment of α , whose general expressionis E [∏ x ∈X α k x x ∣ D ] , where k x ∈ N represents the power ofelement α x . Lemma 1.
Under the assumptions of model (2), the poste-rior average of the quantity ∏ x ∈X α k x x , with k x ∈ N ∀ x ∈ X ,is E D [ ∏ x ∈X α k x x ]= γ ∫ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α ˜ k x x d α , (7) where ˜ k x = α ,x + k x − and γ is a proportionality constantsuch that γ − = ∫ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α α ,x − x d α . (8) The element x ′ of the posterior average vector E D [ α ] is ˆ α x ′ = γ ∫ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α δ x,x ′ x d α , (9) where δ x,x ′ is a Kronecker delta.The element ( x ′ , x ′′ ) of the posterior covariance matrix C ov D ( α ) is C ov D ( α x ′ , α x ′′ ) = E D [ α x ′ α x ′′ ] − ˆ α x ′ ˆ α x ′′ , (10) where E D [ α x ′ α x ′′ ]= γ ∫ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α δ x,x ′ + δ x,x ′′ x d α . Both integrals in (7) and (8) are multiple integrals com-puted with respect to the r elements of vector α , such that ∑ x ∈X α x = . The space of integration is thus the standard r -simplex. Proof of Lemma 1:
Under the assumptions of model(2), the joint posterior density of α , θ X ∣ y , . . . , θ X ∣ y q is p ( α , θ X ∣ y , . . . , θ X ∣ y q ∣ D )∝ Γ ( s )∏ x ∈X Γ ( sα x ) ∏ y ∈Y ∏ x ∈X ( θ x ∣ y ) n xy + sα x − α α ,x − x . Marginalising p ( α , θ X ∣ y , . . . , θ X ∣ y q ∣ D ) with respect to θ X ∣ y , . . . , θ X ∣ y q , we obtain the marginal posterior densityfor α , i.e., p ( α ∣ D )∝ Γ ( s )∏ x ∈X Γ ( sα x )∏ y ∈Y ∏ x ∈X Γ ( sα x + n xy ) Γ ( s + n y ) ∏ x ∈X α α ,x − x . (11)Thanks to the well-known property of the Gamma function Γ ( α + m ) = m ∏ ν = ( α + ν − ) ⋅ Γ ( α ) , for m ≥ e can write the posterior marginal density (11) as p ( α ∣ D ) ∝ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α α ,x − x . (12)The proportionality constant of the posterior marginal den-sity is obtained by integrating the right term in (12) withrespect to the r elements of α , such that ∑ x ∈X α x = . Theresulting proportionality constant is thus γ = ⎛⎜⎜⎝∫ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α α ,x − x d α ⎞⎟⎟⎠ − . The posterior average for the quantity ∏ x ∈X α k x x can bederived directly from the posterior marginal density of α as E D [ ∏ x ∈X α k x x ] = γ ∫ ∏ x ∈X ∏ y ∈Y s.t. n xy > n xy ∏ ν = ( sα x + ν − ) α ˜ k x x d α , where ˜ k x = α ,x + k x − . In the special case of α ,x = , ∀ x ∈ X , we have ˜ k x = k x .The posterior average for α x ′ is obtained directly from(7), by choosing k x = δ x = x ′ , i.e., k x = for x = x ′ and k x = for ∀ x ≠ x ′ , while the posterior average for theproduct of α x ′ and α x ′′ is obtained directly from (7), bychoosing k x = δ x = x ′ + δ x = x ′′ , i.e., k x = for x ∈ { x ′ , x ′′ } and k x = for ∀ x ∉ { x ′ , x ′′ } . Proof of Theorem 1:
Given α , the marginal posteriordensity for θ X ∣ y is a Dirichlet distribution with parameters s α + n y , where n y = ( n x y , . . . , n x r y ) ′ . It is thus easy tocompute E [ θ x ∣ y ∣ α , D ] = E α ,D [ θ x ∣ y ] = n xy + sα x n y + s , where x ∈ X , y ∈ Y , and E α ,D [⋅] = E [⋅ ∣ α , D ] .The posterior average of θ x ∣ y can thus be computed bymeans of the law of total expectation as E D [ θ x ∣ y ] = E D [ E α ,D [ θ x ∣ y ]] = n xy + s E D [ α x ] n y + s . The posterior average of α x is obtained directly fromLemma 1.In order to compute the posterior covariance between θ x ∣ y and θ x ′ ∣ y ′ we can use the law of total covariance, i.e., C ov D ( θ x ∣ y , θ x ′ ∣ y ′ ) = C ov D ( E α ,D [ θ x ∣ y ] , E α ,D [ θ x ′ ∣ y ′ ]) ++ E s,D [ C ov α ,D ( θ x ∣ y , θ x ′ ∣ y ′ )] . The first quantity is: C ov D ( E α ,D [ θ x ∣ y ] , E α ,D [ θ x ′ ∣ y ′ ])= C ov D ( n xy + sα x n y + s , n x ′ y ′ + sα x ′ n y ′ + s ) = s C ov D ( α x , α x ′ )( n y + s ) ( n y ′ + s ) . If y ′ = y , the second quantity is: E D [ C ov α ,D ( θ x ∣ y , θ x ′ ∣ y )]= E D [ ( n xy + sα x )(( n y + s ) δ xx ′ − ( n x ′ y + sα x ′ ))( n y + s ) ( n y + s + ) ]= ( n xy + s E D [ α x ]) δ xx ′ ( n y + s )( n y + s + ) − E D [( n xy + sα x )( n x ′ y + sα x ′ )]( n y + s ) ( n y + s + )= ˆ θ x ∣ y δ xx ′ − ˆ θ x ∣ y ˆ θ x ′ ∣ y ′ n y + s + − s ( E D [ α x α x ′ ]− E D [ α x ] E D [ α x ′ ])( n y + s ) ( n y + s + )= ˆ θ x ∣ y δ xx ′ − ˆ θ x ∣ y ˆ θ x ′ ∣ y ′ n y + s + − s C ov D ( α x , α x ′ )( n y + s ) ( n y + s + ) . Otherwise, if y ′ ≠ y , E D [ C ov α ,D ( θ x ∣ y , θ x ′ ∣ y ′ )] = , since θ X ∣ y ⊥⊥ θ X ∣ y ′ given α .Exploiting the law of total covariance, we obtain C ov D ( θ x ∣ y , θ x ∣ y ′ ) = s C ov D ( α x , α x ′ )( n y + s ) ( n y ′ + s ) ++ δ yy ′ ⎛⎝ ˆ θ x ∣ y δ xx ′ − ˆ θ x ∣ y ˆ θ x ′ ∣ y n y + s + − s C ov D ( α x , α x ′ )( n y + s ) ( n y + s + ) ⎞⎠ . The posterior covariance between α x and α x ′ is obtaineddirectly from Lemma 1. Proof of Theorem 2:
Exploiting the linearity of theideal estimator we obtain that θ ∗ x ∣ y − θ x ∣ y = ˜ ω y ( θ ML x ∣ y − θ x ∣ y ) + ( − ˜ ω y ) ( ˜ α x − θ x ∣ y ) . The MSE of θ ∗ x ∣ y is thusMSE ( θ ∗ x ∣ y ) = ˜ ω y MSE ( θ ML x ∣ y ) + ( − ˜ ω y ) MSE ( ˜ α x ) ++ ˜ ω y ( − ˜ ω y ) E θ [ E n [( θ ML x ∣ y − θ x ∣ y ) ( ˜ α x − θ x ∣ y )]] . Using the definition of MSE (4) and the assumptionsof model (3), i.e., E n [ n xy ] = n y θ x ∣ y , E θ [ θ x ∣ y ] = ˜ α x and V ar θ ( θ x ∣ y ) = ˜ α x − ˜ α x s + , we obtainMSE ( θ ML x ∣ y ) = E θ ⎡⎢⎢⎢⎢⎣ E n ⎡⎢⎢⎢⎢⎣( n xy n y − θ x ∣ y ) ⎤⎥⎥⎥⎥⎦⎤⎥⎥⎥⎥⎦= E θ [ n y V ar n ( n xy )] = n y n y E θ [ θ x ∣ y ( − θ x ∣ y )]= n y ( E θ [ θ x ∣ y ] − E θ [ θ x ∣ y ] − V ar θ ( θ x ∣ y ))= n y ( ˜ α x − ˜ α x − ˜ α x − ˜ α x s + ) = sn y ˜ α x − ˜ α x s + . (13)This quantity corresponds to the first term of MSE ( θ ∗ x ∣ y ) .The second term is obtained asMSE ( ˜ α x ) = E θ [ E n [( ˜ α x − θ x ∣ y ) ]] = E θ [( ˜ α x − θ x ∣ y ) ]= V ar θ ( θ x ∣ y ) = ˜ α x − ˜ α x s + , ince ˜ α x − θ x ∣ y is independent of n xy and E θ [ θ x ∣ y ] = ˜ α x .The last term E θ [ E n [( θ ML x ∣ y − θ x ∣ y )( ˜ α x − θ x ∣ y )]] = , since ˜ α x − θ x ∣ y is independent of n xy and E θ [ θ ML x ∣ y ] = θ x ∣ y .The MSE for the ideal estimator is thusMSE ( θ ∗ x ∣ y ) = ˜ ω y sn y ˜ α x − ˜ α x s + + ( − ˜ ω y ) ˜ α x − ˜ α x s + = ( ˜ ω y sn y + ( − ˜ ω y ) ) ˜ α x − ˜ α x s + . If ˜ ω y = n y n y + s and s > , ∀ x ∈ X ,MSE ( θ ∗ x ∣ y ) = ( n y ( n y + s ) sn y + s ( n y + s ) ) ˜ α x − ˜ α x s + = sn y + s ˜ α x − ˜ α x s + < sn y ˜ α x − ˜ α x s + = MSE ( θ ML x ∣ y ) . Thus, ∑ x ∈X MSE ( θ ∗ x ∣ y ) < ∑ x ∈X MSE ( θ ML x ∣ y ) .The MSE for the Bayesian estimator (6) is:MSE ( θ B x ∣ y ) = ω y MSE ( θ ML x ∣ y ) + ( − ω y ) MSE ( r ) ++ ω y ( − ω y ) E θ [ E n [( θ ML x ∣ y − θ x ∣ y ) ( r − θ x ∣ y )]] . The first term is derived in (13).The second term corresponds toMSE ( r ) = E θ [( r − ˜ α x ) ] + E θ [( ˜ α x − θ x ∣ y ) ]= ( ˜ α x − r ) + V ar θ ( θ x ∣ y ) = ( ˜ α x − r ) + ˜ α x − ˜ α x s + , since r − θ x ∣ y is independent of n xy and E θ [ θ x ∣ y ] = ˜ α x .The last term E θ [ E n [( θ ML x ∣ y − θ x ∣ y ) ( r − θ x ∣ y )]] = , since r − θ x ∣ y is independent of n xy and E θ [ θ ML x ∣ y ] = θ x ∣ y .The MSE for the Bayesian estimator is thusMSE ( θ B x ∣ y ) = ( ω y sn y + ( − ω y ) ) ˜ α x − ˜ α x s + ++ ( − ω y ) ( ˜ α x − r ) = MSE ( θ ∗ x ∣ y )+( − ω y ) ( ˜ α x − r ) , if ˜ ω y = ω y . Thus, MSE ( θ ∗ x ∣ y ) ≤ MSE ( θ B x ∣ y ) , ∀ x ∈ X .The two estimators have the same MSE in the specialcase of ˜ α x = r . As a consequence, ∑ x ∈X MSE ( θ ∗ x ∣ y ) ≤∑ x ∈X MSE ( θ B x ∣ y ) , with equality if ˜ α = r ⋅ × r .In the finite sample assumption MSE ( ˆ θ x ∣ y ) differs fromMSE ( θ ∗ x ∣ y ) , since ˆ θ x ∣ y includes an estimator of α . In par-ticular,MSE ( ˆ θ x ∣ y ) = MSE ( θ ∗ x ∣ y ) + ( − ω y ) MSE ( ˆ α x ) ++ ω y ( − ω y ) E θ [ E n [( θ ML x ∣ y − θ x ∣ y ) ( ˆ α x − ˜ α x )]] . (14)The second and third term in (14) cannot be computedanalytically and should be computed numerically. A CKNOWLEDGMENT
The research in this paper has been partially supported bythe Swiss NSF grants ns. IZKSZ2 162188.R
EFERENCES [1] J. Hausser and K. Strimmer, “Entropy inference and theJames-Stein estimator, with application to nonlinear geneassociation networks,”
J. Mach. Learn. Res. , vol. 10, no. Jul,pp. 1469–1484, 2009.[2] G. Casella and E. Moreno, “Assessing robustness of intrinsictests of independence in two-way contingency tables,”
Jour-nal of the American Statistical Association , vol. 104, no. 487,pp. 1261–1271, 2009.[3] I. Good and J. Crook, “The robustness and sensitivity ofthe mixed-Dirichlet Bayesian test for independence in con-tingency tables,”
The Annals of Statistics , pp. 670–693, 1987.[4] I. Nemenman, F. Shafee, and W. Bialek, “Entropy and infer-ence, revisited,” in
Advances in Neural Information Process-ing Systems , 2002, pp. 471–478.[5] K. P. Murphy,
Machine learning: a probabilistic perspective .MIT press, 2012.[6] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Ve-htari, and D. B. Rubin,
Bayesian data analysis . CRC Press,2013.[7] D. Koller and N. Friedman,
Probabilistic graphical models:principles and techniques . MIT press, 2009.[8] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichletallocation,”
J. Mach. Learn. Res. , vol. 3, pp. 993–1022, Mar.2003.[9] B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich,M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell,“Stan: A probabilistic programming language,”
Journal ofStatistical Software , vol. 76, no. 1, pp. 1–32, 2017.[10] M. Scutari, “Learning Bayesian networks with the bnlearn Rpackage,”