Maximum Entropy competes with Maximum Likelihood
aa r X i v : . [ phy s i c s . d a t a - a n ] D ec Maximum Entropy competes with Maximum Likelihood
A.E. Allahverdyan & N.H. Martirosyan
A. Alikhanyan National Laboratory (Yerevan Physics Institute), 0036 Yerevan, Armenia
Maximum entropy (MAXENT) method has a large number of applications in theoretical and applied machinelearning, since it provides a convenient non-parametric tool for estimating unknown probabilities. The method isa major contribution of statistical physics to probabilistic inference. However, a systematic approach towards itsvalidity limits is currently missing. Here we study MAXENT in a Bayesian decision theory set-up, i.e. assumingthat there exists a well-defined prior Dirichlet density for unknown probabilities, and that the average Kullback-Leibler (KL) distance can be employed for deciding on the quality and applicability of various estimators. Theseallow to evaluate the relevance of various MAXENT constraints, check its general applicability, and compareMAXENT with estimators having various degrees of dependence on the prior, viz. the regularized maximumlikelihood (ML) and the Bayesian estimators. We show that MAXENT applies in sparse data regimes, butneeds specific types of prior information. In particular, MAXENT can outperform the optimally regularized MLprovided that there are prior rank correlations between the estimated random quantity and its probabilities.
I. INTRODUCTION
The maximum entropy (MAXENT) method was proposed within statistical physics [1–3], and later on got a wide range ofinter-disciplinary applications in data science, probabilistic inference, biological data modeling etc ; see e.g. [4]. MAXENTestimates unknown probabilities (that generated data) via maximizing the Boltzmann-Gibbs-Shannon entropy under certainconstraints which can be derived from the observed data [4]. MAXENT leads to non-parametric estimators whose form does notdepend on the underlying mechanism that generated data (i.e. prior assumptions). Also, MAXENT avoids the zero-probabilityproblem, i.e. when operating on a sparse data, so that certain values of the involved random quantity may not appear due to asmall, but non-zero probability, MAXENT still provides a controllable non-zero estimate for this small probability.MAXENT has has several formal justification [1, 5–10]. But the following open problems are basic for MAXENT, becausetheir insufficient understanding prevents its valid applications. (i)
Which constraints of entropy maximization are to be extractedfrom data, which is necessarily finite and noisy? (ii)
When and how these constraints can lead to overfitting, where, due to a noisy data, involving more constraints leads to poorer results? (iii)
How predictions of MAXENT compare with those of otherestimators, e.g. the (regularized) maximum likelihood?Here we approach these open problems via tools of Bayesian decision theory [11]. We assume that the data is given asan i.i.d. sample of a finite length M from a random quantity with n outcomes and unknown probabilities that are instancedfrom a non-informative prior Dirichlet density, or a mixture of such densities. Focusing on the sparse data regime M < n wecalculate average KL-distances between real probabilities and their estimates, decide on the quality of MAXENT under variousconstraints, and compare it with the (regularized) maximum-likelihood (ML) estimator. Our main results are that MAXENTdoes apply to sparse data, but does demand specific prior information. We explored two different scenarios of such information.First, the unknown probabilities are most probably deterministic. Second, there are prior rank correlations between the inferredrandom quantity and its probabilities. Moreover, in the latter case the non-parametric MAXENT estimator is better in terms ofthe average KL-distance than the optimally regularized ML (parametric) estimator.Some of above questions were already studied in literature. [12–15] applied formal principles of statistics (e.g. the MinimumDescription Length) to the selection of constraints (question (i) ). Our approach to studying this question will be direct andunambiguous, since, as shown below, the Bayesian decision theory leads to clear criteria for the validity of MAXENT estimators.We can also compare all predictions with the optimal Bayesian estimator. The latter is normally not available in practice due toinsufficient knowledge of prior details, but it still does provide an important theoretical benchmark. Note that [16–23] studiedsoft constraints that allow incorporation of prior assumptions into the MAXENT estimator making it effectively parametric.Here MAXENT will be taken in its original meaning as providing non-parametric estimators.This paper is organized as follows. Section II recall the tenets of the Bayesian decision theory and describes the data-generation set-up. Section III introduces and motivates the Bayesian estimator and the regularized ML estimator. Section IVrecalls the basic formulas of MAXENT, applies them to the studied set-up, and discusses their symmetry features. SectionV compares predictions of MAXENT with the regularized ML. We close in the last section with discussing open problems.Appendix A shows how to apply MAXENT to categorical data. Appendix B presents our preliminary results on the affinesymmetry of MAXENT estimators, and establishes relations with the minimum entropy principle proposed in [12–15].
II. BAYESIAN DECISION THEORY
Consider a random quantity Z with values ( z , ..., z n ) and respective probabilities q = ( q , ..., q n ) = ( q ( z ) , ..., q ( z n )) . Welook at an i.i.d. sample of length M : D = ( Z , ..., Z M ) , m = { m k } nk =1 , M ≡ n X k =1 m k , (1)where Z u ∈ ( z , ..., z n ) ( u = 1 , ..., M ), and m k is the number of appearances of z k in (1). This sample will be an instance ofour data, e.g. constraints of MAXENT will be determined from it. The conditional probability of data D reads P ( D| q , ..., q n ) = P ( m , ..., m n | q , ..., q n ) = M ! n Y k =1 q m k k m k ! . (2)To check the performance of various inference methods, the probabilities ˆ q ( D ) = { ˆ q k ( D ) } nk =1 inferred from (1) are comparedwith true probabilities q = { q ( z k ) } nk =1 via the KL-distance K [ q, ˆ q ( D )] = n X k =1 q k ln q k ˆ q k ( D ) , (3)where concrete forms of ˆ q ( D ) are given below. The choice of distance (3) is motivated below, where we recall that it impliesthe global optimality of the standard (posterior-mean) Bayesian estimator. Another possible choice of distance is the squared(symmetric) Hellinger distance: dist H [ q, ˆ q ] ≡ − P nk =1 √ q k ˆ q k . In our situation, it frequently leads to the same qualitativeresults as (3).How to compare various estimators with each other, and decide on the quality of a given estimator? Bayesian decision theorycomes to answer this question; see chapter 11 of [11]. The theory assumes that the probabilities of ( z , ..., z n ) are generatedfrom a known probability density P ( q , ..., q n ) that encapsulates the prior information about the situation. Next it decides on thequality of an estimator ˆ q ( D ) via the average distance h K i = Z n Y k =1 d q k P ( q , ..., q n ) K , K = X D P ( D| q ) K [ q, ˆ q ( D )] . (4)where K is the average of (3) over samples (1) with fixed length M . Sometimes the Bayesian decision theory replaces thedistance by the utility, loss etc In the Bayesian decision theory different loss (or decision) functions can be optimized based onthe context of the problem [11]. Note the difference between the proper Bayesian approach and the Bayesian decision theory;cf. chapters 10 and 11 in [11]. The former employs the data for moving from the prior (5) to the posterior (7). It averages overthe prior, e.g. when calculating the posterior mean. The latter advises on choosing estimators, whose form may or not may notdepend on the prior; see below for examples. The decision theory averages both over the data and over the prior, as seen in (4).For the prior density of q = { q k } nk =1 we choose the Dirichlet density (or a mixture of such densities as seen below) [24, 25]: P ( q , ..., q n ; α , ..., α n ) = Γ[ P nk =1 α k ] Q nk =1 Γ[ α k ] n Y k =1 q α k − k δ ( n X k =1 q k − , (5)where Γ[ x ] = R ∞ d y y x − e − y is Euler’s Γ -function and delta-function δ ( P nk =1 q k − ensures the normalization of probabil-ities. Parameters α k > determine the prior weight of q k [24, 25]: h q k i ≡ Z ∞ n Y l =1 d q l q k P ( q , ..., q n ; α , ..., α n ) = α k A , A ≡ n X k =1 α k , (6)where the integration range goes over the simplex ≤ q k ≤ , ∀ k , and P nk =1 q k = 1 . Dirichlet density (5) is unique in holdingseveral desired features of non-informative prior density over unknown probabilities; see [24, 25] for reviews. An importantfeature of density (5) is that it is conjugate to the multinomial conditional probability (2) P ( q , ..., q n | m , ..., m n ) = P ( q , ..., q n ; α + m , ..., α n + m n ) . (7)Eq. (7) is convenient when studying i.i.d. samples (1) of discrete random quantities. Here we assume that the prior density isknown exactly [see however (32)]. In practice, such a knowledge need not be available. For example, it may be known that theprior density belongs to the Dirichlet family, but its hyper-parameters { α k } nk =1 are unknown and should be determined from thedata, e.g. via empirical Bayes procedures; see [24–28] for reviews on hyper-parameter estimation. III. BAYESIAN AND REGULARIZED MAXIMUM LIKELIHOOD (ML) ESTIMATORS
Starting from (4), we find the best estimator in terms of the minimal, average KL-distance: min[ h K i ] = X D P ( D ) min " Z n Y k =1 d q k P ( q |D ) K [ q, ˆ q ( D )] , (8)where the minimization goes over inferred probabilities { ˆ q ( D ) } , and where P ( q |D ) is recovered from P ( D| q ) : P ( D ) P ( q |D ) = P ( D| q ) P ( q ) ; cf. (1, 2). The equality in (8) follows from the fact that if ˆ q ( D ) minimizes R Q nk =1 d q k P ( q |D ) K [ q, ˆ q ( D )] , thenit will minimize each term of the sum for every D , and thus will minimize the whole sum. Then implementing the constraint P nk =1 ˆ q k ( D ) = 1 via a Lagrange multiplier, we get from (8): argmin " Z n Y k =1 d q k P ( q |D ) K [ q, ˆ q ( D )] = (Z n Y k =1 d q k q l P ( q |D ) ) nl =1 . (9)We got in (9) the posterior average, because we employed the KL distance K [ q, ˆ q ( D )] . The optimal estimator will be differentupon using another distance, e.g. KL distance K [ˆ q ( D ) , q ] of ˆ q ( D ) from q , or the Hellinger distance. Note that in the properBayesian approach the posterior mean is simply postulated to be an estimator, since it is just a characteristics of the posteriordistribution. In the present Bayesian decision approach the posterior emerges from minimizing a specific ( viz. KL) distance. Ifanother distance is used, the posterior mean is not anymore optimal.If the prior is a single Dirichlet density (5) we get from (7, 9) for the Bayesian estimator: p ( z k ) = m k + α k M + A . (10)The average KL-distance (4) for the estimator (10) reads from (7, 2) (denoting ψ [ x ] ≡ dd x ln Γ[ x ] ): h K [ q, p ] i = 1 A n X k =1 α k ψ (1 + α k ) − ψ (1 + A ) + ln( M + A ) − Γ[ M + 1] Γ[ A ]Γ[ M + A + 1] n X k =1 M X m =0 Γ[ m + 1 + α k ] Γ[ M − m + A + α k ] ln( m + α k )Γ[ α k ] Γ[ A − α k ] Γ[ m + 1] Γ[ M − m + 1] . (11)If the prior density is given by mixture of Dirichlet densities with weights { π a } La =1 : L X a =1 π a P ( q , ..., q n ; α [ a ]1 , ..., α [ a ] n ) , L X a =1 π a = 1 , (12)then instead of (6) and (10) we have from (9) h q k i = L X a =1 π a α [ a ] k A [ a ] , A [ a ] ≡ n X k =1 α [ a ] k , (13) p ( z k ) = P La =1 π a Φ [ a ] m k + α [ a ] k M + A [ a ] P La =1 π a Φ [ a ] , Φ [ a ] ≡ Γ[ A [ a ] ]Γ[ M + A [ a ] ] n Y k =1 Γ[ m k + α [ a ] k ]Γ[ α [ a ] k ] . (14)For a mixture prior density, the Bayesian estimator (14) depends on all numbers { m k ; α [1] k , ..., α [ L ] k } not just on m k . Below weillustrate that not knowing precisely details of the prior mixture can lead to serious losses when applying Bayesian estimators.It is interesting (both conceptually and practically) to have a simple estimator, where the dependence on the prior is reducedto a single parameter. A good candidate is the regularized maximum likelihood (ML) estimator (see [29] for a review): p ML ( z k ) ≡ m k + bM + nb = λ m k M + (1 − λ ) 1 n , λ = MM + nb , b ≥ , < λ < , (15)where the regularizer b (or λ ) takes care of the fact that for a finite sample (1) not all values z k had a chance to appear (i.e. m k = 0 for them). Then (15) avoids to claim a zero probability due to b > . Eq. (15) is a shrinkage estimator, where the properML estimator m k M is shrunk towards uniform distribution n by the shrinkage factor λ . The proper ML estimator p ML ( z k ) | b =0 will be shown to be a meaningless estimator for not very long samples (1) producing results that are worse than { q ( z k ) = n } nk =1 .Moreover, for such samples the correct choice of b (based on the prior information) is crucial, i.e. (15) is generally a parametricestimator. The estimator (15) recovers true probabilities for M → ∞ [11], where n and b are fixed, hence λ → in (15).For the optimal estimator (15), the value of b is found by minimizing the average KL-distance (4). When the prior is givenby a Dirichlet density (5), the average KL-distance amounts to (11), where we need to replace ln( M + A ) → ln( M + nb ) and ln( m + α k ) → ln( m + b ) . Now (9, 10) imply that for a homogeneous Dirichlet prior, i.e. for (5) with α k = α , we have b opt = α for the optimal value of b , i.e. the regularized ML estimator coincides with the Bayesian estimator: p ML ( z k ) = p ( z k ) . This doesnot anymore hold for the mixture of Dirichlet prior densities. IV. THE MAXIMUM ENTROPY (MAXENT) METHOD
MAXENT infers probabilities from maximizing the Boltzmann-Gibbs-Shannon entropy S [ q ] = − n X k =1 q ( z k ) ln q ( z k ) , (16)under constraints taken from the sample (1). The rationale of maximizing (16) is that a larger S means a smaller bias (orinformation) according to several axiomatic schemes [1–3, 5–10]. Note that physical applications of MAXENT operate withconstraints that are known precisely, e.g. the mean energy constraint is deduced from the corresponding conservation law [1–3].Such situations are rare in statistics and machine learning. Hence we need to understand which constraints are to be taken fromthe noisy data.First we can apply no constraint and maximize the entropy: q [0] ( z k ) = 1 /n. (17)The calculation of the average distance is straightforward from (4, 11, 17) both for a single Dirichlet prior and a mixture of suchpriors. We examplify the single Dirichlet case (5): h K [ q, q [0] ] i = n X k =1 h q k ln q k i + ln n = 1 A n X k =1 α k ψ (1 + α k ) − ψ (1 + A ) + ln n. (18)Now h K [ q, q [0] ] i plays an important role: once (17) is completely data-independent and simply reproduced the prior expectationon the unbiased probabilities, estimators that provide the average KL-distance larger than (18) are meaningless; see below forexamples.Next, we employ the empiric mean of (1) as a constraint for the expected value of Z : µ = 1 M M X u =1 Z u = 1 M n X k =1 z k m k = n X k =1 q k z k . (19)Maximizing (16) under constraint (19) via the Lagrange method leads to the famous Gibbs formula [1–3]: q [1] ( z k ) = e − βz k P nl =1 e − βz l , (20)where the Lagrange multiplier β is found from (19). Appendix presents an example of applying (20) to real data. We order thevalues of Z as z < ... < z n , and note a specific feature of (20): depending on the sign of β , we either get q [1] ( z ) ≤ ... ≤ q [1] ( z n ) or q [1] ( z ) ≥ ... ≥ q [1] ( z n ) . (21)One can try to acquire further information from sample (1) by looking at the second empiric moment: µ = 1 M M X u =1 Z u = 1 M n X k =1 z k m k = n X k =1 q k z k . (22)Now we maximize (16) under two constraints (19) and (22): q [1+2] ( z k ) = e − β z k − β z k P nl =1 e − β z l − β z l , (23)where Lagrange multipliers β and β are found from solving both (19) and (22). Eqs. (20, 23) make obvious how to involveother (fractional) moments. The maximizations of (16) lead to unique results, because (16) is a concave function of { p k } nk =1 ,while the moment constraints are linear.Let the values ( z , ..., z n ) of Z be subject to affine transformation e z k = F ( z k ) , F ( z ) = gz + h, k = 1 , ..., n. (24)Hence as a result of transformation (24): µ → e µ = gµ + h and µ → e µ = g µ + h + 2 ghµ ; see (19, 22). These relationsshow that (24) leaves the inferred probabilities (20, 23) invariant, because the resulting set of equations for the unknowns in (20,23) are identical for both the original ( z , ..., z n ) and transformed values (24). Likewise, involving first p moments µ , ..., µ p produces affine-invariant probabilities. Note that involving only (22) [without involving (19)] will lead to the invariance of theprobabilities with respect to a limited affine-symmetry, where h = 0 in (24). Another example of limited affine symmetryis involving the fractional moment P nk =1 q k √ z k (for z k ≥ and instead of (19, 22)). Then the probabilities q [1 / ( z k ) ∝ e − β / √ z k will stay intact only under h = 0 and g > in (24). Note in this context that the ML estimator (15) is invariant withrespect (24) with an arbitrary bijective F , which keeps the values of e z k different.The symmetry features of various estimators are clearly important, though we so far have no analytical results that wouldrelate them to the estimation quality quantified by (4). But we noted from numerical comparison of MAXENT estimators basedon various constraints, that estimators with the largest affine symmetry, i.e. (24) with arbitrary g and h , tend to be better in termsof the average KL-distance (4). Intuitively, higher (affine) symmetry should be related to higher susceptibility with respect tonoises; see Appendix B for further results. V. NUMERICAL RESULTSA. A single Dirichlet density
Recall that maximization of entropy (16) can be applied if there is no prior information that distinguishes one probability fromanother. If such information is present, MAXENT is generalized to the minimum relative entropy method [9]. We shall notstudy this generalization here. Hence to ensure applicability of MAXENT, we always choose prior densities such that h q k i = n ;i.e. all n values are equally likely to be generated, on average. As seen from (6), for a single prior Dirichlet density (5) condition h q k i = n implies: α k = α, k = 1 , ..., n, (25)Now recall from (15, 10) that under (25) the Bayesian and the regularized ML coincide. I.e. we conclude that the regularizedML is a better estimator than MAXENT (under any constraint).Though h q k i = n does not depend on α , the most probable values e q k of q k do depend on the magnitude of α . Finding e q k from (5, 25) amounts to maximizing L ( q ) = ( α − P nk =1 ln q k + γ P nk =1 q k , where the Lagrange multiplier γ ensures P nk =1 q k = 1 . For α > , L ( q ) is a concave function of q , and its global maximum is found after differentiating it. Hence e q k holds e q k = 1 /n for α > , k = 1 , ..., n. (26)For α < , L ( q ) is a convex function, it does not have local maxima with q k > ( k = 1 , ..., n ). Its maxima are located at points,where q k = 0 for certain k . Repeating this argument, we see that the maxima of L ( q ) are at those points, where a possibly largenumber of q k are zero: e q k = 0 or e q k = 1 , for α < , k = 1 , ..., n, (27)which means deterministic probabilities. Eq. (27) is consistent with h q k i = 1 /n , because there are n equivalent most probablevalues.Let us start with the regime α > ; cf. (26). Table I compares predictions of (15) with those of MAXENT solutions (20) and(23) for the Dirichlet prior (5) holding (25) with α = 2 . It is seen that MAXENT is meaningless, because the trivial estimator(17) provides a smaller average KL-distance; cf. (18). For the Bayesian estimator even M = 1 leads to a meaningfull prediction;e.g. for parameters of Table I we have: h K Bayes i| M =1 = 0 . < . .The above conclusion holds more generally (as we checked numerically): for the homogeneous Dirichlet prior (25) with α ≥ , MAXENT estimators (20, 19) and (23, 19, 22) are meaningless at least in the sparse data regime M < n . This puts aserious limitation on the validity of MAXENT.The situation changes for sufficiently small values of α in the regime (27); see Table II for α = 0 . . Here the MAXENTestimators are meaningful provided that the sample length M is sufficiently large (but still in the sparse data regime M < n ): TABLE I: For n = 60 and z k = k ( k = 1 , ..., n ) we show the average KL-distance (4) for various estimators. The full affine symmetry (24)holds for all shown probabilities. M is the length of sample (1). The initial prior Dirichlet density (5) holds (25) with α k = 2 . Eq. (18) equals h K [ q, q [0] ] i = 0 . , i.e. values of the average KL-distance larger than . are meaningless . h K Bayes i is the averaged KL-distance for theBayes estimator (10) that for this case coincides with the optimally regularized ML estimator. h K i and h K i are defined (resp.) via (20,19) and (23, 22). The averages are found numerically (applies to all Tables): first we generate instances of { q k } nk =1 from the Dirichletdensity, and then for each instance we generate samples (1). Such parameters lead to 3-digit precision, as reported. M h K Bayes i h K i h K i
35 0.177 0.236 0.24725 0.188 0.240 0.26015 0.202 0.259 0.301TABLE II: The same as in Table I, but for α k = α = 0 . in (25). Eq. (18) gives h K [ q, q [0] ] i = 1 . , i.e. values of the average KL-distancelarger than . are meaningless. M h K Bayes i h K i h K i
55 0.233 1.756 1.68545 0.276 1.700 1.64335 0.338 1.723 1.68025 0.428 1.753 1.71715 0.606 1.770 2.16411 0.730 1.762 4.9469 0.818 1.774 11.637 0.916 1.848 32.24TABLE III: The same as in Table I, but the initial prior density is a Dirichlet mixture given by (12, 30) with α = 0 . and ǫ = 1 . . The averageKL-distance h K [ q, q [0] ] i for the trivial estimator (17) equals . , i.e. values of the average KL-distance larger than . are meaningless ;cf. (18). h K Bayes i and h K Bayes i refer to (14) and (32), respectively. h K ML i b = b opt and h K ML i b =1 refer to regularized ML estimator (15)under b = 1 and the optimal value of b found from numerically minimizing h K ML i . The optimal value b opt of b changes from . for M = 35 to . for M = 1 . We also report the value of h K ML i b =1 with a sensible value of b to confirm that if b is not chosen properly, thenthe corresponding (regularized) ML estimator (15) is meaningless. h K i is defined via (4, 20). h K i is not shown, since h K i > h K i for ≥ M ≥ . We do not show h K i| M =1 , since it is larger than the average KL-distance for all other estimators. M h K Bayes i h K Bayes i h K ML i b = b opt h K ML i b =1 h K i
35 0.014 0.206 0.180 0.204 0.04825 0.015 0.207 0.188 0.210 0.05315 0.017 0.209 0.197 0.214 0.06511 0.022 0.209 0.201 0.215 0.0777 0.035 0.209 0.205 0.215 0.1055 0.052 0.210 0.207 0.214 0.1413 0.083 0.210 0.209 0.213 0.2681 0.150 0.211 0.211 0.212 — (20, 19) is meaningful for M ≥ ( M < n = 60 ), while the estimator (23, 19, 22) is meaningful for M ≥ ; see Table II.Though predictions of MAXENT are still far from those of the Bayesian estimator, we should recall that the latter estimatoris parametric, i.e. it depends on the prior (via the parameter α ) in contrast to MAXENT estimators. Table II demonstrates theoverfitting phenomenon: for ≤ M ≤ the MAXENT estimator (20, 19) is meaningful, but adding the second constraintmakes the MAXENT estimator (23, 19, 22) not meaningful. The situation is worsened since (22) is again estimated from thenoisy data and gathers more noise than information. This overfitting disappears for larger values of M , i.e. M ≥ , as Table IIdemonstrates. Now adding the second constraint (22) is beneficial. TABLE IV: The same as in Table III, but for different values of M . Here h K i refers to MAXENT estimator (23) with constraints (19, 22).The Bayesian estimator is found from (14, 30). For this range of sufficiently large M the MAXENT estimator (23) performs better than (23): h K i < h K i < h K ML i b = b opt . M h K Bayes i h K ML i b = b opt h K ML i b =1 h K i h K i
45 0.015 0.172 0.196 0.045 0.04265 0.014 0.157 0.180 0.042 0.03585 0.014 0.145 0.164 0.040 0.031241 0.013 0.087 0.091 0.038 0.024
B. Mixture of Dirichlet densities
For modeling more complex types of prior information about the unknown probabilities { q k } nk =1 , we shall assume that theprior density is a mixture of two Dirichlet densities; see (12). Relations h q k i = n ( k = 1 , ..., n ) will be still kept, sincethey are necessary for applying MAXENT. Now we assume that that there are (prior) conditional rank correlation between thevalues ( z , ..., z n ) of Z , ordered as ( z < ... < z n ) , and its probabilities ( q , ..., q n ) . For one component of the mixture, theprobabilities ( q , ..., q n ) prefer to be ordered as in ( q < ... < q n ) . For another component they tend to be ordered in theopposite way ( q > ... > q n ) . This type of prior knowledge can be modeled via a mixture (12) of two Dirichlet priors with L = 2 , π = π = , and α [1]1 < ... < α [1] n , α [2]1 > ... > α [2] n , (28) α [1] k − α [1] l A [1] = α [2] l − α [2] k A [2] , for any k, l = 1 , ..., n, (29)where (29) ensures the needed h q k i = n , as seen from (13). A simple case that leads to (28, 29) is L = 2 , π = π = 12 , α [1] k = α + ǫ ( k − , α [2] k = α + ǫ ( n − k ) , k = 1 , ..., n, (30)where A [1] = A [2] = α n + ǫn ( n − . Recall that for a mixture of Dirichlet densities the Bayes estimator (14) and the optimallyregularized ML estimator (15) are different.For numerical illustration we choose { z k = k } nk =1 . Prior probability densities generated via (30) will be employed with z < ... < z n . Now Tables III and IV show that for M ≥ the MAXENT estimator (20, 19) is clearly better than the optimally regularized ML estimator (15): h K i < h K ML i b = b opt , (31)where the optimal value of b is found from minimizing the averaged KL-distance (4). Moreover, for M ≥ , we see that h K i is closer to the optimal h K Bayes i than to h K ML i b = b opt . Note that such threshold values for M do depend on the assumed priordensity and on n .For M → ∞ the performance of the optimally regularized ML estimator (15) (for a fixed b ∼ ) will be better than MAXENTwith any finite number of constraints, since the regularized ML converges to the true probabilities for M → ∞ [11], whileMAXENT does not. But as Table IV shows, MAXENT with constraints (19) or (19)+(22) still performs better than the optimallyregularized ML even for M as large as 241 (for n = 60 ).Table III shows that MAXENT with two constraints (19, 22) performs worse than the method under the single constraint (19)although the affine invariance (24) of probabilities holds. This overfitting situation changes for larger values of M , i.e. M ≥ ,as Table IV demonstrates.To stress the relevance of rank correlations, we note that the advantage (31) of MAXENT closely relates to the agreementbetween (28) and the ordering ( z < ... < z n ) of Z . If the vector ( z < ... < z n ) is randomly permuted, and employed forvalues of Z , predictions of MAXENT become meaningless even for rather large values of M > n .Recall that both the Bayesian (14) and the regularized ML estimator (15) are parametric, i.e. the very their form depends onthe prior, which is frequently not available in practice. Hence we need to understand how strong is dependence. Let us assumethat one has to employ a Bayesian estimator without knowing the full form of the equal-weight mixture (30). Instead one knowsthe average values of α k = [ α + ǫ ( k − [ α + ǫ ( n − k )] from (30), prescribes them to a single Dirichlet prior (5, 25)and builds up from (10) an estimator p ( z k ) = m k + α + ǫ ( n − / M + nα + ǫn ( n − / . (32)The performance of this perturbed Bayesian estimator deteriorates and gets worse than that of the MAXENT solution: h K i < h K Bayes i ; see Table III. Likewise, the choice of b in the regularized ML estimator (15) is important. If just some reasonablevalue is chosen instead of the optimal one, e.g. b = 1 instead of b ∼ . in Table III, then the ML estimator can turn meaningless;see Table III for M ≤ . In this context, we emphasize that the Bayes estimator and the optimally regularized ML estimator arenever meaningless even for M = 1 . For parameters of Table III, the MAXENT estimator (20) becomes meaningless for M ≤ . VI. SUMMARY AND DISCUSSION
The maximum entropy (MAXENT) method provides non-parametric estimators for inferring unknown probabilities [4].MAXENT is widely applied both in statistical physics and probabilistic inference. However, its physical applications are mostlydata-free and are based on additional principles (e.g. conservation laws [2, 3]) that are normally absent in statistics and machinelearning. Hence we needed a systematic approach towards understanding the validity limits of MAXENT as an inference tool.Here we presented a Bayesian decision theory approach that allows to determine on whether MAXENT is applicable at all,i.e. whether it is better than a random guess. It also allows to compare different estimators with each other (e.g. to compareMAXENT with the regularized maximum likelihood), and study the relevance of various constraints employed in MAXENT.Our results are summarized as follows. MAXENT does apply to a sparse data, but demands specific prior information. Heresparse means
M < n , i.e. the sample length M is smaller than the number of probabilities n to be inferred. We explored twodifferent scenarios of such prior information. First, the unknown probabilities generated by homogeneous Dirichlet density (25)are most probably deterministic. Second, there are prior rank correlations between the random quantity and its probabilities.This seems to be the simplest prior information that makes MAXENT applicable and superior over the optimally regularizedmaximum-likelihood estimator. Our approach is capable of describing several phenomena that are relevant for applying and un-derstanding estimators: overfitting (i.e. adding more noisy constraints leads to poorer inference), instability of optimal Bayesianparametric estimators with respect to variation of prior details, inapplicability of non-parametric MAXENT estimators to veryshort samples etc .Several important problems were uncovered by this study and should be addressed in future. First of all, this concernsthe applicability of MAXENT to a categorical data, where the values ( z , ..., z n ) of the random variable Z in sample (1) arenot numerical, but instead refer to certain distinguishable categories. The major difference between maximum likelihood andMAXENT estimator is that the former freely applies to categorical data. In contrast, MAXENT does depend on the concretenumerical implementation (i.e. encoding ) of data, though this dependence is somewhat weakened by the affine symmetry (24).Thus an open problem demands considering various encoding schemes in view of their applicability to MAXENT estimators.(In this paper we in fact assumed the simplest encoding via natural numbers; see Tables.) Appendix A reports preliminaryresults in this direction along with a real data example. The second open problem relates to the influence of affine symmetries onthe performance of various MAXENT estimators. We observed numerically that the constraints which produce affine-invariantprobabilities produce better estimators; see after (24). Preliminary results along this direction are given in Appendix B, wherewe also show relations of our results with the minimum entropy principle proposed in [12–15] for contraint selection. Acknowledgments
We thank Roger Balian for useful discussions.AEA and NHM were supported by SCS of Armenia, grants No. 18RF-015 and No. 18T-1C090. [1] E. T. Jaynes,
Information theory and statistical mechanics,
Physical review , 620 (1957).[2] R. Balian,
From Microphysics to Macrophysics: Methods and Applications of Statistical Physics. Volumes I, II (Springer Science &Business Media, 2007).[3] S. Pressé, K. Ghosh, J. Lee, and K. A. Dill,
Principles of maximum entropy and maximum caliber in statistical physics,
Reviews ofModern Physics , 1115 (2013).[4] G. Erickson and C. R. Smith, Maximum-Entropy and Bayesian Methods in Science and Engineering: Volume 2: Applications , Vol. 31(Springer Science & Business Media, 2013).[5] C. Chakrabarti and I. Chakrabarty,
Shannon entropy: axiomatic characterization and application,
International Journal of Mathematicsand Mathematical Sciences (2005).[6] J. C. Baez, T. Fritz, and T. Leinster,
A characterization of entropy in terms of information loss,
Entropy , 1945 (2011).[7] J. Van Campenhout and T. Cover, Maximum entropy and conditional probability,
IEEE Transactions on Information Theory , 483(1981).[8] F. Topsøe, Information-theoretical optimization techniques,
Kybernetika , 8 (1979). [9] J. Shore and R. Johnson, Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,
IEEETransactions on information theory , 26 (1980).[10] J. Paris and A. Vencovská, In defense of the maximum entropy inference process,
International Journal of approximate reasoning , 77(1997).[11] D. R. Cox and D. V. Hinkley, Theoretical statistics (CRC Press, 1979).[12] I. Good,
Some statistical methods in machine intelligence research,
Mathematical Biosciences , 185 (1970).[13] R. Christensen, Entropy minimax multivariate statistical modeling–i: Theory,
International Journal Of General System , 231 (1985).[14] S. C. Zhu, Y. N. Wu, and D. Mumford, Minimax entropy principle and its application to texture modeling,
Neural computation , 1627(1997).[15] G. Pandey and A. Dukkipati, in (IEEE, 2013) pp. 1521–1525.[16] M. U. Thomas, A generalized maximum entropy principle,
Operations Research , 1188 (1979).[17] G. Lebanon and J. D. Lafferty, in Advances in neural information processing systems (2002) pp. 447–454.[18] J. Kazama and J. Tsujii,
Maximum entropy models with inequality constraints: A case study on text categorization,
Machine Learning , 159 (2005).[19] Y. Altun and A. Smola, in International Conference on Computational Learning Theory (Springer, 2006) pp. 139–153.[20] M. Dudik, Maximum entropy density estimation and modeling geographic distributions of species. phd dissertation presented to theprinceton university, (2007.).[21] J. Rau,
Inferring the gibbs state of a small quantum system,
Physical Review A , 012101 (2011).[22] L. L. Campbell, Minimum cross-entropy estimation with inaccurate side information,
IEEE Transactions on Information Theory , 2650(1999).[23] M. P. Friedlander and M. R. Gupta, On minimizing distortion and relative entropy,
IEEE Transactions on Information Theory , 238(2005).[24] B. A. Frigyik, A. Kapila, and M. R. Gupta, Introduction to the dirichlet distribution and related processes,
Department of ElectricalEngineering, University of Washignton, UWEETR-2010-0006 , 1 (2010).[25] J. L. Schafer,
Analysis of incomplete multivariate data (CRC press, 1997).[26] M. Claesen and B. De Moor,
Hyperparameter search in machine learning, arXiv preprint arXiv:1502.02127 (2015).[27] Z.-Y. Ran and B.-G. Hu,
Parameter identifiability in statistical machine learning: a review,
Neural Computation , 1151 (2017).[28] J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization,
The Journal of Machine Learning Research , 281 (2012).[29] J. Hausser and K. Strimmer, Entropy inference and the james-stein estimator, with application to nonlinear gene association networks.
Journal of Machine Learning Research (2009).[30] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian data analysis (CRC press, 2013).
Appendix A: MAXENT applies to categorical data
The MAXENT method can be applied to any multinomial data (1), provided that numeric values ( z , ..., z n ) of the randomquantity Z are given. MAXENT estimators depend on ( z , ..., z n ) modulo the affine symmetry (24). This creates a problemin applying MAXENT to categorical data, since for MAXENT one now needs a specific encoding of categorical Z into anumeric representation ( z , ..., z n ) of categories. Recall that there is a degree of arbitrariness in choosing the regularizer inmaximum likelihood (ML) estimator, or in choosing prior parameters for Bayesian inference. Here the arbitrariness lies indifferent encodings. In practice, the proper encoding of categories arises in any problem dealing with categorical data. Ifcategories are ordinal (e.g. military ranks, education levels), then one can use z < z < ... < z n encoding. However, fornominal categories (e.g. ethnicity, preference, disease) there is no such ordering.Let us illustrate the MAXENT method with the simple data from pre-election presidential polling conducted in 1988, whereout of M = 1447 voters m = 727 preferred Bush, m = 583 preferred Dukakis, and m = 137 preferred other candidates orhad no preference. The data together with its Bayesian analysis is taken from [30].Here our random variable Z is voter’s preference with three outcomes ( z , z , z ) = (’Bush’, ’Dukakis’, ’Other’) and un-knowns ( q , q , q ) , which just represent the fractions of the population with each preference. The goal here is to estimate q − q , i.e. whether Bush has more support than Dukakis. One can assume the Dirichlet noninformative prior distribution for ( q , q , q ) with parameters α = α = α = 1 , compute the posterior means of q and q (ˆ q , ˆ q ) and take the difference [30].The results show that Bush has more support: ˆ q > ˆ q .Since the data is purely categorical, we shall apply the frequency encoding for MAXENT: each category is represented withits frequency in the data set, e.g. in this example ( z , z , z ) = (0 . , . , . . Now empiric mean is equal to . andthe maximizing solutions of (16) with P k =1 q k z k = 0 . are ( q [1] ( z ) , q [1] ( z ) , q [1] ( z )) = (0 . , . , . . Thus, also theMAXENT result shows more support for Bush.The more detailed data to the same problem from [30] is displayed in Table V, where M = 1447 voters were stratifiedinto regions. The Proportion column shows the proportions M i /M ( i = 1 , ..., of voters registered in each region, and ( m i /M i , m i /M i , m i /M i ) in each row are proportions of voters preferring Bush, Dukakis, and others/no-preference amongthose who vote in the corresponding region. As in the previous example, one can assume the Dirichlet prior distributions for ( q i , q i , q i ) with α = α = α = 1 , this time for each region separately and compute the posterior means of q i and q i TABLE V: The regional distribution of the election data. Here Proportion approximates M i /M , where M i is the sample length in each region,while M = 1447 is the overall sample length; see [30] for details.Region Bush Dukakis Other Proportion q [1] ( z i ) − q [1] ( z i ) Northeast, I 0.298 0.617 0.085 0.032 -0.404Northeast, II 0.500 0.478 0.022 0.032 0.070Northeast, III 0.467 0.413 0.120 0.115 0.093Northeast, IV 0.464 0.522 0.014 0.048 -0.180Midwest, I 0.404 0.489 0.106 0.032 -0.147Midwest, II 0.447 0.447 0.106 0.065 0.0Midwest, III 0.509 0.388 0.103 0.080 0.197Midwest, IV 0.552 0.338 0.110 0.100 0.292South, I 0.571 0.286 0.143 0.015 0.330South, II 0.469 0.406 0.125 0.066 0.105South, III 0.515 0.404 0.081 0.068 0.201South, IV 0.555 0.352 0.093 0.126 0.296West, I 0.500 0.471 0.029 0.023 0.084West, II 0.532 0.351 0.117 0.053 0.255West, III 0.540 0.371 0.089 0.086 0.266West, IV 0.554 0.361 0.084 0.057 0.294 (ˆ q i , ˆ q i ) for each region. Assuming that the proportions M i /M are approximately equal to the population proportions for eachregion [30] one can estimate the difference in the fractions q − q by X i =1 M i M (ˆ q i − ˆ q i ) . (A1)Now we can apply MAXENT method for each region i using frequency encoding ( z i , z i , z i ) = ( m i /M i , m i /M i , m i /M i ) and get corresponding estimates ( q [1] ( z i ) , q [1] ( z i ) , q [1] ( z i )) . The rightmost column in the table above shows the differences q [1] ( z i ) − q [1] ( z i ) . Thus, the MAXENT estimate of the difference in the fractions q − q can be computed as in (A1) q [1] ( z ) − q [1] ( z ) = X i =1 M i M ( q [1] ( z i ) − q [1] ( z i )) = 0 . > . (A2)To see if the prediction of MAXENT is reliable (on average) here, the same Bayesian decision model for these samples isset up, where first a sample of ( q , q , q ) is drawn from the Dirichlet distribution with α = α = α = 1 , and then usingthis sample as category probabilities, categorical data sets of size M are generated with categories replaced by its frequencyencodings. The process is repeated and the average h K i from (4) is computed via generating instances of { q k } k =1 and categorical samples. For the present case α = α = α = 1 and n = 3 , we have h K [ q, q [0] ] i = 0 . from (18).Now for M > we get that h K i < h K [ q, q [0] ] i , i.e. the MAXENT solution is reliable. For example, at M = 17 we have h K Bayes i = 0 . < h K i = 0 . < h K [ q, q [0] ] i = 0 . , where h K Bayes i refers to the Bayesian (posterior mean) estimator(10). Already for M = 47 predictions of MAXENT are close to those of the optimal Bayesian estimator h K Bayes i = 0 . < h K i = 0 . . For the actual sample size M = 1447 , we get even closer results h K Bayes i = 0 . < h K i = 0 . . Notefrom Table V that M i = M South , I = 22 is the minimal sample length in regions. Hence all our MAXENT predictions forregions are reliable in the above sense.To summarize the present real categorical data example, we saw that the frequency encoding of the categorical variable allowsto apply MAXENT. The MAXENT estimator (19) agrees with Bayesian estimator, and is going to be reliable already for modestsample sizes M > . For a sufficiently large M the average KL distance of the MAXENT estimator gets close to that of the(optimal) Bayes estimator. Appendix B: Affine symmetry and the minimum entropy principle
Above we focused on MAXENT estimators (20, 19) (the first empiric moment is fixed) or (23, 19, 22) (the first and secondempiric moments are fixed). As discussed around (24), both (19) and (22) lead to affine-invariant probabilities. We studied1several alternative constraints that do not have the full affine symmetry, i.e. this symmetry is partial and relates to restriction onthe parameters in (24). An example of this is constraining the square-root (fractional) moment [cf. (20, 23)] q [1 / ( z k ) = e − β / √ z k P nl =1 e − β / √ z l , (B1) n X k =1 q k √ z k = 1 M M X u =1 p Z u , (B2)where β / is determined from (B2), and where we assumed z k > . For estimator (B1) the symmetry (24) is kept under g > and h = 0 . We denote the corresponding average KL distances by h K / i .Let us now compare two different MAXENT estimators each one employing its own constraint; e.g. we compare (20) with(B1). We saw from extensive numeric simulations that whenever these constraints have different degrees of the affine symmetry,then the estimator having the largest symmetry provides a smaller average KL distance. A particular example of this generalrelation is: h K i < h K / i , (B3)which was verified on parameters of Tables I–IV.Recall that Refs. [12–14] proposed the minimum entropy principle: when comparing two possible contraints to be employedin the maximum entropy method, then it is preferable to use the one that provides the smaller (maximized) entropy. The heuristicmotivation of the principle is that it avoids overfitting by not insisting too much on the entropy maximization. This principle wasmotivated via the minimum description length in [15].We ask whether in cases similar to (B3) we can compare the average entropies, i.e. for (B3) we compare h S [ q [1] ( z k )] i and h S [ q [1 / ( z k )] i , where the averages are defined as in (4). In all cases we were able to check, relations similar to (B3) areaccompanied by the result that the constraint which provide a smaller average KL distance (i.e. a better costraint) also has asmaller average entropy, e.g. h S [ q [1] ( z k )] i < h S [ q [1 / ( z k )] i . (B4)The theoretical origin of this relation between the average KL distance and the average (maximized) entropy is not yet clear.Here is a concrete numerical example that illustrates (B3, B4). For parameters of Table II we noted for M = 55 : h K i =1 . < h K / i = 1 . and h S [ q [1] ( z k )] i = 4 . < h S [ q [1 / ( z k )] i = 4 .008