A Maximum Entropy Procedure to Solve Likelihood Equations
Antonio Calcagnì, Livio Finos, Gianmarco Altoè, Massimiliano Pastore
aa r X i v : . [ s t a t . C O ] J un A Maximum Entropy Procedure to SolveLikelihood Equations
Antonio Calcagnì ∗ , Livio Finos, Gianmarco Altoé,and Massimiliano PastoreUniversity of Padova Abstract
In this article we provide initial findings regarding the problem of solving likelihood equationsby means of a maximum entropy approach. Unlike standard procedures that require equating atzero the score function of the maximum-likelihood problem, we propose an alternative strategywhere the score is instead used as external informative constraint to the maximization of theconvex Shannon’s entropy function. The problem involves the re-parameterization of the scoreparameters as expected values of discrete probability distributions where probabilities need tobe estimated. This leads to a simpler situation where parameters are searched in smaller (hyper)simplex space. We assessed our proposal by means of empirical case studies and a simulationstudy, this latter involving the most critical case of logistic regression under data separation.The results suggested that the maximum entropy re-formulation of the score problem solves thelikelihood equation problem. Similarly, when maximum-likelihood estimation is difficult, as forthe case of logistic regression under separation, the maximum entropy proposal achieved results(numerically) comparable to those obtained by the Firth’s Bias-corrected approach. Overall,these first findings reveal that a maximum entropy solution can be considered as an alternativetechnique to solve the likelihood equation.Keywords: maximum entropy; score function; maximum likelihood; binary regression; dataseparation
Maximum likelihood is one of the most used tools of modern statistics. As a result of its attractiveproperties, it is useful and suited for a wide class of statistical problems, including modeling, testing,and parameters estimation [13, 37]. In the case of regular and correctly-specified models, maximumlikelihood provides a simple and elegant means of choosing the best asymptotically normal estima-tors. Generally, the maximum likelihood workflow proceeds by first defining the statistical modelwhich is thought to generate the sample data and the associated likelihood function. Then, the like-lihood is differentiated around the parameters of interest by getting the likelihood equations (score),which are solved at zero to find the final estimates. In most simple cases, the maximum likelihoodsolutions are expressed in closed-form. However, analytic expressions are not always available formost complex problems and researchers need to solve likelihood equations numerically. A broad class ∗ Corresponding author: [email protected]
1f these procedures include Newton-like algorithms, such as the Newton–Raphson, Fisher-scoring,and quasi Newton–Raphson algorithms [39]. However, when the sample size is small, or when theoptimization is no longer convex as in the case of more sophisticated statistical models, the standardversion of Newton–Raphson may not be optimal. In this case, robust versions should instead be used[11]. A typical example of such a situation is the logistic regression for binary data, where maximumlikelihood estimates may no longer be available, for instance, when the binary outcome variable canbe perfectly or partially separated by a linear combination of the covariates [2]. As a result, theNewton–Raphson is unstable with inconsistent or infinite estimates. Other examples include smallsample sizes, large numbers of covariates, and multicollinearity among the regressor variables [36].Different proposals have been made to solve these drawbacks, many of which are based on iterativeadjustments of the Newton–Raphson algorithm (e.g., see [16, 28]), penalized maximum likelihood(e.g., see [18]), or the homotopy-based method (e.g., see [1]). Among them, bias-corrected methodsguarantee the existence of finite maximum likelihood estimates by removing first-order bias [12],whereas homotopy Newton-Raphson algorithms, which are mostly based on Adomian’s decompo-sition, ensure more robust numerical convergences in finding roots of the score function (e.g., see[41]).Maximum entropy (ME)-based methods have a long history in statistical modeling and inference(e.g., for a recent review see [20]). Since the seminal work by [24], there have been many applicationsof maximum entropy to the problem of parameter estimation in statistics, including autoregressivemodels [21], multinomial models [23], spatial autoregressive models [32], structural equation models[8], the co-clustering problem [3], and fuzzy linear regressions [10]. What all these works share incommon is an elegant estimation method that avoids strong parametric assumptions on the modelbeing used (e.g., error distribution). Differently, maximum entropy has also been widely adoptedin many optimization problems, including queueing systems, transportation, portfolio optimization,image reconstruction, and spectral analysis (for a comprehensive review see [27, 15]). In all thesecases, maximum entropy is instead used as a pure mathematical solver engine for complex or ill-posedproblems, such as those encountered when dealing with differential equations [14], oversampled data[5], and data decomposition [6].The aim of this article is to introduce a maximum entropy-based technique to solve likelihoodequations as they appear in many standard statistical models. The idea relies upon the use ofJaynes’ classical ME principle as a mathematical optimization tool [15, 14, 38]. In particular, insteadof maximizing the likelihood function and solving the corresponding score, we propose a solutionwhere the score is used as the data constraint to the estimation problem. The solution involves twosteps: (i) reparametrizing the parameters as discrete probability distributions and (ii) maximizingthe Shannon’s entropy function w.r.t. to the unknown probability mass points constrained by thescore equation. Thus, parameter estimation is reformulated as recovering probabilities in a (hyper)symplex space, with the searching surface being always regular and convex. In this context, thescore equation represents all the available information about the statistical problem and is used toidentify a feasible region for estimating the model parameters. In this sense, our proposal differsfrom other ME-based procedures for statistical estimation (e.g., see [25]). Instead, our intent is tooffer an alternative technique to solve score functions of parametric, regular, and correctly specifiedstatistical models, where inference is still based on maximum likelihood theory.The reminder of this article is organized as follows. Section 2 presents our proposal and describesits main characteristics by means of simple numerical examples. Section 3 describes the results ofa simulation study where the ME method is assessed in the typical case of logistic regression underseparation. Finally, Section 4 provides a general discussion of findings, comments, and suggestionsfor further investigations. Complementary materials like datasets and scripts used throughout thearticle are available to download at https://github.com/antcalcagni/ME-score , whereas the list2f symbols and abbreviations adopted hereafter is available in Table 1.Table 1: List of symbols and abbreviations used throughout the manuscript.ME Maximum EntropyNR Newton–Raphson algorithmNFR Bias corrected Newton–Raphson algorithmy sample of observations Y sample space θ J × vector of parameters ˆ θ estimated vector of parameters ˜ θ reparameterized vector of parameters under ME f ( y ; θ ) density function l ( θ ) likelihood function U ( θ ) , U ( ˜ θ ) score function z K × vector of finite elements for ˜ θ p K × vector of unknown probabilities for ˜ θ ˆp vector of estimated probabilities for ˜ θ Let y = { y , . . . , y n } be a random sample of independent observations from the parametric model M = { f ( y ; θ ) : θ ∈ Θ , y ∈ Y } , with f ( y ; θ ) being a density function parameterized over θ , Θ ⊆ R J the parameter space with J being the number of parameters, and Y the sample space. Let l ( θ ) = n X i =1 ln f ( y i ; θ ) be the log-likelihood of the model and U ( θ ) = ∇ θ l ( θ ) = ( ∂l/∂θ , . . . , ∂l/∂θ j , . . . , ∂l/∂θ J ) the score equation. In the regular case, the maximum likelihood estimate (MLE) ˆ θ of the unknownvector of parameters θ is the solution of the score U ( θ ) = J . In simple cases, ˆ θ has closed-formexpression but, more often, a numerical solution is required for ˆ θ , for instance by using iterativealgorithms like Newton–Raphson and Expectation-Maximization.In the maximum likelihood setting, our proposal is instead to solve U ( θ ) = J by means of amaximum entropy approach (for a brief introduction, see [22]). This involves a two step formulationof the problem, where θ is first reparameterized as a convex combination of a numerical supportwith some predefined points and probabilities. Next, a non-linear programming (NLP) problemis set with the objective of maximizing the entropy of the unknown probabilities subject to somefeasible constraints. More formally, let ˜ θ = ( z T p , . . . , z Tj p j , . . . , z TJ p J ) T (1)be the reparameterized J × vector of parameters of the model M , where z j is a user-definedvector of K × (finite) points, whereas p j is a K × vector unknown probabilities obeying to3 Tj K = 1 . Note that the arrays z , . . . , z J must be chosen to cover the natural range of the modelparameters. Thus, for instance, in the case of estimating the population mean µ ∈ R for a normalmodel N ( µ, σ ) with σ known, z µ = ( − d, . . . , , . . . , d ) T with d as large as possible. In practice,as observations y are available, the support vector can be defined using sample information, i.e., z µ = (min( y ) , . . . , max( y )) T . Similarly, in the case of estimating the parameter π ∈ [0 , of theBinomial model Bin ( π, n ) , the support vector is z π = (0 , . . . , T . The choice of the number ofpoints K of z can be made via sensitivity analysis although it has been shown that K ∈ { , , } isusually enough for many regular problems (e.g., see [33, 25]). Readers may refer to [25] and [9] forfurther details.Under the reparameterization in Equation (1), U ( θ ) = J is solved via the following NLPproblem: maximize ( p ,..., p J ) H ( p , . . . , p J ) subject to: U ( ˜ θ ) = J p T K = 1 ... p TJ K = 1 , (2)where H ( p ) = − P Jj =1 p Tj log p j is the Shannon’s entropy function, whereas the score equation U ( ˜ θ ) has been rewritten using the reparameterized parameters ˜ θ . The problem needs to recover K × J quantities which are defined in a (convex) hyper-simplex region with J (non-) linear equality con-straints U (˜ θ ) , . . . , U (˜ θ J ) ( consistency constraints ) and linear equality constraints p T K , . . . , p TJ K ( normalization constraints ). The latter ensure that the recovered quantities ˆp , . . . , ˆp J are stillprobabilities. Note that closed-form solutions for the ME-score problem do not exist and solutionsneed to be attained numerically.In the following examples, we will show how the ME-score problem can be formulated in themost simple cases of estimating a mean from Normal, Poisson, and Gamma models (Examples 1-3)as well as in more complex cases of estimating parameters for logistic regression (Example 4). Consider the case of estimating the location parameter µ ∈ R of a Normal density function with σ known. In particular, let y = (2 . , . , . , . , . , . , . , . , . , . , . , . T be a sample of n = 12 drawn from a population with Normal density N ( µ, σ ) with σ = 1 known.Our objective is to estimate µ using the information of y . Let l ( µ ) = ( σ ) − || y − µ n || be the log-likelihood of the model where constant terms have been dropped and U ( µ ) = ( σ ) − (cid:0) y T n − nµ (cid:1) be the corresponding score w.r.t. µ . To define the associated ME-score problem to solve U ( µ ) = 0 ,first let µ ME = z T p with z and p being K × vector of supports and unknown probabilities. In thisexample, z = (1 . , . , . , . , . , . , . T K = 7 , z = min( y ) , and z K = max( y ) . Given the optimization problem in (2), in this case p can be recovered via the Lagrangean method, as follows. Let L ( λ , λ , p ) = − p T log p − λ (cid:0) − p T K (cid:1) − λ (cid:0) ( σ ) − ( y T n − n ( z T p )) (cid:1) (3)be the Lagrangean function, with λ and λ being the usual Lagrangean multipliers. The Lagrangeansystem of the problem is ∂ L ( λ , λ , p ) ∂ p = − log( p ) − − λ − λ n z = K (4) ∂ L ( λ , λ , p ) ∂λ = 1 − p T K = 0 (5) ∂ L ( λ , λ , p ) ∂λ = ( σ ) − ( y T n − n ( z T p ) = 0 . (6)Solving p in Equation (4), by using Equation (6), we get the general solutions for the ME-scoreproblem: ˆp = exp (cid:16) − z ˆ λ n ( σ ) − (cid:17) exp (cid:16) − z ˆ λ n ( σ ) − (cid:17) T K , (7)where the quantity in the denominator is the normalization constant. Note that solutions in Equation(7) depend on the Lagrangean multiplier ˆ λ , which needs to be determined numerically [19]. In thisparticular example, we estimate the unknown Lagrangean multiplier using a grid-search approach,yielding to ˆ λ = − . . The final solutions are ˆp = (0 . , . , . , . , . , . , . T with ˆ µ ME = z T ˆp = 3 . , which corresponds to the maximum likelihood estimate of ˆ µ ML = n y T n = 3 . , as expected. Consider the simple case of estimating λ ∈ R + of a Poisson density function. Let y = (5 , , , , , , , , , , , , , , , T be a sample of n = 16 drawn from a Poisson density Pois ( λ ) and U ( λ ) = − n + ( y T n ) /λ be the scoreof the model. The reparameterized Poisson parameter is λ ME = z T p , with support being defined asfollows: z = (0 . , . , . , . , . T , where K = 5 and z K = max( y ) . Note that, since the Poisson parameter λ is bounded below by zero,we can set z = 0 . Unlike the previous case, we cannot determine ˆp analytically. For this reason,we need to solve the ME-score problem:maximize p − p T log( p ) subject to: p T K − n + ( y T n ) / ( z T p ) (8)5ia the augmented Lagrangean adaptive barrier algorithm as implemented in the function constrOptim.nl of the R package alabama [40]. The algorithm converged successfully in few iterations. The recoveredprobabilities are as follows: ˆp = (0 . , . , . , . , . T with ˆ λ ME = 6 . , which is equal to the maximum likelihood solution ˆ λ ML = n y T n = 6 . , asexpected. Consider the following random sample y = (0 . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . , . T drawn from a Gamma density Ga ( α, ρ ) with α ∈ R + being the scale parameter and ρ ∈ R + the rateparameter. The log-likelihood of the model is as follows: l ( α, ρ ) = − (( α −
1) log( y ) T n − ( y T n ρ ) + nα log( ρ ) − n log (Γ( α ))) where Γ( . ) is the well-known gamma function. The corresponding score function equals to U ( α ) = − y T n + nαρ − U ( ρ ) = log( y ) T n + n log( ρ ) − nψ ( α ) with ψ ( α ) = ∂∂α log(Γ( α )) being the digamma function, i.e. the derivative of the logarithm of theGamma function evaluated in α . The re-parameterized Gamma parameters are defined as usual ˜ α ME = z Tα p α and ˜ ρ ME = z Tρ p ρ whereas the supports can be determined as z α = (0 , . . . , α + δ ) and z ρ = (0 , . . . , ρ + δ ) , with δ being a positive constant. Note that the upper limits of the supportcan be chosen according to the following approximations: α = 1 (cid:14) M and ρ = α (cid:14) y , with M =log( y ) − P i log( y i ) (cid:14) n and y = P i y i (cid:14) n [7]. In the current example, the supports for the parametersare: z α = (0 . , . , . , . , . T and z ρ = (0 . , . , . , . , . T where K = 5 , α = 1 . , ρ = 4 . , and δ = 3 . The ME-score problem for the Gamma case ismaximize p − p Tα log( p α ) − p Tρ log( p ρ ) subject to: p Tα K p Tρ K − y T n + ( n z Tα p α )( z Tρ p ρ ) − log( y ) T n + n log( z Tρ p ρ ) − nψ ( z Tα p α ) (9)which is solved via Augmented Lagrangean adaptive barrier algorithm. The algorithm required fewiterations to converge and the recovered probabilities are as follows: ˆp α = (0 . , . , . , . , . T and ˆp ρ = (0 . , . , . , . , . T The estimated parameters under the ME-score formulation are ˆ α ME = 1 . and ˆ ρ ME = 5 . whichequal to the maximum likelihood solutions ˆ α ML = 1 . and ˆ ρ ML = 5 . .6 .4 Example 4: Logistic regression In what follows, we show the ME-score formulation for logistic regression. We will consider boththe cases of simple situations involving no separation—where maximum likelihood estimates canbe easily computed—and those unfortunate situations in which separation occur. Note that in thelatter case, maximum likelihood estimates are no longer available without resorting to the use of abias reduction iterative procedure [16]. Formally, the logistic regression model with p continuouspredictors is as follows: π i = (1 + exp( − X i β )) − (10) y i ∼ Bin ( π i ) , where X is an n × p matrix containing predictors, β is a p × vector of model parameters, and y is an n × vector of observed responses. Here, the standard maximum likelihood solutions ˆ β areusually attained numerically, e.g., using Newton–Raphson like algorithms [2]. No separation case . As an illustration of the ME-score problem in the optimal situation where noseparation occurs, we consider the traditional Finney’s data on vasoconstriction in the skin of thedigits (see Table 2) [34].Table 2: Finney’s data on vasoconstriction in the skin of the digits. The response Y indicates theoccurrence ( Y = 1 ) or non-occurrence ( Y = 0 ) of the vasoconstriction.Volume Rate Y3.70 0.825 13.50 1.090 11.25 2.500 10.75 1.500 10.80 3.200 10.70 3.500 10.60 0.750 01.10 1.700 00.90 0.750 00.90 0.450 00.80 0.570 00.55 2.750 00.60 3.000 01.40 2.330 10.75 3.750 12.30 1.640 13.20 1.600 10.85 1.415 11.70 1.060 0In the Finney’s case, the goal is to predict the vasoconstriction responses as a function of Volumeand Rate, according to the following linear term [34]:logit ( π i ) = β + β log ( Volume i ) + β log ( Rate i ) (11)7ith logit : [0 , → R being the inverse of the logistic function. In the maximum entropy framework,the model parameters can be reformulated as follows: β ME = (cid:0) z T ⊗ I p +1 (cid:1) vec ( P T ) , (12)where z is a K × vector of support points, I p +1 is an identity matrix of order p + 1 (includingthe intercept term), P is a ( p + 1) × K matrix of probabilities associated to the p parameters plusthe intercept, ⊗ is the Kronecker product, whereas vec( ) is a linear operator that transforms amatrix into a column vector. Note that in this example p = 2 and K = 7 , whereas the support z = ( − , . . . , , . . . , T is defined to be the same for both predictors and the intercept (the boundsof the support have been chosen to reflect the maximal variation allowed by the logistic function).Finally, the ME-score problem for the Finney’s logistic regression is:maximize vec ( P ) − vec ( P ) T log( vec ( P )) subject to: vec ( P ) T p ( K +1) X T ( y − π ) , (13)where X is the n × ( p + 1) matrix containing the variables Rate, Volume, and a column of all onesfor the intercept term, and π = (1 + exp( − X β ME )) − , with β ME being defined as in Equation (12).Solutions for ˆP were obtained via the augmented Lagrangean adaptive barrier algorithm, whichyielded the following estimates: ˆP = .
000 0 .
004 0 .
062 0 .
159 0 .
220 0 .
263 0 . .
000 0 .
001 0 .
099 0 .
178 0 .
224 0 .
247 0 . .
205 0 .
201 0 .
190 0 .
170 0 .
137 0 .
085 0 . , where the third line of ˆP refers to the intercept term. The final estimated coefficients are ˆ β ME = − . β ME = 5 . β ME = 4 . , which are the same as those obtained in the original paper of [34]. Separation case . As a typical example of data under separation, we consider the classical Fisheriris dataset [31]. As generally known, the dataset contains fifty measurements of length and width(in centimeters) of sepal and petal variables for three species of iris, namely setosa, versicolor, andvirginica [17]. For the sake of simplicity, we keep a subset of the whole dataset containing twospecies of iris (i.e., setosa and virginica) with sepal length and width variables only. Inspired by thework of [31], we study a model where the response variable is a binary classification of iris, with Y = 0 indicating the class virginica and Y = 1 the class setosa, whereas petal length and width arepredictors of Y . The logistic regression for the iris data assumes the following linear term:logit ( π i ) = β + β length i + β width i , (14)where model parameters can be reformulated as in Equation (12), with K = 7 , p = 2 , and z beingcentered around zero with bounds z = − and z K = 25 . The ME-score problem for the iris dataset8s the same as in (13) and it is solved using the augmented Lagrangean adaptive barrier algorithm.The recovered ˆP is ˆP = .
228 0 .
226 0 .
215 0 .
190 0 .
137 0 .
001 0 . .
000 0 .
039 0 .
040 0 .
158 0 .
218 0 .
257 0 . .
000 0 .
000 0 .
000 0 .
037 0 .
210 0 .
329 0 . , where the intercept term is reported in the third line of the matrix. The estimates for the modelcoefficients are reported in Table 3 (ME, first column). For the sake of comparison, Table 3 alsoreports the estimates obtained by solving the score of the model via Bias-corrected Newton–Raphson(NRF, second column) and Newton–Raphson (NR, third column). The NRF algorithm uses theFirth’s correction for the score function [16] as implemented in the R package logistf [26]. Asexpected, the NR algorithm fails to converge reporting divergent estimates. By contrast, the NRFprocedure converges to non-divergent solutions. Interestingly, the maximum entropy solutions aremore close to NRF estimates although they differ in magnitude.Table 3: Estimates for the iris logistic regression: ME (maximum entropy), NRF (Biased-correctedNewton–Raphson), NR (Newton–Raphson). Note that NRF algorithm implements the Firth’s biascorrection [16]. ME NRF NR β β -10.091 -6.151 -166.637 β Having examined the ME-score problem with numerical examples for both simple and more complexcases, in this section, we will numerically investigate the behavior of the maximum entropy solutionsfor the most critical case of logistic regression under separation.
Design . Two factors were systematically varied in a complete two-factorial design:(i) the sample size n at three levels: 15, 20, 200;(ii) the number of predictors p (excluding the intercept) at three levels: 1, 5, 10.The levels of n and p were chosen to represent the most common cases of simple, medium, andcomplex models, as those usually encountered in many social research studies. Procedure . Consider the logistic regression model as represented in Equation (10) and let n k and p k be distinct elements of sets n and p . The following procedure was repeated Q = 10000 times foreach of the n × p = 9 combinations of the simulation design:1. Generate the matrix of predictors X n k × (1+ p k ) = h n k | ˜ X n k × p k i , where ˜ X n k × p k is drawn fromthe multivariate standard Normal distribution N ( p k , I p k ) , whereas the column vector of allones stands for the intercept term; 9. Generate the vector of predictors β p k from the multivariate centered Normal distributionN ( p k , σ I p k ) , where σ = 2 . was chosen to cover the natural range of variability allowedby the logistic equation;3. Compute the vector π n k via Equation (10) using X n k × (1+ p k ) and β p k ;4. For q = 1 , . . . , Q , generate the vectors of response variables y ( q ) n k from the binomial distributionBin ( π n k ) , with π n k being fixed;5. For q = 1 , . . . , Q , estimate the vectors of parameters ˆ β ( q )1+ p k by means of Newton–Raphson(NR), Bias-corrected Newton–Raphson (NRF), and ME-score (ME) algorithms.The entire procedure involves a total of × × new datasets as well as an equivalentnumber of model parameters. For the NR and NRF algorithms, we used the glm and logistf rou-tines of the R packages stats [35] and logistf [26]. By contrast, the ME-score problem was solvedvia the augmented Lagrangean adaptive barrier algorithm implemented in constrOptim.nl routineof the R package alabama [40]. Convergences of the algorithms were checked using the built-in cri-teria of glm , logistf , and constrOptim.nl . For each of the generated data { y , X } q =1 ,...,Q , theoccurrence of separation was checked using a linear programming-based routine to find infinite esti-mates in the maximum likelihood solution [29, 30]. The whole simulation procedure was performedon a (remote) HPC machine based on 16 cpu Intel Xeon CPU E5-2630L v3 1.80GHz, 16x4GB Ram. Measures . The simulation results were evaluated considering the averaged bias of the parameters ˆ B = Q ( β ( k ) − ˆ β ( k ) ) T , its squared version ˆ B (the square here is element-wise), and the averagedvariance of the estimates ˆ V = Q Var ( ˆ β ( k ) ) . They were then combined together to form the meansquare error (MSE) of the estimates MSE = ˆ V + ˆ B . The relative bias RB = ( ˆ β ( k ) j − β j ) (cid:14) | β j | was alsocomputed for each predictor j = 1 , . . . , J , ( β indicates the population parameter). The measureswere computed for each of the three algorithms and for all the combinations of the simulation design. Results . Table 4 reports the proportions of separation present in the data for each level of the simu-lation design along with the proportions of non-convergence for the three algorithms. As expected,NR failed to converge when severe separation occurred, for instance, in the case of small samplesand large number of predictors. By contrast, for NRF and ME algorithms, the convergence criteriawere always met. The results of the simulation study with regards to bias, variance, and meansquare error (MSE) are reported in Table 5 and Figure 1. In general, MSE for the three algorithmsdecreased almost linearly with increasing sample sizes and number of predictors. As expected, theNR algorithm showed higher MSE than NRF and ME, except in the simplest case of n = 200 and p = 1 . Unlike for the NR algorithm, with increasing model complexity ( p > ), ME showed a similarperformances of NRF both for medium ( n = 50 ) and large ( n = 200 ) sample sizes. Interestingly,for the most complex scenario, involving a large sample ( n = 200 ) and higher model complexity( p = 10 ), the ME algorithm outperformed NRF in terms of MSE. To further investigate the re-lationship between NRF and ME, we focused on the latter conditions and analyzed the behaviorof ME and NRF in terms of relative bias (RB, see Figure 2). Both the ME and NRF algorithmsshowed RB distributions centered about 0. Except for the condition N = 200 ∧ P = 10 , where MEshowed smaller variance than NRF, both the algorithms showed similar variance in the estimatesof the parameters. Finally, we also computed the ratio of over- and under-estimation r as the ratiobetween the number of positive RB and negative RB, getting the following results: r ME = 1 . (over-estimation: 54%), r NRF = 0 . (over-estimation: 49%) for the case N = 200 ∧ P = 5 and r ME = 1 . r NRF = 0 . (over-estimation: 47%) for the case N = 200 ∧ P = 10 .Overall, the results suggest the following points: • In the simplest cases with no separation (i.e., N = 50 ∧ P = 1 , N = 200 ∧ P = 1 , N = 200 ∧ P =5 ), the ME solutions to the maximum likelihood equations were the same as those providedby standard Newton–Raphson (NR) and the bias-corrected version (NRF). In all these cases,the bias of the estimates approximated zero (see Table 5); • In the cases of separation, ME showed comparable performances to NRF, which is known toprovide the most efficient estimates in the case of logistic model under separation: Bias andMSE decreased as a function of sample size and predictors, with MSE being lower for ME thanNRF in the case of N = 200 ∧ P = 5 and N = 200 ∧ P = 10 ; • In the most complex scenario with a large sample and higher model complexity ( N = 200 ∧ P =5 , N = 200 ∧ P = 10 ), ME and NRF algorithms showed similar relative bias, with ME estimatesbeing less variable than NRF in N = 200 ∧ P = 10 condition. The ME algorithm tended toover-estimate the population parameters, by contrast NRF tended to under-estimate the truemodel parameters.Table 4: Simulation study: proportions of separation occurred in the data and non-convergence (nc)rates for NR, NRF, ME algorithms.n p separation nc NR nc NRF nc ME
15 1 0.333 0.085 0.000 0.00050 1 0.002 0.002 0.000 0.000200 1 0.000 0.000 0.000 0.00015 5 0.976 0.237 0.000 0.00050 5 0.771 0.771 0.000 0.000200 5 0.000 0.000 0.000 0.00015 10 1.000 0.002 0.000 0.00050 10 0.949 0.950 0.000 0.000200 10 0.013 0.013 0.000 0.000
We have described a new approach to solve the problem U ( θ ) = in order to get ˆ θ in the contextof maximum likelihood theory. Our proposal took the advantages of using the maximum entropyprinciple to set a non-linear programming problem where U ( θ ) was not solved directly, but it wasused as informative constraint to maximize the Shannon’s entropy. Thus, the parameter θ was notsearched over the parameter space Θ ⊂ R J , rather it was reparameterized as a convex combinationof a known vector z , which indicated the finite set of possible values for θ , and a vector of unknownprobabilities p , which instead needed to be estimated. In so doing, we converted the problem U ( θ ) = from one of numerical mathematics to one of inference, where U ( θ ) was treated as oneof the many pieces of (external) information we may have had. As a result, the maximum entropysolution did not require either the computation of the Hessian of second-order derivatives of l ( θ ) (orthe expectation of the Fisher information matrix) or the definition of initial values, as is required11able 5: Simulation study: averaged bias, squared averaged bias, and MSE for NR, NRF, MEalgorithms. NR NRF MEn p ˆ B ˆ V ˆ B M SE ˆ B ˆ V B M SE ˆ B ˆ V B M SE
15 1 -5.54 236.70 30.67 267.36 0.22 0.35 0.05 0.40 -1.17 6.28 1.37 7.6450 1 -0.13 3.42 0.02 3.44 -0.00 1.41 0.00 1.41 -0.12 1.99 0.01 2.00200 1 0.03 0.11 0.00 0.11 0.00 0.10 0.00 0.10 0.03 0.11 0.00 0.1115 5 10.68 1553.37 113.98 1667.33 -1.22 3.00 1.50 4.49 0.20 5.32 0.04 5.3650 5 7.46 1918.18 55.65 1973.78 -0.44 2.20 0.20 2.39 -0.11 1.45 0.01 1.46200 5 0.24 1.58 0.06 1.64 0.01 0.50 0.00 0.50 0.12 0.42 0.02 0.4415 10 -0.97 177.40 0.95 178.35 -0.13 4.82 0.02 4.84 -0.38 8.10 0.14 8.2450 10 2.80 1490.39 7.83 1498.20 -0.07 1.23 0.00 1.23 -0.02 1.53 0.00 1.53200 10 0.66 15.29 0.43 15.72 0.02 0.86 0.00 0.86 0.10 0.48 0.01 0.50by Newton-like algorithms θ . In contrast, the maximum entropy solution revolved around thereduction of the initial uncertainty: as one adds pieces of external information (constraints), adeparture from the initial uniform distribution p results, implying a reduction of the uncertaintyabout θ ; a solution is found when no further reduction can be enforced given the set of constraints.We used a set of empirical cases and a simulation study to assess the maximum entropy solutionto the score problem. In cases where the Newton–Raphson is no longer correct for θ (e.g., logisticregression under separation), the ME-score formulation showed results (numerically) comparablewith those obtained using the Bias-corrected Newton–Raphson, in the sense of having the sameor even smaller mean square errors (MSE). Broadly speaking, these first findings suggest that theME-score formulation can be considered as a valid alternative to solve U ( θ ) = , although furtherin-depth investigations need to be conducted to formally evaluate the statistical properties of theME-score solution.Nevertheless, we would like to say that the maximum entropy approach has been used to builda solver for maximum likelihood equations [15, 14, 38]. In this sense, standard errors, confidencelevels, and other likelihood based quantities can be computed using the usual asymptotic propertiesof maximum likelihood theory. However, attention should be directed at the definition of the supportpoints z since they need to be sufficiently large to include the true (hypothesized) parameters weare looking for. Relatedly, our proposal differs from other methods, such as Generalized MaximumEntropy (GME) or Generalized Cross Entropy (GCE) [25, 10], in two important respects. First,the ME-score formulation does not provide a class of estimators for the parameters of statisticalmodels. By contrast, GME and GCE are estimators belonging to the exponential family, which canbe used in many cases as alternatives to maximum likelihood estimators [22]. Second, the ME-scoreformulation does not provide an inferential framework for θ . While GME and GCE use informationtheory to provide the basis for inference and model evaluation (e.g., using Lagrangean multipliersand normalized entropy indices), the ME-score formulation focuses on the problem of finding rootsfor U ( θ ) = . Finally, an open issue which deserves greater consideration in future investigations isthe examination of how the ME-score solution can be considered in light of the well-known maximumentropy likelihood duality [4].Some advantages of the ME-score solution over Newton-like algorithms may include the following:12 ˆ B ˆ V M S E
15 50 100 150 200 15 50 100 150 200 15 50 100 150 2000.000010.0100010.000000.11.010.0100.01000.00.11.010.0100.01000.0 N NRNRFME
Figure 1: Simulation study: averaged bias, squared averaged bias, and MSE for NR, NRF, MEalgorithms. Note that the number of predictors p is represented column-wise (outside) whereas thesample sizes n is reported in the x-axis (inside). The measures are plotted on logarithmic scale.(i) model parameters are searched in a smaller and simpler space because of the convex reparame-terization required for θ ; (ii) the function to be maximized does not require either the computationof second-order derivatives of l ( θ ) , or searching for good initial values θ ; (iii) additional informa-tion on the parameters, such as dominance relations among the parameters, can be added to theME-score formulation in terms of inequality constraints (e.g., θ j < θ t , j = t ). Furthermore, theME-score formulation may be extended to include a priori probability distributions on θ . While inthe current proposal, the elements of z j have the same probability to occur, the Kullback–Leiblerentropy might be used to form a Kullback Leibler -score problem, where z = ( z , . . . , z J ) T are ade-quately weighted by known vectors of probability w = ( w , . . . , w J ) T . This would offer, for instance,another opportunity to deal with cases involving penalized likelihood estimations.In conclusion, we think that this work yielded initial findings in the solution of likelihood equa-tions from a maximum entropy perspective. To our knowledge, this is the first time that maximumentropy is used to define a solver to the score function. We believe this contribution will be of in-terest to all researchers working at the intersection of information theory, data mining, and appliedstatistics. 13 r e l a t i v e b i a s (A) -1.0-0.50.00.51.0 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 r e l a t i v e b i a s algorithm NRF ME (B)
Figure 2: Simulation study: relative bias for NRF and ME algorithms in the conditions N =200 ∧ P = 5 (A) and N = 200 ∧ P = 10 (B). Note that plots are paired vertically by predictor. Rateof over-estimation (under-estimation): (A) ME = 0.54 (0.46), NRF = 0.49 (0.51); (B) ME = 0.53(0.47), NRF = 0.47 (0.53). 14 eferences [1] Saeid Abbasbandy, Y Tan, and SJ Liao. Newton-homotopy analysis method for nonlinearequations. Applied Mathematics and Computation , 188(2):1794–1800, 2007.[2] Adelin Albert and John A Anderson. On the existence of maximum likelihood estimates inlogistic regression models.
Biometrika , 71(1):1–10, 1984.[3] Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dharmendra SModha. A generalized maximum entropy approach to bregman co-clustering and matrix ap-proximation.
Journal of Machine Learning Research , 8(Aug):1919–1986, 2007.[4] Lawrence D Brown.
Fundamentals of statistical exponential families: with applications in statis-tical decision theory , volume 9. Institute of Mathematical Statistics Lecture Notes-MonographSeries, 1986.[5] RK Bryan. Maximum entropy analysis of oversampled data problems.
European BiophysicsJournal , 18(3):165–174, 1990.[6] Antonio Calcagnì, Luigi Lombardi, and Simone Sulpizio. Analyzing spatial data from mousetracker methodology: An entropic approach.
Behavior research methods , 49(6):2012–2030, 2017.[7] S. Choi and R. Wette. Maximum likelihood estimation of the parameters of the gamma distri-bution and their bias.
Technometrics , 11(4):683–690, 1969.[8] Enrico Ciavolino and Amjad D Al-Nasser. Comparing generalised maximum entropy and partialleast squares methods for structural equation models.
Journal of nonparametric statistics ,21(8):1017–1036, 2009.[9] Enrico Ciavolino and Antonio Calcagnì. A generalized maximum entropy (gme) approach forcrisp-input/fuzzy-output regression model.
Quality & Quantity , 48(6):3401–3414, 2014.[10] Enrico Ciavolino and Antonio Calcagnì. A generalized maximum entropy (gme) estimationapproach to fuzzy regression model.
Appl Soft Comput. , 38:51–63, 2016.[11] Daniel Commenges, Hélene Jacqmin-Gadda, Cécile Proust, and Jérémie Guedj. A newton-likealgorithm for likelihood maximization: The robust-variance scoring algorithm. arXiv preprintmath/0610402 , 2006.[12] Gauss M Cordeiro and Peter McCullagh. Bias correction in generalized linear models.
Journalof the Royal Statistical Society: Series B (Methodological) , 53(3):629–643, 1991.[13] David Roxbee Cox.
Principles of statistical inference . Cambridge university press, 2006.[14] SA El-Wakil, A Elhanbaly, and MA Abdou. Maximum entropy method for solving the collisionalvlasov equation.
Physica A: Statistical Mechanics and its Applications , 323:213–228, 2003.[15] Shu-Cherng Fang, Jay R Rajasekera, and H-S Jacob Tsao.
Entropy optimization and mathe-matical programming , volume 8. Springer Science & Business Media, 2012.[16] David Firth. Bias reduction of maximum likelihood estimates.
Biometrika , 80(1):27–38, 1993.[17] Ronald A Fisher. The use of multiple measurements in taxonomic problems.
Annals of eugenics ,7(2):179–188, 1936. 1518] Sujuan Gao and Jianzhao Shen. Asymptotic properties of a double penalized maximum likeli-hood estimator in logistic regression.
Stat Probabil Lett. , 77(9):925–930, 2007.[19] Amos Golan. Maximum entropy, likelihood and uncertainty: A comparison. In
Maximumentropy and bayesian methods , pages 35–56. Springer, 1998.[20] Amos Golan.
Foundations of info-metrics: modeling and inference with imperfect information .Oxford University Press, 2017.[21] Amos Golan, George Judge, and Larry Karp. A maximum entropy approach to estimation andinference in dynamic models or counting fish in the sea using maximum entropy.
Journal ofEconomic Dynamics and Control , 20(4):559–582, 1996.[22] Amos Golan, George Judge, and Douglas Miller. The maximum entropy approach to estimationand inference. In
Applying Maximum Entropy to Econometric Problems , pages 3–24. EmeraldGroup Publishing Limited, 1997.[23] Amos Golan, George Judge, and Jeffrey M Perloff. A maximum entropy approach to recoveringinformation from multinomial response data.
Journal of the American Statistical Association ,91(434):841–853, 1996.[24] Amos Golan, George Judge, and Sherman Robinson. Recovering information from incompleteor partial multisectoral economic data.
The Review of Economics and Statistics , pages 541–549,1994.[25] Amos Golan, George G Judge, and Douglas Miller.
Maximum entropy econometrics: Robustestimation with limited data . Wiley New York, 1996.[26] Georg Heinze and Meinhard Ploner. logistf: Firth’s Bias-Reduced Logistic Regression , 2018. Rpackage version 1.23.[27] Jagat Narain Kapur.
Maximum-entropy models in science and engineering . John Wiley & Sons,1989.[28] EC Kenne Pagui, A Salvan, and N Sartori. Median bias reduction of maximum likelihoodestimates.
Biometrika , 104(4):923–938, 2017.[29] Kjell Konis.
Linear Programming Algorithms for Detecting Separated Data in Binary Lo-gistic Regression Models . PhD thesis, Department of Statistics, University of Oxford, 2007. https://ora.ox.ac.uk/objects/uuid:8f9ee0d0-d78e-4101-9ab4-f9cbceed2a2a .[30] Kjell Konis. safeBinaryRegression: Safe Binary Regression , 2013. R package version 0.1-3.[31] Emmanuel Lesaffre and Adelin Albert. Partial separation in logistic discrimination.
J R StatSoc Series B , 51(1):109–116, 1989.[32] Thomas L Marsh and Ron C Mittelhammer. Generalized maximum entropy estimation of a firstorder spatial autoregressive model. In
Spatial and Spatiotemporal Econometrics , pages 199–234.Emerald Group Publishing Limited, 2004.[33] R Bernardini Papalia. A composite generalized cross-entropy formulation in small samplesestimation.
Economet Rev. , 27(4-6):596–609, 2008.[34] Daryl Pregibon et al. Logistic regression diagnostics.
Ann Stat Journal , 9(4):705–724, 1981.1635] R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria, 2019.[36] Jianzhao Shen and Sujuan Gao. A solution to separation and multicollinearity in multiplelogistic regression.
J Data Sci , 6(4):515, 2008.[37] Stephen M Stigler et al. The epic story of maximum likelihood.
Stat Sci. , 22(4):598–620, 2007.[38] N Sukumar. Construction of polygonal interpolants: a maximum entropy approach.
Interna-tional journal for numerical methods in engineering , 61(12):2159–2181, 2004.[39] Martin A Tanner.
Tools for statistical inference . Springer, 2012.[40] Ravi Varadhan. alabama: Constrained Nonlinear Optimization , 2015. R package version 2015.3-1.[41] Tzong-Mou Wu. A study of convergence on the newton-homotopy continuation method.