Application of the Iterated Weighted Least-Squares Fit to counting experiments
AApplication of the Iterated Weighted Least-Squares Fitto counting experiments
Hans Dembinski a, ∗ , Michael Schmelling a , Roland Waldi b a Max Planck Institute for Nuclear Physics, Heidelberg, Germany b Rostock University, Rostock, Germany
Abstract
Least-squares fits are popular in many data analysis applications, and so wereview some theoretical results in regard to the optimality of this fit method. Itis well-known that common variants of the least-squares fit applied to Poisson-distributed data produce biased estimates, but it is not well-known that thebias can be overcome by iterating an appropriately weighted least-squares fit.We prove that the iterated fit converges to the maximum-likelihood estimate.Using toy experiments, we show that the iterated weighted least-squares methodconverges faster than the equivalent maximum-likelihood method when the sta-tistical model is a linear function of the parameters and it does not requireproblem-specific starting values. Both can be a practical advantage. The equiv-alence of both methods also holds for binomially distributed data. We furthershow that the unbinned maximum-likelihood method can be derived as a limit-ing case of the iterated least-squares fit when the bin width goes to zero, whichdemonstrates the deep connection between the two methods.
1. Introduction
In this paper, we review some theoretical results on least-squares methods,in particular, when they yield optimal estimates. We show how they can beapplied to counting experiments without sacrificing optimality. The insightsdiscussed here are known in the statistics community [1, 2], but less so in the ∗ Corresponding author
Preprint submitted to Elsevier June 7, 2019 a r X i v : . [ phy s i c s . d a t a - a n ] J un igh-energy physics community. Standard text books on statistical methodsand papers, see e.g. [3, 4], correctly warn about biased results when standardvariants of the least-squares fit are applied to counting experiments with smallnumbers of events, but do not show that these can be overcome. The resultspresented here are of practical relevance for fits of linear models, where theiterated weighted least-squares method discussed in this paper converges fasterthan the standard maximum-likelihood method and does not require startingvalues near the optimum.The least-squares fit is a popular tool of statistical inference. It can beapplied in situations with k measurements { y i | i = . . . , k } , described by amodel with m parameters p = ( p j | j = . . . , m ) that predicts the expectationvalues E [ y i ] = µ i ( p ) for the measurements. The measurements differ from theexpectation values by unknown residuals (cid:15) i = y i − µ i ( p ) . The solution ˆ p thatminimizes the sum Q ( p ) of squared residuals, Q ( p ) = k X i = (cid:0) y i − µ i ( p ) (cid:1) , (1)is taken as the best fit of the model to the data.More generally, the measurements and the model predictions can be regardedas k -dimensional vectors y = ( y i | i = . . . , k ) and µ = ( µ i | i = . . . , k ) , forwhich one wants to minimize a distance measure. In Eq. (1), we minimize thesquared Euclidean distance. A generalization is the bilinear form Q ( p ) = ( y − µ ) T W ( y − µ ) , (2)where W is a positive-definite symmetric matrix of weights. This variant iscalled weighted least squares (WLS). Eq. (1) is recovered with W = . Animportant special case is when the weight matrix is equal to the inverse ofthe true covariance matrix C of the measurements, W = C − with C = E [ yy T ] − E [ y ] E [ y ] T . For uncorrelated measurements, Eq. (2) simplifies to thefamiliar form Q ( p ) = k X i = (cid:0) y i − µ i ( p ) (cid:1) (cid:14) σ i , (3)2ith variances σ i = E [ y i ] − E [ y i ] .Aitken [5] showed in a generalization to the Gauss-Markov theorem [3, p.152] that minimizing Q ( p ) with W ∝ C − produces an optimal, in the sense asdetailed below, solution for linear models µ ( p ) = Xp , where X is a constant k × m matrix. The theorem applies when the covariance matrix C is finite andnon-singular. Then, Q ( p ) has a unique minimum at ˆ p = ( X T C − X ) − X T C − y . (4)The best fit parameters ˆ p in this case are a linear function of the measurements y with the covariance matrix C p = ( X T C − X ) − . (5)If the measurements are unbiased, E [ y ] = µ , this solution is the best linearunbiased estimator (BLUE). Like all linear estimators, Eq. (4) is unbiased if theinput is unbiased. In addition, it has minimal variance of all linear estimators.This is true for any shape of the data distribution and any sample size. Theseexcellent properties may be compromised in practical applications, since thecovariance matrix C is often only approximately known.The least-squares approach is often regarded as a special case of the moregeneral maximum-likelihood (ML) approach. The ML principle states that thebest fit of a model should maximize the likelihood L , which is proportional tothe joint probability of all measurements under the model. In practice, it is moreconvenient to work with ln L rather than L , so that the product of probabilitiesturns into a sum of log-probabilities,ln L ( p ) = ln k Y i = P i ( y i ; p ) = k X i = ln P i ( y i ; p ) . (6)Here P i ( y i ; p ) is the value of the probability density at y i for continuous out-comes or the actual probability for discrete outcomes. The ML method needsa fully specified probability distribution for each measurement, while the WLSmethod uses only the first two moments.3he parameter vector ˆ p that maximizes Eq. (6) is called the maximum-likelihood estimate (MLE). MLEs have optimal asymptotic properties; asymp-totic here means in the limit of infinite samples. They are consistent (asymptot-ically unbiased) and efficient (asymptotically attaining minimal variance) [3].In many practical cases of inference, in particular when data are Poisson-distributed, this method is known to produce good estimates also for finitesamples. These properties make the ML fit the recommended tool for the prac-ticioner [3, 4].The WLS fit can be derived as a special case of a ML fit, if one considersnormally distributed measurements y i with expectations µ i and variances σ i ,where each measurement has the probability density function (PDF) P i ( y i ; µ i , σ i ) = q πσ i exp (cid:18) − ( y i − µ i ) σ i (cid:19) . (7)For fixed σ i , we obtain Eq. (3) from Eq. (6),ln L ( p ) = − k X i = ( y i − µ i ( p )) σ i + c ≡ − Q ( p ) + c , (8)where the constant term c depends only on the fixed variances σ i . Constantterms do not affect the location ˆ p of the maximum of ln L and the minimum of Q . We will often drop them from equations.This derivation shows that for Gaussian PDFs a ML and a WLS fit give iden-tical results when the same fixed variances are used, even if they are not the truevariances. This does not hold in general, but is relevant in this context. Whendata are Poisson-distributed and have small counts, common implementationsof the WLS fit are biased as we will show in the following section. The biasdoes not originate from the skewed shape of the Poisson distribution however,but rather from the fact that the weights are either biased or not fixed.To these standard methods, we add the iterated weighted least-squares (IWLS)fit [1]. It yields maximum-likelihood estimates when data are Poisson or binomi-ally distributed with only the probabilities as free parameters [2]. This extendsthe strict equivalence between ML and WLS fits to a larger class of problems,4n extension which is highly relevant in practice, since counts in histograms arePoisson distributed, and counted fractions are binomially distributed with thedenominator considered fixed. The iterations are used to successively updateestimates of the variances σ i , which are kept constant during minimization.When IWLS and ML fits are equivalent, which one is recommended? Weconducted toy experiments where IWLS and ML fits are carried out numerically,as is common in practice. We found similar convergence rates for both methodswhen the model is non-linear, and a significantly faster convergence for theIWLS fit if the model is linear. This makes the IWLS fit a useful addition tothe toolbox.We have seen how the WLS fit can be derived from the ML fit under certainconditions. Inversely, we will show that the unbinned ML fit can be derived asa limiting case from the IWLS fit under weak conditions. The derivation showsthat the two approaches are deeply connected.
2. Least-Squares Variants In Use
Standard variants of the WLS fit used in practice produce biased estimateswhen the fit is applied to Poisson-distributed data with small counts. The biasis often attributed to the breakdown of the normal approximation to the Poissondistribution, but it is actually related to how the unknown true variances σ i inEq. (3) are replaced by estimates.We demonstrate this along a simple example. We fit the single parameter µ of the Poisson-distribution P ( n ; µ ) = e − µ µ n / n ! , (9)to k counts { n i | i = . . . , k } sampled from it. The maximum-likelihood estimatefor µ can be computed analytically by maximizing Eq. (6). We solve ∂ ln L / ∂µ ≡ ∂ µ ln L = µ and obtain the arithmetic average ˆ µ = k k X i = n i , (10)5hich is unbiased and has minimal variance. We will now apply variants of theWLS fit to the same problem, which differ in how they substitute the unknowntrue variance. Variance computed for each sample.
For a single isolated sample, theunbiased estimate of µ is ˆ µ i = n i , with variance Var [ n i ] = µ ’ ˆ µ i = n i . This isthe origin of the well-known √ n -estimate for the standard deviation of a count n . With this variance estimate, we get Q ( µ ) = k X i = ( n i − µ ) / n i . (11)This form is called Neyman’s χ in the statistics literature [4]. Replacing thetrue variance µ by its sample estimate n i is an application of the bootstrapprinciple discussed by Efron and Tibshirani [6]. To obtain the minimum, wesolve ∂ µ Q = µ and obtain the harmonic average1 ˆ µ = k k X i = n i . (12)The solution is biased and breaks down for samples with n i =
0. The varianceestimates here are constant (they do not vary with ˆ µ ), but differ from sampleto sample. This treatment ignores the fact that the true variance is the samefor all samples in this setup. Variance computed from model.
Another choice is to directly insertVar [ n i ] = µ in the formula, Q ( µ ) = k X i = ( n i − µ ) / µ . (13)This form, called Pearson’s χ [4], is a conceptual improvement, because µ isthe exact but unknown value of the variance. However, the variance µ nowvaries together with the expectation value µ . Solving ∂ µ Q = µ yields thequadratic average ˆ µ = vuut k k X i = n i . (14)6his estimate is also biased, but can handle samples with n i =
0. The bias maycome at a surprise, since we used the exact value for the variance after all. Thefailure here can be traced back to the fact that the variance estimates σ i = µ are not fixed during the minimization. A small positive bias on µ in Eq. (13)leads to a second order increase in the numerator, which is overcompensatedby a first order increase of the denominator. In other words, the fit tends toincrease the variance even at the cost of a small bias in the expectation whengiven this freedom, because overall it yields a reduction of Q . Constant variance.
Finally, we simply use σ i = c , where c is an arbitraryconstant, Q ( µ ) = k X i = ( n i − µ ) / c . (15)We solve ∂ µ Q = µ and obtain the optimal maximum-likelihood estimateEq. (10) as the solution; the constant c drops out.This seems counter-intuitive, since we used a constant for all samples insteadof a value close or equal to the true variance. However, this case satisfies allconditions of the Gauss-Markov theorem. The expectation values are triviallinear functions of the parameter µ i = µ . The variances σ i are all equal andonly need to be known up to a global scaling factor, hence any constant c willdo. We learned that keeping the variance estimates constant during minimizationis important, but the estimates should in general be as close to the true variancesas possible. An iterated fit can satisfy both requirements.
3. Iterated Weighted Least-Squares
The iterated (re)weighted least-squares methods (IWLS or IRLS) are wellknown in statistical regression [1], and can be applied to fits with k mea-surements { y i | i = . . . , k } described by a model with m parameters p =( p j | j = . . . , m ) , which predicts the expectations E [ y i ] = µ i ( p ) and variancesVar [ y i ] = σ i ( p ) of each measurement. We will discuss the special applicationwhere the y i are entries of a histogram. One then minimizes the sum of squared7esiduals Q ( p ) = k X i = ( y i − µ i ( p ) (cid:1) / σ i ( ˆ p ) , (16)where the σ i are constant within one iteration of the fit and computed fromthe model using the parameter estimate ˆ p that minimized Q ( p ) in the previousiteration. A convenient choice for the first iteration is σ i =
1. One iteratesuntil ˆ p converges.In particle physics, we often work with samples drawn from two monopara-metric distributions of the exponential family: • Poisson distribution. Example: fitting a distribution function to a his-togram of counts. • Binomial distribution with fixed number of trials. Example: fitting anefficiency function to two histograms with generated and accepted events.Charles, Frome, and Yu [2] derived that the IWLS fit gives the exact same resultas the ML fit for a family of distributions. We demonstrate this in the appendixfor the special distributions discussed here.The Hessian matrices of second derivatives are also equal up to a constantfactor, ∂ p l ∂ p m Q = − ∂ p l ∂ p m ln L . The inverse of the Hessian is an estimateof the covariance matrix of the solution, an important uncertainty estimate inpractical applications.We emphasize that the equivalence does not depend on the size of the datasample or on the functional form of the model that predicts the expectationvalues for the measurements. In particular, when the IWLS fit is applied tohistograms, it is not biased by small counts per bin or even empty bins. A formal discussion of how systematic uncertainties can be handled withthe IWLS fit is outside of the scope of this paper, but we note that it caninclude systematic uncertainties. Barlow [7] discusses how correlated systematic8ncertainties can be handled in a least-squares fit. One minimizes Eq. (2) in eachiteration with a matrix C = C ( ˆ p ) + C sys ( ˆ p ) , (17)where C is the current estimate of the stochastic covariance computed fromthe previous solution, and C sys is a current covariance matrix that representsthe systematic uncertainties of the measurements. The matrix C sys may bea function of the parameter vector. Like the covariance matrix C , it is keptconstant during each iteration, and updated between iterations using the currentvalue of ˆ p . This approach has been successfully applied in a combination ofmeasurements from the CDF and D0 experiments [8]. When the IWLS and the ML fits are equivalent, which one should be used inpractice? The two methods produce the same results in analytical problems, butcan have different performance in numerical problems. In practice, the extremaof the log-likelihood function ln L ( p ) and the weighted least-squares function Q ( p ) are usually found with a local optimizer, like the MIGRAD algorithm in the
MINUIT package [9, 10]. Computing the functions is sometimes expensive, whenthe fitted data sets are large and the model has many parameters. Numericalmethods are therefore judged based on the number of function evaluations re-quired to converge to the optimum within some tolerance. Another criterionis robustness, the ability to converge to the right optimum from a point in theneighborhood of the solution.To address these points, we conducted toy experiments with Poisson-distrib-uted counts n i and find that the ML method requires less function evaluationsthan the IWLS methods in general. However, the rate of convergence of theIWLS method can be greatly accelerated, when the model that computes thecount expectation E [ n i ] = µ i ( p ) is linear in the parameters, µ i = X i p , where X i is a vector of constants. The maximum of ln L ( p ) usually cannot be foundanalytically in this case, but the minimum of Q ( p ) is given by Eq. (4) in eachiteration of the IWLS fit. When the computing time is dominated by the eval-9 .0 0.2 0.4 0.6 0.8 1.0 x n fittoy data 0 1 2 3 4 5 6 7 8 9 10iterations02468101214 p a r a m e t e r ML IWLS L-IWLS
Figure 1: Example of a toy data set (points) to test the performance of the ML, IWLS, andL-IWLS fits (see text).
Left:
The curve represent the fit result of the three methods for thisdata set (which are identical).
Right:
Intermediate parameter states during the optimizationfor the ML, IWLS, and L-IWLS fits (see text), in each iteration of the respective algorithmsuntil the stopping criterion is reached. The L-IWLS fit often converges quadratically. TheIWLS fit is slowed down by the artifical dampening that we introduced to avoid oscillations. uation of Q ( p ) or ln L ( p ) , solving the IWLS fit is faster than the ML fit. TheIWLS fit also does not require a problem-specific starting point for the opti-mization in this case. We call this special variant the L-IWLS fit. All threemethods are able to handle fits that have bounded parameters, which are com-mon in particle physics. In our toy experiments, the parameters are boundedto be non-negative. Details are given in the next section.Whether the ML or the IWLS fits are more robust in the above sense is moredifficult to say. No general proofs can be given for either method. Our toy stud-ies suggest the following order of increasing robustness: IWLS, L-IWLS, ML. Insome toy experiments, the IWLS methods require many more iterations thanaverage, producing a long tail in the distribution of iteration counts. Such tailsare not observed for the ML fit. It is likely, however, than a more sophisticatedimplementation of the IWLS fit than ours could improve the robustness of thismethod. We compare the performances of ML, IWLS, and L-IWLS fits in a seriesof 1000 toy experiments with Poisson-distributed samples. We use a linearmodel for the expectation with two parameters, µ ( x , p ) = ( p + p x ) , with10 ∈ [
0, 1 ] as an independent variable. For the true parameters p truth = (
1, 10 ) ,we simulate 10 pairs ( x i , n i ) . The x i are evenly spaced over the interval [
0, 1 ] , µ i is calculated for each x i based on the true parameters, and finally a randomsample n i is drawn for each µ i from the Poisson distribution. The model is thenfitted to each toy data set using the following three methods. One of the toyexperiments is shown in Fig. (1). • ML fit: Starting from Eq. (A.1) we use the
MIGRAD algorithm from the
MINUIT package to find the minimum. We pass the exact analytical gra-dient to
MIGRAD for this problem, replacing the numerical approximationthat
MINUIT uses otherwise. We restrict the parameter range to p k ≥ µ whenever it appears in a denominator to avoiddivision by zero. • IWLS fit: We use Newton’s method to update p , p n + = p n − H − ∂ p Q , (18)with the exact analytical gradient ∂ p Q and Hessian H for this problem.Since the model is linear and the function Q quadratic, Newton’s methodyields the exact solution for the given gradient and Hessian matrix, butwithout taking the boundary condition p k ≥ MINUIT criterion, which is based on the es-timated distance-to-minimum and deviations in the diagonal elements ofthe inverted Hessian [9].This approach works very well for most toy experiments, but in somerare cases ( < p
10 20 30 p N call ML: 1.00 ± 0.02IWLS: 1.00 ± 0.02L-IWLS: 1.00 ± 0.02 ML: 10.11 ± 0.07IWLS: 10.11 ± 0.07L-IWLS: 10.11 ± 0.07
ML IWLS L-IWLS
Figure 2: Application of the maximum-likelihood (ML), the iterated weighted least-squares(IWLS) fit and its specialization for linear models (L-IWLS) to 1000 toy experiments withPoisson-distributed samples (see text).
Top row:
Histograms of the two fitted parametersof the model ˆ p and ˆ p are shown, overlayed for all three fit methods. The histograms arenearly identical. Bottom row:
Normalized histograms of the number of evalutions of the modelfunction for the the fit methods in double-logarithmic scale. parameter vector with the previous one, p n + : = ( p n + + p n ) / • L-IWLS fit: We solve Eq. (4) with the
NNLS [11] algorithm as implementedin
SciPy [12], and iterate. It solves Eq. (4) under the boundary condition p k ≥
0. To check for convergence, we again use the
MINUIT criterion.We note that our application of the general IWLS fit to a problem witha linear model is artifical. We only do this here to compare all three fittingmethods on the same problem. For the IWLS and ML fits, we use the optimistic12tarting point p truth = (
1, 10 ) . The ML and IWLS fits therefore run under idealconditions compared to the L-IWLS fit, which does not require a specific startingpoint. In practice, one will usually start with a less ideal starting point, whichslows down the convergence of ML and IWLS fits compared to the L-IWLS fit.In case of the ML and IWLS fits, we increase the call counter for eachevaluation of Q or ln L and each evaluation of their gradients for all values of z byone. In case of the L-IWLS fit, we count one application of the NNLS algorithmas one call, since it requires essentially one computation of the gradient.The results are shown in Fig. (2). As expected, the results are equal withinthe numerical accuracies of the numerical algorithms, which stop when MINUIT ’sstandard convergence criterion is reached. This criterion roughly gives a preci-sion of about 10 − in the parameter relative to its uncertainty.The average number of calls required to converge is different: 19.3 for ML,23.6 for IWLS, and 4.8 for L-IWLS. The L-IWLS fit is the fastest to converge,requiring only a quarter of the function evaluations of the ML fit. The IWLSfit is the slowest, it requires about 20% more calls on average than the MLfit. This is mainly due to artificial dampening. In cases where the dampeningis not needed, the IWLS fit converges as rapidly as the L-IWLS fit. Since wechose a linear model for this performance study in order to compare all threemethods, a Newton’s step computes the exact solution to the fitting problemfor the current covariance matrix estimate.An investigation shows that the convergence issues of the IWLS fit appearwhen a parameter of the model is very close to zero. If this is not the case andno dampening is applied, the IWLS and L-IWLS fits produce identical results. MINUIT was designed to handle such cases well and shows a much more stableconvergence rate. This suggests that the issues of the IWLS fit can be overcomeas well with a more sophisticated implementation, but this comes at the cost ofa slower convergence in favorable cases. The overall performance of the IWLSfit will probably not surpass that of the more straight-forward ML fit.In conclusion, we recommend the L-IWLS fit for linear models and the MLfit for non-linear models. 13 . Unbinned maximum-likelihood from IWLS
In the introduction, we reviewed how the WLS fit can be derived as a specialcase of the ML fit, when measurements are normally distributed with knownvariance. Alternatively, the WLS fit can also be derived from geometric princi-ples without relying on the ML principle. We will now show that the unbinnedML fit can be derived as a limiting case of the IWLS fit.For the unbinned ML fit of a known probability density f ( x ; p ) of a con-tinuous stochastic variable x with parameters p , one maximizes the sum oflogarithms of the model density evaluated at the measurements { x i | i = . . . k } ,ln L ( p ) = k X i = ln f ( x i ; p ) . (19)The maximum is found by solving the system of equations ∂ p j ln L ( p ) =
0. Thedensity f ( x ; p ) must be at least once differentiable in p .To derive these equations as a limit of the IWLS fit, we assume that f ( x ; p ) is finite everywhere in x , so that the probability density is not concentrated indiscrete points.We start by considering a histogram of k samples x i . Since the samples areindependently drawn from a PDF, the histogram counts n l are uncorrelated andPoisson-distributed. Following the IWLS approach, we minimize the function Q ( p ) = X l ( n l − kP l ) k ˆ P l (20)and iterate, where P l ( p ) = R x l + ∆ xx l f ( x ; p ) d x is the expected fraction of thesamples in bin l , and ˆ P l = P l ( ˆ p ) is the value based on the fitted parameters ˆ p from the previous iteration. Expansion of the squares yields three terms, Q ( p ) = X l n l k ˆ P l − X l n l P l ˆ P l + k X l P l ˆ P l . (21)The first term is proportional to 1 / ∆ x , but not a function of p . Therefore it doesnot contribute to the minimum obtained by solving the equations ∂ p j Q ( p ) = p . 14e investigate the limit ∆ x →
0. Since f ( x ) is finite everywhere, we haveultimately either zero or one count in each bin. With P l → f ( x l ; p ) ∆ x , thesecond term has a finite limit X l n l P l ˆ P l ∆ x → −−−−→ k X i = f ( x i ; p ) ∆ xf ( x i ; ˆ p ) ∆ x = k X i = f ( x i ; p ) f ( x i ; ˆ p ) , (22)where only bins around the measurements x i with one entry contribute ( n l = X l P l ˆ P l ∆ x → −−−−→ X l f ( x l ; p ) ( ∆ x ) f ( x l ; ˆ p ) ∆ x = Z f ( x ; p ) f ( x ; ˆ p ) d x . (23)One ∆ x cancels in the ratio and in the limit ∆ x → ∂ p j Q ( p ) in the limit of many iterations. Weassume that the iterations converge, so that the previous solution ˆ p approachesthe next solution p . We get ∂ p j Q ( p ) = − k X i = ∂ p j f ( x i ; p ) f ( x i ; ˆ p ) + k Z f ( x ; p ) ∂ p j f ( x ; p ) f ( x ; ˆ p ) d x ˆ p → p −−−→ − k X i = ∂ p j f ( x i ; p ) f ( x i ; p ) + k∂ p j Z f ( x ; p ) d x . (24)The last term vanishes in the limit, because R f ( x ; p ) d x = ∂ p j Q ( p ) ∆ x → ˆ p → p −−−−−−−−→ − k X i = ∂ p j f ( x i ; p ) f ( x i ; p ) = − k X i = ∂ p j ln f ( x i ; p ) = − ∂ p j ln L ( p ) .(25)The derivatives are equal up to a constant factor, which means that the solutionsof ∂ p j Q ( p ) = ∂ p j ln L ( p ) = x i per event for simplicity, but it also holds for the general case15f a set of n n -dimensional vectors { x i | i = . . . k } with x i = ( x ji | j = . . . n ) and a corresponding n -dimensional probability density f ( x ; p ) . In this case,one would repeat the derivation starting from an n -dimensional histogram.The derivation provides some insights. • The absolute values of Q ( ˆ p ) and − L ( ˆ p ) at the solution ˆ p are not equal.They differ by an (infinite) additive constant. • The derivatives ∂ p j Q ( p ) and − ∂ p j ln L ( p ) differ in general when p is notthe solution ˆ p , because the second term in Eq. (24) does not vanish for p = ˆ p .In practice, the second point means that the MINUIT package produces the sameerror estimates for the solution ˆ p if the HESSE algorithm is used, but not if the
MINOS algorithm is used. The
HESSE algorithm numerically computes and invertsthe Hessian matrix of second derivatives at the minimum, which gives identicalresults for Q and − L . The MINOS algorithm scans the neighborhood of theminimum, which for Q and − L usually has a different shape.
5. Notes on goodness-of-fit tests
For a goodness-of-fit (GoF) test, one computes a test statistic for a proba-bilistic model and a set of measurements. The test statistic is designed to havea known probability distribution when the measurements are truly distributedaccording to the probabilistic model. If the value for a particular model is veryimprobable, the model may be rejected.It is well-known that the minimum value Q ( ˆ p ) is χ -distributed with expec-tation ( k − m ) , if the measurements are normally distributed, where k and m arethe number of measurements and number of fitted parameters, respectively [3].This GoF property is so useful and frequently applied, that the function Q ( p ) is often simply called chi-square .In general, Q ( ˆ p ) is not χ -distributed for measurements that are not nor-mally distributed around the model expectations. Approximately, it holds for16oisson and binomially distributed measurements when counts are not close tozero, and fractions are neither too close to zero or one. Stronger statementscan be made about the expectation value of Q ( ˆ p ) . For linear models with m parameters and k unbiased measurements with known covariance matrix C , theexpectation of Q ( ˆ p ) is guaranteed to beE [ Q ( ˆ p )] = k − m , (26)regardless of the sample size and the distribution of the measurements, as shownin the appendix. Therefore, the well-known quality criterion that the reduced χ should be close to unity, Q ( ˆ p ) / ( k − m ) ’
1, is often useful even if measurementsare not normally distributed.We saw previously that − L ( ˆ p ) differs from Q ( ˆ p ) by an infinite additiveconstant, which is a hint that it cannot straight-forwardly replace the latter as aGoF statistic. When used with unbinned data, ln L ( ˆ p ) is ill-suited as a GoF teststatistic. Heinrich [13] presented striking examples when ln L ( ˆ p ) carries no in-formation of how well the model fits the measurements. Cousins [14, 15] gave anintuitive explanation for this fact. The IWLS fit provides a maximum-likelihoodestimate for measurements that follow a Poisson or binomial distribution and aGoF test statistic as a side result, which in general is not exactly χ -distributed,but its distribution can often be obtained from a Monte Carlo simulation.
6. Conclusions
An iterated weighted least-squares fit applied to measurements, which arePoisson- or binomially distributed around model expectations, provides the ex-act same solution as a maximum-likelihood fit. This holds for any model and anysample size. When the two fit methods are equivalent, the maximum-likelihoodfit is still recommended, except when the model is linear. In this case, theminimum of the weighted least-squares problem can be found analytically ineach iteration, which usually needs less computations overall than numericallymaximizing the likelihood and requires no problem-specific starting point. Theiterated weighted least-squares fit provides a goodness-of-fit statistic in addition,17hile the maximum-likelihood fit usually does not. Of course, a goodness-of-fitstatistic can always be separately computed after the optimization, but in caseof the maximum-likelihood it requires implementing two functions in a computerprogram instead of one.Whether the two fit methods give equivalent results depends only on theprobability distribution of the measurements around the model expectations.Here we presented proofs of the equivalence for Poisson and binomial distribu-tions. In the statistics literature [1, 2], more general proofs are given that holdalso for some other distributions.
7. Acknowledgments
We thank Bob Cousins for a critical reading of the manuscript and for valu-able pointers to the primary statistical literature, and the anonymous reviewerfor suggestions to clarify additional points.
Appendix A. Equivalence of ML and ILWS for Poisson-distributeddata
A common task is to fit a model to a histogram with k bins, each with a count n i . Especially in multi-dimensional histograms some bins may have few or evenzero entries. This poses a problem for a conventional weighted least-squares fit,but not for a ML fit or an IWLS fit.A ML fit of a model with m parameters p = ( p j | j = . . . , m ) to a sample of k Poisson-distributed numbers { n i | i = . . . , k } with expectation values E [ n i ] = µ i ( p ) is performed by maximizing the log-likelihoodln L ( p ) = k X i = n i ln µ i − k X i = µ i , (A.1)which is obtained by taking the logarithm of the product of Poisson probabilities(9) of the data under the model, and dropping terms that do not depend on p .To find the maximum, we set the m first derivatives ∂ ln L∂p j = k X i = n i µ i ∂µ i ∂p j − k X i = ∂µ i ∂p j (A.2)18or j = m to 0. We get a system of equations k X i = n i − µ i µ i ∂µ i ∂p j =
0. (A.3)We now approach the same problem as an IWLS fit. The sum of weightedsquared residuals is Q = k X i = ( n i − µ i ) ˆ µ i , (A.4)where ˆ µ i is the expected variance computed from the model, using the parameterestimate ˆ p from the previous iteration. To find the minimum, we again set the m first derivatives to 0 and obtain ∂Q∂p j = − k X i = n i − µ i ˆ µ i ∂µ i ∂p j =
0. (A.5)Eq. (A.5) and Eq. (A.3) yield identical solutions in the limit ˆ µ i → µ i , and sodo their solutions. The limit is approached by iterating the fit, so that we actu-ally obtain the maximum-likelihood estimate from the IWLS fit. Remarkably,this does not depend on the size of the counts n i per bin. The equivalence holdseven when many bins with zero entries are present. To obtain this result, the ˆ µ i must be constant. If ˆ µ i was replaced by µ i in Eq. (A.4), extra non-vanishingterms would appear in Eq. (A.5).As already mentioned, when the expectations are linear functions, the uniqueanalytical solution to Eq. (A.5) is given by Eq. (4), with C − = ( δ ij / ˆ µ i | i , j = . . . , k ) and y i = n i . An analytical solution of Eq. (A.3) is not known to theauthors. The IWLS fit converges faster than the ML fit in this case. Appendix B. Equivalence of ML and IWLS for binomially distributeddata
Another common task is to obtain an efficiency function of a selection ortrigger as a function of an observable. One collects a histogram of generatedevents with bin contents N i , and a corresponding histogram of accepted eventswith bin contents n i . The N i are considered as constants here, while the n i aredrawn from the binomial distribution. The goal is to obtain a model function19hat best describes the efficiencies (cid:15) i that best describe the drawn samples n i . Asingle least-squares fit will give biased results when many n i are close to either0 or N i , but not a ML or an IWLS fit.A ML fit of a model with m parameters p = ( p j | j = . . . , m ) for a sampleof k binomially distributed numbers { n i | i = . . . , k } with expectations E [ n i ] = µ i ( p ) = (cid:15) i ( p ) N i is performed by maximizing the log-likelihoodln L ( p ) = k X i = n i ln µ i + k X i = ( N i − n i ) ln ( N i − µ i ) , (B.1)which is obtained by taking the logarithm of the product of binomial probabil-ities to observe n i when µ i = (cid:15) i N i are expected, P ( n i ; µ i , N i ) = (cid:18) N i n i (cid:19) (cid:15) n i i ( − (cid:15) i ) N i − n i = (cid:18) N i n i (cid:19) µ n i i ( N i − µ i ) N i − n i N N i i , (B.2)and dropping terms that do not depend on p . A binomial distribution hastwo parameters ( µ i , N i ) , but it is a monoparametric distribution in this contextsince the N i are known and only the µ i are free parameters.Again we set the m first derivatives ∂ ln L∂p j = k X i = n i µ i ∂µ i ∂p j − k X i = N i − n i N i − µ i ∂µ i ∂p j (B.3)to zero for j = m . The minimum is obtained by solving k X i = n i − µ i µ i ( − µ i / N i ) ∂µ i ∂p j =
0. (B.4)For the IWLS fit, we need to minimize the sum Q ( p ) = k X i = ( n i − µ i ) ˆ µ i ( − ˆ µ i / N i ) . (B.5)where the variances for the binomial distribution with expectation µ i are σ i = Var [ n i ] = N i (cid:15) i ( − (cid:15) i ) = µ i ( − µ i / N i ) . Again, we replaced µ i in the varianceby the constant estimate ˆ µ i from the previous iteration. Setting the m firstderivatives to 0, we obtain ∂Q∂p j = − k X i = n i − µ i ˆ µ i ( − ˆ µ i / N i ) ∂µ i ∂p j = ˆ µ i → µ i , which is approached by iterating the minimization. Again,we obtain the maximum-likelihood estimate with the IWLS fit.Like in the previous case, the L-IWLS fit for a linear model converges fasterthan the ML fit, while the IWLS fit converges more slowly than the ML fit ingeneral. Appendix C. Expectation of Q ( ˆ p ) for linear models We compute the expectation of Q in Eq. (2), evaluated at the solution ˆ p fromEq. (4) for linear models with E [ y ] = Xp , where X is a fixed k × m matrix, andwhere the measurements y have a known finite covariance matrix C . Similarproofs are found in the literature [16]. The covariance matrix of ˆ p is obtained byerror propagation with the matrix M = ( X T C − X ) − X T C − and ˆ p = M y as C p = M CM T = ( X T C − X ) − , (C.1)where we used that C − and ( X T C − X ) − are symmetric matrices.The expectation is a linear operator. Since the solution ˆ p = M y is a linearfunction of the measurement, we haveE [ ˆ p ] = M E [ y ] = M Xp = p , (C.2)in other words, ˆ p is an unbiased estimate of p .We expand Q evaluated at ˆ p , Q ( ˆ p ) = y T C − y − y T C − X ˆ p − ˆ p T X T C − y + ˆ p T X T C − X ˆ p , (C.3)which simplifies with C − p ˆ p = X T C − y and Eq. (C.1) to Q ( ˆ p ) = y T C − y − ˆ p T C − p ˆ p . (C.4)The scalar result of a bilinear form is trivially equal to the trace of this bilinearform, and a cyclic permutation inside the trace then yields Q ( ˆ p ) = Tr ( C − yy T ) − Tr ( C − p ˆ p ˆ p T ) . (C.5)21e compute the expectation on both sides and get, using linearity of trace andexpectation, E [ Q ( ˆ p )] = Tr ( C − E [ yy T ]) − Tr ( C − p E [ ˆ p ˆ p T ]) . (C.6)The definition of the covariance matrix C = E [ yy T ] − E [ y ] E [ y ] T is inserted,and vice versa for C p . We getE [ Q ( ˆ p )] = Tr ( C − C + C − E [ y ] E [ y ] T ) − Tr ( C − p C p + C − p E [ ˆ p ] E [ ˆ p T ]) .(C.7)The trace of a matrix multiplied with its inverse is equal to the number ofdiagonal elements, which is k in case of C and m in case of C p . We use this,E [ y ] = Xp , E [ ˆ p ] = p , and again the linearity of the trace, to getE [ Q ( ˆ p )] = k + Tr ( C − Xpp T X T ) − (cid:0) m + Tr ( C − p pp T ) (cid:1) . (C.8)The remaining traces are identical and cancel,Tr ( C − Xpp T X T ) = Tr ( X T C − Xpp T ) = Tr ( C − p pp T ) , (C.9)and so we finally obtain the resultE [ Q ( ˆ p )] = k − m , (C.10)which is independent of the PDFs that describe the scatter of the measurements y around the expectation values E [ y ] . References [1] J. A. Nelder, R. W. M. Wedderburn,
Generalized Linear Models , J. R.Statist. Soc.
A135 , 370–384 (1972).[2] A. Charles, E.L. Frome, P. L. Yu,
The Equivalence of Generalized LeastSquares and Maximum Likelihood Estimates in the Exponential Family , J.Am. Stat. Assoc. , 169–171 (1976).[3] F. James, Statistical methods in experimental physics , World ScientificPublishing Company, 2006. 224] S. Baker and R. D. Cousins,
Clarification of the use of chi-square andlikelihood functions in fits to histograms , Nuclear Instruments and Methodsin Physics Research , 437 (1984).[5] A. C. Aitken,
IV. On Least Squares and Linear Combination of Observa-tions , Proc. R. Soc. Edinburgh , 42 (1935).[6] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap ,Monographs on Statistics and Applied Probability No. 57, Chapman &Hall/CRC, Boca Raton, Florida, USA, 1993.[7] R. J. Barlow,
Combining experiments with systematic errors ,arXiv.1701.03701 (2017).[8] T.A. Aaltonen et al. (CDF, D0 collaborations),
Combination of measure-ments of the top-quark pair production cross section from the TevatronCollider , Phys. Rev.
D89 , 072001 (2014).[9] F. James and M. Roos,
Minuit - a system for function minimization andanalysis of the parameter errors and correlations , Computer Physics Com-munications , 343 (1975).[10] iminuit team, MINUIT from Python , https://github.com/iminuit/iminuit (2013), accessed: 2018-03-05.[11] C. Lawson and R. Hanson, Solving Least Squares Problems , Society forIndustrial and Applied Mathematics, 1995, https://doi.org/10.1137/1.9781611971217 .[12] E. Jones et al. , SciPy: Open source scientific tools for Python , (2001), accessed: 2018-03-05.[13] J. Heinrich, Pitfalls of Goodness-of-Fit from Likelihood , in
Statistical Prob-lems in Particle Physics, Astrophysics, and Cosmology , edited by L. Lyons,R. Mount, and R. Reitmeyer, p. 52, 2003, arXiv:physics/0310167.2314] R.D. Cousins,