[PDF] Application of the Iterated Weighted Least-Squares Fit to counting experiments

Abstract

Least-squares fits are an important tool in many data analysis applications. In this paper, we review theoretical results, which are relevant for their application to data from counting experiments. Using a simple example, we illustrate the well known fact that commonly used variants of the least-squares fit applied to Poisson-distributed data produce biased estimates. The bias can be overcome with an iterated weighted least-squares method, which produces results identical to the maximum-likelihood method. For linear models, the iterated weighted least-squares method converges faster than the equivalent maximum-likelihood method, and does not require problem-specific starting values, which may be a practical advantage. The equivalence of both methods also holds for binomially distributed data. We further show that the unbinned maximum-likelihood method can be derived as a limiting case of the iterated least-squares fit when the bin width goes to zero, which demonstrates a deep connection between the two methods.

Full PDF

AApplication of the Iterated Weighted Least-Squares Fitto counting experiments

Hans Dembinski a, ∗ , Michael Schmelling a , Roland Waldi b a Max Planck Institute for Nuclear Physics, Heidelberg, Germany b Rostock University, Rostock, Germany

Abstract

Least-squares ﬁts are popular in many data analysis applications, and so wereview some theoretical results in regard to the optimality of this ﬁt method. Itis well-known that common variants of the least-squares ﬁt applied to Poisson-distributed data produce biased estimates, but it is not well-known that thebias can be overcome by iterating an appropriately weighted least-squares ﬁt.We prove that the iterated ﬁt converges to the maximum-likelihood estimate.Using toy experiments, we show that the iterated weighted least-squares methodconverges faster than the equivalent maximum-likelihood method when the sta-tistical model is a linear function of the parameters and it does not requireproblem-speciﬁc starting values. Both can be a practical advantage. The equiv-alence of both methods also holds for binomially distributed data. We furthershow that the unbinned maximum-likelihood method can be derived as a limit-ing case of the iterated least-squares ﬁt when the bin width goes to zero, whichdemonstrates the deep connection between the two methods.

1. Introduction

In this paper, we review some theoretical results on least-squares methods,in particular, when they yield optimal estimates. We show how they can beapplied to counting experiments without sacriﬁcing optimality. The insightsdiscussed here are known in the statistics community [1, 2], but less so in the ∗ Corresponding author

Preprint submitted to Elsevier June 7, 2019 a r X i v : . [ phy s i c s . d a t a - a n ] J un igh-energy physics community. Standard text books on statistical methodsand papers, see e.g. [3, 4], correctly warn about biased results when standardvariants of the least-squares ﬁt are applied to counting experiments with smallnumbers of events, but do not show that these can be overcome. The resultspresented here are of practical relevance for ﬁts of linear models, where theiterated weighted least-squares method discussed in this paper converges fasterthan the standard maximum-likelihood method and does not require startingvalues near the optimum.The least-squares ﬁt is a popular tool of statistical inference. It can beapplied in situations with k measurements { y i | i = . . . , k } , described by amodel with m parameters p = ( p j | j = . . . , m ) that predicts the expectationvalues E [ y i ] = µ i ( p ) for the measurements. The measurements diﬀer from theexpectation values by unknown residuals (cid:15) i = y i − µ i ( p ) . The solution ˆ p thatminimizes the sum Q ( p ) of squared residuals, Q ( p ) = k X i = (cid:0) y i − µ i ( p ) (cid:1) , (1)is taken as the best ﬁt of the model to the data.More generally, the measurements and the model predictions can be regardedas k -dimensional vectors y = ( y i | i = . . . , k ) and µ = ( µ i | i = . . . , k ) , forwhich one wants to minimize a distance measure. In Eq. (1), we minimize thesquared Euclidean distance. A generalization is the bilinear form Q ( p ) = ( y − µ ) T W ( y − µ ) , (2)where W is a positive-deﬁnite symmetric matrix of weights. This variant iscalled weighted least squares (WLS). Eq. (1) is recovered with W = . Animportant special case is when the weight matrix is equal to the inverse ofthe true covariance matrix C of the measurements, W = C − with C = E [ yy T ] − E [ y ] E [ y ] T . For uncorrelated measurements, Eq. (2) simpliﬁes to thefamiliar form Q ( p ) = k X i = (cid:0) y i − µ i ( p ) (cid:1) (cid:14) σ i , (3)2ith variances σ i = E [ y i ] − E [ y i ] .Aitken [5] showed in a generalization to the Gauss-Markov theorem [3, p.152] that minimizing Q ( p ) with W ∝ C − produces an optimal, in the sense asdetailed below, solution for linear models µ ( p ) = Xp , where X is a constant k × m matrix. The theorem applies when the covariance matrix C is ﬁnite andnon-singular. Then, Q ( p ) has a unique minimum at ˆ p = ( X T C − X ) − X T C − y . (4)The best ﬁt parameters ˆ p in this case are a linear function of the measurements y with the covariance matrix C p = ( X T C − X ) − . (5)If the measurements are unbiased, E [ y ] = µ , this solution is the best linearunbiased estimator (BLUE). Like all linear estimators, Eq. (4) is unbiased if theinput is unbiased. In addition, it has minimal variance of all linear estimators.This is true for any shape of the data distribution and any sample size. Theseexcellent properties may be compromised in practical applications, since thecovariance matrix C is often only approximately known.The least-squares approach is often regarded as a special case of the moregeneral maximum-likelihood (ML) approach. The ML principle states that thebest ﬁt of a model should maximize the likelihood L , which is proportional tothe joint probability of all measurements under the model. In practice, it is moreconvenient to work with ln L rather than L , so that the product of probabilitiesturns into a sum of log-probabilities,ln L ( p ) = ln k Y i = P i ( y i ; p ) = k X i = ln P i ( y i ; p ) . (6)Here P i ( y i ; p ) is the value of the probability density at y i for continuous out-comes or the actual probability for discrete outcomes. The ML method needsa fully speciﬁed probability distribution for each measurement, while the WLSmethod uses only the ﬁrst two moments.3he parameter vector ˆ p that maximizes Eq. (6) is called the maximum-likelihood estimate (MLE). MLEs have optimal asymptotic properties; asymp-totic here means in the limit of inﬁnite samples. They are consistent (asymptot-ically unbiased) and eﬃcient (asymptotically attaining minimal variance) [3].In many practical cases of inference, in particular when data are Poisson-distributed, this method is known to produce good estimates also for ﬁnitesamples. These properties make the ML ﬁt the recommended tool for the prac-ticioner [3, 4].The WLS ﬁt can be derived as a special case of a ML ﬁt, if one considersnormally distributed measurements y i with expectations µ i and variances σ i ,where each measurement has the probability density function (PDF) P i ( y i ; µ i , σ i ) = q πσ i exp (cid:18) − ( y i − µ i ) σ i (cid:19) . (7)For ﬁxed σ i , we obtain Eq. (3) from Eq. (6),ln L ( p ) = − k X i = ( y i − µ i ( p )) σ i + c ≡ − Q ( p ) + c , (8)where the constant term c depends only on the ﬁxed variances σ i . Constantterms do not aﬀect the location ˆ p of the maximum of ln L and the minimum of Q . We will often drop them from equations.This derivation shows that for Gaussian PDFs a ML and a WLS ﬁt give iden-tical results when the same ﬁxed variances are used, even if they are not the truevariances. This does not hold in general, but is relevant in this context. Whendata are Poisson-distributed and have small counts, common implementationsof the WLS ﬁt are biased as we will show in the following section. The biasdoes not originate from the skewed shape of the Poisson distribution however,but rather from the fact that the weights are either biased or not ﬁxed.To these standard methods, we add the iterated weighted least-squares (IWLS)ﬁt [1]. It yields maximum-likelihood estimates when data are Poisson or binomi-ally distributed with only the probabilities as free parameters [2]. This extendsthe strict equivalence between ML and WLS ﬁts to a larger class of problems,4n extension which is highly relevant in practice, since counts in histograms arePoisson distributed, and counted fractions are binomially distributed with thedenominator considered ﬁxed. The iterations are used to successively updateestimates of the variances σ i , which are kept constant during minimization.When IWLS and ML ﬁts are equivalent, which one is recommended? Weconducted toy experiments where IWLS and ML ﬁts are carried out numerically,as is common in practice. We found similar convergence rates for both methodswhen the model is non-linear, and a signiﬁcantly faster convergence for theIWLS ﬁt if the model is linear. This makes the IWLS ﬁt a useful addition tothe toolbox.We have seen how the WLS ﬁt can be derived from the ML ﬁt under certainconditions. Inversely, we will show that the unbinned ML ﬁt can be derived asa limiting case from the IWLS ﬁt under weak conditions. The derivation showsthat the two approaches are deeply connected.

2. Least-Squares Variants In Use

Standard variants of the WLS ﬁt used in practice produce biased estimateswhen the ﬁt is applied to Poisson-distributed data with small counts. The biasis often attributed to the breakdown of the normal approximation to the Poissondistribution, but it is actually related to how the unknown true variances σ i inEq. (3) are replaced by estimates.We demonstrate this along a simple example. We ﬁt the single parameter µ of the Poisson-distribution P ( n ; µ ) = e − µ µ n / n ! , (9)to k counts { n i | i = . . . , k } sampled from it. The maximum-likelihood estimatefor µ can be computed analytically by maximizing Eq. (6). We solve ∂ ln L / ∂µ ≡ ∂ µ ln L = µ and obtain the arithmetic average ˆ µ = k k X i = n i , (10)5hich is unbiased and has minimal variance. We will now apply variants of theWLS ﬁt to the same problem, which diﬀer in how they substitute the unknowntrue variance. Variance computed for each sample.

For a single isolated sample, theunbiased estimate of µ is ˆ µ i = n i , with variance Var [ n i ] = µ ’ ˆ µ i = n i . This isthe origin of the well-known √ n -estimate for the standard deviation of a count n . With this variance estimate, we get Q ( µ ) = k X i = ( n i − µ ) / n i . (11)This form is called Neyman’s χ in the statistics literature [4]. Replacing thetrue variance µ by its sample estimate n i is an application of the bootstrapprinciple discussed by Efron and Tibshirani [6]. To obtain the minimum, wesolve ∂ µ Q = µ and obtain the harmonic average1 ˆ µ = k k X i = n i . (12)The solution is biased and breaks down for samples with n i =

0. The varianceestimates here are constant (they do not vary with ˆ µ ), but diﬀer from sampleto sample. This treatment ignores the fact that the true variance is the samefor all samples in this setup. Variance computed from model.

Another choice is to directly insertVar [ n i ] = µ in the formula, Q ( µ ) = k X i = ( n i − µ ) / µ . (13)This form, called Pearson’s χ [4], is a conceptual improvement, because µ isthe exact but unknown value of the variance. However, the variance µ nowvaries together with the expectation value µ . Solving ∂ µ Q = µ yields thequadratic average ˆ µ = vuut k k X i = n i . (14)6his estimate is also biased, but can handle samples with n i =

0. The bias maycome at a surprise, since we used the exact value for the variance after all. Thefailure here can be traced back to the fact that the variance estimates σ i = µ are not ﬁxed during the minimization. A small positive bias on µ in Eq. (13)leads to a second order increase in the numerator, which is overcompensatedby a ﬁrst order increase of the denominator. In other words, the ﬁt tends toincrease the variance even at the cost of a small bias in the expectation whengiven this freedom, because overall it yields a reduction of Q . Constant variance.

Finally, we simply use σ i = c , where c is an arbitraryconstant, Q ( µ ) = k X i = ( n i − µ ) / c . (15)We solve ∂ µ Q = µ and obtain the optimal maximum-likelihood estimateEq. (10) as the solution; the constant c drops out.This seems counter-intuitive, since we used a constant for all samples insteadof a value close or equal to the true variance. However, this case satisﬁes allconditions of the Gauss-Markov theorem. The expectation values are triviallinear functions of the parameter µ i = µ . The variances σ i are all equal andonly need to be known up to a global scaling factor, hence any constant c willdo. We learned that keeping the variance estimates constant during minimizationis important, but the estimates should in general be as close to the true variancesas possible. An iterated ﬁt can satisfy both requirements.

3. Iterated Weighted Least-Squares

The iterated (re)weighted least-squares methods (IWLS or IRLS) are wellknown in statistical regression [1], and can be applied to ﬁts with k mea-surements { y i | i = . . . , k } described by a model with m parameters p =( p j | j = . . . , m ) , which predicts the expectations E [ y i ] = µ i ( p ) and variancesVar [ y i ] = σ i ( p ) of each measurement. We will discuss the special applicationwhere the y i are entries of a histogram. One then minimizes the sum of squared7esiduals Q ( p ) = k X i = ( y i − µ i ( p ) (cid:1) / σ i ( ˆ p ) , (16)where the σ i are constant within one iteration of the ﬁt and computed fromthe model using the parameter estimate ˆ p that minimized Q ( p ) in the previousiteration. A convenient choice for the ﬁrst iteration is σ i =

1. One iteratesuntil ˆ p converges.In particle physics, we often work with samples drawn from two monopara-metric distributions of the exponential family: • Poisson distribution. Example: ﬁtting a distribution function to a his-togram of counts. • Binomial distribution with ﬁxed number of trials. Example: ﬁtting aneﬃciency function to two histograms with generated and accepted events.Charles, Frome, and Yu [2] derived that the IWLS ﬁt gives the exact same resultas the ML ﬁt for a family of distributions. We demonstrate this in the appendixfor the special distributions discussed here.The Hessian matrices of second derivatives are also equal up to a constantfactor, ∂ p l ∂ p m Q = − ∂ p l ∂ p m ln L . The inverse of the Hessian is an estimateof the covariance matrix of the solution, an important uncertainty estimate inpractical applications.We emphasize that the equivalence does not depend on the size of the datasample or on the functional form of the model that predicts the expectationvalues for the measurements. In particular, when the IWLS ﬁt is applied tohistograms, it is not biased by small counts per bin or even empty bins. A formal discussion of how systematic uncertainties can be handled withthe IWLS ﬁt is outside of the scope of this paper, but we note that it caninclude systematic uncertainties. Barlow [7] discusses how correlated systematic8ncertainties can be handled in a least-squares ﬁt. One minimizes Eq. (2) in eachiteration with a matrix C = C ( ˆ p ) + C sys ( ˆ p ) , (17)where C is the current estimate of the stochastic covariance computed fromthe previous solution, and C sys is a current covariance matrix that representsthe systematic uncertainties of the measurements. The matrix C sys may bea function of the parameter vector. Like the covariance matrix C , it is keptconstant during each iteration, and updated between iterations using the currentvalue of ˆ p . This approach has been successfully applied in a combination ofmeasurements from the CDF and D0 experiments [8]. When the IWLS and the ML ﬁts are equivalent, which one should be used inpractice? The two methods produce the same results in analytical problems, butcan have diﬀerent performance in numerical problems. In practice, the extremaof the log-likelihood function ln L ( p ) and the weighted least-squares function Q ( p ) are usually found with a local optimizer, like the MIGRAD algorithm in the

MINUIT package [9, 10]. Computing the functions is sometimes expensive, whenthe ﬁtted data sets are large and the model has many parameters. Numericalmethods are therefore judged based on the number of function evaluations re-quired to converge to the optimum within some tolerance. Another criterionis robustness, the ability to converge to the right optimum from a point in theneighborhood of the solution.To address these points, we conducted toy experiments with Poisson-distrib-uted counts n i and ﬁnd that the ML method requires less function evaluationsthan the IWLS methods in general. However, the rate of convergence of theIWLS method can be greatly accelerated, when the model that computes thecount expectation E [ n i ] = µ i ( p ) is linear in the parameters, µ i = X i p , where X i is a vector of constants. The maximum of ln L ( p ) usually cannot be foundanalytically in this case, but the minimum of Q ( p ) is given by Eq. (4) in eachiteration of the IWLS ﬁt. When the computing time is dominated by the eval-9 .0 0.2 0.4 0.6 0.8 1.0 x n fittoy data 0 1 2 3 4 5 6 7 8 9 10iterations02468101214 p a r a m e t e r ML IWLS L-IWLS

Figure 1: Example of a toy data set (points) to test the performance of the ML, IWLS, andL-IWLS ﬁts (see text).

Left:

The curve represent the ﬁt result of the three methods for thisdata set (which are identical).

Right:

Intermediate parameter states during the optimizationfor the ML, IWLS, and L-IWLS ﬁts (see text), in each iteration of the respective algorithmsuntil the stopping criterion is reached. The L-IWLS ﬁt often converges quadratically. TheIWLS ﬁt is slowed down by the artiﬁcal dampening that we introduced to avoid oscillations. uation of Q ( p ) or ln L ( p ) , solving the IWLS ﬁt is faster than the ML ﬁt. TheIWLS ﬁt also does not require a problem-speciﬁc starting point for the opti-mization in this case. We call this special variant the L-IWLS ﬁt. All threemethods are able to handle ﬁts that have bounded parameters, which are com-mon in particle physics. In our toy experiments, the parameters are boundedto be non-negative. Details are given in the next section.Whether the ML or the IWLS ﬁts are more robust in the above sense is morediﬃcult to say. No general proofs can be given for either method. Our toy stud-ies suggest the following order of increasing robustness: IWLS, L-IWLS, ML. Insome toy experiments, the IWLS methods require many more iterations thanaverage, producing a long tail in the distribution of iteration counts. Such tailsare not observed for the ML ﬁt. It is likely, however, than a more sophisticatedimplementation of the IWLS ﬁt than ours could improve the robustness of thismethod. We compare the performances of ML, IWLS, and L-IWLS ﬁts in a seriesof 1000 toy experiments with Poisson-distributed samples. We use a linearmodel for the expectation with two parameters, µ ( x , p ) = ( p + p x ) , with10 ∈ [

0, 1 ] as an independent variable. For the true parameters p truth = (

1, 10 ) ,we simulate 10 pairs ( x i , n i ) . The x i are evenly spaced over the interval [

0, 1 ] , µ i is calculated for each x i based on the true parameters, and ﬁnally a randomsample n i is drawn for each µ i from the Poisson distribution. The model is thenﬁtted to each toy data set using the following three methods. One of the toyexperiments is shown in Fig. (1). • ML ﬁt: Starting from Eq. (A.1) we use the

MIGRAD algorithm from the

MINUIT package to ﬁnd the minimum. We pass the exact analytical gra-dient to

MIGRAD for this problem, replacing the numerical approximationthat

MINUIT uses otherwise. We restrict the parameter range to p k ≥ µ whenever it appears in a denominator to avoiddivision by zero. • IWLS ﬁt: We use Newton’s method to update p , p n + = p n − H − ∂ p Q , (18)with the exact analytical gradient ∂ p Q and Hessian H for this problem.Since the model is linear and the function Q quadratic, Newton’s methodyields the exact solution for the given gradient and Hessian matrix, butwithout taking the boundary condition p k ≥ MINUIT criterion, which is based on the es-timated distance-to-minimum and deviations in the diagonal elements ofthe inverted Hessian [9].This approach works very well for most toy experiments, but in somerare cases ( < p

10 20 30 p N call ML: 1.00 ± 0.02IWLS: 1.00 ± 0.02L-IWLS: 1.00 ± 0.02 ML: 10.11 ± 0.07IWLS: 10.11 ± 0.07L-IWLS: 10.11 ± 0.07

ML IWLS L-IWLS

Figure 2: Application of the maximum-likelihood (ML), the iterated weighted least-squares(IWLS) ﬁt and its specialization for linear models (L-IWLS) to 1000 toy experiments withPoisson-distributed samples (see text).

Top row:

Histograms of the two ﬁtted parametersof the model ˆ p and ˆ p are shown, overlayed for all three ﬁt methods. The histograms arenearly identical. Bottom row:

Normalized histograms of the number of evalutions of the modelfunction for the the ﬁt methods in double-logarithmic scale. parameter vector with the previous one, p n + : = ( p n + + p n ) / • L-IWLS ﬁt: We solve Eq. (4) with the

NNLS [11] algorithm as implementedin

SciPy [12], and iterate. It solves Eq. (4) under the boundary condition p k ≥

0. To check for convergence, we again use the

MINUIT criterion.We note that our application of the general IWLS ﬁt to a problem witha linear model is artiﬁcal. We only do this here to compare all three ﬁttingmethods on the same problem. For the IWLS and ML ﬁts, we use the optimistic12tarting point p truth = (

1, 10 ) . The ML and IWLS ﬁts therefore run under idealconditions compared to the L-IWLS ﬁt, which does not require a speciﬁc startingpoint. In practice, one will usually start with a less ideal starting point, whichslows down the convergence of ML and IWLS ﬁts compared to the L-IWLS ﬁt.In case of the ML and IWLS ﬁts, we increase the call counter for eachevaluation of Q or ln L and each evaluation of their gradients for all values of z byone. In case of the L-IWLS ﬁt, we count one application of the NNLS algorithmas one call, since it requires essentially one computation of the gradient.The results are shown in Fig. (2). As expected, the results are equal withinthe numerical accuracies of the numerical algorithms, which stop when MINUIT ’sstandard convergence criterion is reached. This criterion roughly gives a preci-sion of about 10 − in the parameter relative to its uncertainty.The average number of calls required to converge is diﬀerent: 19.3 for ML,23.6 for IWLS, and 4.8 for L-IWLS. The L-IWLS ﬁt is the fastest to converge,requiring only a quarter of the function evaluations of the ML ﬁt. The IWLSﬁt is the slowest, it requires about 20% more calls on average than the MLﬁt. This is mainly due to artiﬁcial dampening. In cases where the dampeningis not needed, the IWLS ﬁt converges as rapidly as the L-IWLS ﬁt. Since wechose a linear model for this performance study in order to compare all threemethods, a Newton’s step computes the exact solution to the ﬁtting problemfor the current covariance matrix estimate.An investigation shows that the convergence issues of the IWLS ﬁt appearwhen a parameter of the model is very close to zero. If this is not the case andno dampening is applied, the IWLS and L-IWLS ﬁts produce identical results. MINUIT was designed to handle such cases well and shows a much more stableconvergence rate. This suggests that the issues of the IWLS ﬁt can be overcomeas well with a more sophisticated implementation, but this comes at the cost ofa slower convergence in favorable cases. The overall performance of the IWLSﬁt will probably not surpass that of the more straight-forward ML ﬁt.In conclusion, we recommend the L-IWLS ﬁt for linear models and the MLﬁt for non-linear models. 13 . Unbinned maximum-likelihood from IWLS

In the introduction, we reviewed how the WLS ﬁt can be derived as a specialcase of the ML ﬁt, when measurements are normally distributed with knownvariance. Alternatively, the WLS ﬁt can also be derived from geometric princi-ples without relying on the ML principle. We will now show that the unbinnedML ﬁt can be derived as a limiting case of the IWLS ﬁt.For the unbinned ML ﬁt of a known probability density f ( x ; p ) of a con-tinuous stochastic variable x with parameters p , one maximizes the sum oflogarithms of the model density evaluated at the measurements { x i | i = . . . k } ,ln L ( p ) = k X i = ln f ( x i ; p ) . (19)The maximum is found by solving the system of equations ∂ p j ln L ( p ) =

0. Thedensity f ( x ; p ) must be at least once diﬀerentiable in p .To derive these equations as a limit of the IWLS ﬁt, we assume that f ( x ; p ) is ﬁnite everywhere in x , so that the probability density is not concentrated indiscrete points.We start by considering a histogram of k samples x i . Since the samples areindependently drawn from a PDF, the histogram counts n l are uncorrelated andPoisson-distributed. Following the IWLS approach, we minimize the function Q ( p ) = X l ( n l − kP l ) k ˆ P l (20)and iterate, where P l ( p ) = R x l + ∆ xx l f ( x ; p ) d x is the expected fraction of thesamples in bin l , and ˆ P l = P l ( ˆ p ) is the value based on the ﬁtted parameters ˆ p from the previous iteration. Expansion of the squares yields three terms, Q ( p ) = X l n l k ˆ P l − X l n l P l ˆ P l + k X l P l ˆ P l . (21)The ﬁrst term is proportional to 1 / ∆ x , but not a function of p . Therefore it doesnot contribute to the minimum obtained by solving the equations ∂ p j Q ( p ) = p . 14e investigate the limit ∆ x →

0. Since f ( x ) is ﬁnite everywhere, we haveultimately either zero or one count in each bin. With P l → f ( x l ; p ) ∆ x , thesecond term has a ﬁnite limit X l n l P l ˆ P l ∆ x → −−−−→ k X i = f ( x i ; p ) ∆ xf ( x i ; ˆ p ) ∆ x = k X i = f ( x i ; p ) f ( x i ; ˆ p ) , (22)where only bins around the measurements x i with one entry contribute ( n l = X l P l ˆ P l ∆ x → −−−−→ X l f ( x l ; p ) ( ∆ x ) f ( x l ; ˆ p ) ∆ x = Z f ( x ; p ) f ( x ; ˆ p ) d x . (23)One ∆ x cancels in the ratio and in the limit ∆ x → ∂ p j Q ( p ) in the limit of many iterations. Weassume that the iterations converge, so that the previous solution ˆ p approachesthe next solution p . We get ∂ p j Q ( p ) = − k X i = ∂ p j f ( x i ; p ) f ( x i ; ˆ p ) + k Z f ( x ; p ) ∂ p j f ( x ; p ) f ( x ; ˆ p ) d x ˆ p → p −−−→ − k X i = ∂ p j f ( x i ; p ) f ( x i ; p ) + k∂ p j Z f ( x ; p ) d x . (24)The last term vanishes in the limit, because R f ( x ; p ) d x = ∂ p j Q ( p ) ∆ x → ˆ p → p −−−−−−−−→ − k X i = ∂ p j f ( x i ; p ) f ( x i ; p ) = − k X i = ∂ p j ln f ( x i ; p ) = − ∂ p j ln L ( p ) .(25)The derivatives are equal up to a constant factor, which means that the solutionsof ∂ p j Q ( p ) = ∂ p j ln L ( p ) = x i per event for simplicity, but it also holds for the general case15f a set of n n -dimensional vectors { x i | i = . . . k } with x i = ( x ji | j = . . . n ) and a corresponding n -dimensional probability density f ( x ; p ) . In this case,one would repeat the derivation starting from an n -dimensional histogram.The derivation provides some insights. • The absolute values of Q ( ˆ p ) and − L ( ˆ p ) at the solution ˆ p are not equal.They diﬀer by an (inﬁnite) additive constant. • The derivatives ∂ p j Q ( p ) and − ∂ p j ln L ( p ) diﬀer in general when p is notthe solution ˆ p , because the second term in Eq. (24) does not vanish for p = ˆ p .In practice, the second point means that the MINUIT package produces the sameerror estimates for the solution ˆ p if the HESSE algorithm is used, but not if the

MINOS algorithm is used. The

HESSE algorithm numerically computes and invertsthe Hessian matrix of second derivatives at the minimum, which gives identicalresults for Q and − L . The MINOS algorithm scans the neighborhood of theminimum, which for Q and − L usually has a diﬀerent shape.

5. Notes on goodness-of-ﬁt tests

For a goodness-of-ﬁt (GoF) test, one computes a test statistic for a proba-bilistic model and a set of measurements. The test statistic is designed to havea known probability distribution when the measurements are truly distributedaccording to the probabilistic model. If the value for a particular model is veryimprobable, the model may be rejected.It is well-known that the minimum value Q ( ˆ p ) is χ -distributed with expec-tation ( k − m ) , if the measurements are normally distributed, where k and m arethe number of measurements and number of ﬁtted parameters, respectively [3].This GoF property is so useful and frequently applied, that the function Q ( p ) is often simply called chi-square .In general, Q ( ˆ p ) is not χ -distributed for measurements that are not nor-mally distributed around the model expectations. Approximately, it holds for16oisson and binomially distributed measurements when counts are not close tozero, and fractions are neither too close to zero or one. Stronger statementscan be made about the expectation value of Q ( ˆ p ) . For linear models with m parameters and k unbiased measurements with known covariance matrix C , theexpectation of Q ( ˆ p ) is guaranteed to beE [ Q ( ˆ p )] = k − m , (26)regardless of the sample size and the distribution of the measurements, as shownin the appendix. Therefore, the well-known quality criterion that the reduced χ should be close to unity, Q ( ˆ p ) / ( k − m ) ’

1, is often useful even if measurementsare not normally distributed.We saw previously that − L ( ˆ p ) diﬀers from Q ( ˆ p ) by an inﬁnite additiveconstant, which is a hint that it cannot straight-forwardly replace the latter as aGoF statistic. When used with unbinned data, ln L ( ˆ p ) is ill-suited as a GoF teststatistic. Heinrich [13] presented striking examples when ln L ( ˆ p ) carries no in-formation of how well the model ﬁts the measurements. Cousins [14, 15] gave anintuitive explanation for this fact. The IWLS ﬁt provides a maximum-likelihoodestimate for measurements that follow a Poisson or binomial distribution and aGoF test statistic as a side result, which in general is not exactly χ -distributed,but its distribution can often be obtained from a Monte Carlo simulation.

6. Conclusions

An iterated weighted least-squares ﬁt applied to measurements, which arePoisson- or binomially distributed around model expectations, provides the ex-act same solution as a maximum-likelihood ﬁt. This holds for any model and anysample size. When the two ﬁt methods are equivalent, the maximum-likelihoodﬁt is still recommended, except when the model is linear. In this case, theminimum of the weighted least-squares problem can be found analytically ineach iteration, which usually needs less computations overall than numericallymaximizing the likelihood and requires no problem-speciﬁc starting point. Theiterated weighted least-squares ﬁt provides a goodness-of-ﬁt statistic in addition,17hile the maximum-likelihood ﬁt usually does not. Of course, a goodness-of-ﬁtstatistic can always be separately computed after the optimization, but in caseof the maximum-likelihood it requires implementing two functions in a computerprogram instead of one.Whether the two ﬁt methods give equivalent results depends only on theprobability distribution of the measurements around the model expectations.Here we presented proofs of the equivalence for Poisson and binomial distribu-tions. In the statistics literature [1, 2], more general proofs are given that holdalso for some other distributions.

7. Acknowledgments

We thank Bob Cousins for a critical reading of the manuscript and for valu-able pointers to the primary statistical literature, and the anonymous reviewerfor suggestions to clarify additional points.

Appendix A. Equivalence of ML and ILWS for Poisson-distributeddata

A common task is to ﬁt a model to a histogram with k bins, each with a count n i . Especially in multi-dimensional histograms some bins may have few or evenzero entries. This poses a problem for a conventional weighted least-squares ﬁt,but not for a ML ﬁt or an IWLS ﬁt.A ML ﬁt of a model with m parameters p = ( p j | j = . . . , m ) to a sample of k Poisson-distributed numbers { n i | i = . . . , k } with expectation values E [ n i ] = µ i ( p ) is performed by maximizing the log-likelihoodln L ( p ) = k X i = n i ln µ i − k X i = µ i , (A.1)which is obtained by taking the logarithm of the product of Poisson probabilities(9) of the data under the model, and dropping terms that do not depend on p .To ﬁnd the maximum, we set the m ﬁrst derivatives ∂ ln L∂p j = k X i = n i µ i ∂µ i ∂p j − k X i = ∂µ i ∂p j (A.2)18or j = m to 0. We get a system of equations k X i = n i − µ i µ i ∂µ i ∂p j =

0. (A.3)We now approach the same problem as an IWLS ﬁt. The sum of weightedsquared residuals is Q = k X i = ( n i − µ i ) ˆ µ i , (A.4)where ˆ µ i is the expected variance computed from the model, using the parameterestimate ˆ p from the previous iteration. To ﬁnd the minimum, we again set the m ﬁrst derivatives to 0 and obtain ∂Q∂p j = − k X i = n i − µ i ˆ µ i ∂µ i ∂p j =

0. (A.5)Eq. (A.5) and Eq. (A.3) yield identical solutions in the limit ˆ µ i → µ i , and sodo their solutions. The limit is approached by iterating the ﬁt, so that we actu-ally obtain the maximum-likelihood estimate from the IWLS ﬁt. Remarkably,this does not depend on the size of the counts n i per bin. The equivalence holdseven when many bins with zero entries are present. To obtain this result, the ˆ µ i must be constant. If ˆ µ i was replaced by µ i in Eq. (A.4), extra non-vanishingterms would appear in Eq. (A.5).As already mentioned, when the expectations are linear functions, the uniqueanalytical solution to Eq. (A.5) is given by Eq. (4), with C − = ( δ ij / ˆ µ i | i , j = . . . , k ) and y i = n i . An analytical solution of Eq. (A.3) is not known to theauthors. The IWLS ﬁt converges faster than the ML ﬁt in this case. Appendix B. Equivalence of ML and IWLS for binomially distributeddata

Another common task is to obtain an eﬃciency function of a selection ortrigger as a function of an observable. One collects a histogram of generatedevents with bin contents N i , and a corresponding histogram of accepted eventswith bin contents n i . The N i are considered as constants here, while the n i aredrawn from the binomial distribution. The goal is to obtain a model function19hat best describes the eﬃciencies (cid:15) i that best describe the drawn samples n i . Asingle least-squares ﬁt will give biased results when many n i are close to either0 or N i , but not a ML or an IWLS ﬁt.A ML ﬁt of a model with m parameters p = ( p j | j = . . . , m ) for a sampleof k binomially distributed numbers { n i | i = . . . , k } with expectations E [ n i ] = µ i ( p ) = (cid:15) i ( p ) N i is performed by maximizing the log-likelihoodln L ( p ) = k X i = n i ln µ i + k X i = ( N i − n i ) ln ( N i − µ i ) , (B.1)which is obtained by taking the logarithm of the product of binomial probabil-ities to observe n i when µ i = (cid:15) i N i are expected, P ( n i ; µ i , N i ) = (cid:18) N i n i (cid:19) (cid:15) n i i ( − (cid:15) i ) N i − n i = (cid:18) N i n i (cid:19) µ n i i ( N i − µ i ) N i − n i N N i i , (B.2)and dropping terms that do not depend on p . A binomial distribution hastwo parameters ( µ i , N i ) , but it is a monoparametric distribution in this contextsince the N i are known and only the µ i are free parameters.Again we set the m ﬁrst derivatives ∂ ln L∂p j = k X i = n i µ i ∂µ i ∂p j − k X i = N i − n i N i − µ i ∂µ i ∂p j (B.3)to zero for j = m . The minimum is obtained by solving k X i = n i − µ i µ i ( − µ i / N i ) ∂µ i ∂p j =

0. (B.4)For the IWLS ﬁt, we need to minimize the sum Q ( p ) = k X i = ( n i − µ i ) ˆ µ i ( − ˆ µ i / N i ) . (B.5)where the variances for the binomial distribution with expectation µ i are σ i = Var [ n i ] = N i (cid:15) i ( − (cid:15) i ) = µ i ( − µ i / N i ) . Again, we replaced µ i in the varianceby the constant estimate ˆ µ i from the previous iteration. Setting the m ﬁrstderivatives to 0, we obtain ∂Q∂p j = − k X i = n i − µ i ˆ µ i ( − ˆ µ i / N i ) ∂µ i ∂p j = ˆ µ i → µ i , which is approached by iterating the minimization. Again,we obtain the maximum-likelihood estimate with the IWLS ﬁt.Like in the previous case, the L-IWLS ﬁt for a linear model converges fasterthan the ML ﬁt, while the IWLS ﬁt converges more slowly than the ML ﬁt ingeneral. Appendix C. Expectation of Q ( ˆ p ) for linear models We compute the expectation of Q in Eq. (2), evaluated at the solution ˆ p fromEq. (4) for linear models with E [ y ] = Xp , where X is a ﬁxed k × m matrix, andwhere the measurements y have a known ﬁnite covariance matrix C . Similarproofs are found in the literature [16]. The covariance matrix of ˆ p is obtained byerror propagation with the matrix M = ( X T C − X ) − X T C − and ˆ p = M y as C p = M CM T = ( X T C − X ) − , (C.1)where we used that C − and ( X T C − X ) − are symmetric matrices.The expectation is a linear operator. Since the solution ˆ p = M y is a linearfunction of the measurement, we haveE [ ˆ p ] = M E [ y ] = M Xp = p , (C.2)in other words, ˆ p is an unbiased estimate of p .We expand Q evaluated at ˆ p , Q ( ˆ p ) = y T C − y − y T C − X ˆ p − ˆ p T X T C − y + ˆ p T X T C − X ˆ p , (C.3)which simpliﬁes with C − p ˆ p = X T C − y and Eq. (C.1) to Q ( ˆ p ) = y T C − y − ˆ p T C − p ˆ p . (C.4)The scalar result of a bilinear form is trivially equal to the trace of this bilinearform, and a cyclic permutation inside the trace then yields Q ( ˆ p ) = Tr ( C − yy T ) − Tr ( C − p ˆ p ˆ p T ) . (C.5)21e compute the expectation on both sides and get, using linearity of trace andexpectation, E [ Q ( ˆ p )] = Tr ( C − E [ yy T ]) − Tr ( C − p E [ ˆ p ˆ p T ]) . (C.6)The deﬁnition of the covariance matrix C = E [ yy T ] − E [ y ] E [ y ] T is inserted,and vice versa for C p . We getE [ Q ( ˆ p )] = Tr ( C − C + C − E [ y ] E [ y ] T ) − Tr ( C − p C p + C − p E [ ˆ p ] E [ ˆ p T ]) .(C.7)The trace of a matrix multiplied with its inverse is equal to the number ofdiagonal elements, which is k in case of C and m in case of C p . We use this,E [ y ] = Xp , E [ ˆ p ] = p , and again the linearity of the trace, to getE [ Q ( ˆ p )] = k + Tr ( C − Xpp T X T ) − (cid:0) m + Tr ( C − p pp T ) (cid:1) . (C.8)The remaining traces are identical and cancel,Tr ( C − Xpp T X T ) = Tr ( X T C − Xpp T ) = Tr ( C − p pp T ) , (C.9)and so we ﬁnally obtain the resultE [ Q ( ˆ p )] = k − m , (C.10)which is independent of the PDFs that describe the scatter of the measurements y around the expectation values E [ y ] . References [1] J. A. Nelder, R. W. M. Wedderburn,

Generalized Linear Models , J. R.Statist. Soc.

A135 , 370–384 (1972).[2] A. Charles, E.L. Frome, P. L. Yu,

The Equivalence of Generalized LeastSquares and Maximum Likelihood Estimates in the Exponential Family , J.Am. Stat. Assoc. , 169–171 (1976).[3] F. James, Statistical methods in experimental physics , World ScientiﬁcPublishing Company, 2006. 224] S. Baker and R. D. Cousins,

Clariﬁcation of the use of chi-square andlikelihood functions in ﬁts to histograms , Nuclear Instruments and Methodsin Physics Research , 437 (1984).[5] A. C. Aitken,

IV. On Least Squares and Linear Combination of Observa-tions , Proc. R. Soc. Edinburgh , 42 (1935).[6] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap ,Monographs on Statistics and Applied Probability No. 57, Chapman &Hall/CRC, Boca Raton, Florida, USA, 1993.[7] R. J. Barlow,

Combining experiments with systematic errors ,arXiv.1701.03701 (2017).[8] T.A. Aaltonen et al. (CDF, D0 collaborations),

Combination of measure-ments of the top-quark pair production cross section from the TevatronCollider , Phys. Rev.

D89 , 072001 (2014).[9] F. James and M. Roos,

Minuit - a system for function minimization andanalysis of the parameter errors and correlations , Computer Physics Com-munications , 343 (1975).[10] iminuit team, MINUIT from Python , https://github.com/iminuit/iminuit (2013), accessed: 2018-03-05.[11] C. Lawson and R. Hanson, Solving Least Squares Problems , Society forIndustrial and Applied Mathematics, 1995, https://doi.org/10.1137/1.9781611971217 .[12] E. Jones et al. , SciPy: Open source scientiﬁc tools for Python , (2001), accessed: 2018-03-05.[13] J. Heinrich, Pitfalls of Goodness-of-Fit from Likelihood , in

Statistical Prob-lems in Particle Physics, Astrophysics, and Cosmology , edited by L. Lyons,R. Mount, and R. Reitmeyer, p. 52, 2003, arXiv:physics/0310167.2314] R.D. Cousins,