Bayesian model averaging for analysis of lattice field theory results
FFERMILAB-PUB-20-374-T
Bayesian model averaging for analysis of lattice field theoryresults
William I. Jay ∗ and Ethan T. Neil † Theoretical Physics Department, Fermi NationalAccelerator Laboratory, Batavia, IL 60510, USA and Department of Physics, University of Colorado, Boulder, CO 80309, USA (Dated: August 3, 2020)
Abstract
Statistical modeling is a key component in the extraction of physical results from lattice fieldtheory calculations. Although the general models used are often strongly motivated by physics,their precise form is typically ill-determined, and many model variations can be plausibly consideredfor the same lattice data. Model averaging, which amounts to a probability-weighted average overall model variations, can incorporate systematic errors associated with model choice without beingoverly conservative. We discuss the framework of model averaging from the perspective of Bayesianstatistics, and give useful formulas and approximations for the particular case of least-squaresfitting, commonly used in modeling lattice results. In addition, we frame the common problem ofdata subset selection (e.g. choice of minimum and maximum time separation for fitting a two-pointcorrelation function) as a model selection problem, and study model averaging as a straightforwardalternative to manual selection of fit ranges. Numerical examples involving both mock and reallattice data are given. ∗ Email: [email protected] † Email: [email protected] a r X i v : . [ s t a t . M E ] A ug . INTRODUCTION One of the central problems in lattice field theory is that of model fitting and parameterestimation. This problem appears repeatedly in analysis of lattice results, from single two-point correlators up to joint chiral and continuum extrapolations of results from manysimulation streams. The functional forms which appear in these cases are often notoriouslydifficult to work with; the sum of exponentials which models the two-point correlator isin general quite numerically unstable, and chiral and continuum extrapolations can involvenonlinear dependence on large numbers of unknown variables.Worse still, many of the models appearing in analysis of lattice simulations are incom-plete, with well-known limits to their applicability. Chiral perturbation theory is an effectivetheory which will break down at large masses or momentum scales, and contains (in prin-ciple) an infinite number of low-energy constants; a single two-point correlator in principlehas contributions from an infinite tower of excited states. The Symanzik effective theorydescribing the appearance of lattice artifacts similarly contains an organized but infinitenumber of terms. These contributions are typically dealt with by truncating the model, andoften the data as well. However, this can lead to subtle dependence in the results on theanalyst’s choice of fit range and number of terms to include in the model.This is not to say that lattice theorists are unaware of these potential sources of systematicerror. The effects of model truncation and data truncation can be estimated by studyingthe variation of quantities of interest as the range of data included is varied, or additionalmodel terms are added. However, the often-adopted approach of taking the full differencebetween these variations as a systematic error is somewhat crude and likely to be overlyconservative in many cases.In this paper, we describe the technique of Bayesian model averaging as an alternativeapproach to these systematic errors, and outline its potential applications in the analysis oflattice simulation results. This approach allows for fully rigorous estimation of probabilitydistributions for parameters of interest by combining results from several models. All modelsmust reduce to a common model containing the parameter(s) to be estimated, but there isno other requirement that they be nested models. (For a continuum extrapolation of amatrix element e.g., the reduced model can be simply be a single constant parameter, thevalue of the matrix element in the continuum limit.)Bayesian model averaging is somewhat well-known in the statistical literature [1–5], al-though it is most often used in the context of linear models. Here we place no such restriction,giving formulas which can be used for arbitrary non-linear models. In general, the modelprobability weights required for model averaging are complicated integrals, but we will giveapproximate formulas which may be used with sufficient sample sizes. We discuss the effectsof estimator bias and avoiding the Jeffreys-Lindley paradox [6–9] when empirical priors areused.Our work is partially inspired by the Bayesian approach to effective field theory advocatedby Schindler and Phillips [10]. Other examples of using weighted averages over models inlattice analysis include [11–16]. We believe that our treatment is the first attempt to layout the procedure rigorously and in a fully Bayesian framework for a lattice field theoryaudience.An outline of the paper is as follows. In Section II, we describe the basic Bayesianframework for model averaging and derive formal results for model-averaged expectationvalues. Section III specializes to the case of least-squares fitting and derives an approximate2ormula for the model probabilities which are needed for model averaging. Section III C givesa numerical example of model averaging applied to mock data. In Section IV, we discussthe common problem of data subset selection, reframing it as a model variation problem inorder to apply model averaging. Section IV A shows an example application to the commontask of fitting a two-point correlation problem and demonstrates the effectiveness of modelaveraging as a replacement for choosing cuts on the data. We make some concluding remarksin Section V. A detailed discussion of bias correction in the estimation of model probabilityis given in Appendix A.
II. BAYESIAN FRAMEWORK
The basic analysis problem is as follows: we wish to describe a dataset D using a basemodel M in order to determine the value of one or more common parameters { a c } . In theexample of a continuum extrapolation, M could simply be the value of a specific matrixelement in the continuum limit.In estimating the parameters { a c } , we are often led to consider several extensions of thebase model M which contain various unphysical or uninteresting terms, like lattice artifactsor undetermined excited states. This extends our study to a space of models { M } , but ourbasic interest is still in estimation of the parameters of M . All of the models in { M } mustcontain M , in the sense that marginalizing over additional parameters { a m } reduces themto M . (Note that the set { a m } implicitly depends on the choice of model M .)It is important to note that the base model M itself does not necessarily have to becontained in the set { M } of models that are actually fit to the data. As a simple example,in a continuum extrapolation it is certainly not necessary to include the continuum-onlymodel M (without any lattice spacing dependence) in the set of fits. Indeed, a continuum-only model would surely describe the data poorly in this example.To obtain the marginal probabilities for the common parameters, we marginalize overboth models and additional parameters [10]:pr( a c | D ) = (cid:88) M (cid:90) d a m pr( D | a , M )pr( a | M )pr( M )pr( D ) . (1)where “pr” denotes a probability distribution, and the set of all parameters { a } is theunion of { a c } and { a m } . In principle, this formula assumes that all parameters { a } aredimensionless. In practice, we will be interested in model weights which will depend onlyon ratios of probabilities, so any units will tend to cancel.In principle, this “master formula” contains all the required information. If we can carryout the integrals and explicitly construct pr( a c | D ), then expectation values for arbitraryfunctions of the common parameters a c are immediately available: (cid:104) f ( a c ) (cid:105) = (cid:90) d a c f ( a c )pr( a c | D ) , (2)from which we can construct the standard mean, variance, and so forth. However, evaluatingthe integrals in the master formula is generally quite difficult, especially in the context ofthe complicated non-linear models appearing in lattice analyses. Direct Monte Carlo evaluation of such integrals is an intriguing option which deserves more attention (cid:104) f ( a c ) (cid:105) M = (cid:90) d a c f ( a c )pr( a c | M, D ) (3)= (cid:90) d a c f ( a c ) pr( a c , M, D )pr( M, D ) (4)= (cid:90) d a c f ( a c ) pr( D | a c , M )pr( a c , M )pr( M | D )pr( D ) (5)= (cid:90) d a c f ( a c ) pr( D | a c , M )pr( a c | M )pr( M )pr( M | D )pr( D ) (6)= 1pr( M | D ) (cid:90) d a c f ( a c )pr( a c , M | D ) . (7)But now if we marginalize the integral on the right-hand side over the space of models { M } ,we just obtain the total, model-independent expectation value for f : (cid:88) M (cid:90) d a c f ( a c )pr( a c , M | D ) = (cid:90) d a c f ( a c )pr( a c | D ) = (cid:104) f ( a c ) (cid:105) . (8)Thus, we arrive at the relation (cid:104) f ( a c ) (cid:105) = (cid:88) M (cid:104) f ( a c ) (cid:105) M pr( M | D ) . (9)This is the central formula of interest for purposes of model averaging. It shows that all of themoments of the fully combined PDF can be obtained as a weighted average over individualmodel information, with the weight factors given by the posterior probability pr( M | D )for each individual model. Due to its role in model averaging, we will refer to pr( M | D )interchangeably as the “posterior probability” or as the “model weight.” The model weightitself can be expressed as an integral over the parameter space,pr( M | D ) = (cid:90) d a pr( M, a | D ) (10)= (cid:90) d a pr( D | a , M )pr( a , M )pr( D ) (11)= (cid:90) d a pr( D | a , M )pr( a | M )pr( M )pr( D ) . (12)To fix the normalization of these probabilities, observe that the expectation of the unitoperator (cid:104) (cid:105) should be unity independent of model choice, which gives the condition (cid:88) M pr( M | D ) = 1 . (13) in the context of lattice studies, in which much more complicated integrals are evaluated as a matter ofcourse. This method seems to have been explored in Ref. [17]. However, we will not pursue this approachhere. (cid:80) M pr( M | D )pr( D ) = (cid:80) M pr( M ), apart from the overall normalization factor pr( D ).The formulae presented so far are completely general. However, certain common choicesused for the estimators of the likelihood function and other quantities can introduce bias.Any such biases should be corrected to guarantee convergence to correct results. We discussan important bias correction in detail in Section III A below. A. Estimation of model parameters with model averaging
It is instructive to consider what happens to the simple estimate of a model parameterunder the model combination procedure. Suppose we are interested in the single parameter a , marginalized over a set of N models { M } . Using Eq. (9), we find for its mean (cid:104) a (cid:105) = (cid:88) M (cid:104) a (cid:105) M pr( M | D ) (14)and variance: σ a = (cid:10) a (cid:11) − (cid:104) a (cid:105) (15)= N (cid:88) i =1 (cid:10) a (cid:11) i pr( M i | D ) − (cid:32) N (cid:88) i =1 (cid:104) a (cid:105) i pr( M i | D ) (cid:33) (16)= N (cid:88) i =1 σ a ,i pr( M i | D ) + N (cid:88) i =1 (cid:104) a (cid:105) i pr( M i | D ) − (cid:32) N (cid:88) i =1 (cid:104) a (cid:105) i pr( M i | D ) (cid:33) . (17)This result for the variance also appears in the statistics literature [4], and has been used inthe context of lattice calculations in [14, 16]. The first term is simply the weighted averageof the statistical variance over all models. The remaining two terms can then be thought ofas a “systematic error” contribution to the variance of a due to model choice. In the limitof equal model weights, i.e., pr( M i | D ) = 1 /N , the latter contribution can be thought of asthe variance over the space of models, since it reduces to the standard formula σ a , syst = 1 N N (cid:88) i =1 (cid:104) a (cid:105) i − N (cid:32) N (cid:88) i =1 (cid:104) a (cid:105) i (cid:33) . (18)We note that in the general case, this is not the same as the variance computed from theset of weighted estimates w i ≡ (cid:104) a (cid:105) i pr( M i | D ); such a weighted variance would contain anextraneous factor of pr( M i | D ) in the (cid:104) a (cid:105) term. We also note that taking the full width ofthe distribution of results (cid:104) a (cid:105) i , as done in e.g. [18], will give an estimated systematic errorstrictly greater than σ a , syst , so that this procedure is a conservative error estimate.It is illustrative to specialize to the case of considering only two models M , M , for whichwe have found the weights pr( M | D ) = 1 − p, (19)pr( M | D ) = p. (20)5uppose now that model 1 is strongly favored by the data, so p is small. Expanding theexpectation values above to first order in p , we find: (cid:104) a (cid:105) = (cid:104) a (cid:105) + ( (cid:104) a (cid:105) − (cid:104) a (cid:105) ) p, (21) σ a ≈ σ a , + (cid:2) σ a , − σ a , + ( (cid:104) a (cid:105) − (cid:104) a (cid:105) ) (cid:3) p. (22)In the limit p → M , as expected. For small butnon-zero p , the corrections to the mean and variance of a due to including M are likely tobe small, but it is clear that this depends on how large the difference between the estimatesfrom M and M are. III. LEAST-SQUARES FITTING
The discussion so far has been completely general with regards to the form of the prob-ability distributions appearing. We now specialize to the most common usage case in thecontext of lattice simulations, namely least-squares regression of a model M to some dataset D . The likelihood function pr( D | a , M ) is taken to bepr( D | a , M ) = N (cid:89) i =1 π ) d/ (det Σ) / exp (cid:20) − χ i (cid:21) , (23)where χ i ≡ ( y i − f M ( a )) T Σ − ( y i − f M ( a )) (24)is the standard chi-square goodness of fit for the data sample y i ; we assume the samples aredrawn independently from some underlying distribution. Here d denotes the dimension of asingle observation vector y i , and N is the number of independent observations drawn fromthe true distribution. The matrix Σ is the covariance matrix between the y i .For the prior distribution, it is standard (and typically justified by the principle of max-imum entropy [10, 19]) to adopt a multivariate Gaussian form,pr( a | M ) = 1(2 π ) k/ (det ˜Σ) / exp (cid:104) ( a − ˜ a ) T ˜Σ − ( a − ˜ a ) (cid:105) (25)= k (cid:89) x =1 (cid:18) √ π ˜ σ x (cid:19) exp (cid:18) ( a x − ˜ a x ) ˜ σ x (cid:19) , (26)where k is the number of fit parameters in model M and ˜Σ is the prior covariance matrix.The second formula holds for the simplified case where the prior parameter covariance matrix˜Σ is diagonal with entries ˜ σ x .The ordinary least-squares likelihood pr( D | a , M ) is normalized by the data covariancematrix and some factors of (2 π ). Since we are considering only the case of a fixed data set,this overall normalization is the same for all models and can be ignored here. On the otherhand, it is important to include the normalization of the prior distribution pr( a | M ), whichdiffers for models with different numbers of parameters.The best-fit point a (cid:63) maximizes the above likelihood or, equivalently, minimizes thenegative log-likelihood function − D | a , M )pr( a | M )) = χ + χ p ≡ χ N . (27)with the combination of terms defining the “augmented chi-squared” function [19].6 . Bias correction of the model weights Estimator bias occurs when a sample estimator differs from the “true” underlying pop-ulation value it is approximating. A common example of such a bias occurs in the naiveestimator of variance for data drawn from an underlying Gaussian distribution. In this ex-ample, the bias is of order 1 /N , where N is the sample size, which means that it will beautomatically removed in the limit of large N .However, there are many examples of asymptotic biases which do not vanish as N → ∞ .The sample maximum likelihood estimate (MLE) of the log likelihood function, χ N ( a (cid:63) ),suffers from such an asymptotic bias. Roughly speaking, because the MLE maximizes thesample log-likelihood, it tends to overshoot the true asymptotic value. Fundamentally, thisbias arises not from the choice of the MLE but rather from finite-sample-size fluctuations inthe data itself.To ensure that the results at finite sample size N converge to the asymptotic modelweight, we define the bias-corrected model weight to be [20–23]pr( M | D ) BC = exp (cid:0) − tr[ J − N ( a (cid:63) ) I N ( a (cid:63) )] (cid:1) × (cid:90) d a pr( D | a , M )pr( a | M )pr( M )pr( D ) (28)where I N and J N are the sample estimates of the log-likelihood Fisher information matrixand the (negative) Hessian matrix, respectively. These matrices are defined in Appendix A,which also gives an informal derivation of the bias correction term, ∼ tr[ J − N ( a (cid:63) ) I N ( a (cid:63) )].When the model is linear in the parameters a , it is straightforward to show that I and J are identical, so that J − N ( a (cid:63) ) I N ( a (cid:63) ) →
1, the k × k identity matrix. In the more generalnonlinear case, this well-known relationship between J N and I N still holds asymptotically [21, 22]. In either case, tracing over the identity matrix simply counts the total number ofparameters k in the model.The appearance of the structure tr[ J − I ] is closely related to the Akaike informationcriterion [24, 25] and its generalization, the Takeuchi information criterion [20]. We discussthis connection more below. B. Gaussian approximation
By construction, the sample likelihood pr( a | D ) is locally maximized at the best-fit pa-rameter values a (cid:63) . Taylor expansion about the best-fit point gives χ N ( a ) ≈ χ N ( a (cid:63) ) + 12 ( a − a (cid:63) ) T H (cid:63)N ( a − a (cid:63) ) + ... (29)where H (cid:63)N is the standard Hessian matrix evaluated at the best-fit point, H (cid:63)N,xy ≡ ∂ χ N ∂a x ∂a y (cid:12)(cid:12)(cid:12)(cid:12) a = a (cid:63) = − J N,xy ( a (cid:63) ) . (30) Strictly speaking, this relationship will hold asymptotically assuming that the “true model” from whichthe data are drawn can be described by the model M being fit. For lattice applications where we typicallyhave a strong physical basis for the models that we use, this assumption is likely to hold. D . Withinthis approximation, the integral for model weight becomes Gaussian and can be evaluatedanalytically:pr( M | D ) = (cid:90) d a pr( D | a , M )pr( a | M )pr( M )pr( D ) (31) ≈ (cid:90) d a π ) k/ (det ˜Σ) / e − χ N ( a (cid:63) ) / − ( a − a (cid:63) ) T H (cid:63)N ( a − a (cid:63) ) pr( M )pr( D ) (32)= pr( M )pr( D ) (2 π ) − k/ (det ˜Σ) − / (cid:104) (2 π ) k/ e − χ N ( a (cid:63) ) / (det H (cid:63)N / − / (cid:105) (33)= pr( M )pr( D ) (det ˜Σ) − / e − χ N ( a (cid:63) ) / (det H (cid:63)N / − / . (34)Thus, neglecting the term pr( D ) which is the same for all models, one finds the followingform for the log-likelihood − M | D )) ≈ − M )) + χ N ( a (cid:63) ) + log det ˜Σ M + log det( H (cid:63)N /
2) (35)= − M )) + χ N ( a (cid:63) ) + log det ˜Σ M − log det Σ (cid:63)N , (36)where Σ (cid:63)N ≡ ( H (cid:63)N / − is the usual model best-fit covariance matrix, i.e., the predictedcovariance of the fit parameters a . Including the bias correction term introduced in Eq. (28)gives the overall result − M | D ) BC ) ≈ GAP M ,GAP M ≡ − M )) + χ N ( a (cid:63) ) + log det ˜Σ M − log det Σ (cid:63)N + 2 tr[ J − N ( a (cid:63) ) I N ( a (cid:63) )] , (37)where we introduce the notation “GAP,” standing for “Gaussian approximate posterior”. Ifthe model is linear in the parameters a , then the GAP formula is in fact exact. For nonlinearregression, the approximation is expected to improve as the sample size of the data D grows.From this point forward, we adopt the bias-corrected form as the default choice of pr( M | D )unless explicitly stated otherwise, dropping the “ BC ” subscript.From Eq. (37), we see that the posterior probability pr( M | D ) encapsulates the principle ofOccam’s Razor: models with large χ N ( a (cid:63) ) are penalized, but so are models which have a largenumber of free parameters. (The final bias-correction term reduces to 2 k asymptotically,where k is the number of parameters.)Unfortunately, in the presence of any empirical priors and/or models with differing dimen-sions, an effect known as the Jeffreys-Lindley paradox [6, 8, 9] leads to out-sized dependenceon the prior widths that can severely distort the overall results. The presence of the paradoxis linked closely to the use of “empirical priors” which are not based on true prior infor-mation and can be taken to be arbitrarily wide. Note that for estimation of the best-fitparameters a (cid:63) in a fixed model there is no such distortion, and the use of empirical priorsis not problematic. However, when the normalization of the likelihood is important, as inthe estimation of pr( M | D ), the paradox can lead to nonsensical results if the prior width istaken to be extremely large.Instead of attempting to confront the Jeffreys-Lindley paradox head-on, we will insteadargue that the effect of the (log det ˜Σ M − log det Σ (cid:63)N ) terms will vanish asymptotically and8o can be viewed as a 1 /N effect in the sample size N . Working in a scenario without truepriors, we may adopt a cross-validation procedure as follows: we partition our full data set D into a small “training set” D T , and the remaining data D cT . A given model is first fit tothe training set D T to determine a set of fit parameters a T . The results for a T are then usedto fix priors on a for the fit to the remaining data D cT .In practice, the cost of this procedure is that we must “use up” the data in D T to fixpriors, reducing the overall statistical precision of our analysis. However, in the asymptoticlimit N → ∞ , both partitions { D T , D cT } approach the true asymptotic distribution and willyield exactly the same fit results. Therefore, in this limit ˜Σ M → Σ (cid:63)N and the determinantterms in GAP M cancel completely.In the same N → ∞ limit, the bias-correction term reduces to twice the number ofparameters, and we recover the well-known Akaike information criterion (AIC) [24, 25]: GAP
M N →∞ ,CV −−−−−−→ AIC M = − M )) + χ N ( a (cid:63) ) + 2 k, (38)where “CV” denotes the aforementioned cross-validation procedure to remove the determi-nant terms. More generally, simply dropping the determinant terms from Eq. (37) gives theTakeuchi information criterion (TIC) [20]. However, even for non-linear models the TIC ap-proaches the AIC in the N → ∞ limit which we are considering, as discussed in Section III A.We therefore advocate the use of the AIC for general model averaging purposes. C. Practical example: polynomial data
To demonstrate the method and some key features, we begin by considering a simple toymodel. We begin by specifying a quadratic “model truth” polynomial function, F ( x ) = 1 . − . (cid:16) x (cid:17) + 0 . (cid:16) x (cid:17) . (39)A set of N mock data samples are generated for x ∈ { , , ..., } by taking the modeltruth for each point and adding uncorrelated Gaussian noise with mean zero and standarddeviation σ = 1 .
0. The resulting mock data are plotted in Fig. 1 (top panel) for the choice N = 160; we will also study the N -dependence.We take as our space of model functions polynomials labeled by their degree m , f m ( x ) = m (cid:88) j =0 a j (cid:16) x (cid:17) j . (40)with 0 ≤ m ≤
5. We take the flat prior pr( m ) = 1 /
6, corresponding to minimal prior knowl-edge about the functional form of the model (except that it is polynomial.) All parameterpriors are taken to be Gaussian with mean 0 and width 10 (the parameters are, essentially,unconstrained.)The results of the fits to individual models as well as the model-averaging results areshown in Fig. 1 (bottom panel) and in Table I. Note that although all models with m > χ N , the bias-corrected model probabilityestimated through the AIC places relatively more weight on the simpler models, with themaximum probability assigned to the correct choice m = 2. The model-averaged result is9 x y ( x ) N p a Model truthModel avg.Individual fits0 1 2 3 4 5 m p pr ( M | D ) Fit p-value
Figure 1: Top: synthetic data (green points) for the given quadratic “model truth” function (blackcurve) plus Gaussian noise, with noise sample size N = 160. Bottom: fit results for degree- m polynomial models (blue circles), compared to the known value a = 1 . m = 0. The lower inset shows the standard p -value (blue dashed line) and the model weightcalculated from the AIC (orange solid line). Comparison of these curves shows the “Occam’s razor”effect, with the AIC penalizing fits with roughly equal goodness of fit but more fit parameters. consistent with model truth and has slightly larger uncertainty than the individual fit to thecorrect model m = 2.As noted, the results so far use a fixed sample size of N = 160. We repeat the test asdescribed above with several values of N ∈ [20 , a using various procedures in Fig. 2. The result of the model averaging procedure usingthe AIC is seen to be consistent with model truth in all cases, with an error that comparesfavorably to single-model fits using the known true quadratic model and is uniformly smallerthan the more conservative procedure of taking the full variation of the mean over all modelswith p > . = 0 m = 1 m = 2 m = 3 m = 4 m = 5 a a -0.170(72) -0.96(30) -1.42(81) -1.1(1.2) -0.9(1.3) a a -0.8(1.2) 1.9(6.4) 1.3(6.8) a -1.4(3.2) 0.3(7.3) a -1.0(3.9) χ N p -value 0.01 0.04 0.22 0.24 0.25 0.25AIC m M | D ) 0.01 0.04 0.56 0.25 0.10 0.04Table I: Individual best-fit results and associated quantities for N = 160. The model-averagedvalue for the intercept is (cid:104) a (cid:105) = 1 . Omitting the bias-correction term and averaging using only the χ N results to estimatemodel probability (the “naive” estimate) also tends to give slightly larger error than modelaveraging using AIC, but the results remain consistent with the correct answer. In theabsence of the bias correction, the likelihood of the models with m > m = 2 model withintheir parameter space, they tend to estimate the correct value of the intercept a on average.As a result, there is no bias introduced into the mean result for a when using biased modelprobabilities in this particular example. However, the error bar is slightly overestimatedwhen using the naive estimator. IV. DATA SUBSET SELECTION AS A MODEL VARIATION PROBLEM
A routine part of modeling lattice Monte Carlo data is data subset selection, i.e., choos-ing a “cut” on the data beyond which the model is not applied. A canonical and simpleexample is fitting a two-point correlation function C ( t ) to extract the ground-state energy.The full model expected to describe this correlation function involves an infinite tower ofexponentials, (cid:80) ∞ i A i e − E i t . In practice, one truncates the sum after a finite number of termsand then selects a minimum value t min below which the data are simply ignored. Choosingthe precise value of t min is generally done by hand.Although this process is typically thought of as data selection problem, it can easily bereformulated as a model selection problem. In the previous example, the justification forignoring data below t min is partially one of expediency. If we are only interested in the firstfew states, it suffices to look at times with t ≥ t min where they dominate. Times below t min will be heavily contaminated by contributions from the higher excited states, and little tono information about the first few states is lost by ignoring them.Based on this observation, we can define a joint model that describes the full data set.First, select a subset of the data and fit the model of choice M to this subset as usual.Second, fit the remaining data to a “perfect” model with zero degrees of freedom. Forexample, we could take the “perfect” model to be a polynomial with degree equal to N cut ,but in principle other functional forms will also work. Because the “perfect” model has11 log( N ) a Model truthModel avg. (AIC)Quadratic fixedFull-width systematicModel avg. (naive)
Figure 2: Scaling of various estimates of the intercept a vs. the data sample size N . Thetrue value (dashed line) is a = 1 .
8. The blue circles (model average using the AIC) show goodconsistency with both the model truth and with the estimates using the correct quadratic modelform (red squares). Using the full-width difference between all models with fit p -value greater than0 . M | D ) which omits the bias-correctionterm (silver triangles) does not directly lead to bias in the estimation of a , but also gives slightlylarger uncertainty due to overweighting of more complicated models as discussed in the text. zero degrees of freedom, there exists a solution for its parameters for which the differencesbetween the model and the sample means vanish exactly.To give an explicit construction, we first define a partition P of the data vectors into y i = ( y cut i , y keep i ), where y keep i are the subset to be modeled and y cut i are the cut data. Wethen define the corresponding partitioned model g M ( a , P ) as y i − g M ( a , P ) = (cid:40) y i − ¯ y cut , y i ∈ y cut i y i − f M ( a ) , y i ∈ y keep i , (41)where ¯ y cut is the sample average of the y cut i . The partition-dependent log likelihood is then,dropping constant terms that do not change with fixed data set D , − D | a , M ) = N (cid:88) i =1 χ i ( P ) = N (cid:88) i =1 ( y i − g M ( a , P )) T Σ − ( y i − g M ( a , P )) (42)= N (cid:88) i =1 ( y keep i − f M ( a )) T Σ − P ( y keep i − f M ( a )) + (const) (43)where Σ − P is the submatrix of the full inverse data covariance matrix Σ − which correspondsto the data subset y keep i . All other terms involving the cut data contain the expression¯ y cut − g M ( a , P ) at least once and therefore vanish identically by construction, even in thepresence of off-diagonal correlations between y keep i and y cut i .Since matrix inversion does not generally commute with subspace projection, the matrixΣ − P typically differs from (Σ P ) − , the inverse of the covariance sub-matrix. However, in12ractical lattice applications (Σ P ) − is often used as an approximation to Σ − P ; the differencebetween these matrices is given by terms that are suppressed whenever long-range (i.e.,further off-diagonal) correlations are generally smaller than short-range ones. An obstructionto using Σ − P directly is that finite-sample estimates of the full covariance matrix Σ − aretypically ill-conditioned. Therefore, in what follows we will use (Σ P ) − .The result of this construction is that the contribution from the “perfect” model describ-ing the data outside the chosen subset is ∆ χ = 0. However, there remains a bias-correctionterm which accounts for the N cut additional model parameters used to describe the cut data y cut i . Thus, the overall model probability for the joint model is easily seen to be obtainedfrom the modified expression AIC
M,N cut = − M )) + χ N ( a (cid:63) ) + 2 k + 2 N cut , (44)where χ N ( a (cid:63) ) is evaluated only for the model M and for data within our selected subset. Inother words, in practice we do not need to construct the “perfect” model at all; we simplyneed to count the number of points cut away in our subset selection to obtain the correctmodel weight.What about the covariance terms in Eq. (37) involving the “perfect” model parameters?Because the parameters of the “perfect” model are uniquely determined by the mean valuesof the data points alone, we can impose a prior on the perfect model with infinitesimal priorwidths (cid:15) . So long as (cid:15) is much smaller than the error on the removed data points themselves,it will dominate the posterior covariance matrix for those model parameters as well, so thatlog det ˜Σ M, perfect − log det Σ (cid:63)N, perfect ≈ k log (cid:15) − k log (cid:15) = 0 . (45)The bias-correction term is also exactly 2 N cut in this case, because the “perfect” model canalways be constructed to be linear in the parameters. In other words, the use of the AIC tocompute the contribution of the cut data to the relative model weights is exact and not anasymptotic approximation. Although we were motivated by the example of a two-point correlator where the dataare cleanly divided into two subsets along a single dimension, the argument above holds forarbitrarily complex subdivisions of the full data set. Whatever subset of the data we chooseto fit explicitly to model M , the joint model which also describes the remaining data willcontribute an additional factor of 2 N cut to the information criterion. We can also considera set of models { M } and perform ordinary model averaging over the joint space defined by { M } and the parameters that uniquely define a data subset. A. Practical example: Synthetic correlation functions
To test the data subset selection procedure, we set up another toy-model example resem-bling a two-point correlation function, following the example description above. The “model We note in passing that if we fit a single model M and only vary N cut , then the full GAP formulaEq. (37) may be used without running afoul of the Jeffreys-Lindley paradox, since the prior covariance˜Σ M is held fixed and therefore cancels completely. As we have argued that the influence of the covariancedeterminant terms is a finite sample size effect, we will nevertheless adopt the simpler AIC prescriptionfor our numerical tests to follow. F ( t ) = 2 . e − . t + 10 . e − . t . (46)To generate synthetic data, we add correlated Gaussian noise η ( t ) with mean zero andvariance 0.09. The noise is added fractionally to the data, i.e., the synthetic data aregenerated according to the formula y ( t ) = F ( t )(1 + η ( t )). The correlation matrix of thenoise takes the form ρ t,t (cid:48) = ρ | t − t (cid:48) | , i.e. equal to 1 on the diagonal and dying off as a powerlaw as the time separation between points increases, similar to a real QCD correlationfunction. We fix the numerical correlation coefficient ρ = 0 . N mock data samples aregenerated for t ∈ { , , ..., } .Additional trials in which the above parameters have been varied were also tried, includingusing uncorrelated Gaussian noise instead of correlated. No qualitative difference in theoutcome of the tests was observed with these variations.For this test, we consider a single model which consists of a single exponential term, f ( t ) = A e − E t . (47)This model is fit to all data in the range [ t min , t min from 1 to28, with the goal of using model averaging with Eq. (44) to obtain a combined result for theground-state energy E .The results of four independent trials following the above procedure with N = 500 areshown in Fig. 3. Excited-state contamination, i.e. the influence of the second exponentialstate which is not present in our fit model, is clearly visible at low t min . In each case, excellentagreement of the model-averaged result with model truth is seen. As in the polynomialexample, the bias-corrected model probability is seen to weight simpler models more strongly,which in this case means favoring fits that cut away less of the data.In Fig. 4, we repeat the above exercise while varying the sample size N , once again overthe range N ∈ [20 , t min , thelatter has an unaccounted-for systematic error due to model truncation. (Indeed, if we donot adjust t min as N → ∞ , we expect the result for E to become incompatible with thecorrect ground-state energy, as the contamination from the second state will eventually beresolved in a large enough sample.)On the other hand, the error on the model-averaged result is generally much smaller thanthe more conservative full-width estimate, resulting from taking the full variation of meanresults over all models with p > . N cut contribution to model weight) causes adrastic inflation of the error in the averaged result for E . Once again, the interpretationis that omitting the bias correction causes overestimation of the likelihood for the morecomplicated models, in this case models with larger t min . These models generally giveresults for E consistent with the correct answer, but with much larger errors, and themodel-averaged result is altered accordingly.14 .800.810.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value0.800.810.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value0.800.810.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value0.800.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value
Figure 3: Fit results for the ground-state energy with true value E = 0 . t min (blue points). The model-averaged result (red open square) showsgood agreement with model truth in all cases. The lower inset shows the standard p -value (bluedashed line) and the model weight calculated from Eq. (44) (orange solid line). The four subfiguresrepresent four random draws of correlated Gaussian noise, but are otherwise identical. log( N s ) E Model truthModel avg. (AIC)Fixed t min = 14 Full-width systematicModel avg. (naive)
Figure 4: Scaling of various estimates of the ground-state energy E vs. the data sample size N .The true value (dashed line) is E = 0 .
8. The blue circles (model average using the AIC) showgood consistency with the model truth and generally comparable error to fitting with fixed t min (redsquares). Using the full-width difference between all models with fit p -value greater than 0 . M | D ) which omits the bias-correctionterm (silver triangles) also leads to significantly larger uncertainty due to overweighting of morecomplicated models as discussed in the text. B. Practical example: QCD correlation functions
1. Masses from a two-point correlation function
We now consider the example of model averaging applied to a pion two-point correlationfunction from a real lattice QCD calculation. This correlator has been used in published workby the Fermilab Lattice and MILC collaborations [26]. The underlying gauge-field ensembleused has a lattice spacing of a ≈ .
09 fm and a pion mass of about 215 MeV. In this example,the correlation function was constructed using staggered fermions and corresponds to a pionwith energy E π ≈
300 MeV.The results of our procedure are shown in Fig. 5. The top pane shows the effective massand the final result of model averaging, which is consistent with by-eye expectations. Theoscillations in the effective mass are a familiar feature of staggered two-point correlationfunctions with nonzero momentum. The middle pane shows intermediate results from indi-vidual fits. The green band indicates the model-averaged result for the points shown. Forclarity of presentation, the results in the middle pane are from fits with (1 + 1) states only,i.e., 1 decaying state and 1 oscillating state. We also tried fits including (2 + 2), (3 + 3), and(4 + 4) states. The only qualitative difference from including more states is that good fitsare obtained for smaller t min . The model average is unchanged. The bottom pane shows themodel weights for the fits with (1 + 1) states. As expected, the weights peak in the middleand taper off at both ends. When t min is small, the fit quality rapidly declines due to contri-butions from excited states. When t min is large, Eq. (44) disfavors cutting too aggressively.The model-averaged result agrees with intermediate results that went into the analysis of16ef. [26] to better than 1 σ [27].
2. Matrix elements from three-point correlation functions
Model averaging also shows promising results for extraction of more complicated matrixelements. In this example, we test the extraction of a K → π transition matrix (cid:104) π | J | K (cid:105) ,again using correlation functions that were a part of published work by the Fermilab Latticeand MILC collaborations [26]. In this case, the underlying gauge-field ensemble has a latticespacing of about 0 .
12 fm, with pion and kaon masses of about 220 and 515 MeV, respectively.Staggered fermions were also used for these correlators. The calculation occurs in the restframe of the kaon. The pion momentum has been adjusted to be near the point of zero recoil, q ≡ ( p K − p π ) ≈
0. The methodology for extracting these matrix elements is complicatedbut relatively standard within the lattice community. The desired matrix element is theresult of a joint correlated fit to two- and three-point functions. In order to visualize theresult, it is standard to construct a ratio R ( t, T ) of two- and three-point functions whoseasymptotic plateau is proportional to the bare lattice matrix element. Here T denotes thelocation of the sink operator which couples to the kaon. In conducting such a fit, the analystis faced with several choices: the number of states in the pion channel ( n + n ), the numberof states in the kaon channel ( m + m ), and the fit window t ∈ [ t min , T − t min ]. We refer thereader to Ref. [26] for additional details about fits like these.Fig. 6 shows the result of model averaging for the matrix element. We adopt a flatprior model weight pr( M ) = C for all choices of ( m + m ) and ( n + n ), which drops out ofthe model average. The top pane shows the ratio R ( t, T ) for two different sink locationsalongside the result of model averaging. The middle pane shows intermediate fit resultsand the model average. The particular choices made for each of the fits is displayed alongthe horizontal axes, with the fit window displayed on top and the number of states on thebottom. For instance, the leftmost point used ( n + n ) = (3 + 3) states for the pion channel,( m + m ) = (3+3) states for the kaon channel and a fit range window t ∈ [3 , T − σ with the published result of Ref. [26]This example suggests another important application of the framework we are describing.A complete analysis of a lattice matrix element might consider a more general set of fits, e.g.,with different numbers of decaying and oscillating states in each channel or with different t min cuts for the source and sink. Scanning over all possibilities can easily produce tens orhundreds of individual fit results. Finding an objective selection criterion for choosing abest-fit result can be difficult. The model weights in Eq. (37) and Eq. (38) offer a potentialsolution to this problem, particularly when used in conjunction with expert knowledge andthe usual careful thinking. V. CONCLUSION
We have presented a Bayesian approach to the problem of model averaging. The sta-tistical methods we describe apply very generally, though our examples have focused on17
10 20 30 40 t / a E ff e c t i v e m a ss m e ff a Model average m eff t min / a F i t e n e r g y E a Model averageFit results, model weight >1%Fit results, model weight <1%0 10 20 30 40 t min / a M o d e l w e i g h t P ( M | D ) Model weight >1%Model weight <1%
Figure 5: Model averaging results for a pion two-point correlation function using staggeredfermions.
Top:
The effective mass and the final result of model averaging. The oscillating contri-butions are from the opposite-parity states associated with staggered fermions.
Middle:
Individualfits together with the result of model averaging.
Bottom:
Model weights for the individual fits. t / a R a t i o R ( l a tt i c e un i t s ) Model average R ( T = 15) R ( T = 18) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) States in fit n, m for : ( n + n ), K : ( m + m ) R a t i o R ( l a tt i c e un i t s ) Model averageFit results, model weight >1%Fit results, model weight <1% ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) States in fit n, m for : ( n + n ), K : ( m + m ) M o d e l w e i g h t P ( M | D ) Model weight >1%Model weight <1%3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6
Fit range t min , t [ t min , T t min ] Fit range t min , t [ t min , T t min ] Figure 6: Model averaging results for a matrix element associated with a K → π transition formfactor. Top:
A ratio of two- and three-point correlation functions (whose plateau is proportionalto the matrix element (cid:104) π | J | K (cid:105) ) and the final result of model averaging. Middle:
Individual fitstogether with the result of model averaging.
Bottom:
Model weights for the individual fits. practical problems in lattice gauge theory. The context for regression problems is rather ex-ceptional in many lattice studies, since the models often rest on firm theoretical foundations.For instance, multi-exponential fits to correlators are based on the spectral decomposition,which only requires the existence of a positive-definite transfer matrix. Effective field theorygoverns extrapolations to the chiral, continuum, or heavy-mass limit. If a model fails to19escribe the data, the simulation itself is rightfully viewed with additional scrutiny. Hy-pothesis testing is typically less important than reliably extracting the values of parameterscapturing the physics of interest. When predictions from nested models (say, the NLO ver-sus NNLO predictions from effective field theory) differ slightly, it is important to be able toproduce a final number with associated statistical and systematic uncertainties. Bayesianmodel averaging is an attractive approach to problems like these.Two key practical results are the model-averaged mean and variance, Eq. (14) andEq. (17). Both formulae rely on the model weights pr( M | D ). In general, the model weightsare defined through complicated integrals. However, analytic results are available in theGaussian approximation, which is exact for linear least-squares fitting. For nonlinear least-squares fitting, the approximation is expected to become increasingly good for larger datasets.The full GAP result of Eq. (37) involves the prior and posterior covariance matrices andis not free from the well-known Jeffreys-Lindley paradox. From a pragmatic perspective,we argued that these terms can be viewed as finite sample size effect and be ignored usingarguments from cross validation. Eq. (38) is the final expression used to construct themodel weights used in the examples. Since this expression is easily computable just usingthe familiar augmented χ and the number of parameters in the model, it is easy to includeand test in existing lattice analyses. However, it relies on taking an asymptotic limit in thesample size, and it would be interesting to study improved estimators at finite sample sizein future work.A particularly nice application of these ideas is data subset selection, which we recast asa model variation problem. The basic observation was to reinterpret cuts on the data asadditional model parameters, with a model weight given by Eq. (44). As discussed, carryingout data subset selection with a fixed model avoids the complication of the Jeffreys-Lindleyparadox due to cancellation of the associated contributions. The model averaging approachgives a straightforward way to replace the common practice of tuning such data subset cutsby hand.Broadly speaking, perhaps the most attractive feature of Bayesian model averaging isthe natural appearance of Occam’s razor. The model weights appearing in Eqs. (37), (38)and (44) favor models which use the fewest parameters while describing the most data.Inclusion of an asymptotic bias correction to the estimated likelihood, which yields the AICas a model-selection criterion in the limit of large sample size, is crucial to the occurrenceof this effect.An interesting direction to explore in future work would be to study improved estimatorsat finite sample size, rather than relying on the asymptotic result to estimate the modelweights. This will necessarily involve careful treatment of the covariance matrix terms inEq. (37). It would also be interesting to explore direct Monte Carlo evaluation of theintegral in Eq. (28), although the bias correction should be studied carefully in the contextof whatever specific approach is used. Studying the interplay of model averaging withresampling methods such as jackknife and bootstrap, commonly used in lattice analyses,would likely be a useful extension of this work. A. Practical suggestions and warnings
Model averaging has performed well for us in many test cases. However, as with anystatistical tool, the techniques we describe should not be used blindly. In particular, model20veraging should not be used as a substitute for plotting data and fits and thinking carefullyabout the results [28].A basic assumption underlying this technique is that only statistically correct results areincluded in the model average. Including results for fits that fail to converge numerically,for example, will likely result in incorrect answers. Incomplete treatment of autocorrelationeffects in the data will similarly yield invalid statistical estimates and thus invalid model-averaged results.Although this technique allows the data to remove much of the subjectivity from analysesincluding model variations, this does not extend to the choice of the model prior weightspr( M ). In the absence of specific and strong beliefs about particular models, we advocatefor the use of a flat prior, i.e. weighting all models equally in the prior. In particular, weemphasize that one should not attempt to impose parsimony through the model priors byoverweighting models with fewer parameters; this principle is built-in to the bias-correctedmodel weights as we have discussed.The model weights of Eq. (37) and Eq. (38) are useful beyond model averaging `a la Eq. (9).For instance, many lattice calculations oblige the analyst to make many choices beyond t min .In this situation, the model weights can help guide the decision about which, say, dozenfit results (out of potentially hundreds) are most promising for further investigation andscrutiny using more familiar and established techniques. Model selection is equivalent tomodel averaging in the limit that a single model has very high probability of correctness; thissituation can naturally emerge from the data analysis, as in the example shown in Fig. 6.Model averaging may be especially useful in the context of fitting models that containdiscrete degrees of freedom that are not amenable to standard numerical minimization pro-cedures. For example, a multi-exponential model (cid:80) ∞ i A i e − E i t in which the sign of theamplitudes A i is a priori unknown could be studied with improved numerical stability byfixing the signs of all included amplitudes one by one, and then averaging together theresults.In certain cases, we have found that the systematic errors due to model truncation orvariation can be significantly overestimated by more conservative methods. Revisiting oldlattice analyses which are limited by systematic errors related to model variation may beworthwhile. On the other hand, in our tests we have found excellent agreement between cor-relator fits with few states and those with many states; the combination of model averagingwith few-state fits as a method could reduce problems related to numerical convergence andreduce the computational cost of fitting. Acknowledgments
We thank the members of the Fermilab Lattice and MILC collaborations for generouslysharing their correlator data with us for use in numerical tests. We are especially gratefulto Elvira G´amiz for providing intermediate results and helping compare against publishedresults. We also thank Andreas Kronfeld, Christoph Lehner, Taku Izubuchi, Tom DeGrand,Jason Chang, and Dan Hackett for helpful discussions and comments in various stages of thiswork. This work was supported in part by the U.S. Department of Energy (DOE), Officeof Science, Office of High Energy Physics, under Award Number DE-SC0010005 (ETN).Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-217CH11359 with the United States Department of Energy. [1] E. E. Leamer and E. E. Leamer,
Specification searches: Ad hoc inference with nonexperimentaldata , vol. 53 (Wiley New York, 1978).[2] A. Racine, A. Grieve, H. Fluhler, and A. Smith, Applied Statistics pp. 93–150 (1986).[3] E. I. George and R. E. McCulloch, Journal of the American Statistical Association , 881(1993).[4] R. E. Kass and A. E. Raftery, Journal of the american statistical association , 773 (1995).[5] A. E. Raftery and Y. Zheng, Journal of the American Statistical Association , 931 (2003).[6] D. V. Lindley, Biometrika , 187 (1957), ISSN 00063444, URL .[7] M. S. Bartlett, Biometrika , 533 (1957), ISSN 00063444, URL .[8] H. Jeffreys, The theory of probability (OUP Oxford, 1998).[9] R. D. Cousins, Synthese , 395 (2017), 1310.3791.[10] M. R. Schindler and D. R. Phillips, Annals Phys. , 682 (2009), [Erratum: Annals Phys.324, 2051–2055 (2009)], 0808.3643.[11] C. Davies, K. Hornbostel, I. Kendall, G. Lepage, C. McNeile, J. Shigemitsu, and H. Trottier(HPQCD), Phys. Rev. D , 114507 (2008), 0807.1687.[12] S. D¨urr et al. (Budapest-Marseille-Wuppertal), Phys. Rev. D , 114504 (2014), 1310.3626.[13] E. Berkowitz et al. (2017), 1704.01114.[14] C. Chang et al., Nature , 91 (2018), 1805.12130.[15] E. Rinaldi, S. Syritsyn, M. L. Wagman, M. I. Buchoff, C. Schroeder, and J. Wasem, Phys.Rev. D , 074510 (2019), 1901.07519.[16] N. Miller et al. (2020), 2005.04795.[17] C. Morningstar, Nucl. Phys. B Proc. Suppl. , 185 (2002), hep-lat/0112023.[18] A. Bazavov et al. (Fermilab Lattice, MILC), Phys. Rev. D , 074509 (2014), 1407.3772.[19] G. Lepage, B. Clark, C. Davies, K. Hornbostel, P. Mackenzie, C. Morningstar, and H. Trottier,Nucl. Phys. B Proc. Suppl. , 12 (2002), hep-lat/0110175.[20] K. Takeuchi, Suri-Kagaku (Mathematical Sciences) , 12 (1976).[21] M. Stone, Journal of the Royal Statistical Society. Series B , 44 (1977), URL .[22] R. Shibata, in From Data to Model , edited by J. Willems (Springer-Verlag, New York, 1989),pp. 215–240.[23] S. Konishi and G. Kitagawa, Biometrika , 875 (1996), URL .[24] H. Akaike, in Selected papers of hirotugu akaike (Springer, 1998), pp. 199–213.[25] H. Akaike, Annals of the Institute of Statistical Mathematics , 9 (1978), URL https://doi.org/10.1007/BF02480194 .[26] A. Bazavov et al. (Fermilab Lattice, MILC), Phys. Rev. D , 114509 (2019), 1809.02827.[27] E. Gamiz, private communication (2020).[28] F. Anscombe, The American Statistician .
29] M. Dixon and T. Ward (2018), 1803.04947. ppendix A: Calculation of asymptotic bias for model weights In this appendix we derive the asymptotic bias of the log-likelihood. The form of thebias is well known in the statistics literature, and it appears in the Takeuchi InformationCriterion, a generalization of the well-known Akaike Information Criterion. What followsis not a tight mathematical proof, but rather an informal derivation designed to illustratehow the bias term arises. For technical details, we refer interested readers to the extensiveoriginal literature [20–23]. The introduction of Ref. [29] is particularly accessible.The maximum likelihood estimator (MLE) a (cid:63) is a consistent estimator of the asymptoticor “true” ˆ a . However, the estimated log-likelihood function evaluated at a (cid:63) is not a consis-tent estimator of the expected log-likehood function. Roughly speaking, because the MLEmaximizes the estimated log-likelihood, it tends to overshoot the expected log-likelihood.Consider the log-likelihood log L ( x ; a ) = log n (cid:89) i =1 L ( x n ; a ) , (A1)where L ( x n ; a ) is the likelihood for a single sample x n . The sample and population expec-tation values are defined according to E n [ · · · ] = 1 n n (cid:88) i =1 ( · · · ) ≡ (cid:104)· · ·(cid:105) n (A2) E z [ · · · ] = (cid:90) dz f ( z )( · · · ) ≡ (cid:104)· · ·(cid:105) z , (A3)where f ( z ) is the population distribution from which the samples x n are presumed to bedrawn. Because a (cid:63) and ˆ a maximize their respective log-likelihoods, they are solutions to theusual equations: (cid:104) ∂ a log L ( x n | a = a (cid:63) ) (cid:105) n = 0 (A4) (cid:104) ∂ a log L ( z | a = ˆ a ) (cid:105) z = 0 . (A5)Note that, for a fixed number of samples n , a (cid:63) is a fixed number. The sample Fisherinformation matrix I N and the negative sample Hessian matrix J N are defined as I N,xy ( a ) ≡ N − N (cid:88) i =1 (cid:18) ∂ log L ( x i | a ) ∂a x (cid:19) (cid:18) ∂ log L ( x i | a ) ∂a y (cid:19) (A6)= 14( N − N (cid:88) i =1 (cid:18) ∂χ i ∂a x (cid:19) (cid:18) ∂χ i ∂a y (cid:19) , (A7) J N,xy ( a ) ≡ − N N (cid:88) i =1 ∂ log L ( x i | a ) ∂a x ∂a y = 12 N N (cid:88) i =1 ∂ χ i ∂a x ∂a y . (A8)The bias in the log-likelihood is defined as the difference between its estimated andexpected values, b ( a (cid:63) ( x n )) ≡ (cid:104) log L ( x n | a (cid:63) ( x n )) − (cid:104) log L ( z | a (cid:63) ( x n ) (cid:105) z (cid:105) n . (A9)24e are interested in the behavior of this bias in the limit of many samples, n → ∞ . Toemphasize the dependence on the data, we have written a (cid:63) = a (cid:63) ( x n ). To evaluate thebias explicitly and resolve the mixed expectation value, it helps to add and subtract termsjudiciously: b ( a (cid:63) ( x n )) = (cid:16) (cid:104) log L ( x n | a (cid:63) ( x n )) (cid:105) n − (cid:104) log L ( x n | ˆ a ) (cid:105) n (cid:17) + (cid:16) (cid:104) log L ( x n | ˆ a ) (cid:105) n − (cid:104) log L ( z | ˆ a ) (cid:105) z (cid:17) + (cid:16) (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104)(cid:104) log L ( z | a (cid:63) ( x n )) (cid:105) z (cid:105) n (cid:17) . (A10)This trivial rewriting pays immediate dividends. The first and third terms involve matchingexpectation values at nearby points and are amenable to Taylor expansion. As we will argueshortly, the second term vanishes.Let us begin with the first term. Expanding around the MLE point a (cid:63) gives (cid:104) log L ( x n | a (cid:63) ( x n ) (cid:105) n − (cid:104) log L ( x n | ˆ a ) (cid:105) n = − (cid:28) (ˆ a − a (cid:63) ( x n )) ∂ log L∂ a ∂ a (cid:48) (ˆ a − a (cid:63) ( x n )) (cid:29) n (A11)= −
12 (ˆ a − a (cid:63) ( x n )) (cid:28) ∂ log L∂ a ∂ a (cid:48) (cid:29) n (ˆ a − a (cid:63) ( x n )) (A12) n →∞ −→ + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) . (A13)In the first equality, the linear term vanishes by the definition of a (cid:63) , Eq. (A4). The final linefollows in the large- n limit under suitable regularity conditions (for details, see Refs. [22,23, 29]).The second term vanishes. To see this, first observe that both terms are evaluated at thesame fixed parameters ˆ a , which remain unchanged by the limit n → ∞ . Next, note that theestimated log-likelihood converges point-by-point to the asymptotic distribution. Therefore,the difference vanishes in this limit. This argument fails when the MLE a (cid:63) ( x n ) is involved,since a (cid:63) ( x n ) itself moves with n .Finally, we consider the third term, which contains population expectation values. Inthis case it is useful to expand around ˆ a , since the linear term will vanish by Eq. (A5): (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104)(cid:104) log L ( z | a (cid:63) ( x n )) (cid:105) z (cid:105) n (A14)= (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104)(cid:104) log L ( z | ˆ a ) (cid:105) z (cid:105) n + 12 (cid:104) ( a (cid:63) − ˆ a ) J (ˆ a )( a (cid:63) − ˆ a ) (cid:105) n (A15)= (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104) log L ( z | ˆ a ) (cid:105) z + 12 (cid:104) ( a (cid:63) − ˆ a ) J (ˆ a )( a (cid:63) − ˆ a ) (cid:105) n (A16)= 12 (cid:104) ( a (cid:63) − ˆ a ) J (ˆ a )( a (cid:63) − ˆ a ) (cid:105) n (A17) n →∞ −→
12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) . (A18)In the final line we have again taken the limit n → ∞ and invoked suitable regularityconditions. Combining results, we see that b ( a (cid:63) ( x n )) n →∞ −→ + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) + 0 + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) (A19)25 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) (A20)As indicated, this result is evaluated at the (unknown) parameters ˆ a . However, since thesample I N ( a (cid:63) ) and J N ( a (cid:63) ) are consistent estimators of I (ˆ a ) and J (ˆ a ), the bias may beevaluated using them instead [22, 23, 29].For the sake of concreteness, the proof sketched here has used the MLE a (cid:63) . However,a similar bias term is expected to be present quite generally. For instance, Theorem 2.1of Ref. [23] proves the existence of bias for a more general class of estimators. Roughlyspeaking, the bias arises from finite-sample-size fluctuations in the data and not from thechoice of the maximum likelihood estimator itself. Due to the generality of this bias term, weinclude the correction in the general formula Eq. (28) and not only in the following Gaussianapproximation.So far the discussion has been for general log-likelihoods. Now we specialize to the caseof least-square fitting, where − L ( x n ; a ) = χ ( x n ; a ). Taking χ ( x n ; a (cid:63) ) → χ ( x n ; a (cid:63) ) +2tr[ I N ( a (cid:63)(cid:63)
12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) . (A18)In the final line we have again taken the limit n → ∞ and invoked suitable regularityconditions. Combining results, we see that b ( a (cid:63) ( x n )) n →∞ −→ + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) + 0 + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) (A19)25 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) (A20)As indicated, this result is evaluated at the (unknown) parameters ˆ a . However, since thesample I N ( a (cid:63) ) and J N ( a (cid:63) ) are consistent estimators of I (ˆ a ) and J (ˆ a ), the bias may beevaluated using them instead [22, 23, 29].For the sake of concreteness, the proof sketched here has used the MLE a (cid:63) . However,a similar bias term is expected to be present quite generally. For instance, Theorem 2.1of Ref. [23] proves the existence of bias for a more general class of estimators. Roughlyspeaking, the bias arises from finite-sample-size fluctuations in the data and not from thechoice of the maximum likelihood estimator itself. Due to the generality of this bias term, weinclude the correction in the general formula Eq. (28) and not only in the following Gaussianapproximation.So far the discussion has been for general log-likelihoods. Now we specialize to the caseof least-square fitting, where − L ( x n ; a ) = χ ( x n ; a ). Taking χ ( x n ; a (cid:63) ) → χ ( x n ; a (cid:63) ) +2tr[ I N ( a (cid:63)(cid:63) ) J − N ( a (cid:63)(cid:63)