[PDF] Bayesian model averaging for analysis of lattice field theory results

Abstract

Statistical modeling is a key component in the extraction of physical results from lattice field theory calculations. Although the general models used are often strongly motivated by physics, many model variations can frequently be considered for the same lattice data. Model averaging, which amounts to a probability-weighted average over all model variations, can incorporate systematic errors associated with model choice without being overly conservative. We discuss the framework of model averaging from the perspective of Bayesian statistics, and give useful formulae and approximations for the particular case of least-squares fitting, commonly used in modeling lattice results. In addition, we frame the common problem of data subset selection (e.g. choice of minimum and maximum time separation for fitting a two-point correlation function) as a model selection problem and study model averaging as a straightforward alternative to manual selection of fit ranges. Numerical examples involving both mock and real lattice data are given.

Full PDF

FFERMILAB-PUB-20-374-T

Bayesian model averaging for analysis of lattice ﬁeld theoryresults

William I. Jay ∗ and Ethan T. Neil † Theoretical Physics Department, Fermi NationalAccelerator Laboratory, Batavia, IL 60510, USA and Department of Physics, University of Colorado, Boulder, CO 80309, USA (Dated: August 3, 2020)

Abstract

Statistical modeling is a key component in the extraction of physical results from lattice ﬁeldtheory calculations. Although the general models used are often strongly motivated by physics,their precise form is typically ill-determined, and many model variations can be plausibly consideredfor the same lattice data. Model averaging, which amounts to a probability-weighted average overall model variations, can incorporate systematic errors associated with model choice without beingoverly conservative. We discuss the framework of model averaging from the perspective of Bayesianstatistics, and give useful formulas and approximations for the particular case of least-squaresﬁtting, commonly used in modeling lattice results. In addition, we frame the common problem ofdata subset selection (e.g. choice of minimum and maximum time separation for ﬁtting a two-pointcorrelation function) as a model selection problem, and study model averaging as a straightforwardalternative to manual selection of ﬁt ranges. Numerical examples involving both mock and reallattice data are given. ∗ Email: [email protected] † Email: [email protected] a r X i v : . [ s t a t . M E ] A ug . INTRODUCTION One of the central problems in lattice ﬁeld theory is that of model ﬁtting and parameterestimation. This problem appears repeatedly in analysis of lattice results, from single two-point correlators up to joint chiral and continuum extrapolations of results from manysimulation streams. The functional forms which appear in these cases are often notoriouslydiﬃcult to work with; the sum of exponentials which models the two-point correlator isin general quite numerically unstable, and chiral and continuum extrapolations can involvenonlinear dependence on large numbers of unknown variables.Worse still, many of the models appearing in analysis of lattice simulations are incom-plete, with well-known limits to their applicability. Chiral perturbation theory is an eﬀectivetheory which will break down at large masses or momentum scales, and contains (in prin-ciple) an inﬁnite number of low-energy constants; a single two-point correlator in principlehas contributions from an inﬁnite tower of excited states. The Symanzik eﬀective theorydescribing the appearance of lattice artifacts similarly contains an organized but inﬁnitenumber of terms. These contributions are typically dealt with by truncating the model, andoften the data as well. However, this can lead to subtle dependence in the results on theanalyst’s choice of ﬁt range and number of terms to include in the model.This is not to say that lattice theorists are unaware of these potential sources of systematicerror. The eﬀects of model truncation and data truncation can be estimated by studyingthe variation of quantities of interest as the range of data included is varied, or additionalmodel terms are added. However, the often-adopted approach of taking the full diﬀerencebetween these variations as a systematic error is somewhat crude and likely to be overlyconservative in many cases.In this paper, we describe the technique of Bayesian model averaging as an alternativeapproach to these systematic errors, and outline its potential applications in the analysis oflattice simulation results. This approach allows for fully rigorous estimation of probabilitydistributions for parameters of interest by combining results from several models. All modelsmust reduce to a common model containing the parameter(s) to be estimated, but there isno other requirement that they be nested models. (For a continuum extrapolation of amatrix element e.g., the reduced model can be simply be a single constant parameter, thevalue of the matrix element in the continuum limit.)Bayesian model averaging is somewhat well-known in the statistical literature [1–5], al-though it is most often used in the context of linear models. Here we place no such restriction,giving formulas which can be used for arbitrary non-linear models. In general, the modelprobability weights required for model averaging are complicated integrals, but we will giveapproximate formulas which may be used with suﬃcient sample sizes. We discuss the eﬀectsof estimator bias and avoiding the Jeﬀreys-Lindley paradox [6–9] when empirical priors areused.Our work is partially inspired by the Bayesian approach to eﬀective ﬁeld theory advocatedby Schindler and Phillips [10]. Other examples of using weighted averages over models inlattice analysis include [11–16]. We believe that our treatment is the ﬁrst attempt to layout the procedure rigorously and in a fully Bayesian framework for a lattice ﬁeld theoryaudience.An outline of the paper is as follows. In Section II, we describe the basic Bayesianframework for model averaging and derive formal results for model-averaged expectationvalues. Section III specializes to the case of least-squares ﬁtting and derives an approximate2ormula for the model probabilities which are needed for model averaging. Section III C givesa numerical example of model averaging applied to mock data. In Section IV, we discussthe common problem of data subset selection, reframing it as a model variation problem inorder to apply model averaging. Section IV A shows an example application to the commontask of ﬁtting a two-point correlation problem and demonstrates the eﬀectiveness of modelaveraging as a replacement for choosing cuts on the data. We make some concluding remarksin Section V. A detailed discussion of bias correction in the estimation of model probabilityis given in Appendix A.

II. BAYESIAN FRAMEWORK

The basic analysis problem is as follows: we wish to describe a dataset D using a basemodel M in order to determine the value of one or more common parameters { a c } . In theexample of a continuum extrapolation, M could simply be the value of a speciﬁc matrixelement in the continuum limit.In estimating the parameters { a c } , we are often led to consider several extensions of thebase model M which contain various unphysical or uninteresting terms, like lattice artifactsor undetermined excited states. This extends our study to a space of models { M } , but ourbasic interest is still in estimation of the parameters of M . All of the models in { M } mustcontain M , in the sense that marginalizing over additional parameters { a m } reduces themto M . (Note that the set { a m } implicitly depends on the choice of model M .)It is important to note that the base model M itself does not necessarily have to becontained in the set { M } of models that are actually ﬁt to the data. As a simple example,in a continuum extrapolation it is certainly not necessary to include the continuum-onlymodel M (without any lattice spacing dependence) in the set of ﬁts. Indeed, a continuum-only model would surely describe the data poorly in this example.To obtain the marginal probabilities for the common parameters, we marginalize overboth models and additional parameters [10]:pr( a c | D ) = (cid:88) M (cid:90) d a m pr( D | a , M )pr( a | M )pr( M )pr( D ) . (1)where “pr” denotes a probability distribution, and the set of all parameters { a } is theunion of { a c } and { a m } . In principle, this formula assumes that all parameters { a } aredimensionless. In practice, we will be interested in model weights which will depend onlyon ratios of probabilities, so any units will tend to cancel.In principle, this “master formula” contains all the required information. If we can carryout the integrals and explicitly construct pr( a c | D ), then expectation values for arbitraryfunctions of the common parameters a c are immediately available: (cid:104) f ( a c ) (cid:105) = (cid:90) d a c f ( a c )pr( a c | D ) , (2)from which we can construct the standard mean, variance, and so forth. However, evaluatingthe integrals in the master formula is generally quite diﬃcult, especially in the context ofthe complicated non-linear models appearing in lattice analyses. Direct Monte Carlo evaluation of such integrals is an intriguing option which deserves more attention (cid:104) f ( a c ) (cid:105) M = (cid:90) d a c f ( a c )pr( a c | M, D ) (3)= (cid:90) d a c f ( a c ) pr( a c , M, D )pr( M, D ) (4)= (cid:90) d a c f ( a c ) pr( D | a c , M )pr( a c , M )pr( M | D )pr( D ) (5)= (cid:90) d a c f ( a c ) pr( D | a c , M )pr( a c | M )pr( M )pr( M | D )pr( D ) (6)= 1pr( M | D ) (cid:90) d a c f ( a c )pr( a c , M | D ) . (7)But now if we marginalize the integral on the right-hand side over the space of models { M } ,we just obtain the total, model-independent expectation value for f : (cid:88) M (cid:90) d a c f ( a c )pr( a c , M | D ) = (cid:90) d a c f ( a c )pr( a c | D ) = (cid:104) f ( a c ) (cid:105) . (8)Thus, we arrive at the relation (cid:104) f ( a c ) (cid:105) = (cid:88) M (cid:104) f ( a c ) (cid:105) M pr( M | D ) . (9)This is the central formula of interest for purposes of model averaging. It shows that all of themoments of the fully combined PDF can be obtained as a weighted average over individualmodel information, with the weight factors given by the posterior probability pr( M | D )for each individual model. Due to its role in model averaging, we will refer to pr( M | D )interchangeably as the “posterior probability” or as the “model weight.” The model weightitself can be expressed as an integral over the parameter space,pr( M | D ) = (cid:90) d a pr( M, a | D ) (10)= (cid:90) d a pr( D | a , M )pr( a , M )pr( D ) (11)= (cid:90) d a pr( D | a , M )pr( a | M )pr( M )pr( D ) . (12)To ﬁx the normalization of these probabilities, observe that the expectation of the unitoperator (cid:104) (cid:105) should be unity independent of model choice, which gives the condition (cid:88) M pr( M | D ) = 1 . (13) in the context of lattice studies, in which much more complicated integrals are evaluated as a matter ofcourse. This method seems to have been explored in Ref. [17]. However, we will not pursue this approachhere. (cid:80) M pr( M | D )pr( D ) = (cid:80) M pr( M ), apart from the overall normalization factor pr( D ).The formulae presented so far are completely general. However, certain common choicesused for the estimators of the likelihood function and other quantities can introduce bias.Any such biases should be corrected to guarantee convergence to correct results. We discussan important bias correction in detail in Section III A below. A. Estimation of model parameters with model averaging

It is instructive to consider what happens to the simple estimate of a model parameterunder the model combination procedure. Suppose we are interested in the single parameter a , marginalized over a set of N models { M } . Using Eq. (9), we ﬁnd for its mean (cid:104) a (cid:105) = (cid:88) M (cid:104) a (cid:105) M pr( M | D ) (14)and variance: σ a = (cid:10) a (cid:11) − (cid:104) a (cid:105) (15)= N (cid:88) i =1 (cid:10) a (cid:11) i pr( M i | D ) − (cid:32) N (cid:88) i =1 (cid:104) a (cid:105) i pr( M i | D ) (cid:33) (16)= N (cid:88) i =1 σ a ,i pr( M i | D ) + N (cid:88) i =1 (cid:104) a (cid:105) i pr( M i | D ) − (cid:32) N (cid:88) i =1 (cid:104) a (cid:105) i pr( M i | D ) (cid:33) . (17)This result for the variance also appears in the statistics literature [4], and has been used inthe context of lattice calculations in [14, 16]. The ﬁrst term is simply the weighted averageof the statistical variance over all models. The remaining two terms can then be thought ofas a “systematic error” contribution to the variance of a due to model choice. In the limitof equal model weights, i.e., pr( M i | D ) = 1 /N , the latter contribution can be thought of asthe variance over the space of models, since it reduces to the standard formula σ a , syst = 1 N N (cid:88) i =1 (cid:104) a (cid:105) i − N (cid:32) N (cid:88) i =1 (cid:104) a (cid:105) i (cid:33) . (18)We note that in the general case, this is not the same as the variance computed from theset of weighted estimates w i ≡ (cid:104) a (cid:105) i pr( M i | D ); such a weighted variance would contain anextraneous factor of pr( M i | D ) in the (cid:104) a (cid:105) term. We also note that taking the full width ofthe distribution of results (cid:104) a (cid:105) i , as done in e.g. [18], will give an estimated systematic errorstrictly greater than σ a , syst , so that this procedure is a conservative error estimate.It is illustrative to specialize to the case of considering only two models M , M , for whichwe have found the weights pr( M | D ) = 1 − p, (19)pr( M | D ) = p. (20)5uppose now that model 1 is strongly favored by the data, so p is small. Expanding theexpectation values above to ﬁrst order in p , we ﬁnd: (cid:104) a (cid:105) = (cid:104) a (cid:105) + ( (cid:104) a (cid:105) − (cid:104) a (cid:105) ) p, (21) σ a ≈ σ a , + (cid:2) σ a , − σ a , + ( (cid:104) a (cid:105) − (cid:104) a (cid:105) ) (cid:3) p. (22)In the limit p → M , as expected. For small butnon-zero p , the corrections to the mean and variance of a due to including M are likely tobe small, but it is clear that this depends on how large the diﬀerence between the estimatesfrom M and M are. III. LEAST-SQUARES FITTING

The discussion so far has been completely general with regards to the form of the prob-ability distributions appearing. We now specialize to the most common usage case in thecontext of lattice simulations, namely least-squares regression of a model M to some dataset D . The likelihood function pr( D | a , M ) is taken to bepr( D | a , M ) = N (cid:89) i =1 π ) d/ (det Σ) / exp (cid:20) − χ i (cid:21) , (23)where χ i ≡ ( y i − f M ( a )) T Σ − ( y i − f M ( a )) (24)is the standard chi-square goodness of ﬁt for the data sample y i ; we assume the samples aredrawn independently from some underlying distribution. Here d denotes the dimension of asingle observation vector y i , and N is the number of independent observations drawn fromthe true distribution. The matrix Σ is the covariance matrix between the y i .For the prior distribution, it is standard (and typically justiﬁed by the principle of max-imum entropy [10, 19]) to adopt a multivariate Gaussian form,pr( a | M ) = 1(2 π ) k/ (det ˜Σ) / exp (cid:104) ( a − ˜ a ) T ˜Σ − ( a − ˜ a ) (cid:105) (25)= k (cid:89) x =1 (cid:18) √ π ˜ σ x (cid:19) exp (cid:18) ( a x − ˜ a x ) ˜ σ x (cid:19) , (26)where k is the number of ﬁt parameters in model M and ˜Σ is the prior covariance matrix.The second formula holds for the simpliﬁed case where the prior parameter covariance matrix˜Σ is diagonal with entries ˜ σ x .The ordinary least-squares likelihood pr( D | a , M ) is normalized by the data covariancematrix and some factors of (2 π ). Since we are considering only the case of a ﬁxed data set,this overall normalization is the same for all models and can be ignored here. On the otherhand, it is important to include the normalization of the prior distribution pr( a | M ), whichdiﬀers for models with diﬀerent numbers of parameters.The best-ﬁt point a (cid:63) maximizes the above likelihood or, equivalently, minimizes thenegative log-likelihood function − D | a , M )pr( a | M )) = χ + χ p ≡ χ N . (27)with the combination of terms deﬁning the “augmented chi-squared” function [19].6 . Bias correction of the model weights Estimator bias occurs when a sample estimator diﬀers from the “true” underlying pop-ulation value it is approximating. A common example of such a bias occurs in the naiveestimator of variance for data drawn from an underlying Gaussian distribution. In this ex-ample, the bias is of order 1 /N , where N is the sample size, which means that it will beautomatically removed in the limit of large N .However, there are many examples of asymptotic biases which do not vanish as N → ∞ .The sample maximum likelihood estimate (MLE) of the log likelihood function, χ N ( a (cid:63) ),suﬀers from such an asymptotic bias. Roughly speaking, because the MLE maximizes thesample log-likelihood, it tends to overshoot the true asymptotic value. Fundamentally, thisbias arises not from the choice of the MLE but rather from ﬁnite-sample-size ﬂuctuations inthe data itself.To ensure that the results at ﬁnite sample size N converge to the asymptotic modelweight, we deﬁne the bias-corrected model weight to be [20–23]pr( M | D ) BC = exp (cid:0) − tr[ J − N ( a (cid:63) ) I N ( a (cid:63) )] (cid:1) × (cid:90) d a pr( D | a , M )pr( a | M )pr( M )pr( D ) (28)where I N and J N are the sample estimates of the log-likelihood Fisher information matrixand the (negative) Hessian matrix, respectively. These matrices are deﬁned in Appendix A,which also gives an informal derivation of the bias correction term, ∼ tr[ J − N ( a (cid:63) ) I N ( a (cid:63) )].When the model is linear in the parameters a , it is straightforward to show that I and J are identical, so that J − N ( a (cid:63) ) I N ( a (cid:63) ) →

1, the k × k identity matrix. In the more generalnonlinear case, this well-known relationship between J N and I N still holds asymptotically [21, 22]. In either case, tracing over the identity matrix simply counts the total number ofparameters k in the model.The appearance of the structure tr[ J − I ] is closely related to the Akaike informationcriterion [24, 25] and its generalization, the Takeuchi information criterion [20]. We discussthis connection more below. B. Gaussian approximation

By construction, the sample likelihood pr( a | D ) is locally maximized at the best-ﬁt pa-rameter values a (cid:63) . Taylor expansion about the best-ﬁt point gives χ N ( a ) ≈ χ N ( a (cid:63) ) + 12 ( a − a (cid:63) ) T H (cid:63)N ( a − a (cid:63) ) + ... (29)where H (cid:63)N is the standard Hessian matrix evaluated at the best-ﬁt point, H (cid:63)N,xy ≡ ∂ χ N ∂a x ∂a y (cid:12)(cid:12)(cid:12)(cid:12) a = a (cid:63) = − J N,xy ( a (cid:63) ) . (30) Strictly speaking, this relationship will hold asymptotically assuming that the “true model” from whichthe data are drawn can be described by the model M being ﬁt. For lattice applications where we typicallyhave a strong physical basis for the models that we use, this assumption is likely to hold. D . Withinthis approximation, the integral for model weight becomes Gaussian and can be evaluatedanalytically:pr( M | D ) = (cid:90) d a pr( D | a , M )pr( a | M )pr( M )pr( D ) (31) ≈ (cid:90) d a π ) k/ (det ˜Σ) / e − χ N ( a (cid:63) ) / − ( a − a (cid:63) ) T H (cid:63)N ( a − a (cid:63) ) pr( M )pr( D ) (32)= pr( M )pr( D ) (2 π ) − k/ (det ˜Σ) − / (cid:104) (2 π ) k/ e − χ N ( a (cid:63) ) / (det H (cid:63)N / − / (cid:105) (33)= pr( M )pr( D ) (det ˜Σ) − / e − χ N ( a (cid:63) ) / (det H (cid:63)N / − / . (34)Thus, neglecting the term pr( D ) which is the same for all models, one ﬁnds the followingform for the log-likelihood − M | D )) ≈ − M )) + χ N ( a (cid:63) ) + log det ˜Σ M + log det( H (cid:63)N /

2) (35)= − M )) + χ N ( a (cid:63) ) + log det ˜Σ M − log det Σ (cid:63)N , (36)where Σ (cid:63)N ≡ ( H (cid:63)N / − is the usual model best-ﬁt covariance matrix, i.e., the predictedcovariance of the ﬁt parameters a . Including the bias correction term introduced in Eq. (28)gives the overall result − M | D ) BC ) ≈ GAP M ,GAP M ≡ − M )) + χ N ( a (cid:63) ) + log det ˜Σ M − log det Σ (cid:63)N + 2 tr[ J − N ( a (cid:63) ) I N ( a (cid:63) )] , (37)where we introduce the notation “GAP,” standing for “Gaussian approximate posterior”. Ifthe model is linear in the parameters a , then the GAP formula is in fact exact. For nonlinearregression, the approximation is expected to improve as the sample size of the data D grows.From this point forward, we adopt the bias-corrected form as the default choice of pr( M | D )unless explicitly stated otherwise, dropping the “ BC ” subscript.From Eq. (37), we see that the posterior probability pr( M | D ) encapsulates the principle ofOccam’s Razor: models with large χ N ( a (cid:63) ) are penalized, but so are models which have a largenumber of free parameters. (The ﬁnal bias-correction term reduces to 2 k asymptotically,where k is the number of parameters.)Unfortunately, in the presence of any empirical priors and/or models with diﬀering dimen-sions, an eﬀect known as the Jeﬀreys-Lindley paradox [6, 8, 9] leads to out-sized dependenceon the prior widths that can severely distort the overall results. The presence of the paradoxis linked closely to the use of “empirical priors” which are not based on true prior infor-mation and can be taken to be arbitrarily wide. Note that for estimation of the best-ﬁtparameters a (cid:63) in a ﬁxed model there is no such distortion, and the use of empirical priorsis not problematic. However, when the normalization of the likelihood is important, as inthe estimation of pr( M | D ), the paradox can lead to nonsensical results if the prior width istaken to be extremely large.Instead of attempting to confront the Jeﬀreys-Lindley paradox head-on, we will insteadargue that the eﬀect of the (log det ˜Σ M − log det Σ (cid:63)N ) terms will vanish asymptotically and8o can be viewed as a 1 /N eﬀect in the sample size N . Working in a scenario without truepriors, we may adopt a cross-validation procedure as follows: we partition our full data set D into a small “training set” D T , and the remaining data D cT . A given model is ﬁrst ﬁt tothe training set D T to determine a set of ﬁt parameters a T . The results for a T are then usedto ﬁx priors on a for the ﬁt to the remaining data D cT .In practice, the cost of this procedure is that we must “use up” the data in D T to ﬁxpriors, reducing the overall statistical precision of our analysis. However, in the asymptoticlimit N → ∞ , both partitions { D T , D cT } approach the true asymptotic distribution and willyield exactly the same ﬁt results. Therefore, in this limit ˜Σ M → Σ (cid:63)N and the determinantterms in GAP M cancel completely.In the same N → ∞ limit, the bias-correction term reduces to twice the number ofparameters, and we recover the well-known Akaike information criterion (AIC) [24, 25]: GAP

M N →∞ ,CV −−−−−−→ AIC M = − M )) + χ N ( a (cid:63) ) + 2 k, (38)where “CV” denotes the aforementioned cross-validation procedure to remove the determi-nant terms. More generally, simply dropping the determinant terms from Eq. (37) gives theTakeuchi information criterion (TIC) [20]. However, even for non-linear models the TIC ap-proaches the AIC in the N → ∞ limit which we are considering, as discussed in Section III A.We therefore advocate the use of the AIC for general model averaging purposes. C. Practical example: polynomial data

To demonstrate the method and some key features, we begin by considering a simple toymodel. We begin by specifying a quadratic “model truth” polynomial function, F ( x ) = 1 . − . (cid:16) x (cid:17) + 0 . (cid:16) x (cid:17) . (39)A set of N mock data samples are generated for x ∈ { , , ..., } by taking the modeltruth for each point and adding uncorrelated Gaussian noise with mean zero and standarddeviation σ = 1 .

0. The resulting mock data are plotted in Fig. 1 (top panel) for the choice N = 160; we will also study the N -dependence.We take as our space of model functions polynomials labeled by their degree m , f m ( x ) = m (cid:88) j =0 a j (cid:16) x (cid:17) j . (40)with 0 ≤ m ≤

5. We take the ﬂat prior pr( m ) = 1 /

6, corresponding to minimal prior knowl-edge about the functional form of the model (except that it is polynomial.) All parameterpriors are taken to be Gaussian with mean 0 and width 10 (the parameters are, essentially,unconstrained.)The results of the ﬁts to individual models as well as the model-averaging results areshown in Fig. 1 (bottom panel) and in Table I. Note that although all models with m > χ N , the bias-corrected model probabilityestimated through the AIC places relatively more weight on the simpler models, with themaximum probability assigned to the correct choice m = 2. The model-averaged result is9 x y ( x ) N p a Model truthModel avg.Individual fits0 1 2 3 4 5 m p pr ( M | D ) Fit p-value

Figure 1: Top: synthetic data (green points) for the given quadratic “model truth” function (blackcurve) plus Gaussian noise, with noise sample size N = 160. Bottom: ﬁt results for degree- m polynomial models (blue circles), compared to the known value a = 1 . m = 0. The lower inset shows the standard p -value (blue dashed line) and the model weightcalculated from the AIC (orange solid line). Comparison of these curves shows the “Occam’s razor”eﬀect, with the AIC penalizing ﬁts with roughly equal goodness of ﬁt but more ﬁt parameters. consistent with model truth and has slightly larger uncertainty than the individual ﬁt to thecorrect model m = 2.As noted, the results so far use a ﬁxed sample size of N = 160. We repeat the test asdescribed above with several values of N ∈ [20 , a using various procedures in Fig. 2. The result of the model averaging procedure usingthe AIC is seen to be consistent with model truth in all cases, with an error that comparesfavorably to single-model ﬁts using the known true quadratic model and is uniformly smallerthan the more conservative procedure of taking the full variation of the mean over all modelswith p > . = 0 m = 1 m = 2 m = 3 m = 4 m = 5 a a -0.170(72) -0.96(30) -1.42(81) -1.1(1.2) -0.9(1.3) a a -0.8(1.2) 1.9(6.4) 1.3(6.8) a -1.4(3.2) 0.3(7.3) a -1.0(3.9) χ N p -value 0.01 0.04 0.22 0.24 0.25 0.25AIC m M | D ) 0.01 0.04 0.56 0.25 0.10 0.04Table I: Individual best-ﬁt results and associated quantities for N = 160. The model-averagedvalue for the intercept is (cid:104) a (cid:105) = 1 . Omitting the bias-correction term and averaging using only the χ N results to estimatemodel probability (the “naive” estimate) also tends to give slightly larger error than modelaveraging using AIC, but the results remain consistent with the correct answer. In theabsence of the bias correction, the likelihood of the models with m > m = 2 model withintheir parameter space, they tend to estimate the correct value of the intercept a on average.As a result, there is no bias introduced into the mean result for a when using biased modelprobabilities in this particular example. However, the error bar is slightly overestimatedwhen using the naive estimator. IV. DATA SUBSET SELECTION AS A MODEL VARIATION PROBLEM

A routine part of modeling lattice Monte Carlo data is data subset selection, i.e., choos-ing a “cut” on the data beyond which the model is not applied. A canonical and simpleexample is ﬁtting a two-point correlation function C ( t ) to extract the ground-state energy.The full model expected to describe this correlation function involves an inﬁnite tower ofexponentials, (cid:80) ∞ i A i e − E i t . In practice, one truncates the sum after a ﬁnite number of termsand then selects a minimum value t min below which the data are simply ignored. Choosingthe precise value of t min is generally done by hand.Although this process is typically thought of as data selection problem, it can easily bereformulated as a model selection problem. In the previous example, the justiﬁcation forignoring data below t min is partially one of expediency. If we are only interested in the ﬁrstfew states, it suﬃces to look at times with t ≥ t min where they dominate. Times below t min will be heavily contaminated by contributions from the higher excited states, and little tono information about the ﬁrst few states is lost by ignoring them.Based on this observation, we can deﬁne a joint model that describes the full data set.First, select a subset of the data and ﬁt the model of choice M to this subset as usual.Second, ﬁt the remaining data to a “perfect” model with zero degrees of freedom. Forexample, we could take the “perfect” model to be a polynomial with degree equal to N cut ,but in principle other functional forms will also work. Because the “perfect” model has11 log( N ) a Model truthModel avg. (AIC)Quadratic fixedFull-width systematicModel avg. (naive)

Figure 2: Scaling of various estimates of the intercept a vs. the data sample size N . Thetrue value (dashed line) is a = 1 .

8. The blue circles (model average using the AIC) show goodconsistency with both the model truth and with the estimates using the correct quadratic modelform (red squares). Using the full-width diﬀerence between all models with ﬁt p -value greater than0 . M | D ) which omits the bias-correctionterm (silver triangles) does not directly lead to bias in the estimation of a , but also gives slightlylarger uncertainty due to overweighting of more complicated models as discussed in the text. zero degrees of freedom, there exists a solution for its parameters for which the diﬀerencesbetween the model and the sample means vanish exactly.To give an explicit construction, we ﬁrst deﬁne a partition P of the data vectors into y i = ( y cut i , y keep i ), where y keep i are the subset to be modeled and y cut i are the cut data. Wethen deﬁne the corresponding partitioned model g M ( a , P ) as y i − g M ( a , P ) = (cid:40) y i − ¯ y cut , y i ∈ y cut i y i − f M ( a ) , y i ∈ y keep i , (41)where ¯ y cut is the sample average of the y cut i . The partition-dependent log likelihood is then,dropping constant terms that do not change with ﬁxed data set D , − D | a , M ) = N (cid:88) i =1 χ i ( P ) = N (cid:88) i =1 ( y i − g M ( a , P )) T Σ − ( y i − g M ( a , P )) (42)= N (cid:88) i =1 ( y keep i − f M ( a )) T Σ − P ( y keep i − f M ( a )) + (const) (43)where Σ − P is the submatrix of the full inverse data covariance matrix Σ − which correspondsto the data subset y keep i . All other terms involving the cut data contain the expression¯ y cut − g M ( a , P ) at least once and therefore vanish identically by construction, even in thepresence of oﬀ-diagonal correlations between y keep i and y cut i .Since matrix inversion does not generally commute with subspace projection, the matrixΣ − P typically diﬀers from (Σ P ) − , the inverse of the covariance sub-matrix. However, in12ractical lattice applications (Σ P ) − is often used as an approximation to Σ − P ; the diﬀerencebetween these matrices is given by terms that are suppressed whenever long-range (i.e.,further oﬀ-diagonal) correlations are generally smaller than short-range ones. An obstructionto using Σ − P directly is that ﬁnite-sample estimates of the full covariance matrix Σ − aretypically ill-conditioned. Therefore, in what follows we will use (Σ P ) − .The result of this construction is that the contribution from the “perfect” model describ-ing the data outside the chosen subset is ∆ χ = 0. However, there remains a bias-correctionterm which accounts for the N cut additional model parameters used to describe the cut data y cut i . Thus, the overall model probability for the joint model is easily seen to be obtainedfrom the modiﬁed expression AIC

M,N cut = − M )) + χ N ( a (cid:63) ) + 2 k + 2 N cut , (44)where χ N ( a (cid:63) ) is evaluated only for the model M and for data within our selected subset. Inother words, in practice we do not need to construct the “perfect” model at all; we simplyneed to count the number of points cut away in our subset selection to obtain the correctmodel weight.What about the covariance terms in Eq. (37) involving the “perfect” model parameters?Because the parameters of the “perfect” model are uniquely determined by the mean valuesof the data points alone, we can impose a prior on the perfect model with inﬁnitesimal priorwidths (cid:15) . So long as (cid:15) is much smaller than the error on the removed data points themselves,it will dominate the posterior covariance matrix for those model parameters as well, so thatlog det ˜Σ M, perfect − log det Σ (cid:63)N, perfect ≈ k log (cid:15) − k log (cid:15) = 0 . (45)The bias-correction term is also exactly 2 N cut in this case, because the “perfect” model canalways be constructed to be linear in the parameters. In other words, the use of the AIC tocompute the contribution of the cut data to the relative model weights is exact and not anasymptotic approximation. Although we were motivated by the example of a two-point correlator where the dataare cleanly divided into two subsets along a single dimension, the argument above holds forarbitrarily complex subdivisions of the full data set. Whatever subset of the data we chooseto ﬁt explicitly to model M , the joint model which also describes the remaining data willcontribute an additional factor of 2 N cut to the information criterion. We can also considera set of models { M } and perform ordinary model averaging over the joint space deﬁned by { M } and the parameters that uniquely deﬁne a data subset. A. Practical example: Synthetic correlation functions

To test the data subset selection procedure, we set up another toy-model example resem-bling a two-point correlation function, following the example description above. The “model We note in passing that if we ﬁt a single model M and only vary N cut , then the full GAP formulaEq. (37) may be used without running afoul of the Jeﬀreys-Lindley paradox, since the prior covariance˜Σ M is held ﬁxed and therefore cancels completely. As we have argued that the inﬂuence of the covariancedeterminant terms is a ﬁnite sample size eﬀect, we will nevertheless adopt the simpler AIC prescriptionfor our numerical tests to follow. F ( t ) = 2 . e − . t + 10 . e − . t . (46)To generate synthetic data, we add correlated Gaussian noise η ( t ) with mean zero andvariance 0.09. The noise is added fractionally to the data, i.e., the synthetic data aregenerated according to the formula y ( t ) = F ( t )(1 + η ( t )). The correlation matrix of thenoise takes the form ρ t,t (cid:48) = ρ | t − t (cid:48) | , i.e. equal to 1 on the diagonal and dying oﬀ as a powerlaw as the time separation between points increases, similar to a real QCD correlationfunction. We ﬁx the numerical correlation coeﬃcient ρ = 0 . N mock data samples aregenerated for t ∈ { , , ..., } .Additional trials in which the above parameters have been varied were also tried, includingusing uncorrelated Gaussian noise instead of correlated. No qualitative diﬀerence in theoutcome of the tests was observed with these variations.For this test, we consider a single model which consists of a single exponential term, f ( t ) = A e − E t . (47)This model is ﬁt to all data in the range [ t min , t min from 1 to28, with the goal of using model averaging with Eq. (44) to obtain a combined result for theground-state energy E .The results of four independent trials following the above procedure with N = 500 areshown in Fig. 3. Excited-state contamination, i.e. the inﬂuence of the second exponentialstate which is not present in our ﬁt model, is clearly visible at low t min . In each case, excellentagreement of the model-averaged result with model truth is seen. As in the polynomialexample, the bias-corrected model probability is seen to weight simpler models more strongly,which in this case means favoring ﬁts that cut away less of the data.In Fig. 4, we repeat the above exercise while varying the sample size N , once again overthe range N ∈ [20 , t min , thelatter has an unaccounted-for systematic error due to model truncation. (Indeed, if we donot adjust t min as N → ∞ , we expect the result for E to become incompatible with thecorrect ground-state energy, as the contamination from the second state will eventually beresolved in a large enough sample.)On the other hand, the error on the model-averaged result is generally much smaller thanthe more conservative full-width estimate, resulting from taking the full variation of meanresults over all models with p > . N cut contribution to model weight) causes adrastic inﬂation of the error in the averaged result for E . Once again, the interpretationis that omitting the bias correction causes overestimation of the likelihood for the morecomplicated models, in this case models with larger t min . These models generally giveresults for E consistent with the correct answer, but with much larger errors, and themodel-averaged result is altered accordingly.14 .800.810.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value0.800.810.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value0.800.810.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value0.800.82 E Model truthModel avg.Individual fits4 8 12 16 20 24 t min p pr ( M | D ) Fit p-value

Figure 3: Fit results for the ground-state energy with true value E = 0 . t min (blue points). The model-averaged result (red open square) showsgood agreement with model truth in all cases. The lower inset shows the standard p -value (bluedashed line) and the model weight calculated from Eq. (44) (orange solid line). The four subﬁguresrepresent four random draws of correlated Gaussian noise, but are otherwise identical. log( N s ) E Model truthModel avg. (AIC)Fixed t min = 14 Full-width systematicModel avg. (naive)

Figure 4: Scaling of various estimates of the ground-state energy E vs. the data sample size N .The true value (dashed line) is E = 0 .

8. The blue circles (model average using the AIC) showgood consistency with the model truth and generally comparable error to ﬁtting with ﬁxed t min (redsquares). Using the full-width diﬀerence between all models with ﬁt p -value greater than 0 . M | D ) which omits the bias-correctionterm (silver triangles) also leads to signiﬁcantly larger uncertainty due to overweighting of morecomplicated models as discussed in the text. B. Practical example: QCD correlation functions

1. Masses from a two-point correlation function

We now consider the example of model averaging applied to a pion two-point correlationfunction from a real lattice QCD calculation. This correlator has been used in published workby the Fermilab Lattice and MILC collaborations [26]. The underlying gauge-ﬁeld ensembleused has a lattice spacing of a ≈ .

09 fm and a pion mass of about 215 MeV. In this example,the correlation function was constructed using staggered fermions and corresponds to a pionwith energy E π ≈

300 MeV.The results of our procedure are shown in Fig. 5. The top pane shows the eﬀective massand the ﬁnal result of model averaging, which is consistent with by-eye expectations. Theoscillations in the eﬀective mass are a familiar feature of staggered two-point correlationfunctions with nonzero momentum. The middle pane shows intermediate results from indi-vidual ﬁts. The green band indicates the model-averaged result for the points shown. Forclarity of presentation, the results in the middle pane are from ﬁts with (1 + 1) states only,i.e., 1 decaying state and 1 oscillating state. We also tried ﬁts including (2 + 2), (3 + 3), and(4 + 4) states. The only qualitative diﬀerence from including more states is that good ﬁtsare obtained for smaller t min . The model average is unchanged. The bottom pane shows themodel weights for the ﬁts with (1 + 1) states. As expected, the weights peak in the middleand taper oﬀ at both ends. When t min is small, the ﬁt quality rapidly declines due to contri-butions from excited states. When t min is large, Eq. (44) disfavors cutting too aggressively.The model-averaged result agrees with intermediate results that went into the analysis of16ef. [26] to better than 1 σ [27].

2. Matrix elements from three-point correlation functions

Model averaging also shows promising results for extraction of more complicated matrixelements. In this example, we test the extraction of a K → π transition matrix (cid:104) π | J | K (cid:105) ,again using correlation functions that were a part of published work by the Fermilab Latticeand MILC collaborations [26]. In this case, the underlying gauge-ﬁeld ensemble has a latticespacing of about 0 .

12 fm, with pion and kaon masses of about 220 and 515 MeV, respectively.Staggered fermions were also used for these correlators. The calculation occurs in the restframe of the kaon. The pion momentum has been adjusted to be near the point of zero recoil, q ≡ ( p K − p π ) ≈

0. The methodology for extracting these matrix elements is complicatedbut relatively standard within the lattice community. The desired matrix element is theresult of a joint correlated ﬁt to two- and three-point functions. In order to visualize theresult, it is standard to construct a ratio R ( t, T ) of two- and three-point functions whoseasymptotic plateau is proportional to the bare lattice matrix element. Here T denotes thelocation of the sink operator which couples to the kaon. In conducting such a ﬁt, the analystis faced with several choices: the number of states in the pion channel ( n + n ), the numberof states in the kaon channel ( m + m ), and the ﬁt window t ∈ [ t min , T − t min ]. We refer thereader to Ref. [26] for additional details about ﬁts like these.Fig. 6 shows the result of model averaging for the matrix element. We adopt a ﬂatprior model weight pr( M ) = C for all choices of ( m + m ) and ( n + n ), which drops out ofthe model average. The top pane shows the ratio R ( t, T ) for two diﬀerent sink locationsalongside the result of model averaging. The middle pane shows intermediate ﬁt resultsand the model average. The particular choices made for each of the ﬁts is displayed alongthe horizontal axes, with the ﬁt window displayed on top and the number of states on thebottom. For instance, the leftmost point used ( n + n ) = (3 + 3) states for the pion channel,( m + m ) = (3+3) states for the kaon channel and a ﬁt range window t ∈ [3 , T − σ with the published result of Ref. [26]This example suggests another important application of the framework we are describing.A complete analysis of a lattice matrix element might consider a more general set of ﬁts, e.g.,with diﬀerent numbers of decaying and oscillating states in each channel or with diﬀerent t min cuts for the source and sink. Scanning over all possibilities can easily produce tens orhundreds of individual ﬁt results. Finding an objective selection criterion for choosing abest-ﬁt result can be diﬃcult. The model weights in Eq. (37) and Eq. (38) oﬀer a potentialsolution to this problem, particularly when used in conjunction with expert knowledge andthe usual careful thinking. V. CONCLUSION

We have presented a Bayesian approach to the problem of model averaging. The sta-tistical methods we describe apply very generally, though our examples have focused on17

10 20 30 40 t / a E ff e c t i v e m a ss m e ff a Model average m eff t min / a F i t e n e r g y E a Model averageFit results, model weight >1%Fit results, model weight <1%0 10 20 30 40 t min / a M o d e l w e i g h t P ( M | D ) Model weight >1%Model weight <1%

Figure 5: Model averaging results for a pion two-point correlation function using staggeredfermions.

Top:

The eﬀective mass and the ﬁnal result of model averaging. The oscillating contri-butions are from the opposite-parity states associated with staggered fermions.

Middle:

Individualﬁts together with the result of model averaging.

Bottom:

Model weights for the individual ﬁts. t / a R a t i o R ( l a tt i c e un i t s ) Model average R ( T = 15) R ( T = 18) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) States in fit n, m for : ( n + n ), K : ( m + m ) R a t i o R ( l a tt i c e un i t s ) Model averageFit results, model weight >1%Fit results, model weight <1% ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) States in fit n, m for : ( n + n ), K : ( m + m ) M o d e l w e i g h t P ( M | D ) Model weight >1%Model weight <1%3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6

Fit range t min , t [ t min , T t min ] Fit range t min , t [ t min , T t min ] Figure 6: Model averaging results for a matrix element associated with a K → π transition formfactor. Top:

A ratio of two- and three-point correlation functions (whose plateau is proportionalto the matrix element (cid:104) π | J | K (cid:105) ) and the ﬁnal result of model averaging. Middle:

Individual ﬁtstogether with the result of model averaging.

Bottom:

Model weights for the individual ﬁts. practical problems in lattice gauge theory. The context for regression problems is rather ex-ceptional in many lattice studies, since the models often rest on ﬁrm theoretical foundations.For instance, multi-exponential ﬁts to correlators are based on the spectral decomposition,which only requires the existence of a positive-deﬁnite transfer matrix. Eﬀective ﬁeld theorygoverns extrapolations to the chiral, continuum, or heavy-mass limit. If a model fails to19escribe the data, the simulation itself is rightfully viewed with additional scrutiny. Hy-pothesis testing is typically less important than reliably extracting the values of parameterscapturing the physics of interest. When predictions from nested models (say, the NLO ver-sus NNLO predictions from eﬀective ﬁeld theory) diﬀer slightly, it is important to be able toproduce a ﬁnal number with associated statistical and systematic uncertainties. Bayesianmodel averaging is an attractive approach to problems like these.Two key practical results are the model-averaged mean and variance, Eq. (14) andEq. (17). Both formulae rely on the model weights pr( M | D ). In general, the model weightsare deﬁned through complicated integrals. However, analytic results are available in theGaussian approximation, which is exact for linear least-squares ﬁtting. For nonlinear least-squares ﬁtting, the approximation is expected to become increasingly good for larger datasets.The full GAP result of Eq. (37) involves the prior and posterior covariance matrices andis not free from the well-known Jeﬀreys-Lindley paradox. From a pragmatic perspective,we argued that these terms can be viewed as ﬁnite sample size eﬀect and be ignored usingarguments from cross validation. Eq. (38) is the ﬁnal expression used to construct themodel weights used in the examples. Since this expression is easily computable just usingthe familiar augmented χ and the number of parameters in the model, it is easy to includeand test in existing lattice analyses. However, it relies on taking an asymptotic limit in thesample size, and it would be interesting to study improved estimators at ﬁnite sample sizein future work.A particularly nice application of these ideas is data subset selection, which we recast asa model variation problem. The basic observation was to reinterpret cuts on the data asadditional model parameters, with a model weight given by Eq. (44). As discussed, carryingout data subset selection with a ﬁxed model avoids the complication of the Jeﬀreys-Lindleyparadox due to cancellation of the associated contributions. The model averaging approachgives a straightforward way to replace the common practice of tuning such data subset cutsby hand.Broadly speaking, perhaps the most attractive feature of Bayesian model averaging isthe natural appearance of Occam’s razor. The model weights appearing in Eqs. (37), (38)and (44) favor models which use the fewest parameters while describing the most data.Inclusion of an asymptotic bias correction to the estimated likelihood, which yields the AICas a model-selection criterion in the limit of large sample size, is crucial to the occurrenceof this eﬀect.An interesting direction to explore in future work would be to study improved estimatorsat ﬁnite sample size, rather than relying on the asymptotic result to estimate the modelweights. This will necessarily involve careful treatment of the covariance matrix terms inEq. (37). It would also be interesting to explore direct Monte Carlo evaluation of theintegral in Eq. (28), although the bias correction should be studied carefully in the contextof whatever speciﬁc approach is used. Studying the interplay of model averaging withresampling methods such as jackknife and bootstrap, commonly used in lattice analyses,would likely be a useful extension of this work. A. Practical suggestions and warnings

Model averaging has performed well for us in many test cases. However, as with anystatistical tool, the techniques we describe should not be used blindly. In particular, model20veraging should not be used as a substitute for plotting data and ﬁts and thinking carefullyabout the results [28].A basic assumption underlying this technique is that only statistically correct results areincluded in the model average. Including results for ﬁts that fail to converge numerically,for example, will likely result in incorrect answers. Incomplete treatment of autocorrelationeﬀects in the data will similarly yield invalid statistical estimates and thus invalid model-averaged results.Although this technique allows the data to remove much of the subjectivity from analysesincluding model variations, this does not extend to the choice of the model prior weightspr( M ). In the absence of speciﬁc and strong beliefs about particular models, we advocatefor the use of a ﬂat prior, i.e. weighting all models equally in the prior. In particular, weemphasize that one should not attempt to impose parsimony through the model priors byoverweighting models with fewer parameters; this principle is built-in to the bias-correctedmodel weights as we have discussed.The model weights of Eq. (37) and Eq. (38) are useful beyond model averaging `a la Eq. (9).For instance, many lattice calculations oblige the analyst to make many choices beyond t min .In this situation, the model weights can help guide the decision about which, say, dozenﬁt results (out of potentially hundreds) are most promising for further investigation andscrutiny using more familiar and established techniques. Model selection is equivalent tomodel averaging in the limit that a single model has very high probability of correctness; thissituation can naturally emerge from the data analysis, as in the example shown in Fig. 6.Model averaging may be especially useful in the context of ﬁtting models that containdiscrete degrees of freedom that are not amenable to standard numerical minimization pro-cedures. For example, a multi-exponential model (cid:80) ∞ i A i e − E i t in which the sign of theamplitudes A i is a priori unknown could be studied with improved numerical stability byﬁxing the signs of all included amplitudes one by one, and then averaging together theresults.In certain cases, we have found that the systematic errors due to model truncation orvariation can be signiﬁcantly overestimated by more conservative methods. Revisiting oldlattice analyses which are limited by systematic errors related to model variation may beworthwhile. On the other hand, in our tests we have found excellent agreement between cor-relator ﬁts with few states and those with many states; the combination of model averagingwith few-state ﬁts as a method could reduce problems related to numerical convergence andreduce the computational cost of ﬁtting. Acknowledgments

We thank the members of the Fermilab Lattice and MILC collaborations for generouslysharing their correlator data with us for use in numerical tests. We are especially gratefulto Elvira G´amiz for providing intermediate results and helping compare against publishedresults. We also thank Andreas Kronfeld, Christoph Lehner, Taku Izubuchi, Tom DeGrand,Jason Chang, and Dan Hackett for helpful discussions and comments in various stages of thiswork. This work was supported in part by the U.S. Department of Energy (DOE), Oﬃceof Science, Oﬃce of High Energy Physics, under Award Number DE-SC0010005 (ETN).Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-217CH11359 with the United States Department of Energy. [1] E. E. Leamer and E. E. Leamer,

Speciﬁcation searches: Ad hoc inference with nonexperimentaldata , vol. 53 (Wiley New York, 1978).[2] A. Racine, A. Grieve, H. Fluhler, and A. Smith, Applied Statistics pp. 93–150 (1986).[3] E. I. George and R. E. McCulloch, Journal of the American Statistical Association , 881(1993).[4] R. E. Kass and A. E. Raftery, Journal of the american statistical association , 773 (1995).[5] A. E. Raftery and Y. Zheng, Journal of the American Statistical Association , 931 (2003).[6] D. V. Lindley, Biometrika , 187 (1957), ISSN 00063444, URL .[7] M. S. Bartlett, Biometrika , 533 (1957), ISSN 00063444, URL .[8] H. Jeﬀreys, The theory of probability (OUP Oxford, 1998).[9] R. D. Cousins, Synthese , 395 (2017), 1310.3791.[10] M. R. Schindler and D. R. Phillips, Annals Phys. , 682 (2009), [Erratum: Annals Phys.324, 2051–2055 (2009)], 0808.3643.[11] C. Davies, K. Hornbostel, I. Kendall, G. Lepage, C. McNeile, J. Shigemitsu, and H. Trottier(HPQCD), Phys. Rev. D , 114507 (2008), 0807.1687.[12] S. D¨urr et al. (Budapest-Marseille-Wuppertal), Phys. Rev. D , 114504 (2014), 1310.3626.[13] E. Berkowitz et al. (2017), 1704.01114.[14] C. Chang et al., Nature , 91 (2018), 1805.12130.[15] E. Rinaldi, S. Syritsyn, M. L. Wagman, M. I. Buchoﬀ, C. Schroeder, and J. Wasem, Phys.Rev. D , 074510 (2019), 1901.07519.[16] N. Miller et al. (2020), 2005.04795.[17] C. Morningstar, Nucl. Phys. B Proc. Suppl. , 185 (2002), hep-lat/0112023.[18] A. Bazavov et al. (Fermilab Lattice, MILC), Phys. Rev. D , 074509 (2014), 1407.3772.[19] G. Lepage, B. Clark, C. Davies, K. Hornbostel, P. Mackenzie, C. Morningstar, and H. Trottier,Nucl. Phys. B Proc. Suppl. , 12 (2002), hep-lat/0110175.[20] K. Takeuchi, Suri-Kagaku (Mathematical Sciences) , 12 (1976).[21] M. Stone, Journal of the Royal Statistical Society. Series B , 44 (1977), URL .[22] R. Shibata, in From Data to Model , edited by J. Willems (Springer-Verlag, New York, 1989),pp. 215–240.[23] S. Konishi and G. Kitagawa, Biometrika , 875 (1996), URL .[24] H. Akaike, in Selected papers of hirotugu akaike (Springer, 1998), pp. 199–213.[25] H. Akaike, Annals of the Institute of Statistical Mathematics , 9 (1978), URL https://doi.org/10.1007/BF02480194 .[26] A. Bazavov et al. (Fermilab Lattice, MILC), Phys. Rev. D , 114509 (2019), 1809.02827.[27] E. Gamiz, private communication (2020).[28] F. Anscombe, The American Statistician .

29] M. Dixon and T. Ward (2018), 1803.04947. ppendix A: Calculation of asymptotic bias for model weights In this appendix we derive the asymptotic bias of the log-likelihood. The form of thebias is well known in the statistics literature, and it appears in the Takeuchi InformationCriterion, a generalization of the well-known Akaike Information Criterion. What followsis not a tight mathematical proof, but rather an informal derivation designed to illustratehow the bias term arises. For technical details, we refer interested readers to the extensiveoriginal literature [20–23]. The introduction of Ref. [29] is particularly accessible.The maximum likelihood estimator (MLE) a (cid:63) is a consistent estimator of the asymptoticor “true” ˆ a . However, the estimated log-likelihood function evaluated at a (cid:63) is not a consis-tent estimator of the expected log-likehood function. Roughly speaking, because the MLEmaximizes the estimated log-likelihood, it tends to overshoot the expected log-likelihood.Consider the log-likelihood log L ( x ; a ) = log n (cid:89) i =1 L ( x n ; a ) , (A1)where L ( x n ; a ) is the likelihood for a single sample x n . The sample and population expec-tation values are deﬁned according to E n [ · · · ] = 1 n n (cid:88) i =1 ( · · · ) ≡ (cid:104)· · ·(cid:105) n (A2) E z [ · · · ] = (cid:90) dz f ( z )( · · · ) ≡ (cid:104)· · ·(cid:105) z , (A3)where f ( z ) is the population distribution from which the samples x n are presumed to bedrawn. Because a (cid:63) and ˆ a maximize their respective log-likelihoods, they are solutions to theusual equations: (cid:104) ∂ a log L ( x n | a = a (cid:63) ) (cid:105) n = 0 (A4) (cid:104) ∂ a log L ( z | a = ˆ a ) (cid:105) z = 0 . (A5)Note that, for a ﬁxed number of samples n , a (cid:63) is a ﬁxed number. The sample Fisherinformation matrix I N and the negative sample Hessian matrix J N are deﬁned as I N,xy ( a ) ≡ N − N (cid:88) i =1 (cid:18) ∂ log L ( x i | a ) ∂a x (cid:19) (cid:18) ∂ log L ( x i | a ) ∂a y (cid:19) (A6)= 14( N − N (cid:88) i =1 (cid:18) ∂χ i ∂a x (cid:19) (cid:18) ∂χ i ∂a y (cid:19) , (A7) J N,xy ( a ) ≡ − N N (cid:88) i =1 ∂ log L ( x i | a ) ∂a x ∂a y = 12 N N (cid:88) i =1 ∂ χ i ∂a x ∂a y . (A8)The bias in the log-likelihood is deﬁned as the diﬀerence between its estimated andexpected values, b ( a (cid:63) ( x n )) ≡ (cid:104) log L ( x n | a (cid:63) ( x n )) − (cid:104) log L ( z | a (cid:63) ( x n ) (cid:105) z (cid:105) n . (A9)24e are interested in the behavior of this bias in the limit of many samples, n → ∞ . Toemphasize the dependence on the data, we have written a (cid:63) = a (cid:63) ( x n ). To evaluate thebias explicitly and resolve the mixed expectation value, it helps to add and subtract termsjudiciously: b ( a (cid:63) ( x n )) = (cid:16) (cid:104) log L ( x n | a (cid:63) ( x n )) (cid:105) n − (cid:104) log L ( x n | ˆ a ) (cid:105) n (cid:17) + (cid:16) (cid:104) log L ( x n | ˆ a ) (cid:105) n − (cid:104) log L ( z | ˆ a ) (cid:105) z (cid:17) + (cid:16) (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104)(cid:104) log L ( z | a (cid:63) ( x n )) (cid:105) z (cid:105) n (cid:17) . (A10)This trivial rewriting pays immediate dividends. The ﬁrst and third terms involve matchingexpectation values at nearby points and are amenable to Taylor expansion. As we will argueshortly, the second term vanishes.Let us begin with the ﬁrst term. Expanding around the MLE point a (cid:63) gives (cid:104) log L ( x n | a (cid:63) ( x n ) (cid:105) n − (cid:104) log L ( x n | ˆ a ) (cid:105) n = − (cid:28) (ˆ a − a (cid:63) ( x n )) ∂ log L∂ a ∂ a (cid:48) (ˆ a − a (cid:63) ( x n )) (cid:29) n (A11)= −

12 (ˆ a − a (cid:63) ( x n )) (cid:28) ∂ log L∂ a ∂ a (cid:48) (cid:29) n (ˆ a − a (cid:63) ( x n )) (A12) n →∞ −→ + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) . (A13)In the ﬁrst equality, the linear term vanishes by the deﬁnition of a (cid:63) , Eq. (A4). The ﬁnal linefollows in the large- n limit under suitable regularity conditions (for details, see Refs. [22,23, 29]).The second term vanishes. To see this, ﬁrst observe that both terms are evaluated at thesame ﬁxed parameters ˆ a , which remain unchanged by the limit n → ∞ . Next, note that theestimated log-likelihood converges point-by-point to the asymptotic distribution. Therefore,the diﬀerence vanishes in this limit. This argument fails when the MLE a (cid:63) ( x n ) is involved,since a (cid:63) ( x n ) itself moves with n .Finally, we consider the third term, which contains population expectation values. Inthis case it is useful to expand around ˆ a , since the linear term will vanish by Eq. (A5): (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104)(cid:104) log L ( z | a (cid:63) ( x n )) (cid:105) z (cid:105) n (A14)= (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104)(cid:104) log L ( z | ˆ a ) (cid:105) z (cid:105) n + 12 (cid:104) ( a (cid:63) − ˆ a ) J (ˆ a )( a (cid:63) − ˆ a ) (cid:105) n (A15)= (cid:104) log L ( z | ˆ a ) (cid:105) z − (cid:104) log L ( z | ˆ a ) (cid:105) z + 12 (cid:104) ( a (cid:63) − ˆ a ) J (ˆ a )( a (cid:63) − ˆ a ) (cid:105) n (A16)= 12 (cid:104) ( a (cid:63) − ˆ a ) J (ˆ a )( a (cid:63) − ˆ a ) (cid:105) n (A17) n →∞ −→

12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) . (A18)In the ﬁnal line we have again taken the limit n → ∞ and invoked suitable regularityconditions. Combining results, we see that b ( a (cid:63) ( x n )) n →∞ −→ + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) + 0 + 12 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) (A19)25 tr (cid:2) I (ˆ a ) J − (ˆ a ) (cid:3) (A20)As indicated, this result is evaluated at the (unknown) parameters ˆ a . However, since thesample I N ( a (cid:63) ) and J N ( a (cid:63) ) are consistent estimators of I (ˆ a ) and J (ˆ a ), the bias may beevaluated using them instead [22, 23, 29].For the sake of concreteness, the proof sketched here has used the MLE a (cid:63) . However,a similar bias term is expected to be present quite generally. For instance, Theorem 2.1of Ref. [23] proves the existence of bias for a more general class of estimators. Roughlyspeaking, the bias arises from ﬁnite-sample-size ﬂuctuations in the data and not from thechoice of the maximum likelihood estimator itself. Due to the generality of this bias term, weinclude the correction in the general formula Eq. (28) and not only in the following Gaussianapproximation.So far the discussion has been for general log-likelihoods. Now we specialize to the caseof least-square ﬁtting, where − L ( x n ; a ) = χ ( x n ; a ). Taking χ ( x n ; a (cid:63) ) → χ ( x n ; a (cid:63) ) +2tr[ I N ( a (cid:63)(cid:63)