[PDF] Sample size effects in multivariate fitting of correlated data

Abstract

A common problem in analysis of experiments or in lattice QCD simulations is fitting a parameterized model to the average over a number of samples of correlated data values. If the number of samples is not infinite, estimates of the variance of the parameters ("error bars") and of the goodness of fit are affected. We illustrate these problems with numerical simulations, and calculate approximate corrections to the variance of the parameters for estimates made in the standard way from derivatives of the parameters' probability distribution as well as from jackknife and bootstrap estimates.

Full PDF

aa r X i v : . [ h e p - l a t ] A ug Sample size eﬀects in multivariate ﬁtting of correlated data

D. Toussaint and W. Freeman

Department of Physics, University of Arizona, Tucson, AZ 85721, USA (Dated: May 28, 2018)

Abstract

A common problem in analysis of experiments or in lattice QCD simulations is ﬁtting a param-eterized model to the average over a number of samples of correlated data values. If the numberof samples is not inﬁnite, estimates of the variance of the parameters (“error bars”) and of thegoodness of ﬁt are aﬀected. We illustrate these problems with numerical simulations, and calculateapproximate corrections to the variance of the parameters for estimates made in the standard wayfrom derivatives of the parameters’ probability distribution as well as from jackknife and bootstrapestimates.

PACS numbers: . INTRODUCTION A common problem in analysis of experiments or of Monte Carlo simulations is ﬁttinga parameterized model to the average over a number of samples of correlated data values.In particular, lattice QCD calculations typically require ﬁtting operator correlators, whichare a function of distance between the operators, to sums of exponentials with unknownamplitudes and masses. If the number of samples is not inﬁnite, estimates of the varianceof the parameters (“error bars”) and of the goodness of ﬁt are aﬀected. This can be viewedas a generalization of the well known rule “replace N by N − none of these methods give unbiased estimates of the parameters’variance.Many numerical simulation programs or experiments involve two or more stages of ﬁt-ting, where the parameters resulting from the ﬁrst stage are the data input to the secondstage. For example, in computations of meson decay constants in lattice QCD the ﬁrst stageinvolves ﬁtting a correlator of meson operators to exponentials and extracting the mass andamplitude, and the second stage involves ﬁtting these masses and amplitudes to functionsof the quark masses and lattice spacings to allow extrapolation to the chiral and continuumlimits. (See for example Refs. [1] and [2].) For example, in Ref. [1] about 600 hadron massesand amplitudes are computed in the ﬁrst stage of ﬁtting, and these 600 numbers and their(co)variances are in turn the data for the ﬁtting in the second stage. While an unbiasedestimate of the variance of the parameters is always welcome, it is particularly importantin this case since many parameters in the ﬁrst stage of ﬁtting are used as inputs (data)in the second stage of ﬁtting, and if their errors are systematically too large or too smallthe apparent goodness of ﬁt in the second stage of ﬁtting will be very good or very badrespectively. 2 I. THE PROBLEM

We consider a problem where we need to ﬁt a function of P parameters to an average of N samples, where each sample consists of D data points. We use subscript indices to labelthe component of the data vectors and superscript indices to label the samples. Thus x ai is the i ’th component of the a ’th sample, with 0 ≤ i < D and 0 ≤ a < N . Each sampleis assumed to be normally distributed, but the diﬀerent components of the D dimensionalsample are generally correlated. Averages over samples will be denoted by overbars. We willneed to imagine averaging over many trials of the experiment, and we will use angle bracketsto denote such an average: (cid:10) x i (cid:11) .So, for example x i = 1 N X a x ai x i x j = 1 N X a x ai x aj x i x j = 1 N X a x ai N X b x bj (1)The covariance matrix (“of the mean”) for one trial is C ij = 1 N ( x i x j − x i x j ) = 1 N X a x ai x aj − N X a x ai ! X b x bj ! (2)This covariance matrix will ﬂuctuate around the true covariance matrix, obtainable only inthe limit N → ∞ . Note we use N instead of N − in normalizing C ij . For our purposes, thediﬀerence between these normalizations is best included with the other order N eﬀects to bediscussed.Fit parameters p α , with 0 ≤ α < P , are obtained by minimizing χ = (cid:16) x i − x fi ( p α ) (cid:17) (cid:0) C − (cid:1) ij (cid:16) x j − x fj ( p α ) (cid:17) (3)where x fi ( p α ) is the value of x i predicted by the model. As pointed out in Ref. [3], sincewe are stuck with estimates of the covariance matrix and the x i obtained from the samesamples, they are correlated. 3irst change to a convenient coordinate system (alas, available only in theory, not inpractice). For the moment we assume that our ﬁt model is good, so that the x fi ( p α ) can beadjusted to equal the true averages of the x i . Shift the coordinates so that h x i i is zero. Thenrotate the coordinates so that the true covariance matrix is diagonal, and rescale them sothat h ( x ai ) i = 1. (So far, we have followed Ref. [3].) We now have h x ai x bj i = δ ij δ ab , and thetrue covariance matrix is the unit matrix.Make a further rotation so that the changes in the x fi ( p α ) as the p α vary around theirtrue values are in the ﬁrst P components, and so that the changes in the x fi ( p α ) as the ﬁrstparameter p varies are in the ﬁrst component. Now we can rescale p so that ∂p ∂x = 1, whichsimply means that p is the average x . In doing this we have assumed that p is linearenough in the x i or that the ﬂuctuations in the x i are small enough.In this basis, write the covariance matrix (from the data in this experiment) and itsinverse in blocks, C ≡  U VV T W  C − ≡  A BB T E  (4)where the matrices U and A are P by P , V and B are P by D − P and W and E are D − P by D − P .Now χ is given by χ = (cid:16) x i − x fi (cid:17) (cid:0) C − (cid:1) ij (cid:16) x j − x fj (cid:17) , (5)where only the ﬁrst P components of x fi are nonzero. For example, with two parameters χ = (cid:16) x − x f , x − x f , x , . . . (cid:17)  A BB T E   x − x f x − x f x . . .  (6)The x fi are found from minimizing χ :0 = ∂χ ∂x fi ∗ = 2 A i ∗ j ∗ (cid:16) x j ∗ − x fj ∗ (cid:17) + 2 B i ∗ j ′ x j ′ (7)4here here and in many subsequent equations starred indices run from 0 to P − P to D −

1, and the factor of two comes from diﬀerentiating with respect tothe x fi on both sides of Eq. 5 and using the fact that C − is symmetric.This is solved by x fi ∗ = x i ∗ + A − i ∗ j ∗ B j ∗ k ′ x k ′ (8)From C C − = and Eq. 4, U A + V B T = U B + V E = V T A + W B T = V T B + W E = (9)Using the third of Eqs. 9, remembering that A and W are symmetric, we get an alternateto Eq. 8: B = − AV W − A − B = − V W − x fi ∗ = x i ∗ − V i ∗ j ′ W − j ′ k ′ x k ′ (10)From this equation we see that, in this basis, parameter number zero, x f , does not dependon the other of the ﬁrst P components, x i ∗ with 1 ≤ i ∗ < P . Thus the distribution ofparameters depends only on the combination D − P ≡ d .Similarly for χ : χ = (cid:0) − x B T A − , x (cid:1)  A BB T E   − A − Bxx  (11)= x i ′ (cid:0) − B T A − B + E (cid:1) x j ′ W − W = and use the third and fourth equations in 9 χ = x W − (cid:0) − W B T A − B + W E (cid:1) x = x W − (cid:0) V T AA − B + W E (cid:1) x = x W − (cid:0) V T B + W E (cid:1) x = x i ′ W − i ′ j ′ x j ′ (12)But W is just the covariance matrix for the last D − P components of x in this basis, sothe statistical properties of χ are exactly the same as a D − P dimensional problem withno ﬁt parameters, and the distribution of χ , as expected, depends only on the number ofdegrees of freedom, d ≡ D − P . We note that the distribution of χ (more properly, T ) isknown. Since it is important here and closely related to the estimates of parameter errors,we quote the result in Appendix I. III. NUMERICAL EXAMPLE

To illustrate the eﬀects of sample size, we begin with a numerical example, using thebasis described above. In this example, N Gaussian distributed random data vectors with D = 25 were generated. The data was ﬁt with P = 5 parameters, which are just the ﬁrst P components of the average data vector. This was repeated for many trials. The blackoctagons in Fig. 1 show N times the variance (over trials) of one of the parameters, wherethe asymptotic value is one. These black octagons are the correct answer for the variance ofthe parameter, and this is the variance that we wish to estimate from our experiment, wherewe only have one trial to work with. We see that for ﬁnite N the parameters ﬂuctuate byan amount larger than the asymptotic value.We also show the average over trials of the variance estimated from derivatives of theparameter probability, the average over trials of the variance estimated from a single elimi-nation jackknife analysis, and the average from a bootstrap analysis. For the jackknife andbootstrap, the plot contains average variances both for the case where the full sample covari-ance matrix was used in each resampling and where a new covariance matrix was made usingthe data in each jackknife or bootstrap sample. Red squares are average variances from the6 IG. 1: Number of samples times the variance (square of the error) of a parameter in ﬁttingcorrelated data, and averages over trials of several methods for estimating this variance. Thehorizontal line indicates the asymptotic value, variance ( x f ) = 1 /N . N is the number of samples;the meaning of the plot symbols is described in the text. N , and that thevarious methods for estimating this variance produce biased estimates of the variance of theparameter. IV. LARGE N EXPANSION

Most of the eﬀects shown in Fig. 1 can be understood analytically. We can expand thecovariance matrix in each trial around its true value, C ij = 1 N { δ ij + ( x i x j − δ ij − x i x j ) } (13)Here the term in parentheses has ﬂuctuations of order 1 / √ N and an average of order N .Thus its square will also have expectation value ≈ N .Then C − ij = N δ ij − N ( x i x j − δ ij − x i x j )+ N ( x i x k − δ ik − x i x k ) ( x k x j − δ kj − x k x j )+ . . . (14)Using the fact that integrals of polynomials weighted by Gaussians are found by pairingthe x ai in all possible ways, or making all possible contractions, we can develop rules forcalculating these expectation values. We will use parentheses to list the pairings. For8xample, with (12) indicating that the ﬁrst and second x are paired, h x i x i x j x j i = x i x i x j x j (12)(34)+ x i x i x j x j (13)(24)+ x i x i x j x j (14)(23) (15)Using h x ai x bj i = δ ij δ ab and x i = 1 N X a x ai we get the Feynman rules for contractions of barred quantities.1. Each contraction gives a δ ij for the lower indices it connects.2. Each bar gives a N , whether it covers a single x or two, x i or x i x j — see Eq.1.3. Each continuous line made of overbars and contraction symbols gives a factor of N . This is from the P ab... δ ab δ bc . . . , which has N nonzero terms. For example, x i x j x k x l (12)(34) is one continuous line, while x i x j x k x l (14)(23) is two lines (oneis a loop). This results in every loop giving an extra factor of N relative to othercontractions with the same number of ﬁelds.Since an open line (not a loop) with C contractions has 2 C x ’s and C + 1 bars, but aloop with N contractions has 2 C x ’s and C bars, these rules can be rephrased as:1. Each contraction gives a δ ij for the lower indices it connects.2. Each x gives a factor of 1 / √ N .3. Each loop gives a factor of N .In the expansion of C − we ﬁnd the combination x i x j − δ ij − x i x j , which we will denoteby x i x j . This occurs frequently enough that we should state special rules for it.9n evaluating an expression containing x i x j there will be contractions where the x i and x j in x i x j are contracted with each other. These contractions just cancel the δ ij . The termswith x i and x j contracted give a − δ ij /N . Thus a “tadpole” where x i x j (12) contracts withitself just gives a − δ ij /N . (This includes the N − / from each of the x ’s.)Now consider terms where x i x j is part of an open line, like x i x i x j x j (12)(34) (16)In this case the x i x j and the − x i x j cancel, so x i x j can never be part of an open line. Butif this object is part of a loop, like in x i x j x k x l (13)(24) (17)the x i x j part is part of the loop, but the − x i x j part breaks the loop. Thus the four pathshidden in these double bars give N − N − δ ik δ jl (18)Similarly, a loop of three double bars gives 2 = 8 terms, N − − N − x i x j ’s gives a factor of N − δ ’s.As trivial examples, h x i x j i = x i x j (12) = 1 N δ ij (19) h x i x j i = x i x j (12) = δ ij (20) h N C ij i = 1 + x i x j (12) = (cid:18) − N (cid:19) δ ij (21)We are also interested in the variances of averaged quantities. For the variance of some-thing, var ( X ) = h X i − h X i , we need the “connected part” of h X i . We use a vertical barto denote this, and we only need contractions where some of the lines cross the bar.10s an example, for the variance of an arbitrary element of C to lowest order, h var ( C ij ) i = (cid:10) C ij (cid:11) − h C ij i no sum ij (22)= 1 N D x i x j (cid:12)(cid:12)(cid:12) x i x j E NS (23)= 1 N (cid:16) x i x j (cid:12)(cid:12)(cid:12) x i x j (cid:17) (13)(24) + (14)(23)= N − N ( δ ii δ jj + δ ij δ ij ) NS = 1 N N ; i = j N N ; i = j (24)The last line is written to display that the fractional variance on the diagonal element is N .In equations where the components are separated into starred indices, 0 ≤ i ∗ < P andprimed indices, P ≤ i ′ < D , contractions of primed with starred indices are zero, contractionsof starred with starred indices give delta functions with δ i ∗ i ∗ = P , and primed with primeduse δ i ′ i ′ = D − P . V. VARIANCE (AND HIGHER MOMENTS) OF THE PARAMETERS

In this section we examine the variances of the parameters – that is, the error bars onour answers. First we calculate how much the parameters actually vary over many trials ofthe experiment. Then we calculate the average of common ways of estimating this variance— from derivatives of the probability, from an “eliminate J” jackknife analysis or from abootstrap resampling (using either the covariance matrix from the full sample, or a newcovariance matrix made from each jackknife or bootstrap resample). The diﬀerences allowus to ﬁnd and correct for bias in our error estimates resulting from the ﬁnite sample size.For the actual variance of our parameters, use Eq. 10. Since we are in a coordinate systemwhere the average of this quantity is zero, we don’t need to worry about taking the connectedpart. 11 x f x f E = h x x i + 2 (cid:10) x V j ′ W − j ′ k ′ x k ′ (cid:11) (25)+ (cid:10) x j ′ W − j ′ k ′ V Tk ′ V m ′ W − m ′ n ′ x n ′ (cid:11) Since V and W are made entirely of double bars and can therefore only be part of a loop,and primed indices can’t contract with index zero, the middle term (cross term) is zero.We compute this to order N . D x f x f E = h x x i + D x j ′ (cid:0) δ j ′ m ′ − x j ′ x m ′ + x j ′ x p ′ x p ′ x m ′ . . . (cid:1) (cid:0) x m ′ x (cid:1)(cid:0) x x n ′ (cid:1) (cid:0) δ n ′ k ′ − x n ′ x k ′ + x n ′ x r ′ x r ′ x k ′ . . . (cid:1) x k ′ E (26)The leading term, h x x i , is just N .The term with six x ’s has only one contraction: x j ′ x j ′ x x x k ′ x k ′ (16)(25)(34)= N − N d (27)where d ≡ D − P .There are two equal terms with eight x ’s. There are three nonzero contractions of thisterm. Ignoring the N − parts, these are − x j ′ x j ′ x m ′ x m ′ x x x k ′ x k ′ (18)(27)(34)(56) = − N δ j ′ k ′ δ j ′ k ′ δ m ′ m ′ δ − x j ′ x j ′ x m ′ x m ′ x x x k ′ x k ′ (18)(23)(47)(56) = +2 N δ j ′ k ′ δ j ′ m ′ δ m ′ k ′ δ − x j ′ x j ′ x m ′ x m ′ x x x k ′ x k ′ (18)(24)(37)(56) = − N δ j ′ k ′ δ j ′ m ′ δ m ′ k ′ δ (28)Here the plus sign on the second contraction comes from the tadpole. The second and thirdcontractions cancel, so we just have − N d (29)12o order N we only need two loop contractions from the terms with ten x ’s. There arethree such terms, but two of them are equal. (cid:10) x j ′ x j ′ x m ′ x m ′ x x x n ′ x n ′ x k ′ x k ′ (cid:11) +2 (cid:10) x j ′ x j ′ x p ′ x p ′ x m ′ x m ′ x x x k ′ x k ′ (cid:11) (30)Each term has two contractions: x j ′ x j ′ x m ′ x m ′ x x x n ′ x n ′ x k ′ x k ′ (1 , x j ′ x j ′ x m ′ x m ′ x x x n ′ x n ′ x k ′ x k ′ (1 , N (cid:0) d + d (cid:1) (31)2 x j ′ x j ′ x p ′ x p ′ x m ′ x m ′ x x x k ′ x k ′ (1 , x j ′ x j ′ x p ′ x p ′ x m ′ x m ′ x x x k ′ x k ′ (1 , N (cid:0) d + d (cid:1) (32)Putting it all together, D x f x f E = 1 N + N − N ( d ) + − N ( d ) + 3 N ( d + d )= 1 N + dN + d ( d + 2) N + . . . (33)Thus the ﬂuctuations in the parameters are larger than the asymptotic value N . from thecovariance matrix.As noted above, the probability distribution of the parameters is not exactly Gaussian.Higher moments of this distribution can be obtained in the same way. At leading order in N there is only one independent diagram for the connected part of each moment, and we ﬁnd,for M even, (cid:28)(cid:16) x f (cid:17) M (cid:29) connected = ( D − P ) ( M − N M (34)13 I. ESTIMATES OF THE PARAMETERS’ VARIANCE

In practice, the most common method for estimating the variance of the parameters is touse the covariance matrix for the parameters. (See, for example, Ref. [5].) In our coordinatesystem, this matrix is just A − , and our estimate for the variance of parameter zero is( A − ) . Using the third and ﬁrst of Eqs. 9, B T = − W − V T A = U A − V W − V T AA − = U − V W − V T (35)Then, our estimate for the variance of parameter zero is var ( x f ) derivative = A − = U − V k ′ W − k ′ l ′ V Tl ′ = 1 N (cid:0) δ + x x (cid:1) − N (cid:0) x x k ′ (cid:1) (cid:0) δ k ′ l ′ − x k ′ x l ′ + x k ′ x m ′ x m ′ x l ′ . . . (cid:1) x l ′ x (36)For the order N correction we only need the δ k ′ l ′ from W − , and ﬁnd N var ( x f ) derivative = δ + x x (12) − x x k ′ x k ′ x (14)(23)= 1 − N − N − N ( D − P )= 1 − N (1 + D − P ) (37)The order 1 /N contribution to this estimate vanishes, as sketched in Appendix II. If D = P = 1 this is just N (cid:10) x x (cid:11) = 1 − N , the standard correction for a simple average,reﬂecting our normalization of the covariance matrix. Comparing to the desired result inEq. 33, we see that this is an underestimate of the variance of the parameters. The diﬀerencebetween this error estimate and the correct one above is that this estimate assumes that the14ovariance matrix remains ﬁxed while the data points vary, while the correct answer takesinto account the correlations between the data points and the covariance matrix (constructedfrom these same data points). VII. VARIANCE OF JACKKNIFE AND BOOTSTRAP PARAMETERS

The variance of the parameters is also often estimated by a jackknife or bootstrap analysis.In these methods the ﬁt is repeated many times using subsets of the data sample, and thevariance of the parameters is estimated from the variance over the jackknife or bootstrapsamples. Both the jackknife and bootstrap can be done either using the covariance matrixfrom the full sample in ﬁtting each jackknife or bootstrap sample, or by remaking a covariancematrix for each resample. Using the full sample covariance matrix amounts to seeing how theparameters vary with ﬁxed covariance matrix, that is, by varying x i ∗ and x k ′ in Eq. 8 with A − i ∗ j ∗ B j ∗ k ′ held ﬁxed. This is the same question as is answered by var ( x f ) derivative in Eq. 37.Since the change in the parameters is linear in x i ∗ and x k ′ , it doesn’t matter if the x i arevaried inﬁnitesimally (by taking derivatives) or slightly (jackknife) or fully (bootstrap). Inthis case, the variance of the parameters will have the same bias as does Eq. 37 — no newcalculation is necessary, although there is a slight diﬀerence due to the normalization of thecovariance matrix used here.Remaking the covariance matrix for each resample includes correlations of the covariancematrix and data, but not in quite the desired way. The calculations above can be extendedto calculate the expectation value of the parameter variance for the jackknife analysis inwhich the covariance matrix is recomputed for each jackknife sample. An “eliminate J”jackknife consists of making N/J resamples, each omitting J data vectors (numbers nJ through ( n + 1) J − N J ≡ N − J elements. We will denote averagesin the n ’th jackknife sample with a superscript ( n ). The average of x a in the n ’th jackknifesample is x a ( n ) = 1 N J  X a ∈ ( n ) x a  (38)where J data vectors (starting with number nJ ) were deleted from the full sample. The15ariance of this quantity (over the jackknife samples) is JN ( N − J ) , (39)so we generally multiply the variance over the jackknife samples by N − JJ to get the expectedvariance of the mean N .We now compute the variance of the parameters in the jackknife ﬁts. In doing this we willneed averages of products of quantities from diﬀerent jackknife ensembles. Without losinggenerality, we can think of these as ensembles number zero and one, which diﬀer only in theirﬁrst J data elements. Thus, expectation values of sums over values in diﬀerent ensemblesmay produce factors of N J − J instead of N J , where N J is the number of samples in thejackknife, and is really N − J . ( N J − J = N − J is the number of samples in commonbetween two diﬀerent jackknife resamples.)For example, using ( n ) to denote quantities in the n ’th jackknife sample ( x a ( n ) j is the j ’thcomponent of the a ’th data vector in jackknife sample ( n )), for n = m , h x ( n ) j x ( m ) k i = h N J X a x a ( n ) j N J X b x b ( m ) k i = 1 N J X ab δ jk δ ab = N J − JN J δ jk (40)where we deﬁne δ ab = 1 if a = b and a, b ∈ ( J, N − a and b gives a factor of N J − J instead of N J .From Eq. 10, parameter 0 in jackknife ﬁt ( n ) is x ( n ) f = x ( n )0 + V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ (41)and the variance of this parameter over the jackknife samples is var J ( x f ) = * x ( n )0 − x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ − JN X m (cid:16) x ( m )0 − x ( m ) k ′ W − m ) k ′ i ′ V T ( m ) i ′ (cid:17)! x ( n )0 − V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ − JN X p (cid:16) x ( p )0 − V ( p )0 j ′ W − p ) j ′ l ′ x ( p ) l ′ (cid:17)! + (42)16his is more complicated than Eq. 25 because the mean over jackknife samples is notexactly zero. Also, the sums over sample vectors now sometimes give N J , sometimes N J − N J − J , so some of the shortcuts developed above won’t work any more.Note the J/N is correct – there are

N/J jackknife resamples, each containing N J = N − J elements.In Eq. 42 the sums contain terms where n = m and terms where n = m . Separate thediagonal and oﬀ-diagonal terms in the sums, and use the fact that all non-diagonal termsare equal, P m contains N/J − m = n , and P mp has N/J diagonal terms and(

N/J )( N/J −

1) oﬀ diagonal: var J ( x f ) = * (cid:18) − JN (cid:19) x ( n )0 x ( n )0 − (cid:18) − JN (cid:19) x ( n )0 x ( m )0 − (cid:18) − JN (cid:19) x ( n )0 V ( n )0 i ′ W − n ) i ′ k ′ x ( n ) k ′ + 2 (cid:18) − JN (cid:19) x ( n )0 V ( m )0 i ′ W − m ) i ′ k ′ x ( m ) k ′ + (cid:18) − JN (cid:19) x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( n )0 i ′ W − n ) i ′ k ′ x ( n ) k ′ − (cid:18) − JN (cid:19) x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( m )0 i ′ W − m ) i ′ k ′ x ( m ) k ′ + n = m (43)where n = m . To evaluate this expression we need:( a ) h x ( n )0 x ( n )0 i ( b ) h x ( n )0 x ( m )0 i n = m ( c ) h x ( n )0 V ( n )0 i ′ W − n ) i ′ k ′ x ( n ) k ′ i ( d ) h x ( n )0 V ( m )0 i ′ W − m ) i ′ k ′ x ( m ) k ′ i n = m ( e ) h x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ i ( f ) h x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( m )0 j ′ W − m ) j ′ k ′ x ( m ) k ′ i n = m (44)Here ( a ), ( c ) and ( e ), which involve only jackknife sample ( n ), are the same as in the previoussection with the replacement of N by N J . Because W − n ) and V ( n ) consist only of double17arred quantities which can’t be part of an open line, and primed and unprimed indicescan’t contract, ( c ) vanishes. For ( d ) we can imagine expanding all the x i x j ( m ) ’s into pieces, (cid:0) x i x j ( m ) − δ ij − x i ( m ) x j ( m ) (cid:1) . and making all contractions. There is only one factor of x from jackknife sample ( n ), which must contract with something from ( m ). Thus, all of theseterms diﬀer from ( c ) by replacement of exactly one factor of N J by N J − J , and thereforealso sum to zero.Similarly ( e ) and ( f ) diﬀer by the replacement of one or more factors of N J by N J − J .Thus, their diﬀerence will be one order in JN less than their value. This means that to getthe ﬁrst correction to the asymptotic form, we need only keep the lowest order term in part( e ), and the analogous contraction for part ( f ).( a ) h x ( n )0 x ( n )0 i = 1 N J X ab x a ( n )0 x b ( n )0 = 1 N J ( b ) h x ( n )0 x ( m )0 i = 1 N J X ab x a ( n )0 x b ( m )0 = N J − JN J ( e ) h x ( n ) j ′ x j ′ x ( n )0 x x ( n ) k ′ x ( n ) k ′ i = x ( n ) j ′ x j ′ x ( n )0 x x ( n ) k ′ x ( n ) k ′ (16)(25)(34)= N J − N J ( D − P )( f ) h x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ i = x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ (16)(25)(34)= N J − J − . . .N J ( D − P ) (45)Here ( a ), ( c ) and ( e ) are the same as in the previous section. To evaluate ( f ) we needto separate the two terms in the double overbar (the delta function isn’t there since theindices can never be equal), since they may give diﬀerent numbers of factors of N J − J . Thisevaluation proceeds as: 18 f ) ≈ (cid:10) x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ (cid:11) n = m = D x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ − x ( n ) j ′ x ( n ) j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ − x ( n ) j ′ x j ′ x ( n )0 x ( m )0 x ( m ) k ′ x ( m ) k ′ + x ( n ) j ′ x ( n ) j ′ x ( n )0 x ( m )0 x ( m ) k ′ x ( m ) k ′ E n = m = 1 N J X abcdef D x a ( n ) j ′ x b ( n ) j ′ x b ( n )0 x d ( m )0 x d ( m ) k ′ x f ( m ) k ′ − x a ( n ) j ′ x b ( n ) j ′ x b ( n )0 x d ( m )0 x e ( m ) k ′ x f ( m ) k ′ − x a ( n ) j ′ x b ( n ) j ′ x c ( n )0 x d ( m )0 x d ( m ) k ′ x f ( m ) k ′ + x a ( n ) j ′ x b ( n ) j ′ x c ( n )0 x d ( m )0 x e ( m ) k ′ x f ( m ) k ′ E n = m (46)Now the (16)(25)(34) contraction gives= 1 N J X abcdef δ j ′ k ′ δ af δ j ′ k ′ δ bd δ δ bd − δ j ′ k ′ δ af δ j ′ k ′ δ be δ δ bd − δ j ′ k ′ δ af δ j ′ k ′ δ bd δ δ cd + δ j ′ k ′ δ af δ j ′ k ′ δ be δ δ cd ! = 1 N J (cid:16) N J ( N J − J ) − N J ( N J − J ) + ( N J − J ) (cid:17) ( D − P )= 1 N J (cid:16) N J − N J J − N J + . . . (cid:17) ( D − P )= 1 N J (cid:16) N J − J − . . . (cid:17) ( D − P ) (47)Putting the pieces together, the variance of the parameter over the jackknife samples is (cid:18) − JN (cid:19) (cid:18) JN J + 2 J ( D − P ) N J (cid:19) (48)= (cid:18) N − JN (cid:19) (cid:18) J ( N − J ) (cid:19) (cid:18) D − P ) N + . . . (cid:19) (49)19omparing with Eq. 39, which is for D = P , we see that there is an extra factor of1 + D − P ) N (independent of J ). However, by comparison with Eq. 33 we see that this eﬀectis too large by a factor of two, so the jackknife variance for the parameters is also biased.The leading corrections to the bootstrap estimate of the parameters’ variance can be donein a similar way. To be speciﬁc, our bootstrap procedure is to make B resamplings, eachmade by choosing N data vectors with replacement from the original set of N vectors, andcalculate the variance of the parameters over the bootstrap resamples. Similarly to Eq. 42,the average over trials of the bootstrap estimate of the variance is var B ( x f ) = * x ( n )0 − x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ − B X m (cid:16) x ( m )0 − x ( m ) k ′ W − m ) k ′ i ′ V T ( m ) i ′ (cid:17)! x ( n )0 − V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ − B X p (cid:16) x ( p )0 − V ( p )0 j ′ W − p ) j ′ l ′ x ( p ) l ′ (cid:17)! + (50)which, after separating diagonal and oﬀ-diagonal terms in the sums, becomes (cid:18) − B (cid:19) * x ( n )0 x ( n )0 − x ( n )0 x ( m )0 − x ( n )0 V ( n )0 i ′ W − n ) i ′ j ′ x ( n ) k ′ + 2 x ( n )0 V ( m )0 i ′ W − m ) i ′ j ′ x ( m ) k ′ + x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ − x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( m )0 j ′ W ( − m ) j ′ k ′ x ( m ) k ′ + n = m (51)The overall (cid:0) − B (cid:1) is the expected factor for diﬀerence between the average over the originalsample and average over bootstraps. Label the parts as in Eq. 44, where now ( n ) means the n ’th bootstrap resample.For part ( a ), h x ( n )0 x ( n )0 i = 1 N X ab D x a ( n )0 x b ( n )0 E (52)where, in this section, the superscript a ( n ) means the number of the data vector in theoriginal set that was chosen to be the a ’th member of bootstrap resample ( n ). For example,20f for N = 3 our bootstrap ensemble members were members 0, 1 and 0 of the originalensemble, then 0( n ) = 0, 1( n ) = 1 and 2( n ) = 0. We will get contributions with nonvanishingexpectation value when a ( n ) = b ( n ). If a member of the original ensemble is chosen m timesin the bootstrap sample, then there will be m contributions. Thus the total is the sum overall members of the original ensemble of the square of the number of times that member waschosen for this bootstrap sample. The probability distribution for the number of times amember appears in the bootstrap sample is a binomial distribution with probability p = 1 /N .The average square of the number of times a member appears in a bootstrap resample isjust the second moment of this distribution, etc. h ( n i ) i = 1 (cid:10) ( n i ) (cid:11) = = 2 − N (cid:10) ( n i ) (cid:11) = = 5 − N + 2 N ( N >

2) (53)Thus the expectation value of ( a ) is N N (cid:0) − N (cid:1) = N − N .Part ( b ) is the expectation value of the number of times a member was chosen in bootstrapresample ( n ) times the number of times it was chosen in resample ( m ). These two areindependent, so we get just the product of the averages, or − N .For part ( c ), break the double bar into its two components.( c ) = − (cid:10) x n ) x x i ′ ( n ) x i ′ ( n ) (cid:11) + 2 (cid:10) x n ) x n ) x i ′ ( n ) x i ′ ( n ) (cid:11) = − N X abc D x a ( n )0 x b ( n )0 x b ( n ) i ′ x c ( n ) i ′ E + 2 N X abcd D x a ( n )0 x b ( n )0 x c ( n ) i ′ x d ( n ) i ′ E (54)In the ﬁrst term we get a contribution when a ( n ) = b ( n ) = c ( n ). For each of the N membersof the original ensemble we therefore get n a terms, where n a is the number of times thatmember appeared in the bootstrap resample, so we get N (5 − . . . ) ( D − P ), where the D − P is from the implicit sum over i ′ . In the second term we get contributions when a ( n ) = b ( n )and c ( n ) = d ( n ). The probabilities of these two conditions are not quite independent, sinceif one member of the original ensemble is chosen multiple times in the bootstrap resamplethe other members will be chosen fewer times. This eﬀect will be suppressed by a power of21 N , so to leading order we just have h n a i = 4 N ( D − P ). Putting in the two and overallfactors of N from the left, ( c ) = − D − P ) N + . . . .Parts ( d ), ( e ) and f are done similarly, where to this order in N we only need the loopcontraction in parts ( e ) and ( f ).Putting it together var B ( x f ) = (cid:18) − B (cid:19) N (cid:18) D − P − N (cid:19) (55) VIII. CORRECTING SMALL BIASES

Once the biases in the various estimates of the error on the parameter have been calcu-lated, it is a simple matter to correct for them. In particular, we should multiply varianceestimates from the derivative method by F deriv in Eq. 56. Note this assumes the covariancematrix was normalized as in Eq. 2. For the jackknife or bootstrap done with the full samplecovariance matrix, multiply the variance by F reuse . This diﬀers from F deriv only in the 1 inthe denominator, the well known correction for the diﬀerence between the sample averageand the true average, which was not included in our normalization of C . For the jackknifeor bootstrap analysis where a new covariance matrix is made for each jackknife or bootstrapsample, multiply the variance by F jackknife,remake or F bootstrap,remake . Of course, if you arerescaling error bars instead of the variance, you should use the square root of the factorbelow. (In F bootstrap,remake we assumed that the B − B in Eq. 51 has already been accountedfor.) F deriv = 1 + N ( D − P ) + N ( D − P ) ( D − P + 2) . . . − N (1 + D − P ) + N F reuse = 1 + N ( D − P ) + N ( D − P ) ( D − P + 2) . . . − N ( D − P ) + N F jackknife,remake = 1 − N ( D − P ) . . .F bootstrap,remake = 1 + 1 N . . . (56)22

IG. 2: Numerical results from Fig.1 together with the order N and N results from the previoussection. The meaning of the symbols is the same as in Fig. 1. IX. COMPARISON TO NUMERICAL RESULTS

In Fig. 2 we plot the order N forms for the variance of the parameter and the variousmethods of estimating it together with the numerical data. The horizontal axis has beeninverted to N . Figure 3 shows the same data, with the estimates for the variance corrected for23 IG. 3: Numerical results from Fig.1 corrected for bias up to corrections of order N for thejackknife with remade covariance matrices and order N for the methods with ﬁxed covariancematrix. bias (up to errors of order N or N ). Here the lines for the actual variance of the parameter(black) and for the derivative or resampling with the full sample covariance matrix (red) aresecond order in N , while the line for the jackknife with remade covariance matrices (blue)is only ﬁrst order in N . As an aside, we note that although the lowest order corrections for24he bootstrap with remade covariance matrices are smaller than for the other methods, thenext order corrections appear to be larger. [1] C. Aubin et al. , Phys. Rev. D , 114501 (2004).[2] C. Aubin et al. , Phys. Rev. Lett. , 011601 (2005).[3] C. Michael, Phys. Rev. D (1994) 2616-2619.[4] C. Michael and A. McKerrell, Phys. Rev. D (1995) 3745-3750.[5] W.M. Yao et al. (Particle Data Group), J. Phys. G , 1 (2006) and 2007 partial update forthe 2008 edition, (see section 32); D. Toussaint, in “From Actions to Answers – Proceedings ofthe 1989 Theoretical Advanced Study Institute in Elementary Particle Physics” , T. DeGrandand D. Toussaint, eds. (World Scientiﬁc, Singapore, 1990).[6] See for example D.F. Morrison, “Multivariate Statistical Methods”, McGraw-Hill, 1967;R.A. Johnson and D.W. Wichern, “Applied Multivariate Statistical Analysis”, Prentice-Hall,1982. Appendix I

Since estimating the goodness of ﬁt is as important as estimating the errors on the pa-rameters, we quote some results here. Note that what we call χ (with the covariance matrixestimated from our data) is more properly called T , but we stick with the common usagein the lattice gauge community.The probability distribution for χ is known[6]. In terms of N and d , P rob ( χ ) = N − d/ Γ( N/ d/ N − d ) /

2) ( χ ) ( d − / (cid:18) N χ (cid:19) − N/ (57)We can compare to the χ distribution: P rob ( χ ) = C ( χ ) ( d − / e − χ / and see that in the limit of large N they are the same.From moments of Eq. 57 we see that the mean and variance of χ depend on the sample25ize. Using I ( D, N ) ≡ Z ∞ d ( χ )( χ ) ( D − / (cid:18) N χ (cid:19) − N/ = N D/ Γ( D/ N − D ) / N/

2) (58), (cid:10) χ (cid:11) = I ( D + 2 , N ) I ( D, N ) = N D N − D − = D − D +2 N (59) (cid:10) ( χ ) (cid:11) = I ( D + 4 , N ) I ( D, N ) = N D (cid:0) D +22 (cid:1)(cid:0) N − D − (cid:1) (cid:0) N − D − (cid:1) = D ( D + 2) (cid:0) − D +2 N (cid:1) (cid:0) − D +4 N (cid:1) (60)(Note this is using our normalization of the covariance matrix).Taking the connected part, or variance, and expanding in N , this is var ( χ ) = 2 d (cid:18) d + 6 N (cid:19) (61)Estimates of conﬁdence levels, or probability (over trials) that χ would exceed the valuein your experiment, can be found by integrating Eq. 57. Appendix II

The customary estimate for the variance of the parameters, or from jackknife or bootstrapresamplings with the covariance matrix held ﬁxed, has zero coeﬃcient at the next order.