Sample size effects in multivariate fitting of correlated data
aa r X i v : . [ h e p - l a t ] A ug Sample size effects in multivariate fitting of correlated data
D. Toussaint and W. Freeman
Department of Physics, University of Arizona, Tucson, AZ 85721, USA (Dated: May 28, 2018)
Abstract
A common problem in analysis of experiments or in lattice QCD simulations is fitting a param-eterized model to the average over a number of samples of correlated data values. If the numberof samples is not infinite, estimates of the variance of the parameters (“error bars”) and of thegoodness of fit are affected. We illustrate these problems with numerical simulations, and calculateapproximate corrections to the variance of the parameters for estimates made in the standard wayfrom derivatives of the parameters’ probability distribution as well as from jackknife and bootstrapestimates.
PACS numbers: . INTRODUCTION A common problem in analysis of experiments or of Monte Carlo simulations is fittinga parameterized model to the average over a number of samples of correlated data values.In particular, lattice QCD calculations typically require fitting operator correlators, whichare a function of distance between the operators, to sums of exponentials with unknownamplitudes and masses. If the number of samples is not infinite, estimates of the varianceof the parameters (“error bars”) and of the goodness of fit are affected. This can be viewedas a generalization of the well known rule “replace N by N − none of these methods give unbiased estimates of the parameters’variance.Many numerical simulation programs or experiments involve two or more stages of fit-ting, where the parameters resulting from the first stage are the data input to the secondstage. For example, in computations of meson decay constants in lattice QCD the first stageinvolves fitting a correlator of meson operators to exponentials and extracting the mass andamplitude, and the second stage involves fitting these masses and amplitudes to functionsof the quark masses and lattice spacings to allow extrapolation to the chiral and continuumlimits. (See for example Refs. [1] and [2].) For example, in Ref. [1] about 600 hadron massesand amplitudes are computed in the first stage of fitting, and these 600 numbers and their(co)variances are in turn the data for the fitting in the second stage. While an unbiasedestimate of the variance of the parameters is always welcome, it is particularly importantin this case since many parameters in the first stage of fitting are used as inputs (data)in the second stage of fitting, and if their errors are systematically too large or too smallthe apparent goodness of fit in the second stage of fitting will be very good or very badrespectively. 2 I. THE PROBLEM
We consider a problem where we need to fit a function of P parameters to an average of N samples, where each sample consists of D data points. We use subscript indices to labelthe component of the data vectors and superscript indices to label the samples. Thus x ai is the i ’th component of the a ’th sample, with 0 ≤ i < D and 0 ≤ a < N . Each sampleis assumed to be normally distributed, but the different components of the D dimensionalsample are generally correlated. Averages over samples will be denoted by overbars. We willneed to imagine averaging over many trials of the experiment, and we will use angle bracketsto denote such an average: (cid:10) x i (cid:11) .So, for example x i = 1 N X a x ai x i x j = 1 N X a x ai x aj x i x j = 1 N X a x ai N X b x bj (1)The covariance matrix (“of the mean”) for one trial is C ij = 1 N ( x i x j − x i x j ) = 1 N X a x ai x aj − N X a x ai ! X b x bj ! (2)This covariance matrix will fluctuate around the true covariance matrix, obtainable only inthe limit N → ∞ . Note we use N instead of N − in normalizing C ij . For our purposes, thedifference between these normalizations is best included with the other order N effects to bediscussed.Fit parameters p α , with 0 ≤ α < P , are obtained by minimizing χ = (cid:16) x i − x fi ( p α ) (cid:17) (cid:0) C − (cid:1) ij (cid:16) x j − x fj ( p α ) (cid:17) (3)where x fi ( p α ) is the value of x i predicted by the model. As pointed out in Ref. [3], sincewe are stuck with estimates of the covariance matrix and the x i obtained from the samesamples, they are correlated. 3irst change to a convenient coordinate system (alas, available only in theory, not inpractice). For the moment we assume that our fit model is good, so that the x fi ( p α ) can beadjusted to equal the true averages of the x i . Shift the coordinates so that h x i i is zero. Thenrotate the coordinates so that the true covariance matrix is diagonal, and rescale them sothat h ( x ai ) i = 1. (So far, we have followed Ref. [3].) We now have h x ai x bj i = δ ij δ ab , and thetrue covariance matrix is the unit matrix.Make a further rotation so that the changes in the x fi ( p α ) as the p α vary around theirtrue values are in the first P components, and so that the changes in the x fi ( p α ) as the firstparameter p varies are in the first component. Now we can rescale p so that ∂p ∂x = 1, whichsimply means that p is the average x . In doing this we have assumed that p is linearenough in the x i or that the fluctuations in the x i are small enough.In this basis, write the covariance matrix (from the data in this experiment) and itsinverse in blocks, C ≡ U VV T W C − ≡ A BB T E (4)where the matrices U and A are P by P , V and B are P by D − P and W and E are D − P by D − P .Now χ is given by χ = (cid:16) x i − x fi (cid:17) (cid:0) C − (cid:1) ij (cid:16) x j − x fj (cid:17) , (5)where only the first P components of x fi are nonzero. For example, with two parameters χ = (cid:16) x − x f , x − x f , x , . . . (cid:17) A BB T E x − x f x − x f x . . . (6)The x fi are found from minimizing χ :0 = ∂χ ∂x fi ∗ = 2 A i ∗ j ∗ (cid:16) x j ∗ − x fj ∗ (cid:17) + 2 B i ∗ j ′ x j ′ (7)4here here and in many subsequent equations starred indices run from 0 to P − P to D −
1, and the factor of two comes from differentiating with respect tothe x fi on both sides of Eq. 5 and using the fact that C − is symmetric.This is solved by x fi ∗ = x i ∗ + A − i ∗ j ∗ B j ∗ k ′ x k ′ (8)From C C − = and Eq. 4, U A + V B T = U B + V E = V T A + W B T = V T B + W E = (9)Using the third of Eqs. 9, remembering that A and W are symmetric, we get an alternateto Eq. 8: B = − AV W − A − B = − V W − x fi ∗ = x i ∗ − V i ∗ j ′ W − j ′ k ′ x k ′ (10)From this equation we see that, in this basis, parameter number zero, x f , does not dependon the other of the first P components, x i ∗ with 1 ≤ i ∗ < P . Thus the distribution ofparameters depends only on the combination D − P ≡ d .Similarly for χ : χ = (cid:0) − x B T A − , x (cid:1) A BB T E − A − Bxx (11)= x i ′ (cid:0) − B T A − B + E (cid:1) x j ′ W − W = and use the third and fourth equations in 9 χ = x W − (cid:0) − W B T A − B + W E (cid:1) x = x W − (cid:0) V T AA − B + W E (cid:1) x = x W − (cid:0) V T B + W E (cid:1) x = x i ′ W − i ′ j ′ x j ′ (12)But W is just the covariance matrix for the last D − P components of x in this basis, sothe statistical properties of χ are exactly the same as a D − P dimensional problem withno fit parameters, and the distribution of χ , as expected, depends only on the number ofdegrees of freedom, d ≡ D − P . We note that the distribution of χ (more properly, T ) isknown. Since it is important here and closely related to the estimates of parameter errors,we quote the result in Appendix I. III. NUMERICAL EXAMPLE
To illustrate the effects of sample size, we begin with a numerical example, using thebasis described above. In this example, N Gaussian distributed random data vectors with D = 25 were generated. The data was fit with P = 5 parameters, which are just the first P components of the average data vector. This was repeated for many trials. The blackoctagons in Fig. 1 show N times the variance (over trials) of one of the parameters, wherethe asymptotic value is one. These black octagons are the correct answer for the variance ofthe parameter, and this is the variance that we wish to estimate from our experiment, wherewe only have one trial to work with. We see that for finite N the parameters fluctuate byan amount larger than the asymptotic value.We also show the average over trials of the variance estimated from derivatives of theparameter probability, the average over trials of the variance estimated from a single elimi-nation jackknife analysis, and the average from a bootstrap analysis. For the jackknife andbootstrap, the plot contains average variances both for the case where the full sample covari-ance matrix was used in each resampling and where a new covariance matrix was made usingthe data in each jackknife or bootstrap sample. Red squares are average variances from the6 IG. 1: Number of samples times the variance (square of the error) of a parameter in fittingcorrelated data, and averages over trials of several methods for estimating this variance. Thehorizontal line indicates the asymptotic value, variance ( x f ) = 1 /N . N is the number of samples;the meaning of the plot symbols is described in the text. N , and that thevarious methods for estimating this variance produce biased estimates of the variance of theparameter. IV. LARGE N EXPANSION
Most of the effects shown in Fig. 1 can be understood analytically. We can expand thecovariance matrix in each trial around its true value, C ij = 1 N { δ ij + ( x i x j − δ ij − x i x j ) } (13)Here the term in parentheses has fluctuations of order 1 / √ N and an average of order N .Thus its square will also have expectation value ≈ N .Then C − ij = N δ ij − N ( x i x j − δ ij − x i x j )+ N ( x i x k − δ ik − x i x k ) ( x k x j − δ kj − x k x j )+ . . . (14)Using the fact that integrals of polynomials weighted by Gaussians are found by pairingthe x ai in all possible ways, or making all possible contractions, we can develop rules forcalculating these expectation values. We will use parentheses to list the pairings. For8xample, with (12) indicating that the first and second x are paired, h x i x i x j x j i = x i x i x j x j (12)(34)+ x i x i x j x j (13)(24)+ x i x i x j x j (14)(23) (15)Using h x ai x bj i = δ ij δ ab and x i = 1 N X a x ai we get the Feynman rules for contractions of barred quantities.1. Each contraction gives a δ ij for the lower indices it connects.2. Each bar gives a N , whether it covers a single x or two, x i or x i x j — see Eq.1.3. Each continuous line made of overbars and contraction symbols gives a factor of N . This is from the P ab... δ ab δ bc . . . , which has N nonzero terms. For example, x i x j x k x l (12)(34) is one continuous line, while x i x j x k x l (14)(23) is two lines (oneis a loop). This results in every loop giving an extra factor of N relative to othercontractions with the same number of fields.Since an open line (not a loop) with C contractions has 2 C x ’s and C + 1 bars, but aloop with N contractions has 2 C x ’s and C bars, these rules can be rephrased as:1. Each contraction gives a δ ij for the lower indices it connects.2. Each x gives a factor of 1 / √ N .3. Each loop gives a factor of N .In the expansion of C − we find the combination x i x j − δ ij − x i x j , which we will denoteby x i x j . This occurs frequently enough that we should state special rules for it.9n evaluating an expression containing x i x j there will be contractions where the x i and x j in x i x j are contracted with each other. These contractions just cancel the δ ij . The termswith x i and x j contracted give a − δ ij /N . Thus a “tadpole” where x i x j (12) contracts withitself just gives a − δ ij /N . (This includes the N − / from each of the x ’s.)Now consider terms where x i x j is part of an open line, like x i x i x j x j (12)(34) (16)In this case the x i x j and the − x i x j cancel, so x i x j can never be part of an open line. Butif this object is part of a loop, like in x i x j x k x l (13)(24) (17)the x i x j part is part of the loop, but the − x i x j part breaks the loop. Thus the four pathshidden in these double bars give N − N − δ ik δ jl (18)Similarly, a loop of three double bars gives 2 = 8 terms, N − − N − x i x j ’s gives a factor of N − δ ’s.As trivial examples, h x i x j i = x i x j (12) = 1 N δ ij (19) h x i x j i = x i x j (12) = δ ij (20) h N C ij i = 1 + x i x j (12) = (cid:18) − N (cid:19) δ ij (21)We are also interested in the variances of averaged quantities. For the variance of some-thing, var ( X ) = h X i − h X i , we need the “connected part” of h X i . We use a vertical barto denote this, and we only need contractions where some of the lines cross the bar.10s an example, for the variance of an arbitrary element of C to lowest order, h var ( C ij ) i = (cid:10) C ij (cid:11) − h C ij i no sum ij (22)= 1 N D x i x j (cid:12)(cid:12)(cid:12) x i x j E NS (23)= 1 N (cid:16) x i x j (cid:12)(cid:12)(cid:12) x i x j (cid:17) (13)(24) + (14)(23)= N − N ( δ ii δ jj + δ ij δ ij ) NS = 1 N N ; i = j N N ; i = j (24)The last line is written to display that the fractional variance on the diagonal element is N .In equations where the components are separated into starred indices, 0 ≤ i ∗ < P andprimed indices, P ≤ i ′ < D , contractions of primed with starred indices are zero, contractionsof starred with starred indices give delta functions with δ i ∗ i ∗ = P , and primed with primeduse δ i ′ i ′ = D − P . V. VARIANCE (AND HIGHER MOMENTS) OF THE PARAMETERS
In this section we examine the variances of the parameters – that is, the error bars onour answers. First we calculate how much the parameters actually vary over many trials ofthe experiment. Then we calculate the average of common ways of estimating this variance— from derivatives of the probability, from an “eliminate J” jackknife analysis or from abootstrap resampling (using either the covariance matrix from the full sample, or a newcovariance matrix made from each jackknife or bootstrap resample). The differences allowus to find and correct for bias in our error estimates resulting from the finite sample size.For the actual variance of our parameters, use Eq. 10. Since we are in a coordinate systemwhere the average of this quantity is zero, we don’t need to worry about taking the connectedpart. 11 x f x f E = h x x i + 2 (cid:10) x V j ′ W − j ′ k ′ x k ′ (cid:11) (25)+ (cid:10) x j ′ W − j ′ k ′ V Tk ′ V m ′ W − m ′ n ′ x n ′ (cid:11) Since V and W are made entirely of double bars and can therefore only be part of a loop,and primed indices can’t contract with index zero, the middle term (cross term) is zero.We compute this to order N . D x f x f E = h x x i + D x j ′ (cid:0) δ j ′ m ′ − x j ′ x m ′ + x j ′ x p ′ x p ′ x m ′ . . . (cid:1) (cid:0) x m ′ x (cid:1)(cid:0) x x n ′ (cid:1) (cid:0) δ n ′ k ′ − x n ′ x k ′ + x n ′ x r ′ x r ′ x k ′ . . . (cid:1) x k ′ E (26)The leading term, h x x i , is just N .The term with six x ’s has only one contraction: x j ′ x j ′ x x x k ′ x k ′ (16)(25)(34)= N − N d (27)where d ≡ D − P .There are two equal terms with eight x ’s. There are three nonzero contractions of thisterm. Ignoring the N − parts, these are − x j ′ x j ′ x m ′ x m ′ x x x k ′ x k ′ (18)(27)(34)(56) = − N δ j ′ k ′ δ j ′ k ′ δ m ′ m ′ δ − x j ′ x j ′ x m ′ x m ′ x x x k ′ x k ′ (18)(23)(47)(56) = +2 N δ j ′ k ′ δ j ′ m ′ δ m ′ k ′ δ − x j ′ x j ′ x m ′ x m ′ x x x k ′ x k ′ (18)(24)(37)(56) = − N δ j ′ k ′ δ j ′ m ′ δ m ′ k ′ δ (28)Here the plus sign on the second contraction comes from the tadpole. The second and thirdcontractions cancel, so we just have − N d (29)12o order N we only need two loop contractions from the terms with ten x ’s. There arethree such terms, but two of them are equal. (cid:10) x j ′ x j ′ x m ′ x m ′ x x x n ′ x n ′ x k ′ x k ′ (cid:11) +2 (cid:10) x j ′ x j ′ x p ′ x p ′ x m ′ x m ′ x x x k ′ x k ′ (cid:11) (30)Each term has two contractions: x j ′ x j ′ x m ′ x m ′ x x x n ′ x n ′ x k ′ x k ′ (1 , x j ′ x j ′ x m ′ x m ′ x x x n ′ x n ′ x k ′ x k ′ (1 , N (cid:0) d + d (cid:1) (31)2 x j ′ x j ′ x p ′ x p ′ x m ′ x m ′ x x x k ′ x k ′ (1 , x j ′ x j ′ x p ′ x p ′ x m ′ x m ′ x x x k ′ x k ′ (1 , N (cid:0) d + d (cid:1) (32)Putting it all together, D x f x f E = 1 N + N − N ( d ) + − N ( d ) + 3 N ( d + d )= 1 N + dN + d ( d + 2) N + . . . (33)Thus the fluctuations in the parameters are larger than the asymptotic value N . from thecovariance matrix.As noted above, the probability distribution of the parameters is not exactly Gaussian.Higher moments of this distribution can be obtained in the same way. At leading order in N there is only one independent diagram for the connected part of each moment, and we find,for M even, (cid:28)(cid:16) x f (cid:17) M (cid:29) connected = ( D − P ) ( M − N M (34)13 I. ESTIMATES OF THE PARAMETERS’ VARIANCE
In practice, the most common method for estimating the variance of the parameters is touse the covariance matrix for the parameters. (See, for example, Ref. [5].) In our coordinatesystem, this matrix is just A − , and our estimate for the variance of parameter zero is( A − ) . Using the third and first of Eqs. 9, B T = − W − V T A = U A − V W − V T AA − = U − V W − V T (35)Then, our estimate for the variance of parameter zero is var ( x f ) derivative = A − = U − V k ′ W − k ′ l ′ V Tl ′ = 1 N (cid:0) δ + x x (cid:1) − N (cid:0) x x k ′ (cid:1) (cid:0) δ k ′ l ′ − x k ′ x l ′ + x k ′ x m ′ x m ′ x l ′ . . . (cid:1) x l ′ x (36)For the order N correction we only need the δ k ′ l ′ from W − , and find N var ( x f ) derivative = δ + x x (12) − x x k ′ x k ′ x (14)(23)= 1 − N − N − N ( D − P )= 1 − N (1 + D − P ) (37)The order 1 /N contribution to this estimate vanishes, as sketched in Appendix II. If D = P = 1 this is just N (cid:10) x x (cid:11) = 1 − N , the standard correction for a simple average,reflecting our normalization of the covariance matrix. Comparing to the desired result inEq. 33, we see that this is an underestimate of the variance of the parameters. The differencebetween this error estimate and the correct one above is that this estimate assumes that the14ovariance matrix remains fixed while the data points vary, while the correct answer takesinto account the correlations between the data points and the covariance matrix (constructedfrom these same data points). VII. VARIANCE OF JACKKNIFE AND BOOTSTRAP PARAMETERS
The variance of the parameters is also often estimated by a jackknife or bootstrap analysis.In these methods the fit is repeated many times using subsets of the data sample, and thevariance of the parameters is estimated from the variance over the jackknife or bootstrapsamples. Both the jackknife and bootstrap can be done either using the covariance matrixfrom the full sample in fitting each jackknife or bootstrap sample, or by remaking a covariancematrix for each resample. Using the full sample covariance matrix amounts to seeing how theparameters vary with fixed covariance matrix, that is, by varying x i ∗ and x k ′ in Eq. 8 with A − i ∗ j ∗ B j ∗ k ′ held fixed. This is the same question as is answered by var ( x f ) derivative in Eq. 37.Since the change in the parameters is linear in x i ∗ and x k ′ , it doesn’t matter if the x i arevaried infinitesimally (by taking derivatives) or slightly (jackknife) or fully (bootstrap). Inthis case, the variance of the parameters will have the same bias as does Eq. 37 — no newcalculation is necessary, although there is a slight difference due to the normalization of thecovariance matrix used here.Remaking the covariance matrix for each resample includes correlations of the covariancematrix and data, but not in quite the desired way. The calculations above can be extendedto calculate the expectation value of the parameter variance for the jackknife analysis inwhich the covariance matrix is recomputed for each jackknife sample. An “eliminate J”jackknife consists of making N/J resamples, each omitting J data vectors (numbers nJ through ( n + 1) J − N J ≡ N − J elements. We will denote averagesin the n ’th jackknife sample with a superscript ( n ). The average of x a in the n ’th jackknifesample is x a ( n ) = 1 N J X a ∈ ( n ) x a (38)where J data vectors (starting with number nJ ) were deleted from the full sample. The15ariance of this quantity (over the jackknife samples) is JN ( N − J ) , (39)so we generally multiply the variance over the jackknife samples by N − JJ to get the expectedvariance of the mean N .We now compute the variance of the parameters in the jackknife fits. In doing this we willneed averages of products of quantities from different jackknife ensembles. Without losinggenerality, we can think of these as ensembles number zero and one, which differ only in theirfirst J data elements. Thus, expectation values of sums over values in different ensemblesmay produce factors of N J − J instead of N J , where N J is the number of samples in thejackknife, and is really N − J . ( N J − J = N − J is the number of samples in commonbetween two different jackknife resamples.)For example, using ( n ) to denote quantities in the n ’th jackknife sample ( x a ( n ) j is the j ’thcomponent of the a ’th data vector in jackknife sample ( n )), for n = m , h x ( n ) j x ( m ) k i = h N J X a x a ( n ) j N J X b x b ( m ) k i = 1 N J X ab δ jk δ ab = N J − JN J δ jk (40)where we define δ ab = 1 if a = b and a, b ∈ ( J, N − a and b gives a factor of N J − J instead of N J .From Eq. 10, parameter 0 in jackknife fit ( n ) is x ( n ) f = x ( n )0 + V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ (41)and the variance of this parameter over the jackknife samples is var J ( x f ) = * x ( n )0 − x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ − JN X m (cid:16) x ( m )0 − x ( m ) k ′ W − m ) k ′ i ′ V T ( m ) i ′ (cid:17)! x ( n )0 − V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ − JN X p (cid:16) x ( p )0 − V ( p )0 j ′ W − p ) j ′ l ′ x ( p ) l ′ (cid:17)! + (42)16his is more complicated than Eq. 25 because the mean over jackknife samples is notexactly zero. Also, the sums over sample vectors now sometimes give N J , sometimes N J − N J − J , so some of the shortcuts developed above won’t work any more.Note the J/N is correct – there are
N/J jackknife resamples, each containing N J = N − J elements.In Eq. 42 the sums contain terms where n = m and terms where n = m . Separate thediagonal and off-diagonal terms in the sums, and use the fact that all non-diagonal termsare equal, P m contains N/J − m = n , and P mp has N/J diagonal terms and(
N/J )( N/J −
1) off diagonal: var J ( x f ) = * (cid:18) − JN (cid:19) x ( n )0 x ( n )0 − (cid:18) − JN (cid:19) x ( n )0 x ( m )0 − (cid:18) − JN (cid:19) x ( n )0 V ( n )0 i ′ W − n ) i ′ k ′ x ( n ) k ′ + 2 (cid:18) − JN (cid:19) x ( n )0 V ( m )0 i ′ W − m ) i ′ k ′ x ( m ) k ′ + (cid:18) − JN (cid:19) x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( n )0 i ′ W − n ) i ′ k ′ x ( n ) k ′ − (cid:18) − JN (cid:19) x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( m )0 i ′ W − m ) i ′ k ′ x ( m ) k ′ + n = m (43)where n = m . To evaluate this expression we need:( a ) h x ( n )0 x ( n )0 i ( b ) h x ( n )0 x ( m )0 i n = m ( c ) h x ( n )0 V ( n )0 i ′ W − n ) i ′ k ′ x ( n ) k ′ i ( d ) h x ( n )0 V ( m )0 i ′ W − m ) i ′ k ′ x ( m ) k ′ i n = m ( e ) h x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ i ( f ) h x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( m )0 j ′ W − m ) j ′ k ′ x ( m ) k ′ i n = m (44)Here ( a ), ( c ) and ( e ), which involve only jackknife sample ( n ), are the same as in the previoussection with the replacement of N by N J . Because W − n ) and V ( n ) consist only of double17arred quantities which can’t be part of an open line, and primed and unprimed indicescan’t contract, ( c ) vanishes. For ( d ) we can imagine expanding all the x i x j ( m ) ’s into pieces, (cid:0) x i x j ( m ) − δ ij − x i ( m ) x j ( m ) (cid:1) . and making all contractions. There is only one factor of x from jackknife sample ( n ), which must contract with something from ( m ). Thus, all of theseterms differ from ( c ) by replacement of exactly one factor of N J by N J − J , and thereforealso sum to zero.Similarly ( e ) and ( f ) differ by the replacement of one or more factors of N J by N J − J .Thus, their difference will be one order in JN less than their value. This means that to getthe first correction to the asymptotic form, we need only keep the lowest order term in part( e ), and the analogous contraction for part ( f ).( a ) h x ( n )0 x ( n )0 i = 1 N J X ab x a ( n )0 x b ( n )0 = 1 N J ( b ) h x ( n )0 x ( m )0 i = 1 N J X ab x a ( n )0 x b ( m )0 = N J − JN J ( e ) h x ( n ) j ′ x j ′ x ( n )0 x x ( n ) k ′ x ( n ) k ′ i = x ( n ) j ′ x j ′ x ( n )0 x x ( n ) k ′ x ( n ) k ′ (16)(25)(34)= N J − N J ( D − P )( f ) h x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ i = x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ (16)(25)(34)= N J − J − . . .N J ( D − P ) (45)Here ( a ), ( c ) and ( e ) are the same as in the previous section. To evaluate ( f ) we needto separate the two terms in the double overbar (the delta function isn’t there since theindices can never be equal), since they may give different numbers of factors of N J − J . Thisevaluation proceeds as: 18 f ) ≈ (cid:10) x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ (cid:11) n = m = D x ( n ) j ′ x j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ − x ( n ) j ′ x ( n ) j ′ x ( n )0 x x ( m ) k ′ x ( m ) k ′ − x ( n ) j ′ x j ′ x ( n )0 x ( m )0 x ( m ) k ′ x ( m ) k ′ + x ( n ) j ′ x ( n ) j ′ x ( n )0 x ( m )0 x ( m ) k ′ x ( m ) k ′ E n = m = 1 N J X abcdef D x a ( n ) j ′ x b ( n ) j ′ x b ( n )0 x d ( m )0 x d ( m ) k ′ x f ( m ) k ′ − x a ( n ) j ′ x b ( n ) j ′ x b ( n )0 x d ( m )0 x e ( m ) k ′ x f ( m ) k ′ − x a ( n ) j ′ x b ( n ) j ′ x c ( n )0 x d ( m )0 x d ( m ) k ′ x f ( m ) k ′ + x a ( n ) j ′ x b ( n ) j ′ x c ( n )0 x d ( m )0 x e ( m ) k ′ x f ( m ) k ′ E n = m (46)Now the (16)(25)(34) contraction gives= 1 N J X abcdef δ j ′ k ′ δ af δ j ′ k ′ δ bd δ δ bd − δ j ′ k ′ δ af δ j ′ k ′ δ be δ δ bd − δ j ′ k ′ δ af δ j ′ k ′ δ bd δ δ cd + δ j ′ k ′ δ af δ j ′ k ′ δ be δ δ cd ! = 1 N J (cid:16) N J ( N J − J ) − N J ( N J − J ) + ( N J − J ) (cid:17) ( D − P )= 1 N J (cid:16) N J − N J J − N J + . . . (cid:17) ( D − P )= 1 N J (cid:16) N J − J − . . . (cid:17) ( D − P ) (47)Putting the pieces together, the variance of the parameter over the jackknife samples is (cid:18) − JN (cid:19) (cid:18) JN J + 2 J ( D − P ) N J (cid:19) (48)= (cid:18) N − JN (cid:19) (cid:18) J ( N − J ) (cid:19) (cid:18) D − P ) N + . . . (cid:19) (49)19omparing with Eq. 39, which is for D = P , we see that there is an extra factor of1 + D − P ) N (independent of J ). However, by comparison with Eq. 33 we see that this effectis too large by a factor of two, so the jackknife variance for the parameters is also biased.The leading corrections to the bootstrap estimate of the parameters’ variance can be donein a similar way. To be specific, our bootstrap procedure is to make B resamplings, eachmade by choosing N data vectors with replacement from the original set of N vectors, andcalculate the variance of the parameters over the bootstrap resamples. Similarly to Eq. 42,the average over trials of the bootstrap estimate of the variance is var B ( x f ) = * x ( n )0 − x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ − B X m (cid:16) x ( m )0 − x ( m ) k ′ W − m ) k ′ i ′ V T ( m ) i ′ (cid:17)! x ( n )0 − V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ − B X p (cid:16) x ( p )0 − V ( p )0 j ′ W − p ) j ′ l ′ x ( p ) l ′ (cid:17)! + (50)which, after separating diagonal and off-diagonal terms in the sums, becomes (cid:18) − B (cid:19) * x ( n )0 x ( n )0 − x ( n )0 x ( m )0 − x ( n )0 V ( n )0 i ′ W − n ) i ′ j ′ x ( n ) k ′ + 2 x ( n )0 V ( m )0 i ′ W − m ) i ′ j ′ x ( m ) k ′ + x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( n )0 j ′ W − n ) j ′ k ′ x ( n ) k ′ − x ( n ) j ′ W − n ) j ′ i ′ V T ( n ) i ′ V ( m )0 j ′ W ( − m ) j ′ k ′ x ( m ) k ′ + n = m (51)The overall (cid:0) − B (cid:1) is the expected factor for difference between the average over the originalsample and average over bootstraps. Label the parts as in Eq. 44, where now ( n ) means the n ’th bootstrap resample.For part ( a ), h x ( n )0 x ( n )0 i = 1 N X ab D x a ( n )0 x b ( n )0 E (52)where, in this section, the superscript a ( n ) means the number of the data vector in theoriginal set that was chosen to be the a ’th member of bootstrap resample ( n ). For example,20f for N = 3 our bootstrap ensemble members were members 0, 1 and 0 of the originalensemble, then 0( n ) = 0, 1( n ) = 1 and 2( n ) = 0. We will get contributions with nonvanishingexpectation value when a ( n ) = b ( n ). If a member of the original ensemble is chosen m timesin the bootstrap sample, then there will be m contributions. Thus the total is the sum overall members of the original ensemble of the square of the number of times that member waschosen for this bootstrap sample. The probability distribution for the number of times amember appears in the bootstrap sample is a binomial distribution with probability p = 1 /N .The average square of the number of times a member appears in a bootstrap resample isjust the second moment of this distribution, etc. h ( n i ) i = 1 (cid:10) ( n i ) (cid:11) = = 2 − N (cid:10) ( n i ) (cid:11) = = 5 − N + 2 N ( N >
2) (53)Thus the expectation value of ( a ) is N N (cid:0) − N (cid:1) = N − N .Part ( b ) is the expectation value of the number of times a member was chosen in bootstrapresample ( n ) times the number of times it was chosen in resample ( m ). These two areindependent, so we get just the product of the averages, or − N .For part ( c ), break the double bar into its two components.( c ) = − (cid:10) x n ) x x i ′ ( n ) x i ′ ( n ) (cid:11) + 2 (cid:10) x n ) x n ) x i ′ ( n ) x i ′ ( n ) (cid:11) = − N X abc D x a ( n )0 x b ( n )0 x b ( n ) i ′ x c ( n ) i ′ E + 2 N X abcd D x a ( n )0 x b ( n )0 x c ( n ) i ′ x d ( n ) i ′ E (54)In the first term we get a contribution when a ( n ) = b ( n ) = c ( n ). For each of the N membersof the original ensemble we therefore get n a terms, where n a is the number of times thatmember appeared in the bootstrap resample, so we get N (5 − . . . ) ( D − P ), where the D − P is from the implicit sum over i ′ . In the second term we get contributions when a ( n ) = b ( n )and c ( n ) = d ( n ). The probabilities of these two conditions are not quite independent, sinceif one member of the original ensemble is chosen multiple times in the bootstrap resamplethe other members will be chosen fewer times. This effect will be suppressed by a power of21 N , so to leading order we just have h n a i = 4 N ( D − P ). Putting in the two and overallfactors of N from the left, ( c ) = − D − P ) N + . . . .Parts ( d ), ( e ) and f are done similarly, where to this order in N we only need the loopcontraction in parts ( e ) and ( f ).Putting it together var B ( x f ) = (cid:18) − B (cid:19) N (cid:18) D − P − N (cid:19) (55) VIII. CORRECTING SMALL BIASES
Once the biases in the various estimates of the error on the parameter have been calcu-lated, it is a simple matter to correct for them. In particular, we should multiply varianceestimates from the derivative method by F deriv in Eq. 56. Note this assumes the covariancematrix was normalized as in Eq. 2. For the jackknife or bootstrap done with the full samplecovariance matrix, multiply the variance by F reuse . This differs from F deriv only in the 1 inthe denominator, the well known correction for the difference between the sample averageand the true average, which was not included in our normalization of C . For the jackknifeor bootstrap analysis where a new covariance matrix is made for each jackknife or bootstrapsample, multiply the variance by F jackknife,remake or F bootstrap,remake . Of course, if you arerescaling error bars instead of the variance, you should use the square root of the factorbelow. (In F bootstrap,remake we assumed that the B − B in Eq. 51 has already been accountedfor.) F deriv = 1 + N ( D − P ) + N ( D − P ) ( D − P + 2) . . . − N (1 + D − P ) + N F reuse = 1 + N ( D − P ) + N ( D − P ) ( D − P + 2) . . . − N ( D − P ) + N F jackknife,remake = 1 − N ( D − P ) . . .F bootstrap,remake = 1 + 1 N . . . (56)22
IG. 2: Numerical results from Fig.1 together with the order N and N results from the previoussection. The meaning of the symbols is the same as in Fig. 1. IX. COMPARISON TO NUMERICAL RESULTS
In Fig. 2 we plot the order N forms for the variance of the parameter and the variousmethods of estimating it together with the numerical data. The horizontal axis has beeninverted to N . Figure 3 shows the same data, with the estimates for the variance corrected for23 IG. 3: Numerical results from Fig.1 corrected for bias up to corrections of order N for thejackknife with remade covariance matrices and order N for the methods with fixed covariancematrix. bias (up to errors of order N or N ). Here the lines for the actual variance of the parameter(black) and for the derivative or resampling with the full sample covariance matrix (red) aresecond order in N , while the line for the jackknife with remade covariance matrices (blue)is only first order in N . As an aside, we note that although the lowest order corrections for24he bootstrap with remade covariance matrices are smaller than for the other methods, thenext order corrections appear to be larger. [1] C. Aubin et al. , Phys. Rev. D , 114501 (2004).[2] C. Aubin et al. , Phys. Rev. Lett. , 011601 (2005).[3] C. Michael, Phys. Rev. D (1994) 2616-2619.[4] C. Michael and A. McKerrell, Phys. Rev. D (1995) 3745-3750.[5] W.M. Yao et al. (Particle Data Group), J. Phys. G , 1 (2006) and 2007 partial update forthe 2008 edition, (see section 32); D. Toussaint, in “From Actions to Answers – Proceedings ofthe 1989 Theoretical Advanced Study Institute in Elementary Particle Physics” , T. DeGrandand D. Toussaint, eds. (World Scientific, Singapore, 1990).[6] See for example D.F. Morrison, “Multivariate Statistical Methods”, McGraw-Hill, 1967;R.A. Johnson and D.W. Wichern, “Applied Multivariate Statistical Analysis”, Prentice-Hall,1982. Appendix I
Since estimating the goodness of fit is as important as estimating the errors on the pa-rameters, we quote some results here. Note that what we call χ (with the covariance matrixestimated from our data) is more properly called T , but we stick with the common usagein the lattice gauge community.The probability distribution for χ is known[6]. In terms of N and d , P rob ( χ ) = N − d/ Γ( N/ d/ N − d ) /
2) ( χ ) ( d − / (cid:18) N χ (cid:19) − N/ (57)We can compare to the χ distribution: P rob ( χ ) = C ( χ ) ( d − / e − χ / and see that in the limit of large N they are the same.From moments of Eq. 57 we see that the mean and variance of χ depend on the sample25ize. Using I ( D, N ) ≡ Z ∞ d ( χ )( χ ) ( D − / (cid:18) N χ (cid:19) − N/ = N D/ Γ( D/ N − D ) / N/
2) (58), (cid:10) χ (cid:11) = I ( D + 2 , N ) I ( D, N ) = N D N − D − = D − D +2 N (59) (cid:10) ( χ ) (cid:11) = I ( D + 4 , N ) I ( D, N ) = N D (cid:0) D +22 (cid:1)(cid:0) N − D − (cid:1) (cid:0) N − D − (cid:1) = D ( D + 2) (cid:0) − D +2 N (cid:1) (cid:0) − D +4 N (cid:1) (60)(Note this is using our normalization of the covariance matrix).Taking the connected part, or variance, and expanding in N , this is var ( χ ) = 2 d (cid:18) d + 6 N (cid:19) (61)Estimates of confidence levels, or probability (over trials) that χ would exceed the valuein your experiment, can be found by integrating Eq. 57. Appendix II
The customary estimate for the variance of the parameters, or from jackknife or bootstrapresamplings with the covariance matrix held fixed, has zero coefficient at the next order.