[PDF] Error estimation in the method of quasi-optimal weights

Abstract

We examine the problem of construction of confidence intervals within the basic single-parameter, single-iteration variation of the method of quasi-optimal weights. Two kinds of distortions of such intervals due to insufficiently large samples are examined, both allowing an analytical investigation. First, a criterion is developed for validity of the assumption of asymptotic normality together with a recipe for the corresponding corrections. Second, a method is derived to take into account the systematic shift of the confidence interval due to the non-linearity of the theoretical mean of the weight as a function of the parameter to be estimated. A numerical example illustrates the two corrections.

Full PDF

Error estimation in the method of quasi-optimal weights

A.D. Morozov , A.V. Lokhov , F.V. Tkachov Physics Department, Lomonosov Moscow State University, Moscow 119991, Russia University of Münster, Muenster 48149, Germany Institute for Nuclear Research of Russian Academy of Sciences, Moscow 117312, Russia

Abstract

1. Introduction

Error estimation is an important and often difficult part of any physical experiment [1]. A key role is played by the concept of confidence interval: such interval must cover the unknown value being measured with a given probability called the confidence level. In this work we study the problem of construction of confidence intervals for the method of quasi-optimal weights (QOW) [2], [3]. QOW is a rather general and versatile method of parameter estimation. It was successfully used to complete the data processing of the Troitsk-  -mass experiment [4] after several years of difficulties that became known as the Troitsk anomaly [5]. The QOW method starts with the classical method of moments and inherits its analytical transparency (see textbooks [1], [6], [7], [8]). This allows one to approach the construction of confidence intervals in a systematic fashion, relying on analytical methods, which (together with its generality) distinguishes the QOW method from other approaches to parameter estimation including the maximum likelihood method. This is the more attractive that the QOW method is versatile and flexible (see examples of unconventional applications in [9]). The present work is focused on the single parameter case as a starting point for any multi-parameter extensions. The plan of our work is as follows. In sec. 2 we review the method of quasi-optimal weights together with the basic large- N formula for the confidence interval. In sec. 3 we obtain a criterion for taking into account corrections that modify the confidence interval to account for insufficiently large sample size N . Sec. 4 presents a method to correctly take into account the non-linearity of the function   h  (see the main equation of the QOW method, eq. (4)) which represents the theoretical mean of the quasi-optimal weight as a function of the parameter  being estimated. It is hown that the ignoring of non-linearity can lead to a systematic general shift of the boun-daries of the confidence interval, and also to an effective decrease of the confidence level. In sec. 5 we discuss two basic examples: the Poisson case (sec. 5a) and the case of the normally distributed measurements (sec. 5b). We always consider the case of a heterogeneous sample unless stated otherwise.

2. A review of the QOW method

The QOW method is a refinement of the classical method of moments in regard of the choice of the weight functions (“generalized moments”). Therefore, it is convenient to begin with a review of the classical method in order to fix notations in such way as to automa-tically account for the case of heterogeneous sample. A special attention is given to error estimation because this point is not treated in textbooks adequately. The following notations will be used, following ref. [3]:  is a generic notation for the parameter being estimated, its various specific values are marked by indices: , i   etc.;   is the unknown true value of the parameter which is to be estimated; i x is the i -th random value representing one measurement;   i x is the sample of N measurements;   ; i x   is the probability density function of i x , it depends on  ; this density is assumed to be known (analytically or as a Monte Carlo event generator; in the latter case it is suffi-cient to construct the main functions of the method   h  etc. in the form of numerical interpolations). The method of moments assumes that one can compute theoretical means using the density distributions   ; i x   for functions   f x of the random variable x (the functions f must satisfy the usual restrictions so that the emerging integrals make sense):       ; i i i h f dx x f x        . (1) We will be calling f weights rather than moments, retaining the latter word only in the name of the method, because we will not limit ourselves with the classical moments   n f x x  . Besides the obvious requirement that the integral (1) exists, one may require continuity for each of the functions   i f x [6]. For a convenient estimate of the dispersion to be possible, one should also require quadratic integrability of each i f . In practice the weights in the QOW method satisfy, as a rule, both these requirements (even in the case of estimating the median of the Cauchy distribution that does not even have a mean). It is convenient to introduce the following function (in all such expressions summations run over i from 1 to N ):   ii h f fN       (2) The corresponding empirical analogue (the empirical mean):  exp i ii f f xN   (3) The idea of the method of moments (inherited by the QOW method) is to take the solution with respect to  of the following equation as an estimator for   (see fig. 1):   exp f f h     (4) Within the adopted assumptions the empirical mean exp f tends to ( ) h   as N   , and one expects that the solution of the equation with respect to  would approach   . The corresponding (point) estimate is as follows (see fig. 1):   h f    . (5) Fig. 1. The function   h  (the theoretical weight mean, eq. (2)) together with the point estimates and the CL= 90% intervals for the numerical example of sec. 5c. It is clear from fig. 1, that the confidence interval for the estimate of  in this method is obtained by mapping the confidence interval for th f using the inverse of h . The estimation of th f is a special case of the so-called direct measurements [11]. Follow-ing the well-known recipe, take the unbiased sample variance  

22 2exp expexpexp

Var 1

Nf f fN     . (6) Then the confidence interval is written in a standard form: th f f f       , где exp f f f       , f y N       (7) where y  is the  -quantile of the standard normal distribution   . One sees from fig. 1 that the interval (7) maps to    h f h f          (8) which is the required confidence interval for the confidence level  for the parameter being estimated. This formula is the starting point for the construction of confidence intervals in the QOW method. Since the scatter of exp f decreases for larger N , in practice (especially in the multidimensional case) it is convenient to have in view a linear approximation for this formula, perhaps together with quadratic corrections (see below). The QOW method refines the basic method of moments by a special choice of the weights f . Within some natural assumptions on   ; x   there exist weights that generate estimates with the minimal possible variance (the so-called Frechet-Rao-Cramer limit [6]). The simplest one is given by the following formula       opt ; ln ; x x       (9) and is called optimal weight (optimal, because the estimate obtained with this weight has the minimal variance). The optimal weight is defined up to an additive constant and a factor, both independent of the sample   i x . For a heterogeneous sample, it is sufficient to consider the entire sample as a random variable whose probability distribution is a product of distributions for its components (due to assumed independence of different i x ), and then obtain the optimal weight according to the general prescription (9):          opt opt 1 2opt, ln ;, , , ln ; i iN i ii ii ii xx x x x xx             (10) In other words, the optimal weight for the entire sample is equal to the sum of optimal weights for individual measurements. An important property of the optimal weight is the equality to zero of its theoretical mean when calculated over the distribution at the same value of the parameter, for which the weight is optimal:   opt ; 0 x     (11) As a side remark we note that this property can be taken as a starting point for construction of the parameter estimator, and then one actually returns to the maximal likelihood method (cf. §16 in [6]). However, one then relinquishes the simple way to estimate errors as described above. The key observation of the QOW method is that the deviation of the estimate from the optimal one (i.e. the one that corresponds to the Frechet-Rao-Cramer boundary) depends only quadratically on the deviation of the weight from the optimal one (for deviations which are not too large). This means that even non-ideal approximations for the weight ould yield good estimates that are close to the optimum. One should also keep in mind that the deviation is an integral characteristic (see eq. (13) below), which gives one more freedom for constructing a quasi-optimal weight than is the case for point-wise function approximations. The importance of this observation is seen from the fact that   is unknown prior to the measurement, and it is impossible to point out the corresponding exact optimal weight. However, the quadratic (moreover, integral) smallness of deviations from the optimum often allows one to choose good, almost optimal weight without exact knowledge of   . Therein lies the crucial advantage of the QOW method. The quadratic smallness can be expressed as follows. Suppose the weight   f x used in the method of moments is close to opt ( ; ) x    :     opt ( ; ) f x x f x       . Consider the theoretical expression for the variance of the estimate for large N , then it is sufficient to consider the linear approximation for h : thth 2 VarVar ( ) fH     (12) where     H h       . Expand this expression in f  , retain only quadratic terms, and denote th f f f      , then one obtains [2]:   f f              , (13) where the quadratic term is non-negative due to the Schwartz inequality. A convenient way to choose quasi-optimal weights is to take the optimal weight opt 0 ( ; ) x   which corresponds to some a priori (e.g. known from previous experiments) value  (thus assuming that     ). Then the deviation of the error from the theoretical lower bound would vanish quadratically as ( )     . One should note the following, bearing in mind applications to problems which occur in accelerator measurements with a complicated event structure and computationally expensive theoretical Monte Carlo event generators: in view of the integral character of the above formulae, quasi-optimal weights could be chosen in the form of a rather crude, even piecewise-constant approximations. Note that if the quasi-optimal weight is known as a function of  , then it becomes possible to improve the estimate iteratively. Indeed, let  be the solution of the main equation (4). Then one can repeat the entire procedure taking the new value  in place of  . This would yield a new solution  , and so on. It should be remembered, however, that the meaning of such iterations is not so much in the convergence of the estimates i  as in the minimization of the corresponding formal errors. We have called the errors formal because they are obtained following the general prescription of the basic method of moments, where he weight is independent of the sample. But in the iterative version the weight acquires a dependence on the sample via the estimates for  obtained at the previous iterations, and then the true experimental error would be underestimated similar to how it is underestimated in the elementary case of estimation of the normal variance relative to the experimental mean, where in the simplest case of the normal distribution the corresponding correction is achieved via the Student distribution. There is no known analytical way for implementing such corrections in a general case. Therefore, we only note here that an iterative variant of the method must observe a balance between the improvement of the estimate due to iterations and the above effect of error underestimation. The eventual balance depends on a concrete experimental situation and must be studied concretely. This work only aims to understand the error estimation in the basic, non-iterative variant of the QOW method to provide a starting point for further improvements.

3. The case of insufficiently large N In the simplest case the confidence interval is constructed via the standard deviation. This method is limited by the following conditions of applicability [1]: 1. The sampling distribution of the estimator becomes asymptotically normal as N   . 2. N is sufficiently large for neglecting the deviations of the distribution from the normal one. In order to study the case of insufficiently large N , rewrite the main equation (4) as follows:     opt, 0 opt, 0 i i ii i x xN N         (14) As was already mentioned, the standard deviation can be used as an error estimate only in the case of the asymptotically normal distribution of the estimator. On the left hand side, this is guaranteed by the central limit theorem [7], [8]. For small N , however, this approxi-mation may not be valid. To clarify this point, recall the fundamental Berry-Esseen theorem [7]. Let i Y be independent random variables such that   th 22 2th thth3 th ,Var , i ii i i ii i i Y Y Y YY             (15) Let N F be the distribution function of the quantity   i iiN ii YS     . Then for any , y N the following inequality is true:     N N

F y y C     , (16) where 0.4097 0.5600 C   (some concrete values are given in Table 1 [12]),   N i ii i        and   y  is the standard normal distribution.

1 2 3 4 5 6 7 8 9 10

C(N) C in Berry-Esseen inequality. Denote BE, N N C     and rewrite the inequality as follows:       BE, BE,

N N N y F y y         , (17) This yields pointwise estimates for the upper and lower bounds of the distribution. When constructing the confidence interval, it is highly undesirable to overestimate its probability content, therefore consider the lower point:    

BE,

N N

F y y     (18) The distribution function is defined by    

N N

F y S y    (19) Therefore, with a given confidence level  for the theoretical mean S  of N S one obtains (the factor 2 comes from the fact that the bound on the absolute value is actually two-sided):  BE, N N

S S y            (20) To err on the conservative side, consider the worst case, which corresponds to replacing the inequality with equality, and obtain:  BE, N N

S S y            (21) In other words, the formally constructed confidence interval corresponds to a confidence level that is less than  by the amount BE, N  . Since the quantity BE, N  which emerged from the Berry-Esseen theorem is independent of  , one can implement the corresponding correction by simply rebuilding the confidence intervals with the confidence level BE, N    instead of  . Then  BE, * BE, BE, ( 2 ) 2 N N N N

S S y                (22) To summarize, the confidence interval for the confidence level  for the theoretical mean S  in the case of small N has the form: BE, BE,

N N

S y S S y           (23) This confidence interval guarantees that it will cover the measured value S  with the probability no less than  . Also note that the conservative choice made in eq. (21) may lead to , BE N     in the case of large deviations from the asymptotic normality that cannot be treated as corrections anyway. pplication to the QOW method. Let us specify the quantities that participate in the Berry-Esseen construction, eq. (15), to the case of the QOW method:   opt, 0 ; i i i Y x    (24)     opt, 0 ; ii x        (25)        

22 2opt, 0 opt, 0 ; ; i ii x x                (26)       ; ; i ii x x            (27)             N i ii i                 (28)    

BE, N N C       (29)        opt ; N xS        (30) Then the confidence interval for the weight is represented as follows:       BE, exp 2 N N N

S S y          (31) There is still an ambiguity in the choice of the argument for  

BE, N   . It would be mathematically correct to take the value   which, however, is unknown. It is customary in such cases to choose exp  instead, especially that there is a safety margin built into the above construction due to the fact that the worst case bound was chosen from the Berry-Esseen inequality (see the reasoning for eq. (21)). The aim here is to construct confidence intervals not for   N S   , but for the quantity   opt ; x    that occurs in the main equation of the method (4). Therefore it is necessary to transform the interval (31) as follows:       exp opt exp ; , ; ; , x x x               , (32) where             BE, exp exp opt exp exp 2exp ( ); , ; ( ) N x x y                             (33) As argued above, we substitute exp  here in place of the unknown   and obtain the final interval for the weight:         BE, exp opt exp opt exp 2 ; ; N x x y               (34) From this, one obtains the following general interval for  without invoking approximations for the function ( ) h  : *        (35)        BE, exp ; N h x y             (36) This formula assumes that ( ) h  increases monotonically; in the case of decrease, the upper and lower bound should be swapped. To summarize, the quantity   BE, exp N   has a simple statistical meaning («the leakage of probability from the confidence interval») and can, therefore, be used as a convenient quantitative indicator of whether or not the normality assumption is valid for the distribution of the weight’s mean.

4. The linear and quadratic approximations for ( ) h  The geometric interpretation of the QOW method (see Fig. 1) allows one to immediately write down the confidence interval for   with the confidence level  , as follows:        , (37) where     h h f        , (38)   h h  is the inverse function for   h  , and the quantity exp f y      was defined in eq. (7). For practical purposes (having in view the cases when the modelling and construction of   h  is expensive) it is convenient to consider the inversion of h in eq. (32) in the linear and quadratic approximations, which is justified for large N when the scatter of values of   exp exp h    decreases with the growing N . Start with the following quadratic approximation for   h  near exp  :          

1' ''2 h h h h              (39) After simple transformations one obtains:   h h        , (40) where     expexp ''' hh    ,     expexp ' h hh    (41) Note that 0   corresponds to the linear case. Substituting eq. (40) into the main formula (37), one obtains: exp nlin lin         (42) where   lin exp ' fh     (43)     (44) It is seen that neglecting quadratic terms would result in a systematic measurement error in the form of an overall shift of the confidence interval by nlin  . The second effect of non-linearity is a possible distortion of the probability content of the confidence interval in the linear approximation compared with the quadratic case. In order to obtain a quantitative estimate, we use expression (42) to form the difference of probability contents of the two confidence intervals:     exp nlin lin exp lin                     (45) Introduce an intermediate quantity, namely, the distribution function     of the esti-mator (it will drop from the final answer). Then the expression (45) can be rewritten as follows:         adj lin adj lin exp lin exp lin                               (46) where adj exp nlin      (an adjusted exp  ). Regroup the terms and obtain:         adj lin exp lin adj lin exp lin                                (47) In practice one works with confidence intervals with confidence levels that are close to 1 (100%). Then the variation of Ф in the region of interest is small, see Fig. 2. Fig. 2. Distribution function of exp  Therefore, one can use approximations for each of the terms in the square brackets:         '                , (48) where     is the probability density for the estimator. Then eq. (45) becomes     nlin 1 exp lin 1 exp lin                  (49) From the main equation of the QOW method, eq. (14), one can see that exp  is connected to the empirical mean of the weight exp  . The distribution of exp  is assumed to be normal (the corrections for non-normality in the sub-asymptotic case have been considered in sec. 2). Therefore, for a direct calculation it is more convenient to use the probability density of the empirical weight which we denote as   h  (the notation h for the argument instead of the cumbersome exp  is motivated by eq. (4)). Note that   h  is a normal distribution in accordance with the assumptions here; its parameters are discussed below. Using eq. (14), one can write (this is in fact a transformation of the probability measure under a mapping)          dhh dh d h d             (50) Finally, eq. (45) takes the form:     exp linexp lin nlin 2 dhh d                      (51) This expression does not contain non-constructive quantities and can be implemented as a program, see sec. 5c.   shows how the probability content differs between the linear approximations and the more accurate quadratic one. For instance, the value 1%   (cid:0) would have to be viewed in the context of the target confidence level (e.g. 95%), and one would take further decisions based on that number depending on the concrete experimental context.

5. Examples 5a. The Poisson case.

Let us work out eq. (51) for the case of Poisson distribution:       , ,; , ! n x xn x en        , (52) where   , x   is a known function that depends on the parameter  to be estimated (the case of neutrino mass measurements in Tritium decays, cf. [5], [10]). Also known is the sample     , i i x n of size N ( n i is the count measured for the value x i of the control parameter x ). From the basic formula for the quasi-optimal weight, eq. (9) dropping the additive term (which is allowed by the method without loss of optimality, see sec. 2), one obtains:       ln ,; , , xn x n nA x        (53) Evaluate the main function of the QOW method   h  . For a single control point x i             ,0 0 ,, , , ,! n xn xh x nA x e x A xn             (54) For the entire sample (cf. eq. (10)):         i i ii i h h x x A xN N          (55) Then the main equation of the QOW method becomes:  

1( ) , i ii h n A xN     (56) Numerically solving this for  yields a point estimate exp  for the unknown  . One now has to evaluate eq. (51). There are three quantities whose values do not immedia-tely follow from the basic QOW method:   h  , lin  , nlin  .  h  is the familiar normal distribution with the mean   h    and the variance  given by the following expression (see sec. 4 for explanations):    

222 exp 02 i ii x A xN          (57) To compute lin  , one first evaluates f   according to eq. (7): f y      . (58) Finally, in order to compute the loss of probability content due to the non-linearity, one substitutes eq. (58) into eqs. (43) and (44), and then use eq. (57) to construct   h  . Evaluation of the correction (29) that is due to the insufficiently large sample is essentially reduced to computation of   N   given by the following expression     N        (59) where     exp exp , ii x       (60)           exp ,,, , ! n xn xA xx n x eN n               (61) The summation in (61) is to be done numerically. nlin   is computed from lin   and the nonlinearity coefficient  (eq. (41)):            exp 0expexp exp 0 , ,, , i ii i ii x A xhh x A x            (62) Computer implementation of all the above is completely straightforward. The reasoning here follows the general scheme and is similar to the Poisson case with the integer count n replaced by a continuous measurement y , and   ,x   replaced by the pair   ,x   ,   x  :           ,1; , exp2 2 y xy x x x               , (63) where   , x   is a known function that depends on the parameter  to be estimated. The variance   x  is assumed to be known for each value x i of the control parameter x and to be independent of  . (If   x  does depend on  , one has to take this into account when evaluating the derivatives, starting with the formula (64) that would become quadratic in y . It would simply make the formulae more cumbersome but is not a matter of principle.) Also known is the sample     , i i x y of size N . The quasi-optimal weight is as follows:         , 1; , , xy x y y A xx          (64) The main function of the QOW method for a single control parameter x i      , , , h x x A x      , (65) and for the entire sample:         i i ii i h h x x A xN N          (66) The main equation of the QOW method is:  

1( ) , i ii h y A xN     (67) The auxiliary distribution   h  has the mean   h    and the variance  given by the following expression:    

22 22 02 i ii x A xN         (68) The Berry-Esseen correction is irrelevant here because the normal distribution is infinitely divisible so that the sum of normally distributed random variables is automatically normally distributed. Finally, the formula for the nonlinearity coefficient  is the same as for the Poisson case (62) with  replaced by  . The numerical Monte Carlo test presented below illustrates how the above analytical formulae may work. The model is remotely reminiscent of, but much simpler than, what one deals with in the neutrino mass measurement experiments [4], [10]: the Poisson case (sec. 5a) with the following spectrum:     x xx x           We used 10 uniformly distributed control points

0, 2,..18, 1,..10 i x i   . The “unknown” model value being measured is *   . We used 1 000 000 samples. Straightforward calculations give the following results for the Berry-Esseen probability loss BE,  N (see the analytical expression eq. (29)), the overall shift of the confidence interval nlin  due to the non-linearity (see eq. (44)) together with a distortion of the probability content of the uncorrected confidence interval due to the nonlinearity   (see eq. (51)):   , exp BE N    ; thnlin    ;     (69) The Berry-Esseen correction was evaluated using the most conservative value of the constant C  , and the corresponding probability loss was then estimated as 5.4%. The probability distortion estimated using the quadratic approximation (see eq. (40)) turned out to be 0.3%. The resulting CL=90% interval for *  is: *     (70) This includes the systematic overall nonlinear shift , therefore the point estimate exp   is not in the center of the confidence interval, which may seem unusual. The effect of nonlinearity is clearly visible in Fig. 1. n the Monte Carlo modelling, the confidence interval for the linear approximation (eq. (42) with nlin   ) covers the exact   with probability 90.3%. In the quadratic approximation ( nlin   ), the probability content becomes 90.2%, which is consistent with the negative   in eq. (69). For the exact confidence interval obtained using the inverse function   h  , the probability content is 89.8%, which is less than the target 90% due to the Berry-Esseen effect. The Berry-Esseen correction to the confidence interval increases its probability content to 94.7%, which is conservatively larger than the target 90%, as expected. Note that the better results for the linear and quadratic approximations here is an example-specific fluctuation, the more significant number being the magnitude of the effect. An interesting feature of this example is that the two corrections work in opposite directions, which is clearly seen in the two figures below: the Berry-Esseen correction reflects the fact that the histogram in the first Fig.3. leaks to the left from under the Gaussian shape, whereas the nonlinear correction, not knowing about the effect, attempts to preserve the probability content of the Gaussian confidence interval and map the result to the  axis. The two figures below illustrate the two effects. Fig. 3. The histogram of the average weight distribution, sec. 4, together with the corresponding asymptotic Gaussian. The Berry-Esseen correction to the confidence level attempts to (conservatively) estimate the probability leakage to the left of the Gaussian. Fig. 4. The Monte Carlo distribution of the point estimate exp  . Also shown is the normal distribution that corresponds to the asymptotic large- N assumption. A mismatch is seen, and it is this mismatch that the nonlinear correction nlin  attempts to correct for.

6. Conclusions

To summarize, the QOW method inherits the analytical transparency of the elementary method of moments and allows to explore, in much detail, the errors in the problem of parameter estimation. Constructive prescriptions emerge that allow a straightforward com-puter implementation. Eq. (29) is key for detecting a situation with insufficiently large N , eq. (51) takes into account the non-linearity. A concrete example of such errors evaluation is presented in sec. 5c. The obtained formulae are immediately applicable to situations with one parameter singled out on physical grounds (e.g. the neutrino mass [4], [10]), whose confidence interval is extracted from the marginal distribution. An important prob-lem yet to be explored is the extension of the obtained results to the iterative variants of the QOW method (see the comments at the end of sec. 2). Acknowledgments

We thank A.N. Titov (Institute for Nuclear Research RAS and KATRIN) for useful discussions and I.A. Denisov (Siberian Federal University and MOLPIT) for sharing a piece of software. This work was supported by R.Sh. Menyashev.

References [1]

F. James. Statistical Methods in Experimental Physics. World Scientific, 2008. [2]

F.V. Tkachov. Int.J.Mod.Phys.A, 17(21), 1999. arXiv:0001019. [3]

F.V. Tkachov. Transcending The Least Squares. arXiv:0604127. [4]

Troitsk-  -mass experiment: http://mass.inr.ru [5] V.M. Lobashev et al. Nucl. Phys. Proc. Suppl. 66 (1998) 187-190. [6]

M. Kendall, A. Stuart. Kendall’s Advanced Theory of Statistics: Volume 1: Distribution Theory. Oxford University Press, 1987. [7]

M.B. Lagutin. Visual mathematical statistics. BINOM, 2007 (in Russian). [8]

A.A. Borovkov. Mathematical statistics. Gordon and Breach, 1998. [9]

A.V. Lokhov, F.V. Tkachov, P.S. Trukhanov. Nuclear Instruments and Methods in Physics, Vol.686, 2012. arXiv:1111.4835; A.V. Lokhov, F.V. Tkachov. PANIC 2014. arXiv:1411.6245 [10]