Parametric unfolding. Method and restrictions
aa r X i v : . [ phy s i c s . d a t a - a n ] A p r Parametric unfolding. Method and restrictions.
Nikolay D. Gagunashvili ∗ University of Iceland, Sæmundargata 2, 101 Reykjavik, Iceland
Abstract
Parametric unfolding of a true distribution distorted due to finite resolutionand limited efficiency for the registration of individual events is discussed.Details of the computational algorithm of the unfolding procedure are pre-sented.
Keywords: comparison experimental and simulated data, homogeneitytest, weighted histogram, deconvolution problem
1. Introduction
The measured distribution P ( x ′ ) of events with a reconstructed character-istic x ′ obtained from a detector with finite resolution and limited efficiencycan be represented as P ( x ′ ) = Z Ω p ( x ) A ( x ) R ( x, x ′ ) dx, (1)where p ( x ) is the true density, A ( x ) the efficiency function describing theprobability of recording an event with a true characteristic x , and R ( x, x ′ )the experimental resolution function, i.e. the probability of obtaining x ′ instead of x after the reconstruction of the event. The integration in (1) iscarried out over the domain Ω of the variable x .If a parametric model of true distribution p ( x ; a , a , . . . , a l ) exists, themodel parameters can estimated by fitting P s ( x ′ ) = Z Ω p ( x, a , a , . . . , a l ) A ( x ) R ( x, x ′ ) dx (2) ∗ Corresponding author.
E-mail address: [email protected]
Preprint submitted to Nuclear Instruments and Methods A April 28, 2020 o the measured distribution P ( x ′ ), as discussed e.g. in [1, 2].To realize method, the efficiency function A ( x ) and the resolution function R ( x, x ′ ) must be defined. In many cases, especially in particle physics, theyare not known analytically and instead are obtained by computer simulationof the measurement process. A test statistic for comparing the histogram ofthe measured distribution P ( x ′ ) and the histogram of the measured distri-bution P s ( x ′ ) obtained by simulation [4] was used in [3], where the modelparameters were estimated by minimization of this statistic. The test wasimproved significantly in [5], computer code implementing this test was de-veloped in [6, 7].This paper extends the previous work by presenting a detailed parametricunfolding algorithm using the above mentioned results [5, 6, 7]. A bootstrapalgorithm for the calculation of the statistical errors of the estimated param-eters has been developed. To gauge the quality of the results a method ofresiduals analysis has been developed that complements the p -value of thechi-square test statistic. Application of the method as well as the evaluationare demonstrated on a numerical example.
2. Fitting a simulated parametric model to data
In experimental particle and nuclear physics analyses the modelling of themeasurement process usually is the most time-consuming step, requiring thesimulation of particle transport through a medium and the rather complexregistration apparatus. The minimization algorithm to estimate the modelparameters then is an iterative procedure that may need many to calculatethe simulated measured histogram many times. A way to decrease the CPU-time demand is to perform an initial calculation for some distribution g ( x ),and then to calculate simulated measured histogram with an alternative truedistribution p ( x ; a , a , . . . , a l ) by taking all entries with weights [8] w ( x ) = p ( x ; a , a , . . . , a l ) / ( g ( x ) , (3)that exploit the equality P s ( x ′ ) = Z Ω p ( x, a , . . . , a l ) A ( x ) R ( x, x ′ ) dx = Z Ω w ( x ) g ( x ) A ( x ) R ( x, x ′ ) dx. Let us denote the sum of the entries to the i th bin of the measuredhistogram by n i , i = 1 , ..., m , and the sum of the weights of the events in the2 th bin of the simulated measured histogram by W i ( a , .., a l ) = n wi X k =1 w ik ( a , .., a l ) , i = 1 , .., m, (4)where n wi is the number of events in bin i and w ik is the weight of the k thevent in the i th bin. Events that do not register due to inefficiency enter inan overflow bin m . The total number of events is denoted by n = P mi =1 n i and total number of simulated events by n w = P mi =1 n wi .Let the values of the parameter a , .., a l be fixed. The hypothesis of homo-geneity that the measured histogram with bin contents n i and the simulatedhistogram with bin contents W i are drawn from the same parent distributionis probed by the test statistic X (ˆ p , .., ˆ p m ) = n X i n i ˆ p i ! − n + n w X i = k r i W i ˆ p i + ( n w − P i = k r i W i ) − P i = k r i ˆ p i ! − n w , (5)where the first sum is over all bins i , and the second sum omits the leastsensitive bin k as defined below. The estimates ˆ p , . . . , ˆ p m minimize X ,ˆ p , ..., ˆ p m = argmin p ,...,p m X ( p , ..., p m ) , (6)subject to the constraints p i > ∀ i, X p i = 1 and 1 − X i = k r i p i > r i = W i P n wi k =1 w ik . (7)As shown in [5], the power of the test is optimized by the choice k = argmin i ˆ p i r i . (8)Statistic (5) has approximately a χ m − distribution if the hypothesis of ho-mogeneity is valid [5].Varying the model paramaters a , a , . . . , a l , estimators for best fit pa-rameters ˆ a , ˆ a , . . . , ˆ a l are found by minimization of the statistic (5),ˆ a , ..., ˆ a l = argmin a ,...,a l X (ˆ p , ..., ˆ p m , a , ..., a l ) . (9)3f the parametric model fits the data, the statistic X (ˆ p , ..., ˆ p m , ˆ a , ..., ˆ a l ) hasa χ m − − l distribution, because l parameters are estimated in addition to theprobabilities and can be used for a goodness-of-fit test in the selection of thebest from a set of alternative models.Another approach to the evaluation of the fit quality is the analysis ofthe residuals. The definition of Pearson’s residuals for usual histograms is res i = n i − n ˆ p i p n ˆ p i (1 − ˆ p i ) . (10)which for weighted histograms generalizes to res wi = W i − n w ˆ p i p n w ˆ p i ( r i − ˆ p i ) . (11)For a homogeneity test two unweighted histograms an adjustment of theresidual was proposed in [9], which for histograms with weighted entriesbecomes Res i = res i p − n/ ( n + q ) (12)and Res wi = res wi p − q/ ( n + q ) , (13)where q is equivalent number of unweighted events for a sample of weightedevents q = ( P i,k w ik ) P i,k w ik . (14)If the hypothesis of homogeneity is valid, then the adjusted residuals areapproximately independent and identically distributed random variables witha standard normal PDF N (0 , n, ˆ p , ..., ˆ p m . The fit is done for each histogram of the set. The resultingset of parameter estimates then permits one to calculate an estimate of thecovariance matrix of the parameters also.4 Figure 1: Histogram of true distribution p ( x ) and measured distribution P ( x ′ ).
3. Numerical example
Starting from a true PDF to be of the form p ( x ) = 23 π x − + 1 + 13 π x − + 1 , (15)the measured density P ( x ′ ) was defined according (1) with an acceptancefunction A ( x ) = 1 − ( x −
36 (16)and a gaussian resolution function R ( x, x ′ ) = 1 √ πσ exp (cid:18) − ( x ′ − x ) σ (cid:19) , σ = 1 . (17)A simulation of 10 000 events generated according to P ( x ′ ) was done (seealgorithm in [11]) and is presented as a histogram with 77 bins in Figure 1.As input for the fitting procedure the true distribution assumed to befully simulated is taken as g ( x ) = 27 π x − + 1 + 57 π x − + 1 . (18)5 Figure 2: Histogram of true distribution g ( x ) and measured distribution P s ( x ′ ). The result P s ( x ′ ) from simulating 1 000 000 events according to the algorithmdescribed in [11] is shown in Figure 2 together with the initial distribution g ( x ).As a fit-model for the true distribution the following function with fivefree parameters was chosen, p ( x ; a , a , b , b , p ) = p b π b ( x − a ) + b + 1 − p b π b ( x − a ) + b . (19)In the fit event weights w ( x ) = p ( x, a , a , b , b , p ) /g ( x ) were used accordingto the formula (3). The test statistic for X (ˆ p , .., ˆ p ) for a fixed set of pa-rameters a , a , b , b , p was calculated with methods and codes published in[5, 6, 7]. The parameter variations were driven by the SIMPLEX algorithmof the package MINUIT [12] in order to determine the best fit parametersby minimizing X (ˆ p , ..., ˆ p , ˆ a , ˆ a , ˆ b , ˆ b , ˆ p ). Figure 3 shows unfolded distri-bution compared to the true PDF. Figure 4 shows the weighted histogramwith the optimal values of parameters in comparison with the histogramrepresenting the measured events.The bootstrap method, with the resample size equal to 1000, was usedfor estimating error intervals and correlation matrix of the fit parameters.6 a = 10 . +0 . − . a = 14 . +0 . − . b = 0 . +0 . − . b = 0 . +0 . − . -0.586 -0.631 -0.679 1ˆ p = 0 . +0 . − . a a b b p Table 1: Best fit parameters with uncertainties and correlation matrix.
To resample the measured histogram, bins contents were generated as multi-nomial random numbers [14] with parameter n = 10 000 and probabilitiesˆ p , ..., ˆ p . Assignment of central value and estimate of the statistical errorsof particular parameter, for example a , is done based on the ordered list ofbootstrap estimates ˆ a , ..., ˆ a , (20)where the number in parentheses shows the location when sorting in ascend-ing order.Starting from the smallast size 68% confidence interval for ˆ a , with lowerand upper limit estimated by L = ˆ a i ) and U = ˆ a i +680) , (21)where ˆ i = argmin i [ˆ a i +680) − ˆ a i ) ] (22)the central value is taken to be ˆ a i +340) and the uncertainties are estimatedby the signed deviations from the central value − err = L − ˆ a i +340) and + err = U − ˆ a i +340) . (23)The results are given in Table 1. For the best fit parameter the value of thetest statistic is X (ˆ p , ..., ˆ p , ˆ a , ˆ a , ˆ b , ˆ b , ˆ p ) = 65 . p -value of p = 0 . p -value of p = 0 . p -values for the test statistis considered.7 Figure 3: True PDF p ( x ) and unfolded PDF p (ˆ a , ˆ a , ˆ b , ˆ b , ˆ p ) (dashed line) . Figure 4: Histogram of measured PDF P ( x ′ ) and histogram of fitted measured PDF P s ( x ′ , ˆ a , ˆ a , ˆ b , ˆ b , ˆ p ) (solid line) .8
202 5 10 15 x´ R e s Figure 5: Distribution of residuals -202 -2 0 2Theoretical quantile D a t a qu a n t il e Figure 6: Quantile-quantile plot of residuals with 95% confidence band .9 . Evaluation of the method For the evaluation of the method as a whole, the procedure describedabove was repeated 1000 times. Sets of 10 000 events distributed according p ( x ) were simulated to create histograms of the measured distribution P ( x ′ ).The same set of 1 000 000 simulated events distributed according to g ( x ) wasused in each run. Figure 7 shows plots for all pairs of the 5 parameter and thedistribution of the estimators of ˆ a , ˆ a , ˆ b , ˆ b , ˆ p . Figure 8 shows the regioncovered by the 1000 estimates of the unfolded distribution together with truedistribution p ( x ). Figure 9 presents a histogram of the distribution of the p -values and confirms that the theoretical distribution χ can be used for agoodness of fit test. aˆ aˆ b ˆ b ˆ pˆ Figure 7: Plots for all pairs of estimators of ˆ a , ˆ a , ˆ b , ˆ b , ˆ p and resulting correlationmatrix. Figure 8: Comparison of the regions covered by 1000 estimates of the unfolded distributionwith true distribution p ( x ) (solid line) . Figure 9: Distribution of p -values derived from the X test statistic. a = 10 .
000 +0.036 -0.045 +0.051 -0.038 a = 14 .
000 +0.084 -0.076 +0.084 -0.086 b = 1 .
000 +0.048 -0.062 +0.061 -0.056 b = 1 .
000 +0.114 -0.105 +0.137 -0.117 p = 0 .
667 +0.017 -0.023 +0.025 -0.021
Table 2: Error estimates from the smallest size 68% quantile interval of the parameterdistributions from 1000 toy experiments compared to the errors obtained by the bootstrapmethod for the example discussed before. a a b b p a a b b -0.586 -0.631 -0.679 -0.777 p Table 3: Correlation matrices obtained by the bootstrap method (lower triangle) andcalculated from the distribution of the 1000 simulated toy experiments (upper triangle) .Finally, the simulation study was done for different values of the res-olution parameter σ . Results are presented in Table 4, showing how theaccuracy of the parameter estimates diminishes with the worsening of thedetector resolution. The fits were done without constraints for the values ofthe parameters. The study indicates that for larger resolution parameterseventually constraints will be needed to obtain stable results.12arameter σ = 0 . σ = 0 . σ = 1 . σ = 1 . a = 10 .
000 0.049 0.063 0.081 0.123 a = 14 .
000 0.096 0.122 0.160 0.252 b = 1 .
000 0.070 0.089 0.110 0.143 b = 1 .
000 0.147 0.179 0.219 0.265 p = 0 .
667 0.027 0.032 0.040 0.041
Table 4: Sizes of 68% confidence intervals for different values of the resolution parameter σ in the response function. Conclusions
Parametric unfolding of data measured by a detector with finite reso-lution and limited efficiency is presented. The method is developed as anapplication of an improved test for comparing weighted histograms and in-corporates new computational algorithms and codes. The bootstrap methodis employed to estimation the errors of the fit parameters. Residual analysisgeneralized for weighted histograms has been developed to gauge the qual-ity of the unfolding result. A numerical example is given to illustrate themethod, and an extensive simulation study was done to confirm that theproposed method as a whole is valid.
Acknowledgments
The author is grateful to Michael Schmeling (Max Plank institute forNuclear Physics) for critical reading of the manuscript and comments andto Hj¨orleifur Sveinbj¨ornsson and Helmut Neukirchen (University of Iceland)for their help and support of this work. This research was funded by theUniversity of Iceland Research Fund (HI17080029).