[PDF] A data-driven convergence criterion for iterative unfolding of smeared spectra

Abstract

A data-driven convergence criterion for the D'Agostini (Richardson-Lucy) iterative unfolding is presented. It relies on the unregularized spectrum (infinite number of iterations), and allows a safe estimation of the bias and undercoverage induced by truncating the algorithm. In addition, situations where the response matrix is not perfectly known are also discussed, and show that in most cases the unregularized spectrum is not an unbiased estimator of the true distribution. Whenever a bias is introduced, either by truncation of by poor knowledge of the response, a way to retrieve appropriate coverage properties is proposed.

Full PDF

AA data-driven convergence criterion for iterative unfolding ofsmeared spectra

M. Licciardi a, ∗ , B. Quilain b a Univ. Grenoble Alpes, CNRS, Grenoble INP, LPSC-IN2P3, 38000 Grenoble, France b LLR, Ecole Polytechnique, CNRS/IN2P3, Universit´e Paris-Saclay, Palaiseau, France

Abstract

A data-driven convergence criterion for the D’Agostini (Richardson-Lucy) iterative unfoldingis presented. It relies on the unregularized spectrum (inﬁnite number of iterations), andallows a safe estimation of the bias and undercoverage induced by truncating the algorithm.In addition, situations where the response matrix is not perfectly known are also discussed,and show that in most cases the unregularized spectrum is not an unbiased estimator of thetrue distribution. Whenever a bias is introduced, either by truncation of by poor knowledgeof the response, a way to retrieve appropriate coverage properties is proposed.

Introduction

Unfolding procedures are at the heart of many domains in science and engineering,from optics to high-energy physics. These procedures aim at answering the apparentlysimple question: what is the true physical distribution that led to the observed data? Theanswer is however often not simple, since detector eﬀects (such as ﬁnite resolution or limitedacceptance) smear the signal and information about the initial distribution is partially lost.Moreover, there may be some non-trivial transformation between the true variable and theobserved one: for example, one could like to infer the momentum of a particle from itspenetration length in a calorimeter. Many unfolding (or deconvolution) techniques havebeen proposed in the last decades [1, 2, 3, 4, 5, 6] to solve this statistical problem, in a largevariety of physics ﬁelds.A general issue when solving such inverse problems is that unfolding procedures enhanceﬂuctuations. Indeed, to counter the smearing by detector eﬀects, unfolding techniques actas anti-smearing processes. Any true distribution will be smeared when folded through thedetector ﬁnite resolution, so we may ﬁnd a true spectrum with large ﬂuctuations while thecorresponding observed data remains relatively smooth. To mitigate this eﬀect one uses regularization techniques to encode some additional information about, for instance, thesmoothness, curvature or generic shape of the true distribution. Doing so, the variance ofthe true distribution’s estimator is reduced, but biases are introduced, and the regularizedvariance do not provide proper frequentist coverage anymore. The strength of the regular-ization may be tuned to balance decreasing variance with increasing bias. This tuning suﬀershowever from arbitrariness since there seem not to be many consensual prescriptions (if any),though the statistics ﬁeld has proposed several criteria [7, 8]. Popular regularized methodsin high-energy physics are such as penalized log-likelihood minimization or Tikhonov regu-larization [9, 4, 6], ﬁltered Singular Value Decomposition [10, 11], and truncated iterativeunfoldings [12, 13].The iterative unfolding algorithm introduced by D’Agostini [5, 14] in high energy physics,and known as Richardson-Lucy [1, 2] in astrophysics since the 1970s, is widely used. D’Agostini’s ∗ Corresponding author, [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] J a n ormulation is based on Bayes’ inversion formula for conditional probabilities and is inter-preted in terms of bayesian statistics. This algorithm however appears to be equivalentto the expectation-minimization (EM) algorithm applied to obtain a maximum-likelihoodestimator (MLE) for Poisson likelihoods (see for example a derivation in [15]), and can beeﬃciently used apart from its bayesian interpretation.In such EM algorithms, a initial guess of the true distribution is required at the start,but the limit point of the algorithm is an unbiased MLE. The regularization is introduced bystopping the algorithm after a small number of iterations. Doing so, the estimator is biasedand depends on the initial guess that has been used to start the iterations. In analyses usingthis iterative algorithm, the cut-oﬀ is chosen, at best, after more or less detailed Monte-Carlo(MC) studies ensuring that the bias introduced by the truncation is reasonnably small – inthe worst case, no dedicated studies are done at all. In any situation, such a MC-basedapproach is only valid if MC distributions used for these tests are in close compatibility withthe observed data. However, if the truncation bias is evaluated based on MC distributionsthat do not reproduce well the data, there is no guarantee for the chosen cut-oﬀ to beappropriate for the data set of interest, and the level of bias may be well underestimated.In order to remedy this eﬀect, we present here a data-driven criterion to choose an ap-propriate number of iterations for D’Agostini-like iterative algorithms. It provides an upperbound of the regularization bias that can be used to correct for under-coverage of the er-ror estimates. This criterion has been primarily developped and used for neutrino-nucleuscross section measurements [16]. Because of nuclear collective eﬀects, neutrino-nucleus in-teractions Monte-Carlo generators [17, 18] have sizeable uncertainties and their respectivepredictions are not always consistent with each other (though remarkable improvements hasbeen done in the last decade). In this ﬁeld, as in others where Monte-Carlo simulations aresuspected to not be as accurate as expected, such a data-driven criterion will hopefully helpfor more cautious data analysis.This article is organized as follows. The iterative unfolding algorithm and its propertiesare recalled in section 1. In section 2, a two-peak toy model used for illustration is introduced.Section 3 presents the convergence criterion, and coverage properties of unfolded spectraare studied. In section 4 the cases where the response matrix is not perfectly known arediscussed. Finally, we conclude and discuss some other methods in section 5.

1. Iterative unfolding

The iterative unfolding algorithm aims at recovering the distribution of a true variable X true provided the observation of its observed counterpart Y obs . We follow here the de-scription of D’Agostini [5], which works with binned distributions (histograms). We denote D = ( D j ) j =1 ,...,n obs the vector of observed counts; an estimator of the true distribution N will be denoted as ˆ N = ( ˆ N i ) i =1 ,...,n true .The unfolding matrix U = ( U ij ) is built as the transition matrix U ij ≡ P ( X true in bin i | Y obs in bin j ) ≡ P ( X true i | Y obs j ) . (1)which can be written, using Bayes’ formula, as U ij ≡ P ( Y obs j | X true i ) P ( X true i ) P ( Y obs j ) . (2)The denominator can be regarded as a normalization factor, ensuring that (cid:80) i U ij = 1:all observed counts originate from some true bin. The reverted conditionnal probability P ( Y obs j | X true i ) is to be identiﬁed with the detector response matrix R , transforming thetrue variable into the observed one. Finally, P ( X true i ) is a prior guess of what the distributionof X true could be, called in short prior (denoted P ). The unfolding matrix then writes U ij = R ij P ,i (cid:80) l R lj P ,l , (3)2uilt upon only two ingredients: the detector response matrix and the prior. The unfoldedestimator ˆ N is obtained as ˆ N = U · D as a direct consequence of the relation P ( X true i ) = (cid:80) j P ( X true i | Y obs j ) P ( Y obs j ).To mitigate the arbitrariness due to the choice of a speciﬁc prior, iterations are intro-duced. The prior is replaced, for the next iteration, by the true spectrum ˆ N just extracted: P ,i ≡ ˆ N i / (cid:80) l ˆ N l . The algorithm then reads:1. Initialization: pick a prior P ;2. Recursion: for any iteration k • build the unfolding matrix U k as U k,ij = R ij P k,i (cid:80) l R lj P k,l ; (4) • extract the unfolded distribution ˆ N k +1 = U k · D ; • update the prior as P k +1 ,i ≡ ˆ N k +1 ,i / (cid:80) l ˆ N k +1 ,l .This algorithm produces a sequence of true spectra ( ˆ N k ) k (cid:62) , for which the prior is (up toa normalization) nothing but the initial condition. The response matrix stays the samethroughout all iterations and deﬁnes the endpoint of the sequence, that we may note ˆ N ∞ .What are the properties of ˆ N ∞ ? Few theoretical studies have been done on this iterativealgorithm, through its connection to the expectation-maximization (EM) algorithm. Indeed,applying an EM algorithm on Poisson likelihoods leads to the exact same iteration (see aderivation in [15, section 4.1.2]). Results of our interest here are as follows [15, 19, andreferences therein]:1. ˆ N ∞ is a maximum-likelihood estimator (MLE) for the Poisson likelihood built fromthe observations D : L ( N ; D ) = n obs (cid:89) j =1 Poisson (cid:0) D j ; ( RN ) j (cid:1) ; (5)2. it does not depends on the chosen prior;3. if the response matrix is perfectly known then ˆ N ∞ is an unbiased MLE, i.e. overstatistical realizations we have (cid:104) ˆ N ∞ (cid:105) = N where N is the true distribution.Note that these properties hold under the assumption that the response has full columnrank, i.e. rank( R ) = n true ; if needed, the number of bins n true of the unfolded distributioncan be reduced until this condition is fulﬁlled.The three properties above are lost when the algorithm is truncated after a ﬁnite num-ber of iterations k . ˆ N k is not a maximum-likelihood estimator. It varies under a changeof prior, and the lower k , the larger this variation. A truncation bias is introduced: evenwith the exact response matrix, we have (cid:104) ˆ N k (cid:105) (cid:54) = N , resulting in (severe) undercoverage.However, ˆ N ∞ often suﬀers from large variance and lack of smoothness, which is a typicalfeature of anti-smearing processes. Therefore, limited deviations from these properties maybe acceptable, but the choice of k should be addressed, in any case, with special care; smallvalues of k could lead to sizeable bias. We present in this paper a data-driven criterion tochoose a suitable k (cf. section 3.2). The importance of the response matrix can be illustrated as follows in a simple situation. Let R bethe exact response matrix; observations are such that D ∼ Poisson(

R N ). Assuming R to be an invertiblesquare matrix, we may formally identify ˆ N ∞ to R − D . As a result, (cid:104) ˆ N ∞ (cid:105) = R − R N : the endpoint isbiased when the reponse matrix in the unfolding is not the exact one ( R (cid:54) = R ). A singular exception arises when the prior equals the true distribution ( P = N ), but never occurs inreal data analysis where the true distribution is unknown. N k ) and observed ( D ) distri-butions for k (cid:62) D .However, an error propagation matrix E k can be iteratively built [20] and allows to obtainthe covariance matrix V k of the unfolded spectrum analytically as V k = E k V D E Tk (6)where V D is the covariance matrix associated to the observed distribution D . Another op-tion is to numerically sample the covariance matrix V k (this procedure is known as bootstrapresampling in statistics), as follows:1. build a set of toy spectra { D ( t ) } t =1 ,...,N following the variance V D ;2. unfold each toy spectrum separately to get { ˆ N ( t ) k } t =1 ,...,N ;3. an estimator of the variance V k is given byˆ V k = 1 N − (cid:88) t δ ˆ N ( t ) Tk · δ ˆ N ( t ) k (7)where δ ˆ N ( t ) k = ˆ N ( t ) k − (cid:104) ˆ N ( t ) k (cid:105) .Another source of variance of the unfolded spectrum is systematic and comes from uncer-tainties on the detector response matrix. Because the response matrix deﬁnes the endpointˆ N ∞ , a biased response would lead to a biased endpoint spectrum; it is illustrated in sec-tion 4. A ﬁrst simple solution is to increase the covariance of the observed distribution V D → V D + V syst to include systematic uncertainties, and propagate this new error matrixto the unfolded space using one of the methods described above (analytical or numerical).The systematic variance can also be evaluated from alternative unfoldings, built using mod-iﬁed response matrices in eqn. (4). From a set of response matrices { R ( r ) } representativeof the expected variations of R , one could obtain a set of unfolded spectra { ˆ N ( r ) k } . Theirdistribution allows to built a systematic covariance matrix for ˆ N k (and ˆ N ∞ ) as in eqn. (7).Whatever method is used, an useful validation is to check that the total variance V k pro-vides proper coverage for systematically and statistically ﬂuctuating realizations.So far, the observed data D has been assumed to be background-free, which is nota realistic case for most high-energy experiments. Backgrounds can be accounted for inseveral ways.1. Background subtraction.

The MC background prediction B MC is subtracted from theobserved data D and the unfolding is applied on the signal distribution D − B MC . Thisis however only relevant when the background prediction is known to be accurate.2. Scaling factor from control regions.

A common way to monitor the MC backgroundprediction is to use control regions (sidebands, SB). The observed data/MC ratio α = N SBData /N SBMC in the sideband is used to scale the background prediction in theregion of interest; the unfolding is then applied on D − α B MC . To be used, thismethod requires: 1) to build signal-free sidebands and 2) that extrapolation of a singlenormalization-like factor α from sideband to the main sample is meaningful. The lateris achieved when the background distribution in the control region closely relates tobackground in the main sample, e.g. if they share the same kinematic distribution, ortype of interactions, etc.3. Simultaneous unfolding of signal and sidebands . When signal events are observed inthe sideband, the unfolding matrix can be extended to (cid:18) ˆ N Signal ˆ N Bkgd (cid:19) = U · (cid:18) DD SB (cid:19) (8)4here D ( D SB ) is the observed distribution of events in the main sample (sideband),and ˆ N Signal ( ˆ N Bkgd ) the unfolded distribution of signal (background) events. Thismethod is also useful when data/MC shape discrepancies are observed in backgrounddistributions, making the norm correction of method 2 inappropriate.On a statistical point of view, an asset of the last method is to preserve the Poisson propertiesof the input distribution ( D , D SB ). In (scaled) background subtraction, it is therefore notguaranteed that properties of the endpoint ˆ N ∞ are preserved, in particular that it providesan unbiased MLE.In the following we will assume the input distribution to follow Poisson statistics. Up toredeﬁnition of binnings, simple or simultaneous unfoldings are equivalent, and need not tobe treated diﬀerently.

2. The 2-peak model

Throughout this paper we will use a simple two-peak model to illustrate the behaviour ofthe iterative unfolding. The true two peaks distribution is smeared by an artiﬁcial detectorsmearing, deﬁned here as a convolution by a gaussian of width σ s = 0 .

5. The true spectrumand the smeared spectrum are displayed in ﬁgure 1 along with the response matrix. Thenumber of bins is set to n obs = 20 for observed data and n true = 12 for the unfoldedspectrum. The parameters of the two peaks are given in table 1. Modiﬁed two-peak models,used to test the properties of the algorithm (cf. sections 3-4), are also introduced in table 1. obs / Y true X00.511.522.5 E v en t r a t e [ a . u .] Spectra

TrueSmeared true

X012345678910 b i n s ob s Y Response matrix (a) (b)

Figure 1: (a) True and smeared spectra for the nominal two-peak model. (b) The corresponding responsematrix, with n true = 12 and n obs = 20.

3. Data-driven convergence criterion

Let us ﬁrst summarize some notations and deﬁnitions used through this paper. We deﬁnethe following vectors of size n ≡ n true (number of bins in the phase-space of true variables): • N is the true spectrum, or truth ; 5oy model First peak Second peak Baseline Smearing A , µ , σ ) ( A , µ , σ ) b σ s , 0.5) (1.5, , 0.75)3 Closer peaks (1, , 0.5) (1.5, 6, 0.75)4 Wider 2nd peak (1, 3, 0.5) (1.5, 6, )5 Smaller 1st peak ( , 3, 0.5) (1.5, 6, 0.75)1 n More smearing Same as toy model n Table 1: Deﬁnition of two-peak toy models used in this article. ( A i , µ i , σ i ) refers to the maximum amplitude,mean position, width of the i -th peak. Diﬀerences with respect to the nominal model are highlighted. • ˆ N k is the unfolded spectrum (true spectrum estimator) after k iterations of the algo-rithm; • (cid:104) ˆ N k (cid:105) ≡ E ( ˆ N k ) is the expected value of the unfolded spectrum after k iterations; • b k = (cid:104) ˆ N k (cid:105) − N is the average bias.We also denote the following n × n matrix: • V k ≡ Var( ˆ N k ) = E (cid:2)(cid:0) ˆ N k − (cid:104) ˆ N k (cid:105) (cid:1)(cid:0) ˆ N k − (cid:104) ˆ N k (cid:105) (cid:1) T (cid:3) is the covariance matrix associatedto the unfolded spectrum ˆ N k .Finally, we also use the following metrics to study the convergence of the algorithm andestablish the convergence criterion: • a measure of the distance of the unfolded spectrum to the truth χ ( k ) = (cid:0) ˆ N k − N (cid:1) T · V − k · (cid:0) ˆ N k − N (cid:1) ; (9) • a measure of the distance of the unfolded spectrum to the endpoint χ ( k ) = (cid:0) ˆ N k − ˆ N ∞ (cid:1) T · V − k · (cid:0) ˆ N k − ˆ N ∞ (cid:1) ; (10)for which we have by construction lim k →∞ χ ( k ) = 0; • a measure to compare bias and variance: χ ( k ) = b Tk V − k b k . (11)The following useful relation (proof in appendix) describes how the distribution of χ departs from a perfect χ law in presence of bias: E (cid:2) χ ( k ) (cid:3) = n + χ ( k ) . (12) The iterative unfolding produces, from a given input data spectrum, a sequence of spectrawith associated covariances ( ˆ N k , V k ) k (cid:62) . We would like to build a data-driven criterion, i.e. to be applied only on this sequence of spectra and covariances and not on an a priori MC distribution, to determine what is an appropriate number of iterations to unfold thisparticular input data spectrum.Because the algorithm is truncated (ﬁnite number of iterations k ) it is expected to havea convergence bias b k ≡ (cid:104) ˆ N k (cid:105) − N (cid:54) = . For this bias to have limited impact on the coverage– deﬁned at this point by the covariance V k – we would like to keep it ”well smaller thanthe error bars”. In other words, we wish to have χ ( k ) ≡ b Tk V − k b k (cid:54) nε (13)6or some ε much smaller than 1, or equivalently from eqn. (12) E (cid:2) χ ( k ) (cid:3) (cid:54) n (1 + ε ) . (14)The impact of the size of bias ε on the coverage is studied in section 3.3.This χ is not computable for actual data since biases are unknown. Our proposal isto use the endpoint spectrum ˆ N ∞ as a pivot. This endpoint is an unbiased MLE of thetrue distribution (cf. section 1), leading to b ∞ = 0. Consequently, if the unfolded spectrumˆ N k is close enough to the endpoint, we can expect it to be close to the true spectrum aswell. Formally, we deﬁne the distance to the endpoint d k = ˆ N k − ˆ N ∞ and the χ metricsas χ ( k ) ≡ d Tk V − k d k (15)which describes the level of convergence of the algorithm. As it is built only from thesequence of unfolded spectra ( ˆ N k , V k ) k (cid:62) , this quantity can be used to construct a data-driven convergence criterion.Using toy studies for which the truth is known, the goal is to ﬁnd some η such that E (cid:2) χ ( k ) (cid:3) (cid:54) n η = ⇒ E (cid:2) χ ( k ) (cid:3) (cid:54) n (1 + ε ) . (16)This relation describes an average behaviour, determined over many statistical ﬂuctuations.Since we only have a single realization of the real experiment, the condition to be appliedon the unfolded data is simply χ ( k ) (cid:54) n η . (17)The number of iterations k chosen to truncate the algorithm will be the smallest k satisfyingthe above relation.Typical MC-driven convergence criteria would only rely on conditions similar to theright-hand side of eqn. (16). Looking at the evolution of biases using fake-data sets wouldprovide the number of iterations k , usually very small ( k < k and thecorresponding spectrum ˆ N k obtained this way are largely correlated to the choice of prior,and the amount of bias after only a few iterations depends on how diﬀerent is the truthfrom the prior. In particular, the (unknown) true distribution of real data may be furtherfrom the prior that what has been tested with toy models. The data-driven criterion ofeqn. (17) relies instead on the endpoint spectrum, which does not vary with the chosenprior. Whenever the diﬀerence truth/prior is higher on real data, the endpoint spectrumremains a robust quantity upon which a convergence criterion may be built.Admittedly, the value of η in eqn. (17) is chosen using MC toy models, and this criterionis not fully data-driven. However, the convergence speed of χ is characterized by theresponse matrix used in the unfolding, which introduce much less model-dependence thanthe choice of a speciﬁc prior. Even if η is chosen on some toy models, the criterion (17) istherefore still relevant for real data, where the truth is unknown. ε Let us study now how the coverage provided by the covariance V k evolves with thenumber of iterations k , and investigate its relation with the measure of bias χ . For agiven k and a given realization of the experiment ˆ N k , the conﬁdence region for 1 − α CL isdeﬁned as the set of spectra N such that χ k ( N ) (cid:54) χ ( α ) , (18)with χ k ( N ) = (cid:0) ˆ N k − N (cid:1) T V − k (cid:0) ˆ N k − N (cid:1) (19) We insist once again that this only holds when the response matrix is exactly known. Other cases arediscussed in section 4. χ ( α ) = F n (1 − α ) (20)where F n is the inverse of the cumulative function of a χ distribution with n degrees offreedom [21] (in the large sample approximation). The coverage of this conﬁdence region isdeﬁned as the fraction of statistical realizations ˆ N k for which the true spectrum N belongsto the conﬁdence region, that is χ k ( N ) ≡ χ ( k ) (cid:54) χ ( α ) . (21)When there is no bias ( χ = 0), the region deﬁned by eqns. (18-20) achieves the nominal1 − α coverage; however, in presence of bias, the coverage is reduced. This is illustrated inﬁgure 2 using toy model k →∞ E [ χ ( k )] = n or equivalently lim k →∞ χ ( k ) = 0. Large biases areobserved for k (cid:46)

10, resulting in poor coverage. With more iterations (10 (cid:46) k (cid:46) >

60% (resp. > k (cid:38)

20 in this example) there isalmost no bias and the coverage is as expected.

10 20 30 40 50 60 70 80 90 100Number of iterations -

10 110 / n c ) true2 c E( true2 c - C o v e r age n F

10 20 30 40 50 60 70 80 90 100Number of iterations -

10 110 / n c ) true2 c E( true2 c - C o v e r age n F (a) (b) Figure 2: Evolution of coverage of conﬁdence regions for (a) 68% CL and (b) 90% CL, produced using model k →∞ E [ χ ( k )] /n = 1 (unbiased endpoint). The68%-quantile of χ is the value of χ that provides a 68% CL conﬁdence region with proper coverage;it corresponds to F n (1 − α ) (red dotted line) in the no-bias limit (large k ). The extended conﬁdence regiondeﬁned by χ ( α ; χ ) achieves correct coverage even in presence of large biases. The data-driven criterion (16) is built to provide a number of iterations k allowing fora control of the truncation bias: χ ( k ) (cid:54) nε . Nonetheless, the conﬁdence region deﬁnedwith the nominal χ ( α ) = F n (1 − α ) undercovers in presence of even small bias; one couldthen seek to increase this boundary to recover an appropriate coverage. We found that χ (cid:0) α ; χ (cid:1) ≡ (1 + χ /n ) · F n (1 − α ) (22)is a good approximation, valid for a wide range of values of χ and α . The coverageprovided by eqn. (22) is illustrated in ﬁgure 2 for 1 − α being 68% or 90%; it is verysatisfactory for χ /n (cid:46)

1. On real data analysis where χ ( k ) is unknown, a conservativeconﬁdence region can be deﬁned using χ ( α ; nε ) = (1 + ε ) · F n (1 − α ) . (23)We have illustrated in this section how the coverage evolves in presence of bias, andhow conﬁdence regions can be extended to compensate the undercoverage induced by such8iases. The choice of ε in the criterion (16) is left to the discretion of the analyzer; butsome recommendations follow.1. For ε (cid:28)

1, the undercoverage is negligible and the unfolded spectrum ˆ N k and itscovariance V k can be used to deﬁne conﬁdence regions. However, note that whenthe response matrix is not accurately known we have lim k →∞ χ ( k ) /n = b ∞ > ε are not possible (such cases are discussed in section 4).2. For ε <

1, the loss of coverage may become signiﬁcant and should be accounted for.Conservative conﬁdence regions providing at least nominal coverage can be recoveredusing χ ( α ; nε ) from eqn. (23), or equivalently by inﬂating the covariance matrix as V k → V k (1 + ε ). Having set the value of ε , the remaining task is to ﬁnd η fulﬁlling the condition E (cid:2) χ ( k ) (cid:3) (cid:54) n η = ⇒ E (cid:2) χ ( k ) (cid:3) (cid:54) n (1 + ε ) . (24)This section illustrates how η can be chosen. Since the amount of bias and its evolutionover iterations depend on the unknown true model, extracting η from the nominal model(or standard prediction, or MC prediction) is not enough. It is important to evaluate whatalternative true models could be plausible. For the two-peak model, we considered (cf.table 1): both peak being shifted in the same direction (model E ( χ ) and E ( χ ) for models ε is set here to 0 .

2. For each model, the true response is used in the unfolding,but the prior is based on the nominal model ( E ( χ ) (cid:54) n (1 + ε ) – or equivalently χ (cid:54) ε – varies from 3 to 9. When comparedto the nominal model, largest shape discrepancies occur for models with shifted peaks ( χ ( k = 1), correlated to the diﬀerence prior/truth, is then the largest forthese models. In turn, more iterations are required to reach below the ε threshold.The values of E ( χ /n ) range from 2.08 to 22.1. By chosing η as the lowest E ( χ /n )among all models (in our example η = 2 . χ ( k ) (cid:54) n η (25)is applied on single realizations of the experiment (assuming model k = 5 (9) with model k = 2 (11) with model k = 6 (13)for a ﬂat prior. We can expect k to be smallest when the prior is the truth; this is veriﬁedwith Asimov data. However, because of random statistical ﬂuctuations, the initial data setmay appear more similar to another model: for ﬂuctuated data, the lowest k is with model .We also displayed χ ( k ) = ∆ ˆ N Tk V − k ∆ ˆ N k , (26)with ∆ ˆ N k = ˆ N k − ˆ N (cid:48) k the diﬀerence induced by changing the prior in the unfolding: ˆ N k is obtained using the nominal prior and ˆ N (cid:48) k using the truth as prior. This emphasizes thatthe endpoint spectrum does not depend on the selected prior, i.e. lim k →∞ χ ( k ) = 0. It would have been closer to real conditions to set the prior and vary the true model. However, in ordercompare the behaviour on the same ﬂuctuation, the prior has been varied for a ﬁxed truth (and realization). - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 2.08 data2 c E( e Unfolding of model 2

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 2.15 data2 c E( e Unfolding of model 3

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 7.53 data2 c E( e Unfolding of model 4

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 22.07 data2 c E( e Unfolding of model 5

Figure 3: Evolution of E ( χ /n ) and χ /n = E ( χ /n ) − k → ∞ . The ﬁrst iteration suchthat E ( χ /n ) (cid:54) ε is indicated, with ε = 0 .

2. The corresponding values of E ( χ /n ) are indicatedas well; η is set as the lowest E ( χ /n ) among all models.

10 20 30 40 50 60 70 80 90Number of iterations - - - - - - -

10 110 / n c data, pr=nom2 c data, pr=truth2 c data, pr=flat2 c prior2 c h Asimov data

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c data, pr=nom2 c data, pr=truth2 c data, pr=flat2 c prior2 c h Fluctuated data (a) (b)

Figure 4: Evolution of a χ for a single realization from model . Imperfectly known response matrix So far, the basic case of a perfectly known response matrix has been discussed. In fullgenerality, the bias from the algorithm at iteration k can be written as b k ≡ (cid:104) ˆ N k (cid:105) − N = (cid:104) ˆ N k (cid:105) − (cid:104) ˆ N ∞ (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) truncation bias + (cid:104) ˆ N ∞ (cid:105) − N (cid:124) (cid:123)(cid:122) (cid:125) endpoint bias (27)with the truncation bias vanishing in the limit k → ∞ . As discussed in section 1, theendpoint spectrum ˆ N ∞ is an unbiased MLE (i.e. (cid:104) ˆ N ∞ (cid:105) = N ) when the response matrix isexactly known, but there remains a non-zero bias b ∞ (called here endpoint bias ) otherwise.In fact, we claim that in most data analyses the response matrix is not perfectly accurateand endpoint biases should be considered. We consider three sources which can bias theresponse matrix: • imperfect or biased knowledge of the detector response. • ﬁnite binning of the true distributions. • limited statistics of the simulation which might be used to built the response matrix.Let us ﬁrst mention the most obvious situation where the detector response suﬀers fromsystematic uncertainties; it occurs when, for instance, resolution or acceptance are notperfectly known or modelled. This is illustrated in our model when the smearing width σ s used to build the response matrix is diﬀerent from the real one (cf. ﬁgure 5), leading tosigniﬁcant biases: lim k →∞ χ ( k ) /n (cid:39) .

25. In this context, the addition of a systematiccovariance matrix V syst k to the purely statistical V k : χ ( k ) → χ ( k ) = b Tk (cid:2) V k + V syst k (cid:3) − b k (28)will reduce the relative size of the endpoint bias respectively to the uncertainties and possiblyretrieve lim k →∞ χ ( k ) /n (cid:28)

1. In our example (ﬁgure 5 (b)) the bias indeed reduces tolim k →∞ χ ( k ) /n (cid:39) .

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.250 bias2 c lim Model

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.054 bias2 c lim Model (a) (b)

Figure 5: Evolution of χ /n = E ( χ /n ) − σ MC s = 0 . σ s = 0 .

55. (a) When no systematic uncertainty accounts forthis mismodelling, large biases are observed. (b) When a systematic covariance matrix is added in thecomputation of χ , the relative size of bias with respect to the total uncertainty is signiﬁcantly reduced.

11 second source of inaccuracy comes from the fact that binned distributions are used.Let us note ρ X ( x t ) (resp. ρ Y ( y o )) the p.d.f. of X true (resp. Y obs ). Analytically we have ρ Y ( y o ) = (cid:90) K ( x t , y o ) ρ X ( x t ) d x t (29)with the kernel K ( x t , y o ) modelling the detector response (a gaussian smearing in our ex-ample). The p.d.f. of Y obs under the condition that X true is in bin i is then ρ Y | X true i ( y o ) = (cid:90) bin i K ( x t , y o ) ˜ ρ X,i ( x t ) d x t (30)with ˜ ρ X,i ( x t ) the p.d.f. of X true restricted to bin i and normalized so that (cid:82) bin i ˜ ρ X,i = 1.Finally, the response matrix’ coeﬃcient R ij can be expressed as R ij = (cid:90) bin j ρ Y | X true i ( y o ) d y o = (cid:90) bin j (cid:90) bin i K ( x t , y o ) ˜ ρ X,i ( x t ) d x t d y o . (31)Because true bins i have ﬁnite size, the shape of the true distribution ˜ ρ X,i ( x t ) inside bin i actually matters . To be accurate, one should use the (unknown!) true spectrum to weightevents inside a true bin. Instead one only has an educated guess ˜ ρ X,i ( x t ) at best – notspeaking of a ﬂat distribution. As a result the response matrix is inaccurate even when theresponse kernel K is perfectly known. Figure 6 illustrate this eﬀect: when using alternativedistributions ˜ ρ X,i ( x t ) instead of the true one, we obtain lim k →∞ χ ( k ) /n ∼ − : theendpoint is biased.

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.160 bias2 c lim Model

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.082 bias2 c lim Model (a) (b)

Figure 6: Evolution of χ /n = E ( χ /n ) − ρ X,i ( x t ) diﬀers from the truth. A third case appears when the response matrix is obtained using MC simulations, whichoccurs in particular for sophisticated detectors with a complex detector response. Theconditional probabilities R ij = P ( Y obs j | X true i ) (32) To check the case where true bins have inﬁnitesimal width, let us note the true bin i as [ x i − δ, x i + δ ].When δ →

0, we have ˜ ρ X,i ( x t ) → δ ( x i − x t ) and for all bins i : ρ Y | X true i ( y o ) → ρ Y | X true = x i ( y o ) = K ( x i , y o ) . In this case, the conditionnal p.d.f. is determined by the detector response only ( ρ Y | X = K ) and does notdepend on the shape of any speciﬁc distribution ρ X . X true and recording the output quantity Y obs .Assuming that both the detector response and the true distribution are perfectly known ( K and ˜ ρ X in eqn. (30)), limited sample size will still blur the response matrix. Even whenlarge samples are accessible, the response matrix is never exact strictly speaking. Figure 7provides an example with our nominal model, sampling the response matrix with 10 , 10 orevents 10 . We obtain for lim k →∞ χ ( k ) /n values of about 0.6, 0.1, and 0.02 respectively:depending on sample size, the bias may, or may not, be negligible. As expected, the largerthe sample, the smaller the bias; but the bias is there in all cases.

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.616 bias2 c lim Model

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.115 bias2 c lim Model (a) (b)

10 20 30 40 50 60 70 80 90 100Number of iterations - -

10 110 / n c ) true2 c E( bias2 c /n = 0.022 bias2 c lim Model (c)

Figure 7: Evolution of χ /n = E ( χ /n ) − (b) 10 and (c) 10 events. It uses the correct smearing and the true model shape. The true response is generatedfrom an independant sample of 10 events. As expected, lower MC statistics yields higher endpoint bias. In summary, several sources of response matrix inaccuracy exist: systematic uncertain-ties on the detector response, ﬁnite bin size in true space, sampling of the response matrix.Hence, endpoint biases are present and the unregularized spectrum ˆ N ∞ is not an unbiasedestimator. However, it may occur that the endpoint bias b ∞ is actually negligible whencompared to uncertainties (lim k →∞ χ ( k ) /n (cid:28) ρ X,i ( x t ) and to quantify the induced discrepancy on unfolded spectra.If not negligible, systematic uncertainties may need to be assigned to the response matrix’13onstruction.As for the data-driven criterion presented in this article, the control of the bias providedby χ /n (cid:54) ε does not depend on the nature of the bias, either a truncation bias oran endpoint bias. As a result, it remains applicable for inaccurate response matrices. Anexample of coverage evolution is given in ﬁgure 8. It is the equivalent of ﬁgure 2, but theresponse matrix is not accurate anymore: it is sampled from 10 events, using the nominalmodel ( χ ( α, χ ) from eqn. (22) allows to recover appropriate coveragein this case as well.

10 20 30 40 50 60 70 80 90 100Number of iterations -

10 110 / n c ) true2 c E( true2 c - C o v e r age n F

10 20 30 40 50 60 70 80 90 100Number of iterations -

10 110 / n c ) true2 c E( true2 c - C o v e r age n F (a) (b) Figure 8: Evolution of coverage of conﬁdence regions for (a) 68% CL and (b) 90% CL, produced using model k →∞ E [ χ ( k )] /n = 1 .

17 (biased endpoint). The68%-quantile of χ is the value of χ that provides a 68% CL conﬁdence region with proper coverage;it corresponds to F n (1 − α ) (red dotted line) in the no-bias limit, which is never reached in this case. Theextended conﬁdence region deﬁned by χ ( α ; χ ) achieves correct coverage even in presence of largebiases. Although the endpoint spectrum is not an unbiased MLE anymore, it remains prior-independent. Therefore, the beneﬁts of the criterion presented in section 3, based on χ ,are still relevant. The value of η satisfying the convergence condition (24) can be obtainedusing pseudo-data studies with alternative true models, as in section 3.4. The results aredisplayed on ﬁgure 9; models requiring a systematic uncertainty related to the smearing arealso considered. The worst-case scenario is taken to set η = 1 .

5. Discussion and summary

The D’Agostini (Richardson-Lucy) iterative unfolding has developed to become one ofthe most frequently used unfolding technique. To the best of our knowledge, only a fewelaborated methods to set the number of iterations have been published in the speciﬁccontext of this algorithm. Most notably, we mention here the method by G. Zech [12],where acceptable numbers of iterations k are such that ˆ N k ﬁts the data almost as well asthe unregularized best-ﬁt ˆ N ∞ . The observable of interest is ∆ χ ( k ) = χ ( ˆ N k ) − χ ( ˆ N ∞ )(or the corresponding p -value), on which a threshold is set. As in our method, the endpointspectrum is taken as reference; however, χ compares directly ˆ N k to ˆ N ∞ instead of theirrespective goodness-of-ﬁt with data. We believe these are complementary approaches.Concerning uncertainty quantiﬁcation, ref. [12] recommends to provide the unregularizedcovariance matrix V ∞ , which is meant to ensure proper coverage for the correspondingconﬁdence intervals. However, as showed in section 4, the true response matrix is in most14 - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c

10 iterations: )/n = 1.20 true2 c E( )/n = 1.80 data2 c E( e Unfolding of model 2

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 3.69 data2 c E( e Unfolding of model 3

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 7.52 data2 c E( e Unfolding of model 4

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 22.27 data2 c E( e Unfolding of model 5

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c

12 iterations: )/n = 1.17 true2 c E( )/n = 1.08 data2 c E( e Unfolding of model 12

10 20 30 40 50 60 70 80 90Number of iterations - - - - - -

10 110 / n c ) data2 c E( ) true2 c E( bias2 c true2 c E( )/n = 4.45 data2 c E( e Unfolding of model 14

Figure 9: Evolution of E ( χ /n ) and χ /n = E ( χ /n ) − σ s diﬀerent from the nominal, a systematic covariance is computed and included in the χ deﬁnitions. The ﬁrst iteration such that E ( χ /n ) (cid:54) ε is indicated, with ε = 0 .

2. The correspondingvalues of E ( χ /n ) are indicated as well; η is set as the lowest E ( χ /n ) among all models. cases not perfectly known. Thus, the endpoint spectrum ˆ N ∞ remains a biased estimator,and the covariance V ∞ actually undercovers.Another interesting idea for uncertainty quantiﬁcation has been proposed by M. Kuuselaand V. Panaretos [22], although not in the speciﬁc context of D’Agostini unfolding. It iscalled bias-corrected uncertainty quantiﬁcation . The spectrum estimate is chosen with agenerally strong regularization, but the corresponding covariance is iteratively corrected for15he regularization bias until desired coverage is retrieved. Within the iterative unfoldingframework discussed here, this would translate into picking a k for the spectrum estimatorˆ N k , and a larger k to have a less regularized covariance V k . However, in presence ofendpoint biases, appropriate coverage may be beyond the reach of the unfolding, at anyiteration.Controlling the amount of bias is a key point for analyses. The convergence criterion pre-sented in this article allows to control the level of bias introduced (by setting ε ) and suggestsa way to extend conﬁdence regions to retrieve the expected coverage (by using χ ( α ; nε )).As the statistical variance inﬂates with the number of iterations, a well-controlled covariance V k also provides smaller, yet meaningful, error bars than the unregularized V ∞ .In addition, with the convergence criterion presented here, the stopping iteration is notdetermined a priori . This is particularly relevant when the true model is suspected to notbe well reproduced by MC generators. In such cases, MC-based studies may fail to evaluateor control the level of bias. With this new method, we aim at being as much data-driven aspossible, while providing valid uncertainties, which are crucial for model comparisons usingunfolded data. Appendix

Proof of relation (12)

Let us denote is ∆ N k ≡ ˆ N k − N , and b k ≡ (cid:104) ∆ N k (cid:105) ; χ can be rewritten as χ ( k ) = ∆ N Tk · V − k · ∆ N k . (33)As a scalar number, χ ( k ) equals its trace and χ ( k ) = Tr (cid:2) ∆ N Tk V − k ∆ N k (cid:3) = Tr (cid:2) V − k ∆ N k ∆ N Tk (cid:3) . (34)The expectation value of ∆ N k ∆ N Tk is by deﬁnition the covariance with respect to the truespectrum V T,k : V T,k ≡ E (cid:2) ( N k − N )( N k − N ) T (cid:3) = V k + b k b Tk . (35)In this bias-variance decomposition, V k corresponds to the variance of ˆ N k around its ex-pectation value and b k b Tk accounts for the average bias. Using the linearity of the traceoperator and of the expectation value, we get E (cid:2) χ ( k ) (cid:3) = Tr (cid:2) V − k E (cid:0) ∆ N k ∆ N Tk (cid:1)(cid:3) = Tr (cid:2) V − k V T,k (cid:3) (36)yielding E (cid:2) χ ( k ) (cid:3) = Tr (cid:2) I n + V − k b k b Tk (cid:3) = n + χ ( k ) . (37) References [1] W. H. Richardson, Bayesian-Based Iterative Method of Image Restoration, Journal ofthe Optical Society of America 62 (1972) 55.[2] L. Lucy, An iterative technique for the rectiﬁcation of observed distributions, Astron.J. 79 (1974) 745–754. doi:10.1086/111605 .[3] H. Multhei, B. Schorr, On an Iterative Method for the Unfolding of Spectra, Nucl.Instrum. Meth. A 257 (1987) 371. doi:10.1016/0168-9002(87)90759-5 .[4] A. H¨ocker, V. Kartvelishvili, SVD approach to data unfolding, Nucl. Instrum. Meth. A372 (1996) 469–481. doi:10.1016/0168-9002(95)01478-0 .165] G. D’Agostini, A Multidimensional unfolding method based on Bayes’ theorem, Nucl.Instrum. Meth. A 362 (1995) 487–498. doi:10.1016/0168-9002(95)00274-X .[6] V. Blobel, An Unfolding Method for High Energy Physics Experiments (2002). arXiv:hep-ex/0208022 .[7] P. C. Hansen, Analysis of Discrete Ill-Posed Problems by Means of the L-Curve, SIAMReview 34 (1992) 561–580.[8] G. Wahba, G. H. Golub, M. Heath, Generalized Cross-Validation as a Method forChoosing a Good Ridge Parameter, Technometrics, Vol. 21, no 2 (1979).[9] A. N. Tikhonov, Solution of incorrectly formulated problems and the regularizationmethod, Soviet Math. Dokl. 4 (1963) 1035–1038.[10] P. C. Hansen, The truncated SVD as a method for regularization, BIT 27 (1987) 534–553. doi:10.1007/BF01937276 .[11] W. Tang, X. Li, X. Qian, H. Wei, C. Zhang, Data Unfolding with Wiener-SVD Method,JINST 12 (2017) P10002–P10002. doi:10.1088/1748-0221/12/10/p10002 .[12] G. Zech, Iterative unfolding with the Richardson–Lucy algorithm, Nucl. Instrum. Meth.A 716 (2013) 1–9. doi:10.1016/j.nima.2013.03.026 .[13] G. Zech, Analysis of distorted measurements – parameter estimation and unfolding(2016). arXiv:1607.06910 .[14] G. D’Agostini, Improved iterative Bayesian unfolding, in: Alliance Workshop on Un-folding and Data Correction, 2010. arXiv:1010.0632 .[15] M. J. Kuusela, Statistical Issues in Unfolding Methods for High Energy Physics. Mas-ter’s thesis, Aalto University (2012).[16] M. Licciardi, Etude de la production d’un pion dans l’interaction de neutrinosmuoniques avec le nouveau d´etecteur WAGASCI au Japon. PhD thesis, Universit´eParis-Saclay (2018).[17] Y. Hayato, A neutrino interaction simulation program library NEUT, Acta Phys. Polon.B 40 (2009) 2477–2489.[18] C. Andreopoulos, The GENIE neutrino Monte Carlo generator, Acta Phys. Polon. B40 (2009) 2461–2475.[19] M. J. Kuusela, Uncertainty quantiﬁcation in unfolding elementary particle spec-tra at the Large Hadron Collider. PhD thesis, EPFL (2016). doi:10.5075/epfl-thesis-7118 .[20] T. Adye, Unfolding algorithms and tests using RooUnfold (2011). arXiv:1105.1160 .[21] P. A. Zyla et al. (Particle Data Group), Statistics, Prog. Theor. Exp. Phys. 2020,083C01 (2020).[22] M. Kuusela, V. M. Panaretos, Statistical unfolding of elementary particle spectra: Em-pirical Bayes estimation and bias-corrected uncertainty quantiﬁcation, Ann. Appl. Stat.9 (2015) 1671–1705. doi:10.1214/15-AOAS857doi:10.1214/15-AOAS857