[PDF] Unfolding with Gaussian Processes

Abstract

A method to perform unfolding with Gaussian processes (GPs) is presented. Using Bayesian regression, we define an estimator for the underlying truth distribution as the mode of the posterior. We show that in the case where the bin contents are distributed approximately according to a Gaussian, this estimator is equivalent to the mean function of a GP conditioned on the maximum likelihood estimator. Regularisation is introduced via the kernel function of the GP, which has a natural interpretation as the covariance of the underlying distribution. This novel approach allows for the regularisation to be informed by prior knowledge of the underlying distribution, and for it to be varied along the spectrum. In addition, the full statistical covariance matrix for the estimator is obtained as part of the result. The method is applied to two examples: a double-peaked bimodal distribution and a falling spectrum.

Full PDF

UUnfolding with Gaussian Processes

Adam Bozson ∗ , Glen Cowan, Francesco Span`o Department of PhysicsRoyal Holloway, University of LondonEgham, Surrey, TW20 0EX, United Kingdom

Abstract

A method to perform unfolding with Gaussian processes (GPs) is presented. Using Bayesian regression, we deﬁne anestimator for the underlying truth distribution as the mode of the posterior. We show that in the case where the bincontents are distributed approximately according to a Gaussian, this estimator is equivalent to the mean function ofa GP conditioned on the maximum likelihood estimator. Regularisation is introduced via the kernel function of theGP, which has a natural interpretation as the covariance of the underlying distribution. This novel approach allows forthe regularisation to be informed by prior knowledge of the underlying distribution, and for it to be varied along thespectrum. In addition, the full statistical covariance matrix for the estimator is obtained as part of the result. Themethod is applied to two examples: a double-peaked bimodal distribution and a falling spectrum.

Keywords: unfolding, Gaussian process

1. Introduction

Experimental measurements are distorted and biasedby detector eﬀects, due to limitations of the measuring in-strument and procedures. The need to infer the underlyingdistribution using the measured data is shared by varietyof ﬁelds, from astronomy [1] and medical applications [2]to the investigation of the parameters that describe oil wellproperties [3].In most of these ﬁelds, these techniques are called de-convolution or restoration [4]. They are used to solve whatis deﬁned as the inverse problem : to infer an unknownfunction f ( x ) from the measured data, using knowledgeand assumptions of the distortions.In particle physics such techniques are known as un-folding and a variety of methods have been developed forthis purpose (for some reviews see Refs. [5, 6, 7]).In this paper, a novel Bayesian method to perform un-folding in particle physics is proposed. We use an approachthat “gives prior probability to every possible function” via Gaussian process regression [8], where higher probabilitiesare assigned to functions that are considered to agree withthe observations. This approach allows greater ﬂexibil-ity than unfolding schemes based on a set of parametrisedfunctions belonging to a speciﬁc class. In addition, it isshown to have a locally tunable regularisation scheme interms of the variable to be unfolded.In Sec. 2, we deﬁne the unfolding problem and thenotation for approximately Gaussian-distributed datasets.Sec. 3 discusses the solution to the unfolding problem based ∗ Corresponding author: [email protected] on the maximum likelihood (ML) method, and the needfor regularisation. In a Bayesian setting, the likelihoodis enhanced by prior information so that the ML solu-tion is replaced by the mode of the posterior distribution.Sec. 4 connects the maximum a posteriori (MAP) estim-ator to the solution of a regression problem which condi-tions prior knowledge encoded in a Gaussian process onthe ML solution extracted from data. Example applica-tions are provided in Sec. 5. Finally, we report the conclu-sions and outlook for future exploration of this method inSec. 6.

2. Deﬁnitions and notation

In particle physics, measured distributions are often re-ported as populations of bins rather than continuous func-tions. Therefore the ﬁrst step we will take is to representthe underlying distributions with discretised bin popula-tions. We note that this process biases the estimated his-togram away from the true distribution.The truth distribution is referred to as f ( x ) and rep-resented by a histogram µ = ( µ , . . . , µ M ) with contents µ j ∝ (cid:82) bin j f ( x ) d x , j = 1 , . . . , M . Observed data are con-tained in a histogram n = ( n , . . . , n N ) with N bins. Theexpectation value of n is the histogram ν .The truth and observed distributions are related throughthe eﬀects of detector response, acceptance, and back-ground contributions. For simplicity, we take the back-ground to be zero (the relaxation of this assumption isdiscussed in Sec. 6). The contents of µ and ν are linearlyrelated by ν = R µ , (1) Preprint submitted to NIM A 7th November 2018 a r X i v : . [ phy s i c s . d a t a - a n ] N ov here R is the N × M response matrix with elements R ij giving the conditional probability for an event to lie in bin i of the observed histogram, given its true value is in bin j of the truth histogram.The goal of unfolding is to construct estimators ˆ µ forthe truth histogram µ , along with their covariance mat-rix U ij = cov[ˆ µ i , ˆ µ j ]. In an experiment, the bin counts ofthe observed histogram n i ﬂuctuate according to the Pois-son distribution with expectation values ν i and covariancematrix V ij = ν i δ ij . The ﬁndings in this paper apply whenbin counts are approximately Gaussian, i.e., for large ν i .

3. ML solution and regularisation

Since the data n are approximately Gaussian-distributedaround ν , the likelihood is given by P ( n | ν ) = (cid:104) (2 π ) N det( V ) (cid:105) − exp (cid:20) −

12 ( n − ν ) T V − ( n − ν ) (cid:21) (2)and hence the log-likelihood may be writtenlog P ( n | µ ) = −

12 ( n − R µ ) T V − ( n − R µ ) + . . . , (3)where we have substituted ν = R µ and not written termsthat do not depend on µ . It can be shown that the max-imum likelihood (ML) solution ˆ µ ML satisﬁes n = R ˆ µ ML .We write the ML solution as ˆ µ ML = R − n , which may beobtained by explicit matrix inversion for invertible R when N = M or by alternative methods, such as numericallymaximising Eq. (3) or singular value decomposition. TheML covariance matrix is given by U ML = R − V (cid:0) R − (cid:1) T [9]. The detector response acts to smear out ﬁne structurein the truth distribution, so statistical ﬂuctuations in thedata can lead to a large amount of ﬁne structure in theunfolded result. This eﬀect yields large local ﬂuctuationsin the ML unfolded solution when the typical bin width isnot much larger than the detector resolution. In addition,the estimators for neighbouring bin counts can often havestrong negative correlations.These unwanted false features are typically reduced bya technique known as regularisation . Regularisation maybe introduced by minimising a cost functional Φ( µ ) = − α log P ( n | µ )+ S ( µ ), where S ( µ ) penalises high-variancedistributions, eﬀectively constricting the space of possibleunfolded solutions. Multiple measures of smoothness maybe used, such as those based on derivatives [10, 11] or en-tropy [12]. The ML solution has the minimum variance foran unbiased estimator, so any reduction in variance mustbe balanced by introducing some bias. The regularisationparameter α controls this bias–variance trade-oﬀ.An unfolded distribution may alternatively be obtainedby iterative techniques [13, 14], which converge on the MLsolution. Stopping after a ﬁxed number of iterations canyield a solution with the desired properties, although the fact that the bias–variance trade-oﬀ is controlled by a dis-crete parameter, rather than a continuous one, is seen asa disadvantage because it limits the possibility to tune theparameter values. Fully Bayesian unfolding [15] addressesregularisation through a non-constant prior distribution,and performs the unfolding by sampling from the posteriordistribution.

4. Gaussian process method

The method presented in this paper builds on the MLsolution from Sec. 3. Starting from the Gaussian likelihoodgiven by Eq. (3) and a

Gaussian process (GP) prior, Bayes’theorem is applied to obtain a posterior distribution. Wedeﬁne the estimator representing the unfolded distributionas a summary statistic of the posterior, namely the mode.We remark that while the ML solution is a frequentist es-timator, the method presented here incorporates elementsof Bayesian statistics. However the ﬁnal estimator for theunfolded distribution is a valid frequentist estimator, sothis method may be used in either fully Bayesian or hy-brid analyses.A random process extends the notion of a random vari-able to the space of functions of a set of indices X . A GP istherefore a set of indexed random variables, any ﬁnite sub-set of which are distributed according to a joint Gaussiandistribution [8]. Since the Gaussian distribution is entirelydeﬁned by its mean and covariance, a complete descriptionof a GP requires just a mean function m ( x ) = E[ f ( x )] anda kernel function k ( x , x (cid:48) ) = cov[ f ( x ) , f ( x (cid:48) )]. A GP canbe thought of as a probability distribution for the latentfunction f ( x ).GPs may be used for regression, where one wishes toestimate the function f of the (generally multidimensional)variable x , given some observations y taken at X = ( x , x , . . . ). This is done by updating a GP prior using Bayes’theorem to obtain a posterior GP for f . The mean (orequivalently the mode) of the posterior, evaluated at X ∗ =( x ∗ , x ∗ , . . . ), is denoted ¯ f ∗ and used as the estimator for f . A rich treatment of using GPs for regression may befound in Ch. 2 of Ref. [8], whose notation we follow in thispaper. Here we state the posterior mean and covarianceof a GP with prior mean function m and kernel function k for a vector of observations y with data covariance matrix V : ¯ f ∗ = K T ∗ [ K + V ] − y + m ∗ , (4)cov( f ∗ ) = K ∗∗ − K T ∗ [ K + V ] − K ∗ , (5)with the matrices K ij = k ( x i , x j ), [ K ∗ ] ij = k ( x i , x ∗ j ),[ K ∗∗ ] ij = k ( x ∗ i , x ∗ j ). Here m ∗ = m ( X ∗ ).This standard result from GP regression is used in thefollowing section to link the estimator for an unfolded dis-tribution to a GP.2 .1. MAP estimator We consider again the model from Sec. 3 with the like-lihood P ( n | µ ) given by Eq. (3). From Bayes’ theorem, thelog-posterior probability is given bylog P ( µ | n ) = log P ( n | µ ) + log P ( µ ) − log P ( n ) , (6)where P ( µ ) is the prior probability, and the last term P ( n )may be ignored since it does not depend on µ .We take the prior probability to be given by a GP withmean vector m (the reference histogram ) and covariancematrix K ij = k ( x i , x j ). The log-prior probability is thengiven bylog P ( µ ) = −

12 ( µ − m ) T K − ( µ − m ) + . . . . (7)Substituting the likelihood Eq. (3) and prior Eq. (7) intothe expression for the posterior Eq. (6), we obtainlog P ( µ | n ) = −

12 ( n − R µ ) T V − ( n − R µ ) −

12 ( µ − m ) T K − ( µ − m ) + . . . , (8)dropping terms which do not contain µ . The maximum a posteriori (MAP) estimator ˆ µ is the mode of this pos-terior probability, and maximises log P ( µ | n ). This sum-mary statistic is found to be given byˆ µ = K (cid:104) K + R − V ( R − ) T (cid:105) − (cid:16) R − n − m (cid:17) + m . (9)A derivation of this is given in Appendix A.By comparing the MAP estimator from Eq. (9) to thatobtained from GP regression in Eq. (4), we ﬁnd the im-portant result that ˆ µ is the posterior mean of a GP re-gression whose observations are the ML solution, whichis given by ˆ µ ML = R − n with covariance matrix U ML = R − V ( R − ) T . Since the posterior distribution is a productof Gaussians, it is also Gaussian and therefore the mode isidentical to the mean. This connection allows us to writethat the covariance of the MAP estimator may be givenby U = K − K (cid:104) K + R − V ( R − ) T (cid:105) − K. (10)Furthermore, if the observation (training) indices X =( x , x , . . . ) are diﬀerent from the prediction (testing) in-dices X ∗ = ( x ∗ , x ∗ , . . . ), and the reference histogram canbe obtained for bins deﬁned by X ∗ , then we may use thestandard results from GP regression to generalise the MAPsolution toˆ µ = K T ∗ [ K + U ML ] − ( ˆ µ ML − m ) + m ∗ , (11) U = K ∗∗ − K T ∗ [ K + U ML ] − K ∗ , (12)where [ K ∗ ] ij = k ( x i , x ∗ j ), [ K ∗∗ ] ij = k ( x ∗ i , x ∗ j ), and m ∗ is the mean histogram at X ∗ .The generalised results in Eq. (11) and Eq. (12) aresimple algebraic expressions once the ML solution is known. Therefore the unfolded estimator and covariance are eﬃ-cient to compute. This is an advantage over other, moreCPU-intensive unfolding schemes. In addition, these res-ults are linear in n so error propagation is simple. In the proposed GP unfolding method, the regularisa-tion is introduced via the kernel function k ( x , x (cid:48) ) whichconstricts the space of possible solutions to those with aparticular covariance. A common choice for the kernelfunction is the squared-exponential: k ( x , x (cid:48) ) = A exp  − (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) l  . (13)This kernel function is stationary in the sense that it is afunction of only the distance between the inputs, (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) .It is parameterised by the amplitude A and length scale l , referred to as the set of hyperparameters , θ = { A, l } .Various methods exist to choose their values.One method for this is by simulation, as is often donein particle physics analyses. In this approach, a simulationprogram produces values for µ , ν , and statistically inde-pendent n . Then the pseudo-data n are unfolded withvarying hyperparameters to obtain ˆ µ θ , and the agreementwith µ checked for closure . The acceptable degree of clos-ure, and its measure, are often chosen by eye, although amore speciﬁc goodness-of-ﬁt statistic may be used.Another approach, taken from GP methods [8], is tomaximise the marginal likelihood,log P ( n ; θ ) = log (cid:18)(cid:90) P ( n | µ ) P ( µ ; θ ) d µ (cid:19) = −

12 ( ˆ µ ML − m ) T [ K θ + U ML ] − ( ˆ µ ML − m ) −

12 log | K θ + U ML | − N π, (14)where K θ = k θ ( X, X ) is the kernel function evaluated at X with the hyperparameters set to θ . This is a Bayesianapproach, marginalising over the latent distribution µ .The maximum of this marginal likelihood deﬁnes a modelwith a trade-oﬀ between the ﬁt to the data (the ﬁrst term)versus model complexity (the second term). For example,a GP using a squared-exponential kernel with very smalllength scale l will tend to ﬁt to ˆ µ ML , but will be overlycomplex (under-regularisation). In contrast, a large l de-scribes a simpler model, but will fail to ﬁt to the ˆ µ ML (over-regularisation). These extreme situations are pen-alised by the marginal likelihood, whose maximum pointmay be used to choose values for θ .Finally, we mention the method of cross validation , of-ten employed for hyperparameter optimisation in machinelearning [16]. Various approaches for cross validation ex-ist, but given the relatively small number of bins M ina typical unfolding scenario, and the fast computation ofthe GP unfolded result, we recommend the leave-one-out M sets of M − X (1) ∗ , X (2) ∗ , . . . , X ( M ) ∗ ,are produced with each X ( i ) ∗ missing the i th bin. Thenthe prediction for µ i is compared to its true value via a loss function , most often the squared error. The set of M losses for some hyperparameters θ can then be used tochoose their values.Other kernel functions may be more suitable for de-scribing the truth distribution. An attractive feature of theapproach presented here is that one may encode knowledgeof the underlying physical process to derive a physically-motivated kernel [17] which may better describe the truthdistribution.The mathematics of reproducing kernel Hilbert spaces formalises the link between the kernel and the traditionalregularisation approach used in some particle physics res-ults. For example, a thin plate covariance [18] leads to asolution equivalent to that of spline regularisation, knownas Tikhonov regularisation in particle physics [9, 10, 11,19]. In one dimension, this stationary kernel may be writ-ten k ( r ) = A (2 r − Rr + R ), where r = | x − x (cid:48) | and R isdetermined by boundary conditions. This kernel containsa single parameter A , which controls the global strength ofthe regularisation, as is the case with Tikhonov regularisa-tion in its usual implementation. In contrast, an advant-age of the GP approach presented in this paper is that theregularisation may be varied locally along the spectrumby using a non-stationary kernel function. We provide anexample of this in Sec. 5.2.

5. Example applications

Python code for the following examples may be foundin Ref. [20]. We consider the case of a bimodal distributionin Sec. 5.1, and a falling spectrum in Sec. 5.2.

A set of 20 000 toy truth events is obtained by samplingfrom two Gaussian distributions for x with mean values 0.3and 0.7, both with standard deviation 0.1. These truthevents are histogrammed in µ . The truth events are thensmeared with a Gaussian resolution of σ = 0 .

075 to gener-ate the histogram ν . Events are accepted in the region0 < x < µ and ν histograms use 20constant-width bins. The truth and smeared events areused to determine the response matrix R from a normal-ised 2D histogram. Finally, the observed histogram n isgenerated by Poisson ﬂuctuations around ν . The threehistograms are shown in Fig. 1.We use a GP with the squared-exponential kernel func-tion given by Eq. (13). The values for the two hyper-parameters A and l are chosen to be those that maxim-ise the marginal likelihood given by Eq. (14). The max-imum and contours of the marginal likelihood are shownin Fig. 3. The mean histogram for the GP, m , is taken tobe 0 for all bins since it is found to have little impact onthe ﬁnal result. The unfolded estimator for the truth, ˆ µ given by Eq. (9), is shown in Fig. 2. The covariance mat-rix U is deﬁned by Eq. (10), and the correlation matrix ρ ij = U ij / (cid:112) U ii U jj is shown in Fig. 4. f ( x ) = e − x in the region 1 < x < µ ). These events aresmeared according to a Gaussian with resolution 0 . √ x .The smeared events are placed in a histogram ν with 30bins of equal width in the region 0 . < x <

5, and theobserved histogram n is generated from a Poisson distri-bution around ν . These three histograms are shown inFig. 5.In this example, N > M so while the problem is well-constrained, the N × M response matrix R is not directlyinvertible. To mitigate this, we use the ML estimator asthe starting point and then regularise with the MAP pre-scription detailed in Sec. 4. Speciﬁcally, we numericallymaximise the Gaussian likelihood in Eq. (3) using MINUIT[21] via the iminuit [22] Python interface to obtain the MLestimator ˆ µ ML . The covariance matrix for the ML estim-ator, U ML is obtained by inverting the Hessian matrix withthe HESSE subroutine. These results are then substitutedin Eq. (9) with R − n → ˆ µ ML and R − V ( R − ) T → U ML .Since the ( a priori known) detector resolution increasesproportionally with √ x , a kernel with constant length scale,such as the squared-exponential Eq. (13) is unsuitable inthis case. Therefore we choose a kernel function with vari-able length scale, the Gibbs kernel [8, 23] in 1D, k ( x, x (cid:48) ) = A (cid:115) l ( x ) l ( x (cid:48) ) l ( x ) + l ( x (cid:48) ) exp (cid:32) − ( x − x (cid:48) ) l ( x ) + l ( x (cid:48) ) (cid:33) , (15)where l ( x ) is an arbitrary positive function of x , herechosen to be l ( x ) = bx + c . This allows for a linearly-changing length scale. The increased ﬂexibility aﬀordedby this kernel function is realised by introducing more reg-ularisation parameters, θ = { A, b, c } . We remark that fora large number of parameters, it becomes increasingly dif-ﬁcult to choose the optimal point.The unfolded estimators are shown against the truthhistogram in Fig. 6. Here the parameters θ are chosenwith the maximum marginal likelihood prescription givenin Sec. 4.2. As expected, b > x .

6. Conclusions and Outlook

In this paper we have presented how GPs may be ap-plied to the unfolding problem. It is shown that condi-tioning a GP prior on the ML solution is equivalent toconstructing the MAP estimator. In this application, theuse of a GP regressor may be thought of as a method ofregularising the ML solution.4 .0 0.2 0.4 0.6 0.8 1.0x025050075010001250150017502000 E v e n t s p e r b i n n Figure 1: Truth ( µ ), expected ( ν ), and observed ( n ) histogramsfor the two-peak unfolding example. The histogram deﬁnitions arereported in the text. The error bars on n represent their Poissonuncertainties. E v e n t s p e r b i n Figure 2: Truth ( µ ) and unfolded truth estimators ( ˆ µ ) for the two-peak example. The error bars on ˆ µ represent the standard deviationsobtained from the covariance matrix as deﬁned by Eq. (10).

10 12 14 16 18 20log A (amplitude)0.000.050.100.150.200.25 l ( l e n g t h s c a l e ) Figure 3: Contours of the log-marginal likelihood Eq. (14) for thetwo-peak example as a function of the parameters for the squared-exponential kernel, A and l . The cross indicates the point of max-imum marginal likelihood. The contour labels are the depth of thecontour below the maximum. j i Correlation matrix

Figure 4: Correlation matrix for the unfolded truth estimators ˆ µ forthe two-peak example. .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0x0255075100125150175 E v e n t s p e r b i n n Figure 5: Truth ( µ ), expected ( ν ), and observed ( n ) histograms forthe falling spectrum example, as deﬁned in the text. The error barson n represent their Poisson uncertainties. E v e n t s p e r b i n Figure 6: Truth ( µ ) and unfolded truth estimators ( ˆ µ ) for the fallingspectrum example. The GP is entirely described by mean and kernel func-tions. While the mean function is found to have littleimpact on the result, the kernel function prescribes the co-variance of unfolded estimation for the truth distribution.By choosing an appropriate kernel function, the smooth-ness in the unfolded estimator can be controlled. Further-more, the kernel function has a direct interpretation andmay be motivated by knowledge of the underlying physics.This means that, in contrast to other unfolding schemes,the regularisation is a natural product of this approach.For N = M , where the bins for the truth and observedhistograms are equal, the ML solution is simply given byinverting the response matrix, ˆ µ ML = R − n . However,generally N (cid:54) = M and this method may not be used. Forthe general case we envisage two possibilities. First, the N × M response matrix R is constructed from Monte Carlosimulation, and the ML solution is found numerically bymaximising the likelihood given by Eq. (3). This is theapproach taken in Sec. 5.2 for the falling spectrum ex-ample. Alternatively, the square N × N matrix R (cid:48) couldbe constructed using the same binning for the truth andobserved histograms. Then the predictive GP mean func-tion Eq. (11) is evaluated at X ∗ , the M centers of the binsfor the desired truth histograms.In Sec. 2, we take the background contribution to beequal to zero for simplicity. Background contributions,in the form of the N -dimensional vector β , are simpleto include by modifying the folding equation Eq. (1) to ν = R µ + β . Then for the estimators used in the methodpresented in this paper, one simply substitutes the datafor the background-subtracted data, n → n − β .This paper assumes throughout that the data may beapproximated as distributed according to a Gaussian, andwe note that this is not universally the case in particlephysics. However, the choice of unfolding method dependson the analysis being done and should be tested against simulation in any case. Therefore we recommend that forhistograms with small bin populations, the unfolding istested to ensure it acceptably meets the requirements ofthe analysis under consideration.The treatment of systematic uncertainties is postponedto future work in this area. We remark that that approx-imate variational approaches, as used in published particlephysics analyses [24, 25], may still be employed in thiscase. We envisage further research into the applications ofStudent- t processes [26] in unfolding in particle physics, asan extension to this work.GPs have been introduced to a number of scientiﬁcﬁelds to improve their statistical procedures [3, 4]. Theyhave not, however, traditionally been used in particle phys-ics, although recent developments in this area have shownpromise [17]. In this paper, we have introduced GPs intothe important problem of unfolding. We show that themethod is generally applicable to problems of diﬀerentshapes and sizes, that the regularisation can be controllednaturally, and that the result – including the unfolded co-variance matrix – can be obtained conveniently. Acknowledgements

We would like to extend our gratitude to our colleaguesin ATLAS and RHUL for their support. In particular, wethank Pedro Teixeira-Dias, and Lewis Wilkins for theirproofreading and helpful comments on this manuscript.We also thank Veronique Boisvert and Pim Verschuurenfor insightful discussions. This work was supported by theUK Science and Technology Facilities Council.6 ppendix A. Derivation of MAP estimator

With reference to Eq. (8), we wish to ﬁnd the value ˆ µ that maximises the expression −

12 ( n − R µ ) T V − ( n − R µ ) −

12 ( µ − m ) T K − ( µ − m ) . (A.1)The derivative for each term is given by ∂∂ µ (cid:20) −

12 ( n − R µ ) T V − ( n − R µ ) (cid:21) = ( n − R µ ) T V − R, (A.2) ∂∂ µ (cid:20) −

12 ( µ − m ) T K − ( µ − m ) (cid:21) = − ( µ − m ) T K − . (A.3)Combining these and taking the transpose ( V − and K − are symmetric), we therefore require that ˆ µ satisﬁes = R T V − ( n − R ˆ µ ) − K − ( ˆ µ − m ) (A.4)= R T V − n − (cid:104) R T V − R + K − (cid:105) ˆ µ + K − m . (A.5)Now we use that the covariance of the ML solution fromSec. 3 is given by U ML = R − V ( R − ) T and therefore that R T V − R = U − . Substituting into Eq. (A.5) and rearran-ging for ˆ µ ,ˆ µ = (cid:104) K − + U − (cid:105) − (cid:16) U − R − n + K − m (cid:17) (A.6)= K [ K + U ML ] − R − n + U ML [ K + U ML ] − m (A.7)= K [ K + U ML ] − (cid:16) R − n − m (cid:17) + m , (A.8)where from Eq. (A.6) to Eq. (A.7) we use that (cid:2) A − + B − (cid:3) − B − ≡ A [ A + B ] − for invertible matrices A and B . References [1] D. Foreman-Mackey, E. Agol, S. Ambikasaran, R. Angus, Fastand scalable Gaussian process modeling with applications toastronomical time series, Astron. J. 154 (6) (2017) 220. arXiv:1703.09710 , doi:10.3847/1538-3881/aa9332 .[2] I. Andersen, A. Szymkowiak, C. Rasmussen, L. Hanson,J. Marstrand, H. Larsson, L. Hansen, Perfusion quantiﬁca-tion using Gaussian process deconvolution, Magn. Reson. Med.48 (2) 351–361. doi:10.1002/mrm.10213 .[3] J. A. Christen, B. Sans´o, M. Santana-Cibrian, J. X. Velasco-Hern´andez, Bayesian deconvolution of oil well test data usingGaussian processes, J. Appl. Stat. 43 (4) (2016) 721–737. doi:10.1080/02664763.2015.1077374 .[4] B. Hunt, Bayesian methods in nonlinear digital image restor-ation, IEEE Trans. Comput. C-26 (3) (1977) 219–229. doi:10.1109/TC.1977.1674810 .[5] G. Cowan, A survey of unfolding methods for particle physics,in: Proc. Conference on Advanced Statistical Techniques inParticle Physics, Durham, England, 2002, pp. 248–257.URL [6] V. Blobel, Unfolding methods in particle physics, in: Proc.PHYSTAT 2011 Workshop on Statistical Issues Related to Dis-covery Claims in Search Experiments and Unfolding, CERN,Geneva, Switzerland, 2011, pp. 240–251. doi:10.5170/CERN-2011-006.240 .[7] F. Span`o, Unfolding in particle physics: a window on solvinginverse problems, EPJ Web Conf. 55 (2013) 03002. doi:10.1051/epjconf/20135503002 .[8] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes forMachine Learning, The MIT Press, 2006.[9] G. Cowan, Statistical Data Analysis, Oxford University Press,1998.[10] A. Tikhonov, On the solution of ill-posed problems and themethod of regularization, Mat. Sb. 151 (3) (1963) 501–504.[11] D. L. Phillips, A technique for the numerical solution of certainintegral equations of the ﬁrst kind, J. ACM 9 (1) (1962) 84–97. doi:10.1145/321105.321114 .[12] M. Schmelling, The method of reduced cross-entropy: A gen-eral approach to unfold probability distributions, Nucl. Inst.Methods Phys. Res. A 340 (2) (1994) 400–412. doi:10.1016/0168-9002(94)90119-8 .[13] G. Zech, Iterative unfolding with the Richardson-Lucy al-gorithm, Nucl. Inst. Methods Phys. Res. A 716 (2013) 1–9. arXiv:1210.5177 , doi:10.1016/j.nima.2013.03.026 .[14] G. D’Agostini, A multidimensional unfolding method based onBayes’ theorem, Nucl. Inst. Methods Phys. Res. A 362 (1995)487–498. doi:10.1016/0168-9002(95)00274-X .[15] G. Choudalakis, Fully Bayesian unfolding (2012). arXiv:1201.4612 .[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machinelearning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.[17] M. Frate, K. Cranmer, S. Kalia, A. Vandenberg-Rodes,D. Whiteson, Modeling smooth backgrounds and generic local-ized signals with Gaussian processes (2017). arXiv:1709.05681 .[18] O. Williams, A. Fitzgibbon, Gaussian process implicit surfaces,in: Proc. Gaussian Processes in Practice, 2007, pp. 1–4.URL [19] G. Wahba, Spline Models for Observational Data, Society forIndustrial and Applied Mathematics, 1990. doi:10.1137/1.9781611970128 .[20] https://github.com/adambozson/gp-unfold .[21] F. James, M. Roos, Minuit – a system for function minimiz-ation and analysis of the parameter errors and correlations,Comput. Phys. Commun. 10 (1975) 343–367. doi:10.1016/0010-4655(75)90039-9 .[22] iminuit – A Python interface to Minuit.URL https://github.com/iminuit/iminuit [23] M. N. Gibbs, Bayesian Gaussian processes for regression andclassiﬁcation, Ph.D. thesis, Univ. Cambridge (1997).[24] G. Aad et al. (ATLAS Collaboration), Measurements of top-quark pair diﬀerential cross-sections in the lepton+jets chan-nel in pp collisions at √ s = 8 TeV using the ATLAS de-tector, Eur. Phys. J. C 76 (2016) 538. doi:10.1140/epjc/s10052-016-4366-4 .[25] G. Aad et al. (ATLAS Collaboration), Measurement of the dif-ferential cross-section of highly boosted top quarks as a functionof their transverse momentum in √ s = 8 TeV proton-protoncollisions using the ATLAS detector, Phys. Rev. D 93 (2016)1–34. doi:10.1103/PhysRevD.93.032009 .[26] A. Shah, A. Wilson, Z. Ghahramani, Student-t processes as al-ternatives to Gaussian processes, in: Proc. Artiﬁcial Intelligenceand Statistics, 2014, pp. 877–885..[26] A. Shah, A. Wilson, Z. Ghahramani, Student-t processes as al-ternatives to Gaussian processes, in: Proc. Artiﬁcial Intelligenceand Statistics, 2014, pp. 877–885.