Gaussian Process Accelerated Feldman-Cousins Approach for Physical Parameter Inference
GGaussian Process Accelerated Feldman-Cousins Approach forPhysical Parameter Inference
Lingge Li, Nitish Nayak, Jianming Bian, Pierre Baldi
University of California, Irvine (Dated: October 2, 2019)
Abstract
The unified approach of Feldman and Cousins allows for exact statistical inference of smallsignals that commonly arise in high energy physics. It has gained widespread use, for instance,in measurements of neutrino oscillation parameters in long-baseline experiments. However, theapproach relies on the Neyman construction of the classical confidence interval and is computa-tionally intensive as it is typically done in a grid-based fashion over the entire parameter space.In this letter, we propose an efficient algorithm for the Feldman-Cousins approach using Gaussianprocesses to construct confidence intervals iteratively. We show that in the neutrino oscillationcontext, one can obtain confidence intervals 5 times faster in one dimension and 10 times faster intwo dimensions, while maintaining an accuracy above 99 . PACS numbers: a r X i v : . [ phy s i c s . d a t a - a n ] O c t . INTRODUCTION Constructing classical confidence intervals for physical parameters with boundary con-ditions is challenging when dealing with small signals. The challenge is especially evidentwhen studying neutrino oscillations because of the low event counts and multiple competingeffects on the energy spectrum. The low event counts are primarily caused by the extremelylow interaction cross-section of neutrinos, arising from the fact that they interact via theweak nuclear force. In order to extract meaningful statistical conclusions, one has to resortto means other than the asymptotic properties of Poisson data. The gold standard is theso-called unified approach outlined by Feldman and Cousins [1]. It builds upon the Neymanconstruction of classical confidence intervals by specifying an ordering principle based onlikelihood ratios and is known for providing correct coverage.The Feldman-Cousins approach is firmly grounded in statistical theory and widely usedin neutrino experiments, for example Refs. [2][3][4]. However, it comes at a heavy computa-tional cost, which in some cases such as Ref. [2] renders it infeasible for multi-dimensionalconfidence intervals. For the 1 − α confidence interval, the Feldman-Cousins approach in-cludes all the values in the parameter space where the likelihood ratio test fails to reject at α level. However, it doesnt provide a prescription for how to sample that parameter space.Therefore, one is forced to sample it in its entirety in a grid-based fashion. Moreover, at eachpoint one has to perform a large number of Monte Carlo simulations in order to calculatethe p -value for the likelihood ratio test.To accelerate the Feldman-Cousins approach, we propose approximating the functionof p -values over the parameter space with Gaussian processes. Instead of performing alarge number of Monte Carlo simulations, we start with just a small number of them atseveral parameter values to get noisy estimates of the p -values. We then train a Gaussianprocess model to interpolate over these estimates. Iteratively, we perform more MonteCarlo simulations to refine the Gaussian process approximation. We can control the p -valueapproximation error so that it does not change the likelihood ratio test decisions and theconfidence interval. Meanwhile, the Monte Carlo simulations can be allocated intelligentlyin the parameter space to achieve substantial savings in computation.The proposed algorithm is rooted in the framework of Bayesian optimization [5]. It wasoriginally designed to find the extremal points of an objective function that is unknown apriori . In the Feldman-Cousins approach, the function of p -values over the parameter spaceis unknown. We adapt Bayesian optimization to locate a set of points in the parameterspace that lie on the boundary of desired confidence intervals. By side-stepping points thatare estimated to be either inside or outside the confidence interval with high probability, wecan thus reduce the computational cost while producing the same result. We show that inthe context of neutrino oscillation experiments, one can accelerate the construction of one-dimensional and two-dimensional confidence intervals by a factor of 5 and 10 respectively,without sacrificing the accuracy of the Feldman-Cousins approach. II. STATISTICAL INFERENCE FOR NEUTRINO OSCILLATIONSA. Neutrino Oscillations
Neutrino oscillations demonstrate that neutrinos have mass and that the neutrino masseigenstates are different from their flavor eigenstates. In the three flavor framework, the2ransformation of the mass eigenstates ( ν , ν , ν ) into the flavor eigenstates ( ν e , ν µ , ν τ )is described by the 3 × U P MNS [6], which is parameterized by three mix-ing angles θ , θ and θ , and a CP violation phase δ CP . The probability of oscillationsbetween different neutrino flavor states of given energy E ν over a propagation distance (base-line) L depends on the U P MNS parameters and the difference of the squared masses of theeigenstates, ∆ m and ∆ m .The mixing angles θ and θ along with the squared-mass splitting ∆ m have beenmeasured to relatively high accuracy by several experiments, for example, Refs. [7][8][9]. Onecan then infer the remaining parameters, θ , δ CP , and ∆ m , by measuring the probabilities P ( ν µ → ν µ ) and P ( ν µ → ν e ). Of particular interest are: (1) the sign of ∆ m , positiveindicating a “Normal Hierarchy” (NH) and negative indicating a “Inverted Hierarchy” (IH)of neutrino mass states; (2) whether δ CP (cid:54) = 0 , π , indicating Charge-Parity (CP) violation inthe lepton sector; (3) whether the mixing angle is in fact maximal, i.e θ = 45 ◦ . The neutrinomass hierarchy has important implications for current and future neutrino experiments [10]involved in measuring the absolute neutrino mass and investigating the possible Majorananature of the neutrino. Leptonic CP-violation could be important to deduce the origin ofthe predominance of matter in the universe.To infer neutrino oscillation parameters θ , a typical long baseline neutrino oscillationexperiment sends a beam of ν µ neutrinos into a detector and observes a handful of oscillated ν e neutrinos along with ν µ neutrinos that survive over the baseline. As the oscillationprobability is a function of neutrino energy, the observed neutrinos are binned by theirenergy. The neutrino oscillation parameters are inferred by comparing the observed neutrinoenergy spectra with the expected spectra for different oscillation parameters as shown inFig. 1. Neutrino Energy (GeV) N u m be r o f E v en t s e n fi m n Mock Observation /2) p = 3 CP d = 0.56, q (NH, sinPrediction /2) p = CP d = 0.56, q (IH, sinPrediction e n fi m n Neutrino Energy (GeV) N u m be r o f E v en t s m n fi m n m n fi m n FIG. 1: An illustration of a toy neutrino oscillation experiment setup with the ν µ → ν e channelon the left and the ν µ → ν µ on the right. Expectations for different oscillation parameters arecompared to mock observations in order to find maximum likelihood estimates. The likelihood ofobserved data is maximized using the extended likelihood function. The fit is performed in bothchannels simultaneously. B. Feldman-Cousins Approach
Denote the random variable for the neutrino count in the i -th energy bin by X i . Further,assume that each X i follows an independent Poisson distribution with mean λ i . For a given θ ,3he expectations (cid:126)λ are also influenced by systematic uncertainties in the beam configurationand the interaction model among others, which we parameterize by δ . For given oscillationand nuisance parameters ( θ, δ ), the expectations (cid:126)λ given ( θ, δ ) are obtained through simula-tions as they are analytically intractable. Denote the implicit mapping between (cid:126)λ and ( θ, δ )by v . The extended log-likelihood of ( θ, δ ) is given by:log L ( θ, δ ) = (cid:88) i ∈ I log P ois ( x i ; v ( θ, δ ) i ) + log P ois ( (cid:88) i ∈ I x i ; (cid:88) i ∈ I v ( θ, δ ) i ) − δ where − δ is a penalty term for systematic error [11].For a unified treatment of constructing classical confidence intervals for both null and non-null observations, an ordering principle based on likelihood ratios was introduced by Feldmanand Cousins in 1997. The unified approach provides correct coverage even at parameterboundaries and has the highest statistical power as a result of the Neyman-Pearson lemma.In essence, a particular parameter value θ is included in the 1 − α confidence interval if thelikelihood ratio test fails to reject the null hypothesis θ = θ at the α level. The likelihoodratio test statistic is given by − L ( θ )arg max θ L ( θ )and has an asymptotic χ distribution by Wilks’ theorem. FIG. 2: In the context of neutrino oscillations, the likelihood ratio test statistic distribution changesin the parameter space. Here the parameter is δ CP and ranges from 0 to 2 π . The solid blue lineindicates the 68th-percentile of Monte Carlo simulated distributions while the dashed black line isthe 68th-percentile of the asymptotic χ distribution. In the context of neutrino oscillations, the asymptotic distribution is unreliable because ofthe small sample size in neutrino data and physical boundaries on the oscillation parameters.The reference distribution of the likelihood ratio test statistic can vary drastically as afunction of θ ; Fig. 2 shows several distributions at different θ values and comparisons oftheir critical values in particular. Therefore, for any given θ , Monte Carlo experiments areused to simulate the reference distribution and calculate the p -value for the likelihood ratiotest. Since the parameter space is bounded, the simulations are performed on a grid for alarge number of θ values and the computational cost adds up quickly.4 II. GAUSSIAN PROCESS ALGORITHMA. Gaussian Process Regression
A Gaussian process ( GP ) is a stochastic process where any finite collection of points arejointly Gaussian with mean µ and covariance Σ. An interpretation of the GP is an infiniteextension of multivariate Gaussian; a GP can be thought of as a distribution in the functionspace where each draw from the distribution is a curve. Typically, the zero mean GP is usedfor modeling but it is still impossible to specify an infinite-dimensional covariance matrixexplicitly. Instead, we can parametrize a zero mean GP with a kernel function κ that definesthe pairwise covariance. Let f ∼ GP (0 , κ ( · , · )). Then for any pair x and x (cid:48) we have (cid:18) f ( x ) f ( x (cid:48) ) (cid:19) ∼ N (0 , (cid:20) κ ( x, x ) κ ( x, x (cid:48) ) κ ( x, x (cid:48) ) κ ( x (cid:48) , x (cid:48) ) (cid:21) ) . Given a finite set of observed data (cid:126)x obs , we can write down the multivariate Gaussianlikelihood in this fashion and maximize it through kernel parameters ω . Conveniently, at anew point x ∗ we can obtain the closed form predictive distribution: f ( x ∗ ) | f ( (cid:126)x obs ) ∼ N ( κ ( x ∗ , (cid:126)x obs )( κ ( (cid:126)x obs , (cid:126)x obs )) − f ( (cid:126)x obs ) ,κ ( x ∗ , x ∗ ) − κ ( x ∗ , (cid:126)x obs )( κ ( (cid:126)x obs , (cid:126)x obs )) − κ ( (cid:126)x obs , x ∗ )) . Since a GP is uniquely characterized by the kernel, different kernels produce distinctbehaviors. A commonly used kernel is squared exponential κ ( x , x ) = exp( − ( x − x ) /l )where l is called the length scale. Intuitively, the length scale determines the distance overwhich the GP interpolates between points. The squared exponential kernel is infinitely differ-entiable and functions drawn from such a GP would be smooth. However, this smoothnessassumption might not be appropriate for some applications; a more general kernel is theMat´ern kernel. The Mat´ern kernel has an additional parameter ν that controls the smooth-ness and the squared exponential kernel is a special case where ν → ∞ . Fig. 3 shows some GP examples and please refer to Ref. [12] for more details on Gaussian processes.Different kernels can be combined to compose a GP as long as the new kernel covariancematrix is still positive semi-definite. With the squared exponential kernel alone, the covari-ance implies that the observed data has no error. To account for error in the data, a diagonalmatrix σ I is usually added to model constant variance across observations. In many situ-ations such as ours, there exists heteroskedasticity, which means that different observationshave different errors. When we iteratively perform Monte Carlo simulations to calculate p -values, the errors in the estimates also vary based on the number of simulations. We canactually model the p -value error as a diagonal matrix and add it to the GP covariance. B. Monte Carlo Error Estimation
In the Feldman-Cousins approach, a large number of Monte Carlo simulations is requiredin order to make the error in p -value calculation negligible. When the Monte Carlo errorin p -value calculation is not negligible, we should try to quantify it. Since the p -value isthe quantile of the observed likelihood ratio statistic under the reference distribution, wecan use a binomial proportion confidence interval as the p -value error estimate as outlined5 IG. 3: Sampled paths from Gaussian processes with different kernels (left). With ν = 1 .
5, theMat´ern kernel produces functions that are only once differentiable. Sampled paths from Gaussianprocess prior and posterior with squared exponential kernel (right). The posterior paths, repre-senting the curves drawn from the predictive distribution, are better aligned with the observeddata points in solid blue. below [13]. As shown in Fig. 4, the Monte Carlo error only slowly approaches zero when thenumber of simulations increases to 10,000.
FIG. 4: Monte Carlo error in terms of p -value as a function of the number of experiments (left).Example of non-parametric quantile interval construction using Binomial distribution (right). Ina sample with 100 draws, the 85 th and 95 th order statistics form a 95% confidence interval for the90 th quantile of the unknown distribution. Suppose X , ..., X n are independent draws from an unknown distribution F whose q th quantile is denoted by F − ( q ). Each draw X i is either below or above F − ( q ) with probability q . Consequently, M , the number of X i ’s less than or equal to F − ( q ), has a Binomial( n, q )distribution. We can obtain a confidence interval for F − ( q ) with sample statistics X ( l ) , X ( u ) (the l th and u th ordered draws) with 1 ≤ l ≤ u ≤ n such that B ( u − n, q ) − B ( l − n, q ) ≥ − α.B ( u − n, q ) − B ( l − n, q ) is the probability that M is between l and u −
1. Thus,( l, u ) would form a confidence interval for M . Correspondingly, ( X ( l ) , X ( u ) ) would form aconfidence interval for F − ( q ). Our goal, however, is to estimate F ( x ∗ ) for an arbitrary x ∗ X , where, in our context, x ∗ is the observed likelihood ratio test statistic and F ( x ∗ ) is the p -value. This can be done by inverting the quantile confidence interval untilthe confidence intervals for F − ( q l ) and F − ( q u ) no longer contain x ∗ . Then ( q l , q u ) wouldform a confidence interval for F ( x ∗ ). C. Proposed Algorithm
Bayesian optimization can be used to find the extremum of a black-box function h when h is expensive to evaluate so that a grid search is too computationally intensive. Bayesianoptimization is an iterative procedure; in each iteration, h is evaluated at a number ofpoints to update an approximation of h . The approximation usually starts from a zero-mean Gaussian process prior GP (0 , κ ( · , · )). After each iteration, the GP model yields aposterior distribution, hence Bayesian. Based on the approximation posterior, the points inthe next iteration are proposed by an acquisition function a . The acquisition function a aimsto balance between “exploration”, reducing approximation uncertainty, and “exploitation”,reaching the extremum.In our context, the expensive black-box function is the the function of p -values overthe parameter space. Denote the grid points in the parameter space, where Monte Carlosimulations are performed, by (cid:126)θ o , the simulated p -values at these points by y ( (cid:126)θ o ), and theindependent simulation errors by σ ( (cid:126)θ o ). The GP predictive posterior distribution of theunobserved p -values at (cid:126)θ u conditional on obtained p -values y ( (cid:126)θ o ) at points (cid:126)θ o is then givenby f ( (cid:126)θ u ) | y ( (cid:126)θ o ) ∼ N ( K uo ( K oo + diag ( σ ( (cid:126)θ o ))) − y ( (cid:126)θ o ) ,K uu − K uo ( K oo + σ ( (cid:126)θ o ) I ) − K ou )where K oo , K ou , K uo , K uu denote the covariance matrices between points (cid:126)θ o and (cid:126)θ u .Different from typical Bayesian optimization, we do not simply wish to find the minimumor maximum p -value. Instead, we want to find the points where the p -value is equal to α so that they enclose the confidence interval. Moreover, we want to be able to find multipleintervals at different confidence levels. Therefore, we choose our acquisition function to be a ( θ ) = (cid:88) α i | f ( θ ) − α i σ f ( θ ) | − where f ( θ ) is the GP approximated p -value (posterior mean) at θ and σ f ( θ ) is the GP posterior standard deviation at θ . 7 lgorithm 1 GP iterative confidence interval construction for each iteration t = 1 , , ... do Propose points in parameter space arg max θ a ( θ ) for each point θ (cid:48) do Simulate likelihood ratio statistic distribution for k = 1 , , ... do Perform a pseudo experimentMaximize the likelihood with respect to ( θ, δ )Maximize the likelihood with constraint θ = θ (cid:48) Calculate likelihood ratio statistic end for
Calculate p -value based on the simulated distribution end for Train GP approximation f ( θ ) for the p -valuesUpdate confidence intervals end for Iteratively, the GP algorithm will seek points on the boundary of confidence intervals,for which it is unsure about. Points far from the boundary, which have p -values muchgreater or less than α i , are probabilistically “ruled out.” At these points, we will end upperforming fewer Monte Carlo experiments or skipping them altogether. Every point on thegrid would be either included or rejected with some uncertainty based on the GP posterior.With more iterations, the uncertainty will diminish so that the approximated confidenceintervals converges to the ones produced by a full grid search. Fig. 5 illustrates the proposedalgorithm on an 1-dimensional example. FIG. 5: An illustration of our construction for the 68% and 90% confidence intervals for δ CP ,which consist of points lying underneath the dashed horizontal lines. From a few initial pointswith high variance, the GP learns a rough approximation of the true curve (left). Based on theapproximation, more points are proposed around the interval boundary, shown in dark blue, andthe GP improves itself (right). The shade of blue represents the number of simulations used tocalculate the p -value and the error bars are for the p -value. Here we use the squared exponential kernel with the Monte Carlo errors added to thecovariance diagonal and a small amount of white noise as often done in a regression setting812]. Point estimates of the GP kernel parameters by optimizing the log marginal likelihood − y ( (cid:126)θ o ) T ( K oo + diag ( σ ( (cid:126)θ o ))) − y ( (cid:126)θ o ) −
12 log | K oo + diag ( σ ( (cid:126)θ o )) | − n π. There are constraints on the kernel parameters that should be incorporated. For instance,the length scale l should be greater than the grid resolution and less than the grid range. IV. NUMERICAL STUDIES
By way of illustration, we set up a toy long-baseline neutrino oscillation experiment inorder to construct confidence intervals for the oscillation parameters. A flux distribution of ν µ s is modeled as a Landau function over neutrino energies, E ν ∈ (0 . , .
5) GeV with thelocation parameter at 2 GeV as shown in Fig 6. The normalisation uncertainty is taken tobe 10% and is applied as a nuisance parameter. The ν µ distribution is then oscillated into ν e s using the PMNS model for a toy baseline of 810km through the Earth. Corrections frommatter interactions [14] are applied assuming a constant matter density of 2 . g/cm . Thesetup is similar to NOvA [3], an accelerator-based long-baseline experiment at Fermilab. Theoscillated ν e s are then “observed” with a toy interaction cross-section distribution, similarin shape to Ref. [15]; the cross-section increases as a function of neutrino energy from 0 GeVup to 1 GeV and decreases slowly until a maximum neutrino energy of 4 . ν e distribution to get an energy spectrum expectation,in energy bins of 0 . ν µ → ν µ channel. However, in order to expedite the computation, the 2-flavor oscillation probabilityapproximation is used. The reactor mixing angle, θ and the solar parameters, θ and∆ m are fixed at the values given in Ref. [16]. A mock data set is obtained by applyingPoisson variations on the expected spectrum at oscillation parameter values given by NOvA. Neutrino Energy (GeV) A r b i t r a r y U n i t s Interaction Cross-Section e n Interaction Cross-Section e n Neutrino Energy (GeV) A r b i t r a r y U n i t s Flux m n Flux m n FIG. 6: The distributions for ν e interaction cross-section (left) and ν µ flux (right) are shown alongwith a normalization systematic error of 10%. We then use this setup to construct 1-dimensional confidence intervals for δ CP and 2-dimensional confidence intervals for sin θ vs δ CP by the two algorithms, a standard grid-search implementation of Feldman-Cousins and the GP -based algorithm. | ∆ m | is treated9s a nuisance parameter while sin θ is treated as another in the case of the 1-dimensionalinterval for δ CP . The likelihood function is integrated over the nuisance parameters assuminga flat prior in the range (2 , × − eV for | ∆ m | and (0 . , .
7) for sin θ , similar toRef. [2]. The prior on the nuisance parameters for the systematic uncertainties in the toymodel is assumed to be a standard normal distribution. The toy model and parameter fittingroutine are implemented in ROOT [17] while the Gaussian process algorithm is implementedwith scikit-learn [18]. A. 1-dimensional Confidence Intervals
To make inference on δ CP , a significance curve is usually drawn under different assump-tions of mass hierarchy as shown in Fig. 7. The portion of the significance curve below acertain value gives us the confidence interval at that level. We can observe that the NHcurves by both the standard FC and GP algorithms have the same intersections with 1 σ horizontal line, which implies that the 1 σ confidence intervals are the same. Though thereare slight discrepancies, the shape of the GP significance curve is mostly correct. FIG. 7: Example significance curves obtained with the standard Feldman-Cousins and Gaussianprocess algorithms mostly overlap, especially when the significance is close to 1 σ and 1 . σ asdesired. In this case, the inverted hierarchy (IH) is rejected at 1 . σ level and the normal hierarchy(NH) has the same 1 σ confidence interval. To evaluate the performance of the GP algorithm, we perform the same inference pro-cedure on 200 different data sets to find the 68% and 90% confidence intervals. First, withstandard FC results as ground truth, we consider the accuracy of the GP algorithm forclassifying whether or not each grid point is included in the confidence intervals. As the GP algorithm is iterative, we can calculate the accuracy at the end of each iteration withfixed computation. When the computation reaches 20% of that is required by standardFC, we stop the algorithm and calculate the absolute error as the difference in confidenceinterval endpoints. Fig 8 shows that the median accuracy reaches 1 with less than 20% ofcomputation and the error is no more than 0 . π for most data sets. As δ CP ranges from 0to 2 π and there are only 20 grid points, an error of 0 . π is just one grid point. With a finergrid, we expect the performance of the GP algorithm to improve.10 IG. 8: Relative accuracy of the confidence intervals in terms of correctly included grid points as afunction of computation (left) and the distribution of absolute errors for both normal and invertedhierarchies (right).
B. 2-dimensional Confidence Contours
To find the 2-dimensional confidence contours under hierarchy constraints, the GP al-gorithm approximates the p -value surface on the parameter grid as shown in Fig. 9 andspecifically prioritizes points on the contour boundaries. Grid points below a certain valueare included in the confidence contour at that level. To make the final smooth contours inFig. 10, we use Fourier smoothing to draw the closest elliptical curves. We can observe thatthe FC and GP contours overlap in the same areas. In fact, the area difference between thecontours is on the same order of magnitude with Fourier smoothing. FIG. 9: GP approximated percentile (1 − p -value) on the 20 ×
20 grid for sin θ vs δ CP (left) andthe priority to sample points from the grid (right). Notice that the points near 68% and 90% havethe highest priority. Similarly, we use both algorithms on 200 different data sets to find the 68% and 90%confidence contours and calculate the grid point classification accuracy after each iterationup to 10% of the standard FC computation. A concern is that contours with larger areacould require more computation to achieve the same accuracy as there are more points along11
IG. 10: Confidence contours for the same data constrained to normal (left) and inverted hier-archies (right). The true (dashed) and approximated (transparent) contours are almost indistin-guishable. the boundary. We address this concern by stratifying contours by area quartile and plottingmedian accuracy as a function of computation. Fig. 11 shows that the median accuracyreaches 1 with less than 10% of computation and contour area does not have an effect. Thereason is that while larger contours have more points on the boundary, smaller contoursare more difficult to locate precisely. Overall, it takes roughly the same computation toprobe the p -value surface accurately so the GP algorithm should have similar performanceregardless of the contour size. FIG. 11: Relative accuracy of the confidence contours as a function of computation (left) andmedian accuracy stratified by area as a function of computation (right).
Lastly, we are interested in where the computational savings come from. We keep trackof the number of grid points explored by the GP algorithm and the number of simulationsat each point for the 200 data sets. Fig. 12 shows that the algorithm explores about halfof the total grid points and on average only about 300 Monte Carlo simulations are doneinstead of 2000 in standard FC. We conclude that most of the computational savings comefrom performing fewer Monte Carlo simulations; skipping grid points nearly doubles thecomputational savings. As mentioned earlier, the advantage of the GP algorithm could be12reater on a finer grid. FIG. 12: Distribution of the number of points explored on the grid (left) and distribution of theaverage number of Monte Carlo experiments simulated at a point (right).
V. DISCUSSION
The proposed algorithm significantly accelerates the Feldman-Cousins approach whereinexperiments have to devote enormous computational resources in order to estimate uncer-tainties in neutrino oscillation parameters [19]. This could also prove useful in estimatingconfidence intervals from a combined fit of neutrino oscillation results from different experi-ments when the respective likelihood functions are available. While we design the GP basedconstruction in the neutrino oscillation context, the GP approximation does not have a par-ticular parametric form. The same idea can therefore be applied to many other scenarioswhere the confidence interval construction for a continuous parameter over a bounded regionnormally proceeds via the unified approach. [1] G. J. Feldman and R. D. Cousins. Unified approach to the classical statistical analysis ofsmall signals. Physical Review D , 57(7):3873, 1998.[2] K. Abe et al. Measurement of neutrino and antineutrino oscillations by the T2K experimentincluding a new additional sample of ν e interactions at the far detector. Physical Review D ,96(9):092006, 2017.[3] M. A. Acero et al. New constraints on oscillation parameters from ν e appearance and ν µ disappearance in the NOvA experiment. Physical Review D , 98(3):032012, 2018.[4] P. Adamson et al. Search for sterile neutrinos in MINOS and MINOS+ using a two-detectorfit.
Physical Review Letters , 122(9):091803, 2019.[5] Jonas Moˇckus. On bayesian methods for seeking the extremum. In
Optimization TechniquesIFIP Technical Conference , pages 400–404. Springer, 1975.[6] Bruno Pontecorvo. Mesonium and antimesonium.
Zhur. Eksptl’. i Teoret. Fiz. , 33, 1957.[7] Q. R. Ahmad et al. Direct Evidence for Neutrino Flavor Transformation from Neutral-CurrentInteractions in the Sudbury Neutrino Observatory.
Physical Review Letters , 89(011301), 2002.
8] Y. Fukuda et al. Evidence for oscillation of atmospheric neutrinos.
Physical Review Letters ,81(1562), 1998.[9] F. P. An et al. Measurement of electron antineutrino oscillation based on 1230 days of operationof the Daya Bay experiment.
Physical Review D , 95(072006), 2017.[10] G. Drexlin, V. Hannen, S. Mertens, and C. Weinheimer. Current direct neutrino mass exper-iments.
Advances in High Energy Physics , 2013, 2013.[11] R. Barlow. Extended Maximum Likelihood.
Nuclear Instruments and Methods in Physics,Volume 293, Issue 3 , 1990.[12] Carl Edward Rasmussen. Gaussian processes in machine learning. In
Advanced lectures onmachine learning , pages 63–71. Springer, 2004.[13] Gerald J Hahn and William Q Meeker.
Statistical intervals: a guide for practitioners , vol-ume 92. John Wiley & Sons, 2011.[14] Y. Smirnov. The MSW effect and Matter Effects in Neutrino Oscillations.
Phys.Scripta T121(2005) 57-64 , 2004.[15] J. A. Formaggio and G.P. Zeller. From eV to EeV: Neutrino cross sections across energyscales.
Reviews of Modern Physics , 84(3):1307, 2012.[16] Particle Data Group. Neutrino Masses, Mixing and Oscillations. 2017.[17] Rene Brun and Fons Rademakers. ROOT : An object oriented data analysis framework.
Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,Detectors and Associated Equipment , 389(1-2):81–86, 1997.[18] Fabian Pedregosa, Ga¨el Varoquaux, et al. Scikit-learn: Machine learning in python.
Journalof Machine Learning Research , 12(Oct):2825–2830, 2011.[19] A. Sousa, N. Buchanan, S. Calvez, P. Ding, D. Doyle, A. Himmel, B. Holzman, J. Kowalkowski,A. Norman, and T. Peterka. Implementation of Feldman-Cousins corrections and oscillationcalculations in the HPC environment for the NOvA Experiment. In
Proceedings of the 23rdInternational Conference on Computing in High-Energy and Nuclear Physics (CHEP 2018),Sofia, Bulgaria, July 9-13, 2018 , 2019. In press., 2019. In press.