[PDF] Efficient description of experimental effects in amplitude analyses

Abstract

Amplitude analysis is a powerful technique to study hadron decays. A significant complication in these analyses is the treatment of instrumental effects, such as background and selection efficiency variations, in the multidimensional kinematic phase space. This paper reviews conventional methods to estimate efficiency and background distributions and outlines the methods of density estimation using Gaussian processes and artificial neural networks. Such techniques see widespread use elsewhere, but have not gained popularity in use for amplitude analyses. Finally, novel applications of these models are proposed, to estimate background density in the signal region from the sidebands in multiple dimensions, and a more general method for model-assisted density estimation using artificial neural networks.

Full PDF

NNovel approaches to estimation of acceptance andbackground distributions in multidimensionalamplitude analyses

Abhijit Mathad , , Daniel O’Hanlon , Anton Poluektov , ∗ Physik-Institut, Universit¨at Z¨urich, Z¨urich, Switzerland Department of Physics, University of Warwick, Coventry, United Kingdom INFN Sezione di Bologna, Bologna, Italy Aix Marseille Univ, CNRS/IN2P3, CPPM, Marseille, France

February 6, 2019

Abstract

Amplitude analyses are a powerful technique to study heavy hadron decays. Asigniﬁcant complication in these analyses is the treatment of instrumental eﬀects,such as background and selection eﬃciency variations, in the multidimensional kine-matic phase space. This paper reviews conventional methods to estimate eﬃciencyand background distributions and outlines the method of density estimation usingGaussian processes and artiﬁcial neural networks. Such techniques see widespreaduse elsewhere, but have not gained popularity in use for amplitude analyses. Finally,novel applications of these models are proposed, to estimate background density inthe signal region from the sidebands in multiple dimensions, and a more generalmethod for model-assisted density estimation using artiﬁcial neural networks. ∗ Corresponding author. E-mail address:

[email protected] a r X i v : . [ h e p - e x ] F e b ontents D + s → K + π − π + . . . . . . . . . . . . . . . . . . 73.2 Combinatorial background density for D + s → K + π − π + . . . . . . . . . . . 7 Introduction

Amplitude analysis of hadron decays is a powerful technique employed in many ﬂavourphysics studies, such as measurements of CP violation, searches for eﬀects beyond theStandard Model, spectroscopic studies of excited hadrons, and searches for previouslyunobserved hadronic states. In this kind of analysis, multidimensional kinematic distri-butions of the decay products of a parent particle are studied to reveal the dynamicalstructure of the decay amplitude [1]. In addition to the decay dynamics, the kinematicdistributions are in general aﬀected by non-uniform acceptance, or detection eﬃciency,and background density, which need to be accounted for in the ﬁt.In this paper we brieﬂy review the conventional approaches employed by previousanalyses, recall a few already known but rarely used methods, and, ﬁnally, propose newtechniques to model non-uniform acceptance and background distributions. The proposedtechniques not only oﬀer more accurate descriptions of these distributions, but also provideimproved avenues to control the systematic uncertainties arising from such approaches.A simple, yet typical example of an amplitude analysis is the study of the two-dimensional distribution of a three-body decay of a scalar meson into three scalar mesons:Dalitz plot analysis [1, 2]. This is the simplest case where the decay has internal degreesof freedom, yet the amplitude is a function of only two kinematic variables. In more com-plicated cases, such as decays involving non-spin-zero states or decays with greater thanthree particles in the ﬁnal state, one is required to analyse kinematic distributions in morethan two dimensions. Here we focus on the simple two-dimensional case, however most ofthe approaches presented here can be generalised to the cases with higher dimensionality.We avoid any quantitative comparisons of the performance for the illustrated tech-niques. The optimal technique for each individual analysis depends on many factors,such as the size of the data sample, dimensionality of the kinematic phase space, require-ments on statistical and systematic uncertainties, complexity of the amplitude model,signal selection procedure and structure of the background contributions.The structure of the paper is as follows: the formalism of multidimensional maximum-likelihood ﬁts is recalled in Section 2 and non-parametric methods to deal with backgroundand acceptance are presented. The samples used to illustrate the background and accep-tance parameterisation techniques are described in Section 3. Conventional techniques toparametrise the acceptance distribution are illustrated in Section 4. Further, two rarelyused but yet eﬃcient approaches are presented: a technique using Gaussian processes(Section 5) and density estimation with artiﬁcial neural networks (Section 6). Finally,two novel approaches are proposed: the technique for inter- or extrapolation of back-ground density from the sidebands using one of the techniques above (Section 7), and amodel-assisted parameterisation of background or acceptance density using neural net-works (Section 8). A typical amplitude analysis in ﬂavour physics deals with the distribution of kinematicvariables z that characterise the multibody decay. The goal is to determine the unknown3arameters p that characterise the amplitude of the decay A ( z | p ). Given a set of decaycandidates z i (1 ≤ i ≤ N ) obtained in an experiment, an unbinned maximum-likelihoodﬁt is performed to infer the model parameters p . The negative logarithm of the likelihood, − L , minimised in the ﬁt, is of the form − L = − N (cid:88) i =1 ln F ( z i | p ) , (1)where N is the size of sample being ﬁt, F ( z | p ) is the normalised probability density of thedecay that depends on the model A ( z | p ), with a normalisation term I , the acceptance (cid:15) ( z ) and the normalised probability density function for the background events B ( z ): F ( z | p ) = (cid:15) ( z ) |A ( z | p ) | I + B ( z ) . (2)Another instrumental eﬀect that needs to be taken into account in the ﬁt, particularlyif the amplitude contains narrow resonant states, is the ﬁnite resolution of the kinematicvariables z . This eﬀect is beyond the scope of this paper and is not considered.The contribution of background events is typically obtained by analysing the distri-bution of selection variables, m . In the simplest case, m is a single variable that is takento be the combined invariant mass of the ﬁnal state particles, which typically peaks atthe mass of the parent particle for the signal events, and is distributed more uniformlyfor the background. However, other parameters of the event can also be included in thebackground selection. Alternatively, instead of treating the background contribution ex-plicitly as shown in Eq. (2), one can also assign to each candidate i in the data sample z i a weight w i , such that the background contribution is statistically subtracted. Thisprocedure will be discussed in more detail in Section 2.2.While the amplitude A ( z | p ) is driven by the model of the decay dynamics and isthe primary object under study, the experimental eﬀects of background and non-uniformacceptance exist as “nuisance” objects that, nevertheless, have to be modelled accuratelyfor correct interpretation of the results. Their description, especially in the case of themultidimensional kinematic phase space of the decay, often presents a major diﬃcultyin an analysis. Below we review several conventional techniques employed in amplitudeanalyses to deal with eﬀects of background and non-uniform acceptance. Non-uniform acceptance is usually handled either explicitly, using a parametric or non-parametric model of the decay density as shown in Eq. (2), or in an implicit way, byincluding its eﬀect in the normalisation term of the likelihood. In the latter case, thescattered data from simulation can be directly used, and no functional representation ofthe acceptance is required.To demonstrate the implicit approach, let us consider Eq. (1) that is being minimised4n the unbinned ﬁt (the background contribution has been omitted for simplicity): − L = − N (cid:88) i =1 ln (cid:18) |A ( z i | p ) | (cid:15) ( z i ) N (cid:19) = − N (cid:88) i =1 ln |A ( z i | p ) | − N (cid:88) i =1 ln (cid:15) ( z i ) + 2 N ln (cid:18) VM (cid:19) + 2 N ln (cid:32) M (cid:88) j =1 |A ( y j | p ) | (cid:15) ( y j ) (cid:33) . (3)Here the normalisation term I is calculated by taking the mean of the values of the densityfunction on a uniformly distributed sample y j (1 ≤ j ≤ M ) in a volume V . The constantterms that do not depend on the parameters of interest, p , can be omitted, which leadsto the following expression: − L = − N (cid:88) i =1 ln |A ( z i | p ) | + 2 N ln (cid:32) M (cid:88) j =1 |A ( y j | p ) | (cid:15) ( y j ) (cid:33) . (4)The second term in the above equation can be seen as the sum of |A| calculated over thesample y j distributed uniformly over the decay phase space, where each event j enters withthe weight (cid:15) ( y j ). The weight of each event can also be interpreted as the probability of thatevent passing the detector acceptance. Such an interpretation hints at a way to preparethe normalisation sample: one has to generate the decays uniformly in the decay phasespace, and then simulate the reconstruction and selection of the events. The retainedevents will serve as a normalisation sample for the likelihood and no further correctionsto the acceptance are required. In a real analysis, however, additional weighting of thenormalisation sample may be needed to account for the imperfections in the simulatedsample.Since no explicit parameterisation of the acceptance distribution is needed, this ap-proach is often used in amplitude analyses with more degrees of freedom than the two-dimensional Dalitz plot, such as Λ b → J/ψ pK − [3], B + → J/ψ φK + [4], or D → K − π + π + π − [5] decays where the amplitudes are described in a ﬁve- or six-dimensionalphase space. This method is, however, statistically sub-optimal, since it does not exploitthe fact that the acceptance distribution can be assumed to be at least locally smooth.As a result, these analyses usually require simulation data sizes several times larger thanthe real data samples.Explicit modelling of the acceptance function is more typically used in two-dimensionalDalitz-plot analyses. The techniques often used are two-dimensional polynomials with theparameters obtained from ﬁtting a simulated data sample [6], two-dimensional histogramssmoothed with cubic splines [7], and kernel density estimation [8, 9]. Analyses in morethan two dimensions do sometimes also use an explicit acceptance parameterisation, albeitwith assumptions on the factorisation of the acceptance in some variables [3] to reducethe dimensionality. 5 .2 Treatment of background contributions As in the case of acceptance, backgrounds can also be treated in the amplitude analyseseither in an explicit or implicit fashion. The implicit inclusion of the background intothe likelihood ﬁt can be performed using the sPlot technique [10, 11], where, each event isassigned a weight that depends on the value of the selection variables m . These weightsare positive in the signal-dominated regions of the selection variables and negative in thebackground-dominated regions, and as such the contribution of the background eventscan be statistically subtracted from the likelihood.This procedure does not require explicit parameterisation of the background density,however, it suﬀers some drawbacks. Firstly, it assumes that the amplitude ﬁt variables areuncorrelated with the selection variables, which as will be demonstrated in Section 7, isan assumption that in general is not well motivated. The presence of correlations will thusintroduce bias in the ﬁt results, especially if the background level is large. Secondly, sincethe procedure does not make any assumptions on the functional form of the backgrounddensity, it results in larger statistical uncertainties on the results than when a reasonablefunctional form is assumed. For the purposes of illustration, the D + s → K + π − π + decay is considered in this paper.As this is a three-body decay with scalar particles in both the initial and ﬁnal states, itsdynamics are fully characterised by two kinematic variables. This decay is also convenientas all the ﬁnal states of the decay are charged tracks, which makes the selection easier tobe implemented in a simpliﬁed Monte-Carlo simulation framework. There are no identicalparticles in the ﬁnal state, avoiding any need to deal with Bose symmetrisation of thekinematic phase space. Finally, being a singly Cabibbo-suppressed decay, it is interestingfrom a physics point of view, since it can exhibit signiﬁcant CP violation.Here we choose to parametrise the phase space in terms of the two square Dalitz-plotvariables [12], deﬁned as m (cid:48) = 1 π arccos (cid:18) m ( K + π − ) − m min Kπ m max Kπ − m min Kπ − (cid:19) ,θ (cid:48) = 1 π θ ( K + π − ) , (5)where m ( K + π − ) is the invariant mass of the K + π − combination, m min Kπ = M K + M π and m max Kπ = M B − M π are the minimum and maximum values of m ( K + π − ) variations, M K and M π are the masses of K and π mesons, respectively [13], and θ ( K + π − ) is the helicityangle of the K + π − combination. While the square Dalitz plot was designed for analysesof B mesons, in order to give more weight to the interference regions between diﬀerenttwo-body combinations that predominantly occur in the corners of the conventional Dalitzplot, in our case it is used purely to avoid complications related to the curved boundariesof the conventional Dalitz plot. 6lthough the three-body Dalitz-plot analysis is the simplest case for the techniquesconsidered here, they scale well with dimensionality of the kinematic phase space, andcan be applied for more complicated amplitude analyses. Similarly, the exact deﬁnition ofthe kinematic phase space (conventional or square Dalitz plots, or various representationsfor four-body kinematics) is not a limitation for the techniques under study. D + s → K + π − π + The sample of D + s → K + π − π + decays is simulated using a simpliﬁed Monte-Carlo tech-nique, where only kinematic properties of the initial and ﬁnal state particles are consid-ered. The production of D + s mesons and reconstruction of decay products is inspired bythe conditions of LHCb experiment [14], however the numerical values of the parametersused in the simulation are largely arbitrary and diﬀer from those at LHCb.For the initial D + s mesons, the transverse momentum p T (the component of momentumperpendicular to the ˆ z –axis, which in the case of a proton-proton collider corresponds tothe direction of the beams) is generated according to an exponential distribution with amean of 1 . The angle θ , which is the angle between the direction of D + s momentumand the ˆ z –axis, is generated such that the pseudorapidity, η = − ln tan( θ/ D + s decay is spherically symmetric, the momenta of the ﬁnalstate products are generated such that they are uniform in the square Dalitz-plot variables m (cid:48) and θ (cid:48) . These momenta in the D + s rest frame are then boosted to the laboratory frame.To simulate the selection of D + s candidates in the experiment, only the candidates thatsatisfy certain kinematic criteria are retained: the total momentum p and the transversemomentum p T of each track are required to exceed 3 . . p T of at least one of the ﬁnal state tracks is required to be greater than 1 . p T of the D + s candidate is required to be greater than 2 . . × events after selection,where a smaller sample of 10 selected candidates is used in the examples presentedelsewhere in this paper to estimate this distribution. D + s → K + π − π + The simulated combinatorial background contribution to the D + s → K + π − π + decayscontain purely random combinations of three tracks, as well as the combinations of ρ (770) → π + π − or K ∗ (892) → K + π − with a K + or a π + , respectively. The kaon andpion tracks, as well as the K ∗ and ρ resonances, are generated uniformly in pseudorapid-ity, η , and with an exponential p T distribution (with a mean p T of 0 . . . ρ and K ∗ in thecombinations before applying selection cuts is 20% and 10%, respectively. The invariant Natural units with c = (cid:126) = 1 are used in this paper m' ' q ' ) q ( m ' , e Figure 1: Relative eﬃciency variation over the square Dalitz-plot variables for D + s → K + π − π + sample obtained from high-statistics Monte-Carlo simulation. The distributionis normalised such that the average eﬃciency equals 1.masses of the ρ and K ∗ resonances are generated according to the relativistic Breit-Wigner distribution, with masses and widths equal to their world-average values [13], andtheir decay products are generated assuming that the resonances are unpolarised ( i.e. they are isotropic in the resonance rest frame and are uncorrelated with the third track).In a second stage, the three charged tracks are combined to form the D + s candidate.Only the tracks with the p T greater than 0 . D + s mass M D + s = 1 .

97 GeV [13]. Thethree-dimensional distribution of the invariant mass, m D ≡ m ( K + π − π + ), of the threeparticles before the kinematic ﬁt, and the square Dalitz-plot variables m (cid:48) and θ (cid:48) afterthe kinematic ﬁt, are used to extract the density of the background events in the signalregion.The distributions of each variable with the deﬁnition of signal and sideband regionsare shown in Fig. 2. To clearly show the features of the background density, the two-dimensional projections in Fig. 2(a,b,c) are obtained with the high-statistics Monte-Carlosample of 4 × events satisfying the selection requirements. The examples of backgrounddensity estimation in this paper, as well as the one-dimensional projections in Fig. 2(d,e,f)use the smaller sample of 10 candidates. The following examples assume that the back-ground density is estimated only using the sideband regions of the m D distribution, deﬁnedas m D < M D + s − . m D > M D + s + 0 . | m D − M D + s | < . K ∗ and ρ resonancesshifted in the sidebands with respect to the signal region. In this particular case, it ispurely explained by the kinematic ﬁt procedure applied to the background sample. In a8 m' ' q D ) d m D ' , m q B ( m ' , (cid:242) (a) m' ( G e V ) D m ' q ) d D ' , m q B ( m ' , (cid:242) (b) ' q ( G e V ) D m ) d m ' D ' , m q B ( m ' , (cid:242) (c) (GeV) D m E n t r i e s / ( . G e V ) Lowersideband Signalregion Uppersideband (d) m' E n t r i e s / ( . ) Signal regionLower sidebandUpper sideband (e) ' q E n t r i e s / ( . ) Signal regionLower sidebandUpper sideband (f)

Figure 2: Simulated combinatorial background to D + s → K + π − π + decay. Two-dimensional (a) θ (cid:48) vs. m (cid:48) (b) m (cid:48) vs. m D , and (c) θ (cid:48) vs. m D projections, one-dimensionalprojections onto (d) m D variable with the deﬁnition of signal and sideband regions,and projections of signal and sideband regions onto (e) m (cid:48) and (f) θ (cid:48) variables. Two-dimensional distributions are normalised such that the average density is equal to one.9eal analysis, this can also be caused by dependence of background production propertieson the selection variable, m D . In this section, we present the estimate of the acceptance variation over the amplitudeﬁt variables using conventional methods that involve use of Legendre polynomials andcubic splines. These conventional methods use histograms to estimate the local density ofevents before and after a requirement, and then interpolate the eﬃciency value betweenthe bin centres of the histogram.Numerous analyses use multidimensional orthogonal Legendre polynomials [6, 15–17]to parameterise the variation of the acceptance. In d dimensions these consist of theproduct of d n d -order polynomials, where n d can be diﬀerent for each dimension, thatdescribe the form of the acceptance. As such, the number of free parameters in thismethod are O ( n × n × ... × n d ), increasing with power-law growth in the number ofdimensions. The coeﬃcients of these polynomials are then extracted in a maximum-likelihood ﬁt. As the optimal order of each of these polynomials is a priori unknown,often cross-validation and/or regularisation is used to prevent overﬁtting.Another method commonly used is interpolation between the histogram bin centres viacubic splines [7,18,19]. Here, the scale of the variation over the phase-space is determinedby the initial histogram bin size, and therefore to avoid overﬁtting the size and locationof bins are optimised before the spline function is calculated. Unlike in the case for theLegendre polynomials, no free parameters exist for the spline functions (except for thevalues at the bin centres). Therefore these are particularly susceptible to over-ﬁtting, asthey simply “connect” the points.In Figure 3, we show the results of these two conventional approaches [20, 21]. For theLegendre polynomial ﬁt, additive L and L regularisation terms were included into thelikelihood to reduce overﬁtting and improve likelihood ﬁt stability [22]: − L reg = − L + L + L (6)with L = λ (cid:88) i,j | c ij | (7)and L = λ (cid:88) i,j c ij , (8)where c ij are the coeﬃcients of the polynomials and λ and λ are regularisation parame-ters. A cross-validation procedure was performed to determine the optimal degree of theLegendre polynomials (set to be equal in each dimension for simplicity), as well as themagnitude of each of the L and L terms. This resulted in a polynomial degree of n = 8in each dimension, an L regularisation parameter λ = 0 .

01, and an L regularisationparameter λ = 0 .

1. The reduced χ for this ﬁt is calculated using a independent test set,10 ' ' q ' ) q ( m ' , e (a) m' E n t r i e s / ( . ) SimulationFit result (b) ' q E n t r i e s / ( . ) SimulationFit result (c) m' ' q ' ) q ( m ' , e (d) m' E n t r i e s / ( . ) SimulationFit result (e) ' q E n t r i e s / ( . ) SimulationFit result (f)

Figure 3: Estimate of the acceptance variation using (top) Legendre polynomial and(bottom) cubic spline model over (left to right) two-dimensional m (cid:48) vs. θ (cid:48) variables, m (cid:48) and θ (cid:48) projections.and the number of eﬀective degrees of freedom is calculated using bootstrap resampling,and yields a value of χ / nDoF = 2517 / . ×

10 binning in ( m (cid:48) , θ (cid:48) ) is used, and is chosen ad hoc such that the bin size is similar to the smallest structure size in these variables( ∼ . χ is calculated using the number of bins inthe χ test (50 ×

50) minus the number of points ﬁtted by the spline (100), and a valueof χ / nDoF = 2460 / .

03 is obtained.

A Gaussian process [23] is a statistical model which assumes that each point in spaceis associated with a normally distributed random variable. This implies that any linearcombination of model estimates is also a normal distribution, yielding a closed formexpression for the model at an arbitrary point in space. Fortunately, to avoid having toﬁt for the parameters of an inﬁnite number of normal distributions, Gaussian processes arecompletely determined by a parameterisation of the covariance matrix with a covariancefunction. As such, it is possible to obtain model estimates in a large number of dimensionswith relatively few parameters, which can then be used to extrapolate the behaviour ofthe model with reliable estimates of the uncertainty. Furthermore, these parameters canbe robustly extracted directly from the data, which gives Gaussian processes an advantageover other “non-parametric” models, such as kernel density estimates or piece-wise spline11nterpolation. A pedagogical introduction to Gaussian processes can be found in Ref. [24],and other applications in high-energy physics can be found in Refs. [25, 26]Given some input data vector x , of length n (where the elements of x can themselvesbe vectors of arbitrary dimension), the output, y , of Gaussian process is deﬁned as y ∼ N (0 , Σ( x ; Θ)) , (9)where N is a multivariate normal distribution with zero mean, and the covariance matrix,Σ( x ; Θ), given byΣ( x ; Θ) =  k ( x , x ; Θ) k ( x , x ; Θ) . . . k ( x n , x ; Θ) k ( x , x ; Θ) k ( x , x ; Θ) . . . k ( x n , x ; Θ)... ... . . . ... k ( x , x n ; Θ) k ( x , x n ; Θ) . . . k ( x n , x n ; Θ)  . (10)Here k ( x i , x j ; Θ) is a covariance function, with hyperparameter vector Θ, that is deﬁnedbetween any two input points, x i , x j , that are elements of x . For a new input point x ∗ , theconditional probability of predicting an output that is equal to the true unknown value y ∗ , given the previously observed outputs, y , follows a normal distribution, P ( y ∗ | y ) ∼ N (Σ ∗ Σ − y, Σ ∗∗ − Σ ∗ Σ − Σ T ∗ ) , (11)where Σ ∗ = [ k ( x ∗ , x ; Θ) , k ( x ∗ , x ; Θ) , · · · , k ( x ∗ , x n ; Θ)], and Σ ∗∗ = k ( x ∗ , x ∗ ; Θ). The bestestimate of the true value, y ∗ , is equal to the mean of the above probability distribution,ˆ y ∗ = Σ ∗ Σ − y, (12)and the uncertainty is the square-root of the variance,Var( y ∗ ) = Σ ∗∗ − Σ ∗ Σ − Σ T ∗ . (13)The negative logarithm of the likelihood for this construction is given by − p ( y | x, Θ) = y T Σ − y + log | Σ | + n log 2 π. (14)The hyperparameters, Θ, of the covariance function can then be inferred by minimising thenegative logarithm of the likelihood (14), or obtained via marginalisation using suitablepriors and Markov-chain Monte-Carlo.One such covariance function is the Mat´ern function, k ν ( d ) = σ − ν Γ( ν ) (cid:18) √ ν dρ (cid:19) ν K ν (cid:18) √ ν dρ (cid:19) (15)where d is the distance between two points, Γ is the gamma function, K ν is the modiﬁedBessel function of the second kind, ρ and ν are non-negative parameters, and σ controlsthe absolute magnitude of the covariance. The Mat´ern function is deﬁned in terms ofthe distance between two points, rather than the location of each point, and therefore12escribes a stationary distribution. For half-integer values of ν , this can be expressedas a product of an exponential function and a polynomial of order p = ν − /

2. Theparameters ν can be thought of as controlling the smoothness of the function, and when ν → ∞ , the Mat´ern function converges to the squared-exponential covariance functionlim ν →∞ k v ( d ) = σ exp (cid:18) − d ρ (cid:19) . (16)Here we use the Mat´ern function with ν = throughout, as this provides a good bal-ance between replicating and smoothing the observed structures, however the parametervalues, and the best model choice in general, depends on the data in question. These areideally selected using cross-validation, or a similar procedure, and can be considered as asource of systematic uncertainty on the ﬁnal distribution.In reality, one would also want to describe distributions that are non-stationary (i.e.,where the mean in Eq. (9) is non-zero). However, due to the linearity of the model,this represents a simple subtraction of the mean function from the observed data, andtherefore there is no loss in generality due to this description. Speciﬁc assumptions on themean distribution will be discussed for the applications described in Sections 5.1 and 7. The dataset described in Section 3 parameterises the acceptance in terms of the squareDalitz-plot variables, m (cid:48) and θ (cid:48) . As the Gaussian process does not estimate the densitydirectly (although there are modiﬁcations that would permit this [27]), the density in( m (cid:48) , θ (cid:48) )-space is estimated ﬁrst by a uniformly binned histogram, with 50 bins in eachaxis. The location of the centres of these bins are then the input points, x , deﬁned above,where x i = [ m (cid:48) i , θ (cid:48) i ]. As these were generated uniformly in ( m (cid:48) , θ (cid:48) )-space, the acceptanceprobability in each bin is simply the reciprocal of the bin content, which is the output, y , ofthe Gaussian process. Therefore for each input x i = [ m (cid:48) i , θ (cid:48) i ] there is an associated output y i . As the acceptance is positive deﬁnite everywhere, the resulting Gaussian process doesnot represent a stationary process, and as such, a mean function constant in ( m (cid:48) , θ (cid:48) ) isadded to the Gaussian process to account for this scaling.An advantage of the Gaussian process is that it is relatively robust to statisticalﬂuctuations due to low sample sizes, as the uncertainty at each point is estimated directly.As such, the aforementioned binning can be considerably ﬁner than in other methods,provided that the assumption of normally distributed uncertainties holds (processes wherethe likelihood is replaced with a Poisson distribution can also be used, however this isless computationally tractable than in the Gaussian case, as Poisson distributions are notclosed under linear combination).Here, Gaussian processes are implemented using the GPy package [28], and the result-ing acceptance model as a function of m (cid:48) and θ (cid:48) can be seen in Figure 4. The parametersof this model are the overall scale of the Mat´ern function, σ , the overall scale of theconstant mean function σ , the characteristic length scale over which points covary foreach dimension, ρ m (cid:48) and ρ θ (cid:48) , and a term describing the additive Gaussian noise at eachpoint, (cid:15) . This ﬁt was performed using a maximum-likelihood approach, and the resultingparameter values can be found in Table 1. 13 ' ' q ' ) q ( m ' , e (a) m' E n t r i e s / ( . ) SimulationFit result (b) ' q E n t r i e s / ( . ) SimulationFit result (c)

Figure 4: Result of the density estimation of the simulated sample of D + s → K + π − π + decays using Gaussian process method (a) in two square Dalitz-plot variables m (cid:48) and θ (cid:48) ,and projections of the two-dimensional distribution onto (b) m (cid:48) and (c) θ (cid:48) variables.Table 1: Parameters of the Gaussian process ﬁt to the simulated D + s → K + π − π + decays,with a Mat´ern kernel and a constant mean function.Model parameter Value σ

648 events σ

547 events ρ m (cid:48) . ρ θ (cid:48) . (cid:15) . χ per number of degrees of freedom of this model with respect to the data isevaluated using an independent dataset of simulated D + s → K − π + π − decays. Here, theeﬀective number of degrees of freedom is calculated approximately as the number of binsin the χ test (50 × χ / nDoF = 2471 / .

99 is obtained. This indicates that the model reproduces theunderlying distribution very well, where the smallness of the χ value is mostly due tothe fact that this reproduction can be achieved with a small number of parameters.As the only parameters that scale with dimensionality are the characteristic lengthscale of the Mat´ern function, the increase in the number of parameters is linear in thedimensionality. Furthermore, the time complexity of Gaussian processes is also linearin the dimensionality, making these very eﬃcient in high dimensions compared to otherparameterisations. Unfortunately however, the complexity is cubic in the input data size,due to the dependence on a matrix inversion, and therefore these do not scale well withlarge data sizes. Nevertheless, methods exist to mitigate this, such as the use of binneddata in the strategies described here, or by selection of a small number of “pseudo”inputs [29]. Multivariate techniques such as artiﬁcial neural networks (ANNs) or boosted decisiontrees provide an alternative approach to parametrise multidimensional probability den-sity from scattered data [30, 31]. The approach involving ANNs exploits a property ofneural networks, where a “feed forward” network (when layers of neurons are arrangedin a non-cyclical structure), with smooth activation functions can approximate any con-tinuous function. Here, the parameters of the ANN are treated as free parameters in amaximum-likelihood ﬁt to the unbinned data, performed by treating the negative loga-rithm of the likelihood as a custom loss function. Since this technique does not requirebinning the input data, and in general ANNs have successfully shown their ability formultivariate generalisation, we expect that density estimation approach using ANNs canbecome useful for multidimensional amplitude analyses. This approach is demonstratedbelow for the parameterisation of the two-dimensional acceptance of the D + s → K + π − π + decay, described in Section 3.The outputs of the n th neuron of the l th layer in the ANN is given by a n,l +1 = f (cid:32)(cid:88) m w nm,l a m,l + b n,l (cid:33) , (17)where w nm,l is the matrix of weights, b n,l is the vector of biases for l th layer, and f ( x ) isa non-linear activation function. For the estimated density to be smooth, it is convenientto use a smooth diﬀerentiable activation function such as a sigmoid function, f ( x ) = 11 + e − x . (18)In this structure, the ﬁrst layer of neurons is the input layer, and accepts kinematicvariables z as inputs, while the output neuron return a single scalar, the density estimate P ( z ) 15ensity estimation can be performed by treating the weights and biases q ≡ { w nm,l , b n,l } of the ANN as free parameters, and minimising the negative logarithm of the likelihood, − L = − N (cid:88) i =1 ln P ( z i | q ) + 2 N ln (cid:32) M (cid:88) j =1 P ( y i | q ) (cid:33) , (19)where z i ( i = 1 . . . N ) are data points, and y i ( j = 1 . . . M ) is a uniformly distributedsample used for normalisation. The function (19) is used as the loss function to train theANN given the training sample z i .As in many applications of machine learning techniques, special care needs to be takento avoid overﬁtting, where the model conﬁguration or parameters become too specialisedto the training dataset, and therefore fail to generalise properly. In the case of density es-timation with ANN, overﬁtting manifests itself as unphysical rapid oscillations in the PDFaround training data points. Regularisation techniques, where the likelihood is explicitlypenalised to promote smoothness or sparsity, are thus essential to control overﬁtting.It was found that regularisation which penalises large neuron weights (and thereforethose that result in large gradients in the likelihood function) works well in the typicalcases when density is expected to be smooth. Speciﬁcally, an L regularisation term ofthe form L = λ (cid:88) n,m,l w nm,l (20)is added to the loss function (19), where λ is the regularisation parameter that ultimatelycontrols the smoothness of the PDF.The density estimate of the D + s → K + π − π + decay acceptance is performed with anANN consisting of four hidden layers with the number of neurons, from the ﬁrst to fourthlayer, equalling 32, 64, 32, and 8. The regularisation parameter λ is chosen to be equalto 1. Normalisation is performed with 5 × events distributed uniformly over the spaceof inputs, the square Dalitz plot. The likelihood minimisation is performed using theTensorFlow framework [32] and the Adam optimiser [33] with learning rate of 10 − . Theresulting estimate of the density after 10 000 training epochs (passes through the data) isshown in Fig. 5. As highlighted in Section 2.2, the conventional methods that either use sideband distri-butions or the sPlot technique to determine the combinatorial background contribution ingeneral introduce systematic biases, since they ignore correlations between the amplitudeﬁt variables and the selection variables (such as the combined invariant mass of the ﬁnalstate particles m D ). The bias can become more pronounced if only one of the sidebandscan be used to estimate the background ( e.g. due to presence of speciﬁc peaking back-grounds in the other sideband as often happens in B meson decays). sPlot procedure alsointroduces additional statistical uncertainty compared to the parametric approach due tothe lack of any assumptions on the behaviour of the background.To overcome such issues, one can add the selection variables to the background pa-rameterisation. For example, in the case of Dalitz-plot analysis, a 3D ﬁt can be performed16 m' ' q ' ) q ( m ' , e (a) m' E n t r i e s / ( . ) SimulationFit result (b) ' q E n t r i e s / ( . ) SimulationFit result (c)

Figure 5: Result of the density estimation of the simulated sample of D + s → K + π − π + decays using an ANN (a) in the two-dimensional square Dalitz-plot variables m (cid:48) and θ (cid:48) ,and projections of this distribution onto the (b) m (cid:48) and (c) θ (cid:48) variables.to obtain the probability density function P ( m (cid:48) , θ (cid:48) , m D ), which can then be used to ex-trapolate the desired combinatorial PDF B ( m (cid:48) , θ (cid:48) ) in the signal region. We present inthis section two such approaches, one using a Gaussian process [34], and another usingan ANN [35], to extrapolate the combinatorial background PDF from both the upperand lower sidebands of m D to the signal region. To illustrate the performance of both ofthese approaches, simulated combinatorial background of D + s → K + π − π + decay is used,as described in Section 3. In the Gaussian process method, the ﬁt variables ( m (cid:48) , θ (cid:48) ) and selection variable ( m D )taken from the sideband sample are ﬁrst binned to obtain a local estimate of the density,and the parameters of the covariance function are then inferred by ﬁtting the model usingthe location of the bin centres and their respective yields. As mentioned in Section 5, themodel is fairly robust to variations in the choice of the location and size of these bins,providing that they capture suﬃcient variation in the input variables.Here, the D + s → K + π − π + lower, m D ∈ [1 . , .

92] GeV, and upper, m D ∈ [2 . , . D + s sideband described in Section 3.2 are separated into three bins of 0 .

05 GeV each,for a total of six bins in m D . In each of these bins, the square Dalitz plot is separatedinto 20 ×

20 uniform bins (with bin size 0 . × . × ×

20 = 2400inputs to the Gaussian process.The Gaussian process uses the Mat´ern kernel, deﬁned in Section 5, with ν = , alongwith a constant mean function in ( m (cid:48) , θ (cid:48) ), and a linear mean function in m D . Resultsof the estimation of the background density in the sideband regions of m D variable arepresented in Fig. 6. Here it can be seen speciﬁcally that the Gaussian process modelestimates well the variation in the resonance structure in ( m (cid:48) , θ (cid:48) ) as a function of m D due to the kinematical constraints, permitting accurate estimation of this backgroundstructure in the unobserved signal region. The corresponding kernel parameters that areextracted from the data can be seen in Table 2.The distribution of the background in the signal region is obtained by querying theGaussian process at m D = 1 .

97 GeV, the results of which can be seen in Figure 7.17able 2: Parameters of the Gaussian process ﬁt to the simulated D + s → K − π + π − back-ground, with a Mat´ern kernel, constant mean function in ( m (cid:48) , θ (cid:48) ), and linear mean functionin m D . Model parameter Value σ

279 events σ m D . σ . ρ m D .

27 GeV ρ m (cid:48) . ρ θ (cid:48) . (cid:15) . m (cid:48) distribution(Fig. 7(b)), the bias of the distribution is clearly smaller than that obtained from thesimple projections of the distribution in the sidebands. If there are no narrow structures in the background, one can consider a background PDFthat is positive-deﬁnite, reasonably smooth in m D , and is suﬃciently generic in squareDalitz-plot variables, such as P ( m (cid:48) , θ (cid:48) , m D ) = | P ( m (cid:48) , θ (cid:48) ) + e − αm D P ( m (cid:48) , θ (cid:48) ) | (21)where P , ( m (cid:48) , θ (cid:48) ) are the functions modelled with ANN. One can then perform an un-binned ﬁt of P ( m (cid:48) , θ (cid:48) , m D ) to sideband data, with regularisation to avoid overtraining,with the weights and biases q , of ANN functions P , and α as the free parameters.The background in the signal region can then be extrapolated using the trained model as B ( m (cid:48) , θ (cid:48) ) = P ( m (cid:48) , θ (cid:48) , m D = M D + s ).In the presence of narrow structures in the amplitude that vary as a function of m D (such as the resonant K ∗ and ρ contributions in the D + s → K + π − π + sample, see Fig. 2),the approximation shown in Eq.(21) may not work well. As ANN density estimation canbe performed in multiple dimensions, one can estimate the background density in theselection variable m D in addition to that in the Dalitz-plot variables, m (cid:48) and θ (cid:48) . Therefore,as an alternative to the PDF of the form (21), the full three-dimensional ANN can beused to parametrise the background as a function of square Dalitz-plot variables and m D ,where additional regularisation has to be applied to ensure continuity as a function of theselection variable m D . The latter can be done by adding an extra penalty term in thelikelihood which penalises conﬁgurations where the neurons in the input layer have largeweights corresponding to m D variable.The ﬁt in the sideband regions to the simulated combinatorial background of the D + s → K + π − π + decay using the ANN is shown in Fig. 8. For the neurons in the ﬁrstlayer that take the m D dimension as input, the regularisation parameter λ is set equal18 q ( G e V ) D m Data ' q ( G e V ) D m Fit m' ' q Data m' ' q Fit m' ( G e V ) D m Data m' ( G e V ) D m Fit m' E n t r i e s ' q E n t r i e s (GeV) D m E n t r i e s Figure 6: Results of the estimation of combinatorial background density in sidebandregions using Gaussian process. Two-dimensional (ﬁrst row) m D vs. m (cid:48) , (second row) θ (cid:48) vs. m (cid:48) , and (third row) m D vs. θ (cid:48) projections of the (left) simulated backgroundsample and (right) density predictions from the ﬁt model. (Bottom row) One-dimensionalprojections onto (from left to right) m (cid:48) , θ (cid:48) and m D .19 ' ' q ' ) q B ( m ' , (a) m' E n t r i e s / ( . ) Signal regionSidebandsFit result (b) ' q E n t r i e s / ( . ) Signal regionSidebandsFit result (c)

Figure 7: Results of the interpolation of the combinatorial background density in the signalregion using Gaussian process: (a) two-dimensional density and (b,c) its projections.to 10, while for the other neurons this is equal to 1, as in Section 6. In Fig. 9, thepredicted combinatorial background density in the signal region using ANN approach isshown, where the ANN is trained for 10 000 epochs. As in the case of the Gaussian processtechnique, we can observe a certain bias due to smearing of the narrow structure in the m (cid:48) distribution. Possibly, ﬁne tuning of the regularisation parameters could improve theagreement in this region. However, in the following Section, we propose a more generalsolution to control the features of density models that aims to solve this problem. In the training of the ANN, it is often diﬃcult to replicate speciﬁc features of the resultingdensity, especially in the case of limited training data. The only handles on the genericANN training are the topology of the network and the generic regularisation terms, andcareful tuning of these is needed to obtain a reasonable description of the density.The procedure of parameterising the background or acceptance with using only the in-put data, without any external knowledge of the processes that govern the features of thedistributions, is not the most optimal approach. In general, it is known, for instance, thatthe acceptance function should be relatively smooth with a falloﬀ at the boundaries ofthe phase space due to kinematic selection requirements, or that the combinatorial back-ground should contain contributions from certain two-body resonances. The implicationof this is that the behaviour is constrained much more than conventional parameterisationtechniques assume, and and ideally eﬃcient procedure should take this prior informationinto account. For example, one can introduce a simpliﬁed model of these processes, andextract only the parameters of this simpliﬁed model from the unbinned data samples.In the case of background and eﬃciency distributions, even a simple analytical modelfor this would be diﬃcult to express. Instead, here we propose a technique to performnonparametric estimation of these distribution, using the formalism described in Section 6,with the assumption that the complex observed behaviour explicitly depends only on afew underlying parameters that are suﬃcient to describe the eﬃciency or backgroundbehaviour in the region of interest. The values of these latent parameters can then be20 ' q ( G e V ) D m Data ' q ( G e V ) D m Fit m' ' q Data m' ' q Fit m' ( G e V ) D m Data m' ( G e V ) D m Fit m' E n t r i e s ' q E n t r i e s (GeV) D m E n t r i e s Figure 8: Results of the estimation of combinatorial background density in sidebandregions using an ANN. Two-dimensional (ﬁrst row) m D vs. m (cid:48) , (second row) θ (cid:48) vs. m (cid:48) ,and (third row) m D vs. θ (cid:48) projections of the (left) simulated background sample and(right) density predictions from the ﬁt model. (Bottom row) One-dimensional projectionsonto (from left to right) m (cid:48) , θ (cid:48) and m D . 21 m' ' q ' ) q B ( m ' , (a) m' E n t r i e s / ( . ) Signal regionSidebandsFit result (b) ' q E n t r i e s / ( . ) Signal regionSidebandsFit result (c)

Figure 9: Results of the interpolation of the combinatorial background density in thesignal region using an ANN: (a) its two-dimensional density and (b,c) its projections.inferred from the observed data, or detailed simulation in the case of a description for theacceptance, in order to parameterise the observations.This way, the features of the resulting density are controlled by the simpliﬁed model,which leverages prior information on the correlations between these variables and theparameters of interest. This results in more stable training of the ANN, as only data forthe simpliﬁed parameters are required, rather than resource-intensive detailed simulation,and therefore larger sample sizes can be generated. Crucially, reliance on a few latentparameters also results in a an ad hoc regularisation eﬀect, and as such the densityobtained via this procedure is less sensitive to statistical ﬂuctuations when obtaining thevalues of the simpliﬁed parameters. A similar technique has recently been independentlyproposed that utilises generative adversarial-networks [36].

In the initial stage, an estimate of the joint probability distribution P ≡ P ( z , Θ), is con-structed, in terms of the parameters of interest, z , in which the background or eﬃciencydescription is required, and the latent parameters on which the background or eﬃciencydepend, Θ. The variables z could comprise the (square) Dalitz-plot variables, such as inthe examples in this paper, but could also be anything else that is required to be parame-terised in a physics analysis, such as the invariant mass of the reconstructed particle, or itsdecay time. The parameters Θ are those that directly control physical constraints on thesystem, and inﬂuence z via their correlations. These parameters necessarily vary betweenanalyses, but it is likely that these would include eﬀective cut values on the ﬁnal stateparticle momenta, parameters that describe the shape of these momentum distributions,or fractions of potential background contributions.An estimate of the joint probability distribution is parameterised using an ANN, ob-tained via the probability density estimation technique described in Section 6. The ANNis trained using a sample of simulated data, s train = { z train , Θ train } , that encapsulates de-pendencies between the parameters of interest, z , and the latent variables Θ. This data isrequired to span the space of possible real parameter values, however accurate descriptionof any speciﬁc conﬁguration of Θ or z is not required (that is, there is no requirement forthe set of input data points in this initial construction to overlap with the set of eventual22valuation points, due to the model smoothing).Secondly, an estimate of the speciﬁc values of the latent parameters, Θ pred , that corre-spond to the background data or detailed simulation, z data , is obtained. This is done byﬁxing the weights of the ANN and performing a maximum-likelihood ﬁt for these values,treating the ANN output as the probability of the latent parameters conditioned on theknown parameters of interest, P (Θ | z ), such that Θ pred = arg max Θ P (Θ | z data ).Lastly, again using the ANN as a joint probability function, the sample z data is drawnfrom the distribution P ( z | Θ pred ), to obtain the probability density of the parameters ofinterest. This approach is illustrated below for the estimation of the acceptance and com-binatorial background distributions of the D + s → K + π − π + decay, described in Section 3.It is worth noting that, whilst these latent parameters should in principle comprise theset of features on which the parameters of interest depend strongly upon, this set need notbe exhaustive. Providing that the included parameters are at least reasonably correlatedwith any additional parameters that are not considered, yet inﬂuence the parametersof interest, an ‘eﬀective’ value of these parameters can be obtained in the maximumlikelihood ﬁt stage. As such, these latent parameters can diﬀer from those that can becalculated directly from the dataset that the model is evaluated on, in such a way that theeﬃciency or background distribution can nevertheless be correctly parameterised usingthese. To demonstrate the feasibility of the model-assisted approach for the parameterisationof acceptances, the same model as the one described in Section 3.1 is used, however therequirements on the reconstructed D + s mesons and their decay products are varied in thegeneration of the training sample used to construct the ANN, as in step one above. Therange of the variations for each of the ﬁve parameters of the model is given in Table 3,where the entire sample consists of 500 000 events that satisfy the selection requirements.Since the eﬃciency model is relatively simple, generation of the training sample takesonly a few minutes, and therefore this can be arbitrarily large. Since the events that donot satisfy the selection requirements are rejected during generation, the initial uniformdistributions of the model parameters can become signiﬁcantly non-uniform for the ac-cepted events in the training sample. To compensate for this eﬀect, the model parametersare sampled from exponential distributions, where the parameters of the distribution aretuned to ensure that the distribution of accepted events is roughly uniform.The functional form of the eﬃciency model, a 7-dimensional probability density func-tion p ( m (cid:48) , θ (cid:48) , Θ), where Θ is the vector of parameters listed in Table 3, is obtained bythe ANN density estimation procedure described in Section 6. The ANN topology andtraining parameters are the same as those highlighted in that Section 6, with the excep-tion that here the L regularisation parameter λ = 0 .

5. The projections of the result ofthe ANN training after 30 000 epochs in the square Dalitz-plot variables, as well as thecorrelations between the Dalitz-plot variables and the latent model parameters, are givenin Appendix A.An unbinned maximum likelihood ﬁt is then performed to obtain the eﬀective modelparameters for a test sample corresponding to that described in Section 3.1. The results23 m' ' q ' ) q ( m ' , e (a) m' E n t r i e s / ( . ) SimulationFit result (b) ' q E n t r i e s / ( . ) SimulationFit result (c)

Figure 10: Result of the density estimation of the simulated sample of D + s → K − π + π − decays using the model-assisted ANN (a) in the two square Dalitz plot variables m (cid:48) and θ (cid:48) , and the projections of the two-dimensional distribution on to the(b) m (cid:48) and (c) θ (cid:48) variables.Table 3: Parameters of the D + s → K + π − π + eﬃciency model: the range of parametervariations used at the ANN training stage, the true values used in the generated testsample, and the reconstructed values extracted from the ﬁt of the ANN model to the testsample. Model parameter Range True value Reconstructed valueTrack p T cut ( GeV) (0 . ,

1) 0.4 0 . ± . p cut ( GeV) (1 ,

10) 3.0 5 . ± . D + s p T cut ( GeV) (0 ,

5) 2.0 2 . ± . p T cut ( GeV) (0 . ,

3) 1.0 1 . ± . p T cut ( GeV) (2 . ,

6) 3.0 3 . ± . m' ' q ' ) q B ( m ' , (a) m' E n t r i e s / ( . ) Signal regionSidebandsFit result (b) ' q E n t r i e s / ( . ) Signal regionSidebandsFit result (c)

Figure 11: Results of the interpolation of the combinatorial background density in thesignal region using the model-assisted ANN: (a) the two-dimensional density and (b,c) itsprojections.

The estimation of the combinatorial background density is performed in a similar wayto that of the acceptance parameterisation, except that the ANN training sample alsoincludes the full range of the selection variable, m D , and the test sample only containsthe events in the sidebands (deﬁned in Section 7) to reproduce a more realistic analysisscenario. As in the case of the acceptance parameterisation, the background model fromSection 3.2 is used to both generate samples for the initial joint density estimation by theANN, and the test dataset for the subsequent maximum likelihood ﬁt. The list of modelgenerated parameters, with their input ranges, is given in Table 4, where the size of thetraining sample is 500 000 events. Here, the topology of the ANN is unchanged from thatdescribed in Section 7, and the penalty factor in the L regularisation term is taken to be λ = 2. The projections of ANN variables, after 30 000 training epochs, in the m (cid:48) , θ (cid:48) and m D variables, as well as the correlations of these variables with each other and with themodel parameters, are shown in Appendix B.The results of the resulting background density estimation are shown in Fig. 11, andthe true and reconstructed values of the model parameters are given in Table 4. Thevalues of the reconstructed parameters are consistent with the true values within theiruncertainties, which indicates a good quality of the 11-dimensional ANN parameterisationof the background model. Techniques have been proposed here to eﬃciently parameterise background and accep-tance variations that are an essential component to multidimensional ﬁts of hadronicdecay amplitudes. Often, treatments of the acceptance variations are sub-optimal, asthey either do not exploit rudimentary assumptions of local smoothness, and requirelarge quantities of computationally expensive simulated data, or a sizeable systematicuncertainty results from the use of an ineﬃcient parameterisation. For the backgrounddescription, assumptions often have to be made on the functional form of speciﬁc back-grounds, or on the validity of an extrapolation into the signal region. These assumptions25able 4: Parameters of the D + s → K + π − π + background model: the range of parametervariations used at the ANN training stage, the true values used in the generated testsample, and the reconstructed values extracted from the ﬁt of the ANN model to the testsample. Model parameter Range True value Reconstructed valueMean p T ( K ) ( GeV) (0 . ,

1) 0.3 0 . ± . p T ( π ) ( GeV) (0 . ,

1) 0.6 0 . ± . p T ( K ∗ ) ( GeV) (0 . ,

3) 2.9 2 . ± . p T ( ρ ) ( GeV) (0 . ,

3) 2.0 2 . ± . K ∗ fraction (0 , .

3) 0.1 0 . ± . ρ fraction (0 , .

3) 0.2 0 . ± . p T cut ( GeV) (0 . , .

5) 0.3 0 . ± . p cut ( GeV) (1 ,

4) 3.0 2 . ± . Acknowledgements

The authors would like to thank their colleagues from the LHCb collaboration (TimGershon, Thomas Latham, Mark Whitehead, Sneha Malde, Maurizio Martinelli, andothers) for fruitful discussions. This work is supported by the Science and TechnologyFacilities Council (United Kingdom). 26 ppendicesA Training the neural network to parametrise eﬃ-ciency model

The results of the ANN training to estimate the 7-dimensional density of a generatedeﬃciency model, as a function of a set of eﬀective model parameters, is shown in Fig. 12.These plots show the projections of the m (cid:48) and θ (cid:48) variables (top row, two leftmost plots),as well as the normalised two-dimensional distributions for the correlations between m (cid:48) and θ (cid:48) (top row, two rightmost plots) and correlations between m (cid:48) or θ (cid:48) and the modelparameters. The plots marked as “Data” show the training data distributions, thosemarked with “Fit” show the high-statistics distributions generated from the result ofthe ANN training. As in the rest of the paper, the two-dimensional distributions arenormalised in such a way that the average of the projected distribution equals 1. B Training the neural network to parametrise back-ground model

The results of the ANN training to estimate the 11-dimensional density of generatedcombinatorial background model as a function of model parameters are shown in Fig. 13–15. These plots show the projections of the m (cid:48) , θ (cid:48) and m D variables (Fig. 13, top row),as well as the normalised two-dimensional distributions for the correlations between eachpair of m (cid:48) , θ (cid:48) and m D variables (Fig. 13, second row and the two leftmost plots in thethird row) and correlations between m (cid:48) , θ (cid:48) , or m D , and the model parameters. The plotsmarked as “Data” show the training data distributions, those marked with “Fit” showthe high-statistics distributions generated from the result of the ANN training. As in therest of the paper, the two-dimensional distributions are normalised in such a way that theaverage of the projected distribution equals 1. References [1] R. H. Dalitz,

On the analysis of tau-meson data and the nature of the tau-meson ,Phil. Mag. (1953) 1068.[2] J. Back et al. , Laura ++ : A Dalitz plot ﬁtter , Comput. Phys. Commun. (2018)198, arXiv:1711.09854 .[3] LHCb collaboration, R. Aaij et al. , Observation of

J/ψ p resonances consistent withpentaquark states in Λ b → J/ψ pK − decays , Phys. Rev. Lett. (2015) 072001, arXiv:1507.03414 .[4] LHCb collaboration, R. Aaij et al. , Observation of exotic

J/ψ φ structures from am-plitude analysis of B + → J/ψ φK + decays , Phys. Rev. Lett. (2016) 022003, arXiv:1606.07895 . 27 m' E n t r i e s ' q E n t r i e s m' ' q Data m' ' q Fit m' c u t ( G e V ) T s u m p Data m' c u t ( G e V ) T s u m p Fit m' c u t ( G e V ) T m a x p Data m' c u t ( G e V ) T m a x p Fit m' c u t ( G e V ) T p s + D Data m' c u t ( G e V ) T p s + D Fit m' T r ac k p c u t ( G e V ) Data m' T r ac k p c u t ( G e V ) Fit m' c u t ( G e V ) T T r ac k p Data m' c u t ( G e V ) T T r ac k p Fit ' q c u t ( G e V ) T m a x p Data ' q c u t ( G e V ) T m a x p Fit ' q c u t ( G e V ) T T r ac k p Data ' q c u t ( G e V ) T T r ac k p Fit ' q c u t ( G e V ) T s u m p Data ' q c u t ( G e V ) T s u m p Fit ' q c u t ( G e V ) T p s + D Data ' q c u t ( G e V ) T p s + D Fit ' q T r ac k p c u t ( G e V ) Data ' q T r ac k p c u t ( G e V ) Fit

Figure 12: Results of the estimation of the simulated acceptance distribution, as a functionof the eﬀective model parameters, using an ANN.28 m' E n t r i e s ' q E n t r i e s (GeV) D m E n t r i e s m' ' q Data m' ' q Fit m' ( G e V ) D m Data m' ( G e V ) D m Fit ' q ( G e V ) D m Data ' q ( G e V ) D m Fit m' ) ( G e V ) * ( K T M ea n p Data m' ) ( G e V ) * ( K T M ea n p Fit m' fr ac ti on r Data m' fr ac ti on r Fit m' ( K ) ( G e V ) T M ea n p Data m' ( K ) ( G e V ) T M ea n p Fit m' ) ( G e V ) p ( T M ea n p Data m' ) ( G e V ) p ( T M ea n p Fit m' c u t ( G e V ) T T r ac k p Data m' c u t ( G e V ) T T r ac k p Fit m' ) ( G e V ) r ( T M ea n p Data m' ) ( G e V ) r ( T M ea n p Fit m' T r ac k p c u t ( G e V ) Data m' T r ac k p c u t ( G e V ) Fit

Figure 13: Results of the estimation of the simulated combinatorial background density,as a function of eﬀective model parameters, using an ANN.29 m' fr ac ti on * K Data m' fr ac ti on * K Fit ' q ( K ) ( G e V ) T M ea n p Data ' q ( K ) ( G e V ) T M ea n p Fit ' q T r ac k p c u t ( G e V ) Data ' q T r ac k p c u t ( G e V ) Fit ' q c u t ( G e V ) T T r ac k p Data ' q c u t ( G e V ) T T r ac k p Fit ' q fr ac ti on r Data ' q fr ac ti on r Fit ' q ) ( G e V ) p ( T M ea n p Data ' q ) ( G e V ) p ( T M ea n p Fit ' q fr ac ti on * K Data ' q fr ac ti on * K Fit ' q ) ( G e V ) r ( T M ea n p Data ' q ) ( G e V ) r ( T M ea n p Fit ' q ) ( G e V ) * ( K T M ea n p Data ' q ) ( G e V ) * ( K T M ea n p Fit (GeV) D m c u t ( G e V ) T T r ac k p Data (GeV) D m c u t ( G e V ) T T r ac k p Fit (GeV) D m fr ac ti on * K Data (GeV) D m fr ac ti on * K Fit (GeV) D m ) ( G e V ) r ( T M ea n p Data (GeV) D m ) ( G e V ) r ( T M ea n p Fit

Figure 14: Results of the estimation of the simulated combinatorial background density,as a function of eﬀective model parameters, using an ANN.30 .8 1.9 2 2.1 (GeV) D m ) ( G e V ) * ( K T M ea n p Data (GeV) D m ) ( G e V ) * ( K T M ea n p Fit (GeV) D m T r ac k p c u t ( G e V ) Data (GeV) D m T r ac k p c u t ( G e V ) Fit (GeV) D m fr ac ti on r Data (GeV) D m fr ac ti on r Fit (GeV) D m ) ( G e V ) p ( T M ea n p Data (GeV) D m ) ( G e V ) p ( T M ea n p Fit (GeV) D m ( K ) ( G e V ) T M ea n p Data (GeV) D m ( K ) ( G e V ) T M ea n p Fit

Figure 15: Results of the estimation of the simulated combinatorial background density,as a function of eﬀective model parameters, using an ANN.315] LHCb collaboration, R. Aaij et al. , Studies of the resonance structure in D → K ∓ π ± π + π − decays , Eur. Phys. J. C78 (2018) 443, arXiv:1712.08609 .[6] LHCb collaboration, R. Aaij et al. , Dalitz plot analysis of B → D π + π − decays ,Phys. Rev. D92 (2015) 032002, arXiv:1505.01710 .[7] LHCb collaboration, R. Aaij et al. , Dalitz plot analysis of B s → D K − π + decays ,Phys. Rev. D90 (2014) 072003, arXiv:1407.7712 .[8] A. Poluektov,

Kernel density estimation of a multidimensional eﬃciency proﬁle ,JINST (2015), no. 02 P02011, arXiv:1411.5528 .[9] LHCb collaboration, R. Aaij et al. , Study of the D p amplitude in Λ b → D pπ − decays , JHEP (2017) 030, arXiv:1701.07873 .[10] M. Pivk and F. R. Le Diberder, sPlot: A statistical tool to unfold data distributions ,Nucl. Instrum. Meth. A555 (2005) 356, arXiv:physics/0402083 .[11] Y. Xie, sFit: a method for background subtraction in maximum likelihood ﬁt , arXiv:0905.0724 .[12] BaBar, B. Aubert et al. , An amplitude analysis of the decay B ± → π ± π ± π ∓ , Phys.Rev. D72 (2005) 052002, arXiv:hep-ex/0507025 .[13] Particle Data Group, C. Patrignani et al. , Review of particle physics , Chin. Phys.

C40 (2016) 100001.[14] LHCb collaboration, A. A. Alves Jr. et al. , The LHCb detector at the LHC , JINST (2008) S08005.[15] LHCb collaboration, R. Aaij et al. , Angular analysis of the B → K ∗ µ + µ − decayusing fb − of integrated luminosity , JHEP (2016) 104, arXiv:1512.04442 .[16] LHCb collaboration, R. Aaij et al. , Diﬀerential branching fraction and angular anal-ysis of the decay B → K + π − µ + µ − in the K ∗ , (1430) region , JHEP (2016) 065, arXiv:1609.04736 .[17] LHCb collaboration, R. Aaij et al. , Angular moments of the decay Λ b → Λµ + µ − ,JHEP (2018) 146, arXiv:1808.00264 .[18] LHCb collaboration, R. Aaij et al. , Amplitude analysis of the decay B → K S π + π − and ﬁrst observation of CP asymmetry in B → K ∗ (892) − π + , Phys. Rev. Lett. (2018) 261801, arXiv:1712.09320 .[19] LHCb collaboration, R. Aaij et al. , Evidence for a η c (1 S ) π − resonance in B → η c (1 S ) K + π − decays , Eur. Phys. J. C78 (2018) 1019, arXiv:1809.07416 .[20] T. Oliphant,

NumPy: A guide to NumPy , USA: Trelgol Publishing, 2006.[21] E. Jones, T. Oliphant, P. Peterson et al. , SciPy: Open source scientiﬁc tools forPython , 2001. 3222] H. Zou and T. Hastie,

Regularization and variable selection via the elastic net , Journalof the Royal Statistical Society: Series B (Statistical Methodology) (2005), no. 2301.[23] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learn-ing , Adaptative computation and machine learning series, University Press GroupLimited, 2006.[24] S. Roberts et al. , Gaussian processes for time-series modelling , Phil. Trans. R. Soc.A (2013), no. 1984 20110550.[25] M. Frate et al. , Modeling Smooth Backgrounds and Generic Localized Signals withGaussian Processes , arXiv:1709.05681 .[26] A. Bozson, G. Cowan, and F. Span`o, Unfolding with Gaussian Processes , arXiv:1811.01242 .[27] I. Murray, D. MacKay, and R. P. Adams, The gaussian process density sampler , in

Advances in Neural Information Processing Systems , pp. 9–16, 2009.[28]

GPy python package , http://gpy.readthedocs.io/en/deploy/ .[29] E. Snelson and Z. Ghahramani, Sparse gaussian processes using pseudo-inputs , in

Advances in neural information processing systems , pp. 1257–1264, 2006.[30] A. Likas,

Probability density estimation using artiﬁcial neural networks , ComputerPhysics Communications (2001), no. 2 167 .[31] B. Viaud,

On the potential of multivariate techniques for the determination of multidi-mensional eﬃciencies , Eur. Phys. J. Plus (2016), no. 6 191, arXiv:1602.03758 .[32] M. Abadi et al. , TensorFlow: Large-scale machine learning on heterogeneous systems ,2015. Software available from tensorﬂow.org.[33] D. P. Kingma and J. Ba,

Adam: A method for stochastic optimization , CoRR abs/1412.6980 (2014) arXiv:1412.6980 .[34] D. P. O’Hanlon,

Studies of CP -violation in charmless three-body b -hadron decays ,PhD thesis, University of Warwick, UK, CERN-THESIS-2017-184. Presented 04 Oct2017.[35] A. Mathad, Search for CP violation in charmless three-body decays of strange-beautybaryons , PhD thesis, University of Warwick, UK, CERN-THESIS-2018-237. Pre-sented 08 Oct 2018.[36] S. Alonso-Monsalve and L. H. Whitehead, Image-based model parameter optimisationusing Model-Assisted Generative Adversarial Networks , arXiv:1812.00879 .[37] R. Brun and F. Rademakers, ROOT: An object oriented data analysis framework ,Nucl. Instrum. Meth.

A389 (1997) 81.3338]

TensorFlowAnalysis package , https://gitlab.cern.ch/poluekt/TensorFlowAnalysishttps://gitlab.cern.ch/poluekt/TensorFlowAnalysis