Functional Decomposition: A new method for search and limit setting
FFunctional Decomposition
A new method for search and limit setting.
R. Edgar a, ∗ , D. Amidei a , C. Grud a , K. Sekhon a a University of Michigan, Ann Arbor,450 Church St. Ann Arbor, MI 48109, United States
Abstract
In the analysis of High-Energy Physics data, it is frequently desired to separate resonantsignals from a smooth, non-resonant background. This paper introduces a new technique- functional decomposition (FD) - to accomplish this task. It is universal and readily ableto describe often-problematic effects such as sculpting and trigger turn-ons.Functional decomposition models a dataset as a truncated series expansion in a com-plete set of orthonormal basis functions, using a process analogous to Fourier analysis.A new family of orthonormal functions is presented, which has been expressly designedto accomplish this in a succinct way. A consistent signal extraction methodology basedon linear signal estimators is also detailed, as is an automated method for selecting themethod’s (few) hyperparameters and preventing over-fitting.The full collection of algorithms described in this paper have been implemented in aneasy-to-use software package, which will also be briefly described.
Keywords: orthogonal density estimation, nonparametric density estimation,orthonormal exponentials, resonance search, background modeling
1. Motivation and Overview
A substantial body of searches for new physical phenomena are conducted under theresonance ansatz : new physics is presumed to present as a localized deviation (a resonance)from an otherwise-smooth background. The smooth background is most commonly mod-eled using Monte Carlo simulation, data sidebands, parametric fits to functional forms, orsome combination of these. While all of these approaches are applied frequently and withsuccess, each also has disadvantages: Monte Carlo requires careful testing and sophisti-cated control of systematic uncertainties, and can be computationally infeasible for verylarge datasets; data sidebands are usually limited in statistical precision and often captureonly some of the physical phenomena contributing to the region of interest; parametricfits rely on functions that are approximate or ad-hoc, and therefore may not adequatelycapture all features of the smooth background. ∗ Corresponding author
Email addresses: [email protected] (R. Edgar), [email protected] (D. Amidei),
Preprint submitted to Nuclear Instruments and Methods A May 15, 2018 a r X i v : . [ phy s i c s . d a t a - a n ] M a y − − − − E v e n t s / G e V BackgroundSignal+BkgData − . . . . D a t a - B k g × M γγ (GeV) − R e s i du a l ( σ ) Full Spectrum a E v e n t s / G e V BackgroundSignal+BkgData − . . . . D a t a - B k g ×
60 80 100 120 140 160 M γγ (GeV) − R e s i du a l ( σ ) Low-Mass Region b Figure 1: Functional decomposition applied to a simulated data spectrum, designed toapproximate the invariant mass M γγ of the two-photon final state at 13 TeV at CERN’sLHC. The simulated spectrum has roughly 5 × events scaled down to 2 × in orderto approximate the statistically-asymptotic behavior. The figures show the decompositionplotted over the full range (Fig. (a)) and only in the low-mass region (Fig. (b)). Bothfigures show the same data and decomposition; only the x axis ranges and the binningdiffer. 2 resonance search technique that addresses these disadvantages is thus a tool withpotentially widespread application. Functional decomposition is such a tool. Its power canbe seen in Figure 1, which illustrates the application of FD to a particularly troublesomecase: the production of a purely data-driven model for a spectrum that includes both aturn-on and two known resonant peaks. The performance on this particular example willbe explored in more detail during the course of this paper, but the excellent modeling ofthe data over its full range is evident, including the turn-on, peak, resonances and tail.Functional Decomposition is a form of Orthogonal Density Estimation (ODE; [1]).It operates by applying a transformation to the variable of interest and modeling theresulting dimensionless variable using a complete set of orthonormal functions. Judiciouschoices of the basis and transformation ensure that the smooth background has a succinctrepresentation, using only the first few terms in the series. The remaining (i.e. higher-order) terms are then available to construct estimators for the resonant contributions.This model can be written as: z = T (cid:0) x ; θ (cid:1) (1)Ω ( z ) = N − (cid:88) n =0 c n E n ( z ) + N s (cid:88) m =0 s ( m ) S ( m ) ( z ) (2)Ω ( x ) = N − (cid:88) n =0 c n (cid:18) dzdx (cid:19) E n ( z ) + N s (cid:88) m =0 s ( m ) S ( m ) ( x ) . (3)Here, x is the initial variable of interest and z = T (cid:0) x ; θ (cid:1) is the corresponding dimension-less variable. The transformation T is parameterized by the vector θ (the transformationhyperparameters). The functions { E n } are a complete set of orthonormal functions, fromwhich the first N are retained for the background model. The parameters c n are the coef-ficients of the background distribution, presumed zero if n ≥ N . Finally, S m ( z ) are somenumber N s of resonant contributions, each of which has a corresponding normalization s m , and each of whose shapes are presumed to be known.The use of a complete, orthonormal set of basis functions guarantees that any smoothfunction can be described by such a series expansion. But the performance is entirely de-termined by the particular choice of basis and transformation; succinct expansions (that is, N is small) retain maximum information for estimating the resonant contributions, whilemore verbose choices reduce the resonant sensitivity and often produce approximationsthat are not positive-definite (and therefore are not valid probability distributions).The remainder of this paper is devoted to establishing particulars for this techniquethat are effective for resonance searches and measurements in high-energy physics. It isorganized as follows: • Section 2: The test spectrum; • Section 3: The orthonormal exponentials: a basis for falling spectra; [email protected] (C. Grud), [email protected] (K. Sekhon) Section 4: The power-law transformation; • Section 5: Decomposing the dataset; • Section 6: Estimating signal and background parameters; • Section 7: Optimization of hyperparameters; • Section 8: Statistical interpretation; • Section 9: Summary and discussion; the FD package.Several mathematical results will be required along the way. The proofs and deriva-tions are relegated to appendices to avoid some otherwise-lengthy digressions.
For convenience and consistency, the following notational conventions will be usedthroughout: • A function is written f ( z ); • its vector representation in a Hilbert space is denoted ˜ f ; • and the individual components of that vector are ˜ f n . • Similarly, operators are written ˆ O with components ˆ O nm . The operator form of f ( z ) is written as ˆ f . • Families of functions have parenthesized indices (i.e. f ( n ) ( z ); ˜ f ( n ) ; ˜ f ( n ) i ). Thisdistinguishes them from the Hilbert-space indices. • Einstein summation convention is used whenever possible: repeated indices with oneindex lowered and the other raised imply summation (i.e. c i d i = (cid:80) i c i d i ). Explicitsums are used when required for clarity or to indicate summation limits. • Raised and lowered indices denote the same numerical values; the position is usedonly to indicate implied summation. • Multiplications written ˆ O ˜ f are to be read as ˆ O mn ˜ f m (matrix multiplication). • Angle brackets are sometimes used for inner products: (cid:104) u, v (cid:105) = ˜ u n ˜ v n .4 . The test spectrum FD will be illustrated with the aid of an example spectrum, already shown in Figure 1.This spectrum was constructed to exhibit several features that are usually the source ofsome difficulty: a large mass range and an event rate that spans some six orders ofmagnitude, a turn-on in the low-mass region, and two known resonant peaks which, forthe purpose of conducting a search, must be included as part of the background.The test spectrum was designed with particular reference to the two-photon final statein pp collisions at 13 TeV at CERN’s LHC. Forming the invariant mass M γγ , one expectsa smooth, high-statistics background with one resonant peak from the Higgs boson and asecond resonant peak from the Z . Though the Z does not decay to two photons, Z → ee decays are occasionally misidentified as two-photon events. This can lead to a substantial‘fake’ Z contribution when the production rate is sufficiently high, as is the case at theLHC. At low mass, the spectrum is modified by the trigger and selection thresholds.The model for the continuum M γγ background consists of 5 × events generatedaccording to the probability distribution P ( x ) = p G ( x ) + p (1 − y ) p y p + p log y + p log y (4)where G ( x ) = 1 √ πp exp (cid:34) − (cid:18) x − p p (cid:19) (cid:35) (5) y = x
13 TeV , (6)and the constant parameters are given by p = 9 .
507 00 × p = 4 .
582 42 × p = − .
212 68 × p = − .
513 09 p = 2 .
388 49 × − p = − .
380 68 × p = 3 .
349 80 × p = 1 .
830 62 × . The probability distribution P ( x ) is one of the well-known ‘dijet functions’, typicallyused in resonance searches involving the strong interaction (e.g. [2], [3], [4], [5]), with theaddition of a Gaussian term to produce the turn-on. The Z contribution is modeled using5 × normally-distributed events with a mass of 89 . . × normally-distributed events with a mass of 125 . . Z , the resolution is broadened andthe mass is shifted downward by misidentification of electrons as photons.5 z − . − . − . − . . . . . . A r b i tr a r y U n i t s E ( z ) E ( z ) E ( z ) E ( z ) E ( z ) Figure 2: A selected few of the orthonormal exponentials.The test sample is scaled down by a factor of 25, for a total of 2 × backgroundevents, 2 × Z events, and 4 × Higgs events. This scaling is applied to ensurethat any biases or spurious signals introduced by the method are clearly visible and notobscured by statistical fluctuations.
3. The orthonormal exponentials: a basis for falling spectra
Existing sets of orthonormal functions (orthogonal polynomials, Bessel functions,trigonometric functions, etc) do not generally produce succinct representations of exponentially-falling spectra. To address this, a new set of orthonormal functions has been constructedfrom the exponential function. Some previous efforts to construct orthonormal basesfrom finite sets of exponential functions have been made in the field of signal processing([6], [7], [8]). None, to the authors’ knowledge, describe the infinite family of functionsdetailed here.The choice of the exponential function is motivated by several considerations. Qual-itatively, the tails of many spectra can often be approximated using members of the6xponential family (simple exponentials or exponentials of a polynomial, for example).Furthermore, the simple exponential is an entropy-maximizing distribution . That is, inan information-theoretic sense, the exponential is the least-informative (most delocalized)distribution of all distributions sharing the same mean. This intuitively fits the resonance ansatz of smoothly-falling delocalized backgrounds and localized signals.We begin by defining a non-orthogonal family of functions (the exponential basis )coupled with the L inner product: F n ( z ) = √ e − nz (7) (cid:104) f, g (cid:105) = ∞ (cid:90) dzf ( z ) g ( z ) . (8)Here z is the transformed variable as above (Eq. 1). This set of functions is completewith respect to continuous probability distributions defined on [0 , ∞ ) (see Appendix A.1for proof). The orthonormal exponentials are then defined in terms of this inner productand the exponential basis functions: E n ( z ) = n (cid:88) m =1 ˆ d nm F m ( z ) (9) (cid:104) E n , E m (cid:105) = δ nm . (10)A selected few of the orthonormal exponentials can be seen in Figure 2.The coefficients ˆ d nm can be derived numerically using any number of well-known meth-ods (e.g. Gram-Schmidt). However, the inner product matrix (cid:104) F n , F m (cid:105) is ill-conditioned;this constrains the usefulness of numerical solutions to just the first few orthonormal ex-ponentials. The authors have therefore derived an exact solution for the coefficients ˆ d nm ,as well as recurrence relations for the functions (see Appendix A.2). These recurrencerelations take the form E ( z ) = √ e − z (11) E n +1 ( z ) = 1 φ n +1 (cid:18) e − z E n ( z ) − φ n E n ( z ) − φ n − E n − ( z ) (cid:19) (12) φ n = (cid:114) − n . (13)This provides a fast and numerically stable method for evaluating the orthonormal expo-nentials. Moreover, evaluating E N ( z ) for some z with the recurrence relations naturallyproduces E n ( z ) for all n < N , which is quite advantageous for the present application.
4. The power-law transformation
We next specify the transform z = T (cid:0) x ; θ (cid:1) . There are few constraints on the choice ofthis transformation. It must (invertibly) map the range [ x , ∞ ) of the variable of interest7o [0 , ∞ ) while rendering it dimensionless. It must be continuous. It is desirable that ithave some flexibility that can be applied to ensuring that the resulting decompositionsare succinct; on the other hand, too many free parameters become difficult to handle.We find that the power-law transformation z = (cid:18) x − x λ (cid:19) α (14)meets these requirements well. There are three hyperparameters: x specifies the startof the distribution, λ is a positive scale parameter, and α is a positive, dimensionlessexponent. Intuitively, the hyperparameters adjust the shape of the tail (all orthonormalexponentials approach e − z as z → ∞ ) as well as the spacing of the different degrees offreedom across the spectrum.Because every choice of λ and α produces a distinct (but still complete) orthonormalbasis, they are, in a certain sense, arbitrary. But as stated above, careful selection oftheir values can greatly affect the number of terms required to model the background.Optimal selection of the hyperparameters is thus crucial to ensuring FD’s efficacy. But theoptimization of the hyperparameters must be performed numerically, and it is necessaryto recompute the series coefficients c n at each iteration of the optimization.For large datasets, a from-scratch re-computation of the series coefficients can beexpensive. Luckily, there is another way. Each choice of hyperparameters by designproduces a distinct orthonormal basis on the same underlying Hilbert space. Thus thedecomposition ˜ f (cid:63)n of some function f ( x ) with hyperparameters θ (cid:63) is connected by a lineartransformation to the decomposition ˜ f n with hyperparameters θ . A general treatment ofthese transformation matrices is found in Appendix A.3.The power-law transformation has the nice property that the transformation matricesbetween different choices of hyperparameters are calculable:˜ f m = exp (cid:104) c ˆ C + s ˆ S (cid:105) ˜ f (cid:63)n , (15)where exp is the matrix exponential and the matrices ˆ C and ˆ S are the mathematicalconstants ˆ C nm = − n (cid:88) i =1 m (cid:88) j =1 ˆ d ni ˆ d mj i ( i + j ) [1 − γ − ln ( i + j )] (16)ˆ S nm = − n (cid:88) i =1 m (cid:88) j =1 ˆ d ni ˆ d mj i ( i + j ) (17)( γ = 0 . . . . is the Euler-Mascheroni constant). The transformation parameters are c = ln αα (cid:63) (18) s = − αce c − λλ (cid:63) . (19)8ee Appendix A.4 for a derivation of this result. Being mathematical constants, the ma-trices ˆ C and ˆ S need be computed only once. To apply a transformation in practice, then,requires only the computation of the action of a matrix exponential on the original decom-position ˜ f (cid:63)n . This is a common enough operation that fast and efficient implementationsexist in most widely-available numerical linear algebra libraries.
5. Decomposing the dataset
Given some dataset { x i } with M measurements, the customary approach to obtainingthe parameters ˜ f n is to choose them to maximize the log-likelihood of the data. Thisapproach suffers in performance when the number of parameters is large and, if unbinned,for large datasets. Moreover, the estimate for the parameters will change depending onthe model - that is, ˜ f with N = 2 will generally be different from ˜ f with N > f ( x ) is the underlying probability distribution, the parameters can beextracted with the inner product:˜ f true n = (cid:104) E n , f (cid:105) = ∞ (cid:90) dzf ( z ) E n ( z ) = lim M →∞ M M (cid:88) i =1 E n ( z i ) , (20)where z i = z ( x i ) and the last equality follows from the strong law of large numbers. Itthen suffices to define the empirical moments as˜ f n = 1 M M (cid:88) i =1 E n ( z i ) . (21)This same result can also be obtained by direct computation, treating the data as anormalized “comb” of Dirac δ -functions. Furthermore, it can be shown to agree withthe parameters estimated by maximizing the unbinned log-likelihood assuming infinitely-many free parameters ˜ f n . The first few parameters for the test data spectrum are shownin Table 1.Asymptotically, the parameters ˜ f n are normally distributed regardless of the underly-ing function f ( z ) (see Appendix A.5). The uncertainties on the moments can thereforebe represented by ˆ Σ ˜ f nm /M , where the covariance is given byˆ Σ ˜ f nm = 1 M M (cid:88) i =1 E n ( z i ) E m ( z i ) − ˜ f n ˜ f m (22)= ˜ f i ˆ I inm − ˜ f n ˜ f m (23)and ˆ I inm is the triple-product tensor ˆ I inm = ∞ (cid:82) dzE i ( z ) E n ( z ) E m ( z ). This covariance ma-trix is exactly calculable from the decomposition ˜ f n (see Appendix A.6). This allows fast9 oment Value Moment Value Moment Value˜ f . × − ˜ f . × − ˜ f . × − ˜ f . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f . × − ˜ f − . × − ˜ f . × − ˜ f . × − ˜ f . × − ˜ f . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f . × − ˜ f . × − ˜ f . × − ˜ f . × − ˜ f . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f − . × − ˜ f . × − ˜ f . × − Table 1: The first 32 moments of the test data spectrum. These are computed with λ = 32 .
90 GeV and α = 0 . O ( N ) time, instead of the O ( N M ) requiredfor a direct computation.Remarkably, the N × N covariance matrix of the first N moments is a function of only the first 2 N moments. This makes the dissemination of the covariance matrix at best aconvenience - one need only report the first 2 N moments to exactly capture the N desiredmoments along with their covariance. Even reporting the first N moments alone is oftenenough - they capture their own covariance to a very good approximation!
6. Estimating signal and background parameters
The infinite set of parameters ˜ f from the previous section is exactly equivalent tothe original dataset. It can be regarded as being the data, transformed to a differentrepresentation. The next task is to extract the parameters c n and s m as per Eq. 2.Because the first N moments are all free parameters for the background estimation,sensitivity to the resonant contributions lies exclusively in the higher moments ( ˜ f n with n ≥ N ). We use minimum-entropy estimators, described in Appendix A.7, to extract thesignal contributions s m : ˜ (cid:15) ( n ) i = ˆ Σ − l ij ˜ S j ( n ) (24) η − nm ) = (cid:10) (cid:15) ( n ) , S ( m ) (cid:11) = ˜ S i ( n ) ˆ Σ − ij ˜ S j ( m ) (25) s ( n ) = η ( k )( n ) (cid:10) (cid:15) ( k ) , f (cid:11) . (26)Note that the Hilbert space indices in Eq. 24 and 25 are taken to run over the highermoments only , that is, i, j ∈ [ N , ∞ ). This is an important departure from the conventionused in the rest of this paper.Equation 24 defines ˜ (cid:15) ( n ) , the minimum-variance estimator associated with the n ’thresonant signal. These estimators, by their definition, are orthogonal to the first N M γγ (GeV) − . . . . . . . A r b i tr a r y U n i t s h (Signal)h (MinVar Estimator)Z (Signal)Z (MinVar Estimator) Estimator Comparison for Z and h a A comparison between the signals and the corresponding minimum-varianceestimators for the h and Z (assuming the covariance of the test sample). Moment − − − − − − − − | M o m e n t | Datah (MinVar Estimator)Z (MinVar Estimator)
Moment Comparison - Data vs. Higgs and Z b A comparison between the moments of the test sample (green) and the momentsof the minimum-variance estimators for the Z (purple) and h (blue). Figure 311oments (the background). However, they are not necessarily orthogonal to one another.To properly account for multiple signals, the overlap matrix η − nm is introduced, whoseinverse orthogonalizes the estimators with respect to the other signals (and also correctlynormalizes the estimators). This procedure is optimal for multiple signals in the sensethat it minimizes the uncertainty on each signal contribution individually, as well as theoverall uncertainty.Here the inner products and the matrix multiplications are taken with respect to thehigher moments only . The covariance ˆ Σ ˜ l , on the other hand, is defined from the lower moments: ˆ Σ ˜ l jk = N − (cid:88) i =1 ˜ f i ˆ I ijk . (27)The estimators retain the properties of linearity and unbiasedness regardless of the choiceof ˆ Σ . The choice, then, is motivated by convenience and optimality. Were the truedistribution and its associated covariance ˆ Σ T known a priori , one would choose ˆ Σ = ˆ Σ T and obtain an optimal set of estimators. However, given that the true distribution is notknown, the next-best choice is to use the empirical covariance obtained from the data.This is a good approximation, as the covariance matrix is dominantly a function of thelower moments.Returning to the M γγ test spectrum, consider the estimators for the Z and h . A com-parison between the signals and their corresponding estimators can be seen in Figure. 3a.The moments of the estimators are shown overlaid with the decomposition of the testdata in Figure. 3b. The tail visible in the data moments (green curve) from n = 15 andabove is the signal contribution to the data. Usually this tail would not be visible, butthe scaling of the test data reduces the statistical fluctuations and makes it more obviousto the eye.Once the signal normalizations have been extracted, the background parameters areobtained by subtracting the signal contribution from the data lower moments: c n = (cid:26) ˜ f n − s ( m ) ˜ S ( m ) n , n < N n ≥ N . (28)The parameters for the test dataset, both signal and background, are shown in Table 2.The signal parameters are converted to yields by multiplying by the total number ofevents in the dataset. The uncertainties can be computed using σ ( nm ) = M ˜ ω n ˆ Σ ˜ f ˜ ω m , (29)where ˜ ω ( n ) i = λ ( k )( n ) ˜ (cid:15) ( k ) i , (30)essentially the projection of the full covariance matrix onto the subspace defined by thesignal estimators. The extracted signals from the test data, compared to the injectedsignal strengths, are shown in Table. 3. For both the Z and h , the difference betweeninjected and extracted signals is small and consistent with the statistical variation of thetest dataset ( ±
66 events for the h and ±
182 events for the Z ). There is a small negativecorrelation between the two extracted signals.12 arameter Value Parameter Value c . × − c . × − c − . × − c − . × − c − . × − c − . × − c . × − c . × − c − . × − c . × − c − . × − c − . × − c . × − c . × − s Z . × − s h . × − Table 2: The parameters for the FD model of the test dataset. These are computed with λ = 32 .
90 GeV, α = 0 .
60 and N = 15. Injected Extracted N h ± N Z
20 000 20 196 ± Zh -0.068 Table 3: Comparison between the injected and extracted signal for the test dataset. Forboth the Z and h , the difference is well below 1 σ and consistent with the statisticalvariation of the test dataset. The correlation between the two extracted signals is smalland negative.
7. Optimization of Hyperparameters
The final ingredient in FD is the selection of the hyperparameters. For the power-lawtransformation, there are three hyperparameters: x , the lower mass limit; λ , the lengthscale; and α , the scaling exponent. Additionally there is N , the number of moments toallocate to background modeling.The first of these is a true free parameter and serves only to delineate the region ofinterest. If the series is not truncated ( N = ∞ ), then α and λ are free parameters aswell - any value will do, because all choices result in a complete basis. Unfortunately, thiswould leave no higher moments with which to search for resonances!It is clearly desirable to choose the smallest possible N that allows adequate repre-sentation of the smooth background. Beyond the innate utility of producing an optimalrepresentation of the dataset, this is also a compromise between avoiding loss of signalsensitivity on the one hand ( N too large), and poor modeling with attendant biases onthe other ( N too small). The hyperparameters α and λ are chosen to support this end,and to produce the most succinct representation of the data.We make the concept of ‘succinct’ concrete using a minimum-description-length (MDL)approach [9]. The objective is to minimize the amount of information required to fullyrepresent the dataset using a two-part encoding scheme. The expected total information13equired to encode the dataset in this way is L = D KL (cid:16) ˜ f (cid:13)(cid:13)(cid:13) ˜ c + s ( m ) ˜ S ( m ) n (cid:17) + D KL (cid:16) ˜ c (cid:13)(cid:13)(cid:13) ˜ p (cid:17) . (31)The first term is the Kullback-Liebler (KL) divergence of the full dataset with respect tothe full model, which represents the ‘information cost’ (in nats) to encode the full decom-position ˜ f if the model parameters are known. The second term similarly represents theinformation cost to encode the background estimate ˜ c given some prior background as-sumption ˜ p . The resonant contributions are assumed to have a constant information cost,independent of hyperparameters, and thus are neglected for the purpose of minimization.Because the moments are normally distributed, the KL divergences can be calculatedusing the formula for two multivariate Gaussians: D KL (cid:16) ˜ a (cid:13)(cid:13)(cid:13) ˜ b (cid:17) = 12 (cid:18) tr (cid:16) ˆ Σ − b ˆ Σ a (cid:17) + (cid:16) ˜ a − ˜ b (cid:17) (cid:62) ˆ Σ − b (cid:16) ˜ a − ˜ b (cid:17) − L − ln det (cid:16) ˆ Σ − b ˆ Σ a (cid:17)(cid:19) , (32)where ˆ Σ b and ˆ Σ a are the uncertainty matrices associated with ˜ a and ˜ b , respectively, and L is the number of degrees of freedom.The second term of Eq. 31 is amenable to approximation. If the prior ˜ p is taken tobe weak ( ˆ Σ p is generally large with respect to ˆ Σ c ), the first two terms of the Kullback-Liebler divergence approach zero. The log-determinant approaches L ln Mj , where j is theequivalent statistical strength of the prior (that is, the prior contains equivalent informa-tion to j events). The weakest reasonable prior is j = L (in order for L moments to beindependent, they must be based on a distribution of at least j events). Thus the secondterm can be approximated D KL (cid:16) ˜ c (cid:13)(cid:13)(cid:13) ˜ p (cid:17) ≈ N (cid:18) M N e (cid:19) . (33)To carry out the minimization of Eq. 31 requires some care. Multiple minima are aubiquitous feature, as might be anticipated by the fact that all ( α, λ ) produce an exactrepresentation as N → ∞ . We find that a two-stage process is most effective: a gridsearch over a defined range followed by a gradient-descent minimization started from thebest point identified in the grid search. This process typically entails evaluating L atseveral hundred combinations of ( α, λ ), at a minimum.If ˜ f is known, the signal and background contributions s ( m ) and ˜ c may be calculatedwith little computational effort. However, calculating ˜ f from a dataset is much morecomputationally intensive. Furthermore, the complexity scales linearly with the size ofthe dataset. This speed of this procedure can be substantially improved by making useof the transformation matrices defined in Sec. 4.This is accomplished by decomposing the dataset at some initial choice of hyperparam-eters, ( α ini , λ ini ), to derive an initial decomposition ˜ f ini . The appropriate transformationmatrices are then applied to extract the decomposition at each point required for thesearch. In the test dataset, having 5 × events, a transformation is roughly two ordersof magnitude faster than a full decomposition.14 . . . . . . . . . Scale ( λ ) . . . . . . . . . E x p o n e n t( α ) InitialFinal ∆ L og - L i k e li h oo d Hyperparameter Scan
Figure 4: A scan of the hyperparameters α and λ . The simulated spectrum has roughly5 × events scaled down to 2 × in order to approximate the statistically-asymptoticbehavior. The figures shows the difference between the cost function L at each combi-nation ( α, λ ) and the cost function at the minimum ( α, λ ). The number of backgroundmoments N is profiled, that is, at each point the N has been chosen that minimizes L at that point. The red ‘ × ’ is placed at the minimum, where the hyperparameters areoptimal. 15he results of a scan over the test dataset are shown in Fig. 4. From an initialdecomposition at λ = 45 GeV and α = 0 .
50, the scan identifies λ = 32 .
90 GeV and α = 0 .
60 as optimal, with N = 15. Several minima are evident. The ‘valley’ structurearises from the fact that the hyperparameters most strongly influence the decomposition’stail (for finitely many moments, the decomposition always goes like (cid:39) e − z for sufficientlylarge z ). Outside the valley, the tail is poorly described by e − z , requiring larger N andconsequentially producing much more costly decompositions. This procedure for selecting the hyperparameters also addresses one of the majorshortcomings of conventional orthogonal density estimation: the problem of positive-definiteness. The underlying probability distribution f ( z ), being a probability distribu-tion, must be everywhere non-negative. However, the Hilbert space of functions that canbe represented in the orthonormal basis is more general, and includes functions that takenegative values. Between the statistical uncertainty of a finite dataset and the trunca-tion of the series, there is no guarantee that the resulting approximation of f ( z ) will benon-negative, even though f ( z ) itself must be.It turns out that the covariance matrix corresponding to a given model is invertible ifand only that model has no zeros. Because the computation of L requires the inversionof the model’s covariance matrix, it is defined only when the model has no zeros. If themodel does have zeros, the computation will fail and L is assigned an infinite value. Thehyperparameter selection algorithm thus naturally excludes models that are not positive-definite.
8. Statistical interpretation
When one of the signals ˜ S ( m ) represents a hypothetical resonance, a precise statisticalinterpretation of its observed coefficient s ( m ) is required. This section describes a con-venient approximation for the probability of observing a particular value of s ( m ) given amodel (which may or may not include an actual signal contribution), describes the calcu-lation of p-values and limits, and finally demonstrates these procedures on the test dataspectrum. The data decomposition ˜ f , the background parameters ˜ c , and the signal normaliza-tions s ( m ) are all described exactly by compound Poission distributions. To a high degreeof accuracy, these can be approximated by multivariate normal distributions having co-variances as defined above. In most cases, this approximation is extremely good. Oneexception is the important case of a small signal on the mass distribution’s tail. Here, abetter approximation is useful in order to obtain the most accurate confidence intervalsand p-values.This can be framed more precisely. Given some model˜ Ω i = ˜ c i + s ( m ) ˜ S ( m ) i , (34)16hat is the probability distribution P (cid:16) x (cid:12)(cid:12)(cid:12) ˜ Ω (cid:17) , where x = ˜ ω i ( n ) ˜ f i is the estimate for theparameter s ( n ) ? Like any probability distribution, P can be specified exactly in terms ofits central moments: µ i (cid:16) ˜ Ω (cid:17) = E (cid:104)(cid:0) x − s ( n ) (cid:1) i (cid:105) = ∞ (cid:90) dz Ω ( z ) (cid:0) ω n ( z ) − s ( n ) (cid:1) i = ˜ Ω (cid:0) ˆ ω ( n ) − s ( n ) (cid:1) i ˜ . (35)where ˆ ω ( n ) ij = ∞ (cid:90) dzω ( n ) ( z ) E i ( z ) E j ( z ) = ˜ ω k ( n ) I ijk (36)is the operator representation of ω ( n ) ( z ). Note that ‘central moments’ is used as in thestatistical literature, and is distinct from the meaning of ‘moment’ otherwise employed inthis paper. We approximate P as a shifted, continuous Poisson distribution whose mean,variance and skewness are matched to the first three central moments: a = M µ (cid:16) ˜ Ω (cid:17)(cid:46) µ (cid:16) ˜ Ω (cid:17) (37) b = M µ (cid:16) ˜ Ω (cid:17)(cid:46) µ (cid:16) ˜ Ω (cid:17) (38) k ( x ) = a + b (cid:0) x − s ( n ) (cid:1) + 0 . P (cid:16) x (cid:12)(cid:12)(cid:12) ˜ Ω (cid:17) = a k ( x ) − e − a Γ (cid:16) k ( x ) (cid:17) . (40)This reduces to the normal distribution and to the classical discrete Poisson distributionin the appropriate limits. It provides an excellent approximation to the exact distributionin all the cases that the authors have examined. The results of a search are customarily expressed in the form of p-values and limits.Both can be obtained from Eq. 40 with ˜ Ω = ˜ Ω b , that is, using the null hypothesis asmodel. In this section, the n ’th signal is presumed to be the contribution of interest. Theremaining resonant contributions are considered background. The null-hypothesis model˜ Ω b is obtained by completely excluding the n ’th signal from consideration (the overlapmatrix is constructed using only the n − s ( n ) is set to zero).The p-value is defined as the probability of obtaining an estimate x (cid:48) that is as largeor larger than actually observed. Using Eq. 40, this is given by P ( x (cid:48) > x ) = ∞ (cid:90) x dx P (cid:16) x | ˜ Ω b (cid:17) = Γ (cid:16) k ( x ) , a (cid:17) Γ (cid:16) k ( x ) (cid:17) , (41)where Γ ( y, λ ) is the upper incomplete gamma function.17imits can be conveniently calculated using a Bayesian approach. Assuming a uniformprior on positive signals, the 95% confidence-level upper limit x is defined as x (cid:90) P (cid:16) x (cid:12)(cid:12)(cid:12) ˜ Ω b + s ( n ) ˜ S ( n ) (cid:17) ds ( n ) = 0 . × ∞ (cid:90) P (cid:16) x (cid:12)(cid:12)(cid:12) ˜ Ω b + s ( n ) ˜ S ( n ) (cid:17) ds ( n ) , (42)where x is fixed to the normalization of the n ’th signal as observed in the data, andthe integral runs over the corresponding true value. This integral must be performednumerically. In terms of events, the limit can be written N = M × x , (43)where M is the number of data events. This can be converted to a cross-section as desired. In searches, a data spectrum is typically tested against a collection of hypothetical‘new-physics’ models, with each model regarded as an independent hypothesis. For thetest dataset, we consider simple Gaussian shapes over a range of masses and widths. Eachmodel has three resonant contributions: one each for the Z and h (known signals), andone representing a hypothetical unknown resonance.The p-values and limits on the test spectrum are shown in Figure. 5 as a functionof the mass and width of the hypothetical resonance. Because the test spectrum wasgenerated with only the Z and h , and no additional resonance, no substantial signalshould be detected (the spurious signal should be small). That is in fact the case, withp-values around 0.5 and small deviations not exceeding 0 . σ . This indicates minimal biasand spurious signal.A second set of scans are shown in Figure. 6. These differ in that an actual thirdresonance has been injected into the test dataset, a Gaussian with a mass of 625 GeV andwidth of 6 . . . × − . The extracted signal is 70 ±
22 events.An additional interesting feature is visible on the plots - the limits ‘spike’, and becomesubstantially worse at certain mass/width combinations. This occurs because of the Z and h contributions, which are treated as free parameters. If the new-physics signal is toosimilar to either, the signals become degenerate (or nearly so). There is consequentially anatural loss in sensitivity to the new-physics signal, as it becomes difficult to distinguishit from the other, known, resonance. Although it is not immediately apparent, the approach presented above correctly andnaturally accounts for the systematic uncertainties due to the presence of parametersother than the signal normalization of interest. This is a result of the fact that eachsignal estimator is orthogonal to every element of the model other than the correspondingsignal. 18 C L ( E v e n t s ) ExpRes ObsRes − . . . D e v . ( σ ) M γγ (GeV) − × − × − × − × − p - v a l u e σ Resolution Gaussian a Gaussian width 1 . C L ( E v e n t s ) Exp2.5% Obs2.5% − . . . D e v . ( σ ) M γγ (GeV) − × − × − × − × − p - v a l u e σ b Gaussian width 2 . C L ( E v e n t s ) Exp5.0% Obs5.0% − . . . D e v . ( σ ) M γγ (GeV) − × − × − × − × − p - v a l u e σ
5% Gaussian c Gaussian width 5 . C L ( E v e n t s ) Exp10.0% Obs10.0% − . . . D e v . ( σ ) M γγ (GeV) − × − × − × − × − p - v a l u e σ
10% Gaussian d Gaussian width 10%
Figure 5: Scans over the test-data set. The test data is scaled down by a factor of 25,reducing the statistical fluctuations and making the spurious signal (more precisely, thelack thereof) evident. Also visible is the loss of sensitivity in the vicinity of the Z and h ,which occurs due to the difficulty of distinguishing between two separate peaks when thepeaks have similar masses and widths. 19 C L ( E v e n t s ) ExpRes ObsRes − . . . D e v . ( σ ) M γγ (GeV) − − − p - v a l u e σ σ σ σ Resolution Gaussian a Gaussian width 1 . C L ( E v e n t s ) Exp2.5% Obs2.5% − . . . D e v . ( σ ) M γγ (GeV) − − − p - v a l u e σ σ σ σ b Gaussian width 2 . C L ( E v e n t s ) Exp5.0% Obs5.0% − . . . D e v . ( σ ) M γγ (GeV) − − − p - v a l u e σ σ σ σ
5% Gaussian c Gaussian width 5 . C L ( E v e n t s ) Exp10.0% Obs10.0% − . . . D e v . ( σ ) M γγ (GeV) − − − p - v a l u e σ σ σ σ
10% Gaussian d Gaussian width 10%
Figure 6: Scans over the test-data set where an additional 6 . . × − . The fit extracts 70 ±
22 events, compared to 66 . Ω i = ˜ c i + s ˜ S i . (44)A single signal ˜ S is considered without loss of generality. The log-likelihood of the dataset˜ f is given by L = 12 (cid:16) ˜ Ω − ˜ f (cid:17) ˆ Σ − (cid:16) ˜ Ω − ˜ f (cid:17) + 12 log det (cid:16) π ˆ Σ (cid:17) . (45)The profiled likelihood is obtained by fixing s and choosing ˜ c i to maximize L d L d ˜ c i = ˜ E i ˆ Σ − (cid:16) ˜ Ω − ˜ f (cid:17) , (46)which is at an extremum when ˜ Ω i = ˜ f i ( i < N ). This is exactly as arranged by theorthogonal estimation procedure of Sec. 6. An equivalent result can be obtained bymarginalization.In reality, ˆ Σ is a function of ˜ c i and s rather than a constant, and Eq. 45 is an ap-proximation of the exact likelihood. In principal this causes some small differences withrespect to an exact procedure. However, these corrections are generally proportional to1 /M and are negligible for all practical purposes.
9. Summary and discussion; the FD package
Functional decomposition provides a complete and self-consistent approach to theproblem of detecting a narrow, resonant structure superimposed on a smooth background.By employing a carefully-constructed series of orthonormal functions, it is able to success-fully model spectra with sculpting or turn-on effects and generalizes to arbitrarily largedatasets. It addresses numerous shortcomings of traditional Monte Carlo and ad-hocfunction-based methods. The mechanism for choosing the series’ truncation point strikesa natural balance between sensitivity and flexibility.The orthonormal exponentials also have application as a means to parameterize fallingspectra. The same algorithm used to create a background for a resonance search alsooptimally parameterizes the data spectrum using only a handful of coefficients. Thisprovides a natural way to encapsulate the spectrum’s shape without resorting to ad-hocfunctions or to reproducing the raw data.A user-friendly software package that completely implements all of the described tech-niques is available at https://github.com/ryan-c-edgar/functional-decomposition .It is written in Python using Numpy [10], Scipy [11], Matplotlib [12] and Numexpr [13].The software can read ROOT ntuples as well as CSV files. All variables from the input filesare available for use; a text-base configuration file specifies which variables to decomposeand can optionally specify cuts on any other variables that are available. The configu-ration file also allows parametric signal shapes to be freely defined, and can make use ofany Python builtins or Numpy/Scipy functions to this end. It also contains definitions ofscan ranges and output plots.The implementation is highly optimized for speed and memory usage, and consequen-tially is able to perform fast, unbinned statistical analysis of very large datasets on modesthardware. Readers are encouraged to download the code and give it a try!21 cknowledgements
This work is supported by the U.S. Department of Energy, Office of Science, undergrant de-sc0007859.
References [1] S. Efromovich, Orthogonal series density estimation, Wiley Interdisciplinary Reviews:Computational Statistics 2 (4) (2010) 467–476. doi:10.1002/wics.97 .URL https://doi.org/10.1002/wics.97 [2] CDF Collaboration, Search for new particles decaying to dijets at cdf, Phys. Rev. D 55(1997) R5263–R5268, https://link.aps.org/doi/10.1103/PhysRevD.55.R5263 . doi:10.1103/PhysRevD.55.R5263 .[3] CDF Collaboration, Search for new particles decaying into dijets in proton-antiprotoncollisions at √ s = 1 .
96 TeV, Phys. Rev. D 79 (2009) 112002. arXiv:0812.4036 , doi:10.1103/PhysRevD.79.112002 .[4] C. Collaboration, Searches for dijet resonances in pp collisions at √ s = 13 TeVusing data collected in 2016., Tech. Rep. CMS-PAS-EXO-16-056, CERN, Geneva, http://cds.cern.ch/record/2256873 (2017).[5] A. Collaboration, Search for new phenomena in dijet events using 37 fb − of pp collision data collected at √ s =13 TeV with the ATLAS detector, Phys. Rev. D96 (5)(2017) 052004. arXiv:1703.09127 , doi:10.1103/PhysRevD.96.052004 .[6] W. Kautz, Transient synthesis in the time domain, Transactions of the IRE Profes-sional Group on Circuit Theory CT-1 (3) (1954) 29–39. doi:10.1109/tct.1954.1083588 .URL https://doi.org/10.1109/tct.1954.1083588 [7] D. C. Ross, Orthonormal exponentials, IEEE Transactions on Communication andElectronics 83 (71) (1964) 173–176. doi:10.1109/tcome.1964.6539335 .URL https://doi.org/10.1109/tcome.1964.6539335 [8] D. C. Lai, Signal processing with orthonormalized exponentials, Mathematics andComputers in Simulation 27 (5-6) (1985) 409–420. doi:10.1016/0378-4754(85)90060-6 .URL https://doi.org/10.1016/0378-4754(85)90060-6 [9] P. D. Grunwald, The Minimum Description Length Principle (Adaptive Computationand Machine Learning series), The MIT Press, 2007.[10] T. E. Oliphant, Guide to NumPy, 2nd Edition, CreateSpace Independent PublishingPlatform, USA, 2015. 2211] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools forPython, [Online; accessed 10 May 2018] (2001–).URL [12] J. D. Hunter, Matplotlib: A 2d graphics environment, Computing in Science &Engineering 9 (3) (2007) 90–95. doi:10.1109/mcse.2007.55 .URL https://doi.org/10.1109/mcse.2007.55 [13] D. Cooke, et al., numexpr: Fast numerical array expression evaluator for Python,NumPy, pandas, bcolz, and more, [Online; accessed 10 May 2018] (2009–).URL https://github.com/pydata/numexpr Appendix A. Proofs and Derivations
Appendix A.1. Completeness of the exponentials
This section demonstrates that the set of exponentials as defined in Eq. 7 is com-plete with respect to the set of normalizable probability distributions on [0 , ∞ ). This isaccomplished by showing completeness for a more general category of functions.Suppose that f ( z ) is a real-valued function defined on [0 , ∞ ) and furthermore thatlim z →∞ f ( z ) = 0 . (A.1)Consider the transformation z = − ln y . This bijectively maps the exponentials √ F n = e − nz on [0 , ∞ ) to the polynomials F (cid:63)n on (0 , F (cid:63)n ( y ) = √ y n , (A.2)and maps the inner product as (cid:104) f (cid:63) , g (cid:63) (cid:105) = (cid:90) dyy f (cid:63) ( y ) g (cid:63) ( y ) . (A.3)Then by the completeness of the polynomials, the transformed function f (cid:63) ( y ) = f ( − ln y )can be represented f (cid:63) ( y ) = ∞ (cid:88) n =0 a (cid:63)n y n . (A.4)However, lim y → f (cid:63) ( y ) = 0 and so the constant term a must be zero. It then follows fromEq. A.2 and the definition of F n that f ( z ) = ∞ (cid:88) n =1 a n F n ( z ) , (A.5)where a n = a (cid:63)n / √
2. 23 ppendix A.2. Coefficients of the orthonormal exponentials and recurrence relations
This section constructs an explicit solution for the coefficients of the orthonormalexponentials in terms of the non-orthonormal exponentials.Consider the functions Λ n ( t ) = (cid:68) E n , √ e − tz (cid:69) = ∞ (cid:90) dzE n ( z ) √ e − tz (A.6)with respect to a complex argument t having Re ( t ) >
0. When t is a positive integer,Λ n ( t ) = (cid:104) E n , F c (cid:105) . Using Eq. 9, Λ n may be writtenΛ n ( t ) = n (cid:88) i =1 d ni t + i . (A.7)This is a rational function of t , and may be written as the ratio of two polynomials in t .The denominator is at most degree n , and the numerator is at most degree n −
1. Byconstruction, it is zero for t ∈ { , . . . , n − } . These n − t ): f ( n ) n − (cid:89) i =1 ( t − i ) . (A.8)It follows that ˆ d ni (cid:54) = 0 ∀ i < n (otherwise the degree of the numerator would be less than n ). Consequentially, the denominator is the product of the n denominators in Eq. A.7.Then Λ n may be written Λ n ( t ) = f ( n ) t − n n (cid:89) i =1 t − it + i . (A.9)Next, note that Λ n can be uniquely analytically continued to the full complex plane,and consider the equality between Eq. A.7 and Eq. A.9: n (cid:88) i =1 d ni t + i = f ( n ) t − n n (cid:89) i =1 t − it + i . (A.10)The values of ˆ d nm can be extracted from the residues of the n simple poles at m = − n, . . . ,
1: ˆ d nm = 12 lim t →− m (cid:34) ( t + m ) f ( n ) t − n n (cid:89) i =1 t − it + i (cid:35) (A.11)ˆ d nm = f ( n ) mn + m ( − n + m m − (cid:89) i =1 m + im − i n (cid:89) i = m +1 i + mi − m . (A.12)24 d n ˆ d n ˆ d n ˆ d n ˆ d n ˆ d n ˆ d n ˆ d n ˆ d n √ − √ √ √ − √ √ − √ √ − √ √ √ − √ √ − √ √ − √ √ − √ √ − √ √ √ − √ √ − √ √ − √ √ − √ √ − √ √ − √ √ − √ √ √ − √ √ − √ √ − √ √ − √ √ Table A.4: The coefficients of the first few orthonormal exponentials. The n ’th orthonor-mal exponential is written: E n ( z ) = √ n (cid:80) i =0 ˆ d ni e − iz . Finally, we fix f ( n ) from the requirement that (cid:104) E n , E n (cid:105) is to be unity: (cid:104) E n , E n (cid:105) = ∞ (cid:90) dz n (cid:88) i =1 ˆ d ni n (cid:88) j =1 ˆ d nj e − ( i + j ) z = n (cid:88) i =1 ˆ d ni n (cid:88) j =1 d nj i + j = n (cid:88) i =1 ˆ d ni Λ n ( i ) . (A.13)The sum reduces to a single term because Λ n ( i ) is nonzero for integer i only if i ≥ n , so (cid:104) E n , E n (cid:105) = ˆ d nn Λ n ( n ) = f ( n ) / n . From this, f ( n ) = 2 √ n andΛ n ( t ) = 2 √ nt − n n (cid:89) i =1 t − it + i (A.14)ˆ d nm = √ n ( − n + m (cid:18) mn + m (cid:19) m − (cid:89) i =1 m + im − i n (cid:89) i = m +1 i + mi − m . (A.15)The first few of these coefficients are tabulated in Table A.4.Two recurrence relations derived from this result are also useful. The first arises fromconsidering the ratio ˆ d n ( m +1) / ˆ d nm . This results in a recurrence relation on the coefficientsthemselves, which is given by ˆ d n = ( − n +1 n √ n ˆ d n ( m +1) = m − n m ( m + 1) ˆ d nm . (A.16)This form can be used to conveniently generate the coefficients for the n ’th orthonormalexponential with minimal computational effort. The second is a three-term recurrence re-lation on the formalized exponentials. That such a recurrence relation exists is implied bythe isomorphism between the orthonormal exponentials and the polynomials (Appendix25.1). It is given by: E ( z ) = √ e − z E n +1 ( z ) = 1 φ n +1 (cid:18) e − z E n ( z ) − φ n E n ( z ) − φ n − E n − ( z ) (cid:19) φ n = (cid:114) − n . (A.17)That Eqs. A.16 satisfy this relation can be shown using only simple (though tedious)algebra. Equation A.17 is generally the fastest and most numerically stable method toevaluate the orthonormal exponentials. Appendix A.3. The general hyperparameter transformation matrix
This section demonstrates a general result that, under certain conditions, the trans-formation matrix between two different sets of hyperparameters θ and θ (cid:63) can be expressedas a matrix exponential. This can be seen more generally by noting that the hyperpa-rameter transformations form a Lie algebra. This section is included nonetheless, first forcompleteness but also so as to have the result expressed in the most convenient form forthe needs of this paper.Suppose some function f ( x ) has a known decomposition in a basis defined by trans-formed variable z (cid:63) = T ( x, θ (cid:63) ), and the decomposition is desired with respect to a differentchoice of hyperparameters, z = ( x, θ ). In the starred basis, the known decomposition isexpressed f ( x ) = ˜ f (cid:63) n E n ( z (cid:63) ) . (A.18)The decomposition in z is related by a linear transformation:˜ f n = ˆ M nm ˜ f (cid:63)m (A.19)ˆ M nm = ∞ (cid:90) dzE n ( z ) E m ( z (cid:63) ) . (A.20)Now suppose that the hyperparameters are a function of some variable β , that is, θ = Θ ( β ) with Θ (0) = θ (cid:63) . If Θ is differentiable with respect to β , then the transformationmatrix is also differentiable with respect to β . Its derivative is d ˆ M nm dβ = ∞ (cid:90) dzE (cid:48) n ( z ) E m ( z (cid:63) ) dzdβ (A.21)= ˆ M i m ∞ (cid:90) dzE (cid:48) n ( z ) E i ( z ) dzdβ . (A.22)To obtain the second equation, the term E m ( z (cid:63) ) has been transformed into a function of z using ˆ M . This results in a matrix differential equation, and if dzdβ is a constant, this26as the solution ˆ M = exp (cid:104) ˆ T (cid:105) ˆ T nm = ∞ (cid:90) dzE (cid:48) n ( z ) E m ( z ) dzdβ , (A.23)where exp is the matrix exponential. Appendix A.4. Hyperparameter transformation matrix for the power-law transformation
The power-law transformation is written z = (cid:18) x − x λ (cid:19) α . (A.24)Suppose that α = α ( β ) and λ = λ ( β ). Then dzdβ = 1 α dαdβ z ln z − αλ dλdβ z . (A.25)For this to be a constant, as required for Eq. A.23, both terms must individually beconstant: 1 α dαdβ = c (A.26) − αλ dλdβ = s . (A.27)From Eq. A.26, it follows that α = α (cid:63) e βc . Then substituting this into Eq. A.27, ddβ ln λ = − sα (cid:63) e − βc (A.28)ln λ + C = sα (cid:63) c e − βc (A.29)ln λλ (cid:63) = sα (cid:63) c (cid:0) e − βc − (cid:1) (A.30)where the constant C has been fixed by the requirement that λ (0) = λ (cid:63) . At β = 1, theseyield expressions for the constants s and c : c = ln αα (cid:63) (A.31) s = − αce c − λλ (cid:63) . (A.32)From Eq. A.23, the infinitesimal transformation matrix can be expressed asˆ T nm = ∞ (cid:90) dzE (cid:48) n ( z ) E m ( z ) ( cz ln z + sz ) , (A.33)27t is easiest to evaluate the integrals in the exponential basis and then transform tothe orthonormal basis. Using the series coefficients ˆ d nm derived in Appendix A.2, theargument to the matrix exponential can be written ∞ (cid:90) dzE (cid:48) n ( z ) E m ( z ) ( cz ln z + sz ) = − ∞ (cid:88) i =1 ∞ (cid:88) j =1 d ni d mj i ∞ (cid:90) dze − ( i + j ) z ( cz ln z + sz ) . (A.34)This can be evaluated numerically with the aid of the integrals ∞ (cid:90) dze − nz z = 1 n (A.35) ∞ (cid:90) dze − nz z log z = 1 − γ − log nn , (A.36)where γ is the Euler-Mascheroni constant. This finally gives an expression for the trans-formation matrix, ˆ M = exp (cid:104) c ˆ C + s ˆ S (cid:105) ˆ C nm = − n (cid:88) i =1 m (cid:88) j =1 ˆ d ni ˆ d mj i ( i + j ) [1 − γ − ln ( i + j )]ˆ S nm = − n (cid:88) i =1 m (cid:88) j =1 ˆ d ni ˆ d mj i ( i + j ) (A.37)where s and c are set according to the desired start and end hyperparameter values asper Eq. A.31 and A.32. Appendix A.5. Finiteness of the mean and variance of the moments
We here argue that the empirical moments ˜ f n corresponding to some function f ( x )must be normally distributed in the high-statistics limit. That is, the conditions of thecentral limit theorem always apply, regardless of the underlying function.Suppose that f ( x ) is a continuous probability distribution that is everywhere finiteon [ x , ∞ ). Then the moments and (co-)variance may be expressed˜ f n = ∞ (cid:90) dzf ( z ) E n ( z ) (A.38)ˆ Σ nm = ∞ (cid:90) dzf ( z ) E n ( z ) E m ( z ) − ˜ f n ˜ f m . (A.39)28ecause the orthonormal exponentials are linear combinations of terms like exp ( − kz ),both of these may be expressed as linear combinations of integrals of the form M k = ∞ (cid:90) dzf ( z ) e − kz . (A.40)Comparing the integrands for M k and M j , if j < k then | f ( z ) | e − kz < | f ( z ) | e − jz . (A.41)The convergence of the longest length-scale integral (i.e., ˜ f ) then ensures the convergenceof all the smaller length-scale integrals. A necessary and sufficient condition for themoments (and covariance) to be finite is then ∞ (cid:90) dzf ( z ) e − z ∈ R . (A.42)The previously-derived condition for completeness, lim z →∞ f ( z ) = 0, is more restrictive.Thus the moments, their variance, and the covariance between any two moments must allbe finite. Appendix A.6. Calculating the covariance matrix
We here record a surprising (and convenient) result: the empirical covariance matrixneed not be evaluated directly, because the N × N covariance matrix can be calculatedfrom the first 2 N moments. This is most readily seen from the continuous analogue ofEq. 22: ˆ Σ nm = ∞ (cid:90) dzf ( z ) E n ( z ) E m ( z ) − ˜ f n ˜ f m . (A.43)Substituting f ( z ) with its expansion,ˆ Σ nm = ∞ (cid:90) dz ˜ f i E i ( z ) E n ( z ) E m ( z ) − ˜ f n ˜ f m . (A.44)Note that the triple-integral is nonzero only if | n − m | ≤ i ≤ n + m . This is because E n E m is contained in the subspace spanned by { E , . . . , E n + m } , to which E i is by definitionorthogonal if i > n + m . The rest of the condition follows by permutation of the indices.The covariance matrix can therefore be representedˆ Σ nm = n + m (cid:88) i = | n − m | ˜ f i ˆ I inm − ˜ f n ˜ f m ˆ I ijk = ∞ (cid:90) dzE i ( z ) E j ( z ) E j ( z ) . (A.45)29rom the range of the sum, it can be seen that if n, m < N , then ˆ Σ nm is computableusing as most the first 2 N terms of the series expansion.The triple-integral is expressible in terms of the coefficients of the orthonormal expo-nentials as ˆ I ijk = √ i (cid:88) a =0 j (cid:88) b =0 k (cid:88) c =0 ˆ d ia ˆ d jb ˆ d kc a + b + c . (A.46)There appears to be no simpler closed-form solution for ˆ I , but as it is a mathematicalconstant it need be calculated only once. Appendix A.6.1. Covariance matrix from the recursion relations
The calculation of the covariance matrix via Eq. A.45 is straightforward and useful, butrequires O ( N N ) operations to calculate the N × N covariance matrix from N moments.An O ( N ) algorithm also exists, which is especially useful for large covariance matricescomputed from many moments.Consider the first term in Eq. A.43:ˆ f nm = ∞ (cid:90) dzf ( z ) E n ( z ) E m ( z ) . (A.47)Note that this is written ˆ f to reflect the fact that it is a Hilbert-space operator corre-sponding to multiplication by f ( z ). This is distinct from the vector form ˜ f .Apply the recursion relations from Eq. A.17 to E m ( z ). The result isˆ f nm +1 = ∞ (cid:90) dz f ( z ) E n ( z ) φ m +1 (cid:18) e − z E m ( z ) − φ m E m ( z ) − φ m − E m − ( z ) (cid:19) = 1 φ m +1 ∞ (cid:90) dzf ( z ) e − z E n ( z ) E m ( z ) − φ m ˆ f nm − φ m − ˆ f nm − = 1 φ m +1 (cid:18) e in ˆ f in − φ m ˆ f nm − φ m − ˆ f nm − (cid:19) , (A.48)where ˆ e ni = ∞ (cid:90) dze − z E n ( z ) E m ( z ) (A.49)is the operator representation of e − z . This matrix is tridiagonal and constant. Theelements can be obtained from Eq. A.17, and take the valuesˆ e nn = 12 φ − n ˆ e nn ± = 14 φ n ± . (A.50)30he covariance matrix can therefore be obtained asˆ Σ nm = ˆ f nm − ˜ f n ˜ f m (A.51)where ˆ f is calculated either as ˆ f nm = ˜ f i ˆ I inm or by using the recursion relations above. Appendix A.7. Optimal signal estimators
Here we construct optimal estimators for a distribution’s resonant contributions. Weuse a construction that generalizes the concept of the Best Linear Unbiased Estimator(BLUE) to the problem of simultaneously estimating several parameters.Suppose that some function F ( z ) is to be modeled as a linear combination of N functions f . . . f N : F ( z ) = c ( m ) f ( m ) ( z ) . (A.52)We wish to construct a set of N functions (cid:8) ˜ ω ( n ) (cid:9) such that (cid:10) ω ( n ) , f ( m ) (cid:11) = δ nm , (A.53)that is, if ˜ ω ( n ) is applied to F ( z ), its expected value is c ( n ) . These functions have covari-ance σ ( nm ) , given by σ ( nm ) = ∞ (cid:90) dzF ( z ) ω ( n ) ( z ) ω ( m ) ( z ) − ∞ (cid:90) dzF ( z ) ω ( n ) ( z ) ∞ (cid:90) dzF ( z ) ω ( m ) ( z ) (A.54)= ˜ ω i ( n ) ˆ Σ ij ˜ ω j ( m ) (A.55)where ˆ Σ is the covariance matrix associated with F ( z ).We call the set (cid:8) ˜ ω ( n ) (cid:9) optimal if the entropy of σ is a minimum with respect to theset of all possible linear, unbiased estimators. That is, we minimize H = ln det (2 πeσ ).This can be pictured as minimizing the volume of the N -dimensional ellipsoid describedby σ .This can be accomplished by introducing N Lagrange multipliers η ( ij ) to produce anew objective function, L = ln det (2 πeσ ) − η ( nm ) (cid:16) ˜ ω ( n ) i ˜ f i ( m ) − δ nm (cid:17) , (A.56)where the constraints enforce linearity and unbiasedness. Without loss of generality, take σ to be diagonal. Then the objective function and its derivatives can be written L = N (cid:88) n =0 ln (cid:0) πeσ ( nn ) (cid:1) − η ( nm ) (cid:16) ˜ ω ( n ) i ˜ f i ( m ) − δ nm (cid:17) (A.57) d L d ˜ ω ( n ) i = 2 σ ( nn ) ˆ Σ ij ˜ ω j ( n ) − η ( m )( n ) ˜ f ( m ) i (A.58) d L dη ( nm ) = ˜ ω ( n ) i ˜ f i ( m ) − δ nm . (A.59)31etting Eq. A.58 to zero and absorbing factors of σ ( nn ) into η ( nm ) ,˜ ω ( n ) i = η ( m )( n ) ˆ Σ − ij ˜ f j ( m ) . (A.60)Take the dot product with ˜ f n ( l ) , and then multiply both sides by η − ik ) , yielding η − nm ) = ˜ f i ( n ) ˆ Σ − ij ˜ f j ( m ) . (A.61)Note that here, raising an object to the power minus-one indicates the matrix inverse.We call ˜ (cid:15) ( n ) = ˆ Σ − ij ˜ f j ( n ) the minimum variance estimator for f n and σ ( nm ) the orthog-onalization matrix for the functions f , . . . , f nn