Density estimation on small datasets
DDensity estimation on small datasets
Wei-Chia Chen, Ammar Tareen, Justin B. Kinney ∗ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
How might a smooth probability distribution be estimated, with accurately quantified uncertainty,from a limited amount of sampled data? Here we describe a field-theoretic approach that addressesthis problem remarkably well in one dimension, providing an exact nonparametric Bayesian pos-terior without relying on tunable parameters or large-data approximations. Strong non-Gaussianconstraints, which require a non-perturbative treatment, are found to play a major role in reducingdistribution uncertainty. A software implementation of this method is provided.
The need to estimate smooth probability distributionsfrom a limited number of samples is ubiquitous in dataanalysis [1]. This “density estimation” problem alsopresents a fundamental conceptual challenge in statisticallearning, important aspects of which remain unresolved.These outstanding problems are especially acute in thecontext of small datasets, where standard large-datasetapproximations do not apply. Here we investigate thepotential for Bayesian field theory, an area of statisti-cal learning based on field-theoretic methods in physics[2–5], to estimate probability densities in this small dataregime.Density estimation requires answering two distinctquestions. First, what is the best estimate for the un-derlying probability distribution? Second, what do other plausible distributions look like? Ideally, one would liketo answer these questions by first considering all pos-sible distributions (regardless of mathematical form),then identifying those that fit the data while satisfy-ing a transparent notion of smoothness. Such an ap-proach should not require one to manually identify valuesfor critical parameters, specify boundary conditions, ormake invalid mathematical approximations in the smalldata regime. However, the most common density es-timation approaches, including kernel density estima-tion (KDE) [1] and Dirichlet process mixture modeling(DPMM)[6, 7], do not satisfy these requirements.Previous work has described a Bayesian field theoryapproach, called Density Estimation using Field The-ory (DEFT) [8, 9], for addressing the density estimationproblem in low dimensions. DEFT satisfies all of theabove criteria except for the last one: in [8, 9], an appealto the large data regime was used to justify a Laplace ap-proximation (i.e., a saddle-point approximation) of theBayesian posterior. This approximation facilitated thesampling of an ensemble of plausible densities, as well asthe identification of an optimal smoothness lengthscale.Independent but closely related work [10] has also reliedheavily on this approximation.Here we investigate the performance of DEFT in thesmall data regime and find that the Laplace approxima-tion advocated in prior work can be catastrophic. This isbecause non-Gaussian features of the DEFT posterior are ∗ Email correspondence to [email protected] critical for suppressing “wisps” – large positive fluctua-tions that otherwise occur in posterior-sampled densities.We further find that these non-Gaussian effects cannotbe addressed perturbatively using Feynman diagrams, ashas been suggested in other Bayesian field theory con-texts [4, 5]. These results are not specific to DEFT, butrather reflect the fundamentally nonperturbative natureof the density estimation problem.Happily, we find that importance resampling [7] canrapidly and effectively correct for the Laplace approxi-mation. The resulting DEFT algorithm, which we havemade available in robust and easy-to-use software, thusappears to satisfy all of the above requirements for anideal density estimation method in one dimension. Testsof DEFT on simulated data show favorable performancerelative to KDE and DPMM. We also illustrate the util-ity of DEFT on real data from the Large Hadron Collider[11] and World Health Organization (WHO) [12] .We first recap the DEFT approach to density esti-mation [8, 9]. Consider N data points { x i } Ni = drawnfrom a smooth one-dimensional probability distribution Q true ( x ) that is confined to an x -interval of length L .From these data we wish to obtain a best estimate Q ∗ of Q true , as well as an ensemble of plausible distributionswith which to quantify the uncertainty in this estimate.DEFT reparametrizes each candidate distribution Q interms of a field φ via Q ( x ) = e − φ ( x ) ∫ dx ′ e − φ ( x ′ ) . (1)After adopting a Bayesian prior that constrains the α -order x -derivative of φ (denoted by ∂ α φ in what follows),and accounting for the likelihood of the data given φ , oneobtains a posterior distribution on φ . We represent thisposterior as p ( Q ∣ data , (cid:96) ) ∝ exp (− S (cid:96) [ φ ]) where S (cid:96) [ φ ] = ∫ dxL [ (cid:96) α ( ∂ α φ ) + N RLφ + N e − φ ] (2)is the “posterior action” described in [9]. In Eq. 2, (cid:96) is asmoothness lengthscale that has yet to be determined and R ( x ) = N ∑ Ni = δ ( x − x i ) is a histogram (of bin width zero)that summarizes the data. See Supplemental Informationsection SI.1 for details. The behavior of Q under thisaction S (cid:96) [ φ ] is the primary focus of the present paper. a r X i v : . [ phy s i c s . d a t a - a n ] A ug Q true : 2.88 bits (a) R : 2.21 bits (b) Q ∗ : 2.96 bits (c) Q ∼ p Lap ( Q | data): 1.28 ± (d) − − − Q ∼ p ( Q | data): 2.91 ± (e) FIG. 1. (Color)
Density estimation using field theory .(a) A Gaussian mixture distribution Q true = N (− , ) + N ( , ) within the x -interval (− , ) . (b) A histogram R of N =
30 data points sampled from Q true and discretizedto G =
100 grid points. (c) The corresponding estimate Q ∗ computed by DEFT using α = p Lap ( Q ∣ data ) , which accounts for uncertainty in (cid:96) aswell as in Q . (e) 100 distributions generated using importanceresampling of the Laplace ensemble. The differential entropiesof the illustrated distributions are provided. S (cid:96) [ φ ] is minimized at the maximum a posteriori(MAP) field φ (cid:96) . The MAP field φ (cid:96) is unique even in theabsence of boundary conditions; see SI.2 for details. Al-though φ (cid:96) cannot be solved analytically, it is readily com-puted as the solution to a convex optimization problemafter discretization of the x -domain at G equally-spacedgrid points. In this discrete representation, R becomesa histogram with bin width h = L / G . As long as h ≪ (cid:96) ,the choice of G will not greatly affect φ (cid:96) . The optimallengthscale (cid:96) ∗ is identified by maximizing the Bayesianevidence, p ( data ∣ (cid:96) ) ; see SI.3 for details. Q ∗ = Q (cid:96) ∗ is thenused as our best density estimate. Fig. 1(a-c) illustratesthis procedure on simulated data.To characterize the uncertainty in the DEFT esti-mate Q ∗ , we sample the Bayesian posterior p ( Q ∣ data ) =∫ d(cid:96) p ( (cid:96) ∣ data ) p ( Q ∣ data , (cid:96) ) . Each sample is generated byfirst drawing (cid:96) from p ( (cid:96) ∣ data ) , then drawing Q from p ( Q ∣ data , (cid:96) ) . Previous work [8] has suggested that thissampling task be performed using the Laplace approxi-mation, i.e., approximating p ( Q ∣ data , (cid:96) ) with a Gaussianthat has the same mean and Hessian. The correspondingaction, S Lap (cid:96) [ φ ] , is thus quadratic in δφ = φ − φ (cid:96) . ThisLaplace approximation has the advantage that posteriorsamples Q can be rapidly and independently generated[8].Fig. 1d shows multiple Q s sampled from the Laplaceposterior p Lap ( Q ∣ data ) = ∫ d(cid:96) p ( (cid:96) ∣ data ) p Lap ( Q ∣ data , (cid:96) ) .Clearly something is very wrong. Although many of these Q s appear reasonable, others exhibit wisps that have sub-stantial probability mass far removed from the data.We hypothesized that wisps are an artifact of theLaplace approximation. To correct for potential inaccu-racies of this approximation, we adopted an importanceresampling approach [7]. For each sampled φ we com-puted a weight w (cid:96) [ φ ] = exp ( S Lap (cid:96) [ φ ] − S (cid:96) [ φ ]) . (3)We then resampled the Laplace ensemble with replace-ment, selecting each φ (and thus Q ) with a probabil-ity proportional to w (cid:96) [ φ ] . A mixture of such resampledensembles across lengthscales (cid:96) was then used to gener-ate an ensemble reflecting p ( Q ∣ data ) ; see SI.4 for details.Fig. 1e shows 100 distributions Q from this resampledposterior. Wisps no longer appear.Eliminating wisps is especially important when esti-mating values for summary statistics, such as distribu-tion entropy. In entropy estimation, the goal is to dis-cern a value for the quantity H true = H [ Q true ] where H [ Q ] = − ∫ dx Q ( x ) log Q ( x ) . Using the DEFT poste-rior ensemble, we can estimate H true as ̂ H ± ̂ δH , where ̂ H = ⟨ H ⟩ and ̂ δH = √⟨ H ⟩ − ⟨ H ⟩ , with ⟨⋅⟩ denoting aposterior average. Previous work expressed hope that theensemble provided by the Laplace approximation mightserve this purpose [8]. But in this case we see that ̂ H is farless accurate than the point estimates H [ R ] or H [ Q ∗ ] ,and ̂ δH is enormous (Fig. 1d). Importance resamplingfixes both problems: the resulting ̂ H is closer to H true than either point estimate, and ̂ δH is remarkably small(Fig. 1e).We now turn to the problem of understanding howwisps arise. To this end we consider the variation in theaction upon φ (cid:96) → φ (cid:96) + δφ . One finds that δS (cid:96) [ φ (cid:96) + δφ ] = ∫ dxL (cid:96) α ( ∂ α δφ ) + ∫ dxL V ( δφ ) (4)where V ( δφ ) = N LQ (cid:96) [ e − δφ − + δφ ] . (5)The first (kinetic) term on the right hand side of Eq. 4imposes a smoothness constraint on δφ , while the second(potential) term keeps δφ confined to a potential wellconsistent with the data. See SI.5 for details. Note that V is convex, nonnegative, and vanishes when δφ = n eff , the effectivenumber of degrees of freedom constrained by the data,as twice the value of the second term in Eq. 4 averagedover the posterior ensemble. Typical fluctuations δφ willtherefore exhibit V ( δφ ) ∼ n eff / x domain, which we define by Q (cid:96) ( x ) ≫ n eff / N L ,and the “data poor” regime, corresponding to Q (cid:96) ( x ) ≪ n eff / N L . In the data rich regime, fluctuations are smallenough that V adheres well to its Laplace approximation, V ≈ N LQ (cid:96) δφ /
2. Under this nearly symmetric potential,both positive fluctuations δφ + and negative fluctuations δφ − are constrained by ∣ δφ ± ∣ ∼ δφ rich = √ n eff N LQ (cid:96) . (6)By contrast, V is highly asymmetric in the data poorregime and produces highly asymmetric fluctuations.Positive fluctuations satisfy δφ + ∼ n eff / N LQ (cid:96) , whereasnegative fluctuations obey − δφ − ∼ δφ − poor = log n eff N LQ (cid:96) . (7)See SI.5 for more information.The key point is that adopting S Lap (cid:96) [ φ ] in place of S (cid:96) [ φ ] is equivalent to assuming the Laplace approxima-tion for V throughout the entire x -domain. Because δφ rich ≫ δφ − poor in data poor regions, the Laplace ap-proximation greatly overestimates the size of downwardfluctuations in φ . This results in the large upward fluc-tuations in Q that we identify as wisps. We note thatwisps are especially prominent at the x -interval bound-aries in Fig. 1 for two reasons: (i) Q (cid:96) is especially smallhere, making these regions very data poor, and (ii) thekinetic term in Eq. 4, which is all that suppresses wispsin data poor regions, is less effective at constraining δφ because data are present on only one side.Feynman diagrams provide a general means of correct-ing for inaccuracies in Laplace approximations [13], andhave been advocated in the context of some Bayesianfield theory regression problems [4, 5]. For density es-timation, however, Feynman diagrams are ineffective ifany region of the x interval is data poor. This is due tothe action S (cid:96) [ φ ] being strongly coupled. For example, inthe Bayesian evidence computations used to determine (cid:96) ∗ , DEFT estimates the action Z (cid:96) = ∫ D φ e − S (cid:96) [ φ ] usingthe Laplace approximation Z Lap (cid:96) = ∫ D φ e − S Lap (cid:96) [ φ ] . SeeSI.3 for details. At first, one might think it possible tocorrect for potential inaccuracies in this approximationusing a series of vacuum diagrams (see SI.6), i.e.,log Z (cid:96) Z Lap (cid:96) = + + + ⋯ . (8)However, as described in SI.8, the number of diagramsneeded to obtain accurate results is prohibitive when N = 10 N = 100 N = 10 N = 10010 − − − D K L ( Q tr u e k Q ∗ ) α = 1 α = 2 α = 3 α = 4 KDEDPMM01 p - v a l u e (a)(b)(c) FIG. 2. (Color)
Performance of DEFT . (a) DEFT, KDE,and DPMM were used to analyze data from two different Q true distributions: the Gaussian mixture from Fig. 1a (left)or a Pareto distribution, Q true ( x ) = x − , confined to the x -interval ( , ) (right). (b) 100 datasets of size N =
10 and100 datasets of size N =
100 were generated for each Q true .For each dataset, Q ∗ was computed by DEFT (using G =
100 and α =
1, 2, 3, or 4), by KDE, or by DPMM. Violinplots (with median indicated) show the resulting Kullback-Leibler divergences D KL ( Q true ∥ Q ∗ ) . (c) P-values quantifying,for each simulated dataset, the location of D KL ( Q true ∥ Q ∗ ) within the distribution of D KL ( Q ∥ Q ∗ ) values observed for Q ∼ p ( Q ∣ data ) . data-poor regions of the x -interval are present. Fortu-nately, one can instead compute nonperturbative correc-tions to this log ratio using the importance resamplingweights in Eq. 3 vialog Z (cid:96) Z Lap (cid:96) = log ⟨ w (cid:96) ⟩ Lap ∣ (cid:96) . (9)See SI.7 for details.These results reflect a fundamental yet under-appreciated aspect of density estimation: unless data areobserved throughout the x -domain, the uncertainties inestimated probability densities require a nonperturbativetreatment. Specifically, nonperturbative methods suchas the Laplace approximation or Feynman diagrams canonly be expected to work if Q true ( x ) ≳ / N L everywherewithin the x domain. Very often, however, density es-timation is applied to data like that in Fig. 1, which islocalized far away from one or both x -interval bound-aries. We argue that the analysis of such data will quitegenerally require a nonperturbative treatment.To benchmark the performance of DEFT, we quanti-fied its ability to estimate probability densities of knownfunctional form. Specifically, we simulated datasets ofvarying size N from a variety of Q true distributions, thenasked two questions. First, how accurately does Q ∗ esti-mate Q true ? Second, how typical is Q true among the dis-tributions in the Bayesian posterior? In both contexts,DEFT was compared to KDE and DPMM. See SI.9 fordetails on how KDE and DPMM were implemented. Fig.2 shows the results of these performance tests for two dif-ferent choices of Q true . Fig. S3 in SI provides analogousresults for other Q true distributions.To answer the first question, we compared theKullback-Leibler divergence, D KL ( Q true ∥ Q ∗ ) , achievedby each estimator on each dataset. Note that smallervalues for these divergences indicate better method ac-curacy. As illustrated in Fig. 2b, DEFT usually per-formed comparably to KDE and DPMM at N = 10, andsomewhat better at N = 100. DEFT appears to havea particular advantage over both KDE and DPMM on Q true distributions that bump up against one or both x -interval boundaries. Also unsurprising is that DEFTperforms notably better with α = 2, 3, and 4 than with α = 1, since α = 1 yields non-smooth Q ∗ distributionswith cusps at each data point [8, 14].To answer the second question, we computed where D KL ( Q true ∥ Q ∗ ) falls within the distribution of diver-gences D KL ( Q ∥ Q ∗ ) observed for Q ∼ p ( Q ∣ data ) . Thislocation is naturally quantified by a p-value correspond-ing to the null hypothesis that Q true ∼ p ( Q ∣ data ) . If Q true is typical of plausible Q s, these p-values should beuniformly distributed between 0 and 1. Alternatively,p-values clustered close to 0 indicate that posterior en-semble p ( Q ∣ data ) overestimates how much Q true divergesfrom Q ∗ , whereas p-values clustered close to 1 indicatethat p ( Q ∣ data ) underestimates this uncertainty. Fig. 2cshows our results for the two choices of Q true in Fig. 2a;results for other choices of Q true are shown in Fig. S3.In general, the p-values for DEFT (with α = 2, 3, and4) were distributed with remarkable uniformity. DEFTwith α = 1 tended to overestimate uncertainties, whereasKDE and DPMM tended to underestimate uncertainties.Finally, we illustrate the capabilities of DEFT usingdata reported in the initial observation of the Higgs boson[11] (see Fig. S4 for an analysis of data from the WHO).Fig. 3a, which is a reconstruction of Fig. 4 of [11], showsa histogram of the invariant masses of N =
58 4-leptonevents observed by the CMS Collaboration at the LargeHadron Collider. Such events are generated by the de-cays of the Higgs boson via H → ZZ → (cid:96) , but they alsoarise from a variety of background decay processes. Oneof the challenges faced by the CMS Collaboration was de-termining whether these data exhibit a localized excessof events representing a possible Higgs resonance. Fig.3b shows DEFT applied to these data using default pa-rameters. Despite Higgs decays representing only ∼ m H = 125 GeV. Theconfidence in this maxima can be quantified by sampling p ( Q ∣ data ) : 81% of sampled Q s have exactly one local
80 100 120 140 160 180 m ‘ (GeV)024681012 e v e n t s /3 G e V data ( N =58)backgroundHiggs ( m H =125 GeV)80 100 120 140 160 180 m ‘ (GeV)024681012 e v e n t s /3 G e V DEFT: Q ∗ DEFT: Q ∼ p ( Q | data) (a)(b) FIG. 3. (Color)
DEFT applied to Higgs boson data. (a)A reconstruction of Fig. 4 from [11]. Dots (black) indicatethe invariant masses of 4-lepton decay events histogrammedacross G =
37 bins of width 3 GeV each. Also shown are thenumber of events expected, based on Standard Model sim-ulations, from either background decay processes (blue) orfrom the decay of a Higgs boson with mass of 125 GeV (red).(b) The optimal density estimate Q ∗ (black), along with 100posterior samples Q ∼ p ( Q ∣ data ) (olive), computed by DEFTusing the histogram data in panel (a). maximum between 110 GeV and 140 GeV (7% have nolocal maxima and 12% have multiple local maxima), andthese maxima occurred at 127.1 GeV ± α that defines the qualitative meaningof smoothness and which governs how DEFT relates tomaximum entropy estimation (see [9]). In our experi-ence, however, using α = G , reflect computational practicalities. Theseparameters can be chosen automatically and have littleeffect on the results as long as reasonable values are used.DEFT thus addresses a major outstanding need, notjust in statistical learning theory but also in day-to-daydata analysis. To this end we have developed an opensource Python package called SUFTware. SUFTwareallows users to apply DEFT in one dimension to theirown data, and in the future will include additional field-theory-based statistical methods. This implementationis sufficiently fast for routine use: the computations forFig. 1 takes about 0.25 seconds on a standard laptopcomputer (see SI.10 for a discussion of computationalcomplexity). SUFTware has minimal dependencies, is compatible with both Python 2 and Python 3, and isreadily installed using the pip package manager. See http://suftware.readthedocs.io for installation andusage instructions.We thank Kush Coshic for preliminary contributionsto this project, as well as Serena Bradde, David McCan-dlish, and two anonymous referees for helpful feedback.This work was supported by a CSHL/Northwell HealthAlliance grant to JBK and by NIH Cancer Center Sup-port Grant 5P30CA045508. [1] B. W. Silverman, Density Estimation for Statistics andData Analysis (Chapman and Hall, 1986).[2] W. Bialek, C. Callan, and S. Strong, Phys Rev Lett ,4693 (1996).[3] J. C. Lemm, Bayesian Field Theory (Johns Hopkins,2003).[4] T. A. Enßlin, M. Frommert, and F. S. Kitaura, PhysRev D , 105005 (2009).[5] T. Ensslin, arXiv [hep-ex] (2013), 1301.2556v1.[6] P. M¨uller, F. A. Quintana, A. Jara, and T. Hanson, Bayesian Nonparametric Data Analysis (Springer, 2015).[7] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, andA. Vehtari,
Bayesian Data Analysis , 3rd ed., Vol. 109(CRC Press, 2013).[8] J. B. Kinney, Phys Rev E , 011301(R) (2014).[9] J. B. Kinney, Phys Rev E , 032107 (2015).[10] J. Riihim¨aki and A. Vehtari, Bayesian Anal , 425 (2014).[11] CMS Collaboration, Phys Lett B , 30 (2012).[12] World Health Organization, World health statistics 2017:Monitoring health for the SDGs, Sustainable Develop-ment Goals. (World Health Organization, Geneva, 2017).[13] J. Zinn-Justin,
Path Integrals in Quantum Mechanics (Oxford, 2010).[14] I. Nemenman and W. Bialek, Phys Rev E , 026137(2002). SI.1. THE POSTERIOR ACTION S (cid:96) [ φ ] A derivation for Eq. 2 has already been reported in Ref. [9]. The derivation presented here, however, is morestraight-forward. The action in Eq. 2 is given by S (cid:96) [ φ ] = S (cid:96) [ φ ] + S data [ φ ] , (S1)where S (cid:96) [ φ ] is the “prior action”, corresponding to a Bayesian prior p ( Q ∣ (cid:96) ) ∝ exp (− S (cid:96) [ φ ]) , while S data [ φ ] , the“likelihood action”, is related to likelihood via p ( data ∣ Q ) ∝ exp (− S data [ φ ]) . DEFT uses a prior action of the form S (cid:96) [ φ ] = ∫ dxL (cid:96) α ( ∂ α φ ) . (S2)The parameter α reflects a fundamental choice in how one defines “smoothness”, and (cid:96) is a lengthscale below whichfluctuations in φ are strongly damped. The derivation of S data [ φ ] is as follows. Suppose we are given N data pointsdrawn from a probability distribution Q true ( x ) that is confined to the interval [ x min , x max ] . Label these data in orderof increasing value as x , x , . . . , x N . Next, imagine these data as being produced by a stochastic process in time, with x being the time variable and r ( x ) being the instantaneous emission rate. The likelihood of the data is then given by ( dx ) N p ( data ∣ r ) = [ e − ∫ x x min dx r ( x ) ] ⋅ [ dx r ( x )] ⋅ [ e − ∫ x x dx r ( x ) ] ⋅ [ dx r ( x )] ⋯ [ dx r ( x N )] ⋅ [ e − ∫ x max xN dx r ( x ) ]= ( dx ) N exp {− ∫ x max x min dx r ( x )} N ∏ i = r ( x i )= ( dx ) N exp {− ∫ dx r ( x ) + N ∑ i = log r ( x i )}= ( dx ) N exp {− ∫ dx [ r ( x ) − N R ( x ) log r ( x )]} (S3)where ∫ dx indicates integration over the entire x -domain and R ( x ) = N − ∑ Ni = δ ( x − x i ) is the raw data densityreferred to in the main text. Next, we parametrize the emission rate r ( x ) using the field φ ( x ) via r ( x ) = NL e − φ ( x ) . (S4)The probability density corresponding to this rate is Q ( x ) = r ( x )∫ dx ′ r ( x ′ ) = e − φ ( x ) ∫ dx ′ e − φ ( x ′ ) , (S5)and so our definition of φ here is consistent with the definition of φ in the main text. We therefore see that thelikelihood density in Eq. S3 is given by p ( data ∣ φ ) ∝ exp (− S data [ φ ]) where the corresponding action (after droppingthe constant term N log ( L / N ) ) is, S data [ φ ] = ∫ dxL [ N LR ( x ) φ ( x ) + N e − φ ( x ) ] . (S6)Plugging Eq. S2 and Eq. S6 into Eq. S1 gives Eq. 2 of the main text. Note the origin of the two terms in the integrandin Eq. S6: the term linear in φ comes from the exact locations of the N data points, whereas the nonlinear term(which leads to such interesting behavior) comes from regions of the x domain in which no data is observed.We briefly discuss a subtle issue with the above derivation. The probability distribution Q ( x ) is invariant underadditive shifts in the underlying field, i.e., φ ( x ) → φ ( x ) + c for any constant c . By contrast, the likelihood action S data [ φ ] is not invariant under such transformations. This difference is due Eq. S4 which, by specifying how theemission rate r ( x ) relates to φ ( x ) , introduces an additional assumption about how φ should be constrained by data.But although this additional assumption alters p ( φ ∣ data ) , it does not alter p ( Q ∣ data ) . The more involved derivationof S (cid:96) [ φ ] provided in Ref. [9] demonstrates this fact explicitly. SI.2. THE MAP FIELD φ (cid:96) To solve for φ (cid:96) , the maximum a posteriori (MAP) field at lengthscale (cid:96) , we set δS (cid:96) / δφ =
0. The resulting equationof motion is (cid:96) α ∆ α φ (cid:96) + N LR − N e − φ (cid:96) = . (S7)The operator ∆ α that appears here is the “bilateral Laplacian”, which is described in Ref. [9]. Briefly, ∆ α is definedby the requirement that ∫ dx ϕ ∆ α φ = ∫ dx ( ∂ α ϕ )( ∂ α φ ) , (S8)for any two fields ϕ and φ . This bilateral Laplacian is identical to the standard α -order Laplacian (− ) α ∂ α in theinterior of the x -interval, but differs at the boundaries. Specifically, the standard α -order Laplacian requires theadditional specification of α boundary conditions in order to be self-adjoint. By contrast, the bilateral Laplacian isself-adjoint without the specification of any boundary conditions. The equation of motion, Eq. S7, thus has a uniquesolution without the need to assume any boundary conditions on φ . See Ref. [9] for more information.By integrating Eq. S7 we find that ∫ dx e − φ (cid:96) ( x ) = L , due to ∫ dx R ( x ) = ∫ dx ∆ α φ (cid:96) = ∫ dx ( ∂ α )( ∂ α φ (cid:96) ) = Q (cid:96) thus has a simple form: Q (cid:96) ( x ) = e − φ (cid:96) ( x ) L . (S9)Similarly, multiplying Eq. S7 on the left by x k for k = , . . . , α − ⟨ x k ⟩ Q (cid:96) = ⟨ x k ⟩ R , (S10)i.e., the first α − Q (cid:96) exactly match those of the data.As described in Ref. [9], DEFT computes the map field φ (cid:96) for a set of lengthscales (cid:96) , (cid:96) , (cid:96) , . . . , (cid:96) K , ranging from (cid:96) = (cid:96) K = ∞ . These lengthscales are chosen so that neighboring MAP densities, Q (cid:96) k and Q (cid:96) k + , are approximatelyequally spaced along this “MAP curve”, as quantified by the geodesic distance D geo ( Q (cid:96) k , Q (cid:96) k + ) . We note that Q is in fact the data histogram R , while Q ∞ is in fact the maximum entropy distribution consistent with the momentconstraints in Eq. S10. See Ref. [9] for details. SI.3. THE EVIDENCE p ( data ∣ (cid:96) ) The DEFT algorithm computes the MAP field at lengthscales spanning (cid:96) = (cid:96) = ∞ . The optimal lengthscale (cid:96) ∗ is then computed by maximizing the Bayesian evidence p ( data ∣ (cid:96) ) . The key quantity needed for this procedure is the“evidence ratio,” which is given by E ( (cid:96) ) = p ( data ∣ (cid:96) ) p ( data ∣∞) . (S11)It can be shown that E ( (cid:96) ) = ( Z (cid:96) / Z (cid:96) )/( Z ∞ / Z ∞ ) , where Z (cid:96) = ∫ D φ e − S (cid:96) [ φ ] and Z (cid:96) = ∫ D φ e − S (cid:96) [ φ ] (S12)respectively denote the posterior partition function and the prior partition function. The prior partition function Z (cid:96) can be computed analytically, although it has a divergence that must be regularized. By contrast, the posteriorpartition function Z (cid:96) can only be analytically computed in the Laplace approximation. We therefore instead use thequantity Z Lap (cid:96) = ∫ D φ e − S Lap (cid:96) [ φ ] , (S13)where S Lap (cid:96) [ φ ] is the Laplace approximation of S (cid:96) [ φ ] . The resulting evidence ratio in this approximation is found tobe E ( (cid:96) ) = e S ∞ [ φ ∞ ]− S (cid:96) [ φ (cid:96) ] ¿``(cid:192) det ker [ e − φ ∞ ] det row [ L α ∆ α ] η − α det [ L α ∆ α + ηe − φ (cid:96) ] , (S14)where η = N ( L / (cid:96) ) α , and “ker” and “row” respectively denote the kernel and row space of the bilateral Laplacian∆ α . See Ref. [9] for details.It should be emphasized that, although the Laplace approximation can be grossly innacurate when sampling Q ∼ p ( Q ∣ data ) , it does not strongly effect the evidence ratio E ( (cid:96) ) . This is because log E ( (cid:96) ) typically varies over manyorders of magnitude, whereas log ( Z (cid:96) / Z Lap (cid:96) ) varies with (cid:96) far less dramatically. This is demonstrated in Fig. S2 below.Nevertheless, the SUFTware implementation of DEFT includes an option to correct for this approximation usingimportance sampling, as described in the main text. − .
25 0 .
00 0 . ← δφ → T eff δφ − δφ + data rich regime( T eff = 0 . − ← δφ → T eff δφ − δφ + data poor regime( T eff = 40 . FIG. S1. (Color)
Fluctuations δφ in data rich vs. data poor regimes . Solid blue lines indicate f ( δφ ) from Eq. S18.Dotted orange lines indicate the Laplace approximation f Lap ( δφ ) = δφ /
2. Dashed green lines indicate the value of half theeffective temperature ( T eff /
2) from Eq. S20. In the data rich regime, T eff / ≪ δφ ± fluctuations. In the data poor regime, T eff / ≫ δφ − due to f is substantially less than would result from f Lap . Note also that the very large positive fluctuations δφ + in the data poor regime have little noticeable effect on Q (cid:96) , since they just push Q (cid:96) closer to zero. SI.4. SAMPLING THE POSTERIOR p ( φ, (cid:96) ∣ data ) The posterior probability p ( φ, (cid:96) ∣ data ) can be decomposed as p ( φ, (cid:96) ∣ data ) = p ( φ ∣ (cid:96), data ) p ( (cid:96) ∣ data ) . (S15)This forms the basis for our posterior sampling procedure. First, we sample plausible (cid:96) s from p ( (cid:96) ∣ data ) . Note that p ( (cid:96) ∣ data ) ∝ p ( data ∣ (cid:96) ) p ( (cid:96) ) by Bayes’s Theorem. Assuming p ( (cid:96) ) is uniform over the length of the MAP curve asquantified by geodesic distance (see Ref. [9]), p ( (cid:96) ∣ data ) becomes proportional to the evidence ratio E ( (cid:96) ) . We thussample values of (cid:96) from the set { (cid:96) , (cid:96) , . . . , (cid:96) K } used to trace the MAP curve, each (cid:96) k being selected with probabilityproportional to E ( (cid:96) k ) . For each of these (cid:96) values, we then sample plausible φ s from p ( φ ∣ (cid:96), data ) . Here we employimportance sampling. Specifically, we can rewrite the distribution p ( φ ∣ (cid:96), data ) as follows p ( φ ∣ (cid:96), data ) = e − S (cid:96) [ φ ] Z (cid:96) = e − S Lap (cid:96) [ φ ] Z Lap (cid:96) w (cid:96) [ φ ]⟨ w (cid:96) ⟩ Lap ∣ (cid:96) ∝ p Lap ( φ ∣ (cid:96), data ) w (cid:96) [ φ ] , (S16)where we have made use of Eq. S36 (derived below). Therefore, we first sample φ s from the Laplace-approximateddistribution p Lap ( φ ∣ data , (cid:96) ) , then correct for the non-Gaussian nature of the original distribution by resampling these φ s using the importance weights w (cid:96) [ φ ] . SI.5. ORIGIN OF WISPS
To derive Eqs. 4 and 5, it suffices to note that S (cid:96) [ φ (cid:96) + δφ ] = S (cid:96) [ φ (cid:96) ] + O ( δφ ) , (S17)because the EOM in Eq. S7 causes all first-order terms in δφ to cancel. Next, we express V ( δφ ) = N LQ (cid:96) f ( δφ ) where f ( δφ ) = e − δφ − + δφ (S18)is just e − δφ with the 0th and 1st order terms subtracted out. This function is plotted in Fig. S1. The key to derivingthe magnitude of fluctuations δφ in different regimes is the relationship ⟨ V ( δφ )⟩ ∼ n eff , which we rephrase here as ⟨ f ( δφ )⟩ ∼ T eff . (S19)where T eff = n eff N LQ (cid:96) (S20)is an effective temperature.In the data rich regime, T eff / ≪
1. Therefore, f ( δφ ) ≪ δφ . As illustrated in Fig. S1 (leftpanel), the Laplace approximation works well in this regime. Setting f ( δφ ) ≈ f Lap ( δφ ) = δφ ∼ T eff δφ gives Eq. 6.In the data poor regime, T eff / ≫
1. As illustrated in Fig. S1 (right panel), f is highly asymmetric in this regimeand so the positive and negative fluctuations, δφ + and δφ − , need to be treated separately. Specifically, f ( δφ + ) ≈ δφ + ∼ T eff , whereas f ( δφ − ) ≈ e − δφ − ∼ T eff . (S22)Solving the latter condition for δφ − gives Eq. 7. Note in Fig. S1 (right panel) how the the Laplace approximationgreatly overestimates the magnitude of negative fluctuations δφ − in the data poor regime. SI.6. COMPUTING log ( Z (cid:96) / Z Lap (cid:96) ) USING FEYNMAN DIAGRAMS
Here we show how Feynman diagrams can be used to compute log ( Z (cid:96) / Z Lap (cid:96) ) , thereby obtaining corrections to theLaplace approximation. Our exposition closely follows that sketched by Zinn-Justin [13]. However, because Feynmandiagrams are rarely used in the context of statistical inference, we felt it worthwhile to make these calculations explicit.Upon discretization of the x -interval using G grid points, the action in Eq. 2 becomes S (cid:96) [ φ ] = (cid:96) α G ∑ ij ∆ αij φ i φ j + N LG ∑ i R i φ i + NG ∑ i e − φ i . (S23)where i, j = , , . . . , G . In what follows we represent fluctuations in φ about from the MAP field φ (cid:96) using the rescaledfluctuation x = √ N ( φ − φ (cid:96) ) . The action can then be expanded in the following way: S (cid:96) [ φ ] = S Lap (cid:96) [ φ ] + ∑ ijk B ijk √ N x i x j x k + ∑ ijkl C ijkl N x i x j x k x l + ⋯ , (S24)where the Laplace action is S Lap (cid:96) [ φ ] = S (cid:96) [ φ (cid:96) ] + ∑ ij A ij x i x j , (S25)and A ij = N ∂ S (cid:96) ∂φ i ∂φ j ∣ φ (cid:96) = (cid:96) α N G ∆ αij + G e − φ (cid:96)i δ ij , (S26) B ijk = N ∂ S (cid:96) ∂φ i ∂φ j ∂φ k ∣ φ (cid:96) = − G e − φ (cid:96)i δ ijk , (S27) C ijkl = N ∂ S (cid:96) ∂φ i ∂φ j ∂φ k ∂φ l ∣ φ (cid:96) = G e − φ (cid:96)i δ ijkl . (S28)The quantity log ( Z (cid:96) / Z Lap (cid:96) ) is conveniently given by the sum of connected vacuum diagrams. At O ( N − ) , the relevantdiagrams contain only 3rd-order and 4th-order vertices. From the expansion in Eq. S24 we see that the valuescorresponding to these vertices are given by − B ijk /√ N and − C ijkl / N , respectively. We also need the propagatormatrix P , which is given by the inverse of the Hessian A , i.e., P ij = ( A − ) ij . We thus obtainlog Z (cid:96) Z Lap (cid:96) = + + + O ( N − ) , (S29)0where the contribution from each diagram is = ∑ ijkl (− C ijkl N ) P ij P kl = − ∑ i e − φ (cid:96)i N G ( P ii ) , (S30) = ∑ ijk ∑ lmn (− B ijk √ N ) (− B lmn √ N ) P ij P kl P mn = ∑ i ∑ l e − φ (cid:96)i − φ (cid:96)l N G P ii P il P ll , (S31) = ∑ ijk ∑ lmn (− B ijk √ N ) (− B lmn √ N ) P il P jm P kn = ∑ i ∑ l e − φ (cid:96)i − φ (cid:96)l N G ( P il ) . (S32) SI.7. COMPUTING log ( Z (cid:96) / Z Lap (cid:96) ) USING IMPORTANCE SAMPLING
Alternatively, the correction log ( Z (cid:96) / Z Lap (cid:96) ) can be computed using importance sampling involving the weights w (cid:96) in Eq. 3. To see how, we express the partition function Z (cid:96) as an average over the Laplace ensemble: Z (cid:96) = ∫ D φ e − S (cid:96) [ φ ] (S33) = Z Lap (cid:96) ∫ D φ e − S Lap (cid:96) [ φ ] Z Lap (cid:96) e S Lap (cid:96) [ φ ]− S (cid:96) [ φ ] (S34) = Z Lap (cid:96) ∫ D φ p
Lap ( φ ∣ data , (cid:96) ) w (cid:96) [ φ ] (S35) = Z Lap (cid:96) ⟨ w (cid:96) ⟩ Lap ∣ (cid:96) , (S36)where ⟨⋅⟩ Lap ∣ (cid:96) denotes the mean taken with respect to the Laplace posterior p Lap ( φ ∣ data , (cid:96) ) , and w (cid:96) denotes theimportance sampling weights in Eq. 3. The quantity log ( Z (cid:96) / Z Lap (cid:96) ) can thus be computed using Eq. 9. SI.8. FEYNMAN DIAGRAMS VS. IMPORTANCE SAMPLING
Perhaps disappointingly, Feynman diagrams generally do not work well in situations where wisps appear. This isbecause the posterior action in such cases is strongly coupled. To see this, consider an expansion of the potential V in Eq. 5 to m ’th order in δφ : V m ( δφ ) = N LQ (cid:96) m ∑ n = (− δφ ) n n ! . (S37)To produce accurate results, the potential V m must include enough terms to sufficiently approximate V when evaluatedat δφ = − δφ − poor = − φ ∗ + log ( N / n eff ) . This would require m min = φ ∗ − log ( N / n eff ) terms at the very least, since notuntil here do the (all positive) terms in this power series begin to decrease. Thus, the number of terms that wouldbe needed cannot be fixed a priori , but rather must increase with φ ∗ . This presents a major problem for Feynman-diagram-based expansions. Any diagram influenced by the the m min ’th term in Eq. S37 must contain an m min ’thorder vertex. But m min can be quite large: for φ ∗ in Fig. 1, finds m min >
100 near the boundaries of the x -interval.Evaluating Feynman diagrams up to such high order is not feasible.This expectation is confirmed in Fig. S2, which compares the two ways of computing log ( Z (cid:96) / Z Lap (cid:96) ) for two differentchoices of Q true . The Feynman diagram approximation works well when Q true fills the entire x -interval, indicating thatthe action S (cid:96) [ φ ] is nearly quadratic and the corrections to Laplace approximation are small. However, when Q true vanishes in large regions of the x domain, the Feynman diagram approximation is a very bad. In this case, the action S (cid:96) [ φ ] is strongly coupled and a fundamentally non-perturbative approach is required to compute the corrections.Although the non-quadratic nature of the posterior action can lead to a partition function Z (cid:96) differing from itsLaplace-approximated value Z Lap (cid:96) by a large amount, we find that Laplace approximation generally works well never-theless for identifying the optimal lengthscale. This is because log Z Lap (cid:96) typically varies by multiple orders of magnitudeacross different values of (cid:96) , thereby swamping potential inaccuracies in the Z (cid:96) ≈ Z Lap (cid:96) assumption.1 − x .
00 0 . Z/Z
Lap (diagrams)0 . . l og Z / Z L a p ( s a m p li n g ) x ∈ [ − ,
3] : ρ = 0.95 −
15 15 x − . . Z/Z
Lap (diagrams) − . . l og Z / Z L a p ( s a m p li n g ) x ∈ [ − ,
15] : ρ = 0.09 (a)(c) (b)(d) FIG. S2. (Color)
Wisps appear when S (cid:96) [ φ ] is strongly coupled . The accuracy of Feynman diagrams was assessed usingdata drawn from the Q true density in Fig. 1 confined to the intervals [− , ] (a,c) or [− , ] (b,d). (a,b) Q true (gray) is shownalong with 100 distributions Q (magenta) sampled at fixed (cid:96) = (cid:96) ∗ from the Laplace-approximated posterior inferred from adataset of size N = ( Z (cid:96) / Z Lap (cid:96) ) computed for 100 differentdatasets, generated as above, using either Feynman diagrams (Eq. 8) or importance weights (Eq. 9). These two quantitiesagree well in (c) but poorly in (d). Squared Pearson correlations, ρ , are shown in the titles of (c,d). SI.9. OTHER DENSITY ESTIMATION METHODS
Here we describe the Kernel density estimation (KDE) and Dirichlet process mixture modeling (DPMM) algorithmsused for the computations shown in Fig. 2 and Fig. S3.
A. Kernel density estimation
KDE is arguably the most common approach to density estimation in one dimension. Given data { x i } Ni = , the KDEdensity estimate is given by Q ∗ ( x ) = N N ∑ i = w K ( x − x i w ) , (S38)where K ( z ) is the kernel function and w is the “bandwidth”. We used a Gaussian kernel, K ( z ) = √ π e − z / , (S39)and chose the bandwidth w using cross-validation. Specifically, we considered 100 candidate bandwidths geometricallydistributed between w min (the minimum spacing between data points) and w max (10 times the span of the data). Wethen chose the bandwidth w that maximized the jackknifed log likelihood L( w ) = N ∑ i = log Q ∗− i ( x i ) , (S40)where the subscript on Q ∗− i indicates the density Q ∗ computed as Eq. S38 but using a dataset missing the datum x i .KDE does not provide an explicit posterior on Q . Therefore, to compute p-values for Fig. 2 and Fig. S3, weapproximated posterior samples Q ∼ p ( Q ∣ data ) by applying KDE to bootstrap-resampled datasets.2 B. Dirichlet process mixture modeling
DPMM is arguably the most popular nonparametric Bayesian method for estimating probability densities. DPMMshave a hierarchical structure, in the sense that each data point is assumed to be drawn from one of a number of“clusters,” with each cluster having a probability density defined by a kernel of pre-specified functional form.In the computations for Fig. 2 and Fig. S3, we adopted the finite DPMM described in Refs. [6, 7]. Densities wereassumed to be of the form Q ( x ) = H ∑ h = w h K m h ( x ) , (S41)where H is the number of clusters, w h is the probability of cluster h , and m h is the set of parameters defining thedensity of cluster h . K m ( z ) was assumed to be a Gaussian density specified by m = ( µ, σ ) , i.e., a mean and avariance. A normal-inverse-gamma distribution was used as the prior on m : p ( µ, σ ) = N ( µ ∣ ˆ µ, ˆ κσ ) Γ − ( σ ∣ ˆ α, ˆ β ) , (S42)where ˆ κ =
1, ˆ α =
1, ˆ β = ˆ σ , ˆ σ = N − N ∑ i = ( x i − ˆ µ ) , and ˆ µ = N N ∑ i = x i . (S43)The number of clusters was fixed at H =
10. For each dataset, we used Gibbs sampling to obtain an ensemble ofplausible densities representing p ( Q ∣ data ) . The optimal estimate Q ∗ was then defined as the mean density in thisensemble. Following Ref. [7], our Gibbs sampling algorithm worked as follows. For each cluster h = , , . . . , H , wechose an initial weight w h = / H and a set of kernel parameters m h chosen according to the prior distribution p ( µ, σ ) in Eq. S42. The sampler was then run by iterating the following steps:1. Data were redistributed across clusters. Specifically, each data point x i was allocated to cluster h with probability p ( h ∣ x i ) = w h K m h ( x i )∑ Hh ′ = w h ′ K m h ′ ( x i ) . (S44)2. The mean and variance of each cluster were updated using m h ∼ N ( µ h ∣ ˆ µ h , ˆ κ h σ h ) Γ − ( σ h ∣ ˆ α h , ˆ β h ) , (S45)where ˆ µ h = ˆ κ h ( ˆ µ ˆ κ + n h ⟨ x h ⟩) , (S46)ˆ κ h = ˆ κ + n h ˆ κ , (S47)ˆ α h = ˆ α + n h , (S48)ˆ β h = ˆ β + (∑ i ∈ h ( x i − ⟨ x h ⟩) + n h + n h ˆ κ (⟨ x h ⟩ − ˆ µ ) ) . (S49)Here, x h represents the set of data points belonging to cluster h and n h = ∣ x h ∣ .3. The cluster weights were updated by sampling w , . . . , w H ∼ Dirichlet ( + n , . . . , + n H ) . (S50) SI.10. COMPUTATIONAL COMPLEXITY
An explicit expression for the algorithmic complexity of DEFT is not very helpful for understanding runtimeperformance. This is because DEFT involves multiple steps computed in series, the runtimes of which are governed3by different parameters. In practice, we have found DEFT to be primarily limited by the number of grid points G . This is because a computation of the evidence ratio E ( (cid:96) ) , as well as posterior sampling, requires a spectraldecomposition of the G × G Hessian matrix at each lengthscale (cid:96) along the MAP curve. We note, however, that DEFTcomputations with G =
100 are generally quite fast (i.e., ∼ .
25 seconds on a standard laptop computer). AlthoughDEFT does require histogramming the data, which is O( N ) , this is rarely the bottleneck in practice. In fact, we havefound that the speed of DEFT often increases with N , since this leads to a shorter MAP curve, thus requiring fewerdiscrete lengthscales (cid:96) to be examined.In our computations for Fig. 2 and Fig. S3, DEFT was often faster than our KDE and DPMM implementations.The use of jackknife cross-validation greatly slows down KDE in a manner that increases linearly with N . DPMM, onthe other hand, is greatly slowed down by its reliance on Gibbs sampling, which is necessitated by the non-convexityof the parameter posterior. In fact, Gibbs sampling is needed not just to generate a posterior sample, but also toestimate Q ∗ (via a posterior mean). We note that the accuracy of KDE and DPMM is also very sensitive to thechoice of kernel, especially when data is clustered near the x -interval boundaries.4 Argus N = 10 N = 10010 − − − D K L ( Q tr u e k Q ∗ ) α = 1 α = 2 α = 3 α = 4KDEDPMM01 p - v a l u e Beta N = 10 N = 100 Bradford N = 10 N = 100 Cauchy N = 10 N = 100 ChiSquare N = 10 N = 10010 − − − D K L ( Q tr u e k Q ∗ ) p - v a l u e ExpPower N = 10 N = 100 ExpWeibull N = 10 N = 100 FoldedNormal N = 10 N = 100 SkewNorm N = 10 N = 10010 − − − D K L ( Q tr u e k Q ∗ ) p - v a l u e Wald N = 10 N = 100 WeibullMin N = 10 N = 100 WrapCauchy N = 10 N = 100 FIG. S3. (Color)
Extension of Fig. 2 to other choices of Q true . The same analysis as in Fig. 2 was performed for twelveadditional Q true distributions, which were selected from the built-in distributions in the scipy.stats Python library. pure liters per yearPer capita alcohol consumption (2016) N = 189 percent of populationClean water (2015) N = 183 micrograms per cubi meterParticulate matter in air (2014) N = 179 percentFamily planning needs met (2005 2015) N = 122 percentOverweight children (2005-2016) N = 133 yearsHealthy life expectancy (2015) N = 183 thousandsTotal population (2015) N = 194 per 100,000Homicide mortality (2015) N = 183 per 100,000Road traffic mortality (2013) N = 189 FIG. S4. (Color)
Demonstration of DEFT on data from the World Health Organization (WHO) . Densities wereestimated for 9 different global health indicators reported by the WHO in [12]. Each datum corresponds to a different country; N varies between panels because of missing data in [12]. Orange shows a histogram of each global health indicator computed using G =
100 grid points. The best DEFT estimate Q ∗ is shown in dark blue, while 100 posterior-sampled densities Q ∼ p ( Q ∣ data ))