Restoring a smooth function from its noisy integrals
RRestoring a smooth function from its noisy integrals
Olga Goulko
Department of Physics, University of Massachusetts, Amherst, MA 01003, USA andPresent address: Raymond and Beverly Sackler School of Chemistry and Schoolof Physics and Astronomy, Tel Aviv University, Tel Aviv 6997801, Israel
Nikolay Prokof’ev
Department of Physics, University of Massachusetts, Amherst, MA 01003, USA andNational Research Center “Kurchatov Institute,” 123182 Moscow, Russia
Boris Svistunov
Department of Physics, University of Massachusetts, Amherst, MA 01003, USANational Research Center “Kurchatov Institute,” 123182 Moscow, Russia andWilczek Quantum Center, School of Physics and Astronomy and T. D. Lee Institute,Shanghai Jiao Tong University, Shanghai 200240, China (Dated: May 14, 2018)Numerical (and experimental) data analysis often requires the restoration of a smooth functionfrom a set of sampled integrals over finite bins. We present the bin hierarchy method that efficientlycomputes the maximally smooth function from the sampled integrals using essentially all the infor-mation contained in the data. We perform extensive tests with different classes of functions andlevels of data quality, including Monte Carlo data suffering from a severe sign problem and physicaldata for the Green’s function of the Fr¨ohlich polaron.
PACS numbers: 02.60.-x, 02.70.-c, 07.05.Rm, 87.57.np
I. INTRODUCTION
Sampling is at the heart of many Monte Carlo com-putations and experimental measurements. Data pointsare generated or measured and we eventually want torestore the underlying smooth probability density dis-tribution f ( x ) behind them. Many density estimationprotocols for lists of statistical data exist, but forlarge-scale Monte Carlo calculations storing the individ-ual data points would require pentabytes of memory andlead to a substantial slowing down of the simulation.More importantly, the individual data points are notneeded, provided that we know that f ( x ) is structure-less below a certain scale. Hence, without losing any es-sential information, we can collect integrals of f ( x ) overfinite-size bins. The problem then is how to extract allthe information from these integrals without systematicbias, augmenting it with our a priori knowledge of thesmoothness of the function.In this paper, we propose a simple and efficient solu-tion, which we call the bin hierarchy method (BHM). We note that the problem is closely related to the prob-lem of numerical analytic continuation under consistentconstraints, but with a crucial simplification. Unlikefor the numerical restoration of spectral functions, we donot have to account for possible sharp features, such as δ -function peaks or kinks, which cannot be resolved withinthe sampled error bars. Assuming that f ( x ) is smooth,we are allowed to parametrize its optimal approximation˜ f ( x ) as a polynomial spline (or a spline with a generalfunctional ansatz).The protocol of constructing ˜ f ( x ) has some similarities with the smoothing spline approach. However, there isa fundamental difference in the data structure. Ratherthan fitting to approximate function values on a givenset of points, we fit the spline directly to the sampledintegrals. The central requirement is that ˜ f ( x ) must beconsistent not only with the sampled integrals over theelementary histogram bins, but also with the integralsover any combination of these. Integrals over large binsare known with higher precision than those over smallerbins, since the former contain more data points. Hence,accurately recovering a large integral is a priority over re-covering a smaller one. In particular, bins below a certainscale contain noise rather than information about f ( x ),and thus can be safely omitted in the fitting. We formal-ize this idea by introducing a hierarchy of bins with onebin over the entire domain of f ( x ) on the top level, anddoubling the number of bins on each subsequent level.The goodness of fit is evaluated on each level n sepa-rately, with χ n / ˜ n as the measure, where ˜ n is the numberof bins on the given hierarchy level.Out of all possible approximations ˜ f ( x ) consistent withthe sampled integrals we select the one with the leastfeatures, i.e. the one with the least fitting parameters.This is achieved by fitting a spline of order m (typically m = 3) with the minimal number of knots that yieldsan acceptable fit. The number and the positions of theknots are determined automatically by the fitting algo-rithm. Known boundary conditions on f ( x ) can also beaccounted for. If desired, additional optimization terms,for instance for the jump in the highest spline derivative,can be included into the fitting procedure, with iterativeimprovement in the spirit of the method of consistent a r X i v : . [ s t a t . O T ] M a y constraints. The paper is organized as follows. In Sec. II, we re-view alternative methods to construct ˜ f ( x ) and arguethat the BHM is superior in accuracy, efficiency, and sim-plicity. The details of the BHM are presented in Sec. III.In Sec. IV, we present several tests of the BHM. We con-clude in Sec. V. II. REVIEW OF ALTERNATIVE METHODSA. Naive histograms
The most straightforward way to approximate a distri-bution f ( x ) is the histogram: divide the domain intoseveral bins and count how many times the sampled val-ues of x , which were generated with probabilities givenby f ( x ), fall into each of the bins. For each generated x ∈ bin i , the counter c i for the bin i is increased by one.The distribution is then approximated by˜ f ( x ) = (cid:88) i c i N ∆ i Π i ( x ) , (1)where N is the total number of sampled points, ∆ i isthe width of the bin i , and Π i ( x ) is the boxcar function,which is one if x ∈ bin i and zero otherwise. In what fol-lows, we refer to this method as “naive histogramming”—to emphasize the contrast with the BHM that also in-volves histogramming as a part of the protocol. The endresult of the naive histogram method is a staircase func-tion, which is not smooth. This drawback can be amelio-rated by taking c i / ( N ∆ i ) as the value of ˜ f ( x ) at the bincenters and fitting a smoothing spline to the data withtheir statistical error bars. A more important issue isthat the naive histogram method suffers from an inher-ent compromise between the resolution of features andstatistical noise. Sufficiently narrow bins are necessaryto resolve f ( x ) without introducing a significant system-atic bias, while reducing the size of the bins increases thenoise in the sampled counters. B. Basis projection method
A generalization of the naive histogram method—thebasis projection method—can significantly reduce thisdrawback. This method is based on a generalized Fourierseries expansion of f ( x ). For each bin i , we choose a ba-sis { e ( j ) i ( x ) } of m i real-valued functions with e ( j ) i ( x ) = 0if x / ∈ bin i that satisfy the orthonormality condition (cid:104) e ( j ) i | e ( j (cid:48) ) i (cid:105) ≡ (cid:90) bin i w i ( x ) e ( j ) i ( x ) e ( j (cid:48) ) i ( x ) dx = δ jj (cid:48) . (2)The non-negative weight function w i ( x ) of the innerproduct should be chosen such that all integrals are con-vergent. If no divergences are present w i ( x ) can be set to unity. The basis and weight functions may be differentfor each bin, and can be defined also for bins of infinitesize. In general, the basis functions can be chosen to re-flect the properties of f ( x ) that are known in advance,for instance known divergences or asymptotic behavior.Otherwise, in the majority of the cases, shifted Legendrepolynomials are a reasonable choice.With the basis projection method, in each bin i wekeep track of several counters c ( j ) i estimating the m i gen-eralized Fourier coefficients (cid:104) e ( j ) i | f (cid:105) . Specifically, eachtime we generate x ∈ bin i we add the value w i ( x ) e ( j ) i ( x )(rather than one) to each of the c ( j ) i counters. The func-tion ˜ f ( x ) is then given by˜ f ( x ) = (cid:88) i m i (cid:88) j =1 c ( j ) i N e ( j ) i ( x ) . (3)Note that Eq. (1) is a special case of the above formulawith m i = 1, w i ( x ) = 1, and e (1) i ( x ) = Π i ( x ) / √ ∆ i .It can be easily seen that this procedure correspondsto the best possible representation of f ( x ) in terms oforthogonal basis functions on each bin i , since minimizing (cid:90) bin i w i ( x )( f ( x ) − ˜ f ( x )) dx == (cid:90) bin i w i ( x ) f ( x ) − m i (cid:88) j =1 c ( j ) i N e ( j ) i ( x ) dx (4)with respect to the c ( j ) i implies (cid:90) bin i w i ( x ) e ( j ) i ( x ) f ( x ) − m i (cid:88) j (cid:48) =1 c ( j (cid:48) ) i N e ( j (cid:48) ) i ( x ) dx = 0 , (5)and consequently c ( j ) i /N = (cid:104) e ( j ) i | f (cid:105) .Using basis projections we collect more information ineach sampling step than with the naive histogram, sincesome information about the position of x inside the bin ispreserved. Therefore fewer bins are necessary to resolvethe function. The bin widths and basis sizes m i can bechosen in a way that the systematic error is negligiblecompared to the statistical error, without a substantialincrease in the statistical noise of the ˜ f ( x ) values.While the basis projection method is a significant im-provement over the naive histogram, it still has severaldisadvantages. The binning and the basis functions mustbe chosen at the beginning of the sampling process andcannot be altered retrospectively. Basis projections canalso be computationally expensive, especially for a largebasis, or a basis built of complicated functions. Moreover,the sampled projections generally exhibit large jumps atthe bin boundaries because the basis functions may beunbounded. As before, we can eliminate these jumps bytaking the values of ˜ f ( x ) at several points inside each binand then constructing a smoothing spline. This proce-dure, however, involves an unnecessary loss of informa-tion. Rather than reducing the fit to specific individualpoints, it is suggestive to fit a spline to the sampled inte-grals (cid:104) e ( j ) i | f (cid:105) directly, resulting in smaller errors. In addi-tion, we can extract more exhaustive information about f ( x ) by using overlapping bins of different size that covera range of relevant scales of the function. This is achievedby the BHM described in Sec. III. C. Reweighting
Reweighting is a Monte Carlo-specific technique thatallows one to convert the statistics generated for a givenvariable x to statistics for other values of this variable. This is achieved in the following way. Prior to sampling,we choose a fixed set of points x i and associate a binwith each of the x i . The bins can be large and overlap-ping. Whenever the random variable x falls into the bin i associated with the point x i , we update the x i counterby the reweighted value vp ( x i ) /p ( x ). Here v is the valuethat would have been sampled without reweighting, and p ( x i ) /p ( x ) is the ratio of the probability weights at x i and x , where p ( x ) is the probability distribution usedto sample the physical function f ( x ). This can be seenas a deterministic “update” from the generated point x to the fixed point x i . For sufficiently large bins, severalcounters may be updated simultaneously for each gener-ated x . The reweighting technique provides an unbiasedestimator for the value f ( x i ) at any fixed x i .The reweighting method is very efficient in the casewhen one is interested in a single special value of a certaincontinuous variable. A characteristic example is the sam-pling of the single-particle density matrix from the statis-tics for the Green’s function by reweighting the latter tothe imaginary-time value τ = −
0. In all other cases,reweighting is computationally expensive compared tothe gain in accuracy. It also requires a case-specific imple-mentation that assumes knowledge of the analytical formof the reweighting factors. To be maximally efficient, onehas to adjust the size of bin i for each x i of interest. Thebin size must be optimized in such a way that, on theone hand, it is large enough to ensure sufficiently largestatistics, and on the other hand small enough that thereweighting factors remain of the order of unity (to avoidnumerous vanishingly small contributions that consumesimulation time). Bins that are too large can also resultin a strong increase in the autocorrelation time, sincereweighting from a rare event to a frequent one impliesa large p ( x i ) /p ( x ) ratio, which will result in occasionalanomalously large contributions to the sampled counters.Moreover, to produce a smooth outcome, the reweightingprotocol requires an appropriately dense mesh of pointswith strongly overlapping bins, which is computationallyexpensive since a large number of counters needs to beupdated for each generated x . For a continuous set offunction values, a smoothing spline needs to be fitted at additional computational cost, comparable to the cost ofthe BHM fit described in Sec. III. III. BIN HIERARCHY METHOD
The BHM restores smooth distributions in a universal,unbiased, and efficient manner. The method satisfies thefollowing requirements: • All the information contained in the sampled datashould be used to fit the distribution: the accuracyshould be essentially the same as if we had storedthe full list of the generated values of x . This isachieved by introducing a hierarchy of overlappinghistogram bins of different size, where large binsgive precise estimates of the broad features of thedistribution, while small bins resolve its fine struc-ture once enough data has accumulated. Note thatwe assume that the sampled distribution is struc-tureless below a certain scale, and hence there al-ways exists a binning that is sufficiently fine to re-solve all of its features. • The resulting function should be exactly smooth,meaning that the function and all its derivativesup to a given order must have no jumps. This isachieved by fitting a polynomial (or generalized)spline. • Out of all smooth functions consistent with thesampled data within its error bars, the selected re-sult should have the least features (peaks, oscilla-tions, etc.). This is achieved by choosing the splinewith the minimal number of free parameters thatfits the data. • The sampling process and the restoration of thesmooth function should be computationally andmemory efficient. In particular, they should bemore efficient than the basis projection method. • The algorithm should work well regardless of thequality and the number of sampled data. As moredata are collected and statistics improves, so shouldthe fitted distribution without needing any externaladjustments. • The algorithm should be highly automated. Nohuman input or control should be necessary otherthan a limited choice of initial parameters. In par-ticular, the method should be applicable to all rea-sonably occurring functions and should require noadjustment for different classes of functions.Below we describe the setup in detail.
A. Sampling stage
We divide the domain into 2 K non-overlapping elemen-tary bins that do not need to be of equal width. Becausein the end we form combinations of several bins, the ele-mentary bins may be arbitrarily small. Bins with too fewdata points are automatically excluded from the analy-sis. As the number of sampled points increases, bins ofsmaller size will become usable. The number of elemen-tary bins is thus limited only by memory. The K → ∞ limit would formally correspond to keeping the full list ofthe generated values of x . In practice, the distributionsof interest are structureless beyond a certain scale thatsets a natural limit on the required elementary bin width.We generate values of x according to a probabil-ity distribution given by | f ( x ) | (importance sampling)and sample the value v = sign[ f ( x )] in the elementarybin i that contains x . We can also include additionalweighting coefficients into the sampled values. For eachbin we store the number of sampled values in the bin, N i , as well as the average ¯ v i and the scaled variance M ( v i ) = ( N i − v i ) of the sampled values over thebin. We obtain the sampled integral I i over each elemen-tary bin via I i = ¯ v i N i /N , where N is the total numberof sampled points. The variance of the sampled integralis calculated viaVar( I i ) = M ( I i ) N − M ( I i ) = M ( v i ) + ¯ v i N i ( N − N i ) N (7)and its error is given by δI i = (cid:112) Var( I i ) /N . B. Bin hierarchy
After sampling has finished, we construct combinationsof elementary bins. These make up a hierarchy with 2 n bins at each level n ∈ { , . . . , K } . At the top level wehave one large bin containing the entire integral; on thenext level we have two bins containing the integrals overhalf of the interval each, etc. The fit is constructed tominimize K (cid:88) n =0 χ n n = χ χ . . . + χ K K , (8)where χ n is the goodness of the fit considering only thebins on the level n . Note that the construction of the binhierarchy happens in the post processing. During thesampling stage, only the values N i , ¯ v i and M ( v i ) in theelementary bins (corresponding to the hierarchy level K )need to be collected and stored.We tested several modifications of this setup, whichproduced consistent results. Including more than 2 n binson each level by using overlapping bins of equal size wasfound to bring no significant improvement. Likewise, us-ing weights other than 1 / n for χ n did not improve theresult. In general, finer bins at higher levels do not sub-stantially contribute to the form of the fit, as their errorbars are large and the shape of the distribution is usuallyalready determined by the lower-level integrals. But they are still included into the fitting procedure (provided thatthey contain enough data points for statistics) to resolvepotential fine structures of the distribution. For example,a periodically oscillating function like the sine can havezero average over large parts of the domain, while beingnonzero on a finer scale. The structure of such a functionwill be resolved from the integrals over small bins if theycontain enough data. Due to the 1 / n weight factors, thesmall bins cannot overpower the large bin contributions.Note that while for minimization we use the sum overall levels, to decide whether a fit is accepted we check thegoodness of the fit at every level separately. A fit is ac-cepted only if each of the χ n / ˜ n is approximately 1, withina given number T of standard deviations σ = (cid:112) / ˜ n (˜ n ≤ n is the number of bins on level n which con-tain sufficient data to be used for fitting). The fit accep-tance threshold T is an external input parameter, typi-cally T = 2. Lower values of the threshold are likely toresult in the failure to produce an acceptable fit. Insteadof a fixed value, a range of T can also be specified, forexample between two and four. If there is no acceptablefit at the lowest threshold value, it is gradually increaseduntil either the maximum is reached or an acceptable fitis found. For a large number of hierarchy levels, the sta-tistical probability for a level to exceed a 2 σ thresholdon χ n / ˜ n becomes non-negligible, even if the overall fit isgood. Specifying a threshold range allows one to attemptfits with lower threshold values, without risking a failedfit. C. Polynomial fit
The sampled integrals are fitted to a spline. In whatfollows we discuss polynomial (typically, cubic) splines,but a generalization to other functions is straightforward.Here we briefly review the theory of fitting a single poly-nomial p ( x ) = (cid:80) mk =0 a k x k , before we turn to the fullspline in the next section.There exists a lot of standard software for fitting a setof points to a function. Here we want to match the poly-nomial to a set of integrals, which requires a slight adjust-ment. The derivation follows the steps of general linearleast squares fitting. The best polynomial fit minimizes χ = (cid:88) i (cid:32) I i − I ( p ) i δI i (cid:33) , (9)where I ( p ) i is the integral of the polynomial over the bin i . For simplicity we omit the sum over bin levels andthe corresponding weighting factors in this section, butthe generalization is straightforward. Let us label theboundaries of bin i by ¯ x i and ¯ x i +1 . The integral of thepolynomial over the bin equals (cid:90) ¯ x i +1 ¯ x i p ( x ) dx = m (cid:88) k =0 a k k + 1 (¯ x k +1 i +1 − ¯ x k +1 i ) ≡ m (cid:88) k =0 a k X ik (10)and Eq. (9) becomes χ = (cid:88) i (cid:32) I i δI i − m (cid:88) k =0 a k X ik δI i (cid:33) ≡ (cid:88) i (cid:32) b i − (cid:88) k A ik a k (cid:33) = | A · a − b | . (11)The vector b contains the sampled integrals divided bythe error bar (its length equals the number of bins ˜ n ),the vector a of length m + 1 contains the free parametersof the polynomial, and the ˜ n × ( m +1) design matrix A isdefined by A ik = X ik /δI i . The design matrix has morerows than columns since there must be more data pointsthan fit parameters. We find the vector a that minimizes | A · a − b | using singular value decomposition of thedesign matrix. The error on the fitted polynomial ata given point x is obtained from the covariance matrix C jk = Cov( a j , a k ) via δp ( x ) = (cid:118)(cid:117)(cid:117)(cid:116) m (cid:88) i,j =0 C ij x i + j . (12) D. Spline
At each knot between two spline pieces, the values ofthe corresponding polynomials and their derivatives upto order ( m −
1) are equal. This gives m constraints ateach knot. Since each polynomial has m + 1 coefficients,the total number of free parameters is n p + m , where n p is the number of spline pieces. The extra m parametersarise since there is one more polynomial piece than knots.Known boundary conditions, if any, further reduce thenumber of free parameters.The algorithm determines the positions of the knotsusing the following procedure. Begin by attempting tofit one polynomial on the whole interval. If this fit isnot acceptable, split the interval into two equal partsand attempt fitting a two-piece spline. If this fit is notacceptable either, check the goodness of the fit on eachinterval separately. Intervals where the fit is not accept-able have to be split in the next iteration, while all otherintervals remain unchanged. Repeat this procedure untileither an acceptable fit is found, or the intervals can nolonger be split. We require that the set of equations foreach spline piece be overdetermined, i.e. that the numberof bins inside each interval is larger than m + 1. This setsthe limit on the maximal number of spline pieces.To check the goodness of the fit on an individual in-terval, we compute the values of χ n / ˜ n , where χ n is eval-uated only on bins of level n that are fully inside thegiven interval (and that contain enough statistical data),and ˜ n is the number of such bins. Larger bins that reachover several intervals cannot be used for this check, but they are used to check the goodness of the overall fit.In all test cases, if the polynomials on the individual in-tervals passed the goodness-of-fit test, the overall splineevaluated over the entire domain passed also. Note thatas before, the checks on the individual intervals are per-formed for each hierarchy level separately. If more thanhalf of the level n bins inside a given interval do not con-tain enough data, the check for this interval terminateswithout proceeding to subsequent levels. E. Constraining jumps in the highest derivative
An additional constraint on the jumps in the m thderivative of the spline can be imposed on the final fit,if desired. Derivatives up to order m − m are zero. The piecewise constant m th derivative,which is proportional to the value of the parameter a ( j ) m on each interval j , is the only one that exhibits jumps atthe knots. Imposing an additional constraint that aimsto minimize these jumps ensures a continuous evolutionof the fitted spline as a function of the number of sam-pled points N . The number of spline pieces for the bestfit generally increases with growing N , as better statisticspermits a better resolution of the sampled function. Ad-ditional interval divisions are triggered at certain discretevalues of N , namely when one of the χ n values on the af-fected interval crosses the goodness-of-fit threshold. Thiscreates an additional jump in the m th derivative at theposition of the new knot. This jump can be closed withonly a small sacrifice in χ n since the previous spline withone fewer knot was almost acceptable. In this sense, theadditional constraint on the jump acts as a compensationmechanism for a potentially too vigorous interval split-ting. In most cases, constraining the jump has almost noeffect of the fit.The constraint term has the form λn p n p − (cid:88) j =1 λ j (cid:32) a ( j +1) m − a ( j ) m ¯ a ( j +1) m − ¯ a ( j ) m (cid:33) , (13)where ¯ a ( j ) m are the polynomial parameters of the opti-mal fit without the constraint and the weights λ and λ j control the strength of the constraint. After comput-ing the optimal spline through minimization of Eq. (8),a new spline fit is attempted on the same interval di-vision, this time by minimizing the sum of (8) and theconstraint (13). The local weights λ j are determined bythe goodness of the original fit (without constraint) onthe intervals j − j , j + 1 and j + 2, which are adjacentor next-to-adjacent to the respective knot. Specifically,we determine (on each hierarchy level of each of theseintervals) the difference between the maximally accept-able χ n / ˜ n = 1 + T (cid:112) / ˜ n and the actual χ n / ˜ n . We take λ j to be the minimum of all these differences. This en-sures that the constraint does not affect spline intervalswhere already the original fit was barely acceptable. Theglobal weight λ is iteratively adjusted to the maximalvalue that is still compatible with the imposed thresh-old on the χ n / ˜ n . Additional iterations can be performedby replacing the ¯ a ( j ) m with the values calculated with theconstraint. F. Errors on the spline coefficients
An estimate on the spline coefficient errors can be ob-tained from Eq. (12) for each spline piece. This equationis accurate for parametric fits to a set of statistically in-dependent normally distributed data points. This is notthe case for the BHM, due to the overlapping bins andthe unconventional χ -minimization protocol. Becauseof this, one cannot attach Gaussian standard deviationprobabilities to error bars. A robust way of defining theerror at a given point x is to look at the histogram of˜ f ( x ) values over a large set of independent runs with thesame sample size N . The smallest interval around thehistogram mean that contains 68 .
27% of the ˜ f ( x ) valuescorresponds to the robust estimate for two standard er-rors ( ± σ ). This is the most rigorous, albeit impractical,way of defining the error under these circumstances. Asuitable alternative protocol should yield errors that areclose to the robust value.We performed extensive tests of several error estima-tion protocols by comparing them to the robust estimatefor different types of distribution f ( x ). These tests con-firmed that Eq. (12) provides a good error estimate inmost cases. In some cases the error was found to beoverestimated by a factor of about two compared to therobust error. This occurred in particular at the domainboundaries and for BHM splines with a large numberof knots. A concrete example is presented in Sec. IV C.The error was never observed to be too small and henceEq. (12) gives a convenient and efficient way to obtain aconservative error estimate on the BHM fit.A tighter error estimate can be obtained usingbootstrap with the following protocol. Produce an ar-ray of M histograms each of which contains a fraction N/M of the N sampled data points, so that each pointis sampled in exactly one of the histograms. Generate˜ M ≥ M bootstrap histograms by taking ˜ M randomlinear combinations (with positive integer coefficients)of the M histograms. Each of these bootstrapped his-tograms then contains N sampled points, but with pos-sible repetitions. Run the BHM fit on each of the boot-strap histograms fixing the positions of the knots to bethe same as for the BHM fit of the regular histogram(the sum of all bootstrap histograms with unit weight)and without evaluating the goodness of fit. The error onthe spline coefficients is then determined from the statis-tics on the bootstrapped histogram fits in the usual way.The bootstrap error was found to be very close to therobust error estimate in all our tests. This scheme usesmore resources, especially memory, since M (cid:38)
100 his- tograms need to be stored instead of one. The increasein computational time is less significant, since performingfits with fixed knot positions is very fast.Another error estimation protocol is based on the anal-ysis of the evolution of the BHM fit as a function of sam-ple size N . For this, the BHM spline is produced andsaved periodically in intervals of ∆ sampling steps. Theidea is that the fit results ˜ f ( x ) will approach the exactvalue with the dispersion σ ∝ / √ N . The interval size ∆must be large enough to guarantee that the contributions A k from different intervals are statistically independent,allowing one to invoke the central limit theorem:˜ f k = 1 k k (cid:88) i =1 A i ⇒ A k = k ˜ f k − ( k −
1) ˜ f k − . (14)The dispersion of ˜ f k can then be obtained from the dis-persion σ ∗ of the independent A k via σ k = σ ∗ /k . Anadditional cutoff k ≥ k should also be introduced to re-duce the effect of the initial portion of the statistics thatcan be strongly affected by transient processes. A simpleconsistency criterion is that there should be a range of∆ and k such that σ ∗ is essentially independent of bothparameters. Our tests showed that the errors calculatedwith this method are almost indistinguishable from theones obtained with bootstrap, and hence also agree withthe robust error. This method is somewhat slower, sincethe full BHM fit needs to be evaluated repeatedly, andrequires a sufficiently large N . On the other hand, the A k sequence provides explicit insight into the statistics ofthe data and can help to identify statistical fluctuations. G. Sign problem
Monte Carlo sampling often suffers from a sign prob-lem, when positive and negative terms occur with almostequal frequency in the sampling process. Before a suffi-cient amount of data has been collected, the error maythus substantially exceed the absolute value of the sam-pled integrals. The same applies to the error of the fittedspline. Fits where the noise substantially exceeds the sig-nal are unsuitable for further data analysis. To identifythis problem, the algorithm checks—at all bin levels—whether the sampled data are consistent with zero beforethe start of the fitting process. The data is deemed cer-tainly inconsistent with zero only if at least one of thefollowing conditions is met: (i) at any individual level n , the value of χ n / ˜ n for the zero function exceeds 1 byat least 4 standard deviations (4 (cid:112) / ˜ n ), or (ii) at anytwo individual levels, χ n / ˜ n exceeds 1 by at least 3 (cid:112) / ˜ n ,respectively, or (iii) χ n / ˜ n exceeds 1 by at least 3 (cid:112) / ˜ n at one level, and by at least 2 (cid:112) / ˜ n on two other levels,respectively, or (iv) at any four individual levels, χ n / ˜ n exceeds 1 by at least 2 (cid:112) / ˜ n , respectively. The probabil-ity for this to occur as a statistical fluctuation is lowerthan 10 − . Unless one of these conditions is met, the al- . . . . x . . . . f ( x ) ∝ − x / + x − x / (a)BHM splinebasis projection f ( x ) . . . . x − . − . . . ˜ f ( x ) − f ( x ) (b)BHM splinebasis projection FIG. 1. Cubic polynomial test function. The BHM fit (red dashed line with error band) in comparison with the basis projectionmethod (blue dotted line with error band). The left panel (a) shows the function and the fits, while the right panel (b) showsthe difference between the fitted and the true function for both methods. gorithm can be set to not continue with the fitting proce-dure until more data has been collected. The conditionsare chosen to be strict, to minimize the risk of continuingthe analysis with a noisy fit.An alternative to this procedure is to consider the evo-lution of the fitted spline with accumulated statistics,accepting the spline only when it ceases to change sub-stantially over Monte Carlo time, (cid:90) (cid:12)(cid:12)(cid:12) ˜ f N ( x ) − ˜ f N ( x ) (cid:12)(cid:12)(cid:12) dx < α (cid:90) | ˜ f N ( x ) | dx, (15)where ˜ f N ( x ) is the BHM spline obtained from N sam-pled data points, α is the acceptance threshold and theintegration goes over the entire domain. Typical valuesfor α lie approximately between 0 . . IV. NUMERICAL TESTS
We present tests of the BHM for several types of distri-bution. Unless otherwise stated, for each test 10 randomsamples were taken, to demonstrate the ability to restorethe smooth distribution from a moderate sample. We al-ways fit cubic splines. The displayed error bands on thesplines were obtained with Eq. (12).For comparison, we sampled the same data points alsowith the basis projection method using a polynomial ba-sis up to cubic order. The binning for basis projectionsampling was retrospectively chosen to be the same asthe interval division for the spline that was determinedby the bin hierarchy algorithm. This is a strict test,since in practice the optimal binning for the projectionmethod is not known in advance. For each of the tests weobserved that the accuracy of the bin hierarchy fit wasat least as good as with the basis projection method, proving the superiority of BHM due to at least its effi-ciency and the smoothness of ˜ f ( x ). We display the basisprojection results for the first example of a polynomialdistribution and omit them in the following examples toavoid overcrowding the plots. A. Polynomial
The distribution sampled on the interval [1 , .
8] is f ( x ) ∝ − x/ x − x / . (16)Since the distribution is a cubic polynomial, the algo-rithm should stop after fitting one polynomial on theentire domain. This is indeed the case. Figure 1 showsthe BHM fit in comparison with the result obtained usingthe basis projection method. In this particular case, thebasis projection method provides the most accurate esti-mate on the function by definition, since we are project-ing on a function that has the same form as f ( x ) and thuscover the whole interval [1 , .
8] without introducing anysystematic error. Nevertheless, BHM produces a resultcomparable in accuracy. This example demonstrates thatthe BHM fit competes with the basis projection methodeven when the latter is known to be optimal.We also perform tests with a higher order polynomial.The function sampled on the interval [ − ,
1] is f ( x ) ∝ x − . x . (17)Since f ( x ) changes sign, we use | f ( x ) | as the probabilityto generate x and then sample sign[ f ( x )]. The BHMprovides an accurate smooth fit of this function. Theresult in comparison with the basis projection method isshown in Fig. 2. Four spline pieces were needed to resolvethe function in this example. As in the previous example, − . − . . . . x − . − . . . . f ( x ) ∝ x − . x (a)BHM splinebasis projection f ( x ) − . − . . . . x − . − . . . . ˜ f ( x ) − f ( x ) (b)BHM splinebasis projection FIG. 2. Quartic polynomial test function. The BHM fit (red dashed line with error band) in comparison with the basisprojection method (blue dotted line with error band). The left panel (a) shows the function and the fits, while the right panel(b) shows the difference between the fitted and the true function for both methods. . . . . x . . . . . . . f ( x ) ∝ e x p ( − x ) (a) BHM spline f ( x ) . . . . x − . − . . . . ˜ f ( x ) − f ( x ) (b) BHM spline FIG. 3. BHM fit with error band of an exponential test function. The left panel (a) shows the function and the fit, while theright panel (b) shows the difference between the fitted and the true function. the errors on the BHM spline and the basis projection arecomparable in size.
B. Decaying exponential
The distribution sampled on the interval [1 , .
8] is f ( x ) ∝ exp( − x ) . (18)An exponentially decaying distribution is typical formany physical processes. In this example, the BHM pro-duced an acceptable fit with two spline pieces on theintervals [1 , .
9] and [1 . , . f ( x ) onthe first interval [1 , . C. Oscillating function
The distribution sampled on the interval [1 , π + 0 .
6] is f ( x ) ∝
10 + cos(10 x ) . (19) . . . . x . . . . . . . f ( x ) ∝ e x p ( − x ) BHM splinespline using only elementary bins f ( x ) FIG. 4. Exponential test function. The BHM fit (red dashedline with error band) in comparison with the fit using onlyelementary bins (blue dotted line with error band). The latterdoes not properly resolve all sampled integrals of the testfunction.
The challenge is to resolve the periodic oscillations withinthe accuracy of the sampled data. For a small sample,we expect a flat fit reproducing the average of the func-tion. As we collect more data the oscillations should beresolved. We show the bin hierarchy fit for a small sam-ple of 10 points and for a larger sample of 10 pointsin Fig. 5. The fit reproduces the original distributionwell within error bars. Many spline pieces are neededto resolve the structure (in the examples shown: 15–16intervals) and Eq. (12) overestimates the error on thespline coefficients. In Fig. 5 we show the error band ob-tained from Eq. (12) as well as the one obtained usingbootstrap.In Fig. 6 we explicitly compare the error estimates ob-tained with Eq. (12), bootstrap, and fit evolution. Thebootstrap and fit evolution errors are nearly the same forall x . In this example, for bootstrap and fit evolutionerrors, the deviation ˜ f ( x ) − f ( x ) exceeds 1 σ on roughly30% of the domain and 2 σ on roughly 7% of the do-main, which is comparable to the Gaussian case. For theEq. (12) error, the deviation almost never exceeds 1 σ .We also present a robust error analysis based on 100 in-dependent samples with 10 points each. The histogramof ˜ f ( x ) values is shown in Fig. 7 for several values of x .For reference we indicate the specific value of ˜ f ( x ) fromthe example shown in Fig. 5, to illustrate for which x the fit values in this example are statistical outliers, andfor which they are typical. We also show the median ofthe Eq. (12) errors and a representative bootstrap error(from the example shown in Fig. 5). Indeed, the sizeof the bootstrap and fit evolution errors is very close tothe robust estimate, while the error from Eq. (12) is toolarge. D. Divergent function on a semi-infinite domain
The distribution sampled on the interval [0 , ∞ ) is f ( x ) ∝ √ x (1 + x ) . (20)The basis projection method has the advantage thatthe 1 / √ x divergence at x = 0 can be incorporateddirectly into the basis, and that the asymptotic be-havior at large x can be resolved using a semi-infinitebin. We present a setup that achieves the same in theBHM framework. The divergence can be avoided byweighting all sampled values with √ x and recovering theoriginal distribution at the end by dividing the BHMspline by √ x . The semi-infinite domain can be mappedonto a finite interval for instance through a transformsuch as y ( x ) = 2 arctan( x ) /π (used in this example) or y ( x ) = 1 − exp( − x ). For each x generated according tothe distribution f ( x ), the value y ( x ) ∈ [0 ,
1] is calculatedand sampling is then performed into the corresponding y -histogram. To recover the correct function, the sampledvalues need to be scaled by 1 /x (cid:48) ( y ).Figure 8 shows the BHM fit transformed back into theoriginal domain. Due to the large domain, 10 samplingpoints were used for this example. The fit agrees wellwith the original distribution within error bars. In thisexample, the BHM produced an acceptable fit on onespline interval.In order to compensate for the divergence, it is as-sumed that its location and type are known in advance.Often it is sufficient to know the approximate proper-ties of the divergence. To demonstrate this, Fig. 9 showsBHM fits of data sampled with a weighting factor thatslightly overestimates or underestimates the power of thedivergence (the mapping of the semi-infinite domain ontoa finite interval is the same as before). Despite the wrongscaling, the fits still agree with the original distribution,but the errors on the fits are larger. It is better to over-estimate the power, since in this case the divergence isstill fully compensated. E. Sampling with a severe sign problem
We demonstrate the performance of the algorithm inthe presence of a severe sign problem by generating dataon the interval [0 ,
3] with the distribution given by f ( x ) ∝ exp( − . x ) − exp( − x ) ≡ f ( x ) − f − ( x ) . (21)The values of x are generated via a Monte Carlo Markovchain process with two types of updates: switching be-tween two “sectors” corresponding to f and f − , andvarying the value of x on the interval [0 ,
3] within thesame sector. The update acceptance probabilities aregiven by the detailed balance equations, P f − σ → f σ = f σ ( x ) /f − σ ( x ) = exp( ± . x ) , (22) P x → x (cid:48) = f σ ( x (cid:48) − x ) . (23)0 . . . . . . x . . . . . . f ( x ) = / π + c o s ( x ) / π (a) BHM spline f ( x ) . . . . . . x − . − . . . . ˜ f ( x ) − f ( x ) (b) BHM spline . . . . . . x . . . . . . f ( x ) = / π + c o s ( x ) / π (c) BHM spline f ( x ) . . . . . . x − . − . . . . ˜ f ( x ) − f ( x ) (d) BHM spline FIG. 5. Oscillating test function using 10 sampled points (top panels (a) and (b)) and 10 sampled points (bottom panels (c)and (d)). The BHM fit (red dashed line) is shown with two error estimates on the spline coefficients: errors calculated usingEq. (12) (red band) and errors calculated using bootstrap (blue band). Bootstrap provides a tighter bound on the errors in thisexample. The left panels, (a) and (c), show the function and the fit, while the right panels, (b) and (d), show the differencebetween the fitted and the true function. The sign problem is due to the functions f ( x ) and f − ( x )having almost the same magnitude but being sampledwith opposite sign. This setup is a toy version of thetypical situation arising in diagrammatic Monte Carlofor fermions.Figure 10 shows the BHM spline for 10 and 10 sam-pled points. In the former case, the data is compatiblewith zero, which is correctly captured by the algorithmpresented in Sec. III G. In this case, the fit should notbe used, despite having acceptable χ (the fit is consis-tent with the true function within errors). As more dataare gathered the algorithm begins to correctly reproducethe features of the function. In the examples shown, onespline piece was sufficient for an acceptable fit. F. Green’s function of the Fr¨ohlich polaron
We now apply the BHM to a physical problem—thediagrammatic Monte Carlo calculation of the zero mo-mentum imaginary time Green’s function, G ( k = 0 , τ ) = (cid:104) vac | a ( τ ) a † (0) | vac (cid:105) , (24) a k ( τ ) = e Hτ a k e − Hτ . (25)of the Fr¨ohlich polaron with the dimensionless couplingconstant α = 2 and the chemical potential µ = − . ω ,where ω is the (momentum-independent) phonon fre-quency. Here | vac (cid:105) is the vacuum state and a k is theannihilation operator for an electron with momentum k .The Fr¨ohlich Hamiltonian describes an electron coupled1 . . . . . . x − − − ( ˜ f ( x ) − f ( x )) / σ ( x ) Eq. (9) errorbootstrap errorfit evolution error
FIG. 6. Oscillating test function. Difference between thefitted and the true function normalized by one standard errorobtained with different methods. to a bath of phonons, H = H e + H ph + H e − ph , (26) H e = (cid:88) k k a † k a k , H ph = (cid:88) q ω b † q b q , (27) H e − ph = (cid:88) k , q i (2 √ απ ) / q ( b † q − b − q ) a † k − q a k , (28)where b q is the annihilation operator for a phonon withmomentum q .The Green’s function is a central quantity for the dia-grammatic technique, from which other properties of thesystem can be obtained with appropriate analysis. As ref-erence we use a very long Monte Carlo run with reweight-ing. This provides a reliable estimate of the Green’s func-tion with negligible errors (several orders of magnitudesmaller than the errors of the sample used for the BHMand for basis projection sampling).Figure 11 shows the BHM fit and the result obtainedusing the basis projection method for approximately2 · sampled points. The BHM fit produced four splinepieces in this example and, as in the previous examples, the basis was retrospectively chosen to have the sameinterval division. It can be clearly seen that the BHMprovides an accurate smooth fit of the Fr¨ohlich polaronGreen’s function, which agrees well with the reference.While the error bars on the BHM fit and the basis pro-jection are comparable, the basis projection sampling isless efficient, since it requires O ( m ) operations duringthe sampling stage, where m is the number of basis func-tions used. In this particular example with a polynomialbasis up to cubic order, this corresponds to at least 16times as many operations as needed during the samplingstage for BHM. V. CONCLUSIONS
We have argued and demonstrated that the BHMyields an efficient, flexible, and fully automatized algo-rithm to restore smooth functions from their noisy inte-grals using all available information. The resulting fitsare at least as accurate as the ones obtained using basisprojection sampling, but with guaranteed smoothness atthe knots. Sampling with the BHM is also computation-ally less expensive than using basis projections, whichrequire many operations in each sampling step. Similarto the projection method onto a polynomial basis, theBHM spline is a piecewise polynomial function on sev-eral large intervals. The crucial technical advantage isthat these intervals do not need to be fixed beforehand.A suitable division into intervals is found automaticallyby the algorithm and is adjusted over time, as more datapoints are collected.In the future we plan to extend the bin hierarchy al-gorithm to multivariate functions, since many relevantphysical observables depend on several variables, such astime and momentum. Two, three and four dimensionsare most relevant for physical applications.
ACKNOWLEDGMENTS
We thank Chris Amey for helpful discussions. Thiswork was supported by the Simons Collaboration on theMany Electron Problem and the National Science Foun-dation under the grant DMR-1720465. O.G. also ac-knowledges support by the US-Israel Binational ScienceFoundation (Grants 2014262 and 2016087). I. Narsky and F. C. Porter,
Statistical Analysis Techniquesin Particle Physics (John Wiley & Sons, 2013). D. W. Scott,
Multivariate Density Estimation (John Wiley& Sons, 2015). B. W. Silverman,
Density estimation for statistics and dataanalysis (London: Chapman and Hall, 1986). O. Goulko, A. Gaenko, E. Gull, N. Prokof’ev,and B. Svistunov, ArXiv e-prints (2017), codeavailable at https://github.com/olgagoulko/BHM/tree/GPLv3 , arXiv:1711.04316 [stat.OT]. O. Goulko, A. S. Mishchenko, L. Pollet, N. Prokof’ev,and B. Svistunov, Phys. Rev. B , 014102 (2017),arXiv:1609.01260 [cond-mat.other]. C. De Boor,
A Practical Guide to Splines (Revised Edition) (Springer, 2001). N. V. Prokof’ev and B. V. Svistunov, JETP Lett. , 649(2013), arXiv:1304.5198 [cond-mat.stat-mech]. A. M. Ferrenberg and R. H. Swendsen, Phys. Rev. Lett. ,2635 (1988), ; Erratum: Phys. Rev. Lett. 63, 1658 (1989). A. S. Mishchenko, N. V. Prokof’ev, A. Sakamoto, and (a) (b) (c) (d)FIG. 7. Smoothed histogram of ˜ f ( x ) values for the oscillating test function at x = 1 , , . , .
74 (from left to right). Thevertical lines are: histogram mean (red solid line); true function value (blue dotted line); the particular ˜ f ( x ) value from thesimulation shown in the bottom panel of Fig. 5 (black dashed line). The latter is shown for reference. The horizontal linesare: the ± σ interval around the histogram mean, where σ is the median standard error from Eq. (12) (red solid line); the ± σ interval around the histogram mean, where σ is the robust standard error (blue dotted line); the ± σ interval around thehistogram mean, where σ is the typical bootstrap error, taken from the example shown in the bottom panel of Fig. 5 (blackdashed line). . . . . . . x . . . . . . . . . . f ( x ) ∝ √ x ( + x ) (a) x . . . . . BHM spline f ( x ) . . . . . . x − . − . . . . ˜ f ( x ) − f ( x ) (b) BHM spline x − . . . ˜ f ( x ) − f ( x ) FIG. 8. Sampling a divergent function defined on a semi-infinite domain using the BHM with appropriate transforms. The leftpanel (a) shows the function and the fit, while the right panel (b) shows the difference between the fitted and the true function. . . . . . . x − . . . ˜ f ( x ) − f ( x ) (a) BHM spline x − . . . ˜ f ( x ) − f ( x ) . . . . . . x − . − . . . . ˜ f ( x ) − f ( x ) (b) BHM spline x − . . . ˜ f ( x ) − f ( x ) FIG. 9. Divergent test function f ( x ) ∝ x − . (1 + x ) − with an x − . divergence at x = 0. Data is sampled with a compensationfor the divergence that overestimates (left panel (a)) and underestimates (right panel (b)) the power of the divergence by usingweighting factors of x . and x . , respectively. The plots show the difference between the BHM fit and the true function. . . . . . . . x − . . . . . . f ( x ) ∝ e x p ( − . x ) − e x p ( − x ) (a) BHM spline f ( x ) . . . . . . . x . . . . . . f ( x ) ∝ e x p ( − . x ) − e x p ( − x ) (b) BHM spline f ( x ) FIG. 10. Sampling with a severe sign problem using 10 sampled points (left panel (a)) and 10 sampled points (right panel(b)). The sampled integrals used for the fit in the left panel are consistent with zero, which is correctly captured by the checkpresented in Sec. III G. In this case, the displayed BHM spline would not be used for further data analysis. ω τ G ( τ ) (a) ω τ −0.010−0.0050.0000.0050.010 ˜ G ( τ ) − G ( τ ) (b)BHM splinebasis projection FIG. 11. Diagrammatic Monte Carlo sampling of the Fr¨ohlich polaron Green’s function. The left panel (a) shows a veryprecise numerical calculation of the function, while the right panel (b) shows the difference between the fitted and the referencefunction for the BHM and the basis projection method.B. V. Svistunov, Phys. Rev. B , 6317 (2000), cond-mat/9910025. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.Flannery,
Numerical Recipes in C++ : The Art of Sci- entific Computing , 2nd ed. (Cambridge University Press,2002). B. Efron, Ann. Statist. , 1 (1979). H. Fr¨ohlich, H. Pelzer, and S. Zienau, Philos. Mag.41