[PDF] Implementation of the Bin Hierarchy Method for restoring a smooth function from a sampled histogram

Abstract

We present BHM , a tool for restoring a smooth function from a sampled histogram using the bin hierarchy method. The theoretical background of the method is presented in [arXiv:1707.07625]. The code automatically generates a smooth polynomial spline with the minimal acceptable number of knots from the input data. It works universally for any sufficiently regular shaped distribution and any level of data quality, requiring almost no external parameter specification. It is particularly useful for large-scale numerical data analysis. This paper explains the details of the implementation and the use of the program.

Full PDF

IImplementation of the Bin Hierarchy Method for restoring a smoothfunction from a sampled histogram

Olga Goulko a,b, ∗ , Alexander Gaenko c , Emanuel Gull c , Nikolay Prokof’ev a,d , Boris Svistunov a,d,e a Department of Physics, University of Massachusetts, Amherst, MA 01003, USA b Present address: Raymond and Beverly Sackler School of Chemistry and School Physics and Astronomy, Tel AvivUniversity, Tel Aviv 6997801, Israel c Department of Physics, University of Michigan, Ann Arbor, MI 48109, USA d National Research Center “Kurchatov Institute,” 123182 Moscow, Russia e Wilczek Quantum Center, School of Physics and Astronomy and T. D. Lee Institute, Shanghai Jiao Tong University,Shanghai 200240, China

Abstract

We present

BHM , a tool for restoring a smooth function from a sampled histogram using the bin hierarchymethod. The theoretical background of the method is presented in [1]. The code automaticallygenerates a smooth polynomial spline with the minimal acceptable number of knots from the inputdata. It works universally for any suﬃciently regular shaped distribution and any level of data quality,requiring almost no external parameter speciﬁcation. It is particularly useful for large-scale numericaldata analysis. This paper explains the details of the implementation and the use of the program.

PROGRAM SUMMARY

Manuscript Title:

Implementation of the BinHierarchy Method for restoring a smooth functionfrom a sampled histogram

Authors:

Olga Goulko, Alexander Gaenko, EmanuelGull, Nikolay Prokof’ev, Boris Svistunov

Program Title:

BHM

Journal Reference:Catalogue identiﬁer:Licensing provisions:

GPLv3

Programming language: C++Operating system:

Tested on Linux

RAM:

Keywords:

Data analysis, Function restoration,Spline ﬁtting, Histogram, Smoothing

Classiﬁcation:

External routines/libraries:

CMake, GSL

Nature of problem:

Restoring a smooth functionfrom a sampled histogram in an eﬃcient, reliableand automatized way is crucial for numerical and ∗ Corresponding author.

E-mail address: [email protected] experimental data analysis.

Solution method:

To make use of all informationcontained in the sampled data, the BHM algorithmgenerates a hierarchy of overlapping bins of diﬀerentsizes from the initially supplied ﬁne histogram. Thebin hierarchy is ﬁtted to a polynomial spline with theminimal acceptable number of knots, the positionsof which are determined automatically. The outputis a smooth function with error band.

Running time:

Typically less than a second

1. Introduction

Numerical approaches to problems in con-densed matter and quantum many-body physicsoften involve generating data points accordingto an unknown probability density f ( x ), whichneeds to be restored from the sampled data. Theamount of data generated in large-scale quantumMonte Carlo simulations is usually so large that itis impossible (or at least impractical) to store thecomplete list of sampled data points x i , in order to Preprint submitted to Computer Physics Communications November 15, 2017 a r X i v : . [ s t a t . O T ] N ov se density estimation protocols [2–4] to recover f ( x ). Instead, data points are typically collectedinto a histogram, the histogram bins represent-ing integrals over the sampled distribution. Thisdoes not involve any signiﬁcant loss of informa-tion, as long as the bins are suﬃciently small toresolve the features of the distribution (which isalways possible provided that f ( x ) is suﬃcientlysmooth). More sophisticated sampling methodsexist, which retain more information about theindividual points, but these are in general less ef-ﬁcient and require a case-dependent implementa-tion. We provide a universal and eﬃcient programto restore a smooth distribution, which uses thestandard histogram as input. BHM is an implementation of the bin hierarchymethod, introduced in [1]. It is1. unbiased: • utilizes all relevant information con-tained in the data; • non-parametric ﬁt automatically ad-justs to data quality; • provides maximally featureless solu-tion (least acceptable number of splineknots);2. eﬃcient: • based on regular histogram, which is ef-ﬁcient to sample; • fast analysis;3. automatic: • very little user input; • no adjustment for diﬀerent types ofsampled functions; • no adjustment with simulation time asmore data is collected.The paper is organized as follows. The generalproblem setup is presented in Sec. 2. In Sec. 3,we give an overview of the algorithm. We explainhow to use the program in Sec. 4, giving a detailedexplanation of the input and output formats, aswell as possible parameter speciﬁcations. Severalexamples are presented in Sec. 5.

2. Problem setup

The central object in

BHM is a smooth function f ( x ) deﬁned on a bounded domain D . Statisti-cal sampling with a probability density p ( x ) isperformed to generate samples for f ( x ) accord-ing to f j = f ( x j ) /p ( x j ) with p -distributed x j . Inthe simplest case, when f ( x ) itself is a normal-ized probability distribution, p ( x ) = f ( x ) can bechosen, implying f j = 1. The samples are binnedinto a histogram with 2 K bins. We are interestedin restoring a smooth function ˜ f ( x ) from this his-togram.Each histogram bin i with bin boundaries x i, min and x i, max represents the stochastic integral I i = (cid:90) x i, max x i, min f ( x ) dx (1)through the following relations: I i = ¯ f i N i N , (2) M ( I i ) = M ( f i ) + ¯ f i N i ( N − N i ) N , (3)Var( I i ) = M ( I i ) N − , (4) δI i = (cid:114) Var( I i ) N , (5)where the “scaled variance” M ( f i ) = ( N i − f i ) is the sum of squares of diﬀerences fromthe mean, N i is the number of samples in bin i and N the total number of samples. Note that inthe simplest case p ( x ) = f ( x ) the above quantitiesare determined through N i and N alone.The goal is to ﬁnd a function ˜ f ( x ) whose inte-grals over diﬀerent parts of the domain D agreewith the sampled integrals. Working with inte-grals rather than interpolated function values al-lows us to include combinations of histogram binsinto the ﬁtting. Rebinning data to larger bin sizesleads to a reduction of statistical noise, while re-taining small bins results in a higher resolutiondue to smaller discretization errors.The resulting ﬁt ˜ f ( x ) is a polynomial splineof order m , where m is the highest power withnon-zero coeﬃcient. The spline function and itsderivatives up to order m − f ( x ).2 . Overview of the algorithm In this section we give a brief overview of thealgorithm. More details on the theoretical back-ground of the method can be found in [1]. Aﬂowchart of the algorithm is shown in Fig. 1. • The algorithm starts from a list of 2 K his-togram bins supplied in an input ﬁle (for adetailed format description, see Sec. 4). Typ-ical values of K are 7 – 15. It should benoted that the bins are not required to havethe same size; however, in practice there isno need to have variable size bins. The binsmust not overlap or leave gaps. • From this input the code generates a hier-archy of histogram bins of increasing size.Combining two neighboring bins of the 2 K initial bins leads to 2 K − larger bins with,on average, twice as many entries. Succes-sive repetitions of this rebinning result in ahierarchy of levels with 2 K − , . . . , , N i is smaller than a user deﬁned min-imal value, are excluded from the ﬁtting pro-cess. Likewise, levels that do not containenough usable bins (the minimal fraction canbe deﬁned by the user) are also excluded.This implies that in general ﬁtting starts witha level K (cid:48) > K so that the original binningcan be chosen to be very ﬁne without intro-ducing noise into the ﬁnal ﬁt. For bins thatwill be used for ﬁtting, the bin integrals andtheir errors are computed via Eqs. (2),(3), (4)and (5). • The code checks if the data is compatiblewith zero on the whole domain. There is anoption not to proceed with the ﬁt if this isthe case. This feature is particularly usefulfor data suﬀering from a severe sign problem. • The next step is ﬁtting a spline of order m onthe given spline interval division. The start- ing point is one spline interval, which meansthat one polynomial is ﬁtted on the wholedomain. The ﬁt minimizes K (cid:88) n =0 χ n n , (6)where χ n is deﬁned for bins on hierarchy level n in the usual way. • Afterwards the goodness of ﬁt is evaluated oneach hierarchy level individually. The crite-rion is χ n ˜ n ≤ T (cid:114) n , (7)where T is the ﬁt acceptance threshold (inputparameter) and ˜ n the number of bins on level n that were used for ﬁtting. The expression (cid:112) / ˜ n corresponds to one standard deviationof the χ -distribution. • If at least one level fails the global goodness-of-ﬁt check, the goodness-of-ﬁt is then evalu-ated on each spline interval separately (againlevel by level). Spline intervals on which theﬁt was acceptable remain unchanged, whilethe others are split into two parts, by intro-ducing a spline knot in the middle (“numberof bins”-wise). • If any of the resulting intervals is too small,meaning that there is not enough data to ﬁton that interval, the code exits without hav-ing produced an acceptable spline. Otherwisethe BHM ﬁt is repeated on the new intervaldivision. • Once an acceptable spline has been found,there is an option to reﬁt the data on thesame interval division with an additional con-straint that aims to minimize the jump in thehighest derivative. • The resulting BHM spline is output (splinecoeﬃcients and error coeﬃcients). In addi-tion, the spline values can be output evalu-ated on a grid.3 nput sampled data:histogram with K binsgenerate bin hierarchy;start with 1 spline intervalfit BHM spline oncurrent interval divisionoutput BHM splineall levels passrefit with constraintif desiredcheck goodness of fiton each hierarchy level check goodness of fit on allspline intervals separately enough data tocontinue fitting?split intervals that failedthe goodness-of-fit check no acceptable BHM splineat least onelevel fails yesno Figure 1: Flowchart of the algorithm

4. Input and output

Instructions for compiling the program and ex-ecuting unit tests can be found in the READMEﬁle.The program executable requires 1 argument,the name of the parameter ﬁle, e.g.: $ ./bhm in.param In particular, the parameter ﬁle determines thename of the input ﬁle with the histogram dataand the name of the output ﬁle for the BHM spline(see below).As a special case, if the parameter ﬁle name isan empty string, the default parameters will beused which are suitable for most applications: $ ./bhm "" spline.dat In this case, the histogram data input is expectedto be provided at the standard input, and the re-sults will be printed to the standard output. In the example above, the standard input is redi-rected from ﬁle histogram.dat , and the standardoutput is redirected to ﬁle spline.dat .Without an argument, the program prints ashort help message and exits.

The input histogram data is text-based, line-oriented, and has the following format: A N exc 2 x , min N ¯ f M ( f ) x , min N ¯ f M ( f ) .... x i, min N i ¯ f i M ( f i ) .... x max where the ﬁrst line speciﬁes an overall normal-ization factor A and the number N exc of samplesoutside of the histogram bounds. The normaliza-tion step is omitted if either A = 1 or A = 0.Otherwise, all values ¯ f i and M ( f i ) are divided4y A and A , respectively, before constructing theBHM ﬁt. The value N exc is used to calculate thetotal number of samples N = N exc + (cid:80) i N i , whichis needed for Eqns. (2–5). N exc can be zero.Starting from the second line, each line, exceptthe last one, contains 2 or 4 blank-separated val-ues, specifying the left bin boundary, the numberof samples in the bin, and, optionally, mean valueand scaled variance. For example, line 5 of thelisting corresponds to a bin i with the left bound-ary x i, min , number of samples N i , mean value ¯ f i and scaled variance M ( f i ) (see Eqns. 2–5). If themean value and the scaled variance are both omit-ted, they are assumed to be ¯ f i = 1 and M ( f i ) =0, which corresponds to only ever adding 1 to bincounters, or in other words p ( x ) = f ( x ). The lastline of the ﬁle (line 7 of the listing) must containa single entry x max , the right boundary of the lastbin.The numbers x , min . . . < x i, min . . . < x max mustform a strictly monotonically increasing sequence,corresponding to non-overlapping, ﬁnite-size binswith no gaps. In the current implementation, thenumber of bins must be a power of 2 (in the laterversions we may remove this limitation).It is important to note that all sampled dataand variances are assumed to be uncorrelated . Ifcorrelations are present, they have to be removedprior to the BHM ﬁt, for example through appro-priate blocking analysis or by scaling the varianceswith the estimated correlation factor. The input parameter ﬁle is a text-based, line-oriented ﬁle that has a key = value format. Anexample input is shown in Fig. 2. The keys arecase-insensitive; the string values may be enclosedin quotes; the symbol starts a comment whichis ignored until the end of the line. The meaningof each parameter is indicated in the ﬁgure in thecorresponding comment. Below we provide moredetailed explanations for some of the parameters.

DataPointsMin in line 1 speciﬁes the minimalnumber of data points that a bin must containin order to be used for ﬁtting. Bins that contain fewer sampled points are ignored (but still con-tribute in combination with other bins at higherhierarchy levels).

DataPointsMin must be atleast 10, in order to ensure that meaningful statis-tics can be made from the data. The defaultvalue is 100. If a hierarchy level does not containenough usable bins (the minimal number is givenby the parameter

UsableBinFraction in line 7,times the total number of bins on that level) thenthis level and all subsequent levels are completelyomitted from the ﬁtting.The maximal possible number of interval divi-sions is determined by the parameter

MinLevel in line 3. For example, if there are 2 K elemen-tary bins, MinLevel=2 means that the smallestpossible spline intervals coincide with the bins onhierarchy level K − MinLevel must be at least2 (corresponding to a total of at least 1+2+4=7bins per interval); moreover,

MinLevel must belarge enough to ensure that the ﬁt is underdeter-mined for each interval.The ﬁt acceptance threshold T (lines 4-6) canbe either set to a ﬁxed value, or to a range ofvalues between Threshold and

ThresholdMax .In the latter case, a BHM ﬁt is ﬁrst attemptedwith the smallest value

Threshold . If no accept-able ﬁt is found, the threshold value is succes-sively increased in

ThresholdSteps equidistantsteps, until either an acceptable ﬁt is produced or

ThresholdMax is reached. Setting

ThresholdMax to be smaller or equal to

Threshold and/or set-ting

ThresholdSteps=0 corresponds to only us-ing one ﬁxed value of T . Note that thresholdvalues that are too low can result in overﬁtting(too many spline pieces) or the failure to producean acceptable ﬁt. Values that are too high canresult in underﬁtting (too few spline pieces and apoor ﬁt with underestimated error bars). Theseissues are illustrated in Example 5.2. The value T = 2 . T = 2 . T = 4 . arameter File in.param DataPointsMin=100 SplineOrder=3 MinLevel=2 Threshold=2.0 ThresholdMax=4.0 ThresholdSteps=4 UsableBinFraction=0.25 JumpSuppression=false Verbose=true PrintFitInfo=true FailOnBadFit=true FailOnZeroFit=true Data="histogram.dat" OutputName="spline.dat" GridOutput="spline_plot.dat" GridPoints=1024

Figure 2: Sample parameter ﬁle functions, and hence there is no need to changeany of the parameters unless speciﬁcally desired.

The default verbose output is printed to stan-dard error and contains auxiliary informationsuch as values of the input parameters, a briefdescription of the input histogram, and the logof the ﬁtting process. The ﬁtting log is de-scribed in detail in Example 5.1. If requested bythe

PrintFitInfo input parameter, informationabout the ﬁnal ﬁt is also printed to the standardoutput.The output of the program is both humanand machine-readable, and has the following text-based, line-oriented, blank-separated format: m s x x ... x s a a a ... a m ε ε ε ... ε m ... i a a a ... a m ε ε ε ... ε m ( i + 1) ... Any lines at the beginning of the ﬁle that startwith are considered comments and are ignored.The ﬁrst signiﬁcant line of the ﬁle (line 3 of thelisting) speciﬁes the spline polynomial order m and the number of splines pieces s ; the next line(line 4 of the listing) lists all ( s + 1) spline pieceboundaries x , . . . , x s +1 . The following lines form s sections describing each spline piece ˜ f i , for i =1 . . . s . Each section (lines 5–7, 9–11 of the listing)consists of 3 lines:1. Header (starts with ) specifying the splinepiece number ( i ),2. ( m + 1) numbers specifying the spline piececoeﬃcients a . . . a m ( ˜ f i ( x ) = (cid:80) mk =0 a k x k ),3. (2 m + 1) numbers ε . . . ε m specifying the er-ror bar E i ( x ) = (cid:113)(cid:80) mk =0 ε k x k . The simplest way to plot the resultingspline is to use the provided Python3 script bhm_spline.py , as follows: $ python3 bhm_spline.py spline.dat

6n the other hand, it may be convenient to cus-tomize the plot and/or compare it with a knownfunction, or plot it interactively (e.g., from aJupyter notebook). For this purpose the scriptcan be imported as a module that provides a BHMSpline class. The following listing demonstrates apossible way of using the module. import numpy as np import matplotlib.pyplot as plt from bhm_spline import BHMSpline spline=BHMSpline("spline.dat") x=np.linspace(*spline.domain()) def fn(x): return (x**4-0.8*x*x)/0.171964 plt.plot(x,spline(x), x,fn(x)) plt.plot(x,spline.errorbar(x)) spline.plot() spline.plot(fn) spline.plot_difference(fn) In line 3 the class

BHMSpline is imported; line 5creates the object representing the spline. Inline 6 an interval of x-values is created corre-sponding to the domain of the spline. Line 8deﬁnes a reference function to compare with thespline. In line 10 the spline and the referencefunction are plotted using the

Matplotlib plot-ting library; in line 12 the error bar E ( x ) is plot-ted. The class also provides a convenience plot-ting method: when called without arguments (ason line 14), the spline is plotted along with theerror bars; when a function is passed as an argu-ment (line 16), its graph is plotted also. It is alsopossible to plot the diﬀerence between the splineand the reference function with error bar (line 18). If the

GridOutput parameter in the parameterﬁle is set to a non-empty ﬁlename, the program also outputs to the speciﬁed ﬁle the values andthe error bars of the spline computed on a one-dimensional grid of points. A plotting program,such as gnuplot , can then be used to plot the gen-erated function and the error bars and to comparethem with a reference function; for example: $ gnuplot gnuplot> quartic(x)=(x**4-0.8*x*x)/0.171964 gnuplot> plot "spline_plot.dat" witherrors gnuplot> replot quartic(x) In this example, line 1 of the listing starts the gnuplot program; line 2 deﬁnes a reference func-tion (quartic polynomial); line 3 plots the gridoutput ﬁle generated by

BHM ; and line 4 plots thereference function on the same graph.

5. Examples

In this section we present three detailed exam-ples of the features of

BHM illustrated on diﬀerentdistributions f ( x ). We provide a program to gen-erate the input data for these examples (as wellas for several additional test functions). Callingthe program without arguments: $ ./generator prints a brief help message, which includes a listof the functions supported by the program.Calling the program with a single ﬁle argument: $ ./generator generator.param generates the histogram data for a given analyt-ical function according to the parameters listedin the generator.param ﬁle. For all examplesdiscussed below, the parameters are the same asshown in the example generator parameter ﬁleshown in Fig. 3 (including the random numbergenerator seed), except when stated otherwise.Calling the program as: $ ./generator -python name (where name is the name of the function, possiblyabbreviated) prints the Python code that corre-sponds to the function, which is convenient for7 arameter File generator.param SampleSize=10000 Function=exponential PowerBins=10 RandomSeed=956475 Output="histogram.dat" GridOutput="function.dat" GridPoints=1024

Figure 3: Sample parameter ﬁle to generate example input plotting the analytical function against the ap-proximating spline in an interactive Python envi-ronment (as has been discussed in subsection 4.5).If the

GridOutput parameter in the parameterﬁle is set to a non-empty ﬁlename, the programalso outputs the values of the function computedon a one-dimensional grid to the speciﬁed ﬁle; aplotting program, such as gnuplot , can then beused to plot the generated function; for example: $ gnuplot gnuplot> plot "function.dat" with lines gnuplot> replot "spline_plot.dat" witherrors In this example, line 1 of the listing starts the gnuplot program; line 2 plots the generatedfunction; and line 3 plots the content of the spline_plot.dat generated by

BHM as discussedin subsection 4.6.

This example demonstrates BHM ﬁts for diﬀer-ent choices of spline order m .The original function is a quartic polynomial( Function=quartic polynomial ): f ( x ) = α ( x − . x ) . (8)Because f ( x ) changes sign, sampling on the in-terval [ − ,

1] is performed with the probabilitydensity p ( x ) = | f ( x ) | and α = 0 . p ( x ) on this interval.The histogram data is ﬁtted with BHM usingthe default parameters, with the exception of

SplineOrder which is set to 3, 4, and 5 respec-tively. The ﬁt results are shown in Fig. 4. From the output ﬁles "spline.dat" it can be seen thatthe cubic spline has four spline pieces; the quar-tic spline has one spline piece, as expected; thequintic spline also has one spline piece, its coeﬃ-cients up to quartic order are similar to the onesobtained via quartic ﬁt, and its highest spline co-eﬃcient is small.We explain in detail the verbose output for thecubic ﬁt m = 3. At the beginning of the output,the ﬁt parameters are listed, as well as general in-formation about the input histogram. Then fol-lows information about the goodness-of-ﬁt at thediﬀerent ﬁtting stages: ... BHM fit: Begin BHM fitting with threshold T = 2 Checking separate chi_n^2/n in spline fit level n chi_n^2/n max chi_n^2/n Checking interval 0 (order: 0, number: 0) This interval fit is not good Checking separate chi_n^2/n in spline fit level n chi_n^2/n max chi_n^2/n .0 0.5 0.0 0.5 1.0 x f ( x ) x ˜ f ( x ) − f ( x ) m =3 m =4 m =5 Figure 4: Quartic polynomial test function (left panel). Diﬀerence between BHM ﬁt ˜ f ( x ) with diﬀerent spline orders m and the test function f ( x ) (right panel). Checking interval 0 (order: 1, number: 0) This interval fit is not good Checking interval 1 (order: 1, number: 1) This interval fit is not good Checking separate chi_n^2/n in spline fit level n chi_n^2/n max chi_n^2/n Good spline found with threshold T = 2 ... First a ﬁt is attempted with one spline pieceon the whole domain (lines 4-13). This ﬁt isnot acceptable because χ n / ˜ n (third column inthe output) exceeds the maximally allowed value1+ T (cid:112) / ˜ n (fourth column in the output) for mostof the levels. The second column lists ˜ n , the num- ber of available bins at each level. This numberis in general smaller than 2 n , because some binsdo not contain enough data to be used for ﬁtting.Also, hierarchy levels below n = 7 were omittedbecause the fraction of usable bins on these levelswas below the set UsableBinFraction value.Since the ﬁrst ﬁt was unsuccessful, χ is evalu-ated on each spline interval separately (lines 14-16). In this case, this yields no new information,since only one interval is present. As soon as alevel is found where the ﬁt is unacceptable (level 0in this case), this check stops without proceedingto lower levels, since this is enough to identify abad interval.After the interval is divided, another BHM ﬁt isattempted on two intervals (lines 17-26). This ﬁtalready has smaller χ n / ˜ n values than the previousone, but still fails the threshold on several levels.Both spline intervals are then again checked sep-arately (lines 27-31 and 32-36, respectively) andboth fail the goodness-of-ﬁt check on level 3. Notethat level 0 is not present in the individual inter-val checks, because the bin on this level is largerthan each of the spline intervals.The intervals are numbered consecutively, butadditional information is provided so that theirlocation can be recovered (see e.g. lines 27 and32). The boundaries of an interval always coincidewith the boundaries of a bin on a certain hierarchylevel (denoted by “order”) and “number” denotes9he number of this bin.After the intervals are again divided, the result-ing BHM ﬁt (lines 38-47) is acceptable. No sepa-rate interval checks need to be performed and thecode exits with the ﬁt result. If PrintFitInfo isrequested, the goodness-of-ﬁt information of theﬁnal result is output again at the end. This in-cludes the χ n / ˜ n values on each level n , the unitstandard deviation (cid:112) / ˜ n of the corresponding χ -distribution, as well as the number of standarddeviations by which χ n / ˜ n exceeds 1 on each level(last column). If χ n / ˜ n ≤ This example demonstrates BHM ﬁts fordiﬀerent choices of the threshold T . Thesampled distribution is a decaying exponential( Function=exponential ), f ( x ) = α exp( − x ) , (9)normalized on the interval [1 , α = 3 e / ( e − , . N exc sampled outside of the histogrambounds. The total number of sampled points inthis example is SampleSize=100000 .The histogram data is ﬁtted with

BHM usingthe default parameters, with the exception of theparameters deﬁning the ﬁt acceptance threshold,which is set to be ﬁxed at T = 0, 2, and 8, re-spectively. This can be achieved by either set-ting the value of ThresholdMax to be equal orless than the value of

Threshold , or by setting

ThresholdSteps=0 . The ﬁt results are shown inFig. 5.For all threshold values an acceptable ﬁt ex-ists, but with diﬀerent interval divisions. The ex-tremely low threshold value T = 0 (which meansthat only ﬁts with χ n / ˜ n ≤ T = 2 produces a suitable ﬁt with 3 splinepieces that captures the shape of the test func-tion well. The very high value T = 8 yields anunderﬁtted spline with only 2 pieces. This splinedeviates strongly from the true function and theerror on the spline is severely underestimated. This example demonstrates that

BHM works forboth uniform and non-uniform input histograms.The sampled distribution, f ( x ) = 0 . G (0 , .

2) + 0 . G (2 ,

1) + G ( − , , (10)is a linear combination of three Gaussians G ( µ, σ ) with mean µ and standard deviation σ ( Function=triple gaussian ). It has several dis-tinct features and resembles a physically relevantcase.We sample

SampleSize=1000000 data pointson the interval [ − ,

5] into a uniform and a non-uniform histogram, both with 2 bins. Notethat the non-uniform histogram binning is pre-deﬁned and cannot be adjusted by changing the PowerBins entry. The non-uniform histogrambins are smaller in the center of the domain(where the sampled function has a sharp fea-ture) and increase exponentially in size towardsthe domain boundaries. The smallest bin size isequal to the domain length divided by 2 . Thenon-uniform histogram is always collected in ad-dition to the customizable uniform histogram if Function=triple gaussian is chosen and is out-put into the ﬁle nonuniform histogram.dat .The ﬁt results are shown in Fig. 6. Both his-togram divisions produce ﬁts of similar qualitythat reproduce the tested distribution well. Since

BHM automatically considers combinations of ele-mentary bins, there is no need for a case-speciﬁcimplementation of a non-uniform histogram grid.Note that sampling the same data in a uniformhistogram with 2 bins produces nearly the sameﬁt as when using 2 uniform bins in this example.

6. Acknowledgments

This work was supported by the Simons Col-laboration on the Many Electron Problem and bythe National Science Foundation under the grantsPHY-1314735 (O.G., N.P., and B.S.) and DMR-1720465 (N.P. and B.S.). O.G. also acknowledgessupport by the US-Israel Binational Science Foun-dation (Grants 2014262 and 2016087).10 .0 1.6 2.2 2.8 x f ( x ) f ( x ) x ˜ f ( x ) − f ( x ) T =0 T =2 T =8 Figure 5: Decaying exponential test function (left panel). BHM ﬁts of the test function with diﬀerent goodness-of-ﬁtthresholds (right panel). x f ( x ) − − x − . − . . . . ˜ f ( x ) − f ( x ) uniform histogramnon-uniform histogram Figure 6: Triple Gaussian test function (left panel). BHM ﬁts of the test function based on a uniform histogram and ahistogram with bins of diﬀerent size (right panel).

References [1] O. Goulko, N. Prokof’ev, B. Svistunov, Restoring asmooth function from its noisy integrals arXiv:1707.07625 .[2] I. Narsky, F. C. Porter, Statistical Analysis Techniquesin Particle Physics, John Wiley & Sons, 2013.[3] D. W. Scott, Multivariate Density Estimation, JohnWiley & Sons, 2015.[4] B. W. Silverman, Density estimation for statistics anddata analysis, London: Chapman and Hall, 1986..[2] I. Narsky, F. C. Porter, Statistical Analysis Techniquesin Particle Physics, John Wiley & Sons, 2013.[3] D. W. Scott, Multivariate Density Estimation, JohnWiley & Sons, 2015.[4] B. W. Silverman, Density estimation for statistics anddata analysis, London: Chapman and Hall, 1986.