Optimal estimate of probability density functions from experimental data
aa r X i v : . [ phy s i c s . d a t a - a n ] M a y Optimal estimate of probability density functions from experimental data
R. Labb´e
Laboratorio de Turbulencia, Departamento de F´ısica, Facultad de Ciencia,Universidad de Santiago de Chile, USACH. Casilla 307, Correo 2, Santiago, Chile (Dated: June 1, 2006)A method providing optimal estimate of probability density functions (PDFs) from time series isproposed. It allows almost arbitrary resolution PDFs when applied to either, sampled analytic func-tions or digitized data from experiments. When results are compared with PDFs of the same datacalculated using the standard histogram method, a remarkable improvement is observed, especiallyin far lateral regions of the PDF, where low probability events give poor statistics.
PACS numbers: 02.50.-r 05.10.-a
Probability density functions (PDFs) are of main in-terest in physical systems were the statistical descriptionof magnitudes is more appropriate than the detailed be-havior in time and/or space of one or more variables.In particular, in research in turbulence, it is of interestto characterize properties with non-Gaussian statistics,especially those related with small scale intermittency[1, 2], responsible of slowly decaying wings in the PDFof velocity differences at small scales, or the statistics ofglobal magnitudes like pressure [3] or the injected powerin confined turbulent flows [4, 5] —characterized by non-symmetric PDFs showing an exponential or stretched ex-ponential wing on the left side. Being the events thatcontribute to these particular features of the PDF rare,it is not often possible to obtain a good statistics to accu-rately describe them, and their effect on the the signalscould appear to be rather marginal. However, in viewof their strength, they are detectable as a non gaussianbehavior in the tails of the PDF of the variable understudy, and given their importance in the description andunderstanding of intermittency, among other effects, it isdesirable to have a reliable method to estimate the PDFof functions having this kind of features. Additionally,when the PDF of short bursts in a signal is being stud-ied, it is worth to have a tool to estimate it over relativelyshort time intervals. In these cases, the usual method ofbuilding a histogram of the data is inadequate becausethe number of points available could be not large enough.In this note I propose a simple method to estimate thePDF of a sampled function, like the data obtained whenmeasuring the time evolution of some quantity in an ex-periment, which produces remarkably good results.The idea behind the method is simple: given a boundedtime function f ( t ), t ∈ [ a, b ] ⊂ R , with Fourier trans-form F ( ω ), a sampled version f n = f ( t n ) of f ( t ), with t n = nT , n = 1 , . . . , N , and T the time interval be-tween samples, is accordingly with the sampling theo-rem, a complete representation of the continuous func-tion f ( t ) provided that: i) the function is band limitedand ii) the highest frequency contents in the spectrum of f ( t ) is bounded by the Nyquist frequency, defined as onehalf of the sampling frequency, i.e. F ( ω ) = 0 , | w | > π/T. (1) Thus, although the set of values { f n } is nothing buta “small” subset (one having zero-measure) of the set { f ( t ) | t ∈ [ a, b ] ⊂ R } , the Nyquist-Shannon-Kotelnikovsampling theorem allows us to recover all the informationcontained in the original function from the set of points { f k } . When f ( t ) is defined for all t ∈ ( −∞ , ∞ ), theexplicit expression for its reconstructed version, f r ( t ), interms of the samples { f n } is f r ( t ) = ∞ X n = −∞ f n sin[ π ( t − nT ) /T ] π ( t − nT ) . (2)As we will see later, we do not need f r ( t ) in the processof building the PDF. If we want to evaluate the PDFin M points y k , k = 1 , . . . , M , a local approximationusing few samples near the points { ( t k , f ( t k )) | f ( t k ) = y k , k = 1 , . . . , M } will be enough. Now, the usual methodof binning the data to make a histogram, which by ap-propriate normalization gives an estimate of the PDF of f ( t ), has the obvious drawback that only the values inthe set { f n } are used. Thus, most of the information tobuild the PDF of f ( t ) is lost. Alternatively, if we con-sider the continuous function f ( t ), it is intuitively obviousthat the probability of finding a certain value ˜ y = f (˜ t )in the interval [ y, y + δy ] should be proportional to thetime δt spent by the function in traversing the arbitrar-ily small neighborhood [ y, y + δy ] of ˜ y . More precisely, P ( y ) ∝ | δt/δy | −→ / | f ′ ( t ) | when δy −→
0. As f cantake many times the value y at many different instants t ,we need to add all these values together over the wholetime interval in which the PDF is being calculated. Then,we have P ( y ) = 1 N X t : y = f ( t ) | f ′ ( t ) | , (3)where N is a normalization constant, so that Z ∞−∞ P ( y ) dy = 1 . (4)Equation 3 can be seen as a particular case of equation(5-5) in reference [6]. Note that in (3) the set of values FIG. 1: (a) Samples of a function from which the PDF is tobe calculated. (b) The resulting PDF, using 5000 points. of y can be constructed arbitrarily, provided that f ( t )is defined for at least a subset of the values y chosento evaluate P ( y ). As a consequence, this allows —as aby-product, to increase arbitrarily the resolution in theevaluation of P ( y ).It is important to mention here that a large num-ber of points is not strictly necessary for evaluating aPDF, although this certainly helps in obtaining an ac-curate evaluation of the normalization constant N . Thereason is that formula (3) corresponds to the infinitelymany points limit of the binning method (the proof isstraightforward). What is indeed needed is a good rep-resentation of f ( t ) in the neighborhoods of the zeroes of f ′ ( t ), where the r.h.s. of equation (3) becomes singular.The advantage of using a rather large number of points y i arises when the normalization constant is calculated:given that P ( y ) is singular at the roots of f ′ ( t ), then N = R P t : y = f ( t ) | f ′ ( t ) | − must be evaluated using a loworder numerical integration method. In the examples be-low, the trapezium rule was used, which requires manypoints to give an accurate result. Another concern is re-lated to the singular values in the r.h.s. of equation (3).One would expect that zeroes in the denominator canappear while running the numerical calculation. Thisis not the case, because by picking an arbitrary value y k , producing a set { t kj } j =1 ,...,m k such that f ( t kj ) = y k ,getting f ′ ( t kj ) = 0 for some j is extremely unlike. Todate, this has never happened to me, neither in the ex-amples given below nor in other calculations. Thus, tocompute a pdf, all we need is a suitable, computationallyefficient interpolation scheme to rebuild the function f FIG. 2: (a) Sample record of power injected in a confinedturbulent flow. (b) Its PDF estimate, obtained by binningthe data using 200 bins. (c) A plot showing the result ofincreasing the number of bins to 4000. (d) Estimate of thePDF based on equation 3 (see text). The number of points is4000, as in the previous plot. near ( t, y ), using some of the samples in the discrete set { f k } . Although this can be done with the help of equa-tion (2), using an interpolating polynomial through fouror six points around the point ( t, f ( t )) is far a better ap-proach, provided that the function corresponding to thesamples is smooth enough. When dealing with digitizeddata, this is ensured by the anti-aliasing filter, except bythe remaining electronic noise. I will return to this pointlater.To illustrate the method, let us start with the calcula-tion of the PDF of a few cycles of a sinus function plusa “drift” f ( t ) = sin(2 πt/T ) + αt, (5)with suitable values for the parameters, and using arather poor sampling. From the samples shown in figure1 (a), and using a third degree interpolating polynomial,the PDF displayed in figure 1 (b) is obtained, using 5000point for P ( y ). Note that the expected singularities inthis PDF are remarkably well represented. Of course,there is no way to obtain this result by binning the datashown in figure 1 (a).As a second example, consider the figure 2 (a), display-ing a 13 s sample of a 3000 s length record of the powerinjected to maintain the turbulence in a flow like those inreferences [4, 5]. As this record was taken specifically tobuild the PDF of the injected power, some oversamplingwas performed to allow numerical smoothing on the data.In this case, a cutoff frequency of 50 Hz was used in theanti-aliasing filter, for a sampling rate of 150 sps (samplesper second). The applied smoothing process is such thatthe signal spectrum remains unchanged below the filtercutoff frequency. With these two cautionary measures, itis possible to use third order polynomials to locally re- construct the signal around the points y k chosen to buildthe PDF, using only four neighboring samples. In figure2 (b), a PDF built by using the standard binning methodis displayed. In this case, 200 bins were used. The PDFlooks very acceptable, thanks to the length of 4 . × samples of the whole data record. If we want to increasethe accuracy of the PDF by increasing the number ofbins, things begin to go from bad to worse. Figure 2 (c)shows the PDF obtained when 4000 bins are used. Ob-viously, trying this is not quite reasonable. However, byusing the method that I propose here, a remarkably good4000 point estimate of the PDF is effectively obtained,as displayed in figure 2 (d). When compared with figure2 (a), it is clear that all of the features present there arerecovered, but with a highly increased level of detail.In conclusion, in this note I report a powerful methodto estimate the PDF of magnitudes obtained as timeseries from essentially continuous functions of time, likethose resulting by digitizing the signal resulting fromthe output of an antialiasing filter —a very commonexperimental scenario. In contrast to the usual binningprocedure, the method I propose here can yield, inprinciple, optimal accuracy and arbitrary resolutionin the resulting estimate of the probability densityfunction, even for rather small data sets.This work benefited of the financial support provided byFONDECYT, under project No. 1040291. [1] T. Zhou, Z. Hao, L. P. Chua, and S. C. M. Yu, Phys. Rev.E , 066307 (2005)[2] A. Staicu and W. van de Water, Phys. Rev. Lett. ,094501 (2003)[3] P. Abry, S. Fauve, P. Flandrin, and C. Laroche, J. Phys.II