Computing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples
Hoshin V Gupta, Mohammed Reza Ehsani, Tirthankar Roy, Maria A Sans-Fuentes, Uwe Ehret, Ali Behrangi
CComputing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples
Hoshin V Gupta , Mohammed Reza Ehsani , Tirthankar Roy , Maria A Sans-Fuentes , Uwe Ehret , Ali Behrangi , Hydrology and Atmospheric Sciences, The University of Arizona, Tucson, AZ Civil and Environmental Engineering, University of Nebraska-Lincoln, Nebraska GIDP Statistics and Data Science, The University of Arizona, Tucson, AZ Institute of Water and River Basin Management - Hydrology, Karlsruhe Institute of Technology, Karlsruhe, Germany Draft Manuscript (02/22/2021) for submission to ArXiv
Abstract
We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) method. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. Whereas BC requires optimal tuning of a bin-width hyper-parameter whose value varies with sample size and shape of the pdf, QS requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantile-spacings is a fixed fraction of the sample size (empirically determined to be ~0.25 -0.35), and that this value is relatively insensitive to distributional form or sample size, providing a clear advantage over BC since hyperparameter tuning is not required. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (
Gaussian, Log-Normal, Exponential and
Bimodal Gaussian Mixture ), expected estimation bias is less than 1% and uncertainty is relatively low even for very small sample sizes. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.
Keywords:
Entropy, estimation, quantile spacing, accuracy, uncertainty, bootstrap, small-sample efficiency
1. Introduction [1] Consider a data generating process ππ ( ππ ) from which a finite size set of π΅π΅ πΊπΊ random, equiprobable,independent identically distributed (iid) samples πΊπΊ = { ππ ππ , ππ = β¦ π΅π΅ πΊπΊ } is drawn. In general, we may not knowthe nature and mathematical form of ππ ( πΏπΏ ) , and our goal is to compute an estimate π―π―οΏ½ ππ ( πΏπΏ | πΊπΊ ) of the Entropy π―π― ππ ( πΏπΏ ) of ππ ( ππ ) .[2] In the idealized case, where πΏπΏ is a one-dimensional continuous random variable and the parametricmathematical form of ππ ( ππ ) is known, we can apply the definition of differential Entropy (
Shannon 1948; Coverand Thomas 1991 ) to compute: π―π― ππ ( πΏπΏ ) = πΌπΌ ππ οΏ½βπππποΏ½ππ ( ππ ) οΏ½οΏ½ = β« βπππποΏ½ππ ( ππ ) οΏ½ β ππ ( ππ ) β π π ππ +βββ (Eqn. 1) Explicit, closed form solutions for π―π― ππ ( πΏπΏ ) are available for a variety of probability density functions (pdfs). For a variety of others, closed form solutions are not available, and one can compute π―π― ππ ( πΏπΏ ) via numerical integration of Eqn. 1. In all such cases, entropy estimation consists of first obtaining estimates π½π½οΏ½ | πΊπΊ of the parameters π½π½ of the known parametric density ππ ( ππ | π½π½ ) and then computing the entropy estimate π―π―οΏ½ ππ | π½π½οΏ½ ( πΏπΏ | πΊπΊ ) y plugging πποΏ½ππ | π½π½οΏ½οΏ½ into Eqn. 1. Any bias and uncertainty in the entropy estimate will depend on the accuracy and uncertainty of the parameter estimates π½π½οΏ½ . If the form of the ππ ( ππ | π½π½ ) is β assumed β rather than explicitly known, then additional bias will stem from the inadequacy of this assumption. [3] In most practical situations the mathematical form of ππ ( ππ ) is not known, and πΊπΊ must first be used to obtain a data-based estimate πποΏ½ ( ππ | πΊπΊ ) , from which an estimate π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) can be obtained via numerical integration of Eqn. 1. In generating πποΏ½ ( ππ | πΊπΊ ) , consistency with prior knowledge regarding the nature of ππ ( ππ ) must be ensured β for example, πΏπΏ may be known to take on only positive values, or values on some finite range. Consistency must also be maintained with the information in πΊπΊ . Further, the sample size π΅π΅ πΊπΊ must be sufficiently large that the information in πΊπΊ provides an accurate characterization of ππ ( ππ ) β in other words, that πΊπΊ is informationally representative and consistent. [4] To summarize, for the case that πΏπΏ is a continuous random variable, entropy estimation from data involves two steps; (i) Use of πΊπΊ to estimate πποΏ½ ( ππ | πΊπΊ ) , and (ii) Numerical integration to compute an estimate of entropy using Eqn. 2: π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) = πΌπΌ πποΏ½ οΏ½βπππποΏ½πποΏ½ ( ππ | πΊπΊ ) οΏ½οΏ½ = β« βπππποΏ½πποΏ½ ( ππ | πΊπΊ ) οΏ½ β πποΏ½ ( ππ | πΊπΊ ) β π π ππ +βββ (Eqn. 2) Accordingly, the estimate π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) has two potential sources of error. One is due to the use of πποΏ½ ( ππ | πΊπΊ ) to approximate ππ ( ππ ) , and the other is due to imperfect numerical integration. To maximize accuracy, we must ensure that both these errors are minimized. Further, π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) is a statistic that is subject to inherent random variability associated with the sample πΊπΊ , and so it will be useful to have an uncertainty estimate, in some form such as confidence intervals. [5] For cases where πΏπΏ is discrete and can take on only a finite set of values οΏ½ππ ( ππ ) , ππ = , β¦ π΅π΅ πΏπΏ οΏ½ , if the mathematical form of ππ ( ππ ) = οΏ½πποΏ½ππ ( ππ ) οΏ½ , ππ = , β¦ π΅π΅ πΏπΏ οΏ½ is known, then π―π― ππ ( πΏπΏ ) can be computed by applying the mathematical definition of discrete Entropy: π―π― ππ ( πΏπΏ ) = πΌπΌ ππ οΏ½βπππποΏ½ππ ( ππ ) οΏ½οΏ½ = β β ππππ οΏ½ πποΏ½ππ ( ππ ) οΏ½οΏ½ β πποΏ½ππ ( ππ ) οΏ½ π΅π΅ πΏπΏ ππ=ππ (Eqn. 3) [6] Here, given a data sample πΊπΊ , pdf estimation amounts simply to counting the number ππ ( ππ ) of data points in πΊπΊ that take on the value ππ ( ππ ) , and setting πποΏ½ ( ππ | πΊπΊ ) = οΏ½πποΏ½οΏ½ππ ( ππ ) | πΊπΊοΏ½ = ππ ( ππ ) π΅π΅ πΊπΊ , ππ = , β¦ π΅π΅ πΏπΏ οΏ½ and πποΏ½ ( ππ | πΊπΊ ) = otherwise. Entropy estimation then consists of applying the equation: π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) = β β ππππ οΏ½ πποΏ½οΏ½ππ ( ππ ) | πΊπΊοΏ½οΏ½ β πποΏ½οΏ½ππ ( ππ ) | πΊπΊοΏ½ π΅π΅ πΏπΏ ππ=ππ (Eqn. 4) In this case, there is no numerical integration error; any bias in the estimate is entirely due to πποΏ½οΏ½ππ ( ππ ) | πΊπΊοΏ½ β πποΏ½ππ ( ππ ) οΏ½ , which occurs due to πΊπΊ not being perfectly informative about ππ ( ππ ) , while uncertainty is due to πΊπΊ being a random sample drawn from ππ ( ππ ) . If πΊπΊ is a representative sample, as π΅π΅ πΊπΊ β β then πποΏ½οΏ½ππ ( ππ ) | πΊπΊοΏ½ β πποΏ½ππ ( ππ ) οΏ½ and hence π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) β π―π― ππ ( ππ ) , so that estimation bias and uncertainty will both tend towards zero as the sample size is increased. [7] When the one-dimensional random variable πΏπΏ is some hybrid combination of discrete and continuous, the relative fractions of total probability mass associated with the discrete and continuous portions of the pdf must also be estimated. The general principles discussed herein also apply to the hybrid case, and we will not consider it further in this paper; for a relevant discussion of estimating entropy for mixed discrete-continuous random variables, see Gong et al (2014) . Popular Approaches to Estimating Distributions from Data [8]
We focus here on the case of a one-dimensional continuous random variable πΏπΏ for which the mathematical form of ππ ( ππ ) is unknown. Silverman (1986) provides a summary of methods for the estimation of pdfs from ata, while
Beirlant et al (1997) provides an overview of methods for estimating the differential entropy of a continuous random variable. The three most widely used β non-parametric β methods for estimating differential entropy by pdf approximation are the: (a) Bin-Counting (BC) or piece-wise constant frequency histogram method, (b) Kernel Density (KD) method, and (c) Average Shifted Histogram (ASH) method.
Scott (2008 ) points out that these can all be asymptotically viewed as ββ
Kernelββ methods, where the bins in the BC and ASH approaches are understood as treating the data points falling in each bin as being from a locally uniform distribution. [9]
As discussed by
Scott (1979) and
Scott and Thompson (1983) , appropriate selection of the bin-width (effectively a smoothing hyper-parameter) is critical to success of the BC and ASH histogram-based methods. Bin-widths that are too small can result in overly rough approximation of the underlying distribution (increasing the variance), while bin-widths that are overly large can result in an overly smooth approximation (introducing bias). Therefore, one typically has to choose values that balance variance and bias errors. Scott (1979) and
Scott (2004) present expressions for β optimal β bin width when using BC, including the β normal reference rule β that is applicable when the pdf is approximately Gaussian, and the β oversmoothed bandwidth rule β that places an upper-bound on the bin-width. Similarly,
Scott (2008) shows that while KD is more computationally costly to implement than BC, its accuracy and convergence are better, and they derive optimal values for the KD smoothing hyper-parameter.
Scott (1985) also proposed the ASH method, which refines BC by sub-dividing each histogram bin into sub-bins, with computational cost similar to BC and accuracy approaching that of KD. Note that, if prior information on the shape of ππ ( ππ ) is available, or if a representation with the smallest number of bins is desired, then variable bin width methods may be more appropriate (e.g., Wegman 1975, Denison et al. 2002, Jackson et al. 2005, Endres and Foldiak 2005, Hutter, 2007 ). [10]
The BC, KD and ASH methods all require hyper-parameter tuning to be successful. BC requires selection of the histogram bin-widths (and thereby the number of bins), KD requires selection of the form of the Kernel function and tuning of its parameters, and ASH requires selection of the form of a Kernel function and tuning of the coarse bin width, and number of sub-bins. While recommendations are provided to guide the selection of these β hyper-parameters β, these recommendations depend on theoretical arguments based in assumptions regarding the typical underlying forms of ππ ( ππ ) . Based on empirical studies, and given that we typically do not know the β true β form of ππ ( ππ ) to be used as a reference for tuning, Gong et al (2014) recommend use of BC and KD rather than the ASH method. [11]
Finally, since BC effectively treats the pdf as being discrete, and therefore uses Eqn. 4 with each of the indices ππ corresponding to one of the histogram bins, the loss of entropy associated with implementing the discrete constant bin-width approximation is approximately ππππ ( β ) , where β is the bin width, provided β is sufficiently small ( Cover and Thomas, 2006 ). This fact allows conversion of the discrete entropy estimate to differential entropy simply by adding ππππ ( β ) to the discrete entropy estimate. [12] In summary, while BC and KD can be used to obtain accurate estimates of entropy for pdfs of arbitrary form, hyper-parameter tuning is required to ensure that good results are obtained. In the next section, we propose an alternative method to approximate ππ ( ππ ) that does not require counting the numbers of samples in β bins β, and is instead based on estimating the quantile positions of ππ ( ππ ) . Proposed Quantile Spacing (QS) Approach [13]
We present an approach to computing an estimate
π―π―οΏ½ ππ ( πΏπΏ | πΊπΊ ) of Entropy π―π― ππ ( πΏπΏ ) given a set of available samples πΊπΊ for the case where πΏπΏ is a one-dimensional continuous random variable and the mathematical form of the data generating process ππ ( ππ ) is unknown. The approach is based in the assumption that ππ ( ππ ) can be approximated as piecewise constant on the intervals between quantile locations, and consists of three steps. Step 1 - Assumption about Support Interval
The first step is to assume that ππ ( ππ ) exists only on some finite support interval [ ππ ππππππ , ππ ππππππ ] , where ππ ππππππ β€ π¦π¦π¦π¦π¦π¦ { πΊπΊ } and ππ ππππππ β₯ π¦π¦π¦π¦π¦π¦ { πΊπΊ } ; i.e., we treat ππ ( ππ ) as being everywhere outside of the interval [ ππ ππππππ , ππ ππππππ ] . Given that the true support of πΏπΏ may, in reality, be as extensive as [ ββ , + β ] , we allow the selection of this interval (based on prior knowledge, such as physical realism) to be as extensive as appropriate and/or necessary. However, as we show later, the impact of this selection can be quite significant and will need special attention. Step 2 - Assumption about Approximate Form of ππ ( πΏπΏ ) [15] The second step is to assume that ππ ( ππ ) can be approximated as piecewise constant on the intervals between quantiles ππ = οΏ½ππ , ππ , ππ , β¦ , ππ π΅π΅ ππ οΏ½ associated with the οΏ½ππ , ππ , ππ , β¦ π΅π΅ ππ βπππ΅π΅ ππ , non-exceedance probabilities of ππ ( ππ ) , where π΅π΅ ππ represents the number of quantiles, ππ = ππ ππππππ , and ππ π΅π΅ ππ = ππ ππππππ . This corresponds to making the minimally informative (maximum entropy) assumption that ππ ( ππ ) is β uniformβ over each of the quantile intervals οΏ½ππ ππβππ , ππ ππ οΏ½ for ππ = , β¦ π΅π΅ ππ , which is equivalent to assuming that the corresponding cumulative distribution function π·π· ( ππ ) is piecewise linear (i.e., increases linearly between ππ ππβππ and ππ ππ ). [16] Assuming perfect knowledge of the locations of the quantiles ππ , this approximation corresponds to: ππ ( ππ ) β πποΏ½ ( ππ | ππ ) = ππ ππβππππ = π²π²β ππ for ππ ππβππ β€ πΏπΏ < ππ ππ , ππ = , β¦ π΅π΅ ππ (Eqn. 5) where β ππ = ππ ππ β ππ ππβππ . To ensure that πποΏ½ ( ππ | ππ ) integrates to . over the support region [ ππ ππππππ , ππ ππππππ ] we have π²π² = ππ . Accordingly, our entropy estimate is given by: π―π―οΏ½ πποΏ½ ( πΏπΏ | ππ ) = β β« βππππ οΏ½ππ ππβππππ οΏ½ β ππ ππβππππ β π π ππ ππ ππ ππ ππβππ π΅π΅ ππ ππ=ππ (Eqn. 6a) = ππ β β πππποΏ½π΅π΅ ππ β β ππ οΏ½ π΅π΅ ππ ππ=ππ (Eqn. 6b) = ππππ ( π΅π΅ ππ ) + ππ β β πππποΏ½β ππ οΏ½ π΅π΅ ππ ππ=ππ (Eqn. 6c) From Eqn. 6b we see that the estimate depends on the logs of the spacings between quantiles, and is defined by the average of these values. Further, we can define the error due to piecewise constant approximation of ππ ( ππ ) as βπ―π― ππ , πποΏ½ ( πΏπΏ | ππ ) = π―π―οΏ½ πποΏ½ ( πΏπΏ | ππ ) β π―π― ππ ( πΏπΏ ) . Step 3 β Estimation of the Quantiles of ππ ( ππ ) [17] The third step is to use the available data πΊπΊ to compute estimates of the quantiles ππ to be plugged into Eqn. 6. Of course, given a finite sample size π΅π΅ πΊπΊ , the number of quantiles π΅π΅ ππ that can be estimated will, in general, be much smaller than the sample size π΅π΅ πΊπΊ (i.e., π΅π΅ ππ βͺ π΅π΅ πΊπΊ ). [18] Various methods for computing estimates of the quantiles are available. Here, we use a relatively simple approach in which π΅π΅ π²π² sample subsets πΊπΊ ππ , ππ = , β¦ , π΅π΅ π²π² , each of size π΅π΅ ππ β ππ (i.e., πΊπΊ ππ = οΏ½ππ , ππ , β¦ , ππ π΅π΅ ππ βππππ οΏ½ ) are drawn from the available sample set πΊπΊ , where the samples in each subset are drawn from πΊπΊ without replacement so that the values obtained in each subset are unique (not-repeated). For each subset, we sort the values in increasing order to obtain ππ ππ = οΏ½ππ , ππ , β¦ , ππ π΅π΅ ππ βππππ οΏ½ , thereby obtaining π΅π΅ π²π² estimates οΏ½ππ ππππ , ππ ππππ , β¦ , ππ πππ΅π΅ π²π² οΏ½ for each ππ ππ , ππ = , β¦ π΅π΅ ππ β ππ . This procedure results in an empirical estimate of the sample distribution πποΏ½ππ ππ | πΊπΊοΏ½ for each quantile ππ ππ , ππ = , β¦ π΅π΅ ππ β ππ . Finally, we compute πποΏ½ ππ = π²π² β ππ πππππ΅π΅ π²π² ππ=ππ , and set πποΏ½ = οΏ½ππ ππππππ , πποΏ½ , πποΏ½ , β¦ , πποΏ½ π΅π΅ ππ βππ , ππ ππππππ οΏ½ . Plugging these values into Equations 6b & 6c, we get: π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½οΏ½ = ππ β β πππποΏ½π΅π΅ ππ β βοΏ½ ππ οΏ½ π΅π΅ ππ ππ=ππ (Eqn. 7a) = ππππ ( π΅π΅ ππ ) + ππ β β πππποΏ½βοΏ½ ππ οΏ½ π΅π΅ ππ ππ=ππ (Eqn. 7b) here βοΏ½ ππ = πποΏ½ ππ β πποΏ½ ππβππ . For practical computation, to avoid numerical problems as π΅π΅ ππ becomes large so that βοΏ½ ππ becomes very small and πππποΏ½βοΏ½ ππ οΏ½ approaches ββ , we will actually use Eqn. 7a. Further, we define the additional error due purely to imperfect quantile estimation to be βπ―π― πποΏ½ οΏ½πΏπΏ | πποΏ½ , πποΏ½ = π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½οΏ½ β π―π― πποΏ½ ( πΏπΏ | ππ ) . Random Variability associated with the QS-based Entropy Estimate [19]
Given that the quantile spacing estimates οΏ½βοΏ½ ππ , ππ = , β¦ , π΅π΅ ππ οΏ½ are subject to random sampling variability associated with (i) the sampling of πΊπΊ from ππ ( ππ ) , and (ii) estimation of the quantile positions πποΏ½ ππ , the entropy estimate π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½οΏ½ will also be subject to random sampling variability. As shown later, we can generate probabilistic estimates of the nature and size of this error from the empirical estimates of πποΏ½ππ ππ | πΊπΊοΏ½ obtained in Step 3, and by bootstrapping on πΊπΊ . Properties of the Proposed Approach [20]
The accuracy of the estimate
π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½οΏ½ obtained using the QS method outlined above depends on the following four assumptions, each of which we discuss below: (i)
A1:
The piecewise constant approximation πποΏ½ ( ππ | ππ ) of ππ ( ππ ) on the intervals between the quantile positions is adequate (ii) A2:
The quantile positions ππ = οΏ½ππ , ππ , ππ , β¦ , ππ π΅π΅ ππ οΏ½ of ππ ( ππ ) have been estimated accurately. (iii) A3:
The pdf ππππ exists only on the support interval [ ππ ππππππ , ππ ππππππ ] , which has been properly chosen (iv) A4:
The sample set πΊπΊ is consistent, representative and sufficiently informative about the underlying nature of ππ ( ππ ) Implications of the Piecewise Constant Assumption [21]
Assume that πποΏ½ ( ππ | ππ ) provides a piecewise-constant estimate of ππ ( ππ ) and that the quantile positions ππ = οΏ½ππ , ππ , ππ , β¦ , ππ π΅π΅ ππ οΏ½ associated with a given choice for π΅π΅ ππ are perfectly known. Since the continuous shape of the cumulative distribution function (cdf) π·π· ( ππ ) can be approximated to an arbitrary degree of accuracy by a sufficient number of piecewise linear segments, we will have π·π·οΏ½ ( ππ | ππ ) β π·π· ( ππ ) as π΅π΅ ππ β β . [22] However, an insufficiently accurate approximation will result in a pdf estimate that is not sufficiently smooth, so that the entropy estimate will be biased. This bias will, in general, be positive (overestimation) because the piecewise-constant form πποΏ½ ( ππ | ππ ) used to approximate ππ ( ππ ) will always be shifted slightly in the direction of larger entropy; i.e., each piecewise constant segment in πποΏ½ ( ππ | ππ ) is a maximum-entropy (uniform distribution) approximation of the corresponding segment of ππ ( ππ ) . However, the bias can be reduced and made arbitrarily small by increasing π΅π΅ ππ until the Kullback-Leibler distance between πποΏ½ ( ππ | ππ ) and ππ ( ππ ) is so small that the information loss associated with use of πποΏ½ ( ππ | ππ ) in place of ππ ( ππ ) in Eqn. 1 is negligible. [23] The left panel of
Figure 1 shows how this bias in the estimate of π―π― ππ ( πΏπΏ ) , due solely to piecewise constant approximation of the pdf (no sample data are involved), declines with increasing π΅π΅ ππ for three pdf forms of varying functional complexity ( Gaussian, Log-Normal and
Exponential ), each using a parameter choice such that its theoretical entropy π―π― ππ ( πΏπΏ ) = . Also shown, for completeness, are results for the Uniform pdf where only one piecewise constant bin is theoretically required. Note that because π―π― ππ οΏ½ππ β ( πΏπΏ β ππ ππ ) οΏ½ = π―π― ππ ( πΏπΏ ) + π₯π₯π¦π¦ ππ , the entropy can be changed to any desired value simply by rescaling on πΏπΏ . For these theoretical examples, the quantile positions are known exactly, and the resulting estimation bias is due only to the piecewise constant approximation of ππ ( ππ ) . However, since the theoretical pdfs used for this example all have infinite support, whereas the piecewise approximation requires specification of a finite support interval, for the latter we set [ ππ ππππππ , ππ ππππππ ] to be the theoretical locations where π·π· ( ππ ) = πΊπΊ and π·π·οΏ½ππ π΅π΅ ππ οΏ½ =
ππ β πΊπΊ respectively, with πΊπΊ chosen to be some sufficiently small number (we used πΊπΊ = ). We see empirically that bias due to the piecewise onstant approximation declines to zero approximately as an exponential function of ππππππ π΅π΅ ππ so that the absolute percent bias is less than ~ % for π΅π΅ ππ > , less than ~ % for π΅π΅ ππ > , and less than ~ % for π΅π΅ ππ > . [24] In practice, given a finite sample size π΅π΅ πΊπΊ , our ability to increase the value of π΅π΅ ππ will be constrained by the size of the sample (i.e., π΅π΅ ππ < π΅π΅ πΊπΊ ). This is because when the form of ππ ( πΏπΏ ) is unknown, the locations of the quantile positions must be estimated using the information provided by πΊπΊ . Further, what constitutes a sufficiently large value for π΅π΅ ππ will depend the complexity of the underlying shape of ππ ( ππ ) . Implications of Imperfect Quantile Position Estimation [25]
Assume that π΅π΅ ππ is large enough for the piecewise constant pdf approximation to be sufficiently accurate, but that the estimates οΏ½πποΏ½ , πποΏ½ , β¦ , πποΏ½ π΅π΅ ππ οΏ½ of the locations of the quantiles are imperfect. Clearly, this can affect the estimate of entropy computed via Eqn. 7a by distorting the shape of πποΏ½οΏ½ππ | πποΏ½οΏ½ away from πποΏ½ ( ππ | ππ ) , and therefore away from ππ ( ππ ) . Further, the uncertainty associated with the quantile estimates will translate into uncertainty associated with the estimate of entropy. [26] In general, as the number of quantiles π΅π΅ ππ is increased, the inter-sample spacings associated with each ordered subset ππ ππ = οΏ½ππ , ππ , β¦ , ππ π΅π΅ ππ βππππ οΏ½ , ππ = , β¦ , π΅π΅ π²π² will decrease, so that the distribution of possible locations for each quantile ππ ππππ , ππ = , β¦ π΅π΅ ππ β ππ will progressively become more tightly constrained. This means that the bias associated with each estimated quantile πποΏ½ ππ will reduce progressively towards zero as π΅π΅ ππ is increased (constrained only by sample size π΅π΅ πΊπΊ ) and the variance of the estimate πποΏ½ ππ will decline towards zero as the number of subsamples π΅π΅ π²π² is increased. [27] Figure 2 illustrates how bias and uncertainty associated with estimates of the quantiles diminish with increasing π΅π΅ ππ and π΅π΅ π²π² . Experimental results are shown for the Log-Normal density with ππ = and ππ = . (theoretical entropy π―π― ππ ( πΏπΏ ) = ), with the y-axis indicating percent error in the quantile estimates corresponding to the 90% (green), 95% (purple) and 99% (turquoise) non-exceedance probabilities. In these plots, there is no distorting effect of sample size π΅π΅ πΊπΊ (the sample size is effectively infinite), since when computing the estimates of the quantiles (as explained in Section 3.3) we draw subsamples of size π΅π΅ ππ directly from the theoretical pdf. [28] The left-side plot shows, for π΅π΅ π²π² = subsamples, how the biases and uncertainties diminish as π΅π΅ ππ is increased. The boxplots reflect uncertainty due to random sampling variability, estimated by repeating each experiment times (by drawing new samples from the pdf). As expected, for smaller π΅π΅ ππ the quantile location estimates tend to be negatively biased, particularly for those in the more extreme tail locations of the distribution. However, for π΅π΅ ππ = the bias associated with the 99% non-exceedance probability quantile is less than βππ % , for π΅π΅ ππ β ππππππ the corresponding bias is less than βππ % , and for π΅π΅ ππ = it is less than -1 % . The right-side plot shows, for a fixed value of π΅π΅ ππ = , how the uncertainties diminish but the biases remain relatively constant as the number of subsamples π΅π΅ π²π² is increased. Overall, the uncertainty becomes quite small for π΅π΅ π²π² > . Implications of the Finite Support Assumption [29]
Assume that π΅π΅ ππ has been chosen large enough for the piecewise constant pdf approximation to be sufficiently accurate, and that the exact quantile positions associated with this choice for π΅π΅ ππ are known. The equation for estimating entropy (Eqn. 7a) can be decomposed into three terms: π―π― πποΏ½ ( πΏπΏ | ππ ) = ππ β β πππποΏ½π΅π΅ ππ β β ππ οΏ½ π΅π΅ ππ ππ=ππ = π―π― ππ ππππππ ππ + π―π― ππ ππ π΅π΅ππβππ + π―π― π΅π΅ ππ βππππ ππππππ (Eqn. 8) where π―π― ππ ππππππ ππ = ππππ ( π΅π΅ ππ ββ ) π΅π΅ ππ , π―π― ππ ππ π΅π΅ππβππ = ππ β β πππποΏ½π΅π΅ ππ β β ππ οΏ½ π΅π΅ ππ βππππ=ππ and π―π― π΅π΅ ππ βππππ ππππππ = πππποΏ½π΅π΅ ππ ββ π΅π΅ππ οΏ½π΅π΅ ππ , and where β ππ indicates the true inter-quantile spacings. Only the first and last terms π―π― ππ ππππππ ππ and π―π― π΅π΅ ππ βππππ ππππππ are affected by the choices for ππ ππππππ and ππ ππππππ through β = ππ β ππ ππππππ and β π΅π΅ ππ = ππ ππππππ β ππ π΅π΅ ππ βππ . 30] Clearly, if ππ ( ππ ) is bounded both above and below by specific known values, then there is no issue. However, if the support of ππ ( ππ ) is not known, or if one or both bounds can reasonably be expected to extend to Β± β (as appropriate), then the choice for the relevant limiting value ( ππ ππππππ or ππ ππππππ ) can significantly affect the computed value for π―π―οΏ½ πποΏ½ . To see this, note that the first term π―π― ππ ππππππ ππ can be made to vary from ββ when β = , to + β when β = β , passing through zero when β = ππ ; and similarly for the last term π―π― π΅π΅ ππ βππππ ππππππ . Therefore, the error associated with π―π―οΏ½ πποΏ½ can be made arbitrarily negatively large by choosing β and β π΅π΅ ππ to be too small, or arbitrarily positively large by choosing β and β π΅π΅ ππ to be too large. [31] In practice, when dealing with samples πΊπΊ from some unknown data generating process, we will often have only the samples themselves from which to infer the support of ππ ( ππ ) , and therefore can only confidently state that ππ ππππππ β€ π¦π¦π¦π¦π¦π¦ { πΊπΊ } and ππ ππππππ β₯ π¦π¦π¦π¦π¦π¦ { πΊπΊ } . One possibility could be to ignore the fractional contributions of the terms π―π― ππ ππππππ ππ and π―π― π΅π΅ ππ βππππ ππππππ corresponding to the (unknown) portions of the pdf and instead use as our estimate π―π― πποΏ½β ( πΏπΏ | ππ ) = π―π― ππ ππ π΅π΅ππβππ . This would be equivalent to setting β = β π΅π΅ ππ = ππ , so that πΏπΏ ππππππ = ππ β ππ and πΏπΏ ππππππ = ππ π΅π΅ ππ βππ + ππ . By doing so, we would be ignoring a portion of the overall entropy associated with the pdf and can therefore expect to obtain an underestimate. However, this bias error π©π©π©π© = π―π― πποΏ½ ( πΏπΏ | ππ ) β π―π― πποΏ½β ( πΏπΏ | ππ ) will tend to zero as π΅π΅ ππ is increased. [32] An alternative approach, that we recommend in this paper, is to set ππ ππππππ = π¦π¦π¦π¦π¦π¦ { πΊπΊ } and ππ ππππππ = π¦π¦π¦π¦π¦π¦ { πΊπΊ } . In this case, there will be random variability associated with the sampled values for π¦π¦π¦π¦π¦π¦ { πΊπΊ } and π¦π¦π¦π¦π¦π¦ { πΊπΊ } and so the bias in our estimate π―π― πποΏ½β ( πΏπΏ | ππ ) can be either negative or positive. Nonetheless, this bias error π©π©π©π© will still tend to zero as π΅π΅ ππ is increased. [33] Note that the percentage contributions of the entropy fractions π―π― ππ ππππππ ππ and π―π― π΅π΅ ππ βππππ ππππππ to the total entropy π―π― ππ ( πΏπΏ | ππ ) will depend on the nature of the underlying pdf. Figure 3 illustrates this for three pdfs (
Gaussian which has infinite extent on both sides, and the
Exponential and
Log-Normal which have infinite extent on only one side), assuming no estimation error associated with the quantile locations. For the
Gaussian (blue) and
Exponential (red) densities, the largest fractional entropy contributions are clearly from the tail regions, whereas for the
Log-Normal (orange) density this is not so. So, the entropy fractions can be proportionally quite large or small at the extremes, depending on the form of the pdf. Nonetheless, the overall entropy fraction associated with each quantile spacing diminishes with increasing π΅π΅ ππ . For the examples shown, when π΅π΅ ππ = (left plot) the maximum contributions associated with a quantile spacing are less than % and when π΅π΅ ππ = (right plot) become less than . % . This plot illustrates clearly the most important issue that must be dealt with when estimating entropy from samples. [34] So, on the one hand, the cumulative entropy fractions associated with the tail regions of ππ ( ππ ) that lie beyond π¦π¦π¦π¦π¦π¦ { πΊπΊ } and π¦π¦π¦π¦π¦π¦ { πΊπΊ } are impossible to know. On the other, the individual contributions of these fractions associated with the extreme quantile spacings β and/or β π΅π΅ ππ where ππ ( ππ ) is small can be quite a bit larger than those associated with the contributions from intermediate quantile spacings. Overall, the only real way to control the estimation bias and uncertainty associated with these extreme regions is to use a sufficiently large value for π΅π΅ ππ so that the relative contribution of the extreme regions is small. This will in turn, of course, be constrained by the sample size. Combined Effect of the Piecewise Constant Assumption, Finite Support Assumption, and Quantile Position Estimation using Finite Sample Sizes [35]
In section 4.1, we saw that the effect of the piecewise constant assumption on the QS-based estimate of entropy is positive bias that diminishes with increasing π΅π΅ ππ . Similarly, section 4.2 showed that the biases associated with the quantiles diminish with increasing π΅π΅ ππ , while the corresponding uncertainties diminish with increasing π΅π΅ π²π² . As mentioned earlier, the bias in each quantile position will be towards the direction of locally higher probability mass (since more of the equiprobable random samples will tend to drawn from this region), nd therefore the estimate πποΏ½οΏ½ππ | πποΏ½οΏ½ of ππ ( ππ ) will be distorted in the direction of having smaller β dispersion β (i.e., πποΏ½οΏ½ππ | πποΏ½οΏ½ will tend to be more β peakedβ than ππ ( πΏπΏ ) ), resulting in negative bias in the corresponding estimate of entropy. Finally, section 4.3 discussed the implications of the finite support assumption, given that ππ ππππππ and ππ ππππππ will often not be known. [36] Figure 4 illustrates the combined effect of these assumptions. Here we show how the overall percentage error in the QS-based estimate of entropy varies as a function of πΆπΆ = ( π΅π΅ ππ / π΅π΅ πΊπΊ ) , where πΆπΆ expresses the number of quantiles π΅π΅ ππ as a fraction of the sample size π΅π΅ πΊπΊ . Sample sets of given size π΅π΅ πΊπΊ are drawn from the Gaussian (left panel),
Exponential (middle panel) and
Log-Normal (right panel) densities, the quantiles are estimated using the procedure discussed in section 3.3, ππ ππππππ and ππ ππππππ are set to be the smallest and largest data values in the set (section 4.3), and entropy is estimated using equation 7a for different selected values of π΅π΅ ππ . To account for sampling variability, the results are averaged over different sample sets drawn randomly from the parent density. [37] The plots show how percentage estimation error (bias) varies as πΆπΆ (and hence π΅π΅ ππ ) changes as a fraction of sample size π΅π΅ πΊπΊ , for different sample sizes from to . As might be expected, in each case when πΆπΆ is too small the estimation bias is positive (over-estimation) and can be quite large due to the piecewise constant approximation. However, as πΆπΆ is increased the estimation bias decreases rapidly, crosses zero, and becomes negative (under-estimation) due to the combined effects of quantile position estimation bias and use of the smallest and largest sample values to approximate ππ ππππππ and ππ ππππππ . Most interesting is the fact that all of the curves cross zero at approximately πΆπΆ β ππ . . , and that this location does not seem to depend strongly on the sample size or shape of the pdf. Further, the marginal cost of setting πΆπΆ too large is low (less than βππ % for πΆπΆ = . ) compared to setting πΆπΆ too small. Overall, the expected bias error diminishes with increasing sample size π΅π΅ πΊπΊ and the optimal choice for πΆπΆ β ππ . . . [38] Figure 5 illustrates both the bias and uncertainty in the estimate of entropy as a function of sample size π΅π΅ πΊπΊ when we specified the number of quantiles π΅π΅ ππ to be % of the sample size (i.e., πΆπΆ = . ). The uncertainty intervals are due to sampling variability, estimated by drawing different sample sets from the parent population. The results show that uncertainty due to sampling variability diminishes rapidly with sample size, becoming relatively small for large sample sizes. Implications of Informativeness of the Data Sample [39]
For the results shown in
Figures 4 & 5 , we drew samples directly from ππ ( ππ ) . In practice, we must construct our entropy estimate by using a single data sample πΊπΊ of finite size π΅π΅ πΊπΊ . Provided that πΊπΊ is a consistent and representative random sample from ππ ( ππ ) , with each element ππ ππ being iid, then a sufficiently large sample size π΅π΅ πΊπΊ should enable construction of an accurate approximation πποΏ½οΏ½ππ | πποΏ½οΏ½ of ππ ( ππ ) via the QS method. However, if π΅π΅ πΊπΊ is too small, it can (i) prevent setting a sufficiently large value for π΅π΅ ππ , and (ii) tend to make the sets ππ ππ sub-sampled from πΊπΊ to be insufficiently independent for accurate estimates of the quantile positions of ππ ( ππ ) to be obtained. The overall effect will be to prevent πποΏ½οΏ½ππ | πποΏ½οΏ½ from approaching ππ ( ππ ) , leading to an unreliable estimate of its entropy. [40] Further, even if π΅π΅ πΊπΊ is sufficiently large for πποΏ½οΏ½ππ | πποΏ½οΏ½ β ππ ( ππ ) , sampling variability associated with randomly drawing πΊπΊ from ππ ( ππ ) will result in the entropy estimate π―π― πποΏ½ οΏ½πΏπΏ | πποΏ½οΏ½ being subject to statistical variability.
Figure 6 shows how bootstrapped estimates of the uncertainty will differ from those shown in
Figure 5 above, in which we drew different sample sets from the parent population. Here, each time a sample set is drawn from the parent density we draw π΅π΅ π©π© = bootstrap samples of the same size π΅π΅ πΊπΊ from that sample set, use these to obtain π΅π΅ π©π© different estimates of the associated entropy (using πΆπΆ = . ), and compute the width of the resulting inter-quartile range (IQR). We then repeat this procedure for different sample sets of the same size drawn from the parent population. Figure 6 shows the ratio of the IQR obtained using bootstrapping to that of the actual IQR for different sample sizes; the boxplots represent variability due to random sampling. Here, an expected (mean) ratio value of 1.0 and small width of the boxplot is ideal, indicating that ootstrapping provides a good estimate of the uncertainty to be associated with random sampling variability. The results show that for smaller sample sizes ( π΅π΅ πΊπΊ < ) there is a tendency to overestimate the width of the inter-quartile range, but that this slight positive bias disappears for larger sample sizes. Summary of Properties of the Proposed Quantile Spacing Approach [41]
To summarize, bias in the estimate
π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½οΏ½ can arise due to: a) inadequacy of the piece-wise approximation of ππ ( ππ ) , b) imperfect estimation of the quantile positions, c) imperfect knowledge of the support interval, and d) the sample πΊπΊ not being consistent, representative and sufficiently informative. Meanwhile, uncertainty in the estimate can arise due to: a) random sampling variability associated with estimation of the quantiles, and b) random sampling variability associated with drawing πΊπΊ from ππ ( ππ ) . For a given sample size π΅π΅ πΊπΊ , and provided that the sample is consistent, representative and fully informative, the bias and uncertainty can be reduced by selecting sufficiently large values for π΅π΅ ππ and π΅π΅ π²π² (we recommend π΅π΅ π²π² = and π΅π΅ ππ = . πΊπΊ ), while the overall statistical variability associated with the estimate can be estimated by bootstrapping from πΊπΊ . Algorithm for Estimating Entropy via the Quantile Spacing Approach [42]
Given a sample set πΊπΊ of size π΅π΅ πΊπΊ Set ππ ππππππ = π¦π¦π¦π¦π¦π¦ { πΊπΊ } and ππ ππππππ = π¦π¦π¦π¦π¦π¦ { πΊπΊ } Select values for ππ = { π΅π΅ ππ , π΅π΅ π²π² , π΅π΅ π©π© } . Recommended default values are π΅π΅ ππ = πΆπΆ β π΅π΅ πΊπΊ , π΅π΅ π²π² = and π΅π΅ π©π© = , with πΆπΆ = . . 3) Bootstrap a sample set πΊπΊ ππ of size π΅π΅ πΊπΊ from πΊπΊ with replacement. 4) Compute the entropy estimate
π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½ ππ οΏ½ using Eqn. 7a and the procedure outlined in Section 3. 5) Repeat the above steps π΅π΅ π©π© times to generate the bootstrapped distribution of π―π―οΏ½ πποΏ½ οΏ½πΏπΏ | πποΏ½ ππ οΏ½ as an empirical probabilistic estimate ππ οΏ½π―π―οΏ½ πποΏ½ ( πΏπΏ | πΊπΊ ) οΏ½ of the Entropy π―π― ππ ( πΏπΏ ) of ππ ( ππ ) given πΊπΊ . Relationship to the Bin Counting approach [43]
Because the proposed QS approach employs a piecewise constant approximation of ππ ( ππ ) , there are obvious similarities to BC. However, there are also clear differences. First, while BC typically employs equal-width binning along the support of πΏπΏ , with each bin having a different fraction of the total probability mass, QS uses variable width intervals (analogous to β bins β) each having an identical fraction of the total probability mass (so that the intervals are wider where ππ ( πΏπΏ ) is small, and narrower where ππ ( πΏπΏ ) is large). Both methods require specification of the support interval [ ππ ππππππ , ππ ππππππ ] . [44] Second, whereas BC requires counting samples falling within bins to estimate the probability masses associated with each bin, QS involves no β bin-counting β, and, instead, the data samples are used to estimate the positions of the quantiles. Since the probability mass estimates obtained by counting random samples falling within bins can be highly uncertain due to sampling variability, particularly for small sample sizes π΅π΅ πΊπΊ , this translates into uncertainty regarding the shape of the pdf and thereby regarding its entropy. In QS, the effect of sampling variability is to consistently provide a pdf approximation that tends to be slightly more peaked that the true pdf, so that the bias in the entropy estimate tends to be slightly negative. This negative bias acts to counter the positive bias resulting from the piecewise constant approximation of the pdf. [45] Third, whereas BC requires selection of a bin-width hyperparameter β that represents the appropriate bin-width required for β smoothing β to ensure an appropriate balance between bias and variance errors, QS requires selection of a hyperparameter πΆπΆ that specifies the number of quantile positions π΅π΅ ππ as a fraction of he sample size π΅π΅ πΊπΊ . As seen in Figure 4 , an appropriate choice for πΆπΆ can effectively drive estimation bias to zero, while π΅π΅ π²π² controls the degree of uncertainty associated with the estimation of the positions of the quantiles. Further, if we desire estimates of the uncertainty in the computed value of entropy arising due to random sampling variability, we must specify the number of bootstraps π΅π΅ π©π© . In practice, the values selected for π΅π΅ π²π² and π΅π΅ π©π© can be made arbitrarily large, and our experiments suggest that setting π΅π΅ π²π² = (or larger) and π΅π΅ π©π© = (or larger) works well in practice. Accordingly, the QS hyperparameter πΆπΆ takes the place of the BC hyperparameter β in determining the accuracy of the Entropy estimate obtained from a given sample. [46] Our survey of the literature suggests that the problem of how to select the BC bin-width hyperparameter β is not simple, and a number of different strategies have been proposed. Sturges (1926) proposed to choose the number of bins based on sample size only.
Scott (1979) estimated the optimal number of bins by minimizing the mean squared error between the sample histogram and the β true β form of the pdf (for which the shape must be assumed).
Freedman and Diaconis (1981) further developed this approach by estimating the shape of the true pdf from the interquartile range of the sample. More recently,
Knuth (2019) proposed a method that does not require choice of a hyperparameter β using a Bayesian maximum likelihood approach, and assuming a piecewise-constant density model, the posterior probability for the number of bins is identified (this approach also provides uncertainty estimates for the related bin counts). Other BC approaches that provide uncertainty estimates based on the Dirichelet, Multinomial, and Binomial distribution are discussed by
Darscheid et al. (2018) . However, as shown in the next section, in practice the β optimal β fixed bin-width can vary significantly with shape of the pdf and with sample size. Experimental Comparison with the Bin Counting method [47]
The right panel of
Figure 1 shows the theoretical bias, due only to piecewise-constant approximation, associated with the estimate of π―π― ππ ( πΏπΏ ) obtained using BC when the support interval [ ππ ππππππ , ππ ππππππ ] is subdivided into equal-width intervals. We can compare the results to the left panel of Figure 1 if we consider the number π΅π΅ π©π©ππππ of BC bins to be analogous to the number π΅π΅ ππ of spacings between quantiles for QS. Note that no random sampling variability or data informativeness issues are involved in the construction of these figures. For BC we use the theoretical fractions of probability mass associated with each of the equal-width bins, and for QS we use the theoretical quantile positions to compute the interval spacings. In both cases, to address the β infinite support β issue, we set [ ππ ππππππ , ππ ππππππ ] to the locations where π·π· ( ππ ) = πΊπΊ and π·π·οΏ½ππ π΅π΅ ππ οΏ½ =
ππ β πΊπΊ respectively, with πΊπΊ = % when π΅π΅ π©π©ππππ β₯ ππππππ ; in fact, it can decline somewhat faster than for the QS approach. Clearly, for the
Gaussian (blue) and
Exponential (red) densities, the BC constant bin-width approximation can provide better entropy estimates with fewer bins than the QS variable bin-width approach. However, for the skewed
Log-Normal density (orange) the behavior of the BC approximation is more complicated, whereas the QS approach shows an exponential rate of improvement with increasing number of bins for all three density types. This suggests that the variable bin-width QS approximation may provide a more consistent approach for more complex distributional forms (see section 7). [48]
Further,
Figure 7 shows the results of a β naΓ―ve β implementation of BC where the value for π΅π΅ π©π©ππππ is varied as a fractional percentage of sample size π΅π΅ πΊπΊ . As with QS, we specify the support interval by setting ππ ππππππ = π¦π¦π¦π¦π¦π¦ { πΊπΊ } and ππ ππππππ = π¦π¦π¦π¦π¦π¦ { πΊπΊ } , but here the support interval is divided into equal-width bins so that πΏπΏ π©π©π©π©π΅π΅ = οΏ½πΏπΏ , πΏπΏ , πΏπΏ , β¦ , πΏπΏ π΅π΅ π©π©ππππ οΏ½ represents the locations of the edges of the bins (where πΏπΏ = ππ ππππππ and πΏπΏ π΅π΅ π©π©ππππ = ππ ππππππ ), and therefore β = ππ ππππππ βππ ππππππ π΅π΅ π©π©ππππ . We then assume that ππ ( ππ ) β πποΏ½ ( ππ | πΏπΏ π©π©π©π©π΅π΅ ) = ππ ππ π΅π΅ πΊπΊ for πΏπΏ ππβππ β€ πΏπΏ < πΏπΏ ππ where ππ ππ is the number of samples falling in the bin defined by πΏπΏ ππβππ β€ πΏπΏ < πΏπΏ ππ . Finally, we compute the BC estimate of entropy as π―π―οΏ½ πποΏ½ ( πΏπΏ | πΏπΏ π©π©π©π©π΅π΅ ) = β β ππππ οΏ½ ππ ππ π΅π΅ πΊπΊ οΏ½ β ππ ππ π΅π΅ πΊπΊ π΅π΅ π©π©ππππ ππ=ππ + ππππ ( ππ ) , and follow the convention that
ππ β ππππ ( ) = to handle bins where the number of samples ππ ππ = . Finally, we obtain estimates for different sample sets drawn from the parent density and average the results. Results are shown for different sample sizes πΊπΊ = { , , , , and }. The yellow marker symbols indicate where each curve crosses the zero-bias line; clearly π΅π΅ π©π©ππππ is not a constant fraction of π΅π΅ πΊπΊ , and for any given sample size the ratio of π΅π΅ π©π©ππππ π΅π΅ πΊπΊ changes with form of the pdf. [49] To more clearly compare these BC results with the results shown in
Figure 4 for the QS approach,
Figure 8 shows a plot indicating the sampling variability distribution of the optimal number of bins (i.e., the value of π΅π΅ π©π©ππππ for which the expected entropy estimation error is zero) as a function of sample size for the
Gaussian , Exponential and
Log-Normal densities. We see clearly that, in contrast to QS, the β expected optimal β number of bins to achieve zero bias is neither a constant fraction of the sample size or independent of the pdf shape, but instead declines as the sample size increases, and is different for different pdf shapes. Further, the sampling variability associated with the optimal fractional number of bins can be quite large, and is highly skewed at smaller sample sizes. This is in contrast with QS where the optimal fractional number of bins is approximately constant at πΆπΆ β ππππ β ππππ % for different sample sizes and pdf shapes. Testing on Multi-Modal PDF Forms [50]
While the types of pdf forms tested in this paper are far from exhaustive, they represent differently shapes and degrees of skewness, including infinite support on both sides (
Gaussian ), and infinite support on only one side (
Exponential and
Log-Normal ). However, all three forms are β unimodal β, and so we conducted an additional test for a multimodal distributional form. [51]
Figure 9 shows results for a
Bimodal pdf (
Figure 9a ) constructed using a mixture of two
Gaussians π΅π΅ ( , ) and π΅π΅ ( , ) . Since its theoretical entropy value is unknown, we used the piecewise constant approximation method with true (known) quantile positions to compute its entropy by progressively increasing the number of quantiles π΅π΅ ππ until the estimate converged to within three decimal places ( Figure 9b ) to the value π―π― ππ ( πΏπΏ ) β ππ . for π΅π΅ ππ > . Figures 9c and shows that QS estimation bias declines exponentially with fractional number of quantiles and crosses zero at πΆπΆ β ππππ % to % , in a manner similar to the Unimodal pdfs tested previously (
Figure 4 ). Figure 9c shows the results for π΅π΅ πΊπΊ = samples, along with the distribution due to sampling variability (500 repetitions), showing that the IQR falls within the Β± % of the correct value and the whiskers ( Β± sigma) fall within Β± % . Figure 9d shows the expected bias (estimated by averaging over 500 repetitions) for varying sample size π΅π΅ πΊπΊ ; for smaller sample sizes, the optimal value for πΆπΆ is closer to % , while for π΅π΅ πΊπΊ β₯ ππππππ the value of πΆπΆ β ππππ % to % seems to work quite well, while being relatively insensitive to the choice of value within this range. [52] Interestingly, comparing
Figure 9d ( Bimodal Gaussian Mixture ) with
Figure 4a ( Unimodal Gaussian ), we see that QS actually converges more rapidly for the
Bimodal density. One possible explanation is that the
Bimodal density is in some sense β closer β in shape to a
Uniform density, for which the piecewise constant representation is a better approximation. Discussion and Conclusions [53]
In principle, the QS approach provides a relatively simple method for obtaining accurate estimates of entropy from data samples, along with an idea of the estimation-uncertainty associated with sampling variability. It appears to have an advantage over BC since the most important hyper-parameter to be specified, the number of quantiles π΅π΅ ππ , does not need to be tuned and can apparently be set to a fixed fraction ( ~ % ) of the sample size, regardless of pdf shape or sample size. In contrast, for BC the optimal number of bins π΅π΅ π©π©ππππ varies with pdf shape and sample size and, since the underlying pdf shape is usually not known beforehand, it can be difficult to come up with a general rule for how to accurately specify this value. Therefore, QS is potentially more accurate than BC when applied to data from an unknown distributional form. [54]
The fact that QS differs from BC in one very important way may help to explain the properties noted above. Whereas in BC we choose the β bin β size and locations, and then compute the β probability mass β stimates for each bin from the data, in QS we instead choose the β probability mass β size (by specifying the number of quantiles) and then compute the β bin β sizes and locations (to conform to the spacings between quantiles) from the data. In doing so, BC uses only the samples falling within a particular bin to compute each probability mass estimate, which value can (in principle) be highly sensitive to sampling variability unless the number of samples (in each bin) is sufficiently large. In contrast, QS uses a potentially large number of samples from the data to generate a smoothed (via subsampling and averaging) estimate of the position of each quantile. As shown in
Figure 2 , the estimation bias and uncertainty are small for most of the quantiles and may only be significant near the extreme tails of the density, and for smaller sample sizes. In principle, therefore, with its focus on estimating quantiles rather than probability masses, the QS method seems to provide a more efficient use of the information in the data, and thereby a more robust approximation of the shape of the pdf. [55]
In this paper, we have used a simple, perhaps naΓ―ve, way of estimating the locations of the quantiles. Future work could investigate more sophisticated methods where the bias associated with extreme quantiles is accounted for and corrected. These include both Kernel and non-parametric methods. The simplest non-parametric methods are the empirical quantile estimator based on a single order statistic, or the extension based on two consecutive order statistics (
Hyndman and Fan 1996 ), for which the variance can be large. Quantile estimators based on L-statistics have been explored as a way to reduce estimation variance (
Harrell and Davis 1982; Kaigh and Lachenbruch 1982; Brodin 2006 ), and include Kernel quantile estimators (
Parzen, 1979; Sheather and Marron, 1990; Cheng and Parzen 1997; Park 2006; and
Yang, 1985 ). However, performance of the latter can be very sensitive to the choice of bandwidth. More recently, quantile L-estimators intended to be efficient at small sample sizes for estimating quantiles in the tails of a distribution have also been proposed (
Sfakianakis and Verginis, 2008; Navruz and Γzdemir, 2020 ). Finally, quantile estimators based on Bernstein polynomials (
Cheng, 1995; Pepelyshev, RafajΕowicz and Steland , 2014 ) and importance sampling (see
Kohler and Tent, 2020, and references therein) have been also investigated. [56]
Note that the small-sample efficiency of the QS method may be affected by the fact that the entropy fractions associated with the extreme upper and lower end β bins β (quantile spacings) can be quite large when a small number of quantiles is used (see
Figure 3 ). Our implementation of QS seems to successfully compensate for lack of exact knowledge of ππ and ππ π΅π΅ ππ by using as empirical estimates the values ππ ππππππ and ππ ππππππ in the data sample. Intuitively, one would expect that wider β bins β could be used in regions where the slopes of the entropy fraction curves are flatter (e.g., the center of the Gaussian density), and narrower β bins β in the tails (more like the BC method). Taken to its logical conclusion, an ideal approach might be to use β bin β locations and widths such that the cumulative value of βππππ ( ππ ) β ππ for any given pdf is subdivided into equal intervals (i.e., equal fractional entropy spacings) such that each bin then contributes approximately the same amount to the summation in Eqn. 7. To achieve this, the challenge is to estimate the edge locations of these β bins β (analogous to locations of the quantiles) from the sample data; we leave the possibility of such an approach for future investigation. Acknowledgements
We acknowledge the many members of the
GeoInfoTheory community (https://geoinfotheory.org) who have provided both moral support and extensive discussion of matters related to Information Theory and its applications to the Geosciences; without their existence and enthusiastic engagement it is unlikely that the ideas leading to this manuscript would have occurred to us. The first author also acknowledges partial support by the
Australian Research Council Centre of Excellence for Climate Extremes (CE170100023). The algorithms used in this work are freely accessible for non-commercial use at https://github.com/rehsani/Entropy . eferences Cited Beirlant J, EJ Dudewicz, L Gyorο¬, EC Van Der Meulen (1997),
Nonparametric entropy estimation: An overview , International Journal of Mathematical and Statistics Sciences, 6: 17β39 Cheng C and E Parzen (1997),
Unified estimators of smooth quantile and quantile density functions , Journal of Statistical Planning and Inference 59: 291-307 Cover TM and JA Thomas (2006),
Elements of Information Theory , Wiley Darscheid P, A Guthke and U Ehret (2018),
A maximum-entropy method to estimate discrete distributions from samples ensuring nonzero probabilities , Entropy, 20(8) Denison DGT, NM Adams, CC Holmes and DJ Hand (2002),
Bayesian partition modelling , Computational Statistics and Data Analysis, 38(4): 475β485 Endres A and P Foldiak (2005),
Bayesian bin distribution inference and mutual information , IEEE Transactions of Information Theory, 51(11): 3766β3779 Freedman D and P Diaconis (1981),
On the histogram as a density estimator: L2 theory , Zeitschrift fΓΌr Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4): 453-476 Gong W, D Yang, HV Gupta, and G Nearing (2014),
Estimating information entropy for hydrological data: One dimensional case , Technical Note, Water Resources Research, 50, doi: 10.1002/2014WR015874 Harrell FE and CE Davis (1982),
A new distribution-free quantile estimator . Biometrika 69: 635β640 Hutter M (2007),
Exact Bayesian regression of piecewise constant functions , Bayesian Analsis, 2(4): 635β664 Hyndman RJ and Y Fan (1996),
Sample quantiles in statistical packages , American Statistician 50(4):361β365 Jackson B, J Scargle, D Barnes, S Arabhi, A Alt, P Gioumousis, E Gwin, P Sangtrakulcharoen, L Tan and TT Tsai (2005),
An algorithm for optimal partitioning of data on an interval , IEEE Signal Processing Letters, 12: 105β108. Kaigh WD and PA Lachenbruch (1982),
A Generalized Quantile Estimator , Communications in Statistics, Part A-Theory and Methods, 11: 2217-2238 Kohler M and R Tent (2020),
Nonparametric quantile estimation using surrogate models and importance sampling , Metrika, 83: 141β169 Knuth KH (2019),
Optimal data-based binning for histograms and histogram-based probability density models , Digital Signal Processing, 95, 102581 Navruz G and AF Γzdemir (2020),
A new quantile estimator with weights based on a subsampling approach , British Journal of Mathematical and Statistical Psychology, 73 Park C (2006),
Smooth nonparametric estimation of a quantile function under right censoring using beta kernels , Technical Report (TR 2006-01-CP), Department of Mathematical Sciences, Clemson University Parzen E (1979),
Nonparametric Statistical Data Modeling , Journal of the American Statistical Association, 74: 105-131 Scott DW (1979),
Optimal and data-based histograms , Biometrika, 66(3), 605β610, doi:10.2307/2335182 Scott DW (1985),
Averaged shifted histogramsβEffective nonparametric density estimators in several dimensions , Ann. Stat., 13(3), 1024β1040, doi:10.1214/aos/1176349654 Scott DW (2004),
Handbook of Computational StatisticsβConcepts and Methods , Springer NY Scott DW (2008),
Multivariate Density Estimation: Theory, Practice, and Visualization , John Wiley NY cott DW and JR Thompson (1983),
Probability density estimation in higher dimensions , in Computer Science and Statistics: Proceedings of the Fifteenth Symposium on the Interface, edited by J. E. Gentle, 173β179, North-Holland, Amsterdam, Netherlands Sfakianakis ME and DG Verginis (2008),
A new family of nonparametric quantile estimators , Communications in Statistics β Simulation and Computation, 37: 337β345 Shannon CE (1948),
A mathematical theory of communication , Bell System Technical Journal, 27: 379-423 Sheather JS and JS Marron (1990),
Kernel quantile estimators , Journal of the American Statistical Association, 85: 410β416 Silverman BW (1986),
Density estimation for statistics and data analysis , Chapman and Hall NY Sturges HA (1926),
The choice of a class interval , Journal of the American Statistical Association, 21(153): 65-66 Wegman EJ (1975),
Maximum likelihood estimation of a probability density function , SankhyΔ: The Indian
Journal of Statistics, 3(2), 211β224 Hutter M (2007),
Exact Bayesian regression of piecewise constant functions , Bayesian Analsis, 2(4), 635β664 Yang SS (1985),
A smooth nonparametric estimator of a quantile function , Journal of the American Statistical Association, 80: 1004-1011
List of Figures Figure 1:
Plots showing how entropy estimation bias associated with the piecewise-constant approximation of various theoretical pdf forms varies with the number of quantiles (left; QS method) or number of equal-width bins (right; BC method) used in the approximation. The dashed horizontal lines indicate Β±1% and Β±5% bias error. No sampling is involved and the bias is due purely to the piecewise constant assumption. For QS, the locations of the quantiles are set to their theoretical values. To address the β infinite support β issue, [ ππ ππππππ , ππ ππππππ ] were set to be the locations where π·π· ( ππ ) = πΊπΊ and π·π·οΏ½ππ π΅π΅ ππ οΏ½ =
ππ β πΊπΊ respectively, with πΊπΊ = Figure 2:
Plots showing bias and uncertainty associated with estimates of the quantiles derived from random samples, for the
Log-Normal pdf. Uncertainty associated with random sampling variability is estimated by repeating each experiment times. In both subplots, for each case, the box plots are shown side by side to improve legibility. (Left) Subplot showing results varying π΅π΅ ππ = [ , , , , , , ] for fixed π΅π΅ π²π² = . (Right) Subplot showing results varying π΅π΅ π²π² = [ , , , , , ] for fixed π΅π΅ ππ = . Figure 3:
Plots showing percentage entropy fraction associated with each quantile spacing for the
Gaussian , Exponential and
Log-Normal pdfs, for π΅π΅ ππ = (left), and π΅π΅ ππ = (right). For the Uniform pdf (not shown to avoid complicating the figures) the percentage entropy fraction associated with each quantile spacing is a horizontal line (at 1% in the left panel, and at 0.1% in the right panel). Note that the entropy fractions can be proportionally quite large or small at the extremes, depending on the form of the pdf. However, the overall entropy fraction associated with each quantile spacing diminishes with increasing π΅π΅ ππ . For the examples shown, the maximum ontributions associated with a quantile spacing are less than % for π΅π΅ ππ = (left), and become less than . % for π΅π΅ ππ = (right). Figure 4:
Plots showing expected percent error in the QS-based estimate of entropy derived from random samples, as a function of πΆπΆ = ( π΅π΅ ππ / π΅π΅ πΊπΊ ) , which expresses the number of quantiles π΅π΅ ππ as a fractional percentage of the sample size π΅π΅ πΊπΊ . Results are averaged over trials obtained by drawing sample sets of size π΅π΅ πΊπΊ from the theoretical pdf, where ππ ππππππ and ππ ππππππ are set to be the smallest and largest data values in the particular sample. Results are shown for different sample sizes π΅π΅ πΊπΊ = [ , , , , , ] , for the Gaussian (left),
Exponential (middle) and
Log-Normal (right) densities. In each case, when πΆπΆ is small the estimation bias is positive (overestimation) and can be greater than % for πΆπΆ < % , and crosses zero to become negative (underestimation) when πΆπΆ > %. The marginal cost of setting πΆπΆ too large is low compared to setting πΆπΆ too small. As π΅π΅ πΊπΊ increases, the bias diminishes. The optimal choice is πΆπΆ βππππ β ππππ % and is relatively insensitive to pdf shape or sample size. Figure 5:
Plots showing bias and uncertainty in the QS-based estimate of entropy derived from random samples, as a function of sample size π΅π΅ πΊπΊ , when the number of quantiles π΅π΅ ππ is set to % of the sample size ( πΆπΆ = ), and ππ ππππππ and ππ ππππππ are respectively set to be the smallest and largest data values in the particular sample. The uncertainty shown is due to random sampling variability, estimated by drawing different samples from the parent density. Results are shown for the Gaussian (blue),
Exponential (red) and
Log-Normal (orange) densities; box plots are shown side by side to improve legibility. As sample size π΅π΅ πΊπΊ increases, the uncertainty diminishes. Figure 6:
Plots showing, for different sample sizes and πΆπΆ = % , the ratio of the interquartile range (IQR) of the QS-based estimate of entropy obtained using bootstrapping to that of the actual IQR arising due to random sampling variability. Here, each sample set drawn from the parent density is bootstrapped to obtain π΅π΅ π©π© = different estimates of the associated entropy, and the width of the resulting inter-quartile range is computed. The procedure is repeated for different sample sets drawn from the parent population, and the graph shows the resulting variability as box-plots. The ideal result would be a ratio of 1.0. Figure 7:
Plot showing how expected percentage error in the BC-based estimate of Entropy derived from random samples, varies as a function of the number of bins π΅π΅ π©π©ππππ . Results are averaged over trials obtained by drawing sample sets of size π΅π΅ πΊπΊ from the theoretical pdf, where ππ ππππππ and ππ ππππππ are set to be the smallest and largest data values in the particular sample. Results are shown for different sample sizes π΅π΅ πΊπΊ = [ , , , , , ] , for the Gaussian (left),
Exponential (middle) and
Log-Normal (right) densities. When the number of bins is small the estimation bias is positive (overestimation) but rapidly declines to cross zero and become negative (underestimation) as the number of bins is increased. In general, the overall ranges of overestimation and underestimation bias are larger than for the QS method (see Figure 4).
Figure 8:
Boxplots showing the sampling variability distribution of optimal fractional number of bins (as a percentage of sample size) to achieve zero bias, when using the BC method for estimating entropy from random samples. Results are shown for the
Gaussian (blue),
Exponential (red) and
Log-Normal (orange) densities. The uncertainty estimates are computed by drawing different sample data sets of a given size from the parent distribution. Note that the xpected optimal fractional number of bins varies with shape of the pdf, and is not constant but declines as the sample size increases. This is in contrast with the QS method where the optimal fractional number of bins is constant at ~ % for different sample sizes and pdf shapes. Further, the variability in optimal fractional number of bins can be large and highly skewed at smaller sample sizes. Figure 9:
Plots showing results for the
Bimodal pdf. (a) Pdf and Cdf for the
Gaussian Mixture model. (b) Showing convergence of entropy computed using piecewise constant approximation as the number of quantiles π΅π΅ ππ is increased. (c) Bias and sampling variability of the QS-based estimate of entropy plotted against π΅π΅ ππ as a percentage of sample size. (d) Expected bias of QS-based estimate of entropy plotted against π΅π΅ ππ as a percentage of sample size, for different sample sizes π΅π΅ πΊπΊ = [ , , , , , ] . Figure 1:
Plots showing how entropy estimation bias associated with the piecewise-constant approximation of various theoretical pdf forms varies with the number of quantiles (left; QS method) or number of equal-width bins (right; BC method) used in the approximation. The dashed horizontal lines indicate Β±1% and Β±5% bias error. No sampling is involved and the bias is due purely to the piecewise constant assumption. For QS, the locations of the quantiles are set to their theoretical values. To address the β infinite support β issue, [ ππ ππππππ , ππ ππππππ ] were set to be the locations where π·π· ( ππ ) = πΊπΊ and π·π·οΏ½ππ π΅π΅ ππ οΏ½ =
ππ β πΊπΊ respectively, with πΊπΊ = Figure 2:
Plots showing bias and uncertainty associated with estimates of the quantiles derived from random samples, for the
Log-Normal pdf. Uncertainty associated with random sampling variability is estimated by repeating each experiment times. In both subplots, for each case, the box plots are shown side by side to improve legibility. (Left) Subplot showing results varying π΅π΅ ππ =[ , , , , , , ] for fixed π΅π΅ π²π² = . (Right) Subplot showing results varying π΅π΅ π²π² =[ , , , , , ] for fixed π΅π΅ ππ = . Figure 3:
Plots showing percentage entropy fraction associated with each quantile spacing for the
Gaussian , Exponential and
Log-Normal pdfs, for π΅π΅ ππ = (left), and π΅π΅ ππ = (right). For the Uniform pdf (not shown to avoid complicating the figures) the percentage entropy fraction associated with each quantile spacing is a horizontal line (at 1% in the left panel, and at 0.1% in the right panel). Note that the entropy fractions can be proportionally quite large or small at the extremes, depending on the form of the pdf. However, the overall entropy fraction associated with each quantile spacing diminishes with increasing π΅π΅ ππ . For the examples shown, the maximum contributions associated with a quantile spacing are less than % for π΅π΅ ππ = (left), and become less than . % for π΅π΅ ππ = (right). Figure 4:
Plots showing expected percent error in the QS-based estimate of entropy derived from random samples, as a function of πΆπΆ = ( π΅π΅ ππ / π΅π΅ πΊπΊ ) , which expresses the number of quantiles π΅π΅ ππ as a fractional percentage of the sample size π΅π΅ πΊπΊ . Results are averaged over trials obtained by drawing sample sets of size π΅π΅ πΊπΊ from the theoretical pdf, where ππ ππππππ and ππ ππππππ are set to be the smallest and largest data values in the particular sample. Results are shown for different sample sizes π΅π΅ πΊπΊ =[ , , , , , ] , for the Gaussian (left),
Exponential (middle) and
Log-Normal (right) densities. In each case, when πΆπΆ is small the estimation bias is positive (overestimation) and can be greater than % for πΆπΆ < % , and crosses zero to become negative (underestimation) when πΆπΆ > %. The marginal cost of setting πΆπΆ too large is low compared to setting πΆπΆ too small. As π΅π΅ πΊπΊ increases, the bias diminishes. The optimal choice is πΆπΆ β ππππ β ππππ % and is relatively insensitive to pdf shape or sample size. Figure 5:
Plots showing bias and uncertainty in the QS-based estimate of entropy derived from random samples, as a function of sample size π΅π΅ πΊπΊ , when the number of quantiles π΅π΅ ππ is set to % of the sample size ( πΆπΆ = ), and ππ ππππππ and ππ ππππππ are respectively set to be the smallest and largest data values in the particular sample. The uncertainty shown is due to random sampling variability, estimated by drawing different samples from the parent density. Results are shown for the Gaussian (blue),
Exponential (red) and
Log-Normal (orange) densities; box plots are shown side by side to improve legibility. As sample size π΅π΅ πΊπΊ increases, the uncertainty diminishes. Figure 6:
Plots showing, for different sample sizes and πΆπΆ = % , the ratio of the interquartile range (IQR) of the QS-based estimate of entropy obtained using bootstrapping to that of the actual IQR arising due to random sampling variability. Here, each sample set drawn from the parent density is bootstrapped to obtain π΅π΅ π©π© = different estimates of the associated entropy, and the width of the resulting inter-quartile range is computed. The procedure is repeated for different sample sets drawn from the parent population, and the graph shows the resulting variability as box-plots. The ideal result would be a ratio of 1.0. Figure 7:
Plot showing how expected percentage error in the BC-based estimate of Entropy derived from random samples, varies as a function of the number of bins π΅π΅ π©π©ππππ . Results are averaged over trials obtained by drawing sample sets of size π΅π΅ πΊπΊ from the theoretical pdf, where ππ ππππππ and ππ ππππππ are set to be the smallest and largest data values in the particular sample. Results are shown for different sample sizes π΅π΅ πΊπΊ = [ , , , , , ] , for the Gaussian (left),
Exponential (middle) and
Log-Normal (right) densities. When the number of bins is small the estimation bias is positive (overestimation) but rapidly declines to cross zero and become negative (underestimation) as the number of bins is increased. In general, the overall ranges of overestimation and underestimation bias are larger than for the QS method (see Figure 4).
Figure 8:
Boxplots showing the sampling variability distribution of optimal fractional number of bins (as a percentage of sample size) to achieve zero bias, when using the BC method for estimating entropy from random samples. Results are shown for the
Gaussian (blue),
Exponential (red) and
Log-Normal (orange) densities. The uncertainty estimates are computed by drawing different sample data sets of a given size from the parent distribution. Note that the expected optimal fractional number of bins varies with shape of the pdf, and is not constant but declines as the sample size increases. This is in contrast with the QS method where the optimal fractional number of bins is constant at ~ % for different sample sizes and pdf shapes. Further, the variability in optimal fractional number of bins can be large and highly skewed at smaller sample sizes. Figure 9:
Plots showing results for the
Bimodal pdf. (a) Pdf and Cdf for the
Gaussian Mixture model. (b) Showing convergence of entropy computed using piecewise constant approximation as the number of quantiles π΅π΅ ππ is increased. (c) Bias and sampling variability of the QS-based estimate of entropy plotted against π΅π΅ ππ as a percentage of sample size. (d) Expected bias of QS-based estimate of entropy plotted against π΅π΅ ππ as a percentage of sample size, for different sample sizes π΅π΅ πΊπΊ = [ , , , , , ] ..