[PDF] Computing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples

Abstract

We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) method. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. Whereas BC requires optimal tuning of a bin-width hyper-parameter whose value varies with sample size and shape of the pdf, QS requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantile-spacings is a fixed fraction of the sample size (empirically determined to be ~0.25-0.35), and that this value is relatively insensitive to distributional form or sample size, providing a clear advantage over BC since hyperparameter tuning is not required. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (Gaussian, Log-Normal, Exponential and Bimodal Gaussian Mixture), expected estimation bias is less than 1% and uncertainty is relatively low even for very small sample sizes. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.

Full PDF

CComputing Accurate Probabilistic Estimates of One-D Entropy from Equiprobable Random Samples

Hoshin V Gupta , Mohammed Reza Ehsani , Tirthankar Roy , Maria A Sans-Fuentes , Uwe Ehret , Ali Behrangi , Hydrology and Atmospheric Sciences, The University of Arizona, Tucson, AZ Civil and Environmental Engineering, University of Nebraska-Lincoln, Nebraska GIDP Statistics and Data Science, The University of Arizona, Tucson, AZ Institute of Water and River Basin Management - Hydrology, Karlsruhe Institute of Technology, Karlsruhe, Germany Draft Manuscript (02/22/2021) for submission to ArXiv

Abstract

We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) method. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. Whereas BC requires optimal tuning of a bin-width hyper-parameter whose value varies with sample size and shape of the pdf, QS requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantile-spacings is a fixed fraction of the sample size (empirically determined to be ~0.25 -0.35), and that this value is relatively insensitive to distributional form or sample size, providing a clear advantage over BC since hyperparameter tuning is not required. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate, and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (

Gaussian, Log-Normal, Exponential and

Bimodal Gaussian Mixture ), expected estimation bias is less than 1% and uncertainty is relatively low even for very small sample sizes. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.

Keywords:

Entropy, estimation, quantile spacing, accuracy, uncertainty, bootstrap, small-sample efficiency

1. Introduction [1] Consider a data generating process 𝒑𝒑 ( 𝒙𝒙 ) from which a finite size set of 𝑵𝑵 𝑺𝑺 random, equiprobable,independent identically distributed (iid) samples 𝑺𝑺 = { 𝒔𝒔 𝒊𝒊 , 𝒊𝒊 = … 𝑵𝑵 𝑺𝑺 } is drawn. In general, we may not knowthe nature and mathematical form of 𝒑𝒑 ( 𝑿𝑿 ) , and our goal is to compute an estimate 𝑯𝑯� 𝒑𝒑 ( 𝑿𝑿 | 𝑺𝑺 ) of the Entropy 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) of 𝒑𝒑 ( 𝒙𝒙 ) .[2] In the idealized case, where 𝑿𝑿 is a one-dimensional continuous random variable and the parametricmathematical form of 𝒑𝒑 ( 𝒙𝒙 ) is known, we can apply the definition of differential Entropy (

Shannon 1948; Coverand Thomas 1991 ) to compute: 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) = 𝔼𝔼 𝒑𝒑 �−𝒍𝒍𝒍𝒍�𝒑𝒑 ( 𝒙𝒙 ) �� = ∫ −𝒍𝒍𝒍𝒍�𝒑𝒑 ( 𝒙𝒙 ) � ∙ 𝒑𝒑 ( 𝒙𝒙 ) ∙ 𝒅𝒅𝒙𝒙 +∞−∞ (Eqn. 1) Explicit, closed form solutions for 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) are available for a variety of probability density functions (pdfs). For a variety of others, closed form solutions are not available, and one can compute 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) via numerical integration of Eqn. 1. In all such cases, entropy estimation consists of first obtaining estimates 𝜽𝜽� | 𝑺𝑺 of the parameters 𝜽𝜽 of the known parametric density 𝒑𝒑 ( 𝒙𝒙 | 𝜽𝜽 ) and then computing the entropy estimate 𝑯𝑯� 𝒑𝒑 | 𝜽𝜽� ( 𝑿𝑿 | 𝑺𝑺 ) y plugging 𝒑𝒑�𝒙𝒙 | 𝜽𝜽�� into Eqn. 1. Any bias and uncertainty in the entropy estimate will depend on the accuracy and uncertainty of the parameter estimates 𝜽𝜽� . If the form of the 𝒑𝒑 ( 𝒙𝒙 | 𝜽𝜽 ) is “ assumed ” rather than explicitly known, then additional bias will stem from the inadequacy of this assumption. [3] In most practical situations the mathematical form of 𝒑𝒑 ( 𝒙𝒙 ) is not known, and 𝑺𝑺 must first be used to obtain a data-based estimate 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) , from which an estimate 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) can be obtained via numerical integration of Eqn. 1. In generating 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) , consistency with prior knowledge regarding the nature of 𝒑𝒑 ( 𝒙𝒙 ) must be ensured – for example, 𝑿𝑿 may be known to take on only positive values, or values on some finite range. Consistency must also be maintained with the information in 𝑺𝑺 . Further, the sample size 𝑵𝑵 𝑺𝑺 must be sufficiently large that the information in 𝑺𝑺 provides an accurate characterization of 𝒑𝒑 ( 𝒙𝒙 ) – in other words, that 𝑺𝑺 is informationally representative and consistent. [4] To summarize, for the case that 𝑿𝑿 is a continuous random variable, entropy estimation from data involves two steps; (i) Use of 𝑺𝑺 to estimate 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) , and (ii) Numerical integration to compute an estimate of entropy using Eqn. 2: 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) = 𝔼𝔼 𝒑𝒑� �−𝒍𝒍𝒍𝒍�𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) �� = ∫ −𝒍𝒍𝒍𝒍�𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) � ∙ 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) ∙ 𝒅𝒅𝒙𝒙 +∞−∞ (Eqn. 2) Accordingly, the estimate 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) has two potential sources of error. One is due to the use of 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) to approximate 𝒑𝒑 ( 𝒙𝒙 ) , and the other is due to imperfect numerical integration. To maximize accuracy, we must ensure that both these errors are minimized. Further, 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) is a statistic that is subject to inherent random variability associated with the sample 𝑺𝑺 , and so it will be useful to have an uncertainty estimate, in some form such as confidence intervals. [5] For cases where 𝑿𝑿 is discrete and can take on only a finite set of values �𝒙𝒙 ( 𝒋𝒋 ) , 𝒋𝒋 = , … 𝑵𝑵 𝑿𝑿 � , if the mathematical form of 𝒑𝒑 ( 𝒙𝒙 ) = �𝒑𝒑�𝒙𝒙 ( 𝒋𝒋 ) � , 𝒋𝒋 = , … 𝑵𝑵 𝑿𝑿 � is known, then 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) can be computed by applying the mathematical definition of discrete Entropy: 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) = 𝔼𝔼 𝒑𝒑 �−𝒍𝒍𝒍𝒍�𝒑𝒑 ( 𝒙𝒙 ) �� = − ∑ 𝒍𝒍𝒍𝒍 � 𝒑𝒑�𝒙𝒙 ( 𝒋𝒋 ) �� ∙ 𝒑𝒑�𝒙𝒙 ( 𝒋𝒋 ) � 𝑵𝑵 𝑿𝑿 𝒋𝒋=𝟏𝟏 (Eqn. 3) [6] Here, given a data sample 𝑺𝑺 , pdf estimation amounts simply to counting the number 𝒍𝒍 ( 𝒋𝒋 ) of data points in 𝑺𝑺 that take on the value 𝒙𝒙 ( 𝒋𝒋 ) , and setting 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) = �𝒑𝒑��𝒙𝒙 ( 𝒋𝒋 ) | 𝑺𝑺� = 𝒍𝒍 ( 𝒋𝒋 ) 𝑵𝑵 𝑺𝑺 , 𝒋𝒋 = , … 𝑵𝑵 𝑿𝑿 � and 𝒑𝒑� ( 𝒙𝒙 | 𝑺𝑺 ) = otherwise. Entropy estimation then consists of applying the equation: 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) = − ∑ 𝒍𝒍𝒍𝒍 � 𝒑𝒑��𝒙𝒙 ( 𝒋𝒋 ) | 𝑺𝑺�� ∙ 𝒑𝒑��𝒙𝒙 ( 𝒋𝒋 ) | 𝑺𝑺� 𝑵𝑵 𝑿𝑿 𝒋𝒋=𝟏𝟏 (Eqn. 4) In this case, there is no numerical integration error; any bias in the estimate is entirely due to 𝒑𝒑��𝒙𝒙 ( 𝒋𝒋 ) | 𝑺𝑺� ≠ 𝒑𝒑�𝒙𝒙 ( 𝒋𝒋 ) � , which occurs due to 𝑺𝑺 not being perfectly informative about 𝒑𝒑 ( 𝒙𝒙 ) , while uncertainty is due to 𝑺𝑺 being a random sample drawn from 𝒑𝒑 ( 𝒙𝒙 ) . If 𝑺𝑺 is a representative sample, as 𝑵𝑵 𝑺𝑺 → ∞ then 𝒑𝒑��𝒙𝒙 ( 𝒋𝒋 ) | 𝑺𝑺� → 𝒑𝒑�𝒙𝒙 ( 𝒋𝒋 ) � and hence 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) → 𝑯𝑯 𝒑𝒑 ( 𝒙𝒙 ) , so that estimation bias and uncertainty will both tend towards zero as the sample size is increased. [7] When the one-dimensional random variable 𝑿𝑿 is some hybrid combination of discrete and continuous, the relative fractions of total probability mass associated with the discrete and continuous portions of the pdf must also be estimated. The general principles discussed herein also apply to the hybrid case, and we will not consider it further in this paper; for a relevant discussion of estimating entropy for mixed discrete-continuous random variables, see Gong et al (2014) . Popular Approaches to Estimating Distributions from Data [8]

We focus here on the case of a one-dimensional continuous random variable 𝑿𝑿 for which the mathematical form of 𝒑𝒑 ( 𝒙𝒙 ) is unknown. Silverman (1986) provides a summary of methods for the estimation of pdfs from ata, while

Beirlant et al (1997) provides an overview of methods for estimating the differential entropy of a continuous random variable. The three most widely used “ non-parametric ” methods for estimating differential entropy by pdf approximation are the: (a) Bin-Counting (BC) or piece-wise constant frequency histogram method, (b) Kernel Density (KD) method, and (c) Average Shifted Histogram (ASH) method.

Scott (2008 ) points out that these can all be asymptotically viewed as ‘‘

Kernel’’ methods, where the bins in the BC and ASH approaches are understood as treating the data points falling in each bin as being from a locally uniform distribution. [9]

As discussed by

Scott (1979) and

Scott and Thompson (1983) , appropriate selection of the bin-width (effectively a smoothing hyper-parameter) is critical to success of the BC and ASH histogram-based methods. Bin-widths that are too small can result in overly rough approximation of the underlying distribution (increasing the variance), while bin-widths that are overly large can result in an overly smooth approximation (introducing bias). Therefore, one typically has to choose values that balance variance and bias errors. Scott (1979) and

Scott (2004) present expressions for “ optimal ” bin width when using BC, including the “ normal reference rule ” that is applicable when the pdf is approximately Gaussian, and the “ oversmoothed bandwidth rule ” that places an upper-bound on the bin-width. Similarly,

Scott (2008) shows that while KD is more computationally costly to implement than BC, its accuracy and convergence are better, and they derive optimal values for the KD smoothing hyper-parameter.

Scott (1985) also proposed the ASH method, which refines BC by sub-dividing each histogram bin into sub-bins, with computational cost similar to BC and accuracy approaching that of KD. Note that, if prior information on the shape of 𝒑𝒑 ( 𝒙𝒙 ) is available, or if a representation with the smallest number of bins is desired, then variable bin width methods may be more appropriate (e.g., Wegman 1975, Denison et al. 2002, Jackson et al. 2005, Endres and Foldiak 2005, Hutter, 2007 ). [10]

The BC, KD and ASH methods all require hyper-parameter tuning to be successful. BC requires selection of the histogram bin-widths (and thereby the number of bins), KD requires selection of the form of the Kernel function and tuning of its parameters, and ASH requires selection of the form of a Kernel function and tuning of the coarse bin width, and number of sub-bins. While recommendations are provided to guide the selection of these “ hyper-parameters ”, these recommendations depend on theoretical arguments based in assumptions regarding the typical underlying forms of 𝒑𝒑 ( 𝒙𝒙 ) . Based on empirical studies, and given that we typically do not know the “ true ” form of 𝒑𝒑 ( 𝒙𝒙 ) to be used as a reference for tuning, Gong et al (2014) recommend use of BC and KD rather than the ASH method. [11]

Finally, since BC effectively treats the pdf as being discrete, and therefore uses Eqn. 4 with each of the indices 𝒋𝒋 corresponding to one of the histogram bins, the loss of entropy associated with implementing the discrete constant bin-width approximation is approximately 𝒍𝒍𝒍𝒍 ( ∆ ) , where ∆ is the bin width, provided ∆ is sufficiently small ( Cover and Thomas, 2006 ). This fact allows conversion of the discrete entropy estimate to differential entropy simply by adding 𝒍𝒍𝒍𝒍 ( ∆ ) to the discrete entropy estimate. [12] In summary, while BC and KD can be used to obtain accurate estimates of entropy for pdfs of arbitrary form, hyper-parameter tuning is required to ensure that good results are obtained. In the next section, we propose an alternative method to approximate 𝒑𝒑 ( 𝒙𝒙 ) that does not require counting the numbers of samples in “ bins ”, and is instead based on estimating the quantile positions of 𝒑𝒑 ( 𝒙𝒙 ) . Proposed Quantile Spacing (QS) Approach [13]

We present an approach to computing an estimate

𝑯𝑯� 𝒑𝒑 ( 𝑿𝑿 | 𝑺𝑺 ) of Entropy 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) given a set of available samples 𝑺𝑺 for the case where 𝑿𝑿 is a one-dimensional continuous random variable and the mathematical form of the data generating process 𝒑𝒑 ( 𝒙𝒙 ) is unknown. The approach is based in the assumption that 𝒑𝒑 ( 𝒙𝒙 ) can be approximated as piecewise constant on the intervals between quantile locations, and consists of three steps. Step 1 - Assumption about Support Interval

The first step is to assume that 𝒑𝒑 ( 𝒙𝒙 ) exists only on some finite support interval [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] , where 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 ≤ 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ≥ 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } ; i.e., we treat 𝒑𝒑 ( 𝒙𝒙 ) as being everywhere outside of the interval [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] . Given that the true support of 𝑿𝑿 may, in reality, be as extensive as [ −∞ , + ∞ ] , we allow the selection of this interval (based on prior knowledge, such as physical realism) to be as extensive as appropriate and/or necessary. However, as we show later, the impact of this selection can be quite significant and will need special attention. Step 2 - Assumption about Approximate Form of 𝒑𝒑 ( 𝑿𝑿 ) [15] The second step is to assume that 𝒑𝒑 ( 𝒙𝒙 ) can be approximated as piecewise constant on the intervals between quantiles 𝒁𝒁 = �𝒛𝒛 , 𝒛𝒛 , 𝒛𝒛 , … , 𝒛𝒛 𝑵𝑵 𝒁𝒁 � associated with the �𝟎𝟎 , 𝒁𝒁 , 𝒁𝒁 , … 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝑵𝑵 𝒁𝒁 , non-exceedance probabilities of 𝒑𝒑 ( 𝒙𝒙 ) , where 𝑵𝑵 𝒁𝒁 represents the number of quantiles, 𝒛𝒛 = 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , and 𝒛𝒛 𝑵𝑵 𝒁𝒁 = 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 . This corresponds to making the minimally informative (maximum entropy) assumption that 𝒑𝒑 ( 𝒙𝒙 ) is ‘ uniform’ over each of the quantile intervals �𝒛𝒛 𝒋𝒋−𝟏𝟏 , 𝒛𝒛 𝒋𝒋 � for 𝒋𝒋 = , … 𝑵𝑵 𝒁𝒁 , which is equivalent to assuming that the corresponding cumulative distribution function 𝑷𝑷 ( 𝒙𝒙 ) is piecewise linear (i.e., increases linearly between 𝒛𝒛 𝒋𝒋−𝟏𝟏 and 𝒛𝒛 𝒋𝒋 ). [16] Assuming perfect knowledge of the locations of the quantiles 𝒁𝒁 , this approximation corresponds to: 𝒑𝒑 ( 𝒙𝒙 ) ≈ 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) = 𝒑𝒑 𝒋𝒋−𝟏𝟏𝒋𝒋 = 𝑲𝑲∆ 𝒋𝒋 for 𝒛𝒛 𝒋𝒋−𝟏𝟏 ≤ 𝑿𝑿 < 𝒛𝒛 𝒋𝒋 , 𝒋𝒋 = , … 𝑵𝑵 𝒁𝒁 (Eqn. 5) where ∆ 𝒋𝒋 = 𝒛𝒛 𝒋𝒋 − 𝒛𝒛 𝒋𝒋−𝟏𝟏 . To ensure that 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) integrates to . over the support region [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] we have 𝑲𝑲 = 𝒁𝒁 . Accordingly, our entropy estimate is given by: 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝒁𝒁 ) = ∑ ∫ −𝒍𝒍𝒍𝒍 �𝒑𝒑 𝒋𝒋−𝟏𝟏𝒋𝒋 � ∙ 𝒑𝒑 𝒋𝒋−𝟏𝟏𝒋𝒋 ∙ 𝒅𝒅𝒙𝒙 𝒛𝒛 𝒋𝒋 𝒛𝒛 𝒋𝒋−𝟏𝟏 𝑵𝑵 𝒁𝒁 𝒋𝒋=𝟏𝟏 (Eqn. 6a) = 𝒁𝒁 ∙ ∑ 𝒍𝒍𝒍𝒍�𝑵𝑵 𝒁𝒁 ∙ ∆ 𝒋𝒋 � 𝑵𝑵 𝒁𝒁 𝒋𝒋=𝟏𝟏 (Eqn. 6b) = 𝒍𝒍𝒍𝒍 ( 𝑵𝑵 𝒁𝒁 ) + 𝒁𝒁 ∙ ∑ 𝒍𝒍𝒍𝒍�∆ 𝒋𝒋 � 𝑵𝑵 𝒁𝒁 𝒋𝒋=𝟏𝟏 (Eqn. 6c) From Eqn. 6b we see that the estimate depends on the logs of the spacings between quantiles, and is defined by the average of these values. Further, we can define the error due to piecewise constant approximation of 𝒑𝒑 ( 𝒙𝒙 ) as ∆𝑯𝑯 𝒑𝒑 , 𝒑𝒑� ( 𝑿𝑿 | 𝒁𝒁 ) = 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝒁𝒁 ) − 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) . Step 3 – Estimation of the Quantiles of 𝒑𝒑 ( 𝒙𝒙 ) [17] The third step is to use the available data 𝑺𝑺 to compute estimates of the quantiles 𝒁𝒁 to be plugged into Eqn. 6. Of course, given a finite sample size 𝑵𝑵 𝑺𝑺 , the number of quantiles 𝑵𝑵 𝒁𝒁 that can be estimated will, in general, be much smaller than the sample size 𝑵𝑵 𝑺𝑺 (i.e., 𝑵𝑵 𝒁𝒁 ≪ 𝑵𝑵 𝑺𝑺 ). [18] Various methods for computing estimates of the quantiles are available. Here, we use a relatively simple approach in which 𝑵𝑵 𝑲𝑲 sample subsets 𝑺𝑺 𝒌𝒌 , 𝒌𝒌 = , … , 𝑵𝑵 𝑲𝑲 , each of size 𝑵𝑵 𝒁𝒁 − 𝟏𝟏 (i.e., 𝑺𝑺 𝒌𝒌 = �𝒔𝒔 , 𝒔𝒔 , … , 𝒔𝒔 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒌𝒌 � ) are drawn from the available sample set 𝑺𝑺 , where the samples in each subset are drawn from 𝑺𝑺 without replacement so that the values obtained in each subset are unique (not-repeated). For each subset, we sort the values in increasing order to obtain 𝒁𝒁 𝒌𝒌 = �𝒛𝒛 , 𝒛𝒛 , … , 𝒛𝒛 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒌𝒌 � , thereby obtaining 𝑵𝑵 𝑲𝑲 estimates �𝒛𝒛 𝒋𝒋𝟏𝟏 , 𝒛𝒛 𝒋𝒋𝟐𝟐 , … , 𝒛𝒛 𝒋𝒋𝑵𝑵 𝑲𝑲 � for each 𝒛𝒛 𝒋𝒋 , 𝒋𝒋 = , … 𝑵𝑵 𝒁𝒁 − 𝟏𝟏 . This procedure results in an empirical estimate of the sample distribution 𝒑𝒑�𝒛𝒛 𝒋𝒋 | 𝑺𝑺� for each quantile 𝒛𝒛 𝒋𝒋 , 𝒋𝒋 = , … 𝑵𝑵 𝒁𝒁 − 𝟏𝟏 . Finally, we compute 𝒛𝒛� 𝒋𝒋 = 𝑲𝑲 ∑ 𝒛𝒛 𝒋𝒋𝒌𝒌𝑵𝑵 𝑲𝑲 𝒌𝒌=𝟏𝟏 , and set 𝒁𝒁� = �𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒛𝒛� , 𝒛𝒛� , … , 𝒛𝒛� 𝑵𝑵 𝒁𝒁 −𝟏𝟏 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 � . Plugging these values into Equations 6b & 6c, we get: 𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁�� = 𝒁𝒁 ∙ ∑ 𝒍𝒍𝒍𝒍�𝑵𝑵 𝒁𝒁 ∙ ∆� 𝒋𝒋 � 𝑵𝑵 𝒁𝒁 𝒋𝒋=𝟏𝟏 (Eqn. 7a) = 𝒍𝒍𝒍𝒍 ( 𝑵𝑵 𝒁𝒁 ) + 𝒁𝒁 ∙ ∑ 𝒍𝒍𝒍𝒍�∆� 𝒋𝒋 � 𝑵𝑵 𝒁𝒁 𝒋𝒋=𝟏𝟏 (Eqn. 7b) here ∆� 𝒋𝒋 = 𝒛𝒛� 𝒋𝒋 − 𝒛𝒛� 𝒋𝒋−𝟏𝟏 . For practical computation, to avoid numerical problems as 𝑵𝑵 𝒁𝒁 becomes large so that ∆� 𝒋𝒋 becomes very small and 𝒍𝒍𝒍𝒍�∆� 𝒋𝒋 � approaches −∞ , we will actually use Eqn. 7a. Further, we define the additional error due purely to imperfect quantile estimation to be ∆𝑯𝑯 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁� , 𝒁𝒁� = 𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁�� − 𝑯𝑯 𝒑𝒑� ( 𝑿𝑿 | 𝒁𝒁 ) . Random Variability associated with the QS-based Entropy Estimate [19]

Given that the quantile spacing estimates �∆� 𝒋𝒋 , 𝒋𝒋 = , … , 𝑵𝑵 𝒁𝒁 � are subject to random sampling variability associated with (i) the sampling of 𝑺𝑺 from 𝒑𝒑 ( 𝒙𝒙 ) , and (ii) estimation of the quantile positions 𝒛𝒛� 𝒋𝒋 , the entropy estimate 𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁�� will also be subject to random sampling variability. As shown later, we can generate probabilistic estimates of the nature and size of this error from the empirical estimates of 𝒑𝒑�𝒛𝒛 𝒋𝒋 | 𝑺𝑺� obtained in Step 3, and by bootstrapping on 𝑺𝑺 . Properties of the Proposed Approach [20]

The accuracy of the estimate

𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁�� obtained using the QS method outlined above depends on the following four assumptions, each of which we discuss below: (i)

A1:

The piecewise constant approximation 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) of 𝒑𝒑 ( 𝒙𝒙 ) on the intervals between the quantile positions is adequate (ii) A2:

The quantile positions 𝒁𝒁 = �𝒛𝒛 , 𝒛𝒛 , 𝒛𝒛 , … , 𝒛𝒛 𝑵𝑵 𝒁𝒁 � of 𝒑𝒑 ( 𝒙𝒙 ) have been estimated accurately. (iii) A3:

The pdf 𝒑𝒑𝒙𝒙 exists only on the support interval [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] , which has been properly chosen (iv) A4:

The sample set 𝑺𝑺 is consistent, representative and sufficiently informative about the underlying nature of 𝒑𝒑 ( 𝒙𝒙 ) Implications of the Piecewise Constant Assumption [21]

Assume that 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) provides a piecewise-constant estimate of 𝒑𝒑 ( 𝒙𝒙 ) and that the quantile positions 𝒁𝒁 = �𝒛𝒛 , 𝒛𝒛 , 𝒛𝒛 , … , 𝒛𝒛 𝑵𝑵 𝒁𝒁 � associated with a given choice for 𝑵𝑵 𝒁𝒁 are perfectly known. Since the continuous shape of the cumulative distribution function (cdf) 𝑷𝑷 ( 𝒙𝒙 ) can be approximated to an arbitrary degree of accuracy by a sufficient number of piecewise linear segments, we will have 𝑷𝑷� ( 𝒙𝒙 | 𝒁𝒁 ) → 𝑷𝑷 ( 𝒙𝒙 ) as 𝑵𝑵 𝒁𝒁 → ∞ . [22] However, an insufficiently accurate approximation will result in a pdf estimate that is not sufficiently smooth, so that the entropy estimate will be biased. This bias will, in general, be positive (overestimation) because the piecewise-constant form 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) used to approximate 𝒑𝒑 ( 𝒙𝒙 ) will always be shifted slightly in the direction of larger entropy; i.e., each piecewise constant segment in 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) is a maximum-entropy (uniform distribution) approximation of the corresponding segment of 𝒑𝒑 ( 𝒙𝒙 ) . However, the bias can be reduced and made arbitrarily small by increasing 𝑵𝑵 𝒁𝒁 until the Kullback-Leibler distance between 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) and 𝒑𝒑 ( 𝒙𝒙 ) is so small that the information loss associated with use of 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) in place of 𝒑𝒑 ( 𝒙𝒙 ) in Eqn. 1 is negligible. [23] The left panel of

Figure 1 shows how this bias in the estimate of 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) , due solely to piecewise constant approximation of the pdf (no sample data are involved), declines with increasing 𝑵𝑵 𝒁𝒁 for three pdf forms of varying functional complexity ( Gaussian, Log-Normal and

Exponential ), each using a parameter choice such that its theoretical entropy 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) = . Also shown, for completeness, are results for the Uniform pdf where only one piecewise constant bin is theoretically required. Note that because 𝑯𝑯 𝒑𝒑 �𝒌𝒌 ∙ ( 𝑿𝑿 − 𝝁𝝁 𝒙𝒙 ) � = 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) + 𝐥𝐥𝐦𝐦 𝒌𝒌 , the entropy can be changed to any desired value simply by rescaling on 𝑿𝑿 . For these theoretical examples, the quantile positions are known exactly, and the resulting estimation bias is due only to the piecewise constant approximation of 𝒑𝒑 ( 𝒙𝒙 ) . However, since the theoretical pdfs used for this example all have infinite support, whereas the piecewise approximation requires specification of a finite support interval, for the latter we set [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] to be the theoretical locations where 𝑷𝑷 ( 𝒛𝒛 ) = 𝜺𝜺 and 𝑷𝑷�𝒛𝒛 𝑵𝑵 𝒁𝒁 � =

𝟏𝟏 − 𝜺𝜺 respectively, with 𝜺𝜺 chosen to be some sufficiently small number (we used 𝜺𝜺 = ). We see empirically that bias due to the piecewise onstant approximation declines to zero approximately as an exponential function of 𝒍𝒍𝒍𝒍𝒍𝒍 𝑵𝑵 𝒁𝒁 so that the absolute percent bias is less than ~ % for 𝑵𝑵 𝒁𝒁 > , less than ~ % for 𝑵𝑵 𝒁𝒁 > , and less than ~ % for 𝑵𝑵 𝒁𝒁 > . [24] In practice, given a finite sample size 𝑵𝑵 𝑺𝑺 , our ability to increase the value of 𝑵𝑵 𝒁𝒁 will be constrained by the size of the sample (i.e., 𝑵𝑵 𝒁𝒁 < 𝑵𝑵 𝑺𝑺 ). This is because when the form of 𝒑𝒑 ( 𝑿𝑿 ) is unknown, the locations of the quantile positions must be estimated using the information provided by 𝑺𝑺 . Further, what constitutes a sufficiently large value for 𝑵𝑵 𝒁𝒁 will depend the complexity of the underlying shape of 𝒑𝒑 ( 𝒙𝒙 ) . Implications of Imperfect Quantile Position Estimation [25]

Assume that 𝑵𝑵 𝒁𝒁 is large enough for the piecewise constant pdf approximation to be sufficiently accurate, but that the estimates �𝒛𝒛� , 𝒛𝒛� , … , 𝒛𝒛� 𝑵𝑵 𝒁𝒁 � of the locations of the quantiles are imperfect. Clearly, this can affect the estimate of entropy computed via Eqn. 7a by distorting the shape of 𝒑𝒑��𝒙𝒙 | 𝒁𝒁�� away from 𝒑𝒑� ( 𝒙𝒙 | 𝒁𝒁 ) , and therefore away from 𝒑𝒑 ( 𝒙𝒙 ) . Further, the uncertainty associated with the quantile estimates will translate into uncertainty associated with the estimate of entropy. [26] In general, as the number of quantiles 𝑵𝑵 𝒁𝒁 is increased, the inter-sample spacings associated with each ordered subset 𝒁𝒁 𝒌𝒌 = �𝒛𝒛 , 𝒛𝒛 , … , 𝒛𝒛 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒌𝒌 � , 𝒌𝒌 = , … , 𝑵𝑵 𝑲𝑲 will decrease, so that the distribution of possible locations for each quantile 𝒛𝒛 𝒋𝒋𝒌𝒌 , 𝒋𝒋 = , … 𝑵𝑵 𝒁𝒁 − 𝟏𝟏 will progressively become more tightly constrained. This means that the bias associated with each estimated quantile 𝒛𝒛� 𝒋𝒋 will reduce progressively towards zero as 𝑵𝑵 𝒁𝒁 is increased (constrained only by sample size 𝑵𝑵 𝑺𝑺 ) and the variance of the estimate 𝒛𝒛� 𝒋𝒋 will decline towards zero as the number of subsamples 𝑵𝑵 𝑲𝑲 is increased. [27] Figure 2 illustrates how bias and uncertainty associated with estimates of the quantiles diminish with increasing 𝑵𝑵 𝒁𝒁 and 𝑵𝑵 𝑲𝑲 . Experimental results are shown for the Log-Normal density with 𝝁𝝁 = and 𝝈𝝈 = . (theoretical entropy 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) = ), with the y-axis indicating percent error in the quantile estimates corresponding to the 90% (green), 95% (purple) and 99% (turquoise) non-exceedance probabilities. In these plots, there is no distorting effect of sample size 𝑵𝑵 𝑺𝑺 (the sample size is effectively infinite), since when computing the estimates of the quantiles (as explained in Section 3.3) we draw subsamples of size 𝑵𝑵 𝒁𝒁 directly from the theoretical pdf. [28] The left-side plot shows, for 𝑵𝑵 𝑲𝑲 = subsamples, how the biases and uncertainties diminish as 𝑵𝑵 𝒁𝒁 is increased. The boxplots reflect uncertainty due to random sampling variability, estimated by repeating each experiment times (by drawing new samples from the pdf). As expected, for smaller 𝑵𝑵 𝒁𝒁 the quantile location estimates tend to be negatively biased, particularly for those in the more extreme tail locations of the distribution. However, for 𝑵𝑵 𝒁𝒁 = the bias associated with the 99% non-exceedance probability quantile is less than −𝟓𝟓 % , for 𝑵𝑵 𝒁𝒁 ≈ 𝟓𝟓𝟎𝟎𝟎𝟎 the corresponding bias is less than −𝟐𝟐 % , and for 𝑵𝑵 𝒁𝒁 = it is less than -1 % . The right-side plot shows, for a fixed value of 𝑵𝑵 𝒁𝒁 = , how the uncertainties diminish but the biases remain relatively constant as the number of subsamples 𝑵𝑵 𝑲𝑲 is increased. Overall, the uncertainty becomes quite small for 𝑵𝑵 𝑲𝑲 > . Implications of the Finite Support Assumption [29]

Assume that 𝑵𝑵 𝒁𝒁 has been chosen large enough for the piecewise constant pdf approximation to be sufficiently accurate, and that the exact quantile positions associated with this choice for 𝑵𝑵 𝒁𝒁 are known. The equation for estimating entropy (Eqn. 7a) can be decomposed into three terms: 𝑯𝑯 𝒑𝒑� ( 𝑿𝑿 | 𝒁𝒁 ) = 𝒁𝒁 ∙ ∑ 𝒍𝒍𝒍𝒍�𝑵𝑵 𝒁𝒁 ∙ ∆ 𝒋𝒋 � 𝑵𝑵 𝒁𝒁 𝒋𝒋=𝟏𝟏 = 𝑯𝑯 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝒛𝒛 + 𝑯𝑯 𝒛𝒛 𝒛𝒛 𝑵𝑵𝒁𝒁−𝟏𝟏 + 𝑯𝑯 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 (Eqn. 8) where 𝑯𝑯 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝒛𝒛 = 𝒍𝒍𝒍𝒍 ( 𝑵𝑵 𝒁𝒁 ∙∆ ) 𝑵𝑵 𝒁𝒁 , 𝑯𝑯 𝒛𝒛 𝒛𝒛 𝑵𝑵𝒁𝒁−𝟏𝟏 = 𝒁𝒁 ∙ ∑ 𝒍𝒍𝒍𝒍�𝑵𝑵 𝒁𝒁 ∙ ∆ 𝒋𝒋 � 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒋𝒋=𝟐𝟐 and 𝑯𝑯 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 = 𝒍𝒍𝒍𝒍�𝑵𝑵 𝒁𝒁 ∙∆ 𝑵𝑵𝒁𝒁 �𝑵𝑵 𝒁𝒁 , and where ∆ 𝒋𝒋 indicates the true inter-quantile spacings. Only the first and last terms 𝑯𝑯 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝒛𝒛 and 𝑯𝑯 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 are affected by the choices for 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 through ∆ = 𝒛𝒛 − 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and ∆ 𝑵𝑵 𝒁𝒁 = 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 − 𝒛𝒛 𝑵𝑵 𝒁𝒁 −𝟏𝟏 . 30] Clearly, if 𝒑𝒑 ( 𝒙𝒙 ) is bounded both above and below by specific known values, then there is no issue. However, if the support of 𝒑𝒑 ( 𝒙𝒙 ) is not known, or if one or both bounds can reasonably be expected to extend to ± ∞ (as appropriate), then the choice for the relevant limiting value ( 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 or 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ) can significantly affect the computed value for 𝑯𝑯� 𝒑𝒑� . To see this, note that the first term 𝑯𝑯 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝒛𝒛 can be made to vary from −∞ when ∆ = , to + ∞ when ∆ = ∞ , passing through zero when ∆ = 𝒁𝒁 ; and similarly for the last term 𝑯𝑯 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 . Therefore, the error associated with 𝑯𝑯� 𝒑𝒑� can be made arbitrarily negatively large by choosing ∆ and ∆ 𝑵𝑵 𝒁𝒁 to be too small, or arbitrarily positively large by choosing ∆ and ∆ 𝑵𝑵 𝒁𝒁 to be too large. [31] In practice, when dealing with samples 𝑺𝑺 from some unknown data generating process, we will often have only the samples themselves from which to infer the support of 𝒑𝒑 ( 𝒙𝒙 ) , and therefore can only confidently state that 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 ≤ 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ≥ 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } . One possibility could be to ignore the fractional contributions of the terms 𝑯𝑯 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝒛𝒛 and 𝑯𝑯 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 corresponding to the (unknown) portions of the pdf and instead use as our estimate 𝑯𝑯 𝒑𝒑�∗ ( 𝑿𝑿 | 𝒁𝒁 ) = 𝑯𝑯 𝒛𝒛 𝒛𝒛 𝑵𝑵𝒁𝒁−𝟏𝟏 . This would be equivalent to setting ∆ = ∆ 𝑵𝑵 𝒁𝒁 = 𝒁𝒁 , so that 𝑿𝑿 𝒎𝒎𝒊𝒊𝒍𝒍 = 𝒛𝒛 − 𝒁𝒁 and 𝑿𝑿 𝒎𝒎𝒂𝒂𝒙𝒙 = 𝒛𝒛 𝑵𝑵 𝒁𝒁 −𝟏𝟏 + 𝒁𝒁 . By doing so, we would be ignoring a portion of the overall entropy associated with the pdf and can therefore expect to obtain an underestimate. However, this bias error 𝑩𝑩𝑩𝑩 = 𝑯𝑯 𝒑𝒑� ( 𝑿𝑿 | 𝒁𝒁 ) − 𝑯𝑯 𝒑𝒑�∗ ( 𝑿𝑿 | 𝒁𝒁 ) will tend to zero as 𝑵𝑵 𝒁𝒁 is increased. [32] An alternative approach, that we recommend in this paper, is to set 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 = 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 = 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } . In this case, there will be random variability associated with the sampled values for 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and so the bias in our estimate 𝑯𝑯 𝒑𝒑�∗ ( 𝑿𝑿 | 𝒁𝒁 ) can be either negative or positive. Nonetheless, this bias error 𝑩𝑩𝑩𝑩 will still tend to zero as 𝑵𝑵 𝒁𝒁 is increased. [33] Note that the percentage contributions of the entropy fractions 𝑯𝑯 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝒛𝒛 and 𝑯𝑯 𝑵𝑵 𝒁𝒁 −𝟏𝟏𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 to the total entropy 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 | 𝒁𝒁 ) will depend on the nature of the underlying pdf. Figure 3 illustrates this for three pdfs (

Gaussian which has infinite extent on both sides, and the

Exponential and

Log-Normal which have infinite extent on only one side), assuming no estimation error associated with the quantile locations. For the

Gaussian (blue) and

Exponential (red) densities, the largest fractional entropy contributions are clearly from the tail regions, whereas for the

Log-Normal (orange) density this is not so. So, the entropy fractions can be proportionally quite large or small at the extremes, depending on the form of the pdf. Nonetheless, the overall entropy fraction associated with each quantile spacing diminishes with increasing 𝑵𝑵 𝒁𝒁 . For the examples shown, when 𝑵𝑵 𝒁𝒁 = (left plot) the maximum contributions associated with a quantile spacing are less than % and when 𝑵𝑵 𝒁𝒁 = (right plot) become less than . % . This plot illustrates clearly the most important issue that must be dealt with when estimating entropy from samples. [34] So, on the one hand, the cumulative entropy fractions associated with the tail regions of 𝒑𝒑 ( 𝒙𝒙 ) that lie beyond 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } are impossible to know. On the other, the individual contributions of these fractions associated with the extreme quantile spacings ∆ and/or ∆ 𝑵𝑵 𝒁𝒁 where 𝒑𝒑 ( 𝒙𝒙 ) is small can be quite a bit larger than those associated with the contributions from intermediate quantile spacings. Overall, the only real way to control the estimation bias and uncertainty associated with these extreme regions is to use a sufficiently large value for 𝑵𝑵 𝒁𝒁 so that the relative contribution of the extreme regions is small. This will in turn, of course, be constrained by the sample size. Combined Effect of the Piecewise Constant Assumption, Finite Support Assumption, and Quantile Position Estimation using Finite Sample Sizes [35]

In section 4.1, we saw that the effect of the piecewise constant assumption on the QS-based estimate of entropy is positive bias that diminishes with increasing 𝑵𝑵 𝒁𝒁 . Similarly, section 4.2 showed that the biases associated with the quantiles diminish with increasing 𝑵𝑵 𝒁𝒁 , while the corresponding uncertainties diminish with increasing 𝑵𝑵 𝑲𝑲 . As mentioned earlier, the bias in each quantile position will be towards the direction of locally higher probability mass (since more of the equiprobable random samples will tend to drawn from this region), nd therefore the estimate 𝒑𝒑��𝒙𝒙 | 𝒁𝒁�� of 𝒑𝒑 ( 𝒙𝒙 ) will be distorted in the direction of having smaller “ dispersion ” (i.e., 𝒑𝒑��𝒙𝒙 | 𝒁𝒁�� will tend to be more ‘ peaked’ than 𝒑𝒑 ( 𝑿𝑿 ) ), resulting in negative bias in the corresponding estimate of entropy. Finally, section 4.3 discussed the implications of the finite support assumption, given that 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 will often not be known. [36] Figure 4 illustrates the combined effect of these assumptions. Here we show how the overall percentage error in the QS-based estimate of entropy varies as a function of 𝜶𝜶 = ( 𝑵𝑵 𝒁𝒁 / 𝑵𝑵 𝑺𝑺 ) , where 𝜶𝜶 expresses the number of quantiles 𝑵𝑵 𝒁𝒁 as a fraction of the sample size 𝑵𝑵 𝑺𝑺 . Sample sets of given size 𝑵𝑵 𝑺𝑺 are drawn from the Gaussian (left panel),

Exponential (middle panel) and

Log-Normal (right panel) densities, the quantiles are estimated using the procedure discussed in section 3.3, 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 are set to be the smallest and largest data values in the set (section 4.3), and entropy is estimated using equation 7a for different selected values of 𝑵𝑵 𝒁𝒁 . To account for sampling variability, the results are averaged over different sample sets drawn randomly from the parent density. [37] The plots show how percentage estimation error (bias) varies as 𝜶𝜶 (and hence 𝑵𝑵 𝒁𝒁 ) changes as a fraction of sample size 𝑵𝑵 𝑺𝑺 , for different sample sizes from to . As might be expected, in each case when 𝜶𝜶 is too small the estimation bias is positive (over-estimation) and can be quite large due to the piecewise constant approximation. However, as 𝜶𝜶 is increased the estimation bias decreases rapidly, crosses zero, and becomes negative (under-estimation) due to the combined effects of quantile position estimation bias and use of the smallest and largest sample values to approximate 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 . Most interesting is the fact that all of the curves cross zero at approximately 𝜶𝜶 ≈ 𝟎𝟎 . . , and that this location does not seem to depend strongly on the sample size or shape of the pdf. Further, the marginal cost of setting 𝜶𝜶 too large is low (less than −𝟓𝟓 % for 𝜶𝜶 = . ) compared to setting 𝜶𝜶 too small. Overall, the expected bias error diminishes with increasing sample size 𝑵𝑵 𝑺𝑺 and the optimal choice for 𝜶𝜶 ≈ 𝟎𝟎 . . . [38] Figure 5 illustrates both the bias and uncertainty in the estimate of entropy as a function of sample size 𝑵𝑵 𝑺𝑺 when we specified the number of quantiles 𝑵𝑵 𝒁𝒁 to be % of the sample size (i.e., 𝜶𝜶 = . ). The uncertainty intervals are due to sampling variability, estimated by drawing different sample sets from the parent population. The results show that uncertainty due to sampling variability diminishes rapidly with sample size, becoming relatively small for large sample sizes. Implications of Informativeness of the Data Sample [39]

For the results shown in

Figures 4 & 5 , we drew samples directly from 𝒑𝒑 ( 𝒙𝒙 ) . In practice, we must construct our entropy estimate by using a single data sample 𝑺𝑺 of finite size 𝑵𝑵 𝑺𝑺 . Provided that 𝑺𝑺 is a consistent and representative random sample from 𝒑𝒑 ( 𝒙𝒙 ) , with each element 𝒙𝒙 𝒊𝒊 being iid, then a sufficiently large sample size 𝑵𝑵 𝑺𝑺 should enable construction of an accurate approximation 𝒑𝒑��𝒙𝒙 | 𝒁𝒁�� of 𝒑𝒑 ( 𝒙𝒙 ) via the QS method. However, if 𝑵𝑵 𝑺𝑺 is too small, it can (i) prevent setting a sufficiently large value for 𝑵𝑵 𝒁𝒁 , and (ii) tend to make the sets 𝒁𝒁 𝒌𝒌 sub-sampled from 𝑺𝑺 to be insufficiently independent for accurate estimates of the quantile positions of 𝒑𝒑 ( 𝒙𝒙 ) to be obtained. The overall effect will be to prevent 𝒑𝒑��𝒙𝒙 | 𝒁𝒁�� from approaching 𝒑𝒑 ( 𝒙𝒙 ) , leading to an unreliable estimate of its entropy. [40] Further, even if 𝑵𝑵 𝑺𝑺 is sufficiently large for 𝒑𝒑��𝒙𝒙 | 𝒁𝒁�� → 𝒑𝒑 ( 𝒙𝒙 ) , sampling variability associated with randomly drawing 𝑺𝑺 from 𝒑𝒑 ( 𝒙𝒙 ) will result in the entropy estimate 𝑯𝑯 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁�� being subject to statistical variability.

Figure 6 shows how bootstrapped estimates of the uncertainty will differ from those shown in

Figure 5 above, in which we drew different sample sets from the parent population. Here, each time a sample set is drawn from the parent density we draw 𝑵𝑵 𝑩𝑩 = bootstrap samples of the same size 𝑵𝑵 𝑺𝑺 from that sample set, use these to obtain 𝑵𝑵 𝑩𝑩 different estimates of the associated entropy (using 𝜶𝜶 = . ), and compute the width of the resulting inter-quartile range (IQR). We then repeat this procedure for different sample sets of the same size drawn from the parent population. Figure 6 shows the ratio of the IQR obtained using bootstrapping to that of the actual IQR for different sample sizes; the boxplots represent variability due to random sampling. Here, an expected (mean) ratio value of 1.0 and small width of the boxplot is ideal, indicating that ootstrapping provides a good estimate of the uncertainty to be associated with random sampling variability. The results show that for smaller sample sizes ( 𝑵𝑵 𝑺𝑺 < ) there is a tendency to overestimate the width of the inter-quartile range, but that this slight positive bias disappears for larger sample sizes. Summary of Properties of the Proposed Quantile Spacing Approach [41]

To summarize, bias in the estimate

𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁�� can arise due to: a) inadequacy of the piece-wise approximation of 𝒑𝒑 ( 𝒙𝒙 ) , b) imperfect estimation of the quantile positions, c) imperfect knowledge of the support interval, and d) the sample 𝑺𝑺 not being consistent, representative and sufficiently informative. Meanwhile, uncertainty in the estimate can arise due to: a) random sampling variability associated with estimation of the quantiles, and b) random sampling variability associated with drawing 𝑺𝑺 from 𝒑𝒑 ( 𝒙𝒙 ) . For a given sample size 𝑵𝑵 𝑺𝑺 , and provided that the sample is consistent, representative and fully informative, the bias and uncertainty can be reduced by selecting sufficiently large values for 𝑵𝑵 𝒁𝒁 and 𝑵𝑵 𝑲𝑲 (we recommend 𝑵𝑵 𝑲𝑲 = and 𝑵𝑵 𝒁𝒁 = . 𝑺𝑺 ), while the overall statistical variability associated with the estimate can be estimated by bootstrapping from 𝑺𝑺 . Algorithm for Estimating Entropy via the Quantile Spacing Approach [42]

Given a sample set 𝑺𝑺 of size 𝑵𝑵 𝑺𝑺 Set 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 = 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 = 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } Select values for 𝝍𝝍 = { 𝑵𝑵 𝒁𝒁 , 𝑵𝑵 𝑲𝑲 , 𝑵𝑵 𝑩𝑩 } . Recommended default values are 𝑵𝑵 𝒁𝒁 = 𝜶𝜶 ∙ 𝑵𝑵 𝑺𝑺 , 𝑵𝑵 𝑲𝑲 = and 𝑵𝑵 𝑩𝑩 = , with 𝜶𝜶 = . . 3) Bootstrap a sample set 𝑺𝑺 𝒃𝒃 of size 𝑵𝑵 𝑺𝑺 from 𝑺𝑺 with replacement. 4) Compute the entropy estimate

𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁� 𝒃𝒃 � using Eqn. 7a and the procedure outlined in Section 3. 5) Repeat the above steps 𝑵𝑵 𝑩𝑩 times to generate the bootstrapped distribution of 𝑯𝑯� 𝒑𝒑� �𝑿𝑿 | 𝒁𝒁� 𝒃𝒃 � as an empirical probabilistic estimate 𝒑𝒑 �𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑺𝑺 ) � of the Entropy 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) of 𝒑𝒑 ( 𝒙𝒙 ) given 𝑺𝑺 . Relationship to the Bin Counting approach [43]

Because the proposed QS approach employs a piecewise constant approximation of 𝒑𝒑 ( 𝒙𝒙 ) , there are obvious similarities to BC. However, there are also clear differences. First, while BC typically employs equal-width binning along the support of 𝑿𝑿 , with each bin having a different fraction of the total probability mass, QS uses variable width intervals (analogous to “ bins ”) each having an identical fraction of the total probability mass (so that the intervals are wider where 𝒑𝒑 ( 𝑿𝑿 ) is small, and narrower where 𝒑𝒑 ( 𝑿𝑿 ) is large). Both methods require specification of the support interval [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] . [44] Second, whereas BC requires counting samples falling within bins to estimate the probability masses associated with each bin, QS involves no “ bin-counting ”, and, instead, the data samples are used to estimate the positions of the quantiles. Since the probability mass estimates obtained by counting random samples falling within bins can be highly uncertain due to sampling variability, particularly for small sample sizes 𝑵𝑵 𝑺𝑺 , this translates into uncertainty regarding the shape of the pdf and thereby regarding its entropy. In QS, the effect of sampling variability is to consistently provide a pdf approximation that tends to be slightly more peaked that the true pdf, so that the bias in the entropy estimate tends to be slightly negative. This negative bias acts to counter the positive bias resulting from the piecewise constant approximation of the pdf. [45] Third, whereas BC requires selection of a bin-width hyperparameter ∆ that represents the appropriate bin-width required for “ smoothing ” to ensure an appropriate balance between bias and variance errors, QS requires selection of a hyperparameter 𝜶𝜶 that specifies the number of quantile positions 𝑵𝑵 𝒁𝒁 as a fraction of he sample size 𝑵𝑵 𝑺𝑺 . As seen in Figure 4 , an appropriate choice for 𝜶𝜶 can effectively drive estimation bias to zero, while 𝑵𝑵 𝑲𝑲 controls the degree of uncertainty associated with the estimation of the positions of the quantiles. Further, if we desire estimates of the uncertainty in the computed value of entropy arising due to random sampling variability, we must specify the number of bootstraps 𝑵𝑵 𝑩𝑩 . In practice, the values selected for 𝑵𝑵 𝑲𝑲 and 𝑵𝑵 𝑩𝑩 can be made arbitrarily large, and our experiments suggest that setting 𝑵𝑵 𝑲𝑲 = (or larger) and 𝑵𝑵 𝑩𝑩 = (or larger) works well in practice. Accordingly, the QS hyperparameter 𝜶𝜶 takes the place of the BC hyperparameter ∆ in determining the accuracy of the Entropy estimate obtained from a given sample. [46] Our survey of the literature suggests that the problem of how to select the BC bin-width hyperparameter ∆ is not simple, and a number of different strategies have been proposed. Sturges (1926) proposed to choose the number of bins based on sample size only.

Scott (1979) estimated the optimal number of bins by minimizing the mean squared error between the sample histogram and the “ true ” form of the pdf (for which the shape must be assumed).

Freedman and Diaconis (1981) further developed this approach by estimating the shape of the true pdf from the interquartile range of the sample. More recently,

Knuth (2019) proposed a method that does not require choice of a hyperparameter – using a Bayesian maximum likelihood approach, and assuming a piecewise-constant density model, the posterior probability for the number of bins is identified (this approach also provides uncertainty estimates for the related bin counts). Other BC approaches that provide uncertainty estimates based on the Dirichelet, Multinomial, and Binomial distribution are discussed by

Darscheid et al. (2018) . However, as shown in the next section, in practice the “ optimal ” fixed bin-width can vary significantly with shape of the pdf and with sample size. Experimental Comparison with the Bin Counting method [47]

The right panel of

Figure 1 shows the theoretical bias, due only to piecewise-constant approximation, associated with the estimate of 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) obtained using BC when the support interval [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] is subdivided into equal-width intervals. We can compare the results to the left panel of Figure 1 if we consider the number 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 of BC bins to be analogous to the number 𝑵𝑵 𝒁𝒁 of spacings between quantiles for QS. Note that no random sampling variability or data informativeness issues are involved in the construction of these figures. For BC we use the theoretical fractions of probability mass associated with each of the equal-width bins, and for QS we use the theoretical quantile positions to compute the interval spacings. In both cases, to address the “ infinite support ” issue, we set [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] to the locations where 𝑷𝑷 ( 𝒛𝒛 ) = 𝜺𝜺 and 𝑷𝑷�𝒛𝒛 𝑵𝑵 𝒁𝒁 � =

𝟏𝟏 − 𝜺𝜺 respectively, with 𝜺𝜺 = % when 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 ≥ 𝟏𝟏𝟎𝟎𝟎𝟎 ; in fact, it can decline somewhat faster than for the QS approach. Clearly, for the

Gaussian (blue) and

Exponential (red) densities, the BC constant bin-width approximation can provide better entropy estimates with fewer bins than the QS variable bin-width approach. However, for the skewed

Log-Normal density (orange) the behavior of the BC approximation is more complicated, whereas the QS approach shows an exponential rate of improvement with increasing number of bins for all three density types. This suggests that the variable bin-width QS approximation may provide a more consistent approach for more complex distributional forms (see section 7). [48]

Further,

Figure 7 shows the results of a “ naïve ” implementation of BC where the value for 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 is varied as a fractional percentage of sample size 𝑵𝑵 𝑺𝑺 . As with QS, we specify the support interval by setting 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 = 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 = 𝐦𝐦𝐦𝐦𝐦𝐦 { 𝑺𝑺 } , but here the support interval is divided into equal-width bins so that 𝑿𝑿 𝑩𝑩𝑩𝑩𝑵𝑵 = �𝑿𝑿 , 𝑿𝑿 , 𝑿𝑿 , … , 𝑿𝑿 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 � represents the locations of the edges of the bins (where 𝑿𝑿 = 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝑿𝑿 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 = 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ), and therefore ∆ = 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 −𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 . We then assume that 𝒑𝒑 ( 𝒙𝒙 ) ≈ 𝒑𝒑� ( 𝒙𝒙 | 𝑿𝑿 𝑩𝑩𝑩𝑩𝑵𝑵 ) = 𝒍𝒍 𝒋𝒋 𝑵𝑵 𝑺𝑺 for 𝑿𝑿 𝒋𝒋−𝟏𝟏 ≤ 𝑿𝑿 < 𝑿𝑿 𝒋𝒋 where 𝒍𝒍 𝒋𝒋 is the number of samples falling in the bin defined by 𝑿𝑿 𝒋𝒋−𝟏𝟏 ≤ 𝑿𝑿 < 𝑿𝑿 𝒋𝒋 . Finally, we compute the BC estimate of entropy as 𝑯𝑯� 𝒑𝒑� ( 𝑿𝑿 | 𝑿𝑿 𝑩𝑩𝑩𝑩𝑵𝑵 ) = − ∑ 𝒍𝒍𝒍𝒍 � 𝒍𝒍 𝒋𝒋 𝑵𝑵 𝑺𝑺 � ∙ 𝒍𝒍 𝒋𝒋 𝑵𝑵 𝑺𝑺 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 𝒋𝒋=𝟏𝟏 + 𝒍𝒍𝒍𝒍 ( 𝜟𝜟 ) , and follow the convention that

𝟎𝟎 ∙ 𝒍𝒍𝒍𝒍 ( ) = to handle bins where the number of samples 𝒍𝒍 𝒋𝒋 = . Finally, we obtain estimates for different sample sets drawn from the parent density and average the results. Results are shown for different sample sizes 𝑺𝑺 = { , , , , and }. The yellow marker symbols indicate where each curve crosses the zero-bias line; clearly 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 is not a constant fraction of 𝑵𝑵 𝑺𝑺 , and for any given sample size the ratio of 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 𝑵𝑵 𝑺𝑺 changes with form of the pdf. [49] To more clearly compare these BC results with the results shown in

Figure 4 for the QS approach,

Figure 8 shows a plot indicating the sampling variability distribution of the optimal number of bins (i.e., the value of 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 for which the expected entropy estimation error is zero) as a function of sample size for the

Gaussian , Exponential and

Log-Normal densities. We see clearly that, in contrast to QS, the “ expected optimal ” number of bins to achieve zero bias is neither a constant fraction of the sample size or independent of the pdf shape, but instead declines as the sample size increases, and is different for different pdf shapes. Further, the sampling variability associated with the optimal fractional number of bins can be quite large, and is highly skewed at smaller sample sizes. This is in contrast with QS where the optimal fractional number of bins is approximately constant at 𝜶𝜶 ≈ 𝟐𝟐𝟓𝟓 − 𝟑𝟑𝟓𝟓 % for different sample sizes and pdf shapes. Testing on Multi-Modal PDF Forms [50]

While the types of pdf forms tested in this paper are far from exhaustive, they represent differently shapes and degrees of skewness, including infinite support on both sides (

Gaussian ), and infinite support on only one side (

Exponential and

Log-Normal ). However, all three forms are “ unimodal ”, and so we conducted an additional test for a multimodal distributional form. [51]

Figure 9 shows results for a

Bimodal pdf (

Figure 9a ) constructed using a mixture of two

Gaussians 𝑵𝑵 ( , ) and 𝑵𝑵 ( , ) . Since its theoretical entropy value is unknown, we used the piecewise constant approximation method with true (known) quantile positions to compute its entropy by progressively increasing the number of quantiles 𝑵𝑵 𝒁𝒁 until the estimate converged to within three decimal places ( Figure 9b ) to the value 𝑯𝑯 𝒑𝒑 ( 𝑿𝑿 ) ≈ 𝟐𝟐 . for 𝑵𝑵 𝒁𝒁 > . Figures 9c and shows that QS estimation bias declines exponentially with fractional number of quantiles and crosses zero at 𝜶𝜶 ≈ 𝟐𝟐𝟓𝟓 % to % , in a manner similar to the Unimodal pdfs tested previously (

Figure 4 ). Figure 9c shows the results for 𝑵𝑵 𝑺𝑺 = samples, along with the distribution due to sampling variability (500 repetitions), showing that the IQR falls within the ± % of the correct value and the whiskers ( ± sigma) fall within ± % . Figure 9d shows the expected bias (estimated by averaging over 500 repetitions) for varying sample size 𝑵𝑵 𝑺𝑺 ; for smaller sample sizes, the optimal value for 𝜶𝜶 is closer to % , while for 𝑵𝑵 𝑺𝑺 ≥ 𝟐𝟐𝟎𝟎𝟎𝟎 the value of 𝜶𝜶 ≈ 𝟐𝟐𝟓𝟓 % to % seems to work quite well, while being relatively insensitive to the choice of value within this range. [52] Interestingly, comparing

Figure 9d ( Bimodal Gaussian Mixture ) with

Figure 4a ( Unimodal Gaussian ), we see that QS actually converges more rapidly for the

Bimodal density. One possible explanation is that the

Bimodal density is in some sense “ closer ” in shape to a

Uniform density, for which the piecewise constant representation is a better approximation. Discussion and Conclusions [53]

In principle, the QS approach provides a relatively simple method for obtaining accurate estimates of entropy from data samples, along with an idea of the estimation-uncertainty associated with sampling variability. It appears to have an advantage over BC since the most important hyper-parameter to be specified, the number of quantiles 𝑵𝑵 𝒁𝒁 , does not need to be tuned and can apparently be set to a fixed fraction ( ~ % ) of the sample size, regardless of pdf shape or sample size. In contrast, for BC the optimal number of bins 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 varies with pdf shape and sample size and, since the underlying pdf shape is usually not known beforehand, it can be difficult to come up with a general rule for how to accurately specify this value. Therefore, QS is potentially more accurate than BC when applied to data from an unknown distributional form. [54]

The fact that QS differs from BC in one very important way may help to explain the properties noted above. Whereas in BC we choose the “ bin ” size and locations, and then compute the “ probability mass ” stimates for each bin from the data, in QS we instead choose the “ probability mass ” size (by specifying the number of quantiles) and then compute the “ bin ” sizes and locations (to conform to the spacings between quantiles) from the data. In doing so, BC uses only the samples falling within a particular bin to compute each probability mass estimate, which value can (in principle) be highly sensitive to sampling variability unless the number of samples (in each bin) is sufficiently large. In contrast, QS uses a potentially large number of samples from the data to generate a smoothed (via subsampling and averaging) estimate of the position of each quantile. As shown in

Figure 2 , the estimation bias and uncertainty are small for most of the quantiles and may only be significant near the extreme tails of the density, and for smaller sample sizes. In principle, therefore, with its focus on estimating quantiles rather than probability masses, the QS method seems to provide a more efficient use of the information in the data, and thereby a more robust approximation of the shape of the pdf. [55]

In this paper, we have used a simple, perhaps naïve, way of estimating the locations of the quantiles. Future work could investigate more sophisticated methods where the bias associated with extreme quantiles is accounted for and corrected. These include both Kernel and non-parametric methods. The simplest non-parametric methods are the empirical quantile estimator based on a single order statistic, or the extension based on two consecutive order statistics (

Hyndman and Fan 1996 ), for which the variance can be large. Quantile estimators based on L-statistics have been explored as a way to reduce estimation variance (

Harrell and Davis 1982; Kaigh and Lachenbruch 1982; Brodin 2006 ), and include Kernel quantile estimators (

Parzen, 1979; Sheather and Marron, 1990; Cheng and Parzen 1997; Park 2006; and

Yang, 1985 ). However, performance of the latter can be very sensitive to the choice of bandwidth. More recently, quantile L-estimators intended to be efficient at small sample sizes for estimating quantiles in the tails of a distribution have also been proposed (

Sfakianakis and Verginis, 2008; Navruz and Özdemir, 2020 ). Finally, quantile estimators based on Bernstein polynomials (

Cheng, 1995; Pepelyshev, Rafajłowicz and Steland , 2014 ) and importance sampling (see

Kohler and Tent, 2020, and references therein) have been also investigated. [56]

Note that the small-sample efficiency of the QS method may be affected by the fact that the entropy fractions associated with the extreme upper and lower end “ bins ” (quantile spacings) can be quite large when a small number of quantiles is used (see

Figure 3 ). Our implementation of QS seems to successfully compensate for lack of exact knowledge of 𝒛𝒛 and 𝒛𝒛 𝑵𝑵 𝒁𝒁 by using as empirical estimates the values 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 in the data sample. Intuitively, one would expect that wider “ bins ” could be used in regions where the slopes of the entropy fraction curves are flatter (e.g., the center of the Gaussian density), and narrower “ bins ” in the tails (more like the BC method). Taken to its logical conclusion, an ideal approach might be to use “ bin ” locations and widths such that the cumulative value of −𝒍𝒍𝒍𝒍 ( 𝒑𝒑 ) ∙ 𝒑𝒑 for any given pdf is subdivided into equal intervals (i.e., equal fractional entropy spacings) such that each bin then contributes approximately the same amount to the summation in Eqn. 7. To achieve this, the challenge is to estimate the edge locations of these “ bins ” (analogous to locations of the quantiles) from the sample data; we leave the possibility of such an approach for future investigation. Acknowledgements

We acknowledge the many members of the

GeoInfoTheory community (https://geoinfotheory.org) who have provided both moral support and extensive discussion of matters related to Information Theory and its applications to the Geosciences; without their existence and enthusiastic engagement it is unlikely that the ideas leading to this manuscript would have occurred to us. The first author also acknowledges partial support by the

Australian Research Council Centre of Excellence for Climate Extremes (CE170100023). The algorithms used in this work are freely accessible for non-commercial use at https://github.com/rehsani/Entropy . eferences Cited Beirlant J, EJ Dudewicz, L Gyorﬁ, EC Van Der Meulen (1997),

Nonparametric entropy estimation: An overview , International Journal of Mathematical and Statistics Sciences, 6: 17–39 Cheng C and E Parzen (1997),

Unified estimators of smooth quantile and quantile density functions , Journal of Statistical Planning and Inference 59: 291-307 Cover TM and JA Thomas (2006),

Elements of Information Theory , Wiley Darscheid P, A Guthke and U Ehret (2018),

A maximum-entropy method to estimate discrete distributions from samples ensuring nonzero probabilities , Entropy, 20(8) Denison DGT, NM Adams, CC Holmes and DJ Hand (2002),

Bayesian partition modelling , Computational Statistics and Data Analysis, 38(4): 475–485 Endres A and P Foldiak (2005),

Bayesian bin distribution inference and mutual information , IEEE Transactions of Information Theory, 51(11): 3766–3779 Freedman D and P Diaconis (1981),

On the histogram as a density estimator: L2 theory , Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4): 453-476 Gong W, D Yang, HV Gupta, and G Nearing (2014),

Estimating information entropy for hydrological data: One dimensional case , Technical Note, Water Resources Research, 50, doi: 10.1002/2014WR015874 Harrell FE and CE Davis (1982),

A new distribution-free quantile estimator . Biometrika 69: 635–640 Hutter M (2007),

Exact Bayesian regression of piecewise constant functions , Bayesian Analsis, 2(4): 635–664 Hyndman RJ and Y Fan (1996),

Sample quantiles in statistical packages , American Statistician 50(4):361–365 Jackson B, J Scargle, D Barnes, S Arabhi, A Alt, P Gioumousis, E Gwin, P Sangtrakulcharoen, L Tan and TT Tsai (2005),

An algorithm for optimal partitioning of data on an interval , IEEE Signal Processing Letters, 12: 105–108. Kaigh WD and PA Lachenbruch (1982),

A Generalized Quantile Estimator , Communications in Statistics, Part A-Theory and Methods, 11: 2217-2238 Kohler M and R Tent (2020),

Nonparametric quantile estimation using surrogate models and importance sampling , Metrika, 83: 141–169 Knuth KH (2019),

Optimal data-based binning for histograms and histogram-based probability density models , Digital Signal Processing, 95, 102581 Navruz G and AF Özdemir (2020),

A new quantile estimator with weights based on a subsampling approach , British Journal of Mathematical and Statistical Psychology, 73 Park C (2006),

Smooth nonparametric estimation of a quantile function under right censoring using beta kernels , Technical Report (TR 2006-01-CP), Department of Mathematical Sciences, Clemson University Parzen E (1979),

Nonparametric Statistical Data Modeling , Journal of the American Statistical Association, 74: 105-131 Scott DW (1979),

Optimal and data-based histograms , Biometrika, 66(3), 605–610, doi:10.2307/2335182 Scott DW (1985),

Averaged shifted histograms—Effective nonparametric density estimators in several dimensions , Ann. Stat., 13(3), 1024–1040, doi:10.1214/aos/1176349654 Scott DW (2004),

Handbook of Computational Statistics—Concepts and Methods , Springer NY Scott DW (2008),

Multivariate Density Estimation: Theory, Practice, and Visualization , John Wiley NY cott DW and JR Thompson (1983),

Probability density estimation in higher dimensions , in Computer Science and Statistics: Proceedings of the Fifteenth Symposium on the Interface, edited by J. E. Gentle, 173–179, North-Holland, Amsterdam, Netherlands Sfakianakis ME and DG Verginis (2008),

A new family of nonparametric quantile estimators , Communications in Statistics – Simulation and Computation, 37: 337–345 Shannon CE (1948),

A mathematical theory of communication , Bell System Technical Journal, 27: 379-423 Sheather JS and JS Marron (1990),

Kernel quantile estimators , Journal of the American Statistical Association, 85: 410–416 Silverman BW (1986),

Density estimation for statistics and data analysis , Chapman and Hall NY Sturges HA (1926),

The choice of a class interval , Journal of the American Statistical Association, 21(153): 65-66 Wegman EJ (1975),

Maximum likelihood estimation of a probability density function , Sankhyā: The Indian

Journal of Statistics, 3(2), 211–224 Hutter M (2007),

Exact Bayesian regression of piecewise constant functions , Bayesian Analsis, 2(4), 635–664 Yang SS (1985),

A smooth nonparametric estimator of a quantile function , Journal of the American Statistical Association, 80: 1004-1011

List of Figures Figure 1:

Plots showing how entropy estimation bias associated with the piecewise-constant approximation of various theoretical pdf forms varies with the number of quantiles (left; QS method) or number of equal-width bins (right; BC method) used in the approximation. The dashed horizontal lines indicate ±1% and ±5% bias error. No sampling is involved and the bias is due purely to the piecewise constant assumption. For QS, the locations of the quantiles are set to their theoretical values. To address the “ infinite support ” issue, [ 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 , 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 ] were set to be the locations where 𝑷𝑷 ( 𝒛𝒛 ) = 𝜺𝜺 and 𝑷𝑷�𝒛𝒛 𝑵𝑵 𝒁𝒁 � =

𝟏𝟏 − 𝜺𝜺 respectively, with 𝜺𝜺 = Figure 2:

Plots showing bias and uncertainty associated with estimates of the quantiles derived from random samples, for the

Log-Normal pdf. Uncertainty associated with random sampling variability is estimated by repeating each experiment times. In both subplots, for each case, the box plots are shown side by side to improve legibility. (Left) Subplot showing results varying 𝑵𝑵 𝒁𝒁 = [ , , , , , , ] for fixed 𝑵𝑵 𝑲𝑲 = . (Right) Subplot showing results varying 𝑵𝑵 𝑲𝑲 = [ , , , , , ] for fixed 𝑵𝑵 𝒁𝒁 = . Figure 3:

Plots showing percentage entropy fraction associated with each quantile spacing for the

Gaussian , Exponential and

Log-Normal pdfs, for 𝑵𝑵 𝒁𝒁 = (left), and 𝑵𝑵 𝒁𝒁 = (right). For the Uniform pdf (not shown to avoid complicating the figures) the percentage entropy fraction associated with each quantile spacing is a horizontal line (at 1% in the left panel, and at 0.1% in the right panel). Note that the entropy fractions can be proportionally quite large or small at the extremes, depending on the form of the pdf. However, the overall entropy fraction associated with each quantile spacing diminishes with increasing 𝑵𝑵 𝒁𝒁 . For the examples shown, the maximum ontributions associated with a quantile spacing are less than % for 𝑵𝑵 𝒁𝒁 = (left), and become less than . % for 𝑵𝑵 𝒁𝒁 = (right). Figure 4:

Plots showing expected percent error in the QS-based estimate of entropy derived from random samples, as a function of 𝜶𝜶 = ( 𝑵𝑵 𝒁𝒁 / 𝑵𝑵 𝑺𝑺 ) , which expresses the number of quantiles 𝑵𝑵 𝒁𝒁 as a fractional percentage of the sample size 𝑵𝑵 𝑺𝑺 . Results are averaged over trials obtained by drawing sample sets of size 𝑵𝑵 𝑺𝑺 from the theoretical pdf, where 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 are set to be the smallest and largest data values in the particular sample. Results are shown for different sample sizes 𝑵𝑵 𝑺𝑺 = [ , , , , , ] , for the Gaussian (left),

Exponential (middle) and

Log-Normal (right) densities. In each case, when 𝜶𝜶 is small the estimation bias is positive (overestimation) and can be greater than % for 𝜶𝜶 < % , and crosses zero to become negative (underestimation) when 𝜶𝜶 > %. The marginal cost of setting 𝜶𝜶 too large is low compared to setting 𝜶𝜶 too small. As 𝑵𝑵 𝑺𝑺 increases, the bias diminishes. The optimal choice is 𝜶𝜶 ≈𝟐𝟐𝟓𝟓 − 𝟑𝟑𝟎𝟎 % and is relatively insensitive to pdf shape or sample size. Figure 5:

Plots showing bias and uncertainty in the QS-based estimate of entropy derived from random samples, as a function of sample size 𝑵𝑵 𝑺𝑺 , when the number of quantiles 𝑵𝑵 𝒁𝒁 is set to % of the sample size ( 𝜶𝜶 = ), and 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 are respectively set to be the smallest and largest data values in the particular sample. The uncertainty shown is due to random sampling variability, estimated by drawing different samples from the parent density. Results are shown for the Gaussian (blue),

Exponential (red) and

Log-Normal (orange) densities; box plots are shown side by side to improve legibility. As sample size 𝑵𝑵 𝑺𝑺 increases, the uncertainty diminishes. Figure 6:

Plots showing, for different sample sizes and 𝜶𝜶 = % , the ratio of the interquartile range (IQR) of the QS-based estimate of entropy obtained using bootstrapping to that of the actual IQR arising due to random sampling variability. Here, each sample set drawn from the parent density is bootstrapped to obtain 𝑵𝑵 𝑩𝑩 = different estimates of the associated entropy, and the width of the resulting inter-quartile range is computed. The procedure is repeated for different sample sets drawn from the parent population, and the graph shows the resulting variability as box-plots. The ideal result would be a ratio of 1.0. Figure 7:

Plot showing how expected percentage error in the BC-based estimate of Entropy derived from random samples, varies as a function of the number of bins 𝑵𝑵 𝑩𝑩𝒊𝒊𝒍𝒍 . Results are averaged over trials obtained by drawing sample sets of size 𝑵𝑵 𝑺𝑺 from the theoretical pdf, where 𝒙𝒙 𝒎𝒎𝒊𝒊𝒍𝒍 and 𝒙𝒙 𝒎𝒎𝒂𝒂𝒙𝒙 are set to be the smallest and largest data values in the particular sample. Results are shown for different sample sizes 𝑵𝑵 𝑺𝑺 = [ , , , , , ] , for the Gaussian (left),

Exponential (middle) and

Log-Normal (right) densities. When the number of bins is small the estimation bias is positive (overestimation) but rapidly declines to cross zero and become negative (underestimation) as the number of bins is increased. In general, the overall ranges of overestimation and underestimation bias are larger than for the QS method (see Figure 4).

Figure 8:

Boxplots showing the sampling variability distribution of optimal fractional number of bins (as a percentage of sample size) to achieve zero bias, when using the BC method for estimating entropy from random samples. Results are shown for the

Gaussian (blue),

Exponential (red) and

Log-Normal (orange) densities. The uncertainty estimates are computed by drawing different sample data sets of a given size from the parent distribution. Note that the xpected optimal fractional number of bins varies with shape of the pdf, and is not constant but declines as the sample size increases. This is in contrast with the QS method where the optimal fractional number of bins is constant at ~ % for different sample sizes and pdf shapes. Further, the variability in optimal fractional number of bins can be large and highly skewed at smaller sample sizes. Figure 9:

Plots showing results for the

Bimodal pdf. (a) Pdf and Cdf for the

Gaussian Mixture model. (b) Showing convergence of entropy computed using piecewise constant approximation as the number of quantiles 𝑵𝑵 𝒁𝒁 is increased. (c) Bias and sampling variability of the QS-based estimate of entropy plotted against 𝑵𝑵 𝒁𝒁 as a percentage of sample size. (d) Expected bias of QS-based estimate of entropy plotted against 𝑵𝑵 𝒁𝒁 as a percentage of sample size, for different sample sizes 𝑵𝑵 𝑺𝑺 = [ , , , , , ] . Figure 1:

𝟏𝟏 − 𝜺𝜺 respectively, with 𝜺𝜺 = Figure 2:

Plots showing bias and uncertainty associated with estimates of the quantiles derived from random samples, for the

Log-Normal pdf. Uncertainty associated with random sampling variability is estimated by repeating each experiment times. In both subplots, for each case, the box plots are shown side by side to improve legibility. (Left) Subplot showing results varying 𝑵𝑵 𝒁𝒁 =[ , , , , , , ] for fixed 𝑵𝑵 𝑲𝑲 = . (Right) Subplot showing results varying 𝑵𝑵 𝑲𝑲 =[ , , , , , ] for fixed 𝑵𝑵 𝒁𝒁 = . Figure 3:

Plots showing percentage entropy fraction associated with each quantile spacing for the

Gaussian , Exponential and

Log-Normal pdfs, for 𝑵𝑵 𝒁𝒁 = (left), and 𝑵𝑵 𝒁𝒁 = (right). For the Uniform pdf (not shown to avoid complicating the figures) the percentage entropy fraction associated with each quantile spacing is a horizontal line (at 1% in the left panel, and at 0.1% in the right panel). Note that the entropy fractions can be proportionally quite large or small at the extremes, depending on the form of the pdf. However, the overall entropy fraction associated with each quantile spacing diminishes with increasing 𝑵𝑵 𝒁𝒁 . For the examples shown, the maximum contributions associated with a quantile spacing are less than % for 𝑵𝑵 𝒁𝒁 = (left), and become less than . % for 𝑵𝑵 𝒁𝒁 = (right). Figure 4:

Exponential (middle) and

Log-Normal (right) densities. In each case, when 𝜶𝜶 is small the estimation bias is positive (overestimation) and can be greater than % for 𝜶𝜶 < % , and crosses zero to become negative (underestimation) when 𝜶𝜶 > %. The marginal cost of setting 𝜶𝜶 too large is low compared to setting 𝜶𝜶 too small. As 𝑵𝑵 𝑺𝑺 increases, the bias diminishes. The optimal choice is 𝜶𝜶 ≈ 𝟐𝟐𝟓𝟓 − 𝟑𝟑𝟎𝟎 % and is relatively insensitive to pdf shape or sample size. Figure 5:

Exponential (red) and

Log-Normal (orange) densities; box plots are shown side by side to improve legibility. As sample size 𝑵𝑵 𝑺𝑺 increases, the uncertainty diminishes. Figure 6:

Exponential (middle) and

Figure 8:

Gaussian (blue),

Exponential (red) and

Log-Normal (orange) densities. The uncertainty estimates are computed by drawing different sample data sets of a given size from the parent distribution. Note that the expected optimal fractional number of bins varies with shape of the pdf, and is not constant but declines as the sample size increases. This is in contrast with the QS method where the optimal fractional number of bins is constant at ~ % for different sample sizes and pdf shapes. Further, the variability in optimal fractional number of bins can be large and highly skewed at smaller sample sizes. Figure 9:

Plots showing results for the

Bimodal pdf. (a) Pdf and Cdf for the