On the Estimation of Information Measures of Continuous Distributions
aa r X i v : . [ c s . I T ] F e b On the Estimation of Information Measures ofContinuous Distributions
Georg Pichler , Pablo Piantanida , and Günther Koliander Institute of Telecommunications, TU Wien, Vienna, Austria Université Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des Signaux etSystèmes, Gif-sur-Yvette, France and Montreal Institute for Learning Algorithms(Mila), Université de Montréal, QC, Canada Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria
Abstract
The estimation of information measures of continuous distributions based onsamples is a fundamental problem in statistics and machine learning. In thispaper, we analyze estimates of differential entropy in K -dimensional Euclideanspace, computed from a finite number of samples, when the probability densityfunction belongs to a predetermined convex family P . First, estimating differen-tial entropy to any accuracy is shown to be infeasible if the differential entropy ofdensities in P is unbounded, clearly showing the necessity of additional assump-tions. Subsequently, we investigate sufficient conditions that enable confidencebounds for the estimation of differential entropy. In particular, we provide confi-dence bounds for simple histogram based estimation of differential entropy froma fixed number of samples, assuming that the probability density function is Lip-schitz continuous with known Lipschitz constant and known, bounded support.Our focus is on differential entropy, but we provide examples that show thatsimilar results hold for mutual information and relative entropy as well. Many learning tasks, especially in unsupervised/semi-supervised settings, useinformation theoretic quantities, such as relative entropy, mutual information,differential entropy, or other divergence functionals as target functions in numer-ical optimization problems [19, 20, 25, 38, 16, 8, 5]. Furthermore, estimators forinformation theoretical quantities are useful in other fields, such as neuroscience[26]. As these quantities typically cannot be computed directly, surrogate func-tions, either upper/lower bounds, or estimates are used in place. Here, we willinvestigate the problem of estimating differential entropy using a finite numberof samples. Throughout, we will restrict our attention to differential entropy, but1imilar results also hold for conditional differential entropy, mutual informationand relative entropy (cf. Section 4).
The contributions of this work can be summarized as follows:• First, we explore the following basic but fundamental question: Fixing
C, δ > and given N ∈ N samples from a probability density function(pdf) p ∈ P , where P is a family of pdfs on R K , is it possible to obtainan estimate ˆ h of the differential entropy h(p) < ∞ satisfying P n | ˆ h − h(p) | > C o ≤ δ ? In Section 2, we show that the answer to this question is negative (Propo-sition 2) if P is convex and the differential entropy of the pdfs in P isunbounded.• Subsequently, we investigate sufficient conditions for the class P that en-able estimation of differential entropy with such a confidence bound andin Section 3 (Theorem 3) we show that a known, bounded support togetherwith an L -Lipschitz continuous pdf for fixed L > , suffices. For a simplehistogram based estimator we explicitly compute a relation between prob-ability of correct estimation, accuracy, dimension K , sample size N , andLipschitz constant L . It is shown that estimation becomes impossible ifeither assumption is removed.• Finally, in Section 4 we obtain impossibility results, similar to Proposi-tion 2, for the estimation of other information measures. The problem of estimating information measures from a finite number of sam-ples is as old as information theory itself. Shortly after his seminal paper [31],Shannon worked on estimating the entropy rate of English text [32]. Therehave been numerous works on the estimation of information measures, such asentropy, mutual information, and differential entropy, since. There are many dif-ferent approaches for estimating information measures, including kernel basedmethods, nearest neighbor methods, methods based on sample distances as wellas multiple variants of plug-in estimates. Many estimators have been shown tobe consistent and/or asymptotically unbiased under various constraints, e. g.,in [17, 1, 12, 21, 36]. An excellent overview can be found in [6].In [36], rate-of-convergence results as well as a central limit theorem areprovided for differential entropy and Rényi entropy. However, the confidence These assumptions assure that the differential entropy of the pdfs in P is bounded. Aknown, bounded support bounds the differential entropy from above, and L -Lipschitz conti-nuity bounds it from below. [0 , K , but insteadof Lipschitz continuity, β -Hölder continuity, β ∈ (0 , is assumed. Additionally,strict positivity on the interior of the support is required and the constantsbounding the approximation error depend on the underlying, unknown distri-bution. These additional complications are likely due to the extended scope, as[35] is not focused on differential entropy, but the expectation of arbitrary func-tionals of the probability density. The same authors also provide finite sampleanalysis for the estimation of Rényi divergence under similarly strong conditionsin [34].There are several negative results, which clearly show that information mea-sures are hard to estimate from a finite number of samples. It was shown in[3, Th. 4] that rate-of-convergence results cannot be obtained for any consistentestimator of entropy on a countable alphabet and only when imposing variousassumptions on the true distribution, rate-of-convergence results were obtained.More negative results on the estimation of entropy and mutual information canbe found in [28]. In fact, obtaining confidence bounds for information measuresfrom samples is inherently difficult and requires regularity assumptions aboutthe involved distributions, which are not subject to empirical test. In the sem-inal work of [4] as well as subsequent works [13, 10, 29, 11] (and referencestherein) such necessary conditions for the estimation of statistical parameterswith confidence bounds are discussed in great detail and generality. The resultsof [10, 29] can be applied to differential entropy estimation and yield a resultvery similar to Proposition 2, essentially showing that differential entropy can-not be bounded using a finite number of samples, unless additional assumptionson the distribution are made.Especially in the context of unsupervised and semi-supervised machine learn-ing it recently became popular to use variational bounds or estimates of informa-tion measures as part of the loss function for training neural networks [7, 30, 19].Criticism to this approach, in particular the use of variational bounds, has beenvoiced [24]. The current paper has a more general scope, dealing with the es-timation problem of information measures in general, not limited to specific3ariational bounds or techniques.The information flow in neural networks is also a recent topic of investigation.In [33], an argument for successive compression in the layers of a deep neural net-work is given, along the lines of the information bottleneck method [37]. Whileflaws in this argument were pointed out [2, 22], the authors of [15] found thata clustering phenomenon might elucidate the behavior of deep neural networks.These insights were obtained by estimating the differential entropy h( X + G ) ofa sum of two random vectors, where X is sub-Gaussian and G ∼ N ( , σ I ) isan independent Gaussian vector. This is similar in spirit to the work conductedhere, however, our assumption of compact support is replaced by assuming X tobe sub-Gaussian. Note that the pdf of X + G is Lipschitz continuous with fixedLipschitz constant L ( σ ) , so [15] is implicitly also using a Lipschitz assumption. Let P be a family of pdfs on X := R K with finite differential entropy, i. e., h(p) := − R p( x ) log p( x ) d x ∈ R for every p ∈ P .Suppose we observe N i.i.d. copies D := ( X , X , . . . , X N ) of some randomvector X ∼ p ∈ P and want to obtain an estimate of differential entropy fromthese samples D . Such an estimator is a function ˆ h : X N → R that maps D into ˆ h ( D ) , approximating the differential entropy h := h( X ) := h(p) < ∞ .Its accuracy can be measured by a confidence interval, a widely used tool instatistical practice for indicating the precision of point estimators. For a givenerror probability δ > , we would like to have C > such that | h − ˆ h ( D ) | ≥ C with probability less than δ , i. e., a confidence interval of size C with confidence − δ . However, there is no free lunch when estimating differential entropy, asevidenced by the following result, a corollary of a more general result in [10],here specialized to a bound of differential entropy. It is based on the abstractnotion of a dense graph condition (DGC). Theorem 1 ([10, Th. 2.1]) . Assume that h satisfies the DGC over P and de-fine B := sup { h(p) : p ∈ P} , where, e. g., B = + ∞ . If for any C > , sup p ∈P P p { ˆ h ( D ) + C ≤ B } = 1 , then inf p ∈P P p n h(p) − ˆ h ( D ) ≤ C o = 0 . A similar result follows from [29, Prop. 3.1].We will not work with the DGC, but make two practical assumptions: P is a convex family and the differential entropy of the pdfs in P is unbounded(either from above or from below). Under these assumptions we show that forany δ, C > there is a pdf p ∈ P , such that | h − ˆ h ( D ) | ≤ C with probability Similar to our assumption of an arbitrary but fixed compact support in Section 3, theconstant K in the definition of the sub-Gaussian X is assumed to be fixed in [14, eq. (1)]. h satisfies the DGC over P if the graph of h over P is dense in its own epigraph [10,eq. (2.4)]. δ , i. e., ˆ h ( D ) is far from h with high probability. Fundamentally, thisfollows from the fact that P contains pdfs with a large difference in differentialentropy, which cannot be accurately distinguished based on samples. Similarresults hold true for mutual information and relative entropy and are given inSection 4. Proposition 2.
Let P be a convex family of pdfs with unbounded differentialentropy, i. e., for any α ∈ [0 , and p , q ∈ P , we have α p + (1 − α )q ∈ P as wellas sup q ∈P | h(q) | = ∞ . Then, for any pair of constants C, δ > , there exists acontinuous random vector X ∼ p ∈ P , satisfying P p n | h(p) − ˆ h ( D ) | ≤ C o ≤ δ. Remark . Before proceeding with the proof of Proposition 2, we note that thisresult could be proved as a consequence of Theorem 1. However, this wouldnecessitate to show that our conditions imply the DGC. Furthermore, the proofof [10, Th. 2.1] itself hinges on deep statistical results and thus we opted forproviding a short, self-contained proof.
Proof of Proposition 2.
The function ˆ h , constants C, δ > , and the sample size N ∈ N are arbitrary, but fixed. Choose an arbitrary X ∼ p ∈ P and let h :=h(p) < ∞ . Then fix b > , such that P n | ˆ h ( X , X , . . . , X N ) − h | ≤ b o ≥ − δ ,where ( X n ) n =1 ,...,N are i.i.d. copies of X . Furthermore, let Q ∼ B (1 − ε ) be aBernoulli random variable with parameter − ε = P { Q = 1 } , independent of X , where < ε ≤ δ N . Choose a > such that aε > b + C + log 2 . By ourassumption sup q ∈P | h(q) | = ∞ , we can find e X ∼ e p ∈ P with e h := h( e p) suchthat | e h − h | ≥ a .Define X := Q X − (1 − Q ) e X , which yields h = h( X ) = I( X ; Q ) + h( X | Q ) =I( X ; Q ) + (1 − ε ) h + ε e h ∈ (1 − ε ) h + ε e h + [0 , log 2] where I( · ; · ) denotes mutualinformation. For convenience we use ˆ H := ˆ h ( D ) , where D = ( X , X , . . . , X N ) and define the event E = { Q = Q = · · · = Q N = 1 } = {D = ( X , X , . . . , X N ) } .By the union bound, we have P {E} ≥ − N ε and obtain P n | ˆ H − h | ≤ b o = P {E} P n | ˆ H − h | ≤ b (cid:12)(cid:12)(cid:12) E o + P {E c } P n | ˆ H − h | ≤ b (cid:12)(cid:12)(cid:12) E c o ≥ (1 − N ε )P n | ˆ H − h | ≤ b (cid:12)(cid:12)(cid:12) E o = (1 − N ε )P n | ˆ h ( X , X , . . . , X N ) − h | ≤ b o ≥ (cid:18) − δ (cid:19) ≥ − δ.
5e thus found X such that P n | h − ˆ H | ≤ C o ≤ P n | h − ˆ H | ≥ | h − h | − C o ≤ P n | h − ˆ H | ≥ ε | h − e h | − log 2 − C o ≤ P n | h − ˆ H | ≥ εa − log 2 − C o ≤ P n | h − ˆ H | > b o = 1 − P n | h − ˆ H | ≤ b o ≤ δ. Remark . Proposition 2 shows that in order to obtain confidence bounds, oneneeds to make assumptions about the underlying distribution. However, aspointed out in [10, p. 1395], when making these assumptions, one uses informa-tion external to the samples.
Remark . Note that the family of all pdfs with support [0 , K satisfies therequirements of Proposition 2. It also satisfies the DGC, but it is not stronglynonparametric, as defined in [10, p. 1395]. One way to avoid the problems outlined in Section 2 is to impose additionalassumptions on the underlying probability distribution, that bound the differ-ential entropy from above and from below. We will showcase that the differentialentropy of an L -Lipschitz continuous pdf with fixed, known L > on R K andknown, compact support X can be well approximated from samples. In the fol-lowing, let X ∼ p be supported on X := [0 , K , i. e., R X p d λ K = P { X ∈ X } =1 , where λ K denotes the Lebesgue measure on R K . The pdf p : R K → R + of X is assumed to be L -Lipschitz continuous on R K with some fixed L > , where R K is equipped with the ℓ -norm k x k := k x k = P k | x k | , hence, ∀ x , y ∈ R K : | p( x ) − p( y ) | ≤ L k x − y k . Given N i.i.d. copies D = ( X , X , . . . , X N ) of X , let Y be distributed accord-ing to the empirical distribution of D , i. e., Y = X U , where U ∼ U ( { , , . . . , N } ) is a uniform random variable on { , , . . . , N } . Let the discrete random vector e Y = ∆ M ( Y ) be the element-wise quantization of Y , where ∆ M ( x ) := ⌊ Mx ⌋ M isthe M -step discretization of [0 , for some M ∈ N . Additionally define the con-tinuous random vector Y = e Y + U (cid:0) [0 , M ] K (cid:1) , i. e., independent uniform noise is Any known compact support suffices. An affine transformation then yields X = [0 , K ,while possibly resulting in a different Lipschitz constant. The L norm is only chosen to facilitate subsequent computations. By the equivalenceof norms on R K , any norm suffices. h( Y ) − H( e Y ) = − K log M , where H( · ) denotes Shannonentropy.We will estimate differential entropy by ˆ h ( D ) = h (cid:0) Y (cid:1) = H( e Y ) − K log M , i. e.,the Shannon entropy of the discretized and binned samples with a correctionfactor.In the following, we shall also use the two constants η ( K, L ) := 1 K (cid:18) K + 1)! L (cid:19) K +1 , and α := √ e + 4 − e2e ≈ . . Theorem 3.
For M ≥ αη ( K,L ) and any δ ∈ (0 , , we have with probabilitygreater than − δ that (cid:12)(cid:12) ˆ h ( D ) − h( X ) (cid:12)(cid:12) ≤ LK M log( M η ( K, L )) (1a) + r N log 2 δ log N (1b) + log (cid:18) M K − N (cid:19) . (1c)The proof will be given in Appendix A. Remark . Of the three error terms (1a)–(1c), the terms (1a) and (1c) constitutethe bias and (1b) is a variance-like error term. While the variance (1b) vanishesas N → ∞ , the term (1a) does not depend on the sample size N as it merelymeasures the error incurred due to the quantization ∆ M , which is boundedby the Lipschitz constraint and approaches zero as M → ∞ . The final term(1c) results from ensuring that N samples suffice to suitably approximate theempirical distribution over M quantization steps. Thus, it ties the quantizationto the sample size and approaches zero if M K N → . In total, the RHS of (1)approaches zero for N, M → ∞ provided that M K N → . Remark . Theorem 3 should be regarded as a proof-of-concept rather than apractical tool for performing differential entropy estimation. While analyticallytractable, the estimation strategy is crude and the bounds, especially the term(1b), while being a completely universal bound, is know to be loose, as pointedout in [28, p. 1200].
Remark . We want to note that requiring both a fixed Lipschitz constant L anda known bounded support, e. g., X = [0 , K , is necessary. Consider for instancethe set P ′ = { p : p supported on [0 , K and Lipschitz continuous } of pdfs witharbitrary Lipschitz constant or the set P ′′ = { p : p supported on a boundedset and L -Lipschitz continuous } with fixed Lipschitz constant, but arbitrary,bounded support. Both families satisfy the conditions of Proposition 2, i. e.,they are convex and sup p ∈P ′ | h(p) | = sup p ∈P ′′ | h(p) | = ∞ .7n principle, Theorem 3 also allows for the approximation of mutual infor-mation with a confidence bound. Let ( X , Y ) ∼ p XY be two random vectors, sup-ported on [0 , K and [0 , K , respectively. Assuming that p XY is L -Lipschitzcontinuous on R K + K , it is clear that the marginals p X and p Y are L -Lipschitzcontinuous as well. Thus, Theorem 3 can be used to approximate all three termsin I( X ; Y ) = h( X ) + h( Y ) − h(( X , Y )) . In this section, we showcase that similar statements as Proposition 2 also holdfor mutual information and relative entropy. For simplicity we will not assume p XY ∈ P for some family P of probability density functions, but merely re-quire I( X ; Y ) , D( X k Y ) < ∞ . Only proof sketches are provided as the examplesprovided in this section are similar to the proof of Proposition 2.Here we show that in general, it is not possible to accurately estimate mutualinformation I( X ; Y ) and relative entropy D( X k Y ) from samples ( X , Y ) . For any N , let ˆ i : R N × R N → R be a measurable function, which represents anestimate of the mutual information I := I( X ; Y ) < ∞ from X , Y . For conveniencewe use ˆ I := ˆ i ( X , Y ) . Let X , Z ∼ U ([0 , , W ∼ U ([0 , e − a ]) , and Q ∼ B (1 − ε ) beindependent random variables. Define Y := QZ − (1 − Q )( X + W ) . We have I( X ; Y ) = h( Y ) − h( Y | X ) ≥ H ( ε ) − H ( ε ) + aε = aε. The random vectors ( X , Y ) [ ( X , Z ) ] are N i.i.d. realizations of ( X , Y ) [ ( X , Z ) ].For any δ > , we can find b ∈ R such that P n ˆ i ( X , Z ) ≤ b o ≥ − δ . Letting E = { Y = Z } = { Q = } , we have P {E} ≥ − N ε . Thus, when choosing ε ≤ δ N , P n ˆ I ≤ b o = P {E} P n ˆ I ≤ b (cid:12)(cid:12)(cid:12) E o + P {E c } P n ˆ I ≤ b (cid:12)(cid:12)(cid:12) E c o ≥ (1 − N ε )P n ˆ I ≤ b (cid:12)(cid:12)(cid:12) E o ≥ (1 − N ε ) (cid:18) − δ (cid:19) ≥ − δ.
8e may choose a ≥ b + Cε . Then, for arbitrary C, δ > and n ∈ N , we found X , Y and b ∈ R , such that P n ˆ I > b o ≤ δ , yet I ≥ b + C . Remark . Note that [7, Th. 3] claims a confidence bound for mutual infor-mation, that together with the approximation result [7, Lem. 1] seeminglycontradicts our result. However, the confidence bound proved in [7, Th. 3]requires strong conditions on the functions T θ and [7, Lem. 1] does not nec-essarily hold under these conditions. Moreover, both approximation results [7,Lem. 1 and Lem. 2] do not hold uniformly for a family of distributions, butimplicitly assume a fixed, underlying distribution. This is especially evident in[7, Lem. 2], which also seemingly contradicts our result, when assuming that theoptimal function T ∗ is in the family T Θ . However, this apparent contradictionis resolved by noting that the chosen N ∈ N depends on the underlying, truedistribution. In the following we show that a similar result holds if Y has a fixed finitealphabet, say Y = { , , . . . , K } . Again, for every N ∈ N , let ˆ i N : R N × Y N → R be an estimator that estimates I := I( X ; Y ) ≤ log K from X , Y . Note that theresult for continuous Y cannot carry over unchanged as we have I ∈ [0 , log K ] .We shall assume that ˆ i N is consistent in the sense that ˆ I N → I in probability as N → ∞ , where we use ˆ I N := ˆ i N ( X , Y ) .Let X ∼ U ([0 , and W ∼ U ( Y ) be independent random variables. Fix δ > and by consistency find N such that P n ˆ i N ( X , W ) ≥ δ o ≤ δ for all N ≥ N . Inthe following consider N ≥ N fixed. Fix M ∈ N and v ∈ Y M , and define thequantization Z := 1 + ⌊ M X ⌋ . The random variable Y is simply Y = v Z . We usethe notation ˆ I v N to highlight that ˆ I N depends on the particular choice of v andwish to show that P n ˆ I v N ≥ δ o ≤ δ for at least one v .Assume to the contrary, that P n ˆ I v N ≥ δ o > δ for all v ∈ Y M . Let E = {∃ i, j ∈ { , , . . . , N } : i = j, Z i = Z j } be the event that two elements of X fallin the same “bin.” Note that Z is the quantization of X . For M large enough,we obtain P {E} = 1 − M ! M N ( M − N )! ≤ − (cid:18) M − N + 1 M (cid:19) N ≤ ε. Defining V ∼ U (cid:0) Y M (cid:1) , independent of X , we obtain for ε small enough δ < K − M X w ∈Y M P n ˆ I v N ≥ δ o Here we use the notation of [7]. ε + K − M X v ∈Y M P n ˆ I v N ≥ δ (cid:12)(cid:12)(cid:12) E c o = ε + X v ∈Y M P n ˆ I V N ≥ δ (cid:12)(cid:12)(cid:12) E c , V = v o P { V = v } = ε + P n ˆ I V N ≥ δ (cid:12)(cid:12)(cid:12) E c o = ε + P n ˆ i N ( X , V Z ) ≥ δ (cid:12)(cid:12)(cid:12) E c o = ε + P n ˆ i N ( X , W ) ≥ δ (cid:12)(cid:12)(cid:12) E c o ≤ ε + δ − ε ) ≤ δ, leading to a contradiction.To summarize, for arbitrary δ and N large enough, there exists w suchthat P n ˆ I w N ≥ δ o ≤ δ , but clearly Y is a deterministic function of X and hence I = log K . Let p and q be two continuous pdfs (w.r.t. λ ) and X , Y be N i.i.d. randomvariables distributed according to p and q , respectively. For any N , let ˆ d N : R N × R N → R be an estimator that estimates D := D(p k q) < ∞ from X , Y . Forconvenience we use ˆ D N := ˆ d N ( X , Y ) . Let Z , Z be two independent i.i.d. N vectors with components uniformly distributed on [ − , . For an arbitrary δ > we can find c ∈ R such that P n ˆ d N ( Z , Z ) ≤ c o ≥ − δ . Consider C, δ > andan arbitrary N ∈ N .Define the pdfs p( x ) = e − a [0 , ( x ) + (1 − e − a ) [ − , ( x ) , and q( x ) = e − a − b [0 , ( x ) + (1 − e − a − b ) [ − , ( x ) for a, b ∈ R + .With b = k e a , where k ∈ R + , we have D = D ( a, k ) , with the function D ( a, k ) = e − a b + (1 − e − a ) log 1 − e − a − e − a − b = k + (1 − e − a ) log 1 − e − a − e − a − k e a ≥ k − e − . Let E = { X < } and E = { Y < } be the events that every component of X and Y , respectively, is negative. Then, P {E c1 ∪ E c2 } ≤ N e − a , {E ∩ E } ≥ − N e − a . Choose a ≥ log Nδ and k ≥ c + C + e − such that N e − a ≤ δ and D ≥ c + C .We can now bound the probability P n ˆ D N ≤ c o = P n ˆ D N ≤ c (cid:12)(cid:12)(cid:12) E ∩ E o P {E ∩ E } + P n ˆ D N ≤ c (cid:12)(cid:12)(cid:12) E c1 ∪ E c2 o P {E c1 ∪ E c2 }≥ P n ˆ D N ≤ c (cid:12)(cid:12)(cid:12) E ∩ E o (1 − δ n ˆ d N ( Z , Z ) ≤ c o (1 − δ ≥ − δ. In summary, for an estimator ˆ d N and any δ, C > and N ∈ N , we can finddistributions p , q and c ∈ R , such that P { ˆ D N > c } ≤ δ , even though D ≥ c + C . We showed that under mild assumptions on the family of allowed distributions,differential entropy cannot be reliably estimated solely based on samples, nomatter how many samples are available. In particular, as first noted in [10]no non-trivial bound or estimate of an information measure can be obtainedbased only on samples. External information about the regularity of the un-derlying probability distribution needs to be taken into account. However, suchregularity assumptions are not subject to empirical verification and thus, the ex-istence of statistical guarantees for an empirical estimate cannot be empiricallytested. This shows that researchers should take great care when approximat-ing or bounding information measures, and specifically explore the necessaryassumptions for the underlying distribution.Regarding the use of information measures in machine learning, we notethat our results apply to all estimators of information measures. In particular,empirical versions of variational bounds cannot provide estimates of informationmeasures with high reliability in general.It would be interesting to investigate the type of assumptions on the under-lying distributions that may hold in typical machine learning setups. However,as pointed out previously, these properties cannot be deduced from data, butmust result from the model under consideration. In a related note, it mightbe interesting if the confidence bounds for differential entropy estimation underbounded support and Lipschitz condition from Section 3 carry over to empiricalversions of variational bounds. Extensions of these results to other informationmeasures, e. g., Rényi entropy, Rényi divergences, or f -divergences, could alsobe of particular interest for future work.11 cknowledgments The authors are very grateful to Prof. Elisabeth Gassiat for pointing out theconnection between the present work and the reference [10].
A Proof of Theorem 3
We shall first introduce auxiliary random variables which are depicted in Fig-ure 1. Let e X be the element-wise discretization e X = ∆ M ( X ) . The continuousrandom vector X = e X + U (cid:0) [0 , M ] K (cid:1) is obtained by adding independent uniformrandom noise. Let q be the pdf of X . It is straightforward to see that e Y isdistributed according to the empirical distribution of N i.i.d. copies of e X . Alsonote that H( e X ) = h( X ) + K log M . X e X X D Y e Y Y N i.i.d. copies ∆ M + uniform noiseempirical distribution ∆ M + uniform noiseempirical distribution of N i.i.d. copiesFigure 1: Connections between all involved random variables.In order to prove Theorem 3 we use the triangle inequality twice to obtain (cid:12)(cid:12) ˆ h ( D ) − h( X ) (cid:12)(cid:12) = (cid:12)(cid:12) h( Y ) − h( X ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) h( Y ) − h( X ) (cid:12)(cid:12) + (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) = (cid:12)(cid:12) H( e Y ) − H( e X ) (cid:12)(cid:12) + (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) H( e Y ) − E [H( e Y )] (cid:12)(cid:12) + (cid:12)(cid:12) E [H( e Y )] − H( e X ) (cid:12)(cid:12) + (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) , (2)noting that h( Y ) − H( e Y ) = h( X ) − H( e X ) = − K log M . Note that H( e Y ) is arandom quantity that depends on D . We thus split the bound in three terms,where the first term in (2) is variance-like and the second and third termsconstitute the bias. In the remainder of this appendix, we will complete theproof by showing that all three terms in (2) can be bounded as follows. First,we have (cid:12)(cid:12) E [H( e Y )] − H( e X ) (cid:12)(cid:12) ≤ log (cid:18) M K − N (cid:19) , and (3)12 (cid:12) h( X ) − h( X ) (cid:12)(cid:12) ≤ LK M log( M η ( K, L )) . (4)And with probability greater than − δ we also have (cid:12)(cid:12) H( e Y ) − E [H( e Y )] (cid:12)(cid:12) ≤ r N log 2 δ log N. (5)As e Y is distributed according to the empirical distribution of N i.i.d. copies of e X ,on an alphabet of size M K , the inequalities (3) and (5) follow directly from thefollowing well-known lemma, concerning the estimation of (discrete) Shannonentropy. Lemma 4 ([28, eq. (3.4), and Prop. 1] and [3, Remark iii, p.168]) . Let Z be arandom variable on { , , . . . , M } and ˆ Z distributed according to the empiricalmeasure of N i.i.d. copies of Z . We then have (cid:12)(cid:12) H(ˆ Z ) − H( Z ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) H( Z ) − E [H(ˆ Z )] (cid:12)(cid:12) + (cid:12)(cid:12) H(ˆ Z ) − E [H(ˆ Z )] (cid:12)(cid:12) , where (cid:12)(cid:12) H( Z ) − E [H(ˆ Z )] (cid:12)(cid:12) ≤ log (cid:18) M − N (cid:19) , and for any δ ∈ (0 , , with probability greater than − δ , (cid:12)(cid:12) H(ˆ Z ) − E [H(ˆ Z )] (cid:12)(cid:12) ≤ r N log 2 δ log N. In order to show (4), we will first obtain some preliminary results and thenconclude the proof in Lemma 10. We start by bounding the difference between p and the approximation q using the following auxiliary results. Lemma 5.
Let f : [ , A ] → R be an arbitrary L -Lipschitz continuous functionand assume f ( y ) = 0 for some y ∈ [ , A ] , then Z [ , A ] | f ( x ) | λ K (d x ) ≤ L K Y k =1 A k ! K X k =1 A k ! . In particular, for A k = ε , k ∈ { , , . . . , K } , we have R [0 ,ε ] K | f ( x ) | λ K (d x ) ≤ ε K +1 LK .Proof. For x ∈ [ , A ] we have | f ( x ) | = | f ( x ) − f ( y ) | ≤ L k x − y k and hence Z [ , A ] | f ( x ) | λ K (d x ) ≤ L Z [ , A ] k x − y k λ K (d x )= L K X k =1 Z [ , A ] | x k − y k | λ K (d x ) We use the notation [ , A ] = { x ∈ R K : 0 ≤ x k ≤ A k , k ∈ { , , . . . , K }} . L K Y k =1 A k ! K X k =1 A k − y k ( A k − y k ) A k ! ≤ L K Y k =1 A k ! K X k =1 A k ! . Lemma 6.
For an L -Lipschitz continuous pdf p on X and q the pdf of X , asdefined above, we have | p( x ) − q( x ) | ≤ LK M for every x ∈ X .Proof. Let x ∈ X and e x = ∆ M ( x ) . The function q is constant on ∆ − M ( e x ) and given by q( x ′ ) = λ K ( ∆ − M ( e x )) − R ∆ − M ( e x ) p d λ K for all x ′ ∈ ∆ − M ( e x ) , where λ K ( ∆ − M ( e x )) = M − K . Thus, since x ∈ ∆ − M ( e x ) , we obtain M − K | q( x ) − p( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z ∆ − M ( e x ) p( x ′ ) − p( x ) λ K (d x ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Z ∆ − M ( e x ) | p( x ′ ) − p( x ) | λ K (d x ′ ) ≤ M − K − LK , (6)where we applied Lemma 5 to x ′ p( x ′ ) − p( x ) in (6). Lemma 7. If p is an L -Lipschitz continuous pdf on R K , then p ≤ (cid:16) L K ( K +1)!2 K (cid:17) K +1 .Proof. Let x ∈ R K and define the ball B x ( p( x ) L ) = { x ′ ∈ R K : k x ′ − x k ≤ p( x ) L } with radius p( x ) L , centered at x . We then have λ K ( B x ( p( x ) L )) = K p( x ) K L K K ! andhence, ≥ Z B x ( p( x ) L ) p( x ) λ K (d x )= 2 K p( x ) K +1 L K K ! − Z B x ( p( x ) L ) p( x ) − p( x ′ ) λ K (d x ′ ) ≥ K p( x ) K +1 L K K ! − L Z B x ( p( x ) L ) k x − x ′ k λ K (d x ′ ) (7) = 2 K p( x ) K +1 L K K + 1)! , where the fact that p( x ) − p( x ′ ) ≤ L k x − x ′ k for all x ′ ∈ R K is used in (7).Using the previous lemmas to bound the distance between p and q , thefollowing two results will allow us to bound the difference of the differentialentropies. 14 emma 8. For x ∈ [0 , , y ≥ , and a := | x − y | ≤ α ≈ . we have | x log x − y log y | ≤ − a log a. (8) Proof.
In the following, we assume w.l.o.g. that x ≤ y = x + a . If x ≤ y ≤ e − ,then | x log x − ( x + a ) log( x + a ) | = x log x − ( x + a ) log( x + a ) is monotonicallydecreasing in x and thus maximal at x = 0 and hence, (8) follows. If, on the otherhand, y ≥ e − , then necessarily α ≤ e − − α ≤ x ≤ y ≤ α . Define the function f ( x ) := − x log x , x > and f (0) := 0 . Note that by the mean value theoremthere are a ∈ (0 , a ) and x ∈ ( x, y ) such that | f (0) − f ( a ) | = f ( a ) = af ′ ( a ) and | f ( x ) − f ( y ) | = a | f ′ ( x ) | . Inequality (8) then follows by observing that | f ′ ( x ) | ≤ f ′ ( a ) whenever a ∈ (0 , α ) and x ∈ ( α, α ) . Lemma 9.
Let p and q be two pdfs supported on X with finite differentialentropies. Assume that for all x ∈ X we have | p( x ) − q( x ) | ≤ ε and ≤ p( x ) ≤ A , and that εA ≤ α holds. Then, | h(p) − h(q) | ≤ ε log Aε . (9)
Proof.
Define p ′ ( x ) := A − p (cid:16) A − K x (cid:17) and q ′ ( x ) := A − q (cid:16) A − K x (cid:17) for x ∈ [0 , A K ] K . We have | p ′ ( x ) − q ′ ( x ) | ≤ A − ε , ≤ p ′ ( x ) ≤ as well as | h(p) − h(q) | = | h(p ′ ) − h(q ′ ) | = (cid:12)(cid:12)(cid:12)(cid:12)Z p ′ log p ′ − q ′ log q ′ d λ K (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z | p ′ log p ′ − q ′ log q ′ | d λ K ≤ − λ K ([0 , A K ] K ) · A − ε log( A − ε ) (10) = ε log Aε , where Lemma 8 was applied in (10).We can now finish the proof of Theorem 3 by showing (4).
Lemma 10. If M ≥ αη ( K,L ) we have (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) ≤ LK M log( M η ( K, L )) . Proof.
By Lemma 6, | p( x ) − q( x ) | ≤ LK M and, by Lemma 7, p ≤ (cid:16) L K ( K +1)!2 K (cid:17) K +1 .We can thus apply Lemma 9 with ε = LK M and A = (cid:16) L K ( K +1)!2 K (cid:17) K +1 providedthat εA ≤ α , which is equivalent to M ≥ αη ( K,L ) . Inserting ε and A in (9)proves the result. 15 eferences [1] I. Ahmad and P.-E. Lin. A nonparametric estimation of the entropy forabsolutely continuous distributions. IEEE Trans. Inf. Theory , 22(3):372–375, May 1976.[2] R. A. Amjad and B. C. Geiger. Learning representations for neural network-based classification using the information bottleneck principle.
IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 2019. to appear.[3] A. Antos and I. Kontoyiannis. Convergence properties of functional esti-mates for discrete distributions.
Random Structures & Algorithms , 19(3-4):163–193, Nov. 2001.[4] R. R. Bahadur and L. J. Savage. The nonexistence of certain statisticalprocedures in nonparametric problems.
Ann. Math. Statist. , 27(4):1115–1122, 1956.[5] D. Barber and F. Agakov. The IM algorithm: A variational approach toinformation maximization. In
NIPS’03 , volume 16 of
Advances in neuralinformation processing systems , pages 201–208, Cambridge, MA, USA, Dec.2003.[6] J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. Van der Meulen. Nonpara-metric entropy estimation: An overview.
International Journal of Mathe-matical and Statistical Sciences , 6(1):17–39, 1997.[7] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville,and D. Hjelm. Mutual information neural estimation. In
ICML’18 , vol-ume 80 of
PMLR , pages 531–540, Stockholm, Sweden, July 2018.[8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.InfoGAN: Interpretable representation learning by information maximizinggenerative adversarial nets. In
NIPS’16 , volume 29 of
Advances in neuralinformation processing systems , pages 2172–2180, Barcelona, Spain, 2016.[9] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptivepartitioning of the observation space.
IEEE Trans. Inf. Theory , 45(4):1315–1321, May 1999.[10] D. L. Donoho. One-sided inference about functionals of a density.
Ann.Statist. , 16(4):1390–1420, 1988.[11] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence, II.
Ann.Statist. , 19(2):633–667, 1991.[12] W. Gao, S. Kannan, S. Oh, and P. Viswanath. Estimating mutual informa-tion for discrete-continuous mixtures. In
NIPS’17 , volume 30 of
Advancesin Neural Information Processing Systems , pages 5986–5997, Long Beach,CA, USA, 2017. 1613] L. J. Gleser and J. T. Hwang. The nonexistence of 100 (1- α )% confidencesets of finite expected diameter in errors-in-variables and related models. Ann. Statist. , pages 1351–1362, 1987.[14] Z. Goldfeld, K. Greenewald, Y. Polyanskiy, and J. Weed. Convergenceof smoothed empirical measures with applications to entropy estimation. arXiv preprint , 2019.[15] Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen,B. Kingsbury, and Y. Polyanskiy. Estimating information flow in deepneural networks. In
ICML’19 , volume 97 of
PMLR , pages 2299–2308, LongBeach, CA, USA, 2019.[16] S. Gordon, H. Greenspan, and J. Goldberger. Applying the informationbottleneck principle to unsupervised clustering of discrete and continuousimage representations. In
Proc. Ninth IEEE Int. Conf. Comput. Vision ,pages 370–377, Nice, France, Oct. 2003.[17] L. Györfi and E. C. Van der Meulen. Density-free convergence propertiesof various estimators of entropy.
Comput. Stat. Data Anal. , 5(4):425–436,Sept. 1987.[18] Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy esti-mation over Lipschitz balls. arXiv preprint , 2019.[19] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman,A. Trischler, and Y. Bengio. Learning deep representations by mutualinformation estimation and maximization. In
ICLR , New Orleans, LA,USA, 2019.[20] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learningdiscrete representations via information maximizing self-augmented train-ing. In
ICML’17 , volume 70 of
PMLR , pages 1558–1567, Sydney, Australia,2017.[21] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M.Robins. Nonparametric von Mises estimators for entropies, divergencesand mutual informations. In
NIPS’15 , volume 28 of
Advances in NeuralInformation Processing Systems , pages 397–405, Montréal, Canada, 2015.[22] A. Kolchinsky, B. D. Tracey, and S. V. Kuyk. Caveats for informationbottleneck in deterministic scenarios. In
ICLR , New Orleans, LA, USA,2019.[23] H. Liu, L. Wasserman, and J. D. Lafferty. Exponential concentration formutual information estimation with application to forests. In
NIPS’12 ,volume 26 of
Advances in Neural Information Processing Systems , pages2537–2545, Lake Tahoe, NV, USA, 2012.1724] D. McAllester and K. Stratos. Formal limitations on the measurement ofmutual information. In
ICLR , New Orleans, LA, USA, 2019.[25] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributionalsmoothing with virtual adversarial training. In
ICLR , San Juan, PuertoRico, 2016.[26] I. Nemenman, W. Bialek, and R. D. R. Van Steveninck. Entropy andinformation in neural spike trains: Progress on the sampling problem.
Phys.Rev. E , 69(5):056111, 2004.[27] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergencefunctionals and the likelihood ratio by convex risk minimization.
IEEETrans. Inf. Theory , 56(11):5847–5861, Nov. 2010.[28] L. Paninski. Estimation of entropy and mutual information.
Neural Com-put. , 15(6):1191–1253, 2003.[29] J. Pfanzagl. The nonexistence of confidence sets for discontinuous function-als.
J. Stat. Plan. Inference , 75(1):9–20, 1998.[30] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker. Onvariational bounds of mutual information. In
ICML’19 , volume 97 of
PMLR ,pages 5171–5180, Long Beach, CA, USA, 2019.[31] C. E. Shannon. A mathematical theory of communication.
Bell Syst. Tech.J. , 27(3):379–423, July 1948.[32] C. E. Shannon. Prediction and entropy of printed English.
Bell Ssyst. Tech.J. , 30(1):50–64, 1951.[33] R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neuralnetworks via information. arXiv preprint , 2017.[34] S. Singh and B. Poczos. Generalized exponential concentration inequalityfor renyi divergence estimation. In
ICML’14 , volume 32 of
PMLR , pages333–341, Bejing, China, 2014.[35] S. Singh and B. Poczos. Finite-sample analysis of fixed-k nearest neighbordensity functional estimators. In
NIPS’16 , volume 29 of
Advances in NeuralInformation Processing Systems , pages 1217–1225, Barcelona, Spain, 2016.[36] K. Sricharan, R. Raich, and A. O. Hero. k-nearest neighbor estimation ofentropies with confidence. In
Proc. IEEE Int. Symp. Inf. Theory (ISIT2011) , pages 1205–1209, Saint Petersburg, Russia, July 2011.[37] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneckmethod. In
Annu. Allerton Conf. Commun., Control, and Comput. , pages368–377, Monticello, IL, Sept. 1999.1838] A. van den Oord, Y. Li, and O. Vinyals. Representation learning withcontrastive predictive coding. arXiv preprint , 2018.[39] Q. Wang, S. R. Kulkarni, and S. Verdú. Divergence estimation of continuousdistributions based on data-dependent partitions.