[PDF] On the Estimation of Information Measures of Continuous Distributions

Abstract

Full PDF

aa r X i v : . [ c s . I T ] F e b On the Estimation of Information Measures ofContinuous Distributions

Georg Pichler , Pablo Piantanida , and Günther Koliander Institute of Telecommunications, TU Wien, Vienna, Austria Université Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des Signaux etSystèmes, Gif-sur-Yvette, France and Montreal Institute for Learning Algorithms(Mila), Université de Montréal, QC, Canada Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria

Abstract

The estimation of information measures of continuous distributions based onsamples is a fundamental problem in statistics and machine learning. In thispaper, we analyze estimates of diﬀerential entropy in K -dimensional Euclideanspace, computed from a ﬁnite number of samples, when the probability densityfunction belongs to a predetermined convex family P . First, estimating diﬀeren-tial entropy to any accuracy is shown to be infeasible if the diﬀerential entropy ofdensities in P is unbounded, clearly showing the necessity of additional assump-tions. Subsequently, we investigate suﬃcient conditions that enable conﬁdencebounds for the estimation of diﬀerential entropy. In particular, we provide conﬁ-dence bounds for simple histogram based estimation of diﬀerential entropy froma ﬁxed number of samples, assuming that the probability density function is Lip-schitz continuous with known Lipschitz constant and known, bounded support.Our focus is on diﬀerential entropy, but we provide examples that show thatsimilar results hold for mutual information and relative entropy as well. Many learning tasks, especially in unsupervised/semi-supervised settings, useinformation theoretic quantities, such as relative entropy, mutual information,diﬀerential entropy, or other divergence functionals as target functions in numer-ical optimization problems [19, 20, 25, 38, 16, 8, 5]. Furthermore, estimators forinformation theoretical quantities are useful in other ﬁelds, such as neuroscience[26]. As these quantities typically cannot be computed directly, surrogate func-tions, either upper/lower bounds, or estimates are used in place. Here, we willinvestigate the problem of estimating diﬀerential entropy using a ﬁnite numberof samples. Throughout, we will restrict our attention to diﬀerential entropy, but1imilar results also hold for conditional diﬀerential entropy, mutual informationand relative entropy (cf. Section 4).

The contributions of this work can be summarized as follows:• First, we explore the following basic but fundamental question: Fixing

C, δ > and given N ∈ N samples from a probability density function(pdf) p ∈ P , where P is a family of pdfs on R K , is it possible to obtainan estimate ˆ h of the diﬀerential entropy h(p) < ∞ satisfying P n | ˆ h − h(p) | > C o ≤ δ ? In Section 2, we show that the answer to this question is negative (Propo-sition 2) if P is convex and the diﬀerential entropy of the pdfs in P isunbounded.• Subsequently, we investigate suﬃcient conditions for the class P that en-able estimation of diﬀerential entropy with such a conﬁdence bound andin Section 3 (Theorem 3) we show that a known, bounded support togetherwith an L -Lipschitz continuous pdf for ﬁxed L > , suﬃces. For a simplehistogram based estimator we explicitly compute a relation between prob-ability of correct estimation, accuracy, dimension K , sample size N , andLipschitz constant L . It is shown that estimation becomes impossible ifeither assumption is removed.• Finally, in Section 4 we obtain impossibility results, similar to Proposi-tion 2, for the estimation of other information measures. The problem of estimating information measures from a ﬁnite number of sam-ples is as old as information theory itself. Shortly after his seminal paper [31],Shannon worked on estimating the entropy rate of English text [32]. Therehave been numerous works on the estimation of information measures, such asentropy, mutual information, and diﬀerential entropy, since. There are many dif-ferent approaches for estimating information measures, including kernel basedmethods, nearest neighbor methods, methods based on sample distances as wellas multiple variants of plug-in estimates. Many estimators have been shown tobe consistent and/or asymptotically unbiased under various constraints, e. g.,in [17, 1, 12, 21, 36]. An excellent overview can be found in [6].In [36], rate-of-convergence results as well as a central limit theorem areprovided for diﬀerential entropy and Rényi entropy. However, the conﬁdence These assumptions assure that the diﬀerential entropy of the pdfs in P is bounded. Aknown, bounded support bounds the diﬀerential entropy from above, and L -Lipschitz conti-nuity bounds it from below. [0 , K , but insteadof Lipschitz continuity, β -Hölder continuity, β ∈ (0 , is assumed. Additionally,strict positivity on the interior of the support is required and the constantsbounding the approximation error depend on the underlying, unknown distri-bution. These additional complications are likely due to the extended scope, as[35] is not focused on diﬀerential entropy, but the expectation of arbitrary func-tionals of the probability density. The same authors also provide ﬁnite sampleanalysis for the estimation of Rényi divergence under similarly strong conditionsin [34].There are several negative results, which clearly show that information mea-sures are hard to estimate from a ﬁnite number of samples. It was shown in[3, Th. 4] that rate-of-convergence results cannot be obtained for any consistentestimator of entropy on a countable alphabet and only when imposing variousassumptions on the true distribution, rate-of-convergence results were obtained.More negative results on the estimation of entropy and mutual information canbe found in [28]. In fact, obtaining conﬁdence bounds for information measuresfrom samples is inherently diﬃcult and requires regularity assumptions aboutthe involved distributions, which are not subject to empirical test. In the sem-inal work of [4] as well as subsequent works [13, 10, 29, 11] (and referencestherein) such necessary conditions for the estimation of statistical parameterswith conﬁdence bounds are discussed in great detail and generality. The resultsof [10, 29] can be applied to diﬀerential entropy estimation and yield a resultvery similar to Proposition 2, essentially showing that diﬀerential entropy can-not be bounded using a ﬁnite number of samples, unless additional assumptionson the distribution are made.Especially in the context of unsupervised and semi-supervised machine learn-ing it recently became popular to use variational bounds or estimates of informa-tion measures as part of the loss function for training neural networks [7, 30, 19].Criticism to this approach, in particular the use of variational bounds, has beenvoiced [24]. The current paper has a more general scope, dealing with the es-timation problem of information measures in general, not limited to speciﬁc3ariational bounds or techniques.The information ﬂow in neural networks is also a recent topic of investigation.In [33], an argument for successive compression in the layers of a deep neural net-work is given, along the lines of the information bottleneck method [37]. Whileﬂaws in this argument were pointed out [2, 22], the authors of [15] found thata clustering phenomenon might elucidate the behavior of deep neural networks.These insights were obtained by estimating the diﬀerential entropy h( X + G ) ofa sum of two random vectors, where X is sub-Gaussian and G ∼ N ( , σ I ) isan independent Gaussian vector. This is similar in spirit to the work conductedhere, however, our assumption of compact support is replaced by assuming X tobe sub-Gaussian. Note that the pdf of X + G is Lipschitz continuous with ﬁxedLipschitz constant L ( σ ) , so [15] is implicitly also using a Lipschitz assumption. Let P be a family of pdfs on X := R K with ﬁnite diﬀerential entropy, i. e., h(p) := − R p( x ) log p( x ) d x ∈ R for every p ∈ P .Suppose we observe N i.i.d. copies D := ( X , X , . . . , X N ) of some randomvector X ∼ p ∈ P and want to obtain an estimate of diﬀerential entropy fromthese samples D . Such an estimator is a function ˆ h : X N → R that maps D into ˆ h ( D ) , approximating the diﬀerential entropy h := h( X ) := h(p) < ∞ .Its accuracy can be measured by a conﬁdence interval, a widely used tool instatistical practice for indicating the precision of point estimators. For a givenerror probability δ > , we would like to have C > such that | h − ˆ h ( D ) | ≥ C with probability less than δ , i. e., a conﬁdence interval of size C with conﬁdence − δ . However, there is no free lunch when estimating diﬀerential entropy, asevidenced by the following result, a corollary of a more general result in [10],here specialized to a bound of diﬀerential entropy. It is based on the abstractnotion of a dense graph condition (DGC). Theorem 1 ([10, Th. 2.1]) . Assume that h satisﬁes the DGC over P and de-ﬁne B := sup { h(p) : p ∈ P} , where, e. g., B = + ∞ . If for any C > , sup p ∈P P p { ˆ h ( D ) + C ≤ B } = 1 , then inf p ∈P P p n h(p) − ˆ h ( D ) ≤ C o = 0 . A similar result follows from [29, Prop. 3.1].We will not work with the DGC, but make two practical assumptions: P is a convex family and the diﬀerential entropy of the pdfs in P is unbounded(either from above or from below). Under these assumptions we show that forany δ, C > there is a pdf p ∈ P , such that | h − ˆ h ( D ) | ≤ C with probability Similar to our assumption of an arbitrary but ﬁxed compact support in Section 3, theconstant K in the deﬁnition of the sub-Gaussian X is assumed to be ﬁxed in [14, eq. (1)]. h satisﬁes the DGC over P if the graph of h over P is dense in its own epigraph [10,eq. (2.4)]. δ , i. e., ˆ h ( D ) is far from h with high probability. Fundamentally, thisfollows from the fact that P contains pdfs with a large diﬀerence in diﬀerentialentropy, which cannot be accurately distinguished based on samples. Similarresults hold true for mutual information and relative entropy and are given inSection 4. Proposition 2.

Let P be a convex family of pdfs with unbounded diﬀerentialentropy, i. e., for any α ∈ [0 , and p , q ∈ P , we have α p + (1 − α )q ∈ P as wellas sup q ∈P | h(q) | = ∞ . Then, for any pair of constants C, δ > , there exists acontinuous random vector X ∼ p ∈ P , satisfying P p n | h(p) − ˆ h ( D ) | ≤ C o ≤ δ. Remark . Before proceeding with the proof of Proposition 2, we note that thisresult could be proved as a consequence of Theorem 1. However, this wouldnecessitate to show that our conditions imply the DGC. Furthermore, the proofof [10, Th. 2.1] itself hinges on deep statistical results and thus we opted forproviding a short, self-contained proof.

Proof of Proposition 2.

The function ˆ h , constants C, δ > , and the sample size N ∈ N are arbitrary, but ﬁxed. Choose an arbitrary X ∼ p ∈ P and let h :=h(p) < ∞ . Then ﬁx b > , such that P n | ˆ h ( X , X , . . . , X N ) − h | ≤ b o ≥ − δ ,where ( X n ) n =1 ,...,N are i.i.d. copies of X . Furthermore, let Q ∼ B (1 − ε ) be aBernoulli random variable with parameter − ε = P { Q = 1 } , independent of X , where < ε ≤ δ N . Choose a > such that aε > b + C + log 2 . By ourassumption sup q ∈P | h(q) | = ∞ , we can ﬁnd e X ∼ e p ∈ P with e h := h( e p) suchthat | e h − h | ≥ a .Deﬁne X := Q X − (1 − Q ) e X , which yields h = h( X ) = I( X ; Q ) + h( X | Q ) =I( X ; Q ) + (1 − ε ) h + ε e h ∈ (1 − ε ) h + ε e h + [0 , log 2] where I( · ; · ) denotes mutualinformation. For convenience we use ˆ H := ˆ h ( D ) , where D = ( X , X , . . . , X N ) and deﬁne the event E = { Q = Q = · · · = Q N = 1 } = {D = ( X , X , . . . , X N ) } .By the union bound, we have P {E} ≥ − N ε and obtain P n | ˆ H − h | ≤ b o = P {E} P n | ˆ H − h | ≤ b (cid:12)(cid:12)(cid:12) E o + P {E c } P n | ˆ H − h | ≤ b (cid:12)(cid:12)(cid:12) E c o ≥ (1 − N ε )P n | ˆ H − h | ≤ b (cid:12)(cid:12)(cid:12) E o = (1 − N ε )P n | ˆ h ( X , X , . . . , X N ) − h | ≤ b o ≥ (cid:18) − δ (cid:19) ≥ − δ.

5e thus found X such that P n | h − ˆ H | ≤ C o ≤ P n | h − ˆ H | ≥ | h − h | − C o ≤ P n | h − ˆ H | ≥ ε | h − e h | − log 2 − C o ≤ P n | h − ˆ H | ≥ εa − log 2 − C o ≤ P n | h − ˆ H | > b o = 1 − P n | h − ˆ H | ≤ b o ≤ δ. Remark . Proposition 2 shows that in order to obtain conﬁdence bounds, oneneeds to make assumptions about the underlying distribution. However, aspointed out in [10, p. 1395], when making these assumptions, one uses informa-tion external to the samples.

Remark . Note that the family of all pdfs with support [0 , K satisﬁes therequirements of Proposition 2. It also satisﬁes the DGC, but it is not stronglynonparametric, as deﬁned in [10, p. 1395]. One way to avoid the problems outlined in Section 2 is to impose additionalassumptions on the underlying probability distribution, that bound the diﬀer-ential entropy from above and from below. We will showcase that the diﬀerentialentropy of an L -Lipschitz continuous pdf with ﬁxed, known L > on R K andknown, compact support X can be well approximated from samples. In the fol-lowing, let X ∼ p be supported on X := [0 , K , i. e., R X p d λ K = P { X ∈ X } =1 , where λ K denotes the Lebesgue measure on R K . The pdf p : R K → R + of X is assumed to be L -Lipschitz continuous on R K with some ﬁxed L > , where R K is equipped with the ℓ -norm k x k := k x k = P k | x k | , hence, ∀ x , y ∈ R K : | p( x ) − p( y ) | ≤ L k x − y k . Given N i.i.d. copies D = ( X , X , . . . , X N ) of X , let Y be distributed accord-ing to the empirical distribution of D , i. e., Y = X U , where U ∼ U ( { , , . . . , N } ) is a uniform random variable on { , , . . . , N } . Let the discrete random vector e Y = ∆ M ( Y ) be the element-wise quantization of Y , where ∆ M ( x ) := ⌊ Mx ⌋ M isthe M -step discretization of [0 , for some M ∈ N . Additionally deﬁne the con-tinuous random vector Y = e Y + U (cid:0) [0 , M ] K (cid:1) , i. e., independent uniform noise is Any known compact support suﬃces. An aﬃne transformation then yields X = [0 , K ,while possibly resulting in a diﬀerent Lipschitz constant. The L norm is only chosen to facilitate subsequent computations. By the equivalenceof norms on R K , any norm suﬃces. h( Y ) − H( e Y ) = − K log M , where H( · ) denotes Shannonentropy.We will estimate diﬀerential entropy by ˆ h ( D ) = h (cid:0) Y (cid:1) = H( e Y ) − K log M , i. e.,the Shannon entropy of the discretized and binned samples with a correctionfactor.In the following, we shall also use the two constants η ( K, L ) := 1 K (cid:18) K + 1)! L (cid:19) K +1 , and α := √ e + 4 − e2e ≈ . . Theorem 3.

For M ≥ αη ( K,L ) and any δ ∈ (0 , , we have with probabilitygreater than − δ that (cid:12)(cid:12) ˆ h ( D ) − h( X ) (cid:12)(cid:12) ≤ LK M log( M η ( K, L )) (1a) + r N log 2 δ log N (1b) + log (cid:18) M K − N (cid:19) . (1c)The proof will be given in Appendix A. Remark . Of the three error terms (1a)–(1c), the terms (1a) and (1c) constitutethe bias and (1b) is a variance-like error term. While the variance (1b) vanishesas N → ∞ , the term (1a) does not depend on the sample size N as it merelymeasures the error incurred due to the quantization ∆ M , which is boundedby the Lipschitz constraint and approaches zero as M → ∞ . The ﬁnal term(1c) results from ensuring that N samples suﬃce to suitably approximate theempirical distribution over M quantization steps. Thus, it ties the quantizationto the sample size and approaches zero if M K N → . In total, the RHS of (1)approaches zero for N, M → ∞ provided that M K N → . Remark . Theorem 3 should be regarded as a proof-of-concept rather than apractical tool for performing diﬀerential entropy estimation. While analyticallytractable, the estimation strategy is crude and the bounds, especially the term(1b), while being a completely universal bound, is know to be loose, as pointedout in [28, p. 1200].

Remark . We want to note that requiring both a ﬁxed Lipschitz constant L anda known bounded support, e. g., X = [0 , K , is necessary. Consider for instancethe set P ′ = { p : p supported on [0 , K and Lipschitz continuous } of pdfs witharbitrary Lipschitz constant or the set P ′′ = { p : p supported on a boundedset and L -Lipschitz continuous } with ﬁxed Lipschitz constant, but arbitrary,bounded support. Both families satisfy the conditions of Proposition 2, i. e.,they are convex and sup p ∈P ′ | h(p) | = sup p ∈P ′′ | h(p) | = ∞ .7n principle, Theorem 3 also allows for the approximation of mutual infor-mation with a conﬁdence bound. Let ( X , Y ) ∼ p XY be two random vectors, sup-ported on [0 , K and [0 , K , respectively. Assuming that p XY is L -Lipschitzcontinuous on R K + K , it is clear that the marginals p X and p Y are L -Lipschitzcontinuous as well. Thus, Theorem 3 can be used to approximate all three termsin I( X ; Y ) = h( X ) + h( Y ) − h(( X , Y )) . In this section, we showcase that similar statements as Proposition 2 also holdfor mutual information and relative entropy. For simplicity we will not assume p XY ∈ P for some family P of probability density functions, but merely re-quire I( X ; Y ) , D( X k Y ) < ∞ . Only proof sketches are provided as the examplesprovided in this section are similar to the proof of Proposition 2.Here we show that in general, it is not possible to accurately estimate mutualinformation I( X ; Y ) and relative entropy D( X k Y ) from samples ( X , Y ) . For any N , let ˆ i : R N × R N → R be a measurable function, which represents anestimate of the mutual information I := I( X ; Y ) < ∞ from X , Y . For conveniencewe use ˆ I := ˆ i ( X , Y ) . Let X , Z ∼ U ([0 , , W ∼ U ([0 , e − a ]) , and Q ∼ B (1 − ε ) beindependent random variables. Deﬁne Y := QZ − (1 − Q )( X + W ) . We have I( X ; Y ) = h( Y ) − h( Y | X ) ≥ H ( ε ) − H ( ε ) + aε = aε. The random vectors ( X , Y ) [ ( X , Z ) ] are N i.i.d. realizations of ( X , Y ) [ ( X , Z ) ].For any δ > , we can ﬁnd b ∈ R such that P n ˆ i ( X , Z ) ≤ b o ≥ − δ . Letting E = { Y = Z } = { Q = } , we have P {E} ≥ − N ε . Thus, when choosing ε ≤ δ N , P n ˆ I ≤ b o = P {E} P n ˆ I ≤ b (cid:12)(cid:12)(cid:12) E o + P {E c } P n ˆ I ≤ b (cid:12)(cid:12)(cid:12) E c o ≥ (1 − N ε )P n ˆ I ≤ b (cid:12)(cid:12)(cid:12) E o ≥ (1 − N ε ) (cid:18) − δ (cid:19) ≥ − δ.

8e may choose a ≥ b + Cε . Then, for arbitrary C, δ > and n ∈ N , we found X , Y and b ∈ R , such that P n ˆ I > b o ≤ δ , yet I ≥ b + C . Remark . Note that [7, Th. 3] claims a conﬁdence bound for mutual infor-mation, that together with the approximation result [7, Lem. 1] seeminglycontradicts our result. However, the conﬁdence bound proved in [7, Th. 3]requires strong conditions on the functions T θ and [7, Lem. 1] does not nec-essarily hold under these conditions. Moreover, both approximation results [7,Lem. 1 and Lem. 2] do not hold uniformly for a family of distributions, butimplicitly assume a ﬁxed, underlying distribution. This is especially evident in[7, Lem. 2], which also seemingly contradicts our result, when assuming that theoptimal function T ∗ is in the family T Θ . However, this apparent contradictionis resolved by noting that the chosen N ∈ N depends on the underlying, truedistribution. In the following we show that a similar result holds if Y has a ﬁxed ﬁnitealphabet, say Y = { , , . . . , K } . Again, for every N ∈ N , let ˆ i N : R N × Y N → R be an estimator that estimates I := I( X ; Y ) ≤ log K from X , Y . Note that theresult for continuous Y cannot carry over unchanged as we have I ∈ [0 , log K ] .We shall assume that ˆ i N is consistent in the sense that ˆ I N → I in probability as N → ∞ , where we use ˆ I N := ˆ i N ( X , Y ) .Let X ∼ U ([0 , and W ∼ U ( Y ) be independent random variables. Fix δ > and by consistency ﬁnd N such that P n ˆ i N ( X , W ) ≥ δ o ≤ δ for all N ≥ N . Inthe following consider N ≥ N ﬁxed. Fix M ∈ N and v ∈ Y M , and deﬁne thequantization Z := 1 + ⌊ M X ⌋ . The random variable Y is simply Y = v Z . We usethe notation ˆ I v N to highlight that ˆ I N depends on the particular choice of v andwish to show that P n ˆ I v N ≥ δ o ≤ δ for at least one v .Assume to the contrary, that P n ˆ I v N ≥ δ o > δ for all v ∈ Y M . Let E = {∃ i, j ∈ { , , . . . , N } : i = j, Z i = Z j } be the event that two elements of X fallin the same “bin.” Note that Z is the quantization of X . For M large enough,we obtain P {E} = 1 − M ! M N ( M − N )! ≤ − (cid:18) M − N + 1 M (cid:19) N ≤ ε. Deﬁning V ∼ U (cid:0) Y M (cid:1) , independent of X , we obtain for ε small enough δ < K − M X w ∈Y M P n ˆ I v N ≥ δ o Here we use the notation of [7]. ε + K − M X v ∈Y M P n ˆ I v N ≥ δ (cid:12)(cid:12)(cid:12) E c o = ε + X v ∈Y M P n ˆ I V N ≥ δ (cid:12)(cid:12)(cid:12) E c , V = v o P { V = v } = ε + P n ˆ I V N ≥ δ (cid:12)(cid:12)(cid:12) E c o = ε + P n ˆ i N ( X , V Z ) ≥ δ (cid:12)(cid:12)(cid:12) E c o = ε + P n ˆ i N ( X , W ) ≥ δ (cid:12)(cid:12)(cid:12) E c o ≤ ε + δ − ε ) ≤ δ, leading to a contradiction.To summarize, for arbitrary δ and N large enough, there exists w suchthat P n ˆ I w N ≥ δ o ≤ δ , but clearly Y is a deterministic function of X and hence I = log K . Let p and q be two continuous pdfs (w.r.t. λ ) and X , Y be N i.i.d. randomvariables distributed according to p and q , respectively. For any N , let ˆ d N : R N × R N → R be an estimator that estimates D := D(p k q) < ∞ from X , Y . Forconvenience we use ˆ D N := ˆ d N ( X , Y ) . Let Z , Z be two independent i.i.d. N vectors with components uniformly distributed on [ − , . For an arbitrary δ > we can ﬁnd c ∈ R such that P n ˆ d N ( Z , Z ) ≤ c o ≥ − δ . Consider C, δ > andan arbitrary N ∈ N .Deﬁne the pdfs p( x ) = e − a [0 , ( x ) + (1 − e − a ) [ − , ( x ) , and q( x ) = e − a − b [0 , ( x ) + (1 − e − a − b ) [ − , ( x ) for a, b ∈ R + .With b = k e a , where k ∈ R + , we have D = D ( a, k ) , with the function D ( a, k ) = e − a b + (1 − e − a ) log 1 − e − a − e − a − b = k + (1 − e − a ) log 1 − e − a − e − a − k e a ≥ k − e − . Let E = { X < } and E = { Y < } be the events that every component of X and Y , respectively, is negative. Then, P {E c1 ∪ E c2 } ≤ N e − a , {E ∩ E } ≥ − N e − a . Choose a ≥ log Nδ and k ≥ c + C + e − such that N e − a ≤ δ and D ≥ c + C .We can now bound the probability P n ˆ D N ≤ c o = P n ˆ D N ≤ c (cid:12)(cid:12)(cid:12) E ∩ E o P {E ∩ E } + P n ˆ D N ≤ c (cid:12)(cid:12)(cid:12) E c1 ∪ E c2 o P {E c1 ∪ E c2 }≥ P n ˆ D N ≤ c (cid:12)(cid:12)(cid:12) E ∩ E o (1 − δ n ˆ d N ( Z , Z ) ≤ c o (1 − δ ≥ − δ. In summary, for an estimator ˆ d N and any δ, C > and N ∈ N , we can ﬁnddistributions p , q and c ∈ R , such that P { ˆ D N > c } ≤ δ , even though D ≥ c + C . We showed that under mild assumptions on the family of allowed distributions,diﬀerential entropy cannot be reliably estimated solely based on samples, nomatter how many samples are available. In particular, as ﬁrst noted in [10]no non-trivial bound or estimate of an information measure can be obtainedbased only on samples. External information about the regularity of the un-derlying probability distribution needs to be taken into account. However, suchregularity assumptions are not subject to empirical veriﬁcation and thus, the ex-istence of statistical guarantees for an empirical estimate cannot be empiricallytested. This shows that researchers should take great care when approximat-ing or bounding information measures, and speciﬁcally explore the necessaryassumptions for the underlying distribution.Regarding the use of information measures in machine learning, we notethat our results apply to all estimators of information measures. In particular,empirical versions of variational bounds cannot provide estimates of informationmeasures with high reliability in general.It would be interesting to investigate the type of assumptions on the under-lying distributions that may hold in typical machine learning setups. However,as pointed out previously, these properties cannot be deduced from data, butmust result from the model under consideration. In a related note, it mightbe interesting if the conﬁdence bounds for diﬀerential entropy estimation underbounded support and Lipschitz condition from Section 3 carry over to empiricalversions of variational bounds. Extensions of these results to other informationmeasures, e. g., Rényi entropy, Rényi divergences, or f -divergences, could alsobe of particular interest for future work.11 cknowledgments The authors are very grateful to Prof. Elisabeth Gassiat for pointing out theconnection between the present work and the reference [10].

A Proof of Theorem 3

We shall ﬁrst introduce auxiliary random variables which are depicted in Fig-ure 1. Let e X be the element-wise discretization e X = ∆ M ( X ) . The continuousrandom vector X = e X + U (cid:0) [0 , M ] K (cid:1) is obtained by adding independent uniformrandom noise. Let q be the pdf of X . It is straightforward to see that e Y isdistributed according to the empirical distribution of N i.i.d. copies of e X . Alsonote that H( e X ) = h( X ) + K log M . X e X X D Y e Y Y N i.i.d. copies ∆ M + uniform noiseempirical distribution ∆ M + uniform noiseempirical distribution of N i.i.d. copiesFigure 1: Connections between all involved random variables.In order to prove Theorem 3 we use the triangle inequality twice to obtain (cid:12)(cid:12) ˆ h ( D ) − h( X ) (cid:12)(cid:12) = (cid:12)(cid:12) h( Y ) − h( X ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) h( Y ) − h( X ) (cid:12)(cid:12) + (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) = (cid:12)(cid:12) H( e Y ) − H( e X ) (cid:12)(cid:12) + (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) H( e Y ) − E [H( e Y )] (cid:12)(cid:12) + (cid:12)(cid:12) E [H( e Y )] − H( e X ) (cid:12)(cid:12) + (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) , (2)noting that h( Y ) − H( e Y ) = h( X ) − H( e X ) = − K log M . Note that H( e Y ) is arandom quantity that depends on D . We thus split the bound in three terms,where the ﬁrst term in (2) is variance-like and the second and third termsconstitute the bias. In the remainder of this appendix, we will complete theproof by showing that all three terms in (2) can be bounded as follows. First,we have (cid:12)(cid:12) E [H( e Y )] − H( e X ) (cid:12)(cid:12) ≤ log (cid:18) M K − N (cid:19) , and (3)12 (cid:12) h( X ) − h( X ) (cid:12)(cid:12) ≤ LK M log( M η ( K, L )) . (4)And with probability greater than − δ we also have (cid:12)(cid:12) H( e Y ) − E [H( e Y )] (cid:12)(cid:12) ≤ r N log 2 δ log N. (5)As e Y is distributed according to the empirical distribution of N i.i.d. copies of e X ,on an alphabet of size M K , the inequalities (3) and (5) follow directly from thefollowing well-known lemma, concerning the estimation of (discrete) Shannonentropy. Lemma 4 ([28, eq. (3.4), and Prop. 1] and [3, Remark iii, p.168]) . Let Z be arandom variable on { , , . . . , M } and ˆ Z distributed according to the empiricalmeasure of N i.i.d. copies of Z . We then have (cid:12)(cid:12) H(ˆ Z ) − H( Z ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) H( Z ) − E [H(ˆ Z )] (cid:12)(cid:12) + (cid:12)(cid:12) H(ˆ Z ) − E [H(ˆ Z )] (cid:12)(cid:12) , where (cid:12)(cid:12) H( Z ) − E [H(ˆ Z )] (cid:12)(cid:12) ≤ log (cid:18) M − N (cid:19) , and for any δ ∈ (0 , , with probability greater than − δ , (cid:12)(cid:12) H(ˆ Z ) − E [H(ˆ Z )] (cid:12)(cid:12) ≤ r N log 2 δ log N. In order to show (4), we will ﬁrst obtain some preliminary results and thenconclude the proof in Lemma 10. We start by bounding the diﬀerence between p and the approximation q using the following auxiliary results. Lemma 5.

Let f : [ , A ] → R be an arbitrary L -Lipschitz continuous functionand assume f ( y ) = 0 for some y ∈ [ , A ] , then Z [ , A ] | f ( x ) | λ K (d x ) ≤ L K Y k =1 A k ! K X k =1 A k ! . In particular, for A k = ε , k ∈ { , , . . . , K } , we have R [0 ,ε ] K | f ( x ) | λ K (d x ) ≤ ε K +1 LK .Proof. For x ∈ [ , A ] we have | f ( x ) | = | f ( x ) − f ( y ) | ≤ L k x − y k and hence Z [ , A ] | f ( x ) | λ K (d x ) ≤ L Z [ , A ] k x − y k λ K (d x )= L K X k =1 Z [ , A ] | x k − y k | λ K (d x ) We use the notation [ , A ] = { x ∈ R K : 0 ≤ x k ≤ A k , k ∈ { , , . . . , K }} . L K Y k =1 A k ! K X k =1 A k − y k ( A k − y k ) A k ! ≤ L K Y k =1 A k ! K X k =1 A k ! . Lemma 6.

For an L -Lipschitz continuous pdf p on X and q the pdf of X , asdeﬁned above, we have | p( x ) − q( x ) | ≤ LK M for every x ∈ X .Proof. Let x ∈ X and e x = ∆ M ( x ) . The function q is constant on ∆ − M ( e x ) and given by q( x ′ ) = λ K ( ∆ − M ( e x )) − R ∆ − M ( e x ) p d λ K for all x ′ ∈ ∆ − M ( e x ) , where λ K ( ∆ − M ( e x )) = M − K . Thus, since x ∈ ∆ − M ( e x ) , we obtain M − K | q( x ) − p( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z ∆ − M ( e x ) p( x ′ ) − p( x ) λ K (d x ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Z ∆ − M ( e x ) | p( x ′ ) − p( x ) | λ K (d x ′ ) ≤ M − K − LK , (6)where we applied Lemma 5 to x ′ p( x ′ ) − p( x ) in (6). Lemma 7. If p is an L -Lipschitz continuous pdf on R K , then p ≤ (cid:16) L K ( K +1)!2 K (cid:17) K +1 .Proof. Let x ∈ R K and deﬁne the ball B x ( p( x ) L ) = { x ′ ∈ R K : k x ′ − x k ≤ p( x ) L } with radius p( x ) L , centered at x . We then have λ K ( B x ( p( x ) L )) = K p( x ) K L K K ! andhence, ≥ Z B x ( p( x ) L ) p( x ) λ K (d x )= 2 K p( x ) K +1 L K K ! − Z B x ( p( x ) L ) p( x ) − p( x ′ ) λ K (d x ′ ) ≥ K p( x ) K +1 L K K ! − L Z B x ( p( x ) L ) k x − x ′ k λ K (d x ′ ) (7) = 2 K p( x ) K +1 L K K + 1)! , where the fact that p( x ) − p( x ′ ) ≤ L k x − x ′ k for all x ′ ∈ R K is used in (7).Using the previous lemmas to bound the distance between p and q , thefollowing two results will allow us to bound the diﬀerence of the diﬀerentialentropies. 14 emma 8. For x ∈ [0 , , y ≥ , and a := | x − y | ≤ α ≈ . we have | x log x − y log y | ≤ − a log a. (8) Proof.

In the following, we assume w.l.o.g. that x ≤ y = x + a . If x ≤ y ≤ e − ,then | x log x − ( x + a ) log( x + a ) | = x log x − ( x + a ) log( x + a ) is monotonicallydecreasing in x and thus maximal at x = 0 and hence, (8) follows. If, on the otherhand, y ≥ e − , then necessarily α ≤ e − − α ≤ x ≤ y ≤ α . Deﬁne the function f ( x ) := − x log x , x > and f (0) := 0 . Note that by the mean value theoremthere are a ∈ (0 , a ) and x ∈ ( x, y ) such that | f (0) − f ( a ) | = f ( a ) = af ′ ( a ) and | f ( x ) − f ( y ) | = a | f ′ ( x ) | . Inequality (8) then follows by observing that | f ′ ( x ) | ≤ f ′ ( a ) whenever a ∈ (0 , α ) and x ∈ ( α, α ) . Lemma 9.

Let p and q be two pdfs supported on X with ﬁnite diﬀerentialentropies. Assume that for all x ∈ X we have | p( x ) − q( x ) | ≤ ε and ≤ p( x ) ≤ A , and that εA ≤ α holds. Then, | h(p) − h(q) | ≤ ε log Aε . (9)

Proof.

Deﬁne p ′ ( x ) := A − p (cid:16) A − K x (cid:17) and q ′ ( x ) := A − q (cid:16) A − K x (cid:17) for x ∈ [0 , A K ] K . We have | p ′ ( x ) − q ′ ( x ) | ≤ A − ε , ≤ p ′ ( x ) ≤ as well as | h(p) − h(q) | = | h(p ′ ) − h(q ′ ) | = (cid:12)(cid:12)(cid:12)(cid:12)Z p ′ log p ′ − q ′ log q ′ d λ K (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z | p ′ log p ′ − q ′ log q ′ | d λ K ≤ − λ K ([0 , A K ] K ) · A − ε log( A − ε ) (10) = ε log Aε , where Lemma 8 was applied in (10).We can now ﬁnish the proof of Theorem 3 by showing (4).

Lemma 10. If M ≥ αη ( K,L ) we have (cid:12)(cid:12) h( X ) − h( X ) (cid:12)(cid:12) ≤ LK M log( M η ( K, L )) . Proof.

By Lemma 6, | p( x ) − q( x ) | ≤ LK M and, by Lemma 7, p ≤ (cid:16) L K ( K +1)!2 K (cid:17) K +1 .We can thus apply Lemma 9 with ε = LK M and A = (cid:16) L K ( K +1)!2 K (cid:17) K +1 providedthat εA ≤ α , which is equivalent to M ≥ αη ( K,L ) . Inserting ε and A in (9)proves the result. 15 eferences [1] I. Ahmad and P.-E. Lin. A nonparametric estimation of the entropy forabsolutely continuous distributions. IEEE Trans. Inf. Theory , 22(3):372–375, May 1976.[2] R. A. Amjad and B. C. Geiger. Learning representations for neural network-based classiﬁcation using the information bottleneck principle.

IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 2019. to appear.[3] A. Antos and I. Kontoyiannis. Convergence properties of functional esti-mates for discrete distributions.

Random Structures & Algorithms , 19(3-4):163–193, Nov. 2001.[4] R. R. Bahadur and L. J. Savage. The nonexistence of certain statisticalprocedures in nonparametric problems.

Ann. Math. Statist. , 27(4):1115–1122, 1956.[5] D. Barber and F. Agakov. The IM algorithm: A variational approach toinformation maximization. In

NIPS’03 , volume 16 of

Advances in neuralinformation processing systems , pages 201–208, Cambridge, MA, USA, Dec.2003.[6] J. Beirlant, E. J. Dudewicz, L. Györﬁ, and E. C. Van der Meulen. Nonpara-metric entropy estimation: An overview.

International Journal of Mathe-matical and Statistical Sciences , 6(1):17–39, 1997.[7] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville,and D. Hjelm. Mutual information neural estimation. In

ICML’18 , vol-ume 80 of

PMLR , pages 531–540, Stockholm, Sweden, July 2018.[8] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.InfoGAN: Interpretable representation learning by information maximizinggenerative adversarial nets. In

NIPS’16 , volume 29 of

Advances in neuralinformation processing systems , pages 2172–2180, Barcelona, Spain, 2016.[9] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptivepartitioning of the observation space.

IEEE Trans. Inf. Theory , 45(4):1315–1321, May 1999.[10] D. L. Donoho. One-sided inference about functionals of a density.

Ann.Statist. , 16(4):1390–1420, 1988.[11] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence, II.

Ann.Statist. , 19(2):633–667, 1991.[12] W. Gao, S. Kannan, S. Oh, and P. Viswanath. Estimating mutual informa-tion for discrete-continuous mixtures. In

NIPS’17 , volume 30 of

Advancesin Neural Information Processing Systems , pages 5986–5997, Long Beach,CA, USA, 2017. 1613] L. J. Gleser and J. T. Hwang. The nonexistence of 100 (1- α )% conﬁdencesets of ﬁnite expected diameter in errors-in-variables and related models. Ann. Statist. , pages 1351–1362, 1987.[14] Z. Goldfeld, K. Greenewald, Y. Polyanskiy, and J. Weed. Convergenceof smoothed empirical measures with applications to entropy estimation. arXiv preprint , 2019.[15] Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen,B. Kingsbury, and Y. Polyanskiy. Estimating information ﬂow in deepneural networks. In

ICML’19 , volume 97 of

PMLR , pages 2299–2308, LongBeach, CA, USA, 2019.[16] S. Gordon, H. Greenspan, and J. Goldberger. Applying the informationbottleneck principle to unsupervised clustering of discrete and continuousimage representations. In

Proc. Ninth IEEE Int. Conf. Comput. Vision ,pages 370–377, Nice, France, Oct. 2003.[17] L. Györﬁ and E. C. Van der Meulen. Density-free convergence propertiesof various estimators of entropy.

Comput. Stat. Data Anal. , 5(4):425–436,Sept. 1987.[18] Y. Han, J. Jiao, T. Weissman, and Y. Wu. Optimal rates of entropy esti-mation over Lipschitz balls. arXiv preprint , 2019.[19] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman,A. Trischler, and Y. Bengio. Learning deep representations by mutualinformation estimation and maximization. In

ICLR , New Orleans, LA,USA, 2019.[20] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama. Learningdiscrete representations via information maximizing self-augmented train-ing. In

ICML’17 , volume 70 of

PMLR , pages 1558–1567, Sydney, Australia,2017.[21] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. M.Robins. Nonparametric von Mises estimators for entropies, divergencesand mutual informations. In

NIPS’15 , volume 28 of

Advances in NeuralInformation Processing Systems , pages 397–405, Montréal, Canada, 2015.[22] A. Kolchinsky, B. D. Tracey, and S. V. Kuyk. Caveats for informationbottleneck in deterministic scenarios. In

ICLR , New Orleans, LA, USA,2019.[23] H. Liu, L. Wasserman, and J. D. Laﬀerty. Exponential concentration formutual information estimation with application to forests. In

NIPS’12 ,volume 26 of

Advances in Neural Information Processing Systems , pages2537–2545, Lake Tahoe, NV, USA, 2012.1724] D. McAllester and K. Stratos. Formal limitations on the measurement ofmutual information. In

ICLR , New Orleans, LA, USA, 2019.[25] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributionalsmoothing with virtual adversarial training. In

ICLR , San Juan, PuertoRico, 2016.[26] I. Nemenman, W. Bialek, and R. D. R. Van Steveninck. Entropy andinformation in neural spike trains: Progress on the sampling problem.

Phys.Rev. E , 69(5):056111, 2004.[27] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergencefunctionals and the likelihood ratio by convex risk minimization.

IEEETrans. Inf. Theory , 56(11):5847–5861, Nov. 2010.[28] L. Paninski. Estimation of entropy and mutual information.

Neural Com-put. , 15(6):1191–1253, 2003.[29] J. Pfanzagl. The nonexistence of conﬁdence sets for discontinuous function-als.

J. Stat. Plan. Inference , 75(1):9–20, 1998.[30] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker. Onvariational bounds of mutual information. In

ICML’19 , volume 97 of

PMLR ,pages 5171–5180, Long Beach, CA, USA, 2019.[31] C. E. Shannon. A mathematical theory of communication.

Bell Syst. Tech.J. , 27(3):379–423, July 1948.[32] C. E. Shannon. Prediction and entropy of printed English.

Bell Ssyst. Tech.J. , 30(1):50–64, 1951.[33] R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neuralnetworks via information. arXiv preprint , 2017.[34] S. Singh and B. Poczos. Generalized exponential concentration inequalityfor renyi divergence estimation. In

ICML’14 , volume 32 of

PMLR , pages333–341, Bejing, China, 2014.[35] S. Singh and B. Poczos. Finite-sample analysis of ﬁxed-k nearest neighbordensity functional estimators. In

NIPS’16 , volume 29 of

Advances in NeuralInformation Processing Systems , pages 1217–1225, Barcelona, Spain, 2016.[36] K. Sricharan, R. Raich, and A. O. Hero. k-nearest neighbor estimation ofentropies with conﬁdence. In

Proc. IEEE Int. Symp. Inf. Theory (ISIT2011) , pages 1205–1209, Saint Petersburg, Russia, July 2011.[37] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneckmethod. In

Annu. Allerton Conf. Commun., Control, and Comput. , pages368–377, Monticello, IL, Sept. 1999.1838] A. van den Oord, Y. Li, and O. Vinyals. Representation learning withcontrastive predictive coding. arXiv preprint , 2018.[39] Q. Wang, S. R. Kulkarni, and S. Verdú. Divergence estimation of continuousdistributions based on data-dependent partitions.