[PDF] Two-Moment Inequalities for Rényi Entropy and Mutual Information

Abstract

This paper explores some applications of a two-moment inequality for the integral of the r-th power of a function, where 0 < r< 1. The first contribution is an upper bound on the R\'{e}nyi entropy of a random vector in terms of the two different moments. When one of the moments is the zeroth moment, these bounds recover previous results based on maximum entropy distributions under a single moment constraint. More generally, evaluation of the bound with two carefully chosen nonzero moments can lead to significant improvements with a modest increase in complexity. The second contribution is a method for upper bounding mutual information in terms of certain integrals with respect to the variance of the conditional density. The bounds have a number of useful properties arising from the connection with variance decompositions.

Full PDF

aa r X i v : . [ c s . I T ] F e b Two-Moment Inequalities for R ´enyi Entropyand Mutual Information

Galen Reeves

Abstract —This paper explores some applications of a two-moment inequality for the integral of the r -th power of a function,where < r < . The ﬁrst contribution is an upper bound on theR´enyi entropy of a random vector in terms of the two differentmoments. When one of the moments is the zeroth moment, thesebounds recover previous results based on maximum entropydistributions under a single moment constraint. More generally,evaluation of the bound with two carefully chosen nonzeromoments can lead to signiﬁcant improvements with a modestincrease in complexity. The second contribution is a method forupper bounding mutual information in terms of certain integralswith respect to the variance of the conditional density. The boundshave a number of useful properties arising from the connectionwith variance decompositions. Index Terms —Information Inequalities, Mutual Information,R´enyi Entropy.

I. I

NTRODUCTION

Measures of entropy and information play a central rolein applications throughout information theory, statistics, com-puter science, and statistical physics. In many cases, there isinterest in understanding maximal properties of these measuresover a given family of distributions. One example is givenby the principle of maximum entropy, which originated instatistical mechanics and was introduced in broader contextby Jaynes [1].Entropy-moment inequalities can be used to describe prop-erties of distributions characterized by moment constraints.Perhaps the most well known entropy-moment inequalityfollows from the fact that the Gaussian distribution maximizesdifferential entropy over all distributions with the same vari-ance [2, Theorem 8.6.5]. This inequality leads to remarkablysimple proofs for fundamental results in information theoryand estimation theory.A variety of entropy-moment inequalities have also beenstudied in the context of R´enyi entropy [3]–[7], which is ageneralization of Shannon entropy. Recent work has focusedon the extremal distributions for the closely related R´enyidivergence [8]–[12].Another line of work focuses on relationships betweenmeasures of dissimilarity between probability distributionsprovided by the family of f -divergences [13], [14], whichincludes as special cases, the total variation distance, relativeentropy (or Kullback-Leibler divergence), R´enyi divergence, The work of G. Reeves was supported in part by funding from theLaboratory for Analytic Sciences (LAS). Any opinions, ﬁndings, conclusions,and recommendations expressed in this material are those of the author anddo not necessarily reﬂect the views of the sponsors.G. Reeves is with the Department of Electrical and Computer Engineeringand the Department of Statistical Science, Duke University, Durham (e-mail:[email protected]) and chi-square divergence. One application of these resultsis to provide bounds for mutual information in terms ofdivergence measures that dominate relative entropy, such asthe chi-square divergence; see e.g. [13], [15].

A. Overview of results

The starting point of our analysis (Proposition 2) is aninequality for the integral of the r -th power of a function.Speciﬁcally, for any numbers p, q, r with < r < and p < − rr < q, the following inequality holds: (cid:18)Z f r ( x ) d x (cid:19) r ≤ C (cid:18)Z | x | p f ( x ) d x (cid:19) λ (cid:18)Z | x | q f ( x ) d x (cid:19) − λ , for all non-negative functions f : R + → R + where C and < λ < are given explicitly in terms of the tuple ( p, q, r ) .An extension to functions deﬁned on an arbitrary subset of R n is also provided (Proposition 3).The remainder of the paper shows how this inequality canbe used to provide bounds on information measures such asR´enyi entropy and mutual information. Some useful propertiesof the bounds include: • Simplicity:

Beyond the existence of a density, thesebounds do not require further regularity conditions suchas boundedness or sub-exponential tails. As a conse-quence, these bounds can be applied under relatively mildtechnical assumptions. • Tightness:

For some applications, the bounds can providean accurate characterization of the underlying informationmeasures. For example, a special case of Proposition 9 inthis paper played a key role in the author’s recent work[16]–[18], where it was used to bound the relative entropybetween low-dimensional projections of a random vectorand a Gaussian approximation. • Geometric Interpretation:

Our bounds on the mutualinformation between random variables X and Y can beexpressed in terms of the variance of the conditionaldensity of Y given X . Speciﬁcally, the bounds dependon integrals of the form: Z k y k s Var ( f Y | X ( y | X )) d y. For s = 0 , this integral is the expected squared L distance between the conditional density f Y | X and themarginal density f Y .The paper is organized as follows: Section II providesintegral inequalities for nonnegative functions; Section III gives bounds on R´enyi entropy of orders less than one; andSection IV provides bounds on mutual information.II. M OMENT I NEQUALITIES

Throughout this section, we assume that f is a real-valuedLebesgue measurable function deﬁned on a measurable subset S of R n . For any positive number p , the function k · k p isdeﬁned according to k f k p = (cid:18)Z S | f ( x ) | p d x (cid:19) p . Recall that for < p < , the function k · k p is not a normbecause it does not satisfy the triangle inequality. The s -thmoment of f is deﬁned according to µ s ( f ) = Z S k x k s f ( x ) d x, where k · k denotes the standard Euclidean norm on vectors. A. Multiple Moments

Consider the following optimization problem:maximize k f k r subject to f ( x ) ≥ for all x ∈ Sµ s i ( f ) ≤ m i for ≤ i ≤ k. For r ∈ (0 , this is a convex optimization problem because k · k rr is concave and the moment constraints are linear. Bystandard theory in convex optimization (see e.g., [19]), it canbe shown that if the problem is feasible and the maximum isﬁnite, then the maximizer has the form f ∗ ( x ) = (cid:18) k X i =1 ν ∗ i k x k s i (cid:19) r − , for all x ∈ S. The parameters ν ∗ , · · · , ν ∗ k are nonnegative and the i -th mo-ment constraint holds with equality for all i such that ν ∗ i is strictly positive, that is ν ∗ i > ⇒ µ s i ( f ∗ ) = m i .Consequently, the maximum can be expressed in terms of alinear combination of the moments: k f ∗ k rr = k ( f ∗ ) r k = k f ∗ ( f ∗ ) r − k = k X i =1 ν ∗ i m i . For the purposes of this paper, is it is useful to considera relative inequality in terms of the moments of the functionitself. Given a number < r < and vectors s ∈ R k and ν ∈ R k + the function c r ( ν, s ) is deﬁned according to c r ( ν, s ) = Z ∞ (cid:18) k X i =1 ν i x s i (cid:19) − r − r d x ! − rr , if the integral exists. Otherwise, c r ( ν, s ) is deﬁned to be pos-itive inﬁnity. It can be veriﬁed that c r ( ν, s ) is ﬁnite providedthat there exists i, j such that ν i and ν j are strictly positiveand s i < (1 − r ) /r < s j .The following result can be viewed as a consequence of theconstrained optimization problem described above. We provide a different and very simple proof that depends only on H¨older’sinequality. Proposition 1.

Let f be a nonnegative Lebesgue measurablefunction deﬁned on the positive reals R + . For any number < r < and vectors s ∈ R k and ν ∈ R k + , we have k f k r ≤ c r ( ν, s ) k X i =1 ν i µ s i ( f ) . Proof.

Let g ( x ) = P ki =1 ν i x s i . Then, we have k f k rr = k g − r ( f g ) r k ≤ k g − r k − r k ( gf ) r k r = k g − r − r k − r k gf k r = (cid:18) c r ( ν, s ) k X i =1 ν i µ s i ( f ) (cid:19) r , where second step follows from H¨older’s inequality withconjugate exponents / (1 − r ) and /r . B. Two Moments

The next result follows from Proposition 1 for the case oftwo moments.

Proposition 2.

Let f be a nonnegative Lebesgue measureablefunction deﬁned on the positive reals R + . For any numbers p, q, r with < r < and p < /r − < q , we have k f k r ≤ [ ψ r ( p, q )] − rr [ µ p ( f )] λ [ µ q ( f )] − λ , where λ = ( q + 1 − /r ) / ( q − p ) and ψ r ( p, q ) = 1( q − p ) e B (cid:18) rλ − r , r (1 − λ )1 − r (cid:19) , (1) where e B ( a, b ) = B( a, b )( a + b ) a + b a − a b − b and B( a, b ) is theBeta function.Proof. Letting s = ( p, q ) and ν = ( γ − λ , γ − λ ) with λ > ,we have [ c r ( ν, s )] r − r = Z ∞ (cid:0) γ − λ x p + γ − λ x q (cid:1) − r − r d x. Making the change of variable x ( γu ) q − p leads to [ c r ( ν, s )] r − r = 1( q − p ) Z ∞ u b − (1 + u ) a + b d u = B( a, b )( q − p ) , where a = r − r λ and b = r − r (1 − λ ) and the second stepfollows from the integral representation of the Beta function[20, Eq. (1.1.19)]. Therefore, by Proposition 1, the inequality k f k r ≤ (cid:18) B( a, b ) q − p (cid:19) − rr (cid:0) γ − λ µ p ( f ) + γ − λ µ q ( f ) (cid:1) , holds for all γ > . Evaluating this inequality with γ = λ µ q ( f )(1 − λ ) µ p ( f ) , leads to the stated result. The special case r = 1 / admits the simpliﬁed expression ψ / ( p, q ) = πλ − λ (1 − λ ) − (1 − λ ) ( q − p ) sin( πλ ) , (2)where we have used Euler’s reﬂection formula for the Betafunction [20, Theorem 1.2.1].Next, we consider an extension of Proposition 2 for func-tions deﬁned on R n . Given any measurable subset S of R n we deﬁne ω ( S ) = Vol( B n ∩ cone( S )) , (3)where B n = { u ∈ R n : k u k ≤ } is the n -dimensionalEuclidean ball of radius one and cone( S ) = { x ∈ R n : tx ∈ S for some t > } . The function ω ( S ) is proportional to the surface measure ofthe projection of S on the Euclidean sphere and satisﬁes ω ( S ) ≤ ω ( R n ) = π n Γ( n + 1) , (4)for all S ⊆ R n . Note that ω ( R + ) = 1 and ω ( R ) = 2 . Proposition 3.

Let f be a nonnegative Lebesgue measurablefunction deﬁned on a subset S of R n . For any numbers p, q, r with < r < and p < /r − < q , we have k f k r ≤ [ ω ( S ) ψ r ( p, q )] − rr [ µ np ( f )] λ [ µ nq ( f )] − λ , where λ = ( q + 1 − /r ) / ( q − p ) and ψ r ( p, q ) is given by (1) .Proof. Let f be extended to R n using the rule f ( x ) = 0 forall x outside of S and let g : R + → R + be deﬁned accordingto g ( y ) = 1 n Z S n − f ( y n u ) d σ ( u ) , where S n − = { u ∈ R n : k u k = 1 } is the Euclidean sphereof radius one and σ ( u ) is the surface measures of the sphere.We will show that k f k r ≤ ( ω ( S )) − rr k g k r (5) µ ns ( f ) = µ s ( g ) . (6)Then, the stated inequality then follows from applying Propo-sition 2 to the function g .In order to prove (5), we begin with a transformation intopolar coordinates: k f k rr = Z ∞ Z S n − | f ( tu ) | r t n − d σ ( u ) d t. (7)Letting cone( S ) ( x ) denote the indicator function of the set cone( S ) , the integral over the sphere can be bounded using: Z S n − | f ( tu ) | r d σ ( u )= Z S n − cone( S ) ( u ) | f ( tu ) | r d σ ( u ) ( a ) ≤ (cid:18)Z S n − cone( S ) ( u ) d σ ( u ) (cid:19) − r (cid:18)Z S n − | f ( tu ) | d σ ( u ) (cid:19) r ( b ) = n ( ω ( S )) − r g r ( t n ) . (8) where: (a) follows from H¨older’s inequality with conjugateexponents − r and r ; and (b) follows from the deﬁnition of g and the fact that ω ( S ) = Z Z S n − cone( S ) ( u ) t n − d σ ( u ) d t = 1 n Z S n − cone( S ) ( u ) d σ ( u ) . Plugging (8) back into (7) and then making the change ofvariables t → y n yields k f k rr ≤ n ( ω ( S )) − r Z ∞ g r ( t n ) t n − d t = ( ω ( S )) − r k g k rr . The proof of (6) follows along similar lines. We have µ ns ( f ) ( a ) = Z ∞ Z S n − t ns f ( tu ) t n − d σ ( u ) d t ( b ) = 1 n Z ∞ Z S n − y s f ( y n u ) d σ ( u ) d y = µ s ( g ) where (a) follows from a transformation into polar coordinatesand (b) follows form the change of variable t y n .III. R ´ ENYI E NTROPY B OUNDS

Let X be a random vector that has a density f ( x ) withrespect to Lebesgue measure on R n . The differential R´enyientropy of order r ∈ (0 , ∪ (1 , ∞ ) is deﬁned according to[2]: h r ( X ) = 11 − r log (cid:18)Z R n f r ( x ) d x (cid:19) . The R´enyi entropy is continuous and non-increasing in r . Ifthe support set S = { x ∈ R n : f ( x ) > } has ﬁnite measurethen the limit as r converges to zero is given by h ( X ) =log Vol( S ) . If the support does not have ﬁnite measure then h r ( X ) increases to inﬁnity as r decreases to zero. The case r = 1 is given by the Shannon differential entropy: h ( X ) = − Z S f ( x ) log f ( x ) d x. Given a random variable X that is not identically zero andnumbers p, q, r with < r < and p < /r − < q , wedeﬁne the function L r ( X ; p, q ) = rλ − r log E [ | X | p ] + r (1 − λ )1 − r log E [ | X | q ] , where λ = ( q + 1 − /r ) / ( q − p ) .The next result, which follows directly from Proposition 3,provides an upper bound on the R´enyi entropy. Proposition 4.

Let X be a random vector with a density on R n . For any numbers p, q, r with < r < and p < /r −

The relationship between Proposition 4 and previous resultsdepends on whether the moment p is equal to zero: • One-moment inequalities: If p = 0 then there existsa distribution such that (9) holds with equality. Thisis because the zero-moment constraint ensures that thefunction that maximizes the R´enyi entropy integrates toone. In this case, Proposition 4 is equivalent to previousresults that focused on distributions that maximize R´enyientropy subject to a single moment constraint [3]–[5].With some abuse of terminology we refer to these boundsas one-moment inequalities . • Two-moment inequalities: If p = 0 then the right-handside of (9) corresponds to the R´enyi entropy of a non-negative function that might not integrate to one. Never-theless, the expression provides an upper bound on theR´enyi entropy for any density with the same moments.We refer to the bounds obtained using a general pair ( p, q ) as two-moment inequalities.The contribution of two-moment inequalities is that theylead to tighter bounds. To quantify the tightness, we deﬁne ∆ r ( X ; p, q ) to be the gap between the right-hand side andleft-hand side of (9) corresponding to the pair ( p, q ) , that is ∆ r ( X ; p, q ) = log ω ( S ) + log ψ r ( p, q )+ L r ( k X k n ; p, q ) − h r ( X ) . The gaps corresponding to the optimal two-moment and one-moment inequalities are deﬁned according to: ∆ r ( X ) = inf p,q ∆ r ( p, q ) e ∆ r ( X ) = inf q ∆ r (0 , q ) . A. Some consequences of these bounds

By Lyapunov’s inequality, the mapping s s log E [ | X | s ] is nondecreasing on [0 , ∞ ) and thus L r ( X ; p, q ) ≤ L r ( X ; 0 , q ) = 1 q log E [ | X | q ] , p ≥ . (10)In other words, the case p = 0 provides an upper bound on L r ( X ; p, q ) for nonnegative p . Alternatively, we also have thelower bound L r ( X ; p, q ) ≥ r − r log E h | X | − rr i , (11)which follows from the convexity of log E [ | X | s ] .A useful property of L r ( X ; p, q ) is that it is additivewith respect to the product of independent random variables.Speciﬁcally, if X and Y are independent, then L r ( XY ; p, q ) = L r ( X ; p, q ) + L r ( Y ; p, q ) . (12)One consequence is that multiplication by a bounded randomvariable cannot increase the R´enyi entropy by an amountthat exceeds the gap of the two-moment inequality withnonnegative moments. A more accurate name would be two-moment inequalities under theconstraint that one of the moments is the zeroth moment.

Proposition 5.

Let Y be a random vector on R n with ﬁniteR´enyi entropy of order < r < , and let X be an independentrandom variable that satisﬁes < X ≤ t . Then, h r ( XY ) ≤ h r ( tY ) + ∆ r ( Y ; p, q ) , for all < p < /r − < q .Proof. Let Z = XY and let S Z and S Y denote the supportsets of Z and Y , respectively. The assumption that X is non-negative means that cone( S Z ) = cone( S Y ) . We have h r ( Z ) ( a ) ≤ log ω ( S Z ) + log ψ r ( p, q ) + L r ( k Z k n ; p, q ) ( b ) = h r ( Y ) + L r ( | X | n ; p ; q ) + ∆ r ( Y ; p, q ) ( c ) ≤ h r ( Y ) + n log t + ∆ r ( Y ; p, q ) , where: (a) follows from Proposition 4; (b) follows from (12)and the deﬁnition of ∆ r ( Y ; p, q ) , and (c) follows from (10)and the assumption | X | ≤ t . Finally, recalling that h r ( tY ) = h r ( Y ) + n log t completes the proof. B. Example with lognormal distribution If W ∼ N ( µ, σ ) then the random variable X = exp( W ) has a lognormal distribution with parameters ( µ, σ ) . TheR´enyi entropy is given by h r ( X ) = µ + 12 (cid:18) − rr (cid:19) σ + 12 log(2 πr r − σ ) , and the logarithm of the s -th moment is given by log E [ | X | s ] = µs + 12 σ s . With a bit of work, it can be shown that the gap of the optimaltwo-moment inequality does not depend on the parameters ( µ, σ ) and is given by ∆ r ( X ) = log (cid:18)e B (cid:18) r − r ) , r − r ) (cid:19)r r − r ) (cid:19) + 12 −

12 log(2 πr r − ) . (13)The details of this derivation are given in Appendix B-A.Meanwhile, the gap of the optimal one-moment inequality isgiven by e ∆ r ( X ) = inf q (cid:20) log (cid:18)e B (cid:18) r − r − q , q (cid:19) q (cid:19) + 12 qσ (cid:21) − (cid:18) − rr (cid:19) σ −

12 log(2 πr r − σ ) . (14)The functions ∆ r ( X ) and e ∆ r ( X ) are illustrated in Figure 1as a function of r for various σ . The function ∆ r ( X ) isbounded uniformly with respect to r and converges to zero as r increases to one. The tightness of the two-moment inequalityin this regime follows from the fact that the lognormaldistribution maximizes Shannon entropy subject to a constrainton E [log X ] . By contrast, the function e ∆ r ( X ) varies with theparameter σ . For any ﬁxed r ∈ (0 , , it can be shown that e ∆ r ( X ) increases to inﬁnity if σ converges to zero or inﬁnity. r ∆ r ( X ) e ∆ r ( X ) σ = 10 σ = 1 σ = 0 . Fig. 1. Comparison of upper bounds on R´enyi entropy for the lognormaldistribution as a function of the order r for various σ . n ∆ r ( Y ) e ∆ r ( Y )∆ r ( X ) Fig. 2. Comparison of upper bounds on R´enyi entropy for the multivariateGaussian distribution N (0 , I n ) as a function of the dimension n with r = 0 . .The solid black line is the gap of the optimal two-moment inequality for thelognormal distribution. C. Example with multivariate Gaussian distribution

Next, we consider the case where Y ∼ N (0 , I n ) is an n -dimensional Gaussian vector with mean zero and identitycovariance. The R´enyi entropy is given by h r ( Y ) = n πr r − ) , and the s -th moment of the magnitude k X k is given by E [ k Y k s ] = 2 s Γ( n + s )Γ( n ) . As the dimension n increases, it can be shown that the gapof the optimal two-moment inequality converges to the gap forthe lognormal distribution. The proof of the following resultis given in Appendix B-C. Proposition 6. If Y ∼ N (0 , I n ) then, lim n →∞ ∆ r ( Y ) = ∆ r ( X ) , where X has a lognormal distribution. The functions ∆ r ( Y ) and e ∆ r ( Y ) are illustrated in Figure 2.Both functions are increasing in the dimension n . However,while ∆ r ( Y ) converges to a ﬁnite limit, e ∆ r ( Y ) increaseswithout bound. For any ﬁxed integer n , it can be shown thatboth ∆ r ( Y ) and e ∆ r ( Y ) converge to zero as r increases to one. This behavior follows from the fact that the Gaussiandistribution is the maximum entropy distribution for Shannonentropy under a second moment constraint. D. Inequalities for differential entropy

Proposition 4 can also be used to recover some knowninequalities for differential entropy by considering the limitingbehavior as r converges to one. For example, it is well knownthat the differential entropy of an n -dimensional random vector X with ﬁnite second moment satisﬁes h ( X ) ≤

12 log (cid:0) πe E (cid:2) n k X k (cid:3)(cid:1) , (15)with equality if and only if the entries of X are i.i.d. zero-meanGaussian. A generalization of this result in terms an arbitrarypositive moment is given by h ( X ) ≤ log Γ (cid:0) ns + 1 (cid:1) Γ (cid:0) n + 1 (cid:1) + n π + ns log (cid:0) es E (cid:2) n k X k s (cid:3)(cid:1) , (16)for all s > . Note that (15) corresponds to the case s = 2 .Inequality (16) can be proved as an immediate consequenceof Proposition 4 and the fact that h r ( X ) is non-increasing in r . Using properties of the beta function given in Appendix A,it is straightforward to verify that lim r → ψ r (0 , q ) = ( e q ) q Γ (cid:18) q + 1 (cid:19) , for all q > . Combining this result with Proposition 4 and (10) leads to h ( X ) ≤ log ω ( S ) + log Γ (cid:18) q + 1 (cid:19) + 1 q log( eq E [ k X k nq ]) . Using (4) and making the substitution s = nq leads to (16).Another example follows from the fact that the lognormaldistribution maximizes the differential entropy of a positiverandom variable X subject to constraints on the mean andvariance of log( X ) , and hence h ( X ) ≤ E [log( X )] + 12 log(2 πe Var (log( X ))) , (17)with equality if and only if X is lognormal. In Appendix B-D,it is shown how this inequality can be proved using our two-moment inequalities, by studying the behavior as both p and q converge to zero as r increases to one.IV. M UTUAL I NFORMATION B OUNDS

A. Relative entropy and chi-square divergence

Let P and Q be distributions deﬁned on common probabilityspace that that have densities p and q with respect to adominated measure µ . The relative entropy (or Kullback–Leibler) divergence is deﬁned according to D ( P k Q ) = Z p log (cid:18) pq (cid:19) d µ, and the chi-square divergence is deﬁned according to, χ ( P, Q ) = Z ( p − q ) q d µ. The chi-square divergence is equal to the squared L distance between the densities scaled densities p/ √ q and √ q .The chi-square can also be interpreted as the ﬁrst non-zeroterm in the power series expansion of the relative entropy [13,Lemma 4]. More generally, the chi-square provides an upperbound on the relative entropy, via D ( P k Q ) ≤ log(1 + χ ( P, Q )) . (18)The proof of this inequality follows straightforwardly fromJensen’s inequality and the concavity of the logarithm; seee.g., [21, Theorem 5].Given a random pair ( X, Y ) the mutual information be-tween X and Y is deﬁned according to I ( X ; Y ) = D ( P X,Y k P X × P Y ) . From (18), we see the the mutual information can always beupper bounded using I ( X ; Y ) ≤ log(1 + χ ( P X,Y , P X × P Y )) . (19)The next section provides bounds on the mutual informationthat can improve upon this inequality. B. Mutual information and variance of conditional density

Let ( X, Y ) be a random pair such that the conditionaldistribution Y given X has a density f Y | X ( y | x ) with respect toLebesgue measure on R n . Note that the marginal density of Y is given by f Y ( y ) = E (cid:2) f Y | X ( y | X ) (cid:3) . To simplify notation, wewill write f ( y | x ) and f ( y ) where the subscripts are implicit.The support set of Y denoted by S Y .The measure of the dependence between X and Y that isused in our bounds can be understood in terms of the varianceof the conditional density. For each y , the conditional density f ( y | X ) evaluated with a random realization of X is a randomvariable. The variance of this random variable is given by Var ( f ( y | X )) = E h ( f ( y | X ) − f ( y )) i , where we have used the fact that the marginal density f ( y ) isthe expectation of f ( y | X ) . The s -th moment of the varianceof the conditional density is deﬁned according to V s ( Y | X ) = Z S Y k y k s Var ( f ( y | X )) d y. The function V s ( Y | X ) is nonnegative and equal to zero if andonly if X and Y are independent.For t ∈ (0 , the function κ ( t ) is deﬁned according to κ ( t ) = sup u ∈ (0 , ∞ ) log(1 + u ) u t . Properties of this function are given in Appendix C, where itis shown that / ( e t ) < κ ( t ) ≤ /t with equality on the rightwhen t = 1 .We are now ready to give the main results of this section,which are bounds on the mutual information. We begin with ageneral upper bound in terms of the variance of the conditionaldensity. Proposition 7.

For any < t ≤ , the mutual informationsatisﬁes I ( X ; Y ) ≤ κ ( t ) Z S Y [ f ( y )] − t [ Var ( f ( y | X ))] t d y. Proof.

We use the following series of inequalities: I ( X ; Y ) ( a ) = Z f ( y ) D (cid:0) P X | Y = y (cid:13)(cid:13) P X (cid:1) d y ( b ) ≤ Z f ( y ) log (cid:0) χ ( P X | Y = y , P X ) (cid:1) d y ( c ) = Z f ( y ) log (cid:18) Var ( f ( y | X )) f ( y ) (cid:19) d y ( d ) ≤ κ ( t ) Z f ( y ) (cid:18) Var ( f ( y | X )) f ( y ) (cid:19) t d y, where: (a) follows from the deﬁnition of mutual information;(b) follows from (18); and (c) follows from Bayes’ rule, whichallows us to write the chi-square in terms of the variance ofthe conditional density: χ ( P X | Y = y , P X ) = E "(cid:18) f ( y | X ) f ( y ) − (cid:19) = Var ( f ( y | X )) f ( y ) . Inequality (d) follows from the non-negativity of the varianceand the deﬁnition of κ ( t ) .Evaluating Proposition 7 with t = 1 recovers the well-known inequality I ( X ; Y ) ≤ χ ( P X,Y , P X × P Y ) . The nexttwo results follow from the cases < t < / and t = 1 / ,respectively. Proposition 8.

For any < r < , the mutual informationsatisﬁes I ( X ; Y ) ≤ κ ( t ) (cid:16) e h r ( Y ) V ( Y | X ) (cid:17) t , where t = (1 − r ) / (2 − r ) .Proof. Starting with Proposition 7 and applying H¨older’sinequality with conjugate exponents / (1 − t ) and /t leadsto I ( X ; Y ) ≤ κ ( t ) (cid:18)Z f r ( y ) d y (cid:19) − t (cid:18)Z Var ( f ( y | X )) d y (cid:19) t = κ ( t ) e t h r ( Y ) V t ( Y | X ) , where we have used the fact that r = (1 − t ) / (1 − t ) . Proposition 9.

For any p < < q , the mutual informationsatisﬁes I ( X ; Y ) ≤ C ( λ ) s ω ( S Y ) V λnp ( Y | X ) V − λnq ( Y | X )( q − p ) , where λ = ( q − / ( q − p ) and C ( λ ) = κ (1 / s πλ − λ (1 − λ ) − (1 − λ ) sin( πλ ) . Proof.

Evaluating Proposition 7 with t = 1 / gives I ( X ; Y ) ≤ κ (1 / Z S Y p Var ( f ( y | X )) d y. Evaluating Proposition 3 with r = 1 / leads to (cid:18)Z S Y p Var ( f ( y | X )) d y (cid:19) ≤ ω ( S Y ) ψ / ( p, q ) V λnp ( Y | X ) V − λnq ( Y | X ) . Combining these inequalities with the expression for ψ / ( p, q ) given in (2) completes the proof.The contribution of Propositions 8 and 9 is that they providebounds on the mutual information in terms of quantitiesthat can be easy to characterize. One application of thesebounds is to establish conditions under which the mutualinformation corresponding to a sequence of random pairs ( X k , Y k ) converges to zero. In this case, Proposition 8 providesa sufﬁcient condition in terms of the R´enyi entropy of Y n and the function V ( Y n | X n ) , while Proposition 9 provides asufﬁcient condition in terms of V s ( Y n | X n ) evaluated with twodifference values of s . These conditions are summarized in thefollowing result. Proposition 10.

Let ( X k , Y k ) be a sequence of random pairssuch the conditional distribution Y k given X k has a densityon R n . The following are sufﬁcient conditions under whichthe mutual information of I ( X k ; Y k ) converges to zero as k increases to inﬁnity:(i) There exists < r < such that lim k →∞ e h r ( Y k ) V ( Y k | X k ) = 0 . (ii) There exists p < < q such that lim k →∞ V q − np ( Y k | X k ) V − pnq ( Y k | X k ) = 0 . C. Properties of the bounds

The function V s ( Y | X ) has a number of interesting proper-ties. The variance of the conditional density can be expressedin terms of an expectation with respect to two independentrandom variables X and X with the same distribution as X via the decomposition: Var ( f ( y | X )) = E [ f ( y | X ) f ( y | X ) − f ( y | X ) f ( y | X )] . Consequently, by swapping the order of the integration andexpectation we obtain V s ( Y | X ) = E [ K s ( X, X ) − K s ( X , X )] , (20)where K s ( x , x ) = Z k y k s f ( y | x ) f ( y | x ) d y. The function K s ( x , x ) is a positive deﬁnite kernel that doesnot depend on the distribution of X . For s = 0 , this kernelhas been studied previously in the machine learning literature[22], where it is referred to as the expected likelihood kernel.The variance of the conditional density also satisﬁes a data-processing inequality. Suppose that U → X → Y forms aMarkov chain. Then, the square of the conditional density of Y given U can be expressed as f Y | U ( y | u ) = E (cid:2) f Y | X ( y | X ′ ) f Y | X ( y | X ′ ) | U = u (cid:3) , where ( U, X ′ , X ′ ) ∼ P U P X | U P X | U . Combining this expres-sion with (20) yields V s ( Y | U ) = E [ K s ( X ′ , X ′ ) − K s ( X , X )] , (21)where we recall that ( X , X ) are independent copies of X. Finally, it is easy to verify that the function V s ( Y ) satisﬁes V s ( aY | X ) = | a | s − n V s ( Y | X ) , for all a = 0 . Using this scaling relationship we see that the sufﬁcientconditions in Proposition 10 are invariant to scaling of Y . D. Example with Gaussian noise

We now provide a speciﬁc example of our bounds on themutual information. Let ( X, Y ) be distributed according to Y = X + W, (22)were W ∼ N (0 , is independent of X . In this case, it iswell known that the mutual information satisﬁes I ( X ; Y ) ≤

12 log(1 +

Var ( X )) , (23)where equality is attained is X is Gaussian. This inequalityfollows straightforwardly from the fact that the Gaussiandistribution maximizes differential entropy subject to a secondmoment constraint. One of the limitations of this bound isthat it can be loose when the second moment is dominatedby events that have small probability. In fact, it is easy toconstruct examples for which X does not have a ﬁnite secondmoment and yet I ( X ; Y ) is arbitrarily close to zero.Our results provide bounds on I ( X ; Y ) that are signiﬁcantlyless sensitive to the effects of rare events. To begin, observethat the product of the conditional densities can be factoredaccording to f ( y | x ) f ( y | x ) = φ (cid:18) √ y − x + x √ (cid:19) φ (cid:18) x − x √ (cid:19) , where φ ( x ) = (2 π ) − / exp( − x / is the density of thestandard Gaussian distribution. Integrating with respect to y leads to K s ( x , x ) = 2 − s E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) W + x + x √ (cid:12)(cid:12)(cid:12)(cid:12) s (cid:21) φ (cid:18) x − x √ (cid:19) . For the case s = 0 , the function K ( x , x ) is proportionalthe standard Gaussian kernel and we have V ( Y | X ) = 12 √ π h − E h e − ( X − X ) ii . This expression shows that V ( Y | X ) is a measure of thevariation in X .A useful property of V ( Y | X ) is that the conditions underwhich it converges to zero are weaker than the conditionsneeded for other measures of variation, such that variance. Tosee why, observe that the expectation is bounded uniformlywith respect to ( X , X ) . In particular, for every ǫ > and x ∈ R , we have − E h e − ( X − X ) i ≤ ǫ + 2 P [ | X − x | ≥ ǫ ] , where we have used the inequality − e − x ≤ x and thefact that P [ | X − X | ≥ ǫ ] ≤ P [ | X − x | ≥ ǫ ] . Therefore, V ( Y | X ) is small provided that X is close a constant valuewith high probability.To study some further properties of these bounds, we nowfocus on the case where X is a Gaussian scalar mixturegenerated according to X = A √ U , A ∼ N (0 , , U ≥ , (24)with A and U independent. In this case, the expectations withrespect to the kernel K s ( x , x ) can be computed explicitly,leading to V s ( Y | X ) = Γ( s )2 π E " (1 + 2 U ) s − (1 + U ) s (1 + U ) s (1 + ( U + U )) s +12 . It can be shown that this expression depends primarily on themagnitude of U . This is not surprising given that X convergesto a constant if and only if U converges to zero.Our results can also be used to bound the mutual informa-tion I ( U ; Y ) by noting that U → X → Y forms a Markovchain, and taking advantage of the characterization provided in(21). Letting X ′ = A √ U and X ′ = A √ U with ( A , A , U ) mutually independent, leads to V s ( Y | U ) = Γ( s )2 π E " (1 + U ) s − − (1 + U ) s (1 + U ) s (1 + ( U + U )) s +12 . In this case, V s ( Y | U ) is a measure of the variation in U . Tostudy it behavior, we consider the simple upper bound V s ( Y | U ) ≤ Γ( s )2 π P [ U = U ] E h (1 + U ) s − i . This bound shows that if s ≤ then V s ( Y | U ) is boundeduniformly with respect to distributions on U , and if s > then V s ( Y | U ) is bounded in terms of the ( s − ) -th moment of U .In conjunction with Propositions 8 and 9 the function V s ( Y | U ) provide bounds on the mutual information I ( U ; Y ) that can be expressed in terms of simple expectations involvingtwo independent copies of U . Figure 3 provides an illustrationof the upper bound in Proposition 9 for the case where U isa discrete random variable supported on two-points and X and Y are generated according to (22) and (24). This exampleshows that there exist sequences of distributions for which ourupper bounds on the mutual information converges to zerowhile the chi-square divergence between P XY and P X × P Y is bounded away from zero.V. C ONCLUSION

This paper provides bounds on R´enyi entropy and mutualinformation that are based on a relatively simple two-momentinequality. One of the main takeaways from our analysis isthat sometimes two carefully chosen moments are all that isneeded to provide an accurate characterization. Extensions toinequalities with more moments are also worth exploring. ǫ Proposition 9chi-square divergenceupper bound (19) I ( X ; Y ) Fig. 3. Bounds on the mutual information I ( U ; Y ) when U ∼ (1 − ǫ ) δ + ǫδ a ( ǫ ) , with a ( ǫ ) = 1 + 1 / √ ǫ , and X and Y are generated according to (22)and (24). The bound from Proposition 9 is evaluated with p = 0 and q = 2 . A PPENDIX AT HE G AMMA AND B ETA F UNCTIONS

This section reviews some properties of the gamma and betafunctions. For x > , the gamma function is deﬁned accordingto Γ( x ) = R ∞ t x − e − t d t. Binet’s formula the logarithm ofthe gamma function [20, Theorem 1.6.3] gives log Γ( x ) = (cid:18) x − (cid:19) log x − x + 12 log(2 π ) + θ ( x ) , (25)where the remainder term θ ( x ) is convex and non-increasingwith lim x → θ ( x ) = ∞ and lim x →∞ θ ( x ) = 0 . Euler’sreﬂection formula [20, Theorem 1.2.1] gives Γ( x )Γ(1 − x ) = π sin( πx ) . (26)For x, y > the beta function is deﬁned according to B( x, y ) = Γ( x )Γ( y ) / Γ( x + y ) . The beta function can alsobe expressed in integral form as [20, pg. 7] B( x, y ) = Z ∞ s x − (1 + s ) x + y d s. (27)Recall that e B( x, y ) = B( x, y )( x + y ) x + y x − x y − y . Using (25)leads to log (cid:18)e B( x, y ) r x y π ( x + y ) (cid:19) = θ ( x ) + θ ( y ) − θ ( x + y ) . (28)It can also be shown that [23, Equation (2) pg. 2] e B( x, y ) ≥ x + yxy . (29)A PPENDIX BD ETAILS FOR

R ´

ENYI E NTROPY E XAMPLES

This appendix studies properties of the two-moment inequal-ities for R´enyi entropy described in Section III.

A. Lognormal distribution

Let X be a lognormal random variable with parameters ( µ, σ ) and consider the parametrization p = 1 − rr − (1 − λ ) s (1 − r ) urλ (1 − λ ) q = 1 − rr + λ s (1 − r ) urλ (1 − λ ) . where λ ∈ (0 , and u ∈ (0 , ∞ ) . Then, we have ψ r ( p, q ) = e B (cid:18) rλ − r , r (1 − λ )1 − r (cid:19)s rλ (1 − λ )(1 − r ) uL r ( X ; p, q ) = µ + 12 (cid:18) − rr (cid:19) σ + 12 uσ . Combining these expressions with (28) leads to ∆ r ( X ; p, q ) = θ (cid:16) rλ − r (cid:17) + θ (cid:16) r (1 − λ )1 − r (cid:17) − θ (cid:16) r − r (cid:17) + 12 uσ −

12 log (cid:0) uσ (cid:1) −

12 log( r r − ) . (30)We now characterize the minimum with respect to theparameters ( λ, u ) . Note that the mapping λ θ ( rλ − r ) + θ ( r (1 − λ )1 − r ) is convex and symmetric about the point λ = 1 / .Therefore, the minimum with respect to λ is attained at λ = 1 / . Meanwhile, mapping u uσ − log( uσ ) is convexand attains it minimum at u = 1 /σ . Evaluating (30) withthese values, we see that the optimal two-moment inequalitycan be expressed as ∆ r ( X ) = 2 θ (cid:18) r − r ) (cid:19) − θ (cid:18) r − r (cid:19) + 12 log (cid:16) e r − r (cid:17) . By (28), this expression is equivalent to (25). Moreover, thefact that ∆ r ( X ) decreases to zero as r increases to one followsfrom the fact that θ ( x ) decreases to zero and x increases toinﬁnity.Next, we express gap in terms of the pair ( p, q ) . Comparingthe difference between ∆ r ( X ; p, q ) and ∆ r ( X ) leads to ∆ r ( X ; p, q ) = ∆ r ( X ) + 12 ϕ (cid:18) rλ (1 − λ )1 − r ( q − p ) σ (cid:19) + θ (cid:16) rλ − r (cid:17) + θ (cid:16) r (1 − λ )1 − r (cid:17) − θ (cid:16) r − r ) (cid:17) , where ϕ ( x ) = x − log( x ) − . In particular, if p = 0 , then weobtain the simpliﬁed expression ∆ r ( X ; 0 , q ) = ∆ r ( X ) + 12 ϕ (cid:18)(cid:16) q − − rr (cid:17) σ (cid:19) + θ (cid:16) r − r − q (cid:17) + θ (cid:16) q (cid:17) − θ (cid:16) r − r ) (cid:17) . This characterization shows that the gap of the optimal one-moment inequality e ∆ r ( X ) increases to inﬁnity in the limit aseither σ → or σ → ∞ . B. Multivariate Gaussian distribution

Let Y ∼ N (0 , I n ) is an n -dimensional Gaussian vector andconsider the parametrization p = 1 − rr − − λr s − r ) zλ (1 − λ ) nq = 1 − rr + λr s − r ) zλ (1 − λ ) n . where λ ∈ (0 , and z ∈ (0 , ∞ ) . The, we have log ω ( S Y ) = n π − log (cid:16) n (cid:17) − log Γ (cid:16) n (cid:17) ψ r ( p, q ) = e B (cid:18) rλ − r , r (1 − λ )1 − r (cid:19)s rλ (1 − λ )(1 − r ) r nr z . Furthermore, if (1 − λ ) s − r ) zλ (1 − λ ) n < , (31)then L r ( k Y k n ; p, q ) is ﬁnite and is given by L r ( k Y k n ; p, q ) = Q r,n ( λ, z ) + n r − r h log Γ (cid:16) n r (cid:17) − log Γ (cid:16) n (cid:17)i , where Q r,n ( λ, z ) = rλ − r log Γ (cid:18) n r − − λr s (1 − r ) nz λ (1 − λ ) (cid:19) + r (1 − λ )1 − r log Γ (cid:18) n r + λr s (1 − r ) nz λ (1 − λ ) (cid:19) − r − r log Γ (cid:16) n r (cid:17) . (32)Combining these expressions and then using (25) and (28)leads to ∆ r ( Y ; p, q ) = θ (cid:16) rλ − r (cid:17) + θ (cid:16) r (1 − λ )1 − r (cid:17) − θ (cid:16) r − r (cid:17) + Q r,n ( z, λ ) −

12 log z −

12 log (cid:16) r r − (cid:17) + r − r θ (cid:16) n r (cid:17) − − r θ (cid:16) n (cid:17) . (33)Next, we study some properties of Q r,n ( λ, z ) . The decom-missions (25) shows that the logarithm of the gamma functioncan expressed as the sum of convex functions: log Γ( x ) = ϕ ( x ) + 12 log (cid:18) x (cid:19) + 12 log(2 π ) − θ ( x ) , where ϕ ( x ) = x log x + 1 − x . Starting with the deﬁnition of Q ( λ, z ) and then using Jensen’s inequality yields Q r,n ( z, λ ) ≥ rλ − r ϕ (cid:18) n r − − λr s (1 − r ) nz λ (1 − λ ) (cid:19) + r (1 − λ )1 − r ϕ (cid:18) n r + λr s (1 − r ) nz λ (1 − λ ) (cid:19) − r − r ϕ (cid:16) n r (cid:17) = λa ϕ (cid:18) − q(cid:0) − λλ (cid:1) az (cid:19) + (1 − λ ) a ϕ (cid:18) r(cid:16) λ − λ (cid:17) az (cid:19) , where a = 2(1 − r ) /n . Using the inequality ϕ ( x ) ≥ (3 / x − / ( x + 2) leads to Q r,n ( λ, z ) ≥ z "(cid:18) − q(cid:0) − λλ (cid:1) bz (cid:19)(cid:18) r(cid:16) λ − λ (cid:17) bz (cid:19) − ≥ z (cid:18) r(cid:16) λ − λ (cid:17) b z (cid:19) − , (34) where b = 2(1 − r ) / (9 n ) .Observe that the right-hand side of (34) converges to z/ as n increases to inﬁnity. It turns out this limiting behavior istight. Using (25), it is straightforward to show that Q n ( λ, z ) converges pointwise to z/ as n increases to inﬁnity, that is lim n →∞ Q r,n ( λ, z ) = 12 z. (35)for any ﬁxed pair ( λ, z ) ∈ (0 , × (0 , ∞ ) . C. Proof of Proposition 6

Let D = (0 , × (0 , ∞ ) . For ﬁxed r ∈ (0 , we use Q n ( λ, z ) to denote the function Q r,n ( λ, z ) deﬁned in (32) andwe use G n ( λ, z ) to denote the right-hand side of (33). Thesefunctions are deﬁned to be equal to positive inﬁnity for anypair ( λ, z ) ∈ D such that (31) does not hold.Note that the terms θ ( n/ (2 r )) and θ ( n/ converge to zeroin the limit as n increases to inﬁnity. In conjunction with (35),this shows that G n ( λ, z ) converges pointwise to a limit G ( λ, z ) given by G ( λ, z ) = θ (cid:16) rλ − r (cid:17) + θ (cid:16) r (1 − λ )1 − r (cid:17) − θ (cid:16) r − r (cid:17) + 12 z −

12 log( z ) −

12 log( r r − ) . At this point, the correspondence with the lognormal distribu-tion can be seen from the fact that G ( λ, z ) is equal to theright-hand side of (30) evaluated with uσ = z .To show that the gap corresponding to the lognormaldistribution provides an upper bound on the limit, we use lim sup n →∞ ∆ r ( Y ) = lim sup n →∞ inf ( λ,z ) ∈ D G n ( λ, z ) ≤ inf ( λ,z ) ∈ D lim sup n →∞ G n ( λ, z )= inf ( λ,z ) ∈ D G ( λ, z )= ∆ r ( X ) . (36)Here, the last equality follows from the analysis in Ap-pendix B-A, which shows that the minimum of G ( λ, z ) isa attained at λ = 1 / and z = 1 .To prove the lower bound requires a bit more work. Fix any ǫ ∈ (0 , and let D ǫ = (0 , − ǫ ] × (0 , ∞ ) . Using the lowerbound on Q n ( λ, z ) given in (34), it can be veriﬁed that lim inf n →∞ inf ( λ,z ) ∈ D ǫ (cid:20) Q r,n ( z, λ ) −

12 log z (cid:21) ≥ . Consequently, we have lim inf n →∞ inf ( λ,z ) ∈ D ǫ G n ( λ, z ) = inf ( λ,z ) ∈ D ǫ G ( λ, z ) ≥ ∆ r ( X ) . (37)To complete the proof we will show that for any sequence λ n that converges to one as n increases to inﬁnity, we have lim inf n →∞ inf z ∈ (0 , ∞ ) G n ( λ n , z ) = ∞ . (38)To see why this is the case, note that by (28) and (29), θ (cid:0) rλ − r (cid:1) + θ (cid:0) r (1 − λ )1 − r (cid:1) − θ (cid:0) r − r (cid:1) ≥

12 log (cid:16) − r πrλ (1 − λ ) (cid:17) . Therefore, we can write G n ( λ, z ) ≥ Q n ( λ, z ) −

12 log( λ (1 − λ ) z ) + c n , (39)where c n is bounded uniformly for all n . Making the substi-tution u = λ (1 − λ ) z , we obtain inf z> G n ( λ, z ) ≥ inf u> (cid:20) Q n (cid:18) λ, uλ (1 − λ ) (cid:19) −

12 log u (cid:21) + c n . Next, let b n = 2(1 − r ) / (9 n ) . The lower bound in (34) leadsto inf u> (cid:20) Q n (cid:18) λ, uλ (1 − λ ) (cid:19) −

12 log u (cid:21) ≥ inf u> (cid:20) u λ (cid:18) − λ + √ b n u (cid:19) −

12 log u (cid:21) . (40)The limiting behavior in (38) can now be seen as a conse-quence of (39) and the fact that, for any sequence λ n converg-ing to one, the right-hand side (40) increases without boundas n increases. Combining (36), (37), and (38) establishes thatthe large n limit of ∆ r ( Y ) exists and is equal to ∆ r ( X ) . Thisconcludes the proof of Proposition 6 D. Proof of Inequality (17)Given any λ ∈ (0 , and u ∈ (0 , ∞ ) let p ( r ) = 1 − rr − s − rr (cid:18) − λλ (cid:19) uq ( r ) = 1 − rr + s − rr (cid:18) λ − λ (cid:19) u. We need the following results, which characterize the terms inProposition 4 in the limit as r increases to one. Lemma 11.

The function ψ r ( p ( r ) , q ( r )) satisﬁes lim r → ψ r ( p ( r ) , q ( r )) = r πu Proof.

Starting with (28), we can write ψ r ( p, q ) = 1 q − p s π (1 − r ) rλ (1 − λ × exp (cid:18) θ (cid:16) rλ − r (cid:17) + θ (cid:16) r (1 − λ )1 − r (cid:17) − θ (cid:16) r − r (cid:17)(cid:19) . As r converges to one the terms in the exponent convergeto zero. Noting that q ( r ) − p ( r ) = p rλ (1 − λ ) / (1 − r ) completes the proof. Lemma 12. If X is a random variable such that s E [ | X k s ] is ﬁnite in a neighborhood of zero, then E [log( X )] and Var (log( X )) are ﬁnite, and lim r → L r ( X ; p ( r ) , q ( r )) = E [log | X | ] + u Var (log | X | ) . Proof.

Let Λ( s ) = log( E [ | X | s ]) . The assumption that E [ | X | s ] is ﬁnite in a neighborhood of zero means that E [(log | X | ) m ] is ﬁnite for all positive integers m that Λ( s ) is real-analytic in a neighborhood of zero, that is there exists constants δ > and C < ∞ , depending on X , such that (cid:12)(cid:12) Λ( s ) − as + bs (cid:12)(cid:12) ≤ C | s | , for all | s | ≤ δ, where a = E [log | X | ] and b = Var ( | X | ) . Consequently, forall r such that − δ < p ( r ) < (1 − r ) /r < q ( r ) < δ , itfollows that (cid:12)(cid:12) L r ( X ; p ( r ) , q ( r )) − a − (cid:0) − rr + u (cid:1) b (cid:12)(cid:12) ≤ C r − r (cid:0) λ | p ( r ) | + (1 − λ ) | q ( r ) | (cid:1) . Taking the limit as r increases to one completes the proof.We are now ready to prove Inequality (17). CombiningProposition 4 with Lemma 11 and Lemma 12 yields lim sup r →∞ h r ( X ) ≤

12 log (cid:18) πu (cid:19) + E [log X ] + u Var (log X ) . The stated inequality follows from evaluating the right-handside with u = 1 / Var (log X ) and recalling that h ( X ) corre-sponds to the limit of h r ( X ) as r increases to one.A PPENDIX CP ROPERTIES OF L OGARITHM -P OWER R ATIO

This section studies properties of the function κ : (0 , → R + deﬁned by κ ( t ) = sup u ∈ (0 , ∞ ) log(1 + u ) u t . (41)For t = 1 , the bound log(1 + u ) ≤ u means that κ (1) ≤ .Noting lim u → log(1 + u ) /u = 1 shows that this inequality istight, and thus κ (1) = 1 . For any t ∈ (0 , , it can be veriﬁedvia differentiation that the supremum is attained on (0 , ∞ ) bythe unique solution u ∗ t to the ﬁxed-point equation u = t (1 + u ) log(1 + u ) . (42)The solution to this equation can be expressed as u ∗ t = exp (cid:0) W (cid:0) − t exp (cid:0) − t (cid:1)(cid:1) + t (cid:1) − , where Lambert’s function W ( z ) is the solution to the equation z = x exp( x ) on the interval on [ − , ∞ ) . Lemma 13.

The function g ( t ) = tκ ( t ) is nondecreasing on (0 , with lim t → g ( t ) = 1 /e and g (1) = 1 .Proof. The fact that g (1) = 1 follows from κ (1) = 1 . By theenvelope theorem [24], the derivative of g ( t ) can be expressedas g ′ ( t ) = (cid:18) t − log( u ∗ t ) (cid:19) g ( t ) . Therefore, the derivative satisﬁes g ′ ( t ) ≥ ⇐⇒ t − log( u ∗ t ) ≥ ⇐⇒ (1 + u ∗ t ) log(1 + u ∗ t ) u ∗ t − log( u ∗ t ) > ⇐⇒ (1 + u ∗ t ) log(1 + u ∗ t ) ≥ u ∗ t log( u ∗ t ) . Noting that u u log u is negative on (0 , and nonnegativeand nondecreasing on [1 , ∞ ) shows that the last condition isalways satisﬁed, and hence g ′ ( t ) is nonnegative.To prove the small t limit we can rearrange (42) to see that u ∗ t satisﬁes u ∗ t (1 + u ∗ t ) log(1 + u ∗ t ) = t, (43)and hence log( g ( t )) = log (cid:18) u ∗ t u ∗ t (cid:19) − u ∗ t log u ∗ t (1 + u ∗ t ) log(1 + u ∗ t ) . (44)Now, as t decreases to zero, (43) shows that u ∗ t increases toinﬁnity. By (44), it then follows that log( g ( t )) converges tonegative one, which proves the desired limit.R EFERENCES[1] E. T. Jaynes, “On the rationale of maximum-entropy methods,”

Proceed-ings of the IEEE , vol. 70, no. 9, pp. 939–952, Sep. 1982.[2] T. M. Cover and J. A. Thomas,

Elements of Information Theory , 2nd ed.Wiley-Interscience, 2006.[3] J. A. Costa, A. O. Hero, and C. Vignat, “A characterization of themultivariate distributions maximizing R´enyi entropy,” in

Proceedingsof the IEEE International Symposium on Information Theory (ISIT) ,Lausanne, Switzerland, Jul. 2002.[4] E. Lutwak, D. Yang, and G. Zhang, “Moment-entropy inequalities,”

TheAnnals of Probability , vol. 32, no. 1B, pp. 757–774, 2004.[5] ——, “Moment-entropy inequalities for a random vector,”

IEEE Trans-actions on Information Theory , vol. 53, no. 4, pp. 1603–1607, 2007.[6] O. Johnson and C. Vignat, “Some results concerning maximum R´enyientropy distributions,”

Annales de l’Institut Henri Poincar´e (B) Proba-bility and Statistics , vol. 43, no. 3, pp. 339–351, Jun. 2007.[7] E. Lutwak, S. Lv, D. Yang, and G. Zhang, “Afﬁne moments of a randomvector,”

IEEE Transactions on Information Theory , vol. 59, no. 9, pp.5592–5599, Sep. 2013.[8] T. van Erven and P. Harremo¨es, “R´enyi divergence and Kullback–Lieblerdivergence,”

IEEE Transactions on Information Theory , vol. 60, no. 7,pp. 3937–3820, Jul. 2014.[9] M. A. Kumar and R. Sundaresan, “Minimization problems based onrelative α -entropy I: Forward projeciton,” IEEE Transactions on Infor-mation Theory , vol. 61, no. 9, pp. 5063–5080, Sep. 2015.[10] ——, “Minimization problems based on relative α -entropy II: Reverseprojeciton,” IEEE Transactions on Information Theory , vol. 61, no. 9,pp. 5081–5095, Sep. 2015.[11] I. Sason, “On the R´enyi divergence, joint range of relative entropy, anda channel coding theorem,”

IEEE Transactions on Information Theory ,vol. 62, no. 1, pp. 23–34, Jan. 2016.[12] S. G. Bobkov, G. P. Chistyakov, and F. G¨otze, “R´enyi diver-gence and the central limit theorem,” 2016, [Online]. Available:https://arxiv.org/abs/1608.01805.[13] F. Nielsen and R. Nock, “On the chi square and higer-order chi distrancesfor approximationg f -divergences,” IEEE Signal Processing Letters ,vol. 1, no. 1, pp. 10–13, Jan. 1014.[14] I. Sason and S. Verd´u, “ f -divergence inequalities,” IEEE Transactionson Information Theory , vol. 62, no. 11, pp. 5973–6006, Nov. 2016.[15] S.-L. Huang, C. Suh, and L. Zheng, “Euclidean information theory ofnetworks,”

IEEE Transactions on Information Theory , vol. 61, no. 12,pp. 6795–6814, Dec. 2015.[16] G. Reeves and H. D. Pﬁster, “The replica-symmetric prediction forcompressed sensing with Gaussian matrices is exact,” Jul. 2016, [Online].Available: https://arxiv.org/abs/1607.02524.[17] ——, “The replica-symmetric prediction for compressed sensing withGaussian matrices is exact,” in

Proceedings of the IEEE InternationalSymposium on Information Theory (ISIT) , Barcelona, Spain, Jul. 2016,pp. 665 – 669.[18] G. Reeves, “Conditional central limit theorems for Gaussian projections,”Dec. 2016, [Online]. Available: https://128.84.21.199/abs/1612.09252.[19] R. T. Rockafellar,

Convex Analysis . Princeton University Press, 1970.[20] G. E. Andrews, R. Askey, and R. Roy,

Special Functions , ser. Ency-clopedia of Mathematics and its Applications. Cambridge UniversityPress, 1999, vol. 71. [21] A. L. Gibbs and F. E. Su, “On choosing and bounding probabilitymetrics,” International Statistical Review , vol. 70, no. 3, pp. 419–435,Dec. 2002.[22] T. Jebara, R. Kondor, and A. Howard, “Probability product kernels,”

Journal of Machine Learning Research , vol. 5, pp. 818–844, 2004.[23] L. Greni´e and G. Molteni, “Inequalities for the beta function,”

Math-ematical Inequalities & Applications , vol. 18, no. 4, pp. 1427–1442,2015.[24] P. Milgrom and I. Segal, “Envelope theorems for arbitrary choice sets,”