11 Variations on a Theme by Massey
Olivier RioulLTCI, Télécom Paris, Institut Polytechnique de ParisF-91120, Palaiseau, France
Abstract
In 1994, James Lee Massey proposed the guessing entropy as a measure of the difficulty that anattacker has to guess a secret used in a cryptographic system, and established a well-known inequalitybetween entropy and guessing entropy. Over 15 years before, in an unpublished work, he also establisheda well-known inequality for the entropy of an integer-valued random variable of given variance. In thispaper, we establish a link between the two works by Massey in the more general framework of therelationship between discrete (absolute) entropy and continuous (differential) entropy. Two approachesare given in which the discrete entropy (or Rényi entropy) of and integer-valued variable can be upperbounded using the differential (Rényi) entropy of some suitably chosen continuous random variable.
I. I
NTRODUCTION
In an unpublished work in the mid-1970s, later published in the late 1980s [1], James L.Massey proved the following bound on the entropy of a integer-valued random variable X : H ( X ) (cid:54) log (cid:0) πe ( σ + ) (cid:1) . (1)This inequality establishes an interesting connection between the entropy of X and that of aGaussian random variable. After more than a decade, Massey also established an importantinequality for the guessing entropy [2]: G ( X ) (cid:62) H ( X ) − + 1 when H ( X ) (cid:62) bits (2)where again an integer-valued random variable (the average number of guesses) is involved, theguessing entropy G ( X ) being defined as the minimum average number of guesses.Perhaps surprisingly, it can be found that the two Massey inequalities are part of a commonframework which relates discrete (absolute) and continuous (differential) entropies. Of course,the question of making the link between the entropy H ( X ) of a discrete random variable X and a r X i v : . [ c s . I T ] F e b the entropy h ( X ) of a continuous random variable X is not new. The classical textbook answerto this question states that when X is quantized version of X with small quantization step ∆ ,then one has the approximation H ( X ) ≈ h ( X ) − log ∆ . (3)Massey’s approach somehow goes in the opposite direction in deducing X from X using anadditive continuous perturbation. Doing so the above approximation becomes an exact equality H ( X ) = h ( X ) − log ∆ . (4)where ∆ needs not be arbitrary small. In particular for integer-valued X , one has ∆ = 1 andboth entropies coincide: H ( X ) = h ( X ) .This paper gives a general derivation of inequalities of the Massey-type, based on a versionof Kullback’s inequality [3] for exponential families applied to X , leading to upper bounds on H ( X ) . We illustrate the method when X has fixed variance σ , and also when X > with fixedmean µ . In the first case, we recover the Massey inequality (1) for a given variance and in thesecond case, we improve the other Massey inequality (2) for the guessing entropy [2].The method can be also easily extended to Rényi entropies. Inequalities relating guessingentropy to Rényi entropies have become increasing popular for practical applications [4].Another approach presented in this paper uses Kullback’s inequality directly the integer-valuedvariable X for the same exponential family density, combined with the Poisson summationformula . This greatly improves the original Massey’s inequality for integer-valued variables ofgiven variance.The remainder of this paper is organized as follows. Section II introduces two complementaryapproaches to relate discrete to continuous entropy. Based on Massey’s approach, a generalmethod for establishing Massey-type inequalities is presented in Section III and generalized toRényi entropies in Section IV. Finally, another method improving the original Massey’s inequalityis presented in Section V.II. D
ISCRETE VS . C
ONTINUOUS E NTROPY
Consider a discrete random variable X whose values are regularly spaced ∆ apart, with someprobability distribution p ( x ) = P ( X = x ) with finite entropy. As ∆ → , X may approach in distribution a continuous random variable X with density f . How then is the discrete (absolute)entropy H ( X ) = (cid:88) x p ( x ) log 1 p ( x ) (5)related to the continuous (differential) entropy h ( X ) = (cid:90) f ( x ) log 1 f ( x ) d x (6)and how can H ( X ) be evaluated from h ( X ) ?Similarly (or more generally), for any fixed α > , how is the discrete Rényi α -entropy H α ( X ) = 11 − α log (cid:88) x p ( x ) α (7)related to the continuous Rényi α -entropy h α ( X ) = 11 − α log (cid:90) f ( x ) α d x (8)and how can H α ( X ) be evaluated from h α ( X ) ? (Notice that the limiting case α → gives H ( X ) = H ( X ) and h ( X ) = h ( X ) .) A. Reza’s Equivalence
For Shannon’s entropy, the classical answer to the above question dates back to the 1961textbook by Reza [5, § 8.3], the earliest reference to the best of the author’s knowledge. It hasalso been presented in the classical textbooks [6, § 1.3] and [7, § 8.3]. The argument can beeasily generalized to Rényi entropies.The approach is to first consider the continuous variable X having density f , and then quantize it to obtain the discrete X with step size ∆ , in such a way that p ( x k ) = P ( X = x k ) = (cid:90) ( k +1)∆ k ∆ f ( x ) d x. (9)and the discrete values x k correspond to mean values f ( x k ) = 1∆ (cid:90) ( k +1)∆ k ∆ f ( x ) d x = p ( x k )∆ (10)as given by the mean value theorem (assuming e.g. f continuous within each bin of length ∆ ).Then the integral in (6) or in (8) can approximated by the Riemann sum (cid:88) k ∆ · f ( x k ) log 1 f ( x k ) = (cid:88) k p ( x k ) log ∆ p ( x k ) = H ( X ) + log ∆11 − α log (cid:88) k ∆ · f α ( x k ) = 11 − α log (cid:88) k ∆ − α p α ( x k ) = H α ( X ) + log ∆ (11) which tends to h ( X ) (resp. h α ( X ) ) as ∆ → provided that f log f (resp. f α ) is Riemann-integrable(e.g., f is continuous and compactly supported). Proposition 1:
Under the above assumptions, one has the well-known approximation H ( X ) ≈ h ( X ) − log ∆ for small ∆ , and more generally, for any α > , H α ( X ) ≈ h α ( X ) − log ∆ . (12)in the sense that lim ∆ → { H α ( X ) + log ∆ } = h α ( X ) . B. Massey’s Equivalence
Reza’s approximation (12), however appealing it may be, is not so convenient for evaluatingthe discrete entropy of X from the continuous one: It requires an arbitrary small ∆ and theresulting values of X are in fact not necessarily regularly spaced since they correspond to meanvalues (10).Now instead of deriving the discrete X from the continuous X and expressing the continuousentropy in terms of the discrete one, we can proceed in the opposite direction: Starting from X with regularly spaced values, we infer the continuous version X in order to express the discreteentropy in terms of the continuous one.Massey’s solution, in an unpublished work in the mid-1970s [1], is to write density f as astaircase function whose values are the discrete probabilities. This amounts to taking the discreterandom variable X (whose values are regularly spaced ∆ apart) and adding an independentuniformly distributed random perturbation U , as explained in [7, Exercice 8.7] which also creditsan unpublished work by Willems.More generally, this random perturbation approach does not necessarily require U to beuniformly distributed, as shown in the following Theorem 1:
Let X be a discrete random variable whose values are regularly spaced ∆ apart,and define X by X = X + U . (13)where U is a continuous random variable independent of X , with support of finite length (cid:54) ∆ .Then h α ( X ) = H α ( X ) + h α ( U ) . (14) In particular, if U is uniformly distributed in an interval of length ∆ , then h α ( U ) = log ∆ andthe exact equality H α ( X ) = h α ( X ) − log ∆ (15)holds for any α > (compare to (12)). Proof:
The density of X = X + U is a mixture of the form f ( x ) = (cid:88) k ∈ Z p ( x k ) χ ( x − x k ) (16)where x k are the regularly spaced values of X and χ is the density of U . The terms in thesum have disjoint supports. Since entropy is invariant by translation, we may always assumethat χ is supported in the interval [0 , ∆] . Splitting the integral in (6) or in (8) on intervals [ x k , x k +1 = x k + ∆] we obtain h ( X ) = (cid:88) k p ( x k ) (cid:90) χ ( x − x k ) log 1 p ( x k ) χ ( x − x k ) d x = (cid:88) k p ( x k ) (cid:104) (cid:90) χ (cid:124)(cid:123)(cid:122)(cid:125) =1 (cid:105) log 1 p ( x k ) + (cid:104)(cid:88) k p ( x k ) (cid:124) (cid:123)(cid:122) (cid:125) =1 (cid:105) (cid:90) χ log 1 χh α ( X ) = 11 − α log (cid:88) k p ( x k ) α (cid:90) χ ( x − x k ) α d x = 11 − α log (cid:88) k p ( x k ) α (cid:90) χ α = 11 − α log (cid:88) k p ( x k ) α + 11 − α log (cid:90) χ α (17)which proves (14). Remark 1:
The above proof follows the textbook solution [8] to exercice 8.7 of [7] (see also [9,Proof of Thm. 3]). For Shannon’s entropy ( α = 1 ), a simpler proof is as follows. Proof of Theorem 1 ( α = 1 ): By the support assumption, X can be recovered by rounding X + U , hence is a deterministic function of X . Therefore, H ( X | X ) = 0 and H ( X ) = H ( X ) − H ( X | X ) = I ( X ; X ) = h ( X ) − h ( X | X ) = h ( X ) − h ( U ) , (18)which proves (14). Remark 2:
Compared to (12), equality (15) has the advantage that it is exact, that the discretevalues of X are truly regularly spaced, and also that ∆ is not necessarily small. In fact, (15)is invariant by scaling : if s > , H α ( sX ) = h α ( s X ) − log( s ∆) is the same as (15) because h α ( s X ) = h α ( X ) + log s , as is easily checked. As a result, one can always set ∆ = 1 and consider an integer-valued random variable X .Hereafter we shall always make this assumption. As a result, (15) simply writes H α ( X ) = h α ( X ) (19)when U is uniformly distributed in an interval of length . This is the original remark byMassey [1] that discrete and continuous entropies coincide in this case.III. A G ENERAL A PPROACH TO M ASSEY ’ S I NEQUALITIES
In this section, we focus on Shannon’s entropy ( α = 1 ) and follow Massey’s approach inbounding the continuous entropy (Kullback’s inequality) and then applying Theorem 1 to deriveupper bounds on the discrete entropy. A. Kullback’s Inequality
Let D ( f (cid:107) ϕ ) = (cid:82) f log fϕ be the relative entropy (or Kullback-Leibler divergence) betweenthe density f of X and some probability density function ϕ . The information inequality [7,Thm. 2.6.3] states that D ( f (cid:107) ϕ ) (cid:62) with equality iff (if and only if) f = ϕ a.e. This gives thewell known “Gibbs’ inequality” h ( X ) (cid:54) − E log ϕ ( X ) (20)with equality iff f = ϕ a.e. This can be used to derive well-known bounds on the continuousentropy h ( X ) as follows. Proposition 2 (Kullback’s Inequality):
Suppose f is parametrized by some θ in such a waythat the “moment” E [ T ( X )] = m is a fixed quantity. Set ϕ in the form of an “exponential familydistribution” ϕ ( x ) = e − T ( x ) Z (21)where T depends on parameter θ and Z = Z ( θ ) = (cid:90) e − T ( x ) d x (22)is a normalization constant (known as the “partition function”). Then h ( X ) (cid:54) m log e + log Z (23)with equality iff X has density (21). Proof:
Apply Gibbs’ inequality (20) to (21).
Remark 3:
Such a general inequality (23) is well known (see, e.g., [10, § 21]) and and canbe seen as a version of
Kullback’s inequality [3, § 4] (or the
Kullback-Sanov inequality [11,pp. 23–24], [12, Chap. 3, Thm. 2.1]) for exponential families. It is more general in the sensethat one does not use the condition on Z ( θ ) which would be required for equality to hold.Such a condition would read dd θ log Z ( θ ) = − m in the case of a natural exponential family ϕ ( x ) = e − θT (cid:48) ( x ) /Z ( θ ) where T (cid:48) does not depend on θ .Let µ X and σ X denote the mean and variance of X , respectively. We illustrate (23) in threeclassical situations: a) Support length parameter: Here X has finite support: X ∈ ( a, b ) a.s.; letting (cid:96) ( · ) denotesthe support length, the corresponding parameter is θ = (cid:96) ( X ) = b − a . We set T ( x ) = 0 if x ∈ ( a, b ) and = + ∞ otherwise, so that ϕ is the uniform distribution on ( a, b ) , moment m = 0 ,partition Z = b − a and (23) reduces to [7, Ex. 12.2.4] h ( X ) (cid:54) log( b − a ) (24)with equality iff X is uniformly distributed in ( a, b ) . b) Variance parameter: Here X ∈ R with parameter θ = σ X . We set T ( x ) = ( x − µ X σ X ) , sothat ϕ = N ( µ X , σ X ) is the Gaussian density, moment m = , partition Z = (cid:112) πσ X , and (23)reduces to the well-known Shannon bound [13, § 20.5] h ( X ) (cid:54)
12 log(2 πeσ X ) (25)with equality iff X is Gaussian. c) Mean parameter: Here we assume that X > a.s., with parameter θ = µ X . We set T ( x ) = xµ X so that ϕ is the exponential density, moment m = 1 , partition Z = µ X and (23)reduces to another Shannon bound [13, § 20.7] h ( X ) (cid:54) log( eµ X ) (26)with equality iff X is exponential. B. Inequalities of the Massey Type
Applying Theorem 1 on top of Kullback’s inequality provides upper bounds on the discreteentropy H ( X ) , depending on the choice of T and θ . In keeping with Remark 2, we assumethat X is integer-valued, with mean µ and variance σ , and we apply Theorem 1 in the form H ( X ) = h ( X ) − h ( U ) where U has support of finite length (cid:96) ( U ) = ∆ (cid:54) , in the three classicalsituations a) , b) , c) above. Note that in these cases we respectively have a) (cid:96) ( X ) = (cid:96) ( X ) + (cid:96) ( U ) = (cid:96) ( X ) + ∆ ; b) σ X = σ + σ U ; c) µ X = µ + µ U . a) Support length parameter: Suppose that X has finite support { k, . . . , k + (cid:96) } of length (cid:96) (cid:62) . Since (cid:96) ( X ) = (cid:96) ( X ) + (cid:96) ( U ) = (cid:96) + ∆ , by Theorem 1 and inequality (24), we have H ( X ) (cid:54) log( (cid:96) + ∆) − h ( U ) . (27)Since U has support length ∆ (cid:54) , from (24) we always have h ( U ) (cid:54) log ∆ (cid:54) log 1 = 0 withequality iff U is uniformly distributed in an interval of length ∆ = 1 . Thus, given ∆ , the bestupper bound in (27) is log( (cid:96) + ∆) − log ∆ , which is minimized when ∆ is maximum = 1 . Oneobtains the trivial bound H ( X ) (cid:54) log( (cid:96) + 1) (28)achieved when X is equiprobable. Interestingly, achievability of h ( X + U ) = log( (cid:96) + 1) is at thebasis of the analysis done in [9, Thm. 1] on Shannon’s vs. Hartley’s formula. b) Variance parameter: Suppose that X has finite variance σ . Since σ X = σ + σ U , byTheorem 1 and inequality (25), we have H ( X ) (cid:54) log (cid:0) πe ( σ + σ U ) (cid:1) − h ( U ) . (29)where U has support length (cid:54) .Here the best choice of U (the best compromise between maximum possible h ( U ) and minimumpossible σ U ) depends on the value of σ . But it can be observed that the obtained bound cannotbe tight for small values of σ . Indeed when σ = 0 , X is deterministic, H ( X ) = 0 and theupper bound in (29) becomes log(2 πeσ U ) − h ( U ) which from (25) is strictly positive since U cannot be Gaussian when it has finite support.For large σ , the best asymptotic upper bound in (29) is obtained when h ( U ) is maximum = log 1 = 0 . From the equality case in (24) U is then uniformly distributed in an interval oflength . In this case σ U = and one obtains Massey’s inequality [1] or the
Massey-Willemsinequality [7, Ex. 8.7] H ( X ) (cid:54) log (cid:0) πe ( σ + ) (cid:1) . (30) for any fixed σ . This bound is asymptotically tight for large σ : As an example, for Poissondistributed X we have [14] H ( X ) = log(2 πeσ ) + O ( σ ) .However, inequality (30) can still be improved: the constant can be replaced by an arbitrarysmall constant as σ gets larger (see Section V). c) Mean parameter: Suppose that X (cid:62) has finite mean µ . Since µ X = µ + µ U , byTheorem 1 and inequality (26), we have H ( X ) (cid:54) log (cid:0) e ( µ + µ U ) (cid:1) − h ( U ) . (31)provided that U (cid:62) a.s. with support length (cid:54) .Again the best choice of U (the best compromise between maximum possible h ( U ) andminimum possible µ U ) depends on the value of the parameter µ (cid:62) . Also the obtained boundcannot be tight for small values of µ : When µ = 0 , X = 0 a.s., H ( X ) = 0 and the upperbound in (31) becomes log (cid:0) eµ U (cid:1) − h ( U ) which from (26) is strictly positive because U cannotbe exponential when it has finite support.For large µ , the best asymptotic upper bound in (31) is again obtained when h ( U ) is maximum = log 1 = 0 . From the equality case in (24) U (cid:62) is then uniformly distributed in an interval oflength . In this case the minimum value of µ U is achieved when U (cid:62) is uniformly distributedin (0 , , which gives µ U = .We thus obtain, for any X (cid:62) with fixed µ , a new Massey-type inequality H ( X ) (cid:54) log (cid:0) e ( µ + ) (cid:1) (32)which is asymptotically tight for large µ : As an example, for geometric X we have H ( X ) = µH (1 /µ ) = log( eµ ) + O ( µ ) . C. Improved Massey’s Inequality for Guessing
Inequality (32) can be thought of as an improvement of Massey’s inequality for the guessingentropy [2]. Let G ( X ) be the number of successive guesses of some (discrete) secret X beforethe actual value of X is found, and define the guessing entropy as the minimum average numberof guesses for a given probability distribution of X : G ( X ) = min E (cid:0) G ( X ) (cid:1) . (33)Massey’s original inequality reads [2] G ( X ) (cid:62) H ( X ) − + 1 when H ( X ) (cid:62) bits. (34) Corollary 1 (Improvement of Massey’s Inequality):
When H ( X ) is expressed in bits, G ( X ) (cid:62) H ( X ) e + 12 . (35)This inequality improves Massey’s original inequality as soon as H ( X ) > log e − e ≈ , . . . bits. The improvement is particularly important for large values of entropy, by the optimal factor /e (see Fig. 1). Proof:
As explained in [2] the optimal strategy leading to the minimum (33) require k guesses with probability P ( G ( X ) = k ) = p ( k ) ( k = 1 , , . . . ) (36)where p ( k ) is the k th largest probability in X ’s distribution. Applying (32) to G ( X ) − (cid:62) ,and noting that µ = G ( X ) − and H ( G ( X )) = H ( X ) yields H ( X ) (cid:54) log (cid:0) e ( G ( X ) − ) (cid:1) (37)which is (35). E [ G ( X )] H ( X ) H ( X ) e +
12 2 H ( X ) − +1 Fig. 1. Massey’s original (blue) and improved (black) lower bounds. Remark 4:
It is quite startling to notice that the approach followed by Massey himself back inthe 1970s [1] can improve the result of his 1994 paper [2] so much.Massey’s inequality was already improved by the author in the (weaker) form G ( X ) (cid:62) H ( X ) e with a very different proof, see [15] and [16]. See also [16], [17] for a different kind ofimprovement.Inequality (35) can be shown to be optimal among all possible bounds of the form G (cid:62) a · b H + c [18]. IV. E XTENSION TO R ÉNYI E NTROPIES
We now extend the previous section to Rényi entropies of any order α > . We assume α (cid:54) = 1 ;the case α = 1 of the preceding section can be recovered as the limit α → . A. α -Kullback’s Inequality Let D α ( f (cid:107) ϕ ) = α − (cid:82) f α ϕ − α be the Rényi α -divergence [19] between the density f of X andsome probability density function ϕ . We have D α ( f (cid:107) ϕ ) (cid:62) with equality iff f = ϕ a.e. Denotingthe “escort” densities of exponent α by f α = f α (cid:82) f α and ϕ α = ϕ α (cid:82) ϕ α , the relative α -entropy [20]between f and ϕ is defined as ∆ α ( f (cid:107) ϕ ) = D /α ( f α (cid:107) ϕ α ) (cid:62) (38)with equality = 0 iff f = ϕ a.e. Expanding D /α ( f α (cid:107) ϕ α ) gives the α -Gibbs’ inequality [21,Prop. 8] h α ( X ) (cid:54) α − α log E ϕ − α α ( X ) (39)with equality iff f = ϕ a.e. This can be used to derive upper bounds on the continuous α -entropy h α ( X ) as follows. Proposition 3 ( α -Kullback’s Inequality): Suppose f is parametrized by some θ in such a waythat the “moment” E [ T ( X )] = m is a fixed quantity. Set ϕ in the form ϕ ( x ) = T ( x ) α − Z (40)where T depends on parameter θ and Z = (cid:82) T ( x ) α − d x is a normalization constant, so that ϕ α ( x ) = T ( x ) αα − Z α (41) with “ α -partition function” Z α = Z α ( θ ) = (cid:90) T ( x ) αα − d x. (42)Then h α ( X ) (cid:54) α − α log m + log Z α (43)with equality iff X has density (40). Proof:
Apply α -Gibbs’ inequality (39) to (40).Of course, both T ( x ) α − and T ( x ) αα − need to be integrable over the given support intervalfor (43) to hold. We illustrate this in the three classical situations: a) Support length parameter: Here X has finite support: X ∈ ( a, b ) a.s.; letting (cid:96) ( · ) denotesthe support length, the corresponding parameter is θ = (cid:96) ( X ) = b − a . We set T ( x ) = 1 if x ∈ ( a, b ) and = 0 otherwise, so that ϕ = ϕ α is the uniform distribution on ( a, b ) , moment m = 1 , α -partition Z α = b − a and (43) reduces to h α ( X ) (cid:54) log( b − a ) (44)with equality iff X is uniformly distributed in ( a, b ) . b) Variance parameter: Here X ∈ R with parameter θ = σ X . We set T ( x ) in the form T ( x ) = 1 + β · ( x − µ X σ X ) so that m = 1 + β and β is such that (40) has finite variance σ X . Thecorresponding density (40) is known as the α -Gaussian density [22]. We obtain the following Lemma 1:
Under the above assumptions, we have α > , β = − α α − , ϕ ( x ) is the α -Gaussiandensity ϕ ( x ) = (cid:113) βπσ X Γ( − α )Γ( − α − ) 1 (cid:0) β ( x − µ X σ X ) (cid:1) − α for < α < (cid:113) | β | πσ X Γ( αα − + )Γ( αα − ) (cid:0) − | β | ( x − µ X σ X ) (cid:1) α − + for α > (45)and (43) reduces to h α ( X ) (cid:54) log (cid:0) α − − α πσ X (cid:1) + − α log α α − + log Γ( − α − )Γ( − α ) for < α < log (cid:0) α − α − πσ X (cid:1) + α − log α − α + log Γ( αα − )Γ( αα − + ) for α > (46)with equality iff X is α -Gaussian. Proof:
See Appendix A. For α > ( β < ) the density is supported in the interval | x − µ X | < σ X √ | β | so that X = 1 + β ( x − µ X σ X ) (cid:62) , hence thenotation ( X ) + = max( X, . When α → we recover (25) attained for the Gaussian density. As other examples we have h ( X ) (cid:54) log(2 πσ X ) (47) h ( X ) (cid:54)
12 log( 1259 σ X ) (48)with equality iff X is -Gaussian and -Gaussian, respectively. c) Mean parameter: Here we assume that X > a.s., with parameter θ = µ X . We set T ( x ) in the form T ( x ) = 1 + β · xµ X so that m = 1 + β and β is such that (40) has finite mean µ X . Thecorresponding density (40) is known as the Lomax density [23, § II.B]. We obtain the following Lemma 2:
Under the above assumptions, we have α > , β = − α α − , Z α = µ X , ϕ ( x ) is the“ α -exponential” density ϕ ( x ) = βµ X α − α (cid:0) β xµ X (cid:1) − α for < α < | β | µ X αα − (cid:0) − | β | xµ X (cid:1) α − + for α > (49)and (43) reduces to h α ( X ) (cid:54) log µ X + α − α log α α − µ X + αα − α − α (50)with equality iff X is α -exponential. Proof:
See Appendix B.When α → we recover (26) attained for the exponential density. As other examples we have h ( X ) (cid:54) log 27 µ X (51) h ( X ) (cid:54) log 9 µ X (52)with equality iff X is -exponential and -exponential, respectively. Remark 5:
It is possible to further generalize a) , b) , c) to a parameterization by a ρ th-ordermoment θ = E ( | X | ρ ) . The maximizing distribution is a generalized α -Gaussian [24] and cases a) , b) , c) are recovered by setting ρ = + ∞ , ρ = 2 , and ρ = 1 , respectively. For α > ( β < ) the density is supported in the interval < x < µ X | β | so that X = 1 + β xµ X (cid:62) , hence the notation ( X ) + = max( X, . For α < , ϕ is a Pareto Type II distribution a.k.a. Lomax distribution with shape parameter α − α -3 -2 -1 0 1 2 3 4 α = √ α =1 α =16 Fig. 2. α -Gaussian distributions (45) for α = 3 − / , − / , − / , , , , , . α = √ α =1 α =8 Fig. 3. α -exponential distributions (49) for α = 2 − / , − / , − / , , , , . B. Massey-Type Inequalities for Rényi Entropies
We again apply Theorem 1 in the form H α ( X ) = h α ( X ) − h α ( U ) where U has support offinite length (cid:96) ( U ) = ∆ (cid:54) , on top of the α -Kullback inequality (Prop. 3). As a result, we obtainupper bounds on the (discrete) Rényi entropy H α ( X ) of an integer-valued X with mean µ andvariance σ , in the three situations a) , b) , c) above. a) Support length parameter: Suppose that X has finite support { k, . . . , k + (cid:96) } of length (cid:96) (cid:62) . This situation is handled exactly the same as in the case α = 1 (§ III-B- a) ). One obtainsthe known bound (for any α > ) H α ( X ) (cid:54) log( (cid:96) + 1) (53)achieved when X is equiprobable. b) Variance parameter: Suppose that X has finite variance σ . With a similar reasoning asin as in the case α = 1 (§ III-B- b) ), for large σ , the best upper bound in Theorem 1 is obtainedwith Massey’s choice that U is uniformly distributed in an interval of length . Hence (19) holds,and since σ X = σ + σ U = σ + , (46) gives the following natural generalization of Massey’sinequality (30) to Rényi entropies: H α ( X ) (cid:54) log (cid:0) α − − α π ( σ + ) (cid:1) + − α log α α − + log Γ( − α − )Γ( − α ) for < α < log (cid:0) α − α − π ( σ + ) (cid:1) + α − log α − α + log Γ( αα − )Γ( αα − + ) for α > . (54)for any α > . Thus, for example, H ( X ) (cid:54)
12 log (cid:16) π (cid:0) σ + 112 (cid:1)(cid:17) (55) H ( X ) (cid:54)
12 log (cid:16) (cid:0) σ + 112 (cid:1)(cid:17) . (56) Remark 6:
Such inequalities cannot exist in general when α (cid:54) . To see this, consider thediscrete random variable X (cid:62) with distribution P ( X = k ) = c ( k log k ) with normalizationconstant c = (cid:80) k> k log k ) . Then X has finite second moment (cid:80) k> ck log k < + ∞ hence finitevariance, but (cid:80) k> (cid:112) P ( X = k ) = (cid:80) k> k log k = + ∞ , hence H α ( X ) (cid:62) H ( X ) = + ∞ for all α (cid:54) . c) Mean parameter: Suppose that X (cid:62) has finite mean µ . For large µ , as in as in thecase α = 1 (§ III-B- c) ), the best upper bound in Theorem 1 is obtained when U is uniformlydistributed in (0 , . Hence (19) holds, and since µ X = µ + µ U = µ + , (50) gives the followingnatural generalization of (32) to Rényi entropies: H α ( X ) (cid:54) log( µ + 12 ) + α − α log α α − µ + 12 ) + αα − α − α . (57)for any α > . Thus, for example, H ( X ) (cid:54) log 27( µ + )8 (58) H ( X ) (cid:54) log 9( µ + )4 . (59) Remark 7:
Such inequalities cannot exist in general when α (cid:54) . To see this, consider thediscrete random variable X (cid:62) with distribution P ( X = k ) = c ( k log k ) with normalization constant c = (cid:80) k> k log k ) . Then X has finite mean µ = (cid:80) k> ck log k < + ∞ but (cid:80) k> (cid:112) P ( X = k ) = (cid:80) k> k log k = + ∞ , hence H α ( X ) (cid:62) H ( X ) = + ∞ for all α (cid:54) . C. Massey-Type Inequalities for Guessing
As in Corollary 1, we can apply (57) to G ( X ) − where G ( X ) is the number of successiveguesses of some X in the optimal strategy leading to the guessing entropy (33). Again the µ + term in (57) is replaced by G ( X ) − , and we immediately obtain the following Corollary 2:
When H α ( X ) is expressed in bits, for any α > , G ( X ) (cid:62) H α ( X ) (1 + α − α ) αα − + 12 (60) Remark 8:
Since the factor (1+ α − α ) αα − converges to e as α → , inequality (35) of Corollary 1is recovered by letting α → . This factor is nonincreasing in α , and since x < e x for x (cid:54) = 0 ,the term (1 + α − α ) αα − = − − αα ) α − α is greater than e for α < and less than e for α > . Thus,for example, we have G ( X ) (cid:62)
827 2 H ( X ) + 12 (61) G ( X ) (cid:62)
49 2 H ( X ) + 12 . (62)where < e < . Since H α ( X ) is also nonincreasing in α , it follows that none of theinequalities (60) is a trivial consequence of another for a different value of α . Remark 9:
By Remark 7, no inequality of the type (60) can generally hold for α (cid:54) . Thisdoes not contradict Arikan’s inequality [25], which reads G ( X ) (cid:62) H / ( X ) M (63)because it was established when X takes M possible values, where M can be arbitrarily large.A similar remark can be made for the moment E [ G ρ ( X )] considered by Arikan [25] by usingthe extension to ρ th-order moments of Remark 5. It is then found that E [ G ρ ( X )] can be lower-bounded by an exponential function of H α ( X ) for any α > ρ but not for α = ρ in general(when the number of possible values of X is infinite).V. A NOTHER A PPROACH TO M ASSEY I NEQUALITIES
A. A Mixed Discrete-Continuous Kullback’s inequality
Instead of applying Kullback’s inequality (23) on X = X + U we now apply a similar inequalitydirectly on the integer-valued variable X using the same exponential family density (21). Thiswill have the effect of removing constant in (30) and in (32) at the expense of an additionaladditive constant in the upper bound. Theorem 2 (Generalized Kullback’s inequality for integer-valued variables):
Let X be integer-valued and let X be the random variable having density (21): f ( x ) = e − T ( x ) Z (64)such that the “moment” E [ T ( X )] = E [ T ( X )] = m is a fixed quantity. Then H ( X ) (cid:54) h ( X ) + log Z (cid:48) (65)where Z (cid:48) is a “discrete partition function” Z (cid:48) = (cid:88) x f ( x ) (66)the sum being over all integer values x of X . Proof:
Apply the information inequality D ( p (cid:107) q ) (cid:62) to p ( x ) = P ( X = x ) , the probabilitydistribution of X and to q ( x ) = f ( x ) Z (cid:48) , which is also a discrete probability distribution on the sameset of values because of the normalization constant Z (cid:48) . We obtain Gibbs’ inequality in the form H ( X ) (cid:54) − E log q ( X ) = − E log f ( X ) + log Z (cid:48) where − E log f ( X ) = E [ T ( X )] log e + log Z = E [ T ( X )] log e + log Z = h ( X ) . Notice that (65) is again invariant by scaling : if ∆ > , H (∆ X ) = H ( X ) while h (∆ X ) = h ( X ) + log ∆ , Z (cid:48) is divided by ∆ , hence the r.h.s. of (65) becomes h ( X ) + log ∆ + log( Z (cid:48) / ∆) = h ( X ) + log Z (cid:48) .Similarly as in the preceding sections, we illustrate this approach in three situations: a) Support length parameter: X has finite support { k, . . . , k + (cid:96) } of length (cid:96) (cid:62) , X is uniformly distributed on an interval ( a, b ) that includes { k, k + (cid:96) } , of differential entropy h ( X ) = log( b − a ) . b) Variance parameter: X has finite mean µ and variance σ , X ∼ N ( µ, σ ) of differentialentropy h ( X ) = log(2 πeσ ) . c) Mean parameter: X (cid:62) has finite mean µ , X has exponential density e − x/µ µ of differentialentropy h ( X ) = log( eµ ) .Again case a) is trivial: Since Z (cid:48) = (cid:80) x b − a = (cid:96) +1 b − a we end up with the classical bound H ( X ) (cid:54) log( b − a ) + log (cid:96) +1 b − a = log( (cid:96) + 1) achieved when X is equiprobable. For the other casesTheorem 2 gives the following. b) if X has finite variance σ , H ( X ) (cid:54) log(2 πeσ ) + log (cid:88) x e − ( x − µσ ) √ πσ (67) c) if X (cid:62) has finite mean µ , H ( X ) (cid:54) log( eµ ) + log (cid:88) x e − x/µ µ . (68)the sums being taken over all integer values x of X . B. Use of the Poisson Summation Formula
When σ or µ is large, then the additional logarithmic term log Z (cid:48) in (65) is likely smallbecause of the approximation Z (cid:48) = (cid:80) x f ( x ) ≈ (cid:82) f ( x ) d x = 1 . In order to evaluate this precisely,the Poisson summation formula can be used.Let ˆ f ( t ) = (cid:82) f ( x ) e − iπtx d x be the Fourier transform of f ( x ) . If f and ˆ f have O ( | x | ε ) decayat infinity [26, p. 252] then Poisson’s summation formula holds: (cid:88) x ∈ Z f ( x ) = (cid:88) x ∈ Z ˆ f ( x ) . (69) where the x = 0 term in the r.h.s. is ˆ f (0) = (cid:82) f ( x ) d x = 1 . Thus for case b) one has [27,Eq. (15)] (cid:88) x ∈ Z e − ( x − µσ ) √ πσ = (cid:88) x ∈ Z e − iπµx e − πσx ) = 1 + 2 + ∞ (cid:88) x =1 e − πσx ) cos 2 πµx. (70)For case c) , one can apply Poisson’s formula to the symmetrized density (cid:0) f ( x ) + f ( − x ) (cid:1) toensure that the decay condition at infinity holds: (cid:88) x ∈ Z e −| x | /µ µ = (cid:88) x ∈ Z
21 + (2 πµx ) (71)This yields (cid:88) x ∈ N e − x/µ µ = 1 + 12 µ + 2 + ∞ (cid:88) x =1
11 + (2 πµx ) (72)Of course one can also evaluate the geometric sum in the l.h.s.: (cid:80) x ∈ N e − x/µ µ = /µ − e − /µ . But theexpression (72) is more convenient to show that (cid:80) x ∈ N e − x/µ µ > µ . This implies that (68) isunfortunately strictly weaker than (32); in fact, (32) already reads H ( X ) (cid:54) log( eµ ) + log(1 + µ ) .Thus, the approach of this section cannot improve (32).For case c) , however, (67) together with Poisson’s formula (70) greatly improves Massey’soriginal inequality (30). C. Improved Massey’s Inequality for Large VarianceCorollary 3:
For any integer-valued X of variance σ > , H ( X ) <
12 log(2 πeσ ) + 2 log ee π σ − (73) Proof:
The sum in the r.h.s. of (70) is bounded by (cid:80) x (cid:62) e − πσx ) (cid:54) (cid:80) x (cid:62) e − πσ ) x = e π σ − . Substituting in (67) and using the inequality log(1 + z ) < (log e ) z (when z > ) givesthe result.Massey’s original inequality (30) reads H ( X ) (cid:54) log (cid:0) πe ( σ + ) (cid:1) < log(2 πeσ ) + log e σ .Here in (73) the O ( σ ) term is replaced by the exponentially small O ( e − π σ ) .As a illustration, consider a binomial X ∼ B ( n, p ) of variance σ = npq (where p + q = 1 ).The best known upper bound on H ( X ) is [28, Eq. (7)] H ( X ) <
12 log(2 πenpq ) + log e n + log( pq )2 n + log e npq (74) which (73) considerably improves for large n since all O ( n ) terms are replaced by O ( e − π npq ) : H ( X ) <
12 log(2 πenpq ) + 2 log ee π npq − . (75)The exponentially small term can even be made disappear under mild conditions. For example: Corollary 4:
If the integer-valued variable X ∈ N is nonnegative and µ/σ is bounded by aconstant < π , then for large enough σ , H ( X ) <
12 log(2 πeσ ) . (76) Proof:
Apply (67) where the sum can be taken only over x ∈ N . Then by (70), (cid:88) x ∈ N e − ( x − µσ ) √ πσ (cid:54) + ∞ (cid:88) x =1 e − πσx ) − + ∞ (cid:88) x =1 e − ( x + µσ ) √ πσ To obtain (76) it is sufficient to prove that e − πσx ) < e −
12 ( x + µσ )2 √ πσ , i.e., πσx ) − ( x + µσ ) > log √ πσ for all x (cid:62) . When πσ > we have πσ ) > / σ and it is enough to provethe required inequality for x = 1 , i.e., (2 πσ ) > ( µ +1 σ ) + log(8 πσ ) . This will hold for largeenough σ provided that πσ > (1 + ε ) µ for some ε > .As an example, if X ∼ P ( λ ) is Poisson-distributed then µσ = λλ = 1 < π so that for largeenough λ , H ( X ) <
12 log(2 πeλ ) . (77)It is found numerically that this inequality holds as soon as λ > . . . . .Similarly, if X ∼ B ( n, p ) is binomial, we may always assume that p (cid:54) since considering n − X in place of X permutes the roles of p and q = 1 − p without changing H ( X ) . Then µσ = npnpq = q (cid:54) < π , and by Corollary 4, for large enough n , H ( X ) <
12 log(2 πenpq ) . (78)It is found numerically that this inequality holds for all n > as soon as | p − | < . . . . .For the last two examples, Takano’s strong central limit theorem [29, Thm. 2] implies that H ( X ) = 12 log(2 πeσ ) + o (cid:16) σ ε (cid:17) (79)for every ε > . The above inequalities show that the o (cid:0) σ ε (cid:1) term is actually negative for largeenough σ . A PPENDIX AP ROOF OF L EMMA (cid:16) β ( x − µ X σ X ) (cid:17) α − be integrable, it is necessary that β has the same signas − α . For α > , the support of this function is the interval | x − µ X | (cid:54) σ X √ | β | . For α < ,the existence of a finite variance implies that the integral of (cid:16) β ( x − µ X σ X ) (cid:17) − − α converges atinfinity, which requires α > . In either case β is such that ϕ has variance σ X , that is, Z α = (cid:90) (cid:16) β ( x − µ X σ X ) (cid:17) αα − d x = (cid:90) (cid:16) β ( x − µ X σ X ) (cid:17)(cid:16) β ( x − µ X σ X ) (cid:17) α − d x = Z (1 + β ) (80)hence m = 1 + β = Z α Z . Now we can write Z = σ X √ | β | I (cid:0) α − (cid:1) and Z α = σ X √ | β | I (cid:0) αα − (cid:1) where I ( γ ) is the integral I ( γ ) = (cid:90) + ∞−∞ d x (1 + x ) − γ = (cid:90) (1 − t ) − γ − t − d t = Γ( − γ − ) √ π Γ( − γ ) for γ < ; (cid:90) − (1 − x ) γ d x = (cid:90) (1 − t ) γ t − d t = Γ( γ + 1) √ π Γ( γ + ) for γ > . (81)Here we have made the change of variables t = x and t = x x , respectively, and recognizedEuler integrals of the first kind. In either case, letting γ = α − , m = Z α Z = I ( γ + 1) I ( γ ) = − γ − − γ − = γ + 1 γ + = 2 α α − , (82)hence β = − α α − . Plugging this and Z = σ X √ | β | I (cid:0) α − (cid:1) into ϕ ( x ) = Z (cid:16) β ( x − µ X σ X ) (cid:17) α − + gives (45).Plugging (82) and Z α = σ X √ | β | I (cid:0) αα − (cid:1) into (43) gives (46).A PPENDIX BP ROOF OF L EMMA (cid:0) β xµ X (cid:1) α − be integrable, it is necessary that β has the same sign as − α . For α > , the support of this function is the interval (cid:54) x (cid:54) µ X | β | . For α < , the existenceof a finite mean implies that the integral of (cid:0) β xµ X (cid:1) − − α converges at + ∞ , which requires α > . In either case β is such that ϕ has mean µ X , that is, Z α = (cid:90) + ∞ (cid:0) β xµ X (cid:1) αα − d x = (cid:90) + ∞ (cid:0) β xµ X (cid:1)(cid:0) β xµ X (cid:1) α − d x = Z (1 + β ) (83) hence m = 1 + β = Z α Z . Now we can write Z = µ X | β | I (cid:0) α − (cid:1) and Z α = µ X | β | I (cid:0) αα − (cid:1) where I ( γ ) isthe integral I ( γ ) = (cid:90) + ∞ d x (1 + x ) − γ = − γ + 1 for γ < ; (cid:90) (1 − x ) γ d x = 1 γ + 1 for γ > . (84)In either case, letting γ = α − , m = Z α Z = I ( γ + 1) I ( γ ) = γ + 1 γ + 2 = α α − , (85)hence β = − α α − . Plugging this and Z = µ X | β | I (cid:0) α − (cid:1) into ϕ ( x ) = Z (cid:0) β xµ X (cid:1) α − + gives (49).Furthermore, one has Z α = µ X | β | I (cid:0) αα − (cid:1) = µ X β − α α − µ X . (86)Plugging this and (85) into (43) gives (50).R EFERENCES [1] J. L. Massey, “On the entropy of integer-valued random variables,” in
Proc. Beijing International Workshop of InformationTheory , July 4–7 1988.[2] ——, “Guessing and entropy,” in
Proc. of IEEE International Symposium on Information Theory , 1994, p. 204.[3] S. Kullback, “Certain inequalities in information theory and the Cramér-Rao inequality,”
The Annals of MathematicalStatistics , vol. 25, no. 4, pp. 745–751, Dec. 1954.[4] M. O. Choudary and P. G. Popescu, “Back to Massey: Impressively fast, scalable and tight security evaluation tools,” in
Proc. 19th Workshop on Cryptographic Hardware and Embedded Systems (CHES 2017) , vol. LNCS 10529, 2017, pp.367–386.[5] F. M. Reza,
An Introduction to Information Theory . New York: Dover, 1961.[6] R. J. McEliece,
The Theory of Information and Coding . Cambridge University Press, 1st Ed. 1985, 2nd Ed. 2002.[7] T. M. Cover and J. A. Thomas,
Elements of Information Theory . John Wiley & Sons, 1st Ed. 1990, 2nd Ed. 2006.[8] ——, “Elements of information theory: Solutions to problems,” Oct. 2006.[9] O. Rioul and J. C. Magossi, “On Shannon’s formula and Hartley’s rule: Beyond the mathematical coincidence,”
Entropy ,vol. 16, no. 9, pp. 4892–4910, Sept. 2014.[10] O. Rioul, “This is IT: A primer on Shannon’s entropy and information,”
Mathematical Physics , vol. Bourbaphy SeminarXXIII (2018), to appear.[11] I. N. Sanov, “On the probability of large deviations of random variables,”
Matematicheskii , vol. 42 (84), no. 1, pp. 11–44(in Russian), 1957 (Translation, North Carolina Institute of Statistics, Mimeograph Series, No. 192, Mar. 1958).[12] S. Kullback,
Information Theory and Statistics . Wiley, Dover, 1st Ed., 1959, 2nd Ed. 1968.[13] C. E. Shannon, “A mathematical theory of communication,”
Bell Syst. Tech. J. , vol. 27, pp. 623–656, Oct. 1948.[14] R. J. Evans, J. Boersma, N. M. Blachman, and A. A. Jagers, “The entropy of a Poisson distribution: Problem 87-6,”
SIAMReview , vol. 30, no. 2, pp. 314–317, June 1988. [15] E. de Chérisey, S. Guilley, P. Piantanida, and O. Rioul, “Best information is most successful: Mutual information andsuccess rate in side-channel analysis,” IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES2019) , vol. 2019, no. 2, pp. 49–79, 2019.[16] A. T˘an˘asescu and P. G. Popescu, “Exploiting the Massey gap,”
Entropy , vol. 22, no. 1398, pp. 1–9, Dec. 2020.[17] P. G. Popescu and M. O. Choudary, “Refinement of Massey inequality,” in
Proc of the IEEE International Symposium onInformation Theory , 2019, pp. 495–496.[18] A. T˘an˘asescu, M. O. Choudary, O. Rioul, and P. G. Popescu, “An asymptotically optimal global Massey-like inequality andrefinement for finite support distributions,” in preparation, 2021.[19] T. van Erven and P. Harremoës, “Rényi divergence and Kullback-Leibler divergence,”
IEEE Trans. Inf. Theory , vol. 60,no. 7, pp. 3797–3820, Jul. 2014.[20] A. Lapidoth and C. Pfister, “Two measures of dependence,” in
IEEE International Conference on the Science of ElectricalEngineering (ICSEE 2016) , 2016.[21] O. Rioul, “Rényi entropy power and normal transport,” in
Proc. International Symposium on Information Theory and ItsApplications (ISITA2020) , Oct. 24-27 2020, pp. 1–5.[22] J. Costa, A. Hero, and C. Vignat, “On solutions to multivariate maximum α -entropy problems,” in Energy MinimizationMethods in Computer Vision and Pattern Recognition (EMMCVPR 2003) , ser. Lecture Notes in Computer Science,A. Rangarajan, M. Figueiredo, and J. Zerubia, Eds., vol. 2683. Springer, 2003, pp. 211–226.[23] C. Bunte and A. Lapidoth, “Maximizing Rényi entropy rate,” in
IEEE 28th Convention of Electrical and ElectronicsEngineers in Israel , 2014.[24] E. Lutwak, D. Yang, and G. Zhang, “Cramér–Rao and moment-entropy inequalities for Rényi entropy and generalizedFisher information,”
IEEE Transactions on Information Theory , vol. 51, no. 2, pp. 473–478, Feb. 2005.[25] E. Arikan, “An inequality on guessing and its application to sequential decoding,”
IEEE Transactions on InformationTheory , vol. 42, no. 1, pp. 99–105, Jan. 1996.[26] E. M. Stein and G. Weiss,
Introduction to Fourier Analysis on Euclidean Spaces . Princeton Univertsity Press, 1971.[27] S.-D. Poisson, “Suite du mémoire sur les intégrales définies et sur la sommation des séries,”
Journal de l’École RoyalePolytechnique , vol. 19, no. 12, pp. 404–509, Juillet 1823.[28] J. A. Adell, A. Lekuona, and Y. Yu, “Sharp bounds on the entropy of the Poisson law and related quantities,”
IEEETransactions on Information Theory , vol. 56, no. 5, pp. 2299–2306, May 2010.[29] S. Takano, “Convergence of entropy in the central limit theorem,”