aa r X i v : . [ c s . I T ] A p r On Reverse Pinsker Inequalities
Igal Sason
Abstract
New upper bounds on the relative entropy are derived as a function of the total variation distance.One bound refines an inequality by Verd´u for general probability measures. A second bound improvesthe tightness of an inequality by Csisz´ar and Talata for arbitrary probability measures that are defined ona common finite set. The latter result is further extended, for probability measures on a finite set, leadingto an upper bound on the R´enyi divergence of an arbitrary non-negative order (including ∞ ) as a functionof the total variation distance. Another lower bound by Verd´u on the total variation distance, expressed interms of the distribution of the relative information, is tightened and it is attained under some conditions.The effect of these improvements is exemplified. Keywords : Pinsker’s inequality, relative entropy, relative information, R´enyi divergence, totalvariation distance, typical sequences. I. I
NTRODUCTION
Consider two probability measures P and Q defined on a common measurable space ( A , F ) .The Csisz´ar-Kemperman-Kullback-Pinsker inequality states that D ( P k Q ) ≥ log e · | P − Q | (1)where D ( P k Q ) = E P (cid:20) log d P d Q (cid:21) = Z A d P log d P d Q (2)designates the relative entropy from P to Q (a.k.a. the Kullback-Leibler divergence), and | P − Q | = 2 sup A ∈F (cid:12)(cid:12) P ( A ) − Q ( A ) (cid:12)(cid:12) (3)designates the total variation distance (or L distance) between P and Q . One of the implicationsof inequality (1) is that convergence in relative entropy implies convergence in total variationdistance. The total variation distance is bounded | P − Q | ≤ , in contrast to the relative entropy.Inequality (1) is a.k.a. Pinsker’s inequality, although the analysis made by Pinsker [15] leadsto a significantly looser bound where log e on the RHS of (1) is replaced by log e (see [25,Eq. (51)]). Improved and generalized versions of Pinsker’s inequality have been studied in [7],[8], [9], [14], [18], [24].For any ε > , there exists a pair of probability measures P and Q such that | P − Q | ≤ ε while D ( P k Q ) = ∞ . Consequently, a reverse Pinsker inequality which provides an upper bound onthe relative entropy in terms of the total variation distance does not exist in general. Nevertheless,under some conditions, such inequalities hold [4], [25], [26] (to be addressed later in this section).If P ≪ Q , the relative information in a ∈ A according to ( P, Q ) is defined to be i P k Q ( a ) , log d P d Q ( a ) . (4) I. Sason is with the Department of Electrical Engineering, Technion–Israel Institute of Technology, Haifa 32000,Israel (e-mail: [email protected]). This work has been submitted in part to the , Jeju Island, Korea, October 11–15, 2015. The research was supported by the Israeli ScienceFoundation (ISF), grant number 12/12.
From (2), the relative entropy can be expressed in terms of the relative information as follows: D ( P k Q ) = E (cid:2) i P k Q ( X ) (cid:3) = E (cid:2) i P k Q ( Y ) exp (cid:0) i P k Q ( Y ) (cid:1)(cid:3) (5)where X ∼ P and Y ∼ Q (i.e., X and Y are distributed according to P and Q , respectively).The total variation distance is also expressible in terms of the relative information [25]. If Q ≪ P | P − Q | = E h(cid:12)(cid:12) − exp (cid:0) i P k Q ( Y ) (cid:1)(cid:12)(cid:12)i (6)and if, in addition, P ≪ Q , then | P − Q | = E h(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12)i . (7)Let β − , sup a ∈A d P d Q ( a ) (8)with the convention, implied by continuity, that β = 0 if i P k Q is unbounded from above. With β ≤ , as it is defined in (8), the following inequality holds (see [25, Theorem 7]): | P − Q | ≥ − β log β ! D ( P k Q ) . (9)From (9), if the relative information is bounded from above, a reverse Pinsker inequality holds.This inequality has been recently used in the context of the optimal quantization of probabilitymeasures when the distortion is either characterized by the total variation distance or the relativeentropy between the approximating and the original probability measures [2, Proposition 4].Inequality (9) is refined in this work, and the improvement that is obtained by this refinementis exemplified (see Section II).In the special case where P and Q are defined on a common discrete set (i.e., a finite orcountable set) A , the relative entropy and total variation distance are simplified to D ( P k Q ) = X a ∈A P ( a ) log P ( a ) Q ( a ) , | P − Q | = X a ∈A (cid:12)(cid:12) P ( a ) − Q ( a ) (cid:12)(cid:12) , | P − Q | . A restriction to probability measures on a finite set A has led in [4, p. 1012 and Lemma 6.3]to the following upper bound on the relative entropy in terms of the total variation distance: D ( P k Q ) ≤ (cid:18) log eQ min (cid:19) · | P − Q | , (10)where Q min , min a ∈A Q ( a ) , suggesting a kind of a reverse Pinsker inequality for probabilitymeasures on a finite set. A recent application of this bound has been exemplified in [13,Appendix D] and [23, Lemma 7] for the analysis of the third-order asymptotics of the discretememoryless channel with or without cost constraints.The present paper also considers generalized reverse Pinsker inequalities for R´enyi diver-gences. In the discrete setting, the R´enyi divergence of order α from P to Q is defined as D α ( P || Q ) , α − X a ∈A P α ( a ) Q − α ( a ) ! , ∀ α ∈ (0 , ∪ (1 , ∞ ) . (11) Recall that D ( P k Q ) , D ( P k Q ) is defined to be the analytic extension of D α ( P || Q ) at α = 1 (if D ( P || Q ) < ∞ , it can be verified with L’Hˆopital’s rule that D ( P || Q ) = lim α → − D α ( P || Q ) ).The extreme cases of α = 0 , ∞ are defined as follows: • If α = 0 then D ( P || Q ) = − log Q ( Support ( P )) where Support ( P ) = { x ∈ X : P ( x ) > } denotes the support of P , • If α = + ∞ then D ∞ ( P || Q ) = log (cid:16) ess sup PQ (cid:17) where ess sup f denotes the essentialsupremum of a function f .Pinsker’s inequality has been generalized by Gilardoni [9] for R´enyi divergences of order α ∈ (0 , (see also [6, Theorem 30]), and it gets the form D α ( P k Q ) ≥ α log e · | P − Q | . An improved bound, providing the best lower bound on the R´enyi divergence of order α > in terms of the total variation distance, has been recently introduced in [20, Section 2].Motivated by these findings, the analysis in this paper suggests an improvement over the upperbound on the relative entropy in (10) for probability measures defined on a common finite set.The improved version of the bound in (10) is further generalized to provide an upper bound onthe R´enyi divergence of orders α ∈ [0 , ∞ ] in terms of the total variation distance.Note that the issue addressed in this paper of deriving, under suitable conditions, upper boundson the relative entropy as a function of the total variation distance has some similarity to theissue of deriving upper bounds on the difference between entropies as a function of the totalvariation distance. Note also that in the special case where Q is a Gaussian distribution and P isa probability distribution with the same covariance matrix, then D ( P k Q ) = h ( Q ) − h ( P ) where h ( · ) denotes the differential entropy of a specified distribution (see [3, Eq. (8.76)]). Bounds onthe entropy difference in terms of the total variation distance have been studied, e.g., in [3,Theorem 17.3.3], [11], [16], [17], [19], [26, Section 1.7], [27].This paper is structured as follows: Section II refers to [25], deriving a refined version ofinequality (9) for general probability measures, and improving another lower bound on the totalvariation distance which is expressed in terms of the distribution of the relative information.Section III derives a reverse Pinsker inequality for probability measures on a finite set, improvinginequality (10) that follows from [4, Lemma 6.3]. Section IV extends the analysis to R´enyidivergences of arbitrary non-negative orders. Section V exemplifies the utility of a reverse Pinskerinequality in the context of typical sequences.II. A R EFINED R EVERSE P INSKER I NEQUALITY FOR G ENERAL P ROBABILITY M EASURES
The present section derives a reverse Pinsker inequality for general probability measures,suggesting a refined version of [25, Theorem 7]. The utility of this new inequality is exemplified.This section also provides a lower bound on the total variation distance which is based on thedistribution of the relative information; the latter inequality is based on a modification of theproof of [25, Theorem 8], and it has the advantage of being tight for a double-parameter familyof probability measures which are defined on an arbitrary set of 2 elements.
A. Main Result and Proof
Inequality (9) provides an upper bound on the relative entropy D ( P k Q ) as a function of thetotal variation distance when P ≪ Q , and the relative information i P k Q is bounded from above(this implies that β in (8) is positive). The following theorem tightens this upper bound. Theorem 1:
Let P and Q be probability measures on a measurable space ( A , F ) , P ≪ Q ,and let β , β ∈ [0 , be given by β − , sup a ∈A d P d Q ( a ) , β , inf a ∈A d P d Q ( a ) . (12)Then, the following inequality holds: D ( P k Q ) ≤ log β − β − β log e ! | P − Q | . (13) Proof:
Let X ∼ P , Y ∼ Q , and B , (cid:8) a ∈ A : i P k Q ( a ) > (cid:9) . (14)From (5), the relative entropy is equal to D ( P k Q ) = Z A d Q exp (cid:0) i P k Q (cid:1) i P k Q = Z B d Q exp (cid:0) i P k Q (cid:1) i P k Q + Z A\B d Q exp (cid:0) i P k Q (cid:1) i P k Q . (15)In the following, the two integrals on the RHS of (15) are upper bounded. The upper bound onthe first integral on the RHS of (15) is based on the proof of [25, Theorem 7]; it is providedin the following for completeness, and with more details in order to clarify the way that thisbound is refined here. Let z ( a ) , exp( i P k Q ( a )) for a ∈ A . By assumption < z ( a ) ≤ β forall a ∈ B . The function f ( z ) = z log( z ) z − is monotonic increasing over the interval (1 , ∞ ) sincewe have ( z − f ′ ( z ) = ( z −
1) log e − log z > for z > . Consequently, we have z ( a ) log z ( a ) z ( a ) − ≤ log β − β , ∀ a ∈ B (16)and Z B d Q exp (cid:0) i P k Q (cid:1) i P k Q ( a ) ≤ log β − β Z B d Q (cid:0) exp( i P k Q ) − (cid:1) ( b ) = log β − β Z A d Q ( a ) (cid:0) − exp( i P k Q ( a )) (cid:1) − ( c ) = log β − β ! E h(cid:0) − exp( i P k Q ( Y )) (cid:1) − i ( d ) = log β − β ) ! | P − Q | (17)where inequality (a) follows from (16), equality (b) is due to (14) and the definition ( a ) − , − a { a < } , equality (c) holds since Y ∼ Q , and equality (d) follows from [25, Eq. (14)]. At this point, we deviate from the analysis in [25] where the second integral on the RHS of(15) has been upper bounded by zero (since i P k Q ( a ) ≤ for all a ∈ A \ B ). If β > , weprovide in the following a strictly negative upper bound on this integral. Since P ≪ Q , we have Z A\B d Q exp (cid:0) i P k Q (cid:1) i P k Q ( a ) = Z { a ∈A : i P k Q ( a ) < } d Q ( a ) d P d Q ( a ) i P k Q ( a ) ( b ) ≤ β Z { a ∈A : i P k Q ( a ) < } d Q ( a ) i P k Q ( a ) ( c ) ≤ β log e Z { a ∈A : i P k Q ( a ) < } d Q ( a ) (cid:16) exp (cid:0) i P k Q ( a ) (cid:1) − (cid:17) ( d ) = − β log e Z A\B d Q ( a ) (cid:16) − exp (cid:0) i P k Q ( a ) (cid:1)(cid:17) ( e ) = − β log e Z A d Q ( a ) (cid:16) − exp (cid:0) i P k Q ( a ) (cid:1)(cid:17) +( f ) = − β log e · E h(cid:16) − exp( i P k Q ( Y )) (cid:17) + i ( g ) = − β log e · | P − Q | (18)where equality (a) holds due to (4), (14) and since the integrand is zero if i P k Q = 0 , inequality (b)follows from the definition of β in (12) and since i P k Q is negative over the domain of integration,inequality (c) holds since the inequality x ≤ log e (cid:0) exp( x ) − (cid:1) is satisfied for all x ∈ R ,equalities (d) and (e) follow from the definition of the set B in (14), equality (f) holds since Y ∼ Q , and equality (g) follows from [25, Eq. (15)].Inequality (13) finally follows by combining (15), (17) and (18). B. Example for the Refined Inequality in Theorem 1
We exemplify in the following the improvement obtained by (13), in comparison to (9), dueto the introduction of the additional parameter β in (12). Note that when β is replaced byzero (i.e., no information on the infimum of d P d Q is available or β = 0 ), inequalities (9) and (13)coincide.Let P and Q be two probability measures, defined on a set A , where P ≪ Q and assumethat − η ≤ d P d Q ( a ) ≤ η, ∀ a ∈ A (19)for a fixed constant η ∈ (0 , .In (13), one can replace β and β with lower bounds on these constants. From (12), we have β ≥ η and β ≥ − η , and it follows from (13) that D ( P k Q ) ≤ (cid:18) (1 + η ) log(1 + η ) η − (1 − η ) log e (cid:19) | P − Q |≤ (cid:16) (1 + η ) log e − (1 − η ) log e (cid:17) | P − Q | = (cid:0) η log e (cid:1) | P − Q | . (20) From (19) (cid:12)(cid:12) exp (cid:0) i P k Q ( a ) (cid:1) − (cid:12)(cid:12) ≤ η, ∀ a ∈ A so, from (6), the total variation distance satisfies (recall that Y ∼ Q ) | P − Q | = E h(cid:12)(cid:12) exp (cid:0) i P k Q ( Y ) (cid:1) − (cid:12)(cid:12)i ≤ η. Combining the last inequality with (20) gives that D ( P k Q ) ≤ η log e, ∀ η ∈ (0 , . (21)For comparison, it follows from (9) (see [25, Theorem 7]) that D ( P k Q ) ≤ log β − β ) · | P − Q |≤ (1 + η ) log(1 + η )2 η · | P − Q |≤
12 (1 + η ) log(1 + η ) ≤ (cid:18) log e (cid:19) η (1 + η ) . (22)Let η ≈ . The upper bound on the relative entropy in (22) scales like η whereas the tightenedbound in (21) scales like η . The scaling in (21) is correct, as it follows from Pinsker’s inequality.For example, consider the probability measures defined on a two-element set A = { a, b } with P ( a ) = Q ( b ) = 12 − η , P ( b ) = Q ( a ) = 12 + η . Condition (19) is satisfied for η ≈ , and Pinsker’s inequality yields that D ( P k Q ) ≥ (cid:18) log e (cid:19) η (23)so the ratio of the upper and lower bounds in (21) and (23) is 2, and both provide the truequadratic scaling in η whereas the weaker upper bound in (22) scales linearly in η for η ≈ . C. Another Lower Bound on the Total Variation Distance
The following lower bound on the total variation distance is based on the distribution of therelative information, and it improves the lower bounds in [15, Eq. (2.3.18)], [22, Lemma 7] and[25, Theorem 8] by modifying the proof of the latter theorem in [25]. Besides of improving thetightness of the bound, the motivation for the derivation of the following lower bound is that itis achieved under some conditions.
Theorem 2: If P and Q are mutually absolutely continuous probability measures, then | P − Q | ≥ sup η> n(cid:0) − exp( − η ) (cid:1) (cid:16) P (cid:2) i P k Q ( X ) ≥ η (cid:3) + exp( η ) P (cid:2) i P k Q ( X ) ≤ − η (cid:3)(cid:17)o (24)where X ∼ P . This lower bound is attained if P and Q are probability measures on a 2-elementset A = { a, b } where, for an arbitrary η > , P ( a ) = exp( η ) −
12 sinh( η ) , Q ( a ) = 1 − exp( − η )2 sinh( η ) . (25) Proof:
Since P ≪≫ Q , we have | P − Q | = E (cid:2)(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12)(cid:3) ≥ E (cid:2)(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12) (cid:8)(cid:12)(cid:12) i P k Q ( X ) (cid:12)(cid:12) ≥ η (cid:9)(cid:3) , ∀ η > where {·} is the indicator function of the specified event (it is equal to 1 if the event occurs,and it is zero otherwise). At this point we deviate from the proof of [25, Theorem 8], and write | P − Q | ≥ E (cid:2)(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12) (cid:8) i P k Q ( X ) ≥ η (cid:9)(cid:3) + E (cid:2)(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12) (cid:8) i P k Q ( X ) ≤ − η (cid:9)(cid:3) (a) ≥ (cid:0) − exp( − η ) (cid:1) E (cid:2) (cid:8) i P k Q ( X ) ≥ η (cid:9)(cid:3) + (cid:0) exp( η ) − (cid:1) E (cid:2) (cid:8) i P k Q ( X ) ≤ − η (cid:9)(cid:3) = (cid:0) − exp( − η ) (cid:1) (cid:16) P (cid:2) i P k Q ( X ) ≥ η (cid:3) + exp( η ) P (cid:2) i P k Q ( X ) ≤ − η (cid:3)(cid:17) (26)where step (a) follows from the inequality (cid:12)(cid:12) − exp( − z ) (cid:12)(cid:12) ≥ − exp( − η ) if z ≥ η , and (cid:12)(cid:12) − exp( − z ) (cid:12)(cid:12) ≥ exp( η ) − if z ≤ − η . Taking the supremum on the right-hand side of (26),w.r.t. the free parameter η > , gives the lower bound on | P − Q | in (24).The condition (25) for the tightness of the lower bound in (24) follows from the fact that,for an arbitrary η > , we have log (cid:16) P ( a ) Q ( a ) (cid:17) = η and log (cid:16) − P ( a )1 − Q ( a ) (cid:17) = − η . This yields that theinequalities in the derivation of the lower bound (24) turn to be satisfied with equalities. Remark 1:
One can further tighten the lower bound in (24) by writing, for arbitrary η , η > , | P − Q | ≥ E (cid:2)(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12) (cid:8) i P k Q ( X ) ≥ η (cid:9)(cid:3) + E (cid:2)(cid:12)(cid:12) − exp (cid:0) − i P k Q ( X ) (cid:1)(cid:12)(cid:12) (cid:8) i P k Q ( X ) ≤ − η (cid:9)(cid:3) and proceeding similarly to (26) to get the following lower bound on the total variation distance: | P − Q | ≥ sup η ,η > (cid:26)(cid:0) − exp( − η ) (cid:1) (cid:18) P (cid:2) i P k Q ( X ) ≥ η (cid:3) + (cid:18) exp( η ) − − exp( − η ) (cid:19) P (cid:2) i P k Q ( X ) ≤ − η (cid:3)(cid:19)(cid:27) . This lower bound is achieved if P and Q are probability measures on a 2-element set A = { a, b } where, for an arbitrary η , η > , P ( a ) = exp( η ) − exp( η − η )exp( η ) − exp( − η ) , Q ( a ) = 1 − exp( − η )exp( η ) − exp( − η ) (27)which implies that log (cid:16) P ( a ) Q ( a ) (cid:17) = η and log (cid:16) − P ( a )1 − Q ( a ) (cid:17) = − η . Condition (27) is specialized tothe condition in (25) when η = η = η > .III. A R EVERSE P INSKER I NEQUALITY FOR P ROBABILITY M EASURES ON A F INITE S ET The present section introduces a strengthened version of inequality (10) (see Theorem 3) asa reverse Pinsker inequality for probability measures on a finite set, followed by a discussionand an example.
A. Main Result and ProofTheorem 3:
Let P and Q be probability measures defined on a common finite set A , andassume that Q is strictly positive on A . Then, the following inequality holds: D ( P k Q ) ≤ log (cid:18) | P − Q | Q min (cid:19) − β log e · | P − Q | (28)where Q min , min a ∈A Q ( a ) > , β , min a ∈A P ( a ) Q ( a ) ∈ [0 , . (29) Remark 2:
The upper bound on the relative entropy in Theorem 3 improves the bound in(10). The improvement in (28) is demonstrated as follows: let V , | P − Q | , then the RHS of(28) satisfies log (cid:18) V Q min (cid:19) − β log e · V ≤ log (cid:18) V Q min (cid:19) ≤ V log e Q min ≤ V log eQ min . Hence, the upper bound on D ( P k Q ) in Theorem 3 can be loosened to (10). Proof:
Theorem 3 is proved by obtaining upper and lower bounds on the χ -divergencefrom P to Q χ ( P, Q ) , X a ∈A ( P ( a ) − Q ( a )) Q ( a ) = X a ∈A P ( a ) Q ( a ) − . (30)A lower bound follows by invoking Jensen’s inequality: χ ( P, Q ) = X a ∈A P ( a ) Q ( a ) − X a ∈A P ( a ) exp (cid:18) log P ( a ) Q ( a ) (cid:19) − ≥ exp X a ∈A P ( a ) log P ( a ) Q ( a ) ! −
1= exp (cid:0) D ( P k Q ) (cid:1) − . (31)A refined version of (31) is derived in the following. The starting point of its derivation relies ona refined version of Jensen’s inequality from [5, Theorem 1], which enables to get the inequality min a ∈A P ( a ) Q ( a ) · D ( Q || P ) ≤ log (cid:0) χ ( P, Q ) (cid:1) − D ( P || Q ) ≤ max a ∈A P ( a ) Q ( a ) · D ( Q || P ) . (32)Inequality (32) is proved in the appendix. From the LHS of (32) and the definition of β in(29), we have χ ( P, Q ) ≥ exp (cid:16) D ( P k Q ) + β D ( Q k P ) (cid:17) − ≥ exp (cid:18) D ( P k Q ) + β log e · | P − Q | (cid:19) − (33)where the last inequality relies on Pinsker’s lower bound on D ( Q k P ) . Inequality (33) refinesthe lower bound in (31) since β ∈ [0 , , and it coincides with (31) in the worst case where β = 0 . An upper bound on χ ( P, Q ) is derived as follows: χ ( P, Q ) = X a ∈A ( P ( a ) − Q ( a )) Q ( a ) ≤ P a ∈A (cid:0) P ( a ) − Q ( a ) (cid:1) min a ∈A Q ( a ) ≤ max a ∈A | P ( a ) − Q ( a ) | P a ∈A (cid:12)(cid:12) P ( a ) − Q ( a ) (cid:12)(cid:12) min a ∈A Q ( a )= | P − Q | max a ∈A | P ( a ) − Q ( a ) | Q min (34)and, from (3), | P − Q | ≥ a ∈A | P ( a ) − Q ( a ) | (35)since, for every a ∈ A , the 1-element set { a } is included in the σ -algebra F . Combining (34)and (35) gives that χ ( P, Q ) ≤ | P − Q | Q min . (36)Inequality (28) finally follows from the bounds on the χ -divergence in (33) and (36). Corollary 1:
Under the same setting as in Theorem 3, we have D ( P k Q ) ≤ log (cid:18) | P − Q | Q min (cid:19) . (37) Proof:
This inequality follows from (28) and since β ≥ . B. Discussion
In the following, we discuss Theorem 3 and its proof, and link it to some related results.
Remark 3:
The combination of (31) with the second line of (34), without further looseningthe upper bound on the χ -divergence as is done in the third line of (34) and inequality (35),gives the following tighter upper bound on the relative entropy in terms of the Euclidean norm | P − Q | : D ( P k Q ) ≤ log (cid:18) | P − Q | Q min (cid:19) . (38)This improves the upper bound on the relative entropy in the proofs of Property 4 of [23,Lemma 7] and [13, Appendix D]: D ( P k Q ) ≤ | P − Q | log eQ min . Furthermore, avoiding the use of Jensen’s inequality in (31), gives the equality (see [6, Eq. (6)]) χ ( P, Q ) = exp (cid:0) D ( P k Q ) (cid:1) − (39)whose combination with the second line of (34) gives D ( P k Q ) ≤ log (cid:18) | P − Q | Q min (cid:19) . (40) Inequality (40) improves the tightness of inequality (38). Note that (40) is satisfied with equalitywhen Q is an equiprobable distribution over a finite set. Remark 4:
Inequality (31) improves the lower bound on the χ -divergence in [4, Lemma 6.3]which states that χ ( P, Q ) ≥ D ( P k Q ) ; this improvement also follows from [6, Eqs. (6), (7)]. Remark 5:
The upper bound on the relative entropy in (28) involves the parameter β ∈ [0 , as defined in (29). A non-trivial lower bound on β can be used in conjunction with (28) forimproving the upper bound in Corollary 1. We derive in the following a lower bound on β fora given probability measure Q and a given total variation distance | P − Q | , which can be usedin conjunction with (28), to get an upper bound on the relative entropy D ( P k Q ) . We have β = min a ∈A P ( a ) Q ( a ) ≥ P min Q max where P min , min a ∈A P ( a ) , Q max , max a ∈A Q ( a ) . Note that, if | P − Q | < Q min then P min ≥ Q min − | P − Q | > . Let ( x ) + , max (cid:8) x, (cid:9) , then β ≥ (cid:0) Q min − | P − Q | (cid:1) + Q max . (41) Remark 6:
In an attempt to extend the concept of proof of Theorem 3 to general probabilitymeasures, we have χ ( P, Q ) = Z A (cid:18) d P d Q − (cid:19) d Q = E h(cid:0) exp (cid:0) i P k Q ( Y ) (cid:1) − (cid:1) i ( Y ∼ Q ) ≤ sup a ∈A (cid:12)(cid:12) exp (cid:0) i P k Q ( a ) (cid:1) − (cid:12)(cid:12) · E h(cid:12)(cid:12) exp (cid:0) i P k Q ( Y ) (cid:1) − (cid:12)(cid:12)i ( a ) = sup a ∈A (cid:12)(cid:12) exp (cid:0) i P k Q ( a ) (cid:1) − (cid:12)(cid:12) · | P − Q | = sup a ∈A (cid:12)(cid:12)(cid:12) d P d Q ( a ) − (cid:12)(cid:12)(cid:12) · | P − Q | (42)where equality (a) holds due to (6). Let β , β ∈ [0 , be defined as in Theorem 1 (see (12)).Since we have β ≤ d P d Q ( a ) ≤ β − for all a ∈ A then sup a ∈A (cid:12)(cid:12)(cid:12) d P d Q ( a ) − (cid:12)(cid:12)(cid:12) ≤ max (cid:8) β − − , − β (cid:9) . (43)A combination of (42) and (43) leads to the following upper bound on the χ -divergence: χ ( P, Q ) ≤ max (cid:8) β − − , − β (cid:9) · | P − Q | . (44)A combination of (39) (see [6, Eq. (6)]) and (44) gives D ( P k Q ) ≤ log (cid:16) (cid:8) β − − , − β (cid:9) · | P − Q | (cid:17) (45)and since the R´enyi divergence is monotonic non-decreasing in its order (see, e.g., [6, Theo-rem 3]) and D ( P k Q ) = D ( P k Q ) , it follows that D ( P k Q ) ≤ log (cid:16) (cid:8) β − − , − β (cid:9) · | P − Q | (cid:17) . (46) A comparison of the upper bound on the relative entropy in (46) and the bound of Theorem 1in (13) yields that the latter bound is superior. Hence, the extension of the concept of proof ofTheorem 3 to general probability measures does not improve the bound in Theorem 1.
Remark 7:
The second inequality in (33) relies on Pinsker’s inequality as a lower bound on D ( Q k P ) . This lower bound can be slightly improved by invoking higher-order Pinsker’s-typeinequalities (see [9, Section 5] and references therein). In [9, Section 6], Gilardoni derived a lowerbound on the relative entropy which is tight for both large and small total variation distances.Hence, the second inequality in (33) can instead rely on the inequality (see [9, Eq. (2)]): D ( Q k P ) ≥ − log (cid:18) − | P − Q | (cid:19) − (cid:18) − | P − Q | (cid:19) log (cid:18) | P − Q | (cid:19) . Note that although the latter lower bound on the relative entropy is tight for both large and smalltotal variation distances, it is not uniformly tighter than Pinsker’s inequality. For this reason andfor the simplicity of the bound, we rely on Pinsker’s inequality in the second inequality of(33). An exact parametrization of the minimum of the relative entropy in terms of the totalvariation distance was introduced in [7, Theorem 1], expressed in terms of hyperbolic functions;the bound, however, is not expressed in closed form in terms of the total variation distance.
Remark 8:
A related problem to Theorem 3 has been recently studied in [1]. Consider anarbitrary probability measure Q , and an arbitrary ε ∈ [0 , . The problem studied in [1] is thecharacterization of D ∗ ( ε, Q ) , defined to be the infimum of D ( P || Q ) over all probability measures P that are at least ε -far away from Q in total variation, i.e., D ∗ ( ε, Q ) = inf P : | P − Q |≥ ε D ( P k Q ) , ε ∈ [0 , . Note that D ( P k Q ) < ∞ yields that Supp ( P ) ⊆ Supp ( Q ) . From Sanov’s theorem (see [3,Theorem 11.4.1]), D ∗ ( ε, Q ) is equal to the asymptotic exponential decay of the probability thatthe total variation distance between the empirical distribution of a sequence of i.i.d. randomvariables and the true distribution ( Q ) is more than a specified value ε . Upper and lower boundson D ∗ ( ε, Q ) have been introduced in [1, Theorem 1], in terms of the balance coefficient β ≥ that is defined as β , inf (cid:26) x ∈ (cid:8) Q ( A ) : A ∈ F (cid:9) : x ≥ (cid:27) . It has been demonstrated in [1, Theorem 1] that D ∗ ( ε, Q ) = Cε + O ( ε ) (47)where β −
1) log (cid:18) β − β (cid:19) ≤ C ≤ log e β (1 − β ) . If the support of Q is a finite set A , Theorem 3 and (41) yield that D ∗ ( ε, Q ) ≤ log (cid:18) ε Q min (cid:19) − log e · Q max · (cid:0) Q min − ε (cid:1) + . Hence, it follows that D ∗ ( ε, Q ) ≤ C ε + O ( ε ) where C = log e (cid:18) Q min − Q min Q max (cid:19) . Similarly to (47), the same quadratic scaling of D ∗ ( ε, Q ) holds for small values of ε , but withdifferent coefficients. C. Example: Total Variation Distance From the Equiprobable Distribution
Let A be a finite set, and let U be the equiprobable probability measure on A (i.e., U ( a ) = |A| for every a ∈ A ). The relative entropy of an arbitrary distribution P on A with respect to theequiprobable distribution satisfies D ( P k U ) = log |A| − H ( P ) . (48)From Pinsker’s inequality (1), the following upper bound on the total variation distance holds: | P − U | ≤ r e · (cid:0) log |A| − H ( P ) (cid:1) . (49)From [26, Theorem 2.51], for all probability measures P and Q , | P − Q | ≤ q − exp (cid:0) − D ( P k Q ) (cid:1) which gives the second upper bound | P − U | ≤ s − |A| · exp (cid:0) H ( P ) (cid:1) . (50)From Theorem 3 and (41), we have D ( P k U ) ≤ log (cid:18) |A| · | P − U | (cid:19) − (cid:18) |A| log e (cid:19) · (cid:18) |A| − | P − U | (cid:19) + · | P − U | . A loosening of the latter bound by removing its second non-negative term on the RHS of thisinequality, in conjunction with (48), leads to the following closed-form expression for the lowerbound on the total variation distance: | P − U | ≥ s (cid:18) exp (cid:0) − H ( P ) (cid:1) − |A| (cid:19) . (51)Let H ( P ) = β log |A| , so β ∈ [0 , . From (49), (50) and (51), it follows that vuut "(cid:18) |A| (cid:19) β − |A| ≤ | P − U | ≤ min (cid:26)p − β ) ln |A| , q − |A| β − (cid:27) . (52)As expected, if β = 1 , both upper and lower bounds are equal to zero (since D ( P k U ) = 0 ).The lower bound on the LHS of (52) improves the lower bound on the total variation distancewhich follows from (10): | P − U | ≥ s (1 − β ) ln |A||A| (53)For example, for a set of size |A| = 1024 and β = 0 . , the improvement in the new lowerbound on the total variation distance is from 0.0582 to 0.2461.Note that if β → (i.e., P is far in relative entropy from the equiprobable distribution), andthe set A stays fixed, the ratio between the upper and lower bounds in (52) tends to √ . On theother hand, in this case, the ratio between the upper and the looser lower bound in (53) tends to s |A| − |A| , which can be made arbitrarily large for a sufficiently large set A . IV. E
XTENSION OF T HEOREM TO R ´
ENYI D IVERGENCES
The present section extends Theorem 3 to R´enyi divergences of an arbitrary order α ∈ [0 , ∞ ] (i.e., it relies on Theorem 3 to provide a generalization of the special case where α = 1 ), andthe use of this generalized inequality is exemplified. A. Main Result and Proof
The following theorem provides a kind of a generalized reverse Pinsker inequality where theR´enyi divergence of an arbitrary order α ∈ [0 , ∞ ] is upper bounded in terms of the total variationdistance for probability measures defined on a common finite set. Theorem 4:
Let P and Q be probability measures on a common finite set A , and assume that P, Q are strictly positive. Let ε , | P − Q | (recall that ε ∈ [0 , ), ε ′ , min { , ε } , and P min , min a ∈A P ( a ) , Q min , min a ∈A Q ( a ) . Then, the R´enyi divergence of order α ∈ [0 , ∞ ] satisfies D α ( P k Q ) ≤ log (cid:16) ε Q min (cid:17) , if α ∈ (2 , ∞ ]log (cid:16) ε ε ′ Q min (cid:17) , if α ∈ [1 , { f , f } , if α ∈ (cid:0) , (cid:1) min (cid:8) − (cid:0) − ε (cid:1) , f , f (cid:9) , if α ∈ (cid:2) , (cid:3) (54)where, for α ∈ [0 , , f ≡ f ( α, P min , Q min , ε ) , (cid:18) α − α (cid:19) (cid:20) log (cid:18) ε P min (cid:19) − (cid:18) Q min log e (cid:19) ε (cid:21) , (55) f ≡ f ( P min , Q min , ε, ε ′ ) , log (cid:18) ε ε ′ Q min (cid:19) − (cid:18) P min log e (cid:19) ε . (56) Proof:
The R´enyi divergence of order ∞ satisfies (see, e.g., [6, Theorem 6]) D ∞ ( P || Q ) = log (cid:18) ess sup PQ (cid:19) . Since, by assumption, the probability measures P and Q are defined on a common finite set A D ∞ ( P || Q ) = log (cid:18) max a ∈A P ( a ) Q ( a ) (cid:19) = log (cid:18) a ∈A P ( a ) − Q ( a ) Q ( a ) (cid:19) ≤ log (cid:18) a ∈A | P ( a ) − Q ( a ) | min a ∈A Q ( a ) (cid:19) ≤ log (cid:18) | P − Q | Q min (cid:19) (57) where the last inequality follows from (35). Since the R´enyi divergence of order α ∈ [0 , ∞ ] ismonotonic non-decreasing in α (see, e.g., [6, Theorem 3]), it follows from (57) that D α ( P k Q ) ≤ D ∞ ( P k Q ) ≤ log (cid:18) ε Q min (cid:19) , ∀ α ∈ [0 , ∞ ] (58)which proves the first line in (54) when the validity of the bound is restricted to α ∈ (2 , ∞ ] .For proving the second line in (54), it is shown that the bound in (37) can be sharpened byreplacing D ( P k Q ) on the LHS of (37) with the quadratic R´enyi divergence D ( P k Q ) (notethat D ( P k Q ) ≥ D ( P k Q ) ), leading to D ( P k Q ) ≤ log (cid:18) | P − Q | Q min (cid:19) . (59)The strengthened inequality in (59), in comparison to (37), follows by replacing inequality (31)with the equality in (39). Combining (36) and (39) gives inequality (59), and D α ( P k Q ) ≤ D ( P k Q ) ≤ log (cid:18) ε Q min (cid:19) , ∀ α ∈ [0 , . (60)The combination of (58) with (60) gives the second line in (54) (note that εε ′ = min { ε, ε } )while the validity of the bound is restricted to α ∈ [1 , .For α ∈ (0 , , D α ( P k Q ) satisfies the skew-symmetry property D α ( P k Q ) = α − α · D − α ( Q k P ) (see, e.g., [6, Proposition 2]). Consequently, we have D α ( P k Q ) = (cid:18) α − α (cid:19) D − α ( Q k P ) ≤ (cid:18) α − α (cid:19) D ( Q k P ) ≤ (cid:18) α − α (cid:19) (cid:20) log (cid:18) ε P min (cid:19) − (cid:18) Q min log e (cid:19) ε (cid:21) , ∀ α ∈ (0 , (61)where the first inequality holds since the R´enyi divergence is monotonic non-decreasing in itsorder, and the second inequality follows from Theorem 3 which implies that D ( Q k P ) ≤ log (cid:18) ε P min (cid:19) − log e · min a ∈A Q ( a ) P ( a ) · ε ≤ log (cid:18) ε P min (cid:19) − (cid:18) Q min log e (cid:19) ε . The third line in (54) follows from (58), (60) and (61) while restricting the validity of the boundto α ∈ (cid:0) , (cid:1) .For proving the fourth line in (54), note that from (11) D / ( P k Q ) = − Z ( P, Q ) where Z ( P, Q ) , P a ∈A p P ( a ) Q ( a ) is the Bhattacharyya coefficient between P and Q [12]. TheBhattacharyya distance is defined as minus the logarithm of the Bhattacharyya coefficient, whichis non-negative in general and it is zero if and only if P = Q (since ≤ Z ( P, Q ) ≤ , and Z ( P, Q ) = 1 if and only if P = Q ). Hence, the R´enyi divergence of order is twice theBhattacharyya distance. Based on the inequality Z ( P, Q ) ≥ − | P − Q | , which follows from [10,Example 6.2] (see also [21, Proposition 1]), we have D α ( P k Q ) ≤ D / ( P k Q ) ≤ − (cid:16) − ε (cid:17) , ∀ α ∈ h , i (62)where ε , | P − Q | ∈ [0 , . Finally, the last case in (54) follows from (58), (60), (61) and (62). B. Example: R´enyi Divergence for Multinomial Distributions
Let X , X , . . . be independent Bernoulli random variables with X i ∼ Bernoulli ( p i ) , and let Y , Y , . . . be independent Bernoulli random variables with Y i ∼ Bernoulli ( q i ) (assume w.l.o.g.that q i ≤ ). Let U n and V n be the partial sums U n = P ni =1 X i and V n = P ni =1 Y i , and let P U n , P V n denote their multinomial distributions. For all α ∈ [0 , and n ∈ N , we have D α ( P U n k P V n ) (a) ≤ D α ( P X ,...,X n k P Y ,...,Y n ) (b) = n X i =1 D α ( P X i k P Y i ) (c) ≤ log | P X i − P Y i | (cid:0) P Y i (cid:1) min ! (d) = n X i =1 log q i (cid:18) p i q i − (cid:19) ! (63)where inequality (a) follows from the data processing inequality for the R´enyi divergence (see[6, Theorem 9]), equality (b) follows from the additivity property of the R´enyi divergence underthe independence assumption for { X i } and for { Y i } (see [6, Theorem 28]), inequality (c) followsfrom Theorem 4, and equality (d) holds since | P X i − P Y i | = 2 | p i − q i | for Bernoulli randomvariables, and ( P Y i ) min = min { q i , − q i } = q i ( q i ≤ ) . Similarly, for all α > and n ∈ N , D α ( P U n k P V n ) ≤ n X i =1 log (cid:18) (cid:12)(cid:12)(cid:12)(cid:12) p i q i − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) . (64)The only difference in the derivation of (64) is in inequality (c) of (63) where the bound in thefirst line of (54) is used this time.Let { ε n } ∞ n =1 be a non-negative sequence such that (1 − ε n ) q n ≤ p n ≤ (1 + ε n ) q n , ∀ n ∈ N and ∞ X n =1 ε n < ∞ . Then, from (63), it follows that D α ( P U n k P V n ) ≤ K for all α ∈ [0 , and n ∈ N where K , ∞ X n =1 log (cid:0) ε n (cid:1) < ∞ . Furthermore, if P ∞ n =1 ε n < ∞ , it follows from (64) that D α ( P U n k P V n ) ≤ K for all α > and n ∈ N where K , ∞ X n =1 log (1 + 2 ε n ) < ∞ . Note that although D α ( P X i k P Y i ) in equality (b) of (63) is equal to the binary R´enyi divergence d α ( p i k q i ) , (cid:16) α − (cid:17) log (cid:16) p αi q − αi + (1 − p i ) α (1 − q i ) − α (cid:17) , if α ∈ (0 , ∪ (1 , ∞ ) p i log (cid:16) p i q i (cid:17) + (1 − p i ) log (cid:16) − p i − q i (cid:17) , if α = 1 the reason for the use of the upper bounds in step (c) of (63) and (64) is to state sufficientconditions, in terms of { ε n } ∞ n =1 , for the boundedness of the R´enyi divergence D α ( P U n k P V n ) . V. T HE E XPONENTIAL D ECAY OF THE P ROBABILITY FOR A N ON -T YPICAL S EQUENCE
Let U N = ( U , . . . , U N ) be a sequence of i.i.d. symbols that are emitted by a memorylessand stationary source with distribution Q and a finite alphabet A . Let |A| = r < ∞ denotethe cardinality of the source alphabet, and assume that all symbols are emitted with positiveprobability (i.e., Q min , min a ∈A Q ( a ) > ). The empirical probability distribution of the emittedsequence ˆ P U N is given by ˆ P U N ( a ) , N N X k =1 { U k = a } , ∀ a ∈ A . For an arbitrary δ > , let the δ -typical set be defined as T Q ( δ ) , n u N ∈ A N : (cid:12)(cid:12) ˆ P u N ( a ) − Q ( a ) (cid:12)(cid:12) < δ Q ( a ) , ∀ a ∈ A o , (65)i.e., the empirical distribution of every symbol in an N -length δ -typical sequence deviates fromthe true distribution of this symbol by a fraction of less than δ . Consequently, the complementaryof (65) is given by T Q ( δ ) c = n u N ∈ A N : ∃ a ∈ A , (cid:12)(cid:12) ˆ P u N ( a ) − Q ( a ) (cid:12)(cid:12) ≥ δ Q ( a ) o . From Sanov’s theorem (see [3, Theorem 11.4.1]), the asymptotic exponential decay of theprobability that a sequence U N is not δ -typical, for a specified δ > , is given by lim N →∞ − N log Q N (cid:0) T Q ( δ ) c (cid:1) = min P ∈P Q D ( P k Q ) (66)where P Q , n P is a probability measure on ( A , F ) : ∃ a ∈ A , | P ( a ) − Q ( a ) | ≥ δ Q ( a ) o . (67)We obtain in the following explicit upper and lower bounds on the exponential decay rate onthe RHS of (66). The emphasis is on the upper bound, which is based on Theorem 3, and wefirst introduce the lower bound for completeness. The derivation of the lower bound is similar tothe analysis in [14, Section 4]; note, however, that there is a difference between the δ -typicalityin [14, Eq. (19)] and the way it is defined in (65). The probability-dependent refinement ofPinsker’s inequality (see [14, Theorem 2.1]) states that D ( P k Q ) ≥ ϕ ( π Q ) | P − Q | (68)where π Q , max A ∈F min (cid:8) Q ( A ) , − Q ( A ) (cid:9) ≤ (69)and ϕ ( p ) = − p ) log (cid:16) − pp (cid:17) , if p ∈ (cid:2) , (cid:1) , log e , if p = (70) is a monotonic decreasing and continuous function. Hence, ϕ ( π Q ) ≥ log e , and (68) forms aprobability-dependent refinement of Pinsker’s inequality [14]. From (67) and (68), we have min P ∈P Q D ( P k Q ) ≥ ϕ ( π Q ) min P ∈P Q | P − Q | = ϕ ( π Q ) (cid:18) min a ∈A δ Q ( a ) (cid:19) = ϕ ( π Q ) Q δ , E L (71) ≥ (cid:18) Q log e (cid:19) δ (72)where the transition from (71) to (72) follows from the global lower bound on ϕ ( π Q ) .We derive in the following an upper bound on the asymptotic exponential decay rate in (66): min P ∈P Q D ( P k Q ) (a) ≤ min P ∈P Q (cid:26) log (cid:18) | P − Q | Q min (cid:19)(cid:27) = log (cid:0) min P ∈P Q | P − Q | (cid:1) Q min ! (b) = log (cid:0) min a ∈A ( δ Q ( a ) (cid:1) Q min ! = log (cid:18) Q min δ (cid:19) , E U (73)where inequality (a) follows from (37), and equality (b) follows from (67).The ratio between the upper and lower bounds on the asymptotic exponent in (66), as givenin (71) and (73) respectively, satisfies ≤ E U E L = 1 Q min · log e ϕ ( π Q ) · log (cid:16) Q min δ (cid:17) log e · Q min δ (74) ≤ Q min where inequality (74) follows from the fact that the second and third multiplicands in (74) areboth less than or equal to 1. Note that both bounds in (71) and (73) scale like δ for δ ≈ .A PPENDIX : A P
ROOF OF I NEQUALITY (32)This appendix proves inequality (32), which provides upper and lower bounds on the difference log (cid:0) χ ( P, Q ) (cid:1) − D ( P k Q ) in terms of the dual relative entropy D ( Q k P ) . To this end, wefirst prove a new inequality relating f -divergences [21], and the bounds in (32) then follow asa special case. Recall the following definition of an f -divergence: Definition 1:
Let f : (0 , ∞ ) → R be a convex function with f (1) = 0 , and let P and Q betwo probability measures defined on a common finite set A . The f -divergence from P to Q isdefined by D f ( P || Q ) , X a ∈A Q ( a ) f (cid:18) P ( a ) Q ( a ) (cid:19) (75)with the convention that f (cid:16) (cid:17) = 0 , f (0) = lim t → + f ( t ) , f (cid:16) b (cid:17) = lim t → + tf (cid:16) bt (cid:17) = b lim u →∞ f ( u ) u , ∀ b > . (76) Proposition 1:
Let f : (0 , ∞ ) → R be a convex function with f (1) = 0 and assume that thefunction g : (0 , ∞ ) → R , defined by g ( t ) = − tf ( t ) for every t > , is also convex. Let P and Q be two probability measures that are defined on a finite set A , and assume that P, Q are strictlypositive. Then, the following inequality holds: min a ∈A P ( a ) Q ( a ) · D f ( P || Q ) ≤ − D g ( P || Q ) − f (cid:0) χ ( P, Q ) (cid:1) ≤ max a ∈A P ( a ) Q ( a ) · D f ( P || Q ) . (77) Proof:
Let A = (cid:8) a , . . . , a n (cid:9) , and u = ( u , . . . , u n ) ∈ R n + be an arbitrary n -tuple withpositive entries. Define J n ( f, u, P ) , n X i =1 P ( a i ) f ( u i ) − f n X i =1 P ( a i ) u i ! ,J n ( f, u, Q ) , n X i =1 Q ( a i ) f ( u i ) − f n X i =1 Q ( a i ) u i ! . (78)The following refinement of Jensen’s inequality has been introduced in [5, Theorem 1] for aconvex function f : (0 , ∞ ) → R : min i ∈{ ,...,n } P ( a i ) Q ( a i ) · J n ( f, u, Q ) ≤ J n ( f, u, P ) ≤ max i ∈{ ,...,n } P ( a i ) Q ( a i ) · J n ( f, u, Q ) . (79)Let u i , P ( a i ) Q ( a i ) for i ∈ { , . . . , n } . Calculation of (78) gives that J n ( f, u, Q ) = n X i =1 Q ( a i ) f (cid:18) P ( a i ) Q ( a i ) (cid:19) − f n X i =1 Q ( a i ) · P ( a i ) Q ( a i ) ! = X a ∈A Q ( a ) f (cid:18) P ( a ) Q ( a ) (cid:19) − f (1)= D f ( P || Q ) , (80) J n ( f, u, P ) = n X i =1 P ( a i ) f (cid:18) P ( a i ) Q ( a i ) (cid:19) − f n X i =1 P ( a i ) Q ( a i ) ! ( a ) = − n X i =1 Q ( a i ) g (cid:18) P ( a i ) Q ( a i ) (cid:19) − f n X i =1 P ( a i ) Q ( a i ) ! ( b ) = − D g ( P || Q ) − f (cid:0) χ ( P, Q ) (cid:1) (81) where equality (a) holds by the definition of g , and equality (b) follows from equalities (30) and(75). The substitution of (80) and (81) into (79) gives inequality (77).As a consequence of Proposition 1, we prove inequality (32). Let f ( t ) = − log( t ) for t > .The function f : (0 , ∞ ) → R is convex with f (1) = 0 , and g ( t ) = − tf ( t ) = t log( t ) for t > is also convex with g (1) = 0 . Inequality (32) follows by substituting f, g into (77) where D f ( P || Q ) = D ( Q || P ) and D g ( P || Q ) = D ( P || Q ) . Inequality (32) also holds in the case where P is not strictly positive on A with the convention in (76) where t → + g ( t ) = 0 .A CKNOWLEDGMENT
Sergio Verd ´u is gratefully acknowledged for his earlier results in [25] that attracted my interestand motivated this work, for providing a draft of [26], and raising the question that led to theinclusion of Remark 6. Vincent Tan is acknowledged for pointing out [23] and suggesting asimplified proof of (35). A discussion with Georg B¨ocherer and Bernhard Geiger on their paper[2] has been stimulating along the writing of this manuscript.R
EFERENCES [1] D. Berend, P. Harremo¨es and A. Kontorovich, “Minimum KL-divergence on complements of L balls,” IEEETrans. on Information Theory , vol. 60, no. 6, pp. 3172–3177, June 2014.[2] G. B¨ocherer and B. C. Geiger, “Optimal quantization for distribution synthesis,” March 2015. [Online]. Available:http://arxiv.org/abs/1307.6843.[3] T. M. Cover and J. A. Thomas,
Elements of Information Theory , second edition, John Wiley & Sons, 2006.[4] I. Csisz´ar and Z. Talata, “Context tree estimation for not necessarily finite memory processes, via BIC andMDL,”
IEEE Trans. on Information Theory , vol. 52, no. 3, pp. 1007–1016, March 2006.[5] S. S. Dragomir, “Bounds for the normalized Jensen functional,”
Bulletin of the Australian Mathematical Society ,vol. 74, no. 3, pp. 471–478, 2006.[6] T. van Erven and P. Harremo¨es, “R´enyi divergence and Kullback-Leibler divergence,”
IEEE Trans. on InformationTheory , vol. 60, no. 7, pp. 3797–3820, July 2014.[7] A. A. Fedotov, P. Harremo¨es and F. Topsøe, “Refinements of Pinsker’s inequality,”
IEEE Trans. on InformationTheory , vol. 49, no. 6, pp. 1491–1498, June 2003.[8] G. L. Gilardoni, “On the minimum f -divergence for given total variation,” Comptes Rendus Mathematique ,vol. 343, no. 11–12, pp. 763–766, 2006.[9] G. L. Gilardoni, “On Pinsker’s and Vajda’s type inequalities for Csisz´ar’s f -divergences,” IEEE Trans. onInformation Theory , vol. 56, no. 11, pp. 5377–5386, November 2010.[10] A. Guntuboyina, S. Saha and G. Schiebinger, “Sharp inequalities for f -divergences,” IEEE Trans. on InformationTheory , vol. 60, no. 1, pp. 104–121, January 2014.[11] S. W. Ho and R. W. Yeung, “The interplay between entropy and variational distance,”
IEEE Trans. on InformationTheory , vol. 56, no. 12, pp. 5906–5929, December 2010.[12] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,”
IEEE Trans. onCommunication Technology , vol. 15, no. 1, pp. 52–60, February 1967.[13] V. Kostina and S. Verd´u, “Channels with cost constraints: strong converse and dispersion,” to appear in the
IEEE Trans. on Information Theory , vol. 61, no. 5, May 2015.[14] E. Ordentlich and M. J. Weinberger, “A distribution dependent refinement of Pinsker’s inequality,”
IEEE Trans.on Information Theory , vol. 51, no. 5, pp. 1836–1840, May 2005.[15] M. S. Pinsker,
Information and Information Stability of Random Variables and Random Processes , San-Fransisco:Holden-Day, 1964, originally published in Russian in 1960.[16] V. V. Prelov, “On inequalities between mutual information and variation,”
Problems of Information Transmission ,vol. 43, no. 1, pp. 12–23, March 2007.[17] V. V. Prelov and E. C. van der Meulen, “Mutual information, variation, and Fano’s inequality,”
Problems ofInformation Transmission , vol. 44, no. 3, pp. 185–197, September 2008.[18] M. D. Reid and R. C. Williamson, “Information, divergence and risk for binary experiments,”
Journal of MachineLearning Research , vol. 12, no. 3, pp. 731–817, March 2011.[19] I. Sason, “Entropy bounds for discrete random variables via maximal coupling,”
IEEE Trans. on InformationTheory , vol. 59, no. 11, pp. 7118–7131, November 2013.[20] I. Sason, “On the R´enyi divergence and the joint range of relative entropies,” March 2015. [Online]. Availableat http://arxiv.org/abs/1501.03616. [21] I. Sason, “Tight bounds on symmetric divergence measures and a new inequality relating f -divergences,”accepted to the IEEE 2015 Information Theory Workshop , Jerusalem, Israel, April 26–May 1, 2015. [Online].Available at http://arxiv.org/abs/1502.06428.[22] Y. Steinberg and S. Verd´u, “Simulation of random processes and rate-distortion theory,”
IEEE Trans. onInformation Theory , vol. 42, no. 1, pp. 63–86, January 1996.[23] M. Tomamichel and V. Y. F. Tan, “A tight upper bound for the third-order asymptotics for most discretememoryless channels,”
IEEE Trans. on Information Theory , vol. 59, no. 11, pp. 7041–7051, November 2013.[24] I. Vajda, “Note on discrimination information and variation,”
IEEE Trans. on Information Theory , vol. 16, no. 6,pp. 771–773, November 1970.[25] S. Verd´u, “Total variation distance and the distribution of the relative information,”
Proceedings of the 2014Information Theory and Applications (ITA) Workshop , pp. 499–501, San-Diego, California, USA, February 2014.[26] S. Verd´u,
Information Theory , in preparation.[27] Z. Zhang, “Estimating mutual information via Kolmogorov distance,”