aa r X i v : . [ c s . I T ] J a n Minimax R ´enyi Redundancy
Semih Yagli,
Student Member, IEEE , Y¨ucel Altu ˘g, and Sergio Verd ´u,
Fellow, IEEE
Abstract —The redundancy for universal lossless compressionof discrete memoryless sources in Campbell’s setting is charac-terized as a minimax R´enyi divergence, which is shown to beequal to the maximal α -mutual information via a generalizedredundancy-capacity theorem. Special attention is placed on theanalysis of the asymptotics of minimax R´enyi divergence, whichis determined up to a term vanishing in blocklength.Keywords: Universal lossless compression, generalizedredundancy-capacity theorem, minimax redundancy, minimaxregret, Jeffreys’ prior, risk aversion, R´enyi divergence, α -mutualinformation. I. I
NTRODUCTION I N variable length source coding, expected code lengthis the usual cost function that one aims to minimize.For discrete memoryless sources, asymptotically, the minimalachievable per-letter expected code length is equal to theentropy. However, if P Y n | V = θ is a discrete memoryless sourcedistribution with an unknown parameter θ and the encodingsystem assumes a distribution Q Y n , then one needs to pay anextra penalty for the mismatch given by n D ( P Y n | V = θ k Q Y n ) + o (1) , (1)where D ( P k Q ) stands for the relative entropy between theprobability measures P and Q . In light of (1), the conven-tional worst-case measure of redundancy in universal losslesscompression is R n = inf Q Y n sup θ D ( P Y n | V = θ k Q Y n ) , (2)where the infimization is over all distributions on Y n , andthe supremum is over all possible values of the unknownparameter. In this zero-sum game, Q Y n is chosen by the codedesigner, and θ is chosen by nature.A relation between R n and the maximal mutual informationis given by the Redundancy-Capacity Theorem (e.g., [4], and[5]) that states that R n = sup P V I ( P V , P Y n | V ) , (3)where the supremization is over all probability distributionson the parameter space. Through (1), (2) and (3), we see apleasing relationship between entropy, relative entropy andmutual information in the context of lossless data compression. Semih Yagli and Sergio Verd´u are with the Electrical Engineering De-partment, Princeton University, Princeton, NJ 08544. Y¨ucel Altu˘g is withNatera Inc., San Carlos, CA 94070. E-mail: [email protected] , [email protected] , [email protected] . Part of this paper was presented at2017 IEEE International Symposium on Information Theory [1]. For prefix codes, (1) is well known [2, Theorem 5.4.3]. On the otherhand, the loss in rate incurred due to the prefix condition is known to beasymptotically negligible [3]. I ( P X , P Y | X ) = D ( P Y | X P X k P Y P X ) is the mutual information be-tween X and Y with ( X, Y ) ∼ P X P Y | X . Let Y n ∼ P Y n | V = θ , and note that D ( P Y n | V = θ k Q Y n ) = E h ı P Y n | V = θ k Q Y n ( Y n ) i , (4)where the relative information between the discrete probabilitymeasures P and Q is defined as ı P k Q ( a ) = log P ( a ) Q ( a ) . (5)A much more stringent performance guarantee than theaverage of relative information is its pointwise maximum.In particular, if one replaces E [ ı P Y n | V = θ k Q Y n ( Y n )] with max y n ı P Y n | V = θ k Q Y n ( y n ) in (2), the resulting quantity, i.e., r n = inf Q Y n sup θ max y n ∈Y n ı P Y n | V = θ k Q Y n ( y n ) , (6)is called the minimax regret , which has found applica-tions in various settings , e.g., [6]–[10]. An analogy to theRedundancy-Capacity Theorem is given by [7] r n = log X y n ∈Y n sup θ P Y | V = θ ( y n ) (7) = sup P V I ∞ ( P V , P Y n | V ) , (8)where I ∞ ( P X , P Y | X ) denotes the α -mutual information ofinfinite order, whose definition is given in (42).The average and pointwise formulations are two extremesof performance guarantees, which are not quite suitable forcertain applications. For this reason, one seeks a compromisebetween those two. For example, in the economics literature,average and pointwise guarantees are referred as risk-neutral and risk-avoiding , respectively. Since the former is known tobe too lenient and the latter is known to be too stringentfor typical applications, the notion of risk-aversion has beenintroduced to provide a more useful compromise betweenthese two extremes [11], [12], which is known to be relevantfor diverse applications [13]. In this paper, we introduce thenotion of risk-aversion within the universal source codingcontext and quantify its effect on the fundamental limit.In the non-universal setting, i.e., when the source distribu-tion is known, a classical result of Campbell [14] introducessuch a risk-averse cost function in a discrete memoryless set-ting. Specifically, [14] proposes to generalize the conventionalnotion of minimizing the expected code length with the costfunction L λ ( Y n ) = 1 λ log E [exp( λℓ ( f ( Y n )))] , (9) Unless otherwise stated, logarithms and exponentials are of arbitrary basis. For example, in lossless compression with prefix codes, ı P Y n | V = θ k Q Y n ( y n ) is often viewed as a proxy for the mismatchpenalty incurred by assuming that y n is drawn from Q Y n rather thanthe true distribution P Y n | V = θ . Such an approximation can be justifiedasymptotically. where λ ∈ (0 , ∞ ) , f denotes the code, and ℓ ( · ) denotes thelength function. In this case, for a discrete memoryless source Y n , Campbell [14] shows that the minimum per-letter costasymptotically achievable by prefix codes is given by the R´enyientropy H λ ( Y ) . Notice that L λ ( Y n ) captures the notion ofrisk-aversion through the parameter λ since L λ ( Y n ) λ → −−−→ E [ ℓ ( f ( Y n ))] , (10) L λ ( Y n ) λ →∞ −−−−→ max y n ∈Y n ℓ ( f ( y n )) . (11)A natural way to introduce risk-aversion in universal sourcecoding is to use Campbell’s formulation and characterize thepenalty for the mismatch akin to (1). Indeed, about forty yearsafter Campbell’s work, Sundaresan [15, Theorem 8] shows thatif one uses L λ ( Y n ) as the cost function, the penalty paid foruniversality can be written as n D λ ( e P λ Y n | V = θ k e Q λ Y n ) + o (1) , (12)where D λ ( P k Q ) denotes the R´enyi divergence of order λ , which is defined in (40), and e P αY denotes the scaleddistribution of P Y : e P αY ( y ) = P αY ( y ) P b ∈Y P αY ( b ) . (13)The distance measure S α ( P k Q ) = D α ( e P α k e Q α ) (14)is known as the Sundaresan divergence of order α between P and Q . Following [15], the relevant measure of redundancy foruniversal lossless compression under Campbell’s performancecriterion is R λ ( n ) = inf Q Y n sup θ S λ ( P Y n | V = θ k Q Y n ) . (15)The conventional minimax redundancy in (2) corresponds to R ( n ) while the minimax regret in (6) corresponds to R ∞ ( n ) .Although, in general, S α ( P k Q ) = D α ( P k Q ) , we are able toestablish a pleasing analog to the classical redundancy resultssuch as (2), (3) and (6), (8): R λ ( n ) = inf Q Y n sup θ D λ ( P Y n | V = θ k Q Y n ) (16) = sup P V I λ ( P V , P Y n | V ) , (17)where in (17) I λ ( P X , P Y | X ) = inf Q Y D λ ( P Y | X P X k Q Y P X ) (18)is the α -mutual information of order λ between X and Y with ( X, Y ) ∼ P X P Y | X , see [16], [17]. Note that (16) isanalogous to (2) with R´enyi divergence replacing the relativeentropy. Thus, we refer R λ ( n ) as the minimax R´enyi redun-dancy . Moreover, (17) generalizes the Redundacy-CapacityTheorem to α -mutual information thereby finding anotheroperational meaning for the maximal α -mutual informationbeyond those that have been shown in the literature on error Campbell’s and Sundaresan’s results are still valid when λ ∈ ( − , .However, such a formulation corresponds to a risk-seeking scheme, whichfalls outside the philosophy espoused in this paper. probability bounds for data transmission (e.g. [17], [18]).Moreover, the α -mutual information smoothly interpolatesbetween two extremes, namely I ( P V , P Y n | V ) in (3) and I ∞ ( P V , P Y n | V ) in (8). Finally, (16) and (17), coupled withCampbell’s result [14], provide a pleasing relationship betweenR´enyi entropy, R´enyi divergence and α -mutual information inthe context of universal lossless data compression.The asymptotic behaviors of the minimax redundancy andminimax regret have also received considerable attention inthe literature (e.g., [6], [7], [9], [19]–[24]) since, in additionto compression, they are relevant in applications such asmachine learning, finance, prediction, gambling, and so on.In particular, Xie and Barron in their key contributions [19],[6] show that R n = R ( n ) (19) = k −
12 log n π e + log Γ k (1 / k/
2) + o (1) , (20) r n = R ∞ ( n ) (21) = k −
12 log n π + log Γ k (1 / k/
2) + o (1) , (22)where n and k are the number of observations and the alphabetsize, respectively, Γ denotes the Gamma function, and o (1) vanishes as n → ∞ .While Merhav [25, Theorem 1] gives R λ ( n ) = k − log n + o (log n ) , we quantify asymptotically the effect of the risk-aversion parameter λ on the fundamental limit in universalsource coding by providing a pleasing interpolation between(20) and (22): R λ ( n ) = k −
12 log n π (1 + λ ) λ + log Γ k (1 / k/
2) + o (1) .(23)In the remainder of the paper, Section II sets the basicnotation and definitions. Section III states the main resultsand gives the outlines of their proofs, which are contained inSection IV. In the Appendices, we prove several lemmas thatare used in Section IV.II. N OTATION AND D EFINITIONS
Let Y = { , , . . . , k } and denote the ( k − -dimensionalsimplex of probability mass functions defined on Y by ∆ k − = ( ( θ , . . . , θ k ) ∈ R k + : k X i =1 θ i = 1 ) . (24)For each parameter θ = ( θ , . . . , θ k ) ∈ ∆ k − , we define ourobservation model P Y | V = θ : ∆ k − → Y such that P Y | V = θ ( i ) = θ i , (25) In a fundamentally different setup, Hayashi [26, Lemma 3] considersthe counterpart of the Clarke and Barron [27, Theorem 2.1] result replacingrelative entropy with R´enyi divergence. As a special case, when k = 2 , we use the shorthand notation P Y | V = θ instead of P Y | V =( θ, − θ ) . and the independent identically distributed (i.i.d.) extension ofthis model P Y n | V = θ : ∆ k − → Y n such that P Y n | V = θ ( y n ) = n Y i =1 P Y | V = θ ( y i ) (26) = θ t · · · θ t k k , (27)where t i = n X j =1 { y j = i } , (28)denotes the number of times i ∈ Y appears in the vector y n ,and therefore k X i =1 t i = n . (29)It can be verified that the Fisher information matrix (in nats)of P Y | V = θ for the parameter vector θ is J ( θ , P Y | V ) = diag (cid:18) θ , θ , . . . , θ k − (cid:19) + 1 θ k ( k − × ( k − ,(30)where l × l denotes an l × l matrix all of whose entries areequal to 1. The determinant of the Fisher information matrixin (30) satisfies | J ( θ , P Y | V ) | = 1 Q ki =1 θ i . (31)An important probability measure on ∆ k − is Jeffreys’ prior [28] defined as P ∗ V ( θ ) = | J ( θ , P Y | V ) | / Z ∆ k − | J ( ξ , P Y | V ) | / d ξ (32) = θ − / · · · θ − / k D k (1 / , . . . , / , (33)where D k ( α , . . . , α k ) denotes a special form of the Dirichletintegrals of type 1 which can be written in terms of the Gammafunction: D k ( α , . . . , α k ) = Z ∆ k − ξ α − · · · ξ α k − k d ξ (34) = Γ( α ) · · · Γ( α k )Γ( α + · · · + α k ) . (35)In particular, D k (1 / , . . . , /
2) = Γ k (1 / k/ (36) = π k/ ( k/ − , k is even, π ( k − / Q ( k − / i =1 (cid:0) i − (cid:1) , k is odd. (37) Note that the Fisher information matrix is ( k − × ( k − since thereare ( k − free parameters in the model. Nevertheless, it is notationallyconvenient to denote the parameter vector θ as if it were k -dimensional. The source distribution we get by assuming Jeffreys’ prioron the parameter space is referred as
Jeffreys’ mixture whichis denoted by Q ∗ Y n ( y n ) = Z ∆ k − P Y n | V = θ ( y n )d P ∗ V ( θ ) (38) = D k ( t + 1 / , . . . , t k + 1 / k (1 / , . . . , / . (39)For discrete probability measures P and Q on the set Y such that Q dominates P , i.e., P ≪ Q , R´enyi divergence oforder α between P and Q is defined as D α ( P k Q )= D ( P k Q ) , α = 1 α − log E [exp(( α − ı P k Q ( Y ))] , α ∈ (1 , ∞ )max b ∈Y ı P k Q ( b ) , α = ∞ , (40)where Y ∼ P . In particular, when α ∈ (1 , ∞ ) , R´enyidivergence of order α between P and Q can be expressedas D α ( P k Q ) = 1 α − X b ∈Y P α ( b ) Q − α ( b ) . (41)Given ( P X , P Y | X ) , an analogous generalization can be madefor mutual information resulting in the α -mutual information [17]: I α ( P X , P Y | X )= I ( P X , P Y | X ) , α = 1inf Q Y D α ( P Y | X P X k Q Y P X ) , α ∈ (1 , ∞ )log E (cid:20) ess sup X exp (cid:0) ı X ; Y ( X ; ¯ Y ) (cid:1)(cid:21) , α = ∞ , (42)where ¯ Y ∼ P Y , independent of X ∼ P X , and we have usedthe conventional notation for information density ı X ; Y ( x ; y ) = ı P Y | X = x k P Y ( y ) . As shown in Lemma 1 in Appendix A, theinfimum in (42) can be solved explicitly.In parallel with the standard usage for relative entropy, it iscommon to define the conditional R´enyi divergence as D α ( P Y | X k Q Y | X | P X ) = D α ( P X P Y | X k P X Q Y | X ) , (43)therefore, the unconditional R´enyi divergence in (42) can bewritten as D α ( P Y | X k Q Y | P X ) .III. S TATEMENT OF THE R ESULTS
Theorem 1 states that under the minimax operation in (15)the Sundaresan divergence can be replaced by the R´enyidivergence. We further show that this minimax operation canbe written as the maximization of the α -mutual information, Whenever it is informative to explicitly show the dimensionality of theparameter space in the notation for Jeffreys’ mixture, we do so by replacing Q ∗ Y n with Q ∗ ( k − Y n . We are not concerned with R´enyi divergences of order α ∈ (0 , . Amore general definition can be found in [29]. The definition of α -mutual information in (42) dates back to Sibson’sinformation radius [16]. Although, it should be noted that Sibson’s motivationin [16] is not the generalization of mutual information. See [17] for a morethorough discussion. thus, providing a generalization to the Redundancy-CapacityTheorem in (3). In Theorem 2, we investigate the asymptoticbehavior of the minimax R´enyi redundancy between P Y n | V = θ and Q Y n , and we find its precise asymptotic expansion,thereby quantifying the effect of the risk-aversion parameter λ . Theorem 1
Generalized Redundancy-Capacity Theorem . Forany λ ∈ (0 , ∞ ) , and positive integer nR λ ( n ) = inf Q Y n sup θ ∈ ∆ k − D λ ( P Y n | V = θ k Q Y n ) (44) = sup P V I λ ( P V , P Y n | V ) . (45)As we show in the proof in Section IV, (44) is due tothe fact that scaling a distribution is a one-to-one operationthat preserves memorylessness while the minimax theorem forR´enyi divergence [29, Theorem 34] is the gateway to showingthe generalized redundancy-capacity theorem in (45).Although Theorem 1 holds in great generality, we illustrateits use in the simple example below. Example
Z-Channel with crossover probability . Consider the
Z-channel with crossover probability, see, e.g., [2, Problem7.8]. In this case, λ λ I λ ( P V , P Y | V ) = (46) log (cid:18) P V (1)2 λ (cid:19) λ + (cid:18) − λ − λ P V (1) (cid:19) λ ! which is a concave function of P V (1) for every value of λ ∈ (0 , ∞ ) , and is maximized when P V (1) = 1 − − (2 λ − − λλ λ − − λ . (47)After some elementary algebra, plugging (47) into (46) yields sup P V I λ ( P V , P Y | V ) = log (cid:16) (cid:0) λ − (cid:1) − λ (cid:17) . (48)Observe that as λ → , the right side of (48) converges to thecapacity of the channel, namely, log . On the other hand, tocompute the minimax R´enyi redundancy, note that D λ ( P Y | V =0 k Q Y ) = log 1 Q Y (0) , (49) D λ ( P Y | V =1 k Q Y ) = 1 λ log (cid:18) − (1+ λ ) Q λY (0) + 2 − (1+ λ ) Q λY (1) (cid:19) . (50)Let Q ∗ Y be the distribution such that D λ ( P Y | V =0 k Q ∗ Y ) = D λ ( P Y | V =1 k Q ∗ Y ) . (51)Since inf Q Y sup θ ∈{ , } D λ ( P Y | V = θ k Q Y )= D λ ( P Y | V =0 k Q ∗ Y ) (52) = log (cid:16) (cid:0) λ − (cid:1) − λ (cid:17) , (53)through (48) and (53), as enforced by generalized redundancy-capacity theorem, we observe that the maximal α -mutualinformation matches the minimax R´enyi divergence. Theorem 2
Asymptotic Behavior of Minimax R ´enyi Redun-dancy . For any λ ∈ (0 , ∞ )lim n →∞ (cid:26) R λ ( n ) − k −
12 log n π (cid:27) = log Γ k (1 / k/ − k − λ log(1 + λ ) . (54)We prove Theorem 2 in Section IV by dividing it into twoparts: converse and achievability. In both parts, Jeffreys’ priorplays a significant role. However, it is known that Jeffreys’prior dramatically emphasizes the lower dimensional facesof the simplex. While this is not a problem in proving theconverse bound, Jeffreys’ prior achieves a suboptimal minimaxvalue (see Lemma 14 in Appendix J). Similar issues arise infinding the exact asymptotic constant in minimax redundancy[19], and in minimax regret [6]. To overcome this problem,we modify Jeffreys’ prior by placing masses near the facesof the simplex as in [19]. Although this resolves the problemencountered in the minimax redundancy and minimax regretcases, the functional form of R´enyi divergence becomes thesecond obstacle which forces us to show a uniform Laplaceapproximation thereby making the proof of achievability amuch more involved task than that of the converse. For thisreason, we start by presenting the achievability proof in thespecial case of binary alphabets, in which the notation issimplified considerably. IV. P ROOFS
A. Proof of Theorem 1
To establish (44), for any λ ∈ (0 , ∞ ) , define the bijection f λ : ∆ k − → ∆ k − as f λ ( θ , . . . , θ k ) = 1 κ λ (cid:18) θ λ , . . . , θ λ k (cid:19) , (55)where κ λ = X b ∈Y P λ Y | V = θ ( b ) . (56)Then, for any θ = ( θ , . . . , θ k ) ∈ ∆ k − and y n ∈ Y n , thescaled version of the conditional distribution (see (13)) satisfies e P λ Y n | V = θ ( y n ) = n Y i =1 e P λ Y | V = θ ( y i ) (57) = n Y i =1 P Y | V = f λ ( θ ) ( y i ) (58) = P Y n | V = f λ ( θ ) ( y n ) . (59)Therefore, for any given distribution R Y n on Y n sup θ ∈ ∆ k − D λ ( e P λ Y n | V = θ k R Y n )= sup θ ∈ ∆ k − D λ ( P Y n | V = f λ ( θ ) k R Y n ) (60) = sup θ ∈ ∆ k − D λ ( P Y n | V = θ k R Y n ) . (61)As a result of (61), inf Q Y n sup θ ∈ ∆ k − S λ ( P Y n | V = θ k Q Y n ) = inf Q Y n sup θ ∈ ∆ k − D λ (cid:16) e P λ Y n | V = θ k e Q λ Y n (cid:17) (62) = inf Q Y n sup θ ∈ ∆ k − D λ (cid:16) P Y n | V = θ k e Q λ Y n (cid:17) (63) = inf e Q λY n sup θ ∈ ∆ k − D λ (cid:16) P Y n | V = θ k e Q λ Y n (cid:17) (64) = inf Q Y n sup θ ∈ ∆ k − D λ (cid:0) P Y n | V = θ k Q Y n (cid:1) , (65)where (64) follows because every probability measure in ∆ nk − is a scaled version of another probability measure in ∆ nk − .In order to establish (45), note that inf Q Y n sup θ ∈ ∆ k − D λ (cid:0) P Y n | V = θ k Q Y n (cid:1) = inf Q Y n sup P V E (cid:2) D λ (cid:0) P Y n | V ( ·| V ) k Q Y n (cid:1)(cid:3) (66) = sup P V inf Q Y n E (cid:2) D λ (cid:0) P Y n | V ( ·| V ) k Q Y n (cid:1)(cid:3) (67) = sup P V I λ ( P V , P Y n | V ) , (68)where the expectation in (66) is with respect to V ∼ P V ,and (67) follows from [29, Theorem 34], which holds when Y is finite. The right side of (67) is the maximal α -mutualinformation of order λ in the sense of Csisz´ar, see [17]and [18], which is known to equal maximal I λ (see [17,Proposition 1], and [18, Theorem 5]) in the discrete parametercase. To see that (68) holds even when the parameter spaceis continuous, recall the definition of α -mutual information,(42), which can be written as I λ ( P V , P Y n | V )= inf Q Y n λ log E (cid:20) E (cid:20) exp (cid:0) λ ı P Y n | V k Q Y n ( Y n ) (cid:1)(cid:12)(cid:12) V (cid:21)(cid:21) , (69)and note that sup P V inf Q Y n E (cid:2) λD λ (cid:0) P Y n | V ( ·| V ) k Q Y n (cid:1)(cid:3) ≤ sup P V inf Q Y n log E (cid:20) E (cid:20) exp (cid:0) λ ı P Y n | V k Q Y n ( Y n ) (cid:1)(cid:12)(cid:12) V (cid:21)(cid:21) (70) ≤ inf Q Y n log (cid:18) sup P V E (cid:20) E (cid:20) exp (cid:0) λ ı P Y n | V k Q Y n ( Y n ) (cid:1)(cid:12)(cid:12) V (cid:21)(cid:21)(cid:19) (71) = inf Q Y n sup θ ∈ ∆ k − λD λ (cid:0) P Y n | V = θ k Q Y n (cid:1) (72) = inf Q Y n sup P V E (cid:2) λD λ (cid:0) P Y n | V ( ·| V ) k Q Y n (cid:1)(cid:3) (73) = sup P V inf Q Y n E (cid:2) λD λ (cid:0) P Y n | V ( ·| V ) k Q Y n (cid:1)(cid:3) , (74)where (70) follows from Jensen’s inequality, (71) follows fromthe fact that the maximin value is always less than or equalto the minimax value, and (74) is again due to [29, Theorem34]. When both random variables are discrete, another generalization ofmutual information, whose maximum also coincides with (68), is put forwardby Arimoto [30]. See [17] for further discussion of the various proposals of α -mutual information. B. Proof of the Converse of Theorem 2
This section is devoted to the proof of lim inf n →∞ (cid:26) R λ ( n ) − k −
12 log n π (cid:27) ≥ log Γ k (1 / k/ − k − λ log(1 + λ ) , (75)for any λ ∈ (0 , ∞ ) . Define M n = ( a = ( a , . . . , a k ) ∈ Z k + : k X i =1 a i = n ) , (76) M n,δ = M n ∩ (cid:26) ( a , . . . , a k ) ∈ Z k + : nδk ≤ a i ∀ i (cid:27) , (77)for any δ ∈ (0 , . Let t = ( t , . . . , t k ) , and consider thefollowing λ λ R λ ( n ) = sup P V λ λ I λ ( P V , P Y n | V ) (78) = sup P V log X y n ∈Y n (cid:18)Z ∆ k − P λY n | V = θ ( y n )d P V ( θ ) (cid:19) λ (79) ≥ log X t ∈M n (cid:18) n t (cid:19)(cid:18)Z ∆ k − ( θ t · · · θ t k k ) λ d P ∗ V ( θ ) (cid:19) λ (80) ≥ log X t ∈M n,δ (cid:18) n t (cid:19)(cid:18)Z ∆ k − ( θ t · · · θ t k k ) λ d P ∗ V ( θ ) (cid:19) λ , (81)where (78) is due to Theorem 1, (79) follows from a moregeneral result [17, Theorem 1], although, for the sake ofcompleteness, its proof is included in Lemma 1 in AppendixA, (80) is due to the suboptimal choice of Jeffreys’ prior, and(81) follows because M n,δ ⊂ M n .Using Robbins’ sharpening [31] of Stirling’s approximation,one can show that e nH ( b P yn ) (2 π ) k − n Q ki =1 t i ! e n +1) Q ki =1 e ti ≤ (cid:18) nt · · · t k (cid:19) (82) ≤ e nH ( b P yn ) (2 π ) k − n Q ki =1 t i ! e n Q ki =1 e ti +1) . (83)where the entropy is in nats and b P y n denotes the empirical dis-tribution of the vector y n . Since t ∈ M n,δ , (82) particularizesto (cid:18) nt · · · t k (cid:19) ≥ e nH ( b P yn ) (2 π ) k − n Q ki =1 t i ! e n +1) e k nδ . (84)With the aid of (33) and (35) we can express the integral inthe right side of (81) as Z ∆ k − (cid:0) θ t · · · θ t k k (cid:1) λ d P ∗ V ( θ )= Q ki =1 Γ (cid:0) (1 + λ ) t i + (cid:1) Γ (cid:0) (1 + λ ) n + k (cid:1) k (cid:0) , . . . , (cid:1) . (85) The gamma function generalization of Stirling’s approxima-tion (shown to be valid for positive real numbers by Whittakerand Watson [32]) yields Γ( x ) = √ πx x − / e − x (1 + r ) , x > , (86)where | r | ≤ e / (12 x ) − . In particular, for i = 1 , . . . , k , Γ (cid:16) (1 + λ ) t i + 12 (cid:17) = √ π (cid:16) (1 + λ ) t i + 12 (cid:17) (1+ λ ) t i × e − (1+ λ ) t i − / (1 + r i ) , (87) Γ (cid:16) (1 + λ ) n + k (cid:17) = √ π (cid:16) (1 + λ ) n + k (cid:17) (1+ λ ) n + k − × e − (1+ λ ) n − k/ (1 + r ) , (88)where | r i | ≤ exp e (cid:18) λ ) t i + 6 (cid:19) − , (89) | r | ≤ exp e (cid:18) λ ) n + 6 k (cid:19) − . (90)It follows from (87) and (88) that Q ki =1 Γ (cid:0) (1 + λ ) t i + (cid:1) Γ (cid:0) (1 + λ ) n + k (cid:1) = (cid:16) πn (cid:17) k − e − n (1+ λ ) H ( b P yn ) (1 + λ ) k − × Q ki =1 (cid:16) λ ) t i (cid:17) (1+ λ ) t i (1 + r i ) (cid:16) k λ ) n (cid:17) (1+ λ ) n + k − (1 + r ) . (91)Combining (85) and (91), we can write Z ∆ k − (cid:0) θ t · · · θ t k k (cid:1) λ d P ∗ V ( θ )= (2 π ) k − D k ( , . . . , ) e − n (1+ λ ) H ( b P yn ) (1 + λ ) k − n k − (92) × Q ki =1 (cid:16) λ ) t i (cid:17) (1+ λ ) t i (1 + r i ) (cid:16) k λ ) n (cid:17) (1+ λ ) n + k − (1 + r ) ≥ (2 π ) k − D k ( , . . . , ) e − n (1+ λ ) H ( b P yn ) (1 + λ ) k − n k − (93) × (cid:16) k λ ) nδ (cid:17) (1+ λ ) nδ (cid:16) k λ ) n (cid:17) (1+ λ ) n + k − (cid:16) − e k λ ) nδ +6 k (cid:17) k e λ ) n +6 k ,where (93) is due to the definition of M n,δ , (77), the factthat for any positive constant c , (1 + c/x ) x is a monotoneincreasing function of x , and the fact that the error terms (see(89) and (90)) satisfy Q ki =1 (1 + r i )1 + r ≥ (cid:16) − e k λ ) nδ +6 k (cid:17) k e λ ) n +6 k . (94)Uniting the lower bounds in (81), (84) and (93), R λ ( n ) − k −
12 log (cid:16) n π (cid:17) ≥ − λ log D k (1 / , . . . , / (95) − k − λ log(1 + λ ) + 1 + λλ log( β ( n, δ, k ) ǫ ( n, δ, k, λ )) , where β ( n, δ, k ) = X t ∈M n,δ n k − Q kj =1 (cid:16) t j n (cid:17) / , (96) ǫ ( n, δ, k, λ ) = e n +1) e k nδ (cid:16) k λ ) nδ (cid:17) nδ (cid:16) k λ ) n (cid:17) n + k − λ ) (97) × (cid:16) − e k λ ) nδ +6 k (cid:17) k e λ ) n +6 k λ .Notice that lim n →∞ ǫ ( n, δ, k, λ ) = 1 , for any δ ∈ (0 , , (98) lim δ → lim n →∞ β ( n, δ, k ) = Z ∆ k − τ − / . . . τ − / k d τ (99) = D k (1 / , . . . , / , (100)where (98) follows after noticing that each factor of ǫ ( n, δ, k ) goes to 1, and (99) follows from the definition of the Riemannintegral. Assembling (95), (98) and (100), we obtain thedesired bound in (75). C. Proof of the achievability of Theorem 2 when k = 2 In this section, we prove ≤ in (54) when k = 2 , i.e., lim sup n →∞ (cid:26) R λ ( n ) −
12 log n π (cid:27) ≤ log Γ (1 / − λ log(1 + λ ) (101) = log π − λ log(1 + λ ) . (102)To that end, we modify Jeffreys’ prior by placing masses nearthe vertices of the simplex, i.e., ∆ , which, in turn, enablesus to show that when the parameter θ takes values nearthe vertices of the simplex the value of the minimax R´enyiredundancy grows strictly slower than log n + O (1) . Thus,we focus on values of θ that are not close to the vertices ofthe simplex, thereby enabling us to argue that the minimaxR´enyi redundancy behaves as in (102).Inspired by Xie and Barron’s [19] modified Jeffreys’ prior,for ǫ ∈ (0 , and c ∈ (0 , / (2 log e)) , consider the prior P ǫV ( θ ) = (1 − ǫ ) P ∗ V ( θ ) (103) + ǫ (cid:26) θ = c log nn (cid:27) + ǫ (cid:26) θ = 1 − c log nn (cid:27) ,which differs from the one in [19] in the location of the pointmasses. Because of the modification on Jeffreys’ prior, thecorresponding Y n marginal changes from Q ∗ Y n in (38) to Q ǫY n = (1 − ǫ ) Q ∗ Y n + ǫ P Y n | V = c log nn + ǫ P Y n | V =1 − c log nn .(104) Since k = 2 , we have θ = ( θ, − θ ) . To simplify the discussion, weprefer the shorthand notation θ rather than θ . In view of Theorem 1, R λ ( n ) ≤ sup θ ∈ [0 , D λ (cid:0) P Y n | V = θ k Q ǫY n (cid:1) (105) = max { Ξ ( n, λ, ǫ ) , Ξ ( n, λ, ǫ ) , Ξ ( n, λ, ǫ ) } , (106)where Ξ ( n, λ, ǫ ) = sup θ ∈ [ , c log nn ] D λ ( P Y n | V = θ k Q ǫY n ) , (107) Ξ ( n, λ, ǫ ) = sup θ ∈ [ c log nn , − c log nn ] D λ ( P Y n | V = θ k Q ǫY n ) , (108) Ξ ( n, λ, ǫ ) = sup θ ∈ [ − c log nn , ] D λ ( P Y n | V = θ k Q ǫY n ) . (109)The following result shows that the first and the third suprem-izations in the right side of (106) are both dominated by log n + O (1) . Proposition 1. If c ∈ (0 , / (2 log e)) , then max { Ξ ( n, λ, ǫ ) , Ξ ( n, λ, ǫ ) } ≤ log 2 ǫ + c (log e) log n − c log nn . (110) Proof:
Assume that θ ∈ h , c log nn i . We have D λ (cid:0) P Y n | V = θ k Q ǫY n (cid:1) ≤ log 2 ǫ + nD λ (cid:16) P Y | V = θ k P Y | V = c log nn (cid:17) (111) ≤ log 2 ǫ + nD λ (cid:16) P Y | V =0 k P Y | V = c log nn (cid:17) (112) = log 2 ǫ − n log (cid:16) − c log nn (cid:17) (113) ≤ log 2 ǫ + c log e1 − c log nn log n , (114)where (111) follows from (104), (112) follows because R´enyidivergence is monotone decreasing in θ (see Lemma 2 inAppendix B) and (114) follows because, for x < , log (cid:18) − x (cid:19) ≤ x − x log e . (115)Using a symmetrical argument, one can show that the upperbound in (114) still holds when θ ∈ [1 − c log n/n, .It remains to investigate the behavior of the second suprem-ization in the right side of (106). Let Ξ ∗ ( n, λ ) = sup θ ∈ [ c log nn , − c log nn ] D λ ( P Y n | V = θ k Q ∗ Y n ) , (116)and note that Ξ ( n, λ, ǫ ) ≤ log 11 − ǫ + Ξ ∗ ( n, λ ) , (117)which follows from (104). The following proposition gives anasymptotic upper bound on Ξ ∗ ( n, λ ) . Proposition 2.
Let c ∈ (0 , / (2 log e)) . For any λ ∈ (0 , ∞ ) , lim sup n →∞ ( Ξ ∗ ( n, λ ) −
12 log n π ) ≤ log Γ (1 / − λ log(1 + λ ) . (118) Proof:
Let θ = θ and θ = 1 − θ . Without loss ofgenerality, we may assume that θ ≤ / , otherwise we mayinterchange the roles of θ and θ together with the roles of t and t = n − t below. Note that D λ ( P Y n | V = θ k Q ∗ Y n )= 1 λ log( V ( λ, θ , n ) + W ( λ, θ , n )) , (119)where V ( λ, θ , n )= (cid:16) θ n (1+ λ )1 + θ n (1+ λ )2 (cid:17) (cid:18) D ( , )D ( , n + ) (cid:19) λ , (120) W ( λ, θ , n )= n − X t =1 (cid:18) nt (cid:19)(cid:0) θ t θ t (cid:1) λ (cid:18) D ( , )D ( t + , t + ) (cid:19) λ . (121)Thanks to Lemma 4 in Appendix D, we know that for allsufficiently large n satisfying k ln n n < , (122)we have V ( λ, θ , n ) ≤ C λ (2) C λ (2) n − (1+ λ ) c log e n λ , (123)where the explicit expressions for C ( k ) and C ( k ) are givenin (219) and (228), respectively. Hence, we may now focusattention on W ( λ, θ , n ) . Note that (cid:18) nt (cid:19) ≤ (cid:18) n πt t (cid:19) exp e (cid:0) nh (cid:0) t n (cid:1) + n (cid:1) , (124) θ t θ t = exp e (cid:0) − n (cid:2) d (cid:0) t n k θ (cid:1) + h (cid:0) t n (cid:1)(cid:3)(cid:1) , (125)where h : [0 , → [0 , and d ( ·k· ) : [0 , × [0 , → [0 , ∞ ] denote the binary entropy and the binary relative entropyfunctions in nats, respectively and the bound in (124) followsfrom Stirling’s approximation, see (83). Note also that ( t + , t + ) = Γ( n + 1)Γ( t + ) Γ( t + ) (126) ≤ (cid:16) n π (cid:17) e nh ( t n ) (cid:0) n (cid:1) n + e n +1) Q i =1 (cid:16) t i (cid:17) t i (cid:16) − e ti +6 (cid:17) , (127)where (127) also follows from an application of Stirling’sapproximation, see (86).By substituting (124), (125), and (127) into the right sideof (121), we get W ( λ, θ , n ) ≤ (cid:16) n π (cid:17) λ D λ (cid:0) , (cid:1) S ( λ, θ , n ) , (128)where S ( λ, θ , n ) = n − X t =1 (cid:16) n πt t (cid:17) exp e ( − n (1 + λ ) d ( t n k θ )) × K ( λ, n, t ) , (129) and K ( λ, n, t ) = e n (1 + n ) n + e n +1) Q i =1 (1 + t i ) t i (2 − e ti +6 ) ! λ .(130)Note that we can find an asymptotically suboptimal upperbound on S ( λ, θ , n ) that depends only on λ by invokingLemma 6 in Appendix F, which shows a non-asymptoticuniform upper bound on K ( λ, n, t ) , and then by invokingLemma 5 in Appendix E, which shows a non-asymptoticuniform upper bound on T ( λ, θ , n ) = n − X t =1 (cid:16) n πt t (cid:17) exp e ( − n (1 + λ ) d ( t n k θ )) .(131)Finding the optimal upper bound, on the other hand, requiresa uniform Laplace approximation on S ( λ, θ , n ) , which isintroduced next. First, given δ ∈ (0 , , split S ( λ, θ , n ) as S ( λ, θ , n ) = S ( λ, θ , n, δ ) + S ( λ, θ , n, δ ) (132) + S ( λ, θ , n, δ ) ,where S ( λ, θ , n, δ ) = ⌊ n (1 − δ ) θ ⌋ X t =1 (cid:16) n πt t (cid:17) (133) × exp e (cid:0) − n (1 + λ ) d (cid:0) t n k θ (cid:1)(cid:1) K ( λ, n, t ) , S ( λ, θ , n, δ ) = ⌊ n (1+ δ ) θ ⌋ X t = ⌈ n (1 − δ ) θ ⌉ (cid:16) n πt t (cid:17) (134) × exp e (cid:0) − n (1 + λ ) d (cid:0) t n k θ (cid:1)(cid:1) K ( λ, n, t ) , S ( λ, θ , n, δ ) = n − X t = ⌈ n (1+ δ ) θ ⌉ (cid:16) n πt t (cid:17) (135) × exp e (cid:0) − n (1 + λ ) d (cid:0) t n k θ (cid:1)(cid:1) K ( λ, n, t ) .In Lemmas 8, 9 and 10 in Appendix G, we show each of thefollowing properties: lim n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n, δ ) = 0 ∀ δ ∈ (0 , , (136) lim δ → lim sup n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n, δ ) ≤ (1 + λ ) − , (137) lim n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n, δ ) = 0 ∀ δ ∈ (0 , . (138)Since the left side of (132) does not depend on δ , (136)–(138)imply, by letting δ → , that lim sup n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n ) ≤ (1 + λ ) − . (139)Finally, it follows from (119), (123), (128), and (139) that lim sup n →∞ ( sup θ ∈ [ c log nn , ] D λ ( P Y n | V = θ k Q ∗ Y n ) −
12 log n π ) ≤ log Γ (1 / − λ log(1 + λ ) . (140) Since θ + θ = 1 , it also follows that lim sup n →∞ ( sup θ ∈ [ , − c log nn ] D λ ( P Y n | V = θ k Q ∗ Y n ) −
12 log n π ) ≤ log Γ (1 / − λ log(1 + λ ) . (141)Combining (140) and (141) gives us the promised result ofProposition 2.Invoking Proposition 1, we see that the functions in (107)and (109) can be bounded by Ξ ( n, λ, ǫ ) −
12 log n π ≤ c log e1 − c log nn − ! log n (142) + 12 log(2 π ) + log 2 ǫ , Ξ ( n, λ, ǫ ) −
12 log n π ≤ c log e1 − c log nn − ! log n (143) + 12 log(2 π ) + log 2 ǫ ,while thanks to (117) and Proposition 2, it follows that lim sup n →∞ (cid:26) Ξ ( n, λ, ǫ ) −
12 log n π (cid:27) ≤ log Γ (1 / − λ log(1 + λ ) + log 11 − ǫ . (144)Since c ∈ (0 , / (2 log e)) , we see that the right side of (144)asymptotically dominates the right sides of (142) and (143).Due to (106), and (142)–(144), the desired result in (102)follows by choosing an arbitrarily small ǫ in (103). D. Proof of the achievability of Theorem 2 when k > In this section, we prove ≤ in (54) when k > , i.e., lim sup n →∞ (cid:26) R λ ( n ) − k −
12 log n π (cid:27) ≤ log Γ k (1 / k/ − k − λ log(1 + λ ) . (145)To do so, we once again modify Jeffreys’ prior as in theprevious section by placing masses near the lower dimensionalfaces of the simplex, i.e., ∆ k − , which, in turn, enablesus to show that when the parameter vector θ takes valuesnear the faces of the simplex, the value of the minimaxR´enyi redundancy grows strictly slower than k − log n + O (1) .Hence, by focusing on the parameter values that are not closeto the faces of the simplex, we show that the minimax R´enyiredundancy behaves as in (145).Following the idea in [19], let c ∈ (0 , / (2 log e)) and, for i = 1 , . . . , k , define L i = (cid:26) θ : θ i = c log nn (cid:27) ∩ ∆ k − . (146)Accordingly, we define the probability measure µ i with respectto d i ξ = d ξ · · · d ξ i − d ξ i +1 · · · d ξ k , the Lebesgue measure on R k − , as µ i ( θ ) = θ − / · · · θ − / i − θ − / i +1 · · · θ − / k Z L i ξ − / · · · ξ − / i − ξ − / i +1 · · · ξ − / k d i ξ . (147) Finally, for ǫ ∈ (0 , , we define the prior distribution P ǫV onthe probability simplex ∆ k − as P ǫV = ǫk k X i =1 µ i + (1 − ǫ ) P ∗ V , (148)where P ∗ V is Jeffreys’ prior. Because of the modificationon Jeffreys’ prior in (148), the corresponding Y n marginalchanges from Q ∗ Y n in (38) to Q ǫY n ( y n ) = ǫk k X i =1 M i ( y n ) + (1 − ǫ ) Q ∗ Y n ( y n ) , (149)where M i ( y n ) = Z L i P Y n | V = θ ( y n ) µ i ( θ ) d i θ (150) = (cid:18) c log nn (cid:19) t i (cid:18) − c log nn (cid:19) n − t i (151) × D k − (cid:0) t + , . . . , t i − + , t i +1 + , . . . , t k + (cid:1) D k − (1 / , . . . , / .Define, for i = 1 , . . . , k , R i = (cid:26) θ : θ i ∈ (cid:20) , c log nn (cid:21)(cid:27) , (152) R = ∆ k − − k [ i =1 R i . (153)Note that R denotes the vectors none of whose coordinatesare within close proximity of zero in the sense of (152).In view of Theorem 1, R λ ( n ) = inf Q Y n sup θ ∈ ∆ k − D λ ( P Y n | V = θ k Q Y n ) (154) ≤ sup θ ∈ ∆ k − D λ ( P Y n | V = θ k Q ǫY n ) (155) = max i ∈{ , ,...,k } sup θ ∈R i D λ ( P Y n | V = θ k Q ǫY n ) . (156)The following result shows that the supremizations over R i for i = 1 , . . . , k in (156) are all dominated by k − log n + O (1) . Proposition 3. If c ∈ (0 , / (2 log e)) , then for each i ∈{ , . . . , k } sup θ ∈R i D λ ( P Y n | V = θ k Q ǫY n ) ≤ (157) log kǫ + log C ( k −
1) + k −
22 + c log e1 − c log nn ! log n ,where the explicit value of C ( k ) is given in (204) .Proof: Thanks to the symmetry, it suffices to show theresult for i = 1 . To that end, define f : ∆ k − → ∆ k − as f ( θ ) = (cid:18) θ − θ , · · · , θ k − θ (cid:19) , (158) and let Q ∗ ( k − Y n denote the Jeffreys’ mixture when the un-derlying parameter space is the ( k − -dimensional simplex.Further define ψ ( λ, n, θ , t ) = (cid:18) nt (cid:19) (cid:2) θ t (1 − θ ) n − t (cid:3) λ (cid:16) c log nn (cid:17) λt (cid:16) − c log nn (cid:17) λ ( n − t ) ,(159) ζ ( k, λ, n, θ , t ) = exp (cid:16) λD λ (cid:16) P Y n − t | V = f ( θ ) (cid:13)(cid:13) Q ∗ ( k − Y n − t (cid:17)(cid:17) ,(160)and note that ζ ( k, λ, n, θ , t ) ≤ C λ ( k −
1) exp (cid:16) λ log( n − t ) k − (cid:17) (161) ≤ C λ ( k −
1) exp (cid:16) λ log n k − (cid:17) , (162)where (161) follows from Lemma 3 in Appendix C. For θ ∈R , D λ ( P Y n | V = θ k Q ǫY n ) ≤ log kǫ + D λ (cid:0) P Y n | V = θ k M (cid:1) (163) = log kǫ + 1 λ log n X t =0 ψ ( λ, n, θ , t ) ζ ( k, λ, n, θ , t ) (164) ≤ log kǫ + log C ( k −
1) + k −
22 log n (165) + D λ (cid:16) P Y n | V = θ k P Y n | V = c log nn (cid:17) where (163) follows from (149), and (165) is due to (162).Finally, the desired result follows because (111)–(114) imply D λ (cid:16) P Y n | V = θ k P Y n | V = c log nn (cid:17) = nD λ (cid:16) P Y | V = θ k P Y | V = c log nn (cid:17) (166) ≤ c log e1 − c log nn log n . (167)It remains to investigate the supremization over R in (156).Observe that sup θ ∈R D λ ( P Y n | V = θ k Q ǫY n ) ≤ log 11 − ǫ + sup θ ∈R D λ ( P Y n | V = θ k Q ∗ Y n ) , (168)which follows from the definition of Q ǫY n in (149). Parallel toProposition 2, Proposition 4 characterizes the behavior of thesupremum in the right side of (168). Proposition 4.
For any λ ∈ (0 , ∞ ) , lim sup n →∞ (cid:26) sup θ ∈R D λ ( P Y n | V = θ k Q ∗ Y n ) − k −
12 log n π (cid:27) ≤ log Γ k (1 / k/ − k − λ log(1 + λ ) . (169) Proof:
We are only interested in θ ∈ R . Therefore, forall i = 1 , . . . , k , θ i ≥ c log nn , (170) where c ∈ (0 , / (2 log e)) is a constant. Since there is an index j such that θ j ≥ /k , it simplifies notation without loss ofgenerality that j = k . Otherwise, the proof remains identical.For a given positive integer l , let I l = { i , . . . , i l } ⊂ Y (171)be a proper subset and note that D λ ( P Y n | V = θ k Q ∗ Y n )= 1 λ log( V ( k, λ, θ , n ) + W ( k, λ, θ , n )) , (172)where V ( k, λ, θ , n ) = k − X l =1 (cid:18) kl (cid:19) X t : t i ≥ ∀ it + ··· + t k = nt i =0 ∀ i ∈I l (cid:18) nt · · · t k (cid:19) (173) × ( θ t · · · θ t k k ) λ (cid:18) D k ( , . . . , )D k ( t + , . . . , t k + ) (cid:19) λ , W ( k, λ, θ , n ) = X t : t i ≥ ∀ it + ··· + t k = n (cid:18) nt · · · t k (cid:19) (174) × ( θ t · · · θ t k k ) λ (cid:18) D k ( , . . . , )D k ( t + , . . . , t k + ) (cid:19) λ .Thanks to Lemma 4 in Appendix D, we know that for allsufficiently large n satisfying k ln n n < , (175)it follows that V ( k, λ, θ , n ) ≤ e C ( k, λ ) n − (1+ λ ) c log e n λ ( k − ) , (176)where e C ( k, λ ) is a constant depending only on λ and k , whichis explicitly given in the proof of Lemma 4, see (231). Hence,we may now focus attention on W ( k, λ, θ , n ) . Note that (cid:18) nt · · · t k (cid:19) ≤ e nH ( b P yn ) (2 π ) k − n Q ki =1 t i ! e n , (177)and k Y i =1 θ t i i = exp e (cid:16) − n h D ( b P y n k P Y | V = θ ) + H ( b P y n ) i(cid:17) , (178)where both the entropy and relative entropy are in nats andthe bound in (177) follows from Stirling’s approximation, see(83). Note also that k ( t + , . . . , t k + ) = Γ( n + k ) Q ki =1 Γ( t i + ) (179) ≤ (cid:16) n π (cid:17) k − e nH ( b P yn ) (cid:0) k n (cid:1) n + k − e n +6 k Q ki =1 (cid:16) t i (cid:17) t i (cid:16) − e ti +6 (cid:17) , (180)where (180) also follows from an application of Stirling’sapproximation, see (86). By substituting (177), (178) and (180) into the right side of(174), we get W ( k, λ, θ , n ) ≤ (cid:16) n π (cid:17) λ ( k − D λk (cid:16) , . . . , (cid:17) S ( k, λ, θ , n ) ,(181)where S ( k, λ, θ , n ) = X t : t i ≥ ∀ it + ··· + t k = n K ( k, λ, n, t )(2 π ) k − n Q ki =1 t i ! (182) × exp e (cid:16) − n (1 + λ ) D ( b P y n k P Y | V = θ ) (cid:17) ,and K ( k, λ, n, t ) = e n (cid:0) k n (cid:1) n + ( k − e n +6 k Q ki =1 (cid:16) t i (cid:17) t i (cid:16) − e ti +6 (cid:17) λ .(183)Observe once again that we can find an asymptotically sub-optimal upper bound on S ( k, λ, θ , n ) that depends only on k and λ by invoking Lemma 6 in Appendix F, which showsa non-asymptotic uniform upper bound on K ( k, λ, n, t ) , andthen by invoking Lemma 5 in Appendix E, which shows anon-asymptotic uniform upper bound on T ( k, λ, θ , n ) = X t : t i ≥ ∀ it + ··· + t k = n π ) k − n Q ki =1 t i ! (184) × exp e (cid:16) − n (1 + λ ) D ( b P y n k P Y | V = θ ) (cid:17) .Finding the optimal upper bound, on the other hand, requiresa uniform Laplace approximation on S ( k, λ, θ , n ) , which isintroduced next. First, given δ ∈ (0 , / ( k − , recall the set M n as defined in (76), let N θ δ = M n ∩ (cid:26) ( a , . . . , a k ) ∈ Z k + : (cid:12)(cid:12)(cid:12)(cid:12) a i nθ i − (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ ∀ i (cid:27) ,(185)and split S ( k, λ, θ , n ) as S ( k, λ, θ , n ) = S ( k, λ, θ , n, δ ) + S ( k, λ, θ , n, δ ) , (186)where S ( k, λ, θ , n, δ ) = X t : t ∈N θ δ t i ≥ ∀ i K ( k, λ, n, t )(2 π ) k − n Q ki =1 t i ! (187) × exp e (cid:16) − n (1 + λ ) D ( b P y n k P Y | V = θ ) (cid:17) , S ( k, λ, θ , n, δ ) = X t : t θ δ t i ≥ ∀ i K ( k, λ, n, t )(2 π ) k − n Q ki =1 t i ! (188) × exp e (cid:16) − n (1 + λ ) D ( b P y n k P Y | V = θ ) (cid:17) . In Lemmas 12 and 13 in Appendix I, we show that thefollowing properties hold: lim δ → lim sup n →∞ sup θ ∈R θ k ≥ /k S ( k, λ, θ , n, δ ) ≤ (1 + λ ) − k − , (189) lim n →∞ sup θ ∈R θ k ≥ /k S ( k, λ, θ , n, δ ) = 0 ∀ δ ∈ (0 , . (190)Since the left side of (186) does not depend on δ , (189) and(190) imply, by letting δ → , that lim sup n →∞ sup θ ∈R θ k ≥ /k S ( k, λ, θ , n ) ≤ (1 + λ ) − k − . (191)Finally, it follows from (172), (176), (181), and (191), that(169) holds when θ k ≥ /k as we wanted to show.Invoking Proposition 3, we see that for each i = 1 , . . . , k sup θ ∈R i D λ (cid:0) P Y n | V = θ k Q ǫY n (cid:1) − k −
12 log n π ≤ (cid:18) c log e1 − c log nn − (cid:19) log n + k −
12 log(2 π ) + log kǫ (192) + log C ( k − ,while thanks to (168) and Proposition 4, it follows that lim sup n →∞ (cid:26) sup θ ∈R D λ ( P Y n | V = θ k Q ǫY n ) − k −
12 log n π (cid:27) ≤ log Γ k (1 / k/ − k − λ log(1 + λ ) + log 11 − ǫ . (193)Since c ∈ (0 , / (2 log e)) , we see that, as n → ∞ , the rightside of (192) goes to −∞ whereas the right side of (193)remains constant. In view of (156), (192) and (193), the desiredresult in (145) follows by choosing an arbitrarily small ǫ in(148). A PPENDIX AE XPLICIT E VALUATION OF α -M UTUAL I NFORMATION
In the case of finite collection of arbitrary distributions, ex-plicit evaluation of I λ is provided by Sibson [16, Corollary2.3]. A more general result that allows non-discrete alphabetscan be found in [17]. Lemma 1.
Let λ ∈ (0 , ∞ ) . Given an arbitrary input distribu-tion P V on Θ and a random transformation P Y | V : Θ → Y with finite output alphabet Y , the α -mutual information oforder λ induced by P V on P Y | V satisfies λ λ I λ ( P V , P Y | V )= log X y ∈Y (cid:18)Z θ ∈ Θ P λY | V = θ ( y )d P V ( θ ) (cid:19) λ . (194) Proof:
Define R Y ( y ) = (cid:18)Z θ ∈ Θ P λY | V = θ ( y )d P V ( θ ) (cid:19) λ X b ∈Y (cid:18)Z ξ ∈ Θ P λY | V = ξ ( b )d P V ( ξ ) (cid:19) λ , (195) and recall that D λ ( R Y k Q Y ) ≥ (196)for any distribution Q Y on Y . Capitalizing on (196), note that D λ ( P Y | V P V k Q Y P V )= 1 λ log X y ∈Y Z θ ∈ Θ P λY | V = θ ( y ) Q λY ( y ) d P V ( θ ) (197) ≥ λλ log X b ∈Y (cid:18)Z ξ ∈ Θ P λY | V = ξ ( b )d P V ( ξ ) (cid:19) λ (198) = D λ ( P Y | V P V k R Y P V ) . (199)By the definition of the α -mutual information, see (42); (199)implies the result in (194).A PPENDIX BM ONOTONICITY OF B INARY
R ´
ENYI D IVERGENCE
Lemma 2.
Let P Y | V = θ denote a Bernoulli distributionwith parameter θ . For any ξ ∈ (0 , and λ ∈ (0 , ∞ ) , D λ ( P Y | V = θ k P Y | V = ξ ) is a monotone decreasing functionof θ on [0 , ξ ] .Proof: Fix λ ∈ (0 , ∞ ) . Let Y ∼ P Y | V = θ . It sufficesto prove that E (cid:20)(cid:16) P Y | V = θ ( Y ) P Y | V = ξ ( Y ) (cid:17) λ (cid:21) is a monotone decreasingfunction of θ on [0 , ξ ] . To that end, note that dd θ E (cid:20)(cid:18) P Y | V = θ ( Y ) P Y | V = ξ ( Y ) (cid:19) λ (cid:21) = (1 + λ ) (cid:18) θ λ ξ λ − (1 − θ ) λ (1 − ξ ) λ (cid:19) (200) ≤ , (201)where (201) follows because θ ∈ [0 , ξ ] implies θξ ≤ − θ − ξ . (202)A PPENDIX CU NIFORM U PPER B OUND ON D λ (cid:0) P Y n | V = θ k Q ∗ Y n (cid:1) Lemma 3.
Let θ ∈ ∆ k − be an element in the ( k − -dimensional simplex and assume that we are given a discretei.i.d. model P Y n | V = θ . Then, for any n ≥ and y n ∈ Y n , therelative information between the model P Y n | V = θ and Jeffreys’mixture Q ∗ Y n satisfies the following bound ı P Y n | V = θ k Q ∗ Y n ( y n ) ≤ k −
12 log n + log C ( k ) , (203) where C ( k ) = e k +112 D k (cid:0) , · · · , (cid:1) (2 π ) k − (cid:0) − e / (cid:1) k (cid:18) k (cid:19) k − . (204) Consequently, for any λ > , D λ (cid:0) P Y n | V = θ k Q ∗ Y n (cid:1) ≤ k −
12 log n + log C ( k ) , (205) where C ( k ) is given in (204) .Proof: Immediate consequence of [19, Lemma 4]. A PPENDIX DE DGE C ASES OF t i Lemma 4.
Let c ∈ (0 , / (2 log e)) and for a given positiveinteger l , let I l = { i , . . . , i l } be a proper subset of Y . Then,for any n satisfying k ln n n < , (206) and θ ∈ R (defined in (153) ) V ( k, λ, θ , n ) ≤ e C ( k, λ ) n − (1+ λ ) c log e+ λ ( k − ) , (207) where V ( k, λ, θ , n ) is defined in (173) and e C ( k, λ ) is aconstant that only depends k and λ .Proof: Denote { i l +1 , i l +2 , . . . , i k } = { , . . . , k } − { i , i , . . . , i l } , (208)and note that X t : t i =0 ∀ i ∈I l t + ··· + t k = n (cid:18) nt · · · t k (cid:19) ( θ t · · · θ t k k ) λ D λk ( , . . . , )D λk ( t + , . . . , t k + )= X t il +1 ,...,t ik t il +1 + ··· + t ik = n (cid:18) nt i l +1 · · · t i k (cid:19) (cid:16) θ t il +1 i l +1 · · · θ t ik i k (cid:17) λ (209) × D λk ( , . . . , )D λk ( , . . . , , t i l +1 + , . . . , t i k + ) .Regarding the last term within the summation in the right sideof (209), D k ( , . . . , )D k ( , . . . , , t i l +1 + , . . . , t i k + )= D k − l ( , . . . , )D k − l ( t i l +1 + , . . . , t i k + ) Γ( k − l )Γ( k ) Γ( n + k )Γ( n + k − l ) (210) ≤ D k − l ( , . . . , )D k − l ( t i l +1 + , . . . , t i k + ) Γ( k − )Γ( k ) Γ( n + k )Γ( n + k − l ) , (211)where (210) follows from the definition of the Dirichletintegrals in (35), and (211) follows from the fact that l ≥ .Now, observe that Γ (cid:0) n + k (cid:1) Γ (cid:0) n + k − l (cid:1) = (cid:0) n + k (cid:1) n + k − e − n − k (1 + r ) (cid:0) n + k − l (cid:1) n + k − l − e − n − k − l (1 + r l ) (212) ≤ (cid:0) k (cid:1) k − e k/ e / − e / n l/ , (213)where r l is the remainder in Stirling’s approximation of Γ (cid:0) n + k − l (cid:1) in (86), and (213) is due to the following ele-mentary bounds: (cid:18) k n (cid:19) n + k − ≤ (cid:18) k (cid:19) k − e k/ , (214) r ≤ e / , (215) The quantity V ( λ, θ , n ) defined in (120) corresponds to the special caseof (173) where k = 2 , θ = ( θ , − θ ) . (cid:18) k − l n (cid:19) n + k − l − e l/ ≥ , (216) r l ≥ − e / . (217)It follows that D k ( , . . . , )D k ( , . . . , , t i l +1 + , . . . , t i k + ) ≤ D k − l ( , . . . , )D k − l ( t i l +1 + , . . . , t i k + ) C ( k ) n l/ , (218)where C ( k ) = Γ( k − )Γ( k ) (1 + k ) k − e k +124 − e / . (219)Since θ ∈ R , (1 − ( θ i + · · · + θ i l )) n ≤ (cid:18) − lc log nn (cid:19) n (220) ≤ n − lc log e , (221)where (221) is because lc log nn < k ln n n < and for any x < we have log(1 − x ) ≤ − x log e . Let ¯ θ = (cid:0) ¯ θ i l +1 , . . . , ¯ θ i k (cid:1) (222) = (cid:0) θ i l +1 , · · · , θ i k (cid:1) − ( θ i + · · · + θ i l ) . (223)It follows from (218) and (221) that X t il +1 ,...,t ik t il +1 + ··· + t ik = n (cid:18) nt i l +1 · · · t i k (cid:19) (cid:16) θ t il +1 i l +1 · · · θ t ik i k (cid:17) λ × D λk (cid:0) , . . . , (cid:1) D λk (cid:0) , . . . , , t i l +1 + , . . . , t i k + (cid:1) ≤ X t il +1 ,...,t ik t il +1 + ··· + t ik = n (cid:18) nt i l +1 · · · t i k (cid:19) (cid:16) ¯ θ t il +1 i l +1 · · · ¯ θ t ik i k (cid:17) λ (224) × D λk − l ( , . . . , )D λk − l ( t i l +1 + , . . . , t i k + ) C λ ( k ) n λl/ n (1+ λ ) lc log e . Note that exp (cid:16) λD λ (cid:16) P Y n | V = ¯ θ (cid:13)(cid:13) Q ∗ ( k − l − Y n (cid:17)(cid:17) = X t il +1 ,...,t ik t il +1 + ··· + t ik = n (cid:18) nt i l +1 · · · t i k (cid:19) (cid:16) ¯ θ t il +1 i l +1 · · · ¯ θ t ik i k (cid:17) λ (225) × D λk − l ( , . . . , )D λk − l ( t i l +1 + , . . . , t i k + ) ,where Q ∗ ( k − l − Y n denotes the Jeffreys’ mixture when theunderlying parameter space is the ( k − l − -dimensionalsimplex. Using the uniform upper bound on R´enyi divergencein Lemma 3, we get exp (cid:0) λD λ (cid:0) P Y n | V = ¯ θ k Q ∗ ( k − l − Y n (cid:1)(cid:1) ≤ C λ ( k − l ) n λ ( k − l − ) ,(226) where C ( k ) is as defined in (204). Since l ∈ { , . . . , k − } , D ( ) = 1 , and D m ( , . . . , ) ≤ π for any integer m ≥ ,we can upper bound C ( k − l ) ≤ π e k − − e / (cid:18) k − (cid:19) k − (227) = C ( k ) . (228)As a result, V ( k, λ, θ , n ) ≤ k − X l =1 (cid:18) kl (cid:19) C λ ( k ) C λ ( k ) n − ( λ +1) c log e+ λ ( k − ) (229) = (2 k − C λ ( k ) C λ ( k ) n − ( λ +1) c log e+ λ ( k − ) , (230)and (207) follows after setting e C ( k, λ ) = (2 k − C λ ( k ) C λ ( k ) . (231)A PPENDIX EU NIFORM U PPER B OUND ON T ( k, λ, θ , n ) The quantity defined in (184) satisfies the following upperbound. Lemma 5. T ( k, λ, θ , n ) ≤ C λ ( k )(2 π ) λ ( k − D λk (1 / , . . . , /
2) e k (20 λ +3)36 (2 − e k ) λ , (232) where C ( k ) is explicitly given in (204) .Proof: Define e S ( k, λ, θ , n ) = X t : t i ≥ ∀ it + ··· + t k = n e K ( k, λ, n, t )(2 π ) k − n Q ki =1 t i ! (233) × exp e ( − n (1 + λ ) D ( b P y n k P Y | V = θ )) ,where e K ( k, λ, n, t ) = e n +1) e k (cid:0) k n (cid:1) n + k − Q ki =1 (cid:16) t i (cid:17) t i λ (234) × − e n +6 k Q ki =1 e ti +6 ! λ .Note that D λ ( P Y n | V = θ k Q ∗ Y n ) ≥ λ log W ( k, λ, θ , n ) (235) ≥ λ log (cid:16)(cid:16) n π (cid:17) λ ( k − D λk ( , . . . , ) e S ( k, λ, θ , n ) (cid:17) , (236)where (235) follows from (172), and (236) follows fromStirling’s approximations, (82) and (86), as well as the factthat k Y i =1 θ t i i = exp e (cid:16) − n h D ( b P y n k P Y | V = θ ) + H ( b P y n ) i(cid:17) . (237) The quantity T ( λ, θ , n ) defined in (131) corresponds to the special caseof (184) where k = 2 , θ = ( θ , − θ ) . Regarding e K ( k, λ, n, t ) , one can check that e K ( k, λ, n, t ) ≥ (2 − e k ) λ e k (20 λ +3)36 . (238)Invoking Lemma 3 in Appendix C to upper bound the left sideof (235) and applying the bound in (238) to (236) results in(232). A PPENDIX FB OUNDS ON K ( k, λ, n, t ) The quantity defined in (183) satisfies the following non-asymptotic bound. Lemma 6
Uniform Upper Bound on K ( k, λ, n, t ) . Given λ ∈ (0 , ∞ ) , K ( k, λ, n, t ) ≤ e e k (cid:0) k (cid:1) k − e k (cid:0) (2 − e ) (cid:1) k λ (239) = M ( k, λ ) . (240) In particular, K ( λ, n, t ) ≤ M (2 , λ ) (241) ≤ λ e (242) Proof:
For x ≥ , (cid:18) x (cid:19) x (cid:16) − e x +6 (cid:17) ≥ (cid:16) − e (cid:17) , (243)because the function in the left side of (243) is an increasingfunction. On the other hand, (cid:18) k x (cid:19) k − e x +6 k ≤ (cid:18) k (cid:19) k − e k , (244)because the function of the left side of (244) is a decreasingfunction. Finally, (239) follows from the fact that λ ≥ and e k ≥ (cid:0) k n (cid:1) n . Lemma 7
Asymptotic Upper Bound on K ( k, λ, n, t ) . Let c ∈ (0 , / (2 log e)) , and δ ∈ (0 , / ( k − be fixed and n > be an integer. Assume that θ ∈ R (defined in (153) ) satisfies θ k ≥ /k . If for i ∈ { , . . . , k − } n (1 − δ ) θ i ≤ t i ≤ n (1 + δ ) θ i , (245) then K ( k, λ, n, t ) ≤ K ( k, λ, n, c (1 − δ ) u ) (246) = M ( k, λ, n, c, δ ) , (247) where in (246) u = ( u , . . . , u k ) ∈ R k satisfies u = (cid:18) log n, . . . , log n, (1 − ( k − δ ) nc (1 − δ ) k (cid:19) . (248) Furthermore, lim n →∞ M ( k, λ, n, c, δ ) = 1 . (249) The quantity K ( λ, n, t ) defined in (130) corresponds to the special caseof (183) where k = 2 , t = ( t , n − t ) . Proof:
Note that since θ ∈ R , for i ∈ { , . . . , k } t i ≥ c (1 − δ ) u i (250) = v i , (251)which, in turn, imply that (cid:18) t i (cid:19) t i ≥ (cid:18) v i (cid:19) v i (252) e ti +6 ≤ e vi +6 . (253)Hence, inequality (246) follows. It is straightforward to seethe limit in (249). A PPENDIX GL EMMAS FOR THE P ROOF IN S ECTION
IV-CIn the proofs of Lemmas 8, 9 and 10, we use the followingbound: for θ ∈ (0 , / and δ ∈ (0 , , | τ − θ | ≤ δθ = ⇒ d ( τ k θ ) ≥
12 (1 − δ )( τ − θ ) θ (1 − θ ) , (254)in nats. In particular, when < τ ≤ θ ≤ / d ( τ k θ ) ≥
12 ( τ − θ ) θ (1 − θ ) . (255)To show (254) and (255), we rely on Taylor’s theorem: d ( τ k θ ) = 12 ( τ − θ ) θ (1 − θ ) + 2 α − α (1 − α ) ( τ − θ ) , (256)for some α in between τ and θ . Lemma 8.
Let c ∈ (0 , / (2 log e)) and fix δ ∈ (0 , . lim n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n, δ ) = 0 , (257) where S ( λ, θ , n, δ ) is defined in (133) .Proof: Assume that n is a sufficiently large integer, let θ ∈ h c log nn , i be given. Then S ( λ, θ , n, δ ) ≤ ⌊ n (1 − δ ) θ ⌋ X t =1 λ e n πt t ! exp e (cid:18) − n (1 + λ ) δ θ (cid:19) (258) ≤ (1 − δ ) θ λ e n π ( n − ! exp e (cid:16) − n λ ) δ θ (cid:17) , (259)where (258) is due to (255), the uniform upper bound on K ( λ, n, t ) given in Lemma 6 in Appendix F, and the factthat ( t n − θ ) ≥ δ θ , (259) follows because for ≤ t ≤⌊ n (1 − δ ) θ ⌋ , t t ≥ n − . (260)Since the supremum in sup θ ∈ [ c log nn , ](1 − δ ) θ λ e n π ( n − ! exp e (cid:16) − n λ ) δ θ (cid:17) is attained at θ = c log nn , it follows that (257) holds. Lemma 9.
Let c ∈ (0 , / (2 log e)) . lim δ → lim sup n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n, δ ) ≤ (1 + λ ) − , (261) where S ( λ, θ , n, δ ) is defined in (134) .Proof: Assume that n is a sufficiently large integer, let θ ∈ h c log nn , i be given and define σ n = s θ (1 − θ ) n (1 + λ )(1 − δ ) . (262)We have S ( λ, θ , n, δ ) ≤ M (2 , λ, n, c, δ ) p (1 − δ ) (1 − (1 + δ ) θ ) r − θ λ (263) × ⌊ n (1+ δ ) θ ⌋ X t = ⌈ n (1 − δ ) θ ⌉ n √ πσ n exp e − (cid:0) t n − θ (cid:1) σ n ! ≤ M (2 , λ, n, c, δ ) p (1 − δ ) (1 + λ ) (264) × ⌊ n (1+ δ ) θ ⌋ X t = ⌈ n (1 − δ ) θ ⌉ n √ πσ n exp e − (cid:0) t n − θ (cid:1) σ n ! ,where (263) is due to (254), the bound on K ( λ, n, t ) for thegiven range of t (see Lemma 7 in Appendix F), and the factthat for ⌈ n (1 − δ ) θ ⌉ ≤ t ≤ ⌊ n (1 + δ ) θ ⌋ , p t (1 − t ) ≥ n p (1 − δ ) θ (1 − (1 + δ ) θ ) , (265)(264) follows because for θ ∈ h c log nn , i , s − θ − (1 + δ ) θ = s δθ − (1 + δ ) θ (266) ≤ √ − δ . (267)In light of Lemma 7 in Appendix F, lim n →∞ M (2 , λ, n, c, δ ) = 1 . (268)Moreover, the Riemann sum in (264) can be upper boundedas lim sup n →∞ ⌊ n (1+ δ ) θ ⌋ X t = ⌈ n (1 − δ ) θ ⌉ n √ πσ n exp e (cid:18) − ( t n − θ ) σ n (cid:19) ≤ .(269)It follows that (261) holds. Lemma 10.
Let c ∈ (0 , / (2 log e)) and fix δ ∈ (0 , . lim n →∞ sup θ ∈ [ c log nn , ] S ( λ, θ , n, δ ) = 0 , (270) where S ( λ, θ , n, δ ) is defined in (135) .Proof: The proof of this lemma is more involved thanthat of Lemma 8. To proceed, using Pinsker’s inequality (e.g.,[33, Ex. 3.18]), namely d ( τ k θ ) ≥ τ − θ ) , (271) we first prove that lim n →∞ sup θ ∈ (cid:20) n − β , (cid:21) S ( λ, θ , n, δ ) = 0 , (272)where β ∈ (0 , is a fixed constant. Then, we show that lim n →∞ sup θ ∈ (cid:20) c log nn ,n − β (cid:21) S ( λ, θ , n, δ ) = 0 , (273)with the help of Lemma 11 in Appendix H. Fix a constant β ∈ (0 , , and assume that n is a sufficiently large integer.First, let θ ∈ h n − β , i be arbitrary and note that S ( λ, θ , n, δ ) ≤ n − X t = ⌈ n (1+ δ ) θ ⌉ λ e n πt t ! exp e (cid:0) − n (1 + λ ) δ θ (cid:1) (274) ≤ λ e n π ( n − ! exp e (cid:0) − λ ) δ n − β (cid:1) , (275)where (274) follows from Lemma 6 in Appendix F, Pinsker’sinequality as in (271), and the fact that (cid:0) t n − θ (cid:1) ≥ δ θ ,(275) follows because θ ≥ n − β and for ⌈ n (1 + δ ) θ ⌉ ≤ t ≤ n − , t t ≥ ( n − . (276)Thus, (275) implies that sup θ ∈ (cid:20) n − β , (cid:21) S ( λ, θ , n, δ ) ≤ λ e n π ( n − ! exp e (cid:0) − λ ) δ n − β (cid:1) . (277)Since β < , lim n →∞ λ e n π ( n − ! exp e (cid:0) − λ ) δ n − β (cid:1) = 0 , (278)and it follows that (272) holds.Second, let θ ∈ h c log nn , n − β i be arbitrary and fix someconstant κ ∈ (cid:0) , (cid:1) . Further, separate S ( λ, θ , n, δ ) into twosums as follows S ( λ, θ , n, δ ) = e S ( κ, λ, θ , n, δ ) + e S ( κ, λ, θ , n, δ ) ,(279)where e S ( κ, λ, θ , n, δ ) = n − X t = ⌈ nκ ⌉ (cid:18) n πt t (cid:19) (280) × exp e (cid:0) − n (1 + λ ) d (cid:0) t n k θ (cid:1)(cid:1) K ( λ, n, t ) , e S ( κ, λ, θ , n, δ ) = ⌊ nκ ⌋ X t = ⌈ n (1+ δ ) θ ⌉ (cid:18) n πt t (cid:19) (281) × exp e (cid:0) − n (1 + λ ) d (cid:0) t n k θ (cid:1)(cid:1) K ( λ, n, t ) . Regarding e S ( κ, λ, θ , n, δ ) , we have e S ( κ, λ, θ , n, δ ) ≤ n − X t = ⌈ nκ ⌉ r n πt t (282) × exp e (cid:16) − n (1 + λ ) (cid:0) t n − θ (cid:1) (cid:17) λ e ≤ n √ πκ exp e (cid:16) − n (1 + λ ) (cid:0) κ − n − β (cid:1) (cid:17) λ e , (283)where (282) follows from Lemma 6 in Appendix F and (271),(283) follows because t n − θ ≥ κ − n − β and √ t t ≥ √ nκ for ⌈ nκ ⌉ ≤ t ≤ n − and c log n/n ≤ θ ≤ n − β . Hence, sup θ ∈ (cid:20) c log nn ,n − β (cid:21) e S ( κ, λ, θ , n, δ ) ≤ n √ πκ exp e (cid:16) − n (1 + λ ) (cid:0) κ − n − β (cid:1) (cid:17) λ e , (284)and lim n →∞ sup θ ∈ (cid:20) c log nn ,n − β (cid:21) e S ( κ, λ, θ , n, δ ) = 0 . (285)Regarding e S ( κ, λ, θ , n, δ ) , we have √ π λ e e S ( κ, λ, θ , n, δ ) ≤ ⌊ nκ ⌋ X t = ⌈ n (1+ δ ) θ ⌉ √ n √ t t exp e (cid:0) − n (1 + λ ) d (cid:0) t n k θ (cid:1)(cid:1) , (286)where (286) follows from Lemma 6 in Appendix F. Let θ ∗ ∈ (cid:2) c log nn , n − β (cid:3) be the maximizer of the right side in (286).Note that lim sup n →∞ ⌊ nκ ⌋ X t = ⌈ n (1+ δ ) θ ∗ ⌉ √ n √ t t e − n (1+ λ ) d ( t n k θ ∗ ) ≤ lim sup n →∞ Z κ (1+ δ ) θ ∗ √ n p τ (1 − τ ) e − n (1+ λ ) d ( τ k θ ∗ ) d τ (287) ≤ lim sup n →∞ (1 + λ ) − ln − (1 + δ ) p n (1 + δ ) θ ∗ (1 − (1 + δ ) θ ∗ ) (288) ≤ lim sup n →∞ (1 + λ ) − ln − (1 + δ ) q (1 + δ ) c log n (cid:0) − (1 + δ ) n − β (cid:1) (289) = 0 , (290)where (287) follows after noticing that for any θ ∈ (cid:2) c log nn , n − β (cid:3) , the function g θ ( x ) = 1 p x (1 − x ) e − n (1+ λ ) d ( x k θ ) (291)is a decreasing function in x ∈ ((1 + δ ) θ, κ ) and thereforethe corresponding Riemann sum in the left side of (287) canbe upper bounded by the integral in its right side and (288)follows from Lemma 11 in Appendix H. Hence, lim n →∞ sup θ ∈ (cid:2) c log nn ,n − β (cid:3) e S ( κ, λ, θ , n, δ ) = 0 . (292)As a result of (285) and (292), (273) holds. The desired resultfollows since we have established (272) and (273). A PPENDIX HU PPER B OUND FOR THE I NTEGRAL IN (287)
Lemma 11.
Let c ∈ (0 , / (2 log e)) and λ ∈ (0 , ∞ ) .Fix β ∈ (0 , , δ ∈ (0 , and κ ∈ (0 , / . For any θ ∈ h c log nn , n − β iZ κ (1+ δ ) θ n (1 + λ ) p τ (1 − τ ) e − n (1+ λ ) d ( τ k θ ) d τ ≤ ln − (1 + δ ) p (1 + δ ) θ (1 − (1 + δ ) θ ) . (293) Proof:
Abbreviate a n = n (1 + λ ) , (294) ϕ ( τ ) = 1 p τ (1 − τ ) , (295) φ ( τ ) = d ( τ k θ ) . (296)Applying integration by parts yields Z a n ϕ ( τ )e − a n φ ( τ ) d τ = − ϕ ( τ ) φ ′ ( τ ) e − a n φ ( τ ) + Z e − a n φ ( τ ) dd τ (cid:18) ϕ ( τ ) φ ′ ( τ ) (cid:19) d τ . (297)For τ ∈ [(1 + δ ) θ , κ ] , we have dd τ (cid:18) ϕ ( τ ) φ ′ ( τ ) (cid:19) ≤ , (298)because ϕ ( τ ) is a decreasing function and φ ( τ ) is an increas-ing convex function for the given range of τ . Hence, we seethat Z κ (1+ δ ) θ a n ϕ ( τ )e − a n φ ( τ ) d τ ≤ ϕ ( τ ) φ ′ ( τ ) e − a n φ ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) (1+ δ ) θ τ = κ (299) ≤ ϕ ((1 + δ ) θ )ln(1 + δ ) , (300)where (300) follows because κ ≤ / implies ϕ ( τ ) φ ′ ( τ ) e − a n φ ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) τ = κ ≥ , (301)and φ ′ ( τ )e a n φ ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) τ =(1+ δ ) θ ≥ ln(1 + δ ) . (302)A PPENDIX IL EMMAS FOR THE P ROOF IN S ECTION
IV-DIn the proofs of Lemmas 12 and 13, we use the followingbound: for θ k ≥ /k and δ ∈ (0 , / ( k − , | τ i − θ i | ≤ δθ i for i = 1 , . . . , k − ⇒ D ( τ k θ ) ≥
12 ( τ ′ − θ ′ ) T J ( θ , P Y | V )( τ ′ − θ ′ )(1 − ( k − δ ) ,(303)where J ( θ , P Y | V ) denotes the Fisher information matrix, and τ ′ = ( τ , . . . , τ k − ) , (304) θ ′ = ( θ , . . . , θ k − ) . (305)To show (303), we rely on Taylor’s theorem: D ( τ k θ ) = k X i =1 (cid:18) ( τ i − θ i ) θ i − ( τ i − θ i ) α i (cid:19) , (306)for some α = ( α , . . . , α k ) ∈ ∆ k − such that α i lies between τ i and θ i . Lemma 12.
The function defined in (187) satisfies lim δ → lim sup n →∞ sup θ ∈R θ k ≥ /k S ( k, λ, θ , n, δ ) ≤ (1 + λ ) − k − . (307) Proof:
Assume that n is a sufficiently large integer, andlet θ ∈ R with θ k ≥ /k be given. Define Σ n = J − ( θ , P Y | V ) n (1 + λ )(1 − ( k − δ ) . (308)We invoke (303) with τ ′ ← (cid:0) t n , . . . , t k − n (cid:1) . (309)Hence, S ( k, λ, θ , n, δ ) ≤ M ( k, λ, n, c, δ )(1 − δ ) k − (1 − ( k − δ ) k − (310) × s θ k (1 + λ ) − k (1 + δ ) θ k − δ X t : t ∈N θ δ t i ≥ ∀ i e − ( τ ′ − θ ′ ) T Σ − n ( τ ′ − θ ′ ) n k − (2 π ) k − | Σ n | ≤ (1 + λ ) k − M ( k, λ, n, c, δ )(1 − δ ) k − (1 − ( k − δ ) k (311) × X t : t ∈N θ δ t i ≥ ∀ i e − ( τ ′ − θ ′ ) T Σ − n ( τ ′ − θ ′ ) n k − (2 π ) k − | Σ n | ,where (310) is due to (303), the bound on K ( k, λ, n, t ) when t ∈ N θ δ (see Lemma 7 in Appendix F), and the fact that for t ∈ N θ δ , k Y i =1 t i ≥ n k (1 − δ ) k − p θ · · · θ k − p (1 + δ ) θ k − δ , (312)(311) follows because θ k ≥ /k implies θ k (1 + δ ) θ k − δ ≤ − ( k − δ . (313)In light of Lemma 7 in Appendix F, lim n →∞ M ( k, λ, n, c, δ ) = 1 . (314)Since the multi-variable Riemann sum in (311) can be upperbounded as lim sup n →∞ X t : t ∈N θ δ t i ≥ ∀ i e − ( τ ′ − θ ′ ) T Σ − n ( τ ′ − θ ′ ) n k − (2 π ) k − | Σ n | ≤ , (315)we can conclude that (307) holds. Lemma 13.
The function defined in (188) satisfies lim n →∞ sup θ ∈R θ k ≥ /k S ( k, λ, θ , n, δ ) = 0 . (316) Proof:
Assume that n is a sufficiently large integer, andlet θ ∈ R with θ k ≥ /k be given. Recall the definition of N θ δ in (185), and note that if t
6∈ N θ δ , (317)then there must exist i ∈ { , . . . , k − } such that t i I δ,θ i ,n = [ ⌈ n (1 − δ ) θ i ⌉ , ⌊ n (1 + δ ) θ i ⌋ ] . (318)Moreover, by symmetry, we can write S ( k, λ, θ , n, δ )= X ≤ t ≤ n − k +1 t i ≥ ∀ it I δ,θ ,n X t ,...,t k t + ··· + t k = n − t K ( k, λ, n, t )(2 π ) k − n Q ki =1 t i ! × exp e ( − n (1 + λ ) D ( b P y n k P Y | V = θ )) (319) ≤ X ≤ t ≤ n − ( k − t i ≥ ∀ it I δ,θ ,n X t ,...,t k t + ··· + t k = n − t M ( k, λ )(2 π ) k − n Q ki =1 t i ! × exp e ( − n (1 + λ ) D ( b P y n k P Y | V = θ )) (320) = X ≤ t ≤ n − ( k − t i ≥ ∀ it I δ,θ ,n (cid:18) (2 π ) − nt ( n − t ) (cid:19) exp e ( − n (1 + λ ) d ( t n k θ )) × M ( k, λ ) T ( k − , λ, θ ′ , n − t ) , (321)where (320) is due to the uniform upper bound on K ( k, λ, n, t ) in Lemma 6, in (321), θ ′ = (cid:16) θ − θ , · · · , θ k − θ (cid:17) and the function denoted by T ( k, λ, θ , n ) is defined in (184).By invoking Lemma 5 in Appendix E, we see that T ( k − , λ, θ ′ , n − t ) can be upper bounded by a constant dependingonly on λ and k . On the other hand, the sum without the factor T vanishes as n → ∞ (see Lemmas 8 and 10). Therefore,(316) follows. A PPENDIX JJ EFFREYS ’ M
IXTURE IS NOT M INIMAX
The fact that Jeffreys’ prior is capacity achieving (or leastfavorable) follows from the converse proof of Theorem 2.Therefore, Jeffreys’ mixture is maximin for R´enyi redundancy.Parallel to the results in [19] and [6], Lemma 14 below provesthat Jeffreys’ mixture is not minimax.
Lemma 14.
For any l ∈ { , . . . , k − } , lim inf n →∞ (cid:26) sup θ D λ ( P Y n | V = θ k Q ∗ Y n ) − k −
12 log n π (cid:27) ≥ log Γ k (1 / k/ − k − λ log(1 + λ ) (322) + k − l (cid:18) log 2 + log(1 + λ ) λ (cid:19) , where the supremization is over all θ ∈ ∆ k − that are on theface of the simplex so that at most l of its components areknown to be non-zero. Note that the third term in the right side of (322) interpolatesthe extra constants k − l log(2e) when λ = 0 and k − l log 2 when λ = ∞ , shown in [19] and [6], respectively. Proof:
Assuming without loss of generality that the last k − l entries of θ are equal to zero simplifies the notation.Otherwise, the proof remains identical. Define ¯ θ = ( θ , . . . , θ l ) ∈ ∆ l − , (323) L ( k, l, n ) = (cid:0) k n (cid:1) n + k − (cid:0) l n (cid:1) n + l − − e n +6 k e k − l + n +6 l , (324)where θ i denotes the i -th entry of θ . Note that D λ ( P Y n | V = θ k Q ∗ Y n )= D λ ( P Y n | V = ¯ θ k Q ∗ ( l − Y n ) + log Γ( l ) Γ( n + k )Γ( k ) Γ( n + l ) (325) ≥ D λ ( P Y n | V = ¯ θ k Q ∗ ( l − Y n ) + log Γ( l )Γ( k ) (326) + k − l n + log L ( k, l, n ) ,where Q ∗ ( l − Y n denotes the Jeffreys’ mixture when the under-lying parameter space is the ( l − -dimensional simplex, (325)follows from the fact that D k (1 / , . . . , / k ( t + 1 / , . . . , t l + 1 / , / , . . . , / l (1 / , . . . , / l ( t + 1 / , . . . , t l + 1 /
2) Γ ( l/
2) Γ( n + k/ k/
2) Γ ( n + l/ , (327)and (326) follows from Stirling’s approximation which can beseen in (86). Since sup θ D λ ( P Y n | V = θ k Q ∗ Y n ) ≥ sup ¯ θ ∈ ∆ l − D λ ( P Y n | V = ¯ θ k Q ∗ ( l − Y n ) + log Γ( l/ k/ (328) + k − l n + log L ( k, l, n ) ≥ inf Q Y n sup ¯ θ ∈ ∆ l − D λ ( P Y n | V = ¯ θ k Q Y n ) + log Γ( l/ k/ (329) + k − l n + log L ( k, l, n ) ,where the supremization in the left side of (329) is over all θ whose last k − l entries are zero, the converse result inSection IV-B with k ← l , and the fact that lim n →∞ L ( k, l, n ) = 1 , (330)along with routine algebraic manipulations yield the desiredresult in (322). A CKNOWLEDGMENTS
This work has been supported by ARO-MURI contractnumber W911NF-15-1-0479 and in part by the Center forScience of Information, an NSF Science and TechnologyCenter under Grant CCF-0939370. R EFERENCES[1] S. Yagli, Y. Altu˘g, and S. Verd´u, “Minimax R´enyi redundancy,” in , June 2017,pp. 2980–2984.[2] T. M. Cover and J. A. Thomas,
Elements of Information Theory , 2nd ed.Hoboken, NJ, USA: Wiley, 2006.[3] I. Kontoyiannis and S. Verd´u, “Optimal lossless data compression:non-asymptotics and asymptotics,”
IEEE Transactions on InformationTheory
Problems of Information Transmission , vol. 15, no. 2, pp. 134–138,Oct. 1979.[6] Q. Xie and A. R. Barron, “Asymptotic minimax regret for data com-pression, gambling, and prediction,”
IEEE Transactions on InformationTheory , vol. 46, no. 2, pp. 431–445, Mar. 2000.[7] Y. M. Shtarkov, “Universal sequential coding of single messages,”
Problemy Peredachi Informatsii , vol. 23, no. 3, pp. 3–17, Jul.–Sep. 1987.[8] J. Forster and M. K. Warmuth, “Relative expected instantaneous lossbounds,”
Journal of Computer and System Sciences , vol. 64, no. 1, pp.76–102, Feb. 2002.[9] M. Drmota and W. Szpankowski, “Precise minimax redundancy andregret,”
IEEE Transactions on Information Theory , vol. 50, no. 11, pp.2686–2707, Nov. 2004.[10] F. Liang and A. R. Barron, “Exact minimax strategies for predictivedensity estimation, data compression, and model selection,”
IEEE Trans-actions on Information Theory , vol. 50, no. 11, pp. 2708–2726, Nov.2004.[11] J. W. Pratt, “Risk aversion in the small and in the large,”
Econometrica:Journal of the Econometric Society , vol. 32, no. 1–2, pp. 122–136, Jan.–Apr. 1964.[12] K. J. Arrow,
Aspects of the Theory of Risk-Bearing . Helsinki, Finland:Yrj¨o Jahnssonin S¨a¨ati¨o, 1965.[13] S. A. Ross, “Some stronger measures of risk aversion in the small andthe large with applications,”
Econometrica: Journal of the EconometricSociety , vol. 49, no. 3, pp. 621–638, May 1981.[14] L. L. Campbell, “A coding theorem and R´enyi’s entropy,”
Informationand Control , vol. 8, no. 4, pp. 423–429, Aug. 1965.[15] R. Sundaresan, “Guessing under source uncertainty,”
IEEE Transactionson Information Theory , vol. 53, no. 1, pp. 269–287, Jan. 2007.[16] R. Sibson, “Information radius,”
Zeitschrift f¨ur Wahrscheinlichkeitsthe-orie und verwandte Gebiete , vol. 14, no. 2, pp. 149–160, 1969.[17] S. Verd´u, “ α -mutual information,” in , San Diego, Feb. 2015, pp. 1–6.[18] I. Csisz´ar, “Generalized cutoff rates and R´enyi’s information measures,” IEEE Transactions on Information Theory , vol. 41, no. 1, pp. 26–34,Jan. 1995.[19] Q. Xie and A. R. Barron, “Minimax redundancy for the class of mem-oryless sources,”
IEEE Transactions on Information Theory , vol. 43,no. 2, pp. 646–657, Mar. 1997.[20] L. D. Davisson, R. J. McEliece, M. B. Pursley, and M. S. Wallace,“Efficient universal noiseless source codes,”
IEEE Transactions onInformation Theory , vol. 27, no. 3, pp. 269–279, May 1981.[21] L. Gy¨orfi, I. P´ali, and E. C. van der Meulen, “There is no universalsource code for an infinite source alphabet,”
IEEE Transactions onInformation Theory , vol. 40, no. 1, pp. 267–271, Jan. 1994.[22] R. E. Krichevsky and V. K. Trofimov, “The performance of universalencoding,”
IEEE Transactions on Information Theory , vol. 27, no. 2,pp. 199–207, Mar. 1981.[23] J. Rissanen, “Universal coding, information, prediction, and estimation,”
IEEE Transactions on Information Theory , vol. 30, no. 4, pp. 629–636,Jul. 1984.[24] J. Risannen, “Stochastic complexity and modeling,”
The Annals ofStatistics , vol. 14, no. 3, pp. 1080–1100, Sep. 1986.[25] N. Merhav, “On optimum strategies for minimizing the exponentialmoments of a loss function,”
Communications in Information andSystems , vol. 11, no. 4, pp. 343–368, 2011.[26] M. Hayashi, “Universal channel coding for general output alphabet,”2015, [Online] Available: https://arxiv.org/abs/1502.02218.[27] B. S. Clarke and A. R. Barron, “Information-theoretic asymptotics ofBayes methods,”
IEEE Transactions on Information Theory , vol. 36,no. 3, pp. 453–471, May 1990. [28] H. Jeffreys, “An invariant form for the prior probability in estimationproblems,” in
Proceedings of the Royal Society of London Series A:Mathematical, Physical and Engineering Sciences , vol. 186, no. 1007,Sep. 1946, pp. 453–461.[29] T. van Erven and P. Harremo¨es, “R´enyi divergence and Kullback-Leiblerdivergence,”
IEEE Transactions on Information Theory , vol. 60, no. 7,pp. 3797–3820, Jul. 2014.[30] S. Arimoto, “Information measures and capacity of order α for discretememoryless channels,” in Topics in Information Theory, Proc. Coll.Math. Soc. J´anos Bolyai . Keszthely, Hungary: Bolyai, 1975, pp. 41–52.[31] H. Robbins, “A remark on Stirling’s formula,”
The American Mathemat-ical Monthly , vol. 62, no. 1, pp. 26–29, Jan. 1955.[32] E. T. Whittaker and G. N. Watson,
A Course of Modern Analysis , 4th ed.Cambridge, U.K.: Cambridge University Press, 1963.[33] I. Csiszar and J. K¨orner,
Information Theory: Coding Theorems forDiscrete Memoryless Systems , 2nd ed. Cambridge, U.K.: CambridgeUniversity Press, 2011.
Semih Yagli received his Bachelor of Science degree in Electrical and Elec-tronics Engineering in 2013, his Bachelor of Science degree in Mathematicsin 2014 both from Middle East Technical University and his Master of Artsdegree in Electrical Engineering in 2016 from Princeton University.Currently, he is pursuing his Ph.D. degree in Electrical Engineering inPrinceton University under the supervision of Sergio Verd´u. His researchinterest include information theory, optimization, and machine learning.
Y¨ucel Altu˘g received the B.S. and M.S. degrees in electrical and electronicsengineering from Bo˘gazic¸i University, Turkey, in 2006 and 2008, respectivelyand the Ph.D. degree in electrical and computer engineering from CornellUniversity, in 2013, where he has been awarded the ECE Director’s Ph.D.Thesis Research Award. After postdoctoral appointments at Cornell Universityand Princeton University, he is currently a senior data scientist at NateraInc. His research interests include Shannon theory, feedback communications,and stochastic modeling and algorithm design for next-generation DNAsequencing and genetic testing.
Sergio Verd´u received the Telecommunications Engineering degree fromthe Universitat Polit`ecnica de Barcelona in 1980, and the Ph.D. degree inElectrical Engineering from the University of Illinois at Urbana-Champaign in1984. Since then, he has been a member of the faculty of Princeton University,where he is the Eugene Higgins Professor of Electrical Engineering, and is amember of the Program in Applied and Computational Mathematics.Sergio Verd´u is the recipient of the 2007 Claude E. Shannon Award, andthe 2008 IEEE Richard W. Hamming Medal. He is a member of both theNational Academy of Engineering and the National Academy of Sciences. In2016, Verd´u received the National Academy of Sciences Award for ScientificReviewing.Verd´u is a recipient of several paper awards from the IEEE: the 1992Donald Fink Paper Award, the 1998 and 2012 Information Theory PaperAwards, an Information Theory Golden Jubilee Paper Award, the 2002Leonard Abraham Prize Award, the 2006 Joint Communications/InformationTheory Paper Award, and the 2009 Stephen O. Rice Prize from the IEEECommunications Society. In 1998, Cambridge University Press published hisbook
Multiuser Detection, for which he received the 2000 Frederick E. TermanAward from the American Society for Engineering Education. He was awardeda Doctorate Honoris Causa from the Universitat Polit`ecnica de Catalunyain 2005, and was elected corresponding member of the Real Academia deIngenier´ıa of Spain in 2013. Sergio Verd´u served as President of the IEEE Information Theory Societyin 1997, and on its Board of Governors (1988-1999, 2009-2014). He has alsoserved in various editorial capacities for the
IEEE Transactions on InformationTheory : Associate Editor (Shannon Theory, 1990-1993; Book Reviews, 2002-2006), Guest Editor of the Special Fiftieth Anniversary Commemorative Issue(published by IEEE Press as “Information Theory: Fifty years of discovery”),and member of the Executive Editorial Board (2010-2013). He co-chairedthe Europe-United States
Frontiers of Engineering program, of the NationalAcademy of Engineering during 2009-2013. He is the founding Editor-in-Chief of
Foundations and Trends in Communications and Information Theory .Verd´u served as co-chair of the 2000 and 2016