[PDF] Relative log-concavity and a pair of triangle inequalities

Abstract

The relative log-concavity ordering ≤ lc between probability mass functions (pmf's) on non-negative integers is studied. Given three pmf's f,g,h that satisfy f ≤ lc g ≤ lc h , we present a pair of (reverse) triangle inequalities: if ∑ i i f i = ∑ i i g i <∞, then D(f|h)≥D(f|g)+D(g|h) and if ∑ i i g i = ∑ i i h i <∞, then D(h|f)≥D(h|g)+D(g|f), where D(⋅|⋅) denotes the Kullback--Leibler divergence. These inequalities, interesting in themselves, are also applied to several problems, including maximum entropy characterizations of Poisson and binomial distributions and the best binomial approximation in relative entropy. We also present parallel results for continuous distributions and discuss the behavior of ≤ lc under convolution.

Full PDF

aa r X i v : . [ m a t h . S T ] O c t Bernoulli (2), 2010, 459–470DOI: 10.3150/09-BEJ216 Relative log-concavity and a pair of triangleinequalities

YAMING YU

Department of Statistics, University of California, Irvine, CA 92697-1250, USA.E-mail: [email protected]

The relative log-concavity ordering ≤ lc between probability mass functions (pmf’s) on non-negative integers is studied. Given three pmf’s f, g, h that satisfy f ≤ lc g ≤ lc h , we present a pairof (reverse) triangle inequalities: if P i if i = P i ig i < ∞ , then D ( f | h ) ≥ D ( f | g ) + D ( g | h )and if P i ig i = P i ih i < ∞ , then D ( h | f ) ≥ D ( h | g ) + D ( g | f ) , where D ( ·|· ) denotes the Kullback–Leibler divergence. These inequalities, interesting in them-selves, are also applied to several problems, including maximum entropy characterizations ofPoisson and binomial distributions and the best binomial approximation in relative entropy. Wealso present parallel results for continuous distributions and discuss the behavior of ≤ lc underconvolution. Keywords:

Bernoulli sum; binomial approximation; Hoeﬀding’s inequality; maximum entropy;minimum entropy; negative binomial approximation; Poisson approximation; relative entropy

1. Introduction and main result

A non-negative sequence u = { u i , i ≥ } is log-concave if (a) the support of u is an intervalin Z + = { , , . . . } and (b) u i ≥ u i +1 u i − for all i or, equivalently, log( u i ) is concave insupp( u ). Such sequences occur naturally in combinatorics, probability and statistics, forexample, as probability mass functions (pmf’s) of many discrete distributions. Given twopmf’s f = { f , f , . . . } and g = { g , g , . . . } on Z + , we say that f is log-concave relativeto g , written as f ≤ lc g, if1. each of f and g is supported on an interval on Z + ;2. supp( f ) ⊂ supp( g );3. log( f i /g i ) is concave in supp( f ). This is an electronic reprint of the original article published by the ISI/BS in

Bernoulli ,2010, Vol. 16, No. 2, 459–470. This reprint diﬀers from the original in pagination andtypographic detail. (cid:13) Y. Yu

We have f ≤ lc f (assuming interval support) and f ≤ lc g, g ≤ lc h = ⇒ f ≤ lc h . In otherwords, ≤ lc deﬁnes a pre-order among discrete distributions with interval supports on Z + . When g is a geometric pmf, f ≤ lc g simply means that f is log-concave; when g is abinomial or Poisson pmf and f ≤ lc g , then f is ultra log-concave [23] (see Section 2).Whitt [27] discusses this particular ordering and illustrates its usefulness with a queue-ing theory example. Yu [30] uses ≤ lc to derive simple conditions that imply other stochas-tic orders such as the usual stochastic order, the hazard rate order and the likelihoodratio order. Stochastic orders play an important role in diverse areas, including reliabilitytheory and survival analysis ([2, 7]); see Shaked and Shanthikumar [24] for a book-lengthtreatment. In this paper, we are concerned with entropy relations between distributionsunder ≤ lc . The investigation is motivated by maximum entropy characterizations of bi-nomial and Poisson distributions (see Section 2). For a random variable X on Z + withpmf f , the Shannon entropy is deﬁned as H ( X ) = H ( f ) = − ∞ X i =0 f i log( f i ) . By convention, 0 log(0) = 0. The relative entropy (Kullback and Leibler [19]; Kullback[18]; Csisz´ar and Shields [5]) between pmf’s f and g on Z + is deﬁned as D ( f | g ) =  ∞ X i =0 f i log( f i /g i ) , if supp( f ) ⊂ supp( g ), ∞ , otherwise.By convention, 0 log(0 /

0) = 0. We state our main result.

Theorem 1.

Let f, g, h be pmf ’s on Z + such that f ≤ lc g ≤ lc h . If f and g have ﬁniteand equal means, then D ( f | h ) < ∞ and D ( f | h ) ≥ D ( f | g ) + D ( g | h ); (1.1) if h and g have ﬁnite and equal means, then D ( h | f ) ≥ D ( h | g ) + D ( g | f ) . (1.2)Theorem 1 has an appealing geometric interpretation. (With a slight abuse of notation,we write the mean of a pmf g as E ( g ) = P i ig i .) If g and h satisfy E ( g ) < ∞ and g ≤ lc h ,then (1.1) gives D ( g | h ) = inf f ∈ F D ( f | h ) , F = { f : f ≤ lc g, E ( f ) = E ( g ) } . That is, g is the I-projection of h onto F . Relation (1.2) can be interpreted similarly. SeeCsisz´ar and Shields [5] for general deﬁnitions and properties of the I-projection and therelated reverse I-projection . pair of triangle inequalities ≤ lc under convolution, as thisbecomes relevant in a few places.

2. Some implications of Theorem 1

Theorem 1 is used to unify and generalize classical results on maximum entropy charac-terizations of Poisson and binomial distributions in Section 2.1 and to determine the bestbinomial approximation to a sum of independent Bernoulli random variables (in relativeentropy) in Section 2.2. Section 2.3 contains analogous results for the negative binomial.Theorem 1 also implies monotonicity (in terms of relative entropy) in certain Poissonlimit theorems.

Throughout this subsection (and in Section 2.2), let X , . . . , X n be independent Bernoullirandom variables with Pr( X i = 1) = 1 − Pr( X i = 0) = p i , < p i <

1. Deﬁne S = P ni =1 X i and ¯ p = (1 /n ) P ni =1 p i .A theorem of Shepp and Olkin [25] (see also [22] and [10]) states that H ( S ) ≤ H (bi( n, ¯ p )) , (2.1)where bi( n, p ) denotes the binomial pmf with n trials and probability p for success. Inother words, subject to a ﬁxed mean n ¯ p , the entropy of S is maximized when all p i areequal. Karlin and Rinott [16] (see also Harremo¨es [10]) note the corresponding result H ( S ) ≤ H (po( n ¯ p )) , (2.2)where po( λ ) denotes the Poisson pmf with mean λ .Johnson [13] gives a generalization of (2.2) to ultra log-concave (ULC) distributions.The notion of ultra log-concavity was introduced by Pemantle [23] in the study of negativedependence. A pmf f on Z + is ULC of order k if f i / (cid:0) ki (cid:1) is log-concave in i ; it is ULCof order ∞ , or simply ULC, if i ! f i is log-concave. Equivalently, these deﬁnitions can bestated with the ≤ lc notation:1. f is ULC of order k if f ≤ lc bi( k, p ) for some p ∈ (0 ,

1) (the value of p does not aﬀectthe deﬁnition);2. f is ULC of order ∞ if f ≤ lc po( λ ) for some λ > λ does not aﬀectthe deﬁnition).62 Y. Yu

An example is the distribution of S in (2.2) and (2.1). Denoting the pmf of S by f S , wehave f S ≤ lc bi( n, ¯ p ) , (2.3)which can be shown to be a reformulation of Newton’s inequalities (Hardy et al. [9]). Also,note that, as can be veriﬁed using the deﬁnition, f being ULC of order k means that itis also ULC of orders k + 1 , k + 2 , . . . , ∞ . Another notable property of ULC distributions,expressed in our notation, is due to Liggett [21]. Theorem 2 ([21]). If f ≤ lc bi( k, p ) and g ≤ lc bi( m, p ) , p ∈ (0 , , then f ∗ g ≤ lc bi( k + m, p ) , where f ∗ g = { P ji =0 f i g j − i , j = 0 , . . . , k + m } denotes the convolution of f and g . This is a strong result; it implies (2.3) trivially. Simply observe that bi(1 , p i ) ≤ lc bi(1 , ¯ p ) , i = 1 , . . . , n, and apply Theorem 2 to obtain f S = bi(1 , p ) ∗ · · · ∗ bi(1 , p n ) ≤ lc bi( n, ¯ p ), that is, f S is ULC of order n . A limiting case of Theorem 2 also holds: for pmf’s f and g on Z + , we have f ≤ lc po( λ ) , g ≤ lc po( µ ) = ⇒ f ∗ g ≤ lc po( λ + µ ) . The following generalization of (2.2) is proved by Johnson [13].

Theorem 3 ([13]).

If a pmf f on Z + is ULC, then H ( f ) ≤ H (po( E ( f ))) . Johnson’s proof uses two operations, namely convolution with a Poisson pmf and bi-nomial thinning, to construct a semigroup action on the set of ULC distributions with aﬁxed mean. The entropy is then shown to be monotone along this semigroup. A corre-sponding generalization of (2.1) appears in Yu [28]. The proof adopts the idea of Johnson[13] and is likewise non-trivial.

Theorem 4 ([28]).

If a pmf f is ULC of order n , then H ( f ) ≤ H (bi( n, E ( f ) /n )) . We point out that Theorems 3 and 4 can be deduced from Theorem 1; in fact, bothare special cases of the following result.

Theorem 5.

Any log-concave pmf g on Z + is the unique maximizer of entropy in theset F = { f : f ≤ lc g, E ( f ) = E ( g ) } . pair of triangle inequalities Proof.

The log-concavity of g ensures that λ ≡ E ( g ) < ∞ . Letting f ∈ F and using thegeometric pmf ge( p ) = { p (1 − p ) i , i = 0 , , . . . } , we get D ( f | ge( p )) = − H ( f ) − log( p ) − λ log(1 − p ) ,D ( g | ge( p )) = − H ( g ) − log( p ) − λ log(1 − p ) , which also shows that H ( f ) < ∞ and H ( g ) < ∞ . Since f ≤ lc g ≤ lc ge( p ), Theorem 1yields − H ( f ) ≥ D ( f | g ) − H ( g ) ≥ − H ( g )so that H ( f ) ≤ H ( g ) for all f ∈ F , with equality if and only if D ( f | g ) = 0, that is, f = g . (cid:3) Theorems 3 and 4 are obtained by noting that both po( λ ) and bi( n, p ) are log-concave.For recent extensions of Theorems 3 and 4 to compound distributions, see [14] and [31]. Recall that S = P ni =1 X i is a sum of independent Bernoulli random variables, each withsuccess probability p i . Let λ = P ni =1 p i and let f S denote the pmf of S . Approximating S with a Poisson distribution Po( λ ) is an old problem (Le Cam [20], Chen [3], Barbour et al. [1]). Approximating S with a binomial Bi( n, ¯ p ) , ¯ p = (1 /n ) P ni =1 p i , has also beenconsidered (Stein [26], Ehm [8]). The results are typically stated in terms of the totalvariation distance, deﬁned for pmf’s f and g as V ( f, g ) = P i | f i − g i | . For example,Ehm [8] applies the method of Stein and Chen to derive the bound (¯ q = 1 − ¯ p ) V ( f S , bi( n, ¯ p )) ≤ (1 − ¯ p n +1 − ¯ q n +1 )[( n + 1)¯ p ¯ q ] − n X i =1 ( p i − ¯ p ) . Here, we are concerned with the following problem: what is the best m, m ≥ n , and p ∈ (0 ,

1) for approximating S with Bi( m, p )? Intuition says Bi( n, ¯ p ). Indeed, Choi andXia [4] study this in terms of the total variation distance d m = V ( f S , bi( m, λ/m )) andprove that under certain conditions, for large enough m , d m increases with m . Theorem 6 ([4]).

Let r = ⌊ λ ⌋ be the integer part of λ and let δ = λ − r . If r > δ ) and m ≥ max { n, λ / ( r − − (1 + δ ) ) } , then d m < d m +1 < V ( f S , po( λ )) . The derivation of Theorem 6 is somewhat involved. However, if we consider this prob-lem in terms of relative entropy rather than total variation, then Theorem 7 below givesa deﬁnite and equally intuitive answer. Similar results (see Section 2.3) hold for thenegative binomial approximation of a sum of independent geometric random variables.64

Y. Yu

Theorem 7.

Let f = f S , g = bi( m, λ/m ) and h = bi( m ′ , p ′ ) in Theorem 1. By (2.3), we have f ≤ lc bi( n, ¯ p ) ≤ lc g ≤ lc h . The claim follows from (1.1). (cid:3) Theorem 7 shows that, for approximating S in the sense of relative entropy,1. Bi( m, λ/m ), which has the same mean as S , is preferable to Bi( m, p ′ ), p ′ = λ/m ;2. Bi( n, ¯ p ) is preferable to Bi( m, λ/m ), m > n .Obviously, the proof of (2.4) still applies when bi( m ′ , p ′ ) is replaced by po( λ ). Hence, D ( f S | po( λ )) ≥ D ( f S | bi( n, ¯ p )) + D (bi( n, ¯ p ) | po( λ )) , (2.5)that is, Po( λ ) is worse than Bi( n, ¯ p ) by at least D (bi( n, ¯ p ) | po( λ )).We conclude this subsection with another interesting result in the form of a corollaryof Theorem 1. Writing b m = bi( m, λ/m ) for simplicity, we have D ( b m | po( λ )) ≥ D ( b m | b m +1 ) + D ( b m +1 | po( λ ))and, therefore, D ( b m | po( λ )) > D ( b m +1 | po( λ )) , m > λ. (2.6)That is, the limit Bi( m, λ/m ) → Po( λ ) , m → ∞ , is monotone in relative entropy. Assimple as (2.6) may seem, it is diﬃcult to derive it directly without Theorem 1, whichperhaps explains why (2.6) appears new, even though the binomial-to-Poisson limit iscommon knowledge. Let T be a sum of geometric random variables, T = P ni =1 Y i , where Y i ∼ Ge( r i ) indepen-dently, r i ∈ (0 , T by µ = P ni =1 (1 − r i ) /r i and denote the pmf of T by f T . Let nb( n, r ) = { (cid:0) n + i − i (cid:1) r n (1 − r ) i , i = 0 , , . . . } denote the pmf of the negativebinomial NB( n, r ).The counterpart of (2.1) appears in Karlin and Rinott [16]. Theorem 8 ([16]). H ( T ) ≥ H (nb( n, n/ ( n + µ ))) . In other words, subject to a ﬁxed mean µ , the entropy of T is minimized when all r i are equal. Theorem 8 can be generalized as follows. pair of triangle inequalities Theorem 9.

Any log-concave pmf f is the unique minimizer of entropy in the set G = { g : f ≤ lc g ≤ lc ge( p ) , E ( g ) = E ( f ) } , p ∈ (0 , . We realize that Theorem 9 is just a reformulation of Theorem 5, which follows fromTheorem 1. To show that Theorem 9 indeed implies Theorem 8, we need the followinginequality of Hardy et al. [9], written in our notation asnb( n, n/ ( n + µ )) ≤ lc f T . (2.7)We also need f T to be log-concave, but this holds because convolutions of log-concavesequences are also log-concave.Next, we consider the problem of selecting the best m, m ≥ n , and r ∈ (0 ,

1) for ap-proximating T with NB( m, r ). Theorem 10.

The relations nb( m ′ , r ′ ) ≤ lc nb m ≤ lc nb( n, n/ ( n + µ ))are easy to verify. We also have (2.7). The claim follows from (1.2). (cid:3) Theorem 10 implies that for approximating T in the sense of relative entropy,NB( n, n/ ( n + µ )) is no worse than NB( m ′ , r ′ ) whenever m ′ ≥ n . The counterpart of(2.5) also holds (nb n = nb( n, n/ ( n + µ ))): D ( f T | po( µ )) ≥ D ( f T | nb n ) + D (nb n | po( µ )) , that is, Po( µ ) is worse than NB( n, n/ ( n + µ )) by at least D (nb n | po( µ )).In addition, parallel to (2.6), we have D (nb m | po( µ )) > D (nb m ′ | po( µ )) , m ′ > m > , (2.8)that is, the limit NB( m, m/ ( m + µ )) → Po( µ ) , m → ∞ , is monotone in relative entropy.Note that in (2.8), m and m ′ need not be integers; similarly in Theorem 10.We conclude this subsection with a problem on the behavior of ≤ lc under convolution.Analogous to Theorem 2 is the following result of Davenport and P´olya ([6], Theorem2), rephrased in terms of ≤ lc .66 Y. Yu

Theorem 11 ([6]).

Suppose that pmf ’s f and g on Z + satisfy nb( k, r ) ≤ lc f, nb( m, r ) ≤ lc g for k, m > , r ∈ (0 , . Their convolution f ∗ g then satisﬁes nb( k + m, r ) ≤ lc f ∗ g. Actually, Davenport and P´olya [6] assume that k + m = 1, so their conclusion is thelog-convexity of f ∗ g , but it is readily veriﬁed that the same proof works for all positive k and m . The limiting case also holds, that is,po( λ ) ≤ lc f, po( µ ) ≤ lc g = ⇒ po( λ + µ ) ≤ lc f ∗ g. An open problem is to determine general conditions that ensure f ≤ lc f ′ , g ≤ lc g ′ = ⇒ f ∗ g ≤ lc f ′ ∗ g ′ . (2.9)Theorem 2 simply says that (2.9) holds if f ′ = bi( k, p ) and g ′ = bi( m, p ) with the same p and Theorem 11 says that (2.9) holds if f = nb( k, r ) and g = nb( m, r ) with the same r .The proofs of Theorems 11 and 2 (Theorem 2 especially) are non-trivial. It is reasonableto ask whether there exist other interesting and non-trivial instances of (2.9).

3. Proof of Theorem 1

The proof of Theorem 1 hinges on the following lemma that dates back to Karlin andNovikoﬀ [15] and Karlin and Studden [17]. Our assumptions are slightly diﬀerent fromthose of Karlin and Studden [17], Lemma XI. 7.2. In the proof (included for complete-ness), the number of sign changes of a sequence is counted discarding zero terms.

Lemma 1 ([17]).

Let a i , i = 0 , , . . . , be a real sequence such that P ∞ i =0 a i = 0 and P ∞ i =0 i × a i = 0 . Suppose that the set C = { i : a i > } is an interval on Z + . For anyconcave function w ( i ) on Z + , we then have ∞ X i =0 w ( i ) a i ≥ . (3.1) Proof.

Karlin and Studden ([17], Lemma XI. 7.2) assume that a i , i = 0 , , . . . , changessign exactly twice, with sign sequence − , + , − . However, it also suﬃces to assume that C is an interval. Suppose that a i changes sign exactly once, with sign sequence + , − , thatis, there exists 0 ≤ k < ∞ such that a i ≥ , ≤ i ≤ k , with strict inequality for at leastone i ≤ k , and a i ≤ , i > k . Then, ∞ X i =0 ia i ≤ k X i =0 ka i + ∞ X i = k +1 ( k + 1) a i = − k X i =0 a i < , pair of triangle inequalities − , + either. Assuming that C isan interval, this shows that, except for the trivial case a i ≡

0, the sequence a i changessign exactly twice, with sign sequence − , + , − .The rest of the argument is well known. We proceed to show that the sequence A j = P ji =0 a i has exactly one sign change, with sign sequence − , +. Similarly, P ji =0 A i ≤ j = 0 , , . . . , which implies (3.1) for every concave function w ( i ) upon applyingsummation by parts. (cid:3) Theorem 12 below is a consequence of Lemma 1. Although not phrased as such, thebasic idea is implicit in Karlin and Studden [17] in their analyses of special cases; see alsoWhitt [27]. When f is the pmf of a sum of n independent Bernoulli random variablesand g = bi( n, E ( f ) /n ) , as discussed in Section 2, Theorem 12 reduces to an inequality ofHoeﬀding [11]. Theorem 12.

Suppose that two pmf ’s f and g on Z + satisfy f ≤ lc g and E ( f ) = E ( g ) < ∞ . For any concave function w ( i ) on Z + , we then have ∞ X i =0 f i w ( i ) ≥ ∞ X i =0 g i w ( i ) . Proof.

Since E ( g ) < ∞ and w is concave, P ∞ i =0 g i w ( i ) either converges absolutely ordiverges to −∞ . Assume the former. Since log( f i /g i ) is concave and hence unimodal, theset C = { i : f i − g i > } must be an interval. The result then follows from Lemma 1. (cid:3) Theorem 1 is a consequence of Theorem 12. Actually, we prove a slightly more general“quadrangle inequality,” which may be of interest. Theorem 1 corresponds to the specialcase g = g ′ in Theorem 13. Theorem 13.

Let f, g, g ′ , h be pmf ’s on Z + such that f ≤ lc g ≤ lc g ′ ≤ lc h . If E ( f ) = E ( g ) < ∞ , then D ( f | h ) < ∞ and D ( f | h ) + D ( g | g ′ ) ≥ D ( f | g ′ ) + D ( g | h ); (3.2) if E ( g ′ ) = E ( h ) < ∞ , then D ( h | f ) + D ( g ′ | g ) ≥ D ( g ′ | f ) + D ( h | g ) . (3.3) Proof.

The concavity of log( f i /h i ) and E ( f ) < ∞ imply D ( f | h ) < ∞ . Likewise for D ( g | h ). Thus, (3.2) can be written as D ( f | h ) − D ( f | g ′ ) ≥ D ( g | h ) − D ( g | g ′ )or, equivalently, X i f i log( g ′ i /h i ) ≥ X i g i log( g ′ i /h i ) . (3.4)68 Y. Yu

Since log( g ′ i /h i ) is concave in supp( g ′ ), and supp( f ) ⊂ supp( g ) ⊂ supp( g ′ ), (3.4) followsdirectly from Theorem 12.To prove (3.3), we may assume D ( h | f ) < ∞ and D ( g ′ | g ) < ∞ . These imply, in partic-ular, that supp( f ) = supp( g ′ ) = supp( h ). We get X i g ′ i log( f i /g i ) ≥ X i h i log( f i /g i )and (3.3) follows as before. (cid:3)

4. The continuous case

For probability density functions (pdf’s) f and g with respect to Lebesgue measure on R , the diﬀerential entropy of f and the relative entropy between f and g are deﬁned,respectively, as H ( f ) = Z ∞−∞ − f ( x ) log( f ( x )) d x and D ( f | g ) = Z ∞−∞ f ( x ) log( f ( x ) /g ( x )) d x. Parallel to the discrete case, let us write f ≤ lc g if1. supp( f ) and supp( g ) are both intervals on R ;2. supp( f ) ⊂ supp( g ); and3. log( f ( x ) /g ( x )) is concave in supp( f ).There then holds a continuous analog of Theorem 1 (with its ﬁrst phrase replaced by“Let f, g, h be pdf’s on R ”); the proof is similar and is hence omitted.The following maximum/minimum entropy result parallels Theorems 5 and 9. Theorem 14.

If a pdf g on R is log-concave, then it maximizes the diﬀerentialentropy in the set F = { f : f ≤ lc g, E ( f ) = E ( g ) } . Alternatively, if a pdf f on R is log-concave, then it minimizes the diﬀerential entropy in the set G = { g : f ≤ lc g, g is log-concave and E ( g ) = E ( f ) } . We illustrate Theorem 14 with a minimum entropy characterization of the gammadistribution. This parallels Theorem 8 for the negative binomial. Denote by gam( α, β )the pdf of the gamma distribution Gam( α, β ), that is,gam( x ; α, β ) = β − α x α − e − x/β / Γ( α ) , x > . Theorem 15.

Let α i ≥ , β i > and let X i ∼ Gam( α i , , i = 1 , . . . , n, independently.Deﬁne S = P ni =1 β i X i . Then, subject to a ﬁxed mean ES = P ni =1 α i β i , the diﬀerentialentropy of S ( as a function of β i , i = 1 , . . . , n ) is minimized when all β i are equal. Note that Theorem 3.1 of Karlin and Rinott ([16]; see also Yu [29]) implies that The-orem 15 holds when all α i are equal. We use ≤ lc to give an extension to general α i ≥ pair of triangle inequalities α + α = 1, but the proof works for allpositive α , α . Lemma 2 ([6], Theorem 4).

Let α , α > and let f and g be pdf ’s on (0 , ∞ ) suchthat gam( α , ≤ lc f and gam( α , ≤ lc g . Then, gam( α + α , ≤ lc f ∗ g, where ( f ∗ g )( x ) = R x f ( y ) g ( x − y ) d y . Proof of Theorem 15.

Repeated application of Lemma 2 yieldsgam( α + , ≤ lc f S , (4.1)where α + = P ni =1 α i and f S denotes the pdf of S . Alternatively, we can show (4.1) bynoting that f S is a mixture of gam( α + , β ), where β has the distribution of S/ P ni =1 X i (see, e.g., [27] and [30]). Since α i ≥

1, each X i is log-concave and so is f S . The claimfollows from Theorem 14. (cid:3) Weighted sums of gamma variates, as in Theorem 15, arise naturally in statisticalcontexts, for example, as quadratic forms in normal variables, but their distributions canbe non-trivial to compute (Imhof [12]). When comparing diﬀerent gamma distributionsas convenient approximations, we obtain a result similar to Theorems 7 and 10. Theproof, also similar, is omitted.

Theorem 16.

Fix α i > , β i > and let X i ∼ Gam( α i , , i = 1 , . . . , n, independently.Deﬁne S = P ni =1 β i X i , with pdf f S . Write g a = gam( a, P ni =1 β i α i /a ) as shorthand. For b > and a ′ ≥ a ≥ α + , where α + = P ni =1 α i , we then have D ( f S | gam( a ′ , b )) ≥ D ( f S | g a ) + D ( g a | gam( a ′ , b )) and, consequently, D ( f S | gam( a ′ , b )) ≥ D ( f S | g a ) ≥ D ( f S | g α + ) . In other words, to approximate S in the sense of relative entropy, Gam( α + , P ni =1 β i α i /α + ),which has the same mean as S , is no worse than Gam( a, b ) whenever a ≥ α + . Note that,unlike in Theorem 15, we do not require here that α i ≥ Acknowledgments

The author would like to thank three referees for their constructive comments.70

Y. Yu

References [1] Barbour, A.D., Holst, L. and Janson, S. (1992).

Poisson Approximation. Oxford Studies inProbability . Oxford: Clarendon Press. MR1163825[2] Barlow, R.E. and Proschan, F. (1975). Statistical Theory of Reliability and Life Testing .New York: Holt, Rinehart & Winston. MR0438625[3] Chen, L.H.Y. (1975). Poisson approximation for dependent trials.

Ann. Probab. Ann. Appl. Probab. Foundationsand Trends in Communications and Information Theory Canad. J. Math. Unimodality, Convexity, and Applications .New York: Academic Press. MR0954608[8] Ehm, W. (1991). Binomial approximation to the Poisson binomial distribution.

Statist.Probab. Lett. Inequalities . Cambridge, UK: Cam-bridge Univ. Press.[10] Harremo¨es, P. (2001). Binomial and Poisson distributions as maximum entropy distribu-tions.

IEEE Trans. Inform. Theory Ann. Math. Statist. Biometrika Stochastic Process. Appl.

Paciﬁc J. Math. Adv. in Appl. Probab. Tchebycheﬀ Systems: With Applications in Analysisand Statistics.

New York: Interscience. MR0204922[18] Kullback, S. (1959).

Information Theory and Statistics . New York: Wiley. MR0103557[19] Kullback, S. and Leibler, R.A. (1951). On information and suﬃciency.

Ann. Math. Statist. PaciﬁcJ. Math. J. Combin.Theory Ser. A Teor. Veroyatn. Primen. J. Math. Phys. pair of triangle inequalities [24] Shaked, M. and Shanthikumar, J.G. (1994). Stochastic Orders and Their Applications . NewYork: Academic Press. MR1278322[25] Shepp, L.A. and Olkin, I. (1981). Entropy of the sum of independent Bernoulli randomvariables and of the multinomial distribution. In

Contributions to Probability

Approximate Computation of Expectations. IMS Monograph Series .Hayward, CA: Inst. Math. Statist. MR0882007[27] Whitt, W. (1985). Uniform conditional variability ordering of probability distributions. J. Appl. Probab. IEEETrans. Inform. Theory Adv. in Appl. Probab. J. Appl. Probab. IEEETrans. Inform. Theory .3645–3650.