[PDF] An Inequality for the Sum of Independent Bounded Random Variables

Abstract

We give a simple inequality for the sum of independent bounded random variables. This inequality improves on the celebrated result of Hoeffding in a special case. It is optimal in the limit where the sum tends to a Poisson random variable.

Full PDF

aa r X i v : . [ m a t h . P R ] O c t An Inequality for the Sum of Independent Bounded RandomVariables

Christopher R. Dance

Xerox Research Centre Europe, 6 chemin de Maupertuis, 38240 Meylan, France [email protected]

Phone: +33 4 76 61 51 37

September 2012

Abstract

We give a simple inequality for the sum of independent bounded random variables. This inequalityimproves on the celebrated result of Hoeﬀding in a special case. It is optimal in the limit where thesum tends to a Poisson random variable.

Modern machine learning and stochastic programming are largely based on inequalities relating to thesums of random variables. Hoeﬀding [2] proposed several such bounds, which were in turn improvedby Talagrand [5], Pinelis [4] and Bentkus [1]. In this paper we prove the following related result.

Theorem 1.

Suppose that S = P ni =1 X i is a sum of independent random variables with P (0 ≤ X i ≤

1) = 1 for ≤ i ≤ n and E S = λ . Then P ( S ≤ ≤ max ((cid:18) λ − λn (cid:19) (cid:18) − λn (cid:19) n − , (cid:18) − λ − n − (cid:19) n − ) (1) ≤ max { λ, e } e − λ . (2)In this context, Hoeﬀding’s inequality states that for all λ ≥ P ( S ≤ ≤ λ (cid:18) − λn (cid:19) n − . (3)Theorem 1 is not as general as Hoeﬀding’s inequality since it only allows us to bound P ( S ≤

1) ratherthan P ( S ≤ t ) for any positive t . However, from Theorem 1, we may derive Corollary 1 which statesthat P ( S ≤ ≤ e − r E S where r = 0 . . . . . (4)In contrast, the strongest such result that can be obtained from Hoeﬀding’s bound is P ( S ≤ ≤ e − ( − e − ) E S where 1 − e − = 0 . . . . . (5)Thus Theorem 1 improves on the Hoeﬀding bound.It is interesting to compare our result with Theorem 1.2 of Bentkus [1] in the form of his in-equality 1.1. This states that for a sequence of bounded independent random variables Y i such that P (0 ≤ Y i ≤

1) we have P n X i =1 Y i ≥ x ! ≤ e P ( B n ≥ x ) (6)where B n ∼ binomial( p, n ) with p := P ni =1 E Y i /n . If we set X i := 1 − Y i and S = P ni =1 (1 − Y i ) inorder to match the random variables in our Theorem 1, Bentkus’s result gives P ( S ≤

1) = P n X i =1 Y i ≥ n − ! ≤ e (cid:0) p n + n (1 − p ) p n − (cid:1) . If we set m := E S , so that p = 1 − mn , we have p n = (cid:0) − mn (cid:1) n ≤ e − m and p + n (1 − p ) ≤ m sothat P ( S ≤ ≤ ep (1 + E S ) e − E S . his bound is a factor of e larger than our result for all E S ≥ e − S ∼ binomial (cid:18) λn , n (cid:19) then P ( S ≤

1) = P ( S = 0) + P ( S = 1) = (cid:18) − λn (cid:19) n + λ (cid:18) − λn (cid:19) n − corresponding to the ﬁrst term in the ‘max’ of (1), while if S ∼ (cid:18) λ − n − , n − (cid:19) then we get the second term in the ‘max’ P ( S ≤

1) = (cid:18) − λ − n − (cid:19) n − . Similarly, in the n -independent form of our bound (2), if S ∼ Poisson( λ ) then P ( S ≤

1) = P ( S = 0) + P ( S = 1) = e − λ + e − λ λ corresponding to the ﬁrst term in the ‘max’ of (2). Similarly, if S ∼ λ −

1) then we getthe second term in the ‘max’ P ( S ≤

1) = e − λ . While the sum of a ﬁnite collection of bounded random variables P ni =1 X i cannot have a Poissondistribution, the law of small numbers implies that the Poisson distribution is the limit as n → ∞ of the sum of a suitable collection of random variables ( X i ) i =1 , ,...,n . For instance if each X i is aBernoulli random variable taking value 1 with probability λ/n and value 0 otherwise, then followinglimiting probability mass function is Poissonlim n →∞ P n X i =1 X i = x ! = e − λ λ x x ! for x ∈ Z + . (7) In this section, we deﬁne four families of random sums S n , T n , U n and V n . Then we present Lemmas 1,3, 4 and 5 that relate these families, and combine these results to prove Theorem 1.The random variable considered in Theorem 1 is from the family S n of random variables S of theform S := n X i =1 X i (8)where X i are independent random variables with X i ∈ [0 , T n is the set of Bernoulli sums T of the form T := n X i =1 Y i (9)where Y i are independent random variables taking values a i or b i with a i , b i ∈ [0 , i = 1 , , . . . , n .Family U n is the set of Bernoulli sums U of the form U := n X i =1 B i (10)with each Bernoulli random variable B i taking the value 0 or 1. Finally, family V n is the set of shiftedbinomial random variables V with any parameter p ∈ [0 ,

1] and with either of the following two forms V ∼ binomial( p, n ) or V ∼ (cid:18) np − n − , n − (cid:19) . (11) Lemma 1.

For any random sum S ∈ S n , there exists a random sum T ∈ T n such that E S = E T and P ( S ≤ ≤ P ( T ≤ . Lemma 1 follows directly from Theorem 8 of Mulholland and Rogers [3], which we state as Lemma 2. emma 2. For each integer i with ≤ i ≤ n , let f i ( x ) , . . . , f ik ( x ) be Borel-measurable functions,and let K i be the set of probability distribution functions F i ( x ) : R → [0 , satisfying Z ∞−∞ f ij ( x ) dF i ( x ) = 0 for j = 1 , , . . . , k. (12) Let E i be the set of functions from K i that are step-functions having κ i jumps at points x i , x i , . . . , x iκ i where ≤ κ i ≤ k + 1 and where the κ i vectors (1 , f i ( x ij ) , . . . , f ik ( x ij )) j = 1 , , . . . , κ i are linearly independent.Suppose that g ( x , x , . . . , x n ) is Borel-measurable as a function of the point ( x , . . . , x n ) ∈ R n .Then sup F i ( x ) ∈ K i Z ∞−∞ . . . Z ∞−∞ g ( x , . . . , x n ) dF ( x ) . . . dF n ( x n )= sup H i ( x ) ∈ E i Z ∞−∞ . . . Z ∞−∞ g ( x , . . . , x n ) dH ( x ) . . . dH n ( x n ) provided the left-hand side is ﬁnite.Proof. See [3].In the above Lemma, the conditions (12) can be interpreted as moment conditions on randomvariables X i whose probability distribution functions are F i , while the distribution functions in E i correspond to random variables whose support consists of a ﬁnite set of κ i points and which satisfyconditions (12). Lemma 1.

Let C denote the indicator function for condition C and let X i be the random variablesdeﬁning S for 1 ≤ i ≤ n . In Lemma 2, put f i ( x ) := ≤ x ≤ − , f i ( x ) := x − E X i (13)which are both Borel-measurable functions. Then the set E i of distribution functions corresponds tothe set of random variables Z i which take on 1 ≤ κ i ≤ z i , . . . , z iκ i , whichsatisfy P (0 ≤ Z i ≤

1) = 1 and E Z i = E X i (14)and for which the vectors (1 , f i ( z ij ) , f i ( z ij )) are linearly independent for 1 ≤ j ≤ κ i .We now rule out the case κ i = 3, since if P (0 ≤ Z i ≤

1) = 1 then the jumps must satisfy f i ( z ij ) = ≤ z ij ≤ − , , z ij − E X i ) . (15)Thus the random variables Z i take on at most two values, say a i , b i ∈ [0 , Z i match the deﬁnition of the random variables Y i deﬁning the sum T .Finally, if we set g ( x , . . . , x n ) := P ni =1 x i ≤ , which is Borel-measurable, and identify the distri-bution functions F i ( x ) with those of the random variables X i then Lemma 2 gives Z ∞−∞ . . . Z ∞−∞ g ( x , . . . , x n ) dF ( x ) . . . dF n ( x n )= P n X i =1 X i ≤ ! ≤ P n X i =1 Z i ≤ ! (16)since P (cid:0)P ni =1 X i ≤ (cid:1) ∈ [0 , . This completes the proof.

Lemma 3.

For any T ∈ T n there exists a U ∈ U ∪ U ∪ · · · ∪ U n such that P ( T ≤ ≤ P ( U ≤ and E T ≤ E U. (17) Proof.

We use induction on n .If n = 1 then we set U = 1, so that P ( T ≤

1) = P ( U ≤

1) = 1 and E T ≤ E U directly satisfying (17). f n > T . Recall that T = P ni =1 Y i where Y i ∈ { a i , b i } and 0 ≤ a i ≤ b i ≤

1. Thus we may write T = a + n X i =1 c i B i (18)where a := P ni =1 a i ≥ c i := b i − a i ∈ [0 ,

1] and where B i are independent Bernoulli random variables.In the ﬁrst case, if a > U = P ni =1 B ′ i with P ( B ′ i = 1) = 1 for all i . Then P ( T ≤

1) = P ( U ≤

1) = 0 and E T ≤ P ni =1 b i ≤ n = E U , satisfying (17).Secondly, if c i + c j ≤ − a for some pair i, j with i = j , then consider the sum S := X ij + X k ∈{ , ,...,n }\{ i,j } Y k where X ij := Y i + Y j . (19)We have X ij = a i + a j + c i B i + c j B j ≤ a + c i + c j ≤ B i , B j . Thus S ∈ S n − and we can apply Lemma 1 to show that E T = E S ≤ E T ′ and P ( T ≤

1) = P ( S ≤ ≤ P ( T ′ ≤

1) for some T ′ ∈ T n − . The Lemma then follows by induction.Otherwise, we have a ∈ [0 , , c i ∈ [0 ,

1] and c i + c j > − a for all i and all j = i . The keyobservation is that the latter condition implies that P ( T ≤

1) = P X i ∈ C B i ≤ , X i ∈ D B i = 0 ! where C := { i : c i ≤ − a } , D := { i : c i > − a } .If P i ∈ C E B i < C = ∅ , then we put U = 1 + P i ∈ D B i , noting that U ∈ ∪ nm =1 U n , giving P ( U ≤

1) = P X i ∈ D B i = 0 ! ≥ P X i ∈ C B i ≤ , X i ∈ D B i = 0 ! = P ( T ≤ E U = 1 + X i ∈ D E B i ≥ a + (1 − a ) X i ∈ C E B i + X i ∈ D E B i (as X i ∈ C E B i < ≥ a + X i ∈ C c i E B i + X i ∈ D c i E B i = E T satisfying (17).Finally, if P i ∈ C E B i ≥ C = ∅ , then we put U = P ni =1 B i , noting that U ∈ U n so that P ( U ≤

1) = P X i ∈ C B i + X i ∈ D B i ≤ ! ≥ P X i ∈ C B i ≤ , X i ∈ D B i = 0 ! = P ( T ≤ E U = X i ∈ C E B i + X i ∈ D E B i ≥ a + (1 − a ) X i ∈ C E B i + X i ∈ D E B i ≥ a + X i ∈ C c i E B i + X i ∈ D c i E B i = E T satisfying (17) and completing the proof. Lemma 4.

For any U ∈ U n with n ≥ , there exists a V ∈ ∪ nm =1 V m such that P ( U ≤ ≤ P ( V ≤ and E U = E V. (20) roof. Let U := P ni =1 B i where B i are Bernoulli random variables, q i := E B i and q := ( q , q , . . . , q n ).We have P ( U ≤

1) = P n X i =1 B i = 0 ! + P n X i =1 B i = 1 ! (21)= n Y i =1 (1 − q i ) + n X j =1 q j Y i =1: n,i = j (1 − q i ) =: L n ( q ) . (22)Consider maximizing L n ( q ) over q ∈ { [0 , n | λ = P ni =1 q i } noting that maxima might lie on theinterior with q i ∈ (0 ,

1) for all 1 ≤ i ≤ n or on the boundary with q i ∈ { , } for some i . Since L n ( q ) isa diﬀerentiable function of q , any critical point of L n ( q ) on the interior with q ∈ { (0 , n | λ = P ni =1 q i } must satisfy ∇ q k L n ( q ) + µ n X i =1 q i ! = 0 for all 1 ≤ k ≤ n (23)for a suitable Lagrange multiplier µ . However, L n ( q ) is a symmetric linear function of each q k . Soif n ≥ q k = q l for all k = l in 1 , . . . , n . Thus q = (cid:0) λn , λn , . . . , λn (cid:1) for which U corresponds to the random variable V n, ∼ binomial (cid:0) λn , n (cid:1) which is in V n and has E V n, = λ .If q i = 1 for some i and n ≥ L n ( q ) = Y j =1: n,j = i (1 − q j ) ≤ (cid:18) − λ − n − (cid:19) n − . (24)The right-hand side is P ( V n, ≤

1) for the random variable V n, ∼ (cid:16) λ − n − , n − (cid:17) for which V n, ∈ V n and E V n, = λ .If q i = 0 for some i and n ≥ L n ( q ) gives L n ( q ) = L n − ( q i ) where q i := ( q , . . . , q i − , q i +1 , . . . , q n ) . (25)However to have q i = 0 for some i we require that λ = P nj =1 ,j = i q j ≤ n − q ∈ { [0 , n | λ = P ni =1 q i } and n ≥ L n ( q ) ≤ ( max ≤ i ≤ n { L n − ( q i ) , P ( V n, ≤ , P ( V n, ≤ } if 0 ≤ λ ≤ n − { P ( V n, ≤ , P ( V n, ≤ } if n − < λ ≤ n, and for n = 1, consider the random variable V , := binomial( λ,

1) for which V , ∈ V , P ( U ≤

1) = P ( V , ≤

1) and λ = E V , . Thus P ( U ≤ ≤ max { V | V ∈∪ nm =1 V m , E V = λ } P ( V ≤ Lemma 5.

Let H n ( λ ) := sup { P ( V ≤ | V ∈ V n , E V = λ } with the convention that sup ∅ = 0 . Then H n ( λ ) ≤ H n ′ ( λ ′ ) for all ≤ λ ′ ≤ λ and all ≤ n ≤ n ′ . (26) Proof.

The deﬁnition of V n gives H n ( λ ) =  ≤ λ ≤ { F n ( λ ) , G n ( λ ) } if 1 < λ < n n ≤ λ (27)where for 1 ≤ λ ≤ nF n ( λ ) := P (cid:18) binomial (cid:18) λn , n (cid:19) ≤ (cid:19) = (cid:18) − λn (cid:19) n + λ (cid:18) − λn (cid:19) n − (28) G n ( λ ) := P (cid:18) (cid:18) λ − n − , n − (cid:19) ≤ (cid:19) = (cid:18) − λ − n − (cid:19) n − , (29)so let us collect some facts about F n ( λ ) and G n ( λ ).First, set x := 1 − λ/n so we have n = λ/ (1 − x ) andlog F n ( λ ) = log( x n + λx n − ) = λ log x − x + log (cid:18) λx (cid:19) =: g ( x ) . (30) ow (1 − x ) λ ∇ g ( x ) = log x + 1 − xx − (1 − x ) x ( x + λ ) =: u ( x ) (31)and ∇ u ( x ) = x − x ( x + λ ) ( x + ( λ − x + λ − λ ) . (32)Note that min x ∈ R ( x + ( λ − x + λ − λ ) = (3 λ − /

4. Thus if λ ≥ / √ ∇ u ( x ) ≤ < x ≤ u ( x ) ≥ u (1) = 0. Thus ∇ g ( x ) ≥

0, by (31), so that log F n ( λ ) is increasing in n , by(30) and from the fact that n = λ/ (1 − x ) is increasing in x for ﬁxed λ . Hence F n ( λ ) ≤ F n +1 ( λ ) for all 2 / √ ≤ λ < n and n ≥

1. (33)Second, Taylor expansion giveslog G n +1 ( λ ) = n log (cid:18) − λn (cid:19) = − λ − ∞ X k =2 λ k kn k − for (cid:12)(cid:12)(cid:12)(cid:12) λn (cid:12)(cid:12)(cid:12)(cid:12) < n for λ ≥

0. Thus G n ( λ ) ≤ G n +1 ( λ ) for all 0 ≤ λ < n and n ≥ . (35)Third, considering the range of λ for which F n ( λ ) ≤ G n ( λ ) gives (cid:18) − λn (cid:19) n + λ (cid:18) − λn (cid:19) n − ≤ (cid:18) − λ − n − (cid:19) n − (36) ⇔ − λn + λ ≤ − λ − n − − λn ! n − = (cid:18) nn − (cid:19) n − (37) ⇔ λ ≤ (cid:18) nn − (cid:19) n − nn − . (38)Applying the inequality log x ≤ x − x = n − n we see that n log n − n ≤ −

1, hence (cid:16) nn − (cid:17) n ≥ e . Sofor n ≥ (cid:18) nn − (cid:19) n − nn − ≥ e − > √ . (39)Additionally G ( λ ) − F ( λ ) = ( λ − / ≥ λ ∈ R . In conjunction with (38) and (39) thisgives F n ( λ ) ≤ G n ( λ ) for all λ < √ n ≥

2. (40)Now consider the function H n ( λ ). The deﬁnition of H n ( λ ) gives H n ( λ ) = H n +1 ( λ ) = 1 for all 0 ≤ λ ≤ n ≥ H n ( λ ) ≤ H n +1 ( λ ) for all n ≤ λ < n + 1 and n ≥ H n ( λ ) = H n +1 ( λ ) = 0 for all λ ≥ n + 1 and n ≥

1. (43)For all 1 < λ < √ and all n ≥

2, (35) and (40) give H n ( λ ) = max { F n ( λ ) , G n ( λ ) } = G n ( λ ) ≤ G n +1 ( λ ) (44) ≤ max { F n +1 ( λ ) , G n +1 ( λ ) } = H n +1 ( λ ) . (45)For all √ ≤ λ < n and all n ≥

2, (33) and (35) give H n ( λ ) = max { F n ( λ ) , G n ( λ ) } (46) ≤ max { F n +1 ( λ ) , G n +1 ( λ ) } = H n +1 ( λ ) . (47)In summary H n ( λ ) ≤ H n +1 ( λ ) for all λ ≥ n ≥ H n ( λ ) is non-decreasing in n .Finally, ∇ λ F n ( λ ) = − n − n (cid:0) − λn (cid:1) n − ≤ ∇ λ G n ( λ ) ≤

0, so H n ( λ ) is non-increasing in λ .This completes the proof. heorem 1. By Lemmas 1, 3 and 4, there exist random sums T ∈ T n , U ∈ ∪ nm =1 U m and V ∈ ∪ nm =1 V m such that P ( S ≤ ≤ P ( T ≤ ≤ P ( U ≤ ≤ P ( V ≤

1) and E S = E T ≤ E U = E V. Say that V ∈ V m for some 1 ≤ m ≤ n and let λ V := E V . Then λ V ≥ λ =: E S as just shown, soLemma 5 gives P ( V ≤ ≤ H m ( λ V ) ≤ H n ( λ ) . Now, by deﬁnition of H n ( λ ), for 0 ≤ λ ≤ n we have H n ( λ ) = max { F n ( λ ) , G n ( λ ) } where F n ( λ ) := (cid:0) λ − λn (cid:1) (cid:0) − λn (cid:1) n − and G n ( λ ) := (cid:16) − λ − n − (cid:17) n − , so that P ( S ≤ ≤ max { F n ( λ ) , G n ( λ ) } (49)which proves (1). Furthermore, Lemma 5 gives H n ( λ ) ≤ lim m →∞ max { F m ( λ ) , G m ( λ ) } = max { λ, e } e − λ (50)which completes the proof. If we wish to bound the expectation of a random sum, then Theorem 1 can be conveniently rearrangedas follows.

Corollary 1.

Suppose that S = P ni =1 X i is a sum of independent random variables with P (0 ≤ X i ≤

1) = 1 for ≤ i ≤ n . Then for r = 0 . . . . , we have P ( S ≤ ≤ e − r E S or equivalently E S ≤ r (1 − log P ( S ≤ . (51) Proof.

We work with the right-hand side of Theorem 1 to ﬁnd the smallest a such thatmax { e, m } e − m ≤ e − m + am for all m ≥

0, or equivalently, such thatlog(1 + m ) − am ≤ . For ﬁxed a ≥ m = a −

1. Substituting this m , we require that a − log a ≤ . Now the function a − log a is decreasing for a ≤

1, thus we require that a ≥ a where a is the root of a = e a − having a ≤

1. A ﬁxed point method yields the solution a = 0 . · · · = 1 − r . Acknowledgements

The author thanks the anonymous reviewer, Christophe Leuridan, Bin Yu, Nicol`o Cesa-Bianchi, OnnoZoeter and Shengbo Guo for helpful comments and discussions.The ﬁnal version of this publication is (or will be) available at springerlink.com in the Journal ofTheoretical Probability.

References [1] Bentkus, V.: On Hoeﬀding’s inequalities. The Annals of Probability (2), 1650–1673 (2004)[2] Hoeﬀding, W.: Probability inequalities for sums of bounded random variables. Journal of theAmerican Statistical Association , 13–30 (1963)[3] Mulholland, H. and Rogers, C.: Representation theorems for distribution functions. Proceedingsof the London Mathematical Society (8), 177–223 (1958)[4] Pinelis, I.: Optimal tail comparison based on comparison of moments. High Dimensional Proba-bility. Progress in Probability , 297–314 (1998)[5] Talagrand, M.: The missing factor in Hoeﬀding’s inequalities. Annals of the Institut Henri Poincar´eProbability and Statistics , 689–702 (1995), 689–702 (1995)