New-Type Hoeffding's Inequalities and Application in Tail Bounds
11 New-Type Hoeffding’s Inequalities and Applicationin Tail Bounds
Pingyi Fan,
Senior Member, IEEE , Abstract —It is well known that Hoeffding’s inequality has alot of applications in the signal and information processing fields.How to improve Hoeffding’s inequality and find the refinementsof its applications have always attracted much attentions. Animprovement of Hoeffding inequality was recently given byHertz [1]. Eventhough such an improvement is not so big, itstill can be used to update many known results with originalHoeffding’s inequality, especially for Hoeffding-Azuma inequalityfor martingales. However, the results in original Hoeffding’sinequality and its refinement one by Hertz only consideredthe first order moment of random variables. In this paper, wepresent a new type of Hoeffding’s inequalities, where the highorder moments of random variables are taken into account.It can get some considerable improvements in the tail boundsevaluation compared with the known results. It is expected thatthe developed new type Hoeffding’s inequalities could get moreinteresting applications in some related fields that use Hoeffding’sresults.
Index Terms —Hoeffding’s Lemma, Hoeffding’s tail bounds,Azuma inequality, Chernoff’s bound.
I. I
NTRODUCTION
It is well known that Hoeffding’s inequality has been appliedin many scenarios in the signal and information processingfields. Since Hoeffding’s inequality was first found in 1963 [2],it has been attracting much attentions in the academic research[3] [4] and industry. Especially, in the last decade, it has beenused to evaluate the channel code design [5] [6] and achievablerate over nonlinear channels [7] as well as delay performancein CSMA with linear virtual channels under a general topology[8] in information theory [9]. As one key tool, it also found theapplications in machine learning and big data processing, i.e.PAC-Bayesian method analysis and Markov model analysisin machine learning [10] [11],statistical mode bias analysis[12], concept drift in online learning for big data mining [13]and compressed sensing of high dimensional sparse functions[14]etc. It also has been employed in biomedical fields, i.e.developing the computational molecular modelling tools [15]and analyzing the level set estimation in medical image andpattern recognition [16] etc. Due to its widely applications, therefined results and improvements of Hoeffding’s inequality andHoeffding-Azuma inequality in martingales usually resultedin more new insights on the developments of related fields.Recently, Hertz [1] presented an improvement result on theoriginal Hoeffding’s inequality by utilizing the asymmetricfeature of finite distribution interval of random variables. It canreduce the related exponential coefficient from its arithmetic
Pingyi Fan are with Beijing National Research Center for InformationScience and Technology and the Department of Electronic Engineering,Tsinghua University, Beijing 10084, China(e-mail:[email protected]). means to the geometric means of | a | and b , where [ a, b ] ( a < , b > is the distributed interval of random variable X . This improvement motivate us to improve the Hoeffding’sinequality. For simplicity, let us first review the result ofHoeffding’s inequality [2] and its improvement obtained byHertz [1]. A. Hoeffding’s Inequality and An Improvement
Assume that X is a zero mean real valued random variableand X ∈ [ a, b ] with a < , b > . Hoeffding’s lemma statethat for all s ∈ R , s > , E [ e sX ] ≤ exp (cid:110) s ( b − a ) (cid:111) (1)Recently, D. Hertz presented an improved result with thefollowing form E [ e sX ] ≤ exp (cid:110) s Φ ( a, b )2 (cid:111) , (2)where Φ( a, b ) = (cid:40) | a | + b b > | a | , (cid:112) | a | b, b ≤| a | . (3)Since (cid:112) | a | b ≤ | a | + b , it gives a tighter upper bound for − a > b , compared with the original Hoeffding’s inequality.Motivated by this result, an interesting question is raised.Can we further improve the Hoeffding’s inequality? If so, howto do it.In this paper, we try to derive a new type of Hoeffding’sinequalities, where higher order moments of random variable X are taken into account, except E ( X ) = 0 . i.e. E ( X k ) = m k ( k = 2 , , ... ) . B. Main Theorem
To give a clear picture of this paper, the new type ofHoeffding’s inequalities are given as follows.
Theorem 1:
Assume that X is a real valued random variablewith E ( X ) = 0 , X ∈ [ a, b ] with a < , b > . For all s ∈ R , s > and an integer k ( k ≥ ), then we have E [ e sX ] ≤ Υ k ( a, b ) exp (cid:110) s k Φ ( a, b ) (cid:111) (4)where Υ k ( a, b ) = (cid:104) {| a | , b }| a | (cid:105) k − k max {| a | , b }| a | (5) Φ( a, b ) = (cid:40) | a | + b b > | a | , (cid:112) | a | b, b ≤| a | . (6) a r X i v : . [ m a t h . S T ] J a n Remark 1.
When k = 1 , it is easy to check that Υ ( a, b ) =1 . This indicates that the new type Hoeffding’s inequality willbe reduced to the improved Hoeffding’s inequality (2),stillbetter than the original Hoeffding’s inequality. When k = 2 , Υ ( a, b ) = 1 + { max {| a | ,b }| a | } and the exponential coefficienthas been decreased by 2 times compared to the improvedHoeffding’s inequality (2). In fact, such a result can be refined,which is given by the following Corollary. Corollary 1:
Under the same assumption of Theorem 1, for k = 2 , we have E [ e sX ] ≤ [1 + m a ] exp (cid:110) s ( a, b ) (cid:111) (7)where m = E ( X ) .If E ( X ) is unknown, the inequalities can be relaxed as E [ e sX ] ≤ [1 + b | a | ] exp (cid:110) s ( a, b ) (cid:111) (8)and E [ e sX ] ≤ (cid:110) s ( a, b ) (cid:111) if | a | ≥ b (9)Comparing the result in eqn.(9) with that presented inTheorem 1, it is easy to check that [1 + b | a | ] ≤ { max {| a | , b }| a | } (10)holds. This indicates that Corollary 1 really improves the resultpresented in Theorem 1 for k = 2 . Comparing to that in eqn.(2), the exponential coefficient has be reduced by 2 times. Thatis to say, when parameter s is relatively large, the new typeof Hoeffding’s inequalities will give much tighter results thanoriginal Hoeffding’s inequality and its improvement obtainedby Hertz.The remaining part of this paper is organized as follows.In Section 2, we first present the proof of Corollary (1) andshow the insight by taking higher order moments of real valuedrandom variables into account and then present the proof ofmain theorem in this paper. In Section 3, we present the newtype Hoeffding’s inequalities applications in the one sided andtwo sided tail bound. We also discuss how to select the integerparameter k to give a tighter bound In Section 4. Finally, inSection 5, we give the conclusion.II. T HE P ROOF OF M AIN T HEORETICAL R ESULTS
Let us first introduce some Lemmas.
A. Some Useful LemmasLemma 1:
Supposed f ( x ) is a convex function of x , f ( x ) > with x ∈ [ a, b ] , then we have the following results. (i) f ( x ) ≤ b − xb − a f ( a ) + x − ab − a f ( b ) (ii) f ( x ) is also a convex function of x and f ( x ) ≤ [ b − xb − a f ( a ) + x − ab − a f ( b )] and f ( x ) ≤ b − xb − a f ( a ) + x − ab − a f ( b ) The proof of Lemma 1 can be directly derived by using thedefinition of Convex function and ( f ( x )) (cid:48) = 2 f ( x ) f (cid:48) ( x ) and ( f ( X )) (cid:48)(cid:48) = 2( f (cid:48) ( x )) + 2 f ( x ) f (cid:48)(cid:48) ( x ) > . Lemma 2:
Assume that X is a real valued random variablewith E ( X ) = 0 , P ( X ∈ [ a, b ]) = 1 with a < , b > . Wehave (i) E ( X ) ≤ | a | b (11)(ii) E ( X ) ≤ | a | b ( a + ab + b ) ≤ | a | b ( a + b ) (12) Proof: (i) Since f ( x ) = x is a convex function of x in [ a, b ] , we have x ≤ b − xb − a a + x − ab − a b (13) E ( X ) ≤ bb − a a + − ab − a b = | a | b (14)(ii) Since f ( x ) = x is a convex function and f ( x ) ≥ , weknow that f ( x ) = x is also a convex function of x accordingto Lemma 1. Then we have x ≤ b − xb − a a + x − ab − a b (15) E ( X ) ≤ bb − a a + − ab − a b = | a | b ( a + ab + b ) ≤ | a | b ( a + b ) (16) Lemma 3:
For < λ < and u > , let ψ ( u ) = − λu + ln(1 − λ + λe u ) (17)Then we have ψ ( u ) = 0 . τ (1 − τ ) u (18)where τ = λ (1 − λ ) e − ξ + λ , ξ ∈ [0 , u ] (19)In addition, we have ψ ( u ) ≤ (cid:40) u λ ≤ . λ (1 − λ ) u λ > . (20)This lemma was derived in [1]. For completeness, we reorga-nize it as follows. Proof:
Since ψ ( u ) = − λu + ln(1 − λ + λe u ) (21)For u > , one can use Taylor’s expansion and obtain ψ ( u ) = ψ (0) + ψ (cid:48) (0) u + 0 . ψ (cid:48)(cid:48) ( ξ ) u (22)it is easy to check that ψ (0) = 0 and ψ (cid:48) ( u ) = − λ + λe u − λ + λe u (23) ψ (cid:48)(cid:48) ( u ) = λe u − λ + λe u (1 − λe u − λ + λe u ) (24) That means ψ (cid:48) (0) = 0 and ψ (cid:48)(cid:48) ( ξ ) = 0 . τ (1 − τ ) (25)where τ = λ (1 − λ ) e − ξ + λ , ξ ∈ [0 , u ] . (26)That is, ψ ( u ) = 0 . τ (1 − τ ) u (27)Now let us divide it into two cases to discuss. (a) If λ > . ,then τ = λ (1 − λ ) e − ξ + λ ≥ λ > . (28)That means, τ (1 − τ ) reaches its maximum at τ = λ , In otherword, τ (1 − τ ) ≤ λ (1 − λ ) (b) If λ ≤ . , then we have τ (1 − τ ) ≤ .By combining cases (a) and (b), we get ψ ( u ) ≤ (cid:40) u λ ≤ . λ (1 − λ ) u λ > . (29)The proof is completed. B. Observation from Corollary 1
Now we first review the Corollary 1. It claimed that underthe same assumption of Theorem 1, for k = 2 , we have E [ e sX ] ≤ [1 + m a ] exp (cid:110) s ( a, b ) (cid:111) (30)where m = E ( X ) .Before we present the proof of Corollary 1, let us analyzewhy such a new type of Hoeffding’s inequality can decreaseits exponential factor by 2 times in philosophy. Since f ( x ) = exp( αx ) is a convex function for any α > .Let α = 2˜ s , then E (exp(2˜ sX )) ≤ b + m ( b − a ) exp(2˜ sa ) + m + a ( b − a ) exp(2˜ sb )+ − ab − m ( b − a ) exp(˜ sa ) exp(˜ sb )= b + m ( b − a ) exp(2˜ sa ) + m + a ( b − a ) exp(2˜ sb )+ − ab − m ( b − a ) exp(2˜ s a + b (31)The equation above can be rewritten as E (exp( αX )) ≤ b + m ( b − a ) exp( αa ) + m + a ( b − a ) exp( αb )+ − ab − m ( b − a ) exp( α a + b (32) Using Lemma 2 above, it is easy to see that all of theweighting coefficients of exp( αa ) , exp( αb ) and exp( α a + b ) are non-negative and b + m ( b − a ) + m + a ( b − a ) + − ab − m ( b − a ) = 1 (33)Now, by using s = α in the inequality (32), we have E (exp( sX )) ≤ b + m ( b − a ) exp( sa ) + m + a ( b − a ) exp( sb )+ − ab − m ( b − a ) exp( s a + b (34)It is easy to see that the right hand side of equation is equal tothe linear weighting sum of exp( sa ) , exp( sb ) and exp( s a + b ) .That is to say, one can use the information provided by threepoints to estimate the upper bound of E (exp( sX )) . It exactlyprovides more information than that only using two pointlinear weighting sum of exp( sa ) and exp( sb ) to estimate theupper bound of E (exp( sX )) . Similarly, if one can use theinformation of function exp( sx ) at multiple points, the upperbound of estimation E (exp( sX )) may be improved further,this is why we consider the high order moments of randomvariables to discuss Hoeffding’s inequality improvements. C. Proof of Corollary 1
Now let us present the proof of Corollary 1.
Proof:
Following the inequality (34), we have that E (exp( sX )) ≤ b + m ( b − a ) exp( sa ) + m + a ( b − a ) exp( sb )+ − ab − m ( b − a ) exp( s a + b b ( b − a ) exp( sa ) + a ( b − a ) exp( sb )+ − ab ( b − a ) exp( s a + b m ( b − a ) { exp( s b − exp( s a } (35)Let u = s ( b − a )2 , λ = − ab − a , and β = m ( b − a ) , then we have s = ub − a , bb − a = 1 − λ . The inequality 35 can be rewritten as E (exp( sX )) ≤ [(1 − λ ) e − λu + λe (1 − λ ) u ] + β ( e (1 − λ ) u − e − λu ) ≤ [(1 − λ ) e − λu + λe (1 − λ ) u ] (1 + β λ )= exp(2 ψ ( u ))(1 + β λ ) (36)By using Lemma 3 we have E (exp( sX )) ≤ (cid:40) exp (cid:0) u (cid:1) (1 + β λ ) | a | < b exp( λ (1 − λ ) u )(1 + β λ ) | a | ≥ b (37)Now we shall discuss the exponential coefficient and themultiple factor (cid:0) β λ (cid:1) in tow different cases. (a) If | a | ≥ b , then by using u = s ( b − a )2 , λ = − ab − a , and β = m ( b − a ) , we have λ (1 − λ ) u ≤ − ab ( b − a ) s ( b − a ) s | a | b (38)as well as β λ = 1 + m a ≤ b | a | ≤ (39)(b) If | a | < b , then we have u s ( b − a ) s ( b − a ) (40)and β λ = 1 + m a ≤ b | a | (41)Combining the two difference cases and using Φ( a, b ) = (cid:40) | a | + b b > | a | , (cid:112) | a | b, b ≤| a | . we obtain the result in inequality. The proof is completed. D. Proof of Theorem 1Proof: If k = 1 , it is the improved Hoeffding’s inequality2. Now we mainly focus on the case of k ≥ .Since f ( x ) = e αx is a convex function of x for all α > and f ( X ) > , we have e αx ≤ b − xb − a e αa + x − ab − a e αb (42)andfor an positive integer k ( k ≥ ), we have e kαx ≤ (cid:104) b − xb − a e αa + x − ab − a e αb (cid:105) k = (cid:110)(cid:104) bb − a e αa + − ab − a e αb (cid:105) + x (cid:104) e αb − e αa b − a (cid:105)(cid:111) k (43)and E (cid:0) e kαX (cid:1) ≤ (cid:110)(cid:104) bb − a e αa + − ab − a e αb (cid:105) + X (cid:104) e αb − e αa b − a (cid:105)(cid:111) k (44)By using s = kα and λ = − ab − a , u = sk ( b − a ) , then we have E (cid:0) e sX (cid:1) ≤ E (cid:110) [(1 − λ ) e − λu + λe (1 − λ ) u ]+ X | a | [ λe (1 − λ ) u − λe − λu ] (cid:111) k (45)Let e ψ ( u ) = (1 − λ ) e − λu + λe (1 − λ ) u (46)and ϕ ( u ) = λe (1 − λ ) u − λe − λu (47) then E (cid:0) e sX (cid:1) ≤ E (cid:104) e ψ ( u ) + X | a | ϕ ( u ) (cid:105) k = e kψ ( u ) + ke ( k − ψ ( u ) E (cid:16) X | a | (cid:17) ϕ ( u )+ k (cid:88) i =2 C ik e ( k − i ) ψ ( u ) E (cid:16) X | a | (cid:17) i ϕ i ( u ) (48) = e kψ ( u ) + k (cid:88) i =2 C ik e ( k − i ) ψ ( u ) E (cid:16) X | a | (cid:17) i ϕ i ( u ) ≤ e kψ ( u ) + k (cid:88) i =2 C ik e ( k − i ) ψ ( u ) E (cid:16) | X || a | (cid:17) i ϕ i ( u ) ≤ [(1 − λ ) e − λu + λe (1 − λ ) u ] k × (cid:110)(cid:104) {− a, b }| a | (cid:105) k − k max {− a, b }| a | (cid:111) (49)where C ik = k ! i !( k − i )! .By using ( b − a ) λ = − a , and ψ ( u ) = 0 . τ (1 − τ ) u , u = sk ( b − a ) , we have E ( e sX ) ≤ e k τ (1 − τ ) u (cid:110) [1 + max {− a, b }| a | ] k − k max {− a, b }| a | (cid:111) ≤ (cid:110) [1 + max {| a | , b }| a | ] k − k max {− a, b }| a | (cid:111) exp (cid:16) s k Φ (cid:17) = Υ k ( a, b ) exp (cid:16) s k Φ (cid:17) (50)where Υ k ( a, b ) = (cid:104) {| a | , b }| a | (cid:105) k − k max {| a | , b }| a | , (51) Φ = (cid:40) ( b − a )2 − a < b (cid:112) | a | b − a ≥ b (52)The proof is completed. Remark 2.
The proof of theorem 1 create a new routineon how to use multipoint values of exp sx to get tighterapproximation of E (exp sX ) for any random distribution ina finite interval with P ( X ∈ [ a, b ]) = 1 . Comparing with theoriginal Hoeffding’s inequality and its improvement obtainedby Hertz, the advantages is that it can exactly reduce theexponential coefficients by k times when all the moments ofless than k order statistics are taken into account, but thecost is that it will almost enlarge the multiply factor with C k times, as shown by Υ k ( a, b ) , where C is a constantwith C > . That means there exists a trade off betweenthe exponential coefficient reduction and the multiply factorincrement. It needs to be consider in specific applications.In some scenarios, one may interest in the case of k = 4 .The following Corollary shows one refinement of Theorem 1. Corollary 2:
Assume that X is a real valued randomvariable, P ( X ∈ [ a, b ]) = 1 with a < , b > and E ( X ) = 0 , E ( X ) = m , E ( X ) = 0 and E ( X ) = m . For all s ∈ R , s > , we have E [ e sX ] ≤ (cid:104) m a + m a (cid:105) exp (cid:16) s ( a, b ) (cid:17) (53) where Φ = (cid:40) ( b − a )2 − a < b (cid:112) | a | b − a ≥ b (54) Proof:
The proof can follow that way on Theorem1.Since f ( x ) = e αx is a convex function of x for all α > and f ( X ) > , we have e αx ≤ b − xb − a e αa + x − ab − a e αb (55)and e αx ≤ (cid:104) b − xb − a e αa + x − ab − a e αb (cid:105) = (cid:110)(cid:104) bb − a e αa + − ab − a e αb (cid:105) + x (cid:104) e αb − e αa b − a (cid:105)(cid:111) (56)Let s = 4 α , and using E ( X ) = 0 , E ( X ) = m E ( X ) = 0 and E ( X ) = m , we have E ( e sX ) ≤ (cid:16) bb − a e s a + − ab − a e s b (cid:17) + 6 m (cid:16) bb − a e s a + − ab − a e s b (cid:17) (cid:16) e s b − e s a b − a (cid:17) + m (cid:16) e s b − e s a b − a (cid:17) (57)Let λ = − ab , u = s ( b − a ) , then we have bb − a = 1 − λ , s a = − λu , s b = (1 − λ ) u . Then the inequality above can berewritten as E ( e sX ) ≤ [(1 − λ ) e − λu + λe (1 − λ ) u ] (58) + 6 m ( b − a ) λ [(1 − λ ) e − λu + λe (1 − λ ) u ] (59) × [ λe (1 − λ ) u − λe − λu ] (60) + m ( b − a ) λ [ λe (1 − λ ) u − λe − λu ] ≤ [(1 − λ ) e − λu + λe (1 − λ ) u ] × [1 + 6 m ( b − a ) λ + m ( b − a ) λ ]= e ψ ( u ) [1 + 6 m ( b − a ) λ + m ( b − a ) λ ] (61)by using ( b − a ) λ = − a , and Lemma 3, we have E ( e sX ) ≤ [1 + 6 m a + m a ] e s (62)where Φ = (cid:40) ( b − a )2 − a < b (cid:112) | a | b − a ≥ b (63)The proof is completed.If the E ( X ) and E ( X ) are not exactly known and | a | = b ,we have the following result. Corollary 3:
Assume that X is a real valued randomvariable. P ( X ∈ [ − a, a ]) = 1 with a > and E ( X ) = 0 and E ( X ) = 0 . For all s ∈ R , s > , we have E [ e sX ] ≤ a s (64) Proof: by using m ≤ a and m ≤ a and the inequalityin Corollary 2 , we can get the result directly.III. A PPLICATIONS IN T AIL B OUND E STIMATION
Let us consider the scenario, where X , X , . . . , X n beindependent random variables such that X i ∈ [ a i , b i ] , a i < , b i > and EX i = 0 for i = 1 , , . . . , n . Define S n = (cid:80) ni =1 . It is easy to check that ES n = 0 . For all s > ,we have P ( S n ≥ t ) = P (cid:0) e sS n ≥ e st (cid:1) Chernoff ≤ e − st Ee sS n Markov = e − st n (cid:89) i =1 Ee sX i (65)Using the results of Theorem 1 and its Corollaries, one canobtain that Ee sX i ≤ A k i exp (cid:16) s k i Φ i (cid:17) (66)where A k i and k i are based on which one inequality of X i being selected in Section II and III with A k i = k i = 11 + max {| a | ,b }| a | k i = 2 (cid:104) max {| a | ,b }| a | (cid:105) k i − k i max {| a | ,b }| a | k i ≥ (67)and Φ i = Φ( a i , b i ) .In this case, we get P ( S n ≥ t ) ≤ (cid:16) n (cid:89) i =1 A k i (cid:17) exp (cid:110) − st + s (cid:16) n (cid:88) i =1 Φ i k i (cid:17)(cid:111) (68)Now selecting s = t (cid:16) (cid:80) ni =1 Φ i k i (cid:17) (69)to minimize the exponent in the last inequality, we obtain P ( S n ≥ t ) ≤ (cid:16) n (cid:89) i =1 A k i (cid:17) exp (cid:110) − t (cid:16) n (cid:88) i =1 Φ i k i (cid:17) − (cid:111) (70)In particular, if all the k i , ( i = 1 , , . . . , n ) are selected as 1,then A k i = 1 , it reduces to the Improved Hoeffding’s one sidetail bound.If all the k i , ( i = 1 , , . . . , n ) are selected as 2 and | a i | = b i ,then A k i = 2 , and the inequality can be rewritten as P ( S n ≥ t ) ≤ n exp (cid:110) − t (cid:80) ni =1 a i (cid:111) (71)Furthermore, P (cid:16) S n n ≥ l (cid:17) ≤ (cid:16) n (cid:89) i =1 A k i (cid:17) exp (cid:110) − nl i (cid:111) (72)where l is a positive number and ˜Φ i = n (cid:16) (cid:80) ni =1 Φ i k i (cid:17) . The two sided tail bound can be given by P ( | S n | ≥ t ) ≤ (cid:16) n (cid:89) i =1 A k i (cid:17) exp (cid:110) − t (cid:16) n (cid:88) i =1 Φ i k i (cid:17) − (cid:111) + (cid:16) n (cid:89) j =1 B k j (cid:17) exp (cid:110) − t (cid:16) n (cid:88) j =1 Φ j k j (cid:17) − (cid:111) (73)where { B k j , j = 1 , , · · · , n } is a sort of { A k i , i =1 , , · · · , n } complement. That is to say, The calculation of B k j is just changing the positions of a j and b j in such away − a j → b i and − b j → a i in the calculation of A k i ifthe integer index k j of B k j is equal to the integer index k i of A k i . In other word, for the same X i , it may select twodifferent integer parameter values of k i to estimate both sidedtail bounds for the positive and the negative directions.IV. S ELECTION OF I NTEGER P ARAMETER k On the selection of integer parameter k i , we shall discuss itfirstly from one sided tail bound. For simplicity, let us consider n = 1 . The first question is when selecting a larger k will geta tighter bound. The question can be solved by A k +1 exp (cid:110) − t (cid:16) k + 1 (cid:17) − (cid:111) < A k exp (cid:110) − t (cid:16) k (cid:17) − (cid:111) (74)Using logarithm on both sides of inequality (74) and aftersome manipulations, we get t > ln A k +1 − ln A k (75)That is t > Φ (cid:114) A k +1 A k (76)To clear illustrate the effect of k selection, we give threeexamples. Example 1. a = − , b = 1 . The selection rule of k ( k =1 , , ) is given by k = < t < √ ≈ . √ < t < (cid:112) . ≈ . t > (cid:112) . (77) Example 2. a = − , b = 5 . The selection rule of k ( k =1 , , ) is given by k = < t < √ ≈ . √ < t < (cid:112) / ≈ . t > (cid:112) / (78) Example 3. a = − , b = 1 . The selection rule of k ( k =1 , , ) is given by k = < t < (cid:112)
10 ln(6 / ≈ . (cid:112)
10 ln(6 / < t < (cid:112)
10 ln(25 / ≈ . t > (cid:112)
10 ln(25 / (79) Remark 3.
All the three examples shows that when t is relatively small, i.e. close to zero, selecting parameter k = 1 is the best one. The results in Example 3 shows that when t = 0 . , selecting k = 2 will give a tighter biasbound. The results in Example 2 and
Example 3 also indicateswhen random variable X with P ( X ∈ [ − , , where a = − , b = 5 , one need to estimate P ( | X | > . , the righthand sided bound should select k = 1 as its estimation whilethe left hand sided bound should select k = 2 as its estimation.This result shows that one may not consistently select the sameparameter k to deal with both sided bias bounds when | a | (cid:54) = b .Now let consider the general case.The goal of parameters k i selection is basically to minimizethe right hand of inequality (70). Thus, one can set up anoptimization problem as follows. Problem 1 : For a given t > , min k i (cid:16) n (cid:89) i =1 A k i (cid:17) exp (cid:110) − t (cid:16) n (cid:88) i =1 Φ i k i (cid:17) − (cid:111) (80)where A k i are calculated by using the theoretical results inTheorem 1 and its Corollaries for a given k i .It is equivalent to min k i (cid:16) n (cid:88) i =1 ln( A k i ) (cid:17) − t (cid:16) n (cid:88) i =1 Φ i k i (cid:17) − (81)and max k i (cid:16) (cid:80) ni =1 Φ i k i (cid:17) − (cid:16) (cid:80) ni =1 ln( A k i ) (cid:17) t (82)In fact, such an optimization problem can be solved by usingcomputer search. Here, in order to provide a tractable mode,we relax A k i with the form (cid:104) max {| a | ,b }| a | (cid:105) k given in Theorem1. In this case, the optimization problem can be transformedinto the following problem. Problem 2: max k i (cid:16) (cid:80) ni =1 Φ i k i (cid:17) − (cid:16) (cid:80) ni =1 k i ln (cid:0) max {| a i | ,b i }| a i | (cid:1)(cid:17) t (83)Let us define g ( k , k , . . . , k n ) = 1 (cid:16) (cid:80) ni =1 Φ i k i (cid:17) − (cid:16) (cid:80) ni =1 k i ln (cid:0) max {| a i | ,b i }| a i | (cid:1)(cid:17) t (84)In order to get some insights, let us consider k j to be a realnumber rather than an integer. Then the partial derivative offunction g ( . ) to k j is given by ∂g∂k j = 2Φ i k − j (cid:16) (cid:80) ni =1 Φ i k i (cid:17) − ln (cid:0) max {| a j | ,b j }| a j | (cid:1) t (85) Let ∂g∂k j = 0 , after some manipulations, we obtain k j = Φ j (cid:113) (cid:0) max {| a j | ,b j }| a j | (cid:1) t (cid:80) ni =1 Φ i k i = Φ( a j , b j ) (cid:113) (cid:0) max {| a j | ,b j }| a j | (cid:1) t (cid:80) ni =1 Φ i k i (86)Since t (cid:80) ni =1 Φ2 iki is a common factor for all the k j , ( j =1 , , . . . , n ) . This means k j ∝ Φ( a j , b j ) (cid:113) (cid:0) max {| a j | ,b j }| a j | (cid:1) (87)That is to say, the near optimal value of k j is mainly deter-mined by a j and b j except a common factor, the parameters ofdistribution interval of X j . This is an interesting result, whichcan provide more insight. In most of applications, all the X i ( i = 1 , , . . . , n ) are distributed with the same interval. In thiscase, one can select the same k i value for all of them, so thatit can approximate the near optimal tighter tail bound. Such adiscussion can be extended to the scenarios of two sided tailbound. Remark 4.
Consider the distribution interval is symmetric,where | a i | = b i . In this case, we have Φ( a j , b j ) (cid:113) (cid:0) max {| a j | ,b j }| a j | (cid:1) = | a j |√ (88)This means k j ∝ | a j | (89)This indicates the integer parameter k j selection is propor-tional to the distribution interval length. When | a j | is relativelysmall, i.e. | a j | is close to zero, the linear interpolation oftwo points with x = a j and x = | a j | is good enough toapproximate the random curve of e sX . That is to say, select k j = 1 is good enough.When | a j | is relatively large, the linear interpolation oftwo points with x = a j and x = | a j | may not be goodenough to approximate the curve of e sX . It needs more pointsin the curve of e sx to do the interpolation so that it couldhave a good approximation to the random curve of e sX .That is to say, selecting a larger k j is necessary. Such anobservation is consistent with our ”intuitive feeling” on thefunction approximation in philosophy. We shall illustrate suchphenomenon in detail with some examples below. Example 4.
Let a = − , b = 5 and m = 5 . The selectionrule of k ( k = 1 , , ) is given by k = < t < (cid:112) / ≈ . (cid:112) / < t < (cid:112) / ≈ . t > (cid:112) / (90) Remark 5.
The results in
Example 1 shows selecting k = 1 is always the best since the best working region for t of k ≥ is out of the X distributed interval, which can not occur inpractice. Example 4 shows that when m is given, it is possibleto select k ≥ to get a tighter tail bound, i.e. t = 4 , the best selection of k is k = 2 , which also show that when thedistribution interval is relatively larger, it is possible to selectthe larger integer value of k for the tail bound estimation. Example 5.
Let us consider n = 4 , where X ∈ [ − , , X ∈ [ − , , X ∈ [ − , and X ∈ [ − , with E ( X ) = E ( X ) = E ( X ) = E ( X ) = 0 . E ( X ) = 5 and S = X + X + X + X . It is easy to check that S ∈ [ − , . Fig.1 shows different curves of one sided tail bounds , which aregroup one: k = k = k = k = 1 . Group two: k = k = k = 1 , k = 2 and Group three: k = k = 1 , k = k = 2 ,where the y-label is the logarithm of the one sided tail bound, (cid:16) (cid:80) ni =1 ln( A k i ) (cid:17) − t (cid:16) (cid:80) ni =1 Φ i k i (cid:17) − , the x-label is t . It isobserved that among the three groups of parameter k selection,when < t < . , the curve of Group one provides thetightest bound. When . < t < . , the curve ofGroup two provides the tightest bound and when . Fig. 1. The logarithm of the one sided tail bound of three different groupselection of parameters k with n = 4 in Example 5 The results in Example 5 exactly demonstrated that thenew type Hoeffding’s inequalities are useful in the tail boundestimation. Remark 6. In real applications, one would not like to paymore attention on the selection of parameter k i in order tosimply the system analysis. It recommends to select k i to be or . V. C ONCLUSION In this paper, we presented new type of Hoeffding’s in-equalities by using higher order moments of random variables.Some applications in one and two sided tail bound improve-ments can also be obtained by using the exponential functionpositiveness and Chernoff inequality. Perhaps, future researchmay focus on trying to improve the related inequalities thatuse Hoeffding’s Lemma. R EFERENCES[1] H ERTZ , D. (2020). Improved Hoeffding’s Lemma and Hoeffding’s TailBounds. arXiv preprint arXiv:2012.03535. [2] H OEFFDING , W.(1963). Probability inequalities for sums of boundedrandom variables. Journal of American Statistical Association CHMIDT ,J. P., S IEGEL , A., S RINIVASAN , A. (1995). Chernoff Hoeffd-ing bounds for applications with limited independence. SIAM Journalon Discrete Mathematics. . 223–250.[4] S TEVEN G. F ROM AND A NDREW W. S WIFT . (2013). A refinement ofhoeffding’s inequality. Journal of Statistical Computation and Simula-tion . 977–983.[5] S CARLETT , J., M ARTINEZ , A., I F ˙ a BREGAS , A. G. (2014) Second-order rate region of constant-composition codes for the multiple-accesschannel. IEEE Transactions on Information Theory , . 157–172.[6] S ASON , I., E SHEL , R. (2011, July). On concentration of measuresfor LDPC code ensembles. In 2011 IEEE International Symposium onInformation Theory Proceedings (pp. 1268–1272). IEEE.[7] X ENOULIS , K., K ALOUPTSIDIS , N., S ASON , I. (2012, July). Newachievable rates for nonlinear Volterra channels via martingale inequal-ities. In 2012 IEEE International Symposium on Information TheoryProceedings. (pp. 1425-1429). IEEE.[8] Y UN , D., L EE , D., Y UN , S. Y., S HIN , J., Y I , Y. (2015). Delay optimalCSMA with linear virtual channels under a general topology. IEEE/ACMTransactions on Networking, , 2847–2857.[9] R AGINSKY , M., S ASON , I. (2018). Concentration of Measure Inequali-ties in Information Theory, Communications, and Coding Third Edition.Now Foundations and Trends.[10] S ELDIN Y., L AVIOLETTE F., C ESA -B IANCHI N., S HAWE -T AYLOR J., AND A UER P. (2012) Pac-bayesian inequalities for martingales. IEEETransactions on Information Theory, AN J., J IANG B., AND S UN Q. (2018) Hoeffding’s lemma formarkov chains and its applications to statistical learning. arXiv preprint arXiv:1802.00211.[12] G OURGOULIAS , K., K ATSOULAKIS , M. A., R EY -B ELLET , L., AND W ANG , J. (2020). How biased is your model? Concentration inequalities,information and model bias. IEEE Transactions on Information Theory, RIAS -B LANCO . ETAL (2014). Online and non-parametric drift de-tection methods based on Hoeffding’s bounds. IEEE Transactions onKnowledge and Data Engineering, CHNASS , K., V YBIAL , J. (2011, May). Compressed learning of high-dimensional sparse functions. In 2011 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) (pp. 3924-3927).IEEE.[15] R ASHEED M.,C LEMENT N., B HOWMICK A., AND B AJAJ C. L. (2019)Statistical framework for uncertainty quantification in computationalmolecular modeling. IEEE/ACM Transactions on Computational Biologyand Bioinformatics, ILLETT R. M. AND N OWAK R. D. (2007) Minimax optimal level-setestimation. IEEE Transactions on Image Processing,16(12).