aa r X i v : . [ m a t h . S T ] O c t Sharpening Jensen’s Inequality
J. G. Liao and Arthur BergDivision of Biostatistics and BioinformaticsPenn State University College of MedicineOctober 26, 2017
Abstract
This paper proposes a new sharpened version of the Jensen’s inequality. Theproposed new bound is simple and insightful, is broadly applicable by imposing min-imum assumptions, and provides fairly accurate result in spite of its simple form.Applications to the moment generating function, power mean inequalities, and Rao-Blackwell estimation are presented. This presentation can be incorporated in anycalculus-based statistical course.
Keywords:
Jensen gap, Power mean inequality, Rao-Blackwell Estimator, Taylor series1
Introduction
Jensen’s inequality is a fundamental inequality in mathematics and it underlies many im-portant statistical proofs and concepts. Some standard applications include derivationof the arithmetic-geometric mean inequality, non-negativity of Kullback and Leibler diver-gence, and the convergence property of the expectation-maximization algorithm (Dempster et al.,1977). Jensen’s inequality is covered in all major statistical textbooks such as Casella and Berger(2002, Section 4.7) and Wasserman (2013, Section 4.2) as a basic mathematical tool forstatistics.Let X be a random variable with finite expectation and let ϕ ( x ) be a convex function,then Jensen’s inequality (Jensen, 1906) establishes E [ ϕ ( X )] − ϕ ( E [ X ]) ≥ . (1)This inequality, however, is not sharp unless var( X ) = 0 or ϕ ( x ) is a linear function of x .Therefore, there is substantial room for advancement. This paper proposes a new sharperbound for the Jensen gap E [ ϕ ( X )] − ϕ ( E [ X ]). Some other improvements of Jensen’s in-equality have been developed recently; see for example Walker (2014), Abramovich and Persson(2016); Horvath et al. (2014) and references cited therein. Our proposed bound, however,has the following advantages. First, it has a simple, easy to use, and insightful form interms of the second derivative ϕ ′′ ( x ) and var( X ). At the same time, it gives fairly accurateresults in the several examples below. Many previously published improvements, however,are much more complicated in form, much more involved to use, and can even be moredifficult to compute than E [ ϕ ( X )] itself as discussed in Walker (2014). Second, our methodrequires only the existence of ϕ ′′ ( x ) and is therefore broadly applicable. In contrast, someother methods require ϕ ( x ) to admit a power series representation with positive coeffi-cients (Abramovich and Persson, 2016; Dragomir, 2014; Walker, 2014) or require ϕ ( x ) tobe super-quadratic (Abramovich et al., 2014). Third, we provide both a lower bound andan upper bound in a single formula.We have incorporated the materials in this paper in our classroom teaching. With onlyslightly increased technical level and lecture time, we are able to present a much sharperversion of the Jensen’s inequality that significantly enhances students’ understanding of2he underlying concepts. Theorem . Let X be a one-dimensional random variable with mean µ , and P ( X ∈ ( a, b )) =1, where −∞ ≤ a < b ≤ ∞ . Let ϕ ( x ) is a twice differentiable function on ( a, b ), and definefunction h ( x ; ν ) , ϕ ( x ) − ϕ ( ν )( x − ν ) − ϕ ′ ( ν ) x − ν . Then inf x ∈ ( a,b ) { h ( x ; µ ) } var( X ) ≤ E [ ϕ ( X )] − ϕ ( E [ X ]) ≤ sup x ∈ ( a,b ) { h ( x ; µ ) } var( X ) . (2) Proof.
Let F ( x ) be the cumulative distribution function of X . Applying Taylor’s theoremto ϕ ( x ) about µ with a mean-value form of the remainder gives ϕ ( x ) = ϕ ( µ ) + ϕ ′ ( µ )( x − µ ) + ϕ ′′ ( g ( x ))2 ( x − µ ) , where g ( x ) is between x and µ . Explicitly solving for ϕ ′′ ( g ( x )) / ϕ ′′ ( g ( x )) / h ( x ; µ )as defined above. Therefore E [ ϕ ( X )] − ϕ ( E [ X ]) = Z ba { ϕ ( x ) − ϕ ( µ ) } dF ( x )= Z ba (cid:8) ϕ ′ ( µ )( x − µ ) + h ( x ; µ )( x − µ ) (cid:9) dF ( x )= Z ba h ( x ; µ )( x − µ ) dF ( x ) , and the result follows because inf x ∈ ( a,b ) h ( x ; µ ) ≤ h ( x ; µ ) ≤ sup x ∈ ( a,b ) h ( x ; µ ).Theorem 1 also holds when inf h ( x ; µ ) is replaced by inf ϕ ′′ ( x ) / h ( x ; µ ) replacedby sup ϕ ′′ ( x ) / ϕ ′′ ( x )2 ≤ inf h ( x ; µ ) and sup ϕ ′′ ( x )2 ≥ sup h ( x ; µ ) . These less tight bounds are implied in the economics working paper Becker (2012). Ourlower and upper bounds have the general form J · var( X ), where J depends on ϕ . Similar3orms of bounds are presented in Abramovich and Persson (2016); Dragomir (2014); Walker(2014), but our J in Theorem 1 is much simpler and applies to a wider class of ϕ .Inequality (2) implies Jensen’s inequality when ϕ ′′ ( x ) ≥
0. Note also that Jensen’sinequality is sharp when ϕ ( x ) is linear, whereas inequality (2) is sharp when ϕ ( x ) is aquadratic function of x .In some applications the moments of X present in (2) are unknown, although a randomsample x , . . . , x n from the underlying distribution F is available. A version of Theorem 1suitable for this situation is given in the following corollary. Corollary . Let x , . . . , x n be any n datapoints in ( −∞ , ∞ ), and let¯ x = 1 n n X i =1 x i , ϕ x = 1 n n X i =1 ϕ ( x i ) , S = 1 n n X i =1 ( x i − ¯ x ) . Then inf x ∈ [ a,b ] h ( x ; ¯ x ) S ≤ ϕ x − ϕ (¯ x ) ≤ sup x ∈ [ a,b ] h ( x ; ¯ x ) S , where a = min { x , . . . , x n } and b = max { x , . . . , x n } . Proof.
Consider the discrete random variable X with probability distribution P ( X = x i ) =1 /n , i = 1 , . . . , n . We have E [ X ] = ¯ x , E [ ϕ ( X )] = ϕ x , and var( X ) = S . Then the corollaryfollows from application of Theorem 1. Lemma . If ϕ ′ ( x ) is convex, then h ( x ; µ ) is monotonically increasing in x , and if ϕ ′ ( x ) isconcave, then h ( x ; µ ) is monotonically decreasing in x . Proof.
We prove that h ′ ( x ; µ ) ≥ ϕ ′ ( x ) is convex. The analogous result for concave ϕ ′ ( x ) follows similarly. Note that dh ( x ; µ ) dx = ϕ ′ ( x )+ ϕ ′ ( µ )2 − ϕ ( x ) − ϕ ( µ ) x − µ ( x − µ ) , so it suffices to prove ϕ ′ ( x ) + ϕ ′ ( µ )2 ≥ ϕ ( x ) − ϕ ( µ ) x − µ . Without loss of generality we assume x > µ . Convexity of ϕ ′ ( x ) gives ϕ ′ ( y ) ≤ ϕ ′ ( µ ) + ϕ ′ ( x ) − ϕ ′ ( µ ) x − µ ( y − µ )4or all y ∈ ( µ, x ). Therefore we have ϕ ( x ) − ϕ ( µ ) = Z xµ ϕ ′ ( y ) dy ≤ Z xµ (cid:26) ϕ ′ ( µ ) + ϕ ′ ( x ) − ϕ ′ ( µ ) x − µ ( y − µ ) (cid:27) dy = ϕ ′ ( x ) + ϕ ′ ( µ )2 ( x − µ ) . and the result follows.Lemma 1 makes Theorem 1 easy to use as the follow results hold: inf h ( x ; µ ) = lim x → a h ( x ; µ )sup h ( x ; µ ) = lim x → b h ( x ; µ ) , when ϕ ′ ( x ) is convex inf h ( x ; µ ) = lim x → b h ( x ; µ )sup h ( x ; µ ) = lim x → a h ( x ; µ ) , when ϕ ′ ( x ) is concave.Note the limits of h ( x ; µ ) can be either finite or infinite. The proof of Lemma 1 borrowsideas from Bennish (2003). Examples of functions ϕ ( x ) for which ϕ ′ is convex include ϕ ( x ) = exp( x ) and ϕ ( x ) = x p for p ≥ p ∈ (0 , ϕ ( x ) for which ϕ ′ is concave include ϕ ( x ) = − log x and ϕ ( x ) = x p for p < p ∈ [1 , Example . For any random variable X supported on ( a, b )with a finite variance, we can bound the moment generating function E [ e tX ] using Theorem1 to get inf x ∈ ( a,b ) { h ( x ; µ ) } var( X ) ≤ E [ e tX ] − e tE [ X ] ≤ sup x ∈ ( a,b ) { h ( x ; µ ) } var( X ) , where h ( x ; µ ) = e tx − e tµ ( x − µ ) − te tµ x − µ . For t > a, b ) = ( −∞ , ∞ ), we haveinf h ( x ; µ ) = lim x →−∞ h ( x ; µ ) = 0 and sup h ( x ; µ ) = lim x →∞ h ( x ; µ ) = ∞ .
5o Theorem 1 provides no improvement over Jensen’s inequality. However, on a finitedomain such as a non-negative random variable with ( a, b ) = (0 , ∞ ), a significant improve-ment in the lower bound is possible becauseinf h ( x ; µ ) = h (0; µ ) = 1 − e tµ + tµe tµ µ > . Similar results hold for t <
0. We apply this to an example from Walker (2014), where X is an exponential random variable with mean 1 and ϕ ( x ) = e tx with t = 1 /
2. Here theactual Jensen’s gap is E [ e tX ] − e t E [ X ] = 2 − √ e ≈ . X ) = 1, we have . ≈ h (0; µ ) ≤ E [ e tX ] − e t E [ X ] ≤ lim x →∞ h ( x ; µ ) = ∞ . The less sharp lower bound using inf ϕ ′′ ( x ) / Example . Let X be a positive random variable oninterval ( a, b ) with mean µ . Note that − log( x ) is convex whose derivative is concave.Applying Theorem 1 and Lemma 1 leads tolim x → b h ( x ; µ )var( X ) ≤ − E { log( X ) } + log µ ≤ lim x → a h ( x ; µ )var( X ) , where h ( x ; µ ) = − log x + log µ ( x − µ ) + 1 µ ( x − µ ) . Now consider a sample of n positive data points x , . . . , x n . Let ¯ x be the arithmeticmean and ¯ x g = ( x x · · · x n ) n be the geometric mean. Applying Corollary 1.1 givesexp { S h ( b ; ¯ x ) } ≤ ¯ x ¯ x g ≤ exp { S h ( a ; ¯ x ) } , where a , b , S are as defined in Corollary 1.1. To give some numerical results, we generated100 random numbers from uniform distribution on [10,100]. For these 100 numbers, thearithmetic mean ¯ x is 54.830 and the geometric mean ¯ x g is 47.509. The above inequalitybecomes 1 . ≤ ¯ x ( x x · · · x n ) n = 1 . ≤ . , which are fairly tight bounds. Replacing h ( x n ; ¯ x ) by ϕ ′′ ( x n ) / h ( x ; ¯ x ) by ϕ ′′ ( x ) / xample . Let X be a positive random variable on a positive interval ( a, b )with mean µ . For any real number s = 0, define the power mean as M s ( X ) = ( EX s ) /s Jensen’s inequality establishes that M s ( X ) is an increasing function of s . We now give asharper inequality by applying Theorem 1. Let r = 0, Y = x r , µ y = EY , p = s/r and ϕ ( y ) = y p . Note that EX s = E { ϕ ( Y ) } . Applying Theorem 1 leads toinf h ( y ; µ y )var( Y ) ≤ E [ X s ] − ( EX r ) p ≤ sup h ( y ; µ y )var( Y ) , where h ( y ; µ y ) = y p − µ py ( y − µ y ) − pµ p − y y − µ y . To apply Lemma 1, note that ϕ ′ ( y ) is convex for p ≥ p ∈ (0 ,
1] and is concave for p < p ∈ [1 ,
2] as noted in Section 2.Applying the above result to the case of r = 1 and s = −
1, we have Y = X , p = − (cid:18) ( EX ) − + lim y → a h ( y ; µ y )var( X ) (cid:19) − ≤ (cid:0) EX − (cid:1) − ≤ (cid:18) ( EX ) − + lim y → b h ( y ; µ y )var( X ) (cid:19) − . For the same sequence x , . . . , x n generated in Example 2, we have ¯ x harmonic = 39 . . ≤ ¯ x harmonic = 39 . ≤ . . Note that the upper bound 48.905 is much smaller than the arithmetic mean ¯ x = 54 . h ( b ; ¯ x ) by ϕ ′′ ( b ) / h ( a ; ¯ x ) by ϕ ′′ ( a ) / American Statistician , de Carvalho (2016) revisitedKolmogorov’s formulation of generalized mean as E ϕ ( X ) = ϕ − ( E [ ϕ ( X )]) , (3)where ϕ is a continuous monotone function with inverse ϕ − . The Example 2 correspondsto ϕ ( x ) = − log( x ) and Example 3 corresponds to ϕ ( x ) = x s . We can also apply Theorem1 to bound ϕ − ( Eϕ ( X )) for a more general function ϕ ( x ).7 xample . Rao-Blackwell theorem (Theorem 7.3.17 in Casellaand Berger, 2002; Theorem 10.42 in Wasserman, 2013) is a basic result in statistical esti-mation. Let ˆ θ be an estimator of θ , L ( θ, ˆ θ ) be a loss function convex in ˆ θ , and T a sufficientstatistic. Then the Rao-Blackwell estimator, ˆ θ ∗ = E [ˆ θ | T ], satisifies the following inequal-ity in risk function E [ L ( θ, ˆ θ )] ≥ E [ L ( θ, ˆ θ ∗ )] . (4)We can improve this inequality by applying Theorem 1 to ϕ (ˆ θ ) = L ( θ, ˆ θ ) with respect tothe conditional distribution of ˆ θ given T : E [ L ( θ, ˆ θ ) | T ] − L ( θ, ˆ θ ∗ ) ≥ inf x ∈ ( a,b ) h ( x ; ˆ θ ∗ )var(ˆ θ | T ) , where function h is defined as in Theorem 1 for ϕ (ˆ θ ) and P (ˆ θ ∈ ( a, b ) | T ) = 1. Furthertaking expectations over T gives E [ L ( θ, ˆ θ )] − E [ L ( θ, ˆ θ ∗ )] ≥ E (cid:20) inf x ∈ ( a,b ) h ( x ; ˆ θ ∗ )var(ˆ θ | T ) (cid:21) . In particular for square-error loss, L ( θ, ˆ θ ) = (ˆ θ − θ ) , we have E [( θ − ˆ θ ) ] − E [( θ − ˆ θ ∗ ) ] = E h var(ˆ θ | T ) i . Using the original Jensen’s inequality only establishes the cruder inequality in Equation(4).
As discussed in Example 1 above, Theorem 1 does not improve on Jensen’s inequality ifinf h ( x ; µ ) = 0. In such cases, we can often sharpen the bounds by partitioning the domain( a, b ) following an approach used in Walker (2014). Let a = x < x < · · · < x m = b,I j = [ x j − , x j ), η j = P ( X ∈ I j ), and µ j = E ( X | X ∈ I j ). It follows from the law of totalexpectation that 8 [ ϕ ( X )] = m X j =1 η j E [ ϕ ( X ) | X ∈ I j ]= m X j =1 η j ϕ ( µ j ) + m X j =1 η j ( E [ ϕ ( X ) | X ∈ I j ] − ϕ ( µ j )) . Let Y be a discrete random variable with distribution P ( Y = µ j ) = η j , j = 1 , , . . . , m . Itis easy to see that EY = EX . It follows by Theorem 1 that m X j =1 η j ϕ ( µ j ) = E [ ϕ ( Y )] ≥ ϕ ( EY ) + inf y ∈ [ µ ,µ m ] h ( y ; µ y )var( Y ) . We can also apply Theorem 1 to each E [ ϕ ( X | X ∈ I j )] − ϕ ( µ j ) term: E [ ϕ ( X | X ∈ I j )] − ϕ ( µ j ) ≥ inf x ∈ I j h ( x ; µ j )var( X | X ∈ I j ) . Combining the above two equations, we have E [ ϕ ( X )] − ϕ ( EX ) ≥ inf y ∈ [ µ ,µ m ] h ( y ; µ y )var( Y ) + m X j =1 η j inf x ∈ I j h ( x ; µ j )var( X | X ∈ I j ) . (5)Replacing inf by sup in the righthand side gives the upper bound.The Jensen gap on the left side of (5) is positive if any of the m + 1 terms on the rightis positive. In particular, the Jensen gap is positive if there exists an interval I ⊂ ( a, b )that satisfies inf x ∈ I ϕ ′′ ( x ) > P ( X ∈ I ) > X | X ∈ I ) >
0. Note that afiner partition does not necessarily lead to a sharper lower bound in (5). The focus of thepartition should therefore be on isolating the part of interval ( a, b ) in which ϕ ′′ ( x ) is closeto 0.Consider example X ∼ N ( µ, σ ) with µ = 0 and σ = 1 and ϕ ( x ) = e x . We divide( −∞ , ∞ ) into three intervals with equal probabilities. This gives I j η j E [ X | X ∈ I j ] var( X | X ∈ I j ) inf x ∈ I j h ( x ; µ j ) sup x ∈ I j h ( x ; µ j )( −∞ , − . − . , . . , ∞ ) 1/3 1.091 0.280 1.209 ∞ e µ + σ − e µ = 0 . ∞ , however, provides noimprovement over Theorem 1.To summarize, this paper proposes a new sharpened version of the Jensen’s inequality.The proposed bound is simple and insightful, is broadly applicable by imposing minimumassumptions on ϕ ( x ), and provides fairly accurate result in spite of its simple form. It canbe incorporated in any calculus-based statistical course. References
Abramovich, S. and L.-E. Persson (2016). Some new estimates of the Jensen gap.
Journalof Inequalities and Applications 2016 (1), 39.Abramovich, S., L.-E. Persson, and N. Samko (2014). Some new scales of refined Jensenand Hardy type inequalities.
Mathematical Inequalities & Applications 17 (3), 1105–1114.Becker, R. A. (2012). The variance drain and Jensen’s inequality. Technical report, CAEPRWorking Paper No. 2012-004. Available at SSRN: https://ssrn.com/abstract=2027471 orhttp://dx.doi.org/10.2139/ssrn.2027471.Bennish, J. (2003). A proof of Jensens inequality.
Missouri Journal of MathematicalSciences, 15 (1).Casella, G. and R. L. Berger (2002).
Statistical inference , Volume 2. Duxbury PacificGrove, CA.de Carvalho, M. (2016). Mean, what do you mean?
The American Statistician 70 (3),270–274.Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from in-complete data via the em algorithm.
Journal of the royal statistical society. Series B(methodological) , 1–38.Dragomir, S. (2014). Jensen integral inequality for power series with nonnegative coeffi-cients and applications.
RGMIA Res. Rep. Collect 17 , 42.10orvath, L., K. A. Khan, and J. Pecaric (2014). Refinement of Jensen’s inequality foroperator convex functions.
Advances in inequalities and applications 2014 .Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les in´egalit´es entre les valeursmoyennes.
Acta mathematica 30 (1), 175–193.Walker, S. G. (2014). On a lower bound for the Jensen inequality.
SIAM Journal onMathematical Analysis 46 (5), 3151–3157.Wasserman, L. (2013).