A convex analysis approach to tight expectation inequalities
SSaturable Generalizations of Jensen’s Inequality
André M. TimpanaroFebruary 25, 2021
Abstract
Jensen’s inequality can be thought as answering the question of how knowledge of E ( X ) allows us to bound E ( f ( X )) in the cases where f is either convex or concave. Many generalizations have been proposed, which boildown to how additional knowledge allows us to sharpen the inequality. In this work, we investigate the questionof how knowledge about expectations E ( f i ( X )) of a random vector X translate into bounds for E ( g ( X )) .Our results show that there is a connection between this problem and properties of convex hulls, allowing us torewrite it as an optimization problem. The results of these optimization problems not only arrive at sharp boundsfor E ( g ( X )) but in some cases also yield discrete probability measures where equality holds. We develop bothanalytical and numerical approaches for finding these bounds and also study in more depth the case where theknown information are the average and variance, for which some analytical results can be obtained. Jensen’s inequality [11] has a long history in applied mathematics and science, being used in Probability Theory,Statistical Mechanics, Information Theory and Convex Analysis. Written in terms of expectations, it states that if X is a random vector and f is a convex function, then E ( f ( X )) ≥ f ( E ( X )) (1)Eq (1) can be regarded as an inequality relating different expectations that can be computed from the randomvector X (as is the case for Markov’s and Chebyshev’s inequalities). In particular, one can think of it as answeringthe question of how knowledge of E ( X ) allows us to bound E ( f ( X )) in the cases where f is convex (or concave,if we consider − f ).Given it’s importance, there has been considerable effort on finding generalizations of the inequality (1)[12, 16, 9, 3, 8, 1, 15, 7, 13, 2]. These generalizations can be broken down on bounds that require more in-formation about the function f (like analyticity assumptions [16, 3, 8], superquadraticity [1, 2] or assumptionsabout asymptotic behaviours [9]); or bounds that require knowledge of more expectations besides E ( X ) (like thevariance [12, 16, 3, 8, 7], other dispersion measures [9] or more complicated expectations [1, 13, 2]). A notablelimitation of the current results is that they require the random vector to have a support contained in R or evenspecific subsets of R .In this work, we consider the question of how knowledge about expectations E ( f i ( X )) of a random vector X translate into bounds for E ( g ( X )) . The case where the f i and g are polynomials is famously related to generalizedmoment problems and can be solved numerically with semidefinite programming [14, 6]. In order to extend thisproblem to more general classes of functions, we take a different route and show that using convex hulls, one canobtain sharp lower and upper bounds for E ( g ( X )) . Most of the work is devoted to theorems that allow thesebounds to be obtained in a simpler way, instead of requiring finding the actual convex hull. In sections 3 and 5 weobtain the main results. Some applications are given in sections 4 and 6. These include two theorems that expandsome of the work already done regarding the Jensen gap: Theorem. If f : [ a, b ] → R is bounded, differentiable and such that f (cid:48) is strictly convex, then for every λ and σ that are possible values for the average and variance of a variable in [ a, b ] , then there exist measures µ ± with E ( X ) µ ± = λ and Var ( X ) µ ± = σ such that E ( f ( X )) µ − = σ f ( a ) + ( λ − a ) f (cid:16) λ + σ λ − a (cid:17) σ + ( λ − a ) a r X i v : . [ m a t h . P R ] F e b ( f ( X )) µ + = σ f ( b ) + ( λ − b ) f (cid:16) λ + σ λ − b (cid:17) σ + ( λ − b ) and for every measure µ with the same average and variance we have E ( f ( X )) µ − ≤ E ( f ( X )) µ ≤ E ( f ( X )) µ + Theorem.
Let f : R → R be continuous, λ ∈ R , σ > , (cid:126)p ≡ ( p a , p b , p c ) , S = { (cid:126)p ∈ R | p a , p b , p c > and p a + p b + p c = 1 } and define x a ( (cid:126)p, θ ) = λ + σ (cid:18) cos( θ ) (cid:114) p b p a + sin( θ ) √ p c (cid:19) √ p a + p b x b ( (cid:126)p, θ ) = λ + σ (cid:18) − cos( θ ) (cid:114) p a p b + sin( θ ) √ p c (cid:19) √ p a + p b x c ( (cid:126)p, θ ) = λ − σ sin( θ ) (cid:114) p a + p b p c Then inf µ ∈ M E ( f ( X )) µ = inf (cid:126)p ∈ S inf θ ( p a f ( x a ) + p b f ( x b ) + p c f ( x c ))sup µ ∈ M E ( f ( X )) µ = sup (cid:126)p ∈ S sup θ ( p a f ( x a ) + p b f ( x b ) + p c f ( x c )) where M is the set of measures with support in R , E ( X ) µ = λ and Var ( x ) µ = σ . Finally, in section 7 we present a theorem that reduces the problem of bounding E ( g ( X )) given E ( f i ( X )) to aconvex optimization problem: Theorem. If S ⊆ R n , let g : R n → R and f : R n → R m be measurable functions, M ( S , ψ ) be the set of allmeasures µ with support contained in S such that E ( f ( X )) µ = ψ and let D = { ψ ∈ R m | M ( S , ψ ) (cid:54) = ∅ } . Soif φ ∈ int ( D ) (cid:54) = ∅ , then inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = sup α ∈ R m (cid:18) inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) and sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = inf α ∈ R m (cid:18) sup x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) We are interested in studying the following problem:Let X be a random vector with probability measure µ on R n that has its support contained in S ⊆ R n .Let f , . . . , f m and g be measurable functions R n → R . If E ( f i ( X )) µ = φ i ∈ R , i = 1 , . . . , m , then whatcan be said about E ( g ( X )) µ ?For ease of notation, for the rest of this work, the functions f i will always be the m functions whose expecta-tions are known (bundled together as the vector f ), φ i are the values of these expectations (bundled as the vector φ ) and g is the function whose expectation we wish to study. We further define/denote• z is the ( m + 1) -th coordinate of a point in R m +1 .• Γ( x ) ≡ ( f ( x ) , f ( x ) , . . . , f m ( x ) , g ( x )) • M ( S ) is the set of all probability measures µ on R n , with support contained in S ⊆ R n .• M ( S , φ ) is the subset of M ( S ) such that E ( f ( X )) µ = φ ∈ R m and E ( g ( X )) µ is finite.• M k ( S , φ ) is the subset of M ( S , φ ) with measures that have at most k points in their support.2 H ( S ) is the convex hull of a set S .• If φ ∈ R m and S ⊆ R n , then γ ( S , φ ) ≡ H (Γ( S )) ∩ ( { φ } × R ) • cl ( S ) is the closure of a set S .• int ( S ) is the interior of a set S .• bd ( S ) is the boundary of a set S .• u • v denotes the scalar product (cid:80) i u i v i .The following definition will also be useful Def 1 (Progressive Cover) . A progressive cover of a set S ⊆ R n is a sequence of sets ( S i ) ∞ i =1 such that • S i ⊆ S j if i ≤ j • S i ⊆ S ∀ i • ∀ x ∈ S ∃ i | x ∈ S i A progressive compact cover is a progressive cover where all sets in it are compact and a progressive boundedcover is a progressive cover where all sets in it are bounded.
In order to tackle our problem, we first recall the well known lemma
Lemma 1. If S ⊆ R n and µ ∈ M ( S ) has a non-divergent expectation, E ( X ) µ , then E ( X ) µ ∈ H ( S ) . The same general reasoning used in proving lemma 1 can be used to extend it to situations where we aremapping the random vector:
Lemma 2.
Let µ ∈ M ( S ) , with S ⊆ R n and let Ψ : R n → R m be a measurable function such that E (Ψ( X )) µ is non-divergent, then E (Ψ( X )) µ ∈ H (Ψ( S )) . From here the following corollary follows:
Corollary 1.
Let
Ψ : R n → R m be a measurable function, S ⊆ R n and Ξ ∈ R m . Then there exists a measure µ ∈ M ( S ) such that E (Ψ( X )) µ = Ξ iff Ξ ∈ H (Ψ( S )) .Proof. If Ξ ∈ H (Ψ( S )) , then by definition Ξ is a finite convex combination of elements in Ψ( S ) : Ξ = (cid:88) k α k P k The measure µ ∈ M ( S ) we are after can be obtained by taking Q k ∈ S such that Ψ( Q k ) = P k and attributingprobability α k to each Q k .On the other hand, if E (Ψ( X )) µ = Ξ for some µ ∈ M ( S ) , then all we need to do is apply lemma 2.The connection of corollary 1 with the problem we are interested in is given by the following theorem Theorem 1 (Hull Inequality) . Let γ ( S , φ ) = H (Γ( S )) ∩ ( { φ } × R ) , with S ⊆ R n and let z denote the ( m +1) -thcoordinate of a point. Then inf γ ( S ,φ ) z = inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = inf µ ∈ M m +2 ( S ,φ ) E ( g ( X )) µ sup γ ( S ,φ ) z = sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = sup µ ∈ M m +2 ( S ,φ ) E ( g ( X )) µ roof. The case γ ( S , φ ) = ∅ is trivially true. Moving on to γ ( S , φ ) (cid:54) = ∅ , we’ll give the proof for the supremumonly, as the result for the infimum would follow from considering the supremum for − g ( X ) instead of g ( X ) . Forconciseness, let us denote σ = sup γ ( S ,φ ) z and s = sup µ ∈ M ( S ,φ ) E ( g ( X )) µ We first prove that s ≥ σ . There exists a sequence ( P k ) ∞ k =1 in γ ( S , φ ) such that P k = ( φ, z k ) and lim k →∞ z k = σ Using corollary 1 it follows that for every k there exists µ k ∈ M ( S , φ ) such that E (Γ( X )) µ k = P k , hence lim k →∞ E ( g ( X )) µ k = σ implying s ≥ σ (note that this reasoning works even if σ is ∞ ).Next we show that s ≤ σ . If σ is ∞ , there is nothing to prove, otherwise suppose by absurd that s > σ . Itfollows that there exists µ ∈ M ( S , φ ) such that g ≡ E ( g ( X )) µ > σ . Let P = E (Γ( X )) µ . By definition of M ( S , φ ) we have P = ( φ, g ) . At the same time, lemma 2 implies P ∈ H (Γ( S )) and hence P ∈ γ ( S , φ ) , butthen the supremum σ should be at least g (since it is the z coordinate of a point in γ ( S , φ ) ). Contradiction!Finally, using corollary 1, for every µ ∈ M ( S , φ ) , we have E (Γ( X )) µ ∈ H (Γ( S )) . So we can useCaratheodory’s theorem to build a measure (cid:101) µ ∈ M m +2 ( S , φ ) such that E (Γ( X )) µ = E (Γ( X )) (cid:101) µ and hencesuch that E ( g ( X )) µ = E ( g ( X )) (cid:101) µ , which completes the proof.Note that if we supplement this theorem with corollary 1 we get that the possible values for E ( g ( X )) µ areexactly the z coordinates in γ ( S , φ ) , which provides us, at least in principle, with the answer of the problem weset to study. This can be put in the form inf γ ( S ,φ ) z ≤ E ( g ( X )) µ ≤ sup γ ( S ,φ ) z which can be seen as a generalization of Jensen’s inequality, as we will see from examples in the followingsections. Furthermore, the equality in theorem 1 means it can always be saturated, when the bounds are finite.Also, the fact that M ( S , φ ) and M m +2 ( S , φ ) have the same extrema allows us to think of the problem ofobtaining these bounds as an optimization over M m +2 ( S , φ ) , which we will explore later.However, using convex hulls still obfuscates the results due to their complexity. In the next sections we developtools to obtain the extrema in γ ( S , φ ) more easily. Theorem 1 tells us that studying H (Γ( S )) gives us information about E ( g ( X )) . As an example, consider thesituation where we are given the expectation of the random vector E ( X ) ∈ R n and we want to study E ( g ( X )) .This is the case where m = n , f i ( x ) = x i , S = R n and φ = E ( X ) . In this case, Γ( S ) is the graph of g ( x ) : G = { ( x, g ( x )) | x ∈ R n } whose convex hull obeys H (Γ( S )) = H ( G ) ⊆ cl ( H ( G )) = { ( x, z ) | x ∈ R n and ˘ g ( x ) ≤ z ≤ “ g ( x ) } where ˘ g, “ g : R n → R are respectively the convex and concave envelopes of g .Applying theorem 1 we get that ˘ g (cid:16) E ( X ) µ (cid:17) ≤ E ( g ( X )) µ ≤ “ g (cid:16) E ( X ) µ (cid:17) which can be seen as a generalization of Jensen’s inequality, beyond the convex/concave case. Note that we couldhave proven this inequality directly from Jensen’s inequality together with ˘ g ( x ) ≤ g ( x ) ≤ “ g ( x ) since ˘ g is convex and “ g is concave. However, the theory we will develop can be applied in many more situations.Notably, we can study the case where one is given E ( X ) and Var ( X ) and we want to study E ( g ( X )) , which hasreceived considerable attention [16, 12]. 4 Geometrical Approach
Consider now the situation where one of the extrema σ − = inf γ ( S ,φ ) z or σ + = sup γ ( S ,φ ) z is finite. It follows that there’d exist a point P = ( φ, σ ) (either ( φ, σ − ) or ( φ, σ + ) ) in cl ( γ ( S , φ )) . This pointwould be in the boundary of H (Γ( S )) and hence there exists a supporting hyperplane passing through P thatdivides the space in two parts, one of which has no intersection with Γ( S ) . It turns out that the existence of thesehyperplanes can help us finding σ ± (and hence the bound in the expectation we are interested in) Theorem 2. If S ⊆ R n , then there exist vectors α ± = ( α ± , . . . , α m ± ) and values β ± , c ± such that β ± ≥ , ( α ± , . . . , α m ± , β ± ) (cid:54) = (cid:126) and • If σ + is finite, then α + • f ( x ) + β + g ( x ) + c + ≤ ∀ x ∈ S and sup µ ∈ M ( S ,φ ) E ( α + • f ( X ) + β + g ( X ) + c + ) µ = 0 • If σ − is finite, then α − • f ( x ) + β − g ( x ) + c − ≥ ∀ x ∈ S and inf µ ∈ M ( S ,φ ) E ( α − • f ( X ) + β − g ( X ) + c − ) µ = 0 • Moreover, for each case where σ ± is finite, if ( φ, σ ± ) ∈ γ ( S , φ ) , then there exists a measure µ ± ∈ M m +1 ( S , φ ) with support s ± , such that E ( g ( X )) µ ± = σ ± and α ± • f ( x ) + β ± g ( x ) + c ± = 0 ∀ x ∈ s ± Proof.
We will focus on the proof for σ + (the proof for σ − is analogous).Let us prove the first item, starting with the inequality α + • f ( x ) + β + g ( x ) + c + ≤ ∀ x ∈ S (2)Since σ + is finite, then using theorem 1, there exists a point P + = ( φ, σ + ) ∈ cl ( γ ( S , φ )) . This point must bein the boundary of H (Γ( S )) , so there must exist a supporting hyperplane Π containing it. This hyperplane dividesthe space into 2 regions, one of which has no intersection with H (Γ( S )) (and hence with Γ( S ) ).Algebraically, the hyperplane and the regions it defines can be specified (for every y ∈ R m +1 ) as a + • y + c + =0 ( Π itself) and a + • y + c + > , a + • y + c + < (the 2 regions), where a + ≡ ( α , . . . , α m + , β + ) (cid:54) = (cid:126) . Note thatthe sign of a + is arbitrary (as we can change it and the sign of c + to get the same geometric locus) so without lossof generality we may assume β + ≥ .We will now consider some possibilities. If Γ( S ) is such that int ( H (Γ( S ))) = ∅ , then Π can also be madesuch that H (Γ( S )) ⊆ Π , meaning that equality holds in (2) for all points x ∈ S (and we are done). We will dividethe case int ( H (Γ( S ))) (cid:54) = ∅ in two other cases. Firstly, if β + = 0 , we need to prove that we can choose a + and c + such that a + • Γ( x ) + c + = α + • f ( x ) + c + ≤ ∀ x ∈ S But since β + = 0 , this implies that the sign of a + has still not been determined. So the observation that a + • y + c + has the same sign for all y ∈ Γ( S ) trivially implies that we can choose this sign to be negative.Finally, for the case β + > , consider the line λ = { ( φ, t ) | t ∈ R } . If λ ∩ int ( H (Γ( S ))) = ∅ then theseparating hyperplane theorem tells us that there exists a hyperplane Ξ separating λ and int ( H (Γ( S ))) . Since P + ∈ λ , the distance between these two sets is 0, implying that Ξ is also a supporting hyperplane containing P + .Furthermore, it implies that λ ⊆ Ξ , so if we had used Ξ instead of Π to define α and β , we’d be back to the case β + = 0 that we proved already. On the other hand, if λ and int ( H (Γ( S ))) are not disjoint, then λ ∩ H (Γ( S )) is not a singleton, implying that there are points of the form ( φ, t ) in H (Γ( S )) with t (cid:54) = σ + . These points musthave t < σ + and hence the region defined by Π where H (Γ( S )) (and hence Γ( S ) ) resides is a + • y + c + ≤ .Therefore, a + • Γ( x ) + c + ≤ ∀ x ∈ S . Expanding this we get the desired inequality: α + • f ( x ) + β + g ( x ) + c + ≤ ∀ x ∈ S σ − the only difference is that on the last step we have t > σ − , which reverses the sign of the finalinequality).Next, we must prove that sup µ ∈ M ( S ,φ ) E ( α + • f ( x ) + β + g ( x ) + c + ) µ = 0 We first note that if µ ∈ M ( S , φ ) , then E ( α + • f ( x ) + β + g ( x ) + c + ) µ = α + • φ + c + + β + E ( g ( x )) µ Since β + ≥ , this implies that sup µ ∈ M ( S ,φ ) E ( α + • f ( x ) + β + g ( x ) + c + ) µ = α + • φ + c + + β + σ + = a + • P + + c + which equals 0 because P + ∈ Π (the reasoning is identical for σ − ).Finally, let us prove the last item. Since P + ∈ γ ( S , φ ) , then P + ∈ H (Γ( S )) and we can use Caratheodory’stheorem to write P + as the convex combination of m + 1 points in Γ( S ) ( m + 2 are not needed because we are inthe boundary) and using an argument similar to the one in the proof of corollary 1, there exists µ + ∈ M m +1 ( S , φ ) with E (Γ( X )) µ + = P + . Hence E ( g ( X )) µ + = σ + .Also, E ( a + • Γ( X ) + c + ) µ + = a + • E (Γ( X )) µ + + c + = a + • P + + c + = 0 . Since a + • Γ( x ) + c + must have the same sign for all x ∈ S and its expectation is null, then µ + must be such that a + • Γ( X ) + c + = 0 almost surely. Since the support s + of µ + is finite, then s + ⊆ Π . Expanding a + • Γ( x ) weget α + • f ( x ) + β + g ( x ) + c + = 0 ∀ x ∈ s + (once again the reasoning is identical for σ − )An important situation where this theorem can be applied is when S is a compact in R n and the restriction of Γ to S is continuous. In this case both σ ± are finite and such that ( φ, σ ± ) ∈ γ ( S , φ ) and as we will see in thesection 5.1, studying the possibilities for µ ± will be very useful to finding the σ ± . The next theorem will allow usto extend this use case, by studying progressive compact covers of S , instead of S itself. Theorem 3.
Let ( S i ) ∞ i =1 be a progressive cover of S ⊆ R n and let L i , U i , L ( k ) i and U ( k ) i be defined as follows L i ≡ inf µ ∈ M ( S i ,φ ) E ( g ( X )) µ U i ≡ sup µ ∈ M ( S i ,φ ) E ( g ( X )) µ L ( k ) i ≡ inf µ ∈ M k ( S i ,φ ) E ( g ( X )) µ and U ( k ) i ≡ sup µ ∈ M k ( S i ,φ ) E ( g ( X )) µ then inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = lim i →∞ L i sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = lim i →∞ U i inf µ ∈ M k ( S ,φ ) E ( g ( X )) µ = lim i →∞ L ( k ) i and sup µ ∈ M k ( S ,φ ) E ( g ( X )) µ = lim i →∞ U ( k ) i Proof.
Let us show that sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = lim i →∞ U i Since ( S i ) ∞ i =1 is a progressive cover, it follows that M ( S i , φ ) ⊆ M ( S j , φ ) ⊆ M ( S , φ ) whenever i < j .Firstly, this implies that the case M ( S , φ ) = ∅ is such that M ( S i , φ ) = ∅ ∀ i , hence the supremum and all the U i are −∞ (proving this case). Secondly, if M ( S , φ ) (cid:54) = ∅ , then ( U i ) ∞ i =1 is non-decreasing.Consider first the case where the supremum is finite: sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = σ It follows that for every ε > , there exists µ ε ∈ M ( S , φ ) such that6 = E ( g ( X )) µ ε ≥ σ − ε We must have then E (Γ( X )) µ ε = ( φ, g ) . Using lemma 2 we have ( φ, g ) ∈ H (Γ( S )) . Using Caratheodory’stheorem we can build then ν ε ∈ M m +2 ( S , φ ) with E (Γ( X )) ν ε = ( φ, g ) . Let s ε be the support of ν ε . Since s ε ⊆ S is finite and ( S i ) ∞ i =1 is a progressive cover of S , then there exists N ε such that for all n > N ε we have s ε ⊆ S n , implying ν ε ∈ M ( S n , φ ) and hence U n = sup µ ∈ M ( S n ,φ ) E ( g ( X )) µ ≥ E ( g ( X )) ν ε = g However, since M ( S n , φ ) ⊆ M ( S , φ ) , then U n ≤ σ , so | U n − σ | ≤ ε , implying the limit.The case when the supremum is ∞ is similar. For all δ , there exists µ δ ∈ M ( S , φ ) such that E ( g ( X )) µ δ ≥ δ Invoking lemma 2 and Caratheodory’s theorem we can once again find ν δ ∈ M m +2 ( S , φ ) with the same expecta-tion E (Γ( X )) than µ δ . Once again, since the support of ν δ is finite, then there exists N δ such that for all n > N δ ,the support of ν δ is in S n , implying U n ≥ δ and hence that U n → ∞ .This concludes the proof, as the limit for the infimum follows from considering the limit of the supremum for − g instead of g and the reasoning for the case with finite support is nearly identical (substitute M for M k , U i for U ( k ) i and instead of using lemma 2 together with Caratheodory’s theorem to build ν ε and ν δ , we can just use µ ε and µ δ in their places, as the supports are already finite). To illustrate how to use theorems 2 and 3, we will work out some examples where the bounds can also be derivedby simpler methods. Γ is continuous Let us first find the lower bound for E (cid:0) X (cid:1) given E ( X ) = λ and Var ( X ) = σ . This problem fits the frameworkwe just developed. Namely we have m = 2 , f ( x ) = x , f ( x ) = x , g ( x ) = x , φ = ( λ, λ + σ ) and S = R .Theorem 2 gives us the most information when Γ( S ) is compact, which is not the case. However, we can usetheorem 3 and study a progressive compact cover of R instead. We will consider a cover where all elementsare intervals of the form [ − a, a ] (what the particular cover is turns out to be unimportant). Since Γ([ − a, a ]) iscompact, then the corresponding γ ([ − a, a ] , φ ) will be compact and we can use theorem 2 to find the infimum of E (cid:0) X (cid:1) µ for µ ∈ M ([ − a, a ] , φ ) . Theorem 2 implies that there exists a measure in M ([ − a, a ] , φ ) that attains thisinfimum and whose support consists of roots of h ( x ) ≡ α + α x + α x + βx ≥ ∀ x ∈ [ − a, a ] for some choice of constants β, α k , with β ≥ . We note that the roots in ] − a, a [ must be double roots, while ± a may be simple roots. If σ > then the support must have more than one point and hence h ( x ) must have morethan one root. These constraints leave us with the possibilities found in figure 1, for the qualitative graph of h ( x ) . (a) (b) (c) (d) Figure 1: Possibilities for the qualitative graph of h ( x ) respecting h ( x ) ≥ in [ − a, a ] and with at least 2 roots inthe interval.Adding the constraint that the cubic term in h ( x ) must be 0 allows us to discard the possibilities in 1(c) and (d),as the sum of the roots is necessarily different from zero in these cases (in 1(c), for example, the roots are b < a with multiplicity 2, − a with multiplicity 1 and c ≤ − a with multiplicity 1, so their sum taking the multiplicities7nto account must be negative). It also implies that in the cases 1(a) and (b) the roots must be ± b for some b ∈ [ − a, a ] .Imposing the known expectations, we get that b = √ λ + σ and hence there exists only one measure obeyingall the properties prescribed by theorem 2 for every a ≥ b (the probabilities for − b and b are uniquely determinedby λ ), so it must have the smallest possible expectation for X . Since this measure is the same for every a ≥ b ,the limit prescribed by theorem 3 is trivial and we have the bound E (cid:0) X (cid:1) ≥ ( λ + σ ) which can also be derived directly from Jensen’s inequality E (cid:0) ( X ) (cid:1) ≥ E (cid:0) X (cid:1) For the second example, lets rederive Markov’s inequality. That is, if X is a nonnegative random variable and a > , then Pr (
X > a ) ≤ min (cid:26) E ( X ) a , (cid:27) We can translate this into our framework, using m = 1 , f ( x ) = x , g ( x ) = Θ( a − x ) , S = R + , φ = λ . Where Θ is Heaviside’s step function Θ( x ) = (cid:26) , if x ≥ , if x < and we want to show that the lower bound for E ( g ( X )) = E (Θ( a − X )) = Pr ( X ≤ a ) is max { − λ / a , } .Once again, Γ( S ) is not bounded, but we can use a progressive compact cover for S . This cover must be builtmore carefully than in the previous example, to deal with the discontinuity of g ( x ) at x = a . A possible choiceis to use S n = [0 , a ] ∪ (cid:2) a + n , n (cid:3) for n > a + 1 as the elements of the cover. Since the restriction of Γ to any S n is continuous, then Γ ( S n ) is compact and we can once again use theorem 2 to find a measure that minimizes E ( g ( X )) .The support of the measure that minimizes E ( g ( X )) is composed of roots of h ( x ) = β Θ( a − x ) + α x + α ≥ for some choice of α , α and β , such that β ≥ . What the possible roots are will depend only on the sign of α .Some representative graphs can be found in Fig 2. Taking n > λ, a + 1 ; we have the following cases: (A) If α < then the only possible root is n (Fig 2(a)) (B) If α > then the only possible roots are and a + n (Figs 2(b-d)) (C) If α = 0 then all values in the interval (cid:2) a + n , n (cid:3) can be roots, as long as we also have α = 0 (Fig 2(e)) (a) (b) (c) (d) (e) Figure 2: Possibilities for the graph of h ( x ) respecting h ( x ) ≥ in S n = [0 , a ] ∪ [ a + / n , n ] .Since n > λ , then the case (A) will be irrelevant, as it never obeys the constraints. If λ ≤ a , then case (B) is the only one that can obey the constraints and doing the algebra leads us to Pr ( X ≤ a ) = E (Θ( a − X )) =1 − nλna +1 . Finally, if λ > a , then case (C) is the only one that can obey the constraints and we have trivially Pr ( X ≤ a ) = E (Θ( a − X )) = 0 .To obtain Markov’s inequality we must use theorem 3 and take the limit n → ∞ , which is clearly8 im n →∞ E (Θ( a − X )) (cid:12)(cid:12)(cid:12) S n = max (cid:26) − λa , (cid:27) completing the proof. Discontinuities in Γ where the limit along some curve exists can in principle be treatedsimilarly (removing an open set that becomes smaller from each of the elements of a progressive compact cover),but this is normally only feasible when X ∈ R . Nevertheless, note that this still requires a one-sided limit to existin order to work. The examples in section 5.1 highlight the importance of the case when
S ⊆ R n has a progressive compact cover.This raises the question of how to identify these situations, which fortunately is an easy one: Lemma 3.
S ⊆ R n has a progressive compact cover iff it is an F σ set.Proof. If S has a progressive compact cover ( S k ) ∞ k =1 , then clearly S is F σ , as S = ∞ (cid:91) k =1 S k If on the other hand S is an F σ set, then S can be written as S = ∞ (cid:91) k =1 F k where the F k are all closed. Let ( R k ) ∞ k =1 be a progressive compact cover of R n (like R k = [ − k, k ] n , for example),then if we define S k = R k ∩ k (cid:91) q =1 F q then ( S k ) ∞ k =1 is a progressive compact cover of S .Which leads us to the following strengthening of theorem 1 Corollary 2. If S is an F σ set in R n and Γ is continuous, then inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = inf µ ∈ M m +1 ( S ,φ ) E ( g ( X )) µ sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = sup µ ∈ M m +1 ( S ,φ ) E ( g ( X )) µ Proof.
This is trivially true if M ( S , φ ) = ∅ . Otherwise, lemma 3 implies that S has a progressive compact cover ( S k ) ∞ k =1 . Since Γ is continuous, then γ ( S k , φ ) is always compact and using theorems 2 and 3 we get inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = lim k →∞ (cid:18) inf µ ∈ M ( S k ,φ ) E ( g ( X )) µ (cid:19) == lim k →∞ (cid:18) inf µ ∈ M m +1 ( S k ,φ ) E ( g ( X )) µ (cid:19) = inf µ ∈ M m +1 ( S ,φ ) E ( g ( X )) µ The proof for the supremum follows from considering the infimum for − g instead of g . The examples in the previous section were meant to familiarize the reader with this method of obtaining bounds(study how the roots can be distributed, then apply the constraints to find candidates for the measure extremizingthe expectation we are interested in). An interesting feature is that intuitively similar problems should have similarsupports for their solutions. The next sections highlight some cases where such patterns arise. In all of them wewill be investigating bounds for E ( g ( X )) given E ( X ) and Var ( X ) .9 .1 g (cid:48) ( x ) is strictly convex Theorem 4.
Let X be a random variable with support contained in [ a, b ] and let g : [ a, b ] → R be bounded,differentiable and such that g (cid:48) ( x ) is strictly convex. Then for every λ and σ that are possible values for theaverage and variance of a variable in [ a, b ] (which amounts to λ ∈ [ a, b ] and σ ≤ λ ( a + b ) − ab − λ ), thereexist measures µ ± with E ( X ) µ ± = λ and Var ( X ) µ ± = σ > such that E ( g ( X )) µ − = σ g ( a ) + ( λ − a ) g (cid:16) λ + σ λ − a (cid:17) σ + ( λ − a ) E ( g ( X )) µ + = σ g ( b ) + ( λ − b ) g (cid:16) λ + σ λ − b (cid:17) σ + ( λ − b ) and for every measure µ in M ([ a, b ]) , with the same average and variance, we have E ( g ( X )) µ − ≤ E ( g ( X )) µ ≤ E ( g ( X )) µ + Proof.
We will focus on the lower bound, as the proof for the upper bound is analogous. We have S = [ a, b ] , f ( x ) = x , f ( x ) = x and φ = ( λ, λ + σ ) . Since Γ is continuous and S is compact, then theorem 2 impliesthat there exists a measure µ ∈ M ([ a, b ] , φ ) that minimizes E ( g ( X )) µ . The support of µ consists of roots of h ( x ) ≡ βg ( x ) + α + α x + α x for some choice of α , α , α and β , such that β ≥ and h ( x ) ≥ in [ a, b ] .Furthermore, σ > , so there must be more than one point in the support. Using that g (cid:48) is strictly convex and β ≥ , we can obtain all possibilities for the qualitative graph of h ( x ) (Figs 3(a, b)) (a) (b) (c) (d) Figure 3: The possible qualitative graphs of h ( x ) , given the constraints provided by theorem 2. (a) and (b) are thepossibilities for the lower bound case, while (c) and (d) are the possibilities for the upper bound case.It follows that the support must be of the form { a, c } with c ∈ [ a, b ] . To actually find the measure we need toimpose all constraints, which leads us to the system p a + p c = 1 ap a + cp c = λa p a + c p c = λ + σ For σ > there is only one solution: p a = σ σ + ( λ − a ) , p c = ( λ − a ) σ + ( λ − a ) , c = λ + σ λ − a . and evaluating E ( g ( X )) for this measure gives us the lower bound. If we wanted the upper bound instead, theonly difference is that now we must have h ( x ) ≤ , so the possibilities for the qualitative graph of h ( x ) are theones in Figs 3(c, d), so the support must be of the form { d, b } where d ∈ [ a, b ] . The rest follows by swapping a for b and c for d .Let us examine some cases where we can apply theorem 410 .1.1 Moment Generating Functions Suppose we want to find bounds for the moment generating function E (cid:0) e Xs (cid:1) of a non-negative random variable X . We have then S = R + and in order to be able to use theorem 4 we will need to firstly study the case S = [0 , a ] and then use theorem 3 to obtain the correct bounds.If s > , then g ( x ) = e xs is such that g (cid:48) ( x ) is strictly convex, so X ∈ [0 , a ] leads us to the bounds σ + λ e λs + σ s / λ σ + λ ≤ E (cid:0) e Xs (cid:1) ≤ σ e as + ( λ − a ) e λs + σ s / ( λ − a ) σ + ( λ − a ) whereas if s < , then g ( x ) = − e xs is such that g (cid:48) ( x ) is strictly convex, so X ∈ [0 , a ] implies σ e as + ( λ − a ) e λs + σ s / ( λ − a ) σ + ( λ − a ) ≤ E (cid:0) e Xs (cid:1) ≤ σ + λ e λs + σ s / λ σ + λ Finally, taking the limit a → ∞ , one arrives at E (cid:0) e Xs (cid:1) ≥ σ + λ e ( λ σ s / λ σ + λ for s > , with no upper bound available and e λs ≤ E (cid:0) e Xs (cid:1) ≤ σ + λ e ( λ σ s / λ σ + λ for s < (note that the lower bound does not improve over Jensen’s inequality). These bounds can be visualizedmore easily graphing them for the cumulant generating function (see figure 4) (a) Figure 4: Our bounds for the cumulant generating function of a non negative random variable with average andvariance equal to 1 (grey region between the red curves) together with the usual Jensen lower bound (blue).
Suppose that X is a positive random variable, with average λ and variance σ and that we are interested in findingbounds to the power mean E ( X s ) / s for s (cid:54) = 0 . We must study g ( x ) = x s with S =]0 , ∞ [ . Once again, weneed to consider a progressive compact cover to use theorem 4 and then apply theorem 3 to obtain the final bound.Since any progressive compact cover of ]0 , ∞ [ will do, we can use intervals of the form [ / a , a ] and then take thelimit a → ∞ .Without worrying in a first moment which is the lower and which is the upper bound (which will depend onthe convexity of the derivatives), the two bounds prescribed in theorem 4 (and their limits for a → ∞ ) are asfollows.When / a is in the support: σ a s + (cid:0) λ − a (cid:1) (cid:16) λ + σ λ − / a (cid:17) s σ + (cid:0) λ − a (cid:1) → (cid:40) ( σ + λ ) s − λ s − if s > ∞ if s < a is in the support: σ a s + ( λ − a ) (cid:16) λ − λ − a (cid:17) s σ + ( λ − a ) → (cid:26) ∞ if s > λ s if s < For s > and < s < , g (cid:48) ( x ) is strictly convex for x > , whereas for s < and for < s < , − g (cid:48) ( x ) isstrictly convex. If we define M s = E ( X s ) / s , this implies that ≤ M s ≤ λ for s < σ + λ ) − / s λ − / s ≤ M s ≤ λ for < s < λ ≤ M s ≤ ( σ + λ ) − / s λ − / s for < s < σ + λ ) − / s λ − / s ≤ M s for s > , with no upper bound availableThese bounds and their comparison with Jensen’s inequality can be found in figure 5 (a) (b) Figure 5: Our bounds for the power mean of a non negative random variable with average and variance equal to 1(grey region between the red curves) together with the usual Jensen bounds (grey regions and blue curves). g ( x ) is continuous If we relax the hypothesis and require only that g be continuous, we can still use corollary 2 to write the problemof finding the bounds as an extremization over measures with up to 3 points in their support: Theorem 5.
Let g : R → R continuous, λ ∈ R , σ > , (cid:126)p ≡ ( p a , p b , p c ) , S = { (cid:126)p ∈ R | p a , p b , p c > and p a + p b + p c = 1 } and define x a ( (cid:126)p, θ ) = λ + σ (cid:18) cos( θ ) (cid:114) p b p a + sin( θ ) √ p c (cid:19) √ p a + p b x b ( (cid:126)p, θ ) = λ + σ (cid:18) − cos( θ ) (cid:114) p a p b + sin( θ ) √ p c (cid:19) √ p a + p b x c ( (cid:126)p, θ ) = λ − σ sin( θ ) (cid:114) p a + p b p c Then inf µ ∈ M E ( g ( X )) µ = inf (cid:126)p ∈ S inf θ ( p a g ( x a ) + p b g ( x b ) + p c g ( x c ))sup µ ∈ M E ( g ( X )) µ = sup (cid:126)p ∈ S sup θ ( p a g ( x a ) + p b g ( x b ) + p c g ( x c )) where M is the set of measures with support in R , E ( X ) µ = λ and Var ( x ) µ = σ . roof. Since S = R is an F σ set and Γ( x ) = ( x, x , g ( x )) is continuous, then we can apply corollary 2. Since m = 2 : inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = inf µ ∈ M ( S ,φ ) E ( g ( X )) µ sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = sup µ ∈ M ( S ,φ ) E ( g ( X )) µ where φ = ( λ, λ + σ ) (so M ( S , φ ) = M ). To characterize the measures in M ( S , φ ) , we must impose theconstraints. Calling the points in the support a, b, c we have p a + p b + p c = 1 ap a + bp b + cp c = λa p a + b p b + c p c = λ + σ whose solution is a = λ + σ (cid:16) cos( θ ) (cid:113) p b p a + sin( θ ) √ p c (cid:17) √ p a + p b b = λ + σ (cid:16) − cos( θ ) (cid:113) p a p b + sin( θ ) √ p c (cid:17) √ p a + p b c = λ − σ sin( θ ) (cid:113) p a + p b p c where the probabilities are constrained by ( p a , p b , p c ) ∈ S and θ is a free variable. From here the theorem followsfrom extremizing over these measures (note that the case with exactly 2 points in the support can be ignored, aswe can always make p c → in a way that c does not contribute to E ( g ( X )) by making θ → in a convenientway).This theorem also illustrates how to use these results when the constraints are not enough to reduce the possi-bilities to a single measure. We are left with an optimization problem over the measures satisfying the constraints. With the exception of section 6.2, the cases we analysed so far could be tackled analytically. This was mostlybecause the number n of random variables in the vector X and the number m of constraints was small, togetherwith other properties that allowed us to reduce the size of the support. As an illustration, in the case where wecan only apply corollary 2, the measures that extremize E ( g ( X )) µ can have in their support up to m + 1 pointsin S ⊆ R n , which corresponds to ( m + 1)( n + 1) variables (for each unknown point in the support, each of the n coordinates and the probability of that point are variables to be found), whereas we have only m + 1 constraints(the m constraints given by φ plus normalization of the measure). So we are still left with an optimization problemover the ( m + 1) n remaining variables, which in general will be a nonlinear program (as seen in section 6.2). Thecomplexity of finding the bounds would then scale exponentially with ( m + 1) n and quickly become numericallyunfeasible.This situation can be somewhat remedied if we look at what we have been doing from a different angle. Thefunctions h ± ( x ) = α ± • f ( x ) + β ± g ( x ) + c ± that appear in theorem 2 can be thought as establishing inequalities E ( h − ( X )) µ ≥ and E ( h + ( X )) µ ≤ .Substituting the constraints, these give bounds to E ( g ( X )) µ that can be then optimized by changing the α ± , β ± and c ± , until a measure satisfying E ( h ( X )) µ = 0 (or a sequence of measures, whose expectations converge to 0)can be found. The main result is summarized in the following theorem Theorem 6.
Let
S ⊆ R n and let D = { φ | M ( S , φ ) (cid:54) = ∅ } be the set of all values for E ( f ( X )) that arecompatible with the support being contained in S . So if φ ∈ int ( D ) (cid:54) = ∅ , then inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = sup α ∈ R m (cid:18) inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) and sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = inf α ∈ R m (cid:18) sup x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) roof. We’ll do the proof only for inf E ( g ( X )) , as the proof for the supremum follows from considering theinfimum for − g . Defining σ = inf µ ∈ M ( S ,φ ) E ( g ( X )) µ we will start with the case where σ is finite. In this case, the hypothesis for theorem 2 are satisfied, so considerthe values α i − , β − and c − predicted by it (that will be denoted α i , β and c for simplicity). As seen in the proof oftheorem 2, a = ( α , . . . , α m , β ) is a vector normal to a supporting hyperplane Π of H (Γ( S )) that passes through ( φ, σ ) ∈ cl ( H (Γ( S ))) .Suppose that we had β = 0 . In this case, if we project all points in Π into the hyperplane z = 0 , we get Π (cid:48) × { } (instead of R m × { } ), where Π (cid:48) is a hyperplane in R m with normal vector ( α , . . . , α m ) . Furthermore,if we project all points in H (Γ( S )) into the z = 0 hyperplane, we get D × { } (because of corollary 1). But since Π is a supporting hyperplane of H (Γ( S )) , then Π (cid:48) will be a supporting hyperplane of D passing through φ . As aconsequence this would imply that φ ∈ bd ( D ) . Our point is that the hypothesis that φ ∈ int ( D ) implies then that β > and as such, without loss of generality, we can choose β = 1 for the values predicted by theorem 2, that is α • f ( x ) + g ( x ) + c ≥ ∀ x ∈ S (3) inf µ ∈ M ( S ,φ ) E (cid:16) α • E ( f ( X )) µ + E ( g ( X )) µ + c (cid:17) = 0 ⇒ σ = − α • φ − c. (4)Consider now the set C = (cid:8) ( α, z ) ∈ R m +1 (cid:12)(cid:12) g ( x ) + α • ( f ( x ) − φ ) + z ≥ ∀ x ∈ S (cid:9) The definition of C implies that if µ ∈ M ( S , φ ) and ( α, z ) ∈ C , then E ( g ( X )) µ ≥ − z . However this impliesthat σ = − inf ( α,z ) ∈ C z (5)To see why this is true, we first note that ( α, − σ ) ∈ C (which follows directly from substituting σ = − α • φ − c into the definition of C , while using (3)) and that ( α, z ) ∈ C ⇒ ( α, z (cid:48) ) ∈ C ∀ z (cid:48) > z . So if we suppose byabsurd that inf ( α,z ) ∈ C z (cid:54) = − σ , then there would need to exist ( α, z (cid:48) ) ∈ C such that z (cid:48) < − σ . But then we wouldhave g ( x ) + α • ( f ( x ) − φ ) ≥ − z (cid:48) ∀ x ∈ S ⇒ E ( g ( X )) µ ≥ − z (cid:48) > σ ∀ µ ∈ M ( S , φ ) which contradicts the definition of σ .It also follows from its definition that C is convex and the epigraph of some convex function F ( α ) . Thedefinition of C leads us easily to F ( α ) = − inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) Hence, if we calculate the infimum in equation (5) by first taking the infimum over z and then over α we get σ = − inf ( α,z ) ∈ C z = − inf α ∈ R m ( F ( α )) = sup α ∈ R m ( − F ( α )) ⇒ inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = sup α ∈ R m (cid:18) inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) This concludes the proof of the case where σ is finite. Since φ ∈ int ( D ) (cid:54) = ∅ , then the hypothesis of thetheorem exclude the case σ = ∞ . Finally, in the case σ = −∞ , Γ( S ) must be unbounded. We will start byconsidering a progressive bounded cover ( B k ) ∞ k =1 of Γ( S ) (like B k = Γ( S ) ∩ [ − k, k ] m +1 for example) and useit to find a progressive cover ( S k ) ∞ k =1 of S where S k = S ∩ Γ − ( B k ) . Since B k is bounded it follows that γ ( S k , φ ) = H ( B k ) ∩ ( { φ } × R ) is bounded.Since γ ( S k , φ ) is bounded, then theorem 1 implies that inf µ ∈ M ( S k ,φ ) E ( g ( X )) µ is finite and hence, from whatwe have already shown inf µ ∈ M ( S k ,φ ) E ( g ( X )) µ = sup α ∈ R m (cid:18) inf x ∈S k ( g ( x ) + α • ( f ( x ) − φ )) (cid:19)
14f we combine this result with theorem 3 we get −∞ = inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = lim k →∞ (cid:18) inf µ ∈ M ( S k ,φ ) E ( g ( X )) µ (cid:19) = lim k →∞ (cid:18) sup α ∈ R m (cid:18) inf x ∈S k ( g ( x ) + α • ( f ( x ) − φ )) (cid:19)(cid:19) So for every δ ∈ R there exists N δ such that if k > N δ then sup α ∈ R m (cid:18) inf x ∈S k ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) < δ ⇒ inf x ∈S k ( g ( x ) + α • ( f ( x ) − φ )) < δ ∀ α ∈ R m But inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) ≤ inf x ∈S k ( g ( x ) + α • ( f ( x ) − φ )) < δ ∀ α ∈ R m ⇒ inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) < δ ∀ α ∈ R m , δ ∈ R ⇒ inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) = −∞ ∀ α ∈ R m ⇒ sup α ∈ R m (cid:18) inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) = −∞ = inf µ ∈ M ( S ,φ ) E ( g ( X )) µ The main advantage of this formulation is that finding the bounds becomes a convex optimization problemin R m . In particular, we minimize some convex function F ( α ) , where evaluating F is akin to solving a globaloptimization in S ⊆ R n . More precisely, if we define G ( x ; α ) = g ( x ) + α • ( f ( x ) − φ ) then the convex functions we must use for finding the lower and upper bounds of E ( g ( X )) µ are F ± : R m → R such that F ± ( α ) = sup x ∈S ( ± G ( x ; α )) , as we have inf µ ∈ M ( S ,φ ) E ( g ( X )) µ = − inf α ∈ R m F − ( α ) and sup µ ∈ M ( S ,φ ) E ( g ( X )) µ = inf α ∈ R m F + ( α ) (6)The complexity of solving the problem numerically with this approach grows polynomially in m and exponen-tially in n , which is a huge improvement over the more naive approach of the previous sections. Nevertheless, thisapproach is not as useful for obtaining analytical results and makes it harder to use special properties of the f i and g , so there is actually a tradeoff between the two approaches. Finally, the intermediate steps of the minimizationof both F ± can be used to create bounds that are looser but numerically cheaper to obtain (only a rough idea ofwhere the extrema are might already lead to an useful bound): Corollary 3. − F − ( α ) ≤ E ( g ( X )) µ ≤ F + ( α (cid:48) ) ∀ α, α (cid:48) ∈ R m and µ ∈ M ( S , φ ) Proof.
This follows directly from equation (6) F ± In order to find the value of F ± ( α ) for a given α , we will need to solve an optimization problem in S . Doing thisnumerically will typically lead us to sequences ( x ± n ) ∞ n =1 in S , such that lim n →∞ G ( x ± n ; α ) = ± F ± ( α ) . Interestingly, if the sequences ( f ( x ± n )) ∞ n =1 are convergent, they can be used to find a subgradient for F ± ( α ) ,without the need for evaluating F ± for different values of α . More precisely15 emma 4. If α is such that F ± ( α ) is finite, ( x ± n ) ∞ n =1 is such that lim n →∞ G ( x ± n ; α ) = ± F ± ( α ) . and the limit S ± ( α ) = ± lim n →∞ ( f ( x ± n ) − φ ) is convergent, then S ± ( α ) is a subgradient of F ± at α .Proof. The proof is by direct verification. We need to show that for all α (cid:48) ∈ R m we have F ± ( α (cid:48) ) − F ± ( α ) − S ± ( α ) • ( α (cid:48) − α ) ≥ If F ± ( α (cid:48) ) is not finite, then the left hand side is ∞ (its definition implies that F ± ( α (cid:48) ) cannot be −∞ ) and theinequality follows trivially, so we only need to consider the cases where F ± ( α (cid:48) ) is finite. For F + : F + ( α (cid:48) ) − F + ( α ) − S + ( α ) • ( α (cid:48) − α ) =sup x ∈S G ( x ; α (cid:48) ) − lim n →∞ (cid:0) g ( x + n ) + (cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40) α • ( f ( x + n ) − φ ) (cid:1) − lim n →∞ ( f ( x + n ) − φ ) • ( α (cid:48) − (cid:26) α ) =sup x ∈S G ( x ; α (cid:48) ) − lim n →∞ (cid:0) g ( x + n ) + α (cid:48) • ( f ( x + n ) − φ ) (cid:1) = sup x ∈S G ( x ; α (cid:48) ) − lim n →∞ G ( x + n ; α (cid:48) ) ≥ For F − : F − ( α (cid:48) ) − F − ( α ) − S − ( α ) • ( α (cid:48) − α ) =sup x ∈S ( − G ( x ; α (cid:48) )) + lim n →∞ (cid:0) g ( x + n ) + (cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40) α • ( f ( x + n ) − φ ) (cid:1) + lim n →∞ ( f ( x + n ) − φ ) • ( α (cid:48) − (cid:26) α ) = − inf x ∈S G ( x ; α (cid:48) ) + lim n →∞ (cid:0) g ( x + n ) + α (cid:48) • ( f ( x + n ) − φ ) (cid:1) = lim n →∞ G ( x + n ; α (cid:48) ) − inf x ∈S G ( x ; α (cid:48) ) ≥ This result implies that a subgradient method can be used to obtain the bounds numerically, under no extraassumptions about f and g . Also, note that if S is compact and f, g are continuous in S , then no limits need to betaken and we can just use the estimates for argmax x ∈S ( ± G ( x ; α )) obtained when calculating F ± ( α ) . Let X be a random variable such that X ≥ a , a < and E (cid:0) e X (cid:1) = 1 . Given some value λ > a we are interestedin the largest value possible for Pr ( X ≥ λ ) . This problem was studied in [5] (in the context of finding the optimalwork extraction of a process obeying Jarzynski’s equality [10]) where it was found that Pr ( X ≥ λ ) ≤ min (cid:26) , − e a e λ − e a (cid:27) holds and is sharp.We can obtain the same result with theorem 6. We have in this case S = [ a, ∞ [ , g ( x ) = Θ( x − λ ) , f ( x ) = e x , φ = 1 and D = [ e a , ∞ [ . Since φ ∈ int ( D ) , then theorem 6 tells us that the answer is σ + ≡ inf α (cid:18) sup x ≥ a (Θ( x − λ ) + α ( e x − (cid:19) One can easily determine that F ( α ) ≡ sup x ≥ a (Θ( x − λ ) + α ( e x − ∞ , if α > α ( e λ −
1) + 1 , if e a − e λ ≤ α ≤ α ( e a − , if α ≤ e a − e λ The graph of F ( α ) is slightly different depending on the sign of λ (Fig 6), with the minimum value attainedat α = 0 if λ ≤ and at α = e a − e λ if λ > . Substituting we get the result in [5]. σ + = (cid:26) , if λ ≤ − e a e λ − e a , if λ > or σ + = min (cid:26) , − e a e λ − e a (cid:27) a) (b) (c) Figure 6: F ( α ) for different values of λ : (a) λ = − . (b) λ = 0 (c) λ = 0 . In theorem 6 we use φ ∈ int ( D ) as a hypothesis. This is a strengthening of φ ∈ D , which just says that theconstraints in the problem are feasible. In order to show that φ ∈ D it suffices to find an example of a measureobeying the constraints. The following lemma can be used to prove φ / ∈ cl ( D ) by means of proving that a certaininequality holds: Lemma 5.
Let
S ⊆ R n and let D = { φ | M ( S , φ ) (cid:54) = ∅ } . Then φ / ∈ cl ( D ) if and only if there exists α ∈ R m and ε > such that α • ( f ( x ) − φ ) > ε ∀ x ∈ S Proof. If φ / ∈ cl ( D ) , then since cl ( D ) and { φ } are convex and closed and { φ } is also compact, then by theSeparating Hyperplane Theorem, there exists a hyperplane Π that separates both sets with a gap, that is, if Π isdefined by α • x + β = 0 , then there exists ε > such that α • φ + β < − ε / and α • y + β > ε / ∀ y ∈ cl ( D ) .Combining both inequalities it follows that α • ( y − φ ) > ε ∀ y ∈ cl ( D ) . Since D = H ( f ( S )) (using corollary1), then f ( S ) ⊆ cl ( D ) and hence α • ( f ( x ) − φ ) > ε ∀ x ∈ S .On the other hand, if α • ( f ( x ) − φ ) > ε ∀ x ∈ S , then taking an expectation on both sides, we have that α • ( E ( f ( x )) µ − φ ) > ε ∀ µ ∈ M ( S ) . Once again, using corollary 1, this means that α • ( y − φ ) > ε ∀ y ∈ D and this implies that α • ( y − φ ) ≥ ε ∀ y ∈ cl ( D ) . Since taking y = φ implies that α • ( y − φ ) = 0 and ε > ,this finally implies φ / ∈ cl ( D ) .This lemma also allows us to extend theorem 6 a bit Corollary 4. If g ( S ) is bounded, then theorem 6 can be extended to the case φ / ∈ bd ( D ) Proof.
Starting from theorem 6, what we still need to show is that if φ / ∈ cl ( D ) , then sup α ∈ R m (cid:18) inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) = ∞ and inf α ∈ R m (cid:18) sup x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) = −∞ Using lemma 5, there exists α ∈ R m and ε > such that α • ( f ( x ) − φ ) > ε ∀ x ∈ S . It follows that all λ > we have inf x ∈S ( g ( x ) + λα • ( f ( x ) − φ )) ≥ inf x ∈S g ( x ) + inf x ∈S ( λα • ( f ( x ) − φ )) ≥ λε + inf x ∈S g ( x ) and since g ( S ) is bounded, then taking λ → ∞ it follows that sup α ∈ R m (cid:18) inf x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) = ∞ Similarly, if λ < we have sup x ∈S ( g ( x ) + λα • ( f ( x ) − φ )) ≤ sup x ∈S g ( x ) + sup x ∈S ( λα • ( f ( x ) − φ )) ≤ λε + inf x ∈S g ( x ) so taking λ → −∞ we have inf α ∈ R m (cid:18) sup x ∈S ( g ( x ) + α • ( f ( x ) − φ )) (cid:19) = −∞ Conclusions
In this work we have studied how knowledge of expectations E ( f i ( X )) of a random vector X allow us to bound E ( g ( X )) for an unrelated function g . What we have shown is that studying convex hulls allows us to arrive atbounds that can be saturated (theorem 1). We have also shown that it is also possible to obtain said bounds withoutactually finding the convex hulls.In particular, theorem 2 in conjunction with theorem 3 allows analytical bounds to be found, specially in caseswith low dimensionality and few expectations known (theorems 4 and 5). A numerical approach for problems withhigher dimensionality and larger numbers of expectations was also developed (theorem 6), showing a connectionbetween this problem and convex optimization.The importance in general of these results is tied to the importance and scope of applications of Jensen’sinequality, but some of the main points we’d like to stress are the following:• Our results go beyond the case where X is a single random variable and allow (specially in the numericalcase) the study of random vectors, which widen significantly the range of applications these results canhave.• With some extra assumptions on g , theorem 4 allows one to find sharp inequalities for the Jensen gap when E ( X ) and Var ( X ) are the known information.• When the f i and g are polynomials we fall back to a well known problem, involving the connection betweengeneralized moment problems and semidefinite programming. From this point of view, our work is also ageneralization of these optimization problems, beyond the polynomial case.• Interest in the problem of how certain expectations impact others has increased in some applied fields (theexample from [5], developed in section 7.2 is a great example). Having an unifying framework for studyingthese problems would be extremely helpful (be it for analytical calculations or for numerics).As future avenues of research, we can point out the following questions that our results raise:• Are there other classes of functions for which analytical results for the Jensen gap can be obtained, like intheorems 4 and 5? This would be interesting particularly for the cases where m > .• The convex optimization problem that arises in theorem 6 doesn’t seem to have been studied in detail andeven though we were able to show that it is amenable to a subgradient method, we were unable to find a wayto tackle it with a higher order method (for example, it seems to be outside of the scope of barrier methods[4]). As such, extending these methods to handle this new setup would be a very interesting undertaking.• Regarding some limitations of theorem 6, some interesting questions are whether it is possible to relax thehypothesis over g in corollary 4 and whether anything can be said about the case φ ∈ bd ( D ) at all. References [1] Shoshana Abramovich, Slavica Ivelic, and Josip E. Pecaric. Improvement of jensen–steffensen’s inequalityfor superquadratic functions.
Banach J. Math. Anal. , 4(1):159–169, 2010.[2] Shoshana Abramovich, Graham Jameson, and Gord Sinnamon. Refining jensen’s inequality.
Bulletin math-ématique de la Société des Sciences Mathématiques de Roumanie , 47 (95)(1/2):3–14, 2004.[3] Shoshana Abramovich and Lars-Erik Persson. Some new estimates of the ‘jensen gap’.
Journal of Inequal-ities and Applications , 2016(1):39, 2016.[4] Stephen Boyd and Lieven Vandenberghe. Interior-point methods. In
Convex Optimization , pages 561–630.Cambridge University Press, 2004.[5] Vasco Cavina, Andrea Mari, and Vittorio Giovannetti. Optimal processes for probabilistic work extractionbeyond the second law.
Scientific Reports , 6:29282, 2016.[6] Etienne de Klerk and Monique Laurent. A survey of semidefinite programming approaches to the generalizedproblem of moments and their error analysis. In
World Women in Mathematics 2018 , pages 17–56, Cham,Switzerland, 2019. Springer. 187] S. S. DRAGOMIR. Some inequalities for (m, m)-convex mappings and applications for the csiszár phi-divergence in information theory.
Mathematical journal of Ibaraki University , 33:35–50, 2001.[8] Silvestru Sever Dragomir. Inequality for power series with nonnegative coefficients and applications.
OpenMathematics , 13(1):000010151520150061, 2015.[9] Xiang Gao, Meera Sitharam, and Adrian E. Roitberg. Bounds on the jensen gap, and implications for mean-concentrated distributions.
The Australian Journal of Mathematical Analysis and Applications , 16(2):14,2019.[10] C. Jarzynski. Nonequilibrium equality for free energy differences.
Phys. Rev. Lett. , 78:2690–2693, Apr1997.[11] J. L. W. V. Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.
Acta Math. ,30:175–193, 1906.[12] J. G. Liao and Arthur Berg. Sharpening jensen’s inequality.
The American Statistician , 73(3):278–281, 2019.[13] Josip E Peˇcari´c. A companion to jensen-steffensen’s inequality.
Journal of Approximation Theory , 44(3):289– 291, 1985.[14] Konrad Schmüdgen. Semidefinite programming and polynomial optimization. In
The Moment Problem ,pages 399–411. Springer, 2017.[15] Slavko Simic. Sharp global bounds for jensen’s inequality.
Rocky Mountain J. Math. , 41(6):2021–2031, 122011.[16] Stephen G. Walker. On a lower bound for the jensen inequality.