[PDF] Basic inequalities for weighted entropies

Abstract

The concept of weighted entropy takes into account values of different outcomes, i.e., makes entropy context-dependent, through the weight function. In this paper, we establish a number of simple inequalities for the weighted entropies (general as well as specific), mirroring similar bounds on standard (Shannon) entropies and related quantities. The required assumptions are written in terms of various expectations of the weight functions. Examples are weighted Ky Fan and weighted Hadamard inequalities involving determinants of positive-definite matrices, and weighted Cramér-Rao inequalities involving the weighted Fisher information matrix.

Full PDF

aa r X i v : . [ c s . I T ] J a n Basic inequalities for weighted entropies

Y. Suhov, I. Stuhl, S. Yasaei Sekeh, M. Kelbert

Abstract

The concept of weighted entropy takes into account values of diﬀerent outcomes, i.e., makesentropy context-dependent, through the weight function. In this paper, we establish a numberof simple inequalities for the weighted entropies (general as well as speciﬁc), mirroring similarbounds on standard (Shannon) entropies and related quantities. The required assumptions arewritten in terms of various expectations of the weight functions. Examples are weighted Ky Fanand weighted Hadamard inequalities involving determinants of positive-deﬁnite matrices, andweighted Cramér-Rao inequalities involving the weighted Fisher information matrix.

The deﬁnition and initial results on weighted entropy were introduced in [1, 11]. The purpose was tointroduce disparity between outcomes of the same probability: in the case of a standard entropy suchoutcomes contribute the same amount of information/uncertainty, which is appropriate in context-free situations. However, imagine two equally rare medical conditions, occurring with probability p << , one of which carries a major health risk while the other is just a peculiarity. Formally,they provide the same amount of information − log p but the value of this information can be verydiﬀerent. The weight, or a weight function, was supposed to fulﬁll this task, at least to a certainextent. The initial results have been further extended and deepened in [24, 7, 14, 23, 25, 31, 15], and,more recently, in [6, 26, 2, 22, 27]. Certain applications emerged, see [8, 13], along with a number oftheoretical suggestions.The purpose of this note is to extend a number of inequalities, established previously for a stan-dard (Shannon) entropy, to the case of the weighted entropy. We particularly mention Ky Fan-and Hadamard-type inequalities from [3, 9, 20] which are related to (standard) Gaussian entropies.Extended inequalities for weighted entropies already found applications and further developments in[28, 29, 30]. Another kind of bounds, weighted Cramér-Rao inequalities, may be useful in statistics.An additional motivation for studying weighted entropy (WE) can be provided in the followingquestions. (I) What is the rate at which the WE is produced by a sample of a random process(and what could be an analog of the Shannon–McMillan–Breiman theorem)? (II) What would bean analog of Shannon’s Second Coding theorem when an incorrect channel output causes a penaltybut does not make the transmission session invalid? Properties of the WE established in the currentpaper could be helpful in this line of research.One of naturally emerging questions is about the form/structure of the weight function (WF). Inthis paper we focus on some simple inequalities (as suggested by the title). Our results hold for fairly Mathematics Subject Classiﬁcation: 60A10, 60B05, 60C05Key words and phrases: weighted entropy, weighted conditional entropy, weighted relative entropy, weighted mutualentropy, weighted Gibbs inequality, convexity, concavity, weighted Hadamard inequality, weighted Fisher information,weighed Cramér-Rao inequalities. (Ω , B , P ) be a standard probability space (see, e.g., [12]). We consider random variables(RVs) as (measurable) functions Ω → X , with values in a measurable space ( X , M ) equipped witha countably additive reference measure ν . Probability mass functions (PMFs) or probability densityfunctions (PDFs) are denoted by letter f with various indices and deﬁned relative to ν . The diﬀerencebetween PMFs (discrete parts of probability measures) and PDFs (continuous parts) is insigniﬁcantfor most of the presentation; this will be reﬂected in a common acronym PM/DF. In a few cases wewill address directly the probabilities P ( X = i ) (when X is a ﬁnite or countable set, assuming that ν ( i ) = 1 ∀ i ∈ X ). On the other hand, some important facts will remain true without assumption that Z X f ( x ) ν (d x ) = 1 . When we deal with a collection of RVs X i , the space of values X i and the referencemeasure ν i may vary with i . Some of RVs X i may be random × n vectors, viz., X n = ( X , . . . , X n ) ,with random components X i : Ω → X i , ≤ i ≤ n . Deﬁnition 1.1

Given a function x ∈ X 7→ ϕ ( x ) ≥ , and an RV X : Ω → X , with a PM/DF f ,the weighted entropy (WE ) of X (or f ) with weight function (WF) ϕ and reference measure ν isdeﬁned by h w ϕ ( X ) = h w ϕ ( f ) = − E [ ϕ ( X ) log f ( X )] = − Z X ϕ ( x ) f ( x ) log f ( x ) ν (d x ) (1.1) whenever the integral R X ϕ ( x ) f ( x ) (cid:16) ∨ | log f ( x ) | (cid:17) ν (d x ) < ∞ . (A standard agreement · log 0 =0 · log ∞ is adopted throughout the paper.) If f ( x ) ≤ ∀ x ∈ X , h w ϕ ( f ) is non-negative. (This is thecase when ν ( X ) ≤ .) The dependence of h w ϕ ( X ) = h w ϕ ( f ) on ν is omitted.Given two functions, x ∈ X 7→ f ( x ) ≥ and x ∈ X 7→ g ( x ) ≥ , the relative WE of g relative to f with WF ϕ is deﬁned by D w ϕ ( f k g ) = Z X ϕ ( x ) f ( x ) log f ( x ) g ( x ) ν (d x ) . (1.2) Alternatively, the quantity D w ϕ ( f k g ) can be termed a weighted Kullback–Leibler divergence (of g from f ) with WF ϕ . If f is a PM/DF, one can use an alternative form of writing: D w ϕ ( f k g ) = E h ϕ ( X ) log f ( X ) g ( X ) i . In what follows, all WFs are assumed non-negative and positive on a set of positive f -measure. Remark 1.2

Passing to standard entropies, an obvious formula reads h w ϕ ( f ) = h ( ϕf ) + D ( ϕf k f ) = − D ( ϕf k ϕ ) , (1.3) provided that one can guarantee that the integrals involved converge. However, in general neither ϕf nor ϕ are PM/DFs, which can be a nuisance. Besides, the interpretation of ϕ as a weight functionin h w ϕ ( f ) makes the inequalities more transparent. Theorem 1.3 (The weighted Gibbs inequality; cf. [4], Lemma 1, [3], Theorem 2.6.3, [5] Lemma 1,[20], Theorem 1.2.3 (c).)

Given non-negative functions f , g , assume the bound Z X ϕ ( x ) (cid:2) f ( x ) − g ( x ) (cid:3) ν (d x ) ≥ . (1.4)2 hen D w ϕ ( f k g ) ≥ . (1.5) Moreover, equality in (1.5) holds iﬀ the ratio gf equals modulo function ϕ . In other words, (cid:20) g ( x ) f ( x ) − (cid:21) ϕ ( x ) = 0 for f -almost all x ∈ X . Proof.

Following a standard calculation (see, e.g., [3], Theorem 2.6.3 or [20], Theorem 1.2.3 (c))and using (1.2), we write − D w ϕ ( f k g ) = Z X ϕ ( x ) f ( x ) ( f ( x ) >

0) log g ( x ) f ( x ) ν (d x ) ≤ Z X ϕ ( x ) f ( x ) ( f ( x ) > (cid:20) g ( x ) f ( x ) − (cid:21) ν (d x )= Z X ϕ ( x ) ( f ( x ) > (cid:2) g ( x ) − f ( x ) (cid:3) ν (d x ) ≤ Z X ϕ ( x ) (cid:2) g ( x ) − f ( x ) (cid:3) ν (d x ) ≤ . (1.6)The equality in (1.6) occurs iﬀ ϕ ( g/f − vanishes f -a.s. Theorem 1.4 (Bounding the WE via a uniform distribution.)

Suppose an RV X takes at most m values, i.e., X = { , . . . , m } , and set p i = P ( X = i ) , ≤ i ≤ m . Suppose that for given < β ≤ m X i =1 ϕ ( i )( p i − β ) ≥ . (1.7) Then h w ϕ ( X ) = − m P i =1 ϕ ( i ) p i log p i obeys h w ϕ ( X ) ≤ − log β m X i =1 ϕ ( i ) p i , or − E [ ϕ ( X ) log p X ] ≤ − (log β ) E [ ϕ ( X )] , (1.8) with equality iﬀ for all i = 1 , . . . , m , ϕ ( i )( p i − β ) = 0 .In the case of a general space X , assume that for a constant β > we have Z X ϕ ( x ) [ f ( x ) − β ] ν (d x ) ≥ . (1.9) Then h w ϕ ( X ) ≤ − log β Z X ϕ ( x ) f ( x ) ν (d x ); (1.10) equality iﬀ ϕ ( x ) [ f ( x ) − β ] = 0 for f -almost all x ∈ X . Proof.

The proof follows directly from Theorem 1.3, with g ( x ) = β , x ∈ X . Deﬁnition 1.5

Let ( X , X ) be a pair of RVs X i : Ω → X i , with a joint PM/DF f ( x , x ) , x i ∈ X i , i = 1 , , relative to measure ν (d x ) × ν (d x ) , and marginal PM/DFs f ( x ) = Z X f ( x , x ) ν (d x ) , x ∈ X , f ( x ) = Z X f ( x , x ) ν (d x ) , x ∈ X . et ( x , x ) ∈ X × X ϕ ( x , x ) be a given WF. We use Eqn (1.1) to deﬁne the joint WE of X , X with WF ϕ (under an assumption of absolute convergence of the integrals involved): h w ϕ ( X , X ) = − E [ ϕ ( X , X ) log f ( X , X )]= − Z X ×X ϕ ( x , x ) f ( x , x ) log f ( x , x ) ν (d x ) ν (d x ) . (1.11) Next, the conditional

WE of X given X with WF ϕ is deﬁned by h w ϕ ( X | X ) = − E h ϕ ( X , X ) log f ( X , X ) f ( X ) i = h w ϕ ( X , X ) − h w ψ ( X )= − Z X ×X ϕ ( x , x ) f ( x , x ) log f ( x , x ) f ( x ) ν (d x ) ν (d x ) , (1.12) here and below ψ ( X ) = Z X ϕ ( x , x ) f ( x , x ) f ( x ) ν (d x ) . Further, the mutual

WE between X and X by i w ϕ ( X : X ) = D w ϕ ( f k f ⊗ f ) = E h ϕ ( X , X ) log f ( X , X ) f ( X ) f ( X ) i = Z X ×X ϕ ( x , x ) f ( x , x ) log f ( x , x ) f ( x ) f ( x ) ν (d x ) ν (d x ) . (1.13)We will use the notation X ki = ( X i , . . . , X k ) and x ki = ( x i , . . . , x k ) , ≤ i < k ≤ n , for collectionsof RVs and their sample values (particularly for pairs and triples of RVs) allowing us to shortenequations throughout the paper. In addition, we employ Cartesian products X ki = X i × . . . × X k and product-measures ν ki (d x ki ) = ν i (d x i ) × . . . × ν k (d x k ) . Given a random × n vector X n with aPM/DF f , we denote by f i , f ij and f ijk the PM/DFs for component X i , pair X ij = ( X i , X j ) andtriple X ijk = ( X i , X j , X k ) , respectively. The arguments of f i , f ij and f ijk are written as x i ∈ X i , x ij = ( x i , x j ) ∈ X ij = X i × X j and x ijk = ( x i , x j , x k ) ∈ X ijk = X i × X j × X k . Next, symbols f i | j , f ij | k and f i | jk are used for conditional PM/DFs: f i | j ( x i | x j ) = f ij ( x ij ) f j ( x j ) , f ij | k ( x ij | x k ) = f ijk ( x ijk ) f k ( x k ) , f i | jk ( x i | x jk ) = f ijk ( x ijk ) f jk ( x jk ) . For a pair of RVs X , set ψ ( x ) = Z X ϕ ( x , x ) f | ( x | x ) ν (d x ) , x ∈ X ; (1.14)quantity ψ ( x ) , x ∈ X , is deﬁned in a similar (symmetric) fashion. See above.Next, given a triple of RVs X , with a joint PM/DF f ( x ) , set: ψ ( x ) = Z X ϕ ( x ) f | ( x | x ) ν (d x ) = E h ϕ ( X ) | X = x i , x ∈ X ,ψ ( x ) = Z X ϕ ( x ) f | ( x | x ) ν (d x ) = E h ϕ ( X ) | X = x i , x ∈ X , (1.15)and deﬁne functions ψ ijk and ψ ij for distinct labels ≤ i, j, k ≤ , in a similar manner.4 emma 1.6 (Bounds on conditional WE, I.) Let X be a pair of RVs with a joint PM/DF f ( x ) .Suppose that a WF x ∈ X ϕ ( x ) obeys E h ϕ ( X ) (cid:2) f | ( X | X ) − (cid:3)i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ . (1.16) Then h w ϕ ( X ) ≥ h w ψ ( X ) , or, equivalently, h w ϕ ( X | X ) ≥ , (1.17) with equality iﬀ ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X . Proof.

The statement is derived similarly to Theorem 1.3: Z X ϕ ( x ) f ( x ) log f | ( x | x ) ν (d x ) ≤ Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) . The argument is concluded as in (1.6). The cases of equalities also follow.

Remark 1.7

In particular, suppose that X takes ﬁnitely or countably many values and ν is acounting measure with ν ( i ) = 1 , i ∈ X . Then the value f | ( x | x ) yields the conditional probability P ( X = x | x ) , which is ≤ for f -almost all x ∈ X . Then h w ϕ ( X | X ) ≥ , and the bound isstrict unless, modulo ϕ , RV X is a function of X . That is, there exists a map υ : X → X suchthat (cid:2) x − υ ( x ) (cid:3) ϕ ( x ) = 0 for f -almost every x ∈ X . For a future use, we can consider a triple of RVs, X , and a pair, X , and assume that E h ϕ ( X ) (cid:2) f | ( X | X ) − (cid:3)i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ . (1.18)Then h w ϕ ( X ) ≥ h w ψ ( X ) , or, equivalently, h w ϕ ( X | X ) ≥ , (1.19)with equality iﬀ ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X . Theorem 1.8 (Sub-additivity of the WE.)

Let X = ( X , X ) be a pair of RVs with a joint PM/DF f ( x ) and marginals f ( x ) , f ( x ) , where x ∈ X . Suppose that a WF x ∈ X ϕ ( x ) obeys E ϕ ( X ) − E ϕ ( X ⊗ ) = Z X ϕ ( x ) h f ( x ) − f ( x ) f ( x ) i ν (d x ) ≥ . (1.20) Here X ⊗ stands for the pair of independent RVs having the same marginal distributions as X , X .(The joint PDF for X ⊗ is the product f ( x ) f ( x ) .) Then h w ϕ ( X ) ≤ h w ψ ( X ) + h w ψ ( X ) , or, equivalently, h w ϕ ( X | X ) ≤ h w ψ ( X ) , or, equivalently, i w ϕ ( X : X ) ≥ . (1.21) The equalities hold iﬀ X , X are independent modulo ϕ , i.e., ϕ ( x ) (cid:20) − f ( x ) f ( x ) f ( x ) (cid:21) = 0 for f -almost all x ∈ X . roof. The subsequent argument works for the proof of Theorem 1.10 as well. Set ( f ⊗ f ) ( x , x ) = f ( x ) f ( x ) . According to (1.2), (1.11) – (1.13) and owing to Theorem 1.3 andLemma 1.6, ≥ − D w ϕ ( f k f ⊗ f ) = Z X ϕ ( x ) f ( x ) log f ( x ) f ( x ) f ( x ) ν (d x )= h w ϕ ( X , X ) − h w ψ ( X ) − h w ψ ( X )= h w ϕ ( X | X ) − h w ψ ( X ) = − i w ϕ ( X : X ) . (1.22)This yields the inequalities in (1.21). The cases of equality are also identiﬁed from Theorem 1.3.Note that if in (1.20) we use function ψ ( x ) emerging from triple X , the assumption becomes E ϕ ( X ) − E ϕ ( X ⊗ → X )= Z X ϕ ( x ) h f ( x ) − f ( x ) f ( x ) i f | ( x | x ) ν (d x ) ≥ (1.23)and the conclusion h w ψ ( X | X ) ≤ h w ψ ( X ) . (1.24)Here X ⊗ → X denotes the triple of RVs where X and X have been made independent, keepingintact their marginal distributions, and X has the same conditional PM/DF f | as within theoriginal triple X . Lemma 1.9 (Bounds on conditional WE, II.)

Let X be a triple of RVs, with a joint PM/DF f ( x ) .Given a WF x ϕ ( x ) , assume that E h ϕ ( X ) (cid:2) f | ( X | X ) − (cid:3)i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ . (1.25) Then h w ψ ( X | X ) ≤ h w ϕ ( X | X ); (1.26) equality iﬀ ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X .As in Remark , assume X takes ﬁnitely or countably many values and ν ( i ) = 1 , i ∈ X . Thenthe value f | ( x | x ) yields the conditional probability P ( X = x | x ) , for f -almost all x ∈ X .Then h w ϕ ( X | X ) ≥ h w ψ ( X | X ) , with equality iﬀ modulo ϕ , RV X is a function of X . Proof.

Observe that h w ϕ ( X | X ) = h w ϕ ( X ) − h w ψ ( X ) and h w ψ ( X | X ) = h w ψ ( X ) − h w ψ ( X ) ,so that we need to prove that h w ϕ ( X ) ≥ h w ψ ( X ) . The proof follows that of Lemma 1.6, with obviousmodiﬁcations.Of course, if we swap labels and in (1.25), assuming that E ϕ ( X ) h f | ( X | X ) − i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ (1.27)we get h w ψ ( X | X ) ≤ h w ϕ ( X | X ) , with equality iﬀ ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X .6 heorem 1.10 (Sub-additivity of the conditional WE.) Let X be a triple of RVs, with a jointPM/DF f . Given a WF x ϕ ( x ) , assume the following bound E ϕ ( X ) − E ϕ ( X → X ⊗ )= Z X ϕ ( x )  f ( x ) − f ( x ) Y i =1 , f i | ( x i | x )  ν (d x ) ≥ . (1.28) Here X → X ⊗ stands for the triple of RVs where X keeps its distribution as within the triple X whereas X and X have been made conditionally independent given X , with the same marginalconditional PDFs f | and f | as in X . Then h w ϕ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) , (1.29) with equality iﬀ, modulo ϕ , RVs X and X are conditionally independent given X . That is: ϕ ( x ) h f ( x ) − f ( x ) f | ( x | x ) f | ( x | x ) i = 0 for f -almost all x ∈ X . Proof.

The proof is based on the equation (1.30): h w ϕ ( X | X ) − h w ψ ( X | X ) − h w ψ ( X | X )= Z X ϕ ( x ) f ( x ) log f | ( x | x ) f | ( x | x ) f | ( x | x )= Z X ϕ ( x ) f ( x ) log f ( x ) f | ( x | x ) f | ( x | x ) f ( x ) . (1.30)After that we apply the same argument as in (1.22). Lemma 1.11 (Bounds on conditional WE, III.)

For a triple of RVs X with a joint PM/DF f ( x ) and a WF x ϕ ( x ) , assume the bound as in (1.28) . Then h w ϕ ( X | X ) ≤ h w ψ ( X | X ); equality iﬀ X and X are conditionally independent given X modulo ϕ . (1.31) Proof.

Write (1.31) as h ψ ( X ) − h ψ ( X ) ≥ h w ϕ ( X ) − h w ψ ( X ) and then pass to an equivalent form h w ϕ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) which is exactly(1.29).Summarizing, we have an array of inequalities (1.32) for h w ϕ ( X | X ) and its upper bounds, eachrequiring its own assumption (and with its own case for equality):by Lemma 1.6: ≤ h w ϕ ( X | X ) , assuming (1.18) (a modiﬁed form of (1.16)),by Lemma 1.11: h w ϕ ( X | X ) ≤ h w ψ ( X | X ) , assuming (1.28),by Theorem 1.8: h w ψ ( X | X ) ≤ h ψ ( X ) , assuming (1.23),by Lemma 1.9: h w ψ ( X | X ) ≤ h w ϕ ( X | X ) , assuming (1.27),by Theorem 1.10: h w ϕ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) , assuming (1.28). (1.32)It is worth noting that the assumptions listed in Eqn (1.32) express an impact on the total expectedweight when we perform various manipulations with RVs forming a pair or a triple under consideration.7 heorem 1.12 (Strong sub-additivity of the WE.) Given a triple of RVs X , assume that bound (1.28) is fulﬁlled. Then h w ϕ ( X ) + h w ψ ( X ) ≤ h w ψ ( X ) + h w ψ ( X ) . (1.33) The equality in (1.33) holds iﬀ, modulo ϕ , X and X are conditionally independent given X . Proof.

Write the inequality in Eqn (1.33) in an equivalent form: h w ϕ ( X ) − h w ψ ( X ) ≤ h w ψ ( X ) − h w ψ ( X ) + h w ψ ( X ) − h w ψ ( X ) . (1.34)The LHS in (1.34) equals h w ϕ ( X | X ) while the RHS yields h w ψ ( X | X ) + h w ψ ( X | X ) . The in-equality then follows from Theorem 1.10. Theorem 2.1 (Concavity of the WE; cf. [3], Theorem 2.7.3.)

The functional f h w ϕ ( f ) is concavein argument f . Namely, for given PM/DFs f ( x ) , f ( x ) , non-negative function x ∈ X 7→ ϕ ( x ) and λ , λ ∈ [0 , with λ + λ = 1 , h w ϕ ( λ f + λ f ) ≥ λ h w ϕ ( f ) + λ h w ϕ ( f ) . (2.1) The inequality in (2.1) is strict unless one of the values λ , λ vanishes (and the other equals ) orwhen f and f coincide modulo ϕ , that is, ϕ ( x ) (cid:2) f ( x ) − f ( x ) (cid:3) = 0 for ( λ f + λ f ) -almost all x ∈ X . Proof.

Let X , X : Ω → X be RVs with PM/DF f and f , respectively. Consider a binary RV Θ with Θ = ( , with probability λ , , with probability λ . (2.2)Setting Z = X θ yields an RV Z with values from X and with PM/DF f = λ f + λ f . Thus, h w ϕ ( Z ) = h w ϕ ( λ f + λ f ) . On the other hand, take the conditional WE h w e ϕ ( Z | Θ) with the WF e ϕ ( z, θ ) = ϕ ( z ) depending on theﬁrst argument z ∈ X and not on value θ = 1 , of RV Θ . Then the WF ψ ( z ) = E h e ϕ ( Z, Θ) | Z = z i coincides with ϕ ( z ) . It means that condition (1.20) hold true for the pair of RVs Z, Θ . According toTheorem 1.8 (cf. Eqn (1.21)), h w e ϕ ( Z | Θ) ≤ h w ϕ ( Z ) , with equality iﬀ Z and Θ are independent modulo ϕ . The latter holds when the product λ λ = 0 or when f = f modulo ϕ . Now, h w ϕ ( Z | Θ) = − X θ =1 λ θ Z X ϕ ( z ) f θ ( z ) log f θ ( z ) ν (d z ) = λ h w ϕ ( f ) + λ h w ϕ ( f ) . This completes the proof. 8 heorem 2.2 (a) (Convexity of relative WE; cf. [3], Theorem 2.7.2.)

Consider two pairs of non-negative functions, ( f , g ) and ( f , g ) , on X . Given a WF x ∈ X 7→ ϕ ( x ) and λ λ ∈ (0 , with λ + λ = 1 , the following property is satisﬁed: λ D w ϕ ( f k g ) + λ D w ϕ ( f k g ) ≥ D w ϕ ( λ f + λ f k λ g + λ g ) , (2.3) with equality iﬀ λ λ = 0 or f = f and g = g modulo ϕ . (b) (Data-processing inequality for relative WE; cf. [3], Theorem 2.8.1.) Let ( f, g ) be a pair ofnon-negative functions and ϕ a WF on X . Let Π = (Π( x, y ) , x, y ∈ X ) be a stochastic kernel. (Thatis, ∀ x, y ∈ X , Π( x, y ) ≥ and Z X Π( x, y ) ν (d y ) = 1 ; in other words, Π( x, y ) is a transition functionof a Markov chain). Set Ψ( u ) = Z X ϕ ( x )Π( u, x ) ν (d x ) . Then D wΨ ( f || g ) ≥ D w ϕ ( f Π k g Π ) (2.4) where (cid:0) f Π (cid:1) ( x ) = Z X f ( u )Π( u, x ) ν (d u ) and (cid:0) g Π (cid:1) ( x ) = Z X g ( u )Π( u, x ) ν (d u ) . The equality occurs iﬀ f Π = f and g Π = g . Proof. (a) The log-sum inequality yields λ ϕ ( x ) f ( x ) log λ ϕ ( x ) f ( x ) λ g ( x ) + λ ϕ ( x ) f ( x ) log λ ϕ ( x ) f ( x ) λ g ( x ) ≥ ( λ ϕ ( x ) f ( x ) + λ ϕ ( x ) f ( x )) log λ ϕ ( x ) f ( x ) + λ ϕ ( x ) f ( x ) λ g ( x ) + λ g ( x ) , x ∈ X . (2.5)Integrating in ν (d x ) yields the asserted inequality (2.3). The cases of equality emerge from thelog-sum equality cases.(b) Again, a straightforward application of the log-sum inequality gives the result. Theorem 2.3

Let X be a triple of RVs with joint PM/DF f ( x ) . Let x ∈ X ϕ ( x ) be a WFsuch that X and X are conditionally independent given X modulo ϕ . (This property can be referredto as a Markov property modulo ϕ .) (a) (Data-processing inequality for conditional WE.) Assume inequality (2.6) (which is (1.28) with X and X swapped): Z X ϕ ( x )  f ( x ) − f ( x ) Y i =2 , f i | ( x i | x )  ν (d x ) ≥ . (2.6) Then the conditional WEs satisfy property (2.7) : h w ψ ( X | X ) ≤ h w ψ ( X | X ) , (2.7) with equality iﬀ X and X are independent modulo ϕ . Furthermore, assume in addition that bound (2.8) holds true Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ (2.8) (which becomes (1.25) after a cyclic substitution X → X → X → X ) and suppose h w ψ ( X | X ) = h w ψ ( X | X ) (a stationarity-type property). Then h w ψ ( X | X ) ≤ h w ψ ( X | X ) . (2.9)9b) (Data-processing inequality for mutual WE; cf. [3], Theorem 2.8.1.) Assume inequality (2.10) : Z X ϕ ( x ) h f ( x ) − f ( x ) Y i =1 , f i | ( x i | x ) i ν (d x ) ≥ (2.10) (similar to (1.28) , with X and X swapped). Then i w ψ ( X : X ) ≤ i w ψ ( X : X ) . (2.11) Here, equality in (2.11) holds iﬀ, modulo ϕ , RVs X and X are conditionally independent given X . Proof. (a) Following the argument in Lemma 1.11, we observe that h w ϕ ( X | X ) ≤ h w ψ ( X | X ) . On the other hand, owing to conditional independence, h w ϕ ( X | X ) = h w ψ ( X | X ) . (2.12)This yields the inequality in (2.7); for equality we need that, modulo ϕ , RVs X and X are condi-tionally independent given X . Together with conditional independence of X and X given X , itimplies that for i = 1 , , the conditional PM/DF f | i does not depend on i .Next, using Lemma 1.9, we can write h w ψ ( X | X ) ≤ h w ϕ ( X | X ) := h w ϕ ( X | X ) + h w ψ ( X | X ) . (2.13)Applying (2.12) yields the following assertion: h w ψ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) . (2.14)Now, the assumption that h w ψ ( X | X ) = h w ψ ( X | X ) implies (2.9). The cases of equality followfrom Lemmas 1.11 and 1.9.(b) As before, we use Lemma 1.11 and Eqn (2.12) (implied by conditional independence): h w ψ ( X | X ) = h w ϕ ( X | X ) ≤ h w ψ ( X | X ) . Consequently, i w ψ ( X : X ) = h w ψ ( X ) − h w ψ ( X | X ) ≥ h w ψ ( X ) − h w ψ ( X | X ) = i w ψ ( X : X ) , with the case of equality also determined from Lemma 1.9. Theorem 2.4 (Cf. [3], Theorem 2.7.4.) Let X be a pair of RVs with joint PM/DF f ( x ) = f ( x ) f | ( x | x ) = f ( x ) f | ( x | x ) . • (I) The mutual WE i w ϕ ( X : X ) is convex in f | ( x | x ) for ﬁxed f ( X ) . • (II) Suppose that the WF ϕ ( x , x ) depends only on x : ϕ ( x , x ) = ϕ ( x ) . Then i w ϕ ( X : X ) is a concave function in f ( X ) for ﬁxed f | ( x | x ) .10 roof. (I) For a ﬁxed f , take two conditional PM/DFs, f (1)2 | ( x | x ) and f (2)2 | ( x | x ) , and set e f | ( x | x ) = λ f (1)2 | ( x | x ) + λ f (2)2 | ( x | x ) and e f ( x ) = f ( x ) e f | ( x | x ) = λ f (1) ( x ) + λ f (2) ( x ) where f ( j ) ( x ) = f ( x ) f ( j )2 | ( x | x ) , j = 1 , . Also, set: e f ( x ) = Z X e f ( x ) ν (d x ) and f ( j )2 ( x ) = Z X f ( j ) ( x ) ν (d x ) and e g ( x ) = f ( x ) e f ( x ) , and g ( j ) ( x ) = f ( x ) f ( j )2 ( x ) , j = 1 , . Next, the mutual WE i w ϕ ( X : X ) for joint PM/DFs e f ( x ) and f ( j ) ( x ) is given, respectively, byrelative WEs D w ϕ ( e f k e g ) and D w ϕ ( f ( j ) k g ( j ) ) , j = 1 , . Now assertion (I) follows from Theorem 2.2 (a).(II) Under the condition of the theorem, the reduced WF does not depend on the choice of PM/DF f ψ ( x ) = Z X ϕ ( x , x ) f | ( x | x ) ν ( dx ) = ϕ ( x ) . Next, write i w ϕ ( X : X ) = h w ψ ( X ) − h w ϕ ( X | X )= h w ϕ ( X ) − R X ϕ ( x ) f ( x ) f | ( x | x ) log f | ( x | x ) ν (d x )= h w ϕ ( X ) − R X f ( x ) h w ϕ ( X | X = x ) ν (d x ) where h w ϕ ( X | X = x ) = Z X ϕ ( x ) f | ( x | x ) log f | ( x | x ) ν (d x ) . Owing to Theorem 2.1, for ﬁxed WF x ϕ ( x ) and conditional PM/DF f | ( x | x ) , the WE h w ϕ isconcave in f . The negative term is linear in f . This completes the proof of statement (II). Theorem 2.5 (The weighted Fano inequality; cf. [3], Theorem 2.10.1, [20], Theorem 1.2.8.)

Suppose an RV X takes a value x ∗ ∈ X with probability p ∗ = P ( X = x ∗ ) < (i.e., p ∗ = f ( x ∗ ) ν ( { x ∗ } ) ). Given a WF x ∈ X 7→ ϕ ( x ) , assume that Z X \{ x ∗ } ϕ ( x ) (cid:20) f ( x ) − − p ∗ ν ( X \ { x ∗ } ) (cid:21) ν (d x ) ≥ . (2.15) Then h w ϕ ( X ) ≤ − ϕ ( x ∗ ) p ∗ log p ∗ + ϕ ∗ log (cid:18) ν ( X \ { x ∗ } )1 − p ∗ (cid:19) . (2.16) Here ϕ ∗ = R X ϕ ( x ) ν (d x ) − ϕ ( x ∗ ) p ∗ .The equality in (2.16) is achieved iﬀ ϕ ( x ) (cid:20) f ( x ) − − p ∗ ν ( X \ { x ∗ } ) (cid:21) = 0 , for f -almost all x ∈ X \{ x ∗ } ,i.e., iﬀ RV X is (conditionally) uniform on X \ { x ∗ } modulo ϕ . roof. We write h w ϕ ( X ) = − ϕ ( x ∗ ) p ∗ log p ∗ − Z X \{ x ∗ } ϕ ( x ) f ( x ) log f ( x ) ν (d x )= − ϕ ( x ∗ ) p ∗ log p ∗ − log (1 − p ∗ ) Z X \{ x ∗ } ϕ ( x ) f ( x ) ν (d x ) − (1 − p ∗ ) Z X \{ x ∗ } ϕ ( x ) f ( x )1 − p ∗ log f ( x )1 − p ∗ ν (d x ) . (2.17)Theorem 1.4, with β = 1 ν ( X \ { x ∗ } ) , yields that the last line in Eqn (2.17) is upper-bounded by ϕ ∗ log ν ( X \ { x ∗ } ) . This leads to (2.16). Theorem 2.6 (The weighted generalized Fano inequality; cf. [20], Theorem 1.2.11.)

Let X i : Ω →X i , be a pair of RVs, i = 1 , . Suppose that X takes exactly m values , . . . , m (that is, X = { , . . . , m } ) while X takes values , . . . , m and possibly other values (that is, X ⊇ { , . . . , m } ), andset: ε j = P ( X = j | X = j ) . Let a WF ( x , x ) ∈ X ϕ ( x , x ) be given such that for all j = 1 , . . . , m , Z X \{ j } ϕ ( x , j ) (cid:20) f | ( x | j ) − ε j ν ( X \ { j } ) (cid:21) ν (d x ) ≥ . (2.18) Then h w ϕ ( X | X ) ≤ X ≤ j ≤ m P ( X = j ) h − ϕ ∗ j (0)(1 − ε j ) log (1 − ε j ) + ϕ ∗ j (1) log (cid:16) ν ( X \ { j } ) ε j (cid:17)i . (2.19) Here RV X ∗ j takes two values, say and , with P ( X ∗ = 0) = 1 − ε j = 1 − P ( X ∗ = 1) , and the WF ϕ ∗ has ϕ ∗ j (0) = ϕ ( j, j ) and ϕ ∗ j (1) = Z X \{ j } ϕ ( x , j ) f ( x, j ) ν (d x ) . (2.20) Proof.

By deﬁnition of the conditional WE, the weighted Fano inequality, Theorem 1.4 and withdeﬁnitions (2.20) at hand, we obtain that h w ϕ ( X | X ) ≤ P j P ( X = j ) (cid:20) − ϕ ( j, j )(1 − ε j ) log (1 − ε j )+ Z X \{ j } ϕ ( x , j ) f ( x, j ) ν (d x ) log ν ( X \ { j } ) ε j (cid:21) . This yields inequality (2.19).

In this section we establish some extremality properties for the WE; cf. [4], Chap. 12.

Theorem 3.1

Suppose X ∗ : Ω → X is an RV with a PM/DF f ∗ and x ∈ X → ϕ ( x ) is a given WF. (I) Then f ∗ (or X ∗ ) is the unique maximizer, modulo ϕ , of the WE h w ϕ ( f ) under the constraints Z X ϕ ( x ) (cid:2) f ( x ) − f ∗ ( x ) (cid:3) ν (d x ) ≥ and (3.1) Z X ϕ ( x ) (cid:2) f ( x ) − f ∗ ( x ) (cid:3) log f ∗ ( x ) ν (d x ) ≥ . (3.2) • (II) On the other hand, consider a constraint Z X ϕ ( x ) f ( x ) β ( x )d ν ( x ) = c (3.3) where x ∈ X 7→ β ( x ) is a given function and c a given constant neither of which is assumednon-negative. Suppose that f ∗ ( x ) = 1 Z exp[ − bβ ( x )] is a (Gibbsian-type) PM/DF such that Z X ϕ ( x ) f ∗ ( x )d ν ( x ) = 1 and Z X ϕ ( x ) f ∗ ( x ) β ( x )d ν ( x ) = c. Here b is a constant (an analog of inverse temperature) and Z = R X exp[ − bβ ( x )]d ν ( x ) ∈ (0 , ∞ ) is the normalizing denominator (an analog of a partition function). Introduce the second con-straint: (log Z ) Z X ϕ ( x )[ f ∗ ( x ) − f ( x )]d ν ( x ) ≥ . (3.4) Then, under (3.3) and (3.4) , the WE h w ϕ ( f ) is maximized at f = f ∗ . As above, it is a uniquemaximizer, modulo ϕ . Proof. (I) Using deﬁnition (1.2) and Theorem 1.3, we obtain ≥ − D w ϕ ( f k f ∗ ) = h w ϕ ( f ) + Z X ϕ ( x ) f ( x ) log f ∗ ( x ) ν (d x ) . (3.5)Under our constraint (3.1) it yields h w ϕ ( f ) ≤ − Z X ϕ ( x ) f ∗ ( x ) log f ∗ ( x ) ν (d x ) = h w ϕ ( f ∗ ) . (3.6)The uniqueness of the maximizer follows from the uniqueness case for equality in the weighted Gibbsinequality.(II) Again use (3.5): h w ϕ ( f ) ≤ − Z X ϕ ( x ) f ( x ) h − log Z − bβ ( x ) i d ν ( x )= (log Z ) Z X ϕ ( x ) f ( x )d ν ( x ) + b Z X ϕ ( x ) f ( x ) β ( x )d ν ( x ) ≤ (log Z ) Z X ϕ ( x ) f ∗ ( x )d ν ( x ) + b Z X ϕ ( x ) f ∗ ( x ) β ( x )d ν ( x ) = h w ϕ ( f ∗ ) . Note that when Z ≥ , the factor log Z can be omitted from (3.4); otherwise log Z can bereplaced by − . 13 xample 3.2 Consider a random vector X = X d : Ω → R d with PDF f (relative to the d -dimensional Lebesgue measure), mean vector and covariance matrix C = ( C ij ) with with C ij = E [ X i X j ] , ≤ i, j ≤ d . Let f No C be the normal PDF with the same µ and C . Let x = x d ∈ R d ϕ ( x ) be a given WF which is positive on an open domain in R d . Introduce d × d matrices Φ = (Φ ij ) , Φ No C = (Φ No ij ) and x T x , where ( x T x ) ij = x i x j and Φ = Z R d ϕ ( x ) f ( x ) x T x d x , Φ No C = Z R d ϕ ( x ) f No C ( x ) x T x d x . Suppose that Z R d ϕ ( x ) h f ( x ) − f No C ( x ) i d x ≥ and log h (2 π ) d (det C ) i Z R d ϕ ( x ) h f ( x ) − f No C ( x ) i d x + tr h C − (cid:0) Φ − Φ No C (cid:1) i ≤ . (3.7) Then h w ϕ ( f ) ≤ h w ϕ ( f No C ) = 12 log h (2 π ) d (det C ) i Z R d ϕ ( x ) f No C ( x )d x + log e C − Φ No C , (3.8) with equality iﬀ f = f No C modulo ϕ . Proof.

Using the same idea as before, write ≥ − D w ϕ ( f k f No C ) = h w ϕ ( f ) −

12 log h (2 π ) d (det C ) i Z R d ϕ ( x ) f ( x )d x − log e C − Φ , (3.9)Equivalently, h w ϕ ( f ) ≤

12 log h (2 π ) d (det C ) i Z R d ϕ ( x ) f ( x )d x + log e C − Φ which leads directly to the result.To further illustrate the above methodology, we provide some more examples, omitting the proofs. Example 3.3

Let f Exp denote an exponential PDF on R + = (0 , ∞ ) (relative to the Lebesgue measure d x ) with mean λ − . Suppose a PDF f on R + satisﬁes the constraints Z R + ϕ ( x ) (cid:2) f ( x ) − f Exp ( x ) (cid:3) d x ≥ and (cid:0) log λ (cid:1) Z R + ϕ ( x ) (cid:2) f ( x ) − f Exp ( x ) (cid:3) d x − λ Z R + x ϕ ( x ) (cid:2) f ( x ) − f Exp ( x ) (cid:3) d x ≥ . (3.10) where x ∈ R + ϕ ( x ) is a given WF positive on an open interval. Then h w ϕ ( f ) ≤ h w ϕ ( f Exp ) = − (cid:0) λ log λ (cid:1) Z R + ϕ ( x ) e − λx d x + λ Z R + xϕ ( x ) e − λx d x, and f Exp is a unique maximizer modulo ϕ . xample 3.4 Take X = Z + = { , , . . . } and let ν be the counting measure: ν ( i ) = 1 ∀ i ∈ Z + .Then, for a RV X with PMF f ( i ) we have f ( i ) = P ( X = i ) . Fix a WF i ∈ Z + ϕ ( i ) . (a) Let f Ge be a geometric PMF: f Ge ( x ) = (1 − p ) x p , x ∈ Z + . Then for any PMF f ( x ) , i ∈ Z + ,satisfying the constraints X i ∈ Z + ϕ ( i ) (cid:2) f ( i ) − f Ge ( i ) (cid:3) ≥ and log p X i ∈ Z + ϕ ( i ) (cid:2) f ( i ) − f Ge ( i ) (cid:3) + log(1 − p ) X i ∈ Z + iϕ ( i ) (cid:2) f ( i ) − f Ge ( i ) (cid:3) ≥ . (3.11) we have h w ϕ ( f ) ≤ h w ϕ ( f Ge ) , with equality iﬀ f = f Ge modulo ϕ . (b) Let f Po be a Poissonian PMF: f Po ( k ) = e − λ λ k k ! , k ∈ Z + . Then for any PMF f ( k ) , k ∈ Z + ,satisfying the constraints X k ∈ Z + ϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) ≥ and log λ P k ∈ Z + kϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) − λ P k ∈ Z + ϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) − P k ∈ Z + (log k !) ϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) ≥ . (3.12) we have h w ϕ ( f ) ≤ h w ϕ ( f Po ) , with equality iﬀ f = f Po modulo ϕ . Theorem 3.5 below oﬀers an extension of the Ky Fan inequality that log det C is a concavefunction of a positive deﬁnite d × d matrix C . Cf. [16, 17, 18, 21]. We follow the method proposedby Cover-Dembo-Thomas. As before, f No C denotes the normal PDF with zero mean and covariancematrix C . Theorem 3.5 (The weighted Ky Fan inequality; cf. [3], Theorem 17.9.1, [4], Theorem 1, [5], Theo-rem 8, [20], Worked Example 1.5.9.)

Assume that x d ∈ R d ϕ ( x d ) ≥ is a given WF positive onan open domain. Suppose that, for λ , λ ∈ [0 , with λ + λ = 1 and positive-deﬁnite C , C , with C = λ C + λ C , Z R d ϕ ( x ) h λ f No C ( x ) + λ f No C ( x ) − f No C ( x ) i d x ≥ , and (3.13) log h (2 π ) d (det C ) i Z R d ϕ ( x ) h λ f No C ( x ) + λ f No C ( x ) − f No C ( x ) i d x + log e h C − Ψ i ≤ , (3.14) where Ψ = Z R d ϕ ( x ) h λ f No C ( x ) + λ f No C ( x ) − f No C ( x ) i ( x − µ ) T ( x − µ )d x . (3.15) Then, with σ ϕ ( C ) = h w ϕ ( f No C ) , σ ϕ ( C ) = h w ϕ ( f No C ) and σ ϕ ( C ) = h w ϕ ( f No C ) σ ϕ ( C ) − λ σ ϕ ( C ) − λ σ ϕ ( C ) ≥ (3.16) equality iﬀ λ λ = 0 or C = C . roof. Take values λ , λ ∈ [0 , , such that λ + λ = 1 . Let C and C be two positive deﬁnite d × d matrices. Let X and X be two multivariate normal vectors, with PDFs f k ∼ N(0 , C k ) , k = 1 , . Set Z = X Θ , where the RV Θ , takes two values, θ = 1 and θ = 2 with probability λ and λ respectively, and is independent of X and X . Then vector Z has covariance C = λ C + λ C .Also set: α ( C ) = Z R d ϕ ( x ) f No C ( x )d x . (3.17)Let x = ( x , . . . , x d ) ∈ R d ϕ ( x ) be a given WF and set e ϕ ( x , θ ) = ϕ ( x ) . Following the samearguments as in the proof of Theorem 2.1, h w e ϕ ( Z | Θ) ≤ h w ϕ ( Z ) . It is plain that h w e ϕ ( Z | Θ) = λ h w ϕ ( X ) + λ h w ϕ ( X )= X k =1 , λ k 

12 log h (2 π ) d (det C k ) i Z R d ϕ ( x ) f No C k ( x )d x + log e C − k Φ ( k )  (3.18)where Φ ( k ) = Z R d x T x ϕ ( x ) f No C k ( x )d x , k = 1 , , and ( x T x ) ij = x i x j . According to Example 3.2, we have h w ϕ ( Z ) ≤ n log h (2 π ) d (det C ) i o α ( C ) + log e C − Φ , (3.19)where Φ = Z R d x T x ϕ ( x ) f No C ( x )d x . (3.20)The inequality (3.16) then follows. The cases of equality are covered by Theorem 2.1.The following lemma is an immediate extension of Lemma 1.6. Lemma 3.6

Let X n = ( X , . . . , X n ) be a random vector, with components X i : Ω → X i , ≤ i ≤ n ,and the joint PM/DF f . Extending the notation used in Sect , set: x n = ( x , . . . , x n ) ∈ X n := × ≤ i ≤ n X i and ν n (d x n ) = Y ≤ i ≤ n d ν i (d x i ) , and more generally, x lk = ( x k , . . . , x l ) ∈ X lk := × k ≤ i ≤ l X i and ν lk (d x lk ) = Y k ≤ i ≤ l d ν i (d x i ) , ≤ k ≤ l. Next, introduce f i ( x i ) = Z X i − ×X ni +1 f ( x i − , x i , x ni +1 ) ν i − (d x i − ) ν ni +1 (d x ni +1 ) (the marginal PM/DF for RV X i ) , and f | i ( x n | x i ) = f ( x n ) f i ( x i ) (the conditional PM/DF given that X i = x i ). iven a WF x n ∈ X n ϕ ( x n ) , suppose that Z X n ϕ ( x n ) " f ( x n ) − n Y i =1 f i ( x i ) ν n (d x n ) ≥ . (3.21) Then h w ϕ ( X n ) ≤ n X i =1 h w ψ i ( X i ) , (3.22) where ψ i ( x i ) = Z X i − ×X ni +1 ϕ ( x n ) f | i ( x n | x i ) ν i − (d x i − ) ν ni +1 (d x ni +1 ) . Here, equality in (3.22) holds iﬀ, modulo ϕ , components X , . . . , X n are independent. Theorem 3.7 (The weighted Hadamard inequality; cf. [3], Theorem 17.9.2, [4], Theorem 3, [5],Theorem 26, [20], Worked Example 1.5.10).

Let C = ( C ij ) be a positive deﬁnite d × d matrix and f No C the normal PDF with zero mean and covariance matrix C . Given a WF function x d = ( x , . . . , x d ) ∈ R d ϕ ( x d ) , positive on an open domain in R d , introduce quantity α = α ( C ) by (3.17) and matrix Φ = (Φ ij ) by (3.20) . Let f No i stand for the N(0 , C ii ) -PDF (the marginal PDF of the i -th component).Then under condition Z R d ϕ ( x d ) h f No C ( x d ) − d Y i =1 f No i ( x i ) i d x d ≥ , (3.23) we have: α log Y i (2 πC ii ) + (log e ) X i C − ii Φ ii − α log h (2 π ) d (det C ) i − (log e )tr C − Φ ≥ , (3.24) with equality iﬀ C is diagonal. Proof. If X , . . . , X d ∼ N(0 , C ) , then in Lemma 3.6, by following (3.22) we can write

12 log h (2 π ) d (det C ) i Z R d ϕ ( x d ) f ( x d )d x d + log e C − Φ ≤ d X i =1  log (2 πC ii ) Z R ψ i ( x ) f No i ( x )d x + (log e ) C − ii Ψ ii  . (3.25)Here ψ i ( x i ) = Z R d − ϕ ( x d ) f No | i ( x d | x i ) Y j : j = i d x j , Ψ ii = Z R d x i ψ i ( x i ) f No i ( x i )d x i = Φ ii and f No | i ( x d | x i ) = f No C ( x d ) f No i ( x i ) (the conditional PDF) . With α = Z R ψ i ( x i ) f No i ( x i )d x i = Z R d ϕ ( x d ) f No C ( x d )d x d , the bound (3.24) follows. Remark 3.8

As above, maximizing the left-hand side in (3.24) would give a bound between det C and the product d Q i =1 C ii . Weighted Fisher information and related inequalities

In this section we introduce a weighted version of Fisher information matrix and establish somestraightforward facts. The bulk of these properties is derived by following Ref. [32].

Deﬁnition 4.1

Let X = ( X , . . . , X n ) be a random × n vector with probability density function(PDF) f θ ( x ) = f X ( x ; θ ) , x = ( x , . . . , x n ) ∈ R n , where θ = ( θ , . . . , θ m ) ∈ R m is a parameter vector.Suppose that dependence θ f θ is C . The m × m weighted Fisher information matrix J w ϕ ( X ; θ ) ,with a given WF x ∈ R n ϕ ( x ) ≥ , is deﬁned by J w ϕ ( X ; θ ) = E h ϕ ( X ) S ( X , θ ) T S ( X , θ ) i = Z ϕ ( x ) f θ ( x ) (cid:18) ∂f θ ( x ) ∂θ (cid:19) T ∂f θ ( x ) ∂θ (cid:0) f θ ( x ) > (cid:1) d x , (4.1) assuming the integrals are absolutely convergent. Here and below, ∂∂θ stands for the × m gradientin θ and S ( X , θ ) = ( f θ ( x ) > ∂∂θ log f θ ( x ) denotes the score vector.When ϕ ( x ) ≡ , J w ϕ ( X ; θ ) = J ( X ; θ ) , the standard Fisher information matrix, cf. [5], [4], [20] . Deﬁnition 4.2

Let ( X , Y ) be a pair of RVs with a joint PDF f θ ( x , y ) = f X , Y ( x , y ; θ ) and conditionalPDF f θ ( y | x ) = f Y | X ( y | x ; θ ) := f X , Y ( x , y ; θ ) f X ( x ; θ ) . Given a joint WF ( x , y ) ∈ R n × R n ϕ ( x , y ) ≥ ,we set: J w ϕ ( X , Y ; θ ) = E " ϕ ( X , Y ) (cid:18) ∂ log f θ ( X , Y ) ∂θ (cid:19) T ∂ log f θ ( X , Y ) ∂θ (cid:0) f θ ( X , Y ) > (cid:1) = Z ϕ ( x , y ) f θ ( x , y ) (cid:18) ∂f θ ( x , y ) ∂θ (cid:19) T ∂f θ ( x , y ) ∂θ (cid:0) f θ ( x , y ) > (cid:1) d x d y (4.2) and J w ϕ ( Y | X ; θ ) = E " ϕ ( X , Y ) (cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19) T ∂ log f θ ( Y | X ) ∂θ (cid:0) f θ ( Y | X ) > (cid:1) = Z ϕ ( x , y ) f θ ( x , y ) f θ ( y | x ) (cid:18) ∂f θ ( y | x ) ∂θ (cid:19) T ∂f θ ( y | x ) ∂θ (cid:0) f θ ( y | x ) > (cid:1) d x d y . (4.3) Next, consider an m × m matrix S θ = S θ ( f X , Y ) and a × m vector B θ = B θ ( x , f Y | X ) : B θ = E Y | X = x (cid:20) ϕ ( x , Y ) ∂ log f θ ( Y | x ) ∂θ (cid:21) = Z ϕ ( x , y ) f θ ( y | x ) ∂f θ ( y | x ) ∂θ (cid:0) f θ ( y | x ) > (cid:1) d y , (4.4) S θ = E ("(cid:18) ∂ log f θ ( X ) ∂θ (cid:19) T B θ ( X ) + B θ ( X ) T ∂ log f θ ( X ) ∂θ (cid:0) f θ ( X ) > (cid:1)) . (4.5) When ϕ ( x , y ) depends only on x and under standard regularity assumptions, vector B θ vanishes (andso does matrix S θ ): B θ = ϕ ( x ) Z ∂f θ ( y | x ) ∂θ d y = ∂∂θ Z f θ ( y | x )d y = 0 . For the sake of brevity, in formulas that follow we routinely omit indicators of positivity of PDFsinvolved: their presence can be easily derived from the local context.18 emma 4.3 (The chain rule: cf. [32], Lemma 1.)

Given a pair ( X , Y ) of random vectors and a jointWF ( x , y ) ϕ ( x , y ) , set: ψ ( x ) = ψ X ( x ) = Z ϕ ( x , y ) f θ ( y | x ) d y = E Y | X = x ϕ ( x , Y ) . (4.6) Then J w ϕ ( X , Y ; θ ) = J w ψ ( X ; θ ) + J w ϕ ( Y | X ; θ ) + S θ . (4.7) Proof.

For simplicity, assume that θ is scalar: θ = θ ; a generalization to a vector case isstraightforward. Therefore, we have J w ϕ ( X , Y ; θ ) = E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X , Y ) ∂θ (cid:19) (cid:21) . (4.8)Furthermore, we know log f θ ( x , y ) = log f θ ( x ) + log f θ ( y | x ) Using (4.8) yields: J w ϕ ( X , Y ; θ ) = E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X ) ∂θ (cid:19) (cid:21) + E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19) (cid:21) + 2 E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X ) ∂θ (cid:19)(cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19)(cid:21) . (4.9)We also can write E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X ) ∂θ (cid:19)(cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19)(cid:21) = E ( ∂ log f θ ( X ) ∂θ E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19)(cid:12)(cid:12)(cid:12) X (cid:21)) . (4.10)This cancels the last term in (4.7) when applying inner expectation in the RHS of (4.7).Throughout the paper, an inequality A ≤ B between matrices A and B means that B − A is apositive-deﬁnite matrix. Lemma 4.4 (Data-reﬁnement inequality: cf. [32], Lemma 2.)

For a given joint WF ( x , y ) ϕ ( x , y ) , J w ϕ ( X , Y ; θ ) ≥ J w ψ ( X ; θ ) + S θ , (4.11) with equality if X is a suﬃcient statistic for θ . Here WF ψ = ψ X is deﬁned as in (4.6) . Proof.

Bound (4.11) follows from Lemma 4.3 using the non-negativity of matrix J w ϕ ( Y | X = x ; θ ) = Z f θ ( y | x ) ϕ ( x , y ) (cid:18) ∂ log f θ ( y | x ) ∂θ (cid:19) T ∂ log f θ ( y | x ) ∂θ d y . Equality holds when J w ϕ ( Y | X = x ; θ ) = 0 which leads to the statement.19 emma 4.5 (Data-processing inequality: cf. [32], Lemma 3.) For a given joint WF ( x , y ) ϕ ( x , y ) and a function x g ( x ) , set ̺ g ( x ) = ϕ ( x , g ( x )) and ρ g ( x ) = ϕ ( x , g ( x )) f θ ( x | g ( x )) . (4.12) Then we have J w ̺ g ( X ; θ ) ≥ J w ρ g ( g ( X ); θ ) . (4.13) The equality holds iﬀ function g ( X ) is invertible. Proof.

We make use Lemma 4.4. Let Y = g ( X ) , then S θ = . This yields J w ρ g ( g ( X ); θ ) ≤ J w ϕ ( X , g ( X ); θ ) . (4.14)Note that the equality holds true if J w ϕ ( X | g ( X ); θ ) = 0 , that is g ( X ) is a suﬃcient statistics for θ .Now use the chain rule, Lemma 4.3, where J w ϕ ( g ( X ) | X ; θ ) = 0 . Hence, J w ϕ ( X , g ( X ); θ ) = J w ̺ g ( X ; θ ) . (4.15)The assertions (4.14) and (4.15) lead directly to the result. Lemma 4.6 (Parameter transformation: cf. [32], Lemma 4.)

Suppose we have a family of PDFs f η ( x ) parameterized by a × m ′ vector η = ( η , . . . , η m ′ ) ∈ R m ′ . Suppose that vector η is a functionof θ ∈ R m . Then J w ϕ ( X ; θ ) = (cid:18) ∂η∂θ (cid:19) T J w ϕ (cid:16) X ; η ( θ ) (cid:17) (cid:18) ∂η∂θ (cid:19) , (4.16) with an m ′ × m matrix ∂η∂θ = (cid:18) ∂η i ∂θ j (cid:19) , ≤ i ≤ m ′ , ≤ j ≤ m .In the linear case where η ( θ ) = θ Q for some m × m ′ matrix Q , we obtain: J w ϕ ( X ; θ ) = QJ w ϕ (cid:0) X ; η ( θ ) (cid:1) Q T . (4.17) Proof.

Formula (4.16) becomes straightforward by substituting the expression ∂ log f η ( θ ) ( x ) ∂θ = (cid:18) ∂η ( θ ) ∂θ (cid:19)(cid:18) ∂ log f η ( x ) ∂η (cid:19) T . (4.18)Concluding this section, we consider a linear model where the parameter is related to an additiveshift. Suppose a random vector X in R n has a joint PDF f X and x ∈ R n ϕ ( x ) is a given WF. Set: L w ϕ ( X ) := Z ϕ ( x ) f ( x ) (cid:16) ∇ f ( x ) (cid:17) T ∇ f ( x )d x . (4.19)Here and below, we use symbol ∇ for the spatial gradient × n vectors as opposite to parametergradients ∂∂θ and ∂∂η .Furthermore, set X = θ Q + Y P . (4.20)20ere Q and P are two matrices, of sizes m × n and k × n respectively, with m ≤ k ≤ n . Next, X ∈ R n and Y ∈ R k . Let x ∈ R n ϕ ( x ) ≥ be a given WF and set ψ ( y ) = ψ P ( y ) = Z R n − k ϕ ( x ) ( x P T = y ) f X | X P T ( x | y )d x ∁ P , y ∈ R n P T , where x ∁ P stands for the complementary variable in x , given that x P T = y . In Lemma 4.7 we presentrelationships between J w ϕ ( X ; θ ) , J w ψ ( Y ; θ ) , L w ϕ ( X ) and L w ψ ( X P T ) for the above model. (The proofs arestraightforward and omitted.) Lemma 4.7 (Cf. [32], Lemmas 5 and 6.)

Assume the model (4.20) . Then J w ϕ ( X ; θ ) = QL w ϕ ( X ) Q T , J w ψ ( Y ; θ ) = QP T L w ψ ( X P T ) PQ T , and J w ϕ ( X ; θ ) ≥ J w ψ ( Y ; θ ) . (4.21) Corollary 4.8 (Cf. [32], Corollary 1.)

Let P be an m × m matrix. Let X be a random vector in R m and WFs ϕ and ψ = ψ P be as above. Then • (i) J w ϕ ( X ; θ ) ≥ P T J w ψ ( X P T ; θ ) P . • (ii) For P with orthonormal rows (i.e., with PP T equal to I m , the unit m × m matrix), J w ψ ( X P T ; θ ) ≤ P T J w ϕ ( X ; θ ) P . (4.22) • (iii) For P with a full row rank m , and X ∈ R m with nonsingular J w ϕ , J w ψ ( X P T ) ≤ (cid:18) P T J w ψ ( X ; θ ) − P (cid:19) − . (4.23) We start with multivariate weighted Cramér-Rao inequalities (WCRIs). As usually, consider a familyof PDFs f θ ( x ) , x ∈ R n , dependent on a parameter θ ∈ R m and let X = X θ denote the random vectorwith PDF f θ . Let a statistic x T ( x ) = ( T ( x ) , ..., T s ( x )) and a WF x ϕ ( x ) ≥ be given. With E θ standing for the expectation relative to f θ , set: α ( θ ) = E θ ϕ ( X ) , η ( θ ) = E θ (cid:2) ϕ ( X ) T ( X ) (cid:3) . (5.1)We also suppose that the operations of taking expectation and the gradient are interchangeable: E θ (cid:20) ϕ ( X ) S ( X , θ ) (cid:21) = ∂α ( θ ) ∂θ , E θ (cid:20) ϕ ( X ) T ( X ) T S ( X , θ ) (cid:21) = ∂η ( θ ) ∂θ , (5.2)assuming C -dependence in θ α ( θ ) and θ η ( θ ) and absolute convergence of the integrals involved.Let C w ϕ ( θ ) denote the weighted covariance matrix for X : C w ϕ ( θ ) = E θ n ϕ ( X ) h T ( X ) − η ( X ) i T h T ( X ) − η ( X ) io (5.3)and J w ϕ ( X ; θ ) = E (cid:2) ϕ ( X ) S ( X , θ ) T S ( X , θ ) (cid:3) be the weighted Fisher information matrix under the WF ϕ ; cf. Eqn (4.1). 21 heorem 5.1 (A weighted Cramér-Rao inequality, version I; [4], Theorem 11.10.1, [5], Theorem 20.) Assuming J w ϕ ( X ; θ ) is invertible and under condition (5.2) , vectors η ( θ ) , ∂α ( θ ) ∂θ and matrices C w ϕ ( θ ) , J w ϕ ( X ; θ ) , ∂η ( θ ) ∂θ obey C w ϕ ( θ ) ≥ (cid:20) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:21) h J w ϕ ( X ; θ ) i − (cid:20) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:21) T . (5.4) Proof.

We start with a simpliﬁed version where s = 1 and T ( X ) = T ( X ) and η ( θ ) = η ( θ ) arescalars, keeping general n, m ≥ . By using (5.2), write: E θ n ϕ ( X ) (cid:2) T ( X ) − η ( θ ) (cid:3) S ( X ; θ ) o = E θ (cid:2) ϕ ( X ) T ( X ) S ( X ; θ ) (cid:3) − η ( θ ) E θ (cid:2) ϕ ( X ) S ( X ; θ ) (cid:3) = ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ . (5.5)Then for any × m vector µ ∈ R m , ≤ E θ (cid:26) ϕ ( X ) h T ( X ) − η ( θ ) − S ( X , θ ) µ T i (cid:27) = E θ (cid:26) ϕ ( X ) h T ( X ) − η ( θ ) i (cid:27) + µ J w ϕ ( X ; θ ) µ T − µ (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) T . (5.6)Taking µ = (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) (cid:2) J w ϕ ( X ; θ ) (cid:3) − (which is the minimiser for the RHS in (5.6)), weobtain Var w ϕ [ T ( X )] := E θ (cid:26) ϕ ( X ) (cid:16) T ( X ) − η ( θ ) (cid:17) (cid:27) ≥ (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) (cid:2) J w ϕ ( X ; θ ) (cid:3) − (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) T . (5.7)Turning to the general case s ≥ , set: T ( X ) = T ( X ) λ T where × s vector λ ∈ R s . Then (5.7)yields that for all λ , λ C w ϕ ( θ ) λ T = Var w ϕ [ T ( X ) λ T ] := E θ (cid:26) ϕ ( X ) h T ( X ) λ T − η ( θ ) λ T i (cid:27) ≥ λ (cid:18) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:19) h J w ϕ ( X ; θ ) i − (cid:18) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:19) T λ T , implying (5.4). Deﬁnition 5.2

The calibrated relative WE K w ϕ ( f || g ) of f and g with WF ϕ is deﬁned by K w ϕ ( f || g ) = Z ϕ ( x ) f ( x ) α ( f ) log f ( x ) α ( g ) g ( x ) α ( f ) d x = D w ϕ ( f k g ) α ( f ) + log α ( g ) α ( f ) = D ( e f k e g ) . (5.8) Here e f and e g are PDFs produced from ϕf and ϕg after normalizing by α ( f ) and α ( g ) : α ( f ) = Z ϕ ( x ) f ( x )d x , α ( g ) = Z ϕ ( x ) g ( x )d x , e f ( x ) = ϕ ( x ) f ( x ) α ( f ) , e g ( x ) = ϕ ( x ) g ( x ) α ( g ) , (5.9) and D ( · k · ) is the standard Kullback–Leibler divergence. heorem 5.3 (Weighted Kullback inequalities, cf. [10].) For given ϕ and f , g as above, the follow-ing bounds hold true. First, for × n vector ζ , K w ϕ ( f k g ) ≥ sup h e ϕ ( f ) ζ T α ( f ) + log α ( g ) − log M g ( ζ ) : ζ ∈ R n i , (5.10) where e ϕ ( f ) = Z ϕ ( x ) f ( x ) x d x , M g ( ζ ) = Z ϕ ( x ) g ( x ) (cid:2) exp( x ζ T ) (cid:3) d x . (5.11) Second, D w ϕ ( f k g ) ≥ sup h e ϕ ( f ) ζ T : ζ ∈ M i , (5.12) where M = (cid:26) ζ : Z ϕ ( x ) (cid:16) f ( x ) − g ( x )[exp( x ζ T )] (cid:17) d x ≥ (cid:27) . (5.13) Proof.

First, given ζ ∈ R n , set e G ζ ( x ) = ϕ ( x ) g ( x ) (cid:2) exp( x ζ T ) (cid:3) M g ( ζ ) . Following (5.11) and (5.8),obtain: K w ϕ ( f k g ) = D ( e f k e G ζ ) + Z e f ( x ) log e G ζ ( x ) e g ( x ) d x ≥ Z e f ( x ) log α ( g )[exp( x ζ T )] M g ( ζ ) d x ; (5.14)the bound holds as D ( e f k e G ζ ) ≥ by the Gibbs inequality for the standard Kullback-Leibler diver-gence. By taking the supremum, we arrive at (5.10).Second, write: G ζ ( x ) = g ( x )[exp( x ζ T )] and D w ϕ ( f k g ) = D w ϕ ( f k G ζ ) + e ϕ ( f ) ζ T . (5.15)For ζ ∈ M , the bound D w ϕ ( f k G ζ ) ≥ holds true (the weighted Gibbs inequality (1.3)). This yields(5.12).An application of the weighted Kullback’s inequality is given in the next theorem where we obtainanother version of the weighted Cramér-Rao inequality. Theorem 5.4 (A weighted Cramér-Rao inequality, version II; [4], Theorem 11.10.1, [5], Theorem20.)

Suppose we have a family of × n random vectors X , with PDFs f θ ( x ) , x ∈ R n , indexed by θ ∈ R m . Suppose that f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) → as ε → uniformly in x . Let x ϕ ( x ) be a given WF.Denoting, as before, the expectation relative to f θ by E θ , set α ( θ ) = E θ [ ϕ ( X )] , e ( θ ) = E θ [ ϕ ( X ) X ] and e C w ϕ ( θ ) = 1 α ( θ ) E θ (cid:2) ϕ ( X ) X T X (cid:3) − e ( θ ) T e ( θ ) . (5.16) Under the assumptions needed to deﬁne matrix J w ϕ ( X ; θ ) , then J w ϕ ( X ; θ ) ≥ ∂ e ( θ ) ∂θ T he C w ϕ ( θ ) i − ∂ e ( θ ) ∂θ + α ( θ ) − ∂α ( θ ) ∂θ T ∂α ( θ ) ∂θ . (5.17)23 roof. By deﬁnition (5.8), for ε ∈ R m , K w ϕ ( f θ + ε || f θ ) = − Z ϕ ( x ) f θ + ε ( x ) α ( θ + ε ) log f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) d x . (5.18)Next, set M ( θ, ζ ) = E θ (cid:8) ϕ ( X )[exp( X ζ T )] (cid:9) and Ψ( θ, ε ) = sup h e ( θ + ε ) ζ T + log α ( θ ) − log M ( θ, ζ ) : ζ ∈ R n i . (5.19)Then, owing to Theorem 5.3, we obtain: K w ϕ ( f θ + ε || f θ ) ≥ Ψ( θ, ε ) . (5.20)The LHS of (5.20) is − Z ϕ ( x ) f θ + ε ( x ) log f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) d x = Z ϕ ( x ) f θ + ε ( x ) (cid:26)h − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) i + 12 (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) + O (cid:18) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) (cid:19)(cid:27) d x . (5.21)Here we have used the Taylor expansion of log(1 + z ) . The ﬁrst-order term disappears: Z ϕ ( x ) f θ + ε ( x ) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) d x = α ( θ + ε ) − α ( θ + ε ) = 0 . (5.22)Next, for small ε , Z ϕ ( x ) f θ + ε ( x ) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) d x = ε " J w ϕ ( X ; θ ) − α ( θ ) ∂α ( θ ) ∂θ (cid:18) ∂α ( θ ) ∂θ (cid:19) T ε T + o ( k ε k ) . (5.23)Finally, the remainder lim ε → k ε k Z ϕ ( x ) f θ + ε ( x ) O (cid:18) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) (cid:19) d x = o ( k ε k ) . (5.24)For the RHS in (5.20), we take the point τ where the gradient has the form ∇ ζ h e ( θ + ε ) ζ T + log α ( θ ) − log M ( θ, ζ ) i(cid:12)(cid:12)(cid:12) ζ = τ = 0 , i.e., e ( θ + ε ) = ∇ ζ log M ( θ, ζ ) (cid:12)(cid:12)(cid:12) ζ = τ = 1 M ( θ, τ ) ∇ ζ M ( θ, τ ) (cid:12)(cid:12)(cid:12) ζ = τ . Consider the limit lim ε → k ε k sup t ∈ R n (cid:26) t T µ ϕ ( θ + ε ) − Ψ( t ) (cid:27) . (5.25)Here Ψ( t ) = log α ( θ ) + log R f θ ( x )[exp( xt T )]d x denotes the weighted cumulant-generating functionfor PDF e f θ . The supremum is attained at a value of t = τ = τ ( ε ) where the ﬁrst derivative24f the weighted cumulant-generating function equals ∇ t Ψ( t = τ ) = µ ϕ ( θ + ε ) . Here µ ϕ ( θ ) = E θ [ X ϕ ( X )] / E θ ϕ ( X ) . We have also ∇ t Ψ(0) = µ ϕ ( θ ) , and therefore the Hessian ∇ t Ψ(0) = ∂∂θ µ ϕ ( θ ) lim ε → ∂ε∂ τ . (5.26)It also can be seen that ∇∇ Ψ(0) = E θ (cid:2) X T X ϕ ( X ) (cid:3) E θ [ ϕ ( X )] − µ ϕ ( θ ) T µ ϕ ( θ ) := V ϕ ( X ; θ ) . (5.27)In addition, by using the Taylor formula at an intermediate point between θ and θ + ε , lim ε → k ε k (cid:26) τ T µ ϕ ( θ + ε ) − Ψ( τ ) (cid:27) = (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19)

12 [ ∇∇ Ψ(0)] − (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19) T . (5.28)Now let us back to the RHS of (5.25) which becomes: lim ε → k ε k (cid:20) τ T µ ϕ ( θ + ε ) − Ψ( τ ) + log α ( θ ) (cid:21) = 12 (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19) [ V ϕ ( X ; θ )] − (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19) T . (5.29)Now (5.29) gives the required result (5.17). Remark 5.5

When ϕ ( x ) ≡ then α ( θ ) = 1 , e ( θ ) = E θ X , C w ϕ ( θ ) = e C w ϕ ( θ ) , and the two inequalities (5.4) and (5.17) coincide.In general, these inequalities competing; the question which inequality is stronger is not discussed inthis paper. We also note that both inequalities (5.4) and (5.17) lack a covariant property: multiplyingWF ϕ by a constant has a diﬀerent impact on the left- and righ-hand sides. Acknowledgement

YS thanks the Oﬃce of the Rector, University of Sao Paulo (USP) for the ﬁnancial supportduring the academic year 2013-4. YS thanks Math Department, Penn State University, USA forthe hospitality and support during the academic years 2014-6. IS is supported by FAPESP Grant -process number 11/51845-5, and expresses her gratitude to IMS, University of São Paulo, Brazil, andto Math Department, University of Denver, USA for the warm hospitality. SYS thanks the CAPESPNPD-UFSCAR Foundation for the ﬁnancial support in the year 2014. SYS thanks the FederalUniversity of Sao Carlos, Department of Statistics, for hospitality during the year 2014. MK thanksthe Higher School of Economics for the support in the framework of the Global CompetitivenessProgram.

References [1] M. Belis and S. Guiasu. A Quantitative and qualitative measure of information in cyberneticsystems.

IEEE Trans. on Inf. Theory , (1968), 593–594.[2] A. Clim. Weighted entropy with application. Analele Universităţii Bucureşti, Matematică , AnulLVII (2008), 223-231.[3] T. Cover and J.A. Thomas.

Elements of Information Theory.

New York: Wiley, 2006.254] T.M. Cover and J.A. Thomas. Determinant inequalities via information theory.

SIAM J. MatrixAnal. and its Applicat. , (1988), 384–392.[5] A. Dembo, T.M. Cover and J.A. Thomas. Information theoretic inequalities. IEEE Trans. In-form. Theory , (1991), 1501–1518.[6] A. Di Crescenzo and M. Longobardi. Entropy based measure of uncertainty in past lifetimedistributions. J. App. Prob. , (2002), no. 3, 434–440.[7] G. Dial and I. J. Taneja. On weighted entropy of type ( α , β ) and its generalizations. Appl. Math. , (1981), 418–425.[8] G. Frizelle and Y. M. Suhov. An entropic measurement of queueing behaviour in a class ofmanufacturing operations. Proc. Royal Soc. A , (2001), 1579–1601[9] G. Frizelle and Y. M. Suhov. The measurement of complexity in production and other commercialsystems. Proc. Royal Soc. A , (2008), 2649–2668.[10] A. Fuchs and G. Letta. L’inégalité de Kullback. Application à la théorie de l’estimation. Sémi-naire de probabilités 4 . Strasbourg, (1970), 108–131.[11] S. Guiasu. Weighted entropy.

Report on Math. Physics , (1971), 165–179.[12] K. Ito. Introduction to Probability Theory.

Cambridge: Cambridge University Press, 1984.[13] D. H. Johnson and R. M. Glantz. When does interval coding occur?

Neurocopmuting , (2004), 13–18.[14] P. L. Kannappan and P. K. Sahoo. On the general solution of a functional equation connectedto sum form information measures on open domain. Math. Sci. , (1986), 545–550.[15] J. N. Kapur. Measures of Information and Their Applications.

Chapter 17, New Delhi: WileyEastern Limited, 1994.[16] K. Fan. On a theorem of Weyl concerning eigenvalues of linear transformations, I.

Proc. Nat.Acad. USA, (1949), 652–655.[17] K. Fan. On a theorem of Weyl concerning eigenvalues of linear transformations, II. Proc. Nat.Acad. USA, (1950), 31–35.[18] K. Fan. Maximum properies and inequalities for the eigenvalues of completely continuous oper-ators. Proc. Nat. Acad. USA, (1951), 760–766.[19] M. Kelbert and Y. Suhov. Continuity of mutual entropy in the limiting signal-to-noise ratioregimes. In: Stochastic Analysis , Springer-Verlag: Berlin (2010), 281–299.[20] M. Kelbert and Y. Suhov.

Information Theory and Coding by Example.

Cambridge: CambridgeUniversity Press, 2013.[21] M. Moslehian. Ky Fan inequalities. arXiv:1108.1467, 2011.[22] K. Muandet, S. Marukatat and C. Nattee. Query selection via weighted entropy in graph-basedsemi-supervised classiﬁcation. In:

Advances in Machine Learning.

Lecture Notes in ComputerScience, (2009), pp. 278–292. 2623] O. Parkash and H. C. Taneja. Characterization of the quantitative-qualitative measure of inac-curacy for discrete generalized probability distributions.

Commun. Statist. Theory Methods , (1986), 3763–3771.[24] B. D. Sharma, J. Mitter and M. Mohan. On measure of ‘useful‘ information. Inform. Control (1978), 323–336.[25] R. P. Singh and J. D. Bhardwaj. On parametric weighted information improvement. Inf. Sci. (1992), 149–163.[26] A. Sreevally and S. K. Varma. Generating measure of cross entropy by using measure of weightedentropy. Soochow Journal of Mathematics , (2004), no. 2, 237–243.[27] A. Srivastava. Some new bounds of weighted entropy measures. Cybernetics and InformationTechnologies , (2011), no. 3, 60–65.[28] Y. Suhov, S. Yasaei Sekeh and M. Kelbert. Entropy-power inequality for weighted entropy. arXiv:1502.02188 [29] Y. Suhov, I. Stuhl and S. Yasaei Sekeh. Weighted Gaussian entropy and determinant inequalities. arXiv:1505.01753 [30] Y. Suhov, I. Stuhl and M. Kelbert. Weight functions and log-optimal investment portfolios. arXiv:1505.01437 [31] R. K. Tuteja, Sh. Chaudhary and P. Jain. Weighted entropy of orders α and type β informationenergy. Soochow Journal of Mathematic (1993), no. 2, 129-138.[32] R. Zamir. A proof of the Fisher information inequality via a data processing argument. IEEETransaction on Information Theory , , No. 3 (1998), 1246–1250.Yuri Suhov: DPMMS, University of Cambridge, UK; Math Dept, Penn State University, PA, USA;IPIT RAS, Moscow, RFIzabella Stuhl: IMS, University of São Paulo, Brazil; Math Dept, University of Denver, CO, USA;University of Debrecen, HungarySalimeh Yasaei Sekeh: Stat Dept, Federal University of S ˜a˜a