Basic inequalities for weighted entropies
Yuri Suhov, Izabella Stuhl, Salimeh Yasaei Sekeh, Mark Kelbert
aa r X i v : . [ c s . I T ] J a n Basic inequalities for weighted entropies
Y. Suhov, I. Stuhl, S. Yasaei Sekeh, M. Kelbert
Abstract
The concept of weighted entropy takes into account values of different outcomes, i.e., makesentropy context-dependent, through the weight function. In this paper, we establish a numberof simple inequalities for the weighted entropies (general as well as specific), mirroring similarbounds on standard (Shannon) entropies and related quantities. The required assumptions arewritten in terms of various expectations of the weight functions. Examples are weighted Ky Fanand weighted Hadamard inequalities involving determinants of positive-definite matrices, andweighted Cramér-Rao inequalities involving the weighted Fisher information matrix.
The definition and initial results on weighted entropy were introduced in [1, 11]. The purpose was tointroduce disparity between outcomes of the same probability: in the case of a standard entropy suchoutcomes contribute the same amount of information/uncertainty, which is appropriate in context-free situations. However, imagine two equally rare medical conditions, occurring with probability p << , one of which carries a major health risk while the other is just a peculiarity. Formally,they provide the same amount of information − log p but the value of this information can be verydifferent. The weight, or a weight function, was supposed to fulfill this task, at least to a certainextent. The initial results have been further extended and deepened in [24, 7, 14, 23, 25, 31, 15], and,more recently, in [6, 26, 2, 22, 27]. Certain applications emerged, see [8, 13], along with a number oftheoretical suggestions.The purpose of this note is to extend a number of inequalities, established previously for a stan-dard (Shannon) entropy, to the case of the weighted entropy. We particularly mention Ky Fan-and Hadamard-type inequalities from [3, 9, 20] which are related to (standard) Gaussian entropies.Extended inequalities for weighted entropies already found applications and further developments in[28, 29, 30]. Another kind of bounds, weighted Cramér-Rao inequalities, may be useful in statistics.An additional motivation for studying weighted entropy (WE) can be provided in the followingquestions. (I) What is the rate at which the WE is produced by a sample of a random process(and what could be an analog of the Shannon–McMillan–Breiman theorem)? (II) What would bean analog of Shannon’s Second Coding theorem when an incorrect channel output causes a penaltybut does not make the transmission session invalid? Properties of the WE established in the currentpaper could be helpful in this line of research.One of naturally emerging questions is about the form/structure of the weight function (WF). Inthis paper we focus on some simple inequalities (as suggested by the title). Our results hold for fairly Mathematics Subject Classification: 60A10, 60B05, 60C05Key words and phrases: weighted entropy, weighted conditional entropy, weighted relative entropy, weighted mutualentropy, weighted Gibbs inequality, convexity, concavity, weighted Hadamard inequality, weighted Fisher information,weighed Cramér-Rao inequalities. (Ω , B , P ) be a standard probability space (see, e.g., [12]). We consider random variables(RVs) as (measurable) functions Ω → X , with values in a measurable space ( X , M ) equipped witha countably additive reference measure ν . Probability mass functions (PMFs) or probability densityfunctions (PDFs) are denoted by letter f with various indices and defined relative to ν . The differencebetween PMFs (discrete parts of probability measures) and PDFs (continuous parts) is insignificantfor most of the presentation; this will be reflected in a common acronym PM/DF. In a few cases wewill address directly the probabilities P ( X = i ) (when X is a finite or countable set, assuming that ν ( i ) = 1 ∀ i ∈ X ). On the other hand, some important facts will remain true without assumption that Z X f ( x ) ν (d x ) = 1 . When we deal with a collection of RVs X i , the space of values X i and the referencemeasure ν i may vary with i . Some of RVs X i may be random × n vectors, viz., X n = ( X , . . . , X n ) ,with random components X i : Ω → X i , ≤ i ≤ n . Definition 1.1
Given a function x ∈ X 7→ ϕ ( x ) ≥ , and an RV X : Ω → X , with a PM/DF f ,the weighted entropy (WE ) of X (or f ) with weight function (WF) ϕ and reference measure ν isdefined by h w ϕ ( X ) = h w ϕ ( f ) = − E [ ϕ ( X ) log f ( X )] = − Z X ϕ ( x ) f ( x ) log f ( x ) ν (d x ) (1.1) whenever the integral R X ϕ ( x ) f ( x ) (cid:16) ∨ | log f ( x ) | (cid:17) ν (d x ) < ∞ . (A standard agreement · log 0 =0 · log ∞ is adopted throughout the paper.) If f ( x ) ≤ ∀ x ∈ X , h w ϕ ( f ) is non-negative. (This is thecase when ν ( X ) ≤ .) The dependence of h w ϕ ( X ) = h w ϕ ( f ) on ν is omitted.Given two functions, x ∈ X 7→ f ( x ) ≥ and x ∈ X 7→ g ( x ) ≥ , the relative WE of g relative to f with WF ϕ is defined by D w ϕ ( f k g ) = Z X ϕ ( x ) f ( x ) log f ( x ) g ( x ) ν (d x ) . (1.2) Alternatively, the quantity D w ϕ ( f k g ) can be termed a weighted Kullback–Leibler divergence (of g from f ) with WF ϕ . If f is a PM/DF, one can use an alternative form of writing: D w ϕ ( f k g ) = E h ϕ ( X ) log f ( X ) g ( X ) i . In what follows, all WFs are assumed non-negative and positive on a set of positive f -measure. Remark 1.2
Passing to standard entropies, an obvious formula reads h w ϕ ( f ) = h ( ϕf ) + D ( ϕf k f ) = − D ( ϕf k ϕ ) , (1.3) provided that one can guarantee that the integrals involved converge. However, in general neither ϕf nor ϕ are PM/DFs, which can be a nuisance. Besides, the interpretation of ϕ as a weight functionin h w ϕ ( f ) makes the inequalities more transparent. Theorem 1.3 (The weighted Gibbs inequality; cf. [4], Lemma 1, [3], Theorem 2.6.3, [5] Lemma 1,[20], Theorem 1.2.3 (c).)
Given non-negative functions f , g , assume the bound Z X ϕ ( x ) (cid:2) f ( x ) − g ( x ) (cid:3) ν (d x ) ≥ . (1.4)2 hen D w ϕ ( f k g ) ≥ . (1.5) Moreover, equality in (1.5) holds iff the ratio gf equals modulo function ϕ . In other words, (cid:20) g ( x ) f ( x ) − (cid:21) ϕ ( x ) = 0 for f -almost all x ∈ X . Proof.
Following a standard calculation (see, e.g., [3], Theorem 2.6.3 or [20], Theorem 1.2.3 (c))and using (1.2), we write − D w ϕ ( f k g ) = Z X ϕ ( x ) f ( x ) ( f ( x ) >
0) log g ( x ) f ( x ) ν (d x ) ≤ Z X ϕ ( x ) f ( x ) ( f ( x ) > (cid:20) g ( x ) f ( x ) − (cid:21) ν (d x )= Z X ϕ ( x ) ( f ( x ) > (cid:2) g ( x ) − f ( x ) (cid:3) ν (d x ) ≤ Z X ϕ ( x ) (cid:2) g ( x ) − f ( x ) (cid:3) ν (d x ) ≤ . (1.6)The equality in (1.6) occurs iff ϕ ( g/f − vanishes f -a.s. Theorem 1.4 (Bounding the WE via a uniform distribution.)
Suppose an RV X takes at most m values, i.e., X = { , . . . , m } , and set p i = P ( X = i ) , ≤ i ≤ m . Suppose that for given < β ≤ m X i =1 ϕ ( i )( p i − β ) ≥ . (1.7) Then h w ϕ ( X ) = − m P i =1 ϕ ( i ) p i log p i obeys h w ϕ ( X ) ≤ − log β m X i =1 ϕ ( i ) p i , or − E [ ϕ ( X ) log p X ] ≤ − (log β ) E [ ϕ ( X )] , (1.8) with equality iff for all i = 1 , . . . , m , ϕ ( i )( p i − β ) = 0 .In the case of a general space X , assume that for a constant β > we have Z X ϕ ( x ) [ f ( x ) − β ] ν (d x ) ≥ . (1.9) Then h w ϕ ( X ) ≤ − log β Z X ϕ ( x ) f ( x ) ν (d x ); (1.10) equality iff ϕ ( x ) [ f ( x ) − β ] = 0 for f -almost all x ∈ X . Proof.
The proof follows directly from Theorem 1.3, with g ( x ) = β , x ∈ X . Definition 1.5
Let ( X , X ) be a pair of RVs X i : Ω → X i , with a joint PM/DF f ( x , x ) , x i ∈ X i , i = 1 , , relative to measure ν (d x ) × ν (d x ) , and marginal PM/DFs f ( x ) = Z X f ( x , x ) ν (d x ) , x ∈ X , f ( x ) = Z X f ( x , x ) ν (d x ) , x ∈ X . et ( x , x ) ∈ X × X ϕ ( x , x ) be a given WF. We use Eqn (1.1) to define the joint WE of X , X with WF ϕ (under an assumption of absolute convergence of the integrals involved): h w ϕ ( X , X ) = − E [ ϕ ( X , X ) log f ( X , X )]= − Z X ×X ϕ ( x , x ) f ( x , x ) log f ( x , x ) ν (d x ) ν (d x ) . (1.11) Next, the conditional
WE of X given X with WF ϕ is defined by h w ϕ ( X | X ) = − E h ϕ ( X , X ) log f ( X , X ) f ( X ) i = h w ϕ ( X , X ) − h w ψ ( X )= − Z X ×X ϕ ( x , x ) f ( x , x ) log f ( x , x ) f ( x ) ν (d x ) ν (d x ) , (1.12) here and below ψ ( X ) = Z X ϕ ( x , x ) f ( x , x ) f ( x ) ν (d x ) . Further, the mutual
WE between X and X by i w ϕ ( X : X ) = D w ϕ ( f k f ⊗ f ) = E h ϕ ( X , X ) log f ( X , X ) f ( X ) f ( X ) i = Z X ×X ϕ ( x , x ) f ( x , x ) log f ( x , x ) f ( x ) f ( x ) ν (d x ) ν (d x ) . (1.13)We will use the notation X ki = ( X i , . . . , X k ) and x ki = ( x i , . . . , x k ) , ≤ i < k ≤ n , for collectionsof RVs and their sample values (particularly for pairs and triples of RVs) allowing us to shortenequations throughout the paper. In addition, we employ Cartesian products X ki = X i × . . . × X k and product-measures ν ki (d x ki ) = ν i (d x i ) × . . . × ν k (d x k ) . Given a random × n vector X n with aPM/DF f , we denote by f i , f ij and f ijk the PM/DFs for component X i , pair X ij = ( X i , X j ) andtriple X ijk = ( X i , X j , X k ) , respectively. The arguments of f i , f ij and f ijk are written as x i ∈ X i , x ij = ( x i , x j ) ∈ X ij = X i × X j and x ijk = ( x i , x j , x k ) ∈ X ijk = X i × X j × X k . Next, symbols f i | j , f ij | k and f i | jk are used for conditional PM/DFs: f i | j ( x i | x j ) = f ij ( x ij ) f j ( x j ) , f ij | k ( x ij | x k ) = f ijk ( x ijk ) f k ( x k ) , f i | jk ( x i | x jk ) = f ijk ( x ijk ) f jk ( x jk ) . For a pair of RVs X , set ψ ( x ) = Z X ϕ ( x , x ) f | ( x | x ) ν (d x ) , x ∈ X ; (1.14)quantity ψ ( x ) , x ∈ X , is defined in a similar (symmetric) fashion. See above.Next, given a triple of RVs X , with a joint PM/DF f ( x ) , set: ψ ( x ) = Z X ϕ ( x ) f | ( x | x ) ν (d x ) = E h ϕ ( X ) | X = x i , x ∈ X ,ψ ( x ) = Z X ϕ ( x ) f | ( x | x ) ν (d x ) = E h ϕ ( X ) | X = x i , x ∈ X , (1.15)and define functions ψ ijk and ψ ij for distinct labels ≤ i, j, k ≤ , in a similar manner.4 emma 1.6 (Bounds on conditional WE, I.) Let X be a pair of RVs with a joint PM/DF f ( x ) .Suppose that a WF x ∈ X ϕ ( x ) obeys E h ϕ ( X ) (cid:2) f | ( X | X ) − (cid:3)i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ . (1.16) Then h w ϕ ( X ) ≥ h w ψ ( X ) , or, equivalently, h w ϕ ( X | X ) ≥ , (1.17) with equality iff ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X . Proof.
The statement is derived similarly to Theorem 1.3: Z X ϕ ( x ) f ( x ) log f | ( x | x ) ν (d x ) ≤ Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) . The argument is concluded as in (1.6). The cases of equalities also follow.
Remark 1.7
In particular, suppose that X takes finitely or countably many values and ν is acounting measure with ν ( i ) = 1 , i ∈ X . Then the value f | ( x | x ) yields the conditional probability P ( X = x | x ) , which is ≤ for f -almost all x ∈ X . Then h w ϕ ( X | X ) ≥ , and the bound isstrict unless, modulo ϕ , RV X is a function of X . That is, there exists a map υ : X → X suchthat (cid:2) x − υ ( x ) (cid:3) ϕ ( x ) = 0 for f -almost every x ∈ X . For a future use, we can consider a triple of RVs, X , and a pair, X , and assume that E h ϕ ( X ) (cid:2) f | ( X | X ) − (cid:3)i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ . (1.18)Then h w ϕ ( X ) ≥ h w ψ ( X ) , or, equivalently, h w ϕ ( X | X ) ≥ , (1.19)with equality iff ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X . Theorem 1.8 (Sub-additivity of the WE.)
Let X = ( X , X ) be a pair of RVs with a joint PM/DF f ( x ) and marginals f ( x ) , f ( x ) , where x ∈ X . Suppose that a WF x ∈ X ϕ ( x ) obeys E ϕ ( X ) − E ϕ ( X ⊗ ) = Z X ϕ ( x ) h f ( x ) − f ( x ) f ( x ) i ν (d x ) ≥ . (1.20) Here X ⊗ stands for the pair of independent RVs having the same marginal distributions as X , X .(The joint PDF for X ⊗ is the product f ( x ) f ( x ) .) Then h w ϕ ( X ) ≤ h w ψ ( X ) + h w ψ ( X ) , or, equivalently, h w ϕ ( X | X ) ≤ h w ψ ( X ) , or, equivalently, i w ϕ ( X : X ) ≥ . (1.21) The equalities hold iff X , X are independent modulo ϕ , i.e., ϕ ( x ) (cid:20) − f ( x ) f ( x ) f ( x ) (cid:21) = 0 for f -almost all x ∈ X . roof. The subsequent argument works for the proof of Theorem 1.10 as well. Set ( f ⊗ f ) ( x , x ) = f ( x ) f ( x ) . According to (1.2), (1.11) – (1.13) and owing to Theorem 1.3 andLemma 1.6, ≥ − D w ϕ ( f k f ⊗ f ) = Z X ϕ ( x ) f ( x ) log f ( x ) f ( x ) f ( x ) ν (d x )= h w ϕ ( X , X ) − h w ψ ( X ) − h w ψ ( X )= h w ϕ ( X | X ) − h w ψ ( X ) = − i w ϕ ( X : X ) . (1.22)This yields the inequalities in (1.21). The cases of equality are also identified from Theorem 1.3.Note that if in (1.20) we use function ψ ( x ) emerging from triple X , the assumption becomes E ϕ ( X ) − E ϕ ( X ⊗ → X )= Z X ϕ ( x ) h f ( x ) − f ( x ) f ( x ) i f | ( x | x ) ν (d x ) ≥ (1.23)and the conclusion h w ψ ( X | X ) ≤ h w ψ ( X ) . (1.24)Here X ⊗ → X denotes the triple of RVs where X and X have been made independent, keepingintact their marginal distributions, and X has the same conditional PM/DF f | as within theoriginal triple X . Lemma 1.9 (Bounds on conditional WE, II.)
Let X be a triple of RVs, with a joint PM/DF f ( x ) .Given a WF x ϕ ( x ) , assume that E h ϕ ( X ) (cid:2) f | ( X | X ) − (cid:3)i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ . (1.25) Then h w ψ ( X | X ) ≤ h w ϕ ( X | X ); (1.26) equality iff ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X .As in Remark , assume X takes finitely or countably many values and ν ( i ) = 1 , i ∈ X . Thenthe value f | ( x | x ) yields the conditional probability P ( X = x | x ) , for f -almost all x ∈ X .Then h w ϕ ( X | X ) ≥ h w ψ ( X | X ) , with equality iff modulo ϕ , RV X is a function of X . Proof.
Observe that h w ϕ ( X | X ) = h w ϕ ( X ) − h w ψ ( X ) and h w ψ ( X | X ) = h w ψ ( X ) − h w ψ ( X ) ,so that we need to prove that h w ϕ ( X ) ≥ h w ψ ( X ) . The proof follows that of Lemma 1.6, with obviousmodifications.Of course, if we swap labels and in (1.25), assuming that E ϕ ( X ) h f | ( X | X ) − i = Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ (1.27)we get h w ψ ( X | X ) ≤ h w ϕ ( X | X ) , with equality iff ϕ ( x ) (cid:2) f | ( x | x ) − (cid:3) = 0 for f -almost all x ∈ X .6 heorem 1.10 (Sub-additivity of the conditional WE.) Let X be a triple of RVs, with a jointPM/DF f . Given a WF x ϕ ( x ) , assume the following bound E ϕ ( X ) − E ϕ ( X → X ⊗ )= Z X ϕ ( x ) f ( x ) − f ( x ) Y i =1 , f i | ( x i | x ) ν (d x ) ≥ . (1.28) Here X → X ⊗ stands for the triple of RVs where X keeps its distribution as within the triple X whereas X and X have been made conditionally independent given X , with the same marginalconditional PDFs f | and f | as in X . Then h w ϕ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) , (1.29) with equality iff, modulo ϕ , RVs X and X are conditionally independent given X . That is: ϕ ( x ) h f ( x ) − f ( x ) f | ( x | x ) f | ( x | x ) i = 0 for f -almost all x ∈ X . Proof.
The proof is based on the equation (1.30): h w ϕ ( X | X ) − h w ψ ( X | X ) − h w ψ ( X | X )= Z X ϕ ( x ) f ( x ) log f | ( x | x ) f | ( x | x ) f | ( x | x )= Z X ϕ ( x ) f ( x ) log f ( x ) f | ( x | x ) f | ( x | x ) f ( x ) . (1.30)After that we apply the same argument as in (1.22). Lemma 1.11 (Bounds on conditional WE, III.)
For a triple of RVs X with a joint PM/DF f ( x ) and a WF x ϕ ( x ) , assume the bound as in (1.28) . Then h w ϕ ( X | X ) ≤ h w ψ ( X | X ); equality iff X and X are conditionally independent given X modulo ϕ . (1.31) Proof.
Write (1.31) as h ψ ( X ) − h ψ ( X ) ≥ h w ϕ ( X ) − h w ψ ( X ) and then pass to an equivalent form h w ϕ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) which is exactly(1.29).Summarizing, we have an array of inequalities (1.32) for h w ϕ ( X | X ) and its upper bounds, eachrequiring its own assumption (and with its own case for equality):by Lemma 1.6: ≤ h w ϕ ( X | X ) , assuming (1.18) (a modified form of (1.16)),by Lemma 1.11: h w ϕ ( X | X ) ≤ h w ψ ( X | X ) , assuming (1.28),by Theorem 1.8: h w ψ ( X | X ) ≤ h ψ ( X ) , assuming (1.23),by Lemma 1.9: h w ψ ( X | X ) ≤ h w ϕ ( X | X ) , assuming (1.27),by Theorem 1.10: h w ϕ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) , assuming (1.28). (1.32)It is worth noting that the assumptions listed in Eqn (1.32) express an impact on the total expectedweight when we perform various manipulations with RVs forming a pair or a triple under consideration.7 heorem 1.12 (Strong sub-additivity of the WE.) Given a triple of RVs X , assume that bound (1.28) is fulfilled. Then h w ϕ ( X ) + h w ψ ( X ) ≤ h w ψ ( X ) + h w ψ ( X ) . (1.33) The equality in (1.33) holds iff, modulo ϕ , X and X are conditionally independent given X . Proof.
Write the inequality in Eqn (1.33) in an equivalent form: h w ϕ ( X ) − h w ψ ( X ) ≤ h w ψ ( X ) − h w ψ ( X ) + h w ψ ( X ) − h w ψ ( X ) . (1.34)The LHS in (1.34) equals h w ϕ ( X | X ) while the RHS yields h w ψ ( X | X ) + h w ψ ( X | X ) . The in-equality then follows from Theorem 1.10. Theorem 2.1 (Concavity of the WE; cf. [3], Theorem 2.7.3.)
The functional f h w ϕ ( f ) is concavein argument f . Namely, for given PM/DFs f ( x ) , f ( x ) , non-negative function x ∈ X 7→ ϕ ( x ) and λ , λ ∈ [0 , with λ + λ = 1 , h w ϕ ( λ f + λ f ) ≥ λ h w ϕ ( f ) + λ h w ϕ ( f ) . (2.1) The inequality in (2.1) is strict unless one of the values λ , λ vanishes (and the other equals ) orwhen f and f coincide modulo ϕ , that is, ϕ ( x ) (cid:2) f ( x ) − f ( x ) (cid:3) = 0 for ( λ f + λ f ) -almost all x ∈ X . Proof.
Let X , X : Ω → X be RVs with PM/DF f and f , respectively. Consider a binary RV Θ with Θ = ( , with probability λ , , with probability λ . (2.2)Setting Z = X θ yields an RV Z with values from X and with PM/DF f = λ f + λ f . Thus, h w ϕ ( Z ) = h w ϕ ( λ f + λ f ) . On the other hand, take the conditional WE h w e ϕ ( Z | Θ) with the WF e ϕ ( z, θ ) = ϕ ( z ) depending on thefirst argument z ∈ X and not on value θ = 1 , of RV Θ . Then the WF ψ ( z ) = E h e ϕ ( Z, Θ) | Z = z i coincides with ϕ ( z ) . It means that condition (1.20) hold true for the pair of RVs Z, Θ . According toTheorem 1.8 (cf. Eqn (1.21)), h w e ϕ ( Z | Θ) ≤ h w ϕ ( Z ) , with equality iff Z and Θ are independent modulo ϕ . The latter holds when the product λ λ = 0 or when f = f modulo ϕ . Now, h w ϕ ( Z | Θ) = − X θ =1 λ θ Z X ϕ ( z ) f θ ( z ) log f θ ( z ) ν (d z ) = λ h w ϕ ( f ) + λ h w ϕ ( f ) . This completes the proof. 8 heorem 2.2 (a) (Convexity of relative WE; cf. [3], Theorem 2.7.2.)
Consider two pairs of non-negative functions, ( f , g ) and ( f , g ) , on X . Given a WF x ∈ X 7→ ϕ ( x ) and λ λ ∈ (0 , with λ + λ = 1 , the following property is satisfied: λ D w ϕ ( f k g ) + λ D w ϕ ( f k g ) ≥ D w ϕ ( λ f + λ f k λ g + λ g ) , (2.3) with equality iff λ λ = 0 or f = f and g = g modulo ϕ . (b) (Data-processing inequality for relative WE; cf. [3], Theorem 2.8.1.) Let ( f, g ) be a pair ofnon-negative functions and ϕ a WF on X . Let Π = (Π( x, y ) , x, y ∈ X ) be a stochastic kernel. (Thatis, ∀ x, y ∈ X , Π( x, y ) ≥ and Z X Π( x, y ) ν (d y ) = 1 ; in other words, Π( x, y ) is a transition functionof a Markov chain). Set Ψ( u ) = Z X ϕ ( x )Π( u, x ) ν (d x ) . Then D wΨ ( f || g ) ≥ D w ϕ ( f Π k g Π ) (2.4) where (cid:0) f Π (cid:1) ( x ) = Z X f ( u )Π( u, x ) ν (d u ) and (cid:0) g Π (cid:1) ( x ) = Z X g ( u )Π( u, x ) ν (d u ) . The equality occurs iff f Π = f and g Π = g . Proof. (a) The log-sum inequality yields λ ϕ ( x ) f ( x ) log λ ϕ ( x ) f ( x ) λ g ( x ) + λ ϕ ( x ) f ( x ) log λ ϕ ( x ) f ( x ) λ g ( x ) ≥ ( λ ϕ ( x ) f ( x ) + λ ϕ ( x ) f ( x )) log λ ϕ ( x ) f ( x ) + λ ϕ ( x ) f ( x ) λ g ( x ) + λ g ( x ) , x ∈ X . (2.5)Integrating in ν (d x ) yields the asserted inequality (2.3). The cases of equality emerge from thelog-sum equality cases.(b) Again, a straightforward application of the log-sum inequality gives the result. Theorem 2.3
Let X be a triple of RVs with joint PM/DF f ( x ) . Let x ∈ X ϕ ( x ) be a WFsuch that X and X are conditionally independent given X modulo ϕ . (This property can be referredto as a Markov property modulo ϕ .) (a) (Data-processing inequality for conditional WE.) Assume inequality (2.6) (which is (1.28) with X and X swapped): Z X ϕ ( x ) f ( x ) − f ( x ) Y i =2 , f i | ( x i | x ) ν (d x ) ≥ . (2.6) Then the conditional WEs satisfy property (2.7) : h w ψ ( X | X ) ≤ h w ψ ( X | X ) , (2.7) with equality iff X and X are independent modulo ϕ . Furthermore, assume in addition that bound (2.8) holds true Z X ϕ ( x ) f ( x ) h f | ( x | x ) − i ν (d x ) ≤ (2.8) (which becomes (1.25) after a cyclic substitution X → X → X → X ) and suppose h w ψ ( X | X ) = h w ψ ( X | X ) (a stationarity-type property). Then h w ψ ( X | X ) ≤ h w ψ ( X | X ) . (2.9)9b) (Data-processing inequality for mutual WE; cf. [3], Theorem 2.8.1.) Assume inequality (2.10) : Z X ϕ ( x ) h f ( x ) − f ( x ) Y i =1 , f i | ( x i | x ) i ν (d x ) ≥ (2.10) (similar to (1.28) , with X and X swapped). Then i w ψ ( X : X ) ≤ i w ψ ( X : X ) . (2.11) Here, equality in (2.11) holds iff, modulo ϕ , RVs X and X are conditionally independent given X . Proof. (a) Following the argument in Lemma 1.11, we observe that h w ϕ ( X | X ) ≤ h w ψ ( X | X ) . On the other hand, owing to conditional independence, h w ϕ ( X | X ) = h w ψ ( X | X ) . (2.12)This yields the inequality in (2.7); for equality we need that, modulo ϕ , RVs X and X are condi-tionally independent given X . Together with conditional independence of X and X given X , itimplies that for i = 1 , , the conditional PM/DF f | i does not depend on i .Next, using Lemma 1.9, we can write h w ψ ( X | X ) ≤ h w ϕ ( X | X ) := h w ϕ ( X | X ) + h w ψ ( X | X ) . (2.13)Applying (2.12) yields the following assertion: h w ψ ( X | X ) ≤ h w ψ ( X | X ) + h w ψ ( X | X ) . (2.14)Now, the assumption that h w ψ ( X | X ) = h w ψ ( X | X ) implies (2.9). The cases of equality followfrom Lemmas 1.11 and 1.9.(b) As before, we use Lemma 1.11 and Eqn (2.12) (implied by conditional independence): h w ψ ( X | X ) = h w ϕ ( X | X ) ≤ h w ψ ( X | X ) . Consequently, i w ψ ( X : X ) = h w ψ ( X ) − h w ψ ( X | X ) ≥ h w ψ ( X ) − h w ψ ( X | X ) = i w ψ ( X : X ) , with the case of equality also determined from Lemma 1.9. Theorem 2.4 (Cf. [3], Theorem 2.7.4.) Let X be a pair of RVs with joint PM/DF f ( x ) = f ( x ) f | ( x | x ) = f ( x ) f | ( x | x ) . • (I) The mutual WE i w ϕ ( X : X ) is convex in f | ( x | x ) for fixed f ( X ) . • (II) Suppose that the WF ϕ ( x , x ) depends only on x : ϕ ( x , x ) = ϕ ( x ) . Then i w ϕ ( X : X ) is a concave function in f ( X ) for fixed f | ( x | x ) .10 roof. (I) For a fixed f , take two conditional PM/DFs, f (1)2 | ( x | x ) and f (2)2 | ( x | x ) , and set e f | ( x | x ) = λ f (1)2 | ( x | x ) + λ f (2)2 | ( x | x ) and e f ( x ) = f ( x ) e f | ( x | x ) = λ f (1) ( x ) + λ f (2) ( x ) where f ( j ) ( x ) = f ( x ) f ( j )2 | ( x | x ) , j = 1 , . Also, set: e f ( x ) = Z X e f ( x ) ν (d x ) and f ( j )2 ( x ) = Z X f ( j ) ( x ) ν (d x ) and e g ( x ) = f ( x ) e f ( x ) , and g ( j ) ( x ) = f ( x ) f ( j )2 ( x ) , j = 1 , . Next, the mutual WE i w ϕ ( X : X ) for joint PM/DFs e f ( x ) and f ( j ) ( x ) is given, respectively, byrelative WEs D w ϕ ( e f k e g ) and D w ϕ ( f ( j ) k g ( j ) ) , j = 1 , . Now assertion (I) follows from Theorem 2.2 (a).(II) Under the condition of the theorem, the reduced WF does not depend on the choice of PM/DF f ψ ( x ) = Z X ϕ ( x , x ) f | ( x | x ) ν ( dx ) = ϕ ( x ) . Next, write i w ϕ ( X : X ) = h w ψ ( X ) − h w ϕ ( X | X )= h w ϕ ( X ) − R X ϕ ( x ) f ( x ) f | ( x | x ) log f | ( x | x ) ν (d x )= h w ϕ ( X ) − R X f ( x ) h w ϕ ( X | X = x ) ν (d x ) where h w ϕ ( X | X = x ) = Z X ϕ ( x ) f | ( x | x ) log f | ( x | x ) ν (d x ) . Owing to Theorem 2.1, for fixed WF x ϕ ( x ) and conditional PM/DF f | ( x | x ) , the WE h w ϕ isconcave in f . The negative term is linear in f . This completes the proof of statement (II). Theorem 2.5 (The weighted Fano inequality; cf. [3], Theorem 2.10.1, [20], Theorem 1.2.8.)
Suppose an RV X takes a value x ∗ ∈ X with probability p ∗ = P ( X = x ∗ ) < (i.e., p ∗ = f ( x ∗ ) ν ( { x ∗ } ) ). Given a WF x ∈ X 7→ ϕ ( x ) , assume that Z X \{ x ∗ } ϕ ( x ) (cid:20) f ( x ) − − p ∗ ν ( X \ { x ∗ } ) (cid:21) ν (d x ) ≥ . (2.15) Then h w ϕ ( X ) ≤ − ϕ ( x ∗ ) p ∗ log p ∗ + ϕ ∗ log (cid:18) ν ( X \ { x ∗ } )1 − p ∗ (cid:19) . (2.16) Here ϕ ∗ = R X ϕ ( x ) ν (d x ) − ϕ ( x ∗ ) p ∗ .The equality in (2.16) is achieved iff ϕ ( x ) (cid:20) f ( x ) − − p ∗ ν ( X \ { x ∗ } ) (cid:21) = 0 , for f -almost all x ∈ X \{ x ∗ } ,i.e., iff RV X is (conditionally) uniform on X \ { x ∗ } modulo ϕ . roof. We write h w ϕ ( X ) = − ϕ ( x ∗ ) p ∗ log p ∗ − Z X \{ x ∗ } ϕ ( x ) f ( x ) log f ( x ) ν (d x )= − ϕ ( x ∗ ) p ∗ log p ∗ − log (1 − p ∗ ) Z X \{ x ∗ } ϕ ( x ) f ( x ) ν (d x ) − (1 − p ∗ ) Z X \{ x ∗ } ϕ ( x ) f ( x )1 − p ∗ log f ( x )1 − p ∗ ν (d x ) . (2.17)Theorem 1.4, with β = 1 ν ( X \ { x ∗ } ) , yields that the last line in Eqn (2.17) is upper-bounded by ϕ ∗ log ν ( X \ { x ∗ } ) . This leads to (2.16). Theorem 2.6 (The weighted generalized Fano inequality; cf. [20], Theorem 1.2.11.)
Let X i : Ω →X i , be a pair of RVs, i = 1 , . Suppose that X takes exactly m values , . . . , m (that is, X = { , . . . , m } ) while X takes values , . . . , m and possibly other values (that is, X ⊇ { , . . . , m } ), andset: ε j = P ( X = j | X = j ) . Let a WF ( x , x ) ∈ X ϕ ( x , x ) be given such that for all j = 1 , . . . , m , Z X \{ j } ϕ ( x , j ) (cid:20) f | ( x | j ) − ε j ν ( X \ { j } ) (cid:21) ν (d x ) ≥ . (2.18) Then h w ϕ ( X | X ) ≤ X ≤ j ≤ m P ( X = j ) h − ϕ ∗ j (0)(1 − ε j ) log (1 − ε j ) + ϕ ∗ j (1) log (cid:16) ν ( X \ { j } ) ε j (cid:17)i . (2.19) Here RV X ∗ j takes two values, say and , with P ( X ∗ = 0) = 1 − ε j = 1 − P ( X ∗ = 1) , and the WF ϕ ∗ has ϕ ∗ j (0) = ϕ ( j, j ) and ϕ ∗ j (1) = Z X \{ j } ϕ ( x , j ) f ( x, j ) ν (d x ) . (2.20) Proof.
By definition of the conditional WE, the weighted Fano inequality, Theorem 1.4 and withdefinitions (2.20) at hand, we obtain that h w ϕ ( X | X ) ≤ P j P ( X = j ) (cid:20) − ϕ ( j, j )(1 − ε j ) log (1 − ε j )+ Z X \{ j } ϕ ( x , j ) f ( x, j ) ν (d x ) log ν ( X \ { j } ) ε j (cid:21) . This yields inequality (2.19).
In this section we establish some extremality properties for the WE; cf. [4], Chap. 12.
Theorem 3.1
Suppose X ∗ : Ω → X is an RV with a PM/DF f ∗ and x ∈ X → ϕ ( x ) is a given WF. (I) Then f ∗ (or X ∗ ) is the unique maximizer, modulo ϕ , of the WE h w ϕ ( f ) under the constraints Z X ϕ ( x ) (cid:2) f ( x ) − f ∗ ( x ) (cid:3) ν (d x ) ≥ and (3.1) Z X ϕ ( x ) (cid:2) f ( x ) − f ∗ ( x ) (cid:3) log f ∗ ( x ) ν (d x ) ≥ . (3.2) • (II) On the other hand, consider a constraint Z X ϕ ( x ) f ( x ) β ( x )d ν ( x ) = c (3.3) where x ∈ X 7→ β ( x ) is a given function and c a given constant neither of which is assumednon-negative. Suppose that f ∗ ( x ) = 1 Z exp[ − bβ ( x )] is a (Gibbsian-type) PM/DF such that Z X ϕ ( x ) f ∗ ( x )d ν ( x ) = 1 and Z X ϕ ( x ) f ∗ ( x ) β ( x )d ν ( x ) = c. Here b is a constant (an analog of inverse temperature) and Z = R X exp[ − bβ ( x )]d ν ( x ) ∈ (0 , ∞ ) is the normalizing denominator (an analog of a partition function). Introduce the second con-straint: (log Z ) Z X ϕ ( x )[ f ∗ ( x ) − f ( x )]d ν ( x ) ≥ . (3.4) Then, under (3.3) and (3.4) , the WE h w ϕ ( f ) is maximized at f = f ∗ . As above, it is a uniquemaximizer, modulo ϕ . Proof. (I) Using definition (1.2) and Theorem 1.3, we obtain ≥ − D w ϕ ( f k f ∗ ) = h w ϕ ( f ) + Z X ϕ ( x ) f ( x ) log f ∗ ( x ) ν (d x ) . (3.5)Under our constraint (3.1) it yields h w ϕ ( f ) ≤ − Z X ϕ ( x ) f ∗ ( x ) log f ∗ ( x ) ν (d x ) = h w ϕ ( f ∗ ) . (3.6)The uniqueness of the maximizer follows from the uniqueness case for equality in the weighted Gibbsinequality.(II) Again use (3.5): h w ϕ ( f ) ≤ − Z X ϕ ( x ) f ( x ) h − log Z − bβ ( x ) i d ν ( x )= (log Z ) Z X ϕ ( x ) f ( x )d ν ( x ) + b Z X ϕ ( x ) f ( x ) β ( x )d ν ( x ) ≤ (log Z ) Z X ϕ ( x ) f ∗ ( x )d ν ( x ) + b Z X ϕ ( x ) f ∗ ( x ) β ( x )d ν ( x ) = h w ϕ ( f ∗ ) . Note that when Z ≥ , the factor log Z can be omitted from (3.4); otherwise log Z can bereplaced by − . 13 xample 3.2 Consider a random vector X = X d : Ω → R d with PDF f (relative to the d -dimensional Lebesgue measure), mean vector and covariance matrix C = ( C ij ) with with C ij = E [ X i X j ] , ≤ i, j ≤ d . Let f No C be the normal PDF with the same µ and C . Let x = x d ∈ R d ϕ ( x ) be a given WF which is positive on an open domain in R d . Introduce d × d matrices Φ = (Φ ij ) , Φ No C = (Φ No ij ) and x T x , where ( x T x ) ij = x i x j and Φ = Z R d ϕ ( x ) f ( x ) x T x d x , Φ No C = Z R d ϕ ( x ) f No C ( x ) x T x d x . Suppose that Z R d ϕ ( x ) h f ( x ) − f No C ( x ) i d x ≥ and log h (2 π ) d (det C ) i Z R d ϕ ( x ) h f ( x ) − f No C ( x ) i d x + tr h C − (cid:0) Φ − Φ No C (cid:1) i ≤ . (3.7) Then h w ϕ ( f ) ≤ h w ϕ ( f No C ) = 12 log h (2 π ) d (det C ) i Z R d ϕ ( x ) f No C ( x )d x + log e C − Φ No C , (3.8) with equality iff f = f No C modulo ϕ . Proof.
Using the same idea as before, write ≥ − D w ϕ ( f k f No C ) = h w ϕ ( f ) −
12 log h (2 π ) d (det C ) i Z R d ϕ ( x ) f ( x )d x − log e C − Φ , (3.9)Equivalently, h w ϕ ( f ) ≤
12 log h (2 π ) d (det C ) i Z R d ϕ ( x ) f ( x )d x + log e C − Φ which leads directly to the result.To further illustrate the above methodology, we provide some more examples, omitting the proofs. Example 3.3
Let f Exp denote an exponential PDF on R + = (0 , ∞ ) (relative to the Lebesgue measure d x ) with mean λ − . Suppose a PDF f on R + satisfies the constraints Z R + ϕ ( x ) (cid:2) f ( x ) − f Exp ( x ) (cid:3) d x ≥ and (cid:0) log λ (cid:1) Z R + ϕ ( x ) (cid:2) f ( x ) − f Exp ( x ) (cid:3) d x − λ Z R + x ϕ ( x ) (cid:2) f ( x ) − f Exp ( x ) (cid:3) d x ≥ . (3.10) where x ∈ R + ϕ ( x ) is a given WF positive on an open interval. Then h w ϕ ( f ) ≤ h w ϕ ( f Exp ) = − (cid:0) λ log λ (cid:1) Z R + ϕ ( x ) e − λx d x + λ Z R + xϕ ( x ) e − λx d x, and f Exp is a unique maximizer modulo ϕ . xample 3.4 Take X = Z + = { , , . . . } and let ν be the counting measure: ν ( i ) = 1 ∀ i ∈ Z + .Then, for a RV X with PMF f ( i ) we have f ( i ) = P ( X = i ) . Fix a WF i ∈ Z + ϕ ( i ) . (a) Let f Ge be a geometric PMF: f Ge ( x ) = (1 − p ) x p , x ∈ Z + . Then for any PMF f ( x ) , i ∈ Z + ,satisfying the constraints X i ∈ Z + ϕ ( i ) (cid:2) f ( i ) − f Ge ( i ) (cid:3) ≥ and log p X i ∈ Z + ϕ ( i ) (cid:2) f ( i ) − f Ge ( i ) (cid:3) + log(1 − p ) X i ∈ Z + iϕ ( i ) (cid:2) f ( i ) − f Ge ( i ) (cid:3) ≥ . (3.11) we have h w ϕ ( f ) ≤ h w ϕ ( f Ge ) , with equality iff f = f Ge modulo ϕ . (b) Let f Po be a Poissonian PMF: f Po ( k ) = e − λ λ k k ! , k ∈ Z + . Then for any PMF f ( k ) , k ∈ Z + ,satisfying the constraints X k ∈ Z + ϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) ≥ and log λ P k ∈ Z + kϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) − λ P k ∈ Z + ϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) − P k ∈ Z + (log k !) ϕ ( k ) (cid:2) f ( k ) − f Po ( k ) (cid:3) ≥ . (3.12) we have h w ϕ ( f ) ≤ h w ϕ ( f Po ) , with equality iff f = f Po modulo ϕ . Theorem 3.5 below offers an extension of the Ky Fan inequality that log det C is a concavefunction of a positive definite d × d matrix C . Cf. [16, 17, 18, 21]. We follow the method proposedby Cover-Dembo-Thomas. As before, f No C denotes the normal PDF with zero mean and covariancematrix C . Theorem 3.5 (The weighted Ky Fan inequality; cf. [3], Theorem 17.9.1, [4], Theorem 1, [5], Theo-rem 8, [20], Worked Example 1.5.9.)
Assume that x d ∈ R d ϕ ( x d ) ≥ is a given WF positive onan open domain. Suppose that, for λ , λ ∈ [0 , with λ + λ = 1 and positive-definite C , C , with C = λ C + λ C , Z R d ϕ ( x ) h λ f No C ( x ) + λ f No C ( x ) − f No C ( x ) i d x ≥ , and (3.13) log h (2 π ) d (det C ) i Z R d ϕ ( x ) h λ f No C ( x ) + λ f No C ( x ) − f No C ( x ) i d x + log e h C − Ψ i ≤ , (3.14) where Ψ = Z R d ϕ ( x ) h λ f No C ( x ) + λ f No C ( x ) − f No C ( x ) i ( x − µ ) T ( x − µ )d x . (3.15) Then, with σ ϕ ( C ) = h w ϕ ( f No C ) , σ ϕ ( C ) = h w ϕ ( f No C ) and σ ϕ ( C ) = h w ϕ ( f No C ) σ ϕ ( C ) − λ σ ϕ ( C ) − λ σ ϕ ( C ) ≥ (3.16) equality iff λ λ = 0 or C = C . roof. Take values λ , λ ∈ [0 , , such that λ + λ = 1 . Let C and C be two positive definite d × d matrices. Let X and X be two multivariate normal vectors, with PDFs f k ∼ N(0 , C k ) , k = 1 , . Set Z = X Θ , where the RV Θ , takes two values, θ = 1 and θ = 2 with probability λ and λ respectively, and is independent of X and X . Then vector Z has covariance C = λ C + λ C .Also set: α ( C ) = Z R d ϕ ( x ) f No C ( x )d x . (3.17)Let x = ( x , . . . , x d ) ∈ R d ϕ ( x ) be a given WF and set e ϕ ( x , θ ) = ϕ ( x ) . Following the samearguments as in the proof of Theorem 2.1, h w e ϕ ( Z | Θ) ≤ h w ϕ ( Z ) . It is plain that h w e ϕ ( Z | Θ) = λ h w ϕ ( X ) + λ h w ϕ ( X )= X k =1 , λ k
12 log h (2 π ) d (det C k ) i Z R d ϕ ( x ) f No C k ( x )d x + log e C − k Φ ( k ) (3.18)where Φ ( k ) = Z R d x T x ϕ ( x ) f No C k ( x )d x , k = 1 , , and ( x T x ) ij = x i x j . According to Example 3.2, we have h w ϕ ( Z ) ≤ n log h (2 π ) d (det C ) i o α ( C ) + log e C − Φ , (3.19)where Φ = Z R d x T x ϕ ( x ) f No C ( x )d x . (3.20)The inequality (3.16) then follows. The cases of equality are covered by Theorem 2.1.The following lemma is an immediate extension of Lemma 1.6. Lemma 3.6
Let X n = ( X , . . . , X n ) be a random vector, with components X i : Ω → X i , ≤ i ≤ n ,and the joint PM/DF f . Extending the notation used in Sect , set: x n = ( x , . . . , x n ) ∈ X n := × ≤ i ≤ n X i and ν n (d x n ) = Y ≤ i ≤ n d ν i (d x i ) , and more generally, x lk = ( x k , . . . , x l ) ∈ X lk := × k ≤ i ≤ l X i and ν lk (d x lk ) = Y k ≤ i ≤ l d ν i (d x i ) , ≤ k ≤ l. Next, introduce f i ( x i ) = Z X i − ×X ni +1 f ( x i − , x i , x ni +1 ) ν i − (d x i − ) ν ni +1 (d x ni +1 ) (the marginal PM/DF for RV X i ) , and f | i ( x n | x i ) = f ( x n ) f i ( x i ) (the conditional PM/DF given that X i = x i ). iven a WF x n ∈ X n ϕ ( x n ) , suppose that Z X n ϕ ( x n ) " f ( x n ) − n Y i =1 f i ( x i ) ν n (d x n ) ≥ . (3.21) Then h w ϕ ( X n ) ≤ n X i =1 h w ψ i ( X i ) , (3.22) where ψ i ( x i ) = Z X i − ×X ni +1 ϕ ( x n ) f | i ( x n | x i ) ν i − (d x i − ) ν ni +1 (d x ni +1 ) . Here, equality in (3.22) holds iff, modulo ϕ , components X , . . . , X n are independent. Theorem 3.7 (The weighted Hadamard inequality; cf. [3], Theorem 17.9.2, [4], Theorem 3, [5],Theorem 26, [20], Worked Example 1.5.10).
Let C = ( C ij ) be a positive definite d × d matrix and f No C the normal PDF with zero mean and covariance matrix C . Given a WF function x d = ( x , . . . , x d ) ∈ R d ϕ ( x d ) , positive on an open domain in R d , introduce quantity α = α ( C ) by (3.17) and matrix Φ = (Φ ij ) by (3.20) . Let f No i stand for the N(0 , C ii ) -PDF (the marginal PDF of the i -th component).Then under condition Z R d ϕ ( x d ) h f No C ( x d ) − d Y i =1 f No i ( x i ) i d x d ≥ , (3.23) we have: α log Y i (2 πC ii ) + (log e ) X i C − ii Φ ii − α log h (2 π ) d (det C ) i − (log e )tr C − Φ ≥ , (3.24) with equality iff C is diagonal. Proof. If X , . . . , X d ∼ N(0 , C ) , then in Lemma 3.6, by following (3.22) we can write
12 log h (2 π ) d (det C ) i Z R d ϕ ( x d ) f ( x d )d x d + log e C − Φ ≤ d X i =1 log (2 πC ii ) Z R ψ i ( x ) f No i ( x )d x + (log e ) C − ii Ψ ii . (3.25)Here ψ i ( x i ) = Z R d − ϕ ( x d ) f No | i ( x d | x i ) Y j : j = i d x j , Ψ ii = Z R d x i ψ i ( x i ) f No i ( x i )d x i = Φ ii and f No | i ( x d | x i ) = f No C ( x d ) f No i ( x i ) (the conditional PDF) . With α = Z R ψ i ( x i ) f No i ( x i )d x i = Z R d ϕ ( x d ) f No C ( x d )d x d , the bound (3.24) follows. Remark 3.8
As above, maximizing the left-hand side in (3.24) would give a bound between det C and the product d Q i =1 C ii . Weighted Fisher information and related inequalities
In this section we introduce a weighted version of Fisher information matrix and establish somestraightforward facts. The bulk of these properties is derived by following Ref. [32].
Definition 4.1
Let X = ( X , . . . , X n ) be a random × n vector with probability density function(PDF) f θ ( x ) = f X ( x ; θ ) , x = ( x , . . . , x n ) ∈ R n , where θ = ( θ , . . . , θ m ) ∈ R m is a parameter vector.Suppose that dependence θ f θ is C . The m × m weighted Fisher information matrix J w ϕ ( X ; θ ) ,with a given WF x ∈ R n ϕ ( x ) ≥ , is defined by J w ϕ ( X ; θ ) = E h ϕ ( X ) S ( X , θ ) T S ( X , θ ) i = Z ϕ ( x ) f θ ( x ) (cid:18) ∂f θ ( x ) ∂θ (cid:19) T ∂f θ ( x ) ∂θ (cid:0) f θ ( x ) > (cid:1) d x , (4.1) assuming the integrals are absolutely convergent. Here and below, ∂∂θ stands for the × m gradientin θ and S ( X , θ ) = ( f θ ( x ) > ∂∂θ log f θ ( x ) denotes the score vector.When ϕ ( x ) ≡ , J w ϕ ( X ; θ ) = J ( X ; θ ) , the standard Fisher information matrix, cf. [5], [4], [20] . Definition 4.2
Let ( X , Y ) be a pair of RVs with a joint PDF f θ ( x , y ) = f X , Y ( x , y ; θ ) and conditionalPDF f θ ( y | x ) = f Y | X ( y | x ; θ ) := f X , Y ( x , y ; θ ) f X ( x ; θ ) . Given a joint WF ( x , y ) ∈ R n × R n ϕ ( x , y ) ≥ ,we set: J w ϕ ( X , Y ; θ ) = E " ϕ ( X , Y ) (cid:18) ∂ log f θ ( X , Y ) ∂θ (cid:19) T ∂ log f θ ( X , Y ) ∂θ (cid:0) f θ ( X , Y ) > (cid:1) = Z ϕ ( x , y ) f θ ( x , y ) (cid:18) ∂f θ ( x , y ) ∂θ (cid:19) T ∂f θ ( x , y ) ∂θ (cid:0) f θ ( x , y ) > (cid:1) d x d y (4.2) and J w ϕ ( Y | X ; θ ) = E " ϕ ( X , Y ) (cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19) T ∂ log f θ ( Y | X ) ∂θ (cid:0) f θ ( Y | X ) > (cid:1) = Z ϕ ( x , y ) f θ ( x , y ) f θ ( y | x ) (cid:18) ∂f θ ( y | x ) ∂θ (cid:19) T ∂f θ ( y | x ) ∂θ (cid:0) f θ ( y | x ) > (cid:1) d x d y . (4.3) Next, consider an m × m matrix S θ = S θ ( f X , Y ) and a × m vector B θ = B θ ( x , f Y | X ) : B θ = E Y | X = x (cid:20) ϕ ( x , Y ) ∂ log f θ ( Y | x ) ∂θ (cid:21) = Z ϕ ( x , y ) f θ ( y | x ) ∂f θ ( y | x ) ∂θ (cid:0) f θ ( y | x ) > (cid:1) d y , (4.4) S θ = E ("(cid:18) ∂ log f θ ( X ) ∂θ (cid:19) T B θ ( X ) + B θ ( X ) T ∂ log f θ ( X ) ∂θ (cid:0) f θ ( X ) > (cid:1)) . (4.5) When ϕ ( x , y ) depends only on x and under standard regularity assumptions, vector B θ vanishes (andso does matrix S θ ): B θ = ϕ ( x ) Z ∂f θ ( y | x ) ∂θ d y = ∂∂θ Z f θ ( y | x )d y = 0 . For the sake of brevity, in formulas that follow we routinely omit indicators of positivity of PDFsinvolved: their presence can be easily derived from the local context.18 emma 4.3 (The chain rule: cf. [32], Lemma 1.)
Given a pair ( X , Y ) of random vectors and a jointWF ( x , y ) ϕ ( x , y ) , set: ψ ( x ) = ψ X ( x ) = Z ϕ ( x , y ) f θ ( y | x ) d y = E Y | X = x ϕ ( x , Y ) . (4.6) Then J w ϕ ( X , Y ; θ ) = J w ψ ( X ; θ ) + J w ϕ ( Y | X ; θ ) + S θ . (4.7) Proof.
For simplicity, assume that θ is scalar: θ = θ ; a generalization to a vector case isstraightforward. Therefore, we have J w ϕ ( X , Y ; θ ) = E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X , Y ) ∂θ (cid:19) (cid:21) . (4.8)Furthermore, we know log f θ ( x , y ) = log f θ ( x ) + log f θ ( y | x ) Using (4.8) yields: J w ϕ ( X , Y ; θ ) = E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X ) ∂θ (cid:19) (cid:21) + E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19) (cid:21) + 2 E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X ) ∂θ (cid:19)(cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19)(cid:21) . (4.9)We also can write E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( X ) ∂θ (cid:19)(cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19)(cid:21) = E ( ∂ log f θ ( X ) ∂θ E (cid:20) ϕ ( X , Y ) (cid:18) ∂ log f θ ( Y | X ) ∂θ (cid:19)(cid:12)(cid:12)(cid:12) X (cid:21)) . (4.10)This cancels the last term in (4.7) when applying inner expectation in the RHS of (4.7).Throughout the paper, an inequality A ≤ B between matrices A and B means that B − A is apositive-definite matrix. Lemma 4.4 (Data-refinement inequality: cf. [32], Lemma 2.)
For a given joint WF ( x , y ) ϕ ( x , y ) , J w ϕ ( X , Y ; θ ) ≥ J w ψ ( X ; θ ) + S θ , (4.11) with equality if X is a sufficient statistic for θ . Here WF ψ = ψ X is defined as in (4.6) . Proof.
Bound (4.11) follows from Lemma 4.3 using the non-negativity of matrix J w ϕ ( Y | X = x ; θ ) = Z f θ ( y | x ) ϕ ( x , y ) (cid:18) ∂ log f θ ( y | x ) ∂θ (cid:19) T ∂ log f θ ( y | x ) ∂θ d y . Equality holds when J w ϕ ( Y | X = x ; θ ) = 0 which leads to the statement.19 emma 4.5 (Data-processing inequality: cf. [32], Lemma 3.) For a given joint WF ( x , y ) ϕ ( x , y ) and a function x g ( x ) , set ̺ g ( x ) = ϕ ( x , g ( x )) and ρ g ( x ) = ϕ ( x , g ( x )) f θ ( x | g ( x )) . (4.12) Then we have J w ̺ g ( X ; θ ) ≥ J w ρ g ( g ( X ); θ ) . (4.13) The equality holds iff function g ( X ) is invertible. Proof.
We make use Lemma 4.4. Let Y = g ( X ) , then S θ = . This yields J w ρ g ( g ( X ); θ ) ≤ J w ϕ ( X , g ( X ); θ ) . (4.14)Note that the equality holds true if J w ϕ ( X | g ( X ); θ ) = 0 , that is g ( X ) is a sufficient statistics for θ .Now use the chain rule, Lemma 4.3, where J w ϕ ( g ( X ) | X ; θ ) = 0 . Hence, J w ϕ ( X , g ( X ); θ ) = J w ̺ g ( X ; θ ) . (4.15)The assertions (4.14) and (4.15) lead directly to the result. Lemma 4.6 (Parameter transformation: cf. [32], Lemma 4.)
Suppose we have a family of PDFs f η ( x ) parameterized by a × m ′ vector η = ( η , . . . , η m ′ ) ∈ R m ′ . Suppose that vector η is a functionof θ ∈ R m . Then J w ϕ ( X ; θ ) = (cid:18) ∂η∂θ (cid:19) T J w ϕ (cid:16) X ; η ( θ ) (cid:17) (cid:18) ∂η∂θ (cid:19) , (4.16) with an m ′ × m matrix ∂η∂θ = (cid:18) ∂η i ∂θ j (cid:19) , ≤ i ≤ m ′ , ≤ j ≤ m .In the linear case where η ( θ ) = θ Q for some m × m ′ matrix Q , we obtain: J w ϕ ( X ; θ ) = QJ w ϕ (cid:0) X ; η ( θ ) (cid:1) Q T . (4.17) Proof.
Formula (4.16) becomes straightforward by substituting the expression ∂ log f η ( θ ) ( x ) ∂θ = (cid:18) ∂η ( θ ) ∂θ (cid:19)(cid:18) ∂ log f η ( x ) ∂η (cid:19) T . (4.18)Concluding this section, we consider a linear model where the parameter is related to an additiveshift. Suppose a random vector X in R n has a joint PDF f X and x ∈ R n ϕ ( x ) is a given WF. Set: L w ϕ ( X ) := Z ϕ ( x ) f ( x ) (cid:16) ∇ f ( x ) (cid:17) T ∇ f ( x )d x . (4.19)Here and below, we use symbol ∇ for the spatial gradient × n vectors as opposite to parametergradients ∂∂θ and ∂∂η .Furthermore, set X = θ Q + Y P . (4.20)20ere Q and P are two matrices, of sizes m × n and k × n respectively, with m ≤ k ≤ n . Next, X ∈ R n and Y ∈ R k . Let x ∈ R n ϕ ( x ) ≥ be a given WF and set ψ ( y ) = ψ P ( y ) = Z R n − k ϕ ( x ) ( x P T = y ) f X | X P T ( x | y )d x ∁ P , y ∈ R n P T , where x ∁ P stands for the complementary variable in x , given that x P T = y . In Lemma 4.7 we presentrelationships between J w ϕ ( X ; θ ) , J w ψ ( Y ; θ ) , L w ϕ ( X ) and L w ψ ( X P T ) for the above model. (The proofs arestraightforward and omitted.) Lemma 4.7 (Cf. [32], Lemmas 5 and 6.)
Assume the model (4.20) . Then J w ϕ ( X ; θ ) = QL w ϕ ( X ) Q T , J w ψ ( Y ; θ ) = QP T L w ψ ( X P T ) PQ T , and J w ϕ ( X ; θ ) ≥ J w ψ ( Y ; θ ) . (4.21) Corollary 4.8 (Cf. [32], Corollary 1.)
Let P be an m × m matrix. Let X be a random vector in R m and WFs ϕ and ψ = ψ P be as above. Then • (i) J w ϕ ( X ; θ ) ≥ P T J w ψ ( X P T ; θ ) P . • (ii) For P with orthonormal rows (i.e., with PP T equal to I m , the unit m × m matrix), J w ψ ( X P T ; θ ) ≤ P T J w ϕ ( X ; θ ) P . (4.22) • (iii) For P with a full row rank m , and X ∈ R m with nonsingular J w ϕ , J w ψ ( X P T ) ≤ (cid:18) P T J w ψ ( X ; θ ) − P (cid:19) − . (4.23) We start with multivariate weighted Cramér-Rao inequalities (WCRIs). As usually, consider a familyof PDFs f θ ( x ) , x ∈ R n , dependent on a parameter θ ∈ R m and let X = X θ denote the random vectorwith PDF f θ . Let a statistic x T ( x ) = ( T ( x ) , ..., T s ( x )) and a WF x ϕ ( x ) ≥ be given. With E θ standing for the expectation relative to f θ , set: α ( θ ) = E θ ϕ ( X ) , η ( θ ) = E θ (cid:2) ϕ ( X ) T ( X ) (cid:3) . (5.1)We also suppose that the operations of taking expectation and the gradient are interchangeable: E θ (cid:20) ϕ ( X ) S ( X , θ ) (cid:21) = ∂α ( θ ) ∂θ , E θ (cid:20) ϕ ( X ) T ( X ) T S ( X , θ ) (cid:21) = ∂η ( θ ) ∂θ , (5.2)assuming C -dependence in θ α ( θ ) and θ η ( θ ) and absolute convergence of the integrals involved.Let C w ϕ ( θ ) denote the weighted covariance matrix for X : C w ϕ ( θ ) = E θ n ϕ ( X ) h T ( X ) − η ( X ) i T h T ( X ) − η ( X ) io (5.3)and J w ϕ ( X ; θ ) = E (cid:2) ϕ ( X ) S ( X , θ ) T S ( X , θ ) (cid:3) be the weighted Fisher information matrix under the WF ϕ ; cf. Eqn (4.1). 21 heorem 5.1 (A weighted Cramér-Rao inequality, version I; [4], Theorem 11.10.1, [5], Theorem 20.) Assuming J w ϕ ( X ; θ ) is invertible and under condition (5.2) , vectors η ( θ ) , ∂α ( θ ) ∂θ and matrices C w ϕ ( θ ) , J w ϕ ( X ; θ ) , ∂η ( θ ) ∂θ obey C w ϕ ( θ ) ≥ (cid:20) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:21) h J w ϕ ( X ; θ ) i − (cid:20) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:21) T . (5.4) Proof.
We start with a simplified version where s = 1 and T ( X ) = T ( X ) and η ( θ ) = η ( θ ) arescalars, keeping general n, m ≥ . By using (5.2), write: E θ n ϕ ( X ) (cid:2) T ( X ) − η ( θ ) (cid:3) S ( X ; θ ) o = E θ (cid:2) ϕ ( X ) T ( X ) S ( X ; θ ) (cid:3) − η ( θ ) E θ (cid:2) ϕ ( X ) S ( X ; θ ) (cid:3) = ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ . (5.5)Then for any × m vector µ ∈ R m , ≤ E θ (cid:26) ϕ ( X ) h T ( X ) − η ( θ ) − S ( X , θ ) µ T i (cid:27) = E θ (cid:26) ϕ ( X ) h T ( X ) − η ( θ ) i (cid:27) + µ J w ϕ ( X ; θ ) µ T − µ (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) T . (5.6)Taking µ = (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) (cid:2) J w ϕ ( X ; θ ) (cid:3) − (which is the minimiser for the RHS in (5.6)), weobtain Var w ϕ [ T ( X )] := E θ (cid:26) ϕ ( X ) (cid:16) T ( X ) − η ( θ ) (cid:17) (cid:27) ≥ (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) (cid:2) J w ϕ ( X ; θ ) (cid:3) − (cid:18) ∂η ( θ ) ∂θ − η ( θ ) ∂α ( θ ) ∂θ (cid:19) T . (5.7)Turning to the general case s ≥ , set: T ( X ) = T ( X ) λ T where × s vector λ ∈ R s . Then (5.7)yields that for all λ , λ C w ϕ ( θ ) λ T = Var w ϕ [ T ( X ) λ T ] := E θ (cid:26) ϕ ( X ) h T ( X ) λ T − η ( θ ) λ T i (cid:27) ≥ λ (cid:18) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:19) h J w ϕ ( X ; θ ) i − (cid:18) ∂η ( θ ) ∂θ − (cid:0) η ( θ ) (cid:1) T ∂α ( θ ) ∂θ (cid:19) T λ T , implying (5.4). Definition 5.2
The calibrated relative WE K w ϕ ( f || g ) of f and g with WF ϕ is defined by K w ϕ ( f || g ) = Z ϕ ( x ) f ( x ) α ( f ) log f ( x ) α ( g ) g ( x ) α ( f ) d x = D w ϕ ( f k g ) α ( f ) + log α ( g ) α ( f ) = D ( e f k e g ) . (5.8) Here e f and e g are PDFs produced from ϕf and ϕg after normalizing by α ( f ) and α ( g ) : α ( f ) = Z ϕ ( x ) f ( x )d x , α ( g ) = Z ϕ ( x ) g ( x )d x , e f ( x ) = ϕ ( x ) f ( x ) α ( f ) , e g ( x ) = ϕ ( x ) g ( x ) α ( g ) , (5.9) and D ( · k · ) is the standard Kullback–Leibler divergence. heorem 5.3 (Weighted Kullback inequalities, cf. [10].) For given ϕ and f , g as above, the follow-ing bounds hold true. First, for × n vector ζ , K w ϕ ( f k g ) ≥ sup h e ϕ ( f ) ζ T α ( f ) + log α ( g ) − log M g ( ζ ) : ζ ∈ R n i , (5.10) where e ϕ ( f ) = Z ϕ ( x ) f ( x ) x d x , M g ( ζ ) = Z ϕ ( x ) g ( x ) (cid:2) exp( x ζ T ) (cid:3) d x . (5.11) Second, D w ϕ ( f k g ) ≥ sup h e ϕ ( f ) ζ T : ζ ∈ M i , (5.12) where M = (cid:26) ζ : Z ϕ ( x ) (cid:16) f ( x ) − g ( x )[exp( x ζ T )] (cid:17) d x ≥ (cid:27) . (5.13) Proof.
First, given ζ ∈ R n , set e G ζ ( x ) = ϕ ( x ) g ( x ) (cid:2) exp( x ζ T ) (cid:3) M g ( ζ ) . Following (5.11) and (5.8),obtain: K w ϕ ( f k g ) = D ( e f k e G ζ ) + Z e f ( x ) log e G ζ ( x ) e g ( x ) d x ≥ Z e f ( x ) log α ( g )[exp( x ζ T )] M g ( ζ ) d x ; (5.14)the bound holds as D ( e f k e G ζ ) ≥ by the Gibbs inequality for the standard Kullback-Leibler diver-gence. By taking the supremum, we arrive at (5.10).Second, write: G ζ ( x ) = g ( x )[exp( x ζ T )] and D w ϕ ( f k g ) = D w ϕ ( f k G ζ ) + e ϕ ( f ) ζ T . (5.15)For ζ ∈ M , the bound D w ϕ ( f k G ζ ) ≥ holds true (the weighted Gibbs inequality (1.3)). This yields(5.12).An application of the weighted Kullback’s inequality is given in the next theorem where we obtainanother version of the weighted Cramér-Rao inequality. Theorem 5.4 (A weighted Cramér-Rao inequality, version II; [4], Theorem 11.10.1, [5], Theorem20.)
Suppose we have a family of × n random vectors X , with PDFs f θ ( x ) , x ∈ R n , indexed by θ ∈ R m . Suppose that f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) → as ε → uniformly in x . Let x ϕ ( x ) be a given WF.Denoting, as before, the expectation relative to f θ by E θ , set α ( θ ) = E θ [ ϕ ( X )] , e ( θ ) = E θ [ ϕ ( X ) X ] and e C w ϕ ( θ ) = 1 α ( θ ) E θ (cid:2) ϕ ( X ) X T X (cid:3) − e ( θ ) T e ( θ ) . (5.16) Under the assumptions needed to define matrix J w ϕ ( X ; θ ) , then J w ϕ ( X ; θ ) ≥ ∂ e ( θ ) ∂θ T he C w ϕ ( θ ) i − ∂ e ( θ ) ∂θ + α ( θ ) − ∂α ( θ ) ∂θ T ∂α ( θ ) ∂θ . (5.17)23 roof. By definition (5.8), for ε ∈ R m , K w ϕ ( f θ + ε || f θ ) = − Z ϕ ( x ) f θ + ε ( x ) α ( θ + ε ) log f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) d x . (5.18)Next, set M ( θ, ζ ) = E θ (cid:8) ϕ ( X )[exp( X ζ T )] (cid:9) and Ψ( θ, ε ) = sup h e ( θ + ε ) ζ T + log α ( θ ) − log M ( θ, ζ ) : ζ ∈ R n i . (5.19)Then, owing to Theorem 5.3, we obtain: K w ϕ ( f θ + ε || f θ ) ≥ Ψ( θ, ε ) . (5.20)The LHS of (5.20) is − Z ϕ ( x ) f θ + ε ( x ) log f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) d x = Z ϕ ( x ) f θ + ε ( x ) (cid:26)h − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) i + 12 (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) + O (cid:18) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) (cid:19)(cid:27) d x . (5.21)Here we have used the Taylor expansion of log(1 + z ) . The first-order term disappears: Z ϕ ( x ) f θ + ε ( x ) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) d x = α ( θ + ε ) − α ( θ + ε ) = 0 . (5.22)Next, for small ε , Z ϕ ( x ) f θ + ε ( x ) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) d x = ε " J w ϕ ( X ; θ ) − α ( θ ) ∂α ( θ ) ∂θ (cid:18) ∂α ( θ ) ∂θ (cid:19) T ε T + o ( k ε k ) . (5.23)Finally, the remainder lim ε → k ε k Z ϕ ( x ) f θ + ε ( x ) O (cid:18) (cid:20) − f θ ( x ) α ( θ + ε ) f θ + ε ( x ) α ( θ ) (cid:21) (cid:19) d x = o ( k ε k ) . (5.24)For the RHS in (5.20), we take the point τ where the gradient has the form ∇ ζ h e ( θ + ε ) ζ T + log α ( θ ) − log M ( θ, ζ ) i(cid:12)(cid:12)(cid:12) ζ = τ = 0 , i.e., e ( θ + ε ) = ∇ ζ log M ( θ, ζ ) (cid:12)(cid:12)(cid:12) ζ = τ = 1 M ( θ, τ ) ∇ ζ M ( θ, τ ) (cid:12)(cid:12)(cid:12) ζ = τ . Consider the limit lim ε → k ε k sup t ∈ R n (cid:26) t T µ ϕ ( θ + ε ) − Ψ( t ) (cid:27) . (5.25)Here Ψ( t ) = log α ( θ ) + log R f θ ( x )[exp( xt T )]d x denotes the weighted cumulant-generating functionfor PDF e f θ . The supremum is attained at a value of t = τ = τ ( ε ) where the first derivative24f the weighted cumulant-generating function equals ∇ t Ψ( t = τ ) = µ ϕ ( θ + ε ) . Here µ ϕ ( θ ) = E θ [ X ϕ ( X )] / E θ ϕ ( X ) . We have also ∇ t Ψ(0) = µ ϕ ( θ ) , and therefore the Hessian ∇ t Ψ(0) = ∂∂θ µ ϕ ( θ ) lim ε → ∂ε∂ τ . (5.26)It also can be seen that ∇∇ Ψ(0) = E θ (cid:2) X T X ϕ ( X ) (cid:3) E θ [ ϕ ( X )] − µ ϕ ( θ ) T µ ϕ ( θ ) := V ϕ ( X ; θ ) . (5.27)In addition, by using the Taylor formula at an intermediate point between θ and θ + ε , lim ε → k ε k (cid:26) τ T µ ϕ ( θ + ε ) − Ψ( τ ) (cid:27) = (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19)
12 [ ∇∇ Ψ(0)] − (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19) T . (5.28)Now let us back to the RHS of (5.25) which becomes: lim ε → k ε k (cid:20) τ T µ ϕ ( θ + ε ) − Ψ( τ ) + log α ( θ ) (cid:21) = 12 (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19) [ V ϕ ( X ; θ )] − (cid:18) ∂∂θ µ ϕ ( θ ) (cid:19) T . (5.29)Now (5.29) gives the required result (5.17). Remark 5.5
When ϕ ( x ) ≡ then α ( θ ) = 1 , e ( θ ) = E θ X , C w ϕ ( θ ) = e C w ϕ ( θ ) , and the two inequalities (5.4) and (5.17) coincide.In general, these inequalities competing; the question which inequality is stronger is not discussed inthis paper. We also note that both inequalities (5.4) and (5.17) lack a covariant property: multiplyingWF ϕ by a constant has a different impact on the left- and righ-hand sides. Acknowledgement
YS thanks the Office of the Rector, University of Sao Paulo (USP) for the financial supportduring the academic year 2013-4. YS thanks Math Department, Penn State University, USA forthe hospitality and support during the academic years 2014-6. IS is supported by FAPESP Grant -process number 11/51845-5, and expresses her gratitude to IMS, University of São Paulo, Brazil, andto Math Department, University of Denver, USA for the warm hospitality. SYS thanks the CAPESPNPD-UFSCAR Foundation for the financial support in the year 2014. SYS thanks the FederalUniversity of Sao Carlos, Department of Statistics, for hospitality during the year 2014. MK thanksthe Higher School of Economics for the support in the framework of the Global CompetitivenessProgram.
References [1] M. Belis and S. Guiasu. A Quantitative and qualitative measure of information in cyberneticsystems.
IEEE Trans. on Inf. Theory , (1968), 593–594.[2] A. Clim. Weighted entropy with application. Analele Universităţii Bucureşti, Matematică , AnulLVII (2008), 223-231.[3] T. Cover and J.A. Thomas.
Elements of Information Theory.
New York: Wiley, 2006.254] T.M. Cover and J.A. Thomas. Determinant inequalities via information theory.
SIAM J. MatrixAnal. and its Applicat. , (1988), 384–392.[5] A. Dembo, T.M. Cover and J.A. Thomas. Information theoretic inequalities. IEEE Trans. In-form. Theory , (1991), 1501–1518.[6] A. Di Crescenzo and M. Longobardi. Entropy based measure of uncertainty in past lifetimedistributions. J. App. Prob. , (2002), no. 3, 434–440.[7] G. Dial and I. J. Taneja. On weighted entropy of type ( α , β ) and its generalizations. Appl. Math. , (1981), 418–425.[8] G. Frizelle and Y. M. Suhov. An entropic measurement of queueing behaviour in a class ofmanufacturing operations. Proc. Royal Soc. A , (2001), 1579–1601[9] G. Frizelle and Y. M. Suhov. The measurement of complexity in production and other commercialsystems. Proc. Royal Soc. A , (2008), 2649–2668.[10] A. Fuchs and G. Letta. L’inégalité de Kullback. Application à la théorie de l’estimation. Sémi-naire de probabilités 4 . Strasbourg, (1970), 108–131.[11] S. Guiasu. Weighted entropy.
Report on Math. Physics , (1971), 165–179.[12] K. Ito. Introduction to Probability Theory.
Cambridge: Cambridge University Press, 1984.[13] D. H. Johnson and R. M. Glantz. When does interval coding occur?
Neurocopmuting , (2004), 13–18.[14] P. L. Kannappan and P. K. Sahoo. On the general solution of a functional equation connectedto sum form information measures on open domain. Math. Sci. , (1986), 545–550.[15] J. N. Kapur. Measures of Information and Their Applications.
Chapter 17, New Delhi: WileyEastern Limited, 1994.[16] K. Fan. On a theorem of Weyl concerning eigenvalues of linear transformations, I.
Proc. Nat.Acad. USA, (1949), 652–655.[17] K. Fan. On a theorem of Weyl concerning eigenvalues of linear transformations, II. Proc. Nat.Acad. USA, (1950), 31–35.[18] K. Fan. Maximum properies and inequalities for the eigenvalues of completely continuous oper-ators. Proc. Nat. Acad. USA, (1951), 760–766.[19] M. Kelbert and Y. Suhov. Continuity of mutual entropy in the limiting signal-to-noise ratioregimes. In: Stochastic Analysis , Springer-Verlag: Berlin (2010), 281–299.[20] M. Kelbert and Y. Suhov.
Information Theory and Coding by Example.
Cambridge: CambridgeUniversity Press, 2013.[21] M. Moslehian. Ky Fan inequalities. arXiv:1108.1467, 2011.[22] K. Muandet, S. Marukatat and C. Nattee. Query selection via weighted entropy in graph-basedsemi-supervised classification. In:
Advances in Machine Learning.
Lecture Notes in ComputerScience, (2009), pp. 278–292. 2623] O. Parkash and H. C. Taneja. Characterization of the quantitative-qualitative measure of inac-curacy for discrete generalized probability distributions.
Commun. Statist. Theory Methods , (1986), 3763–3771.[24] B. D. Sharma, J. Mitter and M. Mohan. On measure of ‘useful‘ information. Inform. Control (1978), 323–336.[25] R. P. Singh and J. D. Bhardwaj. On parametric weighted information improvement. Inf. Sci. (1992), 149–163.[26] A. Sreevally and S. K. Varma. Generating measure of cross entropy by using measure of weightedentropy. Soochow Journal of Mathematics , (2004), no. 2, 237–243.[27] A. Srivastava. Some new bounds of weighted entropy measures. Cybernetics and InformationTechnologies , (2011), no. 3, 60–65.[28] Y. Suhov, S. Yasaei Sekeh and M. Kelbert. Entropy-power inequality for weighted entropy. arXiv:1502.02188 [29] Y. Suhov, I. Stuhl and S. Yasaei Sekeh. Weighted Gaussian entropy and determinant inequalities. arXiv:1505.01753 [30] Y. Suhov, I. Stuhl and M. Kelbert. Weight functions and log-optimal investment portfolios. arXiv:1505.01437 [31] R. K. Tuteja, Sh. Chaudhary and P. Jain. Weighted entropy of orders α and type β informationenergy. Soochow Journal of Mathematic (1993), no. 2, 129-138.[32] R. Zamir. A proof of the Fisher information inequality via a data processing argument. IEEETransaction on Information Theory , , No. 3 (1998), 1246–1250.Yuri Suhov: DPMMS, University of Cambridge, UK; Math Dept, Penn State University, PA, USA;IPIT RAS, Moscow, RFIzabella Stuhl: IMS, University of São Paulo, Brazil; Math Dept, University of Denver, CO, USA;University of Debrecen, HungarySalimeh Yasaei Sekeh: Stat Dept, Federal University of S ˜a˜a