[PDF] Extracting analytic proofs from numerically solved Shannon-type Inequalities

Abstract

A class of information inequalities, called Shannon-type inequalities (STIs), can be proven via a computer software called ITIP. In previous work, we have shown how this technique can be utilized to Fourier-Motzkin elimination algorithm for Information Theoretic Inequalities. Here, we provide an algorithm for extracting \emph{analytic} proofs of information inequalities. Shannon-type inequalities are proven by solving an optimization problem. We will show how to extract a \emph{formal} proof of numerically solved information inequality. Such proof may become useful when an inequality is implied by several constraints due to the PMF, and the proof is not apparent easily. More complicated are cases where an inequality holds due to both constraints from the PMF and due to other constraints that arise from the statistical model. Such cases include information theoretic capacity regions, rate-distortion functions and lossless compression rates. We begin with formal definition of Shannon-type information inequalities. We then review the optimal solution of the optimization problem and how to extract a proof that is readable to the user.

Full PDF

aa r X i v : . [ c s . I T ] J u l Extracting analytic proofs from numerically solvedShannon-type Inequalities

Ido B. Gattegno and Haim H. Permuter

I. I

NTRODUCTION

A class of information inequalities, called Shannon-type inequalities (STIs), can be proven via a computer software calledITIP [1]. In previous work [2], we have shown how this technique can be utilized to Fourier-Motzkin elimination algorithm forInformation Theoretic Inequalities. Here, we provide an algorithm for extracting analytic proofs of information inequalities.Shannon-type inequalities are proven by solving an optimization problem. We will show how to extract a formal proof ofnumerically solved information inequality. Such proof may become useful when an inequality is implied by several constraintsdue to the PMF, and the proof is not apparent easily. More complicated are cases where an inequality holds due to bothconstraints from the PMF and due to other constraints that arise from the statistical model. Such cases include informationtheoretic capacity regions, rate-distortion functions and lossless compression rates. We begin with formal deﬁnition of Shannon-type information inequalities. We then review the optimal solution of the optimization problem and how to extract a proofthat is readable to the user. II. P

RELIMINARIES AND N OTATIONS

We use the following notation. Calligraphic letters denote discrete sets, e.g., X . The empty set is denoted by φ , while N n , { , , . . . , n } is a set of indices. Lowercase letters, e.g. x , represent variables. A vector of n variables ( x , . . . , x n ) is denoted by x N n , and its substring as x α = ( x i ∈ Ω | i ∈ α, φ = α ⊆ N n ) , e.g., x { , } = ( x , x ) ⊤ ; whenever thedimensions are clear from the context, the subscript is omitted. Vector inequalities, e.g., v ≥ , are in element-wise sense.Random variables are denoted by uppercase letters, e.g., X , with similar conventions for random vectors.III. I NFORMATION INEQUALITIES AND CONSTRAINTS

In [3], Yeung characterized a subset of information inequalities named

Shannon-type inequaliteis (STIs), that are provableusing a computer program called ITIP [1]. More work on the ITIP was done in [4]. This section consist of the mathematicalreview of this work. We establish a canonical form for linear combination of Shannon’s information measures, which uniquelyrepresent the expression as a linear combination of joint entropies. By giving constraints of non-negativity on the informationmeasures, we establish a region where linear information inequalities residue. A theorem provides a minimization problem,which can be solved by linear programming techniques, makes the identiﬁcation of true information inequalities applicable.

A. Unconstrained inequalities

Given a random vector X N n that take values in X × · · · × X n , deﬁne h ℓ , (cid:0) H ( X α (cid:1) | φ = α ⊆ N n ) . Let P be the setof all probability mass functions (PMFs) over X × · · · × X n . Moreover, for every p ∈ P , h l ( p ) ∈ R n − is a vector whoseentries are the values of H ( X α ) , φ = α ⊆ N n , with respect to p . Deﬁnition 1 (Basic information measure (BIM))

An information measure is called basic if it takes on one of the followingforms: H ( X α | X γ ) (1a) I ( X α ; X β | X γ ) , (1b) where α, β, γ ⊆ N n and α, β = φ . Deﬁnition 2 (Elemental information measure (EIM))

An information measure is called elemental if it takes on one of thefollowing forms: H ( X i | X N n \{ i } ) (2a) I ( X i ; X j | X K ) , (2b) where i, j ∈ N n , i = j, K ⊆ N n \{ i, j } Lemma 1

Every BIM can be represented as a linear combination of EIMs with non-negative coefﬁcients.

By the deﬁnition of mutual information and by the entropy chain rule, for every i, j ∈ N n , i = j and K ⊆ N n \{ i, j } , wehave H ( X i | X N n \{ i } ) = H ( X N n ) − H ( X N n \{ i } ) , (3a) I ( X i ; X j | X K ) = H ( X i , X K ) + H ( X j , X K ) (3b) − H ( X i , X j , X K ) − H ( X K ) . Lemma 1 combined with (3) implies that every BIM is uniquely representable as a linear combination of unconditionaljoint entropies. This representation, which is called the canonical form , allows one to write every linear combination of BIMsas b ⊤ h ℓ , where b is a vector of coefﬁcients . We assume a lexicographical ordering of the elements of h ℓ . For the proof of uniqueness see [3, Section 13.2]. Henceforth, an arbitrary linear combination of BIMs is denoted by b ⊤ h ℓ . Deﬁnition 3

An information inequality b ⊤ h ℓ ≥ always holds if b ⊤ h ℓ ( p ) ≥ , for every p ∈ P . Proposition 1

An information inequality b ⊤ h ℓ ≥ always holds if and only if (iff) min p ∈P b ⊤ h ℓ ( p ) = min h ℓ ( p ) ∈ Γ ∗ n b ⊤ h ℓ ( p ) = 0 , (4a) where Γ ∗ n = [ p ∈P h ℓ ( p ) . (4b)Proposition 1 follows since there is always a p ∈ P for which b ⊤ h ℓ ( p ) = 0 .The optimization problem in (4) is infeasible as it involves optimizing over the set of all PMFs of n discrete randomvariables. Therefore, an algorithm that numerically proves information inequalities requires an simpler alternative descriptionof Γ ∗ n . Such a description, being currently unknown, leads one to search for a different subspace of R n − , that is, in a sense,similar to Γ ∗ n , based on which numerical proofs can be implemented. Deﬁnition 4 (Basic and elemental inequalities)

Non-negativity inequalities on BIMs and EIMs are called basic inequalities(BIs) and elemental inequalities (EIs), respectively.

Every h ∈ Γ ∗ n is a vector of entropies that is induced by some p ∈ P , and, in particular, satisﬁes all BIs. Since BIs arelinear constraints on h , in [3], Γ n , { h ∈ R n − | h satisﬁes all BIs } (5)was proposed as an alternative for Γ ∗ n . Lemma 2 (Minimality of elemental inequalities)

The set of EIs is minimal in sense that every BI is implied by a subset ofEIs.

Remark 1

There are n + (cid:0) n (cid:1) n − EIs while the amount of BIs is bounded by P nj =1 (cid:0) nj (cid:1)(cid:16) n − j + P n − ji =1 (cid:0) n − ji (cid:1) n − j − i (cid:17) frombelow. Based on Remark 1 and Lemma 2, we write Γ n = { h ∈ R n − | G h ≥ } , (6)where G is a matrix such that the elements of G h ℓ are all EIMs. Theorem 1

Let b ⊤ h ℓ ≥ be an information inequality, and let ρ ∗ = min h :G h ≥ b ⊤ h . (7) If ρ ∗ = 0 then b ⊤ h ℓ ≥ always holds. The proof of Theorem 1 follows from Proposition 1 since every h ∈ Γ ∗ n satisﬁes all EIs, which implies that Γ ∗ n ⊆ Γ n .The optimization problem (7) is solvable using linear programming (LP) optimization methods [5]. Information inequalitiesthat are provable by Theorem 1 form a subset of inequalities called unconstrained Shannon-type inequalities (STIs) . B. Constrained STIs

Some information inequalities (respectively, identities) hold only when a certain structure is imposed on the PMFs. Such astructure may account for independencies between random variables, Markov chains and functional dependencies. We formulatethese constraints on the PMF domain as linear constraints on entropy and mutual information terms.

Lemma 3

Let { X i } ni =1 be a collection of random variables. { X i } ni =1 are mutually independent iff H ( X N n ) = P ni =1 H ( X i ) . { X i } ni =1 are pairwise independent iff I ( X i ; X j ) = 0 , for every ≤ i = j ≤ n . Let α, β ⊆ N n and α, β = φ . X α is a function of X β iff H ( X α | X β ) = 0 . Let α, β, γ, δ ⊆ N n and α, β, γ, δ = φ . X α − X β − X γ − X δ forms a Markov chain iff I ( X α ; X γ , X δ | X β ) = 0 and I ( X α , X β ; X δ | X γ ) = 0 . Every set of constraints as in Lemma 3 is representable by Q h ℓ = (8)where Q is a matrix whose rows are the coefﬁcients that correspond to each constraint. Proposition 2

Given a PMF deﬁned on discrete random variables, the constraints which induced by reading the joint PMFdeﬁnes all probabilistic relations between the random variables.

For instance, given a PMF P ( X, Y, Z ) where X − Y − Z forms a Markov chain, we can write P ( X, Y, Z ) = P ( X ) P ( Y | X ) P ( Z | Y ) | {z } H ( Z | Y )= H ( Z | Y,X ) . (9)The induced constraint is equivalent to I ( X ; Z | Y ) = 0 , which implies that X − Y − Z forms a Markov chain. By Lemma3, all such constraints imply the joint PMF. Γ ∗ = Γ but Γ ∗ = Γ [3, Section 15.1]. Property 4 of Lemma 3 is an extension of [3, Section 13.3.2] and can be generalized to account for longer Markov chains.

Theorem 2

Let b ⊤ h ℓ ≥ be an information inequality and ρ ∗ = min h :G h ≥ Q h = b ⊤ h . (10) If ρ ∗ = 0 then b ⊤ h ℓ ≥ always holds under the constraints Q h ℓ = . Constrained information inequalities that are captured by Theorem 2 are called constrained STIs .IV. O

PTIMIZATION PROBLEMS AND OPTIMAL SOLUTION

A. The Lagrangian dual function

An LP problem is a private case of convex optimization problems. The reader may refer to [6] for further study aboutconvex optimization (here we use lemmas from Chapter 5). Consider an optimization problem in the standard formminimize: f ( x ) (11a)subject to: f i ( x ) ≤ , i = 1 , . . . m (11b) h i ( x ) = 0 , i = 1 , . . . , p (11c)with x ∈ R n . The function f ( x ) is called the objective function .We deﬁne D to be the domain of this problem, D , ( m \ i =0 dom ( f i ) ) \ ( p \ i =1 dom ( h i ) ) where dom ( · ) is the domain of its arguments. We assume nonempty domain, i.e., D 6 = φ and that there is an optimal solutionfor this problem. Deﬁnition 5 (Tha Lagrangian)

The Lagrangian L : R n × R m × R p → R associated with the problem in (11) is L ( x , λ, ν ) , f ( x ) + m X i =1 λ i f i ( x ) + p X i =1 ν i h i ( x ) (12) where x ∈ R n , λ ∈ R m and ν ∈ R p . We refer λ i as the Lagrange multiplier of the i -th inequality constraint and ν i as the Lagrange multiplier of the i -th equality constraint. Deﬁnition 6 (Lagrangian dual function)

The Lagrangian dual function g : R m × R p → R is the inﬁmum of the Lagrangianover x : g ( λ, ν ) , inf x ∈D L ( x , λ, ν ) Lemma 4

For any λ ≥ and any ν , g ( λ, ν ) ≤ p ∗ (13) where p ∗ is the optimal solution of the problem in (11). Let d ∗ be optimal solution of the following optimization problemmaximize: g ( λ, ν ) subject to: λ ≥ . Note that since the problem in (11) is convex, (14) is also convex. We refer λ ∗ , ν ∗ as the optimal Lagrange multipliers and b ∗ as the dual optimal solution of the problem in (14). In general, by Lemma 4 we know that d ∗ ≤ p ∗ . Furthermore, underspeciﬁc conditions we achieve equality. B. Strong duality

When d ∗ = p ∗ , the solutions of both the dual and original optimization problems coincide. If that is the case, we say wehave strong duality . We here assume all equality constraints are afﬁne . Thus, the equality constraints in (11) can be replacedwith A x = . Deﬁnition 7 (Slater’s condition (for afﬁne constraints))

Assume that f i ( x ) are afﬁne functions of x for i = 1 , . . . , k .If there exists x ∈ relint ( D ) , where relint ( D ) is the relative interior of D , such that f i ( x ) ≤ , i = 1 , . . . k (15a) f i ( x ) < , i = k + 1 , . . . m (15b) A x = (15c) then we say that Slater’s condition holds. Lemma 5

If Slater’s condition holds, then strong duality exists.

Remark 2

In LP problems, all constraints are afﬁne, and therefore Slater’s condition reduce to weak inequality in allconstraints. Moreover, if the solution of the problem is feasible, then Slater’s condition holds and we have strong duality. In many information theoretic problems, the inequalities are afﬁne functions of the entropies.

Consider an LP problem of the form minimize: c ⊤ x (16a)subject to: A x = b (16b) B ≤ d (16c)The Lagrangian of this problem is L ( x , λ, ν ) = c ⊤ x + λ ⊤ (B x − d ) + ν ⊤ (A x − b ) (17)and the dual function is g ( λ, ν ) = inf x L ( x , λ, ν ) (18a) = ( − b ⊤ ν − d ⊤ λ ) + inf x { ( c + B λ + A ⊤ ν ) ⊤ x } (18b)subject to λ ≥ . Lemma 6 (Optimal Lagrange multipliers of an LP problem)

If a solution to an LP problem with linear constraints exists,then A ⊤ ν ∗ + B ⊤ λ ∗ + c = 0 (19)The proof of Lemma 6 follows directly from the deﬁnition of the dual function, since g ( λ, ν ) =  − b ⊤ ν − d ⊤ λ A ⊤ ν + B ⊤ λ + c = 0 −∞ otherwise (20)Recall that since Slater’s condition holds, g ( λ ∗ , ν ∗ ) = p ∗ . Consequently, using the optimal Lagrange multipliers, we canrepresent the linear objective function by means of linear combination of the constraints.V. E

XTRACTING FORMAL PROOF FROM THE OPTIMAL SOLUTION

Recall from Section III-B that non-negativity of linear combination of information measures can be proven by solving anLP problem. Assume we want to prove the following inequality in the canonical form f ⊤ L h ℓ ( p ) ≤ f ⊤ R h ℓ ( p ) , p ∈ Q (21)where Q is a subspace of P where the constraints due to the PMF hold. Deﬁne f D , f R − f L (22) The corresponding LP problem we solve to check in it is an unconstrained STI isminimize: f ⊤ D h (23a)subject to: − G h ≤ (23b) Q h = (23c)where G h ℓ ≥ represent the elemental inequality and Q h ℓ = the constraints due to the PMF. Note that both inequalityand equality constraints in (23) are afﬁne. If a solution to that problem exists, by Lemma 6, we have f D = G ⊤ λ ∗ − Q ⊤ ν ∗ (24)Thus, we can represent the coefﬁcients of the objective by rows of G and Q . Deﬁne G ℓ ( p ) and Q ℓ ( p ) to be vectors with labelscorrespond to G and Q , respectively. The labels in the components are information measures which are represented by rowsof the corresponding matrix. For instance, assume a case where there are only two variables, ( X , X ) . By our deﬁnitions, h ℓ ( p ) ⊤ = [ H ( X ) , H ( X ) , H ( X , X )] (25a) G ⊤ ℓ = [ H ( X | X ) , H ( X | X ) , I ( X ; X )] (25b) G =  − − −  (25c)Note that h ⊤ ℓ ( p ) f D is the canonical form the RHS minus the LHS of the inequality we aim to prove. Similarly, the componentsof h ⊤ ℓ ( p )G and h ⊤ ℓ ( p )Q are the canonical forms of G ⊤ ℓ ( p ) and Q ⊤ ℓ ( p ) , respectively. From (24) we have h ⊤ ℓ ( p )( f R − f L ) = h ⊤ ℓ ( p )G λ ∗ − h ⊤ ℓ ( p )Q ν ∗ (26a) = G ⊤ ℓ ( p ) λ ∗ − Q ⊤ ℓ ( p ) ν ∗ (26b)We then obtain a representation of the difference between R.H.S and L.H.S by elemental inequality and PMF constraints.From this representation, proving the inequality is more apparent. Since λ ≥ and all elemental information measures arenonnegative, it is easy to see that G ⊤ ℓ ( p ) λ ∗ is nonnegative. As for Q ⊤ ℓ ( p ) ν ∗ , it is a sum of information measures that are eachequal to zero due Markov chains induced by the PMF. We refer this representation as the elemental form of the expression.Equality between the original representation and the elemental form can be shows either using information theoretic identities or by showing that the canonical forms of both are the same. Check ExpressionsSaveLoadExit

Random VariablesA,B,C,DJoint probability mass functionp(a,b)p(c|b)p(d|c)Expressions to proveI(B;C)-I(A;D)Output

Fig. 1: Demonstration of the algorithmVI. A

N ILLUSTRATION

We provide an illustration of how the algorithm works and what is the extracted proof. Let ( A, B, C, D ) be randomvariables such that A ↔ B ↔ C ↔ D is a Markov chain. The PMF of those variables factorizes as follows P ( a, b, c, d ) = P ( a, b ) P ( c | b ) P ( d | c ) (27)Following Proposition 2, we obtain the following constraints I ( C ; A | B ) = 0 (28) I ( D ; A, B | C ) = 0 . (29)Since A ↔ B ↔ C ↔ D is a Markov chain, I ( A ; D ) ≤ I ( B ; C ) (30) always holds due to the PMF factorization. Using the canonical form of the R.H.S minus the L.H.S, it can be veriﬁed that I ( B ; C ) − I ( A ; D ) = I ( A ; C | D ) + I ( B ; C | A ) + I ( B ; D | A, C ) I ( C ; A | B ) − I ( D ; A, B | C ) . (31a)This representation clarify that I ( B ; C ) − I ( A ; D ) ≥ because of the Markov chains. An implementation of this algorithmis demonstrated in Fig. 1. R ∼ ITIP/.[2] I. B. Gattegno, Z. Goldfeld, and H. H. Permuter, “Fourier-motzkin elimination software for information theoretic inequalities,”

IEEE Inf. Theory Soc.Newsletter, arXiv:1610.03990 ∼ fmeit/.[3] R. W. Yeung, Information theory and network coding . Springer Science & Business Media, 2008.[4] E. P. Rethnakaran Pulikkoonattu and S. Diggavi, “X information theoretic inequalities prover,” http://xitip.epﬂ.ch/.[5] A. Schrijver,

Theory of linear and integer programming . John Wiley & Sons, 1998.[6] S. Boyd and L. Vandenberghe,

Convex optimization . Cambridge university press, 2004.7