New identities for the Shannon function and applications
NNew identities for the Shannon-Wienerentropy function with applications
Aiden A. Bruen ∗ Adjunct Research Professor,Carleton University, [email protected]
Abstract
We show how the two-variable entropy function H ( p, q ) = p log p + q log q can be expressed as a linear combination of entropy functionssymmetric in p and q involving quotients of polynomials in p, q of degree n for any n ≥ Keywords: entropy, Binary Symmetric Channel, Shannon, Wiener, identi-ties2020 MSC: 94A15 Information theory, 94A17 Measures of information, en-tropy, 94A60 Cryptography
In Renyi’s
A Diary on Information Theory , [8], he points out that the en-tropy formula, i.e., H ( X ) = p log 1 p + p log 1 p + · · · + p N log 1 p N , where logs are to the base 2, was arrived at, independently, by Claude Shan-non and Norbert Wiener in 1948. This famous formula was the revolutionaryprecursor of the information age. Renyi goes on to say that,this formula had already appeared in the work of Boltzmann whichis why it is also called the Boltzmann-Shannon Formula. Boltzmann ∗ The author gratefully acknowledges the financial support of the National Sciences andEngineering Research Council of Canada a r X i v : . [ c s . I T ] F e b rrived at this formula in connection with a completely different prob-lem. Almost half a century before Shannon he gave essentially thesame formula to describe entropy in his investigations of statisticalmechanics. He showed that if, in a gas containing a large numberof molecules the probabilities of the possible states of the individ-ual molecules are p , p , . . . , p N then the entropy of the system is H = c ( p log p + p log p + · · · + p N log p N ) where c is a constant.(In statistical mechanics the natural logarithm is used and not base 2. . . ). The entropy of a physical system is the measure of its disorder. . .Since 1948 there have been many advances in information theory andentropy. A well-known paper of Dembo, Cover and Thomas is devoted toinequalities in information theory. Here, we concentrate on equalities. Weshow how a Shannon function H ( p, q ) can be expanded in infinitely manyways in an infinite series of functions each of which is a linear combinationof Shannon functions of the type H (cid:0) f ( p ) , g ( q ) (cid:1) , where f, g are quotients ofpolynomials of degree n for any n ≥
2. Apart from its intrinsic interest, thisnew result gives insight into the algorithm in Section 6 for constructing acommon secret key between two communicating parties — see also [2], [3],[4], [6].
Recall that a binary symmetric channel has input and output symbols drawnfrom { , } . We say that there is a common probability q = 1 − p of anysymbol being transmitted incorrectly, independently for each transmittedsymbol, 0 ≤ p ≤ P = (cid:18) p qq p (cid:19) . Again, p is the probabilityof success , i.e., p denotes the probability that 0 is transmitted to 0 and alsothe probability that 1 gets transmitted to 1. The second extension P (2) of P has alphabet { , , , } and channel matrix P (2) = p pq qp q pq p q qpqp q p pqq qp pq p = (cid:18) pP qPqP pP (cid:19) . An alternative way to think of an n th extension of a channel C (see Welsh [9])is to regard it as n copies of C acting independently and in parallel. SeeFigure 1.Let us assume also that, for C , the input probability of 0 and the inputprobability of 1 are both equal to . 2 Figure 1: n copies of C acting independently and in parallel. Theorem 2.1.
Let X = ( X , . . . , X n ) , and Y = ( Y , . . . , Y n ) denote aninput-output pair for C ( n ) . Then(a) H ( X ) = H ( X , . . . , X n ) = n (b) H ( Y ) = H ( Y , . . . , Y n ) = n (c) H ( X | Y ) is equal to nH ( p, q ) .(d) The capacity of C ( n ) is n (cid:0) − H ( p, q ) (cid:1) . Proof. (a) Since, by definition, the X i are independent, H ( X ) = H ( X ) + H ( X ) + · · · + H ( X n ) . We have H ( X i ) = − (cid:20) P r ( X i = 0) log P r ( X i = 0) + P r ( X i = 1) log P r ( X i = 1) (cid:21) = − (cid:20)
12 log (cid:0) (cid:1) + 12 log (cid:0) (cid:1)(cid:21) = − log (cid:0) (cid:1) = − [log 1 − log 2] = log 2 = 1 . Thus H ( X ) = n . 3b) The value of Y i only depends on X i . The X i are independent. Thus Y , Y , . . . , Y n are independent. For any Y i we have P r ( Y i = 0) = P r ( X i = 0)( p ) + P r ( X i = 1)( q ) = 12 ( p + q ) = 12 . Also
P r ( Y i = 1) = . Then, as for X i , H ( Y i ) = 1. Thus H ( Y ) = H ( Y , Y , . . . , Y n ) = (cid:80) i H ( Y i ) = n .(c) We have H ( X ) − H ( X | Y ) = H ( Y ) − H ( Y | X ). Since H ( X ) = H ( Y ) = n we have H ( X | Y ) = H ( Y | X ). Now H ( Y | X ) = (cid:88) x P r ( x ) H ( Y | X = x ) , where x denotes a given value of the random vector X . Since the channel ismemoryless, H ( Y | X = x ) = (cid:88) i H ( Y i | X = x ) = (cid:88) i H ( Y i | X i = x i ) . The last step needs a little work — see [5] Exercise 4.10 or [7] or [3] or theexample below for details. Then H ( Y | X ) = (cid:88) x P r ( x ) (cid:88) i H ( Y i | X i = x i )= (cid:88) i (cid:88) u H ( Y i | X i = u ) P r ( X i = u ) . Thus H ( Y | X ) = n (cid:88) i =1 H ( Y i | X i ) = nH ( p, q ) = H ( X | Y ) . Example
Let n = 2 and (cid:18) x x (cid:19) = (cid:18) (cid:19) . Then (cid:18) y y (cid:19) = (cid:18) (cid:19) or (cid:18) (cid:19) or (cid:18) (cid:19) or (cid:18) (cid:19) . Using independence we get that the core-sponding entropy term is − [ q log q + p log p + qp log qp + pq log pq ]. Thissimplifies to − p log p + q log q ] = 2 H ( p, q ). Note that the probability that (cid:18) x x (cid:19) = (cid:18) (cid:19) is .(d) The capacity of C is the maximum value, over all inputs, of H ( X ) − H ( X | Y ). Since X is random, the input probability of a 1 or 0 is 0.5. Thisinput distribution maximizes H ( X ) − H ( X | Y ) for C , the maximum valuebeing 1 − H ( p, q ). Then the capacity of C ( n ) is n (1 − H ( p, q )). It representsthe information about X conveyed by Y or the amount of information about Y conveyed by X . (cid:4) An Entropy Equality
First we need some additional discussion on entropy.
A. Extending a basic result.
A fundamental result for random variables
X, Y is that H ( X ) + H ( Y | X ) = H ( Y )+ H ( X | Y ). A corresponding argument may be used to establish similaridentities involving more than two random variables. For example, H ( X, Y, Z ) = H ( X ) + H ( Y | X ) + H ( Z | X, Y )= H ( X, Y ) + H ( Z | X, Y ) . Also H ( X, Y, Z ) = H ( X ) + H ( Y, Z | X ) . B. From Random Variables to Random Vectors.
For any random variable X taking only a finite number of values with prob-abilities p , p , . . . , p n such that (cid:88) p i = 1 and p i > ≤ i ≤ n ) , we define the entropy of X using the Shannon formula H ( X ) = − n (cid:88) k =1 p k log p k = n (cid:88) k =1 p k log 1 p k . Analogously, if X is a random vector which takes only a finite number ofvalues u , u , . . . , u m , we define its entropy by the formula H ( X ) = − m (cid:88) k =1 P r ( u k ) log P r ( u k ) . For example, when X is a two-dimensional random vector, say X = ( U, V )with p ij = P r ( U = a i , V = b j ), then we can write H ( X ) = H ( U, V ) = − (cid:88) i,j p ij log p ij . Note that (cid:80) i,j p ij = 1.More generally, if X , X , . . . , X m is a collection of random variables eachtaking only a finite set of values, then we can regard X = ( X , X , . . . , X m )as a random vector taking a finite set of values and we define the jointentropy of X , . . . , X m by H ( X , X , . . . , X m )= H ( X )= − (cid:88) P r ( X = x , X = x , . . . , X m = x m ) log P r ( X = x , X = x , . . . , X m = x m ) . C. The Grouping Axiom for Entropy.
This axiom or identity can shorten calculations. It reads as follows ([9, p.2], [1, p. 8], [3, Section 9.6]).Let p = p + p + · · · + p m and q = q + q + · · · + q n where each p i and q j is non-negative. Assume that p, q are positive with p + q = 1. Then H ( p , p , . . . , p m , q , q , . . . , q n )= H ( p, q ) + pH (cid:18) p p , p p , . . . , p m p (cid:19) + qH (cid:18) q q , q q , . . . , q n q (cid:19) . For example, suppose m = 1 so p = p . Then we get H ( p , q , q , . . . , q n ) = H ( p, q ) + qH (cid:18) q q , . . . , q n q (cid:19) . This is because p H (cid:0) p p , (cid:1) = p H (1 ,
0) = p (1 log 1) = p (0) = 0. Theorem 3.1.
Let X , Y , Z be random vectors such that H ( Z | X , Y ) = 0 .Then(a) H ( X | Y ) = H ( X , Z | Y ) .(b) H ( X | Y ) = H ( X | Y , Z ) + H ( Y | Z ) . Proof. H ( X | Y ) = H ( X , Y ) − H ( Y )= H ( X , Y , Z ) − H ( Z | X , Y ) − H ( Y )= H ( X , Y , Z ) − H ( Y ) [ since H ( Z | X , Y ) = 0 ]= H ( Y ) + H ( X , Z | Y ) − H ( Y )= H ( X , Z | Y ) , proving (a).For (b), H ( X | Y ) = H ( X , Z , Y ) − H ( Y ) from (a)= H ( X , Z , Y ) − H ( Y , Z ) + H ( Y , Z ) − H ( Y )= H ( X | Y , Z ) + H ( Z | Y ) . (cid:4) The New Identities
We will use the above identity, i.e., H ( X | Y ) = H ( X | Y , Z ) + H ( Z | Y ) (4.1)which holds under the assumption that H ( Z | X , Y ) = 0. We begin witharrays A = a a ... a n , B = b b ... b n , where n is even. We assume that A, B are random binary strings subject to the condition that, for each i , we have P r ( a i = b i ) = p . We also assume that the events { ( a i = b i ) } form anindependent set. We divide A, B into blocks of size 2.To start, put X = (cid:18) x x (cid:19) , Y = (cid:18) y y (cid:19) , Z = x + x . Lemma 4.2. H ( Z | X , Y ) = 0 . Proof.
We want to calculate (cid:80) x , y H ( Z | x , y ) P r ( X = x , Y = y ). Given x , y , say x = (cid:18) α α (cid:19) , y = (cid:18) β β (cid:19) the value of Z is α + α . There is nouncertainty in the value of Z given x , y , i.e., each term in the above sum for H is H (1 ,
0) = 0. Therefore H ( Z | X , Y ) = 0. (cid:4) From this we can use formula (4.1) for this block of size two. We canthink of a channel from X to Y (or from Y to X ) which is the secondextension of a binary symmetric channel where p is the probability of success.We have H ( X | Y ) = H ( X | Y , Z ) + H ( Z | Y ) . From Theorem 2.1 part (c) the left side, i.e., H ( X | Y ) is equal to 2 H ( p, q ).Next we calculate the right side beginning with H ( Z | Y ), i.e., H (cid:18) Z (cid:12)(cid:12)(cid:12) (cid:18) y y (cid:19)(cid:19) .We have H ( Z | Y ) = H (cid:0) Z | ( y + y = x + x ) (cid:1) P r ( y + y = x + x )+ H (cid:0) Z | ( y + y (cid:54) = x + x ) (cid:1) P r ( y + y (cid:54) = x + x ) . We know that
P r ( x + x = y + y ) = p + q and P r ( x + x (cid:54) = y + y ) =1 − ( p + q ) = 2 pq since p + q = 1. From the standard formula we have H ( Z | Y ) = ( p + q ) log (cid:16) p + q (cid:17) + 2 pq log (cid:16) pq (cid:17) since H ( Z | Y ) = H ( p + q , pq ). 7ext we calculate H (cid:18)(cid:18) x x (cid:19) (cid:12)(cid:12)(cid:12) (cid:18) y y (cid:19) , ( x + x ) (cid:19) = H ( X | Y , Z ) . Again we have two possibilities, i.e., y + y = x + x and y + y (cid:54) = x + x .The corresponding probabilities are p + q and 2 pq respectively. We obtain H ( X | Y , Z ) = ( p + q ) H (cid:18) p p + q , q p + q (cid:19) + 2 pqH (cid:18) pq pq , pq pq (cid:19) . This comes about from the facts that(a) If y + y = x + x then we either have y = x and y = x or y = 1 − x , y = 1 − x .(b) If y + y (cid:54) = x + x then either y = x and y (cid:54) = x or y (cid:54) = x and y = x .(c) H ( , ) = 1.Then from equation (4.1) we have our first identity as follows2 H ( p, q ) = ( p + q ) H (cid:18) p p + q , q p + q (cid:19) + 2 pq + H ( p + q , pq ) . (4.3) Blocks of Size Three.
Here X = x x x , Y = y y y , Z = x + x + x . As in Lemma 4.2 wehave H ( Z | X , Y ) = 0 so we can use formula (4.1) again, i.e., H ( X | Y ) = H ( X | Y , Z ) + H ( Z | Y ) . We have a channel from X to Y (or from Y to X ) which is the thirdextension C (3) of a binary symmetric channel C , where p is the probabilitythat 0 (or 1) is transmitted to itself.From Theorem 2.1 we have H ( X | Y ) = 3 H ( p, q ).Similar to the case of blocks of size 2, we have H ( Z | Y ) = H ( p +3 pq , q + 3 qp ). This is because the probabilities that Z = y + y + y or Z (cid:54) = y + y + y are, respectively, p + 3 pq or q + 3 qp , as follows.If Z = y + y + y , then either x = y , x = y , x = y or else, forsome i , 1 ≤ i ≤ x i = y i and, for the other two indices j, k , x j (cid:54) = y j and x k (cid:54) = y k .A similar analysis can be carried out for the case where Z (cid:54) = y + y + y .We then get H ( X | Y , Z ) = f ( p, q ) + f ( q, p ) where f ( p, q ) = ( p + 3 pq ) (cid:26) H (cid:18) p p + 3 pq , pq p + 3 pq , pq p + 3 pq , pq p + 3 pq (cid:19)(cid:27) .
8e now use the grouping axiom for m = 1. The p in the formula refers to p p +3 pq here and the q there is now replaced by pq p +3 pq . Then f ( p, q ) = ( p + 3 pq ) (cid:26) H (cid:18) p p + 3 pq , pq p + 3 pq (cid:19) + 3 pq p + 3 pq H ( 13 , ,
13 ) (cid:27) = ( p + 3 pq ) H (cid:18) p p + 3 pq , pq p + 3 pq (cid:19) + 3 pq log 3 .f ( q, p ) is obtained by interchanging p with q . We note that, since p + q = 1,3 pq log 3 + 3 qp log 3 = 3 pq log 3.From working with blocks of size 3 we get3 H ( p, q ) = H ( p + 3 pq , q + 3 qp ) + ( p + 3 pq ) H (cid:18) p p + 3 pq , pq p + 3 pq (cid:19) + ( q + 3 qp ) H (cid:18) q q + 3 qp , qp q + 3 qp (cid:19) + 3 pq log 3 . (4.4)For blocks of size 2 formula (4.3) can be put in a more compact form interms of capacities, namely,2 (cid:0) − H ( p, q ) (cid:1) = (cid:20) − H ( p + q , pq ) (cid:21) + (cid:20) ( p + q ) (cid:18) − H (cid:18) p p + q , q p + q (cid:19)(cid:19)(cid:21) . (4.5)Using the same method we can find a formula analogous to formulae (4.3),(4.4) for obtaining nH ( p, q ) as a linear combination of terms of the form H ( u, v ) where u, v involve terms in p n , p n − q , . . . , q n , q n − p , . . . plus extraterms such as 3 pq log 3 as in formula (4.4). The result of Theorem 2.1 can be extended to the more general case wherewe take the product of n binary symmetric channels even if the channelmatrices can be different corresponding to differing p -values.As an example, suppose we use the product of 2 binary symmetric chan-nels with channel matrices (cid:18) p q q p (cid:19) , (cid:18) p q q p (cid:19) . Then the argument in Section 4 goes through. To avoid being overwhelmedby symbols we made a provisional notation change.
Notation 5.1.
We denote by h ( p ) the quantity H ( p, q ) = p log p + q log q . h ( p )+ h ( p )= h ( p p + q q ) + ( p p + q q ) h (cid:18) p p p p + q q (cid:19) + ( p q + p q ) h (cid:18) p q p q + p q (cid:19) . (5.2)Similarly to the above we can derive a formula for h ( p )+ h ( p )+ · · · + h ( p n ). The above method of using blocks of various sizes is reminiscent of thealgorithm for the key exchange in [4] which relates to earlier work in [2], [6]and others. Indeed the identities above were informed by the details of thealgorithm.The algorithm starts with two arrays ( a , . . . , a n ) and ( b , . . . , b n ). Weassume that the set of events { a i = b i } is an independent set with p = P r ( a i = b i ). We subdivide A, B into corresponding sub-blocks of size t ,where t divides n . Exchanging parities by public discussion we end upwith new shorter sub-arrays A , B , where the probabilities of correspondingentries being equal are independent with probability p > p . Eventuallyafter m iterations we end up with a common secret key A m = B m .Let us take an example. Start with two binary arrays ( a , . . . , a n ) and( b , . . . , b n ) of length n with n even, n = 2 t , say. We subdivide the arraysinto corresponding blocks of size 2. If the two blocks are (cid:18) a a (cid:19) and (cid:18) b b (cid:19) we discard those blocks if the parities disagree. If the parities agree, whichhappens with probability p + q , we keep a and b , discarding a and b .Thus, on average, we keep ( p + q ) n partial blocks and discard (cid:2) − ( p + q ) (cid:3) n blocks of size 2.Let us suppose n = 100 and p = 0 .
7. From Theorem 2.1 part (d),the information that Y has about X , i.e., that B has about A is 100[1 − H (0 . , . ≈ − . ≈ .
87 Shannon bits.We are seeking to find a sub-array of
A, B such that corresponding bitsare equal. Our method is to publicly exchange parities. The length of thissecret key will be at most 11.Back to the algorithm.
A, B keep on average (50)( p + q ) blocks of size 2,i.e., (50)(0 .
58) = 29 blocks of size 2. A and B remove the bottom elementof each block. We are left with 29 pairs of elements ( a , b ). The probabilitythat a = b given that a + a = b + b is p p + q , i.e., (0 . (0 . +(0 . = . . ≈ . − H (0 . , . ≈ (1 − . ≈ . a i , b i ) with P r ( a i = b i ) = 0 .
7. Theinformation revealed to B by A is (100)(1 − H (0 . , . . Shannonbits of information . After the first step of the algorithm we are left with29 pairs ( a j , b j ) with P r ( a j = b j ) = 0 . B by the remnant of A is 29(0 . ≈ . Shannon bits of information . So we have “wasted” less than 1 bit, i.e.the wastage is about 8%. Mathematically we have 100 (cid:0) − H (0 . , . (cid:1) =29 (cid:0) − H (0 . , . (cid:1) + Wastage. In general we have n [1 − H ( p, q )] = n p + q ) (cid:20) − H (cid:18) p p + q , q p + q (cid:19)(cid:21) + W, where W denotes the wastage. Dividing by n we get2[1 − H ( p, q )] = ( p + q ) (cid:20) − H (cid:18) p p + q , q p + q (cid:19)(cid:21) + 2 Wn .
Comparing with formula(4.5) we see that W = n (cid:2) − H ( p + q , pq ) (cid:3) . In thiscase W = 50 (cid:2) − H (0 . , . (cid:3) ≈ − . . . To sum up then the new identities tell us exactly how much informationwas wasted and not utilized.
They also tell us, in conjunction with thealgorithm, the optimum size of the sub-blocks at each stage.One of the original motivations for work in coding theory was that theShannon fundamental theorem showed how capacity was the bound for accu-rate communication but the problem was to construct linear codes or othercodes such as turbo codes that came close to the bound.Here we have an analogous situation. In the example just consideredthe maximum length of a cryptographic common secret key obtained as acommon subset of
A, B is bounded by n (1 − H ( p )). The problem is to findalgorithms which produce such a common secret key coming close to thisShannon bound of n (cid:0) − H ( p ) (cid:1) .This work nicely illustrates the inter-connections between codes, cryp-tography, and information theory. Information theory tells us the bound.The identities tell us the size of the sub-blocks for constructing a commonsecret key which attains, or gets close to, the information theory bound.Coding theory is then used to ensure that the two communicating par-ties have a common secret key by using the hash function described in thealgorithm using a code C . Error correction can ensure that the commonsecret key can be obtained without using another round of the algorithm(thereby shortening the common key) if the difference between the keys of A and B is less than the minimum distance of the dual code of C . Thisimproves on the standard method of checking parities of random subsets ofthe keys A m , B m at the last stage. 11 Concluding Remarks.
1. Please see Chapter 25 of the forthcoming
Cryptography, InformationTheory, and Error-Correction: A Handbook for the Twenty-First Cen-tury , by Bruen, Forcinito, and McQuillan, [3], for background infor-mation, additional details, and related material.2. Standard tables for entropy list values to two decimal places. When p is close to 1 interpolation for three decimal places is difficult as h ( p )is very steeply sloped near p = 1. Formula 4.3 may help since p p + q isless than p , and the formula can be re-iterated.3. In [4] the emphasis is on the situation where the eavesdropper hasno initial information. The case where the eavesdropper has initialinformation is discussed in [6]. In Section 7 of [4] the quoted resultuses Renyi entropy rather than Shannon entropy.4. The methods in this note suggest possible generalizations which we donot pursue here. Acknowledgement:
The author thanks Drs Mario Forcinito, James Mc-Quillan and David Wehlau for their help and encouragement with this work.
References [1] R. B. Ash.
Information Theory . Dover, 1990.[2] C. H. Bennett and G. Brassard. Quantum cryptography: Public keydistribution and coin tossing. In
International Conference on Computers,Systems & Signal Processing , December 1984.[3] A. A. Bruen, M. A. Forcinito, and J. M. McQuillan.
Cryptography,Information Theory, and Error-Correction: A Handbook for the Twenty-First Century . Wiley, second edition, 2021. to appear.[4] A. A. Bruen, D. L. Wehlau, and M. Forcinito. Error correcting codes,block designs, perfect secrecy and finite fields.
Acta Applicandae Math-ematica , 93, September 2006.[5] G. A. Jones and J. M. Jones.
Information and Coding Theory . Springer,2000.[6] U. Maurer and S. Wolf. Privacy amplification secure against active ad-versaries. In
Advances in Cryptology, Proceedings of Crypto’97 , pages307–321. Springer-Verlag, August 1997.127] R. J. McEliece.
The Theory of Information and Coding . Addison-Wesley,1978.[8] A. Renyi.
A Diary on Information Theory . Wiley, 1987.[9] D. Welsh.