[PDF] Simple Channel Coding Bounds

Abstract

New channel coding converse and achievability bounds are derived for a single use of an arbitrary channel. Both bounds are expressed using a quantity called the "smooth 0-divergence", which is a generalization of Renyi's divergence of order 0. The bounds are also studied in the limit of large block-lengths. In particular, they combine to give a general capacity formula which is equivalent to the one derived by Verdu and Han.

Full PDF

aa r X i v : . [ c s . I T ] J u l Simple Channel Coding Bounds

Ligong Wang

Signal and Information Processing LaboratoryETH Zurich, Switzerland [email protected]

Roger Colbeck

Institute for Theoretical Physics, andInstitute of Theoretical Computer ScienceETH Zurich, Switzerland [email protected]

Renato Renner

Institute for Theoretical PhysicsETH Zurich, Switzerland [email protected]

Abstract —New channel coding converse and achievabilitybounds are derived for a single use of an arbitrary channel.Both bounds are expressed using a quantity called the “smooth -divergence”, which is a generalization of R´enyi’s divergence oforder . The bounds are also studied in the limit of large block-lengths. In particular, they combine to give a general capacityformula which is equivalent to the one derived by Verd ´u andHan. I. I

NTRODUCTION

We consider the problem of transmitting informationthrough a channel. A channel consists of an input alphabet X ,an output alphabet Y , where X and Y are each equipped witha σ -Algebra, and the channel law which is a stochastic kernel P Y | X from X to Y . We consider average error probabilitiesthroughout this paper, thus an ( m, ǫ ) -code consists of anencoder f : { , . . . , m } → X , i x and a decoder g : Y → { , . . . , m } , y ˆ i such that the probability that i = ˆ i is smaller than or equal to ǫ , assuming that the messageis uniformly distributed. Our aim is to derive upper and lowerbounds on the largest m given ǫ > such that an ( m, ǫ ) -codeexists for a given channel.Such bounds are different from those in Shannon’s originalwork [1] in the sense that they are nonasymptotic and donot rely on any channel structure such as memorylessness orinformation stability.Previous works have demonstrated the advantages of suchnonasymptotic bounds. They can lead to more general channelcapacity formulas [2] as well as giving tight approximationsto the maximal rate achievable for a desired error probabilityand a ﬁxed block-length [3].In this paper we prove a new converse bound and a newachievability bound. They are asymptotically tight in the sensethat they combine to give a general capacity formula that isequivalent to [2, (1.4)]. We are mainly interested in provingsimple bounds which offer theoretical intuitions into channelcoding problems. It is not our main concern to derive boundswhich outperform the existing ones in estimating the largestachievable rates in ﬁnite block-length scenarios. In fact, aswill be seen in Section VI, the new achievability bound is lesstight than the one in [3], though the differences are small. Note that Shannon’s method of obtaining codes that have small maximumerror probabilities from those that have small average error probabilities [1]can be applied to our codes. We shall not examine other such methods whichmight lead to tighter bounds for ﬁnite block-lengths.

Both new bounds are expressed using a quantity which wecall the smooth -divergence , denoted as D δ ( ·k· ) where δ is apositive parameter. This quantity is a generalization of R´enyi’sdivergence of order [4]. Thus, our new bounds demonstrateconnections between the channel coding problem and R´enyi’sdivergence of order . Various previous works [5], [6], [7]have shown connections between channel coding and R´enyi’sinformation measures of order α for α ≥ . Also relevantis [8] where channel coding bounds were derived using thesmooth min- and max-entropies introduced in [9].As will be seen, proofs of the new bounds are simple andself-contained. The achievability bound uses random codingand suboptimal decoding, where the decoding rule can bethought of as a generalization of Shannon’s joint typicalitydecoding rule [1]. The converse is proved by simple algebracombined with the fact that D δ ( ·k· ) satisﬁes a Data ProcessingTheorem.The quantity D δ ( ·k· ) has also been deﬁned for quantumsystems [10], [11]. In [11] the present work is extended toquantum communication channels.The remainder of this paper is arranged as follows: inSection II we introduce the quantity D δ ( ·k· ) ; in Section III westate and prove the converse theorem; in Section IV we stateand prove the achievability theorem; in Section V we analyzethe bounds asymptotically for an arbitrary channel to studyits capacity and ǫ -capacity; ﬁnally, in Section VI we comparenumerical results obtained using our new achievability boundwith some existing bounds.II. T HE Q UANTITY D δ ( ·k· ) In [4] R´enyi deﬁned entropies and divergences of order α for every α > . We denote these H α ( · ) and D α ( ·k· ) respectively. They are generalizations of Shannon’s entropy H ( · ) and relative entropy D ( ·k· ) .Letting α tend to zero in D α ( ·k· ) yields the followingdeﬁnition of D ( ·k· ) . Deﬁnition 1 (R´enyi’s Divergence of Order ): For P and Q which are two probability measures on (Ω , F ) , D ( P k Q ) is deﬁned as D ( P k Q ) = − log Z supp ( P ) d Q, (1)here we use the convention log 0 = −∞ . We generalize D ( ·k· ) to deﬁne D δ ( ·k· ) as follows. Deﬁnition 2 (Smooth -Divergence): Let P and Q be twoprobability measures on (Ω , F ) . For δ > , D δ ( P k Q ) isdeﬁned as D δ ( P k Q ) = sup Φ:Ω → [0 , R Ω Φ d P ≥ − δ (cid:26) − log Z Ω Φ d Q (cid:27) . (2) Remark:

To achieve the supremum in (2), one should choose Φ to be large (equal to ) for large d P d Q and vice versa. Lemma 1 (Properties of D δ ( ·k· ) ): D δ ( P k Q ) is monotonically nondecreasing in δ .2) When δ = 0 , the supremum in (2) is achieved bychoosing Φ to be on supp ( P ) and to be elsewhere,which yields D ( P k Q ) = D ( P k Q ) .3) If P has no point masses, then the supremum in (2) isachieved by letting Φ take value in { , } only and D δ ( P k Q ) = sup P ′ : k P ′ − P k ≤ δ D ( P ′ k Q ) .

4) (Data Processing Theorem) Let P and Q be probabilitymeasures on (Ω , F ) , and let W be a stochastic kernelfrom (Ω , F ) to (Ω ′ , F ′ ) . For all δ > , we have D δ ( P k Q ) ≥ D δ ( W ◦ P k W ◦ Q ) , (3)where W ◦ P denotes the probability distribution on (Ω ′ , F ′ ) induced by P and W and similarly for W ◦ Q . Proof:

The ﬁrst three properties are immediate conse-quences of the deﬁnition and the remark. We therefore onlyprove 4).For any Φ ′ : Ω ′ → [0 , such that Z Ω ′ Φ ′ d ( W ◦ P ) ≥ − δ, we choose Φ : Ω → R to be Φ( ω ) = Z Ω ′ Φ ′ ( ω ′ ) W ( d ω ′ | ω ) , ω ∈ Ω . then we have that Φ( ω ) ∈ [0 , for all ω ∈ Ω . Further, Z Ω Φ d P = Z Ω ′ Φ ′ d ( W ◦ P ) ≥ − δ, Z Ω Φ d Q = Z Ω ′ Φ ′ d ( W ◦ Q ) . Thus we have sup

Φ:Ω → [0 , R Ω Φ d P ≥ − δ (cid:26) − log Z Ω Φ d Q (cid:27) ≥ sup Φ ′ :Ω ′ → [0 , R Ω ′ Φ ′ d ( W ◦ P ) ≥ − δ (cid:26) − log Z Ω ′ Φ ′ d ( W ◦ Q ) (cid:27) , which proves 4). We remark that for distributions deﬁned on a ﬁnite alphabet, X , theequivalent of (1) is D ( P || Q ) = − log P x : P ( x ) > Q ( x ) . A relation between D δ ( P k Q ) , D ( P k Q ) and the informa-tion spectrum methods [12], [13] can be seen in the nextlemma. A slightly different quantum version of this theoremhas been proven in [10]. We include a classical proof of it inthe Appendix. Lemma 2:

Let P n and Q n be probability measures on (Ω n , F n ) for every n ∈ N . Then lim δ ↓ lim n →∞ n D δ ( P n k Q n ) = { P n } - lim n →∞ n log d P n d Q n . (4)Here { P n } - lim means the lim inf in probability with respectto the sequence of probability measures { P n } , that is, for areal stochastic process { Z n } , { P n } - lim n →∞ Z n , sup n a ∈ R : lim n →∞ P n ( { Z n < a } ) = 0 o . In particular, let P × n and Q × n denote the product distribu-tions of P and Q respectively on (Ω ⊗ n , F ⊗ n ) , then lim δ ↓ lim n →∞ n D δ (cid:0) P × n k Q × n (cid:1) = D ( P k Q ) . (5) Proof:

See Appendix.III. T HE C ONVERSE

We ﬁrst state and prove a lemma.

Lemma 3:

Let M be uniformly distributed over { , . . . , m } and let ˆ M also take value in { , . . . , m } . If the probability that ˆ M = M is at most ǫ , then log m ≤ D ǫ (cid:0) P M ˆ M k P M × P ˆ M (cid:1) , where P M ˆ M denotes the joint distribution of M and ˆ M while P M and P ˆ M denote its marginals. Proof:

Let Φ be the indicator of the event M = ˆ M , i.e., Φ( i, ˆ i ) , ( , i = ˆ i , otherwise i, ˆ i ∈ { , . . . , m } . Because, by assumption, the probability that M = ˆ M is notlarger than ǫ , we have Z { ,...,m } ⊗ Φ d P M ˆ M ≥ − ǫ. Thus, to prove the lemma, it sufﬁces to show that log m ≤ − log Z { ,...,m } ⊗ Φ d ( P M × P ˆ M ) . (6)To justify this we write: Z { ,...,m } ⊗ Φ d ( P M × P ˆ M ) = m X i =1 P M (cid:0) { i } (cid:1) · P ˆ M (cid:0) { i } (cid:1) = m X i =1 m · P ˆ M (cid:0) { i } (cid:1) = 1 m , from which it follows that (6) is satisﬁed with equality. heorem 1 (Converse): An ( m, ǫ ) -code satisﬁes log m ≤ sup P X D ǫ ( P XY k P X × P Y ) , (7)where P XY and P Y are probability distributions on X × Y and Y , respectively, induced by P X and the channel law. Proof:

Choose P X to be the distribution induced by themessage uniformly distributed over { , . . . , m } , then log m ≤ D ǫ (cid:0) P M ˆ M k P M × P ˆ M (cid:1) ≤ D ǫ ( P XY k P X × P Y ) , where the ﬁrst inequality follows by Lemma 3; the sec-ond inequality by Lemma 1 Part 4) and the fact that M ⊸ −− X ⊸ −− Y ⊸ −− ˆ M forms a Markov Chain. Theorem 1follows. IV. A CHIEVABILITY

Theorem 2 (Achievability):

For any channel, any ǫ > and ǫ ′ ∈ [0 , ǫ ) there exists an ( m, ǫ ) -code satisfying log m ≥ sup P X D ǫ ′ ( P XY k P X × P Y ) − log 1 ǫ − ǫ ′ , (8)where P XY and P Y are induced by P X and the channel law.The proof of Theorem 2 can be thought of as a general-ization of Shannon’s original achievability proof [1]. We userandom coding as in [1]; for the decoder, we generalize Shan-non’s typicality decoder to allow, instead of the “indicator”for the jointly typical set, an arbitrary function on input-outputpairs. Proof:

For any distribution P X on X and any m ∈ Z + ,we randomly generate a codebook of size m such that the m codewords are independent and identically distributed ac-cording to P X . We shall show that, for any ǫ ′ , there existsa decoding rule associated with each codebook such that theaverage probability of a decoding error averaged over all suchcodebooks satisﬁes Pr( error ) ≤ ( m − · − D ǫ ′ ( P XY k P X × P Y ) + ǫ ′ . (9)Then there exists at least one codebook whose average proba-bility of error is upper-bounded by the right hand side (RHS)of (9). That this codebook satisﬁes (8) follows by rearrangingterms in (9).We shall next prove (9). For a given codebook and any Φ :

X × Y → [0 , which satisﬁes Z X ×Y Φ d P XY ≥ − ǫ ′ , (10)we use the following random decoding rule: when y isreceived, select some or none of the messages such that mes-sage j is selected with probability Φ( f ( j ) , y ) independentlyof the other messages. If only one message is selected, outputthis message; otherwise declare an error. It is well-known that, for the channel model considered in this paper, theaverage probability of error cannot be improved by allowing random decodingrules.

To analyze the error probability, suppose i was the trans-mitted message. The error event is the union of E and E ,where E denotes the event that some message other than i isselected; E denotes the event that message i is not selected.We ﬁrst bound Pr( E ) averaged over all codebooks. Fix f ( i ) and y . The probability averaged over all codebooks ofselecting a particular message other than i is given by Z X Φ( x, y ) P X ( d x ) . Since there are ( m − such messages, we can use the unionbound to obtain E [Pr( E | f ( i ) , y )] ≤ ( m − · Z X Φ( x, y ) P X ( d x ) . (11)Since the RHS of (11) does not depend on f ( i ) , we furtherhave E [Pr( E | y )] ≤ ( m − · Z X Φ( x, y ) P X ( d x ) . Averaging this inequality over y gives E [Pr( E )] ≤ ( m − Z Y (cid:18)Z X Φ( x, y ) P X ( d x ) (cid:19) P Y ( d y )= ( m − Z X ×Y Φ d ( P X × P Y ) . (12)On the other hand, the probability of E averaged over allgenerated codebooks can be bounded as E [Pr( E )] = Z X ×Y (1 − Φ) d P XY ≤ ǫ ′ . (13)Combining (12) and (13) yields Pr( error ) ≤ ( m − Z X ×Y Φ d ( P X × P Y ) + ǫ ′ . (14)Finally, since (14) holds for every Φ satisfying (10), weestablish (9) and thus conclude the proof of Theorem 2.V. A SYMPTOTIC A NALYSIS

In this section we use the new bounds to study the capacityof a channel whose structure can be arbitrary. Such a channelis described by stochastic kernels from X n to Y n for all n ∈ Z + , where X and Y are the input and output alphabets,respectively. An ( n, M, ǫ ) -code on a channel consists of anencoder and a decoder such that a message of size M canbe transmitted by mapping it to an element of X n while theprobability of error is no larger than ǫ . The capacity and theoptimistic capacity [14] of a channel are deﬁned as follows. Deﬁnition 3 (Capacity and Optimistic Capacity):

The ca-pacity C of a channel is the supremum over all R for whichthere exists a sequence of ( n, M n , ǫ n ) -codes such that log M n n ≥ R, n ∈ Z + (15)and lim n →∞ ǫ n = 0 . he optimistic capacity C of a channel is the supremum overall R for which there exists a sequence of ( n, M n , ǫ n ) -codessuch that (15) holds and lim n →∞ ǫ n = 0 . Given Deﬁnition 3, the next theorem is an immediateconsequence of Theorems 1 and 2.

Theorem 3 (Capacity Formulas):

Any channel satisﬁes C = lim ǫ ↓ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) , (16) C = lim ǫ ↓ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) . (17) Remark:

According to Lemma 2, (16) is equivalent to [2,(1.4)]. It can also be shown that (17) is equivalent to [15,Theorem 4.4].We can also use Theorems 1 and 2 to study the ǫ -capacitieswhich are usually deﬁned as follows (see, for example, [2],[15]). Deﬁnition 4 ( ǫ -Capacity and Optimistic ǫ -Capacity): The ǫ -capacity C ǫ of a channel is the supremum over all R suchthat, for every large enough n , there exists an ( n, M n , ǫ ) -codesatisfying log M n n ≥ R. The optimistic ǫ -capacity C ǫ of a channel is the supremumover all R for which there exist ( n, M n , ǫ ) -codes for inﬁnitelymany n s satisfying log M n n ≥ R. The following bounds on the ǫ -capacity and optimistic ǫ -capacity of a channel are immediate consequences of Theo-rems 1 and 2. They can be shown to be equivalent to thosein [2, Theorem 6], [16, Theorem 7] and [15, Theorem 4.3].As in those previous results, the bounds for C ǫ ( C ǫ ) coincideexcept possibly at the points of discontinuity of C ǫ ( C ǫ ). Theorem 4 (Bounds on ǫ -Capacities): For any channel andany ǫ ∈ (0 , , the ǫ -capacity of the channel satisﬁes C ǫ ≤ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) ,C ǫ ≥ lim ǫ ′ ↑ ǫ lim n →∞ n sup P Xn D ǫ ′ ( P X n Y n k P X n × P Y n ) ; and the optimistic ǫ -capacity of the channel satisﬁes C ǫ ≤ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) ,C ǫ ≥ lim ǫ ′ ↑ ǫ lim n →∞ n sup P Xn D ǫ ′ ( P X n Y n k P X n × P Y n ) . VI. N

UMERICAL C OMPARISON WITH E XISTING B OUNDSFOR THE

BSCIn this section we compare the new achievability boundobtained in this paper with the bounds by Gallager [5] andPolyanskiy et al. [3]. We consider the memoryless binarysymmetric channel (BSC) with crossover probability . . Thus, for n channel uses, the input and output alphabets areboth { , } n and the channel law is given by P Y n | X n ( y n | x n ) = 0 . | y n − x n | . n −| y n − x n | , where | · | denotes the Hamming weight of a binary vector.The average block-error rate is chosen to be − .In the calculations of all three achievability bounds wechoose P X n to be uniform on { , } n . For comparison weinclude the plot of the converse used in [3]. Our new conversebound involves optimization over input distributions and isthus difﬁcult to compute. In fact, in this example it is lesstight compared to the one in [3] since for the uniform inputdistribution D . ( P X n Y n k P X n × P Y n ) coincides with thelatter.Comparison of the curves is shown in Figure 1. For the n rate ConversePolyanskiy et al.New achievability Gallager

Fig. 1. Comparison of the new achievability bound with Gallager [5] andPolyanskiy et al. [3] for the BSC with crossover probability . and averageblock-error rate − . The converse is the one used in [3]. example we consider, the new achievability is always lesstight than the one in [3], though the difference is small. Itoutperforms Gallager’s bound for large block-lengths.A PPENDIX

In this appendix we prove Lemma 2. We ﬁrst show that lim δ ↓ lim n →∞ n D δ ( P n k Q n ) ≥ { P n } - lim n →∞ n log d P n d Q n . (18)To this end, consider any a satisfying < a < { P n } - lim n →∞ n log d P n d Q n . (19)Let A n ( a ) ∈ F n , n ∈ N , be the union of all measurable setson which n log d P n d Q n ≥ a. (20)Let Φ n : Ω n → [0 , , n ∈ N , equal on A n ( a ) and equal elsewhere, then by (19) we have lim n →∞ Z Ω n Φ n d P n = lim n →∞ P n ( A n ( a )) = 1 . (21)hus we have lim δ ↓ lim n →∞ n D δ ( P n k Q n ) ≥ lim n →∞ (cid:18) − n log Z Ω n Φ n d Q n (cid:19) ≥ lim n →∞ (cid:18) − n log Z Ω n Φ n d P n · − na (cid:19) = lim n →∞ (cid:18) − n log (cid:0) − na (cid:1)(cid:19) = a, (22)where the ﬁrst inequality follows because, according to (21),for any δ > , R Ω n Φ n d P n = P n ( A n ( a )) ≥ − δ for largeenough n ; the second inequality by (20) and the fact that Φ n is zero outside A n ( a ) ; the next equality by (21). Since (22)holds for every a satisfying (19), we obtain (18).We next show the other direction, namely, we show that lim δ ↓ lim n →∞ n D δ ( P n k Q n ) ≤ { P n } - lim n →∞ n log d P n d Q n . (23)To this end, consider any b > { P n } − lim n →∞ n log d P n d Q n . (24)Let A ′ n ( b ) , n ∈ N , be the union of all measurable sets onwhich n log d P n d Q n ≤ b. (25)By (24) we have that there exists some c ∈ (0 , such that lim n →∞ P n ( A ′ n ( b )) = c. (26)For every δ ∈ (0 , c ) , consider any sequence of Φ n : Ω n → [0 , satisfying Z Ω n Φ n d P n ≥ − δ, n ∈ N . (27)Combining (26) and (27) yields lim n →∞ Z A ′ n ( b ) Φ n d P n ≥ c − δ. (28)On the other hand, from (25) it follows that Z A ′ n ( b ) Φ n d Q n ≥ Z A ′ n ( b ) Φ n d P n · − nb . (29)Combining (28) and (29) yields lim n →∞ − n log Z A ′ n ( b ) Φ n d Q n ! ≤ b. Thus we obtain that for every δ ∈ (0 , c ) and every sequence Φ n : Ω n → [0 , satisfying (27), lim n →∞ (cid:18) − n log Z Ω n Φ n d Q n (cid:19) ≤ lim n →∞ − n log Z A ′ n ( b ) Φ n d Q n ! ≤ b. This implies that, for every δ ∈ (0 , c ) , lim n →∞ n D δ ( P n k Q n ) ≤ b. (30)Inequality (30) still holds when we take the limit δ ↓ . Sincethis is true for every b satisfying (24), we establish (23).Combining (18) and (23) proves (4).Finally, (5) follows from (4) because, by the law of largenumbers, n log d ( P × n ) d ( Q × n ) → E (cid:20) log d P d Q (cid:21) = D ( P k Q ) (31)as n → ∞ P × n -almost surely. This completes the proof ofLemma 2. A CKNOWLEDGMENT

RR acknowledges support from the Swiss National ScienceFoundation (grant No. 200021-119868).R

EFERENCES[1] C. E. Shannon, “A mathematical theory of communication,”

Bell SystemTechn. J. , vol. 27, pp. 379–423 and 623–656, July and Oct. 1948.[2] S. Verd´u and T. S. Han, “A general formula for channel capacity,”

IEEETrans. Inform. Theory , vol. 40, no. 4, pp. 1147–1157, July 1994.[3] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “New channel coding achiev-ability bounds,” in

Proc. IEEE Int. Symposium on Inf. Theory , Toronto,Canada, July 6–11, 2008.[4] A. R´enyi, “On measures of entropy and information,” in , University ofCalifornia, Berkeley, CA, USA, 1961.[5] R. G. Gallager, “A simple derivation of the coding theorem and someapplications,”

IEEE Trans. Inform. Theory , vol. 11, no. 1, pp. 3–19, Jan.1965.[6] S. Arimoto, “Information measures and capacity of order α for discretememoryless channels,” in Topics in Information Theory , I. Csisz´ar andP. Elias, Eds., 1977, pp. 41–52.[7] I. Csisz´ar, “Generalized cutoff rates and R´enyi’s information measures,”

IEEE Trans. Inform. Theory , vol. 41, no. 1, pp. 26–34, Jan. 1995.[8] R. Renner, S. Wolf, and J. Wullschleger, “The single-serving channelcapacity,” in

Proc. IEEE Int. Symposium on Inf. Theory , Seattle, Wash-ington, USA, July 9–14, 2006.[9] R. Renner and S. Wolf, “Smooth R´enyi entropy and applications,” in

Proc. IEEE Int. Symposium on Inf. Theory , Chicago, Illinois, USA, June27 – July 2, 2004.[10] N. Datta, “Min- and max-relative entropies and a new entanglementmonotone,” 2008. [Online]. Available: http://arxiv.org/abs/0803.2770[11] L. Wang and R. Renner, “One-shot classical capacity of quantumchannels,” presented at

The Twelfth Workshop on Quantum Inform.Processing (QIP), poster session.[12] T. S. Han and S. Verd´u, “Approximation theory of output statistics,”

IEEE Trans. Inform. Theory , vol. 39, no. 3, pp. 752–772, 1993.[13] T. S. Han,

Information Spectrum Methods in Information Theory .Springer Verlag, 2003.[14] S. Vembu, S. Verd´u, and Y. Steinberg, “The source-channel separationtheorem revisited,”

IEEE Trans. Inform. Theory , vol. 41, no. 1, pp. 44–54, Jan. 1995.[15] P.-N. Chen and F. Alajaji, “Optimistic Shannon coding theorems forarbitrary single-user systems,”

IEEE Trans. Inform. Theory , vol. 45,no. 7, pp. 2623–2629, Nov. 1999.[16] Y. Steinberg, “New converses in the theory of identiﬁcation via chan-nels,”