Simple Channel Coding Bounds
aa r X i v : . [ c s . I T ] J u l Simple Channel Coding Bounds
Ligong Wang
Signal and Information Processing LaboratoryETH Zurich, Switzerland [email protected]
Roger Colbeck
Institute for Theoretical Physics, andInstitute of Theoretical Computer ScienceETH Zurich, Switzerland [email protected]
Renato Renner
Institute for Theoretical PhysicsETH Zurich, Switzerland [email protected]
Abstract —New channel coding converse and achievabilitybounds are derived for a single use of an arbitrary channel.Both bounds are expressed using a quantity called the “smooth -divergence”, which is a generalization of R´enyi’s divergence oforder . The bounds are also studied in the limit of large block-lengths. In particular, they combine to give a general capacityformula which is equivalent to the one derived by Verd ´u andHan. I. I
NTRODUCTION
We consider the problem of transmitting informationthrough a channel. A channel consists of an input alphabet X ,an output alphabet Y , where X and Y are each equipped witha σ -Algebra, and the channel law which is a stochastic kernel P Y | X from X to Y . We consider average error probabilitiesthroughout this paper, thus an ( m, ǫ ) -code consists of anencoder f : { , . . . , m } → X , i x and a decoder g : Y → { , . . . , m } , y ˆ i such that the probability that i = ˆ i is smaller than or equal to ǫ , assuming that the messageis uniformly distributed. Our aim is to derive upper and lowerbounds on the largest m given ǫ > such that an ( m, ǫ ) -codeexists for a given channel.Such bounds are different from those in Shannon’s originalwork [1] in the sense that they are nonasymptotic and donot rely on any channel structure such as memorylessness orinformation stability.Previous works have demonstrated the advantages of suchnonasymptotic bounds. They can lead to more general channelcapacity formulas [2] as well as giving tight approximationsto the maximal rate achievable for a desired error probabilityand a fixed block-length [3].In this paper we prove a new converse bound and a newachievability bound. They are asymptotically tight in the sensethat they combine to give a general capacity formula that isequivalent to [2, (1.4)]. We are mainly interested in provingsimple bounds which offer theoretical intuitions into channelcoding problems. It is not our main concern to derive boundswhich outperform the existing ones in estimating the largestachievable rates in finite block-length scenarios. In fact, aswill be seen in Section VI, the new achievability bound is lesstight than the one in [3], though the differences are small. Note that Shannon’s method of obtaining codes that have small maximumerror probabilities from those that have small average error probabilities [1]can be applied to our codes. We shall not examine other such methods whichmight lead to tighter bounds for finite block-lengths.
Both new bounds are expressed using a quantity which wecall the smooth -divergence , denoted as D δ ( ·k· ) where δ is apositive parameter. This quantity is a generalization of R´enyi’sdivergence of order [4]. Thus, our new bounds demonstrateconnections between the channel coding problem and R´enyi’sdivergence of order . Various previous works [5], [6], [7]have shown connections between channel coding and R´enyi’sinformation measures of order α for α ≥ . Also relevantis [8] where channel coding bounds were derived using thesmooth min- and max-entropies introduced in [9].As will be seen, proofs of the new bounds are simple andself-contained. The achievability bound uses random codingand suboptimal decoding, where the decoding rule can bethought of as a generalization of Shannon’s joint typicalitydecoding rule [1]. The converse is proved by simple algebracombined with the fact that D δ ( ·k· ) satisfies a Data ProcessingTheorem.The quantity D δ ( ·k· ) has also been defined for quantumsystems [10], [11]. In [11] the present work is extended toquantum communication channels.The remainder of this paper is arranged as follows: inSection II we introduce the quantity D δ ( ·k· ) ; in Section III westate and prove the converse theorem; in Section IV we stateand prove the achievability theorem; in Section V we analyzethe bounds asymptotically for an arbitrary channel to studyits capacity and ǫ -capacity; finally, in Section VI we comparenumerical results obtained using our new achievability boundwith some existing bounds.II. T HE Q UANTITY D δ ( ·k· ) In [4] R´enyi defined entropies and divergences of order α for every α > . We denote these H α ( · ) and D α ( ·k· ) respectively. They are generalizations of Shannon’s entropy H ( · ) and relative entropy D ( ·k· ) .Letting α tend to zero in D α ( ·k· ) yields the followingdefinition of D ( ·k· ) . Definition 1 (R´enyi’s Divergence of Order ): For P and Q which are two probability measures on (Ω , F ) , D ( P k Q ) is defined as D ( P k Q ) = − log Z supp ( P ) d Q, (1)here we use the convention log 0 = −∞ . We generalize D ( ·k· ) to define D δ ( ·k· ) as follows. Definition 2 (Smooth -Divergence): Let P and Q be twoprobability measures on (Ω , F ) . For δ > , D δ ( P k Q ) isdefined as D δ ( P k Q ) = sup Φ:Ω → [0 , R Ω Φ d P ≥ − δ (cid:26) − log Z Ω Φ d Q (cid:27) . (2) Remark:
To achieve the supremum in (2), one should choose Φ to be large (equal to ) for large d P d Q and vice versa. Lemma 1 (Properties of D δ ( ·k· ) ): D δ ( P k Q ) is monotonically nondecreasing in δ .2) When δ = 0 , the supremum in (2) is achieved bychoosing Φ to be on supp ( P ) and to be elsewhere,which yields D ( P k Q ) = D ( P k Q ) .3) If P has no point masses, then the supremum in (2) isachieved by letting Φ take value in { , } only and D δ ( P k Q ) = sup P ′ : k P ′ − P k ≤ δ D ( P ′ k Q ) .
4) (Data Processing Theorem) Let P and Q be probabilitymeasures on (Ω , F ) , and let W be a stochastic kernelfrom (Ω , F ) to (Ω ′ , F ′ ) . For all δ > , we have D δ ( P k Q ) ≥ D δ ( W ◦ P k W ◦ Q ) , (3)where W ◦ P denotes the probability distribution on (Ω ′ , F ′ ) induced by P and W and similarly for W ◦ Q . Proof:
The first three properties are immediate conse-quences of the definition and the remark. We therefore onlyprove 4).For any Φ ′ : Ω ′ → [0 , such that Z Ω ′ Φ ′ d ( W ◦ P ) ≥ − δ, we choose Φ : Ω → R to be Φ( ω ) = Z Ω ′ Φ ′ ( ω ′ ) W ( d ω ′ | ω ) , ω ∈ Ω . then we have that Φ( ω ) ∈ [0 , for all ω ∈ Ω . Further, Z Ω Φ d P = Z Ω ′ Φ ′ d ( W ◦ P ) ≥ − δ, Z Ω Φ d Q = Z Ω ′ Φ ′ d ( W ◦ Q ) . Thus we have sup
Φ:Ω → [0 , R Ω Φ d P ≥ − δ (cid:26) − log Z Ω Φ d Q (cid:27) ≥ sup Φ ′ :Ω ′ → [0 , R Ω ′ Φ ′ d ( W ◦ P ) ≥ − δ (cid:26) − log Z Ω ′ Φ ′ d ( W ◦ Q ) (cid:27) , which proves 4). We remark that for distributions defined on a finite alphabet, X , theequivalent of (1) is D ( P || Q ) = − log P x : P ( x ) > Q ( x ) . A relation between D δ ( P k Q ) , D ( P k Q ) and the informa-tion spectrum methods [12], [13] can be seen in the nextlemma. A slightly different quantum version of this theoremhas been proven in [10]. We include a classical proof of it inthe Appendix. Lemma 2:
Let P n and Q n be probability measures on (Ω n , F n ) for every n ∈ N . Then lim δ ↓ lim n →∞ n D δ ( P n k Q n ) = { P n } - lim n →∞ n log d P n d Q n . (4)Here { P n } - lim means the lim inf in probability with respectto the sequence of probability measures { P n } , that is, for areal stochastic process { Z n } , { P n } - lim n →∞ Z n , sup n a ∈ R : lim n →∞ P n ( { Z n < a } ) = 0 o . In particular, let P × n and Q × n denote the product distribu-tions of P and Q respectively on (Ω ⊗ n , F ⊗ n ) , then lim δ ↓ lim n →∞ n D δ (cid:0) P × n k Q × n (cid:1) = D ( P k Q ) . (5) Proof:
See Appendix.III. T HE C ONVERSE
We first state and prove a lemma.
Lemma 3:
Let M be uniformly distributed over { , . . . , m } and let ˆ M also take value in { , . . . , m } . If the probability that ˆ M = M is at most ǫ , then log m ≤ D ǫ (cid:0) P M ˆ M k P M × P ˆ M (cid:1) , where P M ˆ M denotes the joint distribution of M and ˆ M while P M and P ˆ M denote its marginals. Proof:
Let Φ be the indicator of the event M = ˆ M , i.e., Φ( i, ˆ i ) , ( , i = ˆ i , otherwise i, ˆ i ∈ { , . . . , m } . Because, by assumption, the probability that M = ˆ M is notlarger than ǫ , we have Z { ,...,m } ⊗ Φ d P M ˆ M ≥ − ǫ. Thus, to prove the lemma, it suffices to show that log m ≤ − log Z { ,...,m } ⊗ Φ d ( P M × P ˆ M ) . (6)To justify this we write: Z { ,...,m } ⊗ Φ d ( P M × P ˆ M ) = m X i =1 P M (cid:0) { i } (cid:1) · P ˆ M (cid:0) { i } (cid:1) = m X i =1 m · P ˆ M (cid:0) { i } (cid:1) = 1 m , from which it follows that (6) is satisfied with equality. heorem 1 (Converse): An ( m, ǫ ) -code satisfies log m ≤ sup P X D ǫ ( P XY k P X × P Y ) , (7)where P XY and P Y are probability distributions on X × Y and Y , respectively, induced by P X and the channel law. Proof:
Choose P X to be the distribution induced by themessage uniformly distributed over { , . . . , m } , then log m ≤ D ǫ (cid:0) P M ˆ M k P M × P ˆ M (cid:1) ≤ D ǫ ( P XY k P X × P Y ) , where the first inequality follows by Lemma 3; the sec-ond inequality by Lemma 1 Part 4) and the fact that M ⊸ −− X ⊸ −− Y ⊸ −− ˆ M forms a Markov Chain. Theorem 1follows. IV. A CHIEVABILITY
Theorem 2 (Achievability):
For any channel, any ǫ > and ǫ ′ ∈ [0 , ǫ ) there exists an ( m, ǫ ) -code satisfying log m ≥ sup P X D ǫ ′ ( P XY k P X × P Y ) − log 1 ǫ − ǫ ′ , (8)where P XY and P Y are induced by P X and the channel law.The proof of Theorem 2 can be thought of as a general-ization of Shannon’s original achievability proof [1]. We userandom coding as in [1]; for the decoder, we generalize Shan-non’s typicality decoder to allow, instead of the “indicator”for the jointly typical set, an arbitrary function on input-outputpairs. Proof:
For any distribution P X on X and any m ∈ Z + ,we randomly generate a codebook of size m such that the m codewords are independent and identically distributed ac-cording to P X . We shall show that, for any ǫ ′ , there existsa decoding rule associated with each codebook such that theaverage probability of a decoding error averaged over all suchcodebooks satisfies Pr( error ) ≤ ( m − · − D ǫ ′ ( P XY k P X × P Y ) + ǫ ′ . (9)Then there exists at least one codebook whose average proba-bility of error is upper-bounded by the right hand side (RHS)of (9). That this codebook satisfies (8) follows by rearrangingterms in (9).We shall next prove (9). For a given codebook and any Φ :
X × Y → [0 , which satisfies Z X ×Y Φ d P XY ≥ − ǫ ′ , (10)we use the following random decoding rule: when y isreceived, select some or none of the messages such that mes-sage j is selected with probability Φ( f ( j ) , y ) independentlyof the other messages. If only one message is selected, outputthis message; otherwise declare an error. It is well-known that, for the channel model considered in this paper, theaverage probability of error cannot be improved by allowing random decodingrules.
To analyze the error probability, suppose i was the trans-mitted message. The error event is the union of E and E ,where E denotes the event that some message other than i isselected; E denotes the event that message i is not selected.We first bound Pr( E ) averaged over all codebooks. Fix f ( i ) and y . The probability averaged over all codebooks ofselecting a particular message other than i is given by Z X Φ( x, y ) P X ( d x ) . Since there are ( m − such messages, we can use the unionbound to obtain E [Pr( E | f ( i ) , y )] ≤ ( m − · Z X Φ( x, y ) P X ( d x ) . (11)Since the RHS of (11) does not depend on f ( i ) , we furtherhave E [Pr( E | y )] ≤ ( m − · Z X Φ( x, y ) P X ( d x ) . Averaging this inequality over y gives E [Pr( E )] ≤ ( m − Z Y (cid:18)Z X Φ( x, y ) P X ( d x ) (cid:19) P Y ( d y )= ( m − Z X ×Y Φ d ( P X × P Y ) . (12)On the other hand, the probability of E averaged over allgenerated codebooks can be bounded as E [Pr( E )] = Z X ×Y (1 − Φ) d P XY ≤ ǫ ′ . (13)Combining (12) and (13) yields Pr( error ) ≤ ( m − Z X ×Y Φ d ( P X × P Y ) + ǫ ′ . (14)Finally, since (14) holds for every Φ satisfying (10), weestablish (9) and thus conclude the proof of Theorem 2.V. A SYMPTOTIC A NALYSIS
In this section we use the new bounds to study the capacityof a channel whose structure can be arbitrary. Such a channelis described by stochastic kernels from X n to Y n for all n ∈ Z + , where X and Y are the input and output alphabets,respectively. An ( n, M, ǫ ) -code on a channel consists of anencoder and a decoder such that a message of size M canbe transmitted by mapping it to an element of X n while theprobability of error is no larger than ǫ . The capacity and theoptimistic capacity [14] of a channel are defined as follows. Definition 3 (Capacity and Optimistic Capacity):
The ca-pacity C of a channel is the supremum over all R for whichthere exists a sequence of ( n, M n , ǫ n ) -codes such that log M n n ≥ R, n ∈ Z + (15)and lim n →∞ ǫ n = 0 . he optimistic capacity C of a channel is the supremum overall R for which there exists a sequence of ( n, M n , ǫ n ) -codessuch that (15) holds and lim n →∞ ǫ n = 0 . Given Definition 3, the next theorem is an immediateconsequence of Theorems 1 and 2.
Theorem 3 (Capacity Formulas):
Any channel satisfies C = lim ǫ ↓ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) , (16) C = lim ǫ ↓ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) . (17) Remark:
According to Lemma 2, (16) is equivalent to [2,(1.4)]. It can also be shown that (17) is equivalent to [15,Theorem 4.4].We can also use Theorems 1 and 2 to study the ǫ -capacitieswhich are usually defined as follows (see, for example, [2],[15]). Definition 4 ( ǫ -Capacity and Optimistic ǫ -Capacity): The ǫ -capacity C ǫ of a channel is the supremum over all R suchthat, for every large enough n , there exists an ( n, M n , ǫ ) -codesatisfying log M n n ≥ R. The optimistic ǫ -capacity C ǫ of a channel is the supremumover all R for which there exist ( n, M n , ǫ ) -codes for infinitelymany n s satisfying log M n n ≥ R. The following bounds on the ǫ -capacity and optimistic ǫ -capacity of a channel are immediate consequences of Theo-rems 1 and 2. They can be shown to be equivalent to thosein [2, Theorem 6], [16, Theorem 7] and [15, Theorem 4.3].As in those previous results, the bounds for C ǫ ( C ǫ ) coincideexcept possibly at the points of discontinuity of C ǫ ( C ǫ ). Theorem 4 (Bounds on ǫ -Capacities): For any channel andany ǫ ∈ (0 , , the ǫ -capacity of the channel satisfies C ǫ ≤ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) ,C ǫ ≥ lim ǫ ′ ↑ ǫ lim n →∞ n sup P Xn D ǫ ′ ( P X n Y n k P X n × P Y n ) ; and the optimistic ǫ -capacity of the channel satisfies C ǫ ≤ lim n →∞ n sup P Xn D ǫ ( P X n Y n k P X n × P Y n ) ,C ǫ ≥ lim ǫ ′ ↑ ǫ lim n →∞ n sup P Xn D ǫ ′ ( P X n Y n k P X n × P Y n ) . VI. N
UMERICAL C OMPARISON WITH E XISTING B OUNDSFOR THE
BSCIn this section we compare the new achievability boundobtained in this paper with the bounds by Gallager [5] andPolyanskiy et al. [3]. We consider the memoryless binarysymmetric channel (BSC) with crossover probability . . Thus, for n channel uses, the input and output alphabets areboth { , } n and the channel law is given by P Y n | X n ( y n | x n ) = 0 . | y n − x n | . n −| y n − x n | , where | · | denotes the Hamming weight of a binary vector.The average block-error rate is chosen to be − .In the calculations of all three achievability bounds wechoose P X n to be uniform on { , } n . For comparison weinclude the plot of the converse used in [3]. Our new conversebound involves optimization over input distributions and isthus difficult to compute. In fact, in this example it is lesstight compared to the one in [3] since for the uniform inputdistribution D . ( P X n Y n k P X n × P Y n ) coincides with thelatter.Comparison of the curves is shown in Figure 1. For the n rate ConversePolyanskiy et al.New achievability Gallager
Fig. 1. Comparison of the new achievability bound with Gallager [5] andPolyanskiy et al. [3] for the BSC with crossover probability . and averageblock-error rate − . The converse is the one used in [3]. example we consider, the new achievability is always lesstight than the one in [3], though the difference is small. Itoutperforms Gallager’s bound for large block-lengths.A PPENDIX
In this appendix we prove Lemma 2. We first show that lim δ ↓ lim n →∞ n D δ ( P n k Q n ) ≥ { P n } - lim n →∞ n log d P n d Q n . (18)To this end, consider any a satisfying < a < { P n } - lim n →∞ n log d P n d Q n . (19)Let A n ( a ) ∈ F n , n ∈ N , be the union of all measurable setson which n log d P n d Q n ≥ a. (20)Let Φ n : Ω n → [0 , , n ∈ N , equal on A n ( a ) and equal elsewhere, then by (19) we have lim n →∞ Z Ω n Φ n d P n = lim n →∞ P n ( A n ( a )) = 1 . (21)hus we have lim δ ↓ lim n →∞ n D δ ( P n k Q n ) ≥ lim n →∞ (cid:18) − n log Z Ω n Φ n d Q n (cid:19) ≥ lim n →∞ (cid:18) − n log Z Ω n Φ n d P n · − na (cid:19) = lim n →∞ (cid:18) − n log (cid:0) − na (cid:1)(cid:19) = a, (22)where the first inequality follows because, according to (21),for any δ > , R Ω n Φ n d P n = P n ( A n ( a )) ≥ − δ for largeenough n ; the second inequality by (20) and the fact that Φ n is zero outside A n ( a ) ; the next equality by (21). Since (22)holds for every a satisfying (19), we obtain (18).We next show the other direction, namely, we show that lim δ ↓ lim n →∞ n D δ ( P n k Q n ) ≤ { P n } - lim n →∞ n log d P n d Q n . (23)To this end, consider any b > { P n } − lim n →∞ n log d P n d Q n . (24)Let A ′ n ( b ) , n ∈ N , be the union of all measurable sets onwhich n log d P n d Q n ≤ b. (25)By (24) we have that there exists some c ∈ (0 , such that lim n →∞ P n ( A ′ n ( b )) = c. (26)For every δ ∈ (0 , c ) , consider any sequence of Φ n : Ω n → [0 , satisfying Z Ω n Φ n d P n ≥ − δ, n ∈ N . (27)Combining (26) and (27) yields lim n →∞ Z A ′ n ( b ) Φ n d P n ≥ c − δ. (28)On the other hand, from (25) it follows that Z A ′ n ( b ) Φ n d Q n ≥ Z A ′ n ( b ) Φ n d P n · − nb . (29)Combining (28) and (29) yields lim n →∞ − n log Z A ′ n ( b ) Φ n d Q n ! ≤ b. Thus we obtain that for every δ ∈ (0 , c ) and every sequence Φ n : Ω n → [0 , satisfying (27), lim n →∞ (cid:18) − n log Z Ω n Φ n d Q n (cid:19) ≤ lim n →∞ − n log Z A ′ n ( b ) Φ n d Q n ! ≤ b. This implies that, for every δ ∈ (0 , c ) , lim n →∞ n D δ ( P n k Q n ) ≤ b. (30)Inequality (30) still holds when we take the limit δ ↓ . Sincethis is true for every b satisfying (24), we establish (23).Combining (18) and (23) proves (4).Finally, (5) follows from (4) because, by the law of largenumbers, n log d ( P × n ) d ( Q × n ) → E (cid:20) log d P d Q (cid:21) = D ( P k Q ) (31)as n → ∞ P × n -almost surely. This completes the proof ofLemma 2. A CKNOWLEDGMENT
RR acknowledges support from the Swiss National ScienceFoundation (grant No. 200021-119868).R
EFERENCES[1] C. E. Shannon, “A mathematical theory of communication,”
Bell SystemTechn. J. , vol. 27, pp. 379–423 and 623–656, July and Oct. 1948.[2] S. Verd´u and T. S. Han, “A general formula for channel capacity,”
IEEETrans. Inform. Theory , vol. 40, no. 4, pp. 1147–1157, July 1994.[3] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “New channel coding achiev-ability bounds,” in
Proc. IEEE Int. Symposium on Inf. Theory , Toronto,Canada, July 6–11, 2008.[4] A. R´enyi, “On measures of entropy and information,” in , University ofCalifornia, Berkeley, CA, USA, 1961.[5] R. G. Gallager, “A simple derivation of the coding theorem and someapplications,”
IEEE Trans. Inform. Theory , vol. 11, no. 1, pp. 3–19, Jan.1965.[6] S. Arimoto, “Information measures and capacity of order α for discretememoryless channels,” in Topics in Information Theory , I. Csisz´ar andP. Elias, Eds., 1977, pp. 41–52.[7] I. Csisz´ar, “Generalized cutoff rates and R´enyi’s information measures,”
IEEE Trans. Inform. Theory , vol. 41, no. 1, pp. 26–34, Jan. 1995.[8] R. Renner, S. Wolf, and J. Wullschleger, “The single-serving channelcapacity,” in
Proc. IEEE Int. Symposium on Inf. Theory , Seattle, Wash-ington, USA, July 9–14, 2006.[9] R. Renner and S. Wolf, “Smooth R´enyi entropy and applications,” in
Proc. IEEE Int. Symposium on Inf. Theory , Chicago, Illinois, USA, June27 – July 2, 2004.[10] N. Datta, “Min- and max-relative entropies and a new entanglementmonotone,” 2008. [Online]. Available: http://arxiv.org/abs/0803.2770[11] L. Wang and R. Renner, “One-shot classical capacity of quantumchannels,” presented at
The Twelfth Workshop on Quantum Inform.Processing (QIP), poster session.[12] T. S. Han and S. Verd´u, “Approximation theory of output statistics,”
IEEE Trans. Inform. Theory , vol. 39, no. 3, pp. 752–772, 1993.[13] T. S. Han,
Information Spectrum Methods in Information Theory .Springer Verlag, 2003.[14] S. Vembu, S. Verd´u, and Y. Steinberg, “The source-channel separationtheorem revisited,”
IEEE Trans. Inform. Theory , vol. 41, no. 1, pp. 44–54, Jan. 1995.[15] P.-N. Chen and F. Alajaji, “Optimistic Shannon coding theorems forarbitrary single-user systems,”
IEEE Trans. Inform. Theory , vol. 45,no. 7, pp. 2623–2629, Nov. 1999.[16] Y. Steinberg, “New converses in the theory of identification via chan-nels,”