Codes That Achieve Capacity on Symmetric Channels
aa r X i v : . [ c s . I T ] O c t CODES THAT ACHIEVE CAPACITYON SYMMETRIC CHANNELS September 5, 2018
Vishvajeet Nagargoje Indian Institute of Technology MadrasUnder the guidance ofProf. Prahladh Harsha Tata Institute of Fundamental Research, Mumbai As part of the Visiting Students Research Programme at the School of Technologyand Computer Science, Tata Institute of Fundamental Research, Mumbai, India -40005. Email : [email protected], Contact : +91-9884299504. Email: [email protected] ishvajeet N IIT Madras
ACKNOWLEDGEMENT
I would like to express my deep gratitude to Prof. Prahladh Harsha and Prof.Vinod Prabhakaran, my internship supervisors, for giving me an opportunity towork under them during the summer. I would also like to thank Sasank Mouli, forhelping me out with things.I would also like to acknowledge Tata Institute of Fundamental Research, Mumbaifor warmly hosting me in the summer. Page 1 of 29 bstract
Transmission of information reliably and efficiently across channels is one of thefundamental goals of coding and information theory. In this respect, efficiently de-codable deterministic coding schemes which achieve capacity provably have beenelusive until as recent as 2008, even though schemes which come close to it in prac-tice existed. This survey tries to give the interested reader an overview of the area.Erdal Arikan came up with his landmark polar coding shemes which achieve ca-pacity on symmetric channels subject to the constraint that the input codewordsare equiprobable. His idea is to convert any B-DMC into efficiently encodable-decodable channels which have rates 0 and 1, while conserving capacity in thistransformation. An exponentially decreasing probability of error which indepen-dent of code rate is achieved for all rates lesser than the symmetric capacity.These codes perform well in practice since encoding and decoding complexity is O ( N log N ) . Guruswami et al. improved the above results by showing that errorprobability can be made to decrease doubly exponentially in the block length.We also study recent results by Urbanke et al. which show that 2-transitive codesalso achieve capacity on erasure channels under MAP decoding. Urbanke and hisgroup use complexity theoretic results in boolean function analysis to prove thatEXIT functions, which capture the error probability, have a sharp threshold at1-R, thus proving that capacity is achieved. One of the oldest and most widelyused codes - Reed Muller codes are 2-transitive. Polar codes are 2-transitive tooand we thus have a different proof of the fact that they achieve capacity, thoughthe rate of polarization would be better in Guruswami’s paper. ontents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Polar Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Channel Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . 6Rate of Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Polar Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Successive Cancellation Decoder . . . . . . . . . . . . . . . . . . . . 12Encoding and Decoding Complexity . . . . . . . . . . . . . . . . . . 13Reed Muller Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Reed Muller Codes on Erasure Channels . . . . . . . . . . . . . . . . . . 20Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291ishvajeet N IIT Madras
Introduction
Communication is a fundamental need of our lives. In modern times even thoughthe need for communication and the (ever-the-more sophisticated) tools availablewith us have increased, communication is something which we humans have beendoing for a long time. Inherently, we have error correcting capabilities built intous, which help us understand our fellow humans even when there is corruptiondue to say, speech defects, physical mediums, etc. We want our communicationsystems to do the same for us, but unfortunately there are fundamental questionswhich we have not been able to answer, and things don’t look that easy.We, as people studying computer science like to abstract things out and work withappropriate models. In this pursuit, Alice and Bob come to our rescue, as always.Suppose Alice and Bob want to communicate with each efficiently other over a’noisy’ physical transmission medium which corrupts the information. We ab-stract out the possible scheme of corruptions into something which we shall call a ′ channel ′ . Definition 1.
Communication ChannelWe define a discrete channel to be a system consisting of the following : • An input alphabet χ • An output alphabet Υ • A conditional probability transition matrix P ( y | x )The conditional probability distribution expresses the probability of observing theoutput symbol y given that we send the symbol x. The channel is said to be mem-oryless if the probability distribution of the output depends only on the input atthat particular time and is independent of the previous channel inputs or outputs.We would be talking about binary memoryless channels in the rest of this article.Let’s quickly define the information theoretic functions that we would be using inthe course of this article. Definition 2.
Entropy of a random variableThe entropy H(X) of a discrete random variable X is defined by
Page 2 of 29ishvajeet N IIT Madras H ( X ) = − P x ∈ χ p ( x ) log p ( x ) . Definition 3.
Mutual information between two random variablesThe mutual information I(X;Y) between two random variables X and Y is definedto be I ( X ; Y ) = P x ∈ χ P y ∈ Υ p ( x, y ) log p ( x,y ) p ( x ) p ( y ) Definition 4. (Shannon)Capacity of ChannelThe ”information” channel capacity of a discrete memoryless channel is definedas C = Max P ( x ) I (X;Y)
Observe that the capacity of a channel is completely characterized once we specifythe input output conditional probability distribution.Let us look at a few frequently occuring discrete channels.
Definition 5.
Noiseless Binary ChannelAlso called ’perfect channel’, this is a channel such that : • χ = { , }• Υ = { , }• P( y = 0 — x = 0) = 1 and P( y = 1 — x = 1) = 1
Observe that we do not need any coding scheme if we use the noiseless channel.
Definition 6.
Binary Erasure ChannelAn erasure channel with erasure probability p has the following parameters : • χ = { , }• Υ = { , , e }• P( y = 0 — x = 0) = 1 - p , P( y = 1 — x = 1) = 1 - pP( y = e — x = 1) = p and P( y = e — x = 1) = p
Page 3 of 29ishvajeet N IIT Madras
Definition 7.
Binary Symmetric ChannelA binary symmetric channel with flip-over probability p has the following parame-ters : • χ = { , }• Υ = { , }• P( y = 0 — x = 0) = 1 - p , P( y = 1 — x = 1) = 1 - pP( y = 1 — x = 0) = p and P( y = 0 — x = 1) = p
Definition 8.
Symmetric ChannelA channel for which there exists a permutation π of the output alphabet Υ suchthat : • π − = π • W ( y |
1) = W ( π ( y ) | ∀ y ∈ Υ Observe that the symmetric capacity I(W) equals the Shannon capacity when W isa symmetric channel. We can also see that the BEC and the BSC are symmetricchannels.
Observe that in all channels except the noiseless channel we cannot decode cor-rectly unless we use encoding and decoding schemes. Intuitively it is clear thatif we add some amount of redundancy in the code, it would be easier for us tocorrect errors. But this leads to transmission of more bits than the message ac-tually contains. The fundamental question in Information and Coding Theory isthe tradeoff between redundancy and the number of errors that can be corrected.We shall formalize the notion of redundancy in a code.
Definition 9.
Rate of CodeThe rate of a code is defined to be equal to dimensionblock length
Claude Shannon gave an operational definition of the channel capacity, whichimplies that it is the maximum rate at which we can transmit information acrossa channel reliably, with error going to zero in the limit of the block length goingto infinity. Page 4 of 29ishvajeet N IIT Madras
Theorem 1.
Shannon’s coding theoremGiven a noisy channel with channel capacity C and information transmission rateR, then if R ¡ C there exist codes that allow the probability of error at the receiverto be made arbitrarily small, and the converse is also true.
Shannon’s used a random coding approach in his landmark paper. But this ran-dom coding approach is not something which satisfies theoretical computer sciencepeople like us and we want to lay our hands on a particular coding scheme whichdoes the above for us reliably. We also want to be able to decode it efficiently.In other words, we want an efficiently decodable capacity achieving deterministiccoding scheme .We would like to analyze how bad our coding scheme is, by looking at the errorprobability when a trivial decoding scheme is used - ML / MAP decoding, whichlooks at the codeword closest to the one received and outputs it. In doing so wewould require a parameter defined as follows :
Definition 10.
Bhattacharya parameterFor a channel W Bhattacharya parameter is defined as,Z(W) , P y ∈ Υ p W ( y | W ( y | . Theorem 2.
The Bhattacharya parameter is an upper bound on the probability oferror achieved by ML/MAP decoding.
We can quickly relate the symmetric capacity I(W) and Z(W), the Bhattacharyaparameter. Intuitively one would expect that I(W) ≈ ≈
1. The followingbounds make this precise :
Proposition 1.
Relation between symmetric capacity and Bhattacharya parame-ter. I(W) ≥ log Z ( W ) I(W) ≤ p − Z ( W ) Page 5 of 29ishvajeet N IIT Madras
Polar Codes
A deterministic capacity achieving coding scheme has been elusive for a long time,and in the process computer scientists have come up with different types of schemeswhich come extremely close to achieving capacity in practice but fail to do soprovably. It was in 2008 that Erdal Arikan came up with his landmark codingscheme which achieves capacity on symmetric channels. His work is a culminationof more than 20 years of reserach into the sequential cutoff rate for different typesof codes. This section tries to outline the idea behind Arikan’s polar codes . Channel Polarization
It can be observed that it very easy to code two types of channels - the perfect andthe useless channel. The main idea behind Arikan’s technique is that it manufac-tures out of N independent copies of a given BDM W, a second set of channelswhich is polarized i.e. consisting of only perfect and useless (a channel in whichthe output is independent of the input) channels and this transformation also con-serves capacity. This operation goes through two parts : combining and splitting.Let’s look at these operations in detail. • Channel combining
We take N =2 n channels, each denoted by W, and manufacture a vectorchannel W N : χ N −→ Υ N recursively.In general, 2 copies of W N are combined to produce W N so that the theinput u N to W N is first transformed to s N where s i − = u i − ⊕ u i and s i = u i for 1 ≤ i ≤ N For the 0-th level of recursion (n=0) we set W , W The first level of recursion combines to copies of W and obtains the channel W where W ( y , y | u , u ) = W ( y | u ⊕ u ) W ( y | u ) as and so on. Binary Discrete Memoryless Channel
Page 6 of 29ishvajeet N IIT MadrasThe operator R N is a permutation called the reverse shuffle operation andacts on the input s N to produce v N = ( s , s , s N − , . . . s , s . . . s N ), which isthe input to the 2 copies of W N .At each level of recursion, it should be observed that the mapping from u N −→ v N is linear over GF (2). Inductively, it can be proved that themapping from u N −→ x N which takes the input from W N to W N channel isalso linear and can be denoted by x N = u N G N .We can thus relate the transition probabilities of the 2 channels W N and W N by the following equation : W N ( y N | u N ) = W N ( y N | u N G N )for appropriately defined alphabets on either side.We shall talk about the implementations of this transformation and theencoding complexity later in this article. • Channel splitting
As mentioned earlier, we need to split W N back into a set of N channels.We call them virtual channels and denote them individually as W ( i ) N : χ −→ Υ N χ i − , 1 ≤ i ≤ N and define them as : W ( i ) N ( y N , u i − i | u i ) , P u Ni +1 ∈ χ N − i N − W N ( y N | u N )This definition goes hand in hand with the successive cancellation decoderused in polar coding as we describe below. Let’s say that we are tryingto decode the i-th bit and we are given the correctly decoded estimates forthe first i-1 bits. We use this vector of the i-1 estimates and the vectorof observations y N to get the i-th bit. We assume that the inputs u N areuniformly distributed. Note that even if we tranform this vector into differentvectors during the encoding process, the fact that the bits are uniform stillholds because the transformations are permutations and are linear too. Thisuniformity dictates a factor of N − in the term. Page 7 of 29ishvajeet N IIT MadrasLet us observe a few properties of the virtual channels we get after the abovetransformation. To make things easy, we shall see what happens when the channelis a BEC( ǫ ) with uniform input. Since we are talking about uniform input, thecapacity is I ( W ). I ( W (2 i − N ) = I ( W ( i ) N ) I ( W (2 i ) N ) = 2 I ( W ( i ) N ) − I ( W ( i ) N ) I ( W (1)1 ) = 1 − ǫ Observe that at every level of recursion, we get one channel which has capacity better than the original one and one which has capacity worse than the original one.In order to analyse what happens for the general channel, we would like to seewhat happens locally to the above quantities. For doing this we would like tohave the transtition probabilites of each of the virtual channels to be related tothe original individual channels directly, instead of being related block by block.We want to map the independent copies of channel W to the channels we get aftersplitting i.e. ( W, W ) −→ ( W ′ , W ′′ )In more general terms, in the following proposition we map ( W ( i ) N , W ( i ) N ) −→ ( W (2 i − N , W (2 i )2 N ) . Proposition 2.
Recursive channel transformationsFor any n ≥ , N = 2 n , ≤ i ≤ N , W (2 i − N ( y N , u i − | u i − ) = P u i W ( i ) N ( y N , u i − , ⊕ u i − ,e | u i − ⊕ u i ) ∗ W ( i ) N ( y NN +1 , u i − ,e | u i ) and W (2 i )2 N ( y N , u i − | u i ) = P u i W ( i ) N ( y N , u i − , ⊕ u i − ,e | u i − ⊕ u i ) ∗ W ( i ) N ( y NN +1 , u i − ,e | u i )We are ready to talk about how the capacity and the reliability change througha local transformation as above. the capacity of these channels is I(W) because the channel has uniform inputs The Bhattacharya parameter
Page 8 of 29ishvajeet N IIT MadrasAs we have seen in the erasure channel, we get one channel which is good andanother which is bad . the following proposition formalizes this notation Proposition 3.
Local transformation of rate and reliabilityIf ( W, W ) −→ ( W ′ , W ′′ ) is the local transformation, then the following statementsare true1. I ( W ′ ) + I ( W ′′ ) = 2 I ( W ) I ( W ′ ) ≤ I ( W ′′ ) Z ( W ′′ ) = Z ( W ) Z ( W ′ ) ≤ Z ( W ) − Z ( W ) From the above proposition, we can see that I ( W ′ ) ≤ I ( W ) ≤ I ( W ′′ ) and Z ( W ′ ) ≥ Z ( W ) ≥ Z ( W ′′ ), which goes with our intuition that we have one good and onebad channel. Also Z ( W ′ ) + Z ( W ′′ ) ≤ Z ( W ), which means that the reliabilityparameter can only improve with our transformation. Note that equality for theabove is when we have perfect or useless channels.We apply these results to the recursive formulations in the previous proposition.We sketch the recursive transformations as a binary tree, in which every node givesbirth to a good and a bad channel. The root is the original channel W and thechannel W ( i )2 n is located at the n-th level and i-th node from top. We can label thenodes in this tree in a natural way - one in which each node is labelled with thepath taken from the root ( 1 means up and 0 means down).If we do this operation many number of times, we expect that most of the obtainedchannels have capacities near zero or one. In other words, the channel becomessignificantly polarized after a few iterations. The following theorem formalizes ourintuition. Theorem 3.
For binary channel W, the channels { W ( i ) N } polarize in the sense that,for any fixed δin ( o, as N tends to infinity through powers of two, the fraction ofindices for which I ( W ( i ) N ∈ (1 − δ, goes to − I ( W ) and the fraction of those forwhich capacity lies in [0 , δ ) goes to − I ( W ) . In loose terms good channels have more capacity and lesser reliability than the originalchannel and the opposite is true for bad channels
Page 9 of 29ishvajeet N IIT Madras
Proof :
The sequence of random variables I n defined as the capacity of the channel obtainedby starting at the node and taking the path n, is a martingale, because it ismemoryless and E [ I n +1 | path n] = I ( W pathn, ) + I ( W pathn, ) = I n . Also, thesequence of random variables Z n is a supermartingale because it is memorylessand E [ | Z n +1 | pathn | ] = Z ( W pathn )+ Z ( W pathn ) ≤ Z n . Both the martingales areuniformly integrable and hence converge, by the martingale convergence theorem.It follows that E [ Z n +1 − Z n ] −→ n −→ ∞ . Since Z n +1 = Z n with probability , E [ | Z n +1 − Z n | ] ≥ E [ Z n ( Z n +1 − Z n )] ≥
0. By sandwich theorem of limits, E [ Z n (1 − Z n )] −→ E [ Z ∞ (1 − Z ∞ )] = 0. Hence Z ∞ = 0 or Z ∞ = 1 almost everywhere.Proposition 1 implies that I ∞ takes values in { , } with P ( I ∞ = 1) = I ∞ and P ( I ∞ = 0) = 1 − I ∞ .We can thus see that polar codes indeed achieve capacity in the limit of the block-length going to infinity. Page 10 of 29ishvajeet N IIT Madras Rate of Polarization
Until now we have said that ploar codes with sufficiently high blocklengths acheivescapacity. We would to know how fast this happens and the error probability for aparticular block length. This section tries to address the above questions.This question was answered by Arikan in his paper and is outlined in the followingtheorem. Guruswami improved upon the result in his paper and the result shallbe mentioned later.
Theorem 4.
For a BDMC W with I ( W ) > , and any fixed R < I ( W ) , thereexists a sequence of sets A N ⊂ { , , , . . . , N } , N = 2 n such that | A N | ≥ N R and Z ( W ( i ) N ) ≤ O ( N − ) for all i ∈ A N The above theorem essentially says that there exists a subset of the set of vir-tual channels which are ’good’ and capacity does not decrease. This process alsoinvolves that the reliability factor goes down exponentially in the blocklength.
Polar Coding
We have seen in the earlier sections that the synthetic channels are sufficientlypolarized. We need a way to access the ’good’ channels - channels W ( i ) N for which Z ( W ( i ) N ) = 0 and thus achieve the symmetric channel capacity.We define a class of codes called G N − coset codes in which G N is the generatormatrix i.e. x N = u N G N . For an arbitrary subset A of the indices, we can write x N as X N = u A G N ( A ) ⊕ u A c G N ( A c ) because it is a linear transformation. We havethree parameters here - A, u A and u A c and hence we talk about ( N, K, A, u A c )codes. A is interpreted as the ’information set’, the set of indices which coincidewith ’good’ channels and A c as the set of ’bad’ channels. u A c are the ’ f rozenbits ’ and we leave u A to be free variables. We need to give a rule for selecting theinformation set A. As we shall see later, the way we choose the frozen bits does nothave any effect on how well the coding scheme performs over symmetric channels.We shall briefly talk about the decoder for polar codes, because that will give usinsights on how we could possibly choose the information set and the frozen bits.Page 11 of 29ishvajeet N IIT Madras The Successive Cancellation Decoder
We shall be considering a (N,K,A, u A c ) G N -coset code in which u N has been en-coded into a codeword x N and sent over the channel W N . The decoder’s task isto generate an estimate ˆ u N of u N , given the knowledge of A, u A c and the channeloutput y N . An obvious way to decode the A c bits is to set ˆ u A c = u A c . We need away to decode u A c . We do this by exploiting the structure of polar codes, a waywhich uses the bits we have already decoded and which treats the bits which arenot decoded until now, as noise. We call this a successive cancellation decoder.Let’s formalize this :Our SC decoder outputs decisions ˆ u i in order from i=1 to n such that,ˆ u i , u i , if i ∈ A c h i ( y N , ˆ u i − i ) , if i ∈ A where, h i ( y N , ˆ u i − i ) , , if W ( i ) N ( y Ni , ˆ u i − | W ( i ) N ( y Ni , ˆ u i − | ≥ , otherwiseThese functions are similar to the ML decoding functions, but differ in that theyassume the bits which we have not seen yet as noise, in other words as RVs.We need to analyse the probability of error in this SCD framework. The errorprobabilities are denoted in a natural way. Definition 11.
The probability of block error for a (N,K,A, u A c ) assuming thateach vector u A is sent uniformly is P e ( N, K, A, u A c ) , P u A ∈ χ K K P y Ni ∈ Υ N :ˆ u N ( y N ) = u N W N ( y N | u N )We also denote the average of the above error probability over all choices of u A c by P e ( N, K, A ). We claim that the reliability still stays an upper bound on thiserror probability.
Proposition 4. P e ( N, K, A ) ≤ P i ∈ A Z ( W ( i ) N ) Page 12 of 29ishvajeet N IIT Madras Theorem 5.
The average probability of block error for polar coding under SCdecoding goes down exponentially as O ( N − ) for any BDMC W and a fixed ratelesser than the capacity. P e ( N, R ) = O ( N − )The proof easily follows from theorem 4 and the relation between block and biterror for SC decoder.Note that the above can be viewed as an existential result, in the sense that thereexists a way of setting the frozen bits so that the error probability goes downexponentially. We have stronger results when the channel is symmetric. Let’sobserve a few properties of symmetric channels. Proposition 5.
If a BDMC W is symmetric, then W N , W N and W ( i ) N are alsosymmetric. The symmetries of the channel dictate proposition 4 is true for any way of settingthe frozen bits.
Theorem 6.
The probability of block error for polar coding under SC decodinggoes down exponentially as O ( N − ) for any symmetric BDMC W and a fixed ratelesser than the capacity, for u A c fixed arbitrarily. P e ( N, K, A, u A c ) = O ( N − )The idea used in the proof is that the events - making an error in the block andchoosing the vector of frozen bits are independent events and thus we can freezethe bits ( to say 0 N ) without affecting the probability of error. Encoding and Decoding Complexity
Polar codes are quite remarkable in the sense that both coding and decoding arepolynomial in the block length; to be more precise, both take O ( N log N ) timesteps on a sequential machine. It would be good to note that the structure of theencoding matrix can help us do it faster than the above on a parallel machine(Page 13 of 29ishvajeet N IIT Madras O (log N ) time).As for the encoding compexity, G N , which is an involutory permutation matrixcan be written in terms of tensor products and we can also exploit its relation toFast Fourier Transforms. The O(N log N) time obtained is due to the bit-indexingmethods frequently used in FFTs.It can be easily observed that the decoding complexity is O ( N log N ) because wecheck the ML like function recursively using log N many levels, each taking O(N)time.We can thus see that polar coding is not just capacity achieving, but also somethingwhich is quite implementable in practice owing to the low encoding and decodingcomplexities. Page 14 of 29ishvajeet N IIT Madras Reed Muller Codes
Reed-Muller codes are one of the oldest families of error correcting codes and useconcepts from algebra for the encoding and decoding process. The idea is to lookat the message as the coefficients of a polynomial in a suitable degree and pass thesuitable
Definition 12.
Reed Solomon Codes RS F,S,n,k ( m ) = f ( α ) , f ( α ) , f ( α ) , . . . , f ( α n )) , where f ( X ) = m + m X + · · · + m k X k . We view a message of k symbols as the coefficients of a univariate polynomial f(X)of degree k-1. We encode the message as the evaluations of this polynomial at ndifferent points in the underlying field (or in a subset S which the code designer isleft to choose).We should talk about how we are getting these evaluations across. We define aspecial matrix to make representation easier to work with.
Definition 13.
The Vandermonde matrixG = . . . α α α . . . α n α α α . . . α n ... ... ... . . . ... α k − α k − α k − . . . α k − n is the generator matrix for RS F,s,n,k
Looking at the generator matrix we can see that RS codes are linear.
Proposition 6.
RS codes are linear.
Proposition 7.
The minimum distance of RS codes is (n-k+1).Proof :
This is true because if m ′ = m ′′ are the messages, the correspondingpolynomials have to differ in more than n-k locations.RS codes are good in the sense that their distance is huge, but on the downsidethey require that the underlying field should be sufficiently large - at least of orderPage 15 of 29ishvajeet N IIT Madrasn. To address this difficulty, we talk about Reed Muller codes. They are general-izations of Reed Solomon codes in the sense that we would be using multivariatepolynomials instead. Definition 14.
Reed Muller CodesGiven a field size q and a number m of variables, and a total degree bound r, theRM q,m,r code is the linear code over F q defined by the encoding map f ( X , X , . . . X m ) −→ < f ( α ) > | α ∈ F mq applies to the domain of all polynomials in F q [ X , X , . . . X m ] of total degree def(f ) ≤ r Let’s talk about decoding these families of codes. We would expect that uniquedecoding is possible only if there are not too many errors in the code.
Theorem 7.
Unique decoding is possible only if the distance of the code is at least n − k .Proof: We look at hamming balls and for the boundary condition, we want thatthe point is close to exactly two of them, which gives us the result.RS codes are decoded using the ’magical’ Berlekamp-Welch algorithm which in-volves fitting the bad points in a curve and then finding them out.Let y i s be the evaluations of the polynomial at distinct locations x i for i ∈{ , , . . . n } . Let e be the number of errors. Our obective is to find a polyno-mial p(X) of degree at most k-1 such that the number of errors e is respected. Thefollowing algorithm helps us in doing so. Algorithm 1.
Berlekamp-Welch algorithm1. If there is a polynomial such that p ( x i ) = y i for all i = 1 , . . . n , then outputp. Otherwise :2. Find polynomials E(x) and N(x) such that : • E is not identically zero • E(x) has degree at most e and N(x) has degree at most e+k-1 • For every i = 1 , , . . . n , N ( x i ) = E ( x i ) ∗ y i Page 16 of 29ishvajeet N IIT Madras
3. Output N ( x ) E ( x ) if E(x) divides N(x), else output error. We write the constraints for the polynomials E(x) and N(x) and then find themusing the above algorithm. It can be proved that any solution for the constraintssatisfies the conditions and gives us the correct generator polynomial.If unique decoding is not possible, then we can do list decoding till a particularfraction of errors after which the list size becomes exponential in size. This is doneusing the even more magical Guruswami-Sudan algorithm, which uses the sameidea of fitting the bad points in a curve and then finding them out.
Definition 15. C is said to be 1-transitive if for any j and j ∈ [N] satisfying j = j ,there exists a permutation π : [N] → [N] such that : • π ( j ) = j • y π (1) , y π (2) ..y π ( n ) ∈ C for every y , y ..y n ∈ C Page 17 of 29ishvajeet N IIT Madras
Definition 16. C is said to be 2-transitive if for any j , j , j , j ∈ [N] satisfying j = j and j = j , there exists a permutation π : [N] → [N] such that : • π ( j ) = j • π ( j ) = j • y π (1) , y π (2) ..y π ( n ) ∈ C for every y , y ..y n ∈ C Let’s look at some important properties of the above codes.
Proposition 8.
Reed Solomon codes are 2-transitiveProof :
As we have seen before, RS codes are generated by the Vandermondematrix G . The input m N is transformed to the codeword y N using the matrix G ,which is eventually sent over the channel.Formally, y N = m N ∗ G We are given four locations in the code - say a, b, c and d ∈ [ N ] such that a = b and c = d and we need to give a permutation π : [ N ] −→ [ N ] such that π ( a ) = c and π ( b ) = d and also preserves membership in the code.For the moment, pick any permutation which satisfies the above constraints - weshall see why it does not matter. Observe that this permutation is a N × N π N × N ∗ h m m m . . . . . . m n i × N = h m π (1) m π (2) m π (3) . . . . . . m π ( n ) i Same holds for the vector y N , because the π matrix does not care about the actualvalues of the vector it is multiplied with (for it is a permutation matrix). It onlypermutes the indices to get another vector and will do the same for m N .i.e. π N × N ∗ h y y y . . . . . . y n i × N = h y π (1) y π (2) y π (3) . . . . . . y π ( n ) i × N We are given y N = m N ∗ G Premultiply the encoding equation with the matrix ππ ∗ h y y y . . . . . . y n i = π ∗ h m m m . . . m n i ∗ G = ⇒ h y π (1) y π (2) y π (3) . . . . . . y π ( n ) i = h m π (1) m π (2) m π (3) . . . . . . m π ( n ) i ∗ G Page 18 of 29ishvajeet N IIT MadrasSince RS codes are linear codes, and we are giving evalutions over the whole field, m π (1) ,π (2) ,π (3) ...π ( n ) is a codeword too. Proposition 9.
Reed Muller codes are 2-transitiveProof is similar to the one given above, find any transformation which satisfies ourconstraints works.
Note that if we want our codes to satisfy these transitive properties, we must giveevaluations of the polynomials on all points in the corresponding fields.We shall see in the coming sections that RM codes have more interesting properties.Page 19 of 29ishvajeet N IIT Madras
Reed Muller Codes Achieve Capacity on ErasureChannels
As recently as July 2015, Urbanke et. al. and Santhosh Kumar et. al. indepen-dently proved that RM codes achieve capacity on erasure channels. Let’s us nowbuild up the machinery required to prove the aforementioned result. We assumethat the ith bit of the message is transmitted through an erasure channel withprobability p i i.e. BEC( p i ). Let’s denote this channel by BEC(p) . In this settingwe would like to analyze the probability that the MAP decoder is unable to decodethe i-th bit and then try to get a bound on the probability of error for the blockMAP decoder. We also assume that the input distribution is uniform.Since we are talking about erasure channels and MAP decoding, we should defineEXIT functions which we would use later to capture the decoding errors. Definition 17.
EXIT functionsThe vector EXIT function associated with the ith bit is defined to be h i ( p ) , H ( X i | Y ∼ i ( p ∼ i )We can define an average exit function, which we would use later. Definition 18.
Average EXIT functionAverage EXIT function is defined as h ( p ) , P Ni =1 h i ( p )We denote the bit MAP decoder’s output for ith bit as D i . On receiving the se-quence Y, if the ith bit X i can be recovered uniquely, then D i (Y) = X i . Otherwise,D i declares an erasure and returns *. We claim that the probability of the decodingfailing to decode is equal to the ith EXIT function. Proposition 10.
Pr ( D i ( Y ) = X i ) = H ( X i | Y ) Proof :
Whenever bit i can be recovered from a received sequence Y = y, H ( X i Y = y ) = 0.Otherwise, H ( X i Y = y ) = 1 because of the uniform codeword assumption. p is the corresponding vector of erasure probabilities Page 20 of 29ishvajeet N IIT MadrasObserve that the probability of being able to decode the ith bit, and hence thevalue of the ith EXIT function is either equal to zero or one; it cannot be anythingbetween zero or one.
Proposition 11.
The MAP exit function for the ith bit satisfies h i ( p ) = ∂H ( X | Y ( p )) ∂p i Proof :
By chain rule of entropy H ( X | Y ( p )) = H ( X i | Y ( p )) + H ( X ∼ i | X i , Y ( p ))Observe that the second term written in the above expansion is independent of p i because H ( X ∼ i | X i , Y ( p )) = H ( X ∼ i | X i , Y ∼ i ( p ∼ i )) H ( X | Y ( p )) = P r ( Y i = ∗ ) H ( X i | Y ∼ i ( p ∼ i , Y i = ∗ ) + P r ( Y i = X i ) H ( X i | Y ∼ i ( p ∼ i , Y i = X i )= p i H ( X i | Y ∼ i ( p ∼ i ))The second entropy term is zero and the proposition thus follows.The above proposition leads us to an important theorem in coding theory - thearea theorem. Theorem 8.
The Area TheoremThe average EXIT function satisfies the area theorem : R h ( p ) dp = KN .Proof : The above proposition gives us the derivative of the function. It follows that wecan integrate from 0 to 1 on a fixed path ( p, p, p, . . . p ). H ( X | Y ( p )(1)) − H ( X | Y ( p )(0)) = R ( P N h i ( t )) dtH ( X | Y ( p )(1)) =H(X ) = K since it is uniform distribution and the encodingscapture the same randomness as that of the original distribution. H ( X | Y ( p )(0)) = 0.Hence the result.We would like to look at the set of erasure patterns ( vectors Y ∼ i from which wecannot decode the ith bit X i indirectly using MAP decoding. We claim that thefollowing set correctly captures this notion and contains all the erasure patternsfrom which indirect recovery of the ith bit is not possible. Hence the measure ofthis set determines the probability of error which is what we eventually want.Page 21 of 29ishvajeet N IIT Madras Definition 19.
The set of patterns ’bad’ for bit i are contained in Ω i defined as Ω i , { A ⊆ [ N ] \ i |∃ B ⊆ [ N ] \ i, B ∪ { i } ∈ C , B ⊆ A } From the above discussion, it is clear that the measure of the set is equal to theprobability of error, which, in turn is equal to the ith exit function because of theuniform input assumption. We summarize this in the following proposition.
Proposition 12. Ω i encodes h i ( p ) h i ( p ) = µ p (Ω i ) = P A ∈ Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l )We claim that if the code C is 2-transitive, the set Ω i is 1-transitive. Proposition 13.
If C is 2-transitive then Ω i is 1-transitiveProof : Since the code is 2 transitive, there exists a permutation π such that π ( i ) = j for all i = j . We need to show that this permutation preserves membership inthe corresponding Ω s . In other words, we need to show that if A ∈ Ω i then π ( A ) ∈ Ω π ( i ) .Since A ∈ Ω i , ∃ B ⊆ A such that B ∪ i ∈ C . π ( B ∪ i ) ∈ C . Observe that π ( B ∪ i ) = π ( B ) ∪ π ( i ) = π ( B ) ∪ j . Since π ( B ) ⊆ π ( A ), it follows that π ( A ) ∈ Ω j This is a bijection because we can do what we did above with the indices i and jinterchanged.
Proposition 14.
All EXIT functions are equal.Proof:
Since the code is transitive, any two locations have a permutation betweenthem. The corresponding Ω s have a bijection between them, and the EXIT func-tion is equal to the measure of the corresponding set Ω. The proposition thusfollows.Note that all EXIT functions are equal to the average EXIT function and thus weare free to invoke the area theorem now.Intuitively, if we have an erasure pattern which is ’bad’, we will not be able todecode the patterns which are obtained after adding more erasures at places wherethere were no erasures before. The following propositon formalizes this.Page 22 of 29ishvajeet N IIT Madras Proposition 15. Ω i is monotoneProof : We have to prove that if A ∈ Ω i and A ⊆ C then such that C ∈ Ω i Looking at the definition of Ω i there exists B ∈ [ N ] \ { i } such that B ⊆ A,B ∪ { i } ∈ C , and B ⊆ A . B ⊆ A ⊆ C and hence it follows that C ∈ Ω i Let’s qiuckly look at what we know until now.1. h i ( p ) captures the probability of error of MAP decoder2. Ω i encodes h i ( p )3. All EXIT functions are equal to the average exist function h.4. The area under the h vs p curve is the rate ( using area theorem)If we prove that the set Ω i has a sharp threshold, then we would’ve proved that2-transitive codes achieve capacity, since the threshold would occure at p = 1-R.We define another set, which would be useful to prove the sharp threshold be-haviour - the set of erasure patterns for which location j is pivotal in the indirectrecovery of the i-th bit. In other words, flipping the j-th bit flips the erasurepattern between Ω i and Ω ci . Definition 20.
The set of erasure patterns for which the j-th bit is pivotal in theindirect decoding of the i-th bit is ∂ j Ω i , { A ⊆ [ N ] \ { i }| A \ { j } 6∈ Ω i , A ∪ { j } ∈ Ω i } Note that ∂ j Ω i contains patterns from both Ω i and Ω ci .Intuitively we expect that once we permute the locations to another set of locations,the bits which were pivotal before, stay pivotal in the permuted world, for if thiswere not to happen, we could have magically decoded the concerned bit using thispermutation. We formalize this notion. Proposition 16.
If a code C is 2-transitive, then for distinct i, j, k ∈ [ N ] , thereexists a bijection between ∂ j Ω i and ∂ k Ω i .Proof : Page 23 of 29ishvajeet N IIT MadrasConsider a codeword A. Since the code is 2-transitive, there exists a permutation π such that π ( i ) = i and π ( j ) = k and π ( A ) ∈ C . We need to prove that if A ∈ ∂ j Ω i then π ( A ) ∈ ∂ k Ω i . • Case 1 : A ∪ { j } ∈ Ω i and A \ { j } 6∈ Ω i . π ( A ) ∈ Ω i and π ( A \ { j } ) Ω i because of the transitivity of Ω i . π ( A \ { j } ) = π ( A ) \ { k } . Thus π ( A ) ∈ ∂ k Ω i • Case 2 : A ∪ { j } ∈ Ω i and A Ω i .If we were to interchange the indices j and k, a similar argument would hold. Hencethere is a one-to-one correspondence between the two sets.The measures of the sets ∂ j Ω i and ∂ k Ω i are equal since we prove that there is oneto one correspondence between them. µ p ( ∂ j Ω i ) = P A ∈ ∂ j Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l )Our quest to prove that the set Ω( and in turn the average EXIT function) has asharp threshold, requires us to talk about the influences of the variables and invokesuitable results from boolean function analysis. Let’s define a few terms first. Definition 21.
Influence of a variable
Let Ω be a monotone set and let ∂ j Ω , { x ∈ { , } N | Ω ( x ) = Ω ( x ( j ) ) } , where x ( j ) is defined by x ( j ) l = x l for l = j and x ( j ) j = 1 − x j The influence of bit j ∈ [N] is defined by I ( p ) j (Ω) , µ p ( ∂ j Ω) Definition 22.
Total InfluenceThe total influence of a bunch of variables is defined as I ( p ) (Ω) = P Nl =1 I ( p ) l (Ω)Let’s look at a result which talks about the derivative of the measure of monotonesets. Lemma 1.
Margulis - Russo LemmaLet Ω be a monotone set, then d µp (Ω) dp = I ( p ) (Ω) Page 24 of 29ishvajeet N IIT MadrasThe above result says that the derivative is lower bounded by a quantity ( whichas we shall see, scales with N). But the value of the function has to sum up to onesince it is a graph which plots the measure against the probability. This impliesthat the derivative cannot be high everywhere, and hence the function should havea sharp threshold.First we shall show how this is related to the problem at hand - how the influenceof the EXIT functions is related to the measure of the set of pivotal bits. Proposition 17. ∂h i ( p ) ∂p j = P ∂ j Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l ) Proof :
We evaluate the partial derivative from the explicit evaluation of h i ( p ))done in Proposition 6. h i ( p ) = µ p (Ω i ) = P A ∈ Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l )If we differentiate the above quantity with respect to p j and use the fact that Ω i is monotone, we get the above result.We still need to show how these influences scale with N and tie up the loose ends,and the following theorem, which we state without proof helps us in doing so. Theorem 9.
Let Ω be a montone set and suppose that, for all ≤ p ≥ , theinfluences of all bits are equal I ( p )1 (Ω) = I ( p )2 (Ω) = · · · = I ( p ) N (Ω) . The following istrue :1. There exists an universal constant C ≥ which is independent of p, Ω andN, such that d µp (Ω) dp ≥ C (log N )( µ p (Ω))(1 − µ p (Ω))
2. For any < ǫ ≤ , p − ǫ − p ǫ ≤ C log − ǫǫ log M where p t , h − = inf { p ∈ [0 , | h ( p ) ≥ t } is the inverse function for theaverage EXIT function . It is important to note that for the above theorem to hold, the influences have tobe spread quite uniformly. Let’s see why this is intuitively true. Say there is a h(p) is a strictly increasing continuous polynomial function and hence inverse is well-definedon [0,1] Page 25 of 29ishvajeet N IIT Madras(dictator)function which has N variables and only one bit is influential ( outputdepends only on this particular bit), the total influence is p (the success). Thederivative of this with respect to p is 1 and it does not scale with N as the RHSof the first expression is supposed to.Observe that the second statement implies that the set has a sharp thresholdbecause p − ǫ − p ǫ −→ −→ ∞ We have considered bit-MAP decoding for the above result. This result also impliesthat the channel is capacity achieving under block-MAP decoding, if we considerthe following proposition which is quite elementary in itself.
Proposition 18.
Relation between the error probabilities of the bit MAP and blockMAP decoder P block − MAP ≤ NP bit − MAP d min where d min is the minimum distance of the code. We can see that if P bit − MAP −→ P block − MAP −→ Theorem 10.
Corollary.
Reed Muller codes achieve capacity on erasure channels.
Theorem 11.
Polar codes are 2-transitiveProof :
The proof is exactly similar to the one given while proving that RS codesare 2-transitive. As we have seen before, polar codes are G N -coset codes in whichthe input u N is transformed to the codeword x N using the matrix G N , which iseventually sent over the channel. x N = u N G N We are given four locations in the code - say a, b, c and d ∈ [ N ] such that a = b and c = d and we need to give a permutation π : [ N ] −→ [ N ] such that π ( a ) = c and π ( b ) = d and also preserves membership in the code. Page 26 of 29ishvajeet N IIT MadrasFor the moment, pick any permutation which satisfies the above constraints - weshall see why it does not matter. Observe that this permutation is a N × N π N × N ∗ h u u u . . . . . . u n i × N = h u π (1) u π (2) u π (3) . . . . . . u π ( n ) i Same holds for the vector x N , because the π matrix does not care about the actualvalues of the vector it is multiplied with (for it is a permutation matrix). It onlypermutes the indices to get another vector and will do the same for u N .i.e. π N × N ∗ h x x x . . . . . . x n i × N = h x π (1) x π (2) x π (3) . . . . . . x π ( n ) i × N Premultiply the polar encoding equation with the matrix ππ ∗ h x x x . . . . . . x n i = π ∗ h u u u . . . . . . u n i ∗ G N = ⇒ h x π (1) x π (2) x π (3) . . . x π ( n ) i = h u π (1) u π (2) u π (3) . . . u π ( n ) i ∗ G N Polar codes are linear codes and thus u π (1) ,π (2) ,π (3) ...π ( n ) is a codeword, x π (1) ,π (2) ,π (3) ...π ( n ) is a codeword too. The above theorem, along with the results related to 2-transitivecodes in this section gives us another proof of the fact that polar codes achievecapacity on the binary erasure channel. Corollary.
Polar codes acheive capacity on erasure channels under MAP decoding.
It should be noted that the rate of polarization in Guruswami’s paper on ’Speedof Polarization’ would be faster than the above proof would give us.Page 27 of 29ishvajeet N IIT Madras
Future Work
The next big question we would like to ask is whether Reed Muller codes achievecapacity for other symmetric channels. This approach of EXIT functions andmonotone thresholds does not work even for the BSC.Another interesting question would be whether we can improve the above resultsby coming up with some non trivial decoding scheme instead of the expensiveMAP decoding. Page 28 of 29 ibliography [1] Erdal Arikan, ”Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels” , IEEE In-ternational Symposium on Information Theory, 2008.[2] Venkatesan Guruswami, Patrick Xia, ”Polar Codes: Speed of polarization andpolynomial gap to capacity” , IEEE Foundations of Computer Science (FOCS),2013.[3] Shrinivas Kudekar, Marco Mondelli, Eren Sasoglu, Rudiger Urbanke ”Reed-Muller Codes Achieve Capacity on the Binary Erasure Channel under MAPDecoding.” , arXiv:1505.0583, 2015.[4] Santhosh Kumar and Henry D. Pfister, ”Reed-Muller Codes Achieve Capacityon Erasure Channels” ,arXiv:1505.05123, 2015.[5] Venkatesan Guruswami, Atri Rudra, ”Error-correction up to the information-theoretic limit”
Communications of the ACM, Volume 52 Issue 3, March 2009.[6] C. E. Shannon, ” A mathematical theory of communication”” A mathematical theory of communication”