[PDF] Codes That Achieve Capacity on Symmetric Channels

Abstract

Transmission of information reliably and efficiently across channels is one of the fundamental goals of coding and information theory. In this respect, efficiently decodable deterministic coding schemes which achieve capacity provably have been elusive until as recent as 2008, even though schemes which come close to it in practice existed. This survey tries to give the interested reader an overview of the area. Erdal Arikan came up with his landmark polar coding shemes which achieve capacity on symmetric channels subject to the constraint that the input codewords are equiprobable. His idea is to convert any B-DMC into efficiently encodable-decodable channels which have rates 0 and 1, while conserving capacity in this transformation. An exponentially decreasing probability of error which independent of code rate is achieved for all rates lesser than the symmetric capacity. These codes perform well in practice since encoding and decoding complexity is O(N log N). Guruswami et al. improved the above results by showing that error probability can be made to decrease doubly exponentially in the block length. We also study recent results by Urbanke et al. which show that 2-transitive codes also achieve capacity on erasure channels under MAP decoding. Urbanke and his group use complexity theoretic results in boolean function analysis to prove that EXIT functions, which capture the error probability, have a sharp threshold at 1-R, thus proving that capacity is achieved. One of the oldest and most widely used codes - Reed Muller codes are 2-transitive. Polar codes are 2-transitive too and we thus have a different proof of the fact that they achieve capacity, though the rate of polarization would be better as found out by Guruswami.

Full PDF

aa r X i v : . [ c s . I T ] O c t CODES THAT ACHIEVE CAPACITYON SYMMETRIC CHANNELS September 5, 2018

Vishvajeet Nagargoje Indian Institute of Technology MadrasUnder the guidance ofProf. Prahladh Harsha Tata Institute of Fundamental Research, Mumbai As part of the Visiting Students Research Programme at the School of Technologyand Computer Science, Tata Institute of Fundamental Research, Mumbai, India -40005. Email : [email protected], Contact : +91-9884299504. Email: [email protected] ishvajeet N IIT Madras

ACKNOWLEDGEMENT

I would like to express my deep gratitude to Prof. Prahladh Harsha and Prof.Vinod Prabhakaran, my internship supervisors, for giving me an opportunity towork under them during the summer. I would also like to thank Sasank Mouli, forhelping me out with things.I would also like to acknowledge Tata Institute of Fundamental Research, Mumbaifor warmly hosting me in the summer. Page 1 of 29 bstract

Transmission of information reliably and eﬃciently across channels is one of thefundamental goals of coding and information theory. In this respect, eﬃciently de-codable deterministic coding schemes which achieve capacity provably have beenelusive until as recent as 2008, even though schemes which come close to it in prac-tice existed. This survey tries to give the interested reader an overview of the area.Erdal Arikan came up with his landmark polar coding shemes which achieve ca-pacity on symmetric channels subject to the constraint that the input codewordsare equiprobable. His idea is to convert any B-DMC into eﬃciently encodable-decodable channels which have rates 0 and 1, while conserving capacity in thistransformation. An exponentially decreasing probability of error which indepen-dent of code rate is achieved for all rates lesser than the symmetric capacity.These codes perform well in practice since encoding and decoding complexity is O ( N log N ) . Guruswami et al. improved the above results by showing that errorprobability can be made to decrease doubly exponentially in the block length.We also study recent results by Urbanke et al. which show that 2-transitive codesalso achieve capacity on erasure channels under MAP decoding. Urbanke and hisgroup use complexity theoretic results in boolean function analysis to prove thatEXIT functions, which capture the error probability, have a sharp threshold at1-R, thus proving that capacity is achieved. One of the oldest and most widelyused codes - Reed Muller codes are 2-transitive. Polar codes are 2-transitive tooand we thus have a diﬀerent proof of the fact that they achieve capacity, thoughthe rate of polarization would be better in Guruswami’s paper. ontents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Polar Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Channel Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . 6Rate of Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Polar Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Successive Cancellation Decoder . . . . . . . . . . . . . . . . . . . . 12Encoding and Decoding Complexity . . . . . . . . . . . . . . . . . . 13Reed Muller Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Reed Muller Codes on Erasure Channels . . . . . . . . . . . . . . . . . . 20Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291ishvajeet N IIT Madras

Introduction

Communication is a fundamental need of our lives. In modern times even thoughthe need for communication and the (ever-the-more sophisticated) tools availablewith us have increased, communication is something which we humans have beendoing for a long time. Inherently, we have error correcting capabilities built intous, which help us understand our fellow humans even when there is corruptiondue to say, speech defects, physical mediums, etc. We want our communicationsystems to do the same for us, but unfortunately there are fundamental questionswhich we have not been able to answer, and things don’t look that easy.We, as people studying computer science like to abstract things out and work withappropriate models. In this pursuit, Alice and Bob come to our rescue, as always.Suppose Alice and Bob want to communicate with each eﬃciently other over a’noisy’ physical transmission medium which corrupts the information. We ab-stract out the possible scheme of corruptions into something which we shall call a ′ channel ′ . Deﬁnition 1.

Communication ChannelWe deﬁne a discrete channel to be a system consisting of the following : • An input alphabet χ • An output alphabet Υ • A conditional probability transition matrix P ( y | x )The conditional probability distribution expresses the probability of observing theoutput symbol y given that we send the symbol x. The channel is said to be mem-oryless if the probability distribution of the output depends only on the input atthat particular time and is independent of the previous channel inputs or outputs.We would be talking about binary memoryless channels in the rest of this article.Let’s quickly deﬁne the information theoretic functions that we would be using inthe course of this article. Deﬁnition 2.

Entropy of a random variableThe entropy H(X) of a discrete random variable X is deﬁned by

Page 2 of 29ishvajeet N IIT Madras H ( X ) = − P x ∈ χ p ( x ) log p ( x ) . Deﬁnition 3.

Mutual information between two random variablesThe mutual information I(X;Y) between two random variables X and Y is deﬁnedto be I ( X ; Y ) = P x ∈ χ P y ∈ Υ p ( x, y ) log p ( x,y ) p ( x ) p ( y ) Deﬁnition 4. (Shannon)Capacity of ChannelThe ”information” channel capacity of a discrete memoryless channel is deﬁnedas C = Max P ( x ) I (X;Y)

Observe that the capacity of a channel is completely characterized once we specifythe input output conditional probability distribution.Let us look at a few frequently occuring discrete channels.

Deﬁnition 5.

Noiseless Binary ChannelAlso called ’perfect channel’, this is a channel such that : • χ = { , }• Υ = { , }• P( y = 0 — x = 0) = 1 and P( y = 1 — x = 1) = 1

Observe that we do not need any coding scheme if we use the noiseless channel.

Deﬁnition 6.

Binary Erasure ChannelAn erasure channel with erasure probability p has the following parameters : • χ = { , }• Υ = { , , e }• P( y = 0 — x = 0) = 1 - p , P( y = 1 — x = 1) = 1 - pP( y = e — x = 1) = p and P( y = e — x = 1) = p

Page 3 of 29ishvajeet N IIT Madras

Deﬁnition 7.

Binary Symmetric ChannelA binary symmetric channel with ﬂip-over probability p has the following parame-ters : • χ = { , }• Υ = { , }• P( y = 0 — x = 0) = 1 - p , P( y = 1 — x = 1) = 1 - pP( y = 1 — x = 0) = p and P( y = 0 — x = 1) = p

Deﬁnition 8.

Symmetric ChannelA channel for which there exists a permutation π of the output alphabet Υ suchthat : • π − = π • W ( y |

1) = W ( π ( y ) | ∀ y ∈ Υ Observe that the symmetric capacity I(W) equals the Shannon capacity when W isa symmetric channel. We can also see that the BEC and the BSC are symmetricchannels.

Observe that in all channels except the noiseless channel we cannot decode cor-rectly unless we use encoding and decoding schemes. Intuitively it is clear thatif we add some amount of redundancy in the code, it would be easier for us tocorrect errors. But this leads to transmission of more bits than the message ac-tually contains. The fundamental question in Information and Coding Theory isthe tradeoﬀ between redundancy and the number of errors that can be corrected.We shall formalize the notion of redundancy in a code.

Deﬁnition 9.

Rate of CodeThe rate of a code is deﬁned to be equal to dimensionblock length

Claude Shannon gave an operational deﬁnition of the channel capacity, whichimplies that it is the maximum rate at which we can transmit information acrossa channel reliably, with error going to zero in the limit of the block length goingto inﬁnity. Page 4 of 29ishvajeet N IIT Madras

Theorem 1.

Shannon’s coding theoremGiven a noisy channel with channel capacity C and information transmission rateR, then if R ¡ C there exist codes that allow the probability of error at the receiverto be made arbitrarily small, and the converse is also true.

Shannon’s used a random coding approach in his landmark paper. But this ran-dom coding approach is not something which satisﬁes theoretical computer sciencepeople like us and we want to lay our hands on a particular coding scheme whichdoes the above for us reliably. We also want to be able to decode it eﬃciently.In other words, we want an eﬃciently decodable capacity achieving deterministiccoding scheme .We would like to analyze how bad our coding scheme is, by looking at the errorprobability when a trivial decoding scheme is used - ML / MAP decoding, whichlooks at the codeword closest to the one received and outputs it. In doing so wewould require a parameter deﬁned as follows :

Deﬁnition 10.

Bhattacharya parameterFor a channel W Bhattacharya parameter is deﬁned as,Z(W) , P y ∈ Υ p W ( y | W ( y | . Theorem 2.

The Bhattacharya parameter is an upper bound on the probability oferror achieved by ML/MAP decoding.

We can quickly relate the symmetric capacity I(W) and Z(W), the Bhattacharyaparameter. Intuitively one would expect that I(W) ≈ ≈

1. The followingbounds make this precise :

Proposition 1.

Relation between symmetric capacity and Bhattacharya parame-ter. I(W) ≥ log Z ( W ) I(W) ≤ p − Z ( W ) Page 5 of 29ishvajeet N IIT Madras

Polar Codes

A deterministic capacity achieving coding scheme has been elusive for a long time,and in the process computer scientists have come up with diﬀerent types of schemeswhich come extremely close to achieving capacity in practice but fail to do soprovably. It was in 2008 that Erdal Arikan came up with his landmark codingscheme which achieves capacity on symmetric channels. His work is a culminationof more than 20 years of reserach into the sequential cutoﬀ rate for diﬀerent typesof codes. This section tries to outline the idea behind Arikan’s polar codes . Channel Polarization

It can be observed that it very easy to code two types of channels - the perfect andthe useless channel. The main idea behind Arikan’s technique is that it manufac-tures out of N independent copies of a given BDM W, a second set of channelswhich is polarized i.e. consisting of only perfect and useless (a channel in whichthe output is independent of the input) channels and this transformation also con-serves capacity. This operation goes through two parts : combining and splitting.Let’s look at these operations in detail. • Channel combining

We take N =2 n channels, each denoted by W, and manufacture a vectorchannel W N : χ N −→ Υ N recursively.In general, 2 copies of W N are combined to produce W N so that the theinput u N to W N is ﬁrst transformed to s N where s i − = u i − ⊕ u i and s i = u i for 1 ≤ i ≤ N For the 0-th level of recursion (n=0) we set W , W The ﬁrst level of recursion combines to copies of W and obtains the channel W where W ( y , y | u , u ) = W ( y | u ⊕ u ) W ( y | u ) as and so on. Binary Discrete Memoryless Channel

Page 6 of 29ishvajeet N IIT MadrasThe operator R N is a permutation called the reverse shuﬄe operation andacts on the input s N to produce v N = ( s , s , s N − , . . . s , s . . . s N ), which isthe input to the 2 copies of W N .At each level of recursion, it should be observed that the mapping from u N −→ v N is linear over GF (2). Inductively, it can be proved that themapping from u N −→ x N which takes the input from W N to W N channel isalso linear and can be denoted by x N = u N G N .We can thus relate the transition probabilities of the 2 channels W N and W N by the following equation : W N ( y N | u N ) = W N ( y N | u N G N )for appropriately deﬁned alphabets on either side.We shall talk about the implementations of this transformation and theencoding complexity later in this article. • Channel splitting

As mentioned earlier, we need to split W N back into a set of N channels.We call them virtual channels and denote them individually as W ( i ) N : χ −→ Υ N χ i − , 1 ≤ i ≤ N and deﬁne them as : W ( i ) N ( y N , u i − i | u i ) , P u Ni +1 ∈ χ N − i N − W N ( y N | u N )This deﬁnition goes hand in hand with the successive cancellation decoderused in polar coding as we describe below. Let’s say that we are tryingto decode the i-th bit and we are given the correctly decoded estimates forthe ﬁrst i-1 bits. We use this vector of the i-1 estimates and the vectorof observations y N to get the i-th bit. We assume that the inputs u N areuniformly distributed. Note that even if we tranform this vector into diﬀerentvectors during the encoding process, the fact that the bits are uniform stillholds because the transformations are permutations and are linear too. Thisuniformity dictates a factor of N − in the term. Page 7 of 29ishvajeet N IIT MadrasLet us observe a few properties of the virtual channels we get after the abovetransformation. To make things easy, we shall see what happens when the channelis a BEC( ǫ ) with uniform input. Since we are talking about uniform input, thecapacity is I ( W ). I ( W (2 i − N ) = I ( W ( i ) N ) I ( W (2 i ) N ) = 2 I ( W ( i ) N ) − I ( W ( i ) N ) I ( W (1)1 ) = 1 − ǫ Observe that at every level of recursion, we get one channel which has capacity better than the original one and one which has capacity worse than the original one.In order to analyse what happens for the general channel, we would like to seewhat happens locally to the above quantities. For doing this we would like tohave the transtition probabilites of each of the virtual channels to be related tothe original individual channels directly, instead of being related block by block.We want to map the independent copies of channel W to the channels we get aftersplitting i.e. ( W, W ) −→ ( W ′ , W ′′ )In more general terms, in the following proposition we map ( W ( i ) N , W ( i ) N ) −→ ( W (2 i − N , W (2 i )2 N ) . Proposition 2.

Recursive channel transformationsFor any n ≥ , N = 2 n , ≤ i ≤ N , W (2 i − N ( y N , u i − | u i − ) = P u i W ( i ) N ( y N , u i − , ⊕ u i − ,e | u i − ⊕ u i ) ∗ W ( i ) N ( y NN +1 , u i − ,e | u i ) and W (2 i )2 N ( y N , u i − | u i ) = P u i W ( i ) N ( y N , u i − , ⊕ u i − ,e | u i − ⊕ u i ) ∗ W ( i ) N ( y NN +1 , u i − ,e | u i )We are ready to talk about how the capacity and the reliability change througha local transformation as above. the capacity of these channels is I(W) because the channel has uniform inputs The Bhattacharya parameter

Page 8 of 29ishvajeet N IIT MadrasAs we have seen in the erasure channel, we get one channel which is good andanother which is bad . the following proposition formalizes this notation Proposition 3.

Local transformation of rate and reliabilityIf ( W, W ) −→ ( W ′ , W ′′ ) is the local transformation, then the following statementsare true1. I ( W ′ ) + I ( W ′′ ) = 2 I ( W ) I ( W ′ ) ≤ I ( W ′′ ) Z ( W ′′ ) = Z ( W ) Z ( W ′ ) ≤ Z ( W ) − Z ( W ) From the above proposition, we can see that I ( W ′ ) ≤ I ( W ) ≤ I ( W ′′ ) and Z ( W ′ ) ≥ Z ( W ) ≥ Z ( W ′′ ), which goes with our intuition that we have one good and onebad channel. Also Z ( W ′ ) + Z ( W ′′ ) ≤ Z ( W ), which means that the reliabilityparameter can only improve with our transformation. Note that equality for theabove is when we have perfect or useless channels.We apply these results to the recursive formulations in the previous proposition.We sketch the recursive transformations as a binary tree, in which every node givesbirth to a good and a bad channel. The root is the original channel W and thechannel W ( i )2 n is located at the n-th level and i-th node from top. We can label thenodes in this tree in a natural way - one in which each node is labelled with thepath taken from the root ( 1 means up and 0 means down).If we do this operation many number of times, we expect that most of the obtainedchannels have capacities near zero or one. In other words, the channel becomessigniﬁcantly polarized after a few iterations. The following theorem formalizes ourintuition. Theorem 3.

For binary channel W, the channels { W ( i ) N } polarize in the sense that,for any ﬁxed δin ( o, as N tends to inﬁnity through powers of two, the fraction ofindices for which I ( W ( i ) N ∈ (1 − δ, goes to − I ( W ) and the fraction of those forwhich capacity lies in [0 , δ ) goes to − I ( W ) . In loose terms good channels have more capacity and lesser reliability than the originalchannel and the opposite is true for bad channels

Page 9 of 29ishvajeet N IIT Madras

Proof :

The sequence of random variables I n deﬁned as the capacity of the channel obtainedby starting at the node and taking the path n, is a martingale, because it ismemoryless and E [ I n +1 | path n] = I ( W pathn, ) + I ( W pathn, ) = I n . Also, thesequence of random variables Z n is a supermartingale because it is memorylessand E [ | Z n +1 | pathn | ] = Z ( W pathn )+ Z ( W pathn ) ≤ Z n . Both the martingales areuniformly integrable and hence converge, by the martingale convergence theorem.It follows that E [ Z n +1 − Z n ] −→ n −→ ∞ . Since Z n +1 = Z n with probability , E [ | Z n +1 − Z n | ] ≥ E [ Z n ( Z n +1 − Z n )] ≥

0. By sandwich theorem of limits, E [ Z n (1 − Z n )] −→ E [ Z ∞ (1 − Z ∞ )] = 0. Hence Z ∞ = 0 or Z ∞ = 1 almost everywhere.Proposition 1 implies that I ∞ takes values in { , } with P ( I ∞ = 1) = I ∞ and P ( I ∞ = 0) = 1 − I ∞ .We can thus see that polar codes indeed achieve capacity in the limit of the block-length going to inﬁnity. Page 10 of 29ishvajeet N IIT Madras Rate of Polarization

Until now we have said that ploar codes with suﬃciently high blocklengths acheivescapacity. We would to know how fast this happens and the error probability for aparticular block length. This section tries to address the above questions.This question was answered by Arikan in his paper and is outlined in the followingtheorem. Guruswami improved upon the result in his paper and the result shallbe mentioned later.

Theorem 4.

For a BDMC W with I ( W ) > , and any ﬁxed R < I ( W ) , thereexists a sequence of sets A N ⊂ { , , , . . . , N } , N = 2 n such that | A N | ≥ N R and Z ( W ( i ) N ) ≤ O ( N − ) for all i ∈ A N The above theorem essentially says that there exists a subset of the set of vir-tual channels which are ’good’ and capacity does not decrease. This process alsoinvolves that the reliability factor goes down exponentially in the blocklength.

Polar Coding

We have seen in the earlier sections that the synthetic channels are suﬃcientlypolarized. We need a way to access the ’good’ channels - channels W ( i ) N for which Z ( W ( i ) N ) = 0 and thus achieve the symmetric channel capacity.We deﬁne a class of codes called G N − coset codes in which G N is the generatormatrix i.e. x N = u N G N . For an arbitrary subset A of the indices, we can write x N as X N = u A G N ( A ) ⊕ u A c G N ( A c ) because it is a linear transformation. We havethree parameters here - A, u A and u A c and hence we talk about ( N, K, A, u A c )codes. A is interpreted as the ’information set’, the set of indices which coincidewith ’good’ channels and A c as the set of ’bad’ channels. u A c are the ’ f rozenbits ’ and we leave u A to be free variables. We need to give a rule for selecting theinformation set A. As we shall see later, the way we choose the frozen bits does nothave any eﬀect on how well the coding scheme performs over symmetric channels.We shall brieﬂy talk about the decoder for polar codes, because that will give usinsights on how we could possibly choose the information set and the frozen bits.Page 11 of 29ishvajeet N IIT Madras The Successive Cancellation Decoder

We shall be considering a (N,K,A, u A c ) G N -coset code in which u N has been en-coded into a codeword x N and sent over the channel W N . The decoder’s task isto generate an estimate ˆ u N of u N , given the knowledge of A, u A c and the channeloutput y N . An obvious way to decode the A c bits is to set ˆ u A c = u A c . We need away to decode u A c . We do this by exploiting the structure of polar codes, a waywhich uses the bits we have already decoded and which treats the bits which arenot decoded until now, as noise. We call this a successive cancellation decoder.Let’s formalize this :Our SC decoder outputs decisions ˆ u i in order from i=1 to n such that,ˆ u i ,  u i , if i ∈ A c h i ( y N , ˆ u i − i ) , if i ∈ A where, h i ( y N , ˆ u i − i ) ,  , if W ( i ) N ( y Ni , ˆ u i − | W ( i ) N ( y Ni , ˆ u i − | ≥ , otherwiseThese functions are similar to the ML decoding functions, but diﬀer in that theyassume the bits which we have not seen yet as noise, in other words as RVs.We need to analyse the probability of error in this SCD framework. The errorprobabilities are denoted in a natural way. Deﬁnition 11.

The probability of block error for a (N,K,A, u A c ) assuming thateach vector u A is sent uniformly is P e ( N, K, A, u A c ) , P u A ∈ χ K K P y Ni ∈ Υ N :ˆ u N ( y N ) = u N W N ( y N | u N )We also denote the average of the above error probability over all choices of u A c by P e ( N, K, A ). We claim that the reliability still stays an upper bound on thiserror probability.

Proposition 4. P e ( N, K, A ) ≤ P i ∈ A Z ( W ( i ) N ) Page 12 of 29ishvajeet N IIT Madras Theorem 5.

The average probability of block error for polar coding under SCdecoding goes down exponentially as O ( N − ) for any BDMC W and a ﬁxed ratelesser than the capacity. P e ( N, R ) = O ( N − )The proof easily follows from theorem 4 and the relation between block and biterror for SC decoder.Note that the above can be viewed as an existential result, in the sense that thereexists a way of setting the frozen bits so that the error probability goes downexponentially. We have stronger results when the channel is symmetric. Let’sobserve a few properties of symmetric channels. Proposition 5.

If a BDMC W is symmetric, then W N , W N and W ( i ) N are alsosymmetric. The symmetries of the channel dictate proposition 4 is true for any way of settingthe frozen bits.

Theorem 6.

The probability of block error for polar coding under SC decodinggoes down exponentially as O ( N − ) for any symmetric BDMC W and a ﬁxed ratelesser than the capacity, for u A c ﬁxed arbitrarily. P e ( N, K, A, u A c ) = O ( N − )The idea used in the proof is that the events - making an error in the block andchoosing the vector of frozen bits are independent events and thus we can freezethe bits ( to say 0 N ) without aﬀecting the probability of error. Encoding and Decoding Complexity

Polar codes are quite remarkable in the sense that both coding and decoding arepolynomial in the block length; to be more precise, both take O ( N log N ) timesteps on a sequential machine. It would be good to note that the structure of theencoding matrix can help us do it faster than the above on a parallel machine(Page 13 of 29ishvajeet N IIT Madras O (log N ) time).As for the encoding compexity, G N , which is an involutory permutation matrixcan be written in terms of tensor products and we can also exploit its relation toFast Fourier Transforms. The O(N log N) time obtained is due to the bit-indexingmethods frequently used in FFTs.It can be easily observed that the decoding complexity is O ( N log N ) because wecheck the ML like function recursively using log N many levels, each taking O(N)time.We can thus see that polar coding is not just capacity achieving, but also somethingwhich is quite implementable in practice owing to the low encoding and decodingcomplexities. Page 14 of 29ishvajeet N IIT Madras Reed Muller Codes

Reed-Muller codes are one of the oldest families of error correcting codes and useconcepts from algebra for the encoding and decoding process. The idea is to lookat the message as the coeﬃcients of a polynomial in a suitable degree and pass thesuitable

Deﬁnition 12.

Reed Solomon Codes RS F,S,n,k ( m ) = f ( α ) , f ( α ) , f ( α ) , . . . , f ( α n )) , where f ( X ) = m + m X + · · · + m k X k . We view a message of k symbols as the coeﬃcients of a univariate polynomial f(X)of degree k-1. We encode the message as the evaluations of this polynomial at ndiﬀerent points in the underlying ﬁeld (or in a subset S which the code designer isleft to choose).We should talk about how we are getting these evaluations across. We deﬁne aspecial matrix to make representation easier to work with.

Deﬁnition 13.

The Vandermonde matrixG =  . . . α α α . . . α n α α α . . . α n ... ... ... . . . ... α k − α k − α k − . . . α k − n  is the generator matrix for RS F,s,n,k

Looking at the generator matrix we can see that RS codes are linear.

Proposition 6.

RS codes are linear.

Proposition 7.

The minimum distance of RS codes is (n-k+1).Proof :

This is true because if m ′ = m ′′ are the messages, the correspondingpolynomials have to diﬀer in more than n-k locations.RS codes are good in the sense that their distance is huge, but on the downsidethey require that the underlying ﬁeld should be suﬃciently large - at least of orderPage 15 of 29ishvajeet N IIT Madrasn. To address this diﬃculty, we talk about Reed Muller codes. They are general-izations of Reed Solomon codes in the sense that we would be using multivariatepolynomials instead. Deﬁnition 14.

Reed Muller CodesGiven a ﬁeld size q and a number m of variables, and a total degree bound r, theRM q,m,r code is the linear code over F q deﬁned by the encoding map f ( X , X , . . . X m ) −→ < f ( α ) > | α ∈ F mq applies to the domain of all polynomials in F q [ X , X , . . . X m ] of total degree def(f ) ≤ r Let’s talk about decoding these families of codes. We would expect that uniquedecoding is possible only if there are not too many errors in the code.

Theorem 7.

Unique decoding is possible only if the distance of the code is at least n − k .Proof: We look at hamming balls and for the boundary condition, we want thatthe point is close to exactly two of them, which gives us the result.RS codes are decoded using the ’magical’ Berlekamp-Welch algorithm which in-volves ﬁtting the bad points in a curve and then ﬁnding them out.Let y i s be the evaluations of the polynomial at distinct locations x i for i ∈{ , , . . . n } . Let e be the number of errors. Our obective is to ﬁnd a polyno-mial p(X) of degree at most k-1 such that the number of errors e is respected. Thefollowing algorithm helps us in doing so. Algorithm 1.

Berlekamp-Welch algorithm1. If there is a polynomial such that p ( x i ) = y i for all i = 1 , . . . n , then outputp. Otherwise :2. Find polynomials E(x) and N(x) such that : • E is not identically zero • E(x) has degree at most e and N(x) has degree at most e+k-1 • For every i = 1 , , . . . n , N ( x i ) = E ( x i ) ∗ y i Page 16 of 29ishvajeet N IIT Madras

3. Output N ( x ) E ( x ) if E(x) divides N(x), else output error. We write the constraints for the polynomials E(x) and N(x) and then ﬁnd themusing the above algorithm. It can be proved that any solution for the constraintssatisﬁes the conditions and gives us the correct generator polynomial.If unique decoding is not possible, then we can do list decoding till a particularfraction of errors after which the list size becomes exponential in size. This is doneusing the even more magical Guruswami-Sudan algorithm, which uses the sameidea of ﬁtting the bad points in a curve and then ﬁnding them out.

Deﬁnition 15. C is said to be 1-transitive if for any j and j ∈ [N] satisfying j = j ,there exists a permutation π : [N] → [N] such that : • π ( j ) = j • y π (1) , y π (2) ..y π ( n ) ∈ C for every y , y ..y n ∈ C Page 17 of 29ishvajeet N IIT Madras

Deﬁnition 16. C is said to be 2-transitive if for any j , j , j , j ∈ [N] satisfying j = j and j = j , there exists a permutation π : [N] → [N] such that : • π ( j ) = j • π ( j ) = j • y π (1) , y π (2) ..y π ( n ) ∈ C for every y , y ..y n ∈ C Let’s look at some important properties of the above codes.

Proposition 8.

Reed Solomon codes are 2-transitiveProof :

As we have seen before, RS codes are generated by the Vandermondematrix G . The input m N is transformed to the codeword y N using the matrix G ,which is eventually sent over the channel.Formally, y N = m N ∗ G We are given four locations in the code - say a, b, c and d ∈ [ N ] such that a = b and c = d and we need to give a permutation π : [ N ] −→ [ N ] such that π ( a ) = c and π ( b ) = d and also preserves membership in the code.For the moment, pick any permutation which satisﬁes the above constraints - weshall see why it does not matter. Observe that this permutation is a N × N π N × N ∗ h m m m . . . . . . m n i × N = h m π (1) m π (2) m π (3) . . . . . . m π ( n ) i Same holds for the vector y N , because the π matrix does not care about the actualvalues of the vector it is multiplied with (for it is a permutation matrix). It onlypermutes the indices to get another vector and will do the same for m N .i.e. π N × N ∗ h y y y . . . . . . y n i × N = h y π (1) y π (2) y π (3) . . . . . . y π ( n ) i × N We are given y N = m N ∗ G Premultiply the encoding equation with the matrix ππ ∗ h y y y . . . . . . y n i = π ∗ h m m m . . . m n i ∗ G = ⇒ h y π (1) y π (2) y π (3) . . . . . . y π ( n ) i = h m π (1) m π (2) m π (3) . . . . . . m π ( n ) i ∗ G Page 18 of 29ishvajeet N IIT MadrasSince RS codes are linear codes, and we are giving evalutions over the whole ﬁeld, m π (1) ,π (2) ,π (3) ...π ( n ) is a codeword too. Proposition 9.

Reed Muller codes are 2-transitiveProof is similar to the one given above, ﬁnd any transformation which satisﬁes ourconstraints works.

Note that if we want our codes to satisfy these transitive properties, we must giveevaluations of the polynomials on all points in the corresponding ﬁelds.We shall see in the coming sections that RM codes have more interesting properties.Page 19 of 29ishvajeet N IIT Madras

Reed Muller Codes Achieve Capacity on ErasureChannels

As recently as July 2015, Urbanke et. al. and Santhosh Kumar et. al. indepen-dently proved that RM codes achieve capacity on erasure channels. Let’s us nowbuild up the machinery required to prove the aforementioned result. We assumethat the ith bit of the message is transmitted through an erasure channel withprobability p i i.e. BEC( p i ). Let’s denote this channel by BEC(p) . In this settingwe would like to analyze the probability that the MAP decoder is unable to decodethe i-th bit and then try to get a bound on the probability of error for the blockMAP decoder. We also assume that the input distribution is uniform.Since we are talking about erasure channels and MAP decoding, we should deﬁneEXIT functions which we would use later to capture the decoding errors. Deﬁnition 17.

EXIT functionsThe vector EXIT function associated with the ith bit is deﬁned to be h i ( p ) , H ( X i | Y ∼ i ( p ∼ i )We can deﬁne an average exit function, which we would use later. Deﬁnition 18.

Average EXIT functionAverage EXIT function is deﬁned as h ( p ) , P Ni =1 h i ( p )We denote the bit MAP decoder’s output for ith bit as D i . On receiving the se-quence Y, if the ith bit X i can be recovered uniquely, then D i (Y) = X i . Otherwise,D i declares an erasure and returns *. We claim that the probability of the decodingfailing to decode is equal to the ith EXIT function. Proposition 10.

Pr ( D i ( Y ) = X i ) = H ( X i | Y ) Proof :

Whenever bit i can be recovered from a received sequence Y = y, H ( X i Y = y ) = 0.Otherwise, H ( X i Y = y ) = 1 because of the uniform codeword assumption. p is the corresponding vector of erasure probabilities Page 20 of 29ishvajeet N IIT MadrasObserve that the probability of being able to decode the ith bit, and hence thevalue of the ith EXIT function is either equal to zero or one; it cannot be anythingbetween zero or one.

Proposition 11.

The MAP exit function for the ith bit satisﬁes h i ( p ) = ∂H ( X | Y ( p )) ∂p i Proof :

By chain rule of entropy H ( X | Y ( p )) = H ( X i | Y ( p )) + H ( X ∼ i | X i , Y ( p ))Observe that the second term written in the above expansion is independent of p i because H ( X ∼ i | X i , Y ( p )) = H ( X ∼ i | X i , Y ∼ i ( p ∼ i )) H ( X | Y ( p )) = P r ( Y i = ∗ ) H ( X i | Y ∼ i ( p ∼ i , Y i = ∗ ) + P r ( Y i = X i ) H ( X i | Y ∼ i ( p ∼ i , Y i = X i )= p i H ( X i | Y ∼ i ( p ∼ i ))The second entropy term is zero and the proposition thus follows.The above proposition leads us to an important theorem in coding theory - thearea theorem. Theorem 8.

The Area TheoremThe average EXIT function satisﬁes the area theorem : R h ( p ) dp = KN .Proof : The above proposition gives us the derivative of the function. It follows that wecan integrate from 0 to 1 on a ﬁxed path ( p, p, p, . . . p ). H ( X | Y ( p )(1)) − H ( X | Y ( p )(0)) = R ( P N h i ( t )) dtH ( X | Y ( p )(1)) =H(X ) = K since it is uniform distribution and the encodingscapture the same randomness as that of the original distribution. H ( X | Y ( p )(0)) = 0.Hence the result.We would like to look at the set of erasure patterns ( vectors Y ∼ i from which wecannot decode the ith bit X i indirectly using MAP decoding. We claim that thefollowing set correctly captures this notion and contains all the erasure patternsfrom which indirect recovery of the ith bit is not possible. Hence the measure ofthis set determines the probability of error which is what we eventually want.Page 21 of 29ishvajeet N IIT Madras Deﬁnition 19.

The set of patterns ’bad’ for bit i are contained in Ω i deﬁned as Ω i , { A ⊆ [ N ] \ i |∃ B ⊆ [ N ] \ i, B ∪ { i } ∈ C , B ⊆ A } From the above discussion, it is clear that the measure of the set is equal to theprobability of error, which, in turn is equal to the ith exit function because of theuniform input assumption. We summarize this in the following proposition.

Proposition 12. Ω i encodes h i ( p ) h i ( p ) = µ p (Ω i ) = P A ∈ Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l )We claim that if the code C is 2-transitive, the set Ω i is 1-transitive. Proposition 13.

If C is 2-transitive then Ω i is 1-transitiveProof : Since the code is 2 transitive, there exists a permutation π such that π ( i ) = j for all i = j . We need to show that this permutation preserves membership inthe corresponding Ω s . In other words, we need to show that if A ∈ Ω i then π ( A ) ∈ Ω π ( i ) .Since A ∈ Ω i , ∃ B ⊆ A such that B ∪ i ∈ C . π ( B ∪ i ) ∈ C . Observe that π ( B ∪ i ) = π ( B ) ∪ π ( i ) = π ( B ) ∪ j . Since π ( B ) ⊆ π ( A ), it follows that π ( A ) ∈ Ω j This is a bijection because we can do what we did above with the indices i and jinterchanged.

Proposition 14.

All EXIT functions are equal.Proof:

Since the code is transitive, any two locations have a permutation betweenthem. The corresponding Ω s have a bijection between them, and the EXIT func-tion is equal to the measure of the corresponding set Ω. The proposition thusfollows.Note that all EXIT functions are equal to the average EXIT function and thus weare free to invoke the area theorem now.Intuitively, if we have an erasure pattern which is ’bad’, we will not be able todecode the patterns which are obtained after adding more erasures at places wherethere were no erasures before. The following propositon formalizes this.Page 22 of 29ishvajeet N IIT Madras Proposition 15. Ω i is monotoneProof : We have to prove that if A ∈ Ω i and A ⊆ C then such that C ∈ Ω i Looking at the deﬁnition of Ω i there exists B ∈ [ N ] \ { i } such that B ⊆ A,B ∪ { i } ∈ C , and B ⊆ A . B ⊆ A ⊆ C and hence it follows that C ∈ Ω i Let’s qiuckly look at what we know until now.1. h i ( p ) captures the probability of error of MAP decoder2. Ω i encodes h i ( p )3. All EXIT functions are equal to the average exist function h.4. The area under the h vs p curve is the rate ( using area theorem)If we prove that the set Ω i has a sharp threshold, then we would’ve proved that2-transitive codes achieve capacity, since the threshold would occure at p = 1-R.We deﬁne another set, which would be useful to prove the sharp threshold be-haviour - the set of erasure patterns for which location j is pivotal in the indirectrecovery of the i-th bit. In other words, ﬂipping the j-th bit ﬂips the erasurepattern between Ω i and Ω ci . Deﬁnition 20.

The set of erasure patterns for which the j-th bit is pivotal in theindirect decoding of the i-th bit is ∂ j Ω i , { A ⊆ [ N ] \ { i }| A \ { j } 6∈ Ω i , A ∪ { j } ∈ Ω i } Note that ∂ j Ω i contains patterns from both Ω i and Ω ci .Intuitively we expect that once we permute the locations to another set of locations,the bits which were pivotal before, stay pivotal in the permuted world, for if thiswere not to happen, we could have magically decoded the concerned bit using thispermutation. We formalize this notion. Proposition 16.

If a code C is 2-transitive, then for distinct i, j, k ∈ [ N ] , thereexists a bijection between ∂ j Ω i and ∂ k Ω i .Proof : Page 23 of 29ishvajeet N IIT MadrasConsider a codeword A. Since the code is 2-transitive, there exists a permutation π such that π ( i ) = i and π ( j ) = k and π ( A ) ∈ C . We need to prove that if A ∈ ∂ j Ω i then π ( A ) ∈ ∂ k Ω i . • Case 1 : A ∪ { j } ∈ Ω i and A \ { j } 6∈ Ω i . π ( A ) ∈ Ω i and π ( A \ { j } ) Ω i because of the transitivity of Ω i . π ( A \ { j } ) = π ( A ) \ { k } . Thus π ( A ) ∈ ∂ k Ω i • Case 2 : A ∪ { j } ∈ Ω i and A Ω i .If we were to interchange the indices j and k, a similar argument would hold. Hencethere is a one-to-one correspondence between the two sets.The measures of the sets ∂ j Ω i and ∂ k Ω i are equal since we prove that there is oneto one correspondence between them. µ p ( ∂ j Ω i ) = P A ∈ ∂ j Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l )Our quest to prove that the set Ω( and in turn the average EXIT function) has asharp threshold, requires us to talk about the inﬂuences of the variables and invokesuitable results from boolean function analysis. Let’s deﬁne a few terms ﬁrst. Deﬁnition 21.

Inﬂuence of a variable

Let Ω be a monotone set and let ∂ j Ω , { x ∈ { , } N | Ω ( x ) = Ω ( x ( j ) ) } , where x ( j ) is deﬁned by x ( j ) l = x l for l = j and x ( j ) j = 1 − x j The inﬂuence of bit j ∈ [N] is deﬁned by I ( p ) j (Ω) , µ p ( ∂ j Ω) Deﬁnition 22.

Total InﬂuenceThe total inﬂuence of a bunch of variables is deﬁned as I ( p ) (Ω) = P Nl =1 I ( p ) l (Ω)Let’s look at a result which talks about the derivative of the measure of monotonesets. Lemma 1.

Margulis - Russo LemmaLet Ω be a monotone set, then d µp (Ω) dp = I ( p ) (Ω) Page 24 of 29ishvajeet N IIT MadrasThe above result says that the derivative is lower bounded by a quantity ( whichas we shall see, scales with N). But the value of the function has to sum up to onesince it is a graph which plots the measure against the probability. This impliesthat the derivative cannot be high everywhere, and hence the function should havea sharp threshold.First we shall show how this is related to the problem at hand - how the inﬂuenceof the EXIT functions is related to the measure of the set of pivotal bits. Proposition 17. ∂h i ( p ) ∂p j = P ∂ j Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l ) Proof :

We evaluate the partial derivative from the explicit evaluation of h i ( p ))done in Proposition 6. h i ( p ) = µ p (Ω i ) = P A ∈ Ω i Π l ∈ A p l Π l ∈ A c \{ i } (1 − p l )If we diﬀerentiate the above quantity with respect to p j and use the fact that Ω i is monotone, we get the above result.We still need to show how these inﬂuences scale with N and tie up the loose ends,and the following theorem, which we state without proof helps us in doing so. Theorem 9.

Let Ω be a montone set and suppose that, for all ≤ p ≥ , theinﬂuences of all bits are equal I ( p )1 (Ω) = I ( p )2 (Ω) = · · · = I ( p ) N (Ω) . The following istrue :1. There exists an universal constant C ≥ which is independent of p, Ω andN, such that d µp (Ω) dp ≥ C (log N )( µ p (Ω))(1 − µ p (Ω))

2. For any < ǫ ≤ , p − ǫ − p ǫ ≤ C log − ǫǫ log M where p t , h − = inf { p ∈ [0 , | h ( p ) ≥ t } is the inverse function for theaverage EXIT function . It is important to note that for the above theorem to hold, the inﬂuences have tobe spread quite uniformly. Let’s see why this is intuitively true. Say there is a h(p) is a strictly increasing continuous polynomial function and hence inverse is well-deﬁnedon [0,1] Page 25 of 29ishvajeet N IIT Madras(dictator)function which has N variables and only one bit is inﬂuential ( outputdepends only on this particular bit), the total inﬂuence is p (the success). Thederivative of this with respect to p is 1 and it does not scale with N as the RHSof the ﬁrst expression is supposed to.Observe that the second statement implies that the set has a sharp thresholdbecause p − ǫ − p ǫ −→ −→ ∞ We have considered bit-MAP decoding for the above result. This result also impliesthat the channel is capacity achieving under block-MAP decoding, if we considerthe following proposition which is quite elementary in itself.

Proposition 18.

Relation between the error probabilities of the bit MAP and blockMAP decoder P block − MAP ≤ NP bit − MAP d min where d min is the minimum distance of the code. We can see that if P bit − MAP −→ P block − MAP −→ Theorem 10.

Corollary.

Reed Muller codes achieve capacity on erasure channels.

Theorem 11.

Polar codes are 2-transitiveProof :

The proof is exactly similar to the one given while proving that RS codesare 2-transitive. As we have seen before, polar codes are G N -coset codes in whichthe input u N is transformed to the codeword x N using the matrix G N , which iseventually sent over the channel. x N = u N G N We are given four locations in the code - say a, b, c and d ∈ [ N ] such that a = b and c = d and we need to give a permutation π : [ N ] −→ [ N ] such that π ( a ) = c and π ( b ) = d and also preserves membership in the code. Page 26 of 29ishvajeet N IIT MadrasFor the moment, pick any permutation which satisﬁes the above constraints - weshall see why it does not matter. Observe that this permutation is a N × N π N × N ∗ h u u u . . . . . . u n i × N = h u π (1) u π (2) u π (3) . . . . . . u π ( n ) i Same holds for the vector x N , because the π matrix does not care about the actualvalues of the vector it is multiplied with (for it is a permutation matrix). It onlypermutes the indices to get another vector and will do the same for u N .i.e. π N × N ∗ h x x x . . . . . . x n i × N = h x π (1) x π (2) x π (3) . . . . . . x π ( n ) i × N Premultiply the polar encoding equation with the matrix ππ ∗ h x x x . . . . . . x n i = π ∗ h u u u . . . . . . u n i ∗ G N = ⇒ h x π (1) x π (2) x π (3) . . . x π ( n ) i = h u π (1) u π (2) u π (3) . . . u π ( n ) i ∗ G N Polar codes are linear codes and thus u π (1) ,π (2) ,π (3) ...π ( n ) is a codeword, x π (1) ,π (2) ,π (3) ...π ( n ) is a codeword too. The above theorem, along with the results related to 2-transitivecodes in this section gives us another proof of the fact that polar codes achievecapacity on the binary erasure channel. Corollary.

Polar codes acheive capacity on erasure channels under MAP decoding.

It should be noted that the rate of polarization in Guruswami’s paper on ’Speedof Polarization’ would be faster than the above proof would give us.Page 27 of 29ishvajeet N IIT Madras

Future Work

The next big question we would like to ask is whether Reed Muller codes achievecapacity for other symmetric channels. This approach of EXIT functions andmonotone thresholds does not work even for the BSC.Another interesting question would be whether we can improve the above resultsby coming up with some non trivial decoding scheme instead of the expensiveMAP decoding. Page 28 of 29 ibliography [1] Erdal Arikan, ”Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels” , IEEE In-ternational Symposium on Information Theory, 2008.[2] Venkatesan Guruswami, Patrick Xia, ”Polar Codes: Speed of polarization andpolynomial gap to capacity” , IEEE Foundations of Computer Science (FOCS),2013.[3] Shrinivas Kudekar, Marco Mondelli, Eren Sasoglu, Rudiger Urbanke ”Reed-Muller Codes Achieve Capacity on the Binary Erasure Channel under MAPDecoding.” , arXiv:1505.0583, 2015.[4] Santhosh Kumar and Henry D. Pﬁster, ”Reed-Muller Codes Achieve Capacityon Erasure Channels” ,arXiv:1505.05123, 2015.[5] Venkatesan Guruswami, Atri Rudra, ”Error-correction up to the information-theoretic limit”

Communications of the ACM, Volume 52 Issue 3, March 2009.[6] C. E. Shannon, ” A mathematical theory of communication”” A mathematical theory of communication”