[PDF] From sequential decoding to channel polarization and back again

Abstract

This note is a written and extended version of the Shannon Lecture I gave at 2019 International Symposium on Information Theory. It gives an account of the original ideas that motivated the development of polar coding and discusses some new ideas for exploiting channel polarization more effectively in order to improve the performance of polar codes.

Full PDF

aa r X i v : . [ c s . I T ] S e p From Sequential Decoding to Channel Polarizationand Back Again

Erdal Arıkan

Department of Electrical and Electronics EngineeringBilkent University, Ankara, 06800, Turkey

Abstract —This note is a written and extended version of theShannon Lecture I gave at 2019 International Symposium onInformation Theory. It gives an account of the original ideas thatmotivated the development of polar coding and discusses somenew ideas for exploiting channel polarization more effectively inorder to improve the performance of polar codes.

I. I

NTRODUCTION

We begin with the usual setup for the channel codingproblem, as shown in Fig. 1. A message source produces asource word d = ( d , . . . , d K ) uniformly at random overall possible source words of length K over a ﬁnite set, thesource word d is encoded into a codeword x = ( x , . . . , x N ) ,the codeword x is transmitted over a channel, the channelproduces an output word y = ( y , . . . , y N ) , and a decoderprocesses y to produce an estimate ˆ d = ( ˆ d , . . . , ˆ d K ) of thesource word d . The performance metrics for the system arethe probability of frame error P e = Pr(ˆ d = d ) , the coderate R = K/N , and the complexity of implementation of theencoder and decoder.

Encoder d Channel x Decoder y ˆ d Fig. 1. Channel coding system.

Shannon [1] proved that for a broad class of channels,there exists a channel parameter C , called capacity, such thatarbitrarily reliable transmission (small P e ) is attainable atany given rate R if R < C (and unattainable if

R > C ).Shannon’s theorem settled the question about the trade-offbetween the rate ( R ) and reliability ( P e ) in a communicationsystem. However, the random-coding analysis Shannon usedto prove the attainability part of his theorem left out com-plexity issues. Below, we present a track of ideas, as shownin Fig. 2, for constructing practically implementable codesthat meet Shannon’s capacity bound while providing reliablecommunication.For the rest of the note, we restrict attention to binary-inputmemoryless channels (BMCs). By convention, the channelinput alphabet will be { , } , the channel output alphabetwill be arbitrary, and the channel transition probabilities willbe denoted by W ( y | x ) . We will also assume that the sourcealphabet is binary so that d ∈ { , } K . Channel codingproblemConvolutional codesand sequential decodingComplexityPinsker’s schemeCutoﬀ rate bottleneck Polar codesComplexity Polarization-adjustedconvolutional codes andsequential decodingPerformance

Fig. 2. Order of main topics discussed in the note.

Two channel parameters of primary interest will be thesymmetric versions of channel capacity and cutoff rate, whichare deﬁned respectively as C ( W ) = X y X x ∈{ , } W ( y | x ) log W ( y | x ) W ( y |

0) + W ( y | (1)and R ( W ) = 1 − log (cid:18) X y p W ( y | W ( y | (cid:19) . (2)If the BMC under consideration happens to have some sym-metry properties as deﬁned in [4, p. 94], then the symmetriccapacity and symmetric cutoff rate coincide with their trueversions (which are obtained by an optimization over allpossible distributions on the channel input alphabet). For ourpurposes, the symmetric versions of the capacity and cutoffrate are more relevant than their true versions since throughoutthis note we will be considering linear codes. Linear codesare constrained to use the channel input symbols 0 and 1 withequal frequency so they can at best achieve the symmetriccapacity and symmetric cutoff rate. For brevity, in the rest ofthe note, we will omit the qualiﬁer “symmetric” when referringto C ( W ) and R ( W ) ; the reader should remember that allsuch references are actually to the symmetric versions of theseparameters as deﬁned by (1) and (2).A third channel parameter that will be useful in the follow-ing is the Bhattacharyya parameter deﬁned as Z ( W ) = X y p W ( y | W ( y | . (3)We note the relation R ( W ) = 1 − log (cid:2) Z ( W ) (cid:3) , whichwill be important in the sequel.I. C ONVOLUTIONAL CODES AND SEQUENTIAL DECODING

Convolutional codes are a class of linear codes introducedby Elias [2] with an encoder mapping of the form x = dG where the generator matrix G has a special structure thatcorresponds to a convolution operation. An example of aconvolutional code is one with the generator matrix G =   , for which the encoding operation can be implemented usingthe convolution circuit in Fig. 3. · · · , d , d + · · · , x , x + · · · , x , x Fig. 3. Example of a convolutional code.

The codewords of a convolutional code can be representedin the form of a tree. For example, the ﬁrst four levels ofthe tree corresponding to the convolutional code of Fig. 3 areshown in Fig. 4. Each source word d = ( d , . . . , d K ) deﬁnesa path through the code tree (take the upper branch if d i is 0,the lower branch otherwise). Branches along a path are labeledwith the codeword symbols corresponding to that path.The tree representation of a convolutional code turns thedecoding problem into a tree search problem. One of the pathsthrough the tree is the correct path and all other paths areincorrect paths. Exhaustive search of the tree for the correctpath corresponds to optimum decoding but is too complexto implement. There is need for low-complexity tree searchheuristics that can be used as decoders. A reasonable choiceis a depth-ﬁrst search heuristic. Sequential decoding is a depth-ﬁrst search heuristic developed by Wozencraft [3] for decodingarbitrary tree codes.The computational complexity in sequential decoding (thenumber of steps it takes to complete decoding) is a randomvariable whose statistical properties (mean, variance, distribu-tion) depend on the code rate and the channel characteristics.Sequential decoding achieves the capacity C ( W ) of any givenBMC W if no limit is placed on its search complexity. How-ever, the average complexity in sequential decoding becomesprohibitive for practical purposes if the code rate is abovethe cutoff rate R ( W ) . More precisely, at rates R > R ( W ) ,the average complexity of decoding the ﬁrst nR source bitscorrectly is lower-bounded roughly as n [ R − R ( W )] , while

01 0011 00111001 0011100111000110 00111001110001100011100111000110

Fig. 4. Tree representation of a convolutional code. at rates

R < R ( W ) virtually error free communicationis possible at constant average complexity per decoded bit.Detailed accounts of the sequential decoding algorithm and itscomplexity may be found in [4, pp. 263-286] and [5, pp. 425-476].My interest in sequential decoding goes back to 1983 whenI was a doctoral student at M.I.T. and my thesis supervisorBob Gallager asked me to look at sequential decoding formultiaccess channels. This subject became my PhD thesis [6].Multiaccess communications was an emerging subject andsequential decoding was a good starting point for assessingthe practical viability of coding for multiaccess channels (see[7] for the broader context of this problem). Historically,sequential decoding had been a method of choice brieﬂy (usedin space communications (Pioneer 9, 1968)) before beingsuperseded by Viterbi decoding in the 1970s. Despite havingfallen out of favor, sequential decoding was still an interestingsubject with rich connections to information theory and errorexponents. In studying sequential decoding, I came across twofascinating papers by Pinsker [8] and Massey [9]. These papersshowed how to “boost” the cutoff rate of sequential decodingin a sense described below. An extended discussion of bothpapers as they relate to my later work on polar coding canbe found in [10]. In the following, I will focus mainly on [8]because of its general nature. However, before proceeding to[8], I will review [9] since it contains some of the essentialideas in this note in a very simple setting.II. M ASSEY ’ S EXAMPLE

Let M = 2 m for some integer m ≥ , and consideran M ’ary erasure channel (MEC) with input alphabet X = { , , . . . , m − } , output alphabet Y = X ∪ { ? } (where ? is an erasure symbol), and transition probabilities W ( y | x ) such that, when x ∈ X is sent, the channel output y hastwo possible values, y = x and y =? , which it takes withconditional probabilities W ( x | x ) = 1 − ǫ and W (? | x ) = ǫ .The capacity and cutoff rate of the MEC are readily calculatedas C ( m ) = m (1 − ǫ ) and R ( m ) = m − log (cid:0) m − ǫ (cid:1) .Massey observed that the MEC can be split into m binaryerasure channels (BECs) by relabeling its inputs and outputswith vectors of length m . A speciﬁc labeling that achievesthis is as follows. Each input symbol x ∈ X is relabeled withits binary representation ( x , . . . , x m ) ∈ { , } m so that x = P mi =1 x i m − i . Each output symbol y ∈ Y is relabeled with avector ( y , . . . , y m ) which equals the binary representation of y if y ∈ X and equals (? , . . . , ?) if y =? . With this relabeling,a single transmission event { ( x , . . . , x m ) → ( y , . . . , y m ) } across the MEC can be thought of as a collection of m transmission events { x i → y i } across the coordinate channels.An erasure event in the MEC causes an erasure event in allcoordinate channels; if there is no erasure in the MEC, there isno erasure in any of the coordinate channels. Each coordinatechannel is a BEC with erasure probability ǫ . The coordinatechannels are fully correlated in the sense that when an erasureoccurs in one of them, an erasure occurs in all of them.The capacity and cutoff rate of the BECs are given by C (1) = 1 − ǫ and R (1) = 1 − log (1 + ǫ ) . It can be veriﬁedreadily that C ( m ) = mC (1) (capacity is conserved), while R ( m ) ≤ mR (1) with strict inequality unless ǫ equals 0 or1. Thus, splitting the MEC does not cause a degradation inchannel capacity but “improves” or “boosts” the cutoff rate.This example shows that one may break the cutoff rate barrierfor the MEC by employing a separate convolutional encoder– sequential decoder pair on each coordinate BEC. The readeris advised to see [7] for an alternative look at this importantexample from the perspective of multiaccess channels. Tolearn about the communications engineering context in whichMassey’s example arose, we refer to [9].Massey’s example provides a basis for understanding themore complex schemes presented below. These more complexschemes begin with independent copies of a binary-inputchannel (raw channels), build up a large channel (akin to theMEC) through some channel combining operations, and thensplit the large channel back to a set of correlated binary-inputchannels (synthesized channels). One speaks of a “boosting” ofthe cutoff rate if the sum of the cutoff rates of the synthesizedchannels is larger than the sum of the cutoff rates of the rawchannels. IV. P INSKER ’ S SCHEME

Pinsker [8] observed that, for the binary symmetric channel(BSC) with crossover probability p (a BMC with output Crossover probability p

Cutoff rate / Capacity

Fig. 5. Ratio of cutoff rate to capacity for the BSC. alphabet { , } and W (1 |

0) = W (0 |

1) = p ), the ratio ofthe cutoff rate to capacity approaches 1 as p goes to 0, R C = 1 − log (cid:2) p p (1 − p ) (cid:3) p log ( p ) + (1 − p ) log (1 − p ) → as p → , as illustrated in Fig. 5. Pinsker combined this observation withElias’ product coding idea [11] to construct a coding schemethat boosted the cutoff rate to capacity.Pinsker’s scheme, as shown in Fig. 6, uses an inner blockcode and K identical outer convolutional codes. Each roundof operation of the inner block code comprises the encoderfor the inner block code receiving one bit from the output ofeach outer convolutional encoder (for a total of K bits) andencoding them into an inner code block of length N bits. Theinner code block is then sent over a BMC W by N uses of W .Since successive bits at the output of each outer convolutionalencoder are carried in separate inner code blocks, they sufferi.i.d. error events. So, each outer convolutional code sees amemoryless bit-channel, as depicted in Fig. 7. We denoteby W i : U i → ˆ U i the (virtual) BMC that connects the i thconvolutional encoder to the i th sequential decoder. To show that this scheme is capable of boosting the cutoffrate arbitrarily close to channel capacity, we may ﬁx the rate

K/N of the inner block code as (1 − δ ) C ( W ) for someconstant < δ < and consider increasing the block length N and choosing a good enough inner block code so as toensure that the bit-channels W , . . . , W K become near-perfectwith R ( W i ) > − ǫ for each i , where ǫ > is a secondconstant independent of N and i . This ensures that each outerconvolutional code can operate at a rate − ǫ and still bedecoded by a sequential decoder at an average complexitybounded by a third constant, where the third constant dependson δ and ǫ but not on N . The overall rate for this schemeis K (1 − ǫ ) /N = (1 − δ )(1 − ǫ ) C ( W ) , which can be made We use capital letters U i and ˆ U i to denote the random variables corre-sponding to u i and ˆ u i . This convention of using capital letters to denoterandom variables is followed throughout. CE u ˆ u SD ˆ d x W y d CE u ˆ u SD ˆ d x W y d K CE u K ˆ u K SD ˆ d K x N W y N Blockcode MLdecoder K identicalconvolutionalencoders N independentcopies of W K independentsequential decoders bbb bbbbb bbbbbb bbb bbb bbb

Fig. 6. Pinsker’s scheme. d CE u ˆ u SD ˆ d d CE u ˆ u SD ˆ d d K CE u K ˆ u K SD ˆ d K bbb bbb bbb bbb bbb bbb bbb W W W K Fig. 7. Bit-channels created by Pinsker’s scheme. arbitrarily close to C ( W ) by choosing δ and ǫ sufﬁcientlysmall. In Pinsker’s words, his scheme shows that “[f]or avery general class of channels operating below capacity it ispossible to construct a code in such a way that the numberof operations required for decoding is less than some constantthat is independent of the error probability”.Pinsker’s result complements Shannon’s result by showingthat, at any ﬁxed rate R below channel capacity C ( W ) , theaverage complexity per decoded bit can be kept boundedby a constant while achieving any desired frame error rate P e > . Unfortunately, the recipe for choosing a good enoughinner block code in Pinsker’s scheme is to pick the code atrandom. The non-constructive nature of Pinsker’s scheme andthe complexity of ML decoding of a randomly chosen blockcode make Pinsker’s scheme impractical. For our purposes,the takeaway from Pinsker’s scheme is the demonstration thatthere is no “cutoff rate barrier to sequential decoding” in afundamental sense. Our next goal will be to ﬁnd a way ofbreaking the cutoff rate barrier in a practically implementablemanner.Before we end this section, it is instructive to comparePinsker’s scheme with Massey’s example. In Massey’s ex-ample, a given channel is split into multiple correlated bit-channels. In Pinsker’s scheme, the ﬁrst step is to synthesize alarge channel from a collection of independent bit-channels;the large channel is then split back into a number of dependent bit-channels. Massey’s example appears to be a very specialcase that cannot be generalized to arbitrary BMCs, whilePinsker’s scheme is entirely general. Massey’s example booststhe cutoff rate almost effortlessly but cannot boost it all theway to channel capacity. Pinsker’s scheme is much morecomplex but can boost the cutoff rate to capacity. Bothschemes use multiple sequential decoders. The use of multiplesequential decoders is a crucial aspect of both schemes. If asingle sequential decoder were used in Pinsker’s scheme todecode all K convolutional codes jointly (using a joint treerepresentation), then a “data-processing” theorem by Gallager[4, pp. 149-150] would limit the achievable cutoff rate to R ( W ) . For more on this point, we refer to [10].V. M ULTI - LEVEL CODING

In order to reduce the complexity in Pinsker’s scheme,in this section, we look at multi-level coding (MLC) withmulti-stage decoding (MSD), a scheme due to Imai andHirakawa [12]. The MLC/MSD system makes better use ofthe information available at the receiver and hence it has thepotential to boost the cutoff rate at lower complexity. Theparticular MLC/MSD system we consider here is shown inFig. 8. The mapper in the ﬁgure is a one-to-one transformation.The demapper is a device that calculates sufﬁcient statistics inthe form of log-likelihood ratios (LLRs) and feeds them to aMSD unit. Each decoder in the MSD chain is able to beneﬁtfrom the decisions by the previous decoders in the chain.In effect, the MLC/MSD system creates N bit-channels W , . . . , W N , as shown in Fig. 9, where the i th bit-channelis of the form W i : U i → Y ˆ U i − . More precisely, W i isthe channel whose input U i is a bit taken from the output ofthe i th convolutional encoder and whose output Y ˆ U i − is theinput to the i th sequential decoder in the MSD chain. Here, Y = ( Y , . . . , Y N ) is the entire channel output vector and ˆ U i − = ( ˆ U , . . . , ˆ U i − ) is the vector of decisions providedby the decoders that precede decoder i in the MSD chain.If the MLC/MSD system is conﬁgured so that the sequentialdecoders provide virtually error-free decisions, then the bit-channel W i takes the form W i : U i → YU i − wherethe decisions fed forward by the previous stages are always CE u x W y ℓ ˆ u SD ˆ d d CE u x W y ℓ ˆ u SD ˆ d d N CE N u N x N W y N ℓ N ˆ u N SD N ˆ d N Mapper Demap-per N convolutionalencoders N independentcopies of channel W N sequentialdecoders bbbbb bbbbb bbbbb

Fig. 8. Multi-level coding d CE u y SD ˆ d d CE u y ˆ u SD ˆ d d N CE N u N y ˆ u N − SD N ˆ d N bbb bbb bbb bbb bbb bbb bbb W W W N Fig. 9. Bit channels created by MLC/MSD correct. For purposes of deriving polar codes, it sufﬁces toconsider only this ideal case with no decision errors. Hence,from now on, we suppose that W i has this ideal form.An important property of the MLC/MSD scheme is theconservation of capacity, N X i =1 C ( W i ) = N X i =1 I ( U i ; YU i − ) = I ( U N ; Y N ) = N C ( W ) , where the second equality is obtained by writing I ( U i ; YU i − ) = I ( U i ; Y | U i − ) based on the assumptionthat U i and U i − are independent and then using the chainrule.The MLC/MSD scheme conserves capacity at any ﬁniteconstruction size N while Pinsker’s scheme conserves capacityonly in an asymptotic sense. Thus MLC/MSD uses informa-tion more efﬁciently and hence may be expected to achieve agiven performance at a lower construction size (leading to alower complexity).On the other hand, unlike Pinsker’s scheme in which theouter convolutional codes are all identical, the natural rateassignment for the MLC/MSD scheme is to set the rate R i of the i th convolutional code to a value just below R ( W i ) .Using convolutional codes at various different rates { R i } asdictated by { R ( W i ) } , and decoding them using a chain ofsequential decoders is a high price to pay for the greater information efﬁciency of the MLC/MSD scheme. Fortunately,this complexity issue regarding outer convolutional codes andsequential decoders is not as severe as it looks thanks to aphenomenon called channelpolarization. Theorem 1:

Consider a sequence of MLC/MSD schemesover a BMC W , with the n th scheme in the sequence havingsize N = 2 n and a mapper of the form P n = (cid:20) (cid:21) ⊗ n , (4)where the exponent “ ⊗ n ” indicates the n th Kronecker power.Fix < δ < . As n increases, the idealized bit-channels { W i } Ni =1 for the n th MLC/MSD scheme polarize in the sensethat the fraction of channels with C ( W i ) > − δ tends to C ( W ) and the fraction with C ( W i ) < δ tends to − C ( W ) .For each bit-channel W i that polarizes, its cutoff rate R o ( W i ) polarizes to the same point (0 or 1) as its capacity C ( W i ) .Furthermore, the mapper and demapper functions can beimplemented at complexity O ( N log N ) per mapper block u . ⋄ We refer to [13] for a proof of this theorem.The most important aspect of Theorem 1 is its statementthat polarization can be achieved at complexity O (log N ) pertransmitted bit. In the absence of a complexity constraint,polarization alone is not hard to achieve. A randomly chosenmapper is likely to achieve polarization but is also likely tobe too complex to implement. The recursive structure of themappers { P n } used in Theorem 1 make it possible to obtainpolarization at low complexity. We will see below that thepolarization effect brought about by the transforms { P n } isstrong enough to simplify the rate assignment { R i } while alsomaintaining reliable transmission of source data bits after theMLC/MSD scheme is simpliﬁed. However, we ﬁrst wish toillustrate the polarization phenomenon of Theorem 1 by anexample.In Fig. 10, we show a plot of C ( W i ) v. i for the bit-channels { W i } created by an MLC/MSD construction of size N = 128 using the transform P n with n = 7 . The channel inhe example is a binary-input additive white Gaussian noise(BIAWGN) channel, which is a channel that receives a binarysymbol x ∈ { , } as input, maps it into a real number s by setting s = 1 if x = 0 and s = − otherwise, andgenerates a channel output y = s + z , where z ∼ N (0 , σ ) isadditive Gaussian noise independent of s . The signal-to-noiseratio (SNR) for the BIAWGN channel is deﬁned as /σ . TheSNR in Fig. 10 is 3 dB. The capacity C ( W ) of the BIAWGNchannel W at 3 dB SNR is . bits; hence, by Theorem 1,we expect that roughly a fraction 0.72 of the capacity terms C ( W i ) in Fig. 10 will be near 1. C apa c i t y ( b i t s ) Bit channel index

Fig. 10. Channel polarization for BIAWGN channel at 3 dB SNR.

An alternative view of the channel polarization effect in thepreceding example is presented in Fig. 11 where cumulativedistributions (proﬁles) of various information parameters areplotted as a function of an index variable i which takesvalues from 0 to N = 128 . The polarized capacity proﬁleis deﬁned as the sequence of cumulatives (cid:8) P ij =1 C ( W j ) (cid:9) indexed by i . Likewise, the polarized cutoff rate proﬁle isdeﬁned as (cid:8) P ij =1 R ( W j ) (cid:9) , the unpolarized capacity pro-ﬁle as (cid:8) iC ( W ) (cid:9) , and the unpolarized cutoff rate proﬁle as (cid:8) iR ( W ) (cid:9) . By convention, we start each proﬁle at 0 at i = 0 .The two other curves in the ﬁgure (Reed-Muller and polar coderate proﬁles) will be discussed later.The unpolarized capacity and cutoff rate proﬁles in Fig. 11serve as benchmarks, corresponding to the case where themapper in the MLC scheme is the identity transform. Thepolarized capacity and cutoff rate proﬁles demonstrate thepolarization effect due to the transform P . The polarized andunpolarized capacity proﬁles coincide at i = 0 and i = N , buta gap exists between the two for < i < N due to channelpolarization. Ideally, the polarized capacity proﬁle would stayzero until i is around [1 − C ( W )] N = 35 . and then climbwith a slope of 1 until i = N . A mapper chosen at randomis likely to create a near-ideal polarized capacity proﬁle, butthe corresponding demapper function is also likely to be toocomplex. By using P as the mapper, we settle for a non-idealpolarized capacity proﬁle in return for lower implementationcomplexity.A beneﬁcial by-product of channel polarization is the boost-ing of the cutoff rate, which is clearly visible in Fig. 11. The Bit-channel index b i t s Polar coderate profileUnpolarizedcapacity Unpolarizedcutoff rate Polarized capacityPolarized cutoff rateReed-Mullerrate profileMLC/MSD construction of size N=128over BIAWGN channel at 3 dB SNR

Fig. 11. Capacity and cutoff rate proﬁles over BIAWGN channel. polarized cutoff rate proﬁle has a ﬁnal value P Ni =1 R ( W i ) =86 . compared to a ﬁnal value N R ( W ) = 69 . for theunpolarized cutoff rate proﬁle. Theorem 1 ensures that, asymp-totically as N becomes large, the normalized sum cutoffrate N P Ni =1 R ( W i ) approaches C ( W ) . So, the MLC/MSDscheme, equipped with the transforms { P n } , reproducesPinsker’s result by boosting the cutoff rate to channel capac-ity, with the important difference that here the mapper anddemapper complexity per transmitted source bit is O (log N ) for a construction of size N (while the similar complexity inPinsker’s scheme is exponential in N ).Despite the reduced mapper/demapper complexity, theMLC/MSD scheme (with the transforms { P n } ) is still far frombeing practical since it calls for using N outer convolutionalcodes at various code rates. At this point, we take advantageof the polarization effect and constrain the rates R i to 0 or1. Such a 0-1 rate assignment in effect eliminates the outercodes. Setting R i = 0 corresponds to ﬁxing the input to the i th bit channel W i . Setting R i = 1 corresponds to sendinginformation in uncoded form over the i th bit-channel W i . Ineither case, the MSD decisions can be made independentlyfrom one mapper block (of length N ) to the next, eliminatingthe need for a sequential decoder.The 0-1 rate assignment leads to a new type of stand-alone block code, which we will call a polar code. Thesimpliﬁed MSD function under the 0-1 rate assignment willbe called successivecancellation (SC) decoding. An importantnew question that arises is whether polar codes, obtained bysuch drastic simpliﬁcation of the MLC/MSD scheme, canprovide reliable transmission of source data. An answer tothis question is provided in the next section.VI. P OLAR CODES

In this section we will study polar codes as a stand-alonecoding scheme. For simplicity, we will consider polar codingonly for BMCs that are symmetric in the sense deﬁned in [13]r [4, p. 94]. We begin by restating the deﬁnition of polar codeswithout any reference to their origin.A polar code is a linear block code characterized by threeparameters: a code block-length N , a code dimension K , anda data index set A . The code block-length is constrained tobe a power of two, N = 2 n for some n ≥ . The codedimension can be any integer in the range ≤ K ≤ N .The data index set A is a subset of { , . . . , N } with size |A| = K . (This set corresponds to the set of indices i for which R i = 1 in the MLC/MSD context.) A method of choosing A will be given below. The encoder for a polar code withparameters ( N, K, A ) receives a source word d of length K and embeds it in a carrier vector u so that u A = d and u A c = . (Here, u A = ( u i : i ∈ A ) is a subvector of u obtained by discarding all coordinates outside A .) Encodingis completed by computing the transform x = uP n , where P n is as deﬁned in (4). Henceforth, we will refer to P n as apolartransform.The standard decoding method for polar codes is SC decod-ing. For details of SC decoding, we refer to [13]. As shown in[13], for a symmetric BMC W , the probability of frame error P e for a polar code under SC decoding is bounded as P e ≤ X i ∈A Z ( W i ) (5)where Z ( W i ) is the Bhattacharyya parameter of channel W i .From now on, we will assume that the data index set A ischosen so as to minimize the bound (5) on P e , i.e., that A is selected as a set of K indices i such that Z ( W i ) is amongthe K smallest numbers in the list Z ( W ) , . . . , Z ( W N ) . Since Z ( W i ) = 2 − R ( W i ) − , an equivalent rule for constructinga polar code is to select A as a set of K indices i suchthat R ( W i ) is among the K largest cutoff rates in the list R ( W ) , . . . , R ( W N ) . Theorem 2:

A polar code with length N , dimension K , andrate R = K/N over a symmetric BMC W has the followingproperties. • It can be constructed (the data index set A can bedetermined) in O ( N poly (log N )) steps [14], [15], [16]. • It can be encoded and SC-decoded in O ( N log N ) steps[13]. • Its frame error rate P e under SC decoding is bounded as O ( e − N . ) for any ﬁxed rate R < C ( W ) [17]. ⋄ In summary, polar coding achieves the capacity of sym-metric BMCs with low-complexity encoding, decoding, andconstruction methods. For a precise discussion of the noveltyof polar codes as a capacity-achieving code construction, werefer to [18].The performance of polar codes is far from optimal. Fig. 12illustrates the frame error rate (FER) P e under SC decodingof a polar code with block-length N = 128 and rate R =1 / over a BIAWGN channel with the SNR ranging from 0to 5 dB. This and other FER curves in Fig. 12 have beenobtained by computer simulation. Also shown in Fig. 12 isthe BIAWGN dispersion approximation [19] at block-length N = 128 and rate R = 1 / , which is an estimate of theaverage ML-decoding performance over the BIAWGN channelof a code chosen uniformly at random from the ensemble ofall possible binary codes of block-length N = 128 and rate R = 1 / . SNR (dB) -6 -5 -4 -3 -2 -1 F E R PAC codePolar code with SC decodingPolar code withCA-SCL decodingBIAWGN dispersion approx.

Fig. 12. Performance curves over the BIAWGN channel.

The weak performance of polar codes is due in part tothe suboptimality of the SC decoder and in part to the poorminimum distance of polar codes. An effective method to ﬁxboth of these problems has been to use a concatenation schemein which a high-rate outer code is used to pre-code the sourcebits before they go into an inner polar code. A particularlypowerful example of such methods is the CRC-aided SC listdecoding (CA-SCL) [20], whose FER performance is shownin Fig. 12 for the case of N = 128 , R = 1 / , CRC length8, and list size 32. In the next section, we consider improvingthe polar code performance still further by shifting the burdenof error correction entirely to an outer code.VII. P OLARIZATION - ADJUSTED CONVOLUTIONAL CODES

In this section, we consider a new class of codes that we willrefer to as polarization-adjusted convolutional (PAC) codes.The motivating idea for PAC codes is the recognition that 0-1rate assignments waste the capacities C ( W i ) of bit-channels W i whose inputs are ﬁxed by the rate assignment R i = 0 .The capacity loss is especially signiﬁcant at practical (smallto moderate) block-lengths N since polarization takes placerelatively slowly. In order to prevent such capacity loss, weneed a scheme that avoids ﬁxing the input of any bit-channel.PAC codes achieve this by placing an outer convolutionalcoding block in front of the polar transform as shown inFig. 13.As with polar codes, the natural block lengths for PAC codesare powers of two, N = 2 n , n ≥ . The code dimension K can be any integer between 1 and N . The encoding operationfor PAC codes is as follows. A rate-proﬁling block inserts thesource word d into a data carrier word v in accordance witha data index set A so that v A = d and v A c = . The PACcodeword x is obtained from v by a one-to-one transformation x = vTP n where T is a convolution operation and P n is the ateproﬁling Convolu-tion Polartransform ChannelMetriccalculatorSequentialdecoderDataextraction d v u xy metric requests m ˆ v ˆ d irregular tree codetree search algorithm polarized channel Fig. 13. PAC coding scheme. polar transform. A low-complexity encoding alternative is tocompute ﬁrst u = vT and then x = uP n .As usual, we characterize the convolution operation by animpulse response c = ( c , · · · , c m ) , where by convention weassume that c = 0 and c m = 0 . The parameter m + 1 iscalled the constraint length of the convolution. The input-output relation for a convolution with a given impulse response c = ( c , · · · , c m ) is u i = m X j =0 c j v i − j where it is understood that v i − j = 0 for j ≥ i . The sameconvolution operation can be represented in matrix form as u = vT where T is an upper-triangular Toeplitz matrix, T =  c c c · · · c m · · · c c c · · · c m ... c c . . . · · · c m ...... . . . . . . . . . . . . ...... . . . . . . . . . . . . ...... . . . c c c ... c c · · · · · · · · · · · · c  . To illustrate the above encoding operation, consider asmall example with N = 8 , K = 4 , A = { , , , } ,and c = (1 , , . The rate-proﬁler maps the source word d = ( d , . . . , d ) into v = ( v , . . . , v ) so that v = (0 , , , d , , d , d , d ) . The convolution u = vT generates an output word u with u = v , u = v + v , and u i = v i − + v i − + v i for i = 3 , . . . , . (This convolution can be implemented as inFig. 3 by taking the upper part of that circuit.) Encoding iscompleted by computing the polar transform x = uP .Unlike ordinary convolutional codes, the convolution opera-tion here generates an irregular tree code due to the constraint v A c = . Fig. 14 illustrates the irregular tree code generatedby the convolution in the above example. The tree in Fig. 14branches only at time indices in the set A , i.e., only when thereis a new source bit d i going into the convolution operation.When there is a branching in the tree at some stage i ∈ A ,by convention, the upper branch corresponds to v i = 0 andthe lower branch to v i = 1 . Leaf nodes of the tree in Fig. 14are in one-to-one correspondence with the convolution inputwords v satisfying the constraint v A c = . The branches onthe path to a leaf node v are labeled with the symbols of theconvolution output word u = vT . b b b b bb bb bbbb bbbbbbbb Fig. 14. Irregular tree code example.

To summarize, a PAC code is speciﬁed by four parameters ( N, K, A , c ) . In simulation studies we observed that the per-formance of a PAC code is more sensitive to the choice of A than to c . As long as the constraint length of the convolution issufﬁciently large, choosing c at random may be an acceptabledesign practice. Finding good design rules for A is a researchproblem.A heuristic method of choosing A is to use a score function s : { , . . . , N } → R and select A as a set of indices i such that s ( i ) is among the largest K scores in the list s (1) , . . . , s ( N ) (with ties broken arbitrarily). Two examplesof score functions (inspired by polar codes) are the capacityscore function s ( i ) = C ( W i ) and the cutoff rate score function s ( i ) = R ( W i ) where { W i } are the MLC/MSD bit-channelscreated by the polar transform P n . The cutoff rate scorefunction recovers polar codes when T is set to the identitytransform (corresponding to c = 1 ). A third example ofa score function is the Reed-Muller (RM) score function s ( i ) = w ( i − where w ( i − is the number of ones in thebinary representation of i − , ≤ i − ≤ N − . For example, w (12) = 2 since 12 has the binary representation . Werefer to this score function as the RM score function since itenerates the well-known RM codes [22], [23] when T is theidentity transform.We now turn to decoding of PAC codes. For purposes ofdiscussing the decoding operation, it is preferable to segmentthe PAC coding system into three functional blocks as shownby dashed-rectangles in Fig. 13. According to this functionalsegmentation, a source word d is inserted into a data carrier v , the data carrier v is encoded into an codeword u from anirregular tree code, the codeword u is sent over a polarizedchannel, a sequential decoder is used to generate an estimate ˆ v of v , and ﬁnally, an estimate ˆ d of the source word d isextracted from ˆ v by setting ˆ d = ˆ v A .Irregular tree codes can be decoded by tree search heuristicsin much the same way as regular tree codes. A particularlysuitable tree search heuristic for PAC codes is sequential de-coding, speciﬁcally, the Fano decoder [21]. The Fano decodertries to identify the correct path in the code tree by using ametric that tends to drift up along the correct path and driftdown as soon as a path diverges from the correct path. TheFano decoder generates metric requests along the path thatit is currently exploring and a metric calculator responds bysending back the requested metric values (denoted by m inFig. 13). Unlike the usual metric in sequential decoding, themetrics here have to have a time-varying bias so as to maintainthe desired drift properties in the face of the irregular natureof the tree code. In computing the metric, the metric calculatorcan use a recursive method, as in SC decoding of polar codes.Fig. 12 presents the result of a computer simulation with aPAC code with N = 128 , R = 1 / , A chosen in accordancewith the RM design rule, and c = (1 , , , , , , . As seenin the ﬁgure, the FER performance of the PAC code in thisexample comes very close to the dispersion approximation forFER values larger than − . Evidently, the product of thepolar transform P n and the convolution transform T createsan overall transform G = TP n that looks sufﬁciently randomto achieve a performance near the dispersion approximation.When we repeated this simulation experiment with a PACcode designed by the polar coding score function (keepingeverything else the same), we observed that the performancebecame worse but the sequential decoder ran signiﬁcantlyfaster. The RM design was the best design we could ﬁnd interms of FER performance.As a heuristic guide to understanding the computationalbehavior of sequential decoding of a PAC code, we foundit useful to associate a rate proﬁle to each design rule orequivalently data index set A . The rate proﬁle for a dataindex set A is deﬁned as the the sequence of numbers { K i } Ni =0 where K = 0 and K i is the number of elementsin A ∩ { , , . . . , i } for i ≥ . Thus, K i is the number ofsource data bits carried in the ﬁrst i coordinates of the datacarrier word v . The rate proﬁles associated with the RM andpolar code design rules are shown in Fig. 11 for N = 128 and K = 64 . We expect that a design rule whose rate proﬁle staysbelow the polarized cutoff rate proﬁle at a certain SNR willgenerate a PAC code that has low complexity under sequentialdecoding at that SNR. In Fig. 11, both the RM and polar rate proﬁles lie below the polarized cutoff rate proﬁle, but the polarrate proﬁle leaves a greater safety margin, which may explainthe experimental observation that the Fano decoder runs fasterwith the polar code design rule.VIII. R EMARKS AND OPEN PROBLEMS

We conclude the note with some complementary remarksabout PAC codes and suggestions for further research.One may view PAC codes as a concatenation scheme withan outer convolutional code and an inner polar code. However,PAC codes differ from typical concatenated coding schemesin that the inner code in PAC coding has rate one, so it hasno error correction capability. It is more appropriate to viewthe inner polar transform and the metric calculator (mapperand demapper) in PAC coding as a pair of pre- and post-processing devices around a memoryless channel that providepolarized information to an outer decoder so as to increase theperformance of the outer coding system.In view of the data-processing theorem mentioned in con-nection with Pinsker’s scheme, it seems impossible that PACcodes be able to operate at low-complexity at rates above thecutoff rate R ( W ) using only a single sequential decoder.This is true only in part. PAC codes use a convolutional codewhose length spans only one use of the polarized channel.The sequential decoder in PAC coding stops searching for thecorrect path if a decision error is made after reaching level N in the irregular code tree, i.e., after a single use of the polarizedchannel. The R ( W ) bound on sequential decoding wouldhold if a convolutional code were used that extended overmultiple uses of the polarized channel. A better understandingof the computational complexity of the sequential decoder inPAC coding is an open problem.As stated above, the performance and complexity of PACcodes are yet to be studied rigorously. It is clear that ingeneral PAC codes can achieve channel capacity since theycontain polar codes as a special case. The main question isto characterize the best attainable performance by PAC codesover variation of the data index set A and the convolutionimpulse response c .The fact that PAC codes perform well under the RM designrule suggests that, unlike polar codes, PAC codes are robustagainst channel parameter variations and modeling errors. It isof interest to investigate if PAC codes have universal designrules so that a given PAC code performs well uniformly overthe class of all BMCs with a given capacity. In particular, itis of interest to check if the RM design rule (together with asuitably chosen convolution impulse response c ) is universalin this sense.A disadvantage of the sequential decoding method is itsvariable complexity. It is of interest to study ﬁxed-complexitysearch heuristics for decoding PAC codes. One possibility is touse a breadth-ﬁrst search heuristic, such as a Viterbi decoder.However, a Viterbi decoder that tracks only the state of theconvolutional encoder will be suboptimal since PAC codesincorporate a polarized channel that, too, has a state. In fact,the number of states of the polarized channel is the same ashe number of possible words u at the input of the polarizedchannel, namely, NR for a PAC code of length N and rate R . There is clearly need for a sub-optimal breadth-ﬁrst searchheuristic that tracks only a subset of all possible states. Oneoption that may be considered here is list Viterbi decoding [24]which is a method that has proven effective for searching largestate spaces. For some other alternatives of forward pruningmethods in breadth-ﬁrst search, such as beam search, we referto [25, pp. 174-175].In linear algebra, lower-upper decomposition (LUD) is amethod for solving systems of linear equations. PAC codingmay be regarded as one form of upper-lower decomposition(ULD) of a code generator matrix G for purposes of solvinga redundant set of linear equations when the equations arecorrupted by noise. One may investigate if there are otherdecompositions in linear algebra for synthesizing generatormatrices that yield powerful codes with low-complexity en-coding and decoding. R EFERENCES[1] C. E. Shannon, “A mathematical theory of communication,”

The BellSystem Technical Journal , vol. 27, pp. 379–423, July 1948.[2] Peter Elias, “Coding for noisy channels,” in

IRE Convention Record, Part4 , pp. 37–46, Mar. 1955.[3] J. M. Wozencraft, “Sequential Decoding for Reliable Communication,”Tech. Report 325, Res. Lab. Elect., M.I.T., Aug. 1957.[4] R. G. Gallager,

Information Theory and Reliable Communication . NewYork: Wiley, 1968.[5] J. M. Wozencraft and I. M. Jacobs,

Principles of Communication Engi-neering . New York: Wiley, 1965.[6] E. Arıkan, “Sequential Decoding for Multiple Access Channels,” Tech.Rep. LIDS-TH-1517, Lab. Inf. Dec. Syst., M.I.T., 1985.[7] R. Gallager, “A perspective on multiaccess channels,”

IEEE Transactionson Information Theory , vol. 31, pp. 124–142, Mar. 1985.[8] M. S. Pinsker, “On the complexity of decoding,”

Problemy PeredachiInformatsii , vol. 1, no. 1, pp. 84–86, 1965.[9] J. Massey, “Capacity, cutoff rate, and coding for a direct-detection opticalchannel,”

IEEE Transactions on Communications , vol. 29, pp. 1615–1621,Nov. 1981.[10] E. Arıkan, “On the origin of polar coding,”

IEEE Journal on SelectedAreas in Communications , vol. 34, pp. 209–223, Feb. 2016.[11] P. Elias, “Error-free coding,”

Transactions of the IRE Professional Groupon Information Theory , vol. 4, pp. 29–37, Sept. 1954.[12] H. Imai and S. Hirakawa, “A new multilevel coding method using error-correcting codes,”

IEEE Transactions on Information Theory , vol. 23,pp. 371–377, May 1977.[13] E. Arıkan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,”

IEEETransactions on Information Theory , vol. 55, pp. 3051–3073, July 2009.[14] R. Mori and T. Tanaka, “Performance of polar codes with the con-struction using density evolution,”

IEEE Communications Letters , vol. 13,pp. 519–521, July 2009.[15] R. Pedarsani, S. H. Hassani, I. Tal, and E. Telatar, “On the constructionof polar codes,” in , pp. 11–15, IEEE, July 2011.[16] I. Tal and A. Vardy, “How to construct polar codes,”

IEEE Transactionson Information Theory , vol. 59, pp. 6562–6582, Oct. 2013.[17] E. Arıkan and E. Telatar, “On the rate of channel polarization,” in ,pp. 1493–1495, IEEE, June 2009.[18] V. Guruswami and P. Xia, “Polar codes: speed of polarization andpolynomial gap to capacity,”

IEEE Transactions on Information Theory ,vol. 61, pp. 3–16, Jan. 2015.[19] Y. Polyanskiy, H. Poor, and S. Verd´u, “Channel coding rate in the ﬁniteblocklength regime,”

IEEE Transactions on Information Theory , vol. 56,pp. 2307–2359, May 2010. [20] I. Tal and A. Vardy, “List decoding of polar codes,” in , pp. 1–5,July 2011.[21] R. Fano, “A heuristic discussion of probabilistic decoding,”

IEEETransactions on Information Theory , vol. 9, pp. 64–74, Apr. 1963.[22] I. Reed, “A class of multiple-error-correcting codes and the decodingscheme,”

Transactions of the IRE Professional Group on InformationTheory , vol. 4, pp. 38–49, Sept. 1954.[23] D. E. Muller, “Application of Boolean algebra to switching circuit designand to error detection,”

Transactions of the I.R.E. Professional Group onElectronic Computers , vol. EC-3, pp. 6–12, Sept. 1954.[24] N. Seshadri and C. E. W. Sundberg, “List Viterbi decoding algorithmswith applications,”

IEEE Transactions on Communications , vol. 42,pp. 313–323, Feb. 1994.[25] S. Russell and P. Norvig,