[PDF] Doubly Residual Neural Decoder: Towards Low-Complexity High-Performance Channel Decoding

Abstract

Recently deep neural networks have been successfully applied in channel coding to improve the decoding performance. However, the state-of-the-art neural channel decoders cannot achieve high decoding performance and low complexity simultaneously. To overcome this challenge, in this paper we propose doubly residual neural (DRN) decoder. By integrating both the residual input and residual learning to the design of neural channel decoder, DRN enables significant decoding performance improvement while maintaining low complexity. Extensive experiment results show that on different types of channel codes, our DRN decoder consistently outperform the state-of-the-art decoders in terms of decoding performance, model sizes and computational cost.

Full PDF

DDoubly Residual Neural Decoder:Towards Low-Complexity High-Performance Channel Decoding

Siyu Liao, Chunhua Deng, Miao Yin, Bo Yuan

Department of Electrical and Computer Engineering, Rutgers [email protected], [email protected], [email protected], [email protected]

Abstract

Recently deep neural networks have been successfully ap-plied in channel coding to improve the decoding perfor-mance. However, the state-of-the-art neural channel decoderscannot achieve high decoding performance and low complex-ity simultaneously. To overcome this challenge, in this paperwe propose doubly residual neural (DRN) decoder. By inte-grating both the residual input and residual learning to thedesign of neural channel decoder, DRN enables signiﬁcantdecoding performance improvement while maintaining lowcomplexity. Extensive experiment results show that on dif-ferent types of channel codes, our DRN decoder consistentlyoutperform the state-of-the-art decoders in terms of decodingperformance, model sizes and computational cost.

Starting from Claude Shannon’s 1948 seminal paper (Shan-non 1948), channel codes , also known as error correctioncodes, have provided data reliability for communication andstorage systems in the last seven decades. Historically, ev-ery ten years or so information theorists discovered a newchannel code that approaches the ultimate channel capac-ity closer than the prior ones, thereby reshaping the waythat we transmit and store data. For instance, low-densityparity check (LDPC) codes (Gallager 1962; MacKay andNeal 1996) that was re-discovered in 1996 and polar codes(Arikan 2009) that was invented in 2009 have become theadopted channel codes solution in 5G standard. Nowadays,channel codes have served as the key enablers for the dra-matic advances of modern high-quality data transmissionand high-density storage systems, including but not limitedto 5G air interface, deep space communication, solid-statedisk (SSD), high-speed Ethernet etc.

Channel Encoding & Decoding.

In general, the key ideaof channel coding is to ﬁrst encode certain redundancy intothe bit-level message that will be transmitted over noisychannel, and then at the receiver end to decode the corruptedmessage for recovery via utilizing the redundancy informa-tion. Based on such underlying mechanism, a channel codecconsists of one encoder and one decoder at the transmitterend and receiver end, respectively (see Figure 1). In most

Noisy Channel

Encoder Decoder

Figure 1: A channel codec uses one encoder and one decoderto recover the information after noisy transmission.cases, channel decoder is much more expensive than en-coder in terms of both space and computational complex-ity . This is because in encoding phase only simple exclusiveOR operations are needed at the bit level; while in decod-ing phase the more advanced but complicated algorithms areneeded to correct the errors occurred by noisy transmission.To date, the most popular and powerful channel decoding al-gorithm is iterative belief propagation (BP) (Fossorier, Mi-haljevic, and Imai 1999).

Deep Learning for Channel Decoder.

From the perspec-tive of machine learning, the role of channel decoder can beinterpreted as a special multi-label binary classiﬁer or de-noiser. Based on such observation and motivated by the cur-rent unprecedented success of deep neural network (DNN)in various science and engineering applications, recentlyboth information theory and machine learning communi-ties are beginning to study the potential integration of deepneural network into channel codec, especially for the high-performance channel decoder design. A simple and natu-ral idea along this direction is to use the classical deep au-toencoder to serve as the entire channel codec (O’Shea andHoydis 2017). Although this domain knowledge-free strat-egy can work for very short channel codes (e.g. less-than-10 code length), it cannot provide satisﬁed decoding per-formance for moderate and long channel codes, which aremuch more important and popular in the practical industrialstandard and commercial systems.

The State of the Art: NBP & HGN Decoders.

Recently,several studies (Nachmani, Be’ery, and Burshtein 2016;Cammerer et al. 2017; Gruber et al. 2017; Lugosch andGross 2017; Nachmani et al. 2018) have shown that, by in-tegrating the existing mathematical structure and character- a r X i v : . [ c s . L G ] F e b stics of classical decoding approach, e.g. iterative BP, these domain knowledge-based neural channel decoders can pro-vide promising decoding performance for the longer channelcodes. Among those recent progress, both of the two state-of-the-art works, namely Neural BP (NBP) decoder (Nach-mani et al. 2018) and

Hyper Graph Neural (HGN) decoder(Nachmani and Wolf 2019), are based on the ”deep un-folding” methodology (Hershey, Roux, and Weninger 2014).Speciﬁcally, NBP decoder unfolds the original iterative BPdecoder to the neural network format, and then trains thescaling factors instead of empirically setting. Following thesimilar strategy, HGN decoder further replaces the origi-nal message updating step of BP algorithm with a graphneural network (GNN) to form a hyper graph neural net-work (HGNN). As reported in their experiments on differenttypes of channel codes, such proper utilization of the domainknowledge directly makes the neural channel decoders out-perform the traditional BP decoder. Limitations of Existing Works.

Despite the currentencouraging progress, the state-of-the-art neural channeldecoders are still facing several challenging limitations.Speciﬁcally, NBP decoder and its variants do not providesigniﬁcant improvement on decoding performance over thetraditional method. For some codes (e.g. Polar codes) withmoderate or high code rates, the bit error rate (BER) per-formance improvement brought by NBP decoder is veryslight. On the other hand, though HGN decoder indeed pro-vides signiﬁcant decoding gain over the conventional BPdecoder – HGN decoder currently maintains the best de-coding performance among all the neural channel decoders,the hyper graph neural network structure makes the entiredecoder suffer very large model size, thereby causing highstorage cost and computational cost for both training and in-ference phases. Considering channel codes are widely usedin the latency-restrictive resource-restrictive scenarios, suchas mobile devices and terminals, the expensive deploymentcost of HGN decoder makes it infeasible for practical appli-cations.

Technical Preview & Contributions.

To overcome theselimitations and fully unlock the potentials of neural net-works in high-performance channel decoder design, in thispaper we propose a novel doubly residual neural decoder,namely

DRN decoder, to provide strong decoding perfor-mance with low storage and computational costs. As re-vealed by its name, a key feature of DRN decoder is itsbuilt-in residual characteristics on both data processing andnetwork structure, which jointly avoid the structured limita-tions of the existing neural channel decoders. In overall, wesummarize the contributions and beneﬁts of DRN decoderas follows:• Inspired by the historical success of ResNet (He et al.2016), DRN decoder imposes both residual input and Some recent studies also propose to use neural networks todesign new channel codes (Kim et al. 2018; Ebada et al. 2019;Jiang et al. 2019; Burth Kurka and G¨und¨uz 2020; Kim, Oh, andViswanath 2020). In this paper we focus on designing neural chan-nel decoders for the existing widely used channel codes (such asLDPC, Polar and BCH codes). residual learning on the neural channel decoder archi-tecture. Such structure-level reformulation ensures thatDRN decoder can effectively and consistently learn strongerror-correcting capability over various types of channelcodes with different code lengths and code rates.• Our experimental results show that, our proposed DRNdecoder achieves signiﬁcant decoding performance im-provement. Compared with the state-of-the-art NBP de-coder, DRN decoder enjoys 0.5 ∼ ×∼ × fewerparameters and 3.2 ×∼ × fewer computational oper-ations. Compared with HGN decoder, DRN decoderachieves the similar decoding performance with only us-ing 373 ×∼ × fewer parameters and 708 ×∼ × fewer computational operations over different channelcodes. Focus on Block Codes.

Channel codes can be roughlycategorized to two types: block codes and convolutionalcodes . This paper focuses efﬁcient neural channel decoderdesign for block codes, including LDPC, Polar and BCHcodes. This is because block codes are the state-of-the-artchannel codes due to their better error-correcting perfor-mance and more feasible decoder implementation than theconvolutional codes. Currently most advanced communica-tion (e,g, 5G) and storage systems (e,g, SSD) adopts blockcodes in the industrial standards and commercial products.

Channel Codes.

In general, for an ( n , k ) channel code with n -bit code length and k -bit information length, it can be de-ﬁned by a binary generator matrix G of size k × n . Mean-while, it is also associated with a binary parity check matrix H of size ( n − k ) × n , where GH T = .In encoding phase, the original k -bit binary informationvector m is encoded to an n -bit binary codeword x = mG ,where all the arithmetic operations are in binary domain. Af-ter x is transmitted over a noisy channel, at the receiver endthe received codeword r is observed, and the goal of channeldecoding is to recover x from r . Factor Graph and BP Algorithm.

Channel decodingcan be performed by using various approaches. Amongthem, belief propagation (BP) is the most advanced decod-ing algorithm. The key idea of BP algorithm is to performiterative belief message passing over the factor graph , a bi-partite graph entailed by parity check matrix H . As illus-trated in Figure 2a, the factor graph for an ( n , k ) channel In practice the encoder usually adopts systematic encodingstrategy (Lin and Costello 1983), so after decoding phase m canbe directly obtained via fetching the ﬁrst k bits of the decoded ˆx . elief message propagation fromvariable to check nodes 𝑣 𝑣 𝑣 𝑣 𝑐 𝑐 𝑐 𝑣 𝑣 𝑣 𝑣 𝑐 𝑐 𝑐 Belief message propagation fromcheck to variable nodes

𝐻 = 1 1 0 01 1 𝑣 𝑣 𝑣 𝑣 𝑐 𝑐 𝑐 Parity Check Matrix (a) 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 𝑣 (b) Figure 2: (a) Parity check matrix and associate factor graphfor channel codes. Iterative BP is on the factor graph. (b)Factor graph can be unfolded to Trellis graph. Variablenodes are colored in blue, V-to-C messages are in yellow,and C-to-V messages are in orange.code contains n variable nodes and ( n − k ) check nodes, andeach edge in the graph corresponds to an entry-1 in matrix H .At the initial stage of BP algorithm, all the variable nodesreceive the log likelihood ratio (LLR) l v of the correspond-ing bit: l v = log P ( x v = 1 | r v ) P ( x v = 0 | r v ) , (1)where v ∈ [ n ] is the index of variable nodes, and x v and r v are the corresponding bit of x and r , respectively. Then, thebelief messages between variable nodes and check nodes areiteratively calculated and propagated as follows: u tv → c = l v + (cid:88) c (cid:48) ∈ N ( v ) \ c u t − c (cid:48) → v u tc → v = 2 arctanh [ (cid:89) v (cid:48) ∈ M ( c ) \ v tanh ( u tv (cid:48) → c s tv = l v + (cid:88) c (cid:48) ∈ N ( v ) u tc (cid:48) → v , (2)where c ∈ [ n − k ] is the index of check nodes, and t is theiteration number. N ( · ) and M ( · ) represent the set of the con-nected nodes to the current variable node and check node, re-spectively. u tc → v denotes the message to be propagated from the index- c check node to the index- v variable at the t -thiteration, and u tv → c denotes the message in the opposite di-rection. In addition, after the ﬁnal iteration ( L ) s Lv is used forhard decision of the decoded bit ˆ x v . If s Lv > , then ˆ x v = 1 ;otherwise ˆ x v = 0 . From the perspective of neural network, the iterative BP de-coding over factor graph can be ”unfolded” to a neural net-work. Speciﬁcally, since the unfolded factor graph is essen-tially a Trellis graph, where each edge in the factor graphbecomes the node of the Trellis graph (see Figure 2b), theentire Trellis graph can be interpreted as a special neuralnetwork, thereby forming a neural BP (NBP) decoder. Avery attractive advantage of this interpretation is that, withproper neural network training, each propagated message’sassociate scaling parameter, which was constant 1 or empir-ically set in conventional BP decoder, can now be trained asthe weight of neural network to achieve better decoding per-formance. In general, the original message passing describedin Eq. (2) become the forward propagation on the layers ofthe NBP decoder (Nachmani et al. 2018) as follows: u tv → c = f ( w tv,in l v + (cid:88) c (cid:48) ∈ N ( v ) \ c w tc (cid:48) → v u t − c (cid:48) → v ) u tc → v = g ( (cid:89) v (cid:48) ∈ M ( c ) \ v u tv (cid:48) → c ) s tv = σ ( w tv,out l v + (cid:88) c (cid:48) ∈ N ( v ) w tc (cid:48) → v u tc (cid:48) → v ) , (3)where f ( · ) , g ( · ) and σ ( · ) are the tanh, arctanh and sigmoidfunction, respectively. From the perspective of neural net-work, w tv,in , w tc (cid:48) → v , w tv,out and w tc (cid:48) → v , can be learned byminimizing the multi-label binary classiﬁcation loss as fol-lows: loss = N (cid:88) v =1 − [ x v log s v + (1 − x v ) log(1 − s v )] , (4)where s v = s Lv is the output of the last layer of NBP decoder. In (Nachmani and Wolf 2019), a hyper graph neural (HGN)decoder is proposed to further improve the performance ofneural channel decoder. Beyond the weight-learning strategyadopted in the NBP decoder, at each iteration HGN decoderdirectly learns the belief message calculation and propa-gation schemes between check nodes and variable nodes.Speciﬁcally, the update of u tv → c is now learned and per-formed via a graph neural network as follows: u tv → c = GNN ( l v , u t − c (cid:48) → v ) , ∀ c (cid:48) ∈ N ( v ) \ c. (5)Because it is found that training such graph neural networkis quite challenging due to the large amount of possible up-dating schemes, HGN decoder further uses another neuralnetwork to learn and predict the weights for the graph neu-ral network. In overall, unlike NBP decoder, HGN decoderdopts ﬂexible belief message update scheme because ofits ”hyper-network” structure. Such ﬂexibility is believed tobring signiﬁcant decoding performance improvement overthe ﬁxed-scheme NBP decoder. Dilemma between Performance and Cost.

Although NBPand HGN decoders show performance improvement overtraditional BP decoder, they are facing several inherent lim-itations. For NBP decoder, its provided decoding perfor-mance improvement is not consistently signiﬁcant. As willbe shown in Section 4, on some channel codes (e.g. Polarcodes) and with some codes parameters (e.g. higher coderate), the decoding performance of NBP decoder is simi-lar to conventional BP decoder or even worse. On the otherhand, HGN decoder shows consistently much lower BERwith different types of codes and parameters. However, itsunique hyper graph neural network structure makes it veryexpensive for both computation and storage. In overall, suchdilemma between performance and cost severely hindersthe widespread deployments of NBP and HGN decoders inpractical applications.

Rethink-1: Why is Performance of NBP Decoder Lim-ited?

As mentioned above, the underlying design method-ology used for NBP decoder – training the unfolded factorgraph as a neural network, though works, does not achievethe expected signiﬁcant decoding performance improve-ment. We hypothesize such phenomenon is due to threereasons. 1) Depth. Once factor graph is unfolded to Trel-lis graph, the depth of the corresponding neural network isproportional to the number of iterations, which is at least 5in typically setting. Therefore, the depth of the NBP decoderis at least 10 layers or more. For such type of deep and plainneural network without additional structure such as residualblock, it is well known that they suffer unsatisﬁed perfor-mance due to the vanishing gradient problem. 2) Sparsity.Because factor graph of channel codes is inherently sparse,the underlying neural network of NBP decoder is highlysparse as well. Therefore, training an NBP decoder is es-sentially training a sparse neural network from scratch. Un-fortunately, extensive experiments in literature have shownthat, the performance of a sparse model via training-from-scratch is usually inferior to the same-size one via pruning-from-dense (Li et al. 2016; Luo, Wu, and Lin 2017; He,Zhang, and Sun 2017; Yu et al. 2018). Such widely observedphenomenon probably also limits the performance of NBPdecoder. 3) Application. Different from most other applica-tions, channel decoding has extremely strict requirement foraccuracy. Its targeted bit error rate range is typically − and below. Therefore, even though learning the weights in-creases the classiﬁcation accuracy, if such increase is notvery signiﬁcant, it will not translate to obvious decoding per-formance important in terms of BER or coding gain (dB). Rethink-2: Is Flexible Message Update Scheme inHGN Decoder a Must?

As introduced in Section 2.3, HGNdecoder uses high-complexity hyper graph neural network to directly learn the message update schemes instead of theweights only. In other words, both how the messages are cal-culated and propagated are now learnable and ﬂexible. Al-though such ﬂexibility is widely believed as the key enablerfor the promising performance of HGN decoder, we argue itsnecessity for the high-performance neural channel decoderdesign. Recall the structure of the state-of-the-art convolu-tional neural networks (CNNs), such as ResNet (He et al.2016) and DenseNet (Huang et al. 2017), we can ﬁnd that thepropagation path of the information during both inferenceand training phases are not ﬂexible but always ﬁxed. Al-though there are a set of works studying ”adaptive inference”(Bolukbasi et al. 2017; Wang et al. 2018; Hu et al. 2019),the main beneﬁt of introducing such ﬂexibility is to acceler-ate inference speed instead of improving accuracy– actuallythose adaptive inference work typically have to trade the ac-curacy for faster inference.

Rethink-3: How to Break Performance-CostDilemma?

Based on our above analysis and observation,we believe designing a high-performance low-complexityneural channel decoder is not only possible, but the avenueis already available – a new network architecture is thekey. This is because the history of developing advancedCNNs, such as ResNet and DenseNet, has already demon-strated how important a new, instead of ﬂexible, networkarchitecture to the accuracy performance of CNN models.Inspired by these historical success, we propose to performarchitecture-level reformulation to NBP decoder. Such de-sign strategy is attractive for breaking the performance-costdilemma of neural channel decoder because 1) NBP decoderitself has lower complexity than HGN decoder; and 2) ifproperly performed, architecture reformulation will bringhigh decoding performance.

Residual Structure: From CNN to Channel Decoder.

Toachieve that, we propose to integrate residual structure ,which is a key enabler for the success of ResNet in CNN, tothe design of high-performance neural channel decoder. Asanalyzed and veriﬁed by numerous prior studies, the resid-ual structure, performs residual learning to learn the residualmapping F ( x ) := H ( x ) − x instead of directly learning theunderlying mapping H ( x ) . Such strategy effectively circum-vents the vanishing gradient problem and makes traininghigh-performance deep network become possible. As ana-lyzed in our rethinking on the limitations of NBP decoder,such beneﬁt provided by the residual structure is particularattractive for high-performance neural channel decoder de-sign. Doubly Residual Structure.

Next we describe the pro-posed architecture reformulation on the neural channel de-coder. As shown in Figure 3, the entire decoder consists ofmultiple blocks, where each block stacks two adjacent lay-ers of Trellis graph. Similar to the construction of bottleneckblock in ResNet, our architecture reformulation is performedon this two-layer-stacked component block of the decoder.

Mapping Challenge.

Imposing the residual structure onthe block is facing a structure-level challenge. For each com-ponent block, it maps three inputs to three outputs. From the eural Decoder Architecture

Step 1:Merge RepresentationsStep 2:Residual Input 𝑠 𝑣 𝑢 𝑐→𝑣 ⊖ 𝑎 𝑐𝑣 ℎ(𝑤, 𝑎 𝑐𝑣 ) 𝑏 𝑐𝑣 ⊕ Step 3:

Residual Learning

Doubly Residual Neural Decoder 𝑢 𝑐→𝑣 𝑠 𝑣 𝑙 𝑣 ⊕ 𝑢 𝑐→𝑣 𝑠 𝑣 𝑙 𝑣 Block

Block

𝑰𝒏𝒑𝒖𝒕𝒔

Block … 𝑳𝒐𝒔𝒔 𝑢 𝑐→𝑣 𝑢 𝑣→𝑐 Figure 3: Three-step reformulation to form DRN decoder. − − − BE R using s v on LDPC (49, 24)using l v on LDPC (49, 24)using s v on LDPC (121, 60)using l v on LDPC (121, 60)using s v on LDPC (121, 70)using l v on LDPC (121, 70)using s v on LDPC (121, 80)using l v on LDPC (121, 80) Figure 4: BER of BP decoder using s v and l v for hard deci-sion after 1 iteration on different LDPC codes.perspective of DNN, such multiple-input-to-multiple-outputmapping is very difﬁcult for the neural network model tolearn properly and accurately, which then would signiﬁ-cantly limiting the learning capability. Step-1: Merge Representations.

To overcome this chal-lenge, we propose to simplify the input-to-output mapping ineach block (see Figure 3). Our ﬁrst step is to merge s v and l v – we use s v to replace l v in the corresponding computa-tion. Such substitution is based on the phenomenon that, asthe soft output for each iteration, s v should always be morereliable for hard decision of each bit than l v , since l v is onlythe constant extrinsic LLR obtained from the noisy channel.For instance, as shown in Figure 4, when we simply use l v and s v after one BP iteration for hard decision of differentLDPC codes, the BER performance using s v is much betterthan that using l v for hard decision. Based on this obser-vation, in our proposed design we merge s v and l v at eachblock and only use s v for the involved computation. Step-2: Residual Input.

After merging s v with l v , there still exist 2-input-to-2-output mapping in the block. Hencewe further propose to only use the residual value between s v and u c → v as the input and output as follows: a cv = s v − u c → v . (6)As shown in Figure 3, making residual input ensures that thecomponent block only need to learn one-to-one mapping,thereby reducing the learning difﬁculty. Step-3: Residual Learning.

Based on the one-to-onemapping result from the previous two steps, we can now in-tegrate the shortcut-based residual learning to the decoderarchitecture. In general, the reformulated block will learnthe following mapping function: b cv = a cv + h ( w, a cv (cid:48) )= a cv + g ◦ f ( w, a cv (cid:48) ) , (7)where h ( · ) is the activation function as the composition of g ( · ) and f ( · ) . Figure 3 shows the overall procedure of this3-step architecture reformulation. Since this new structurecontains both residual input and residual learning, we namethe entire decoder as doubly residual neural (DRN) decoder. Further Complexity Reduction.

Besides architecturereformulation, we also adopt two approaches to further re-duce complexity of DRN decoder. First, during the trainingphase we keep the weights in the same block as the same.Our experimental results shows that, such weight sharingstrategy signiﬁcantly degrades the decoding performance ofNBP decoder, but it does not affect DRN decoder at all. Sec-ond, considering the high complexity of tanh and arctanhfunctions in Eq. (2), we adopt the widely used min-sum ap-proximation (Hu et al. 2001) to simplify the computation: y = 2 arctanh[tanh( p q ≈ sign ( p ) · sign ( q ) · min( | p | , | q | ) , (8)where | · | returns the absolute value. Based on this approxi-mation, h ( · ) can be performed as follows: h ( w, a cv ) = w min v (cid:48) ∈ M ( c ) \ v | a cv (cid:48) | (cid:89) v (cid:48) ∈ M ( c ) \ v sign ( a cv (cid:48) ) . (9)ecoder Conventional BP NBP HGN (NeurIPS’19) DRN (Ours)SNR (dB) 4 5 6 4 5 6 4 5 6 4 5 6Polar (64, 32) 4.45 5.41 6.46 4.48 5.35 6.50 4.25 5.49 7.02 Polar (64, 48) 4.64 5.90 7.31 4.52 5.73 7.49 4.91 6.48 8.41

Polar (128, 64) 3.74 4.43 5.64 3.67 4.63 5.85 3.89 5.18 6.94

Polar (128, 86) 3.94 4.87 6.24 3.96 4.88 6.20 4.57 6.18 8.27

Polar (128, 96) 4.13 5.21 6.43 4.25 5.09 6.75 4.73 6.39 8.57

LDPC (49, 24) 5.36 7.26 10.03 5.29 7.67 10.27 5.76

LDPC (121, 60) 4.76 7.20 11.07 4.96 8.00 12.35 5.22 8.29 13.00

LDPC (121, 70) 5.85 8.93 13.75 6.43 9.53 13.83 6.39 9.81 14.04

LDPC (121, 80) 6.54 9.64 14.78 7.04 10.56 14.97 6.95 10.68 15.80

BCH (31, 16) 4.44 5.78 7.31 4.84 6.34 8.20

BCH (63, 45) 3.84 4.92 6.35 4.37 5.61 7.20 4.48

Table 1: Negative logarithm of BER performance of different neural channel decoders.

High value means better performance.

NBP HGN DRN (Ours)Polar (64, 32) 41.1KB 596.3KB 1.6KBPolar (64, 48) 32.7KB 428.0KB 860BPolar (128, 64) 88.6KB 1.4MB 3.8KBPolar (128, 86) 111.5KB 1.4MB 3.3KBPolar (128, 96) 75.0KB 1.0MB 2.2KBLDPC (49, 24) 43.1KB 447.6KB 560BLDPC (121, 60) 246.8KB 1.6MB 1.3KBLDPC (121, 70) 193.6KB 1.4MB 1.1KBLDPC (121, 80) 145.2KB 1.1MB 880BBCH (31, 16) 30.9KB 281.2KB 300BBCH (63, 36) 269.4KB 1.1MB 540BBCH (63, 45) 277.4KB 981.0KB 360BBCH (63, 51) 229.4KB 761.3KB 240BTable 2: Model sizes of different neural channel decoders.

In this section, we compare DRN decoder with the tradi-tional BP and the state-of-the-art NBP and HGN decodersin terms of decoding performance (BER), model size andcomputational cost.

Channel Codes Type.

All the decoders are evaluated onthree types of popular ( n , k ) channel codes: LPDC, Polarand BCH codes with different code lengths and code rates.The parity check matrices are adopted from (Helmling et al.2019). Iteration Number and Channel Condition.

For fairnessthe number of iterations for all the decoders is set as 5. Ad-ditive white Gaussian noise (AWGN) channel, as the mostlyused channel type for channel coding research, is adoptedfor transmission channel. The signal-to-noise ratio (SNR) isset in the range of ∼ dB. Experiment Environment.

Our experiment environmentis Ubuntu 16.04 with 256GB random access memory(RAM), Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz andNvidia-V100 GPU. BP NBP HGN DRNPolar (64, 32) 43.6K 52.5K 80.8M 16.4KPolar (64, 48) 45.0K 52.2K 30.4M 15.1KPolar (128, 64) 93.1K 112.1K 1.1G 36.6KPolar (128, 86) 141.9K 166.7K 935.0M 48.1KPolar (128, 96) 90.2K 106.6K 431.7M 32.2KLDPC (49, 24) 54.1K 63.9K 34.1M 17.6KLDPC (121, 60) 316.4K 374.5K 1.6G 94.4KLDPC (121, 70) 263.8K 309.2K 920.4M 78.7KLDPC (121, 80) 211.1K 245.0K 476.1M 62.9KBCH (31, 16) 38.0K 45.1K 8.5M 12.0KBCH (63, 36) 347.8K 412.7K 481.6M 97.2KBCH (63, 45) 412.9K 480.1K 340.3M 112.3KBCH (63, 51) 375.0K 430.6K 162.8M 100.8KTable 3: FLOPs of different neural channel decoders to de-code one codeword.

Training & Testing.

Each input batch is mixed with equalnumber of samples from different SNR settings. The train-ing batch size is 384, so there are 64 samples generated ateach SNR value. We use the RMSprop optimizer (Hinton,Srivastava, and Swersky 2012) with learning rate 0.001 andrun 20,000 iterations. The training samples are generated onthe ﬂy and testing samples are generated till at least 100 er-ror samples detected at each SNR setting.

Since BER can range from − to − , for simplicity,we adopt the negative logarithm representation as used inHGN paper. Table 1 lists the negative logarithm of BERperformance of different decoding methods. A higher num-ber means a better performance because it corresponds to alower BER. From this table it is seen that, with the built-in doubly residual structure, our DRN decoder obtain verystrong error-correcting capability. It consistently achievesthe best BER performance on most of Polar and LDPCcodes. For BCH codes, DRG decoder achieve almost thesame or better performance than HGN decoderl. − − − BE R BP on BCH (63, 36)NBP on BCH (63, 36)HGN on BCH (63, 36)DRN on BCH (63, 36)BP on BCH (63, 51)NBP on BCH (63, 51)HGN on BCH (63, 51)DRN on BCH (63, 51) (a) BCH codes with n = 63 . − − − − − − − BE R BP on LDPC (121, 60)NBP on LDPC (121, 60)HGN on LDPC (121, 60)DRN on LDPC (121, 60)BP on LDPC (121, 80)NBP on LDPC (121, 80)HGN on LDPC (121, 80)DRN on LDPC (121, 80) (b) LDPC codes with n = 121 . − − − − BE R BP on Polar (128, 64)NBP on Polar (128, 64)HGN on Polar (128, 64)DRN on Polar (128, 64)BP on Polar (128, 96)NBP on Polar (128, 96)HGN on Polar (128, 96)DRN on Polar (128, 96) (c) Polar codes with n = 128 . Figure 5: BER-vs-SNR curve of different decoders on different channel codes.Figure 5 shows the BER-vs-SNR curve for different de-coders on different channel codes. Notice that HGN de-coder only reports the BER under SNR=4 ∼ ∼ Table 2 compares the model sizes of different decoders.Based on its inherent lightweight structure and weight shar-ing strategy, our DRN decoder requires the fewest modelsize than others over all different channel codes. Comparedwith NBP decoder, DRN decoder brings 23 ×∼ × reduc-tion on model size. Notice that as mentioned in Section 3.2,the weight sharing strategy cannot be applied to NBP due tothe resulting severe decoding performance loss. Also, com-pared with the large-size hyper graph neural network-basedHGN decoder, DRN decoder enables 373 ×∼ × reduc-tion on model size with achieving the similar or better de-coding performance as shown in Table 1 and Figure 5. Table 3 compares the computational cost, in term of ﬂoat-ing point operations (FLOPs) for decoding one codewordamong different decoders. It can be seen that DRN decoderalso enjoys the lowest computational cost because of itssmall-size model. Compared with NBP decoder, DRN de-coder has 3.2 ×∼ × fewer computational cost. Comparedwith HGN decoder, DRN decoder needs 708 ×∼ × fewer operations while achieving the same or better decod-ing performance. Density (%)BCH (31, 16)BCH (63, 36)BCH (63, 45)BCH (63, 51)LDPC (121, 60)LDPC (121, 70)LDPC (121, 80)LDPC (49, 24)Polar (128, 64)Polar (128, 86)Polar (128, 96)Polar (64, 32)Polar (64, 48) 25.8128.57 38.10 44.449.099.099.09 14.291.351.502.043.014.80

Figure 6: Density of H matrix on different codes.

From simulation results it is seen that DRN achieves betterBERs than HGN on 6 BCH codes, and achieves very closeBERs on other 6 BCH codes. Though such performance isalready very promising, the performance improvement overHGN is not as huge as that on LDPC and polar codes. Wehypothesize such phenomenon is related to the density ofH matrix. For BP-family decoders, like our DRN, H matrixdensity highly affects BER performance. Figure 6 shows Hmatrices of the evaluated BCH codes have higher densitythan those of LDPC and Polar codes, hence this may explainwhy DRN performs better on LDPC and Polar codes than onBCH codes.

This paper proposes doubly residual neural (DRN) decoder,a low-complexity high-performance neural channel decoder.Built upon the inherent residual input and residual learn-ing structure, DRN decoder achieves strong decoding per-formance with low storage cost and computational cost. Ourevaluation on different channel codes shows that the pro-posed DRN decoder consistently outperforms the state-of-the-art neural channel decoders in terms of decoding perfor-mance, model size and computational cost.

Acknowledgement

This work is partially supported by National Science Foun-dation (NSF) award CCF-1854737.

References

Arikan, E. 2009. Channel polarization: A method for con-structing capacity-achieving codes for symmetric binary-input memoryless channels.

IEEE Transactions on infor-mation Theory

Pro-ceedings of the 34th International Conference on MachineLearning-Volume 70 , 527–536.Burth Kurka, D.; and G¨und¨uz, D. 2020. Joint source-channelcoding of images with (not very) deep learning. In

Interna-tional Zurich Seminar on Information and Communication(IZS 2020). Proceedings , 90–94. ETH Zurich.Cammerer, S.; Gruber, T.; Hoydis, J.; and Ten Brink, S.2017. Scaling deep learning-based decoding of polar codesvia partitioning. In

GLOBECOM 2017-2017 IEEE GlobalCommunications Conference , 1–6. IEEE.Ebada, M.; Cammerer, S.; Elkelesh, A.; and ten Brink, S.2019. Deep learning-based polar code design. In , 177–183. IEEE.Fossorier, M. P.; Mihaljevic, M.; and Imai, H. 1999. Re-duced complexity iterative decoding of low-density paritycheck codes based on belief propagation.

IEEE Transac-tions on communications

IRETransactions on information theory , 1–6. IEEE.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.He, Y.; Zhang, X.; and Sun, J. 2017. Channel pruning for ac-celerating very deep neural networks. In

Proceedings of theIEEE International Conference on Computer Vision arXiv preprint arXiv:1409.2574 .Hinton, G.; Srivastava, N.; and Swersky, K. 2012. Neuralnetworks for machine learning lecture 6a overview of mini-batch gradient descent . Hu, C.; Bao, W.; Wang, D.; and Liu, F. 2019. Dynamic adap-tive DNN surgery for inference acceleration on the edge.In

IEEE INFOCOM 2019-IEEE Conference on ComputerCommunications , 1423–1431. IEEE.Hu, X.-Y.; Eleftheriou, E.; Arnold, D.-M.; and Dholakia,A. 2001. Efﬁcient implementations of the sum-product al-gorithm for decoding LDPC codes. In

GLOBECOM’01.IEEE Global Telecommunications Conference (Cat. No.01CH37270) , volume 2, 1036–1036E. IEEE.Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger,K. Q. 2017. Densely connected convolutional networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , 4700–4708.Jiang, Y.; Kim, H.; Asnani, H.; Kannan, S.; Oh, S.; andViswanath, P. 2019. Turbo autoencoder: Deep learn-ing based channel codes for point-to-point communicationchannels. In

Advances in Neural Information ProcessingSystems , 2758–2768.Kim, H.; Jiang, Y.; Kannan, S.; Oh, S.; and Viswanath, P.2018. Deepcode: Feedback codes via deep learning. In

Advances in neural information processing systems , 9436–9446.Kim, H.; Oh, S.; and Viswanath, P. 2020. Physical LayerCommunication via Deep Learning.

IEEE Journal on Se-lected Areas in Information Theory .Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.2016. Pruning ﬁlters for efﬁcient convnets. arXiv preprintarXiv:1608.08710 .Lin, S.; and Costello, D. J. 1983.

Error Control Coding:Fundamentals and Applications . prentice Hall.Lugosch, L.; and Gross, W. J. 2017. Neural offset min-sumdecoding. In , 1361–1365. IEEE.Luo, J.-H.; Wu, J.; and Lin, W. 2017. Thinet: A ﬁlter levelpruning method for deep neural network compression. In

Proceedings of the IEEE international conference on com-puter vision , 5058–5066.MacKay, D. J.; and Neal, R. M. 1996. Near Shannon limitperformance of low density parity check codes.

Electronicsletters , 341–346. IEEE.Nachmani, E.; Marciano, E.; Lugosch, L.; Gross, W. J.; Bur-shtein, D.; and Be’ery, Y. 2018. Deep learning methods forimproved decoding of linear codes.

IEEE Journal of Se-lected Topics in Signal Processing

Advances in Neural InformationProcessing Systems , 2329–2339.O’Shea, T.; and Hoydis, J. 2017. An introduction to deeplearning for the physical layer.

IEEE Transactions on Cog-nitive Communications and Networking

The Bell system technical journal

Proceedings of the European Conference onComputer Vision (ECCV) , 409–424.Yu, R.; Li, A.; Chen, C.-F.; Lai, J.-H.; Morariu, V. I.; Han,X.; Gao, M.; Lin, C.-Y.; and Davis, L. S. 2018. Nisp: Prun-ing networks using neuron importance score propagation. In