[PDF] Tb/s Polar Successive Cancellation Decoder 16nm ASIC Implementation

Abstract

This work presents an efficient ASIC implementation of successive cancellation (SC) decoder for polar codes. SC is a low-complexity depth-first search decoding algorithm, favorable for beyond-5G applications that require extremely high throughput and low power. The ASIC implementation of SC in this work exploits many techniques including pipelining and unrolling to achieve Tb/s data throughput without compromising power and area metrics. To reduce the complexity of the implementation, an adaptive log-likelihood ratio (LLR) quantization scheme is used. This scheme optimizes bit precision of the internal LLRs within the range of 1-5 bits by considering irregular polarization and entropy of LLR distribution in SC decoder. The performance cost of this scheme is less than 0.2 dB when the code block length is 1024 bits and the payload is 854 bits. Furthermore, some computations in SC take large space with high degree of parallelization while others take longer time steps. To optimize these computations and reduce both memory and latency, register reduction/balancing (R-RB) method is used. The final decoder architecture is called optimized polar SC (OPSC). The post-placement-routing results at 16nm FinFet ASIC technology show that OPSC decoder achieves 1.2 Tb/s coded throughput on 0.79 mm 2 area with 0.95 pJ/bit energy efficiency.

Full PDF

aa r X i v : . [ c s . A R ] S e p Tb/s Polar Successive Cancellation Decoder16nm ASIC Implementation

Altu ˘g Süral, E. Göksu Sezer, Ertu ˘grul Kola ˘gasıo˘glu, Veerle Derudder and Kaoutar Bertrand

Abstract —This work presents an efﬁcient ASIC implementa-tion of successive cancellation (SC) decoder for polar codes. SC isa low-complexity depth-ﬁrst search decoding algorithm, favorablefor beyond-5G applications that require extremely high through-put and low power. The ASIC implementation of SC in this workexploits many techniques including pipelining and unrolling toachieve Tb/s data throughput without compromising power andarea metrics. To reduce the complexity of the implementation, anadaptive log-likelihood ratio (LLR) quantization scheme is used.This scheme optimizes bit precision of the internal LLRs withinthe range of 1-5 bits by considering irregular polarization andentropy of LLR distribution in SC decoder. The performancecost of this scheme is less than 0.2 dB when the code blocklength is 1024 bits and the payload is 854 bits. Furthermore,some computations in SC take large space with high degree ofparallelization while others take longer time steps. To optimizethese computations and reduce both memory and latency, registerreduction/balancing (R-RB) method is used. The ﬁnal decoderarchitecture is called optimized polar SC (OPSC). The post-placement-routing results at 16nm FinFet ASIC technology showthat OPSC decoder achieves 1.2 Tb/s coded throughput on 0.79mm area with 0.95 pJ/bit energy efﬁciency. Index Terms —Polar codes, successive cancellation decoding,high-throughput, ASIC implementation, energy efﬁciency.

I. I

NTRODUCTION

The Ethernet Alliance foresees a strong demand for Tb/sdata throughput for data centers and mobile networks [1].In the wireless domain, the Horizon 2020 project

EnablingPractical Wireless Tb/s Communications with Next Genera-tion Channel Coding (EPIC) [2] considers three well-knownforward error correction (FEC) schemes turbo [3], Low Den-sity Parity Check (LDPC) [4], [5] and polar codes [6] forextremely high speed beyond 5G applications. This paper aimsto present an efﬁcient polar code implementation to meet Tb/sthroughput demand. Polar codes have attracted a signiﬁcantinterest from both academia and industry and recently theyhave been adopted for the protection of the control channel inthe enhanced mobile broadband (eMBB) service for the ﬁfthgeneration (5G) cellular wireless technology [7].Polar codes are a unique family of FEC schemes which cantheoretically achieve capacity in broad class of channels usinga low-complexity successive cancellation (SC) decoder [6],[8]. To improve error correction performance of SC algorithmat moderate data block lengths, many algorithms are proposedwith most popular one being SC-list (SCL) [9]. SCL algorithm

A. Süral, E. G. Sezer and E. Kola˘gasıo˘glu are with POLARAN, Ankara,Turkey (e-mails: {altug.sural, goksu.sezer, ekolagasioglu}@polaran.com).V. Derudder and K. Bertrand are with IMEC, Leuven, Belgium (e-mails:{veerle.derudder, kaoutar.bertrand}@imec.be).Manuscript received September 15, 2020 can track a list of possible decision candidates and can achievenear ML performance with this additional complexity. OtherSC based decoding algorithms SC-ﬂip (SCF) [10] and Soft-cancellation (SCAN) [11] use an iterative approach to correctdecision errors of SC. Using multiple iterations or multipleparallel decoders make these algorithms much more powerhungry at Tb/s data rates.The sequential processing nature of SC limits parallelismwithin decoder but promotes pipelined approach for Tb/sthroughput. There are some high-throughput and pipelinedSC implementations [12], [13], [14], [15] in literature. Theseimplementations can only achieve a few Gb/s throughput, evenif the unrolled implementations [14], [15] take advantage ofa set of shortcuts [16], [17], [18] for speeding up the SCdecoder. For a higher throughput, discrete quantization of soft-information [19] and register reduction/balancing (R-RB) [20]methods have been proposed. In [20], Tb/s throughput forpolar SC decoder have been investigated. A generic problemidentiﬁed for Tb/s throughput regime is power density causedby excessive switching activity in a limited core area.This work proposes an optimized SC (OPSC) decoderarchitecture based on pipelining and unrolling techniques. TheOPSC decoder has low implementation complexity thanks tocareful register balancing and adaptive log-likelihood ratio(LLR) quantization (AQ) scheme. This scheme takes LLRdistribution of each polar code segment as input and op-timizes bit precision of internal LLRs. In addition to thatOPSC utilizes R-RB method to optimize clock frequency byﬂattening time delay of pipeline stages. The post-placement-routing (post-P&R) results at 16nm FinFet ASIC technologyshow that OPSC decoder achieves 1.2 Tb/s coded throughput(corresponds to 1 Tb/s data throughput) on 0.79 mm areawith 0.95 pJ/bit energy efﬁciency. A. Summary of the achievements • OPSC decoder and a channel simulator capable of sim-ulating very low error rates are implemented on FPGAto verify RTL code and measure error correction per-formance at very low BER. The FPGA implementationresults show that OPSC decoder achieves 200 Gb/sthroughput and . × − bit error rate (BER) at 8dB Eb/No. • OPSC decoder exploits AQ and R-RB methods to reducedesign area and power. • To the best of our knowledge, OPSC decoder is the ﬁrstpolar decoder that exceeds 1 Tb/s throughput on ASICbased on post-P&R results. • The results also show that OPSC decoder archives 0.95pJ/bit energy efﬁciency at 0.79 mm area.The outline of this paper is as follows. Section II gives ashort summary of polar coding and introduces the SC decodingwith shortcuts and AQ. Section III presents the proposeddecoder architecture including pipeline depth optimization.Section IV presents FPGA veriﬁcation and communicationperformance of the proposed decoder. Section V presentsASIC implementation details and show comparison with state-of-the-art implementations. Finally, Section VI summarizes themain results with a brief conclusion.II. P OLAR C ODES AND

SC D

ECODING A LGORITHM

A. Polar codes

Polar codes are a class of linear block codes that exists witha block length N = 2 n for every n ≥ . A polar transformmatrix G N = G ⊗ n is the n th Kronecker power of a generatormatrix G = (cid:20) (cid:21) . An input transform vector u N consists ofa user data set u A to carry the user data d K with K dimentionand a redundancy (frozen) set u A c to carry the frozen bits ﬁxedto zero. The polar-encoded codeword x N is simply obtainedas x N = u N G N . We refer to [6] for a complete descriptionof the polar coding technique. B. SC Decoding Algorithm with Shortcuts

The data ﬂow diagram of SC algorithm with certain short-cuts is shown in Fig. 1. At the start of decoding, channellog-likelihood ratio (LLR) vector ℓ N is given to the input. Foran AWGN channel W with the variance σ , i th ( ≤ i ≤ N )encoded symbol x i and received symbol y i , a channel LLRvector ℓ i is ℓ i = ln (cid:18) W ( y i | x i = 0) W ( y i | x i = 1) (cid:19) = ln  e − ( yi − σ √ πσ  − ln  e − ( yi +1)22 σ √ πσ  = − ( y i − σ − − ( y i + 1) σ = 2 y i σ ,where W ( y i | x i ) is the channel transition probability densityfunction. The forward LLR calculation module consists of Fand G functions to calculate the inputs ℓ M for the ﬁrst and thesecond half of a polar code, where M is the recursive blocklength parameter. Initially, M = N and there are M/ size- F and G functions (denoted as F M/ and G M/ ) at eachrecursion. The function F ( ℓ , ℓ ) for any two LLR values ℓ and ℓ is deﬁned asF ( ℓ , ℓ ) = sgn( ℓ ℓ ) min( | ℓ | , | ℓ | ) .The function G ( ℓ , ℓ , ˆ z ) with a hard decision (HD) feedback ˆ z is G ( ℓ , ℓ , ˆ z ) = (1 − z ) ℓ + ℓ . ˆ d K ˆ u N ℓ N ˆ z M/ M = M/ ℓ M ℓ M M = M/ ˆ u MM = 2 M ˆ u M WagnerDecoder MAPDecoder

Input OutputUser data extractionForward LLR calc.G GF F M/ Iterate until shortcutHard-decision (HD) makingR0 R1 SPC REPBackward HD calc. M/ Iterate until feedbackFig. 1: Data ﬂow diagram of polar SC decoding algorithmwith shortcutsAfter a bunch of F and G function iterations, the SC de-coder becomes ready for making hard decisions using certainshortcuts. These shortcuts are named as Rate-0 (R0), Rate-1 (R1), SPC and REP (ﬁrst introduced in [17]) for easilydecodable polar code segments. For R0 shortcut all values areassigned to 0. For R1 shortcut a simple threshold functionis used to assign 0 for positive LLRs and 1 for negativeLLRs. Moreover, Wagner [21] and MAP [22] decoders areemployed for SPC and REP nodes, respectively. After one ofthese shortcuts is activated, the backward HD module (canalso be named as partial-sum update logic - PSUL) takes theHD estimate ˆ u M and calculates the feedback ˆ z M/ for the Gfunctions. This module utilizes M/ XOR ( ⊕ ) functions ateach iteration. After all polar code segments are decoded, theﬁnal HD estimate ˆ u N is calculated. and the estimated user data ˆ d K is extracted from ˆ u N at the end of decoding operation.A formal representation of SC algorithm with shortcuts isshown in Algorithm 1. v M is the indicator vector of the frozencoordinates for length- M polar code segments. The i th elementof v M is deﬁned as v i = ( , if i ∈ A c , if i ∈ A .When v M is all one, the R0 shortcut is calculated. When v M is all zero, the R1 shortcut is calculated using a thresholdfunction on LLRs. Both F and G functions are used element-wise for odd ℓ odd M and even ℓ even M elements of ℓ M vector.

1) Shortcuts for ( N = 1024 , K = 854 ) polar code: For aspeciﬁc implementation of polar codes, code parameters areselected as N = 1024 , K = 854 , R = . The selected polarcode is constructed using Density Evolution algorithm [23] at6.5 dB target Es/No. The number of shortcuts for this code isshown in Table I. Due to high code rate, R1 and SPC shortcutsappear more frequent than R0 and REP shortcuts. SPC andREP shortcuts are not allowed to be greater than N LIM =32 by design choice to keep the target clock frequency high.There are also other shortcuts discovered in [18], however

Algorithm 1:

SC with shortcuts

Inputs : ℓ M , v M , M Output: ˆ u M if v M = 1 then // R0, R=0 ˆ u M = d ( ℓ M , v M = 1) = 0 else if v M = 0 then // R1, R=1 ˆ u M = d ( ℓ M , v M = 0) else if M ≤ N LIM and v = 1 and v M = 0 then ˆ u M = d ( ℓ M , v M = 0) // Wagner dec. p = mod ( P Mi =1 ˆ u i , // R = (M-1)/M r = argmin ( | ℓ M | ) ˆ u r = ˆ u r ⊕ p else if M ≤ N LIM and v M − = 1 and v M = 0 then ˆ u M = d ( P Mi =1 ℓ i , v = 0) // MAP dec. R = 1/M else // Conventional SC ∀ R l M/ = F ( ℓ odd M , ℓ even M ) // Fig. 4 ˆ z M/ = SC ( l M/ , v odd M , M ) r M/ = G ( ℓ odd M , ℓ even M , ˆ z M/ ) // Fig. 5 ˆ x M/ = SC ( r M/ , v even M , M ) ˆ u odd M = ˆ z M/ ⊕ ˆ x M/ // Backward HD ˆ u even M = ˆ x M/ if M = N then // User data extraction ˆ u K = u A return ˆ u K else // Decode remaining code segments return ˆ u M such shortcuts are not employed in this work due to negligiblehardware gain in our target polar code.TABLE I: Number of shortcuts in (1024,854) OPSC decoder Node Length ShortcutsR0 R1 SPC REP Total2 2 2 - - 44 2 4 8 10 248 1 3 6 3 1316 1 3 3 2 932 1 6 4 - 364 - 3 - - 7128 - 1 - - 1Total 68 604 256 96 1024

Due to using shortcuts, F/G function and XOR gate gainsare shown in Table II. In standard SC decoder architecture,there are N/ F and G functions at each polarcode segment. Therefore, total number of required F and G functions are N log N = 10 , . It reduces to 5276 afterapplying shortcuts. The gain is mostly caused by the smallerpolar code segments. The number of F functions are notequal to G functions for small segments due to R0 shortcuts.For these speciﬁc nodes, G functions are used with all zerodecision feedback. Furthermore, the required XOR gates forPSUL reduces from N/ N = 5120 to 2672.

2) Adaptive quantization of the LLRs:

Adaptive quantiza-tion is an optimization method to reduce hardware complexityby decreasing LLR bit precision of internal data paths in theSC decoder. Instead of storing and processing LLRs with aconstant number of bits, a variable number of bits is usedfor each polar code segment. During SC decoding of polar TABLE II: F/G function and XOR gate gain due to shortcuts

NodeLength Number of F/G functions XOR gatesin PSULF G Total4 - 2 4 28 11 13 96 1316 12 13 200 1332 10 11 336 1164 10 11 672 11128 7 7 896 7256 4 4 1024 4512 2 2 1024 21024 1 1 1024 1Total 2604 2672 5276 2672 codes, the polarization phenomenon becomes effective. As po-larization increases, resolution can be decreased without losingperformance and representing polarized code segments withconstant quantization bits becomes inefﬁcient. As polarizationincreases reliability, resolution of LLR can be decreased.Unlike using lookup tables as in [19] for the computationof LLRs, we use input LLR distribution statistics of eachpolar code segment and apply the given F and G functionswith an optimized bit precision. Adaptive quantization schemefor (1024,854) polar code is shown in Fig. 2. The numberof quantization bits are written on the lines between polarcode segments. With the adaptive quantization, the memorycomplexity is reduced by 15.1% while having 4.25 bit averageLLR bit precision compared to the SC decoder with 5-bitregular quantization levels. It further reduces combinationallogic complexity and enables shallower pipeline depth. (1024,854) (512,361)5 (256,131)5 (128,36)5 (128,95)4(256,230)4 (128,103)4 (128,127)3(512,493)4 (256,238)4 (128,111)4 (128,127)3(256,255)3 (128,127)3 (128,128)1

Fig. 2: Adaptive quantization of the constituent codes ofSC(1024,854) for ≤ M ≤ . The number of quan-tization bits are written on the lines.III. OPSC D ECODER A RCHITECTURE

The proposed OPSC decoder architecture exploits pipeliningand unrolling techniques to achieve Tb/s data rate whilekeeping implementation complexity in check. The enablingmethods for OPSC decoder are as follows. • Systematic polar code for improved BER performance. • Min-sum decoding for simpler arithmetic operations withsmall area and power dissipation. • Adaptive quantization of internal LLRs to reduce com-putational and memory complexity. • Bit-reversed order computation to operate on neighboringLLRs. • Unrolled and pipelined architecture for high throughput. • Fully-parallel SC architecture with multi-bit hard deci-sions using shortcuts. • R-RB using pipeline depth optimization for minimumdelay and power. Pipeline depth of OPSC decoder isoptimized as 158 for FPGA and 60 for ASIC.The OPSC decoder denoted as OPSC ( N, K ) , consistsof two sub-decoders which have the same block length N with a different payload K i = N R i . In general,OPSC ( N, K ) is decoded in four steps: calculation of Ffunctions, OPSC ( N , K ) , calculation of G functions andOPSC ( N , K ) . AQ is applied after F and G functions toreduce LLR bit resolution from Q to Q’ bits. For example, therecursive OPSC decoder architecture for N = 16 , K = 9 polarcode segment is shown in Fig. 3. Choice of lower layer coderates is exemplary. The ℓ LLRs at the input with × Q bitsare stored during processing duration of OPSC (8 , decoder(denoted as L ( OPSC ) ) until ˆ z becomes ready at the inputof function block G. Likewise, ˆ z is stored until ˆ x is ready.Buffer memory structure is used to access data faster thanother alternatives such as SRAM. b b OPSC (16 , ℓ × Q F bbb B u ff e r M e m o r y G ℓ × Q AQ r × Q

88 16 AQ ℓ ′ × Q ′ r ′ × Q ′ OPSC (8 , OPSC (8 ,

7) ˆ z bbb B u ff e r M e m o r y ˆ x PS U L ˆ u L ( OPSC ) L ( OPSC ) Fig. 3: OPSC(16,9) architectureSimilar hardware architecture is utilized to decode allpolar code segments recursively. When shortcuts are de-tected, OPSC and OPSC blocks are replaced. For example,OPSC (8 , is replaced with an SPC shortcut. A. Building blocks of the OPSC decoder architecture

The basic building blocks of OPSC decoder are F and Gfunctions. F N function consists of N/ copies of F functionshown in Fig. 4. The F function contains a compare-and-select (C&S) logic and an XOR gate.G function is shown in Fig. 5. The G function containsan adder, a subtractor and a multiplexer. The select input ˆ z ofthe multiplexer may have longer delay than the ℓ ′ + ℓ ′ and ℓ ′ − ℓ ′ inputs due to XOR gate chain in PSUL. To avoid timingproblems, both results are calculated and the correct one ischosen. Since LLRs are stored in sign-magnitude form, the G ℓ ℓ bb sgn ( ℓ ) sgn ( ℓ ) sgn ( ℓ ) | ℓ || ℓ | | ℓ | C & S b ℓ Fig. 4: F function.function also utilizes two sign-magnitude to twos-complementconverters (S2C) at the input and a twos-complement to sign-magnitude converter (C2S) at the output. ℓ ℓ S2CS2C ℓ ′ ℓ ′ bb ˆ z C2S ℓ − +++ Fig. 5: G function.

1) Complexity analysis:

The time complexity of fully-parallel standard SC decoder is T N = 2 T N / + 2 = n X i =1 i = 2 N − N ) .The memory complexity of unrolled and fully-pipelinedstandard SC decoder is M N = 2 M N / + ( Q + 0 . N − N ) + N Q = ( Q + 0 . (cid:0) (2 − − log N ) N − N log N − N (cid:1) + QN log N = Θ( N Q ) .where M = 2 Q + 1 . This formula shows that in the mostgeneral case, memory complexity increases almost quadrat-ically with N . This memory is dominantly used in buffers,where soft decision values are stored. Size of these buffersdecreases signiﬁcantly after applying AQ and R-RB as shownin Table III. The ﬁnal memory complexity of OPSC decoderis signiﬁcantly smaller than the conventional SC decoder. B. Register reduction/balancing

Pipeline stages are important to shorten the critical pathand, thus, increase the clock frequency. However, excessivepipeline stages may also increase memory complexity. Toreduce the pipeline depth, we merge a set of consecutiveshort paths of the SC decoder as much as possible basedon the combinational delay from timing simulations. Registerreduction is challenging for SC decoding algorithm due to itssequential essence. We overcome this problem by estimatingdelay of each computation and exploiting shortcuts for parallelprocessing. This method enables the decoder to perform mul-tiple calculations within a single clock cycle with remarkably reduced latency and memory consumption. Table III showsthat LLR buffer memory of OPSC decoder is reduced from 1.1Mb to 380 Kb using R-RB at Q = 5 bits. After applying R-RB,the pipeline depth becomes 60. The results include shortcutswithout AQ. Applying the proposed AQ scheme in Fig. 2further reduces the LLR buffer memory to 361 Kb. IncludingPSUL memory, total buffer memory becomes 380 Kb. Thememory gain of AQ is marginal, because R-RB scheme hasalready reduced the memory signiﬁcantly.TABLE III: Buffer memory of OPSC decoder NodeLength Buffer memorydepth w/o R-RB Buffer memorydepth with R-RBLLR PSUL LLR PSUL4 - 18 - -8 22 29 - 516 44 36 15 1432 60 44 20 1564 76 50 29 22128 93 49 29 21256 103 53 34 20512 103 46 37 171024 112 - 41 -Total size 1.1 Mb 49 Kb 380 Kb 19 Kb

IV. FPGA V

ERIFICATION AND P ERFORMANCE

Error correction performance of OPSC decoder was veriﬁedon Xilinx (xcvu37p-fsch2892-2L-e-es1) FPGA demo board.To attain real-time veriﬁcation, FPGA architecture is devel-oped for both polar systematic encoder and OPSC decoder.The rest of this section presents FPGA test platform and errorcorrection performance of OPSC decoder.

A. FPGA test platform

Polar decoder implementations were veriﬁed on FPGA testplatform shown in Fig. 6. The test platform supports 200 Gb/sinformation throughput at 234 MHz clock frequency. A linearfeedback shift register (LFSR) array generates K = 854 bitpseudo random data for each transmitted polar codeword. Asystematic polar encoder generates N = 1024 bit encoded datafrom the pseudo random data. The encoded data, x i , consistsof 854 systematic bits and 170 parity bits, where both bitsare mapped to BPSK symbols ( s i ) using the mapping rule inEq. 1. The symbols are accumulated with the additive whiteGaussian noise (AWGN) generated by a build-in Gaussianrandom number generator. BPSK demodulator generates LLRvalues with Q = 5 bits in the form of sign-magnitude. Thepolar SC decoder processes LLRs and produces informationbit estimates, which are compared with the original data toproduce error statistics. s i = ( , if x i = 0 − , otherwise. (1) B. FPGA performance results

The frame error statistics in Fig. 7 show that the FPGAimplementation of OPSC decoder causes less than 0.1 dBperformance loss compared to ﬂoating-point software sim-ulation (without AQ). The performance loss is caused by DataSource 200 Gb/sPolarEncoder BPSKMapperAWGNChannelSISOLLRDemapper

200 Gb/sOPSCDecoder

ErrorStatisticsFig. 6: Block diagram of FPGA test platform5-bit quantization of LLRs on FPGA. The proposed AQscheme causes almost 0.1 dB more performance loss, whichis tolerable due to hardware implementation gains.Fig. 7: FER performance of (1024,854) polar codeBER performance results in Fig. 8 show that a coding gainof . − .

72 = 6 . dB is attained at − BER relativeto uncoded transmission.FPGA implementation results of polar encoder and OPSCdecoder are shown in Table IV. Both encoder and decoder have234 MHz clock frequency. To support this frequency, OPSCdecoder utilizes register array memory in the form of LUT forstoring internal LLRs. The received LLRs are stored in 143block random access memory (BRAM) with 16K capacity.The results also show that OPSC decoder utilizes only 7.34 %of the FPGA in terms of LUT consumption. OPSC decodercan ﬁt low-cost FPGA boards such as Xilinx Artix-7.TABLE IV: FPGA resource utilization

LUT FF Power (mW) Latency (ns)Polar encoder 1848 1825 36 8.5OPSC decoder 95653 50843 2373 672

Fig. 8: BER performance of (1024,854) polar codeV. ASIC I

MPLEMENTATION

This section presents ASIC implementation procedure andresults. The general information about the ASIC implementa-tion is as follows. • The TSMC 16nm CMOS logic FinFet library(BWP16P90) is used for RTL synthesis and backendimplementation. • The process-voltage-temperature (PVT) values are 0.8V and °C. Although timing is satisﬁed at 0.7 V atthe end of RTL synthesis, 0.8 V is chosen for backendimplementation to make timing closure easier. • The logic cells with a set of thresholds RVT, LVT, OPT-LVT and ULVT are used. • The setup and hold time of each design are veriﬁed fortypical, worst-C, worst-RC, best-C and best-RC designcorners. • The power results are estimated accurately by generatingsignal activity factors using a testbench. The testbenchsimulates consecutive decoding of 1000 codewords trans-mitted at 0-9dB Eb/No. For each Eb/No value 100codewords are tested. • To achieve timing-clean results, a noticeable number ofbuffers and inverters have been added to the design. • The ﬁnal implementation results are obtained at the endof a timing-clean P&R.

A. Synthesis

The 0.8V TSMC 16nm logic FinFET plus 1P13M processis used for implementation. Physical synthesis is performedwith Cadence Genus. The OPSC has been implemented fora clock frequency of 1.2 GHz. To cope with the high clockfrequency, retiming has been adopted to move the pipelineregisters across the combinatorial logic. In addition, duringsynthesis ﬁne-grained clock gating has been performed toreduce the dynamic power.Initially, OPSC decoder was synthesized with 0.7V and0.8V supply voltages at 1.2 GHz clock frequency. Synthesis results in Table V show that 73 paths have timing violationsat 0.7V. These violations can be ﬁxed by reducing clock fre-quency or increasing the number of pipeline stages. The formeris not possible to achieve our Tb/s throughput target. Thelatter increases the number of DFF and therefore complexity ofclock distribution network. This may cause routing problems atbackend implementation stage. Since we want to keep pipelinedepth as small as possible,the backend study is performed foronly OPSC decoder at 0.8V.TABLE V: Synthesis results of OPSC decoder at 0.7V and0.8V supply voltages

SupplyVoltage(V) CellArea(um ) B. Floorplan

Physical ﬂoorplan for OPSC decoder is shown in Fig. 9.Input and output pins are placed at top and bottom of thechip, respectively. Rectangular shape is adopted to increasearea utilization under ﬂat placement. The total area of the chipis (1 . .

63) = 0 . mm .Fig. 9: Physical ﬂoorplan of the OPSC decoder C. Placement and Routing

Virtual silicon tape out is a technique to conceptuallyvalidate on silicon without making a real tape out. It consistsof going through all the implementation and signoff phasesand doing all the simulations required to validate the designwithout send it to the fabrication. In this study, we havecompleted virtual silicon design ﬂow up to post-place-and-route stage with timing closure.The physical implementation on this design was done withCadence INNOVUS tool, using the exact same libraries asthe synthesis and using the signoff timing recommendationsprovided by TSMC for OCV. Timing is closed only onSSG0.72V and TT0.8V corners in all temperatures. We alsofollowed all the recommendations in terms of power and layoutsignoff. Additional cells added through different steps of theimplementation is shown in Table VI.The spread of cells across libraries are shown in Table VII.

D. ASIC implementation results

The implementation results of the OPSC decoder in TableVIII show that the decoder utilizes 906K instances in 0.5

TABLE VI: Additional cells at post-place, post-cts and post-route steps

Post-place Post-cts Post-routeDensity 60.48 66.95 67.49Cells to ﬁx setup/transition 596 6119 450Latency average N/A 600 ps 600 psCells added to ﬁx hold N/A 110452 116018

TABLE VII: Cell types of OPSC decoder

RVT LVT OPT-LVT ULVTNumber of cells 637,149 207,964 234 61,618% of cells 70.27 22.94 0.03 6.8 mm cell area. Even after our optimizations, the design isstill register dominated due to deeply pipelined architecture tocope with 1.2 GHz clock frequency. Since register cells areusually larger than the other cells, 70% of the area is occupiedby registers. The second largest area belongs to combinationallogic to process LLRs and produce hard decisions. The totalpower dissipation is 1.2 W, while the leakage power is only 2.4mW. The registers clocked at 1.2 GHz has 52.8% of the totalpower dissipation. Hold-ﬁx and setup-ﬁx buffers also havesigniﬁcant low power dissipation.TABLE VIII: Implementation results of the OPSC decoder Cell type Instances Area PowerValue % um % mW %Registers 401,132 44.2 357,327 69.8 616.2 52.8Inverters 29,575 3.3 4,055 0.8 35.3 3.0Buffers 154,379 17.0 52,087 10.2 261.4 22.4Clk. latches 353 0.04 453 0.1 6.0 0.5Comb. logic 321,292 35.4 97,683 19.1 248.6 21.3Total 906,731 100 511,605 100 1,167 100 E. ASIC implementation comparison with other high through-put polar decoders

A comparison of the proposed OPSC decoder implemen-tation with the state-of-the-art polar SC decoder implemen-tations is shown in Table IX. The results are scaled to thesame 0.8V supply voltage and 16nm process technology for afair comparison. As for common scaling factors given in [24],the area is scaled in proportion to the square of the processratio; the power is scaled in proportion to the square of voltageratio and linear to process ratio; energy efﬁciency is scaled inproportion to the square of the process ratio times the squareof the voltage ratio.The synthesis results show that OPSC decoder has a notice-able area efﬁciency, energy efﬁciency, and latency advantagecompared to others. The unrolled implementation [25] pro-vides immense throughput at high clock frequency; however,it consumes too much power. The combinational decoderimplementation [26] is favorable in terms of power and powerdensity; however, it has extremely low throughput to satisfythe high throughput requirements of certain applications.The comparison of the post-P&R results of this work withthe results of fabricated ASICs is shown in Table X. The scaledresults show that this work has ultra-low latency and it canachieve Tb/s throughput under a reasonable area and powerbudget. The area efﬁciency result of this work is more than TABLE IX: Synthesis results comparison with the highthroughput polar decoders

Implementation

This work [25] [26]

ASIC Technology

Block Length

Code Rate

Supply Voltage (V) 0.8 1.0 1.3

Coded Throughput (Gb/s) 1229 1275 2.6

Frequency (MHz) 1200 1245 2.5

Latency ( µ s) 0.05 0.29 0.40 † Latency (Clock Cycles) 60 365 1

Area (mm ) 0.47 4.63 3.21 Power (mW) 1072 8793 191Scaled to 16nm and 0.8V using the common scaling factors in [24].

Coded Throughput (Gb/s) 1229

Frequency (MHz) 1200 Latency ( µ s) Area (mm ) 0.47 1.51 (Gb/s/mm ) Power (mW) 1072 3216 (mW/mm ) 2260 2128 (pJ/bit) † Not presented in the paper, calculated from the presented results

TABLE X: Implementation results comparison with fabricatedASICs

Implementation

This work † [27] [28] ASIC Technology

Block Length

Code Rate

Supply Voltage (V) 0.8 0.9 0.9

Coded Throughput (Gb/s) 1229 9.23 4.44

Frequency (MHz) 1200 451 1000

Latency ( µ s) 0.05 0.63 - Latency (CCs) 60 283 -

Area (mm ) 0.79 0.35 0.35 Power (mW) 1167 10.6 -Scaled to 16nm and 0.8V using the common scaling factors in [24].

Coded Throughput (Gb/s)

Frequency (MHz)

789 1000

Latency ( µ s) Area (mm ) 0.79 Area Eff. (Gb/s/mm )

143 12.7

Power (mW) 1167 - Power Density (mW/mm ) 1473 - Energy Eff. (pJ/bit) 0.95 - † Post-P&R results are given for this work.

10 times greater than the others. It is remarkable for this workto achieve 1.2 W power and 0.95 pJ/bit energy efﬁciency at16nm FinFet technology.VI. C

ONCLUSION

In this paper, an optimized implementation of SC decoderbased on AQ and R-RB methods is proposed for polar codes.These methods not only reduce implementation complexity,power and area but also enable Tb/s throughput on ASIC.Hardware architectures of (1024,854) OPSC decoder are de-veloped for FPGA and ASIC. For FGPA, 200 Gb/s throughputis achieved and hardware veriﬁcation of OPSC decoder iscompleted by measuring . × − BER at 8 dB Eb/No.The ASIC implementation results show that OPSC decoderachieves 1.2 Tb/s coded throughput, 1554 Gb/s/mm area ef-ﬁciency and 0.95 pJ/b energy efﬁciency. When OPSC decoderis compared with other fabricated ASICs for polar codes, it has

16 times more throughput, 7.2 times less latency and 10 timesbetter area efﬁciency than the best alternative implementation.A

CKNOWLEDGMENT

This work has been supported by EPIC project funded bythe European Union’s Horizon 2020 research and innovationprogramme under grant agreement No 760150. The authorswould like to thank Prof. Norbert Wehn and Mr. Claus Kestelfor useful discussions and comments.R

EFERENCES[1] “Ethernet roadmap 2019.” [Online].Available: https://ethernetalliance.org/wp-content/uploads/2019/08/EthernetRoadmap-2019-Side1-ToPrint.pdf.[2] “EPIC - Enabling practical wireless Tb/s communications withnext generation channel coding.” [Online]. Available: https://epic-h2020.eu/results.[3] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limiterror-correcting coding and decoding: Turbo-codes. 1,” in

Proceedingsof ICC ’93 - IEEE International Conference on Communications , vol. 2,pp. 1064–1070 vol.2, May 1993.[4] R. Gallager, “Low-density parity-check codes,”

IRE Transactions onInformation Theory , vol. 8, pp. 21–28, January 1962.[5] D. J. C. MacKay and R. M. Neal, “Near shannon limit performance oflow density parity check codes,”

Electronics Letters , vol. 32, pp. 1645–,Aug 1996.[6] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,”

IEEETransactions on Information Theory , vol. 55, pp. 3051–3073, July 2009.[7] 3GPP, “New radio NR; multiplexing and channel coding (3GPP TS38.212, Version 15.7.0),” tech. rep., 2019.[8] E. Sasaoglu, E. Telatar, and E. Arikan, “Polarization for arbitrary discretememoryless channels,” pp. 144 – 148, 11 2009.[9] I. Tal and A. Vardy, “List decoding of polar codes,”

IEEE Transactionson Information Theory , vol. 61, pp. 2213–2226, May 2015.[10] O. Aﬁsiadis, A. Balatsoukas-Stimming, and A. Burg, “A low-complexityimproved successive cancellation decoder for polar codes,” in , pp. 2116–2120, 2014.[11] U. U. Fayyaz and J. R. Barry, “Low-complexity soft-output decodingof polar codes,”

IEEE Journal on Selected Areas in Communications ,vol. 32, no. 5, pp. 958–966, 2014.[12] C. Zhang and K. K. Parhi, “Low-latency sequential and overlapped archi-tectures for successive cancellation polar decoder,”

IEEE Transactionson Signal Processing , vol. 61, pp. 2429–2441, May 2013.[13] O. Dizdar and E. Arıkan, “A high-throughput energy-efﬁcient imple-mentation of successive cancellation decoder for polar codes usingcombinational logic,”

IEEE Transactions on Circuits and Systems I:Regular Papers , vol. 63, pp. 436–447, March 2016.[14] P. Giard, G. Sarkis, C. Thibeault, and W. J. Gross, “237 gbit/s unrolledhardware polar decoder,”

Electronics Letters , vol. 51, no. 10, pp. 762–763, 2015.[15] P. Giard, G. Sarkis, C. Thibeault, and W. J. Gross, “Multi-mode unrolledarchitectures for polar decoders,”

IEEE Transactions on Circuits andSystems I: Regular Papers , vol. 63, pp. 1443–1453, Sept 2016.[16] A. Alamdar-Yazdi and F. R. Kschischang, “A simpliﬁed successive-cancellation decoder for polar codes,”

IEEE Communications Letters ,vol. 15, pp. 1378–1380, December 2011.[17] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polardecoders: Algorithm and implementation,”

IEEE Journal on SelectedAreas in Communications , vol. 32, pp. 946–957, May 2014.[18] M. Hanif and M. Ardakani, “Fast successive-cancellation decoding ofpolar codes: Identiﬁcation and decoding of new nodes,”

IEEE Commu-nications Letters , vol. 21, pp. 2360–2363, Nov 2017.[19] S. A. A. Shah, M. Stark, and G. Bauch, “Design of quantized decodersfor polar codes using the information bottleneck method,” in

SCC 2019;12th International ITG Conference on Systems, Communications andCoding , pp. 1–6, Feb 2019.[20] A. Süral, E. G. Sezer, Y. Ertu˘grul, O. Arıkan, and E. Arıkan, “Terabits-per-second throughput for polar codes,” in , pp. 1–7, Sep. 2019. [21] R. Silverman and M. Balser, “Coding for constant-data-rate systems,”

Transactions of the IRE Professional Group on Information Theory ,vol. 4, pp. 50–63, September 1954.[22] M. P. C. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexityiterative decoding of low-density parity check codes based on beliefpropagation,”

IEEE Trans. on Comm. , vol. 47, pp. 673–680, May. 1999.[23] I. Tal and A. Vardy, “How to construct polar codes,”

IEEE Transactionson Information Theory , vol. 59, no. 10, pp. 6562–6582, 2013.[24] C. Wong and H. Chang, “Reconﬁgurable turbo decoder with parallelarchitecture for 3gpp lte system,”

IEEE Transactions on Circuits andSystems II: Express Briefs , vol. 57, pp. 566–570, July 2010.[25] P. Giard, C. Thibeault, and W. J. Gross,

High-Speed Decoders for PolarCodes . Springer, 2017. pp. 66–67.[26] O. Dizdar,

High Throughput Decoding Methods and Architectures forPolar Codes with High Energy-Efﬁciency and Low Latency . PhD thesis,Bilkent University, 2017.[27] P. Giard, A. Balatsoukas-Stimming, T. C. Müller, A. Bonetti,C. Thibeault, W. J. Gross, P. Flatresse, and A. Burg, “Polarbear: A 28-nm fd-soi asic for decoding of polar codes,”

IEEE Journal on Emergingand Selected Topics in Circuits and Systems , vol. 7, pp. 616–629, Dec2017.[28] X. Liu, Q. Zhang, P. Qiu, J. Tong, H. Zhang, C. Zhao, and J. Wang,“A 5.16gbps decoder asic for polar code in 16nm ﬁnfet,” in2018 15thInternational Symposium on Wireless Communication Systems (ISWCS)