Multi-Stream LDPC Decoder on GPU of Mobile Devices
aa r X i v : . [ c s . D C ] D ec Multi-Stream LDPC Decoder on GPU of MobileDevices
Roohollah AmiriElectrical and Computer EngineeringBoise State University, Idaho, [email protected] Hani MehrpouyanElectrical and Computer EngineeringBoise State University, Idaho, [email protected]
Abstract —Low-density parity check (LDPC) codes have beenextensively applied in mobile communication systems due totheir excellent error correcting capabilities. However, their broadadoption has been hindered by the high complexity of the LDPCdecoder. Although to date, dedicated hardware has been usedto implement low latency LDPC decoders, recent advancementsin the architecture of mobile processors have made it possibleto develop software solutions. In this paper, we propose a multi-stream LDPC decoder designed for a mobile device. The proposeddecoder uses graphics processing unit (GPU) of a mobile deviceto achieve efficient real-time decoding. The proposed solution isimplemented on an NVIDIA Tegra board as a system on a chip(SoC), where our results indicate that we can control the load onthe central processing units through the multi-stream structure.
Index Terms —Parallel and Distributed Algorithms, Multipro-cessor Architectures, LDPC Decoder, GPU Processing.
I. I
NTRODUCTION
Low-density parity check (LDPC) codes were originallyproposed by Robert Gallager in 1962 [1] and rediscoveredby MacKay and Neal in 1996 [2]. LDPC codes have beenadopted by a wide range of communication standards suchas IEEE 802.11n, 10 Gigabit Ethernet (IEEE 802.3an), LongTerm Evolution (LTE), and DVB-S2. Chung and Richard-son [3] showed that a class of LDPC codes could approachthe Shannon limit to within 0.0045 dB. However, the errorcorrecting strength of LDPC codes comes at the cost of veryhigh decoding complexity [4]. Moreover, to date, there areno closed-form solutions to determine the performance ofLDPC codes in various wireless channels and systems. Thus,performance evaluation is typically carried out via simulationson computers or dedicated hardwares [5].Since LDPC decoders are computationally-intensive andneed powerful computer architectures to result in low latencyand high throughput, to date, most LDPC decoders are imple-mented using application-specific integrated circuits (ASIC) orfield-programmable gate array (FPGA) circuits [6]. However,their high speed often comes at a price of high developmentcost and low programming flexibility [7]. Further, it is verychallenging to design decoder hardware that supports variousstandards and multiple data rates [8]. Decoding of LDPC codesis implemented via belief propagation also known as sum-product algorithm (SPA). One advantage of iterative schemesbased on the SPA is that it could be parallelized based on the architecture of the code graph [3]. In recent years, re-searchers have used multi-core architectures such as CPUs [9],[10], graphics processing units (GPUs) [5], [11], [12], andadvanced RISC machines (ARMs) [10], [13] to develop highthroughput and low latency software-defined radio (SDR)applications. Therefore, designers have recently focused onsoftware implementations of LDPC decoders on multi/many-core devices [11] to meet the performance requirements ofcurrent communication systems.In microarchitectures, increasing clock frequencies to obtainfaster processing performance has reached the limits of silicon-based architectures. Hence, to achieve gains in processingperformance, other techniques based on parallel processingis being investigated [4]. Todays’ multi-core architecturessupport single instruction multiple data (SIMD), single pro-gram multiple data (SPMD), and single instruction multiplethreads (SIMT). The general purpose multi-core processorshomogeneously replicate a single core, typically with an x86instruction set, and provide shared memory hardware mecha-nisms [11]. Such multi-core structures can be programmed at ahigh level by using different software technologies [14] suchas Open Multi-Processing (OpenMP) [15] which provides apractical and relatively straightforward approach for general-purpose programming. On the other hand, newer microarchi-tectures are trying to provide larger SIMD units for vector pro-cessing like streaming SIMD extensions (SSE), advanced vec-tor extensions (AVX), and AVX2 [16] on Intel Architectures.In [4], the authors have used Intel SSE/AVX2 SIMD unitsto implement a high throughput LDPC decoder efficiently.Meanwhile, the power consumption of x86 implementationsis incompatible with most of the embedded mobile systems,which makes them useful for simulation purposes only.Over the last decade, the performance of GPUs has signifi-cantly improved mainly due to the demands for visualizationtechnology in the gaming industry. Recent GPUs are composedof many cores which are driven by considerable memorybandwidth. Therefore, they are also being targeted for solv-ing computationally intensive algorithms in a multithreadedand highly parallel fashion. Hence, researchers in the high-performance computing field are applying GPUs to general-purpose applications (GPGPU). Pertaining to the field of com-munication, researchers have used Compute Unified DeviceArchitecture (CUDA) from NVIDIA [5], [8], [12], [17], [18]and Open Computing Language (OpenCL) [19] platformso develop LDPC decoders on GPUs. As an example, theauthors in [17] have achieved almost Gbps of decodingthroughput for LDPC codes on GPU devices. Although theseworks can achieve extremely high throughputs, their latencybeyond seconds, their high power consumption, and their costmake them incompatible with embedded mobile devices. Thedevices of the end users usually have limited access to a largepower source. As such, these devices must operate on limitedresources as small processors, tiny memory, and low powerbudget. In other words, the limited available resources mustbe used most effectively and efficiently.ARM-based SDR systems have been proposed in recentyears [10], [13] with the goal of developing an SDR basedLDPC decoder that provides high throughput and low latencyon a low-power embedded system. The authors in [13] haveused the ARM processor’s SIMD and SIMT programmingmodels to implement an LDPC decoder. This approach al-lows reaching high throughput while maintaining low-latency.However, the proposed ARM-based solution in [13] is basedon the assumption that the ARM processor is solely used forLDPC decoding. However, mobile devices need to supportmultiple applications simultaneously, and the processing re-sources cannot be extensively dedicated to the LDPC decoder.Moreover, recent works in SDR LDPC embedded systems aremissing the fact that today’s mobile devices have powerfulCUDA enabled GPUs which can play a significant role as acomputing resource in an embedded system.This paper proposes a GPU-based LDPC decoder for anembedded device. The structure of the proposed decoder isbased on multiple data streams which first makes it scalableto other architectures, and second, the process imposed bythe decoding can be controlled by choosing the appropriatenumber of data streams that are sent to the GPU device.Moreover, since the ARM and GPU of an embedded deviceare collocated on the same die, the latency issues associatedwith a GPU implementation is limited.The remainder of the paper is structured as follows. Sec-tion II briefly introduces the LDPC error correcting codesand their decoding algorithms. Then the proposed heteroge-neous algorithm on embedded mobile targets is described inSection III. Section IV gives experimental results and finally,Section V concludes the paper.II. LDPC
CODES AND THEIR D ECODING P ROCESSES
LDPC codes are a class of linear block codes with a sparseparity check matrix called H-matrix. Their main advantage isthat they provide a performance which is close to that of thechannel capacity for various wireless channels. Furthermore,the decoding process of LDPC codes is suited for implementa-tions that make heavy use of parallelism [20]. Here, we presenta brief background on LDPC codes . There are two waysto represent LDPC codes. Like all linear block codes, theycan be described by their H-matrix, while they can also berepresented by a Tanner graph which is a bipartite graph. An The reader is referred to [21] for more information.
LDPC graph consists of a set of variable nodes, a set of checknodes, and a set of edges E. Each edge connects a variablenode to a check node. For example, when the ( i, j ) element ofan H-matrix is ’1’, the ith check node is connected to the jth variable node of the equivalent Tanner graph. Fig. 1 illustratesthe equivalent Tanner graph for a 10 variable nodes and 5check nodes, (10 , , LDPC code with H-matrix in (1) [20]. H = (1) V0V1V2V3V4V5V6V7V8V9 C0C1C2C3C4
Variable Nodes CheckNodes
Fig. 1: Tanner graph of the H-matrix in (1)The general decoding algorithm of LDPC codes is basedon the standard two-phase message passing (TPMP) principledescribed in [11]. This algorithm works in two phases. Inthe first phase, all the variable nodes send messages to theirneighboring parity check nodes, and in the second phase, theparity check nodes send messages to their neighboring variablenodes. One practical variant of message passing algorithms isMin-Sum algorithm which is preferred by designers [13]. Thegeneral steps taken in the Min-Sum algorithm are provided inAlgorithm 1. In Algorithm 1, LLR stands for log-likelihoodratio, CN m and VN n denote the m th check node and the n thvariable node, respectively. lgorithm 1 Min-Sum algorithm Loop 1:
Initialization for all t = 1 → ( Max Iterations ) do Loop 2:
LLR of message CN m to VN n Loop 3:
LLR of message VN n to CN m Loop 4:
Hard decision from soft-values end for One major drawback of Algorithm 1 is that Loops and are updated by separate processing and passed to eachother iteratively. It means that the update loop of the variablenodes does not start until all check nodes are updated. Thischaracteristic affects the efficiency of parallel implementationof such an algorithm.Due to the poor parallel mapping of the Min-Sum algo-rithm, more efficient schedules, such as horizontal layered-based decoding algorithm, are proposed which allow updatedinformation to be utilized more quickly in the algorithm, thus,speeding up decoding [18], [22]. In fact, the H-matrix can beviewed as a layered graph that is decoded sequentially. Thework in [17] has applied a form of layered belief propagationto irregular LDPC codes to reach times faster convergencefor a given error rate. By using this method, the memorybits usage is reduced by to . The layered decodingalgorithm is denoted as Algorithm 2 and can be summarizedas follows:1) All values for the check node computations are com-puted using variable node messages linked to them.2) Once, a check node is calculated, the correspondingvariable nodes are updated immediately after receivingmessages.3) This process is repeated to the maximum number ofiterations.In this paper, we propose a multi-stream structure for im-plementing the layered decoding of LDPC codes on the GPUdevice of a mobile processor with high throughput and lowlatency performance. By using GPU device as the processingunit, significantly fewer resources of the ARM processor isused for decoding compared to similar work in [13]. Thus,the ARM processor gains more processing power for otherapplications running on the device. On the other hand, sincethe GPU and ARM of a mobile device are sitting on the samedie, the latency issues in [17] are improved.III. A LGORITHM M APPING
An efficient implementation of the layered decoding al-gorithm is a challenging task. The concerning programmingdrawbacks of this algorithm are as follows:1) The number of computations for the number of memoryaccess is low.2) The data reuse between consecutive computations is low.3) It requires a large set of random memory access due tothe sparse nature of the H-matrix [4].Therefore, a software-based decoder should take advantage ofdifferent parallelism levels offered by the target architecture to achieve high throughput efficiency. In this section, we detailthe different parallelism levels, target architecture and thestructure of the proposed algorithm.
A. Parallelism Levels in the Proposed Algorithm
To achieve high throughput performance, a software-basedLDPC decoder has to exploit computational parallelism fortaking advantage of multi-core architectures. Different par-allelism levels are present in a layered decoding algorithm,which include:1) First parallelism level is located inside the check nodecomputations. Executing such computations in parallel ispossible. However, this atomic parallelism level is hardto exploit due to the low complexity of computations.On the other hand, two check node computations can bedone in parallel if there is no data dependency. Since thisis rarely true, this level is hard to exploit and inefficient.2) Second parallelism level is located at the frame level(complete execution of Algorithm 2). The same com-putation sequence is executed over consecutive frames.This approach provides an efficient parallel processingalgorithm.Hence, here, we use the SIMD programming model to decode F frames in parallel. In subsection III-C the parallel decodingof F frames is referred to as kernel 2 for the sake of simplicity. B. Data Interleaving/Deinterleaving
Recall that the implementation of the parallel frame pro-cessing is subject to massive irregular memory access due tothe structure of H-matrix. To process the same VN n elementof the F frames at the same time, non-contiguous memoryaccess would affect performance. To solve this issue, a datainterleaving process has to be performed before and afterthe decoding stage to ensure that each set of F framesare reordered to achieve an aligned memory data structure.We use the same procedure as in [4] and the reordering isshown in Fig. 2. In the proposed structure, interleaving anddeinterleaving of frames are called kernel and kernel .Fig. 2: Data interleaving/deinterleaving process [4] C. Multi Stream Parallelism
The SIMT programming model is used to decode W setsof F frames concurrently, with W denoting the number ofoncurrent streams on the GPU device. This multi-core pro-gramming is specified by the CUDA API. Each GPU streamis controlled by a pthread called worker on the host machine(which is an ARM in this case). Each worker is responsiblefor its own sets of frames. By using stream-based processing,the system can decode W × F frames at the same time. Thewhole LDPC decoder system model is shown in Fig. 3.Fig. 3: LDCP decoder data flowIV. E XPERIMENTAL R ESULTS
The experiments were carried out by decoding LDPC codesusing NVIDIA Tegra K1 SoCs and various other structures toshow scalability. The programs were compiled via GCC-4.8and CUDA 6.5. The TK1 is composed of 4 Cortex-A15 ARMprocessors and one NVIDIA Kepler ”GK20a” GPU with 192SM3.2 CUDA cores. The host platform uses a GNU/Linuxkernel 3.10.40.
A. Performance Evaluation of the Proposed Algorithm
The first set of experiments evaluates the decoding through-put of different LDPC codes. The codes have different framelengths: to . The results are provided in Fig. 5 whenone or three threads are used to handle one or three GPU Fig. 4: Tegra-TK1 development boardstreams. Measurements are performed for LDPC decoders thatexecute layered-base decoding iterations.One stream decoding achieves Mbps, while with threestreams it can be as high as Mbps. For a (4000 , LDPC code and one thread, data transfer takes about × . ms, interleaving steps need about × ms and decodingtakes about ms. For the same code with threads, datatransfer takes approximately × . ms, interleaving steps needabout × ms and decoding takes about ms. Therefore,by introducing more streams to GPU device, its performancedoes not degrade. In comparison, the latency, i.e., the timefor data transfer between the host and GPU device in [17] isabout ms, is reduced to . ms because of the architectureof the embedded mobile device. On the other hand, withintroducing three streams to GPU, its processing capacity isused more effectively which results to about throughputimprovement in most of our experiments. LDPC codes T h r oughpu t ( M bp s ) Fig. 5: Measured throughputs for 10 layered decoding iter-ations ( − LDPC codes: × , × , × , × , × , × , × ) B. Performance Comparison with Related Works
To demonstrate the efficiency of the proposed ARM de-coder, its throughput was compared to the ARM related workin [13]. In [13], ARM SIMD units are used to perform vectorata processing in parallel frame decoding. In the experiment,the throughput of the proposed decoder is compared to thatof [13] while using thread for the work in [13] and threadsin the proposed algorithm. This selection is motivated by thefact that the thread from [13] uses a of a core whilethe threads for the proposed algorithm only uses of eachcore resulting in an overall utilization of . -iterationdecoding performed on Tegra-K1 board gives us the results asshown in Table I. The work in [13] can achieve much higherthroughputs by using more threads on the ARM processor, butby introducing each thread, the whole capacity of one moreARM core is used for decoding. In Table I, it is shown thatthe proposed algorithm can achieve the similar throughput tothat of [13] when using of ARM processing power andusing its GPU device. Although, by using more powerful GPUdevice, the algorithm can achieve much higher throughputswhich has been shown in next subsection. This shows that theproposed algorithm is scalable across platforms.TABLE I: Throughput (Mbps) Comparison With Related Work ARM decoder [13], 1 thread Proposed decoder, 3 threadscode (Mbps) Processes used (Mbps) Processes used(4000,2000) 35 100% 34.5 (8000,4000) 34 100% 33
C. Performance Comparison on Different GPU Devices
GPU devices have different characteristics such as thenumber of stream multiprocessors, CUDA cores, and workingfrequencies. A GPU based algorithm should have the scala-bility to use all the processing capability of a GPU device.The proposed algorithm has been executed on multiple GPUdevices. GT540M and K620 are considered as mid-range andGTX680, and TeslaK20 are considered as high power GPUdevices. The algorithm is executed for three code lengths as (576 , , (2304 , and (4000 , . The performanceis shown for and iterations in two sets of figures in Fig. 6and Fig. 7. These figures show that the proposed algorithm canachieve up to Mbps performance across devices. In theseset of experiments, an x86 CPU processor is the host.V. C
ONCLUSION
A stream-based approach for GPU-based LDPC decodingon embedded devices was introduced in this paper. Thisalgorithm is based on running multiple concurrent kernels onGPU devices to utilize their processing capacity and freeing upresources on the ARM processor of mobile devices. Our resultsshow that this approach helps to achieve higher throughputs onembedded mobile devices. Experimental results demonstratethat the proposed algorithm is scalable and can achieve highthroughputs on multiple GPU devices. Moreover, the proposedalgorithm structure provides a trade-off for the operating sys-tem to choose between performance and resource managementby selecting various values for the number of streams that areused for decoding.
K620 GT540M TeslaK20 GTX680 020406080100120140 T h r oughpu t ( M bp s ) (a) code=(576,288) K620 GT540M TeslaK20 GTX680 020406080100120140 T h r oughpu t ( M bp s ) (b) code=(2304,1152) K620 GT540M TeslaK20 GTX680 020406080100120140 T h r oughpu t ( M bp s ) (c) code=(4000,2000) Fig. 6: 10 iteration experiment
620 GT540M TeslaK20 GTX680 050100150200250 T h r oughpu t ( M bp s ) (a) code=(576,288) K620 GT540M TeslaK20 GTX680 050100150200250 T h r oughpu t ( M bp s ) (b) code=(2304,1152) K620 GT540M TeslaK20 GTX680 050100150200250 T h r oughpu t ( M bp s ) (c) code=(4000,2000) Fig. 7: 5 iteration experiment R
EFERENCES[1] R. Gallager, “Low-density parity-check codes,”
IRE Trans. Inform.Theory , vol. 8, no. 1, pp. 21–28, Jan 1962.[2] D. J. C. MacKay and R. M. Neal, “Near shannon limit performance oflow density parity check codes,”
Electronics Letters , vol. 33, no. 6, pp.457–458, Mar 1997.[3] S.-Y. Chung, G. D. Forney, T. J. Richardson, and R. Urbanke, “Onthe design of low-density parity-check codes within 0.0045 db of theShannon limit,”
IEEE Commun. Lett. , vol. 5, no. 2, pp. 58–60, Feb 2001.[4] B. L. Gal and C. Jego, “High-throughput multi-core LDPC decodersbased on x86 processor,”
IEEE Trans. Parallel Distrib. Syst. , vol. PP,no. 99, pp. 1–1, May 2015.[5] S. Kang and J. Moon, “Parallel LDPC decoder implementation on GPUbased on unbalanced memory coalescing,” in
Communications (ICC),2012 IEEE International Conference on Proc , June 2012, pp. 3692–3697.[6] J. Andrade, G. Falcao, and V. Silva, “Flexible design of wide-pipeline-based WiMAX QC-LDPC decoder architectures on FPGAs using high-level synthesis,”
Electronics Letters , vol. 50, no. 11, pp. 839–840, May2014.[7] Y. Hou, R. Liu, H. Peng, and L. Zhao, “High throughput pipeline decoderfor LDPC convolutional codes on GPU,”
IEEE Commun. Lett. , vol. 19,no. 12, pp. 2066–2069, Dec 2015.[8] J.-Y. Park and K.-S. Chung, “Parallel LDPC decoding using CUDA andOpenMP,”
EURASIP JWCN , vol. 2011, no. 1, pp. 1–8, Nov. 2011.[9] S. Gr¨onroos, K. Nybom, and J. Bj¨orkqvist, “Efficient GPU andcpu-based LDPC decoders for long codewords,”
Analog IntegratedCircuits and Signal Processing , vol. 73, no. 2, pp. 583–595, 2012.[Online]. Available: http://dx.doi.org/10.1007/s10470-012-9895-7[10] S. Grnroos and J. Bjrkqvist, “Performance evaluation of LDPC decodingon a general purpose mobile cpu,” in Proc. IEEE GlobalSIP , pp. 1278–1281, Dec 2013.[11] G. Falcao, L. Sousa, and V. Silva, “Massively LDPC decoding onmulticore architectures,”
IEEE Trans. Parallel Distrib. Syst. , vol. 22,no. 2, pp. 309–322, Feb 2011.[12] G. Wang, M. Wu, B. Yin, and J. R. Cavallaro, “High throughputlow latency LDPC decoding on GPU for SDR systems,” in
GlobalConference on Signal and Information Processing (GlobalSIP), 2013IEEE , Dec 2013, pp. 1258–1261.[13] B. L. Gal and C. Jego, “High-throughput LDPC decoder on low-powerembedded processors,”
IEEE Commun. Lett. , vol. 19, no. 11, pp. 1861–1864, Nov 2015.[14] H. Kim and R. Bond, “Multicore software technologies,”
IEEE SignalProcessing Mag. , vol. 26, no. 6, pp. 80–89, November 2009.[15] B. Chapman, G. Jost, and R. v. d. Pas,
Using OpenMP: Portable SharedMemory Parallel Programming . The MIT Press, 2007.[16] M. Deilmann, “A guide to auto-vectorization with intel c++ compilers,”
Intel Corporation , April 2012.[17] B. L. Gal, C. Jego, and J. Crenne, “A high throughput efficient approachfor decoding LDPC codes onto GPU devices,”
IEEE Embedded SystemsLetters , vol. 6, no. 2, pp. 29–32, June 2014.[18] B. L. Gal and C. Jego, “Gpu-like on-chip system for decoding LDPCcodes,”
ACM Trans. Embed. Comput. Syst. , vol. 13, no. 4, pp. 95:1–95:19, Mar. 2014.[19] G. Falcao, V. Silva, L. Sousa, and J. Andrade, “Portable LDPC de-coding on multicores using opencl [applications corner],”
IEEE SignalProcessing Magazine , vol. 29, no. 4, pp. 81–109, July 2012.[20] D. J. Costello Jr, “An introduction to low-density parity check codes,”2009.[21] W. Ryan and S. Lin,
Channel Codes: Classical and Modern . CambridgeUniversity Press, 2009.[22] D. E. Hocevar, “A reduced complexity decoder architecture via layereddecoding of LDPC codes,” in Proc. IEEE SiPSin Proc. IEEE SiPS