[PDF] BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

Abstract

In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

Full PDF

1 BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

Seyed Abolfazl Ghasemzadeh, Erfan Bank Tavakoli, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram

Abstract — In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios ( i.e. , dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g. , compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

Index Terms — LSTM neural network, Pruning, FPGA, Energy efficiency, Accuracy. —————————— ◆ ——————————

1 I

NTRODUCTION

OR applications that require processing time-depend-ent sequences of data such as speech recognition [1] and natural language processing (NLP) [2], Recurrent Neural Networks (RNNs) and, more specifically, Long Short-Term Memory (LSTM) Neural Networks [3] have been intro-duced. These networks, which create high computational loads, have to cope with resource limitation when they are implemented on hardware platforms such as FPGAs [4]. A limited number of computational resources ( e.g. , number and type of available DSP blocks) and memory re-sources ( e.g. , size and access speed of embedded block memories) in FPGAs make LSTM networks implementa-tion challenging. To overcome this problem, several works which invoke resource sharing along with timing optimi-zation have been suggested in the literature. Examples of these research efforts include balancing the memory band-width and the internal storage utilization [5], optimizing computational performance and communication require-ments [6], overlapping LSTM computations with memory accesses [4], and overlapping internal computations of the LSTM architecture [7]. The weight pruning technique reduces storage and computational costs by eliminating redundant elements in the weight matrices. Reducing the model size, however, does not necessarily lead to a more efficient hardware im-plementation. This is because fetching unstructured pruned data may require high memory bandwidth due to the random accesses to the memory. This implies that un-structured pruning may limit the energy and performance gains that are achievable by model reduction and pruning [8]. Accordingly, when reducing the model size, the de-signer should try to minimize the accuracy loss due to the pruning while providing a sparsity pattern compatible with the hardware architecture. Lower improvements achieved by unstructured prun-ing is due to the fact that while the input weight matrices are sparse, the input vector is dense causing rather low im-provement in matrix-vector multiplication (MxV). Since some elements in a row of sparse matrix are zeros, their multiplication by their corresponding elements of the dense vector are zeros, thus not contributing to the sum in final. To avoid the useless computations, we need to access the nonzero elements requiring irregular (random) ac-cesses to memory which is translated to inefficient utiliza-tion of the memory bandwidth and processing elements (PEs) of the NN accelerator. The reason is that the irregu-larity in the positions of non-zero elements in the weight matrices makes variation in the number of PEs needed for each row of the matrix inevitable, reducing the efficiency of using the process elements. In this work, first, a Balanced Row Dual-ratio Sparsity-inducing pruning algorithm (called BRDS) is presented. In this algorithm, the input and recurrent weight matrices of the LSTM are pruned with different sparsity ratios result-ing in lower accuracy loss while providing the opportunity for more weight pruning for a given target accuracy level. These two sets of matrices have different sensitivities to the pruning owing to their different contributions to the final results of the LSTM model. To lower the required memory bandwidth, the pruning is performed in a row-wise man-ner. Moreover, since the number of non-zero elements in each row of the sparse matrices is known at design time, number of PEs required to process the row can be deter-mined. Both of these features provide an efficient ———————————————— • S. A. Ghasemzadeh, M. Kamal, and A. Afzali-Kusha are with School of Electrical and Computer Engineering, University of Tehran, Tehran 14399-57131, Iran. E-mail: [email protected]; [email protected]; [email protected]. • E. Bank Tavakoli is with the School of Computing, Informatics, and Deci-sion Systems Engineering, Arizona State University, Tempe, AZ 85281, USA. E-mail: [email protected]. • M. Pedram is with Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA. E-mail: [email protected]. F hardware implementation of sparse matrix-vector multi-plication (SpMxV) while creating regular memory accesses for performing the operation. Next, we describe the BRDS accelerator which is an FPGA-based, row-balanced, dual-ratio sparsity-aware, low-power and high-performance ar-chitecture for the LSTM networks. The accelerator takes full advantage of the efficiency of the proposed pruning al-gorithm. In this accelerator, to minimize the overhead of storing the positions of non-zero values in the rows of sparse matrices, a relative addressing method is exploited [22]. The contributions of this paper are given below: • Devising a row-balanced dual-ratio sparsity algo-rithm for improving the accuracy of the LSTM models while considering the hardware imple-mentation (BRDS algorithm). • Presenting a low-energy yet high-speed FPGA-based hardware accelerator based on the above pruning algorithm for facilitating the implemen-tations of BRDS-based sparse models. The remainder of the paper is organized as follows. Sec-tion 2 provides basic concepts of LSTM as well as a review of prior work on FPGA-based LSTM architectures. The proposed row-balanced dual-ratio sparsity algorithm is presented in Section 3. Section 4 provides the details of the proposed hardware accelerator. In Section 5, the efficacies of the proposed algorithm and accelerator are evaluated, and finally, the paper is concluded in Section 6.

2 LSTM B ASIC C ONCEPTS AND R ELATED W ORK

In this section, first, the internal structure of the considered LSTM network layer is described and then prior work deal-ing with the LSTM implementation on FPGA platforms and several LSTM pruning and compression algorithms are briefly reviewed.

In this work, we use the LSTM network of [7], which has a simple structure with an acceptable output accuracy. A layer of this network consists of the cells (to store prior in-formation) and gates ( i.e. , 𝑓 𝑡 , 𝑖 𝑡 , 𝑔 𝑡 and 𝑜 𝑡 ) to control whether to remember or forget prior information ( i.e. , 𝑐 𝑡−1 ) , inputs ( i.e. , 𝑥 𝑡 ) , and outputs of the previous time step ( i.e. , ℎ 𝑡−1 ) . Based on this, an LSTM layer may be described as [7]: 𝑓 𝑡 = 𝑠𝑖𝑔(𝑊 𝑓𝑥 𝑥 𝑡 + 𝑊 𝑓ℎ ℎ 𝑡−1 + 𝑏 𝑓 ) 𝑖 𝑡 = 𝑠𝑖𝑔(𝑊 𝑖𝑥 𝑥 𝑡 + 𝑊 𝑖ℎ ℎ 𝑡−1 + 𝑏 𝑖 ) 𝑔 𝑡 = tanh (𝑊 𝑔𝑥 𝑥 𝑡 + 𝑊 𝑔ℎ ℎ 𝑡−1 + 𝑏 𝑔 ) 𝑜 𝑡 = 𝑠𝑖𝑔(𝑊 𝑜𝑥 𝑥 𝑡 + 𝑊 𝑜ℎ ℎ 𝑡−1 + 𝑏 𝑜 ) (1) 𝑐 𝑡 = 𝑓 𝑡 ⨀𝑐 𝑡−1 + 𝑖 𝑡 ⨀𝑔 𝑡 ℎ 𝑡 = 𝑜 𝑡 ⨀𝑡𝑎𝑛ℎ(𝑐 𝑡 ) (2) where 𝑡𝑎𝑛ℎ and 𝑠𝑖𝑔 ( i.e. , Sigmoid) are logistic activation functions and ⨀ is the dot product operation. Also, 𝑊 and 𝑏 denote the weight matrix and bias vector, respectively. Gates 𝑓 , 𝑖 , 𝑔 , and 𝑜 correspond to the forget, input, candi-date cell, and output gates, respectively. Weight matrices ( i.e. , 𝑊 𝑥 and 𝑊 ℎ ) and the bias vector are determined for each gate ( e.g. , weight matrices and bias vector for gate 𝑓 are denoted as 𝑊 𝑓𝑥 , 𝑊 𝑓ℎ , and 𝑏 𝑓 , respectively). In addi-tion, 𝑡 denotes the current time step. The size of the vector 𝑥 is 𝑋 when sizes of the vectors ℎ , 𝑏 , and 𝑐 are 𝐻 . Similarly, sizes of matrices 𝑊 𝑥 and 𝑊 ℎ are 𝐻 × 𝑋 and

𝐻 × 𝐻 , respec-tively. The activation functions perform element-wise com-putations on their input vectors. The internal structure of the considered LSTM layer is shown in Fig. 1.

The compression of the network model could lead to speed and energy efficiency improvements of the inference phase [10]. The improvements are achieved by reducing the memory usage and bandwidth and the computational re-quirements of an NN-based inference. Well-known com-pression techniques consist of pruning [11], sparsity-in-ducing regularization [12], and quantization [4]. Several structured sparsity methods have been proposed in prior studies. The proposed algorithms considered constraints on the locality of non-zero weights to limit the scattering of the zero weights in the weight metrices [8], [13]. Compared to unstructured sparsity, accelerating the structured spar-sity using special hardware is more feasible and affordable. The proposed LSTM architecture in [4] utilized weight compression and pruning techniques to increase speed and energy efficiency. The gains (in terms of energy effi-ciency and computational speed) were obtained at the cost of considerable hardware resource usage. The method pro-posed in [14] reduced the LSTM network size and con-trolled the network irregularity. It made use of block-circu-lant matrices [15] ( i.e. , arbitrary size circulant submatrices), and further applying the FFT algorithm to accelerate the compute-intensive circulant convolution operations. Vari-able submatrix sizes provided a tradeoff between the com-pression ratio and the accuracy degradation. In [16], first, an algorithm for reducing the computa-tions of the Gated Recurrent Unit (GRU) network was sug-gested. The algorithm induced sparsity in the inputs and activations, thereby lowering the computations. Next, an accelerator architecture, called DeltaRNN, which skipped the updating of an RNN when the input changes were be-low a certain threshold, was presented. In [9],

Bank-Bal-anced Sparsity (BBS), which partitions each weight matrix row into banks for parallel computing, was proposed. This method adopted fine-grained pruning inside each bank to maintain the model accuracy. The architecture, which was fully parallel, had the drawback of considering the same W f , b f W i , b i W g , b g W o , b o x t h t-1 c t-1 h t f t i t g t o t c t vector concatenation pointwise multiplication pointwise summation tanh sigmoid Fig. 1. The internal structure of the considered LSTM layer. PEs for the forward and recurrent weights. The proposed BRDS hardware architecture, which is an extension of the POLAR accelerator in [7] designed for the inference phase of dense networks, has the ability to sup-port sparse LSTM networks. By utilizing parallel modules and an addressing technique, performing sparse opera-tions efficiently as well as higher efficacy for BRDS com-pared to POLAR were achieved.

3 R OW -B ALANCED D UAL -R ATIO S PARSITY P RUNING

Fig. 2 illustrates the original and pruned matrices with the sparsity ratio of 50%. Fine-grained pruning simply omits the smallest 50% of the weights globally which leads to un-structured sparse matrix (Fig. 2(b)). Block sparsity induces a block sparse matrix (Fig. 2(c)) by setting the block size to 𝑚  𝑚 (which in this example is 2 

2) and the block repre-sentative (as the metric for pruning the blocks) with the block average. Bank-balanced pruning induces a bank-bal-anced sparse matrix (Fig. 2(d)) by splitting each row into two equal-sized banks and applying fine-grained pruning inside each bank independently. In this work, we propose row-balanced sparsity whereby the same number of elements from every row of a given weight matrix are pruned. The row-balanced sparse matrix of Fig. 2(a) is shown in Fig. 2(e) where the smallest 50% elements in each row have been removed. The pseudo-code of the row-balanced sparsity is shown in Fig. 3. The inputs to this algorithm are the weight matrix and the expected sparsity ratio while the output is the pruned matrix. It prunes each row separately based on the given sparsity ratio. To do this, based on the defined spar-sity ratio ( i.e. , 𝑆𝑝𝑎𝑟% ), some of the elements are pruned (in order from the smallest to the largest values).

As mentioned in subsection 3.1, in LSTM networks, for two different sets of weights ( i.e. , 𝑊 𝑥 and 𝑊 ℎ ), the proposed ac-celerator can consider two different sparsity ratios. The sparsity ratios are denoted by 𝑆𝑝𝑎𝑟 ℎ and 𝑆𝑝𝑎𝑟 𝑥 for 𝑊 ℎ and 𝑊 𝑥 , respectively. Applying the aforesaid row-balanced sparsity approach, every feed-forward (recurrent) weight matrix, i.e. , 𝑊 𝑓𝑥 , 𝑊 𝑖𝑥 , 𝑊 𝑔𝑥 , and 𝑊 𝑜𝑥 ( 𝑊 𝑓ℎ , 𝑊 𝑖ℎ , 𝑊 𝑔ℎ , and 𝑊 𝑜ℎ ) will have the same numbers of non-zero elements per row. Choosing different sparsity ratios for feed-forward and re-current weights alleviates the accuracy degradation result-ing from the pruning process. This originates from the fact that the algorithm decides which weights are more im-portant to keep. As an example, in Fig. 4, the effect of hav-ing two different sparsity ratios on the accuracy of an LSTM model is shown. The dataset here is PTB [17] with an input size of 1,500. For this example, the overall sparsity ( 𝑂𝑆 ) of 65% is considered. When 𝑆𝑝𝑎𝑟 𝑥 and 𝑆𝑝𝑎𝑟 ℎ were both set to 65%, the perplexity (the metric widely used in NLP) became large while the best perplexity was achieved when we set 𝑆𝑝𝑎𝑟 ℎ = 60% and 𝑆𝑝𝑎𝑟 𝑥 = 70% . A low value for the perplexity shows a well-trained LSTM network [20]. One may find more information about the perplexity met-ric in [21]. Similar to the previous pruning methods (see, e.g. , [4], [9]), we apply the row-balanced pruning method itera-tively to a pre-trained network and retrain the network af-ter each pruning iteration to partially restore the model ac-curacy. Since there are two matrices which should be pruned simultaneously with different sparsity ratios and the sensitivities of the output quality to the sparsity ratio are different for the two matrices, we determine the spar-sity ratios considering an overall sparsity target provided by the designer. The (minimum) sparsity target is deter-mined based on the number of weights that the designer wants to store in the on-chip memories. The goal, therefore, is to achieve the best model accuracy given a designer-specified lower bound on the sparsity factor. Since the value of 𝑆𝑝𝑎𝑟 𝑥 and 𝑆𝑝𝑎𝑟 ℎ cannot be directly obtained from the value of 𝑂𝑆 due to the their dependency on the dataset, these values should be determined by exploring different possible combinations of them as shown in Fig. 4. Based on the above discussion, a heuristic algorithm for pruning the 𝑊 𝑥 and 𝑊 ℎ weight matrices is presented. The pseudo-code of the pruning method which is inducing row-balanced dual-ratio sparsity is shown in Fig. 5. It iter-atively explores the search space to find the best sparsity ratios ( i.e. , the best values for 𝑆𝑝𝑎𝑟 ℎ and 𝑆𝑝𝑎𝑟 𝑥 ). In each pruning iteration, based on the considered sparsity ratios, the weights with small importance are dropped. In this al-gorithm, the importance of weights is represented by their internal ranking within the row, which is dictated by their absolute values. To reduce the accuracy loss due to prun-ing, in each iteration, the pruned network is retrained to -0.6 0.4 -0.1-0.2 -0.5-0.4 -0.6 0.4 -0.5-0.4 -0.20.6 0.50.5 -0.5-0.40.7 0.40.4-0.6 -0.5 -0.5-0.40.7 -0.6 -0.50.5 -0.5 -0.4 (a) Original dense matrix (b) Unstructured sparse matrix by global pruning(c) Block sparse matrix by pruning 2x2 blocks according to block average (d) Bank-balanced sparse matrix by local pruning inside each 1x4 bank (e) Row-balanced sparse matrix Fig. 2. Comparing different pruning methods with Row-Balanced.

Input:

The matrix to be pruned, 𝑊 ; The expected sparsity, 𝑆𝑝𝑎𝑟% ; Output:

The pruned matrix, 𝑊 𝑝 ; 1: for each 𝑊 𝑖 ∈ 𝑊. 𝑟𝑜𝑤𝑠 do Prune the smallest

𝑆𝑝𝑎𝑟% of 𝑊 𝑖 ; 3: end for return the pruned matrix, 𝑊 𝑝 ; Fig. 3. The Row-Balanced Pruning Algorithm. determine the corresponding accuracy of the chosen spar-sity ratios. For retraining, we freeze the weights that are set to zero ( i.e. , the dropped ones) and tune the other network weights. In the proposed algorithm, to lower the accuracy loss due to the pruning, in the first step, we suggest increasing the pruning ratios ( i.e. , 𝑆𝑝𝑎𝑟 ℎ and 𝑆𝑝𝑎𝑟 𝑥 ) gradually with the same step size ( 𝛼 ) from zero to the predefined overall sparsity ( 𝑂𝑆 ). The pruned network at this point is consid-ered as the initial point for searching. We denote it as 𝑁𝑁 𝑃,𝐼 . Next, to explore the search space, one of the sparsity ratios ( e.g. , 𝑆𝑝𝑎𝑟 ℎ ) is increased by a predefined step ( 𝛿 ℎ ) while the other one ( e.g. , 𝑆𝑝𝑎𝑟 𝑥 ) is decreased by its predefined step ( 𝛿 𝑥 ). In each iteration, the chosen tuple of the sparsity is applied on the pruned network of the previous iteration. Altering the sparsity ratios is continued till one of them reaches 0 or 100%. Next, this process is repeated again (by starting from the 𝑁𝑁 𝑃,𝐼 ) considering the opposite direction for the sparsity ratios. For each chosen tuple of the sparsity ratios, the accuracy of the network is determined. At the end, the algorithm returns the best tuple. To generate the sparsity ratios with the maximum model accuracy (

𝑆𝑝𝑎𝑟 𝑥,𝑀𝐴 and

𝑆𝑝𝑎𝑟 ℎ,𝑀𝐴 ), the BRDS algorithm is executed only once. The pruned network is used multiple times, so the inference takes a long time which amortizes the cost of the retraining algorithm. In addition, the execution time of the algorithm de-pends on 𝑂𝑆 , 𝛼 , 𝛿 𝑥 , 𝛿 ℎ , 𝑒𝑝𝑡 (the time needed for each epoch), and 𝑛 𝑟𝑒 (the number of epochs needed for the re-training. The formulas below can be used to attain the ex-ecution time of the algorithm assuming the pretrained net-work is available. The parameter 𝑒𝑝𝑡 depends on both the size of the model that the user is trying to prune and the hardware that is going to be utilized to perform the algo-rithm. The parameters 𝑒𝑥 , 𝑒𝑥 , 𝑒𝑥 , and 𝑒𝑥 𝑡𝑜𝑡 show the ex-ecution time of lines 1-6, 7-14, 15-24, and the whole algo-rithm, respectively. 𝑒𝑥 = 𝑂𝑆𝛼 × 𝑒𝑝𝑡 × 𝑛 𝑟𝑒 (3) 𝑒𝑥 = min (100 − 𝑂𝑆𝛿 𝑥 , 𝑂𝑆𝛿 ℎ ) × 𝑒𝑝𝑡 × 𝑛 𝑟𝑒 (4) 𝑒𝑥 = min (100 − 𝑂𝑆𝛿 ℎ , 𝑂𝑆𝛿 𝑥 ) × 𝑒𝑝𝑡 × 𝑛 𝑟𝑒 (5) 𝑒𝑥 𝑡𝑜𝑡 = 𝑒𝑥 + 𝑒𝑥 + 𝑒𝑥 (6)

4 H

ARDWARE A RCHITECTURE

As discussed before, unstructured sparsity leads to unbal-anced computations as well as irregular memory accesses. To take advantage of the structured sparsity introduced by the BRDS algorithm, an efficient LSTM hardware accelera-tor, called BRDS LSTM accelerator, is presented next. The internal structure of the accelerator, which is shown in Fig. 6, is based on POLAR accelerator of [7] with some modifi-cations to support dual sparsity. The BRDS accelerator consists of seven main modules including

DRAM Controller , Embedded

Memory , Address

De-coder , Gate , Function , Buffer , and

LSTM

Controller . The bit width of the datapath is n , and the data is represented in fixed- point two’s complement 𝑛 -bit binary format. DRAM Controller performs load and store instructions related to off-chip DRAM. A load instruction may occur when data should be read from off-chip DRAM and written onto On-chip memories. Similarly, when outputs are ready,

DRAM Controller should read the data and store them on the off-chip DRAM. It should be mentioned that most of the times, the weights can be fully fit in FPGA embedded memories. In the cases where, even with pruning, we cannot fit the data in the FPGA memories, the proposed accelerator uti-lizes this module as the interface to off-chip DRAM. Note that only non-zero elements of the sparse matrices are fetched consecutively to efficiently utilize the off-chip memory bandwidth. The module

Gate (see Fig. 6) includes two Mult Arrays (MAs) working concurrently, an MA Selector, a Tree Ad-der, and an Accumulator. Since there are two different pruning ratios (

𝑆𝑝𝑎𝑟 ℎ and 𝑆𝑝𝑎𝑟 𝑥 ), we consider two MAs Fig. 4. The effect of Dual-Ratio Sparsity on the perplexity of PTB dataset.

Input:

The weights of the LSTM layer to be pruned, 𝑊 𝑥 and 𝑊 ℎ ; The expected overall sparsity, 𝑂𝑆 ; Output:

The maximum model accuracy, 𝑀𝐴 ; The sparsity ratios with the maximum model accuracy 𝑆𝑝𝑎𝑟 𝑥,𝑀𝐴 and

𝑆𝑝𝑎𝑟 ℎ,𝑀𝐴 ; 1: set

𝑆𝑝𝑎𝑟 𝑥 and 𝑆𝑝𝑎𝑟 ℎ to ; 2: while 𝑆𝑝𝑎𝑟 𝑥 < 𝑂𝑆 and 𝑆𝑝𝑎𝑟 ℎ < 𝑂𝑆 do Increase

𝑆𝑝𝑎𝑟 𝑥 and 𝑆𝑝𝑎𝑟 ℎ by 𝛼 ; 4: Prune 𝑊 𝑥 and 𝑊 ℎ ; 5: Retrain the network; 6:

Save the pruned network as 𝑁𝑁 𝑃,𝐼 ; 7: while

𝑆𝑝𝑎𝑟 𝑥 < 100% and 𝑆𝑝𝑎𝑟 ℎ > 0% do Increase

𝑆𝑝𝑎𝑟 𝑥 by 𝛿 𝑥 ; 9: Decrease

𝑆𝑝𝑎𝑟 ℎ by 𝛿 ℎ ; 10: Prune 𝑊 𝑥 and 𝑊 ℎ ; 11: Retrain the network and save model accuracy to

𝐴𝑐𝑐 ; 12: if 𝐴𝑐𝑐 > 𝑀𝐴 do 𝑀𝐴 = 𝐴𝑐𝑐 ; 14: (𝑆𝑝𝑎𝑟 𝑥,𝑀𝐴 , 𝑆𝑝𝑎𝑟 ℎ,𝑀𝐴 ) = (𝑆𝑝𝑎𝑟 𝑥 , 𝑆𝑝𝑎𝑟 ℎ ) ; 15: Load the pruned network 𝑁𝑁 𝑃,𝐼 ; 16: while

𝑆𝑝𝑎𝑟 𝑥 > 0% and 𝑆𝑝𝑎𝑟 ℎ < 100% do Decrease

𝑆𝑝𝑎𝑟 𝑥 by 𝛿 𝑥 ; 18: Increase

𝑆𝑝𝑎𝑟 ℎ by 𝛿 ℎ ; 19: Prune 𝑊 𝑥 and 𝑊 ℎ ; 20: Retrain the network and save model accuracy to

𝐴𝑐𝑐 ; 21: if 𝐴𝑐𝑐 > 𝑀𝐴 do 𝑀𝐴 = 𝐴𝑐𝑐 ; 23: (𝑆𝑝𝑎𝑟 𝑥,𝑀𝐴 , 𝑆𝑝𝑎𝑟 ℎ,𝑀𝐴 ) = (𝑆𝑝𝑎𝑟 𝑥 , 𝑆𝑝𝑎𝑟 ℎ ) ; 24: return 𝑀𝐴 , (𝑆𝑝𝑎𝑟 𝑥,𝑀𝐴 , 𝑆𝑝𝑎𝑟 ℎ,𝑀𝐴 ) ; Fig. 5. The BRDS Algorithm.

35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95%

95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 45% 40% 35%

Sparsity Ratio of W x P e r p l e x it y Sparsity Ratio of W h with different sizes, named Small and Large. By using a multiplexer in the MA Selector, the accelerator can choose, for each set of weights, which MA to be employed. The number of elements in each row of 𝑊 ℎ and 𝑊 𝑥 which are 𝐻 and 𝑋 , respectively, become 𝐻 𝑆𝑃 and 𝑋 𝑆𝑃 , after running the BRDS algorithm. The weight matrix with a larger (smaller) number of elements in each row ( i.e. , 𝐻 𝑆𝑃 or 𝑋 𝑆𝑃 ) utilizes Large (Small) MA. Two MAs working together conduct 𝑅 signed multiplications in parallel. Large (Small) MA in-cludes a Mult Array component which conducts 𝑅 𝐿 ( 𝑅 𝑆 ) parallel 𝑛 -bit multiply operations in a single cycle ( 𝑅 =𝑅 𝐿 + 𝑅 𝑆 ). It should be noted that the parameters 𝑅 𝐿 and 𝑅 𝑆 show the level of parallelization for each weight matrix ( i.e. , 𝑊 ℎ and 𝑊 𝑥 ). Here, Large (Small) MA processes the weights with the larger (smaller) number of elements in each row. To fully utilize the MAs, the BRDS accelerator considers 𝑅 𝐿 and 𝑅 𝑆 in a way that 𝑅 𝑆 /𝑅 𝐿 be equal to (min {𝑋 𝑆𝑃 , 𝐻 𝑆𝑃 })/(max {𝑋 𝑆𝑃 , 𝐻 𝑆𝑃 }) . In this way, the ratio of the number of non-zero elements in small and large matri-ces would be equal to the ratio of the number of multipliers in small and large MAs. Therefore, the number of clock cy-cles needed for processing each weight matrix is the same, and there is no time in which one MA is not utilized. It is worth mentioning that the parameter 𝑅 , which shows the number of parallel multiplication operations for every row, determines the latency and the resource usage. Since, in the BRDS hardware, the input rows are pruned, to reach a higher level of parallelism, we propose a new parallelization factor, called 𝑄 , showing the number of rows whose corresponding calculations could be per-formed parallel. Therefore, parameter 𝑄 shows the num-ber of modules Gate ( Buffer and

Function as well ) working in parallel. The outputs of the Large and Small MAs are concatenated and passed onto the Add Array component in the module Gate . The Add Array component utilizes a tree of 𝑛 -bit adders, which gives the summation of its 𝑅 in-put operands. To perform the additions and multiplica-tions, we use DSP blocks in the FPGA. Due to the architec-ture of the DSP blocks, we are able to perform some func-tions together. In the Tree Adder component, we use three-input adders where it is possible to minimize resource uti-lization. The internal structure of a DSP block of the Xilinx FPGAs (DSPE48) is shown in Fig. 7. The specified path is utilized for realizing the three-input adders. The current output of this component is added to its previous one in an Accumulate component that is implemented by a DSP block. The output of the Accumulate component is passed to an Add unit to take biases into account. All these com-ponents as one unit perform the computation of the MxV. The proposed accelerator truncates the output of each add and multiply unit to 𝑛 bits. To alleviate the impact of overflow in the result, we utilize Recovery units after each Add and Multiply unit suggested in [7]. The module Func-tion (see Fig. 6) performs operations that are pointwise ( i.e. , 𝑠𝑖𝑔 , 𝑡𝑎𝑛ℎ and (2)). This module generates the output ℎ and the cell state 𝑐 , which are written to their corresponding space in the module Embedded Memory . The operations of this module are overlapped by those of the module

Gate where this overlap is provided by the module

Buffer . The proposed accelerator utilizes piecewise linear approximation of the activation functions ( e.g. , 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ( 𝜎 ) and 𝑡𝑎𝑛ℎ ) to balance speed, accuracy, area, and power con-sumption. In each piece, two 𝑛 -bit coefficients a and b are obtained and stored in LUTs. Hence, the word size of the LUTs are bits. In the activation function component, by employing the add and multiply operations in one DSP block, the output ( 𝑎 × 𝑥 + 𝑏 ) of the activation functions are determined. The operations of (2) are done by deploy-ing a multiply unit and an add unit in the module Function . Because the output of the multiply unit should be passed to the module

Embedded Memory , these units are imple-mented separately by DSP blocks. To transfer the data from the module

Gate to the module

Function , and also to feed back the result of the module

Function to itself, the module

Buffer (see Fig. 6) is deployed and used with the same approach as that of the POLAR architecture [7]. The module

Embedded

Memory (see Fig. 6) stores the weights, biases, inputs, relative addresses of the inputs, cells, and outputs in the embedded memory banks of the FPGA. To store the weights, two memory arrays de-noted by M WX and M WH are employed. Only the nonzero elements of the matrices 𝑊 𝑓𝑥 , 𝑊 𝑖𝑥 , 𝑊 𝑔𝑥 , and 𝑊 𝑜𝑥 are stored in M WX . We use relative row index and cumulative pointer to store sparse matrices. The relative row index for each element shows the number of zero elements before it. Each 𝑅 𝑋 ( 𝑅 𝐻 ) nonzero elements of 𝑊 𝑥 ( 𝑊 ℎ ) weights are stored in four consecutive rows. Hence, the size of M WX is

4𝐻 × 𝑋 𝑆𝑃 × 𝑛 bits where the width of each row of this memory is 𝑅 𝑋 × 𝑛 . Similarly, the elements of matrices 𝑊 𝑓ℎ , 𝑊 𝑖ℎ , 𝑊 𝑔ℎ , and 𝑊 𝑜ℎ are stored in M WH . The size of M WH is

4𝐻 × 𝐻 𝑆𝑃 × 𝑛 bits, where the width of each row of this memory is 𝑅 𝐻 × 𝑛 . The memory array 𝑀 𝐵 of the module Embedded

Memory is used to store the biases. The size of 𝑀 𝐵 is

4𝐻 × 𝑛 bits with a word-width of n bits. The i th row of the biases 𝑏 𝑓 , 𝑏 𝑖 , 𝑏 𝑔 , and 𝑏 𝑜 are stored in the four consecutive rows of the memory 𝑀 𝐵 . The memory array 𝑀 𝑋 stores N time steps of the input dataset where the size of the input dataset, in each time step, is 𝑋 × 𝑛 bits. To gain more throughput, for the inputs, we use duplicate memories in parallel. Because of the utilization of the Dual Port RAMs, we need 𝑅 𝑋 /2 BRAMs ( 𝑀 𝑋 ) for storing the inputs. In this work, for sim-plicity, we consider a single time step ( 𝑁 = 1 ). The memory array 𝑀 𝐻 stores the outputs of the current and previous time steps ( ℎ 𝑡 and ℎ 𝑡−1 ) with the size of 𝐻 × 𝑛 bits. Similar to the input memory array, duplicated memories for the outputs were utilized ( 𝑅 𝐻 /2 BRAMs for 𝑀 𝐻 ). At the Off-chip DRAM

FPGA

DRAM

Controller

LSTM Controller M B M WX M AdX M AdH M WH M H M X Adr. Dec.

Buffer M C Large MA E m b e dd e d M e m o r y Gate

Small MAMA SelectorTree AdderAccum.Adder MultAccum.

Function sig tanh

Q Q

BufferBuffer Q Fig. 6. The internal structure of the proposed BRDS LSTM accelerator. beginning of each time step, this memory contains ℎ 𝑡−1 . After generating each element of ℎ 𝑡 , this element will be stored on their corresponding rows in all replicated mem-ories of 𝑀 𝐻 . By using duplicate memories, the proposed ar-chitecture will have a much better performance by increas-ing the throughput at the cost of using a reasonable amount of embedded memory. The generated cells and the ones for the previous time step are accessible from the memory array 𝑀 𝐶 . The word size of this memory is 𝑛 bits while its size is 𝐻 × 𝑛 bits. In this memory approach, be-fore replacing previous 𝑐 𝑡 ( i.e. , current 𝑐 𝑡−1 ), the cell is fetched and then replaced with the current 𝑐 𝑡 . We store relative addresses corresponding to the mem-ories M WX and M WH in memories M AdX and M

AdH , respec-tively. The module

Address Decoder , which consists of small add units, decodes the relative addresses to obtain cumu-lative pointer addresses. The pattern of storing 𝑊 ℎ in M WH and their corresponding relative addresses in M AdH are il-lustrated in Fig. 8. In this example, 𝐻 , 𝑆𝑝𝑎𝑟 ℎ , and 𝑅 𝐻 (the parallelization factor of 𝑊 ℎ ) are considered , , and , respectively. Storing 𝑊 𝑥 and its relative addresses in the corresponding memories are performed similar to that of 𝑊 ℎ . Based on the proposed datapath for the accelerator ar-chitecture, a designer may control the trade-off between the resource usage and the latency of the architecture simply by adjusting the parameters 𝑅 and 𝑄 at the design time. To switch from one parallelization factor to another, one needs to change the number of mult and add units in MAs and Add Array components, respectively. Also, the designer should change the size of the delay units in the module Buffer . It is worth mentioning that if the parallel-ization factor 𝑅 𝑥 ( 𝑅 ℎ ) is chosen greater than 𝑋 𝑆𝑃 ( 𝐻 𝑆𝑃 ), the designer should use the parameter 𝑄 and utilize multiple number of modules Gate , Buffer , and

Function . The module

LSTM Controller in the proposed architec-ture performs the control of the complicated timing scheme of the LSTM network. This module generates proper signals with proper timing to meet the architecture requirements. Details of the timing of this architecture is the same as that of the POLAR architecture [7].

5 R

ESULTS AND D ISCUSSION

In this section, the accuracy of the proposed pruning algo-rithm is evaluated by applying it to some LSTM networks. Also, the design parameters of the proposed accelerator, as well as its efficacy compared to several prior works, are assessed by implementing the accelerator on an FPGA.

To evaluate the accuracy of the BRDS pruning algorithm, the algorithm was applied to an LSTM language model of the PTB dataset [17], the IMDB Movie Reviews dataset [18], and the TIMIT dataset [19]. The PTB dataset, widely used in NLP researches, includes 929K training, 73K validation, and 82K test words. The IMDB dataset has 50,000 highly polar positive and negative movie reviews for binary sen-timent classification. It includes 25,000 reviews for training and 25,000 reviews for testing. The TIMIT dataset has been provided for the study of acoustic-phonetics. It includes re-cordings of 630 speakers of eight major dialects of Ameri-can English. For the LSTM speech recognition model, we set the input size to 153 and the hidden state to 1024 which are the same as the prior studies ([4], [9]). In this study, the accuracy of the BRDS is compared to three prior pruning approaches including unstructured sparsity, block sparsity, and bank-balanced sparsity (BBS) [9]. For the studies of this section, 64 banks in the case of BBS method and 4  Fig. 7. The internal structure of a DSP block configured for three-input adder. H H H HH H H H H ) n×R H W fh W ih W gh W oh M WH AdH

Fig. 8. Pattern of storing 𝑊 ℎ and its relative addresses in M WH and M AdH . The elements of each row are distinguished by the square, circle, trian-gle, and cross shapes. algorithm is, on average about 0.1%, less than that of the BBS method. As the results in Fig. 9(c) show, our method outperforms other algorithms in almost all the cases, par-ticularly in larger sparsity ratios. Compared to the BBS sparsity method, the proposed pruning algorithm resulted in 0.7% lower accuracy loss. Note that the efficacy of the proposed pruning algorithm is reduced in the case of this dataset compared to the other considered datasets due to its small size. To evaluate the efficiency of the architecture of the pro-posed accelerator, it was implemented on FPGA for execut-ing the TIMIT dataset with the same configurations pro-vided in [4], [9], [14]. The design parameters of the BRDS are compared with four state-of-the-art works including the ones proposed in [4], [9], [14], [16]. The focus of all of these prior works were on implementing the LSTM net-works on FPGAs using weight pruning and compression. The work in [16] was evaluated on GRU which is simpler than LSTMs in terms of its computational complexity. Thus, the reported design parameters of [16] should be looked at as optimistic values when compared to those of the LSTM accelerators. This work used the delta network algorithm to reduce MxV operations and skipping unim-portant cell activation changes (the changes were below a threshold value) to reduce memory accesses. To perform a better comparison, we implemented the BRDS design on an FPGA (XCKU9P) with the same family as that of [4] by exploiting Xilinx VIVADO 2018.2 tool.

In this study, the pruning ratio was set to 87.5% (the same as [9], [14]). Also, since the data bit width was con-sidered as 16 bits in most of the prior architectures ( e.g. , [9], [14], [16]), without loss of generality, the same size was con-sidered for the BRDS accelerator. For pruning the network, we applied the BRDS algorithm for the overall sparsity ( 𝑂𝑆 ) of 87.5% (the same sparsity ratio as most of the prior works) on the TIMIT dataset. The best 𝑆𝑝𝑎𝑟 ℎ and 𝑆𝑝𝑎𝑟 𝑥 given by the algorithm were for both. It is worth mentioning that because parameters 𝑋 and 𝐻 were differ-ent in this design, having the same sparsity ratios for 𝑊 ℎ and 𝑊 𝑥 did not mean the same number of elements in each row for them. Thus, even by having the same sparsity ra-tios, the numbers of elements to be pruned were different for 𝑊 ℎ and 𝑊 𝑥 . After running the BRDS pruning algorithm, the parameters 𝑋 𝑆𝑃 and 𝐻 𝑆𝑃 were and , respectively. As mentioned in Section 4, to fully utilize the MAs in the BRDS accelerator, the parameters 𝑅 𝐿 and 𝑅 𝑆 were con-sidered such that 𝑅 𝑆 /𝑅 𝐿 be equal to (𝑚𝑖𝑛 {𝑋 𝑆𝑃 , 𝐻 𝑆𝑃 })/(max {𝑋 𝑆𝑃 , 𝐻 𝑆𝑃 }) . Therefore, we considered 𝑅 𝑆 /𝑅 𝐿 as making the parallelization factors 𝑅 and 𝑄 equal to and , respectively. TABLE 1 shows the resource utili-zation of the BRDS accelerator with the mentioned config-uration on the XCKU9P Xilinx FPGA device. Also, TABLE 2 shows the frequency, sparsity ratio, accuracy degrada-tion, GOPS, power, GOPS/W, and effective GOPS, GOPS/W, and DSP and logic efficiency of the BRDS accel-erator and those of the considered prior works. The Effec-tive Throughput (GOPS) is defined as (𝐺𝑂𝑃𝑆)/(1 −𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦) which takes the impact of pruning into account [9]. In this work, although the operating frequency of the BRDS architecture could be increased to 238MHz, it was set to 200MHz to have a fair comparison with the selected prior works. The design parameters of the prior works were borrowed from their published papers. As the results in TABLE 2 show, the throughput (GOPS) of the BRDS ac-celerator is higher (up to 52.7%) than that of [14], [16] while smaller (up to 52%) than those of [4], [9]. The power con-sumption was extracted using Xilinx Power Estimator. The switching activity was set by the tool based on inputs and weights for the TIMIT dataset stored in the embedded memories. The power consumption of the BRDS accelera-tor is much less than that of other selected works except for [16], mostly due to its smaller operating frequency. Addi-tionally, the GOPS/W of the BRDS shows a higher number compared to those of other works except for [16] which has Fig. 9. The accuracy-sparsity tradeoff on (a) PTB, (b) TIMIT, and (c) IMDB datasets.

TABLE

ESOURCE U TILIZATION OF THE

BRDS A CCELERATOR I MPLE-MENTED ON

XCKU9P

FOR

TIMIT D ATASET ( 𝑂𝑆 = 87.5% ). LUT FF DSP BRAM Available 274080 548160 2520 912 Utilization 5600 83710 1600 724 Utilization (%) 2 15.3 63.5 79.4 P e r p l e x it y (a) Sparsity Ratio Dense Baseline UnstructuredBlock Sparsity BBSBRDS P E R ( % ) (b) Sparsity Ratio

0% 10% 20% 30% 40% 50% 60% 70% 80% E rr o r ( % ) (c) Sparsity Ratio a slightly better energy efficiency. The effective GOPS (GOPS with considering the sparsity ratio) of the BRDS is, on average, about 43% higher than those of [14], [16] while it is on average about 54% lower than those of the [4], [9]. Moreover, the BRDS accelerator outperforms all of the other works in terms of effective GOPS/W. The effective GOPS/W of the BRDS, on average (up to), is 2.3× (3.7×) higher than those of the other selected works. Finally, GOPS/ i.e. , ALM for Intel and LUT for Xilinx devices). C ONCLUSION

In this paper, first, the BRDS, a row-balanced dual-ratio sparsity algorithm, was presented to improve the accuracy of LSTM models considering their hardware implementa-tion. Additionally, BRDS LSTM, an energy-efficient FPGA implementation for the inferencing phase of sparse LSTM networks, was proposed. Its architecture is compatible with the suggested pruning algorithm, which utilized two configurable processing elements with different sparsity ratios. It takes advantage of the sensitivity of the two dif-ferent weight matrices to the pruning. Finally, the effi-ciency of the proposed pruning algorithm and accelerator was evaluated using selected benchmarks in NLP, senti-ment classification, and speech recognition fields. Com-pared the state-of-the-art work, the proposed architecture and pruning algorithm provided, on average, 128% im-provements in effective GOPS/W, and a 0.7% reduction in perplexity. R EFERENCES [1]

A. Graves, A.R. Mohamed, and G. Hinton , “Spe ech Recognition with

Deep Recurrent Neural Networks,” in Proc. of the IEEE Int. Conf. Acoustic Speech Signal Process. , pp. 6645-6649 , May 2013. [2]

H. Palangi et al. , ‘‘Deep sentence

Embedding using Long Short-Term Memory Networks: Analysis and Application to Infor-mation R etrieval,’’

IEEE/ACM Trans. Audio, Speech, Language Process. , vol. 24, no. 4, pp. 694 – S. Hochreiter and J. Schmidhuber. “ Long Short-Term Memory ,” Neural computation, 9(8):1735 – S. Han et al., “ESE: Efficient Speech

Recognition Engine with

Sparse LSTM on FPGA,” Dec. 2016 . [5]

A. Chang and

E. Culurciello, “Hardware Accelerators for Re-current Neural Networks on FPGA,” in

Proc. of the IEEE Int. Sym. on Circuits and Systems (ISCAS) , pp. 1-4, May 2017. [6]

Y. Guan et al. , “FPGA -based Accelerator for Long Short-Term

Memory Recurrent Neural Networks,” in

Proc. of Asia and South Pacific Design Automation Conference , pp. 629-634, 2017. [7]

E. Bank-Tavakoli, S.A. Ghasemzadeh, M. Kamal, A. Afzali-

Kusha, and M. Pedram, “POLAR: A Pipelined/Ov erlapped FPGA-

Based LSTM Accelerator,” in

IEEE Transactions on Very Large-Scale Integration (VLSI) Systems , 2019. [8]

S. Narang, E. Undersander, and

G. F. Diamos, “Block -sparse Re-curren t Neural Networks,” in

CoRR , 2017. [9]

S. Cao et a l., “Efficient and

Effective Sparse LSTM on FPGA with Bank-

Balanced Sparsity,” in

Proc. Int. Symp. Field-Program-mable Gate Arrays , pp. 63-72, 2019. [10]

S. Han, H. Mao, and

W. J. Dally, “Deep Compression: Com- pressing Deep Neural Networks with Pruning Trained Quanti-zation and Huffman Coding ,” in

Proc. ICLR , 2016. [11]

S. Han et al. , “Learning Both Weights and Connections for Effi-cient Neural Networks,” in

Proc. NIPS , pp. 1135-1143, 2015. [12]

W. Wen et al. , “Learning Structured Sparsity in Deep Neural Networks,” in

Proc. NIPS , pp. 2074-2082, 2016. [13] H. Mao et al., “Exploring the Regularity of Sparse Structure in Convolutional Neural Networks,” in

Proc. CVPR Workshop Ten-sor Methods in Comput. Vis. , 2017. [14]

S. Wang et al., "C-LSTM: Enabling Efficient LSTM Using Struc-tured Compression Techniques on FPGAs" in

Proc. of the 2018 ACM/SIGDA Inter. Symp. on Field-Programmable Gate Arrays, ACM , pp. 11-20, 2018. [15]

V. Y. Pan, “ Structured Matrices and Polynomials: Unified Super-fast Algorithms, ” New York: Springer , 2001. [16]

C. Gao et al. , “DeltaRNN: A Power -efficient Recurrent Neural

Network Accelerator,” in

Proc. of the 2018 ACM/SIGDA Inter. Symp. on Field-Programmable Gate Arrays, ACM , 21 –

30, 2018. [17]

M. Marcus et al. 1999. Treebank-3 LDC99T42. CD-ROM. Phila-delphia, Penn.: Linguistic Data Consortium (1999). [18]

J. S. Garofolo et al. Darpa TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA STI/Recon technical report N, 93. [20]

J. Park et al., "Maximizing System Performance by Balancing Computation Loads in LSTM Accelerators,"

Design Aut. Test in Europe Conf. Ex. (DATE), pp. 7-12, March 2018. [21]

Chip Huyen, "Evaluation Metrics for Language Modeling," The Gradient, 2019. [22]

S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in

Proc. ISCA , pp. 243-254, 2016.

TABLE

OMPARISON OF THE D ESIGN P ARAMETERS OF D IFFERENT S TATE - OF - THE - ART

LSTM A CCELERATORS . ESE [4] C-LSTM [14] DeltaRNN [16] BBS [9] BRDS Platform XCKU060 Virtex-7 XC7Z100 Arria 10 GX1150 XCKU9P Frequency (MHz) 200 200 125 200 200 Sparsity (%) 88.7 87.5 - 87.5 87.5 Quantization fixed-12 fixed-16 fixed-16 fixed-16 fixed-16 Accuracy Degradation 0.30% 0.32% - 0.25% 0.25% Throughput (GOPS) 282 131 192

200 Power (W) 41.0 22.0

Effective DSP Efficiency (GOPS/0.286