[PDF] FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

Abstract

In natural language processing (NLP), the "Transformer" architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory-constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduces the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.

Full PDF

FFTRANS: Energy-Efficient Acceleration of Transformers usingFPGA

Bingbing Li , Santosh Pandey , Haowen Fang , Yanjun Lyv , Ji Li , Jieyang Chen , Mimi Xie ,Lipeng Wan , Hang Liu and Caiwen Ding University of Connecticut Stevens Institute of Technology Syracuse University Microsoft Corporation Oak Ridge National Laboratory University of Texas at San Antonio {bingbing.li, lyu.yanjun, caiwen.ding}@uconn.edu {spande1, Hang.liu}@stevens.edu [email protected] [email protected] {chenj3, wanl}@ornl.gov [email protected] ABSTRACT

In natural language processing (NLP), the “Transformer" architec-ture was proposed as the first transduction model replying entirelyon self-attention mechanisms without using sequence-aligned re-current neural networks (RNNs) or convolution, and it achievedsignificant improvements for sequence to sequence tasks. The in-troduced intensive computation and storage of these pre-trainedlanguage representations has impeded their popularity into compu-tation and memory constrained devices. The field-programmablegate array (FPGA) is widely used to accelerate deep learning algo-rithms for its high parallelism and low latency. However, the trainedmodels are still too large to accommodate to an FPGA fabric. Inthis paper, we propose an efficient acceleration framework, Ftrans,for transformer-based large scale language representations. Ourframework includes enhanced block-circulant matrix (BCM)-basedweight representation to enable model compression on large-scalelanguage representations at the algorithm level with few accuracydegradation, and an acceleration design at the architecture level. Ex-perimental results show that our proposed framework significantlyreduce the model size of NLP models by up to 16 times. Our FPGAdesign achieves 27.07 × and 81 × improvement in performance andenergy efficiency compared to CPU, and up to 8.80 × improvementin energy efficiency compared to GPU. ACM Reference Format:

Bingbing Li , Santosh Pandey , Haowen Fang , Yanjun Lyv , Ji Li , JieyangChen , Mimi Xie , Lipeng Wan , Hang Liu and Caiwen Ding . 2020. FTRANS:Energy-Efficient Acceleration of Transformers using FPGA. In ACM/IEEEInternational Symposium on Low Power Electronics and Design (ISLPED ’20),August 10–12, 2020, Boston, MA, USA.

ACM, New York, NY, USA, 7 pages.https://doi.org/10.1145/3370748.3406567

RNN and its variant

Long Short-Term Memory (LSTM) unit [6] and

Gated Recurrent unit (GRU) [3] used to dominate in sequence mod-eling, language modeling and machine translation, etc. However,they in general lack efficiency in transmitting global information,

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ISLPED ’20, August 10–12, 2020, Boston, MA, USA © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7053-0/20/08...$15.00https://doi.org/10.1145/3370748.3406567 due to the bottleneck in the memory (hidden state) and compli-cated bypassing logic (additive and derivative branches) where longrange information is passed. In addition, the inherently sequentialnature precludes parallelization within training examples throughbackpropagation, which is critical at longer sequence lengths [9].To overcome the shortcomings in RNNs, the “Transformer" ar-chitecture was proposed as the first transduction model replyingentirely on self-attention mechanisms without using sequence-aligned RNNs or convolution. It achieved notable improvementsfor sequence to sequence tasks [18]. The breakthroughs and devel-opments of new models have accelerated at an unprecedented pacesince the attention mechanisms have become the mainstream inNLP domain with the invention of Transformer. Many transformer-based NLP language models like BERT [4] and RoBERTa [10] intro-duced pretraining procedures to the transformer architecture andachieved record-breaking results on major NLP tasks, includingquestion answering, sentiment analysis, and language inference.Nevertheless, the introduced intensive computation and powerfootprint of these pre-trained language representations has impededtheir popularity into computation and energy constrained as edgedevices. Moreover, despite of the rapid advancement achieved bythe recent transformer-based NLP models, there is a serious lack ofstudies on compressing these models for embedded and internet-of-things (IoT) devices.In this paper, we propose an energy-efficient acceleration frame-work, Ftrans, for transformer-based large scale language repre-sentations using FPGA. Ftrans is comprised of an enhanced BCM-based method enabling model compression on language represen-tations at the algorithm level, and an acceleration design at thearchitecture level. Our contributions are summarized as follows: • Enhanced BCM-based model compression for Transformer.

We address the accuracy degradation caused by traditional BCMcompression, and propose an enhanced BCM-based compressionto reduce the footprint of weights in Transformer. With smallaccuracy loss, Ftrans achieves up to 16 times compression ratio. • Holistic optimization for Transformers on FPGA.

Giventhe large size and complex data flow of transformer-based mod-els, even with model compression, we still need to schedulethe computation resources carefully to optimize latency andthroughput. We propose a two stage optimization approach tomitigate the resource constraints and achieve high throughput. • Low hardware footprint and low power (energy) consump-tion.

We propose an FPGA architecture design to support themodel compression technique and we develop a design automa-tion and optimization technique. Overall, the proposed Ftransachieves the lowest hardware cost and energy consumption in a r X i v : . [ c s . D C ] J u l SLPED ’20, August 10–12, 2020, Boston, MA, USA Bingbing Li et al. implementing Transformer and RoBERTa compared to CPU andGPU references.Experimental results show that our proposed framework signifi-cantly reduce the size of NLP models by up to 16 times. Our FPGAdesign achieves 27.07 × and 81 × improvement in performance andenergy efficiency compared to CPU. The power consumption ofGPU is up to 5.01 × compared to that of FPGA, and we achieve upto 8.80 × improvement in energy efficiency compared to GPU. Attention mechanisms have become an integral part of compellingsequence modeling and transduction models in various tasks [9]. Ev-idence of NLP community moving towards attention-based modelscan be found by more attention-based neural networks developedby companies like Amazon [8], Facebook [16], and Salesforce [2].The novel approach of Transformer is the first model to eliminaterecurrence completely with self-attention to handle the dependen-cies between input and output. BERT [4] and RoBERTa [10] extendTransformer’s capacity from a sequence to sequence model to ageneral language model by introducing the pretraining procedure,and achieved state-of-the-art results on major NLP benchmarks.Although RNNs and Convolutional Neural Networks (CNNs) arebeing replaced by Transformer-based models in NLP community,there are only a few works that accelerate Transformers and fo-cus on reducing the energy and power footprint, e.g., a case studyof Transformer is presented in [1] using one of the cutting-edgeFPGA boards. However, it is noteworthy that [1] targets at a special-ized FPGA architecture, which employs High Bandwidth Memory(HBM) technology. Unlike the conventional FPGA, HBM is pack-aged directly within the FPGA fabric to alleviate the on chip mem-ory constraint. However, work [1] did not adopt model compressiontechnique, and used the sequence length of 8 and 16, which aretoo short and not favorable in practise. The model details such asnumber of encoder/decoders, hidden size are also not listed.

The “Transformer" architecture is the heart for all state-of-the-art large scale language models. It has an encoder-decoder struc-ture [18] as shown in Figure 1. The encoder maps a sequence of theinput symbols x = ( x ; x ; x ; ... ; x n ) to a sequence of continuousrepresentations z = ( z ; z ; z ; ... ; z n ) . Given x , the decoder thenproduces an output sequence y = ( y ; y ; y ; ... ; y m ) of symbols oneelement per time step. For the next time step, the model takes thepreviously generated symbols as additional input when generatingthe next. The Transformer follows this overall architecture usingstacked self-attention and fully-connected (FC) layers for both theencoder and decoder, shown in Figure 1. InputsembeddingPositional encoding Encoder *NDecoder *NOutputsembedding

FC layers

Multi-headattention Add & Norm Add & Norm

FC layers

Maskedmulti-headattention Add & Norm Add & NormMulti-headattention Add & Norm++ Linear OutputprobabilitiesInputsOutputs SoftmaxPositional encoding

Figure 1: Model structure of Transformer. Encoder:

The encoder consists of a stack of N identical layers.Each layer has two sub-layers. The first is a multi-head self-attentionmechanism, and the second is a FC feed-forward network. There isa residual connection around each of the two sub-layers, followedby layer normalization. Decoder:

The decoder contains of a stack of N identical layers.Within each layer, there are three sub-layers, where the third sub-layer is the same as the encoder. The inserted second sub-layerperforms multi-head attention over the output of encoder stack. Thefirst-sublayer utilizes masked multi-head attention, to ensure thatpredictions for position i only depends on its previous positions. The attention function can be described as mapping a query q and aset of keys k and values v pairs to an output o as shown in Figure 2(a), named scaled dot-product attention, or single head attention. In this paper, we select dot-productattention as the attention function since it is much faster and morespace-efficient [18]. The input consists of queries and keys of di-mension d k , and values of dimension d v . We denote (cid:112) d k is thescaling factor for dot-product attention. We compute the dot prod-ucts of the query with all keys, divide each by (cid:112) d k , and apply asoftmax function to obtain the weights on the values. The attentionfunction on q , k , and v can be computed simultaneously by con-catenated into matrix Q , K , and V , respectively. Accordingly, theoutput matrix O att is: O att = Attention ( Q , K , V ) = sof tmax ( QK T (cid:112) d k ) (1) Multi single-head attention are then concatenated as multi-headattention, as shown in Figure 2 (b). MultiHead ( Q , K , V ) = Concat(Head , · · · , Head h ) × W O , where the Head is defined as: Head i = Attention ( QW Qi , KW Ki , VW Vi ) (2) where the projections are parameter matrices W Qi ∈ R d model × d k , W Ki ∈ R d model × d k , and W Vi ∈ R d model × d v . Multi-head attentionenables the model to jointly attend to information from differentrepresentation subspaces at different positions [18].In this work, we implement a shallow Transformer and a largescale Transformer, i.e., RoBERTa. The shallow Transformer has h = 2 parallel attention layers with 4 attention heads and RoBERTa(base configuration) has 12 layers with 12 heads. For each headwe use d k = d v = d model / h =

200 and 768 for Transformer andRoBERTa, respectively.

MatMul MatMul(a) Scaled dot-product attention { (b) Multi-head attentionScale hMask Concat QKV

Softmax LinearLinear Scaled dot-productattentionLinearLinear

QKV

Figure 2: : (a) Scaled Dot-Product Attention. (b) Multi-HeadAttention.

TRANS: Energy-Efficient Acceleration of Transformers using FPGA ISLPED ’20, August 10–12, 2020, Boston, MA, USA

The introduced intensive computation and weight storage of largepre-trained language representations have brought challenges inhardware implementation. Therefore, model compression is a natu-ral method to mitigate the these challenges.

CirCNN [5] and C-LSTM [19] have adopted BCM for model com-pression on small to medium scale datasets in image classificationand speech recognition, respectively, and achieved significant im-provement in terms of performance and energy efficiency comparedto the prior arts. Using this method, we can reduce weight storageby replacing the original weight matrix with one or multiple blocksof circulant matrices, where each row/column is the cyclic reformu-lation of the others. We use b to represent the row/column size ofeach circulant matrix (or block size, FFT size). Suppose the shape ofa weight matrix in Transformer (e.g., W Qi , W Ki , W Vi ) is W ∈ R m × n ,there will be f × д blocks after partitioning W , where f = m ÷ b and д = n ÷ b . Then W = [ W ij ] , i ∈ { . . . f } , j ∈ { . . . д } .The input x is also partitioned as x = [ x T , x T , . . . , x Tд ] T . Ineach BCM, only the first column/row is needed for storage andcomputation, and is termed the index vector , p ij . The theoreticalfoundation is derived in [20], which demonstrates the universalapproximation property and the error bounds of BCM-based neuralnetworks are as efficient as general neural networks.Prior works [5, 19] have not investigated large-scale languagerepresentations. To further maintain the prediction accuracy, weuse an enhanced BCM-based model compression. We modify theformulation of the index vector as follows: p ij =  b (cid:205) bj = W j b (cid:205) bj = W j . . . b (cid:205) bj = W bj  (3) where W ij is a circulant matrix. We observe that in this way, wecan better preserve the parameter information and maintain theoverall prediction accuracy. The main reason is that prior workstake the first column/row as the index vector , missing the effectiverepresentations for other rows/columns.Based on the circulant convolution theorem [14, 17], instead ofdirectly performing the matrix-vector multiplication, we could usethe fast Fourier transform (FFT)-based multiplication method, andit is equivalent to matrix-vector multiplication. The calculationof a BCM-based matrix-vector multiplication W ij x j is: W ij x j = p ij ⊛ x j = IFFT (cid:0)

FFT ( p ij ) ◦ FFT ( x j ) (cid:1) , where ‘ ⊛ ’ represents circularconvolution, and ◦ is element-wise multiplication. Therefore, thecomputational complexity is reduced from O( b ) to O( b log b ). FPGA is widely used to accelerate deep learning models for itshigh parallelism and low latency. As large amount of transformerparameters exceed the on-chip memory or block RAM (BRAM)capacity on FPGA fabric, even with model compression technique,the full model cannot be stored on chip. To address the challenge,we partition a model into embedding layer and encoder/decoder stacks. The embedding layer contributes 30.89% of parameters. Es-sentially it is a look-up table which transforms discrete tokens intocontinuous space, the computation is less than that of encoder anddecoder. Therefore, our basic idea is to off-load embedding layerto off-chip memory, thus it is possible to deploy the most compu-tational intensive part, i.e. the encoder and decoder stack on chip,avoiding frequently access off-chip weights, hence to acceleratecomputation. Second, to mitigate the I/O constrain, we developedthe inter-layer coarse grained pipelining, intra-layer fine grainedpipeling, and computation scheduling.

As shown in Figure 3, the proposed hardware architecture consistsof computation units for encode/decoder computation, on-chipmemory banks, a transformer controller, and an off-chip memory(DDR) and DDR controller. The transformer controller communi-cates with the host and controls all the modules in FPGA. The hostPC loads the inputs (i.e., sentence pairs ) to the FPGA for inferencethrough PCIE. On the FPGA part, given the tokenized sentences,the embedding look up module accesses DDR to fetch embeddings.Next, the embeddings will be fed into the pipelined encoder/decoderstacks to perform inference.The computing units consist of multi-head attention, scaled dotproduct attention, point wise feed forward layer, linear, and ad-d/norm. The transformer controller orchestrates the computingflow and data flow of inputs from PCIEs, BRAMs and computingunits on the FPGA fabric. Since the encoder and decoder share sametype of operations, so we first decompose them into different com-puting primitives, including matrix multiplication of different sizes,vectorized exponentials etc. The multi-head attention, linear, andadd/norm modules are reconfigured to form as encoder or decoderunder the transformer control logic. We have two pipeline strate-gies. For shallow networks, the entire network can be straightfor-wardly implemented, i.e. all layers can be implemented by dedicatedFPGA resources. For the state-of-the-art designs such as BERT andRoBERTa, there are multiple encoders/decoders, hence the resourcesuch as DSPs may not enough. In such cases, reuse of certain PE orentire encoder/decoder module are necessary.

Multi-head attention includes multi- processing elements (namedPE) banks, for matrix multiplication), buffers (K buf, Q buf, andV buf), a normalization module (Norm), a masking function formasked multi-head attention, and a softmax module as describedin Equation (2) and shown in Fig. 4.

PCIe Memory BankDDRController DDRHostCPUHostMemory TransformerControl Logic

Host

FPGA

Encoder Stack

EmbeddingLookup

Decoder Stack

Memory Bank B u ff e r Figure 3: The overall hardware architecture on FPGA.

SLPED ’20, August 10–12, 2020, Boston, MA, USA Bingbing Li et al.

K Buf FPGA FabricHead Cntl 0

PE BankBRAM K

Q BufV Buf

BRAM QBRAM V h PE BankPE Bank PE Bank PE Bank

SoftmaxNormK Buf FPGA FabricHead Cntl 0

PE BankBRAM K

Q BufV Buf

BRAM QBRAM V PE BankPE Bank PE Bank PE Bank

SoftmaxNormK BufWeights FPGA FabricHead Cntl 0Input ......

PE Bank

DDR

BRAM K

Q BufV Buf

BRAM QBRAM V

PCIE 1

PE BankPE Bank PE Bank PE Bank

SoftmaxNorm

Figure 4: Multi-head attention (Head , · · · , Head h ) design. The input are fetched from DDR and fed into encoder pipeline,then multiplied with a set of query matrix Q and key matrix K storedon BRAMs. The intermediate results QW Q and KW K are then prop-agated to the buffers (i.e., K buffer and Q buffer to store KW K , and QW Q , respectively). Next, we compute the matrix multiplicationof the values stored in the K buffer and W buffer. The product willbe loaded to the normalization (Norm) module, i.e., product √ d k . Afterthe softmax module, the results will be propagated to a PE bankto perform matrix multiplication with the matrix stored in V Buf,i.e., VW V . Each head will have a local controller to orchestrate thecomputing flow and data flow of PE banks and buffers. The localcontroller will also enable the multiplexer for masked multi-headattention with a goal of masking the future tokens in a sequence(setting to 0), to prevent the current output predictions from beingable to see later into the sentence. To support masked multi-headattention, the local controller controls multiplexer to set futuretokens to 0, such that current output predictions are not able tosee later sequences. Decoder has one more multi-head attention,thus takes longer time to compute than encoder. In the case ofdecoder module has to be reused, to prevent encoder pipeline stall,a buffer is placed between the encoder and decoder stacks. Thebuffer also stores the output from the last encoder to support theresidue connection. We develop three different configurable PEs, which as PE-A, PE-B,and PE-FFT/IFFT. For the BCM-based matrix-vector multiplicationin FC layers, we use FFT/IFFT-based processing elements (PE); forother layers, we use matrix-vector multiplication, i.e., PE-A andPE-B for different matrix sizes.

The major part of thePE-A or PE-B is a multiplier for matrix multiplication of differentsizes. It also consists of two accumulators, dividers and exponential

FFT BRAM

FFT( W ) MAC PE Register bank FFT(FFT twiddle factors) (b) Softmax module(a) PE design

Buf AccExp(-x) DIV

Figure 5: FFT/IFFT-based PE and softmax module design. units to support scaling and softmax required by multi-head atten-tion. The output of multipliers are fed into divider or accumulatoras stream, hence scaling and softmax layer can be overlapped withmatrix multiplication.

Figure 5 shows the design of FFT/IFFT-based PE and softmax, including a FFT/IFFT kernel, an accumulator,and an adder. The accumulator is an adder tree with N inputs (thesize is chosen the same as the FFT/IFFT kernel size). We selectRadix-2 Cooley Tukey algorithm [7] for FFT implementation. Figure 5 (b) shows the implementation ofthe softmax function softmax ( x ) i = exp ( x i ) (cid:205) j exp ( x j )) . The exponentialfunction exp ( x i ) or exp ( x j ) is expensive in resource consumptionfor FPGAs. We adopt piece-wise linear functions to estimate theiroutputs, in order to simultaneously reduce the resource consump-tion and maintain the accuracy. A buffer is used to store exp ( x i ) and an accumulator is used to compute the summation of exp ( x j ) .Next, we perform the division and generate the softmax results. We developed a workflow to prototype and explore the hardwarearchitecture. First, we generate a data dependency graph based ontrained models to illustrate the computation flow. The operatorsin graph are scheduled to compose the pipeline under the designconstraints, to achieve maximum throughput. At last, a code gener-ator receives the scheduling results and generates the final C/C++implementation, which can be fed into the commercial HLS toolfor synthesis. Our target synthesis backend is Xilinx SDx.The major computationally intensive operations are shown inFigure 6. Other operations such as division and softmax consumemuch less time, and can be merged/overlapped with these majoroperations. The computation in different layers can be decomposedinto common computing elements, i.e., PEs. The layers in samecolor can be performed by same PE, however, with unbalancedoperations. For example, the time consumed by the KW K , QW Q and VW V is roughly 4 times of computation required by the n heads. To improve the utilization of pipeline, it is desirable to let Input VW V Head

Attn n

Attn 2 split

FC Add 1 FFT-

IFFT 1

Add 2FFT-

IFFT 2InputInput 1 2 3 4 5 6 7 80 QW Q splitKW K split Figure 6: Data flow of major operations.

MM-A 1MM-A 2

Head

HeadHead

Head Head

HeadHead

Head Att

AttAtt

Att Att

AttAtt

Att FC

Add

FFT-IFFT FFT-IFFT

AddWk * k

Wq * q

Wv * v

MM-A 3MM-A 4

MM-B 1MM-B 2

MM-B 3MM-B 4

AdderFFT-IFFT

Resource

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8

Adder

Figure 7: Fine grained operation scheduling

TRANS: Energy-Efficient Acceleration of Transformers using FPGA ISLPED ’20, August 10–12, 2020, Boston, MA, USA each layers consumes roughly the same time. This can be achievedby allocating more resources to the slowest layer. We adopt a two-stage optimization flow. In first stage, we find a resource schemethat can minimize the maximum time required by layers. In secondstage, under such resource constrains, we optimize the schedulingof an individual encoder/decoder.The optimization starts from a basic implementation of an indi-vidual encoder/decoder, i.e. no parallization nor resource reusing,such that we can obtain an estimation of resource consumption,number of operations and execution time of each layer, through-put obtained by unit number of resources. Then we will examinehow much resource can be allocated to each encoder/decoder tominimize the execution time of the slowest layer: minimize max ( T , T , ..., T n ) , subject to R F [ i ] ≥ M (cid:213) j R j [ i ] + R misc [ i ] (4)where i ∈ ( , ..., ) , j ∈ n , n is the number of layers, M is the to-tal number of encoder/decoder, R F = [ R F F , R LUT , R DSP , R BRAM ] is on-chip resource constraints for look-up table (LUT), flip-flop(FF), digital signal processing unit (DSP), and BRAM, respectively. T j is the time required by the j -th layer. R j is resource utiliza-tion of the j -th layer, which is also represented as a vector: R j = [ R jF F , R jLUT , R jDSP , R jBRAM ] . R misc is the resource utilization ofmodules except encoder/decoder module, such as DDR controller,PCIE controller, etc. T j can be given as: T j = ⌈ N iop /( F j · K j )⌉ , j ∈ n (5)where N iop is the number of operations required by the j -th layer. K j is resource allocation factor of the j -th layer. F j is the through-put of non-optimized design, which can be obtained empirically.Therefore, the throughput is: Throuдhput = f req /( n · max ( T , T , ..., T j )) (6)It finds the slowest layer, allocates more resources, then up-dates the resource consumption and execution time. If resourceconstraints are satisfied, we repeat this procedure until no morespeedup. Then the algorithm will examine the fastest layer. If ittakes significantly less time than the slowest layer, it is possible toallocate less resources for that layer, hence more resources can be as-signed to the slowest layer. After this procedure, we obtain resourceconstraints, e.g. the No. of different PEs of an encoder and decoder.Under resource constraints, each layer may not have dedicated com-putation resource, hence matrix multipliers, adders, etc. have to beshared. Therefore, the computation has to be carefully scheduled tominimize the computation latency. The encoder/decoder can be rep-resented as a Directed Acyclic Graph (DAG) – G ( V , E ) , where V is aset of vertices representing different computation, edges E indicatethe dependency. The available computation units such as PEs andadders are represented by a set Op = { PE − A , PE − A , ..., Adder } .The algorithm used for operation scheduling takes G and Op asinput, is shown in Algorithm 1. In this section, we apply both enhanced BCM-based model compres-sion on the linear layers, and adopt 16 fixed-point data representa-tion for all the weights. We evaluate the accuracy impact with two

Algorithm 1:

Pseudo-code for operation scheduling

Input:

Dependency graph G ( V , E ) , available PEs Op = { PE _ A , PE _ A , ..., Adder } Output: C and designated PE for each Layer Q = TOPO _ SORT ( G ( V , E )) \\ Topological sort to obtain priority queue of all layers P = Q [ ] \\ List of layers to be scheduled E = ∅ \\ List of layers being executed S = ∅ \\ The final schedule result staдe = while Q (cid:44) ∅ ∧ E (cid:44) ∅ dofor layer ∈ Q doif available PE ∃ Op for layer then Q . pop () Op . remove ( PE ) E . push _ back (( layer , PE )) for V ∈ N EIGHBOR ( layer ) do Q . push _ back ( V ) endend staдe + = for layer , PE ∈ E doif IS _ F I N ISHED ( layer ) == True then E . pop () S . push _ back (( layer , staдe , PE )) Op . push _ back ( PE ) endendreturn S representative Transformer structures, i.e., a shallow Transformerwith both encoder and decoder, and a pretrained deep Transformerarchitecture - RoBERTa (base configuration) which only has en-coder [10]. The shallow Transformer is evaluated in a languagemodeling task, which is an unsupervised sequence-to-sequenceproblem that requires the decoder part. On the other hand, we runa RoBERTa on a sentiment classification task that is a supervisedclassification problem without the requirement for decoder block.The software is implemented in PyTorch deep learning framework[15] and FairSeq sequence modeling toolkit [13]. Table 1 summa-rizes the key parameters of the shallow Transformer and RoBERTamodels in the experiments. Table 1: Key parameters of Shallow Transformer andRoBERTa

Model Transformer Transformer Hidden Attention TotalConfiguration Structure Layers Size Heads ParamsShallow Transformer encoder-decoder 2 200 4 6MRoBERTa (base config.) encoder only 12 768 12 125M

We evaluatethe proposed model compression method for finetuned RoBERTa [10]on IMDB movie review sentiment classification [11] to shed somelight on training trial reductions. Starting from the saved state ofpretrained models in work [10], we finetune the model until itreaches to its best validation accuracy at 95.7%. To maintain over-all accuracy, we compress partial layers. The process suppressesrandomness by using a deterministic seed. Thus the accuracy dif-ference between the original RoBERTa and compressed version issorely contributed by the compression techniques.

Language modeling task takes a se-quence of words as input and determines how likely that sequenceis the actual human language. We consider the popular WikiText-2dataset [12] in this experiment, which contains 2M training tokenswith a vocabulary size of 33k. A shallow Transformer model with 4attention heads and 200 hidden dimension is established.The baseline and model compression results of shallow Trans-former and RoBERTa on WikiText-2 and IMDB review are shown

SLPED ’20, August 10–12, 2020, Boston, MA, USA Bingbing Li et al.

Table 2: Comparison among different model configurations

ID Network Block WikiText-2 ACC loss ACC lossType Size (ACC) % with BCM (%) with BCM & Quant. (%)1 Shallow Transformer − − − − − − Table 3: Comparison among different model configurations

Shallow TransformerBatch Size DSP FF LUT Latency (ms) Power (W) Throughput (FPS)1 5647 304012 268933 2.94 22.45 680.914 5647 304296 269361 11.59 22.52 690.508 5647 305820 269753 22.90 22.66 698.7216 5647 306176 270449 45.54 22.73 702.54RoBERTa (base)Batch Size DSP FF LUT Latency (ms) Power (W) Throughput (FPS)1 6531 506612 451073 10.61 25.06 94.254 6531 506936 451545 40.33 25.13 99.138 6531 508488 452005 79.03 25.89 101.2316 6531 508916 452661 157.18 25.96 101.79 in Table 2, respectively. We compress the models using enhancedBCM-based method with block size of 4 or 8. From Table 2, weobserve that for the shallow Transformer, thers is no accuracy losswith block size of 4 and only 0.6% accuracy loss with block size of8. The RoBERTa, on the other hand, incurs 4.2% and 4.3% accuracydrop after model compression using 4 and 8 block size, respectively .We also observe that changing from 32-bit floating point to 16-bitfixed point will not cause accuracy loss. The comparable accuracybetween the original model and the weight compressed versiondemonstrates the effectiveness of the proposed model compressionmethod. The Xilinx Virtex UltraScale+ VCU118board, comprising 345.9Mb BRAM, 6,840 DSPs, 2,586K logic cells(LUT), and two 4GB DDR5 memory, is connected to the host ma-chine through PCIE Gen3 × We imple-ment the compressed model to FPGA to evaluate the performanceand energy efficiency. For different batch sizes, we obtain the paral-lelism per stage for the 7 stages in encoder/decoders of Transformerand RoBERTa based on Algorithm 1 as shown in Table 3, respec-tively. We report the resource utilization on FPGA including DSP,LUT, and FF. The latency (ms), throughput (frame/sequence persecond) and power consumption (W) are also reported. Our resultsshows that there is a trade-off between latency and power con-sumption. For both Transformer and RoBERTa, we can achieve thebest trade-off (the lowest ratio of Latency/Power) when the batch The accuracy drop on RoBERTa is slightly higher because its parameters are carefullypretrained on the Giga byte dataset (160GB of text) using a masked language model [10]and more sensitive to compression. size is 8 since the latency will be significantly increased and thethroughput will not be increased when we use larger batch size.

We compare the performance(throughput) and energy efficiency among CPU, GPU, and FPGAusing same model and same benchmark (IMDB), as shown in Ta-ble 4. We also validate our method on embedded low-power devices,implement our pruned model on Jetson TX2, an embedded AI com-puting device. ItâĂŹs built by a 256-core NVIDIA Pascal-familyGPU and the memory is 8 GB with 59.7 GB/s bandwidth. Our FPGAdesign achieves 27.07 × and 81 × improvement in throughput andenergy efficiency compared to CPU. For GPU TRX5000, the powerconsumption is 5.01 × compared to that of FPGA, and Our FPGAdesign achieves 8.80 × improvement in energy efficiency and 1.77 × throughput improvement compared to GPU. For embedded GPUJason TX2, our FPGA design achieves 2.44 × improvement in energyefficiency. Table 4: The performance and energy efficiency comparisonamong CPU, GPU, FPGA using RoBERTa

CPU GPU FPGA Jetson TX2i7-8700K RTX5000 VCU118 Embedded GPUThroughput (FPS) 3.76 57.46 101.79 9.75Power (W) 80 126 25.13 5.86Energy efficiency (FPS/W) 0.05 0.46 4.05 1.66

In this paper, we propose an energy-efficient acceleration frame-work for transformer-based large scale language representations.Our framework includes an enhanced BCM-based method to enablemodel compression on large-scale language representations at thealgorithm level, and an acceleration design at the architecture level.We propose an FPGA architecture design to support the modelcompression technique and we develop a design automation andoptimization technique to explore the parallelism and achieve highthroughput and performance. Experimental results show that ourproposed framework significantly reduces the size of NLP modelswith small accuracy loss on Transformer. Our FPGA-based imple-mentation significantly outperforms CPU and GPU in terms ofenergy efficiency.

REFERENCES [1] 2019. Supercharge Your AI and Database Applications with Xilinx’s HBM-EnabledUltraScale+ Devices Featuring Samsung HBM2.

Xilinx white paper, WP508 (v1.1.2) (2019).[2] James Bradbury and Richard Socher. 2017. Towards Neural Machine Translationwith Latent Tree Attention.

EMNLP 2017 (2017), 12.[3] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben-gio. 2014. On the properties of neural machine translation: Encoder-decoderapproaches. arXiv preprint arXiv:1409.1259 (2014).[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[5] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang,Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. Circnn: accelerating and compressingdeep neural networks using block-circulant weight matrices. In

Proceedings of the50th Annual IEEE/ACM International Symposium on Microarchitecture . 395–408.[6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[7] S Lennart Johnsson and Robert L Krawitz. 1992. Cooley-tukey FFT on theconnection machine.

Parallel Comput.

18, 11 (1992), 1201–1221.[8] Joo-Kyung Kim and Young-Bum Kim. 2018. Supervised Domain EnablementAttention for Personalized Domain Classification. In

Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing . 894–899.

TRANS: Energy-Efficient Acceleration of Transformers using FPGA ISLPED ’20, August 10–12, 2020, Boston, MA, USA [9] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structuredattention networks. arXiv preprint arXiv:1702.00887 (2017).[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, OmerLevy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: Arobustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).[11] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, andChristopher Potts. 2011. Learning word vectors for sentiment analysis. In

ACL .Association for Computational Linguistics, 142–150.[12] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. [n. d.].Pointer Sentinel Mixture Models. In .[13] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng,David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit forSequence Modeling. In

Proceedings of NAACL-HLT 2019: Demonstrations .[14] Victor Pan. 2012.

Structured matrices and polynomials: unified superfast algorithms .Springer Science & Business Media. [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic differentiation in PyTorch. (2017).[16] Peng Shi, Jinfeng Rao, and Jimmy Lin. 2018. Simple Attention-Based Rep-resentation Learning for Ranking Short Social Media Posts. arXiv preprintarXiv:1811.01013 (2018).[17] Julius Orion Smith. 2007.

Mathematics of the discrete Fourier transform (DFT):with audio applications . Julius Smith.[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008.[19] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and YunLiang. 2018. C-LSTM: Enabling Efficient LSTM Using Structured CompressionTechniques on FPGAs. In

FPGA’18 .[20] Liang Zhao, Siyu Liao, Yanzhi Wang, Zhe Li, Jian Tang, and Bo Yuan. 2017. Theo-retical Properties for Neural Networks with Weight Matrices of Low DisplacementRank. In