[PDF] FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

Abstract

In this paper, we present a deep reinforcement learning platform named FIXAR which employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Starting from 32-bit fixed-point data, Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR proposes the adaptive array processing core composed of configurable processing elements to support both intra-layer parallelism and intra-batch parallelism for high-throughput inference and training. Finally, FIXAR was implemented on Xilinx U50 and achieves 25293.3 inferences per second (IPS) training throughput and 2638.0 IPS/W accelerator efficiency, which is 2.7 times faster and 15.4 times more energy efficient than those of the CPU-GPU platform without any accuracy degradation.

Full PDF

FFIXAR: A Fixed-Point Deep Reinforcement Learning Platformwith Quantization-Aware Training and Adaptive Parallelism

Je Yang Seongmin Hong Joo-Young Kim

School of Electrical EngineeringKAIST { yangje, seongminhong, jooyoung1203 } @kaist.ac.kr Abstract — Deep reinforcement learning (DRL) is a powerful tech-nology to deal with decision-making problem in various applicationdomains such as robotics and gaming, by allowing an agent to learnits action policy in an environment to maximize a cumulative reward.Unlike supervised models which actively use data quantization, DRLstill uses the single-precision ﬂoating-point for training accuracywhile it suffers from computationally intensive deep neural network(DNN) computations.In this paper, we present a deep reinforcement learning accelera-tion platform named FIXAR, which employs ﬁxed-point data typesand arithmetic units for the ﬁrst time using a SW/HW co-designapproach. We propose a quantization-aware training algorithm inﬁxed-point, which enables to reduce the data precision by half aftera certain amount of training time without losing accuracy. We alsodesign a FPGA accelerator that employs adaptive dataﬂow andparallelism to handle both inference and training operations. Itsprocessing element has conﬁgurable datapath to efﬁciently supportthe proposed quantized-aware training. We validate our FIXARplatform, where the host CPU emulates the DRL environment andthe FPGA accelerates the agent’s DNN operations, by runningmultiple benchmarks in continuous action spaces based on a latestDRL algorithm called DDPG. Finally, the FIXAR platform achieves25293.3 inferences per second (IPS) training throughput, whichis 2.7 times higher than the CPU-GPU platform. In addition, itsFPGA accelerator shows 53826.8 IPS and 2638.0 IPS/W energyefﬁciency, which are 5.5 times higher and 15.4 times more energyefﬁcient than those of GPU, respectively. FIXAR also shows the bestIPS throughput and energy efﬁciency among other state-of-the-artacceleration platforms using FPGA, even it targets one of the mostcomplex DNN models.

Index Terms —Reinforcement Learning, Accelerator, Platform, Quan-tization, Deep Neural Network, FPGA I. I NTRODUCTION

Reinforcement learning (RL) is a promising area of machine learn-ing that studies how an agent should take actions in an environmentin order to maximize a long-term cumulative reward. It aims to solvea complex decision-making problem in a setting where the agentconstantly updates its action policy based on the reward feedbackfrom the environment. Recently, deep reinforcement learning (DRL)that utilizes a deep neural network (DNN) for the action policy totrain has been proposed [1]–[4]. This deep learning approach becomesvery popular like in other machine learning disciplines, as it observeswidespread success in various applications such as robotics, indus-trial control, and game playing [5]–[7]. Unlike supervised learningrequires a number of labeled input/output pairs for training of themodel, DRL uses its own inference samples to train an agent to takedesirable actions. However, DRL’s training process is computationallyexpensive as it requires repeated computations of the forward andbackward propagation.Data quantization is an effective technique to reduce the size ofDNN models by replacing 32-bit ﬂoating-point weights and activa-tions to less complicated representations such as lower-bit ﬁxed-point numbers with negligible or even no accuracy loss through re-training[8, 9]. It is beneﬁcial for hardware because the reduced model requiressmaller memory storage as well as smaller memory bandwidth. Italso enables efﬁcient computations by employing simpler arithmeticunits with low bit-precision. Due to these strong beneﬁts, many state-of-the-art DNN platforms try to use low-bit ﬁxed-point format overexpensive ﬂoating-point format if possible [10, 11]. Unfortunately,most of these quantization researches are done with supervisedmodels, especially focused on inference. It is questionable if wecan apply the same quantization techniques to deep reinforcementlearning. As the agent’s current decision heavily inﬂuences the futurestate and action, it is hard to predict that how quantization will affectthe policy’s long-term decision in complex environment [12].In this paper, we propose a ﬁxed-point deep reinforcement learningacceleration platform named FIXAR. FIXAR successfully employsdynamically changing dual ﬁxed-point data types for the ﬁrst time inthe training process of deep reinforcement learning. Using a SW/HWco-design approach, its FPGA based accelerator achieves the highestenergy efﬁciency among existing state-of-the-art compute platforms. • Algorithm Design

We propose a quantization-aware trainingalgorithm which enables to reduce the full data precision inﬁxed-point by half after a certain number of training steps tokeep its training accuracy in DRL. Based on this algorithm, theFIXAR’s hardware accelerator employs fast and energy-efﬁcientﬁxed-point arithmetic units instead of expensive ﬂoating-pointarithmetic units. • Accelerator Design

We design a FPGA based acceleratorresponsible for running DNN inference and training operationswith supporting the proposed quantization-aware training. It isthe ﬁrst hardware accelerator that supports both inference andtraining with dual bit-precision in ﬁxed-point for DRL. To thisend, we propose the adaptive array processing core that exploitsdifferent dataﬂows and parallelisms to handle both forward andbackward propagation. Its processing element has conﬁgurabledatapath to run a quantized model seamlessly with doublingits throughput. With eliminating external memory access usingan efﬁcient on-chip memory structure, FIXAR’s acceleratorachieves a very high training throughput. • Platform Design

By integrating the proposed training algorithmand accelerator design together, we implement our FIXARplatform on a CPU-FPGA system, where the host CPU emulatesthe environment and the FPGA accelerator runs the agent’s DNNoperations. We successfully validate the FIXAR platform byrunning multiple benchmarks in continuous action spaces fromMuJoCo environment, based on a widely used DRL algorithmcalled DDPG.Finally, the FIXAR platform achieves 25293.3 inferences per sec-ond (IPS) throughput and 2638.0 IPS/W accelerator efﬁciency fromsystem-level benchmarking, which is 2.7 times faster and 15.4 times a r X i v : . [ c s . A R ] F e b ig. 1: Actor-critic reinforcement learning algorithmmore energy efﬁcient than those of the CPU-GPU platform withoutany accuracy degradation. Among other state-of-the-art accelerationplatforms using FPGA, the FIXAR shows the best IPS performanceand energy efﬁciency even it targets one of the most complex DNNmodels. II. B ACKGROUND

A. Deep Reinforcement Learning

In reinforcement learning, an agent interacts with its environmentfor the aim of learning reward-maximizing behavior (Figure 1). Ateach timestep t , the agent receives an state s t and selects an action a t based on its policy π . After sending the selected action a t to theenvironment, the agent receives a reward value r t and the next state s t +1 from the environment. The agent continuously interacts withthe environment for the goal of maximizing the cumulative reward R t = (cid:80) Ti = t γ i − t r i , where T is the total timesteps for an episodeand γ ∈ (0 , is a discount factor that reﬂects the importance offuture rewards. The Q-value Q π = E π [ R t +1 + γR t +2 + ... | S t = s t , A t = a t ] represents how effective the action a t is in state s t at timestep t . In conventional reinforcement learning, Q-learningalgorithm which stores the Q-values for all state-action pairs in theQ-table and recursively ﬁnds the optimal actions that maximizes theQ-value is commonly used. However, this table based training of Q-learning becomes easily unstable when the number of states/actionsincreases in a complex continuous environment because the chanceof the agent visiting a particular state-action pair gets increasinglysmall. To solve this problem, actor-critic algorithm is suggested.Actor-critic algorithm uses a deep neural network approach inlearning the action policy. The actor calculates the action based on thepolicy network π ( a t | s t ; θ ) where θ is the network’s model. On theother hand, the critic evaluates how good the selected action is basedon value network V ( s t ; θ ) . The DNN model parameters are updatedbased on the calculated values from the value network, in the directionwhere more rewards can be obtained through optimal actions. Amongactor-critic algorithm, Deep Deterministic Policy Gradients (DDPG)[3] (and its variants [13, 14]) is known to be one of the best and issuccessfully demonstrated in continuous control domain. B. Quantization in Deep Reinforcement Learning

DNN model quantization is an active research area in machinelearning, which aims to use as smallest bit-precision as possible inorder to minimize both memory storage and bandwidth requirements.Since the accuracy is the most important goal to achieve in modeltraining, many training methods still use the standard 32-bit single-precision ﬂoating-point format to cover a wide dynamic range of cal-culated gradients. One of the biggest problems of a quantized formatin training process is the data truncation issue caused by a limiteddynamic range. For ﬁxed-point quantization, this truncation issue getseven worse. In addition, inherent error generated by quantizing ahigher-bit ﬂoating-point data to a lower-bit ﬁxed-point representationshould be handled as well. To overcome these challenges, Jain etal. [15] introduced a new ﬁxed-point representation that contains Fig. 2: Overall FIXAR platformcompensation bits to dynamically adjust the error introduced duringquantization. Zhang et al. [16] proposed the quantization scheme thatchanges the bit-width automatically by layers in the ﬁxed-point back-propagation.As the deep reinforcement learning tries to address a continuousdecision-making problem, which is quite different from recognitionand classiﬁcation tasks normally handled by conventional supervisedmodels, it is questionable if we can apply the same quantizationmethodologies. Krishnan et al. [12] showed that certain reinforcementlearning algorithms and environment tasks are more problematic toquantize because of their weight distributions. They also demon-strated that quantization may have a positive effect on the training asthe induced noise diversiﬁes the action exploration.III.

FIXAR P

LATFORM

Figure 2 shows the overall architecture of FIXAR platform whichbroadly consists of the host CPU that emulates the RL environmentand the FPGA accelerator that accelerates compute-intensive DNNoperations of the actor and critic network. FIXAR is initiated byreceiving its actor’s current state and a random batch of B transitions,i.e., a set of required input vectors including a state, an action, anda reward, from the environment run on CPU. The critic networkevaluates Q-value of each transition and executes backward propaga-tion (BP) and weight update (WU) based on the estimated Q-value.With updated weights, the critic network leads the BP and WU of theactor network in the direction where optimal actions can be obtained.Then, the actor network selects the action based on the updatedweights in a given state and this forward propagation (FP) result issent to host CPU. The environment takes the action computed fromFPGA, calculates the reward, and changes to a new state. It storestransition information from the current step and samples a trainingbatch in order to send to FPGA. In this way, FPGA continuouslycommunicates with the host CPU through PCIe interface. Detailedoperation sequence between the host CPU and FPGA accelerator isillustrated in Figure 3.FIXAR’s FPGA accelerator is in charge of running all computa-tionally heavy DNN inference and training workloads. It includesan activation memory that stores the input transitions from the hostas well as the intermediate activations, a weight memory that storesig. 3: FIXAR’s operation sequence in a single timestepmodel parameters of the actor and critic network, and a gradient mem-ory that stores the intermediate gradient values. As the actor’s modelsize (input:state-hidden:400-hidden:300-output:action) and the critic’smodel size (input:state+action-hidden:400-hidden:300-output:1) arerelatively small, we are able to store all the model parameters in theweight memory, whose size is 1.05MB using only on-chip BRAMs.The size of the gradient memory is same as the weight memory’sand it only uses on-chip BRAMs too. With accumulated gradient,weight update occurs in Adam optimizer module, which is fullylocal to FPGA as the entire model parameters are stored on-chipBRAMs. The size of activation memory is set to 2.94KB to hold theactivation data out of all 3 layers. Since the accelerator has all theparameters and activations on-chip, it does not require any externalDRAM memory accesses, which enables fast and efﬁcient processing.For the compute side, the accelerator has multiple adaptive arrayprocessing cores where each of them contains 16x16 processingelements and an activation line buffer for data broad-casting. Weightdata are pre-loaded from the weight memory. The outputs of the arraycores are aggregated in the accumulator and passed to the activationunit for nonlinear functions. The pseudo random number generator(PRNG) module injects random noise to the ﬁnal results of the actor’sinference to help action exploration.IV. D YNAMIC F IXED -P OINT Q UANTIZATION OF R EINFORCEMENT L EARNING

FIXAR uses a dynamic ﬁxed-point quantization for the ﬁrsttime, which changes the bit-precision of activations in the trainingprocess of deep reinforcement learning. To ensure the algorithmicaccuracy measured by the level of accumulated reward, we adapt theQuantization-Aware Training (QAT) algorithm from QuaRL [12] fora ﬁxed-point version. In the original QAT algorithm, it learns the RLpolicy model based on the single-precision ﬂoating-point format fora certain time period deﬁned as the quantization delay, then quantizesit to a narrow-bit ﬂoating-point representation for re-training. In our

Algorithm 1

Quantization-Aware Training for DRL

Input:

Quantization bit n , Quantization delay d Output:

Trained reinforcement learning model M Randomly initialize network parameter θ of M for t = 1 , T doif t < d thenActivation: Fixed-point 32-bit, Weight: Fixed-point 32-bit Monitor the maximum and minimum value ofactivations A min , A max Update θ with activation A elseActivation: Fixed-point 16-bit, Weight: Fixed-point 32-bit Q n ( A, A min , A max ) = (cid:98) Aδ (cid:99) + z where δ = | A min | + | A max | n and z = (cid:98) − A min δ (cid:99) Update θ with Q n ( A, A min , A max ) end if Evaluate(M) end for version, we start from the 32-bit ﬁxed-point format and quantize it tohalf after a quantization delay. Algorithm 1 describes the ﬁxed-pointversion of QAT algorithm used in the FIXAR platform.In the algorithm, the model parameter θ of both actor and criticnetwork is initialized with random numbers. If the timestep t is lessthan the quantization delay d , it performs BP and WU based on theinput activations with 32-bit ﬁxed-point format. During this time,the minimum and maximum value of the activations are activelymonitored and captured. Once the timestep reaches the quantizationdelay, the previously captured minimum and maximum value are usedto quantize the activations. From this point, activations are down-scaled to 16-bit ﬁxed-point data and both BP and WU are done withquantized activations. Weights and gradients are kept in 32-bit ﬁxed-point format for the entire timesteps. Because the weights are trainedwith full-precision activations for the quantization delay, a possibleloss of accuracy can be compensated even we train them with half-precision activations for the remaining time.V. A CCELERATOR D ESIGN

The goal of the FIXAR’s FPGA accelerator is twofold: to supportDNN inference and training operations of the actor and critic networkand to support dynamic ﬁxed-point datapath for the proposed QATalgorithm. In this section, we mainly describe the design of adaptivearray processing core, which is the main compute engine for theaccelerator.

A. Adaptive Array Processing Core

As illustrated in Figure 2, the FPGA accelerator has N adaptivearray processing (AAP) cores for parallel DNN inference and trainingcomputations. Each core is arranged in a 2-dimensional array ofconﬁgurable processing elements (PEs) where each PE performsmultiply-and-accumulate (MAC) operation between activations andweights. In AAP core, a vector of input activations is copied from theactivation memory to the 512-bit activation line buffer and the weightsare pre-loaded to each PE from the weight memory. For operation,each activation is always broad-casted to a row of PEs while thepartial sums from the PEs in a column are accumulated vertically tothe accumulator at the bottom. After accumulated outputs are passedto the activation unit, they are saved in the activation memory andused as the input activations of the next layer. B. Dataﬂow for Adaptive Parallelism

For each layer of a deep neural network, we need to calculatematrix-vector multiplication (MVM) for the weight matrix W (size:PxQ) and an activation vector A (size: Qx1). Among a few waysfor MVM, we chose the column-wise matrix decomposition, asillustrated in Figure 4(a). In this method, each column of the matrixis multiplied by the corresponding element of a vector (e.g., 1stFig. 4: (a) Column-wise matrix decomposition (b) Dataﬂow forinference and trainingolumn - 1st element, 2nd column - 2nd element, ..) to generateQ different partial-sum vectors. The output vector (size: Px1) canbe calculated by accumulating all the partial-sum vectors. We usethis mechanism for both forward-propagating inference and back-propagating training. Figure 4(b) shows the data mapping and ﬂowfor inference and training operation in AAP cores. For inference, wemap each column of the matrix to a row of PE array and broad-cast an element of the vector to the row. Then, if we accumulate theproducts out of the PEs vertically, we can get the ﬁnal vector at thebottom of the array. To have parallel processing within a single layerprocessing, i.e., intra-layer parallelism, we interleave each column ofthe matrix, which is mapped to each row of the PE array, amongmultiple AAP cores. For example, in the case of 4 AAP cores, theﬁrst core accumulates the partial-sum vectors from the 1st column,5th column, and so on. In this case, once all the AAP cores ﬁnish theirlocal accumulations, the results from the cores should be accumulatedagain to get the ﬁnal vector. For training operation, we need a MVMwith a transposed matrix W T and the error vector computed fromthe previous layer. Since we use the same column-wise mechanism inthe training operation as well, we map each column of the transposedmatrix to a row of PE array. Therefore, each row of the originalmatrix maps to a row of PE array. In training, we implement intra-batch parallelism by distributing the vectors in a batch across multipleAAP cores. As each AAP core works on each MVM, in parallel, wecan increase the overall throughput by the number of AAP cores.Based on the adaptive parallelism enabled by different data mappingand ﬂow, FIXAR is able to execute a single vector N times faster inforward propagation and compute N times more vectors in a batchin back-propagation without any off-chip DRAM access, where N isthe number of AAP cores.The weight memory stores the model’s matrix parameters row byrow over 16 BRAM modules. We set the bit-width of the weightmemory to 512-bit to read or write 16 weights at the same cycle. Forinference, the controller reads a single row from the weight memoryand distributes them to a column of PE array. For training, on theother hand, it distributes the 16 weights to a row of PE array. Basedon the column-wise matrix decomposition mechanism, we efﬁcientlysolve the matrix transpose problem, which happens in supportingboth inference and training, by distributing the weights in a row ofthe matrix to a column or a row of the physical PE array. The weightmemory is the centralized storage for model parameters and is sharedamong multiple AAP cores. Without any duplication in the weightmemory, we are able to store the entire model parameters for theDDPG algorithm on-chip. C. Processing element with conﬁgurable datapath

Processing element (PE) of the AAP core is the most basic computeunit in the accelerator and performs multiply-and-accumulate (MAC)operation in ﬁxed-point format. As each AAP core contains 256PEs, having ﬁxed-point arithmetic units instead of ﬂoating-pointarithmetic units greatly reduces its logic area and power consumption.Fig. 5: Processing Element with Conﬁgurable Datapath Fig. 6: FPGA layoutTABLE I: FPGA Resource Usage on Xilinx Alveo U50

Component LUT FF BRAM URAM DSP

PEs 216.3K 161.8K 0 0 2295On-chip Memory 10.3K 0 584 128 0Adam Optimizer 46.7K 70.2K 0 0 3Control Unit 69.0K 45.4K 0 0 0Kernel Interface 68.8K 15.2K 12 0 0HBM Interface 8.2K 13.1K 2 0 0PCIe DMA 88.8K 103.2K 176 0 4Total 508.1K 408.8K 774 128 2302(58.4%) (23.5%) (57.6%) (20.0%) (38.8%)

In accordance with the QAT, it supports the conﬁgurable datapathfor 32-bit activation and two 16-bit activations, as shown in Figure5. Leveraging the fact that the 32-bit by 32-bit multiplication canbe decomposed into two 32-bit by 16-bit multiplications, we employtwo 32-bit by 16-bit multipliers in the PE. For the full-precision casebefore the quantization, the output from the upper-bits multiplieris left-shifted and added to the output of the lower-bits multiplierto generate the ﬁnal result. For the half-precision case after thequantization, we use the two partial-sums separately as the ﬁnalresult. On the memory side, we don’t expect any changes as two16-bit activations just replace a single 32-bit activation. As a result,PEs with conﬁgurable datapath efﬁciently support both full and half-precision MACs in ﬁxed-point and double the throughput for thehalf-precision case without any overhead.VI. E VALUATION

In this section, we evaluate the FIXAR’s CPU-FPGA platformagainst the popular CPU-GPU platform. We used Intel Xeon 6226Rat 2.9 GHz as the host CPU for both platforms. We used Xilinx AlveoU50 accelerator card for FPGA and Nvidia Titan RTX for GPU.

A. Implementation Results

We implement the FIXAR’s hardware accelerator on the XilinxAlveo U50 acceleration card. As the U50’s FPGA chip uses chiplet-based design with 2 super logic regions (SLRs), we carefully designedthe pipelining in the PE array and the fan-out of the weight andgradient memories to overcome a severe SLR crossing penalty. Asa result, we are able to integrate 2 AAP cores across the 2 SLRsat 164MHz operating frequency with utilizing 58.4%, 57.6%, and38.8% of LUT, BRAM, and DSP resource, respectively. We usedXilinx Vitis framework 2020.1 for PCIe communication between thehost and FPGA card. Figure 6 shows the ﬁnal layout of Xilinx U50FPGA and Table I summarizes its resource usage.

B. Benchmarks for Continuous Action Space

To evaluate the FIXAR platform, we run multiple physical locomo-tion benchmarks

HalfCheetah , Hopper , and

Swimmer from MuJoCoig. 7: Algorithm accuracy in

HalfCheetah environmentphysics engine [17]. These benchmarks target continuous actionspaces and are considered as complex tasks, hence they are widelyused for RL algorithm evaluation. For instance, the

HalfCheetah benchmark aims to train a cheetah to run by giving 6 action outputsbased on the cheetah’s state including 17 physical conditions and thereward from the environment. Likewise,

Hopper benchmark has 11-dimensional state and 6-dimensional action, and

Swimmer benchmarkhas 8-dimensional state and 2-dimensional action. We use DDPGalgorithm to learn agent’s action policy in the continuous actionspace. In our DDPG implementation, we use a neural network with2 hidden layers (input:state-hidden:400-hidden:300-output:action) forthe actor. The critic’s neural network receives both the state andaction as inputs and generates a single error value to the actor(input:state+action-hidden:400-hidden:300-output:1). Both networksuse rectiﬁed linear unit (ReLU) in each layer while the actor appliesadditional tanh to the output. Both network parameters are optimizedusing Adam optimizer with a learning rate of − . We run the taskfor 1 million timesteps in total while evaluating the reward every 5000timesteps. For each evaluation, we calculate the average of cumulativerewards until the agent falls down for given random 10 states. C. Performance

Algorithm accuracy

To verify the algorithm accuracy of theFIXAR’s ﬁxed-point training, we measure the total reward of it fortraining episode (episode = 1000 timesteps). For comparison, we alsomeasure the training results of GPU when it runs the experimentfor various data formats including 32-bit single-precision ﬂoating-point, 32-bit ﬁxed-point, and 16-bit ﬁxed-point. The graph in Figure7 presents the total reward obtained during the training process. Itshows that the FIXAR’s dynamic ﬁxed-point successfully trains theDDPG DNNs with the cumulative reward level saturating towards2000, like in the 32-bit ﬂoating-point and 32-bit ﬁxed-point. AlthoughFIXAR has a dip in reward after quantization, it scoops up again asit re-trains the model with a reduced bit-precision. This is possiblebecause it starts re-training from the pre-trained model with a full bit-precision during the quantization delay time before the quantizationhappens. On the other hand, the GPU case that starts the trainingwith the 16-bit ﬁxed-point fails to train.

Training throughput

In accordance with previous works, we usethe metric called IPS, the number of inferences processed per second,to evaluate the training performance. It is the ratio of the total numberof collected samples to the end-to-end system time taken in an entiretimestep including inference, training, and environment interactions.Figure 8 shows the IPS results of the FIXAR and CPU-GPU platformfor the chosen benchmarks when the batch size varies among 64, 128,256, and 512. As the batch size increases, the throughputs of bothplatforms improve as well. Figure 9(a) shows the detailed executiontime breakdown of a single timestep in the FIXAR platform fordifferent batch sizes. For all cases, the CPU time spent on runningMuJoCo Python environment is rather constant around 2 ms. The Fig. 8: FIXAR platform’s training throughputFig. 9: (a) Execution time of FIXAR platform (b) Execution timeratio of FIXAR platformtime spent on Xilinx run-time to import the input batch from the hostCPU to the FPGA increases marginally even though the batch sizedoubles up. This tells us that the initial overhead of buffer allocationand PCIe communication in the run-time is quite large. The time spenton FPGA accelerator is linear to the size of input batch because theFIXAR’s AAP cores remain highly utilized for different batch sizesthanks to the intra-batch parallelism. We can also observe that thesystem bottleneck changes from the CPU to the FPGA acceleratoras the batch size increases, as shown in Figure 9(b). The CPU-GPUplatform beneﬁts from a large batch size more than FIXAR becauseit helps GPU improve the utilization signiﬁcantly. As a result, theFIXAR platform performs 1.8-4.8 times better than the CPU-GPUplatform although it has some inefﬁciency in the run-time system.

Accelerator efﬁciency

Figure 10 shows the IPS throughput perfor-mance and their energy efﬁciency measured only on the accelerators,Fig. 10: (a) Accelerator throughput (b) Accelerator energy efﬁciencyABLE II: Comparison Table with Previous Works

ASPLOS’19 [19] FCCM’20 [20] FIXAR

Platform Xilinx VCU1525 Xilinx U200 Xilinx U50Clock 180MHz 285MHz 164MHzAlgorithm Actor-Critic(A3C) Actor-Critic(PPO) Actor-Critic(DDPG)Task Env. Discrete Continuous ContinuousPrecision Floating 32-bit Floating 32-bit Fixed 32, 16-bitDSP 2348 3744 2302Network Size 2592.0 KB 229.6 KB 514.4 KBPeak Perf. 2550.0 IPS 15286.8 IPS 38779.8 IPSNormalized PeakPerf. to FIXAR 12849.1 IPS 6823.2 IPS 38779.8 IPSEnergy Efﬁciency(Accelerator) 141.7 IPS/W - 2638.0 IPS/W i.e., FPGA accelerator and GPU, except the host CPU time. The IPSperformance of the FIXAR’s accelerator remains high at 53826.8 IPSfor all different batch sizes. This is possible because the multiple AAPcores (N=2 in this implementation) can run a single vector fasterusing the intra-layer parallelism in forward path as well as can runmultiple vectors in a batch in parallel using the intra-batch parallelismin backward path as described in Section 5-B. Since the batch sizeis big enough compared to the number of AAP cores, the hardwareutilization is kept very high at 92.4%. On the other hand, GPU’shardware utilization linearly increases as the batch size increases.For power estimation, we used Xilinx Board Utility that measuresthe power consumption of the overall acceleration card includingFPGA, PCIe interface, and on-board DRAMs. We measured thatthe FPGA and GPU consume 20.4W and 56.7W power on average,respectively, for running DNN models of the DDPG algorithm with 3benchmarks. As a result, FIXAR’s FPGA accelerator showing 2638.0IPS/W energy efﬁciency achieves 15.4 times higher energy efﬁciencythan that of Titan RTX GPU.VII. R ELATED W ORK

There have been increasing number of works on accelerating deepreinforcement learning over the last few years, but most of themimplemented table-based Q-learning algorithms [18]. A couple ofrecent works focused on accelerating advanced actor-critic algorithmslike FIXAR. FA3C [19] presented a framework for accelerating adeep reinforcement learning algorithm named A3C [2] and appliedit for an Atari game. However, FA3C was only evaluated in thediscrete action space which requires a much lower precision than thecontinuous action space that FIXAR targets. In [20], a FPGA basedacceleration platform is proposed for PPO algorithm [4]. While thePPO accelerator targets the continuous action space like FIXAR, itis still based on resource-hungry ﬂoating-point format. In addition,its DNN model size is less than half of the FIXAR’s. Based onefﬁcient ﬁxed-point arithmetic units, FIXAR is able to process moreoperations with less DSPs so its overall throughput surpasses evenwith a more-than-twice larger network. Table II summarizes thecomparison of the FIXAR platform against the state-of-the-art works.VIII. C ONCLUSION

In this paper, we present a DRL acceleration platform calledFIXAR, which employs ﬁxed-point data types and arithmetic unitsfor the ﬁrst time. We propose the quantization-aware training thatenables to reduce the data precision in ﬁxed-point by half after acertain number of training steps. We also design a FPGA acceler-ator that employs adaptive dataﬂow and parallelism to handle bothinference and training operations with supporting dual ﬁxed-point data types. We evaluate the implemented FIXAR platform against theconventional CPU-GPU platform by running multiple benchmarks forcontinuous action spaces. As a result, FIXAR achieves 25293.3 IPStraining throughput and 2638.0 IPS/W accelerator efﬁciency, whichare 2.7 times higher and 15.4 times more energy efﬁcient than thoseof the CPU-GPU platform. It also shows the best performance andenergy efﬁciency among other acceleration platforms even it runs oneof the most complex DNN models thanks to its ﬁxed-point arithmeticimplementation. R

EFERENCES [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602 , 2013.[2] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in

International conference on machine learning ,2016, pp. 1928–1937.[3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv preprint arXiv:1509.02971 , 2015.[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,”

CoRR , vol. abs/1707.06347,2017. [Online]. Available: http://arxiv.org/abs/1707.06347[5] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Ja´skowski,“Vizdoom: A doom-based ai research platform for visual reinforcementlearning,” 2016.[6] F. Farahnakian, P. Liljeberg, and J. Plosila, “Energy-efﬁcient virtualmachines consolidation in cloud data centers using reinforcement learn-ing,” in , 2014, pp. 500–507.[7] F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. Corke, “Towardsvision-based deep reinforcement learning for robotic motion control,”2015.[8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deepneural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149 , 2015.[9] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:Training low bitwidth convolutional neural networks with low bitwidthgradients,” arXiv preprint arXiv:1606.06160 , 2016.[10] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s.Hua, “Quantization networks,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 7308–7316.[11] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,and D. Kalenichenko, “Quantization and training of neural networksfor efﬁcient integer-arithmetic-only inference,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 2704–2713.[12] S. Krishnan, S. Chitlangia, M. Lam, Z. Wan, A. Faust, and V. J.Reddi, “Quantized reinforcement learning (quarl),” arXiv preprintarXiv:1910.01055 , 2019.[13] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan,D. TB, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributionaldeterministic policy gradients,” 2018.[14] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approx-imation error in actor-critic methods,” 2018.[15] S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, P. Chuang, andL. Chang, “Compensated-dnn: energy efﬁcient low-precision deepneural networks by compensating quantization errors,” in . IEEE, 2018,pp. 1–6.[16] X. Zhang, S. Liu, R. Zhang, C. Liu, D. Huang, S. Zhou, J. Guo,Q. Guo, Z. Du, T. Zhi et al. , “Fixed-point back-propagation training,”in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , 2020, pp. 2330–2338.[17] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine formodel-based control,” in , 2012, pp. 5026–5033.[18] J. Su, J. Liu, D. Thomas, and P. Cheung, “Neural network basedreinforcement learning acceleration on fpga platforms,”

ACM SIGARCHComputer Architecture News , vol. 44, pp. 68–73, 01 2017.[19] H. Cho, P. Oh, J. Park, W. Jung, and J. Lee, “Fa3c: Fpga-accelerateddeep reinforcement learning,” in

Proceedings of the Twenty-FourthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , 2019, pp. 499–513.[20] Y. Meng, S. Kuppannagari, and V. Prasanna, “Accelerating proximalpolicy optimization on cpu-fpga heterogeneous platforms,” in2020 IEEE28th Annual International Symposium on Field-Programmable CustomComputing Machines (FCCM)