[PDF] Accelerating Deep Neuroevolution on Distributed FPGAs for Reinforcement Learning Problems

Abstract

Reinforcement learning augmented by the representational power of deep neural networks, has shown promising results on high-dimensional problems, such as game playing and robotic control. However, the sequential nature of these problems poses a fundamental challenge for computational efficiency. Recently, alternative approaches such as evolutionary strategies and deep neuroevolution demonstrated competitive results with faster training time on distributed CPU cores. Here, we report record training times (running at about 1 million frames per second) for Atari 2600 games using deep neuroevolution implemented on distributed FPGAs. Combined hardware implementation of the game console, image pre-processing and the neural network in an optimized pipeline, multiplied with the system level parallelism enabled the acceleration. These results are the first application demonstration on the IBM Neural Computer, which is a custom designed system that consists of 432 Xilinx FPGAs interconnected in a 3D mesh network topology. In addition to high performance, experiments also showed improvement in accuracy for all games compared to the CPU-implementation of the same algorithm.

Full PDF

AAccelerating Deep Neuroevolution on DistributedFPGAs for Reinforcement Learning Problems

Alexis Asseman , Nicolas Antoine and Ahmet S. Ozcan

IBM Almaden Research Center, San Jose, CA, USA.

Abstract

Reinforcement learning augmented by the represen-tational power of deep neural networks, has shownpromising results on high-dimensional problems, suchas game playing and robotic control. However, the se-quential nature of these problems poses a fundamen-tal challenge for computational eﬃciency. Recently,alternative approaches such as evolutionary strategiesand deep neuroevolution demonstrated competitiveresults with faster training time on distributed CPUcores. Here, we report record training times (run-ning at about 1 million frames per second) for Atari2600 games using deep neuroevolution implementedon distributed FPGAs. Combined hardware imple-mentation of the game console, image pre-processingand the neural network in an optimized pipeline, mul-tiplied with the system level parallelism enabled theacceleration. These results are the ﬁrst applicationdemonstration on the IBM Neural Computer, whichis a custom designed system that consists of 432 Xil-inx FPGAs interconnected in a 3D mesh networktopology. In addition to high performance, experi-ments also showed improvement in accuracy for allgames compared to the CPU-implementation of thesame algorithm.

In reinforcement learning (RL) [3][11], an agent learnsan optimal behavior by observing and interactingwith the environment, which provides a reward signalback to the agent. This loop of observing, interactingand receiving rewards, applies to many problems inthe real world, especially in control and robotics [16].Video games can be easily modeled as learning envi-ronments in an RL setting [9], where the players actas agents. The most appealing part of video gamesfor reinforcement learning research is the availabilityof the game score as a direct reward signal, as well asthe low cost of running large amounts of virtual ex- periments on computers without actual consequences(e.g., crashing a car hundreds of times would not beacceptable).Deep learning based game playing reached pop-ularity when Deep Q-Network (DQN) [14] showedhuman-level scores for several Atari 2600 games. Themost important aspect of this achievement was learn-ing control policies directly from raw pixels in an end-to-end fashion (i.e., pixels to actions). Subsequentinnovations in DQN [25], and new algorithms such asthe Asynchronous Advantage Actor-Critic (A3C) [13]and Rainbow [8] made further progress and launchedthe ﬁeld to an explosive growth. A comprehensiveand recent review of deep learning for video gameplaying can be found in [10].However, gradient-based optimization algorithms,used for the training of neural networks, have per-formance limitations, as they do not lend themselvesto parallelization, and they require heavy computa-tions and a large amount of memory, requiring the useof specialized hardware such a Graphical ProcessingUnits (GPU).Compared to the gradient descent based optimiza-tion techniques mentioned above, derivative-free op-timization methods such as evolutionary algorithmshave recently shown great promise. One of these ap-proaches, called deep neuroevolution, can optimize aneural network’s weights as well as its architecture.Recent work in [18] showed that a simple geneticalgorithm with a Gaussian noise mutation can suc-cessfully evolve the parameters of a neural networkand achieve competitive scores across several Atarigames. Training neural networks with derivative-freemethods opens the door for innovations in hardwarebeyond GPUs. The main implications are relatedto precision and data ﬂow. Rather than ﬂoatingpoint operations, ﬁxed point precision is suﬃcient [6]and data ﬂow is only forward (i.e., inference only,no backward ﬂow). Moreover, genetic algorithmsare population-based optimization techniques, whichgreatly beneﬁt from distributed parallel computation.1 a r X i v : . [ c s . N E ] M a y hese observations led us to conclude that ge-netic algorithm–based optimization of neural net-works could be accelerated (and made more eﬃ-cient) by the use of hardware optimized for fast in-ference, and the use of multiplicity of such deviceswould easily take advantage of the inherent paral-lelism of the algorithm. Hence, we implemented oursolution on the IBM Neural Computer [15], whichis a custom-designed distributed FPGA system de-veloped by IBM Research. By implementing twoinstances of the whole application on each of the416 FPGAs we used (i.e., game console, image pre-processing and the neural net), we were able to run832 instances in parallel, at an aggregated rate of 1.2million frames per second. Our main contributionsare: • Introduction of an FPGA-accelerated

FitnessEvaluation Module consisting of a neural net-work and Atari 2600 pair, for use with evolu-tionary algorithms. • The ﬁrst demonstration of accelerated trainingquantized neural networks using neuroevolutionon distributed FPGAs. • Extensive results on 59 Atari 2600 games trainedfor six billion frames using deep neuroevolutionand performance analysis of our results on theIBM Neural Computer compared to baselines.

Most of the FPGA-based implementations of neuralnetworks target inference applications due to the ad-vantages related to energy eﬃciency and latency [21][23] [22]. These are often based on high-level synthe-sis for FPGAs, while some of them utilize frameworksthat convert and optimize neural network models intobitstreams. FPGA maker Xilinx recently launched anew software platform called Vitis to make it eas-ier for software developers to convert neural networkmodels to FPGA bitstreams.In addition to the inference-only applications, fewstudies utilized FPGAs to accelerate reinforcementlearning and genetic algorithms. For example [5] pro-posed the FA3C (FPGA-based Asynchronous Advan-tage Actor-Critic) platform which targets both infer-ence and training using single-precision ﬂoating pointarithmetic in the FPGA. They show that the perfor-mance and energy eﬃciency of FA3C is better than ahigh-end GPU-based implementation. Similar to ourwork, they chose the Atari 2600 games (only six) todemonstrate their results. However, unlike our work, their Atari 2600 environment is the Arcade LearningEnvironment [4], which runs on the host CPU.Genetic algorithms (GA) are another class of opti-mization methods that FPGA acceleration can help.For example, [19] implemented GA on FPGA hard-ware and proposed designs for genetic operations,such as mutation, crossover, selection. Their ap-proach tried to exploit parallelism and pipelining tospeed up the algorithm. Experimental results werelimited to the optimization of a modiﬁed Witte andHolst’s Strait Equation, f ( x , x , x ) = | x − a | + | x − b | + | x − c | , and showed about an order of magnitudespeed up compared to a CPU implementation at thetime.A more recent study [20] proposed a parallel im-plementation of GA on FPGAs. They showed resultsfor the optimization of various simple mathematicalfunctions, which are trivial to implement and evalu-ate in the FPGA itself. Compared to previous stud-ies, they report speed-up values ranging from one tofour orders of magnitude.Even though these related studies are not a com-plete picture of the ﬁeld, our approach is fundamen-tally diﬀerent and unique in several aspects. Ratherthan accelerating the optimization algorithm (e.g.RL or GA) we have taken a diﬀerent approach andaddressed the data generation (i.e. Atari game en-vironment and obtaining frames). Moreover, we arepipelining the image pre-processing and neural net-work inference entirely within the FPGA, thus avoid-ing the costly external memory access, contributingsigniﬁcantly to our results. The IBM Neural Computer (INC) [15] is a parallelprocessing system with a large number of computenodes organized in a high bandwidth, low latency3D mesh network. Within each node is a Zynq-7045system-on-chip, which integrates a dual-core CortexA9 ARM processor and an FPGA, alongside 1GB ofDRAM used both by the ARM CPU and the FPGA.The INC cage is comprised of a 3D network of12 × × × × xyz ) = (000) withsupplementary control capabilities over its card, andalso provides a 4-lane PCIe 2.0 connection to com-municate with an external computer.2igure 1: IBM Neural Computer: (a) Cage holding16 cards (b) Card composed of 27 nodes (c) Nodebased on a Zynq-7045 with 1GB of dedicated RAMThe 3D mesh network is supported by the highfrequency transceivers integrated into the Zynq chip.These are entirely controlled by the FPGA, thus en-abling a low level optimization of the network for thetarget applications. In particular, the currently im-plemented network protocols over the hardware net-work enable us to communicate from any node toany other node of the system, including reading andwriting any address accessible over its AXI bus. Thatlast point enables us to control all the Atari 2600 en-vironment ﬁtness evaluation modules present over allthe nodes of the system, from the gateway node con-nected to the computer through PCIe.We elected to use 26 out of the 27 nodes of eachcard, leaving the node ( xyz ) = (000) of each card.Therefore, all the computation carried out in theexperiments described herein was on a total of 416nodes. The Atari 2600 → image pre-processing → ANN → Atari 2600 loop is integrated in a ﬁtness evaluationmodule, which can communicate with the AXI busin order to control the operation from the outside –i.e. by reading and writing memory-mapped registersexposed on the AXI bus (see ﬁg. 2).The whole loop is pipelined together, and caching isreduced to the bare minimum to decrease the latencyof the loop. Moreover, information exchange betweenthe loop and the rest of the system is done asyn-chronously, such that the loop is never interruptedby external events. This enabled us to achieve 1,450frames per second while running the Atari 2600 insidethe loop described above.The module exposes on the AXI address space: • The Atari 2600’s block RAM containing the AX I B U S Load weights ImagePreprocessingANN B R A M Atari 2600 B R A M Load game ROMGet status infoStart / Reset ActionselectionAgentEnvironment

Figure 2: Schematic representation of the ﬁtness eval-uation module carrying out the evaluation loop—entirely in FPGA.game ROM (write), such that games can beloaded dynamically from the outside. • The ANN’s block RAM containing the parame-ters (write). • The game identiﬁer (write) – used by the ﬁtnessevaluation module to know where in the console’sRAM the game status as well as the score arestored. • The status of the game (read) – Alive or Dead. • The game’s score (read). • A frame counter (read). • A clock counter (read) – to deduce the wall timethat passed since the game start. • A command register (write) – to reset the wholeloop (when a new game, new parameters areloaded) and start the game, or to forcibly stopthe loop’s execution.Table 1 contains a summary of the hardware uti-lization of the diﬀerent submodules comprising theﬁtness evaluation module, as reported by Xilinx’s Vi-vado tool.We implemented two instances of the ﬁtness evalu-ation module per INC node, which brings us to a totalof 832 instances used in parallel, for a total maximumof 1,206,400 frames per second.

To obtain the highest performance, we chose to avoidsoftware emulation of the Atari 2600 console and tookadvantage of the FPGA instead, which can easily im-plement the original hardware functionality of the3able 1: Hardware Utilization of a Single Instance of The Fitness Evaluation Module.Submodule Slice LUTs BRAM Tiles DSPsAtari 2600 1,875 9 0Image pre-processing 677 16.5 2Neural network 22,855 140 416Miscellaneous 1,337 0 0Total 26,744 165.5 418 (a) Alien (b) Chopper Com-mand (c) Fishing Derby(d) Freeway (e) Hero (f) River Raid

Figure 3: Screenshots of some of the games we trainedon FPGAs using Deep Neuroevolution.console at a much higher frequency. We used an open-source VHDL implementation from the open-sourceMiSTer project .We ran the Atari 2600’s main clock at 150 MHz,instead of the original 3.58 MHz[1]. As we are us-ing it in NTSC [17] picture mode, we obtain ∼ We chose to apply the same image pre-processing asin [14] and [18], for the dual purpose of enabling aneasier comparison with those results, as well as reduc-ing the hardware cost of the ANN (Artiﬁcial NeuralNetwork). The entire pre-processing stack is imple-mented on the FPGA in a pipelined fashion for max- https://github.com/MiSTer-devel/Main_MiSTer/wiki imum throughput. The pre-processing stack is com-prised of: • A color conversion module that converts theconsole’s 128 color palette to luminance, usingthe ITU BT.601 [2] conversion standard. This isdone instead of just keeping the 3-bit luminancefrom the console’s NTSC signal, such that the128 color palette is converted to a 124 grayscalecolor palette (4 levels are lost due to some over-laps in the conversion). • A frame-pooling module . Its purpose is toeliminate sprite ﬂickering (where sprites showon the screen in half of the frames to bypassthe sprite limitations of the console). This isachieved by keeping the previous frame in mem-ory, and for each pixel, showing the one that hasthe highest luminance between the current frameand the previous frame. • A re-scaling module . To re-scale the imagefrom the original 160 ×

210 pixels down to 84 × • A frame stacking module . To stack theframes in groups of 4, where each of the 4 framesbecomes a channel of the payload that is fed intothe ANN. This has two purposes: It divides thenumber of inputs to the ANN by 4, and also en-ables the ANN to see 4 frames at a time, there-fore being able to deduce motion within those 4frames. The hardware architecture for the neural net-work was generated using the open-source toolDNNBuilder [24]. It was chosen because it generateshuman-readable register transfer level (RTL) code,which describes a fully-pipelined neural network, op-timized for low block RAM utilization and low la-tency. DNNBuilder makes this possible by imple-menting a Channel Parallelism Factor (CPF) and Also known as AccDNN, available at https://github.com/IBM/AccDNN × × × × × ×

32 ReLU 4 32Convolution 4 × × × ×

64 ReLU 32 4Convolution 3 × × × ×

64 ReLU 4 32Inner product - - 18 - 4 1Table 3: DNNBuilder Fixed-Point Numerical Preci-sion Settings for All Layers.Bit-width 16Weights radix 13Activations radix 6Kernel Parallelism Factor (KPF), which respectivelyunroll the input and output channels of an ANNlayer, at the cost of higher hardware utilization. Byalternating the CPF and KPF values at each stageof the ANN, caching, and therefore latency, can bereduced.Table 2 illustrates the architecture of the model,which has been implemented and trained in thisstudy. Note that the model is similar to the oneused in [14], but the convolutions are done with-out padding, and the ﬁrst fully-connected layer is re-moved. This was necessary to bring the number ofparameters from ∼ The action selection submodule selects the joypadaction to apply for the next 4 frames by selectingthe action with the maximum reward as predicted bythe ANN’s output. To introduce stochasticity intothe games, we used sticky actions as recommendedin [12], which introduces stochasticity by having aprobability ς of maintaining the action sent to theenvironment at the previous frame during the cur-rent frame, instead of applying the latest selected ac-tion. We used the recommended stickiness parametervalue ς = 0 .

25. The randomness is sampled from arather large maximum-length 41-bit linear feedbackshift register running independently from the rest ofthe module.

The Genetic Algorithm runs on an external com-puter, connected to the INC through a PCIe connec-tion that connects it to node (000). The node (000)acts as a gateway to the 3D mesh network and en-ables us to send neural network weights, game ROMs,and start games. It also allows us to gather resultsfrom the 832 instances of the ﬁtness evaluation mod-ule that are scattered across the 3D mesh network.The Genetic Algorithm we describe in Algorithm1 is largely based upon [18]. It only includes muta-tion and selection. Each generation has a population P that is composed of N individuals. To iterate tothe next generation, the top T ﬁttest individuals areselected as parents of the next generation (trunca-tion selection). Each oﬀspring individual is gener-ated from a randomly selected parent with parame-ters vector θ , to which a vector of random noise isadded (mutation) to form the oﬀspring’s parametersvector θ (cid:48) = θ + σ(cid:15) , where σ is a mutation power hyper-parameter, and (cid:15) is a standard normal random vector.Moreover, the ﬁttest parent (elite) is preserved (i.e.unmodiﬁed) as individual for the subsequent genera-tion. We chose to run the training on 59 out of the 60 gamesevaluated in [12], excluding Wizard Of Wor, whichpresented some bugs on our Atari 2600 core. Thetraining was carried out in 5 separate experiments tomeasure the run-to-run variance. Moreover, becausethe game environment is stochastic, during each runwe average the ﬁtness scores of the T ﬁttest individ-uals over 5 evaluations before selecting the E elitesout of those. This procedure helps with generaliza-tion of the trained agents. The hyper-parameters ofthe Genetic Algorithm are presented in table 4.A subset of the results is summarized in Table 5,with the corresponding training plots in Fig. 4. Thecomplete table of results is available in Appendix Ain Table 6, along with all the learning plots in Fig. 5.All of our performance numbers are based on the av-5 lgorithm 1 Simple Genetic Algorithm

Input: mutation power σ , population size N ,number of selected individuals T , Xavier randominitialization [7] function xi , standard normal ran-dom vector generator function snrv , ﬁtness func-tion F . for g = 1 , ..., G generations dofor i = 1 , ..., N − doif g = 1 then θ g =1 i = xi () { initialize random DNN } else k = uniformRandom(1 , T ) { select parent } θ gi = θ g − k + σ ∗ snrv () { mutate parent } end if Evaluate F i = F ( θ gi ) end for Sort θ gi with descending order by F i if g = 1 then Set Elite Candidates C ← θ g =11 ...T else Set Elite Candidates C ← θ g ...T ∪ { Elite } end if Set Elite ← arg max θ ∈ C (cid:80) j =1 F ( θ ) θ g ← [Elite , θ g − { Elite } ] { only include eliteonce } end forReturn: Elite erage and variance over 5 training runs, where eachrun’s performance is based on the average score ofthe best individual, which was evaluated 5 times. Weare comparing with DQN (as does [18]) experimentscarried-out in [12] that use sticky actions as a sourceof stochasticity as we do. We are also comparing withthe results from [18], which implements very similarexperiments in software, with a larger neural network,with the caveat that it uses initial no-ops as a sourceof stochasticity.We are also comparing the approximate wall-clockduration needed to complete a single training experi-ment with the corresponding algorithms and numberof frames. We have measured an evaluation speed of ∼ N ) 1000 + 1Truncation size ( T ) 20Number of elites ( E ) 1Mutation power ( σ ) 0 . The success of a simple GA algorithm in solving com-plex RL problems was a surprising result [18] andattracted more research in this area including thiswork. One of the hypotheses is the improved explo-ration compared to gradient-based methods. Poten-tially GA can avoid being stuck in local minima un-like gradient methods which require additional tricks(e.g., momentum). The promise of GA for trainingdeep neural networks on reinforcement learning prob-lems also depends on the computational resources.Even though [18] showed that the wall clock time canbe an order of magnitude smaller compared to RL inlearning to play Atari games, the data eﬃciency doesnot compare favorably against modern RL methods(e.g. billions of game frames for GA vs. hundreds ofmillions for algorithms such as A3C). In our work, weattempted to accelerate the game environment andthe neural network inference in order to alleviate thisbottleneck.Distributed hardware such as CPUs in the clouddata centers or custom built systems such as ours area naturally good ﬁt for GA type population-basedoptimization methods. Depending on the applica-tion, computation vs. communication time needs tobe considered carefully. For example, for game play-ing, a signiﬁcant portion of the time is spent duringthe game itself, which results in a long sequence ofinference of game frames and actions. Communicat-ing game scores and updating neural network weightsare sparse in comparison. Therefore, rather than ac-celerating the genetic algorithm, acceleration of thegame environment and the inference can make a bigdiﬀerence as our results have shown.The analysis of the game scores agrees with theﬁndings of [18] and shows that the simple approachof the GA is competitive against a basic RL modelsuch as DQN. Our GA experiments surpass DQN on6able 5: Game Scores for 13 Games From [18]. The Highest Scores for an Equal Number of Training FramesAre in Bold. Scores Are Averaged Over 5 Independent Training Runs.DQN [12] GA (ours) GA [18] GA (ours) GA [18] GA (ours) · · · Wall clock time ∼

10d [18] ∼ ∼ ∼ ∼ ∼

2h 30minAmidar

Asterix

Asteroids 528.5

Atlantis

Enduro Frostbite 279.6

Gravitar 154.9

Kangaroo

Skiing -12,446.6 -7,115.2 -6,502 -6,268.6 -5,541 -5,732.6Venture 3.2

Zaxxon 3,852.1

30 out of 59 games for an equal number of 200 milliontraining frames, while taking 3 orders of magnitudeless wall clock time. When not taking data eﬃciencyinto account, GA with 6 billion training frames sur-passes DQN with 200 million training frames in 36out of 59 games, while still taking about 2 orders ofmagnitude less wall clock time.Compared to [18], which demonstrated results onthirteen games, we obtained results for 59 games upto six billion frames. Our implementation is abouttwice as fast as the one in [18], which used 720 CPUcores in the cloud. In all instances, our game scoresmatch [18], and in some cases even surpass them.Even though the GA algorithm and the experimen-tal hyper-parameters (e.g. population size, mutationpower etc.) were identical, the neural network imple-mentations diﬀered. The most signiﬁcant diﬀerencein our implementation is the removal of a fully con-nected layer and the drastic reduction in the numberof weights ( ∼ ∼ In this work, we have shown the acceleration of theﬁtness evaluation of neural networks playing Atari2600 games using FPGAs. Our results were obtainedon the recently built IBM Neural Computer, a largedistributed FPGA system, demonstrating the advan-tage of whole application acceleration. We used thatacceleration with a Genetic Algorithm from [18] ap-plied to training a deep neural network on Atari 2600games. Compared to the CPU implementation ofthe neural network in [18], the FPGA implementa-tion used a signiﬁcantly smaller network with quan-tized weights and activations. The improvements inthe game scores compared to [18] might be due tothese diﬀerences, which is worth further investiga-tions. Our results successfully demonstrated that theGA, as a gradient-free optimization method, is an ef-fective way of leveraging the power of hardware thatis optimized for limited precision computing and neu-ral network inference. We hope to leverage the accel-erator to pursue research on gradient-free optimiza-7 amidar assault asterix asteroids atlantis enduro frostbite gravitar kangaroo

Frames seaquest

Frames skiing

Frames venture

Frames zaxxon

GA (ours)GA (Such et al. 2018)DQN (Machado et al. 2018)

Figure 4: GA learning curve across generations on selected games compared to the ﬁnal scores of DQN at200M frames (green circles) [12] and previous GA implementation at 1 and 6B frames (red triangles) from[18]. Plots for all of the 59 games can be found in the Supplementary section.tion methods. Moreover, we are convinced that sig-niﬁcant further acceleration and eﬃciency gains couldbe achieved with state of the art FPGAs (the XilinxZynq-7000 family was released in 2011).

Acknowledgements

This paper and the research behind it would not havebeen possible without the exceptional work and ded-ication of Chuck Cox (IBM Research) who designedand built the INC system. The authors would alsolike to thank Winfried Wilcke (IBM Research) forhis leadership, support and constant encouragement.Some of the early experiments were run by MiaochenJin (University of Chicago) during his internship atIBM Research. The authors would like to acknowl-edge Kamil Rocki (previously at IBM Research) whocontributed to the project during its conception.

References [1] Stella Programmer’s Guide. https://alienbill.com/2600/101/docs/stella.html ,1979. [Online; accessed 4-October-2019]. 4[2] Studio encoding parameters of digital televisionfor standard 4:3 and wide-screen 16:9 aspect ra-tios.

International Telecommunications Union ,2011. 4[3] K. Arulkumaran, M. P. Deisenroth,M. Brundage, and A. A. Bharath. A briefsurvey of deep reinforcement learning. arXivpreprint arXiv:1708.05866 , 2017. 1[4] M. G. Bellemare, Y. Naddaf, J. Veness, andM. Bowling. The arcade learning environment:An evaluation platform for general agents.

Jour-nal of Artiﬁcial Intelligence Research , 47:253–279, 2013. 285] H. Cho, P. Oh, J. Park, W. Jung, and J. Lee.FA3C: FPGA-accelerated deep reinforcementlearning. In

Proceedings of the Twenty-FourthInternational Conference on Architectural Sup-port for Programming Languages and OperatingSystems , pages 499–513, 2019. 2[6] M. Courbariaux, Y. Bengio, and J.-P. David.Training deep neural networks with low preci-sion multiplications, 2014. 1[7] X. Glorot and Y. Bengio. Understanding the dif-ﬁculty of training deep feedforward neural net-works. In

Proceedings of the thirteenth inter-national conference on artiﬁcial intelligence andstatistics , pages 249–256, 2010. 6[8] M. Hessel, J. Modayil, H. Van Hasselt,T. Schaul, G. Ostrovski, W. Dabney, D. Hor-gan, B. Piot, M. Azar, and D. Silver. Rainbow:Combining improvements in deep reinforcementlearning. In

Thirty-Second AAAI Conference onArtiﬁcial Intelligence , 2018. 1[9] M. Jaderberg, W. M. Czarnecki, I. Dunning,L. Marris, G. Lever, A. G. Castaneda, C. Beat-tie, N. C. Rabinowitz, A. S. Morcos, A. Rud-erman, et al. Human-level performance in 3Dmultiplayer games with population-based rein-forcement learning.

Science , 364(6443):859–865,2019. 1[10] N. Justesen, P. Bontrager, J. Togelius, andS. Risi. Deep learning for video game playing.

IEEE Transactions on Games , 2019. 1[11] Y. Li. Deep reinforcement learning: Anoverview. arXiv preprint arXiv:1701.07274 ,2017. 1[12] M. C. Machado, M. G. Bellemare, E. Talvitie,J. Veness, M. Hausknecht, and M. Bowling. Re-visiting the arcade learning environment: Eval-uation protocols and open problems for generalagents.

Journal of Artiﬁcial Intelligence Re-search , 61:523–562, 2018. 5, 6, 7, 8, 10, 11[13] V. Mnih, A. P. Badia, M. Mirza, A. Graves,T. Lillicrap, T. Harley, D. Silver, andK. Kavukcuoglu. Asynchronous methods fordeep reinforcement learning. In

Internationalconference on machine learning , pages 1928–1937, 2016. 1[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,J. Veness, M. G. Bellemare, A. Graves, M. Ried-miller, A. K. Fidjeland, G. Ostrovski, et al.Human-level control through deep reinforcementlearning.

Nature , 518(7540):529, 2015. 1, 4, 5[15] P. Narayanan, C. E. Cox, A. Asseman, N. An-toine, H. Huels, W. W. Wilcke, and A. S. Ozcan. Overview of the IBM neural computer architec-ture. arXiv preprint arXiv:2003.11178 , 2020. 2[16] A. S. Polydoros and L. Nalpantidis. Surveyof model-based reinforcement learning: Appli-cations on robotics.

Journal of Intelligent &Robotic Systems , 86(2):153–173, 2017. 1[17] D. H. Pritchard. Us color television fundamen-tals: A review.

SMPTE Journal , 86(11):819–828, 1977. 4[18] F. P. Such, V. Madhavan, E. Conti, J. Lehman,K. O. Stanley, and J. Clune. Deep neu-roevolution: Genetic algorithms are a compet-itive alternative for training deep neural net-works for reinforcement learning. arXiv preprintarXiv:1712.06567 , 2017. 1, 4, 5, 6, 7, 8, 10, 11[19] W. Tang and L. Yip. Hardware implementa-tion of genetic algorithms using FPGA. In

The2004 47th Midwest Symposium on Circuits andSystems, 2004. MWSCAS’04. , volume 1, pagesI–549. IEEE, 2004. 2[20] M. F. Torquato and M. A. Fernandes. High-performance parallel implementation of geneticalgorithm on FPGA.

Circuits, Systems, and Sig-nal Processing , 38(9):4014–4039, 2019. 2[21] Y. Umuroglu, N. J. Fraser, G. Gambardella,M. Blott, P. Leong, M. Jahre, and K. Vissers.Finn: A framework for fast, scalable binarizedneural network inference. In

Proceedings of the2017 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , pages 65–74,2017. 2[22] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang,H. Hu, Y. Liang, and J. Cong. Automatedsystolic array architecture synthesis for highthroughput CNN inference on FPGAs. In

Pro-ceedings of the 54th Annual Design AutomationConference 2017 , pages 1–6, 2017. 2[23] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong,Y. Hu, and Y. Shi. Scaling for edge inferenceof deep neural networks.

Nature Electronics ,1(4):216–222, 2018. 2[24] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong,W.-m. Hwu, and D. Chen. DNNBuilder: anautomated tool for building high-performanceDNN hardware accelerators for FPGAs. In

Proceedings of the International Conference onComputer-Aided Design , page 56. ACM, 2018. 4[25] D. Zhao, H. Wang, K. Shao, and Y. Zhu. Deepreinforcement learning with experience replaybased on SARSA. In ,pages 1–6. IEEE, 2016. 19

Results on the 59 games

Table 6: Game Scores. All the Scores Are Averaged Over 5 Independent Training Runs. Variance Is BetweenParenthesis. The Highest Scores for 200M Frames Are in Bold.

DQN [12] GA GA · · · Wall clock time ∼

10d [18] ∼ ∼ ∼

2h 30minAlien (357.5) 1,386.4 (280.5) 1,942.4 (401.7) 3,603.2 (746.8)Amidar (220.4) 217.6 (34.1) 300.8 (45.0) 359.8 (63.0)Assault (106.8) 906.4 (65.6) 1,388.2 (247.9) 2,374.6 (234.4)Asterix (1,354.6) 1,972.0 (332.3) 2,616.0 (169.9) 2,912.0 (267.1)Asteroids 528.5 (37.0) (157.6) 2,771.6 (197.2) 3,227.6 (187.8)Atlantis (128,678.4) 55,472.0 (1,621.4) 77,832.0 (6,786.2) 136,132.0 (10,796.2)Bank Heist (82.3) 144.0 (22.6) 205.2 (39.2) 247.2 (52.1)Battle Zone 20,547.5 (1,843.0) (5,681.5) 29,600.0 (5,128.4) 30,680.0 (5,347.1)Beam Rider (362.5) 1,276.2 (122.5) 1,442.4 (215.9) 1,486.8 (266.5)Berzerk 487.2 (29.9) (114.7) 1,254.8 (207.3) 1,425.6 (62.2)Bowling 33.6 (2.7) (18.0) 188.2 (5.8) 211.2 (11.7)Boxing (4.9) 21.6 (1.5) 47.6 (19.8) 70.6 (13.8)Breakout (22.6) 12.8 (0.4) 15.4 (3.4) 18.8 (5.4)Carnival (189.0) 4,274.4 (1,584.6) 5,701.2 (1,581.7) 6,268.0 (1,435.2)Centipede 2,838.9 (225.3) (1,710.7) 21,163.4 (2,049.0) 25,970.2 (2,945.4)Chopper Command 4,399.6 (401.5) (4,839.8) 14,100.0 (6,566.7) 19,932.0 (9,297.3)Crazy Climber (1,967.3) 5,896.0 (1,008.6) 11,420.0 (1,159.0) 30,888.0 (3,243.5)Defender 2,941.3 (106.2) (860.8) 17,194.0 (1500.2) 20,978.0 (2,358.2)Demon Attack (778.0) 2,057.2 (244.9) 2,601.2 (906.8) 3,277.6 (984.0)Double Dunk -8.7 (4.5) (0.4) 2.0 (0.0) 2.2 (0.4)Elevator Action 6.0 (10.4) (1,131.4) 3,360.0 (1,999.8) 6,892.0 (3,071.3)Enduro (32.4) 76.2 (13.2) 100.6 (9.6) 119.6 (4.3)Fishing Derby (1.9) -49.0 (6.2) -34.2 (10.1) -6.2 (21.9)Freeway (0.3) 27.4 (0.5) 29.0 (0.7) 29.6 (1.1)Frostbite 279.6 (13.9) (342.2) 6,225.2 (1,226.2) 7,241.6 (1,183.4)Gopher (521.4) 1,091.2 (112.4) 1,412.0 (198.9) 1,740.0 (246.6)Gravitar 154.9 (17.7) (369.4) 1,636.0 (639.6) 1,948.0 (763.6)Hero (2,234.9) 10,940.2 (2,265.1) 14,102.8 (2828.5) 17,803.2 (534.5)Ice Hockey -3.8 (4.7) (1.3) 13.8 (1.1) 15.8 (1.9)James Bond 581.0 (21.3) (479.2) 1,778.0 (454.7) 2,670.0 (569.1)Journey Escape -3,503.0 (488.5) (8,335.0) 16,980.0 (8329.7) 22,468.0 (8,340.1)Kangaroo (1,115.9) 2,564.0 (506.0) 6,148.0 (2,878.4) 8,232 (2,788.5)Krull (128.5) 5,875.0 (1,004.0) 7,841.2 (805.5) 10,113.8 (749.4)Kung-Fu Master 16,472.7 (2,892.7) (8,356.2) 46,088.0 (3,588.2) 49,616.0 (2,197.6)Monzuma’s Revenge 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)Ms. Pacman 3,116.2 (141.2) (632.1) 5,654.4 (965.4) 6,295.6 (882.9)Name This Game 3,925.2 (660.2) (119.7) 5,102.8 (130.2) 5,548.4 (282.8)Phoenix 2,831.0 (581.0) (913.5) 6,809.6 (2,096.4) 9,957.6 (2,187.6)Pitfall -21.4 (3.2) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)Pong (1.0) -16.0 (2.1) -10.4 (2.5) -5.6 (2.2)Pooyan (349.5) 1,822.6 (92.9) 2,051.8 (107.6) 2,353.6 (119.4)Private Eye 3,967.5 (5,540.6) (91.7) 15,107.0 (13.3) 15,196.6 (7.6)Q*bert (1,385.3) 8,378.0 (3,430.4) 9,730.0 (2,787.8) 10,023.0 (2,438.9)River Raid (435.0) 1,919.6 (722.9) 2,642.4 (874.8) 3,502.0 (674.4)Road Runner (1,492.0) 9,744.0 (939.1) 14,848.0 (2,792.5) 21,356.0 (8,706.1)Robotank (6.4) 20.2 (1.3) 22.4 (1.9) 25.8 (1.6)Seaquest 1,485.7 (740.8) (335.7) 3,862.4 (307.2) 5,428 (966.5)Skiing -12,446.6 (1,257.9) -7,115.2 (379.5) -6,268.6 (655.8) -5732.6 (156.9)Solaris 1,210.0 (148.3) (964.9) 6,201.6 (1,178.7) 8,560.8 (718.8)Space Invaders 823.6 (335.0) (187.9) 1,490.6 (212.6) 1,919.8 (209.3)Star Gunner (5,298.8) 2,208.0 (224.8) 2,908.0 (227.0) 4,392.0 (453.6)Tennis -23.9 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)Time Pilot 2,061.8 (228.8) (851.7) 9,632.0 (1,227.7) 10,620.0 (1,199.6)Tutankham 60.0 (12.7) (16.7) 190.6 (41.3) 213.8 (49.5)Up and Down 4,750.7 (1,007.5) (2,327.8) 21,458.8 (11,005.0) 29,244.8 (14,693.3)Venture 3.2 (4.7) (125.4) 1,052 (172.4) 1,428.0 (198.3)Video Pinball 15,398.5 (2,126.1) (11,059.2) 50,880.6 (13,469.1) 62,769.2 (6,497.0)Yar’s Revenge 13,073.4 (1,961.8) (3,455.0) 34,935.4 (2,657.7) 45,293.2 (7,313.4)Zaxxon 3,852.1 (1,120.7) (359.8) 6,408.0 (366.2) 8,324.0 (1,213.7) alien amidar assault asterix asteroids atlantis bank_heist battle_zone beam_rider berzerk bowling boxing breakout carnival centipede chopper_command crazy_climber defender demon_attack double_dunk elevator_action enduro fishing_derby freeway frostbite gopher gravitar hero ice_hockey jamesbond journey_escape kangaroo krull kung_fu_master montezuma_revenge ms_pacman name_this_game phoenix pitfall pong pooyan private_eye qbert riverraid road_runner robotank seaquest skiing solaris space_invaders star_gunner tennis time_pilot tutankham Frames up_n_down

Frames venture

Frames video_pinball

Frames yars_revenge

Frames zaxxon