[PDF] AutoML for Multilayer Perceptron and FPGA Co-design

Abstract

State-of-the-art Neural Network Architectures (NNAs) are challenging to design and implement efficiently in hardware. In the past couple of years, this has led to an explosion in research and development of automatic Neural Architecture Search (NAS) tools. AutomML tools are now used to achieve state of the art NNA designs and attempt to optimize for hardware usage and design. Much of the recent research in the auto-design of NNAs has focused on convolution networks and image recognition, ignoring the fact that a significant part of the workload in data centers is general-purpose deep neural networks. In this work, we develop and test a general multilayer perceptron (MLP) flow that can take arbitrary datasets as input and automatically produce optimized NNAs and hardware designs. We test the flow on six benchmarks. Our results show we exceed the performance of currently published MLP accuracy results and are competitive with non-MLP based results. We compare general and common GPU architectures with our scalable FPGA design and show we can achieve higher efficiency and higher throughput (outputs per second) for the majority of datasets. Further insights into the design space for both accurate networks and high performing hardware shows the power of co-design by correlating accuracy versus throughput, network size versus accuracy, and scaling to high-performance devices.

Full PDF

AAutoML for Multilayer Perceptron and FPGACo-design st Philip Colangelo

Intel PSG

San Jose, [email protected] nd Oren Segal

Hofstra University

Hempstead, [email protected] rd Alex Speicher

Hofstra University

Hempstead, [email protected] th Martin Margala

University of Massachusetts Lowell

Lowell, USAMartin [email protected]

Abstract —State-of-the-art Neural Network Architectures(NNAs) are challenging to design and implement efﬁciently inhardware. In the past couple of years, this has led to an explosionin research and development of automatic Neural ArchitectureSearch (NAS) tools. AutomML tools are now used to achieve stateof the art NNA designs and attempt to optimize for hardwareusage and design. Much of the recent research in the auto-designof NNAs has focused on convolution networks and image recog-nition, ignoring the fact that a signiﬁcant part of the workloadin data centers is general-purpose deep neural networks. In thiswork, we develop and test a general multilayer perceptron (MLP)ﬂow that can take arbitrary datasets as input and automaticallyproduce optimized NNAs and hardware designs. We test the ﬂowon six benchmarks. Our results show we exceed the performanceof currently published MLP accuracy results and are competitivewith non-MLP based results. We compare general and commonGPU architectures with our scalable FPGA design and show wecan achieve higher efﬁciency and higher throughput (outputs persecond) for the majority of datasets. Further insights into thedesign space for both accurate networks and high performinghardware shows the power of co-design by correlating accuracyversus throughput, network size versus accuracy, and scaling tohigh-performance devices.

Index Terms —Evolutionary Algorithms, Machine Learning,FPGA, Automated Design

I. I

NTRODUCTION AND M OTIVATION

Optimizing neural network architectures (NNA) is a difﬁcultprocess in part because of the vast number of hyperparam-eter combinations that exist. In cases where a combinationis not optimal, performance will suffer. Research suggeststhat the parameters of a network can directly inﬂuence theaccuracy, throughput, and energy consumption of that modelin deployment [1]. The difﬁculty in designing performantneural networks has brought a recent surge in interest inthe automatic design and optimization of neural networks.The focus of the existing body of research has been onoptimizing NNA for accuracy [2][3][4] but lack network-speciﬁc hardware optimizations. Our focus is to close this gapby using evolutionary algorithms to search an entire designspace, including NNA and reconﬁgurable hardware. This Co-design approach provides a solution that implements a customhardware solution for a speciﬁc NNA model.Numerous articles have focused on image classiﬁcation withconvolutional neural networks, in part, due to the introductionof AlexNet in 2012 and the ImageNet database [5]. Traditionalneural networks such as multilayer perceptrons (MLP) have taken a backseat from the spotlight. Yet, their applicationis still very relevant, large data-centric companies such asFacebook[6][7] and Google [8] have published data showingthat MLP workloads are the majority of their application base.Facebook cites the use of MLP for tasks such as determiningwhich ads to display, which stories matter to see in a newsfeed, and which results to present from a search. Park et al.stress the importance of these networks and the current limita-tions on standard hardware and the call for what this researchaims to solve, i.e., software and hardware co-design in [9].Google recently published a performance analysis of their TPU, showing MLP consisting of 61% of the TPU workloads. Ourﬁndings are in line with the results presenting in their paper,stating that MLP workloads are memory bound. Our co-designprocess is capable of designing optimal hardware structuresthat balance the compute and memory requirements. SectionIV provides insights on memory bandwidth considerations.At the heart of MLP is a general matrix multiplication(GEMM), and libraries have existed for years such as BasicLinear Algebra Subprograms (BLAS) that provide highlytuned hardware-speciﬁc GEMM implementations. Standardhardware such as CPUs and GPUs provide these subroutinesto make implementing performant MLPs easy. Still, two prob-lems persist in this typical ﬂow: 1. target hardware is normallynot considered during the creation of the MLP, and 2. standardarchitectures are general purpose. General purpose architec-tures do not necessarily offer performance scaling concerningthe ANN description (see section IV). Our research aims totake advantage of the reconﬁgurable architecture of an FPGAdevice that is capable of molding to a speciﬁc workload andneural network structure. Leveraging evolutionary algorithmsto search the entire design space of both MLP and targethardware simultaneously, we ﬁnd unique solutions that achieveboth top accuracy and optimal hardware performance. SectionIV provides top results for MLP networks and hardwareconﬁgurations derived from a series of tests over multipledatasets. Correlations between traditional approaches and ourapproach are discussed to provide insights into how we achievetop results. II. R

ELATED W ORK

Automating NNA search has been an ongoing effort forthe past few decades but is becoming a focus of the NNA a r X i v : . [ c s . N E ] S e p esearch community because of the difﬁculty in designing deepnetworks which are ever growing in complexity[3][4][10].Automatic Artiﬁcial Neural Network Architectures Search(NAS) can be conducted using different strategies such as ran-dom search, evolutionary algorithms, Reinforcement Learning(RL), Bayesian optimization, and gradient-based methods [10].Using Evolutionary Algorithms (EAs) to search for performantarchitectures has been investigated extensively [11][12] overthe years. Some recent results indicate that evolutionary algo-rithms offer better results than random search and reinforce-ment learning [4]. Recently, there has been growing interestin NAS for deep neural networks that specialize in imagerecognition [2][3][13]. A recent survey on available tool ﬂowsis available here [14]. The body of work on NAS concentrateon accuracy as the main measure of performance, thoughoptimizing for NAS can lead to more simpliﬁed NNA thatcould in turn simplify and optimize hardware designs [4][10].On the other hand, optimizing for hardware performanceparameters (latency/throughput/power) is normally done on anexisting NNA design and there is no attempt to modify theNNA (layers/neurons etc.)[14].III. E VOLUTIONARY C ELL A IDED D ESIGN F LOW

The Evolutionary Cell Aided Design (ECAD) ﬂow, pre-viously described in [15], is shown in Figure 1 and startswith a general industrial/research problem that (a) sufﬁcientdata exists for (b) there are well deﬁned inputs/outputs and(c) it is a problem that can beneﬁt from software/hardwareacceleration. Once such a problem is identiﬁed a dataset willbe exported into a Comma Separated Value (CSV) tabular dataformat, in addition a conﬁguration ﬁle will be created and willcontain information on (a) the general NNA structure includinginput and output sizes, initial number of layers and neurons,(b) Hardware target including reconﬁgurable hardware devicetype, DSP count, memory size and number of blocks, (c)optmization targets such as accuracy, throughput, latency, andﬂoating-point operations. Note that the conﬁguration ﬁle canbe generated automatically based on an existing templateconﬁguration ﬁle and the dataset.

A. Evolutionary Process

The ECAD Evolutionary process, based on a steady-statemodel [16], generates a population of NNA/Hardware co-design candidates each with a complete set of parametersthat effect both the accuracy and the hardware performance.The parameters we considered during our searches includednumber of layers, layer size, activation function, and bias.Each candidate in the population is evaluated according toconﬁgurable and potentially multiple criteria, for exampleaccuracy alone or accuracy vs throughput. The raw evaluationmeasurments are done on a software artifact, dubbed a

Worker in ECAD terminology. The Worker returns the raw evaluationinformation to a

Master process. The

Master process orches-trates the evaluation process by distributing the co-designpopulation and by evaluating the results. Result evaluationis done using user deﬁned ﬁtness functions. For example, Fig. 1: ECAD ﬂow.an accuracy ﬁtness function can simply return the accuracyvalue obtain by a simulation worker(see next section for moreinformation on worker types). But it can also scale or weightthe value or specify to minimize or maximize the value. Simpleevaluations functions can be speciﬁed in the conﬁguration ﬁleand more complex ones are written in code and added byregistering them with the framework.

B. ECAD Hardware

The evolutionary search has three workers at its disposalto assess the ﬁtness of various hardware platforms. Thesimulation worker is useful for assessing instruction-set basedarchitectures such as CPU and GPU, whereas the physicaland hardware database workers are useful for hardware thatrequires design and synthesis.Simulation workers take queries from the evolutionary en-gine and run them on the target hardware. For example, ifthe target hardware is a GPU, the evolutionary engine sendsthe physical worker an ANN description who converts it intoa readable format for the GPU. Once the GPU returns theresult, the physical worker is responsible for recording allnecessary metrics that are provided back to the engine to assessthe ﬁtness. Metrics include throughput, latency, and powerconsumption.Hardware database workers provide a means for hardwareplatforms that are easily simulated or modeled. In our ex-periments, for example, we leveraged the hardware databaseworker to provide a means of accepting both an ANN de-scription and hardware conﬁguration that together were runthrough a model to obtain the metrics for ﬁtness evaluation.Once the engine ﬁnds performant designs via tracking eachconﬁguration’s ﬁtness, the model can be realized into hardwareand run through the physical worker. FPGA proved to be asuitable architecture for the hardware database worker. The re-conﬁgurable nature of FPGAs, coupled with modeled overlayarchitectures, allows the worker to assess many conﬁgurationsin a relatively swift manner compared to running throughsynthesis tools.Physical workers can be used to synthesize and evaluatehardware designs. While the hardware database worker pro-vides ﬁtness of the overall application with metrics such asthroughput, the physical worker aims to provide the ﬁtnessof the hardware design itself through metrics such as power,logic utilization, and operation frequency. In the case of IntelFPGAs, the physical worker responds with ALM, M20K,nd DSP utilization, power estimations, and clock frequency(Fmax).Having access to these workers in the evolutionary engineis a powerful tool that provides various performance metricsabout the hardware platform as a function of the ANN.Since ANN descriptions can vary widely, the throughput andpower consumption on hardware devices can also vary widely.Understanding how a hardware device performs across ANNvariations aids the decision on how to deploy ANN solutions,as seen in the experiments.Beyond these tools providing insights into architectureperformance, the Pareto frontiers that result after parsing theevolutionary design space deﬁne what the optimal solution is.Every industry deploying ANN solutions may have differentopinions and demands for their use case. One size ﬁts all, orone solution is not realistic or optimal. Having the data tomake decisions based on trade-offs is highly valuable. Sincesome ANN descriptions are approximations, i.e., they includeredundancy (although we note that newer research is starting tolimit the amount of redundancy) and educated guesses, it is anunreasonable expectation to hand optimize ANN descriptionsfor a speciﬁc hardware platform.

C. FPGA Design Model and Hardware Description

Workers can provide input to any hardware capable of beingmodeled efﬁciently. Given the reconﬁgurable architecture ofFPGAs and modern HLS tools capable of describing overlaydesigns that interface nicely with software programs, we choseIntel FPGAs and OpenCL as our hardware platform anddevelopment. Any ﬂavor of FPGA can use the overlay-stylearchitecture we developed, and all that is required to changethe design search space, for example, between Arria 10 andStratix 10, is the hardware conﬁguration used by the hardwaredatabase worker.Hardware database workers receive conﬁgurations that in-clude enough information to construct the model and derivethe necessary performance results. Part of this conﬁgurationis a hardware-speciﬁc conﬁguration ﬁle that deﬁnes the targetaccelerator. This information includes the name of the FPGA,the relevant primitive logic details such as DSP and SRAMcount, target clock frequency, the type of global memory(DRAM) to be used, and its speed and rate. All of thisinformation is used to estimate performance. The other part ofthe conﬁguration is the ANN description. Low-level hardwaresuitable parts are decomposed from the ANN description anddeﬁne the work that must be done by the accelerator.General matrix multiplications are commonly used to accel-erate multilayer perceptrons, so the design we used is based ona 2D systolic array architecture that includes additional func-tionality to support activation functions and vector additionsfor bias operations. This “grid” architecture has various designspace variables that we allow mutations to take place on. Thevariables are the number of rows and columns, double buffercache sizes for each dimension, called interleaving, and thevector width of each processing element (PE). Every variablehas implications in the design space. Each mutation has a unique ﬁtness, not only as it applies to the hardware but alsothe ANN. This works leverages and build on the SGEMMarchitecture described in [17].Our model returns values we deemed fundamental, includ-ing potential and effective performance, total time, outputs persecond, and latency. Giga-operations per second (GOP/s) is theunit to describe the potential and effective performance. Thepotential performance is typically the marketed performancethat deﬁnes the rooﬂine of the conﬁguration, whereas theeffective performance deﬁnes the actual or real performance ofthe conﬁguration under a workload. The delta between themprovides a metric on how well the problem maps to the model.The total time for our model is deﬁned by the differencebetween the timestamp when all necessary data persists inthe accelerator DRAM, and we enqueue the OpenCL kernels,and the timestamp once all results persist in DRAM and thelast OpenCL kernel returns. We call this a run. Outputs persecond is a generalization of the data type and provides thetotal results we can achieve in one second. For example, ifthe data type is images, then this metric could be interpretedas images per second. Lastly, latency is the time it takes fromthe start of a run to store one result into DRAM.Calculating these results in the model is accomplished bystarting with the baseline performance of a conﬁguration.All data is 32-bit ﬂoating-point and maps into the hardenedﬂoating-point DSP blocks of the Arria 10 device. From [17],we can calculate the baseline performance by determining howmany DSP blocks are doing work. The utilization of DSPs isthe product of the grid dimensions and vector width. Thisnumber is the potential performance, but before consideringbandwidth. Using the DRAM specs from the conﬁguration, wecan determine the ratio of how much bandwidth is availableto how much we need. Cycles per block of data divided intothe size of a block in bytes are used to calculate bandwidthneeds. This calculation is the potential performance. Next, thegrid conﬁguration is used to break the ANN up into a series ofblocked matrix multiplications. Having the data now blocked,we can derive all other results.

D. MLP Mapping to Hardware

GEMM nomenclature can be used to describe the three keydimensions that make up the problem size for MLP layers.Two matrices A and B are multiplied together to form anew matrix C. If matrix A has sizes m x k and matrix B k x n , then matrix C is m x n . M is the number of inputsthat are processed at once and commonly referred to as thebatch size. Architectures such as GPU typically batch witha larger M dimension to ﬁll up compute cores and obtainhigher throughput. Our design for FPGA does not need toincrease batching because the PEs can be arranged in a mannerthat exploits parallelism in other dimensions. This results ina lower batch and lower latency accelerator. With sufﬁcientbandwidth, hardware efﬁciency can be increased by keepingthe m dimension smaller, and further vectorizing on k and n dimensions. N dimension is the number of neurons that alsodeﬁnes a subsequent layer k . Lastly, the size of the dataseteﬁnes the ﬁrst layer k . Fitting an MLP into hardware is oneaspect that makes the co-design approach so powerful.IV. E XPERIMENTS

In this section, we show the results from running a seriesof evolutionary searches on six different data sets: MNIST[18], Fashion MNIST [19], Credit-g [20], Har [21], Phishing[20], and Bioresponse [22]. MNIST and Fashion MNIST werechosen so we could compare against wildly used researchbenchmarks. Credit-g/Har/Phishing/Bioresponse were chosensince they represent potential real-world datasets.FPGA hardware results assume the architecture describedin section III-B. Search was done on two different FPGAs, anArria 10 1150 GX device [23] at a clock frequency of 250MHz and a Stratix 10 2800 device at a clock frequency of400 MHz. After many hardware compiles, 250 MHz was, onaverage, the frequency the OpenCL design achieved for theArria 10. Running at 250 MHz provides a peak throughputof 759 GFLOP/s FP32 single-precision performance. Theaccelerator card was the development kit from Intel, whichcontains a single bank of DDR4 memory, providing a peakbandwidth of 19.2 GB/s. In many cases, the evolutionaryalgorithm requested conﬁgurations that led to bandwidth-constrained designs. We do provide a few results that includeda design space with 2 and 4 DDR banks providing 38.4 and76.8 GB/s bandwidth respectively. All Stratix 10 models wererun with 4 banks of DDR.GPU results were obtained using three different devices.An NVIDIA Quadro M5000 with 8GB DDR5 capable of 4.3TFLOP/s of FP32 single-precision performance with 211 GB/smemory bandwidth, a Titan X capable of 12 TFLOP/s of FP32single-precision performance and Radeon VII 16GB HBM2,13.44 FP32 TFLOPS and 1 TB/s memory bandwidth. Proﬁlingfor GPU was done using trace ﬁles generated from TensorFlow. The timing report considers matrix multiplication, ac-tivation, and vector addition routines, but it does not appearto take into account DRAM transfers. FPGA timing reportsdo consider DRAM because memory buffering is an activecomponent in the design. Overall, conclusions are not affectedby the difference between timing reports that likely skewsdirect comparisons of FPGA and GPU (in favor of GPU).Power consumption for GPU was captured during runsusing the built-in nvidia-smi utility. We found the powermanagement for GPU to be efﬁcient since, in most cases,the effective performance was rather low (see section IV-D,and so was the power. On average, the GPU power measuredaround 50 W for the 150 W (maximum) device. Quartus’Power Analyzer Tool was used to capture power estimates forFPGA. Across the many Arria 10 designs compiled, we foundthe minimum power to be 22.5 W, maximum power to be 31.89W, and average power to be 27 W. The difference betweenpower numbers for FPGA and GPU was that FPGA representschip power and GPU represents total board power. Thisdiscrepancy made it challenging to compare power betweenarchitectures. Given this discrepancy and the small variance across conﬁgurations, we decide to leave the topic of poweroutside the conclusions.

A. Overall Performance Results

We ﬁrst examine the accuracy results we were able toachieve using the evolutionary ﬂow. Table I present the topresults obtained from the evolutionary algorithm searchingfor accuracy using a 10-fold evaluation method[24][25]. Thismethod splits the data set into 10-equal train/test folds andmeasures performance on each of the folds. As can be seen,ECAD MLP (our method) was able to achieve better accuracyresults then all MLP based classiﬁers. In addition, the ECADevolutionary process managed to outperform non-MLP basedmethods for the credit-g and phishing datasets. The mnist andfashion-mnist datasets were obtained outside of openml use atraditional 1-fold train/test dataset and we compare them toresults published in the literature[19][18]. As can be seen, ourmnist and fashion-mnist accuracy results outperform the topreported results. In addition our auto MLP network has the sec-ond best reported result and is 0.0047 shy of the SVC methodrecord holder. Table III shows ECAD run time statistics forthe results reported in Tables I and II. It reports the numberof different NNA/HW combinations that were automaticallygenerated and evaluated by the ECAD system, the averagetime per evaluation and total evaluation time of all candidatearchitectures. Note that in order to optimize the search andrun time of the system, potential NNA/HW candidates are ﬁrstanalyzed for similarities to previous evaluations and duplicatesare not evaluated twice.Table IV shows the results for two top Pareto frontiersolutions for each data set. Note that this search was doneoutside of the OpenML spec, i.e., running a single fold. Thesolutions provide accuracy and throughput for a Stratix 10(S10) FPGA and TitanX (TX) GPU. In the majority of casesthe FPGA achieved higher performance than the GPU. Credit-g, for example, favored GPU for higher accuracy, but lookingat the second row for credit-g, by sacriﬁcing just one pointof accuracy, the FPGA sees a very signiﬁcant improvement inthroughput.

B. Accuracy versus Throughput

Part of the motivation for this research is to show theadaptability of reconﬁgurable hardware to a neural network.This ﬂexibility and molding over a series of GEMM callsproduce particular results aimed to ﬁt an optimal networkdescription, i.e., highest accuracy with an optimal hardwaresolution producing high and efﬁcient throughput. We beginrunning the evolutionary search over the HAR dataset andobserve how both GPU and FPGA reacted to each evolutionarystep. Figure 2a provides results for Arria 10 and Figure 2bprovides the results for the Quadro M5000.The evolutionary process provided many high performanceresults in terms of accuracy with the top at 0.995 and providedmany varying results for throughput. The broad swing ofthroughput results (especially for GPU, being a ﬁxed archi-tecture) shows that many MLP solutions exist that achieve theABLE I: Top 10-fold Accuracy (Acc) for All Datasets Compared to Previous Works

Dataset Top Acc (Any) Top Method Top Acc (MLP) MLP Type ECAD MLP

Credit-g 0.7860 mlr.classif.ranger 0.7470 *MLPClassiﬁer

Har 0.9957 *DecisionTreeClassiﬁer 0.1888 *MLPClassiﬁer 0.9909Phishing 0.9753 *SVC 0.9733 *MLPClassiﬁer

Bioresponse 0.8160 mlr.classif.ranger 0.5423 *MLPClassiﬁer 0.8038

Note

TABLE II: Top 1-fold Accuracy (Acc) for All Datasets Compared to Previous Works

Dataset Top Acc (Any) Top Method Top Acc (MLP) MLP Type ECAD MLP

MNIST 0.9979 Manual 0.9840 Manual(no distortions) 0.9852Fashion MNIST 0.8970 SVC 0.8770 MLPClassiﬁer 0.8923

Note

TABLE III: Top Accuracy Run Time Statistics

Dataset Total Models Evaluated AVG Model Evaluation Time (s) Total Evaluation Time (s)

MNIST 553 71.23 39388.6Fashion MNIST 481 82.55 39708.7Credit-g 10480 2.24 23495.2Har 3229 10.20 33069.4Phishing 3534 9.24 32661.3Bioresponse 5309 5.89 31285.0

Note

Each model generated is a fully functional combination of NNA traits and hardware traits that is evaluated for performance on anyof the measured metrics. The ECAD system caches similar conﬁgurations and avoids reevaluating them.

TABLE IV: Best Pareto Frontier Results for Searching Accu-racy and Throughput

Dataset Accuracy S10 (output/s) TX (output/s)

MNIST 0.9841 7.97E5 7.73E5MNIST 0.9763 2.45E6 1.97E6Fashion MNIST 0.893 4.8E5 8.1E5Fashion MNIST 0.8850 1.92E6 2.3E6Har 0.996 1.16E6 9.59E5Har 0.985 4.74E6 2.46E6Credit-g 0.83 8.19E3 1.59E6Credit-g 0.82 1.40E7 1.23E6Bioresponse 0.798 4.64E5 1.34E6Bioresponse 0.7952 1.36E6 1.66E6Phishing 0.9675 6.81E6 2.27E6Phishing 0.9656 1.16E7 2.27E6 top accuracy. GPUs accelerate each solution in the same way,so the varying levels of throughput mean the MLP structureis changing. The top throughput GPU achieved at an accuracyof 0.995 is shown to be approximately 1E6 results per second,while the bottom throughput result is approximately 6E5results per second. For the same reasoning, as we move downa point of accuracy, the GPUs performance hardly changes,this is because of the number of neurons remains roughlythe same, it is the distribution between layers causing theeffect on accuracy. For GPU, there is roughly no relationshipbetween the number of neurons and the throughput. FPGA hasa different correlation. Figure 2a shows that the distributionof neurons in an MLP greatly affects the performance. Unlike GPU, there is (potentially) a different hardware conﬁgurationfor each data point. While top accuracy only reaches 1.6E5outputs per second, moving down accuracy just 0.1% resultsin a giant leap to 1.56E6 outputs per second, an order ofmagnitude higher. Further, moving down another 0.2% resultsin approximately another 1.5x in performance. Each datasettested displayed this same trend.

C. Performance Scaling against Bandwidth

Most designs returned by the evolutionary algorithm tendedto be smaller, using only a percentage of the available DSPs.The reason is that scaling to more DSPs requires more data,which requires more memory bandwidth. We hit the memorybandwidth rooﬂine many times due to only having a singlebank of DDR. Observing this, we ran a series of tests thatprovided the hardware database model with 2 and 4 banks ofDDR. We found mostly a linear scaling going from 1 to 4,so we show the effect bandwidth has on throughput for 1 and4 DDR banks in Figure 3. Higher bandwidth did not producegreater efﬁciency but did result in higher throughput overall.

D. Hardware Efﬁciency and Scaling to Larger Devices

We modiﬁed the hardware database worker to return resultsbased on a Stratix 10 2800 device with 4 banks of DDR. Thisdevice offers up to a 10x performance scaling from the Arria10 device we used in previous experiments. While the Stratix10 device is capable of performing up to 10 TFLOP/s, weused the same methodology as the Arria 10 device to search a)(b)

Fig. 2: Performance of FPGA and GPU at different levels ofaccuracy for the har datasetFig. 3: Throughput and hardware efﬁciency for FPGA designswith 1 and 4 banks of DDR on the credit-g data setStratix 10 devices at a clock frequency of 400 MHz scalingback the rooﬂine to 4.6 available TFLOP/s. To better matchthe performance of the Stratix 10 device, we used an NvidiaTitan X device capable of 12 TFLOP/s.We found that overall, the reconﬁgurable architecture hada more signiﬁcant potential to perform at a higher level;however, for some data sets, the Titan X raw throughput wasable to achieve the highest outputs per second. Throughput isa popular metric to consider, but we also found efﬁciency tobe of high importance. The reason the larger FPGA devicehandled running these datasets through an MLP is due to Fig. 4: Hardware efﬁciency results for a Stratix 10 2800 andTitan X searching over the MNIST datasetthe allocation of resources. When the evolutionary algorithmchooses a hardware conﬁguration, there is an allotted space onthe FPGA that contains a potential performance. Then aftermapping the MLP, we get an effective performance (see sec-tion III-C for details on potential and effective performance).The ratio of effective performance over potential performancegives us hardware efﬁciency. Efﬁciency could yield fewerFPGA resources while maintaining throughput leaving logicavailable for other tasks such as pre or post-processing.Figure 4 shows the FPGA and GPU efﬁciency of varioussolutions from searching across the MNIST dataset. The topaccuracy for this particular run was 0.9845, and throughputwas 796,611 and 773,162 outputs per second for the FPGA andGPU respectively, almost identical. If we consider efﬁciencyfor this result, the FPGA utilized 41.5% of the allocatedlogic, while the GPU only utilized 0.3%. We calculated GPUefﬁciency as the number of operations per second obtainedfrom a run out of the total potential operations per secondof the device. The conclusion is that without target hardwarein mind during MLP development, there is a good chance oflosing efﬁciency. Running an evolutionary search on reconﬁg-urable hardware can balance the ﬁtness between throughputand accuracy. V. C

ONCLUSIONS

We address the difﬁculty of designing highly performantneural networks by leveraging evolutionary search algorithmscapable of ﬁnding the ﬁttest solutions for both classiﬁcationaccuracy and hardware throughput. This process is shown tobe both highly efﬁcient and effective compared to traditionalapproaches that ﬁrst design a neural network to achieve a targetaccuracy, then run it on general-purpose hardware. Through aseries of experiments, we present our results for state of the artneural network conﬁgurations that surpass current publishedwork. We explain the power of co-design by discussing theresults of experiments showing accuracy versus throughput,performance scaling with bandwidth, and scaling designs withlarger devices.I. F

UTURE D IRECTIONS

We believe a custom co-design framework for MLPs andreconﬁgurable hardware could be of use to a large audience.We plan to continue and develop this framework, streamlineits use and make it available to the general public.R

EFERENCES [1] Alfredo Canziani, Adam Paszke, and Eugenio Cu-lurciello. “An analysis of deep neural network mod-els for practical applications”. In: arXiv preprintarXiv:1605.07678 (2016).[2] Hanxiao Liu et al. “Hierarchical representationsfor efﬁcient architecture search”. In: arXiv preprintarXiv:1711.00436 (2017).[3] Esteban Real et al. “Large-scale evolution of imageclassiﬁers”. In:

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 . JMLR.org. 2017, pp. 2902–2911.[4] Esteban Real et al. “Regularized evolution for im-age classiﬁer architecture search”. In: arXiv preprintarXiv:1802.01548 (2018).[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. “Imagenet classiﬁcation with deep convolutionalneural networks”. In:

Advances in neural informationprocessing systems . 2012, pp. 1097–1105.[6] Kim Hazelwood et al. “Applied machine learning atfacebook: A datacenter infrastructure perspective”. In: . IEEE. 2018,pp. 620–629.[7] Carole-Jean Wu et al. “Machine learning at facebook:Understanding inference at the edge”. In: . IEEE. 2019, pp. 331–344.[8] Norman Jouppi et al. “Motivation for and Evaluation ofthe First Tensor Processing Unit”. In:

IEEE Micro arXivpreprint arXiv:1811.09886 (2018).[10] Thomas Elsken, Jan Hendrik Metzen, and Frank Hut-ter. “Neural architecture search: A survey”. In: arXivpreprint arXiv:1808.05377 (2018).[11] Geoffrey F Miller, Peter M Todd, and Shailesh UHegde. “Designing Neural Networks using Genetic Al-gorithms.” In:

ICGA . Vol. 89. 1989, pp. 379–384.[12] Kenneth O Stanley and Risto Miikkulainen. “Evolvingneural networks through augmenting topologies”. In:

Evolutionary computation (2018), pp. 8697–8710. [14] Stylianos I Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. “Toolﬂows for mapping convolutionalneural networks on FPGAs: A survey and future di-rections”. In:

ACM Computing Surveys (CSUR)

To be published in 2019 IEEE High PerformanceExtreme Computing Conference (HPEC) . IEEE. 2019.[16] David E Goldberg and Kalyanmoy Deb. “A comparativeanalysis of selection schemes used in genetic algo-rithms”. In:

Foundations of genetic algorithms . Vol. 1.Elsevier, 1991, pp. 69–93.[17] Amulya Vishwanath et al. “Enabling High-PerformanceFloating-Point Designs”. In:

Intel, White Paper (2016).[18] Yann LeCun and Corinna Cortes. “MNIST handwrittendigit database”. In: (2010).

URL : http://yann.lecun.com/exdb/mnist/.[19] Han Xiao, Kashif Rasul, and Roland Vollgraf. “Fashion-MNIST: a Novel Image Dataset for Benchmarking Ma-chine Learning Algorithms”. In:

CoRR abs/1708.07747(2017). arXiv: 1708.07747.

URL : http://arxiv.org/abs/1708.07747.[20] Dheeru Dua and Casey Graff.

UCI Machine LearningRepository . 2017.

URL : http://archive.ics.uci.edu/ml.[21] Davide Anguita et al. “A public domain dataset for hu-man activity recognition using smartphones.” In:

Esann .2013.[22] Boehringer Ingelheim. 2011.

URL

Intel Arria10 Device Overview . 2018.

URL . URL