AutoML for Multilayer Perceptron and FPGA Co-design
AAutoML for Multilayer Perceptron and FPGACo-design st Philip Colangelo
Intel PSG
San Jose, [email protected] nd Oren Segal
Hofstra University
Hempstead, [email protected] rd Alex Speicher
Hofstra University
Hempstead, [email protected] th Martin Margala
University of Massachusetts Lowell
Lowell, USAMartin [email protected]
Abstract —State-of-the-art Neural Network Architectures(NNAs) are challenging to design and implement efficiently inhardware. In the past couple of years, this has led to an explosionin research and development of automatic Neural ArchitectureSearch (NAS) tools. AutomML tools are now used to achieve stateof the art NNA designs and attempt to optimize for hardwareusage and design. Much of the recent research in the auto-designof NNAs has focused on convolution networks and image recog-nition, ignoring the fact that a significant part of the workloadin data centers is general-purpose deep neural networks. In thiswork, we develop and test a general multilayer perceptron (MLP)flow that can take arbitrary datasets as input and automaticallyproduce optimized NNAs and hardware designs. We test the flowon six benchmarks. Our results show we exceed the performanceof currently published MLP accuracy results and are competitivewith non-MLP based results. We compare general and commonGPU architectures with our scalable FPGA design and show wecan achieve higher efficiency and higher throughput (outputs persecond) for the majority of datasets. Further insights into thedesign space for both accurate networks and high performinghardware shows the power of co-design by correlating accuracyversus throughput, network size versus accuracy, and scaling tohigh-performance devices.
Index Terms —Evolutionary Algorithms, Machine Learning,FPGA, Automated Design
I. I
NTRODUCTION AND M OTIVATION
Optimizing neural network architectures (NNA) is a difficultprocess in part because of the vast number of hyperparam-eter combinations that exist. In cases where a combinationis not optimal, performance will suffer. Research suggeststhat the parameters of a network can directly influence theaccuracy, throughput, and energy consumption of that modelin deployment [1]. The difficulty in designing performantneural networks has brought a recent surge in interest inthe automatic design and optimization of neural networks.The focus of the existing body of research has been onoptimizing NNA for accuracy [2][3][4] but lack network-specific hardware optimizations. Our focus is to close this gapby using evolutionary algorithms to search an entire designspace, including NNA and reconfigurable hardware. This Co-design approach provides a solution that implements a customhardware solution for a specific NNA model.Numerous articles have focused on image classification withconvolutional neural networks, in part, due to the introductionof AlexNet in 2012 and the ImageNet database [5]. Traditionalneural networks such as multilayer perceptrons (MLP) have taken a backseat from the spotlight. Yet, their applicationis still very relevant, large data-centric companies such asFacebook[6][7] and Google [8] have published data showingthat MLP workloads are the majority of their application base.Facebook cites the use of MLP for tasks such as determiningwhich ads to display, which stories matter to see in a newsfeed, and which results to present from a search. Park et al.stress the importance of these networks and the current limita-tions on standard hardware and the call for what this researchaims to solve, i.e., software and hardware co-design in [9].Google recently published a performance analysis of their TPU, showing MLP consisting of 61% of the TPU workloads. Ourfindings are in line with the results presenting in their paper,stating that MLP workloads are memory bound. Our co-designprocess is capable of designing optimal hardware structuresthat balance the compute and memory requirements. SectionIV provides insights on memory bandwidth considerations.At the heart of MLP is a general matrix multiplication(GEMM), and libraries have existed for years such as BasicLinear Algebra Subprograms (BLAS) that provide highlytuned hardware-specific GEMM implementations. Standardhardware such as CPUs and GPUs provide these subroutinesto make implementing performant MLPs easy. Still, two prob-lems persist in this typical flow: 1. target hardware is normallynot considered during the creation of the MLP, and 2. standardarchitectures are general purpose. General purpose architec-tures do not necessarily offer performance scaling concerningthe ANN description (see section IV). Our research aims totake advantage of the reconfigurable architecture of an FPGAdevice that is capable of molding to a specific workload andneural network structure. Leveraging evolutionary algorithmsto search the entire design space of both MLP and targethardware simultaneously, we find unique solutions that achieveboth top accuracy and optimal hardware performance. SectionIV provides top results for MLP networks and hardwareconfigurations derived from a series of tests over multipledatasets. Correlations between traditional approaches and ourapproach are discussed to provide insights into how we achievetop results. II. R
ELATED W ORK
Automating NNA search has been an ongoing effort forthe past few decades but is becoming a focus of the NNA a r X i v : . [ c s . N E ] S e p esearch community because of the difficulty in designing deepnetworks which are ever growing in complexity[3][4][10].Automatic Artificial Neural Network Architectures Search(NAS) can be conducted using different strategies such as ran-dom search, evolutionary algorithms, Reinforcement Learning(RL), Bayesian optimization, and gradient-based methods [10].Using Evolutionary Algorithms (EAs) to search for performantarchitectures has been investigated extensively [11][12] overthe years. Some recent results indicate that evolutionary algo-rithms offer better results than random search and reinforce-ment learning [4]. Recently, there has been growing interestin NAS for deep neural networks that specialize in imagerecognition [2][3][13]. A recent survey on available tool flowsis available here [14]. The body of work on NAS concentrateon accuracy as the main measure of performance, thoughoptimizing for NAS can lead to more simplified NNA thatcould in turn simplify and optimize hardware designs [4][10].On the other hand, optimizing for hardware performanceparameters (latency/throughput/power) is normally done on anexisting NNA design and there is no attempt to modify theNNA (layers/neurons etc.)[14].III. E VOLUTIONARY C ELL A IDED D ESIGN F LOW
The Evolutionary Cell Aided Design (ECAD) flow, pre-viously described in [15], is shown in Figure 1 and startswith a general industrial/research problem that (a) sufficientdata exists for (b) there are well defined inputs/outputs and(c) it is a problem that can benefit from software/hardwareacceleration. Once such a problem is identified a dataset willbe exported into a Comma Separated Value (CSV) tabular dataformat, in addition a configuration file will be created and willcontain information on (a) the general NNA structure includinginput and output sizes, initial number of layers and neurons,(b) Hardware target including reconfigurable hardware devicetype, DSP count, memory size and number of blocks, (c)optmization targets such as accuracy, throughput, latency, andfloating-point operations. Note that the configuration file canbe generated automatically based on an existing templateconfiguration file and the dataset.
A. Evolutionary Process
The ECAD Evolutionary process, based on a steady-statemodel [16], generates a population of NNA/Hardware co-design candidates each with a complete set of parametersthat effect both the accuracy and the hardware performance.The parameters we considered during our searches includednumber of layers, layer size, activation function, and bias.Each candidate in the population is evaluated according toconfigurable and potentially multiple criteria, for exampleaccuracy alone or accuracy vs throughput. The raw evaluationmeasurments are done on a software artifact, dubbed a
Worker in ECAD terminology. The Worker returns the raw evaluationinformation to a
Master process. The
Master process orches-trates the evaluation process by distributing the co-designpopulation and by evaluating the results. Result evaluationis done using user defined fitness functions. For example, Fig. 1: ECAD flow.an accuracy fitness function can simply return the accuracyvalue obtain by a simulation worker(see next section for moreinformation on worker types). But it can also scale or weightthe value or specify to minimize or maximize the value. Simpleevaluations functions can be specified in the configuration fileand more complex ones are written in code and added byregistering them with the framework.
B. ECAD Hardware
The evolutionary search has three workers at its disposalto assess the fitness of various hardware platforms. Thesimulation worker is useful for assessing instruction-set basedarchitectures such as CPU and GPU, whereas the physicaland hardware database workers are useful for hardware thatrequires design and synthesis.Simulation workers take queries from the evolutionary en-gine and run them on the target hardware. For example, ifthe target hardware is a GPU, the evolutionary engine sendsthe physical worker an ANN description who converts it intoa readable format for the GPU. Once the GPU returns theresult, the physical worker is responsible for recording allnecessary metrics that are provided back to the engine to assessthe fitness. Metrics include throughput, latency, and powerconsumption.Hardware database workers provide a means for hardwareplatforms that are easily simulated or modeled. In our ex-periments, for example, we leveraged the hardware databaseworker to provide a means of accepting both an ANN de-scription and hardware configuration that together were runthrough a model to obtain the metrics for fitness evaluation.Once the engine finds performant designs via tracking eachconfiguration’s fitness, the model can be realized into hardwareand run through the physical worker. FPGA proved to be asuitable architecture for the hardware database worker. The re-configurable nature of FPGAs, coupled with modeled overlayarchitectures, allows the worker to assess many configurationsin a relatively swift manner compared to running throughsynthesis tools.Physical workers can be used to synthesize and evaluatehardware designs. While the hardware database worker pro-vides fitness of the overall application with metrics such asthroughput, the physical worker aims to provide the fitnessof the hardware design itself through metrics such as power,logic utilization, and operation frequency. In the case of IntelFPGAs, the physical worker responds with ALM, M20K,nd DSP utilization, power estimations, and clock frequency(Fmax).Having access to these workers in the evolutionary engineis a powerful tool that provides various performance metricsabout the hardware platform as a function of the ANN.Since ANN descriptions can vary widely, the throughput andpower consumption on hardware devices can also vary widely.Understanding how a hardware device performs across ANNvariations aids the decision on how to deploy ANN solutions,as seen in the experiments.Beyond these tools providing insights into architectureperformance, the Pareto frontiers that result after parsing theevolutionary design space define what the optimal solution is.Every industry deploying ANN solutions may have differentopinions and demands for their use case. One size fits all, orone solution is not realistic or optimal. Having the data tomake decisions based on trade-offs is highly valuable. Sincesome ANN descriptions are approximations, i.e., they includeredundancy (although we note that newer research is starting tolimit the amount of redundancy) and educated guesses, it is anunreasonable expectation to hand optimize ANN descriptionsfor a specific hardware platform.
C. FPGA Design Model and Hardware Description
Workers can provide input to any hardware capable of beingmodeled efficiently. Given the reconfigurable architecture ofFPGAs and modern HLS tools capable of describing overlaydesigns that interface nicely with software programs, we choseIntel FPGAs and OpenCL as our hardware platform anddevelopment. Any flavor of FPGA can use the overlay-stylearchitecture we developed, and all that is required to changethe design search space, for example, between Arria 10 andStratix 10, is the hardware configuration used by the hardwaredatabase worker.Hardware database workers receive configurations that in-clude enough information to construct the model and derivethe necessary performance results. Part of this configurationis a hardware-specific configuration file that defines the targetaccelerator. This information includes the name of the FPGA,the relevant primitive logic details such as DSP and SRAMcount, target clock frequency, the type of global memory(DRAM) to be used, and its speed and rate. All of thisinformation is used to estimate performance. The other part ofthe configuration is the ANN description. Low-level hardwaresuitable parts are decomposed from the ANN description anddefine the work that must be done by the accelerator.General matrix multiplications are commonly used to accel-erate multilayer perceptrons, so the design we used is based ona 2D systolic array architecture that includes additional func-tionality to support activation functions and vector additionsfor bias operations. This “grid” architecture has various designspace variables that we allow mutations to take place on. Thevariables are the number of rows and columns, double buffercache sizes for each dimension, called interleaving, and thevector width of each processing element (PE). Every variablehas implications in the design space. Each mutation has a unique fitness, not only as it applies to the hardware but alsothe ANN. This works leverages and build on the SGEMMarchitecture described in [17].Our model returns values we deemed fundamental, includ-ing potential and effective performance, total time, outputs persecond, and latency. Giga-operations per second (GOP/s) is theunit to describe the potential and effective performance. Thepotential performance is typically the marketed performancethat defines the roofline of the configuration, whereas theeffective performance defines the actual or real performance ofthe configuration under a workload. The delta between themprovides a metric on how well the problem maps to the model.The total time for our model is defined by the differencebetween the timestamp when all necessary data persists inthe accelerator DRAM, and we enqueue the OpenCL kernels,and the timestamp once all results persist in DRAM and thelast OpenCL kernel returns. We call this a run. Outputs persecond is a generalization of the data type and provides thetotal results we can achieve in one second. For example, ifthe data type is images, then this metric could be interpretedas images per second. Lastly, latency is the time it takes fromthe start of a run to store one result into DRAM.Calculating these results in the model is accomplished bystarting with the baseline performance of a configuration.All data is 32-bit floating-point and maps into the hardenedfloating-point DSP blocks of the Arria 10 device. From [17],we can calculate the baseline performance by determining howmany DSP blocks are doing work. The utilization of DSPs isthe product of the grid dimensions and vector width. Thisnumber is the potential performance, but before consideringbandwidth. Using the DRAM specs from the configuration, wecan determine the ratio of how much bandwidth is availableto how much we need. Cycles per block of data divided intothe size of a block in bytes are used to calculate bandwidthneeds. This calculation is the potential performance. Next, thegrid configuration is used to break the ANN up into a series ofblocked matrix multiplications. Having the data now blocked,we can derive all other results.
D. MLP Mapping to Hardware
GEMM nomenclature can be used to describe the three keydimensions that make up the problem size for MLP layers.Two matrices A and B are multiplied together to form anew matrix C. If matrix A has sizes m x k and matrix B k x n , then matrix C is m x n . M is the number of inputsthat are processed at once and commonly referred to as thebatch size. Architectures such as GPU typically batch witha larger M dimension to fill up compute cores and obtainhigher throughput. Our design for FPGA does not need toincrease batching because the PEs can be arranged in a mannerthat exploits parallelism in other dimensions. This results ina lower batch and lower latency accelerator. With sufficientbandwidth, hardware efficiency can be increased by keepingthe m dimension smaller, and further vectorizing on k and n dimensions. N dimension is the number of neurons that alsodefines a subsequent layer k . Lastly, the size of the datasetefines the first layer k . Fitting an MLP into hardware is oneaspect that makes the co-design approach so powerful.IV. E XPERIMENTS
In this section, we show the results from running a seriesof evolutionary searches on six different data sets: MNIST[18], Fashion MNIST [19], Credit-g [20], Har [21], Phishing[20], and Bioresponse [22]. MNIST and Fashion MNIST werechosen so we could compare against wildly used researchbenchmarks. Credit-g/Har/Phishing/Bioresponse were chosensince they represent potential real-world datasets.FPGA hardware results assume the architecture describedin section III-B. Search was done on two different FPGAs, anArria 10 1150 GX device [23] at a clock frequency of 250MHz and a Stratix 10 2800 device at a clock frequency of400 MHz. After many hardware compiles, 250 MHz was, onaverage, the frequency the OpenCL design achieved for theArria 10. Running at 250 MHz provides a peak throughputof 759 GFLOP/s FP32 single-precision performance. Theaccelerator card was the development kit from Intel, whichcontains a single bank of DDR4 memory, providing a peakbandwidth of 19.2 GB/s. In many cases, the evolutionaryalgorithm requested configurations that led to bandwidth-constrained designs. We do provide a few results that includeda design space with 2 and 4 DDR banks providing 38.4 and76.8 GB/s bandwidth respectively. All Stratix 10 models wererun with 4 banks of DDR.GPU results were obtained using three different devices.An NVIDIA Quadro M5000 with 8GB DDR5 capable of 4.3TFLOP/s of FP32 single-precision performance with 211 GB/smemory bandwidth, a Titan X capable of 12 TFLOP/s of FP32single-precision performance and Radeon VII 16GB HBM2,13.44 FP32 TFLOPS and 1 TB/s memory bandwidth. Profilingfor GPU was done using trace files generated from TensorFlow. The timing report considers matrix multiplication, ac-tivation, and vector addition routines, but it does not appearto take into account DRAM transfers. FPGA timing reportsdo consider DRAM because memory buffering is an activecomponent in the design. Overall, conclusions are not affectedby the difference between timing reports that likely skewsdirect comparisons of FPGA and GPU (in favor of GPU).Power consumption for GPU was captured during runsusing the built-in nvidia-smi utility. We found the powermanagement for GPU to be efficient since, in most cases,the effective performance was rather low (see section IV-D,and so was the power. On average, the GPU power measuredaround 50 W for the 150 W (maximum) device. Quartus’Power Analyzer Tool was used to capture power estimates forFPGA. Across the many Arria 10 designs compiled, we foundthe minimum power to be 22.5 W, maximum power to be 31.89W, and average power to be 27 W. The difference betweenpower numbers for FPGA and GPU was that FPGA representschip power and GPU represents total board power. Thisdiscrepancy made it challenging to compare power betweenarchitectures. Given this discrepancy and the small variance across configurations, we decide to leave the topic of poweroutside the conclusions.
A. Overall Performance Results
We first examine the accuracy results we were able toachieve using the evolutionary flow. Table I present the topresults obtained from the evolutionary algorithm searchingfor accuracy using a 10-fold evaluation method[24][25]. Thismethod splits the data set into 10-equal train/test folds andmeasures performance on each of the folds. As can be seen,ECAD MLP (our method) was able to achieve better accuracyresults then all MLP based classifiers. In addition, the ECADevolutionary process managed to outperform non-MLP basedmethods for the credit-g and phishing datasets. The mnist andfashion-mnist datasets were obtained outside of openml use atraditional 1-fold train/test dataset and we compare them toresults published in the literature[19][18]. As can be seen, ourmnist and fashion-mnist accuracy results outperform the topreported results. In addition our auto MLP network has the sec-ond best reported result and is 0.0047 shy of the SVC methodrecord holder. Table III shows ECAD run time statistics forthe results reported in Tables I and II. It reports the numberof different NNA/HW combinations that were automaticallygenerated and evaluated by the ECAD system, the averagetime per evaluation and total evaluation time of all candidatearchitectures. Note that in order to optimize the search andrun time of the system, potential NNA/HW candidates are firstanalyzed for similarities to previous evaluations and duplicatesare not evaluated twice.Table IV shows the results for two top Pareto frontiersolutions for each data set. Note that this search was doneoutside of the OpenML spec, i.e., running a single fold. Thesolutions provide accuracy and throughput for a Stratix 10(S10) FPGA and TitanX (TX) GPU. In the majority of casesthe FPGA achieved higher performance than the GPU. Credit-g, for example, favored GPU for higher accuracy, but lookingat the second row for credit-g, by sacrificing just one pointof accuracy, the FPGA sees a very significant improvement inthroughput.
B. Accuracy versus Throughput
Part of the motivation for this research is to show theadaptability of reconfigurable hardware to a neural network.This flexibility and molding over a series of GEMM callsproduce particular results aimed to fit an optimal networkdescription, i.e., highest accuracy with an optimal hardwaresolution producing high and efficient throughput. We beginrunning the evolutionary search over the HAR dataset andobserve how both GPU and FPGA reacted to each evolutionarystep. Figure 2a provides results for Arria 10 and Figure 2bprovides the results for the Quadro M5000.The evolutionary process provided many high performanceresults in terms of accuracy with the top at 0.995 and providedmany varying results for throughput. The broad swing ofthroughput results (especially for GPU, being a fixed archi-tecture) shows that many MLP solutions exist that achieve theABLE I: Top 10-fold Accuracy (Acc) for All Datasets Compared to Previous Works
Dataset Top Acc (Any) Top Method Top Acc (MLP) MLP Type ECAD MLP
Credit-g 0.7860 mlr.classif.ranger 0.7470 *MLPClassifier
Har 0.9957 *DecisionTreeClassifier 0.1888 *MLPClassifier 0.9909Phishing 0.9753 *SVC 0.9733 *MLPClassifier
Bioresponse 0.8160 mlr.classif.ranger 0.5423 *MLPClassifier 0.8038
Note
TABLE II: Top 1-fold Accuracy (Acc) for All Datasets Compared to Previous Works
Dataset Top Acc (Any) Top Method Top Acc (MLP) MLP Type ECAD MLP
MNIST 0.9979 Manual 0.9840 Manual(no distortions) 0.9852Fashion MNIST 0.8970 SVC 0.8770 MLPClassifier 0.8923
Note
TABLE III: Top Accuracy Run Time Statistics
Dataset Total Models Evaluated AVG Model Evaluation Time (s) Total Evaluation Time (s)
MNIST 553 71.23 39388.6Fashion MNIST 481 82.55 39708.7Credit-g 10480 2.24 23495.2Har 3229 10.20 33069.4Phishing 3534 9.24 32661.3Bioresponse 5309 5.89 31285.0
Note
Each model generated is a fully functional combination of NNA traits and hardware traits that is evaluated for performance on anyof the measured metrics. The ECAD system caches similar configurations and avoids reevaluating them.
TABLE IV: Best Pareto Frontier Results for Searching Accu-racy and Throughput
Dataset Accuracy S10 (output/s) TX (output/s)
MNIST 0.9841 7.97E5 7.73E5MNIST 0.9763 2.45E6 1.97E6Fashion MNIST 0.893 4.8E5 8.1E5Fashion MNIST 0.8850 1.92E6 2.3E6Har 0.996 1.16E6 9.59E5Har 0.985 4.74E6 2.46E6Credit-g 0.83 8.19E3 1.59E6Credit-g 0.82 1.40E7 1.23E6Bioresponse 0.798 4.64E5 1.34E6Bioresponse 0.7952 1.36E6 1.66E6Phishing 0.9675 6.81E6 2.27E6Phishing 0.9656 1.16E7 2.27E6 top accuracy. GPUs accelerate each solution in the same way,so the varying levels of throughput mean the MLP structureis changing. The top throughput GPU achieved at an accuracyof 0.995 is shown to be approximately 1E6 results per second,while the bottom throughput result is approximately 6E5results per second. For the same reasoning, as we move downa point of accuracy, the GPUs performance hardly changes,this is because of the number of neurons remains roughlythe same, it is the distribution between layers causing theeffect on accuracy. For GPU, there is roughly no relationshipbetween the number of neurons and the throughput. FPGA hasa different correlation. Figure 2a shows that the distributionof neurons in an MLP greatly affects the performance. Unlike GPU, there is (potentially) a different hardware configurationfor each data point. While top accuracy only reaches 1.6E5outputs per second, moving down accuracy just 0.1% resultsin a giant leap to 1.56E6 outputs per second, an order ofmagnitude higher. Further, moving down another 0.2% resultsin approximately another 1.5x in performance. Each datasettested displayed this same trend.
C. Performance Scaling against Bandwidth
Most designs returned by the evolutionary algorithm tendedto be smaller, using only a percentage of the available DSPs.The reason is that scaling to more DSPs requires more data,which requires more memory bandwidth. We hit the memorybandwidth roofline many times due to only having a singlebank of DDR. Observing this, we ran a series of tests thatprovided the hardware database model with 2 and 4 banks ofDDR. We found mostly a linear scaling going from 1 to 4,so we show the effect bandwidth has on throughput for 1 and4 DDR banks in Figure 3. Higher bandwidth did not producegreater efficiency but did result in higher throughput overall.
D. Hardware Efficiency and Scaling to Larger Devices
We modified the hardware database worker to return resultsbased on a Stratix 10 2800 device with 4 banks of DDR. Thisdevice offers up to a 10x performance scaling from the Arria10 device we used in previous experiments. While the Stratix10 device is capable of performing up to 10 TFLOP/s, weused the same methodology as the Arria 10 device to search a)(b)
Fig. 2: Performance of FPGA and GPU at different levels ofaccuracy for the har datasetFig. 3: Throughput and hardware efficiency for FPGA designswith 1 and 4 banks of DDR on the credit-g data setStratix 10 devices at a clock frequency of 400 MHz scalingback the roofline to 4.6 available TFLOP/s. To better matchthe performance of the Stratix 10 device, we used an NvidiaTitan X device capable of 12 TFLOP/s.We found that overall, the reconfigurable architecture hada more significant potential to perform at a higher level;however, for some data sets, the Titan X raw throughput wasable to achieve the highest outputs per second. Throughput isa popular metric to consider, but we also found efficiency tobe of high importance. The reason the larger FPGA devicehandled running these datasets through an MLP is due to Fig. 4: Hardware efficiency results for a Stratix 10 2800 andTitan X searching over the MNIST datasetthe allocation of resources. When the evolutionary algorithmchooses a hardware configuration, there is an allotted space onthe FPGA that contains a potential performance. Then aftermapping the MLP, we get an effective performance (see sec-tion III-C for details on potential and effective performance).The ratio of effective performance over potential performancegives us hardware efficiency. Efficiency could yield fewerFPGA resources while maintaining throughput leaving logicavailable for other tasks such as pre or post-processing.Figure 4 shows the FPGA and GPU efficiency of varioussolutions from searching across the MNIST dataset. The topaccuracy for this particular run was 0.9845, and throughputwas 796,611 and 773,162 outputs per second for the FPGA andGPU respectively, almost identical. If we consider efficiencyfor this result, the FPGA utilized 41.5% of the allocatedlogic, while the GPU only utilized 0.3%. We calculated GPUefficiency as the number of operations per second obtainedfrom a run out of the total potential operations per secondof the device. The conclusion is that without target hardwarein mind during MLP development, there is a good chance oflosing efficiency. Running an evolutionary search on reconfig-urable hardware can balance the fitness between throughputand accuracy. V. C
ONCLUSIONS
We address the difficulty of designing highly performantneural networks by leveraging evolutionary search algorithmscapable of finding the fittest solutions for both classificationaccuracy and hardware throughput. This process is shown tobe both highly efficient and effective compared to traditionalapproaches that first design a neural network to achieve a targetaccuracy, then run it on general-purpose hardware. Through aseries of experiments, we present our results for state of the artneural network configurations that surpass current publishedwork. We explain the power of co-design by discussing theresults of experiments showing accuracy versus throughput,performance scaling with bandwidth, and scaling designs withlarger devices.I. F
UTURE D IRECTIONS
We believe a custom co-design framework for MLPs andreconfigurable hardware could be of use to a large audience.We plan to continue and develop this framework, streamlineits use and make it available to the general public.R
EFERENCES [1] Alfredo Canziani, Adam Paszke, and Eugenio Cu-lurciello. “An analysis of deep neural network mod-els for practical applications”. In: arXiv preprintarXiv:1605.07678 (2016).[2] Hanxiao Liu et al. “Hierarchical representationsfor efficient architecture search”. In: arXiv preprintarXiv:1711.00436 (2017).[3] Esteban Real et al. “Large-scale evolution of imageclassifiers”. In:
Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 . JMLR.org. 2017, pp. 2902–2911.[4] Esteban Real et al. “Regularized evolution for im-age classifier architecture search”. In: arXiv preprintarXiv:1802.01548 (2018).[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. “Imagenet classification with deep convolutionalneural networks”. In:
Advances in neural informationprocessing systems . 2012, pp. 1097–1105.[6] Kim Hazelwood et al. “Applied machine learning atfacebook: A datacenter infrastructure perspective”. In: . IEEE. 2018,pp. 620–629.[7] Carole-Jean Wu et al. “Machine learning at facebook:Understanding inference at the edge”. In: . IEEE. 2019, pp. 331–344.[8] Norman Jouppi et al. “Motivation for and Evaluation ofthe First Tensor Processing Unit”. In:
IEEE Micro arXivpreprint arXiv:1811.09886 (2018).[10] Thomas Elsken, Jan Hendrik Metzen, and Frank Hut-ter. “Neural architecture search: A survey”. In: arXivpreprint arXiv:1808.05377 (2018).[11] Geoffrey F Miller, Peter M Todd, and Shailesh UHegde. “Designing Neural Networks using Genetic Al-gorithms.” In:
ICGA . Vol. 89. 1989, pp. 379–384.[12] Kenneth O Stanley and Risto Miikkulainen. “Evolvingneural networks through augmenting topologies”. In:
Evolutionary computation (2018), pp. 8697–8710. [14] Stylianos I Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. “Toolflows for mapping convolutionalneural networks on FPGAs: A survey and future di-rections”. In:
ACM Computing Surveys (CSUR)
To be published in 2019 IEEE High PerformanceExtreme Computing Conference (HPEC) . IEEE. 2019.[16] David E Goldberg and Kalyanmoy Deb. “A comparativeanalysis of selection schemes used in genetic algo-rithms”. In:
Foundations of genetic algorithms . Vol. 1.Elsevier, 1991, pp. 69–93.[17] Amulya Vishwanath et al. “Enabling High-PerformanceFloating-Point Designs”. In:
Intel, White Paper (2016).[18] Yann LeCun and Corinna Cortes. “MNIST handwrittendigit database”. In: (2010).
URL : http://yann.lecun.com/exdb/mnist/.[19] Han Xiao, Kashif Rasul, and Roland Vollgraf. “Fashion-MNIST: a Novel Image Dataset for Benchmarking Ma-chine Learning Algorithms”. In:
CoRR abs/1708.07747(2017). arXiv: 1708.07747.
URL : http://arxiv.org/abs/1708.07747.[20] Dheeru Dua and Casey Graff.
UCI Machine LearningRepository . 2017.
URL : http://archive.ics.uci.edu/ml.[21] Davide Anguita et al. “A public domain dataset for hu-man activity recognition using smartphones.” In:
Esann .2013.[22] Boehringer Ingelheim. 2011.
URL
Intel Arria10 Device Overview . 2018.
URL . URL