[PDF] Evolutionary Neural Architecture Search Supporting Approximate Multipliers

Abstract

There is a growing interest in automated neural architecture search (NAS) methods. They are employed to routinely deliver high-quality neural network architectures for various challenging data sets and reduce the designer's effort. The NAS methods utilizing multi-objective evolutionary algorithms are especially useful when the objective is not only to minimize the network error but also to minimize the number of parameters (weights) or power consumption of the inference phase. We propose a multi-objective NAS method based on Cartesian genetic programming for evolving convolutional neural networks (CNN). The method allows approximate operations to be used in CNNs to reduce the power consumption of a target hardware implementation. During the NAS process, a suitable CNN architecture is evolved together with approximate multipliers to deliver the best trade-offs between the accuracy, network size, and power consumption. The most suitable approximate multipliers are automatically selected from a library of approximate multipliers. Evolved CNNs are compared with common human-created CNNs of a similar complexity on the CIFAR-10 benchmark problem.

Full PDF

EEvolutionary Neural Architecture SearchSupporting Approximate Multipliers

Michal Pinos, Vojtech Mrazek [0000 − − − , and LukasSekanina [0000 − − − Brno University of Technology, Faculty of Information Technology,IT4Innovations Centre of ExcellenceBoˇzetˇechova 2, 612 66 Brno, Czech Republic [email protected], [email protected], [email protected]

Abstract.

There is a growing interest in automated neural architecturesearch (NAS) methods. They are employed to routinely deliver high-quality neural network architectures for various challenging data sets andreduce the designer’s eﬀort. The NAS methods utilizing multi-objectiveevolutionary algorithms are especially useful when the objective is notonly to minimize the network error but also to minimize the numberof parameters (weights) or power consumption of the inference phase.We propose a multi-objective NAS method based on Cartesian geneticprogramming for evolving convolutional neural networks (CNN). Themethod allows approximate operations to be used in CNNs to reducepower consumption of a target hardware implementation. During theNAS process, a suitable CNN architecture is evolved together with ap-proximate multipliers to deliver the best trade-oﬀs between the accu-racy, network size and power consumption. The most suitable approxi-mate multipliers are automatically selected from a library of approximatemultipliers. Evolved CNNs are compared with common human-createdCNNs of a similar complexity on the CIFAR-10 benchmark problem.

Keywords:

Approximate computing · Convolutional neural network · Cartesian genetic programming · Neuroevolution · Energy Eﬃciency.

Machine learning technology based on deep neural networks (DNNs) is currentlypenetrating into many new application domains. It is deployed in such classiﬁca-tion, prediction, control, and other tasks in which designers can collect compre-hensive data sets that are mandatory for training and validating the resultingmodel. In many cases (such as smart glasses or voice assistants), DNNs haveto be implemented in low-power hardware operated on batteries. Particularly,the inference process of a fully trained DNN is typically accelerated in hardwareto meet real-time time requirements and other constraints. Hence, drastic opti-mizations and approximations have to be introduced at the level of hardware [2].On the other hand, DNN training is typically conducted on GPU servers.

Accepted for publication at 24th European Conference on Genetic Programming (EuroGP). a r X i v : . [ c s . N E ] J a n xisting DNN architectures have mostly been developed by human experts manually , which is a time-consuming and error-prone process. The current ap-proach to hardware implementations of DNNs is based on semi-automated sim-plifying of a network model, which was initially developed for a GPU and trainedon GPU without considering any hardware implementation aspects. There is agrowing interest in automated DNN design methods known as the neural archi-tecture search (NAS) [4,26]. Evolutionary NAS, introduced over three decadesago [28], is now intensively adopted, mostly because it can easily be implementedas a multi-objective design method [20,12].This paper is focused on the NAS applied to the automated design of convo-lutional neural networks (CNNs) for image classiﬁcation. Current NAS methodsonly partly reﬂect hardware-oriented requirements on resulting CNNs. In addi-tion to the classiﬁcation accuracy, some of them try to minimize the number ofparameters (such as multiply and accumulate operations) for a GPU implemen-tation, which performs all operations in the ﬂoating-point number representa-tion [12,4]. Our research aims to propose and evaluate a NAS method for thehighly automated design of CNNs that reﬂect hardware-oriented requirements.We hypothesize that more energy-eﬃcient hardware implementations of CNNscan be obtained if hardware-related requirements are speciﬁed, reﬂected, and ex-ploited during the NAS . In this paper, we speciﬁcally focus on the automatedco-design of CNN’s topology and approximate arithmetic operations. The ob-jective is to automatically generate CNNs showing good trade-oﬀs between theaccuracy, the network size (the number of multiplications), and a degree of ap-proximation in the used multipliers.The proposed method is based on a multi-objective Cartesian genetic pro-gramming (CGP) whose task is to maximize the classiﬁcation accuracy and min-imize the power consumption of the most dominated arithmetic operation, i.e.,multiplications conducted in convolutional layers. To avoid the time-consumingautomated design of approximate multipliers, CGP selects suitable multipliersfrom a library of approximate multipliers [14]. While CGP delivers the networktopology, the weights are obtained using a TensorFlow. The NAS supporting ap-proximate multipliers in CNNs is obviously more computationally expensive thanthe NAS of common CNNs. The reason is that TensorFlow does not support thefast execution of CNNs that contain non-standard operations such as approxi-mate multipliers. We propose eliminating this issue by employing TFApprox [25],which extends TensorFlow to support approximate multipliers in CNN trainingand inference. Evolved CNNs are compared with common human-created CNNsof a similar complexity on the CIFAR-10 benchmark problem.To summarize our key contributions: We present a method capable of an au-tomated design of CNN topology with automated selection of suitable approxi-mate multiplier(s). The methodology uniquely integrates a multi-objective CGPand TFApprox-based training and evaluation of CNNs containing approximatecircuits. We demonstrate that the proposed method provides better trade-oﬀsthan a common approach based on introducing approximate multipliers to CNNsdeveloped without reﬂecting any hardware aspects. Related Work

Convolutional neural networks are deep neural networks employing, in additionto other layer types, the so-called convolutional layers . These layers are capa-ble of processing large input vectors (tensors). Simultaneously, the number ofparameters (the weights in the convolutional kernels) they use is small com-pared to the common fully-connected layers. Because the state of the art CNNsconsist of hundreds of layers and millions of network elements, they are de-manding in terms of the execution time and energy requirements. For example,the inference phase of a trained CNN such as ResNet-50 requires performing3 . · multiply-and-accumulate operations to classify one single input image.Depending on a particular CNN and a hardware platform used to implement it,arithmetic operations conducted in the inference are responsible for 10% to 40%of total energy [23].To reduce power consumption, hardware-oriented optimization techniquesdeveloped for CNNs focus on optimizing the data representation, pruning lessimportant connections and neurons, approximating arithmetic operations, com-pression of weights, and employing various smart data transfer and memorystorage strategies [8,16]. For example, the Ristretto tool is specialized in de-termining the number of bits needed for arithmetic operations [5] because thestandard 32-bit ﬂoating-point arithmetic is too expensive and unnecessarily ac-curate for CNNs. According to [23], an 8-bit ﬁxed-point multiply consumes 15.5 × less energy (12.4 × less area) than a 32-bit ﬁxed-point multiply, and 18.5 × lessenergy (27.5 × less area) than a 32-bit ﬂoating-point multiply. Further savings inenergy are obtained not only by the bit width reduction of arithmetic operationsbut also by introducing approximate operations, particularly to the multiplica-tion circuits [15,18]. Many approximate multipliers are available in public circuitlibraries, for example, EvoApproxLib [14]. All these techniques are usually ap-plied to CNN architectures initially developed with no or minimal focus on apotential hardware implementation.NAS has been introduced to automate the neural network design process. Thebest-performing CNNs obtained by NAS currently show superior performancewith respect to human-designed CNNs [4,26]. NAS methods can be classiﬁedaccording to the search mechanism that can be based on reinforcement learn-ing [29], evolutionary algorithms (EA) [21], gradient optimization [11], randomsearch [1], or sequential model-based optimization [10]. NAS methods were ini-tially constructed as single-objective methods to minimize the classiﬁcation errorfor a CNN running on a GPU [17,22]. Recent works have been devoted to multi-objective NAS approaches in which the error is optimized together with the cost,whose minimizing is crucial for the sustainable operation of GPU clusters [7,12].As our NAS method employs genetic programming, which is a branch ofevolutionary algorithms, we brieﬂy discuss the main components of the EA-based approaches. Regarding the problem representation , direct [12,22] and in-direct (generative) [20] encoding schemes have been investigated. The selectionof genetic operators is tightly coupled with the chosen problem representation.While mutation is the key operator for CGP [22], the crossover is crucial forinary encoding of CNNs as it allows population members to share commonbuilding-blocks [12]. The non-dominated sorting , known from, e.g., the NSGA-IIalgorithm [3], enables to maintain diverse trade-oﬀs between conﬂicting designobjectives. The evolutionary search is often combined with learning because itis very ineﬃcient to let the evolution ﬁnd the weights. A candidate CNN, con-structed using the information available in its genotype, is trained using commonlearning algorithms available in popular DNN frameworks such as TensorFlow.The number of epochs and the training data size are usually limited to reduce thetraining time, despite the fact that by doing so the ﬁtness score can wrongly beestimated. The CNN accuracy, which is obtained using test data, is interpretedas the ﬁtness score. The best-evolved CNNs are usually re-trained (ﬁne-tuned)to further increase their accuracy.The entire neuro-evolution is very time and resources demanding and, hence,only several hundreds of candidate CNNs can be generated and evaluated in oneEA run. On common platforms, such as TensorFlow, all mathematical opera-tions are highly optimized and work with standard ﬂoating-point numbers onGPUs. If one needs to replace these operations with approximate operations,these non-standard operations have to be expensively emulated. The CNN exe-cution is then signiﬁcantly slower than with the ﬂoating-point operations. Thisproblem can partly be eliminated by using TFApprox in which all approximateoperations are implemented as look-up tables and accessed through a texturememory mechanism of CUDA capable GPUs [25].A very recent work [8] presents a method capable of jointly searching theneural architecture, hardware architecture, and compression model for FPGA-based CNN implementations. Contrasted to our work, arithmetic operations areperformed on 16 bits, and no approximate operations are employed. High-qualityresults are presented for CIFAR-10 and ImageNet benchmark data sets.

The proposed evolutionary NAS is inspired in paper [22] whose authors usedCGP to evolve CNNs. We extend this work by (i) supporting a multi-objectivesearch, (ii) using an eﬃcient seeding strategy and (iii) enabling the approximatemultipliers in convolutional layers. The method is evaluated on the design ofCNNs for a common benchmark problem – the CIFAR-10 image classiﬁcationdata set [9]. The role of CGP is to provide a good CNN architecture. The weightsare obtained using Adam optimization algorithm implemented in TensorFlow.Our ultimate goal is to deliver new CNN architectures that are optimized forhardware accelerators of CNNs in terms of the parameter count and usage oflow-energy arithmetic operations.In this section, we will describe the proposed CGP-based NAS which is de-veloped for CNNs with ﬂoating-point arithmetic operations executed on GPU.In Section 3.5, the proposed evolutionary selection of approximate multipliersfor CNNs will be presented. .1 CNN Representation

CGP was developed to automatically design programs and circuits that are mod-eled using directed acyclic graphs [13]. A candidate solution is represented usinga two-dimensional array of n c × n r nodes, consuming n i inputs and producing n o outputs. In the case of evolutionary design of CNNs, each node represents eitherone layer (e.g., fully connected, convolutional, max pooling, average pooling) ora module (e.g., residual or inception block) of a CNN. Each node of j -th columnreads a tensor coming from column 1 , , . . . , j − I =( V, E ), where V denotes a set of vertices (nodes of template) and E is a set ofedges. Individual representation (in a form of the graph I ) in a conjunction withthe template creates a candidate solution. The nodes that are employed in theCNN are called the active nodes and form a directed acyclic graph (DAG), thatconnects the input node with the output node. When a particular CNN has tobe built and evaluated, this DAG is extracted (from a candidate solution) andtransformed to a computational graph which is processed by TensorFlow. L (1) L (4) L (7) L (2) L (5) L (8) L (3) L (6) L (9)INPUT(0) OUTPUT(10) I = (V, E) V = {0,1,2,3,4,5,6,7,8,9,10}E = {(0,1), (0,3), (1,5), (3,4), (5,7), (4,9), (9,10)} L (1) L (4) L (7) L (2) L (5) L (8) L (3) L (6) L (9)INPUT(0) OUTPUT(10) L (4) L (3) L (9)INPUT(0) OUTPUT(10) + = Template Individual representationCandidate solutionCNN

Fig. 1: A template combined with an individual representation creates a candi-date solution, which is transformed into a CNN.The following layers are supported: fully connected (FC), convolutional (CO-NV), summation (SUM), maximum pooling (MAX) and average pooling (AVG).Inspired in [27], CGP can also use inception (INC), residual (RES), and (resid-ual) bottleneck (RES-B) modules [6] that are composed of several elementaryayers as shown in Fig. 2. Selected layers and modules are introduced in thefollowing paragraphs; the remaining ones are standard.The summation layer accepts tensors t and t with shape ( t ) = ( h , w , c )and shape ( t ) = ( h , w , c ), where h x , w x and c x are height, width and thenumber of channels respectively ( x ∈ { , } ). It outputs t o , i.e. the sum of t and t , deﬁned as t o = t + t ⇐⇒ t ijko = t ijk + t ijk for i = 0 , ..., c − j = 0 , ..., h − k = 0 , ..., w − . It has to be ensured that the height and width of both the tensors are identical.If it is not so, the pooling algorithm is applied to the ‘bigger’ tensor to unifythese dimensions. The problem with unmatched number of channels is resolvedby zero padding applied to the ‘smaller’ tensor, i.e., shape ( t o ) = ( h o , w o , c o ),where h o = min ( h , h ), w o = min ( w , w ) and c o = max ( c , c ).The inception module , showed in Fig. 2c, performs in parallel three convolu-tions with ﬁlters 5x5, 3x3 and 1x1 and one maximum pooling. The results arethen concatenated along the channel dimension. Additionally, 1x1 convolutionsare used to reduce the number of input channels. Parameters C , C and C cor-respond to the number of ﬁlters in 5x5, 3x3 and 1x1 convolutions, whereas R , R and R denote the number of ﬁlters in 1x1 convolutions. All convolutionallayers operate with stride 1 and are followed by the ReLU activation.The residual module contains a sequence of NxN and MxM convolutionsthat can be skipped, which is implemented by the summation layer followed bythe ReLU activation. The residual module, shown in Fig. 2b, consists of twoconvolutional layers with the ﬁlters NxN and MxM, both followed by batchnormalization and ReLU activation. In parallel, one convolution with ﬁlter 1x1is computed. Results of MxM and 1x1 convolution are added together to form aresult. Convolutional layers with ﬁlters NxN and 1x1 operate with stride n .We also support a bottleneck variant of the residual module, shown in Fig. 2a,which comprises of one convolutional layer with ﬁlter NxN, which applies batchnormalization and ReLU activation to its input and output. This convolutionallayer is surrounded by two 1x1 convolutional layers. In parallel, another 1x1convolutional layer is employed. The ﬁrst two parallel 1x1 convolutional layersoperate with stride n , whereas all other layers use stride 1. The outputs of thelast two parallel 1x1 convolutional layers are then batch-normalized and addedtogether. The ﬁnal output is obtained by application of ReLU activation to theoutput of the addition layer. CGP usually employs only one genetic operator – mutation. The proposed muta-tion operator modiﬁes architecture of a candidate CNN; however, the function-ality (layers and their parameters) implemented by the nodes are not directlychanged, except some speciﬁc cases, see below. A randomly selected node is

ONVOLUTION

BNReLUNxN CONV, s = 11x1 CONV, s = 1BN

INPUT

ADDITIONReLU 1x1 CONV, s = n

CONVOLUTION (a) Bottleneck.

CONVOLUTION

BNReLUMxM CONV, s = 1BNReLU

INPUT

ADDITIONOUTPUT 1x1 CONV, s = n

CONVOLUTION

NxN CONV, s = n (b) Residual. (c) Inception.

Fig. 2: Diagrams of (a) bottleneck residual module, (b) residual module and (c)inception module.mutated in such a way that all its incoming edges are removed and a new con-nection is established to a randomly selected node situated in up to L previouscolumns, where L is a user-deﬁned parameter. This is repeated k times, where k is the node’s arity. If the mutation does not hit an active node it is repeated toavoid generating functionally identical networks. One mutation can thus modifyseveral inactive nodes before ﬁnally modifying an active node. The weights asso-ciated with a newly added active node are randomly initialized. If the primaryoutput undergoes a mutation, its destination is a randomly selected node of thelast column containing FC layers. The objectives are to maximize the CNN accuracy and to minimize the CNNcomplexity (which is expressed as the number of parameters), and power con-sumption of multiplication in the convolutional layers. The objective functionexpressing the accuracy of a candidate network x (evaluated using a data set D ), is calculated using TensorFlow as f ( x, D ) = accuracy ( x, D ) . The numberof parameters in the entire CNN x is captured by ﬁtness function f ( x ). Powerconsumption is estimated as f ( x ) = N mult ( x ) · P mult , where N mult is the num-ber of multiplications executed during inference in all convolutional layers and P mult is power consumption of used multiplier. The search algorithm (see Alg. 1) is constructed as a multi-objective evolutionaryalgorithm inspired in CGP-based NAS [22] and NSGA-II [3]. The initial popu-ation is heuristically initialized with networks created according to a templateshown in Fig. 3. The template consists of typical layers of CNNs, i.e., convolu-tional layers in the ﬁrst and middle parts and fully connected layers at the end.All connections in the template (including the link to the output tensor) and allassociated weights are randomly generated. The proposed template ensures thateven the networks of the initial populations are reasonable CNNs which reducesthe computational requirements of the search process.

INC

OUTPUTINPUT

Fig. 3: The template used to initialize CGP.Training of a CNN is always followed by testing to obtain ﬁtness values f ( x ), f ( x ), and f ( x ). To reduce the training time, a randomly selected subset D train of the training data set can be used. The same subset is used for training all theindividuals belonging to the same population. Training is conducted for E train epochs. The accuracy of the candidate CNN (i.e., f ) is determined using theentire test data set D test (Alg. 1, line 2). To overcome the overﬁtting during thetraining, data augmentation and L2 regularization were employed [19].The oﬀspring population ( O ) is created by applying the mutation operatoron each individual of the parental population P . The oﬀspring population isevaluated in the same way as the parental population (Alg. 1, line 5). Popula-tions P and O are joined to form an auxiliary population R (line 6). The newpopulation is constructed by selecting non-dominated individuals from Paretofronts (PF) established in R (lines 9 – 10). If any front must be split, a crowdingdistance is used for the selection of individuals to P (lines 12 – 13) [3]. Thesearch terminates after evaluating a given number of CNNs.As the proposed algorithm is multi-objective, the result of a single CGP runis a set of non-dominated solutions. At the end of evolution, the best-performingindividuals from this set are re-trained (ﬁne-tuned) for E retrain epochs on thecomplete training data set D retrain and the ﬁnal accuracy is reported on thecomplete test data set D test . lgorithm 1 Neuroevolution P ← initial population(); g ←

02: training evaluation(

P, E train , D train , D test )3: repeat P (cid:48) ← replicate( P ); O ← mutate( P (cid:48) )5: training evaluation( O, E train , D train , D test )6: R ← P ∪ O ; P ← ∅ while | P | (cid:54) = population size do P F ← non dominated ( R )9: if | P ∪ P F | ≤ population size then P ← P ∪ P F else n ← | P F ∪ P | − population size P ← P ∪ crowding reduce ( P F, n )14: R ← R \ P F g ← g + 116: until stop criteria satisﬁed()17: training evaluation( P, E retrain , D retrain , D test )18: return ( P ) So far we have discussed a NAS utilizing standard ﬂoating-point arithmetic op-erations. In order to ﬁnd the most suitable approximate multiplier for a CNNarchitecture, we introduce the following changes to the algorithm. (i) The prob-lem representation is extended with one integer specifying the index I m to thelist of available 8-bit approximate multipliers, i.e. to one of the 14 approximatemultipliers included in EvoApproxLib-Lite [14]. These approximate multipliersshow diﬀerent trade-oﬀs between power consumption, error metrics and otherparameters. Please note that the selection of the exact 8-bit multiplier is notexcluded. (ii) The mutation operator is modiﬁed to randomly change this indexwith probability p mult . (iii) Before a candidate CNN is sent to TensorFlow fortraining or testing, all standard multipliers used in convolutional layers are re-placed with the 8-bit approximate multiplier speciﬁed by index I m . TensorFlow,with the help of TFApprox, then performs all multiplications in the convolutioncomputations in the forward pass of learning algorithm with the approximatemultipliers, whereas all computations in the backward pass are done with thestandard ﬂoating-point multiplications. Table 1 summarizes all parameters of CGP and the learning method used inour experiments. These parameters were experimentally selected based on a few rial runs. Because of limited computational resources, we could generate andevaluate only pop size + G × pop size = 88 candidate CNNs in each run.Table 1: Parameters of the experiment. Parameter Value Description n r n c

23 Number of columns in the CGP grid. L pop size G

10 Maximum number of generations. D train D retrain D test E train

20 Number of epochs (during the evolution). E retrain

200 Number of epochs (for re-training). batch size

32 Batch size. rate p arch p mult I m We consider four scenarios to analyze the role of approximate multipliers inNAS: (S1) CNN is co-optimized with the approximate multiplier under ﬁtnessfunctions f and f (denoted ‘CGP+auto-selected-mult-A/E’ in the followingﬁgures); (S2) CNN is co-optimized with the approximate multiplier under ﬁt-ness functions f , f and f (denoted ‘CGP+auto-selected-mult-A/E/P’); (S3)A selected approximate multiplier is always used in NAS (denoted ‘CGP+ﬁxed-approx-mult-A/E’); (S4) The 8-bit exact multiplier is always used in NAS (de-noted ‘CGP+accurate-8-bit-mult-A/E’). Note that symbols A, E and P denoteAccuracy, Parameters and Energy. Because of limited resources we executed 5,2, 13, and 2 CGP runs for scenarios S1, S2, S3, and S4.Fig. 4 plots a typical progress of a CGP run in S1 scenario. The blue pointsrepresent the initial population – all parents and oﬀspring are depicted in Fig. 4a.The remaining subﬁgures show generations 3, 6, and 9. The grey points arecandidate solutions created in the previous generations and their purpose is toemphasize the CGP progress. As the best trade-oﬀs (Accuracy vs. Energy) aremoving to the top-left corner of the ﬁgures, we observe that CGP can improvecandidate solutions despite only 10 populations are generated.Trade-oﬀs between the accuracy (estimated in the ﬁtness function) and thetotal energy of multiplications performed in convolutional layers during inferenceare shown in Fig. 5. The Pareto front is mostly occupied by CNNs evolved in TE of mults. in conv. layers ( J) a) E s t . a cc . Generation 0

TE of mults. in conv. layers ( J) b) E s t . a cc . Generation 3

TE of mults. in conv. layers ( J) c) E s t . a cc . Generation 6

TE of mults. in conv. layers ( J) d) E s t . a cc . Generation 9

Fig. 4: A typical progress of evolution in scenario S1 (Estimated Accuracy vs.Total Energy). The blue points represent the current generation (all parents andoﬀspring). The grey points are all previously generated solutions.scenario S3, i.e. with a pre-selected approximate multiplier. CNNs utilizing the 8-bit accurate multiplier are almost always dominated by CNNs containing some ofthe approximate multipliers. CNNs showing the 70% and higher accuracy neveruse highly approximate multipliers (see the bar on the right hand side) in Fig. 5.Fig. 6 shows that re-training of the best evolved CNNs conducted for E retrain epochs signiﬁcantly improves the accuracy.Final Pareto fronts obtained (after re-training) in our four scenarios are high-lighted in Fig. 7. When an approximate multiplier is ﬁxed before the NAS isexecuted (S3), CGP is almost always able to deliver a better trade-oﬀ than if asuitable multiplier is automatically selected during the evolution (in S1 or S2).However, CGP has to be repeated with each of the pre-selected multipliers toﬁnd the best trade-oﬀs. We hypothesize that longer CGP runs are needed tobeneﬁt from S1 and S2.Finally, Table 2 lists key parameters of selected CNNs (the ﬁnal and esti-mated accuracy, the energy needed for all multiplications in convolutional lay-ers, and the number of multiplications) and used approximate multipliers (theidentiﬁer and the energy per one multiplication). One of the evolved CNNs isdepicted in Fig. 8. Experiments were performed on a machine with two 12-core CPUs Intel SkylakeGold 6126, 2.6 GHz, 192 GB, equipped with four GPU accelerators NVIDIATesla V100-SXM2. A single CGP run with CNNs utilizing approximate mul-tipliers takes 48 GPU hours; the ﬁnal re-training requires additional 56 GPUhours on average. When approximate multipliers are emulated by TFApprox, A cc u r a c y ( e s t i m a t e d ) CGP+auto-selected-mult-A/ECGP+auto-selected-mult-A/E/PCGP+fixed-approx-mult-A/ECGP+accurate-8-bit-mult-A/E 0.00.10.20.30.40.5 E n e r g y o f s e l e c t e d m u l t i p li e r [ p J ] Fig. 5: Trade-oﬀs between the accuracy and the total energy of multiplicationsperformed in convolutional layers during one inference obtained with diﬀerentdesign scenarios. Total energy of multiplications in conv. layers ( J)50%60%70%80%90% A cc u r a c y CGP+auto-selected-mult-A/ECGP+auto-selected-mult-A/E/P CGP+fixed-approx-mult-A/ECGP+accurate-8-bit-mult-A/E

Fig. 6: The impact of re-training on the accuracy of best-evolved CNNs.Crosses/points denote the accuracy before/after re-training. Total energy of multiplications in conv. layers ( J)65%70%75%80%85%90% A cc u r a c y CGP+auto-selected-mult-A/ECGP+auto-selected-mult-A/E/PCGP+fixed-approx-mult-A/E CGP+accurate-8-bit-mult-A/EALWANN (ResNet-8)ALWANN (ResNet-14)

Fig. 7: Pareto fronts obtained in four scenarios compared with ResNet networksutilizing 8-bit multipliers (crosses) and the ALWANN method.

Fig. 8: Evolved CNN whose parameters are given in the ﬁrst row of Table 2.the average time needed for all inferences in ResNet-8 on CIFAR-10 is 1.7 s(initialization) + 1.5 s (data processing) = 3.2 s. If the same task is performedby TensorFlow in the 32-bit FP arithmetic, the time is 1.8 s + 0.2 s = 2.0 s.Hence, the time overhead coming with approximate operations is 37.5%.

The results are compared with human-created ResNet networks of similar com-plexity as evolved CNNs. Parameters of ResNet-8 and ResNet-14 (utilizing the8-bit exact multiplier) are depicted with crosses in Fig. 7. While ResNet-8 isdominated by several evolved CNNs, we were unable to evolve CNNs dominat-ing ResNet-14. We further compared evolved CNNs with the CNNs optimizedusing the ALWANN method [15]. ALWANN tries to identify the best possibleassignment of an approximate multiplier to each layer of ResNet (i.e., diﬀerentapproximate multipliers can be assigned to diﬀerent layers). For a good range oftarget energies, the proposed method produces better trade-oﬀs than ALWANN.Paper [26] reports 31 CNNs generated by various NAS methods and ﬁveCNNs designed by human experts. On CIFAR-10, the error is between 2.08% and8.69%, the number of parameters is between 1.7 and 39.8 millions and the designtime is between 0.3 and 22,400 GPU days. Our results are far from all thesenumbers as we address much smaller networks (operating with 8-bit multipliers)able 2: Key parameters of selected CNNs and used multipliers. Symbol ∗ de-notes the 8-bit accurate multiplier. Method Accuracy Energy Mults Approx . Energy of

Final Estimated [uJ] × mult. ID 1 mult. [pJ] Proposed 83.98 78.01 14.88 uJ 30.9 mul8u JD 0.48 pJ83.50 76.88 13.82 uJ 30.9 mul8u C1 0.45 pJ83.18 78.14 10.76 uJ 28.5 mul8u GR 0.38 pJ83.01 77.02 6.79 uJ 22.9 mul8u M1 0.30 pJ82.53 75.85 9.22 uJ 31.7 mul8u 85Q 0.29 pJ82.15 77.66 11.48 uJ 20.5 mul8u JFF ∗ ∗ ∗ ∗ that must, in principle, show higher errors. However, paper [24] reports a numberof human-created CNN hardware accelerators with the classiﬁcation accuracy80 . − .

53% on CIFAR-10, and with the total energy consumption 34 . − . µ J (energy of multiplications is not reported separately). These numbersare quite comparable with our results even under the conservative assumptionthat multiplication requires 20% of the total energy of the accelerator.

We developed a multi-objective evolutionary design method capable of auto-mated co-design of CNN topology and approximate multiplier(s). This is achallenging problem not addressed in the literature. On the standard CIFAR-10 classiﬁcation benchmark, the CNNs co-optimized with approximate multi-pliers show excellent trade-oﬀs between the classiﬁcation accuracy and energyneeded for multiplication in convolutional layers when compared with commonResNet CNNs utilizing 8-bit multipliers and CNNs optimized with the ALWANNmethod. Despite very limited computational resources, we demonstrated that itmakes sense to co-optimize CNN architecture together with approximate arith-metic operations in a fully automated way.Our future work will be devoted to extending the proposed method, employ-ing more computational resources, and showing its eﬀectiveness on more complexproblems instances. In particular, we will extend the CGP array size whose cur-rent setting was chosen to be comparable with the ALWANN method. It alsoeems that we should primarily focus on optimizing the convolutional layers andleave the structure of fully connected layers frozen for the evolution.

Acknowledgements

This work was supported by the Czech science founda-tion project 21-13001S. The computational experiments were supported by TheMinistry of Education, Youth and Sports from the Large Infrastructures for Re-search, Experimental Development and Innovations project “e-Infrastructure CZ– LM2018140”.

References

1. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. ofMachile Learning Research (10), 281–305 (Feb 2012)2. Capra, M., Bussolino, B., Marchisio, A., Shaﬁque, M., Masera, G., Martina, M.:An updated survey of eﬃcient hardware architectures for accelerating deep convo-lutional neural networks. Future Internet (7), 113 (2020)3. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjec-tive genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation (2), 182–197 (2002)4. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. J. ofMachine Learning Research (55), 1–21 (2019)5. Gysel, P., Pimentel, J., Motamedi, M., Ghiasi, S.: Ristretto: A framework for em-pirical study of resource-eﬃcient inference in convolutional neural networks. IEEETransactions on Neural Networks and Learning Systems (11), 5784–5789 (2018)6. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks.In: Computer Vision – ECCV 2016. pp. 630–645. Springer (2016)7. Hsu, C., Chang, S., Juan, D., Pan, J., Chen, Y., Wei, W., Chang, S.: MONAS:multi-objective neural architecture search using reinforcement learning. CoRR abs/1806.10332 (2018), http://arxiv.org/abs/1806.10332

8. Jiang, W., Yang, L., Dasgupta, S., Hu, J., Shi, Y.: Standing on the shoulders ofgiants: Hardware and neural architecture co-search with hot start (2020), https://arxiv.org/abs/2007.09087

9. Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (Canadian Institute for AdvancedResearch)

10. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille,A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Ferrari, V.,Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. pp.19–35. Springer International Publishing, Cham (2018)11. Liu, H., Simonyan, K., Yang, Y.: DARTS: diﬀerentiable architecture search. CoRR abs/1806.09055 (2018), http://arxiv.org/abs/1806.09055

12. Lu, Z., Whalen, I., Boddeti, V., Dhebar, Y.D., Deb, K., Goodman, E.D., Banzhaf,W.: NSGA-Net: neural architecture search using multi-objective genetic algorithm.In: Proceedings of the Genetic and Evolutionary Computation Conference. pp.419–427. ACM (2019)13. Miller, J.F.: Cartesian Genetic Programming. Springer-Verlag (2011)14. Mrazek, V., Hrbacek, R., et al.: Evoapprox8b: Library of approximate adders andmultipliers for circuit design and benchmarking of approximation methods. In:Proc. of DATE’17. pp. 258–261 (2017)5. Mrazek, V., Vasicek, Z., Sekanina, L., Hanif, A.M., Shaﬁque, M.: ALWANN: Au-tomatic layer-wise approximation of deep neural network accelerators without re-training. In: Proc. of the IEEE/ACM International Conference on Computer-AidedDesign. pp. 1–8. IEEE (2019)16. Panda, P., Sengupta, A., Sarwar, S.S., Srinivasan, G., Venkataramani, S., Raghu-nathan, A., Roy, K.: Invited – cross-layer approximations for neuromorphic com-puting: From devices to circuits and systems. In: 53nd Design Automation Con-ference. pp. 1–6. IEEE (2016). https://doi.org/10.1145/2897937.290500917. Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q., Kurakin,A.: Large-Scale Evolution of Image Classiﬁers. arXiv e-prints arXiv:1703.01041(Mar 2017)18. Sarwar, S.S., Venkataramani, S., Ankit, A., Raghunathan, A., Roy, K.: Energy-eﬃcient neural computing with approximate multipliers. J. Emerg. Technol. Com-put. Syst. (2), 16:1–16:23 (2018)19. Shorten, C., Khoshgoftaar, T.: A survey on image data augmentation for deeplearning. Journal of Big Data , 1–48 (2019)20. Stanley, K.O., Clune, J., Lehman1, J., Miikkulainen, R.: Designing neural networksthrough neuroevolution. Nature Machine Intelligence , 24–35 (2019)21. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmentingtopologies. Evol. Comput. (2), 99–127 (Jun 2002)22. Suganuma, M., Shirakawa, S., Nagao, T.: A genetic programming approach todesigning convolutional neural network architectures. In: Proc. of the Genetic andEvolutionary Computation Conference. pp. 497–504. GECCO ’17, ACM (2017)23. Sze, V., Chen, Y., Yang, T., Emer, J.S.: Eﬃcient processing of deep neural net-works: A tutorial and survey. Proceedings of the IEEE (12), 2295–2329 (2017)24. Tann, H., Hashemi, S., Reda, S.: Lightweight Deep Neural Network AcceleratorsUsing Approximate SW/HW Techniques, pp. 289–305. Springer Verlag (2019)25. Vaverka, F., Mrazek, V., Vasicek, Z., Sekanina, L.: TFApprox: Towards a FastEmulation of DNN Approximate Hardware Accelerators on GPU. In: Design, Au-tomation and Test in Europe. pp. 1–4 (2020)26. Wistuba, M., Rawat, A., Pedapati, T.: A survey on neural architecture search.CoRR abs/1905.01392 (2019), http://arxiv.org/abs/1905.01392

27. Xie, L., Yuille, A.: Genetic cnn. In: 2017 IEEE International Conference on Com-puter Vision (ICCV). pp. 1388–1397. IEEE (2017)28. Yao, X.: Evolving artiﬁcial neural networks. Proceedings of the IEEE (9), 1423–1447 (1999)29. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. CoRR abs/1611.01578 (2016),(2016),