[PDF] Energy Predictive Models for Convolutional Neural Networks on Mobile Platforms

Abstract

Energy use is a key concern when deploying deep learning models on mobile and embedded platforms. Current studies develop energy predictive models based on application-level features to provide researchers a way to estimate the energy consumption of their deep learning models. This information is useful for building resource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modelling provide little insight into the trade-offs involved in the choice of features on the final predictive model accuracy and model complexity. To address this issue, we provide a comprehensive analysis of building regression-based predictive models for deep learning on mobile devices, based on empirical measurements gathered from the SyNERGY framework.Our predictive modelling strategy is based on two types of predictive models used in the literature:individual layers and layer-type. Our analysis of predictive models show that simple layer-type features achieve a model complexity of 4 to 32 times less for convolutional layer predictions for a similar accuracy compared to predictive models using more complex features adopted by previous approaches. To obtain an overall energy estimate of the inference phase, we build layer-type predictive models for the fully-connected and pooling layers using 12 representative Convolutional NeuralNetworks (ConvNets) on the Jetson TX1 and the Snapdragon 820using software backends such as OpenBLAS, Eigen and CuDNN. We obtain an accuracy between 76% to 85% and a model complexity of 1 for the overall energy prediction of the test ConvNets across different hardware-software combinations.

Full PDF

EEnergy Predictive Models for Convolutional Neural Networkson Mobile Platforms

Crefeda Faviola Rodrigues

The University of [email protected]

Graham Riley

The University of [email protected]

Mikel Luján

The University of [email protected]

ABSTRACT

Energy use is a key concern when deploying deep learning mod-els on mobile and embedded platforms. Current studies developenergy predictive models based on application-level features toprovide researchers a way to estimate the energy consumption oftheir deep learning models. This information is useful for buildingresource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modellingprovide little insight into the trade-offs involved in the choice offeatures on the final predictive model accuracy and model complex-ity. To address this issue, we provide a comprehensive analysis ofbuilding regression-based predictive models for deep learning onmobile devices, based on empirical measurements gathered fromthe SyNERGY framework.Our predictive modelling strategy is based on two types of pre-dictive models used in the literature: individual layers and layer-type . Our analysis of predictive models show that simple layer-typefeatures achieve a model complexity of 4 to 32 times less for convo-lutional layer predictions for a similar accuracy compared to pre-dictive models using more complex features adopted by previousapproaches. To obtain an overall energy estimate of the inferencephase, we build layer-type predictive models for the fully-connectedand pooling layers using 12 representative Convolutional NeuralNetworks (ConvNets) on the Jetson TX1 and the Snapdragon 820using software backends such as OpenBLAS, Eigen and CuDNN.We obtain an accuracy between 76% to 85% and a model complexityof 1 for the overall energy prediction of the test ConvNets acrossdifferent hardware-software combinations.

KEYWORDS

Convolutional Neural Networks, Energy Prediction, mobile plat-forms

There is growing importance to bringing deep neural network pro-cessing to mobile or embedded devices (also known as edge-devices )[22]. Deep neural networks such as Convolutional Neural Networks(hereafter referred to as ConvNets) have achieved greater accuracycompared to humans for a large variety of predictive tasks, forexample, image classification in computer vision and text classifi-cation in natural language processing [7]. ConvNets are designedduring a training phase where machine learning researchers searchfor the best model using accuracy as a metric. Once the ConvNetmodel is trained, the pre-trained model is available for use duringthe inference or testing phase. suitably defined for the task. There are numerous benefits to performing inferences locally onedge devices such as reducing energy costs of datacenters, lower(user) latency and reduced need for constant internet connectivity[15]. However, such devices have a unique set of constraints interms of resources, for example, battery life, that are atypical of theenvironments in which the models are trained. This has paved wayfor the exploration of energy-efficient ConvNet designs throughmanual and automated searches for low-cost neural network modeldesigns - for example, MobileNet [8] and MnasNet [24] - explo-ration of compression and quantization and other software-basedacceleration techniques, and the use of application-specific hard-ware accelerators [22].

Despite these efforts, there are very few studiesthat model the energy use of deep learning models in the context ofthese optimizations.

Such modelling approaches are useful in theareas of resource-aware ConvNet designs such as automatic

Neu-ral Architecture Search [24], energy-aware pruning techniques [27]and in neural network accelerators simulators [19] that focus ondesigning energy-efficient deep learning hardware.Previous studies [18] on predictive modelling have indicatedthat relatively simple features, such as the sum of the multiply-accumulate (MAC) counts (we refer to this as layer-type ), can beused to estimate hardware performance counters such as SIMD in-struction counts and bus accesses that are useful for determining anapplication’s performance. The performance counter information isthen used to estimate the energy consumption for the convolutionallayers in a ConvNet for real systems. Other works, such as [5] haverelied on a large set of complex features, extracted from each layer’sspecification (we refer to this as individual layers ), to yield highlycomplex predictive models. However, none of these works haveinvestigated the trade-offs of choosing features on predictive modelaccuracy and complexity.The aim of this work is to perform a thorough analysis of al-gorithmic features of predictive models based on layer-type andindividual layers that can offer the best trade-off in model com-plexity (defined in Section 5) and predictive accuracy, and compareour results to previous works. We first illustrate the techniquesfor layer-type versus individual layer features to build predictivemodels for the convolutional layers on a mobile CPU and based onthe results we apply the method to other layers in a ConvNet toget an overall estimate of the ConvNet’s inference phase runningon different software and mobile hardware platforms.Our contribution is as follows: • An extension to the SyNERGY framework [18] to tie theenergy measurements obtained on the mobile device to theapplication-level at a per layer granularity and support thebuilding of layer-wise energy predictive models (

Step 1 inFigure 1). a r X i v : . [ c s . PF ] A p r affe Caffe2ConvNet model spec.Weight ﬁleInput ImageInput OpenBLAS Eigen cuDNNCPU GPUBackend APIDeep learning software frameworkHardware Step 1 : Performance andenergy measurement

Step 2 & 3 : Predictive modelling

SyNERGY C ode anno t a t i on s + t i m e power+ time Ê inference N eu r a l N e t w o r k La y e r F ea t u r e s Figure 1: SyNERGY energy measurement and predictionframework. • In Step 2 , we perform an exhaustive search based on stan-dard feature subset selection techniques to evaluate the indi-vidual layer features in terms two metrics: predictive modelaccuracy and complexity for building predictive models forthe convolutional layers in the context of a given hardware-software combination. We then compare the best predictivemodel using individual layer features with predictive modelsbased on layer-type features to determine the best featuresfor building predictive models in terms of predictive modelaccuracy and complexity. • In Step 3 , energy predictive models for different layers arebuilt using the best features selected in Step 2. Our predic-tive models, based on simple algorithmic features such asthe summation of multiply-accumulate (MAC) operationcounts of all layers, outperform the predictive models us-ing more complex features from the individual layers them-selves. We can achieve significant accuracy with 4 to 32 timeslower model complexity with similar accuracy comparedto complex predictive models proposed in previous works.Our results are based on 12 representative ConvNet mod-els chosen from existing deep learning frameworks (Caffe2[13]), which is a larger number than used in previous stud-ies [5, 18] including newer low-cost ConvNet models (forexample, SqueezeNet and MobileNet) used in the mobile andembedded space. • We combine predictions from predictive models for differentlayers to get an overall estimate of the energy consumptionof the deep learning model during the inference phase formultiple combinations of hardware and numerical softwarelibraries: Eigen on a Snapdraon820 (Eigen-Snapdragon820) ,Eigen on a Jetson TX1 (Eigen-TX1) and OpenBLAS on aJetson TX1 (OpenBLAS-TX1) . Our choice includes two plat-forms with the same library, that is Eigen-Snapdragon820and Eigen-TX1, and two libraries on a single platform, thatis Eigen-TX1 and OpenBLAS-TX1. We also evaluate ourmethodology on a mobile GPU using CuDNN on the Jetson http://eigen.tuxfamily.org https://developer.qualcomm.com/software/snapdragon-neural-processing-engine z z r Input Feature Map *

40 1053 K y K x N O z O x O py Conv Filter Output Feature Map I y I x O z O z O px K py K px Conv layer Pooling layerO y Figure 2: A simple two layer ConvNet model

TX1 (CuDNN-TX1). For all hardware and software combina-tions, we achieve a significant predictive test accuracy in therange 76% to 84% compared with empirical measurementson the platforms.The organisation of the paper is as follows. Section 2 providesdetails of the ConvNet model specifications for their different lay-ers. Section 3 goes into more detail of the specific methodology forenergy measurements across the two hardware platforms: JetsonTX1 and Snapdragon 820. Section 4 presents the empirical energymeasurements obtained for the overall and at different layer-typesenergy found in ConvNets. Section 5 covers an analysis of thefeatures to be used for the predictive models and evaluates themodels based on predictive accuracy and model complexity. Sec-tion 6 details the final results for the energy-predictive models forthe convolutional, pooling and fully-connected layers. Section 7compares related work in performance and energy measurementand modelling. Finally, section 9 concludes and highlights possiblefuture directions.

This section covers the necessary background to understand thedifferent layers of a ConvNet at the algorithmic level. For each layer,we describe the candidate input features (highlighted in bold) thatwill be evaluated in the feature selection phase in Section 5.A ConvNet is an end-to-end pipeline of feature extraction and classification . The feature extractors are arranged into layers that ex-tract high-level representations from image data [6]. As the numberof layers increases the level of abstraction increases, for exam-ple, from edges or colour blobs to object shapes. The final layeruses this information to provide a classification output (or a deci-sion). Typical layers found are convolution (Conv), pooling (Pool),normalization (there are two types: batch normalization or LocalResponse Normalization-(LRN)), Rectified Linear Unit (ReLU) andfully-connected (Fc) that transform the input data into a probabilis-tic output. We provide a description of the main layers targeted inour study: Conv, Pool and Fc.The bulk of a ConvNet model is made up of Conv layers. Thecomputational complexity of a standard Conv layer can be rep-resented by the number of

Multiply-accumulates (MAC) per-formed which is given by Equation 1: O x × O y × O z × K x × K y × I z (1)where O x , O y and O z represent the output feature map dimensions, K x , K y are the kernel filter dimensions and I x , I y and I z are the able 1: ConvNet models in the literature ConvNet NamingConvention in graphs Top-5accuracy (%) Dataset

SqueezeNet squeezenet [10] 80.3 ImageNet 26 Conv + 3 MaxPool 1.2 M 5 MBSqueezeNet withResidual Connections squeezenetRes [10] 82.5 ImageNet 26 Conv + 3 MaxPool 1.2 M 6.3 MBALL-CNN-C allcnn [21] 90.92 CIFAR 10 9 Conv 1.3 M 5.5 MBGoogleNet googlenet [23] 90.85 ImageNet 57 Conv + 1 Fc + 13 MaxPool 6.9 M 54 MBDenseNet densenet [9] 92.12 ImageNet 121 Conv + 1 MaxPool 7.8 M 32.3 MBInception-v3 inceptionv3 [23] 90.92 ImageNet 94 Conv + 1 Fc + 5 MaxPool 23 M 95.5 MBResidual-net 50 layers resnet50 [7] 93.29 ImageNet 53 Conv + 1 Fc + 1 MaxPool 25 M 103 MBMobileNet mobilenet [8] 70.6 ImageNet 27 Conv 29 M 17 MBPlaces-CDNS-8s places [26] 86.8 ImageNet 8 Conv + 3 Fc + 5 MaxPool 60 M 241.6 MBAlexNet alexnet [14] 80.3 ImageNet 5 Conv + 3 Fc + 3 MaxPool 62 M 244 MBVGG_CNN_S vggsmall [20] 86.9 ImageNet 5 Conv + 3 Fc + 3 MaxPool 102 M 393 MBInception-BN inceptionbn [11] 89.0 ImageNet 69 Conv + 1 Fc + 5 MaxPool 1.4 B 134.6 MB input feature map dimensions in the x , y and z dimensions, as shownin Figure 2. The z dimension represents the number of channelsin the feature maps. These dimensions are governed by the stride (which governs the step size by which the kernel filter slides acrossthe input in x and y ) and padding (the number of zeros that needto be padded around the input border to allow whole filters tobe applied), and kernel shape . The storage complexity or datavolume includes the cost for storing the input feature map or input volume to each layer, the corresponding kernel or filter weights ( K x × K y × I z × O z ) and biases, and the output featuremap or output volume . The volume of data (in number) is givenby Equation 2. ( I x × I y × I z ) + ( K x × K y × I z × O z ) + ( O x × O y × O z ) (2)In Figure 2, N refers to the number of images in the input, commonlyknown as the batch size. Newer models such as MobileNet [8]leverage depth-wise separable convolutions and have the followingcomputational complexity: ( O x × O y × K x × K y × I z ) + ( I z × O z × O x × O y ) (3)A pooling and sub-sampling layer aggregates the output fromthe previous layer using a pooling window ( K px × K py ) in x and y . The Max pooling operator computes the maximum over thiswindow and downsamples the output using the max value whilethe

Average pooling finds the average value over the window anddown samples the output using the average value. This results in apooling output of dimension ( O px × O py ). The computation for amax value within a single window involves a comparison operationwith each of its elements. For example, the K px = K py = Op count for the pooling layer. O px × O py × O z × ( K px × K py − ) (4)Unlike the Conv layer, the inputs to a Fc layer are connectedto all the outputs of the previous layer. The Fc layers are similarto Conv layers with the exception that I x , I y , K x , K y are usuallygreater than 1 for the first Fc layer, and then it flattens out in laterFc layers to a 1-dimensional vector. Referred to as Bandwidth in NeuralPower, [5]

Table 1 gives a list of ConvNet models chosen for this study. Col-umn 5 represents the counts of each of the described layers - Conv,Fc and Maxpool - present in each model. Recently, in newer Con-vNet models (for example, Inception-BN, Inception-V3, GoogleNetand Residual Nets) the traditional stack of fully-connected layers,seen in older models such as AlexNet and VGG, is replaced with aGlobal Average Pooling layer introduced in Lin, Chen, and Yan [16]and a single Fc layer. Column 7 shows the size of the model which isstored in 32-bit floating point precision. We evaluate models rang-ing from 1.2M to 1.4B parameters (note, these are also referred toas weights), as given in Column 6 of Table 1. The top-5 accuracy ofthe model is a measure based on the Top-5 predictions of the objectcategory in a given image [4]. In our study, we evaluate the en-ergy use of these pre-trained

ConvNets and build layer-wise energypredictive models for inferences executing on a mobile device.

We extend the existing

SyNERGY framework to collect power mea-surements and use it for developing energy predictive models. Fig-ure 1, represents a high-level overview of the extended framework.The input to the deep learning software framework is the ConvNetmodel specification, the pre-trained weights from Caffe2’s modelrepository and the input image. Inferences are executed in the deeplearning software ecosystem with the necessary back-end accelera-tion libraries on the chosen hardware. The code annotations supplythe information of a layer’s beginning and end times. The hostmachine remotely collects data such as the annotations and powermeasurements from the target hardware. The target device runs theactual inference. The details of the target platforms in this studyare provided in Table 2. Next, we describe the steps necessary toset up the software tools on both the host and target systems.

The host system runs ARM Streamline version: 5.28.1, Linux 64-bitversion that is compiled using the sources from the DS-5 Devel-opment studio [1]. To facilitate this communication between thehost machine and the target machine, we need the gator daemon which communicates with the host’s Streamline and the gator driver able 2: Platform and software specification

System Operating System Deep learningframework Backendacceleration library Processor Memory

Jetson TX1 Ubuntu 16.04 LTSLinux Kernel: 4.4.38+ Caffe2 Eigen ARM Cortex A57/A53,Quad-Core,64-bit, 1.9GHz 4 GB 64 bitLPDDR325.6 GB/sCaffe OpenBLAS, CuDNN - 6.0.21libopenblas_-cortexa57p-r0.3.1.dev.soOpen-Q 820(APQ8096) (Intrinsyc) Android 7.0, API: 24.0(8096_Open-Q_820_Android_BSP -N-3.3) Caffe2 Eigen Qualcomm Kryo CPU,Quad-Core,64-bit, 2.2GHz 3 GB 2 x 32 bitLPDDR429.9GB/s as a loadable kernel module [1]. The Caffe and Caffe2 binary canbe built directly for the Jetson TX1 while for the Snapdragon wecross-compile the Caffe2’s Android binary using the android-ndk-r16 toolchain. To integrate ARM Streamline with Caffe and Caffe2we use the Streamline annotation library. We identify the specificfunctions that call each layer in the software stack in Caffe andCaffe2 and place the code annotations. This includes Caffe’s net.cpp and Caffe2’s net_simple.cpp . We use the older Caffe with OpenBLASsupport as done in the SyNERGY framework [18].

The Jetson TX1 development comes with an on-board TI-INA3221xpower sensor chip that has to be enabled during kernel source cross-compilation along with enabling its entry in the device tree binaries(dtb) . This power sensor provides system-level power, CPU-levelpower and GPU-level power in mW. We use the system-level powerto the SoC as this accounts for the power due to the processingcore, DRAM memory and peripherals. The power measurementsare gathered with the default interactive Linux governor. The Snap-dragon 820 development board comes with on-board power pins.To capture energy measurement, we use the ARM energy probe toprovide system-level power for the SoC.In our study, the power values are collected as the inference phaseexecutes on either the target CPU (single-threaded) or the GPU. Wereport the execution time per image (sec/image) and energy perimage (mJ/image) averaged over 5 separate runs for single imageinferences. The power sampling rate is fixed to 1 kHz. To extractper layer measurements the execution profile with time-stampedcode-annotations are aligned to the time-stamped power profile. Wethen use the extracted power measurement to calculate the energy E inf erence consumed over the time duration as per Equation 5. E inf erence = T (cid:213) i = P i + × dt (5)where P i + is the ( i + ) th power sample over the time duration dt = t i + − t i and T is the total execution time of inference application. In this section, we use SyNERGY to provide empirical time andenergy measurements for the overall inference as well as finer-grained layer-type for an example ConvNet model (in our case,GoogleNet). Layer-type for the convolutional layers is when we List of kernel modifications can be found in https://github.com/ARM-software/gator group individual convolutional layers regardless of its specifications(for example, 11 ×

11, 3 × ×

1) into a broader category of Conv.We summarize our findings for Eigen-TX1 and Eigen-Snapdragon820in Table 3, and OpenBLAS-TX1 in Table 4. In the case of Caffe2, thepooling layer is further split into two types: MaxPool and Average-Pool while in the case of the original Caffe framework both versionsare grouped under the Pooling category. Certain layers like dropoutand softmax execute too quickly and are too small to be captured.We also report the average percentage of energy and time of eachlayer when compared to the total energy and time. If we comparethe total inference energy in all three software and hardware cases,the combination with the least amount of energy per inference isthe Caffe2’s Eigen-Snapdragon820. However, comparing the energyof the Conv layer for the Jetson TX1 with both software backends,we observe that OpenBLAS consumes 1 . × less energy than Eigen.If we compare the energy of the different layers, we observe thatthe Conv layer contributes the most to the total energy consumedin the inference phase (62 − − − layer-type abstraction will be useful in the next sections, wherewe focus on building predictive models at the granularity of layer-types and compare it with previous approaches that use individuallayers for predictive models. Previous work such as Neuralpower [5], build layer-wise predictivemodels by using complex features extracted from individual layers.For example, higher order terms for kernel size, input volume andothers. While other works [18] use simple aggregate algorithmicfeatures (we refer to this as layer-type), for example, an aggregateMAC count to build predictive models for the convolutional lay-ers. Therefore, in this section, we first aim to evaluate algorithmicfeatures (highlighted in bold in Section 2) extracted for individuallayers to build predictive model in terms of predictive model ac-curacy and complexity. Our feature selection is based on standardtechniques of best subset selection using metrics such as

BayesianInformation Criterion (BIC) [12]. These methods are typically used able 3: GoogleNet per layer-type breakdown of energy and time for Eigen Library on Cortex-A57 and Kryo CPU

GoogleNet-Caffe2 Eigen - Jetson TX1 Eigen - Snapdragon 820Energy (mJ) Time (sec) Avg.energy (%) Avg.time (%) Energy (mJ) Time (sec) Avg.energy (%) Avg.time (%)Conv ± ± ± ± Fc ± ± ± ± MaxPool ± ± ± ± LRN ± ± ± ± ReLU ± ± ± ± AveragePool ± ± ± ± Concat ± ± ± ± Dropout

Softmax

Average Total ± ± Table 4: GoogleNet per layer-type breakdown of energy andtime for OpenBLAS - Cortex-A57

GoogleNet-Caffe OpenBLAS Jetson TX1Energy (mJ) Time (sec) Avg.energy (%) Avg.time (%)Conv ± ± InnerProduct ± ± Pooling ± LRN ± ± ReLU ± ± Split ± Concat ± ± Dropout ± ± Softmax ± ± Average Total

Table 5: BIC subset selection for the Convolutional layer.

Polynomialdegree (d) ModelComplexity(m) Model BIC(lower isbetter) Features Relativecomparisonto MAC model

MAC model

Best linear model

Best non-linearmodel to evaluate the trade-off in model complexity and accuracy as fea-tures are added to the model. We refer to model complexity as thenumber of features in the final predictive model.We demonstrate this feature analysis for all the convolutionallayers of all 12 target ConvNets executing on the CPU of JetsonTX1 with OpenBLAS backend. The linear features (or degree d=1)for these layers include: kernel shape, padding, stride, I x (same as I y ), O x (same as O y ), Oz , I z , input size, output size, weights, datavolume and MAC. In this case, each convolutional layer has a setof 12 features. The target response is the energy for an individuallayer. Figure 3a, shows that a model with 5 features (indicated byred circle as the lowest BIC) would be a good a choice. To modelnon-linear features, we extended the linear feature set to consist ofhigher order polynomial terms and cross terms (of degree d =

2) forthese features (this includes, kernel , kernel × stride and others). Fora degree 2 model the model with lowest BIC has a model complexityof 62 features, as shown in Figure 3b.In order to understand predictive model should yield greaterpredictive accuracy in Table 5, we make a relative comparison of B I C (a) Linear features B I C (b) Non-linear features Figure 3: Subset feature selection. Lower BIC is better. a single feature-based model using MAC to the predictive modelusing the best combination of linear features and to the predictivemodel with best combination of linear and non-linear features ob-tained from the previous step. Based on the relative comparison,a single feature MAC model is found to be within 3% of the bestlinear feature model and within 29% of the best non-linear featuremodel. Therefore, to get a highly accurate predictive model which isindicated by a lower BIC, the number of features extracted for indi-vidual layers would be in the order of 62 non-linear features. In thenext section, we compare regression-based models based on indi-vidual features with predictive models based on layer-type featureson the basis of predictive accuracy to determine whether highercomplexity models indeed offer better accuracy when compared tolower-complexity models.

Based on our analysis of features in the previous section, we buildregression based models trained using the standard supervised learn-ing approach in machine learning [12]. Cross validation is per-formed 10 times and the convolutional layers used in train and testsets are in the ratio 80:20. The regression-based model layer-wisepredictive models is given by Equation 6. Similar predictive modelscan be built for other layers to give the overall energy of the infer-ence as the sum of the predictions from all the layer-wise predictivemodels, as given by Equation 7.ˆ E layer = x × f eature + ... + x d m × f eature d m m (6)ˆ E inf erence = ˆ E conv + ˆ E pool ... + ˆ E layer (7)where d m represents the degree of the m th algorithmic feature. able 6: Comparison of energy predictive accuracy and model complexity Dataset OpenBLAS-TX1 Model type 10-Fold crossvalidation accuracy Polynomialdegree (d) Modelcomplexity (m)

Overall Convmodel

MAC sum model Energy 81.84 ± MAC model

Energy 67.02 ± Best linear model

Energy 72.83 ± Best non-linear model

Energy 79.58 ± NeuralPower [5] Runtime 77.48 ± As described in the previous sections, we have two types ofpredictive models based on the type of features: layer-wise predic-tive models use features aggregated across layers while individuallayer models use features from every layer. The individual convo-lutional layer models are of four categories, as given in Table 6: asingle feature model (MAC) without summation counts (as done in[18]), a model with the best BIC for linear features (best linear), amodel with the best BIC for non-linear features (best non-linear)and finally, we compare with a previous work, Neuralpower [5].NeuralPower is based on predictions from run-time and powerestimation models, to get an estimate of time and power, and sub-sequently energy, for individual layers. For this, we use the codeprovided by NeuralPower .By comparing the models based on individual layer features, assummarised in Table 6, we find that using a larger set of complexfeatures does not provide a massive boost in accuracy comparedwith the use of simpler features, as indicated by the results fromTable 5; for example, compare the data for complex models such asthe Best non-linear model and NeuralPower in Table 6 with that forthe simpler models such as the MAC model and Best linear model.Furthermore, for a single hardware and software configuration anda single split of their dataset, Neuralpower reports an overall accu-racy of 97.21% (based on the Root-mean-squared-percentage-error(RMSPE)). When using Neuralpower on our dataset, we observesimilar high accuracies for certain splits (for example, consider-ing the upper bound we get 77.47+21.21=98.69%). However, thisbehaviour is not consistently observed across other splits of trainand test sets as done in our experimental evaluation. Our resultsindicate a mean and variance of the accuracy of 77.47 ± cancellation effect whencalculating energy to give an overall high accuracy which may bemisleading. However, we do not observe such cancellation effectswhen using a simple MAC sum model trained directly on energyuse information. In addition, the

MAC sum model offers a higheraccuracy and lower variance compared to using models based onindividual layers (See Column 4 of Table 6).Given this behaviour, we conclude that predictive models basedon

MAC sum offers a good first approximation to estimate the energyconsumed in a ConvNet in terms of model complexity and accuracy.The

MAC sum model yields a higher and more stable predictive https://github.com/cmu-enyac/NeuralPower accuracy but with model complexity 4 × and 17 × lower comparedto previous approaches that use data from individual layers (Column6 of Table 6). Moreover, MAC (or more generally an operation count,Op) is a universal feature that can be extracted for other layer-types(see Section 6). Therefore in next section, we extend the MAC sum approach to other layer-types to get an overall estimate of theenergy consumed by the deep learning model.

In this section, we extend our method to construct per layer energyprediction models for the Conv, Fc and pooling layers using MACcount (or, equivalently, OpCount for pooling layers). To make aprediction for an entire ConvNet model, the training and test setsare split based on the ConvNet models themselves during cross-validation. This ensures that for given test ConvNet all its layersare present only in the test set. The predictive models are evaluatedin terms of their relative accuracy , given by Equation 8, whichquantifies the relative performance of the predictor with respect tothe baseline measured energy value [18]. We average the relativetest errors across all test examples and across all folds of data.

Rel . Acc ( % ) = − (cid:16) | predicted − measured | measured × (cid:17) (8) From Figure 4, we plot the linear regression model over the datapoints in our dataset. We observe that the relative positions of eachdata point in all three cases (that is, Eigen-Snapdragon820, Eigen-TX1 and OpenBLAS-TX1) follow a similar trend. At the bottom leftcorner, we observe models with low MAC count and low energyuse, for example, squeezenet and mobilenet. It is interesting toobserve that smaller sized models, in terms of number of parameters,do not always result in better energy use. For example, resnet50outperforms inceptionv3 in terms of energy use in all three casesdespite being roughly the same size (see Table 1). We also observethat alexnet has lower energy use than squeenzenet and mobilenetdespite being approximately 3 and 2 times greater in model size respectively. This is because the latter models use smaller kernelshapes such 1 × × This is considering only the convolutional layers. able 7: Per layer-type energy prediction results for allsoftware-hardware combinations

Layer 10-fold crossvalidation accuracy

Software-HardwareCombination Linear Regressionaccuracy (%)

Conv

Eigen-Snapdragon820 ± Eigen-TX1 ± OpenBLAS-TX1 ± FC Eigen-Snapdragon820 ± Eigen-TX1 ± OpenBLAS-TX1 ± Pool

Eigen-Snapdragon820 90.01 ± ± ± the linear regression model using solely MAC count as an inputfeature achieves a test accuracy between 75% to 82%.We aim to model the Fc layers and pooling layers using an equiv-alent feature to MAC sum count as done previously for the Convlayers. However, as observed in Table 1 from Section 2, there arefewer ConvNets with Fc layers, and fewer Fc layers per ConvNetmodel. Despite the fact that we are using a larger number of Con-vNets than previous studies, the data for the Fc layers is limited.As seen in Table 7, the lower accuracy between 56% to 76% usinga linear fit could be a result of insufficient data points. We couldaddress this issue by trying to generate more points for the Fc layersby using individual Fc layers as adopted by previous approaches orgenerate more data points by using varying batch sizes.For the pooling layers, we focussed on MaxPool operations asthey account for more number of layers than average pooling in realConvNets. We use the OpCount given in Equation 4. The results inFigure 4, show that using solely the OpCount as an input featurewe can obtain a linear fit with test accuracy between 82% to 90%.

In this section, we obtain an estimate for the energy for the wholeConvNet by summing the predictions from the Conv, Fc and Max-Pool layer-type predictive models. We select GoogleNet, AlexNetand VGG_CNN_S as test data points (see Table 8) because AlexNetand a variant of VGG with 16 layer were evaluated in NeuralPower[5], and GoogleNet is our running example. We use the remainingmodel points as the training set to form the linear model.Table 8 shows the prediction results for each layer-type Conv,Pool and Fc given by Columns 3, 4 and 5 and the overall predictedresults (Column 6: Total predicted) for the inference. The measuredenergy for the Conv, Pool and Fc layer-type is given in Columns 7,8 and 9 and the overall measured energy in Column 10 of Table 8.Similar to the results obtained using empirical measurements inSection 4, we find that using the predicted energy for the convo-lutional layers (given in Column 3) of GoogleNet, the OpenBLASlibrary is 1 . × less energy consuming than the Eigen library forthe TX1 platform.Finally, we also report accuracy using RMSPE (as per the metricused in NeuralPower) and relative test accuracy (see Equation 8).Both metrics provide similar results. We find that across the four software-hardware combinations, including mobile GPUs (CuDNN-TX1 in Table 8), we achieve a significant relative test accuracy ofbetween 76% to 85% using solely summation of MAC (or operation)counts as the input feature to a linear model. To enable efficiency in deep learning algorithms, software andhardware will require better understanding in the energy use ofdeep learning models. This section covers related work in the areasof performance and energy benchmarking, and performance andenergy modelling.

Performance and energy benchmarking:

Performance or ex-ecution time is used as metric to evaluate deep learning models onexisting desktop and server systems as done in Fathom [2]. Thesestudies are representative of execution environments with powerfulprocessors and availability of larger memories which is not typi-cally representative of resource constrained low-powered devices.Our work instead provides both performance and energy use of12 representative ConvNet models when executing on resourceconstrained mobile systems and identifies energy bottlenecks at afine-grained level.Recent energy benchmarking efforts, such as BenchIP [25] haveemerged to understand the energy use of deep learning applicationsacross different types of hardware systems. The authors developa benchmark suite of single layers and full ConvNet models andis aimed at evaluating different hardware systems. However, it isunclear how usable this framework would be for measurementand modelling studies described in this paper as it is yet to beopen-sourced.

Performance and energy modelling:

To overcome the re-quirement of having to execute every model to measure its per-formance, recent studies [17] have focussed on modelling the exe-cution time and resource usage for only the convolutional layersin a ConvNet model. They use matrix multiplication as a majorcomponent in a convolutional layer and execute different matrixsizes in isolation to model its performance and resource use. Theauthors identify that such isolation failed to capture the depen-dencies between layers during actual inference runs leading to anover-estimate in the prediction of execution time compared to ac-tual execution time. Our work instead captures the energy use ofthe layers in the context of the execution environment of an entireinference and uses it to build predictive models.Early studies [22], relied on counting the number of weights ofthe deep learning model and energy look-up tables for estimatingthe energy cost of DRAM memory accesses during the inferencephase on specialized hardware. However, such estimation modelsfor deep learning on general purpose processing processors such asCPUs and GPUs have only recently emerged (for example, a mobileCPU [18] and desktop GPUs [5].) Our work instead builds uponthe former that models the energy consumption using platform-specific performance counter information. Specifically, we buildpredictive models at the application-level using platform-agnostic neural network features.Our work also shares similarity to NeuralPower [5] that buildspredictive models for a desktop GPU. However, we differ in three S y s t e m E n e r g y ( y ) ( m J ) y = 6.39 * 10^(-07)x (a) Eigen-Snapdragon820 S y s t e m E n e r g y ( y ) ( m J ) y = 5.54 * 10^(-06)x (b) Eigen-TX1 S y s t e m E n e r g y ( y ) ( m J ) y= 2.88 * 10^(-06)x alexnetsqueezenetresnet50googlenetsqueezenetResvggsmallinceptionbninceptionv3placesallcnnmobilenetdensenet (c) OpenBLAS-TX1 Figure 4: Linear regression-based energy prediction models using MAC for the Conv layer.Table 8: Aggregate Energy prediction results for Conv, pooling and Fc layers

Software-Hardware Test ConvNet ˆ Conv ˆ Pool ˆ Fc Totalpredicted (mJ)

Conv Pool Fc

Total layermeasured (mJ) Accuracy (%)“100-RMSPE" Rel.TestAccuracy (%)Eigen-Snapdragon820

GoogleNet ± AlexNet

VGG_CNN_S

Eigen-TX1

GoogleNet ± AlexNet

VGG_CNN_S

OpenBLAS-TX1

GoogleNet ± AlexNet

VGG_CNN_S

CuDNN-TX1

GoogleNet ± AlexNet

VGG_CNN_S main aspects. First, NeuralPower develop per layer power and run-time prediction models on desktop GPUs such as Titan X GPU topredict the energy for 5 ConvNet test models. Our work focusses onempirical power measurements obtained in a resource constrainedmobile devices and comprehensively evaluates 12 representativeConvNets. Second, NeuralPower does not provide an analysis onhow to select features used to build their predictive models. We usestatistical analysis to select dominant input features extracted fromthe algorithm. Third, although the average energy prediction accu-racy reported by NeuralPower appears high, it can not be replicatedon a different set of ConvNets as shown in our results in Section 6.

The overhead of the power measuring software, introduced by the gator daemon , executing on the target device is negligible, approx-imately 3% [1]. Our data collection phase on each platform takesaround 5 minutes for all 12 ConvNets. Our feature selection processtakes less than a minute. Predictive model training and testing takesapproximately 5ms. This low overhead is beneficial, as for a newsoftware-hardware configuration, we only pay this cost once anduse a few ConvNets to approximate the energy use on the platform.Finally, the predictive models in our work are built at the layer-level and any optimizations to accelerate a layer, such as fused-layer implementations [3], is typically done below this level ofabstraction, and is thus automatically covered.

Deep neural network inference is becoming increasingly popularon low-power mobile devices. In this work, we focus on buildingenergy predictive models by thoroughly investigating the impactof choosing application-level features on the final predictive modelaccuracy and complexity.To support building of predictive models we extended the SyNERGY-a framework for gathering energy measurements on different mo-bile devices. We compare two types of predictive models found inthe literature - based on features selected for layers at differentlevels - individual layers and layer-type. Our analysis using subsetfeature selection techniques for individual layer models indicatethat highly complex features are required to achieved greater pre-dictive accuracy. However unlike the results of previous works, wefind that predictive models based on layer-type features (for exam-ple, summation of operation counts) offer a better model complexityof 4 to 32 times less than models using individual layer features for asimilar average accuracy ( ≈ CKNOWLEDGMENTS

This research was conducted with support for C. Rodrigues and G.D.Riley from the IS-ENES2 project, funded under the European FP7-INFRASTRUCTURES-2012-1 call (GA No: 312979). C. Rodrigues isalso part-funded by Arm under a PhD Studentship Agreement. M.Luján is supported by a Royal Society University Research Fellow-ship.

REFERENCES [1] 2016. DS-5 Development studio.

Developer.arm.com . https://developer.arm.com/products/software-development-tools/5-development-studio/streamline[2] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks.2016. Fathom: reference workloads for modern deep learning methods. In

Work-load Characterization (IISWC), 2016 IEEE International Symposium on . IEEE, 1–10.[3] Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layerCNN accelerators. In

The 49th Annual IEEE/ACM International Symposium onMicroarchitecture . IEEE Press, 22.[4] A Berg, J Deng, and L Fei-Fei. 2010. Large scale visual recognition challenge(ILSVRC), 2010. (2010).[5] Ermao Cai, Da-Cheng Juan, Dimitrios Stamoulis, and Diana Marculescu. 2017.Neuralpower: Predict and deploy energy-efficient convolutional neural networks. arXiv preprint arXiv:1710.05420 (2017).[6] Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, MehdiMirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-HyunLee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang,Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, RaduIonescu, Marius Popescu, Cristian Grozea, James Bergstra, Jingjing Xie, LukaszRomaszko, Bing Xu, Zhang Chuang, and Yoshua Bengio. 2015. Challenges inrepresentation learning: A report on three machine learning contests.

NeuralNetworks

64, 0 (2015), 59 – 63. https://doi.org/10.1016/j.neunet.2014.09.005Special Issue on âĂĲDeep Learning of RepresentationsâĂİ.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep ResidualLearning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015).[8] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, WeijunWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:Efficient convolutional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 (2017).[9] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger.2017. Densely Connected Convolutional Networks. In . IEEE, 2261–2269.[10] Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, Song Han, William JDally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50xfewer parameters and< 1MB model size. arXiv preprint arXiv:1602.07360 (2016).[11] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. In

International conferenceon machine learning . 448–456.[12] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. [n.d.].

Anintroduction to statistical learning . Vol. 112. Springer.[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutionalarchitecture for fast feature embedding. In

Proceedings of the ACM InternationalConference on Multimedia . ACM, 675–678.[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classi-fication with Deep Convolutional Neural Networks. (2012), 1097–1105.[15] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, andFahim Kawsar. 2015. An Early Resource Characterization of Deep Learning onWearables, Smartphones and Internet-of-Things Devices. In

Proceedings of the2015 International Workshop on Internet of Things towards Applications . ACM,7–12.[16] Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXivpreprint arXiv:1312.4400 (2013).[17] Zongqing Lu, Swati Rallapalli, Kevin Chan, and Thomas La Porta. 2017. Modelingthe resource requirements of convolutional neural networks on mobile devices.In

Proceedings of the 2017 ACM on Multimedia Conference . ACM, 1663–1671.[18] Crefeda Rodrigues, Graham Riley, and Mikel Lujan. 2018.

SyNERGY: An energymeasurement and prediction framework for Convolutional Neural Networks onJetson TX1 . CSREA Press, 375–382. https://csce.ucmss.com/cr/books/2018/LFS/CSREA2018/PDP4261.pdf[19] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, andTushar Krishna. 2018. SCALE-Sim: Systolic CNN Accelerator. arXiv preprintarXiv:1811.02883 (2018).[20] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). [21] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A.Riedmiller. 2014. Striving for Simplicity: The All Convolutional Net.

CoRR abs/1412.6806 (2014). arXiv:1412.6806 http://arxiv.org/abs/1412.6806[22] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. 2017. Efficient process-ing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039 (2017).[23] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and ZbigniewWojna. 2016. Rethinking the inception architecture for computer vision. In

Proceedings of the IEEE conference on computer vision and pattern recognition .2818–2826.[24] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. 2018.MnasNet: Platform-Aware Neural Architecture Search for Mobile. arXiv preprintarXiv:1807.11626 (2018).[25] Jin-Hua Tao, Zi-Dong Du, Qi Guo, Hui-Ying Lan, Lei Zhang, Sheng-Yuan Zhou,Ling-Jie Xu, Cong Liu, Hai-Feng Liu, Shan Tang, et al. 2018. Bench IP: Bench-marking Intelligence Processors.

Journal of Computer Science and Technology arXiv preprintarXiv:1505.02496 (2015).[27] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2016. Designing energy-efficientconvolutional neural networks using energy-aware pruning. arXiv preprintarXiv:1611.05128arXiv preprintarXiv:1611.05128