Not All Ops Are Created Equal!
NNot All Ops Are Created Equal!
Liangzhen Lai
Naveen Suda
Vikas Chandra
ABSTRACT
Efficient and compact neural network models are essential for en-abling the deployment on mobile and embedded devices. In thiswork, we point out that typical design metrics for gauging theefficiency of neural network architectures – total number of opera-tions and parameters – are not sufficient. These metrics may notaccurately correlate with the actual deployment metrics such as en-ergy and memory footprint. We show that throughput and energyvaries by up to 5X across different neural network operation typeson an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore,we show that the memory required for activation data also needto be considered, apart from the model parameters, for networkarchitecture exploration studies.
Exploring efficient neural network (NN) architectures targeted formobile and embedded devices with constrained energy and mem-ory resources has been the recent trend in machine learning re-search [3, 5, 9, 12, 13]. Most research use the number of operations(Ops) and/or parameters (i.e., weights) as the metrics for evaluat-ing the model complexity and compactness. While these metricsare sufficient when comparing significantly different NN models(e.g. AlexNet [7] vs. MobileNets [3]), they may not be accurateenough for comparing networks whose complexity and sizes aresimilar. Furthermore, as research shifts towards fine-grained op-timization, e.g., network architecture search [2, 14] and hyperpa-rameter search [10, 13], reductions in Ops or parameters may notalways improve the network efficiency.Energy per inference and total memory footprint are two mainsystem metrics to be considered for deploying NN based solutionson resource constrained devices. In this work, we show examplesthat NNs with similar network design metrics can have very dif-ferent deployment metrics when running on resource constraineddevices like microcontrollers. In particular, we show: • Throughput and energy efficiency for different types of NNoperations can vary by up to 5X. This can result in 30%difference in runtime and energy for NNs with similar Opsand accuracy. • Different operations with same amount of weights can havevarying amount of activation data, and thus different mem-ory footprint. This may not be an issue for large-scale sys-tems, but is critical for devices with limited memory.All experiments are performed using optimized neural networkkernels in CMSIS-NN [8]. The delay/power results are measuredon a NUCLEO-F746ZG mbed development board [1], which has anArm Cortex-M7 core (running at 216 MHz), 1 MB flash and 320 KBSRAM.
Energy consumption per inference is a crucial metric that deter-mines the battery life of an embedded system and it is imperativethat NN models are optimized for energy efficiency. Typically, num-ber of Ops is considered as a proxy for the energy consumption N o r m a li z ed da t a ConvPoolReLUFC
Figure 1: Normalized throughput and energy of differenttype of NN operations in a CNN for CIFAR-10 dataset. per inference, but the type of operations also has huge impact onthe energy. For example, Fig. 1 shows the normalized throughput,power consumption and energy per Op of different NN operationtypes of the convolutional neural network (CNN) for CIFAR-10dataset from Caffe examples [6]. The results show that throughput(i.e. Ops/s) can vary by 5X across different operation types, but av-erage power consumption remains almost same. This implies thatthe overall energy consumption depends mostly on the through-put. Among all the operation types, max pooling is particularlyslow because it is based on comparisons (i.e. branch) rather thancomputations. However, in a typical NN, convolution and fully-connected (FC) layers constitute more than 90% of the operations.These layers achieve good throughput by effectively utilizing theSIMD Multiply-Accumulate (MAC) instructions.Fig. 2 shows the throughput of different MAC based NN oper-ations. Since the throughput depends heavily on the layer dimen-sions, we use the number of MAC operations per output to representthe effectiveness of SIMD MAC instructions. In this case, the differ-ence between operation types represents the relative overhead offetching the MAC operands. In general, convolution is slower thanfully-connected layer because of additional im2col overhead. How-ever, 1x1 convolution does not require im2col . It uses matrix-matrix T h r oughpu t ( M O p s / s ) Number of MACs per output
Conv
FC1x1-ConvDS-Conv
Figure 2: Throughput variation with number of MACs peroutput for different types of NN operations. Depthwise sep-arable convolution (DS-Conv) typically has much less MACsper output compared to other operation types. a r X i v : . [ c s . L G ] J a n ysML Conference, 2018 Liangzhen Lai, Naveen Suda, and Vikas Chandra Ops Accuracy Energy perinference Energy per Op N o r m a li z ed da t a DS-CNN-1DS-CNN-2
DS-CNN-3DS-CNN-4DS-CNN-5
Figure 3: Normalized energy consumption of 5 DS-CNNmodels with similar accuracy and number of Ops. multiplication (GEMM) style of computations, which is faster thanmatrix-vector multiplication (GEMV) style of computations used infully-connected layer due to better data reuse. Among all operationtypes, depthwise separable convolution (DS-Conv) is the slowest asit has higher im2col overhead and typically lower MACs per output.Understanding the throughput differences between operationtypes is crucial for designing efficient NN architectures. Fig. 3 showsthe normalized energy consumption, number of Ops and accuracy of5 DS-CNN models [13] with different number of layers and featuresper layer, trained on Google speech commands dataset [11]. It showsthat the energy per inference varies by as much as 30% across thesemodels although they come from the same NN architecture familyand have similar accuracy and total number of Ops.The distributions of the different operation types of DS-CNN-3 and DS-CNN-4 models are shown in Fig. 4. Compared to theDS-CNN-3 model, DS-CNN-4 has higher proportion of DS-ConvOps, which has substantially lower throughput compared to otheroperation types as shown in Fig. 2. This results in 30% reduction inthe overall throughput and hence energy efficiency. Using Ops as ametric without considering the throughput of different operationtypes on the actual hardware may lead to sub-optimal efficiency.
Conv FC DS-Conv1x1-ConvPool, ReLU
Figure 4: Distribution of different operation types in DS-CNN-3 (left) and DS-CNN-4 (right) models shown in Fig. 3.
When performing fine-grained NN optimization, the operationtype and dimensions should be considered for evaluating the net-work efficiency. The results we show in this work are based ona general purpose processor and the operation characteristics forother platforms (e.g. GPU, FPGA, DSP, accelerator) can be very dif-ferent. Performance for different operation types, similar to resultsin Fig. 2, can be pre-characterized for the target hardware platformand used for estimating the network efficiency.
System memory size is the other important limiting factor for run-ning NNs on resource constrained devices. For example, typicalmicrocontroller SoC have 100 KB - 1 MB of flash (to store programbinary and model weights) and 10-300 KB of SRAM (to store theactivation data). The number of model parameters, which can be used as metricto quantify the compactness of a NN model, determines whetherthe model fits in the flash or not. However, it may not be a goodmetric for representing the total memory footprint, as it does notconsider the activation data typically stored in the SRAM. Theamount of activation data can be a significant part of the totalmemory footprint and will depend on the operation type as well. Forexample, Fig. 5 shows the memory footprint of four NN models forthe keyword spotting application from [13]. The size of maximumconcurrent activation data varies between 1% to 30% of the totalmemory footprint.
CNN GRU LSTM DS-CNN M e m o r y F oo t p r i n t ( KB ) Activations
Weights
Figure 5: Memory footprint (total weights and maximumconcurrent activation data) breakdown for four differenttypes of models from [13].
Apart from operation types, NN topology can also affect thesize of maximum concurrent activation data. Regular feed-forwardnetwork need to store only the input and output data of the cur-rent layer. If there are other feed-forward connections, such as inDenseNet [4], the total number of concurrent activation data willincrease. Also, some networks generated by automatic networkarchitecture search can have many feed-forward connections [14],which can substantially ( ∼ In this work, we show that the NN operation type has significant im-pact on system efficiency. The commonly used network design met-rics – number of operations and parameters – need to be rethoughtas they may not accurately correlate with system design metricssuch as energy efficiency and memory footprint. Experimental re-sults on an off-the-shelf Arm Cortex-M microcontroller show thatthe energy per operation can vary up to 5X for different NN opera-tion types. Network activation data, which are typically overlookedcan also contribute to up to 30% of the total memory footprint. Thenetwork architecture exploration should account for both energyefficiency as well as total memory footprint to make the inferencemore efficient on resource constrained devices. ot All Ops Are Created Equal! SysML Conference, 2018
REFERENCES arXiv preprintarXiv:1611.02167 (2016).[3] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, WeijunWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:Efficient convolutional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 (2017).[4] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2016.Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016).[5] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William JDally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50xfewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).[6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolu-tional architecture for fast feature embedding. In
Proceedings of the 22nd ACMinternational conference on Multimedia . ACM, 675–678.[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In
Advances in neural informationprocessing systems . 1097–1105.[8] Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2018. CMSIS-NN: EfficientNeural Network Kernels for Arm Cortex-M CPUs. arXiv preprint arXiv:1801.06601 (2018).[9] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016.Xnor-net: Imagenet classification using binary convolutional neural networks.In
European Conference on Computer Vision . Springer, 525–542.[10] Dimitrios Stamoulis, Ermao Cai, Da-Cheng Juan, and Diana Marculescu. 2017.HyperPower: Power-and Memory-Constrained Hyper-Parameter Optimizationfor Neural Networks. arXiv preprint arXiv:1712.02446 (2017).[11] Pete Warden. 2017. Speech Commands: A public datasetfor single-word speech recognition.
Dataset available fromhttp://download.tensorflow.org/data/speech_commands_v0.01.tar.gz (2017).[12] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. Shufflenet:An extremely efficient convolutional neural network for mobile devices. arXivpreprint arXiv:1707.01083 (2017).[13] Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. HelloEdge: Keyword Spotting on Microcontrollers. arXiv preprint arXiv:1711.07128 (2017).[14] Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcementlearning. arXiv preprint arXiv:1611.01578arXiv preprint arXiv:1611.01578