Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision
TTowards Closing the Energy Gap Between HOGand CNN Features for Embedded Vision (Invited Paper)
Amr Suleiman*, Yu-Hsin Chen*, Joel Emer, Vivienne Sze
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology { suleiman, yhchen, jsemer, sze } @mit.edu *These authors contributed equally to this work Abstract —Computer vision enables a wide range of appli-cations in robotics/drones, self-driving cars, smart Internet ofThings, and portable/wearable electronics. For many of these ap-plications, local embedded processing is preferred due to privacyand/or latency concerns. Accordingly, energy-efficient embeddedvision hardware delivering real-time and robust performanceis crucial. While deep learning is gaining popularity in severalcomputer vision algorithms, a significant energy consumption dif-ference exists compared to traditional hand-crafted approaches.In this paper, we provide an in-depth analysis of the computation,energy and accuracy trade-offs between learned features such asdeep Convolutional Neural Networks (CNN) and hand-craftedfeatures such as Histogram of Oriented Gradients (HOG). Thisanalysis is supported by measurements from two chips thatimplement these algorithms. Our goal is to understand the sourceof the energy discrepancy between the two approaches and toprovide insight about the potential areas where CNNs can beimproved and eventually approach the energy-efficiency of HOGwhile maintaining its outstanding performance accuracy.
I. I
NTRODUCTION
Computer vision (CV) is a critical technology to numeroussmart embedded systems, such as advanced driver assistantsystems, autonomous cars/drones, and robotics. It extractsmeaningful information from visual data for further decisionmaking. However, many modern CV algorithms require highcomputational complexity, which makes their deployment onbattery-powered devices challenging due to the tight energyconstraints. Near-sensor visual data processors should con-sume under 1nJ/pixel with a logic gate counts of around 1000kgates and a memory capacity of few hundred kBytes in orderto be comparable with video codecs, which are present in mostcameras [1]. For many applications, offloading computation tothe cloud is also undesirable because of latency, connectivity,and security limitations. Thus, dedicated energy-efficient CVhardware becomes very crucial.Feature extraction is the first processing step in most CVtasks, such as object classification and detection (Fig. 1).It transforms the raw pixels into a high-dimensional space,where only meaningful and distinctive information is kept.Traditionally, features are designed by experts in the fieldthrough a hand-crafted process. For instance, many well-known hand-crafted features use image gradients, such ashistogram of oriented gradients (HOG) [2] and scale invariantfeature transform (SIFT) [3], based on the fact that humaneyes are sensitive to edges. In contrast, learned features learn
Feature Extraction Classification (w T x) Handcrafted Features (e.g. HOG) Learned Features (e.g. CNN) pixels Features ( x ) Trained weights ( w ) Image Scores Scores per class (select class based on max or threshold)
Fig. 1. General processing pipeline for object classification and detection. a representation with the desired characteristics directly fromdata using deep convolutional neural networks (CNNs) [4].Learned features are gaining popularity, as they are outper-forming hand-crafted features in many CV tasks [5, 6].The differences in design between hand-crafted and learnedfeatures lead to not only different performance in applications,but also different hardware implementation considerations,which have a strong implication for energy efficiency. Ingeneral, hardware implementations for hand-crafted featuresare widely understood to be more energy-efficient than learnedfeatures. However, there is no analysis that explains the energygap between the two types of features. Also, an open questionis whether the energy gap can be closed in the future.In this paper, we will provide an in-depth analysis on thecauses for the energy gap between hand-crafted and learnedfeatures. We use results from two actual chip designs: [7]implements the hand-crafted feature using HOG, and [8]implements the learned feature using CNN. Both chips use65nm CMOS technology and have similar hardware resourceutilization in terms of logic gate count and memory capacity.Based on the insights derived from the two implementations,we will discuss techniques to help close the energy gap.II. F
EATURE E XTRACTION H ARDWARE I MPLEMENTATIONS
A. Hardware for Hand-crafted Feature: HOG
The chip presented in [7] implements the entire object de-tection pipeline based on deformable parts models (DPM) [9]for high throughput and low power embedded vision applica-tions. DPM extracts HOG features from the input image, andlocalizes objects by sweeping the features with support vector a r X i v : . [ c s . C V ] M a r ig. 2. Feature extraction with histogram of oriented gradients (HOG). Modern
Deep
CNN:
Layers
Classes
FC Layer CONV Layer
Low-Level Features
CONV Layer
High-Level Features … Layers convolu’on non-linearity × normaliza’on pooling Fig. 3. General processing pipeline of CNN. machine (SVM) classifiers [10]. Fig. 2 shows the featuresextraction process using HOG: the image is divided into non-overlapping 8 × B. Hardware for Learned Feature: CNN
The second chip presented in [8], called Eyeriss, is anenergy-efficient accelerator for deep CNNs. Fig. 3 showsa general CNN processing pipeline, consisting mainly of aseries of convolutional (CONV) layers. In each CONV layer,a collection of 3D filters are applied to the input imagesor feature maps to generate output feature maps, which arethen used as the input to the next CONV layer. Eyeriss isprogrammable in terms of the size and shape of filters as wellas number of layers. Therefore, it can accommodate differentCNN architectures, such as AlexNet [5] and VGG-16 [11].
C. Performance Comparison
Table I shows the hardware specification and measuredperformance of the two designs for feature extraction. ForCNN, Eyeriss is programmed to run two CNN models (fiveCONV layers of AlexNet and thirteen CONV layers of VGG-16) to demonstrate the hardware performance variations ofrunning different CNN features. Both chip designs use around1000 kgates and 150 kB memory. While Eyeriss achievesapproximately the same computation throughput (i.e., GOPS)
TABLE IH
ARDWARE SPECIFICATION AND MEASURED PERFORMANCE OFHAND - CRAFTED FEATURE
HOG [7]
AND LEARNED FEATURE
CNN [8]. [7] [8]
Implemented Feature
HOG CNN CNN(AlexNet) (VGG-16)
Technology
Gate Counts (kgates)
Memory (kB)
Multiplier Bitwidth ×
11 – 22 ×
22 16 × Throughput (Mpixels/s)
Throughput (GOPS)
Power (mW)
DRAM access (B/pixel)
Energy Efficiency (nJ/pixel)
Energy Efficiency (GOPS/W)
TABLE IIC
OMPUTATIONAL COMPLEXITY COMPARISON BETWEEN FEATURES
Feature GOP/Mpixel Ratio
Hand-crafted HOG 0.7 1.0 × Learned CNN (AlexNet) 25.8 36.9 × CNN (VGG-16) 610.3 871.9 × as the HOG design when running AlexNet features, HOGprocesses 35 × more input pixels per second; the gap is evenlarger between HOG and VGG-16 features. This is due tothe differences in computational complexity (i.e., operationsper pixel) between different features as shown in Table II.However, the HOG design also consumes around 10 × lesspower than Eyeriss. Thus, the HOG hardware consumes 311 × and 13,486 × less energy per pixel than Eyeriss runningAlexNet and VGG-16 features, respectively. In terms of energyper operation, the HOG hardware is 10 × less than Eyeriss. InSection III, we will discuss the cause of this energy gap. D. Accuracy vs. Energy Efficiency
While there is a large energy gap between the hardware forhand-crafted and learned features, the performance differencesin applications, e.g., accuracy, between the features shouldalso be taken into account. To measure accuracy, we use thefeatures for object detection, which localizes and classifiesobjects in provided images (i.e., the outputs are the class andcoordinates of each object in the images). The mean averageprecision (mAP) metric is used to quantify detection accuracy.Fig. 4 shows the trade-off between detection accuracy andenergy consumption. Note that the vertical axis is logarithmic.All reported detection accuracy numbers are measured onPASCAL VOC 2007 dataset [12], which is a widely usedimage dataset containing 20 different object classes in 9,963images. In order to achieve the same detection accuracy ofHOG features, it only requires the features extracted fromthe first three CONV layers of AlexNet instead of five [13],but it comes at the cost of 100 × higher energy consumptionper pixel to generate the features. Fortunately, mAP can benearly doubled by using the ouput of all five layers of AlexNetwith minimal increase in energy consumption (only 22%).Even higher mAP has been demonstrated by using VGG-16features [14]. However, the accompanying energy consumption ig. 4. Energy versus accuracy trade-off comparison for hand-crafted andlearned features. per pixel becomes four orders of magnitude higher than usingHOG features. If we look at the minimum hardware energyconsumption required for a given accuracy in Fig. 4, weroughly see that a linear increase in detection accuracy requiresan exponential increase in energy consumption per pixel.III. H ARDWARE I MPLEMENTATION C ONSIDERATIONS
One of the key factors that differentiates the two typesof features in hardware implementation is programmability ,which defines the flexibility of a feature to be configuredto deal with different data statistics or different tasks. Whilehand-crafted features can achieve a certain degree of invari-ance to data variations (e.g., images with different exposures),they are mostly designed for very specific tasks with knowndata statistics, leaving little to be programmed when deployed.In contrast, learned features isolate algorithm design fromlearning the data statistics, which leaves room for programma-bility to take advantage of the flexibility. We categorizeprogrammability into two types: Programmability of Hyper-Parameters (PoHP) and Programmability of Parameters (PoP).
A. Programmability of Hyper-Parameters (PoHP)
Hyperparameters refer to the number and/or dimensionalityof parameters of a feature such as the number of layers andsize of filters in a CNN, or the number of histogram bins inHOG. They are usually determined at design time by usingheuristics or through experiments.
Advantages : For learned features, changes in hyperparameterswith proper (re-)training can result in significant performanceimprovements. As a result, hardware that supports PoHP caneasily trade-off computational complexity for higher accuracy.For instance, a CNN object detection system with PoHPcan choose between lower-complexity, lower-accuracy, e.g.,AlexNet, and higher-complexity, higher-accuracy, e.g., VGG-16, according to the use cases. For hand-crafted features,however, the impact of PoHP on performance is usuallylimited, since it can be hard to capture the changes in pa-rameter dimensionality without redesign. As a result, PoHPare not commonly supported for the hardware implementationof hand-crafted features.
Energy Cost : Although PoHP can greatly benefits the perfor-mance of learned features, it comes at the price of loweredenergy efficiency, since the hardware implementation mustaccommodates a wide range of possible configurations. First,it introduces irregular data movement . For example, a CNNprocessor that supports PoHP has to deal with filters ofmany shapes and sizes, which complicates the control logic,synchronization schemes between parallel blocks, and datatiling optimization. Second, it usually requires higher dynamicrange and resolution for data representation , which have anegative impact on both computation and data movement.The bitwidth of datapath and the bandwidth requirement ofmemory all need to be designed for the worst-case scenario,which penalize the average energy efficiency.
B. Programmability of Parameters (PoP)
Parameters are the actual coefficients, such as the filterweights, that are used in the computation. For learned fea-tures, parameters are learned through training; for hand-craftedfeatures, they are usually carefully designed according tothe data statistics or pre-determined based on the desiredproperties. For instance, in HOG, the filter mask for thegradient extraction is simply [ − . Advantages : With PoP, learned features can adapt to new databy simply retraining the parameters. For example, AlexNethas been shown to work for both RGB images and depthimages [15]. For hand-crafted features, however, there is noPoP since all parameters are fixed at design time.
Energy Cost : PoP also negatively impacts the hardwareenergy efficiency since the parameters have to be treated asdata instead of fixed values during hardware implementation.This not only increases the required memory capacity and datamovement, but also complicates the datapath design. In thecase of CNN, the amount of parameters is usually too large tofit on-chip, which increases the accesses to energy-consumingoff-chip memory, such as DRAM. In contrast, hand-craftedfeatures can be greatly optimized for energy efficiency byhard-wiring the fixed parameters in datapaths or ROM. In thecase of HOG, multiplications between the input images andgradient filter can be completely avoided.IV. C
LOSING THE E NERGY G AP In Section II, we have shown the energy efficiency results ofhardware implementations from two extremes: the HOG hand-crafted feature with no PoP or PoHP, and the CNN learnedfeature with both PoP and PoHP, and the latter consumes anorder of magnitude higher normalized energy per pixel thanthe former. A simple approach to reduce this energy gap is toremove all programmability in the hardware implementationof CNN, but this is not straightforward.For example, the CONV layer weights from AlexNet caneither be hard-wired in the multipliers, or stored in on-chipSRAM or ROM. This is not feasible if the available hardwareresources are constrained to the level of the HOG design [7],i.e., 1000 kgates with 150 kB SRAM. Assuming each inputand weight value take 1 byte, only 10k multipliers with fixed
ABLE IIICNN
ENERGY AND MEMORY SAVINGS USING DIFFERENT TECHNIQUES .*M
EASURED ON A LEX N ET **A SSUMING A
BIT BASELINE
Method Energy Memory Size
Reduced precision [17, 22] 2.56 × × Sparsity by pruning [25] 3.7 × × *Data Compression [22] - > × **Energy optimized dataflow [27] 1.4–2.5 × - weights can be implemented in 1000 kgates, and only 150kweight values can fit in the SRAM. This number of multiplierscannot even fit 1% of AlexNet 2334k weights in the CONVlayers, and 15 × larger memory is required to store all weights.In this section, we will discuss some techniques that can beapplied to reduce the energy efficiency gap. Reduced Precision : Reducing data precision is an activeresearch area for DNN [16], which directly reduces the worst-case bitwidth requirement when supporting PoHP. Specifically,8-bit integer precision has become popular in recent DNNhardware [17]. Non-uniform quantization [18] and bitwidthreduction to 1-bit [19–21] have also been demonstrated. En-ergy efficiency can also be improved if the hardware can adaptto the need of actual data. Custom datapath designs that adaptto the lower data precision show 2.56 × energy savings in [22]. Sparsity : Increasing data sparsity reduces the intrinsic amountof computation and data movement, which improve energyefficiency. For CNN, the number of weights can be greatlyreduced without reducing accuracy through pruning [23–25].Specifically, [25] has shown that the number of CONV layerweights in AlexNet can be reduced from 2334k to 352k, whichis only double the memory capacity used in HOG; furthermoreit reduces energy by 3.7 × . Specialized hardware designs canbe used to exploit sparsity for increased speed or reducedenergy consumption [8, 18, 26]. Data Compression : Sparsity also suggests opportunities forcompression, which saves memory space and data transferbandwidth. Many lightweight lossless compression schemes,such as run-length coding [8] and Huffman coding [18], areproposed to reduce the amount of off-chip data [8, 18, 22].
Energy Optimized Dataflow : PoHP incurs irregular datamovement, preventing memory access optimization duringhardware design. Therefore, designing hardware architecturethat can adapt to the irregular data movement becomes criticalto high energy efficiency, since data movement often consumesmore energy than computation. Eyeriss demonstrates a recon-figurable architecture that can optimize data movement for var-ious CNNs with a row stationary dataflow, and achieves 1.4 × to 2.5 × higher energy efficiency than existing designs [27].Table III summarizes the energy and memory savings usingthe discussed techniques. Combining them all can potentiallydeliver an order of magnitude reduction. Taking into accountthe fundamental computation gap discussed in Section II, thisreduction has the potential of closing the gap between HOGand CNN features. V. C ONCLUSION
The CNN learned features outperform the HOG hand-crafted features in visual object classification and detectiontasks. This paper compares two chip designs for CNN andHOG to better understand the energy discrepancy betweenthe two approaches and provide insight about the potentialoptimizations. Although learned features achieve more than2 × accuracy, it comes at a large 311 × to 13,486 × overheadin energy consumption. While a fundamental computationoverhead exists, another order of magnitude gap is mainlycaused by the fact that CNN architecture is programmable. Asimple approach of removing all programmability in CNN andhard-wiring all multiplications doesn’t work due to significantarea cost (i.e., logic gates and on-chip memory). Combing thetechniques highlighted in the paper can potentially reduce theenergy and memory sizes by an order of magnitude, and helpreduce the gap between learned and hand-crafted features.R EFERENCES [1] C. C. Ju and et al., “A 0.5nJ/pixel 4K H.265/HEVC codec LSI for multi-formatsmartphone applications,” in
ISSCC , pp. 1–3, Feb 2015.[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”in
CVPR , 2005.[3] D. G. Lowe, “Object recognition from local scale-invariant features,” in
ICCV ,1999.[4] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applica-tions in vision,” in
ISCAS , 2010.[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with DeepConvolutional Neural Networks,” in
NIPS , 2012.[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies forAccurate Object Detection and Semantic Segmentation,” in
CVPR , 2014.[7] A. Suleiman, Z. Zhang, and V. Sze, “A 58.6 mW real-time programmable objectdetector with multi-scale multi-object support using deformable parts model on1920 × Sym. on VLSI , 2016.[8] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-EfficientReconfigurable Accelerator for Deep Convolutional Neural Networks,” in
ISSCC ,2016.[9] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester, “Discriminatively traineddeformable part models, release 5,” 2012.[10] N. Cristianini and J. Shawe-Taylor,
An introduction to support vector machinesand other kernel-based learning methods . Cambridge University Press, 2000.[11] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in
ICLR
BMVC , 2014.[14] R. Girshick, “Fast R-CNN,” in
ICCV , 2015.[15] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features fromRGB-D images for object detection and segmentation,” in
ECCV , 2014.[16] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented Approximation ofConvolutional Neural Networks,” in
ICLR , 2016.[17] S. Higginbotham, “Google Takes Unconventional Route with Homegrown MachineLearning Chips,” May 2016.[18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE:efficient inference engine on compressed deep neural network,” in
ISCA , 2016.[19] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neuralnetworks with binary weights during propagations,” in
NIPS , 2015.[20] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks withweights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830 ,2016.[21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNetClassification Using Binary Convolutional Neural Networks,” in
ECCV , 2016.[22] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-scalable processor forreal-time large-scale ConvNets,” in
Sym. on VLSI , 2016.[23] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain Damage,” in
NIPS , 1990.[24] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both Weights and Connectionsfor Efficient Neural Network,” in
NIPS , 2015.[25] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efficient ConvolutionalNeural Networks using Energy-Aware Pruning,”
CVPR , 2017.[26] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos,“Cnvlutin: ineffectual-neuron-free deep neural network computing,” in
ISCA , 2016.[27] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in