[PDF] Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision

Abstract

Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy and/or latency concerns. Accordingly, energy-efficient embedded vision hardware delivering real-time and robust performance is crucial. While deep learning is gaining popularity in several computer vision algorithms, a significant energy consumption difference exists compared to traditional hand-crafted approaches. In this paper, we provide an in-depth analysis of the computation, energy and accuracy trade-offs between learned features such as deep Convolutional Neural Networks (CNN) and hand-crafted features such as Histogram of Oriented Gradients (HOG). This analysis is supported by measurements from two chips that implement these algorithms. Our goal is to understand the source of the energy discrepancy between the two approaches and to provide insight about the potential areas where CNNs can be improved and eventually approach the energy-efficiency of HOG while maintaining its outstanding performance accuracy.

Full PDF

TTowards Closing the Energy Gap Between HOGand CNN Features for Embedded Vision (Invited Paper)

Amr Suleiman*, Yu-Hsin Chen*, Joel Emer, Vivienne Sze

Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology { suleiman, yhchen, jsemer, sze } @mit.edu *These authors contributed equally to this work Abstract —Computer vision enables a wide range of appli-cations in robotics/drones, self-driving cars, smart Internet ofThings, and portable/wearable electronics. For many of these ap-plications, local embedded processing is preferred due to privacyand/or latency concerns. Accordingly, energy-efﬁcient embeddedvision hardware delivering real-time and robust performanceis crucial. While deep learning is gaining popularity in severalcomputer vision algorithms, a signiﬁcant energy consumption dif-ference exists compared to traditional hand-crafted approaches.In this paper, we provide an in-depth analysis of the computation,energy and accuracy trade-offs between learned features such asdeep Convolutional Neural Networks (CNN) and hand-craftedfeatures such as Histogram of Oriented Gradients (HOG). Thisanalysis is supported by measurements from two chips thatimplement these algorithms. Our goal is to understand the sourceof the energy discrepancy between the two approaches and toprovide insight about the potential areas where CNNs can beimproved and eventually approach the energy-efﬁciency of HOGwhile maintaining its outstanding performance accuracy.

I. I

NTRODUCTION

Computer vision (CV) is a critical technology to numeroussmart embedded systems, such as advanced driver assistantsystems, autonomous cars/drones, and robotics. It extractsmeaningful information from visual data for further decisionmaking. However, many modern CV algorithms require highcomputational complexity, which makes their deployment onbattery-powered devices challenging due to the tight energyconstraints. Near-sensor visual data processors should con-sume under 1nJ/pixel with a logic gate counts of around 1000kgates and a memory capacity of few hundred kBytes in orderto be comparable with video codecs, which are present in mostcameras [1]. For many applications, ofﬂoading computation tothe cloud is also undesirable because of latency, connectivity,and security limitations. Thus, dedicated energy-efﬁcient CVhardware becomes very crucial.Feature extraction is the ﬁrst processing step in most CVtasks, such as object classiﬁcation and detection (Fig. 1).It transforms the raw pixels into a high-dimensional space,where only meaningful and distinctive information is kept.Traditionally, features are designed by experts in the ﬁeldthrough a hand-crafted process. For instance, many well-known hand-crafted features use image gradients, such ashistogram of oriented gradients (HOG) [2] and scale invariantfeature transform (SIFT) [3], based on the fact that humaneyes are sensitive to edges. In contrast, learned features learn

Feature Extraction Classification (w T x) Handcrafted Features (e.g. HOG) Learned Features (e.g. CNN) pixels Features ( x ) Trained weights ( w ) Image Scores Scores per class (select class based on max or threshold)

Fig. 1. General processing pipeline for object classiﬁcation and detection. a representation with the desired characteristics directly fromdata using deep convolutional neural networks (CNNs) [4].Learned features are gaining popularity, as they are outper-forming hand-crafted features in many CV tasks [5, 6].The differences in design between hand-crafted and learnedfeatures lead to not only different performance in applications,but also different hardware implementation considerations,which have a strong implication for energy efﬁciency. Ingeneral, hardware implementations for hand-crafted featuresare widely understood to be more energy-efﬁcient than learnedfeatures. However, there is no analysis that explains the energygap between the two types of features. Also, an open questionis whether the energy gap can be closed in the future.In this paper, we will provide an in-depth analysis on thecauses for the energy gap between hand-crafted and learnedfeatures. We use results from two actual chip designs: [7]implements the hand-crafted feature using HOG, and [8]implements the learned feature using CNN. Both chips use65nm CMOS technology and have similar hardware resourceutilization in terms of logic gate count and memory capacity.Based on the insights derived from the two implementations,we will discuss techniques to help close the energy gap.II. F

EATURE E XTRACTION H ARDWARE I MPLEMENTATIONS

A. Hardware for Hand-crafted Feature: HOG

The chip presented in [7] implements the entire object de-tection pipeline based on deformable parts models (DPM) [9]for high throughput and low power embedded vision applica-tions. DPM extracts HOG features from the input image, andlocalizes objects by sweeping the features with support vector a r X i v : . [ c s . C V ] M a r ig. 2. Feature extraction with histogram of oriented gradients (HOG). Modern

Deep

CNN:

Layers

Classes

FC Layer CONV Layer

Low-Level Features

CONV Layer

High-Level Features … Layers convolu’on non-linearity × normaliza’on pooling Fig. 3. General processing pipeline of CNN. machine (SVM) classiﬁers [10]. Fig. 2 shows the featuresextraction process using HOG: the image is divided into non-overlapping 8 × B. Hardware for Learned Feature: CNN

The second chip presented in [8], called Eyeriss, is anenergy-efﬁcient accelerator for deep CNNs. Fig. 3 showsa general CNN processing pipeline, consisting mainly of aseries of convolutional (CONV) layers. In each CONV layer,a collection of 3D ﬁlters are applied to the input imagesor feature maps to generate output feature maps, which arethen used as the input to the next CONV layer. Eyeriss isprogrammable in terms of the size and shape of ﬁlters as wellas number of layers. Therefore, it can accommodate differentCNN architectures, such as AlexNet [5] and VGG-16 [11].

C. Performance Comparison

Table I shows the hardware speciﬁcation and measuredperformance of the two designs for feature extraction. ForCNN, Eyeriss is programmed to run two CNN models (ﬁveCONV layers of AlexNet and thirteen CONV layers of VGG-16) to demonstrate the hardware performance variations ofrunning different CNN features. Both chip designs use around1000 kgates and 150 kB memory. While Eyeriss achievesapproximately the same computation throughput (i.e., GOPS)

TABLE IH

ARDWARE SPECIFICATION AND MEASURED PERFORMANCE OFHAND - CRAFTED FEATURE

HOG [7]

AND LEARNED FEATURE

CNN [8]. [7] [8]

Implemented Feature

HOG CNN CNN(AlexNet) (VGG-16)

Technology

Gate Counts (kgates)

Memory (kB)

Multiplier Bitwidth ×

11 – 22 ×

22 16 × Throughput (Mpixels/s)

Throughput (GOPS)

Power (mW)

DRAM access (B/pixel)

Energy Efﬁciency (nJ/pixel)

Energy Efﬁciency (GOPS/W)

TABLE IIC

OMPUTATIONAL COMPLEXITY COMPARISON BETWEEN FEATURES

Feature GOP/Mpixel Ratio

Hand-crafted HOG 0.7 1.0 × Learned CNN (AlexNet) 25.8 36.9 × CNN (VGG-16) 610.3 871.9 × as the HOG design when running AlexNet features, HOGprocesses 35 × more input pixels per second; the gap is evenlarger between HOG and VGG-16 features. This is due tothe differences in computational complexity (i.e., operationsper pixel) between different features as shown in Table II.However, the HOG design also consumes around 10 × lesspower than Eyeriss. Thus, the HOG hardware consumes 311 × and 13,486 × less energy per pixel than Eyeriss runningAlexNet and VGG-16 features, respectively. In terms of energyper operation, the HOG hardware is 10 × less than Eyeriss. InSection III, we will discuss the cause of this energy gap. D. Accuracy vs. Energy Efﬁciency

While there is a large energy gap between the hardware forhand-crafted and learned features, the performance differencesin applications, e.g., accuracy, between the features shouldalso be taken into account. To measure accuracy, we use thefeatures for object detection, which localizes and classiﬁesobjects in provided images (i.e., the outputs are the class andcoordinates of each object in the images). The mean averageprecision (mAP) metric is used to quantify detection accuracy.Fig. 4 shows the trade-off between detection accuracy andenergy consumption. Note that the vertical axis is logarithmic.All reported detection accuracy numbers are measured onPASCAL VOC 2007 dataset [12], which is a widely usedimage dataset containing 20 different object classes in 9,963images. In order to achieve the same detection accuracy ofHOG features, it only requires the features extracted fromthe ﬁrst three CONV layers of AlexNet instead of ﬁve [13],but it comes at the cost of 100 × higher energy consumptionper pixel to generate the features. Fortunately, mAP can benearly doubled by using the ouput of all ﬁve layers of AlexNetwith minimal increase in energy consumption (only 22%).Even higher mAP has been demonstrated by using VGG-16features [14]. However, the accompanying energy consumption ig. 4. Energy versus accuracy trade-off comparison for hand-crafted andlearned features. per pixel becomes four orders of magnitude higher than usingHOG features. If we look at the minimum hardware energyconsumption required for a given accuracy in Fig. 4, weroughly see that a linear increase in detection accuracy requiresan exponential increase in energy consumption per pixel.III. H ARDWARE I MPLEMENTATION C ONSIDERATIONS

One of the key factors that differentiates the two typesof features in hardware implementation is programmability ,which deﬁnes the ﬂexibility of a feature to be conﬁguredto deal with different data statistics or different tasks. Whilehand-crafted features can achieve a certain degree of invari-ance to data variations (e.g., images with different exposures),they are mostly designed for very speciﬁc tasks with knowndata statistics, leaving little to be programmed when deployed.In contrast, learned features isolate algorithm design fromlearning the data statistics, which leaves room for programma-bility to take advantage of the ﬂexibility. We categorizeprogrammability into two types: Programmability of Hyper-Parameters (PoHP) and Programmability of Parameters (PoP).

A. Programmability of Hyper-Parameters (PoHP)

Hyperparameters refer to the number and/or dimensionalityof parameters of a feature such as the number of layers andsize of ﬁlters in a CNN, or the number of histogram bins inHOG. They are usually determined at design time by usingheuristics or through experiments.

Advantages : For learned features, changes in hyperparameterswith proper (re-)training can result in signiﬁcant performanceimprovements. As a result, hardware that supports PoHP caneasily trade-off computational complexity for higher accuracy.For instance, a CNN object detection system with PoHPcan choose between lower-complexity, lower-accuracy, e.g.,AlexNet, and higher-complexity, higher-accuracy, e.g., VGG-16, according to the use cases. For hand-crafted features,however, the impact of PoHP on performance is usuallylimited, since it can be hard to capture the changes in pa-rameter dimensionality without redesign. As a result, PoHPare not commonly supported for the hardware implementationof hand-crafted features.

Energy Cost : Although PoHP can greatly beneﬁts the perfor-mance of learned features, it comes at the price of loweredenergy efﬁciency, since the hardware implementation mustaccommodates a wide range of possible conﬁgurations. First,it introduces irregular data movement . For example, a CNNprocessor that supports PoHP has to deal with ﬁlters ofmany shapes and sizes, which complicates the control logic,synchronization schemes between parallel blocks, and datatiling optimization. Second, it usually requires higher dynamicrange and resolution for data representation , which have anegative impact on both computation and data movement.The bitwidth of datapath and the bandwidth requirement ofmemory all need to be designed for the worst-case scenario,which penalize the average energy efﬁciency.

B. Programmability of Parameters (PoP)

Parameters are the actual coefﬁcients, such as the ﬁlterweights, that are used in the computation. For learned fea-tures, parameters are learned through training; for hand-craftedfeatures, they are usually carefully designed according tothe data statistics or pre-determined based on the desiredproperties. For instance, in HOG, the ﬁlter mask for thegradient extraction is simply [ − . Advantages : With PoP, learned features can adapt to new databy simply retraining the parameters. For example, AlexNethas been shown to work for both RGB images and depthimages [15]. For hand-crafted features, however, there is noPoP since all parameters are ﬁxed at design time.

Energy Cost : PoP also negatively impacts the hardwareenergy efﬁciency since the parameters have to be treated asdata instead of ﬁxed values during hardware implementation.This not only increases the required memory capacity and datamovement, but also complicates the datapath design. In thecase of CNN, the amount of parameters is usually too large toﬁt on-chip, which increases the accesses to energy-consumingoff-chip memory, such as DRAM. In contrast, hand-craftedfeatures can be greatly optimized for energy efﬁciency byhard-wiring the ﬁxed parameters in datapaths or ROM. In thecase of HOG, multiplications between the input images andgradient ﬁlter can be completely avoided.IV. C

LOSING THE E NERGY G AP In Section II, we have shown the energy efﬁciency results ofhardware implementations from two extremes: the HOG hand-crafted feature with no PoP or PoHP, and the CNN learnedfeature with both PoP and PoHP, and the latter consumes anorder of magnitude higher normalized energy per pixel thanthe former. A simple approach to reduce this energy gap is toremove all programmability in the hardware implementationof CNN, but this is not straightforward.For example, the CONV layer weights from AlexNet caneither be hard-wired in the multipliers, or stored in on-chipSRAM or ROM. This is not feasible if the available hardwareresources are constrained to the level of the HOG design [7],i.e., 1000 kgates with 150 kB SRAM. Assuming each inputand weight value take 1 byte, only 10k multipliers with ﬁxed

ABLE IIICNN

ENERGY AND MEMORY SAVINGS USING DIFFERENT TECHNIQUES .*M

EASURED ON A LEX N ET **A SSUMING A

BIT BASELINE

Method Energy Memory Size

Reduced precision [17, 22] 2.56 × × Sparsity by pruning [25] 3.7 × × *Data Compression [22] - > × **Energy optimized dataﬂow [27] 1.4–2.5 × - weights can be implemented in 1000 kgates, and only 150kweight values can ﬁt in the SRAM. This number of multiplierscannot even ﬁt 1% of AlexNet 2334k weights in the CONVlayers, and 15 × larger memory is required to store all weights.In this section, we will discuss some techniques that can beapplied to reduce the energy efﬁciency gap. Reduced Precision : Reducing data precision is an activeresearch area for DNN [16], which directly reduces the worst-case bitwidth requirement when supporting PoHP. Speciﬁcally,8-bit integer precision has become popular in recent DNNhardware [17]. Non-uniform quantization [18] and bitwidthreduction to 1-bit [19–21] have also been demonstrated. En-ergy efﬁciency can also be improved if the hardware can adaptto the need of actual data. Custom datapath designs that adaptto the lower data precision show 2.56 × energy savings in [22]. Sparsity : Increasing data sparsity reduces the intrinsic amountof computation and data movement, which improve energyefﬁciency. For CNN, the number of weights can be greatlyreduced without reducing accuracy through pruning [23–25].Speciﬁcally, [25] has shown that the number of CONV layerweights in AlexNet can be reduced from 2334k to 352k, whichis only double the memory capacity used in HOG; furthermoreit reduces energy by 3.7 × . Specialized hardware designs canbe used to exploit sparsity for increased speed or reducedenergy consumption [8, 18, 26]. Data Compression : Sparsity also suggests opportunities forcompression, which saves memory space and data transferbandwidth. Many lightweight lossless compression schemes,such as run-length coding [8] and Huffman coding [18], areproposed to reduce the amount of off-chip data [8, 18, 22].

Energy Optimized Dataﬂow : PoHP incurs irregular datamovement, preventing memory access optimization duringhardware design. Therefore, designing hardware architecturethat can adapt to the irregular data movement becomes criticalto high energy efﬁciency, since data movement often consumesmore energy than computation. Eyeriss demonstrates a recon-ﬁgurable architecture that can optimize data movement for var-ious CNNs with a row stationary dataﬂow, and achieves 1.4 × to 2.5 × higher energy efﬁciency than existing designs [27].Table III summarizes the energy and memory savings usingthe discussed techniques. Combining them all can potentiallydeliver an order of magnitude reduction. Taking into accountthe fundamental computation gap discussed in Section II, thisreduction has the potential of closing the gap between HOGand CNN features. V. C ONCLUSION

The CNN learned features outperform the HOG hand-crafted features in visual object classiﬁcation and detectiontasks. This paper compares two chip designs for CNN andHOG to better understand the energy discrepancy betweenthe two approaches and provide insight about the potentialoptimizations. Although learned features achieve more than2 × accuracy, it comes at a large 311 × to 13,486 × overheadin energy consumption. While a fundamental computationoverhead exists, another order of magnitude gap is mainlycaused by the fact that CNN architecture is programmable. Asimple approach of removing all programmability in CNN andhard-wiring all multiplications doesn’t work due to signiﬁcantarea cost (i.e., logic gates and on-chip memory). Combing thetechniques highlighted in the paper can potentially reduce theenergy and memory sizes by an order of magnitude, and helpreduce the gap between learned and hand-crafted features.R EFERENCES [1] C. C. Ju and et al., “A 0.5nJ/pixel 4K H.265/HEVC codec LSI for multi-formatsmartphone applications,” in

ISSCC , pp. 1–3, Feb 2015.[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,”in

CVPR , 2005.[3] D. G. Lowe, “Object recognition from local scale-invariant features,” in

ICCV ,1999.[4] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applica-tions in vision,” in

ISCAS , 2010.[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classiﬁcation with DeepConvolutional Neural Networks,” in

NIPS , 2012.[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies forAccurate Object Detection and Semantic Segmentation,” in

CVPR , 2014.[7] A. Suleiman, Z. Zhang, and V. Sze, “A 58.6 mW real-time programmable objectdetector with multi-scale multi-object support using deformable parts model on1920 × Sym. on VLSI , 2016.[8] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-EfﬁcientReconﬁgurable Accelerator for Deep Convolutional Neural Networks,” in

ISSCC ,2016.[9] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester, “Discriminatively traineddeformable part models, release 5,” 2012.[10] N. Cristianini and J. Shawe-Taylor,

An introduction to support vector machinesand other kernel-based learning methods . Cambridge University Press, 2000.[11] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in

ICLR

BMVC , 2014.[14] R. Girshick, “Fast R-CNN,” in

ICCV , 2015.[15] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich features fromRGB-D images for object detection and segmentation,” in

ECCV , 2014.[16] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented Approximation ofConvolutional Neural Networks,” in

ICLR , 2016.[17] S. Higginbotham, “Google Takes Unconventional Route with Homegrown MachineLearning Chips,” May 2016.[18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE:efﬁcient inference engine on compressed deep neural network,” in

ISCA , 2016.[19] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neuralnetworks with binary weights during propagations,” in

NIPS , 2015.[20] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks withweights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830 ,2016.[21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNetClassiﬁcation Using Binary Convolutional Neural Networks,” in

ECCV , 2016.[22] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-scalable processor forreal-time large-scale ConvNets,” in

Sym. on VLSI , 2016.[23] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain Damage,” in

NIPS , 1990.[24] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both Weights and Connectionsfor Efﬁcient Neural Network,” in

NIPS , 2015.[25] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efﬁcient ConvolutionalNeural Networks using Energy-Aware Pruning,”

CVPR , 2017.[26] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos,“Cnvlutin: ineffectual-neuron-free deep neural network computing,” in

ISCA , 2016.[27] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efﬁcient Dataﬂow for Convolutional Neural Networks,” in