Design Automation for Efficient Deep Learning Computing
Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun Lin
11 Design Automation for EfficientDeep Learning Computing
Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun LinMassachusetts Institute of Technology { songhan, hancai, ligeng, jilin, kuanwang, zhijian, yujunlin } @mit.edu Abstract —Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need tochange the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithmmakes the design space much larger: it’s not only about designing the hardware but also about how to tweak the algorithm to best fit thehardware. Human engineers can hardly exhaust the design space by heuristics. It’s labor consuming and sub-optimal. We proposedesign automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channelpruning, and auto mixed-precision quantization. We demonstrate such learning-based, automated design achieves superior performanceand efficiency than rule-based human design. Moreover, we shorten the design cycle by 200 × than previous work, so that we can affordto design specialized neural network models for different hardware platforms. Index Terms —AutoML, Neural Architecture Search, Channel Pruning, Mixed-Precision, Quantization, Specialization, Efficient DeepLearning. (cid:70)
NTRODUCTION
Algorithm and hardware co-design plays an important rolein efficient deep learning computing. Unlike optimizing onthe SPEC2006 benchmark when we can treat the algorithmas a black box, there’s plenty of room at the algorithmlevel that can improve the hardware efficiency of deeplearning. We should open the box and explore modeloptimization techniques. The benefit usually comes frommemory saving and locality improvement. For example,model compression techniques [1] including pruning andquantization can drastically reduce the memory footprintand save energy consumption. Another example is smallmodel design. SqueezeNet [2] and MobileNet [3] have only4.8MB/4.2MB of model size, which can fit on-chip SRAMand improve the locality.However, efficient model design and compression havea large design space. Many different neural network archi-tectures can lead to similar accuracy, but drastically differenthardware efficiency. This is difficult to exhaust by rule-based heuristics, since there is a shortage of deep learningand hardware experts to hand-tune each model to makeit run fast. It’s demanding to systematically study how todesign efficient neural network with hardware constraints.We propose hardware-centric AutoML techniques that canautomatically design neural networks that are hardwareefficient [4, 5, 6]. Such joint optimization is systematic andcan transfer well between tasks. It requires fewer engineerefforts while designing better neural networks at low cost.We explore three aspects of neural network designautomation (Figure 1): auto design specialized model, autochannel pruning, and auto mixed-precision quantization.Each aspect is summarized as follows.There is plenty of specialized hardware for neural net-works, but little research has been done for specialized neural network architecture for a given hardware architecture (the reverse specialization). There are several advantages for aspecialized model: it can fully utilize the parallelism of thetarget hardware (e.g. fitting the channel size with the PEsize). Besides, a specialized model can fully utilize the on-chip buffer and improve locality and reuse. Specializationcan also match the model’s computation intensity with thehardware’s roofline model. However, designing a specializedneural network architecture used to be difficult. First, thereare limited heuristics. Second, the computation cost usedto be prohibitive: even searching a model on CIFAR-10dataset takes GPU hours [7, 8]. We cut the searchcost by two orders of magnitude (actually more than that,since we directly search on ImageNet). The search cost isreduced by two techniques: path-level pruning and path-level binarization, which saves GPU hours and GPU memory.Cutting the search cost enables us to design specialized themodel on the target task and target hardware. On the mobilephone, our searched model [4] runs 1.8 × faster than the besthuman designed model [9].After designing a specialized model, compression andpruning is an effective technique to further reduce thememory footprint [1]. Conventional model compressiontechniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the largedesign space. We propose an automated design flow thatleverages reinforcement learning to give the best modelcompression policy. This learning-based compression policyoutperforms conventional rule-based compression policy byhaving a higher compression ratio, better preserving theaccuracy and freeing human labor. We applied this auto-mated, push-the-button compression pipeline to MobileNetand achieved × speedup of measured inference latencyon an Android phone and × speedup on the Titan XPGPU, with only 0.1% loss of ImageNet Top-1 accuracy. a r X i v : . [ c s . L G ] A p r (1) Update weight parameters Architecture ParametersBinary Gate (0:prune, 1:keep)
INPUTOUTPUT α β … δ
1 0 … 0 (2) Update architecture parameters
INPUT α β … δ
0 1 … 0 update fmap not in memoryfmap in memory
CONV 5x5 POOL 3x3 ...
Weight
Parameters
CONV 3x3 Identity CONV 5x5 POOL 3x3 ...
Identity update
CONV 3x3
OUTPUT (e.g. 50%)
Layer t-1Layer tLayer t+1 …… Automatic Learner Hardware … Layer t 2b/4bLayer t-1 4b/5b … Layer t+1 3b/6b
INPUT α β … δ
0 1 … 0
CONV 5x5 POOL 3x3 ...
IdentityCONV 3x3 OUTPUT
ActionReward MappingLatencyEnergy (a) Auto Model Specialization (b) Auto Channel Pruning (c) Auto Mixed-Precision Quantization
Hardware SimulatorAutomatic Learner
Quantization
ActionReward MappingLatencyEnergy (a) Automatic
Model
Specialization (b)
Automatic
Channel
Pruning (c) Automatic
Mixed-Precision
Quantization weight param.arch. param.
Layer t
Layer t-1
Layer t+1 …… Fig. 1.
Design automation for model specialization, channel pruning and mixed-precision quantization.
The last step is automatic mixed-precision quantization.Emergent DNN hardware accelerators begin to support flexi-ble bitwidth (1-8 bits), which raises a great challenge to find theoptimal bitwidth for each layer: it requires domain expertsto explore the vast design space trading off among accuracy,latency, energy, and model size. Conventional quantizationalgorithm ignores the different hardware architectures andquantizes all the layers in a uniform way. We introducethe automated design flow of model quantization, and wetake the hardware accelerator’s feedback in the design loop.Our framework can specialize the quantization policy fordifferent hardware architectures. It can effectively reduce thelatency by × and the energy consumption by × with negligible loss of accuracy compared with the fixedbitwidth (8 bits) quantization. UTOMATED M ODEL S PECIALIZATION
In order to fully utilize the hardware resource, we propose tosearch a specialized CNN architecture for the given hardware.The model is compact and runs fast. We start with a largedesign space (Figure 1(a)) that includes many candidatepaths to learn which is the best one by gradient descent,rather than hand-picking with rule-based heuristics. Insteadof just learning the weight parameter, we jointly learn thearchitecture parameter (shown in red in Figure 1(a)). Thearchitecture parameter is the probability of choosing eachpath. The search space for each block i consists of manychoices: • ConvOp : mobile inverted bottleneck conv [9] withvarious kernel sizes and expansion ratios – Kernel size: { ×
3, 5 ×
5, 7 × } – Expansion ratio: {
3, 6 } • ZeroOp : if
ZeroOp is chosen at i th block, it meansthe block is skipped.Therefore, the number of possible architectures in thedesign space is [(3 × (cid:124) (cid:123)(cid:122) (cid:125) ConvOp + 1 (cid:124)(cid:123)(cid:122)(cid:125)
ZeroOp ] N = 7 N where N is thenumber of blocks (21 in our experiments).Given the vast design space, it is infeasible to rely ondomain experts to manually design the CNN model foreach hardware platform. So we need to employ automaticarchitecture design techniques.However, early reinforcement learning-based [7, 8] NASmethods are very expensive to run ( e.g ., GPU hours)since they need to iteratively sample an architecture, train it from scratch and update the meta-controller. It typicallyrequires tens of thousands of networks to be trained to finda good neural network architecture.We adopt a different approach to improve the efficiencyof model specialization [4]. We first build a super networkthat comprises all candidate architectures. Concretely, it hasa similar structure to a CNN model in the design spaceexcept that each specific operation is replaced with a mixedoperation that has n parallel paths. Each path in a mixedoperation corresponds to a candidate operation o i ( · ) , and weintroduce an architecture parameter α i to each path to learnwhich paths are redundant and thereby can be pruned (i.e.path-level prunning).In the forward step, to save GPU memory, we allow onlyone candidate path to actively reside in the GPU memory.This is achieved by hard-thresholding the probability of eachcandidate path to either 0 or 1 ( i.e ., path-level binarization).As such the output of a mixed operation is given as x l = (cid:88) i g i o i ( x l − ) (1)where g i is sampled according to the multinomial dis-tribution derived from the architecture parameters, i.e ., { p i = softmax ( α i ; α ) = exp( α i ) / (cid:80) i exp( α i ) } .In the backward step, we update the weight parametersof active paths using standard gradient descent. Since thearchitecture parameters are not directly involved in thecomputational graph (Eq. 1), we use the gradient w.r.t. binarygates to update the corresponding architecture parameters: ∂L∂α i = (cid:88) j =1 ∂L∂p j ∂p j ∂α i ≈ (cid:88) j =1 ∂L∂g j ∂p j ∂α i . In order to specialize the model for hardware, we needto take the latency running on the hardware as a designreward. However, directly measuring the inference latencysuffer from (i) slow (ii) high variance due to differentbattery condition and thermal throttling (iii) latency is non-differentiable and can’t be directly optimized. To addressthese, we present our latency prediction model and hardware-aware loss.To build the latency model we pre-compute the latencyof each operator with all possible inputs. During search wequery the lookup table during the searching process . Theoverall latency of i th block is the weighted sum of the latencyof each operator. Model Top-1 Top-5 GPU LatencyMobileNet-V2 [9] 72.0 91.0 6.1 msResNet-34 [10] 73.3 91.4 8.0 msNASNet-A [8] 74.0 91.3 38.3 msMnasNet [11] 74.0 91.8 6.1 msSpecialized model for GPU
ImageNet Accuracy (%) and GPU latency (Tesla V100). E [LAT i ] = α × F (mb3 3x3)+ β × F (mb3 5x5)+ σ × F (identity)+ ......ζ × F (mb6 7x7) E [LAT] = N (cid:88) i E [LAT i ] (2)Then we combine the latency and training loss (e.g. cross-entropy loss) using the following formula L = L CE × α log (cid:18) E [LAT]LAT ref (cid:19) β , (3)where α and β are hyper-parameters controlling the trade-off between accuracy and latency and LAT ref is the targetlatency. Note our formulation not only provides a fastestimation of the searched model but also makes the searchprocess fully differentiable.We demonstrate the effectiveness of our model special-ization on ImageNet dataset with CPU (Xeon E5-2640 v4),GPU (Tesla V100) and mobile phone (Google Pixel-1). Wefirst search for a specialized CNN model for the mobilephone (Figure 2). Compared to MobileNet-V2 (the state-of-the-art human engineered architecture), our model improvesthe top-1 accuracy by 2.6% while maintaining a similarlatency. Under the same level of top-1 accuracy (around74.6%),
MobileNet-V2 has 143ms latency while ours hasonly 78ms (1.83 × faster) . Compared with the state-of-the-artauto designed model, MnasNet [11], our model can achieve0.6% higher top-1 accuracy with slightly lower mobile latency.More remarkably, we reduced the search cost by 200 × , from40,000 GPU hours to only 200 GPU hours .Table 1 reports the speedup on GPU. our methodachieved superior performances compared to both human-designed and automatically searched architectures. Com-pared general purpose models, our specialized model im-proves the top-1 accuracy by 1.1% - 3.1% while being 1.2 × -7.5 × faster. Table 2 compares the specialized models onCPU/GPU/Mobile. As expected, models specialized for GPUdo not run fast on CPU and mobile phone, vice versa. It isessential to learn specialized neural networks to cater fordifferent hardware.Our automated design flow designed CNN architecturesthat were long dismissed as being too inefficient — butin fact, they are very efficient. For instance, engineershave essentially stopped using 7 × × × Fig. 2.
AI automatically designed specialized model consistentlyoutperforms human designed MobileNetV2 under variouslatency settings.Model Top-1 GPU CPU MobileSpecialized for GPU 75.1
Hardware prefers specialized models. Models opti-mized for GPU does not run fast on CPU and mobile phone,vice versa. Our method provides an efficient solution to searcha specialized neural network architecture for a target hardwarearchitecture, while cutting down the search cost by 200 × compared with state-of-the-arts [7, 11]. layers, but bears 49 weights rather than 27.). However, our AIdesigned model found that using 7 × UTOMATED C HANNEL P RUNING
Pruning [13] is widely used in model compression andacceleration. It is very important to find the optimal sparsityfor each layer during pruning. Pruning too much will hurtaccuracy; too less will not achieve high compression ratio.This used to be manually determined in previous studies [1].Our goal is to automatically find out the effective sparsity for each layer. We train an reinforcement learning agent topredict the best sparsity for a give hardware [5]. We evaluatethe accuracy and FLOPs after pruning. Then we updatethe agent by encouraging smaller, faster and more accuratemodels.Our automatic model compression (AMC) leveragesreinforcement learning to efficiently search the pruning ratio(Figure 1(b)). The reinforcement learning agent receivesan embedding state s t of layer L t from the environmentand then outputs a sparsity ratio as action a t . The layer iscompressed with a t (rounded to the nearest feasible fraction).Then the agent moves to the next layer L t +1 , and receivesstate s t +1 . After finishing the final layer L T , the reward MillionMAC Top-1Acc. Top-5Acc. GPU AndroidLatency Speed Latency Speed Memory100%MobileNet 569 70.6% 89.5% 0.46ms 2191 fps 123.3ms 8.1 fps 20.1MB75%MobileNet 325 68.4% 88.2% 0.34ms 2944 fps 72.3ms 13.8 fps 14.8MB
AMC (50% FLOPs) 285 70.5% 89.3% 0.32ms ( × ) 68.3ms ( × ) 14.3MB AMC (50% Latency) 272 70.2% 89.2% 0.30ms ( × ) 63.3ms ( × ) 13.2MB TABLE 3.
AMC speeds up MobileNet. On Google Pixel-1 CPU, AMC achieves 1.95 × measured speedup with batch size one, whilesaving the memory by 34%. On NVIDIA Titan XP GPU, AMC achieves 1.53 × speedup with batch size of 50. Policy FLOPs ∆ Acc (%)MobileNet-V1 uniform (0.75-224) [3] 56% -2.5
AMC (ours) -0.4 uniform (0.75-192) [3] 41% -3.7
AMC (ours) -1.7
MobileNet-V2 uniform (0.75-224) [9] 70% -2.0
AMC (ours) -1.0
TABLE 4.
Learning-based automated model compression (AMC)outperforms rule-based model compression. Rule-based heuris-tics are suboptimal. accuracy is evaluated on the validation set and returned tothe agent.With our framework, we are able to push the expert-tuned limit of fine-grained model pruning. For ResNet-50on ImageNet, we can push the compression ratio from 3.4 × to 5 × without loss of accuracy. With further investigation,we find that AMC automatically learns to prune 3 × × i.e . use awidth multiplier to uniformly reduce the channels of eachlayer with a fixed ratio. Both MobileNet and MobileNet-V2present the performance of different multiplier and inputsizes, and we compare our pruned result with models ofsame computations. The format are denoted as uniform(depth multiplier - input size) . We can find that our methodconsistently outperforms the uniform baselines. Even for thecurrent state-of-the-art efficient model design MobileNet-V2,AMC can still improve its accuracy by . at the samecomputation.Mobile inference acceleration has drawn people’s atten-tion in recent years. Not only can AMC optimize FLOPs andmodel size, it can also optimize the inference latency. Forall mobile inference experiments, we use TensorFlow Liteframework for timing evaluation. Our experiment platformis Google Pixel 1. Models pruned to 0.5 × FLOPs and 0.5 × inference time are shown in Table 3. For 0.5 × FLOPs setting,we achieve × speed up on a Google Pixel 1 phone. For0.5 × FLOPs setting, we accurately achieve × speed up,which is very close to actual 2 × target, showing that AMC can directly optimize inference time and achieve accuratespeed up ratio. On GPUs, we also achieve up to 1.5 × speedup, which is less than mobile phone but still significanton an already very compact model. The less speedup isbecause a GPU has higher degree of parallelism than a mobilephone. UTOMATED M IXED -P RECISION Q UANTIZATION
Conventional quantization methods quantize each layer ofthe model to the same precision. Mixed-precision quantiza-tion is more flexible but suffer from a huge design space that’sdifficult to explore. Meanwhile, as demonstrated in Table 5,the quantization solution optimized on one hardware mightnot be optimal on the other, which raises the demand for specialized policies for different hardware architectures andfurther increase the design space. Assuming the bitwidthis between 1 to 8 for both weights and activations, theneach layer has choices. If we have M different neu-ral network models, each with N layers, on H differenthardware platforms, there are in total O ( H × M × N ) possible solutions. Rather than using rule-based heuristics,we propose an automated design flow to quantize differentlayer with mixed precision. Our hardware-aware automaticquantization (HAQ) [6] models the quantization task asa reinforcement learning problem. We use the actor-criticmodel to give the quantization policy ( e.g .,vanilla convolution has more data reuse and locality, whiledepthwise convolution has less reuse and worse locality,which makes it memory bounded.In real-world applications, we have limited resourcebudgets ( i.e ., latency, energy, and model size). We would liketo find the quantization policy with the best performancegiven the resource constraint. We encourage our agent tomeet the computation budget by limiting the action space.After our RL agent gives actions { a k } to all layers, wemeasure the amount of resources that will be used by thequantized model. The feedback is directly obtained fromthe hardware simulator. If the current policy exceeds our Inference latency on
HW1 HW2 HW3
Best Q. policy for
HW1 16.29 ms 85.24 ms 117.44 msBest Q. policy for
HW2 ms 108.64 msBest Q. policy for
HW3 ms TABLE 5.
Inference latency of MobileNet-V1 [3] on threehardware architectures under different quantization policies.The quantization policy that is optimized for one hardware isnot optimal for the other. This suggests we need a specialized quantization solution for different hardware architectures. (HW1:spatial accelerator[14], HW2: edge accelerator[15], HW3: cloudaccelerator[15], batch = 16).
Roofline dw_x_8 dw_y_8 dw_x dw_y pw_x_8 pw_y_8 pw_x pw_y0.249997509 0.449592306 0.444438539 0.5996016676 15.98980242 12.797768246 15.98980242 12.7977682460.249990035 0.4494983278 0.533299322 0.719541109 31.83756345 25.59509022 31.83756345 25.595090220.249990035 0.4494983278 0.333315619 0.5993311036 31.83756345 50.67932888 36.35942029 57.941195240.249960147 0.4494983278 0.533197314 0.7190828026 61.49019608 50.68107852 79.86629526 66.004938140.249960147 0.4494983278 0.380871413 0.5992356688 61.49019608 99.35505354 79.86629526 129.039301820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.96694776 121.7494692 124.303484980.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 153.7372654 174.02866820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.97008466 153.7372654 174.02866820.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 86.12894812 87.01669196 130.281008160.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 82.9261673 83.89758595 106.79738932
Fix8bitours
Fix8bitours
Higher is better
Ops/ByteOps/ByteGOps/s GOps/s
HAQ improves depthwise layers’ roofline performance HAQ improves pointwise layers’ roofline performanceHigher is better
Fix8bitours fixed 8-bitours
Higher is better
Ops/ByteGOps/s
HAQ improves depthwise layers’ roofline performance
Ops/ByteGOps/s
HAQ improves pointwise layers’ roofline performanceHigher is better fewer bits pointwise: more bitsdepthwise: more bits pointwise: fewer bits layer
EdgeCloud layer it it
04 2 layer l og Fig. 3.
Quantization policy under latency constraints forMobileNet-V1.
Roofline dw_x_8 dw_y_8 dw_x dw_y pw_x_8 pw_y_8 pw_x pw_y0.249997509 0.449592306 0.444438539 0.5996016676 15.98980242 12.797768246 15.98980242 12.7977682460.249990035 0.4494983278 0.533299322 0.719541109 31.83756345 25.59509022 31.83756345 25.595090220.249990035 0.4494983278 0.333315619 0.5993311036 31.83756345 50.67932888 36.35942029 57.941195240.249960147 0.4494983278 0.533197314 0.7190828026 61.49019608 50.68107852 79.86629526 66.004938140.249960147 0.4494983278 0.380871413 0.5992356688 61.49019608 99.35505354 79.86629526 129.039301820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.96694776 121.7494692 124.303484980.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 135.8862559 145.02511820.249840663 0.4390790292 0.53278967 0.7091457286 96.49230769 95.97008466 153.7372654 174.02866820.249840663 0.4390790292 0.45667686 0.7048951048 96.49230769 95.97008466 153.7372654 174.02866820.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 86.12894812 87.01669196 130.281008160.249363868 0.3876923076 0.455284553 0.6277580072 70.8700565 82.9261673 83.89758595 106.79738932
Fix8bitours
Fix8bitours
Higher is better
Ops/ByteOps/ByteGOps/s GOps/s
HAQ improves depthwise layers’ roofline performance HAQ improves pointwise layers’ roofline performanceHigher is better
Fix8bitours
FixedOurs
Higher is better
Ops/ByteGOps/s
HAQ improves depthwise layers’ roofline performance
Ops/ByteGOps/s
Roofline performance of pointwise layers are improved.Higher is better fewer bits pointwise: more bitsdepthwise: more bits pointwise: fewer bits layer
EdgeCloud layer it it
04 2 layer l og Fig. 4.
HAQ improves the roofline performance of pointwiselayers in MobileNet-V1. resource budget (on latency, energy or model size), we willsequentially decrease the bitwidth of each layer until theconstraint is finally satisfied.We applied HAQ to three different hardware architecturesto show the importance of specialized quantization policy.Inferencing on edge devices and cloud severs can be quite
Edge Accelerator Cloud AcceleratorBitwidths Top-1 Latency Top-1 LatencyPACT 4 bits 62.44 45.45 ms 61.39 52.15 msOurs flexible flexible flexible
TABLE 6.
Latency-constrained quantization on the edge andcloud accelerator on ImageNet. different, since (1) the batch size on the cloud serversare larger (2) the edge devices are usually limited to lowcomputation resources and memory bandwidth. We useembedded FPGA Xilinx Zynq-7020 as our edge device, andserver FPGA Xilinx VU9P as our cloud device to comparethe specialized quantization policies.Compared to fixed 8-bit quantization (PACT [16]), our au-tomated quantization consistently achieved better accuracyunder the same latency (see Table 6). With similar accuracy,HAQ can reduce the latency by 1.4-1.95 × compared with thebaseline.Our agent gave specialized quantization policy for edgeand cloud accelerators (Figure 3). The policy is quite differenton different hardware. For the activations, the depthwiseconvolution layers are assigned much less bitwidth thanthe pointwise layers on the edge device; while on the clouddevice, the bitwidth of these two types of layers are similarto each other. For weights, the bitwidth of these types oflayers are nearly the same on the edge; while on the cloud,the depthwise convolution layers are assigned much morebitwidth than the pointwise convolution layers.We interpret the quantization policy’s difference betweenedge and cloud by the roofline model. Operation intensity isdefined as operations (MACs in neural networks) per DRAMbyte accessed. A lower operation intensity indicates that themodel suffers more from the memory access. The bottomof Figure 3 shows the operation intensity (OPs per byte) ofconvolution layers in the MobileNet-V1. On edge accelerator,which has much less memory bandwidth, our RL agentallocates fewer activation bits to the depthwise convolutionssince the activations dominates the memory access. Oncloud accelerator, which has more memory bandwidth, ouragent allocates more bits to the depthwise convolutions andallocates fewer bits to the pointwise convolutions to preventit from being computation bounded. Figure 4 shows theroofline model before and after HAQ. HAQ gives morereasonable policy to allocate the bits for each layer andpushes all the points to the upper right corner that is moreefficient.Finally, we evaluate the transfer ability of our framework:first train our agent on one network (MobileNet-V1), thendirectly apply the agent to another network (MobileNet-V2) (see Table 7). Our quantization policy transferred fromMobileNet-V1 to MobileNet-V2 performs much better thanthe fixed-bitwidth baseline and is only slightly worse thanthe quantization policy directly searched for MobileNet-V2. Bitwidth Top-1 LatencyPACT 4 bits 61.39 52.15 msOurs (search for V2) flexible flexible flexible flexible
TABLE 7.
Our RL agent is able to generalize well to differentnetwork architectures: the quantization policy transferred fromMobileNet-V1 to MobileNet-V2 performs very close to directlysearching for MobileNet-V2. Both outperfomed the the fixed-bitwidth baseline.
This experiment validates that our RL agent generalizes wellto different network architectures. That’s to say, given a newmodel that the agent hasn’t seen before, it can utilize its pastknowledge to give a decent quantization policy, saving thedesign cycle.
ONCLUSION
We present design automation techniques for efficient deeplearning computing. There’s plenty of room at the algorithmlevel to improve the hardware efficiency, but the large designspace makes it difficult to be exhausted by human. Wecovered three aspects of design automation: specializedmodel design, compression and pruning, mixed-precisionquantization. Such learning based design automation outper-formed rule-base heuristics. Our framework reveals that theoptimal design policies on different hardware architecturesare drastically different, therefore specialization is important.We interpreted those design policies and believe the insightswill inspire the future software and hardware co-design forefficient deep learning computing. R EFERENCES [1] S. Han, H. Mao, and W. J. Dally, “Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding,” in
ICLR ,2016.[2] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf,W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and¡ 0.5 mb modelsize,” arXiv preprint arXiv:1602.07360 , 2016.[3] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mo-bilenets: Efficient convolutional neural networks for mo-bile vision applications,” arXiv preprint arXiv:1704.04861 ,2017.[4] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Directneural architecture search on target task and hardware,”in
ICLR , 2019.[5] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han,“Amc: Automl for model compression and accelerationon mobile devices,” in
ECCV , 2018.[6] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ:Hardware-Aware Automated Quantization with MixedPrecision,” in
CVPR , 2019.[7] B. Zoph and Q. V. Le, “Neural architecture search withreinforcement learning,” in
ICLR , 2017.[8] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learningtransferable architectures for scalable image recogni-tion,” in
CVPR , 2018.[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, andL.-C. Chen, “Mobilenetv2: Inverted residuals and linearbottlenecks,” in
CVPR , 2018.[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in
CVPR , 2016.[11] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le,“Mnasnet: Platform-aware neural architecture search formobile,” arXiv preprint arXiv:1807.11626 , 2018.[12] M. News.[13] S. Han, J. Pool, J. Tran, and W. Dally, “Learning bothweights and connections for efficient neural network,”in
Advances in neural information processing systems , 2015,pp. 1135–1143.[14] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra,and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamicallycomposable architecture for accelerating deep neuralnetwork,” in
ISCA , 2018.[15] Y. Umuroglu, L. Rasnayake, and M. Sjalander, “Bismo:A scalable bit-serial matrix multiplication overlay forreconfigurable computing,” in
FPL , 2018.[16] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang,V. Srinivasan, and K. Gopalakrishnan, “PACT: Param-eterized Clipping Activation for Quantized NeuralNetworks,” arXivarXiv