Architecture-aware Network Pruning for Vision Quality Applications
Wei-Ting Wang, Han-Lin Li, Wei-Shiang Lin, Cheng-Ming Chiang, Yi-Min Tsai
AARCHITECTURE-AWARE NETWORK PRUNING FOR VISION QUALITY APPLICATIONS
Wei-Ting Wang, Han-Lin Li, Wei-Shiang Lin, Cheng-Ming Chiang, Yi-Min Tsai
MediaTek Inc.
ABSTRACT
Convolutional neural network (CNN) delivers impressiveachievements in computer vision and machine learning field.However, CNN incurs high computational complexity, espe-cially for vision quality applications because of large imageresolution. In this paper, we propose an iterative architecture-aware pruning algorithm with adaptive magnitude thresholdwhile cooperating with quality-metric measurement simulta-neously. We show the performance improvement applied onvision quality applications and provide comprehensive anal-ysis with flexible pruning configuration. With the proposedmethod, the Multiply-Accumulate (MAC) of state-of-the-artlow-light imaging (SID) and super-resolution (EDSR) arereduced by 58% and 37% without quality drop, respectively.The memory bandwidth (BW) requirements of convolutionallayer can be also reduced by 20% to 40%.
Index Terms — Pruning, Vision Quality, Network Archi-tecture
1. INTRODUCTION
CNN is adopted as an essential ingredients in computer visionand machine learning areas [1, 2, 3]. Vision perception tasksincluding image classification, object detection and seman-tic segmentation are comprehensively investigated and asso-ciated with CNN. Even in image processing field such as su-per resolution, high dynamic range imaging and de-noising,CNN has progressive and promising improvement on imagequality in recent years [4, 5].However, compared to perception tasks, it requires highercomputational complexity and BW requirements for visionquality tasks. MobileNetV1 [6] is designed with 569M MACfor ImageNet classification. On the other hand, in low-lightphotography SID [4] and super-resolution EDSR [5], it takes560G MAC and 1.4T MAC per inference, respectively. It is
Copyright 2019 IEEE. Published in the IEEE 2019 International Con-ference on Image Processing (ICIP 2019), scheduled for 22-25 September2019 in Taipei, Taiwan. Personal use of this material is permitted. However,permission to reprint/republish this material for advertising or promotionalpurposes or for creating new collective works for resale or redistribution toservers or lists, or to reuse any copyrighted component of this work in otherworks, must be obtained from the IEEE. Contact: Manager, Copyrights andPermissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscat-away, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. more challenge to deploy CNN models on mobile devices forvision quality applications.Network pruning [7] is an effective methodology towardperformance optimization. Sparsity is defined as the ratio ofthe number of zero weights divided by the number of totalweights. Better pruning algorithm delivers higher sparsityand reduces more MAC and BW correspondingly. However,quality drop is one of the major challenges in network prun-ing. Fig. 1 shows visible defect on SID even with only 0.1PSNR degradation.In this paper, we propose architecture-aware pruning tomaximize sparsity and MAC reduction without quality-metric(PSNR or SSIM) drops. We also analyze the effects of MACand BW reduction with different configurations associatingwith pruned structures. The proposed method focus on algo-rithms including but not limited to SID and EDSR.
2. RELATED WORKS
Network pruning has been widely explored in existing litera-tures. To answer which weight should be pruned, some worksadd evaluation functions to loss function, such as group lasso[8] and MAC regularization [9]. However, it is difficult tofind a proper ratio between additional pruning-related lossand original loss. Others works create evaluation functions,including sensitivity [10, 11] and weight magnitude [12].The sensitivity method computes the impact of weights onthe training loss and removes low-impact weights. Weightmagnitude method simply prune weight if its absolute valueis less than the threshold, which is easier to be applied onlarge-scale CNNs. In this work, we use weight magnitudemethod to prune the network.
Pruning granularity.
There are two granularity of prun-ing, fine-grained pruning [12, 13] and coarse-grained pruning[14, 15, 16, 17]. Fined-grained method prunes individualweights (i.e., within a filter kernel), whereas coarse-grainedmethod extensively considers network structures (i.e., alongthe output and the input channels). According to [18], fine-grained pruning needs additional dedicated hardware to han-dle irregular sparsity. Coarse-grained method may obtainhigher compression ratio without the need of compressionheader [19]. Therefore, we focus on coarse-grained output-channel-wise pruning. a r X i v : . [ ee ss . I V ] A ug a) (b) Fig. 1 . Slightly quality-metric drop (PSNR -0.09) may incurvisible defects (SID). (a) PSNR: 25.41. (b) PSNR: 25.32.
Iterative pruning . To prevent catastrophic accuracy degrada-tion, iterative pruning is viewed as an effective retraining pro-cedure [20, 21, 22]. For vision quality applications, qualitymetrics are required as a reference judgement for terminationof pruning procedure.
3. PROPOSED METHOD3.1. Architecture-aware Pruning
An output channel is pruned if its maximum absolute weightvalue is less than magnitude threshold. For convolutionallayer, the weight kernel has tensor shape i × o × k × k , where i is the number of input channels, o is the number of outputchannels and k is kernel size. Output-channel-wise pruningremoves the weights along output channels. The kernel shapebecomes i × ( o − o (cid:48) ) × k × k if o (cid:48) output channels are pruned.The output-channel pruned ratio is defined as o (cid:48) /o .Once output channels of a layer are pruned, the corre-sponding input channels of the following layer are also re-moved. We defined one layer’s sparsity as − (( i − i (cid:48) )( o − o (cid:48) )) / ( i × o ) , where i (cid:48) is the number of pruned input channelin the layer. The network sparsity is defined as the ratio of thenumber of zero weights of a pruned network divided by thenumber of total weights of the original network. Usually, in vision quality applications, each layer in networkis semantically designed for quality-sensitive primitives, suchas edge and chroma, with respect to different resolutions. In-tensively removing a layer can severely degrade the quality.Therefore, we keep the network architecture by preservingminimum number of output channels.
A pruned network with higher weight sparsity may not im-ply higher computation reduction. We define MAC/weight, R l (Eq. 1), for each layer as an indicator of MAC efficiency. M A C s p e r w e i gh t ( L og ) Layer index SID EDSR
Fig. 2 . MAC/weight are much larger on top and bottom layersin SID. However, MAC/weight for most layers are uniform inEDSR.
Conv-A
Conv-B + Conv-C
Conv-D + Fig. 3 . An example of residual block. There are 4 layerswith 6 output channels in each layer. Color parts representremoved output channels. One color denotes one group ofchannels in balance pruned output channel method. M l and W l is number of MAC and weights of a layer, respec-tively. Fig. 2 shows that MAC/weight are much larger onboth top and bottom layers on SID because of its U-Net [23]network topology. Therefore, to productively reduce compu-tation, output channels of a layer with higher R l are tend tobe pruned more because of higher magnitude threshold. R l = log ( M l /W l ) (1) Residual block is universally used in network topology de-sign such as in EDSR which is a variation of ResNet [24] withlong shortcut. However, to prune output channels from resid-ual block is arduous because of element-wise operations (i.e.,element-wise ADD) or concatenation. Fig. 3 illustrates an ex-ample that a magnitude threshold is applied to 4-layer resid-ual block. Because of element-wise ADD after layer Conv-Band layer Conv-D, the output channel of a given layer (Conv-D) and its preceding layer (Conv-B) with the same index (5)should be grouped and pruned at the same time. Conv-B layer lgorithm 1
Architecture-aware and Quality Metric Guaran-teed Pruning Input:
Target quality Q , Target Sparsity Increment S i ,Threshold Increment T i , Total Step G Target Sparsity S = S i + total-network-sparsity Initial Threshold Base T b = T i repeat for layer in network do S l = pruned-output-channel-ratio(layer) M l = MAC(layer) W l = weight-size(layer) R l = log ( M l /W l ) T l = T b × (1 − S l ) × R l prune-output-channels-by-threshold T l end for S c = calculate-total-network-sparsity T b = T b + T i until S c > S repeat retrain-pruned-network Q t = evaluate-quality-metric g = get-current-step until ( Q t > Q or g > = G ) if g < G then jump to line 2 end if and Conv-D layer have less pruned output channels comparedto layer Conv-A and layer Conv-C.We propose a guidance (Eq. 2) to prune output channelsof residual block easier by increasing magnitude threshold onlayers with lower ratio of pruned output channels. MAC effi-ciency mentioned in Sec. 3.1.2 is also applied. S l is the ratioof pruned output channels of layer l . T b is the original mag-nitude threshold base. Thus, output channels of a layer withlower S l have higher tendency to be pruned. T l = T b × (1 − S l ) × R l (2) To maintain quality metric (PSNR and SSIM) while maximiz-ing pruned MAC, our algorithm prunes and retrains networkiteratively. The iteration terminates when either the targetquality-metric criteria or maximum training steps is reached.The proposed overall flow is shown in Algorithm 1.
4. EXPERIMENTAL RESULT4.1. Experiment Setup
We generally investigate both SID for low-light photographyand EDSR (baseline network, ×
2) for super resolution. SIDuses its own dataset [4] and EDSR adopts DIV2K dataset (a) (b)(c) (d)
Fig. 4 . SID results of our method compared to original (with-out pruning). (a)(c) Original (PSNR: 28.54, SSIM: 0.767).(b)(d) Pruned with Method-D (PSNR: 28.55, SSIM:0.768)(a) (b)(c)
Fig. 5 . EDSR results of our method compared to original(without pruning). (a) Image sampled from DIV2K. (b) Orig-inal (PSNR: 34.42, SSIM: 0.942). (c) Proposed Method-D(PSNR: 34.42, SSIM: 0.942)[25]. The input size of the network is set to the maximumimage resolution in the datasets, 1424 × × α The comprehensive analysis is elaborated in Table 1. Weevaluate four distinct approaches. Method-A stands formagnitude threshold pruning without any structural hints. able 1 . Detailed results. BW, considering only convolutional layers, consists of both weights and activations. Each weightand activation is represented with 4-byte floating-point numerical precision.
Network Solution of MAC ( × ) of Weights ( × ) of Activations ( × ) BW ( MByte/Inference ) ValidationPSNR ValidationSSIM
SID Original 560 (100%) 7757 (100%) 1915 (100%) 1922 (100%) 28.54 0.767SID Method-A 458 (82%) 6918 (89%) 1632 (85%) 1639 (85%) 28.54 0.768SID Method-B 354 (63%) 5275 (68%) 1485 (78%) 1491 (78%) 28.54 0.771SID Method-C 270 (48%) 5584 (72%) 1219 (64%) 1225 (64%) 28.54 0.769SID Method-D 236 (42%) 4241 (55%) 1169 (61%) 1173 (61%) 28.55 0.768EDSR Original 1428 (100%) 1367 (100%) 5076 (100%) 5077 (100%) 34.42 0.942EDSR Method-A 1085 (76%) 1037 (76%) 4481 (88%) 4481 (88%) 34.43 0.942EDSR Method-B 1085 (76%) 1037 (76%) 4481 (88%) 4481 (88%) 34.43 0.942EDSR Method-C 1085 (76%) 1037 (76%) 4481 (88%) 4481 (88%) 34.43 0.942EDSR Method-D 897 (63%) 857 (63%) 4083 (80%) 4083 (80%) 34.42 0.942 P r un e d O u t pu t C h a nn e l Layer Index Method-A Method-B Method-C Method-D
Fig. 6 . Pruned output channel per layer on SIDMethod-B keeps the depth of network (Sec. 3.1.1). Method-Cfurther considers MAC/weight ratio (Sec. 3.1.2) on the basisof Method-B. Method-D integrates all proposed techniques.All methods are conducted in company with quality-metricconstraints (Sec. 3.2). For both SID and EDSR, it shows noPSNR and SSIM drop on all methods. Fig. 4 and Fig. 5reveal the indistinguishable quality difference.
Keep Layer Depth.
In SID, Method-B reduces MAC from82% to 63% compared to Method-A. As shown in Fig. 6,Method-A may remove all the output channels of a layerdue to not keeping layer depth, which leads to severe qualitymetric drops that cannot be recovered in retraining steps.
Enhance MAC Efficiency.
In SID, Fig. 6 shows thatMethod-C prunes more weights on both top and bottom layersthat have larger MAC/weight (Eq. 1). Therefore, Method-Creduces MAC from 63% to 48% compared to Method-B.
Balance Pruned Output Channel.
Method-D increases P r un e d O u t pu t C h a nn e l Layer Index
Method-A Method-D
Fig. 7 . Pruned output channel per layer on EDSR17% weight sparsity but only reduces 6% MAC comparedto Method-C in SID. Fig. 6 illustrates that Method-D prunesless on top and bottom layers which have larger MAC/weight.In EDSR, there is no difference among Method-A, Method-Band Method-C because no layer is pruned by Method-A andMAC/weight are identical for all layers (last layer could notbe pruned) as shown in Fig. 2. Fig. 7 shows that Method-Dreduces MAC from 76% to 63%, which is more than 10%, inshortcut-connected layers.In summary, our methodology has significant reduction onboth MAC and BW, which implies reduction on inference la-tency. For BW, we have 39% and 20% reduction on SID andEDSR, respectively. Our methodology also works well oncomplex network architecture. . CONCLUSION
To minimize computation complexity without quality dropon vision quality applications, our architecture-aware prun-ing is optimized for pruning more for complexity metric (e.g.,MAC) on SID and shortcut-connected layers on EDSR. TheMAC of SID and EDSR are reduced by 58% and 37%, respec-tively. Memory bandwidth is also reduced without degrada-tion of PSNR, SSIM and subjective quality. The reduction ofcomputation complexity and memory bandwidth could bene-fit on general mobile devices without special hardware design.
6. REFERENCES [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton, “Imagenet classification with deep convolutionalneural networks,” in
Advances in Neural InformationProcessing Systems 25 , F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105.Curran Associates, Inc., 2012.[2] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Ji-tendra Malik, “Rich feature hierarchies for accurate ob-ject detection and semantic segmentation,”
CoRR , vol.abs/1311.2524, 2013.[3] Jonathan Long, Evan Shelhamer, and Trevor Darrell,“Fully convolutional networks for semantic segmenta-tion,”
CoRR , vol. abs/1411.4038, 2014.[4] Chen Chen, Qifeng Chen, Jia Xu, and VladlenKoltun, “Learning to see in the dark,”
CoRR , vol.abs/1805.01934, 2018.[5] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah,and Kyoung Mu Lee, “Enhanced deep residual net-works for single image super-resolution,”
CoRR , vol.abs/1707.02921, 2017.[6] Andrew G. Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam, “Mobilenets: Efficientconvolutional neural networks for mobile vision appli-cations,”
CoRR , vol. abs/1704.04861, 2017.[7] R. Reed, “Pruning algorithms-a survey,”
IEEE Trans-actions on Neural Networks , vol. 4, no. 5, pp. 740–747,Sep. 1993.[8] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen,and Hai Li, “Learning structured sparsity in deep neuralnetworks,”
CoRR , vol. abs/1608.03665, 2016.[9] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Tien-Ju Yang, and Edward Choi, “Morphnet: Fast & sim-ple resource-constrained structure learning of deep net-works,”
CoRR , vol. abs/1711.06798, 2017. [10] Yann Le Cun, John S. Denker, and Sara A. Solla, “Op-timal brain damage,” in
Proceedings of the 2Nd In-ternational Conference on Neural Information Process-ing Systems , Cambridge, MA, USA, 1989, NIPS’89, pp.598–605, MIT Press.[11] A. P. Engelbrecht, “A new pruning heuristic based onvariance analysis of sensitivity information,”
Trans.Neur. Netw. , vol. 12, no. 6, pp. 1386–1399, Nov. 2001.[12] Song Han, Jeff Pool, John Tran, and William J. Dally,“Learning both weights and connections for efficientneural networks,”
CoRR , vol. abs/1506.02626, 2015.[13] Suraj Srinivas, Akshayvarun Subramanya, andR. Venkatesh Babu, “Training sparse neural net-works,”
CoRR , vol. abs/1611.06694, 2016.[14] Sajid Anwar and Wonyong Sung, “Compact deep con-volutional neural networks with coarse pruning,”
CoRR ,vol. abs/1610.09639, 2016.[15] Pavlo Molchanov, Stephen Tyree, Tero Karras, TimoAila, and Jan Kautz, “Pruning convolutional neural net-works for resource efficient transfer learning,”
CoRR ,vol. abs/1611.06440, 2016.[16] A. Polyak and L. Wolf, “Channel-level acceleration ofdeep face representations,”
IEEE Access , vol. 3, pp.2163–2175, 2015.[17] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet,and Hans Peter Graf, “Pruning filters for efficient con-vnets,”
CoRR , vol. abs/1608.08710, 2016.[18] Yihui He and Song Han, “ADC: automated deep com-pression and acceleration with reinforcement learning,”
CoRR , vol. abs/1802.03494, 2018.[19] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, XingyuLiu, Yu Wang, and William J. Dally, “Exploring the reg-ularity of sparse structure in convolutional neural net-works,”
CoRR , vol. abs/1705.08922, 2017.[20] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze,“Designing energy-efficient convolutional neural net-works using energy-aware pruning,”
CoRR , vol.abs/1611.05128, 2016.[21] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-KeungTang, “Network trimming: A data-driven neuronpruning approach towards efficient deep architectures,”
CoRR , vol. abs/1607.03250, 2016.[22] G. Castellano, A. M. Fanelli, and M. Pelillo, “An it-erative pruning algorithm for feedforward neural net-works,”
IEEE Transactions on Neural Networks , vol.8, no. 3, pp. 519–531, May 1997.23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,“U-net: Convolutional networks for biomedical imagesegmentation,”
CoRR , vol. abs/1505.04597, 2015.[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,”