[PDF] PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

Abstract

This paper presents how we can achieve the state-of-the-art accuracy in multi-category object detection task while minimizing the computational cost by adapting and combining recent technical innovations. Following the common pipeline of "CNN feature extraction + region proposal + RoI classification", we mainly redesign the feature extraction part, since region proposal part is not computationally expensive and classification part can be efficiently compressed with common techniques like truncated SVD. Our design principle is "less channels with more layers" and adoption of some building blocks including concatenated ReLU, Inception, and HyperNet. The designed network is deep and thin and trained with the help of batch normalization, residual connections, and learning rate scheduling based on plateau detection. We obtained solid results on well-known object detection benchmarks: 83.8% mAP (mean average precision) on VOC2007 and 82.5% mAP on VOC2012 (2nd place), while taking only 750ms/image on Intel i7-6700K CPU with a single core and 46ms/image on NVIDIA Titan X GPU. Theoretically, our network requires only 12.3% of the computational cost compared to ResNet-101, the winner on VOC2012.

Full PDF

PPVANET: Deep but Lightweight Neural Networks forReal-time Object Detection

Kye-Hyeon Kim ∗ , Sanghoon Hong ∗ , Byungseok Roh ∗ , Yeongjae Cheon, and Minje Park Intel Imaging and Camera Technology21 Teheran-ro 52-gil, Gangnam-gu, Seoul 06212, Korea { kye-hyeon.kim, sanghoon.hong, peter.roh,yeongjae.cheon, minje.park } @intel.com Abstract

This paper presents how we can achieve the state-of-the-art accuracy in multi-category object detection task while minimizing the computational cost by adapt-ing and combining recent technical innovations. Following the common pipelineof “CNN feature extraction + region proposal + RoI classiﬁcation”, we mainlyredesign the feature extraction part, since region proposal part is not computation-ally expensive and classiﬁcation part can be efﬁciently compressed with commontechniques like truncated SVD. Our design principle is “less channels with morelayers” and adoption of some building blocks including concatenated ReLU, In-ception, and HyperNet. The designed network is deep and thin and trained withthe help of batch normalization, residual connections, and learning rate schedul-ing based on plateau detection. We obtained solid results on well-known objectdetection benchmarks: 83.8% mAP (mean average precision) on VOC2007 and82.5% mAP on VOC2012 (2nd place), while taking only 750ms/image on Inteli7-6700K CPU with a single core and 46ms/image on NVIDIA Titan X GPU. The-oretically, our network requires only 12.3% of the computational cost comparedto ResNet-101, the winner on VOC2012.

Convolutional neural networks (CNNs) have made impressive improvements in object detection forseveral years. Thanks to many innovative work, recent object detection systems have met acceptableaccuracies for commercialization in a broad range of markets like automotive and surveillance. Interms of detection speed, however, even the best algorithms are still suffering from heavy computa-tional cost. Although recent work on network compression and quantization shows promising result,it is important to reduce the computational cost in the network design stage.This paper presents our lightweight feature extraction network architecture for object detection,named PVANET , which achieves real-time object detection performance without losing accuracycompared to the other state-of-the-art systems: • Computational cost: 7.9GMAC for feature extraction with 1065x640 input (cf. ResNet-101[1]: 80.5GMAC ) ∗ These authors contributed equally. Corresponding author: Sanghoon Hong The code and the trained models are available at https://github.com/sanghoon/pva-faster-rcnn ResNet-101 used multi-scale testing without mentioning additional computation cost. If we take this intoaccount, ours requires only <

7% of the computational cost compared to ResNet-101. a r X i v : . [ c s . C V ] S e p onvolution NegationConcatenationScale / ShiftReLU

Figure 1: Our C.ReLU building block.

Negation simply multiplies − to the output of Convolution. Scale / Shift applies trainable weight and bias to each channel, allowing activations in the negatedpart to be adaptive. • Runtime performance: 750ms/image (1.3FPS) on Intel i7-6700K CPU with a single core;46ms/image (21.7FPS) on NVIDIA Titan X GPU • Accuracy: 83.8% mAP on VOC-2007; 82.5% mAP on VOC-2012 (2nd place)The key design principle is “less channels with more layers”. Additionally, our networks adoptedsome recent building blocks while some of them have not been veriﬁed their effectiveness on objectdetection tasks: • Concatenated rectiﬁed linear unit (C.ReLU) [2] is applied to the early stage of our CNNs(i.e., ﬁrst several layers from the network input) to reduce the number of computations byhalf without losing accuracy. • Inception [3] is applied to the remaining of our feature generation sub-network. An In-ception module produces output activations of different sizes of receptive ﬁelds, so thatincreases the variety of receptive ﬁeld sizes in the previous layer. We observed that stack-ing up Inception modules can capture widely varying-sized objects more effectively than alinear chain of convolutions. • We adopted the idea of multi-scale representation like HyperNet [4] that combines severalintermediate outputs so that multiple levels of details and non-linearities can be consideredsimultaneously.We will show that our thin but deep network can be trained effectively with batch normalization [5],residual connections [1], and learning rate scheduling based on plateau detection [1].In the remaining of the paper, we describe our network design brieﬂy (Section 2) and summarizethe detailed structure of PVANET (Section 3). Finally we provide some experimental results onVOC-2007 and VOC-2012 benchmarks, with detailed settings for training and testing (Section 4).

C.ReLU is motivated from an interesting observation of intermediate activation patterns in CNNs.In the early stage, output nodes tend to be “paired” such that one node’s activation is the oppositeside of another’s. From this observation, C.ReLU reduces the number of output channels by half,and doubles it by simply concatenating the same outputs with negation , which leads to 2x speed-upof the early stage without losing accuracy.Figure 1 illustrates our C.ReLU implementation. Compared to the original C.ReLU, we appendscaling and shifting after concatenation to allow that each channel’s slope and activation thresholdcan be different from those of its opposite channel.2 e c e p t i v e f i e l d s i z e Nonlinearity level

Figure 2: Example of a distribution of (expected) receptive ﬁeld sizes of intermediate outputs in achain of 3 Inception modules. Each module concatenates 3 convolutional layers of different ker-nel sizes, 1x1, 3x3 and 5x5, respectively. The number of output channels in each module is setto { / , / , / } of the number of channels from the previous module, respectively. A latter In-ception module can learn visual patterns of wider range of sizes, as well as having higher level ofnonlinearity. conv: conv: concat conv: conv: conv: conv: conv: conv: pool: conv: conv: concatconv: conv: conv: conv: conv: stride=2 Figure 3: (Left) Our Inception building block. 5x5 convolution is replaced with two 3x3 convolu-tional layers for efﬁciency. (Right) Inception for reducing feature map size by half.

For object detection tasks, Inception has neither been widely applied to existing work, nor beenveriﬁed its effectiveness. We found that Inception can be one of the most cost-effective buildingblock for capturing both small and large objects in an input image. To Learn visual patterns forcapturing large object, output features of CNNs should correspond to sufﬁciently large receptiveﬁelds, which can be easily fulﬁlled by stacking up convolutions of 3x3 or larger kernels. On theother hand, for capturing small-sized objects, output features should correspond to sufﬁciently smallreceptive ﬁelds to localize small regions of interest precisely.Figure 2 clearly shows that Inception can fulﬁll both requirements. 1x1 convolution plays the keyrole to this end, by preserving the receptive ﬁeld of the previous layer. Just increasing the nonlin-earity of input patterns, it slows down the growth of receptive ﬁelds for some output features so thatsmall-sized objects can be captured precisely. Figure 3 illustrates our Inception implementation.5x5 convolution is replaced with a sequence of two 3x3 convolutions.

Multi-scale representation and its combination are proven to be effective in many recent deep learn-ing tasks [4, 6, 7]. Combining ﬁne-grained details with highly-abstracted information in featureextraction layer helps the following region proposal network and classiﬁcation network to detectobjects of different scales. However, since the direct concatenation of all abstraction layers mayproduce redundant information with much higher compute requirement we need to design the num-ber of different abstraction layers and the layer numbers of abstraction carefully. If you choose thelayers which are too early for object proposal and classiﬁcation, it would be little help when weconsider additional compute complexity. 3 ame Type Stride Output Residual C.ReLU Inception

Table 1: The detailed structure of PVANET. All conv layers are combined with batch normaliza-tion, channel-wise scaling and shifting, and ReLU activation layers. Theoretical computational costis given as the number of adds and multiplications (MAC), assuming that the input image size is1056x640.

KxK C.ReLU refers to a sequence of “1x1 - KxK - 1x1” conv layers, where KxK is aC.ReLU block as in Figure 1. conv1 1 has no 1x1 conv layer. “C.ReLU” column shows the numberof output channels of each conv layer. For

Residual , 1x1 conv is applied for projecting pool1 1 intoconv2 1, conv2 3 into conv3 1, conv3 4 into conv4 1, and conv4 4 into conv5 1.

Inception consistsof four sub-sequences: 1x1 conv (

Multi-scale fea-tures are obtained by four steps: conv3 4 is down-scaled into “downscale” by 3x3 max-pool withstride 2; conv5 4 is up-scaled into “upscale” by 4x4 channel-wise deconvolution whose weights areﬁxed as bilinear interpolation; “downscale”, conv4 4 and “upscale” are combined into “concat” bychannel-wise concatenation; after 1x1 conv, the ﬁnal output is obtained (convf).Our design choice is not different from the observation from ION [6] and HyperNet [4], whichcombines 1) the last layer and 2) two intermediate layers whose scales are 2x and 4x of the lastlayer, respectively. We choose the middle-sized layer as a reference scale (= 2x), and concatenate the4x-scaled layer and the last layer with down-scaling (pooling) and up-scaling (linear interpolation),respectively.

It is widely accepted that as network goes deeper and deeper, the training of network becomes moretroublesome. We solve this issue by adopting residual structures [1]. Unlike the original residualtraining idea, we add residual connections onto inception layers as well to stabilize the later part ofour deep network architecture.We also add Batch normalization [5] layers before all ReLU activation layers. Mini-batch samplestatistics are used during pre-training, and moving-averaged statistics are used afterwards as ﬁxedscale-and-shift parameters.Learning rate policy is also important to train network successfully. Our policy is to control thelearning rate dynamically, based on plateau detection [1]. We measure the moving average of loss,and decide it to be on-plateau if its improvement is below a threshold during a certain period ofiterations. Whenever the plateau is detected, the learning rate is decreased by a constant factor. Inexperiments, our learning rate policy gave a signiﬁcant gain of accuracy.4

Faster R-CNN with our feature extraction network

Table 1 shows the whole structure of PVANET. In the early stage (conv1 1, ..., conv3 4), C.ReLUis adapted to convolutional layers to reduce the computational cost of KxK conv by half. 1x1 convlayers are added before and after the KxK conv, in order to reduce the input size and then enlargethe representation capacity, respectively.Three intermediate outputs from conv3 4 (with down-scaling), conv4 4, and conv5 4 (with up-scaling) are combined into the 512-channel multi-scale output features (convf), which are fed intothe Faster R-CNN modules: • For computational efﬁciency, only the ﬁrst 128 channels in convf are fed into the regionproposal network (RPN). Our RPN is a sequence of “3x3 conv (384 channels) - 1x1 conv(25x(2+4) = 150 channels )” layers to generate regions of interest (RoIs) from • R-CNN takes all 512 channels in convf. For each RoI, 6x6x512 tensor is generated by RoIpooling, and then passed through a sequence of fully-connected layers of “4096 - 4096 -(21+84)” output nodes. PVANET was pre-trained with ILSVRC2012 training images for 1000-class image classiﬁcation. All images were resized into 256x256, and 192x192 patches were randomly cropped and used asthe network input. The learning rate was initially set to 0.1, and then decreased by a factor of / √ ≈ . whenever a plateau is detected. Pre-training terminated if the learning rate dropsbelow e − , which usually requires about 2M iterations.Then PVANET was trained with the union set of MS COCO trainval, VOC2007 trainval andVOC2012 trainval. Fine-tuning with VOC2007 trainval and VOC2012 trainval was also requiredafterwards, since the class deﬁnitions in MS COCO and VOC competitions are slightly different.Training images were resized randomly such that a shorter edge of an image to be between 416 and864.For PASCAL VOC evaluations, each input image was resized such that its shorter edge to be 640.All parameters related to Faster R-CNN were set as in the original work [8] except for the numberof proposal boxes before non-maximum suppression (NMS) ( = 12000 ) and the NMS threshold( = 0 . ). All evaluations were done on Intel i7-6700K CPU with a single core and NVIDIA Titan XGPU. Table 2 shows the accuracy of our models in different conﬁgurations. Thanks to Inception (Section2.2) and multi-scale features (Section 2.3), our RPN generated initial proposals very accurately.Since the results imply that more than 200 proposals does not give notable beneﬁts to detectionaccuracy, we ﬁxed the number of proposals to 200 in the remaining experiments. We also measuredthe performance with bounding-box voting [10], while iterative regression was not applied. RPN produces 2 predicted scores (foreground and background) and 4 predicted values of the bounding boxfor each anchor. Our RPN uses 25 anchors of 5 scales (3, 6, 9, 16, 25) and 5 aspect ratios (0.5, 0.667, 1.0, 1.5,2.0). For 20-class object detection, R-CNN produces 21 predicted scores (20 classes + 1 background) and 21x4predicted values of 21 bounding boxes. http://mscoco.org/dataset/ http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ On Sep. 19, 2016, we updated the mAP numbers according to the latest version of the evaluation code inpy-faster-rcnn. ≥ . . PVANET+ denotes that bounding-boxvoting is applied, and PVANET+ (compressed) denotes that fully-connected layers in R-CNN arecompressed. Model Computation cost (MAC) Running time mAPShared CNN RPN Classiﬁer Total ms x(PVANET) (%)PVANET+ 7.9 1.3 27.7 37.0 46 1.0 82.5Faster R-CNN + ResNet-101 80.5 N/A 219.6 300.1 2240 48.6 83.8Faster R-CNN + VGG-16 183.2 5.5 27.7 216.4 110 2.4 75.9R-FCN + ResNet-101 122.9 0 0 122.9 133 2.9 82.0

Table 3: Comparisons between our network and some state-of-the-arts in the PASCAL VOC2012leaderboard. PVANET+ denotes PVANET with bounding-box voting. We assume that PVANETtakes a 1056x640 image and the number of proposals is 200. Competitors’ MAC are estimated fromtheir Caffe prototxt ﬁles which are publicly available. All testing-time conﬁgurations are the samewith the original articles [1, 12, 8]. Competitors’ runtime performances are also therein, while weprojected the original values with assuming that NVIDIA Titan X is 1.5x faster than NVIDIA K40.Faster R-CNN consists of fully-connected layers, which can be compressed easily without a signiﬁ-cant drop of accuracy [11]. We compressed the fully-connected layers of “4096 - 4096” into to “512- 4096 - 512 - 4096” by the truncated singular value decomposition (SVD), with some ﬁne-tuningafter that. The compressed network achieved 82.9% mAP (-0.9%) and ran in 31.3 FPS (+9.6 FPS).

Table 3 summarizes comparisons between PVANET+ and some state-of-the-art networks [1, 8, 12]from the PASCAL VOC2012 leaderboard. Our PVANET+ achieved 82.5% mAP, the 2nd place on the leaderboard, outperforming all othercompetitors except for “Faster R-CNN + ResNet-101”. However, the top-performer uses ResNet-101 which is much heavier than PVANET, as well as several time-consuming techniques such asglobal contexts and multi-scale testing, leading to 40x (or more) slower than ours. In Table 3, wealso compare mAP with respect to the computational cost. Among the networks performing over80% mAP, PVANET+ is the only network running ≤ ms. Taking its accuracy and computationalcost into account, our PVANET+ is the most efﬁcient network in the leaderboard. In this paper, we showed that the current networks are highly redundant and we can design a thinand light network which is capable enough for complex vision tasks. Elaborate adoption and com-bination of recent technical innovations on deep learning makes us possible to re-design the featureextraction part of the Faster R-CNN framework to maximize the computational efﬁciency. Eventhough the proposed network is designed for object detection, we believe our design principle canbe widely applicable to other tasks like face recognition and semantic analysis. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=4 References [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni-tion. In

Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition(CVPR) , 2016.[2] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving con-volutional neural networks via concatenated rectiﬁed linear units. In

Proceedings of the InternationalConference on Machine Learning (ICML) , 2016.[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, DumitruErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

Proceedings ofthe IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) , 2015.[4] Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun. HyperNet: Towards accurate region proposalgeneration and joint object detection. In

Proceedings of the IEEE International Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[5] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In

Proceedings of the International Conference on Machine Learning (ICML) ,2015.[6] Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. Inside-outside net: Detecting objectsin context with skip pooling and recurrent neural networks. In

Proceedings of the IEEE InternationalConference on Computer Vision and Pattern Recognition (CVPR) , 2016.[7] Bharath Hariharan, Pablo Arbel´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmen-tation and ﬁne-grained localization. In

Proceedings of the IEEE International Conference on ComputerVision and Pattern Recognition (CVPR) , 2015.[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time objectdetection with region proposal networks. In

Advances in Neural Information Processing Systems (NIPS) ,2015.[9] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. arXiv preprintarXiv:1509.09308 , 2015.[10] Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region & semantic segmentation-aware CNN model. In

Proceedings of the International Conference on Computer Vision (ICCV) , 2015.[11] Ross Girshick. Fast R-CNN. In

Proceedings of the International Conference on Computer Vision (ICCV) ,2015.[12] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn : Object detection via region-based fully convolutionalnetworks. In

Advances in Neural Information Processing Systems (NIPS) , 2016., 2016.