On-the-fly Network Pruning for Object Detection
aa r X i v : . [ c s . C V ] M a y Workshop track - ICLR 2016 O N - THE - FLY N ETWORK P RUNINGFOR O B J EC T D ETEC TION
Marc Masana, Joost van de Weijer & Andrew D. Bagdanov
Computer Vision CentreUniversitat Aut`onoma de BarcelonaBarcelona, 08193, Spain { mmasana,joost,bagdanov } @cvc.uab.cat A BSTRACT
Object detection with deep neural networks is often performed by passing a fewthousand candidate bounding boxes through a deep neural network for each image.These bounding boxes are highly correlated since they originate from the sameimage. In this paper we investigate how to exploit feature occurrence at the imagescale to prune the neural network which is subsequently applied to all boundingboxes. We show that removing units which have near-zero activation in the imageallows us to significantly reduce the number of parameters in the network. Resultson the PASCAL 2007 Object Detection Challenge demonstrate that up to 40% ofunits in some fully-connected layers can be entirely eliminated with little changein the detection result.
NTRODUCTION
Deep neural networks are often trained for recognition problems over very many labels. This ispartially to ensure wide applicability of the network and partially because networks are known tobenefit from multi-label data (additional training examples from one class can increase performanceof another class because they share features among several layers). At testing time, however, onemight want to apply the neural network to a collection of examples which are highly correlated.They only contain a limited subset of the original labels and consequently will result in sparse nodeactivations in the network. In these cases, application of the full neural network to the whole col-lection results in a considerable amount of wasted computation. In this paper we describe a methodfor pruning of neural networks based on analysis of internal unit activations with the objective ofconstructing more efficient networks.In computer vision many problems have the structure described above. We briefly mention twohere. Imagine you want to classify the semantic content in each frame (an example) of a video(the collection). A fast assessment of the video might reveal that it is an indoor birthday party.This knowledge might exclude many of the nodes in the neural network – those which correspond to’snow’, ’leopards’, and ’rivers’, for example, will be unlikely to be needed in any of the thousands offrames in this video. Another example is object detection, where we extract thousands of boundingboxes (examples) from a single image (the collection) with the aim of locating all semantic objectsin the image. Given an assessment of the image, we have knowledge of the node activations for theentire collection, and based on this we can propose a smaller network which is subsequently appliedto the thousands of bounding boxes. We will here only consider the latter example in more detail.Reducing the size and complexity of neural networks (or network compression ) enjoys a long his-tory in the learning community. The authors of Bucila et al. (2006) train a simpler neural networkto mimic the output of a complex one, and in Ba & Caruana (2014) the authors compress deep andwide (i.e. with many feature maps) networks to shallow but wider ones. The technique of Knowl-edge Distillation was introduced in Hinton et al. (2015) as a model compression framework. Theframework compresses an ensemble of deep networks (teacher) into a student network of similardepth. More recently, the FitNets approach leverages the Knowledge Distillation framework to ex-ploit depth and train student networks that are thin but remain deep ( Romero et al. (2014)). Anothernetwork compression strategy was proposed in Girshick (2015); Xue et al. (2013) that uses singu-1orkshop track - ICLR 2016lar value decomposition to reduce the rank of weight matrices in fully connected layers in order toimprove efficiency.In this paper we are not interested in mimicking the operation of a deep neural network over allexamples and all classes (as in the student-teacher compression paradigm common in the literature).Rather, our approach is to make a quick assessment of image content and then, based on analysis ofunit activation on entire image, to modify the network to use only those units likely to contribute tocorrect classification of labels of interest when applied to each candidate bounding box.
ORWARD AND BACKWARD UNIT PRUNING FOR OBJECT DETECTION
Figure 1: Example of backward and forward unitpruning. We use k . k to indicate the relu ( . ) activa-tion function. Based on knowledge that some unitactivations h k ( x ) are zero (indicated in green),we can reduce the parameters of W k , W k +1 and b k (indicated in red).Consider the original neural network f ( x ; θ ) ,where θ are the network parameters. We wishto compute a network defined by parameters θ ∗ for which: f ( x ; θ ∗ ) ≈ f ( x ; θ ) ∀ x ∈ C (1)where | θ ∗ | < | θ | (i.e. the number of parametersin θ ∗ is considerably lower than in the originalnetwork). In the case of object detection wewill use the unit activations of the entire imageto prune the network which will be applied toall the bounding box proposals. This is basedon the observation that for some layers, nodeswith zero activations on the whole image can-not have nonzero activation on any boundingbox in the image.The hidden layer activation of a fully connectedlayer k can be written as: h k ( x ) = relu( b k + W k h k − ( x )) (2)where b k and W are the biases and weights ofthe k -th layer, and relu( · ) indicates the rectifiedlinear activation function. We first consider how knowledge of the absence of node activations in theimage can be translated into a network with fewer parameters. We consider two cases: backwardand forward unit pruning, as illustrated in Fig. 1. Backward unit pruning:
Without loss of generality, we order the activations in layer h k so thatthe q non-active, zero nodes are at the end of vector h k . Then we can write: h h k ( x ) n − q ) ; q, i = relu (cid:16)h W k n − q ) , m ; q,m i h k − ( x ) + h b k n − q ) ; q, i(cid:17) (3)where we use m,n to indicate the zero-matrix of dimension m by n , and subscripts are used toindicate a selection of indices from the original vector or matrix. We use [ ., . ] for horizontal and [ . ; . ] for vertical concatenation (following Matlab convention). Eq. 3 shows that backward unit pruningallows us to remove from W k and b k an equal amount of rows as there are zeros in h k – withoutchanging the output of the network. Forward unit pruning:
Here we look how the zeros in the activation h k can be exploited toremove parameters from the following layer. The activation in layer k + 1 can be written: h k +1 ( x ) = relu (cid:16)h W k +11: p, n − q ) , p,q i h h k ( x ) p − q ) ; q, i + b k +1 (cid:17) (4)In this case, the zeros in h k result in the removal of columns from W k +1 . These can be removedwithout changing the output of the network.In practice there might only be a few zero activation in the image and therefore we consider all nodeactivations which are below a certain threshold to be zero . This allows us to further increase the In case the activation function is not the ReLU one should consider the absolute value of the activationfunction to be smaller than a threshold. f ( x ; θ ∗ ) but at the cost of slight deviations from the original net-work f ( x ; θ ) . We also note that although notations are about fully-connected layers for simplicity,our proposal would also be applicable to convolutional layers too. ESULTS AND C ONCLUSIONS
We evaluate our proposed methods on the VOC PASCAL 2007 dataset (Everingham et al. (2010))with the fast R-CNN framework by Girshick (2015). The VOC 2007 has a total of 24,640 annotatedobjects for training, with an average of 2.5 objects per image, and in the test set an average of 2.4objects per image. The Fast R-CNN framework is fit for our purposes since it first passes the imagethrough all the convolutional layers to later use the extracted feature maps with the correspondingbounding boxes which we want to evaluate (usually 1,000+ boxes). The network used a modificationof the VGG16 network (Simonyan & Zisserman (2014)).
Forward pruning.
Our first experiment uses forward unit pruning on the pool5 layer of theVGG16 network to reduce the number of parameters of the fc6 layer. This is the layer with highestpercentage of parameters (38.7% parameters in the network). The pool5 layer has × × outputs, where the first dimension represents the feature maps, and the second and third dimensionsare spatial dimensions (smaller than the original image size because of the resizing at each poolinglayer). In order to decide which activations to prune, we first pass the whole image through thenetwork and observe the activations at each unit in pool5 . We sum over the spatial dimensions andapply a threshold to select units to prune from the network before applying it to all bounding boxes. proportion of original units removed m AP l o ss -0.500.511.522.533.544.55 pool5 forward unit pruningfc8 backward unit pruningpool5 forward random pruningfc8 backward random pruning Figure 2: Performance loss as a function of pa-rameter reduction. Results show an initial minor improvement in theperformance of the framework when removingparameters (see Fig. 2). The lack of propagationthrough the network of very low value activationscould be the cause of the small difference in per-formance. Then, for reductions of 25-40% of theparameters on layer fc6 , we obtain a mAP loss ofless than 1. From that point on, further removal ofparameters leads to higher loss. This happens be-cause the activations removed start to be too rele-vant for the network’s discriminative power.
Backward pruning.
The second experiment ap-plies backward unit pruning to the fc8 layer to re-duce the number of parameters from the weightand bias matrices used to compute the networkoutputs. In this case, we use an image classi-fier (VGG16 deep features based) to decide whichclasses (activations) would be more likely to ap-pear in the original image. Based on that classifi-cation, we adopt a top- N strategy where we keepthe N classes with higher probability from the image classifier and remove the rest. This reductionaffects the weight and bias matrices of the fc8 , which would no longer propagate into the follow-ing layers (the softmax in this case). In this case, results keeping 6 or more classes (reductions of0-70%) show a mAP loss of less than 1. However, performance starts dropping after because of im-ages having more classes present than classes kept. It should be noted that only a small percentageof the total parameters of the network are in fc8 . However, when considering object detection withthousands of classes, the relevance of this layer is comparable to fc6 . Conclusions.
We have presented a method to prune units in neural networks for object detectionthrough analysis of unit activation on the entire image. We show that for some layers up to 40% of theparameters can be removed with minimal impact on performance. We are interested in combiningour method with other parameter reduction methods such as Xue et al. (2013). Also applying ourmethod to other types of layers (e.g. convolutional) and evaluating on datasets with very many labelsare promising research directions. In addition, we are interested in applying our method to semanticsegmentation where, similarly as in our problem, a redundant network is applied to every pixel.3orkshop track - ICLR 2016A
CKNOWLEDGMENTS
This work is funded by the Projects TIN2013-41751-P of the Spanish Ministry of Science, the Cata-lan project 2014 SGR 221 and the CHIST ERA project PCIN-2015-226. We gratefully acknowledgethe support of NVIDIA. R EFERENCES
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In
Advances in neural informa-tion processing systems (NIPS) , pp. 2654–2662, 2014.C Bucila, R Caruana, and A Niculescu-Mizil. Model compression: Making big, slow models practi-cal. In
Proc. of the 12th International Conf. on Knowledge Discovery and Data Mining (KDD06) ,2006.Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman.The pascal visual object classes (voc) challenge.
International journal of computer vision , 88(2):303–338, 2010.Ross Girshick. Fast r-cnn. In
Proceedings of the IEEE International Conference on ComputerVision , pp. 1440–1448, 2015.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 , 2015.Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, andYoshua Bengio. Fitnets: Hints for thin deep nets.
CoRR , abs/1412.6550, 2014. URL http://arxiv.org/abs/1412.6550 .Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 , 2014.Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models withsingular value decomposition. In