Deep Neural Networks Under Stress
Micael Carvalho, Matthieu Cord, Sandra Avila, Nicolas Thome, Eduardo Valle
DDEEP NEURAL NETWORKS UNDER STRESS
Micael Carvalho (1 , , Matthieu Cord (1) , Sandra Avila (2) , Nicolas Thome (1) & Eduardo Valle (2) (1) Sorbonne Universit´es, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606, 4 place Jussieu 75005 Paris (2)
University of Campinas, RECOD Lab – DCA/FEEC/UNICAMP, Campinas, Brazil
ABSTRACT
In recent years, deep architectures have been used for trans-fer learning with state-of-the-art performance in many data-sets. The properties of their features remain, however, largelyunstudied under the transfer perspective. In this work, wepresent an extensive analysis of the resiliency of feature vec-tors extracted from deep models, with special focus on thetrade-off between performance and compression rate. By in-troducing perturbations to image descriptions extracted froma deep convolutional neural network, we change their preci-sion and number of dimensions, measuring how it affects thefinal score. We show that deep features are more robust tothese disturbances when compared to classical approaches,achieving a compression rate of 98.4%, while losing only0.88% of their original score for Pascal VOC 2007.
Index Terms — feature robustness, deep learning, transferlearning, image classification, feature compression
1. INTRODUCTION
Deep Convolutional Neural Networks have swept the Com-puter Vision community, with state-of-the-art performancefor many tasks [1, 2, 3]. However, an analytical understand-ing of their models is still lacking, shrouding their use undera cloud of ad hoc procedures — tricks of the trade — withoutwhich they simply fail to work. Therefore, a full understand-ing of deep representations became the new Holy Grail ofresearch in Machine Learning and Computer Vision [4, 5].We explore here the properties of Deep Networks, meas-uring to which extent they preserve discriminative informa-tion about the input, i.e., measuring the robustness of the fea-ture vectors they generate. Indeed, we may understand a deepmodel as one that first learns to extract a good representation(feature extraction step) and then uses that representation tomake a decision (classification or regression step). Most ofthe challenge in understanding deep models is due to the un-known nature of the learned features.Pursuing that understanding, we use transfer learning and“stress” tests to probe the networks. Transfer learning con-sists in recycling knowledge from one model to another, inthe form of model weights, initialization, or architecture (e.g. c (cid:13) Fig. 1 : Overview of our framework. Input images are conver-ted to stressed feature vectors by: (1) extracting descriptionsusing a pre-trained deep network, (2) transforming/stressingthe feature vectors by reducing their precision or their numberof dimensions.[6, 7, 8, 9]), saving both computational resources and train-ing data. Transfer learning is often used, with great success,on deep models, which are very greedy in terms of data andprocessing power. A straightforward scheme is to choose apre-trained network, freeze the weights up to a certain layer,and to introduce and train new layers for the new task. Bypicking different layers from the original network, one con-trols the degree of transfer between the models. Conceptually,the output of the frozen transferred layers for any image maybe seen as a feature vector x . Thus, any classifier, like SVM,may be used for classification on a target dataset.In this paper, we propose stress tests, represented in Fi-gure 1, which consistently interfere in the network to select-ively destroy information. We explore two important aspectsof deep architectures: dimensionality and numerical precisionof their representations. Dimensionality stress tests introduce T : R n → R p , with p smaller than the original dimensionality n . Quantization stress tests introduce T : R n → Q n where Q is a more aggressively quantized subset of real numbers than R . We also combine the two stresses.Although recent studies reevaluate deep architectureswith respect to the size and precision of their representations(e.g. [10, 11, 12, 13]), their primary focus are practical im-pacts upon the original tasks. Our framework is designed fortransfer learning tasks and, as we try to shed light on gen-eral properties of the networks, we will see that they showa strong degree of redundancy, opening the opportunity tocreate powerful compact descriptors. a r X i v : . [ c s . C V ] M a y . TRANSFER STRATEGIES Our main objective is to explore the VGG-M deep convolu-tional model [14], which was originally trained on ImageNet,in a transfer scheme for the classification task of the PascalVOC 2007 dataset [15]. We perform extensive experiments tostudy the robustness of this architecture, detailed in Table 1,against different types of stresses.Let us formalize the pre-trained deep model as a series offunctions φ i : R m i → R n i , where φ i is the i th layer of thenetwork, m is equal to the dimensionality of the input dataand n i = m i +1 is the output of such layer.In our stress tests, we choose a layer i up to which wefreeze the network (i.e., we keep layers φ ...φ i untouched).At first, we use the output of layer φ i to train an SVM. Then,we pick a stressing function T and retrain the model using T ( φ i ) as input. Comparing the two scores, we can infer thenetwork’s resiliency to the chosen stress.To better highlight inherent properties of deep models, in-stead of specific characteristics of VGG-M [14], we also eval-uate part of our experiments with GoogLeNet [16]. Further-more, in order to differentiate these deep models from clas-sical approaches, we also report comparative results with therecent Bag-of-Words (BoW) model’s BossaNova [17]. In allcases, we pre-process the images according to each model’srecommended protocol.We also explore how to extend the results obtained forPascal VOC 2007 by comparing part of the experiments withtwo other datasets: MIT-67 – Indoor [18] and UPMC Food-101 [19] (67 and 101 classes, respectively). In order to understand how redundant is the deep represen-tation, the first stress tests drop dimensions from the featurevector. The number of dimensions p i preserved at each step ≤ i ≤ is proportional to the initial size n of the featurevector, according to the expression p i = (cid:98) n ∗ (21 − i )20 (cid:99) .We contrast two strategies for selecting the p i − − p i di-mensions dropped at each step i : T DR-1 drops randomly; and T DR-2 uses a PCA-based strategy. The latter discards the di-mensions encoding less variance. To take in consideration therandom choice in DR-1, we repeat the experiment 10 times.
The other stressor diminishes the numerical precision of therepresentation, quantizing the feature vectors. Our objectiveis not to explore advanced quantization strategies here, butto consider 2 fast and simple scalar quantizations and to ana-lyze their effect on a classification task. In our first one,
Q-1 ,all dimensions are quantized in the same h ∈ [1 , regu-lar intervals, using the minimum ( min ) and maximum ( max )scalar values observed in the training set for all dimensions. G 1 G 2 G 3 G 4 G 5 G 6 G 7 G 8Conv.
L 1 L 5 L 9 L 11 L 13 – – –
Fully – – – – – L 16 L 18 L 20
ReLU
L 2 L 6 L 10 L 12 L 14 L 17 L 19 –
LRN
L 3 L 7 – – – – – –
Pooling
L 4 L 8 – – L 15 – – –
Softmax – – – – – – – L 21
Table 1 : VGG-M model. Description of layers (L) andgroups (G) of the VGG-M model, from the MatConvNet tool-box [20], proposed by Chatfield et al. [14].
Conv. indicatesa convolutional layer,
Fully a fully connected layer,
ReLU aRectified Linear Unit layer,
LRN a Local Response Normal-ization layer,
Pooling a Max Pooling layer and
Softmax theactivation of the Softmax function.In our second one,
Q-2 , we adapt the limits for each dimen-sion individually, according, again, to values observed in thetraining set.Formally, Q-1, using the global step st = max − minh , hasa single dictionary H , generated by H = { ( min + st st ∗ i | ≤ i < h } For Q-2, let x be the feature matrix of the training featurevectors, and x t the t th element from all the vectors. Usingone step st t = max ( x t ) − min ( x t ) h per dimension, Q-2 has n (number of dimensions) dictionaries, generated by H t = { ( min ( x t ) + st t st t ∗ i | ≤ i < h } Finally, in the quantization step, we assign to each ele-ment the value of the closest point in the dictionary. For Q-1and Q-2, respectively, this is defined by: T Q - ( x ij ) = arg min y { abs ( x ij − y ) | y ∈ H} T Q - ( x ij ) = arg min y { abs ( x ij − y ) | y ∈ H j } The final experiment FC , applies both stressors DR-2 andQ-2 simultaneously, dropping dimensions of the feature vec-tor and quantizing the values of the remaining elements. Ourgoal is to measure any cross-effects between DR-2 and Q-2.
3. EXPERIMENTS
As explained, for a given experimental point, we freeze a pre-trained network at layer φ i , discarding all upper layers. Wethen pick a stressing function T , and use the output of T ( φ i ) as a feature vector in a transfer learning classification task.We (cid:96) -normalize those feature vectors, and feed them to a lin-ear SVM model [21], measuring the model’s scores for dif- For all setups, we use a regularization parameter C = 1 ; preliminaryexperiments shown very little variation when the C was cross-validated. ig. 2 : Results for dimensionality reduction (DR) onVOC 2007 with standard deviation (shaded regions aroundthe lines). In the horizontal axis, each value indicates the per-centage of the original dimensions that is kept, while the cor-responding score, with the respect to the initial one, is shownvertically. On the right side of the figure, we show the numberof dimensions for each model, when only 5% of their initialsize is preserved.ferent choices of T . By picking stressing functions of differ-ent kinds and intensities (including the identity T ( x ) = x ) wegain insight on the resiliency of deep models to those stresses.For all our experiments, we report the classification scoresin Mean Average Precision (mAP) for Pascal VOC 2007, andAccuracy (Acc) for Food-101 and MIT-67, following literat-ure tradition on those datasets. Although we have tested thedeep networks extensively, due to space constraints, we onlyreport the experiments with layer 19 for the VGG-M, and withlayer 151 for GoogLeNet. Those results are representative ofour observations throughout the networks.Table 2 shows the scores for our vanilla experiments,using setups without perturbating the feature vectors (i.e., T ( x ) = x ). We simplify BossaNova’s pipeline for PascalVOC 2007, disabling the concatenation with the classic Bagof Visual Words, and using a linear SVM instead of therecommended RBF kernel.The results for our dimensionality reduction (DR) experi-ments in the Pascal VOC 2007 dataset are shown in Figure 2.Strong redundancy on the representations is detected, sinceacross runs, small variations in the score were observed.GoogLeNet was the most robust against random dimension- VGG-M GoogLeNet BossaNova
Pascal VOC 2007 (mAP) 76.95% 80.58% 51.02%MIT-67 Indoor (Acc) 63.35% – –UPMC Food-101 (Acc) 46.22% – –Feature Dimensionality ∗ ∗ ∗ Table 2 : Classification scores for deep and BoW strategies ina vanilla transfer scheme, with a linear SVM as classifier.
Fig. 3 : Results for dimensionality reduction (DR) withVGG-M for Food-101, MIT-67 and VOC 2007. The datasetshave 101, 67 and 20 classes, respectively.ality perturbation DR-1, with an average mAP drop of 4.74%for 95% of the dimensions removed. However, GoogLeNetis, from start, 12-times bigger than CNN-M. Considering adirect comparison of descriptions of approximately the samesize, the scores of the two models were equivalent.Although BossaNova has shown similar resiliency to di-mensionality reduction, VGG-M held better scores for everytest point, despite having feature vectors 15-times smaller(right side of Figure 2).The PCA-based strategy was very effective for preservinginformation while dropping dimensions. DR-2 held 97.95%of the mAP with 95% of dimensions removed, while DR-1could only keep 89.16% of the mAP. Choosing the right di-mensions to drop improves the robustness of the feature vec-tors to dimensionality perturbations.The number of classes in the target dataset also seems toplay an important role on performance resiliency, as seen inFigure 3. For correctly classifying the data, diverse datasetsmay need complementary feature points, which can be lostwith dimensionality reduction.Our quantization (Q) experiments, on the other hand, re-duce the size of the feature vectors, from initial ∗ m i bits ,by aggressively limiting their values. Q-2 performed betterthan Q-1, indicating that adaptiveness to scale plays an im-portant role (Figure 4).Furthermore, Q-1 kept vanilla scores with 7 values, whileQ-2 only needed 4. That represents a strong compression ofthe feature vectors, from ∗ m to (cid:100) log (cid:101) ∗ m bits.The main results for DR and Q for VOC 2007 are sum-marized in Table 3, where each column indicates the max-imum desired loss with respect to the original score for an ex-periment, while the cells indicate the minimum value whichsatisfies such requirement. For example, the second line ofthe second column reveals that with only 10% of the dimen-sions preserved, GoogLeNet score drops less than 2%. For 32-bit single-precision floating-point numbers. riginal Score DR-1 – 2% DR-1 – 5% DR-2 – 1% Q-1 – 1% Q-1 – 4% Q-2 – 1% Q-2 – 3%
VGG-M 76.95% 25% 10% 10% 6 values 4 values 3 values 2 valuesGoogLeNet 80.58% 10% 5% – – – – –BossaNova 39.59% 50% 25% – – – – –
Table 3 : Minimum representation rate for Pascal VOC 2007. Each column indicates a requirement, and each line representsa dataset. The cells reveal the minimum representation needed for losing at most the indicated percentage. For instance,(DR-1 – 2%) + GoogLeNet = 10% means that with only 10% of the dimensions, we lose at most 2% of the initial score.
Fig. 4 : Results for quantization of features on the base setup(VGG-M and Pascal VOC 2007). We can keep vanilla per-formance while reducing the feature vectors from ∗ m i to (cid:100) log (cid:101) ∗ m i and (cid:100) log (cid:101) ∗ m i bits, using Q-1 and Q-2, re-spectively.Finally, the results for FC with the base setup are shownin Figure 5. The flat region on the top represents combina-tions of parameters from DR-2 and Q-2 with complementarycharacteristics, indicating that the features can be compressedin terms of dimension and precision at the same time. Wepoint, with the circle, square and cross markers, specific com-binations of DR-2 and Q-2 with compression rates of 99.1%,98.4% and 96.9%, respectively, while maintaining 97.8%,99.1% and 99.6% of the original score.Supplementary results and resources, including the sourcecode for our experiments, are available online .
4. DISCUSSION
In this paper we evaluated the robustness of deep representa-tions by introducing perturbations to feature vectors extractedfrom upper layers of deep networks. We explored in depth theresiliency of features transferred from the VGG-M model tothe Pascal VOC 2007 dataset. Our findings show that thereis a high level of redundancy in deep representations, andthus, they may be heavily compressed. In our experiments,we achieve a compression rate of 98.4%, while losing only https://github.com/MicaelCarvalho/DNNsUnderStress Fig. 5 : Results for feature compression (FC). We reduce thenumber of dimensions and the precision of the feature vectorsat the same time. The circle, square and cross, mark config-urations with compression rates of 99.1%, 98.4% and 96.9%,respectively, while maintaining 97.8%, 99.1% and 99.6% ofthe original score.0.88% of the original score for Pascal VOC 2007. To en-sure our conclusions are not dataset- nor model-specific, ourtwo main approaches – Dimensionality Reduction and Quan-tization – were extensively tested, with supplementary resultsfor MIT-67, Food-101, GoogLeNet and BossaNova. Further-more, we observed that despite being more compact, deep ar-chitectures are also more robust to perturbations, when com-pared to approaches based on Bags of Visual Words. Thosefindings are specially useful for image retrieval and metriclearning [22], in which the size of the feature vector is crucialto achieve fast response times, and for applications involvingportable devices or remote classification, in which data mustbe efficiently transferred over the network.
ACKNOWLEDGEMENTS
This research was partially supported by CNPq, Santanderand Samsung Eletrˆonica da Amazˆonia Ltda., in the frame-work of law No. 8,248/91. We also thank CENAPAD-SP(Project 533), Microsoft Azure and Amazon Web Services forcomputational resources and Michel Fornaciali for his valu-able advices. . REFERENCES [1] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNetclassification with deep convolutional neural networks,”
Advances in Neural Information Processing Systems(NIPS) , pp. 1–9, 2012.[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep re-sidual learning for image recognition,”
CoRR , vol.abs/1512.03385, 2015.[3] Thibaut Durand, Nicolas Thome, and Matthieu Cord,“WELDON: Weakly Supervised Learning of Deep Con-volutional Neural Networks,” in
Computer Vision andPattern Recognition (CVPR) , 2016.[4] J. Bruna and S. Mallat, “Invariant scattering convolutionnetworks,”
IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI) , vol. 35, no. 8, pp. 1872–1886, 2013.[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol. 521, no. 7553, pp. 436–444, 2015.[6] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson,“CNN features off-the-shelf : an astounding baseline forrecognition,” in
IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2014.[7] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “Howtransferable are features in deep neural networks?,” in
Advances in Neural Information Processing Systems(NIPS) , 2014, pp. 3320–3328.[8] M. Chevalier, N. Thome, M. Cord, J. Fournier, G. Hen-aff, and E. Dusch, “LR-CNN for fine-grained classifica-tion with varying resolution,” , Sep 2015.[9] Thibaut Durand, Nicolas Thome, and Matthieu Cord,“MANTRA: Minimum Maximum Latent StructuralSVM for Image Classification and Ranking,” in
Inter-national Conference on Computer Vision (ICCV) , 2015.[10] V. Vanhoucke, A. Senior, and M. Mao, “Improving thespeed of neural networks on CPUs,” in
Advances inNeural Information Processing Systems (NIPS) , 2011,pp. 1–8.[11] M. Courbariaux, Y. Bengio, and J.-P. David, “Train-ing deep neural networks with low precision multiplic-ations,” in
International Conference on Learning Rep-resentations (ICLR) , 2015.[12] M. Courbariaux, Y. Bengio, and J.-P. David, “Binary-Connect: Training deep neural networks with binaryweights during propagations,” in
Advances in NeuralInformation Processing Systems (NIPS) , 2015. [13] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. Jer-ger, R. Urtasun, and A. Moshovos, “Reduced-precisionstrategies for bounded memory in deep neural nets,”
CoRR , vol. abs/1511.05236, 2015.[14] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser-man, “Return of the devil in the details: Delving deepinto convolutional nets,” in
British Machine Vision Con-ference (BMVC) , 2014.[15] M. Everingham, L. Van Gool, C. Williams, J. Winn, andA. Zisserman, “The Pascal Visual Object Classes (VOC)Challenge,”
International Journal of Computer Vision(IJCV) , vol. 88, no. 2, pp. 303–338, 2010.[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-binovich, “Going deeper with convolutions,” in
IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR) , 2015, pp. 1–9.[17] S. Avila, N. Thome, M. Cord, E. Valle, and A. De A.Ara´ujo, “Pooling in image representation: The visualcodeword point of view,”
Computer Vision and ImageUnderstanding (CVIU) , vol. 117, no. 5, pp. 453–465,2013.[18] A. Quattoni and A. Torralba, “Recognizing indoorscenes,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2009, pp. 413–420.[19] X. Wang, D. Kumar, N. Thome, M. Cord, and F. Pre-cioso, “Recipe recognition with large multimodal fooddataset,” in
IEEE International Conference on Multime-dia & Expo (ICME) , 2015, pp. 1–6.[20] A. Vedaldi and K. Lenc, “MatConvNet – Convolutionalneural networks for MATLAB,” in
ACM InternationalConference on Multimedia (MM) , 2015.[21] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, andC.-J. Lin, “LIBLINEAR: A library for large linearclassification,”
Journal of Machine Learning Research(JMLR) , vol. 9, pp. 1871–187, 2008.[22] C. Le Barz, N. Thome, M. Cord, S. Herbin, and M. San-fourche, “Exemplar based metric learning for robustvisual localization,”2015 IEEE International Confer-ence on Image Processing (ICIP)