[PDF] Exploiting Convolution Filter Patterns for Transfer Learning

Abstract

In this paper, we introduce a new regularization technique for transfer learning. The aim of the proposed approach is to capture statistical relationships among convolution filters learned from a well-trained network and transfer this knowledge to another network. Since convolution filters of the prevalent deep Convolutional Neural Network (CNN) models share a number of similar patterns, in order to speed up the learning procedure, we capture such correlations by Gaussian Mixture Models (GMMs) and transfer them using a regularization term. We have conducted extensive experiments on the CIFAR10, Places2, and CMPlaces datasets to assess generalizability, task transferability, and cross-model transferability of the proposed approach, respectively. The experimental results show that the feature representations have efficiently been learned and transferred through the proposed statistical regularization scheme. Moreover, our method is an architecture independent approach, which is applicable for a variety of CNN architectures.

Full PDF

EExploiting Convolution Filter Patterns for Transfer Learning

Mehmet Ayg¨unIstanbul Technical UniversityIstanbul, Turkey [email protected]

Yusuf AytarMITCambridge, USA [email protected]

Hazım Kemal EkenelIstanbul Technical UniversityIstanbul, Turkey [email protected]

Abstract

In this paper, we introduce a new regularization tech-nique for transfer learning. The aim of the proposed ap-proach is to capture statistical relationships among convo-lution ﬁlters learned from a well-trained network and trans-fer this knowledge to another network. Since convolutionﬁlters of the prevalent deep Convolutional Neural Network(CNN) models share a number of similar patterns, in orderto speed up the learning procedure, we capture such cor-relations by Gaussian Mixture Models (GMMs) and trans-fer them using a regularization term. We have conductedextensive experiments on the CIFAR10, Places2, and CM-Places datasets to assess generalizability, task transfer-ability, and cross-model transferability of the proposed ap-proach, respectively. The experimental results show that thefeature representations have efﬁciently been learned andtransferred through the proposed statistical regularizationscheme. Moreover, our method is an architecture indepen-dent approach, which is applicable for a variety of CNNarchitectures.

1. Introduction

The CNN models are found to be successful at var-ious computer vision tasks such as image classiﬁcation[14, 27, 28], object detection [22], image segmentation [19,21], and face recognition [20], where large-scale datasets[24, 32, 20] are available. Nevertheless, the performanceof CNN models signiﬁcantly reduces, when training datais limited or the domain of the training set is far from thetest set. Today, the most successful and practical solutionto address lack of annotated large dataset is training thenetworks on large-scale annotated datasets like ImageNet[24] and Places [32], then ﬁnetuning these pre-trained net-works for speciﬁc problems. Thanks to community, the pre-trained models of well-known architectures like AlexNet[18], VGG-16 [27], GoogLeNet [28], and ResNet [14] canbe found available online. However, when some architec-tural changes are needed, these pre-trained networks can- not be used. For such cases, it is necessary to train modelson large datasets, then, ﬁnetune for the particular problem.Unfortunately, while getting nearly human performances ona lot of applications with these models, training these net-works on large datasets is still a signiﬁcant problem and avery time consuming process.With recent advances in deep learning such as Resid-ual Learning [14], successful networks have become moreand more deep, and training these models become harder interms of complexity and time. To ﬁnd a solution for thisproblem, inspired by

What makes a good detector? [11],we have investigated

What makes a good CNN ﬁlter?

Twosuccessful CNN models, AlexNet [18] and VGG-16 [27],are analyzed from a statistical perspective, and we realizedthat these models show similar patterns and redundancies.In addition to our ﬁndings, in [7], authors show that 95%of weights of neural networks could be predicted withoutany reduction in accuracy. This leads us to the idea that wecan use these redundancies and patterns for learning betterrepresentations quickly by transfer learning. Similar to themethods used in [2, 4, 11], we introduce a regularizationterm for transferring the statistical information to improvethe learning scheme. First, the statistical distribution of con-volution ﬁlters from a well-trained network is learned witha Gaussian Mixture Model. Next, the newly trained modelis encouraged to show similar statistics with source mod-els using the regularization term. Extensive experiments onthe CIFAR10 [17], Places2 [32], and CMPlaces [4] datasetsshow that the proposed approach is generalizable and thenetworks can quickly learn a representation with statisticalregularization, which could efﬁciently be transferred to an-other task and cross-domains. The overview of our pro-posed method can be seen in Figure 1. The rest of the pa-per is organized as follows: in Section 2 related works aresummarized, in Section 3 our detailed statistical analysis isreviewed and the regularization term is introduced, in Sec-tion 4 experimental results are presented and discussed, andﬁnally in Section 5 the paper is concluded. a r X i v : . [ c s . C V ] A ug igure 1: Overview of the proposed system. The blue (top) network is a well-trained CNN where its distribution is learnedvia GMM. When new red (bottom) network being trained, along with the classiﬁcation loss, the new regularization term (cid:80) λ · R ( w ) that measures the statistical difference of two distributions are minimized. The statistical knowledge could betransferred from source CNN to target CNN with our regularization term. Best viewed in color.

2. Related Work

Network Distillation:

The aim of network distillationapproaches are transforming larger networks to smallerones, while not losing so much information that was learnedin the large network. The pioneering work is conducted byBucila et al . [3], where their aim is to compress the ensem-ble of models to a single model without signiﬁcant accuracyloss. Later, Hinton et al . [15] optimized the smaller net-work to show similar softmax output of cumbersome model.Then, Romero et al . [23] suggest that in addition to soft-max outputs, intermediate representations could be used fordistilling the network. Recently distillation is also appliedin the cross-modal settings [13, 1]. The major drawbackof these methods is the necessity to train a large networkbefore using it to train a smaller one. Also these modelsmainly match the outputs of the networks, whereas we reg-ularize the internal weights of the network.

Domain Adaptation:

Domain adaptation is the prob-lem of learning a model that generalizes target domain ex-amples besides source domain ones while learning onlyfrom source domain. With the success of CNNs, domainadaptation works have been focused on CNNs and sev-eral successful methods have been proposed. For instance,Ganin & Lempitsky [9] introduced gradient reversal layer which act like identity transform in forward pass of CNNand change the sign of the gradient and scale in the back-ward pass. In their work, they added a domain classiﬁerto the end of the feature map and tried to predict the do-main of examples. In backpropagation, they changed thegradient using gradient reversal layer and forced the net-works to learn domain invariant features by maximizingloss of domain classiﬁer. Tzeng et al . [29] added new terms into objective function of CNN to both increasing domainconfusion and transferring inter-class knowledge. The ﬁrstterm domain confusion loss forces the networks to learn do-main invariant features and soft label loss forces the fea-ture of the same class to be similar for both source andtarget domain. In contradistinction to Ganin & Lempit-sky [9], some target labels must be available for optimiz-ing soft label loss. Moreover, recently some other works[10, 12, 30, 26] focused on domain adaptation problems forCNNs. While these methods focus on transferring informa-tion about structure of data, our method focuses on transfer-ring more local information.

Statistical Transfer:

Statistical transfer is learning sta-tistical properties from a source and to use this statisti-cal knowledge for improving learning procedure. For in-stance, Aytar & Zisserman [2] proposed part-level transferregularization which transfers parts of source detectors in-stead of whole detector. Additionally, they take advantageof part co-occurrence statistics. For example, if there arewheels in the picture, probably another wheel would alsoappear. They calculated these co-occurrence statistics usingthe source data and transfer these statistics when a new ob-ject detector is learned. Moreover, in the era of hand craftedfeatures, Gao et al . [11] analyzed famous HOG [6] tem-plates of successful object detectors and made two obser-vations. Firstly, they observed that activations of individualcell models had some correlations, secondly local neigh-borhoods of cells also showed the same characteristic. Fur-thermore, since they wanted to transfer local information incontrast to templates like in [2], they deﬁned their priorssuch that the correlations from the source model could betransferred to target model without global template align-ent. In a recent work [4], GMMs are used for aligningcross-modal data. In this approach, statistics of activationmaps of different layers for a modality are learned, and theother modalities are forced to show similar statistics in theactivations via a regularization term. The proposed methodis capable of aligning modalities by using this regularizationterm when there is no strong alignment between modalities.Our work is inﬂuenced by Gao et al . [11] and Castrejon et al . [4]. The ﬁrst work directs us to analyze the weights ofconvolution ﬁlters and ﬁnd the correlation between the ﬁl-ters, and the other one to use the GMM to enforce them toshow similar statistics in a non-convex optimization prob-lem. Different from these works, our regularization forcesthe weights to show similar statistics and transfers local cor-relations of convolution ﬁlters.

3. Statistical Regularization

In this section, ﬁrst, we present our analysis on CNNmodels from a statistical perspective. Next, we describe ourapproach for capturing statistical knowledge from a CNNand present how to transfer this knowledge via a regulariza-tion term.

For investigating the statistical behavior of a CNNmodel, we have used VGG-16 [27] and AlexNet [18] mod-els that were trained on ImageNet [24] and Places-365 [32]datasets. Especially, weights of convolution ﬁlters are an-alyzed for the four models. We tried to answer followingthree questions: (i) are ﬁlters separable into clusters?; (ii)how much similar are the ﬁlters inside a cluster?; and (iii)how all ﬁlters are distributed over clusters?. In order to ob-tain this information, all × ﬁlters of a model are clus-tered to ten different clusters using the k-means algorithmand each cluster’s covariance matrix and mean value werecalculated and visualized. The covariance matrices wouldprovide information about how members of a cluster arecorrelated. Mean values show whether the sets are similarto each other or not. For instance, it can be seen in Figure 2(a) the covariance matrices of all the four models show thatthere is some shared behavior of learned ﬁlters across layersand architectures. Also, the cluster centers depict differentsimilarities. Generally, all models have clusters that theirmean values are accumulated on the left, right, top, and bot-tom. Moreover, when we look at the distribution of ﬁltersto clusters, the models show different characteristics. Whilethe distributions are Gaussian-like in VGG-16 models, thedistributions cannot be ﬁtted to a known distribution in theAlexNet models and most of the clusters have roughly thesame number of ﬁlters. However, both AlexNet and VGG-16 models show similar distributions by model wise whilethey are trained on different dataset. The mean values are shown in Figure 2 (b), and the distributions can be seen inFigure 3. Since our aim is capturing statistical properties of“good” convolution ﬁlters and transfering this statisticalknowledge to another network, we modeled the convolu-tion ﬁlters’ distributions and enforced the network to showsimilar distributions. Similar to our work, [4] forces theactivations of networks to show similar distributions acrossmodalities for aligning cross modal data. In [4], the au-thors used both mixture and single Gaussian distributionsfor modelling activations, and the mixture models outper-formed single Gaussian model in their problem. Since, ouraim is also similar to their work -capturing statistical knowl-edge and transfering it- we have decided to use mixturemodels for modeling distributions. The main difference isthat we force the weights of ﬁlters instead of the activationsto show similar distributions across the networks.Let x n and y n be a training image and its correspondinglabel. We want to minimize min w (cid:88) n L ( z ( x n , w ) , y n ) (1)where z n ( x n ,w) is output of the network. We have addeda regularization term R to the loss term that represents thenegative log likelihood of a convolution ﬁlter to encouragethe network to learn the weights that are statistically simi-lar to another network. For ﬁlter w i and distribution P wedeﬁne R such that, R ( w i ) = − log( P ( w i )) (2)The distribution P is modeled as GMM, therefore P wouldbe, P ( w | π, µ, Σ) = K (cid:88) k =1 π k N ( w | µ k , Σ k ) (3)where K is the number of mixtures and (cid:80) k π k = 1 , π k ≥ ∀ k . Therefore, the total negative log likelihood for N convolution ﬁlters can be deﬁned as, R ( w | µ, Σ) = N (cid:88) i =1 − log K (cid:88) k =1 π k N ( w i | µ k , Σ k ) (4)where N is speciﬁed as, N ( w | µ, Σ) = 1(2 π | Σ | ) − ( ) exp( −

12 ( w − µ ) T Σ − ( w − µ )) (5)The derivate of the negative log likelihood must be calcu-lated exactly or analytically, since the network is trainedwith back-propagation. Still, calculating exact derivativein every iteration would be very expensive during training, a) (b) Figure 2: (a) The covariance matrices of ten clusters that are calculated using the weights of convolution ﬁlters. The convergedVGG-16 and AlexNet models, which are trained on Places and ImageNet datasets, are used for clustering and visualizing.(b) Visualizations of mean values of each clusters.Figure 3: The distributions of the convolution ﬁlters over clusters. For both Places and ImageNet, the converged model ofVGG and AlexNet show similar distributions.therefore, we approximately calculate the derivative. Toapproximately calculate the derivative of a convolution ﬁl-ter, we ﬁrst pick the mixture component N ( µ s , Σ s ) that theprobability of the convolution ﬁlter is maximized. Next, thederivate is calculated using that single Gaussian. The partialderivative of R with respect to a convolution ﬁlter w i wouldbe ∂R∂w i = ( w i − µ s )Σ − s (6)Finally, our complete loss term is deﬁned as min w (cid:88) n L ( z ( x n , w ) , y n ) + (cid:88) α w + (cid:88) λ · R ( w ) (7)where the ﬁrst and second terms are classiﬁcation loss andweight decay terms, and the last one is our regularizationterm. The λ is a hyperparameter that controls the regulariza-tion. For experiments in this paper, we have used distribu-tion P , which is learned from convolution ﬁlters of VGG-16model trained on ImageNet using Expectation maximiza-tion (EM) algorithm with K = 1000 components. To re-duce the number of parameters, we assumed that the co-variances Σ k are diagonal. Also, K-Means algorithm was run on ﬁlters for decreasing the convergence time of EM al-gorithm. While it is known that, different layers, especiallyﬁrst layers, show different characteristics than other layers,we have used a single GMM - P - for all ﬁlters. Because, ifmultiple distributions are used, a layer alignment relation-ship between layers are needed in transfer time.

4. Experiments and Results

Our hypothesis was that well-trained CNN models showsimilar statistical patterns, and we could exploit this infor-mation in the training phase. Our experiments show thatwe can learn a better representation faster with the statisti-cal regularization. For example, in section 4.1, we validatethat with regularization the convolution ﬁlters create moregeneral representations and show less overﬁtting character-istics than regular trained ﬁlter’s representations. Also insection 4.2, we see how the regularization helps to learnrepresentations that are successfully transferred to anothertask. Furthermore, while transferring across task and do-mains is widely studied, transfer learning for cross modaldata is not a well studied problem. We show that our method a) (b) (c) (d)

Figure 4: Difference between train and test losses of models in CIFAR-10. After freezing weights the models continued totraining. It can be easily seen that later freezing show less overﬁtting. In the (a),(b),(c), and (d) the regularization effectis compared. With regularization models always overﬁt slowly (the difference increases slowly) than regular training. Bestviewed in color.could be applied to cross-modal data as well. In section 4.3,we present our experiments applied to a cross-modal datasetCMPlaces [4].

In this section, we evaluate if the statistical regulariza-tion helps the generalizability of the learned representation.For this purpose, we have used middle sized CIFAR-10dataset [17] and CNN architecture described in [25]. Firstly,we trained the network with and without regularization andstopped the training at 10k, 15k, 20k, and 25k iterations.Later on we freeze all × ﬁlters and continue learningwith training data. Since only the last layers change duringthe training and the features extracted from convolution ﬁl-ters are not so generic, the validation loss starts to increaseand the validation accuracy is dropped.We compared the networks, whose training stopped indifferent iterations with and without regularization. In ourexperiments, the training loss is started to oscillate in asmall interval and does not change much, that is, nei-ther increases nor decreases. However, the validation losschanged. The gap between training and test loss indicateshow generalizable the network is.When we compared the models initialized at 10k, 15k,20k, and 25k iterations, always the test loss of regular-ized versions increased slowly compared to those of non-regularized versions. For instance, as can be seen in Figure4 (a,b,c,d), the difference between training and test loss in-creases slowly in regularized models than non-regularizedmodels. Interestingly, as regular training time (no freezing)increases, the gap between the regularized and normal train-ing reduces. In this section, we show that the quickly learned repre-sentations can also be transferred to another task more suc-cessfully with regularization. To validate our claim, we tryto transfer ﬁlter distributions from ImageNet to Places2 [32]dataset. The trivial solution for a classiﬁcation problem is training the network on ImageNet and ﬁnetuning the modelson the new task afterwards. We follow the same procedurein this section for our experiments, but we want to ﬁnishthe pre-training stage as early as possible. Also we wantto show that our regularization can be used with ﬁnetuning.To evaluate our method’s performance, we ﬁrst train VGG-F model introduced in [5] on ImageNet [24] data with andwithout regularization. As in Section 4.1, we take snapshotsfrom 10k, 25k, and 50k iterations. Next, we start ﬁnetun-ing on Places2 data [32]. When we compare regularizationeffect, we see that the regularization can help to learn bet-ter representations in the early iterations. For example, ascan be seen in Figure 5 (a), the performance of the mod-els that are initialized at the weights learned in pre-trainingwith only 10k iterations, the regularized version shows abetter performance than the non-regularized one. When weexamine the models initialized at 25k iterations, the perfor-mance difference between regularized and non-regularizedversions reduces. Finally, in 50k iterations, nearly there isno difference in the performance. The test loss/iterationsplots are shown in Figure 5 (a), (b), (c). This experimentshows that as pre-training time increases, the gain obtainedfrom regularization decreases. However, for limited amountof pre-training time, the regularization could increase theefﬁciency of the pre-training.

While CNNs performances are very good at variouscomputer vision tasks for real-world images, most of thecomputer vision algorithms fail at non-real images. Thisshows that generalization performance of the computer vi-sion algorithms is not good for cross modal data. Some re-cent works [4, 8] focused on this problem and recently Cas-trejon et al . [4] has introduced a new Cross-Modal dataset.In the dataset there are ﬁve different modalities for eachscene type such as natural image, sketch, clip art, spatialtext, and description. As in section 4.2, we ﬁrstly trainVGG-F models on a large dataset -Places2- and ﬁnetuneon CMPlaces data. We have used clipart and sketch data to a) (b) (c)

Figure 5: The pre-traiened VGG-F models are ﬁnetuned on Places2 dataset. Pretraining is stopped at 10k, 25k, and 50kiterations. For (a) and (b) it can be easily seen that with regularization the test loss decreases quickly. Since test losses andaccuracies correlated we only provide test losses.Pre-Training Top-5 Accuracy25k 60.4525k \ w Reg. 62.2550k 63.050k \ w Reg. 64.5Converged Imagenet 64.8Table 1: The ﬁrst column describes how pretraining is doneand the second column shows the top-5 accuracies afterﬁnetuning for clipart data.evaluate our performance, since we are interested in the vi-sual domain. We take snapshots from 25k and 50k iterationsfrom the regularized and non-regularized networks. Next,we start to ﬁnetune on sketch and clipart data and com-pare their accuracies. Also, we ﬁnetune VGG-F model con-verged on ImageNet data and compare with our pre-trainedmodels. Similar to our experiments, by rising the trainingpre-training time the performance increases. Also, the regu-larization helps to learn better representations and increasesthe performance of pre-training for ﬁnetuning. Although,there is a signiﬁcant gap between the converged ImageNetmodel and our regularization in the sketch data, there isnot a substantial difference between 50k iteration with reg-ularization and the converged ImageNet model on clipartmodality. The results for both modalities can be seen inTable 1 and Table 2. These results show that instead oftraining ImageNet until the model converges, we could trainthe models using regularization with only few iterations andcould employ these pre-trained networks for cross-modalitytransfer. We have used Caffe [16] deep learning framework inour experiments. Moreover, we have implemented our spe-cial convolution layer for applying statistical regularization. Pre-Training Top-5 Accuracy25k 33.0525k \ w Reg. 40.7550k 37.6550k \ w Reg. 41.1Converged ImageNet 53.6Table 2: The ﬁrst column describes how pretraining is doneand the second column shows the top-5 accuracies afterﬁnetuning for sketch data.When VGG-F model is trained on ImageNet and Placesdatasets, stochastic gradient descent with 0.01 learning rateis used for optimization. In the CIFAR experiments, wehave used the same parameters described in [25]. Finally,the Gaussian mixture models are learned using the VLFeatlibrary [31].

5. Conclusion

In this paper, we analyzed convolution ﬁlters of well-known CNN architectures and found that they share a num-ber of common patterns and redundancies that could beexploited for transfer learning. Gaussian Mixture Modelsare used for capturing these statistical patterns and a newregularization term is introduced for transferring such pat-terns to other networks. Our experiments show that wecould learn good representations that are transferable to theother tasks and cross-domains quickly with regularization.For instance, we achieved around 25% improvement on thesketch modality in the cross-modal dataset under limitedpre-training time. Also, our method gets similar perfor-mance on clipart data with converged model that pre-trainedon the ImageNet, while pre-training stopped at 50k itera-tions in our method. eferences [1] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learningsound representations from unlabeled video. In

Advancesin Neural Information Processing Systems , pages 892–900,2016.[2] Y. Aytar and A. Zisserman. Part level transfer regularizationfor enhancing exemplar svms. volume 138, pages 114 – 123,2015.[3] C. Bucilu, R. Caruana, and A. Niculescu-Mizil. Model com-pression. In

Proceedings of the 12th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining ,pages 535–541. ACM, 2006.[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, andA. Torralba. Learning aligned cross-modal representationsfrom weakly aligned data. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages2940–2949, 2016.[5] K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman.Return of the devil in the details: Delving deep into convo-lutional nets. In

British Machine Vision Conference , 2014.[6] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In

Computer Vision and Pattern Recogni-tion, 2005. CVPR 2005. IEEE Computer Society Conferenceon , volume 1, pages 886–893. IEEE, 2005.[7] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al. Predictingparameters in deep learning. In

Advances in Neural Informa-tion Processing Systems , pages 2148–2156, 2013.[8] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa.Sketch-based image retrieval: Benchmark and bag-of-features descriptors.

IEEE transactions on visualization andcomputer graphics , 17(11):1624–1636, 2011.[9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptationby backpropagation. arXiv preprint arXiv:1409.7495 , 2014.[10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks.

Journal of MachineLearning Research , 17(59):1–35, 2016.[11] T. Gao, M. Stark, and D. Koller. What makes a gooddetector?–structured priors for learning from few examples.

Computer Vision–ECCV 2012 , pages 354–367, 2012.[12] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, andW. Li. Deep reconstruction-classiﬁcation networks for un-supervised domain adaptation. In

European Conference onComputer Vision , pages 597–613. Springer, 2016.[13] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillationfor supervision transfer. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages2827–2836, 2016.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages770–778, 2016.[15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015.[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093 , 2014.[17] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009.[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advances in neural information processing systems , pages1097–1105, 2012.[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3431–3440, 2015.[20] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep facerecognition. In

BMVC , volume 1, page 6, 2015.[21] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg-ment object candidates. In

Advances in Neural InformationProcessing Systems , pages 1990–1998, 2015.[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Uniﬁed, real-time object detection. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 779–788, 2016.[23] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,and Y. Bengio. Fitnets: Hints for thin deep nets. arXivpreprint arXiv:1412.6550 , 2014.[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.

International Journal of Computer Vision , 115(3):211–252,2015.[25] T. Salimans and D. P. Kingma. Weight normalization: Asimple reparameterization to accelerate training of deep neu-ral networks. In

Advances in Neural Information ProcessingSystems , pages 901–901, 2016.[26] O. Sener, H. O. Song, A. Saxena, and S. Savarese. Un-supervised transductive domain adaptation. arXiv preprintarXiv:1602.03534 , 2016.[27] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 1–9, 2015.[29] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultane-ous deep transfer across domains and tasks. In

Proceedingsof the IEEE International Conference on Computer Vision ,pages 4068–4076, 2015.[30] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Ad-versarial discriminative domain adaptation. arXiv preprintarXiv:1702.05464 , 2017.[31] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portablelibrary of computer vision algorithms. In

Proceedings of the18th ACM international conference on Multimedia , pages1469–1472. ACM, 2010.[32] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva.Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055arXiv preprint arXiv:1610.02055