Unsupervised Deep Feature Transfer for Low Resolution Image Classification
UUnsupervised Deep Feature Transfer for Low Resolution Image Classification
Yuanwei Wu ∗ , Ziming Zhang † , and Guanghui Wang EECS, The University of Kansas, Lawrence, KS 66045 Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA 02139 [email protected], [email protected], [email protected]
Abstract
In this paper, we propose a simple while effective unsu-pervised deep feature transfer algorithm for low resolutionimage classification. No fine-tuning on convenet filters isrequired in our method. We use pre-trained convenet toextract features for both high- and low-resolution images,and then feed them into a two-layer feature transfer networkfor knowledge transfer. A SVM classifier is learned directlyusing these transferred low resolution features. Our net-work can be embedded into the state-of-the-art deep neuralnetworks as a plug-in feature enhancement module. It pre-serves data structures in feature space for high resolutionimages, and transfers the distinguishing features from a well-structured source domain (high resolution features space) toa not well-organized target domain (low resolution featuresspace). Extensive experiments on VOC2007 test set showthat the proposed method achieves significant improvementsover the baseline of using feature extraction.
1. Introduction
Recently, deep neural networks have demonstrated im-pressive results in image classification [22, 16, 5, 37], objectdetection [11, 29, 25, 41], instance segmentation [14], vi-sual tracking [41, 33, 3, 2] depth estimation [17, 18], facerecognition [6], and image translation [19, 36, 35, 34]. Thesuccess of DNNs has become possible mostly due to a largeamount of annotated datasets [9], as well as advances incomputing resources and better learning algorithms [12, 40].Most of these works typically assume that the images are ofsufficiently high resolution ( e.g . × or larger).The limitation of requiring large amount of data to trainDNNs has been alleviated by the introduction of transferlearning techniques. A common way to make use of transferlearning in the context of DNNs is to start from a pre-trainedmodel in a similar task or domain, and then finetune the ∗ This work was done when the first author took internship at MERL. † Corresponding author.
Figure 1. The tSNE [26] of deep features (2048-D) of VOC2007train set extracted from pool5 layer of pre-trained resnet-101 [16].(a) Feature of High Resolution (HR) images, and (b) feature of LowResolution images. The HR features are well separated, however,the LR features are mixed together. parameters to the new task. For example, the pre-trainedmodel on ImageNet for classification can be finetuned forobject detection on Pascal VOC [11, 29].In this paper, we focus on low resolution ( e.g . × orless) image classification as for privacy purpose, it is com-mon to use low resolution images in real-world applications,such as face recognition in surveillance videos [42]. Withoutadditional information, learning from low resolution imagesalways reduces to an ill-posed optimization problem, andachieves a much degraded performance [28].As shown in Fig. 1, the deep feature of high resolution im-ages extracted from pre-trained convenet has already learneddiscriminative per-class feature representation. Therefore,it is able to be well separated in the tSNE visualization.However, the extracted feature of low resolution images ismixed together. A possible solution is to exploit the transferlearning, leveraging the discriminative feature representationfrom high resolution images to low resolution images.In this paper, we propose a simple while effective unsuper-vised deep feature transfer approach that boosts classificationperformance in low resolution images. We assume that wehave access to high resolution labeled images during training,but at test we only have low resolution images. Most existingdatasets are high resolution. Moreover, it is much easier to a r X i v : . [ c s . C V ] O c t igure 2. The overview of proposed unsupervised deep feature transfer algorithm. It consists of three modules. In the feature extractionmodule, a pre-trained deep convenet is used as feature extractor to obtain HR and LR features from HR and LR images, respectively. Then,we cluster the HR features to obtain pseudo-labels, which are used to guide the feature transfer learning of LR features in the feature transfernetwork. Finally, a SVM classifier is trained on the transferred LR features. label subcategories in high resolution images. Therefore, webelieve it is a reasonable assumption. We aim to transferknowledge from such high resolution images to real worldscenarios that only have low resolution images. The basicintuition behind our approach is to utilize high quality dis-criminative representations in the training domain to guidefeature learning for the target low resolution domain.The contributions of our work have three-fold. • No fine-tuning on convenet filters is required in ourmethod. We use pre-trained convenet to extract featuresfor both high resolution and low resolution images, andthen feed them into a two-layer feature transfer networkfor knowledge transfer. A SVM classifier is learneddirectly using these transferred low resolution features.Our network can be embedded into the state-of-the-artDNNs as a plug-in feature enhancement module. • It preserves data structures in feature space for highresolution images, by transferring the discriminativefeatures from a well-structured source domain (highresolution features space) to a not well-organized targetdomain (low resolution features space). • Our performance is better than that of baseline usingfeature extraction approach for low resolution imageclassification task.
2. Related Work
Our method is closely related to unsupervised learning offeatures and transfer learning.
Unsupervised learning of features:
Clustering has beenwidely used for image classification [4, 38, 20]. Ji et al . [20]propose invariant information clustering relying on statisticallearning by optimising mutual information between relatedpairs for unsupervised image classification and segmenta-tion. Caron et al . [4] present a clustering method that jointlylearns the parameters of a neural network and the cluster as- signments of the resulting features. Yang et al . [38] proposean approach to jointly learn deep representations and imageclusters by combining agglomerative clustering with CNNsand formulate them as a recurrent process.
Transfer learning:
It is commonly used in the scenariowhere the training and testing data distributions are different.Saenko et al . [31] learn a regularized non-linear transfor-mation in the context of object recognition to minimize theeffect of domain-induced changes in the feature distribution.Chen et al . [8] transfer knowledge stored in one previousnetwork into each new deeper or wider network to accel-erate the training of a significantly larger neural network.Yosinski et al . [39] experimentally study the transferabilityof hierarchical features in deep neural networks. Azizpour etal . [1] investigate the factors of transferability of a genericdeep convolutional networks such as the network architec-ture, distribution of the training data, etc. Tzeng et al . [32]learn a CNN architecture to optimize domain invariance andtransfer information between tasks. Long et al . [24] proposea deep adaptation network architecture to match the meanembeddings of different domain distributions in a reproduc-ing kernel Hilbert space. Guo et al . [13] propose an adaptivefine-tuning approach to find the optimal fine-tuning strategyper instance for the target data. Readers can refer to [27] andthe references therein for details about transfer learning.
3. Proposed Approach
This section describes the proposed unsupervised deepfeature transfer approach.
With the recent success of deep learning in computer vi-sion, the deep convnets have become a popular choice forrepresentation learning, to map raw images to an embeddingvector space of fixed dimensionality. In the context of super-ised learning, they could achieve better performance thanhumanbeings on standard classification benchmarks [15, 22]when trained with large amount of labelled data.Let f θ denote the convenet mapping function, where θ is the corresponding learnable parameters. We refer to thevector obtained by applying this mapping to an image asfeature or features. Given a training set X = { x , · · · , x N } of N images, and the corresponding ground truth labels Y = { y , · · · , y N } , we want to find an optimal parameter θ ∗ such that the mapping f ∗ θ predicts good general features.Each image x i associates with a class label y i in { , } k .Let g w denote a classifier with parameter ω . The classifierwould predict the labels on top of the features f θ ( x i ) . Theparameter θ of the mapping function and the parameter ω of the classifier are then learned jointly by optimizing thefollowing objective function: min θ,ω N N (cid:88) i =1 L ( g w ( f θ ( x i ) , y i )) , (1)where L is the multinominal logistic loss for measuring thedifference between the predicted labels and ground-truthlabels given training data samples. The idea of this work is to boost the feature learning forlow resolution images by exploiting the capability of unsu-pervised deep feature transfer from the discriminative highresolution feature. The overview of proposed approach isshown in Fig. 2. It consists of three modules: feature extrac-tion, unsupervised deep feature transfer, and classification,discussed below.
Feature extraction.
We observe that the deep featuresextracted from convenet could generate well separated clus-ters as shown in Fig. 1. Therefore, we introduce the transferlearning to boost the low resolution features learning via thesupervision from high resolution features. Then, we extractthe features (N-Dimensional) of both high and low resolu-tion images from a pre-trained deep convenet. More detailsare described in Sec. 4.2.
Unsupervised deep feature transfer.
We propose a fea-ture transfer network to boost the low resolution featureslearning. However, in our assumption, the ground truth la-bels for low resolution images are absent. Therefore, weneed to make use of the information from high resolutionfeatures. In order to do this, we propose to cluster the highresolution features and use the subsequent cluster assign-ments as “pseudo-label” to guide the learning of featuretransfer network with low resolution features as input. With-out loss of generality, we use a standard clustering algorithm,k-means. The k-means takes a high resolution feature asinput, in our case the feature f θ ( x i ) extracted from the con-venet, and clusters them into k distinct groups based on a geometric criterion. Then, the pseudo-label of low resolutionfeatures are assigned by finding its nearest neighbor to the k centroids of high resolution features. Finally, the parame-ter of the feature transfer network is updated by optimizingEq. (1) with mini-batch stochastic gradient descent. Classification.
The final step is to train a commonly usedclassifier such as Support Vector Machine (SVM) using thetransferred low resolution features. In testing, given onlythe low resolution images, first, our algorithm extracts thefeatures. Then feeds them to the learned feature transfernetwork to obtain the transferred low resolution features.Finally, we run SVM to get the classification results directly.
4. Experiments
We conduct the low resolution classification on the PAS-CAL VOC2007 dataset [10] with 20 object classes. Thereare , images in VOC2007 trainval set and , imagesin VOC2007 test set. However, the images in the dataset arehigh resolution images only. We follow [23] to generate thelow resolution images. In this work, we generate high reso-lution images by resizing the original images to × using bicubic interpolation. We generate the low resolutionimages by down-sampling the original to × , and thenup-sampling to × . We conduct our experiment using Caffe [21]. We usethe resnet-101 [16] pre-trained on ILSVRC [30] asthe backbone convenet to extract the features from high andlow resolution images. We extract the features from thepool5 layer, which gives a feature vector with dimension of N = 2048 .The feature transfer network is a two-layer fully con-nected network. We conduct grid search to find the opti-mal design for the network architecture, see Sec. 4.3. Itis initialized using MSRA [21] initialization. We train thefeature transfer network using stochastic gradient descentwith weight decay . , momentum . , batch size , ,epoch , , total iteration , . The initial learning rateis . , and is decreased by after every , iterations. The feature transfer network is shallow, with two fullyconnected layers. Let N and N denote the neurons of thefirst and second fully connected layers, respectively. Weconduct grid search to find the optimal combination for N and N , as shown in Table 2. The number N is determinedby the number of clusters k for the pseudo labels in k-means. We download the Caffe Model from https://github.com/BVLC/caffe/wiki/Model-Zoo ero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv mAPBaseline-HR 97.6 92.7 89.2 85.8 90.6 87.5 96.2 94.3 81.4 83.3 80.0 86.9 84.2 90.0 95.4 95.0 88.3 71.6 96.0 95.9 89.1Baseline-LR 87.5 84.8 77.5 77.4 80.4 76.5 90.6 72.1 75.1 72.9 69.5 65.0 71.7 73.9 92.8 90.8 78.3 48.6 83.3 92.3 78.1Ours 89.1 86.5 80.1 78.1 79.6 77.4 92.4 75.4 79.4 73.2 72.5 68.5 74.0 77.1 95.0 91.9 77.6 53.4 86.1 92.5 80.0
Table 1. Per-class average precision (%) for object classification on the VOC2007 test set.Figure 3. The tSNE of features on VOC2007 test set. (a) Feature (2048-D) of High Resolution (HR) images, (b) feature (2048-D) of LowResolution (LR) images, (c) transferred feature (100-D) of LR image. (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) N N
256 512 1024 2048 409620 0.704 0.741
100 0.718
500 0.717 0.743 0.766 0.784 0.7951000 0.713 0.743 0.762 0.783 0.7932048 0.718 0.739 0.765 0.783 0.794
Table 2. We use grid search to find the optimal combination of N and N for the two-layer feature transfer network by calculatingthe mean average precision (mAP) on VOC2007 test set. As we can see, when the neurons of N is fixed, the mAPincreases as the neurons of N increases. This is because thecapacity of the two-layers feature transfer network increasesas the neurons increases in N . However, given a fixednumber of neurons of N , the value of mAP would increasefirst, and then decrease when the value of neurons in N islarger enough, maybe is a threshold value in our two-layer network as shown in the table. We observe that thehyperparameters with N = 100 and N = 4096 for theneurons give us the best performance. We use the samevalues in our experiment. We evaluate the performance of image classification inthe context of binary classification task on the VOC2007 testset using SVM [7] classifier in matlab. We have comparedour algorithm with two baselines: Baseline-HR and Baseline-LR, discussed below. Baseline-HR is to use the extractedhigh resolution features (2048-D) of VOC2007 trainval setto train the SVM and report the classification performance on VOC2007 test set. It is similar for Baseline-LR, but withthe extracted low resolution features (2048-D). Our methodtransfers the low resolution feature from 2048-D to 100-D.Therefore, we train the SVM using the 100-D features foreach class. We show the comparison in Table 1.The Baseline-HR is the upper bound of our method, andBaseline-LR is the lower bound. As we can see from the Ta-ble 1, the proposed unsupervised deep feature transfer is ableto boost the low resolution image classification by about .Except for the classes of “bottle” and “sheep”, our methodoutperforms the Baseline-LR. As shown in Fig. 3, we findthe transferred low resolution features are separated muchbetter than the extracted low resolution features. Those in-dicate that the proposed unsupervised deep feature transferalgorithm does help transfer more discriminative representa-tions from high resolution features. Therefore, it boost onlow resolution images classification task. The feature trans-fer network could also be embedded into the state-of-the-artdeep neural networks as an plug-in module to enhance thelearned features.
5. Conclusion
In this paper, we propose an unsupervised deep featuretransfer algorithm for low resolution image classification.The proposed two-layer feature transfer network is able toboost the classification by 2% on mAP. It can be embeddedinto the state-of-the-art deep neural networks as a plug-in fea-ture enhancement module. While our current experimentsfocus on generic classification, we expect our feature en-hancement module to be very useful in detection, retrieval,and category discovery settings as well in the future. cknowledgment
Dr. Zhang was supported by MERL. Mr. Wu and Prof.Wang were supported in part by NSF NRI and USDA NIFAunder the award no. 2019-67021-28996 and KU GeneralResearch Fund (GRF).
References [1] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, andS. Carlsson. Factors of transferability for a generic convnetrepresentation.
IEEE TPAMI , 38(9):1790–1802, 2016. 2[2] S. P. Bharati, S. Nandi, Y. Wu, Y. Sui, and G. Wang. Fast androbust object tracking with adaptive detection. In , pages 706–713. IEEE, 2016. 1[3] S. P. Bharati, Y. Wu, Y. Sui, C. Padgett, and G. Wang. Real-time obstacle detection and tracking for sense-and-avoidmechanism in uavs.
IEEE Transactions on Intelligent Ve-hicles , 3(2):185–197, 2018. 1[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deepclustering for unsupervised learning of visual features. In
ECCV , pages 132–149, 2018. 2[5] F. Cen and G. Wang. Boosting occluded image classificationvia subspace decomposition-based estimation of deep features.
IEEE Transactions on Cybernetics , pages 1–14, 2019. 1[6] F. Cen and G. Wang. Dictionary representation of deep fea-tures for occlusion-robust face recognition.
IEEE Access ,7:26595–26605, 2019. 1[7] C.-C. Chang and C.-J. Lin. Libsvm: a library for supportvector machines.
TIST , 2(3):27, 2011. 4[8] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accel-erating learning via knowledge transfer. arXiv preprintarXiv:1511.05641 , 2015. 2[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.Imagenet: A large-scale hierarchical image database. In
IEEECVPR , pages 248–255. Ieee, 2009. 1[10] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.Williams, J. Winn, and A. Zisserman. The pascal visual objectclasses challenge: A retrospective.
IJCV , 111(1):98–136, Jan.2015. 3[11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection and semanticsegmentation. In
IEEE CVPR , pages 580–587, 2014. 1[12] P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski,A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, largeminibatch sgd: Training imagenet in 1 hour. arXiv preprintarXiv:1706.02677 , 2017. 1[13] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, andR. Feris. Spottune: transfer learning through adaptive fine-tuning. In
IEEE CVPR , pages 4805–4814, 2019. 2[14] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn.In
IEEE ICCV , pages 2961–2969, 2017. 1[15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenetclassification. In
IEEE ICCV , pages 1026–1034, 2015. 3[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
IEEE CVPR , pages 770–778, 2016.1, 3 [17] L. He, G. Wang, and Z. Hu. Learning depth from singleimages with deep neural network embedding focal length.
IEEE Transactions on Image Processing , 27(9):4676–4689,2018. 1[18] L. He, M. Yu, and G. Wang. Spindle-net: Cnns for monoculardepth inference with dilation kernel method. In , pages2504–2509. IEEE, 2018. 1[19] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. In
CVPR ,pages 1125–1134, 2017. 1[20] X. Ji, J. F. Henriques, and A. Vedaldi. Invariant informa-tion distillation for unsupervised image segmentation andclustering. arXiv preprint arXiv:1807.06653 , 2018. 2[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In
ACM Multimedia ,pages 675–678. ACM, 2014. 3[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NeurIPS , pages 1097–1105, 2012. 1, 3[23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Commonobjects in context. In
ECCV , pages 740–755. Springer, 2014.3[24] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning trans-ferable features with deep adaptation networks. In
ICML ,pages 97–105, 2015. 2[25] W. Ma, Y. Wu, Z. Wang, and G. Wang. Mdcn: Multi-scale,deep inception convolutional neural networks for efficientobject detection. In
ICPR , pages 2510–2515. IEEE, 2018. 1[26] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.
JMLR , 9(Nov):2579–2605, 2008. 1[27] S. J. Pan and Q. Yang. A survey on transfer learning.
IEEE Transactions on knowledge and data engineering ,22(10):1345–1359, 2010. 2[28] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg-ment object candidates. In
NeurIPS , pages 1990–1998, 2015.1[29] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In
NeurIPS , pages 91–99, 2015. 1[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.
IJCV ,115(3):211–252, 2015. 3[31] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visualcategory models to new domains.
ECCV , pages 213–226,2010. 2[32] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultane-ous deep transfer across domains and tasks. In
IEEE ICCV ,pages 4068–4076, 2015. 2[33] Y. Wu, Y. Sui, and G. Wang. Vision-based real-time aerialobject localization and tracking for uav sensing system.
IEEEAccess , 5:23969–23978, 2017. 1[34] W. Xu, S. Keshmiri, and G. Wang. Stacked wassersteinautoencoder.
Neurocomputing , 363:195–204, 2019. 135] W. Xu, S. Keshmiri, and G. R. Wang. Adversarially approx-imated autoencoder for image generation and manipulation.
IEEE Transactions on Multimedia , 2019. 1[36] W. Xu, K. Shawn, and G. Wang. Toward learning a unifiedmany-to-many mapping for diverse image translation.
PatternRecognition , 93:570–580, 2019. 1[37] W. Xu, Y. Wu, W. Ma, and G. Wang. Adaptively denoisingproposal collection for weakly supervised object localization.
Neural Processing Letters , pages 1–14, 2019. 1[38] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learningof deep representations and image clusters. In
IEEE CVPR ,pages 5147–5156, 2016. 2[39] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transfer-able are features in deep neural networks? In
NeurIPS , pages3320–3328, 2014. 2[40] Z. Zhang, Y. Wu, and G. Wang. Bpgrad: Towards globaloptimality in deep learning via branch and pruning. In
IEEECVPR , June 2018. 1[41] P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Hu, Q. Nie,H. Cheng, C. Liu, X. Liu, et al. Visdrone-det2018: The visionmeets drone object detection in image challenge results. In
ECCV , pages 0–0, 2018. 1[42] W. W. Zou and P. C. Yuen. Very low resolution face recogni-tion problem.