Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network
TTransfer learning based few-shot classification using optimal transport mappingfrom preprocessed latent space of backbone neural network
Tom´aˇs Chobola, Daniel Vaˇsata and Pavel Kord´ık
Faculty of Information Technology, Czech Technical University in PragueThakurova 9Prague, Czech Republicchoboto1@fit.cvut.cz
Abstract
MetaDL Challenge 2020 focused on image classificationtasks in few-shot settings. This paper describes second bestsubmission in the competition. Our meta learning approachmodifies the distribution of classes in a latent space pro-duced by a backbone network for each class in order to betterfollow the Gaussian distribution. After this operation whichwe call Latent Space Transform algorithm, centers of classesare further aligned in an iterative fashion of the ExpectationMaximisation algorithm to utilize information in unlabeleddata that are often provided on top of few labelled instances.For this task, we utilize optimal transport mapping using theSinkhorn algorithm. Our experiments show that this approachoutperforms previous works as well as other variants of thealgorithm, using K-Nearest Neighbour algorithm, GaussianMixture Models, etc.
Introduction
Few-shot learning is increasingly popular because it canhandle machine learning tasks with just a few learning ex-amples. It is also more biologically plausible and closer towhat we observe in nature. While learning a new task, onenormally does not start from a randomly initialised neuralnetwork presenting hundreds of thousands of examples inseveral thousands epochs.When you are told to remember a person from a picture,you are able to distinguish this person from others evenwhen you see her in different positions or environments.In machine learning, this is called one shot learning. Thetask of one shot learning is to learn new classes given onlyone instance available for each class. Three-way five-shotlearning means learning three classes given five training in-stances each. You do not learn classifiers from scratch, butyou typically use neural networks trained on similar tasksusing much more data. This also reflects the natural situationwhen the visual perception is already well trained on similartasks when trying to remember a new person from the pic-ture. This process can be also called meta learning or trans-fer learning as one uses a pretrained neural network calleda backbone network. Also, in a few-shot learning scenario, you can often utilise unlabelled instances apart of those fewlabelled samples that are available for the task.MetaDl challenge 2020 focused on few shot learningof image classification tasks. Participants trained a meta-learner on a meta-train set and produced a learner which wassubsequently used to train on classification tasks generatedfrom the meta-test set and evaluated. The goal was to dis-cover learners with ability to quickly adapt to new unseenimage classification tasks.Our submissions scored second in the final leaderboard.This paper describes methods we have experimented withand the architecture of the meta-learning pipeline responsi-ble for second best result in the competition. The architec-ture of our solution mainly follows (Hu, Gripon, and Pa-teux 2020) with important improvements in the preprocess-ing of latent space output of the backbone model B . Themain improvement is in the different normalization of thetransformed feature vectors which resembles the Gaussiandistribution assumption better. Since this is the key assump-tion for the proper functionality of the Sinkhorn mappingalgorithm, it leads to more accurate results. Related Work
There are several different approaches to few shot learning.The survey (Wang et al. 2020) is a good resource to learnabout general overview and taxonomy of few shot learn-ing methods. Prototypical networks (Snell, Swersky, andZemel 2017) and the Siamese networks (Koch, Zemel, andSalakhutdinov 2015) focus on learning embeddings trans-forming the data in a way that it can be recognised with asimple classifier. This approach is further enhanced by re-lation networks (Sung et al. 2018) which is able to classifyimages of new classes by predicting distances between queryimages and the few examples of each new class.Another interesting direction aims at the learning processitself. In (Ravi and Larochelle 2017) a recurrent networkbased meta-learner model learns the exact optimization al-gorithm used to train another learner neural network clas-sifier in the few-shot setup. Meta-transfer learning (Sun etal. 2019) adapts a deep neural network for few shot learning https://competitions.codalab.org/competitions/26638 a r X i v : . [ c s . L G ] F e b asks. Transfer is achieved by learning scaling and shiftingfunctions of DNN weights for each task.We further extend the direction of few-shot learning re-search that is leveraging classification capabilities in robustbackbone models (neural networks) pretrained on similartasks. These transfer learning based methods need to findmapping of few-shot classes to similar classes used to trainthe backbone model.In (Rohrbach, Ebert, and Schiele 2013) the PropagatedSemantic Transfer has been applied to employ semanticknowledge transfer to original classes, combine the trans-ferred predictions with labels for the novel classes, exploitthe manifold structure of novel classes by graph based learn-ing and improve the local neighborhood in such graph struc-tures by replacing the raw feature-based representation withan attribute-based representation.When transferring the knowledge, deep embeddings arefar superior, compared to weight transfer, as a starting pointfor novel tasks as investigated in (Scott, Ridgeway, andMozer 2018). Another similar approach is TransMatch (Yuet al. 2020), where a feature extractor is pre-trained on origi-nal classes and subsequently used to initialize few-shot clas-sifier weights for the novel classes, the classifier is also up-dated with a semisupervised learning method.Our research proceeds from (Hu, Gripon, and Pateux2020), where the latent space produced by a backbone deepnetwork is preprocessed by a power transform and optimal-transport algorithm maps original classes to novel classeswhile centres on new classes are iteratively adjusted. Thisapproach has shown significant improvement in accuracy inour experiments. The importance of feature transformationfor few-shot learning is confirmed by (Wang et al. 2019). Model description
Formally, in a few-shot learning task one has a dataset D containing a part D S with a few labelled samples from w classes and a part D Q with some unlabelled samples. Thegoal is to predict the classes for samples in D Q . We willassume that D S contains exactly s labelled samples for eachclass and D Q contains exactly q unlabelled samples for eachclass. Hence, there are ws samples in D S and wq samplesin D Q . The i -th sample from D will be denoted by x i and ifit is from D S we will denote its label by y i .Moreover, let us assume that there is another dataset D B corresponding to some related task, such as image classi-fication to some novel classes. This dataset can be used totrain the backbone model b which maps the initial space intosome latent feature space L = R d . In order to train such amodel one might train the neural network for classificationand then remove the last classification layers as we did in theexperiments. Or an encoder part of an autoencoder might beused.The next step is to preprocess the points in the latent spaceto be prepared for the final prediction algorithm that esti-mates the labels. As was recently researched this step is cru-cial and may lead to significant improvements of the result,see (Wang et al. 2019). To proceed we will further assumethat the features obtained from the backbone model B are non-negative, i.e. L = R d + . This is often the case when oneextracts b as a part of some neural network with the ReLUactivation function on inner layers. Let us denote by B thedataset D transformed by b and by B S and B Q its parts cor-responding to D S and D Q , respectively.In the preprocessing we transform the dataset D of pointsin the latent space L to a final dataset F of points in thefinal feature space F = R r , where the dimension r =min { d, w ( s + q ) } is the minimum of the dimension d of L and the number of points in the dataset D . The prepro-cessing is a composition of three steps and we will call itthe Latent Space Transform algorithm (LST). The first is thepower transform combined with the semi-normalization ofeach point given by f ( u ) = ( u + ε ) β (cid:107) ( u + ε ) β (cid:107) δ for all u ∈ L , where the power is taken component-wise, ε = 10 − is thenormalization parameter, and (cid:107) · (cid:107) is the Euclidean norm.The hyperparameter β controls the strength of the powertransform and the hyperparameter δ controls the strength ofthe normalization, where δ = 1 means the full normaliza-tion and δ = 0 yields no normalization at all. The powertransform is known to help stabilising the variance and mak-ing the data more Gaussian distribution-like by reducing itsskewness, see (Box and Cox 1964). The normalization onthe other hand leads to the projection on the unit spherewhich is not compatible with the assumption used later inthe optimal-transport that the components of points in thesame class are independent with Gaussian distribution of thesame variance. Hence, the semi-normalization controlled bythe hyperparameter δ enables for having some variance inthe perpendicular direction to the unit sphere surface andthus does not a priori break the compatibility of the result-ing distribution with the Gaussian assumption. Let us denotethe dataset with all points in B transformed using f by F and F ,S , F ,Q analogously.The second step is the removal of unnecessary dimensionsusing the QR decomposition of the transposition of the al-ready preprocessed data matrix F ∈ R w ( s + q ) ,d correspond-ing to dataset F , F T = QR and thus we define F = F Q so that F ∈ R w ( s + q ) ,r , where r = min { d, w ( s + q ) } , andthe corresponding dataset is denoted by F . We again denoteby F ,S and F ,Q the parts of F that corresponds to sam-ples originally in D S and D Q , respectively. It corresponds tothe change of the orthonormal basis in the R d and throwingaway the dimensions that are zero for the data points.The last preprocessing step is the centering and furthersemi-normalization given by f ( u ) = u − ¯ u (cid:107) u (cid:107) γ , where ¯ u = 1 w ( s + q ) w ( s + q ) (cid:88) i =1 u i NNBackbone ResNettrained on ImagenetNovel class 1Unlabeled examples Latent space Latent SpaceTransformAlgorithmEM based optimaltransport mappingwith unlabel. dataNovel class 2
Class 1 centreClass 2 centre
CNN CNN Step 1Step NTest example PredictionLatent SpaceTransformAlgorithmLatent SpaceTransformAlgorithm
Class 1 centreClass 2 centre
Figure 1: In order to predict the class label of a test example, we transform the image using a backbone CNN to the latent spaceand preprocess vectors by the Latent Space Transform algorithm that helps to transform distribution of individual classes toGaussian like. Then a test example is processed and compared to the class centres that have been iteratively adjusted using aSinkhorn mapping with unlabeled data projected to the latent space in the same way. The closest class is assigned to the testexample as prediction.is the centroid (component-wise average) of the dataset F .Again, the hyperparameter γ allows to control the strengthof the normalization. For γ < the resulting points are onlypartially normalized and one may expect to better resemblethe Gaussian distribution assumed in the next step. The typi-cal result for the final Euclidean norms of transformed pointsis shown in Figure 2.Let us denote the final preprocessed dataset by F and itsrespective parts corresponding to original parts D S and D Q by F S and F Q , respectively.Once the preprocessing of the dataset is finished, the ac-tual optimal-transport can begin. In this part we directly fol-low (Hu, Gripon, and Pateux 2020). The preliminary as-sumption of the method is the independent Gaussian dis-tributions of all components of points in individual classeswith class centres c , . . . , c w as parameters. Moreover, itis assumed that all the Gaussian distributions have thesame variance λ/ , where λ is the hyperparameter. Underthis assumption the maximum a posteriori estimate (MAP) ˆ y , . . . , ˆ y wq of the labels of unlabelled samples f , . . . , f wq from F Q corresponds to { ˆ y j } wqj =1 , { ˆ c k } wk =1 = arg max { y j } , { c k } (cid:89) i P ( y i | f i )= arg max { y j } , { c k } (cid:89) i P ( f i | y i ) P ( y i )= arg max { y j } , { c k } (cid:89) i e − λ − (cid:107) f i − c yi (cid:107) P ( y i ) . This is directly related to the Optimal Transport theory, see(Hu, Gripon, and Pateux 2020; Cuturi 2013; Berman 2020;Villani 2003), and one may use the iterative expectation-maximization like approach incorporating the Sinkhorn al-gorithm to get the MAP estimate. It consists of repeatingof two steps, where the first is the construction of the map-ping matrix M ∗ with elements M ∗ ij = P ( y i = j ) which ismaximizing the previous term for a given centres c , . . . , c w and the second step is the estimation of class centres thatis for the fixed mapping matrix again optimizing the previ-ous term. For the Sinkhorn algorithm, see (Cuturi 2013) the .03 1.04 1.05 1.06 1.07 1.08024681012 Figure 2: Latent Space Transform algorithm produces Gaus-sian like distribution also for the norms of the transformedsamples. The figure was produced for one batch from theCUB dataset with s = 5 , q = 15 , β = 0 . , δ = 0 . , and γ = 0 . .mapping matrix is defined as M ∗ = Sinkhorn( L , a, b, λ )= arg min M ∈ U ( a,b ) (cid:88) i,j M ij L ij + λH ( M ) , where U ( a, b ) is a set of positive matrices in R wq × w forwhich the rows sums to a vector a and columns sums toa vector b , L ∈ R wq × w is the cost function consisting ofEuclidean distances between unlabelled instances and classcentres, that is L ij = (cid:107) f i − c j (cid:107) , the hyperparameter λ isa regularisation coefficient forcing the entropy H ( M ) = − (cid:80) ij M ij log M ij to become smaller, a denotes the dis-tribution of the amount that each unlabelled example usesfor class allocation, i.e. a is the vector of ones with wq el-ements, and b denotes the distribution of the amount of un-labelled examples allocated to each class, i.e. b is the vectorwith w elements that equals to q .The iterative approach starts with initialising the classcentres from the labelled samples in F S . Then the mappingmatrix M ∗ is calculated using the Sinkhorn algorithm. It isthen used to re-estimate the class centres via the update us-ing µ j = (cid:80) f i ∈ F Q M ∗ ij f i + (cid:80) f k ∈ F S ,y k = j f k s + (cid:80) wqi =1 M ∗ ij . To avoid unnecessarily big steps in centre estimations, thenew centre is set to be c j = c j + α ( µ j − c j ) , where the α isthe learning rate. The number of iterations is fixed to n steps .Once the iteration process finishes, the labels of the samplesfrom F Q might be estimated from the last mapping matrixas ˆ y i = arg max j M ∗ ij . The overview of the algorithm is given in Algorithm 1. Theoverall process of our approach is depicted in Figure 1. The code is available at https://github.com/ctom2/latent-space-transform.
Algorithm 1:
Optimal map algorithm
Parameters: w, s, q, λ, α, n steps
Initialisation: c j = s (cid:80) f k ∈ F S ,y k = j f k repeat n steps times: L ij = || f i − c j || , ∀ i, j ; M ∗ = Sinkhorn( L , p = 1 wq , q = q w , λ ) ;Calculate µ j ; c j = c j + α ( µ j − c j ) ; endreturn ˆ y i = arg max j M ∗ ij Experiments
The performance of the stated methods was measured basedon standardised few-shot classification datasets CIFAR-FS(Bertinetto et al. 2019) and CUB (Wah et al. 2011). CIFAR-FS dataset consists of images with size of × distributedinto 100 classes, each containing 600 images. The dataset issplit into 64 base classes, 16 validation classes and 20 novelclasses. CUB dataset contains 11,788 images of birds, eachwith size × , distributed over 200 classes. The datasetis split into 100 base classes, 50 validation classes and 50novel classes.In each testing run, w classes are randomly and uniformlydrawn from novel classes, where each class consists of s in-stances with label and q instances without label.Because of the high performance of WideResNet(Zagoruyko and Komodakis 2017) augmented with theS2M2 method (Mangla et al. 2020) in the few-shot setting,we chose it as the backbone architecture for our model. Thelatent representation of images produced by the backbone isa vector with dimension of 640. The QR decomposition re-duces the said dimension to 80 in 1-shot setting, and to 100in 5-shot setting.All experiments are based on w = 5 , q = 15 and s = 1 or . To evaluate the performance of the models we run 10,000random draws to obtain mean accuracy with confidencescores.By tuning the hyperparameters of the model we observedevolution in accuracy in both 1-shot and 5-shot setting withTable 1: Hyperparameters used in the final evaluation of theLST+MAP model. Parameter CIFAR-FS CUB CIFAR-FS CUB β λ
10 10 10 10 α n steps
20 30 20 20 δ γ Method Backbone CIFAR CUBPT+MAP WRN . ± .
23% 91 . ± . PT+GMM WRN . ± .
22% 90 . ± . PT+KNN WRN . ± .
19% 89 . ± . LST+MAP WRN . ± . % . ± . % LST+GMM WRN . ± .
21% 89 . ± . LST+KNN WRN . ± .
19% 89 . ± . dependency on tested dataset. The overview with the hyper-parameters can be found in Table 1. The final accuracy canbe seen in Table 2 and Table 3 for 1-shot and 5-shot setting,respectively. Moreover, the tables include results obtainedby substituting MAP with different clustering algorithms,Gaussian Mixture model and k -means model, that take thetransformed features as their input. The k -means model isinitiated with centres corresponding to the labeled instancesin a testing run. The centres are then iteratively refined toproduce better representations of the class centres. Similarly,Gaussian Mixture model is provided with initial means cor-responding to the labeled examples at the beginning of eachrun. To compare our proposed transform method with thePower Transform (PT) (Hu, Gripon, and Pateux 2020), weperformed the same substitutions for the PT+MAP model.The scores show that even by omitting the MAP partfrom the architecture and replacing it with simpler classifica-tion approaches while keeping the transformation intact pro-duces competitive results. Moreover, to compare the statis-tical significance of the superiority of the LST+MAP modelagainst the PT+MAP model we performed the paired t-testwith p -values presented in Table 4. We can see that exceptfor the CUB dataset in 5-shot scenario the LST+MAP modelis significantly better than the PT+MAP model.In terms of execution time, we measured an average of . s per run in 1-shot setting and . s per run in 5-shot setting with the GPU backend.Table 3: 5-shot accuracy of models based on Power Trans-form (PT), our proposed Latent Space Transform (LST) andWideResNet backbone. The authors of the PT+MAP modelpresented accuracy . ± . in 5-shot setting for CUBdataset, however we were able to obtain higher accuracywith their described model configuration. Method Backbone CIFAR CUBPT+MAP WRN . ± .
15% 94 . ± . PT+GMM WRN . ± .
21% 90 . ± . PT+KNN WRN . ± .
19% 89 . ± . LST+MAP WRN . ± . % . ± . % LST+GMM WRN . ± .
20% 90 . ± . LST+KNN WRN . ± .
18% 89 . ± . Table 4: p -values of the paired t-test with the null hypothesisthat the accuracy of the PT+MAP model is greater or equalthan the accuracy of the LST+MAP model against the alter-native that the accuracy of the PT+MAP model is smallerthan the accuracy of the LST+MAP model. CIFAR-FS CUB CIFAR-FS CUB p -value . − . − . − . Challenge submission
In this section, we describe modification to our methodwe have elaborated for the MetaDl challenge 2020. Themain limitation of the challenge was the submission runtimewhich had to include backbone training time and was lim-ited to two hours. Therefore we were not able to utilise theWRN backbone as we suggest above.Our best performing solution was relying on a lighterbackbone network based on the ResNet architecture. Duringthe backbone training, the fed images could either be left asthey were, or their saturation or brightness could be changedwith the probability set to / for each alteration. Moreover,the training batches also included the same images rotatedby 90, 180 and 270 degrees to further improve the backbonecapabilities and augment the training overall. Conclusion
Extracted features from backbones often do not resembleGaussian-like distributions, even though multiple algorithmsare built on that assumption. In this paper we show how totransform feature vectors into better Gaussian-like distribu-tions. By applying an iterative optimal-transport algorithmto estimate class centres empirically, the subsequent cluster-ing method gains significant improvement over other few-shot classification methods.Our experiments confirmed that the Latent Space Trans-form algorithm introduced above outperforms other formsof feature preprocessing including the Power Transform. Wehave also compared our approach based on optimal transportmapping to other classification methods based on Gaussianmixtures and nearest neighbours. For both CIFAR and CUBdatasets, our approach proved to be superior in both 1-shotand 5-shot learning scenarios.We have adjusted our method for the MetaDl challenge2020 competition and scored second in the final leaderboard.
Acknowledgment
This work was supported by the Student Summer ResearchProgram 2020 of FIT CTU in Prague. Moreover, the re-search was supported by the Grant Agency of the CzechTechnical University in Prague (SGS20/213/OHK3/3T/18)and the Czech Science Foundation (GA ˇCR 18-18080S). eferences [Berman 2020] Berman, R. J. 2020. The Sinkhorn algorithm,parabolic optimal transport and geometric Monge–Amp`ereequations.
Numerische Mathematik
Journal of the Royal StatisticalSociety. Series B (Methodological)
Advances in Neural Information Processing Sys-tems , volume 26, 2292–2300. Curran Associates, Inc.[Hu, Gripon, and Pateux 2020] Hu, Y.; Gripon, V.; and Pa-teux, S. 2020. Leveraging the feature distribution in transfer-based few-shot learning.
ArXiv abs/2006.03806.[Koch, Zemel, and Salakhutdinov 2015] Koch, G.; Zemel,R.; and Salakhutdinov, R. 2015. Siamese neural networksfor one-shot image recognition. In
ICML deep learningworkshop , volume 2. Lille.[Mangla et al. 2020] Mangla, P.; Singh, M.; Sinha, A.; Ku-mari, N.; Balasubramanian, V. N.; and Krishnamurthy, B.2020. Charting the right manifold: Manifold mixup for few-shot learning.[Ravi and Larochelle 2017] Ravi, S., and Larochelle, H.2017. Optimization as a model for few-shot learning. In . OpenReview.net.[Rohrbach, Ebert, and Schiele 2013] Rohrbach, M.; Ebert,S.; and Schiele, B. 2013. Transfer learning in a transduc-tive setting. In
Advances in neural information processingsystems , 46–54.[Scott, Ridgeway, and Mozer 2018] Scott, T.; Ridgeway, K.;and Mozer, M. C. 2018. Adapted deep embeddings: A syn-thesis of methods for k-shot inductive transfer learning. In
Advances in Neural Information Processing Systems , 76–85.[Snell, Swersky, and Zemel 2017] Snell, J.; Swersky, K.; andZemel, R. 2017. Prototypical networks for few-shot learn-ing. In
Advances in neural information processing systems ,4077–4087.[Sun et al. 2019] Sun, Q.; Liu, Y.; Chua, T.-S.; and Schiele,B. 2019. Meta-transfer learning for few-shot learning. In
Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition , 403–412.[Sung et al. 2018] Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.;Torr, P. H.; and Hospedales, T. M. 2018. Learning to com-pare: Relation network for few-shot learning. In
Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition , 1199–1208.[Villani 2003] Villani, C. 2003.
Topics in optimal transporta-tion . Providence, Rhode Island: American mathematical so-ciety. [Wah et al. 2011] Wah, C.; Branson, S.; Welinder, P.; Perona,P.; and Belongie, S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, Cali-fornia Institute of Technology.[Wang et al. 2019] Wang, Y.; Chao, W.-L.; Weinberger,K. Q.; and van der Maaten, L. 2019. Simpleshot: Revisitingnearest-neighbor classification for few-shot learning.
ArXiv abs/1911.04623.[Wang et al. 2020] Wang, Y.; Yao, Q.; Kwok, J. T.; and Ni,L. M. 2020. Generalizing from a few examples: A sur-vey on few-shot learning.
ACM Computing Surveys (CSUR)