[PDF] Partial Is Better Than All: Revisiting Fine-tuning Strategy for Few-shot Learning

Abstract

The goal of few-shot learning is to learn a classifier that can recognize unseen classes from limited support data with labels. A common practice for this task is to train a model on the base set first and then transfer to novel classes through fine-tuning (Here fine-tuning procedure is defined as transferring knowledge from base to novel data, i.e. learning to transfer in few-shot scenario.) or meta-learning. However, as the base classes have no overlap to the novel set, simply transferring whole knowledge from base data is not an optimal solution since some knowledge in the base model may be biased or even harmful to the novel class. In this paper, we propose to transfer partial knowledge by freezing or fine-tuning particular layer(s) in the base model. Specifically, layers will be imposed different learning rates if they are chosen to be fine-tuned, to control the extent of preserved transferability. To determine which layers to be recast and what values of learning rates for them, we introduce an evolutionary search based method that is efficient to simultaneously locate the target layers and determine their individual learning rates. We conduct extensive experiments on CUB and mini-ImageNet to demonstrate the effectiveness of our proposed method. It achieves the state-of-the-art performance on both meta-learning and non-meta based frameworks. Furthermore, we extend our method to the conventional pre-training + fine-tuning paradigm and obtain consistent improvement.

Full PDF

PPartial Is Better Than All:Revisiting Fine-tuning Strategy for Few-shot Learning

Zhiqiang Shen ∗ , Zechun Liu * , Jie Qin , Marios Savvides and Kwang-Ting Cheng Carnegie Mellon University Hong Kong University of Science and Technology Inception Institute of Artiﬁcial Intelligence{zhiqians,marioss}@andrew.cmu.edu;[email protected];[email protected];[email protected]

Abstract

The goal of few-shot learning is to learn a classiﬁer that canrecognize unseen classes from limited support data with labels.A common practice for this task is to train a model on the baseset ﬁrst and then transfer to novel classes through ﬁne-tuning or meta-learning. However, as the base classes have no overlapto the novel set, simply transferring whole knowledge frombase data is not an optimal solution since some knowledge inthe base model may be biased or even harmful to the novelclass. In this paper, we propose to transfer partial knowledgeby freezing or ﬁne-tuning particular layer(s) in the base model.Speciﬁcally, layers will be imposed different learning rates ifthey are chosen to be ﬁne-tuned, to control the extent of pre-served transferability. To determine which layers to be recastand what values of learning rates for them, we introduce anevolutionary search based method that is efﬁcient to simulta-neously locate the target layers and determine their individuallearning rates. We conduct extensive experiments on CUB and mini -ImageNet to demonstrate the effectiveness of our pro-posed method. It achieves the state-of-the-art performance onboth meta-learning and non-meta based frameworks. Further-more, we extend our method to the conventional pre-training+ ﬁne-tuning paradigm and obtain consistent improvement.

1. Introduction

Deep neural networks have shown enormous potential onunderstanding natural images (Krizhevsky, Sutskever, andHinton 2012; Szegedy et al. 2015; Simonyan and Zisser-man 2014; He et al. 2016; Huang et al. 2017) in the recentyears. The learning ability of deep neural networks increasessigniﬁcantly with more labeled training data. However, anno-tating such data is expensive, time-consuming and laborious.Furthermore, some classes (e.g., in medical images) are nat-urally rare and hard to collect. The conventional trainingapproaches for deep neural networks often fail to obtain goodperformance when the training data is insufﬁcient. Consid-ering that humans can easily learn from very few examplesand even generalize to many different new images, it will begreatly helpful if the network can also learn to generalize to * Here ﬁne-tuning procedure is deﬁned as transferring knowledgefrom base to novel data, i.e. learning to transfer in few-shot scenario. ! "($ |& $ ) backbone Base class (Many)Novel class (Few) " ! * "($ |& % ) Fixed feature extractor " * ! "($ |& $ ) backbone %" ! * "($ |& % ) Adaptive feature extractor " * Training stageFine-tuning stage

Standard transfer learning procedureOur proposed partial transfer ℒ &'() ℒ &'()

22 11

ClassifierClassifierClassifierClassifier

Figure 1: Illustration of the conventional procedure of pre-training and ﬁne-tuning for few-shot learning. (cid:172) representsthe standard transfer learning procedure which uses the pre-trained model as a feature extractor and the parameters areﬁxed during ﬁne-tuning. (cid:173) is our proposed partial transferstrategy which can ﬁne-tune the model trained on base datawith the few novel class data. Fine-tuning with different learn-ing rates on different layers can optimize the feature extractorto better ﬁt the novel class and prevent the model from over-ﬁtting on it, since the novel data has limited samples.new classes with only a few labeled samples from unseenclasses. Previous studies in this direction (i.e., few-shot learn-ing) can be mainly divided into two categories. One is themeta-learning based methods (Snell, Swersky, and Zemel2017; Vinyals et al. 2016; Finn, Abbeel, and Levine 2017)that model the few-shot learning process with samples be-longing to the base classes, and optimize the model for thetarget novel classes. The other is the plain solution (non-meta and also called baseline method (Chen et al. 2019)) thattrains feature extractor from abundant base class then directlypredicts the weights of the classiﬁer for the novel ones.As the number of images in the support set of novel classesare extremely limited, directly training models from scratchon the support set is unstable and tends to be overﬁtting.Even utilizing the pre-trained parameters on base classes and a r X i v : . [ c s . C V ] F e b ne-tuning all layers on the support set will still lead to poorperformance due to the small proportion of target trainingdata. A common practice utilized by either meta-based orsimple baseline methods relies heavily on the pre-trainedknowledge with the sufﬁcient base classes, and then trans-fer the representation by freezing the backbone parametersand solely ﬁne-tuning the last fully-connected layer or di-rectly extracting features for distance computation on thesupport data, to prevent overﬁtting and improve generaliza-tion. However, as the base classes have no overlap with thenovel ones, meaning that the representation and distributionrequired to recognize images are quite different betweenthem, completely freezing the backbone network and simplytransferring the whole knowledge will suffer from this dis-crepant domain issue, though currently the domain differenceis not huge in the existing few-shot learning datasets.To fundamentally overcome the aforementioned drawback,in this work, we propose to utilize a ﬂexible way to trans-fer knowledge from base to novel classes. We introduce apartial transfer paradigm for the few-shot classiﬁcation task,as shown in Figure 1. In our framework, we ﬁrst pre-traina model on the base classes as previous methods did. Then,instead of transferring the learned representation by freezingthe whole backbone network, we develop an efﬁcient evo-lutionary search method to automatically determine whichlayer/layers need to be frozen and which will be ﬁne-tunedon the support set (on novel class). During searching, thevalidation data will be commandeered as the ground-truth tomonitor the performance of the searched strategy. Intuitively,our strategy can achieve a better trade-off of using knowledgefrom base and support data than previous approaches, mean-while, our strategy can avoid incorporating biased or harmfulknowledge from base classes into novel classes. Moreover,our method is orthogonal to meta-learning or non meta-basedsolutions, and thus can be seamlessly integrated with them.We perform extensive experiments on CUB200-2011 and mini -ImageNet datasets. Our results empirically show thatthe proposed method can favorably improve both of thesetwo types of solutions. We further extend our method to thetraditional pre-training + ﬁne-tuning paradigm from Ima-geNet to CUB200-2011 and achieve consistent improvement,demonstrating the effectiveness and excellent expansibilityof our proposed method.In summary, our contributions are three-fold: • We present Partial Transfer (P-Transfer) for the few-shot classiﬁcation, a framework that enables to search transferstrategies on backbone for ﬂexible ﬁne-tuning. Intuitively,the conventional ﬁxed transferring is a special case of ourpropose strategy when all layers are frozen. Also, to our bestknowledge, this is the pioneer attempt that can achieve partialtransfer with different learning rates on this challenging task. • We introduce a layer-wise search space for ﬁne-tuningfrom base classes to novel. It helps the searched transferstrategy obtain inspiring accuracies under limited searchingcomplexity. For example, using one V100 GPU, our searchalgorithm only takes ∼ • Our resulting network, the P-Transfer model, outper-forms the complete transfer and the hand-crafted transfer strategies by a remarkable margin. As the two baseline trans-fer strategies belong to our search space, thus ideally the bet-ter performance is guaranteed by our searching method. Withthe assistance of designing search space, we show the effec-tiveness of P-Transfer in different few-shot learning methodson various datasets. The searched strategy has consistentlybetter performance and meaningful structural patterns.

2. Background

Few-shot learning is deﬁned as given abundant labeled train-ing examples in the base classes, the trained network canbe generalized to classify the new classes with few labeledsamples. Recently, few-shot learning, enabled by transferringknowledge from the base to novel data, has been increasinglyimportant. Existing few-shot learning methods can mainly becategorized into meta-learning based methods and non-metalearning methods. Here we also review the searching basedmethods for few-shot learning in this section.

Meta-based few-shot learning.

To tackle the data deﬁciencyin few-shot learning, previous studies adopt meta-learningto learn the model or optimizer that can fast update theweights for adapting to the unseen tasks (Thrun and Pratt2012; Andrychowicz et al. 2016; Ye et al. 2020; Tian et al.2020; Kim, Kim, and Kim 2020). For example, MetaNet-work (Munkhdalai and Yu 2017) learned a meta-level knowl-edge for rapid generalization. Ravi and Larochelle (Raviand Larochelle 2016) proposed to use the LSTM-basedmeta-learner model to learn the optimization algorithm.MAML (Finn, Abbeel, and Levine 2017; Antoniou, Edwards,and Storkey 2018) simpliﬁed the aforementioned MetaNet-work by only learning the initial learner parameters to achieverapid adaptation w.r.t. those initial parameters and high gen-eralizability to the new tasks. Furthermore, meta-learningmethods are utilized to learn the similarity between two im-ages. MatchingNet (Vinyals et al. 2016) proposed to map asmall labeled support set to its label, and determine the classof an instance in the query set by ﬁnding its nearest labeledexample. ProtoNet (Snell, Swersky, and Zemel 2017) furtherutilized class-wise mean and the Euclidean distance to gen-eralize the MatchingNet from one-shot learning to few-shotlearning. RelationNet (Sung et al. 2018) use CNN-based re-lation modules and Few-shot GNN (Garcia and Bruna 2017)employed graph neural networks to learn useful metrics.

Non-meta few-shot learning.

Besides those meta-learningbased methods, there are non-meta methods which utilizecosine similarity to predict the novel class classiﬁer withweight generator (Gidaris and Komodakis 2018), directly setthe weights based on the embedding layer’s activations (Qi,Brown, and Lowe 2018) or use dense representations fromimage regions to calculate the distances (Zhang et al. 2020).Chen et al. (Chen et al. 2019) proposed to reduce intra-classvariation along with the conﬁne similarity and achieves com-petitive performance. Both the meta and non-meta methodsused the ﬁxed feature extractor trained from the based classes,which can hardly take the domain discrepancy between thebase and novel classes into consideration. Instead of learn-ing more advanced optimizers or classiﬁcation metrics, wetackle the few-shot problem by discovering an meta knowl-edge transfer scheme through evolutionary search, which is a) Train the feature extractor Evolution (c) Classification with searched feature extractor * )*+, +,-../0/12 Feature extractor … Feature extractor … fix Fine-tune fix …… * -*. +,-../0/12 (b) Evolutionary search for finetuning configuration * /0-,. +,-../0/12 Cross-entropy loss Feature extractor … … Figure 2: Our overall framework overview consists of three stages: (a) train a feature extractor from scratch on the base dataset;(b) apply evolutionary search to explore optimal combination of layers that requires ﬁne-tuning on the validation dataset. Notethat the blocks with dashed lines denote the ﬁne-tuning layers in a speciﬁc evolution iteration; (c) use the best ﬁne-tuning schemediscovered by the evolutionary search to ﬁne-tune the selected layers on the support set of the novel dataset and inferring theﬁnal accuracy on the query set.compatible with both meta and non-meta methods.

Neural architecture search for few-shot learning.

Evolu-tionary algorithm has been adopted in the neural architecturesearch (NAS) to obtain the optimal neural architecture (Mi-ikkulainen et al. 2019; Real et al. 2017; Xie and Yuille 2017;Liu et al. 2017; Real et al. 2019). Evolutionary-based NASevolves within a given architecture search space and updatesa population of genes (i.e., the operation choices in an ar-chitecture) to select the top gene for the ﬁnal model. Recentstudy (Elsken et al. 2020) proposed to integrate NAS algo-rithm with gradient based meta-learning for few-shot learningtask. Different from neural architecture search, our proposedP-Transfer utilizes evolutionary algorithm to seek the opti-mal ﬁne-tuning scheme, instead of the network architecture.Meta-SGD (Li et al. 2017) and MAML++ (Antoniou, Ed-wards, and Storkey 2018) can also learn diverse learning ratesfor each layer in the networks, but they were mainly designedfor MAML-like methods and only suitable for the meta-basedscenarios. In contrast, our proposed method can completelyturnoff the learning rate to zero and ﬁx the weights in a layer,which is a more general design for the few-shot learning task.

3. Methodology

In this section, we start by introducing the problem deﬁnitionof few-shot classiﬁcation, then we present our whole frame-work, which consists of three steps: 1) train a base model onbase class samples (left sub-illustration in Figure 2), 2) applyevolutionary search to explore optimal transfer strategy basedon accuracy metric (middle sub-ﬁgure, curve arrow indicateslooping), and 3) transfer base model to novel class with thesearched strategy through partially ﬁne-tuning. Lastly, weelaborate how to design our search space for transferring andpresent our search algorithm in detail.

In the few-shot classiﬁcation task, given abundant labeled im-ages X b in base classes L b and a small proportion of labeledimages X n in novel classes L n , L b ∩ L n = ∅ . Our goal is totrain models for recognizing novel classes with the labeled large amount of base data and limited novel data. Consid-ering an N -way K -shot few-shot task, where the supportset on novel class has N classes with K labeled images andthe query set contains the same N classes with Q unlabeledimages in each class, the few-shot classiﬁcation algorithmsare required to learn classiﬁers for recognizing the N × Q images in the query set of N classes.Our objective of P-Transfer aims to discover the best trans-fer learning scheme V ∗ lr , such that, the network achievesmaximal accuracy when ﬁne-tuning under that scheme: V ∗ lr = arg max A cc ( W, V lr ) , (1)where V lr = [ v , v , ..., v L ] deﬁnes the layer-wise learningrate for ﬁne-tuning the feature extractor, W is the network’sparameters and L is the total number of layers. As shown in Figure 2, our method consists of three steps: baseclass pre-training, evolutionary search, and partially transferbased on the searched strategy.

Step 1: Base class pre-training.

Base class pre-training isthe fundamental step of our whole pipeline. As shown inFigure 2 (a), for the simple baseline, we follow the commonpractice to train the model from scratch by minimizing astandard cross-entropy objective with the training samplesin base classes. For the meta-learning pipeline, the meta-pretraining also follows the conventional strategy that a meta-learning classiﬁer is conditioned on the base support set.More speciﬁcally, in the meta-pretraining stage, support setand query set on the base class are ﬁrst sampled randomlyfrom N classes, and then train the parameters to minimizethe N -way prediction loss. Step 2: Evolutionary search.

The second step is to performevolutionary search with different ﬁne-tuning strategies todetermine which layers will be ﬁxed and others will be ﬁne-tuned in the representation transfer stage. We also considerthe above two circumstances: simple baseline through pre-training + ﬁne-tuning, and meta-based methods. In these twoscenarios the evolutionary searching operations are slightly aselineEpoch (*) Cosine distance >?0@A-*B * < = > ||0 * < || 9 ||C > || B D Meta methodSample episodesSupport set Query setEuclidean distance (* ) *+,,-./ ) 0 (* ) ) Class Mean (*) Feature extractor * )*+, +,-../0/12 Our general classificationnetworkOur framework when using the Baseline++ method (*) Feature extractorTrain Evolutionary search Test … Our three-step search algorithm operates on the feature extractor Our framework when using the Meta method

Figure 3: In this ﬁgure, we show that our three-step search algorithm operates on the feature extractor f θ ( x ) . Our generalframework can easily be incorporated into the baseline method with cosine distance, denoted as baseline++ (Chen et al. 2019), aswell as the meta-learning based methods.different, as shown in Figure 2 (b) and Figure 3. Gener-ally, our method searches the optimal strategy for transfer-ring from base classes to novel classes through ﬁxing or re-activating some particular layers that can help novel classes.As this is the core of our framework, we will elaborate in thefollowing sections individually (Section 3.4 and 3.5). Step 3: Partially transfer via searched strategy.

As shownin Figure 2 (c), the ﬁnal step is to apply our searched transferstrategy to the novel classes. Different from the simple base-line that ﬁxes backbone and ﬁne-tunes the last linear layeronly, or meta-learning methods that use the base networkas a feature extractor for the meta-testing, we will partiallyﬁne-tune our base network on the novel support set based onthe searched strategies for both types of methods. This is alsothe core component to achieve signiﬁcant improvement.

Our search space is related to the model architecture we uti-lize for the few-shot classiﬁcation. Generally, it contains thelayer-level selection (ﬁne-tuning or freezing) and learningrate assignment for ﬁne-tuning. The search space can be for-mulated as m K , where m is the number of choices for learn-ing rate values and K is the number of layers in networks.For example, if we choose learning rate ∈ { , . , . , . } as a learning rate zoo (“learning rate = 0 ” indicates we freezethis layer during ﬁne-tuning), i.e., m = 4. For Conv6 struc-ture, the search space includes possible transfer strategies.Our searching method can automatically match the optimalchoice for each layer from the learning rate zoo during ﬁne-tuning. A brief comparison of the search space is describedin Table 1. It increases sharply if we choose deeper networks.Table 1: Search Space of P-Transfer. Network

Conv6 ResNet-12 ResNet- K Complexity m m m K Our searching step is following the evolutionary algorithm.Evolutionary algorithms, a.k.a genetic algorithms, base onthe natural evolution of creature species. It contains repro-duction, crossover, and mutation stages. Here in our scenario,ﬁrst a population of strategies is embedded to vectors V andinitialized randomly. Each individual v consists of its strat-egy for ﬁne-tuning. After initialization, we start to evaluateeach individual strategy v to obtain its accuracy on the val-idation set. Among these evaluated strategies we select thetop K as parents to produce posterity strategies. The nextgeneration strategies are made by mutation and crossoverstages. By repeating this process in iterations, we can ﬁnd abest ﬁne-tuning strategy with the best validation performance.The detailed search pipeline is presented in Algorithm 1 andthe hyper-parameters for this algorithm are introduced inSection 4.In this work we conduct the evolutionary search in transferlearning for few-shot classiﬁcation. We target at ﬁne-tuningwith diverse learning rates to explore suitable transfer patternsin terms of knowledge with a simple and effective strategydesign. At each layer, the learning rate is selected from apre-deﬁned zoo with all possible choices. As in Figure 3, we introduce how to incorporate our search al-gorithm into existing few-shot classiﬁcation frameworks. Wechoose the non-meta baseline++ (Chen et al. 2019) and metaProtoNet (Snell, Swersky, and Zemel 2017) as examples.

Upon simple baseline++.

Baseline++ aims to explicitly re-duce intra-class variation among features by applying co-sine distances between the feature and weight vector in thetraining and ﬁne-tuning stages. As shown in Figure 3 (leftsub-ﬁgure), we follow the design of distance-based classiﬁern searching but adjust the backbone feature extractor f θ ( x ) through exploring different learning rates for different lay-ers during ﬁne-tuning. Intuitively, the learned backbone anddistance-based classiﬁer from our searching method are moreharmonious and powerful than freezing backbone networkand only ﬁne-tuning weight vectors for few-shot classiﬁca-tion, since our whole model is tuned end-to-end. Algorithm 1:

Evolutionary algorithm for searchingthe best ﬁne-tuning conﬁguration.

Input:

Trained feature extractor: N , layer index in anetwork: l , the meta-validation loss: L , number of Random sampling operation: R , number of Mutation : M , number of Crossover : C , max number of Iterations : I . Output : Optimized ﬁne-tuning conﬁguration v ∗ deﬁne miniEval( v ): N i = Load( N ) grad l = 0 if v i [ l ] = 0 v i , accuracy} = miniFinetune( N i ) return { v i , accuracy} for i = 0 : R do v i = RandomChoice([0, m], L ) lr for each layer.{ v i , accuracy} = miniEval( v i ) end for v topK = Top K ({ V , accuracy}) K vectors. V is the set of { v i } . for j = 0 : I dofor i = 0 : M do v i = Mutation( v topK , L ) v i , accuracy} = miniEval( v i ) end forfor i = 0 : C do v i = Crossover({ v topK , v topK }, L ) crossover vectors between two parents.{ v i , accuracy} = miniEval( v i ) end for v topK = Top K ({ V , accuracy}) K vectors. end for v ∗ , acc ∗ = Top1({ V , accuracy}) return v ∗ ; Upon meta-learning based methods.

As shown in Figure 3(right sub-ﬁgure), we describe the formulation of how toapply our searching method to meta-learning method for few-shot classiﬁcation. In the meta-training stage, the algorithmﬁrst randomly chooses N classes, and samples small basesupport set x b ( s ) and a base query set x b ( q ) from sampleswithin these classes. The objective is to learn a classiﬁcation model M that minimizes N -way prediction loss of the sam-ples in the query set Q b . Here, the classiﬁer M is conditionedon the provided support set x b . Similar to baseline++, wetrain the classiﬁcation model M by ﬁne-tuning the backbonenetwork and classiﬁer simultaneously, to discover the optimalﬁne-tuning strategy. As the predictions from a meta-basedclassiﬁer are conditioned on the given support set, the meta-learning method can learn to learn from limited labeled datathrough a collection of episodes.

4. Experiments

Dataset.

We verify our method for few-shot learning onboth mini -ImageNet dataset and CUB200-2012 dataset. mini -ImageNet dataset is a commonly used dataset for few-shotclassiﬁcation. It consists of 100 classes from ImageNetdataset (Deng et al. 2009), and 600 images for each class.We follow (Ravi and Larochelle 2016) to split the data into64 base classes, 16 validation classes and 20 novel classes.CUB200-2011 contains 200 classes of birds (Wah et al. 2011).Follow (Hilliard et al. 2018), we split the data into 100 baseclasses, 50 validation classes and 50 novel classes. We val-idate the effectiveness of our method for generic classiﬁca-tion on mini -ImageNet, and for ﬁne-grained classiﬁcationon CUB, as well as for cross-domain adaptation throughtransferring knowledge learned from mini -ImageNet to CUB.

Implementation.

For meta methods, we sample episodeswith 5 classes from the target dataset. Then for each class,we sample k instances as the support set and 15 instances asthe query set for a k -shot task. In training, we train 60,000episodes for 1-shot and 40,000 episodes for 5-shot tasks onthe base dataset. In search, we sample 20 episodes from thevalidation dataset. We ﬁne-tune the network on the supportset for 100 iterations and evaluate the network on the queryset. In evaluation, We ﬁne-tune layers following the searchedconﬁguration on the support set and evaluate on the queryset with episodes sampled from novel dataset. We ran 600episodes and report the average accuracy and the 95% con-ﬁdence intervals. The non-meta method differs only in thetraining stage, where we train the feature extractor for 400epochs with a batchsize of 16 on the base dataset.We adopt Adam optimizer with learning rate of 1e-3 fortraining. In ﬁne-tuning, we use SGD with 0.01 learning ratefor fully-connected layer and other searched learning rates forthe corresponding layers. We use standard data augmentationincluding random crop, horizontal ﬂip and color jitter. ForAlgorithm 1, we set population size P = 20 , max iterations I = 20 , and number of random sampling ( R ), mutation ( M )and crossover ( C ) to . Comparison to ﬁxed and manually designed ﬁne-tuning.

We ﬁrst compare our proposed method with ﬁxed and manu-ally designed ﬁne-tuning schemes using Conv6 and ResNet-12 structures on CUB and mini -ImageNet. The reason thatwe compare with ﬁne-tuning the last convolutional layer asgenerally, the last layer is more domain-speciﬁc. Thus, inmanually designed ﬁne-tuning scheme, researchers usuallyﬁne-tune the last convolutional layer as a solution. Our re-sults are shown in Table 2 (Conv6) and 3 (ResNet-12), whereable 2: We validate on the non-meta method with Conv6 structure. We report the mean of 600 randomly generated episodes andthe 95% conﬁdence intervals. We compare the original learning algorithm (i.e., ﬁne-tuning the fully-connected layer only andreferring as “Fixed” in the table) with ﬁne-tuning the human-deﬁned last convolutional layer (i.e., “Manual” in the table) andﬁne-tuning the layers based on the evolutionary-searched scheme (i.e., “Searched” in the table).

CUB mini -ImageNet1-shot 5-shot 1-shot 5-shotFixed ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Few-shot classiﬁcation results on mini -ImageNet and CUB datasets with ResNet-12 structure. We apply our proposedalgorithm to Baseline++ (non-meta) few-shot method as well as the meta-learning method (ProtoNet). The results show that inmost cases, the proposed algorithm can discover a better knowledge transferring scheme than the original scheme.

CUB mini -ImageNet1-shot 5-shot 1-shot 5-shot

Few-shot learning method: Baseline++

Fixed 70.72 ± ± ± ± ± ± ± ± Fixed ± ± ± ± ± ± ± ± mini -ImageNetCUBCross Conv 6 ResNet-12

Figure 4: We visualize the ﬁne-tuning scheme discovered by our evolutionary algorithm. The grey boxes denote layers withoutﬁne-tuning, the colored boxes denote layers that require ﬁne-tuning. Different scenarios have different searched scheme. Forexample, in the cross-domain transfer-learning, more layers need to be ﬁne-tuned to adapt the knowledge for the target domain.in general, our evolutionary strategy achieves better accu-racy than ﬁxing backbone and human-deﬁned strategy. As abaseline, we obtain 40.84 ± ± mini -ImageNet. Comparison to different normalization layers in cross-domain setting.

As ﬁne-tuning backbone networks will sig-niﬁcantly be affected by the size of batchsize, while for thefew-shot classiﬁcation scenario, we do not have enough sam-ples to increase the batchsize, and also the conventional ﬁxingbackbone solutions do not encounter this problem. Thus, herewe further explore whether a better batch norm techniquecan deliver further improvement. Our results are shown inTable 4, in the cross-domain settings, group norm (Wu andHe 2018) can achieve much better accuracy (about 2 ∼ To better understand our partial transfer method, we visualizethe searched schemes in Figure 4. We observe two interestingphenomena which are in line with the intuition: (1) Deeper networks will always have more layers that require to be ﬁne-tuned for few-shot learning; (2) When the domain differencebetween base and novel data is increased (in the cross-domainscenario), more layers are required to be ﬁne-tuned.Our ﬁnal results are shown in Table 5, we can see thatour partial transfer method can consistently outperform otherstate-of-the-art on both 1 and 5 shots settings. Even withoutadditional training techniques like DropBlock (Ghiasi, Lin,and Le 2018) and label smoothing (Szegedy et al. 2016), ourmethod still obtains a signiﬁcant improvement, as our ﬂexibletransfer/ﬁne-tuning can beneﬁt from few support samples toadjust the backbone parameters.

We further explore the traditional transfer learning from Ima-geNet (Deng et al. 2009) to CUB200-2012 with the InceptionV3 network (Szegedy et al. 2016). We use SGD optimizerwith initial learning rate being 0.01 and linearly decay to 0.In transferring, we observe that, the weights learned from ourmethod i.e., re-initializing and ﬁne-tuning a few layers forpartial transfer achieves higher accuracy than inherit all theweights and do ﬁne-tuning, as shown in Table 6.

5. Discussions

In the setting of few-shot learning, the pre-trained featureextractor is required to provide proper transferability fromable 4: In this table, we further evaluate our method on the cross-domain few-shot learning tasks, i.e., transferring the knowledgefrom mini -ImageNet to CUB. We conduct experiments on both meta and non-meta methods. We ﬁnd that when there existdomain difference, ﬁne-tuning more layers are required. Moreover, we further discovered using GroupNorm can also bridge thedistribution difference between training and testing, which outperforms the results of using BatchNorm.

Few-shot learning method: Baseline++

Conv6 Fixed 40.77 ± ± ± ± ± ± ± ± ResNet-12 Fixed 43.14 ± ± ± ± ± ± ± ± Conv6 Fixed 36.34 ± ± ± ± ± ± ± ± ResNet-12 Fixed 39.14 ± ± ± ± ± ± ± ± Table 5: Comparison with the state-of-the-art results on mini -ImageNet dataset. * denotes the results re-implemented by us. † indicates the results are from more training techniques like DropBlock and label smoothing. Method Backbone 1-shot 5-shotMatchingNet (Vinyals et al. 2016) Conv4 43.56 ± ± MatchingNet* (Vinyals et al. 2016) ResNet-12 ± ± ProtoNet (Snell, Swersky, and Zemel 2017) Conv4 48.70 ± ± ProtoNet* (Snell, Swersky, and Zemel 2017) ResNet-12 ± ± Parameters from Activations (Qiao et al. 2018) WRN-28-10 59.60 ± ± ± ± ± ± ± ± ± ± ± ± † (Lee et al. 2019) ResNet-12 62.64 ± ± ± ± ± ± Table 6: The comparison between inheriting all to partiallyreinitialize weights in transfer learning.

Baseline Partial transfer

Top-1 Accuracy 82.9% base to novel class in the meta or non-meta learning stage. Ba-sically, the transferring learning aims to transfer the commonknowledge from base objects to the novel class. However, aswe stated above, there deﬁnitely has some unnecessary andeven harmful information in the base class, since the noveldata is few and sensitive to the feature extractor, the completetransferring strategy will not be able to avoid them, indicatingthat our method is a better solution for the few-shot scenario.

Usually the base and novel class are in the same domain, sousing the pre-trained feature extractor on base data and thentransferring to novel data can obtain good or moderate perfor-mance. However, as shown in Figure 4, in the cross-domaintransfer-learning, more layers need to be ﬁne-tuned to adaptthe knowledge for the target domain since the source and target domains are discrepant in content. In this circumstance,the conventional transfer learning is no longer applicable. Ourproposed partial transferring with diverse learning rates ondifferent layers is competent for this intractable situation, andintuitively, ﬁxed transferring is generally a special case of ourstrategy and ours has better potential in few-shot learning.

6. Conclusion

We have introduced a partial transfer (P-Transfer) method forthe few-shot classiﬁcation. Our method is the ﬁrst attemptto thoroughly explore the capability of transferring throughsearching strategies in few-shot scenario without any proxy.Our method boosts both the meta and non-meta based meth-ods by a large margin under 1-shot or 5-shot circumstances,as our ﬂexible transfer/ﬁne-tuning can beneﬁt from few sup-port samples to adjust the backbone parameters. Intuitively,our partial transfer has larger potential for few-shot classiﬁ-cation and even for the traditional transfer learning. We hopeour method can inspire more methods along this direction.In the future, we will perform more analyses about how par-tial transfer helps the few-shot problems. We will apply ourmethod on other few-shot tasks like detection, segmentationto explore the upper limit of our proposed transfer method. otential Ethical Impact

This work has the following potential positive and negativeimpacts in the society. As machine learning algorithm hasbeen highly studied in data-intensive applications, such asthe large-scale ImageNet classiﬁcation, but is often ham-pered when the training data is small. This work tackles thisproblem through using prior knowledge from other sufﬁcientlabeled data, and can rapidly generalize to new data contain-ing only a few samples with annotated information. It canhelp researchers or industry to build systems in the area thatdata is costly to collect, such as medical images or wild ani-mals. However, we should be cautious of the result of failureof the system which could cause unreliable conclusion, suchas the misclassiﬁed disease for medical images and furtherwas somewhat misleading to the doctors.

References

Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M. W.; Pfau,D.; Schaul, T.; Shillingford, B.; and De Freitas, N. 2016. Learningto learn by gradient descent by gradient descent. In

Advances inneural information processing systems , 3981–3989. 2Antoniou, A.; Edwards, H.; and Storkey, A. 2018. How to train yourMAML. arXiv preprint arXiv:1810.09502 . 2, 3Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C. F.; and Huang, J.-B.2019. A Closer Look at Few-shot Classiﬁcation. In

InternationalConference on Learning Representations . 1, 2, 4, 7Chen, Y.; Wang, X.; Liu, Z.; Xu, H.; and Darrell, T. 2020.A new meta-baseline for few-shot learning. arXiv preprintarXiv:2003.04390 . 7Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L.2009. Imagenet: A large-scale hierarchical image database. In ,248–255. Ieee. 5, 6Elsken, T.; Stafﬂer, B.; Metzen, J. H.; and Hutter, F. 2020. Meta-Learning of Neural Architectures for Few-Shot Learning. In

Pro-ceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , 12365–12375. 3Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 ,1126–1135. JMLR. org. 1, 2Garcia, V.; and Bruna, J. 2017. Few-shot learning with graph neuralnetworks. arXiv preprint arXiv:1711.04043 . 2Ghiasi, G.; Lin, T.-Y.; and Le, Q. V. 2018. Dropblock: A regular-ization method for convolutional networks. In

Advances in NeuralInformation Processing Systems , 10727–10737. 6Gidaris, S.; and Komodakis, N. 2018. Dynamic few-shot visuallearning without forgetting. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 4367–4375. 2He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learningfor image recognition. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 770–778. 1Hilliard, N.; Phillips, L.; Howland, S.; Yankov, A.; Corley, C. D.;and Hodas, N. O. 2018. Few-shot learning with metric-agnosticconditional embeddings. arXiv preprint arXiv:1802.04376 . 5Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q.2017. Densely connected convolutional networks. In

Proceedingsof the IEEE conference on computer vision and pattern recognition ,4700–4708. 1 Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXivpreprint arXiv:1502.03167 . 6Kim, J.; Kim, H.; and Kim, G. 2020. Model-Agnostic Boundary-Adversarial Sampling for Test-Time Generalization in Few-Shotlearning. In

ECCV . 2Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenetclassiﬁcation with deep convolutional neural networks. In

Advancesin neural information processing systems , 1097–1105. 1Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-learning with differentiable convex optimization. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,10657–10665. 7Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learn-ing to learn quickly for few-shot learning. arXiv preprintarXiv:1707.09835 . 3Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; and Kavukcuoglu,K. 2017. Hierarchical representations for efﬁcient architecturesearch. arXiv preprint arXiv:1711.00436 . 3Miikkulainen, R.; Liang, J.; Meyerson, E.; Rawal, A.; Fink, D.;Francon, O.; Raju, B.; Shahrzad, H.; Navruzyan, A.; Duffy, N.; et al.2019. Evolving deep neural networks. In

Artiﬁcial Intelligencein the Age of Neural Networks and Brain Computing , 293–312.Elsevier. 3Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P. 2017. A sim-ple neural attentive meta-learner. arXiv preprint arXiv:1707.03141 .7Munkhdalai, T.; and Yu, H. 2017. Meta networks. In

Proceedingsof the 34th International Conference on Machine Learning-Volume70 , 2554–2563. JMLR. org. 2Munkhdalai, T.; Yuan, X.; Mehri, S.; and Trischler, A. 2017. Rapidadaptation with conditionally shifted neurons. arXiv preprintarXiv:1712.09926 . 7Oreshkin, B.; López, P. R.; and Lacoste, A. 2018. Tadam: Taskdependent adaptive metric for improved few-shot learning. In

Ad-vances in Neural Information Processing Systems , 721–731. 7Qi, H.; Brown, M.; and Lowe, D. G. 2018. Low-shot learningwith imprinted weights. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 5822–5830. 2Qiao, S.; Liu, C.; Shen, W.; and Yuille, A. L. 2018. Few-shotimage recognition by predicting parameters from activations. In

Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition , 7229–7238. 7Ravi, S.; and Larochelle, H. 2016. Optimization as a model forfew-shot learning . 2, 5Real, E.; Aggarwal, A.; Huang, Y.; and Le, Q. V. 2019. Regularizedevolution for image classiﬁer architecture search. In

Proceedings ofthe aaai conference on artiﬁcial intelligence , volume 33, 4780–4789.3Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y. L.; Tan, J.;Le, Q. V.; and Kurakin, A. 2017. Large-scale evolution of imageclassiﬁers. In

Proceedings of the 34th International Conference onMachine Learning-Volume 70 , 2902–2911. JMLR. org. 3Simonyan, K.; and Zisserman, A. 2014. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 . 1Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networksfor few-shot learning. In

Advances in neural information processingsystems , 4077–4087. 1, 2, 4, 7ung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales,T. M. 2018. Learning to compare: Relation network for few-shotlearning. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 1199–1208. 2Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov,D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Goingdeeper with convolutions. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , 1–9. 1Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z.2016. Rethinking the inception architecture for computer vision. In

Proceedings of the IEEE conference on computer vision and patternrecognition , 2818–2826. 6Thrun, S.; and Pratt, L. 2012.

Learning to learn . Springer Science& Business Media. 2Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J. B.; and Isola, P.2020. Rethinking Few-Shot Image Classiﬁcation: a Good Embed-ding Is All You Need? arXiv preprint arXiv:2003.11539 . 2Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. 2016.Matching networks for one shot learning. In

Advances in neuralinformation processing systems , 3630–3638. 1, 2, 7Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S.2011. The caltech-ucsd birds-200-2011 dataset . 5Wu, Y.; and He, K. 2018. Group normalization. In

Proceedings ofthe European Conference on Computer Vision (ECCV) , 3–19. 6Xie, L.; and Yuille, A. 2017. Genetic cnn. In

Proceedings of theIEEE international conference on computer vision , 1379–1388. 3Ye, H.-J.; Hu, H.; Zhan, D.-C.; and Sha, F. 2020. Few-shot learningvia embedding adaptation with set-to-set functions. In

Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition , 8808–8817. 2Zhang, C.; Cai, Y.; Lin, G.; and Shen, C. 2020. DeepEMD: Few-Shot Image Classiﬁcation with Differentiable Earth Mover’s Dis-tance and Structured Classiﬁers. In