BOIL: Towards Representation Change for Few-shot Learning
DDoes MAML really want feature reuse only?
Jaehoon Oh ∗ Graduate School of KSEKAIST [email protected]
Hyungjun Yoo ∗ Graduate School of KSEKAIST [email protected]
ChangHwan Kim
Graudate School of KSEKAIST [email protected]
Se-Young Yun
Graduate School of AIKAIST [email protected]
Abstract
Meta-learning, the effort to solve new tasks with only a few samples, has attractedgreat attention in recent years. Model Agnostic Meta-Learning (MAML) is one ofthe most representative gradient-based meta-learning algorithms. MAML learnsnew tasks with a few data samples with inner updates from a meta-initializationpoint and learns the meta-initialization parameters with outer updates. Recently,it has been hypothesized that feature reuse , which makes little change in efficientrepresentations, is the dominant factor in the performance of meta-initializedmodel through MAML rather than rapid learning , which makes a big change inrepresentations. In this work, we propose a novel meta-learning algorithm, coinedas
BOIL (Body Only update in Inner Loop), that updates only the body (extractor)of the model and freezes the head (classifier) of the model during inner loop updates.The BOIL algorithm thus heavily relies on rapid learning . Note that BOIL is theopposite direction to the hypothesis that feature reuse is more efficient than rapidlearning . We validate the BOIL algorithm on various data sets and show significantperformance improvement over MAML. The results imply that rapid learning ingradient-based meta-learning approaches is necessary.
One of the most promising fields in machine learning is few-shot learning. Meta-learning, alsoknown as “learning to learn”, is a methodology enabling a fast adaptation of a model to new datathrough previous learning experiences. To address few-shot learning successfully, meta-learningwith deep neural networks have mainly been studied through metric- and gradient-based approaches.Such approaches aim to learn a model only with a few data samples and have shown a generalizedperformance for previously unseen data. Metric-based meta-learning [13, 34, 29, 31] compares thedistance between feature embeddings using models as a mapping function of data into an embeddingspace, whereas gradient-based meta-learning [25, 5, 40] learns the parameters to be able to quicklyadapt when the models encounter new tasks.Model-agnostic meta-learning (MAML) [5] is the most representative gradient-based meta-learningalgorithm, learning the parameters through nested gradient update loops that consist of an innerloop and an outer loop. The inner loop conducts task-specific learning for each task, and the outerloop aims to represent the generalization across tasks. After considerable iterations, the modelhas meta-initialized parameters, which can quickly allow unseen tasks to be learned from a fewsamples with a few inner updates. This algorithm has had a substantial impact on the research field ofmeta-learning, and numerous follow-up studies have been conducted [21, 23, 28, 32, 40, 30].A very recent study by Raghu et al. [24] attempted to analyze why a meta-trained model can learnnew tasks fast and argued that providing high-quality features prior to the inner updates from themeta-initialized parameters is the main reason. They claimed that MAML learns new tasks by ∗ The authors contribute equally to this paper.Preprint. Under review. a r X i v : . [ c s . L G ] A ug pdating the head (the last fully connected layer) with almost the same features (the output of thepenultimate layer) from the meta-initialized network. A small change in the representations duringthe task learning is named feature reuse , whereas a big change is named rapid learning .Herein, we pose an intriguing question: Does MAML really want feature reuse only?
Instead, it isreasonable for gradient-based meta-learning to conduct rapid learning in accordance with a given taskfrom the meta-initialized body (extractor). In general, the potency of feature reuse is closely relatedto the similarity between the source and target domains. The higher the similarity is, the higher theefficiency. However, because the ultimate goal of meta-learning is to solve the unseen tasks even ifthere are no significant similarities between the old and new, rapid learning should be considered aswell.From this consideration, we suggest a new algorithm to enable rapid learning in gradient-basedmeta-learning and investigate this algorithm’s advantages compare to MAML. Our contributions aresummarized as follow: • We propose a simple but effective meta-learning algorithm that learns the Body (extractor)of the model Only in the Inner Loop , coined as BOIL. • We demonstrate that the BOIL algorithm enjoys feature reuse on the low- and mid-levelbody and rapid learning on the high-level body using the cosine similarity and the CenteredKernel Alignment (CKA). • We contemplate the optimal meta-initialization about the head (classifier) parameters andevidence that the orthonormality of the head parameters is important condition to optimizemeta-initialization. Furthermore, we observe that learning the meta-initialized head fromorthonormal initialization improves the performance and convergence speed in BOIL, butworsen in MAML. • We empirically show that BOIL improves the performance over all benchmark data sets andthat this improvement is particularly noticeable on fine-grained data sets or cross-domainadaptation. • For ResNet architectures, we propose a disconnection trick that removes the back-propagation path of the last skip connection. The disconnection trick strengthens featurereuse on the low- and mid-level body and rapid learning on the high-level body.
This section first describes MAML with a few-shot learning framework and then summarizes twohypotheses regarding the effectiveness of this algorithm.
The MAML algorithm [5] attempts to meta-learn the best initialization of parameters for a task-learner.It consists of two main optimization loops, i.e., an inner loop and an outer loop. We first sample abatch of tasks within a data set distribution. Each task τ i consists of a support set S τ i and a query set Q τ i . When we sample a support set for each task, we first sample n labels from the label set and thensample k instances for each label, and thus each support set contains n × k instances. For a query set,we sample instances from the same labels with the support set.With these composed tasks, the MAML algorithm performs meta-training and meta-testing. Duringmeta-training, we first sample a meta-batch consisting of B tasks from the meta-training data set. Inthe inner loops , we update the meta-initialized parameters θ to task-specific parameters θ τ i using the task-specfic loss L S τi ( f θ ) as follows: θ τ i = θ − α ∇ θ L S τi ( f θ ) (1)Using the query set of the corresponding task, we compute the loss L Q τi ( f θ τi ) based on each innerupdated parameter. By summing all these losses, the meta-loss of each meta-batch, L meta ( θ ) , iscomputed. The meta-initialized parameters are then updated using the meta-loss in the outer loop through a gradient descent. θ (cid:48) = θ − β ∇ θ L meta ( θ ) , where L meta ( θ ) = B (cid:88) i =1 L Q τi ( f θ τi ) (2)In meta-testing, the inner loop, which can be interpreted as task-specific learning, is the same as inmeta-training. However, the outer loop only computes the accuracy of the model using a query setof tasks and does not perform a gradient descent, and thus it does not update the meta-initializationparameters. Although the inner loop(s) can be applied in one or more steps, for the sake of simplicity, we consider onlythe case of a single inner loop. a) MAML. (b) BOIL. Figure 1:
Difference in task-specific (inner) updates between MAML and BOIL . In the figure,the lines mean the decision boundaries defined by the head (classifier) of the network. The differentshapes and colors mean different classes. (a) MAML updates mainly head with negligible changein body (extractor) during inner updates, hence representations on the feature space are almostidentical. Whereas, (b) BOIL updates body only without change in head during inner updates, hencerepresentations on the feature space change significantly with the fixed decision boundaries.
To reveal the effectiveness of MAML in solving the new tasks, Raghu et al. [24] proposed twoopposite hypotheses, rapid learning and feature reuse . These two hypotheses relate to the body in thenetwork, usually referring to the convolutional layers in a convolutional neural network (CNN). Tosummarize, the rapid learning hypothesis attributes the capability of MAML to the updates on thebody in the network during inner loops, whereas the feature reuse hypothesis considers that the bodyin the network is universal to all tasks. The authors demonstrated that feature reuse is a dominantfactor in the MAML performance by showing that there is little difference in accuracy even if all ofthe extractor layers are frozen in the inner loops.Based on the feature reuse hypothesis, the authors proposed the ANIL (Almost No Inner Loop)algorithm, which only updates the head in the inner loops during training and testing, and the NIL (NoInner Loop) algorithm, which replaces a classifier with the distance between the representations of asupport set and those of a query set during testing. Both algorithms have comparable performance toMAML, which implies that a body trained only through the outer loops is sufficient to achieve thedesired performance.Nevertheless, the authors mentioned that development and inspection of novel meta-learning algo-rithms based on rapid learning are required because rapid learning might enlarge the problem-solvingarea. Based on this insight, we develop a rapid learning-based meta-learning algorithm and analyze itextensively.
Inspired by [24], we design an algorithm that updates only the body of the model and freezes thehead of the model during the task learning to enforce rapid learning. Because the gradients mustbe back-propagated to update the body, we set the learning rate of the head to zero in inner updatesduring both meta-training and meta-testing. Otherwise, learning and evaluation procedures of BOILare the same as those of MAML. Therefore, the computational overhead does not change.Formally speaking, with the notations used in Section 2.1, the meta-initialized parameters θ can beseparated into body parameters θ b and head parameters θ h , i.e., θ = { θ b , θ h } . For a sample image x ∈ R i , an output can be expressed as ˆ y = f θ ( x ) = f θ h ( f θ b ( x )) ∈ R n where f θ b ( x ) ∈ R d . Thetask-specific body parameters θ b,τ i and head parameters θ h,τ i through an inner loop given task τ i arethen as follows: θ b,τ i = θ b − α b ∇ θ b L S τi ( f θ ) & θ h,τ i = θ h − α h ∇ θ h L S τi ( f θ ) (3)where α b and α h are the inner loop learning rates corresponding to the body and head, respectively.MAML usually sets α = α b = α h ( (cid:54) = 0) , whereas BOIL sets α b (cid:54) = 0 and α h = 0 .This simple difference changes the dominant factor of the task-learning from the head to the body.Figure 1 shows the main difference in the inner updates between MAML and BOIL. To solve newtasks, the head mainly changes with MAML [24], whereas with BOIL, only the body changes. Inthe rest of this section, we demonstrate that BOIL enjoys both rapid learning and feature reuse andimproves both the performance and convergence speed.3 .1 Rapid learning and feature reuse on the body of BOIL We compute the cosine similarities and CKA values of convolution layers to analyze whether thelearning scheme of BOIL is rapid learning or feature reuse with the meta-trained 4conv network (asdetailed in Appendix A). We first investigate the cosine similarity between the representations of aquery set including 5 classes and 15 samples per class after every convolution module. In Figure 2,the orange line represents the average of the cosine similarities between the samples having the sameclass, and the blue line represents the average of cosine similarities between the samples havingdifferent classes. In Figure 2a and Figure 2b, the left panel is before inner loop adaptation and theright panel is after inner loop adaptation.
Inter-classIntra-class conv1 conv2 conv3 conv40.000.250.500.751.00 Before inner loop adatation conv1 conv2 conv3 conv4After inner loop adatation (a) MAML. conv1 conv2 conv3 conv40.000.250.500.751.00 Before inner loop adatation conv1 conv2 conv3 conv4After inner loop adatation (b) BOIL.
Figure 2: Cosine similarity of 4conv network.The key observations from Figure 2, as is discussed with other experiments in Section 4.2.1, are asfollows: • Before inner loop adaptation, MAML makes the average of the cosine similarities monoton-ically decrease and makes the representations separable by classes, as the representationsreach the last convolution layer. In contrast, BOIL reduces the average only up to conv3.More importantly, with BOIL, all the representations are concentrated regardless of theirclasses on the last convolution module. It implies that the meta-initialized body by MAMLcan distinguish classes after conv4, while the meta-initialized body by BOIL cannot do so. • MAML does not have any noticeable difference after inner loop adaptation. In contrast,BOIL can make significant differences among different classes on the last convolutionlayer after inner loop adaptation. We believe that MAML follows the feature reuse trainingscheme, whereas BOIL follows both feature reuse (before the last layer) and rapid learning(at the last layer) training schemes. conv1 conv2 conv3 conv4 head0.00.20.40.60.81.0
CKA
MAMLBOIL
Figure 3: CKA of 4conv.Next, we demonstrate that BOIL enjoys both feature reuse on the low-and mid-level layers and rapid learning on the high-level layer bycomputing the CKA [14] between before and after the inner updatedrepresentations. When the CKA between two representations is closeto 1, the representations are almost identical. In Figure 3, BOIL hasa low CKA for the last convolution module and the subsequent head.This result indicates that the BOIL algorithm learns rapidly on the lastlayer of the body in inner updates. loss 𝜏 ! 𝜏 " 𝜽 ∗ 𝜃 $ ! ∗ 𝜃 $ " ∗ 𝜃 Figure 4: Ideal meta-initialization.In this section, we start by discussing what is the ideal meta-initialization. Because the few-shot classification tasks are constructedwith sampled classes each time, every task consists of different classes.Since the class indices are randomly assigned at the beginning of eachtask learning, the meta-initialized parameters cannot contain any priorinformation on the class indices. For instance, it is not allowed thatthe meta-initialized parameters encode class similarities between class i and class j . Any biased initial guess could hinder the task learning.The meta-initialized parameters should be in-between (local) optimalpoints of tasks as depicted in Figure 4 so that the network can adapt toeach task with few task-specific updates. 4 MAMLMAML-centering
BOILBOIL-centering (a) Comparison with centering algorithm.
MAMLMAML-fix
BOILBOIL-fix (b) Comparison with fix algorithm. Figure 5: Valid accuracy curves of (a) centering algorithm and (b) fix algorithm on Cars.When the head parameters θ h = [ θ h, , ..., θ h,n ] (cid:62) ∈ R n × d have orthonormal rows (i.e., (cid:107) θ h,i (cid:107) = 1 for all i and θ (cid:62) h,i θ h,j = 0 for all i (cid:54) = j ), the meta-initialized model can have the unbiased classifier.Here, a (cid:62) denotes the transpose of a and (cid:107) · (cid:107) denotes the Euclidean norm. With the orthonormalrows, therefore, each logit value θ h,j (cid:62) f θ b ( x ) can be controlled independently of other logit values.Recall that the softmax probability p j for class j of sample x is computed as follows: p j ( x ) = e θ h,j (cid:62) f θb ( x ) (cid:80) ni =1 e θ h,i (cid:62) f θb ( x ) = 1 (cid:80) ni =1 e ( θ h,i − θ h,j ) (cid:62) f θb ( x ) . (4)In Equation 4, indeed, the softmax probability only depends on the differences of the rows of thehead parameters θ h,i − θ h,j . Adding a vector to all the rows (i.e., θ h,i ← θ h,i + c for all i ) doesnot change the softmax vector. So, we can expect the same nice meta-initialized model, whena parallel shift of the rows of the head parameters can make orthonormal rows. To support thisexperimentally, we design the centering algorithm that operates a parallel shift of θ h by subtractingthe average of the row vectors of θ h after every outer update on both MAML and BOIL, i.e., [ θ h, − ¯ θ h , ..., θ h,n − ¯ θ h ] (cid:62) where ¯ θ h = n (cid:80) ni =1 θ h,i . Figure 5a shows that this parallel shift operationsdoes not affect the performance of two algorithms on Cars. Figure 6: Average of cosinesimilarities between gaps.Next, we investigate the cosine similarity between θ h,i (cid:62) − θ h,k (cid:62) and θ h,j (cid:62) − θ h,k (cid:62) for all different i , j , and fixed k . From the trainingprocedures of MAML and BOIL, it is observed that the average ofcosine similarities between the two gaps keeps near 0.5 during meta-training (Figure 6). Note that 0.5 is the cosine similarity between θ h,i (cid:62) − θ h,k (cid:62) and θ h,j (cid:62) − θ h,k (cid:62) when θ h,i (cid:62) , θ h,j (cid:62) , and θ h,k (cid:62) areorthonormal. From the results, we evidence that the orthonormalityof θ h is important for the meta-initialization and meta learningalgorithms naturally keep the orthonormality .From the above observation, we design the fix algorithm that fixes θ h to be orthonormal for themeta-initialized model. Namely, MAML-fix updates θ h in inner loops only, and BOIL-fix does notupdate θ h . The fix algorithm can be easily implemented by initializing θ h to be orthonormal throughthe Gram-Schmidt method from a random matrix and setting the learning rate for the head of themodel during the outer loop to zero.Figure 5b depicts the valid accuracy curves of the fix algorithm on Cars. The experiments substantiatethat orthonormal rows of θ h are important and that BOIL improves the performance. (1) ComparingMAML to MAML-fix (the left panel of Figure 5b), MAML-fix outperforms MAML. It means thatthe outer loop calculated through the task-specific head following MAML is detrimental becausethe outer loop just adds unnecessary task specific information to the model. (2) Comparing vanillamodels to fix models (both panels of Figure 5b), fixed meta-initialized head with orthonormality isless over-fitted, which is explained through the train accuracy curves in Appendix B. (3) ComparingBOIL to BOIL-fix (the right panel of Figure 5b), although BOIL-fix can achieve almost the sameperformance with BOIL with sufficient iterations, BOIL converges faster to a better local optima.This is because θ h is trained so that the inner loop can easily adapt f θ b ( x ) to each class. We used two backbone networks, with 64 channels (from [34]) and
ResNet-12 starting with 64 channels and doubling them after every block (from [23]). For the batch normalization5ayers, we used batch statistics instead of the running statistics during meta-testing, following theoriginal MAML [5]. We trained all models 30,000 epochs and then used the last epoch models toverify performance. We applied an inner update once both meta-training and meta-testing. All resultswere reproduced by our group and reported as the average and standard deviation of the accuraciesover 5 × miniImageNet [34] and tieredImageNet [26], and two specific data sets, CUB [36] and
Cars [15]. Full details onthe implementation and data sets are described in Appendix A. In addition, the results of the 4convnetwork with 32 channels (from [5]) and of the other data sets at a size of 32 ×
32 are reported inAppendix C and Appendix D, respectively.
Table 1: Test accuracy (%) of 4conv network on benchmark dataset.
Domain General (Coarse-grained) Specific (Fine-grained)Dataset miniImageNet tieredImageNet CUB CarsMAML(1) 48.47 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Test accuracy (%) of 4conv network on cross-domain adaptation. adaptation general to general general to specific specific to general specific to specificmeta-train tieredImageNet miniImageNet miniImageNet miniImageNet CUB CUB Cars CUBmeta-test miniImageNet tieredImageNet CUB Cars miniImageNet tieredImageNet CUB CarsMAML(1) 49.45 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 1 shows that BOIL overwhelms MAML on all benchmark data sets, particularly with a widemargin on a specific domain data set such as CUB and Cars. These results demonstrate that it iseffective for the meta-initialized parameter to be learned in a task-specific update using a rapidlearning scheme in gradient-based meta-learning. This means that the BOIL algorithm does notdepend on the fineness of the domain and can be broadly adapted.Furthermore, Table 2 shows the superiority of BOIL on the cross-domain adaptation, where thesource and target domains differ (i.e., the meta-training and meta-testing data sets are different.).Recently, Guo et al. [8] noted that existing meta-learning algorithms have weaknesses in terms of thecross-domain adaptation. We divide the cross-domain adaptation into four cases: general to general,general to specific, specific to general, and specific to specific. Previous studies considered thecross-domain scenario starting with the general domain [3, 8]. However, we also evaluated the reversecases considered more difficulty. BOIL outperforms MAML not only on the typical cross-domainadaptation scenario but also on the reverse scenario. We believe that the rapid learning property ofBOIL enables the model to adapt to an unseen target domain that is entirely different from the sourcedomain.
Table 3: Test accuracy (%) of 4conv network on miniImageNet according to the head’s existence. with classifier without classifierMAML(1) 20.10 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± All implementations are based on Torchmeta [4], and all results were reproduced according to our details. • without a classifier in Table 3a. The body of MAML creates efficient representations be-fore an inner update, whereas the body of BOIL creates relatively inefficient representations.This result is related to the feature reuse of MAML (the left panel of Figure 2a) and therapid learning of BOIL (the left panel of Figure 2b). • without a classifier in Table 3b. The body of BOIL can achieve better representationsthrough rapid learning than the body of MAML if an adequate number of samples areavailable. This result can be explained with the dramatic decrease in the cosine similaritybetween different classes after an inner update (the right panel of Figure 2b). • with a classifier in Table 3a. The heads of MAML and BOIL seem to be ideally meta-initialized, which means that the heads of them cannot classify input data before an innerupdate. This result evidences our hypothesis on the optimal point of meta-initialization(Figure 4). • with a classifier in Table 3b. The head of BOIL, meta-learned across the tasks, is well-matched with the representations through the body, resulting in an improved performance.By contrast, the head of MAML deteriorates significantly (Figure 5b).To summarize, the meta-initialization by MAML provides efficient representations through the body,although a significant problem occurs in that the head decreases the efficiency of the representations.By contrast, although the meta-initialization by BOIL provides less efficient representations comparedto MAML, the body can extract efficient representations through task-specific updates based on rapidlearning, and further, the head boosts the performance.
Table 4: 5-Way 5-Shot test accuracy (%) of ResNet-12. The lsc means the last skip connection.
Meta-train miniImageNet CUBMeta-test miniImageNet tieredImageNet CUB CUB miniImageNet CarsMAML w/ lsc 67.96 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Many recent studies [23, 35, 28, 30] have used deeper networks such as ResNet [9], Wide-ResNet[39], or DenseNet [11] as a backbone network. The deeper networks, in general, use feature wiringstructures that connect layers to facilitate feature propagation. We explore the applicability of BOILto a deeper network with the wiring structure, ResNet-12, and propose a simple trick to boost therapid learning by disconnecting the last skip connection. The trick is explained in Section 4.3.1.Table 4 shows the test accuracy results of ResNet-12, which is meta-trained and meta-tested withvarious data sets according to the fineness of the domains. This result indicates that BOIL can beapplied to other general architectures by showing a better performance than MAML not only onstandard benchmark data sets but also on cross-domain adaptation. Note that BOIL has achieved thebest performance without the last skip connection in every experiment.
Connecting the two learning schemes of BOIL and the wiring structure of ResNet, we propose asimple trick to eliminate the skip connection of the last residual block, which we call a disconnectiontrick. In Section 3.1, we confirmed that the model learned with BOIL applies a feature reuse at thelow- and mid-level of the body and rapid learning at the high-level of the body.7 nter-classIntra-class block1 block2 block3 block40.000.250.500.751.00 Before inner loop adatation block1 block2 block3 block4After inner loop adatation (a) BOIL w/ last skip connection. block1 block2 block3 block40.000.250.500.751.00 Before inner loop adatation block1 block2 block3 block4After inner loop adatation (b) BOIL w/o last skip connection.
Figure 7: Cosine similarity of ResNet-12.To investigate the effects of skip connections on a rapid learning scheme, we analyze the cosinesimilarity after every residual block in the same way as Figure 2. Figure 7a shows that ResNet withskip connections on all blocks changes not only the last block but also other blocks rapidly. Becauseskip connections strengthen the gradient back-propagation, the scope of rapid learning extends tothe front. Therefore, to achieve both the effective feature reuse and the rapid learning of BOIL, wesuggest a way to weaken the gradient back-propagation from the loss function by removing the skipconnection of the last block. As shown in Figure 7b, with this simple disconnection trick, ResNet canimprove the effectiveness of BOIL, as well as the feature reuse at the front blocks of the body and therapid learning at the last block, and significantly improves the performance, as described in Table 4.
MAML [5] is one of the most famous algorithms in gradient-based meta-learning, achieving acompetitive performance on few-shot learning benchmark data sets [34, 26, 1, 23]. To tackle the taskambiguity caused by data insufficiency in few-shot learning, numerous studies have sought to extendMAML in various ways. Some studies [23, 30, 35] have proposed feature modulators that maketask-specific adaptation more amenable by shifting and scaling the representations extracted from thenetwork body. In response to the lack of data for task-specific updates, there have also been attemptsto incorporate additional parameters in a small number, rather than the entire model parameters[40, 28]. Others [7, 6, 38, 20] have taken a probabilistic approach using Bayesian modeling andvariational inference. Unlike prior studies, we proposed a new training paradigm reinforcing atask-specific update by model itself.Few-shot learning has recently been expanding beyond the standard n -way k -shot classification totackle the more realistic problems. Triantafillou et al. [32] constructed a more scalable and realisticdataset, called a meta-dataset, which contains several data sets collected from different sources. Leeet al. [17] addressed n -way any-shot classification considering the imbalanced data distributionin real-world. Furthermore, some studies [2, 3] have recently explored the few-shot learning oncross-domain adaptation, which is one of the ultimate goals of meta-learning. In addition, Guo etal. [8] suggested a new cross-domain benchmark dataset for few-shot learning and showed that thecurrent meta-learning algorithms [5, 34, 29, 31, 18] underachieve compared to simple fine-tuning oncross-domain adaptation. We demonstrated that task-specific update with rapid learning is efficienton cross-domain adaptation. In this study, we proposed the BOIL algorithm that enforces rapid learning by learning only the bodyof the model in the inner loop. Using the cosine similarity and the CKA, we demonstrated that BOILtrains a model to follow the feature reuse scheme on the low- and mid-level body but trains it to followthe rapid learning scheme on the high-level body. We further explored the crucial factor in the headand whether learning the head of the model helps optimization of MAML and BOIL. It was observedthat a model without outer updates of the head with an orthonormal initialization achieves a betterperformance than the original model in MAML, whereas the opposite occurs in BOIL. This indicatesmeans that MAML has not used the head of the model correctly, while BOIL takes advantage ofthe learned head. Based on these analyses, we validated the BOIL algorithm on various data setsincluding miniImageNet, tieredImageNet, CUB, and Cars, and cross-domain adaptation using astandard 4conv network and ResNet-12. The experimental results showed significant improvementover MAML, particularly cross-domain adaptation, implying that rapid learning approaches should beconsidered for adaptation to unseen tasks. We hope our study inspires rapid learning in gradient-basedmeta-learning approaches. 8 roader Impact
We expect that our work can open a new horizon in the gradient-based meta-learning field. Firstof all, our contemplation about the optimal meta-initialization, which is entirely different from theconventional optimal point, gives the meta-learning researchers inspiration to design or analyze anovel or existing algorithm. Furthermore, on the data shortage or cross-domain adaptation, our rapidlearning-based algorithm outshines. However, our approach is the first work to study rapid learningand focuses on classification tasks. Hence, more studies are needed to develop and to generalize theproperty of rapid learning.
References [1] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning withdifferentiable closed-form solvers. arXiv preprint arXiv:1805.08136 , 2018.[2] John Cai and Sheng Mei Shen. Cross-domain few-shot learning with meta fine-tuning. arXivpreprint arXiv:2005.10544 , 2020.[3] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closerlook at few-shot classification. arXiv preprint arXiv:1904.04232 , 2019.[4] Tristan Deleu, Tobias Würfl, Mandana Samiei, Joseph Paul Cohen, and Yoshua Ben-gio. Torchmeta: A Meta-Learning library for PyTorch, 2019. Available at:https://github.com/tristandeleu/pytorch-meta.[5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-tation of deep networks. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1126–1135. JMLR. org, 2017.[6] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In
Advances in Neural Information Processing Systems , pages 9516–9527, 2018.[7] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recastinggradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930 , 2018.[8] Yunhui Guo, Noel CF Codella, Leonid Karlinsky, John R Smith, Tajana Rosing, and RogerioFeris. A new benchmark for evaluation of cross-domain few-shot learning. arXiv preprintarXiv:1912.07200 , 2019.[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[10] Nathan Hilliard, Lawrence Phillips, Scott Howland, Artëm Yankov, Courtney D Corley, andNathan O Hodas. Few-shot learning with metric-agnostic conditional embeddings. arXivpreprint arXiv:1802.04376 , 2018.[11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4700–4708, 2017.[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[13] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shotimage recognition. In
ICML deep learning workshop , volume 2. Lille, 2015.[14] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neuralnetwork representations revisited. arXiv preprint arXiv:1905.00414 , 2019.[15] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In
Proceedings of the IEEE international conference on computer visionworkshops , pages 554–561, 2013.[16] A Krizhevsky and G Hinton. Learning multiple layers of features from tiny images. 2009.[17] Hae Beom Lee, Hayeon Lee, Donghyun Na, Saehoon Kim, Minseop Park, Eunho Yang, andSung Ju Hwang. Learning to balance: Bayesian meta-learning for imbalanced and out-of-distribution tasks. arXiv preprint arXiv:1905.12917 , 2019.918] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learningwith differentiable convex optimization. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 10657–10665, 2019.[19] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 , 2013.[20] Donghyun Na, Hae Beom Lee, Saehoon Kim, Minseop Park, Eunho Yang, and Sung Ju Hwang.Learning to balance: Bayesian meta-learning for imbalanced and out-of-distribution tasks. arXivpreprint arXiv:1905.12917 , 2019.[21] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 , 2018.[22] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a largenumber of classes. In , pages 722–729. IEEE, 2008.[23] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptivemetric for improved few-shot learning. In
Advances in Neural Information Processing Systems ,pages 721–731, 2018.[24] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or featurereuse? towards understanding the effectiveness of maml. In
International Conference onLearning Representations , 2020.[25] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In
ICLR , 2017.[26] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenen-baum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shotclassification. arXiv preprint arXiv:1803.00676 , 2018.[27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.
International journal of computer vision , 115(3):211–252, 2015.[28] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, SimonOsindero, and Raia Hadsell. Meta-learning with latent embedding optimization. arXiv preprintarXiv:1807.05960 , 2018.[29] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In
Advances in neural information processing systems , pages 4077–4087, 2017.[30] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shotlearning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 403–412, 2019.[31] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.Learning to compare: Relation network for few-shot learning. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 1199–1208, 2018.[32] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, RossGoroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, et al. Meta-dataset: Adataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096 ,2019.[33] Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, and Ming-Hsuan Yang. Cross-domain few-shotclassification via learned feature-wise transformation. arXiv preprint arXiv:2001.08735 , 2020.[34] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networksfor one shot learning. In
Advances in neural information processing systems , pages 3630–3638,2016.[35] Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. Multimodal model-agnostic meta-learning via task-aware modulation. In
Advances in Neural Information Processing Systems ,pages 1–12, 2019.[36] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSDBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.[37] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activationsin convolutional network. arXiv preprint arXiv:1505.00853 , 2015.1038] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn.Bayesian model-agnostic meta-learning. In
Advances in Neural Information Processing Systems ,pages 7332–7342, 2018.[39] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016.[40] Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fastcontext adaptation via meta-learning. arXiv preprint arXiv:1810.03642 , 2018.11
Implementation Detail
A.1 n -way k -shot setting We experimented in the 5-way 1-shot, 5-way 5-shot, and 5-way 20-shot settings, and the number ofshots is marked in parentheses in the algorithm name column of all tables. During meta-training,models are inner loop updated only once, and the meta-batch size for the outer loop is set to 4. Duringmeta-testing, the number of task-specific (inner loop) updates is the same as meta-training. Allmodels are trained for 30,000 iterations, and all the reported results are based on the last epoch model.
A.2 Model implementations
In our experiments, we employ 4conv network and ResNet-12 for MAML and BOIL algorithms.4conv network has of 4 convolution modules, and each module consists of a 3 × × × α b and α h arethe learning rates of the body and the head of the model during inner loops, and β b and β h are thelearning rates of the body and the head of the model during outer loops.Table 5: Learning rates according to the algorithms. Model MAML BOIL MAML-fix BOIL-fix α b α h β b β h α b α h β b β h A.3 Dataset
We validate the BOIL and MAML algorithms on several data sets, considering image size andfineness. Table 6 is the summarization of the used data sets.Table 6: Summary of data sets.
Data sets miniImageNet TieredImageNet CUB CarsSource ImageNet[27] ImageNet[27] CUB[36] Cars [15]Image size 84 ×
84 84 ×
84 84 ×
84 84 × ×
32 32 ×
32 32 ×
32 32 × Over-fitting issue
MAMLMAML-fix
BOILBOIL-fix
Figure 8: Train accuracy (%) curves of fix algo-rithm on Cars.Figure 8 shows the train accuracy curves cor-responding to the Figure 5b. We confirm thatMAML, MAML-fix, and BOIL are over-fittedfrom the early epochs, but BOIL-fix is over-fitted more slowly than others. However, thedegradation from the over-fitting issue is muchmore in the original algorithms, i.e., MAML andBOIL, than in the fix algorithms, i.e., MAML-fix and BOIL-fix. It implies that the over-fittingon the head has a greater impact on performancedegradation than the over-fitting on the body.
C Results of 4conv network (32-32-32-32)
In the related papers [5, 24], they used a 4conv network with 32 filters to avoid the over-fittingissue. We chose 64 filters in the main paper because the models trained by BOIL is not over-fitted.Nevertheless, Table 7 shows that BOIL outperforms MAML when 4conv network has 32 filters.Table 7: Test accuracy (%) of 4conv network (32 filters) on benchmark data sets. The values inparenthesis are the number of shots.
Meta-train miniImageNet CUBMeta-test miniImageNet tieredImageNet CUB CUB miniImageNet CarsMAML(1) 46.20 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± D Results on Other Dataset
We applied our algorithm to other data sets with image size of 32 ×
32. Similar to the analyseson section 4, these data sets can be divided into two general data sets,
CIFAR-FS [1] and
FC100 [23], and two specific data sets,
Aircraft [19] and
VGG-Flower [22]. Table 8, Table 9, and Table 10generally show the superiority of BOIL even if image size is extremely tiny.Table 8: Test accuracy (%) of 4conv network on benchmark dataset. The values in parenthesis are thenumber of shots.
Domain General (Coarse-grained) Specific (Fine-grained)Dataset CIFAR-FS FC100 Aircraft VGG-FlowerMAML(1) 55.88 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 9: Test accuracy (%) of 4conv network on cross-domain adaptation. The values in parenthesisare the number of shots. adaptation general to general general to specific specific to general specific to specificmeta-train FC100 CIFAR-FS CIFAR-FS CIFAR-FS Aircraft Aircraft VGG-Flower Aircraftmeta-test CIFAR-FS FC100 Aircraft VGG-Flower CIFAR-FS FC100 Aircraft VGG-FlowerMAML(1) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 10: 5-Way 5-Shot test accuracy (%) of ResNet-12. The lsc means the last skip connection.
Meta-train CIFAR-FS AircraftMeta-test CIFAR-FS FC100 Aircraft Aircraft CIFAR-FS VGG-FlowerMAML w/ lsc 74.38 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±0.21