Meta-Learning with Network Pruning
MMeta-Learning with Network Pruning
Hongduan Tian , Bo Liu , Xiao-Tong Yuan , and Qingshan Liu B-DAT Lab, Nanjing University of Information Science and Technology, Nanjing,210044, China JD Finance America Corporation, Mountain View, CA 94043, USA { hongduan.tian,kfliubo,xtyuan1980 } @gmail.com , [email protected] Abstract.
Meta-learning is a powerful paradigm for few-shot learning.Although with remarkable success witnessed in many applications, theexisting optimization based meta-learning models with over-parameterizedneural networks have been evidenced to ovetfit on training tasks. To rem-edy this deficiency, we propose a network pruning based meta-learningapproach for overfitting reduction via explicitly controlling the capac-ity of network. A uniform concentration analysis reveals the benefit ofnetwork capacity constraint for reducing generalization gap of the pro-posed meta-learner. We have implemented our approach on top of Rep-tile assembled with two network pruning routines: Dense-Sparse-Dense(DSD) and Iterative Hard Thresholding (IHT). Extensive experimentalresults on benchmark datasets with different over-parameterized deepnetworks demonstrate that our method not only effectively alleviatesmeta-overfitting but also in many cases improves the overall generaliza-tion performance when applied to few-shot classification tasks.
Keywords:
Meta-Learning; Few-shot Learning; Network Pruning; Spar-sity; Generalization Analysis.
The ability of adapting to a new task with several trials is essential for artificialagents. The goal of few-shot learning [28] is to build a model which is able to getthe knack of a new task with limited training samples. Meta-learning [29,3,33]provides a principled way to cast few-shot learning as the problem of learning-to-learn , which typically trains a hypothesis or learning algorithm to memorize theexperience from previous tasks for a future task learning with very few samples.The practical importance of meta-learning has been witnessed in many vision andonline/reinforcement learning applications including image classification [24,18],multi-arm bandit [32] and 2D navigation [5].Among others, one particularly simple yet successful meta-learning paradigmis first-order optimization based meta-learning which aims to train hypothesesthat can quickly adapt to unseen tasks by performing one or a few steps of(stochastic) gradient descent [24,5]. Reasons for the recent increasing attentionto this class of gradient-optimization based methods include their outstandingefficiency and scalability exhibited in practice [22]. a r X i v : . [ c s . L G ] J u l Tian et al.
Challenge and motivation.
A challenge in the existing meta-learning ap-proaches is their tendency to overfit [21,36]. When training an over-parameterizedmeta-learner such as very deep and/or wide convolutional neural networks (CNNs)which are powerful for representation learning, there are two sources of potentialoverfitting at play: the inter-task overfitting of meta-learner (or meta-overfitting )to the training tasks and the inner-task overfitting of task-specific learner to thetask training data. There have been recent efforts put to deal with inner-taskoverfitting [17,39]. The study on the inter-task meta-overfitting, however, stillremains under explored. Since in principle the optimization-based meta-learningis designed to learn fast from small amount of data in new tasks, we expectthe meta-overfitting to play a more important role in influencing the overallgeneralization performance of the trained meta-learner.Sparsity model is a promising tool for high-dimensional machine learningwith guaranteed statistical efficiency and robustness to overfitting [20,37,1]. Ithas been theoretically and numerically justified by [2] that sparsity benefits con-siderably the generalization performance of deep neural networks. In the regimeof compact deep learning, the so called network pruning technique has beenwidely studied and evidenced to work favorably in generating sparse subnet-works without compromising generalization performance [6,11,7]. Inspired bythese remarkable success of sparsity models, it is natural to conjecture thatsparsity would also be beneficial for enhancing the robustness of optimizationbased meta-learning to meta-overfitting.
Our contribution.
In this paper, we present a novel gradient-based meta-learning approach with explicit network capacity constraint for overfitting reduc-tion. The problem is formulated as learning a sparse meta-initialization networkfrom training tasks such that in a new task the learned subnetwork can quicklyconverge to the optimal solution via gradient descent. The core idea is to re-duce meta-overfitting by controlling the counts of the non-zero parameters inthe meta-learner during the training phase. Theoretically, we have established auniform generalization gap bound for the proposed sparse meta-learner showingthe benefit of capacity constraint for improving its generalization performance.Practically, we have implemented our approach in a joint algorithmic frame-work of Reptile [22] with network pruning, along with two instantiations usingDense-Sparse-Dense (DSD) [7] and Iterative Hard Thresholding (IHT) [11] asnetwork pruning subroutines, respectively. The actual performance of our ap-proach has been extensively evaluated on few-shot classification tasks with over-parameterized wide CNNs. The obtained results demonstrate that our methodcan effectively alleviate overfitting and achieve similar or even superior general-ization performance to the conventional dense models.
Optimization-based meta-learning.
The family of optimization-based meta-learning approaches usually learn a good hypothesis which can be fast adaptedto unseen tasks [24,5,22,12]. Compared to the metric [13,30] and memory [35,28] eta-Learning with Network Pruning 3 based meta-learning algorithms, optimization based meta-learning algorithmsare gaining increasing attention due to their simplicity, versatility and effec-tiveness. As a recent leading framework for optimization-based meta-learning,MAML [5] is designed to estimate a meta-initialization network which can bewell fine-tuned in an unseen task via only one or few steps of minibatch gradientdescent. Although simple in principle, MAML requires computing Hessian-vectorproduct for back-propagation, which could be computationally expensive whenthe model is big. The first-order MAML (FOMAML) is therefore proposed toimprove the computational efficiency by simply ignoring the second-order deriva-tives in MAML. Reptile [22] is another approximated first-order algorithm whichworks favorably since it maximizes the inner product between gradients from thesame task yet different minibatches, leading to improved model generalization.Recently, several hypothesis biased regularized meta-learning approaches havebeen studied in [4,12,38] with provable strong generalization performance guar-antees provided for convex problems. In [17], the meta-learner is treated as afeature embedding module of which the output is used as input to train a multi-class kernel support vector machine as base learner. To deal with overfitting, theCAVIA method [39] decomposes the meta-parameters into the so called contextparameters and shared parameters. The context parameters are updated for taskadaption with limited capacity while the shared parameters are meta-trained forgeneralization across tasks.
Network pruning.
Early network weight pruning algorithms date back toOptimal Brain Damage [16] and Optimal Brain Surgeon [10]. A dense-to-sparsealgorithm was developed by [8] to firstly remove near-zero weights and then finetune the preserved weights. As a serial work of dense-to-sparse, the dense-sparse-dense (DSD) method [7] was proposed to re-initialize the pruned parameters aszero and retrain the entire network after the dense-to-sparse pruning phase.The iterative hard thresholding (IHT) method [11] shares a similar spirit withDSD to conduct multiple rounds of iteration between pruning and retraining.[31] proposed a data-free method to prune the neurons in a trained network.In [19], an L -norm regularized risk minimization framework was proposed tolearn sparse networks during training. More recently, [6] introduced and studiedthe “lottery ticket hypothesis” which assumes that once a network is initialized,there should exist an optimal subnetwork, which can be learned by pruning, thatperforms as well as the original network or even superior.Despite the remarkable success achieved by both meta-learning and networkpruning, it still remains largely open to investigate the impact of network pruningon alleviating the meta-overfitting of optimization based meta-learning, whichis of primal interest to our study in this paper. We consider the N -way K -shot problem as defined in [34]. Tasks are sampledfrom a specific distribution p ( T ) and will be divided into meta training set S tr , Tian et al. meta validation set S val , and meta testing set S test . Classes in different datasetsare disjoint (i.e., the class in S tr will not appear in S test ). During training,each task is made up of support set D supp and query set D query . Both D supp and D query are sampled from the same classes of S tr . D supp is used for training while D query is used for evaluation. For a N -way K -shot classification task, we sample N out of the C classes from dataset, and then K samples are sampled from eachof these classes to form D supp , namely D supp = { ( x kc , y kc ) , k = 1 , , ..., K ; c =1 , , ..., N } . For example, for a 5-way 2-shot task, we sample 2 data-label pairsfrom each of 5 classes, thus, such a task has 10 samples. Usually, several othersamples of the same classes will be sampled to compose D query . For example, D query is used in Reptile [22] in evaluation steps. We use the loss function (cid:96) ( v, y )to measure the discrepancy between the predicted score vector v ∈ R C and thetrue label y ∈ { , ..., C } . Notation.
For an integer n , we denote [ n ] as the abbreviation of the indexset { , ..., n } . We use (cid:12) to denote the element-wise product operator. We say afunction g : R p (cid:55)→ R is G -Lipschitz continuous if | g ( θ ) − g ( θ (cid:48) ) | ≤ G (cid:107) θ − θ (cid:48) (cid:107) , and g is H -smooth if it obeys (cid:107)∇ g ( θ ) − ∇ g ( θ (cid:48) ) (cid:107) ≤ H (cid:107) θ − θ (cid:48) (cid:107) . Our ultimate goal is to learn a good initialization of parameters for a convolu-tional neural network f θ : X (cid:55)→ Y , where θ is the model parameters set, from aset of training tasks such that the learned initialization network generalizes wellto future unseen tasks. Inspired by the recent remarkable success of MAML [5]and the strong generalization capability of sparse deep learning models [6,2],during sparse(or network pruning) phase, we propose to learn from previoustask experience a sparse subnetwork started from which the future task-specificnetworks can be efficiently learned using first-order optimization methods. Tothis end, we introduce the following layer-wise sparsity constrained stochasticfirst-order meta-learning formulation: min θ R ( θ ) := E T ∼ p ( T ) (cid:104) L D queryT (cid:16) θ − η ∇ θ L D suppT ( θ ) (cid:17)(cid:105) , s.t. (cid:107) θ l (cid:107) ≤ k l , l ∈ [ L ] , (1)where L D suppT ( θ ) = NK (cid:80) ( x kc ,y kc ) ∈D suppT (cid:96) ( f θ ( x kc ) , y kc ) is the empirical risk for task T and L D queryT ( θ ) is similarly defined as the loss evaluated over the query set and η is the learning rate. In the constraint, || θ l || denotes the number of non-zeroentries in the parameters of l -th layer θ l which is required to be no larger thana user-specified sparsity level k l , and L is the total number of network layers.In general, the mathematical formulation of task distribution p ( T ) is un-known but we usually have access to a set of i.i.d. training tasks S = { T i } Mi =1 sampled from p ( T ). Thus the following empirical version of the population formin equation 1 is alternatively considered for training: min θ R S ( θ ) := M (cid:80) Mi =1 (cid:104) L D queryTi (cid:16) θ − η ∇ θ L D suppTi ( θ ) (cid:17)(cid:105) , s.t. (cid:107) θ l (cid:107) ≤ k l , l ∈ [ L ] . (2)To compare with MAML, our model shares an identical objective function, butwith the layer-wise sparsity constraints (cid:107) θ l (cid:107) ≤ k l imposed for the purpose of eta-Learning with Network Pruning 5 enhancing learnability of the over-parameterized meta-initialization network. Inview of the “lottery ticket hypothesis” [6], the model in equation 2 can be inter-preted as a first-order meta-learner for estimating a subnetwork, or a “ winningticket ”, for future task learning. Inspired by the strong statistical efficiency andgeneralization guarantees of sparsity models [37,2], we will very shortly show thatsuch a subnetwork is able to achieve advantageous generalization performanceover the dense initialization networks learned by vanilla MAML. We provide in this section a task-level generalization performance analysis forthe proposed model in equation 2. Let p be the total number of parameters inthe over-parameterized network and Θ ⊆ R p be the domain of interest for θ . Let k = (cid:80) Ll =1 k l be the total desired sparsity level of the subnetwork. The followinguniform concentration bound is our main result. Theorem 1.
Assume that the domain of interest Θ is bounded by R and the lossfunction (cid:96) ( f θ ( x ) , y ) is G -Lipschitz continuous and H -smooth with respect to θ .Suppose that ≤ (cid:96) ( f θ ( x ) , y ) ≤ B for all pairs { f θ ( x ) , y } . Then for any δ ∈ (0 , ,with probability at least − δ over the random draw of S , the generalization gapis uniformly upper bounded for all θ satisfying (cid:107) θ l (cid:107) ≤ k l , l ∈ [ L ] as |R ( θ ) − R S ( θ ) | ≤ O B (cid:115) k log( p √ M GR (1 + ηH ) / ( Bk )) + log(1 /δ ) M . In comparison to the O (cid:16)(cid:112) p/M (cid:17) uniform bound established in Lemma 1 (seeAppendix A.1) for dense networks, the uniform bound established in Theorem 1is substantially stronger when k (cid:28) p , which shows the benefit of network capac-ity constraint for generalization.Specially for margin-based multiclass classification, let us consider the marginoperator M ( v, y ) := max j [ v ] j − [ v ] y associated with the score prediction vector v ∈ R C and label y ∈ { , ..., C } . Let (cid:96) γ ( f θ ( x ) , y ) = h γ ( M ( f θ ( x ) , y )) be a surro-gate loss of the binary loss (i.e., [ y (cid:54) = arg max j [ f θ ( x )] j ]) defined with respect toproper γ -margin based loss h γ such as the hinge/ramp losses and their smoothedvariants [23]. By definition, we must have [ y (cid:54) = arg max j [ f θ ( x )] j ] ≤ (cid:96) γ ( f θ ( x ) , y ).In this case, we denote R γ,S the meta-training risk with loss function (cid:96) γ and ˜ R γ the corresponding population risk in which the task-level query loss L D queryT isevaluated using binary loss as classification error. Then as a direct consequenceof Theorem 1, we can establish the following result for margin-based prediction. Corollary 1.
Suppose that the margin-based loss (cid:96) γ is used for model training.Then under the conditions in Theorem 1, for any δ ∈ (0 , , with probability atleast − δ the following bound holds for all θ satisfying (cid:107) θ l (cid:107) ≤ k l , l ∈ [ L ] : ˜ R γ ( θ ) ≤ R γ,S ( θ ) + O B (cid:115) k log( p √ M GR (1 + ηH ) / ( Bk )) + log(1 /δ ) M . Tian et al.
Remark 1.
We comment that the above O ( (cid:112) k/M ) margin bound derived inthe context of sparse meta-learning can be readily extended to sparse deep netstraining. Also, the bound can be easily generalized for arbitrary convex surro-gates (e.g., cross-entropy loss) of binary loss under proper regularity conditions. We have implemented the proposed model in Equation 2 based on Reptile [22](see Algorithm 2) which is a scalable method for optimization-based meta-learning in form of Equation 2 but without layer-wise sparsity constraint. Inorder to handle the sparsity constraint, we follow the principles behind the widelyapplied dense-sparse-dense (DSD) [7] and iterative hard thresholding (IHT) [11]network pruning algorithms to alternate the Reptile iteration between pruninginsignificant weights in each layer and retraining the pruned network.
The algorithm of our network-pruning-based Reptile method is outlined in Al-gorithm 1. The learning procedure contains a pre-training phase followed by aniterative procedure of network pruning and retraining. We would like to stressthat since our ultimate goal is not to do network compression, but to reducemeta-overfitting via controlling the sparsity level of the meta-initialization net-work, the final output of our algorithm is typically dense after the retrainingphase, which has been evidenced in practice to be effective for improving thegeneralization performance during testing phase. In the following subsections,we describe the key components of our algorithm in details.
Model Pretraining
For model pre-training, we run a few number of Reptileiteration rounds to generate a relatively good initialization. In each loop of theReptile iteration, we first sample a mini-batch of meta-tasks { T i } si =1 from thetask distribution p ( T ). Then for each task T i , we compute the adapted param-eters via (stochastic) gradient descent as ˜ θ T i = θ (0) − η ∇ θ L D suppTi ( θ (0) ), where˜ θ T i denotes the task-specific parameters learned from each task T i , θ (0) is thecurrent initialization of model parameters, η is the inner-task learning rate, and D suppT i denotes the support set of task T i . When all the task-specific param-eters are updated, the initialization parameters will be updated according to θ (0) = θ (0) + β (cid:16) s (cid:80) si =1 ˜ θ T i − θ (0) (cid:17) with learning rate β . Here we follow Rep-tile to use s (cid:80) si =1 ˜ θ T i − θ (0) as an approximation to the negative meta-gradient,which has been evidenced to be effective for scaling up the MAML-type first-order meta-learning models [22]. Iterative Network Pruning and Retraining
After model pre-training, weproceed to the main loop of our Algorithm 1 that carries out iterative networkpruning and retraining. eta-Learning with Network Pruning 7
Algorithm 1:
Reptile with Iterative Network Pruning
Input : inner loop learning rate η , outer loop learning rate β , layer-wisesparsity level { k l } Ll =1 , mini-batch batch size s for meta training. Output: θ ( t ) . Initialization
Randomly initialize θ (0) . /* Pre-training with Reptile */ while the termination condition is not met do θ (0) = Reptile ( θ (0) , η, β, s ); endfor t = 1 , , ... do /* Pruning phase */ Generate a network zero-one mask M ( t ) whose non-zero entries at eachlayer l are those top k l entries in θ ( t ) l ;Compute θ ( t ) M = θ ( t ) (cid:12) M ( t ) ; /* Subnetwork fine-tune with Reptile */ while the termination condition is not met do θ ( t ) = Reptile ( θ ( t ) M , η, β, s ); end /* Retraining phase */ while the termination condition is not met do θ ( t ) = Reptile ( θ ( t ) , η, β, s ); endend Pruning phase.
In this phase, we first greedily truncate out of the modela portion of near-zero parameters which are unlikely to contribute significantlyto the model performance. To do so, we generate a network binary mask M ( t ) whose non-zero entries at each layer l are those top k l (in magnitude) entries in θ ( t ) l , and compute θ ( t ) M = θ ( t ) (cid:12) M ( t ) as the sparsity restriction of θ ( t ) . Then wefine-tune the subnetwork over the mask M ( t ) by applying Reptile restrictivelyto this subnetwork with initialization θ ( t ) M . Our numerical experience suggeststhat sufficient steps of subnetwork fine-tuning tends to substantially improvethe stability and convergence behavior of the method.The fine-tuned subnetwork θ ( t ) M at the end of the pruning phase is expectedto reduce the chance of overfitting to noisy data. However, it is also believedthat such subnetwork will reduce the capacity of the network, which could inturn lead to potentially biased learning with higher training loss. To remedy thisissue, inspired by the retraining trick introduced in [7] for network pruning, wepropose to restore the pruned weights that would be beneficial for enhancing themodel representation power to improve the overall generalization performance. Retraining phase.
In this phase, the layer-wise sparsity constraints areremoved and the pruned parameters are re-activated for fine-tuning. The re-training procedure is almost identical to the pre-training phase, but with themain difference that the former is initialized with the subnetwork generated by
Tian et al.
Algorithm 2:
Reptile Algorithm [22]
Input : model parameters φ , inner loop learning rate η , outer loop learningrate β , mini-batch batch size s for meta training. Output: the updated φ .Sample a mini-batch tasks { T i } si =1 of size s ;For each task T i , compute the task-specific adapted parameters using gradientdescent: ˜ φ T i = φ − η ∇ φ L D suppTi ( φ );Update the parameters: φ = φ + β (cid:16) s (cid:80) si =1 ˜ φ T i − φ (cid:17) . the pruning phase while the latter uses random initialization. Such a retrainingoperation restores the representation capacity of the pruned parameters, whichtends to lead to improved generalization performance in practice. For theoreticaljustification, roughly speaking, since the sparse meta-initialization network ob-tained in the pruning phase generalizes well in light of Theorem 1, it is expectedto serve as a good initialization for future retraining via gradient descent. Thenaccording to the stability theory of gradient descent methods [9], the outputdense network will also generalize well if the retraining phase converges quickly. The DSD method is an effective network pruningapproach for preventing the learned model from capturing noise during the train-ing [7]. By implementing the main loop with t = 1, the proposed Algorithm 1reduces to a DSD-based Reptile method for first-order meta-learning. Reptile with IHT pruning.
The IHT method [11] is another representativenetwork pruning approach which shares a similar dense-sparse-dense spirit withDSD. Different from the one-shot weight pruning and network training by DSD,IHT is designed to perform multiple rounds of iteration between pruning andretraining, and hence is expected to have better chance to find an optimal sparsesubnetwork than DSD does. By implementing the main loop of Algorithm 1 with t >
1, we actually obtain a variant of Reptile with IHT-type network pruning.
In this section, we carry out a numerical study for algorithm performance evalua-tion aiming to answer the following three questions empirically: (Q1) Section 5.1:
Does our method contribute to improve the generalization performance? (Q2)Section 5.2:
What roles do pre-training phase and retraining phase play in ourmethod? (Q3) Section 5.3:
Can our method work on more complex models?
We first evaluate the prediction performance of our method for few-shot clas-sification tasks on two popular benchmark datasets: MiniImageNet [34] and eta-Learning with Network Pruning 9
TieredImageNet [25]. We have also evaluated our method on Omniglot [15] withnumerical results relegated to Appendix C.1 due to space limit. The networkused in our experiments is consistent with that considered for Reptile[22]. Wetest with varying channel number { , , , } in each convolution layer toshow the robustness of our algorithms to meta-overfitting. See Appendix B formore details about Model, datasets and hyperparameters. MiniImageNet
The MiniImageNet dataset consists of 64 training classes, 12validation classes and 24 test classes. For DSD-based Reptile, with 32 channels,we set the iteration numbers for the pre-traning, pruning and retraining phasesrespectively as 3 × , 5 × and 2 × , while with 64 , ,
256 channels, thecorresponding number is 3 × , 6 × and 10 respectively. For IHT-basedReptile model training, we first pre-train the model for 2 × iterations. Thenwe iterate between the sparse model fine-tuning (with 1 . × iterations) anddense-model retraining (with 5 × iterations) for t = 4 rounds. The setting ofother model training related parameters is identical to those in [22]. Table 1.
Results on MiniImageNet under varying number of channels and pruningrates.
Methods Backbone Rate 5-way 1-shot 5-way 5-shot
Reptile baseline 32-32-32-32 0% 50.30 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± % ± %64-64-64-64 30% ± % ± %128-128-128-128 50% ± % ± %256-256-256-256 60% ± % ± %IHT+Reptile 32-32-32-32 20% 50.26 ± ± ± % ± %128-128-128-128 40% ± % ± %256-256-256-256 60% 49.85 ± ± Results.
The experimental results are presented in Table 1 and some addi-tional results are provided in Table 8 in Appendix C. From these results, we canobserve that our methods consistently outperform the considered baselines. Thekey observations are highlighted below: – In the 32-channel setting in which the model is less prone to overfit, whenapplying DSD-based Reptile with 40% pruning rate, the accuracy gain is 0 . our IHT-based Reptile approach respectively improves about 1 .
5% and 1 . . – In the 128-channel setting, the accuracies of our approaches are all higherthan those of CAVIA, and the highest gain of accuracy is over 3%. A cc u r ac y / % A cc u r ac y / % Fig. 1.
The generalization performance of DSD-based and IHT-based Reptile on 5-way 1-shot and5-way 5-shot tasks under 64-channel settings.
Left:
DSD-based Reptile;
Right:
IHT-based Reptile.
It is also worth notingfrom Table 1 that the accu-racy of our algorithms tendsto increase as the channel sizeincreases while the baselinesbehave oppositely. Althoughin 256-channel case the per-formance of IHT-based ap-proach drops compared withthe 128-channel setting, itstill achieves ∼ .
2% accuracygain over the baseline on 5-way 1-shot tasks and ∼ . TieredImageNet
The TieredImageNet dataset consists of 351 training classes,97 validation classes and 160 test classes. For TieredImageNet dataset [25], inDSD-based Reptile case, we set iteration numbers for the pre-training, pruningand retraining phase respectively as 3 × , 5 × and 2 × for all cases. InIHT-based Reptile, the values of iteration number are the same as those used inthe previous experiments for MiniImageNet. Results.
The experimental results are partly presented in Table 2. Moreexperimental results are available in Table 9 in Appendix C. In 5-way 1-shotclassification tasks, both DSD-besed Reptile approach and IHT-based Reptileapproach outperform the baselines in all cases. In 32-channel setting, with DSD-based Reptile approach, the improvement of accuracy is 0 .
42% compared withbaseline. In 64-channel setting, the accuracies of DSD-based Reptile and IHT-based Reptile respectively achieve 0 .
64% and 1 .
24% improvements. And for 256-channel setting, the best performance is also 0 .
44% better than baseline. eta-Learning with Network Pruning 11
Table 2.
Results on TieredImageNet under varying number of channels and pruningrates.
Methods Backbone Rate 5-way 1-shot 5-way 5-shot
Reptile baseline 32-32-32-32 0% 50.52 ± ± ± ± ± ± ± ± ± % 64.65 ± ± % 66.69 ± ± ± ± % 67.98 ± ± ± ± % 66.15 ± ± % ± %256-256-256-256 10% ± % 67.60 ± However, in most 5-way 5-shot classification tasks, the performance of ourmethod drops. We conjecture that the reason is TieredImageNet dataset, com-pared with MiniImageNet dataset, contains more classes.
We next conduct a set of experiments on MiniImageNet to better understandthe impact of pre-training and dense retraining on the task-specific testing per-formance.
Table 3.
Results of ablation study in the 5-way setting. The “ ± ” shows 95% confidenceintervals, the ”P.T” means ”Pre-traning” and the ”R.T” means ”Retraining”. Methods P.T R.T 5-way 1-shot 5-way 5-shot
Reptile baseline(64) - - 51.08 ± ± ± ± √ × ± ± √ × ± ± √ × ± ± √ × ± ± × √ ± ± × √ ± ± × √ ± ± × √ ± ± √ √ ± ± √ √ ± ± %IHT+Reptile(64, 40%) √ √ ± % ± %IHT+Reptile(128, 60%) √ √ ± % 68.04 ± We begin by performing ablation study on pre-training and retraining phases.Each time only one of them is removed from our method. For fair comparison,other settings are the same as proposed in Section 5.1. This study is conductedfor both DSD- and IHT-based Reptile approaches. The results of the experimentsare listed in Table 3.
Impact of the Retraining phase.
It can be clearly seen from group ofresults in Table 3 that the retraining phase plays an important role in the ac-curacy performance of our method. Under the same pruning rate, without theretraining phase, the accuracy of both DSD-based and IHT-based Reptile ap-proach drops dramatically. For an instance, in the 64-channel case with 40%pruning rate, the variant of IHT-based Reptile without retraining phase suf-fers from a ∼
11% drop in accuracy compared with the baseline. On the otherside, as shown in Figure 2 that sparsity structure of the network does helpto reduce the gap between training accuracy and testing accuracy even with-out the retraining phase. This confirms the benefit of sparsity for generaliza-tion gap reduction as revealed by Theorem 1. Therefore, the network pruningphase makes the model robust to overfitting but in the meanwhile tend to sufferfrom the deteriorated training loss. The retraining phase helps to restore thecapacity of the model to further improve the overall generalization performance. A cc u r ac y / % Baseline, train_accBaseline, test_accDSD, train_accDSD, test_accIHT, train_accIHT, test_acc
Fig. 2.
Ablation study on retraining phasefor both DSD-based Reptile and IHT-basedReptile on 64-channel case. The gap be-tween the training accuracy and test accu-racy of the variant algorithm of our methodbecomes smaller than that of baseline.
Impact of the Pre-trainingphase.
From Table 3, we observe thatwithout pre-training phase, the vari-ant algorithms still outperform base-lines. Such results demonstrate theimportance of pruning and retrain-ing phase from another perspectivethat merely pruning and retrainingthe over-parameterized models canachieve similar empirical performanceto our method. However, the vari-ant algorithms fail to outperform ourmethod. Since in network pruning,pre-training phase is treated as a nec-essary phase used to find a set ofmodel parameters which is impor-tant [6,7], we conjecture that it is the prematurely pruning before the modelbeing well trained that leads to the drop of the performance.We now perform experiments to further show how performance varies withdifferent hyperparameters. The tested hyperparameters include (1)The numberof pre-training and retraining iterations in DSD-based Reptile; (2) the numberof iterations in an IHT pruning-retraining interval; (3) the ratio of pruning it-erations in an interval. To be clear, we define ratio = (
Iter prune /Iter interval )%.In experiments above, we set
Iter prune = 1 . × in a 20000-iteration IHTinterval, which means the ratio is 75%. eta-Learning with Network Pruning 13
10 20 30 40 50 A cc u r ac y / %
10 20 30 40 50 A cc u r ac y / % (a) Study on pre-training iterations
10 20 30 40 50 A cc u r ac y / %
10 20 30 40 50 A cc u r ac y / % (b) Study on retraining iterations Fig. 3.
Study of hyperparameters of DSD-based Reptile. (a). Study on the number ofpre-straining iterations. (b). Study on the number of retraining iterations. A cc u r ac y / % A cc u r ac y / % (a) Study on ratio of pruning iterations A cc u r ac y / % A cc u r ac y / % (b) Study on the interval iteration number Fig. 4.
Study on hyperparameters of IHT-based Reptile. (a). Study on ratio of pruningiterations. (b). Study on the number of interval iterations.
DSD-based Reptile.
Figure 5 manifests the performance of DSD-basedReptile varying with the two hyperparameters, number of pre-training and re-training iterations. Figure 3(a) reveals that for most cases, too much or too littlepre-training will both lead to the deterioration of performance. This is consis-tent with our ablation study that pre-training helps find a set of robust sparseparameters that is important and excessive pre-training, which reduces the iter-ations of pruning phase, undermines the generalization performance. Figure 3(b)shows that better performance can be obtained when retraining iterations aresmaller than 30K, which indicates that only a small number of retraining stepsare required to restore the accuracy without overfitting again.
IHT-based Reptile.
Figure 4 shows the hyper-parameter sensitivity resultsof IHT-based Reptile. Figure 4(a) shows the performance under different ratio ofpruning iterations in an IHT interval. It’s clear that better performance can beobtained when the ratio is larger than 50%, which means pruning iterations aremore than retraining iterations. This reveals that more pruning iterations arerequired to alleviate overfitting and a small number of retraining steps are enoughto help compensate the loss of accuracy. Figure 4(b) shows the performanceunder varying number of iterations in an IHT interval. We can see that with theinterval iterations increasing from 5K to 20K, the accuracies get improved. Thissuggests that sufficient steps are required to train a robust model in a loop ofpruning-retraining.
Table 4.
Results on complex Networks. The “ ± ” shows 95% confidence intervals. Methods Backbone Rate 5-way 1-shot 5-way 5-shot
MetaOptNet(rerun) [17] ResNet-12 - 61.95 ± ± ± % 77.51 ± ± ± ± % ± %IHT-CAVIA 128-128-128-128 10% ± % ± % We further implement our method on
MetaOptNet [17] and
CAVIA [39] to eval-uate its the performance on more complex network structures. For MetaOpeNet,we select the ResNet-12 as the network, and SVM as the head and the dropoutin the network is replaced by our method. We respectively set the number ofiterations of pre-training, pruning and retraining as 5 epochs, 20 epochs and 15epochs. The learning rate is 0.1 in the first 30 epochs, 0.006 in next 5 epochsand 0.0012 in the final 5 epochs. The dataset used is MiniImageNet. Since ourexperiments are conducted on 4 RTX 2080Ti GPUs(11GB) while MetaOptNetis trained on 4 Titan X GPUs(12GB), we have to reduce the training shots from15 to 10 in our experiments. For fair comparison, we rerun the baseline on thesame model as in that paper with 10 training shots.For CAVIA, we apply our method directly on the network parameters, andthe context parameters will not be pruned. In DSD-based CAVIA case, thenumbers of the iterations for pre-training, pruning and retraining phase arerespectively 20K, 20K and 20K. In IHT-based CAVIA case, the iteration numberof pre-training is 20K, and the iterative phase include 2 sparse-dense processes.Each sparse-dense process contains 20K iterations in which 16K iterations arefor pruning fine-tuning and 4K iterations are for dense retraining. Other settingsare the same as those in paper [39].As shown in Table 4, for MetaOptNet, our method gains 0 .
2% improve-ment on 5-way 1-shot tasks compared with baseline. On 5-way 5-shot tasks,our method can still obtain similar performance. However, there is a trade-offbetween accuracy and training time. For CAVIA, all cases outperform the base-lines, which shows the strong power of our methods in alleviating the overfitting.Overall, our method can to some extent improve the generalization performanceeven in the complex models when facing with scarce data.
In this paper, we proposed a cardinality-constrained meta-learning approach forimproving generalization performance via explicitly controlling the capacity ofover-parameterized neural networks. We have theoretically proved that the gen-eralization gap bounds of the sparse meta-learner have polynomial dependence eta-Learning with Network Pruning 15 on the sparsity level rather than the number of parameters. Our approach hasbeen implemented in a scalable meta-learning framework of Reptile with thesparsity level of parameters maintained by network pruning routines includingdense-sparse-dense and iterative hard thresholding. Extensive experimental re-sults on benchmark few-shot classification tasks, along with hyperparameter im-pact study and study on complex networks, confirm our theoretical predictionsand demonstrate the power of network pruning and retraining for improving thegeneralization performance of gradient-optimization-based meta-learning.
Acknowledgements
Xiao-Tong Yuan is supported in part by National Major Project of China for NewGeneration of AI under Grant No.2018AAA0100400 and in part by Natural Sci-ence Foundation of China (NSFC) under Grant No.61876090 and No.61936005.Qingshan Liu is supported by NSFC under Grant No.61532009 and No.61825601.
References
1. Abramovich, F., Grinshtein, V.: High-dimensional classification by sparse logisticregression. IEEE Transactions on Information Theory (5), 3068–3079 (2019)2. Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds fordeep nets via a compression approach. In: International Conference on MachineLearning. pp. 254–263 (2018)3. Bengio, Y., Bengio, S., Cloutier, J.: Learning a synaptic learning rule. In: IJCNN(1990)4. Denevi, G., Ciliberto, C., Grazzi, R., Pontil, M.: Learning-to-learn stochastic gra-dient descent with biased regularization. In: International Conference on MachineLearning. pp. 1566–1575 (2019)5. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptationof deep networks. In: Proceedings of the 34th International Conference on MachineLearning-Volume 70. pp. 1126–1135. JMLR. org (2017)6. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainableneural networks. In: International Conference on Learning Representations (2019)7. Han, S., Pool, J., Narang, S., Mao, H., Gong, E., Tang, S., Elsen, E., Vajda,P., Paluri, M., Tran, J., et al.: Dsd: Dense-sparse-dense training for deep neuralnetworks. In: International Conference on Learning Representations (2016)8. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections forefficient neural network. In: Advances in Neural Information Processing Systems.pp. 1135–1143 (2015)9. Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: Stability of stochas-tic gradient descent. In: International Conference on Machine Learning. pp. 1225–1234 (2016)10. Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general networkpruning. In: IEEE International Conference on Neural Networks. pp. 293–299.IEEE (1993)11. Jin, X., Yuan, X., Feng, J., Yan, S.: Training skinny deep neural networks withiterative hard thresholding methods. arXiv preprint arXiv:1607.05423 (2016)6 Tian et al.12. Khodak, M., Balcan, M.F., Talwalkar, A.: Provable guarantees for gradient-basedmeta-learning. In: Advances in Neural Information Processing Systems (2019)13. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shotimage recognition. In: ICML deep learning workshop. vol. 2 (2015)14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Advances in Neural Information Processing Sys-tems. pp. 1097–1105 (2012)15. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simplevisual concepts. In: Proceedings of the annual meeting of the cognitive sciencesociety. vol. 33 (2011)16. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in NeuralInformation Processing Systems. pp. 598–605 (1990)17. Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiableconvex optimization. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 10657–10665 (2019)18. Li, Z., Zhou, F., Chen, F., Li, H.: Meta-sgd: Learning to learn quickly for few-shotlearning. In: Advances in Neural Information Processing Systems (2017)19. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through l (Mar), 671–690 (2012)21. Mishra, N., Rohaninejad, M., Chen, X., Abbeel, P.: A simple neural attentivemeta-learner. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=B1DmUzWAW
22. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms.arXiv preprint arXiv:1803.02999 (2018)23. Pillutla, V.K., Roulet, V., Kakade, S.M., Harchaoui, Z.: A smoother way to trainstructured prediction models. In: Advances in Neural Information Processing Sys-tems. pp. 4766–4778 (2018)24. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: Inter-national Conference on Learning Representations (2016)25. Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B.,Larochelle, H., Zemel, R.S.: Meta-learning for semi-supervised few-shot classifi-cation. arXiv preprint arXiv:1803.00676 (2018)26. Rigollet, P.: 18. s997: High dimensional statistics. Lecture Notes, Cambridge, MA,USA: MIT Open-CourseWare (2015)27. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International journal of computer vision (3), 211–252 (2015)28. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learningwith memory-augmented neural networks. In: International Conference on MachineLearning. pp. 1842–1850 (2016)29. Schmidhuber, J.: Evolutionary principles in self-referential learning. On learninghow to learn: The meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech.Univ. Munich , 2 (1987)30. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In:Advances in Neural Information Processing Systems. pp. 4077–4087 (2017)31. Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural networks.arXiv preprint arXiv:1507.06149 (2015)32. Sung, F., Zhang, L., Xiang, T., Hospedales, T., Yang, Y.: Learning to learn: Meta-critic networks for sample efficient learning. arXiv preprint arXiv:1706.09529 (2017)eta-Learning with Network Pruning 1733. Thrun, S., Pratt, L.: Learning to learn. Springer Science & Business Media (2012)34. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networksfor one shot learning. In: Advances in Neural Information Processing Systems. pp.3630–3638 (2016)35. Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprintarXiv:1410.3916 (2014)36. Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., Ahn, S.: Bayesian model-agnosticmeta-learning. In: Advances in Neural Information Processing Systems. pp. 7332–7342 (2018)37. Yuan, X.T., Li, P., Zhang, T.: Gradient hard thresholding pursuit. Journal ofMachine Learning Research , 1–43 (2018)38. Zhou, P., Yuan, X., Xu, H., Yan, S., Feng, J.: Efficient meta learning via mini-batch proximal update. In: Advances in Neural Information Processing Systems.pp. 1532–1542 (2019)39. Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., Whiteson, S.: Fast context adap-tation via meta-learning. In: International Conference on Machine Learning. pp.7693–7702 (2019)8 Tian et al. A Proofs of Results
A.1 Proof of Theorem 1
We need the following lemma which guarantees the uniform convergence of R S ( θ )towards R ( θ ) for all θ when the loss function is Lipschitz continuous and smooth,and the optimization is limited on a bounded domain. Lemma 1.
Assume that the domain of interest Θ ⊆ R p is bounded by R and theloss function (cid:96) ( f θ ( x ) , y ) is G -Lipschitz continuous and H -smooth with respectto θ . Also assume that ≤ (cid:96) ( f θ ( x ) , y ) ≤ B for all { f θ ( x ) , y } . Then for any δ ∈ (0 , , the following bound holds with probability at least − δ over therandom draw of sample set S for all θ ∈ Θ , |R ( θ ) − R S ( θ ) | ≤ O B (cid:115) log(1 /δ ) + p log( √ M GR (1 + ηH ) /B ) M . Proof.
For any task T , let us denote ˜ (cid:96) ( θ ; T ) := L D queryT (cid:16) θ − η ∇ θ L D suppT ( θ ) (cid:17) .Since (cid:96) ( f θ ( x ) , y ) is G -Lipschitz continuous with respect to θ , we can show that | ˜ (cid:96) ( θ ; T ) − ˜ (cid:96) ( θ (cid:48) ; T ) | ≤ G (cid:107) θ − η ∇ θ L D suppT ( θ ) − θ (cid:48) + η ∇ θ L D suppT ( θ (cid:48) ) (cid:107)≤ G (cid:16) (cid:107) θ − θ (cid:48) (cid:107) + η (cid:107)∇ θ L D suppT ( θ ) − ∇ θ L D suppT ( θ (cid:48) ) (cid:107) (cid:17) ≤ G (1 + ηH ) (cid:107) θ − θ (cid:48) (cid:107) , which indicates that ˜ (cid:96) ( θ ; T ) is G (1 + ηH )-Lipschitz continuous for any task T .As a subset of an L -sphere, it is standard that the covering number of Θ with respect to the L -distance is upper bounded by N ( (cid:15), Θ, L ) ≤ O (cid:18)(cid:18) R(cid:15) (cid:19) p (cid:19) . Since the task-level loss function ˜ (cid:96) ( θ ; T ) is G (1 + ηH )-Lipschitz continuous asshown above, it can be verified that the covering number of the class of functions˜ L = (cid:110) T (cid:55)→ ˜ (cid:96) ( θ ; T ) | θ ∈ Θ (cid:111) with respect to L ∞ -distance L ∞ (˜ (cid:96) ( θ ; · ) , ˜ (cid:96) ( θ ; · )) :=sup T | ˜ (cid:96) ( θ ; T ) − ˜ (cid:96) ( θ ; T ) | is given by N ( (cid:15), ˜ L , L ∞ ) ≤ N (cid:18) (cid:15)G (1 + ηH ) , Θ, L (cid:19) ≤ O (cid:18)(cid:18) GR (1 + ηH ) (cid:15) (cid:19) p (cid:19) . Therefore, there exists a set of points Ω ⊆ R p with cardinality at most N ( (cid:15), ˜ L , L ∞ )such that the following bound holds for any θ ∈ Θ :min ω ∈ Ω | ˜ (cid:96) ( θ ; T ) − ˜ (cid:96) ( ω ; T ) | ≤ (cid:15), ∀ T. eta-Learning with Network Pruning 19 For an arbitrary ω ∈ Ω , based on Hoeffdings inequality (note that (cid:96) ( · , · ) ≤ B implies ˜ (cid:96) ( · , · ) ≤ B ) we have P ( |R S ( ω ) − R ( ω ) | > t ) ≤ exp (cid:26) − M t B (cid:27) . For any θ ∈ Θ , based on triangle inequality we can show that there exits ω θ ∈ Ω such that |R S ( θ ) − R ( θ ) | = |R S ( θ ) − R S ( ω θ ) + R S ( ω θ ) − R ( ω θ ) + R ( ω θ ) − R ( θ ) |≤ (cid:15) + |R S ( ω θ ) − R ( ω θ ) | ≤ (cid:15) + max ω ∈ Ω |R S ( ω ) − R ( ω ) | . Applying uniform bound we know that P (cid:18) sup θ ∈ Θ |R ( θ ) − R S ( θ ) | ≥ (cid:15) + t (cid:19) ≤N ( (cid:15), L , (cid:96) ∞ ) exp (cid:18) − M t B (cid:19) ≤ O (cid:18)(cid:18) GR (1 + ηH ) (cid:15) (cid:19) p exp (cid:18) − M t B (cid:19)(cid:19) . Let us choose (cid:15) = B/ √ M and t = √ B (cid:114) log(1 /δ ) + p log( GR (1 + ηH ) /(cid:15) ) M such that the right hand side of the previous inequality equals δ . Then we obtainthat with probability at least 1 − δ sup θ ∈ Θ |R ( θ ) − R S ( θ ) | ≤ O B (cid:115) log(1 /δ ) + p log( √ M GR (1 + ηH ) /B ) M . This proves the desired result.Based on this lemma, we can readily prove the main result in the theorem.
Proof (Proof of Theorem 1).
For any fixed supporting set J ∈ J , by applyingLemma 1 we obtain that the following uniform convergence bound holds for all θ with supp( θ ) ⊆ J with probability at least 1 − δ over S : |R ( θ ) − R S ( θ ) | ≤ O B (cid:115) log(1 /δ ) + k log( √ M GR (1 + ηH ) /B ) M . Since by constraint the parameter vector θ is always k -sparse, we thus havesupp( θ ) ∈ J . Then by union probability we get that with probability at least1 − δ , the following bound holds for all θ with (cid:107) θ (cid:107) ≤ k : |R ( θ ) − R S ( θ ) | ≤ O B (cid:115) log( |J | ) + log(1 /δ ) + k log( √ M GR (1 + ηH ) /B ) M . It remains to bound the cardinality |J | . From [26, Lemma 2.7] we know |J | = (cid:0) pk (cid:1) ≤ (cid:0) epk (cid:1) k , which then implies the desired generalization gap bound. Thiscompletes the proof. A.2 Proof of Corollary 1
Proof.
Let R γ be a population version of R γ,S with margin-based loss function (cid:96) γ used for computing both L D suppT and L D queryT . Since (cid:96) γ is a surrogate of thebinary loss as used by ˜ R for query classification error evaluation, we must have˜ R ≤ R γ . Then the desired bound follows directly by invoking Theorem 1 to theconsidered margin loss. B Detailed Experimental Settings
B.1 Model
The model used in our experiments is consistent with that considered for Reptile[22].The model used throughout the experiment contains 4 sequential modules. Eachmodule contains a convolutional layer with 3 × × { , , , } in each convolution layer toshow the robustness of our algorithms to meta-overfitting. B.2 Datasets
There are three popular benchmark datasets used in our experiments.
Task 1
Task 2
Task 3 (a) 5-way 1-shot tasks generated from Om-niglot
Task 1Task 2Task 3 (b) 5-way 1-shot tasks generated fromMiniImageNet or TieredImageNet dataset
Fig. 5.
Tasks used in our experiments. (a). Tasks generated from Omniglot. (b). Tasksgenerated from MiniImageNet or TieredImageNet dataset.
Omniglot
The Omniglot dataset has 1623 characters from 50 alphabets.Each character contains 20 instances drawn by different individuals. The sizeof each image is 28 ×
28. We randomly select 1200 characters for meta trainingand the rest are used for meta testing. Following [28], we also adopt a dataaugmentation strategy based on image rotation to enhance performance. eta-Learning with Network Pruning 21
Table 5.
Detailed experimental settings for Omniglot, MiniImageNet, TieredImageNetdatasets with DSD-based Reptile.
Hyperparameters Omniglot MiniImageNet TieredImageNet classes 5 5 5shot 1 or 5 1 or 5 1 or 5inner batch 10 10 6inner iterations 5 8 8outer learning rate 1 1 1meta batch 5 5 5meta iterations 10 evaluation batch 5 5 5evaluation iterations 50 50 50inner learning rate 0 .
001 0 .
001 0 . × × × pruning iterations(32c) 5 × × × retrain iterations(32c) 2 × × × pruning iterations(64/128/256c) 5 × × × retrain iterations(64/128/256c) 2 × × MiniImageNet
The MiniImageNet dataset consists of 100 classes fromthe ImageNet dataset [14] and each class contains 600 images of size 84 × × TieredImageNet
The TieredImageNet dataset consists of 608 classes fromthe ILSVRC-12 dataset [27] and each image is scaled to 84 × ×
3. There are351 classes used for training, 97 classes for validation and 160 classes used fortesting.
B.3 Detailed Experimental Settings
The experimental details of DSD-based Reptile and IHT-based Reptile can re-spectively be seen in Table 5 and Table 6. There are two points of hyperparametersettings that should be highlighted. – The outer learning rate has an initial value 1 . – For MiniImageNet [34] with DSD-based Reptile, the iteration number ofpruning phase for 32-channel case is 5 × and for 64 / / × . Correspondingly, the iteration number of retraining phasefor 32-channel case is 2 × and for 64 / / . C Additional Experimental Results
This appendix contains complete experimental results for Omniglot, MiniIm-ageNet and TieredImageNet datasets. We performed our methods on 4-layerCNNs with varying channel number { , , , } as mentioned in SectionB. Table 6.
Detailed experimental settings for Omniglot, MiniImageNet, TieredImageNetdatasets with IHT-based Reptile.
Hyperparameters Omniglot MiniImageNet TieredImageNet classes 5 5 5shot 1 or 5 1 or 5 1 or 5inner batch 10 10 6inner iterations 5 8 8outer learning rate 1 1 1meta batch 5 5 5meta iterations 10 evaluation batch 5 5 5evaluation iterations 50 50 50inner learning rate 0 .
001 0 .
001 0 . × × × pruning iterations 1 . × . × . × retrain iterations 5 × × × C.1 Results on Omniglot dataset
The baselines and all the results of Omniglot dataset are reported in Table 7. Foreach case, both DSD-based Reptile approach and IHT-based Reptile approachare evaluated on various pruning rates. The settings are the same as proposedin Section B.3.For 32-channel case and 64-channel cases, which is less prone to be overfit-ting, both DSD-based Reptile approach and IHT-based Reptile approach tend toachieve comparable performance to baselines. When the channel size increases to128 and 256, slightly improved performance can be observed. This is consistentwith our analysis that overfiting is more likely to happen when channel number isrelatively large and weight pruning helps alleviate such phenomenon to improvethe generalization performance, which then leads to accuracy improvement withretraining operation.
C.2 Results on MiniImageNet dataset
In this section, we report the detailed results of experiments on MiniImageNetdataset.From the table, it can be obviously observed that our method achieves re-markable performance consistently. For one thing, with the number of channelsincreasing, the accuracies of our methods keep being improved while the base-lines perform oppositely. For example, in the 32-channel setting in which themodel is less prone to overfit, when applying DSD-based Reptile with 10% and40% pruning rate, the accuracy gain is 0 .
35% and 0 .
5% on 5-way 1-shot tasksand 1 .
02% and 1% on 5-way 5-shot tasks. In the 64-channel setting, DSD-basedReptile respectively achieves 0 . . .
88% improvements over 5-way 1-shot baseline and 1 . . .
18% improvements over 5-way 5-shot baseline eta-Learning with Network Pruning 23 with pruning rates 20%, 30%, 40%. Meanwhile our IHT-based Reptile approachrespectively improves about 1 . . .
51% on 5-way 1-shot tasks and0 . .
32% and 1 .
95% on 5-way 5-shot tasks with pruning rates 10%, 20%,40%. In the setting of 128-channel, all the cases of our method outperform thebaseline remarkably, and the best accuracy of DSD-based Reptile on 5-way 1-shot tasks is nearly 3% higher than the baseline while on 5-way 5-shot tasks thegain is about 4 . C.3 Results on TieredImageNet dataset
In this section, we present the detailed results of experiments on TieredImageNetdataset in Table 9.From the table, we can observe that our method achieves good performanceon 5-way 1-shot classification tasks. For example, in 32-channel settings, theaccuracy of DSD-based Reptile with 10% pruning rate is ∼ .
5% higher thanbaseline; in 64-channel settings, both DSD-based Reptile and IHT-based Reptileimprove the performance evidently, respectively are 0 .
64% and 1 . .
44% improvement overthe baseline.However, in most 5-way 5-shot classification tasks, the performance of ourmethod drops. We conjecture that the reason is that TieredImageNet dataset,compared with MiniImageNet dataset, contains more classes from which thenetworks can learn more prior knowledge and thus ease the overfitting.
Table 7.
Few Shot Classification results on Omniglot dataset for 4-layer convolutionalnetwork with different channels on 5-way 1-shot and 5-way 5-shot tasks. The “ ± ” shows95% confidence intervals over tasks. The evaluation baselines are run by us. Methods Backbone Rate 5-way 1-shot 5-way 5-shot
Reptile baseline 32-32-32-32 0% 96.63 ± ± ± ± ± ± ± ± ± ± %20% 95.98 ± ± ± ± ± ± ± % ± %20% 97.60 ± ± ± ± ± ± ± % 99.61 ± ± ± ± ± %40% 97.99 ± ± ± % ± %20% 98.02 ± ± ± ± ± ± ± % 99.49 ± ± ± %30% 96.45 ± ± ± ± ± ± ± ± %30% ± % 99.52 ± ± ± ± ± ± % 99.64 ± ± ± %40% 98.06 ± ± ± % 99.66 ± ± ± %30% 98.05 ± ± ± ± Table 8.
Few Shot Classification results on MiniImageNet dataset for 4-layer convolu-tional network with different channels on 5 way setting. The “ ± ” shows 95% confidenceintervals over tasks. The evaluation baselines are run by us. Methods Backbone Rate 5-way 1-shot 5-way 5-shot
Reptile baseline 32-32-32-32 0% 50.30 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± % ± %20% 49.94 ± ± ± ± %40% ± % ± %64-64-64-64 10% 51.12 ± ± ± % ± %30% ± % ± %40% ± % ± %128-128-128-128 30% 51.98 ± ± ± ± ± ± %60% ± % 68.44 ± ± % ± %IHT-based Reptile 32-32-32-32 10% ± % 63.91 ± ± ± ± ± %40% 49.74 ± ± ± % ± %20% ± % ± %30% 51.98 ± ± ± % ± %128-128-128-128 30% 51.64 ± ± ± ± %50% 52.76 ± ± ± % 68.04 ± ± % ± %6 Tian et al. Table 9.
Few Shot Classification results on TieredImageNet dataset for 4-layer con-volutional network with different channels on 5 way setting. The “ ± ” shows 95% con-fidence intervals over tasks. The evaluation baselines are run by us. Methods Backbone Rate 5-way 1-shot 5-way 5-shot
Reptile baseline 32-32-32-32 0% 50.52 ± ± ± ± ± ± ± ± ± % 64.65 ± ± ± ± % 66.69 ± ± ± ± ± ± ± ± ± %20% ± % 67.98 ± ± % 63.09 ± ± ± ± ± ± % 66.15 ± ± % 69.36 ± ± ± ± % 67.60 ± ± ±0.42