[PDF] MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

Abstract

Regularization and transfer learning are two popular techniques to enhance generalization on unseen data, which is a fundamental problem of machine learning. Regularization techniques are versatile, as they are task- and architecture-agnostic, but they do not exploit a large amount of data available. Transfer learning methods learn to transfer knowledge from one domain to another, but may not generalize across tasks and architectures, and may introduce new training cost for adapting to the target task. To bridge the gap between the two, we propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data. MetaPerturb is implemented as a set-based lightweight network that is agnostic to the size and the order of the input, which is shared across the layers. Then, we propose a meta-learning framework, to jointly train the perturbation function over heterogeneous tasks in parallel. As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize to heterogeneous tasks and architectures. We validate the efficacy and generality of MetaPerturb trained on a specific source domain and architecture, by applying it to the training of diverse neural architectures on heterogeneous target datasets against various regularizers and fine-tuning. The results show that the networks trained with MetaPerturb significantly outperform the baselines on most of the tasks and architectures, with a negligible increase in the parameter size and no hyperparameters to tune.

Full PDF

MMetaPerturb: Transferable Regularizer forHeterogeneous Tasks and Architectures

Jeongun Ryu ∗ , Jaewoong Shin ∗ , Hae Beom Lee ∗ , Sung Ju Hwang , KAIST , AITRICS , South Korea { rjw0205, shinjw148, haebeom.lee, sjhwang82 } @kaist.ac.kr Abstract

Regularization and transfer learning are two popular techniques to enhance gen-eralization on unseen data, which is a fundamental problem of machine learning.Regularization techniques are versatile, as they are task- and architecture-agnostic,but they do not exploit a large amount of data available. Transfer learning methodslearn to transfer knowledge from one domain to another, but may not generalizeacross tasks and architectures, and may introduce new training cost for adaptingto the target task. To bridge the gap between the two, we propose a transferableperturbation,

MetaPerturb , which is meta-learned to improve generalization per-formance on unseen data. MetaPerturb is implemented as a set-based lightweightnetwork that is agnostic to the size and the order of the input, which is sharedacross the layers. Then, we propose a meta-learning framework, to jointly trainthe perturbation function over heterogeneous tasks in parallel. As MetaPerturbis a set-function trained over diverse distributions across layers and tasks, it cangeneralize to heterogeneous tasks and architectures. We validate the efﬁcacy andgenerality of MetaPerturb trained on a speciﬁc source domain and architecture, byapplying it to the training of diverse neural architectures on heterogeneous targetdatasets against various regularizers and ﬁne-tuning. The results show that thenetworks trained with MetaPerturb signiﬁcantly outperform the baselines on mostof the tasks and architectures, with a negligible increase in the parameter size andno hyperparameters to tune.

The success of Deep Neural Networks (DNNs) largely owes to their ability to accurately representarbitrarily complex functions. However, at the same time, the excessive number of parameters, whichenabled such expressive power, renders them susceptible to overﬁtting especially when we do nothave a sufﬁcient amount of data to ensure generalization. There are two popular techniques that canhelp with generalization of deep neural networks: transfer learning and regularization.Transfer learning [39] methods aim to overcome this data scarcity problem by transferring knowledgeobtained from a source dataset to effectively guide the learning on the target task. Whereas theexisting transfer learning methods have been proven to be very effective, there also exist somelimitations. Firstly, their performance gain highly depends on the similarity between source and targetdomain, and knowledge transfer across different domains may not be effective or even degeneratethe performance on the target task. Secondly, many transfer learning methods require the neuralarchitectures for the source and the target tasks to be the same, as in the case of ﬁne-tuning. Moreover,transfer learning methods usually require additional memory and computational cost for knowledgetransfer. Many require to store the entire set of parameters for the source network (e.g. ﬁne-tuning,LwF [22], attention transfer [48]), and some methods require extra training to transfer the source ∗ : Equal contributionPreprint. Under review. a r X i v : . [ c s . L G ] J un 𝜙 Conv4

VGG

Layer 1 Layer 2 Layer 3

Perturbation function 𝒈 𝜙 Transfer 𝒈 𝜙 Meta-testing ... vs Dog Cat vs Car Truck

Source Dataset

Task Task 𝑇 Meta-training

AircraftCUB

Figure 1:

Concepts.

We learn our perturbation function at meta-training stage and use it to solve diversemeta-testing tasks that come with diverse network architectures. knowledge to the target task [14]. Such restriction makes transfer learning unappealing, and thus notmany of them are used in practice except for simple ﬁne-tuning of the networks pre-trained on largedatasets (e.g. convolutional networks pretrained on ImageNet [33], BERT [6] trained on Wikipedia).On the other hand, regularization techniques, which leverage human prior knowledge on the learningtask to help with generalization, are more versatile as they are domain- and architecture- agnostic.Penalizing the (cid:96) p -norm of the weight [28], dropping out random units or ﬁlters [9, 38], normalizing thedistribution of latent features at each input [13, 41, 45], randomly mixing or perturbing samples [42,50], are instances of such domain-agnostic regularizations. They are more favored in practice overtransfer learning since they can work with any architectures and do not incur extra memory orcomputational overhead, which is often costly with many advanced transfer learning techniques.However, regularization techniques are limited in that they do not exploit the rich information in thelarge amount of data available.These limitations of transfer learning and regularization techniques motivate us to come up with transferable regularization technique that can bridge the gap between the two different approachesfor enhancing generalization. Such a transferable regularizer should learn useful knowledge from thesource task for regularization, while generalizing across different domains and architectures, withminimal extra cost. A recent work [20] propose to meta-learn a noise generator for few-shot learning,to improve generalization on unseen tasks. Yet, the proposed gradient-based meta-learning schemecannot scale to standard learning setting which will require large amount of steps to converge to goodsolutions and is inapplicable to architectures that are different from the source network architecture.To overcome these difﬁculties, we propose a novel lightweight, scalable perturbation function thatis meta-learned to improve generalization on unseen tasks and architectures for standard training(See Figure 1 for the concept). Our model generates regularizing perturbations to latent features,given the set of original latent features at each layer. Since it is implemented as an order-equivariantset function, it can be shared across layers and networks learned with different initializations. Wemeta-learn our perturbation function by a simple joint training over multiple subsets of the sourcedataset in parallel, which largely reduces the computational cost of meta-learning.We validate the efﬁcacy and efﬁciency of our transferable regularizer MetaPerturb by training it on aspeciﬁc source dataset and applying the learned function to the training of heterogeneous architectureson a large number of datasets with varying degree of task similarity. The results show that networkstrained with our meta regularizer outperforms recent regularization techniques and ﬁne-tuning, andobtain largely improved performances even on largely different tasks on which ﬁne-tuning fails. Also,since the optimal amount of perturbation is automatically learned at each layer, MetaPerturb does nothave any hyperparameters unlike most of the existing regularizers. Such effectiveness, efﬁciency, andversatility of our method makes it an appealing transferable regularization technique that can replaceor accompany ﬁne-tuning and conventional regularization techniques.The contribution of this paper is threefold: • We propose a lightweight and versatile perturbation function that can transfer the knowledgeof a source task to heterogeneous target tasks and architectures. • We propose a novel meta-learning framework in the form of joint training, which allows toefﬁciently perform meta-learning on large-scale datasets in the standard learning framework. • We validate our perturbation function on a large number of datasets and architectures, onwhich it successfully outperforms existing regularizers and ﬁnetuning.2

Related Work

Transfer Learning

Transfer learning [39] is one of the popular tools in deep learning to solve thedata scarcity problem. The most widely used method in transfer learning is ﬁne-tuning [34] whichﬁrst trains parameters in the source domain and then use them as the initial weights when learningfor the target domain. ImageNet [33] pre-trained network weights are widely used for ﬁne-tuning,achieving impressive performance on various computer vision tasks (e.g. semantic segmentation [23],object detection [10]). However, if the source and target domain are semantically different, ﬁne-tuningmay result in negative transfer [46]. Further it is inapplicable when the target network architectureis different from that of the source network. Transfer learning frameworks often require extensivehyperparameter tuning (e.g. until which layer to transfer, ﬁne-tuning or not, etc). Recently, Janget al. [14] proposed a framework to overcome this limitation which can automatically learn whatknowledge to transfer from the source network and between which layer to perform knowledgetransfer. However, it requires large amount of additional training for knowledge transfer, which limitsits practicality. Most of the existing transfer learning methods aim to transfer the features themselves,which may result in negative or zero transfer when the source and the target domains are dissimilar.Contrary to existing frameworks, our framework transfers how to perturb the features in the latentspace, which can yield performance gains even on domain dissimilar cases.

Regularization methods

Training with our input-dependent perturbation function is reminiscentof some of existing input-dependent regularizers. Speciﬁcally, information bottleneck methods [40]with variational inference have input-dependent form of perturbation function applied to both trainingand testing examples as with ours. Variational Information Bottleneck [3] introduces additive noisewhereas Information Dropout [2] applies multiplicative noise as with ours. The critical differencefrom those existing regularizers is that our perturbation function is meta-learned while they do notinvolve such knowledge transfer. A recently proposed meta-regularizer, Meta Dropout [20] is relevantto ours as it learns to perturb the latent features of training examples for generalization. However,it speciﬁcally targets for meta-level generalization in few-shot meta-learning, and does not scaleto standard learning frameworks with large number of inner gradient steps as they run on MAMLframework [7]. Meta Dropout also requires the noise generator to have the same architecture as thesource network, which limits its practicality for large networks and makes it impossible to generalizeover heterogeneous architectures.

Meta Learning

Our regularizer is meta-learned to generalize over heterogeneous tasks and archi-tectures. Meta-learning [13] aims to learn common knowledge that can be shared over distribution oftasks, such that the model can generalize to unseen tasks. While the literature on meta-learning isvast, we name a few works that are most relevant to ours. Finn et al. [7] proposed a model-agnosticmeta-learning (MAML) framework to ﬁnd a shared initialization parameter that can be ﬁne-tuned toobtain good performance on an unseen target task a few gradient steps. The main difﬁculty is that thenumber of inner-gradient steps is excessively large compared to few-shot learning problems. Thisled the follow-up works to focus on reducing the computational cost of extending the inner-gradientsteps [4, 8, 30, 31], but still they assume we take at most hundreds of gradient steps from a sharedinitialization. On the other hand, Ren et al. [32] and its variant [35] propose to use an online approxi-mation of the full inner-gradient steps, such that we lookahead only a single gradient step and themeta-parameter is optimized with the main network parameter at the same time in online manner.While effective for standard learning, they are still computationally inefﬁcient due to the expensivebi-level optimization. On the other hand, by resorting to simple joint training on ﬁxed subsets of thedataset, we efﬁciently extend the meta-learning framework from few-shot learning into a standardlearning frameworks for transfer learning.

In this section, we introduce our perturbation function that is applicable to any convolutional networkarchitectures and to any image datasets. We then further explain our meta-learning framework forefﬁciently learning the proposed perturbation function in the standard learning framework.

The conventional transfer learning method transfers the entire set or a subset of the main networkparameters θ . However such parameter transfer may become ineffective when we transfer knowledge3 ! 𝜸 ! 𝒉 𝐻 R eL U Channel 1Channel2Channel 𝐶 𝝁 ! (𝒉) R eL U Channel-wise Permutation Equivariant Operation

Instance 1Instance |ℬ|

Batch ℬ Channel 𝑘 GAPGAP Stats.Pooling MeanVar.Layer Info. FCChannel-wise Scaling Function R eL U 𝑠 ( Sigmoid44 ... ... ... ... ... 𝝀 % 𝜸 % 𝝀, 𝜸 : 3x3 kernel 𝝁 " 𝝁 𝝁 $ ... Figure 2:

Left:

The architecture of channel-wise permutation equivariant operation.

Right:

The architecture ofchannel-wise scaling function taking a batch of instances as an input. across a dissimilar pair of source and target tasks. Further, if we need to use a different neuralarchitecture for the target task, it becomes simply inapplicable. Thus, we propose to focus ontransferring another set of parameters φ which is disjoint from θ and is extremely light-weight. In thiswork, we let φ be the parameter for the perturbation function which are learned to regularize latentfeatures of convolutional neural networks. The important assumption here is that even if a disjointpair of source and target task requires different feature extractors for each, there may exist somegeneral rule of perturbation that can effectively regularize both feature extractors at the same time.Another property that we want to impose upon our perturbation function is its general applicability toany convolutional neural network architectures. The perturbation function should be applicable to: • Neural networks with undeﬁned number of convolutional layers . We can solve thisproblem by allowing the function to be shared across the convolutional layers. • Convolutional layers with undeﬁned number of channels . We can tackle this problem ei-ther by sharing the function across channels or using permutation-equivariant set encodings.

We now describe our novel perturbation function,

MetaPerturb that satisﬁes the above requirements.It consists of the following two components: input-dependent stochastic noise generator and batch-dependent scaling function.

Input-dependent stochastic noise generator

The ﬁrst component is an input-dependent stochasticnoise generator, which has been empirically shown by Lee et al. [20] to often outperform the input-independent counterparts. To make the noise applicable to any convolutional layers, we propose touse permutation equivariant set-encoding [49] across the channels. It allows to consider interactionsbetween the feature maps at each layer while making the generated perturbations to be invariant tothe re-orderings caused by random initializations.Zaheer et al. [49] showed that for a linear transformation µ φ (cid:48) : R C → R C parmeterized by a matrix φ (cid:48) ∈ R C × C , µ φ (cid:48) is permutation equivariant to the C input elements iff the diagonal elements of φ (cid:48) are equal and also the off-diagonal elements of φ (cid:48) are equal as well, i.e. φ (cid:48) = λ (cid:48) I + γ (cid:48) T with λ (cid:48) , γ (cid:48) ∈ R and = [1 , . . . , T . The diagonal elements map each of the input elements to themselves,whereas the off-diagonal elements capture the interactions between the input elements.Here, we propose an equivalent form for convolution operation, such that the output feature maps µ φ are equivariant to the channel-wise permutations of the input feature maps h . We assume that φ consists of the following two types of parameters: λ ∈ R × for self-to-self convolution operationand γ ∈ R × for all-to-self convolution operation. We then similarly combine λ and γ to produce aconvolutional weight tensor of dimension R C × C × × for C input and output channels (See Figure 2(left)). Zaheer et al. [49] also showed that a stack of multiple permutation equivariant operations isalso permutation equivariant. Thus we stack two layers of µ φ with different parameters and ReLUnonlinearity in-between them in order to increase the ﬂexibility of µ φ (See Figure 2 (left)).Finally, we sample the input-dependent stochastic noise z from the following distribution: z = Softplus ( a ) , a ∼ N ( µ φ ( h ) , I ) (1)where we ﬁx the variance of a to I following Lee et al. [20], which seems to work well.4 lgorithm 1 Meta-training Input: ( D tr , D te ) , . . . , ( D tr T , D te T ) Input:

Learning rate α Output: φ ∗

4: Randomly initialize θ , . . . , θ T , φ while not converged do for t = 1 to T do

7: Sample B tr t ⊂ D tr t and B te t ⊂ D te t .8: Compute L ( B tr t ; θ t , φ ) w/ perturbation.9: θ t ← θ t − α ∇ θ t L ( B tr t ; θ t , φ )

10: Compute L ( B te t ; θ t , φ ) w/ perturbation.11: end for φ ← φ − α ∇ φ T (cid:80) Tt =1 L ( B te t ; θ t , φ ) end while Algorithm 2

Meta-testing Input: D tr , D te , φ ∗ Input:

Learning rate α Output: θ ∗

4: Randomly initialize θ while not converged do

6: Sample B tr ⊂ D tr .7: Compute L ( B tr ; θ, φ ∗ ) w/ perturbation.8: θ ← θ − α ∇ θ L ( B tr ; θ, φ ∗ ) end while

10: Evaluate the test examples in D te with MC ap-proximation and the parameter θ ∗ .11:12: Batch-dependent scaling function

The next component is batch-dependent scaling function,which scales each channel to different values between [0 , for the given batch of examples. Theassumption here is that the optimal amount of the parameter usage for each channel should be differ-ently controlled for each dataset by using a soft multiplicative gating mechanism. In Figure 2 (right),at training time, we ﬁrst collect examples in batch B , apply convolution, and global average pooling(GAP) for each channel k to extract -dimensional vector representations of the channel. We thencompute statistics of them such as mean and diagonal covariance over batch and further concatenatethe layer information such as the number of channels C and width W (or equivalently, height H )to the statistics. We ﬁnally generate the scales s , · · · , s C with a shared afﬁne transformation anda sigmoid function, and collect them into a single vector s = [ s , .., s C ] ∈ [0 , C . At testing time,instead of using batch-wise scales, we use global scales accumulated by moving average at thetraining time similarly to batch normalization [13]. x C on v B NR eL U x C on v B N PerturbationFunction ...... 𝒉 ReLU 𝝁 ! (𝒉)𝒉 𝒂 ~ 𝑁 𝝁, 𝐈𝒛 = Softplus(a) 𝒔 𝒈 ! (𝒉) : Not Parameterized Figure 3:

The architecture of our perutrbationfunction applicable to any convolutional neuralnetworks (e.g. ResNet)

Final form

We lastly combine z and s to obtain thefollowing form of the perturbation g φ ( h ) : g φ ( h ) = s ◦ z (2)where ◦ denotes channel-wise multiplication. Wethen multiply g φ ( h ) back to the input feature maps h , at every layer (every block for ResNet [12]) ofthe network (See Figure 3). Note that the cost ofknowledge transfer is marginal thanks to the smalldimensionality of φ (e.g. ). Further, there is no hy-perparameter to tune, since the optimal amount of thetwo perturbations is meta-learned and automaticallydecided for each layer and channel. The next important question is how to efﬁciently meta-learn the parameter φ for the perturbationfunction. There are two challenges: Because of the large size of each source task, it is costlyto sequentially alternate between the tasks within a single GPU, unlike few-shot learning whereeach task is sufﬁciently small. The computational cost of lookahead operation and second-orderderivative in online approximation proposed by Ren et al. [32] is still too expensive.

Distributed meta-learning

To solve the ﬁrst problem, we class-wisely divide the source dataset togenerate T (e.g. ) tasks with ﬁxed samples and distribute them across multiple GPUs for parallellearning of the tasks. Then, throughout the entire meta-training phase, we only need to share thelow-dimensional (e.g. ) meta parameter φ between the GPUs without sequential alternating trainingover the tasks. Such a way of meta-learning is simple yet novel, and scalable to the number of tasksgiven a sufﬁcient number of GPUs. Knowledge transfer at the limit of convergence

To solve the second problem, we propose tofurther approximate the online approximation [32] by simply ignoring the bi-level optimization andthe corresponding second-order derivative. It means we simply focus on knowledge transfer across5able 1:

Transfer to multiple datasets.

Source and target network are ResNet20. TIN: Tiny ImageNet.

Model ± ± ± ± ± ± Info. Dropout [2] 0 None 67.46 ± ± ± ± ± ± DropBlock [9] 0 None 68.51 ± ± ± ± ± ± Manifold Mixup [42] 0 None ± ± ± ± ± ± MetaPerturb

82 TIN 69.79 ± ± ± ± ± ± Finetuning (FT) .3M TIN 77.16 ± ± ± ± ± ± FT + Info. Dropout .3M + 0 TIN 77.41 ± ± ± ± ± ± FT + DropBlock .3M + 0 TIN 78.32 ± ± ± ± ± ± FT + Manif. Mixup .3M + 0 TIN ± ± ± ± ± ± FT +

MetaPerturb .3M + 82 TIN 78.40 ± ± ± ± ± ± Training Step (a) Test crossentropy(Aircraft, ResNet20)

MetaPerturbManifold MixupDropBlockInformation DropoutBase

Training Step (b) Test crossentropy(Cars, ResNet20)

Training Step (c) Test accuracy(Aircraft, ResNet20)

Training Step (d) Test accuracy(Cars, ResNet20)

Figure 4:

Convergence plots on Aircraft [25] and Stanford Cars [18] datasets. the tasks only at the limit of the convergence of the tasks. Toward this goal, we propose to performa joint optimization of θ = { θ , . . . , θ T } and φ , each of which maximizes the log likelihood of thetraining dataset D tr and test dataset D te , respectively: φ ∗ , θ ∗ = argmax φ,θ T (cid:88) t =1 (cid:110) log p ( y te t | X te t ; StopGrad ( θ t ) , φ ) + log p ( y tr t | X tr t ; θ t , StopGrad ( φ )) (cid:111) (3)where StopGrad ( x ) denotes that we do not compute the gradient and consider x as constant. See theAlgorithm 1 and 2 for meta-training and meta-test, respectively. The intuition is that, even with thisnaive approximation, the ﬁnal φ ∗ will be transferable if we conﬁne the limit of transfer to aroundthe convergence , since we know that φ ∗ already has satisﬁed the desiried property at the end ofthe convergence of multiple meta-training tasks, i.e. over θ ∗ , . . . , θ ∗ T . It is natural to expect similarconsequence at meta-test time if we let the novel task T + 1 jointly converge with the meta-learned φ ∗ to obtain θ ∗ T +1 . We empirically veriﬁed that gradually increasing the strength of our perturbationfunction g φ performs much better than without such annealing, which means that the knowledgetransfer may be less effective at the early stage of the training, but becomes more effective at latersteps, i.e. near the convergence. We can largely reduce the computational cost of meta-training withthis naive approximation. We next validate our method on realistic learning scenarios where target task can come with arbitraryimage datasets and arbitrary convolutional network architectures. For the base regularizations, weapply weight decay of . and random cropping and horizontal ﬂipping to all our experiments. We ﬁrst validate if our meta-learned perturbation function can generalize to multiple target datasets.

Datasets

We use

Tiny ImageNet [1] as the source dataset, which is a subset of the ImageNet [33]dataset. It consists of × size images from 200 classes, with training images for eachclass. We class-wisely split the dataset into splits to produce heterogeneous task samples. Wethen transfer our perturbation function to the following target tasks: STL10 [5],

CIFAR-100 [19],

Stanford Dogs [16],

Stanford Cars [18],

Aircraft [25], and

CUB [44]. STL10 and CIFAR-100 arebenchmark classiﬁcation datasets of general categories, which is similar to the source dataset. Otherdatasets are for ﬁne-grained classiﬁcation, and thus quite dissimilar from the source dataset. Weresize the images of the ﬁne-grained classiﬁcation datasets into × . Lastly, for CIFAR-100, wesub-sample , images from the original training set in order to simulate data-scarse scenario (i.e.preﬁx s- ). See the Appendix for more detailed information for the datasets.6able 2: Transfer to multiple networks.

Source dataset is Tiny ImageNet and target dataset is small-SVHN.For Finetuning baseline, we match the source and target network since it cannot be applied to different networks.

Model Source Target NetworkNetwork Conv4 Conv6 VGG9 ResNet20 ResNet44 WRN-28-2Base None 83.93 ± ± ± ± ± ± Infomation Dropout None 84.91 ± ± ± ± ± ± DropBlock None 84.29 ± ± ± ± ± ± Finetuning Same 84.00 ± ± ± ± ± ± MetaPerturb

ResNet20 ± ± ± ± ± ± epsilon (a) L inf Robustness(CUB, ResNet20)

MetaPerturbAdv. e=0.002Adv. e=0.01BaseManifold MixupDropblockInfo. Dropout epsilon (b) L Robustness(CUB, ResNet20)

MetaPerturbAdv. e=0.1Adv. e=0.4BaseManifold MixupDropblockInfo. Dropout epsilon (c) L Robustness(CUB, ResNet20)

MetaPerturbAdv. e=2Adv. e=8BaseManifold MixupDropblockInfo. Dropout

Mean Confidence (d) Calibration plot(Cars, ResNet20)

Base (ECE=11.06)MetaPerturb (2.35)Manif Mixup (17.84)DropBlock (7.22)InfoDrop (9.82)

Figure 5: (a-c)

Adversarial robustness against PGD attack with varying size of radius (cid:15) . (d) Calibration plot.

Baselines

We consider the following well-known stochastic regularizers to compare our modelwith. We carefully tuned the hyperparameters of each baseline with a holdout validation set foreach dataset. Note that MetaPerturb does not have any hyperparameters.

Information Dropout:

This model [2] is an instance of Information Bottleneck (IB) method [40], where the bottleneckvariable is deﬁned as multiplicative perturbation as with ours.

DropBlock:

This model [9] is a typeof structured dropout [38] speciﬁcally developed for convolutional networks, which randomly dropsout units in a contiguous region of a feature map together.

Manifold Mixup:

A recently introducedstochastic regularizer [42] that randomly pairs training examples to linearly interpolate between thelatent features of them. We also compare with

Base and

Finetuning which have no regularizer added.

Results

Table 1 shows that our MetaPerturb regularizer signiﬁcantly outperforms all the baselineson most of the datasets with only dimesions of parameters transferred. MetaPerturb is especiallyeffective on the ﬁne-grained datasets. This is because the generated perturbations help focus oncorrect part of the input by injecting noise z or downweighting the scale s of the distracting parts ofthe input. Our model also outperforms the baselines with signiﬁcant margins when used along withﬁnetuning from the source dataset (Tiny ImageNet). All these results demonstrate that our model caneffectively regularize the networks trained on unseen tasks from heterogeneous task distributions.Figure 4 shows that MetaPerturb shows better convergence than the baselines in terms of test lossand accuracy. We next validate if our meta-learned perturbation can generalize to multiple network architectures.

Dataset and Networks

We use small version of SVHN dataset [27] (total , instances). Weuse networks with 4 or 6 convolutional layers with channels (Conv4 [43] and Conv6), VGG9 (asmall version of VGG [37] used in [36]), ResNet20, ResNet44 [12] and Wide ResNet 28-2 [47]. Results

Table 2 shows that our MetaPerturb regularizer signiﬁcantly outperforms the baselineson all the network architectures we considered. Note that although the source network is ﬁxed asResNet20 during meta-training, the statistics of the layers are very diverse, such that the sharedperturbation function is learned to generalize over diverse input statistics. We conjecture that suchsharing across layers is the reason MetaPerturb effectively generalize to diverse target networks.

Figure 5(a-c) shows that unlike the typical adverarial training methods based on PGDattack [24] (adversarial baselines in Figure 5(a-c)), MetaPerturb improves both clean accuracy andadversarial robustness against all the (cid:96) , (cid:96) and (cid:96) ∞ attacks, without explicit adversarial training.Figure 5(d) shows that our MetaPerturb also improves the calibration performance in terms of theexpected calibration error (ECE [26]) and calibration plot, while other regularizers do not.7able 3: Ablation study.

Variants s-CIFAR100 Aircraft CUBBase 31.79 ± ± ± (a) Components ofperturbation w/o channel-wise scaling s ± ± ± w/o stochastic noise z ± ± ± (b) Location ofperturbation Only before pooling 32.92 ± ± ± Only at top layers 32.54 ± ± ± Only at bottom layers 31.75 ± ± ± (c) Meta-training strategy Homogeneous task distribution 34.16 ± ± ± MetaPerturb ± ± ± S c a l e STL10 small CIFAR 100

Dogs

Cars

Aircraft

CUBLayer Number

Figure 6:

The scale s at each block of ResNet20.(a) Base (b) DropBlock (c) M. Mixup (d) MetaPerturb Figure 7:

Visualization of training loss surface [21] (CUB, ResNet20)

Qualitative analysis

Figure 6shows the learned scale s acrossthe layers for each dataset. Wesee that s for each channel andlayer are generated differentlyfor each dataset according towhat has been learned in themeta-training stage. Whereas the amount of penalization at the lower layers are nearly constantacross the datasets, the amount of perturbation at the upper layers are highly variable, for examplethe ﬁne-grained datasets (e.g . Aircraft and CUB) do not penalize the upper layer feature activationsmuch. Figure 7 shows that MetaPerturb and Manifold Mixup model have ﬂatter loss surface than thebaselines’. It is known that ﬂatter loss surface is closely related to generalization performance [15, 29],which partly explains why our model generalize well. Ablation study (a) Components of the perturbation function:

In Table 3(a), we can see thatboth components of our perturbation function, the input-dependent stochastic noise z and the channel-wise scaling s jointly contribute to the good performance of our MetaPerturb regularizer. (b) Location of the perturbation function: Also, in order to ﬁnd appropriate location of the pertur-bation function, we tried applying it to various parts of the networks in Table 3(b) (e.g. only beforepooling layers or only at top/bottom layers). We can see that applying the function to a smaller subsetof layers largely underperforms applying it to all the ResNet blocks as done with MetaPerturb. (c) Source task distribution:

Lastly, in order to verify the importance of heterogeneous task distri-bution, we compare with the homogeneous task distribution by splitting the source dataset acrossthe instances, rather than across the classes as done with MetaPetrub. We see that this results inlarge performance degradation with the ﬁne-grained classiﬁcation datasets, since the lack of diversityprevents the perturbation function from effectively extrapolating to ﬁne granularity in the target tasks.

We proposed a light-weight perturbation function that can transfer the knowledge of a source taskto any convolutional architectures and image datasets, by bridging the gap between regularizationmethods and transfer learning. This is done by implementing the noise generator as a permutation-equivariant set function that is shared across different layers of deep neural networks, and meta-learning it. To scale up meta-learning to standard learning frameworks, we proposed a simple yeteffective meta-learning approach, which divides the dataset into multiple subsets and train the noisegenerator jointly over the subsets, to regularize networks with different initializations. With extensiveexperimental validation on multiple architectures and tasks, we show that MetaPerturb trained on asingle source task and architecture signiﬁcantly improves the generalization of unseen architectures onunseen tasks, largely outperforming advanced regularization techniques and ﬁne-tuning. MetaPerturbis highly practical as it requires negligible increase in the parameter size, with no adaptation costand hyperparameter tuning. We believe that with such effectiveness, versatility and practicality, ourregularizer has a potential to become a standard tool for regularization.8 roader Impact

Our MetaPerturb regularizer effectively eliminates the need for retraining of the source task becauseit can generalize to any convolutional neural architectures and to any image datasets. This versatilityis extremely helpful for lowering the energy consumption and training time required in transferlearning, because in real world there exists extremely diverse learning scenarios that we have todeal with. Previous transfer learning or meta-learning methods have not been ﬂexible and versatileenough to solve those diverse large-scale problems simultaneously, but our model can efﬁcientlyimprove the performance with a single meta-learned regularizer. Also, MetaPerturb efﬁcientlyextends the previous meta-learning to standard learning frameworks by avoiding the expensive bi-level optimization, which reduces the computational cost of meta-training, which will result in furtherreduction in the energy consumption and training time.

References [1] https://tiny-imagenet.herokuapp.com/ .[2] A. Achille and S. Soatto. Information Dropout: Learning Optimal Representations ThroughNoisy Computation. In

TPAMI , 2018.[3] A. Alemi, I. Fischer, J. Dillon, and K. Murphy. Deep Variational Information Bottleneck. In

ICLR , 2017.[4] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford,and N. De Freitas. Learning to learn by gradient descent by gradient descent. In

NIPS , 2016.[5] A. Coates, A. Ng, and H. Lee. An Analysis of Single-Layer Networks in Unsupervised FeatureLearning. In

AISTATS , 2011.[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. In

ACL , 2019.[7] C. Finn, P. Abbeel, and S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of DeepNetworks. In

ICML , 2017.[8] S. Flennerhag, P. G. Moreno, N. Lawrence, and A. Damianou. Transferring Knowledge acrossLearning Processes. In

ICLR , 2019.[9] G. Ghiasi, T.-Y. Lin, and Q. V. Le. Dropblock: A regularization method for convolutionalnetworks. In

NIPS , 2018.[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate ObjectDetection and Semantic Segmentation. In

CVPR , 2014.[11] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In

ICML , 2017.[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In

CVPR ,2016.[13] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift. In

ICML , 2015.[14] Y. Jang, H. Lee, S. J. Hwang, and J. Shin. Learning What and Where to Transfer. In

ICML ,2019.[15] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-BatchTraining for Deep Learning: Generalization Gap and Sharp Minima. In

ICLR , 2017.[16] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for ﬁne-grained imagecategorization. In

First Workshop on Fine-Grained Visual Categorization, CVPR , 2011.[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014. 918] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for ﬁne-grainedcategorization. In , 2013.[19] A. Krizhevsky, G. Hinton, et al. Learning Multiple Layers of features from Tiny Images. 2009.[20] H. Lee, T. Nam, E. Yang, and S. J. Hwang. Meta Dropout: Learning to Perturb Latent Featuresfor Generalization. In

ICLR , 2020.[21] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the Loss Landscape of NeuralNets. In

NIPS , 2018.[22] Z. Li and D. Hoiem. Learning without Forgetting. In

TPAMI , 2017.[23] J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation.In

CVPR , 2015.[24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards Deep Learning ModelsResistant to Adversarial Attacks. In

ICLR , 2018.[25] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-Grained Visual Classiﬁcationof Aircraft. arXiv preprint arXiv:1306.5151 , 2013.[26] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities usingbayesian binning. In

AAAI , 2015.[27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading Digits in NaturalImages with Unsupervised Feature Learning. 2011.[28] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring Generalization in DeepLearning. In

NIPS , 2017.[29] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro. Exploring generalization in deeplearning. In

NIPS . 2017.[30] A. Nichol, J. Achiam, and J. Schulman. On First-Order Meta-Learning Algorithms. arXive-prints , 2018.[31] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-Learning with Implicit Gradients.In

NeurIPS , 2019.[32] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to Reweight Examples for Robust DeepLearning.

ICML , 2018.[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, et al. ImageNet Large Scale Visual Recognition Challenge.

IJCV ,2015.[34] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN Features off-the-shelf: anAstounding Baseline for Recognition. In

CVPR , 2014.[35] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng. Meta-Weight-Net: Learning anExplicit Mapping For Sample Weighting. In

NeurIPS , 2019.[36] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: VisualisingImage Classiﬁcation Models and Saliency Maps. In

ICLR Workshop , 2014.[37] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale ImageRecognition. In

ICLR , 2015.[38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A SimpleWay to Prevent Neural Networks from Overﬁtting.

JMLR , 15:1929–1958, 2014.[39] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A Survey on Deep Transfer Learning.In

ICANN , 2018. 1040] N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. In

Annual AllertonConference on Communication, Control and Computing , 1999.[41] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance Normalization: The Missing Ingredient forFast Stylization. arXiv preprint arXiv:1607.08022 , 2016.[42] V. Verma, A. Lamb, C. Beckham, A. Najaﬁ, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio.Manifold Mixup: Better Representations by Interpolating Hidden States. In

ICML , 2019.[43] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching Networks for One ShotLearning. In

NIPS , 2016.[44] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.[45] Y. Wu and K. He. Group Normalization. In

ECCV , 2018.[46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neuralnetworks? In

NIPS , 2014.[47] S. Zagoruyko and N. Komodakis. Wide Residual Networks. In

BMVC , 2016.[48] S. Zagoruyko and N. Komodakis. Paying More Attention to Attention: Improving the Perfor-mance of Convolutional Neural Networks via Attention Transfer. In

ICLR , 2017.[49] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. DeepSets. In

NIPS , 2017.[50] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond Empirical RiskMinimization. In

ICLR , 2018. 11 .000 0.005 0.010 0.015 0.020 epsilon

MetaPerturbAdv. e=0.005Adv. e=0.02BaseManifold MixupDropblockInfo. Dropout (a) L ∞ Robustness epsilon

MetaPerturbAdv. e=0.1Adv. e=0.4BaseManifold MixupDropblockInfo. Dropout (b) L Robustness epsilon

MetaPerturbAdv. e=2Adv. e=8BaseManifold MixupDropblockInfo. Dropout (c) L Robustness

Figure 8:

Adversarial robustness against PGD attack [24] with varying size of radius (cid:15) using STL10 datasetand ResNet20.

Organization

The appendix is organized as follows. In section A, we show additional resultsand analysis of the robustness and calibration experiments. In section B, we visualize how theperturbations look like in the latent feature space. In section C, we provide the details of the datasets,network architectures, and experimental setups.

A More Results and Analysis on Robustness and Calibration

Robustness

In Figure 8, we measure the adversarial robustness with the additional dataset,STL10 [5]. We use PGD attack of steps with some range of (cid:15) and the inner-learning rateis set to . (cid:15) for (cid:96) ∞ and (cid:96) attack and . (cid:15) for (cid:96) attack. We observe that the baseline regulariz-ers are not as robust against PGD attacks as our method, meaning that it is not easy to defend againstPGD attacks without explicit adversarial training. However, our MetaPerturb provides an efﬁcientway of doing so. We also compare with adversarial training baselines, which take projectedgradient descent steps at training. See Figure 8 for the (cid:15) value used for adversarial training for eachdataset. We can see that whereas adversarial training is beneﬁcial for the adversarial accuracies, itlargely degrades the clean accuracies. On the other hand, our MetaPerturb regularizer improves bothclean accuracy and adversarial robustness than the base model, even without explicit adversarialtraining. Calibration

In the main paper, we showed that the predictions with MetaPerturb regularizer arebetter calibrated than those of the baselines. In this section, we provide more results and analysisof calibration on various datasets. First of all, calibration performance is frequently quantiﬁed withExpected Calibration Error (ECE) [26]. ECE is computed by dividing the conﬁdence values intomultiple bins and averaging the gap between the actual accuracy and the conﬁdence value over all thebins. Formally, it is deﬁned asECE = E conﬁdence (cid:104) | p ( correct | conﬁdence ) − conﬁdence | (cid:105) . (4)Table 4 and Figure 9 show that MetaPerturb produces better-calibrated conﬁdence scores than thebaselines on most of the datasets. We conjecture that it is because the parameter of the perturbationfunction has been meta-learned to lower the negative log-likelihood (NLL) of the test set, similarlyto temperature scaling [11] or other popular calibration methods. In other words, we argue that thelearning objective of meta-learning is inherently good for calibration by learning to lower the testNLL. B Visualizations of Perturbation Function

In this section, we visualize the feature maps before and after passing the perturbation functionfrom various datasets. We use ResNet20 network for visualization. We visualize the feature mapsfrom the top to bottom layers in order to see the different levels of layers. Although it is not verystraightforward to interpret the results, we can roughly observe that the activation strengths aresuppressed by the scale s , and see how the stochastic noise z transforms the original feature maps.12able 4: ECE of multiple datasets.

Source and target network are ResNet20. TIN: Tiny ImageNet.

Model ± ± ± ± ± ± Finetuning .3M TIN 15.68 ± ± ± ± ± ± Info. Dropout [2] 0 None 22.87 ± ± ± ± ± ± DropBlock [9] 0 None 19.65 ± ± ± ± ± ± Manifold Mixup [42] 0 None 5.41 ± ± ± ± ± ± MetaPerturb

82 TIN ± ± ± ± ± ± Mean Confidence

STL10

Base (ECE=23.36)MetaPerturb (4.91)Manif Mixup (5.23)DropBlock (19.65)InfoDrop (22.85)Finetuning (15.67)

Mean Confidence sCIFAR100

Base (ECE=33.09)MetaPerturb (14.34)Manif Mixup (2.03)DropBlock (28.71)InfoDrop (32.78)Finetuning (29.78)

Mean Confidence

Stanford Dogs

Base (ECE=8.40)MetaPerturb (1.78)Manif Mixup (5.79)DropBlock (5.89)InfoDrop (8.27)Finetuning (11.37)

Mean Confidence

Stanford Cars

Base (ECE=9.77)MetaPerturb (2.31)Manif Mixup (17.00)DropBlock (5.83)InfoDrop (8.84)Finetuning (6.99)

Mean Confidence

Aircraft

Base (ECE=10.37)MetaPerturb (2.51)Manif Mixup (19.80)DropBlock (7.23)InfoDrop (9.99)Finetuning (8.04)

Mean Confidence

CUB

Base (ECE=21.77)MetaPerturb (15.58)Manif Mixup (9.95)DropBlock (18.64)InfoDrop (20.41)Finetuning (23.05)

Figure 9:

Calibration plot on STL10, s-CIFAR100, Stanford Dogs, Stanford Cars, Aircraft and CUB datasetsusing ResNet20.

C Experimental Setup

C.1 Meta-training DatasetTiny ImageNet

This dataset [1] is a subset of ImageNet [33] dataset, consisting of × sizeimages from classes. There are , , and images for training, validation, and test dataset,respectively. We use the training dataset for the source training, by resizing images to × sizeand dividing dataset into class-wise splits to produce heterogeneous task samples. C.2 Meta-testing DatasetsSTL10

This dataset [5] consists of classes of general objects such as airplane, bird , and car ,which is similar to CIFAR-10 dataset but has higher resolution of × . There are and examples per class for training and test set, respectively. We resized the images to × size. small CIFAR-100 This dataset [19] consists of classes of general objects such as beaver , aquarium ﬁsh , and cloud . The image size is × and there are and examples for trainingand test set, respectively. In order to demonstrate that our model performs well on small dataset, werandomly sample , examples from the whole training set and use this smaller set for meta-testing.13 ogs Layer 1, s : 0.6463 Layer 3, s : 0.7324 Layer 9, s : 0.4963 Layer 8, s : 0.8078Cars Layer 3, s : 0.6693 Layer 2, s : 0.7129 Layer 7, s : 0.5854 Layer 7, s : 0.8546 Aircraft

Layer 3, s : 0.6601 Layer 1, s : 0.7942 Layer 9, s : 0.5110 Layer 8, s : 0.8824CUB(a) Layer 1, s : 0.6579(b) Layer 2, s : 0.7408(c) Layer 9, s : 0.3544(d) Layer 9, s : 0.8358(e) Figure 10: (a)

Original image (b-e) Left: feature map before passing the perturbation

Center: generated noise

Right: feature map after passing the perturbation.

Stanford Dogs

This dataset [16] is for ﬁne-grained image categorization and contains , images from breeds of dogs from around the world. It has total , and , images fortraining and testing, respectively. We resized the images to × size. Stanford Cars

This dataset [18] is also for ﬁne-grained classiﬁcation, classifying between theMakes, Models, Years of various cars, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe. It contains , images from classes of cars, where , and , images are assigned for trainingand test set, respectively. We resized the images to × size. Aircraft

This dataset [25] consists of , images from different aircraft model variants(most of them are airplane). There are images for each class and we use , examples fortraining and , examples for testing. We resized the images to × size. CUB

This dataset [44] consists of bird classes such as

Black Tern , Blue Jay , and

Palm Warbler .It has , training images and , test images, and we did not use bounding box informationfor our experiments. We resized the images to × size. small SVHN The origianl dataset [27] consists of , color images from digit classes. Theimage size is × . In our experiments, we use only , subsampled examples for training inorder to simulate data scarse scenario. There are , examples for testing. C.3 Networks

We use 6 networks (Conv4 [43], Conv6, VGG9 [37], ResNet20 [12], ResNet44, and Wide ResNet28-2 [47]) in our experiments. For Conv4, Conv6, and VGG9, we add our perturbation functionin every convolution blocks, before activation. For ResNet architectures, we add our perturbationfunction in every residual blocks, before last activation.To simply describe the networks, let Ck denote a sequence of a × convolutional layer with k channels - batch normalization - ReLU activation, M denote a max pooling with a stride of , and FC denote a fully-connected layer. We provide a implementation of the networks in our code. Conv4

This network is frequently used in few-shot classiﬁcation literature. This model can bedescribed with

C64-M-C64-M-C64-M-C64-M-FC .14 onv6

This network is similar to the Conv4 network, except that we increase the depth by addingtwo more convolutional layers. This model can be described with

C64-M-C64-M-C64-C64-M-C64-C64-M-FC . VGG9

This network is a small version of VGG [37] with a single fully-connected layer at thelast. This model can be described with

C64-M-C128-M-C256-C256-M-C512-C512-M-C512-C512-M-FC . ResNet20

This network is used for CIFAR-10 classiﬁcation task in [12]. The network consists of residual block layers that consist of multiple residual blocks, where each residual block con-sists of two × convolution layers. Down-sampling is performed by stride pooling in theﬁrst convolution layer in a residual block layer and is used at the second and the third resid-ual block layers. Let ResBlk(n,k) denote a residual block layer with n residual blocks ofchannel k , and GAP denote a global average pooling. Then, the network can be described with

C16-ResBlk(3,16)-ResBlk(3,32)-ResBlk(3,64)-GAP-FC . ResNet44

This network is similar to the ResNet20 network, but with more residual blocks ineach residual block layer. The network can be described with

C16-ResBlk(7,16)-ResBlk(7,32)-ResBlk(7,64)-GAP-FC . Wide ResNet 28-2

This network is a variant of ResNet, which decrease the depth and increase thewidth of conventional ResNet architecture. We use Wide ResNet 28-2 which has depth d = 28 andwidening factor k = 2 . C.4 Experimental DetailsMeta-training

We use an Adam optimizer [17] and train the model for K steps. We use aninitial learning rate of − and decay the learning rate by . at K , K , and K steps. We set themini-batch size to . Lastly, for the base regularizations during training, we use weight decay of × − and simple data augmentations such as random resizing & cropping and random horizontalﬂipping. In order to efﬁciently train multiple tasks, we distribute the tasks to multiple processingunits and each process has its own main-model parameters θ and perturbation function parameter φ . After one gradient step of the whole model, we share only the perturbation function parametersacross the processes. Meta-testing

We use the same conﬁgurations as the meta-training stage. After the meta-training isdone, only the perturbation function parameter φ is transferred to the meta-testing stage. Note that φφ