MetaMixUp: Learning Adaptive Interpolation Policy of MixUp with Meta-Learning
Zhijun Mai, Guosheng Hu, Dexiong Chen, Fumin Shen, Heng Tao Shen
11 MetaMixUp: Learning Adaptive Interpolation Policyof MixUp with Meta-Learning
Zhijun Mai, Guosheng Hu, Dexiong Chen, Fumin Shen, Heng Tao Shen
Abstract —MixUp is an effective data augmentation methodto regularize deep neural networks via random linear interpo-lations between pairs of samples and their labels. It plays animportant role in model regularization, semi-supervised learningand domain adaption. However, despite its empirical success, itsdeficiency of randomly mixing samples has poorly been studied.Since deep networks are capable of memorizing the entire dataset,the corrupted samples generated by vanilla MixUp with a badlychosen interpolation policy will degrade the performance ofnetworks. To overcome the underfitting by corrupted samples,inspired by Meta-learning (learning to learn), we propose a noveltechnique of learning to mixup in this work, namely, MetaMixUp.Unlike the vanilla MixUp that samples interpolation policy froma predefined distribution, this paper introduces a meta-learningbased online optimization approach to dynamically learn theinterpolation policy in a data-adaptive way. The validation setperformance via meta-learning captures the underfitting issue,which provides more information to refine interpolation policy.Furthermore, we adapt our method for pseudo-label based semi-supervised learning (SSL) along with a refined pseudo-labelingstrategy. In our experiments, our method achieves better per-formance than vanilla MixUp and its variants under supervisedlearning configuration. In particular, extensive experiments showthat our MetaMixUp adapted SSL greatly outperforms MixUpand many state-of-the-art methods on CIFAR-10 and SVHNbenchmarks under SSL configuration.
Index Terms —Deep Learning, MixUp, Meta-learning, Regu-larization.
I. I
NTRODUCTION D ESPITE their striking success in many challenging tasks,deep neural networks have shown prone to overfitting,especially when the number of annotated samples is scarceas in the weakly-supervised [1], [2] and semi-supervisedlearning [3], [4]. In addition, this is also reflected in highgeneralization errors when the ability of deep CNNs overfits ormemorizes the corrupted samples that have slight distributionalshifts, which are also known as the imperceptible adversarialperturbations [5]. These issues can make deep learning basedsystems degrade the prediction performance in practice. It isthus desirable to design effective regularization methods tocontrol the model complexity and reduce the gap betweentraining error and generalization error.Recently due to the great advances of machine learning,remarkable regularization methods have been proposed. Inaddition to manual designed regularization architecture net-work Shake-Shake Regularization [6], typically adding noiseto the deep model is important to reduce overfitting and learnmore robust abstractions, e.g., dropout [7] and randomizeddata augmentation [8]. A simple and effective method, calledMixUp [9], has been proposed recently as a data augmentation scheme to address generalization problems. Specifically, itis performed to generate additional virtual samples duringtraining via a simple linear interpolation of randomly pickedtraining sample pairs, as well as their labels. However, its inter-polation policy (the weights for interpolating paired samples)is randomly chosen from some prior distribution (e.g. BetaDistribution) for each pair of samples at each iteration, whichmay lead to manifold intrusion and thus underfitting [10]. Weobserve that the original MixUp is not robust in cases whenthe generated virtual samples are adjacent to real samples ofdifferent categories, the corresponding virtual labels becomesambiguous. Nevertheless, original MixUp method does nottake such ambiguities into account. Therefore, carefully choos-ing an adequate interpolation policy to avoid underfitting iscrucial to achieve promising performance. It is not trivial tolearn better interpolation policy for MixUp technique sincedeep CNNs are more prone to memorize corrupted samplesand improving deep CNNs on corrupted samples and labels isclearly an under-studied problem and worthy of exploration.AdaMixup[10] proposes to assess the quality of interpolationpolicy with well selected triplet data and uses an additionalintrusion discriminator to judge whether the sample generatedby a policy collides with a real data point. Nevertheless,relying on training an additional carefully designed networkthat estimates the interpolation as a supervision signal, theirmethod has additional hyperparameters (e.g. triplets selection,larger model architecture, more optimization parameters, etc.)to tune and can be hard to deploy for a new dataset or task.In this paper, we also propose a new theoretical perspectiveof MixUp by showing its empirical risk as a lower boundof the gradient Lipschitz constant of a neural network. Thisobservation not only helps understanding vanilla MixUp betterbut also puts forward again the underfitting issue caused bynaive choice of the interpolation policy. Our method is inspiredby recent success of Meta-Learning, a learning paradigminspired by the cognitive process of human and animals, inwhich a model is learning to learn better using validation set asmeta-data. This paper tackles manifold intrusion/underfittingissue using meta-learning on MixUp to learn the interpolationpolicy in a data-adaptive way. Meta-learning has successfullyshown to be powerful in learning data-adaptive rules and poli-cies, such as initial neural network weights [11], optimizationhyperparameters [12], unsupervised learning rules [13] etc.,making models more general and adaptive to the new datasetsand tasks. For our problem, our intuition is that a meta modelwith random interpolation policy learns to gained knowledgefrom metadata, can provide instructive supervision for vanillaMixUp to refine interpolation policy in a data-driven style. a r X i v : . [ c s . C V ] A ug A reasonable interpolation policy for MixUp can help deepCNNs alleviate manifold intrusion problem made by corruptedlabels and samples. Our method, dubbed MetaMixUp, consistsof learning the interpolation policy of MixUp by adapting ameta-learning method in a data-adaptive way. Specifically, weaim to learn a interpolation policy to minimize the expectedloss for the training set. Meta model can be learned todiscover new data-driven interpolation policy from metadata.The learned data-driven interpolation policy can be updated afew times taking into account of the main model’s feedback.Whenever the interpolation policy is learned, we turn the deepCNNs from meta-stage to main-stage to minimize the learningobjective, where the main-stage controls training procedure tolearn each mixed sample. At the test time, deep CNNs makespredictions alone in main-stage. Instead of searching a discreteset of candidate interpolation policy, we relax the optimizationvia an online gradient-based meta-learning to make it contin-uous, so that the interpolation policy can be optimized withrespect to its validation set performance by gradient descent.The data-adaptive of gradient-based optimization, as opposedto selection from prior distribution, allows MetaMixUp toachieve competitive even better performance for differenttasks.To our best knowledge, our method is the first one thatapplies meta-learning to guide interpolation policy learn-ing for MixUp technique. It tackles the manifold intrusionproblem in a more direct and simple way and leads tobetter performance over original MixUp and recent pro-posed AdaMixUp on typical image classification benchmarks:ImageNet, MNIST, SVHN, Fashion-MNIST, CIFAR-10 andCIFAR-100 under supervised configuration. To demonstrateits adaption for semi-supervised tasks, our method extendsMixUp to the pseudo-label based methods [14] and furtheradopts an asynchronous pseudo labeling strategy. Our resultingsemi-supervised method improves the performance over theoriginal pseudo-label based SSL method by a large marginand achieves comparable performance over state-of-the-artmethods on CIFAR-10 and SVHN under semi-supervised con-figuration. Furthermore, we apply MetaMixUp to a powerfulMixUp augmented SSL method called MixMatch [15], andimprove the previous state-of-the-art results, which suggeststhat our MetaMixUp is auxiliary to other methods of semi-supervised learning.To sum up, we highlight our threefold contributions asfollows.1) We address underfitting issue caused by badly choseninterpolation policy of vanilla MixUp. And we introducea new theoretical perspective that MixUp is a lowerbound of the Lipschitz constant of the gradient of theneural network to help further understanding vanillaMixUp.2) We propose a gradient-based meta-learning algorithmwhich is exploited to guide refining interpolation policyof MixUp in a data-driven way. The model can beoptimized with respect to its validation set performanceby gradient descent. We relax the optimization with anonline approximation to improve training efficient. Wefind that MetaMixUp achieves the best performance over vanilla MixUp and AdaMixup.3) We extend our MetaMixUp and MixUp to semi-supervised learning tasks with an asynchronous pseudolabeling strategy. Through extensive experiment weshow that our extensions achieve highly competitiveresults on CIFAR-10 and SVHN, which we attribute totheir adaption for other tasks.The rest of this paper is organized as follows. In section II,we review the literature relevant to our work. In section III,a new perspective of MixUp is introduced, and the proposedMetaMixUp along with its extensions to SSL are presentedin detail, respectively. We provide experimental results andanalysis in section IV, and summarize this paper in section V.II. R
ELATED W ORK
A. Regularization
Regularization is an ongoing subject in machine learningand has been widely studied. It refers to the general approachof penalizing the amount of information neural network con-tains to keep the parameters simple [16]. The constraints anddisturbance on model keep it from over-fitting to training dataand thus hopefully make it generalize better to test data. Inparticular, a common regularization technique is to add a lossterm which penalizes the L2 norm of the model parameters.When we are using simple gradient descent optimizer suchas Adam [17], this loss term is equivalent to weight de-cay, which exponentially decaying the weight values towardszero in training procedure. Data augmentation techniques arecommonly used regularizer by leveraging additional samplesgenerated by appropriate domain-specific transformations. Forinstance, random cropping, flipping and rotating are typicaldata-augmentation ways for image data [8], [18]. Dropoutis another very helpful regularizer in avoiding over-fittingby randomly dropping units from the neural network duringtraining [7]. In contrast to these data-independent methods,AutoAugment [19] has proposed a data-adaptive way tosearch the best data-augmentation policy from a huge spaceof policies, which are combinations of many sub-policies.On the other hand, instead of operating on single imagesample, MixUp [9] and between-class learning [20] augmenttraining data points by interpolating multiples examples andlabels. Manifold Mixup[21] leverages semantic interpolationsin random layers as additional training signal to train neuralnetworks. Nonetheless, their interpolation policies are pre-defined and are not data-driven. Our approach is a data-driven extension of MixUp via meta-learning, which leveragesvicinal relations between examples and can alleviate manifoldintrusion problem introduced in [10]. It is closely related toAdaMixup [10], which also learns the mixing policy fromdata. While AdaMixup requires training an additional networkto infer the policy and also an intrusion discriminator withwell selected triplets, our method is directly applied to theoriginal MixUp method without adding further components(e.g. a carefully designed discriminator) to the model.
B. Meta-learning
Meta-learning methods date back to the 90s [22], [23]and have recently resurged with various techniques focused on learning how to learn and thus quickly adapt to newinformation [24]. Meta-learning approaches can be broadlycategorized into three groups, which has been proposed tosolve the few-shot learning problem.
Gradient-based methods [11], [24] use gradient descent to adapt the model parameters.
Nearest-neighbor methods [25] learn prediction rule over theembeddings based on distance to the nearest class mean.
Neurons-based methods [26], [27] learn meta procedure ofhow to adapt the connections between neurons for differenttasks. Our method is tightly related to gradient-based meta-learning algorithm MAML [11]. MetaMixUp also implicitlylearns how to quickly adapt to new datasets and tasks througha gradient-based meta-learning algorithm. Unlike MAML, ouroptimization procedure works in an online fashion rather thanrelying on heavy offline training stages. Similar to Meta-learning, recent hashing researches focused on learning tohash [28]. Their goal is to learn data-dependent hash func-tions which generate more compact codes to achieve goodsearch accuracy [29], [30]. This paper is more similar to theoptimization-based meta-learning method for sample reweight-ing [31], which has focused on imbalanced classification andnoisy label problems.
C. Hyperparameter optimization
Performance of machine learning algorithms depends crit-ically on identifying a good set of hyperparameters. Recentinterest in complex and computationally expensive machinelearning models with many hyperparameters, such as au-tomated machine learning (AutoML) frameworks and deepneural networks, has resulted in a resurgence of researchon hyperparameter optimization (HPO). The current goldstandard methods for hyperparameter selection are blackboxoptimization methods. Due to the non-convex nature of theproblem, global optimization algotithms are usually applied.The standard baseline methods involve training tens of mod-els that select hyperparameters configurations randomly andnonadptively[32], [33], e.g., grid or random search. Moreover,the majority of recent work in this growing area focuseson Bayesian hyperparameter optimization [34] with the goalof optimizing hyperparameter configuration selection in aniterative fashion. However, recent gradient-based techniquesfor HPO have significantly increase the number of hyperpa-rameters that can be optimized [35]. In this way, it is nowpossible to tune large-scale weight vectors as hyperparametersassociated with neural networks. Such an approach is suitedfor MixUp technique where the interpolation weight of eachsample pair are treated as hyperparameters across a set oftraining episodes.
D. Semi-Supervised learning
Semi-Supervised learning has been extensively studied andhas a large variety of groups. Typical successful SSL methodshave involved some consistency regularization such as Π -model [36], VAT [3], Mean Teacher [4] and simple labelpropagation such as pseudo-labeling [14] or more gener-ally self-training [37], [38]. Our method is more similar topseudo-labeling based methods. Basically, pseudo-labels are the current predictions of the classifier assigned to unlabeledexamples. [38] proposed to leverage multiple networks toasymmetrically give pseudo-labels to unlabeled samples. [39]proposed moving average centroid alignment to reduce biascaused by false pseudo-labels. However, these methods ignoreconsidering the stability of the pseudo-labels. We extendMixUp and our MetaMixUp to SSL tasks with an asyn-chronous pseudo-labeling strategy to stabilize the training.Another recent proposed MixMatch [15] works by guessinglow-entropy labels for unlabeled examples under multiple dataaugmentations for each sample. Likewise, it combines MixUpwith consistency regularization.III. D ATA - ADAPTIVE M IX U P VIA M ETA L EARNING
In this section, we first introduce a new perspective ofMixUp. We then detail our algorithm of learning the inter-polation policy for MixUp. Finally, we adapt our proposedMetaMixUp for supervised and semi-supervised tasks respec-tively.
A. MixUp as A Lower Bound of the Gradient Lipschitz Con-stant
MixUp, originally proposed by [9], augments the trainingset by linearly interpolating a random pair of examples andtheir corresponding labels selected in a minibatch throughpermutation: ˜ x i = λx i + (1 − λ ) x j , ˜ y i = λy i + (1 − λ ) y j , (1)where ( x i , y i ) and ( x j , y j ) are two data-target samples ran-domly drawn from the training set, and λ ∈ [0 , is theinterpolation weighting coefficient. Then the objective of asupervised problem becomes minimizing the empirical riskover the MixUp-generated samples.Despite the empirical effectiveness of MixUp, how it con-trols the smoothness of a neural network has hardly beeninvestigated. As discussed in [10], bad interpolation coefficient λ can lead to underfitting caused by the manifold intrusion.This occurs when the MixUp-generated sample collides withan existing real example with a label different from theinterpolated pair’s, and thus leads to performance degradation.Here, we consider MixUp from a regularization point of viewand show that it is a lower bound of the Lipschitz constant ofthe gradient of the neural network.While many previous works have controlled the smoothnessof a neural network by controlling its Lipschitz constant [40],[41], [3], we consider here to control a stronger condition,which is the Lipschitz constant of its (sub)gradient. Specif-ically, we assume that the predictive function f : R d → R is a differentiable function and its gradient is κ -Lipschitzcontinuous: ∀ x, x (cid:48) ∈ R d (cid:107)∇ f ( x ) − ∇ f ( x (cid:48) ) (cid:107) ≤ κ (cid:107) x − x (cid:48) (cid:107) (P1)On the other hand, we consider the following inequality: | f ( λx + (1 − λ ) x (cid:48) ) − [ λf ( x ) + (1 − λ ) f ( x (cid:48) )] |≤ λ (1 − λ ) κ (cid:107) x − x (cid:48) (cid:107) (P2) ∇ 𝜃 Train-State 𝑥 Training Loss 𝜆 𝜃 ′ Meta-State 𝑥 𝑣 Meta Loss
𝐆𝐫𝐚𝐝𝐢𝐞𝐧𝐭 𝐝𝐞𝐬𝐜𝐞𝐧𝐭 𝐬𝐭𝐞𝐩 𝜃 . F o r w a r d m i x e d i m a g e Interpolation policy . B a c k w a r d w e i g h t s . F o r w a r d v a li d a t i o n
4. Backward updates ValidationInputs ∇ λ Fig. 1. Computation Graph of Our MetaMixUp where λ ∈ [0 , . Under MixUp setting, x and x (cid:48) in (P2) canrepresent x i and x j in (1). Noting that the left term of (P2)is equivalent to the empirical risk of MixUp by replacing the (cid:96) -loss with a general loss function and the prediction f ( x ) and f ( x (cid:48) ) with their true labels. MixUp loss can be consideredas a proxy of the left term.Now we announce the following proposition which buildsthe relation between the Lipschitz continuity of the gradientand the empirical risk of MixUp. Proposition 1 (Link between MixUp and gradient Lipschitzcontinuity) . The property (P1) ⇒ (P2) .Proof. For all x and x (cid:48) in R d we have f ( λx + (1 − λ ) x (cid:48) )= f ( x (cid:48) ) + λ (cid:90) (cid:10) ∇ f ( λtx + (1 − λt ) x (cid:48) ) , x − x (cid:48) (cid:11) dt = f ( x (cid:48) ) + λ [ f ( x ) − f ( x (cid:48) )]+ λ (cid:20)(cid:90) (cid:10) ∇ f ( λtx + (1 − λt ) x (cid:48) ) , x − x (cid:48) (cid:11) dt − ( f ( x ) − f ( x (cid:48) )) (cid:21) (2) Therefore | f ( λx + (1 − λ ) x (cid:48) ) − ( λf ( x ) + (1 − λ ) f ( x (cid:48) ) | = λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:10) ∇ f ( λtx + (1 − λt ) x (cid:48) ) − ∇ f ( tx + (1 − t ) x (cid:48) ) , x − x (cid:48) (cid:11) dt (cid:12)(cid:12)(cid:12)(cid:12) ≤ λ (cid:90) (cid:12)(cid:12)(cid:10) ∇ f ( λtx + (1 − λt ) x (cid:48) ) − ∇ f ( tx + (1 − t ) x (cid:48) ) , x − x (cid:48) (cid:11)(cid:12)(cid:12) dt ≤ λ (cid:90) (cid:107)∇ f ( λtx + (1 − λt ) x (cid:48) ) − ∇ f ( tx + (1 − t ) x (cid:48) ) (cid:107)(cid:107) x − x (cid:48) (cid:107) dt ≤ λ (cid:90) (1 − λ ) tκ (cid:107) x − x (cid:48) (cid:107) dt = λ (1 − λ ) κ (cid:107) x − x (cid:48) (cid:107) , (3) where the second inequality follows Cauchy-Schwarz inequal-ity and the third inequality is the property P .This proposition suggests controlling the Lipschitz constantof the gradient necessarily requires minimizing MixUp loss.However when x is far from x (cid:48) , the mixing policy λ hasa much greater effect on the Lipschitz constant. Therefore,minimizing the MixUp loss with a badly chosen interpolationpolicy λ cannot help control the Lipschitz constant but leadsto unexpected degradation. This observation shows the impor-tance of elaborating a smarter way to choose λ , especially fordealing with distant pairs. B. MetaMixUp: Learning Data-Driven Interpolation Policy
To solve the above problem, we propose MetaMixUp, ameta-learning based method to optimize the interpolation pol-icy of MixUp via an online optimization. Unlike the originalMixUp in Eq. (1) which uses a predefined distribution for theinterpolation coefficient λ and a unique value in each mini-batch, we consider using different interpolation coefficients λ i in each mini-batch to improve the diversity and to make themall learnable. Our target is to tackle manifold intrusion problemthrough learning adaptive interpolation policy, rather thandirectly using the fixed interpolation coefficients of MixUp.Specifically, the MetaMixUp is defined as: ˜ x i = λ i x i + (1 − λ i ) x j , ˜ y i = λ i y i + (1 − λ i ) y j , (4)where λ i is optimized via meta-learning, by optimizing themeta-objective on a validation set [11], [31]. The optimal weight of a network is given by minimizingthe loss function over the training set D = { (˜ x i , ˜ y i ) } Ni =1 : θ ∗ ( λ ) = arg min θ N N (cid:88) i =1 (cid:96) ( f (˜ x i ; θ ) , ˜ y i ; λ i ) (5)where θ denotes the network parameters. Analogous to archi-tecture search using progressive evolution [42], the validationset performance is treated as fitness. The optimal λ is thenoptimized on the validation set D v = { ( x i , y i ) } Mi =1 via λ ∗ = arg min λ ∈ [0 , M M (cid:88) i =1 (cid:96) ( f v ( x i ; θ ∗ ( λ )) , y i ) (6)where f v denotes the meta network which has the samearchitecture as f . While f is used for prediction, and f v is onlyused to optimize λ i . Similar to other meta learning methodswhere parameters of a network can be quickly adapted to thetask guided by meta-objective, our meta network ( f v ( θ ) = f ( θ ; λ i ) ) aims to search the optimal λ i to interpolate thetraining samples, but using gradient descent.Optimization of (6) indicates a bilevel optimization problem[43] with λ as the upper-level variable and θ as the lower-level variable. The nested formulation also arises in gradient-based hyperparameter optimization [35], which is relevant in amanner that interpolation policy could be regarded as a specialtype of hyperparameter, although its magnitude is substantiallyhigher than scalar-valued hyperparameters such as the learningrate, and it is harder to optimize.Given the original MixUp examples, at t -th step, we firsttake a step forward with meta network to update modelweights θ : θ t +1 = θ t − η ∇ θ n n (cid:88) i (cid:96) ( f v (˜ x i ; θ ) , ˜ y i ) , (7)where η is the learning rate. Then, the ideal optimal λ ∗ canbe calculated on the validation set by λ ∗ = arg min λ ∈ [0 , M M (cid:88) i =1 (cid:96) ( f v ( x i ; θ t +1 ) , y i ) (8)However, the complete optimization of the above objectiveexactly can be prohibitive due to the expensive inner optimiza-tion. We therefore propose a simple approximation schemeand some relaxation to make it more scalable in practice. Toachieve this, we consider using an online approximation, byperforming only a single gradient descent step on the valida-tion set, without solving the inner optimization completely bytraining until convergence.The generalization performance of the model is measuredwith a validation loss based on an unregularized meta model.Hence the value of the loss depends only on elementaryparameter updates [44]. The gradient of the validation losswith respect to interpolation policy λ is: ∇ λ (cid:96) ( f v , D v ) = ∂∂λ m m (cid:88) i =1 (cid:96) ( f v ( x i ; θ t +1 ) , y i ) , (9)where m denotes the batch size. To achieve our objective, we only consider the influenceof the interpolation policy on the current elementary param-eter update, ∇ λ (cid:96) ( f v , D v ) . The interpolation policy update istherefore: λ := λ − α ∇ λ (cid:96) ( f v , D v ) , (10)where α is the step size for updating λ . Since λ is free to spanthe entire set of real numbers, we project λ back to [0 , witha sigmoid function: λ ∗ := sigmoid ( λ ) , (11)The complete computation graph of MetaMixUp is summa-rized and visualized in Fig. 1.Given the refined interpolation policy, we re-mixup thetraining examples and update the weight of network in trainingstate. It is straightforward to adopt MetaMixUp in supervisedlearning (SL). We detail the outline of our MetaMixUp algo-rithm in Algorithm 1. Algorithm 1
MetaMixUp for supervised learning
Input : Training data D , validation D v . Parameters : Deep neural network Φ( θ ) , batch size B ,learning rate η , step size α . Output : Deep neural network Φ( θ ) and λ ∗ . for t = 1, 2, ... , Iter max do Random initialize λ = { λ i } Bi =1 ; Turn network to meta stage Φ( θ ) → Φ (cid:48) ( θ ) ; MixUp examples with λ to construct ˜ D ; Update θ (cid:48) = θ − η ∇ θ (cid:96) (Φ (cid:48) ( θ ) , ˜ D ) ; Update λ ∗ = λ − α ∇ λ (cid:96) (Φ (cid:48) ( θ (cid:48) ) , D v ) ; MixUp examples with updated λ ∗ to reconstruct ˜ D ; Update nework θ := θ − η ∇ θ (cid:96) (Φ( θ ) , ˜ D ) ; end for return Φ( θ ) and λ ∗ C. Extension of MetaMixUp for Semi-Supervised Learning
It is non-trivial to apply MixUp and its variants to semi-supervised learning (SSL) tasks since MixUp performs theinterpolation in both data and label space, which is notapplicable for unlabeled data. In this section, we present thestrategy of adapting MetaMixUp to SSL.Before presenting the SSL extension of MetaMixUp, wefirst detail the notations. We denote by D = L ∪ U theentire training set with a relatively small labeled data set L = { ( x i , y i ) | i = 1 , , . . . , L } and a large unlabeled dataset U = { ( x i ) | i = L + 1 , ..., L + U } . Consider here a C -classclassification problem, each y i = [ y i , y i , ..., y Ci ] T ∈ { , } C denotes the corresponding one-hot true label, such that y ji = 1 if x i belongs to the j -th class and y ji = 0 otherwise. Let N = L + U be the total number of training samples, andusually we have L (cid:28) U .As the main challenge of adapting MixUp and MetaMixUpto SSL is the lack of labels for unlabelled data, pseudo-label seems to be a natural choice of the basic SSL methodfor MixUp. Typically, pseudo-labeling based methods simply leverage the high-confident predictions as the true labels andresults in high performance in practice. We employ standardcross-entropy loss as classification loss. For a pseudo-labelbased SSL method, the loss function is (cid:96) ( X , ¯ Y ; θ ) = 1 N N (cid:88) i =1 (cid:96) ( f ( x i ; θ ) , ¯ y i ) , (12)where ¯ Y = { ¯ y i } Ni =1 denotes all the true or pseudo labels forthe training set X = { x i } Ni =1 . If x i ∈ L , ¯ y i is fixed to itscorresponding true label vector ¯ y i = y i throughout the entiretraining process. For an unlabeled training sample x i ∈ U , ¯ y i is the estimated label vector by the network at currentiteration. Utilizing pseudo-labels, we can implicitly generatethe ‘augmented’ set ˜ D = { (˜ x i , ˜ y i ) } Ni =1 during training process.Then unsupervised MetaMixUp can be transformed to thesupervised one. Asynchronous Pseudo Labeling.
A commonly known is-sue of pseudo-label based method is that the incorrect pseudo-labels of the newly labeled examples can propagate errorsand thus degenerate the performance. To solve this issue,we propose an asynchronous strategy, named asynchronouspseudo labeling (APL), to improve deduced pseudo labelquality for unlabeled data and thus stabilize the training.Specifically, to filter out the unconfident pseudo-labels, wepredefine a threshold σ such that all the unlabeled sampleswith the maximal prediction probability under σ will be elim-inated from performing back propagation of the loss function.This threshold allows to mitigate influence from the uncertainpredictions of the unlabeled samples. Accordingly, unlabeledtraining samples are dynamically relabeled with more accuratelabels since easy examples are generally predicted with highconfidence while those with low confidence are more likely tobe hard examples. Instead of using a constant threshold, weasynchronously decrease the threshold to make more unlabeleddata to be labeled after further epochs as shown effective in[38]. The threshold σ is decreased by σ d every K epochs andis defined at t epoch, by σ t = σ t − − σ d × [ tK ] × K , where [ · ] denotes the round-off operation. We set initial threshold σ = 0 . and K = 30 in all experiments to avoid the frequentupdate of pseudo-label. The framework is summarized inAlgorithm 2. IV. E XPERIMENTS
We study here the regularization properties of ourMetaMixUp on typical image classification benchmarks. Theaim of the following experiments is threefold. First, we inves-tigate the impact of the vanilla interpolation policy of originalMixUp on the quality of the solution on multiclass classi-fication problems. Second, we test our proposed MetaMixUpmethods in context of supervised learning and semi-supervisedlearning. Finally, we constrast the MetaMixUp techniqueagainst classical MixUp approaches to learn models with betterregularization properties.
A. Datasets
MNIST and Fashion.
Both datasets contain 60,000 trainingand 10,000 test images (28 ×
28) of 10 classes. Both datasets
Algorithm 2
MetaMixUp for Semi-Supervised Learning
Input : Labeled training data D L , unlabeled training data D UL , validation D v , Parameters : Deep neural network Φ( θ ) , batch size B ,learning rate η , step size α , pseudo-label threshold σ ,decrease step σ d , maximum iterations Iter max . Output : Deep neural network Φ( θ ) and λ ∗ for t = 1, 2, ... , Iter max do Random initialize λ = { λ i } Bi =1 ; Labeling ¯ y j = arg max Φ( x j , θ ) if Φ( x j , θ ) > σ if t reach update period then σ = σ − σ d end if Turn network to meta stage Φ( θ ) → Φ (cid:48) ( θ ) ; MixUp examples with λ to construct ˜ D L and ˜ D UL ; Calculate MetaLoss
M etaLoss = ¯ L s (Φ (cid:48) ( θ ) , ˜ D L ) + ¯ L us (Φ (cid:48) ( θ ) , ˜ D UL ) Update θ (cid:48) = θ − η ∇ θ M etaLoss ; Update λ ∗ = λ − α ∇ λ (cid:96) (Φ (cid:48) ( θ (cid:48) ) , D v ) ; MixUp examples with λ ∗ to reconstruct ˜ D L and ˜ D UL ; Calculate Loss
Loss = L s (Φ( θ ) , ˜ D L ) + L us (Φ( θ ) , ˜ D UL ) Update θ := θ − η ∇ θ Loss ; end for return Φ( θ ) and λ ∗ are used for supervised learning under the standard trainingand test splits. CIFAR-10 and CIFAR-100.
CIFAR-10 and CIFAR-100have 10 and 100 classes of natural images ( × ) respec-tively. For supervised learning , we use the standard data splitfor training (50000) and test (10000). For semi-supervisedlearning , we follow [45], where 1000 or 4000 images (100or 400 per class) in CIFAR-10 are selected from the trainingset as the labeled training examples, and the remaining asthe unlabeled data. CIFAR-100 is only used for supervisedlearning task . SVHN.
The Street View House Numbers (SVHN) dataset[46] contains real world × images of house numbers. Itcontains 73,257 training and 26,032 test images. We use thestandard training/test split for supervised learning . For semi-supervised learning , we follow [45] to randomly select 50 or100 samples per class from the training set as the labeled data,the remaining as the unlabeled data. ImageNet.
ImageNet-2012 is a dataset for classification[47] with 1.3 million training images, 50,000 testing images,and 1,000 classes. We used the standard data split for super-vised training, and followed the data processing approachesused in AdaMixUp [10], in which the crop size is 100 × ×
224 due to limited computational resources.
B. Implementation Details
To make a fair comparison and follow the settings ofexisting works, we train all the models from scratch. We setstep size α to 5.0 and choose SGD as the optimizer for all the experiments with the momentum 0.9 and weight decay − .We train our models on two NVIDIA GTX 1080 Ti GPUs. Supervised Learning (SL)
Following [10], a 3-layer CNNis used for the tasks on MNIST and Fashion. For CIFAR-10, CIFAR-100, SVHN and ImageNet, the PreAct-ResNet-18[48], PreAct-ResNet-34 and Wide-ResNet-28-10 architecturesare used for all the considered methods. We set the batchsize as 128, initial learning rate as 0.1 followed by cosineannealing [49], number of epochs as 600. We randomly sample1000 images (100 per class) from the original training set toconstruct our meta validation set. For data augmentation, weonly use horizontal flipping following [9] for all of the trainingdatasets.
Semi-Supervised Learning (SSL)
We follow the unifiedevaluation platform [45] for SSL to make fair comparisons.We set batch size 100 for both CIFAR-10 and SVHN. All themodels are trained with 200 epochs using data augmentation(horizontal flip and 2-pixel translation) following [45]. Weselect 500 images (50 per class) from the training set asour meta validation set. For asynchronous pseudo-labelingthreshold decreasing step size, we set σ d = 0 . . Following[50], learning rate starts from 0.1, and is divided by 10 after60, 120 and 180 epochs respectively. For a fair comparison, werun all methods with WideResNet-28-2 [50] and the averageaccuracy is obtained by 5 runs. C. Results
Results of Supervised Learning.
We compare MetaMixUpwith the baseline method (w/o MixUp) and two counterpartmethods (MixUp and AdaMixUp) on five popular superviseddatabases. We train a variety of residual network for eachmethod. The error rates presented in Table I show thatour MetaMixUp substantially outperforms the baseline (w/oMixUp), MixUp and AdaMixUp on all the five datasets.Surprisingly, training with MixUp does not always improvethe performance. For instance, baseline method (w/o MixUp)excels MixUp (0.54% vs. 0.59% on MNIST and 52.84%vs. 55.06% on ImageNet). The error rate rise of MixUpover baseline (w/o MixUp) on ImageNet is more distinct,which is also observed on PreActResNet34 and Wide-ResNet-28-10 architectures (3.01% and 2.55% top-1 error rate riserespectively). This issue strongly indicates that MixUp is notrubust to all datasets or tasks with its default interpolationpolicy, and a smart refinement method is needed.As described in [9], interpolation policy λ of vanilla MixUpis drawn according to a Beta distribution: λ ∼ β ( α, α ) , where α is an extra hyperparameter needed to turn. With α = 1 . , thisis equivalent to sampling from an uniform distribution U (0 , .To investigate the impact of α and λ on the performance,we operate hyperparameter tuning on α which determines thedistribution of λ . The influence is presented by experimentalerror rate on three datasets in Table II. We find that for α ,vanilla MixUp has different behaviors on disparate datasets,i.e., original MixUp somehow increases the error rate onMNIST and SVHN compared to baseline. This on the contrarysuggests that mixing samples in a data-adaptive way tendsto provide positive impacts on generalization performance. To further study how interpolation policy impacts the gen-eralization performance, we directly fix λ for each pair oftraining samples in the training process, in this scenario, theperformance becomes more unstable even worse than trainingwithout MixUp. The findings above substantially supports ourgoal to directly make optimization on λ to generate data-adaptive interpolation policy for better MixUp techniques.As we see, the proposed MetaMixUp significantly improvesvanilla MixUp without need of turning interpolation policydistribution.Our MetaMixUp improves vanilla MixUp by a large marginon easier tasks (0.38% vs. 0.59% on MNIST and 5.15%vs. 7.31% on Fashion) as well as on harder tasks, SVHN(2.96% vs. 3.83%), CIFAR-10 (3.12% vs. 4.57%), CIFAR-100 (20.36% vs. 21.35%) with PreActResNet18. Comparedwith the original MixUp, the promotion is more obviouswhile outperforming the original version by 7.71% top-1accuracy on ImageNet. This improvement is consistent at dif-ferent architectures, where MetaMixUp (top-1 error 47.35%,top-5 error 24.43%) outperforms the original MixUp (top-1 error 55.06%, top-5 error 31.32%) with PreActResNet34and (top-1 error 46.38%, top-5 error 23.91% versus top-1 error 52.96%, top-5 error 29.28%) with Wide-Resnet28-10. These comparisons suggest the significant of learninga suitable interpolation policy. As a competitive counterpartof MetaMixUp, AdaMixUp has a generator that outputs theinterpolation policy and a discriminator that judges the qual-ity of the generated policy. As Table I shows, MetaMixUpdistinctly outperforms AdaMixup which is strengthened witha discriminator (0.38% vs. 0.49%) on MNIST, (5.15% vs.6.21%) on Fashion, (2.96% vs. 3.12%) on SVHN, (3.12% vs.3.52%) on CIFAR-10, and (20.36% vs. 20.97%), (47.55% vs.49.17%) on the more challenging CIFAR-100 and ImageNetbenchmark respectively. Moreover, experiments on three dif-ferent architectures consistently demonstrate the remarkableimprovement gained by MetaMixUp, and the best results areachieved on Wide-ResNet-28-10. These comparisons firmlydemonstrate the effectiveness of MetaMixUp. Results of Semi-Supervised Learning.
Using a standardunified and fair SSL evaluation framework [45], we compareour pseudo-labeling extension of MixUp and MetaMixUpwith the state-of-the-art methods on CIFAR-10 and SVHNbenchmarks. The error rates are presented in Table III. Thebest performance of pseudo-label based methods is achievedby MetaMixUp combining with asynchronous pseudo labeling(MetaMixUP+APL) introduced in Section 3.4. The improve-ment over its purely pseudo-labeling counterpart is about6.2% on CIFAR-10 (4K labels) and 2.5% on SVHN (1Klabels) in terms of accuracy. When comparing on fewer labelsettings, the best performance of our MetaMixUP+APL reportsin error rate 20.66% on CIFAR-10 (1K labels) and 6.05%on SVHN (500 labels). More importantly, all the pseudo-based MetaMixUp outperform its MixUp counterparts, show-ing the effectiveness of the interpolation policies learned byMetaMixUp.We also notice that using asynchronous pseudo labeling(APL) for both MixUp and MetaMixUp provides a slightimprovement benefit. To illustrate the capability of correctly
TABLE IE
RROR RATES (%)
OF SUPERVISED LEARNING ON TEST SET . ‡ REFERS TO THE RESULTS FROM [10].
Datasets MNIST Fashion SVHN CIFAR-10 CIFAR-100 ImageNetTop-1 Top-5Architecture 3-layer CNN PreActResNet18Baseline 0.54 7.31 4.62 5.62 25.20 52.84 29.48MixUp [9] 0.59 6.74 3.83 4.57 21.35 55.06 31.32AdaMixup w/o Discriminator ‡ [10] - - - 3.83 24.75 - -AdaMixup w Discriminator ‡ [10] 0.49 6.21 3.12 3.52 20.97 49.17 25.78MetaMixUp (ours) Architecture PreActResNet34Baseline - - 4.46 5.32 24.63 50.25 28.59MixUp [9] - - 3.35 4.14 20.85 53.26 29.83MetaMixUp (ours) - -
Architecture Wide-Resnet-28-10Baseline - - 4.34 4.74 22.33 50.41 28.38MixUp [9] - - 3.31 3.07 19.21 52.96 29.28MetaMixUp (ours) - -
TABLE IIT
EST ERROR RATES ( % ) OF SUPERVISED LEARNING FOR DIFFERENT α AND λ ON TEST SET OF
CIFAR-10, CIFAR-100, SVHN
AND
MNIST. N
OTETHAT α = 0 INDICATES STANDARD TRAINING WITHOUT M IX U P . Method CIFAR-10 CIFAR-100 SVHN MNISTMixUp ( α = 0 ) 5.62 25.20 4.62 MixUp ( α = 0 . ) 4.65 21.67 4.71 0.66MixUp ( α = 1 ) α = 2 ) 4.78 21.49 λ = 0 . ) 5.02 22.56 4.21 MixUp ( λ = 0 . ) λ = 0 . ) 4.71 λ = 0 . ) 5.29 23.29 4.73 0.71MixUp ( λ = 0 . ) 5.88 23.62 4.78 0.70MetaMixUp (ours) TABLE IIIT
EST ERROR RATES ( % ) OF SSL
APPROACHES ON TEST SET OF
CIFAR-10 (1K
MEANS LABELLED DATA ) AND
SVHN. ’S
UPERVISED -O NLY ’ REFERSTO NO UNLABELED DATA . † REFERS TO THE RESULTS REPORTED IN [45].
Method CIFAR-10 CIFAR-10 SVHN SVHN1K Labels 4K Labels 500 Labels 1K LabelsSupervised-Only 35.95 ± ± ± ± ± ± ± ± Π -Model † [36] 24.81 ± ± ± ± † [4] 23.38 ± ± ± ± † [3] 21.52 ± ± ± ± † [3] 21.28 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MixMatch w/o MixUp - 10.97 ± ± ± ± ± ± ± ± ± ± updating pseudo labels for unlabeled data, we also perform aspecialized comparison between APL and the baseline pseudo-labeling method (see Table IV). In contrast, APL graduallyutilizes the more reliable and stable pseudo labels to enforceclassification loss, and hence outperforms pseudo-labeling method [14] on both CIFAR-10 and SVHN datasets withoutadditional computation cost.As a complimentary experiment to further provide evidenceabout improvement obtained by MetaMixUp, we replacedMixUp with our MetaMixUp on MixMatch [15] to generate TABLE IVT
EST ERROR RATES ( % ) OBTAINED WITH A UNIFIED IMPLEMENTATION OFVARIOUS
SSL
METHODS . Method CIFAR-10 CIFAR-10 SVHN SVHN1K Labels 4K Labels 500 Labels 1K LabelsPseudo-Label 25.13 ± ± ± ± ± ± ± ± interpolation policy for data augmentation. We find that theerror rates of MixMatch + MetaMixUp are consistently lowerthan
MixMatch over all datasets, surpassing the publishedstate-of-the-art approaches by a significant margin to our bestknowledge. On dataset CIFAR-10, we achieved test error of7.69% (MixMatch + MetaMixUp) with 1K labels and 6.21%with 4K labels. We achieved an error rate of 3.63% with only500 labels on SVHN compared to MixMatch performanceof 3.79%, showing the great success of the adaptation ofMetaMixUp to semi-supervised learning.
MetaMixUpVanilla MixUp
Epoch L o ss Fig. 2. The loss of the supervised training with MetaMixUp and MixUp onCIFAR-10.
D. Further Discussion
In this section, we experimentally study some propertiesof MetaMixUp and try to answer the following questions: (i)how sensitive is MetaMixUp to the validation size; (ii) doesMetaMixUp really mitigate the manifold intrusion issue; (iii)how does the relative frequency of the learned interpolationweighting coefficient differ from the Beta distribution used inMixUp [9] on MNIST; (iv) how does the distribution of thelearned interpolation policy change during training on CIFAR-10 and (v) how sensitive is MetaMixUp with APL to theconfidence threshold hyperparameter in SSL tasks.
1) Trade-off of validation size:
Validation size controls thequantity of examples of each class for meta-stage. In orderto make a trade-off and understand the sensitivity to thesize of validation. Figure 3 plots the classification error ratewhen we vary the size of validation set on MetaMixUp undersupervised configuration. Surprisingly, using 10 examples foreach class only results in 0.25% drop on CIFAR-10 and 0.1%on SVHN. In contrast, we observe that overall error rate does not drop when having more than 100 examples for each classof validation set. It suggests that our method dose not rely ona larger size of validation set for better performance.
Fig. 3. Error rate for different validation size of MetaMixUp on CIFAR-10and SVHN under supervised configure.
2) Alleviation of manifold intrusion issue:
Manifold intru-sion [10] means improper selection of MixUp weight λ cancause the conflicts of the labels of original data and those ofmixuped data, leading to degraded performance. To verify theeffectiveness of our method on mitigating manifold intrusion,we train a simple CNN network on MNIST with MetaMixUpunder supervised setting and extract the features of all themixed images during training. For each mixed sample, wecompute the minimal Euclidean distance in the feature spaceof the trained network between this mixed sample and allthe different labeled samples. Then, we compute the averageand the minimum of these distances respectively for MixUpand MetaMixUp. The results are reported in Figure 5. Thegenerated samples for MetaMixUp turn out to be farther to theset of the existing real samples. On the other hand, we also plotin Figure 4 the features learned by a network with 2 hiddendimensions. The representation learned with MetaMixUp ismore discriminative and thus the collision in the feature occurswith lower probability. Both of the above experiments confirmthat MetaMixUp mitigates manifold intrusion to great extent.Figure 2 plots the training loss curve of vanilla MixUpand MetaMixUp under a representative setting: ResNet-50on CIFAR-10, where the x -axis denotes the training epochs.The y -axis is the training loss on training data. The figureshows two insights. First, the training error of MetaMixUpapproaches zero. This empirically verifies the convergenceof the model. Second, the loss curve in Figure 2 generallysatisfy the condition in Proposition 1, i.e., minimizing MixUploss helps controlling the Lipschitz constant. The training lossof MetaMixUp is substantially lower than vanilla MixUp.It suggests that the proposed meta-learning schema to learnmixing policies alleviates the manifold intrusion/underfittingissue in vanilla Mixup and thus optimizes an underlying robustobjective.
3) Relative frequency of interpolation coefficient on MNIST:
The relative frequency of the interpolation coefficient λ gener-ated via MetaMixUp and vanilla MixUp of α = 1 . is shown Fig. 4. Feature of MixUp (left) and MetaMixUp (right) on MNIST.Fig. 5. The distance between mixed samples and different labeled originalexamples. in Figure 6, when trained on MNIST. A higher frequency of0 and 1 has occurred in MetaMixUp compared to MixUp.Since manifold intrusion is more likely to occur on MNIST.A possible explanation of this observation is: when mixedsamples may cause useless or negative effects, MetaMixUptries to maintain the original examples to avoid possiblecollisions or unexpected performance degradation and thusmitigate underfitting.
Fig. 6. Relative frequency of λ generated via MixUp (red) and MetaMixUp(blue) on MNIST.
4) Distribution of learned interpolation coefficient onCIFAR-10:
Figure 7 visualizes the distribution of generatedinterpolation policy during training on CIFAR-10 in ourexperiments, where z -axix denotes the λ learned by ourMetaMixUp; the y and x axes denote the class of trainingsamples x i and x j . Two observations can be found in Figure7. First, the learned interpolation policy changes during thetraining process. Figure 7 (a), (b), (c) and (d) are distributionof λ learned at different epochs. As shown, the interpolation (a) epoch = 3 (b) epoch = 30(c) epoch = 90 (d) epoch = 120Fig. 7. The data-driven interpolation policy distribution learned byMetaMixUp with Resnet-50 on CIFAR-10 at epoch 3 in (a), epoch 30 in(b), epoch 90 in (c) and epoch 120 in (d). policy is learned from a random form to a data specificdistribution. Second, the learned λ gradually converge inidiographic sections for samples of different classes. It satisfyour intuition that the proposed method learns data-driveninterpolation policy for MixUp technique.
5) Sensitivity to σ d : As introduced in Section 3.4, σ d controls the decay rate of the confidence threshold for pseudo-labeling in SSL. We conduct experiments with σ varying from0.01 to 0.2 to understand the sensitivity to this hyperparameter.The results (see Table V) shows that influence of changing σ d on performance is small and MetaMixUp is not sensitive to σ d . TABLE VT
EST ERROR RATE (%)
FOR M ETA M IX U P +APL WITH DIFFERENT σ d σ d V. C
ONCLUSIONS AND F UTURE W ORK
In this paper, we show that vanilla MixUp loss is alower bound of the Lipschitz constant of the gradient of theclassifier function. If not smartly choosing the interpolationcoefficient for each pair samples, the model suffers fromthe underfitting problem leading to a degradation of per-formance. The proposed MetaMixUp addresses this problemby optimize the interpolation policy of MixUp method withmeta-learning scheme in an online fashion. The interpolationpolicy of MetaMixUp is learned data-adaptively to improvethe generalization performance of the model. Experimentalresults illustrate that MetaMixUp is adaptive to supervised andsemi-supervised learning scenarios with remarkable perfor-mance improvement over original MixUp and its variants. Ourproposed methods achieve competitive performance acrossmultiple supervised and semi-supervised benchmarks. In the future, it would be more interesting to explore the powerof MetaMixUp in other challenging tasks. We believe thatapplication-specific adaption of the MetaMixUp of the trainingobjective and optimization trajectories will further improveresults over a wide range of application specific areas, includ-ing domain adaption [51], generative adversarial networks, orsemi-supervised natural language processing.R EFERENCES[1] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization forfree? - weakly-supervised learning with convolutional neural networks,”in
CVPR , 2015.[2] T. Durand, T. Mordan, N. Thome, and M. Cord, “WILDCAT: weaklysupervised learning of deep convnets for image classification, pointwiselocalization and segmentation,” in
CVPR , 2017.[3] T. Miyato, S.-i. Maeda, S. Ishii, and M. Koyama, “Virtual adversarialtraining: a regularization method for supervised and semi-supervisedlearning,”
IEEE transactions on pattern analysis and machine intelli-gence , 2018.[4] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-els: Weight-averaged consistency targets improve semi-supervised deeplearning results,” in
NeurIPS , 2017.[5] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J.Goodfellow, and R. Fergus, “Intriguing properties of neural networks,”
CoRR , vol. abs/1312.6199, 2013.[6] X. Gastaldi, “Shake-shake regularization of 3-branch residual networks,”in ,2017.[7] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural networksfrom overfitting,”
Journal of Machine Learning Research , vol. 15, no. 1,pp. 1929–1958, 2014.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,”
Commun. ACM , vol. 60, no. 6,pp. 84–90, 2017.[9] H. Zhang, M. Ciss´e, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,”
CoRR , vol. abs/1710.09412, 2017.[10] H. Guo, Y. Mao, and R. Zhang, “Mixup as locally linear out-of-manifoldregularization,”
CoRR , vol. abs/1809.02499, 2018.[11] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in
Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 . JMLR.org, 2017.[12] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-tion of machine learning algorithms,” in
NeurIPS , 2012.[13] L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein,“Learning unsupervised learning rules,”
CoRR , vol. abs/1804.00222,2018.[14] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervisedlearning method for deep neural networks,” in
Workshop on Challengesin Representation Learning, ICML , 2013.[15] D. Berthelot, N. Carlini, I. J. Goodfellow, N. Papernot, A. Oliver, andC. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,”
CoRR , vol. abs/1905.02249, 2019.[16] G. E. Hinton and D. van Camp, “Keeping the neural networks simpleby minimizing the description length of the weights,” in
Proceedings ofthe Sixth Annual ACM Conference on Computational Learning Theory,COLT 1993, Santa Cruz, CA, USA, July 26-28, 1993. , 1993, pp. 5–13.[17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in , 2015.[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
CVPR , 2016.[19] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Au-toaugment: Learning augmentation policies from data,” arXiv preprintarXiv:1805.09501 , 2018.[20] Y. Tokozume, Y. Ushiku, and T. Harada, “Between-class learning forimage classification,” in
CVPR , 2018. [21] V. Verma, A. Lamb, C. Beckham, A. C. Courville, I. Mitliagkas,and Y. Bengio, “Manifold mixup: Encouraging meaningful on-manifoldinterpolation as a regularizer,”
CoRR , vol. abs/1806.05236, 2018.[22] S. Thrun and L. Pratt, “Learning to learn: Introduction and overview,”in
Learning to learn . Springer, 1998.[23] Y. Bengio, S. Bengio, and J. Cloutier,
Learning a synaptic learning rule .Universit´e de Montr´eal, D´epartement d’informatique et de rechercheop´erationnelle, 1990.[24] S. Ravi and H. Larochelle, “Optimization as a model for few-shotlearning,” in
ICLR , 2017.[25] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” in
Advances in Neural Information Processing Systems30: Annual Conference on Neural Information Processing Systems 2017,4-9 December 2017, Long Beach, CA, USA , 2017, pp. 4080–4090.[26] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neuralattentive meta-learner,” in , 2018.[27] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler, “Rapid adapta-tion with conditionally shifted neurons,” in
Proceedings of the 35thInternational Conference on Machine Learning, ICML 2018, Stock-holmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp. 3661–3670.[28] F. Shen, C. Shen, A. van den Hengel, and Z. Tang, “Approximate leasttrimmed sum of squares fitting and applications in image analysis,”
IEEETrans. Image Processing , vol. 22, no. 5, pp. 1836–1847, 2013.[29] F. Shen, X. Gao, L. Liu, Y. Yang, and H. T. Shen, “Deep asymmetricpairwise hashing,” in
Proceedings of the 2017 ACM on MultimediaConference, MM 2017, Mountain View, CA, USA, October 23-27, 2017 ,2017, pp. 1522–1530.[30] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete hashing,”in
IEEE Conference on Computer Vision and Pattern Recognition, CVPR2015, Boston, MA, USA, June 7-12, 2015 , 2015, pp. 37–45.[31] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweightexamples for robust deep learning,” in
ICML , 2018.[32] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of modelsearch: Hyperparameter optimization in hundreds of dimensions for vi-sion architectures,” in
Proceedings of the 30th International Conferenceon Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 ,2013, pp. 115–123.[33] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka:combined selection and hyperparameter optimization of classificationalgorithms,” in
The 19th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA,August 11-14, 2013 , 2013, pp. 847–855.[34] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-tion of machine learning algorithms,” in
Advances in Neural InformationProcessing Systems 25: 26th Annual Conference on Neural InformationProcessing Systems 2012. Proceedings of a meeting held December 3-6,2012, Lake Tahoe, Nevada, United States. , 2012, pp. 2960–2968.[35] F. Pedregosa, “Hyperparameter optimization with approximate gradient,”in
Proceedings of the 33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24, 2016 , 2016, pp.737–746.[36] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn-ing,”
CoRR , vol. abs/1610.02242, 2016.[37] V. R. de Sa, “Learning classification with unlabeled data,” in
NeurIPS ,1994.[38] K. Saito, Y. Ushiku, and T. Harada, “Asymmetric tri-training for unsu-pervised domain adaptation,” in
Proceedings of the 34th InternationalConference on Machine Learning, ICML 2017, Sydney, NSW, Australia,6-11 August 2017 , 2017, pp. 2988–2997.[39] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic rep-resentations for unsupervised domain adaptation,” in
Proceedings ofthe 35th International Conference on Machine Learning, ICML 2018,Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , 2018, pp.5419–5428.[40] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier,“Parseval networks: Improving robustness to adversarial examples,” in
ICML . JMLR. org, 2017.[41] Y. Tsuzuku, I. Sato, and M. Sugiyama, “Lipschitz-margin training: Scal-able certification of perturbation invariance for deep neural networks,”in
NeurIPS , 2018.[42] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei,A. L. Yuille, J. Huang, and K. Murphy, “Progressive neural architecturesearch,” in
Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I , 2018, pp.19–35.[43] B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel opti-mization,”
Annals OR , vol. 153, no. 1, pp. 235–256, 2007.[44] J. Luketina, T. Raiko, M. Berglund, and K. Greff, “Scalable gradient-based tuning of continuous regularization hyperparameters,” in
Proceed-ings of the 33nd International Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016 , 2016, pp. 2952–2960.[45] A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. J. Goodfellow,“Realistic evaluation of deep semi-supervised learning algorithms,” in
NeurIPS , 2018.[46] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,“Reading digits in natural images with unsupervised feature learning,”2011.[47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, andF. Li, “Imagenet large scale visual recognition challenge,”
InternationalJournal of Computer Vision , vol. 115, no. 3, pp. 211–252, 2015.[48] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residualnetworks,” in
ECCV , 2016.[49] I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent withwarm restarts,” in
ICLR , 2017.[50] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in
BMVC ,2016.[51] G. French, M. Mackiewicz, and M. H. Fisher, “Self-ensembling forvisual domain adaptation,” in6th International Conference on LearningRepresentations, ICLR 2018, Vancouver, BC, Canada, April 30 - May3, 2018, Conference Track Proceedings