Understanding and Improving Virtual Adversarial Training
UUnderstanding and ImprovingVirtual Adversarial Training
Dongha Kim ∗ [email protected] Yongchan Choi ∗ [email protected] Yongdai Kim ∗ [email protected] September 17, 2019
Abstract
In semi-supervised learning, virtual adversarial training (
VAT ) approach is one of the mostattractive method due to its intuitional simplicity and powerful performances.
VAT finds aclassifier which is robust to data perturbation toward the adversarial direction. In this study, weprovide a fundemental explanation why
VAT works well in semi-supervised learning case andpropose new techniques which are simple but powerful to improve the
VAT method. Especiallywe employ the idea of
Bad GAN approach, which utilizes bad samples distributed on complementof the support of the input data, without any additional deep generative architectures. Wegenerate bad samples of high-quality by use of the adversarial training used in
VAT and also givetheoretical explanations why the adversarial training is good at both generating bad samples.An advantage of our proposed method is to achieve the competitive performances comparedwith other recent studies with much fewer computations. We demonstrate advantages ourmethod by various experiments with well known benchmark image datasets.
Deep learning has accomplished unprecedented success due to the development of deep architectures,learning techniques and hardwares Krizhevsky et al. (2012); Ioffe and Szegedy (2015); Szegedyet al. (2015); Hinton et al. (2012); Kingma and Ba (2014). However, deep learning has also sufferedfrom collecting large amount of labeled data which requires both cost and time. Thus it becomes ∗ Department of Statistics, Seoul National University a r X i v : . [ s t a t . M L ] S e p important to develop semi-supervised methodologies that learn a classifier (or discriminator) byusing small labeled data and large unlabeled data.Various semi-supervised learning methods have been proposed for deep learning. Weston et al.(2012) employs a manifold embedding technique using the pre-constructed graph of unlabeled dataand Rasmus et al. (2015) uses a specially designed auto-encoder to extract essential features forclassification. Variational auto encoder Kingma and Welling (2013) is also used in the context ofsemi-supervised learning by maximizing the variational lower bound of both labeled and unlabeleddata Kingma et al. (2014); Maaløe et al. (2016).Recently, a simple and powerful idea for semi-supervised learning has been proposed, which iscalled virtual adversarial training ( VAT , Miyato et al. (2015, 2017)).
VAT succeeds the core ideaof the adversarial training method (Goodfellow et al., 2014b) for supervised learning case, whichenhances the invariance of the classifier with respect to perturbation of inputs, and apply this ideato semi-supervised learning case. There are several follow-up studies utilizing
VAT to sequentialdata or combining
VAT with their own method in order to strengthen predicting performances(Miyato et al., 2016; Clark et al., 2018). Though
VAT is intuitionally simple and has powerfulperformances, it is not still clear why
VAT works well in semi-supervised learning case.Semi-supervised learning based on generative adversarial networks (
GAN , Goodfellow et al.(2014a)) has also received much attention. For K -class classification problems, Salimans et al. (2016)solves the ( K + 1) -class classification problem where the additional ( K + 1) th class consists ofsynthetic images made by a generator of the GAN learned by unlabeled data. Dai et al. (2017)notices that not a good generator but a bad generator which generates synthetic images muchdifferent from observed images is crucial, and develops a semi-supervised learning algorithm called
Bad GAN which achieves great performances over multiple benchmark datasets. However,
BadGAN needs two additional deep architectures - bad generator and pre-trained density estimationmodel besides the one for the classifier. Learning these multiple deep architectures requires hugecomputation and memory consumption. In particular, the
PixelCNN++
Salimans et al. (2017)is used for the pre-trained density estimation model which consumes very large computationalresources.In this study, we give fundamental explanations why
VAT works well for semi-supervised learningcase. One of the standard assumption for semi-supervised learning is cluster assumption whichmeans the optimal decision boundary locates at low density regions Chapelle et al. (2006). We findthat
VAT pushes the decision boundary away from the high density regions of data and thus helpsto find a desirable classifier given the cluster assumption . To be more specific, the objective functionof the
VAT method can be interpreted as a differentiable version of the ideal loss function whoseoptimizer is a perfect classifier.Based on the findings, we propose new techniques to enhance the performance of
VAT . First weemploy the idea of
Bad GAN , which utilizes bad samples distributed on complement of the supportof input data, without any additional deep generative architectures. Our proposed technique ismotivated by close investigation of adversarial direction in
VAT . Here, the adversarial direction for agiven datum is the direction to which the probabilities of each class change most. We prove that theperturbed data toward their adversarial directions can serve as ‘good’ bad samples. Dai et al. (2017)proves that bad samples play a role to pull the decision boundary toward the low density regionsof data. By using the adversarial directions for both measuring invariance and generating the badsamples, the proposed method combines the advantages of
VAT and
Bad GAN together. Thatis, our method accelerates the learning procedure by using both pushing and pulling operationssimultaneously.Secondly we modify the approximation method to calculate the adversarial direction in Miyatoet al. (2017). Miyato et al. (2017) propose the approximation method by using the second-orderTaylor expansion. We modify the idea by considering the reverse directions of dominant eigenvectorsand find that the slight modification helps to improve the
VAT method. We call the modified
VAT with newly proposed techniques
FAT (Fast Adversarial Training). We show that
FAT achievesalmost the state-of-the-art performances with much fewer training epochs. Especially, for the MNISTdataset,
FAT achieves similar test accuracies to those of
Bad GAN and
VAT with 5 times and 7times fewer training epochs, respectively.This paper is organized as follows. In Section 2, we review the
VAT and
Bad GAN methodsbriefly. Theoretical analysis of
VAT is given in Section 3. In Section 4, the technique to generatebad samples using the adversarial directions is described, and our proposed semi-supervised learningmethod is presented. Results of various experiments are presented in Section 5 and conclusionsfollow in Section 6.
VAT approach
VAT
Miyato et al. (2017) is a regularization method which is inspired by the adversarial training(Goodfellow et al., 2014b). The regularization term of
VAT is given as: L VAT ( θ ; (cid:98) θ, x , (cid:15) ) = D KL (cid:16) p ( ·| x ; (cid:98) θ ) || p ( ·| x + r advr ( x , (cid:15) ); θ ) (cid:17) = − K (cid:88) k =1 p ( k | x ; (cid:98) θ ) log p ( k | x + r advr ( x, (cid:15) ); θ ) + C, where r advr ( x , (cid:15) ) = argmax r ; || r ||≤ (cid:15) D KL (cid:16) p ( ·| x ; (cid:98) θ ) || p ( ·| x + r ; (cid:98) θ ) (cid:17) , (1) (cid:15) > is a tuning parameter, θ is the parameter in the discriminator to train, (cid:98) θ is the current estimateof θ and C is a constant. Combining with the cross-entropy term of the labeled data, we get thefinal objective function of VAT : − E x ,y ∼L tr [log p ( y | x ; θ )] + E x ∼U tr (cid:104) L VAT ( θ ; (cid:98) θ, x , (cid:15) ) (cid:105) , (2)where L tr and U tr are labeled and unlabeled datasets respectively. Bad GAN approach
Bad GAN
Dai et al. (2017) is a method that trains a good discriminator with a bad generator. Let D G ( φ ) be generated bad samples with a bad generator p G ( · ; φ ) parametrized by φ. Here, the ‘badgenerator’ is a deep architecture to generate samples different from observed data. Let p pt ( · ) be apre-trained density estimation model. For a given discriminator with a feature vector v ( x ; θ ) of agiven input x parameterized by θ, Bad GAN learns the bad generator by minimizing the following: E x ∼D G ( φ ) (cid:2) log p pt ( x ) I ( p pt ( x ) > τ ) (cid:3) + || E x ∼U tr v ( x ; (cid:98) θ ) − E x ∼D G ( φ ) v ( x ; (cid:98) θ ) || with respect to φ, where τ > is a tuning parameter, U tr is the unlabeled data, and (cid:98) θ is the currentestimate of θ and (cid:107) · (cid:107) is the Euclidean norm.In turn, to train the discriminator, we consider the K -class classification problem as the ( K + 1) -class classification problem where the ( K + 1) -th class is an artificial label of the bad samplesgenerated by the bad generator. We estimate the parameter θ in the discriminator by minimizingthe following: − E x ,y ∼L tr [log p ( y | x , y ≤ K ; θ )] − E x ∼U tr (cid:34) log (cid:40) K (cid:88) k =1 p ( k | x ; θ ) (cid:41)(cid:35) − E x ∼D G ( φ ) [log p ( K + 1 | x ; θ )] − E x ∼U tr (cid:34) K (cid:88) k =1 p ( k | x ; θ ) log p ( k | x ; θ ) (cid:35) (3)for given φ, where L tr is the labeled set. See Dai et al. (2017) for details of the objective function(3). VAT for semi-supervised learning
In this section, we give a theoretical insight for the role of the regularization term of
VAT in semi-supervised learning. We show that the regularization term of
VAT pushes the decision boundaryfrom the high density regions of unlabeled data and thus helps to find a desirable classifier given the cluster assumption . VAT
Let ( X , d ) be a given metric space and ( X, Y ) ∈ X × { , ..., K } be input and ouput randomvariables with a density function p ( x , y ) . We assume that p ( x | y = k ) is positive and continuouson X and p ( y = k ) > for all k . For a > define X ( a ) := ∪ Kk =1 X k ( a ) and δ ( a ) := Pr( X ∈X − X ( a )) where X k ( a ) := (cid:26) x ∈ X : Pr( Y = k | x ) − max k (cid:48) (cid:54) = k Pr( Y = k (cid:48) | x ) > a (cid:27) . And also define P a the probability measure of truncated random variable X on X ( a ) whose density function is given as p a ( x ) ∝ p ( x ) · I ( x ∈ X ( a )) .For a given function f : X → R K , let C ( · ; f ) : X → { , ..., K } be an induced classifier defined as C ( x ; f ) := argmax k =1 ,...,K f k ( x ) and let C B ( · ) be the Bayes classifier. We denote the decision boundary ofa given induced classifier C ( · ; f ) by D ( f ) and the decision boundary of the Bayes classifier by D B .Let ( X , Y ) , ..., ( X n , Y n ) be n independent copies from ( X, Y ) . We propose two measures of agiven function f : one is a standard empirical risk of labeled data given as l n ( f ) := n (cid:80) ni =1 I ( C ( X i ; f ) (cid:54) = Y i ) , and the other is an invariance measure of perturbation defined as u a,(cid:15) ( f ) := E P a [ I ( C ( x ; f ) (cid:54) = C ( x + r ; f ) for all r ∈ B (0 , (cid:15) ))] , (4)where a, (cid:15) > and B (0 , (cid:15) ) := { v : d (0 , v ) < (cid:15) } .Now we are ready to establish a proposition which means that the classifier which is invariantwith respect to all small perturbations and classifies the labeled data correctly converges to the Bayesclassifier exponentially fast with the number of labeled data. The proof is in the supplementarymaterials. Proposition 1
Let F be a set of continuous functions including f B whose corresponding classifier isthe Bayes classifier. For a, (cid:15) > , let (cid:98) f n ∈ argmin f ∈F a,(cid:15) l n ( f ) , where F a,(cid:15) := (cid:40) f ∈ F : f ∈ argmin f ∈F u a,(cid:15) ( f ) (cid:41) . Then there exist δ ∗ , (cid:15) ∗ , c ∗ , c ∗ > not depending on n , if δ ( a ) < δ ∗ and (cid:15) < (cid:15) ∗ , then P ( n ) (cid:104) Pr (cid:16) C ( X ; (cid:98) f n ) = C B ( X )) ≥ − δ ( a ) (cid:17)(cid:105) ≥ − c ∗ exp( − nc ∗ / , (5) where P ( n ) is the product probability measure of ( X , Y ) , ..., ( X n , Y n ) . Note that the Bayes decision boundary D B locates at X − X ( a ) , hence the small value of δ ( a ) means essentially the cluster assumption . The term u a,(cid:15) ( f ) encourages the decision boundary notto be located inside the support X ( a ) or equivalently pushes the decision boundary from the highdensity regions of data, which results to find a good classifier given the cluster assumption . VAT
Here we claim that the objective terms in
VAT can be interpreted as modified version of l n and u η,(cid:15) respectively. Let f ( x ; θ ) := ( p ( y = k | x ; θ )) Kk =1 and C ( x ; θ ) := argmax k f k ( x ; θ ) . Proposition 1implies that it would be good to pursue a classifier which predicts the labeled data correctly and atthe same time is invariant with respect to all local perturbations on the unlabeled data. For thispurpose, a plausible candidate of the objective function is E ( x ,y ) ∼L tr [ I ( y (cid:54) = C ( x ; θ )] + E x ∼U tr [ I ( C ( x ; θ ) (cid:54) = C ( x + r ; θ ) for all r ∈ B (0 , (cid:15) ))] , (6)where L tr and U tr are labeled data and unlabeled data respectively.The objective function (6) is not practically usable since neither optimizing the indicator functionnor checking C ( x ; θ ) (cid:54) = C ( x + r ; θ ) for all r in B (0 , (cid:15) ) is possible. To resolve these problems, wereplace the indicator functions in (6) with the cross-entropies, and the neighborhood B ( x , (cid:15) ) in thesecond term with the adversarial direction. By doing so, we have the following alternative objectivefunction: − E ( x ,y ) ∼L tr [log p ( y | x ; θ )] − E x ∼U tr (cid:34) K (cid:88) k =1 p ( k | x ; θ ) log p ( k | x + r advr ( x , (cid:15) ); θ ) (cid:35) . (7)Finally, we replace p ( ·| x ; θ ) in the second term of (7) by p ( ·| x ; (cid:98) θ ) to have the objective function of VAT (2).
VAT
The key role of bad samples in
Bad GAN is to enforce the decision boundary to be pulled towardthe low density regions of the unlabeled data. In this section, we propose a novel technique togenerate ‘good’ bad samples by use of only a given classifier.
Let us consider the 2-class linear logistic regression model parametrized by η = { w , b } , that is, p ( y = 1 | x ; η ) = (cid:16) − b − w (cid:48) x ) (cid:17) − . Note that the decision boundary is { x : b + w (cid:48) x = 0 } , and for any given x , the distance between x and the decision boundary is | b + w (cid:48) x | / || w || . The keyresult is that moving x toward the adversarial direction r advr ( x , (cid:15) ) is equivalent to moving x towardthe decision boundary which is stated rigorously in the following proposition. The proof is in thesupplementary materials. Proposition 2
For a sufficiently small (cid:15) > , we have | b + w (cid:48) x | / (cid:107) w (cid:107) > | b + w (cid:48) ( x + r advr ( x , (cid:15) ) | / (cid:107) w (cid:107) unless | b + w (cid:48) x | = 0 . Hence, we can treat x + C r advr ( x , (cid:15) ) / || r advr ( x , (cid:15) ) || for appropriately choosing C > as a badsample. For the decision boundary made by the DNN model with ReLU activation function, see thesupplementary materials.Figure 1: Demonstration of how the bad samples generated by the adversarial training are distributed.We consider two cases: 3-class classification problem ( Left ) and 4-class classification problem (
Right ).True data and bad data are coloured by blue and orange, respectively.
Motivated by Proposition 2, we propose a bad sample generator as follows. Let
C > be fixed and (cid:98) θ be the current estimate of θ. For a given datum x and a classifier p ( ·| x ; (cid:98) θ ) , we calculate the adversarialdirection r advr ( x , (cid:15) ) for given (cid:15) by (1). Then, we consider x bad = x + C r advr ( x , (cid:15) ) / (cid:107) r advr ( x , (cid:15) ) (cid:107) as abad sample. It may happen that a generated bad sample is not sufficiently close to the decisionboundary to be a ’good’ bad sample, in particular when C is too large or too small. To avoid such asituation, we exclude x art which satisfies the following condition for a pre-specified α > : max k p ( k | x bad ; (cid:98) θ ) > − α. In Figure 1, we illustrate how the bad samples generated by the proposed adversarial trainingare distributed for multi-class problem. With a good classifier, we can clearly see that most badsamples are located well in the low density regions of the data.
Remark 1
Note that samples near the decision boundary are not always located in low densityregions. Samples near the decision boundary are served as bad samples only with the reasonableclassifier. In order to reflect this finding to learning procedure, we employ the warm-up techniquewhich is described in Section 4.
Miyato et al. (2017) proposes the fast approximation method to calculate the adversarial direction r advr ( x , (cid:15) ) by using the second-order Taylor expansion. Let us define H ( x , (cid:98) θ ) = ∇∇ D KL (cid:16) p ( ·| x ; (cid:98) θ ) || p ( ·| x + r ; (cid:98) θ ) (cid:17) | r =0 .They claim that r advr emerges as the first dominant eigenvector v ( x , (cid:98) θ ) of H ( x , (cid:98) θ ) with magnitude (cid:15) .But there always exist two dominant eigenvectors, ± v ( x , (cid:98) θ ) , and the sign should be selected carefully.So, we slightly modify the approximation method of Miyato et al. (2017) by r advr ( x , (cid:15) ) = argmax r ∈{ v ( x , (cid:98) θ ) , − v ( x , (cid:98) θ ) } D KL (cid:16) p ( ·| x ; (cid:98) θ ) || p ( ·| x + r ; (cid:98) θ ) (cid:17) . With the new techniques described in Section 4, our proposed method called
FAT updates θ byminimizing the following objective function: − E x ,y ∼L tr [log p ( y | x ; θ )] + E x ∼U tr (cid:104) L VAT ( θ ; (cid:98) θ, x , (cid:15) ) (cid:105) + λ · (cid:104) E x ∼U tr (cid:2) L true ( θ ; x ) (cid:3) + E x ∼D bad ( (cid:98) θ,(cid:15),C ) (cid:2) L fake ( θ, x ) (cid:3)(cid:105) (8)where D bad ( (cid:98) θ, (cid:15), C ) is the set of generated bad samples with (cid:98) θ, (cid:15) and C,L true ( θ ; x ) = − K (cid:88) k =1 (cid:34) exp( g k ( x ; θ ))1 + (cid:80) Kk (cid:48) =1 exp( g k (cid:48) ( x ; θ )) log exp( g k ( x ; θ ))1 + (cid:80) Kk (cid:48) =1 exp( g k (cid:48) ( x ; θ )) (cid:35) ,L fake ( θ ; x ) = − log 11 + (cid:80) Kk =1 exp( g k ( x ; θ )) ,g ( x ; θ ) ∈ R K is a pre-softmax vector of a given architecture and λ > . We treat (cid:15) and C as tuningparameters to be selected based on the validation data accuracy. Note that L fake has the same roleas the sum of third and forth terms in (3).As described in Remark 1, x bad s are distributed in low density regions only when the classifierperforms well, which means x bad s hamper the learning procedure at early learning phase. Thus weuse the warm-up strategy Bowman et al. (2015), that is, we start with the small λ being andincrease it gradually after learning step proceeds. We compare prediction performances of
FAT over the benchmark datasets with other semi-supervisedlearning algorithms. We consider the most widely used datasets: MNIST (LeCun et al., 1998),SVHN (Marlin et al., 2010) and CIFAR10 (Krizhevsky and Hinton, 2009). For fair comparison, weuse the same architectures as those used in Miyato et al. (2017) for MNIST, SVHN and CIFAR10.See the supplementary materials for details. The optimal tuning parameters ( (cid:15), C, α ) in FAT arechosen based on the validation data accuracy. We set λ by zero and increase it by 0.1 after everytraining epoch up to one. We use Adam algorithm (Kingma and Ba, 2014) to update the parametersand do not use any data augmentation techniques. The results are summarized in Table 1, whichshows that
FAT achieves the state-of-the-art accuracies for MNIST (20) and SVHN (500, 1000)Table 1: Prediction accuracies of various semi-supervised learning algorithms for the three benchmarkdatasets. |L| is the number of labeled data and the results with ∗ are implemented by us. Test acc.( % )Data MNIST SVHN CIFAR10 |L|
20 100 500 1000 1000 4000
DGN (Kingma et al., 2014) - 96.67 - 63.98 - -
Ladder (Rasmus et al., 2015) - 98.94 - - - 79.6
FM-GAN (Salimans et al., 2016) 83.23 99.07 81.56 91.89 78.13 81.37
FM-GAN-Tan (Kumar et al., 2017) - - 95.13 95.61 80.48 83.80
Bad GAN (Dai et al., 2017) 80.16 ∗ VAT (Miyato et al., 2017) 67.04 ∗ Tri-GAN (LI et al., 2017) 95.19 99.09 - 94.23 - 83.01
CCLP (Kamnitsas et al., 2018) - - 94.31 - 81.43
FAT and competitive accuracies with the state-of-the-art method for other settings. Since
FAT onlyneeds one deep architecture, we can conclude that
FAT is a powerful and computationally efficientmethod.An other advantage of
FAT is its stability with respect to learning phase. With small labeleddata, Figure 2 shows that the test accuracies of each epoch tends to fluctuate much and be degradedfor
VAT and
Bad GAN while
FAT provides much more stable result. This may be partly becausethe bad samples helps to stabilize the objective function.
FAT introduces three tuning parameters (cid:15), C and α , where (cid:15) is the constant used to find theadversarial direction, C is the radius to generate artificial samples and α is used to determinewhether an artificial sample is ’good’. We investigate the sensitivities of prediction performanceswith respect to the changes of the values of these tuning parameters. When we vary one of the tuningparameters, the other parameters are fixed at the optimal values chosen by the validation data. Theresults are reported in Table 2. Unless α is too small or too large, the prediction performancesare not changed much. For (cid:15) and C, care should be done. With c larger than the optimal C (i.e.2) or with C smaller than the optimal (cid:15) (i.e. 1.5), the prediction performances are suboptimal.Apparently, choosing (cid:15) and C with (cid:15) being slightly smaller than C gives the best result.0Figure 2: Trace plot of the test accuracies for MNIST with 20 labeled data.Table 2: Test accuracies of MNIST (100) for var-ious values of (cid:15), C and α . The other parameterson each case are fixed to the optimal values. c
1. 1.5 2. 4.Test acc. 97.94 98.89 98.61 95.65 C α Table 3: Learning time per training epoch ra-tios compared to supervised learning with cross-entropy for CIFAR10.
Bad GAN is operatedwithout
PixelCNN++ . Method
VAT FAT Bad GAN
Time ratio 1.37 2.09 3.20
We investigate the computational efficiency of our method in view of learning speed and computationtime per training epoch. For
Bad GAN , we did not use
PixelCNN++ on SVHN and CIFAR10datasets since the pre-trained
PixelCNN++ models are not publicly available. Without
PixelCNN++ , Bad GAN is similar to
FM-GAN (Salimans et al., 2016). Figure 3 draws the bar plots about thenumbers of epochs needed to achieve the pre-specified test accuracies. We can clearly see that
FAT requires much less epochs.We also calculate the ratios of the computing time of each semi-supervised learning algorithmover the computing time of the corresponding supervised learning algorithm for CIFAR10 dataset,whose results are summarized in Table 3. These ratios are almost same for different datasets. Thecomputation time of
FAT is less than
Bad GAN and competitive to
VAT . From the results of Figure3 and Table 3 we can conclude that
FAT achieves the pre-specified performances efficiently. Note1Figure 3: The number of epochs to achieve the pre-specified test accuracies (98%, 90% and 80%)with the three methods for (Left)
MNIST (100), (Middle)
SVHN (1000) and (Right)
CIFAR10(4000) settings.
Bad GAN is operated without
PixelCNN++ for SVHN and CIFAR10 datasets.that the learning time of
PixelCNN++ is not considered for this experiment, and so comparison ofcomputing time of
FAT and
Bad GAN with
PixelCNN++ is meaningless.
VAT
For generated samples by adversarial training to be ‘good’ bad samples, the adversarial directionsshould be toward the decision boundary. While this always happens for the linear model byProposition 2, adversarial directions could be opposite to the decision boundary for deep model. Toavoid such undesirable cases as much as possible, it would be helpful to smoothen the classifier witha regularization term. Here, we claim that the regularization term of
VAT plays such a role.The adversarial direction obtained by maximizing the KL divergence is sensitive to local fluctua-tions of the class probabilities which is examplified in Figure 4. The regularization term of
VAT ishelpful to find a right adversarial direction which is toward the decision boundary by eliminatingunnecessary local fluctuations of the class probabilities. In Figure 5, we compare bad samplesgenerated by the adversarial training with and without the regularization term of
VAT for theMNIST dataset. While the bad samples generated without the regularization term of
VAT arevisually similar to the given input vectors, the bad samples generated with the regularization termof
VAT look like mixtures of two different digits and thus serve as ‘better’ bad samples.
We investigate how ‘good’ artificial samples generated by
FAT are. The left two plots of Figure 6shows the scatter plot of the synthetic data and the trace plot of prediction accuracies of
FAT and
VAT . And the right four plots of Figure 6 draws the scatter plots with generated artificial samples atvarious epochs. We can clearly see that artificial samples are distributed near the current decisionboundary. We also compare artificial images generated by
FAT and bad images generated by
BadGAN for the MNIST data at the end of learning procedure. In Figure 7, the images by
FAT donot look like real images and do not seem to be collapsed, which indicates that
FAT consistently2Figure 4: Examples of P ( y = 1 | x ) of smooth (Left) and wiggle (Right) cases. We plot 3points and their adversarial directions on eachcase. Figure 5: (Upper)
10 randomly sampled origi-nal MNIST dataset. (Middle and Lower)
Badsamples obtained by the classifier learned withand without the regularization term of
VAT .generates diverse and good artificial samples.
Bad GAN also generates diverse bad samples butsome ‘realistic’ images can be found.
In this paper, we give fundamental explanations why
VAT works well for semi-supervised learningcase.
VAT pushes the decision boundary from the high density regions of data, which results to finda good classifier given the cluster assumption . Also we propose a new method for semi-supervisedlearning, called
FAT , which modifies
VAT by introducing simple but powerful techniques.
FAT isdevised to compromise the advantages of
VAT and
Bad GAN together. In numerical experiments,we show that
FAT achieves almost the state-of-the-art performances with much fewer epochs.Unlike
Bad GAN , FAT only needs to learn a discriminator. Hence, it could be extended withoutmuch effort to other learning problems. For example,
FAT can be modified easily for recurrentneural networks and hence can be applied to sequential data. We will leave this extension as afuture work.
Acknowledgments
This work is supported by Samsung Electronics Co., Ltd.
References
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2015). Generatingsentences from a continuous space. arXiv preprint arXiv:1511.06349 .Chapelle, O., Schölkopf, B., and Zien, A., editors (2006).
Semi-Supervised Learning . MIT Press,Cambridge, MA.3Figure 6: (Upper left)
The scatter plot of synthetic data which consist of 1000 unlabeled data(gray) and 4 labeled data for each class (red and blue with black edge). (Upper right)
Accuraciesof unlabeled data for each epochs for
VAT and
FAT . We use 2-layered NN with 100 hidden unitseach. (Lower)
Artificial samples and classified unlabeled data by colors at the 20,40,60 and 80training epochs of
FAT respectively.Clark, K., Luong, T., and Le, Q. V. (2018). Cross-view training for semi-supervised learning.Dai, Z., Yang, Z., Yang, F., Cohen, W. W., and Salakhutdinov, R. R. (2017). Good semi-supervisedlearning that requires a bad gan. In
Advances in Neural Information Processing Systems , pages6513–6523.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., andBengio, Y. (2014a). Generative adversarial nets. In
Advances in neural information processingsystems , pages 2672–2680.Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarialexamples. arXiv preprint arXiv:1412.6572 .He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-levelperformance on imagenet classification. In
Proceedings of the IEEE international conference oncomputer vision , pages 1026–1034.Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012).4Figure 7: 100 randomly sampled artificial images and bad images with (Left)
FAT and (Right)
Bad GAN respectively.Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprintarXiv:1207.0580 .Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 .Kamnitsas, K., Castro, D., Folgoc, L. L., Walker, I., Tanno, R., Rueckert, D., Glocker, B., Criminisi,A., and Nori, A. (2018). Semi-supervised learning via compact latent space clustering. In Dy, J.and Krause, A., editors,
Proceedings of the 35th International Conference on Machine Learning ,volume 80 of
Proceedings of Machine Learning Research , pages 2459–2468.Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning withdeep generative models. In
Advances in Neural Information Processing Systems , pages 3581–3589.Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.Technical report, Citeseer.Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolu-tional neural networks. In
Advances in neural information processing systems , pages 1097–1105.Kumar, A., Sattigeri, P., and Fletcher, T. (2017). Semi-supervised learning with gans: manifoldinvariance with improved inference. In
Advances in Neural Information Processing Systems , pages5534–5544.5LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied todocument recognition.
Proceedings of the IEEE , 86(11):2278–2324.LI, C., Xu, T., Zhu, J., and Zhang, B. (2017). Triple generative adversarial nets. In Guyon, I.,Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors,
Advances in Neural Information Processing Systems 30 , pages 4088–4098.Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. (2016). Auxiliary deep generativemodels. arXiv preprint arXiv:1602.05473 .Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural networkacoustic models. In
Proc. icml , volume 30, page 3.Marlin, B., Swersky, K., Chen, B., and Freitas, N. (2010). Inductive principles for restrictedboltzmann machine learning. In
Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics , pages 509–516.Miyato, T., Dai, A. M., and Goodfellow, I. (2016). Adversarial Training Methods for Semi-SupervisedText Classification.
ArXiv e-prints .Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. (2017). Virtual adversarial training: a regular-ization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976 .Miyato, T., Maeda, S.-i., Koyama, M., Nakae, K., and Ishii, S. (2015). Distributional smoothingwith virtual adversarial training. arXiv preprint arXiv:1507.00677 .Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In
Proceedings of the 27th international conference on machine learning (ICML-10) , pages 807–814.Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. (2015). Semi-supervised learningwith ladder networks. In
Advances in Neural Information Processing Systems , pages 3546–3554.Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improvedtechniques for training gans. In
Advances in Neural Information Processing Systems , pages2234–2242.Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. (2017). Pixelcnn++: Improving thepixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprintarXiv:1701.05517 .Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., andRabinovich, A. (2015). Going deeper with convolutions. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1–9.Weston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012). Deep learning via semi-supervisedembedding. In
Neural Networks: Tricks of the Trade , pages 639–655. Springer.6
Let d ( U, V ) := min u ∈ U,v ∈ V d ( u, v ) for two sets U and V . For each X k ( a ) , there exists a partition of opensets {X kj ( a ) } m k j =1 such that min j (cid:54) = j (cid:48) d ( X kj ( a ) , X kj (cid:48) ( a )) > . Set δ ∗ = a · min k,j Pr( X ∈ X kj ( a )) and (cid:15) ∗ = d ( X ( a ) , D B ) . The proof consists of three steps.[Step 1.] First we show that if f ∈ F a,(cid:15) then d ( X ( a ) , D ( f )) ≥ (cid:15) . It is easy to check that u a,(cid:15) ( f B ) = 0 , which means u a,(cid:15) ( f ) = 0 for all f ∈ F a,(cid:15) . Let assume that d ( X ( a ) , D ( f )) < (cid:15) for some f ∈ F a,(cid:15) . Then there there exists an open ball B such that B ∈ X ( a ) and d ( B, D ( f )) < (cid:15). Since P a ( B ) > , u a,(cid:15) ( f ) ≤ − P a ( B ) < , which is a contradiction. Therefore d ( X ( a ) , D ( f )) ≥ (cid:15) for all f ∈ F a,(cid:15) We highlight that if f satisfies d ( X ( a ) , D ( f )) ≥ (cid:15) , then for any j, k , C ( x ; f ) is constant on each X kj ( a ) . [Step 2.] Let W be the subset of X n × Y n such that n (cid:88) i =1 I ( X i ∈ X kj ( a ) , Y i = k ) > max k (cid:48) (cid:54) = k (cid:40) n (cid:88) i =1 I ( X i ∈ X kj ( a ) , Y i = k (cid:48) ) (cid:41) + n (cid:88) i =1 I ( X i ∈ X − X ( a )) (9)for all ( k, j ) , where Y := { , ..., K } . Note that C B ( x ) = k on X kj ( a ) for j = 1 , ..., m k and k = 1 , ..., K .Hence on W , n (cid:88) i =1 I ( Y i = C B ( X i ) , X i ∈ X kj ( a )) = n (cid:88) i =1 I ( Y i = k, X i ∈ X kj ( a )) > n (cid:88) i =1 I ( Y i = k (cid:48) , X i ∈ X kj ( a )) + n (cid:88) i =1 I ( X i ∈ X − X ( a )) for any k ‘ (cid:54) = k . If there exists a tuple ( k, j ) such that C ( x ; (cid:98) f ) = k (cid:48) (cid:54) = k on x ∈ X kj ( a ) , we have thefollowing inequalities on W , n (cid:88) i =1 ( Y i = C B ( X i )) ≥ n (cid:88) i =1 (cid:88) k,j I ( X i ∈ X k,j ( a ) , Y i = k ) n (cid:88) i =1 I ( Y i = C ( X i ; (cid:98) f )) ≤ n (cid:88) i =1 I ( Y i = C ( X i ; (cid:98) f ) , X i ∈ X ( a )) + n (cid:88) i =1 I ( X i ∈ X − X ( a )) ≤ n (cid:88) i =1 (cid:88) (˜ k, ˜ j ) (cid:54) =( k,j ) I ( X i ∈ X ˜ k ˜ j ( a ) , Y i = ˜ k ) + n (cid:88) i =1 I ( X i ∈ X kj ( a ) , Y i = k (cid:48) )+ n (cid:88) i =1 I ( X i ∈ X − X ( a )) . Therefore, with (9), we have n (cid:88) i =1 ( Y i = C B ( X i )) > n (cid:88) i =1 I ( Y i = C ( X i ; (cid:98) f )) on W , which is contradiction to the definition of (cid:98) f . Thus C ( x ; (cid:98) f ) = C B ( x ) on x ∈ X ( a ) . Since Pr { X ∈ X ( a ) } = 1 − δ ( a ) Pr (cid:110) C ( X ; (cid:98) f ) = C B ( X ) (cid:111) ≥ − δ ( a ) on W .[Step 3.] Let m := (cid:80) Kk =1 m k and W i, ( k,j,k (cid:48) ) := I ( X i ∈ X kj ( a ) , Y i = k ) − I ( X i ∈ X kj ( a ) , Y i = k (cid:48) ) − I ( X i ∈ X − X ( a )) for k (cid:48) (cid:54) = k. Note thatE ( W , ( k,j,k (cid:48) ) ) = (cid:90) X kj ( a ) ( p ( y = k | x ) − p ( y = k (cid:48) | x )) p ( x ) d x − δ ( a ) > a · Pr( X ∈ X kj ( a )) − δ ( a ) > . Let c ∗ = Km and c ∗ = min k,j (cid:15) · P ( X ∈ X kj ( (cid:15) )) − δ ( (cid:15) ) . Then, Hoeffiding’s inequality implies that P ( n ) (cid:40) n (cid:88) i =1 W i, ( k,j,k (cid:48) ) < (cid:41) ≤ exp( − nc ∗ / . By the union bound, we have P ( n ) ( W c ) ≤ P ( n ) (cid:40) min k,j,k (cid:48) n (cid:88) i =1 W i, ( k,j,k (cid:48) ) < (cid:41) ≤ c ∗ · exp( − nc ∗ / , which complete the proof. (cid:3) Without loss of generality, we assume that w (cid:48) x + b > , that is, p ( y = 1 | x ; η ) > p ( y = 0 | x ; η ) . Wewill show that there exists c > such that w (cid:48) r advr ( x , (cid:15) ) < . Note thatargmax r , || r ||≤ (cid:15), w (cid:48) r > KL ( x , r ; η ) = (cid:15) w || w || (=: r ∗ ) andargmax r , || r ||≤ (cid:15), w (cid:48) r < KL ( x , r ; η ) = − (cid:15) w || w || (=: r ∗ ) . So all we have to do is to show KL ( x , r ∗ ; η ) > KL ( x , r ∗ ; η ) . By simple calculation we can get the following: KL ( x , r ∗ ; η ) − KL ( x , r ∗ ; η ) = − p ( y = 1 | x ; θ ) w (cid:48) ( r ∗ − r ∗ ) − log exp (cid:16) w (cid:48) ( x + r ∗ ) + b (cid:17) + 1exp ( w (cid:48) ( x + r ∗ ) + b ) + 1 . Using the Taylor’s expansion up to the third-order, we obtain the following: log (cid:104) exp (cid:16) w (cid:48) ( x + r ) + b (cid:17) + 1 (cid:105) = log (cid:104) exp (cid:16) w (cid:48) x + b (cid:17) + 1 (cid:105) + p ( y = 1 | x ; η ) w (cid:48) r + 12 p ( y = 1 | x ; η ) p ( y = 0 | x ; η )( w (cid:48) r ) − p ( y = 1 | x ; η ) p ( y = 0 | x ; η ) { p ( y = 1 | x ; η ) − p ( y = 0 | x ; η ) } p (cid:88) i,j,k =1 w i w j w k r i r j r k + o ( || r || ) . So, log exp (cid:16) w (cid:48) ( x + r ∗ ) + b (cid:17) + 1exp ( w (cid:48) ( x + r ∗ ) + b ) + 1 = p ( y = 1 | x ; η ) w (cid:48) ( r ∗ − r ∗ )+ 13 p ( y = 1 | x ; η ) p ( y = 0 | x ; η ) { p ( y = 1 | x ; η ) − p ( y = 0 | x ; η ) } c || w || + o ( (cid:15) ) . Thus, we have the following equations: KL ( x , r ∗ ; η ) − KL ( x , r ∗ ; η ) = 13 p ( y = 1 | x ; η ) p ( y = 0 | x ; η ) { p ( y = 1 | x ; η ) − p ( y = 0 | x ; η ) } (cid:15) || w || + o ( (cid:15) )= C ∗ · (cid:15) + o ( (cid:15) ) . Therefore, there exists (cid:15) ∗ > such that KL ( x , r ∗ ; η ) > KL ( x , r ∗ ; η ) for ∀ < (cid:15) < (cid:15) ∗ . (cid:3) Consider a binary classification DNN model with ReLU-like activation function p ( y = 1 | x ; θ ) =(1 + exp( − g ( x ; θ ))) − parameterized by θ. Here, ReLU-like function is the activation function whichis piece-wise linear, such as ReLU (Nair and Hinton, 2010), lReLU (Maas et al., 2013) and PReLU(He et al., 2015). Since g ( x ; θ ) is piecewise linear, we can write g ( x ; θ ) as g ( x ; θ ) = N (cid:88) j =1 I ( x ∈ A j ) · ( w (cid:48) j x + b j ) , where A j is a linear region and N is the number of linear regions.For given x , suppose g ( x ; θ ) > . If g ( x ; θ ) is estimated reasonably, we expect that g ( x ; θ ) isdecreasing if x moves toward the decision boundary. A formal statement of this expectation wouldbe that x − r ∇ x g ( x ; θ ) can arrive at the decision boundary for a finite value of r > , where ∇ x is the gradient with respect to x . Of course, for x with g ( x ; θ ) < , we expect that x + r ∇ x g ( x ; θ ) can arrive at the decision boundary for a finite value of r > . We say that x is normal if there is r > such that x − r ∇ x g ( x ; θ )sign { g ( x ; θ ) } locates at the decision boundary. We say that a linearregion A j is normal if all x in A j are normal . We expect that most of A j are normal if g ( x ; θ ) isreasonably estimated so that the probability decreases or increases depending on sign { g ( x ; θ ) } if x is getting closer to the decision boundary.The following proposition proves that the adversarial direction is toward the decision boundaryfor all x s in normal linear regions. Lemma 1
If a linear region A j is normal. Then for any x ∈ int ( A j ) , there exists (cid:15) > and C > such that x A = x + C r advr ( x , (cid:15) ) / || r advr ( x , (cid:15) ) || is on the decision boundary. Proof )
Take ˜ (cid:15) > such that x + r ∈ A ˜ j for ∀ r ∈ B ( x , ˜ (cid:15) ) . Then by Proposition 1, there exists < (cid:15) ∗ < ˜ (cid:15) such that for ∀ < (cid:15) < (cid:15) ∗ , r advr ( x , (cid:15) ) = (cid:15) · sign ( − b ˜ j − w (cid:48) ˜ j x )) · w ˜ j || w ˜ j ||∝ −∇ x g ( x ; θ )sign { g ( x ; θ ) } . x is normal , thus there exists C > such that x A = x + C r advr ( x , (cid:15) ) / || r advr ( x , (cid:15) ) || belongs to thedecision boundary. (cid:3) All model architectures used in experiments are based on Miyato et al. (2017).
For MNIST dataset, we used fully connected NN with four hidden layers, whose numbers ofnodes were (1200,600,300,150) with ReLU activation function (Nair and Hinton, 2010). All the0fully-connected layers are followed by BN(Ioffe and Szegedy, 2015).
For SVHN, CIFAR10 and CIFAR10 datasets, we used the CNN architectures. More details are inTable 4. SVHN CIFAR10 CIFAR100 × RGB images × conv. 64 lReLU × conv. 96 lReLU × conv. 128 lReLU × conv. 64 lReLU × conv. 96 lReLU × conv. 128 lReLU × conv. 64 lReLU × conv. 96 lReLU × conv. 128 lReLU × max-pool, stride 2dropout, p = 0 . × conv. 128 lReLU × conv. 192 lReLU × conv. 256 lReLU × conv. 128 lReLU × conv. 192 lReLU × conv. 256 lReLU × conv. 128 lReLU × conv. 192 lReLU × conv. 256 lReLU × max-pool, stride 2dropout, p = 0 . × conv. 128 lReLU × conv. 192 lReLU × conv. 512 lReLU × conv. 128 lReLU × conv. 192 lReLU × conv. 256 lReLU × conv. 128 lReLU × conv. 192 lReLU × conv. 128 lReLUglobal average pool, × → × dense → dense → dense →100