Augmenting Data with Mixup for Sentence Classification: An Empirical Study
aa r X i v : . [ c s . C L ] M a y Augmenting Data with Mixup for Sentence Classification: An EmpiricalStudy
Hongyu Guo
National Research Council Canada1200 Montreal Road, Ottawa [email protected]
Yongyi Mao
Electrical Engineering & Computer ScienceUniversity of Ottawa, Ottawa, Ontario [email protected]
Richong Zhang
BDBC, School of Computer Science and EngineeringBeihang University, Beijing, China [email protected]
Abstract
Mixup (Zhang et al., 2017), a recent proposeddata augmentation method through linearly in-terpolating inputs and modeling targets of ran-dom samples, has demonstrated its capabilityof significantly improving the predictive accu-racy of the state-of-the-art networks for im-age classification. However, how this tech-nique can be applied to and what is its effec-tiveness on natural language processing (NLP)tasks have not been investigated. In this pa-per, we propose two strategies for the adaptionof Mixup on sentence classification: one per-forms interpolation on word embeddings andanother on sentence embeddings. We con-duct experiments to evaluate our methods us-ing several benchmark datasets. Our studiesshow that such interpolation strategies serveas an effective, domain independent data aug-mentation approach for sentence classifica-tion, and can result in significant accuracy im-provement for both CNN (Kim, 2014b) andLSTM (Hochreiter and Schmidhuber, 1997)models.
Deep learning models have achieved state-of-the-art performance on many NLP applications, in-cluding parsing (Socher et al., 2011), text classi-fication (Kim, 2014b; Tai et al., 2015), and ma-chine translation (Sutskever et al., 2014). Thesemodels typically have millions of parameters, thusrequire large amounts of data for training in orderfor over-fit avoidance and better model generaliza-tion. However, collecting a large annotated datasamples is time-consuming and expensive.One technique aiming to address such a datahungry problem is automatic data augmenta-tion. That is, synthetic data samples are gen-erated as additional training data for regulariz-ing the learning models. Data augmentation hasbeen actively and successfully used in computer vision (Simard et al., 1998; Krizhevsky et al.,2017; Zhang et al., 2017) and speech recogni-tion (Jaitly and Hinton, 2015; Ko et al., 2015).Most of these methods, however, rely on hu-man knowledge for label-invariant data transfor-mation, such as image scaling, flipping and rota-tion. Unlike in image, there is, however, no sim-ple rule for label-invariant transformation in nat-ural languages. Often, slight change of a wordin a sentence can dramatically alter the mean-ing of the sentence. To this end, popular dataaugmentation approaches in NLP aim to trans-form the text with word replacements with ei-ther synonyms from handcrafted ontology (e.g.,WordNet (Zhang et al., 2015)) or word similarity(Wang and Yang, 2015; Kobayashi, 2018). Suchsynonym-based transformation, however, can beapplied to only a portion of the vocabulary dueto the fact that words having exactly or nearlythe same meanings are rare. Some other NLPdata augmentation methods are often devised forspecific domains thus makes them difficult to beapplied to other domains (Sahin and Steedman,2018).Recently, a simple yet extremely effective aug-mentation method Mixup (Zhang et al., 2017) hasbeen proposed and shown superior performanceon enhancing the accuracy of image classifica-tion models. Through linearly interpolating pixelsof random image pairs and their training targets,Mixup generates synthetic examples for training.Such training has been shown to act as an effectivemodel regularization strategy for image classifica-tion networks.How Mixup can be applied to and what is itseffectiveness on NLP tasks? We here aim to an-swer these questions in this paper. In specific,we propose two strategies for the application ofMixup on sentence classification: one performsinterpolation on word embedding and another onentence embedding. We empirically show thatsuch interpolation strategies serve as a simple, yeteffective data augmentation method for sentenceclassification, and can result in significant accu-racy improvement for both CNN (Kim, 2014b)and LSTM (Hochreiter and Schmidhuber, 1997)models. Promisingly, unlike traditional data aug-mentation in NLP, these interpolation based aug-mentation strategies are domain independent, ex-clusive of human knowledge for data transforma-tion, and of low additional computational cost.
Zhang et al. (Zhang et al., 2017) proposed theMixup method for image classification. The ideais to generate a synthetic sample by linearly inter-polating a pair of training samples as well as theirmodeling targets. In detail, consider a pair of sam-ples ( x i ; y i ) and ( x j ; y j ) , where x denotes the in-put and y the one-hot encoding of the correspond-ing class of the sample. The synthetic sample isgenerated as follows. e x ij = λx i + (1 − λ ) x j (1) e y ij = λy i + (1 − λ ) y j (2)where λ is the mixing policy or mixing-ratio forthe sample pair. λ is sampled from a Beta( α, α )distribution with a hyper-parameter α . It is worthnoting that, when α equals to one, then the Betadistribution is equivalent to an uniform distribu-tion. The generated synthetic data are then fedinto the model for training to minimize the lossfunction such as the cross-entropy function for thesupervised classification. For an efficient compu-tation, the mixing happens by randomly pick onesample and then pairs it up with another sampledrawn from the same mini-batch. Unlike image which is consist of pixels, sentenceis composed of a sequence of words. Therefore,a sentence representation is often constructed toaggregate information from a sequence of words.Specifically, in a standard CNN or LSTM model,a sentence is first represented by a sequence ofword embeddings, and then fed into a sentence en-coder. The most popular such encoders are CNN
Figure 1: Illustration of wordMixup (left) and sen-Mixup (right), where the added part to the standard sen-tence classification model is in red rectangle. and LSTM. The sentence embedding generated byCNN or LSTM are then passed through a soft-maxlayer to generate the predictive distribution overthe possible target classes for predictions.To this end, we propose two variants of Mixupfor sentence classification. The first one con-ducts sample interpolation in the word embeddingspace (denoted as wordMixup), and the secondon the final hidden layer of the network beforeit is passed to a standard soft-max layer to gen-erate the predictive distribution over classes (de-noted as senMixup). The two models are illus-trated in Figure 1, where the standard CNN (Kim,2014b) or LSTM (Hochreiter and Schmidhuber,1997) model for sentence classification corre-sponds to the one without the red rectangle.In specific, in the wordMixup, all sentences arezero padded to the same length and then interpo-lation is conducted for each dimension of each ofthe words in a sentence. Given a piece of text, suchas a sentence with N words, it can be representedas a matrix B ∈ R N × d . Each row t of the matrixcorresponds to one word (denoted B t ), which isrepresented by a d -dimensional vector as providedeither by a learned word embedding table or beingrandomly generated. Formally, consider a pair ofsamples ( B i ; y i ) and ( B j ; y j ) , where B i and B j denotes the embedding vectors of the input sen-tence pairs and y i and y j denote the correspondingclass labels of the samples using one-hot represen-tation. Then for the t th word in the sentence, linearinterpolation process can be formulated as: e B ijt = λB it + (1 − λ ) B jt (3) e y ij = λy i + (1 − λ ) y j (4)he resulting new sample ( e B ij ; e y ij ) is then usedfor training.In senMixup, the hidden embeddings (with thesame dimension) for the two sentences are firstgenerated by an encoder such as CNN or LSTM.Next, the pair of sentence embeddings are interpo-lated linearly. In specific, let f denote the sentenceencoder, then a pair of sentences B i and B j will befirst encoded into a pair of sentence embeddings f ( B i ) and f ( B j ) , respectively. In this case, themixing is conducted for each k th dimension of thesentence embedding, as follows. e B ij { k } = f ( B i ) { k } + (1 − λ ) f ( B j ) { k } (5) e y ij = λy i + (1 − λ ) y j (6)Finally, the embedding vector e B ij will be passedto a softmax layer to produce a distribution overthe possible target classes. For training, we usemulti-class cross entropy loss. We evaluate the proposed methods with fivebenchmark tasks for sentence classifications. • TREC is a question dataset to categorize aquestion into six question types (Li and Roth,2002). • MR is a movie review dataset for detect-ing positive/negative reviews (Pang and Lee,2005). • SST-1 is the Stanford Sentiment Treebankwith five categories labels (Socher et al.,2013). • SST-2 dataset is the same as SST-1 but withneutral reviews removed and binary labels. • Subj is a subjectivity detection dataset forclassifying a sentence as being subjective orobjective (Pang and Lee, 2004).The summary of the data sets is presented inTable 1. Note that, for comparison purposes onthe SST-1 and SST-2 datasets, following (Kim,2014b; Tai et al., 2015), we trained the models us-ing both phrases and sentences, but only evaluatedsentences at test time.We evaluate our wordMixup and senMixup us-ing both CNN and LSTM for sentence classifica-tion. We implement the CNN model exactly as re-ported in (Kim, 2014b,a). For LSTM, we just sim-ply replace the convolution/pooling components in CNN with standard LSTM units as implementedin (Abadi et al., 2016). The final feature map ofCNN and the final state of LSTM are passed to alogistic regression classifier for label prediction.To evaluate our models in terms of their regu-larization effects on the training, we present fourword embedding settings: random and trainableword embedding (denoted
RandomTune ), ran-dom and fix word embedding (denoted
Random-Fix ), pre-trained and tunable word embedding (de-noted
PretrainTune ), and pre-trained fix wordembedding (denoted
PretrainFix ).Data c l N V TestTREC 6 10 5952 9592 500SST-1 5 18 11855 17836 2210SST-2 2 19 9613 16185 1821Subj 2 23 10000 21323 CVMR 2 20 10662 18765 CV
Table 1: Summary for the datasets after tokenization.c: number of target labels. l: average sentence length.N: number of samples. V: vocabulary size. Test: testset size (CV means no standard train/test split was pro-vided and thus 10-fold CV was used).
In our experiments, following the exact imple-mentation and settings in (Kim, 2014a) we use fil-ter sizes of 3, 4, and 5, each with 100 feature maps;dropout rate of 0.5 and L2 regularization of 0.2 forthe baseline CNN and LSTM. For datasets with-out a standard dev set we randomly select 10% oftraining data as dev set. Training is done throughAdam (Kingma and Ba, 2014) over mini-batchesof size 50. The pre-trained word embeddings are300 dimensional GloVe (Pennington et al., 2014).The hidden state dimension for LSTM is 100.For senMixup and wordMixup, the mixing pol-icy α is set to the default value of one. Also, fol-lowing the original Mixup (Zhang et al., 2017), wedid not use dropout or L2 constraint for the word-Mixup and senMixup models.We train each model 10 times each with 20000steps, and compute their mean test errors and stan-dard deviations. RandomTune has the largest number of parame-ters, compared to RandomFix, PretrainTune, andPretrainFix, and thus requires a strong regulariza-tion method to avoid over-fit the training data. We,therefore, focus our experiments on the Random- andomTune
Trec SST-1 SST-2 Subj MRCNN- KIM Impl. (Kim, 2014b) 91.2 45.0 82.7 89.6 76.1CNN- HarvardNLP Impl. ± ± ± ± ± ± ± ± ± ± CNN+senMixup ± ± ± ± ± Table 2: Accuracy (%) of the testing methods using CNN (with randomly initialized, trainable embeddings). Wereport mean scores over 10 runs with standard deviations (denoted ± ). Best results highlighted in Bold. RandomTune
Trec SST-1 SST-2 Subj MRLSTM-StanfordNLP Impl. (Tai et al., 2015) N/A 46.4 84.9 N/A N/ALSTM-AgrLearn Impl. (Guo et al., 2018a) N/A N/A N/A 90.2 76.2LSTM - Our Impl. 86.5 ± ± ± ± ± ± ± ± ± ± LSTM + senMixup 89.4 ± ± ± ± ± Table 3: Accuracy (%) obtained by the testing methods using LSTM (with randomly initialized, trainable embed-dings). We report mean scores over 10 runs with standard deviations (denoted ± ). Best results highlighted inBold. Tune setting. The results on the RandomTune set-ting are presented in Table 2.The results in Table 2 show that wordMixup andsenMixup provide good regularization to CNN, re-sulting in accuracy improvement on all the fivetesting datasets. For example, in SST-1 and MR,the relative improvement was over 3.3%. Inter-estingly, both wordMixup and senMixup failed tosignificantly improved over the baseline againstthe SST-2 dataset; with senMixup slightly outper-formed the baseline with only 0.7%. Also, re-sults in Table 2 suggest that senMixup and word-Mixup were quite competitively, in terms of pre-dicticve performance obtained, against the fivetesting datasets. For example, on the Trec dataset,senMixup outperformed senMixup with 1.2%, butfor the other four datasets, the two methods ob-tained very similiar predictive accuracy.
Regularization Effect
We plot the training andtesting cross-entropy loss across the first 12Ktraining steps on the MR dataset in Figure 1. Fig-ure 1 shows that with (top-left subfigure) or with-out (top-right subfigure) dropout, the training lossof CNN drops to zero quickly and provides notraining signal for further tuning the networks.On the otherhand, the training loss of wordMixup(bottom-right subfigure) keeps above zero duringthe training, continuously provide training signalfor the network learning. Also, the training losscurve of senMixup (bottom-left subfigure) main-tains a relatively high level, allowing the model to keep tuning. The relatively higher training loss ofboth wordMixup and senMixup is due to the muchlarger space of the mixed samples, thus preventingthe model from being over-fitted by limited num-ber of individual examples
LSTM Networks as Sentence Encoder
We alsoevaluate the effect of using LSTM as the sentenceencoder. Results in Table 3 show that, similar tothe case of using CNN as sentence encoder, word-Mixup and senMixup with LSTM as encoder alsoimproved the predictive performance of the base-line models. For example, the largest improve-ments came from the Trec and SST-1 cases (withrelative improvement of 4.62% and 5.22%), whichhave six and five classes, respectively. Results inthe table also suggest that, on the Subj dataset,wordMixup outperformed senMixup with 1.2%,but for the other four datasets, the two methodsperformed comparably well.One notable fact when compared with the CNN-based models as presented in Table 2 is that,against the SST-2 data sets, both wordMixup andsenMixup with LSTM here were able to improveover the baseline with about 2%.
Results for RandomFix, PretrainTune, andPretrainFix
Results for the settings of Random-Fix, PretrainTune, and PretrainFix are presented inTables 4, 5, and 6, respectively. Results in thesethree tables further confirm that data augmenta-tion through wordMixup and senMixup can im-prove the predictive performance of the base mod- igure 2: Training and testing entropy loss obtained by the baseline CNN without dropout (top-left), baseline CNNwith dropout and L2 (top-right), wordMixup (bottom-right), and senMixup (bottom-left).
RandomFix
Trec SST-1 SST-2 Subj MRCNN 88.4 ± ± ± ± ± ± ± ± ± ± CNN + senMixup 88.8 ± ± ± ± ± Table 4: Accuracy (%) obtained by the testing methods using CNN with randomly initialized and fixed embed-dings. Best results highlighted in Bold.
PretrainTune
Trec SST-1 SST-2 Subj MRCNN 92.1 ± ± ± ± ± ± ± ± ± ± CNN + senMixup ± ± ± ± ± Table 5: Accuracy (%) of the testing methods using CNN with pre-trained GloVe and trainable embeddings. Bestresults highlighted in Bold.
PretrainFix
Trec SST-1 SST-2 Subj MRCNN 92.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 6: Accuracy (%) obtained by the testing methods using CNN with pretrained GloVe and fixed embeddings.Best results highlighted in Bold. els, except on the SST-2 dataset. On the SST-2data set, both wordMixup and senMixup degradedthe predictive accuracy of the baseline when theword embeddings were not allowed to be tunedduring training. With learnable word embeddings,although both wordMixup and senMixup failedto significantly improve over the baseline on this dataset, but they did obtain similar predictive per-formance as the baseline. In short, our experi-ments here suggest that when the word embed-dings are tuned, both wordMixup and senMixupare able to improve the predictive accuracy of thebase models.
Conclusion and Future Work
Inspired by the success of Mixup, a simple andeffective data augmentation method through sam-ple interpolation for image recognition, we inves-tigated two variants of Mixup for sentence classi-fication. We empirically show that they can im-prove the accuracy upon both CNN and LSTMsentence classification models. Interestingly, ourstudies here show that such interpolation strate-gies can serve as an effective, domain independentregularizer for overfitting avoidance for sentenceclassification.We plan to investigate some lately pro-posed variants of Mixup, such as ManifoldMixup (Verma et al.), where interpolation is per-formed in a randomly selected layer of the net-works, and AdaMixup (Guo et al., 2018b), whichaddresses the manifold intrusion issues in Mixup.We are also interested in questions such as whatthe mixed sentences look like and why interpola-tion works for sentence classification.
References
Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner,Paul Tucker, Vijay Vasudevan, Pete Warden, Mar-tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016.Tensorflow: A system for large-scale machine learn-ing. In
Proceedings of the 12th USENIX Conferenceon Operating Systems Design and Implementation ,OSDI’16, pages 265–283.Hongyu Guo, Yongyi Mao, and Richong Zhang. 2018a.Aggregated learning: A vector quantization ap-proach to learning with neural networks.
CoRR ,abs/1807.10251.Hongyu Guo, Yongyi Mao, and Richong Zhang.2018b. Mixup as locally linear out-of-manifold reg-ularization.
CoRR , abs/1809.02499.S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory.
Neural Computation , (8):1735–1780.Navdeep Jaitly and Geoffrey E. Hinton. 2015. Vo-cal tract length perturbation (vtlp) improves speechrecognition.Kim. 2014a. Urlhttps://github.com/yoonkim/cnn sentence.Yoon Kim. 2014b. Convolutional neural networksfor sentence classification. In
EMNLP2014 , pages1746–1751. Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization.
CoRR ,abs/1412.6980.Tom Ko, Vijayaditya Peddinti, Daniel Povey, andSanjeev Khudanpur. 2015. Audio augmentationfor speech recognition. In
INTERSPEECH 2015,16th Annual Conference of the International SpeechCommunication Association, Dresden, Germany,September 6-10, 2015 , pages 3586–3589.Sosuke Kobayashi. 2018. Contextual augmentation:Data augmentation by words with paradigmatic re-lations. In
NAACL-HLT 2018 .Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2017. Imagenet classification with deep convo-lutional neural networks.
Commun. ACM , 60(6):84–90.Xin Li and Dan Roth. 2002. Learning question classi-fiers. In
Proceedings of the 19th International Con-ference on Computational Linguistics - Volume 1 ,COLING ’02.Bo Pang and Lillian Lee. 2004. A sentimental edu-cation: Sentiment analysis using subjectivity sum-marization based on minimum cuts. In
Proceed-ings of the 42nd Annual Meeting of the Associationfor Computational Linguistics, 21-26 July, 2004,Barcelona, Spain. , pages 271–278.Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-ploiting class relationships for sentiment categoriza-tion with respect to rating scales. In
Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics , ACL ’05, pages 115–124.Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In
In EMNLP .G¨ozde G¨ul Sahin and Mark Steedman. 2018. Data aug-mentation via dependency tree morphing for low-resource languages. In
EMNLP 2018 , pages 5004–5009.Patrice Simard, Yann LeCun, John S. Denker, andBernard Victorri. 1998. Transformation invariancein pattern recognition-tangent distance and tangentpropagation. In
Neural Networks: Tricks of theTrade, This Book is an Outgrowth of a 1996 NIPSWorkshop , pages 239–27.Richard Socher, Cliff C. Lin, Andrew Y. Ng, andChristopher D. Manning. 2011. Parsing naturalscenes and natural language with recursive neuralnetworks. In
Proceedings of the 26th InternationalConference on Machine Learning (ICML) .Richard Socher, Alex Perelygin, Jean Y. Wu, JasonChuang, Christopher D. Manning, Andrew Y. Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In
Proceedings of the Conference on Em-pirical Methods in Natural Language Processing ,MNLP ’13, Seattle, USA. Association for Compu-tational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works.
CoRR , abs/1409.3215.Kai Sheng Tai, Richard Socher, and Christopher Man-ning. 2015. Improved Semantic RepresentationsFrom Tree-Structured Long Short-Term MemoryNetworks. In
Proceedings of ACL (to appear) .Vikas Verma, Alex Lamb, Christopher Beckham,Aaron C. Courville, Ioannis Mitliagkas, and YoshuaBengio. Manifold mixup: Encouraging meaningfulon-manifold interpolation as a regularizer.
CoRR .William Yang Wang and Diyi Yang. 2015. That’sso annoying!!!: A lexical and frame-semantic em-bedding based data augmentation approach to au-tomatic categorization of annoying behaviors using
EMNLP2015 .Hongyi Zhang, Moustapha Ciss´e, Yann N. Dauphin,and David Lopez-Paz. 2017. mixup: Beyond em-pirical risk minimization.Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In