[PDF] A Comparative Study on Regularization Strategies for Embedding-based Neural Networks

Abstract

This paper aims to compare different regularization strategies to address a common phenomenon, severe overfitting, in embedding-based neural networks for NLP. We chose two widely studied neural models and tasks as our testbed. We tried several frequently applied or newly proposed regularization strategies, including penalizing weights (embeddings excluded), penalizing embeddings, re-embedding words, and dropout. We also emphasized on incremental hyperparameter tuning, and combining different regularizations. The results provide a picture on tuning hyperparameters for neural NLP models.

Full PDF

AA Comparative Study on Regularization Strategiesfor Embedding-based Neural Networks

Hao Peng ∗ , Lili Mou, ∗ Ge Li, † Yunchuan Chen, Yangyang Lu, Zhi Jin Software Institute, Peking University, 100871, P. R. China { penghao.pku, doublepower.mou } @gmail.com, { lige, luyy11, zhijin } @sei.pku.edu.cn University of Chinese Academy of Sciences, [email protected]

Abstract

This paper aims to compare different reg-ularization strategies to address a com-mon phenomenon, severe overﬁtting, inembedding-based neural networks forNLP. We chose two widely studied neu-ral models and tasks as our testbed.We tried several frequently applied ornewly proposed regularization strategies,including penalizing weights (embeddingsexcluded), penalizing embeddings, re-embedding words, and dropout. We alsoemphasized on incremental hyperparame-ter tuning, and combining different regu-larizations. The results provide a pictureon tuning hyperparameters for neural NLPmodels.

Neural networks have exhibited considerable po-tential in various ﬁelds (Krizhevsky et al., 2012;Graves et al., 2013). In early years on neuralNLP research, neural networks were used in lan-guage modeling (Bengio et al., 2003; Morin andBengio, 2005; Mnih and Hinton, 2009); recently,they have been applied to various supervised tasks,such as named entity recognition (Collobert andWeston, 2008), sentiment analysis (Socher et al.,2011; Mou et al., 2015), relation classiﬁcation(Zeng et al., 2014; Xu et al., 2015), etc. In the ﬁeldof NLP, neural networks are typically combinedwith word embeddings, which are usually ﬁrst pre-trained by unsupervised algorithms like Mikolovet al. (2013); then they are fed forward to standardneural models, ﬁne-tuned during supervised learn-ing. However, embedding-based neural networksusually suffer from severe overﬁtting because ofthe high dimensionality of parameters. ∗ Equal contribution. † Corresponding author.

A curious question is whether we can regular-ize embedding-based NLP neural models to im-prove generalization. Although existing and newlyproposed regularization methods might alleviatethe problem, their inherent performance in neuralNLP models is not clear: the use of embeddingsis sparse; the behaviors may be different fromthose in other scenarios like image recognition.Further, selecting hyperparameters to pursue thebest performance by validation is extremely time-consuming, as suggested in Collobert et al. (2011).Therefore, new studies are needed to provide amore complete picture regarding regularization forneural natural language processing. Speciﬁcally,we focus on the following research questions inthis paper.RQ 1: How do different regularization strategiestypically behave in embedding-based neuralnetworks?RQ 2: Can regularization coefﬁcients be tuned in-crementally during training so as to ease theburden of hyperparameter tuning?RQ 3: What is the effect of combining differentregularization strategies?In this paper, we systematically and quan-titatively compared four different regularizationstrategies, namely penalizing weights, penalizingembeddings, newly proposed word re-embedding(Labutov and Lipson, 2013), and dropout (Srivas-tava et al., 2014). We analyzed these regulariza-tion methods by two widely studied models andtasks. We also emphasized on incremental hyper-parameter tuning and the combination of differentregularization methods.Our experiments provide some interesting re-sults: (1) Regularizations do help generalization,but their effect depends largely on the datasets’size. (2) Penalizing (cid:96) -norm of embeddings helpsoptimization as well, improving training accu-racy unexpectedly. (3) Incremental hyperparam-eter tuning achieves similar performance, indicat- a r X i v : . [ c s . C L ] A ug ng that regularizations mainly serve as a “local”effect. (4) Dropout performs slightly worse than (cid:96) penalty in our experiments; however, providedvery small (cid:96) penalty, dropping out hidden unitsand penalizing (cid:96) -norm are generally complemen-tary. (5) The newly proposed re-embedding wordsmethod is not effective in our experiments. Experiment I: Relation extraction . The datasetin this experiment comes from SemEval-2010Task 8. The goal is to classify the relationshipbetween two marked entities in each sentence. Werefer interested readers to recent advances, e.g.,Hashimoto et al. (2013), Zeng et al. (2014), andXu et al. (2015). To make our task and modelgeneral, however, we do not consider entity tag-ging information; we do not distinguish the orderof two entities either. In total, there are 10 labels,i.e., 9 different relations plus a default other .Regarding the neural model, we applied Col-lobert’s convolutional neural network (CNN)(Collobert and Weston, 2008) with minor modi-ﬁcations. The model comprises a ﬁxed-windowconvolutional layer with size equal to 5, paddedat the end of each sentence; a max pooling layer;a tanh hidden layer; and a softmax output layer. Experiment II: Sentiment analysis . This isanother testbed for neural NLP, aiming to pre-dict the sentiment of a sentence. The dataset isthe Stanford sentiment treebank (Socher et al.,2011) ; target labels are strongly/weaklypositive/negative , or neutral .We used the recursive neural network (RNN),which is proposed in Socher et al. (2011), and fur-ther developed in Socher et al. (2012); Irsoy andCardie (2014). RNNs make use of binarized con-stituency trees, and recursively encode children’sinformation to their parent’s; the root vector is ﬁ-nally used for sentiment classiﬁcation. Experimental Setup . To setup a fair compari-son, we set all layers to be 50-dimensional in ad-vance (rather than by validation). Such setting hasbeen used in previous work like Zhao et al. (2015).Our embeddings are pretrained on the Wikipediacorpus using Collobert and Weston (2008). Thelearning rate is 0.1 and ﬁxed in Experiment I;for RNN, however, we found learning rate decayhelps to prevent parameter blowup (probably due http://nlp.stanford.edu/sentiment/ to the recursive, and thus chaotic nature). There-fore, we applied power decay (Senior et al., 2013)with power equal to − . For each strategy, wetried a large range of regularization coefﬁcients, − , · · · , − , extensively from underﬁtting tono effect with granularity 10x. We ran the model5 times with different initializations. We usedmini-batch stochastic gradient descent; gradientsare computed by standard backpropagation. Forsource code, please refer to our project website. It needs to be noticed that, the goal of this paperis not to outperform or reproduce state-of-the-artresults. Instead, we would like to have a fair com-parison. The testbed of our work is two widelystudied models and tasks, which were not chosenon purpose. During the experiments, we tried tomake the comparison as fair as possible. There-fore, we think that the results of this work can begeneralized to similar scenarios.

In this section, we describe four regularizationstrategies used in our experiment. • Penalizing (cid:96) -norm of weights. Let E be thecross-entropy error for classiﬁcation, and R be a regularization term. The overall costfunction is J = E + λR , where λ is the co-efﬁcient. In this case, R = (cid:107) W (cid:107) , and thecoefﬁcient is denoted as λ W . • Penalizing (cid:96) -norm of embeddings. Somestudies do not distinguish embeddings orconnectional weights for regularization (Taiet al., 2015). However, we would like to an-alyze their effect separately, for embeddingsare sparse in use. Let Φ denote embeddings;then we have R = (cid:107) Φ (cid:107) . • Re-embedding words (Labutov and Lipson,2013). Suppose Φ denotes the original em-beddings trained on a large corpus, and Φ de-notes the embeddings ﬁne-tuned during su-pervised training. We would like to penalizethe norm of the difference between Φ and Φ ,i.e., R = (cid:107) Φ − Φ (cid:107) . In the limit of penalty toinﬁnity, the model is mathematically equiv-alent to “frozen embeddings,” where wordvectors are used as surface features. • Dropout (Srivastava et al., 2014). In thisstrategy, each neural node is set to 0 with apredeﬁned dropout probability p during train-ing; when testing, all nodes are used, with ac-tivation multiplied by − p . https://sites.google.com/site/regembeddingnn/ A cc u r a c y ( % ) λ W = λ W = − λ W = − λ W = − TrainingValidation (a) Penalizing weights in Experiment I. A cc u r a c y ( % ) λ W = λ W = − λ W = − λ W = − TrainingValidation (b) Penalizing weights in Experiment II. A cc u r a c y ( % ) λ b = λ b = − λ b = − λ b = − TrainingValidation (c) Penalizing embeddings in Experiment I. A cc u r a c y ( % ) λ b = λ b = − λ b = − λ b = − TrainingValidation (d) Penalizing embeddings in Experiment II. A cc u r a c y ( % ) λ reembed = λ reembed = − λ reembed = − λ reembed = − TrainingValidation (e) Re-embedding words in Experiment I. A cc u r a c y ( % ) λ reembed = λ reembed = − λ reembed = − λ reembed = − TrainingValidation (f) Re-embedding words in Experiment II. A cc u r a c y ( % ) p drop = p drop = . p drop = . p drop = . p drop = . p drop = . TrainingValidation (g) Applying dropout in Experiment I. p = 0 . , . are omitted because they are similar to small values. A cc u r a c y ( % ) p drop = p drop = 0.1 p drop = 0.2 p drop = 0.3 p drop = 0.4 p drop = 0.5TrainingValidation (h) Applying dropout in Experiment II. Figure 1:

Averaged learning curves. Left: Experiment I, relation extraction with CNN. Right: Experiment II, sentimentanalysis with RNN. From top to bottom, we penalize weights, penalize embeddings, re-embed words, and drop out. Dashedlines refer to training accuracies; solid lines are validation accuracies.

Individual Regularization Behaviors

This section compares the behavior of each strat-egy. We ﬁrst conducted both experiments with-out regularization, achieving accuracies of . ± . , . ± . , respectively. Then we plotin Figure 1 learning curves when each regulariza-tion strategy is applied individually. We reporttraining and validation accuracies through out thispaper. The main ﬁndings are as follows. • Penalizing (cid:96) -norm of weights helps gener-alization; the effect depends largely on thesize of training set. Experiment I contains7,000 training samples and the improvementis 6.98%; Experiment II contains more than150k samples, and the improvement is only2.07%. Such results are consistent with othermachine learning models. • Penalizing (cid:96) -norm of embeddings unexpect-edly helps optimization (improves trainingaccuracy). One plausible explanation is thatsince embeddings are trained on a large cor-pus by unsupervised methods, they tend tosettle down to large values and may not per-fectly agree with the tasks of interest. (cid:96) penalty pushes the embeddings towards smallvalues and thus helps optimization. Regard-ing validation accuracy, Experiment I is im-proved by 6.89%, whereas Experiment II hasno signiﬁcant difference. • Re-embedding words does not improve gen-eralization. Particularly, in Experiment II,the ultimate accuracy is improved by 0.44,which is not large. Further, too much penaltyhurts the models in both experiments. In thelimit λ reembed to inﬁnity, re-embedding wordsis mathematically equivalent to using embed-dings as surface features, that is, freezing em-beddings. Such strategy is sometimes appliedin the literature like Hu et al. (2014), but isnot favorable as suggested by the experiment. • Dropout helps generalization. Under the bestsettings, the eventual accuracy is improvedby 3.12% and 1.76%, respectively. In our ex-periments, dropout alone is not as useful as (cid:96) penalty. However, other studies report thatdropout is very effective (Irsoy and Cardie,2014). Our results are not consistent; differ-ent dimensionality may contribute to this dis-agreement, but more experiments are neededto conﬁrm the hypothesis. The above experiments show that regularizationgenerally helps prevent overﬁtting. To pursue thebest performance, we need to try out different hy-perparameters through validation. Unfortunately,training deep neural networks is time-consuming,preventing full grid search from being a practicaltechnique. Things will get easier if we can incre-mentally tune hyperparameters, that is, to train themodel without regularization ﬁrst, and then addpenalty.In this section, we study whether (cid:96) penalty ofweights and embeddings can be tuned incremen-tally. We exclude the dropout strategy because itsdoes not make much sense to incrementally dropout hidden units. Besides, from this section, weonly focus on Experiment I due to time and spacelimit.Before continuing, we may envision severalpossibilities on how regularization works. • (On initial effects) As (cid:96) -norm prevents pa-rameters from growing large, adding it atearly stages may cause parameters settlingdown to local optima. If this is the case, de-layed penalty would help parameters get overlocal optima, leading to better performance. • (On eventual effects) (cid:96) penalty lifts er-ror surface of large weights. Adding suchpenalty may cause parameters settling downto (a) almost the same catchment basin, or (b)different basins. In case (a), when the penaltyis added does not matter much. In case (b),however, it makes difference, because param-eters would have already gravitated to catch-ment basins of larger values before regular-ization is added, which means incrementalhyperparameter tuning would be ineffective.To verify the above conjectures, we design foursettings: adding penalty (1) at the beginning, (2)before overﬁtting at epoch 2, (3) at peak perfor-mance (epoch 5), and (4) after overﬁtting (valida-tion accuracy drops) at epoch 10.Figure 2 plots the learning curves regarding pe-nalizing weights and embeddings, respectively;baseline (without regularization) is also included.For both weights and embeddings, all settingsyield similar ultimate validation accuracies. Thisshows (cid:96) regularization mainly serves as a “local”effect—it changes the error surface, but parame-ters tend to settle down to a same catchment basin.We notice a recent report also shows local optima

10 20 30 40 50 60 70 80Epochs102030405060708090100 A cc u r a c y ( % ) Reg at epoch 0Reg at epoch 2Reg at epoch 5Reg at epoch 10No regularizationTrainingValidation (a) Incrementally penalizing (cid:96) -norm of weights. A cc u r a c y ( % ) Reg at epoch 0Reg at epoch 2Reg at epoch 5Reg at epoch 10No regularizationTrainingValidation (b) Incrementally penalizing (cid:96) -norm of biases. Figure 2: Tuning hyperparameters incrementally in Experiment I. Penalty is added at epochs 0, 2, 5, 10,respectively. We chose the coefﬁcients yielding the best performance in Figure 1. The controlled trial(no regularization) is early stopped because the accuracy has already decreased. λ W λ embed − · − − − · − − · − − Table 1: Accuracy in percentage when we com-bine (cid:96) -norm of weights and embeddings (Exper-iment I). Bold numbers are among highest accu-racies (greater than peak performance minus 1.5times standard deviation, i.e., 1.26 in percentage). λ W λ embed p – · – – – · – – Table 2: Combining (cid:96) regularization and dropout.Left: connectional weights. Right: embeddings.( p refers to the dropout rate.)may not play an important role in training neuralnetworks, if the effect of parameter symmetry isruled out (Breuel, 2015).We also observe that regularization helps gener-alization as soon as it is added (Figure 2a), and thatregularizing embeddings helps optimization alsoright after the penalty is applied (Figure 2b). We are further curious about the behaviors whendifferent regularization methods are combined.Table 1 shows that combining (cid:96) -norm ofweights and embeddings results in a further accu-racy improvement of 3–4 percents from applying either single one of them. In a certain range ofcoefﬁcients, weights and embeddings are comple-mentary: given one hyperparameter, we can tunethe other to achieve a result among highest ones.Such compensation is also observed in penal-izing (cid:96) -norm versus dropout (Table 2)—althoughthe peak performance is obtained by pure (cid:96) regu-larization, applying dropout with small (cid:96) penaltyalso achieves a similar accuracy. The dropout rateis not very sensitive, provided it is small. In this paper, we systematically compared fourregularization strategies for embedding-basedneural networks in NLP. Based on the experimen-tal results, we answer our research questions asfollows. (1) Regularization methods (except re-embedding words) basically help generalization.Penalizing (cid:96) -norm of embeddings unexpectedlyhelps optimization as well. Regularization perfor-mance depends largely on the dataset’s size. (2) (cid:96) penalty mainly acts as a local effect; hyperpa-rameters can be tuned incrementally. (3) Combin-ing (cid:96) -norm of weights and biases (dropout and (cid:96) penalty) further improves generalization; their co-efﬁcients are mostly complementary within a cer-tain range. These empirical results of regulariza-tion strategies shed some light on tuning neuralmodels for NLP. Acknowledgments

This research is supported by the National BasicResearch Program of China (the 973 Program) un-der Grant No. 2015CB352201 and the NationalNatural Science Foundation of China under GrantNo. 61232015. We would also like to thank HaoJia and Ran Jia. eferences [Bengio et al.2003] Yoshua Bengio, R. Ducharme,P. Vincent, and C. Jauvin. 2003. A neural proba-bilistic language model.

Journal of Machine Learn-ing Research , 3:1137–1155.[Breuel2015] T. Breuel. 2015. The effects of hyperpa-rameters on SGD training of neural networks. arXivpreprint arXiv:1508.02788 .[Collobert and Weston2008] Ronan Collobert andJ. Weston. 2008. A uniﬁed architecture for naturallanguage processing: Deep neural networks withmultitask learning. In

Proceedings of the 25thInternational Conference on Machine learning .[Collobert et al.2011] Ronan Collobert, Jason Weston,L´eon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language process-ing (almost) from scratch.

The Journal of MachineLearning Research , 12:2493–2537.[Graves et al.2013] Alex Graves, A. Mohamed, andG. Hinton. 2013. Speech recognition with deeprecurrent neural networks. In

Proceedings of2013 IEEE International Conference on Acoustics,Speech and Signal Processing .[Hashimoto et al.2013] Kazuma Hashimoto, M. Miwa,Y. Tsuruoka, and T. Chikayama. 2013. Simple cus-tomization of recursive neural networks for semanticrelation classiﬁcation. In

Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing .[Hu et al.2014] Baotian Hu, Z. Lu, H. Li, and Q. Chen.2014. Convolutional neural network architecturesfor matching natural language sentences. In

Ad-vances in Neural Information Processing Systems .[Irsoy and Cardie2014] Ozan Irsoy and C. Cardie.2014. Deep recursive neural networks for compo-sitionality in language. In

Advances in Neural In-formation Processing Systems .[Krizhevsky et al.2012] Alex Krizhevsky, I. Sutskever,and G. Hinton. 2012. ImageNet classiﬁcation withdeep convolutional neural networks. In

Advances inNeural Information Processing Systems .[Labutov and Lipson2013] Igor Labutov and H. Lipson.2013. Re-embedding words. In

Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 2) .[Mikolov et al.2013] Tomas Mikolov, I. Sutskever,K. Chen, G. Corrado, and J. Dean. 2013. Dis-tributed representations of words and phrases andtheir compositionality. In

Advances in Neural In-formation Processing Systems .[Mnih and Hinton2009] Andriy Mnih and G. Hinton.2009. A scalable hierarchical distributed languagemodel. In

Advances in Neural Information Process-ing Systems . [Morin and Bengio2005] Frederic Morin and Y. Ben-gio. 2005. Hierarchical probabilistic neural networklanguage model. In

Proceedings of InternationalConference on Artiﬁcial Intelligence and Statistics .[Mou et al.2015] Lili Mou, Hao Peng, Ge Li, Yan Xu,Lu Zhang, and Zhi Jin. 2015. Tree-based convolu-tion: A new neural architecture for sentence mod-eling. In

Proceedings of Conference on EmpiricalMethods in Natural Language Processing (to ap-pear) .[Senior et al.2013] Andrew Senior, G. Heigold,M. Ranzato, and K. Yang. 2013. An empiricalstudy of learning rates in deep neural networks forspeech recognition. In

Proceedings of 2013 IEEEInternational Conference on Acoustics, Speech andSignal Processing .[Socher et al.2011] Richard Socher, J. Pennington,E. Huang, A. Ng, and C. Manning. 2011. Semi-supervised recursive autoencoders for predictingsentiment distributions. In

Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing .[Socher et al.2012] Richard Socher, B. Huval, C. Man-ning, and A. Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In

Pro-ceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning .[Srivastava et al.2014] Nitish Srivastava, G. Hinton,A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.2014. Dropout: A simple way to prevent neuralnetworks from overﬁtting.

The Journal of MachineLearning Research , pages 1929–1958.[Tai et al.2015] Kai Sheng Tai, R. Socher, and D. Man-ning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In

Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics .[Xu et al.2015] Yan Xu, Lili Mou, Ge Li, YunchuanChen, Hao Peng, and Zhi Jin. 2015. Classifying re-lations via long short term memory networks alongshortest dependency paths. In

Proceedings of Con-ference on Empirical Methods in Natural LanguageProcessing (to appear) .[Zeng et al.2014] Daojian Zeng, K. Liu, S. Lai,G. Zhou, and J. Zhao. 2014. Relation classiﬁcationvia convolutional deep neural network. In

Proceed-ings of Computational Linguistics .[Zhao et al.2015] Han Zhao, Z. Lu, and P. Poupart.2015. Self-adaptive hierarchical sentence model. In