A Comparative Study on Regularization Strategies for Embedding-based Neural Networks
Hao Peng, Lili Mou, Ge Li, Yunchuan Chen, Yangyang Lu, Zhi Jin
AA Comparative Study on Regularization Strategiesfor Embedding-based Neural Networks
Hao Peng ∗ , Lili Mou, ∗ Ge Li, † Yunchuan Chen, Yangyang Lu, Zhi Jin Software Institute, Peking University, 100871, P. R. China { penghao.pku, doublepower.mou } @gmail.com, { lige, luyy11, zhijin } @sei.pku.edu.cn University of Chinese Academy of Sciences, [email protected]
Abstract
This paper aims to compare different reg-ularization strategies to address a com-mon phenomenon, severe overfitting, inembedding-based neural networks forNLP. We chose two widely studied neu-ral models and tasks as our testbed.We tried several frequently applied ornewly proposed regularization strategies,including penalizing weights (embeddingsexcluded), penalizing embeddings, re-embedding words, and dropout. We alsoemphasized on incremental hyperparame-ter tuning, and combining different regu-larizations. The results provide a pictureon tuning hyperparameters for neural NLPmodels.
Neural networks have exhibited considerable po-tential in various fields (Krizhevsky et al., 2012;Graves et al., 2013). In early years on neuralNLP research, neural networks were used in lan-guage modeling (Bengio et al., 2003; Morin andBengio, 2005; Mnih and Hinton, 2009); recently,they have been applied to various supervised tasks,such as named entity recognition (Collobert andWeston, 2008), sentiment analysis (Socher et al.,2011; Mou et al., 2015), relation classification(Zeng et al., 2014; Xu et al., 2015), etc. In the fieldof NLP, neural networks are typically combinedwith word embeddings, which are usually first pre-trained by unsupervised algorithms like Mikolovet al. (2013); then they are fed forward to standardneural models, fine-tuned during supervised learn-ing. However, embedding-based neural networksusually suffer from severe overfitting because ofthe high dimensionality of parameters. ∗ Equal contribution. † Corresponding author.
A curious question is whether we can regular-ize embedding-based NLP neural models to im-prove generalization. Although existing and newlyproposed regularization methods might alleviatethe problem, their inherent performance in neuralNLP models is not clear: the use of embeddingsis sparse; the behaviors may be different fromthose in other scenarios like image recognition.Further, selecting hyperparameters to pursue thebest performance by validation is extremely time-consuming, as suggested in Collobert et al. (2011).Therefore, new studies are needed to provide amore complete picture regarding regularization forneural natural language processing. Specifically,we focus on the following research questions inthis paper.RQ 1: How do different regularization strategiestypically behave in embedding-based neuralnetworks?RQ 2: Can regularization coefficients be tuned in-crementally during training so as to ease theburden of hyperparameter tuning?RQ 3: What is the effect of combining differentregularization strategies?In this paper, we systematically and quan-titatively compared four different regularizationstrategies, namely penalizing weights, penalizingembeddings, newly proposed word re-embedding(Labutov and Lipson, 2013), and dropout (Srivas-tava et al., 2014). We analyzed these regulariza-tion methods by two widely studied models andtasks. We also emphasized on incremental hyper-parameter tuning and the combination of differentregularization methods.Our experiments provide some interesting re-sults: (1) Regularizations do help generalization,but their effect depends largely on the datasets’size. (2) Penalizing (cid:96) -norm of embeddings helpsoptimization as well, improving training accu-racy unexpectedly. (3) Incremental hyperparam-eter tuning achieves similar performance, indicat- a r X i v : . [ c s . C L ] A ug ng that regularizations mainly serve as a “local”effect. (4) Dropout performs slightly worse than (cid:96) penalty in our experiments; however, providedvery small (cid:96) penalty, dropping out hidden unitsand penalizing (cid:96) -norm are generally complemen-tary. (5) The newly proposed re-embedding wordsmethod is not effective in our experiments. Experiment I: Relation extraction . The datasetin this experiment comes from SemEval-2010Task 8. The goal is to classify the relationshipbetween two marked entities in each sentence. Werefer interested readers to recent advances, e.g.,Hashimoto et al. (2013), Zeng et al. (2014), andXu et al. (2015). To make our task and modelgeneral, however, we do not consider entity tag-ging information; we do not distinguish the orderof two entities either. In total, there are 10 labels,i.e., 9 different relations plus a default other .Regarding the neural model, we applied Col-lobert’s convolutional neural network (CNN)(Collobert and Weston, 2008) with minor modi-fications. The model comprises a fixed-windowconvolutional layer with size equal to 5, paddedat the end of each sentence; a max pooling layer;a tanh hidden layer; and a softmax output layer. Experiment II: Sentiment analysis . This isanother testbed for neural NLP, aiming to pre-dict the sentiment of a sentence. The dataset isthe Stanford sentiment treebank (Socher et al.,2011) ; target labels are strongly/weaklypositive/negative , or neutral .We used the recursive neural network (RNN),which is proposed in Socher et al. (2011), and fur-ther developed in Socher et al. (2012); Irsoy andCardie (2014). RNNs make use of binarized con-stituency trees, and recursively encode children’sinformation to their parent’s; the root vector is fi-nally used for sentiment classification. Experimental Setup . To setup a fair compari-son, we set all layers to be 50-dimensional in ad-vance (rather than by validation). Such setting hasbeen used in previous work like Zhao et al. (2015).Our embeddings are pretrained on the Wikipediacorpus using Collobert and Weston (2008). Thelearning rate is 0.1 and fixed in Experiment I;for RNN, however, we found learning rate decayhelps to prevent parameter blowup (probably due http://nlp.stanford.edu/sentiment/ to the recursive, and thus chaotic nature). There-fore, we applied power decay (Senior et al., 2013)with power equal to − . For each strategy, wetried a large range of regularization coefficients, − , · · · , − , extensively from underfitting tono effect with granularity 10x. We ran the model5 times with different initializations. We usedmini-batch stochastic gradient descent; gradientsare computed by standard backpropagation. Forsource code, please refer to our project website. It needs to be noticed that, the goal of this paperis not to outperform or reproduce state-of-the-artresults. Instead, we would like to have a fair com-parison. The testbed of our work is two widelystudied models and tasks, which were not chosenon purpose. During the experiments, we tried tomake the comparison as fair as possible. There-fore, we think that the results of this work can begeneralized to similar scenarios.
In this section, we describe four regularizationstrategies used in our experiment. • Penalizing (cid:96) -norm of weights. Let E be thecross-entropy error for classification, and R be a regularization term. The overall costfunction is J = E + λR , where λ is the co-efficient. In this case, R = (cid:107) W (cid:107) , and thecoefficient is denoted as λ W . • Penalizing (cid:96) -norm of embeddings. Somestudies do not distinguish embeddings orconnectional weights for regularization (Taiet al., 2015). However, we would like to an-alyze their effect separately, for embeddingsare sparse in use. Let Φ denote embeddings;then we have R = (cid:107) Φ (cid:107) . • Re-embedding words (Labutov and Lipson,2013). Suppose Φ denotes the original em-beddings trained on a large corpus, and Φ de-notes the embeddings fine-tuned during su-pervised training. We would like to penalizethe norm of the difference between Φ and Φ ,i.e., R = (cid:107) Φ − Φ (cid:107) . In the limit of penalty toinfinity, the model is mathematically equiv-alent to “frozen embeddings,” where wordvectors are used as surface features. • Dropout (Srivastava et al., 2014). In thisstrategy, each neural node is set to 0 with apredefined dropout probability p during train-ing; when testing, all nodes are used, with ac-tivation multiplied by − p . https://sites.google.com/site/regembeddingnn/ A cc u r a c y ( % ) λ W = λ W = − λ W = − λ W = − TrainingValidation (a) Penalizing weights in Experiment I. A cc u r a c y ( % ) λ W = λ W = − λ W = − λ W = − TrainingValidation (b) Penalizing weights in Experiment II. A cc u r a c y ( % ) λ b = λ b = − λ b = − λ b = − TrainingValidation (c) Penalizing embeddings in Experiment I. A cc u r a c y ( % ) λ b = λ b = − λ b = − λ b = − TrainingValidation (d) Penalizing embeddings in Experiment II. A cc u r a c y ( % ) λ reembed = λ reembed = − λ reembed = − λ reembed = − TrainingValidation (e) Re-embedding words in Experiment I. A cc u r a c y ( % ) λ reembed = λ reembed = − λ reembed = − λ reembed = − TrainingValidation (f) Re-embedding words in Experiment II. A cc u r a c y ( % ) p drop = p drop = . p drop = . p drop = . p drop = . p drop = . TrainingValidation (g) Applying dropout in Experiment I. p = 0 . , . are omitted because they are similar to small values. A cc u r a c y ( % ) p drop = p drop = 0.1 p drop = 0.2 p drop = 0.3 p drop = 0.4 p drop = 0.5TrainingValidation (h) Applying dropout in Experiment II. Figure 1:
Averaged learning curves. Left: Experiment I, relation extraction with CNN. Right: Experiment II, sentimentanalysis with RNN. From top to bottom, we penalize weights, penalize embeddings, re-embed words, and drop out. Dashedlines refer to training accuracies; solid lines are validation accuracies.
Individual Regularization Behaviors
This section compares the behavior of each strat-egy. We first conducted both experiments with-out regularization, achieving accuracies of . ± . , . ± . , respectively. Then we plotin Figure 1 learning curves when each regulariza-tion strategy is applied individually. We reporttraining and validation accuracies through out thispaper. The main findings are as follows. • Penalizing (cid:96) -norm of weights helps gener-alization; the effect depends largely on thesize of training set. Experiment I contains7,000 training samples and the improvementis 6.98%; Experiment II contains more than150k samples, and the improvement is only2.07%. Such results are consistent with othermachine learning models. • Penalizing (cid:96) -norm of embeddings unexpect-edly helps optimization (improves trainingaccuracy). One plausible explanation is thatsince embeddings are trained on a large cor-pus by unsupervised methods, they tend tosettle down to large values and may not per-fectly agree with the tasks of interest. (cid:96) penalty pushes the embeddings towards smallvalues and thus helps optimization. Regard-ing validation accuracy, Experiment I is im-proved by 6.89%, whereas Experiment II hasno significant difference. • Re-embedding words does not improve gen-eralization. Particularly, in Experiment II,the ultimate accuracy is improved by 0.44,which is not large. Further, too much penaltyhurts the models in both experiments. In thelimit λ reembed to infinity, re-embedding wordsis mathematically equivalent to using embed-dings as surface features, that is, freezing em-beddings. Such strategy is sometimes appliedin the literature like Hu et al. (2014), but isnot favorable as suggested by the experiment. • Dropout helps generalization. Under the bestsettings, the eventual accuracy is improvedby 3.12% and 1.76%, respectively. In our ex-periments, dropout alone is not as useful as (cid:96) penalty. However, other studies report thatdropout is very effective (Irsoy and Cardie,2014). Our results are not consistent; differ-ent dimensionality may contribute to this dis-agreement, but more experiments are neededto confirm the hypothesis. The above experiments show that regularizationgenerally helps prevent overfitting. To pursue thebest performance, we need to try out different hy-perparameters through validation. Unfortunately,training deep neural networks is time-consuming,preventing full grid search from being a practicaltechnique. Things will get easier if we can incre-mentally tune hyperparameters, that is, to train themodel without regularization first, and then addpenalty.In this section, we study whether (cid:96) penalty ofweights and embeddings can be tuned incremen-tally. We exclude the dropout strategy because itsdoes not make much sense to incrementally dropout hidden units. Besides, from this section, weonly focus on Experiment I due to time and spacelimit.Before continuing, we may envision severalpossibilities on how regularization works. • (On initial effects) As (cid:96) -norm prevents pa-rameters from growing large, adding it atearly stages may cause parameters settlingdown to local optima. If this is the case, de-layed penalty would help parameters get overlocal optima, leading to better performance. • (On eventual effects) (cid:96) penalty lifts er-ror surface of large weights. Adding suchpenalty may cause parameters settling downto (a) almost the same catchment basin, or (b)different basins. In case (a), when the penaltyis added does not matter much. In case (b),however, it makes difference, because param-eters would have already gravitated to catch-ment basins of larger values before regular-ization is added, which means incrementalhyperparameter tuning would be ineffective.To verify the above conjectures, we design foursettings: adding penalty (1) at the beginning, (2)before overfitting at epoch 2, (3) at peak perfor-mance (epoch 5), and (4) after overfitting (valida-tion accuracy drops) at epoch 10.Figure 2 plots the learning curves regarding pe-nalizing weights and embeddings, respectively;baseline (without regularization) is also included.For both weights and embeddings, all settingsyield similar ultimate validation accuracies. Thisshows (cid:96) regularization mainly serves as a “local”effect—it changes the error surface, but parame-ters tend to settle down to a same catchment basin.We notice a recent report also shows local optima
10 20 30 40 50 60 70 80Epochs102030405060708090100 A cc u r a c y ( % ) Reg at epoch 0Reg at epoch 2Reg at epoch 5Reg at epoch 10No regularizationTrainingValidation (a) Incrementally penalizing (cid:96) -norm of weights. A cc u r a c y ( % ) Reg at epoch 0Reg at epoch 2Reg at epoch 5Reg at epoch 10No regularizationTrainingValidation (b) Incrementally penalizing (cid:96) -norm of biases. Figure 2: Tuning hyperparameters incrementally in Experiment I. Penalty is added at epochs 0, 2, 5, 10,respectively. We chose the coefficients yielding the best performance in Figure 1. The controlled trial(no regularization) is early stopped because the accuracy has already decreased. λ W λ embed − · − − − · − − · − − Table 1: Accuracy in percentage when we com-bine (cid:96) -norm of weights and embeddings (Exper-iment I). Bold numbers are among highest accu-racies (greater than peak performance minus 1.5times standard deviation, i.e., 1.26 in percentage). λ W λ embed p – · – – – · – – Table 2: Combining (cid:96) regularization and dropout.Left: connectional weights. Right: embeddings.( p refers to the dropout rate.)may not play an important role in training neuralnetworks, if the effect of parameter symmetry isruled out (Breuel, 2015).We also observe that regularization helps gener-alization as soon as it is added (Figure 2a), and thatregularizing embeddings helps optimization alsoright after the penalty is applied (Figure 2b). We are further curious about the behaviors whendifferent regularization methods are combined.Table 1 shows that combining (cid:96) -norm ofweights and embeddings results in a further accu-racy improvement of 3–4 percents from applying either single one of them. In a certain range ofcoefficients, weights and embeddings are comple-mentary: given one hyperparameter, we can tunethe other to achieve a result among highest ones.Such compensation is also observed in penal-izing (cid:96) -norm versus dropout (Table 2)—althoughthe peak performance is obtained by pure (cid:96) regu-larization, applying dropout with small (cid:96) penaltyalso achieves a similar accuracy. The dropout rateis not very sensitive, provided it is small. In this paper, we systematically compared fourregularization strategies for embedding-basedneural networks in NLP. Based on the experimen-tal results, we answer our research questions asfollows. (1) Regularization methods (except re-embedding words) basically help generalization.Penalizing (cid:96) -norm of embeddings unexpectedlyhelps optimization as well. Regularization perfor-mance depends largely on the dataset’s size. (2) (cid:96) penalty mainly acts as a local effect; hyperpa-rameters can be tuned incrementally. (3) Combin-ing (cid:96) -norm of weights and biases (dropout and (cid:96) penalty) further improves generalization; their co-efficients are mostly complementary within a cer-tain range. These empirical results of regulariza-tion strategies shed some light on tuning neuralmodels for NLP. Acknowledgments
This research is supported by the National BasicResearch Program of China (the 973 Program) un-der Grant No. 2015CB352201 and the NationalNatural Science Foundation of China under GrantNo. 61232015. We would also like to thank HaoJia and Ran Jia. eferences [Bengio et al.2003] Yoshua Bengio, R. Ducharme,P. Vincent, and C. Jauvin. 2003. A neural proba-bilistic language model.
Journal of Machine Learn-ing Research , 3:1137–1155.[Breuel2015] T. Breuel. 2015. The effects of hyperpa-rameters on SGD training of neural networks. arXivpreprint arXiv:1508.02788 .[Collobert and Weston2008] Ronan Collobert andJ. Weston. 2008. A unified architecture for naturallanguage processing: Deep neural networks withmultitask learning. In
Proceedings of the 25thInternational Conference on Machine learning .[Collobert et al.2011] Ronan Collobert, Jason Weston,L´eon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language process-ing (almost) from scratch.
The Journal of MachineLearning Research , 12:2493–2537.[Graves et al.2013] Alex Graves, A. Mohamed, andG. Hinton. 2013. Speech recognition with deeprecurrent neural networks. In
Proceedings of2013 IEEE International Conference on Acoustics,Speech and Signal Processing .[Hashimoto et al.2013] Kazuma Hashimoto, M. Miwa,Y. Tsuruoka, and T. Chikayama. 2013. Simple cus-tomization of recursive neural networks for semanticrelation classification. In
Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing .[Hu et al.2014] Baotian Hu, Z. Lu, H. Li, and Q. Chen.2014. Convolutional neural network architecturesfor matching natural language sentences. In
Ad-vances in Neural Information Processing Systems .[Irsoy and Cardie2014] Ozan Irsoy and C. Cardie.2014. Deep recursive neural networks for compo-sitionality in language. In
Advances in Neural In-formation Processing Systems .[Krizhevsky et al.2012] Alex Krizhevsky, I. Sutskever,and G. Hinton. 2012. ImageNet classification withdeep convolutional neural networks. In
Advances inNeural Information Processing Systems .[Labutov and Lipson2013] Igor Labutov and H. Lipson.2013. Re-embedding words. In
Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 2) .[Mikolov et al.2013] Tomas Mikolov, I. Sutskever,K. Chen, G. Corrado, and J. Dean. 2013. Dis-tributed representations of words and phrases andtheir compositionality. In
Advances in Neural In-formation Processing Systems .[Mnih and Hinton2009] Andriy Mnih and G. Hinton.2009. A scalable hierarchical distributed languagemodel. In
Advances in Neural Information Process-ing Systems . [Morin and Bengio2005] Frederic Morin and Y. Ben-gio. 2005. Hierarchical probabilistic neural networklanguage model. In
Proceedings of InternationalConference on Artificial Intelligence and Statistics .[Mou et al.2015] Lili Mou, Hao Peng, Ge Li, Yan Xu,Lu Zhang, and Zhi Jin. 2015. Tree-based convolu-tion: A new neural architecture for sentence mod-eling. In
Proceedings of Conference on EmpiricalMethods in Natural Language Processing (to ap-pear) .[Senior et al.2013] Andrew Senior, G. Heigold,M. Ranzato, and K. Yang. 2013. An empiricalstudy of learning rates in deep neural networks forspeech recognition. In
Proceedings of 2013 IEEEInternational Conference on Acoustics, Speech andSignal Processing .[Socher et al.2011] Richard Socher, J. Pennington,E. Huang, A. Ng, and C. Manning. 2011. Semi-supervised recursive autoencoders for predictingsentiment distributions. In
Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing .[Socher et al.2012] Richard Socher, B. Huval, C. Man-ning, and A. Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In
Pro-ceedings of the 2012 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning .[Srivastava et al.2014] Nitish Srivastava, G. Hinton,A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.2014. Dropout: A simple way to prevent neuralnetworks from overfitting.
The Journal of MachineLearning Research , pages 1929–1958.[Tai et al.2015] Kai Sheng Tai, R. Socher, and D. Man-ning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In
Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics .[Xu et al.2015] Yan Xu, Lili Mou, Ge Li, YunchuanChen, Hao Peng, and Zhi Jin. 2015. Classifying re-lations via long short term memory networks alongshortest dependency paths. In
Proceedings of Con-ference on Empirical Methods in Natural LanguageProcessing (to appear) .[Zeng et al.2014] Daojian Zeng, K. Liu, S. Lai,G. Zhou, and J. Zhao. 2014. Relation classificationvia convolutional deep neural network. In
Proceed-ings of Computational Linguistics .[Zhao et al.2015] Han Zhao, Z. Lu, and P. Poupart.2015. Self-adaptive hierarchical sentence model. In