A Comparative Study of Word Embeddings for Reading Comprehension
Bhuwan Dhingra, Hanxiao Liu, Ruslan Salakhutdinov, William W. Cohen
AA Comparative Study of Word Embeddingsfor Reading Comprehension
Bhuwan Dhingra, Hanxiao Liu, Ruslan Salakhutdinov, William W. Cohen
School of Computer ScienceCarnegie Mellon University, Pittsburgh, USA { bdhingra, hanxiaol, rsalakhu, wcohen } @cs.cmu.edu Abstract
The focus of past machine learning re-search for Reading Comprehension taskshas been primarily on the design of noveldeep learning architectures. Here we showthat seemingly minor choices made on(1) the use of pre-trained word embed-dings, and (2) the representation of out-of-vocabulary tokens at test time, can turnout to have a larger impact than architec-tural choices on the final performance. Wesystematically explore several options forthese choices, and provide recommenda-tions to researchers working in this area.
Systems that can read documents and answerquestions about their content are a key languagetechnology. The field, which has been termed
Reading Comprehension (RC), has attracted atremendous amount of interest in the last twoyears, primarily due to the introduction of large-scale annotated datasets, such as CNN (Hermannet al., 2015) and SQuAD (Rajpurkar et al., 2016).Powerful statistical models, including deeplearning models (also termed as readers ), havebeen proposed for RC, most of which employthe following recipe: (1) Tokens in the documentand question are represented using word vectorsobtained from a lookup table (either initializedrandomly, or from a pre-trained source such as
GloVe (Pennington et al., 2014)). (2) A sequencemodel such as LSTM (Hochreiter and Schmidhu-ber, 1997), augmented with an attention mecha-nism (Bahdanau et al., 2014), updates these vec-tors to produce contextual representations. (3) Anoutput layer uses these contextual representationsto locate the answer in the document. The focusso far in the literature has been on steps (2) and Figure 1:
Test set accuracies and std error on the Who-Did-What dataset for Stanford AR and GA Reader, trainedafter initializing with word vectors induced from differentcorpora. Without controlling for the initialization method,different conclusions may be drawn about which architectureis superior. Corpus 1: BookTest dataset (Bajgar et al., 2016),Corpus 2: Wikipedia + Gigaword. (3), and several novel architectures have been pro-posed (see Section 2.1).In this work, we show that seemingly minorchoices made in step (1), such as the use of pre-trained word embeddings and the handling of out-of-vocabulary tokens at test time, can lead to sub-stantial differences in the final performance ofthe reader. These differences are usually muchlarger than the gains reported due to architecturalimprovements. As a concrete example, in Fig-ure 1 we compare the performance of two RCmodels—Stanford Attentive Reader (AR) (Chenet al., 2016) and Gated Attention (GA) Reader(Dhingra et al., 2016)—on the Who-Did-Whatdataset (Onishi et al., 2016), initialized with wordembeddings trained on different corpora. Clearly,comparison between architectures is meaningfulonly under a controlled initialization method.To justify our claims, we conduct a comprehen-sive set of experiments comparing the effect ofutilizing embeddings pre-trained on several cor-pora. We experiment with RC datasets from dif-ferent domains using different architectures, andobtain consistent results across all settings. Basedon our findings, we recommend the use of cer-tain pre-trained GloVe vectors for initialization.These consistently outperform other off-the-shelf a r X i v : . [ c s . C L ] M a r mbeddings such as word2vec (Mikolov et al.,2013), as well as those pre-trained on the tar-get corpus itself and, perhaps surprisingly, thosetrained on a large corpus from the same domain asthe target dataset.Another important design choice is the handlingof out-of-vocabulary (OOV) tokens at test time. Acommon approach (e.g. (Chen et al., 2016; Shenet al., 2016)) is to replace infrequent words duringtraining with a special token UNK , and use this to-ken to model the OOV words at the test phase. Inreading comprehension, where the target answersare often rare words, we find that this approachleads to significantly worse performance in cer-tain cases. A superior strategy is to assign eachOOV token either a pre-trained, if available, or arandom but unique vector at test time. We discussand compare these two strategies, as well a mixedvariant of the two, in Section 3.2.
Many datasets aimed at measuring the perfor-mance of RC have been proposed (Nguyen et al.,2016; Trischler et al., 2016). For our purposes,we pick two of these benchmarks from differ-ent domains – Who-Did-What (WDW) (Onishiet al., 2016) constructed from news stories, andthe Children’s Book Test (CBT) (Hill et al., 2015)constructed from children’s books. For CBT weonly consider the questions where the answer is anamed entity (CBT-NE). Several RC models basedon deep learning have been proposed (Cui et al.,2016; Munkhdalai and Yu, 2016; Sordoni et al.,2016; Shen et al., 2016; Kobayashi et al., 2016;Henaff et al., 2016; Wang and Jiang, 2016; Wanget al., 2016; Seo et al., 2016; Xiong et al., 2016;Yu et al., 2016). For our experiments we pick twoof these models: the simple, but competitive, Stan-ford AR, and the high-performing GA Reader.
Stanford AR:
The Stanford AR consists ofsingle-layer Bidirectional GRU encoders for boththe document and the query, followed by a bilinearattention operator for computing a weighted aver-age representation of the document. The originalmodel, which was developed for the anonymizedCNN / Daily Mail datasets, used an output lookuptable W a to select the answer. However, with-out anonymization the number of answer candi- https://code.google.com/archive/p/word2vec/ Emb. Corpus Domain Size Vocab
OTS Wiki + Gigaword /GoogleNews Wiki /News 6B /100B 400K /3MWDW Who-Did-What News 50M 91KBT BookTest Fiction 8B 1.2MCBT Children’s BookTest Fiction 50M 48K
Table 1:
Details of corpora used for training word em-beddings.
OTS:
Off-The-Shelf embeddings provided withGloVe / word2vec. Corpus size is in dates can become very large. Hence, we insteadselect the answer from the document representa-tion itself, followed by an attention sum mecha-nism (Kadlec et al., 2016). This procedure is verysimilar to the one used in GA Reader, and is de-scribed in detail in Appendix A.
GA Reader:
The GA Reader is a multi-hoparchitecture which updates the representation ofdocument tokens through multiple bidirectionalGRU layers (Cho et al., 2014). At the output ofeach intermediate layer, the token representationsare re-weighted by taking their element-wise prod-uct with an attention-weighted representation ofthe query. The outputs of the final layer are fur-ther matched with the query representation withan inner product to produce a distribution over thecandidate answers in the document, and multiplementions are aggregated using attention sum. Weuse the publicly available code with the defaulthyperparameter settings of (Dhingra et al., 2016),detailed in Appendix B. The two most popular methods for inducing wordembeddings from text corpora are
GloVe (Pen-nington et al., 2014) and word2vec (Mikolov et al.,2013). These packages also provide off-the-shelf (OTS) embeddings trained on large corpora .While the GloVe package provides embeddingswith varying sizes ( - ), word2vec only pro-vides embeddings of size . This is an impor-tant difference, which we discuss in detail later.We also train three additional embeddings, listedin Table 1, including those trained on the targetdatasets themselves. In summary, we test with twoin-domain corpora for WDW: one large (OTS) andone small (WDW), and two in-domain corpora forCBT: one large (BT) and one small (CBT). https://github.com/bdhingra/ga-reader The word2vec package contains embeddings for bothcapitalized and lowercase words. We convert all words tolowercase, and if a word has both lowercase and uppercaseembeddings we use the lowercase version. hen training embeddings, we set hyperparam-eters to their default values in the provided pack-ages (see Appendix B for details). This is by nomeans an optimal choice, in fact previous stud-ies (Levy et al., 2015) have shown that hyper-parameter choices may have a significant impacton downstream performance. However, training asingle RC model can take anywhere from severalhours to several days, and tuning hyperparame-ters for the embedding method on this downstreamtask is both infeasible and rarely done in practice.Instead, our objective is to provide guidelines toresearchers using these methods out-of-the-box.
We repeat each experiment twice with differentrandom seeds and report the average test set ac-curacy across the two runs. Figure 2 shows a com-parison of the RC performance for GA Reader andStanford AR after initializing with various pre-trained embeddings, and also after initializing ran-domly. We see consistent results across the twodatasets and and the two models.The first observation is that using embeddingstrained on the right corpora can improve anywherefrom 3-6% over random initialization. However,the corpus and method used for pre-training areimportant choices: for example word2vec embed-dings trained on CBT perform worse than random.Also note that in every single case, GloVe embed-dings outperform word2vec embeddings trainedon the same corpora. It is difficult to claim thatone method is better than the other, since previousstudies (Levy et al., 2015) have shown that thesemethods are sensitive to hyperparameter tuning.However, if used out-of-the-box, GloVe seems tobe the preferred method for pre-training.The single best performance is given by off-the-shelf GloVe embeddings ( d = 100 ) in each case,which outperform off-the-shelf word2vec embed-dings ( d = 300 ). To understand if the differencecomes from the differing dimension sizes, we plotthe performance of GloVe embeddings as the di-mension size is increased in Figure 3 (left). Per-formance drops as the embedding dimension sizeis increased (most likely due to over-fitting); how-ever even at d = 300 , GloVe embeddings outper-form word2vec embeddings.On both test datasets embeddings trained on for-mal domains, like news (OTS, WDW), perform at least as well as those trained on informal ones,like fiction (BT, CBT). This is surprising for CBT-NE dataset which is itself constructed from theinformal domain of children’s books. For exam-ple, WDW (50M tokens) does significantly bet-ter than CBT-NE (50M tokens) in out of the cases, and also significantly better than the muchlarger BT (8B tokens) in one setting (and compa-rably in other settings). A key distinguishing fea-ture between these two domains is the fraction oftext composed of stopwords – WDW consists of54% stopwords while BT consists of 68% stop-words. Both GloVe and word2vec induce wordvectors by minimizing the Euclidean distance be-tween vectors of frequently co-occurring words.Co-occurrence with stopwords, however, provideslittle meaningful information about the semanticsof a particular word, and hence corpora with a highpercentage of these may not produce high-qualityvectors. This effect may be mitigated during pre-training by either, (1) removing a fraction of stop-words from the corpus, (2) increasing the win-dow size for counting co-occuring words. Figure3 (right) shows the effect of both these methodson downstream RC performance. There is an im-provement as the fraction of stopwords decreasesor the window size increases, upto a certain limit.In fact, with proper tuning, BT embeddings giveroughly the same performance as OTS GloVe, em-phasizing the importance of hyperparameter tun-ing when training word vectors. There is also ev-idence that stopword removal can be beneficialwhen training word vectors, however this needsfurther verification on other downstream tasks. In this section we study some common techniquesfor dealing with OOV tokens at test time. Basedon the results from the previous section, we con-duct this study using only the off-the-shelf GloVepre-trained embeddings. Let the training, test andGloVe vocabularies be denoted by V T , V E and V G respectively. Also define V Tn = { t ∈ V T : t > n } where t denotes the count of token t inthe training corpus. Before training a neural net-work for RC, the developer must first decide on theset of words V which will be assigned word vec-tors. Any token outside V is treated as an OOVtoken (denoted by UNK ) and is assigned the same We use the list of stopwords available at http://research.microsoft.com/en-us/um/redmond/projects/mctest/data/stopwords.txt igure 2:
Test set accuracy and std error for GA Reader and Stanford AR on WDW (left, middle-left) and CBT-NE (middle-right, right) when trained after initializing with pre-trained embeddings induced from different corpora (Table 1), or randomly
Figure 3:
Test set accuracy and std error on left:
WDWwhen initialized with off-the-shelf GloVe embeddings of dif-ferent sizes, right:
CBT-NE when initialized with embed-dings trained on BT corpus after removing a fraction of stop-words (red), or using different window sizes (green).
Figure 4:
Test set accuracy and std error for GA Reader onWDW ( left ) and CBT-NE ( right ) when trained after assign-ing different cuts of the vocabulary with word vectors.
Minfrequency refers to the minimum count of a word type for itto be included in the vocabulary. fixed vector.By far the most common technique in NLP lit-erature (e.g. (Chen et al., 2016; Shen et al., 2016))for constructing this vocabulary is to decide ona minimum frequency threshold n (typically 5-10) and set V = V Tn . Out of these, vectorsfor those which also appear in V G are initializedto their GloVe embeddings, and the rest are ran-domly initialized. Remaining tokens in V T andthose in V E − V T are all assigned the UNK vec-tor, which is itself updated during training. Thismethod ignores the fact that many of the wordsassigned as
UNK may have already trained em-beddings available in V G . Hence, here we pro-pose another strategy of constructing the vocabu-lary as V = V Tn ∪ V G . Then at test time, anynew token would be assigned its GloVe vector ifit exists, or the vector for UNK . A third approach,used in (Dhingra et al., 2016), is motivated by thefact that many of the RC models rely on comput-ing fine-grained similarity between document and query tokens. Hence, instead of assigning all OOVtokens a common
UNK vector, it might be betterto assign them untrained but unique random vec-tors. This can be done by setting the vocabulary to V = V Tn ∪ V E ∪ V G . Hence, at test time any newtoken will be assigned its GloVe vector if it exists,or a random vector. Note that for this approachaccess to V E at training time is not needed.Figure 4 shows a comparison of all three ap-proaches with varying n for the GA Reader onWDW and CBT-NE datasets. A gap of and between the best and worst setting for WDWand CBT-NE respectively clearly indicates the im-portance of using the correct setting. The com-monly used method of setting V = V Tn is not agood choice for RC, and gets worse as n is in-creased. It performs particularly poorly for theCBT-NE dataset, where ∼ of the test setanswers do not appear in the training set (com-pared to only ∼ . in WDW). The other twoapproaches perform comparably for WDW, butfor CBT-NE assigning random vectors rather than UNK to OOV tokens gives better performance.This is also easily explained by looking at fractionof test set answers which do not occur in V T ∪ V G – it is ∼ for CBT-NE, and < for WDW.Since in general it is not possible to compute thesefractions without access to the test set, we recom-mend setting V = V Tn ∪ V E ∪ V G . We have shown that the choice of pre-trained em-beddings for initializing word vectors has a signif-icant impact on the performance of neural modelsfor reading comprehension. So does the methodfor handling OOV tokens at test time. We arguethat different architectures can only be comparedwhen these choices are controlled for. Based onour experiments, we recommend the use of off-the-shelf GloVe embeddings, and assigning pre-trained GloVe vectors, if available, or random butunique vectors to OOV tokens at test time. cknowledgments
This work was funded by NSF under CCF1414030and Google Research.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindi-enst. 2016. Embracing data abundance: Booktestdataset for reading comprehension. arXiv preprintarXiv:1610.00956 .Danqi Chen, Jason Bolton, and Christopher D Man-ning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. arXivpreprint arXiv:1606.02858 .Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,Ting Liu, and Guoping Hu. 2016. Attention-over-attention neural networks for reading comprehen-sion. arXiv preprint arXiv:1607.04423 .Bhuwan Dhingra, Hanxiao Liu, William W Cohen,and Ruslan Salakhutdinov. 2016. Gated-attentionreaders for text comprehension. arXiv preprintarXiv:1606.01549 .Mikael Henaff, Jason Weston, Arthur Szlam, AntoineBordes, and Yann LeCun. 2016. Tracking the worldstate with recurrent entity networks. arXiv preprintarXiv:1612.03969 .Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In
Advances in Neu-ral Information Processing Systems . pages 1684–1692.Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2015. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. arXiv preprint arXiv:1511.02301 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation arXiv preprintarXiv:1603.01547 . Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, andKentaro Inui. 2016. Dynamic entity representationswith max-pooling improves machine reading. In
NAACL-HLT .Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings.
Transactions of the Associ-ation for Computational Linguistics
Advances in neural information processingsystems . pages 3111–3119.Tsendsuren Munkhdalai and Hong Yu. 2016.Neural semantic encoders. arXiv preprintarXiv:1607.04315 .Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. Ms marco: A human generated machinereading comprehension dataset. arXiv preprintarXiv:1611.09268 .Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim-pel, and David McAllester. 2016. Who did what: Alarge-scale person-centered cloze dataset.
EMNLP .Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In
Empirical Methods in Nat-ural Language Processing (EMNLP) arXiv preprintarXiv:1606.05250 .Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionflow for machine comprehension. arXiv preprintarXiv:1611.01603 .Yelong Shen, Po-Sen Huang, Jianfeng Gao, andWeizhu Chen. 2016. Reasonet: Learning to stopreading in machine comprehension. arXiv preprintarXiv:1609.05284 .Alessandro Sordoni, Phillip Bachman, and YoshuaBengio. 2016. Iterative alternating neural at-tention for machine reading. arXiv preprintarXiv:1606.02245 .Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2016. Newsqa: A machine compre-hension dataset. arXiv preprint arXiv:1611.09830 .huohang Wang and Jing Jiang. 2016. Machine com-prehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905 .Zhiguo Wang, Haitao Mi, Wael Hamza, and RaduFlorian. 2016. Multi-perspective context match-ing for machine comprehension. arXiv preprintarXiv:1612.04211 .Caiming Xiong, Victor Zhong, and Richard Socher.2016. Dynamic coattention networks for questionanswering. arXiv preprint arXiv:1611.01604 .Yang Yu, Wei Zhang, Kazi Hasan, Mo Yu, Bing Xiang,and Bowen Zhou. 2016. End-to-end answer chunkextraction and ranking for reading comprehension. arXiv preprint arXiv:1610.09996 . A Answer Selection for Stanford AR
Using the notation from (Chen et al., 2016), let ˜ p , ˜ p , . . . , ˜ p m be the contextual embeddings ofthe tokens in the document, and let o be theattention-weighted document representation, thenwe compute the probability that token i answersthe question as: P ( a = d i | d, q ) = s i = softmax (˜ p Ti o ) (1)The probability of a particular candidate c ∈ C asbeing the answer is then computed by aggregatingthe probabilities of all document tokens which ap-pear in c and renormalizing over the candidates: Pr( c | d, q ) ∝ (cid:88) i ∈ I ( c,d ) s i (2)where I ( c, d ) is the set of positions where a tokenin c appears in the document d . B Hyperparameter Details
For the WDW dataset we use hidden state size d =128 for the GRU and dropout with p = 0 . . ForCBT-NE dataset we use d = 128 and dropout with p = 0 . . The Stanford AR has only 1 layer as pro-posed in the original paper, while the GA Readerhas layers. For Stanford AR dropout is appliedto the input of the layer, and for GA Reader itis applied in between layers. Embeddings sizesfor the word vectors were set to d w = 100 forall experiments, except those using off-the-shelfword2vec embeddings. To enable a fair compar-ison, we utilize the qe-comm feature for StanfordAR, which was used in the implementation of GAReader. Since our purpose is to study the effect ofword vectors, we do not use character embeddingsin our experiments. We train the models using the ADAM (Kingmaand Ba, 2014) optimizer with an initial learningrate of 0.0005, which is halved every epoch afterthe first 3 epochs. We track performance on thevalidation set, and select the model with the high-est validation accuracy for testing.When training word vectors we retain thedefault settings provided with the GloVe andword2vec packages, with the only exception thatwindow size was set to for both (to ensureconsistency). For word2vec, we used skip-gramarchitecture with hierarchical softmax, and sub-sampled frequent words with a threshold − (see (Mikolov et al., 2013) for details). For GloVe,we used iterations when training on the smallWDW and CBT corpora, and iterations for thelarge BT corpus. In any corpus, words occurringless than5