Context encoders as a simple but powerful extension of word2vec
11 Context encoders as a simple but powerful extension of word2vec
Franziska Horn
Machine Learning GroupTechnische Universität Berlin, Germany [email protected]
Abstract
With a simple architecture and the abilityto learn meaningful word embeddings ef-ficiently from texts containing billions ofwords, word2vec remains one of the mostpopular neural language models used to-day. However, as only a single embeddingis learned for every word in the vocabu-lary, the model fails to optimally representwords with multiple meanings. Addition-ally, it is not possible to create embeddingsfor new (out-of-vocabulary) words on thespot. Based on an intuitive interpretationof the continuous bag-of-words (CBOW)word2vec model’s negative sampling train-ing objective in terms of predicting con-text based similarities, we motivate an ex-tension of the model we call context en-coders (ConEc). By multiplying the ma-trix of trained word2vec embeddings witha word’s average context vector, out-of-vocabulary (OOV) embeddings and repre-sentations for a word with multiple mean-ings can be created based on the word’slocal contexts. The benefits of this ap-proach are illustrated by using these wordembeddings as features in the CoNLL 2003named entity recognition (NER) task.
Representation learning is very prominent in thefield of natural language processing (NLP). Forexample, word embeddings learned by neural lan-guage models (NLM) were shown to improvethe performance when used as features for super-vised learning tasks such as named entity recogni-tion (NER) (Collobert et al., 2011; Turian et al.,2010). The popular word2vec model (Mikolovet al., 2013a,b) learns meaningful word embed- dings by considering only the words’ local contexts.Thanks to its shallow architecture it can be trainedvery efficiently on large corpora. The model, how-ever, only learns a single representation for wordsfrom a fixed vocabulary. Consequently, if in a taskwe encounter a new word that was not present inthe texts used for training, we cannot create an em-bedding for this word without repeating the timeconsuming training procedure of the model. Fur-thermore, a single embedding does not optimallyrepresent a word with multiple meanings. For ex-ample, “Washington” is both the name of a USstate as well as a former president and only by tak-ing into account the word’s local context can oneidentify the proper sense.Based on an intuitive interpretation of the con-tinuous bag-of-words (CBOW) word2vec model’snegative sampling training objective, we proposean extension of the model we call context encoders (ConEc). This allows for an easy creation of OOVembeddings as well as a better representation ofwords with multiple meanings by simply multi-plying the trained word2vec embeddings with thewords’ average context vectors. As demonstratedby the CoNLL 2003 NER challenge, the classifi-cation performance can be significantly improvedwhen using as features the word embeddings cre-ated with ConEc instead of word2vec.
Related work
In the past, NLM have addressedthe issue of polysemy in various ways. For exam-ple, sense2vec is an extension of word2vec, wherein a preprocessing step all words in the training cor-pus are annotated with their part-of-speech (POS) In practice the model is trained on such a large vocab-ulary that it is rare to encounter a word that does not havean embedding. Yet there are still scenarios where this is thecase, for example, it is unlikely that the term “W10281545”is encountered in a regular training corpus, but we might stillwant its embedding to represent a search query like “whirlpoolW10281545 ice maker part”. a r X i v : . [ s t a t . M L ] J un tag and then the embeddings are learned for to-kens consisting of the words themselves and theirPOS tags. This way, different representations aregenerated e.g. for words that are used both as anoun and verb (Trask et al., 2015). Other methodsfirst cluster the contexts in which the words appear(Huang et al., 2012) or use additional resourcessuch as wordnet to identify multiple meanings ofwords (Rothe and Schütze, 2015). One possibilityto create OOV embeddings is to learn represen-tations for all character n-grams in the texts andthen compute the embedding of a word by com-bining the embeddings of the n-grams occurringin it (Bojanowski et al., 2016). However, none ofthese NLM are designed to solve both the OOV andpolysemy problem at the same time. Furthermore,compared to word2vec they require more parame-ters, resources, or additional steps in the trainingprocedure. ConEc on the other hand can generateOOV embeddings as well as improved representa-tions for words with multiple meanings by simplymultiplying the matrix of trained word2vec embed-dings with the words’ average context vectors. Word2vec (Fig. 3 in the Appendix) learns d -dimensional vector representations, referred to asword embeddings, for all N words in the vocabu-lary. It is a shallow NLM with parameter matrices W , W ∈ R N × d , which are tuned iteratively byscanning huge amounts of text sentence by sen-tence. Based on some context words, the algo-rithm tries to predict the target word between them.Mathematically, this is realized by first computingthe sum of the embeddings of the context wordsby selecting the appropriate rows from W . Thisvector is then multiplied by several rows selectedfrom W : one of these rows corresponds to the tar-get word, while the others correspond to k ‘noise’words selected at random (negative sampling). Af-ter applying a non-linear activation function, thebackpropagation error is computed by comparingthis output to a label vector t ∈ R k +1 , which is1 at the position of the target word and 0 for all k noise words. After the training of the model iscomplete, the word embedding for a target word isthe corresponding row of W . Similar words appear in similar contexts (Harris,1954). For example, two words synonymous witheach other could be exchanged for one another in al-most all contexts without a reader noticing. Basedon the context word co-occurrences, pairwise sim-ilarities between all N words of the vocabularycan be computed, resulting in a similarity matrix S ∈ R N × N (or for a single word w the vector s w ∈ R N ) with similarity scores between and . These similarities should be preserved in theword embeddings, e.g. the cosine similarity be-tween the embedding vectors of two words usedin similar contexts should be close to , or, moregenerally, the scalar product of the matrix withword embeddings Y ∈ R N × d should approximate S . Obviously, the most straightforward way ofobtaining word embeddings satisfying Y Y (cid:62) ≈ S would be to compute the singular value decomposi-tion (SVD) of the similarity matrix S and use theeigenvectors corresponding to the d largest eigen-values (Levy et al., 2014, 2015). As our vocabularytypically comprises tens of thousands of words, per-forming an SVD of the corresponding similaritymatrix is computationally far too expensive. Yet,while the similarity matrix would be huge, it wouldalso be quite sparse, as many words are of coursenot synonymous with each other. If we picked asmall number k of random words, chances are theirsimilarities to a target word would be close to .Therefore, while the product of a single word’sembedding y w ∈ R d and the matrix of all embed-dings Y should result in a vector ˆs w ∈ R N closeto the true similarities s w of this word, if we onlyconsider a small subset of ˆs w corresponding to theword itself and k random words, it is sufficient ifthis approximates the binary vector t w ∈ R k +1 ,which is for the word itself and elsewhere.The CBOW word2vec model trained with neg-ative sampling can therefore be interpreted as aneural network (NN) that predicts a word’s similar-ities to other words (Fig. 1). During training, foreach occurrence i of a word w in the texts, a binaryvector x w i ∈ R N , which is at the positions ofthe context words of w and elsewhere, is usedas input to the network and multiplied by a set ofweights W to arrive at an embedding y w i ∈ R d (the summed rows of W corresponding to the con-text words). This embedding is then multiplied byanother set of weights W , which corresponds tothe full matrix of word embeddings Y , to producethe output of the network, a vector ˆs w i ∈ R N con-taining the approximated similarities of the word w to all other words. The training error is thencomputed by comparing a subset of the output to abinary target vector t w i ∈ R k +1 , which serves asan approximation of the true similarities s w whenconsidering only a small number of random words.We refer to this interpretation of the model as con-text encoders (ConEc), as it is closely related tosimilarity encoders (SimEc), a dimensionality re-duction method used for learning similarity pre-serving representations of data points (Horn andMüller, 2017). Input Embedding Output Target x w i R N y w i R d s w ⇡ t w i R k +1 ˆs w i theblackslepton cat W W Figure 1: Context encoder (ConEc) NN architec-ture corresponding to the CBOW word2vec modeltrained with negative sampling.While the training procedure of ConEc is iden-tical to that of word2vec, there is a difference inthe computation of a word’s embedding after thetraining is complete. In the case of word2vec, theword embedding is simply the row of the tuned W matrix. When considering the idea behind the opti-mization procedure, we instead propose to createthe representation of a target word w by multiply-ing W with the word’s average context vector x w ,as this better resembles how the word embeddingsare computed during training.We distinguish between a word’s ‘global’ and‘local’ average context vector (CV): The global CVis computed as the average of all binary CVs x w i corresponding to the M w occurrences of w in thewhole training corpus: x w global = 1 M w M w (cid:88) i =1 x w i , while the local CV x w local is computed likewise butconsidering only the m w occurrences of w in a single document. We can now compute the em-bedding of a word w by multiplying W with theweighted average between both CVs: y w = ( a · x w global + (1 − a ) x w local ) (cid:62) W (1)with a ∈ [0 , . The choice of a determines howmuch emphasis is placed on the word’s local con-text, which helps to distinguish between multiplemeanings of the word (Melamud et al., 2015). Asan out-of-vocabulary word does not have a globalCV (as it never occurred in the training corpus), itsembedding is computed solely based on the localcontext, i.e. setting a = 0 .With this new perspective on the model and op-timization procedure, another advancement is fea-sible. Since the context words are merely a sparsefeature vector used as input to a NN, there is noreason why this input vector should not containother features about the target word as well. For ex-ample, the feature vector x w could be extended tocontain information about the word’s case, part-of-speech (POS) tag, or other relevant details. Whilethis would increase the dimensionality of the firstweight matrix W to include the additional fea-tures when mapping the input to the word’s em-bedding, the training objective and therefore also W would remain unchanged. These additionalfeatures could be especially helpful if details aboutthe words would otherwise get lost in preprocess-ing (e.g. by lowercasing) or to retain informationabout a word’s position in the sentence, which is ig-nored in a BOW approach. These extended ConEcsare expected to create embeddings that even betterdistinguish between the words’ different senses bytaking into account, for example, if the word is usedas a noun or verb in the current context, similar tothe sense2vec algorithm (Trask et al., 2015). Butinstead of explicitly learning multiple embeddingsper term, like sense2vec, only the dimensionality ofthe input vector is increased to include the POS tagof the current word as a feature, which is expectedto improve generalization if few training examplesare available. The word embeddings learned by word2vec andcontext encoders are evaluated on the CoNLL 2003NER benchmark task (Tjong et al., 2003). We usea CBOW word2vec model trained with negativesampling as described above where k = 13 , the This implicitly assumes a word is only used in a singlesense in one document.
A B C
Figure 2: Results of the NER task based on three random initializations of the word2vec model.
Left panel:
Overall results, where the mean performance using word2vec embeddings ( dashed lines ) is considered asour baseline, all other embeddings are computed with ConEcs using various combinations of the words’global and local CVs.
Right panel:
Increased performance (mean and standard deviation) on the testfold when using ConEc: Multiplying the word2vec embeddings with global CVs yields a performancegain of . percentage points ( A ). By additionally using local CVs to create OOV word embeddingswe gain another . points ( B ). When using a combination of global and local CVs (with a = 0 . ) todistinguish between the different meanings of words, the F1-score increases by another . points ( C ),yielding a F1-score of . , which marks a significant improvement compared to the . reachedwith word2vec features.embedding dimensionality d is and we use acontext window of words. The word embeddingscreated by ConEc are built directly on top of theword2vec model by multiplying the original em-beddings ( W ) with the respective context vectors.Code to replicate the experiments is available on-line. Additionally, the performance on a wordanalogy task (Mikolov et al., 2013a) is reported inthe Appendix.
Named Entity Recognition
The main advan-tage of context encoders is their ability to use localcontext to create OOV embeddings and distinguishbetween the different senses of words. The ef-fects of this are most prominent in a task such asNER, where the local context of a word can makeall the difference, e.g. to distinguish between the“Chicago Bears” (an organization) and the city ofChicago (a location). We tested this on the CoNLL2003 NER task by using the word embeddings asfeatures together with a logistic regression classi-fier. The reported F1-scores were computed usingthe official evaluation script. The results achievedwith various word embeddings in the training, de-velopment, and test part of the CoNLL task arereported in Fig. 2. It should be noted that we are https://github.com/cod3licious/conec using this task as an extrinsic evaluation to illus-trate the advantages of ConEc embeddings overthe regular word2vec embeddings. To isolate theeffects on the performance, we are only using theseword embeddings as features, while typically theperformance on this NER challenge is much higherwhen other features such as a word’s case or POStag are included as well.The word2vec embeddings were trained on thedocuments used in the training part of the task.OOV words in the development and test parts arerepresented as zero vectors. With three parametersettings, we illustrate the advantages of ConEc:
A) Multiplying the word2vec embeddings by thewords’ average context vectors generally improvesthe embeddings.
To show this, ConEc word embed-dings were computed using only global CVs (Eq. 1with a = 1 ), which means OOV words again havea zero representation. With these embeddings (la-beled ‘global’ in Fig. 2), the performance improveson the dev and test folds of the task. B) Useful OOV embeddings can be created fromthe local context of a new word.
To show this, theConEc embeddings for words from the training vo-cabulary ( w ∈ N ) were computed as in A) , but Since this is a very small corpus, we trained word2vec for25 iterations on these documents. now the embeddings for OOV words ( w (cid:48) / ∈ N )were computed using local CVs (Eq. 1 with a =1 ∀ w ∈ N and a = 0 ∀ w (cid:48) / ∈ N ; referred to as‘OOV’ in the figure). The training performanceobviously stays the same, because here all wordshave an embedding based on their global contexts.However, there is a jump in the ConEc performanceon the dev and test folds, where OOV words nowhave a representation based on their local contexts. C) Better embeddings for a word with multiplemeanings can be created by using a combinationof the word’s average global and local CVs as in-put to the ConEc.
To show this, the OOV embed-dings were computed as in B) , but now for thewords occurring in the training vocabulary, the lo-cal context was taken into account as well by set-ting a < (Eq. 1 with a ∈ [0 , ∀ w ∈ N and a = 0 ∀ w (cid:48) / ∈ N ). The best performances on allfolds are achieved when averaging the global andlocal CVs with around a = 0 . before multiplyingthem with the word2vec embeddings. This clearlyshows that ConEc embeddings created by incorpo-rating local context can help distinguish betweenmultiple meanings of words. Context encoders are a simple but powerful exten-sion of the CBOW word2vec model trained withnegative sampling. By multiplying the matrix oftrained word2vec embeddings with the words’ av-erage context vectors, ConEcs are easily able tocreate OOV embeddings on the spot as well asdistinguish between multiple meanings of wordsbased on their local contexts. The benefits of thiswere demonstrated in the CoNLL NER challenge.
Acknowledgments
I would like to thank Antje Relitz, Ivana Balaže-vi´c, Christoph Hartmann, Andreas Nowag, Klaus-Robert Müller, and other anonymous reviewers fortheir helpful comments on earlier versions of thismanuscript.Franziska Horn acknowledges funding from theElsa-Neumann scholarship from the TU Berlin.
References
Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606 . Ronan Collobert, Jason Weston, Léon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch.
The Journal of Machine Learning Research arXiv preprintarXiv:1402.3722 .Zellig S Harris. 1954. Distributional structure.
Word arXiv preprintarXiv:1702.01824 .Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In
Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics:Long Papers-Volume 1 . ACL, pages 873–882.Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings.
Transactions of the Associ-ation for Computational Linguistics
CoNLL . pages 171–180.Oren Melamud, Ido Dagan, and Jacob Goldberger.2015. Modeling word meaning in context with sub-stitute vectors. In
Human Language Technologies:The 2015 Annual Conference of the North AmericanChapter of the ACL .Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In
Advances in neural information processingsystems . pages 3111–3119.Sascha Rothe and Hinrich Schütze. 2015. Au-toextend: Extending word embeddings to embed-dings for synsets and lexemes. arXiv preprintarXiv:1507.01127 .EF Tjong, Kim Sang, and F De Meulder. 2003. Intro-duction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In WalterDaelemans and Miles Osborne, editors,
Proceedingsof CoNLL-2003 . Edmonton, Canada, pages 142–147.
Andrew Trask, Phil Michalak, and John Liu. 2015.sense2vec-a fast and accurate method for word sensedisambiguation in neural word embeddings. arXivpreprint arXiv:1511.06388 .Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In
Proceedings of the48th annual meeting of the association for compu-tational linguistics . Association for ComputationalLinguistics, pages 384–394.
Appendix
Analogy task
To show that the word embeddingscreated with context encoders capture meaning-ful semantic and syntactic relationships betweenwords, we evaluated them on the original analogytask published together with the word2vec model(Mikolov et al., 2013a). This task consists ofmany questions in the form of “ man is to king as woman is to XXX” where the model is supposedto find the correct answer queen . This is accom-plished by taking the word embedding for king ,subtracting from it the embedding for man andthen adding the embedding for woman . This newword vector should then be most similar (with re-spect to the cosine similarity) to the embedding for queen . The word2vec model was trained for teniterations on the text8 corpus, which containsaround 17 million words and a vocabulary of about70k unique words, as well as the training part ofthe benchmark dataset, which con-tains over 768 million words with a vocabularyof 486k unique words. The ConEc embeddingswere then constructed by multiplying the word2vecembeddings with the words’ average global con-text vectors obtained from the same corpus as theword2vec model was trained on. To achieve thebest results, we also had to include the target worditself in these context vectors.The results of the analogy task are shown in Ta-ble 1. To capture some of the semantic relationsbetween words (e.g. the first four task categories)it can be advantageous to use context encoders in-stead of word2vec. One reason for the ConEcs’ https://code.google.com/archive/p/word2vec/ Readers familiar with Levy et al. (2015) will recognizethis as the 3CosAdd method. We have tried 3CosMul as well,but found that the results did not improve significantly andtherefore omitted them here. http://mattmahoney.net/dc/text8.zip http://code.google.com/p/1-billion-word-language-modeling-benchmark/ In this experiment we ignore all words that occur lessthan 5 times in the training corpus. superior performance on some of the task cate-gories, but not others, might be that the city andcountry names compared in the first four task cat-egories only have a single sense (referring to therespective location), while the words asked for inother task categories can have multiple meanings.For example, “run” can be used as both a noun ora verb, additionally, in some contexts it refers tothe sport activity while other times it is used in amore abstract sense, e.g. in the context of some-one running for president. Therefore, the resultsin the other task categories might improve if thewords’ context vectors are first clustered and thenthe ConEc embedding is generated by multiplyingthe word2vec embeddings with the average of onlythose context vectors corresponding to the word’ssense most appropriate for the task category. target wordThe black cat slept on the bed. context words
After training target embedding R ⇥ d Training phase W W W N ⇥ d N ⇥ d l R ⇥ d
1) take sum of context embeddings 2) select target and k noise weights (negative sampling) N ⇥ d l R ( k +1) ⇥ d
3) compute error & backpropagate err = t ( l · l T ) ( z ) = 11 + e z with: t : binary label vector Figure 3: Continuous bag-of-words (CBOW) word2vec model trained with negative sampling (Mikolovet al., 2013a,b; Goldberg and Levy, 2014).Table 1: Accuracy on the analogy task with mean and standard deviation computed using three randomseeds when initializing the word2vec model. The best results for each category and corpus are in bold. text8 (10 iter) 1-billionword2vec Context Encoder word2vec Context Encodercapital-common-countries 63.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±±