Contextual Skipgram: Training Word Representation Using Context Information
aa r X i v : . [ c s . C L ] F e b C ONTEXTUAL S KIPGRAM : T
R AINING W OR D R EPR ESENTATION U SING C ONTEXT I NFOR MATION
A P
REPRINT
Dongjae Kim
School of Electrical EngineeringKorea UniversityAnam Dong, Seoul, South Korea
Jong-Kook Kim
School of Electrical EngineeringKorea UniversityAnam Dong, Seoul, South KoreaFebruary 18, 2021 A BSTRACT
The skip-gram (SG) model learns word representation by predicting the words surrounding a centerword from unstructured text data. However, not all words in the context window contribute to themeaning of the center word. For example, less relevant words could be in the context window,hindering the SG model from learning a better quality representation. In this paper, we propose anenhanced version of the SG that leverages context information to produce word representation. Theproposed model, Contextual Skip-gram, is designed to predict contextual words with both the centerwords and the context information. This simple idea helps to reduce the impact of irrelevant wordson the training process, thus enhancing the final performance. K eywords machine learning · word embedding · skip-gram · language model Distributed representations of words have been an essential approach to achieving good performance in natural lan-guage processing (NLP) tasks. Most deep learning NLP models use pre-trained word embeddings for successfultraining. Earlier word embedding models are trained based on Neural Network Language Model (NNLM) whichinvolves dense matrix multiplications [1, 2]. As a result, they need long training times when training large corpora.Two popular word representation models, the skip-gram (SG) and continuous bag-of-words (CBOW), were proposedin 2013 [3,4]. The main idea behind these models is that words that are similar to each other are likely to share a similarco-occurrence of nearby words. Predicting the surrounding words with the center word or vice versa is how the SGand CBOW models train word representation. This training process is done by moving a sliding window through thecorpus. As a result, the | V | ∗ d embedding matrix is trained, where V and | V | refer to the vocabulary and the numberof different vocabularies in the given corpus respectively, and d is a hyperparameter defining the dimension of eachword vector. Because Word2Vec models do not have non-linear hidden layers, they can process a large corpus muchfaster than the earlier NNLM based models.The SG architecture learns word embeddings by predicting contextual words given a center word, while the CBOWarchitecture learns by predicting center word given contextual words. Because the CBOW compacts nearby wordvectors into a single average vector, CBOW executes its task faster. On the other hand, the SG has more chances oflearning with the same size of corpus compared to the CBOW, because all possible contextual words and center wordpairs are used for learning. As a result, Skip-gram tends to work better with the a smaller corpus. For a large | V | ,using the softmax function requires a great deal of computation. Thus negative sampling, which is simplified variantof Noise Contrastive Estimation (NCE) [5], is preferred for training a large corpus.Due to their huge success in many NLP tasks, there have been various studies on improving the performance of theSG model and leveraging external linguistic resources such as semantic lexicons is one of them. The information fromthe resources are incorporated to refine objective function [6,7] or utilized in a retrofitting scheme [7,8]. Though these PREPRINT - F
EBRUARY
18, 2021approaches improve the semantic quality of the SG model, they require reliable external linguistic resources which arehard to obtain and produce.To train better word vectors with only a given corpus, [9] proposed leveraging word order information in the localcontext window. The structured skip-gram (SSG) and continuous window (CWin) models increased output embeddingsize proportional to the context window size. [10] introduced directional skip-gram (DSG) and simplified structuredskip-gram(SSSG) models to train with direction information. These approaches showed improvement in some wordsimilarity and Part-of-Speech (POS) tagging tasks.Fasttext model proposed representing each word as a bag of character n-grams to better utilize the morphologicalinformation of the words [11]. Even if specific words are rarely seen in the corpus, their subwords or subfeatures canbe trained during the training of other full words. As a result, this model provides a better rare word embedding quality.Moreover, word embeddings that are not seen in the training corpus can be estimated with the subword information.The difference between previous methods and our proposed method, the Contextual Skip-Gram (CSG) model, is thatour scheme uses a context vector built from the local context window. As shown in the results, the CSG model was ableto achieve overall good performance for similarity tasks for both small and large corpora compared to conventionalmodels. Section 2 describes our method, while the experiments and results are depicted in Section 3. Section 4summarizes the research.
The SG model learns word embeddings by predicting nearby words given the center word. Thus, the training objectiveof the SG model is to maximize the overall log probability: L SG = 1 | V | | V | X t =1 X < | i |≤ c log p ( w t + i | w t ) where w t and w t + i refer to the center word and nearby word to predict. Given sampled negative word set V − , itsnegative sampling objective is defined: log σ (( v w t ⊤ v ′ w t + i )) + X w j ∈ V − log σ (( v w t ⊤ v ′ w j )) where v and v ′ are the input and output vector representations of corresponding word.While all the context words participate in building the center word vector, they do not always make equal contribution. Water becomes solid ice when it is cold enough.
In the example sentence above, the following pairs can be made when the window size is five and the center word isice. (ice, water) (ice, becomes), (ice, solid), (ice, when), (ice, it), (ice, cold), (ice, enough)
In human sense, "water" and "cold" contribute to the meaning of the center word "ice" more than other contextwords such as "when" and "becomes" . In the SG manner, given center word "ice" , "water" and "cold" should be morepredictable. Though "when" and "becomes" could contribute syntactically to the meaning of "ice" , their co-occurrenceis less convincing semantically. We call words less relevant to the center word as weak co-occurrences and they couldpotentially disturb training due to the nature of the SG. The weak co-occurrences could be frequently used words suchas articles, typo or etc. Generally, large training corpus relieves this issue. As long as training corpus is large enough, water and ice are likely to share similar context word co-occurrence and weak co-occurences "when" is likely to beused with many other words. As a result, the SG model could learn descent word representations without human effortsuch as annotation.However, as training progress, this could still cause degraded performance. At the initial epoch, both relevant andless relevant nearby words would have low probabilities given a center word, because they are randomly initialized.However, as the SG model trains through the corpus, less relevant pairs are likely to have lower probabilities thanrelevant words. Consequently, the SG model would increase the probabilities of weak co-occurrences rather thanmore relevant words to increase the overall prediction probability. This issue could hinder word embeddings fromacquiring higher quality. The experiment in Section 3.4 addresses this issue in detail.2 PREPRINT - F
EBRUARY
18, 2021This problem is mainly due to SG model’s nature. it trains word embeddings assuming a direct relationship betweenthe center word and the surrounding word. To alleviate this issue, the CSG model predicts nearby words with contextinformation to introduce indirect relationships. Our objective is to maximize following loss: L CSG = 1 | V | | V | X t =1 X < | i |≤ c log p ( w t + i | w t , w con ) where w t , w t + i and w con refer to the center word, nearby word to predict and the surrounding words as contextinformation. c is a hyperparameter that defines the context window size. Our negative sampling loss is defined as: s ( v ′ w t + i , v w t , v con ) + X w j ∈ V − s ( − v ′ w j , v w t , v con ) where v and v ′ denote the input and output embedding of corresponding words and V − denotes negative samples. v con and s are further described in following Sections 2.1 and 2.2. To make probability calculation between context information and nearby words easy, we first aggregate the contextinformation into context embedding v con , which has the dimension d via the context function. The most simple way tomake context embedding is to average input embeddings of surrounding words. However, we use a summing functionto make vector updates simple and fast. The Section 2.3 describes detailed reason. The prediction probabilities based on center word and context embedding should be combined to produce the finalprobability. The CSG model has two different fusion strategies to combine prediction probabilities.The
Early Fusion (EF) method executes the element-wise weighted sum of two vectors first and then calculates thelog probability with fused vector and output embedding of nearby words s EF = log σ (( γv con + (1 − γ ) v w t ) ⊤ v ′ w t + i ) Function σ in the above equation denotes the sigmoid function, and γ is the fusion weight, which is a hyperparameterranged ≤ γ ≤ . The early fusion is simple, but does not guarantee ratio due to the difference of vector magnitudebetween the context vector and the center word vector.The Late Fusion(LF) method calculates dot products first, then performs a weighted summation. s LF = γσ ( v con ⊤ · v ′ w t + i ) + (1 − γ ) σ ( v w t ⊤ · v ′ w t + i ) The fusion weight γ decides where to focus on during training. If γ is high, predictions are more dependent on thecontext vector, while the center word vector controls minor adjustments. The value of γ can be static or dynamicduring training. In experiments, we used several fixed weights and linear weight scheme defined: γ linear, → = epoch current − epoch total − γ ran = random uniform (0 , where epoch current denotes the current epoch count during training and epoch total is a hyperparameter defining thenumber of total epochs to learn. Given α as learning rate, the vector update in the SG model can be approximated as following equation: g = label − σ ( v w t ⊤ v ′ w k ) v w t += α · g · v ′ w k v ′ w k += α · g · v w t where label is one for w k is positive sample and zero for negative samples. After gradient g is calculated, vectorsare updated according to the gradient and learning rate. The CSG model predict words with context representation3 PREPRINT - F
EBRUARY
18, 2021 v con . Hence, the vector updates should happen to the surrounding words according to the loss. However, additionalcomputation and word vector updates proportional to the window size could serious harm the training speed of theCSG model.The CSG model with averaging function predicts nearby word with the center word and the context representationvector and they are weighted with γ . Given sentence S = { ..., w t − , w t − , w t , w t +1 , w t +2 , ... } , the input vectorrepresentation of v w t is updated when it is center word and when it is surrounding word. When w t is a center word,the weight of w t is − γ according to weighted fusion. Otherwise, the weight of w t is γ c because context vector is anaverage of nearby words. As a result, the vector v w t is updated as the following equations: v w t += (1 − γ ) · α · X < | i |≤ c g t,t + i · v ′ w t + i v w t += t + c X m = t − c γ c · α · X < | i |≤ c g t,m + i · v ′ w m + i t − c ≤ m t ≤ t + c, t − c ≤ m + i ≤ t + 2 c where c is not zero. Since the position of w t and w m is close, they shares at least 50% of surrounding words.If we assume they share same surrounding words, equations above can be simplified as below: v w t += α ((1 − γ ) + c X γ c ) equals 1 · ( X < | i |≤ c g t,t + i · v ′ w t + i ) As a result, we could approximate the vector update of the CSG as following algorithm. g = label − s ( v ′ w t + i , v w t , v con ) v w t += α · g · v ′ w t + i v ′ w t + i += α · g · v ′ w t To preserve size of the magnitude of gradient, we use summing function with this update algorithm.
In our experiments, we compared the CSG model to the CBOW and SG models from [3], the DSG and SSSG from [10]and the FastText model from [11]. For our test, we trained embeddings to be 200-dimensional. Context window size,negative sample size, and total iteration count were set to five. The starting learning rate was given as 0.025. Wordsthat appeared less than five times were not used in training. In the case of FastText, n-grams ranging from 3 to 6characters were used.
To explore the properties of the CSG parameters, we tested various combinations, increasing γ by 0.25. The notation CSG
EF, . represents the CSG trained with EF as fusion method and 0.25 as γ . Models trained with the γ linear, → scheme are denoted as CSG
EF orLF, → . CSG with γ = 0 for EF and LF are omitted because they are equivalent tothe SG. Similarly, CSG with γ = 1 and LF is also discarded due to its equivalence to γ = 1 and EF. We prepared two different corpora for training word embeddings. A large corpus was extracted from the latestWikipedia dump and pre-processed with the following steps. We first lowercased the extracted text and split thetext into sentences. Sentences that had less than ten tokens were filtered to remove partial sentences created during thesplitting process. The large corpus was composed of about 4.14 billion tokens. A small corpus, a subset of the largecorpus, was created by 1% random sampling of sentences. The small corpus was composed of 39.3 million tokens.4
PREPRINT - F
EBRUARY
18, 2021
On this experiment, we manually analyzed how well the SG and CSG model predict surrouding word. To checkprediction performance, we logged σ ( v w t ⊤ v ′ w k ) for the SG and s EF ( v ′ w t + i , v w t , v con ) for the CSG with γ = 0.5, whenthe center word is "ice" . The Table 1 shows the average logged values for epoch one and five. Weak co-occurences,such as "an", "for" , tend to show poor predictions on the SG model and it become worse at epoch five. On the otherhand, the CSG model tend to predict those words better as expected. For relevant words, the CSG showed slightdecline in prediction due to the indirect relationship. SG CSG1 5 1 5an 36.65 35.75 54.30 69.24for 34.22 24.87 53.19 46.20in 26.38 18.60 47.76 52.85who 35.37 33.85 64.99 76.74hockey 90.82 98.84 88.95 98.06water 72.21 70.68 66.08 64.80winter 70.47 87.22 73.72 86.91cream 73.81 96.74 69.95 95.15Table 1: Nearby word prediction performance experiment. All values are multiplied by 100 To compare the performance of the embeddings, we performed a word similarity evaluation with Simlex-999 [12],WordSim-353 [13] and MEN-3k [14] datasets. Similarity scores were measured by calculating the cosine similar-ity between two normalized word vectors for all pairs in each dataset. The Spearman’s rank correlation coefficientbetween obtained scores and human judged scores was calculated.Sim-999 WS-353 MEN-3kCBOW 27.52 60.98 56.52SG 33.57 66.15 63.36DSG 33.04 65.48 63.09SSSG 32.01 63.90 60.16FastText 32.17 64.65
CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, CSG
EF, → CSG
LF, → CSG
EF,ran
CSG
LF,ran ρ × ) on small corpus. Sym-999 denotes Symlex-999 dataset.Our results are reported in Tables 2 and 3. Word similarity evaluation results on different corpora are in the Appendix.The CBOW model has degraded performance on the small corpus because it has less effective training samples thanthe SG based models. The FastText model yielded comparable overall results with the CSG models on this task,especially in the MEN-3k dataset.On both corpora, the CSG models show superior overall performance than the baseline models. The performanceeffects of the fusion method depend on the corpus size and pre-processing style. For static fusion weights, LF worksbetter on the small corpus, while EF works better on the large corpus. Furthermore, the results show that a higher γ leads to better scores. However, when γ becomes one, WS-353 and MEN-3k scores dropped on both corpora. This is5 PREPRINT - F
EBRUARY
18, 2021Sim-999 WS-353 MEN-3kCBOW 37.56 63.46 70.53SG 36.88 71.65 74.81DSG 38.32 70.07 73.38SSSG 37.75 70.66 73.83FastText 37.36 73.53 76.43
CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, CSG
EF, → CSG
LF, → CSG
EF,ran
CSG
LF,ran ρ × ) on the large corpus. Sym-999 denotes Symlex-999 dataset.considered to happen due to the absence of a direct relationship between the center word and surrounding word duringtraining. The linear fusion weight scheme, γ linear, → , helps to settle this issue. As a result, γ linear, → achieved abalanced and superior overall score. The γ ran shows similar result to γ = 0 . but seems less dependent on contextinformation than γ = 0 . . Google MSRSemantic Syntactic SyntacticCBOW 57.95 64.57 52.55SG 54.54 59.58 46.68DSG 55.74 63.02 49.67SSSG 56.38 62.56 49.15FastText 40.24 55.13 43.22 CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, CSG
EF, → CSG
LF, → CSG
EF,ran
CSG
LF,ran
The word analogy task is to solve questions that require predicting a word D from given words A, B, C, and therelationship "A is to B as C is to D". The analogy task is solved by finding a word vector most similar to B + C - A.The accuracy of each model is measured by counting the correct answers. We employed two different datasets for thistask: the Google analogy dataset [3] and the MSR analogy dataset [15]. The Google dataset involves 10,675 syntacticand 8,869 semantic questions, while the MSR dataset is composed of 8,000 syntactic questions. The large corpus wasused to minimize the number of unanswerable questions.Table 4 presents results on the word analogy task. As word similarity task, the CSG models trained with high γ givedecent results. Accordingly, γ linear, → achieves the best overall result in this task. Interestingly, the CBOW shows6 PREPRINT - F
EBRUARY
18, 2021the best performance among the baselines in contrast with the word similarity evaluation task. Hence, leveragingcontext information seems to improve performance in this task.dev F1 test F1CBOW 93.21 89.52SG 94.22 90.22DSG
FastText 94.43 90.55
CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, CSG
EF, → CSG
LF, → CSG
EF,ran
CSG
LF,ran
For extrinsic evaluation, we conducted a named entity recognition(NER) task. A CoNLL-2003 English [16] bench-mark dataset, containing train/dev/test sets, was used. A bidirectional LSTM-CRF model [17, 18] initialized with theproduced embeddings was used to make predictions. During training, the parameter set with the best dev set F1 scorewas selected as the output.Table 5 shows the results. Contrary to previous tasks, the CSG models trained with high γ achieved similar or degradedperformance on the NER task. On the other hand, low context weights and γ ran present similar or better F1 scoresthan the original SG model. Still, other SG augmentations are better than γ ran on the NER task. From the poor resultof the CBOW, we guess that utilizing context information does not fit well with the NER task. In this paper, we presented a simple but strong augmentation that utilizes both the context information and the centerword without extra out of corpus resources. The experiment results show that the CSG model could provide finer pre-trained word representations. In addition, the CSG model could potentially achieve better performance with furtherresearch on sophisticated context function and fusion weight scheme.
References [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,”
J. Mach. Learn.Res. , vol. 3, pp. 1137–1155, Mar. 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=944919.944966[2] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neuralnetworks with multitask learning,” in
Proceedings of the 25th International Conference on MachineLearning , ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 160–167. [Online]. Available:http://doi.acm.org/10.1145/1390156.1390177[3] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,”2013. [Online]. Available: http://arxiv.org/abs/1301.3781[4] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words andphrases and their compositionality,” in
Proceedings of the 26th International Conference on Neural InformationProcessing Systems - Volume 2 , ser. NIPS’13. USA: Curran Associates Inc., 2013, pp. 3111–3119. [Online].Available: http://dl.acm.org/citation.cfm?id=2999792.29999597
PREPRINT - F
EBRUARY
18, 2021[5] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of unnormalized statistical models, withapplications to natural image statistics,”
J. Mach. Learn. Res. , vol. 13, no. 1, pp. 307–361, Feb. 2012. [Online].Available: http://dl.acm.org/citation.cfm?id=2503308.2188396[6] M. Yu and M. Dredze, “Improving lexical embeddings with semantic knowledge,” in
Proceedings ofthe 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Proceedings of the 2015 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies
Proceedings of the 2015 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies
Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
Transactions of the Association for Computational Linguistics
CoRR , vol. abs/1408.3456, 2014. [Online]. Available: http://arxiv.org/abs/1408.3456[13] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, “Placingsearch in context: The concept revisited,” in
Proceedings of the 10th International Conference on WorldWide Web , ser. WWW ’01. New York, NY, USA: ACM, 2001, pp. 406–414. [Online]. Available:http://doi.acm.org/10.1145/371920.372094[14] E. Bruni, N. K. Tran, and M. Baroni, “Multimodal distributional semantics,”
J. Artif. Int. Res. , vol. 49, no. 1, pp.1–47, Jan. 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2655713.2655714[15] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in
Proceedings of the 2013 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies
Proceedings of the Seventh Conference on Natural Language Learning atHLT-NAACL 2003
CoRR , vol.abs/1508.01991, 2015. [Online]. Available: http://arxiv.org/abs/1508.01991[18] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures fornamed entity recognition,” in
Proceedings of the 2016 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies
PREPRINT - F
EBRUARY
18, 2021
A Experiments
A.1 Word Similarity Evaluation
Table 6 shows full results of text8 corpus which has different pre-porcessing style and much smaller vocabulary size. From the results of static fusion weights, trade-off between scores of datasets is observed when fusion weights go high.The CSG models trained with γ linear, → also present good performance on this corpus.Sim-999 WS-353 MEN-3kCBOW 30.62 70.83 59.28SG 33.11 68.57 59.43DSG 30.02 69.18 66.95SSSG 32.60 66.77 56.99FastText 31.21 64.37 59.91 CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, . CSG
LF, . CSG
EF, CSG
EF, → CSG
LF, → CSG
EF,ran
CSG
LF,ran ρ × ) on the text8 corpus. Sym-999 denotes Symlex-999 dataset. http://mattmahoney.net/dc/textdata.htmlhttp://mattmahoney.net/dc/textdata.html