Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants
aa r X i v : . [ c s . C L ] J u l Temporal dynamics of semantic relations in word embeddings:an application to predicting armed conflict participants
Andrey Kutuzov
Department of InformaticsUniversity of Oslo [email protected]
Erik Velldal
Department of InformaticsUniversity of Oslo [email protected]
Lilja Øvrelid
Department of InformaticsUniversity of Oslo [email protected]
Abstract
This paper deals with using word embed-ding models to trace the temporal dynam-ics of semantic relations between pairsof words. The set-up is similar to thewell-known analogies task, but expandedwith a time dimension. To this end, weapply incremental updating of the mod-els with new training texts, including in-cremental vocabulary expansion, coupledwith learned transformation matrices thatlet us map between members of the re-lation. The proposed approach is evalu-ated on the task of predicting insurgentarmed groups based on geographical lo-cations. The gold standard data for thetime span 1994–2010 is extracted from theUCDP Armed Conflicts dataset. The re-sults show that the method is feasible andoutperforms the baselines, but also thatimportant work still remains to be done.
In this research, we make an attempt to model thedynamics of worldwide armed conflicts on the ba-sis of English news texts. To this end, we employthe well-known framework of Continuous Bag-of-Words modeling (Mikolov et al., 2013c) for train-ing word embeddings on the English Gigawordnews text corpus (Parker et al., 2011). We learnlinear projections from the embeddings of geo-graphical locations where violent armed groupswere active to the embeddings of these groups.These projections are then applied to the embed-dings and gold standard data from the subsequentyear, thus predicting what entities act as violentgroups in the next time slice. To evaluate ourapproach, we adapt the UCDP Armed ConflictDataset (Gleditsch et al., 2002; Allansson et al., 2017) (see Section 2 for details).Here is a simplified example of the task: giventhat in 2003, the
Kashmir Liberation Front and
ULFA were involved in armed conflicts in India,and
Lord’s Resistance Army in Uganda, predict en-tities playing the same role in 2004 in Iraq (the cor-rect answers are
Ansar al-Islam , al-Mahdi Army and Islamic State ). The nature of the task is con-ceptually similar to that of analogical reasoning,but with the added complexity of temporal change.Attempts to detect semantic change using un-supervised methods have a long history. Signif-icant results have already been achieved in em-ploying word embeddings to study diachronic lan-guage change. Among others, Eger and Mehler(2016) show that the embedding of a given wordfor a given time period to a large extent is a lin-ear combination of its embeddings for the pre-vious time periods. Hamilton et al. (2016) pro-posed an important distinction between culturalshifts and linguistic drifts. They proved that globalembedding-based measures (comparing the sim-ilarities of words to all other words in the lexi-con) are sensitive to regular processes of linguis-tic drift, while local measures (comparing nearestneighbors’ lists) are a better fit for more irregularcultural shifts in word meaning.Our focus here is on cultural shifts: it is notthe dictionary meanings of the names denoting lo-cations and armed groups that change, but rathertheir ‘image’ in the analyzed texts. Our measure-ment approach can also be defined as ‘local’ tosome extent: the linear projections that we learnare mostly based and evaluated on the nearestneighborhood data. However, this method is dif-ferent in that its scope is not single words but pairsof typed entities (‘ location ’ and ‘ armed group ’ inour case) and the semantic relations between them. .1 Contributions
The main contributions of this paper are:1. We show that distributional semantic models,in particular word embeddings, can be usednot only to trace diachronic semantic shiftsin words, but also the temporal dynamics ofsemantic relations between pairs of words.2. The necessary prerequisites for achieving de-cent performance in this task are incrementalupdating of the models with new textual data(instead of training from scratch each timenew data is added) and some way of expand-ing the vocabulary of the models.
The UCDP/PRIO Armed Conflict Dataset main-tained by the Uppsala Conflict Data Program andthe Peace Research Institute Oslo is a manuallyannotated geographical and temporal dataset withinformation on armed conflicts, in the time periodfrom 1946 to the present (Gleditsch et al., 2002;Allansson et al., 2017). It encodes conflicts, whereat least one party is the government of a state. TheArmed Conflict Dataset is widely used in statis-tical and macro-level conflict research; however,it was adapted and introduced to the NLP fieldonly recently, starting with (Kutuzov et al., 2017).Whereas that work was focused on detecting theonset/endpoint of armed conflicts, the current pa-per further extends on this by using the dataset toevaluate the detection of changes in the seman-tic relation holding between participants of armedconflicts and their locations.Two essential notions in the UCDP data arethose of event and armed conflict . Events canevolve into full-scale armed conflicts , defined ascontested incompatibilities that concern govern-ment and/or territory where the use of armed forcebetween two parties, of which at least one is thegovernment of a state, results in at least 25 battle-related deaths (Sundberg and Melander, 2013).The subset of the data that we employ isthe
UCDP Conflict Termination dataset (Kreutz,2010). It contains entries on starting and endingdates of about 2000 conflicts. We limit ourselvesto the conflicts taking place between 1994 and2010 (the Gigaword time span). Almost always,the first actor of the conflict ( sideA ) is the govern-ment of the corresponding location, and the sec-ond actor ( sideB ) is some insurgent armed group we are interested in. We omitted the conflictswhere both sides were governments (about 2% ofthe entries) or where one of the sides was men-tioned in the Gigaword less than 100 times (about1% of the entries). In cases when the UCDP de-scribed the conflict as featuring several groups onthe sideB , we created a separate entry for each.This resulted in a test set of 673 conflicts,with 137 unique
Location–Insurgent pairsthroughout the whole time span (many pairsappear several times in different years). Intotal, it mentions 52 locations (with
India being the most frequent) and 128 armed insur-gent groups (with
ULFA or United LiberationFront of Assam being the most frequent).This test set is available for subsequent reuse( http://ltr.uio.no/~andreku/armedconflicts/ ). In this section, we provide a detailed descriptionof our approach, starting with a synchronic exam-ple in 3.1 and then moving on to a toy diachronicexample on one pair of years in 3.2. In the nextsection 4, we conduct evaluation on the full testset.
We first conducted preliminary experiments to as-sess the hypothesis that the embeddings containsemantic relationships of the type ‘insurgent par-ticipant of an armed conflict in the location’. Tothis end, we trained a CBOW model on the fullEnglish Gigaword corpus (about 4.8 billion tokensin total), with a symmetric context window of 5words, vector size 300, 10 negative samples and5 iterations. Words with a frequency less than100 were ignored during training. We used Gen-sim ( ˇReh˚uˇrek and Sojka, 2010) for training, andin terms of corpus pre-processing we performedlemmatization, PoS-tagging and NER using Stan-ford CoreNLP (Manning et al., 2014). Named en-tities were concatenated to one token (for example,
United States became
United::States_PROPN ).Then, we used the 137
Location–Insurgent pairsderived in Section 2 to learn a projection matrixfrom the embeddings for locations to the embed-dings for insurgents. The idea and the theory be-hind this approach are extensively described in(Mikolov et al., 2013b) and (Kutuzov et al., 2016),but essentially it involves training a linear regres-sion which minimizes the error in transforming oc → group group → loc λ @1 @5 @10 @1 @5 @100.0 0.0 14.6 31.4 8.8 46.7 Table 1: Accuracies for synchronic projectionsfrom locations to armed groups, and vice versaone set of vectors into another. Finding the op-timal transformation matrix amounts to solving i normal equations (where i is the vector size in theembedding model being used), as shown in Equa-tion 1: β i = ( X ⊺ ∗ X + λ ∗ L ) − ∗ X ⊺ ∗ y i (1)where X is the matrix of 137 location word vectors(input), y i is the array of the i th components of137 corresponding insurgent word vectors (correctpredictions), L is the identity matrix of the size i ,with 0 at the top left cell, and λ is a real numberused to tune the influence of regularization term(if λ = 0 , there is no regularization). β i is thearray of i optimal coefficients which transform anarbitrary location vector into the i th component ofthe corresponding insurgent vector. After learningsuch an array for each vector component, we havea linear projection matrix which can ‘predict’ aninsurgent embedding from a location embedding.To evaluate the resulting projections, we em-ployed leave-one-out cross-validation, i.e., mea-suring the average accuracy of predictions on eachpair from the test set, after training the matrix onall the pairs except the one used for the testing.The transformation matrix was dot-multiplied bythe location vector from the test pair. Then, wefound n nearest neighbors in the word embeddingmodel for this predicted vector. If the real insur-gent in the test pair was present in these n neigh-bors, the accuracy for this pair was 1, otherwise 0.In Table 1, the average accuracies with differentvalues of λ and n are reported.The relations of this kind are not symmetric: itis much easier to predict the location based on theinsurgent than vice versa (see the right part of Ta-ble 1). Moreover, we find that the achieved re-sults are roughly consistent with the performanceof the same approach on the Google Analogiestest set (Mikolov et al., 2013a). We converted thesemantic sections in the Analogies test set con-taining only nouns ( capitals–common , capitals–world , cities in states , currency and family ) to sets of unique pairs. Then, linear projections with λ = 1 . were learned and evaluated for each ofthem. The average accuracies over these sectionswere 13.0@1, 48.77@5 and [email protected] results on predicting armed groups are stillworse than on the Google Analogies, because of 3factors: 1) one-to-many relationships in the UCDPdataset (multiple armed groups can be active in thesame location) make learning the transformationmatrix more difficult; 2) the frequency of wordsdenoting armed groups is lower than any of thewords in the Google Analogies data set, thus, em-beddings for them are of lower quality; 3) trainingthe matrix on the whole Gigaword model is sub-optimal, as the majority of armed groups were notactive throughout all its time span.All our experiments were also conducted us-ing the very similar Continuous Skipgram mod-els. However, as CBOW proved to consistentlyoutperform Skipgram for our tasks, we only reportresults for CBOW, due to limited space. To sum up this section, many-to-one semanticrelations between locations and insurgents do existin the word embedding models. They are less ex-pressed than one-to-one relations like those in theGoogle Analogies test set, but still can be foundusing linear projections. In the next section, wetrace the dynamics of these relations as the mod-els are updated with new data.
Our approach to using learned transformation ma-trices to trace armed conflict dynamics throughtime consists of the following. We first train aCBOW model on the subsection of Gigaword textsbelonging to the year 1994. Then, we incremen-tally update (train) this same model with new texts,saving a new model after each subsequent year.The size of the yearly subcorpora is about 250–320 million content words each. Importantly, wealso use vocabulary expansion: new words areadded to the vocabulary of the model if their fre-quency in the new yearly data satisfy our minimalthreshold of 15. Each yearly training session isperformed in 5 iterations, with linearly decreasinglearning rate. Note that we do not use any modelalignment method (Procrustes, etc): our mod- It seems CBOW is often better than Skipgram with linearprojections; cf. the same claim in (Kutuzov et al., 2016). We did not experiment with different thresholds. It wasinitially set to the value which produced a reasonable vocab-ulary size of several hundred thousand words.airs (size) @1 @5 @10All (38) 44.7 73.7 84.2New (7) 14.3 28.6 42.9
Table 2: Projection accuracy for the isolated ex-ample experiment mapping from 2000 → Location–Insurgent projection learned on thefirst model is able to reveal conflicts that appear in2001. Thus, we extract from the UCDP dataset allthe pairs related to the conflicts which took placebetween 1994 and 2000 (91 pairs total). The pro-jection is trained on their embeddings from thefirst model (actually, on 79 pairs, as 12 armedgroup names were not present in the 2000 modeland subsequently skipped). Then, this projectionis applied to the second model embeddings of the47 locations, which are subject to armed conflictsin the year 2001 (38 after skipping pairs with out-of-vocabulary elements). Table 2 demonstratesthe resulting performance (reflecting how close thepredicted vectors are to the actual armed groupsactive in this or that location).Note that out of 38 pairs from 2001, 31 werealready present in the previous data set (ongoingconflicts). This explains why the evaluation on allthe pairs gives high results. However, even for thenew conflicts, the projection performance is en-couraging. Among others, it managed to preciselyspot the 2001 insurgency of the members of the
Kosovo Liberation Army in Macedonia, notwith-standing the fact that the initial set of training pairsdid not mention Macedonia at all. Thus, it seemsthat the models at least partially ‘align’ new dataalong the existing semantic axis trained before.In the next section, we systematically evaluate our approach on the whole set of UCDP conflictsin the Gigaword years (1994–2010).
To evaluate our approach on all the UCDP data,we again tested how good it is in predicting thefuture conflicts based on the projection matriceslearned from the previous years. We did this forall the years between 1994 and 2010. The evalu-ation metrics are the same as in the Section 3: wecalculated the ratio of correctly predicted armedgroups names from the conflict pairs, for whichthe UCDP datasets stated that these conflicts wereactive in this particular year. As before, the mod-els employed in the experiment were incremen-tally trained on each successive year with vocabu-lary expansion. Words present in the gold standardbut absent from the models under analysis wereskipped. At the worst case, 25% of pairs wereskipped from the test set; on average, 13% wereskipped each year (but see the note below aboutthe incr. static baseline). At test time, all the enti-ties were lowercased.We employ 3 baselines: 1) yearly modelstrained separately from scratch on the corpora con-taining texts from each year only (referred to as separate hereafter); 2) yearly models trained fromscratch on all the texts from the particular yearand the previous years ( cumulative hereafter); 3)incrementally trained models without vocabularyexpansion ( incr. static hereafter).Initially, the linear projections for all modelswere trained on all the conflict pairs from the pastand present years, similar to Section 3.2 (dubbed up-to-now hereafter). However, the informationabout conflicts having ended several years beforemight not be strongly expressed in the model afterit was incrementally updated with the data fromall the subsequent years. For example, the 2005model hardly contains much knowledge about theconflict relations between Mexico and the
Popu-lar Revolutionary Army (EPR) which stopped itsactivities after 1996. Thus, we additionally con-ducted a similar experiment, but this time the pro-jections were learned only on the salient pairs(dubbed previous ): that is, the pairs active in thelast year up to which the model was trained.Table 3 presents the results for these experi-ments, as well as baselines (averaged across 15years). For the proposed incr. dynamic approach,the performance of the previous projections is nly in-vocabulary pairs All pairs, including OOVup-to-now previous up-to-now previous @1 @5 @10 @1 @5 @10 @1 @5 @10 @1 @5 @10
Separate
Cumulative
Incr. static
Incr. dynamic
Table 3: Average accuracies of predicting next-year insurgents on the basis of locations, using projectionstrained on the conflicts from all the preceding years ( up-to-now ) or the preceding year only ( previous ).Results for 3 baselines are shown along with the proposed incremental dynamic approach.comparable to that of the up-to-now projectionson the accuracies @5 and @10, and is even higheron the accuracy @1 (statistically significant with t-test , p < . ). Thus, the single-year projectionsare somewhat more ‘focused’, while taking muchless time to learn, because of less training pairs.The fact that our models were incrementally up-dated, not trained from scratch, is crucial. The re-sults of the separate baseline look more like ran-dom jitter. The cumulative baseline results areslightly better, probably simply because they aretrained on more data. However, they still performmuch worse than the models trained using incre-mental updates. This is because the former mod-els are not connected to each other, and thus areinitialized with a different layout of words in thevector space. This gives rise to formally differ-ent directions of semantic relations in each yearlymodel (the relations themselves are still there, butthey are rotated and scaled differently).The results for the incr. static baseline, whentested only on the words present in the test modelvocabulary (the left part of the table), seem bet-ter than those of the proposed incr. dynamic ap-proach. This stems from the fact that incremen-tal updating with static vocabulary means that wenever add new words to the models; thus, theycontain only the vocabulary learned from the 1994texts. The result is that at test time we skip manymore pairs than with the other approaches (about62% in average). Subsequently, the projections aretested only on a minor part of the test sets.Of course, skipping large parts of the datawould be a major drawback for any realistic ap-plication, so the incr. static baseline is not reallyplausible. For comparison, the right part of Table 3provides the accuracies for the setup in which allthe pairs are evaluated (for pairs with OOV wordsthe accuracy is always 0). Other tested approachesare not much affected by this change, but for incr. static the performance drops drastically. As a re-sult, for the all pairs scenario, incremental updat-ing with vocabulary expansion outperforms all thebaselines (the differences are statistically signifi-cant with t-test , p < . ). We have here shown how incrementally updatedword embedding models with vocabulary expan-sion and linear projection matrices are able to tracethe dynamics of subtle semantic relations overtime. We applied this approach to the task ofpredicting armed groups active in particular geo-graphical locations and showed that it significantlyoutperforms the baselines. However, it can be usedfor any kind of semantic relations. We believe thatstudying temporal shifts of such projections canlead to interesting findings far beyond the usualexample of ‘king is to queen as man is to woman’.To our best knowledge, the behavior of seman-tic relations in updated word embedding modelswas not explored before. Our experiments showthat the models do preserve these ‘directions’ andthat the learned projections not only hold for theword pairs known to the initial model, but can alsobe used to predict relations for the new words.In terms of future work, we plan to trace howquickly incremental updates to the model ‘dilute’the projections, rendering them useless with time.We observed this performance drop in our exper-iments, and it would be interesting to know moreabout the regularities governing this deterioration.Also, for the particular task of analyzing armedconflicts, we plan to research ways of improv-ing accuracy in predicting completely new armedgroups not present in the training data, and themethods of filtering out locations not involved inarmed conflicts. eferences
Marie Allansson, Erik Melander, and Lotta Themnér.2017. Organized violence, 1989–2016.
Journal ofPeace Research , 54(4).Steffen Eger and Alexander Mehler. 2016. On the lin-earity of semantic change: Investigating meaningvariation via dynamic graph models. In
Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 52–58, Berlin, Germany.Nils Petter Gleditsch, Peter Wallensteen, Mikael Eriks-son, Margareta Sollenberg, and Håvard Strand.2002. Armed conflict 1946-2001: A new dataset.
Journal of Peace Research , 39(5):615–637.William L Hamilton, Jure Leskovec, and Dan Jurafsky.2016. Cultural shift or linguistic drift? Comparingtwo computational measures of semantic change.In
Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing , pages2116–2121, Austin, Texas, USA.Nobuhiro Kaji and Hayato Kobayashi. 2017. Incre-mental skip-gram model with negative sampling. arXiv preprint arXiv:1704.03956 .Joakim Kreutz. 2010. How and when armed con-flicts end: Introducing the UCDP conflict termina-tion dataset.
Journal of Peace Research , 47(2):243–250.Andrey Kutuzov, Mikhail Kopotev, Tatyana Sviri-denko, and Lyubov Ivanova. 2016. Clustering com-parable corpora of Russian and Ukrainian academictexts: Word embeddings and semantic fingerprints.In
Proceedings of the Ninth Workshop on Buildingand Using Comparable Corpora , pages 3–10.Andrey Kutuzov, Erik Velldal, and Lilja Øvrelid. 2017.Tracing armed conflicts with diachronic word em-bedding models. In
Proceedings of the Events andStories in the News workshop , Vancouver, Canada.ACL.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In
Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics: System Demonstrations , pages55–60, Baltimore, Maryland, USA.Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .Tomas Mikolov, Quoc Le, and Ilya Sutskever. 2013b.Exploiting similarities among languages for ma-chine translation. arXiv preprint arXiv:1309.4168 . Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013c. Distributed represen-tations of words and phrases and their composition-ality.
Advances in Neural Information ProcessingSystems 26 , pages 3111–3119.Robert Parker, David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda. 2011. English Gigaword FifthEdition LDC2011T07. Technical report, LinguisticData Consortium, Philadelphia.Hao Peng, Jianxin Li, Yangqiu Song, and Yaopeng Liu.2017. Incrementally learning the hierarchical soft-max function for neural language models. In
Pro-ceedings of the Thirty-First AAAI Conference on Ar-tificial Intelligence , pages 3267–327, San Francisco,California USA.Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks , pages 45–50, Val-letta, Malta.Ralph Sundberg and Erik Melander. 2013. Introducingthe UCDP georeferenced event dataset.