SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical Semantic Change
SSChME at SemEval-2020 Task 1: A Model Ensemble for DetectingLexical Semantic Change
Maur´ıcio Gruppi, Sibel Adalı
Rensselaer Polytechnic InstituteTroy, NY, USA { gouvem, adalis } @rpi.edu Pin-Yu Chen
IBM ResearchYorktown Heights, NY, USA [email protected]
Abstract
This paper describes SChME (Semantic Change Detection with Model Ensemble), a method usedin SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature. More specifically, we combine cosine distance of wordvectors combined with a neighborhood-based metric we named Mapped Neighborhood Distance(MAP), and a word frequency differential metric as input signals to our model. Additionally,we explore alignment-based methods to investigate the importance of the landmarks used in thisprocess. Our results show evidence that the number of landmarks used for alignment has a directimpact on the predictive performance of the model. Moreover, we show that languages that sufferless semantic change tend to benefit from using a large number of landmarks, whereas languageswith more semantic change benefit from a more careful choice of landmark number for alignment.
The problem of detecting Lexical Semantic Change (LSC) consists of measuring and identifying change inword sense across time, such as in the study of language evolution, or across domains, such as determiningdiscrepancies in word usage over specific communities (Schlechtweg et al., 2019). One of the greatestchallenges of this problem is the difficulty of assessing and evaluating models and results, as well as thelimited amount of annotated data (Schlechtweg and Walde, 2020). For that reason, the vast majority ofthe related work in the literature pursue this problem from an unsupervised perspective, that is, detectingsemantic change without having prior knowledge of “truth”. The importance of such task is manifold: tohumans, it can be a powerful tool for studying language change and its cultural implications; to machines,it can be used to improve language models in downstream tasks such as unsupervised word translation,and fine-tuning of word embeddings (Joulin et al., 2018; Bojanowski et al., 2019). In this task, the goalis to develop a method for unsupervised detection of lexical semantic change over time by comparingacross two corpora from different time periods in four languages: English, German, Latin, and Swedish(Schlechtweg et al., 2020). Particularly, we are required to solve two sub-tasks: binary classification ofsemantic change (
Subtask 1 ), and semantic change ranking (
Subtask 2 ).There are many ways in which a word may change. Specifically, a word w may change sense because ithas been completely replaced by a synonym w s (lexical replacement), or because it gains a new meaning,in which case word w may keep or lose its previous meaning across time and domain (Kutuzov et al.,2018). Each type of change has its unique characteristics and may require different approaches in orderto be detected. In this paper we describe a novel model ensemble method based on different features(signals) that we can extract from the text using distribution models (skip-gram word embeddings) andword frequency. Our model is primarily based on features extracted from independently trained Word2Vecembeddings aligned with orthogonal procrustes (Sch¨onemann, 1966), such as cosine distance, but alsointroduces two novel measures based on second-order distances and word frequency. Based on the This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ c s . C L ] D ec istribution of each feature, we predict the probability that a word has suffered change through an anomalydetection approach. The final decision is made by soft voting (averaging) all the probabilities. For binaryclassification (Subtask 1) a threshold is applied to the final vote, for ranking (Subtask 2), the output fromthe soft voting is used as the ranking prediction.Our results show that second order methods and different combinations outperform the frequentlyused cosine distance in some subtasks and languages. Furthermore, we illustrate that the methods aresensitive to the degree of change in the language. It is possible to improve performance of these methodsby aligning two embeddings of the same language from different time slices on a subset of words insteadof all words. This opens a new avenue of research on finding optimal words for alignment. The code forthe model can be obtained at https://github.com/mgruppi/schme . Most methods for detecting semantic change are based on the distributional property of word semantics.The general idea is to compute contextual information of word w in each time or domain, and apply ameasure of difference or distance between the observed contexts of w . Some of the first methods fordetecting semantic change compute context information using a co-occurrence matrix within a pre-definedwindow of size L (Sagi et al., 2009; Cook and Stevenson, 2010). This means that, for a vocabulary ofsize n , one computes a n × n matrix M where M i,j is the frequency in which word i and j co-occurwithin a window of L words. This often yields a highly sparse matrix M , which is typically reducedin dimensionality by techniques such as Singular Value Decomposition (SVD). Once the matrices arecomputed, the contextual difference is computed by the cosine distance between the vectors.Distributed word vector representations such as the ones obtained by the skip-gram with negativesampling (SGNS) (Mikolov et al., 2013) are forms of learning distributional information without the needfor computing sparse co-occurrence matrices. A work by Hamilton et al. (2016b) presents a method fordetecting semantic change using SGNS word embeddings learned from each corpora and aligned withorthogonal procrustes. The semantic change is, again, computed by the cosine distance between vectors ineach time/domain. In another study (Hamilton et al., 2016a), the authors introduce a measure of semanticchange based on how the neighborhood of a word changes named Local Neighborhood Change based onthe number of words in common.To eliminate the need for alignment, several authors have proposed dynamic word embeddings tech-niques, which jointly learn distributional word representations using the assumption that words areconnected across time (Bamler and Mandt, 2017; Rudolph and Blei, 2018; Yao et al., 2018). The mainassumption in such methods is that word changes are considerably small between adjacent time stamps t and t , i.e. words evolve smoothly, thus word representations should be close between these periods. Weargue that the assumption that all words in t and t should be smoothly connected through time does notalways hold. This is because the corpora are aggregated over several years/decades/centuries, thus thesemantic change may be drastic, and more similar to a cross-domain scenario than a diachronic one. Weillustrate this by the corpora in this task and the use of a subset of landmarks for alignment that has notbeen investigated in the literature. The data provided in this task consists of two corpora for each language, each corpus corresponding todifferent time periods t and t , as well as a list of target words for which we have to predict binary classand rank with respect to magnitude of the semantic change between t and t . The corpora used for eachlanguage are summarized in Table 1. Most of our features are based on the alignment of word embeddings. Thus, the first step of our systemis to train a Word2Vec model on corpora C and C for each language, let W and W denote theresulting word embeddings, respectively. Since W and W are learned independently, we cannot directlycompare their vectors. Hence, similarly to Hamilton et al. (2016b), we apply orthogonal procrustes (OP) anguage Corpora t t English
CCOHA (Alatrash et al., 2020) 1810-1860 1960-2010
German
DTA + BZ + ND (Schlechtweg et al., 2020) 1800-1900 1946-1990
Latin
LatinISE (McGillivray and Kilgarriff, 2013) -200 - 0 0-2000
Swedish
KubHist (Borin et al., 2012) 1790-1830 1895-1903Table 1: Data provided for the task. In addition to the corpora, a set of target words is given, for which weneed to generate outputs in substasks 1 and 2.(Sch¨onemann, 1966) to align the word embeddings of the corpora. Given matrices A and B , the objectiveof OP is to learn an orthogonal transformation matrix Q that minimizes the sum of squared distances (cid:107) AQ − B (cid:107) . Because Q is orthogonal, the transformation AQ is only subject to rotation and reflection,which preserves the relationships between the word vectors in A . We learn the transformation matrix Q from the alignment of W and W , updating W ← W Q . Now the word vectors in W can be directlycompared to W . In the following sections, we’ll discuss the distance metrics used by the model tomeasure semantic change. One of the most used metric for comparing word vectors is the cosine distance.The cosine distance between two vectors in a single source indicates how closely distributed the words are.In the semantic change scenario, we compute the cosine distance for word w as d cos = 1 − cos ( v , v ) ,where v and v are the word vectors of w in W and W , respectively. Ideally, a small value of d cos would imply that the contexts for w is similar in both corpora C and C . Mapped Neighborhood Change (MAP).
This measure looks at how a word moves away from itsneighborhood across both corpora. To that end, we compute a second-order cosine distance vector s ( v , N ) between v and its k nearest neighbors in W , which we’ll denote as the set N . Then wecompute another second-order vector s ( v , N ) using v but looking for corresponding vectors of eachword in N in the space of the second corpus W . The mapped neighborhood change is then computed asthe cosine distance d map ( v ) = d cos ( s ( v , N ) , s ( v , N )) . Although this method uses second-orderdistances like the Local Neighborhood Change (LNC) (Hamilton et al., 2016a), it differs from it bycomputing the distances between the aligned input embeddings, while LNC only computes such distancesfor vectors within a single embedding matrix. Frequency Differential (FREQ).
Let f and f be the relative frequencies of word w in C and C .We define the frequency differential for w as f ( w ) = f − f f + f . Positive values indicate increase whilenegative values indicate decrease in frequency across the corpora. We argue that a steep increase infrequency may indicate indicate change more strongly than frequency decrease, which may happen due toa word becoming less popular or being replaced by another word without losing its original sense. Thisassumption is only viable because we know that C always happens earlier in time than C . We compute the aforementioned features on all words in the intersection of the vocabularies of C and C , we use the observed feature distributions to determine potentially changed words. Let X i denotethe random variable associated with the distribution of feature i . We work under the assumption thatsmall values of X i denote little or no semantic change to a word. Moreover, unlikely high values of X i indicate a high chance that the word suffered change according to metric i . We define small and largevalues with respect to all the computed values in the distribution. For instance, if the cosine distancecomputed for a word is large when compared to the cosine distances of the other words, it is likely thatthe word has changed. Therefore, we define the probability of change for a word whose feature value is x i as P i ( x i ) = P r ( X i ≤ x i ) .Thus, P i is the cumulative distribution function (CDF) of X i , describing how unlikely high x i iseature English German Latin Swedish Decay (%)COS 0.622 t = 0 . . Majority class(Maj. Class) is a baseline classifier that outputs the most common class for each language (classes 0 or 1).Column decay indicates the accuracy deviation from the best performance for each feature model acrosslanguages. A smaller decay means the method performs close to optimal in all languages.Feature English German Latin Swedish Decay (%)COS -0.009 0.32COS+MAP+FREQ 0.203 0.433 0.424 ρ ) for each feature model. Column decay indicates theSpearman’s ρ deviation from the best performing method in each language. Decay is defined in Table 2.according to the distribution of X i . We aggregate the probability output of each feature P i ( x i ) by applyingsoft voting to each feature’s prediction. The final prediction for a feature vector x = ( x , x , ..., x k ) is P ( x ) = k (cid:80) k P i ( x i ) . For classification, a threshold is a applied to P ( x ) in order to determine the class.For ranking, the score P ( x ) is used directly. We conduct all the experiments on the data provided for SemEval-2020 Task 1 for all four languages.Given that most of the corpora have been pre-processed with lemmatization and tokenization, our pre-processing consists of removing words whose count is less than 10, and tokenizing words at spaces. In thissection we present the experiments and results for the model submitted to the task, as well as additionalanalysis of the model parameters.We begin by learning the distributional representations of words in each corpora using Gensim’s( ˇReh˚uˇrek and Sojka, 2010) implementation of Word2Vec . The parameters for Word2Vec are: vector size d = 300 , window L = 10 , negative samples ng = 5 , and minimum word count min wc = 10 . Next, wealign the learned word vectors via OP using the intersecting vocabulary as landmarks. Then, we computethe distance metrics and their distributions so that we can get the vote P r ( X i ≤ x i ) . Finally, we applythe model ensemble to different feature configurations to predict a final score. For classification, we applya threshold t to the model output P ( x ) , such that the predicted class is y = 1 if P ( x ) > t , and y = 0 otherwise. For ranking, the final score P ( x ) is used.Since there was no validation data during the evaluation phase, our submissions included multiplefeature and threshold settings. The feature configurations are combinations of the cosine distance (COS),mapped neighborhood distance (MAP), and frequency differential (FREQ). The applied threshold levelsare { . , . , . } . Our team (RPI-Trust) ranked 4th place in Subtask 1 with a score of 0.660, and 6thplace in Subtask 2 with a score of 0.427 in the evaluation phase. We evaluate our model on the provided test data in the post-evaluation phase. First, we fix a thresholdof t = 0 . , then we use different feature combinations to evaluate the performance on each language.Classification results, seen in Table 2, show that there is no single best feature configuration for alllanguages. This may happen because each language evolved differently between t and t , and havingach feature model being able to capture different types of change. For example, many events in between t and t for the English corpora may have contributed to the evolution of the language, such as theSecond Industrial Revolution, and the World Wars. Technological development introduced several newconcepts such as (air) plane and (record) player which were unheard of in t , the detection of such changerelies on signals that can indicate a completely new use of a word while potentially keeping its previoussenses. The results for the ranking task are shown in Table 3. Notice that the best feature configurationsfor classification are not necessarily the best for ranking. MAP performs best for Latin which might bedue to potential big semantic shift in this language which is better captured by incorporating neighborhoodinformation. As seen in the decay column, COS and COS+MAP+FREQ (used in our submission) are theoverall best performing methods across the two tasks. When executing procrustes alignment, one must choose which and how many words to align on. Sincealignment seeks to enforce short distances between landmark words, we hypothesize that this method maymask some of the semantic shift involving the landmark words. To test this, we analyze the effect of thenumber of landmark words over the model predictions by executing procrustes alignment at using the top n most frequent landmark words with n ∈ [300 , N ] , where N is the size of the intersecting vocabulary,keeping a classifier threshold fixed at t = 0 . . Figure 1 shows the results for all four languages.These results present evidence to our argument: using more landmark words in the alignment procedurefavors German and Swedish that likely have less semantic shift compared to Latin and English. Noticethat both corpora present class imbalance leaning towards unchanged words, and show increased accuracyas the number of landmarks increase. On the other hand, the same is not true for English, which has morebalanced classes, nor for Latin which is unbalanced towards changed words. In both these languages, theclassification accuracy peaks at some n < N and then decreases, thus showing that using all possiblewords as landmarks may decrease the accuracy. A cc u r a cy englishgermanlatinswedish (a) S pea r m an r ho englishgermanlatinswedish (b) Figure 1: (a) Accuracy in Subtask 1 using different numbers of landmark words for each language. Noticehow German and Swedish do not show a decrease in accuracy despite the large number of landmarksused, whereas English and Latin have optimal performance at some point before the maximum; (b)Ranking performance according to number of landmarks shows a different trend from that of the binaryclassification with Swedish decreasing in performance as the number of landmarks grow.
We presented a model for unsupervised detection of semantic change based on anomaly detection overa selection of features. SChME works directly on the input corpora, not requiring language-specificpre-trained models. The model ensemble is agnostic to the feature models, which means any measure ofchange could be easily incorporated to it, if desired. Our results show that the model parameters must bechosen carefully for each task and language. Particularly, we have shown that the choice of landmarks foralignment is strictly related to the degree of change of a language. In future work, we plan on addressinghis issue by developed principled ways of choosing the words to align so that the semantic change isrevealed more accurately.
This work was supported by the Rensselaer-IBM AI Research Collaboration ( http://airc.rpi.edu ), part of the IBM AI Horizons Network ( http://ibm.biz/AIHorizons ). References
Reem Alatrash, Doninik Schlechtweg, Jonas Kuhn, and Sabine Schulte. 2020. Clean corpus of historical amer-ican english. In
Proceedings of the Twelfth International Conference on Language Resources and Evaluation(LREC’20) . European Language Resources Association (ELRA).Robert Bamler and Stephan Mandt. 2017. Dynamic word embeddings. In
Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pages 380–389. JMLR. org.Piotr Bojanowski, Onur Celebi, Tomas Mikolov, Edouard Grave, and Armand Joulin. 2019. Updating pre-trainedword vectors and text classifiers using monolingual alignment. arXiv preprint arXiv:1910.06241 .Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp — the corpus infrastructure of spr¨akbanken. In
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) , pages474–478, Istanbul, Turkey, May. European Language Resources Association (ELRA).Paul Cook and Suzanne Stevenson. 2010. Automatically identifying changes in the semantic orientation of words.In
LREC .William L Hamilton, Jure Leskovec, and Dan Jurafsky. 2016a. Cultural shift or linguistic drift? comparingtwo computational measures of semantic change. In
Proceedings of the Conference on Empirical Methodsin Natural Language Processing. Conference on Empirical Methods in Natural Language Processing , volume2016, page 2116. NIH Public Access.William L Hamilton, Jure Leskovec, and Dan Jurafsky. 2016b. Diachronic word embeddings reveal statisticallaws of semantic change. arXiv preprint arXiv:1605.09096 .Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Herv´e J´egou, and Edouard Grave. 2018. Loss in translation:Learning bilingual word mapping with a retrieval criterion. arXiv preprint arXiv:1804.07745 .Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. 2018. Diachronic word embeddings andsemantic shifts: a survey. In
Proceedings of the 27th International Conference on Computational Linguistics ,pages 1384–1397, Santa Fe, New Mexico, USA, August. Association for Computational Linguistics.Barbara McGillivray and Adam Kilgarriff. 2013. Tools for historical corpus research, and a corpus of latin.
NewMethods in Historical Corpus Linguistics , 1(3):247–257.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations ofwords and phrases and their compositionality. In
Advances in neural information processing systems , pages3111–3119.Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In
Proceed-ings of the LREC 2010 Workshop on New Challenges for NLP Frameworks , pages 45–50, Valletta, Malta, May.ELRA. http://is.muni.cz/publication/884893/en .Maja Rudolph and David Blei. 2018. Dynamic embeddings for language evolution. In
Proceedings of the 2018World Wide Web Conference , pages 1003–1011.Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009. Semantic density analysis: Comparing word meaningacross time and phonetic space. In
Proceedings of the Workshop on Geometrical Models of Natural LanguageSemantics , pages 104–111. Association for Computational Linguistics.Dominik Schlechtweg and Sabine Schulte im Walde. 2020. Simulating lexical semantic change from sense-annotated data. arXiv preprint arXiv:2001.03216 .ominik Schlechtweg, Anna H¨atty, Marco Del Tredici, and Sabine Schulte im Walde. 2019. A wind of change:Detecting and evaluating lexical semantic change across times and domains. In
Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics , pages 732–746, Florence, Italy, July. Association forComputational Linguistics.Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. 2020.Semeval-2020 task 1: Unsupervised lexical semantic change detetion. In
To appear in SemEval@COLING2020 .Peter H Sch¨onemann. 1966. A generalized solution of the orthogonal procrustes problem.
Psychometrika ,31(1):1–10.Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. 2018. Dynamic word embeddings for evolvingsemantic discovery. In