[PDF] Pitfalls of Static Language Modelling

Abstract

Our world is open-ended, non-stationary and constantly evolving; thus what we talk about and how we talk about it changes over time. This inherent dynamic nature of language comes in stark contrast to the current static language modelling paradigm, which constructs training and evaluation sets from overlapping time periods. Despite recent progress, we demonstrate that state-of-the-art Transformer models perform worse in the realistic setup of predicting future utterances from beyond their training period -- a consistent pattern across three datasets from two domains. We find that, while increasing model size alone -- a key driver behind recent progress -- does not provide a solution for the temporal generalization problem, having models that continually update their knowledge with new information can indeed slow down the degradation over time. Hence, given the compilation of ever-larger language modelling training datasets, combined with the growing list of language-model-based NLP applications that require up-to-date knowledge about the world, we argue that now is the right time to rethink our static language modelling evaluation protocol, and develop adaptive language models that can remain up-to-date with respect to our ever-changing and non-stationary world.

Full PDF

PPitfalls of Static Language Modelling

Angeliki Lazaridou ∗ ♥(cid:52)♠

Adhiguna Kuncoro (cid:63) ♥(cid:52)

Elena Gribovskaya (cid:63) ♥(cid:52)

Devang Agrawal ♦♥ Adam Liˇska ♦♥ Tayfun Terzi ♦ Mai Gimenez ♦ Cyprien de Masson d’Autume ♦ Sebastian Ruder ♥ Dani Yogatama ♣ Kris Cao ♣ Tomas Kocisky ♣ Susannah Young ♣ Phil Blunsom ♣♠ DeepMind, London, UK { angeliki,akuncoro,egribovskaya } @google.com Abstract

Our world is open-ended, non-stationary andconstantly evolving; thus what we talk aboutand how we talk about it changes over time.This inherent dynamic nature of languagecomes in stark contrast to the current static lan-guage modelling paradigm, which constructstraining and evaluation sets from overlappingtime periods. Despite recent progress, wedemonstrate that state-of-the-art Transformermodels perform worse in the realistic setup ofpredicting future utterances from beyond theirtraining period—a consistent pattern acrossthree datasets from two domains. We ﬁndthat, while increasing model size alone—akey driver behind recent progress—does notprovide a solution for the temporal general-ization problem, having models that continu-ally update their knowledge with new infor-mation can indeed slow down the degrada-tion over time. Hence, given the compila-tion of ever-larger language modelling train-ing datasets, combined with the growing list oflanguage-model-based NLP applications thatrequire up-to-date knowledge about the world,we argue that now is the right time to rethinkour static language modelling evaluation pro-tocol, and develop adaptive language modelsthat can remain up-to-date with respect to ourever-changing and non-stationary world. A language model deﬁnes a distribution over ut-terances. Whether these are sentences, documents,or whole conversations, we usually aim to learn a ∗ (cid:63) Equal contribution. ♠ Project initiation. (cid:52)

Paper writ-ing. ♦ Project technical infrastructure. ♥ Model design andexperiments. ♣ Project support and advice. To track progress and encourage the development ofadaptive language models, we release our dynamic (stream-ing) language modelling benchmark for

WMT and AR X IV at https://github.com/deepmind/deepmind-research/tree/master/pitfalls_static_language_models . language model from a set of observations so thatit assigns high probability to utterances observed inthe future. In this decidedly simple deﬁnition lurksa crucial yet often overlooked detail: Languagemodelling is a dynamic task in which experienceof the past is used to predict the future . In contrast,the current practice in large-scale language mod-elling is to draw training and test data from largeweb crawls that overlap in time .In this work, we demonstrate that such static evaluation protocol provides an overly optimisticassessment of a language model’s efﬁcacy. Hence,throughout this paper, we argue for embracing thetemporal dynamics at the heart of language mod-elling in order to maximize their real-world po-tential, as reﬂected by the growing list of NLPapplications—many of which currently rely on lan-guage model pretraining (Devlin et al., 2019)—that require up-to-date factual knowledge of ourever-changing world. Examples of such tasks in-clude ﬂagging the most recent batch of fake news(Thorne and Vlachos, 2018; Zellers et al., 2019;Augenstein et al., 2019, inter alia ), and answer-ing questions like “How many people have beeninfected by COVID-19 worldwide?”, “Has thereever been a female Vice President of the USA”,and ”Is Pluto a planet?”, whose answers can varydepending on when the question was posed. Fur-thermore, the vast majority of practical NLP todayhappens within the context of commercial systems,such as machine translation and automatic speechrecognition, that are deployed on future utteranceswhilst being trained on past ones.Given the practical importance of building adap-tive language models that update their knowledgein response to our non-stationary world, combinedwith the compilation of ever-larger language mod-elling benchmarks that do not necessarily assesshow well our language models can generalize overtime (Chelba et al., 2013; Radford et al., 2019; a r X i v : . [ c s . C L ] F e b rown et al., 2020; Gao et al., 2021), we arguethat now is the right time to revisit the question of whether , and to what extent , our current state-of-the-art Transformer (Vaswani et al., 2017) languagemodels are able to generalize well across time ina dynamic streaming setup (Jelinek et al., 1991;Wang et al., 2008; Yogatama et al., 2014; Osborneet al., 2014, inter alia ), which can help measureprogress and spur further advances in this direction.More concretely and as a ﬁrst step, we stress-testthe current state-of-the-art static language modelsby evaluating their temporal generalization abil-ity on two English domains: News and scientiﬁcarticles—two sources of data with a rapidly chang-ing distribution, where new content with reliabletime information is generated in a continuous fash-ion (§2). To assess how well these models can gen-eralize across time, we design a time stratiﬁcation evaluation protocol where Transformer-XL models(Dai et al., 2019) are trained on the past, but areasked to predict future articles that are published after the end of their training period.We ﬁnd that, despite remarkable recent progresson language modelling benchmarks that draw train-ing and evaluation sets from overlapping time peri-ods, the same models perform worse in the morerealistic use case where models trained on the pastare evaluated on their ability to generalize well tofuture data (§3.1). We further observe an alarm-ing trend where the model performs increasinglybadly when it is asked to make predictions abouttest documents that are further away from the train-ing period, demonstrating that model performancedegrades more substantially with time. Given theseﬁndings, we conduct a comprehensive analysis on what kinds of predictions the model is strugglingwith, and observe that model performance degradesmost substantially for open-class words, includingnouns and verbs (§3.2), rapidly changing topicssuch as sports (§3.3), and emerging concepts thatoccur frequently in the test period, but only rarely(or sometimes never) occurred in the training pe-riod, such as “Novichok” and “5G” (§3.4).Since temporal generalization poses a challengefor current large-scale language models, what, then,is the remedy? One option is to periodically re-train the model from scratch with the new and olddata, although this approach is expensive in termsof both computational resources and carbon emis-sions (Strubell et al., 2019), and further runs therisk of the model getting outdated in-between long retraining cycles. More recently, increasing the size (i.e. number of parameters) of language models hasbeen shown to improve perplexity and downstreamtask performance (Kaplan et al., 2020), and consti-tutes a key driving factor behind recent languagemodelling progress (Brown et al., 2020). Hence,we ask: Can increasing model size also improvetemporal generalization? We ﬁnd that increasingmodel size alone is not a solution for the temporalgeneralization problem (§4), which highlights theneed for approaches that more directly tackle someof the problems introduced by an ever-changingworld, and can rapidly adapt and integrate new in-formation as it becomes available. We then explorea simple way of keeping our models up-to-date andmitigate this temporal degradation by performing dynamic evaluation (Mikolov et al., 2010; Krauseet al., 2018, 2019, §5) incrementally on streams ofnew data. We ﬁnd that this approach can mitigate,but not completely eliminate, the temporal degra-dation problem—leaving a large potential room forimprovement, such as through better continual andlifelong learning approaches (§6).Altogether, our ﬁndings: (i) empirically high-light the limitations of current language modelswith respect to temporal generalization, (ii) demon-strate the need to rethink our static language mod-elling evaluation paradigm that trains and evalu-ates models on data from the same, overlappingtime periods, (iii) provide a benchmark to system-atically measure progress and encourage more re-search on temporal generalization and adaptive lan-guage modelling, and (iv) highlight the fact thatsucceeding in this setup necessitates approachesthat go above and beyond scaling models in termsof parameters or amounts of training data, thuspaving the way for better and more efﬁcient contin-ual learning approaches. In a non-stationary and rapidly changing world likeours, any model begins to become outdated imme-diately after training concludes. In this section, wepropose an experimental setup that enables us to as-sess whether , and to what extent , the performanceof state-of-the-art Transformer language modelsdegrades if they are asked to generalize to futuredata based on the past. Indeed, the question ofhow we can better measure generalization in large-scale language models is a timely one given thecurrent trend of training on ever-larger collectionsf web-crawled data, in which “test data contami-nation” (Brown et al., 2020) can present an impedi-ment towards fair and reliable evaluations.

Concretely, we identify news and scientiﬁc articlesas two sources of dynamic streaming data with anaturally changing distribution over time, whichlend themselves well to evaluating how well lan-guage models generalize over time. For the scien-tiﬁc domain, we use the publicly available arXivabstracts ( AR X IV ). For the news domain, we usethe publicly available WMT News Crawl corpus(

WMT ). We ensure that any trends we observealso generalize well to models trained on largerdatasets—which improve language modelling andrepresentation learning performance (Liu et al.,2019)—by compiling a larger news corpus thatwe term C USTOM N EWS . This dataset consists ofcrawled English news sources from the web cov-ering the 1969-2020 period and includes a varietyof topics, e.g. politics, ﬁnancial news, sport, andlifestyle. We conduct all experiments in English,although we show that our ﬁndings generalize wellto another language, German, in Appendix D. Weapply minimal preprocessing through: (i) Removalof non-English documents, (ii) deduplication usinga custom implementation of the MinHash algo-rithm, and (iii) tokenization using Moses. Table 1summarizes key statistics of our datasets. We now describe our training and evaluation proto-cols in more detail.

Evaluation period and test set.

For eachdataset, we pick the last two years (i.e. 2018 and2019) as our evaluation period, and sub-sample atest set of 24k test documents (1k per test month). TIME - STRATIFIED setup.

In order to assesstemporal generalization, we design a time strat-iﬁcation evaluation protocol, where we constructtraining and evaluation splits from large corpora https://arxiv.org/help/oa/index We limit the dataset to papers with a single version only. http://data.statmt.org/news-crawl/README https://github.com/alvations/sacremoses In Appendix A we present a simple frequency analysisover time conducted in

WMT and AR X IV . In Appendix B we demonstrate that our ﬁndings hold forother test year periods beyond the 2018/2019 one. by taking into account the timestamp of each docu-ment, such that models trained on the past (i.e. thetraining set) are evaluated on their ability to predictfuture articles that are published after the time pe-riod of its training data (i.e. the test set). More con-cretely, we use all documents from the beginning ofeach dataset’s time period up until September 2017as training data, and use the last 3 months of 2017as our validation period, where we sub-sample atotal of 9k validation documents for

WMT and C USTOM N EWS , and 15.6k for AR X IV ; we referto this as the TIME - STRATIFIED setup. We thenevaluate the model on the 2018-2019 test set asdescribed above, which in practice evaluates themodel’s ability to generalize well across time bypredicting articles up to two years after the endof their training period—a realistic time frame forwhich we expect large-scale language models to beused without retraining on more recent corpora. We argue that such time stratiﬁcation procedure isa natural way to evaluate models on out-of-sampledistributions and truly unseen data—above and be-yond distribution shifts in terms of topic or domain(Daum´e III, 2007; Gururangan et al., 2020, §6).

CONTROL setup.

We assess whether time strat-iﬁcation (i.e. generalizing to the future based onthe past) poses a challenge for current languagemodel by comparing it with a

CONTROL setup,where we construct the training and evaluation setsfrom overlapping time periods. This setup is sim-ilar to the prevailing (static) language modellingevaluation protocol; more concretely, the trainingset in the

CONTROL setup includes documents thatcome from the same

CONTROL setupdoes not present the same temporal generalizationchallenges as its

TIME - STRATIFIED counterpart.Crucially, we control such that the two setupsdiffer only in the time periods of their training datarather than the absolute training data size: Both the

TIME - STRATIFIED and

CONTROL training data areof the exact same size . Concretely, we construct the

CONTROL training data by taking the most recentdocuments starting from the end of the evaluationperiod (excluding the test documents and includingthe same number of documents per test month),and keep including documents from previous time In Appendix C, we observe an even stronger trend as weincrease the gap between their training and test periods beyondthe two-year gap. ataset Domain Time period

CONTROL ’sTraining Datafrom the Test PeriodWMT

News 2007 - 2019 551 22.65 6.3% C USTOM N EWS

News 1969 - 2020 491 395.59 34.8% AR X IV Scientiﬁc text 1986 - 2020 172 0.72 14.5%

Table 1: Statistics and time periods of the datasets used in this study. periods until we reach the

TIME - STRATIFIED setuptraining data size. In Table 1, we report the propor-tion of documents in the

CONTROL setups’ trainingdata that come from the same 2018-2019 time pe-riod as the evaluation set, which constitutes theonly effective difference between the two setups:For the

TIME - STRATIFIED setup, this proportion isexactly zero. For validation, we sub-sample a totalof 9k for

WMT and C USTOM N EWS and 15.6kfor AR X IV from the 2018-2019 evaluation period(again, excluding the 24k test documents). Notethat we evaluate both the TIME - STRATIFIED and

CONTROL models on the exact same test set fromthe 2018-2019 period, which facilitates a fair per-plexity comparison between the two setups.

Relative perplexity comparison.

One of ourgoals is to measure temporal degradation (i.e.whether LM performance degrades more when pre-dicting test documents further into the future). Un-fortunately, absolute perplexity degradation overtime (e.g., comparing the perplexity of January2017 vs December 2018) is not a reliable mea-sure due the inherent variability of different testmonths, e.g. some months have longer documentsthan others, resulting in higher perplexities. Hence,we report relative perplexity changes between the

TIME - STRATIFIED and the

CONTROL models.

We perform our experiments on autoregressive, left-to-right language modelling, and leave the exten-sion to the representation learning case throughmasked language modelling (Devlin et al., 2019)to future work. More concretely, we conduct ourexperiments using the state-of-the-art Transformer-XL model (Dai et al., 2019). We use 18 layersand set the model size to 1,024, resulting in 287Mparameters—roughly 15% smaller than the sec-ond smallest GPT-2 model (Radford et al., 2019)and BERT-Large (Devlin et al., 2019), althoughwe experiment with larger models in §4. For bothtraining and test sets, we set the sequence length to1,024; we set the memory attention length to 384

Setup WMT C

USTOM N EWS AR X IV CONTROL

TIME - STRATIFIED ∆ , absolute +1.34 +2.95 +1.79 ∆ , relative (%) 6.34 16.04 8.37 Table 2: Perplexity of Transformer-XL when trained intwo the two different setups regimes and evaluated onthe same test set from 2018/2019 period. during training and 1,600 during test. We use a vo-cabulary of 50,259 subwords, as obtained througha SentencePiece tokenizer (Kudo and Richardson,2018) trained on a random subset (up to 15GB) ofthe training data of each respective experiment, i.e.,

CONTROL and

TIME - STRATIFIED . Whilst train-ing and validation are done on subword tokens, allreported perplexities are computed over actual testword tokens as produced by the Moses tokenizer;we compute these per-word probabilities by sum-ming the log-probabilities of the subword tokensthat form each respective word.

In this section, we empirically validate our hypothe-sis that state-of-the-art Transformer language mod-els would perform worse in the realistic case wheremodels trained on the past are tested on their abilityto generalize to the future. We begin by compar-ing the perplexity of the

TIME - STRATIFIED setupagainst the

CONTROL setup, and demonstrate howthe perplexity of the

TIME - STRATIFIED model de-grades more over time (§3.1). We then perform aﬁne-grained analysis that helps better understand what kinds of predictions the model is strugglingwith §3.2-3.4, and identify that increasing modelsize alone—a key driver behind recent languagemodelling success (Brown et al., 2020)—does notsolve the temporal degradation problem (§4). In Appendix E we assess the effect of an outdated tok-enizer on the language modeling perplexity; having an out-dated tokenizer harms performance, but not being able to update the weights of the language model is a larger problem. igure 1: Relative perplexity increase across testmonths of the

TIME - STRATIFIED model over the

CON - TROL one. The former has not seen documents fromthe test period, but the latter has . Table 2 presents the results of our ﬁrst experiment,with a clear pattern across all datasets. Althoughwe train both models: (i) On the exact same datasetsizes, and (ii) use the same model architectures, astale

TIME - STRATIFIED model performs markedlyworse than the

CONTROL model, which has seentraining data from the test period and thus doesnot have to make the same temporal generaliza-tion. We attribute the higher relative degradationon C USTOM N EWS and AR X IV to their recent ex-ponential growth of new documents, resulting in ahigher proportion of documents from the test pe-riod being present in the training data (i.e., 34.8%for C USTOM N EWS , 14.5% for AR X IV , and 6.3%for WMT ). Compared to prior work on stream-ing ( n -gram) language modelling (Yogatama et al.,2014; Osborne et al., 2014), the magnitude of theperplexity numbers is overall much lower in thiswork—although the exact perplexity values are notdirectly comparable. This ﬁnding suggests that cur-rent neural LMs are, to some extent, already able tomitigate some of the problems posed by temporalgeneralization; in fact, we later discuss how condi-tioning on long-range information in Transformer-XL is beneﬁcial for this challenging task (§3.4).Nevertheless, analyzing the performance of themodel across test months yields a troubling trend,where the stale models become increasingly out-dated with time. Fig. 1 plots the relative perplexityincrease of the TIME - STRATIFIED over the

CON - TROL model. As evidenced by the upward slope on all three lines, the model deteriorates more aswe ask it to predict data further away from the endof the training period (i.e., September 2017).

We plot the relative perplexity increase of the

TIME - STRATIFIED over the

CONTROL model bro-ken down by part-of-speech (POS) tags in Fig. 2;we also plot the corresponding perplexity changeacross time in Fig. 3.

Figure 2: Relative perplexity increase of the

TIME - STRATIFIED over

CONTROL models per POS tag,alongside the relative POS tag frequency, on

WMT . First, we see that common nouns, the most fre-quent POS tag in our test set, are contributingamong the largest perplexity increases and drivethe overall trends observed in Figure 1. We also ob-serve a large temporal performance degradation forother open-class categories, such as adjectives andverbs. In contrast, conjunctions and pronouns—closed word classes containing mostly functionwords—are unaffected by this degradation (Fig. 2).Notably, the performance of the

TIME - STRATIFIED model degrades most rapidly whenmaking temporal generalizations about propernouns . As proper nouns closely relate to eventsand facts about the world, this ﬁnding indicatesthat the model requires an up-to-date knowledgeabout the world to generalize well to these typesof predictions. Qualitative analysis indicates thatthe model performs badly on named entities in pol-itics, whose position changed in some way in our2018-2019 evaluation period. For instance, “Bol-sonaro” was elected to and assumed the ofﬁce ofthe Brazilian presidency in 2018 and 2019, respec-tively, which falls outside of our training period;“Pompeo” was appointed Secretary of State in 2018;while “Khashoggi” was subject to widespread dis-course in 2018 following his assassination. Inter-estingly, we also found the model struggling withconcepts associated with cultural and sociologicalchanges on which public perception and discoursehave evolved over time, such as “MeToo”. igure 3: Relative perplexity increase of

TIME - STRATIFIED over

CONTROL models, broken down byPOS tags, across test months on

WMT .Figure 4: Relative perplexity increase by topicfor

WMT — TIME - STRATIFIED over

CONTROL model.Documents are clustered using LDA, and perplexity isaggregated by topic. In the news domain, politics andsports are the topics that change more rapidly than av-erage due to a faster-changing context.

A priori , we expect the speed of perplexity dete-rioration to be causally linked to the speed of in-coming new information. Here, we aim to under-stand how the perplexity deterioration is distributedacross different topics in the corpora, as we expectdifferent topics to shift more or less rapidly overtime. We ﬁrst cluster the documents using LatentDirichlet Allocation (Blei and Jordan, 2003, LDA),representing each document as a mixture of topicsand each topic as a distribution over words, andthen aggregate the perplexity of words by topic.We present the results for

WMT in Fig. 4.We see that politics and sports are topics thatchange more rapidly than the average. Moreover,we ﬁnd that this is not just caused by new wordsentering these topics (e.g., named entities), but alsoby how and what we choose to talk about in these topics, i.e., the context around the existing namedentities changes too. This problem is directly re-lated to concept drift and out-of-distribution gen-eralization, which we discuss in §6. For instance,we look at one of the sub-topics present in the ar-ticles covering politics in

WMT , “Brexit”, andanalyze its changing context: In early 2018, farfrom the Brexit deadline, the words “remain” and“leave” both frequently occur in the articles, high-lighting the fact that the media—and by extensionthe people—were still ruminating the results ofthe original referendum. Throughout the secondhalf of 2018 and into 2019, as the deadline loomedand attention shifted towards achieving a Brexitdeal, these words ceased to appear as frequently,overtaken by “deal” and “Boris Johnson”. Whileterms like “deal” and “Brexit” were not necessarilyrare, the local word co-occurrences and context sur-rounding these words had changed in a meaningfulway, and this change affects the performance of the

TIME - STRATIFIED model, which lacks access tomore recent information.

Our realistic time-stratiﬁcation setup entails thatthere are new data every day, hence resulting ina natural and constant growth of the vocabulary.Indeed, analysis of our datasets suggests that ∼ previously un-seen words in WMT and AR X IV , rising to 27% in C USTOM N EWS . These novel words represent 0.1-0.3% of all tokens in

WMT and C USTOM N EWS ,and 0.6-0.8% in AR X IV . Crucially, a substantialnumber of these remain in our discourse for a while,as 14-35% of these novel words reappear in at leastfour future months. In AR X IV , examples of thesereappearing terms include terms related to new re-search directions (e.g. “pretrain”). In WMT and C USTOM N EWS , these words often correspond topolitical terms (e.g. “Brexiteers”) and social move-ments (e.g., “MeToo”), as well common nouns. We denote these as

EMERGING NEW WORDS .While it is unrealistic to expect models to beable to handle all rare words, we argue that theseconcepts in particular are of great importance, asthey reﬂect precisely the dynamically-changing na-ture of our non-stationary world. To ground thisobservation into the recent global situation, per- More examples of novel words can be found at https://twitter.com/NYT_First_Said , a resource that hasbeen used for novel word research (Pinter et al., 2020) odel Test ppl. on

EMERGING NEW WORDS

Time-stratiﬁed 109.73Time-stratiﬁed 66.26w/ dynamic eval.

Table 3:

WMT test perplexity for words in the

EMERG - ING NEW WORDS category, which undergo the largesttemporal shifts between the training and test periods.We report results with and without dynamic evaluation(Krause et al., 2018, 2019, §5). haps the most notable

EMERGING NEW WORDS is“COVID-19”, which has zero unigram probabilitybefore the end of 2019, and yet constitutes an ex-tremely important use case of NLP systems today.We continue with assessing how well the

TIME - STRATIFIED model is able to predict to these words,by conﬁning the perplexity analysis to these words.Concretely, we deﬁne

EMERGING NEWWORDS as those that occur frequently on thetest set (at least 50 times), but either: (i) were previously unseen on the training set, or (ii)occurred much less frequently on the training setthan on the test set, as indicated by an at least 5times lower unigram probability, giving rise to 287

EMERGING NEW WORDS and 87,636 mentions.Indeed, many of these words reﬂect strongtemporal dynamics: e.g. “Ardern” (occurring 30times more frequently on the test set, since JacindaArdern became the Prime Minister of New Zealandin late-2017) and “Novichok” (appearing 20,000times more frequently on our test set, since Sergeyand Yulia Skripal were poisoned by Novichoknerve agents in 2018). Table 3 shows that the

TIME - STRATIFIED model performs substantiallyworse for

EMERGING NEW WORDS , a ∼ Perplexity of ﬁrst and subsequent occurrencesof

EMERGING NEW WORDS

It is a well-knownfact that language is characterized by bursti-ness (Church and Gale, 1995; Church, 2000): Aword is more likely to occur again in a document ifit has already appeared in the same document. Wethus argue that models with strong temporal gener-alization should be able to leverage this property oflanguage and dynamically adapt their knowledge topredict subsequent word occurrences in a document better than the ﬁrst occurrence, something of par-ticular relevance for the

EMERGING NEW WORDS —some of which never even appeared on the training set (e.g. “Skripals”, referring collectively to Sergeyand Yulia Skripal). Table 4 shows the perplexity ob-tained by the

TIME - STRATIFIED model under twoconditions—for the ﬁrst and second occurrences of

EMERGING NEW WORDS in a document.We ﬁnd that, although the model has a high per-plexity for generating

EMERGING NEW WORDS forthe ﬁrst time in the document (ppl. of 695 com-pared to the overall ppl. of 22.45 in Table 2), it hasa much lower perplexity for generating the samewords for the second time, but only if the ﬁrst wordis available in the Transformer context. In this case,the model can simply copy the same word fromthe context, which is consistent with prior ﬁnd-ings on the strong copying ability of the attentionblock within Transformers (Bahdanau et al., 2015;Vinyals et al., 2015). This means that the abilityof Transformer models to condition on long-rangecontext is already a useful feature for facilitatingtemporal generalization, even when we are not ex-plicitly updating the model parameters to betteraccount for the new data (which we later do in §5).Nevertheless, the perplexity of the second oc-currence is still remarkably high (more than 100times worse than the overall perplexity) when theﬁrst occurrence falls outside of the Transformercontext, which highlights the important need toscale Transformers to even longer sequence lengthsfor achieving better temporal generalization. How-ever, this is challenging because Transformers scalequadratically with the input length, although recentwork has made substantial progress in this direc-tion (Child et al., 2019; Correia et al., 2019; Kitaevet al., 2020; Beltagy et al., 2020, inter alia ). Scaling language models in numbers of parametershas led to improved perplexity, downstream taskperformance, and few-shot learning abilities (Ka-plan et al., 2020). Hence, a natural questionis: Can increasing model size also help improvethe temporal generalization ability of stale mod-els? For this experiment, we train a bigger

TIME - STRATIFIED model with 448M parameters, or a60% increase over the 287M model used thus far.We train the

TIME - STRATIFIED M model for ourbigger datasets: WMT and C USTOM N EWS . For evaluation, the total context length is 1,024 (standardcontext length) + 1,600 (extended cached Transformer-XLmemory), for a total of 2,624 most recent BPE tokens (§2.3). odel Ppl. of second occurrencePpl. of ﬁrst occurrence of

EMERGING NEW WORDS of EMERGING NEW WORDS

When ﬁrst occurrence is When ﬁrst occurrence isin Transformer context NOT in Transformer context

Time-stratiﬁed 694.95 75.95 2,719.25Time-stratiﬁed 357.40 44.21 1,430.34w/ dynamic eval.

Table 4: Test perplexities of

EMERGING NEW WORDS in WMT , broken down by whether the word is encounteredfor the ﬁrst time on the test document (high perplexity), or the second time (often better perplexity because themodel has observed the word before on the test preﬁx). For the second occurrence, we report perplexity on caseswhere the ﬁrst occurrence is in the Transformer-XL context (lower perplexity because the model can copy theexact same word from the memory), and where the ﬁrst occurrence is not available in the context (much higherperplexity because the previous occurrence and context is not accessible in the Transformer-XL memory due tolong-range dependencies). Dynamic evauation (last row; §5) improves performance across the board.Figure 5: Relative perplexity increase of

TIME - STRATIFIED models with 287M (dotted lines)and 448M parameters (solid lines) over the

CON - TROL model with 287M parameters, for

WMT and C USTOM N EWS . To assess temporal degradation, we need to lookinto the performance of the models as more timepasses from its training time (see Figure 5). Assuch, similar to the analysis in Section 3.1 wherewe reported the perplexity increase of the

TIME - STRATIFIED M model over the CONTROL M one (here with dotted lines), we report the re-spective perplexity increase of the newly trained TIME - STRATIFIED M model over the same CON - TROL M one (solid lines). If increased modelsize was able to delay temporal degradation, wewould expect to see the solid lines produced bythe bigger models to have reduced (i.e., ﬂatter)slopes compared to the dotted lines produced bythe smaller models.While larger models, as expected, achieve over-all lower perplexities, as indicated by the solid linesbeing consistently below the dotted ones, modelsize has no signiﬁcant effect on the slope of these lines (t-test, p > . ). In hindsight, this is anexpected result in light of our analyses in Sec-tion 3: Regardless of the number of its parameters,a stale model would not always be able to anticipateand forecast everything that happens in a changingworld. Having models that perform well in the re-alistic setup of predicting future unseen data thusrequires solutions that more directly tackle some ofthe speciﬁc challenges we emphasized through ourﬁndings so far, and can rapidly adapt to new incom-ing information about our non-stationary world. As state-of-the-art language models perform worsein the realistic scenario of predicting the futurebased on the past, one way to keep our models up-to-date, and hence mitigate this temporal degrada-tion, is to continually update their knowledge withnew information as new documents arrive into ourstream. The simplest way to achieve this is through dynamic evaluation (Mikolov et al., 2010; Graves,2013; Krause et al., 2018, 2019)—a form of onlinelearning that continually updates the parametersof a pretrained model through gradient descent onthe observed test set preﬁx . Dynamic evaluationhas been shown to improve overall perplexity inthe standard (non-temporal) language modellingsetup, allowing the model to adapt to local topicshifts within a document . Here we aim to use dy-namic evaluation to adapt the model to the temporaldynamics that occur within a stream of chronolog-ically ordered documents, allowing the model tocapture temporal dependencies across temporally-related documents (e.g., news articles publishedwithin the same time often describe similar events). igure 6: Relative perplexity increase, with (solid lines)and without (dotted liness) dynamic evaluation.

We plot the results in Fig. 6: Dotted lines re-ﬂect the perplexity increase when comparing the

CONTROL model to the

TIME - STRATIFIED model,i.e., the same graph as in Figure 1, whereas solidlines reﬂect the perplexity increase achieved whencomparing the same

CONTROL model with the

TIME - STRATIFIED model augmented with dynamicevaluation instead (

TIME - STRATIFIED dyn ). In alldatasets, dynamic evaluation reduces the speedof the model becoming outdated, as is evidentfrom their reduced upward slope , with a signiﬁ-cant effect for AR X IV and WMT as assessed us-ing a t-test with p < . . The improvementsare particularly pronounced in AR X IV , where amore granular analysis over weeks (instead ofmonths) reveals that the model needs only approxi-mately one week worth of data to overtake the CON - TROL model on AR X IV . The fact that the TIME - STRATIFIED dyn. outperforms the

CONTROL modelon AR X IV hints that the recency bias imposed bydynamic evaluation is particularly advantageousfor the TIME - STRATIFIED dyn. model, whereas the

CONTROL model does not necessarily see the train-ing documents in order.When aiming to keep models up-to-date, efﬁ-ciency considerations are paramount: Lightweightyet effective approaches are preferable because theyallow the model to rapidly digest new informationwith minimal time and computation costs. Sinceupdating the whole model is expensive, we ex-periment with updating only a smaller subset ofthe whole model. As our ﬁndings identify lexi-cal semantic shifts as one problem, we design asetup where we only update the embedding layer(i.e., 52M parameters). Moreover, following recentwork (Ben-Zaken et al., 2021), we also experimentwith updating only the bias terms at all layers (i.e., type of parametersthat get updated

WMT C

USTOM N EWS AR X IV all parameters 22.17 only bias Table 5: Perplexity results on the three datasets of the

TIME - STRATIFIED when updating different subset ofparameters with dynamic evaluation.

Dynamic Evaluation Helps More for EmergingNew Words.

We now repeat the perplexity analy-sis on

EMERGING NEW WORDS (§3.4) that undergothe largest temporal shifts between the training andtest periods. Since dynamic evaluation allows themodel to update its parameters based on the testpreﬁx, we hypothesize that it is particularly help-ful for predicting

EMERGING NEW WORDS . Theﬁndings in Table 3 indeed afﬁrm our hypothesis:The perplexity improvements are much more sub-stantial for

EMERGING NEW WORDS (a 39.62%perplexity reduction from 109.73 to 66.26), com-pared to the overall perplexity reduction (a 1.25%reduction from 22.45 to 22.17 for

WMT ; Table 5).We now repeat our analysis on how the perplex-ity of

EMERGING NEW WORDS evolves betweenthe ﬁrst and subsequent occurrences of these words.As shown on the bottom row of Table 4, dynamicevaluation reduces the perplexity of both the ﬁrstand second occurrences. More concretely, we ob-serve the most substantial relative perplexity im-provement when predicting the ﬁrst occurrences of rare words in a document, which we attributeto the fact that dynamic evaluation can store andreuse relevant information about

EMERGING NEWWORDS words from previous documents —henceafﬁrming our hypothesis that dynamic evaluationhelps the model capture cross-document tempo-ral structures within a stream chronologically or-dered documents. Moreover, these improvementsare also substantial for predicting the second oc-currences when the ﬁrst occurrence is not in theTransformer memory (bottom right of Table 4), cor-responding to a 47.40% perplexity reduction. Thisis because dynamic evaluation enables the modelto store and update its representation of

EMERG - NG NEW WORDS directly in the model parameters ,hence reducing the reliance on the Transformer con-text length. Nevertheless, the absolute perplexityof such second occurrences is still high (more than1k), demonstrating a large room for improvement.

Limitations of dynamic evaluation.

Dynamicevaluation alone does not completely solve thetemporal degradation problem, as evidenced bythe prevailing (albeit gentler) upward slopes on

WMT and C USTOM N EWS (Fig. 6). Dynamicevaluation relies on performing gradient descenton the new data, which is prone to catastrophic for-getting (Mccloskey and Cohen, 1989; Kirkpatricket al., 2017), where the model discards importantinformation from the past. Beyond adapting to lo-cal temporal shifts, we want our models to achieve positive forward transfer (Lopez-Paz and Ranzato,2017), i.e., the longer we train the model, the bet-ter the model should be at adapting to new data,which dynamic evaluation alone does not directlyoptimize for. Hence, the question of better under-standing the limitations of dynamic evaluation—and whether more sophisticated continual and life-long learning approaches can achieve even betterresults—is an exciting avenue for future work.

Concept drift

The problem of detecting changesin data streams, also known as concept drift, hasa long history (Kifer et al., 2004; Baena-Garcıaet al., 2006; Dries and R¨uckert, 2009). In NLP,much of the recent work in this area models lexicalchange by training word embeddings (Hamiltonet al., 2016; Szymanski, 2017; Yin et al., 2018) anddeep neural networks (Rosenfeld and Erk, 2018;Bjerva et al., 2019) on data of different time spans.

Out-of-Distribution (OoD) generalization.

Achieving OoD generalization, primarily todomain shifts, has a long history in NLP (Blitzeret al., 2006; Daum´e III, 2007; Axelrod et al.,2011), and has recently been addressed in thecontext of neural LMs and transfer learning (Friedet al., 2019; Oren et al., 2019; Hendrycks et al.,2020; Gururangan et al., 2020). To this end,prior work has shown that pretraining languagemodels on large datasets has led to substantialimprovements and increased robustness comparedto non-pretrained models (Hendrycks et al., 2020).Most prior work on OoD generalization, however,puts a greater emphasis on distributional shifts in terms of topic and domain. In contrast, this workputs an emphasis on distributional shifts in termsof time , where models trained on the past mustgeneralize to the future, which is a realistic yetchallenging use case of real-world NLP systems. Arecent exception comes from Søgaard et al. (2020),who also consider temporal shifts in NLP.

Continual learning & streaming LMs.

Ourwork is closely related to continual and lifelong learning, which aim to design models that continu-ally accumulate new knowledge without forgettingrelevant information about the past (Mccloskey andCohen, 1989; Thrun and Mitchell, 1995; French,1999; Mitchell et al., 2015; Rusu et al., 2016; Kirk-patrick et al., 2017; Al-Shedivat et al., 2018; Had-sell et al., 2020). The distribution of words andcontext in natural language, just like the worldaround us, changes rapidly with time, and henceconstitutes an important test bed for developingand evaluating continual learning systems.More speciﬁc to the language modelling litera-ture, prior work has proposed ways of designinglanguage models that can efﬁciently adapt theirknowledge to continuous streams of new informa-tion (Jelinek et al., 1991; Wang et al., 2008; Goyalet al., 2009; Osborne et al., 2014; Yogatama et al.,2014, inter alia )—often known as streaming lan-guage models . Despite substantial recent progressin language modelling, we show that state-of-the-art Transformer language models similarly sufferfrom the temporal degradation problem. The useof large-scale neural models also introduces an ad-ditional complication: Whereas n -gram modelsin prior work can be kept up-to-date by updat-ing the counts of each n -gram on the new data,how we can adapt neural LMs without retrainingthe whole model from scratch remains an open re-search question, with notable progress in other NLPtasks (d’Autume et al., 2019; Sun et al., 2020). We systematically evaluated the extent to whichour current language models can generalize well inthe realistic setup of predicting the future based onthe past. Despite substantial recent progress in lan-guage modelling, we found that this setup poses achallenge even for state-of-the-art Transformer-XLmodels, and that model performance degrades moresubstantially with time. We conducted a thoroughanalysis to better understand the failure modes ofthe model with respect to temporal generalization,nd found that increasing model size alone—a keydriver behind recent language modelling progress—failed to provide a solution for this task.We conclude by outlining three broader impli-cations of our ﬁndings. First, these ﬁndings showthat the prevailing language modelling evaluationparadigm—which draws training and evaluationsets from overlapping time periods—provide an overly optimistic assessment of model generaliza-tion. Second, as new and ever-larger datasets arepresently compiled using web crawls (Radfordet al., 2019; Gao et al., 2021), it has never beenmore timely to rethink how our splits are con-structed (Søgaard et al., 2020). We argue thattime-stratiﬁcation is a realistic evaluation setupthat enables us to evaluate models on genuinelyunseen data, which constitutes a fairer assessmenton the out-of-distribution generalization of models.Lastly, a more realistic dynamic language mod-elling benchmark, such as the one we proposedhere in this paper, can be used to measure progressand ultimately encourage the development of mod-els that can handle non-stationary text data andremain up-to-date with respect to the world—anexciting research domain that we believe can spurfurther advances in continual and lifelong learning.

Future Work.

Here we have primarily assessedthe effect of outdated language models using an intrinsic perplexity metric, which directly relatesto the optimized loss and hence constitutes a natu-ral way to assess language modelling performance.However, a multi-task benchmark is needed to ob-tain a more holistic picture and better track ourprogress on downstream tasks —the vast majorityof which currently rely on a language model pre-training backbone (Devlin et al., 2019). To thisend, one important open question is how we cancreate and maintain benchmarks that are not static(e.g., see recent attempt that utilize human feed-back for adversarial creation of benchmarks (Nieet al., 2019; Potts et al., 2020)) and ﬁxed at a par-ticular point in time, but rather are updated onlineand reﬂect the non-stationarity of the real world.Finally, beyond better evaluations, our ﬁndingscall for the development of adaptive language mod-els. Due to the computational costs of training largemodels, brute-force solutions like retraining fromscratch are unrealistic in practice, which (amongothers) runs the risk of the models becoming out-dated in-between long retraining cycles. Generaliz-ing to the future necessitates the ability to quickly adapt to the new data, albeit without forgettingthe important past—a delicate balance for whichcontinual and lifelong learning techniques can of-fer promising solutions. Another promising direc-tion is to disentangle the acquisition of up-to-dateknowledge (for instance by retrieving external in-formation) from the language learning itself, as re-cently proposed by Guu et al. (2020) and Yogatamaet al. (2021), among others. All in all, above and be-yond impressive scaling efforts towards ever-largerlanguage models (Brown et al., 2020; Fedus et al.,2021), we strongly argue for the necessity of adap-tive language models that can remain up-to-datewith respect to our open and non-stationary world.

Acknowledgements

We thank Paul Michel, Laura Rimell, and ChrisDyer for useful feedback throughout the differentstages of this project. We would also like to thankKatie Millican, Sebastian Borgeaud, Trevor Cai,Roman Ring, Jack Rae, and Geoffrey Irving fortheir initial work on the codebase.

References

Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, IlyaSutskever, Igor Mordatch, and Pieter Abbeel. 2018.Continuous adaptation via meta-learning in nonsta-tionary and competitive environments. In

Proc. ofICLR .Isabelle Augenstein, Christina Lioma, DongshengWang, Lucas Chaves, Lima Casper Hansen, Chris-tian Hansen, and Jakob Grue Simonsen. 2019. Mul-tifc: A real-world multi-domain dataset for evidence-based fact checking of claims. In

Proc. of EMNLP-IJCNLP .Amittai Axelrod, Xiaodong He, and Jianfeng Gao.2011. Domain adaptation via pseudo in-domain dataselection. In

Proc. of EMNLP .Manuel Baena-Garcıa, Jos´e del Campo- ´Avila, Ra´ulFidalgo, Albert Bifet, R Gavalda, and R Morales-Bueno. 2006. Early drift detection method. In

Fourth international workshop on knowledge discov-ery from data streams , volume 6, pages 77–86.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

Proc. of ICLR .Iz Beltagy, Matthew E. Peters, and Arman Cohan.2020. Longformer: The long-document transformer.

CoRR , abs/2004.05150.Elad Ben-Zaken, Shauli Ravfogel, and Yoav Goldberg.2021. Bitﬁt: Simple parameter-efﬁcient ﬁne-tuningfor transformer-based masked language-models.ohannes Bjerva, Wouter M Kouw, and Isabelle Augen-stein. 2019. Back to the future–sequential alignmentof text representations. In , pages 1909–03464. Associa-tion for the Advancement of Artiﬁcial Intelligence.Ng A. Y. Blei, D. M. and M. I Jordan. 2003. Latentdirichlet allocation.

Journal of Machine LearningResearch , 3.John Blitzer, Ryan McDonald, and Fernando Pereira.2006. Domain adaptation with structural correspon-dence learning. In

Proc. of EMNLP .Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot learn-ers.

CoRR , abs/2005.14165.Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. arXivpreprint arXiv:1312.3005 .Rewon Child, Scott Gray, Alec Radford, and IlyaSutskever. 2019. Generating long sequences withsparse transformers.

CoRR , abs/1904.10509.Kenneth Church. 2000. Empirical estimates of adapta-tion: The chance of two noriegas is closer to p/2 thanp2. In

COLING 2000 Volume 1: The 18th Interna-tional Conference on Computational Linguistics .Kenneth Ward Church and William A Gale. 1995. Pois-son mixtures.

Nat. Lang. Eng , pages 163–190.Gonc¸alo M. Correia, Vlad Niculae, and Andr´e F. T.Martins. 2019. Adaptively sparse transformers. In

Proc. of EMNLP-IJCNLP .Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019.Transformer-xl: Attentive language models beyonda ﬁxed-length context. In

Proc. of ACL .Hal Daum´e III. 2007. Frustratingly easy domain adap-tation. In

Proc. of ACL .Cyprien de Masson d’Autume, Sebastian Ruder, Ling-peng Kong, and Dani Yogatama. 2019. EpisodicMemory in Lifelong Language Learning. In

Proc.of NeurIPS 2019 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Anton Dries and Ulrich R¨uckert. 2009. Adaptive con-cept drift detection.

Statistical Analysis and DataMining: The ASA Data Science Journal , 2(5-6):311–327.William Fedus, Barret Zoph, and Noam Shazeer. 2021.Switch transformers: Scaling to trillion parametermodels with simple and efﬁcient sparsity. arXivpreprint arXiv:2101.03961 .Robert M. French. 1999. Catastrophic forgetting inconnectionist networks.

Trends in Cognitive Sci-ences , 3(4).Daniel Fried, Nikita Kitaev, and Dan Klein. 2019.Cross-domain generalization of neural constituencyparsers. In

Proc. of ACL .Leo Gao, Stella Biderman, Sid Black, Laurence Gold-ing, Travis Hoppe, Charles Foster, Jason Phang,Horace He, Anish Thite, Noa Nabeshima, ShawnPresser, and Connor Leahy. 2021. The pile: An800gb dataset of diverse text for language modeling.Amit Goyal, Hal Daum´e III, and Suresh Venkatasubra-manian. 2009. Streaming for large scale NLP: Lan-guage modeling. In

Proc. of NAACL HLT .Alex Graves. 2013. Generating sequences with recur-rent neural networks.

CoRR , abs/1308.0850.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks. In

Proc. of ACL .Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-pat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training. arXivpreprint arXiv:2002.08909 .Raia Hadsell, Dushyant Rao, Andrei A. Rusu, and Raz-van Pascanu. 2020. Embracing change: Continuallearning in deep neural networks.

Trends in Cogni-tive Sciences , 24(12).William L Hamilton, Jure Leskovec, and Dan Jurafsky.2016. Diachronic Word Embeddings Reveal Statis-tical Laws of Semantic Change. In

Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics , pages 1489–1501.Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, AdamDziedzic, Rishabh Krishnan, and Dawn Song. 2020.Pretrained transformers improve out-of-distributionrobustness. In

Proc. of the 58th Annual Meeting ofthe Association for Computational Linguistics .. Jelinek, B. Merialdo, S. Roukos, and M. Strauss.1991. A dynamic language model for speech recog-nition. In

Speech and Natural Language: Proceed-ings of a Workshop Held at Paciﬁc Grove, California,February 19-22, 1991 .Jared Kaplan, Sam McCandlish, Tom Henighan,Tom B. Brown, Benjamin Chess, Rewon Child,Scott Gray, Alec Radford, Jeffrey Wu, and DarioAmodei. 2020. Scaling laws for neural languagemodels.Daniel Kifer, Shai Ben-David, and Johannes Gehrke.2004. Detecting change in data streams. In

VLDB ,volume 4, pages 180–191. Toronto, Canada.James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,Joel Veness, Guillaume Desjardins, Andrei A. Rusu,Kieran Milan, John Quan, Tiago Ramalho, Ag-nieszka Grabska-Barwinska, Demis Hassabis, Clau-dia Clopath, Dharshan Kumaran, and Raia Hadsell.2017. Overcoming catastrophic forgetting in neuralnetworks.

Proceedings of the National Academy ofSciences , 114(13).Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.2020. Reformer: The efﬁcient transformer. In

Proc.of ICLR .Ben Krause, Emmanuel Kahembwe, Iain Murray, andSteve Renals. 2018. Dynamic evaluation of neuralsequence models. In

Proc. of ICML .Ben Krause, Emmanuel Kahembwe, Iain Murray,and Steve Renals. 2019. Dynamic evaluationof transformer language models. arXiv preprintarXiv:1904.08378 .Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.

CoRR , abs/1907.11692.David Lopez-Paz and Marc’Aurelio Ranzato. 2017.Gradient episodic memory for continual learning. In

Proc. of NeurIPS .Michael Mccloskey and Neil J. Cohen. 1989. Catas-trophic interference in connectionist networks: Thesequential learning problem.

The Psychology ofLearning and Motivation , 24.Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, BrianRoark, and Jason Eisner. 2019. What kind of lan-guage is hard to language-model? In

Proc. of ACL ,pages 4975–4989. Tomas Mikolov, Martin Karaﬁ´at, Luk´as Burget, JanCernock´y, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In

Proc. ofINTERSPEECH .T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Bet-teridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel,J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed,N. Nakashole, E. Platanios, A. Ritter, M. Samadi,B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen,A. Saparov, M. Greaves, and J. Welling. 2015.Never-ending learning. In

Proc. of AAAI .Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2019. Ad-versarial nli: A new benchmark for natural languageunderstanding. arXiv preprint arXiv:1910.14599 .Yonatan Oren, Shiori Sagawa, Tatsunori Hashimoto,and Percy Liang. 2019. Distributionally robust lan-guage modeling. In

Proc. of EMNLP-IJCNLP .Miles Osborne, Ashwin Lall, and BenjaminVan Durme. 2014. Exponential reservoir sam-pling for streaming language models. In

Proc. ofACL .Yuval Pinter, Cassandra L Jacobs, and Max Bittker.2020. Nytwit: A dataset of novel words in the newyork times.

CoRR , abs/2003.03444.Christopher Potts, Zhengxuan Wu, Atticus Geiger,and Douwe Kiela. 2020. Dynasent: A dynamicbenchmark for sentiment analysis. arXiv preprintarXiv:2012.15349 .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. In

Tech-nical report, OpenAI.

Alex Rosenfeld and Katrin Erk. 2018. Deep neuralmodels of semantic shift. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) ,pages 474–484.Andrei A. Rusu, Neil C. Rabinowitz, GuillaumeDesjardins, Hubert Soyer, James Kirkpatrick, Ko-ray Kavukcuoglu, Razvan Pascanu, and Raia Had-sell. 2016. Progressive neural networks.

CoRR ,abs/1606.04671.Anders Søgaard, Sebastian Ebert, Joost Bastings, andKatja Filippova. 2020. We need to talk about ran-dom splits. arXiv preprint arXiv:2005.00636 .Emma Strubell, Ananya Ganesh, and Andrew McCal-lum. 2019. Energy and policy considerations fordeep learning in NLP. In

Proc. of ACL .Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee.2020. LAMOL: LAnguage MOdeling for LifelongLanguage Learning. In

Proceedings of ICLR 2020 .errence Szymanski. 2017. Temporal Word Analogies:Identifying Lexical Replacement with DiachronicWord Embeddings. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics , pages 448–453.James Thorne and Andreas Vlachos. 2018. Automatedfact checking: Task formulations, methods and fu-ture directions. In

Proc. of ICCL .Sebastian Thrun and Tom M. Mitchell. 1995. Lifelongrobot learning.

Robotics and Autonomous Systems ,15(1).Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Proc. of NeurIPS .Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In

Proc. of NeurIPS .Chong Wang, David Blei, and David Heckerman. 2008.Continuous time dynamic topic models. In

Pro-ceedings of the Twenty-Fourth Conference on Uncer-tainty in Artiﬁcial Intelligence , pages 579–586.Zi Yin, Vin Sachidananda, and Balaji Prabhakar. 2018.The global anchor method for quantifying linguisticshifts and domain adaptation.

Advances in neuralinformation processing systems , 31:9412–9423.Dani Yogatama, Cyprien de Masson d’Autume, andLingpeng Kong. 2021. Adaptive semiparametriclanguage models.

Transactions of Association ofComputational Linguistics .Dani Yogatama, Chong Wang, Bryan R. Routledge,Noah A. Smith, and Eric P. Xing. 2014. Dynamiclanguage models for streaming text.

TACL .Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against neural fakenews. In

Proc. of NeurIPS . Frequency analysis on our datasets

A simple frequency analysis reveals a number ofinteresting phenomena regarding word usage overtime, which points to interesting challenges that amodel might need to overcome in order to tempo-rally generalize. “Trump” was mostly mentionedin the context of his business and media careerprior to 2015, after which the relative frequencyof the term had seen a several-fold increase in the

WMT dataset and a context-shift towards politics,while the term “Obama” has seen a relative de-crease in frequency since then. Moreover, thereare many terms that show speciﬁc seasonal pat-terns. The term “Christmas” sees a pronounced in-crease in frequency every year in December, whilethe “Olympics” sees an increase every two years.Yet other terms, such as “occupy” (from the Oc-cupy Wall Street movement) peak once or twice be-fore returning to a lower frequency base. Datasetscan also undergo a shift in the proportions of indi-vidual topics, as shown for example by the grow-ing relative frequency of the word “language” in AR X IV coupled with a relative decrease in fre-quency of the word “radiation”. Some of thesefrequency patterns are illustrated in Figure 7. B The effect of outdated models persistsbeyond the 2018/2019 test period.

We test whether the temporal degradation trends weobserve in §3 are not an artifact of some particular-ity of the chosen test period (i.e.,

Y r and

Y r ). We design new test sets by shifting

Y r and Y r in increments of one year towardsthe past, for a total of ﬁve such test sets. Following§2.2, we derive different TIME - STRATIFIED

Y r ,Y r and CONTROL

Y r ,Y r training and validation splits.Note that each TIME - STRATIFIED

Y r ,Y r and CON - TROL

Y r ,Y r setups are: (i) trained on the size oftraining data, and (ii) evaluated on the same testset covering Y r and Y r . Fig. 8 shows similartemporal degradation across all testing years. C The effect of outdated models persistsbeyond the two-year gap. models are evaluated on the same test set intro-duced in Section 2.2.In §3, we observe a performance degradationwhen the

TIME - STRATIFIED model is out-of-datefor up two years, but would the degradation persistbeyond the two-year gap? To answer this question, we train models with training data from differenttime periods with increasingly larger gaps from the2018-2019 evaluation period in §2.2. The most up-to-date model covers the same time period as theoriginal

TIME - STRATIFIED model, and we “push”the training period back with 6-month increments,up to September 2012, for a total of 11 training sets,of the same size, used to train 11 models. Fig. 9shows that the perplexity deterioration continuesto grow in response to larger gaps between thetraining and test periods.

D The effect of outdated models persistsbeyond English: A German study.

We test whether the temporal degradation is a gener-alizable pattern that holds across languages. We usethe German subset of

WMT , apply the same pre-processing steps as §2.1, follow the same experi-mental setup as §2.2, and train two Transformer-XLmodels on

TIME - STRATIFIED de and CONTROL de ,achieving 30.87 and 26.79 respective test set per-plexities. These perplexities are indeed higherthan the ones in Table 2, a pattern consistentwith prior ﬁndings on the difﬁculty of modellingGerman (Mielke et al., 2019). Nevertheless, westill see the exact same pattern where the stale TIME - STRATIFIED de model performs worse thanthe CONTROL de one (a substantial 15.23% relativeincrease). Moreover, similar to the English experi-ment, the model degrades more as the gap betweenthe training and test period widens, an effect partic-ularly pronounced for proper nouns and for wordsthat are broken down by the TIME - STRATIFIED de tokenizer into more tokens. E Outdated tokenizer does notcontribute substantially toperformance deterioration

In our current experimental setup, the

CON - TROL and

TIME - STRATIFIED model differ in twoaspects; in the time period used to train the weights and the tokenizer of the language models. A subtleobservation is that since the tokenizer in the

CON - TROL setup is aware of the evaluation period, it alsoaware its most frequent vocabulary items, whereasthe tokenizer in the

TIME - STRATIFIED setup is not.Considering only words that have been tokenizerdifferently by the two tokenizers (cf. Table 6) wesee that the

TIME - STRATIFIED assigns lower prob-ability to these words than the

CONTROL model(cf. 100 vs 650 perplexity respectively). To assess igure 7: A sample of the types of term frequency through time patterns seen in

WMT and AR X IV datasets.Figure 8: Relative increase of perplexity of TIME - STRATIFIED

Y r ,Y r over CONTROL

Y r ,Y r model.Figure 9: Perplexity of models trained with data cover-ing time periods with increasingly bigger gap from thetest set period (2018, 2019). Figure 10: Relative increase of perplexity of TIME - STRATIFIED de over CONTROL de . CONTROL TIME - STRATIFIED

Language model perplexity100 650ExamplesBrexite+er Bre+x+ite+erSkywalker Sky+walkerMeToo Me+T+oocryptocur+rency crypt+oc+ur+rencyreciprocal reciproc+alimpeach impe+achun+biased un+bi+ased

Table 6: Examples of words that

TIME - STRATIFIED and

CONTROL have tokenized differentlyas well as perplexity for all the words where thetwo tokenizers differ. The

TIME - STRATIFIED modelattributes a fragmented tokenization to certain wordsnot seen frequently on the

TIME - STRATIFIED trainingset, e.g., “Me+T+oo” or “impe-ach”. he effect of an outdated tokenizer on the languagemodeling perplexity, we ran an experiment traininga Transformer-XL on the

TIME - STRATIFIED train-ing data using the

CONTROL ’s tokenizer, result-ing in 22.02 perplexity compared to 22.45 of the

TIME - STRATIFIED model and the 21.11 of the

CON - TROL model. From this we conclude that despitehaving an outdated tokenizer harms performance,not being able to updateupdate