Adaptive Semiparametric Language Models
AAdaptive Semiparametric Language Models
Dani Yogatama, Cyprien de Masson d’Autume, Lingpeng Kong
DeepMindLondon, United Kingdom {dyogatama,cyprien,lingpenk}@google.com
Abstract
We present a language model that combines alarge parametric neural network (i.e., a trans-former) with a non-parametric episodic mem-ory component in an integrated architecture.Our model uses extended short-term con-text by caching local hidden states—similarto transformer-XL—and global long-termmemory by retrieving a set of nearest neigh-bor tokens at each timestep. We design agating function to adaptively combine mul-tiple information sources to make a predic-tion. This mechanism allows the model touse either local context, short-term memory,or long-term memory (or any combinationof them) on an ad hoc basis depending onthe context. Experiments on word-based andcharacter-based language modeling datasetsdemonstrate the efficacy of our proposedmethod compared to strong baselines.
Human language processing is facilitated by com-plex systems interacting together. A core compo-nent that enables such a process is human memory.Memory in humans consists of specialized systems,which forms a basis for intelligent behaviors (Tul-ving, 1985; Rolls, 2000; Eichenbaum, 2012). Forlanguage processing, working (short-term) memoryis a temporary storage that can be used to compre-hend sentences and follow conversations. Episodic(long-term) memory stores individual experienceand events. Semantic memory stores facts andknowledge about words and concepts. In artificial language processing systems (e.g.,language models), a popular approach to designa better model is by encoding all of the desiredknowledge (e.g., to produce grammatical sentences,process long text, remember events, etc.) in the We refer readers to Nematzadeh et al. (2020) for discus-sions on human and artificial language processing memorysystems. weights of a large parametric neural network viaend-to-end training. We see an increasingly largertransformer become a better language model (Rad-ford et al., 2018, 2019; Shoeybi et al., 2019; Brownet al., 2020). In this scale approach, the knowledgeis implicitly represented in the weights of a para-metric neural network, and it is not straightforwardto interpret whether a model contains a particularknowledge without asking the model to produce aresponse—e.g., via a cloze-style question (Petroniet al., 2020) or a prompt (Brown et al., 2020).An alternative strategy is to design a modular architecture that separates memory storage andcomputational processing, where each module hasa clear purpose. Recent progress in memory-augmented neural networks has given rise to manyvariants of memory-augmented transformer lan-guage models that fall under this category. Forexample, attempts to incorporate extended localcontext to a neural network—such as those foundin neural cache (Grave et al., 2017c), transformer-XL (Dai et al., 2019) compressive transformer (Raeet al., 2020), performers (Choromanski et al., 2021),longformer (Beltagy et al., 2020), and reformer(Kitaev et al., 2020)—can be seen as models ofworking memory. Models of episodic memoryinclude k NN-LM (Khandelwal et al., 2020) andarchitectures that are designed for more compli-cated tasks such as question answering (de Mas-son d’Autume et al., 2019; Guu et al., 2020) andmachine translation (Khandelwal et al., 2021). Inmachine learning and natural language processing,memory-augmented neural networks is used to re-fer to all types of memory systems.In this paper, inspired by the modular design ofhuman memory systems, we present a languagemodel architecture (S
PALM ) with storage modulesthat resemble working and episodic memory sys-tems, which we combine with a large parametricneural network that is responsible for computa-tion (§2). Our hypothesis is that encouraging each a r X i v : . [ c s . C L ] F e b omponent to focus on a specific function (e.g.,storing long-term information, capturing extendedcontext, modeling local information) facilitates eas-ier training that produces an overall better languagemodel. Specifically, we follow transformer-XL (Daiet al., 2019) to capture extended context by cachinghidden states in a temporary short-term memory.For long-term context, we use a persistent key-value database and perform sparse retrieval with(approximate) k -nearest neighbors. In contrast toprevious language models that either interpolateoutput probabilities (Merity et al., 2017; Graveet al., 2017c; Khandelwal et al., 2020; Kassner andSchutze, 2020) or use input concatenation (Guuet al., 2020; Xu et al., 2020) to combine informa-tion from different sources, we design a context-dependent gating mechanism to incorporate local,extended, and global context. We discuss similari-ties and differences to related work in §3.In language modeling, many tokens can be pre-dicted from their local context without requir-ing long-term information. Our model can adap-tively decide whether the current (local) contextis enough, or whether it needs to use informationfrom the short-term and/or long-term memory.In §4, we compare S PALM with strongbaselines—including transformer-XL and k NN-LM—on word-based and character-based languagemodeling. Our positive results establish the benefitof the proposed architecture. They also indicate thegenerality of our approach and its potential appli-cability to other sequence modeling tasks.We analyze how S
PALM uses long vs. short-termcontext (§5) to better understand how the modeloperates when making predictions. We concludeby discussing limitations and future directions (§6).
We consider a language model that takes as in-put a sequence of words x ≤ t = { x , . . . , x t } andoutputs a probability distribution of the next word p ( x t +1 | x ≤ t ; W ) . Given a corpus of T words,the log likelihood of the corpus is: L = T (cid:88) t =0 log p ( x t +1 | x ≤ t ; W ) , We note that S
PALM is not intended to be a model ofhuman language processing system. We merely take inspira-tions from human memory systems to design a better artificiallanguage model.
We use transformer (Vaswani et al., 2017) as ourbase model. Given the input sequence x ≤ t , trans-former performs multiple layers of self-attentionbetween every pair of tokens in the input sequenceto produce token representations.A core limitation of transformer is that its com-utational complexity is quadratic in the inputsequence length. As a result, instead of consid-ering all previous tokens x ≤ t , transformer trun-cates the input to be the most recent N words ˜ x ≤ t = { x t − N +1 , . . . , x t } and only operates onthis fixed-length window in practice. A large trans-former, no matter how many parameters it has, islimited by the input sequence length. We use transformer-XL (Dai et al., 2019) as ourworking memory model. Given the current con-text ˜ x There are several language modelsthat are related to our proposed method. The closestone is k NN-LM (Khandelwal et al., 2020), whichis another language model that is augmented with anearest neighbor retrieval mechanism. k NN-LM isan ensemble technique that is designed to be usedonly at evaluation time. In k NN-LM, a pretrainedlanguage model (e.g., a transformer) is combined https://github.com/google-research/google-research/tree/master/scann with another retrieval-based language model by in-terpolating their probabilities: p ( x t +1 | x ≤ t ) = λp LM ( x t +1 | x ≤ t ) + (1 − λ ) p k NN ( x t +1 | x ≤ t ) .The interpolation weight λ is tuned at the corpuslevel on a development set.While this post hoc integration method used by k NN-LM has its merit (e.g., very practical, fastto incorporate to any model since it does not re-quire additional training), our focus is on designinga model that combines short-term and long-termmemory at the architecture level. Our motivationis twofold. First, interpolating the language modelweights at the corpus level forces the model to usethe same interpolation weight λ for p LM and p k NN for each token in the corpus. It cannot adaptivelycombine short-term and long-term information atthe token level based on the context. In addition, λ needs to be tuned on an extra development set. S PALM , on the other hand, is able to adjust theweights placed on m t and h Rt when constructing z t differently for different tokens. Second, we be-lieve that integration of different memory modulesat the architectural level is a more natural approachthat could help pave the way for applications withother memory sources (e.g., knowledge bases, im-ages, videos)—where the memory output is not inthe same space as the prediction output (i.e., words)and an interpolation technique cannot be used.We compare with k NN-LM in our experiments.Since interpolating model probabilities is an en-sembling technique that is independent of the ar-chitecture, we also show that our language modelcan be furher ensembled with p k NN if necessary. Cache-based language models and pointer net-works. Cache-based language models (Graveet al., 2017c; Merity et al., 2017) store pairs ofhidden states and output tokens from previouslyseen tokens (within a limited context length) in acache. The best variant of the method uses an inter-polation (ensemble) method similar to k NN-LM tocombine information from the cache and the back-bone language model. This class of models tem-porarily stores M past hidden states (typically, inthe order of thousands), so it is a working-memorymodel as opposed to long-term memory. In addi- We note that it is possible to incorporate this interpolationtechnique during the training phase of a language model aswell to avoid having to tune λ on a development set. For ex-ample, Neubig and Dyer (2016) shows how to train a mixtureof experts language models, where the mixture weights areinferred. However, the efficacy of this approach as a memory-augmented language model has not been explored. ataset WikiText 110M 0.2M 0.3M 33,060WMT 852M 1M 1M 50,259enwik8 94M 5.2M 5.2M 256Table 1: Descriptive statistics of datasets used in ourexperiments. For each split, we show the number of(sub)words for WikiText and WMT and the number ofcharacters for enwik8. tion, they also rely on interpolating probabilitiesof a backbone language model and a cache com-ponent (similar to k NN-LN when the cache size isunbounded). Other retrieval augmented methods. An earlyversion of a neural language model that includesa retrieval component is presented in Guu et al.(2018). They follow a retrieve-then-edit approachto generate a sentence, which requires approximat-ing an expectation over an edit prior.Outside language modeling, there are several re-cent retrieval-augmented methods that have beenused for question answering (de Masson d’Autumeet al., 2019; Guu et al., 2020; Xiong et al., 2021;Kassner and Schutze, 2020), controllable genera-tion (Xu et al., 2020), machine translation (Bapnaand Firat, 2019; Khandelwal et al., 2021), and one-shot learning (Kaiser et al., 2017). These methodsshare some similarities with our proposed modelsince it involves a retrieval component. How-ever, the difference in the downstream tasks (lan-guage modeling vs. question answering vs. ma-chine translation), results in different items that arestored in and retrieved from the key-value database.For example, de Masson d’Autume et al. (2019)store and retrieve question-answer pairs, Guu et al.(2020) have a database of passages of an article,and Khandelwal et al. (2021) use source and targetsentences. Our gating mechanism resembles thegate that is used to incorporate information from anon-parametric memory component to a machinetranslation model in Bapna and Firat (2019), al-though the memory entries, the decoder architec-ture, and the downstream task are different.In addition, these models are only models oflong-term memory. Their evaluation tasks often donot need working memory because the entire inputsequence is short enough that it can be fed as aninput to a transformer as a whole. We use word-based and character-based Englishlanguage model datasets–WikiText 103, WMT, andenwik8–to evaluate our proposed method. We pro-vide descriptive statistics in Table 1 and discusseach dataset in the respective section below. We use Adam (Kingma and Ba, 2015) as our opti-mizer. For word-based language modeling, we useadaptive softmax (Grave et al., 2017b). We applydropout with a rate of 0.25. All models are trainedon 128 Tensor Processing Units until convergencewith batch size 256. Our first dataset is WikiText-103 (Merity et al.,2017). We compare four models: vanilla trans-former, transformer-XL, k NN-LM, and S PALM .For WikiText-103, all of our models have 18 lay-ers and 512 hidden dimension size with a total of142M parameters. We set the sequence length to512. For transformer-XL, we set the short-termmemory length to during training and or at test time. We use 4 nearest neighbors for k NN-LM and S PALM and analyze the effect ofvarying the number of neighbors in §5.4. For k NN-LM, we use the transformer-XL model to obtain p LM , compute p k NN based on the nearest neighbordistance similar to Khandelwal et al. (2020), andtune λ from { . , . , . , . , . } on the devel-opment setTable 2 shows perplexity on WikiText103. Ourimplementation produces results that are in thesame range as state-of-the-art numbers, demon-strating the strength of our baselines. Transformer-XL outperforms transformer, and interpolatingthe probability of transformer-XL with k NN (i.e., k NN-LM) improves the result further. This is trueboth with transformer-XL (short-term) memorylength of and . Comparing k NN-LM withS PALM , k NN-LM is marginally better on the testset even though S PALM is marginally better on thedevelopment set.We observe further improvements in S PALM byinterpolating its output probability with the outputprobability from p k NN which is used by k NN-LM,resulting in the best model with a perplexity of 17.6.We find this interesting since S PALM and p k NN usesthe exact same four neighbors for each token. It in-dicates that there are some complementary benefitsn incorporating long-term memory into trainingand interpolating probabilities at test time. Model Transformer-XL a b c k NN-LM d M = Transformer 142M 20.8 21.8Transformer-XL 142M 18.7 19.6 k NN-LM 142M 18.1 18.5S PALM (cid:44) → + k NN 17.6 18.0 M = Transformer-XL 142M 18.3 19.1 k NN-LM 142M 17.7 18.0S PALM (cid:44) → + k NN 17.2 Table 2: Perplexity on WikiText-103. The top rowscontain results taken from other papers: (a) transformer-XL (Dai et al., 2019), (b) adaptive input embeddings(Baevski and Auli, 2019), (c) compressive transformer(Rae et al., 2020), and (d) k NN-LM (Khandelwal et al.,2020). The (log likelihood) difference between the bestmodel (S PALM + k NN) and transformer-XL on the testset is statistically significant (Wilcoxon signed-rank test, p < . ). In the second experiment, our goal is to evalu-ate on a much larger dataset. We construct a lan-guage modeling dataset from the English portionof the WMT 2019 dataset, publicly available at . WMTcontains news articles from different months. Weuse articles from January to October for training, aportion of articles in November for development,and a portion of articles in December for test. Theresulting WMT dataset is approximately ten timeslarger than the WikiText-103 dataset.Similar to the previous experiment, we evaluatemodels with 18 layers and 512 hidden dimensionsize with a total of 148 million parameters. Weset the sequence length to 512, the transformer-XL short-term memory length to 512 for trainingand evaluation, and the number of neighbors forS PALM and k NN-LM to 4.Table 3 shows results on this dataset. Consistentwith the previous experiment, k NN-LM outper- We sample articles written in November and Decemberin chronological order to create development and test sets ofapproximately 1 million tokens (there are almost 100 milliontokens if we use all of the articles in each month). forms transformer-XL and transformer. S PALM outperforms all of them by a considerable marginon the test set. Unlike WikiText-103, we observe nofurther improvement interpolating the probabilitiesof S PALM with p k NN . The results also indicate thatwhen the distributions of the dev and test sets canbe different (e.g., articles from different months), k NN-LM that relies on tuning λ on the dev set ismore sensitive to performance discrepancy betweenthe dev and test sets. Model Transformer 148M 16.0 16.3Transformer-XL 148M 15.6 15.5 k NN-LM 148M 13.1 15.2S PALM Table 3: Perplexity on the WMT dataset. The (loglikelihood) difference between S PALM and transformer-XL on the test set is statistically significant (Wilcoxonsigned-rank test, p < . ). In the third experiment, we evaluate our modelson character-level language modeling. Comparedto word-level language modeling, character-levelhas a much smaller output space (in the order ofhundreds instead of tens of thousands) and has a dif-ferent characteristic in how much local vs. globalcontexts are needed to make a good prediction.The enwik8 dataset (Hutter, 2012) is a bench-mark for character-level language modeling. Weuse a 24 layer model with 512 hidden size. In to-tal, our model has 100 million parameters. Weset the sequence length to 768, the transformer-XLshort-term memory length to 1536 for training and4096 for evaluation. Since character-level languagemodels has a much smaller output space, we onlyretrieve two neighbors per character.We show the results in Table 4. Unlike the pre-vious two word-level language modeling results, k NN-LM underperforms transformer-XL. How-ever, S PALM outperforms all other models. Wenote that a decrease of 0.01 is considerable on thisdataset under the BPC metric. Similar to WMT,interpolating the probabilities of S PALM with p k NN does not improve performance. These results high-light a major strength of our proposed model: uni-formly setting interpolation weights at the corpuslevel decreases performance (i.e., k NN-LM), butallowing the model to flexibly decide when to uselong-term vs. short-term memory is beneficial. or Warren & Wednesday briefly a 5 billion to equityWarren may Tuesday praised wiping 16 trillion in fundingPerhaps Warren has Sunday stood breaking 10 billion for federalLike Warren , Monday defended using 166 trillion in spendingElizabeth Warren on Friday proposed $ 20 trillion in federalgrants in 10 course eight . fight even care forfunding over the next three . upgrade them coverage forfunds over 10 next five in improve American - to, over a next 10 , invest a insurance servicesspending over the next decade to provide health care tomore community as the rates . the middle classeveryone child , a taxes on the wealthy classsome baby , co taxes . the middle classevery American by triggering taxes on all middle classevery American without raising taxes on the middle class Figure 2: A sequence of words from WMT and its four nearest neighbors at each position. We break downthe sequence into four blocks. The bottom row of each block in blue represents the original sequence, whichis Elizabeth Warren on Friday ... the middle class . Each row above it represents a nearestneighbor token (starting from the first neighbor at the second-bottom to the fourth neighbor at the top) that is usedwhen predicting that particular word. We highlight matching neighbor–target words in green. We provide a moredetailed discussion in §5.1. Since character-level and word-based languagemodeling are characteristically different, the suc-cess of our model on this dataset indicates its appli-cability to other sequence modeling problems. Weleave such explorations to future work. Model 18L Transformer-XL a 88M - 1.0324L Transformer-XL a c d k NN-LM 104M 1.04 1.02S PALM Table 4: Bits per character (BPC) on enwik8. Thetop rows contain results taken from other papers: (a)transformer-XL (Dai et al., 2019), (b) longformer (Belt-agy et al., 2020), and (c) compressive transformer (Raeet al., 2020). The (log likelihood) difference betweenS PALM and transformer-XL on the test set is statisticallysignificant (Wilcoxon signed-rank test, p < . ). We have demonstrated the efficacy of our proposedmethod on three language modeling tasks. In thissection, we analyze the model to gain more insightsinto how it works. We inspect the neighbor tokens that are retrievedfrom the long-term memory for news articles in theWMT development dataset. We provide a cherry-picked example in Figure 2. As the model seesmore tokens in a sequence, the long-term mem-ory model becomes more accurate. We observeinteresting cases such as when predicting a namedentity (e.g., Elizabeth Warren ), even if thelong-term memory model fails to retrieve the cor-rect first name, it usually is able to retrieve thecorrect last name after seeing the first name (be-cause the entity exists in the training corpus). Weobserve this phenomenon in many other examplesas well. We can also see that the retrieved neigh-bors are generally relevant even when they do notmatch a target word exactly—e.g., when predictingnames of days, dollar amounts, time quantifiers,and common phrases.We next investigate neighbors on enwik8 devel-opment set (Figure 3). We observe that informationfrom the long-term memory helps when completingcommon words (e.g., before and invasion ),named entities (e.g., Soviet ), and corpus-specificformats (e.g., double square brackets).We note that the above examples are only pro-vided to give a better insight into our model. It isentirely plausible that a baseline parametric modelis already able to predict correctly from the localcontext. Nonetheless, directly providing this in- o e h a t f o r e t h i d e n i e t - U n t ah e r h e f o r e t h e f a v i e t ’ U n v aE v e n b e f o r e t h e S o v i e t i n v as i o n , b n t h e [ n d o f t [ 1 6 7 5 ] ]s i o n a n t A h e h n d o f t [ 4 3 9 9 ] ]s i o n a t t h e e n d o f [ [ 1 9 7 9 ] ]Figure 3: A sequence of characters from enwik8 and its two nearest neighbors at each position. We break downthe sequence into two blocks. The bottom row of each block in blue represents the original character sequence, which is Even before ... [[1979]] . The two rows above it represent the nearest neighbors (the firstnearest neighbors at the second bottom row and the second nearest neighbors at the top row) that are used whenpredicting that particular character. We highlight matching neighbor–target characters in green. We provide a moredetailed discussion in §5.1. ... Several companies have pulled their advertising from the TV show following the revelations ...... Liberal Democrat leader Jo Swinson has said she would work with Donald Trump in government as ...... Additionally , the airline has purchased six Boeing 787 - 9 Dream liner aircraft that are scheduled ... Figure 4: Three example sequences from the WMT test set. We highlight words where both p TXL and p S PALM arelarger than p transformer + 0 . in green and p S PALM > p TXL + 0 . in blue. See §5.2 for details. formation as a long-term context helps our modellearn better, as evident from the superior perfor-mance of S PALM on our three evaluation datasets. We search for predictions where S PALM signif-icantly outperforms transformer-XL and trans-former to understand when modeling local informa-tion is sufficient (i.e., vanilla transformer), whenadding extended context helps (i.e., transformer-XL), and when storing long-term information isuseful (i.e., S PALM ). We show three examplesfrom the WMT test set in Figure 4.While it is difficult to find consistent patterns,we observe that S PALM is generally betterthan both transformer and transformer-XL forpredicting (completing) common phrases andnamed entities (that exist in the training set),especially when they are encountered for the firsttime and have not appeared in the extended context(e.g., pulled their advertising from,Liberal Democrat, Jo Swinson,Boeing 787-9 Dreamliner ).On the other hand, we also see a few cases whentransformer-XL outperforms S PALM . These areusually associated with scenarios where the sameword has appeared in the extended context. WhileS PALM uses information from the extended con-text as well, the probability is smoothed over byinformation from the long-term memory, resultingin a more peaky distribution for transformer-XL. Our model has a gating mechanism to regulate in-formation flow from the current context, short-term,and long-term memory. We analyze the values ofthe gate for tokens in WMT and enwik8. Figure 5shows histograms of the distribution of gate values. Figure 5: Distributions of values of z for WMT (left)and enwik8 (right) development sets. We observe different characterstics for WMTand enwik8. On enwik8, the gate values are con-centrated around 1. This indicates that the modelrelies on local context most of the time. This canexplain why k NN-LM does not work well on thisdataset. On WMT, the values are less concentratedaround 1. This suggests that the model uses long-term memory more than on enwik8. S PALM is ableto learn when the long-term memory is needed andwhen it is not in both cases.We next look into the value of the gates for aspecific sequence in the development set in Fig-ure 6. We note that we only show a small dimen-sion subset from the gate vector for readability, so - existence its in appearances postseason four of total a made has franchise the , in ed Start e h t d n a , ] ] t p y g E t n Figure 6: Heatmaps of z values on a partial sequence from WMT development set (left) and enwik8 (right). Eachrow is a token (word or character), each colum is a dimension from z . blue indicates value closer to 1.0, whereasred indicates value closer to 0.0. The darker the shade the closer the value is to the extreme. We see vertical patternson WMT, indicating that these dimensions are reserved to flow information from long-term memory. Horizontalpatterns on enwik8 indicates the model relies on long-term memory to predict a target token (e.g., when forming theword Egypt ). The z vector has 512 dimension, we only zoom in to a small dimension subset here. There are morehorizontal and vertical patterns on both datasets as a whole. we caution against drawing a conclusion about howthe model works from this. Our goal is only toget a better understanding of what happens whenthe model makes predictions. Comparing WMTand enwik8, we see that in general on WMT themodel tends to reserve some dimensions to prop-agate information from the long-term memory, asindicated by vertical red lines. On enwik8, themodel relies on long term information when com-pleting a known word such as Egypt , as shownby more horizontal red patterns when forming thisword. For other characters, the value of the gatesare closer to one, which shows that the model reliesmore on local and extended short-term context. We use four neighbors for our word-based and twoneighbors for our character-based language models.These values are chosen from preliminary experi-ments on a small subset of the datasets.We show S PALM perplexity on development setfor WikiText-103 when we vary the number ofneighbors in Table 5. We see that using one nearestneighbor is enough to obtain good performance,with a slight advantage when we use four neighbors.The performance starts to degrade as we use 8 and16 neighbors. We choose to use four neighbors inour experiments since k NN-LM–which also usesthe same set of neighbors–performs better with fourneighbors instead of one, and we want to keep thecomparison as fair as possible.One notable difference between our neighborsand those that are used in k NN-LM (Khandelwalet al., 2020) is that we do not limit the search of theneighbors to the same token as the current input PALM perplexity on theWikiText-103 develop-ment set with differentnumbers of neighbors. token ( I ( x i = x t ) ). While this allows the modelto combine information from related words (notconstrained to an exact match), it could introducenoise when the number of neighbors is large.We observe that our representation learningmodel (i.e., the baseline transformer) is able toretrieve relevant neighbors most of the time. It re-trieves the exact output token as the first neighbor33%, 44%, and 70% on WikiText-103, WMT andenwik8 development sets respectively. Summary of contributions. We present a semi-parametric language model (S PALM ) that combineslocal context, short-term memory, and long-termmemory to make predictions. Experiments onword-based and character-based language modelsdemonstrate the benefit of our proposed method. Limitations. The biggest limitation is the ne-cessity to retrieve neighbors for each training to-ken. Such a process—even though can be fullyparallelized—is time consuming. In our epxeri-ments, it takes 6-8 hours to obtain neighbors forWikiText-103 and enwik8 with 1,000 CPUs and 18hours for WMT with 9,000 CPUs. Future directions. Our modular approach thatcombines multiple memory systems at the architec-ural level opens up the possibility to incorporateadditional memory from other modalities (e.g., im-ages) or structured knowledge bases. We also en-vision a next-generation model that does not haveto retrieve information from long-term memory forevery token and only does it for those that requireglobal context. A model that learns how to dothis would save a considerable amount of trainingand test time—since it would significantly reducethe number of search that needs to be performed.Our language model that integrates retrieval intotraining is a first step in this direction. Acknowledgements We thank the action editor (Mihai Surdeanu) andthree anonymous reviewers for helpful commentson an earlier draft of this article. References Alexei Baevski and Michael Auli. 2019. Adaptiveinput representations for neural language model-ing. In Proc. of ICLR .Ankur Bapna and Orhan Firat. 2019. Non-parametric adaptation for neural machine trans-lation. In Proc. of NAACL-HLT .Iz Beltagy, Matthew E. Peters, and Arman Cohan.2020. Longformer: The long-document trans-former. arXiv preprint arXiv:2004.05150v2 .Tom B. Brown, Benjamin Mann, Nick Ryder,Melanie Subbiah, Jared Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, GirishSastry, Amanda Askell, Sandhini Agarwal, ArielHerbert-Voss, Gretchen Krueger, Tom Henighan,Rewon Child, Aditya Ramesh, Daniel M.Ziegler, Jeffrey Wu, Clemens Winter, Christo-pher Hesse, Mark Chen, Eric Sigler, MateuszLitwin, Scott Gray, Benjamin Chess, Jack Clark,Christopher Berner, Sam McCandlish, Alec Rad-ford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners. In Proc.of NeurIPS .Krzysztof Choromanski, Valerii Likhosherstov,David Dohan, Xingyou Song, Andreea Gane,Tamas Sarlos, Peter Hawkins, Jared Davis, AfrozMohiuddin, Lukasz Kaiser, David Belanger,Lucy Colwell, and Adrian Weller. 2021. Re-thinking attention with performers. In Proc. ofICLR . Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc V. Le, and Ruslan Salakhutdinov.2019. Transformer-xl: Attentive language mod-els beyond a fixed-length context. In Proc. ofACL .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proc. of NAACL .Howard Eichenbaum. 2012. Memory systems. Handbook of Psychology, Second Edition , 3.Edouard Grave, Moustapha M Cisse, , and ArmandJoulin. 2017a. Unbounded cache model for on-line language modeling with open vocabulary.In Proc. of NeurIPS .Edouard Grave, Armand Joulin, Moustapha Cisse,David Grangier, and Herve Jegou. 2017b. Effi-cient softmax approximation for gpus. In Proc.of ICML .Edouard Grave, Armand Joulin, and NicolasUsunier. 2017c. Improving neural languagemodels with a continuous cache. In Proc. ofICLR .Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng,David Simcha, Felix Chern, and Sanjiv Kumar.2020. Accelerating large-scale inference withanisotropic vector quantization. In Proc. ofICML .Kelvin Guu, Tatsunori B. Hashimoto, YonatanOren, and Percy Liang. 2018. Generating sen-tences by editing prototypes. Transactions ofthe Association for Computational Linguistics ,6:437–450.Kelvin Guu, Kenton Lee, Zora Tung, PanupongPasupat, and Ming-Wei Chang. 2020. Realm:Retrieval-augmented language model pre-training. In Proc. of ICML .Marcus Hutter. 2012. The human knowledge com-pression contest.Hakan Inan, Khashayar Khosravi, and RichardSocher. 2017. Tying word vectors and word clas-sifiers: A loss framework for language modeling.In Proc. of ICLR .ukasz Kaiser, Ofir Nachum, Aurko Roy, andSamy Bengio. 2017. Learning to remember rareevents. In Proc. of ICLR .Nora Kassner and Hinrich Schutze. 2020. Bert-knn: Adding a knn search component to pre-trained language models for better qa. In Proc.of Findings of EMNLP .Urvashi Khandelwal, Angela Fan, Dan Jurafsky,Luke Zettlemoyer, and Mike Lewis. 2021. Near-est neighbor machine translation. In Proc. ofICLR .Urvashi Khandelwal, Omer Levy, Dan Juraf-sky, Luke Zettlemoyer, and Mike Lewis. 2020.Generalization through memorization: Nearestneighbor language models. In Proc. of ICLR .Diederik P. Kingma and Jimmy Lei Ba. 2015.Adam: a method for stochastic optimization. In Proc. of ICLR .Nikita Kitaev, Lukasz Kaiser, and AnselmKevskaya. 2020. Reformer: The efficient trans-former. In Proc. of ICLR .Ben Krause, Emmanuel Kahembwe, Iain Murray,and Steve Renals. 2018. Dynamic evaluation ofneural sequence models. In Proc. of ICML .Ben Krause, Emmanuel Kahembwe, Iain Murray,and Steve Renals. 2019. Dynamic evaluationof transformer language models. arXiv preprintarXiv:1904.08378v1 .Cyprien de Masson d’Autume, Sebastian Ruder,Lingpeng Kong, and Dani Yogatama. 2019.Episodic memory in lifelong language learning.In Proc. of NeurIPS .Stephen Merity, Caiming Xiong, James Bradbury,and Richard Socher. 2017. Pointer sentinel mix-ture models. In Proc. of ICLR .Aida Nematzadeh, Sebastian Ruder, and Dani Yo-gatama. 2020. On memory in human and arti-ficial language processing systems. In Proc. ofICLR Workshop on Bridging AI and CognitiveScience .Graham Neubig and Chris Dyer. 2016. General-izing and hybridizing count-based and neurallanguage models. In Proc. of EMNLP . Fabio Petroni, Tim Rocktaschel, Patrick Lewis, An-ton Bakhtin, Yuxiang Wu, Alexander H. Miller,and Sebastian Riedel. 2020. Language modelsas knowledge bases? In Proc. of EMNLP .Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving languageunderstanding by generative pre-training.Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Lan-guage models are unsupervised multitask learn-ers.Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku-mar, Chloe Hillier, and Timothy P. Lillicrap.2020. Compressive transformers for long-rangesequence modelling. In Proc. of ICLR .Edmund T. Rolls. 2000. Memory systems in thebrain. Annual Review of Psychology , 51(1):599–630.Mohammad Shoeybi, Mostofa Patwary, RaulPuri, Patrick LeGresley, Jared Casper, andBryan Catanzaro. 2019. Megatron-lm: Train-ing multi-billion parameter language modelsusing model parallelism. arXiv preprintarXiv:1909.08053v4 .E. Tulving. 1985. How many memory systems arethere? American Psychologist , 40:385–398.Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Proc. of NIPS .Wenhan Xiong, Xiang Lorraine Li, Srini Iyer,Jingfei Du, Patrick Lewis, William Yang Wang,Yashar Mehdad, Wen tau Yih, Sebastian Riedel,Douwe Kiela, and Barlas Oguz. 2021. Answer-ing complex open-domain questions with multi-hop dense retrieval. In Proc. of ICLR .Peng Xu, Mostofa Patwary, Mohammad Shoeybi,Raul Puri, Pascale Fung, Anima Anandku-mar, and Bryan Catanzaro. 2020. Megatron-cntrl: Controllable story generation with externalknowledge using large-scale language models.In