[PDF] Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries

Abstract

Pretrained language models have been suggested as a possible alternative or complement to structured knowledge bases. However, this emerging LM-as-KB paradigm has so far only been considered in a very limited setting, which only allows handling 21k entities whose single-token name is found in common LM vocabularies. Furthermore, the main benefit of this paradigm, namely querying the KB using a variety of natural language paraphrases, is underexplored so far. Here, we formulate two basic requirements for treating LMs as KBs: (i) the ability to store a large number facts involving a large number of entities and (ii) the ability to query stored facts. We explore three entity representations that allow LMs to represent millions of entities and present a detailed case study on paraphrased querying of world knowledge in LMs, thereby providing a proof-of-concept that language models can indeed serve as knowledge bases.

Full PDF

LLanguage Models as Knowledge Bases:On Entity Representations, Storage Capacity, and Paraphrased Queries

Benjamin Heinzerling

1, 2 and

Kentaro Inui

2, 11

RIKEN AIP & Tohoku University [email protected] | [email protected] Abstract

Pretrained language models have been sug-gested as a possible alternative or comple-ment to structured knowledge bases. However,this emerging LM-as-KB paradigm has so faronly been considered in a very limited setting,which only allows handling 21k entities whosesingle-token name is found in common LM vo-cabularies. Furthermore, the main beneﬁt ofthis paradigm, namely querying the KB usinga variety of natural language paraphrases, isunderexplored so far. Here, we formulate twobasic requirements for treating LMs as KBs:(i) the ability to store a large number facts in-volving a large number of entities and (ii) theability to query stored facts. We explore threeentity representations that allow LMs to repre-sent millions of entities and present a detailedcase study on paraphrased querying of worldknowledge in LMs, thereby providing a proof-of-concept that language models can indeedserve as knowledge bases.

Language models (LMs) appear to memorizeworld knowledge facts during training. For exam-ple, BERT (Devlin et al., 2019) correctly answersthe query “Paris is the capital of [MASK]” with“France”. This observation prompted Petroni et al.(2019) to ask if LMs can serve as an alternative orcomplement to structured knowledge bases (KBs),thereby introducing the idea of treating LMs asKBs: During training, the LM encounters worldknowledge facts expressed in its training data, someof which a stored in some form in the LM’s parame-ters. After training, some of the stored facts can berecovered from the LM’s parameters by means of asuitable natural language query (Fig. 1). However,this emerging LM-as-KB paradigm is faced withseveral foundational questions.

Preprint. Work in progress.

Figure 1: The LM-as-KB paradigm, ﬁrst introduced byPetroni et al. (2019). A LM memorizes facts in theform of statements, which can then be queried in natu-ral language.

First question : KBs contain millions of entities,while the vocabulary size of common LMs usu-ally does not exceed 100k entries.

How can mil-lions of entities be represented in LMs?

Previ-ous work (Petroni et al., 2019) circumvents thisproblem by only considering the roughly 21k en-tities whose canonical name corresponds to a sin-gle token in the LM vocabulary, e.g., entities like“France” or “Bert”, but not “United Kingdom” or“Sesame Street”. Hence, this approach cannot han-dle entities that are not contained in the LM’s vo-cabulary, and a query like “Bert is a character on[MASK]” is not answerable in this simpliﬁed set-ting.To answer this ﬁrst question, we compare threemethods for scaling LM-as-KB to millions of enti-ties:1. Symbolic representation, i.e., extending theLM vocabulary with entries for all entities;2. Surface form representation, i.e., each entity isrepresented by their subword-encoded canoni-cal name, which is stored and queried by ex-tending the LM with a sequence decoder forentity names; and3. Continuous representation, i.e., each entity isrepresented as an embedding.We ﬁnd that, while all three entity representationsallow LMs to store millions of world-knowledge a r X i v : . [ c s . C L ] A ug acts involving a large number of entities, each rep-resentation comes with different trade-offs: Sym-bolic representation allows the most accurate stor-age, but is computationally expensive and requiresentity-linked training data. Surface representationis computationally efﬁcient and does not requireentity-linked training data, but is less accurate, es-pecially for longer entity names. Continuous repre-sentation also requires entity-linked training data,but is computationally more efﬁcient than symbolicrepresentation. Second question : What is the capacity of LMsfor storing world knowledge?

Can a LM store,say, all relation triples contained in a knowl-edge base like Wikidata (Vrandeˇci´c and Kr¨otzsch,2014)? Here we conduct experiments using syn-thetic data to study the scaling behaviour of currentLM architectures. Varying the number of trainablemodel parameters and recording the number of re-lation triples memorized at a given accuracy level,we ﬁnd that, e.g., a Transformer (Vaswani et al.,2017) with 125 million parameters (12 layers ofsize 768), has the capacity to memorize 1 millionWikidata relation triples with 95 percent accuracyor 5 million relation triples with 79 percent accu-racy. Assuming linear scaling, this ﬁnding suggeststhat larger LMs with tens or hundreds of billions ofparameters (Raffel et al., 2019; Brown et al., 2020)can be used to store sizable portions, if not all, of alarge knowledge base like Wikidata.

Third question : How robustly is world knowl-edge stored in LMs?

Is the LM able to recall afact even if the query is slightly different than whatwas memorized during training? For example, ifthe LM memorized “Barack Obama was born inHawaii” during training, can it answer queries like“Barack Obama is from [MASK]” or “Where wasBarack Obama born? [MASK]”? Here we con-duct controlled experiments to measure how wellthe LM transfers knowledge from memorized state-ments to query variants, both in a zero-shot settingin which the model is not exposed to the targetquery variant during training, and a few shot set-ting, in which the model is ﬁnetuned on a smallnumber of statements containing the target queryvariant. We observe zero-shot transfer in case ofhighly similar query variants, and see successfulfew-shot transfer after ﬁnetuning with 5 to 100 in-stances in case of less-similar queries. This abilityto handle soft, natural language queries, as opposedto hard, symbolic queries in a language like SQL or SPARQL, is one of the key motivations for usinglanguage models as knowledge bases.

Contributions . We formulate two requirementsfor treating LMs as KBs: (i) the ability to storea large number of facts involving a large numberof entities and (ii) the ability to query stored facts.After providing background on world knowledgein language models ( § • A comparison of entity representations forscaling LM-as-KB to millions of entities ( § • Empirical lower bounds on LM capacity forstoring world knowledge facts ( § • A controlled study of zero-shot and few-shotknowledge transfer from memorized state-ments to paraphrased queries ( § Terminology . In this work we are interestedin storing and retrieving world knowledge factsin and from a language model.

World knowl-edge is knowledge pertaining to entities, such as

Barack Obama . A fact is a piece of world knowl-edge that can be expressed with a concise naturallanguage statement , such as the English sentence

Barack Obama was Born in Hawaii , or with a re-lation triple , such as (cid:104)

Barack Obama , wasBornIn , Hawaii (cid:105) . A relation triple, or relation for short, con-sists of a head or subject entity (

Barack Obama ), a predicate ( wasBornIn ), and a tail or object entity( Hawaii ). A knowledge base is a set of relations.Knowledge bases, such as Wikidata, typically con-tain hundreds or thousands of predicates, millionsof entities, and millions or billions of relations.

Large pretrained LMs have been the main driver ofrecent progress in natural language processing (Pe-ters et al., 2018; Howard and Ruder, 2018; Radfordet al., 2019; Devlin et al., 2019). While the trend to-wards larger LMs is likely to continue (Raffel et al.,2019; Kaplan et al., 2020; Brown et al., 2020), ithas fundamental limitations: (i) A model trainedonly on surface forms, i.e., text, lacks grounding inperception and experience and hence cannot learnmeaning (Bender and Koller, 2020). (ii) Report-ing bias leads to certain knowledge rarely or neverbeing expressed in text. For example, a LM willeasily learn that Barack Obama is a former U.S.President, but will less likely learn that he is a malehuman being, since the latter fact is rarely stated aradigm / Task Input Output Models and objectivesLanguage modeling Text Text Next word prediction (Shannon, 1948; Elman, 1990; Bengioet al., 2003), masked token prediction (Devlin et al., 2019)LM-as-KB? Text Text / single-tokenentity name Closed-book QA (LAMA probe, Petroni et al., 2019)Sequnece-to-sequence Text Text Text-to-text transformer (T5, Raffel et al., 2019), closed-bookQA (Roberts et al., 2020)Retrieval Text Text, answer span Answer-span selection (Chen et al., 2017), retrieval-augmentedLM (Guu et al., 2020), open-book QAEntity replacement Text, entity men-tion spans Text Detecting replaced entity mentions (Xiong et al., 2019)Entity linking (EL) Text, entity men-tion spans Target entity AIDA (Hoffart et al., 2011), neural EL (Francis-Landau et al.,2016; Kolitsas et al., 2018)Entity embeddings Text, entity men-tion spans Entity embeddings Joint embedding of entities and text (Yamada et al., 2016)LM with entity embed-dings Text, linked en-tity mentions, en-tity embeddings Text ERNIE (Zhang et al., 2019), E-BERT (Poerner et al., 2019)LM with integrated EL Text, entity embed-dings Text KnowBert (Peters et al., 2019)

LM-as-KB (this work)

Natural languagequery Target entity Fact memorization, paraphrased queries, closed-book QAKnowledge-aware LM Text, knowledge(sub)graph Target entity, text Neural Knowledge LM (Ahn et al., 2016), Reference-aware LM(Yang et al., 2017), Knowledge graph LM (Logan et al., 2019)Semantic parsing natural languagequery meaning represen-tation, target entity SEMPRE (Berant et al., 2013), GNNs for KBQA (Sorokin andGurevych, 2018)Universal Schema relation triples, textpatterns entity tuple and re-lation embeddings Matrix factorization (Riedel et al., 2013)Knowledge graph embed-dings relation triples node and edge em-beddings Link prediction; RESCAL (Nickel et al., 2011), TransE (Bor-des et al., 2013), ComplexE (Trouillon et al., 2016), ConvE(Dettmers et al., 2018)Graph neural networks nodes, node fea-tures, edges node embeddings DeepWalk (Perozzi et al., 2014), graph neural networks (Kipfand Welling, 2017)Knowledge graphs nodes, edges nodes, edges Storage and retrieval, SQL/SPARQL queries, symbolic reason-ing (Coppens et al., 2013)

Table 1: Approaches for using world knowledge in natural language processing, ranging from unstructured, purelytext-based approaches (top), over approaches that mix text and structured KBs to varying degrees (middle), toapproaches operating on structured KBs (bottom). explicitly in text. In contrast, this type of knowl-edge is readily available in knowledge bases. (iii)A large number of rare entities (Hoffart et al., 2014;Derczynski et al., 2017; Ilievski et al., 2018) are,by deﬁnition, rarely mentioned, making it difﬁcultfor LMs to acquire knowledge about this long tailof entities from text alone.These limitations have motivated efforts to ex-plicitly equip LMs with world knowledge. Table 1situates these efforts on a spectrum from purelytext-based language modeling to representationsof structured knowledge graphs. Models based ontext generation (Raffel et al., 2019; Roberts et al.,2020) and retrieval (Guu et al., 2020) (denoted with Text in the

Output column) have proven most suc-cesful in knowledge-intensive tasks. However, weargue that models which reify entities (Logan et al.,2019), i.e., models in which entities are “ﬁrst-classcitizens” that can be directly predicted (denoted by As opposed to the LM acquiring world knowledge implic-itly as a side effect of its training objective. As opposed to generating or retrieving a surface form

Target entity in the

Output column), are a promis-ing research direction, since the direct links into aKB can be seen as a form of grounding. This is oneof our main motivations for considering symbolicand continuous entity representations.

We now address our ﬁrst question: How can mil-lions of entities be represented in a LM? To answerthis question, we compare three types of entity rep-resentations, namely symbolic, surface form, andcontinuous representation.

Experimental setup . We evaluate entity represen-tations by measuring how well they allow a LMto store and retrieve world knowledge facts. Forexample, if the model’s training data contains thestatement “Bert is a character on

Sesame Street ”,the model should be able to memorize this state-ment and recall the correct object

Sesame Street when queried with a query like “Bert is a characteron [MASK].” which may or may not correspond to an entity. ynthetic data . It is not a priori clear how manyfacts a given text from the LM’s training data, say,a Wikipedia article, expresses. Since we want toprecisely measure how well a LM can store and re-trieve facts, we create synthetic data by generatingstatements from relation triples and then train themodel to memorize these statements in an ideal-ized setting. Using Wikidata as knowledge source,we ﬁrst deﬁne two sets of entities: A smaller setconsisting of the top 1 million Wikidata entitiesaccording to node outdegree, and a larger set con-sisting of the roughly 6 million Wikidata entitiesthat have a corresponding entry in the English edi-tion of Wikipedia.Next, we select the 100 most frequent Wikidatapredicates and manually create one statement tem-plate for each predicate. For example, for the Wiki-data predicate

P19 (“place of birth”), we create thetemplate

S was born in O and generate Englishstatements by ﬁlling the S and O slots with entitiesfrom the sets deﬁned above for which this relationholds. To make queries for an object entity uniquegiven subject and predicate, we arbitrarily select ex-actly one fact if there are multiple possible objectsand discard the other facts. This process yields 5million statements involving up to 1 million enti-ties, and 10 million statements involving up to 6million entities. These statements then serve astraining instances, i.e., given the query “BarackObama was born in [MASK]”, the model shouldpredict

Hawaii . As our goal is to store facts in aLM, there is no distinction between training andtest data.

Models and training . We consider two commonLM architectures: LSTMs (Hochreiter and Schmid-huber, 1997) and Transformers (Vaswani et al.,2017). For LSTMs, we compare two model conﬁg-urations, namely a randomly initialized two-layerLSTM with 256 hidden units per layer (

LSTM 256 )and one with 1024 hidden units per layer (

LSTM1024 ). For Transformers, we compare a pretrainedmodel, namely RoBERTa-base (Liu et al., 2019),and RoBERTa without pretraining, i.e., a randomlyinitialized Transformer of the same size. For con-sistent tokenization across all four models, wesubword-tokenize all statements with the RoBERTatokenizer. To store statements with symbolic andcontinuous representation, we train until the modelreaches 99 percent memorization accuracy, i.e., Templates and a sample of the generated statements areshown in Appendices A and B.

Number of statements (1 million entities) M e m o r i z a t i o n a cc u r a c y LSTM 1024LSTM 256RoBERTa-baseRoBERTa-base without pretraining

Figure 2: Accuracy of statement memorization withsymbolic representation of object entities. achieves almost perfect overﬁtting, or stop early ifaccuracy does not improve for 20 epochs. Furthertraining details are given in Appendix C.

With symbolic representation, each entity is repre-sented as an entry in the LM’s vocabulary. Predic-tion is done via masked language modeling (Devlinet al., 2019), by encoding the query with the LM,projecting the ﬁnal hidden state of the [MASK] to-ken onto the vocabulary and then taking a Softmaxover the vocabulary. As the results show (Fig. 2),symbolic representation yields very high memo-rization accuracies with a vocabulary of 1 millionentities. RoBERTa-base without pretraining, i.e., arandomly-initialized Transformer, works best andmemorizes 97 percent of 5 million statements cor-rectly.Unfortunately, the Softmax computation be-comes prohibitively slow as the vocabulary sizeincreases (Morin and Bengio, 2005), making sym-bolic representation with a Softmax over a vocabu-lary consisting of the full set of 6 million Wikipediaentities impractical. Imposing a hierarchy is a com-mon approach for dealing with large vocabular-ies, but did not work well in this case (See Ap-pendix E.1).

With surface form representation, each entity isrepresented by its canonical name. Since thisname generally consists of more than one token,we cast memorizing statements and querying factsas a sequence-to-sequence task (Sutskever et al.,2014): Given the source sequence “Bert is a char-acter on [MASK]”, the model needs to generate the We use English Wikidata labels as canonical names. arget sequence “Sesame Street”. To make modelsmemorize statements, we train until perplexity onthe training data reaches 1.0 or does not improvefor 20 epochs. For evaluation, we generate surfaceforms of target entities – i.e., the answer to a givenquery – via a beam search with beam size 10. Wemeasure perfect-match accuracy of the full entityname, i.e., there is no partial credit for partial tokenmatches.The four models in our comparison arenow treated as sequence-to-sequence en-coders and extended with a matching decoderof the same size, i.e., LSTM decoders forLSTM encoders (

LSTM2LSTM ) and ran-domly initialized Transformers for Transformerencoders (

RoBERTa2Transformer and

Trans-former2Transformer ).Unlike symbolic representation, surface repre-sentation is able to handle the entire set of 6 millionWikipedia entities. As with symbolic representa-tion, the randomly initialized Transformer model(Fig. 3, dash-dotted red line) has the highest capac-ity, memorizing up to 10 million statements with90 percent accuracy. A pretrained LM as encoder(RoBERTa2Transformer) appears to have a deleteri-ous effect, with much lower accuracies compared tothe randomly initialized Transformer2Transformer.While the larger LSTM2LSTM model (1024 hid-den units per layer) almost matches the perfor-mance of the best Transformer model, the smallerLSTM2LSTM (256 hidden units per layer) has in-sufﬁcient capacity, memorizing less than 50 percentof 5 million statements correctly.An analysis of the results produced by the Trans-former2Transformer model (Fig. 4) reveals, per-haps unsurprisingly, that statements involving in-frequent and long entity mentions are most difﬁcultto memorize. For example, the model fails tomemorize most of the entity mentions that occuronly in one to ten statements and have a lengthof 12 or more subword tokens (blue cluster, upperleft).

With continuous representation, each entity e i , i ∈ [1 , N entities ] is represented by a d-dimensional em- We include the [MASK] token since the target entity doesnot always occur at the end of a statement. We speculate that this drawback of surface form repre-sentation can be mitigated by shortening canonical namesas much as possible while ensuring a one-to-one mappingbetween entities and names, but leave this to future work.

Number of statements (6 million entities) M e m o r i z a t i o n a cc u r a c y LSTM2LSTM (1024)LSTM2LSTM (256)RoBERTa2TransformerTransformer2Transformer

Figure 3: Accuracy of statement memorization with ob-ject entities represented by surface forms.Figure 4: Error analysis of statements memorized viasurface form representation. Correctly memorized ob-jects orange, wrong ones blue. Selected clusters areannotated with the name of the corresponding entity(green). Large frequencies clipped to a maximum valueof 200, jitter applied for visual clarity. bedding y i ∈ R d . After encoding the given querywith the LM, prediction is performed by project-ing the ﬁnal hidden state corresponding to the[MASK] token onto R d , thereby obtaining the pre-dicted embedding ˆy ∈ R d . We use ﬁxed, pre-trained entity embeddings and train with cosine loss L = 1 − cos( ˆy , y i ) . At test time, the model pre-diction ˆy is mapped to the closest pretrained entityembedding y i via approximate nearest-neighborsearch (Johnson et al., 2017). Continuous prediction with ﬁxed, pretrainedembeddings . When training randomly initializedembeddings with a cosine similarity or Euclideandistance objective, a degenerate solution is to makeall embeddings the same, e.g., all-zero vectors. Toprevent this, it is common practice to use negativesamples (Bordes et al., 2013). When training withﬁxed, pretrained embeddings as supervision signal,negative sampling is not necessary, since the targetembeddings are not updated during training and

2M 4M 6M 8M 10M

Number of statements (6 million entities) M e m o r i z a t i o n a cc u r a c y LSTM 1024LSTM 256RoBERTa-baseRoBERTa-base without pretraining

Figure 5: Accuracy of statement memorization withcontinuous entity representation.

No (195k mentions) Yes (805k mentions)

Relation object memorized? E n t i t y m e n t i o n f r e q u e n c y Figure 6: Error analysis of a subsample of 5 millionstatements memorized by a randomly initialized Trans-former with continuous representation. therefore cannot become degenerate.

Wikidata embeddings . We train embeddings for 6million Wikidata entities using feature-speciﬁc au-toencoders to encode entity features such as names,aliases, description, entity types, and numeric at-tributes. This approach follows prior work on multi-modal KB embeddings (Pezeshkpour et al., 2018)and learning of KB embeddings with autoencoders(Takahashi et al., 2018). Embedding training isdetailed in Appendix D.

Results . Fig. 5 shows memorization accuraciesachieved with continuous representation. Likesurface representation, continuous representationscales to 6 million entities, and we see the same rel-ative order of models, but with overall lower accura-cies. RoBERTa without pretraining has the highestcapacity for storing world knowledge statements,memorizing 67 percent of 10 million statements,while the small LSTM 256 model has the lowest ca-pacity, memorizing 42 percent. Although far fromfully understood, sequence-to-sequence architec-tures are relatively mature, with highly-optimizedtoolkits and hyperparameter settings publicly avail-able (Ott et al., 2019). In contrast, prediction of continuous representations is still in an early stageof research (Kumar and Tsvetkov, 2019). Com-pared to surface form representation, we thereforesee the results presented in this subsection as lowerbounds for LM capacity with continuous represen-tations.By design, memorization with continuous rep-resentations does not rely on entity names, andhence, in contrast to surface form representation,does not lead to difﬁculties in handling entities withlong names. However, as with surface form repre-sentation, infrequent entities are more difﬁcult tomemorize than frequent ones. As shown in Fig. 6,most of the memorization errors (blue, left) involveinfrequent entities with a median frequency of 3,while most of the correctly memorized statements(orange, right) involve entities that occur more than100 times.

We now turn to the question of how model capacityscales with model size (Figure 7). For a 12-layertransformer with layer size 96 or 192 (top subﬁg-ure, solid red and dashed green lines), memoriza-tion accuracy quickly drops as the number of factsto memorize increases. As expected, larger mod-els are able to memorize more facts, but accuracydrops rather quickly, e.g., to 65 percent of 3 millionfacts memorized with a layer size of 384 (dottedorange line, 2nd from top).Assuming a desired memorization accuracylevel, e.g., 80 percent, we analyze the maximumnumber of facts a model of a given size can memo-rize at that level (Figure 7, bottom). For the modelsizes considered here, storage capacity appears toscale linearly, with a model of layer size 384 (55Mparameters) able to store one million facts, and amodel of layer size 960 (160M parameters) storingup to 7 million facts.Apart from the number of facts to be stored, wehypothesize that storage capacity depends on twomore factors: the number of entities involved andthe entropy of their distribution. As expected, alarge entity vocabulary makes memorization moredifﬁcult (Table 2). The impact of entity vocabu-lary size is smaller with surface representation (2percent drop), while for continuous representation,memorization accuracy drops from 85 percent to79 percent as the vocabulary size increases from1 to 6 million entities. We also observe an impactof the entity distribution, with an example given in

2M 4M 6M 8M 10M

Number of statements (6 million entities) M e m o r i z a t i o n a cc u r a c y Layer size

Hidden layer size( M a x f a c t s m e m o r i z a b l e Figure 7: Scaling of model capacity with model size.The model is a 12-layer Transformer with continuousrepresentation of 6 million entities. The top ﬁgureshows the decrease in memorization accuracy as thenumber of facts to be stored in a model of given size in-crease. The bottom ﬁgure shows the maximum numberof facts a model of a given layers size (and parametercount) can memorize with an accuracy of 80 percent.

Appendix. F, but leave a more detailed analysis tofuture work.

So far, we saw that it is possible to store millionsof facts in a LM, by ﬁnetuning the model to predictthe masked object of simple English statementslike

Barack Obama was born in [MASK].

However,given the large number of model parameters andthe effort necessary to train them, mere storage isnot a compelling achievement: The underlying rela-tion triples, in this case (cid:104)

Barack Obama , wasBornIn , Hawaii (cid:105) , can easily be stored more compactly andwith 100 percent accuracy in a symbolic knowledgegraph.One of the potential beneﬁts of the LM-as-KBparadigm is the LM’s ability to handle paraphrases.If the LM’s representation of the statement above issufﬁciently similar to its representation of querieslike

Barack Obama is from [MASK] or even

Whereis Barack Obama from? [MASK] , it is conceivablethat this similarity allows transfer from the memo-

AccuracyRepresentation 1M 6MSymbol 0.97 n/aSurface 0.92 0.90Continuous 0.85 0.79

Table 2: Impact of entity vocabulary size on model ca-pacity. The model is a 12-layer Transformer, hiddenlayers size 768, memorizing 1 million facts. rized statement to these unseen queries. Is this softquerying of facts stored in a LM possible? In thissection we conduct a controlled experiment to an-swer this question, expecting one of the followingthree outcomes:

1. Rote memorization . The model memorizesstatements with little or no abstraction, so that evensmall, meaning-preserving changes to the queryprevent the model from recalling the correct object.

2. Generic association . The model memorizespairs of subject and object entities with little orno consideration of the predicate. For example,the model will always predict

Hawaii whenever thequery contains the phrase

Barack Obama , regard-less of context. This pathological behaviour couldbe especially prevalent if the distribution of objectentities co-occurring with a particular subject isdominated by a single object.

3. Fact memorization . The model memorizesfacts expressed in statements by forming abstrac-tions corresponding to entities and predicates. Thiswould allow retrieving a fact with a variety ofqueries.The results presented in previous sections al-ready established that a model of sufﬁcient sizeis able to perform rote memorization of millionsof statements. We now design an experiment totest whether LMs are capable of fact memorizationwhile taking care to distinguish this capability fromgeneric association.To repeat, our goal is to test if a LM that hasmemorized a statement like

Barack Obama wasborn in Hawaii. can transfer this knowledge toanswer a query like

Barack Obama is from [MASK].

Conveniently, wasBornIn relations are among themost frequent in Wikidata and hold for a diverseset of subject and object entities. This diversityof entities makes this predicate a good candidatefor our case study, since statements involving apredicate with a less diverse set of possible subjector object entities are easier to memorize. For example, with the predicate isA and relations like tatements and controls . We randomly sample100k statements generated by the “S was born inO” template. Since the mapping from S (i.e., men-tions of persons) to O (i.e., locations) is injective,the model could take the shortcut of memorizingstatements via generic association and wrongly an-swer any query involving entity S , e.g., “BarackObama is a [MASK]”, with the associated entity O,i.e., Hawaii . To prevent this shortcut, we introduce control facts. Given a fact (cid:104) S , P , O (cid:105) , its control (cid:104) S , P’ , O’ (cid:105) involves the same subject S, but a dis-tinct predicate P’ and object O’. For example, acontrol for the fact (cid:104) Albert Einstein , wasBornIn , Ulm (cid:105) is the fact (cid:104)

Albert Einstein , diedIn , Princeton (cid:105) . Weadd 100k control statements generated from thetemplate “S died in O”’ and train RoBERTa-baseto memorize all 200k statements with 99 percentaccuracy. The combination of statements and con-trol statements prevents the model from relying ongeneric association: To correctly answer the query“Albert Einstein died in [MASK].”, the model needsto take into account the predicate, since two distinctobject entities are associated with

Albert Einstein . Target query variants . Next, we collect targetquery variants, such as “S is from O” (row labelsin Fig. 8). Expecting good transfer for variants thatare very similar to the original statement template,we include variants with small changes, such asvarying punctuation or prepositions. To includemore diverse variants, we select frequent relationpatterns, e.g., “S (b. 1970, O)” and “born in O,S is a”, from the “place of birth” and “place ofdeath” portions of the Google-RE corpus , as wellas a query in question form. Finally we add ir-relevant distractors (“It is true that, S was born inO”) and misleading ones (“S was born in O, butdied somewhere else”). From each query varianttemplate, we generate 100k query variants usingthe same entity pairs to ﬁll the S and O slots asfor the original statements. To balance the distri-bution between statements and control statementswhen ﬁnetuning towards target queries (see nextparagraph), we also create a matching number of (cid:104)

Barack Obama , isA , human (cid:105) the model would do well byalways predicting human if the subject mention matches afrequent person name pattern like two capitalized words. https://github.com/google-research-datasets/relation-extraction-corpus Our experiments, which test whether a LM is able totransfer a memorized fact to given target paraphrases, can beseen as converse to the probing setup by Jiang et al. (2019),which aims to ﬁnd the best paraphrase for querying a givenfact from a LM. query variant templates and generate a matchingnumber of control statements.

Transfer results . We evaluate knowledge transferfrom memorized statements to query variants usingpretrained RoBERTa-base (Fig. 8, left), measur-ing accuracy over the 100k statements generatedwith the target query variant template. To measurethe effect pretraining has on paraphrasing ability,we compare to RoBERTa-base without pretrain-ing (Fig. 8, right). We consider zero-shot trans-fer, i.e., without any ﬁnetuning towards the targetquery variant, and a ﬁnetuning setting, in which theLM is ﬁrst trained to memorize all 100k originalstatements, and then ﬁnetuned until it memorizesa small number of statements in the target queryformat.In the zero-shot setting (leftmost column), evensmall variations to the query lead to a drop in factrecall: Adding an ellipsis (4th row) causes themodel to answer 95% of queries correctly, a 3%drop from the 98% memorization accuracy of theoriginal statements (ﬁrst row). Adding an exclama-tion mark (5th row) has an even larger effect, re-sulting in a 8% drop. For two paraphrases, namelythe relative clausal

S, who is from O (7th row) and

S is from O , zero-shot transfer works only in about35% and 20% of cases. The question format (11throw) allows zero-shot transfer with 32% accuracy.For the remaining paraphrases, e.g., those with par-entheticals or the distractor died , zero-shot transferis poor, with accuracies ranging from 3% to 13%.A clear overall trend is visible: Zero-transferworks best for similar statements and worst fordissimilar ones. To quantify this trend, we com-pute a representation of a statement template byaveraging over its 100k mean-pooled, LM-encodedstatements, and then measure the Euclidean dis-tance of the original template representation andtarget query variant representations. CorrelatingEuclidean distance and accuracy of zero-shot trans-fer, we obtain a Pearson coefﬁcient of − . , in-dicating a strong negative correlation between dis-tance and knowledge transfer. In other words, trans-fer tends to work well for paraphrased queries theLMs deems similar to the originally memorizedstatement. Conversely, transfer fails in case theLM’s representation of a query is too dissimilar toits representation of the original statement.This trend is also reﬂected in the ﬁnetuning set-ting, with less-similar variants requiring up to 500instances until the model achieves 90 percent ac- Number of facts for finetuning to query variant

S was born in OS was born in O .S was born at OS was born in O ...S was born in O !S , who was born in OS , who is from OS was born in O , but they did not die thereS ( born in 1970 in O )S is from OWhere was S born ? OS ( b. 1970 , O )S was born in O , but died somewhere elseIt is true that S was born in OIt is known that S was born in OAccording to Wikidata , S was born in Oborn in O , S is a Q u e r y v a r i a n t s Number of facts for finetuning to query variant0.98 0.97 0.98 0.97 0.98 0.98 0.98 0.980.67 0.78 0.86 0.94 0.97 0.97 0.97 0.980.96 0.95 0.96 0.94 0.98 0.98 0.98 0.980.93 0.91 0.96 0.93 0.97 0.98 0.97 0.980.92 0.87 0.95 0.92 0.97 0.98 0.98 0.980.33 0.16 0.29 0.38 0.55 0.87 0.91 0.950.03 0.01 0.02 0.02 0.03 0.40 0.77 0.920.08 0.01 0.01 0.01 0.21 0.50 0.93 0.960.34 0.11 0.27 0.43 0.60 0.77 0.94 0.970.03 0.02 0.02 0.14 0.31 0.79 0.91 0.940.02 0.01 0.01 0.01 0.01 0.01 0.02 0.040.04 0.01 0.05 0.03 0.08 0.42 0.76 0.860.07 0.00 0.01 0.04 0.09 0.29 0.81 0.850.01 0.00 0.00 0.01 0.01 0.02 0.03 0.020.01 0.00 0.00 0.01 0.01 0.01 0.02 0.020.00 0.00 0.00 0.00 0.00 0.01 0.01 0.010.03 0.01 0.01 0.01 0.01 0.02 0.03 0.04RoBERTa without pretraining A cc u r a c y o f r e c a lli n g e n t i t y O Figure 8: Transfer from memorized statements (

S was born in O ) to query variants.

Number of facts for finetuning to query variant

S died in OS died in O ...S died at OS died in O .S died in O !S , who died in OS died in O , but they were not born thereS died in O , but was born somewhere elseS ( died 2010 in O )S ( d. 2010 , O )It is true that S died in OIt is known that S died in OAccording to Wikidata , S died in OWhere did S die ? OAfter their death in O , S wasS , who spent the last days of their life in OS spent the last days of their life in O Q u e r y v a r i a n t s Number of facts for finetuning to query variant0.98 0.98 0.97 0.93 0.94 0.97 0.97 0.970.92 0.92 0.94 0.93 0.91 0.96 0.97 0.980.96 0.95 0.97 0.96 0.96 0.97 0.98 0.980.64 0.82 0.92 0.83 0.96 0.95 0.96 0.980.91 0.89 0.92 0.95 0.97 0.97 0.97 0.980.37 0.22 0.25 0.40 0.34 0.70 0.90 0.930.03 0.01 0.01 0.00 0.10 0.45 0.87 0.940.02 0.01 0.01 0.02 0.03 0.05 0.32 0.840.17 0.09 0.26 0.32 0.55 0.71 0.90 0.940.07 0.01 0.04 0.12 0.13 0.47 0.68 0.870.03 0.00 0.02 0.01 0.03 0.02 0.04 0.070.03 0.01 0.03 0.02 0.04 0.03 0.04 0.070.01 0.00 0.00 0.01 0.01 0.02 0.04 0.050.03 0.01 0.02 0.02 0.04 0.03 0.05 0.110.00 0.00 0.00 0.01 0.00 0.00 0.03 0.030.03 0.02 0.01 0.02 0.03 0.08 0.27 0.600.03 0.02 0.02 0.01 0.04 0.02 0.18 0.63RoBERTa without pretraining A cc u r a c y o f r e c a lli n g e n t i t y O Figure 9: Transfer from memorized statements (

S died in O ) to query variants. curacy (last row), while for more similar variantstransfer already works well after ﬁnetuning on 5 to50 target instances.When using RoBERTa without pretraining tomemorize statements, knowledge transfer to queryvariants is much worse. While transfer still worksfor the most similar variants (right, top rows), less-similar variants require more ﬁnetuning instances compared to pretrained RoBERTa (right, middlerows). Transfer does not work for some of the leastsimilar variants, with accuracies as low as 1 to 4percent even after ﬁnetuning with 500 instances(right, bottom rows). Similar results for controlstatements are presented in Figure 9. We take theseresults a evidence that pretraining gives LMs theability to handle paraphrased queries well, and thatMs are able to memorize facts beyond mere rotememorization and generic association.

Limitations . This work is not without limitations.We only consider one knowledge graph, Wikidata,in our experiments. Arguably, as the largest pub-licly available source of world knowledge, Wiki-data is the most promising resource for equippingLMs with such knowledge, but attempts to storedifferent knowledge graphs in a LM might result indifferent outcomes than then ones presented here.For example, certain types of graphs, such as ran-domly uniform graphs, are easier to memorize fora LM, than others, such as scale-free graphs (SeeAppendix. F).While we use language like “ train a LM tomemorize statements” for simplicity throughoutthis work, what we do in case of pretrained LMsis more akin to adaptive pretraining (Gururanganet al., 2020). It is possible that integrating entity su-pervision directly into LM pretraining (F´evry et al.,2020) allows more efﬁcient fact storage.Our analysis was entirely focused on entityrepresentations and ignored the question how torepresent relation predicates or entire relationtriples. Here, incorporating relation learning (Bal-dini Soares et al., 2019) and learning to represent re-lation triples in a LM, e.g., from large, fact-alignedcorpora (Elsahar et al., 2018), are exciting avenuesfor future work.Finally, we formulated the LM-as-KB paradigmin terms of storing and retrieving relation triples.While structured KBs such as Wikidata indeed con-sist of such triples and hence our experiments show-ing storage and retrival of triples LMs are sufﬁcientas a proof-of-concept in principle, structured KBsalso allow more complex queries than the onesconsidered here, such as 1-to-n relations, multihopinference, queries involving numerical ranges, orfacts qualiﬁed by time and location (Hoffart et al.,2013).

Conclusions and outlook . In this work, we give apositive answer to Petroni et al. (2019)‘s questionif language models can serve as knowledge bases.We argued that treating LMs as KBs requires repre-senting a large number of entities, storing a largenumber of facts, and the ability to query a givenfact with a variety of queries. We then showed thatcurrent LM architectures fulﬁll these requirementswhen extended with a component for represent- ing entities. In addition to the ability to handleparaphrased queries, we envision further beneﬁtsfrom the LM-as-KB paradigm. For example, thefact-memorization and paraphrase-ﬁnetuning set-ting introduced in Section 5 allows precise controlover which facts a LM learns during training, whileit is much less clear which facts are contained in un-structured text. Selecting paraphrases to increasethe variety with which a LM can be queried isan interesting problem for future work. For ex-ample, selecting maximally dissimilar paraphrasesand choosing the number of ﬁnetuning instancesby similarity may be more efﬁcient than ﬁnetun-ing on large numbers of paraphrases in brute-forcefashion.

References

Sungjin Ahn, Heeyoul Choi, Tanel Prnamaa, andYoshua Bengio. 2016. A neural knowledge languagemodel.Livio Baldini Soares, Nicholas FitzGerald, JeffreyLing, and Tom Kwiatkowski. 2019. Matching theblanks: Distributional similarity for relation learn-ing. In

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics ,pages 2895–2905, Florence, Italy. Association forComputational Linguistics.Emily M. Bender and Alexander Koller. 2020. Climb-ing towards NLU: On meaning, form, and under-standing in the age of data. In

Proceedings of ACL2020 (to appear) .Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model.

Journal of machine learning research ,3(Feb):1137–1155.Jon Louis Bentley. 1975. Multidimensional binarysearch trees used for associative searching.

Commu-nications of the ACM , 18(9):509–517.Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In

Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing , pages 1533–1544, Seattle, Wash-ington, USA. Association for Computational Lin-guistics.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Dur´an, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In

Proceedings of the 26th Interna-tional Conference on Neural Information ProcessingSystems - Volume 2 , NIPS13, page 27872795, RedHook, NY, USA. Curran Associates Inc.om B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot learn-ers.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.Sam Coppens, Miel Vander Sande, Ruben Verborgh,Erik Mannens, and Rik Van de Walle. 2013. Rea-soning over sparql. In

LDOW .Leon Derczynski, Eric Nichols, Marieke van Erp, andNut Limsopatham. 2017. Results of the WNUT2017shared task on novel and emerging entity recogni-tion. In

Proceedings of the 3rd Workshop on NoisyUser-generated Text , pages 140–147, Copenhagen,Denmark. Association for Computational Linguis-tics.Tim Dettmers, Minervini Pasquale, Stenetorp Pon-tus, and Sebastian Riedel. 2018. Convolutional 2dknowledge graph embeddings. In

Proceedings ofthe 32th AAAI Conference on Artiﬁcial Intelligence ,pages 1811–1818.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Jeffrey L. Elman. 1990. Finding structure in time.

Cog-nitive Science , 14(2):179–211.Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci,Christophe Gravier, Jonathon Hare, FrederiqueLaforest, and Elena Simperl. 2018. T-REx: A largescale alignment of natural language with knowledgebase triples. In

Proceedings of the Eleventh Interna-tional Conference on Language Resources and Eval-uation (LREC 2018) , Miyazaki, Japan. EuropeanLanguage Resources Association (ELRA).Thibault F´evry, Livio Baldini Soares, Nicholas FitzGer-ald, Eunsol Choi, and Tom Kwiatkowski. 2020. En-tities as experts: Sparse memory access with entitysupervision.

CoRR , abs/2004.07202. Matthew Francis-Landau, Greg Durrett, and Dan Klein.2016. Capturing semantic similarity for entity link-ing with convolutional neural networks. In

Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages1256–1261, San Diego, California. Association forComputational Linguistics.Daniel Gillick, Sayali Kulkarni, Larry Lansing,Alessandro Presta, Jason Baldridge, Eugene Ie, andDiego Garcia-Olano. 2019. Learning dense repre-sentations for entity retrieval. In

Proceedings ofthe 23rd Conference on Computational Natural Lan-guage Learning (CoNLL) , pages 528–537, HongKong, China. Association for Computational Lin-guistics.Suchin Gururangan, Ana Marasovi, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks.Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-pat, and Ming-Wei Chang. 2020. Realm: Retrieval-augmented language model pre-training.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Comput. ,9(8):17351780.Johannes Hoffart, Yasemin Altun, and GerhardWeikum. 2014. Discovering emerging entities withambiguous names. In

Proceedings of the 23rd inter-national conference on World wide web, WWW 2014,Seoul, South Korea , pages 385–396.Johannes Hoffart, Fabian M Suchanek, KlausBerberich, and Gerhard Weikum. 2013. Yago2: Aspatially and temporally enhanced knowledge basefrom wikipedia.

Artiﬁcial Intelligence , 194:28–61.Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen F¨urstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of named en-tities in text. In

Proceedings of the 2011 Conferenceon Empirical Methods in Natural Language Process-ing , pages 782–792, Edinburgh, Scotland, UK. Asso-ciation for Computational Linguistics.Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model ﬁne-tuning for text classiﬁcation. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339, Melbourne, Australia.Association for Computational Linguistics.Filip Ilievski, Piek Vossen, and Stefan Schlobach. 2018.Systematic study of long tail phenomena in entitylinking. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages664–674, Santa Fe, New Mexico, USA. Associationfor Computational Linguistics.hengbao Jiang, Frank F. Xu, Jun Araki, and GrahamNeubig. 2019. How can we know what languagemodels know?

CoRR , abs/1911.12543.Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2017.Billion-scale similarity search with gpus. arXivpreprint arXiv:1702.08734 .Jared Kaplan, Sam McCandlish, Tom Henighan,Tom B. Brown, Benjamin Chess, Rewon Child,Scott Gray, Alec Radford, Jeffrey Wu, and DarioAmodei. 2020. Scaling laws for neural languagemodels.

CoRR , abs/2001.08361.Thomas N. Kipf and Max Welling. 2017. Semi-supervised classiﬁcation with graph convolutionalnetworks. In .OpenReview.net.Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-end neural entitylinking. In

Proceedings of the 22nd Conferenceon Computational Natural Language Learning ,pages 519–529, Brussels, Belgium. Association forComputational Linguistics.Sachin Kumar and Yulia Tsvetkov. 2019. Von mises-ﬁsher loss for training sequence to sequence modelswith continuous outputs. In

Proc. of ICLR .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach.Robert Logan, Nelson F. Liu, Matthew E. Peters, MattGardner, and Sameer Singh. 2019. Barack’s wifehillary: Using knowledge graphs for fact-aware lan-guage modeling. In

Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 5962–5971, Florence, Italy. Associa-tion for Computational Linguistics.Federico L´opez, Benjamin Heinzerling, and MichaelStrube. 2019. Fine-grained entity typing in hyper-bolic space. In

Proceedings of the 4th Workshopon Representation Learning for NLP (RepL4NLP-2019) , pages 169–180, Florence, Italy. Associationfor Computational Linguistics.Frederic Morin and Yoshua Bengio. 2005. Hierarchi-cal probabilistic neural network language model. In

Aistats , volume 5, pages 246–252.Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In

ICML , pages809–816. Omnipress.Yusuke Oda, Philip Arthur, Graham Neubig, KoichiroYoshino, and Satoshi Nakamura. 2017. Neural ma-chine translation via binary code prediction. In

Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) , pages 850–860, Vancouver, Canada. Asso-ciation for Computational Linguistics.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In

Proceedings ofNAACL-HLT 2019: Demonstrations .Bryan Perozzi, Rami Al-Rfou, and Steven Skiena.2014. Deepwalk: Online learning of social represen-tations. In

Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discoveryand data mining , pages 701–710.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Matthew E. Peters, Mark Neumann, Robert Logan, RoySchwartz, Vidur Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge enhanced contextual wordrepresentations. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 43–54, Hong Kong, China. Associ-ation for Computational Linguistics.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.Pouya Pezeshkpour, Liyan Chen, and Sameer Singh.2018. Embedding multimodal relational data forknowledge base completion. In

Proceedings of the2018 Conference on Empirical Methods in Natu-ral Language Processing , pages 3208–3218, Brus-sels, Belgium. Association for Computational Lin-guistics.Nina Poerner, Ulli Waltinger, and Hinrich Schtze. 2019.E-bert: Efﬁcient-yet-effective entity embeddings forbert.Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. Techni-cal report, OpenAI.Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limitsf transfer learning with a uniﬁed text-to-text trans-former. arXiv preprint arXiv:1910.10683 .Jonathan Raiman and Olivier Raiman. 2018. Deep-type: Multilingual entity linking by neural type sys-tem evolution. In

Proceedings of the Thirty-SecondAAAI Conference on Artiﬁcial Intelligence, (AAAI-18), the 30th innovative Applications of Artiﬁcial In-telligence (IAAI-18), and the 8th AAAI Symposiumon Educational Advances in Artiﬁcial Intelligence(EAAI-18), New Orleans, Louisiana, USA, February2-7, 2018 , pages 5406–5413. AAAI Press.Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M. Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In

Pro-ceedings of the 2013 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages74–84, Atlanta, Georgia. Association for Computa-tional Linguistics.Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.How much knowledge can you pack into the param-eters of a language model?Claude E. Shannon. 1948. A mathematical theory ofcommunication.

Bell Syst. Tech. J. , 27(3):379–423.Daniil Sorokin and Iryna Gurevych. 2018. Model-ing semantics with gated graph neural networks forknowledge base question answering. In

Proceed-ings of the 27th International Conference on Compu-tational Linguistics , pages 3306–3317. Associationfor Computational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In

Proceedings of the 27th International Conferenceon Neural Information Processing Systems - Volume2 , NIPS14, page 31043112, Cambridge, MA, USA.MIT Press.Ryo Takahashi, Ran Tian, and Kentaro Inui. 2018. In-terpretable and compositional relation learning byjoint training with an autoencoder. In

Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 2148–2159, Melbourne, Australia. Associa-tion for Computational Linguistics.Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´EricGaussier, and Guillaume Bouchard. 2016. Complexembeddings for simple link prediction. In

Proceed-ings of the 33rd International Conference on Inter-national Conference on Machine Learning - Volume48 , page 20712080. JMLR.org.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant,Matt Haberland, Tyler Reddy, David Courna-peau, Evgeni Burovski, Pearu Peterson, WarrenWeckesser, Jonathan Bright, St´efan J. van der Walt,Matthew Brett, Joshua Wilson, K. Jarrod Millman,Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones,Robert Kern, Eric Larson, CJ Carey, ˙Ilhan Po-lat, Yu Feng, Eric W. Moore, Jake Vand erPlas,Denis Laxalde, Josef Perktold, Robert Cimrman,Ian Henriksen, E. A. Quintero, Charles R Harris,Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pe-dregosa, Paul van Mulbregt, and SciPy 1.0 Contribu-tors. 2020. SciPy 1.0: Fundamental Algorithms forScientiﬁc Computing in Python.

Nature Methods ,17:261–272.Denny Vrandeˇci´c and Markus Kr¨otzsch. 2014. Wiki-data: a free collaborative knowledgebase.

Commu-nications of the ACM , 57(10):78–85.Wenhan Xiong, Jingfei Du, William Yang Wang, andVeselin Stoyanov. 2019. Pretrained encyclopedia:Weakly supervised knowledge-pretrained languagemodel.Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. In

Proceedings of The 20th SIGNLLConference on Computational Natural LanguageLearning , pages 250–259. Association for Compu-tational Linguistics.Zichao Yang, Phil Blunsom, Chris Dyer, and WangLing. 2017. Reference-aware language models. In

Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing , pages1850–1859, Copenhagen, Denmark. Association forComputational Linguistics.Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1441–1451, Florence, Italy. Associationfor Computational Linguistics.

Templates for generating English statements from Wikidata relations

ID TemplateP31 S is an instance of OP106 S has the occupation OP17 S belongs to the country OP131 S is located in the administrative territorial entity OP27 S is citizen of OP47 S shares a border with OP19 S was born in OP161 S has the cast member OP421 S is located in time zone OP166 S received the award OP54 S is a member of the sports team OP20 S died in OP136 S has the genre OP69 S was educated at OP1412 S is a language spoken, written or signed in OP190 S is a twinned administrative body of OP641 S participates in the sport OP150 S contains the administrative territorial entity OP463 S is a member of OP735 S has the given name OP1343 S is described by source OP361 S is a part of OP159 the headquarters of S are located in OP1344 S is participant of OP495 S has the country of origin OP39 S held the position of OP910 S has the main category OP105 S has the taxon rank OP527 S has the part OP108 S is employed by OP279 S is a subclass of OP171 S has the parent taxon OP140 S has the religion OP407 S is in the O languageP1303 S plays the instrument OP1411 S has been nominated for OP102 S is a member of political party OP3373 S is a sibling of OP1376 S is the capital of OP509 S died because of OP937 S works in OP264 S was produced by the record label OP119 S is buried in OP138 S is named after OP530 S has diplomatic relations with OP40 S is a child of OP155 S follows OP276 S is located in OP156 S is followed by OP36 S has the capital OP1196 S has the manner of death OP127 S is owned by OP101 S works in the ﬁeld OP607 S participated in the conﬂict OP364 S is a ﬁlm or TV show with the original language OP6379 S has works in the collection OP1346 S is a winner of the OP22 S is the father of OP137 S is operated by O ID TemplateP413 S plays the position OP26 S is spouse of OP1830 S is owner of OP1454 S has the legal form OP206 S is located in or next to body of water OP710 S is a participant of OP1441 S is present in the work OP1532 S represents O when playing sport OP86 S was composed by OP840 S is set in the location OP172 S belongs to the ethnic group OP175 S is performed by OP57 S is directed by OP1889 S is different from OP162 S is produced by OP118 S belongs to the league OP58 S is screenwritten by OP551 S has the residence OP103 S has the native language OP2789 S connects with OP750 S has the distributor OP725 S is voiced by OP272 S is produced by the company OP112 S was founded by OP452 S belongs to the industrial sector OP81 S is connected to line OP97 S has noble title OP740 S formed in the location OP360 S is a list of OP793 S is associated with the signiﬁcant event OP915 S was ﬁlmed at OP410 S has military rank OP1001 S applies to the jurisdiction of OP30 S is located on the continent OP749 S has parent organization OP1435 S has heritage designation OP53 S belongs to the family of OP400 S was developed for the platform OP921 S has the main subject OP37 S has the ofﬁcial language OP734 S has the family name O

Table 3: Templates used to generate English statements from Wikidata facts.

Random sample of English statements generated from Wikidata relations • The Underfall Yard is followed by English Electric Part One • Gazi Beg is a child of Farrukh Yassar • • • George Best A Tribute is performed by Peter Corry • Gamecock Media Group is owned by SouthPeak Games • • Nennslingen is located in or next to body of water Anlauter • • Shock to the System is a part of Cyberpunk • • Ramya Krishnan has the spouse Krishna Vamsi • The Cloud Minders follows The Way to Eden • Curve is followed by Somethingness • Austin Road is named after John Gardiner Austin • Dione juno has the parent taxon Dione • Spirit Bound Flesh is followed by The Wake • Sidnei da Silva has the given name Sidnei • In Memoriam is performed by Living Sacriﬁce • Tracks and Traces is followed by Live 1974 • Grumman Gulfstream I is operated by Phoenix Air • Timeline of Quebec history has the part Timeline of Quebec history (1982present) • Edwin C. Johnson held the position of Lieutenant Governor of Colorado • Here Comes the Summer follows Jimmy Jimmy • In Custody is screenwritten by Anita Desai • Bertie Charles Forbes is the father of Malcolm Forbes • The Mambo Kings has the cast member Helena Carroll • Carnival of Souls has the cast member Art Ellison • • John Harley is the father of Edward Harley, 5th Earl of Oxford and Earl Mortimer • Jane Fellowes, Baroness Fellowes has the spouse Robert Fellowes, Baron Fellowes • Francis of Assisi is buried in Basilica of San Francesco d’Assisi • • Makabana Airport is named after Makabana • Calvin Booth was born in Reynoldsburg • The Telltale Head is followed by Life on the Fast Lane • Alajos Keser is a sibling of Ferenc Keser • Long An contains the administrative territorial entity Chu Thnh

Hyperparameter settings

Entity representation Architecture Hyper-param. ValueSymbolic LSTM layers 2hidden size 256, 1024dropout 0.0learning rate 0.001lr-scheduler plateauoptimizer AdamTransformer model name RoBERTa-baselayers 12hidden size 768learning rate 5e-5lr-scheduler plateauoptimizer AdamSurface form LSTM layers (enc) 2hidden size (enc) 256, 1024layers (dec) 2hidden size (dec) 256, 1024learning rate 0.001lr-scheduler plateauoptimizer AdamTransformer model name (enc) RoBERTa-baselayers (enc) 12hidden size (enc) 768dropout 0.0model name (dec) random init.layers (dec) 12hidden size (dec) 768learning rate 5e-4lr-scheduler inverse sqrtoptimizer AdamContinuous LSTM layers 2hidden size 256, 1024dropout 0.0learning rate 0.001lr-scheduler plateauoptimizer Adamentity emb. dim 64entity emb. trainable noTransformer model name RoBERTa-baselayers 12hidden size 768learning rate 5e-5lr-scheduler plateauoptimizer Adamentity emb. dim 64entity emb. trainable no

Table 4: Hyperparameter settings used in our experiments.

Embeddings of Wikidata entities

Figure 10: Training embeddings of Wikidata entities with feature-speciﬁc autoencoders.

We train the embedding of given Wikidata entity by collecting its features from, encoding each feature toobtain a dense feature representation, and then concatenating feature representations. For textual features,we use RoBERTa-base as encoder and train corresponding decoders in a standard sequence-to-sequenceauto-encoding setup. For quantities, we select the 100 most common quantity types to obtain a ﬁxed-sizedrepresentation and then follow a standard auto-encoding setup. Similarly we obtain a ﬁxed-size entity typerepresentation by selecting the 1000 most common entity types. The concatenated feature-representationsare then compressed to embedding size d , using a separate autoencoder. Preliminary experiments withembedding sizes d ∈ { , , , } showed similar memorization accuracies for all d , but fasterconvergence for smaller sizes. We set d = 64 in our main experiments. Things that didn’t work

E.1 Hierarchical entity representation withbinary codes

Since imposing a hierarchy is a common methodfor dealing with large vocabulary sizes (Morin andBengio, 2005) in general, and large inventories ofentities and entity types in particular (Raiman andRaiman, 2018; L´opez et al., 2019), we created ahierarchy of all entities in Wikidata, using a givenentity’s position in this hierarchy as training sig-nal. Speciﬁcally, we created the entity hierarchyby ﬁtting a KD-tree (Bentley, 1975; Virtanen et al.,2020) with leaf size 1 over pretrained entity embed-dings, thereby obtaining a binary partitioning ofthe embedding space in which each ﬁnal partitioncontains exactly one entity embedding. The pathfrom the KD-tree’s root to a leaf can be representedas a binary code, which we use as training signal(Oda et al., 2017). Memorization accuracy of worldknowledge facts with object entities represented inthe form of these binary codes was substantiallylower compared to the three approaches describedin the main part of this work.

E.2 Training entity embeddings withnegative sampling

Instead of using ﬁxed, pretrained entity embed-dings as training signal, we experimented withrandomly initialized embeddings that are updatedduring training, using between 1 and 50 in-batchnegative samples, which is a standard method in theknowledge base embedding literature (Bordes et al.,2013) and has been used successfully for entity re-trieval (Gillick et al., 2019). However, comparedto using ﬁxed, pretrained entity embeddings with-out negative sampling, we observed lower memo-rization accuracies and slower convergence in ourexperiments.

E.3 Updating pretrained entity embeddingsduring training

Instead of using ﬁxed entity embeddings, we triedupdating them during training with in-batch nega-tive sampling. This increased the number of train-able parameters, memory usage, and training time,but did not lead to higher memorization accuracies.

E.4 Continuous representation withEuclidean distance loss

Instead of normalizing entity embeddings to theunit hypersphere and training with cosine loss, we experimented with predicting the original pre-trained entity embeddings and using the Euclideandistance as loss. Compared to using spherical en-tity embeddings as prediction targets, we observedslower convergence and lower memorization accu-racies.