[PDF] Combining pre-trained language models and structured knowledge

Abstract

In recent years, transformer-based language models have achieved state of the art performance in various NLP benchmarks. These models are able to extract mostly distributional information with some semantics from unstructured text, however it has proven challenging to integrate structured information, such as knowledge graphs into these models. We examine a variety of approaches to integrate structured knowledge into current language models and determine challenges, and possible opportunities to leverage both structured and unstructured information sources. From our survey, we find that there are still opportunities at exploiting adapter-based injections and that it may be possible to further combine various of the explored approaches into one system.

Full PDF

CCombining pre-trained language models andstructured knowledge

Pedro Colon-Hernandez ∗ MIT Media Lab

Catherine Havasi

Dalang Health

Jason Alonso

Dalang Health

Matthew Huggins

MIT Media Lab

Cynthia Breazeal

MIT Media Lab

In recent years, transformer-based language models have achieved state of the art performancein various NLP benchmarks. These models are able to extract mostly distributional informationwith some semantics from unstructured text, however it has proven challenging to integratestructured information, such as knowledge graphs into these models. We examine a varietyof approaches to integrate structured knowledge into current language models and determinechallenges, and possible opportunities to leverage both structured and unstructured informationsources. From our survey, we ﬁnd that there are still opportunities at exploiting adapter-basedinjections and that it may be possible to further combine various of the explored approaches intoone system.

1. Introduction

Recent developments in Language Modeling (LM) techniques have greatly improvedthe performance of systems in a wide range of Natural Language Processing (NLP)tasks. Many of the current state of the art systems are based on variations to thetransformer (Vaswani et al. 2017) architecture. The transformer architecture, along withmodiﬁcations such as the Transformer XL (Dai et al. 2019) and various training regimessuch as the Masked Language Modeling (MLM) used in BERT (Devlin et al. 2018) orthe Permutation Language Modeling (PLM) used in XLNet(Yang et al. 2019), uses anattention based mechanism to model long range dependencies between text. This mod-eling encodes syntactic knowledge, and to a certain extent some semantic knowledgecontained in unstructured texts.There has been interest in being able to understand what kinds of knowledge areencoded in these models’ weights. Hewitt et al. (Hewitt and Manning 2019) devise asystem that generates a distance metric between embeddings for words in languagemodels such as BERT. They show that there is some evidence that there are syntax treesembedded in the transformer language models and this could explain the performanceof these models in some tasks that utilize syntactic elements of text.Petroni et al. (Petroni et al. 2019) build a system (LAMA) to gauge what kinds ofknowledge are encoded in these weights. They discover that language models em- ∗

75 Amherst St, Cambridge, MA, 02139. E-mail: [email protected]. a r X i v : . [ c s . C L ] F e b ed some facts and relationships in their weights during pre-training. This in turncan help explain the performance of these models in semantic tasks. However, thesetransformer-based language models have some tendency to hallucinate knowledge(whether through bias or incorrect knowledge in the training data). This also meansthat some of the semantic knowledge they incorporate is not rigidly enforced or utilizedeffectively.Avenues of research have begun to open up on how to prevent this hallucinationand how to inject additional knowledge from external sources into the transformer-based language models. One promising avenue is through the integration of knowl-edge graphs such as Freebase(Bollacker et al. 2008), WordNet(Miller 1998), Concept-Net(Speer, Chin, and Havasi 2017), and ATOMIC(Sap et al. 2019).A knowledge graph (used somewhat interchangeably with knowledge base al-though they are different concepts) is deﬁned as “a graph of data intended to accu-mulate and convey knowledge of the real world, whose nodes represent entities ofinterest and whose edges represent relations between these entities" (Hogan et al. 2020).Formally, a knowledge graph is a set of triples that represents nodes and edges betweenthese nodes. Let us deﬁne a set of vertices (which we will refer to as concepts) as V , a setof edges as E (which we will refer to as assertions as per Speer and Havasi(Speer andHavasi 2012)), and a set of labels L (which we will refer to as relations). A knowledgegraph is a tuple G := ( V, E, L ) . The set of edges ( E ) or assertions is composed of triples E ⊆ V × L × V which are seen as a subject (a concept), a relation (a label), and object(another concept) respectively (e.g. ( subject, relation, object ) ). These edges in somecases can have weights to represent the strength of the assertion. Broadly speaking,knowledge graphs (KGs) are a collection of tuples that represent things that should betrue within the knowledge of the world that we are representing. An example assertionis “a dog is an animal" and its representation as a tuple would be: (dog,isA,animal).Ideally, we would want to “ inject " this structured collection of conﬁdent information(i.e. knowledge graph) into that of the high-coverage, contextual information found inlanguage models. This injection would permit the model to incorporate some of theinformation found in the KG to improve its performance in inference tasks.There are currently various approaches that try to achieve this injection. The ap-proaches in general take either one or combinations of three forms: input focusedinjections, architecture focused injections, and output focused injections. We deﬁne an input focused injection as any technique that modiﬁes the data pre-processing or the pre-transformer layer inputs that the base model uses(i.e. injecting knowledge graph triplesinto the training data to pre-train/ﬁne tune on them or combining entity embeddingsinto the static word embeddings that the models have). We deﬁne architecture focused in-jections as techniques that alter a base model’s transformer layers (i.e. adding additionallayers that inject in some representation). Lastly, we deﬁne an output focused injectionas any techniques that either modify the output of the base models or that modify/addcustom loss functions. In addition to these three basic types, there are approaches thatutilize combinations of these (i.e. a system that uses both input and output injections),which we call combination injections. Figure 1 gives an abstract visualization of the typesof injections that we describe.To be consistent throughout the type of injections, we will now give some deﬁ-nitions and nomenclature. Let us deﬁne a sequence of words (unstructured text) as S . Typically in a transformer-based model, this sequence of words is converted to a We use the formal deﬁnitions found in Appendix B of (Hogan et al. 2020) T to convert the wordsequence into a token sequence T . This can be seen as T ( S ) = T . This sequence T isused as a lookup in an embedding layer E to produce context independent token vectorembeddings: E ( T ) = E . These are then passed sequentially through various contextu-alization layers (i.e. transformers) which we deﬁne as the set H , H = ( H , ..., H n ) . Thesuccessive application of these ultimately produces a sequence of contextual embed-dings C : C = H n ( H n − ( ...H ( E ))) . We additionally deﬁne G as graph embeddings of aknowledge graph G that are the result of some embedding function E g : G = E g ( G ) . Thisﬁnal sequence is run through a ﬁnal layer L LM that is used to calculate the languagemodeling loss function L that is optimized through back-propagation. The notation thatwe utilize is intentionally vague on the deﬁnition of the functions, in order for us to ﬁtthe different works that we survey.In the following sections we will look at attempts of injecting knowledge graphinformation that fall into the aforementioned categories. Additionally, we will highlightrelevant beneﬁts in these approaches. We conclude with possible opportunities andfuture directions for injecting structured knowledge into language models. Figure 1

Visualization of boundaries of the different categories of knowledge injections.

Combination injections involve combinations of the three categories.

2. Input Focused Injections

In this section we will describe knowledge injections whose techniques center aroundmodifying either the structure of the input or the data that is selected to be fed into thebase transformer models. A common approach to inject information from a knowledgegraph is by converting its assertions into a set of words (possibly including separatortokens) and pre-training or ﬁne-tuning a language model with these inputs. We discusstwo particular papers that focus on structuring the input in different ways as to capturethe semantic information from triples found in a KG. These approaches start from a pre-trained model, and ﬁne-tune on their knowledge infusing datasets. A summary of theseapproaches can be found in table 1.Input focused injections can be seen any technique whose output is a modiﬁed E ,hereby known as E (cid:48) . This modiﬁcation can be achieved either by modifying S , T , T , E ,ordirectly E . (i.e. the word sequence, the token sequence, the tokenization function, thecontext-less embedding function, or the actual context-less embeddings). The hope of3 nput focused injections is that the knowledge in E (cid:48) will be distributed and contextual-ized through H as the language models are trained. AMS is an approach in which a question answering dataset is created, whose questionsand possible answers are generated by aligning a knowledge graph (in this particularcase ConceptNet) with plain text. A BERT model is trained on this dataset to inject itwith the knowledge.Taking an example from their work, the ConceptNet triple (population, AtLocation,city) is aligned with a sentence from the English Wikipedia (i.e. “The largest city by population is Birmingham, which has long been the most industrialized city.") thatcontains both concepts in the triple. They then proceed to mask out one of the conceptswith a special token ([ QS ]) and produce 4 plausible concepts as answers to the maskingtask by looking at the neighbors in ConceptNet that have the same masked token andrelationship. Lastly, they concatenate the generated question with the plausible answersand run it through a BERT model tailored for question answering (QA) (following thesame approach as the architecture and loss for the SWAG task in the original BERT).At the output, they run the classiﬁcation token ([ CLS ]) through a softmax classiﬁer todetermine if the selected concept is the correct one or not.The authors note that the work is sensitive to what it has seen in the pre-trainingbecause when asked a question that needs to disambiguate a pronoun it tries to matchwhat it has seen the most in the training data. This may mean that the generalization ofthe structured knowledge (here commonsense information) or the understanding of theit is overshadowed by the distributional information that it is learning, however moretesting would need to be done to verify this. Overall some highlights of the work are:• Automated pre-training approach which constructs a QA dataset alignedto a KG• Utilization of graph-based confounders in generated dataset entries COMET is a GPT(Radford et al. 2018) based system which is trained on triples from KGs(ConceptNet and Atomic) to learn to predict the object of the triple (the triples beingdeﬁned as (subject, relation, object)). The triples are fed as a concatenated sequence ofwords into the model (i.e. the words for the subject, the relationship, and the object)along with some separators.The authors initialize the GPT model to the ﬁnal weights in the training fromRadford et al.(Radford et al. 2018) and proceed to train it to predict the words thatbelong to the object in the triple. A very interesting part of this work is that it is capabledirectly of performing knowledge graph completion for nodes and relations that maynot have been seen during training, in the form of sentences.Some plausible shortcomings of this work is that you are still having the modelextract the semantic information from the distributional one and possibly sufferingfrom the same bias as AMS. In addition to this, by training on the text version of thesetriples, it may be the case that we lose some of the syntax that the model learns due toawkwardly formatted inputs (i.e. “cat located at housing" rather than “a cat is locatedat a house"), however further testing of these two needs to be performed.4here is some relevant derivative work for COMET by Bosselut et al.(Bosselut andChoi 2019) which looks into how effective COMET is at building KGs on the ﬂy givena certain context, a question, and a proposed answer. They combine the context with arelation from ATOMIC and feed it into COMET to represent reasoning hops. They dothis for multiple relations and keep redoing this with the generated outputs to representa reasoning chain for which they can derive a probability. They use this in a zero-shotevaluation of a question-answering system and ﬁnd that it is effective. Overall somehighlights of COMET are:• Generative language model that can provide natural languagerepresentations of triples• Useful model for zero-shot KG completion• Simple pre-processing of triples for training

Model Summary of Injection Example of Injection

Align,Mask,Select Aligns a knowledge base with textualsentences, masks entities in the sen-tences, and selects alternatives with con-founders to create a QA dataset

KG Assertion : (population, AtLocation,city)

Model Input : The largest [QW] by pop-ulation is Birming- ham, which has longbeen the most industrialized city? city,Michigan, Petrie dish, area with peopleinhabiting, countryCOMET Ingests a formatted sentence version ofa triple from ConceptNet and Atomic

KG Assertion : (PersonX goes to themall, xIntent, to buy clothes)

Model input : PersonX goes to the mall[MASK] (cid:104) xIntent (cid:105) to buy clothes

Table 1

Input Injection System Comparisons

3. Architecture Injections

In this section we describe approaches that focus on architectural changes to languagemodels. This involves either adding additional layers that integrate knowledge in someway with the contextual representations or modifying existing layers to manipulatethings such as attention mechanisms. We discuss two approaches within this categorythat fall under layer modiﬁcations. These approaches utilize adapter-like mechanismsto be able to inject information into the models. A summary of these approaches can befound in table 2.

KnowBERT modiﬁes BERT’s architecture by integrating some layers that they call theKnowledge Attention and Recontextualization (KAR). These layers take graph entityembeddings, that are based on Tucker Tensor Decompositions for KG completion (Bal-aževi´c, Allen, and Hospedales 2019), and run them through an attention mechanism togenerate entity span embeddings. These span embeddings are then added to the regularBERT contextual representations. The summed representations are then uncompressed5nd passed on to the next layer in a regular BERT. Once the KAR entity linker has beentrained, the rest of the BERT model is unfrozen and is trained in the pre-training. TheseKAR layers are trained for every KG that is to be injected, in this work they use datafrom Wikipedia and Wordnet.An interesting observation is that the injection happens in the later layers, whichmeans that the contextual representation up to that point may be unaltered by theinjected knowledge. This is done to stabilize the training, but could present an oppor-tunity to inject knowledge in earlier levels. Additionally, the way the system is trained,the entity linking is ﬁrst trained, and then the whole system is unfrozen to incorporatethe additional knowledge into BERT. This strategy could lead to the catastrophic forget-ting(Kirkpatrick et al. 2017) problem where the knowledge from the underlying BERTmodel or the additional structured injection may be forgotten or ignored.This technique falls into a broader category of what is called Adapters(Houlsbyet al. 2019). Adapters are layers that are added into a language model and are subse-quently ﬁne tuned to a speciﬁc task. The interesting aspect of adapters is that they adda minimal amount of additional parameters, and freeze the original model weights.The added parameters are also initialized to produce a close to identity output. It isworth noting that the KnowBERT is not explicitly an Adapter technique as the model isunfrozen during training. Some highlights of KnowBERT are the following:• Fusion of contextual and graph representation of entities• Attention enhanced entity spanned knowledge infusion• Permits the injection of multiple KGs in varying levels of the model

This work explores what kinds of knowledge are infused by ﬁne tuning an adapterequipped version of BERT on ConceptNet. They generate and test models trained onsentences from the Open Mind Common Sense (OMCS)(Singh et al. 2002) corpus andfrom walks in the ConceptNet graph. They note that with simple adapters and as littleas 25k/100k update steps on their training sentences, they are able greatly improve theencoded “World Knowledge" (another name for the knowledge found in ConceptNet).However, it is worth noting that the information is presented as sentences to whichthe adapters are ﬁne tuned. This may mean that the model may have similar possibleshortcomings such as with the approaches that are input-focused (model may rely moreon the distributional rather than the semantic information), however testing needs to beperformed to conﬁrm this. Overall some highlights of this work are the following:• Adapter based approach which ﬁne-tunes a minimal amount ofparameters• Shows that a relatively small amount of additional iterations can inject theknowledge in the adapters• Show that adapters, trained on KGs, do indeed boost the semanticperformance of transformer-based models6 odel Injectedin Pre-Training Injectedin Fine-Tuning Summary of Injection

KnowBERT Yes Yes Sandwich Adapter-like layerswhich sum contextual repre-sentation of layer with graphrepresentation of entities anddistributes it in an entity spanCommon sense or worldknowledge?[...] No Yes Use sandwich adapters toﬁne tune on a KG

Table 2

Architecture Injection System Comparisons

4. Output Injections

In this section we describe approaches that focus on changing either the output structureor the losses that were used in the base model in some way to incorporate knowledge.Only one model falls strictly under this category, the model injects entity embeddingsinto the output of a BERT model.

SemBERT uses a subsystem that generates embedding representations of the outputof a semantic role labeling(Màrquez et al. 2008) system. They then concatenate thisrepresentation with the output of the contextualized representation from BERT to helpincorporate relational knowledge. The approach, although clever, may fall short in thatalthough it gives a representation for the roles, it leaves the model to ﬁgure out the exactrelationship that the roles are performing, however testing would need to be performedto check this. Some highlights of SemBERT are:• Encodes semantic role in an entity embedding that is combined at theoutput

5. Combination and Hybrid Injections

Here we describe approaches that use combinations of injection types such as in-put/output injections or architecture/output injections, etc. We start by looking atmodels that perform input injections and reinforce these with output injections (LIBERT,KALM), We then look at models that manipulate the attention mechanisms to mimicgraph connections (BERT-MK,K-BERT). We follow this by looking into KG-BERT, amodel that operates on KG triples, and K-Adapter, a modiﬁcation of RoBERTa thatencodes KGs into Adapter Layers and fuses them. After this, we look into the approachpresented as Cracking the Contextual Commonsense Code[...] which determines thatthere are areas lacking in BERT that could be addressed by supplying appropriate data,and we look at ERNIE 2.0, a framework for multi-task training for semantically awaremodels. Lastly, we look at two hybrid approaches which extract LM knowledge andleverage it for different tasks. A summary of these injections can be found in table 3. 7 .1 Knowledge-Aware Language Model (KALM) Pre-Training (Rosset et al. 2020)

KALM is a system that does not modify the internal architecture of the model that itis looking to inject the knowledge into, rather it modiﬁes the input of the model byfusing entity embeddings with the normal word embeddings that the language model(in KALM’s case, GPT-2) uses. They then enforce the model in the output to uphold theentity information by adding an additional loss component in the pre-training that usesa max margin between the cosine distance of the output contextual representation andthe input entity embedding and the cosine distance of the contextual representation anda confounder entity. Altogether what this does is that it forces the model to notice whenthere is an entity and tries to make the contextual representation have the semantics ofthe correct input entity. Some highlights of KALM are:• Sends an entity signal in the beginning and and enforces it in the output ofa generative model to notice its semantics

This work masks informative entities that are drawn from a knowledge graph, inBERT’s MLM objective. In addition to this, they have an auxiliary objective which usesa max-margin loss for a ranking task which is composed of using a bilinear model thatcalculates a similarity score between a contextual representation of an entity mentionand the representation of the [ CLS ] token for the text. The use for this is to determine ifit is a relevant entity or a distractor. Both KALM and this work are very similar, but a keydifference is that KALM uses a generative model without any kind of MLM objective,and KALM does not do any kind of ﬁltering for the entities. Some highlights of thiswork are:• Filters relevant entities to incorporate their information into the model• Enforces entity signal at beginning and end of the model through maskingand max-margin losses LIBERT converts batches of lexical constraints and negative examples, into a BERT-compatible format. The lexical constraints are synonyms and direct hyponyms-hypernyms (speciﬁc,broad) and take the form of a set of tuples of words: ( C = { ( w , w ) i } Ni =1 ) . In addition to this set, the authors generate some negative examplesby ﬁnding the words that are semantically close to ( w ) and ( w ) in a given batch. Theythen format the examples into something BERT can use which is simply the wordpiecesthat pertain to words in the batch separated by the separator token. They pass this inputthrough BERT and use the [ CLS ] token as an input to a softmax classiﬁer to determineif the example is a valid lexical relation or not.During pre-training they alternate between a batch of sentences and a batch ofconstraints. LIBERT outperforms BERT with lesser (1M) iterations of pre-training. Itis worth noting that as the amount of iterations of training increase, the gap betweenthe two systems, although present, becomes smaller. This may indicate that, althoughthe additional training objective is effective, it may be getting overshadowed by the8egular MLM coupled with large amounts of data, however more testing needs to beperformed. It is also worth noting that the authors do not align the sentences with theconstraint batches, combine the training tuples which may hinder training as BERT hasto alternate between different training input structures, and lastly, they do not incor-porate antonymy constraints in their confounder selection, so further experimentationwould be required to verify the effects of these. Some highlights of LIBERT are thefollowing:• Incorporate lexical constraints from entity embeddings• Good performance with constrained amounts of data BERT-MK utilizes a combination of architecture injection and an output injection (addi-tional training loss). In BERT-MK, they utilize the KG-transformer modules which aretransformer layers that are combined with learned entity representations. These entityrepresentations are generated from another set of transformer layers that are trainedon a KG converted to natural language sentences. The interesting aspect is that theseadditional layers incorporate an attention mask that mimics the connections in the KG,so to a certain extent, incorporating the structure of the graph and propagating it backinto the embeddings. These additional layers are trained to reconstruct the input set oftriples. The authors evaluate the system for medical knowledge (MK) however it maybe interesting to evaluate this on the GLUE benchmark along with utilizing other KGssuch as ATOMIC or ConceptNet. Some highlights of BERT-MK are:• Utilization of a modiﬁed attention mechanism to mimic KG structurebetween terms• Incorporation of triple reconstruction loss to train the KG-transformermodules• Merges KG-transformer with regular transformer forcontextual+knowledge-informed representation

K-BERT uses a combination of input injection and architecture injections. For a givensentence, they inject relevant triples for the entities that are present in the sentence andin a KG. They inject these triples in between the actual text and utilize a soft-positionembedding to determine the order in which the triples are evaluated. These soft positionembeddings simply add positional embeddings to the injected triple tokens. This in turncreates a problem that the tokens are injected as entities appear in a sentence, and hencethe ordering of the tokens is altered.To remedy this the authors utilize a masked self attention similar to BERT-MK. Whatthis means is that the attention mechanism should only be able to see everything up tothe entity that matched in the injected triple. This attention mechanism helps the modelfocus on what relevant knowledge it should incorporate. It would have been good tosee a comparison of just adding these as sentences in the input rather than having to9x the attention mechanism to compensate for the erratic placement. Some highlightsof K-BERT are:• Utilization of attention mechanism to mimic connected subgraphs ofinjected triples• Injection of relevant triples as text inputs

The authors present a combination approach which ﬁne-tunes a BERT model with thetext of triples from a KG similar to COMET. The authors also feed confounders in theform of random samples of entities into the training of the system. It utilizes a binaryclassiﬁcation task to determine if the triple is valid and a relationship type predictiontask to determine which relations are present between pairs of entities. Although thissystem is useful for KG completion, there is no evidence of its performance on othertasks. Additionally they train one triple at a time which may limit the model’s ability tolearn the extended relationships for a given set of entities. Some highlights of KG-BERTare the following:• Fine tunes BERT into completing triples from a KG• Uses a binary classiﬁcation to predict if a triple is valid• Uses multi-class classiﬁcation to predict relation type

A work based on adapters, K-Adapter works by adding projection layers before andafter a subset of transformer layers. They do this only for some speciﬁc layers in a pre-trained RoBERTa model (the ﬁrst layer, the middle layer, and the last layer). They thenfreeze RoBERTa as per the Adapter work in (Houlsby et al. 2019) and train 2 adaptersto learn factual knowledge from Wikipedia triples (Elsahar et al. 2019) and linguisticknowledge from outputs of the Stanford parser(Chen and Manning 2014). They thentrain the adapters with a triple classiﬁcation (whether the triple is true or not) tasksimilar to KG-BERT.It is worth noting that the authors compare RoBERTa and their K-Adapter approachagainst BERT, and BERT has considerably better performance on the LAMA probes. Theauthors attribute RoBERTa’s byte pair encodings(Shibata et al. 1999) (BPE) as the majorperformance delta between their approach and BERT. Another possible reason may bethat they only perform injection in a few layers rather than throughout the entire model,although testing needs to be done to conﬁrm this. Some highlights of K-Adapter are:• Approach provides a framework for continual learning• Use a fusion of trained adapter outputs for evaluation tasks

The authors analyze BERT to determine that it is deﬁcient in certain attribute repre-sentations of entities. The authors use the RACE (Lai et al. 2017) dataset, and based10n ﬁve attribute categories (Visual, Encyclopedic, Functional Perceptual, Taxonomic),select samples from the dataset that may be help a BERT model compensate deﬁcienciesin the areas. They then ﬁne tune on this data. In addition to this, the authors concatenatethe ﬁne tuned BERT embeddings with some knowledge graph embeddings. Thesegraph embeddings are generated based on assertions that involve the entities that arepresent in the questions and passages they train their ﬁnal joint model on (MCScript2.0(Ostermann, Roth, and Pinkal 2019)). Their selection of additional ﬁne-tuning datafor BERT improves their performance in MCScript 2.0, highlighting that their selectionaddressed missing knowledge.It is worth noting that the graph embeddings that they concatenate boost theperformance of their system which shows that there is still some information in KGsthat is not in BERT. We classify this approach as a combination approach because theyconcatenate the result of the BERT embeddings and KG embeddings and ﬁne tune bothat the same time. The authors however gave no insight as to how the KG embeddingscould have been incorporated in the ﬁne-tuning/pre-training of BERT with the RACEdataset. Some highlights of this work are:• BERT has some commonsense information in some areas, but is lacking inothers• Fine-tuning on the deﬁcient areas increases performance accordingly• The combination of graph embeddings plus contextual representations areuseful

The authors develop a framework that constructs pre-training tasks that center aroundWord-aware Pre-training, Structure-aware Pre-training, Semantic-aware Pre-trainingTasks, and proceeds to train a transformer based model on these tasks. An interesting as-pect is that as they ﬁnish training on tasks, they keep training on older tasks in order forthe model to not forget what it has learned. In ERNIE 2.0 the authors do not incorporateKG information explicitly. They do have a sub-task within the Word-aware pre-trainingthat masks entities and phrases with the hope that it learns the dependencies of themasked elements which may help to incorporate assertion information.A possible shortcoming of this model is that some tasks that are intended to infusesemantic information into the model (i.e. the Semantic aware tasks which are a Dis-course Relation task and an information retrieval (IR) relevance) rely on the model topick it up from the distributional examples. This could have the same possible issue aswith the Input Injections and would need to be investigated further. Additionally, theydo not explicitly use KGs in the work. Some highlights of ERNIE 2.0 are:• Continual learning platform keeps training on older tasks to maintaintheir information• Framework permits ﬂexibility on the underlying model• Wide variety of semantic pre-training tasks 11 .10 Graph-based reasoning over heterogeneous external knowledge for common-sense question answering (Lv et al. 2020)

A hybrid approach in which the authors do not inject knowledge into a language model(namely XLNet(Yang et al. 2019), rather they utilize a language model as a way to unifygraph knowledge and contextual information. They combine XLNet embeddings asnodes in a Graph Convolutional Network (GCN) to answer questions.They generate relevant subgraphs of ConceptNet and Wikipedia (from ConceptNetthe relations that include entities in a question/answer exercise and the top 10 mostrelevant sentences from ElasticSearch on Wikipedia). They then perform a topologicalsorting on the combined graphs and pass them as input to XLNet. XLNet then generatescontextual representations that are then used as representations for nodes in a GraphConvolutional Network (GCN). They then utilize graph attention to generate a graphlevel representation and combine it with XLNet’s input ([CLS] token) representation todetermine if an answer is valid for a question. In this model they do not ﬁne-tune XLNet,which could have been done on the dataset to give better contextual representations,and additionally they do not leverage the different levels of representation present inXLNet. Some highlights of this work are the following• Combination of GCN, Generative Language Model, and Search systems toanswer questions• Use XLNet as contextual embedding for GCN nodes• Perform QA reasoning with the GCN output

Another hybrid approach, the authors ﬁne tune a BERT model on a list of the uniquephrases that are used to represent nodes in a KG. They then take the embeddingsfrom BERT and from a sub-graph in the form of a GCN and run it through an en-coder/decoder structure to determine the validity of an assertion. They then take this input and concatenate it with node representations for fora subgraph (in this case a combination of ConceptNet and Atomic). They treat thisconcatenation as an encoded representation, and run combinations of these througha convolutional decoder that additionally takes an embedding of a relation type.The result of the convolutional decoder is run through a bilinear model and asigmoid function to determine the validity of the assertion. It seems interesting that theauthors only run the convolution through one side: convolution of ( e i , e rel ) rather thanboth the convolution of ( e i , e rel ) and ( e rel , e j ) (where ( e i , e j ) are the entity embeddingsfor entity i and j respectively and ( e rel ) is the embedding for a speciﬁc relationship) andthen a concatenation. They rely on the bilinear model joining the two representations.Some highlights of this work are the following:• Use a GCN and a LM to generate contextualized assertions representations It is worth noting that the two hybrid projects possibly beneﬁted from the ability for these languagemodels to encode assertions as shown by Feldman et al. (Davison, Feldman, and Rush 2019) and Petroniet al.(Petroni et al. 2019).

12 Use BERT to generate contextual embeddings for nodes• Use an encoder-decoder structure to learn triples

6. Future Directions6.1 Input Injections

Most input injections are to format KG information into whatever format a transformermodel can ingest. Although KALM has explored incorporating a signal to the inputrepresentations, it would be interesting to add additional information such as the lexicalconstraints mentioned in LIBERT, to the word embeddings that are trained with thetransformer based models like BERT. A possible approach could be to build a post-specialization system that could generate retroﬁtted(Faruqui et al. 2014) representationsthat can then be fed into language models.

Adapters seem to be a promising ﬁeld of research in language models overall. Theidea that one can ﬁne tune a small amount of parameters may simplify the injection ofknowledge and KnowBERT has explored some of these beneﬁts. It would be interestingto apply a similar approach to generative models and see the results.Another possible avenue of research would be to incorporate neural memory mod-els/modules such as the ones by Munkhdalai(Munkhdalai et al. 2019) into adapter-based injections. The reasoning would be that the model can simply look up relevantinformation encoded into a memory architecture and fuse it into a contextual represen-tation.

There are a variety of combined approaches, but none of them tackle all three areas(input, architecture, and output) at the same time. It seems promising to test out asignaling method such as KALM and see how this would work with an adapter basedmethod similar to KnowBERT. The idea being that the input signal could help theentity embeddings contextualize better within the injected layers. Additionally, it wouldbe interesting to see how the aforementioned combination would look with a systemsimilar to LIBERT; such that you could fuse entity embeddings with some semanticinformation.

7. Conclusion

Infusing structured information from Knowledge Graphs into pre-trained languagemodels has had some success. Overall, the works reviewed here give evidence that themodels beneﬁt from the incorporation of the structured information. By analyzing theexisting works, we give some research avenues that may help to develop more tightlycoupled language/KG models.

References

Balaževi´c, Ivana, Carl Allen, and Timothy M Hospedales. 2019. Tucker: Tensor factorization forknowledge graph completion. arXiv preprint arXiv:1901.09590 . odenreider, Olivier. 2004. The uniﬁed medical language system (umls): integrating biomedicalterminology. Nucleic acids research , 32(suppl_1):D267–D270.Bollacker, Kurt, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: acollaboratively created graph database for structuring human knowledge. In

Proceedings of the2008 ACM SIGMOD international conference on Management of data , pages 1247–1250.Bosselut, Antoine and Yejin Choi. 2019. Dynamic knowledge graph construction for zero-shotcommonsense question answering. arXiv preprint arXiv:1911.03876 .Bosselut, Antoine, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, andYejin Choi. 2019. Comet: Commonsense transformers for automatic knowledge graphconstruction. arXiv preprint arXiv:1906.05317 .Chen, Danqi and Christopher D Manning. 2014. A fast and accurate dependency parser usingneural networks. In

Proceedings of the 2014 conference on empirical methods in natural languageprocessing (EMNLP) , pages 740–750.Da, Jeff and Jungo Kusai. 2019. Cracking the contextual commonsense code: Understandingcommonsense reasoning aptitude of deep contextual representations. arXiv preprintarXiv:1910.01157 .Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov.2019. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprintarXiv:1901.02860 .Davison, Joe, Joshua Feldman, and Alexander M Rush. 2019. Commonsense knowledge miningfrom pretrained models. In

Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 1173–1178.Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training ofdeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .Elsahar, Hady, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, ElenaSimperl, and Frederique Laforest. 2019. T-rex: A large scale alignment of natural languagewith knowledge base triples.Faruqui, Manaal, Jesse Dodge, Sujay K Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith.2014. Retroﬁtting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 .Gabrilovich, Evgeniy, Michael Ringgaard, and Amarnag Subramanya. 2013. Facc1: Freebaseannotation of clueweb corpora, version 1 (release date 2013-06-26, format version 1, correctionlevel 0).

Note: http://lemurproject. org/clueweb09/FACC1/Cited by , 5:140.He, Bin, Di Zhou, Jinghui Xiao, Qun Liu, Nicholas Jing Yuan, Tong Xu, et al. 2019. Integratinggraph contextualized knowledge into pre-trained language models. arXiv preprintarXiv:1912.00147 .Hewitt, John and Christopher D Manning. 2019. A structural probe for ﬁnding syntax in wordrepresentations. In

Proceedings of the 2019 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ShortPapers) , pages 4129–4138.Hogan, Aidan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, ClaudioGutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, et al.2020. Knowledge graphs. arXiv preprint arXiv:2003.02320 .Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efﬁcient transferlearning for nlp. arXiv preprint arXiv:1902.00751 .Kirkpatrick, James, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,et al. 2017. Overcoming catastrophic forgetting in neural networks.

Proceedings of the nationalacademy of sciences , 114(13):3521–3526.Lai, Guokun, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scalereading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 .Lauscher, Anne, Olga Majewska, Leonardo FR Ribeiro, Iryna Gurevych, Nikolai Rozanov, andGoran Glavaš. 2020. Common sense or world knowledge? investigating adapter-basedknowledge injection into pretrained transformers. arXiv preprint arXiv:2005.11787 .Lauscher, Anne, Ivan Vuli´c, Edoardo Maria Ponti, Anna Korhonen, and Goran Glavaš. 2019.Informing unsupervised pretraining with external linguistic knowledge. arXiv preprintarXiv:1909.02339 . iu, Weijie, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020.K-bert: Enabling language representation with knowledge graph. In AAAI , pages 2901–2908.Lv, Shangwen, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, DaxinJiang, Guihong Cao, and Songlin Hu. 2020. Graph-based reasoning over heterogeneousexternal knowledge for commonsense question answering. In

AAAI , pages 8449–8456.Malaviya, Chaitanya, Chandra Bhagavatula, Antoine Bosselut, and Yejin Choi. 2020.Commonsense knowledge base completion with structural and semantic context. In

AAAI ,pages 2925–2933.Màrquez, Lluís, Xavier Carreras, Kenneth C Litkowski, and Suzanne Stevenson. 2008. Semanticrole labeling: an introduction to the special issue.Miller, George A. 1998.

WordNet: An electronic lexical database . MIT press.Munkhdalai, Tsendsuren, Alessandro Sordoni, Tong Wang, and Adam Trischler. 2019.Metalearned neural memory. In

Advances in Neural Information Processing Systems , pages13331–13342.Ostermann, Simon, Michael Roth, and Manfred Pinkal. 2019. Mcscript2. 0: A machinecomprehension corpus focused on script events and participants. arXiv preprintarXiv:1905.09531 .Peters, Matthew E, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, SameerSingh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXivpreprint arXiv:1909.04164 .Petroni, Fabio, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller,and Sebastian Riedel. 2019. Language models as knowledge bases? arXiv preprintarXiv:1909.01066 .Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improvinglanguage understanding by generative pre-training.Rosset, Corby, Chenyan Xiong, Minh Phan, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020.Knowledge-aware language model pretraining. arXiv preprint arXiv:2007.00655 .Sap, Maarten, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, HannahRashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machinecommonsense for if-then reasoning. In

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 33, pages 3027–3035.Shen, Tao, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, and Weizhu Chen. 2020.Exploiting structured knowledge in text via graph-guided representation learning. arXivpreprint arXiv:2004.14224 .Shibata, Yusuxke, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara,Takeshi Shinohara, and Setsuo Arikawa. 1999. Byte pair encoding: A text compression schemethat accelerates pattern matching. Technical report, Technical Report DOI-TR-161,Department of Informatics, Kyushu University.Singh, Push, Thomas Lin, Erik T Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. 2002.Open mind common sense: Knowledge acquisition from the general public. In

OTMConfederated International Conferences" On the Move to Meaningful Internet Systems" , pages1223–1237, Springer.Speer, Robyn, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingualgraph of general knowledge. In

Thirty-First AAAI Conference on Artiﬁcial Intelligence .Speer, Robyn and Catherine Havasi. 2012. Representing general relational knowledge inconceptnet 5. In

LREC , pages 3679–3686.Sun, Yu, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020.Ernie 2.0: A continual pre-training framework for language understanding. In

AAAI , pages8968–8975.Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In

Advances in neuralinformation processing systems , pages 5998–6008.Wang, Ruize, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Cuihong Cao, DaxinJiang, Ming Zhou, et al. 2020. K-adapter: Infusing knowledge into pre-trained models withadapters. arXiv preprint arXiv:2002.01808 .Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretraining for language understanding. In

Advancesin neural information processing systems , pages 5753–5763. ao, Liang, Chengsheng Mao, and Yuan Luo. 2019. Kg-bert: Bert for knowledge graphcompletion. arXiv preprint arXiv:1909.03193 .Ye, Zhi-Xiu, Qian Chen, Wen Wang, and Zhen-Hua Ling. 2019. Align, mask and select: A simplemethod for incorporating commonsense knowledge into language representation models. arXiv preprint arXiv:1908.06725 .Zhang, Zhuosheng, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and XiangZhou. 2019. Semantics-aware bert for language understanding. arXiv preprintarXiv:1909.02209 . odel Injectedin Pre-Training InjectedinFine-Tuning Summary of Input Injec-tion Summary of ArchitectureInjection Summary of Output In-jection KALM Yes No Combines an entity em-bedding with the model’sword embeddings N/a Incorporates a max-margin loss with cosinedistances to enforcesemantic informationExploitingstructuredknowledgein text [...] Yes No Uses a KG informedmasking scheme to exploitMLM learning N/A Incorporates a max-margin loss withdistractors from a KGand a bilinear scoringmodel for the MM loss.LIBERT Yes No Alternates betweenbatches of sentences,and batches of tuples thatare lexically related N/a Adds a binary classiﬁer asa third training task to de-termine if the tuples forma valid lexical relationBERT-MK Yes No N/A Combines a base trans-former with KG trans-former modules which aretrained to learn contextualentity representations andhave an attention mecha-nism that mimics the con-nections between a graph Use a triple reconstructionloss, similar to MLM, butfor triplesK-BERT Yes No Incorporate as part oftheir training batches,assertions from entitiespresent in a sample Modify the attentionmechanism and positionembeddings to reorderinjected information, andmimic KG connections N/AKG-BERT No Yes Feeds triples from aknowledge graph as inputexamples N/A Uses a binary classiﬁca-tion objective to deter-mine if a triple is correctand a multi-class classiﬁ-cation objective to deter-mine what kind of relationthe triple has.K-Adapter No Yes N/A Fine tune adapter layers tostore information from aKG Combine the model anddifferent adapter layers togive a contextual repre-sentation with informationfrom different sources, ad-ditionally use a relationclassiﬁcation loss for eachtrained adapter.Crackingthecontextualcommon-sensecode[...] Yes No Pre-process the data to ad-dress commonsense rela-tion properties that are de-ﬁcient in BERT N/A Concatenates a graph em-bedding to the output ofthe BERT modelERNIE 2.0 Yes No Construct data for pre-training tasks N/A Provides a battery of tasksthat are trained in parallelto enforce different seman-tic areas in a model

Table 3

Combination Injection Systems Comparisons nowledge InjectionApproach UnderlyingLanguage Model Type of Injection Knowledge Sources Training Objective Align, Mask, Select BERT Input ConceptNet Binary Cross EntropyCOMET GPT Input Atomic, ConceptNet Language ModelingKnowBERT BERT Architecture Wikipedia, WordNet Masked LanguageModelingCommon sense orworld knowledge?[...] BERT Architecture ConceptNet Masked LanguageModelingSemBERT BERT Output Semantic RoleLabeling of Pre-Training data Masked LanguageModelingKALM GPT-2 Combination(Input,Output) FACC1 and FAKBAentity annota-tion(Gabrilovich,Ringgaard, andSubramanya 2013) Language Modeling +Max MarginExploiting Struc-tured[...] BERT Combination(Input+Output) ConceptNet Masked LanguageModeling, MaxMarginLiBERT BERT Combination (Input,Output) Wordnet, Roget’sThesaurus Masked LanguageModeling + MaxMarginGraph-basedreasoning overheterogeneousexternal knowledge XLNet + Graph Con-volutional Network Hybrid (LanguageModel + GraphReasoning) Wikipedia, Concept-Net Cross EntropyCommonsenseknowledge basecompletion withstructural andsemantic context BERT+Graph Convo-lutional Net Hybrid (LanguageModel + GCNEmbeddings) Atomic,ConceptNet Binary Cross EntropyERNIE 2.0 Transformer-BasedModel Combination(Input+Output) Wikipedia, Book-Corpus, Reddit,Discovery Data(Various typesof relationshipsextracted from thesedatasets) Various tasks, amongthem KnowledgeMasking, Token-Document RelationPrediction, SentenceDistance Task, IRRelevance TaskBERT-MK BERT Combination (Archi-tecture+Output) Uniﬁed MedicalLanguage Sys-tem(Bodenreider2004)(UMLS) Masked LanguageModeling, MaxMarginK-Bert BERT Combination (Input +Architecture) TBD Same As BERTKG-Bert BERT Combination (Input +Output) Freebase, Wordnet,UMLS Binary + CategoricalCross EntropyK-Adapter RoBERTa Combination (Archi-tecture+Output) Wikipedia, Depen-dency Parsing fromBook Corpus Binary Cross EntropyCracking theCommonsense Code BERT Combination(Input+Output) N/A:Fine Tuning onRACE dataset subset Binary Cross Entropy

Table 4

Knowledge Injection Models Overview nowledgeInjectionApproach BenchmarkName Base Model Base ModelBenchmarkPerformance KnowledgeInjected ModelPerformance PercentDifference BERT-CS base (Align, Mask, Se-lect) GLUE (Average) Bert-Base 78.975 79.612 0.81%BERT-CS large (Align, Mask, Se-lect) GLUE (Average) Bert-Base-large 81.5 81.45 -0.06%LIBERT (2M) GLUE (Average) BERT BaselineTrained with 2mexamples 72.775 74.275 2.06%SemBERT base

GLUE(Average) BERT-Base 78.975 80.35 1.74%SemBERT large

GLUE(Average) BERT-Base 81.5 84.262 3.39%K-Adapter F+L CosmosQA, TA-CRED RoBERTa + Mul-titask training 81.19,71.62 81.83,71.93 1.54%, 0.95%Ernie 2.0 (large) GLUE (Average) BERT-Base-Large 81.5 84.65 3.87%BERT-MK Entity Typing,Rel. Classiﬁcation(using UMLS) BERT-base 96.55 ,77.75 97.26,83.02 0.74%,6.78%K-BERT XNLI BERT-base 75.4 76.1 0.93%Cracking thecontextualcommonsensecode (Bert large +KB+RACE) MCScript 2.0 BERT-Large 82.3 85.5 3.89KnowBERT TACRED BERT-Base 66 71.5 8%Common Senseor WorldKnowledge?(OM-ADAPT100K) GLUE (Average) BERT-Base 78.975 79.225 0.40%Common Senseor WorldKnowledge?(CN-ADAPT50K) GLUE (Average) BERT-Base 78.975 79.225 0.32%

Table 5