[PDF] An End-to-end Model for Entity-level Relation Extraction using Multi-instance Learning

Abstract

We present a joint model for entity-level relation extraction from documents. In contrast to other approaches - which focus on local intra-sentence mention pairs and thus require annotations on mention level - our model operates on entity level. To do so, a multi-task approach is followed that builds upon coreference resolution and gathers relevant signals via multi-instance learning with multi-level representations combining global entity and local mention information. We achieve state-of-the-art relation extraction results on the DocRED dataset and report the first entity-level end-to-end relation extraction results for future reference. Finally, our experimental results suggest that a joint approach is on par with task-specific learning, though more efficient due to shared parameters and training steps.

Full PDF

AAn End-to-end Model for Entity-level Relation Extractionusing Multi-instance Learning

Markus Eberts and

Adrian Ulges

RheinMain University of Applied SciencesWiesbaden, Germany { markus.eberts, adrian.ulges } @hs-rm.de Abstract

We present a joint model for entity-level rela-tion extraction from documents. In contrastto other approaches – which focus on localintra-sentence mention pairs and thus requireannotations on mention level – our model op-erates on entity level. To do so, a multi-taskapproach is followed that builds upon corefer-ence resolution and gathers relevant signals viamulti-instance learning with multi-level repre-sentations combining global entity and localmention information. We achieve state-of-the-art relation extraction results on the DocREDdataset and report the ﬁrst entity-level end-to-end relation extraction results for future ref-erence. Finally, our experimental results sug-gest that a joint approach is on par with task-speciﬁc learning, though more efﬁcient due toshared parameters and training steps.

Information extraction addresses the inference offormal knowledge (typically, entities and relations)from text. The ﬁeld has recently experienced asigniﬁcant boost due to the development of neuralapproaches (Zeng et al., 2014; Zhang and Wang,2015; Kumar, 2017). This has led to two shifts inresearch: First, while earlier work has focused onsentence level relation extraction (Hendrickx et al.,2010; Han et al., 2018; Zhang et al., 2017), more re-cent models extract facts from longer text passages(document-level). This enables the detection ofinter-sentence relations that may only be implicitlyexpressed and require reasoning across sentenceboundaries. Current models in this area do not relyon mention-level annotations and aggregate signalsfrom multiple mentions of the same entity.The second shift has been towards multi-tasklearning: While earlier approaches tackle entitymention detection and relation extraction with sepa-rate models, recent joint models address these tasks The

Portland Golf Club is a private golf clubin the northwest

United States , in suburbanPortland, Oregon. The

PGC is located inthe unincorporated

Raleigh Hills area of east-ern Washington County, southwest of down-town Portland and east of Beaverton.

PGC was established in the winter of , whena group of nine businessmen assembled toform a new club after leaving their respectiveclubs. The golf club hosted the Ryder Cupmatches of 1947, the ﬁrst renewal in a decade,due to World War II. The

U.S. team defeatedGreat Britain 11 to 1 in wet conditions in earlyNovember.

Figure 1: Our goal is to perform end-to-end entity-levelrelation extraction on whole documents. We extractentity mentions (“PGC”), entity clusters ( { PortlandGolf Club, PGC, golf club } ), their types ( ORG ) andrelations to other entities in the document, such as( { Portland Golf Club, PGC, golf club } ORG , inception , { } T IME ), with a single, joint model. Note thatdocument-level relation extraction requires the aggre-gation of relevant information from multiple sentences,such as in ( { Raleigh Hills } LOC , country , { UnitedStates, U.S. } ) LOC ). Other entities in the example doc-ument are omitted for clarity. at once (Bekoulis et al., 2018; Nguyen and Ver-spoor, 2019; Wadden et al., 2019). This does notonly improve simplicity and efﬁciency, but is alsocommonly motivated by the fact that tasks can ben-eﬁt from each other: For example, knowledge oftwo entities’ types (such as person + organization )can boost certain relations between them (such as ceo of ).We follow this line of research, and presentJEREX (“ J oint E ntity-Level R elation Ex tractor”), The code for reproducing our results is available athttps://github.com/lavis-nlp/jerex. a r X i v : . [ c s . C L ] F e b novel approach for joint information extraction.JEREX is to our knowledge the ﬁrst approach thatcombines a multi-task model with entity-level re-lation extraction: In contrast to previous work, ourmodel jointly learns relations and entities with-out annotations on mention level, but extractsdocument-level entity clusters and predicts rela-tions between those clusters using a multi-instancelearning (MIL) (Dietterich et al., 1997; Riedelet al., 2010; Surdeanu et al., 2012) approach. Themodel is trained jointly on mention detection, coref-erence resolution, entity classiﬁcation and relationextraction (Figure 1).While we follow best practices for the ﬁrst threetasks, we propose a novel representation for rela-tion extraction, which combines global entity-levelrepresentations with localized mention-level ones.We present experiments on the DocRED (Yao et al.,2019) dataset for entity-level relation extraction.Though it is arguably simpler compared to recentgraph propagation models (Nan et al., 2020) orspecial pre-training (Ye et al., 2020), our approachachieves state-of-the-art results.We also report the ﬁrst results for end-to-endrelation extraction on DocRED as a reference forfuture work. In ablation studies we show that (1)combining a global and local representations isbeneﬁcial, and (2) that joint training appears to beon par with separate per-task models. Relation extraction is one of the most studied nat-ural language processing (NLP) problems to date.Most approaches focus on classifying the rela-tion between a given entity mention pair. Herevarious neural network based models, such asRNNs (Zhang and Wang, 2015), CNNs (Zenget al., 2014), recursive neural networks (Socheret al., 2012) or Transformer-type architectures (Wuand He, 2019) have been investigated. However,these approaches are usually limited to local, intra-sentence, relations and are not suited for document-level, inter-sentence, classiﬁcation. Since complexrelations require the aggregation of information dis-tributed over multiple sentences, document-levelrelation extraction has recently drawn attention (e.g.Quirk and Poon 2017; Verga et al. 2018; Guptaet al. 2019; Yao et al. 2019). Still, these modelsrely on speciﬁc entity mentions to be given. Whileprogress in the joint detection of entity mentionsand intra-sentence relations has been made (Gupta et al., 2016; Bekoulis et al., 2018; Luan et al., 2018),the combination of coreference resolution with rela-tion extraction for entity-level reasoning in a single,jointly-trained, model is widely unexplored.

Document-level Relation Extraction

Recentwork on document-level relation extraction directlylearns relations between entities (i.e. clusters ofmentions referring to the same entity) within a doc-ument, requiring no relation annotations on men-tion level. To gather relevant information acrosssentence boundaries, multi-instance learning hassuccessfully been applied to this task. In multi-instance learning, the goal is to assign labels tobags (here, entity pairs), each containing multi-ple instances (here, speciﬁc mention pairs). Vergaet al. (2018) apply multi-instance learning to detectdomain-speciﬁc relations in biological text. Theycompute relation scores for each mention pair oftwo entity clusters and aggregate these scores usinga smooth max-pooling operation. Christopoulouet al. (2019) and Sahu et al. (2019) improve uponVerga et al. (2018) by constructing document-levelgraphs to model global interactions. While theaforementioned models tackle very speciﬁc do-mains with few relation types, the recently releasedDocRED dataset (Yao et al., 2019) enables general-domain research on a rich relation type set (96types). Yao et al. (2019) provide several baseline ar-chitectures, such as CNN-, LSTM- or Transformer-based models, that operate on global, mention av-eraged, entity representations. Wang et al. (2019)use a two-step process by identifying related enti-ties in a ﬁrst step and classifying them in a secondstep. Tang et al. (2020) employ a hierarchical in-ference network, combining entity representationswith attention over individual sentences to formthe ﬁnal decision. Nan et al. (2020) apply a graphneural network (Kipf and Welling, 2017) to con-struct a document-level graph of mention, entityand meta-dependency nodes. The current state-of-the-art constitutes the CorefRoBERTa modelproposed by Ye et al. (2020), a RoBERTa (Liuet al., 2019) variant that is pre-trained on detect-ing co-referring phrases. They show that replacingRoBERTa with CorefRoBERTa improves perfor-mance on DocRED.All these models have in common that entitiesand their mentions are both assumed to be given. Incontrast, our approach extracts mentions, clustersthem to entities, and classiﬁes relations jointly. oint Entity Mention and Relation Extraction

Prior joint models focus on the extraction ofmention-level relations in sentences. Here, mostapproaches detect mentions by BIO (or BILOU)tagging and pair detected mentions for relationclassiﬁcation, e.g. (Gupta et al., 2016; Zhou et al.,2017; Zheng et al., 2017; Bekoulis et al., 2018;Nguyen and Verspoor, 2019; Miwa and Bansal,2016). However, these models are not able to detectrelations between overlapping entity mentions. Re-cently, so-called span-based approaches (Lee et al.,2017) were successfully applied to this task (Luanet al., 2018; Eberts and Ulges, 2019): By enumer-ating each token span of a sentence, these modelshandle overlapping mentions by design. Sanh et al.(2019) train a multi-task model on named entityrecognition, coreference resolution and relation ex-traction. By adding coreference resolution as anauxilary task, Luan et al. (2019) propagate infor-mation through coreference chains. Still, thesemodels rely on mention-level annotations and onlydetect intra-sentence relations between mentions,whereas our model explicitly constructs clustersof co-referring mentions and uses these clusters todetect complex entity-level relations in long docu-ments using multi-instance reasoning.

JEREX processes documents containing multiplesentences and extracts entity mentions, clustersthem to entities, and outputs types and relations onentity level. JEREX consists of four task-speciﬁccomponents, which are based on the same encoderand mention representations, and are trained in ajoint manner. An input document is ﬁrst tokenized,yielding a sequence of n byte-pair encoded (BPE)(Sennrich et al., 2016) tokens. We then use the pre-trained Transformer-type network BERT (Devlinet al., 2019) to obtain a contextualized embeddingsequence ( e , e , ... e n ) of the document. Since ourgoal is to perform end-to-end relation extraction,neither entities nor their corresponding mentionsin the document are known in inference. We suggest a multi-level model: First, we localizeall entity mentions in the document (a) by a span-based approach (Lee et al., 2017). After this, de-tected mentions are clustered into entities by coref-erence resolution (b). We then classify the type(such as person or company ) of each entity cluster by a fusion over local mention representations ( en-tity classiﬁcation ) (c). Finally, relations betweenentities are extracted by a reasoning over mentionpairs (d). The full model architecture is illustratedin Figure 2. (a) Entity Mention Localization Here ourmodel performs a search over all document to-ken subsequences (or spans ). In contrast toBIO/BILOU-based approaches for entity mentionlocalization, span-based approaches are able to de-tect overlapping mentions. Let s := ( e i , e i +1 ,..., e i + k ) denote an arbitrary candidate span. Fol-lowing Eberts and Ulges (2019), we ﬁrst obtaina span representation by max-pooling the span’stoken embeddings: e ( s ) := max-pool ( e i , e i +1 , ..., e i + k ) (1)Our mention classiﬁer takes the span representation e ( s ) as well as a span size embedding w sk +1 (Leeet al., 2017) as meta information. We performbinary classiﬁcation and use a sigmoid activationto obtain a probability for s to constitute an entitymention: ˆ y s = σ (cid:16) FFNN s ( e ( s ) ◦ w sk +1 ) (cid:17) (2)where ◦ denotes concatenation and FFNN s is atwo-layer feedforward network with an inner ReLuactivation. Span classiﬁcation is carried out on alltoken spans up to a ﬁxed length L . We apply a ﬁlterthreshold α s on the conﬁdence scores, retaining allspans with ˆ y s ≥ α s and leaving a set S of spanssupposedly constituting entity mentions. (b) Coreference Resolution Entity mentions re-ferring to the same entity (e.g. “Elizabeth II.” and“the Queen”) can be scattered throughout the in-put document. To later extract relations on en-tity level, local mentions need to be grouped todocument-level entity clusters by coreference res-olution. We use a simple mention-pair (Soonet al., 2001) model: Our component classiﬁespairs ( s , s ) ∈ S×S of detected entity men-tions as coreferent or not, by combining the spanrepresentations e ( s ) and e ( s ) with an edit dis-tance embedding w cd : We compute the Leven-shtein distance (Levenshtein, 1966) between spans d := D ( s , s ) and use a learned embedding w cd .A mention pair representation x c is constructed byconcatenation: x c := e ( s ) ◦ e ( s ) ◦ w cd (3) span repre-sentation BERT (fine-tuned) .. .. entity mention localization .. coreference resolution .. entity classification PER

ORG ORG ORG

PER PER PER PER

LOC .. relation classification PER

ORG ORG ORG

PER PER PER PER

LOC .. employer (a) Entity Mention Localization(b) Coreference Resolution (c) Entity Classification(d) Relation Classification ...... spanmention classifierspan repre-sentations ... span Acoreference classifier editdistance embeddingspan Bentity mention?coreferent? ... entity classifier PER ORG LOC … x span repr. map & max-poolall mention pairsof two entities ORGPER employer? producer? country? None?...size em- bedding context embed. distanceembed.all mentions of an entity max-poolspan repre-sentations entityrepr.entityrepresentations ????relation classifiermax-pool

Figure 2: Our approach combines entity mention localization (a), coreference resolution (b), entity classiﬁcation(c) and relation classiﬁcation (d) within a joint multi-task model, which is trained jointly on entity-level relationextraction. The sub-components share a single BERT encoder for document encoding. Each input document is onlyencoded once ( single-pass ) to speed-up training/inference, with sub-components operating on the contextualizedembeddings. Both entity classiﬁcation and relation classiﬁcation use multi-instance learning to synthesize relevantsignals scattered throughout the input document.

Similar to span classiﬁcation, we conduct binaryclassiﬁcation using a sigmoid activation, obtaininga similarity score between the two mentions: ˆ y c := σ (cid:16) FFNN c ( x c ) (cid:17) (4)where FFNN c follows the same architecture asFFNN s . We construct a similarity matrix C ∈ R m × m (with m referring to the document’s over-all number of mentions) containing the similarityscores between every mention pair. By applyinga ﬁlter threshold α c , we cluster mentions usingcomplete linkage (M¨ullner, 2011), yielding a set E containing clusters of entity mentions. We referto these clusters as entities or entity clusters in thefollowing. (c) Entity Classiﬁcation Next, we map each en-tity to a type such as location or person : We ﬁrstfuse the mention representations of an entity cluster { s , s , ..., s t } ∈ E by max-pooling: x e := max-pool ( e ( s ) , e ( s ) , ..., e ( s t )) (5)Entity classiﬁcation is then carried out on the en-tity representation x e , allowing the model to drawinformation from mentions spread across differentparts of the document. x e is fed into a softmaxclassiﬁer, yielding a probability distribution overthe entity types: ˆ y e := softmax (cid:16) FFNN e ( x e ) (cid:17) (6)We assign the highest scored type to the entity. (d) Relation Classiﬁcation Our ﬁnal componentassigns relation types to pairs of entities. Note thatthe directionality, i.e. which entity constitutes thehead/tail of the relation, needs to be inferred, andthat the input document can express multiple rela-tions between different mentions of the same entitypair. Let R denote a set of pre-deﬁned relationtypes. The relation classiﬁer processes each entitypair ( e , e ) ∈ E×E , estimating which, if any, rela-tions from R are expressed between these entities.To do so, we score every candidate triple ( e ,r i ,e ),expressing that e (as head) is in relation r i with e (as tail). We design two types of relation classiﬁers:A global relation classiﬁer , serving as a baseline,which consumes the entity cluster representations x e , and a multi-instance classiﬁer , which assumesthat certain entity mention pairs support speciﬁcrelations and synthesizes this information into anentity-pair level representation. Global Relation Classiﬁer (GRC)

The globalclassiﬁer builds upon the max-pooled entity clusterrepresentations x e and x e of an entity pair ( e , e ) .We further embed the corresponding entity types( w e / w e ), which was shown to be beneﬁcial inprior work (Yao et al., 2019), and compute anentity-pair representation by concatenation: x p := (cid:16) x e ◦ w e (cid:17) ◦ (cid:16) x e ◦ w e (cid:17) (7)This representation is fed into a 2-layer FFNN(similar to FFNN s ), mapping it to the number ofrelation types R . The ﬁnal layer features sigmoidactivations for multi-label classiﬁcation and assignsny relation type exceeding a threshold α r : ˆ y r := σ (cid:16) FFNN p ( x p ) (cid:17) (8) Multi-instance Relation Classiﬁer (MRC)

Incontrast to the global classiﬁer (GRC), the multi-instance relation classiﬁer operates on mentionlevel: Since only entity-level labels are avail-able, we treat entity mention pairs as latent vari-ables and estimate relations by a fusion over thesemention pairs. For any pair of entity clusters e = { s , s , ..., s t } and e = { s , s , ..., s t } , wecompute a mention-pair representation for any ( s , s ) ∈ e × e . This representation is obtained byconcatenating the global entity embeddings (Equa-tion (5)) with the mentions’ local span representa-tions (Equation (1)) u ( s , s ) := (cid:16) e ( s ) ◦ x e (cid:17) ◦ (cid:16) e ( s ) ◦ x e (cid:17) (9)Further, as we expect close-by mentions to bestronger indicators of relations, we add meta em-beddings for the distances d s , d t between the twomentions, both in sentences ( d s ) and in tokens ( d t ).In addition, following Eberts and Ulges (2019),the max-pooled context between the two mentions( c ( s , s ) ) is added. This localized context pro-vides a more focused view on the document andwas found to be especially beneﬁcial for long, andtherefore noisy, inputs: u (cid:48) ( s ,s ):= u ( s ,s ) ◦ c ( s ,s ) ◦ w rd s ◦ w r (cid:48) d t (10)This mention-pair representation is mapped by asingle feed-forward layer to the original token em-bedding size ( ): u (cid:48)(cid:48) ( s , s ) := FFNN p ( u (cid:48) ( s , s )) (11)These focused representations are then combinedby max-pooling: x r = max-pool ( { u (cid:48)(cid:48) ( s , s ) | s ∈ e ,s ∈ e } ) (12)Akin to GRC, we concatenate x r with entity typeembeddings w e / w e and apply a two-layer FFNN(again, similar to FFNN s ). Note that for both clas-siﬁers (GRC/MRC), we need to score both ( s , r i , s ) and ( s , r i , s ) to infer the direction of asym-metric relations. We perform a supervised multi-task training,whereas each training document features ground truth for all four subtasks (mention localization,coreference resolution, as well as entity and rela-tion classiﬁcation). We optimize the joint loss ofall four components: L := β s · L s + β c · L c + β e · L e + β r · L r (13) L s , L c and L r denote the binary cross entropylosses of the span, coreference and relation clas-siﬁers. We use a cross entropy loss ( L e ) for theentity classiﬁer. A batch is formed by drawingpositive and negative samples from a single docu-ment for all components. We found such a single-pass approach to offer signiﬁcant speed-ups bothin learning and inference:• Entity mention localization: We utilize allground truth entity mentions S gt of a docu-ment as positive training samples, and samplea ﬁxed number N s of random non-mentionspans up to a pre-deﬁned length L s as neg-ative samples. Note that we only train andevaluate on the full tokens according to thedataset’s tokenization, i.e. not on byte-pairencoded tokens, to limit computational com-plexity. Also, we only sample intra-sentencespans as negative samples. Since we foundintra-mention spans to be especially challeng-ing (“New York” versus “New York City”),we sample up to N s intra-mention spans asnegative samples.• Coreference resolution: The coreference clas-siﬁer is trained on all span pairs drawn fromground truth entity clusters E gt as positivesamples. We further sample a ﬁxed number N c of pairs of random ground truth entity men-tions that do not belong to the same cluster asnegative samples.• Entity classiﬁcation: Since the entity classiﬁeronly receives clusters that supposedly consti-tute an entity during inference, it is trained onall ground truth entity clusters of a document.• Relation classiﬁcation: Here we use groundtruth relations between entity clusters as posi-tive samples and N r negative samples drawnfrom E gt ×E gt that are unrelated according tothe ground truth.Each component’s loss is obtained by averagingover all samples. We learn the weights and biasesof sub-component speciﬁc layers as well as the oint Model ∗ PipelineLevel Task

Precision Recall F1 Precision Recall F1(a) Mention Localization .

29 92 .

70 92 .

99 92 .

87 92 .

46 92 . (b) Coreference Resolution .

52 83 .

06 82 .

79 82 .

11 82 .

66 82 . (c) Entity Classiﬁcation .

84 80 .

36 80 .

10 79 .

00 79 .

52 79 . (d) Relation Classiﬁcation .

76 38 .

25 40 .

38 43 .

61 37 .

50 40 . Relation Classiﬁcation (GRC) .

69 37 .

32 37 .

98 39 .

07 36 .

44 37 . Table 1: Test set evaluation results of our multi-level end-to-end system JEREX on DocRED (using the end-to-endsplit). We either train the model jointly on all four sub-components (left) or arrange separately trained models in apipeline (right) ( ∗ joint results are for MRC except for the last row). meta embeddings during training. BERT is ﬁne-tuned in the process. We evaluate JEREX on the DocRED dataset (Yaoet al., 2019). DocRED ist the most diverse relationextraction dataset to date (6 entity and 96 relationtypes). It includes over 5,000 documents, each con-sisting of multiple sentences. According to Yaoet al. (2019), DocRED requires multiple types ofreasoning, such as logical or common-sense rea-soning, to infer relations.Note that previous work only uses DocRED forrelation extraction (which equals our relation clas-siﬁer component) and assumes entities to be given(e.g. Wang et al. 2019; Nan et al. 2020). On theother hand, DocRED is exhaustively annotatedwith mentions, entities and entity-level relations,making it suitable for end-to-end systems. There-fore, we evaluate JEREX both as a relation classi-ﬁer (to compare it with the state-of-the-art) and asa joint model (as reference for future work on jointentity-level relation extraction).While prior joint models focus on mention-levelrelations (e.g. Gupta et al. 2016; Bekoulis et al.2018; Chi et al. 2019), we extend the strict evalu-ation setting to entity level: A mention is countedas correct if its span matches a ground truth men-tion span. An entity cluster is considered correctif it matches the ground truth cluster exactly andthe corresponding mention spans are correct. Like-wise, an entity is considered correct if the clusteras well as the entity type matches a ground truthentity. Lastly, we count a relation as correct if itsargument entities as well as the relation type arecorrect. We measure precision, recall and micro-F1for each sub-task and report micro-averaged scores.

Split

Table 2: DocRED dataset split used for end-to-end re-lation extraction.

Dataset split

The original DocRED dataset issplit into a train (3,053 documents), dev (1,000)and test (1,000) set. However, test relation labelsare hidden and evaluation requires the submissionof results via Codalab. To evaluate end-to-end sys-tems, we form a new split by merging train and dev.We randomly sample a train (3,008 documents),dev (300 documents) and test set (700 documents).Note that we removed 45 documents since they con-tained wrongly annotated entities with mentions ofdifferent types. Table 2 contains statistics of ourend-to-end split. We release the split as a referencefor future work.

Hyperparameters

We use BERT

BASE (cased) for document encoding, an attention-based lan-guage model pre-trained on English text (Devlinet al., 2019). Hyperparameters were tuned onthe end-to-end dev set: We adopt several settingsfrom (Devlin et al., 2019), including the usageof the Adam Optimizer with a linear warmupand linear decay learning rate schedule, a peaklearning rate of 5e-5 and application of dropoutwith a rate of . throughout the model. Weset the size of meta embeddings ( w s , w c , w e , w rd s , w r (cid:48) d t ) to and the number of epochs to We use the implementation from (Wolf et al., 2019). We performed a grid search over [5e-6, 1e-5, 5e-5, 1e-4,5e-4]. odel

Ign F1 F1CNN (Yao et al., 2019) .

33 42 . LSTM (Yao et al., 2019) .

71 50 . Ctx-Aware (Yao et al., 2019) ∗ .

40 50 . BiLSTM (Yao et al., 2019) .

78 51 . Two-Step (Wang et al., 2019) ∗ - . HIN (Tang et al., 2020) ∗ .

70 55 . JEREX (GRC) ∗ .

76 55 . LSR (Nan et al., 2020) ∗ .

97 59 . CorefRo (Ye et al., 2020) ∗ .

90 60 . JEREX (MRC) ∗ Table 3: Comparison of our relation classiﬁcation com-ponent (GRC/MRC) with the state-of-the-art on the Do-cRED relation extraction task. We report test set resultson the original DocRED split. Ign F1 ignores relationalfacts also present in the train set. Models marked with ∗ use a Transformer-type model for document encoding. . Performance is measured once per epochon the dev set, out of which the best performingmodel is used for the ﬁnal evaluation on the testset. A grid search is performed for the mention,coreference and relation ﬁlter threshold ( α s =0 . , α c =0 . , α r ( GRC )=0 . , α r ( MRC )=0 . ) witha step size of 0.05. The number of negativesamples ( N s = N c = N r =200 ) and sub-task lossweights ( β s = β c = β r =1 , β e =0 . ) are manuallytuned. Note that some documents in DocRED ex-ceed the maximum context size of BERT ( BPEtokens). In this case we train the remaining positionembeddings from scratch.

JEREX is trained and evaluated on the end-to-enddataset split (see Table 2). We perform 5 runs foreach experiment and report the averaged results. Tostudy the effects of joint training, we experimentwith two approaches: (a) All four sub-componentsare trained jointly in a single model as described inSection 3.2 and (b) we construct a pipeline systemby training each task separately and not sharing thedocument encoder.Table 1 illustrates the results for the joint (left)and pipeline (right) approach. As described inSection 3, each sub-task builds on the results ofthe previous component during inference. We ob-serve the biggest performance drop for the relationclassiﬁcation task, underlining the difﬁculty in de-tecting document-level relations. Furthermore, themulti-instance based relation classiﬁer (MRC) out- JM ∗ SMTask

F1 F1Mention Localization .

99 92 . Coreference Resolution .

54 90 . Entity Classiﬁcation .

66 95 . Relation Classiﬁcation .

46 59 . Relation Classiﬁcation (GRC) .

45 56 . Table 4: Single-task performance of the joint model(left) and separate models (right) on the end-to-endsplit ( ∗ joint results are for MRC except for the lastrow). performs the global relation classiﬁer (GRC) byabout 2.4% F1 score. We reason that the fusionof local evidences by multi-instance learning helpsthe model to focus on appropriate document sec-tions and alleviates the impact of noise in longdocuments. Moreover, we found the multi-instanceselection to offer good interpretability, usually se-lecting the most relevant instances (see Figure 3 forexamples). Overall, we observe a comparable per-formance by joint training versus using the pipelinesystem.This is also conﬁrmed by the results reported inTable 4, where we evaluate the four components in-dependently, i.e. each component receives groundtruth samples from the previous step in the hier-archy (e.g. ground truth mentions for coreferenceresolution). Again, we observe the performancedifference between the joint and pipeline model tobe negligible. This shows that it is not necessary tobuild separate models for each task, which wouldresult in training and inference overhead due tomultiple expensive BERT passes. Instead, a singleneural model is able to jointly learn all tasks neces-sary for document-level relation extraction, there-fore easing training, inference and maintenance. We also compare our model with the state-of-the-art on DocRED’s relation extraction task. Here,entity clusters are assumed to be given. We trainand test our relation classiﬁcation component onthe original DocRED dataset split. Since test setlabels are hidden, we submit the best out of 5 runson the development set via CodaLab to retrievethe test set results. Table 3 includes previously re-ported results from current state-of-the-art models.Note that our global classiﬁer (GRC) is similar to ueequeg is a ﬁctional character in the 1851 novel Moby-Dick by American author

Herman Melville . The son ofa South Sea chieftain who left home to explore the world,

Queequeg is the ﬁrst principal character encountered by thenarrator, Ishmael. The quick friendship and relationship of equality between the tattooed cannibal and the white sailorshows

Melville ’s basic theme of shipboard democracy and racial diversity...Shadowrun:Hong Kong is a turn-based tactical role-playing video game set in the Shadowrun universe. It was devel-oped and published by

Harebrained Schemes , who previously developed

Shadowrun Returns and its standaloneexpansion. It includes a new single - player campaign and also shipped with a level editor that lets players create theirown Shadowrun campaigns and share them with other players. In January 2015,

Harebrained Schemes launched aKickstarter campaign in order to fund additional features and content they wanted to add to the game, but determinedwould not have been possible with their current budget. The initial funding goal of US $ 100,000 was met in only afew hours. The campaign ended the following month, receiving over $ 1.2 million. The game was developed withan improved version of the engine used with

Shadowrun Returns and Dragonfall.

Harebrained Schemes decided todevelop the game only for Microsoft Windows, OS X, and Linux, ...

Figure 3: Two example documents of the DocRED dataset. Highlighted are relations “creator” between “Quee-queg” and “Herman Melville” (top) and “developer” between “Shadowrun Returns” and “Harebrained Schemes”(bottom). Bordered pairs are the top selections of the multi-instance relation classiﬁer. the baseline by (Yao et al., 2019). However, wereplace mention span averaging with max-poolingand also choose max-pooling to aggregate men-tions into an entity representation, yielding con-siderable improvement over the baseline. Usingthe multi-instance classiﬁer (MRC) instead furtherimproves performance by about 4.5%. Here ourmodel also outperforms complex methods basedon graph attention networks (Nan et al., 2020) orspecialized pre-training (Ye et al., 2020), achievinga new state-of-the-art result on DocRED’s relationextraction task.

We perform several ablation studies to evaluate thecontributions of our proposed multi-instance rela-tion classiﬁer enhancements: We remove either theglobal entity representations x e , x e (Equation 5)(a) or the localized context representation c ( s , s ) (Equation 10) (b). The performance drops by about . F1 score when global entity representationsare omitted, indicating that multi-instance reason-ing beneﬁts from the incorporation of entity-levelcontext. When the localized context representationis omitted, performance is reduced by about . ,conﬁrming the importance of guiding the modelto relevant input sections. Finally, we limit themodel to fusing only intra-sentence mention pairs(c). In case no such instance exists for an entitypair, the closest (in token distance) mention pairis selected. Obviously, this modiﬁcation reducescomputational complexity and memory consump-tion, especially for large documents. Nevertheless,while we observe intra-sentence pairs to cover mostrelevant signals, exhaustively pairing all mentionsof an entity pair yields an improvement of . . Model

F1Relation Classiﬁcation (MRC) . - (a) Entity Representations . - (b) Localized Context . - (c) Exhaustive Pairing . Table 5: Ablation studies for the multi-level relationclassiﬁer (MRC) using the end-to-end split. We eitherremove global entity representations (a), the localizedcontext (b) or only use intra-sentence mention pairs (c).The results are averaged over 5 runs.

We have introduced JEREX, a novel multi-taskmodel for end-to-end relation extraction. In con-trast to prior systems, JEREX combines entity men-tion localization with coreference resolution to ex-tract entity types and relations on an entity level.We report ﬁrst results for entity-level, end-to-end,relation extraction as a reference for future work.Furthermore, we achieve state-of-the-art results onthe DocRED relation extraction task by enhanc-ing multi-instance reasoning with global entity rep-resentations and a localized context, outperform-ing several more complex solutions. We showedthat training a single model jointly on all sub-tasks instead of using a pipeline approach performsroughly on par, eliminating the need of trainingseparate models and accelerating inference. Oneof the remaining shortcomings lies in the detectionof false positive relations, which may be expressedaccording to the entities’ types but are actually notexpressed in the document. Exploring options toreduce these false positive predictions seems to bean interesting challenge for future work. cknowledgments

This work was funded by German Federal Ministryof Education and Research (Program FHprofUnt,Project DeepCA (13FH011PX6)).

References

Giannis Bekoulis, Johannes Deleu, Thomas Demeester,and Chris Develder. 2018. Joint entity recogni-tion and relation extraction as a multi-head selectionproblem.

Expert Systems with Applications , 114:34–45.Renjun Chi, Bin Wu, Linmei Hu, and Yunlei Zhang.2019. Enhancing Joint Entity and Relation Extrac-tion with Language Modeling and Hierarchical At-tention. In

Proc. APWeb-WAIM, LNCS 11641 , pages314–328.Fenia Christopoulou, Makoto Miwa, and Sophia Ana-niadou. 2019. Connecting the Dots: Document-level Neural Relation Extraction with Edge-orientedGraphs. In

Proc. of EMNLP and IJCNLP 2019 ,pages 4925–4936, Hong Kong, China. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proc. of NAACL-HLT 2019 , pages4171–4186, Minneapolis, Minnesota. ACL.Thomas G. Dietterich, Richard H. Lathrop, and Tom´asLozano-P´erez. 1997. Solving the multiple instanceproblem with axis-parallel rectangles.

Artiﬁcial In-telligence , 89(1):31 – 71.Markus Eberts and Adrian Ulges. 2019. Span-basedJoint Entity and Relation Extraction with Trans-former Pre-training. In , pages 2006 –2013.Pankaj Gupta, Subburam Rajaram, Hinrich Sch¨utze,and Thomas A. Runkler. 2019. Neural Relation Ex-traction within and across Sentence Boundaries. In

Proceedings of the AAAI Conference on Artiﬁcial In-telligence , pages 6513–6520. AAAI Press.Pankaj Gupta, Hinrich Sch¨utze, and Bernt Andrassy.2016. Table Filling Multi-Task Recurrent NeuralNetwork for Joint Entity and Relation Extraction. In

Proc. of COLING 2016 , pages 2537–2547, Osaka,Japan. The COLING 2016 Organizing Committee.Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao,Zhiyuan Liu, and Maosong Sun. 2018. FewRel: ALarge-Scale Supervised Few-Shot Relation Classiﬁ-cation Dataset with State-of-the-Art Evaluation. In

Proc. of EMNLP 2018 , pages 4803–4809, Brussels,Belgium. ACL. Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O. S´eaghdha, SebastianPad´o, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. SemEval-2010 Task 8:Multi-way Classiﬁcation of Semantic Relations Be-tween Pairs of Nominals, booktitle = Proc. of the5th International Workshop on Semantic Evaluation.SemEval ’10, pages 33–38, Stroudsburg, PA, USA.ACL.Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classiﬁcation with Graph ConvolutionalNetworks. In .Shantanu Kumar. 2017. A Survey of Deep Learn-ing Methods for Relation Extraction.

CoRR ,abs/1705.03645.Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end Neural Coreference Res-olution. In

Proc. of EMNLP 2017 , pages 188–197,Copenhagen, Denmark. ACL.V. I. Levenshtein. 1966. Binary Codes Capable of Cor-recting Deletions, Insertions and Reversals.

Sovietphysics. Doklady , 10:707–710.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach.

CoRR , abs/1907.11692.Yi Luan, Luheng He, Mari Ostendorf, and Han-naneh Hajishirzi. 2018. Multi-Task Identiﬁcationof Entities, Relations, and Coreference for Scien-tiﬁc Knowledge Graph Construction. In

Proc. ofEMNLP 2018 , pages 3219–3232, Brussels, Belgium.ACL.Yi Luan, Dave Wadden, Luheng He, Amy Shah, MariOstendorf, and Hannaneh Hajishirzi. 2019. A Gen-eral Framework for Information Extraction usingDynamic Span Graphs. In

Proc. of NAACL-HLT2019 , volume 1, pages 3036–3046, Minneapolis,Minnesota. ACL.Makoto Miwa and Mohit Bansal. 2016. End-to-EndRelation Extraction using LSTMs on Sequences andTree Structures. In

Proc. of ACL 2016 , pages 1105–1116, Berlin, Germany. ACL.Daniel M¨ullner. 2011. Modern hierarchical, agglomer-ative clustering algorithms.

CoRR , abs/1109.2378.Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and WeiLu. 2020. Reasoning with Latent Structure Reﬁne-ment for Document-Level Relation Extraction. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1546–1557, Online. Association for Computational Lin-guistics.at Quoc Nguyen and Karin Verspoor. 2019. End-to-end neural relation extraction using deep biafﬁne at-tention. In

Advances in Information Retrieval , pages729–738, Cham. Springer International Publishing.Chris Quirk and Hoifung Poon. 2017. Distant Super-vision for Relation Extraction beyond the SentenceBoundary. In

Proceedings of EACL: Volume 1, LongPapers , pages 1171–1182, Valencia, Spain. Associa-tion for Computational Linguistics.Sebastian Riedel, Limin Yao, and Andrew McCal-lum. 2010. Modeling Relations and Their Mentionswithout Labeled Text. In

Machine Learning andKnowledge Discovery in Databases , pages 148–163,Berlin, Heidelberg. Springer Berlin Heidelberg.Sunil Kumar Sahu, Fenia Christopoulou, MakotoMiwa, and Sophia Ananiadou. 2019. Inter-sentenceRelation Extraction with Document-level GraphConvolutional Neural Network. In

Proc. of the 57thACL , pages 4309–4316, Florence, Italy. Associationfor Computational Linguistics.Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019.A Hierarchical Multi-Task Approach for LearningEmbeddings from Semantic Tasks.

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence ,33:6949–6956.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Richard Socher, Brody Huval, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Semantic Compo-sitionality Through Recursive Matrix-vector Spaces.In

Proc. of EMNLP-CoNLL 2012 , pages 1201–1211,Stroudsburg, PA, USA. ACL.Wee Meng Soon, Hwee Tou Ng, and DanielChung Yong Lim. 2001. A Machine Learning Ap-proach to Coreference Resolution of Noun Phrases.

Computational Linguistics , 27(4):521–544.Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D. Manning. 2012. Multi-instanceMulti-label Learning for Relation Extraction. In

Proceedings of EMNLP-CoNLL 2012 , pages 455–465, Jeju Island, Korea. Association for Computa-tional Linguistics.Hengzhu Tang, Yanan Cao, Zhenyu Zhang, JiangxiaCao, Fang Fang, Shi Wang, and Pengfei Yin. 2020.HIN: Hierarchical Inference Network for Document-Level Relation Extraction. In

Advances in Knowl-edge Discovery and Data Mining , pages 197–209,Cham. Springer International Publishing.Patrick Verga, Emma Strubell, and Andrew McCallum.2018. Simultaneously Self-Attending to All Men-tions for Full-Abstract Biological Relation Extrac-tion. In

Proceedings of the 2018 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 872–884, NewOrleans, Louisiana. Association for ComputationalLinguistics.David Wadden, Ulme Wennberg, Yi Luan, and Han-naneh Hajishirzi. 2019. Entity, Relation, and EventExtraction with Contextualized Span Representa-tions. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages5784–5789, Hong Kong, China. Association forComputational Linguistics.Hong Wang, Christfried Focke, Rob Sylvester, NileshMishra, and William W. J. Wang. 2019. Fine-tuneBert for DocRED with Two-step Process.

ArXiv ,abs/1909.11898.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2019.HuggingFace’s Transformers: State-of-the-art Natu-ral Language Processing.

ArXiv , abs/1910.03771.Shanchan Wu and Yifan He. 2019. Enriching Pre-trained Language Model with Entity Information forRelation Classiﬁcation.

CoRR , abs/1905.08284.Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin,Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou,and Maosong Sun. 2019. DocRED: A Large-ScaleDocument-Level Relation Extraction Dataset. In

Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics , pages 764–777, Florence, Italy. Association for ComputationalLinguistics.Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, PengLi, Maosong Sun, and Zhiyuan Liu. 2020. Corefer-ential Reasoning Learning for Language Representa-tion. In

Proc. of EMNLP , pages 7170–7186, Online.Association for Computational Linguistics.Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation Classiﬁcation viaConvolutional Deep Neural Network. In

Proc. ofCOLING 2014 , pages 2335–2344, Dublin, Ireland.Dublin City University and ACL.Dongxu Zhang and Dong Wang. 2015. Relation Clas-siﬁcation via Recurrent Neural Network.

CoRR ,abs/1508.01006.Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware Attention and Supervised Data Improve SlotFilling. In

Proc. of the EMNLP 2017 , pages 35–45.ACL.uncong Zheng, Feng Wang, Hongyun Bao, YuexingHao, Peng Zhou, and Bo Xu. 2017. Joint Extractionof Entities and Relations Based on a Novel TaggingScheme. In

Proc. of ACL 2017 , pages 1227–1236,Vancouver, Canada. ACL.Peng Zhou, Suncong Zheng, Jiaming Xu, Zhenyu Qi,Hongyun Bao, and Bo Xu. 2017. Joint Extraction ofMultiple Relations and Entities by Using a HybridNeural Network. In