[PDF] Multi-Sentence Argument Linking

Abstract

We present a novel document-level model for finding argument spans that fill an event's roles, connecting related ideas in sentence-level semantic role labeling and coreference resolution. Because existing datasets for cross-sentence linking are small, development of our neural model is supported through the creation of a new resource, Roles Across Multiple Sentences (RAMS), which contains 9,124 annotated events across 139 types. We demonstrate strong performance of our model on RAMS and other event-related datasets.

Full PDF

MMulti-Sentence Argument Linking

Seth Ebner ∗ Patrick Xia ∗ Ryan Culkin Kyle Rawlins Benjamin Van Durme

Johns Hopkins University { seth, paxia } @cs.jhu.edu { rculkin, kgr, vandurme } @jhu.edu Abstract

We present a novel document-level model forﬁnding argument spans that ﬁll an event’sroles, connecting related ideas in sentence-level semantic role labeling and coreferenceresolution. Because existing datasets forcross-sentence linking are small, developmentof our neural model is supported through thecreation of a new resource, R oles A cross M ultiple S entences (RAMS), which contains9,124 annotated events across 139 types. Wedemonstrate strong performance of our modelon RAMS and other event-related datasets. Textual event descriptions may span multiple sen-tences, yet large-scale datasets predominately an-notate for events and their arguments at the sen-tence level. This has driven researchers to focuson sentence-level tasks such as semantic role la-beling (SRL), even though perfect performance atsuch tasks would still enable a less than completeunderstanding of an event at the document level.In this work, we approach event understandingas a form of linking , more akin to coreference res-olution than sentence-level SRL. An event trig-ger evokes a set of roles regarded as latent argu-ments, with these implicit arguments then poten-tially linked to explicit mentions in the text.Consider the example in Figure 1: the

AirstrikeMissileStrike event (triggered by“bombarding”) gives rise to a frame or set of type-level roles ( attacker , target , instrument , place ) with the referents (“Russians”, “rebel out-post”, “aircraft”, “Syria”). Intuitively we recog-nize the possible existence of ﬁllers for these roles,for example, the place of the particular

Air- ∗ Equal Contribution Data and code at http://nlp.jhu.edu/rams/ . (cid:15) would indicate there is no explicit referent in the text. Figure 1: A passage annotated for an event’s type , trigger , and arguments. Each arc points from the trig-ger to the argument that ﬁlls the labeled role. strikeMissileStrike event. These implicit ar-guments are linked to explicit arguments in thedocument (i.e., text spans). We refer to the taskof ﬁnding explicit argument(s) to ﬁll each role foran event as argument linking .Prior annotation of cross-sentence argumentlinks has produced small datasets, with a focus ei-ther on a small number of predicate types (Gerberand Chai, 2010, 2012; Feizabadi and Pad´o, 2014)or on a small number of documents (Ruppenhoferet al., 2010). To enable the development of a neu-ral model for argument linking, we produce R oles A cross M ultiple S entences (RAMS), a dataset of9,124 annotated events from news based on an on-tology of 139 event types and 65 roles. In a 5-sentence window around each event trigger, weannotate the closest argument span for each role.Our model builds on recent ideas in span se-lection models (Lee et al., 2018; He et al., 2018;Ouchi et al., 2018), used in this work for the multi-sentence argument linking task for RAMS andfor several other event-based datasets (Gerber andChai, 2012; Pradhan et al., 2013; Pavlick et al.,2016, AIDA Phase 1). On RAMS our best modelachieves 68.3 F , and it achieves 73.3 F whenevent types are also known, outperforming strongbaselines. We also demonstrate effective use ofRAMS as pre-training for a related dataset. a r X i v : . [ c s . C L ] M a y ur main contributions are a novel model forargument linking and a new large-scale dataset forthe task. Our dataset is annotated for argumentsacross multiple sentences and has broader cover-age of event types and more examples than simi-lar work. Our experiments highlight our model’sadaptability to multiple datasets. Together, thesecontributions further the automatic understandingof events at the document level. We are not the ﬁrst to consider non-local event ar-guments; here we review prior work and refer toO’Gorman (2019) for further reading. Whereaslocal (sentence-level) event arguments are well-studied as semantic role labeling—utilizing largedatasets such as OntoNotes 5.0 (Weischedel et al.,2013; Pradhan et al., 2013)—existing datasets an-notated for non-local arguments are too small fortraining neural models.Much of the effort on non-local arguments,sometimes called implicit

SRL, has focused ontwo datasets: SemEval-2010 Task 10 (Ruppen-hofer et al., 2010) and Beyond NomBank (hence-forth

BNB ) (Gerber and Chai, 2010, 2012). Thesedatasets are substantially smaller than RAMS: theSemEval Task 10 training set contains 1,370 frameinstantiations over 438 sentences, while BNB con-tains 1,247 examples covering just 10 nominalpredicate types. Multi-sentence AMR (MS-AMR)(O’Gorman et al., 2018; Knight et al., 2020) con-tains 293 documents annotated with a document-level adaptation of the Abstract Meaning Repre-sentation (AMR) formalism. O’Gorman (2019)notes that the relatively small size of the MS-AMRand SemEval datasets hinders supervised training.In contrast to these datasets, RAMS contains 9,124annotated examples covering a wide range of nom-inal and verbal triggers.Under the DARPA AIDA program, the Lin-guistic Data Consortium (LDC) has annotateddocument-level event arguments under a three-level hierarchical event ontology (see Figure 2) in-ﬂuenced by prior LDC-supported ontologies suchas ERE and ACE. These have been packaged asthe AIDA Phase 1 Practice and Eval releases(henceforth AIDA-1 ), currently made available toperformers in the AIDA program and participants LDC2019E04 (data); LDC2019E07 (annotations) LDC2019E42 (data); LDC2019E77 (annotations)

CorrespondenceCommandOrder Negotiate FirearmAttack YieldConﬂict{ Communicator, Recipient, Place } { Attacker, Target, Instrument, Place } ………

AttackContactMeet … Figure 2: Subset of the AIDA-1 ontology illustratingthe three-level

Type/Subtype/Sub-subtype eventhierarchy. Dashed gray edges point to roles for twoevent nodes, which have one role in common (

Place ). in related NIST evaluations. AIDA-1 documentsfocus on recent geopolitical events relating to in-teractions between Russia and Ukraine. Unlessotherwise noted, statistics about AIDA-1 pertainonly to the Practice portion of the dataset.For each document in LDC’s collection, onlyAIDA-salient events are annotated. This protocoldoes not guarantee coverage over the event ontol-ogy: 1,559 event triggers are annotated in the textportion of the collection, accounting for only 88of the 139 distinct event sub-subtypes in the ontol-ogy. Our dataset, RAMS, employs the same anno-tation ontology but is substantially larger and cov-ers all 139 types in the ontology. Figure 3 ( § While rarely freely released, historically such collectionsare eventually made available under a license to anyone, un-der some timeline established within a program. ombined existing corpora to increase and diver-sify sources of model supervision. Cheng and Erk(2018, 2019) approached the data scarcity prob-lem by recasting implicit SRL as a cloze task andas a reading comprehension task, for which datacan be generated automatically.The TAC KBP event argument extraction taskalso seeks arguments from document contexts.However, in our work we are concerned with rei-ﬁed events (explicit mentions) and links betweenevent mentions and argument mentions rather thanentity-level arguments (coreference clusters).

Motivated by the scarcity of data for trainingneural models to predict non-local arguments,we constructed R oles A cross M ultiple S entences(RAMS), a crowd-sourced dataset with annota-tions for 9,124 events following the AIDA ontol-ogy. We employed the AIDA ontology in RAMSso-as to be most similar to an existing corpus al-ready being investigated by various members ofthe community. Each example consists of a typed trigger span and 0 or more argument spans in anEnglish document. A trigger span is a word orphrase that evokes a certain event type in context,while argument spans denote role-typed partici-pants in the event (e.g., the Recipient ). Triggerand argument spans are token-level [ start, end ] offsets into a tokenized document.Typically, event and relation datasets annotateonly the argument spans that are in the same sen-tence as the trigger, but we present annotators witha multi-sentence context window surrounding thetrigger. Annotators may select argument spans inany sentence in the context window. We used Reddit, a popular inter-net forum, to ﬁlter a collection of news articles tobe topically similar to AIDA-1. After applying aset of criteria based on keywords, time period, andpopularity (listed in Appendix A.1) we identiﬁedapproximately 12,000 news articles with an aver-age length of approximately 40 sentences.

Annotation

We manually constructed a map-ping from each event ((sub-)sub)type to a list oflexical units (LUs) likely to evoke that type. Thismapping was designed to give high precision and For example,

Conflict/Attack/SetFire is evoked by inferno , blaze , and arson (and word forms). Train Dev Test TotalDocs 3,194 399 400 3,993Examples 7,329 924 871 9,124Event Types 139 131 – 139Roles 65 62 – 65Arguments 17,026 2,188 2,023 21,237 Table 1: Sizes and coverage of RAMS splits. RAMScovers all of the 139 event types and 65 roles types inthe AIDA Phase 1 ontology. low recall, in that for a given (

Type , LUs ) pair,the items in

LUs are all likely to evoke the

Type ,although

LUs can omit items that also evoke the

Type . On average, | LUs | = 3 . .We performed a soft match between everyLU and every word in our text collection toselect candidate sentences for each event type.This matching procedure produced approximately94,000 candidates, which we balanced by sam-pling the same number of sentences for each LU.Candidate sentences were then vetted by crowd-sourcing to ensure that they evoked their associ-ated event type and had positive factuality. We col-lected judgments on approximately 17,500 candi-date sentences, of which 52% were determined tosatisfy these constraints, yielding 9,124 sentencescontaining a LU trigger. Using these sentenceswe then collected multi-sentence annotations, pre-senting annotators with a 5-sentence window con-taining two sentences of context before the sen-tence with the trigger and two sentences after. Annotators then selected in the context window aspan to ﬁll each of the event’s roles.A window size of ﬁve sentences was chosenbased on internal pilots and supported by our ﬁnd-ing that 90% of event arguments in AIDA-1 arerecoverable in this window size. Similarly, Ger-ber and Chai (2010) found that in their data al-most 90% of implicit arguments can be resolvedin the two sentences preceding the trigger. Argu-ments fall close to the trigger in RAMS as well:82% of arguments occur in the same sentence asthe trigger. On average, we collected 66 full anno-tations (trigger and arguments) per event type. Ta-ble 1 shows dataset size and coverage. All aspectsof the protocol, including the annotation interfaceand instructions, are included in Appendix A. We stem all words and ignore case. If fewer than two sentences appeared before/after thetrigger, annotators were shown as many sentences as wereavailable. Arguments following the trigger were not annotated. nter-Annotator Agreement

We randomly se-lected 93 tasks for redundant annotation in orderto measure inter-annotator agreement, collectingﬁve responses per task from distinct users. 68.5%of the time, all annotators mark the role as eitherabsent or present. Less frequently (21.7%), fourof the ﬁve annotators agree, and rarely (9.8%) isthere strong disagreement.We compute pairwise agreement for spanboundaries. For each annotated (event, role) com-bination, we compare pairs of spans for whichboth annotators believe the role is present. 55.3%of the pairs agree exactly. Allowing for a fuzziermatch, such as to account for whether one in-cludes a determiner, spans whose boundaries dif-fer by one token have a much higher agreementof 69.9%. Fewer spans agree on the start bound-ary (59.8%) than on the end (73.5%), while 78.0%match at least one of the two boundaries. Wedemonstrate data quality in § F r equen cy RAMS trainAIDA-1BNB

Figure 3: Comparison of frequency of event types invarious datasets sorted by decreasing frequency in thatdataset. RAMS has a heavier tail than AIDA-1 andBNB and broader coverage of events.

Comparisons to Related Datasets

Compar-isons of event type coverage among RAMS,AIDA-1, and BNB (Gerber and Chai, 2010, 2012)are given in Figure 3. RAMS provides larger andbroader coverage of event types than do AIDA-1and BNB. By design, BNB focuses on only a fewpredicate types, but we include its statistics for ref-erence. More ﬁgures regarding type and role cov-erage are included in Appendix A.4.

Related Protocols

Feizabadi and Pad´o (2014)also considered the case of crowdsourcing annota-tions for cross-sentence arguments. Like us, theyprovided annotators with a context window ratherthan the whole document, annotating two frameseach with four roles over 384 predicates. Annota- tors in that work were shown the sentence contain-ing the predicate and the three previous sentences,unlike ours which shows two preceding and twofollowing sentences.Rather than instructing annotators to highlightspans in the text (“marking”), Feizabadi and Pad´o(2014) directed annotators to ﬁll in blanks in tem-platic sentences (“gap ﬁlling”). We in contrastrequire annotators to highlight mention spans di-rectly in the text.Our protocol of event type veriﬁcation followedby argument ﬁnding is similar to the protocol sup-ported by interfaces such as SALTO (Burchardtet al., 2006) and that of Fillmore et al. (2002).

We formulate argument linking as follows, similarto the formulation in Das et al. (2010). Assumea document D contains a set of described events E , each designated by a trigger—a text span in D .The type of an event e determines the set of rolesthe event’s arguments may take, denoted R e . Foreach e ∈ E , the task is to link the event’s roleswith arguments—text spans in D —if they are at-tested. Speciﬁcally, one must ﬁnd for each e all ( r, a ) pairs such that r ∈ R e and a ∈ D . Thisformulation does not restrict each role to be ﬁlledby only one argument, nor does it restrict each ex-plicit argument to take at most one role. Our model architecture is related to recent modelsfor SRL (He et al., 2018; Ouchi et al., 2018). Con-textualized text embeddings are used to form can-didate argument span representations, A . Theseare then pruned and scored alongside the triggerspan and learned role embeddings to determine thebest argument span (possibly none) for each eventand role, i.e., argmax a ∈A P ( a | e, r ) for each event e ∈ E and role r ∈ R e . Representations

To represent text spans, weadopt the convention from Lee et al. (2017) thathas been used for a broad suite of core NLP tasks(Swayamdipta et al., 2018; He et al., 2018; Ten-ney et al., 2019b). A bidirectional LSTM encodeseach sentence’s contextualized embeddings (Pe-ters et al., 2018; Devlin et al., 2018). The hid-den states at the start and end of the span are con-catenated along with a feature vector for the sizeof the span and a soft head word vector producedby a learned attention mask over the word vectorsGloVe embeddings (Pennington et al., 2014) andcharacter-level convolutions) within the span.We use this method to form representations oftrigger spans, e , and of candidate argument spans, a . We learn a separate embedding, r , for eachrole in the ontology, r ∈ R . Since our objectiveis to link candidate arguments to event-role pairs,we construct an event-role representation by ap-plying a feed-forward neural network (F ˜ a ) to theevent trigger span and role embedding: ˜ a e,r = F ˜ a ([ e ; r ]) (1)This method is similar to one for forming edgerepresentations for cross-sentence relation extrac-tion (Song et al., 2018), but contrasts with priorwork which limits the interaction between r and e (He et al., 2018; Tenney et al., 2019b). Pruning

Given a document with n tokens, thereare O ( n ) candidate argument text spans, whichleads to intractability for large documents. Fol-lowing Lee et al. (2017) and He et al. (2018),we consider within-sentence spans up to a certainwidth (giving O ( n ) spans) and score each span, a , using a learned unary function of its represen-tation: s A ( a ) = w (cid:62) A F A ( a ) . We keep the top λ A n spans ( λ A is a hyperparameter) and refer to this setof high-scoring candidate argument spans as A .In an unpruned model, we need to create at least (cid:80) e |R e | event-role representations and evaluate Ω( n (cid:80) e |R e | ) combinations of events, roles, andarguments, which can become prohibitively largewhen there are numerous events and roles. As-suming the number of events is linear in docu-ment length, the number of combinations wouldbe quadratic in document length (rather thanquadratic in sentence length as in He et al. (2018)).Lee et al. (2018) addressed this issue in coref-erence resolution, a different document-level task,by implementing a coarse pruner to limit thenumber of candidate spans that are subsequentlyscored. For our model, any role can potentiallybe ﬁlled (if the event type is not known). Thus,we do not wish to prematurely prune ( e, r ) pairs,so we must further prune A . Rather than scoring a ∈ A with every event-role pair ( e, r ) , we as-sign a score between a and every event e . Thisrelaxation reﬂects a loose notion of how likely an As a role for an event evokes an implicit discourse ref-erent , this can be regarded as an implicit discourse referentrepresentation. argument span is to participate in an event, whichcan be determined irrespective of a role: s c ( e, a ) = e (cid:62) W c a + s A ( a ) + s E ( e ) + φ c ( e, a ) where W c is learned and φ c ( e, a ) are task-speciﬁcfeatures. We use A e ⊆ A to refer to the top- k -scoring candidate argument spans in relation to e . Scoring

We introduce a link scoring function, l ( a, ˜ a e,r ) , between candidate spans a ∈ A e andevent-role pairs ˜ a e,r = ( e, r ) ∈ E × R . Thescoring function decomposes as: l ( a, ˜ a e,r ) = s E,R ( e, r ) + s A,R ( a, r )+ s l ( a, ˜ a e,r ) + s c ( e, a ) , a (cid:54) = (cid:15) (2) s E ( e ) = w (cid:62) E F E ( e ) s E,R ( e, r ) = w (cid:62) E,R F E,R ([ e ; r ]) s A,R ( a, r ) = w (cid:62) A,R F A,R ([ a ; r ]) s l ( a, ˜ a e,r ) = w (cid:62) l F l ([ a ; ˜ a e,r ; a ◦ ˜ a e,r ; φ l ( a, ˜ a e,r )]) (3)where φ l ( a, ˜ a e,r ) is a feature vector containing in-formation such as the (bucketed) token distancebetween e and a . F x are feed-forward neural net-works, and w x are learned weights. The decompo-sition is inspired by Lee et al. (2017) and He et al.(2018), while the direct scoring of candidate ar-guments against event-role pairs, s l ( a, ˜ a e,r ) , bearssimilarities to the approach taken by Schenk andChiarcos (2016), which ﬁnds the candidate argu-ment whose representation is most similar to theprototypical ﬁller of a frame element (role). Learning

We denote “no explicit argument” by (cid:15) and assign it link score l ( (cid:15), ˜ a e,r ) (cid:44) , whichacts as a threshold for the link function. For everyevent-role-argument triple ( e, r, a ) , we maximize P ( a | e, r ) = exp { l ( a, ˜ a e,r ) } (cid:80) a (cid:48) ∈A e ∪{ (cid:15) } exp { l ( a (cid:48) , ˜ a e,r ) } . Decoding

We experiment with three decodingstrategies: argmax , greedy , and type-constrained .If we assume each role is satisﬁed by exactly oneargument (potentially (cid:15) ), we can perform argmax decoding independently for each role: ˆ a = argmax a ∈A e ∪{ (cid:15) } P ( a | e, r ) If the type of e is known, then we could restrict r ∈ R e . Distance = max( e start − a end , a start − e end ) . o instead predict multiple non-overlapping ar-guments per role, we could use P ( (cid:15) | e, r ) as athreshold in greedy decoding (Ouchi et al., 2018).We may know the gold event types and the map-ping between events e and their permitted roles, R e . While this information can be used duringtraining, we take a simpler approach of using it for type-constrained decoding (TCD). If an event typeallows m r arguments for role r , we keep only thetop-scoring m r arguments based on link scores. Our model is inspired by several recent span se-lection models (He et al., 2018; Lee et al., 2018;Ouchi et al., 2018), as well as the long line ofneural event extraction models (Chen et al., 2015;Nguyen et al., 2016, inter alia ). O’Gorman (2019)speculates a joint coreference and SRL model inwhich implicit discourse referents are generatedfor each event predicate and subsequently clus-tered with the discovered referent spans using amodel for coreference, which is similar to the ap-proach of Silberer and Frank (2012). O’Gorman(2019) further claims that span selection modelswould be difﬁcult to scale to the document level,which is the regime we are most interested in. Wefocus on the implicit discourse referents (i.e., theevent-role representations) for an event and linkthem to argument mentions, rather than clusterthem using a coreference resolution system or ag-gregate event structures across multiple events anddocuments (Wolfe et al., 2015). Our approach isalso similar to the one used by Das et al. (2010)for FrameNet parsing.

CoNLL 2012 SRL

As our model bears simi-larities to the SRL models proposed by He et al.(2018) and Ouchi et al. (2018), we evaluate ourmodel on the sentence-level CoNLL 2012 datasetas a sanity check. Based on a small hyperparame-ter sweep, our model achieves 81.4 F when givengold predicate spans and 81.2 F when not givengold predicates. Our model’s recall is harmedbecause our span pruning occurs at the documentlevel rather than at the sentence level, which leadsto overpruning in some sentences. Although ourmodel is designed to accommodate cross-sentencelinks, it maintains competitive performance onsentence-level SRL. We use ELMo (Peters et al., 2018) in these experiments.He et al. (2018) achieve 85.5 F with gold predicates and82.9 F without gold predicates, and Ouchi et al. (2018)achieve 86.2 F with gold predicates. Model Dev. F P R F Our model 69.9 62.8

TCD

Most common 17.3 15.7 15.7 15.7Fixed trigger

TCD

Table 2: P(recision), R(ecall), and F on RAMS de-velopment and test data. TCD designates the use ofontology-aware type-constrained decoding. In the following experiments, for each event themodel is given the (gold) trigger span and the(gold) spans of the arguments. The model ﬁndsfor each role the best argument(s) to ﬁll it. Predic-tions are returned as trigger-role-argument triples.We use feature-based BERT-base (Devlin et al.,2018)—mixing layers 9 through 12—by splittingthe documents into segments of size 512 subto-kens and encoding each segment separately. We perform preliminary sweeps across hyper-parameter values, which are then ﬁxed while weperform a more exhaustive sweep across scoringfeatures. We also compare argmax decoding withgreedy decoding during training. The best modelis selected based on F on the development set,and ablations are reported in Table 3. Our ﬁnalmodel uses greedy decoding, s A,R , and s l andomits s E,R and s c (see Equation 2). More detailscan be found in Appendix B.The results using our model with greedy decod-ing and TCD are reported in Table 2. We also re-port performance of the following baselines: 1)choosing for each link the most common role( place ), 2) using the same ﬁxed trigger represen-tation across examples, and 3) using the full con-text window as the trigger. Additionally, we ex-periment with two other data conditions: 1) link-ing the correct argument(s) from among a set ofdistractor candidate arguments provided by a con-stituency parser (Kitaev and Klein, 2018), and2) ﬁnding the correct argument(s) from among allpossible spans up to a ﬁxed length. We take as the distractor arguments all (potentially over-lapping) NP s predicted by the parser. On average, this yields44 distractors per training document.odel Greedy TCDOur model s l ( a, ˜ a e,r ) s A,R ( a, r ) s E,R ( e, r ) s c ( e, a ) ELMo 68.5 75.2

Table 3: F on RAMS dev data when link score com-ponents are separately included/excluded (Equation 2)or other contextualized encoders are used in the bestperforming model. TCD = type-constrained decoding. For the distractor experiment, we use the samehyperparameters as for the main experiment.When not given gold argument spans, we con-sider all spans up to 5 tokens long and change onlythe hyperparameters that would prune less aggres-sively. We hypothesize that the low performancein this setting is due to the sparsity of annotatedspans compared to the set of all enumerated spans.In contrast, datasets such as CoNLL 2012 are moredensely annotated, so the training signal is not asaffected when the model must determine argumentspans in addition to linking them.Finally, we examine the effect of TCD to seewhether the model effectively uses gold eventtypes if they are given. TCD ﬁlters out illegalpredictions, boosting precision. Recall is still af-fected by this decoding strategy because the modelmay be more conﬁdent in the wrong argument fora given role, thus ﬁltering out the less conﬁdent,correct one. Nevertheless, using gold types at testtime generally leads to gains in performance.

Ablation studies on development datafor components of the link score as well as thecontextualized encoder and decoding strategy areshown in Table 3. Type-constrained decodingbased on knowledge of gold event types improves F in all cases because it removes predictions thatare invalid with respect to the ontology.The most important link score component is thescore between a combined event-role and a candi-date argument. This result follows intuitions that s l is the primary component of the link score sinceit directly captures the compatibility of the explicitargument and the implicit argument representedby the event-role pair. Dist. F -2 79 (26) 69 (21) 81.2 70.9 75.7-1 164 (33) 151 (27) 76.8 70.7 73.70 1,811 (61) 1,688 (51) 77.7 72.4 75.01 87 (24) 83 (22) 78.3 74.7 76.52 47 (18) 39 (14) 87.2 72.3 79.1Total 2,189 (62) 2,030 (52) 78.0 72.3 75.1 Table 4: Performance breakdown by distance (numberof sentences) between argument and event trigger forour model using TCD over the development data. Neg-ative distances indicate that the argument occurs beforethe trigger.

We also experiment with both ELMo (Peterset al., 2018) and BERT layers 6–9, which werefound to have the highest mixture weights for SRLby Tenney et al. (2019a). We found that BERTgenerally improves over ELMo and layers 9–12often perform better than layers 6–9.

Argument–Trigger Distance

One of the differ-entiating components of RAMS compared to SRLdatasets is its non-local annotation of arguments.At the same time, RAMS uses naturally occur-ring text so arguments are still heavily distributedwithin the same sentence as the trigger (Figure 5).This setting allows us to ask whether our modelaccurately ﬁnds arguments outside of the sentencecontaining the trigger despite the non-uniform dis-tribution. In Table 4, we report F based on dis-tance on the development set and ﬁnd that perfor-mance on distant arguments is comparable to per-formance on local arguments, demonstrating themodel’s ability to handle non-local arguments. Role Embeddings and Confusion

We presentin Figure 4 the cosine similarities between thelearned 50-dimensional role embeddings in ourmodel and also the errors made by the modelunder argmax decoding on the dev set. Someroles are highly correlated. For example, origin and destination have the most similar embed-dings, possibly because they co-occur frequentlyand have the same entity type. Conversely, nega-tively correlated roles have different entity types oroccur in different events, such as communicator compared to destination and artifact . Wealso observe that incorrect predictions are mademore often between highly correlated roles and err Analysis of the confusion matrix with type-constraineddecoding is less meaningful because the constraints, whichrely on gold event types, ﬁlter out major classes of errors. igure 4: Embedding similarity (top) and row-normalized confusion (bottom) between roles for the15 most frequent roles with our model. The full ﬁguresare included in Appendix C. Best viewed in color. on the side of the more frequent role, as most er-rors occur below the diagonal.

Examples

We present predictions from the de-velopment set which demonstrate some phenom-ena of interest. These are made without TCD, il-lustrating the model’s predictions without knowl-edge of gold event types.In Table 5, the ﬁrst example demonstrates themodel’s ability to link a non-local argument whichoccurs in the sentence before the trigger. Greedydecoding helps the model ﬁnd multiple argumentssatisfying the same participant role, whichalso appear on either side of the trigger. Inthe second example, the model correctly predictsthe driverpassenger , one of the rarer roles inRAMS (17 instances in the training set), consis-tent with the gold

AccidentCrash event type.In Table 6, the model ﬁlls roles corre-sponding to both the

Death and the gold

JudicialConsequences event types, therebymixing roles from different event types. The pre-dictions are plausible when interpreted in contextand would be more accurate under TCD.

The EU’s leaders

PARTICIPANT in Brussels are expected to play hard-ball in negotiating Britain’s exit, to send a message toother states that might be contemplating a similar move.“Informal meeting of EU 27 next week without PM in theroom to decide common negotiating position vs UK

PARTICIPANT on exit negotiations” —Faisal Islam.SPEAKER: I’m Mary Ann Mendoza, the mother ofSergeant Brandon Mendoza

DRIVERPASSENGER , who was killed in a violenthead-on collision in Mesa

PLACE . Table 5: Two examples of correct predictions on thedevelopment set. “Many people are saying that the Iranians

KILLER killed the sci-entist who helped the US because of Hillary Clinton’shacked emails.” —8 August, Twitter. Shahran Amiri

VICTIM , DEFENDANT , thenuclear scientist executed in Iran

PLACE last week, ...“Many people are saying that the Iranians

JUDGECOURT killed thescientist who helped the US

CRIME because of Hillary Clinton’shacked emails.” —8 August, Twitter. Shahran Amiri

DEFENDANT , thenuclear scientist executed in Iran

PLACE last week, ...

Table 6: A partially correct prediction (top) and its cor-responding gold annotations (bottom).

We also investigate how well RAMS serves aspre-training data for AIDA-1. A model using thehyperparameters of our best-performing RAMSmodel and trained on just English AIDA-1 Prac-tice data achieves 19.1 F on the English AIDA-1Eval data under greedy decoding and 18.2 F withTCD. When our best-performing RAMS model isﬁne-tuned to the AIDA task by further trainingon the AIDA-1 data, performance is improved to24.4 F under greedy decoding and 24.8 F withTCD. The crowdsourced annotations in RAMS aretherefore of sufﬁcient quality to serve as augmen-tation to LDC’s AIDA-1. Experimental details areavailable in Appendix D. The Beyond NomBank (BNB) dataset collectedby Gerber and Chai (2010) and reﬁned by Ger-ber and Chai (2012) contains nominal predicates(event triggers) and multi-sentence arguments,both of which are properties shared with RAMS.To accommodate our formulation of the argu- ield Baseline* Our ModelVictim Name 9.3 (54.1) 62.2 (69.6)Shooter Name 4.7 (24.1) 53.1 (57.8)Location 12.2 (18.9) 34.9 (63.3)Time 68.1 (69.3) 62.9 (69.4)Weapon 1.1 (17.9) 32.5 (49.6)

Table 7: Strict (and approximate) match F on GVDB.Due to the different data splits and evaluation con-ditions, we are not directly comparable to the base-line (Pavlick et al., 2016), provided only for reference. ment linking task, we modify the BNB data in twoways: 1) we merge “split” arguments, which inall but one case are already contiguous spans; and2) we reduce each cluster of acceptable argumentﬁllers to a set containing only the argument clos-est to the trigger. We also make modiﬁcations tothe data splits for purposes of evaluation. Gerberand Chai (2012) suggest evaluation be done us-ing cross-validation on shufﬂed data, but this maycause document information to leak between thetrain and evaluation folds. To prevent such leakageand to have a development set for hyperparametertuning, we separate the data into train, dev, and testsplits with no document overlap. Additional dataprocessing details and hyperparameters are givenin Appendix E. When given gold triggers and ar-gument spans, our model achieves 75.4 F on devdata and 76.6 F on test data. The Gun Violence Database (GVDB) (Pavlicket al., 2016) is a collection of news articles fromthe early 2000s to 2016 with annotations specif-ically related to a gun violence event. We splitthe corpus chronologically into a training set of5,056 articles, a development set of 400, and atest set of 500. We use this dataset to performa MUC-style information extraction task (Sund-heim, 1992). While GVDB’s schema permits anynumber of shooters or victims, we simply predictthe ﬁrst mention of each type. Pavlick et al. (2016)perform evaluation in two settings: a strict matchis awarded if the predicted string matches thegold string exactly, while an approximate matchis awarded if either string contains the other.Assuming each document contains a single gunviolence event triggered by the full document, ourgoal is to predict the value (argument) for each slot(role) for the event. As each slot is ﬁlled by exactlyone value, we use argmax decoding. While the baseline experiments of Pavlick et al.(2016) made sentence-level predictions focusingon ﬁve attributes, we make document-level pre-dictions and consider the larger set of attributes.Table 7 shows our model’s performance on theshared subset of attributes, but the numerical val-ues are not directly comparable because the priorwork makes predictions on the full dataset and alsocombines some roles. Our results show that ourmodel is suitable for information extraction taskslike slot ﬁlling. Appendix F contains informa-tion on hyperparameters and performance on thefull set of roles. To our knowledge, our resultsare a substantial improvement over prior attemptsto predict attributes of gun violence event reports,and we make our models available in the hopes ofassisting social scientists in their corpus studies.

We introduced a novel model for document-levelargument linking. Because of the small amountof existing data for the task, to support trainingour neural framework we constructed the RAMSdataset consisting of 9,124 events covering 139event types. Our model outperforms strong base-lines on RAMS, and we also illustrated its appli-cability to a variety of related datasets. We hopethat RAMS will stimulate further work on multi-sentence argument linking.

Acknowledgments

We thank Craig Harman for his help in developingthe annotation interface. We also thank TongfeiChen, Yunmo Chen, members of JHU CLSP, andthe anonymous reviewers for their helpful discus-sions and feedback. This work was supportedin part by DARPA AIDA (FA8750-18-2-0015)and IARPA BETTER (

References

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet project. In

COLING998 Volume 1: The 17th International Conferenceon Computational Linguistics .Aljoscha Burchardt, Katrin Erk, Anette Frank, An-drea Kowalski, and Sebastian Pado. 2006. SALTO- a versatile multi-level annotation tool. In

Pro-ceedings of the Fifth International Conference onLanguage Resources and Evaluation (LREC’06) ,Genoa, Italy. European Language Resources Asso-ciation (ELRA).Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng,and Jun Zhao. 2015. Event extraction via dy-namic multi-pooling convolutional neural networks.In

Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) , pages167–176, Beijing, China. Association for Computa-tional Linguistics.Pengxiang Cheng and Katrin Erk. 2018. Implicit ar-gument prediction with event knowledge. In

Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers) , pages 831–840, New Orleans,Louisiana. Association for Computational Linguis-tics.Pengxiang Cheng and Katrin Erk. 2019. Implicit argu-ment prediction as reading comprehension. In

Pro-ceedings of the AAAI Conference on Artiﬁcial Intel-ligence , volume 33, pages 6284–6291.Dipanjan Das, Nathan Schneider, Desai Chen, andNoah A. Smith. 2010. Probabilistic frame-semanticparsing. In

Human Language Technologies: The2010 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics , pages 948–956, Los Angeles, California.Association for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing.

CoRR , abs/1810.04805.Parvin Sadat Feizabadi and Sebastian Pad´o. 2014.Crowdsourcing annotation of non-local semanticroles. In

Proceedings of the 14th Conference ofthe European Chapter of the Association for Com-putational Linguistics, volume 2: Short Papers ,pages 226–230, Gothenburg, Sweden. Associationfor Computational Linguistics.Parvin Sadat Feizabadi and Sebastian Pad´o. 2015.Combining seemingly incompatible corpora for im-plicit semantic role labeling. In

Proceedings of theFourth Joint Conference on Lexical and Computa-tional Semantics , pages 40–50, Denver, Colorado.Association for Computational Linguistics.Charles J Fillmore. 1986. Pragmatically controlledzero anaphora. In

Annual Meeting of the BerkeleyLinguistics Society , volume 12, pages 95–107. Charles J. Fillmore, Collin F. Baker, and HiroakiSato. 2002. The FrameNet database and softwaretools. In

Proceedings of the Third InternationalConference on Language Resources and Evaluation(LREC’02) , Las Palmas, Canary Islands - Spain. Eu-ropean Language Resources Association (ELRA).Matthew Gerber and Joyce Chai. 2010. Beyond Nom-Bank: A study of implicit arguments for nominalpredicates. In

Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1583–1592, Uppsala, Sweden. Associa-tion for Computational Linguistics.Matthew Gerber and Joyce Y. Chai. 2012. Semanticrole labeling of implicit arguments for nominal pred-icates.

Computational Linguistics , 38(4):755–798.Luheng He, Kenton Lee, Omer Levy, and Luke Zettle-moyer. 2018. Jointly predicting predicates and argu-ments in neural semantic role labeling. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 364–369, Melbourne, Australia. Asso-ciation for Computational Linguistics.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Nikita Kitaev and Dan Klein. 2018. Constituencyparsing with a self-attentive encoder. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 2676–2686, Melbourne, Australia. As-sociation for Computational Linguistics.Kevin Knight, Bianca Badarau, Laura Baranescu,Claire Bonial, Madalina Bardocz, Kira Grifﬁtt, UlfHermjakob, Daniel Marcu, Martha Palmer, TimO’Gorman, and Nathan Schneider. 2020. AbstractMeaning Representation (AMR) annotation release3.0 LDC2020T02.

Linguistic Data Consortium,Philadelphia, PA .Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 188–197, Copenhagen, Denmark. Asso-ciation for Computational Linguistics.Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-ﬁne inference. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers) , pages687–692, New Orleans, Louisiana. Association forComputational Linguistics.Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr-ishman. 2016. Joint event extraction via recurrenteural networks. In

Proceedings of the 2016 Con-ference of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies , pages 300–309, San Diego,California. Association for Computational Linguis-tics.Tim O’Gorman, Michael Regan, Kira Grifﬁtt, Ulf Her-mjakob, Kevin Knight, and Martha Palmer. 2018.AMR beyond the sentence: the multi-sentence AMRcorpus. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages3693–3702, Santa Fe, New Mexico, USA. Associ-ation for Computational Linguistics.Timothy J O’Gorman. 2019.

Bringing Together Com-putational and Linguistic Models of Implicit RoleInterpretation . PhD dissertation, University of Col-orado at Boulder.Hiroki Ouchi, Hiroyuki Shindo, and Yuji Matsumoto.2018. A span selection model for semantic role la-beling. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing , pages 1630–1642, Brussels, Belgium. Associ-ation for Computational Linguistics.Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated cor-pus of semantic roles.

Computational Linguistics ,31(1):71–106.Ellie Pavlick, Heng Ji, Xiaoman Pan, and ChrisCallison-Burch. 2016. The gun violence database:A new task and data set for NLP. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 1018–1024, Austin,Texas. Association for Computational Linguistics.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In

Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using OntoNotes. In

Pro-ceedings of the Seventeenth Conference on Com-putational Natural Language Learning , pages 143–152, Soﬁa, Bulgaria. Association for ComputationalLinguistics. Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Michael Roth and Anette Frank. 2013. Automaticallyidentifying implicit arguments to improve argumentlinking and coherence modeling. In

Second JointConference on Lexical and Computational Seman-tics (*SEM), Volume 1: Proceedings of the MainConference and the Shared Task: Semantic TextualSimilarity , pages 306–316, Atlanta, Georgia, USA.Association for Computational Linguistics.Josef Ruppenhofer, Caroline Sporleder, RoserMorante, Collin Baker, and Martha Palmer. 2010.SemEval-2010 task 10: Linking events and theirparticipants in discourse. In

Proceedings of the 5thInternational Workshop on Semantic Evaluation ,pages 45–50, Uppsala, Sweden. Association forComputational Linguistics.Niko Schenk and Christian Chiarcos. 2016. Un-supervised learning of prototypical ﬁllers for im-plicit semantic role labeling. In

Proceedings of the2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 1473–1479, SanDiego, California. Association for ComputationalLinguistics.Carina Silberer and Anette Frank. 2012. Casting im-plicit role linking as an anaphora resolution task. In *SEM 2012: The First Joint Conference on Lexicaland Computational Semantics – Volume 1: Proceed-ings of the main conference and the shared task,and Volume 2: Proceedings of the Sixth Interna-tional Workshop on Semantic Evaluation (SemEval2012) , pages 1–10, Montr´eal, Canada. Associationfor Computational Linguistics.Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018. N-ary relation extraction using graph-state LSTM. In

Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 2226–2235, Brussels, Belgium. As-sociation for Computational Linguistics.Beth M. Sundheim. 1992. Overview of the fourthmessage understanding evaluation and conference.In

FOURTH MESSAGE UNDERSTANDING CON-FERENCE (MUC-4), Proceedings of a ConferenceHeld in McLean, Virginia, June 16-18, 1992 .Swabha Swayamdipta, Sam Thomson, Kenton Lee,Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.2018. Syntactic scaffolds for semantic structures.In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 3772–3782, Brussels, Belgium. Associationfor Computational Linguistics.Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.BERT rediscovers the classical NLP pipeline. In roceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601, Florence, Italy. Association for Computa-tional Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019b. What do you learn fromcontext? Probing for sentence structure in contextu-alized word representations. In

International Con-ference on Learning Representations .Ralph Weischedel, Martha Palmer, Mitchell Marcus,Eduard Hovy, Sameer Pradhan, Lance Ramshaw,Nianwen Xue, Ann Taylor, Jeff Kaufman, MichelleFranchini, Mohammed El-Bachouti, Robert Belvin,and Ann Houston. 2013. OntoNotes release5.0 LDC2013T19.

Linguistic Data Consortium,Philadelphia, PA .Travis Wolfe, Mark Dredze, and Benjamin Van Durme.2015. Predicate argument alignment using a globalcoherence model. In

Proceedings of the 2015 Con-ference of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies , pages 11–20, Denver, Col-orado. Association for Computational Linguistics.

RAMS Data

A.1 Collection

On Reddit, users make submissions containinglinks to news articles, images, videos, or otherkinds of documents, and other users may then voteor comment on the submitted content. We col-lected news articles matching the following crite-ria: 1) Posted to the r/politics sub-forum betweenJanuary and October 2016; 2) Resulted in threadswith at least 25 comments; and 3) Contained atleast one mention of the string “Russia”. The re-sulting subset of articles tended to describe geopo-litical events and relations like the ones in theAIDA ontology. In order to ﬁlter out low-quality,fake, or disreputable news articles, we treat thenumber of comments in the discussion as a sig-nal of information content. Our approach of gath-ering user-submitted and curated content throughReddit is similar to those used for creating largedatasets for language model pre-training (Radfordet al., 2019). Documents were split into sentencesusing

NLTK 3.4.3 , and sentences were split intotokens using

SpaCy 2.1.4 . A.2 Annotation

To assess whether a lexical unit (LU) evoked anevent with positive factuality, the vetting task con-tained an event deﬁnition and several candidatesentences, each with a highlighted LU. Annotatorswere asked to judge how well each highlightedLU, in the context of its sentence, matched the pro-vided event deﬁnition. In the same task, they werealso asked to assess the factuality of the sentence.Annotation instructions and examples are shownin Figure 9 and Figure 10. P r opo r t i on o f a r gu m en t s traindev Figure 5: Distances between triggers and argumentsin RAMS and proportion of arguments at that distance(counts are shown above each bar). Negative distancesindicate that the argument occurs before the trigger.

Each argument selection task contained ﬁve tokenized sentences, a contiguous set of tokensmarking the trigger, a deﬁnition of the event type,and a list of roles and their associated deﬁnitions.For each role, annotators were asked whether acorresponding argument was present in the 5-sentence window, and if so, to highlight the argu-ment span that was closest to the event trigger, asthere could be multiple. In cases near the begin-ning or end of a document, annotators were shownup to two sentences before or after the sentencecontaining the trigger. Annotators were allowed tohighlight any set of (within-sentence) contiguoustokens within the 5-sentence window aside fromthe trigger tokens. The distribution of distancesbetween triggers and arguments is shown in Fig-ure 5. Annotation instructions and an example areshown in Figure 11 and Figure 12.

A.3 Agreement

We additionally compute the frequency withwhich annotators agreed a given role was or wasnot present in the context window. To measure thefrequency with which annotators agree whether agiven role is present, we treat the majority an-notation as the gold standard. Then, we calcu-lated the precision, recall, and F of the anno-tations. Across the set of redundantly annotatedtasks, there were 83 false negatives, 60 false posi-tives, and 892 true positives, giving a precision of93.7, recall of 91.5, and an F of 92.6. Threshold Conjunctive Disjunctive Start End0 55.3 78.0 59.8 73.51 69.9 80.3 74.9 75.32 73.9 82.0 78.2 77.83 76.4 83.6 80.9 79.14 78.8 84.3 82.7 80.4

Table 8: Pairwise span boundary inter-annotator agree-ment statistics for various span difference thresholds.

We consider a wider range of span differencethresholds, where span difference is calculated byusing the absolute difference of the ( start, end ) token indices from each pair. These are pre-sented in Table 8. In conjunctive agreement, both | start − start | and | end − end | must be lessthan the given threshold; therefore, conjunctiveagreement at threshold 0 is the percent of pairs thatexactly agree (55.3%). Disjunctive agreement isless strict, requiring that either the absolute differ-ence of start offsets or end offsets must be less thanthe threshold. Start and end agreement is deter-

50 100Event types0100200 F r equen cy RAMS trainAIDA-1BNB0 50 100Event types0.000.250.500.751.00 CD F RAMS trainAIDA-1BNB

Figure 6: Comparison of frequency (top) and amountof dataset covered (bottom) of event types sorted bydecreasing frequency. RAMS has more annotations fora more diverse set of event types than do AIDA Phase 1and Beyond NomBank. mined by considering whether the absolute differ-ence of the pair’s start or end offsets (respectively)is within the given threshold.

A.4 Event and Role Type Coverage

Event type and role type coverage are shown inFigure 6 and Figure 7. Figure 6 illustrates thatRAMS contains more annotations for a larger setof event types than does AIDA-1. In addition,the distribution of annotations in RAMS is lessskewed (more entropic) than in AIDA-1, in that inorder to cover a given percentage of the dataset,more event types must be considered in RAMSthan in AIDA-1. Figure 7 shows a similar patternfor role type coverage.Figure 8 shows role coverage per event type, ameasure of how much of each event type’s role setis annotated on average. Role coverage per eventtype is calculated as the average number of ﬁlledroles per instance of the event type divided by thenumber of roles speciﬁed for that event type bythe ontology. For the RAMS training set, the 25 th percentile is 55.6%, the 50 th percentile is 61.9%,and the 75 th percentile is 68.6% coverage. F r equen cy RAMS trainAIDA-10 20 40 60Role types0.20.40.60.81.0 CD F RAMS trainAIDA-1

Figure 7: Comparison of frequency (top) and amountof dataset covered (bottom) of roles sorted by decreas-ing frequency. RAMS has more annotations for a morediverse set of role types than the AIDA Phase 1 data.

B RAMS Hyperparameters

Table 9 lists the numerical hyperparameters sharedby all models discussed in this paper. Models mayignore some link score components if they werefound to be unhelpful during our sweep of Equa-tion 2 and Equation 3. For our model, we learn alinear combination of the top layers (9, 10, 11, 12)of BERT-base cased, while we use the middle lay-ers (6, 7, 8, 9) for the 6–9 ablation. For ELMo, weuse all three layers and encode each sentence sep-arately. We apply a lexical dropout of 0.5 to theseembeddings.

20 30 40 50 60 70 80 90 % roles filled e v en t t y pe s Figure 8: Number of event types for which a given per-centage of roles are ﬁlled in RAMS train set.igure 9: Annotation instructions for determining whether a lexical unit (in context) evokes an event type.Figure 10: Annotation interface for determining whether a lexical unit (in context) evokes an event type.igure 11: Annotation instructions for selecting arguments for an event.Figure 12: Annotation interface for selecting arguments for an event. yperparameter ValueEmbeddings role size 50feature ( φ l ) size 20LSTM size 200layers 3dropout 0.4argument (F A ) size 150layers 2event-role (F E,R ) size 150layers 2F ˜ a (Eqn. 1) layers 2arg-role (F A,R ) size 150layers 2F l size 150layers 2distance FFNN size 150layers 2 k . steps patience 10 Table 9: Hyperparameters of the model trained onRAMS. Sizes of learned weights that are omitted fromthe table can be determined from these hyperparame-ters. As the argument spans are given to the model inour experiments, we skip the ﬁrst pass of pruning. Wedo not clip gradients.

In our best model, we use learned bucketed dis-tance embeddings (Lee et al., 2017). These em-beddings are scored as part of φ c in computing s c ( e, a ) in Equation 2 and are also scored as a partof φ l in s l (Equation 3). Since span boundaries aregiven in our primary experiments, we do not in-clude a score s A or s E in s c . Our best model usesboth s A,R and s l ( a, ˜ a e,r ) in Equation 2. These fea-tures were chosen as the result of a sweep overpossible features, with other ablations reported inTable 3.We adopt the span embedding approach by Leeet al. (2017), which uses character convolutions(50 8-dimensional ﬁlters of sizes 3, 4, and 5)and 300-dimensional GloVe embeddings. The de-fault dropout applied to all connections is 0.2.We optimize using Adam (Kingma and Ba, 2015)with patience-based early stopping, resulting inthe best checkpoint after 19 epochs (9 hours onan NVIDIA 1080Ti), using F as the evaluationmetric.Hyperparameters for the condition with distrac-tor candidate arguments are the same as thosein Table 9. For the condition with no given argu-ment spans, we consider all intrasentential spans up to 5 tokens in length. We include the scoreof each candidate argument span when pruning toencourage the model to keep correct spans. Wemodify hyperparameters in Table 9 to prune lessaggressively, setting k = 100 and λ A = 1 . (de-ﬁned in § C Full Role Confusion and SimilarityMatrices

Figure 13 shows the similarity between all 65 roleembeddings, while Figure 14 visualizes all the er-rors made by the model on the development set.These are expansions of the per-role results from § { destination, origin } and themodel predicts { origin, place } , then we onlymark place as an error for destination . D AIDA Phase 1

D.1 Data Processing

We ﬁlter and process the AIDA-1 Practice andEval data in the following way. Because annota-tions are available for only a subset of the docu-ments in AIDA-1, we consider only the documentsthat have textual event triggers. We then take fromthis set only the English documents, which, dueto noisy language ID in the original annotations,were selected by manual inspection of the ﬁrst 5sentences of each document by one of the authorsof this work.In addition, the argument spans in each exampleare only those that participate in events. In otherwords, arguments of relations (that are not also ar-guments of events) are not included. Additionally,a document may contain multiple events, unlike inRAMS.The training and development set come fromAIDA-1 Practice, and the test set comes fromAIDA-1 Eval. As the AIDA-1 Eval documents areabout different topics than the Practice documentsare, we emulate the mismatch in topic distributionby using a development set that is about a different trategy Dev. F P R F No pre-training 25.0 36.6 12.9 19.1No pre-training

TCD

Table 10: P(recision), R(ecall), and F on AIDA-1 En-glish development and test data. TCD designates theuse of ontology-aware type-constrained decoding. topic than the training set is. We use Practice top-ics R103 and

R107 for training and

R105 for devel-opment because

R105 is the smallest of the threepractice topics both by number of documents andby number of annotations. The test set consists ofall 3 topics (

E101 , E102 , E103 ) from the (unse-questered) Eval set. After the ﬁltering process de-scribed above, we obtain a training set of 46 doc-uments, a development set of 17 documents, and atest set of 69 documents. There are 389 events inthe training set, and the training documents havean average length of 50 sentences.

D.2 Hyperparameters

We use the same hyperparameters as the bestmodel for RAMS, shown in Table 9.

D.3 Pre-training on RAMS

Both the models with and without pre-training onRAMS were trained on AIDA-1 for 100 epochswith an early-stopping patience of 50 epochs us-ing the same hyperparameters as the best RAMSmodel. All parameters were updated during ﬁne-tuning (none were frozen). The vocabulary of thepre-trained model was not expanded when trainedon AIDA-1.The models’ lower performance on AIDA-1than on RAMS may be in part explained bythe presence of distractors in AIDA-1. Movingfrom RAMS (one trigger per example) to AIDA-1(many triggers per example) introduces distractor“negative” links: an argument for one event mightnot participate in a different event in the samedocument. When given gold argument spans, amodel learns from RAMS that every argumentgets linked to the trigger, but there are many neg-ative links in the AIDA-1 data, which the modelmust learn to not predict.Full results are given in Table 10. Type-constrained decoding does not improve perfor-mance on AIDA-1 as much as it did in Table 3,possibly because the AIDA-1 data often does not adhere to the multiplicity constraints of the ontol-ogy. For example, many attack events have morethan one annotated attacker or target . UnderTCD, correct predictions made in excess of whatthe ontology allows are deleted, hurting recall.Interestingly, type-constrained decoding hurts performance on AIDA-1 Eval when there is nopre-training. As discussed in §

5, type-constraineddecoding tends to improve precision and lowerrecall. Despite the same behavior here, F isnonetheless decreased.We see similar behavior in this experiment tothe RAMS experiment involving distractor can-didate arguments: low performance which is re-duced further when using TCD. E BNB Data Processing andHyperparameters

E.1 Data Processing

We use the data from Gerber and Chai (2012). We processed the data in the following way. Theannotations were ﬁrst aligned to text in the PennTreebank. Because our model assumes that ar-guments are contiguous spans, we then manuallymerged all “split” arguments, which with one ex-ception were already contiguous spans of text. Forthe one split argument that was not a contigu-ous span, we replaced it with its maximal span. We then removed special parsing tokens such as“trace” terminals from the text and realigned thespans. While BNB gives full credit as long asone argument in each argument “cluster” is found,our training objective assumes one argument perrole. We therefore automatically reduced each ar-gument cluster to a singleton set containing theargument closest to the trigger. This reformula-tion of the problem limits our ability to compareto prior work.Once all the data had been processed, we cre-ated training, development, and test splits. Toavoid leaking information across splits, we buck-eted examples by document and randomly as-signed documents to the splits so that the splitscontained instances in the proportions 80% (train),10% (dev), and 10% (test). http://lair.cse.msu.edu/projects/implicit_argument_annotations.zip . Information about the dataand its ﬁelds is available at http://lair.cse.msu.edu/projects/implicit_annotations.html . The instance is a quote broken by speaker attribution,where the split argument consists of the two halves of thequote. This example appears in our training set.yperparameter ValueEmbeddings role size 50feature ( φ l ) size 20LSTM size 200layers 3dropout 0.4argument (F A ) size 150layers 2event-role (F E,R ) size 150layers 2F ˜ a (Eqn. 1) layers 2F l size 150layers 2positional FFNN size 150layers 2 λ A k . steps patience 20gradient clipping 10.0 Table 11: Hyperparameters of the model trained onGVDB.

E.2 Hyperparameters

We use the same hyperparameters as the bestmodel for RAMS, shown in Table 9.

F GVDB Hyperparameters andAdditional Results

The entire GVDB corpus consists of 7,366 arti-cles. We exclude articles that do not have a reliablepublication date or lack annotated spans for theroles we are interested in. Additionally, a buffer of100 articles spanning roughly one week betweenthe dev and test set is discarded, limiting the pos-sibility of events occurring in both the develop-ment and test sets. We also ﬁlter out spans whosestart and end boundaries are in different sentences,as these are unlikely to be well-formed argumentspans. For evaluation, a slot’s value is marked ascorrect under the strict setting if any of the pre-dictions for that slot match the string of the cor-rect answer exactly, while an approximate matchis awarded if either a prediction contains the cor-rect answer or if the correct answer contains thepredicted string. The approximate setting is neces-sary due to inconsistent annotations (e.g., omittingﬁrst or last names).We experiment with the feature-based versionof BERT-base and with ELMo as our contextual-ized encoder. Table 11 lists the numerical hyper- parameters for this model. Since there is only oneevent per document and no explicit trigger, e isrepresented by a span embedding of the full docu-ment. We use the top four layers (9–12) of BERT-base cased (all three layers for ELMo) with a lex-ical dropout of 0.5. Everywhere else, we applya dropout of 0.4. We train with the Adam opti-mizer (Kingma and Ba, 2015) and use patience-based early stopping. Our best checkpoint was af-ter 8 epochs (roughly 9 hours on a single NVIDIA1080Ti). Even though the ofﬁcial evaluation isstring based, we used a span-based micro F met-ric for early stopping.For this model, φ l corresponds to a learned(bucketed) positional embedding of the argumentspan (i.e., distance from the start of the document).In computing the coarse score, we omit φ c . Whencomputing Equation 2, we omit s A,R but keep allother terms in Equation 2. We adopt the characterconvolution of 50 8-dimensional ﬁlters of windowsizes 3, 4, and 5 (Lee et al., 2017).With the same hyperparameters and featurechoices, we perform an identical evaluationusing ELMo instead of BERT. As the originaldocuments are not tokenized, we use

SpaCy2.1.4 for ﬁnding sentence boundaries andtokenization. The complete list of annotatedﬁelds are

VICTIM (name, age, race),

SHOOTER (name, age, race),

LOCATION (speciﬁc location or city), TIME (time of day or clock time) and

WEAPON (weapon type, number of shots ﬁred).While Pavlick et al. (2016) only make predic-tions for V

ICTIM .N AME , S

HOOTER .N AME ,L OCATION .(C

ITY | L OCATION ),T

IME .(T

IME | C LOCK ), and W

EAPON .W EAPON ,we perform predictions over all annotated span-based ﬁelds. The full results for both BERT andELMo are reported in Table 12 and Table 13,respectively. BERT generally improves overELMo across the board, but not by a sizeablemargin. Despite the inability to directly compare,we nonetheless present a stronger and morecomprehensive baseline for future work withGVDB. For example, a park or a laundromat. igure 13: Full version of Figure 4, showing cosine similarity between role embeddings. Best viewed in color.

Field Strict PartialBaseline Us Baseline UsP R F1 P R F1 P R F1 P R F1

VICTIM

Name

SHOOTER

Name

LOCATION

City

TIME

Time

WEAPON

Weapon

Table 12: P(recision), R(ecall), and F on event-based slot ﬁlling (GVDB) using BERT as the document encoder.Due to the different data splits and evaluation conditions, the results are not directly comparable to the base-line (Pavlick et al., 2016), which is provided only for reference. Fields that were aggregated in the baseline arepredicted separately in our model. ‘–’ indicates result is not reported in the baseline.igure 14: Full version of Figure 4, showing row-normalized confusion between roles. Note that roles not predictedat all would result in empty rows and so are omitted from the table. Field Strict PartialBaseline Us Baseline UsP R F1 P R F1 P R F1 P R F1

VICTIM

Name

SHOOTER

Name

LOCATION

City

TIME

Time

WEAPON

Weapon

Table 13: P(recision), R(ecall), and F1