End-to-End Neural Event Coreference Resolution
EEnd-to-End Neural Event Coreference Resolution
Yaojie Lu , , Hongyu Lin , Jialong Tang , , Xianpei Han , , Le Sun , Chinese Information Processing Laboratory State Key Laboratory of Computer ScienceInstitute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China { yaojie2017,hongyu2016,jialong2019,xianpei,sunle } @iscas.ac.cn Abstract
Traditional event coreference systems usuallyrely on pipeline framework and hand-craftedfeatures, which often face error propagationproblem and have poor generalization abil-ity. In this paper, we propose an E nd-to- E nd E vent C oreference approach – E C neural net-work, which can jointly model event detectionand event coreference resolution tasks, andlearn to extract features from raw text auto-matically. Furthermore, because event men-tions are highly diversified and event corefer-ence is intricately governed by long-distance,semantic-dependent decisions, a type-guidedevent coreference mechanism is further pro-posed in our E C neural network. Experimentsshow that our method achieves new state-of-the-art performance on two standard datasets.
Event coreference resolution aims to identify whichevent mentions in a document refer to the sameevent (Ahn, 2006; Hovy et al., 2013). For exam-ple, the two event mentions in Figure 1, departing and leave , refer to the same
EndPosition event ofNokia’s CEO.Traditional event coreference resolution methodsusually rely on a series of upstream components(Lu and Ng, 2018), such as entity recognition andevent detection. Such a pipeline framework, unfor-tunately, often suffers from the error propagationproblem. For instance, the best event detectionsystem in KBP 2017 only achieved 56 F1 (Jianget al., 2017), and it will undoubtedly limit the per-formance of the follow-up event coreference task(35 Avg F1 on KBP 2017). Furthermore, most pre-vious approaches use hand-crafted features (Chenet al., 2011; Lu and Ng, 2017a), which heavily de-pend on other NLP components (e.g., POS tagging,NER, syntactic parsing, etc.) and thus are hard togeneralize to new languages/domains/datasets.
Huge Payday for Nokia’s [departing]
EndPosition
CEO …Nokia’s CEO prepares to [leave]
EndPosition the company and [rejoin]
StartPosition
Microsoft … It is an expensive [goodbye]
EndPosition for the executive, …
Figure 1: An example of event coreference resolution,which contains two coreferential chains: An
EndPo-sition event chain { departing , leave , goodbye } and a StartPosition chain { rejoin } . In this paper, we propose an E nd-to- E nd E vent C oreference method – E C neural network, whichcan predict event chains from a raw text in an end-to-end manner. For example, taking the raw textin Figure 1 as input, E C will directly output twoevent coreference chains, { departing , leave , good-bye } and { rejoin } . By jointly modeling event de-tection and event coreference, E C neural networkdoes not require any prior components, and therepresentations/pieces of evidence between differ-ent tasks and different decisions can be shared andreinforced. Besides, E C are learned in an end-to-end manner, which can inherently resolve the errorpropagation problem.End-to-end event coreference, however, is chal-lenging due to the mention diversity and the long-distance coreference.
First, event mentions arehighly diversified (Humphreys et al., 1997; Chenand Ji, 2009), which may be a variety of syntacticobjects, including nouns, verbs, and even adjec-tives. For example, an
EndPosition event can betriggered by departing (noun), leave (verb), good-bye (noun) and former (adj). By contrast, mentionsin entity coreference are mostly noun phrases (Luand Ng, 2018).
Second, coreferential event men-tions commonly appear over long-distance sen-tences, therefore event coreference is intricatelygoverned by long-distance, semantic-dependent de-cisions (Choubey and Huang, 2018; Goyal et al.,2013; Peng et al., 2016). For example, in Figure a r X i v : . [ c s . C L ] S e p the closest antecedent of the mention goodbye – leave , is far from it. To resolve the coreferencebetween these two distant, diverse event mentions,a system can only rely on their semantic mean-ings, i.e., they both describe the same EndPosi-tion event(the departing of Nokia’s CEO) but fromdifferent perspectives. By contrast, most of en-tity mentions’ closest antecedents are in the sameor immediately preceding sentence (Choubey andHuang, 2018), which can be resolved more easilyusing local and syntactic clues.To resolve the mention diversity problem andthe long-distance coreference problem, this paperfurther proposes a type-guided mechanism into ourE C neural network. This mechanism bridges dis-tant, diverse event mentions by exploiting eventtype information in three folds: 1) type-informedantecedent network which enables E C to cap-ture more semantic information of event mentionsby predicting coreferential scores and type scoressimultaneously; 2) type-refined mention repre-sentation which enhances mention representationwith type information, therefore even lexically dis-similar mentions can be bridged together, such asthe two diverse
EndPosition mentions goodbye and departing ; 3) type-guided decoding algorithm which can exploit global type consistency for moreaccurate event chains.The main contributions of this paper are:1. We propose an end-to-end neural networkfor event coreference resolution - E C neural net-work. E C can jointly model event detection andevent coreference, and learn to automatically ex-tract features from raw text. To the best of ourknowledge, this is the first end-to-end neural eventcoreference model that can achieve state-of-the-artperformance.2. We design a type-guided mechanism for eventcoreference, which can effectively resolve the men-tion diversity problem and the long-distance coref-erence problem in event coreference resolution.3. We conduct experiments on two standarddatasets: KBP 2016 and KBP 2017, which showthat E C achieves new state-of-the-art performance.And additional ablation experiments verify the ef-fectiveness of the proposed type-guided mecha-nism. In this paper, antecedents are coreferential mentions thatappear earlier in the document.
AntecedentsMentions ... [departing] … [leave] ……[rejoin]…[goodbye] …
Nokia’s CEO prepares to leave the company and rejoin Microsoft …
Mention Proposal NetworkDocumentType-informedAntecedent NetworkType-guidedDecodingEvent Chains [departing-leave-goodbye]
EndPosition [rejoin]
StartPosition departing leavedeparting goodbye antecedentantecedent
EndPosition departing type
Figure 2: The framework of our E C neural network. C: End-to-end Neural EventCoreference Resolution
Given a document D = { w , ..., w n } , an end-to-end event coreference system needs to: 1) detectevent mentions { m , ., m l } (event detection); 2)predict all coreference chains { ev ∗ } (event coref-erence resolution). For example, in Figure 1, thementions are { departing , ..., goodbye } and twocoreference chains will be predicted: { departing , leave , goodbye } , and { rejoin } .To this end, our E C method first detects men-tions candidates via a mention proposal network,then identifies all mentions’ antecedents via an an-tecedent prediction network. To resolve the men-tion diversity problem and the long-distance coref-erence problem, a type-guided event coreferencemechanism is designed for E C. Figure 2 showsthe framework of our method. All components inE C are differentiable and can be trained in an end-to-end manner. In the following, we describe themin detail.
The mention proposal network detects all eventmentions in a document, e.g., identifying { departing , ..., rejoin } as event mentions in Figure1. Because event mentions are highly diversifiedexpressions (e.g., goodbye , former and leave for EndPosition ), we first capture the semantic infor-mation of all tokens via a contextualized represen-tation layer, then identify mention candidates via amention proposal layer. The details are as follows.
Contextualized Word Representation Layer.
To capture the semantic information for proposingvent mentions, we learn a contextualized represen-tation for each token. Concretely, we first obtaina task-independent representation for each tokenbased on pre-trained BERT embeddings (Devlinet al., 2019). Following Tenney et al. (2019), atoken w i ’s representation h i ∈ R d is pooled acrossdifferent BERT layers using scalar mixing (Peterset al., 2018) as h i = γ (cid:80) Lj =1 α j x ( j ) i , where x ( j ) i isthe embedding of token i from BERT layer j , d issize of bert embedding, α j is softmax-normalizedweights, and γ is a scalar parameter.Because event arguments can provide criticalevidence (Bejan and Harabagiu, 2010; Lee et al.,2012; McConky et al., 2012; Cybulska and Vossen,2013), we further obtain an event-specific to-ken representation by distilling argument infor-mation from raw text implicitly. Specifically, wedesign a mask attention strategy (Dong et al.,2019). Given task-independent token representa-tions H = { h , h , ..., h n } , our attention mech-anism first models the relevance between tokensvia a scaled dot-product attention (Vaswani et al.,2017) without linear projection, and then com-putes the final contextualized word representations C = { c , c , ..., c n } as: C = softmax ( HH T √ d + M ) HM ij = (cid:26) , | i − j | < c −∞ otherwise (1)where c is the size of local window (this paperfocuses on the local context since arguments empir-ically appear around event mentions , and we set c = 10 in this paper), and √ d is the scaling factor. Mention Proposal Layer.
Given the token rep-resentations, the mention proposal layer assignsa mention score to each span – s m ( i ) , which in-dicates the likelihood for span i being an eventmention. For example, in Figure 1 the mentionproposal layer will assign spans { departing , leave , rejoin , ... } with high s m ( i ) scores because they arehighly likely to be event mentions, and assign spans { prepares to , company , ... } with low s m ( i ) scoresbecause they are unlikely to be event mentions.Given all spans within a restricted length in adocument, the mention proposal layer representseach span i as g i = ˆc i , where ˆc i is the soft head In KBP 2017 training set, about 90% of arguments appearin the ± This paper restricts span length to 1, which can cover96.6% mentions in KBP 2017 training set. For this case, theattented span representation g i is equivalent to c i . attention-based aggregation of all token represen-tations in span i (Lee et al., 2016). Given g i , themention score s m ( i ) is computed via standard feed-forward neural networks: s m ( i ) = FFNN m ( g i ) (2)Finally, we rank all spans according to their men-tion scores (Lee et al., 2017), and only retain top- l mentions { m , m , ..., m l } as event mention can-didates for computation efficiency. Given an event mention, the type-informed an-tecedent network predicts its antecedents, and theantecedent predictions can be used as local pair-wise coreference decisions. For example, ourmethod will predict the antecedent of leave as de-parting in Figure 1 and (cid:104) departing, leave (cid:105) can beused as a pair-wise coreference decision.For each mention m i in { m , ..., m l } , the type-guided antecedent network produces two kinds ofscores simultaneously: 1) s ( i, j ) – the score formention m j being antecedent of m i , where m j must appear before m i in the document; 2) s ( i, t k ) – the score for mention m i ’s type being t k . Antecedent Score.
Given a mention m i , an-tecedent network computes an antecedent score s ( i, j ) for each mention pair (cid:104) m i , m j (cid:105) : s ( i, j ) = s m ( i ) + s m ( j ) + s a ( i, j ) (3)where j < i , s m ( i ) and s m ( j ) are the mentionscores described in § s a ( i, j ) measures the se-mantic similarity between m i and m j , computedvia a standard feed-forward neural network: s a ( i, j ) = FFNN a ([ g i , g j , g i ◦ g i , Φ( i, j )]) (4)where g i ◦ g i is the element-wise similarity of eachmention pair (cid:104) m i , m j (cid:105) , and Φ( i, j ) is the distanceencoding between two mentions. Event Type Score.
As described in §
1, eventcoreference is intricately governed by long-distance, semantic-dependent decisions. To ad-dress this issue, this paper exploits event type in-formation for better event coreference resolution.Specifically, besides antecedent prediction for eachmention, we further predict its event type so that:1) the neural network will be guided to capturemore semantic information about event mentions(Durrett and Klein, 2014); 2) the type information In this paper, l = 0 . × document length. nsures the global type consistency during corefer-ence resolution, i.e., mentions in the same corefer-ence chain will have the same event type.Specifically, we first embed all event types T = { t , ..., t t } via a hierarchical embedding al-gorithm. The embedding of t k is g t k = W e · [ e event , e type ( t k )] , where e event is shared by allevent types, e type ( t k ) indicates embedding of t k ,and W e is a mapping matrix. The dimension of g t k is the same as mention embedding g i .Then the type scores s ( i, t k ) are computed viathe same scoring function for antecedent prediction: s m ( t k ) = FFNN m ( g t k ) s a ( i, t k ) = FFNN a ([ g i , g t k , g i ◦ g t k , Φ( i, t k )]) s ( i, t k ) = s m ( i ) + s m ( t k ) + s a ( i, t k ) (5)where the distance for Φ( i, t k ) is zero in type scorescomputation.For non-mention spans, we add a dummy an-tecedent ε and assign the antecedents of all non-mention spans to ε , e.g., company and preparesto in Figure 1. We fix the score s ( i, ε ) to 0, andidentify a span i as non-mention span if all its an-tecedent scores s ( i, j ) ≤ and all type scores s ( i, t k ) ≤ .In this way, we obtain the antecedent scoresand the type scores for each mention via our type-informed antecedent network. In this section, we describe how to further refinea mentions representation using its type informa-tion, so it can capture more semantic informationfor event coreference resolution. For example, al-though goodbye and departing are lexically dissim-ilar, we can still capture their semantic similarityby further encoding their event type information,i.e., both of them have the same event type –
End-Position .To refine mention representation, we first definea probability distribution Q ( t k ) over all event types T and { ε } for each mention span m i : Q ( t k ) = e s ( i,t k ) (cid:80) t (cid:48) k ∈T (cid:83) { ε } e s ( i,t (cid:48) k ) (6)where s ( i, t k ) is the type score. We then obtain anexpected event type representation ˜ g i for each span m i using the type distribution Q ( t k ) as: ˜ g i = (cid:88) t (cid:48) k ∈T Q ( t k = t (cid:48) k ) · g t (cid:48) k + Q ( t k = ε ) · g i (7)Then, we obtain a refined span representation g (cid:48) i by combining its expected event type representa-tion ˜ g i and its original span representation g i via alearnable adaptive gate f i : g (cid:48) i = f i ◦ g i + (1 − f i ) ◦ ˜ g i f i = σ ( W f · [ g i , ˜ g i ]) (8)where W f is a weight matrix.Finally, the antecedent network will recomputethe coreferential antecedent score s (cid:48) ( i, j ) and eventtype score s (cid:48) ( i, t k ) using the refined span represen-tation g (cid:48) i . The type-informed antecedent network producespairwise coreference scores of mention pairs. Toform coreference chains, a naive approach is todirectly connect all mentions using their highest-scored antecedent. Unfortunately, such a greedydecoding algorithm only considers local pair-wiseconsistencies, their results may not be globally op-timal, e.g., a coreference chain may contain men-tions with different event types.To address this issue, we propose a decodingalgorithm, which can ensure the global consis-tency of a coreference chain through a type-guidedmechanism. For example, to resolve the chain { departing , leave } in Figure 1, E C considers boththe antecedent score of (cid:104) departing , leave (cid:105) , and thetype consistency that both departing , leave are End-Position mentions.Concretely, given the mentions { m , ..., m l } ina document D , E C constructs the event corefer-ence chains by sequentially identifying the bestantecedent of each mention, further consider thetype consistency. For each mention m i , we firstfind the mention a i which has the max coreferentialscore with m i where m j appears before m i : a i = arg max m j , j s ( i, t i ) , E C considers m i and a i astype consistent and links mention i to a i ; otherwise,when s ( i, a i ) ≤ s ( i, t i ) , E C considers m i and a i as type inconsistent and starts a new event chainfor m i with its type t i . Model Learning
This section describes how to learn E C neuralnetwork in an end-to-end manner. Given a trainingcorpus D = { D , ..., D N } where each instance D i is a document with its event mention, mentiontype, and coreference annotations, our objectivefunction contains two parts: L antecedent ( Θ ) – theantecedent loss, and L proposal ( Θ ) – the mentionproposal loss: L ( Θ ) = L antecedent ( Θ ) + λ L proposal ( Θ ) (10)where λ is the coefficient of mention proposal loss(we set λ = 1 in this paper). This paper optimizes Θ by maximizing L ( Θ ) via Adamax (Kingma andBa, 2015). The two losses are as follows. Antecedent Loss.
It measures whether a men-tion links to its correct antecedent. For each men-tion m i , this paper identifies its gold antecedent setGOLD ( i ) as shown in Figure 3:1) For the first mention of an event chain, thegold antecedent is its event type. For example, thegold antecedent set of departing is { EndPosition } .2) For remaining mentions in a chain, the goldantecedents are all its coreferential antecedents.For example, the gold antecedent set of goodbye is { departing , leave } .3) For non-mention spans, the gold antecedentis the dummy antecedent ε . For example, the goldantecedent set of company is { ε } .Given GOLD ( i ) for each m i in top- l mentionset of document D , the antecedent loss function isa margin log-likelihood function: L ( Θ ) antecedent = log l (cid:89) i =1 (cid:88) ˆ y ∈Y ( i ) ∩ GOLD ( i ) P (ˆ y | D ) P ( y i | D ) = exp( s ( i, y i )) (cid:80) y (cid:48) ∈Y ( i ) exp( s ( i, y (cid:48) )) (11)where Y ( i ) is the valid antecedent set for m i . Mention Proposal Loss.
It measures whetherour model can accurately identify event mentions.Specifically, the mention proposal loss uses thebinary cross-entropy loss function of the mentionproposal network: L ( Θ ) proposal = n (cid:88) i =1 y i log σ ( s m ( i ))+ (1 − y i ) log(1 − σ ( s m ( i ))) (12)where σ is the sigmoid function, y i = 1 indicatesspan i is an event mention, otherwise y i = 0 . EndPosition – departing – leave – goodbyeStartPosition – rejoin ε – company
GOLD( departing ) = {
EndPosition }GOLD( rejoin ) = {
StartPosition }GOLD( leave ) = { departing }GOLD( goodbye ) = { departing, leave }GOLD( company ) = { ε } Annotated Event ChainsGolden Antecedent Sets
Figure 3: An illustration of gold antecedent sets.
Following previous studies (Lu and Ng, 2016b,2017a; Jiang et al., 2017; Huang et al., 2019), weuse KBP 2016 and KBP 2017 English datasets forevaluation : KBP 2016.
For KBP 2016, we use the samesetup as Lu and Ng (2017a), i.e., 509 documentsfor training, 139 documents for parameter tuning,and the official KBP 2016 eval set for evaluation.
KBP 2017.
Following Huang et al. (2019), weuse the English portion of KBP 2015 and 2016dataset for training, and the KBP 2017 dataset forevaluation. We sample 50 documents from the2016 evaluation dataset as the validation set.
We compare E C with the following baselines : Multi-Pass Sieve (Lu and Ng, 2016a) is an it-erative pipeline-based method, which uses bothhand-crafted rules and automatic classifiers. Wecompare two such systems: the Top 1 system inTAC 2016 (Lu and Ng, 2016b) and the Top 1 sys-tem in TAC 2017 (Jiang et al., 2017), both of whichuse additional ensemble strategy for better eventdetection performance. There are also other public event coreference datasets:Ontonotes (Pradhan et al., 2007), ECB+ (Bejan and Harabagiu,2008; Cybulska and Vossen, 2014), ACE (LDC, 2005).Ontonotes and ECB+ are not annotated with event type infor-mation therefore is not appropriate for evaluating our end-to-end event coreference model. ACE dataset has strict notion ofevent identity (Song et al., 2015; Lu and Ng, 2017a), whichrequires which two event mentions coreferential if and onlyif “they had the same agent(s), patient(s), time, and location”.Because E C don’t perform argument extraction for eventcoreference, ACE isn’t used in this paper. For fair comparsion,we choose the KBP datasets (Ellis et al., 2015, 2016; Getmanet al., 2017) so that different systems can be compared in thesame settings. Different from the official type-constraint settings in KBP2016 and KBP 2017, Choubey and Huang (2018) used relaxedconstraints without considering event mention type, so weexclude their system for fair comparison. ype-F1 B CEAF e MUC BLANC AVG-F
KBP 2016 Top 1 in TAC 2016 (Lu and Ng, 2016b) 46.99 37.49 34.21 26.37 22.25 30.08Mention Ranking (Lu and Ng, 2017b) 46.99 38.64 36.16 26.30 23.59 31.17Joint Model (Lu and Ng, 2017a) 49.30 40.90 39.00 27.41 25.00 33.08Interact Model
BERT C ( this paper ) KBP 2017 Top 1 in TAC 2017 (Jiang et al., 2017) 56.19 43.84 39.86 30.63 26.97 35.33Interact Model (Huang et al., 2019) - 42.84 39.01 31.12 24.99 34.49+ Transfer (Huang et al., 2019) - 43.20 40.02 35.66
BERT C ( this paper ) Table 1: Overall performance on KBP 2016 and KBP 2017 datasets, the results of baselines are adapted from theiroriginal papers.
Mention Ranking (Lu and Ng, 2017b) ranksthe candidate antecedents of all event mentions andselects the top-ranked antecedent for each mention.
Joint Model (Lu and Ng, 2017a) is a hand-crafted feature-based system that addresses the er-ror propagation problem by jointly learning eventtrigger detection, event coreference resolution, andevent anaphoricity prediction tasks.
Interact Model (Huang et al., 2019) is the state-of-the-art pair-wise method which decides whethertwo mentions are coreferential using an interac-tive binary classifier, and then link coreferentialmentions to produce final event chains. We alsocompare with an enhanced model that transfersargument compatibility features from external un-labeled data Interact Model + Transfer. We alsoreimplement the interact model using BERT as itsfeature extractor Interact Model
BERT , thereforeE C and Interact Model can be compared with thesame feature extractors.
We use the standard evaluation metrics in KBPevaluation, and compute them using the officialevaluation toolkit . We use 4 measures: MUC(Vilain et al., 1995), B (Bagga and Baldwin, 1998),CEAF e (Luo, 2005), and BLANC (Recasens andHovy, 2011). Following previous studies (Lu andNg, 2017a; Huang et al., 2019), the primary metricAVG-F is the unweighted average of the abovefour F-scores. We also report the event detectionperformance using the typed F1-scores as Type-F1. Table 1 shows the overall performance on KBP2016 and KBP 2017. We can see that: https://github.com/hunterhector/EvmEval E C neural network achieves state-of-the-art performance on both datasets.
Comparedwith all baselines, E C gains at least 11.6% and6.9% AVG-F improvements on KBP 2016 and KBP2017, respectively. This verifies the effectivenessof the end-to-end framework and the type-guidedevent coreference mechanism.2.
By jointly modeling all tasks together andlearning all components in an end-to-end man-ner, E C neural network significantly outper-forms pipeline baselines.
Compared with InteractModel
BERT which uses the same BERT-based fea-ture extractors, E C still gains 3.2% Type-F1 and6.9% AVG-F improvements on KBP 2017. Thisverified the effectiveness of the end-to-end trainingon reducing the error propagation problem. Be-sides, by modeling all tasks together, representa-tions and pieces of evidence can be shared andreinforced between different decisions and tasks.
In this section, we analyze the effects of type-guided mechanism, end-to-end learning, and pre-trained models.
Effect of Type Guided Mechanism.
To investi-gate the effect of type-guided mechanism in E C,we conduct ablation experiments by ablating type-refined representation (-Type-Refined) and by re-placing type-guided decoding with the naive bestantecedent decoding (-Type-Guided). Type Ruleis a simple heuristic method that regards all eventmentions in the same type are coreferential. Theresults are shown in Table 2. We can see that:1) Type-guided decoding is effective for eventcoreference. By considering both type consistencyand antecedent score, E C obtains an 8.1% (3.05)AVG-F improvement over naive decoding.) Type-refined representation helps resolve themention diversity problem and the long-distancecoreference problem. By incorporating type infor-mation into mention representation, E C obtains a3.1% (1.24) AVG-F improvement.
Effect of End-to-end Learning.
To investigatethe effect of end-to-end learning, we conduct ex-periments on three variations of E C: E C Two Stage which models event mention detection and coref-erential antecedent prediction in two indepen-dent models but they share span embeddings;E C w/o Proposal Loss which removes the mention pro-posal loss; E C GoldMention which uses gold men-tions for coreference resolution and type scoring,but the model still needs to predict the type of eachmention. Table 3 shows the performances of thethree systems, we can find that:1) One pass paradigm for E C can effectivelyshare and reinforce the decisions between two tasks.Compared with E C Two Stage , which has a compa-rable event detection performance (Type-F1), E Cgains 5.2% AVG-F on the downstream event coref-erence task.2) Incorporating mention proposal loss can sig-nificantly enhance mention detection performance.By removing mention proposal loss, E C will loss2.3% and 4.2% on Type-F1 and AVG-F, respec-tively. Additionally, the coreference performancecan be further significantly improved if goldenmentions are used – from 40.89 E C to 53.72 ofE C GoldMention . This shows that event detection isstill a bottleneck for event coreference.
Effect of Pre-trained Models.
Pre-trained mod-els are important for neural network-based meth-ods. To investigate their effect on E C, Table 4shows the performance of E C using ELMo (Peterset al., 2018), BERT
BASE-Cased , BERT
LARGE-Uncased ,BERT
LARGE-WWM-Uncased (Devlin et al., 2019),GloVe (Pennington et al., 2014) 300-dimensionalword embedding and char embeddings where thecontextual layer is BiLSTM. We can find that:1) Due to the diversity of event mentions, pre-trained contextualized embeddings are critical formention representation. All contextualized embed-dings outperform GloVe by a large margin in bothevent detection and event coreference.2) E C can be further improved by employingbetter pre-trained contextual embeddings. Com-pared with BERT
BASE-Uncased used in this paper,E C equipped with BERT
LARGE-WWM-Uncased gains
AVG-F ∆ E C 40.85- Type-Refined 39.61 -1.24- Type-Guided 37.80 -3.05Type Rule 31.68 -9.17
Table 2: Ablation results of type-guided mechanism onKBP 2017.
Type-F1 AVG-F E C 58.33 40.85E C Two Stage C w/o Proposal Loss C GoldMention
Table 3: Performance of different E C settings on KBP2017. E C Type-F1 AVG-F
BERT
BASE-Uncased ( this paper ) 58.33 40.85GloVe + Char + BiLSTM 52.45 36.43ELMo 55.24 37.27BERT BASE-Cased
LARGE-Uncased
LARGE-WWM-Uncased
Table 4: Performance using different pretrained modelsfor E C on KBP 2017.
From the above ex-periments, we find that one main bottleneck ofevent coreference is event detection. As shownin Table 3, using gold mentions results in a mas-sive improvement on AVG-F, from 40.85 to 53.72.Besides, even if we fix all coreference link errorsin predicted event detection results, the growth ofAVG-F is still limited, from 40.85 to 42.80. Eventdetection is challenging because: 1) Event men-tions are diversified and ambiguous, detecting themrequires a deep understanding of contexts. 2) Someevent mentions are multi-tagged , i.e., one spantriggering multiple events. Because this paper doesnot consider this issue, it misses some mentions. Domain Adaptation.
We find that domain adap-tation is another challenge for event coreference.Table 5 shows the results of our E C model ondifferent genres of KBP 2017 evaluation dataset:83 newswire documents – NW, and 84 discussion ype-F1 AVG-F NW 59.27 42.39DF 57.38 39.28
Table 5: Results on subsets of different genres in KBP2017. NW indicates newswire documents, while DFindicates discussion forum threads. forum threads – DF. There is a significant perfor-mance gap between the two genres, probably be-cause: 1) Different from formal NW documents,DF threads are often informal and lack coherentdiscourse structures (Choubey and Huang, 2018).2) Event chains in a discussion forum thread are notonly relevant to contents, but also to speaker infor-mation and discussion topic. Solving this problemrequires a deep understanding of dialogue contexts.
Argument Modeling.
In this paper, we ex-ploited the arguments information implicitly viaa mask attention strategy, without explicitly ex-tracting argument role. However, we believe eventcoreference can be further enhanced by modelingargument information more effectively: 1) incorpo-rating explicit argument information can effectivelycapture semantic information of events for betterfeature representation (Peng et al., 2016; Choubeyand Huang, 2017); 2) the coreference/compatibilityof argument is crucial for deciding coreference rela-tions between events (Lee et al., 2012; Huang et al.,2019). Unfortunately, the traditional argument-based end-to-end pipeline event coreference meth-ods (Chen and Ng, 2014; Yang et al., 2015) sufferfrom the error propagation problem of previouscomponents, e.g., argument extraction and entitycoreference. The denoising feature compositionalgorithms or joint modeling of entity/event coref-erence may effectively solve the argument’s errorpropagation problem.
Event Coreference.
Event coreference aims tocluster textual mentions of the same event. Differ-ent from cross-document event coreference works(Yang et al., 2015; Zhang et al., 2015; Choubey andHuang, 2017; Kenyon-Dean et al., 2018; Barhomet al., 2019), this paper focuses on the within-document event coreference task.Traditional approaches (Chen and Ji, 2009; Chenand Ng, 2014; Liu et al., 2014) are mostly pipeline-based systems depending on several upstream com- ponents, thus often suffer from the error propaga-tion problem. To address this problem, many jointmodels have been proposed, e.g., joint inference(Chen and Ng, 2016; Lu et al., 2016) and jointmodeling (Araki and Mitamura, 2015; Lu and Ng,2017a). Furthermore, the above methods use hand-crafted features, which are hard to generalize to thenew languages/domains/datasets. Several neuralnetwork models (Krause et al., 2016; Chao et al.,2019) and transfer techniques (Huang et al., 2019)are proposed to complement these methods withautomatic feature learning abilities.Compared to previous approaches, E C is thefirst fully end-to-end neural event coreference res-olution approach. It can extract features, detectevent mentions, and resolve event chains in thesame network.
End-to-end Entity Coreference.
Recently, end-to-end neural networks (Lee et al., 2017, 2018;Kantor and Globerson, 2019; Fei et al., 2019; Joshiet al., 2019) have achieved significant progress inentity coreference. These methods also motivatethis study. Due to the mention diversity and thelong-distance coreference problems, event corefer-ence is usually considered more challenging thanentity coreference (Lu and Ng, 2018; Choubey andHuang, 2018). This paper proposes a type-guidedmechanism, where can resolve the above chal-lenges by incorporating type information, learningsemantic event mention representation, and model-ing long-distance, semantic-dependent evidence.
This paper proposes a state-of-the-art, end-to-endneural network for event coreference resolution –E C neural network, which jointly models event de-tection and event coreference, and learns to extractfeatures from the raw text directly. A type-guidedmechanism is further proposed for resolving themention diversity problem and the long-distancecoreference problem, which: 1) informs corefer-ence prediction with type scoring, 2) refines men-tion representation using type information, and 3)guides decoding under type consistency. Experi-ments show that our method achieves state-of-the-art performances on KBP 2016 and KBP 2017. Forfuture work, we will focus on the bottleneck ofevent coreference, e.g., event detection and argu-ment modeling. eferences
David Ahn. 2006. The stages of event extraction. In
Proceedings of the Workshop on Annotating andReasoning about Time and Events , pages 1–8, Syd-ney, Australia. Association for Computational Lin-guistics.Jun Araki and Teruko Mitamura. 2015. Joint eventtrigger identification and event coreference resolu-tion with structured perceptron. In
Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing , pages 2074–2080, Lis-bon, Portugal. Association for Computational Lin-guistics.Amit Bagga and Breck Baldwin. 1998. Algorithmsfor scoring coreference chains. In
The first interna-tional conference on language resources and evalua-tion workshop on linguistics coreference , volume 1,pages 563–566. Granada.Shany Barhom, Vered Shwartz, Alon Eirew, MichaelBugert, Nils Reimers, and Ido Dagan. 2019. Re-visiting joint modeling of cross-document entity andevent coreference resolution. In
Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 4179–4189, Florence,Italy. Association for Computational Linguistics.Cosmin Bejan and Sanda Harabagiu. 2008. A lin-guistic resource for discovering event structures andresolving event coreference. In
Proceedings ofthe Sixth International Conference on Language Re-sources and Evaluation , Marrakech, Morocco.Cosmin Bejan and Sanda Harabagiu. 2010. Unsuper-vised event coreference resolution with rich linguis-tic features. In
Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1412–1422, Uppsala, Sweden. Associa-tion for Computational Linguistics.Wenhan Chao, Ping Wei, Zhunchen Luo, Xiao Liu,and Guobin Sui. 2019. Selective expression forevent coreference resolution on twitter. In , pages 1–8. IEEE.Bin Chen, Jian Su, Sinno Jialin Pan, and Chew Lim Tan.2011. A unified event coreference resolution by inte-grating multiple resolvers. In
Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing , pages 102–110, Chiang Mai, Thailand.Chen Chen and Vincent Ng. 2014. SinoCoreferencer:An end-to-end Chinese event coreference resolver.In
Proceedings of the Ninth International Confer-ence on Language Resources and Evaluation , pages4532–4538, Reykjavik, Iceland.Chen Chen and Vincent Ng. 2016. Joint inferenceover a lightly supervised information extractionpipeline: Towards event coreference resolution forresource-scarce languages. In
Proceedings of theThirtieth AAAI Conference on Artificial Intelligence ,AAAI’16, pages 2913–2920. AAAI Press. Zheng Chen and Heng Ji. 2009. Graph-based eventcoreference resolution. In
Proceedings of the 2009Workshop on Graph-based Methods for Natural Lan-guage Processing , pages 54–57, Suntec, Singapore.Association for Computational Linguistics.Prafulla Kumar Choubey and Ruihong Huang. 2017.Event coreference resolution by iteratively unfold-ing inter-dependencies among events. In
Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 2124–2133,Copenhagen, Denmark. Association for Computa-tional Linguistics.Prafulla Kumar Choubey and Ruihong Huang. 2018.Improving event coreference resolution by modelingcorrelations between event coreference chains anddocument topic structures. In
Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 485–495, Melbourne, Australia. Associationfor Computational Linguistics.Agata Cybulska and Piek Vossen. 2013. Semanticrelations between events and their time, locationsand participants for event coreference resolution. In
Proceedings of the International Conference RecentAdvances in Natural Language Processing RANLP2013 , pages 156–163, Hissar, Bulgaria.Agata Cybulska and Piek Vossen. 2014. Using asledgehammer to crack a nut? lexical diversity andevent coreference resolution. In
Proceedings ofthe Ninth International Conference on Language Re-sources and Evaluation , pages 4545–4552, Reyk-javik, Iceland.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Unified languagemodel pre-training for natural language understand-ing and generation. In , pages 13042–13054.Greg Durrett and Dan Klein. 2014. A joint modelfor entity analysis: Coreference, typing, and linking.
Transactions of the Association for ComputationalLinguistics , 2:477–490.Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie Strassel. 2015.Overview of linguistic resources for the tac kbp 2015evaluations: Methodologies and results. In
TAC2015 .oe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song,Ann Bies, and Stephanie Strassel. 2016. Overviewof linguistic resources for the tac kbp 2016 evalua-tions: Methodologies and results. In
TAC 2016 .Hongliang Fei, Xu Li, Dingcheng Li, and Ping Li. 2019.End-to-end deep reinforcement learning based coref-erence resolution. In
Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics , pages 660–665, Florence, Italy. Associ-ation for Computational Linguistics.Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey,and Stephanie Strassel. 2017. Overview of linguisticresources for the tac kbp 2017 evaluations: Method-ologies and results. In
TAC 2017 .Kartik Goyal, Sujay Kumar Jauhar, Huiying Li, Mrin-maya Sachan, Shashank Srivastava, and EduardHovy. 2013. A structured distributional semanticmodel for event co-reference. In
Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages467–473, Sofia, Bulgaria. Association for Computa-tional Linguistics.Eduard Hovy, Teruko Mitamura, Felisa Verdejo, JunAraki, and Andrew Philpot. 2013. Events are notsimple: Identity, non-identity, and quasi-identity. In
Workshop on Events: Definition, Detection, Coref-erence, and Representation , pages 21–28, Atlanta,Georgia. Association for Computational Linguistics.Yin Jou Huang, Jing Lu, Sadao Kurohashi, and VincentNg. 2019. Improving event coreference resolutionby learning argument compatibility from unlabeleddata. In
Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages 785–795, Minneapolis, Minnesota. Association for Com-putational Linguistics.Kevin Humphreys, Robert Gaizauskas, and Saliha Az-zam. 1997. Event coreference for information ex-traction. In
Operational Factors in Practical, Ro-bust Anaphora Resolution for Unrestricted Texts .Shanshan Jiang, Yihan Li, Tianyi Qin, Qian Meng, andBin Dong. 2017. Srcb entity discovery and linking(edl) and event nugget systems for tac 2017. In
TAC2017 .Mandar Joshi, Omer Levy, Luke Zettlemoyer, andDaniel Weld. 2019. BERT for coreference reso-lution: Baselines and analysis. In
Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing ,pages 5802–5807, Hong Kong, China. Associationfor Computational Linguistics.Ben Kantor and Amir Globerson. 2019. Coreferenceresolution with entity equalization. In
Proceed-ings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 673–677, Flo-rence, Italy. Association for Computational Linguis-tics.Kian Kenyon-Dean, Jackie Chi Kit Cheung, and DoinaPrecup. 2018. Resolving event coreference withsupervised representation learning and clustering-oriented regularization. In
Proceedings of theSeventh Joint Conference on Lexical and Com-putational Semantics , pages 1–10, New Orleans,Louisiana. Association for Computational Linguis-tics.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Sebastian Krause, Feiyu Xu, Hans Uszkoreit, and DirkWeissenborn. 2016. Event linking with sententialfeatures from convolutional neural networks. In
Proceedings of The 20th SIGNLL Conference onComputational Natural Language Learning , pages239–249, Berlin, Germany. Association for Compu-tational Linguistics.LDC. 2005. Ace (automatic content extraction) englishannotation guidelines for events. Technical report,Linguistic Data Consortium.Heeyoung Lee, Marta Recasens, Angel Chang, MihaiSurdeanu, and Dan Jurafsky. 2012. Joint entity andevent coreference resolution across documents. In
Proceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning , pages489–500, Stroudsburg, PA, USA. Association forComputational Linguistics.Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 188–197, Copenhagen, Denmark. Associationfor Computational Linguistics.Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-fine inference. In
Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers) , pages687–692, New Orleans, Louisiana. Association forComputational Linguistics.Kenton Lee, Shimi Salant, Tom Kwiatkowski, AnkurParikh, Dipanjan Das, and Jonathan Berant. 2016.Learning recurrent span representations for ex-tractive question answering. arXiv preprintarXiv:1611.01436 .Zhengzhong Liu, Jun Araki, Eduard Hovy, and TerukoMitamura. 2014. Supervised within-document eventcoreference using information propagation. In
Pro-ceedings of the Ninth International Conference onLanguage Resources and Evaluation , pages 4539–4544, Reykjavik, Iceland.ing Lu and Vincent Ng. 2016a. Event coreference res-olution with multi-pass sieves. In
Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation , pages 3996–4003, Portoroˇz,Slovenia.Jing Lu and Vincent Ng. 2016b. Utds event nuggetdetection and coreference system at kbp 2016. In
TAC 2016 .Jing Lu and Vincent Ng. 2017a. Joint learning forevent coreference resolution. In
Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages90–101, Vancouver, Canada. Association for Com-putational Linguistics.Jing Lu and Vincent Ng. 2017b. Learning antecedentstructures for event coreference resolution. In , pages 113–118.Jing Lu and Vincent Ng. 2018. Event coreference reso-lution: A survey of two decades of research. In
Pro-ceedings of the Twenty-Seventh International JointConference on Artificial Intelligence , pages 5479–5486.Jing Lu, Deepak Venugopal, Vibhav Gogate, and Vin-cent Ng. 2016. Joint inference for event corefer-ence resolution. In
Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers , pages 3264–3275,Osaka, Japan.Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In
Proceedings of Human Lan-guage Technology Conference and Conference onEmpirical Methods in Natural Language Processing ,pages 25–32, Vancouver, British Columbia, Canada.Association for Computational Linguistics.Katie McConky, Rakesh Nagi, Moises Sudit, andWilliam Hughes. 2012. Improving event co-reference by context extraction and dynamic fea-ture weighting. In , pages38–43. IEEE.Haoruo Peng, Yangqiu Song, and Dan Roth. 2016.Event detection and co-reference with minimal su-pervision. In
Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing , pages 392–402, Austin, Texas. Association forComputational Linguistics.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In
Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing , pages 1532–1543, Doha, Qatar. Association forComputational Linguistics. Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In
Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Sameer S. Pradhan, Lance Ramshaw, RalphWeischedel, Jessica MacBride, and Linnea Micci-ulla. 2007. Unrestricted coreference: Identifyingentities and events in ontonotes. In
Proceedingsof the International Conference on SemanticComputing , pages 446–453, Washington, DC, USA.Marta Recasens and Eduard Hovy. 2011. Blanc: Imple-menting the rand index for coreference evaluation.
Natural Language Engineering , 17(4):485–510.Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese,Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick,Neville Ryant, and Xiaoyi Ma. 2015. From lightto rich ERE: Annotation of entities, relations, andevents. In
Proceedings of the The 3rd Workshop onEVENTS: Definition, Detection, Coreference, andRepresentation , pages 89–98, Denver, Colorado. As-sociation for Computational Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do youlearn from context? probing for sentence structurein contextualized word representations. In .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, , pages 5998–6008.Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In
Sixth Mes-sage Understanding Conference (MUC-6) , pages45–52, Columbia, Maryland.Bishan Yang, Claire Cardie, and Peter Frazier. 2015. Ahierarchical distance-dependent Bayesian model forevent coreference resolution.
Transactions of the As-sociation for Computational Linguistics , 3:517–528.Tongtao Zhang, Hongzhi Li, Heng Ji, and Shih-FuChang. 2015. Cross-document event coreferenceresolution based on cross-media features. In
Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 201–206, Lisbon, Portugal. Association for Computa-tional Linguistics.
Experiment Details
Table 6 presents the detailed hyper-parameters ofthe E C model used in our experiments. And weconducted all experiments on a Nvidia TITAN RTXGPU.
Parameter name Parameter valueMini batch size 1Max epochs for stopping training 150Patience for early stopping 10Max antecedents number 50Max document length for training 1024Dropout for word representation 0.5Dropout for FFNN 0.2Hidden layers for FFNN 2Hidden units for FFNN 150Optimizer AdamaxInitial learning rate 0.001Learning rate anneal factor 0.5Learning rate anneal patience 5
Table 6: Hyper-parameters of the E C model used inour experiments. FFNN indicates the feed-forward neu-ral networks for mention proposaling and antecedentscoring.
B Data Sets
We used Stanford CoreNLP toolkit to preprocessall documents for xml tags cleaning, sentence split-ting and tokenization. Since only 18 categorieswere used for the official evaluation, we filtered outevent instances with other categories in the trainingdata. C Reproducibility
In this section, we present the reproducibility infor-mation of the paper. Table 7 shows the correspond-ing validation performance for all reported KBP2016 and KBP 2017 results. In addition, Table 8presents the average runtime for each approach andnumber of parameters in each model. https://stanfordnlp.github.io/CoreNLP/ ype-P Type-R Type-F1 B CEAF e MUC BLANC AVG-F
KBP 2016 Interact Model
BERT C 62.94 59.10 60.96 49.02 46.76 42.80 33.00 42.89KBP 2017 Interact Model
BERT C 65.85 54.61 59.71 51.60 51.48 39.45 35.01 44.38E C w/o Type-Refined C w/o Type-Guided C Two Stage C w/o Proposal Loss C GloVe+Char C ELMo C BERT-BASE-Cased C BERT-LARGE-Uncased C BERT-LARGE-WWM-Uncased C GoldMention
Table 7: Corresponding validation performance for each reported KBP 2016/2017 result. Type-Rule andE C GoldMention take unreal experiment setups for exploiting the bound performance of E C. Type Rule is a sim-ple heuristic method that regards all event mentions in the same type are coreferential, and it directly uses theevent detection result from E C. E C GoldMention uses gold mentions instead of mentions proposed by the mentionproposal layer, but the model still needs to predict the type of each mention.
Time for one epoch (s) | Θ update | E C 82.76 2,886,108Interact Model
BERT C w/o Type-Refined C w/o Type-Guided C Two Stage C w/o Proposal Loss C GloVe + Char C ELMo C BERT-BASE-Cased C BERT-LARGE-Uncased C BERT-LARGE-WWM-Uncased
Table 8: Average runtime for each approach and number of parameters in each model. Θ updateupdate