[PDF] End-to-End Neural Event Coreference Resolution

Abstract

Traditional event coreference systems usually rely on pipeline framework and hand-crafted features, which often face error propagation problem and have poor generalization ability. In this paper, we propose an End-to-End Event Coreference approach -- E3C neural network, which can jointly model event detection and event coreference resolution tasks, and learn to extract features from raw text automatically. Furthermore, because event mentions are highly diversified and event coreference is intricately governed by long-distance, semantic-dependent decisions, a type-guided event coreference mechanism is further proposed in our E3C neural network. Experiments show that our method achieves new state-of-the-art performance on two standard datasets.

Full PDF

EEnd-to-End Neural Event Coreference Resolution

Yaojie Lu , , Hongyu Lin , Jialong Tang , , Xianpei Han , , Le Sun , Chinese Information Processing Laboratory State Key Laboratory of Computer ScienceInstitute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China { yaojie2017,hongyu2016,jialong2019,xianpei,sunle } @iscas.ac.cn Abstract

Traditional event coreference systems usuallyrely on pipeline framework and hand-craftedfeatures, which often face error propagationproblem and have poor generalization abil-ity. In this paper, we propose an E nd-to- E nd E vent C oreference approach – E C neural net-work, which can jointly model event detectionand event coreference resolution tasks, andlearn to extract features from raw text auto-matically. Furthermore, because event men-tions are highly diversiﬁed and event corefer-ence is intricately governed by long-distance,semantic-dependent decisions, a type-guidedevent coreference mechanism is further pro-posed in our E C neural network. Experimentsshow that our method achieves new state-of-the-art performance on two standard datasets.

Event coreference resolution aims to identify whichevent mentions in a document refer to the sameevent (Ahn, 2006; Hovy et al., 2013). For exam-ple, the two event mentions in Figure 1, departing and leave , refer to the same

EndPosition event ofNokia’s CEO.Traditional event coreference resolution methodsusually rely on a series of upstream components(Lu and Ng, 2018), such as entity recognition andevent detection. Such a pipeline framework, unfor-tunately, often suffers from the error propagationproblem. For instance, the best event detectionsystem in KBP 2017 only achieved 56 F1 (Jianget al., 2017), and it will undoubtedly limit the per-formance of the follow-up event coreference task(35 Avg F1 on KBP 2017). Furthermore, most pre-vious approaches use hand-crafted features (Chenet al., 2011; Lu and Ng, 2017a), which heavily de-pend on other NLP components (e.g., POS tagging,NER, syntactic parsing, etc.) and thus are hard togeneralize to new languages/domains/datasets.

Huge Payday for Nokia’s [departing]

EndPosition

CEO …Nokia’s CEO prepares to [leave]

EndPosition the company and [rejoin]

StartPosition

Microsoft … It is an expensive [goodbye]

EndPosition for the executive, …

Figure 1: An example of event coreference resolution,which contains two coreferential chains: An

EndPo-sition event chain { departing , leave , goodbye } and a StartPosition chain { rejoin } . In this paper, we propose an E nd-to- E nd E vent C oreference method – E C neural network, whichcan predict event chains from a raw text in an end-to-end manner. For example, taking the raw textin Figure 1 as input, E C will directly output twoevent coreference chains, { departing , leave , good-bye } and { rejoin } . By jointly modeling event de-tection and event coreference, E C neural networkdoes not require any prior components, and therepresentations/pieces of evidence between differ-ent tasks and different decisions can be shared andreinforced. Besides, E C are learned in an end-to-end manner, which can inherently resolve the errorpropagation problem.End-to-end event coreference, however, is chal-lenging due to the mention diversity and the long-distance coreference.

First, event mentions arehighly diversiﬁed (Humphreys et al., 1997; Chenand Ji, 2009), which may be a variety of syntacticobjects, including nouns, verbs, and even adjec-tives. For example, an

EndPosition event can betriggered by departing (noun), leave (verb), good-bye (noun) and former (adj). By contrast, mentionsin entity coreference are mostly noun phrases (Luand Ng, 2018).

Second, coreferential event men-tions commonly appear over long-distance sen-tences, therefore event coreference is intricatelygoverned by long-distance, semantic-dependent de-cisions (Choubey and Huang, 2018; Goyal et al.,2013; Peng et al., 2016). For example, in Figure a r X i v : . [ c s . C L ] S e p the closest antecedent of the mention goodbye – leave , is far from it. To resolve the coreferencebetween these two distant, diverse event mentions,a system can only rely on their semantic mean-ings, i.e., they both describe the same EndPosi-tion event(the departing of Nokia’s CEO) but fromdifferent perspectives. By contrast, most of en-tity mentions’ closest antecedents are in the sameor immediately preceding sentence (Choubey andHuang, 2018), which can be resolved more easilyusing local and syntactic clues.To resolve the mention diversity problem andthe long-distance coreference problem, this paperfurther proposes a type-guided mechanism into ourE C neural network. This mechanism bridges dis-tant, diverse event mentions by exploiting eventtype information in three folds: 1) type-informedantecedent network which enables E C to cap-ture more semantic information of event mentionsby predicting coreferential scores and type scoressimultaneously; 2) type-reﬁned mention repre-sentation which enhances mention representationwith type information, therefore even lexically dis-similar mentions can be bridged together, such asthe two diverse

EndPosition mentions goodbye and departing ; 3) type-guided decoding algorithm which can exploit global type consistency for moreaccurate event chains.The main contributions of this paper are:1. We propose an end-to-end neural networkfor event coreference resolution - E C neural net-work. E C can jointly model event detection andevent coreference, and learn to automatically ex-tract features from raw text. To the best of ourknowledge, this is the ﬁrst end-to-end neural eventcoreference model that can achieve state-of-the-artperformance.2. We design a type-guided mechanism for eventcoreference, which can effectively resolve the men-tion diversity problem and the long-distance coref-erence problem in event coreference resolution.3. We conduct experiments on two standarddatasets: KBP 2016 and KBP 2017, which showthat E C achieves new state-of-the-art performance.And additional ablation experiments verify the ef-fectiveness of the proposed type-guided mecha-nism. In this paper, antecedents are coreferential mentions thatappear earlier in the document.

AntecedentsMentions ... [departing] … [leave] ……[rejoin]…[goodbye] …

Nokia’s CEO prepares to leave the company and rejoin Microsoft …

Mention Proposal NetworkDocumentType-informedAntecedent NetworkType-guidedDecodingEvent Chains [departing-leave-goodbye]

EndPosition [rejoin]

StartPosition departing leavedeparting goodbye antecedentantecedent

EndPosition departing type

Figure 2: The framework of our E C neural network. C: End-to-end Neural EventCoreference Resolution

Given a document D = { w , ..., w n } , an end-to-end event coreference system needs to: 1) detectevent mentions { m , ., m l } (event detection); 2)predict all coreference chains { ev ∗ } (event coref-erence resolution). For example, in Figure 1, thementions are { departing , ..., goodbye } and twocoreference chains will be predicted: { departing , leave , goodbye } , and { rejoin } .To this end, our E C method ﬁrst detects men-tions candidates via a mention proposal network,then identiﬁes all mentions’ antecedents via an an-tecedent prediction network. To resolve the men-tion diversity problem and the long-distance coref-erence problem, a type-guided event coreferencemechanism is designed for E C. Figure 2 showsthe framework of our method. All components inE C are differentiable and can be trained in an end-to-end manner. In the following, we describe themin detail.

The mention proposal network detects all eventmentions in a document, e.g., identifying { departing , ..., rejoin } as event mentions in Figure1. Because event mentions are highly diversiﬁedexpressions (e.g., goodbye , former and leave for EndPosition ), we ﬁrst capture the semantic infor-mation of all tokens via a contextualized represen-tation layer, then identify mention candidates via amention proposal layer. The details are as follows.

Contextualized Word Representation Layer.

To capture the semantic information for proposingvent mentions, we learn a contextualized represen-tation for each token. Concretely, we ﬁrst obtaina task-independent representation for each tokenbased on pre-trained BERT embeddings (Devlinet al., 2019). Following Tenney et al. (2019), atoken w i ’s representation h i ∈ R d is pooled acrossdifferent BERT layers using scalar mixing (Peterset al., 2018) as h i = γ (cid:80) Lj =1 α j x ( j ) i , where x ( j ) i isthe embedding of token i from BERT layer j , d issize of bert embedding, α j is softmax-normalizedweights, and γ is a scalar parameter.Because event arguments can provide criticalevidence (Bejan and Harabagiu, 2010; Lee et al.,2012; McConky et al., 2012; Cybulska and Vossen,2013), we further obtain an event-speciﬁc to-ken representation by distilling argument infor-mation from raw text implicitly. Speciﬁcally, wedesign a mask attention strategy (Dong et al.,2019). Given task-independent token representa-tions H = { h , h , ..., h n } , our attention mech-anism ﬁrst models the relevance between tokensvia a scaled dot-product attention (Vaswani et al.,2017) without linear projection, and then com-putes the ﬁnal contextualized word representations C = { c , c , ..., c n } as: C = softmax ( HH T √ d + M ) HM ij = (cid:26) , | i − j | < c −∞ otherwise (1)where c is the size of local window (this paperfocuses on the local context since arguments empir-ically appear around event mentions , and we set c = 10 in this paper), and √ d is the scaling factor. Mention Proposal Layer.

Given the token rep-resentations, the mention proposal layer assignsa mention score to each span – s m ( i ) , which in-dicates the likelihood for span i being an eventmention. For example, in Figure 1 the mentionproposal layer will assign spans { departing , leave , rejoin , ... } with high s m ( i ) scores because they arehighly likely to be event mentions, and assign spans { prepares to , company , ... } with low s m ( i ) scoresbecause they are unlikely to be event mentions.Given all spans within a restricted length in adocument, the mention proposal layer representseach span i as g i = ˆc i , where ˆc i is the soft head In KBP 2017 training set, about 90% of arguments appearin the ± This paper restricts span length to 1, which can cover96.6% mentions in KBP 2017 training set. For this case, theattented span representation g i is equivalent to c i . attention-based aggregation of all token represen-tations in span i (Lee et al., 2016). Given g i , themention score s m ( i ) is computed via standard feed-forward neural networks: s m ( i ) = FFNN m ( g i ) (2)Finally, we rank all spans according to their men-tion scores (Lee et al., 2017), and only retain top- l mentions { m , m , ..., m l } as event mention can-didates for computation efﬁciency. Given an event mention, the type-informed an-tecedent network predicts its antecedents, and theantecedent predictions can be used as local pair-wise coreference decisions. For example, ourmethod will predict the antecedent of leave as de-parting in Figure 1 and (cid:104) departing, leave (cid:105) can beused as a pair-wise coreference decision.For each mention m i in { m , ..., m l } , the type-guided antecedent network produces two kinds ofscores simultaneously: 1) s ( i, j ) – the score formention m j being antecedent of m i , where m j must appear before m i in the document; 2) s ( i, t k ) – the score for mention m i ’s type being t k . Antecedent Score.

Given a mention m i , an-tecedent network computes an antecedent score s ( i, j ) for each mention pair (cid:104) m i , m j (cid:105) : s ( i, j ) = s m ( i ) + s m ( j ) + s a ( i, j ) (3)where j < i , s m ( i ) and s m ( j ) are the mentionscores described in § s a ( i, j ) measures the se-mantic similarity between m i and m j , computedvia a standard feed-forward neural network: s a ( i, j ) = FFNN a ([ g i , g j , g i ◦ g i , Φ( i, j )]) (4)where g i ◦ g i is the element-wise similarity of eachmention pair (cid:104) m i , m j (cid:105) , and Φ( i, j ) is the distanceencoding between two mentions. Event Type Score.

As described in §

1, eventcoreference is intricately governed by long-distance, semantic-dependent decisions. To ad-dress this issue, this paper exploits event type in-formation for better event coreference resolution.Speciﬁcally, besides antecedent prediction for eachmention, we further predict its event type so that:1) the neural network will be guided to capturemore semantic information about event mentions(Durrett and Klein, 2014); 2) the type information In this paper, l = 0 . × document length. nsures the global type consistency during corefer-ence resolution, i.e., mentions in the same corefer-ence chain will have the same event type.Speciﬁcally, we ﬁrst embed all event types T = { t , ..., t t } via a hierarchical embedding al-gorithm. The embedding of t k is g t k = W e · [ e event , e type ( t k )] , where e event is shared by allevent types, e type ( t k ) indicates embedding of t k ,and W e is a mapping matrix. The dimension of g t k is the same as mention embedding g i .Then the type scores s ( i, t k ) are computed viathe same scoring function for antecedent prediction: s m ( t k ) = FFNN m ( g t k ) s a ( i, t k ) = FFNN a ([ g i , g t k , g i ◦ g t k , Φ( i, t k )]) s ( i, t k ) = s m ( i ) + s m ( t k ) + s a ( i, t k ) (5)where the distance for Φ( i, t k ) is zero in type scorescomputation.For non-mention spans, we add a dummy an-tecedent ε and assign the antecedents of all non-mention spans to ε , e.g., company and preparesto in Figure 1. We ﬁx the score s ( i, ε ) to 0, andidentify a span i as non-mention span if all its an-tecedent scores s ( i, j ) ≤ and all type scores s ( i, t k ) ≤ .In this way, we obtain the antecedent scoresand the type scores for each mention via our type-informed antecedent network. In this section, we describe how to further reﬁnea mentions representation using its type informa-tion, so it can capture more semantic informationfor event coreference resolution. For example, al-though goodbye and departing are lexically dissim-ilar, we can still capture their semantic similarityby further encoding their event type information,i.e., both of them have the same event type –

End-Position .To reﬁne mention representation, we ﬁrst deﬁnea probability distribution Q ( t k ) over all event types T and { ε } for each mention span m i : Q ( t k ) = e s ( i,t k ) (cid:80) t (cid:48) k ∈T (cid:83) { ε } e s ( i,t (cid:48) k ) (6)where s ( i, t k ) is the type score. We then obtain anexpected event type representation ˜ g i for each span m i using the type distribution Q ( t k ) as: ˜ g i = (cid:88) t (cid:48) k ∈T Q ( t k = t (cid:48) k ) · g t (cid:48) k + Q ( t k = ε ) · g i (7)Then, we obtain a reﬁned span representation g (cid:48) i by combining its expected event type representa-tion ˜ g i and its original span representation g i via alearnable adaptive gate f i : g (cid:48) i = f i ◦ g i + (1 − f i ) ◦ ˜ g i f i = σ ( W f · [ g i , ˜ g i ]) (8)where W f is a weight matrix.Finally, the antecedent network will recomputethe coreferential antecedent score s (cid:48) ( i, j ) and eventtype score s (cid:48) ( i, t k ) using the reﬁned span represen-tation g (cid:48) i . The type-informed antecedent network producespairwise coreference scores of mention pairs. Toform coreference chains, a naive approach is todirectly connect all mentions using their highest-scored antecedent. Unfortunately, such a greedydecoding algorithm only considers local pair-wiseconsistencies, their results may not be globally op-timal, e.g., a coreference chain may contain men-tions with different event types.To address this issue, we propose a decodingalgorithm, which can ensure the global consis-tency of a coreference chain through a type-guidedmechanism. For example, to resolve the chain { departing , leave } in Figure 1, E C considers boththe antecedent score of (cid:104) departing , leave (cid:105) , and thetype consistency that both departing , leave are End-Position mentions.Concretely, given the mentions { m , ..., m l } ina document D , E C constructs the event corefer-ence chains by sequentially identifying the bestantecedent of each mention, further consider thetype consistency. For each mention m i , we ﬁrstﬁnd the mention a i which has the max coreferentialscore with m i where m j appears before m i : a i = arg max m j , j s ( i, t i ) , E C considers m i and a i astype consistent and links mention i to a i ; otherwise,when s ( i, a i ) ≤ s ( i, t i ) , E C considers m i and a i as type inconsistent and starts a new event chainfor m i with its type t i . Model Learning

This section describes how to learn E C neuralnetwork in an end-to-end manner. Given a trainingcorpus D = { D , ..., D N } where each instance D i is a document with its event mention, mentiontype, and coreference annotations, our objectivefunction contains two parts: L antecedent ( Θ ) – theantecedent loss, and L proposal ( Θ ) – the mentionproposal loss: L ( Θ ) = L antecedent ( Θ ) + λ L proposal ( Θ ) (10)where λ is the coefﬁcient of mention proposal loss(we set λ = 1 in this paper). This paper optimizes Θ by maximizing L ( Θ ) via Adamax (Kingma andBa, 2015). The two losses are as follows. Antecedent Loss.

It measures whether a men-tion links to its correct antecedent. For each men-tion m i , this paper identiﬁes its gold antecedent setGOLD ( i ) as shown in Figure 3:1) For the ﬁrst mention of an event chain, thegold antecedent is its event type. For example, thegold antecedent set of departing is { EndPosition } .2) For remaining mentions in a chain, the goldantecedents are all its coreferential antecedents.For example, the gold antecedent set of goodbye is { departing , leave } .3) For non-mention spans, the gold antecedentis the dummy antecedent ε . For example, the goldantecedent set of company is { ε } .Given GOLD ( i ) for each m i in top- l mentionset of document D , the antecedent loss function isa margin log-likelihood function: L ( Θ ) antecedent = log l (cid:89) i =1 (cid:88) ˆ y ∈Y ( i ) ∩ GOLD ( i ) P (ˆ y | D ) P ( y i | D ) = exp( s ( i, y i )) (cid:80) y (cid:48) ∈Y ( i ) exp( s ( i, y (cid:48) )) (11)where Y ( i ) is the valid antecedent set for m i . Mention Proposal Loss.

It measures whetherour model can accurately identify event mentions.Speciﬁcally, the mention proposal loss uses thebinary cross-entropy loss function of the mentionproposal network: L ( Θ ) proposal = n (cid:88) i =1 y i log σ ( s m ( i ))+ (1 − y i ) log(1 − σ ( s m ( i ))) (12)where σ is the sigmoid function, y i = 1 indicatesspan i is an event mention, otherwise y i = 0 . EndPosition – departing – leave – goodbyeStartPosition – rejoin ε – company

GOLD( departing ) = {

EndPosition }GOLD( rejoin ) = {

StartPosition }GOLD( leave ) = { departing }GOLD( goodbye ) = { departing, leave }GOLD( company ) = { ε } Annotated Event ChainsGolden Antecedent Sets

Figure 3: An illustration of gold antecedent sets.

Following previous studies (Lu and Ng, 2016b,2017a; Jiang et al., 2017; Huang et al., 2019), weuse KBP 2016 and KBP 2017 English datasets forevaluation : KBP 2016.

For KBP 2016, we use the samesetup as Lu and Ng (2017a), i.e., 509 documentsfor training, 139 documents for parameter tuning,and the ofﬁcial KBP 2016 eval set for evaluation.

KBP 2017.

Following Huang et al. (2019), weuse the English portion of KBP 2015 and 2016dataset for training, and the KBP 2017 dataset forevaluation. We sample 50 documents from the2016 evaluation dataset as the validation set.

We compare E C with the following baselines : Multi-Pass Sieve (Lu and Ng, 2016a) is an it-erative pipeline-based method, which uses bothhand-crafted rules and automatic classiﬁers. Wecompare two such systems: the Top 1 system inTAC 2016 (Lu and Ng, 2016b) and the Top 1 sys-tem in TAC 2017 (Jiang et al., 2017), both of whichuse additional ensemble strategy for better eventdetection performance. There are also other public event coreference datasets:Ontonotes (Pradhan et al., 2007), ECB+ (Bejan and Harabagiu,2008; Cybulska and Vossen, 2014), ACE (LDC, 2005).Ontonotes and ECB+ are not annotated with event type infor-mation therefore is not appropriate for evaluating our end-to-end event coreference model. ACE dataset has strict notion ofevent identity (Song et al., 2015; Lu and Ng, 2017a), whichrequires which two event mentions coreferential if and onlyif “they had the same agent(s), patient(s), time, and location”.Because E C don’t perform argument extraction for eventcoreference, ACE isn’t used in this paper. For fair comparsion,we choose the KBP datasets (Ellis et al., 2015, 2016; Getmanet al., 2017) so that different systems can be compared in thesame settings. Different from the ofﬁcial type-constraint settings in KBP2016 and KBP 2017, Choubey and Huang (2018) used relaxedconstraints without considering event mention type, so weexclude their system for fair comparison. ype-F1 B CEAF e MUC BLANC AVG-F

KBP 2016 Top 1 in TAC 2016 (Lu and Ng, 2016b) 46.99 37.49 34.21 26.37 22.25 30.08Mention Ranking (Lu and Ng, 2017b) 46.99 38.64 36.16 26.30 23.59 31.17Joint Model (Lu and Ng, 2017a) 49.30 40.90 39.00 27.41 25.00 33.08Interact Model

BERT C ( this paper ) KBP 2017 Top 1 in TAC 2017 (Jiang et al., 2017) 56.19 43.84 39.86 30.63 26.97 35.33Interact Model (Huang et al., 2019) - 42.84 39.01 31.12 24.99 34.49+ Transfer (Huang et al., 2019) - 43.20 40.02 35.66

BERT C ( this paper ) Table 1: Overall performance on KBP 2016 and KBP 2017 datasets, the results of baselines are adapted from theiroriginal papers.

Mention Ranking (Lu and Ng, 2017b) ranksthe candidate antecedents of all event mentions andselects the top-ranked antecedent for each mention.

Joint Model (Lu and Ng, 2017a) is a hand-crafted feature-based system that addresses the er-ror propagation problem by jointly learning eventtrigger detection, event coreference resolution, andevent anaphoricity prediction tasks.

Interact Model (Huang et al., 2019) is the state-of-the-art pair-wise method which decides whethertwo mentions are coreferential using an interac-tive binary classiﬁer, and then link coreferentialmentions to produce ﬁnal event chains. We alsocompare with an enhanced model that transfersargument compatibility features from external un-labeled data Interact Model + Transfer. We alsoreimplement the interact model using BERT as itsfeature extractor Interact Model

BERT , thereforeE C and Interact Model can be compared with thesame feature extractors.

We use the standard evaluation metrics in KBPevaluation, and compute them using the ofﬁcialevaluation toolkit . We use 4 measures: MUC(Vilain et al., 1995), B (Bagga and Baldwin, 1998),CEAF e (Luo, 2005), and BLANC (Recasens andHovy, 2011). Following previous studies (Lu andNg, 2017a; Huang et al., 2019), the primary metricAVG-F is the unweighted average of the abovefour F-scores. We also report the event detectionperformance using the typed F1-scores as Type-F1. Table 1 shows the overall performance on KBP2016 and KBP 2017. We can see that: https://github.com/hunterhector/EvmEval E C neural network achieves state-of-the-art performance on both datasets.

Comparedwith all baselines, E C gains at least 11.6% and6.9% AVG-F improvements on KBP 2016 and KBP2017, respectively. This veriﬁes the effectivenessof the end-to-end framework and the type-guidedevent coreference mechanism.2.

By jointly modeling all tasks together andlearning all components in an end-to-end man-ner, E C neural network signiﬁcantly outper-forms pipeline baselines.

Compared with InteractModel

BERT which uses the same BERT-based fea-ture extractors, E C still gains 3.2% Type-F1 and6.9% AVG-F improvements on KBP 2017. Thisveriﬁed the effectiveness of the end-to-end trainingon reducing the error propagation problem. Be-sides, by modeling all tasks together, representa-tions and pieces of evidence can be shared andreinforced between different decisions and tasks.

In this section, we analyze the effects of type-guided mechanism, end-to-end learning, and pre-trained models.

Effect of Type Guided Mechanism.

To investi-gate the effect of type-guided mechanism in E C,we conduct ablation experiments by ablating type-reﬁned representation (-Type-Reﬁned) and by re-placing type-guided decoding with the naive bestantecedent decoding (-Type-Guided). Type Ruleis a simple heuristic method that regards all eventmentions in the same type are coreferential. Theresults are shown in Table 2. We can see that:1) Type-guided decoding is effective for eventcoreference. By considering both type consistencyand antecedent score, E C obtains an 8.1% (3.05)AVG-F improvement over naive decoding.) Type-reﬁned representation helps resolve themention diversity problem and the long-distancecoreference problem. By incorporating type infor-mation into mention representation, E C obtains a3.1% (1.24) AVG-F improvement.

Effect of End-to-end Learning.

To investigatethe effect of end-to-end learning, we conduct ex-periments on three variations of E C: E C Two Stage which models event mention detection and coref-erential antecedent prediction in two indepen-dent models but they share span embeddings;E C w/o Proposal Loss which removes the mention pro-posal loss; E C GoldMention which uses gold men-tions for coreference resolution and type scoring,but the model still needs to predict the type of eachmention. Table 3 shows the performances of thethree systems, we can ﬁnd that:1) One pass paradigm for E C can effectivelyshare and reinforce the decisions between two tasks.Compared with E C Two Stage , which has a compa-rable event detection performance (Type-F1), E Cgains 5.2% AVG-F on the downstream event coref-erence task.2) Incorporating mention proposal loss can sig-niﬁcantly enhance mention detection performance.By removing mention proposal loss, E C will loss2.3% and 4.2% on Type-F1 and AVG-F, respec-tively. Additionally, the coreference performancecan be further signiﬁcantly improved if goldenmentions are used – from 40.89 E C to 53.72 ofE C GoldMention . This shows that event detection isstill a bottleneck for event coreference.

Effect of Pre-trained Models.

Pre-trained mod-els are important for neural network-based meth-ods. To investigate their effect on E C, Table 4shows the performance of E C using ELMo (Peterset al., 2018), BERT

BASE-Cased , BERT

LARGE-Uncased ,BERT

LARGE-WWM-Uncased (Devlin et al., 2019),GloVe (Pennington et al., 2014) 300-dimensionalword embedding and char embeddings where thecontextual layer is BiLSTM. We can ﬁnd that:1) Due to the diversity of event mentions, pre-trained contextualized embeddings are critical formention representation. All contextualized embed-dings outperform GloVe by a large margin in bothevent detection and event coreference.2) E C can be further improved by employingbetter pre-trained contextual embeddings. Com-pared with BERT

BASE-Uncased used in this paper,E C equipped with BERT

LARGE-WWM-Uncased gains

AVG-F ∆ E C 40.85- Type-Reﬁned 39.61 -1.24- Type-Guided 37.80 -3.05Type Rule 31.68 -9.17

Table 2: Ablation results of type-guided mechanism onKBP 2017.

Type-F1 AVG-F E C 58.33 40.85E C Two Stage C w/o Proposal Loss C GoldMention

Table 3: Performance of different E C settings on KBP2017. E C Type-F1 AVG-F

BERT

BASE-Uncased ( this paper ) 58.33 40.85GloVe + Char + BiLSTM 52.45 36.43ELMo 55.24 37.27BERT BASE-Cased

LARGE-Uncased

LARGE-WWM-Uncased

Table 4: Performance using different pretrained modelsfor E C on KBP 2017.

From the above ex-periments, we ﬁnd that one main bottleneck ofevent coreference is event detection. As shownin Table 3, using gold mentions results in a mas-sive improvement on AVG-F, from 40.85 to 53.72.Besides, even if we ﬁx all coreference link errorsin predicted event detection results, the growth ofAVG-F is still limited, from 40.85 to 42.80. Eventdetection is challenging because: 1) Event men-tions are diversiﬁed and ambiguous, detecting themrequires a deep understanding of contexts. 2) Someevent mentions are multi-tagged , i.e., one spantriggering multiple events. Because this paper doesnot consider this issue, it misses some mentions. Domain Adaptation.

We ﬁnd that domain adap-tation is another challenge for event coreference.Table 5 shows the results of our E C model ondifferent genres of KBP 2017 evaluation dataset:83 newswire documents – NW, and 84 discussion ype-F1 AVG-F NW 59.27 42.39DF 57.38 39.28

Table 5: Results on subsets of different genres in KBP2017. NW indicates newswire documents, while DFindicates discussion forum threads. forum threads – DF. There is a signiﬁcant perfor-mance gap between the two genres, probably be-cause: 1) Different from formal NW documents,DF threads are often informal and lack coherentdiscourse structures (Choubey and Huang, 2018).2) Event chains in a discussion forum thread are notonly relevant to contents, but also to speaker infor-mation and discussion topic. Solving this problemrequires a deep understanding of dialogue contexts.

Argument Modeling.

In this paper, we ex-ploited the arguments information implicitly viaa mask attention strategy, without explicitly ex-tracting argument role. However, we believe eventcoreference can be further enhanced by modelingargument information more effectively: 1) incorpo-rating explicit argument information can effectivelycapture semantic information of events for betterfeature representation (Peng et al., 2016; Choubeyand Huang, 2017); 2) the coreference/compatibilityof argument is crucial for deciding coreference rela-tions between events (Lee et al., 2012; Huang et al.,2019). Unfortunately, the traditional argument-based end-to-end pipeline event coreference meth-ods (Chen and Ng, 2014; Yang et al., 2015) sufferfrom the error propagation problem of previouscomponents, e.g., argument extraction and entitycoreference. The denoising feature compositionalgorithms or joint modeling of entity/event coref-erence may effectively solve the argument’s errorpropagation problem.

Event Coreference.

Event coreference aims tocluster textual mentions of the same event. Differ-ent from cross-document event coreference works(Yang et al., 2015; Zhang et al., 2015; Choubey andHuang, 2017; Kenyon-Dean et al., 2018; Barhomet al., 2019), this paper focuses on the within-document event coreference task.Traditional approaches (Chen and Ji, 2009; Chenand Ng, 2014; Liu et al., 2014) are mostly pipeline-based systems depending on several upstream com- ponents, thus often suffer from the error propaga-tion problem. To address this problem, many jointmodels have been proposed, e.g., joint inference(Chen and Ng, 2016; Lu et al., 2016) and jointmodeling (Araki and Mitamura, 2015; Lu and Ng,2017a). Furthermore, the above methods use hand-crafted features, which are hard to generalize to thenew languages/domains/datasets. Several neuralnetwork models (Krause et al., 2016; Chao et al.,2019) and transfer techniques (Huang et al., 2019)are proposed to complement these methods withautomatic feature learning abilities.Compared to previous approaches, E C is theﬁrst fully end-to-end neural event coreference res-olution approach. It can extract features, detectevent mentions, and resolve event chains in thesame network.

End-to-end Entity Coreference.

Recently, end-to-end neural networks (Lee et al., 2017, 2018;Kantor and Globerson, 2019; Fei et al., 2019; Joshiet al., 2019) have achieved signiﬁcant progress inentity coreference. These methods also motivatethis study. Due to the mention diversity and thelong-distance coreference problems, event corefer-ence is usually considered more challenging thanentity coreference (Lu and Ng, 2018; Choubey andHuang, 2018). This paper proposes a type-guidedmechanism, where can resolve the above chal-lenges by incorporating type information, learningsemantic event mention representation, and model-ing long-distance, semantic-dependent evidence.

This paper proposes a state-of-the-art, end-to-endneural network for event coreference resolution –E C neural network, which jointly models event de-tection and event coreference, and learns to extractfeatures from the raw text directly. A type-guidedmechanism is further proposed for resolving themention diversity problem and the long-distancecoreference problem, which: 1) informs corefer-ence prediction with type scoring, 2) reﬁnes men-tion representation using type information, and 3)guides decoding under type consistency. Experi-ments show that our method achieves state-of-the-art performances on KBP 2016 and KBP 2017. Forfuture work, we will focus on the bottleneck ofevent coreference, e.g., event detection and argu-ment modeling. eferences

David Ahn. 2006. The stages of event extraction. In

Proceedings of the Workshop on Annotating andReasoning about Time and Events , pages 1–8, Syd-ney, Australia. Association for Computational Lin-guistics.Jun Araki and Teruko Mitamura. 2015. Joint eventtrigger identiﬁcation and event coreference resolu-tion with structured perceptron. In

Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing , pages 2074–2080, Lis-bon, Portugal. Association for Computational Lin-guistics.Amit Bagga and Breck Baldwin. 1998. Algorithmsfor scoring coreference chains. In

The ﬁrst interna-tional conference on language resources and evalua-tion workshop on linguistics coreference , volume 1,pages 563–566. Granada.Shany Barhom, Vered Shwartz, Alon Eirew, MichaelBugert, Nils Reimers, and Ido Dagan. 2019. Re-visiting joint modeling of cross-document entity andevent coreference resolution. In

Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 4179–4189, Florence,Italy. Association for Computational Linguistics.Cosmin Bejan and Sanda Harabagiu. 2008. A lin-guistic resource for discovering event structures andresolving event coreference. In

Proceedings ofthe Sixth International Conference on Language Re-sources and Evaluation , Marrakech, Morocco.Cosmin Bejan and Sanda Harabagiu. 2010. Unsuper-vised event coreference resolution with rich linguis-tic features. In

Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1412–1422, Uppsala, Sweden. Associa-tion for Computational Linguistics.Wenhan Chao, Ping Wei, Zhunchen Luo, Xiao Liu,and Guobin Sui. 2019. Selective expression forevent coreference resolution on twitter. In , pages 1–8. IEEE.Bin Chen, Jian Su, Sinno Jialin Pan, and Chew Lim Tan.2011. A uniﬁed event coreference resolution by inte-grating multiple resolvers. In

Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing , pages 102–110, Chiang Mai, Thailand.Chen Chen and Vincent Ng. 2014. SinoCoreferencer:An end-to-end Chinese event coreference resolver.In

Proceedings of the Ninth International Confer-ence on Language Resources and Evaluation , pages4532–4538, Reykjavik, Iceland.Chen Chen and Vincent Ng. 2016. Joint inferenceover a lightly supervised information extractionpipeline: Towards event coreference resolution forresource-scarce languages. In

Proceedings of theThirtieth AAAI Conference on Artiﬁcial Intelligence ,AAAI’16, pages 2913–2920. AAAI Press. Zheng Chen and Heng Ji. 2009. Graph-based eventcoreference resolution. In

Proceedings of the 2009Workshop on Graph-based Methods for Natural Lan-guage Processing , pages 54–57, Suntec, Singapore.Association for Computational Linguistics.Prafulla Kumar Choubey and Ruihong Huang. 2017.Event coreference resolution by iteratively unfold-ing inter-dependencies among events. In

Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 2124–2133,Copenhagen, Denmark. Association for Computa-tional Linguistics.Prafulla Kumar Choubey and Ruihong Huang. 2018.Improving event coreference resolution by modelingcorrelations between event coreference chains anddocument topic structures. In

Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 485–495, Melbourne, Australia. Associationfor Computational Linguistics.Agata Cybulska and Piek Vossen. 2013. Semanticrelations between events and their time, locationsand participants for event coreference resolution. In

Proceedings of the International Conference RecentAdvances in Natural Language Processing RANLP2013 , pages 156–163, Hissar, Bulgaria.Agata Cybulska and Piek Vossen. 2014. Using asledgehammer to crack a nut? lexical diversity andevent coreference resolution. In

Proceedings ofthe Ninth International Conference on Language Re-sources and Evaluation , pages 4545–4552, Reyk-javik, Iceland.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Uniﬁed languagemodel pre-training for natural language understand-ing and generation. In , pages 13042–13054.Greg Durrett and Dan Klein. 2014. A joint modelfor entity analysis: Coreference, typing, and linking.

Transactions of the Association for ComputationalLinguistics , 2:477–490.Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie Strassel. 2015.Overview of linguistic resources for the tac kbp 2015evaluations: Methodologies and results. In

TAC2015 .oe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song,Ann Bies, and Stephanie Strassel. 2016. Overviewof linguistic resources for the tac kbp 2016 evalua-tions: Methodologies and results. In

TAC 2016 .Hongliang Fei, Xu Li, Dingcheng Li, and Ping Li. 2019.End-to-end deep reinforcement learning based coref-erence resolution. In

Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics , pages 660–665, Florence, Italy. Associ-ation for Computational Linguistics.Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey,and Stephanie Strassel. 2017. Overview of linguisticresources for the tac kbp 2017 evaluations: Method-ologies and results. In

TAC 2017 .Kartik Goyal, Sujay Kumar Jauhar, Huiying Li, Mrin-maya Sachan, Shashank Srivastava, and EduardHovy. 2013. A structured distributional semanticmodel for event co-reference. In

Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages467–473, Soﬁa, Bulgaria. Association for Computa-tional Linguistics.Eduard Hovy, Teruko Mitamura, Felisa Verdejo, JunAraki, and Andrew Philpot. 2013. Events are notsimple: Identity, non-identity, and quasi-identity. In

Workshop on Events: Deﬁnition, Detection, Coref-erence, and Representation , pages 21–28, Atlanta,Georgia. Association for Computational Linguistics.Yin Jou Huang, Jing Lu, Sadao Kurohashi, and VincentNg. 2019. Improving event coreference resolutionby learning argument compatibility from unlabeleddata. In

Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages 785–795, Minneapolis, Minnesota. Association for Com-putational Linguistics.Kevin Humphreys, Robert Gaizauskas, and Saliha Az-zam. 1997. Event coreference for information ex-traction. In

Operational Factors in Practical, Ro-bust Anaphora Resolution for Unrestricted Texts .Shanshan Jiang, Yihan Li, Tianyi Qin, Qian Meng, andBin Dong. 2017. Srcb entity discovery and linking(edl) and event nugget systems for tac 2017. In

TAC2017 .Mandar Joshi, Omer Levy, Luke Zettlemoyer, andDaniel Weld. 2019. BERT for coreference reso-lution: Baselines and analysis. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing ,pages 5802–5807, Hong Kong, China. Associationfor Computational Linguistics.Ben Kantor and Amir Globerson. 2019. Coreferenceresolution with entity equalization. In

Proceed-ings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 673–677, Flo-rence, Italy. Association for Computational Linguis-tics.Kian Kenyon-Dean, Jackie Chi Kit Cheung, and DoinaPrecup. 2018. Resolving event coreference withsupervised representation learning and clustering-oriented regularization. In

Proceedings of theSeventh Joint Conference on Lexical and Com-putational Semantics , pages 1–10, New Orleans,Louisiana. Association for Computational Linguis-tics.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Sebastian Krause, Feiyu Xu, Hans Uszkoreit, and DirkWeissenborn. 2016. Event linking with sententialfeatures from convolutional neural networks. In

Proceedings of The 20th SIGNLL Conference onComputational Natural Language Learning , pages239–249, Berlin, Germany. Association for Compu-tational Linguistics.LDC. 2005. Ace (automatic content extraction) englishannotation guidelines for events. Technical report,Linguistic Data Consortium.Heeyoung Lee, Marta Recasens, Angel Chang, MihaiSurdeanu, and Dan Jurafsky. 2012. Joint entity andevent coreference resolution across documents. In

Proceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning , pages489–500, Stroudsburg, PA, USA. Association forComputational Linguistics.Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 188–197, Copenhagen, Denmark. Associationfor Computational Linguistics.Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-ﬁne inference. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers) , pages687–692, New Orleans, Louisiana. Association forComputational Linguistics.Kenton Lee, Shimi Salant, Tom Kwiatkowski, AnkurParikh, Dipanjan Das, and Jonathan Berant. 2016.Learning recurrent span representations for ex-tractive question answering. arXiv preprintarXiv:1611.01436 .Zhengzhong Liu, Jun Araki, Eduard Hovy, and TerukoMitamura. 2014. Supervised within-document eventcoreference using information propagation. In

Pro-ceedings of the Ninth International Conference onLanguage Resources and Evaluation , pages 4539–4544, Reykjavik, Iceland.ing Lu and Vincent Ng. 2016a. Event coreference res-olution with multi-pass sieves. In

Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation , pages 3996–4003, Portoroˇz,Slovenia.Jing Lu and Vincent Ng. 2016b. Utds event nuggetdetection and coreference system at kbp 2016. In

TAC 2016 .Jing Lu and Vincent Ng. 2017a. Joint learning forevent coreference resolution. In

Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages90–101, Vancouver, Canada. Association for Com-putational Linguistics.Jing Lu and Vincent Ng. 2017b. Learning antecedentstructures for event coreference resolution. In , pages 113–118.Jing Lu and Vincent Ng. 2018. Event coreference reso-lution: A survey of two decades of research. In

Pro-ceedings of the Twenty-Seventh International JointConference on Artiﬁcial Intelligence , pages 5479–5486.Jing Lu, Deepak Venugopal, Vibhav Gogate, and Vin-cent Ng. 2016. Joint inference for event corefer-ence resolution. In

Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers , pages 3264–3275,Osaka, Japan.Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In

Proceedings of Human Lan-guage Technology Conference and Conference onEmpirical Methods in Natural Language Processing ,pages 25–32, Vancouver, British Columbia, Canada.Association for Computational Linguistics.Katie McConky, Rakesh Nagi, Moises Sudit, andWilliam Hughes. 2012. Improving event co-reference by context extraction and dynamic fea-ture weighting. In , pages38–43. IEEE.Haoruo Peng, Yangqiu Song, and Dan Roth. 2016.Event detection and co-reference with minimal su-pervision. In

Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing , pages 392–402, Austin, Texas. Association forComputational Linguistics.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing , pages 1532–1543, Doha, Qatar. Association forComputational Linguistics. Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Sameer S. Pradhan, Lance Ramshaw, RalphWeischedel, Jessica MacBride, and Linnea Micci-ulla. 2007. Unrestricted coreference: Identifyingentities and events in ontonotes. In

Proceedingsof the International Conference on SemanticComputing , pages 446–453, Washington, DC, USA.Marta Recasens and Eduard Hovy. 2011. Blanc: Imple-menting the rand index for coreference evaluation.

Natural Language Engineering , 17(4):485–510.Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese,Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick,Neville Ryant, and Xiaoyi Ma. 2015. From lightto rich ERE: Annotation of entities, relations, andevents. In

Proceedings of the The 3rd Workshop onEVENTS: Deﬁnition, Detection, Coreference, andRepresentation , pages 89–98, Denver, Colorado. As-sociation for Computational Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do youlearn from context? probing for sentence structurein contextualized word representations. In .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, , pages 5998–6008.Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In

Sixth Mes-sage Understanding Conference (MUC-6) , pages45–52, Columbia, Maryland.Bishan Yang, Claire Cardie, and Peter Frazier. 2015. Ahierarchical distance-dependent Bayesian model forevent coreference resolution.

Transactions of the As-sociation for Computational Linguistics , 3:517–528.Tongtao Zhang, Hongzhi Li, Heng Ji, and Shih-FuChang. 2015. Cross-document event coreferenceresolution based on cross-media features. In

Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 201–206, Lisbon, Portugal. Association for Computa-tional Linguistics.

Experiment Details

Table 6 presents the detailed hyper-parameters ofthe E C model used in our experiments. And weconducted all experiments on a Nvidia TITAN RTXGPU.

Parameter name Parameter valueMini batch size 1Max epochs for stopping training 150Patience for early stopping 10Max antecedents number 50Max document length for training 1024Dropout for word representation 0.5Dropout for FFNN 0.2Hidden layers for FFNN 2Hidden units for FFNN 150Optimizer AdamaxInitial learning rate 0.001Learning rate anneal factor 0.5Learning rate anneal patience 5

Table 6: Hyper-parameters of the E C model used inour experiments. FFNN indicates the feed-forward neu-ral networks for mention proposaling and antecedentscoring.

B Data Sets

We used Stanford CoreNLP toolkit to preprocessall documents for xml tags cleaning, sentence split-ting and tokenization. Since only 18 categorieswere used for the ofﬁcial evaluation, we ﬁltered outevent instances with other categories in the trainingdata. C Reproducibility

In this section, we present the reproducibility infor-mation of the paper. Table 7 shows the correspond-ing validation performance for all reported KBP2016 and KBP 2017 results. In addition, Table 8presents the average runtime for each approach andnumber of parameters in each model. https://stanfordnlp.github.io/CoreNLP/ ype-P Type-R Type-F1 B CEAF e MUC BLANC AVG-F

KBP 2016 Interact Model

BERT C 62.94 59.10 60.96 49.02 46.76 42.80 33.00 42.89KBP 2017 Interact Model

BERT C 65.85 54.61 59.71 51.60 51.48 39.45 35.01 44.38E C w/o Type-Reﬁned C w/o Type-Guided C Two Stage C w/o Proposal Loss C GloVe+Char C ELMo C BERT-BASE-Cased C BERT-LARGE-Uncased C BERT-LARGE-WWM-Uncased C GoldMention

Table 7: Corresponding validation performance for each reported KBP 2016/2017 result. Type-Rule andE C GoldMention take unreal experiment setups for exploiting the bound performance of E C. Type Rule is a sim-ple heuristic method that regards all event mentions in the same type are coreferential, and it directly uses theevent detection result from E C. E C GoldMention uses gold mentions instead of mentions proposed by the mentionproposal layer, but the model still needs to predict the type of each mention.

Time for one epoch (s) | Θ update | E C 82.76 2,886,108Interact Model

BERT C w/o Type-Reﬁned C w/o Type-Guided C Two Stage C w/o Proposal Loss C GloVe + Char C ELMo C BERT-BASE-Cased C BERT-LARGE-Uncased C BERT-LARGE-WWM-Uncased

Table 8: Average runtime for each approach and number of parameters in each model. Θ updateupdate

Related Researches

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

Wake Word Detection with Streaming Transformers

by Yiming Wang

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

In-Order Chart-Based Constituent Parsing

by Yang Wei

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Representation Learning for Natural Language Processing

by Zhiyuan Liu

«

1

2

3

4

»

Submitted on 17 Sep 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar