[PDF] Resolving Event Coreference with Supervised Representation Learning and Clustering-Oriented Regularization

Abstract

We present an approach to event coreference resolution by developing a general framework for clustering that uses supervised representation learning. We propose a neural network architecture with novel Clustering-Oriented Regularization (CORE) terms in the objective function. These terms encourage the model to create embeddings of event mentions that are amenable to clustering. We then use agglomerative clustering on these embeddings to build event coreference chains. For both within- and cross-document coreference on the ECB+ corpus, our model obtains better results than models that require significantly more pre-annotated information. This work provides insight and motivating results for a new general approach to solving coreference and clustering problems with representation learning.

Full PDF

RResolving Event Coreference with Supervised Representation Learningand Clustering-Oriented Regularization

Kian Kenyon-Dean

School of Computer ScienceMcGill University [email protected]

Jackie Chi Kit Cheung

School of Computer ScienceMcGill University [email protected]

Doina Precup

School of Computer ScienceMcGill University [email protected]

Abstract

We present an approach to event coreferenceresolution by developing a general frameworkfor clustering that uses supervised representa-tion learning. We propose a neural networkarchitecture with novel Clustering-OrientedRegularization (CORE) terms in the objectivefunction. These terms encourage the model tocreate embeddings of event mentions that areamenable to clustering. We then use agglom-erative clustering on these embeddings to buildevent coreference chains. For both within-and cross-document coreference on the ECB+corpus, our model obtains better results thanmodels that require signiﬁcantly more pre-annotated information. This work provides in-sight and motivating results for a new generalapproach to solving coreference and clusteringproblems with representation learning.

Event coreference resolution is the task of deter-mining which event mentions expressed in lan-guage refer to the same real-world event in-stances. The ability to resolve event coreferencehas improved the quality of downstream taskssuch as automatic text summarization (Vander-wende et al., 2004), questioning-answering (Be-rant et al., 2014), headline generation (Sun et al.,2015), and text-mining in the medical domain(Ferracane et al., 2016).Event mentions are comprised of an actioncomponent (or, head) and surrounding arguments.Consider the following passages, drawn from twodifferent documents; the heads of the event men-tions are in boldface and the subscripts indicatemention IDs:(1) The president’s speech m shocked m the au-dience. He announced m several new con-troversial policies. (2) The policies proposed m by the presidentwill not surprise m those who followed m his campaign m .In this example, m , m , and m form a chainof coreferent event mentions (underlined), becausethey refer to the same real-world event in whichthe president gave a speech. The other four aresingletons, meaning that they all refer to separateevents and do not corefer with any other mention.This work investigates how to learn useful rep-resentations of event mentions. Event mentionsare complex objects, and both the event mentionheads and the surrounding arguments are impor-tant for the event coreference resolution task. Inour example above, the head words of mentions m , shocked , and m , surprise , are lexically sim-ilar, but the event mentions do not corefer. Thistask therefore necessitates a model that can cap-ture the distributional relationships between eventmentions and their surrounding contexts.We hypothesize that prior knowledge about thetask itself can be usefully encoded into the rep-resentation learning objective. For our task, thisprior means that the embeddings of corefentialevent mentions should have similar embeddings toeach other (a “natural clustering”, using the termi-nology of Bengio et al. (2013)). With this prior,our model creates embeddings of event mentionsthat are directly conducive for the clustering taskof building event coreference chains. This is con-trary to the indirect methods of previous work thatrely on pairwise decision making followed by aseparate model that aggregates the sometimes in-consistent decisions into clusters (Section 2).We demonstrate these points by proposing amethod that learns to embed event mentions intoa space that is tuned speciﬁcally for clustering.The representation learner is trained to predictwhich event cluster the event mention belongs to, a r X i v : . [ c s . C L ] M a y sing an hourglass-shaped neural network. Wepropose a mechanism to modulate this trainingby introducing Clustering-Oriented Regulariza-tion (CORE) terms into the objective function ofthe learner; these terms impel the model to pro-duce similar embeddings for coreferential eventmentions, and dissimilar embeddings otherwise.Our model obtains strong results on within-and cross-document event coreference resolution,matching or outperforming the system of Cybul-ska and Vossen (2015) on the ECB+ corpus on allsix evaluation measures. We achieve these gainsdespite the fact that our model requires signiﬁ-cantly less pre-annotated or pre-detected informa-tion in terms of the internal event structure. Ourmodel’s improvements upon the baselines showthat our supervised representation learning frame-work creates new embeddings that capture the ab-stract distributional relations between samples andtheir clusters, suggesting that our framework canbe generalized to other clustering tasks . The recent work on event coreference can be cat-egorized according to the assumed level of eventrepresentation. In the predicate-argument align-ment paradigm (Roth and Frank, 2012; Wolfeet al., 2013), links are simply drawn between pred-icates in different documents. This work only con-siders cross-document event coreference (Wolfeet al., 2013, 2015), and no within-document coref-erence. At the other extreme, the ACE and EREdatasets annotate rich internal event structure, withspeciﬁc taxonomies that describe the annotatedevents and their types (Linguistic Data Consor-tium, 2005, 2016). In these datasets, only within-document coreference is annotated.The creators of the ECB (Bejan and Harabagiu,2008) and ECB+ (Cybulska and Vossen, 2014),annotate events according to a level of abstractionbetween that of the predicate-argument approachand the ACE approach, being most similar to theTimeML paradigm (Pustejovsky et al., 2003). Inthese datasets, both within-document and cross-document coreference relations are annotated. Weuse the ECB+ corpus in our experiments becauseit solves the lack of lexical diversity found withinthe ECB by adding 502 new annotated documents,providing a total of 982 documents. All code used in this paper can be found here: https://github.com/kiankd/events

Previous work on model design for event coref-erence has focused on clustering over a linguis-tically rich set of features. Most models require apairwise-prediction based supervised learning stepwhich predicts whether or not a pair of event men-tions is coreferential (Bagga and Baldwin, 1999;Chen et al., 2009; Cybulska and Vossen, 2015).Other work focuses on the clustering step itself,aggregating local pairwise decisions into clusters,for example by graph partitioning (Chen and Ji,2009). There has also been work using non-parametric Bayesian clustering techniques (Bejanand Harabagiu, 2014; Yang et al., 2015), as wellas other probabilistic models (Lu and Ng, 2017).Some recent work uses intuitions combining rep-resentation learning with clustering, but does notaugment the loss function for the purpose ofbuilding clusterable representations (Krause et al.,2016; Choubey and Huang, 2017).

We formulate the task of event coreference resolu-tion as creating clusters of event mentions whichrefer to the same event. For the purposes of thiswork, we deﬁne an event mention to be a set of to-kens that correspond to the action of some event.Consider the sentence below (borrowed from Cy-bulska and Vossen (2014)):(3) On Monday Lindsay Lohan checked into re-hab in Malibu, California after a car crash .Our model would take, as input, feature vectors(see Section 4) extracted from the two event men-tions (in bold) independently. In this paper, we usethe gold-standard event mentions provided by thedataset, and leave mention detection to other work.

Our approach to resolving event coreference con-sists of the following steps:1. Train a supervised neural network model whichlearns event mention embeddings by predictingthe event cluster in the training set to which themention belongs (Figure 1).2. At test time, use the previously trained model’sembedding layer to derive representations ofunseen event mentions. Then, perform agglom-erative clustering with these embeddings to cre-ate event coreference chains (Figure 2). igure 1: Our supervised representation learning modelduring the training step. Dashed arrows indicate contri-butions to the loss function. Figure 2: Our trained model at inference time, used forvalidation tuning and ﬁnal testing. Note that H and Y are not used in this step. We propose a representation learning frameworkbased on training a multi-layer artiﬁcial neural net-work, with one layer H e chosen to be the embed-ding layer. In the training set, there are a certainnumber of event mentions, each of which belongsto some gold standard cluster, making C total non-singleton clusters in the training set. The networkis trained as if it were encountering a C +1-classclassiﬁcation problem, where the class of an eventmention corresponds to a single output node, andall singleton mentions belong to class C +1 .When using this model to cluster a new set ofmentions, the ﬁnal layer’s output will not be di-rectly informative since the output node structurecorresponds to the clusters within the training set.However, we hypothesize that the trained modelwill have learned to capture the abstract distribu-tional relationships between event mentions andclusters in the intermediate layer H e . We thus usethe activations in H e as the embedding of an eventmention for the clustering step (see Figure 2). Asimilar hourglass-like neural architecture designhas been successful in automatic speech recogni- If each singleton mention (i.e., a mention that does notcorefer with anything else) had its own class then the modelwould be confronted with a classiﬁcation problem with thou-sands of classes, many of which would only have one sample;this is much too ill-posed, so we merge all singletons togetherduring the training step. tion (Gr´ezl et al., 2007; Gehring et al., 2013), buthas not to our knowledge been used to pre-trainembeddings for clustering.

Using CCE as the loss function trains the modelto correctly predict a training set mention’s corre-sponding cluster. With model prediction y ic as theprobability that sample i belongs to class c , and in-dicator variable t ic = 1 if sample i belongs to class c (else t ic = 0 ), we have the mean categorical-cross entropy loss over a randomly sampled train-ing input batch X : (1) L CCE = − | X | | X | (cid:88) i =1 C +1 (cid:88) c =1 t ic log( y ic ) With CCE, the model may overﬁt towards accurateprediction performance for those particular clus-ters found in the training set without learning anembedding that captures the nature of events ingeneral. This therefore motivates introducing reg-ularization terms based on the intuition that em-beddings of mentions belonging to the same clus-ter should be similar, and that embeddings of men-tions belonging to different clusters should be dis-similar. Accordingly, we deﬁne dissimilarity be-tween two vector embeddings ( (cid:126)e , (cid:126)e ) accordingo the cosine-distance function d : d ( (cid:126)e , (cid:126)e ) = 12 (cid:16) − (cid:126)e · (cid:126)e || (cid:126)e |||| (cid:126)e || (cid:17) (2)Given input batch X , we create two sets S and D , where S is the set of all pairs ( a, b ) of men-tions in X that belong to the same cluster, and D isthe set of all pairs ( c, d ) in X that belong to differ-ent clusters. Note that all vector embeddings (cid:126)e i = H e ( i ) ; i.e., they are obtained by feeding the eventmention i ’s features through to embedding layerH e . We now deﬁne the following Attractive and

Repulsive

CORE terms.

The ﬁrst desirable property for the embeddingsis that mentions that belong to the same clustershould have low cosine distance between each oth-ers’ embeddings, since the agglomerative cluster-ing algorithm uses cosine distance to make coref-erence decisions.Formally, for all pairs of mentions a and b thatbelong to the same cluster, we would like to min-imize the distance between their embeddings (cid:126)e a and (cid:126)e b . We call this “attractive” regularizationbecause we want to attract embeddings closer toeach other by minimizing their distance d ( (cid:126)e a , (cid:126)e b ) so that they will be as similar as possible. (3) L attract = 1 |S| (cid:88) ( a,b ) ∈S d ( (cid:126)e a , (cid:126)e b ) The second desirable property is that the embed-dings corresponding to mentions that belong todifferent clusters should have high cosine distancebetween each other. Thus, for all pairs of mentions c and d that belong to different clusters, the goal isto maximize their distance d ( (cid:126)e c , (cid:126)e d ) . This is “re-pulsive” because we train the model to push awaythe embeddings from each other to be as distant aspossible. (4) L repulse = 1 − |D| (cid:88) ( c,d ) ∈D d ( (cid:126)e c , (cid:126)e d ) Equation 5 below shows the ﬁnal loss function .The attractive and repulsive terms are weighted by Note that, while we present Equations 3 and 4 as sum-mations over pairs from the input batch, the computation isactually reasonable when written in terms of matrix multipli-cations. The most expensive operation multiplying the em-bedded batch of input samples times its transpose. hyperparameter constants λ and λ respectively:(5) L = L CCE + λ L attract + λ L repulse By adding these regularization terms to the lossfunction, we hypothesize that the new embeddingsof test set mentions (obtained by feeding-forwardtheir features into the trained model) will exem-plify the desired properties represented by the lossfunction, thus assisting the agglomerative cluster-ing task in producing correct coreference-chains.

Agglomerative clustering is a non-parametric“bottom-up” approach to hierarchical clustering,in which each sample starts as its own cluster,and at each step, the two most similar clustersare merged, where similarity between two clus-ters is measured according to some similarity met-ric. After each merge, clustering similarities arerecomputed according to a preset criterion (e.g.,single- or complete-linkage). In our models, clus-tering proceeds until a pre-determined similaritythreshold, τ , is reached. We tuned τ on the vali-dation set, doing grid search for τ ∈ [0 , to max-imize B accuracy . Preliminary experimentationled us to use cosine-similarity (see cosine distancein Equation 2) to measure vector similarity, andsingle-linkage for clustering decisions.We experimented with two initializationschemes for agglomerative clustering. In the ﬁrstscheme, each event mention is initialized as itsown cluster, as is standard. In the second, weinitialized clusters using the lemma- δ baselinedeﬁned by Upadhyay et al. (2016). This baselinemerges all event mentions with the same headlemma that are in documents with document-levelsimilarity that is higher than a threshold δ . Upad-hyay et al. showed that it is a strong indicator ofevent coreference, so we experimented with ini-tializing our clustering algorithm in this way. Wecall this model variant CORE+CCE+L EMMA ,and describe the parameter tuning procedures inmore detail in Section 5.

We extract features that do not require the pre-processing step of event-template construction torepresent the context (unlike Cybulska and Vossen We optimize with B F1-score because the other mea-sures are either too expensive to compute (CEAF-M, CEAF-E, BLANC), or are less discriminative (MUC). .action checked into, crash

On Monday rehab in Malibu, California

Lindsay Lohan (human) car (non-human)

Table 1: An event template of the sentence in Ex-ample 3, borrowed from Cybulska and Vossen (2014;2015). Our model only requires as input the action , notthe time , location , nor participant arguments. (2015), see Table 1); instead, we represent the sur-rounding context by using the tokens in the gen-eral vicinity of the event’s action. We thus onlyrequire the event’s action – which is what we de-ﬁne as an event mention – to be previously de-tected, not all of its arguments. We motivate thisby arguing that it would be preferable to build highquality coreference chains without event templatefeatures since since extracting event templates canbe a difﬁcult process, with the possibility of errorscascading into the event coreference step. Inspired by the approach of Clark and Manning(2016) in the entity coreference task, we extract,for the token sets below, (i) the token’s word2vec word embedding (Mikolov et al., 2013) (or aver-age if there are multiple); and, (ii) the one-hotcount vector of the token’s lemma (or sum if thereare multiple), for each event mention, em : • the ﬁrst token of em ; • the last token of em ; • all tokens in the em ; • each of the two tokens preceding em ; • each of the two tokens following em ; • all of the ﬁve tokens preceding em ; • all of the ﬁve tokens following em ; • all of the tokens in em ’s sentence. It is necessary to include features that character-ize the mention’s document, hoping that the modellearns a latent understanding of relations betweendocuments. We extract features from the eventmention’s document by building lemma-based TF-IDF vector representations of the document. Weuse log normalization of the raw term frequency This is a -dimensional vector where the ﬁrst en-tries correspond to the most frequently occurring lemmasin the training set, and the th entry indicates if the lemmais not in that set of most frequently occurring lemmas. of token lemma t in document d , f t,d , where TF t = 1 + log( f t,d ) . For the IDF term we usesmoothed inverse document frequency, with N asthe number of documents and n t as the numberof documents that contain the lemma, we have IDF t = log(1 + Nn t ) . By performing a component-wise multiplication of the IDF vector with eachrow in term-frequency matrix TF , we create TF-IDF vectors of each document in the training andtest sets (with length corresponding to the numberof unique lemmas in the training set). We com-press these vectors to dimensions with prin-cipal component analysis ﬁtted onto the train setdocument vectors, which is used to transform thevalidation and test set document vectors. We include comparative features to relate a men-tion to the other mentions in its document and tothe mentions in the set of documents the modelwould be requested to extract event coreferencechains from. This is motivated by the fact thatcoreference decisions must be informed by the re-lationship mentions have with each other. Firstly,we encode the position of the mention in its doc-ument with speciﬁc binary features indicating ifit is ﬁrst or last; for example, if there were ﬁvementions and it were the third, this feature wouldcorrespond to the vector [0 , , .Next, we deﬁne two sets of mentions we wouldlike to compare with: the ﬁrst contains all men-tions in the same document as the current mention em , and the second contains all mentions in thedata we are asked to cluster. For each of these sets,we compute: the average word overlap and aver-age lemma overlap (measured by harmonic simi-larity) between em and each of the other mentionsin the set. We thus add two feature vector entriesfor each of the sets: the average word overlap be-tween em and the other mentions in the set, andthe average lemma overlap between em and theother mentions in the set. We run our experiments on the ECB+ corpus, thelargest corpus that contains both within- and cross-document event coreference annotations. We fol-lowed the train/test split of Cybulska and Vossen(2015), using topics - as the train set and - as the test set. During training, we split off aalidation set for hyperparameter tuning.Following Cybulska and Vossen, we used theportion of the corpus that has been manually re-viewed and checked for correctness. Some pre-vious work (Yang et al., 2015; Upadhyay et al.,2016; Choubey and Huang, 2017) do not appearto have followed this guideline from the corpuscreators, as they report different corpus statis-tics compared to those reported by Cybulska andVossen. As a result, those papers may report re-sults on a data set with known annotation errors. Since there is no consensus in the coreference res-olution literature on the best evaluation measure,we present results obtained according to six differ-ent measures, as is common in previous work. Weuse the scorer presented by Pradhan et al. (2014).In this task, the term “coreference chain” is syn-onymous with “cluster”.

MUC (Vilain et al., 1995). Link-level mea-sure which counts the minimum number of linkchanges required to obtain the correct clusteringfrom the predictions; it does not account for cor-rectly predicted singletons. B (Bagga and Baldwin, 1998). Mention-levelmeasure which computes precision and recall foreach individual mention, overcoming the singletonproblem of MUC, but can problematically countthe same coreference chain multiple times. CEAF-M (Luo, 2005). Mention-level measurewhich reﬂects the percentage of mentions that arein the correct coreference chains. Note that preci-sion and recall are the same in this measure sincewe use pre-annotated mentions.

CEAF-E (Luo, 2005). Entity-level measure com-puted by aligning predicted with the gold chains,not allowing one chain to have more than onealignment, overcoming the problem of B . BLANC (Luo et al., 2014). Computes two F-scores in terms of the pairwise quality of corefer-ence decisions and non-coreference decisions, andaverages these scores together for the ﬁnal results.

CoNLL . The mean of MUC, B , and CEAF-E. We compare our representation-learning modelvariants to three baselines: a deterministic lemma- Topics 2, 5, 12, 18, 21, 23, 34, 35 (randomly chosen). based baseline, a lemma- δ baseline, and an unsu-pervised baseline which clusters the originally ex-tracted features. We also compare with the resultsof Cybulska and Vossen (2015). EMMA . This algorithm clusters event mentionswhich share the same head word lemma into thesame coreference chains across all documents. L EMMA - δ . Proposed by Upadhyay et al. (2016),this method provides a difﬁcult baseline to beat.A δ -similarity threshold is introduced, and wemerge two mentions with the same head-lemma ifand only if the cosine-similarity between the TF-IDF vectors of their corresponding documents isgreater than δ . This δ parameter is tuned to maxi-mize B performance on the validation set, whichwe found occurs when δ = 0 . . U NSUPERVISED . This is the result obtainedby agglomerative clustering over the original un-weighted features. Again, we optimize the τ sim-ilarity threshold over the validation set. Cybulska and Vossen (2015) propose a model thatuses sentence-level event templates (see Table 1),requiring more annotated information than ourmodels. See (Vossen and Cybulska, 2017) for fur-ther elaboration of this model. To our knowledge,this is the best previous model on ECB+ using thesame data and evaluation criteria as our work.

We test four different model variants: • CCE: uses only categorical-cross-entropy in theloss function (Equation 1); • CORE: uses only clustering-oriented regular-ization; i.e., the attract and repulse terms (Equa-tions 3 and 4); • CORE+CCE: includes categorical-cross-entropy and the attract and repulse terms(Equation 5); • CORE+CCE+L

EMMA : initializes the agglom-erative clustering with clusters computed bylemma- δ (with a differently tuned value of δ than the baseline) and continues the clusteringprocess using the similarities between the em-beddings created by CORE+CCE. odel λ λ B τ Baselines U NSUPERVISED - - .

590 0 . L EMMA - - . -L EMMA - δ - - . - Model Variants

CORE+CCE+L 2.0 0.0 . CORE+CCE . . .

663 0 . . . .

666 0 . . . .

665 0 . . . . . . .

662 0 . CORE . . .

631 0 . . . .

625 0 . CCE - - .

644 0 . Table 2: Model comparison based on validationset B accuracy with optimized τ cluster-similaritythreshold. For CORE+CCE+L EMMA (indicated asCORE+CCE+L) we tuned to δ = 0 . ; for L EMMA - δ we tuned to δ = 0 . . For the representation learning models, we per-formed a non-exhaustive hyper-parameter searchoptimized for validation set performance. Wekeep the following parameters constant across themodel variants: • neurons in H and H ; neurons in H e ,the embedding layer (see Figure 1); • Softmax output layer with C + 1 units; • ReLU activation functions for all neurons; • Adam gradient descent (Kingma and Ba, 2014); • dropout between each layer; • Learning rate of . (times − forCORE); • Randomly sampled batches of mentions,where a batch is forced to contain pairs of coref-erential and non-coreferential mentions.Models are trained for 100 epochs. At eachepoch, we optimize τ (our agglomerative clus-tering similarity threshold) using a two-pass ap-proach: we ﬁrst test different settings of τ ,then τ is further optimized around the best valuefrom the ﬁrst pass. For CORE+CCE+L EMMA ,we tune the δ parameter of the lemma- δ clustering approach to the validation set by testing dif-ferent values of δ ; these different δ values initial-ize the clusters, and we then continue clusteringby testing validation results obtained when usingthe similarities between the embeddings createdby CORE+CCE for different values of τ .Some of the results of hyperparameter tuningon the validation set are shown in Table 2. Inter-estingly, we observe that CORE+CCE performsslightly better with λ = 0 ; i.e., without repulsiveregularization. This suggests that enforcing rep-resentation similarity is more important than en-forcing division, although we cannot conclusivelystate that repulsive regularization would not beuseful for other tasks. Nonetheless, for test setresults we use the optimal hyperparameter conﬁg-urations found during this validation-tuning step;e.g., for CORE+CCE we set λ = 2 and λ = 0 . Table 3 presents the performance of the modelsfor combined within- and cross-document eventcoreference. Results for these models are obtainedwith the hyper-parameter settings that achievedoptimal accuracy during validation-tuning.Firstly, we observe that CORE+CCE offersmarked improvements upon the U

NSUPERVISED baseline, CORE model, and CCE model. Fromthese results we conclude: (i) supervised represen-tation learning provides more informative embed-dings than the original feature vectors; and, (ii)that combining Clustering-Oriented Regulariza-tion with categorical-cross-entropy is better thanjust using one or the other, indicating that our in-troduction of these novel terms into the loss func-tion is a useful contribution.We also note that CORE+CCE+L

EMMA (which obtains the best validation set results) beatsthe strong L

EMMA - δ baseline. Our model of-fers marked improvements or roughly equivalentscores in each evaluation measure except BLANC,where the baseline offers a point F-score im-provement. This is due to the very high precisionof the baseline, whereas CORE+CCE+L EMMA seems to trade precision for recall.We ﬁnally observe that CORE+CCE+L

EMMA improves upon the results of Cybulska and Vossen(2015). We obtain improvements of points inMUC, points in entity-based CEAF, points inCoNLL, and point in BLANC, with equivalentresults in B and mention-based CEAF. These re- UC B CM CE BLANC C O NLL

Model

R P F R P F F R P F R P F F

Baselines L EMMA

66 58 62 66 58 62 51 87 39 54 64 61 63 61L

EMMA - δ

55 68 61 61 80

69 59

73 60 66 62 80 NSUPERVISED

39 63 48 55 81 66 51 72 49 58 57 58 58 57

Previous Work

CV2015 43 77 55 58 86

58 - - 66 60 69 63 64

Model Variants

CCE 66 63 65 69 60 64 50 59 63 61 69 56 59 63CORE 58 58 58 66 58 62 44 53 53 53 66 54 56 57CORE+CCE 62 70 66 67 69 68 56 73 64 68 68 59 62 67CORE+CCE+L

EMMA

67 71

71 67

58 71 67

72 60 64 Table 3: Combined within- and cross-document test set results on ECB+. Measures CM and CE stand for mention-based CEAF and entity-based CEAF, respectively.

MUC B CM CE BLANC C O NLL

Model

R P F R P F F R P F R P F F

Baselines L EMMA - δ

41 77 53 86 97

85 92 82 87 65 86 71 77U

NSUPERVISED

32 36 34 85 86 85 74 80 78 79 65 55 57 66

Model Variants

CCE 44 49 46 87 89 88 79 82 80 81 67 67 67 72CORE 55 32 40 89 70 78 65 64 79 71 75 54 56 63CORE+CCE 43 68 53 87 95 91 84 90 82 86 67 76 70 76CORE+CCE+L

EMMA

57 69

90 94

92 86

90 86

73 78

75 81

Table 4: Within-document test set results on ECB+. Note that L

EMMA is equivalent to L

EMMA - δ in the within-document setting. Cybulska and Vossen (2015) did not report the performance of their model in this setting. sults suggest that high quality coreference chainscan be built without necessitating event templates.In Table 4, we see the performance of ourmodels on within-document coreference resolu-tion in isolation. These results are obtainedby cutting all links drawn across documentsfor the gold standard chains and the predictedchains. We observe that, across all models,scores on the mention- and entity-based measuresare substantially higher than the link-based mea-sures (e.g., MUC and BLANC). The usefulnessof CORE+CCE+L EMMA (which initializes theclustering with the lemma- δ predictions and thencontinues to cluster with CORE+CCE) is exem-pliﬁed by the improvements or matches in ev-ery measure when compared to both L EMMA - δ and CORE+CCE. The most vivid improvementhere is observed with the point improvement inMUC over both models as well as the and pointimprovements in BLANC respectively, where thehigher recall entails that CORE+CCE+L EMMA conﬁdently predicts coreference links that wouldotherwise have been false negatives.

We have presented a novel approach to eventcoreference resolution by combining supervisedrepresentation learning with non-parametric clus-tering. We train an hourglass-shaped neural net-work to learn how to represent event mentions ina useful way for an agglomerative clustering al-gorithm. By adding the novel Clustering-OrientedRegularization (CORE) terms into the loss func-tion, the model learns to construct embeddingsthat are easily clusterable; i.e., the prior that em-beddings of samples belonging to the same clustershould be similar, and those of samples belongingto different clusters should be dissimilar.Our results suggest that clustering embeddingscreated with representation learning is much bet-ter than clustering of the original feature vectors,when using the same agglomerative clustering al-gorithm. We show that including CORE in the lossfunction improves performance more than whenonly using categorical-cross-entropy to train therepresentation learner model. Our top-performingmodel obtains results that improve upon previouswork despite the fact that our model requires lessannotated information in order to perform the task.uture work involves applying our model to au-tomatically annotated event mentions and otherevent coreference datasets, and extending thisframework toward a full end-to-end system thatdoes not rely on manual feature engineering at theinput level. Additionally, our model may be use-ful for other clustering tasks, such as entity coref-erence and document clustering. Lastly, we seekto determine how CORE and its imposition of aclusterable latent space structure may or may notassist in improving the quality of latent represen-tations in general.

Acknowledgements

This work was funded with grants from the Natu-ral Sciences and Engineering Research Council ofCanada (NSERC) and the Fonds de recherche duQu´ebec - Nature et Technologies (FRQNT). Wethank the anonymous reviewers for their helpfulcomments and suggestions.

References

Amit Bagga and Breck Baldwin. 1998. Algorithmsfor scoring coreference chains. In

The ﬁrst in-ternational conference on language resources andevaluation workshop on linguistics coreference , vol-ume 1, pages 563–566.Amit Bagga and Breck Baldwin. 1999. Cross-document event coreference: Annotations, exper-iments, and observations. In

Proceedings of theWorkshop on Coreference and its Applications ,pages 1–8. ACL.Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un-supervised Event Coreference Resolution.

Compu-tational Linguistics , 40(2):311–347.Cosmin Adrian Bejan and Sanda M Harabagiu. 2008.A Linguistic Resource for Discovering Event Struc-tures and Resolving Event Coreference. In

LREC .Yoshua Bengio, Aaron Courville, and Pascal Vincent.2013. Representation learning: A review and newperspectives.

IEEE transactions on pattern analysisand machine intelligence , 35(8):1798–1828.Jonathan Berant, Vivek Srikumar, Pei-Chun Chen,Abby Vander Linden, Brittany Harding, BradHuang, Peter Clark, and Christopher D Manning.2014. Modeling Biological Processes for ReadingComprehension. In

Proceedings of the 2014 Con-ference on EMNLP , pages 1499–1510.Zheng Chen and Heng Ji. 2009. Graph-based eventcoreference resolution. In

Proceedings of the 2009Workshop on Graph-based Methods for NLP , pages54–57. ACL. Zheng Chen, Heng Ji, and Robert Haralick. 2009. APairwise Event Coreference Model, Feature Impactand Evaluation for Event Coreference Resolution.In

Proceedings of the workshop on events in emerg-ing text types , pages 17–22. ACL.Prafulla Kumar Choubey and Ruihong Huang. 2017.Event coreference resolution by iteratively un-folding inter-dependencies among events. arXivpreprint arXiv:1707.07344 .Kevin Clark and Christopher D Manning. 2016. Im-proving Coreference Resolution by Learning Entity-Level Distributed Representations. arXiv preprintarXiv:1606.01323 .Agata Cybulska and Piek Vossen. 2014. Using asledgehammer to crack a nut? Lexical diversity andevent coreference resolution. In

LREC , pages 4545–4552.Agata Cybulska and Piek Vossen. 2015. TranslatingGranularity of Event Slots into Features for EventCoreference Resolution. In

Proceedings of the 3rdWorkshop on EVENTS at the NAACL-HLT , pages 1–10.Elisa Ferracane, Iain Marshall, Byron C Wallace, andKatrin Erk. 2016. Leveraging coreference to iden-tify arms in medical abstracts: An experimentalstudy.

EMNLP , pages 86–95.Jonas Gehring, Yajie Miao, Florian Metze, and AlexWaibel. 2013. Extracting deep bottleneck featuresusing stacked auto-encoders. In

Acoustics, Speechand Signal Processing (ICASSP), 2013 IEEE Inter-national Conference on , pages 3377–3381. IEEE.Frantisek Gr´ezl, Martin Karaﬁ´at, Stanislav Kont´ar, andJan Cernocky. 2007. Probabilistic and bottle-neckfeatures for LVCSR of meetings. In

Acoustics,Speech and Signal Processing, 2007. ICASSP 2007.IEEE International Conference on , volume 4, pagesIV–757. IEEE.Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Sebastian Krause, Feiyu Xu, Hans Uszkoreit, and DirkWeissenborn. 2016. Event linking with sententialfeatures from convolutional neural networks. In

Proceedings of the 20th SIGNLL Conference onComputational Natural Language Learning , pages239–249.Linguistic Data Consortium. 2005. ACE (AutomaticContent Extraction) English Annotation Guidelinesfor Events. Version 5.4.3 2005.07.01.Linguistic Data Consortium. 2016. Rich ERE Annota-tion Guidelines Overview. V4.2.Jing Lu and Vincent Ng. 2017. Learning antecedentstructures for event coreference resolution. In

Ma-chine Learning and Applications (ICMLA), 20176th IEEE International Conference on , pages 113–118. IEEE.Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In

Proceedings of the confer-ence on Human Language Technology and EMNLP ,pages 25–32. ACL.Xiaoqiang Luo, Sameer Pradhan, Marta Recasens, andEduard H Hovy. 2014. An Extension of BLANC toSystem Mentions. In

ACL (2) , pages 24–29.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed-uard H Hovy, Vincent Ng, and Michael Strube.2014. Scoring coreference partitions of predictedmentions: A reference implementation. In

ACL (2) ,pages 30–35.James Pustejovsky, Jos´e M Castano, Robert Ingria,Roser Sauri, Robert J Gaizauskas, Andrea Set-zer, Graham Katz, and Dragomir R Radev. 2003.TimeML: Robust speciﬁcation of event and tempo-ral expressions in text.

New directions in questionanswering , 3:28–34.Michael Roth and Anette Frank. 2012. Aligning predi-cate argument structures in monolingual comparabletexts: A new corpus for a new task. In

Proceedingsof the First Joint Conference on Lexical and Com-putational Semantics-Volume 1: Proceedings of themain conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop onSemantic Evaluation , pages 218–227. ACL.Rui Sun, Yue Zhang, Meishan Zhang, and DonghongJi. 2015. Event-driven headline generation. In

Pro-ceedings of ACL , pages 462–472.Shyam Upadhyay, Nitish Gupta, ChristosChristodoulopoulos, and Dan Roth. 2016. Re-visiting the Evaluation for Cross Document EventCoreference. In

COLING .Lucy Vanderwende, Michele Banko, and ArulMenezes. 2004. Event-centric summary generation.

Working notes of DUC , pages 127–132.Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In

Proceed-ings of the 6th conference on Message understand-ing , pages 45–52. ACL.Piek Vossen and Agata Cybulska. 2017. Identityand granularity of events in text. arXiv preprintarXiv:1704.04259 .Travis Wolfe, Mark Dredze, and Benjamin Van Durme.2015. Predicate argument alignment using a global coherence model. In

Proceedings of the 2015 Con-ference of NAACL: Human Language Technologies ,pages 11–20.Travis Wolfe, Benjamin Van Durme, Mark Dredze,Nicholas Andrews, Charley Beller, Chris Callison-Burch, Jay DeYoung, Justin Snyder, JonathanWeese, Tan Xu, et al. 2013. PARMA: A PredicateArgument Aligner. In

ACL (2) , pages 63–68.Bishan Yang, Claire Cardie, and Peter Frazier. 2015.A Hierarchical Distance-dependent Bayesian Modelfor Event Coreference Resolution.