[PDF] CD2CR: Co-reference Resolution Across Documents and Domains

Abstract

Cross-document co-reference resolution (CDCR) is the task of identifying and linking mentions to entities and concepts across many text documents. Current state-of-the-art models for this task assume that all documents are of the same type (e.g. news articles) or fall under the same theme. However, it is also desirable to perform CDCR across different domains (type or theme). A particular use case we focus on in this paper is the resolution of entities mentioned across scientific work and newspaper articles that discuss them. Identifying the same entities and corresponding concepts in both scientific articles and news can help scientists understand how their work is represented in mainstream media. We propose a new task and English language dataset for cross-document cross-domain co-reference resolution (CD^2CR). The task aims to identify links between entities across heterogeneous document types. We show that in this cross-domain, cross-document setting, existing CDCR models do not perform well and we provide a baseline model that outperforms current state-of-the-art CDCR models on CD^2CR. Our data set, annotation tool and guidelines as well as our model for cross-document cross-domain co-reference are all supplied as open access open source resources.

Full PDF

CCD CR: Co-reference Resolution Across Documents and Domains

James Ravenscroft , Arie Cattan , Amanda Clare , Ido Dagan , Maria Liakata

Centre for Scientiﬁc Computing, University of Warwick, CV4 7AL, United Kingdom Department of Computer Science, Aberystwyth University, SY23 3DB, United Kingdom Alan Turing Institute, 96 Euston Rd, London, NW1 2DB, United Kingdom Filament AI, 1 King William St, London, EC4N 7BJ, United Kingdom Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel

Abstract

Cross-document co-reference resolution(CDCR) is the task of identifying and linkingmentions to entities and concepts acrossmany text documents. Current state-of-the-artmodels for this task assume that all documentsare of the same type (e.g. news articles) orfall under the same theme. However, it is alsodesirable to perform CDCR across differentdomains (type or theme). A particular usecase we focus on in this paper is the resolutionof entities mentioned across scientiﬁc workand newspaper articles that discuss them.Identifying the same entities and correspond-ing concepts in both scientiﬁc articles andnews can help scientists understand how theirwork is represented in mainstream media. Wepropose a new task and English languagedataset for cross-document cross-domainco-reference resolution (CD CR). The taskaims to identify links between entities acrossheterogeneous document types. We showthat in this cross-domain, cross-documentsetting, existing CDCR models do not performwell and we provide a baseline model thatoutperforms current state-of-the-art CDCRmodels on CD CR. Our data set, annotationtool and guidelines as well as our model forcross-document cross-domain co-referenceare all supplied as open access open sourceresources.

Cross-document co-reference resolution (CDCR)is the task of recognising when multiple docu-ments mention and refer to the same real-worldentity or concept. CDCR is a useful NLP pro-cess that has many downstream applications. Forexample, CDCR carried out on separate news ar-ticles that refer to the same politician can facili-tate inter-document sentence alignment requiredfor stance detection and natural language inference models. Furthermore, CDCR can improve informa-tion retrieval and multi-document summarisationby grouping documents based on the entities thatare mentioned within them.Recent CDCR work (Dutta and Weikum, 2015;Barhom et al., 2019; Cattan et al., 2020) has pri-marily focused on resolution of entity mentionsacross news articles. Despite differences in toneand political alignment, most news articles are rel-atively similar in terms of grammatical and lexi-cal structure. Work based on modern transformernetworks such as BERT (Devlin et al., 2019) andElMo (Peters et al., 2018) have been pre-trained onlarge news corpora and are therefore well suited tonews-based CDCR (Barhom et al., 2019).However, there are cases where CDCR acrossdocuments from different domains (i.e. that dif-fer much more signiﬁcantly in style, vocabularyand structure) is useful. One such example is thetask of resolving references to concepts across sci-entiﬁc papers and related news articles. This canhelp scientists understand how their work is beingpresented to the public by mainstream media or fa-cilitate fact checking of journalists’ work (Waddenet al., 2020). A chatbot or recommender that isable to resolve references to current affairs in bothnews articles and user input could be more effectiveat suggesting topics that interest the user. Finally,it may be helpful for e-commerce companies toknow when product reviews gathered from thirdparty websites refer to one of their own listings.The work we present here focuses on the ﬁrst cross-document, cross-domain co-reference-resolution(CD CR) use case, namely co-reference resolutionbetween news articles and scientiﬁc papers.The objective of CD CR is to identify co-referring entities from documents belonging to dif-ferent domains. In this case co-reference resolutionis made more challenging by the differences in lan-guage use (lexical but also syntactic) across the dif- a r X i v : . [ c s . C L ] J a n erent domains. Speciﬁcally, authors of scientiﬁcpapers aim to communicate novel scientiﬁc work inan accurate and unambiguous way by using precisescientiﬁc terminology. Whilst scientiﬁc journalistsalso aim to accurately communicate novel scientiﬁcwork, their work is primarily funded by newspa-per sales and thus they also aim to captivate aslarge an audience as possible. Therefore journal-ists tend to use simpliﬁed vocabulary and structure,creative and unexpected writing style, slang, sim-ile, metaphor and exaggeration to make their workaccessible, informative and entertaining in order tomaximise readership (Louis and Nenkova, 2013).Success at the CD CR task in this setting is de-pendent on context sensitive understanding of howthe accessible but imprecise writing of journalistsmaps on to precise terminology used in scientiﬁcwriting. For example, a recent study has foundthat “convalescent plasma derived from donors whohave recovered from COVID-19 can be used totreat patients sick with the disease” . A news arti-cle discussing this work says that “...blood fromrecovered Covid-19 patients in the hope that trans-fusions...[can help to treat severely ill patients]” . Inthis example the task is to link ‘blood’ to ‘convales-cent plasma’ and ‘recovered Covid-19 patients’ to‘donors’. These cross-document, cross-domain co-reference chains can be used as contextual anchorsfor downstream analysis of the two document set-tings via tasks such as natural language inference,stance detection and frame analysis.The contributions in this paper are the following:• A novel task setting for CDCR that is morechallenging than those that already exist dueto linguistic variation between different do-mains and document types (we call thisCD CR).• An open source English language CD CRdataset with 7602 co-reference pair annota-tions over 528 documents and detailed 11page annotation guidelines (section 3.1).• A novel annotation tool to support ongoingdata collection and annotation for CD CR in-cluding a novel sampling mechanism for cal-culating inter-annotator agreement (Section3.4). DOI: 10.1101/2020.03.16.20036145 https://tinyurl.com/ycnq9xg7 • A series of experiments on our dataset us-ing different baseline models and an in-depth capability-based evaluation of the best-performing baseline (Section 5) Intra-document co-reference resolution is a wellunderstood task with mature training data sets(Weischedel et al., 2013) and academic tasks (Re-casens et al., 2010). The current state of theart model by Joshi et al. (2020) is based on Leeet al. (2017, 2018) and uses a modern BERT-based(Devlin et al., 2019) architecture. Comparatively,CDCR, which involves co-reference resolutionacross multiple documents, has received less at-tention in recent years (Bagga and Baldwin, 1998;Rao et al., 2010; Dutta and Weikum, 2015; Barhomet al., 2019). Cattan et al. (2020) jointly learns bothentity and event co-reference tasks, achieving cur-rent state of the art performance for CDCR, and assuch provides a strong baseline for experiments inCD CR. Both Cattan et al. (2020) and Barhom et al.(2019) models are trained and evaluated using theECB+ corpus (Cybulska and Vossen, 2014) whichcontains news articles annotated with both entityand event mentions.

Entity Linking (EL) focuses on alignment ofmentions in documents to resources in an exter-nal knowledge resource (Ji et al., 2010) such asSNOMED CT or DBPedia . EL is challengingdue to the large number of pairwise comparisons be-tween document mentions and knowledge resourceentities that may need to be carried out. Raimanand Raiman (2018) provide state of the art perfor-mance by building on Ling et al. (2015)’s workin which an entity type system is used to limit thenumber of required pairwise comparisons to relatedtypes. Yin et al. (2019) achieved comparable resultsusing a graph-traversal method to similarly con-strain the problem space to candidates within a sim-ilar graph neighbourhood. EL can be considered anarrow sub-task of CDCR since it cannot resolvenovel and rare entities or pronouns (Shen et al.,2015). Moreover EL’s dependency on expensive-to-maintain external knowledge graphs is also prob-lematic when limited human expertise is available. https://tinyurl.com/yy7g4ttz https://wiki.dbpedia.org/ iven these limitations, EL is inappropriate withinour task setting, hence our CDCR-based approach. Like earlier static vector language models, con-textual language models such as BERT (Devlinet al., 2019) and ElMo (Peters et al., 2018) usedistributional knowledge (Harris, 1954) inherentin large text corpora to learn context-aware wordembeddings that can be used for downstream NLPtasks. However, these models do not learn aboutformal lexical constraints, often conﬂating differenttypes of semantic relatedness (Ponti et al., 2018;Lauscher et al., 2020). This is a weakness of alldistributional language models that is particularlyproblematic in the context of CD CR for entitymentions that are related but not co-referent (e.g.“Mars” and “Jupiter”) as shown in section 5.A number of solutions have been proposedfor adding lexical knowledge to static word em-beddings (Yu and Dredze, 2014; Wieting et al.,2015; Ponti et al., 2018) but contextual languagemodels have received comparatively less attention.Lauscher et al (2020) propose adding a lexicalrelation classiﬁcation step to BERT’s languagemodel pre-training phase to allow the model tointegrate both lexical and distributional knowledge.Their model, LIBERT, has been shown to facili-tate statistically-signiﬁcant performance boosts ona variety of downstream NLP tasks.

Our dataset is composed of pairs of news articlesand scientiﬁc papers gathered automatically (Sec-tion 3.1). Our annotation process begins by obtain-ing summaries of the news and science documentpairs (extractive news summaries and scientiﬁc ab-stracts, respectively) (Section 3.2). Candidate co-reference pairs from each summary-abstract pairare identiﬁed and scored automatically. (Section3.3). Candidate co-reference pairs are then pre-sented to human annotators via a bespoke annota-tion interface for scoring (Section 3.4). Annotationquality is measured on an ongoing basis as newcandidates are added to the system (Section 3.5).

We have developed a novel data set that allows usto train and evaluate a CD CR model. The corpusis approximately 50% the size of the ECB+ cor-pus (918 documents) (Cybulska and Vossen, 2014)and is split into training, development and test sets

Subset Documents Mentions Clusters

Train 300 4,604 426Dev 142 1,821 199Test 86 1,177 101

Table 1: Total individual documents, mentions, co-reference clusters of each subset excluding singletons. (statistics for each subset are provided in Table1). Each pair of documents consists of a scientiﬁcpaper and a newspaper article that discusses thescientiﬁc work. In order to detect pairs of doc-uments, we follow the approach of (Ravenscroftet al., 2018), using approximate matching of authorname and afﬁliation metadata, date of publishingand exact DOI matching where available to connectnews articles to scientiﬁc publications.We built a web scraper that scans for new arti-cles from the ‘Science’ and ‘Technology’ sectionsof 3 well-known online news outlets (BBC, TheGuardian, New York Times) and press releasesfrom Eurekalert, a widely popular scientiﬁc pressrelease aggregator. Once a newspaper article andrelated scientiﬁc paper are detected, the full textfrom the news article and the scientiﬁc paper ab-stract and metadata are stored. Where availablethe full scientiﬁc paper content is also collected.We ran the scraper between April and June 2020collecting news articles and scientiﬁc papers in-cluding preprints discussing a range of topics suchas astronomy, computer science and biology (incl.coverage of COVID-19). New relevant content isdownloaded and ingested into our annotation tool(see Section 3.4) on an ongoing basis as it becomesavailable.

Newspaper articles and scientiﬁc papers are longand often complex documents, usually spanningmultiple pages, particularly the latter. Moreover thetwo document types differ signiﬁcantly in length.Comparing documents of such uneven length is adifﬁcult task for human annotators. We also assumethat asking human annotators to read the documentsin their entirety to identify co-references would beparticularly hard with a very low chance for goodinter-annotator agreement (IAA). We therefore de-cided to simplify the task by asking annotators tocompare summaries of the newspaper article (5-10sentences long) and the scientiﬁc paper (abstract).For each document pair, we ask the annotators todentify co-referent mentions between the scientiﬁcpaper abstract and a summary of the news articlethat is of similar length (e.g. 5-10 sentences). Sci-entiﬁc paper abstracts act as a natural summary ofa scientiﬁc work and have been used as a strongbaseline or even a gold-standard in scientiﬁc sum-marisation tasks (Liakata et al., 2013). Further-more, abstracts are almost always available ratherthan behind paywalls like full text articles. Fornews summarisation, we used a state-of-the-art ex-tractive model (Grenander et al., 2019) to extractsentences forming a summary of the original text.This model provides a summary de-biasing mecha-nism preventing it from focusing on speciﬁc partsof the full article, preserving the summary’s infor-mational authenticity as much as possible.The difference in style between the two docu-ments is preserved by both types of summary sinceabstracts are written in the same scientiﬁc styleas full papers and the extractive summaries useverbatim excerpts of the original news articles.

To populate our annotation tool, we generate pairsof candidate cross-document mentions to be evalu-ated by the user. Candidate mentions are identiﬁedby using spaCy (Honnibal and Montani, 2017) forthe recognition of noun phrases and named enti-ties from each input document pair (abstract-newssummary). For each pair of documents, pairs of allpossible mention combinations are generated andstored for annotation.In any given pair of documents, the majority ofmention pairs ( M , M ) generated automatically inthis way will not co-refer thus resulting in a vastlyimbalanced dataset and also running the risk of de-motivating annotators. To ensure that annotatorsare exposed to both positive and negative examples,we use a similarity score to rank examples basedon how likely they are to co-refer. The ﬁrst stepin generating a similarity score s is to concatenateeach abstract-news-summary pair together: “sum-mary [SEP] abstract” into a pre-trained BERT large model. Then we take the mean of the word vectorsthat correspond to the mention spans within thedocuments and calculate the cosine similarity ofthese vectors. We ﬁnd that this BERT-based simi-larity score performs well in practice. We also useit in combination with a thresholding policy as oneof our baseline models in Section 4. We developed an open source annotation tool that allows humans to identify cross-document co-reference between each pair of related documents.Whilst designing this tool, we made a number ofdecisions to simplify the task and provide clearinstructions for the human annotators in order toencourage consistent annotation behaviour.To maximise the quality and consistency of an-notations in our corpus, we simpliﬁed the task asmuch as possible for the end user. Annotation taskswere framed as a single yes or no question: “Are x and y mentions of the same entity?”. Mentionsin context were shown in bold font whereas men-tions already ﬂagged as co-referent were shown ingreen. This enabled annotators to understand theimplications for existing co-reference chains beforeresponding (see Figure 3). Questions were gener-ated and ranked via our task generation pipeline(see Section 3.3 above).We added two additional features to our anno-tation interface to improve annotators’ experienceand to speed up the annotation process. Firstly,if the candidate pair is marked as co-referent, theuser is allowed to add more mentions to the coref-erence cluster at once. Secondly, inspired by (Liet al., 2020), if the automatically shown mentionpair is not co-referent, the user can select a differentmention that is co-referent.The upstream automated mention detectionmechanism can sometimes introduce incompleteor erroneous mentions, leading to comparisons thatdon’t make sense or that are particularly difﬁcult.Therefore, annotators can also move or resize themention spans they are annotating.We use string offsets of mention span pairs totokens to check that they do not overlap with eachother in order to prevent the creation of duplicates.Figure 1 shows an illustrated example of the gener-ation pipeline for mention pairs. We recruited three university-educated human an-notators and provided them with detailed annota-tion guidelines for the resolution of yes/no ques-tions on potentially co-referring entities in pairsfrom the ordered queue described above. By de-fault each entity pair resolution is carried out once,allowing us to quickly expand our data set. How-ever, we pseudo-randomly sample 5% of men- https://github.com/ravenscroftj/cdcrtool igure 1: Illustration of the generation process for pairs of potentially co-referring expressions, left boxes representrelated news summary (top) and abstract (bottom), co-referent entity pairs in middle boxes shown with sameformatting (underline,italic). tion pairs in order to calculate inter-annotator-agreement (IAA) and make sure that data collectedfrom the tool is consistent and suitable for mod-elling. New entity pairs for IAA are continuallysampled as new document pairs and mention tu-ples are added to the corpus by the web scraper(Section 3.1). The annotation system puts men-tion pairs ﬂagged for IAA ﬁrst in the annotationqueue. Thus, all annotators are required to com-plete IAA comparisons before moving on to novelmention pairs. This allows us to ensure that allannotators are well represented in the IAA exer-cise. To avoid annotators being faced with a hugebacklog of IAA comparisons before being able toproceed with novel annotations, we also limited thenumber of comparisons for IAA required by eachuser to a maximum of 150 per week. We anticipated that annotation of the CD CR cor-pus would be difﬁcult in nature due to its dependen-cies on context and lexical style. We invited usersto provide feedback regularly to help us reﬁne andclarify our guidelines and annotation tool in an iter-ative fashion. Users could alert us to examples theyfound challenging by ﬂagging them as difﬁcult inthe tool. Qualitative analysis of the subset of ‘difﬁ-cult’ cases showed that the resolution of mentionpairs is often perceived by annotators as difﬁcultwhen:• Deep subject-matter-expertise is required tounderstand the mentions, e.g. is “jasmonicacid” the same as “regulator cis -(+)-12-oxophytodienoic acid”.• Mentions involve non-commutable set mem-bership ambiguity e.g. “Diplodocidae” and“the dinosaurs” • Mentions are context dependent e.g. “thestruggling insect” and “the monarch butter-ﬂy”.This feedback prompted the introduction of high-lighting for existing co-reference chains in the userinterface (as described in section 3.4 above) tomake it easier to tell when non-commutable setmembership would likely introduce inconsisten-cies into the dataset. For mention pairs requiringsubject-matter-expertise, annotators were encour-aged to research the terms online. For context sen-sitive mention pairs, annotators were encouragedto read the full news article and full scientiﬁc paperin order to make a decision.In our 11 page annotation guidelines document(appendix) we describe the use of our annotationtool and illustrate some challenging CD CR tasksand resolution strategies. For example precise enti-ties mentioned in the scientiﬁc document may bereferenced using ambiguous exophoric mentions inthe news article (e.g. ‘a mountain breed of sheep’vs ‘eight ovis aries’). Our guidelines require resolv-ing these cases based on the journalist’s intent (e.g.‘a mountain breed’ refers to the ‘ovis aries’ sheepinvolved in the experiment).We evaluated the ﬁnal pairwise agreement be-tween annotators using Cohen’s Kappa (Cohen,1960) ( κ cohen ) and an aggregate ‘n-way’ agreementscore using Fleiss’ Kappa (Fleiss, 1971) ( κ ﬂeiss ).Pairwise κ cohen is shown in table 2 along with thetotal number of tasks each annotator completed.Annotator 3 (A3) shows the most consistent agree-ment with the other two annotators. Our Fleiss’Kappa analysis of tasks common across the threeannotators gave κ ﬂeiss = 0 . . We note thatFleiss’ Kappa is a relatively harsh metric and val-ues, like ours, between 0.41 and 0.60 are consid-ered to demonstrate ’moderate agreement’(Landis Annotations A1 A2 A3A1 10,685 - 0.492 0.600A2 3,051 0.492 - Table 2: Number of Annotations and Pairwise Cohen’sKappa scores κ cohen between annotators. and Koch, 1977). We also carried out Fleiss’ Kappaanalysis on the subset of mention pairs that werecompleted by all annotators and were also markedas difﬁcult by at least one user (180 mention pairsin total). We found that for this subset of pairs, κ ﬂeiss = 0 . which is considered to be fair agree-ment(Landis and Koch, 1977). Below we describe several baseline models includ-ing state of the art CDCR models that we used toevaluate how well current approaches can be usedin our CD CR task setting.

In this model we calculate the cosine-similaritybetween embeddings of the two mentions in con-text ( M , M ) encoded using a pre-trained BERTmodel as discussed above in section 3.3. We deﬁnea thresholding function f to decide if M and M are co-referent ( f ( x ) = 1 ) or not ( f ( x ) = 0 ): f ( x ) = (cid:40) , if COSSIM ( M , M ) ≥ t , otherwiseDuring inference, we pass this function over allpairs M , M and infer missing links such that if f ( A, B ) = 1 and f ( B, C ) = 1 then f ( A, C ) = 1 .Based on Figure 2, we test values in incrementsof 0.01 between 0.3 and 0.8 inclusive for thresholdcut off t . We evaluated the baseline by measuringits accuracy at predicting co-reference in each men-tion pair in the CD CR development set. The bestperformance was attained when t = 0 . . A visual-isation of the BERT Cosine Similarity distributionsof co-referent and non co-referent annotated men-tion pairs can be seen in Figure 2.Co-referent mention pairs tend to have a slightlyhigher BERT cosine similarity than non co-referentmention pairs but there is signiﬁcant overlap ofthe two distributions suggesting that in many casesBERT similarity is too simplistic a measure. Figure 2: BERT Cosine Similarity frequency distri-bution for co-referent (Yes) and non-co-referent (No)mention pairs in the CD CR corpus.

We use a state-of-the-art model (Cattan et al., 2020)(CA) for cross-document co-reference resolution.In this model, each document is separately encodedusing a RoBERTa encoder (without ﬁne-tuning)to get contextualized representations for each to-ken. Then, similarly to the within-document co-reference model by Lee et al. (2017), the mentionspans are represented by the concatenation of fourvectors: the vectors of the ﬁrst and last token inthe span, an attention-weighted sum of the spantoken vectors, and a feature vector to encode thespan width. Two mention representations are thenconcatenated and fed to a feed-forward networkto learn a likelihood score for whether two men-tions co-refer. At inference time, agglomerativeclustering is used on the pairwise scores to formcoreference clusters.The CA model is trained to perform both eventand entity recognition on the ECB+ corpus (Cy-bulska and Vossen, 2014) In our setting there is noevent detection subtask so, for fair comparison, wepre-train the CA model on ECB+ entity annotationsonly and evaluate it on our new CD CR task to seehow well it generalises to our task setting.

Here we aim to evaluate whether ﬁne tuning the CAmodel from section 4.2 using the CD CR corpuscan improve its performance in the new task setting.The CA model is ﬁrst trained on the ECB+ corpusin the manner described above. We then furtherﬁne-tune the feed-forward model (without affectingthe RoBERTa encoder) on the CD CR corpus for10 epochs with early stopping. Pseudo-randomsub-sampling is carried out on the training set to igure 3: An example of a cross-document co-reference task presented within our annotation tool. ensure a balance of co-referent and non-co-referentmention pairs.

Here we aim to evaluate whether training the CAmodel on the CD CR dataset from the RoBERTabaseline without ﬁrst training on the ECB+ corpusallows it to ﬁt well to the new task setting. Were-initialise the CA encoder (Section 4.2) usingweights from RoBERTa (Liu et al., 2019) and ran-domly initialise the remaining model parameters.We then train the model on the CD CR corpus forup to 20 epochs with early stopping with pseudo-random sub-sampling as above.

This model is the same as CA-V but we replace theRoBERTa encoder with SciBERT (Beltagy et al.,2019), a version of BERT pre-trained on scien-tiﬁc literature in order to test whether the scien-tiﬁc terms and context captured by SciBERT im-prove performance at the CD CR task comparedto RoBERTa. Similarly to CA-V in section 4.4,we initialise the BERT model with weights fromSciBERT scivocab-uncased (Beltagy et al., 2019) andrandomly initialise the remaining model parame-ters, training on the CD CR corpus for up to 20epochs with early stopping.

We evaluate each of the model baselines describedin section 4 above on the test subset of our CD CRcorpus. Results are shown in table 3.For the purposes of evaluation, we use named en-tity spans from the manually annotated CD CR as the “gold standard” in all experiments rather thanusing the end-to-end Named Entity Recognitioncapabilities provided by some of the models. Weevaluate the models using the metrics described byVilain et al. (1995) (henceforth MUC) and Baggaand Baldwin (1998) (henceforth B ). MUC F1,precision and recall are deﬁned in terms of pairwiseco-reference relationships between each mention. B , F1, precision and recall are deﬁned in terms ofpresence or absence of speciﬁc entities in the clus-ter. When measuring B , we remove entities withno co-references (singletons) from the evaluationto avoid inﬂation of results (Cattan et al., 2020).The threshold baseline (BCOS) gives the high-est MUC recall but also poor MUC precision andpoorest B precision. The B metric is highly spe-ciﬁc with respect to false-positive entity mentionsand strongly penalises BCOS for linking all non-coreferent pairs with COSSIM ( M , M ) ≥ . .Furthermore, Fig. 2 shows that a thresholding strat-egy is clearly sub-optimal given that there is a sig-niﬁcant overlap of co-referent and non-co-referentpairs with only a small minority of pairs at the topand bottom of the distribution that do not overlap. Model MUC B P R F1 P R F1BCOS CA CA-V

CA-S 0.58

Table 3: MUC and B results from running baselinemodels on CD CR test subset, BCOS threshold=0.65 est Type Co-referent? Pass Rate &Total Tests Example test case and outcome for test caseAnaphoraandExophoraresolution Yes 47.1%(16/34) M1: ...to boost the struggling insect ’s numbers... [PASS]

M2: the annual migration of the monarch butterﬂy ...No 76.5%(26/34) M1: ... monarchs raised in captivity... [FAIL]

M2: ...rearing wild-caught monarchs in an indoor environment...Subset rela-tionshipresolution Yes 24.3% (9/37) M1: ...it was in fact a hive of human activity ... [FAIL]

M2: ...this region for

Pre-Columbian cultural developments ...No 60.0%(18/30) M1: ... the carnivore’s skull ... [FAIL]

M2: ... the gigantic extinct Agriotherium africanum

Para-phraseresolution Yes 33.3%(13/39) M1: ... a giant short-faced bear ... [PASS]

M2: ... the gigantic extinct Agriotherium africanum ...No 80.5%(29/36) M1: ...half the energy that existing techniques require [FAIL]

M2: ...the lack of efﬁcient catalysts for ammonia synthesis

Table 4: A breakdown of speciﬁc tests carried out on CA-V model against three challenging types of relationshipsfound in the CD CR corpus. [PASS] or [FAIL] indicates CA-V model correctness. Pass Rate is mathematicallyequivalent to Recall for test sets. Therefore, despite its promising MUC F1 score, itis clear that BCOS is not useful in practical terms.Whilst our thresholding baseline above usesBERT, RoBERTa is used by Cattan et al. (2020) asthe basis for their state-of-the-art model and thusfor our models based on their work. Although thetwo models have the same architecture, RoBERTahas been shown to outperform BERT at a rangeof tasks (Liu et al., 2019). However, as shownin Figure 4, the cosine similarity distribution ofmention pair embeddings produced by RoBERTais compressed to use a smaller area of the poten-tial distribution space compared to that of BERT(Figure 2). This compression of similarities mayimply a reduction in RoBERTa’s ability to discrim-inate in our task setting. Liu et al. (2019) explainthat their byte-pair-encoding (BPE) mechanism,which expands RoBERTa’s sub-word vocabularyand simpliﬁes pre-processing, can reduce modelperformance for some tasks, although this is notfurther explored in their work. We leave further ex-ploration of RoBERTa’s BPE scheme and its effectson the CD CR task setting to future work.All of the models speciﬁcally trained on theCD CR corpus (CA-V, CA-FT, CA-S) outperformthe CA model by a large margin. Furthermore, theCA-V model (without pre-training on ECB+ cor-pus) outperforms the CA-FT model (with ECB+pre-training) by 6% MUC and 3% B . These re-sults suggest that the CD CR task setting is distinct

Figure 4: RoBERTa Cosine Similarity frequency distri-bution for co-referent (Yes) and non-co-referent (No)mention pairs in the CD CR corpus. Distribution iscompressed between 0.8 and 1.0. from the CDCR and ECB+ task setting and thatthis distinction is not solvable with ﬁne-tuning.In terms of both MUC and B , CA-S performsmuch worse than CA-V suggesting that SciBERTembeddings are less effective than RoBERTa em-beddings in this task setting. We hypothesise thatSciBERT’s specialisation towards scientiﬁc em-beddings may come at the cost of signiﬁcantlyworse news summary embeddings when comparedto those produced by RoBERTa.We next evaluate our best performing CD CRbaseline model (CA-V) at the entity resolutionCDCR task using the ECB+ test corpus, to seehow well it generalises to the original CDCR task.Results are presented in 5 along-side Cattan et al’soriginal model results (CA). The CA-V model stillhows good performance, despite a small drop,when compared to the original CA model. The dropin B F1 is more pronounced than MUC but is stillbroadly in line with other contemporary CDCRsystems (Cattan et al., 2020). The CA-V modeldemonstrates a promising ability to generalise be-yond our corpus to other tasks and reveals an inter-esting correspondence between CDCR and CD CRsettings.

Model MUC B P R F1 P R F1CA

CA-V

Table 5: MUC and B results from running the CD CRbaseline model (CA-V) on ECB+ dataset comparedwith original Cattan et al. (2020) (CA).

Finally, the best model (CA-V) is analysed us-ing a series of challenging test cases inspired byRibeiro et al (2020). These test cases were cre-ated using 210 manually annotated mention-pairsfound in the test subset of the CD CR corpusaccording to the type of relationship illustrated(Anaphora & Exophora, Subset relationships, para-phrases). We collected a balanced set of 30-40examples of both co-referent and non-coreferent-but-challenging pairs for each type of relationship(exact numbers in Table 4). We then recordedwhether the model correctly predicted co-referencefor these pairs. The results along with illustrativeexamples of each relationship type are shown in Ta-ble 4. The results suggest that the model is better atidentifying non-co-referent pairs than co-referentpairs and that it struggles with positive co-referentmentions for all three types of relationship. Themodel struggles to relate general reader-friendly de-scriptions of entities from news articles to preciseand clinical descriptions found in scientiﬁc papers.The model often successfully identiﬁes related con-cepts such as ‘the carnivore’s skull’ and ‘Agrio-therium africanum’. However it is unable to dealwith the complexity of these relationships and ap-pears to conﬂate ‘related’ with ‘co-referent’, whichis likely due to lack of lexical knowledge as wediscussed in section 2.3. Figure 5 shows signiﬁcantoverlap between co-referent and non-co-referentRoBERTa-based cosine similarities, which can alsobe observed for the wider corpus in Figure 4, but isespecially bad for these test examples. This overlapsuggests that disentangling these pairs is likely to Figure 5: RoBERTa-based mention pair similarity fre-quency distribution for test examples from Table 4.’yes’ and ’no’ for ’co-referent’ and ’not-co-referent’ re-spectively be a challenging task for the downstream classiﬁca-tion layer in the CA-V model. These challenges areless likely to occur in homogeneous corpora likeECB+ where descriptions and relationships remainconsistent in detail and complexity.

We have deﬁned cross-document, cross-domainco-reference resolution (CD CR), a special andchallenging case of cross-document co-referenceresolution for comparing mentions across docu-ments of different types and/or themes. We haveconstructed a specialised CD CR annotated dataset,available, along with our annotation guidelines andtool, as a free and open resource for future research.We have shown that state-of-the-art CDCR mod-els do not perform well on the CD CR datasetwithout speciﬁc training. Furthermore, even withtask-speciﬁc training, models perform modestlyand leave room for further research and improve-ment. Finally, we show that the understanding ofsemantic relatedness offered by current generationtransformer-based language models may not be pre-cise enough to reliably resolve complex linguisticrelationships such as those found in CD CR aswell as other types of co-reference resolution andrelationship extraction tasks. The use of semanticenrichment techniques (such as those discussed inSection 2.3) to improve model performance in theCD CR task should be investigated as future work.

Acknowledgments

This work was supported by The Alan Turing Insti-tute under the EPSRC grant EP/N510129/1 and theUniversity of Warwick’s CDT in Urban Scienceunder EPSRC grant EP/L016400/1. eferences

Amit Bagga and Breck Baldwin. 1998. Entity-based cross-document coreferencing using the vec-tor space model. In

Proceedings of ACL ’98/COL-ING ’98 , page 79–85, USA. Association for Compu-tational Linguistics.Shany Barhom, Vered Shwartz, Alon Eirew, MichaelBugert, Nils Reimers, and Ido Dagan. 2019. Revis-iting Joint Modeling of Cross-document Entity andEvent Coreference Resolution. In

Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 4179–4189.Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-ERT: A pretrained language model for scientiﬁc text.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 3615–3620.Arie Cattan, Alon Eirew, Gabriel Stanovsky, MandarJoshi, and Ido Dagan. 2020. Streamlining cross-document coreference resolution: Evaluation andmodeling. arXiv:2009.11032.Jacob Cohen. 1960. A coefﬁcient of agreement fornominal scales.

Educational and PsychologicalMeasurement , 20:37–46.Agata Cybulska and Piek Vossen. 2014. Using asledgehammer to crack a nut? Lexical diversityand event coreference resolution. In

Proceedingsof the Ninth International Conference on LanguageResources and Evaluation (LREC’14) , pages 4545–4552.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies , pages 4171–4186. Association for Compu-tational Linguistics.Sourav Dutta and Gerhard Weikum. 2015. Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment.

Transactions of the Association for ComputationalLinguistics , 3:15–28.Joseph L. Fleiss. 1971. Measuring nominal scale agree-ment among many raters.

Psychological Bulletin ,76(5):378–382.Matt Grenander, Yue Dong, Jackie Chi Kit Cheung,and Annie Louis. 2019. Countering the Effects ofLead Bias in News Summarization via Multi-StageTraining and Auxiliary Losses. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 6019–6024. Zellig S. Harris. 1954. Distributional structure.

WORD , 10(2-3):146–162.Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Grif-ﬁtt, and Joe Ellis. 2010. Overview of the TAC 2010knowledge base population track. In

Proceedings ofthe 2010 Text Analysis Conference .Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.SpanBERT: Improving Pre-training by Representingand Predicting Spans.

Transactions of the Associa-tion for Computational Linguistics , 8:64–77.JR Landis and GG Koch. 1977. The measurement ofobserver agreement for categorical data.

Biometrics ,33(1):159—174.Anne Lauscher, Ivan Vuli´c, Edoardo Maria Ponti, AnnaKorhonen, and Goran Glavaˇs. 2020. Specializingunsupervised pretraining models for word-level se-mantic similarity.

Computing Research Repository ,arXiv:1909.02339.Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 188–197, Copenhagen, Denmark. Associationfor Computational Linguistics.Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-ﬁne inference. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers) , pages687–692, New Orleans, Louisiana. Association forComputational Linguistics.Belinda Z. Li, Gabriel Stanovsky, and Luke Zettle-moyer. 2020. Active learning for coreference reso-lution using discrete annotation. In

Proceedings ofthe 58th Annual Meeting of the Association for Com-putational Linguistics , pages 8320–8331.Maria Liakata, Simon Dobnik, Shyamasree Saha,Colin Batchelor, and Dietrich Rebholz-Schuhmann.2013. A discourse-driven content model for sum-marising scientiﬁc articles evaluated in a complexquestion answering task. In

Proceedings of Con-ference on Empirical Methods in Natural LanguageProcessing, EMNLP 2013 , pages 747–757. Associa-tion for Computational Linguistics.Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015.Design challenges for entity linking.

Transactionsof the Association for Computational Linguistics ,3:315–328.inhan Liu, Myle Ott, Naman Goyal, Jingfei Du,Mandar Joshi, Danqi Chen, Omer Levy, MikeLewis, Luke Zettlemoyer, and Veselin Stoyanov.2019. RoBERTa: A robustly optimized BERT pre-training approach.

Computing Research Repository ,arXiv:1907.11692.Annie Louis and Ani Nenkova. 2013. A corpus of sci-ence journalism for analyzing writing quality.

Dia-logue and Discourse , 4(2):87–117.Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In

Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies , pages 2227–2237.Edoardo Maria Ponti, Ivan Vuli´c, Goran Glavaˇs, NikolaMrkˇsi´c, and Anna Korhonen. 2018. Adversarialpropagation and zero-shot cross-lingual transfer ofword vector specialization. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 282–293.Jonathan Raiman and Olivier Raiman. 2018. Deep-Type: Multilingual entity linking by neural type sys-tem evolution. In

AAAI Conference on Artiﬁcial In-telligence .Delip Rao, Paul McNamee, and Mark Dredze. 2010.Streaming cross document entity coreference reso-lution. In

Coling 2010: Posters , pages 1050–1058,Beijing, China. Coling 2010 Organizing Committee.James Ravenscroft, Amanda Clare, and Maria Liakata.2018. HarriGT: A tool for linking news to science.In

Proceedings of ACL 2018, System Demonstra-tions , pages 19–24.Marta Recasens, Llu´ıs M`arquez, Emili Sapena,M. Ant`onia Mart´ı, Mariona Taul´e, V´eroniqueHoste, Massimo Poesio, and Yannick Versley. 2010.SemEval-2010 task 1: Coreference resolution inmultiple languages. In

Proceedings of the 5th Inter-national Workshop on Semantic Evaluation , pages1–8. Association for Computational Linguistics.Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,and Sameer Singh. 2020. Beyond accuracy: Be-havioral testing of NLP models with CheckList. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4902–4912, Online. Association for Computational Lin-guistics.W. Shen, J. Wang, and J. Han. 2015. Entity linkingwith a knowledge base: Issues, techniques, and so-lutions.

IEEE Transactions on Knowledge and DataEngineering , 27(2):443–460.Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In

MUC6 ’95:Proceedings of the 6th conference on Message un-derstanding , pages 45–52. David Wadden, Shanchuan Lin, Kyle Lo, Lucy LuWang, Madeleine van Zuylen, Arman Cohan, andHannaneh Hajishirzi. 2020. Fact or ﬁction: Verify-ing scientiﬁc claims.

Computing Research Reposi-tory , arXiv:2004.14974.Ralph Weischedel, Martha Palmer, Mitchell Marcus,Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-anwen Xue, Ann Taylor, Jeff Kaufman, MichelleFranchini, Mohammed El-Bachouti, Robert Belvin,and Ann Houston. 2013. Ontonotes release5.0 LDC2013T19. Linguistic Data Consortium,Philadelphia, PA.John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2015. From paraphrase database to compo-sitional paraphrase model and back.

Transactionsof the Association for Computational Linguistics ,3:345–358.Xiaoyao Yin, Yangchen Huang, Bin Zhou, Aiping Li,Long Lan, and Yan Jia. 2019. Deep entity link-ing via eliminating semantic ambiguity with BERT.

IEEE Access , 7:169434–169445.Mo Yu and Mark Dredze. 2014. Improving lexical em-beddings with semantic knowledge. In