[PDF] LOME: Large Ontology Multilingual Extraction

Abstract

We present LOME, a system for performing multilingual information extraction. Given a text document as input, our core system identifies spans of textual entity and event mentions with a FrameNet (Baker et al., 1998) parser. It subsequently performs coreference resolution, fine-grained entity typing, and temporal relation prediction between events. By doing so, the system constructs an event and entity focused knowledge graph. We can further apply third-party modules for other types of annotation, like relation extraction. Our (multilingual) first-party modules either outperform or are competitive with the (monolingual) state-of-the-art. We achieve this through the use of multilingual encoders like XLM-R (Conneau et al., 2020) and leveraging multilingual training data. LOME is available as a Docker container on Docker Hub. In addition, a lightweight version of the system is accessible as a web demo.

Full PDF

LLOME: Large Ontology Multilingual Extraction

Patrick Xia ∗ , Guanghui Qin ∗ , Siddharth Vashishtha Yunmo Chen , Tongfei Chen , Chandler May , Craig Harman Kyle Rawlins , Aaron Steven White , Benjamin Van Durme Johns Hopkins University, University of Rochester {paxia,qin,vandurme}@jhu.edu

Abstract

We present LOME, a system for performingmultilingual information extraction. Given atext document as input, our core system iden-tiﬁes spans of textual entity and event men-tions with a FrameNet (Baker et al., 1998)parser. It subsequently performs coreferenceresolution, ﬁne-grained entity typing, and tem-poral relation prediction between events. Bydoing so, the system constructs an event andentity focused knowledge graph. We can fur-ther apply third-party modules for other typesof annotation, like relation extraction. Our(multilingual) ﬁrst-party modules either out-perform or are competitive with the (monolin-gual) state-of-the-art. We achieve this throughthe use of multilingual encoders like

XLM-R (Conneau et al., 2020) and leveraging multi-lingual training data. LOME is available as aDocker container on Docker Hub. In addition,a lightweight version of the system is accessi-ble as a web demo. As information extraction capabilities continue toimprove due to advances in modeling, encoders,and data collection, we can now look (back) to-ward making richer predictions at the document-level, with a large ontology, and across multiplelanguages. Recently, Li et al. (2020) noted thatdespite a growth of open-source NLP software ingeneral, there is still a lack of available software forknowledge extraction. We wish to provide a start-ing point that allows others to build increasinglycomprehensive document-level knowledge graphsof events and entities from text in many languages.In Figure 1, we follow a multilingual input ex-ample. A sentence-level parser can identify both

INGESTION events and their arguments. To connect ∗ Equal Contribution Information on using the Docker container and the webdemo at https://nlp.jhu.edu/demos . these events cross-sententially, we can cluster coref-erent mentions and predict the temporal relationsbetween the events. We can also augment the enti-ties with ﬁne-grained entitiy types, like the rabbit entity with LIVING THING / ANIMAL .Several prior packages have also used advancesin state-of-the-art models to build comprehensiveinformation extraction systems. Li et al. (2019)present an event, relation, and entity extraction andcoreference system for three languages: English,Russian, and Ukrainian. Li et al. (2020, GAIA) ex-tend that work to support cross-media documents.However, neither of these are truly multilingual:they are limited to three languages and rely onlanguage-speciﬁc models to operate on monolin-gual documents. On the other hand, works focusedon multilinguality is limited in their scope for ex-traction (Akbik and Li, 2016; Pan et al., 2017).Like prior work, LOME is focused on extractingentities and events from raw text documents. How-ever, LOME is language-agnostic; all componentsprioritize multilinguality. Using

XLM-R (Conneauet al., 2020) as the underlying encoder paves theway for both training on multilingual data (whereit exists) and inference in many languages. Ourpipeline includes a full FrameNet parser for eventsand their arguments, neural coreference resolution,an entity typing model over large ontologies, andtemporal resolution between events.Our system is designed to be modular: eachcomponent is trained independently and tuned ontask-speciﬁc data. To communicate between mod-ules, we use C

ONCRETE (Ferraro et al., 2014), adata schema used in other text processing systems(Peng et al., 2015). One advantage of using a stan-dardized data schema is that it enables modular-ization and extension. Unless there are annotationdependencies, individual modules can be inserted, XLM-R itself is trained on one hundred languages. a r X i v : . [ c s . C L ] J a n igure 1: Architecture of LOME. The system processes text documents as input and ﬁrst uses a FrameNet parserto detect entities and events. Then, a suite of models enrich the entities and events with additional predictions.Each individual model can be trained and tuned independently, ensuring modularity of the pipeline. Annotationsbetween models are transferred using C ONCRETE , a data schema for NLP. replaced, merged, or bypassed depending on theapplication. We discuss two example applicationsof our C

ONCRETE -based modules, one of whichfurther extracts relations and the other performscross-sentence argument linking for events.

The overarching application of LOME is to extractan entity- and event-centric knowledge graph froma textual document. In particular, we are inter-ested in using these graphs to support a multilin-gual schema induction task (KAIROS), for whichdata is currently being annotated by the LDC (Cieriet al., 2020). As a result, some parts of the systemare designed for compatibility with the KAIROSevent and entity ontology. Nonetheless, there issigniﬁcant overlap with publicly available datasets,which we describe for those tasks.Figure 1 presents the architecture of our pipeline.Besides the FrameNet parser, which is run ﬁrst, theremaining modules can be run in any order, if at all.In addition, our use of a standardized data schemafor communication allows for the integration ofthird-party systems. In this section, we will go intofurther detail for each task.

FrameNet parsing is a semantic role labeling styletask. The goal is to ﬁnd all the frames and theirroles, as well as the trigger spans associated withthem in a sentence. Frames are concepts, such as events or entities, in a sentences. Every frame isassociated with some roles, and both of them aretriggered by spans in the sentence.Unlike most previous work (Yang and Mitchell,2017; Peng et al., 2018; Swayamdipta et al., 2018),our system is not conditioned on the trigger spansor frames. We perform “full parsing” (Das et al.,2014), where the input is raw sentence, and theoutput is the complete structure predictions.As the ﬁrst model in the whole pipeline system,the trigger spans found by the FrameNet parser willbe used as candidate spans for all other tasks.

In coreference resolution, the goal is to cluster men-tions in the text that refer to the same entity. Neuralmodels for doing so typically encode the text ﬁrstbefore identifying possible mentions (Lee et al.,2017; Joshi et al., 2019, 2020). These mentions arescored pairwise to determine whether two mentionsrefer to each other. These scores then determinecoreference clusters by decoding under a variety ofstrategies (Lee et al., 2018; Xu and Choi, 2020).In this work, we choose a constant-memory vari-ant of that model which also achieves high per-formance (Xia et al., 2020). The motivation hereis robustness: we prioritize the ability to soundlyrun on all document lengths over slightly betterperforming but fragile systems. In addition, be-cause this coreference resolution model is part of abroader entity-centric system, the module used inthis system does not perform the mention detection ntityveh weaaircraft bomb bullets b u ll e t s m o l o t o v c o c k t a il a mm un i t i o n (cid:122)(cid:122)(cid:122)(cid:122)(cid:122)(cid:122)(cid:122)(cid:122)(cid:122) (cid:122)(cid:122)(cid:122)(cid:122)(cid:122)(cid:122) other o t h e r other o t h e r o t h e r Figure 2: A portion of the AIDA entity type ontology. step (which is left to the FrameNet parser). Instead,both training and inference assumes given men-tions, and the task is to therefore perform linking . Entity typing assigns a ﬁne-grained semantic la-bel to a span of text, where the span is a men-tion of some entity found in our previous spanﬁnder module. Traditionally, labels include

PER , GPE , ORG , etc., but recent work in ﬁne-grained entity typing seek to classify spans into types de-ﬁned by hierarchical type ontologies (e.g. BBN(Weischedel and Brunstein, 2005), FIGER (Lingand Weld, 2012), UltraFine (Choi et al., 2018),COLLIE (Allen et al., 2020)). Such ontologies re-ﬁne coarse types like PER to ﬁne-grained typessuch as /person/artist/singer that sitson a type hierarchy. A portion of the AIDA(LDC2019E07) ontology is illustrated in Figure 2.We employ a recent coarse-to-ﬁne-decoding en-tity typing model (Chen et al., 2020a) that is specif-ically designed to assign types that are deﬁned byhierarchical ontologies. We swap the underlyingencoder from ELMo (Peters et al., 2018) to

XLM-R to be able to assign types over mentions in differentlanguages using a single multilingual model, andto enable transfer between languages.The base typing model in Chen et al. (2020a)supports entity typing on entity mentions . We ex-tend this model to gain the ability to perform entitytyping on entities , i.e. clusters of entity mentions.Since our decoder is coarse-to-ﬁne and predicts atype at each level of the type hierarchy, we employBorda voting on each level. Speciﬁcally, givena coreference chain comprising mentions m , ··· ,n ,and the score for mention m i being typed as type UltraFine is slightly different in that the types are buck-eted into 3 categories of different granularity, but withoutexplicit subtyping relations. t as s i,t , we perform Borda counting to select themost conﬁdent type t ∗ = arg max t (cid:80) i r ( i, t ) overall t ’s in a speciﬁc type level, where r ( i, t ) =1 / rank t ( s i,t ) is the ranking relevance score usedin Borda counting. The task of Temporal Relation Extraction focuseson ﬁnding the chronology of events (e.g.,

Before , After , Overlaps ) in text. Extracting temporal rela-tion is useful for various downstream tasks – cu-rating structured clinical data (Savova et al., 2010;Soysal et al., 2018), text summarization (Glavasand Snajder, 2014; Kedzie et al., 2015), question-answering (Llorens et al., 2015; Zhou et al., 2019),etc. The task is most commonly viewed as a classi-ﬁcation task where given a pair of events, and itstextual context, the temporal relation between themneeds to be identiﬁed.The construction of the TimeBank corpus (Puste-jovsky et al., 2003) largely spurred the researchof temporal relation extraction. It included 14temporal relation labels. Other corpora (Verhagenet al., 2007, 2010; Sun et al., 2013; Cassidy et al.,2014) reduced the number of labels to a smallernumber owing to lower inter-annotator agreementsand sparse annotations. Various types of models(Chambers et al., 2014; Cheng and Miyao, 2017;Leeuwenberg and Moens, 2017; Ning et al., 2019;Vashishtha et al., 2019; Zhou et al., 2020) havebeen used in the recent years to extract temporalrelations from text.In this work, we use Vashishtha et al. (2019)’sbest model and retrain it using

XLM-R . We evaluatetheir model using the transfer learning approachdescribed in their work and retrain it on TimeBank-Dense (TBD) (Cassidy et al., 2014). TBD uses areduced set of 5 temporal relation labels – before , after , includes , is included , and vague . Our system is modularized into separate modelsand libraries that communicate with each otherusing C

ONCRETE , a data format for richly anno-tating natural language documents (Ferraro et al.,2014). Each component is independent of eachother, which allows for both inserting additionalmodules or deleting those provided in the defaultpipeline. We choose this loosely-afﬁliated designto enable both faster and independent prototypingf individual components, as well as better com-partmentalization of our models.We emphasize that the system is a pipeline:while individual modules can be further improved,the system is not designed to be trained end-to-end and benchmarking the richly-annotated outputdepends on the application and priorities. In thispaper, we only benchmark individual componentsand describe a couple of applications.

The system can consume, as input, either tokenizedor untokenized text, which is ﬁrst tokenized ei-ther by whitespace or with a multilingual tokenizer,PolyGlot. However, this tokenization is not nec-essarily used by all modules, which may choose toeither operate on the raw text itself or on a Sentence-Piece (Kudo and Richardson, 2018) retokenization.The system outputs a C

ONCRETE communica-tion ﬁle for each input document. This outputﬁle contains annotations including entities, events,coreference, entity types, and temporal relations.This schema used is entirely self-contained and thewell-documented library also contains tools for vi-sualizing and inspecting C

ONCRETE ﬁles. For theweb demo, the output is displayed in the browser.

The FrameNet parser is comprised of an

XLM-R encoder, a BIO tagger, and a typing module. It en-codes the input sentences into a list of vectors, usedby both the BIO tagger and the typing module. Thegoal of BIO tagger is to ﬁnd trigger spans, whichare then labeled by the typing module. To parse asentence, we run the model to ﬁnd all frames, andthen ﬁnd their roles conditioned on the frames.We train the FrameNet parser on the FrameNetv1.7 corpus following Das et al. (2014), with statis-tics in Table 1. We evaluate the results with exactmatching as our metric, and get 63.91 labeled F1or 69.01 unlabeled F1. Since we are not aware ofprevious work on both full parsing and a metric forits evaluation, we do not have a baseline. However,we can force the model to perform frame identiﬁca-tion given the trigger span, like prior work. Theseresults are shown in Table 2. https://github.com/aboSamoor/polyglot http://hltcoe.github.io/concrete/ A role is considered to be correctly predicted only whenits frame is precisely predicted.

Table 1: Statistics of FrameNet v1.7

Model

AccuracyYang and Mitchell (2017) 88.2Hermann et al. (2014) 88.4Peng et al. (2018) 90.0This work

Table 2: Result on frame identiﬁcation

We retrain the model by Xia et al. (2020) with

XLM-R (large) as the underlying encoder and with addi-tional multilingual data. The model is a constant-memory variant of neural coreference resolutionmodels. We refer the reader to Xia et al. (2020) formodel and training details.Unlike that work, we operate under the assump-tion that we are provided gold spans. This is moti-vated by the location of coreference in LOME. Inaddition, while they use a frozen encoder, we foundthat ﬁnetuning improves performance. Finally, wetrain on the full OntoNotes 5.0 (Weischedel et al.,2013; Pradhan et al., 2013), a subset of SemEval2010 Task 1 (Recasens et al., 2010), and two ad-ditional sources of Russian data, RuCor (Toldovaet al., 2014) and AnCor (Budnikov et al., 2019).We benchmark the performance of our model oneach language. We report the average F1 of MUC(Vilain et al., 1995), B (Bagga and Baldwin, 1998),and CEAF φ (Luo, 2005) by language in Table 3.We can compare the model’s performance to mono-lingual gold-only baselines, where they exist. ForEnglish, we can train an identical model but insteaduse SpanBERT (Joshi et al., 2020), an English-onlyencoder ﬁnetuned for English OntoNotes corefer-ence. That model achieves 92.2 average (dev.) F1,compared to our 92.7. There is also a comparablesystem for Russian AnCor from Le et al. (2019),which achieves 79.9 F1 using the model from Leeet al. (2018) and RuBERT (Kuratov and Arkhipov,2019). This shows that our single, multilingualmodel, can perform similarly to monolingual mod-els, with the advantage that our model does notneed to perform language ID. This ﬁnding mirrorsprior ﬁndings showing multilingual encoders are We use AdamW and a learning rate of × − . trong cross-lingually (Wu and Dredze, 2019). Language O

359 44 71.3Catalan S

829 142 58.7Chinese O S

145 23 63.5English O S

80 17 47.2Russian A

573 127 77.3Spanish S

875 140 63.5

Table 3: Average F1 scores by language with goldmentions. The superscripts O indicates data fromOntoNotes 5.0 (dev), S indicates data from SemEval2010 Task 1 (dev), and A is the AnCor data (test). We retrain the coarse-to-ﬁne entity typer by Chenet al. (2020a) with

XLM-R as the underlying en-coder, and using the AIDA ontology as the typelabel inventory. The dataset annotated from AIDAis relatively small. To make the model more ro-bust, we pre-train the model using extra trainingdata from GAIA (Li et al., 2020), where they ob-tained YAGO ﬁne-grained types (Suchanek et al.,2008) from the results of Freebase entity linking,and mapped these types to the AIDA ontology.After pre-training, we ﬁne-tune the model usingthe AIDA M18 and M36 data with 3-fold cross-validation, where each fold is distinct in the topicsof these documents.

Data source Language

AIDA M18 English 4,433Russian 4,826

LDC2019E07

Ukrainian 4,261AIDA M36 English 703Spanish 557

LDC2020E29

Russian 729GAIA English 42.8MSpanish 11.1MRussian 2.4M

Table 4: Statistics of various datasets for entity typing.

Our models perform well in these datasets. Us-ing one third of the AIDA M36 data as dev, ourmethod obtained 60.1% micro- F score; with pre-training using GAIA extra data, we get 76.5%.Our system can also be extended to support othercommonly used ﬁne-grained entity type ontologies.We report the results in micro- F in Table 5. Please refer to Chen et al. (2020a) for the exact deﬁnitionsof the evaluation metric.

Ontology Prior state-of-the-art Ours

BBN 78.1 (Lin and Ji, 2019)

FIGER 79.8 (Lin and Ji, 2019)

UltraFine 40.1 (Onoe and Durrett, 2019)

Table 5: Performance of our hierarchical entity typingmodel across several typing ontologies.

We retrain Vashishtha et al. (2019)’s best ﬁne-grained temporal relation model on UDS-T(Vashishtha et al., 2019) using

XLM-R (large).We then use their transfer learning approach andtrain an SVM model on event-event relations inTimeBank-Dense to predict categorical temporalrelation labels. With this approach, we see a micro-F1 score of 56 on the test set of TBD. For better performance, we train the same modelon additional TempEval3 (TE3) dataset (UzZamanet al., 2013). Since TE3 and TBD use a differentset of temporal relations, we consider only thoseinstances that are labeled with 4 temporal relationsfrom both TE3 and TBD for joint training – be-fore , after , includes ( container ), and is included ( contained ). We retrain Vashishtha et al. (2019)’stransfer learning model on the combined TE3 andTBD dataset considering only these 4 relations andevaluate on their combined test set. Results onthe combined test set are reported in Table 6.

Relation Precision Recall F1 before 68 89 77after 74 69 71includes 83 5 10is included 44 15 22

Table 6: Result on the combined test set of TempEval3and TimeBank-Dense when trained with just 4 tempo-ral relation labels

Besides the core components described above,we also discuss the viability of including addi-tional modules that may not ﬁt directly in thecore pipeline but can be included depending onthe downstream application. For example, the sys-tem described above does not predict any relation The train and dev set of TBD has a total of 4,590 instancesand the test set has 1,405 instances of event-event relations. We consider only event-event relations and the combineddataset has 5,987 (1,249) instances in the train (test) set. nformation, which was needed for a motivatingapplication of downstream schema inference. Todo so, we wrote a C

ONCRETE and Docker wrapperaround OneIE (Lin et al., 2020) and attached it atthe end of the pipeline. With our C

ONCRETE baseddesign, the integration of any third-party modulecan be done via implementing the

AnnotateCommu-nicationService service interface, which can ensurecompatibility between LOME and external mod-ules. The OneIE wrapper is one example of anexternal module.

As another example application, we reconﬁguredour pipeline for the NIST SM-KBP 2020 Task 1evaluation, which aims to produce document-levelknowledge graphs. Each given document may bein English, Russian, or Spanish. On a developmentset consisting solely of text-only documents, westarted with initial predictions, through GAIA (Liet al., 2020), for entity clusters, entity types, eventsand relations. Our goal was to recluster and relabelthe a dataset for knowledge extraction.Our pipeline consisted of multilingual coref-erence resolution (using predetermined mentionspans), the hierarchical entity typing model, anda separate state-of-the-art argument linking model(Chen et al., 2020b). We found improved perfor-mance with entity coreference (from 29.1 F1 to33.3 F1), especially in Russian (from 26.2 F1 to33.3 F1), likely due to our use of multilingual dataand encoders. The improved entity clusters alsoled to downstream improvements in entity typingand argument linking. This example highlights theability to pick out subcomponents of LOME andcustomize according to the downstream task. We present two methods to interact with thepipeline. The ﬁrst is a Docker container whichcontains the libraries, code, and trained models ofour pipeline. This is intended to run on batches ofdocuments. As a lighter demo of some of the sys-tem capabilities, we also have a web demo intendedto interactively run on shorter documents. https://tac.nist.gov/2020/KBP/SM-KBP/index.html AIDA M36,

LDC2020E29 . This evaluation metric is speciﬁc to the NIST SM-KBP2020 task. It takes into account entity types.

Docker

Our Docker image consists of the fourcore modules: FrameNet parser, coreference reso-lution, entity typing, and temporal resolution. Fur-thermore, there are two options for entity typing:a ﬁne-grained hierarchical model (with the AIDAtyping ontology) and a coarse-grained model (withthe KAIROS typing ontology). The container anddocumentation is available on Docker Hub.As some modules depend on GPU libraries,the image also requires NVIDIA-Docker support.Since there is a high start-up (time) cost for usingDocker and loading models, we recommend usingthis container for batch processing of documents.Further instructions for running can be found onthe LOME Docker Hub page. Web Demo

We make a few changes for the webdemo. To reduce latency, we preload the modelsinto memory and we do not write the C

ONCRETE communications to disk. At the cost of modular-ity, this makes the demo lightweight and fast, al-lowing us to run it on a single 16GB CPU-onlyserver. To present the predictions, our front-enduses

AllenNLP-demo . In addition, the web demo is currently limitedto FrameNet parsing and coreference resolution, asother models will increase latency and may impedeusability. The web demo is intended to highlightonly some of the system’s capabilities, like its abil-ity to process multilingual documents.

To facilitate increased interest in large-ontology,multilingual, document-level knowledge extraction,we create and demonstrate LOME, a system forevent and entity knowledge graph creation. Giveninput text documents, LOME runs a full FrameNetparser, coreference resolution, ﬁne-grained entitytyping, and temporal relation prediction. Further-more, each component uses

XLM-R , allowing oursystem to support a broader set of languages thanprevious systems. The pipeline uses a standardizeddata schema, which invites extending the pipelinewith additional modules. By releasing both aDocker image and presenting a lightweight webdemo, we hope to enable the community to buildon top of LOME for even more comprehensiveinformation extraction. https://hub.docker.com/r/hltcoe/lome https://nlp.jhu.edu/demos/lome/ https://github.com/allenai/allennlp-demo . cknowledgments We thank Kenton Murray, Manling Li, Varun Iyer,and Zhuowan Li for helpful discussions and feed-back. This work was supported in part by DARPAAIDA (FA8750-18-2-0015) and KAIROS (FA8750-19-2-0034). The views and conclusions containedin this work are those of the authors and shouldnot be interpreted as necessarily representing theofﬁcial policies, either expressed or implied, or en-dorsements of DARPA or the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for governmental purposesnotwithstanding any copyright annotation therein.

References

Alan Akbik and Yunyao Li. 2016. POLYGLOT: Multi-lingual semantic role labeling with uniﬁed labels. In

Proceedings of ACL-2016 System Demonstrations ,pages 1–6, Berlin, Germany. Association for Com-putational Linguistics.James Allen, Hannah An, Ritwik Bose, Will de Beau-mont, and Choh Man Teng. 2020. A broad-coveragedeep semantic lexicon for verbs. In

Proceedings ofthe 12th Language Resources and Evaluation Con-ference , pages 3243–3251, Marseille, France. Euro-pean Language Resources Association.Amit Bagga and Breck Baldwin. 1998. Algorithms forscoring coreference chains. In

In The First Interna-tional Conference on Language Resources and Eval-uation Workshop on Linguistics Coreference , pages563–566.C. F. Baker, C. J. Fillmore, and J. B. Lowe. 1998. TheBerkeley FrameNet Project. In

Association for Com-putational Linguistics and International Conferenceon Computational Linguistics (ACL-COLING) .A. E. Budnikov, S Yu Toldova, D. S. Zvereva, D. M.Maximova, and M. I. Ionov. 2019. Ru-eval-2019:Evaluating anaphora and coreference resolution forrussian.Taylor Cassidy, Bill McDowell, Nathanel Chambers,and Steven Bethard. 2014. An annotation frame-work for dense event ordering. Technical report,CARNEGIE-MELLON UNIV PITTSBURGH PA.Nathanael Chambers, Taylor Cassidy, Bill McDowell,and Steven Bethard. 2014. Dense event orderingwith a multi-pass architecture.

Transactions of theAssociation for Computational Linguistics , 2:273–284.Tongfei Chen, Yunmo Chen, and Benjamin Van Durme.2020a. Hierarchical entity typing via multi-levellearning to rank. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 8465–8475, Online. Association forComputational Linguistics. Yunmo Chen, Tongfei Chen, and Benjamin Van Durme.2020b. Joint modeling of arguments for event un-derstanding. In

Proceedings of the First Workshopon Computational Approaches to Discourse , pages96–101, Online. Association for Computational Lin-guistics.Fei Cheng and Yusuke Miyao. 2017. Classifying tem-poral relations by bidirectional lstm over depen-dency paths. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 1–6.Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettle-moyer. 2018. Ultra-ﬁne entity typing. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 87–96, Melbourne, Australia. Associa-tion for Computational Linguistics.Christopher Cieri, James Fiumara, Stephanie Strassel,Jonathan Wright, Denise DiPersio, and Mark Liber-man. 2020. A progress report on activities at theLinguistic Data Consortium beneﬁtting the LRECcommunity. In

Proceedings of the 12th LanguageResources and Evaluation Conference , pages 3449–3456, Marseille, France. European Language Re-sources Association.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.D. Das, D. Chen, A. F. T. Martins, N. Scneider, andN. A. Smith. 2014. Frame-Semantic Parsing.

Com-putational Linguistics , 40(1).Francis Ferraro, Max Thomas, Matthew R. Gormley,Travis Wolfe, Craig Harman, and Benjamin VanDurme. 2014. Concretely annotated corpora. In .Goran Glavas and J. Snajder. 2014. Event graphs for in-formation retrieval and multi-document summariza-tion.

Expert Syst. Appl. , 41:6904–6916.K. M. Hermann, D. Das, J. Weston, and K. Ganchev.2014. Semantic frame identiﬁcation with distributedword representations. In

Association for Computa-tional Linguistics (ACL) , pages 1448–1458.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.SpanBERT: Improving pre-training by representingand predicting spans.

Transactions of the Associa-tion for Computational Linguistics , 8:64–77.andar Joshi, Omer Levy, Luke Zettlemoyer, andDaniel Weld. 2019. BERT for coreference reso-lution: Baselines and analysis. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5803–5808, Hong Kong,China. Association for Computational Linguistics.Chris Kedzie, K. McKeown, and Fernando Diaz. 2015.Predicting salient updates for disaster summariza-tion. In

ACL .Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations , pages 66–71, Brussels, Belgium.Association for Computational Linguistics.Yuri Kuratov and Mikhail Arkhipov. 2019. Adaptationof deep bidirectional multilingual transformers forrussian language.T. A. Le, M. A. Petrov, Y. M. Kuratov, and M. S. Burt-sev. 2019. Sentence level representation and lan-guage models in the task of coreference resolutionfor russian. In

Proceedings of the International Con-ference “Dialogue 2019” .Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 188–197, Copenhagen, Denmark. Associationfor Computational Linguistics.Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-ﬁne inference. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers) , pages687–692, New Orleans, Louisiana. Association forComputational Linguistics.Artuur Leeuwenberg and Marie Francine Moens. 2017.Structured learning for temporal relation extractionfrom clinical records. In

Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 1, LongPapers , pages 1150–1158.Manling Li, Ying Lin, Joseph Hoover, Spencer White-head, Clare Voss, Morteza Dehghani, and Heng Ji.2019. Multilingual entity, relation, event and hu-man value extraction. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics (Demon-strations) , pages 110–115, Minneapolis, Minnesota.Association for Computational Linguistics.Manling Li, Alireza Zareian, Ying Lin, Xiaoman Pan,Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji, Shih-Fu Chang, Clare Voss, Daniel Napierski, andMarjorie Freedman. 2020. GAIA: A ﬁne-grainedmultimedia knowledge extraction system. In

Pro-ceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics: SystemDemonstrations , pages 77–86, Online. Associationfor Computational Linguistics.Ying Lin and Heng Ji. 2019. An attentive ﬁne-grainedentity typing model with latent type representation.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 6198–6203, Hong Kong, China. Association for Computa-tional Linguistics.Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020.A joint neural model for information extraction withglobal features. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics, ACL 2020, Online, July 5-10, 2020 , pages7999–8009. Association for Computational Linguis-tics.Xiao Ling and Daniel S. Weld. 2012. Fine-grained en-tity recognition. In

Proceedings of the Twenty-SixthAAAI Conference on Artiﬁcial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada. , pages 94–100.Hector Llorens, Nathanael Chambers, Naushad UzZa-man, Nasrin Mostafazadeh, James Allen, and JamesPustejovsky. 2015. SemEval-2015 task 5: QA Tem-pEval - evaluating temporal information understand-ing with question answering. In

Proceedings of the9th International Workshop on Semantic Evaluation(SemEval 2015) , pages 792–800, Denver, Colorado.Association for Computational Linguistics.Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In

Proceedings of Human Lan-guage Technology Conference and Conference onEmpirical Methods in Natural Language Processing ,pages 25–32, Vancouver, British Columbia, Canada.Association for Computational Linguistics.Qiang Ning, Zhili Feng, and Dan Roth. 2019. A struc-tured learning approach to temporal relation extrac-tion. arXiv preprint arXiv:1906.04943 .Yasumasa Onoe and Greg Durrett. 2019. Learning todenoise distantly-labeled data for entity typing. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages2407–2417, Minneapolis, Minnesota. Associationfor Computational Linguistics.Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 1946–1958, Vancouver,Canada. Association for Computational Linguistics.. Peng, S. Thomson, S. Swayamdipta, and N. A.Smith. 2018. Learning Joint Semantic Parsers fromDisjoint Data. In

North American Association forComputational Linguistics (NAACL) , pages 1492–1502.Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas An-drews, Jay DeYoung, Max Thomas, Matthew R.Gormley, Travis Wolfe, Craig Harman, BenjaminVan Durme, and Mark Dredze. 2015. A concreteChinese NLP pipeline. In

Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics: Demon-strations , pages 86–90, Denver, Colorado. Associa-tion for Computational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In

Proceed-ings of CoNLL .James Pustejovsky, Patrick Hanks, Roser Sauri, An-drew See, Robert Gaizauskas, Andrea Setzer,Dragomir Radev, Beth Sundheim, David Day, LisaFerro, et al. 2003. The timebank corpus. In

Corpuslinguistics , volume 2003, page 40. Lancaster, UK.Marta Recasens, Llu´ıs M`arquez, Emili Sapena,M. Ant`onia Mart´ı, Mariona Taul´e, V´eroniqueHoste, Massimo Poesio, and Yannick Versley. 2010.SemEval-2010 task 1: Coreference resolution inmultiple languages. In

Proceedings of the 5th Inter-national Workshop on Semantic Evaluation , pages1–8, Uppsala, Sweden. Association for Computa-tional Linguistics.Guergana K Savova, James J Masanz, Philip V Ogren,Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. 2010. Mayo clin-ical text analysis and knowledge extraction system(ctakes): architecture, component evaluation and ap-plications.

Journal of the American Medical Infor-matics Association , 17(5):507–513.Ergin Soysal, Jingqi Wang, Min Jiang, YonghuiWu, Serguei Pakhomov, Hongfang Liu, and HuaXu. 2018. Clamp–a toolkit for efﬁciently build-ing customized clinical natural language processingpipelines.

Journal of the American Medical Infor-matics Association , 25(3):331–336.Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2008. YAGO: A large ontology fromwikipedia and wordnet.

Journal of Web Semantics ,6(3):203–217. Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013.Evaluating temporal relations in clinical text: 2012i2b2 challenge.

Journal of the American MedicalInformatics Association , 20(5):806–813.S. Swayamdipta, S. Thomson, K. Lee, L. S. Zettle-moyer, C. Dyer, and N. A. Smith. 2018. Syntac-tic Scaffolds for Semantic Structures. In

EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 3772–3782.S. Toldova, A. Roytberg, A. A. Ladygina, M. D. Vasi-lyeva, I. L. Azerkovich, M. Kurzukov, G. Sim, D. V.Gorshkov, A. Ivanova, A. Nedoluzhko, and Y. Gr-ishina. 2014. Evaluating anaphora and coreferenceresolution for russian. In

Komp’juternaja lingvistikai intellektual’nye tehnologii. Po materialam ezhe-godnoj Mezhdunarodnoj konferencii (Dialog) , pages681–695.Naushad UzZaman, Hector Llorens, Leon Derczyn-ski, James Allen, Marc Verhagen, and James Puste-jovsky. 2013. SemEval-2013 task 1: TempEval-3:Evaluating time expressions, events, and temporalrelations. In

Second Joint Conference on Lexicaland Computational Semantics (*SEM), Volume 2:Proceedings of the Seventh International Workshopon Semantic Evaluation (SemEval 2013) , pages 1–9, Atlanta, Georgia, USA. Association for Computa-tional Linguistics.Siddharth Vashishtha, Benjamin Van Durme, andAaron Steven White. 2019. Fine-grained temporalrelation extraction. In

Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics , pages 2906–2919, Florence, Italy. Asso-ciation for Computational Linguistics.Marc Verhagen, Robert Gaizauskas, Frank Schilder,Mark Hepple, Graham Katz, and James Pustejovsky.2007. Semeval-2007 task 15: Tempeval tempo-ral relation identiﬁcation. In

Proceedings of thefourth international workshop on semantic evalua-tions (SemEval-2007) , pages 75–80.Marc Verhagen, Roser Sauri, Tommaso Caselli, andJames Pustejovsky. 2010. Semeval-2010 task 13:Tempeval-2. In

Proceedings of the 5th internationalworkshop on semantic evaluation , pages 57–62.Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In

Sixth Mes-sage Understanding Conference (MUC-6): Proceed-ings of a Conference Held in Columbia, Maryland,November 6-8, 1995 .Ralph Weischedel and Ada Brunstein. 2005. BBN pro-noun coreference and entity type corpus.

Philadel-phia: Linguistic Data Consortium .Ralph Weischedel, Martha Palmer, Mitchell Marcus,Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Ni-anwen Xue, Ann Taylor, Jeff Kaufman, MichelleFranchini, et al. 2013. OntoNotes release.0 LDC2013T19.

Linguistic Data Consortium,Philadelphia, PA .Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages833–844, Hong Kong, China. Association for Com-putational Linguistics.Patrick Xia, Jo˜ao Sedoc, and Benjamin Van Durme.2020. Incremental neural coreference resolution inconstant memory. In

Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 8617–8624, Online. As-sociation for Computational Linguistics.Liyan Xu and Jinho D. Choi. 2020. Revealing the mythof higher-order inference in coreference resolution.In

Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 8527–8533, Online. Association for Computa-tional Linguistics.B. Yang and T. M. Mitchell. 2017. A joint sequentialand relational model for frame-semantic parsing. In

Empirical Methods in Natural Language Processing(EMNLP) , pages 1247–1256.Ben Zhou, Daniel Khashabi, Qiang Ning, and DanRoth. 2019. ” going on a vacation” takes longerthan” going for a walk”: A study of tempo-ral commonsense understanding. arXiv preprintarXiv:1909.03065 .Yichao Zhou, Yu Yan, Rujun Han, J Harry Cauﬁeld,Kai-Wei Chang, Yizhou Sun, Peipei Ping, and WeiWang. 2020. Clinical temporal relation extrac-tion with probabilistic soft logic regularization andglobal inference. arXiv preprint arXiv:2012.08790arXiv preprint arXiv:2012.08790