[PDF] Fine-grained Entity Recognition with Reduced False Negatives and Large Type Coverage

Abstract

Fine-grained Entity Recognition (FgER) is the task of detecting and classifying entity mentions to a large set of types spanning diverse domains such as biomedical, finance and sports. We observe that when the type set spans several domains, detection of entity mention becomes a limitation for supervised learning models. The primary reason being lack of dataset where entity boundaries are properly annotated while covering a large spectrum of entity types. Our work directly addresses this issue. We propose Heuristics Allied with Distant Supervision (HAnDS) framework to automatically construct a quality dataset suitable for the FgER task. HAnDS framework exploits the high interlink among Wikipedia and Freebase in a pipelined manner, reducing annotation errors introduced by naively using distant supervision approach. Using HAnDS framework, we create two datasets, one suitable for building FgER systems recognizing up to 118 entity types based on the FIGER type hierarchy and another for up to 1115 entity types based on the TypeNet hierarchy. Our extensive empirical experimentation warrants the quality of the generated datasets. Along with this, we also provide a manually annotated dataset for benchmarking FgER systems.

Full PDF

AAutomated Knowledge Base Construction (2019) Conference paper

Fine-grained Entity Recognition with Reduced FalseNegatives and Large Type Coverage

Abhishek [email protected]

Indian Institute of Technology Guwahati,Guwahati, Assam, India

Sanya Bathla Taneja ∗ [email protected] Garima Malik ∗ [email protected] Indira Gandhi Delhi Technical University for Women,Kashmere Gate, Delhi, India

Ashish Anand [email protected]

Amit Awekar [email protected]

Indian Institute of Technology Guwahati,Guwahati, Assam, India

Abstract

Fine-grained Entity Recognition (FgER) is the task of detecting and classifying entitymentions to a large set of types spanning diverse domains such as biomedical, ﬁnance andsports. We observe that when the type set spans several domains, detection of entitymention becomes a limitation for supervised learning models. The primary reason beinglack of dataset where entity boundaries are properly annotated while covering a largespectrum of entity types. Our work directly addresses this issue. We propose HeuristicsAllied with Distant Supervision (HAnDS) framework to automatically construct a qualitydataset suitable for the FgER task. HAnDS framework exploits the high interlink amongWikipedia and Freebase in a pipelined manner, reducing annotation errors introducedby naively using distant supervision approach. Using HAnDS framework, we create twodatasets, one suitable for building FgER systems recognizing up to 118 entity types basedon the FIGER type hierarchy and another for up to 1115 entity types based on the TypeNethierarchy. Our extensive empirical experimentation warrants the quality of the generateddatasets. Along with this, we also provide a manually annotated dataset for benchmarkingFgER systems.

1. Introduction

In the literature, the problem of recognizing a handful of coarse-grained types such as person , location and organization has been extensively studied [Nadeau and Sekine, 2007,Marrero et al., 2013]. We term this as Coarse-grained Entity Recognition (CgER) task. ForCgER, there exist several datasets, including manually annotated datasets such as CoNLL[Tjong Kim Sang and De Meulder, 2003] and automatically generated datasets such as WP2[Nothman et al., 2013]. Manually constructing a dataset for FgER task is an expensive and ∗ . The authors contributed to the work during their internship at the Indian Institute of TechnologyGuwahati.1. The code and datasets to replicate the experiments are available at https://github.com/abhipec/HAnDS . a r X i v : . [ c s . C L ] A p r bhishek, Sanya, Garima, Ashish & Amit time-consuming process as an entity mention could be assigned multiple types from a setof thousands of types.In recent years, one of the subproblems of FgER, the Fine Entity Categorization orTyping (Fine-ET) problem has received lots of attention particularly in expanding its typecoverage from a handful of coarse-grained types to thousands of ﬁne-grained types [Murtyet al., 2018, Choi et al., 2018]. The primary driver for this rapid expansion is exploitation ofcheap but fairly accurate annotations from Wikipedia and Freebase [Bollacker et al., 2008]via the distant supervision process [Craven and Kumlien, 1999]. The Fine-ET problemassumes that the entity boundaries are provided by an oracle.We observe that the detection of entity mentions at the granularity of Fine-ET is abottleneck. The existing FgER systems, such as FIGER [Ling and Weld, 2012], follow atwo-step approach in which the ﬁrst step is to detect entity mentions and the second step isto categorize detected entity mentions. For the entity detection, it is assumed that all theﬁne-categories are subtypes of the following four categories: person , location , organization and miscellaneous . Thus, a model trained on the CoNLL dataset [Tjong Kim Sang andDe Meulder, 2003] which is annotated with these types can be used for entity detection. Ouranalysis indicates that in the context of FgER, this assumption is not a valid assumption.As a face value, the miscellaneous type should ideally cover all entity types other than person , location , and organization . However, it only covers 68% of the remaining types ofthe FIGER hierarchy and 42% of the TypeNet hierarchy. Thus, the models trained usingCoNLL data are highly likely to miss a signiﬁcant portion of entity mentions relevant toautomatic knowledge bases construction applications.Our work bridges this gap between entity detection and Fine-ET. We propose to auto-matically construct a quality dataset suitable for the FgER, i.e, both Fine-ED and Fine-ET using the proposed HAnDS framework. HAnDS is a three-stage pipelined frameworkwherein each stage diﬀerent heuristics are used to combat the errors introduced via naivelyusing distant supervision paradigm, including but not limited to the presence of large falsenegatives. The heuristics are data-driven and use information provided by hyperlinks, al-ternate names of entities, and orthographic and morphological features of words.Using the HAnDS framework and the two popular type hierarchies available for Fine-ET, the FIGER type hierarchy [Ling and Weld, 2012] and TypeNet [Murty et al., 2018], weautomatically generated two corpora suitable for the FgER task. The ﬁrst corpus containsaround 38 million entity mentions annotated with 118 entity types. The second corpus con-tains around 46 million entity mentions annotated with 1115 entity types. Our extensiveintrinsic and extrinsic evaluation of the generated datasets warrants its quality. As com-pared with existing automatically generated datasets, supervised learning models trainedon our induced training datasets perform signiﬁcantly better (approx 20 point improvementon micro-F1 score). Along with the automatically generated dataset, we provide a manu-ally annotated corpora of around thousand sentences annotated with 117 entity types forbenchmarking of FgER models. Our contributions are highlighted as follows: • We analyzed that existing practice of using models trained on CoNLL dataset haspoor recall for entity detection in the Fine-ET setting, where the type set spansseveral diverse domains. (Section 3) gER with Reduced False Negatives and Large Type Coverage • We propose HAnDS framework, a heuristics allied with the distant supervision ap-proach to automatically construct datasets suitable for FgER problem, i.e., both ﬁneentity detection and ﬁne entity typing. (Section 4) • We establish the state-of-the-art baselines on our new manually annotated corpus,which covers 2.7 times more ﬁner-entity types than the FIGER gold corpus, the currentde facto FgER evaluation corpus. (Section 5)The rest of the paper is organized as follows. We describe the related work in Section2, followed by a case study on entity detection problem in the Fine-ET setting, in Section3. Section 4 describes our proposed HAnDS framework, followed by empirical evaluation ofthe datasets in Section 5. In Section 6 we conclude our work.

2. Related Work

We majorly divide the related work into two parts. First, we describe work related to theautomatic dataset construction in the context of the entity recognition task followed byrelated work on noise reduction techniques in the context of automatic dataset constructiontask.In the context of FgER task, [Ling and Weld, 2012] proposed to use distant supervisionparadigm [Black et al., 1998] to automatically generate a dataset for the Fine-ET problem,which is a sub-problem of FgER. We term this as a Naive Distant Supervision (NDS)approach. In NDS, the linkage between Wikipedia and Freebase is exploited. If there isa hyperlink in a Wikipedia sentence, and that hyperlink is assigned to an entity presentin Freebase, then the hyperlinked text is an entity mention whose types are obtained fromFreebase. However, this process can only generate positive annotations, i.e., if an entitymention is not hyperlinked, no types will be assigned to that entity mention. The positiveonly annotations are suitable for Fine-ET problem but it is not suitable for learning entitydetection models as there are large number of false negatives (Section 3). This dataset ispublicly available as FIGER dataset, along with a manually annotated evaluation corpra.The NDS approach is also used to generate datasets for some variants of the Fine-ETproblem such as the Corpus level Fine-Entity typing [Yaghoobzadeh et al., 2018] and Fine-Entity typing utilizing knowledge base embeddings [Xin et al., 2018]. Much recently, [Choiet al., 2018] generated an entity typing dataset with a very large type set of size 10k usinghead words as a source of distant supervision as well as using crowdsourcing.In the context of CgER task, [Nothman et al., 2008, 2009, 2013] proposed an approach tocreate a training dataset for CgER task using a combination of bootstrapping process andheuristics. The bootstrapping was used to classify a Wikipedia article into ﬁve categories,namely

PER , LOC , ORG , MISC and

NON-ENTITY . The bootstrapping requires initialmanually annotated seed examples for each type, which limits its scalability to thousandsof types. The heuristics were used to infer additional links in un-linked text, howeverthe proposed heuristics limit the scope of entity and non-entity mentions. For example,one of the heuristics used mostly restricts entity mentions to have at least one charactercapitalized. This assumption is not true in the context for FgER where entity mentions arefrom several diverse domains including biomedical domain. bhishek, Sanya, Garima, Ashish & Amit

Figure 1: The entity type coverage analysis of the FIGER and the TypeNet type set. Thisillustrates that a signiﬁcant portion of entity types are not a descendant of any of the fourCoNLL types.There are other notable work which combines NDS with heuristics for generating entityrecognition training dataset, such as [Al-Rfou et al., 2014] and [Ghaddar and Langlais,2017]. However, their scope is limited to the application of CgER. Our work revisits theidea of automatic corpus construction in the context of FgER. In HAnDS framework, ourmain contribution is to design data-driven heuristics which are generic enough to work forover thousands of diverse entity types while maintaining a good annotation quality.An automatic dataset construction process involving heuristics and distant supervisionwill inevitably introduce noise and its characteristics depend on the dataset constructiontask. In the context of the Fine-ET task [Ren et al., 2016b, Gillick et al., 2014], the dominantnoise in false positives. Whereas, for the relation extraction task both false negatives andfalse positives noise is present [Roth et al., 2013, Phi et al., 2018].

3. Case study: Entity Detection in the Fine Entity Typing Setting

In this section, we systematically analyzed existing entity detection systems in the settingof Fine-Entity Typing. Our aim is to answer the following question : How good are entitydetection systems when it comes to detecting entity mentions belonging to a large set ofdiverse types? We performed two analysis. The ﬁrst analysis is about the type coverageof entity detection systems and the second analysis is about actual performance of entitydetection systems on two manually annotated FgER datasets.

For this analysis we manually inspected the most commonly used CgER dataset, CoNLL2003. We analyzed how many entity types in the two popular Fine-ET hierarchies, FIGERand TypeNet are actual descendent of the four coarse-types present in the CoNLL dataset,namely person , location , organization and miscellaneous . The results are available in Figure1. We can observe that in the FIGER typeset, 14% of types are not a descendants of theCoNLL types. This share increases in TypeNet where 25% of types are not descendants gER with Reduced False Negatives and Large Type Coverage Models FIGER 1k-WFB-gPrecision Recall F1 Precision Recall F1LSTM-CNN-CRF (FIGER) 87.17 28.95 43.47 91.41 37.13 52.81CoreNLP 83.82 80.99 82.38 75.46 64.12 69.33NER Tagger 80.44 84.01 82.19 77.25 68.52 72.62

Table 1: Performance of entity detection models trained on existing datasets evaluated onthe FIGER and 1k-WFB-g datasets.of CoNLL types. These types are from various diverse domain, including bio-medical,legal processes and entertainment and it is important in the aspect of the knowledge baseconstruction applications to detect entity mentions of these types. These diﬀerences can beattributed to the fact that since 2003, the entity recognition problem has evolved a lot bothin going towards ﬁner-categorization as well as capturing entities from diverse domains.

For this analysis we evaluate two publicly available state-of-the-art entity detection systems,the Stanford CoreNLP [Manning et al., 2014] and the NER Tagger system proposed in[Lample et al., 2016]. Along with these, we also train a LSTM-CNN-CRF based sequencelabeling model proposed in [Ma and Hovy, 2016] on the FIGER dataset. The learningmodels were evaluated on a manually annotated FIGER corpus and 1k-WFB-g corpus, anew in-house developed corpus speciﬁcally for FgER model evaluations. The results arepresented in Table 1.From the results, we can observe that a state-of-the-art sequence labeling model, LSTM-CNN-CRF trained on a dataset generated using NDS approach, such as FIGER dataset haslower recall compared with precision. On average the recall is 58% lower than precision.This is primarily because the NDS approach generates positive only annotations and theremaining un-annotated tokens contains large number of entity mentions. Thus the resultingdataset has large false negatives.On the other hand, learning models trained on CoNLL dataset (CoreNLP and NERTagger), have a much more balanced performance in precision and recall. This is because,being a manually annotated dataset, it is less likely that any entity mention (according tothe annotation guidelines) will remain un-annotated. However, the recall is much lower(16% lower) on the 1k-WFB-g corpus as on the FIGER corpus. This is because, whendesigning 1k-WFB-g we insured that it has suﬃcient examples covering 117 entity types.Whereas, the FIGER evaluation corpus has only has 42 types of entity mentions and 80%of mentions are from person , location and organization coarse types. These results alsohighlight the coverage issue, mentioned in section 3.1. When the evaluation set is balancedcovering a large spectrum of entity types, the performance of models trained on the CoNLLdataset goes down because of presence out-of-scope entity types. An ideal entity detectionsystem should be able to work on the traditional as well as other entities relevant to FgERproblem, i.e., good performance across all types. A statistical comparison of FIGER and1k-WFB-g corpus is provided in Table 2.The use of CoreNLP or learning models trained on CoNLL dataset is a standard practiceto detect entity mentions in existing FgER research [Ling and Weld, 2012]. Our analysis bhishek, Sanya, Garima, Ashish & Amit (a) A high-level overview of the three stages ofHAnDS framework along with each stage’s objective. (b) The top four boxes illustrate how annotationschanges during diﬀerent stages. The bottom box il-lustrates the outgoing links and the candidate namesfor the example document. Figure 2: An overview of HAnDS framework (left) along with an illustration on the frame-work in action on an example document (right).conveys that this practice has its limitation in terms of detecting entities which are outof the scope of the CoNLL dataset. In the next section, we will describe our approach ofautomatically creating a training dataset for the FgER task. The same learning models,when trained on our generated training datasets will have a better and a balanced precisionand recall.

4. HAnDS Framework

The objective of the HAnDS framework is to automatically create a corpus of sentenceswhere every entity mention is correctly detected and is being characterized into one ormore entity types. The scope of entities, i.e., what types of entities should be annotated isdecided by a type hierarchy, which is one of the inputs of the framework. Figure 2 gives anoverview of the HAnDS framework.

The framework requires three inputs, a linked text corpus, a knowledge base and a typehierarchy.

Linked text corpus:

A linked text corpus is a collection of documents where sporadicallyimportant concepts are hyperlinked to another document. For example, Wikipedia is alarge-scale multi-lingual linked text corpus. The framework considers the span of hyper-linked text (or anchor text) as potential candidates for entity mentions.

Knowledge base:

A knowledge base (KB) captures concepts, their properties, and inter-concept properties. Freebase, WikiData [Vrandeˇci´c and Kr¨otzsch, 2014] and UMLS [Boden-reider, 2004] are examples of popular knowledge bases. A KB usually has a type propertywhere multiple ﬁne-grained semantic types/labels are assigned to each concept.

Type hierarchy:

A type hierarchy ( T ) is a hierarchical organization of various entity gER with Reduced False Negatives and Large Type Coverage types. For example, an entity type city is a descendant of type geopolitical entity . Therehave been various hierarchical organization schemes of ﬁne-grained entity types proposed inliterature, which includes, a 200 type scheme proposed in [Sekine, 2008], a 113 type schemeproposed in [Ling and Weld, 2012], a 87 type scheme proposed in [Gillick et al., 2014] and a1081 type scheme proposed in [Murty et al., 2018]. However, in our work, we use two suchhierarchies, FIGER and TypeNet. FIGER being the most extensively used hierarchy andTypeNet being the latest and largest entity type hierarchy. Automatic corpora creation using distant supervised methods inevitably will contain errors.For example, in the context of FgER, the errors could be at annotating entity boundaries,i.e, entity detection errors, or assigning an incorrect type, i.e., entity linking errors or both.The three-step process in our proposed HAnDS framework tries to reduce these errors.

The objective of this stage is to reduce false positives entity mentions, where an incorrectanchor text is detected as an entity mention. To do so, we ﬁrst categorize all hyperlinks ofthe document being processed as entity links and non-entity links . Further, every link isassigned a tag of being a referential link or not.

Entity links:

These are a subset of links whose anchor text represents candidate entitymentions. If the labels obtained by a KB for a link, belongs to T , we categorize that linkas an entity link. Here, the T decides the scope of entities in the generated dataset. Forexample, if T is the FIGER type hierarchy, then the hyperlink photovoltaic cell is not anentity link as its labels obtained by Freebase is not present in T . However, if T is theTypeNet hierarchy, then photovoltaic cell is an entity link of type invention . Non-entity links:

These are a subset of links whose anchor text does not represent anentity mention. Since knowledge bases are incomplete, if a link is not categorized as an entity link it does not mean that the link will not represent an entity. We exploit corpuslevel context to categorize a link as a non-entity link using the following criteria: acrosscomplete corpus, the link should be mentioned at least 50 times (support threshold) andat least 50% of times (conﬁdence threshold) with a lowercase anchor text. The intuitionof this criteria is that we want to be certain that a link actually represents a non-entity.For example, this heuristic categorizes RBI as a non-entity link as there is no label presentfor this link in Freebase. Here RBI refers to the term “run batted in”, frequently used inthe context of baseball and softball. Unlike, [Nothman et al., 2008] which discards nonentity mentions to have capitalized word, our data-driven heuristics does not put any hardconstraints.

Referential links:

A link is said to be referential if its anchor text has a direct case-insensitive match with the list of allowed candidate names for the linked concept. A KBcan provide such list. For example, for an entity

Bill Gates , the candidate names providedby Freebase includes

Gates and

William Henry Gates . However, in Wikipedia, there exists

2. Based on our observations, we made a few changes to the original FIGER hierarchy (seven additions,one correction, one merger, one deletion, and one substitute). bhishek, Sanya, Garima, Ashish & Amit hyperlinks such as Bill and Melinda Gates linking to Bill Gates page, which is erroneous asthe hyperlinked text is not the correct referent of the entity

Bill Gates .After categorization of links, except for referential entity links , we unlink all other links.Unlinking non-referential links such as Bill and Melinda Gates reduce entity detection errors by eliminating false positive entity mentions. The unlinked text span or a part of it canbe referential mention for some other entities, as in the above example

Bill and

MelindaGates . Figure 2b also illustrates this process where Lahti, Finland get unlinked after thisstage. The next stage tries to re-link the unlinked tokens correctly.

The objective of this stage is to reduce false negative entity mentions, where an entitymention is not annotated. This is done by linking the correct referential name of the entitymention to the correct node in KB.To reduce entity linking errors, we use the document level context by restricting thecandidate links (entities or non-entities) to the outgoing links of the current documentbeing processed. For example, in Figure 2b, while processing an article about an Finnish-American luger Tristan Jeskanen, it is unlikely to observe mention of a 1903 German novelhaving the same name, i.e., Tristan.To reduce false negative entity mentions, we construct two trie trees capturing theoutgoing links and their candidates referential names for each document. The ﬁrst triecontains all links and the second trie only contains links of entities which are predominantlyexpressed in lowercase phrases (e.g. names of diseases). For each non-linked uppercasecharacter, we match the longest matching preﬁx string within the ﬁrst trie and assignthe matching link. In the remaining non-linked phrases, we match the longest matchingpreﬁx string within the second trie and assign the matching link. Linking the candidateentities in unlinked phrases reduces entity detection error, by eliminating false negativeentity mentions.Unlike [Nothman et al., 2008], the two step string matching process ensures the pos-sibility of a lowercase phrase being an entity mention (e.g. lactic acid , apple juice , bron-choconstriction , etc.) and a word with a ﬁrst uppercase character being a non-entity (e.g. Jazz , RBI , etc.).Figure 2b shows an example of the input and output of this stage. In this stage, thephrases Tristan , Lahti , Finland and

Jeskanen gets linked.

The objective of this stage is to further reduce entity detection errors. This stage is mo-tivated by the incomplete nature of practical knowledge bases. KBs do not capture allentities present in a linked text corpus and do not provide all the referential names for anentity mention. Thus, after stage-II there will be still a possibility of having both types ofentity detection errors, false positives, and false negatives.To reduce such errors in the induced corpus, we select sentences where it is most likelythat all entity mention are annotated correctly. The resultant corpora of selected sentences

3. More than 50% of anchor text across corpus should be a lowercase phrase.4. A run batted in (RBI) is a statistic in baseball and softball. gER with Reduced False Negatives and Large Type Coverage

Data sets Wiki-FbF Wiki-FbT 1k-WFB-g FIGER (GOLD) , ,

989 32 , ,

731 982 434 , ,

727 45 , ,

034 2 ,

420 563 , ,

518 2 , ,

122 - - , ,

876 3 , ,

161 2 ,

151 331 , ,

692 707 , ,

974 25 ,

658 10 , , ,

565 2 , ,

446 7 ,

245 2 , µ sentence length 21 .

63 21 .

71 26 .

13 23 . µ label per entity 2 .

12 9 .

60 1 .

64 1 . Table 2: Statistics of the diﬀerent datasets generated or used in this work.will be our ﬁnal dataset. To select these sentences, we exploit sentence-level context byusing POS tags and list of the frequent sentence starting words. We only select sentenceswhere all unlinked tokens are most likely to be a non-entity mention. If an unlinked tokenhas a capitalized characters, then it likely to be an entity mention. We do not select suchsentences, except in the following cases. In the ﬁrst case, the token is a sentence starter,and is either in a list of frequent sentence starter word or its POS tag is among the list ofpermissible tags . In the second case, the token is an adjective, or belongs to occupationaltitles or is a name of day or month.Figure 2b shows an example of the input and output of this stage. Here only the ﬁrstsentence of the document is selected because in the other sentence the name Sami is notlinked. The sentence selection stage ensures that the selected sentences have high-qualityannotations. We observe that only around 40% of sentences are selected by stage III inour experimental setup. Our extrinsic analysis in Section 5.2 shows that this stage helpsmodels to have a signiﬁcantly better recall.In the next section, we describe the dataset generated using the HAnDS frameworkalong with its evaluations.

5. Dataset Evaluation

Using the HAnDS framework we generated two datasets as described below:

WikiFbF:

A dataset generated using Wikipedia, Freebase and the FIGER hierarchy as aninput for the HAnDS framework. This dataset contains around 38 million entity mentionsannotated with 118 diﬀerent types.

WikiFbT:

A dataset generated using Wikipedia, Freebase and the TypeNet hierarchy as aninput for the HAnDS framework. This dataset contains around 46 million entity mentionsannotated with 1115 diﬀerent types.In our experiments, we use the September 2016 Wikipedia dump. Table 2 lists variousstatistics of these datasets. In the next subsections, we estimate the quality of the gener-ated datasets, both intrinsically and extrinsically. Our intrinsic evaluation is focused on

5. 150 most frequent words were used in the list.6. POS tags such as DT, IN, PRP, CC, WDT etc. that are least likely to be candidate for entity mention.7. An analysis of several characteristics of the discarded and retained sentences in available in the supple-mentary material at: https://github.com/abhipec/HAnDS . bhishek, Sanya, Garima, Ashish & Amit WikiFbT WikiFbF |H| , ,

943 37 , , |N | , ,

804 20 , , |H − N | , ,

152 18 , , |H ∩ N | , ,

791 19 , , |N − H| , ,

013 1 , , (a) Analysis of entity mentions. WikiFbT WikiFbF |H| , ,

122 2 , , |N | , ,

078 1 , , |H − N | ,

694 952 , |H ∩ N | , ,

428 1 , , |N − H| ,

650 31 , (b) Analysis of entities. Table 3: Quantitative analysis of dataset generated using the HAnDS framework with theNDS approach of dataset generation. Here H denotes a set of entity mentions in Table 3aand set of entities in Table 3b generated by the HAnDS framework, and N denotes a set ofentity mentions in Table 3a and set of entities in Table 3b generated by the NDS approach.quantitative analysis, and the extrinsic evaluation is used as a proxy to estimate precisionand recall of annotations. In intrinsic evaluation, we perform a quantitative analysis of the annotations generated bythe HAnDS framework with the NDS approach. The result of this analysis is presented inTable 3. We can observe that on the same sentences, HAnDS framework is able to generateabout 1.9 times more entity mention annotations and about 1.6 times more entities forthe WikiFbT corpus compared with the NDS approach. Similarly, there are around 1.8times more entity mentions and about 1.6 time more entities in the WikiFbF corpus. InSection 5.2.4, we will observe that despite around 1.6 to 1.9 times more new annotations,these annotations have a very high linking precision. Also, there is a large overlap amongannotations generated using HAnDS framework and NDS approach. Around above 95% ofentity mentions (and entities) annotations generated using the NDS approach are presentin the HAnDS framework induced corpora. This indicated that the existing links presentin Wikipedia are of high quality. The remaining 5% links were removed by the HAnDSframework as false positive entity mentions.

In extrinsic evaluation, we evaluate the performance of learning models when trained ondatasets generated using the HAnDS framework. Due to resource constraints, we performthis evaluation only for the WikiFbF dataset and its variants.

Following [Ling and Weld, 2012] we divided the FgER task into two subtasks: Fine-ED, asequence labeling problem and Fine-ET, a multi-label classiﬁcation problem. We use theexisting state-of-the-art models for the respective sub-tasks. The FgER model is a simplepipeline combination of a Fine-ED model followed by a Fine-ET model.

Fine-ED model:

For Fine-ED task we use a state-of-the-art sequence labeling basedLSTM-CNN-CRF model as proposed in [Ma and Hovy, 2016]. gER with Reduced False Negatives and Large Type Coverage

Fine-ET model:

For Fine-ET task we use a state-of-the-art LSTM based model as pro-posed in [Abhishek et al., 2017].Please refer to the respective papers for model details. The values of various hyper-parameters used in the models along with the training procedure is mentioned in the sup-plementary material available at: https://github.com/abhipec/HAnDS . The two learning models are trained on the following datasets:(1)

Wiki-FbF:

Dataset created by the HAnDS framework.(2)

Wiki-FbF-w/o-III:

Dataset created by the HAnDS framework without using stageIII of the pipeline.(3)

Wiki-NDS:

Dataset created using the naive distant supervision approach with thesame Wikipedia version used in our work..(4)

FIGER:

Dataset created using the NDS approach but shared by [Ling and Weld, 2012].Except for the FIGER dataset, for other datasets, we randomly sampled two millionsentences for model training due to computational constraints. However, during modeltraining as described in the supplementary material, we ensured that every model irrespec-tive of the dataset, is trained for approximately same number of examples to reduce anybias introduced due to diﬀerence in the number of entity mentions present in each dataset.All extrinsic evaluation experiments, subsequently reported in this section are performedon these randomly sampled datasets. Also, the same dataset is used to train Fine-ED andFine-ET learning model. This setting is diﬀerent from [Ling and Weld, 2012] where entitydetection model is trained on the CoNLL dataset. Hence, the result reported in their workis not directly comparable.We evaluated the learning models on the following two datasets:(1)

FIGER:

This is a manually annotated evaluation corpus which has been created by[Ling and Weld, 2012]. This contains 563 entity mentions and overall 43 diﬀerent entitytypes. The type distribution in this corpus is skewed as only 11 entity types are mentionedmore than 10 times.(2)

This is a new manually annotated evaluation corpus developed speciﬁcallyto cover large type set. This contains 2420 entity mentions and overall 117 diﬀerent entitytypes. In this corpus 84 entity types are mentioned more than 10 types. The sentences forthis dataset construction were sampled from Wikipedia text.The statistics of these datasets is available in Table 2.

For the Fine-ED task, we evaluated model’s performance using the precision , recall and F1 metrics as computed by the standard conll evaluation script . For the Fine-ET and

8. Please note that there are several other models with competitive or better performance such as [Chiuand Nichols, 2016, Lample et al., 2016, Ma and Hovy, 2016] for sequence labeling problem and [Renet al., 2016a, Shimaoka et al., 2017, Abhishek et al., 2017, Xin et al., 2018, Xu and Barbosa, 2018]for multi-label classiﬁcation problem. Our criteria for model selection was simple; easy to use publiclyavailable eﬃcient implementation.9. bhishek, Sanya, Garima, Ashish & Amit

Models FIGER 1k-WFB-gPrecision Recall F1 Precision Recall F1LSTM-CNN-CRF (FIGER)

Table 4: Performance of the entity detection models on the FIGER and 1k-WFB-g datasets.the FgER task, we use the strict , loose-macro-average and loose-micro-average evaluationmetrics described in [Ling and Weld, 2012]. The results of the entity detection models on the two evaluation datasets are presented inTable 4. From these results we perform two analysis. First, the eﬀect of training datasetson model’s performance and second, the performance comparison among the two manuallyannotated datasets.In the ﬁrst analysis, we observe that the LSTM-CNN-CRF model when trained on Wiki-FbF dataset has the highest F1 score on both the evaluation corpus. Moreover, the averagediﬀerence in precision and recall for this model is the lowest, which indicates a balancedperformance across both evaluation corpus. When compared with the models trained on theNDS generated datasets (Wiki-NDS and FIGER), we observe that these models have bestprecision across both corpus, however, lowest recall. The result indicates that large numberof false negatives entity mentions are present in the NDS induced datasets. In the case ofmodel trained on the dataset Wiki-FbF-w/o-III dataset the performance is in between theperformance of model trained on Wiki-NDS and Wiki-FbF datasets. However, they havea signiﬁcantly lower recall on average around 28% lower than model trained on Wiki-FbF.This highlights the role of stage-III, by selecting only quality annotated sentence, erroneousannotations are removed, resulting in learning models trained on WikiFbF to have a betterand a balanced performance.In the second analysis, we observe that models trained on datasets generated usingWikipedia as sentence source, performs better on the 1k-WFB-g evaluation corpus as com-pared to the FIGER evaluation corpus. These datasets are FIGER training corpus, Wik-iFbF, Wiki-NDS and Wiki-FbF-w/o-III. The primarily reason for better performance is thatthe sentences constituting the 1k-WFB-g dataset were sampled from Wikipedia. Thus,this evaluation is a same domain evaluation. On the other hand, FIGER evaluation corpusis based on sentences sampled from news and specialized magazines (photography and vet-erinary domains). It has been observed in the literature that in a cross domain evaluationsetting, learning model performance is reduced compared to the same domain evaluation[Nothman et al., 2009]. Moreover, this result also conveys that to some extent learning

10. We ensured that the test sentences are not present in any of the training datasets. gER with Reduced False Negatives and Large Type Coverage

Training Datasets FIGER 1k-WFB-g

Strict Ma-F1 Mi-F1 Strict Ma-F1 Mi-F1FIGER 25.07 34.56 36.47 27.76 35.14 37.31Wiki-NDS 30.07 37.89 38.55 39.12 49.28 51.60Wiki-FbF

Table 5: Performance comparison for the FgER task.model trained on the large Wikipedia text corpus is also able to generalize on evaluationdataset consisting of sentences from news and specialized magazines.Our analysis in this section as well as in Section 3.1 indicates that although the typecoverage of FIGER evaluation corpus is low (43 types), it helps to better measure model’sgeneralizability in a cross-domain evaluation. Whereas, 1k-WFB-g helps to measure per-formance across a large spectrum of entity types (117 types). Learning models trained onWiki-FbF perform best on both of the evaluation corpora. This warrants the usability ofthe generated corpus as well as the framework used to generate the corpus.

We observe that for the Fine-ET task, there is not a signiﬁcant diﬀerence between theperformance of learning models trained on the Wiki-NDS dataset and models trained onthe Wiki-FbF dataset. The later model performs approx 1% better in the micro-F1 metriccomputed on the 1k-WFB-g corpus. This indicates that in the HAnDS framework stage-II,where false negative entity mentions were reduced by relinking them to Freebase, has a veryhigh linking precision similar to NDS, which is estimated to be about 97-98% [Wang et al.,2016].The results for the complete FgER system, i.e., Fine-ED followed by Fine-ET are avail-able in Table 5. These results supports our claim in Section 3.1, that the current bottleneckfor the FgER task, is Fine-ED, speciﬁcally lack of resource with quality entity boundaryannotations while covering large spectrum of entity types. Our work directly addressed thisissue. In the FgER task performance measure, learning model trained on WikiFbF has anaverage absolute performance improvement of at least 18% on all of the there evaluationmetrics.

6. Conclusion and Discussion

In this work, we initiate a push towards moving from CgER systems to FgER systems, i.e.,from recognizing entities from a handful of types to thousands of types. We propose theHAnDS framework to automatically construct quality training dataset for diﬀerent variantsof FgER tasks. The two datasets constructed in our work along with the evaluation resourceare currently the largest available training and testing dataset for the entity recognitionproblem. They are backed with empirical experimentation to warrants the quality of theconstructed corpora.The datasets generated in our work opens up two new research directions related tothe entity recognition problem. The ﬁrst direction is towards an exploration of sequencelabeling approaches in the setting of FgER, where each entity mention can have more than bhishek, Sanya, Garima, Ashish & Amit one type. The existing state-of-the-art sequence labeling models for the CgER task, cannot be directly applied in the FgER setting due to state space explosion in the multi-labelsetting. The second direction is towards noise robust sequence labeling models, where someof the entity boundaries are incorrect. For example, in our induced datasets, there are stillentity detection errors, which are inevitable in any heuristic approach. There has been somework explored in [Dredze et al., 2009] assuming that it is a priori known which tokens havenoise. This information is not available in our generated datasets.Additionally, the generated datasets are much richer in entity types compared to anyexisting entity recognition datasets. For example, the generated dataset contains entitiesfrom several domains such as biomedical, ﬁnance, sports, products and entertainment. Inseveral downstream applications where NER is used on a text writing style diﬀerent fromWikipedia, the generated dataset is a good candidate as a source dataset for transfer learningto improve domain-speciﬁc performance.

Acknowledgments

We would like to thank the anonymous reviewers for their insightful comments. We alsothank Nitin Nair for his help with code to convert data from brat annotation tool to diﬀerentformats. We acknowledge the use of computing resources made available from the Boardof Research in Nuclear Science (BRNS), Dept. of Atomic Energy (DAE), Govt. of Indiasponsored project (No.2013/13/8-BRNS/10026) by Dr. Aryabartta Sahu at Departmentof Computer Science and Engineering, IIT Guwahati. Abhishek is supported by MHRDfellowship, Government of India.

References

Abhishek Abhishek, Ashish Anand, and Amit Awekar. Fine-grained entity type classiﬁ-cation by jointly learning representations and label embeddings. In

Proceedings of the15th Conference of the European Chapter of the Association for Computational Linguis-tics: Volume 1, Long Papers , pages 797–807, Valencia, Spain, April 2017. Association forComputational Linguistics. URL .Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. POLYGLOT-NER: Mas-sive Multilingual Named Entity Recognition. In

Proceedings of the 2015 SIAM Interna-tional Conference on Data Mining , pages 586–594, 2014. doi: 10.1137/1.9781611974010.66.William J Black, Fabio Rinaldi, and David Mowatt. FACILE: Description of the NE SystemUsed for MUC-7. In

Seventh Message Understanding Conference (MUC-7): Proceedingsof a Conference Held in Fairfax, Virginia, April 29 - May 1 , 1998. URL .Olivier Bodenreider. The uniﬁed medical language system (UMLS): integrating biomedicalterminology.

Nucleic acids research , 32:D267–D270, 2004. doi: 10.1093/nar/gkh061.Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: acollaboratively created graph database for structuring human knowledge. In

Proceedings gER with Reduced False Negatives and Large Type Coverage of the 2008 ACM SIGMOD international conference on Management of data , SIGMOD’08, pages 1247–1250. ACM, 2008. doi: 10.1145/1376616.1376746.Jason Chiu and Eric Nichols. Named Entity Recognition with Bidirectional LSTM-CNNs.

Transactions of the Association for Computational Linguistics , 4:357–370, 2016. ISSN2307-387X. URL https://transacl.org/ojs/index.php/tacl/article/view/792 .Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. Ultra-ﬁne entity typing. In

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 87–96. Association for Computational Linguistics, 2018.URL http://aclweb.org/anthology/P18-1009 .Mark Craven and Johan Kumlien. Constructing biological knowledge bases by extractinginformation from text sources. In

Proceedings of the Seventh International Conferenceon Intelligent Systems for Molecular Biology , pages 77–86. AAAI Press, 1999. ISBN1-57735-083-9.Mark Dredze, Partha Pratim Talukdar, and Koby Crammer. Sequence learning fromdata with multiple labels.

ECML/PKDD Workshop on Learning from Multi-Label Data(MLD) , pages 39–48, 2009. URL http://lpis.csd.auth.gr/workshops/mld09/mld09.pdf .Abbas Ghaddar and Phillippe Langlais. WiNER: A Wikipedia annotated corpus for namedentity recognition. In

Proceedings of the Eighth International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers) , pages 413–422, Taipei, Taiwan, November2017. Asian Federation of Natural Language Processing. URL .Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh. Context-dependent ﬁne-grained entity type tagging. arXiv preprint arXiv:1412.1820 , 2014.Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, andChris Dyer. Neural architectures for named entity recognition. In

Proceedings of the 2016Conference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies , pages 260–270, San Diego, California, June 2016.Association for Computational Linguistics. URL .Xiao Ling and Daniel S. Weld. Fine-grained entity recognition. In

Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence , AAAI’12, pages 94–100. AAAI Press,2012. URL .Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In

Proceedings of the 54th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1064–1074, Berlin, Germany, August 2016.Association for Computational Linguistics. URL . bhishek, Sanya, Garima, Ashish & Amit Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard,and David McClosky. The Stanford CoreNLP natural language processing toolkit. In

Association for Computational Linguistics (ACL) System Demonstrations , pages 55–60,2014. URL .Mnica Marrero, Julin Urbano, Sonia Snchez-Cuadrado, Jorge Morato, and Juan MiguelGmez-Berbs. Named entity recognition: Fallacies, challenges and opportunities.

Com-puter Standards & Interfaces , 35(5):482 – 489, 2013. ISSN 0920-5489. doi: http://dx.doi.org/10.1016/j.csi.2012.09.004. URL .Shikhar Murty, Patrick Verga, Luke Vilnis, Irena Radovanovic, and Andrew McCallum.Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking. In

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 97–109, Melbourne, Australia, July 2018. Associationfor Computational Linguistics. URL .David Nadeau and Satoshi Sekine. A survey of named entity recognition and classiﬁcation.

Lingvisticae Investigationes , 30(1):3–26, 2007.Joel Nothman, James R. Curran, and Tara Murphy. Transforming Wikipedia into namedentity training data. In

Proceedings of the Australasian Language Technology AssociationWorkshop 2008 , pages 124–132, Hobart, Australia, December 2008. URL .Joel Nothman, Tara Murphy, and James R. Curran. Analysing Wikipedia and gold-standardcorpora for NER training. In

Proceedings of the 12th Conference of the European Chapterof the ACL (EACL 2009) , pages 612–620, Athens, Greece, March 2009. Association forComputational Linguistics. URL .Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R Curran. Learningmultilingual named entity recognition from wikipedia.

Artiﬁcial Intelligence , 194:151–175,2013.Van-Thuy Phi, Joan Santoso, Masashi Shimbo, and Yuji Matsumoto. Ranking-Based Auto-matic Seed Selection and Noise Reduction for Weakly Supervised Relation Extraction. In

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 89–95, Melbourne, Australia, July 2018. Association forComputational Linguistics. URL .Xiang Ren, Wenqi He, Meng Qu, Lifu Huang, Heng Ji, and Jiawei Han. AFET: AutomaticFine-Grained Entity Typing by Hierarchical Partial-Label Embedding. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natural Language Processing , pages 1369–1378, Austin, Texas, November 2016a. Association for Computational Linguistics. URL https://aclweb.org/anthology/D16-1144 .Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, and Jiawei Han. Label NoiseReduction in Entity Typing by Heterogeneous Partial-Label Embedding. In

Proceedings gER with Reduced False Negatives and Large Type Coverage of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining , KDD ’16, pages 1825–1834, New York, NY, USA, 2016b. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939822.Benjamin Roth, Tassilo Barth, Michael Wiegand, and Dietrich Klakow. A Survey of NoiseReduction Methods for Distant Supervision. In

Proceedings of the 2013 Workshop onAutomated Knowledge Base Construction , AKBC ’13, pages 73–78, New York, NY, USA,2013. ACM. ISBN 978-1-4503-2411-3. doi: 10.1145/2509558.2509571.Satoshi Sekine. Extended Named Entity Ontology with Attribute Information. In

LREC2008 , 2008. URL .Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. Neural Architec-tures for Fine-grained Entity Type Classiﬁcation. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for Computational Linguistics: Volume 1, LongPapers , pages 1271–1280, Valencia, Spain, April 2017. Association for ComputationalLinguistics. URL .Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 SharedTask: Language-independent Named Entity Recognition. In

Proceedings of the SeventhConference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 , CONLL ’03,pages 142–147, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.doi: 10.3115/1119176.1119195.Denny Vrandeˇci´c and Markus Kr¨otzsch. Wikidata: A Free Collaborative Knowledgebase.

Commun. ACM , 57(10):78–85, September 2014. ISSN 0001-0782. doi: 10.1145/2629489.Chengyu Wang, Rong Zhang, Xiaofeng He, and Aoying Zhou. Error Link Detection andCorrection in Wikipedia. In

Proceedings of the 25th ACM International on Conferenceon Information and Knowledge Management , CIKM ’16, pages 307–316, New York, NY,USA, 2016. ACM. ISBN 978-1-4503-4073-1. doi: 10.1145/2983323.2983705.Ji Xin, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Improving Neural Fine-Grained EntityTyping With Knowledge Attention. In

Thirty-Second AAAI Conference on Artiﬁcial In-telligence , 2018. URL https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16321 .Peng Xu and Denilson Barbosa. Neural Fine-Grained Entity Type Classiﬁcation withHierarchy-Aware Loss. In

Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long Papers) , pages 16–25, New Orleans, Louisiana, June 2018. As-sociation for Computational Linguistics. URL .Yadollah Yaghoobzadeh, Heike Adel, and Hinrich Schuetze. Corpus-Level Fine-GrainedEntity Typing.