[PDF] Context-Dependent Fine-Grained Entity Type Tagging

Abstract

Entity type tagging is the task of assigning category labels to each mention of an entity in a document. While standard systems focus on a small set of types, recent work (Ling and Weld, 2012) suggests that using a large fine-grained label set can lead to dramatic improvements in downstream tasks. In the absence of labeled training data, existing fine-grained tagging systems obtain examples automatically, using resolved entities and their types extracted from a knowledge base. However, since the appropriate type often depends on context (e.g. Washington could be tagged either as city or government), this procedure can result in spurious labels, leading to poorer generalization. We propose the task of context-dependent fine type tagging, where the set of acceptable labels for a mention is restricted to only those deducible from the local context (e.g. sentence or document). We introduce new resources for this task: 12,017 mentions annotated with their context-dependent fine types, and we provide baseline experimental results on this data.

Full PDF

CContext-Dependent Fine-Grained Entity Type Tagging

Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, David Huynh

Google Inc.

Abstract

Entity type tagging is the task of assign-ing category labels to each mention of anentity in a document. While standard sys-tems focus on a small set of types, re-cent work (Ling and Weld, 2012) sug-gests that using a large ﬁne-grained labelset can lead to dramatic improvements indownstream tasks. In the absence of la-beled training data, existing ﬁne-grainedtagging systems obtain examples automat-ically, using resolved entities and theirtypes extracted from a knowledge base.However, since the appropriate type of-ten depends on context (e.g. Washingtoncould be tagged either as city or govern-ment ), this procedure can result in spuri-ous labels, leading to poorer generaliza-tion.We propose the task of context-dependentﬁne type tagging , where the set of ac-ceptable labels for a mention is restrictedto only those deducible from the lo-cal context (e.g. sentence or docu-ment). We introduce new resources forthis task: 12,017 mentions annotated withtheir context-dependent ﬁne types, and weprovide baseline experimental results onthis data. The standard entity type tagging task involvesidentifying entity mentions (such as

BarackObama , president , or he ) in natural language text,and classifying them into predeﬁned categorieslike person , location , or organization . Type tag-ging is useful in a variety of related natural lan-guage tasks like coreference resolution and rela-tion extraction, as well as for downstream pro-cessing like question answering (Lin et al., 2012). Most tagging systems only consider a small set of3-18 type labels (Hirschman and Chinchor, 1997;Tjong Kim Sang and De Meulder, 2003; Dodding-ton et al., 2004). However, recent work by Lingand Weld (Ling and Weld, 2012) suggests that us-ing a much larger set of ﬁne-grained types can leadto substantial improvements in relation extractionand possibly other tasks.There exists no labeled training dataset for theﬁne-grained tagging task. Current systems uselabels derived from knowledge bases; for exam-ple, FIGER (Ling and Weld, 2012) uses 112 Free-base types, and HYENA Yosef et al. (2012) uses505 YAGO types, which are Wikipedia categoriesmapped to WordNet synsets. In both cases, thetraining and evaluation examples are obtained au-tomatically from entities resolved in Wikipedia.The resolved entities are ﬁrst assigned Freebase orWikipedia types, and these are mapped to the ﬁnalset of labels.One issue that remains unaddressed in usingdistant supervision to obtain labeled examples forﬁne type tagging is label noise. Most resolvedentities have multiple type labels; however, notall of these labels typically apply in the contextof a given document. For example, the entityClint Eastwood has 30 labels in YAGO, includ-ing actor , medalist , entertainer , mayor , ﬁlm di-rector , composer , and defender . Arguably, onlya few of these labels are deducible from a givennews article about Clint Eastwood; similarly, onlya few types can be considered “common knowl-edge” and thus inferrable from the mention text it-self. For applications such as database completion,ﬁnding as many correct types as possible mightbe appropriate, and this task has been proposed inprior work. For other applications, such as infor-mation retrieval, it is more appropriate to ﬁnd onlythe types evoked by a mention in its context. Forexample, a user might want to ﬁnd news articlesabout Clint Eastwood as mayor, and every mention a r X i v : . [ c s . C L ] A ug hich talks about Clint Eastwood as an entertainershould be considered wrong in this context.The focus of our work is context-dependentﬁne-type tagging , where the set of acceptable la-bels for a mention is constrained to only those thatare relevant to the local context (e.g. sentence,paragraph, or document). Similarly to existingsystems, our label set is derived from Freebase,and training data is generated automatically fromresolved entities. However, unlike existing sys-tems, we address the presence of spurious labelsin the training data by applying a number label-pruning heuristics to the training examples. Theseheuristics demonstrably improve performance onmanually annotated data.We have prepared extensive new resourcesrelated to the context-dependent ﬁne type tag-ging task. This includes 12,017 manually an-notated mentions in the OntoNotes test corpus(Weischedel et al., 2011). We hope these resourceswill enable more research on the problem and pro-vide a useful basis for experimental comparison.The paper is organized as follows. Section 2 de-scribes the process of selecting the set of type la-bels (which we organize into a hierarchy), as wellas manual annotation guidelines and ablation pro-cedure. Section 3 describes the distant supervi-sion process for producing training data and thelabel-pruning heuristics. Section 4 describes thefeatures we used, baseline tagging models, and in-ference. Section 5 outlines evaluation metrics andwalks through a series of experiments that pro-duced the best results. Our set of ﬁne grained type labels T is derivedfrom Freebase similarly to FIGER; however, weadditionally organize the labels into a hierarchy.The motivation for this is that a hierarchy allowsus to incorporate simple domain knowledge (forexample, that an athlete is also a person , but nota location ) and ensure label consistency. Fur-thermore, if the number of possible labels is verylarge, it allows for faster inference by assigninglabels in a top-down manner.The labels are organized into a tree-structuredtaxonomy, where each label is related to its parentin the tree via the asymmetric, anti-reﬂexive, tran-sitive “IS-A” relationship. The root of the tree is Available here: https://arxiv.org/e-print/1412.1820v2 a default node encompassing all types. Labels atthe ﬁrst level of the tree are the commonly usedcoarse types person , location , organization , and other . These are then subdivided into more ﬁne-grained categories, as illustrated in Fig. 2. Thetaxonomy was assembled manually using an iter-ative process, starting with Ling and Weld’s non-hierarchical types. We organized the types into ahierarchy and then reﬁned them, removing labelsthat seemed rare or ambiguous, and including newlabels if there were enough examples to justify theaddition. In general, we preferred a taxonomy thatwould yield a single label path for each mentionand have as little ambiguity as possible.Once this process was ﬁnalized, we mapped thecommon (non-hierarchical) Freebase types to ourtypes. Whenever there was any ambiguity in themapping, we backed off to a common parent inthe hierarchy.Finally, we note that the set of 505 YAGO typesused in (Yosef et al., 2012) is also hierarchical,with 5 top-level types and 100 labels in each sub-category. Most of our types can be mapped toYAGO types in a straight-forward manner. How-ever, we found that using the YAGO labels directlywould lead to much ambiguity in manual annota-tions, primarily due to the large number of labels,the directed-acyclic-graph nature of the hierarchy,and the presence of ‘qualitative’ labels (e.g. per-son/good person , event/happening/beginning ). Type annotations are meant to be context depen-dent; that is, the only assigned types should bethose that are deducible from the sentence, or per-haps the paragraph. Of course, the notions of de-ducibility and local context are subjective. The an-notators were instructed to label “San Francisco”as a location/city even if this is not made explicit inthe context, since this can be considered commonknowledge and should be inferrable from the men-tion text itself. On the other hand, in the case ofany uncertainty, the annotators were instructed toback off to the parent type (in this case, location ).The corpus we annotated for this work includesall news documents in the OntoNotes test set ex-cept the longest 4, which we dropped to reduceannotator burden. Table 1 summarizes some ofthe corpus statistics and provides example anno-tations. Note that labels at level 2 (e.g. per-son/artist ) are approximately half as common as

ERSONartist actorauthordirectormusic education studentteacher athletebusinesscoachdoctorlegalmilitarypolitical ﬁgurereligious leadertitle LOCATIONstructure airportgovernmenthospitalhotelrestaurantsports facilitytheatre geography body of waterislandmountain transit bridgerailwayroad celestialcitycountrypark ORGANIZATIONcompany broadcastnews educationgovernmentmilitarymusicpolitical partysports leaguesports teamstock exchangetransit OTHERart broadcastﬁlmmusicstagewriting event accidentelectionholidaynatural disasterprotestsports eventviolent conﬂict health maladytreatment awardbody partcurrency language programminglanguage living thing animal product cameracarcomputermobile phonesoftwareweapon foodheritageinternetlegalreligionscientiﬁcsports & leisuresupernatural

Figure 1: Our type taxonomy includes types at three levels, e.g. PERSON (level 1), artist (level 2),actor (level 3). Each assigned type (such as artist ) also implies the more general ancestor types (such asPERSON). The top level types were chosen to align with the most common type set used in traditionalentity tagging systems.labels at level 1 (e.g. person ), but are 10 timesas common as labels at level 3. The main reasonfor this is that we allow labels to be partial pathsin the hierarchy tree (i.e. root to internal node,as opposed to root to leaf), and some of the level3 labels rarely occur in the training data. Further-more, many of the level 2 types have no sub-types;for example person/athlete does not have separatesub-categories for swimmers and runners.We built an interactive web interface for anno-tators to quickly apply types to mentions (includ-ing named, nominal, and pronominal mentions);on average, this task took about 10 minutes perdocument. Six annotators independently labeledeach document and we kept the labels with supportfrom at least two of the annotators (about 1 of ev-ery 4 labels was pruned as a result). It is worth dis-tinguishing between two kinds of label disagree-ments.

Speciﬁcity disagreements arise from dif-fering interpretations of the appropriate depth fora label, like person/artist vs. person/artist/actor .More problematic are type disagreements arisingfrom differing interpretations of a mention in con-text or of the type deﬁnitions.Applying the agreement pruning reduces the to-tal number of pairwise disagreements from 3900to 1303 (speciﬁcity) and 3700 to 774 (type). Themost common remaining disagreements are shown in Table 2. Some of these could probably be elim-inated by extra documentation. For example, inthe sentence “Olivetti has denied that it violatedCocom rules”, the mention “rules” is labeled asboth other and other/legal . While it is clear fromcontext that this is indeed a legal issue, the ex-amples provided in the annotation guidelines aremore speciﬁc to laws and courts (“5th Amend-ment”, “Treaty of Versailles”, “Roe v. Wade”).In other cases, the assignment of multiple typesmay well be correct: “Syrians” in “...whose lob-bies and hallways were decorated with murals ofancient Syrians...” is labeled with both person and other/heritage .We assessed the difﬁculty of the annotation taskusing average annotator precision, recall and F1relative to the consensus (pruned) types, shown inTable 3. As expected, there is less agreement overtypes that are deeper in the hierarchy, but the highprecision (92% at depth 2 and 89% at depth 3) re-assures us that the context-dependent annotationtask is reasonably well deﬁned.Finally, we compared the manual annotations tothe labels obtained automatically from Freebasefor the resolved entities in our data. The over-all recall was fairly high (80%), which is unsur-prising since Freebase-mapped types are typicallya superset of the context-speciﬁc type. However, tatistic Value

Documents 77Entity mentions 12017Labels 17704Labels at Level 1 11909Labels at Level 2 5209Labels at Level 3 586

Text:

If a hostile predator emerges for Saatchi & Saatchi Co. ,co-founders Charles and Maurice Saatchi will lead . . .

Fine Types: predator: organization/company , other Saatchi & Saatchi Co.: organization/company co-founders Charles and Maurice Saatchi: person/business

Table 1: Corpus statistics (left) and example (right). Level 1, 2 and 3 correspond to levels in the labelhierarchy in Figure 2. For example, Level 2 includes labels as person/artist while Level 3 are one levellower such as person/artist/actor . Entities in the example are inside a box.

Label 1 Label 2 Count

Other Other/legal

Other Other/product

Person Person/business

Other Other/currency Person Person/political-ﬁgure Other Organization/company

Other Person

Other Location Other Organization Person/title Person/business

Depth Precision Recall F1

Ling and Weld use the internal links in Wikipediaas training data: a linked entity inherits the Free-base types associated with the landing page. Weadopt a similar strategy, but rely instead on an en-tity resolution system that assigns Freebase typesto resolved entities, which we then map to ourtypes.We use a set of 133,000 news documents as the training corpus. Each document is processed bya standard NLP pipeline. This includes a part-of-speech (POS) tagger and dependency parser,comparable in accuracy to the current Stanforddependency parser (Klein and Manning, 2003),and an NP extractor that makes use of POS tagsand dependency edges to identify a set of en-tity mentions. Thus we separate the type taggingtask from the identiﬁcation of entity mentions, of-ten performed jointly by entity recognition sys-tems. Lastly, our entity resolver links entity men-tions to Freebase proﬁles; the system maps stringaliases (“Barack Obama”, “Obama”, “Barack H.Obama”, etc.) to proﬁles with probabilities de-rived from Wikipedia anchors.Next, we apply the types induced from Free-base to each entity. As already discussed, thiscan introduce label noise. For example, BarackObama is both a person/political-ﬁgure and a per-son/artist/author , even though only one of thesemay be deducible from the local context of a men-tion. This issue is discussed by Ritter et al. (Ritteret al., 2011) in relation to entity recognition (with10 types) for Twitter messages, and is addressedby constraining the set of types to those consistentwith a topic model. Instead, we attempt to reducethe mismatch between training and our manually-annotated test data using a set of heuristics.

The ﬁrst heuristic that we apply to reﬁne the train-ing data removes sibling types associated with asingle entity, leaving only the parent type. For ex-ample, an entity with types person/political-ﬁgure and person/athlete would end up with a singletype person . The motivation for this heuristic isthat it is uncommon for several sibling types tobe relevant in the same context. This may re-ove some correct labels; for example, instancesof Barack Obama will only be tagged with per-son , even though in many cases, person/political-ﬁgure is correct. However, less common entitiesassociated with few Freebase types are better forgenerating training data, as they are usually anno-tated with types relevant to the context. Thus welearn about politicians from mayors and governorsrather than from presidents.

Coarse type pruning

The second heuristic removes types that do notagree with the output of a standard coarse-grainedtype classiﬁer trained on the set of types { person,location, organization, other } . We use a soft-max classiﬁer trained on labeled data derived fromACE (Doddington et al., 2004). We apply a sim-ple label mapping to the four coarse types, and usefeatures similar to those described in Klein et al.(2003). The motivation here is to reduce ambigu-ity by encouraging type labels to correspond to asingle subtree of a hierarchy. Furthermore, if theentity is annotated with conﬂicting types (e.g. lo-cation and organization ), this heuristic can helpselect the type more appropriate to the context. Minimum count pruning

The third heuristic removes types that appearfewer than some minimum number of times in thedocument (in our experiments, we require eachtype to appear at least twice). The intuition isthat types relevant to the document context (forexample organization/sports-team in a sports ar-ticle) are likely to apply to multiple mentions in adocument.Because the heuristics prune potentially spuri-ous labels, they decrease the total number of train-ing examples. Table 7 in the Experiments sec-tion shows the number of resulting training in-stances with each type of heuristic. Finally, wenote that there exist non-trivial interactions be-tween the heuristics. For example, Barack Obama,is associated with types person , person/political-ﬁgure and person/artist , and the Sibling heuristicwould normally prune these to person . However,if another heuristic prunes out person/artist , thenthe input to the Sibling heuristic would be just per-son and person/artist , resulting in no additionalpruning. The heuristics are applied in the orderin which they are introduced above. For each mention of a resolved entity with at leastone type, we extract a training instance ( x , y ) ,where x is a vector of binary feature indicatorsand y ∈ { , } |T | is the binary vector of labelindicators. The feature set includes the lexicaland syntactic features described in Table 4, sim-ilar to those used in previous work. We also usea more semantic document topic feature, the re-sult of training a simple bag-of-words topic modelwith eight topics (arts, business, entertainment,health, mayhem, politics, scitech, sport), to try tocapture longer-range context. The word clustersare derived from the class-based exchange cluster-ing algorithm described by Uszkoreit and Brants(2008).Intuitively, the features describing the mentionphrase itself are most relevant for the top levelof the type taxonomy, while distinguishing typesdeeper in the taxonomy requires more contextualfeatures. We use the same feature representationfor all types; the relevant features for each type getweighted appropriately during learning. However,it may be worthwhile to make this distinction ex-plicit in future work, and the hierarchy levels are aconvenient structure for applying different featuresets. Hierarchical classiﬁcation can be seen as a specialcase of structured multilabel classiﬁcation, wherethe output space is a class taxonomy. A recent sur-vey (Silla and Freitas, 2011) categorizes existingapproaches as: ﬂat , using a single multiclass clas-siﬁer, local , using a binary classiﬁer for each la-bel and enforcing label consistency at test time, local per parent node , using a multiclass classiﬁerfor all children of a node, and global , training asingle multiclass classiﬁer but replacing the stan-dard zero-one loss with a function that reﬂects la-bel similarity.We explore the baseline ﬂat and local ap-proaches, and acknowledge that results can pos-sibly be improved using more complex models.In particular, we use the maximum entropy dis-criminative local and ﬂat classiﬁers (i.e. logisticand softmax regression). We note that existingﬁne-type tagging systems also rely on simple lin-ear classiﬁers; FIGER uses a ﬂat multi-class per- eature Description Example

Head The syntactic head of the mention phrase “Obama”Non-head Each non-head word in the mention phrase “Barack”, “H.”Cluster Word cluster id for the head word “59”Characters Each character trigram in the mention head “:ob”, “oba”, “bam”, “ama”, “ma:”Shape The word shape of the words in the mention phrase “Aa A. Aa”Role Dependency label on the mention head “nsubj”Context Words before and after the mention phrase “B:who”, “A:ﬁrst”Parent The head’s lexical parent in the dependency tree “picked”Topic The most likely topic label for the document “politics”Table 4: List of features used in type tagging. Features are extracted from each mention. Context usedfor example features: “... who [Barack H. Obama] ﬁrst picked ...”ceptron, allowing multiple labels as output, whileHYENA employs multiple binary support vectormachine (SVM) classiﬁers with some postprocess-ing of the outputs. In general, the discriminativeability of any classiﬁer diminishes as the numberof classes increases, so we expect local classiﬁersto outperform a ﬂat one. This is conﬁrmed em-pirically in our experiments, as well as in existingwork (i.e. HYENA outperforms FIGER).

In the local approach, a binary classiﬁer is inde-pendently trained for each label, and label consis-tency is enforced at inference time. For each label t , we train a binary logistic regression classiﬁerwith L regularization.Deﬁning the positive and negative training ex-amples for each binary classiﬁer is not entirelystraightforward, due to the asymmetric IS-A rela-tionships between the labels. We set the positiveexamples for a type to itself and all its descendantsin the hierarchy; for example, a mention labeled person/artist is considered a positive example for person . We experiment with setting the negativeexamples for a type as (1) all other types with thesame parent, (2) all other types at the same depth,or (3) all other types.At inference time, given the learned parametersand a test feature vector x , we ﬁrst independentlyevaluate the probability of each type. We then con-sider the following three inference strategies forassigning labels: • Independent . We assign all types whose prob-ability exceeds some decision threshold, withoutenforcing the labels to correspond to a single pathin a hierarchy. • Conditional . We multiply the probability of eachlabel t by the probability of its parent pa ( t ) for alltypes other than the top-level coarse types. Thisstrategy ensures that if a label t is assigned at agiven decision threshold, pa ( t ) must be assignedas well; however, it does allow for multiple pathsin the hierarchy tree. • Marginalizing out IS-A constraints . We reﬁnethe probability of each label by marginalizing outthe hierarchy constraints. Speciﬁcally, we ﬁrstcompute the probability of each valid label con-ﬁguration (each path from root to a leaf or internalnode in the hierarchy) as p ( y ) ∝ (cid:40)(cid:81) t p ( y t ) y is a path otherwise. (1)We then set the probability of an individual label t to the sum of the probabilities of conﬁgurationsin which y t = 1 . Since the number of paths isnot too large, we simply list all paths; with largerlabel sets, the marginalization can be done moreefﬁciently using the sum-product algorithm. Weassign all labels whose reﬁned probabilities areabove a given threshold. In this approach, we train a ﬂat softmax regres-sion classiﬁer (Berger et al., 1996) to discrimi-nate between all possible types. This classiﬁer ex-pects a single type label to each instance, whereasour training examples are labeled with multipletypes. To account for this, at training time, we con-vert each multi-label instance to multiple single-label instances. For example, an occurrence of“Canada” could be both location and organiza-tion . Rather than constructing a learning objectiveappropriate for such multi-label training data, weroduce two training examples, one with label lo-cation and the other with label organization . Atinference time, we assign all labels whose proba-bility exceeds a threshold, rather than selecting asingle highest scoring label.

Assessing the performance of a hierarchical clas-siﬁer is not straightforward. Previous work intro-duces a variety of loss measures to evaluate hierar-chical classiﬁcation errors; see for example Cesa-Bianchi et al. (2006) or Weinberger and Chapelle(2008). For simplicity, we evaluate performanceusing Precision, Recall, F-score and area under theprecision/recall curve. Since performance metricsare dominated by the level 1 types, we addition-ally report precision, recall, and F-score at eachlevel (see Table 6).We split the gold data into a development setwith 16 documents, and a test set with 61 docu-ments, and report results on the test set. We onlyevaluate named and nominal mentions (11197non-proniminal mentions), as is standard in thenamed entity recognition literature. For the sakeof simplicity, we choose a single threshold thatmaximizes overall F-score on the development set.We do observe a wide range of precision/recallnumbers for the individual labels, so using label-speciﬁc thresholds might give better results.

We start by evaluating the local classiﬁer approachdescribed in Section 4.2.1. We compare the threestrategies for selecting negative examples, as wellas the three inference methods for assigning la-bels. For each training strategy, we report the re-sults of the best corresponding inference method,and vice versa. The results are presented in Ta-ble 5; the best results were obtained using same-depth labels as negative training examples, andmarginalizing out hierarchy constraints.Next, we compare the best local classiﬁer re-sults to the ﬂat classiﬁer described in Section4.2.2. Note that the features and the total numberof model parameters are identical for the two ap-proaches. The results are presented in Table 6 andindicate that the local classiﬁer outperforms theﬂat classiﬁer, especially at deeper levels. The areaunder the precision/recall curve (AUC) is 63.7%for the ﬂat classiﬁer and 69.3% for the local clas-siﬁer.

Classiﬁer Precision Recall F1

Level 1 Flat 84.39 79.01 81.61Level 1 Local 87.12 78.84

Level 2 Flat 46.61 25.99 33.37Level 2 Local 56.76 30.88

Level 3 Flat 75.00 1.78 3.47Level 3 Local 24.00 8.28

Table 6: Precision, recall, and F-Score given bythe ﬂat and local classiﬁers at each level of the typetaxonomy. We use all heuristics and Depth nega-tive examples for the local classiﬁers. Level 1 arethe labels immediately below the root of our tree: person , location , organization , and other . Level2 are the labels below them such as person/artist while Level 3 are one level lower such as per-son/artist/actor . We compare the effects of different heuristics forpruning training labels in Table 7, with the bestsettings for our models: using local classiﬁers withsame-depth negative examples and marginalizingover constraints at inference time. Table 7 listsalso the number of training examples extractedfrom the data, as discussed in Section 3.2. It isevident that the heuristics have a signiﬁcant effecton system performance, with the coarse pruningbeing particularly important. Together, the heuris-tics improve overall F1 by 11.3% and the AUC by7.2%.

Entity type tagging is a key component of manynatural language systems such as relation extrac-tion and coreference resolution. Evidence sug-gests that performance of such systems can be dra-matically improved by using ﬁne-grained type tagsin place of the standard limited set of coarse tags.In the absence of labeled training data, ﬁne typetagging systems typically obtain training data au-tomatically, using resolved entities and types ex-tracted from a knowledge base. As entities oftenhave multiple assigned types, this process can re-sult in spurious type labels, which are neither ob-vious from the local context, nor considered com-mon knowledge. This subtle issue is not addressedin existing systems, which are both trained andevaluated on automatically generated data.In this paper, we strive to make ﬁne-typetagging more meaningful by requiring context- egatives Prec Rec F1 AUC

All 77.98 59.55 67.53 66.56Sibling 79.93 58.94 67.85 66.50Depth

Independent 77.06 61.54 68.43 67.74Conditional 77.89

Marginals P r ec i s i on RecallNoneAll

Heuristic Examples Precision Recall F1 AUC(millions)

None 8.58 67.19 57.56 62.00 63.62Min-Count=2 8.15 67.96 59.05 63.20 64.87Sibling 3.07 74.12 57.23 64.59 66.44Coarse 6.45 73.21 59.38 65.57 67.36All 5.08 80.05 62.20 70.01 69.29Table 7: A comparison of the effects of label pruning heuristics on the system performance. Examplesrefers to the total number of training examples extracted from the data. Each heuristic alone improves onthe baseline, and together, the improvement is largest, particularly in precision.dependence; that is, we require the assigned labelsto be deducible from local context. To this end,we introduce several distant supervision heuristicsthat are aimed at pruning irrelevant labels from thetraining data. The heuristics reduce the mismatchbetween the training and gold data, and lead to asigniﬁcant improvement in performance. Finally,in order to provide a meaningful basis for exper-imental comparison, we introduce new resourcesfor the task, including 12,017 manually-annotatedmentions in 77 OntoNotes news documents with17,704 type labels.Our experimental results highlight some of thedifﬁculties in performing type tagging with a largelabel set, especially in the case of very speciﬁclabels for which there are relatively few exam-ples. There exist many directions for future workin this area. For example, we could considerjointly labeling multiple mentions within the samedocument, since their labels are likely correlatedand some may be coreferent. In our current sys-tem, such correlations are only handled implicitly,through the document topic feature. Since our la-bels are organized in a natural hierarchy, it is alsoworth considering richer models designed specif-ically for hierarchical classiﬁcation problems. Fi-nally, we can consider adding more speciﬁc labelconstraints in addition to those imposed by the hi- erarchy; for example, we might allow some multi-path labels (e.g. location , organization ), but notothers (e.g. person , location ).We believe that our problem formulation, train-ing heuristics, and new resources will help pro-vide a meaningful framework for future researchon this problem. References [Berger et al.1996] Adam L. Berger, Vincent J. DellaPietra, and Stephen A. Della Pietra. 1996. A maxi-mum entropy approach to natural language process-ing.

Comput. Linguist. , 22(1):39–71, March.[Cesa-Bianchi et al.2006] Nicol`o Cesa-Bianchi, Clau-dio Gentile, Luca Zaniboni, et al. 2006. Incrementalalgorithms for hierarchical classiﬁcation.

Journal ofMachine Learning Research , 7(1).[Doddington et al.2004] George R Doddington, AlexisMitchell, Mark A Przybocki, Lance A Ramshaw,Stephanie Strassel, and Ralph M Weischedel. 2004.The automatic content extraction (ace) program-tasks, data, and evaluation. In

LREC .[Hirschman and Chinchor1997] L Hirschman andN Chinchor. 1997. Muc-7 named entity taskdeﬁnition. In

Proceedings of the 7th MessageUnderstanding Conference (MUC-7) .[Klein and Manning2003] Dan Klein and Christo-pher D Manning. 2003. Accurate unlexicalizedarsing. In

Proceedings of the 41st Annual Meetingof the Association for Computational Linguistics:Volume 1 , pages 423–430.[Klein et al.2003] Dan Klein, Joseph Smarr, HuyNguyen, and Christopher D Manning. 2003.Named entity recognition with character-level mod-els. In

Proceedings of the seventh conference onNatural language learning at HLT-NAACL 2003-Volume 4 , pages 180–183. Association for Compu-tational Linguistics.[Lin et al.2012] Thomas Lin, Mausam, and Oren Et-zioni. 2012. No noun phrase left behind: Detect-ing and typing unlinkable entities. In

Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning , EMNLP-CoNLL ’12,pages 893–903, Stroudsburg, PA, USA. Associationfor Computational Linguistics.[Ling and Weld2012] Xiao Ling and Daniel S Weld.2012. Fine-grained entity recognition. In

AAAI .[Ritter et al.2011] Alan Ritter, Sam Clark, Oren Et-zioni, et al. 2011. Named entity recognition intweets: an experimental study. In

Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing , pages 1524–1534. Associa-tion for Computational Linguistics.[Silla and Freitas2011] Jr. Silla, Carlos N. and Alex A.Freitas. 2011. A survey of hierarchical classiﬁ-cation across different application domains.

DataMining and Knowledge Discovery , 22(1-2):31–72.[Tjong Kim Sang and De Meulder2003] Erik F TjongKim Sang and Fien De Meulder. 2003. Intro-duction to the conll-2003 shared task: Language-independent named entity recognition. In

Proceed-ings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 , pages 142–147. Association for Computational Linguistics.[Uszkoreit and Brants2008] Jakob Uszkoreit andThorsten Brants. 2008. Distributed word clusteringfor large scale class-based language modeling inmachine translation. In

ACL , pages 755–762.Citeseer.[Weinberger and Chapelle2008] Kilian Weinberger andOlivier Chapelle. 2008. Large margin taxonomyembedding with an application to document catego-rization.

Advances in Neural Information Process-ing Systems , 21:1737–1744.[Weischedel et al.2011] Ralph Weischedel, EduardHovy, Mitchell Marcus, Martha Palmer, RobertBelvin, S Pradan, Lance Ramshaw, and NianwenXue. 2011. Ontonotes: A large training corpusfor enhanced processing.

Handbook of NaturalLanguage Processing and Machine Translation.Springer . [Yosef et al.2012] Mohamed Amir Yosef, SandroBauer, Johannes Hoffart Marc Spaniol, and GerhardWeikum. 2012. HYENA: Hierarchical TypeClassiﬁcation for Entity Names. In