[PDF] ANEA: Distant Supervision for Low-Resource Named Entity Recognition

Abstract

Distant supervision allows obtaining labeled training corpora for low-resource settings where only limited hand-annotated data exists. However, to be used effectively, the distant supervision must be easy to gather. In this work, we present ANEA, a tool to automatically annotate named entities in texts based on entity lists. It spans the whole pipeline from obtaining the lists to analyzing the errors of the distant supervision. A tuning step allows the user to improve the automatic annotation with their linguistic insights without labelling or checking all tokens manually. In six low-resource scenarios, we show that the F1-score can be increased by on average 18 points through distantly supervised data obtained by ANEA.

Full PDF

AANEA: Distant Supervision for Low-Resource Named Entity Recognition

Michael A. Hedderich , Lukas Lange & Dietrich Klakow Saarland University, Saarland Informatics Campus, Germany Bosch Center for Artiﬁcial Intelligence, Germany { mhedderich,dietrich.klakow } @[email protected] Abstract

Distant supervision allows obtaining labeled training corpora for low-resource settings whereonly limited hand-annotated data exists. However, to be used effectively, the distant supervisionmust be easy to obtain. In this work, we present ANEA, a tool to automatically annotate namedentities in text based on entity lists. It spans the whole pipeline from obtaining the lists to analyz-ing the errors of the distant supervision. A tuning step allows the user to improve the automaticannotation with their linguistic insights without having to manually label or check all tokens. Insix low-resource scenarios, we show that the F1-score can be increased by on average 18 pointsthrough distantly supervised data obtained by ANEA.

Named Entity Recognition (NER) is a core NLP task necessary for a variety of applications from infor-mation retrieval to virtual assistants. While there exist some large, hand-annotated corpora like (TjongKim Sang and De Meulder, 2003) or (Weischedel et al., 2011), these are limited to a selected set oflanguages and domains. For many low-resource languages and domains, it is not possible to manuallylabel every token of large corpora due to time and resource constraints. To overcome this problem, weakor distant supervision methods have become popular which can automatically annotate unlabeled, rawtext (Mintz et al., 2009). Even in low-resource settings, unlabeled text is often available and research hasshown that it can be a useful training resource in the absence of expensive, high-quality labels.For NER, a very common approach is to use lists, dictionaries or gazetteer of named entities (e.g. alist of person names or cities) and assign each word in the corpus the corresponding named entity labelif it appears in this list of entities. This is done e.g. in (Ratinov and Roth, 2009; Gerner et al., 2010;Yang et al., 2018; Lange et al., 2019; Peng et al., 2019). However, this idea has several difﬁculties suchas obtaining these dictionaries (e.g. a list of city names in Estonian) or adapting the matching procedureto the speciﬁc language and domain (e.g. deciding for or against lemmatization trading off recall andprecision). In practice, distant supervision can only be beneﬁcial and save resources if it is easy and fastto deploy.The ANEA tool we present provides the functionality to actually use this distant supervision approachin practice for many languages and named entity types while minimizing the amount of manual effortand labeling cost. A process is provided to automatically extract entity names from Wikidata, a freeand open knowledge base. The information is used to automatically annotate named entities for largeamounts of unlabeled text. The tool also supports the user in tuning the automatic annotation process.This enables language experts to efﬁciently include their knowledge without having to manually annotatemany tokens. Both a library and a graphical user interface are provided to assist users of varying technicalbackgrounds and different use-cases. In an experimental study on six different scenarios, we show thatANEA outperforms two baselines in nearly all cases regarding the quality of the automatic annotation.When used to provide distantly supervised training data for a neural network model, it creates on averagea boost of 18 F1 points with less than 30 minutes of manual interaction. a r X i v : . [ c s . C L ] F e b Related Work

A variety of open-source tools exist to manually annotate text. While their focus is on the manualannotation of data, some of them support the user with certain degrees of automation. A token can belabeled automatically if it has been labeled before by the user in WebAnno (Yimam et al., 2014) andTALEN (Mayhew and Roth, 2018). In TALEN, a bilingual lexicon can be integrated but just to supportannotators that do not speak the language of the text. WebAnno and brat (Stenetorp et al., 2012) allowimporting the annotations of external tools as suggestions for the user. The focus is, however, still on theuser manually checking all tokens. Also, the annotator is not able to use their insight to directly inﬂuenceand improve the external tool like in the tuning process of ANEA.In the area of information extraction, the tools by Gupta and Manning (2014), Li et al. (2015) andDalvi et al. (2016) allow the user to create rules or patterns, e.g. “[Material] conducts [Energy]”. Theycan, however, require a large amount of manual rule creation effort to obtain good coverage for NER.With Snorkel (Ratner et al., 2019), a user can deﬁne similar and more general labeling functions. Oiwaet al. (2017) presented a tool to manually create entity lists. These lists could be imported into ANEA.NER is closely related to entity linking. Zhang et al. (2018) presented a system to automatically linkentities in many languages but focus on disaster monitoring.

The workﬂow is visualized in Figure 1a and we provide an online video that shows an exemplary walk-through . The process is split into four parts: Extraction : The user starts with searching for the category names of the entity types that should beextracted (e.g. person or ﬁlm ). The tool will then automatically extract the names of all the correspondingentities (e.g. for person : “Alan Turing”, “Edward Sapir”, ...). As the source for the extractions, we usea dump of Wikidata. It is a free and open knowledge base that is created both by manual edits andautomatic processes. At the time of writing, it contains over 85 million items. For most items, the namesare available in multiple languages (e.g. for city names 8k in English and 2.5k in Estonian). Additionally,the user can also provide existing lists of entity names in case of a very speciﬁc domain. Automatic Annotation : The automatic annotation is performed by checking each word against thelist of extracted entities. A word (or token) is assigned the label of the entity name it matches. If matchesof several entity names overlap, the longest match is used. I.e. for the string “United Arab Emirates”the entity name of the country is preferred over the substring “United” (the airline) if both are in lists ofentities.

Evaluation : If a small set of labeled data exists, it can be used to evaluate the automatic annotation.The tool can calculate precision, recall and F1-score directly. It also reports the tokens that were mostoften labeled incorrectly or not labeled. For a more in-depth analysis, for each token one can checkwhich label was assigned, which alternative labels could have been assigned and to which entities theycorrespond. This allows a user to easily understand issues of the automatic annotation.

Tuning : ANEA provides a list of options with which the automatic annotation can be improved.Guided by the evaluation from the previous step, this allows the user to easily insert expertise about thelanguage into the annotation process and prevent common mistakes while still avoiding to annotate orpost-edit many tokens manually. The options include ﬁltering common false positives, stopword removal,adding alias names (like ”COLING” for the ”International Conference on Computational Linguistics”),splitting entity names, removing diacritics, requiring a minimum character length for the entities or fuzzymatching of entities. The effects of such a tuning process are visualized in Figure 1b for an Estoniandataset and the location label. Adding lemmatization in tuning-step 1 increases recall due to the richmorphological structure of the language that can hinder the matching. In step 3, location entities aregiven a higher priority if they conﬂict with person entities on the same token. In the last tuning-step,another gain can be obtained by extracting additional entity lists for Estonian locations based on theevaluation feedback. After the (optional) tuning process, unlabeled text can be automatically annotated peci ﬁ cationof entity types automaticextractionof entity namesmanual entryof entity names tuning ofannotationsettings automaticannotationof dev set automaticannotation ofunlabeled dataiterations of improvement (a) d e f a u l t s e t t i n g l e m m a t i z a t i o n P E R : s p li t t i n g n a m e s , m i n . l e n g t h L O C h i g h e s t p r i o r i t y f il t e r c o m m o n f a l s e p o s i t i v e s a d d i n g a d d i t i o n a l r e t r i e v a l s t e s t e v a l u a t i o n fine-tuning process LOC precisionLOC recall (b)

Figure 1: Overall workﬂow of ANEA (a) and development of precision and recall during the tuningprocess on the Estonian data (b). On the x-axis, the setting changes over time are reported.for use as distant supervision.

We selected a variety of datasets that reﬂect different languages and entity granularities. The ﬁrst 1500tokens of each dataset are used as labeled training instances. Garrette and Baldridge (2013) reportedthat this is a number of tokens that can be annotated within two hours for a low-resource POS task.We think that this is a reasonable amount of labeled data that one can expect even in a low-resourcesetting and it is also necessary for training the baselines we compare to. For

English (En) , the CoNLL03dataset is probably the most popular NER dataset. It was created for the CoNLL-2003 shared task(Tjong Kim Sang and De Meulder, 2003). To obtain a dataset with a high-resource language, but amore specialized domain, we manually annotated the location labels from the CoNLL03 dataset withmore speciﬁc labels. To evaluate a non-English scenario with a ﬁne-grained and less common label, wemanually annotated

Spanish (Es) news articles with the label movie . For evaluating the low-resourcelanguage scenario, datasets in

Estonian (Et) (Tkachenko et al., 2013),

West Frisian (Fy) (Pan et al.,2017) and

Yor `ub´a (Yo) (Alabi et al., 2020) were chosen. The manually labeled data we created for thisevaluation is made publicly available.

We evaluate against two baselines that should, like ANEA, be easy and quick to use and do not requireextensive development of hand-engineered features. The Stanford NER tagger is a popular tool basedon Conditional-Random-Fields (

CRF ) which we use in their suggested conﬁguration . For the secondbaseline - a deep-learning model - we did a preliminary study on held-out CoNLL03 data to ﬁnd goodsettings for a low-resource scenario. This Neural Network ( NN ) performed better in the low-resourcesetting than more complex ones with a larger context or a Bi-LSTM+CRF architecture like in (Lampleet al., 2016). To easily apply the model to many different languages, we used pretrained fastText embed-dings (Grave et al., 2018) which are available in 157 languages. Model details are given in our publishedcode. In the high-resource setting on the full CoNLL03 dataset ( > Here, the quality of the automatic annotation is evaluated. The CRF is trained on the1500 labeled training tokens of each dataset. Similarly for the Bi-GRU, the ﬁrst 1000 tokens are usedfor the training and the remaining 500 tokens are held-out as the development set to select the bestperforming epoch and avoid overﬁtting. For ANEA, we report the scores with and without the tuningphase.

ANEA No Tuning just uses the default settings without any labeled supervision and no manual https://nlp.stanford.edu/software/crf-faq.html RF NN ANEA ANEANo Tuning + TuningP R F1 P R F1 P R F1 P R F1En PER

14 23 54 40 46 36

42 67 49 En LOC 66 22 33 54 52 52

45 55 56

74 64

En ORG

08 12 23

13 16

17 07 10 21 09 13En CITY

14 25 27 43 33 16 30 21 29

51 37

En COUN.

05 10 63 51 56 93 80 86 84

90 87

En CONTI. 00 00 00 00 00 00

75 94 83 75 94 83

Es MOVIE

02 05 08 07 08 32 35 33 40

40 40

Et PER 66 24 35 61 30 40

17 27 41

51 45

Et LOC 59 27 37 44 25 32 71 36 48

76 63 69

Et ORG 00 00 00 17 09 12 75 12 21

81 17 29

Fy PER 07 06 07 04 03 04

55 42 48 55 42 48

Fy LOC 32 55 41 33 42 37

24 37 61

34 43

Fy ORG 00 00 00 00 00 00 89 07 13

90 08 14

Yo PER 33 05 10 15 22 18 11 13 12

49 43 46

Yo LOC

07 12 48 27 35 64 72 68 65

74 69

Yo ORG 00 00 00 07 08 08 16 28 20

46 52 49 (a)

NN + Distant Supervision by ...CRF NN ANEAEn PER -35 +5 +15

En LOC -20 +1 +13

En ORG -6 -5En CITY -13 +1 +6 En COUN. -45 -6 +30

En CONTI. 0 0 +88

Es Movie -7 +2 +14

Et PER -7 -7 +14

Et LOC +10 -1 +39

Et ORG -2 0 +17

Fy PER +1 0 +26

Fy LOC +4 +1 +4 Fy ORG +1 +1 +7 Yo PER -4 +6 -5Yo LOC -25 +4 +5 Yo ORG -1 +1 +20 (b)

Table 1: Results of Experiment A (a) and Experiment B (b) on the test data. We reportprecision/recall/F1-score in percentage (higher is better).interaction. For

ANEA + Tuning , the 1500 labeled training token are used for the manual tuning. It waslimited to no more than 10 manual steps and 30 minutes of user interaction per dataset.

Experiment B:

For evaluating the effect of the distant supervision, unlabeled tokens are automaticallyannotated by the CRF, the NN and ANEA with Tuning. The NN model is then retrained on both themanually labeled and the distantly-supervised instances. 200k tokens from each of the datasets are usedas unlabeled data. For Spanish, West Frisian and Yoruba, ca. 15k and 70k and 18k tokens are usedrespectively due to the smaller dataset sizes. These texts are disjoint of the labeled training and test data.

The results of Experiment A are given in Table 1a. The CRF approach can provide a high precisionbut often has a very low recall due to the limited amount of training data. The NN can leverage thepre-training of the embeddings on large amounts of unlabeled text. However, the training data seemsnot enough to reach a competitive performance. Our tool struggles most with organizations as theseare stored as several different entity types in Wikidata. Another issue is the existence of false positivesof words that have other meanings beyond entity names, e.g. the Turkish city “Of”. Nevertheless,reasonable results are obtained even if the amount of labeled tokens is too low for the baselines to learnanything meaningful (cf.

En CONTINENT or Et ORG ). Even without any labeled data, we are often ableto reach competitive performance. Using the tuning process is helpful to boost the performance further.The possibility for the user to trade-off precision and recall can be seen in several cases (e.g.

En LOC or Et PER ). Overall, ANEA outperforms the other baselines in all metrics in a majority of the settings. Itachieves the best F1-score in all but one case.The higher quality of the automatic annotation is also reﬂected in Experiment B (Table 1b). For14 out of 16 evaluated entity types, the distant supervision provided by ANEA achieves the largestimprovements. On average, it increases the classiﬁer’s performance by 18 points F1-score.

We presented a tool to obtain large amounts of distantly supervised training data for NER in a quick wayand with few manual efforts and costs. While the annotation itself is automatic, the user is able to tune itto add their expertise. To support users of varying technical backgrounds, both a library and a graphicaluser interface are provided. The experiments showed its usefulness in six different language and domainsettings. The tool, further information and technical documentation and the additional model code andevaluation data are available under the Apache 2 license online . https://github.com/uds-lsv/anea eferences Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espa˜na-Bonet. 2020. Massive vs.curated word embeddings for low-resourced languages. the case of yor`ub´a and Twi. In

Proc. of LREC 2020 .Bhavana Dalvi, Sumithra Bhakthavatsalam, Chris Clark, Peter Clark, Oren Etzioni, Anthony Fader, and DirkGroeneveld. 2016. IKE - an interactive tool for knowledge extraction. In

Proc. of AKBC 2016 .Dan Garrette and Jason Baldridge. 2013. Learning a part-of-speech tagger from two hours of annotation. In

Proc.of NAACL 2013 .Martin Gerner, Goran Nenadic, and Casey M Bergman. 2010. Linnaeus: a species name identiﬁcation system forbiomedical literature.

BMC bioinformatics , 11(1):85.Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning wordvectors for 157 languages. In

Proc. of LREC 2018 .Sonal Gupta and Christopher Manning. 2014. SPIED: Stanford pattern based information extraction and diagnos-tics. In

Proc. of the Workshop on Interactive Language Learning, Visualization, and Interfaces .Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neuralarchitectures for named entity recognition. In

Proc. of NAACL 2016 .Lukas Lange, Michael A. Hedderich, and Dietrich Klakow. 2019. Feature-dependent confusion matrices forlow-resource ner labeling with noisy labels. In

Proc. of EMNLP 2019 .Yunyao Li, Elmer Kim, Marc A. Touchette, Ramiya Venkatachalam, and Hao Wang. 2015. Vinery: A visual idefor information extraction.

Proc. of the VLDB Endowment , 8(12).Stephen Mayhew and Dan Roth. 2018. Talen: Tool for annotation of low-resource entities. In

Proc. of ACL 2018:System Demonstrations .Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extractionwithout labeled data. In

Proc. of ACL 2009 .Hidekazu Oiwa, Yoshihiko Suhara, Jiyu Komiya, and Andrei Lopatenko. 2017. A lightweight front-end tool forinteractive entity population.

CoRR , abs/1708.00481.Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingualname tagging and linking for 282 languages. In

Proc. of ACL 2017 .Minlong Peng, Xiaoyu Xing, Qi Zhang, Jinlan Fu, and Xuanjing Huang. 2019. Distantly supervised named entityrecognition using positive-unlabeled learning. In

Proc. of ACL 2019 .Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In

Proc. ofCoNLL 2009 .Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R´e. 2019. Snorkel:rapid training data creation with weak supervision.

The VLDB Journal , Jul.Pontus Stenetorp, Sampo Pyysalo, Goran Topi´c, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. brat:a web-based tool for nlp-assisted text annotation. In

Proc. of the Demonstrations at EACL 2012 .Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In

Proc. of the Seventh Conference on Natural Language Learning .Alexander Tkachenko, Timo Petmanson, and Sven Laur. 2013. Named entity recognition in estonian. In

Proc. ofthe 4th Biennial International Workshop on Balto-Slavic Natural Language Processing .Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor,Craig Greenberg, Eduard Hovy, Robert Belvin, et al. 2011. Ontonotes release 4.0.

LDC2011T03 .Yaosheng Yang, Wenliang Chen, Zhenghua Li, Zhengqiu He, and Min Zhang. 2018. Distantly supervised ner withpartial annotation learning and reinforcement learning. In

Proc. of COLING 2018 .Seid Muhie Yimam, Chris Biemann, Richard Eckart de Castilho, and Iryna Gurevych. 2014. Automatic annotationsuggestions and custom annotation layers in webanno. In

Proc. of ACL 2014: System Demonstrations .Boliang Zhang, Ying Lin, Xiaoman Pan, Di Lu, Jonathan May, Kevin Knight, and Heng Ji. 2018. Elisa-edl: Across-lingual entity extraction, linking and localization system. In