Low-Resource Name Tagging Learned with Weakly Labeled Data
LLow-Resource Name Tagging Learned with Weakly Labeled Data
Yixin Cao Zikun Hu Tat-Seng Chua Zhiyuan Liu Heng Ji School of Computing, National University of Singapore, Singapore Department of CST, Tsinghua University, Beijing, China Department of CS, University of Illinois Urbana-Champaign, U.S.A. [email protected],[email protected],[email protected]@tsinghua.edu.cn,[email protected]
Abstract
Name tagging in low-resource languages ordomains suffers from inadequate training data.Existing work heavily relies on additional in-formation, while leaving those noisy annota-tions unexplored that extensively exist on theweb. In this paper, we propose a novel neu-ral model for name tagging solely based onweakly labeled (WL) data, so that it can beapplied in any low-resource settings. To takethe best advantage of all WL sentences, wesplit them into high-quality and noisy portionsfor two modules, respectively: (1) a classifi-cation module focusing on the large portionof noisy data can efficiently and robustly pre-train the tag classifier by capturing textual con-text semantics; and (2) a costly sequence la-beling module focusing on high-quality datautilizes Partial-CRFs with non-entity samplingto achieve global optimum. Two modules arecombined via shared parameters. Extensiveexperiments involving five low-resource lan-guages and fine-grained food domain demon-strate our superior performance (6% and 7.8%F1 gains on average) as well as efficiency . Name tagging is the task of identifying theboundaries of entity mentions in texts and classi-fying them into the pre-defined entity types (e.g.,person). It serves as a fundamental role as pro-viding the essential inputs for many IE tasks, suchas Entity Linking (Cao et al., 2018a) and RelationExtraction (Lin et al., 2017).Many recent methods utilize a neural net-work (NN) with Conditional Random Fields(CRFs) (Lafferty et al., 2001) by treating nametagging as a sequence labeling problem (Lample Our project can be found in https://github.com/zig-kwin-hu/Low-Resource-Name-Tagging . Someone may call it Named Entity Recognition (NER). … Barangay
Ginebra and
Formula
Shell forming a rivalry…
Barangay
Ginebra and
Formula
Shell forming a rivalryO’NealShaquille…
Barangay
Ginebra and
Formula
Shell forming a rivalry
B-ORG O OO O B-LOCI-ORG
Formula shell won game one in Philippines …
Weakly Labelled , extensively exists on the web
Fully Labelled , expensive to obtain s
Figure 1: Example of weakly labeled data. B-NT andI-NT denote incomplete labels without types. et al., 2016), which has became a basic architec-ture due to its superior performance. Nevertheless,NN-CRFs require exhaustive human efforts fortraining annotations, and may not perform well inlow-resource settings (Ni et al., 2017). Many ap-proaches thus focus on transferring cross-domain,cross-task and cross-lingual knowledge into nametagging (Yang et al., 2017; Peng and Dredze, 2016;Mayhew et al., 2017; Pan et al., 2017; Lin et al.,2018; Xie et al., 2018). However, they are usuallylimited by the extra knowledge resources that areeffective only in specific languages or domains.Actually, in many low-resource settings, thereare extensive noisy annotations that naturally ex-ist on the web yet to be explored (Ni et al., 2017).In this paper, we propose a novel model for nametagging that maximizes the potential of weaklylabeled (WL) data. As shown in Figure 1, s is weakly labeled, since only Formula shell and
Barangay Ginebra are annotated, leaving the re-maining words unannotated.WL data is more practical to obtain, since it isdifficult for people to accurately annotate those en-tities that they do not know or are not interestedin. We can construct them from online resources,such as the anchors in Wikipedia. However, thefollowing natures of WL data make learning nametagging from them more challenging:
Partially-Labeled Sequence
Automatically a r X i v : . [ c s . C L ] A ug erived WL data does not contain complete anno-tations, thus can not be directly used for training.Ni et al. (2017) select the sentences with highestconfidence, and assume missing labels as O (i.e.,non-entity), but it will introduce a bias to recog-nize mentions as non-entity. Another line of workis to replace CRFs with Partial-CRFs (T¨ackstr¨omet al., 2013), which assign unlabeled words withall possible labels and maximize the total proba-bility (Yang et al., 2018; Shang et al., 2018). How-ever, they still rely on seed annotations or domaindictionaries for high-quality training. Massive Noisy Data
WL corpora are usuallygenerated with massive noisy data including miss-ing labels, incorrect boundaries and types. Previ-ous work filtered out WL sentences by statisticalmethods (Ni et al., 2017) or the output of a train-able classifier (Yang et al., 2018). However, aban-doning training data may exacerbate the issue ofinadequate annotation. Therefore, maximizing thepotential of massive noisy data as well as high-quality part, yet being efficient, is challenging.To address these issues, we first differentiatenoisy data from high-quality WL sentences via alightweight scoring strategy, which accounts forthe annotation confidence as well as the coverageof all mentions in one sentence. To take best ad-vantages of all WL data, we then propose a unifiedneural framework that solves name tagging fromtwo perspectives: sequence labeling and classifi-cation for two types of data, respectively.Specifically, the classification module focuseson noisy data to efficiently pretrain the tag clas-sifier by capturing textual context semantics. It istrained only using annotated words without noisyunannotated words, and thus it is robust and effi-cient during training. The costly sequence labelingmodule is to achieve sequential optimum amongword tags. It further alleviates the burden of seedannotations in Partial-CRFs and increases ran-domness via Non-entity Sampling strategy, whichsamples O words according to some linguistic na-tures. These two modules are combined via sharedparameters. Our main contributions are as follows: • We propose a novel neural name taggingmodel that merely relies on WL data with-out feature engineering. It can thus beadapted for both low-resource languages anddomains, while no previous work deals withthem at the same time. • We consider name tagging from two perspec- tives of sequence labeling and classification,to efficiently take the best advantage of bothhigh-quality and noisy WL data. • We conduct extensive experiments in fivelow-resource languages and a fine-graineddomain. Since few work has been done intwo types of low-resource settings simulta-neously, we arrive at two types of baselinesfrom state-of-the-art methods. Our modelachieves significant improvements (6% and7.8% F1 on average), yet being efficientdemonstrated in further ablation studies.
Name tagging is an fundamental task of extractingentity information, which shall benefit many ap-plications, such as information extraction (Zhanget al., 2017; Kuang et al., 2019; Cao et al., 2019a)and recommendation (Wang et al., 2019; Caoet al., 2019b). It can be treated as either a multi-class classification problem (Hammerton, 2003;Xu et al., 2017) or a sequence labeling prob-lem (Collobert et al., 2011), but very little workcombined them together. The difference betweenthem mainly lies in whether the method mod-els sequential label constraints, which have beendemonstrated effective in many NN-CRFs mod-els (Lample et al., 2016; Ma and Hovy, 2016; Chiuand Nichols, 2016). However, they require a largeamount of human annotated corpora, which areusually expensive to obtain.The above issue motivates a lot of work on nametagging in low-resource languages or domains. Atypical line of effort focuses on introducing exter-nal knowledge via transfer learning (Fritzler et al.,2018; Hofer et al., 2018), such as the use of cross-domain (Yang et al., 2017), cross-task (Peng andDredze, 2016; Lin et al., 2018) and cross-lingualresources (Ni et al., 2017; Xie et al., 2018; Zafar-ian et al., 2015; Zhang et al., 2016; Mayhew et al.,2017; Tsai et al., 2016; Feng et al., 2018; Pan et al.,2017). Although they achieve promising results,there are a large amount of weak annotations onthe Web, which have not been well studied (Noth-man et al., 2008; Ehrmann et al., 2011). Yanget al. (2018); Shang et al. (2018) utilized Partial-CRFs (T¨ackstr¨om et al., 2013) to model incom-plete annotations for specific domains, but theystill rely on seed annotations or a domain dictio-nary. Therefore, we aim at filling the gap in low-resource name tagging research by using only WL oisy Weakly Labeled Data
Label Scores
Neural NetworkPartial-CRFs with Non-Entity Sampling
B-ORGO I-ORG
Neural Network
Shared
Parameter
B-ORG I-ORG … O …
Barangay
Ginebra and
Formula
Shell forming an …The team is owned by
Ginebra
Weakly Labeled Data Generation O Neural Name Tagging Model
Classification
Module Sequence Labeling
ModuleLabel Induction Data Selection Scheme
B-ORGI-ORG… B-ORGI-ORG…OB-ORGI-ORG…High-quality Weakly Labeled Data OB-ORGI-ORG… OB-ORGI-ORG…
Figure 2: Framework. Rectangles denote the main components for two steps, and rounded rectangles consist of twomodules of the neural model. In input sentences, bold fonts denote labeled words, and at the top is correspondingoutputs. We use Partial-CRFs to model all possible label sequences (red paths from left to right by picking upone label per column) controlled by non-entity sampling (strikethrough labels according to the distribution). Wereplace “UN” and “x-NT” label with corresponding possible labels to clarify the principle of PCRF. data, and adapt it to arbitrary low-resource lan-guages or domains, which can be further improvedby the above transfer-based methods.
We formally define the name tagging task asfollows: given a sequence of words X = (cid:104) x , · · · , x i , · · · , x | X | (cid:105) , it aims to infer a sequenceof labels Y = (cid:104) y , · · · , y i , · · · , y | X | (cid:105) , where | X | is the length of the sequence, y i ∈ Y is the labelof the word x i , each label consists of the boundaryand type information, such as B-ORG indicatingthat the word is B egin of an ORG anization en-tity. To make notations consistent, we use ˜ Y = Y (cid:83) { UN,B-NT,I-NT } to denote the label set ofWL data, where UN indicates that the word is un-labeled, and NT denote only the type is unlabeled.In other words, the word with UN may be any oneof the label in Y , and the word with NT may beany type. We define ˜ Y for notation clarity.To deal with the issue of limited annotations,we construct WL data D = { ( X, ˜ Y ) } based onWikipedia anchors and taxonomy, where ˜ Y = (cid:104) ˜ y , · · · , ˜ y i , · · · , ˜ y | X | (cid:105) and ˜ y i ∈ ˜ Y . An anchor (cid:104) m, e (cid:105) ∈ A links a mention m to an entity e ∈ E , where m contains one or several consec-utive words of length | m | . Particularly, we define A ( X ) as the set of anchors in X . Most entitiesare mapped to hierarchically organized categories,namely taxonomy T , which provides category in-formation C = { c } . We define C ( e ) as the cate-gory set of e , and T ↓ ( c ) as the children of c . The goal of our method is to extract WL data fromWikipedia and use them as training corpora forname tagging. As shown in Figure 2, there aretwo steps in our framework:
Weakly Labeled Data Generation generatesas many WL data as possible for higher taggingrecall. It contains two components of label in-duction and data selection scheme. First, the la-bel induction assigns each word a label based onWikipedia anchors and taxonomy. Then, the dataselection scheme computes the quality scores forthe WL sentences by considering the coverage ofmentions as well as the label confidence. Accord-ing to the scores, we split the entire set into twoparts: a small set of high-quality data for the se-quence labeling module, and a large amount ofnoisy data for the classification module.
Neural Name Tagging Model aims at effi-ciently and robustly utilizing both high-qualityand noisy WL data, ensuring satisfying taggingprecision. It is to make best use of labeled wordsvia the sequence labeling module and the classifi-cation module. More specifically, we pre-train theclassification module to capture the textual con-text semantics from massive noisy data, and thenthe sequence labeling module further fine-tunesthe shared neural networks using a Partial-CRFslayer with Non-Entity Sampling.
Existing methods use Wikipedia (Ni et al., 2017;Pan et al., 2017; Geiß et al., 2017) to train an extralassifier to predict entity categories for name tag-ging training. Instead, we aim at lowering the re-quirements of additional resources in order to sup-port more low-resource settings. We thus utilize alightweight strategy to generate WL data includ-ing label induction and data selection scheme.
Given a sentence X including anchors A ( X ) andtaxonomy T , we aim at inducing a label ˜ y ∈ ˜ Y foreach word x ∈ X . Obviously, the words outsideof anchors should be labeled with UN, indicatingthat it is unlabeled and could be O or unannotatedmentions. For the words in an anchor (cid:104) m, e (cid:105) , welabel it according to the entity categories. For ex-ample, words Formula and
Shell (Figure 1) in s are labeled as B-ORG and I-ORG, respectively,because mention Formula Shell is linked to en-tity
Shell Turbo Chargers , which belongs to cat-egory
Basketball teams . We trace it along the tax-onomy T : Basketball teams → ... → Organizations ,and find that it is a child of
Organizations . Ac-cording to a manually defined mapping Γ( Y ) → C (e.g., Γ( ORG ) =
Organizations ), we denote allthe classes and their children with the same type(e.g., ORG).However, there are two minor issues. First, forthe entities without category information C ( e ) = ∅ , we label them as B-NT or I-NT, indicating thatthey have no type information. Second, for theentities referring to multiple categories, we inducelabels that maximizes the conditional probability:argmax y ∗ p ( y ∗ | C ( e )) = (cid:80) c ∈ C ( e ) ( c ∈ T ↓ (Γ( y ∗ ))) | C ( e ) | (1)where ( · ) indicates if holds true, otherwise 0.By doing so, we obtain a set of WL sentences D = { ( X, ˜ Y ) } . However, the induction processmay introduce incorrect boundaries and types oflabels due to the crowdsourcing nature of sourcedata. We thus design a data selection scheme todeal with the issues. Following Ni et al. (2017), we compute qualityscores for sentences to distinguish high-qualityand noisy data from two aspects: the annotationconfidence and the annotation coverage.The annotation confidence measures the likeli-hood of the text spans being mentions (i.e., cor- rectness of boundaries), and being assigned withthe types. We define it as follows: q ( X, ˜ Y ) = (cid:80) ( x i , ˜ y i ) (˜ y i ∈ Y ) p (˜ y i | C ( e )) p ( C ( e ) | x i ) | X | (2)where p ( C ( e ) | x i ) is the conditional probability of x i linking to an entity belong to category C ( e ) ,we compute it based on its statistical frequencyamong Wikipedia anchors.The annotation coverage measures to which ra-tio the words are being labeled in the sentence: n ( X, ˜ Y ) = (cid:80) ( x i , ˜ y i ) (˜ y i ∈ Y ) | X | (3)We select high-quality sentences D hq satisfying: q ( X, ˜ Y ) ≥ θ q ; n ( X, ˜ Y ) ≥ θ n (4)where θ q and θ n are the hyperparameters. Thus,the remaining sentences are noisy D noise .For example (Figure 2), the sentence ...Barangay Ginebra and Formula Shell ... is high-quality, and The team is owned by Ginebra isnoisy. This is because there are more anchors thatlink
Formula Shell to an organization entity andthe anchors within the sentence account for a largeproportion, leading to a higher quality score. Notethat
Barangay and
Ginebra are labeled with B-NTand I-NT, indicating the type information is miss-ing. Our model may learn the textual semantics forclassifying
Ginebra to ORG from the noisy sen-tence, where
Ginebra is labeled with B-ORG.
Our neural model contains two modules that sharethe same NN architecture except the Partial-CRFslayer. Given D hq and D noise , we first pre-trainthe classification module using massive noisy data D noise to efficiently capture the textual seman-tics. Then, we use the sequence labeling mod-ule to fine-tune the classification module on high-quality data D hq by considering the transitionalconstraints among sequential labels. Before describing the NN of the classificationmodule, we first introduce the sequence labelingmodule. Different from conventional NN-CRFsodels, we utilize the Partial CRFs layer to max-imize the probability of all possible sequential la-bels for the sentence with transitional constraints,where the probability of missing word labels iscontrolled by non-entity sampling.
Partial-CRFs
Partial-CRFs (PCRFs) was first proposed in thefield of Part-of-Speech Tagging (T¨ackstr¨om et al.,2013). It can be trained when the coupled wordand label constraints provide only a partial signalby assuming that the uncoupled words may refer tomultiple labels. Given ( X, ˜ Y ) , we traverse all pos-sible labels Y for each unannotated word { x i | ˜ y i ∈ UN,B-NT,I-NT } (e.g., the red paths in Figure 2),and compute the total probability of possible fullylabeled sentences Y ( X, ˜ Y ) = { ( X, Y ) } : p ( ˜ Y | X ) = Y ( X, ˜ Y ) (cid:88) ( X,Y ) p ( Y | X ) (5)where p ( Y | X ) = softmax ( s ( X, Y )) , the same asin CRFs, and the score function s ( X, Y ) is: s ( X, Y ) = | X | (cid:88) i =0 A y i ,y i +1 + | X | (cid:88) i =1 P x i ,y i (6)where P x i ,y i is the score indicating how possible x i is labeled with y i , which is defined as the out-put of NN and will be detailed in the next section. A y i ,y i +1 is the transition score from label y i to y i +1 that is learned in this layer.Instead of the single correct label sequence inCRFs, the loss function of partial-CRFs is to min-imize the negative log-probability of ground truthover all possible labeled sequences: L = − D nq (cid:88) ( X, ˜ Y ) log p ( ˜ Y | X ) (7) Non-entity Sampling
A crucial drawback of using partial CRFs for WLsentences is that there are no words labeled with O(i.e., non-entity words) for training (Section 6.5).To further alleviate the reliance on seed annota-tions, we introduce non-entity sampling that sam-ples O from unlabeled words as follows: p ( y i = O | x i , ˜ y i = N ) = α λ f + λ (1 − f ) + λ f ) (8) where α is non-entity ratio to balance how manyunlabeled words are sampled as O, we set α =0 . in experiments according to Augenstein et al.(2017). Weighting parameters satisfy ≤ λ , λ , λ ≤ , and f , f , f are feature scores.We define f = ( x i adjoins an entity ) , which im-plies that the words around a mention are possi-ble to be O; f is the ratio of the number of x i labeled with entities to its total occurrences, re-flecting how frequent a word is in a mention; and f = tf ∗ df , where tf is term frequency and df isdocument frequency in Wikipedia articles.As shown in Figure 2, three words and , forming and an are labeled with N since they are outside ofanchors. During training, they should be regardedas all labels of Y in Partial CRFs, while we samplesome of them as O words according to Equation 8.Thus, and and an are instead treated as O words,because they do not appear in any anchor, and aretoo general due to a high f value. To efficiently utilize the noisy WL sentences, thismodule regards name tagging as a multi-label clas-sification problem. On one hand, it predicts eachword’s label separately, naturally addressing theissue of inconsecutive labels. On the other hand,we only focus on the labeled words, so that themodule is robust to the noise since most noisearises from the unlabeled words, and enjoy an ef-ficient training procedure.Formally, given a noisy sentence ( X, ˜ Y ) ∈D noise , we classify words { x i | ˜ y i ∈ Y} by cap-turing textual semantics within the context. Inde-pendently of languages and domains, we combinethe character and word embeddings for each word,then feed them into an encoder layer to capturecontextual information for the classification layer. Character and Word Embeddings
As inputs, we introduce character information toenhance word representations to improve the ro-bustness to morphological and misspelling noisefollowing (Ma and Hovy, 2016). Concretely, werepresent a word x by concatenating word em-bedding w and Convolutional Neural Networks(CNN) (LeCun et al., 1989) based character em-bedding c , which is obtained through convolutionoperations over characters in a word followed bymax pooling and drop out techniques. ncoder Layer Given an arbitrary length of sentence X , thiscomponent encodes the semantics of wordsas well as their compositionality into a low-dimensional vector space. The most commonencoders are CNN, Long-Short Term Memory(LSTM) (Hochreiter and Schmidhuber, 1997) andTransformer (Vaswani et al., 2017). We use thebi-directional LSTM (Bi-LSTM) due to its supe-rior performance. We discuss it in Section 6.2.Bi-LSTM (Graves et al., 2013) has been widelyused in modeling sequential words, so as to cap-ture both past and future input features for a givenword. It stacks a forward LSTM and a back-ward LSTM, so that the output of a word x i is h i = [ ←− h i ; −→ h i ] , where −→ h i = LSTM ( X i ) and ←− h i = LSTM ( X i : | X | ) . Classification Layer
The classification layer makes independent label-ing decisions for each word, so that we can onlyfocus on labeled words, while robustly and effi-ciently skip the noisy unlabeled words.In this layer, we estimate the score P x i ,y i (Equa-tion 6) for word x i being the label y i . We use afully connected layer followed by softmax to out-put a probability-like score: P x i ,y i = Sof tmax ( W h i + b ) (9)where W ∈ R |Y| . Note that we have no traininginstance for O words. Thus, we also use the non-entity sampling (Section 5.1). Given ( X, ˜ Y ) ∈D noise , this module is trained to minimize cross-entropy of the predicted and ground truth: L c = − D noise (cid:88) ( X, ˜ Y ) (˜ y i ∈ Y )˜ y i log P x i ,y i (10) To distill the knowledge derived from noisy data,we first pre-train the classification module, thenshare the overall NN with the sequence labelingmodule. If we choose a loose threshold θ p and θ n ,there is no noisy data and our model shall degradeto the sequential model without the pre-trainedclassifier. When the threshold is strict, there is nohigh-quality data and our model will degrade tothe classification module only (Section 6.4). For inference, we use the sequence labelingmodule to predict the output label sequence withthe largest score as in Equation 6. We verify our model using five low-resource lan-guages and a specific domain. Furthermore, weinvestigate the impacts of the main components aswell as hyperparameters in the ablation study.
Since most datasets on low-resourcelanguages are not publicly available, we useWikipedia data as the “ground truth” follow-ing Pan et al. (2017). Thus, we can test nametagging in low-resource languages as well as do-mains. We choose five languages: Welsh, Bengali,Yoruba, Mongolian and Egyptian Arabic (or CY,BN, YO, MN and ARZ for short), at different low-resource levels, and select 3 types: Person, Loca-tion and Organization. For food domain, we re-organized the entities in Wikipedia category Foodand drink into 5 types: Drinks, Meat, Vegetables,Condiments and Breads, for name tagging, and ex-tract sentences containing those entities from allEnglish Wikipedia articles for as many data aspossible.
Train Test
CY 106,541 146,524 1,193 3,256BN 66,915 127,932 870 2838YO 36,548 10,405 77 232MN 19,250 27,820 173 439ARZ 18,700 28,928 195 377Food Domain 27,798 32,155 207 253Drinks 8,615 9,218 62 67Meat 7,685 8,841 53 68Vegetables 6,155 7,235 45 58Condiments 3,737 4,084 27 30Breads 2,515 2,777 24 30
Table 1: The statistics of weakly labeled dataset.
We use 20190120 Wikipedia dump for WL dataconstruction, where the ratio of words in anchorsto the whole sentence is nearly 0.12, 0.07, 0.14,0.07 and 0.06 for languages CY, BN, YO, MNand ARZ, and 0.13 for food domain, demon-strating that unlabeled words are dominant. Byheuristically setting θ q = 0 . , θ n = 0 . , we ob-tain 56,571, 16,718, 4,131, 8,332, 6,266, 11,297high-quality and 49,970, 50,197, 32,417, 10,918,12,434, 16,501 noisy WL sentences for language Y BN YO MN ARZP R F1 P R F1 P R F1 P R F1 P R F1
CNN-CRFs 84.4 76.2 80.1 92.0 89.1 90.5
Table 2: Performance (%) on low-resource languages.
CY, BN, YO, MN and ARZ, and food domain, re-spectively. For correctness, we then pick up testdata of 25% sentences that has highest annotationconfidence and exceed 0.3 coverage. We randomlychoose 25% of high-quality data as validation forearly stop, and the rest for training. The statistics is in Table 1. Training Details
For tuning of hyper-parameters, we set non-entity feature weights to λ = 0 , λ = 0 . , λ =0 . heuristically. We pre-train word embeddingsusing Glove (Pennington et al., 2014), and fine-tune embeddings during training. We set the di-mension of words and characters as 100 and 30,respectively. We use 30 filter kernels, where eachkernel has the size of 3 in character CNN, anddropout rate is set to 0.5. For bi-LSTM, the hiddenstate has 150 dimensions. The batch size is set to32 and 64 for sequence labeling module and clas-sification module. We adopt Adam with L2 reg-ularization for optimization, and set the learningrate and weight decay to . and e − . Baselines
Since most low-resource name tag-ging methods introduce external knowledge (Sec-tion 2), which has limited availability and is out ofthe scope for this paper, we arrive at two types ofbaselines from weakly supervised models:
Typical NN-CRFs models (Ni et al., 2017) byselecting high-quality WL data and regarding un-labeled words as O, which usually achieve verycompetitive results. NN denotes CNN, Trans-former (Trans for short) or Bi-LSTM.
NN-PCRFs model (Yang et al., 2018; Shanget al., 2018). Although they achieves state-of-the-art performance, methods of this type are onlyevaluated in specific domains and require a smallset of seed annotations or a domain dictionary.We thus carefully adapt them to low-resourcelanguages and domains by selecting the highest-quality WL data ( θ n > . ) as seeds . The statistics includes noisy data, which greatly in-creases the size but cannot be used for evaluation. We adopt the common part of their models related to
Table 2 shows the overall performance of our pro-posed model as well as the baseline methods (Pand R denote Precision and Recall). We can see:Our method consistently outperforms all base-lines in five languages w.r.t F1, mainly because wegreatly improve recall (2.7% to 9.34% on average)by taking best advantage of WL data and being ro-bust to noise via two modules. As for the preci-sion, partial-CRFs perform poorly compared withCRFs due to the uncertainty of unlabeled words,while our method alleviates this issue by introduc-ing linguistic features in non-entity sampling. Anexception occurs in CY, because it has the mosttraining data, which may bring more accurate in-formation than sampling. Actually, we can tunehyper-parameter non-entity ratio α to improve pre-cision , more studies can be found in Section 6.5.Besides, the sampling technique can utilize moreprior features if available, we leave it in future.Among all encoders, Bi-LSTM has greater abil-ity for feature abstraction and achieves the highestprecision in most languages. An unexpected ex-ception is Yoruba, where CNN achieves the higherperformance. This indicates that the three en-coders capture textual semantics from differentperspectives, thus it is better to choose the encoderby considering the linguistic natures.As for the impacts of resources, all the mod-els perform the worst in Yoruba. Interestingly, weconclude that the performance for name taggingin low-resource languages doesn’t depend entirelyon the absolute number of mentions in the train-ing data, but largely on the average number of an-notations per sentence. For example, Bengali has . mentions per sentence and all methods achievetheir best results, while the opposite is Welsh with handling weakly labeled data, removing the other parts thatare specifically designed for domains, such as instance selec-tor (Yang et al., 2018) which makes it worse since we havealready selected the high-quality data. In this table, we show the performance using the samehyper-parameters in different languages for fairness.a) Efficiency analysis. (b) Impact of non-entity sampling ratio. (c) Impact of non-entity features.
Figure 3: Ablation study of our model in Mongolian. . mentions per sentence. This verifies our dataselection scheme (e.g., annotation coverage n ( · ) ),and we will give more discussion in Section 6.4. FoodD M V C B All
CNN-CRFs 67.8 69.8 57.9 42.8 46.5 60.9BiLSTM-CRFs 64.9 69.0 62.8 50.0 62.2 63.5Trans-CRFs 62.1 68.9 59.6 43.4 54.5 60.6BiLSTM-PCRFs 66.1 70.7 67.2 44.4 58.3 64.4Ours
Table 3: F1-score (%) on food domain.
Table 3 shows the overall performance in fooddomain, where D, M, V, C and B denote: Drink,Meat, Vegetables, Condiments and Breads. Wecan observe that there is a performance drop com-pared to that in low-resource languages, mainlybecause of more types and sparse training data.Our model outperforms all of the baselines in allfood types by 7.8% on average. The performancein condiments is relatively low, because most ofthem are composed of meat or vegetables, such assteak sauce, which is overlapped with other typesand make the recognition more difficult.
Figure 4: Our predictions on a noisy WL sentence.
Here is a representative case demonstrating thatour model is robust to noise induced by unlabeledwords. In Figure 4, the sentence is from the noisyWL training data of food domain, and only
Maize is labeled as B-V. Although our model is trained onthis sentence, it successfully predicts yams as B-V.This example shows that our two-modules design can utilize the noisy data while avoiding side ef-fects caused by incomplete annotation.
We utilize θ n , the main factor to annotation qual-ity (Section 6.2), to trade off between high-qualityand noisy WL data. As shown in Figure 3(a),the red curve denotes the training time and theblue curve denotes F1. We can see that the per-formance of our model is relatively stable when θ n ∈ [0 , . , while the time cost drops dramat-ically (from 90 to 20 minutes), demonstrating therobustness and efficiency of two-modules design.When θ n ∈ [0 . , . , the performance decreasesgreatly due to less available high-quality data forsequence labeling module; meanwhile, little timeis saved through classification module. Thus, wepick up θ n = 0 . in experiments. A specialcase happens when θ n = 0 , our model degradesto sequence labeling without pre-trained classifier.We can see the performance is worse than that of θ n = 0 . due to massive noisy data. We use non-entity ratio α to control sampling,and a higher α denotes that more unlabeled wordsare labeled with O. As shown in Figure 3(b), theprecision increases as more words are assignedwith labels, while the recall achieves two peaks( α = 0 . , . ), leading to the highest F1 when α = 0 . , which conforms to the statistics in Au-genstein et al. (2017). There are two special cases.When α = 0 , our model degrades to a NN-PCRFsmodel without non-entity sampling and there isno seed annotations for training. We can see themodel performs poorly due to the dominant unla-beled words (Section 5.1). When α = 1 indicatingall unlabeled words are sampled as O, our modelegrades to NN-CRFs model, which has higherprecision at the cost of recall. Clearly, the modelsuffers from the bias to O labeling. We propose three features for non-entity samples:nearby entities ( f ), ever within entities ( f ) andterm/document frequency ( f ). We now investi-gate how effective each feature is. Figure 3(c)shows the performance of our model that sam-ples non-entity words using each feature as well astheir combinations. The first bar denotes the per-formance of sampling without any features. It isnot satisfying but competitive, indicating the im-portance of non-entity sampling to partial-CRFs.The single f contributes the most, and gets en-hanced with f because they provide complemen-tary information. Surprisingly, f seems betterthan f , but makes the model worse if we use itcombined with f , f , thus we set λ = 0 . In this paper, we propose a novel name taggingmodel that consists of two modules of sequencelabeling and classification, which are combinedvia shared parameters. We automatically con-struct WL data from Wikipedia anchors and splitthem into high-quality and noisy portions for train-ing each module. The sequence labeling mod-ule focuses on high-quality data and is costly dueto the partial-CRFs layer with non-entity sam-pling, which models all possible label combina-tions. The classification module focuses on theannotated words in noisy data to pretrain the tagclassifier efficiently. The experimental results infive low-resource languages and a specific domaindemonstrate the effectiveness and efficiency.In the future, we are interested in incorporatingentity structural knowledge to enhance text rep-resentation (Cao et al., 2017, 2018b), or transferlearning (Sun et al., 2019) to deal with massiverare words and entities for low-resource name tag-ging, or introduce external knowledge for furtherimprovement.
Acknowledgments
NExT++ research is supported by the National Re-search Foundation, Prime Minister’s Office, Sin-gapore under its IRC@SG Funding Initiative.
References
Isabelle Augenstein, Leon Derczynski, and KalinaBontcheva. 2017. Generalisation in named entityrecognition: A quantitative analysis.
ComputerSpeech & Language .Yixin Cao, Lei Hou, Juanzi Li, and Zhiyuan Liu.2018a. Neural collective entity linking. In
COL-ING .Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu,Chengjiang Li, Xu Chen, and Tiansi Dong. 2018b.Joint representation learning of cross-lingual wordsand entities via attentive distant supervision. In
EMNLP .Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and JuanziLi. 2017. Bridge text and knowledge by learningmulti-prototype entity mention embedding. In
ACL .Yixin Cao, Zhiyuan Liu, Chengjiang Li, Juanzi Li, andTat-Seng Chua. 2019a. Multi-channel graph neuralnetwork for entity alignment. In
ACL .Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, andTat-Seng Chua. 2019b. Unifying knowledge graphlearning and recommendation: Towards a better un-derstanding of user preferences. In
WWW .Jason Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns.
TACL .Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch.
JMLR .Maud Ehrmann, Marco Turchi, and Ralf Steinberger.2011. Building a multilingual named entity-annotated corpus using annotation projection. In
Proceedings of the International Conference RecentAdvances in Natural Language Processing .Xiaocheng Feng, Xiachong Feng, Bing Qin, ZhangyinFeng, and Ting Liu. 2018. Improving low resourcenamed entity recognition using cross-lingual knowl-edge transfer. In
IJCAI .Alexander Fritzler, Varvara Logacheva, and Mak-sim Kretov. 2018. Few-shot classification innamed entity recognition task. arXiv preprintarXiv:1812.06158 .Johanna Geiß, Andreas Spitz, and Michael Gertz. 2017.Neckar: a named entity classifier for wikidata. In
International Conference of the German Society forComputational Linguistics and Language Technol-ogy . Springer.Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks. In .ames Hammerton. 2003. Named entity recognitionwith long short-term memory. In
NAACL .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory.
Neural computation .Maximilian Hofer, Andrey Kormilitzin, Paul Goldberg,and Alejo Nevado-Holgado. 2018. Few-shot learn-ing for named entity recognition in medical text. arXiv preprint arXiv:1811.05468 .Jun Kuang, Yixin Cao, Jianbing Zheng, Xiangnan He,Ming Gao, and Aoying Zhou. 2019. Improving neu-ral relation extraction with implicit mutual relations. arXiv preprint arXiv:1907.05333 .John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In
ICML .Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In
NAACL .Yann LeCun, Bernhard Boser, John S Denker, Don-nie Henderson, Richard E Howard, Wayne Hubbard,and Lawrence D Jackel. 1989. Backpropagation ap-plied to handwritten zip code recognition.
Neuralcomputation .Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017.Neural relation extraction with multi-lingual atten-tion. In
ACL .Ying Lin, Shengqi Yang, Veselin Stoyanov, and HengJi. 2018. A multi-lingual multi-task architecture forlow-resource sequence labeling. In
ACL .Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. In
ACL .Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017.Cheap translation for cross-lingual named entityrecognition. In
EMNLP .Jian Ni, Georgiana Dinu, and Radu Florian. 2017.Weakly supervised cross-lingual named entityrecognition via effective annotation and representa-tion projection. In
ACL .Joel Nothman, James R Curran, and Tara Murphy.2008. Transforming wikipedia into named entitytraining data. In
Proceedings of the AustralasianLanguage Technology Association Workshop 2008 .Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In
ACL .Nanyun Peng and Mark Dredze. 2016. Improvingnamed entity recognition for chinese social mediawith word segmentation representation learning. In
ACL . Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In
EMNLP .Jingbo Shang, Liyuan Liu, Xiaotao Gu, Xiang Ren,Teng Ren, and Jiawei Han. 2018. Learning namedentity tagger using domain-specific dictionary. In
EMNLP .Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and BerntSchiele. 2019. Meta-transfer learning for few-shotlearning. In
CVPR .Oscar T¨ackstr¨om, Dipanjan Das, Slav Petrov, RyanMcDonald, and Joakim Nivre. 2013. Token and typeconstraints for cross-lingual part-of-speech tagging.
TACL .Chen-Tse Tsai, Stephen Mayhew, and Dan Roth. 2016.Cross-lingual named entity recognition via wikifica-tion. In
CoNLL .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
NeurIPS .Xiang Wang, Dingxian Wang, Canran Xu, XiangnanHe, Yixin Cao, and Tat-Seng Chua. 2019. Explain-able reasoning over knowledge graphs for recom-mendation. In
Proceedings of the AAAI Conferenceon Artificial Intelligence , volume 33, pages 5329–5336.Jiateng Xie, Zhilin Yang, Graham Neubig, Noah ASmith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal re-sources. In
EMNLP .Mingbin Xu, Hui Jiang, and Sedtawut Watcharawit-tayakul. 2017. A local detection approach for namedentity recognition and mention detection. In
ACL .Yaosheng Yang, Wenliang Chen, Zhenghua Li,Zhengqiu He, and Min Zhang. 2018. Distantly su-pervised ner with partial annotation learning and re-inforcement learning. In
COLING .Zhilin Yang, Ruslan Salakhutdinov, and William WCohen. 2017. Transfer learning for sequence tag-ging with hierarchical recurrent networks. In
ICLR .Atefeh Zafarian, Ali Rokni, Shahram Khadivi, and So-nia Ghiasifard. 2015. Semi-supervised learning fornamed entity recognition using weakly labeled train-ing data. In
AISP .Boliang Zhang, Xiaoman Pan, Tianlu Wang, AshishVaswani, Heng Ji, Kevin Knight, and Daniel Marcu.2016. Name tagging for low-resource incident lan-guages based on expectation-driven learning. In
NAACL .Jing Zhang, Yixin Cao, Lei Hou, Juanzi Li, and Hai-Tao Zheng. 2017. Xlink: An unsupervised bilingualentity linking system. In