Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment
CCross-Lingual Named Entity Recognition Using Parallel Corpus: A NewApproach Using XLM-RoBERTa Alignment
Bing Li
Microsoft [email protected]
Yujie He
Microsoft [email protected]
Wenjin Xu
Microsoft [email protected]
Abstract
We propose a novel approach for cross-lingualNamed Entity Recognition (NER) zero-shottransfer using parallel corpora. We builtan entity alignment model on top of XLM-RoBERTa to project the entities detected onthe English part of the parallel data to thetarget language sentences, whose accuracysurpasses all previous unsupervised models.With the alignment model we can get pseudo-labeled NER data set in the target language totrain task-specific model. Unlike using trans-lation methods, this approach benefits fromnatural fluency and nuances in target-languageoriginal corpus. We also propose a modifiedloss function similar to focal loss but assignsweights in the opposite direction to furtherimprove the model training on noisy pseudo-labeled data set. We evaluated this proposedapproach over 4 target languages on bench-mark data sets and got competitive F1 scorescompared to most recent SOTA models. Wealso gave extra discussions about the impactof parallel corpus size and domain on the finaltransfer performance.
Named entity recognition (NER) is a fundamentaltask in natural language processing, which seeksto classify words in a sentence to predefined se-mantic types. Due to its nature that the groundtruth label exists at word level, supervised train-ing of NER models often requires large amountof human annotation efforts. In real-world usecases where one needs to build multi-lingual mod-els, the required human labor scales at least lin-early with number of languages, or even worsefor low resource languages. Cross-Lingual trans-fer on Natural Language Processing(NLP) taskshas been widely studied in recent years (Conneauet al., 2018; Kim et al., 2017; Ni et al., 2017; Xieet al., 2018; Ni and Florian, 2019; Wu and Dredze,2019; Bari et al. , 2019; Jain et al. , 2019), in par- ticular zero-shot transfer which leverages the ad-vances in high resource language such as Englishto benefit other low resource languages. In this pa-per, we focus on the cross-lingual transfer of NERtask, and more specifically using parallel corpusand pretrained multilingual language models suchas mBERT (Devlin, 2018) and XLM-RoBERTa(XLM-R) (Lample and Conneau, 2019; Conneauet al., 2020).Our motivations are threefold. (1) Parallel cor-pus is a great resource for transfer learning and isrich between many language pairs. Some recentresearch focus on using completely unsupervisedmachine translations (e.g. word alignment (Con-neau et al., 2017)) for cross-lingual NER, how-ever inaccurate translations could harm the trans-fer performance. For example in the word-to-wordtranslation approach, word ordering may not bewell represented during translations, such gaps intranslation quality may harm model performancein down stream tasks. (2) A method could stillprovide business value-add even if it only worksfor some major languages that have sufficient par-allel corpus as long as it has satisfying perfor-mance. It is a common issue in industry prac-tices where there is a heavily customized task thatneed to be extended into major markets but youdo not want to annotate large amounts of data inother languages. (3) Previous attempts using par-allel corpus are mostly heuristics and statistical-model based (Jain et al. , 2019; Xie et al., 2018;Ni et al., 2017). Recent breakthroughs in multi-lingual language models have not been applied tosuch scenarios yet. Our work bridges the gap andrevisits this topic with new technologies.We propose a novel semi-supervised method forthe cross-lingual NER transfer, bridged by parallelcorpus. First we train an NER model on source-language data set - in this case English - assumingthat we have labeled task-specific data. Second welabel the English part of the parallel corpus with a r X i v : . [ c s . C L ] J a n his model. Then, we project those recognized en-tities onto the target language, i.e. label the spanof the same entity in target-language portion ofthe parallel corpus. In this step we will leveragethe most recent XLM-R model (Lample and Con-neau, 2019; Conneau et al., 2020), which makes amajor distinction between our work and previousattempts. Lastly, we use this pseudo-labeled datato train the task-specific model in target languagedirectly. For the last step we explored the option ofcontinue training from a multilingual model fine-tuned on English NER data to maximize the ben-efits of model transfer. We also tried a series ofmethods to mitigate the noisy label issue in thissemi-supervised approach.The main contributions of this paper are as fol-lows:• We leverage the powerful multilingual modelXLM-R for entity alignment. It was trainedin a supervised manner with easy-to-collectdata, which is in sharp contrast to previousattempts that mainly rely on unsupervisedmethods and human engineered features.• Pseudo-labeled data set typically containslots of noise, we propose a novel loss func-tion inspired by the focal loss (Lin et al.,2017). Instead of using native focal loss,we went the opposite direction by weightinghard examples less as those are more likely tobe noise.• By leveraging existing natural parallel corpuswe got competitive F1 scores of NER transferon multiple languages. We also tested thatthe domain of parallel corpus is critical in aneffective transfer. There are different ways to conduct zero-shot mul-tilingual transfer. In general, there are two cate-gories, model-based transfer and data-based trans-fer. Model-based transfer often use source lan-guage to train an NER model with language in-dependent features, then directly apply the modelto target language for inference (Wu and Dredze,2019; Wu et al. , 2020). Data-based transfer fo-cus on combining source language task specificmodel, translations, and entity projection to cre-ate weakly-supervised training data in target lan-guage. Some previous attempts includes using annotation projection on aligned parallel corpora,translations between a source and a target lan-gauge (Ni et al., 2017; Ehrmann et al., 2011),or to utilize Wikipedia hyperlink structure to ob-tain anchor text and context as weak labels (Al-Rfou et al., 2015; Tsai et al., 2016). Differentvariants exist in annotation projection, e.g. Ni etal. (Ni et al., 2017) used maximum entropy align-ment model and data selection to project Englishannotated labels to parallel target language sen-tences. Some other work used bilingual mappingcombined with lexical heuristics or used embed-ding approach to perform word-level translationwith which naturally comes the annotation projec-tion (Mayhew et al., 2017; Xie et al., 2018; Jain etal. , 2019). This kind of translation + projection ap-proach is used not just in NER, but in other NLPtasks as well such as relation extraction (Kumar,2015). There are obvious limitations to the transla-tion + projection approach, word or phrase-basedtranslation makes annotation projection easier, butsacrifices the native fluency and language nuances.In addition, orthographic and phonetic based fea-tures for entity matching may only be applicableto languages that are alike, and requires extensivehuman engineered features. To address these lim-itations, we proposed a novel approach which uti-lizes machine translation training data and com-bined with pretrained multilingual language modelfor entity alignment and projection.
We will demonstrate the entity alignment modelcomponent and the full training pipeline of ourwork in following sections.
Translation from a source language to target lan-guage may break the word ordering therefore analignment model is needed to project entities fromsource language sentence to target language, sothat the labels from source language can be zero-shot transferred. In this work, we use XLM-R (Lample and Conneau, 2019; Conneau et al.,2020) series of models which introduced the trans-lation language model(TLM) pretraining task forthe first time. TLM trains the model to pre-dict a masked word using information from boththe context and the parallel sentence in anotherlanguage, making the model acquire great cross-lingual and potentially alignment capability.ur alignment model is constructed by concate-nating the English name of the entity and the targetlanguage sentence, as segment A and segment Bof the input sequence respectively. For token leveloutputs of segment B, we predict 1 if it is insidethe translated entity, 0 otherwise. This formula-tion transforms the entity alignment problem intoa token classification task. An implicit assumptionis that the entity name will still be a consecutivephrase after translation. The model structure is il-lustrated in Fig 1.
XLM-RoBERTa Tok 1 Tok 2 Tok 3 Tok 4 H4 H5 H6 H7 H8 H9 H10 QTok 1 QTok 2C H2 H3
Cologne
Köln liegt in Deutschlands ' ▁ Köln' ' ▁ liegt' ' ▁ in' ' ▁ Deutschland'' ▁ Colo' 'gne' - Label - - - - - Figure 1:
Entity alignment model . The query en-tity on the left is ’Cologne’, and the German sentenceon the right is ’K¨oln liegt in Deutschlands’, which is’Cologne is located in Germany’ in English, and ’K¨oln’is the German translation of ’Cologne’. The model pre-dicts the word span aligned with the query entity.
Fig 2 shows the whole training/evaluation pipelinewhich includes 5 stages: (1) Fine-tune pretrainedlanguage model on CoNLL2003 (Sang and Meul-der, 2003) to obtain English NER model; (2) Inferlabels for English sentences in the parallel corpus;(3) Run the entity alignment model from the pre-vious subsection and find corresponding detectedEnglish entities in the target language, failed ex-amples are filtered out during the alignment pro-cess; (4) Fine-tune the multilingual model withdata generated from step (3); (5) Evaluate the newmodel on the target language test sets.
In our method, we leveraged the availability oflarge-scale parallel corpus to transfer the NERknowledge obtained in English to other languages.Existing parallel corpora is easier to obtain thanannotated NER data. We used parallel corpuscrawled from the OPUS website (Tiedemann, 2012). In our experiments, we used the followingdata sets:•
Ted2013 : consists of volunteer transcriptionsand translations from the TED web site andwas created as training data resource for theInternational Workshop on Spoken LanguageTranslation 2013.•
OpenSubtitles : a new collection of trans-lated movie subtitles that contains 62 lan-guages. (Lison and Tiedemann, 2016)•
WikiMatrix : Mined parallel sentences fromWikipedia in different languages. Only pairswith scores above 1.05 are used. (Schwenket al., 2019)•
UNPC : Manually translated United Nationsdocuments from 1994 to 2014. (Ziemskiet al., 2016)•
Europarl : A parallel corpus extracted fromthe European Parliament web site. (Koehn,2005)•
WMT-News : A parallel corpus of News TestSets provided by WMT for training SMT thatcontains 18 languages. • NewsCommentary : A parallel corpus ofNews Commentaries provided by WMT fortraining SMT, which contains 12 languages. • JW300 : Mined, parallel sentences from themagazines Awake! and Watchtower. (Agi´cand Vuli´c, 2019)In this work, we focus on 4 languages, German,Spanish, Dutch and Chinese. We randomly se-lect data points from all data sets above with equalweights. There might be slight difference in datadistribution between languages due to data avail-ability and relevance.
The objective of alignment model is to find theentity from a foreign paragraph given its En-glish name. We feed the English name and theparagraph as segment A and B into the XLM-R model (Lample and Conneau, 2019; Conneauet al., 2020). Unlike NER task, the alignment task http://opus.nlpl.eu/News-Commentary.php oNLL EN Training Set CoNLL DE Test Set Train English model
Infer on English parallel corpusRun XLM alignment model to project entity to the target language (parallel corpus) Fine-tune on target language (parallel corpus)Evaluate on target language test set Pretrained model
Finetuned EN model
Figure 2:
Training Pipeline Diagram . Yellow pages represent English documents while light blue pages representGerman documents. Step 1 and 5 used original CoNLL data for train and test respectively; step 2, 3, and 4 usedmachine translation data from OPUS website. Pretrained model is either mBert or XLM-R. The final model wasfirst fine-tuned on English NER data set then fine-tuned on target language pseudo-labeled NER data set. has no requirement for label completeness, sincewe only need one entity to be labeled in one train-ing example. The training data set can be createdfrom Wikipedia documents where anchor text inhyperlinks naturally indicate the location of enti-ties and one can get the English entity name vialinking by Wikipedia entity Id. An alternative toget the English name for mentions in another lan-guage is through state-of-the-art translation sys-tem. We took the latter approach to make it simpleand leveraged Microsoft’s Azure Cognitive Ser-vice to do the translation.During training, we also added negative exam-ples with faked English entities which did not ap-pear in the other language’s sentence. The intu-ition is to force model to focus on English en-tity (segment A) and its translation instead of do-ing pure NER and picking out any entity in otherlanguage’s sentence (segment B). We also addedexamples of noun phrases or nominal entities tomake the model more robust.We generated a training set of 30K samples with25 % negatives for each language and trained aXLM-R-large model (Conneau et al., 2020) for 3epochs with batch size 64. The initial learning rateis 5e-5 and other hyperparameters were defaultsfrom HuggingFace Transformers library for thetoken classification task. The precision/recall/F1on the reserved test set reached 98 % . The modeltraining was done on 2 Tesla V100 and took about20 minutes. We used the CoNLL2003 (Sang and Meulder,2003) and CoNLL2002 data sets to test our cross-lingual transfer method for German, Spanish andDutch. We ignored the training sets in those lan-guages and only evaluated our model on test sets.For Chinese, we used People’s Daily as the ma-jor evaluation set, and we also reported numberson MSRA (Levow, 2020) and Weibo (Peng andDredze, 2015) data sets in the next section. Onenotable difference for People’s Daily data set isthat it only covers three entity types, LOC, ORGand PER, so we suppressed the MISC type fromEnglish during the transfer by training EnglishNER model with ’MISC’ marked as ’O’.To enable cross-lingual transfer, we first trainedan English teacher model using the CoNLL2003EN training set with XLM-R-large as the basemodel. We trained with focal loss (Lin et al.,2017) for 5 epochs. We then ran inference withthis model on the English part of the parallel data.Finally, with the alignment model, we projectedentity labels onto other languages. To ensure thequality of target language training data, we dis-carded examples if any English entity failed tomap to tokens in the target language. We also dis-carded examples where there are overlapping tar-get entities because it will cause conflicts in tokenlabels. Furthermore, when one entity is mapped https://github.com/zjy-ucas/ChineseNER E ES NL ZHModel P R F1 P R F1 P R F1 P R F1
Bari et al. (2019) - - 65.24 - - 75.9 - - 74.6 - - -Wu and Dredze (2019) - - 71.1 - - 74.5 - - 79.5 - - -Moon et al. (2019) - - 71.42 - - 75.67 - - 80.38 - - -Wu et al. (2020) - - 73.16 - - 76.75 - - 80.44 - - -Wu et al. (2020) - - 73.22 - - 76.94 - - 80.89 - - -Wu et al. (2020) - - - - - - - - -
Our Models mBERT zero-transfer 67.6 77.4 72.1 72.4 78.2 75.2 77.8 79.3 78.6 64.1 65.0 64.6mBERT fine-tune 73.1 76.2 (+2.5) (+2.4) (+0.0) (+6.4)
XLM-R zero-transfer 67.9 79.8 73.4 79.8 81.9 (+3.5) (-1.9) (-1.5) (+4.2)
Table 1:
Cross-Lingual Transfer Results on German, Spanish, Dutch and Chinese : Experiments are donewith both mBERT and XLM-RoBERTa model. For each of them we compared zero-transfer result (trained onCoNLL2003 English only) and fine-tune result using zero-transfer pseudo-labeled target language NER data. Testsets for German, Spanish, Dutch are from CoNLL2003 and CoNLL2002, and People’s Daily data set for Chinese. to multiple, we only keep the example if all thetarget mention phrases are the same. This is toaccommodate the situation where same entity ismentioned more than once in one sentence.As the last step, we fine-tuned the multilin-gual model pre-trained on English data set withlower n(0, 3, 6, etc.) layers frozen, on the tar-get language pseudo-labeled data. We used bothmBERT (Devlin et al., 2019; Devlin, 2018) andXLM-R(Conneau et al., 2020) with about 40Ktraining samples for 1 epoch. The results areshown in Table 1. All the inference, entity pro-jection and model training experiments are doneon 2 Tesla V100 32G gpus and the whole pipelinetakes about 1-2 hours. All numbers are reportedas an average of 5 random runs with the same set-tings.For loss function we used something similarto focal loss (Lin et al., 2017) but with oppositeweight assignment. The focal loss was designed toweight hard examples more. This intuition holdstrue only when training data is clean. In some sce-narios such as the cross-lingual transfer task, thepseudo-labeled training data contains lots of noisepropagated from upper-stream of the pipeline, inwhich case, those ’hard’ examples are more likelyto be just errors or outliers and could hurt the train-ing process. We went the opposite direction andlowered their weights instead, so that the modelcould focus on less noisy labels. More specifically,we added weight (1 + p t ) γ instead of (1 − p t ) γ ontop of the regular cross entropy loss, and for thehyper-parameter γ we experimented with values from 1 to 5 and the value of 4 it works best.From Table 1, we can see for mBERT model,fine-tuning with pseudo-labeled data has signif-icant effects on all languages except NL. Thelargest improvement is in Chinese, 6.4 % increasein F1 on top of the zero-transfer result, this num-ber is 2.5 % for German and 2.4 % for Spanish.The same experiment with XLM-R model showsa different pattern, F1 increased 3.5 % for Germanbut dropped a little bit on Spanish and Dutch af-ter fine-tuning. For Chinese, we see a comparableimprovement with mBERT of 4.2 % . The nega-tive result on Spanish and Dutch is probably be-cause XLM-R has already had very good pretrain-ing and language alignment for those Europeanlanguages, which can be seen from the high zero-transfer numbers, therefore a relatively noisy dataset did not bring much gain. On the contrary, Chi-nese is a relatively distant language from the per-spective of linguistics so that the add-on value oftask specific fine-tuning with natural data is larger.Another pattern we observed from Table 1 isthat across all languages, the fine-tuning step withpseudo-labeled data is more beneficial to precisioncompared with recall. We observed a consistentimprovement in precision but a small drop in re-call in most cases. One advantage of the parallel corpora methodis high data availability compared to the super-vised approach. Therefore a natural question nexts whether more data is beneficial for the cross-lingual transfer task. To answer this question, wedid a series of experiments with varying number oftraining examples ranging from 5K to 200K, andthe model F1 score increases with the amount ofdata at the beginning and plateaued around 40K.All the numbers displayed in Table 1 are reportedfor training on a generated data set of size around40K (sentences). One possible explanation for theplateaued performance might be due to the propa-gated error in the pseudo-labeled data set. Domainmismatch may also limit the effectiveness of trans-fer between languages. More discussions on thistopic in the next section.
Figure 3: F1 scores evaluated on three data sets usingdifferent domains’ parallel data. Blue column on theleft is the result of zero-shot model transfer. To its rightare F1 scores for 3 different domains and all domainscombined.
Learnings from machine translation communityshowed that the quality of neural machine trans-lation models usually strongly depends on the do-main they are trained on, and the performance of amodel could drop significantly when evaluated ona different domain (Koehn and Knowles, 2017).Similar observation can be used to explain chal-lenges of the NER cross-lingual transfer. In NERtransfer, the first domain mismatch comes fromthe natural gap in entity distributions between dif-ferent language corpus. Many entities only liveinside the ecosystem of a specific group of lan-guages and may not be translated naturally to oth-ers. The second domain mismatch is between theparallel corpora and the NER data set. The Englishmodel might not have good domain adaptationability and could perform well on the CoNLL2003data set but poorly on the parallel data set.To study the impact of the domain of paralleldata on transfer performance, we did an experi-ment on Chinese using parallel data from differ-ent domains. We picked three representative data
Domain PER ORG LOC All
OpenSubtitles 24,036 3,809 5,196 33,041UN 1,094 25,875 12,718 39,687News 10,977 9,568 28,168 48,713
All Domains Combined
Table 2: Entity Count by type in the pseudo-labeledChinese NER training data set. We listed multiple do-mains that were extracted from different parallel datasource. And AllDomains is a combination of all re-sources. sets from OPUS (Tiedemann, 2012), OpenSubti-tles, UN(contains UN and UNPC), News(containsWMT and News-Commentary) and another onewith all these three combined. OpenSubtitles isfrom movie subtitles and language style is infor-mal and oral. UN is from united nation reportsand language style is more formal and political fla-vored. News data is from newspaper and contentis more diverse and closer to CoNLL data sets. Weevaluated the F1 on three Chinese testsets, Weibo,MSRA and People’s Daily, where Weibo containsmessages from social media, MSRA is more for-mal and political, and People’s Daily data set arenewspaper articles.From Fig 3 we see OpenSubtitles performs beston Weibo but poorly on the other two. On the con-trary UN performs worst on Weibo but better onthe other two. News domain performs the best onPeople’s Daily, which is also consistent with one’sintuition because they are both from newspaper ar-ticles. All domains combined approach has a de-cent performance on all three testsets.Different domains of data have quite large gapin the density and distribution of entities, for ex-ample OpenSubtitles contains more sentences thatdoes not have any entity. In the experiments abovewe did filtering to keep the same ratio of ’empty’sentences among all domains. We also exam-ined the difference in type distributions. In Tab 2,we calculated entity counts by type and domain.OpenSubtitles has very few ORG and LOC en-tities, whereas UN data has very few PER data.News and All domain data are more balanced.In Fig 4, we show evaluation on People’s Dailyby type. We want to understand how does paral-lel data domain impacts transfer performance fordifferent entity types. News data has the best per-formance on all types, and Subtitles has very badresult on ORG. All these observations are consis-tent with the type distribution in Table 2. igure 4: F1 score by type evaluated on People’s Dailydata set. The same as Fig 3, we compare the resultsusing different domains of parallel data for the NERtransfer.
Test P Test R F1Sequential fine-tune with RW 73.1
Zero-transfer 67.6
Table 3: Ablation study results evaluated onCoNLL2002 German NER data. All experiments usedthe mBERT-base model. RW denotes the re-weightedloss we proposed in the paper; CE denotes the regularcross entropy loss.
To better understand the contribution of each stagein the training process, we conducted an ablationstudy on German data set with mBERT model.We compared 6 different settings: (1) approachproposed in this work, i.e. fine-tune the En-glish model with pseudo-labeled data with the newloss, denoted as re-weighted (RW) loss; (2) zero-transfer is direct model transfer which was trainedonly on English NER data; (3) fine-tune the En-glish model with pseudo-labeled data with regu-lar cross-entropy (CE) loss; (4) Skip fine-tune onEnglish and directly fine-tune mBERT on pseudo-labeled data. (5)(6) fine-tune the model directlywith mixed English and pseudo-labeled data si-multaneously with RW and CE losses respectively.From Table 3, we see that both pretraining onEnglish and fine-tuning with pseudo German dataare essential to get the best score. The RW lossperformed better in sequential fine-tuning thanin simultaneous training with mixed English andGerman data, this is probably because noise por-tion in English training set is much smaller thanin the German pseudo-labeled training set, usingRW loss on English data failed to exploit the fine- grained information in some hard examples andresults in insufficiently optimized model. Anotherobservation is that training the mBERT with acombination of English and German using crossentropy loss, we can get almost the same score asour best model, which is trained with two stages.
In this paper, we proposed a new method of do-ing NER cross-lingual transfer with parallel cor-pus. By leveraging the XLM-R model for entityprojection, we are able to make the whole pipelineautomatic and free from human engineered featureor data, so that it could be applied to any other lan-guage that has a rich resource of translation datawithout extra cost. This method also has the po-tential to be extended to other NLP tasks, such asquestion answering. In this paper, we thoroughlytested the new method in four languages, and itis most effective in Chinese. We also discussedthe impact of parallel data domain on NER trans-fer performance and found that a combination ofdifferent domain parallel corpora yielded the bestaverage results. We also verified the contributionof the pseudo-labeled parallel data by an ablationstudy. In the future we will further improve thealignment model precision and also explore alter-native ways of transfer such as self teaching in-stead of direct fine-tuning. We are also interestedto see how the propose approach generalizes tocross-lingual transfers for other NLP tasks.
References
Jian Ni, Georgiana Dinu, and Radu Florian. 2017.Weakly supervised cross-lingual named entityrecognition via effective annotation and represen-tation projection. In
Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1470–1480. Association for Computational Linguistics.Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, andEric Fosler-Lussier. 2017. Cross-lingual transferlearning for POS tagging without cross-lingual re-sources. In
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 2832–2838, Copenhagen, Denmark. As-sociation for Computational Linguistics.Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. XNLI: Evaluatingcross-lingual sentence representations. In
Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 2475–2485,russels, Belgium. Association for ComputationalLinguistics.Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A.Smith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal re-sources. In
Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing , pages 369–379, Brussels, Belgium. Associationfor Computational Linguistics.Jian Ni and Radu Florian. 2019. Neural cross-lingualrelation extraction based on bilingual word embed-ding mapping. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 399–409, Hong Kong, China. As-sociation for Computational Linguistics.T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar.Focal loss for dense object detection. arXiv preprintarXiv:1708.02002 , 2017.Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages833–844, Hong Kong, China. Association for Com-putational Linguistics.Alankar Jain, Bhargavi Paranjape, and Zachary C. Lip-ton. Entity projection via machine translation forcross-lingual NER. In
EMNLP , pages 1083–1092,2019.M Saiful Bari, Shafiq Joty, and Prathyusha Jwala-puram. Zero-resource cross-lingual named en-tity recognition. arXiv preprint arXiv:1911.09812 ,2019.Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. In
Proceedings of the Seventh Conference on NaturalLanguage Learning, CoNLL 2003, Held in cooper-ation with HLT-NAACL 2003, Edmonton, Canada,May 31 - June 1, 2003 , pages 142–147.Shijie Wu and Mark Dredze. Beto, bentz, becas:The surprising cross-lingual effectiveness of BERT. arXiv preprint arXiv:1904.09077 , 2019.Jacob Devlin. 2018. Multilingual bert readme docu-ment.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics. Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2017.Word translation without parallel data. arXivpreprint arXiv:1710.04087 .Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291 .Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.J¨org Tiedemann. 2012. Parallel Data, Tools and In-terfaces in OPUS. In
Proceedings of the Eight In-ternational Conference on Language Resources andEvaluation (LREC’12) , Istanbul, Turkey. EuropeanLanguage Resources Association (ELRA).Pierre Lison and J¨org Tiedemann. 2016. OpenSubti-tles2016: Extracting Large Parallel Corpora fromMovie and TV Subtitles. In
Proceedings of the TenthInternational Conference on Language Resourcesand Evaluation (LREC 2016) , Paris, France. Euro-pean Language Resources Association (ELRA).Holger Schwenk, Vishrav Chaudhary, Shuo Sun,Hongyu Gong, and Francisco Guzm´an. 2019. Wiki-Matrix: Mining 135M Parallel Sentences in 1620Language Pairs from Wikipedia. arXiv preprintarXiv:11907.05791 .Michał Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The United Nations Parallel Cor-pus v1.0. In
Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC 2016) , Paris, France. European LanguageResources Association (ELRA).Philipp Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. In
Conference Pro-ceedings: the tenth Machine Translation Summit ,pages 79–86, Phuket, Thailand. AAMT, AAMT.ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Lan-guages. In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 3204–3210, Florence, Italy. Associationfor Computational Linguistics.Maud Ehrmann, Marco Turchi, and Ralf Stein-berger. 2011. Building a multilingual namedentity-annotated corpus using annotation projec-tion. In
Proceedings of Recent Advancesin Natural Language Processing . Associationfor Computational Linguistics, pages 118–124.http://aclweb.org/anthology/R11-1017.ami Al-Rfou, Vivek Kulkarni, Bryan Per-ozzi, and Steven Skiena. 2015. Polyglot-ner:Massive multilingual named entity recog-nition. In
Proceedings of the 2015 SIAMInternational Conference on Data Mining .SIAM, Vancouver, British Columbia, Canada.https://doi.org/10.1137/1.9781611974010.66.Chen-Tse Tsai, Stephen Mayhew, and Dan Roth. 2016.Cross-lingual named entity recognition via wikifica-tion. In
CoNLL , pages 219–228.Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017.Cheap translation for cross-lingual named entityrecognition. In
EMNLP , pages 2526–2535.Faruqui Kumar, S. 2015. Multilingual open relationextraction using cross-lingual In
Proceedings ofNAACL-HLT , 1351–1356.Qianhui Wu, Zijia Lin, Guoxin Wang, Hui Chen,B¨orje F Karlsson, Biqing Huang, and Chin-Yew Lin.Enhanced meta-learning for cross-lingual named en-tity recognition with minimal resources. In
AAAI ,2020.Taesun Moon, Parul Awasthy, Jian Ni, and RaduFlorian. 2019. Towards lingua franca namedentity recognition with bert. arXiv preprintarXiv:1912.01389 .Wu, Q.; Lin, Z.; Karlsson, B. F.; Lou, J.-G.; and Huang,B. 2020. Single-/Multi-Source Cross-Lingual NERvia Teacher-Student Learning on Unlabeled Data inTarget Language. In
Association for ComputationalLinguistics .Qianhui Wu and Zijia Lin and B¨orje F. Karlsson andBiqing Huang and Jian-Guang Lou. UniTrans: Uni-fying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with UnlabeledData. In
IJCAI 2020 .Gina-Anne Levow. The Third International ChineseLanguage Processing Bakeoff: Word Segmentationand Named Entity Recognition. In
Proceedings ofthe Fifth SIGHAN Workshop on Chinese LanguageProcessing , 2006. .Nanyun Peng and Mark Dredze. Named Entity Recog-nition for Chinese Social Media with Jointly TrainedEmbeddings. In
Proceedings of the 2015 Confer-ence on Empirical Methods in Natural LanguageProcessing , 2015. .Philipp Koehn and Rebecca Knowles. Six Challengesfor Neural Machine Translation. In