[PDF] Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment

Abstract

We propose a novel approach for cross-lingual Named Entity Recognition (NER) zero-shot transfer using parallel corpora. We built an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences, whose accuracy surpasses all previous unsupervised models. With the alignment model we can get pseudo-labeled NER data set in the target language to train task-specific model. Unlike using translation methods, this approach benefits from natural fluency and nuances in target-language original corpus. We also propose a modified loss function similar to focal loss but assigns weights in the opposite direction to further improve the model training on noisy pseudo-labeled data set. We evaluated this proposed approach over 4 target languages on benchmark data sets and got competitive F1 scores compared to most recent SOTA models. We also gave extra discussions about the impact of parallel corpus size and domain on the final transfer performance.

Full PDF

CCross-Lingual Named Entity Recognition Using Parallel Corpus: A NewApproach Using XLM-RoBERTa Alignment

Bing Li

Microsoft [email protected]

Yujie He

Microsoft [email protected]

Wenjin Xu

Microsoft [email protected]

Abstract

We propose a novel approach for cross-lingualNamed Entity Recognition (NER) zero-shottransfer using parallel corpora. We builtan entity alignment model on top of XLM-RoBERTa to project the entities detected onthe English part of the parallel data to thetarget language sentences, whose accuracysurpasses all previous unsupervised models.With the alignment model we can get pseudo-labeled NER data set in the target language totrain task-speciﬁc model. Unlike using trans-lation methods, this approach beneﬁts fromnatural ﬂuency and nuances in target-languageoriginal corpus. We also propose a modiﬁedloss function similar to focal loss but assignsweights in the opposite direction to furtherimprove the model training on noisy pseudo-labeled data set. We evaluated this proposedapproach over 4 target languages on bench-mark data sets and got competitive F1 scorescompared to most recent SOTA models. Wealso gave extra discussions about the impactof parallel corpus size and domain on the ﬁnaltransfer performance.

Named entity recognition (NER) is a fundamentaltask in natural language processing, which seeksto classify words in a sentence to predeﬁned se-mantic types. Due to its nature that the groundtruth label exists at word level, supervised train-ing of NER models often requires large amountof human annotation efforts. In real-world usecases where one needs to build multi-lingual mod-els, the required human labor scales at least lin-early with number of languages, or even worsefor low resource languages. Cross-Lingual trans-fer on Natural Language Processing(NLP) taskshas been widely studied in recent years (Conneauet al., 2018; Kim et al., 2017; Ni et al., 2017; Xieet al., 2018; Ni and Florian, 2019; Wu and Dredze,2019; Bari et al. , 2019; Jain et al. , 2019), in par- ticular zero-shot transfer which leverages the ad-vances in high resource language such as Englishto beneﬁt other low resource languages. In this pa-per, we focus on the cross-lingual transfer of NERtask, and more speciﬁcally using parallel corpusand pretrained multilingual language models suchas mBERT (Devlin, 2018) and XLM-RoBERTa(XLM-R) (Lample and Conneau, 2019; Conneauet al., 2020).Our motivations are threefold. (1) Parallel cor-pus is a great resource for transfer learning and isrich between many language pairs. Some recentresearch focus on using completely unsupervisedmachine translations (e.g. word alignment (Con-neau et al., 2017)) for cross-lingual NER, how-ever inaccurate translations could harm the trans-fer performance. For example in the word-to-wordtranslation approach, word ordering may not bewell represented during translations, such gaps intranslation quality may harm model performancein down stream tasks. (2) A method could stillprovide business value-add even if it only worksfor some major languages that have sufﬁcient par-allel corpus as long as it has satisfying perfor-mance. It is a common issue in industry prac-tices where there is a heavily customized task thatneed to be extended into major markets but youdo not want to annotate large amounts of data inother languages. (3) Previous attempts using par-allel corpus are mostly heuristics and statistical-model based (Jain et al. , 2019; Xie et al., 2018;Ni et al., 2017). Recent breakthroughs in multi-lingual language models have not been applied tosuch scenarios yet. Our work bridges the gap andrevisits this topic with new technologies.We propose a novel semi-supervised method forthe cross-lingual NER transfer, bridged by parallelcorpus. First we train an NER model on source-language data set - in this case English - assumingthat we have labeled task-speciﬁc data. Second welabel the English part of the parallel corpus with a r X i v : . [ c s . C L ] J a n his model. Then, we project those recognized en-tities onto the target language, i.e. label the spanof the same entity in target-language portion ofthe parallel corpus. In this step we will leveragethe most recent XLM-R model (Lample and Con-neau, 2019; Conneau et al., 2020), which makes amajor distinction between our work and previousattempts. Lastly, we use this pseudo-labeled datato train the task-speciﬁc model in target languagedirectly. For the last step we explored the option ofcontinue training from a multilingual model ﬁne-tuned on English NER data to maximize the ben-eﬁts of model transfer. We also tried a series ofmethods to mitigate the noisy label issue in thissemi-supervised approach.The main contributions of this paper are as fol-lows:• We leverage the powerful multilingual modelXLM-R for entity alignment. It was trainedin a supervised manner with easy-to-collectdata, which is in sharp contrast to previousattempts that mainly rely on unsupervisedmethods and human engineered features.• Pseudo-labeled data set typically containslots of noise, we propose a novel loss func-tion inspired by the focal loss (Lin et al.,2017). Instead of using native focal loss,we went the opposite direction by weightinghard examples less as those are more likely tobe noise.• By leveraging existing natural parallel corpuswe got competitive F1 scores of NER transferon multiple languages. We also tested thatthe domain of parallel corpus is critical in aneffective transfer. There are different ways to conduct zero-shot mul-tilingual transfer. In general, there are two cate-gories, model-based transfer and data-based trans-fer. Model-based transfer often use source lan-guage to train an NER model with language in-dependent features, then directly apply the modelto target language for inference (Wu and Dredze,2019; Wu et al. , 2020). Data-based transfer fo-cus on combining source language task speciﬁcmodel, translations, and entity projection to cre-ate weakly-supervised training data in target lan-guage. Some previous attempts includes using annotation projection on aligned parallel corpora,translations between a source and a target lan-gauge (Ni et al., 2017; Ehrmann et al., 2011),or to utilize Wikipedia hyperlink structure to ob-tain anchor text and context as weak labels (Al-Rfou et al., 2015; Tsai et al., 2016). Differentvariants exist in annotation projection, e.g. Ni etal. (Ni et al., 2017) used maximum entropy align-ment model and data selection to project Englishannotated labels to parallel target language sen-tences. Some other work used bilingual mappingcombined with lexical heuristics or used embed-ding approach to perform word-level translationwith which naturally comes the annotation projec-tion (Mayhew et al., 2017; Xie et al., 2018; Jain etal. , 2019). This kind of translation + projection ap-proach is used not just in NER, but in other NLPtasks as well such as relation extraction (Kumar,2015). There are obvious limitations to the transla-tion + projection approach, word or phrase-basedtranslation makes annotation projection easier, butsacriﬁces the native ﬂuency and language nuances.In addition, orthographic and phonetic based fea-tures for entity matching may only be applicableto languages that are alike, and requires extensivehuman engineered features. To address these lim-itations, we proposed a novel approach which uti-lizes machine translation training data and com-bined with pretrained multilingual language modelfor entity alignment and projection.

We will demonstrate the entity alignment modelcomponent and the full training pipeline of ourwork in following sections.

Translation from a source language to target lan-guage may break the word ordering therefore analignment model is needed to project entities fromsource language sentence to target language, sothat the labels from source language can be zero-shot transferred. In this work, we use XLM-R (Lample and Conneau, 2019; Conneau et al.,2020) series of models which introduced the trans-lation language model(TLM) pretraining task forthe ﬁrst time. TLM trains the model to pre-dict a masked word using information from boththe context and the parallel sentence in anotherlanguage, making the model acquire great cross-lingual and potentially alignment capability.ur alignment model is constructed by concate-nating the English name of the entity and the targetlanguage sentence, as segment A and segment Bof the input sequence respectively. For token leveloutputs of segment B, we predict 1 if it is insidethe translated entity, 0 otherwise. This formula-tion transforms the entity alignment problem intoa token classiﬁcation task. An implicit assumptionis that the entity name will still be a consecutivephrase after translation. The model structure is il-lustrated in Fig 1.

XLM-RoBERTa ~~Tok 1 Tok 2 Tok 3 Tok 4~~ H4 H5 H6 H7 H8 H9 H10 ~~QTok 1 QTok 2C H2 H3~~

Cologne

Köln liegt in Deutschlands ' ▁ Köln' ' ▁ liegt' ' ▁ in' ' ▁ Deutschland'' ▁ Colo' 'gne' - Label - - - - - Figure 1:

Entity alignment model . The query en-tity on the left is ’Cologne’, and the German sentenceon the right is ’K¨oln liegt in Deutschlands’, which is’Cologne is located in Germany’ in English, and ’K¨oln’is the German translation of ’Cologne’. The model pre-dicts the word span aligned with the query entity.

Fig 2 shows the whole training/evaluation pipelinewhich includes 5 stages: (1) Fine-tune pretrainedlanguage model on CoNLL2003 (Sang and Meul-der, 2003) to obtain English NER model; (2) Inferlabels for English sentences in the parallel corpus;(3) Run the entity alignment model from the pre-vious subsection and ﬁnd corresponding detectedEnglish entities in the target language, failed ex-amples are ﬁltered out during the alignment pro-cess; (4) Fine-tune the multilingual model withdata generated from step (3); (5) Evaluate the newmodel on the target language test sets.

In our method, we leveraged the availability oflarge-scale parallel corpus to transfer the NERknowledge obtained in English to other languages.Existing parallel corpora is easier to obtain thanannotated NER data. We used parallel corpuscrawled from the OPUS website (Tiedemann, 2012). In our experiments, we used the followingdata sets:•

Ted2013 : consists of volunteer transcriptionsand translations from the TED web site andwas created as training data resource for theInternational Workshop on Spoken LanguageTranslation 2013.•

OpenSubtitles : a new collection of trans-lated movie subtitles that contains 62 lan-guages. (Lison and Tiedemann, 2016)•

WikiMatrix : Mined parallel sentences fromWikipedia in different languages. Only pairswith scores above 1.05 are used. (Schwenket al., 2019)•

UNPC : Manually translated United Nationsdocuments from 1994 to 2014. (Ziemskiet al., 2016)•

Europarl : A parallel corpus extracted fromthe European Parliament web site. (Koehn,2005)•

WMT-News : A parallel corpus of News TestSets provided by WMT for training SMT thatcontains 18 languages. • NewsCommentary : A parallel corpus ofNews Commentaries provided by WMT fortraining SMT, which contains 12 languages. • JW300 : Mined, parallel sentences from themagazines Awake! and Watchtower. (Agi´cand Vuli´c, 2019)In this work, we focus on 4 languages, German,Spanish, Dutch and Chinese. We randomly se-lect data points from all data sets above with equalweights. There might be slight difference in datadistribution between languages due to data avail-ability and relevance.

The objective of alignment model is to ﬁnd theentity from a foreign paragraph given its En-glish name. We feed the English name and theparagraph as segment A and B into the XLM-R model (Lample and Conneau, 2019; Conneauet al., 2020). Unlike NER task, the alignment task http://opus.nlpl.eu/News-Commentary.php oNLL EN Training Set CoNLL DE Test Set Train English model

Infer on English parallel corpusRun XLM alignment model to project entity to the target language (parallel corpus) Fine-tune on target language (parallel corpus)Evaluate on target language test set Pretrained model

Finetuned EN model

Figure 2:

Training Pipeline Diagram . Yellow pages represent English documents while light blue pages representGerman documents. Step 1 and 5 used original CoNLL data for train and test respectively; step 2, 3, and 4 usedmachine translation data from OPUS website. Pretrained model is either mBert or XLM-R. The ﬁnal model wasﬁrst ﬁne-tuned on English NER data set then ﬁne-tuned on target language pseudo-labeled NER data set. has no requirement for label completeness, sincewe only need one entity to be labeled in one train-ing example. The training data set can be createdfrom Wikipedia documents where anchor text inhyperlinks naturally indicate the location of enti-ties and one can get the English entity name vialinking by Wikipedia entity Id. An alternative toget the English name for mentions in another lan-guage is through state-of-the-art translation sys-tem. We took the latter approach to make it simpleand leveraged Microsoft’s Azure Cognitive Ser-vice to do the translation.During training, we also added negative exam-ples with faked English entities which did not ap-pear in the other language’s sentence. The intu-ition is to force model to focus on English en-tity (segment A) and its translation instead of do-ing pure NER and picking out any entity in otherlanguage’s sentence (segment B). We also addedexamples of noun phrases or nominal entities tomake the model more robust.We generated a training set of 30K samples with25 % negatives for each language and trained aXLM-R-large model (Conneau et al., 2020) for 3epochs with batch size 64. The initial learning rateis 5e-5 and other hyperparameters were defaultsfrom HuggingFace Transformers library for thetoken classiﬁcation task. The precision/recall/F1on the reserved test set reached 98 % . The modeltraining was done on 2 Tesla V100 and took about20 minutes. We used the CoNLL2003 (Sang and Meulder,2003) and CoNLL2002 data sets to test our cross-lingual transfer method for German, Spanish andDutch. We ignored the training sets in those lan-guages and only evaluated our model on test sets.For Chinese, we used People’s Daily as the ma-jor evaluation set, and we also reported numberson MSRA (Levow, 2020) and Weibo (Peng andDredze, 2015) data sets in the next section. Onenotable difference for People’s Daily data set isthat it only covers three entity types, LOC, ORGand PER, so we suppressed the MISC type fromEnglish during the transfer by training EnglishNER model with ’MISC’ marked as ’O’.To enable cross-lingual transfer, we ﬁrst trainedan English teacher model using the CoNLL2003EN training set with XLM-R-large as the basemodel. We trained with focal loss (Lin et al.,2017) for 5 epochs. We then ran inference withthis model on the English part of the parallel data.Finally, with the alignment model, we projectedentity labels onto other languages. To ensure thequality of target language training data, we dis-carded examples if any English entity failed tomap to tokens in the target language. We also dis-carded examples where there are overlapping tar-get entities because it will cause conﬂicts in tokenlabels. Furthermore, when one entity is mapped https://github.com/zjy-ucas/ChineseNER E ES NL ZHModel P R F1 P R F1 P R F1 P R F1

Bari et al. (2019) - - 65.24 - - 75.9 - - 74.6 - - -Wu and Dredze (2019) - - 71.1 - - 74.5 - - 79.5 - - -Moon et al. (2019) - - 71.42 - - 75.67 - - 80.38 - - -Wu et al. (2020) - - 73.16 - - 76.75 - - 80.44 - - -Wu et al. (2020) - - 73.22 - - 76.94 - - 80.89 - - -Wu et al. (2020) - - - - - - - - -

Our Models mBERT zero-transfer 67.6 77.4 72.1 72.4 78.2 75.2 77.8 79.3 78.6 64.1 65.0 64.6mBERT ﬁne-tune 73.1 76.2 (+2.5) (+2.4) (+0.0) (+6.4)

XLM-R zero-transfer 67.9 79.8 73.4 79.8 81.9 (+3.5) (-1.9) (-1.5) (+4.2)

Table 1:

Cross-Lingual Transfer Results on German, Spanish, Dutch and Chinese : Experiments are donewith both mBERT and XLM-RoBERTa model. For each of them we compared zero-transfer result (trained onCoNLL2003 English only) and ﬁne-tune result using zero-transfer pseudo-labeled target language NER data. Testsets for German, Spanish, Dutch are from CoNLL2003 and CoNLL2002, and People’s Daily data set for Chinese. to multiple, we only keep the example if all thetarget mention phrases are the same. This is toaccommodate the situation where same entity ismentioned more than once in one sentence.As the last step, we ﬁne-tuned the multilin-gual model pre-trained on English data set withlower n(0, 3, 6, etc.) layers frozen, on the tar-get language pseudo-labeled data. We used bothmBERT (Devlin et al., 2019; Devlin, 2018) andXLM-R(Conneau et al., 2020) with about 40Ktraining samples for 1 epoch. The results areshown in Table 1. All the inference, entity pro-jection and model training experiments are doneon 2 Tesla V100 32G gpus and the whole pipelinetakes about 1-2 hours. All numbers are reportedas an average of 5 random runs with the same set-tings.For loss function we used something similarto focal loss (Lin et al., 2017) but with oppositeweight assignment. The focal loss was designed toweight hard examples more. This intuition holdstrue only when training data is clean. In some sce-narios such as the cross-lingual transfer task, thepseudo-labeled training data contains lots of noisepropagated from upper-stream of the pipeline, inwhich case, those ’hard’ examples are more likelyto be just errors or outliers and could hurt the train-ing process. We went the opposite direction andlowered their weights instead, so that the modelcould focus on less noisy labels. More speciﬁcally,we added weight (1 + p t ) γ instead of (1 − p t ) γ ontop of the regular cross entropy loss, and for thehyper-parameter γ we experimented with values from 1 to 5 and the value of 4 it works best.From Table 1, we can see for mBERT model,ﬁne-tuning with pseudo-labeled data has signif-icant effects on all languages except NL. Thelargest improvement is in Chinese, 6.4 % increasein F1 on top of the zero-transfer result, this num-ber is 2.5 % for German and 2.4 % for Spanish.The same experiment with XLM-R model showsa different pattern, F1 increased 3.5 % for Germanbut dropped a little bit on Spanish and Dutch af-ter ﬁne-tuning. For Chinese, we see a comparableimprovement with mBERT of 4.2 % . The nega-tive result on Spanish and Dutch is probably be-cause XLM-R has already had very good pretrain-ing and language alignment for those Europeanlanguages, which can be seen from the high zero-transfer numbers, therefore a relatively noisy dataset did not bring much gain. On the contrary, Chi-nese is a relatively distant language from the per-spective of linguistics so that the add-on value oftask speciﬁc ﬁne-tuning with natural data is larger.Another pattern we observed from Table 1 isthat across all languages, the ﬁne-tuning step withpseudo-labeled data is more beneﬁcial to precisioncompared with recall. We observed a consistentimprovement in precision but a small drop in re-call in most cases. One advantage of the parallel corpora methodis high data availability compared to the super-vised approach. Therefore a natural question nexts whether more data is beneﬁcial for the cross-lingual transfer task. To answer this question, wedid a series of experiments with varying number oftraining examples ranging from 5K to 200K, andthe model F1 score increases with the amount ofdata at the beginning and plateaued around 40K.All the numbers displayed in Table 1 are reportedfor training on a generated data set of size around40K (sentences). One possible explanation for theplateaued performance might be due to the propa-gated error in the pseudo-labeled data set. Domainmismatch may also limit the effectiveness of trans-fer between languages. More discussions on thistopic in the next section.

Figure 3: F1 scores evaluated on three data sets usingdifferent domains’ parallel data. Blue column on theleft is the result of zero-shot model transfer. To its rightare F1 scores for 3 different domains and all domainscombined.

Learnings from machine translation communityshowed that the quality of neural machine trans-lation models usually strongly depends on the do-main they are trained on, and the performance of amodel could drop signiﬁcantly when evaluated ona different domain (Koehn and Knowles, 2017).Similar observation can be used to explain chal-lenges of the NER cross-lingual transfer. In NERtransfer, the ﬁrst domain mismatch comes fromthe natural gap in entity distributions between dif-ferent language corpus. Many entities only liveinside the ecosystem of a speciﬁc group of lan-guages and may not be translated naturally to oth-ers. The second domain mismatch is between theparallel corpora and the NER data set. The Englishmodel might not have good domain adaptationability and could perform well on the CoNLL2003data set but poorly on the parallel data set.To study the impact of the domain of paralleldata on transfer performance, we did an experi-ment on Chinese using parallel data from differ-ent domains. We picked three representative data

Domain PER ORG LOC All

OpenSubtitles 24,036 3,809 5,196 33,041UN 1,094 25,875 12,718 39,687News 10,977 9,568 28,168 48,713

All Domains Combined

Table 2: Entity Count by type in the pseudo-labeledChinese NER training data set. We listed multiple do-mains that were extracted from different parallel datasource. And AllDomains is a combination of all re-sources. sets from OPUS (Tiedemann, 2012), OpenSubti-tles, UN(contains UN and UNPC), News(containsWMT and News-Commentary) and another onewith all these three combined. OpenSubtitles isfrom movie subtitles and language style is infor-mal and oral. UN is from united nation reportsand language style is more formal and political ﬂa-vored. News data is from newspaper and contentis more diverse and closer to CoNLL data sets. Weevaluated the F1 on three Chinese testsets, Weibo,MSRA and People’s Daily, where Weibo containsmessages from social media, MSRA is more for-mal and political, and People’s Daily data set arenewspaper articles.From Fig 3 we see OpenSubtitles performs beston Weibo but poorly on the other two. On the con-trary UN performs worst on Weibo but better onthe other two. News domain performs the best onPeople’s Daily, which is also consistent with one’sintuition because they are both from newspaper ar-ticles. All domains combined approach has a de-cent performance on all three testsets.Different domains of data have quite large gapin the density and distribution of entities, for ex-ample OpenSubtitles contains more sentences thatdoes not have any entity. In the experiments abovewe did ﬁltering to keep the same ratio of ’empty’sentences among all domains. We also exam-ined the difference in type distributions. In Tab 2,we calculated entity counts by type and domain.OpenSubtitles has very few ORG and LOC en-tities, whereas UN data has very few PER data.News and All domain data are more balanced.In Fig 4, we show evaluation on People’s Dailyby type. We want to understand how does paral-lel data domain impacts transfer performance fordifferent entity types. News data has the best per-formance on all types, and Subtitles has very badresult on ORG. All these observations are consis-tent with the type distribution in Table 2. igure 4: F1 score by type evaluated on People’s Dailydata set. The same as Fig 3, we compare the resultsusing different domains of parallel data for the NERtransfer.

Test P Test R F1Sequential ﬁne-tune with RW 73.1

Zero-transfer 67.6

Table 3: Ablation study results evaluated onCoNLL2002 German NER data. All experiments usedthe mBERT-base model. RW denotes the re-weightedloss we proposed in the paper; CE denotes the regularcross entropy loss.

To better understand the contribution of each stagein the training process, we conducted an ablationstudy on German data set with mBERT model.We compared 6 different settings: (1) approachproposed in this work, i.e. ﬁne-tune the En-glish model with pseudo-labeled data with the newloss, denoted as re-weighted (RW) loss; (2) zero-transfer is direct model transfer which was trainedonly on English NER data; (3) ﬁne-tune the En-glish model with pseudo-labeled data with regu-lar cross-entropy (CE) loss; (4) Skip ﬁne-tune onEnglish and directly ﬁne-tune mBERT on pseudo-labeled data. (5)(6) ﬁne-tune the model directlywith mixed English and pseudo-labeled data si-multaneously with RW and CE losses respectively.From Table 3, we see that both pretraining onEnglish and ﬁne-tuning with pseudo German dataare essential to get the best score. The RW lossperformed better in sequential ﬁne-tuning thanin simultaneous training with mixed English andGerman data, this is probably because noise por-tion in English training set is much smaller thanin the German pseudo-labeled training set, usingRW loss on English data failed to exploit the ﬁne- grained information in some hard examples andresults in insufﬁciently optimized model. Anotherobservation is that training the mBERT with acombination of English and German using crossentropy loss, we can get almost the same score asour best model, which is trained with two stages.

In this paper, we proposed a new method of do-ing NER cross-lingual transfer with parallel cor-pus. By leveraging the XLM-R model for entityprojection, we are able to make the whole pipelineautomatic and free from human engineered featureor data, so that it could be applied to any other lan-guage that has a rich resource of translation datawithout extra cost. This method also has the po-tential to be extended to other NLP tasks, such asquestion answering. In this paper, we thoroughlytested the new method in four languages, and itis most effective in Chinese. We also discussedthe impact of parallel data domain on NER trans-fer performance and found that a combination ofdifferent domain parallel corpora yielded the bestaverage results. We also veriﬁed the contributionof the pseudo-labeled parallel data by an ablationstudy. In the future we will further improve thealignment model precision and also explore alter-native ways of transfer such as self teaching in-stead of direct ﬁne-tuning. We are also interestedto see how the propose approach generalizes tocross-lingual transfers for other NLP tasks.

References

Jian Ni, Georgiana Dinu, and Radu Florian. 2017.Weakly supervised cross-lingual named entityrecognition via effective annotation and represen-tation projection. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1470–1480. Association for Computational Linguistics.Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, andEric Fosler-Lussier. 2017. Cross-lingual transferlearning for POS tagging without cross-lingual re-sources. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing , pages 2832–2838, Copenhagen, Denmark. As-sociation for Computational Linguistics.Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. XNLI: Evaluatingcross-lingual sentence representations. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 2475–2485,russels, Belgium. Association for ComputationalLinguistics.Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A.Smith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal re-sources. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing , pages 369–379, Brussels, Belgium. Associationfor Computational Linguistics.Jian Ni and Radu Florian. 2019. Neural cross-lingualrelation extraction based on bilingual word embed-ding mapping. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 399–409, Hong Kong, China. As-sociation for Computational Linguistics.T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar.Focal loss for dense object detection. arXiv preprintarXiv:1708.02002 , 2017.Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-cas: The surprising cross-lingual effectiveness ofBERT. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages833–844, Hong Kong, China. Association for Com-putational Linguistics.Alankar Jain, Bhargavi Paranjape, and Zachary C. Lip-ton. Entity projection via machine translation forcross-lingual NER. In

EMNLP , pages 1083–1092,2019.M Saiful Bari, Shaﬁq Joty, and Prathyusha Jwala-puram. Zero-resource cross-lingual named en-tity recognition. arXiv preprint arXiv:1911.09812 ,2019.Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. In

Proceedings of the Seventh Conference on NaturalLanguage Learning, CoNLL 2003, Held in cooper-ation with HLT-NAACL 2003, Edmonton, Canada,May 31 - June 1, 2003 , pages 142–147.Shijie Wu and Mark Dredze. Beto, bentz, becas:The surprising cross-lingual effectiveness of BERT. arXiv preprint arXiv:1904.09077 , 2019.Jacob Devlin. 2018. Multilingual bert readme docu-ment.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics. Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Herv´e J´egou. 2017.Word translation without parallel data. arXivpreprint arXiv:1710.04087 .Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprintarXiv:1901.07291 .Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.J¨org Tiedemann. 2012. Parallel Data, Tools and In-terfaces in OPUS. In

Proceedings of the Eight In-ternational Conference on Language Resources andEvaluation (LREC’12) , Istanbul, Turkey. EuropeanLanguage Resources Association (ELRA).Pierre Lison and J¨org Tiedemann. 2016. OpenSubti-tles2016: Extracting Large Parallel Corpora fromMovie and TV Subtitles. In

Proceedings of the TenthInternational Conference on Language Resourcesand Evaluation (LREC 2016) , Paris, France. Euro-pean Language Resources Association (ELRA).Holger Schwenk, Vishrav Chaudhary, Shuo Sun,Hongyu Gong, and Francisco Guzm´an. 2019. Wiki-Matrix: Mining 135M Parallel Sentences in 1620Language Pairs from Wikipedia. arXiv preprintarXiv:11907.05791 .Michał Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The United Nations Parallel Cor-pus v1.0. In

Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC 2016) , Paris, France. European LanguageResources Association (ELRA).Philipp Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. In

Conference Pro-ceedings: the tenth Machine Translation Summit ,pages 79–86, Phuket, Thailand. AAMT, AAMT.ˇZeljko Agi´c and Ivan Vuli´c. 2019. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Lan-guages. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 3204–3210, Florence, Italy. Associationfor Computational Linguistics.Maud Ehrmann, Marco Turchi, and Ralf Stein-berger. 2011. Building a multilingual namedentity-annotated corpus using annotation projec-tion. In

Proceedings of Recent Advancesin Natural Language Processing . Associationfor Computational Linguistics, pages 118–124.http://aclweb.org/anthology/R11-1017.ami Al-Rfou, Vivek Kulkarni, Bryan Per-ozzi, and Steven Skiena. 2015. Polyglot-ner:Massive multilingual named entity recog-nition. In

Proceedings of the 2015 SIAMInternational Conference on Data Mining .SIAM, Vancouver, British Columbia, Canada.https://doi.org/10.1137/1.9781611974010.66.Chen-Tse Tsai, Stephen Mayhew, and Dan Roth. 2016.Cross-lingual named entity recognition via wikiﬁca-tion. In

CoNLL , pages 219–228.Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017.Cheap translation for cross-lingual named entityrecognition. In

EMNLP , pages 2526–2535.Faruqui Kumar, S. 2015. Multilingual open relationextraction using cross-lingual In

Proceedings ofNAACL-HLT , 1351–1356.Qianhui Wu, Zijia Lin, Guoxin Wang, Hui Chen,B¨orje F Karlsson, Biqing Huang, and Chin-Yew Lin.Enhanced meta-learning for cross-lingual named en-tity recognition with minimal resources. In

AAAI ,2020.Taesun Moon, Parul Awasthy, Jian Ni, and RaduFlorian. 2019. Towards lingua franca namedentity recognition with bert. arXiv preprintarXiv:1912.01389 .Wu, Q.; Lin, Z.; Karlsson, B. F.; Lou, J.-G.; and Huang,B. 2020. Single-/Multi-Source Cross-Lingual NERvia Teacher-Student Learning on Unlabeled Data inTarget Language. In

Association for ComputationalLinguistics .Qianhui Wu and Zijia Lin and B¨orje F. Karlsson andBiqing Huang and Jian-Guang Lou. UniTrans: Uni-fying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with UnlabeledData. In

IJCAI 2020 .Gina-Anne Levow. The Third International ChineseLanguage Processing Bakeoff: Word Segmentationand Named Entity Recognition. In

Proceedings ofthe Fifth SIGHAN Workshop on Chinese LanguageProcessing , 2006. .Nanyun Peng and Mark Dredze. Named Entity Recog-nition for Chinese Social Media with Jointly TrainedEmbeddings. In

Proceedings of the 2015 Confer-ence on Empirical Methods in Natural LanguageProcessing , 2015. .Philipp Koehn and Rebecca Knowles. Six Challengesfor Neural Machine Translation. In

Related Researches

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

Wake Word Detection with Streaming Transformers

by Yiming Wang

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

In-Order Chart-Based Constituent Parsing

by Yang Wei

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Representation Learning for Natural Language Processing

by Zhiyuan Liu

«

1

2

3

4

»

Submitted on 26 Jan 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar