[PDF] Towards Fully Bilingual Deep Language Modeling

Abstract

Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years. While models covering a large number of languages have been introduced, their multilinguality has come at a cost in terms of monolingual performance, and the best-performing models at most tasks not involving cross-lingual transfer remain monolingual. In this paper, we consider the question of whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language. We collect pre-training data, create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models. Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks, clearly outperforming multilingual BERT. We find that when the model vocabulary size is increased, the BERT-Base architecture has sufficient capacity to learn two remotely related languages to a level where it achieves comparable performance with monolingual models, demonstrating the feasibility of training fully bilingual deep language models. The model and all tools involved in its creation are freely available at this https URL

Full PDF

aa r X i v : . [ c s . C L ] O c t T OWAR DS F ULLY B ILINGUAL D EEP L ANGUAGE M ODELING

A P

REPRINT

Li-Hsin Chang [email protected]

Sampo Pyysalo [email protected]

Jenna Kanerva [email protected]

Filip Ginter [email protected]

TurkuNLP GroupDepartment of Future TechnologiesUniversity of TurkuTurku, FinlandOctober 23, 2020 A BSTRACT

Language models based on deep neural networks have facilitated great advances in natural lan-guage processing and understanding tasks in recent years. While models covering a large numberof languages have been introduced, their multilinguality has come at a cost in terms of monolin-gual performance, and the best-performing models at most tasks not involving cross-lingual transferremain monolingual. In this paper, we consider the question of whether it is possible to pre-traina bilingual model for two remotely related languages without compromising performance at eitherlanguage. We collect pre-training data, create a Finnish-English bilingual BERT model and evaluateits performance on datasets used to evaluate the corresponding monolingual models. Our bilingualmodel performs on par with Google’s original English BERT on GLUE and nearly matches the per-formance of monolingual Finnish BERT on a range of Finnish NLP tasks, clearly outperformingmultilingual BERT. We ﬁnd that when the model vocabulary size is increased, the BERT-Base archi-tecture has sufﬁcient capacity to learn two remotely related languages to a level where it achievescomparable performance with monolingual models, demonstrating the feasibility of training fullybilingual deep language models. The model and all tools involved in its creation are freely availableat https://github.com/TurkuNLP/biBERT K eywords BERT · Multilingual Language Model · Finnish

In recent years, there has been an increased focus on the use of unannotated texts for modeling human language andon transfer learning in natural language processing (NLP). A wide variety of models have been proposed, rangingfrom context-independent word embeddings (Mikolov et al., 2013; Pennington et al., 2014), to the more recent con-textual representations (Peters et al., 2018; Devlin et al., 2019). In particular, the Transformer-based (Vaswani et al.,2017) BERT (Bidirectional Encoder Representations from Transformers) model (Devlin et al., 2019) has generatedconsiderable interest in the NLP community since its release. BERT outperformed the then state-of-the-art systemson a wide range of benchmark datasets when published, and has served as the basis of many studies since. Theseefforts include work that proposes improvements and/or modiﬁcations to the training objectives (Liu et al., 2019;Lan et al., 2020), knowledge distillation (Sanh et al., 2019), multilinguality (Pires et al., 2019), and interpretation(Kovaleva et al., 2019), to name a few. As a mark of its popularity, the term BERTology was coined to refer tothe ﬁeld of research relating to BERT (Rogers et al., 2020).A thriving branch of BERTology involves BERT models for languages other than English. Devlin et al. (2019) releasedmultilingual BERT (mBERT) models trained on over a hundred languages. Wu and Dredze (2019) analyze the repre-sentations produced by mulitlingual BERTs and ﬁnd evidence that these representations generalize across languages

PREPRINT - O

CTOBER

23, 2020for various downstream tasks, though language-speciﬁc information is retained. This language-agnostic subspace ofmultingual BERTs has also been observed in other studies, and is deemed to be the factor that allows for zero-shottransfer (Pires et al., 2019; Cao et al., 2020). Furthermore, the embeddings can be further aligned through a ﬁne-tuningbased alignment procedure, improving the performance of multilingual models (Cao et al., 2020). While multilingualtraining can beneﬁt also monolingual performance, as the number of languages covered by a multilingual model in-creases, the fraction of the model capacity available for any single language decreases. Conneau et al. (2020) term asthe curse of multilinguality the phenomenon where increasing the number of languages included in a model initiallyleads to better cross-lingual performance for low-resource languages, while eventually leading to overall degradationof both monolingual and cross-lingual performance. Work on language-speciﬁc BERT models has also shown thatmonolingual models tend to outperform multilingual models of the same size in monolingual settings (de Vries et al.,2019; Martin et al., 2020; Virtanen et al., 2019; Pyysalo et al., 2020). However, the question of whether it is possibleto train multilingual models without loss of monolingual performance remains largely open.In this paper, we study whether it is feasible to pre-train a bilingual model for two remotely related languages withoutcompromising performance at either language. Speciﬁcally, we train a Finnish-English bilingual BERT model (hence-forth, bBERT) using a combination of the pre-training data of the original English BERT model and the Finnish BERTmodel introduced by Virtanen et al. (2019), using an extended model vocabulary but otherwise ﬁxing model capacityat BERT-Base size and retaining the number of pre-training steps. We evaluate the performance of the introducedbilingual model on a range of natural language understanding (NLU) tasks used to evaluate the monolingual mod-els, which, to the best of our knowledge, has not been the focus of studies on bilingual BERT models. We ﬁnd thatbBERT achieves comparable performance on the GLUE (General Language Understanding Evaluation) benchmark(Wang et al., 2019b) with the original English BERT, and nearly matches the performance of the Finnish BERT onFinnish NLP tasks. Our results indicate that an extension of the vocabulary size is sufﬁcient to allow the creation offully bilingual models that perform on par with their monolingual counterparts in both of their languages.

The BERT variants available can be categorized according to the number of languages they are trained on: monolingualBERTs, multilingual BERTs with few languages, and multilingual BERTs with a large number of languages. The orig-inal authors of BERT released several versions of BERT varying in the model size, casing, and language (Devlin et al.,2019). Among these models is the cased English BERT-Base model. Its architecture, the BERT-Base model architec-ture, has 12 layers with hidden dimension of 768 and 12 attention heads, resulting in a total of 110M parameters. TheEnglish BERT was evaluated on GLUE, the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016),and the Situations With Adversarial Generations dataset (Zellers et al., 2018).Other monolingual BERT models have been trained and released by the NLP community. Virtanen et al. (2019)train Finnish cased and uncased versions of BERT-Base models (the cased Finnish model is referred to as FinBERThenceforth). The Finnish BERTs have been evaluated on part-of-speech (POS) tagging, named entity recognition(NER), dependency parsing, text classiﬁcation, and probing tasks. FinBERT outperforms mBERT on nearly all ofthese tasks, illustrating the advantages of monolingual models over multilingual ones. To beneﬁt a wider range oflanguages, Pyysalo et al. (2020) construct an automatic pipeline to train monolingual BERTs on Wikipedia. Theytrain 42 monolingual BERT models using this pipeline and tested their parsing performance. They ﬁnd that whilelanguage-speciﬁc models lead to improvement in parsing performance on average, the relative performance of themodels vary substantially depending on the language.Apart from monolingual models that have been trained for languages with more resources, the cross-lingual transfer-ability of models has also been studied to empower languages with fewer resources. Artetxe et al. (2020) successfullytransfer monolingual representations to other languages by freezing the model weights and retraining only the vocab-ulary weights. However, the performance of these cross-lingually transferred monolingual models tend not to matchthat of bilingual models in their experiments. Thus, for languages with sufﬁcient resources to train monolingualBERTs that do not have as much labelled data as English, training bilingual BERTs to combine the beneﬁt of BERT’slanguage-agnostic subspace and the advantage of models with fewer languages to avoid the curse of multilingualitypresents a potential solution. Whether the monolingual capability of such models is compromised, however, remainsan open question.The studies on multilingual BERT models with few languages have mainly focused on cross-lingual aspects.Karthikeyan et al. (2020) study the cross-lingual capability of multilingual BERTs by training bilingual BERT modelsand studying the effect of linguistic properties of languages, model architectures, and learning objectives. Their mod-els are evaluated with NER and textual entailment tasks. Compared to our study, their focus is on cross-linguality,and their models are trained for a lesser number of steps than ours. In work performed concurrently with ours,2

PREPRINT - O

CTOBER

23, 2020Data Sentences TokensEnglish 198M 3.8BWikipedia 130M 2.8BBooksCorpus (reconstructed) 68M 1.0BFinnish 234M 3.3BNews 36M 0.5BOnline discussion 118M 1.7BInternet crawl 79M 1.1BTable 1: Statistics for pre-training dataUlˇcar and Robnik-Šikonja (2020) train Finnish-Estonian-English and Croatian-Slovenian-English trilingual BERTmodels and evaluate their performance on POS tagging, NER, and dependency parsing. Their baselines are multi-lingual contextual representation models; there is no evaluation of monolingual model performance. This focus ofevaluation on the cross-lingual ability of the models differs from our focus, which is placed on the language-speciﬁccapability of bilingual models compared to monolingual models.There have also been studies on multilingual BERTs with a larger number of languages, though these models tendto suffer from the curse of multilinguality. In the study of Artetxe et al. (2020), bilingual BERT models tend tooutperform their multilingual BERT model jointly trained on 15 languages. Devlin et al. (2019) release cased anduncased multilingual BERT-Base models trained on over a hundred languages (the multilingual cased model mBERThenceforth). These multilingual models, however, tend to underperform monolingual models (de Vries et al., 2019;Martin et al., 2020; Virtanen et al., 2019; Pyysalo et al., 2020).

This section introduces the sources of unannotated English and Finnish texts used for pre-training, as well as thecollection and ﬁltering of these texts, the generation of the model vocabulary, and the pre-training process.

The original English BERT was trained on the BooksCorpus (Zhu et al., 2015) and English Wikipedia, which consistof 800M words and 2,500M words respectively. Since BooksCorpus is no longer available, we reconstruct an approx-imation from URLs collected in a separate crawl of the corpus sources. The collected books are ﬁltered to excludenon-English text by language detection. Hand-written heuristics are used to remove short sentences and sentences withhigh ratios of uppercase characters, digits, or foreign characters. Since the data came from books, tables of contents,copyright messages, and references are likewise removed by hand-written heuristics. Finally, content duplications areremoved with the corpus tool Onion (Pomikálek, 2011). The English Wikipedia data were obtained using parts of thepipeline introduced by Pyysalo et al. (2020). More speciﬁcally, these components were used to download a Wikipediadatabase backup dump, extract plain text from the XML sources using WikiExtractor , segment and tokenize the text,and perform heuristic document ﬁltering.For the Finnish pre-training data, we use the same data used for training FinBERT (Virtanen et al., 2019). Brieﬂy, thedata come from three sources: news from Yle and the Finnish News Agency, online discussion from the Suomi24forum, and an internet crawl. After ﬁltering and deduplication, the data is about 30 times the size that of the FinnishWikipedia included in the data that mBERT was trained on. A Detailed description of the data and its preprocessingcan be found in Virtanen et al. (2019).The statistics of the pre-training data are presented in Table 1. We note that in terms of the total number of tokens,the data sources for the two languages are remarkably closely balanced, with 3.8B tokens for English and 3.3B forFinnish. https://github.com/soskek/bookcorpus https://github.com/spyysalo/wiki-bert-pipeline https://github.com/attardi/wikiextractor PREPRINT - O

CTOBER

23, 2020

For vocabulary generation, we take a sample of the cleaned and ﬁltered sentences and tokenize them using BERTBasicTokenizer. To balance the vocabulary for the two languages, the same number of sentences are sampled foreach language, proportionally to the size of the source. In total, 10 million sentences are sampled, half of which areEnglish and the other half Finnish. Due to downstream evaluation results reported by Virtanen et al. (2019) and otherstending to favour cased over uncased models, we here chose to train a cased bilingual model. The Sentence-Piece(Kudo and Richardson, 2018) implementation of byte-pair-encoding (Sennrich et al., 2016) is used to generate thebilingual vocabulary. The generated vocabulary is then converted to a WordPiece (Wu et al., 2016) vocabulary. Takinginto account the observation of Artetxe et al. (2020) that the effective vocabulary size per language plays a moreimportant role in the model performance than either the choice between a joint or disjoint vocabulary or the number oflanguages for multilingual models, we ﬁx the bilingual vocabulary to be a joint vocabulary of 80,000 words, matchingthe combined size of the English BERT (30,000 words) and FinBERT (50,000 words) vocabularies.Comparing the bilingual vocabulary to the monolingual ones, we ﬁnd that the bBERT vocabulary contains 87.5% ofthe WordPieces in the original Google BERT vocabulary, and 61.5% of the WordPieces in the FinBERT vocabulary.The lower coverage of FinBERT vocabulary is expected as the sampling strategy for vocabulary generation balancesEnglish and Finnish, whereas the English BERT vocabulary is smaller than that of the Finnish BERT.

Following the approaches of Devlin et al. (2019) and Virtanen et al. (2019), the pre-training examples are createdfor the masked language modeling and next sentence prediction tasks. Duplication factors are set so that there areroughly the same number of training examples for Finnish and English, and that the number of examples coveredin the whole training data matches that of FinBERT. For Finnish, each source (news, discussion, and crawl) has aseparate duplication factor so that there is a balanced distribution of examples from each of them. For English, wedo not balance the number of examples between the reconstructed BooksCorpus and Wikipedia, as no comparablebalancing was applied in the pre-training of the original BERT. Similar to Virtanen et al. (2019), whole-word maskingis used, and other parameters and process for data creation match those in Devlin et al. (2019).

We primarily follow the pre-training process and implementation described in Virtanen et al. (2019). Brieﬂy, themodel architecture is that of BERT-Base, with 110M parameters excluding the word embeddings. The model istrained for 1M steps. For the ﬁrst 0.9M steps of training, we use a sequence length of 128 and batch size of 140. Forthe remaining 0.1M steps, the sequence length is set to 512 and (by contrast to FinBERT training), a batch size of 16is used due to memory constraints. We use the LAMB optimizer (You et al., 2020) with warmup over 10K steps to alearning rate of 1e-4 followed by decay. The model is trained on 8 Nvidia V100 GPUs for approximately 12 days. We evaluate bBERT on the English and Finnish benchmarks that have been used to evaluate the corresponding mono-lingual BERT models to allow direct comparison of the performance of the bilingual and monolingual models. ForEnglish, we choose the GLUE benchmark. For Finnish, we choose the benchmarks that have been used to evaluateFinBERT. These include the following tasks: POS tagging, NER, dependency parsing, and text classiﬁcation tasks.We follow the procedures used in Virtanen et al. (2019) for Finnish evaluation.

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019b) is a collection of nineNLU datasets. The test sets of some of these data sets are only available through the online GLUE evaluation server .Following Devlin et al. (2019), we exclude the Winograd Schema Challenge dataset (Levesque et al., 2011) fromevaluation, using the following eight datasets: https://github.com/TurkuNLP/FinBERT/blob/master/nlpl_tutorial https://gluebenchmark.com/ PREPRINT - O

CTOBER

23, 2020•

CoLA

The Corpus of Linguistic Acceptability (Warstadt et al., 2019) is a collection of sentences from pub-lished linguistics literature. Its training set consists of 8.5K examples, each a sentence with a binary labelindicating whether the sentence is linguistically acceptable or not.•

SST-2

The Stanford Sentiment Treebank (Socher et al., 2013) consists of a collection of sentence excerptsfrom movie reviews, with a training set size of 67K examples. The GLUE benchmark uses sentence-levelhuman judgments of sentiment, i.e., each example consists of a sentence with a label, either positive ornegative.•

MRPC

The Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005) consists of sentence pairsfrom online news. Its training set has 3.7K examples, each a pair of automatically extracted sentences and ahuman judgment of whether they are semantically equivalent.•

QQP

The Quora Question Pairs is a collection of question pairs extracted from the Quora website. Thetraining set has 364K examples, each a pair of questions and judgment of whether they are semanticallyequivalent.• STS-B

The Semantic Textual Similarity Benchmark (Cer et al., 2017) consists of sentence pairs taken fromvarious sources such as news headlines and image captions. The training set has 7K examples, each a pair ofsentences and a human judgment of their similarity, ranging from 1 to 5.•

MNLI

The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a collection ofcrowd-sourced sentence pairs and human judgments of their textual entailment relations ( entailment , contradiction , or neutral ). The corpus has a training set of 383K examples, and in- and out-of-domain development and test sets. The in-domain data (matched) are drawn from the same genres as thetraining data, while out-of-domain data (mismatched) are drawn from different genres.• QNLI

The Stanford Question Answering Dataset (Rajpurkar et al., 2016) is a set of question and answer pairsconverted into the NLI format. The questions are collected from Wikipedia, while the answers are writtenby annotators. The training set of QNLI consists of 105K examples, each a question, an answer, and a label(either entailment or not_entailment ).• RTE

The Recognizing Textual Entailment (RTE) datasets are compiled from the datasets of four textual entail-ment challenges: RTE1 (Dagan et al., 2006), RTE2 (Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), andRTE5 (Bentivogli et al., 2009). The training set consists of 2.5K examples drawn from news and Wikipedia,and each example has two sentences and a label ( entailment or not_entailment ). We follow the ﬁne-tuning approach introduced by Devlin et al. (2019) for hyperparameter selection and additionaltask-speciﬁc additions to the model architecture . For hyperparameter selection, we use a batch size of 32 and ﬁne-tune for 3 epochs for all GLUE tasks. For each task, we search for the best performing learning rate among {2e-5,3e-5, 4e-5, 5e-5} on the development set with three replicates. For all tasks, a task-speciﬁc layer is added after theﬁnal Transformer layer. For tasks with two input sentences, both sentences are given as inputs at a time, in the formof [CLS] Sentence A [SEP] Sentence B , as in pre-training. For tasks with one input sentence, the secondsentence is seen as degenerate, and the input is only [CLS] Sentence A . The results on the test sets are shown in Table 2. Overall, bBERT performs comparably with Google’s BERT. Whilemost of the differences are within half a percentage point, the original BERT obtains better results with single sentenceclassiﬁcation tasks, and bBERT performs better on the RTE task.

We use four tasks that have been used in FinBERT evaluation to evaluate bBERT: POS tagging, NER, dependencyparsing, and text classiﬁcation.

We brieﬂy introduce the datasets used for these four tasks in this section and refer to Virtanen et al. (2019) for detaileddescriptions of the data. The implementation published at https://github.com/google-research/bert is used. PREPRINT - O

CTOBER

23, 2020System CoLA SST-2 MRPC QQP STS-B MNLI-(m/mm) QNLI RTE

Average /83.4 90.5 66.4 79.6

B-BERT https://gluebenchmark.com/leaderboard ). The row of numbers under the task name is the sizeof the training set. For CoLA, the metric is Matthew’s correlation, and F-score for MRPC and QQP, Spearmancorrelation for STS-B, and accuracy for the rest of the task. The average scores are higher than those reported on theGLUE leaderboard because the WNLI task is excluded. The BERT-Base results are taken from Devlin et al. (2019).•

POS tagging

The three Finnish treebanks in the Universal Dependencies (UD) collection (Nivre et al.,2016) are used: the Turku Dependency Treebank (TDT) (Pyysalo et al., 2015), FinnTreeBank (FTB)(Voutilainen et al., 2012), and the Parallel Universal Dependencies treebank (PUD) (Zeman et al., 2017). TheUD version 2.2 is used for comparability with the FinBERT results, which in turn chose this version for com-parability with the results of the CoNLL shared tasks in 2017 and 2018 (Zeman et al., 2017, 2018). For thePUD corpus, which has no training or development sets, the corresponding sets from the TDT corpus areused for training and parameter selection.•

NER

FinBERT was only evaluated on the FiNER (Ruokolainen et al., 2020) corpus because it was the onlyNER corpus available for Finnish during its development. FiNER has its text source from a Finnish onlinetechnology news, and contains CoNLL-style annotations of person, organization, location, product, and event,as well as dates. It also contains an out-of-domain test set, where the text source is Wikipedia. Though thecorpus contains a small number of nested annotations, we follow the FinBERT evaluation and use only non-nested annotations.•

Dependency parsing

The same versions of the three Finnish UD treebanks used for POS-tagging are usedalso for this task.•

Text classiﬁcation

We use the text classiﬁcation corpora created by Virtanen et al. (2019). These are theYle news corpus, which contains formal text, and the Ylilauta corpus, whose text source is the online forumYlilauta, which thus contains informal text. The Yle news corpus contains documents, each annotated withone of ten topics, such as sports, politics, and economy. Due to license restrictions, the corpus cannot beredistributed, but the code for recreating the dataset is available. The Ylilauta corpus contains online postingsand the topic of the boards onto which they were posted. As with the Yle news corpus, the Ylilauta corpusalso contains documents for ten topics. Both corpora have training set size of 100K and balanced distributionof classes. All of the documents are truncated to at most 256 tokens to reduce the advantage that a compactrepresentation for Finnish may introduce for language-speciﬁc models.For ease of comparison of model performance, we organize the results similarly to the GLUE benchmark to obtaina single number for each model. Speciﬁcally, there are four tasks, and each task has its own subsets: POS tagging(TDT, FTB, and PUD treebanks), NER (in-domain and out-of-domain test sets), dependency parsing (TDT, FTB, andPUD treebanks), and text classiﬁcation (Yle and Ylilauta corpora). For dependency parsing, Virtanen et al. (2019)report results for both predicted and gold segmentation, whereas we choose the results for gold segmentation tofocus speciﬁcally on model performance for parsing. For text classiﬁcation, Virtanen et al. (2019) also evaluate ondownsampled versions of the corpora, but we include only the full versions of the corpora. We adopt a similarapproach to GLUE and take the algebraic mean as the ﬁnal score for comparison, though we recognize that (as inGLUE) these evaluations use different metrics.

We follow the task-speciﬁc model architectures implemented in Virtanen et al. (2019) for all the tasks.•

POS tagging

The BERT POS tagger adds a time-distributed dense output layer on top of the BERT modeland represents each word by its ﬁrst WordPiece-tokenized input word. The ofﬁcial CoNLL 2018 evaluationscript is used, which reports the UPOS metric. https://github.com/spyysalo/yle-corpus https://github.com/spyysalo/bert-pos PREPRINT - O

CTOBER

23, 2020POS tagging NER Dependency parsing Text classiﬁcation Average(TDT/FTB/PUD) (ID/OOD) (TDT/FTB/PUD) (Yle/Ylilauta)FinBERT / / / / / / bBERT 98.14/98.16/98.07 92.23/81.08 93.16/93.50/93.02 91.37/81.42 92.02mBERT 96.97/95.87/97.58 90.29/76.15 87.99/87.46/89.75 90.28/76.51 88.88Table 3: Results for evaluation on all Finnish tasks. All the BERT models are cased models. The metric for POStagging is UPOS, NER F-score, dependency parsing LAS, and text classiﬁcation accuracy. Apart from the bBERTnumbers, the other results are taken from Virtanen et al. (2019). For NER, ID stands for the in-domain test set (news),and OOD stands for the out-of-domain test set (Wikipedia).• NER

The NER implementation follows the method employed by Devlin et al. (2019), which attaches adense layer over BERT and independently predicts IOB tags. The evaluation metric is mention-level F-score,as implemented in the standard conlleval script.• Dependency parsing

UDify (Kondratyuk and Straka, 2019) is a multi-task model for BERT that jointly pre-dicts POS tags, morphological features, lemmas, and dependency trees. It has task-speciﬁc prediction layerson top of BERT, as well as task-speciﬁc layer attention that calculates the weighted sum of representationsfor each token from all layers. The original UDify model is ﬁne-tuned on mBERT and was trained on 75languages with UD treebanks. We use bBERT and train separately on the TDT and FTB treebanks. As PUDhas no training set, the model trained on TDT is used for prediction on the PUD test set. The metric used forparsing is Labeled Attachment Score (LAS).•

Text classiﬁcation

A task-speciﬁc layer on top of the [CLS] token is added for class prediction followingDevlin et al. (2019) and Virtanen et al. (2019) and performance is evaluated in terms of accuracy.Unless otherwise speciﬁed, parameter selection is carried out the same way as Virtanen et al. (2019), with a grid overthe learning rate, the number of epochs, and the batch size. For parameter selection on the development sets, 3-5replicates are run, and 5-10 replicates are run with the selected parameters for ﬁnal evaluation on the test sets. Noreplicates are done for dependency parsing due to the large training set. For text classiﬁcation, the grid is narroweddown to batch size of 16 and 20, while ﬁxing the learning rate at 2e-5 and number of epochs at 4 based on previousexperiments.

The results are reported in Table 3. Overall, FinBERT achieves the best average performance of 92.34%, while bBERTscores slightly below at 92.02%. Although the performance does not match that of FinBERT, bBERT performs almostas well compared to the overall score of mBERT at 88.88%; bBERT covers 90.75% of the performance differencebetween mBERT and FinBERT. The slightly lower scores obtained by bBERT could be due to randomness in training,or the fact that the effective number of tokens for Finnish is smaller for bBERT than FinBERT, as the 80,000 tokensare shared between English and Finnish.We note that although results for POS tagging, NER, and dependency parsing results were also recently reportedby Ulˇcar and Robnik-Šikonja (2020) for their Finnish-Estonian-English BERT model, these results are not directlycomparable to ours: the UD version of the treebanks they used were not speciﬁed for POS tagging and dependencyparsing, and although they used the same corpus (FiNER) for NER, the named entity types were reduced to onlyperson, location, and organization. We thus refrain from direct numerical comparison to their results.

Multilingual BERT models have been shown to learn a language-agnostic subspace that allows cross-lingual transfer(Wu and Dredze, 2019; Pires et al., 2019; Cao et al., 2020). However, current research has shown that model capacityneeds to be increased to cover multiple languages without loss in model quality (Conneau et al., 2020). There are sev-eral potential ways in which BERT capacity could be increased. Taking into account the observation by Artetxe et al.(2020) that the effective vocabulary size per language plays an important role in model performance, we have hereexplored an approach where the vocabulary size was increased without increasing the model size in other ways orincreasing the number of training steps. https://github.com/jouniluoma/keras-bert-ner PREPRINT - O

CTOBER

23, 2020Setting the vocabulary size to be the sum of the sizes of those of the monolingual models, we trained a Finnish-English bilingual BERT model and compared its performance to English and Finnish monolingual BERTs on variousestablished benchmark tasks. We have shown that it is possible to create a bilingual model without compromising theperformance on either language. As multilingual BERT models have been shown to be capable of cross-lingual transfer(Artetxe et al., 2020; Ulˇcar and Robnik-Šikonja, 2020), we expect that our bilingual model can be used for cross-lingual tasks as well as for the monolingual tasks considered here, but direct study of the cross-lingual capabilitiesis currently hindered by the lack of Finnish-English cross-lingual datasets. Our approach increased vocabulary sizebut not other aspects of the model, leaving other potential approaches as future work. We note that, during training,the bilingual BERT saw roughly half of the number of examples for Finnish as did the Finnish BERT, since the totalnumber of examples seen by both models are approximately the same. The ability of bilingual BERT to performalmost on par with the Finnish BERT may be due to the language-agnostic subspace reported to be learned by BERTmodels.Potential follow-up questions to our study include: to what degree the approach of increasing vocabulary size canbe extended to cover more languages, and whether the effective vocabulary size per language to achieve comparableperformance remains the same when the number of languages increases, if comparable performance is possible. Onelimitation of our evaluation is the use of the GLUE benchmark without evaluation on more challenging datasets orbenchmarks such as SQuAD or SuperGLUE (Wang et al., 2019a). Question answering is considered to be a morechallenging task, whereas most of the tasks in GLUE are two- or three-way classiﬁcation tasks, some of which havebeen criticized for the presence of artifacts (Gururangan et al., 2018). We hope to explore these questions as well asthe cross-lingual capabilities of the model in future work.

We have studied the feasibility of training a fully bilingual deep neural language model, i.e. a model that approachesor matches the performance of monolingual models at language-speciﬁc tasks. We trained a bilingual Finnish-EnglishBERT-Base model by expanding the vocabulary size to be the sum of the size of the two individual vocabularies,and compared the model performance to monolingual models. We found that, on a range of NLU tasks, the bilin-gual model performs comparably or nearly comparably with monolingual models. We conclude that, for the BERT-Base architecture, it is possible to train a fully bilingual deep contextual model for two remotely related languages.We release the newly introduced bBERT model and all tools introduced to create the model under open licenses at https://github.com/TurkuNLP/biBERT . Acknowledgements

We are thankful for the support of CSC IT Center for Science for the computing resources it provides. L.H.C. isfounded by the DigiCampus project overseen by the EXAM Consortium. We are grateful to Juhani Luotolahti andAntti Virtanen for the technical help they provided. We also thank Jouni Luoma for his consultation on the FinnishNER experiments.

References

M. Artetxe, S. Ruder, and D. Yogatama. On the cross-lingual transferability of monolingual representations. In

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4623–4637, 2020.URL .L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, and B. Magnini. Theﬁfth PASCAL recognizing textual entailment challenge. In

TAC , 2009. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.232.1231&rep=rep1&type=pdf .S. Cao, N. Kitaev, and D. Klein. Multilingual alignment of contextual word representations. In

International Confer-ence on Learning Representations , 2020. URL https://openreview.net/forum?id=r1xCMyBtPS .D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. SemEval-2017 task 1: Semantic textual similarity multi-lingual and crosslingual focused evaluation. In

Proceedings of the 11th International Workshop on Semantic Evalua-tion (SemEval-2017) , pages 1–14, 2017. URL .A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, F. Wenzek, Guillaume Guzmán, E. Grave, M. Ott, L. Zettle-moyer, and V. Stoyanov. Unsupervised cross-lingual representation learning at scale. In

Proceedings ofthe 58th Annual Meeting of the Association for Computational Linguistics , pages 4996–5001, 2020. URL .8 PREPRINT - O

CTOBER

23, 2020I. Dagan, O. Glickman, and B. Magnini. The PASCAL recognising textual entailment challenge. In

Machine LearningChallenges. Evaluating Predictive Uncertainty, Visual Object Classiﬁcation, and Recognising Textual Entailment ,pages 177–190. Springer, 2006.W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim. BERTje: A dutch BERTmodel. arXiv preprint arXiv:1912.09582 , 2019.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for languageunderstanding. In

Proceedings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186,2019. URL .W. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential para-phrases. In

Proceedings of the International Workshop on Paraphrasing , 2005. URL .D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan. The third PASCAL recognizing textual entailment challenge.In

Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1–9. Association forComputational Linguistics, 2007. URL .S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith. Annotation artifacts in naturallanguage inference data. In

Proceedings of the 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages 107–112, 2018.URL .R. B. Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and I. Szpektor. The second PASCAL recog-nising textual entailment challenge. In

Proceedings of the Second PASCAL Challenges Workshop on RecognisingTextual Entailment , 2006. URL http://u.cs.biu.ac.il/~nlp/RTE2/Proceedings/01.pdf .K. Karthikeyan, Z. Wang, S. Mayhew, and D. Roth. Cross-lingual ability of multilingual bert:An empirical study. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=HJeT3yrtDr .D. Kondratyuk and M. Straka. 75 languages, 1 model: Parsing universal dependencies universally. In

Pro-ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2779–2795, 2019. URL .O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky. Revealing the dark secrets of BERT. In

Pro-ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 4365–4374, 2019. URL .T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizerand detokenizer for neural text processing. In

Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing: System Demonstrations , pages 66–71, 2018. URL .Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. ALBERT: A lite BERT for self-supervisedlearning of language representations. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=H1eA7AEtvS .H. J. Levesque, E. Davis, and L. Morgenstern. The Winograd schema challenge. In

AAAI Spring Symposium: LogicalFormalizations of Commonsense Reasoning , volume 46, page 47, 2011.Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa:A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 , 2019.L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie, D. Seddah, and B. Sagot. Camem-BERT: A tasty french language model. In

Proceedings of the 58th Annual Meeting of the Association for Computa-tional Linguistics , 2020. URL .T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations ofwords and phrases and their compositionality. In

Proceedings of the 26th International Con-ference on Neural Information Processing Systems - Volume 2 , pages 3111–3119, 2013. URL https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf .9 PREPRINT - O

CTOBER

23, 2020J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajiˇc, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo,N. Silveira, R. Tsarfaty, and D. Zeman. Universal Dependencies v1: A multilingual treebank collection. In

Proceed-ings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) , pages 1659–1666,2016. URL .J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. In

Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1532–1543, 2014. URL .M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized wordrepresentations. In

Proceedings of the 2018 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 2227–2237, 2018.URL .T. Pires, E. Schlinger, and D. Garrette. How multilingual is multilingual BERT? In

Proceedings of the57th Annual Meeting of the Association for Computational Linguistics , pages 4996–5001, 2019. URL .J. Pomikálek.

Removing boilerplate and duplicate content from web corpora . PhD thesis, Masaryk university, Facultyof informatics, Brno, Czech Republic, 2011.S. Pyysalo, J. Kanerva, A. Missilä, V. Laippala, and F. Ginter. Universal Dependencies for Finnish. In

Proceedingsof the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) , pages 163–172, 2015. URL .S. Pyysalo, J. Kanerva, A. Virtanen, and F. Ginter. WikiBERT models: Deep transfer learning for many languages. arXiv preprint arXiv:2006.01538 , 2020.P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension oftext. In

Proceedings of EMNLP , pages 2383–2392. Association for Computational Linguistics, 2016. URL .A. Rogers, O. Kovaleva, and A. Rumshisky. A primer in BERTology: What we know about how BERT works. arXivpreprint arXiv:2002.12327 , 2020.T. Ruokolainen, P. Kauppinen, M. Silfverberg, and K. Lindén. A Finnish news corpus for named entity recognition.In

Language Resources and Evaluation , volume 54, pages 247–272, 2020.V. Sanh, L. Debut, J. Chaumond, and T. Wolf. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper andlighter. arXiv preprint arXiv:1910.01108 , 2019.R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In

Proceedingsof the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages1715–1725, 2016. URL .R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursivedeep models for semantic compositionality over a sentiment treebank. In

Proceedings of the 2013Conference on Empirical Methods in Natural Language Processing , pages 1631–1642, 2013. URL .M. Ulˇcar and M. Robnik-Šikonja. FinEst BERT and CroSloEngual BERT: Less is more in multilingual models. arXivpreprint arXiv:2006.07890 , 2020.A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Atten-tion is all you need. In

Advances in neural information processing systems , pages 5998–6008, 2017. URL https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. Multilingual is notenough: BERT for Finnish. arXiv preprint arXiv:1912.07076 , 2019.A. Voutilainen, K. Muhonen, T. Purtonen, and K. Lindén. Specifying treebanks, outsourc-ing parsebanks: FinnTreeBank 3. In

Proceedings of the Eighth International Confer-ence on Language Resources and Evaluation (LREC’12) , pages 1927–1931, 2012. URL .A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bow-man. SuperGLUE: A stickier benchmark for general-purpose language understanding sys-tems. In

Advances in Neural Information Processing Systems , pages 3261–3275, 2019a. URL https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf .10

PREPRINT - O

CTOBER

23, 2020A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark and analysisplatform for natural language understanding. In

International Conference on Learning Representations , 2019b.URL https://openreview.net/forum?id=rJ4km2R5t7 .A. Warstadt, A. Singh, and S. R. Bowman. Neural network acceptability judgments.

Transactions ofthe Association for Computational Linguistics , 7:625–641, 2019. doi: 10.1162/tacl\_a\_00290. URL https://doi.org/10.1162/tacl_a_00290 .A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceedings of NAACL-HLT , 2018. URL .S. Wu and M. Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In

Pro-ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 833–844, 2019. URL .Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian,N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean.Google’s neural machine translation system: Bridging the gap between human and machine translation. arXivpreprint arXiv:1609.08144 , 2016.Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh. Largebatch optimization for deep learning: Training BERT in 76 minutes. In

International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=Syx4wnEtvH .R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A large-scale adversarial dataset for grounded commonsenseinference. In

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages93–104, 2018. URL .D. Zeman, M. Popel, M. Straka, J. Hajiˇc, J. Nivre, F. Ginter, J. Luotolahti, S. Pyysalo, S. Petrov, M. Potthast, et al.CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. In

Proceedings of theCoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 1–19, 2017. URL .D. Zeman, J. Hajiˇc, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov. CoNLL 2018shared task: Multilingual parsing from raw text to universal dependencies. In

Proceedings of the CoNLL2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 1–21, 2018. URL .Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies:Towards story-like visual explanations by watching movies and reading books. In