Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models
EEffects of Language Relatedness for Cross-lingual Transfer Learning inCharacter-Based Language Models
Mittul Singh ∗ , Peter Smit ∗‡ , Sami Virpioja † , Mikko Kurimo ∗ ∗ Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland † Department of Digital Humanities, Helsinki University, Helsinki, Finland ‡ Inscripta, Helsinki, Finland firstname.lastname@ { aalto,helsinki } .fi Abstract
Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training timesin comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-characterNNLMs suffer from data sparsity. In such scenarios, cross-lingual transfer has improved multi-character NNLM performance byallowing information transfer from a source to the target language. In the same vein, we propose to use cross-lingual transfer forcharacter NNLMs applied to low-resource Automatic Speech Recognition (ASR). However, applying cross-lingual transfer to characterNNLMs is not as straightforward. We observe that relatedness of the source language plays an important role in cross-lingual pretrainingof character NNLMs. We evaluate this aspect on ASR tasks for two target languages: Finnish (with English and Estonian as source) andSwedish (with Danish, Norwegian, and English as source). Prior work has observed no difference between using the related or unrelatedlanguage for multi-character NNLMs. We, however, show that for character-based NNLMs, only pretraining with a related languageimproves the ASR performance, and using an unrelated language may deteriorate it. We also observe that the benefits are larger whenthere is much lesser target data than source data.
Keywords:
Cross-lingual transfer, Character language models, Low-resource ASR
1. Introduction
Multilingual training of language models has successfullyleveraged datasets from other languages to improve Neu-ral Network Language Modeling (NNLM) performance inlow-resource scenarios (Kim et al., 2019; Conneau andLample, 2019; Conneau et al., 2019; Aharoni et al., 2019).One such method for training NNLM is the multi-task-based approach, where multiple language corpora train themodel simultaneously (Aharoni et al., 2019). Another ap-proach is cross-lingual pretraining, where the NNLM istrained on a set of source languages followed by fine-tuningon the target language (Kim et al., 2019; Conneau andLample, 2019; Conneau et al., 2019). The second ap-proach, explored in this work, is favorable when re-trainingwith the large source data is time-consuming as an existingtrained source model’s weights can be transferred to the tar-get model and then fine-tuned on the smaller target data.Cross-lingually pretrained NNLMs have utilized multi-character units to construct large shared vocabulary to allowthe positive transfer of information from source to target.Instead of multi-character units, we explore a single char-acter as a modeling unit for applying cross-lingual pretrain-ing. This choice has the advantage of reducing the vocab-ulary size by several orders of magnitude and providing alarger intersection of vocabulary terms than multi-characterunits. In this paper, we apply cross-lingual pretraining tocharacter NNLMs. However, this off-the-shelf applicationis not trivial. For multi-character based NNLMs, cross-lingual pretraining works by sharing information acrossvarious source languages independent of relatedness to thetarget language in terms of closeness in the language fam-ily tree . In contrast, for character-based NNLMs, a source https://en.wikipedia.org/wiki/Language_family language in the same family subtree as the target (related)affects the downstream performance positively than froman unrelated source language.We experiment with available Finnish and Swedish Auto-matic Speech Recognition (ASR) systems in a simulatedlow-resource ASR scenario by limiting the language mod-eling resources. We apply pretraining with two source lan-guages (Estonian and English) for Finnish ASR and threesource languages (Danish, English, and Norwegian) forSwedish ASR. In our experiments, we observe perplex-ity and ASR performance improvements when pretrainingNNLMs with related languages (i.e. Estonian for Finnishand Danish and Norwegian for Swedish), whereas pretrain-ing NNLMs on English performs adversely.We also study the impact on cross-lingual transfer due tothe target data size and number of source model layerstransferred. Relatively, smaller amounts of target languagedata than the source language data leads to more consider-able ASR performance improvements. Moreover, we findthat pretrained NNLMs perform best when we transfer onlythe parameters of the lowest layer of the source model.
2. Related Work
In our work, we follow the cross-lingual pretraining schemeutilizing a shared vocabulary as proposed by Zhuang et al.(Zhuang et al., 2017), where they transfer all the hiddenlayers except the final layer from the source model to thetarget model. For NNLMs, such an application does notobtain the best results. In sections 6. and 7., we presentresults to support this observation.Concurrently, Lample and Conneau (Conneau and Lam-ple, 2019) have also shown that cross-lingual pretrainingcan improve the performance of language models on intrin-sic measures like perplexity. They train a multi-character a r X i v : . [ c s . C L ] J u l anguage Vocabulary Train Dev Finnish ASREnglish (En) 232K 116M 107KEstonian (Et) 1.7M 97M 33KFinnish (Fi) 1.1M 17M 130KSwedish ASRDanish (Da) 2.7M 365M 222KEnglish (En) 466K 366M 107KNorwegian (No) 2.4M 381M 194KSwedish (Sv) 936K 45M 158KThousands (K), Millions (M)
Table 1: The table reports the word vocabulary, training set(Train) and development set (Dev) sizes of the languagesused in the experiments.transformer-based language model with a masked languagemodel training procedure for cross-lingual pretraining. Intheir model, multi-character units from both the source andtarget languages are combined to form one large vocabu-lary. This large shared vocabulary leads to a large outputlayer, which can be inefficient to train. The layer size canbe reduced by shortlists and class-based models (Goodman,2001; Le et al., 2011), or approximated by applying a hi-erarchical softmax (Morin and Bengio, 2005). Instead, wechoose characters as the basic unit of modeling, which pro-vides a more natural way of reducing the vocabulary size.Simultaneously, this choice supports the cross-lingual in-formation transfer by providing a larger intersection of vo-cabulary terms than multi-character units.For cross-lingual pretraining, language relatedness remainsan unexplored factor, which becomes the focus of our work.Prior work has applied cross-lingual transfer by using sev-eral unrelated languages as a source. Using related lan-guage can be crucial in low-resource scenarios as we dis-cover in Section 6. and 7. In our work, we limit cross-lingual transfer from one source language allowing a sim-pler setup for better analysis, in future, we would like to ex-plore the impact of relatedness when the number of sourcelanguages is increased dramatically.
3. Datasets
We create two setups to evaluate cross-lingual pretrainingfor NNLMs. In the first setup, English (En) and Estonian(Et) are the high-resource sources of language modelingcorpora, and Finnish (Fi) is the low-resource target lan-guage. In the second setup, Danish (Da), English, and Nor-wegian (No) are the high-resource source languages, andSwedish (Sv) is the low-resource target language.Estonian and Finnish are contained in the Finnic languagesubtree, and Danish, Norwegian, and Swedish belong tothe North Germanic language subtree. Thus, these source-target set of languages are considered as related languages.For both Finnish and Swedish, English, being part of theWest Germanic language subtree, is considered as a moreunrelated language. We also chose English as it has a largeintersection for the character set, but is less mutually intel-ligible in comparison.The English text is obtained from the training data of 2015MGB Challenge (Bell et al., 2015) consists of BBC newstranscripts. The Estonian corpus consists of web crawl text
SoftmaxHighway DropoutLSTMProjection
SoftmaxHighway DropoutLSTMProjection Source (Trained) Target (Initialized) [ ][ ]
Input
Figure 1: The figure displays the source and target NNLMswith the hidden layers used in our experiments. In cross-lingual pretraining, the source-language-trained hidden lay-ers initialize parts (dotted lines) of the target-language net-work shown by the arrows. In contrast, the rest is randomlyinitialized (bold lines in the target network).and spontaneous conversational transcripts from Meister etal. (2012) and has been used by Enarvi et al. (2017). TheFinnish corpus is from Finnish Text Collection containingtext from newspaper, books and novels (CSC - IT Centerfor Science, 1998) and has been used by Smit et al. (2017).The Swedish, Danish, and Norwegian corpora, containingnewspaper articles, are downloaded from Spr˚akbanken cor-pus and have been used by Smit et al. (2018). For Finnishas the target, more data for English was available than forEstonian, so we extract only a portion of English datasetto allow for a similar average of words per line for bothdatasets. We list the corpora statistics for the various lan-guages used in our experiments in Table 1.
4. Building Language Models
We train character NNLMs for our experiments and markboth the left and right ends of characters except when at thebeginning or the end of a word (e.g., model = m+ +o+ +d++e+ +l) to achieve best results (Smit et al., 2017). With thismarking scheme, we can differentiate the characters from aword into beginning (B), middle (M), end (E) and singletonunits. This notation becomes relevant in Section 6., whereanalyze the differences in perplexity per word position.We build Recurrent Neural Network Language Models(RNNLM) with a projection layer (200 neurons), an LSTMlayer (1000 neurons), a highway layer (1000 neurons) anda softmax output layer (displayed in Figure 1). In our ex-periments, both the source- and target-language neural net-works have the same architecture. We train the RNNLMsusing TheanoLM (Enarvi and Kurimo, 2016), applying theadaptive gradient (Adagrad) algorithm to update the modelparameters after processing a mini-batch of training exam-ples. The mini-batch size for models was 64, with a se-quence length of 100. We used an initial learning rate of0.1 in all the experiments and a dropout of 0.2 was used toregularize the parameter learning. innish Test Set Perplexity Fi (baseline) l → Fi 4195 4617 5458 4211Et → Fi 3402 3585 3901 Sv (baseline) l → Sv 334 322 337 315No → Sv
311 312 287Da → Sv 291 292 317 291
Table 2: The table reports NNLM’s test set perplexity forFinnish and Swedish using different cross-lingual initial-izations. For Finnish, English and Estonian are used as thesource languages for pretraining. For Swedish, we use Dan-ish, English and Norwegian as source languages. The bestresults in each category are marked in boldface.
5. Exploring Cross-Lingual Pretraining
Cross-lingual pretraining involves first training the neuralnetwork on a source language. Then, starting from the in-put layer, the source network’s hidden layers initialize thetarget-language neural network partially or wholly. In apartial initialization, we initialize the uninitialized layersrandomly. This initialization step is followed by trainingon the target language, also referred to as the fine-tuningstep. In both the pretraining and the fine-tuning step, theoutput-layer vocabulary consists of character units from allthe source languages and the target language. The pretrain-ing step transfers coarser-level information from input tohigher layers into the target model and during fine-tuning,the target model refines this transferred information to amore fine-grained level.We study neural network models across three dimensions: the source language used for pretraining step; us-ing the number of target-model hidden layers ( l ) initializedstarting from the input layer; and the amount of targetlanguage data. We represent the LM pretrained using thesource language y and fine-tuned using target language z as y → z . We vary l from 1 to 4 for the architecture in Fig-ure 1, which also shows an example for l = 3 . Here l = 1 would refer to just initializing with the projection layer and l = 4 would refer to initializing with all the layers. Weincrease the amount of target data size to match the sourcedata size. Varying these parameters allows us to understandtheir effect on transfer capacity of cross-lingual pretraining.
6. Perplexity Experiments
Table 2 presents the test set perplexity of Finnish andSwedish LMs. When using related source languages — likeEstonian for Finnish, or Danish and Norwegian for Swedish— to pretrain the models, we obtain better perplexity thanthe baseline and when English (the unrelated source lan-guage) is used, which leads to a worse perplexity for all l s.Using related source languages, the pretrained target LMsoutperform the baseline results for most values of l but, no-tably, when initialized with configurations l = 1 , of thesource model. Here, we note that Finnish perplexity valuesare large due to long words and the domain mismatch be- -1.5-1-0.5 0 0.5 1 1.5 2 2.5 3BC MC EC BV MV EV D i ff . ( % ) o f c h a r a c t e r PP L ( X - F i ) Types of character units
X=Et → Fi l=1X=Et → Fi l=2X=Et → Fi l=3X=Et → Fi l=4X=En → Fi l=1X=En → Fi l=2X=En → Fi l=3X=En → Fi l=4
Figure 2: The figure shows the relative differences (%)in character perplexity (PPL) for three different types ofcharacter units of different Finnish NNLMs on the test set.These character units exist due to the marking scheme usedhere: beginning (B), middle (M) and end (E), which canfurther be classified into consonants (C) and vowels (V).tween the training (books, newspaper articles and journals)and test (broadcast news) sets.On characters, similar trends of perplexity improvement forrelated vs unrelated source language and different values of l are observed. Character perplexity differences for Finnishare presented in Figure 2 for different types of units, i.e.,consonant (C) and vowels (V) dependent on their positionin words beginning (B), middle (M) and end (E). In Fig-ure 2, most perplexity improvements are obtained for mid-dle consonants (MC) and middle vowels (MV), which aremore frequent than other character units. For other char-acter units, small but consistent improvements are obtainedby Et → Fi ( l = 1 ) LM over other baseline and other LMs.For Swedish, similar improvements to Finnish resultsare observed for MC and MV, but some dips are seenfor Danish-pretrained LMs on end consonants (EC). Forbrevity, we do not present this result in the paper. Over-all, improvements from related-language pretraining im-pacts the different types of characters, enabled by a largeintersection in the source-target character set.We suspect that pretraining with a related language findsmore useful information than with an unrelated one. To in-vestigate this effect, we calculate the cosine similarity be-tween pretrained and baseline LMs’ output layer embed-dings. We first find an affine transformation to align pre-trained LM’s embeddings with the baseline’s embeddingspace, and then calculate the average similarity between thetwo sets. On Finnish, the English-pretrained embeddingshave a higher average similarity (0.53) to the baseline em-beddings than the Estonian-pretrained embeddings (0.51).On Swedish, similar results are observed with cosine sim-ilarity for the English-, Norwegian- and Danish-pretrainedembeddings at 0.43, 0.42 and 0.42. They suggest that therelated-language pretrained LMs have more conflicting in-formation than the English-pretrained LMs. As they alsoperform better in terms of perplexity, the related-languagepretraining seems to learn information that is complemen-tary to the baseline LM.
7. Speech Recognition Experiments
For training the Finnish acoustic models, we used 1500hours of Finnish audio from three different sources, namely, anguage Baseline Architecture Fi (baseline) l → Fi 16.70 16.90 ∗ ∗ → Fi 16.20 16.14 ∗ ∗ Linear interpolations En → Fi + Fi ∗ ∗ ∗ Et → Fi + Fi ∗ ∗ ∗ ∗ Table 3: The table reports WER on Finnish ASR task usingdifferent cross-lingual initializations for RNNLMs used inrescoring. Here English and Estonian are used as the sourcelanguages for pretraining. Asterisks (*) denote statisticalsignificance while comparing against Fi (16.44) using thematched pairs test with p < . . The best results in eachsection are marked in boldface.
15 15.2 15.4 15.6 15.8 16 16.2 16.4 16.61x 2x 4x 6x 8x 10x W E R Target Language Data PortionFi Et → Fi En → Fi Figure 3: The figures display WERs on Finnish ASR mea-sured when varying the amount of source language data andwhen varying the amount of target language data.the Speecon corpus (Iskra et al., 2002), the Speechdatdatabase (Rosti et al., 1998) and the parliament corpus(Mansikkaniemi et al., 2017). For testing, we used abroadcast news dataset from the Finnish national broad-caster (Yle) containing 5 hours of speech and 35k words(Mansikkaniemi et al., 2017). For training Swedish acous-tic models, we used 354 hours of audio provided by theSpr˚akbanken corpus. From the original evaluation set, weused a total of 9 hours for development and evaluation.The acoustic models were trained with the Kaldi toolkit(Povey et al., 2011) with a similar recipe as (Smit et al.,2017). Instead of phonemes, we use grapheme-units, as thisallows for a trivial lexicon that maps between the acousticand language modeling units. We evaluate the ASR perfor-mance in terms of Word Error Rates (WER).For the first-pass, we train a variable-length Kneser-Ney(Kneser and Ney, 1995) n -gram LM using the VariKNtoolkit (Siivola et al., 2007). Then, RNNLMs, built in Sec-tion 4., are used to rescore the lattices. We also linearlyinterpolate cross-lingually pretrained NNLMs with target-only NNLM while optimizing the interpolation weight.We test the statistical significance of our results using theMatched Pairs Sentence Segment Word Error Test fromNIST Scoring toolkit to compare different systems. Ta- SCTK:
Language Baseline Architecture Sv (baseline) l → Sv 4.43 4.46 4.62 4.41No → Sv 4.18 4.42 4.38 4.17 ∗ Da → Sv 4.24 4.16 ∗ ∗ Linear Interpolations En → Sv + Sv ∗ ∗ ∗ No → Sv + Sv ∗ ∗ Da → Sv + Sv ∗ ∗ ∗ ∗ Table 4: The table reports WER on Swedish ASR taskfor different configurations of RNNLMs used in rescoring.Here Danish, English and Norwegian are used as the sourcelanguages for cross-lingual pretraining. Asterisks (*) de-note statistical significance when comparing to Sv (4.41)using the matched pairs test with p < . . The best re-sults in each section are marked in boldface.bles 3 and 4 outline the performance of rescoring withRNNLMs (Section 4.) on a Finnish and a Swedish ASRtask. The first row of both these tables displays the per-formance of target-only trained RNNLMs (baseline). Thesecond part reports the performance of cross-lingually pre-trained models (Section 5.) and the third part reports theirlinear interpolations with target-only baseline models.Similar to the perplexity results (Section 6.), relatedsource language pretraining improves the ASR perfor-mance over the baseline models and the unrelated sourcelanguage pretraining degrades the performance. On FinnishASR, English-pretrained RNNLM (En → Fi) lags behindthe Estonian-pretrained RNNLM (Et → Fi), which also out-performs Finnish-only models. On Swedish ASR, Dan-ish (Da → Sv) and Norwegian (No → Sv) pretrained mod-els outperform the baseline and English pretrained models(En → Sv). In contrast with perplexity results, lower-layer( l = 1 ) based initialization shows the most benefit overthe higher-layer ( l = 2 , , ) initializations for both Finnishand Swedish ASR. We note that quite like perplexity re-sults, ASR performance on Swedish is lower than Finnishas the Swedish task is easier than the Finnish one.In Figure 3, we observe little performance increase bycross-lingual pretraining when we vary the target data sizeby increasing it to comparable sizes of source languagedata. At least for Estonian, increasing Finnish data (target)closes the gap between cross-lingual pretraining and target-only model. The cross-lingual transfer seems to work bestwith a larger number of resources for the related source lan-guage in comparison to the target language.Furthermore, interpolations of the baseline model with thecross-lingually pretrained models improve over its con-stituent models. On both Finnish and Swedish ASR, cross-lingual pretraining with English combined with the base-line model can outperform the baseline model, unlike whenused individually. This improvement can be attributed tothe regularization effect of such an interpolation. Linear in-terpolations based on other source languages like Estonian,Danish and Norwegian further improve the results consis-tently across different initialization schemes. We hypothe-size that this effect is due to the complementary informa-ion learned by these related-language models. Overall, theindividual systems and the interpolations based on relatedsource languages show a significant and the most substan-tial improvement in performance.
8. Concluding Remarks
We investigated cross-lingual transfer for character-basedneural network language models in a low-resource sce-nario. Cross-lingual pretraining with related source lan-guage significantly improved (3-6% relative) over no pre-training, whereas pretraining with unrelated source lan-guage had adverse effects. At a character level, we sus-pect cross-lingual pretraining works for related languagesas they share a large portion of the character set. The largeshared vocabulary provides soft alignments between char-acters in related languages supporting the transfer of rel-evant information from source to target models. This in-formation transfer is in contrast to multi-character unitswhere the transfer is dependent on shared anchor tokens(like numbers, proper nouns). However, we still lack anempirical understanding of this phenomenon and in our fu-ture work, we hope to explore this phenomenon.Additionally, transferring the lower layer information andhaving more source data than target data was significant forlow-resource ASR. As a followup to our study, we inves-tigate the effects of language relatedness for cross-lingualpretraining in transformer-based language models.
9. Bibliographical References
Aharoni, R., Johnson, M., and Firat, O. (2019). Massivelymultilingual neural machine translation. In
Proceedingsof the 2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and Short Pa-pers) , pages 3874–3884, Minneapolis, Minnesota, June.Association for Computational Linguistics.Bell, P., Gales, M. J., Hain, T., Kilgour, J., Lanchantin,P., Liu, X., McParland, A., Renals, S., Saz, O., Wester,M., et al. (2015). The mgb challenge: Evaluating multi-genre broadcast media recognition. In , pages 687–693.Conneau, A. and Lample, G. (2019). Cross-lingual lan-guage model pretraining. In
Advances in Neural Infor-mation Processing Systems 32: Annual Conference onNeural Information Processing Systems 2019, NeurIPS2019, 8-14 December 2019, Vancouver, BC, Canada ,pages 7057–7067.Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,Wenzek, G., Guzm´an, F., Grave, E., Ott, M., Zettle-moyer, L., and Stoyanov, V. (2019). Unsupervisedcross-lingual representation learning at scale.
CoRR ,abs/1911.02116.CSC - IT Center for Science. (1998). The Helsinki KorpVersion of the Finnish Text Collection.Enarvi, S. and Kurimo, M. (2016). Theanolm - an exten-sible toolkit for neural network language modeling. In
INTERSPEECH , pages 5; 3052–3056. Enarvi, S., Smit, P., Virpioja, S., and Kurimo, M. (2017).Automatic speech recognition with very large conver-sational Finnish and Estonian vocabularies.
IEEE/ACMTransactions on Audio, Speech, and Language Process-ing , 25(11):2085–2097.Goodman, J. (2001). Classes for fast maximum entropytraining. In
ICASSP , pages 561–564.Iskra, D. J., Grosskopf, B., Marasek, K., van den Heuvel,H., Diehl, F., and Kießling, A. (2002). Speecon - speechdatabases for consumer devices: Database specificationand validation. In
LREC . European Language ResourcesAssociation.Kim, Y., Gao, Y., and Ney, H. (2019). Effective cross-lingual transfer of neural machine translation modelswithout shared vocabularies. In
Proceedings of the 57thConference of the Association for Computational Lin-guistics, ACL 2019, Florence, Italy, July 28- August 2,2019, Volume 1: Long Papers , pages 1246–1257. Asso-ciation for Computational Linguistics.Kneser, R. and Ney, H. (1995). Improved backing-offfor m-gram language modeling. In
ICASSP , volume 1,pages 181–184.Le, H. S., Oparin, I., Messaoudi, A. K., Allauzen, A.,Gauvain, J. L., and Yvon, F. (2011). Large vocabu-lary SOUL neural network language models. In
INTER-SPEECH , pages 1469–1472.Mansikkaniemi, A., Smit, P., and Kurimo, M. (2017). Au-tomatic construction of the Finnish parliament speechcorpus. In
INTERSPEECH , pages 3762–3766.Meister, E., Meister, L., and Metsvahi, R. (2012). Newspeech corpora at IoC. In
XXVII Fonetiikan p¨aiv¨a —Phonetics Symposium , pages 30–33.Morin, F. and Bengio, Y. (2005). Hierarchical probabilisticneural network language model. In
AISTATS .Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glem-bek, O., Goel, N., Hannemann, M., Motlicek, P., Qian,Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K.(2011). The kaldi speech recognition toolkit. In
ASRU .IEEE Signal Processing Society, December.Rosti, A., R¨am¨o, A., Saarelainen, T., and Yli-Hietanen, J.(1998). Speechdat Finnish database for the fixed tele-phone network. Technical report, Tampere University ofTechnology.Siivola, V., Hirsimaki, T., and Virpioja, S. (2007). Ongrowing and pruning Kneser-Ney smoothed n -grammodels. IEEE Transactions on Audio, Speech, and Lan-guage Processing , 15(5):1617–1624.Smit, P., Gangireddy, S. R., Enarvi, S., Virpioja, S., andKurimo, M. (2017). Character-based units for unlimitedvocabulary continuous speech recognition. In
ASRU ,pages 149–156.Smit, P., Virpioja, S., and Kurimo, M. (2018). Advances insubword-based hmm-dnn speech recognition across lan-guages. Technical report, Aalto University.Zhuang, X., Ghoshal, A., Rosti, A., Paulik, M., and Liu, D.(2017). Improving DNN bluetooth narrowband acousticmodels by cross-bandwidth and cross-lingual initializa-tion. In