[PDF] Unsupervised pretraining transfers well across languages

Abstract

Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

Full PDF

UUNSUPERVISED PRETRAINING TRANSFERS WELL ACROSS LANGUAGES

Morgane Rivi`ere † , Armand Joulin † , Pierre-Emmanuel Mazar´e † , Emmanuel Dupoux †‡∗ † Facebook AI Research, ‡ Ecole des Hautes Etudes en Sciences Sociales

ABSTRACT

Cross-lingual and multi-lingual training of Automatic SpeechRecognition (ASR) has been extensively investigated in thesupervised setting. This assumes the existence of a parallelcorpus of speech and orthographic transcriptions. Recently,contrastive predictive coding (CPC) algorithms have beenproposed to pretrain ASR systems with unlabelled data. Inthis work, we investigate whether unsupervised pretrainingtransfers well across languages. We show that a slight modi-ﬁcation of the CPC pretraining extracts features that transferwell to other languages, being on par or even outperformingsupervised pretraining. This shows the potential of unsuper-vised methods for languages with few linguistic resources.

Index Terms — Unsupervised pretraining, low resources,cross-lingual

1. INTRODUCTION

Learning phoneme representations remains a challenge for alarge number of languages with limited supervised resources.A common approach is to pre-train these representations on alarge supervised corpus in other languages and transfer themto the low resource languages [1, 2]. For example, Vesely etal. [3] learn a shared representation on a supervised multilin-gual dataset and ﬁnetune it on the target language. This pre-training works even between distant languages, but requiresmassive supervised corpora in the same domain.Recently, several works [4, 5] have proposed promisingmethods to train monolingual audio representations withoutsupervision. In particular, Schneider et al. [6] shows that theunsupervised pre-training method of van den Oord [4] im-proves the quality of automatic speech recognition (ASR) onseveral competitive benchmarks. In this paper, we are inter-ested to see if similar unsupervised pre-training methods canbe leveraged in a cross-lingual setting to improve the qualityof phoneme representations for low resource languages.We focus on the contrastive predictive coding (CPC)method of van den Oord [4] since Schneider et al. [6] hasshown its beneﬁt for pre-training features for ASR. CPC is aform of forward modeling in the feature space [7]: it predicts * Code and data available in https://github.com/facebookresearch/CPC_audio . This is the extended reprint of:Rivi`ere, M., Joulin, A., Mazar´e, P.E. and Dupoux, E. (2020). Unsupervisedpretraining transfers well across languages. in ICASSP-2020. the near future windows in an audio sequence while con-trasting with windows from other sequences or more distantin time. We introduce several modiﬁcations to the originalapproach stabilize the training and lead to better phonemerepresentations. We use our modiﬁed CPC model to pre-trainphoneme representations in English, namely on Librispeech,and transfer them to several low-resource languages from theCommon Voice database.In this paper, we obtain several results related to trans-ferring across languages the features pre-trained without su-pervision. First, pre-training phoneme representation outper-forms representations trained from scratch in the target lan-guage, even if we do not use any supervision for the pre-training. Suprisingly, we also observe that the gap betweenunsupervised and supervised pre-training is relatively smallif we use the same pre-training corpora. Finally, scaling un-supervised pre-training to larger unlabelled datasets furtherreduces the gap with the supervised pre-training features, andeven surpasses it in some low-resource languages.

2. RELATED WORK2.1. Multilingual pre-training for speech recognition

A common way to improving speech recognition in low-resource languages is to train multilingual speech recognitionwith shared components [8, 9, 2]. For example, Stolcke etal. [9] train features for phoneme classiﬁcation in a differ-ent language. Burget et al. [10] shares the parameters of aGaussian Mixture Model. Closer to our work, several workshave shared the parameters of a neural network encoder,using feedforward networks [3, 1, 2] or LSTM [11]. Themodel is then ﬁnetuned on the target low-resource languageto ﬁt its speciﬁcities [12]. The sampling of the languagesduring the pre-training can focus on languages related to thetargeted language [11]. Another approach is to encourage alanguage-independent encoder with an adversarial loss [13].As opposed to our work, this line of research focuses on su-pervised pre-training which restrict its impact to domains orlanguages with large resources for supervision.

Many unsupervised learning approaches have been proposedfor speech and we focus on those based on contrastive learn-ing [7, 14, 15]. In particular, Time Contrastive Learning [5] a r X i v : . [ ee ss . A S ] F e b earns audio features by discriminating between time win-dows. Our work closely follows van den Oord et al. [4]where a contrastive loss is used to predict forward represen-tations in an audio sequence. Their Contrastive PredictiveCoding (CPC) objective function is similar to the objectiveof word2vec [16], applied to sequences instead of words.Contrastive approaches are also related to examplar self-supervision [17, 18, 19]. However, CPC has the advantageof making no assumption about the nature or number of thetraining data samples. Recently, variants of CPC have beenapplied to monolingual ASR [6] and images [20].

3. APPROACH

In this section, we rapidly introduce the approach of van denOord et al. [4] and we refer the reader to the original paper fordetails. We also present several modiﬁcations to improve theresulting representations and stabilize the training. We madeour code as well as our experiments available to the public . Unsupervised training of neural networks relies on build-ing a pretext task that requires discriminative features to besolved. The pretext task used in Contrastive Predictive Cod-ing (CPC) [4] is forward modeling, i.e., predicting the futurestates of a sequence from its past. The particularity of CPCis to frame forward modeling as a reconstruction of futurerepresentations, not future inputs. Past and future represen-tations are built from the same model, and a contrastive lossensures that temporally nearby representations are pushedcloser than temporally distant ones.More precisely, given an audio sequence splitted in T dis-crete time steps, or windows, we embed the input signal x t ateach time step t with a encoder. Then, we form the currentphoneme representation z t by applying a sequence model tothe resulting sequence of t embeddings, i.e., z t = ψ ρ ( φ θ ( x ) , . . . , φ θ ( x t )) , where φ θ is the encoder and ψ ρ is the sequence model,parametrized by θ and ρ respectively. In CPC, the encoderis a -layer convolutional network (kernel sizes: 10,8,4,4,4,stride sizes: 5,4,2,2,2) and the sequence model is a -layerGated Recurrent Units (GRU).The encoder also has a down-sampling factor of , meaning that for a kHz input, eachfeature encodes ms of audio.Given this phoneme embedding z t , the pretext task inCPC is to predict the next K future representations, i.e., φ θ ( x t + k ) for k ∈ { , . . . , K } . CPC also pushes away repre-sentations from a random subset N t of negative examples, or“distant” windows. Overall, the loss function at a time step t is thus: L t = − K K (cid:88) k =1 log (cid:34) exp (cid:0) φ θ ( x t + k ) (cid:62) A k z t (cid:1)(cid:80) n ∈N t exp ( φ θ ( n ) (cid:62) A k z t ) (cid:35) . (1) https://github.com/facebookresearch/CPC_audio where A k is a linear classiﬁer. There are many ways to sam-ple the “distant” windows and we follow van den Oord etal. [4] by sampling negative within speaker . The parameters θ , ρ and A ,...,K are learned with stochastic gradient descent. We observe empirically that the training of CPC is unstable,and can converge to poor solutions. The reason is the pres-ence of batch normalization [21] between the layers of the en-coder. Indeed, batch normalization parameters are learned bycomputing statistics over the whole batch. Since the encoderis shared across a sequence, these parameters leak informa-tion between past and future windows. This makes minimiz-ing eq. (1) trivial when the batch normalization is activated,resulting in instability. We ﬁx this issue by replacing batchnormalization with a channel-wise normalization that plays asimilar role of conditioning internal representations. As opp-posed to batch normalization, the parameters are not sharedacross the sequence and do not leak information (see Supple-mentary Section S1.1 for details).

The prediction of future representations is made by linearclassiﬁers on top of a phoneme embedding, as shown ineq. (1). The motivation is to encourage the phoneme embed-dings to encode linearly separable phonemes. However, thefuture representations are not phoneme representations them-selves; they are embeddings of the time window. Comparingthe outputs of a sequence model and an encoder with a linearclassiﬁer may not result in linearly separable phoneme repre-sentations. Several alternatives are possible, such as adding asequence model on the future representations. In practice, weﬁnd that replacing each linear classiﬁer with a -layer Trans-former network [22] works well (see Supplementary SectionS1.2 for details). This layer accesses the entire sequence of z , ..z t to predict a particular φ ( x t + k ) . We also observe thatreducing the dimension of convolutionnal layers from to does not impact performance while reducing memoryfootprint. Finally, using an LSTM instead of a GRU slightlyimproves the performance. In this work, we evaluate the quality of phoneme represen-tations trained with no supervision when transferred acrosslanguages. Standard cross-lingual approaches ﬁnetune theirpre-trained network on the targeted language. While this im-proves the quality of the resulting representations, it does notassess the quality of the pre-trained representations. Instead,we freeze the model after the pre-training and simply learna linear classiﬁer for the targeted language. Speciﬁcally, weperform the linear classiﬁcation of a concatenation of win-dows to match the average size of a phoneme. We then use theCTC loss between our model predictions and the non-alignedhoneme transcriptions [23]. This procedure explicitly mea-sures the linear separability of the phoneme representation,once transferred to a target language.

4. EXPERIMENTAL SETTING

We pre-train models on the English Librispeech dataset (LS).We consider both the h and h splits of clean data. Forthe supervised pre-training model, we use the aligned phonelabels provided by [4] for Librispeech- h. After the pre-training, we freeze the parameters of our modelsand transfer the features across languages. We consider thecommon Voice database as it comes in many languages. Weretrieve the non-aligned phoneme transcription of each audiosample by running the open-source tool phonemizer on theircorresponding text scripts. We split our dataset between train,validation and test sets along speakers to reduce the inﬂuenceof speakers on the performance of phoneme predictions. Weconsider two train sets of either or hours. We will opensource our train-test splits along with our code. Zerospeech2017 is a dataset made to measure phoneme sep-arability of unsupervised models in different languages. Weconsider the English, Mandarin and French benchmarks andwe report the ABX score on them [24]. The ABX score mea-sures the discriminability between phonemes by estimatingthe probability speech segments to be closer to one another ifthey encode the same phoneme than if they don’t (the distancebeing DTW-realigned average frame-wise cosine).

5. RESULTS5.1. Within-language results

In this set of experiments, we compare the original CPC withour modiﬁed version on two within-language tasks: phonemediscriminability on the English Zerospeech2017 dataset, andphoneme linear separability on Librispeech h [4]. In Ta-ble 1, we compare our ABX score with that of the toplinesfrom the Zerospeech leaderboards. It is interesting to note thatCPC does not perform well on this metric but our modiﬁedversion is on par with the state of the art. Overall, our mod-iﬁed CPC surpasses the original model on phoneme classiﬁ-cation and even matches unsupervised approaches dedicatedto phoneme separability. In Table 2, we show that our mod-iﬁcations to CPC leads to an improvement of 3.4 points inphoneme classiﬁcation compared to the original CPC imple-mentation. https://voice.mozilla.org https://gitlab.coml.lscp.ens.fr/mbernard/phonemizer Across Within

Trained on ZeroSpeech2017 (45 h)

Supervised topline [25] 6.9 5.3Heck et al. [26] 8.7 6.2Chorowski et al. [27] 8.0 5.5

Trained on Librispeech-360

CPC [4] 13.0 9.6Modiﬁed CPC 8.5 6.5

Table 1 . Phoneme discriminability within languages.

Within- and across-speakers ABX scores for the English Ze-rospeech2017 test set. We compare CPC and modiﬁed CPCtrained on Librispeech-360 to the best performing models.Phone accuracySupervised topline 76.3CPC [4] 65.5Modiﬁed CPC 68.9

Table 2 . Phone classiﬁcation within language.

Accuracyon the English LibriSpeech- h dataset for a linear classiﬁertrained on top of frozen features obtained with the originaland our modiﬁed CPC model.

In a ﬁrst experiment, we consider the problem of phonemeclassiﬁcation across languages on the Common Voice database.In Table 3, we report the phone error rate (PER) for the linearclassiﬁers trained on top of the phoneme features pretrainedwith and without supervision. We also compare with a modeltrained from scratch on the target dataset. The training set ofeach target dataset is only hour long. The model trainedfrom scratch thus performs poorly. On the other hand, pre-trained features signiﬁcantly improve the performance in alllanguages, even without any ﬁnetuning. First, on hoursof librispeech, our modiﬁed CPC outperforms the originalCPC by 5.4 points on average. However, supervised pre-training still performs slightly better ( . points) than ourunsupervised pre-training on the same corpus. An advantageof unsupervised pre-training is that we can apply it to anylarger unannotated dataset. We show the beneﬁts of this bypre-training our modiﬁed CPC on hours of unlabelleddata from Librispeech and match the performance of the su-pervised model. This result not only conﬁrms the ﬁndingsof [6] but it also shows that unsupervised pre-training canmatch supervised pre-training with enough data (see Supple-mentary Section S2 with the larger Libri-light dataset [29]).In a second experiment, we compare the quality of ourpre-trained features against other unsupervised methods onthe Zerospeech2017. In Table 4, we compare on French andMandarin, the ABX score of our approach trained on Englishodel Pretraining Frozen du es fr it ky ru sv tr tt zh AvgFrom scratch - No 84.7 95.9 95.1 95.0 81.5 97.7 86.1 83.1 72.9 84.3 87.6Bottleneck [28] Babel-1070h Yes 47.9 36.6 48.3 39.0 38.7 45.2 52.6 43.4 42.5 54.3 44.9Supervised LS-100h Yes 42.4 36.4 47.0 40.5 41.0 43.6 47.0 48.5 41.5 56.8

CPC [4] LS-100h Yes 51.5 44.2 54.5 47.0 44.8 49.0 54.0 54.7 48.9 60.1 50.9Modiﬁed CPC LS-100h Yes 44.4 38.7 49.3 42.1 40.7 45.2 48.8 49.7 44.0 55.5 45.8Modiﬁed CPC LS-360h Yes 42.5 38.0 47.1 40.5 41.2 43.7 47.5 47.3 42.0 55.0 . Transfer of pre-trained phoneme features across languages.

We pre-train the features on h and h ofLibrispeech with supervision (“Supervised”) or not (“CPC” and “Modiﬁed CPC”). We also include multilingual bottleneckfeatures (“Bottleneck”) pre-trained on h from the Babel dataset. We train a linear classiﬁer on the frozen features using h of speech from the Common Voice database in different languages. We also report a supervised model trained entirely fromscratch on the h of speech. We report Phone Error Rate. The languages are: Dutch ( du ), Spanish ( es ), French ( fr ), Italian( it ), Kyrgyz ( ky ), Russian ( ru ), Sweedish ( sv ), Turkish ( tr ), Tatar ( tt ) and Mandarin ( zh ).French MandarinA. W. A. W. Trained within language

Supervised topline 9.1 6.8 5.7 4.2Heck et al. [26] 11.7 8.7 7.4 7.9Chorowski et al. [27] 10.8 7.5 11.2 10.7

Trained on English (Librispeech-360)

CPC [4] 18.0 12.3 11.5 10.0Modiﬁed CPC 14.6 10.0 9.5 8.9

Table 4 . Phoneme discriminability of unsupervised fea-tures across languages.

Across- (“A.”) and within-speakers(“W.”) ABX scores on French and Mandarin speech for CPCfeatures pre-trained in English. For comparison: the best sys-tems plus supervised topline of the Zerospeech leaderboardtrained within-language.Librispeech with unsupervised methods trained for these lan-guages. Surprisingly, our English features transfered to otherlanguages are competitive with the top lines of the leader-board. This result further shows that unsupervised pre-trainedfeatures generalize well across languages.

We also study the impact of ﬁne-tuning the phoneme featuresinstead of freezing them. We use hours of speech in tar-get languages for this experiment. In Table 5, we comparethe difference between frozen features and ﬁne-tuning. Asfor the experiments on h of speech, our approach is on parwith supervised pre-training when the features are frozen. Wealso observe a boost around performance points for all thepre-training methods when we ﬁne-tune the features. Ourapproach is still relatively competitive with supervised pre-training, but slightly worse ( − . points) on average. Model pretraining frozen ﬁnetuneFrom scratch - - 38.3Supervised LS-100 37.6 CPC [4] LS-100 43.5 33.3Mod. CPC LS-100 38.8 31.0Mod. CPC LS-360

Table 5 . Comparison between frozen and ﬁne-tuned fea-tures.

PER averaged over languages (Spanish, French, Ital-ian, Russian and Tatar). The training set for each languagecontains hours extracted from the Common Voice database.

6. CONCLUSION

Pre-training in a given language, with or without supervision,can produce features usable across other languages and otherdomains. Moreover, these features can be matched with a setof phonemes even with extremely low resources datasets andunaligned labels. They are usable with a very simple linearmodel and can be trained at low cost. Finally, though su-pervised pre-training tends to be better than the unsupervisedone, the gap between them is small and can be greatly reducedwith the use of a larger amount of unlabelled data. We did notattempt to push numbers in order to achieve good phone errorrates in the low resource languages, as we only tested a lin-ear separation layer for phoneme classiﬁcation. Further workneeds to be done to establish how these pretrained featurescan be best used in the low resource setting (see [30]), andwith other ASR tasks [29].

7. REFERENCES [1] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. A.Ranzato, M. Devin, and J. Dean, “Multilingual acous-tic models using distributed deep neural networks,” in

ICASSP , 2013.2] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deepneural network with shared hidden layers,” in

ICASSP ,2013.[3] K. Vesel`y, M. Karaﬁ´at, F. Gr´ezl, M. Janda, andE. Egorova, “The language-independent bottleneck fea-tures,” in

SLT , 2012.[4] A. van den Oord, Y. Li, and O. Vinyals, “Repre-sentation learning with contrastive predictive coding,” arXiv:1807.03748 , 2018.[5] A. Hyvarinen and H. Morioka, “Unsupervised featureextraction by time-contrastive learning and nonlinearica,” in

NIPS , 2016.[6] S. Schneider, A. Baevski, R. Collobert, and M. Auli,“wav2vec: Unsupervised pre-training for speech recog-nition,” arXiv:1904.05862 , 2019.[7] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a sim-ilarity metric discriminatively, with application to faceveriﬁcation,” in

CVPR , 2005.[8] T. Schultz and A. Waibel, “Language-independent andlanguage-adaptive acoustic modeling for speech recog-nition,”

Speech Communication , 2001.[9] A. Stolcke, F. Grezl, M.-Y. Hwang, X. Lei, N. Mor-gan, and D. Vergyri, “Cross-domain and cross-languageportability of acoustic features estimated by multilayerperceptrons,” in

ICASSP , 2006.[10] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng,A. Ghoshal, O. Glembek, N. Goel, M. Karaﬁ´at, andD. Povey, “Multilingual acoustic modeling for speechrecognition based on subspace gaussian mixture mod-els,” in

ICASSP , 2010.[11] X. Li, S. Dalmia, A. W. Black, and F. Metze, “Multi-lingual speech recognition with corpus relatedness sam-pling,” arXiv:1908.01060 , 2019.[12] S. Dalmia, R. Sanabria, F. Metze, and A. W. Black,“Sequence-based multi-lingual low resource speechrecognition,” in

ICASSP , 2018.[13] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky,“Massively multilingual adversarial speech recogni-tion,” arXiv:1904.02210 , 2019.[14] K. Q. Weinberger and L. K. Saul, “Distance metriclearning for large margin nearest neighbor classiﬁca-tion,”

JMLR , 2009.[15] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: Auniﬁed embedding for face recognition and clustering,”in

CVPR , 2015.[16] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, andJ. Dean, “Distributed representations of words andphrases and their compositionality,” in

NIPS , 2013. [17] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, andT. Brox, “Discriminative unsupervised feature learningwith convolutional neural networks,” in

NIPS , 2014.[18] P. Bojanowski and A. Joulin, “Unsupervised learning bypredicting noise,” in

ICML , 2017.[19] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Gre-wal, P. Bachman, A. Trischler, and Y. Bengio, “Learningdeep representations by mutual information estimationand maximization,” arXiv:1808.06670 , 2018.[20] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multi-view coding,” arXiv:1906.05849 , 2019.[21] S. Ioffe and C. Szegedy, “Batch normalization: Acceler-ating deep network training by reducing internal covari-ate shift,” arXiv:1502.03167 , 2015.[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,“Attention is all you need,” in

NIPS , 2017.[23] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber,“Connectionist temporal classiﬁcation: labelling unseg-mented sequence data with recurrent neural networks,”in

ICML , 2006.[24] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Herman-sky, and E. Dupoux, “Evaluating speech features withthe minimal-pair abx task: Analysis of the classicalmfc/plp pipeline,”

INTERSPEECH , 2013.[25] E. Dunbar, X. Cao, J. Benjumea, J. Karadayi,M. Bernard, L.t Besacier, X. Anguera, andE. Dupoux, “The zero resource speech challenge2017,” arXiv:1712.04313 .[26] M. Heck, S. Sakti, and S. Nakamura, “Feature op-timized dpgmm clustering for unsupervised subwordmodeling: A contribution to zerospeech 2017,” in

ASRU , 2017.[27] J. Chorowski, R. J. Weiss, S. Bengio, and A. Oord,“Unsupervised speech representation learning usingwavenet autoencoders,” arXiv:1901.08810 , 2019.[28] R. Fer, P. Matˇejka, F. Gr´ezl, O. Plchot, K. Vesel`y,and J. H. ˇCernock`y, “Multilingually trained bottleneckfeatures in spoken language recognition,”

ComputerSpeech & Language , vol. 46, pp. 252–267, 2017.[29] J. Kahn, M. Rivi`ere, W. Zheng, E. Kharitonov, Q. Xu,P.-E. Mazar´e, J. Karadayi, V. Liptchinsky, R. Collobert,C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin,A. Mohamed, and E. Dupoux, “Libri-light: A bench-mark for asr with limited or no supervision,” in

INTER-SPEECH , 2020.[30] K. Kawakami, L. Wang, C. Dyer, P. Blunsom, andA. van den Oord, “Learning robust and multilingualspeech representations,” 2020.

1. SUPPLEMENTARY METHODS

We describe here ablation experiments comparing our reimplementation of the original CPC model [4] and improvements wemade to this model.

S1.1. Changing the normalization method

In order to make the training more stable, we replaced the batch normalization in the original model with layer normalization.The results are illustrated in Table S1. Across Within

Trained on Librispeech-100

CPC [4] 13.0 9.6CPC + Layer norm (LN) . Impact of the normalization method on the phoneme discriminability . Within- and across-speakers ABX scoresfor the English Zerospeech2017 test set.

S1.2. Choosing the right predictor design

We compared several alternatives to the linear prediction model initially presented in [4]. We supposed that if the predictionnetwork is too simple, then the auto-regressive network will perform a signiﬁcant part of the prediction task. Thus we thoughthat more complex architecture would improve the quality of our output features. The results of our experiments are compiledin Table S2. Across Within

Trained on Librispeech-100

CPC + LN 12.0 8.7CPC + LN + Conv8 13.4 9.2CPC + LN + FFD 11.7 8.56CPC + LN + transformer 9.5 7.3CPC + LN + transformer + dropout . Phoneme discriminability for various predictors design . Within- and across-speakers ABX scores for the EnglishZerospeech2017 test set.

S2. SUPPLEMENTARY RESULTS

Here, we present results on the CPC features trained on the recently released Libri-light 60K dataset[29]. As seen in Table S3,we now beat both the Bottleneck and Supervised features on all languages except one. The comparison between Bottleneck andCPC features is displayed in Figure S1.odel Pretraining Frozen du es fr it ky ru sv tr tt zh

AvgBottleneck [28] Babel-1070h Yes 47.9 36.6 48.3 39.0 38.7 45.2 52.6 . Transfer of pre-trained phoneme features across languages.

Phone Error Rate on linear classiﬁcation ofphonemes based on pre-trained features on kh of Libri-light, compared to multilingual bootleneck features (“Bottleneck”)trained on h from the Babel dataset and a supervised baseline trained on LibriSpeech 100h clean. The linear classiﬁer istrained on the frozen features using h of speech from the Common Voice database in different languages. We report PhoneError Rate. The languages are: Dutch ( du ), Spanish ( es ), French ( fr ), Italian ( it ), Kyrgyz ( ky ), Russian ( ru ), Sweedish( sv ), Turkish ( tr ), Tatar ( tt ) and Mandarin ( zh ). dues fritky ru svtr tt zh C P C P E R Bottleneck PERCPC-60k vs Bottleneck