[PDF] Libri-Light: A Benchmark for ASR with Limited or No Supervision

Abstract

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Full PDF

aa r X i v : . [ c s . C L ] D ec LIBRI-LIGHT: A BENCHMARK FOR ASR WITH LIMITED OR NO SUPERVISION

J. Kahn ∗† , M. Rivi`ere ∗† , W. Zheng ∗† , E. Kharitonov ∗† , Q. Xu ∗† , P.E. Mazar´e ∗† , J. Karadayi ∗‡ V. Liptchinsky † , R. Collobert † , C. Fuegen † , T. Likhomanenko † , G. Synnaeve † , A. Joulin † ,A. Mohamed † , E. Dupoux †‡† Facebook AI, ‡ EHESS, ENS, PSL-University, CNRS, INRIA

ABSTRACT

We introduce a new collection of spoken English audio suit-able for training speech recognition systems under limited orno supervision. It is derived from open-source audio booksfrom the LibriVox project. It contains over 60K hours of au-dio, which is, to our knowledge, the largest freely-availablecorpus of speech. The audio has been segmented using voiceactivity detection and is tagged with SNR, speaker ID andgenre descriptions. Additionally, we provide baseline sys-tems and evaluation metrics working under three settings: (1)the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervi-sion setting (WER). Settings (2) and (3) use limited textualresources (10 minutes to 10 hours) aligned with the speech.Setting (3) uses large amounts of unaligned text. They areevaluated on the standard LibriSpeech dev and test sets forcomparison with the supervised state-of-the-art.

Index Terms — unsupervised and semi-supervised learn-ing, distant supervision, dataset, zero- and low resource ASR.

1. INTRODUCTION

Automatic Speech Recognition (ASR) has made strikingprogress in the recent years with the deployment of increas-ingly large deep neural networks trained on increasinglylarge sets of annotated speech (from thousands to tens ofthousands of hours). This approach is hit by diminishingreturns as the costs of annotating even larger datasets becomeprohibitive. It is also difﬁcult to scale beyond a handful ofhigh-resource languages and address the needs of a long tailof low-resource languages, dialectal and idiolectal variants,accents, and registers. As such, there has been a recent surgeof interest in weakly supervised solutions that use datasetswith fewer human annotations. In the semi-supervised set-ting, only a fraction of the dataset is labelled and the rest isunlabelled [1, 2], while in a distant supervision setting, thedataset is mostly or entirely unlabelled, but large quantities ofunaligned text provide a language model corpus [3, 4]. Otherapproaches have addressed pretraining with labels from other * Contributed equally, in random order. languages [5, 6] or pretraining using unsupervised objec-tives [7, 8]. At the extreme of this continuum, zero resourceASR discovers its own units from raw speech [9, 10, 11].Despite many interesting results, the ﬁeld lacks a commonbenchmark (datasets, evaluations, or baselines) for compar-ing ideas and results across these settings. Here, we introduceLibri-light, a large open-source corpus (60K hours) of unla-belled speech and a common set of metrics to evaluate threesettings: (1) the zero-resource/unsupervised setting (ABX),(2) the semi-supervised setting (PER and CER), and (3) thedistant supervision setting (WER). The last two settings usea limited-resource training set (10 min, 1h, 10h), and the lastone large in-domain and out-of-domain text to train languagemodels. The test sets are identical to LibriSpeech [12] soas to facilitate comparison of weakly supervised results withthe state-of-the art in supervised learning. We also provide abaseline system on these three settings. All datasets, metricsand baseline systems are open source .

2. RELATED WORK

The release of open source software and datasets has facil-itated rapid progress in machine learning and in particularlarge vocabulary ASR. LibriSpeech is one of the ﬁrst large-scale open-source datasets and contains over 1000 hours ofaudio books, together with textual annotations aligned at thesentence level. Mozilla’s CommonVoice project has facili-tated data collection across several languages and currentlycontains 2900 hours of read speech in 37 languages . A.Black at CMU has compiled the Wilderness dataset whichconsists of the text of the Bible read in 750 languages [13].Other open-source resources are available from OpenSLR .The Zero Resource Challenge has released a series ofdatasets and metrics for the unsupervised setting [9, 10] ,but the datasets are generally small (between 2.5 and 50 h).In this work, we substantially expand dataset size and usethe same evaluation metrics (ABX [14]) for comparability.The IARPA Babel program [15] has also initiated a pushtowards limited supervision for less studied languages. In https://github.com/facebookresearch/libri-light https://voice.mozilla.org http://openslr.org/ https://zerospeech.com ts most difﬁcult track, the dataset contains only 10 hours oftranscribed speech in conjunction with with larger amountsof untranscribed audio. Here, we retain 10 hours as a upper bound, and add lower-resource sets containing 1 hours and 10minutes of labeled audio. While distant supervision has beenthe focus of two JHU-JSALT workshops (2016 [16], 2019[17]) but no benchmark has yet emerged.

3. DATASET AND METRICS3.1. Dataset

Sequence duration (in hours) N u m be r o f s pea k e r s (a) Genre's categories (in hours)

Literat)re : 42760Science, Craft & Essay : 8978Undefined : 5510Religion : 5373Poet&y : 3462Theate& : 591Ancient : 331 (b)

Fig. 1 : Corpus statistics. (a) Durations in hours per speakers(b) Durations for the 25 most frequent genres.The dataset is composed four parts: a train set with unla-belled speech, a train set with limited labels, dev/test sets, anda train set containing unaligned text; see Table 1.

Unlabelled Speech Training Set . This dataset was ob-tained by extracting audio ﬁles for English speech from theLibriVox repository containing open source audio books.Files were downloaded and converted to 16kHz FLAC. Wethen removed corrupted ﬁles, ﬁles with unknown or multiplespeakers, and speakers appearing in LibriSpeech dev and testsets. The potentially duplicated versions of books based ontitles were set aside (and distributed as a duplicate subset,totalling 4500 hours). We then ran a Voice Activity Detec-tion (VAD) model using the wav2letter++ framework [18] onthe recordings to tag onsets and offsets of speech segments. https://librivox.org subset hours books ﬁles per-spk totalhours spkrs Unlabelled Speech Training Set unlab-60k 57706.4 9860 219041 7.84 7439unlab-6k 5770.7 1106 21327 3.31 1742unlab-600 577.2 202 2588 1.18 489subset hours per-spk female male totalminutes sprks spkrs spkrs

Limited Resource Training Set train-10h 10 25 12 12 24train-1h 1 2.5 12 12 24train-10m* 10min 2.5 2 2 4

Dev & Test Sets (from LibriSpeech) dev-clean 5.4 8 20 20 40dev-other 5.3 10 16 17 33test-clean 5.4 8 20 20 40test-other 5.1 10 17 16 33subset tokens vocab

Unaligned Text Training Set librispeech-LM (in-domain) 800M 200K

Table 1 : Datasets stats in Libri-light . *Six different ver-sions of the 10 min datasets have been constructed, the unionof these small datasets make up the 1h dataset.The VAD segments were used to derive an average SNRfor each ﬁle. For each ﬁle, we constructed JSON metadataincluding title, unique speaker ID, SNR, genre, and list ofvalid VAD block (block of more than 500ms of speech indi-cated by onsets and offsets). We created three dataset splitsbased on different sizes: ( unlab-60k ), ( unlab-6k ) and( unlab-600 ), matched in genre distribution (the smallercuts are included in the larger ones, see the stats in Table 1).The distributions by speaker and genres are in Figure 1. Thetotal amount of speech in the dataset is 62.2K hours, includingthe duplicate ﬁles.

Limited-resource Training Set . For training with lim-ited supervision, we selected three subsets of the LibriSpeechtraining set: a 10 hour set, a 1 hour set, and six 10-minutesets (the six 10-minute sets together make up the 1h set, andthe 1h set is included in the 10h set). In each set, half ofthe utterances are from the clean and other training sets, re-spectively. We additionally provide orthographic transcrip-tions from LibriSpeech and phonetic transcriptions generatedfrom phonemizer . Dev and Test Set . The dev and test sets are the sameas that of LibriSpeech (5.4 hours for dev-clean, 5.3 hoursfor dev-other, 5.4 hours for test-clean, and 5.1 hours for test-other) and are intended for testing and tuning. All dev or testset audio has been removed from training sets. The ground- https://gitlab.coml.lscp.ens.fr/mbernard/phonemizer BX within speaker ABX across speakerSystem dev-clean dev-other test-clean test-other dev-clean dev-other test-clean test-otherMFCC Baseline 10.95 13.55 10.58 13.60 20.94 29.41 20.45 28.5CPC unlab-600 7.36 9.39 6.90 9.59 9.58 14.67 9.00 15.1CPC unlab-6k 6.51 8.42 6.22 8.55 8.48 13.39 8.05 13.81CPC unlab-60k : ABX errors on unsupervised CPC trained features.

Within- and across-speaker phoneme discriminability scores(lower is better) on the LibriSpeech dev and test sets as a function of varying quantities of unlabelled speech.truth phonetic sequences for the dev and test sets were gener-ated as above; in addition, for ABX evaluation, forced align-ment was obtained using a model trained on LibriSpeech.

Unaligned Text Training Set . For training a languagemodel in the distant supervision setting, we consider the LMcorpus provided in LibriSpeech which contains 800M tokensand a vocabulary size of 200k from 14.5k public books fromProject Gutenberg. This corpus only partially overlaps withthe content of our unlabelled training set and can thus be con-sidered in-domain. Several options exist for publicly availableout-of-domain corpora (wikitext103, 1Billion word, etc). We provide 3 sets of metrics for the unsupervised, distantly-supervised, and semi-supervised settings.For the unsupervised setting, the aim is to extract speechrepresentations (discrete or continuous) which encode thephonetic content of the language while ignoring irrelevantinformation (channel, speaker, etc). The representation isevaluated using ABX error, a distance-based metric used inprevious zero resource challenges [9, 10, 11]. For a given pairof sounds (for instance, ”bit” versus ”bet”), we compute theprobability that the distance between a random token of ”bit”(X) is closer to another token of ”bit” (A) than to a token of”bet” (B). The ABX error rate is obtained by averaging acrossall such minimal pairs of phone trigrams in the corpus. Forthe “within-speaker” score, A, B and X are from the samespeakers; in the “across-speaker” score, A and B are from thesame speaker, but X is from a different speaker (see [19]).For the semi-supervised setting, we evaluate the qualityof learned acoustic representations with little annotated data.Models can be trained either with character or phonetic tar-gets using limited data and measured by either Character Er-ror Rate (CER) or Phoneme Error Rate (PER).For distant supervision, we evaluate how the learned rep-resentations can be used to decode speech at the word levelusing a pre-trained language model. We use Word Error Rate(WER) for the evaluation. Because the dev and test sets arefrom LibriSpeech, this allows to compare distant supervi-sion directly with SOTA supervised models. More details ondataset and metrics in Supplementary Section S1. https://openslr.org/11/ dev- dev- test- test-System clean other clean otherno pretraining+train-10h 45.9 55.7 43.7 58.6CPC unlab-60k+train-10m 40.1 51.5 39.4 53.3CPC unlab-60k+train-1h 32.2 44.6 31.6 46.8CPC unlab-60k+train-10h : PER/CER in the semi-supervised setting.

A pre-trained CPC system plus a linear classiﬁer trained on limitedamounts of labels compared to the same system trained fromscratch (PER).

4. BASELINE SYSTEMS

In the unsupervised setting, we use a PyTorch implementa-tion of the Contrastive Predictive Coding (CPC) system [7]trained to predict the hidden states of N future speech framesand containing an encoder, a sequence model, and a predic-tor. The encoder maps waveforms to hidden states (one 512dimensional embedding every 10 ms frames) using a stack of5 convolutional layers. The sequence model encodes the hid-den states into a 512-dimensional phonetic embedding withone layer of Gated Recurrent Units (GRUs). The predictormaps the last phonetic embedding onto a future hidden stateusing a linear projection (one linear projection per time step,varying from 1 to 12). To avoid collapsing to a trivial solu-tion, the model is trained discriminatively; the loss functionaims at decreasing the dot product between predicted and ac-tual future frames while increasing it for frames belongingto negative sequences (distant time windows). We used areimplementation of the original paper, which we modiﬁedto increase stability and performance, as we could not repro-duce the original paper results with the provided descriptions.These changes are as follows: we replaced batch-norm withchannel-wise normalization, we reduced the hidden and pho-netic embeddings to 256 dimensions, used a LSTM instead ofa GRU, and used a 1-layer transformer network instead of alinear projection. The original paper obtained 65.5% accuracyon phoneme classiﬁcation with a linear classiﬁer trained ontop of the frozen system’s phonetic embedding. Our modiﬁedsystem obtained 68.9% accuracy, while using 4 times fewerparameters in the encoder+sequence model part of the system.We trained it on the three cuts ( unlab-600 , unlab-6k and unlab-60k ).ev- dev- test- test-System clean other clean other Supervised systems (LibriSpeech 1000 h)

Gated Cnv+4gramLM[20] 4.6 13.8 4.8 14.5Hybrid+seqdisc+4gramLM[21] 3.4 8.3 3.8 8.8

CPC pretrain + CTC ﬁne-tuning + 4gram-LM

CPC unlab-600+train-10m 97.3 97.6 97.1 97.7CPC unlab-600+train-1h 72.2 84.5 70.1 86.3CPC unlab-600+train-10h 52.5 71.6 49.3 74.1CPC unlab-6k+train-10m 93.6 95.2 93.2 94.9CPC unlab-6k+train-1h 67.5 81.3 65.4 82.0CPC unlab-6k+train-10h 46.4

MFSC + TDS + CTC + Grapheme + 4gram-LM train-1h 79.4 88.1 78.4 88.0+ 60k pseudo-label 78.6 86.5 77.2 86.3train-10h 34.0 60.9 33.5 62.1+ 60k pseudo-label 30.5 55.8 30.1 57.2

MFSC + TDS + CTC + Phoneme + 4gram-LM train-1h 81.1 88.5 80.2 88.7+ 60k pseudo-label 84.3 90.0 84.0 90.5train-10h 44.1 64.2 43.8 65.1+ 60k pseudo-label : WER in the distant supervision setting.

Top:State-of-the-art supervised systems using our 4-gram-LMs.Middle: A CPC system trained with unlabelled speech, ﬁne-tuned with limited data and integrated with a 4-gram wordlanguage model (Librispeech-LM). Bottom: A small MFSCTDS system trained on limited labeled data (graphemes orphonemes). The pseudo-labels for the 60k corpus segmentedinto 36-second chunks are generated and are used to retrain alarger TDS system.In the semi-supervised setting, we use our baseline pre-trained CPC system supplemented with a linear classiﬁertrained with CTC loss on the limited-resource set’s phonelabels (only the linear layer is ﬁne-tuned). We also providea from-scratch control with the same architecture trainedend-to-end.For the distant supervision setting, we run two exper-iments: (1) we use our pretrained CPC system with animproved CTC layer (LSTM) which we ﬁne-tune with or-thographic labels in the limited-resource set. We decodewith a python wrapped version of the wav2letter++ decoder[18], using a 4-gram KenLM [22] language model trained onthe unaligned text set. (2) We use CTC to train small Mel-ﬁlterbanks-based TDS systems[23], (7 TDS blocks, 20Mparameters, total stride 2) on phonemes/graphemes respec- tively. We create pseudo-labels by beam-search decoding the60k-hours unlabelled data with a 4-gram KenLM decodertrained on LibriSpeech-LM. These labels are used to trainlarger TDS systems (11 TDS blocks, 37M parameters) fromscratch which generate WERs when decoding with the sameLM. More details on baselines in Supplementary Section S2.

5. RESULTS

The results for the unsupervised setting are shown in Table 2.CPC constructs embeddings with good ABX scores com-pared to an MFCC baseline, and are in the same range as theSOTA in the Zero Resource Speech Challenge 2017 for En-glish (6.2% within and 8.7% across [24]). The results in thesemi-supervised setting (Table 3) show gains in PER in usingunsupervised pretraining for several different amounts of ﬁnetuning. The results on the distant supervision (Table 4), whilefar from supervised state-of-the-art, show that increasing theamount of unsupervised pretraining helps. Pseudo-labels arebeneﬁcial but only if the generating and ﬁne-tuned modelsare initially good (Table 3 and 4).

6. CONCLUSION

We have introduced a new large dataset for benchmarkingASR systems trained with limited or no supervision. Wefound that unsupervised training with increasingly largerdataset yield better features and can signiﬁcantly boost theperformance of systems trained with limited amounts oflabels (from 10 min to 10 hours) both for a phoneme recog-nition task in a semi-supervised setting and for a word recog-nition in a distant-supervision setting. The baselines werenot particularly optimized for the tasks and are providedonly as a proof-of-concept; there is a signiﬁcant margin withfully-supervised systems. Obvious improvements includeusing larger models, speaker-adversarial losses, ﬁne tuningthe entire system (not just the top layers), and pseudo-labelsretraining in all settings. Active learning [25] could furtherselect useful parts of the dataset (we have provided SNR datato facilitate this effort). Yet another approach might applylanguage modeling techniques directly on unlabelled audio toimprove the representations before ﬁne-tuning them [26, 27].

7. REFERENCES [1] G. Tur, D. Hakkani-T¨ur, and R.E. Schapire, “Combin-ing active and semi-supervised learning for spoken lan-guage understanding,”

Speech Communication , vol. 45,no. 2, pp. 171–186, 2005.[2] J. Kahn, A. Lee, and A. Hannun, “Self-trainingfor end-to-end speech recognition,” arXiv preprintarXiv:1909.09116 , 2019.[3] Y.-C. Chen, C.-H. Shen, S.-F. Huang, H.-y. Lee, andL.-s. Lee, “Almost-unsupervised speech recognitionwith close-to-zero resource based on phonetic structuresearned from very small unpaired speech and text data,” arXiv preprint:1810.12566 , 2018.[4] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Un-supervised cross-modal alignment of speech and textembedding spaces,” in

Advances in Neural InformationProcessing Systems , 2018, pp. 7354–7364.[5] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deepneural network with shared hidden layers,” in

ICASSP .IEEE, 2013, pp. 7304–7308.[6] K. Vesel`y, M. Karaﬁ´at, F. Gr´ezl, M. Janda, andE. Egorova, “The language-independent bottleneckfeatures,” in

Spoken Language Technology Workshop(SLT) . IEEE, 2012, pp. 336–341.[7] A. van den Oord, Y. Li, and O. Vinyals, “Representationlearning with contrastive predictive coding,”

CoRR , vol.abs/1807.03748, 2018.[8] S. Schneider, A. Baevski, R. Collobert, and M. Auli,“wav2vec: Unsupervised pre-training for speech recog-nition,” arXiv preprint arXiv:1904.05862 , 2019.[9] M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux,“The zero resource speech challenge 2015: Proposedapproaches and results,” in

SLTU- Procedia ComputerScience , 2016, vol. 81, pp. 67–72.[10] E. Dunbar, X.-N. Cao, J. Benjumea, J. Karadayi,M. Bernard, L. Besacier, X. Anguera, and E. Dupoux,“The zero resource speech challenge 2017,” in

ASRU ,2017.[11] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard,J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. On-del, A. Black, L. Besacier, S. Sakriani, and E. Dupoux,“The zero resource speech challenge 2019: TTS withoutT,” in

INTERSPEECH , 2019.[12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,“Librispeech: an

ASR corpus based on public domainaudio books,” in

ICASSP . IEEE, 2015, pp. 5206–5210.[13] A.W. Black, “CMU wilderness multilingual speechdataset,” in

ICASSP . IEEE, 2019, pp. 5971–5975.[14] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Herman-sky, and E. Dupoux, “Evaluating speech features withthe minimal-pair abx task: Analysis of the classicalmfc/plp pipeline,”

INTERSPEECH-2013 , 2013.[15] M. Harper, “Iarpa babel program,” 2014.[16] C. Liu, J. Yang, M. Sun, S. Kesiraju, A. Rott, L. Ondel,P. Ghahremani, N. Dehak, L. Burget, and S. Khudanpur,“An empirical evaluation of zero resource acoustic unitdiscovery,” in

ICASSP . IEEE, 2017, pp. 5305–5309. [17] J. Chorowski, “Distant supervision for representationlearning in speech and handwriting,” 2014.[18] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Syn-naeve, V. Liptchinsky, and R. Collobert, “Wav2letter++:A fast open-source speech recognition system,” in

ICASSP . IEEE, 2019, pp. 6460–6464.[19] T. Schatz,

ABX-discriminability measures and applica-tions , Ph.D. thesis, 2016.[20] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letterbased speech recognition with gated convnets,” arXivpreprint arXiv:1712.09444 , 2017.[21] C. L¨uscher, E. Beck, K. Irie, M. Kitza, W. Michel,A. Zeyer, R. Schl¨uter, and H. Ney, “Rwth asr systemsfor librispeech: Hybrid vs attention-w/o data augmenta-tion,” arXiv preprint arXiv:1905.03072 , 2019.[22] Kenneth Heaﬁeld, “Kenlm: Faster and smaller languagemodel queries,” in

Proceedings of the sixth workshop onstatistical machine translation . Association for Compu-tational Linguistics, 2011, pp. 187–197.[23] A. Hannun, A. Lee, Q. Xu, and R. Collobert,“Sequence-to-sequence speech recognition withtime-depth separable convolutions,”

CoRR , vol.abs/1904.02619, 2019.[24] M. Heck, S. Sakti, and S. Nakamura, “Feature op-timized DPGMM clustering for unsupervised subwordmodeling: A contribution to zerospeech 2017,” in

Au-tomatic Speech Recognition and Understanding Work-shop . IEEE, 2017, pp. 740–746.[25] D. Hakkani-T¨ur, G. Riccardi, and A. Gorin, “Activelearning for automatic speech recognition,” in

ICASSP .IEEE, 2002, vol. 4, pp. IV–3904.[26] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequence framework for learning word embeddingsfrom speech,” arXiv preprint arXiv:1803.08976 , 2018.[27] Anonymous, “vq-wav2vec: Self-supervised learningof discrete speech representations,” in

Submitted toInternational Conference on Learning Representations ,2020, under review.[28] Alex Graves, Santiago Fern´andez, Faustino Gomez, andJ¨urgen Schmidhuber, “Connectionist temporal classiﬁ-cation: labelling unsegmented sequence data with recur-rent neural networks,” in

Proceedings of the 23rd inter-national conference on Machine learning . ACM, 2006,pp. 369–376.

1. SUPPLEMENTARY DATASETS AND METRICSS1.1. Datasets and meta-data

The dataset is constructed according to the following pipeline:data download, exclusion of bad ﬁles, conversion to ﬂac, ex-traction of VAD, SNR, and Perplexity, the construction ofJSON ﬁles and the split in three cuts of varying sizes.

S1.1.1. VAD

Voice Activity Detection is accomplished using a TDS acous-tic model [23] trained using CTC loss [28] on the LibriSpeechdataset using the orthographic transcription. The trainedmodel was used to perform inference (greedy frame-by-frame decoding) with the wav2letter++[18] audio analysispipeline on the unlabeled audio by mapping all of the lettersto SPEECH and the silence symbol to NONSPEECH. Theposterior SPEECH probability is added as meta data on theJSON of each ﬁle.The TDS model used for VAD has 100 million parametersand consists of clusters of 2, 3, 4, and 5 TDS blocks sepa-rated by 2D convolutions. The duration of audio contained ineach label is dependent on the stride of the underlying acous-tic model; the model used has a stride of 8. The model istrained with word-pieces using the recipe outlined in [23].

S1.1.2. SNR

The Signal-to-noise (SNR) ratio is calculated using the VADlabels predicted above. For each 80ms frame the VAD willreturn a posterior of whether the frame is speech or not. Wedecided to take < > SNR dB =10 log (cid:16) P signal P noise (cid:17) −10 0 10 20 30 40 500.000.020.040.060.080.100.120.14 Fig. S1 : Librivox SNR histogram

S1.1.3. Perplexity

Perplexity was obtained by performing beam search decod-ing of the trained TSD model deﬁned above supplemented bya 4gram word Language Model trained on LibriSpeech LM.It was computed as the mean of the log probability of the pos-terior on each ﬁle.

S1.1.4. JSON ﬁles and splits

To construct the JSON, we duplicated the metadata of theoriginal book JSON ﬁle from LibriVox into each of the ﬁlesassociated for a given book (including unique book ID andspeaker ID, and we added tags for SNR, perplexity and ourown macro-genre tags, by folding the existing ones into 7 cat-egories: ”Literature”, ”Science, Craft & Essay”, ”Ancient”,”Religion”, ”Poetry”, ”Theater”, and ”Undeﬁned”. We alsoadded VAD information as a list of onsets and offsets of voiceactivity.The ﬁles were splitted into cuts of different sizes by tryingto maintain the same distribution of macro-genres in the threecuts.

S1.2. The ABX metric

Given a frame-wise distance metric, we want to check thatfeatures coding for the same phonemes have a more similarrepresentation than features coding for different phonemes.To quantify this property, we use a minimal pair ABX task asdeﬁned in [14]; given a set of sounds S ( x ) from a category x and a set of sounds S ( y ) from a category y we compute: θ ( x , y ) = 1 nm ( m − X a ∈ S ( x ) X b ∈ S ( y ) X c ∈ S ( x ) \{ a } ˆ θ ( a, b, c ) With: ˆ θ ( a, b, c ) = d ( a,c )

To train a feature model in an unsupervised fashion, we usedthe method implemented by Riviere and al. in [ ? ], inspired bythe Contrastive Predictive Coding algorithm (CPC) presentedin [7]. We will brieﬂy introduce the algorithm in this section,though we refer the reader to the original papers for moredetails. CPC relies on forward modeling: given an input sequence offeatures, we try to predict the k future representations of thesequence. The network must discriminate each future groundtruth feature from negative examples randomly sampled in thebatch.More precisely, the model goes like this:1. The raw waveform w goes through a convolutional net-work g c , resulting in a feature sequence ( x t ) t ∈ ...T .2. Then we form the current phoneme representation z t by applying a recurrent network g ar to x t .3. Finally, we predict ( x t +1 , ..., x t + k ) from ( z t ) t ≤ t us-ing a prediction network g p .When using theses feature for another task we always con-sider z t , the output of the recurrent layer. S2.1.2. Architecture Details

For g c , we use ﬁve convolutional layers with strides [5, 4, 2, 2,2], ﬁlter-sizes [10, 8, 4, 4, 4] and 256 hidden units with ReLUactivations. Besides, the features are normalized channel-wise between each convolution. In the end, this network hasa downsampling factor of 160, meaning that on a kH audioinput each feature will encode ms of data. Furthermore, g ar is a one-layer LSTM with also a dimensional hiddenstate. Finally, the predictor g p is a one-layer transformer. S2.1.3. Training Details

We considered input sequences of ms gathered in batchesof sequences per GPU with a total of GPUs. Ourtraining took approximatly two days on NVIDIA Tesla V100-SXM2-16GB. Besides, in a given batch, all sequences weresampled within the same speaker.

S2.2. TDS model and training

Given the limited amount of supervised training data, we se-lect to use a smaller TDS model [23] with 20 million param-eters. The model has a stride 2 in the ﬁrst convolution, andthree groups of TDS blocks with channels (10, 14, 18) and(2, 2, 3) blocks in each group. While on the whole 60k hourstraining data, we use the original architecture introduced in[23] with 37 million parameters. The only difference is thatwe reduce the overall stride from 8 to 2. We use dropout toprevent over-ﬁtting, and its value is set to 0.4 and 0.1 in the20M and 37M models.In terms of model optimization, we use plain SGD withmomentum. The initial learning rate and momentum are setto 0.1 and 0.5 respectively. In the supervised setting, the mod-els are trained for 1500 epochs on 8 GPUs in total with learn-ing rate halved after each 200 epochs and total batch size 64 and 16 for character- and phone- based system. In the semi-supervised scenario, the models are trained for 150 epochson 32 GPUs in total with learning rate halved after every 30epochs and total batch size 256.The beam-search decoding parameters are tuned on onlydev-other for the 20M TDS models to generate pseudo-labels,while they are tuned independently for the ﬁnal 37M model.The same ofﬁcial LibriSpeech 4-gram LM is used in both de-coding procedures. The decoding beam size is 1000 in all theexperiments.

S3. SUPPLEMENTARY RESULTSS3.1. Pseudo-labels experiment dev- dev- test- test-System clean other clean otherMFSC TDS + train-1h 44.4 57.7 55.6 65.9+ 60k pseudo-label 57.6 68.1 59.5 72.3MFSC TDS + train-10h 22.5 40.2 22.2 41.3+ 60k pseudo-label

MFSC TDS + train-1h 44.3 53.9 46.9 55.4+ 60k pseudo-label 43.4 52.6 43.2 53.9MFSC TDS + train-10h 20.7 36.4 21.8 38.0+ 60k pseudo-label : PER/CER of acoustic models trained in withpseudo-labels.

Top : small phone-based TDS [23] modelswith limited labels using wav2letter++[18], generating pseu-dolabels on the 60K dataset with an in-domain LM, retraininga larger TDS acoustic model (PER).