[PDF] "This is Houston. Say again, please". The Behavox system for the Apollo-11 Fearless Steps Challenge (phase II)

Abstract

We describe the speech activity detection (SAD), speaker diarization (SD), and automatic speech recognition (ASR) experiments conducted by the Behavox team for the Interspeech 2020 Fearless Steps Challenge (FSC-2). A relatively small amount of labeled data, a large variety of speakers and channel distortions, specific lexicon and speaking style resulted in high error rates on the systems which involved this data. In addition to approximately 36 hours of annotated NASA mission recordings, the organizers provided a much larger but unlabeled 19k hour Apollo-11 corpus that we also explore for semi-supervised training of ASR acoustic and language models, observing more than 17% relative word error rate improvement compared to training on the FSC-2 data only. We also compare several SAD and SD systems to approach the most difficult tracks of the challenge (track 1 for diarization and ASR), where long 30-minute audio recordings are provided for evaluation without segmentation or speaker information. For all systems, we report substantial performance improvements compared to the FSC-2 baseline systems, and achieved a first-place ranking for SD and ASR and fourth-place for SAD in the challenge.

Full PDF

““This is Houston. Say again, please”. The Behavox system for the Apollo-11Fearless Steps Challenge (phase II).

Arseniy Gorin, Daniil Kulko, Steven Grima, Alex Glasman

Behavox Limited, Montreal, Canada { arseniy.gorin,daniil.kulko,steven.grima,alex.glasman } @behavox.com Abstract

We describe the speech activity detection (SAD), speaker di-arization (SD), and automatic speech recognition (ASR) exper-iments conducted by the Behavox team for the Interspeech 2020Fearless Steps Challenge (FSC-2). A relatively small amount oflabeled data, a large variety of speakers and channel distortions,speciﬁc lexicon and speaking style resulted in high error rateson the systems which involved this data. In addition to approxi-mately 36 hours of annotated NASA mission recordings, the or-ganizers provided a much larger but unlabeled 19k hour Apollo-11 corpus that we also explore for semi-supervised training ofASR acoustic and language models, observing more than 17%relative word error rate improvement compared to training onthe FSC-2 data only. We also compare several SAD and SDsystems to approach the most difﬁcult tracks of the challenge(track 1 for diarization and ASR), where long 30-minute audiorecordings are provided for evaluation without segmentation orspeaker information. For all systems, we report substantial per-formance improvements compared to the FSC-2 baseline sys-tems, and achieved a ﬁrst-place ranking for SD and ASR andfourth-place for SAD in the challenge.

Index Terms : speech recognition, speaker diarization, speechactivity detection

1. Introduction

The Apollo-11 corpus is not only a digitized audio recordingarchive of an important historical event, it is also a naturalis-tic dataset that demonstrates problems which are hard for state-of-the-art speech processing systems, including speech activ-ity detection (SAD), speaker diarization (SD), automatic speechrecognition (ASR), and speaker identiﬁcation [1, 2]. The sec-ond Fearless Steps [3] (FS-2) extends the previous years in-augural challenge [4] with streamlined diarization and auto-matic speech recognition tasks, which are very realistic for bothresearchers and engineers working with single-channel noisyrecordings.Coincidentally, we encounter similar challenges in the ﬁeldof commercial speech processing for the ﬁnancial industry.Trading ﬂoor recordings tend to be long, single-channel, andconsist of rapid speech including short phrases without con-textual information, and a speciﬁc style of speech (trading jar-gon, abbreviations, etc). The recordings come from a variety ofcommercial recording devices, and are frequently corrupted byheavy background noise and lossy compression codecs.Our FS-2 submission and described experiments target ﬁveout of the six tracks provided by the FS-2 challenge, includingSAD, SD, and ASR. For SD and ASR, both track 2 with thereference SAD, and track 1 with long audio ﬁles, are evaluated.The SAD system was ranked fourth, while SD and ASR were ranked as the top-performing systems in the FS-2 challenge .The submitted SAD system is based on the multilayer per-ceptron (MLP), which has been shown to be simple and ef-ﬁcient [5, 6, 7]. We signiﬁcantly improved the performanceby tuning post-processing parameters, and experimented withuntranscribed data by means of ASR transcriptions of the 19khour Apollo-11 (A11) dataset. We achieve an order of magni-tude lower detection cost function (DCF) than the baseline sys-tem (0.9% on development and 1.6% on evaluation data versus12.5% and 13.6% for the baseline).For the SD system, we compared a state-of-the-art x-vector [8] model trained on out-of-domain data with a con-ventional i-vector based system [9] trained on the FS-2 datasetonly. In addition, we experimented with a combination ofthese representations and employed variational Bayes (VB) re-segmentation [10] to achieve 26.7% and 27.6% diarization errorrates (DER) on the track 1 and track 2 development sets, respec-tively. This is a substantial improvement compared to 79.7%and 68.6% DER of the baseline system.Our Kaldi based ASR [11] system heavily utilizes the largeand untranscribed A11 dataset by means of semi-supervisedtraining (SST) that was shown to be useful in the past forlow and moderately resourced conditions [12, 13, 14]. In ourexperiments, we relied on lattice-free MMI SST [15] for the“chain” [16, 17] acoustic models based on factorized time-delayed neural networks (TDNN-F) proposed in [18]. Our ex-periments suggest that the SST framework proposed by [15] isvery efﬁcient “in the wild”, where the available large untran-scribed speech data are very noisy and not segmented. We alsofound that automatically transcribed data was useful for improv-ing language models for this task. Our best performing ASRsystems achieve 21.8% and 22.4% word error rate (WER) ontracks 2 and 1, respectively, which is almost four times bet-ter than the performance of the baseline systems (80.5% and83.5%, respectively).

2. Speech Activity Detection

The SAD system is based on an MLP trained with conven-tional 13 MFCC features. A context of 30 frames was usedas an MLP input to produce frame-wise predictions based on aspeech probability threshold F thd . The frame-wise predictionswere further grouped into segments followed by simple post-processing, which ﬁltered out speech segments with an averageframe probability below a given threshold S thd or shorter thana given duration S min similar to [19, 20]. For all three systemssummarized in Table 1, the postprocessing parameters were se-lected to optimize DCF on the development data, resulting in F thd = 0 . , S min = 25 frames and S thd = 0 . . This simplepostprocessing greatly reduced false-positive speech segments, https://fearless-steps.github.io/ChallengePhase2/Final.html a r X i v : . [ ee ss . A S ] A ug

82 284 286 288 290 2910.050.000.050.100.15 A m p li t u d e Audio Signal & Reference\Hypothesis Speech Labels

Audio Signal Reference Hypothesis282 284 286 288 290 291Time (s)40080012001600 F r e q u e n c y ( H z ) Power Spectrogram

Figure 1:

Snippet of FS02 dev 020.wav [4:46-4:55] showingunlabeled speech found by the speech activity detector. improving DCF by a large margin.The initial SAD (system 1 in Table 1) was trained on theFS-2 training data, which consisted of 18 hours of speech and44 hours of non-speech. It is based on a two layer MLP with256 hidden rectiﬁed linear units (ReLUs). To take advantageof the larger 19k hour A11 dataset, we relied on SST. First,the initial SAD was used to segment the A11 corpus. Then,the word-level (including silence and noise units) transcriptionswith conﬁdence scores were created from the ASR lattices pro-duced by the system described later in Section 4.1 (system 3in Table 3). We further selected 206 hours of non-speech and780 hours of speech from these transcriptions and re-trained theMLP (system 2 in Table 1). While using extra data did not resultin a signiﬁcant improvement of DCF, it allowed us to further in-crease the capacity of the model to three layers with 400 ReLUsresulting in our best performing SAD (system 3 in Table 1).Table 1:

Detection cost function (DCF) on development andevaluation sets of FS-2 speech activity detection challenge forraw and postprocessed classiﬁcations comparing two and thee-layer MLP trained with or without semi-supervised (A11) data. N o Model size Train data DCF Dev, % DCF Eval, %Raw Post. Post.1 2 x 256 FS-2 1.68 1.25 1.872 2 x 256 FS-2 + A11 3.03 1.23 2.033 3 x 400 FS-2 + A11 2.01 0.92 1.56Interestingly, the results indicate that the raw classiﬁcationfrom the MLP performs better on system 1 trained with fewerdata, but after applying postprocessing, systems trained withA11 data perform the best.Assessment of SAD performance on the development setrevealed that certain ﬁles had an abnormally higher false posi-tives rate. Speciﬁcally for FS02 dev 020.wav the false-positiveprobability was more than seven times the mean of the develop-ment set. Upon further investigation, we discovered that a majorfactor contributing to this was missing annotations for low-levelspeech in heavy noise. Still the SAD presented has correctly de-tected such speech. One example is shown in Figure 1.

3. Speaker Diarization

One challenge with the data provided for FS-2 track 2 for bothSD and ASR is that it mostly consists of very short speech seg-ments (50% of the segments are shorter than one second) sepa-rated by long silence. Figure 2 shows the distribution of refer-ence SAD segment duration for development and training data.

Figure 2:

Reference segment duration for diarization track 2.

Our diarization experiments were conducted on both track 2(reference SAD) and track 1 (long ﬁles without SAD). Initialexperiments for track 1 were done using SAD system 3 shownin Table 1. However, using this SAD resulted in a large degrada-tion on track 1 compared to track 2 (up to 1.5 times higher DER)due to many silence segments being recognized as speech. Thisis expected given the DCF metric deﬁnition for SAD task:

DCF ( θ ) = 0 . P FN ( θ ) + 0 . P FP ( θ ) (1)where P FN is false negatives duration divided by referencespeech duration, P FP is false positive duration divided by ref-erence silence duration, and θ denotes a threshold setting.The imbalanced coefﬁcients in the DCF metric encouragesystems to not miss speech. However, as the FS-2 ﬁles havemuch more silence than speech, even small P FP implies a largeabsolute duration of false positives. To reduce the amount offalse positives for DER and the amount of insertions for WER,in this and later experiments with SD and ASR, we used thesame SAD but with thresholds optimized as follows: DCF

INV ( θ ) = 0 . P FN ( θ ) + 0 . P FP ( θ ) (2) We started with an i-vector based SD system described in [21]that used agglomerative hierarchical clustering (AHC) of prob-abilistic linear discriminant analysis (PLDA) scores [22]. TheKaldi toolkit [11] was used for modeling and implementationof the baseline similar to a publicly available Kaldi callhomediarization recipe. In addition, we investigated neural network-based speaker embeddings (x-vectors) that have demonstratedgood performance for diarization [8, 23]. Training an x-vectorextractor generally requires a much larger amount of speakerannotated training data than is available in FS-2. We thereforeused the x-vector extractor trained on out-of-domain data (NISTSRE 04, 05, 06, and 08) provided by the authors of [23, 8].In our experiments, both x-vector and i-vector extractorsused 23 MFCC features. The i-vector extractor also used ﬁrstand second derivatives and was trained on the FS-2 trainingdataset using 1024-component UBM and a 128-dimensionalsubspace. X-vectors were extracted from a 128-dimensionalpre-softmax layer of the network described in [23]. For bothi/x-vectors, a global mean subtraction and the PCA whiteningtransform were applied to combined development and trainingdata.For PLDA training, the i/x-vectors were extracted fromthree-second chunks of speech, where speaker-dependent seg-ments were merged within each audio ﬁle to collect sufﬁcientlylong chunks. For inference, i/x-vectors were extracted fromtwo-second chunks with a one-second overlap. These param-eters were selected to optimize DER on the development data.For both systems, AHC was used in each audio ﬁle for clus-tering PLDA scores between all pairs of vectors computed oneach segment. Before estimating pairwise similarity scores, thePLDA and vectors were projected into low dimensional spacesing a PCA estimated per audio recording. The number ofrecording-dependent PCA components and the AHC stoppingcriteria thresholds were optimized on the development set.The performance of i-vector and x-vector based systemswith AHC is shown in rows 1 and 2 of Table 2. In our exper-iments, both systems demonstrated similar results for track 2,but x-vectors performed better on track 1.Table 2:

Diarization error rate for diarization track 2 (referencesegments) and track 1 (unsegmented). Comparing i-vector (iv),x-vector (xv) and combined (iv+xv) systems with agglomera-tive hierarchical clustering (AHC), with and without variationalBayesian (VB) re-segmentation and under-clustering (UC). N o System DER, track 2 DER, track 1Dev Eval Dev Eval1 iv + AHC 30.87 32.10 34.56 35.982 xv + AHC 30.85 33.20 30.86 34.783 iv+xv + AHC 27.86 29.00 31.20 31.744 + VB 26.14 26.98 27.29 29.455 + AHC(UC) + VB 27.65 26.55 26.73 28.85A fusion of x-vectors and i-vectors was shown to improvethe DER in [23]. In our experiments, raw x-vectors and i-vectors were concatenated and the resulting 256 dimension vec-tors were used as an input for the same PCA and PLDA train-ing/scoring. Such a combination (system 3 in Table 2) resultsin a signiﬁcant DER improvement on the development set oftrack 2 but does not improve the performance of the x-vectorsystem for track 1.For the two ﬁnal systems, a Variational Bayesian HiddenMarkov Model (VBHMM, or simply VB) re-segmentation onthe vector level (described in [24, 10] and denoted by the au-thors as VBx) was used to reﬁne the initial AHC clustering.First, VB re-segmentation was initialized with the best perform-ing AHC (system 3 for track 2 and system 2 for track 1 shownin Table 2). The resulting system (system 4 in Table 2) signiﬁ-cantly improves DER for the corresponding AHC-based system(from 27.86% to 26.14% and 30.86% to 27.29% on the devel-opment set of tracks 2 and 1, respectively).Next, VB re-segmentation was used with an AHC systemthat was tuned to produce more speakers than expected (under-clustering), while expecting VB to merge some of the frag-mented speaker clusters. A similar approach was shown tobe efﬁcient in [25], where the authors controlled AHC under-clustering by increasing the AHC stopping criterion and thenrelied on the ability of VB to merge some of the speaker can-didates. Instead of increasing the AHC threshold, we increasedthe number of recording-dependent PCA components leavingthe AHC threshold unchanged. In our preliminary experiments,this approach performed better on the development data andthus yields a different way of controlling the number of speak-ers. The performance of the diarization system using AHC withunder-clustering and VB is shown in Row 5 of Table 2. Overall,we observed a signiﬁcant improvement on the evaluation datacompared to VB on top of an optimized AHC system.Although the presence of the same speakers in both devel-opment and training data (potentially even evaluation data) maysimplify the SD task, a large (over 50 for several ﬁles) and vary-ing number of speakers (from seven to 61 in the developmentset) presents a challenge for all diarization systems. Figure 3shows a per-ﬁle analysis of the reference number of speakersin the development set for track 2 along with the number of speakers predicted by AHC, AHC+VB with and without under-clustering. While errors in determining the number of speakersare lower for AHC (9.6 for AHC, 12.4 for AHC+VB and 10.8for VB with under-clustered AHC), VB results in better DER inall our experiments.

4. Automatic Speech Recognition

For the ASR experiments, the Kaldi toolkit was used [11]. Westarted by optimizing the system performance using the ground-truth SAD (track 2) and then analyzed how segmentation im-pacted performance on unsegmented audio (track 1). Whiletrack 2 provides speaker labels for training and developmentdata, no such information was released for the evaluation data.In an effort to maintain experimental consistency, we decidednot to use speaker information in testing for track 2.

The baseline speaker adapted HMM-GMM system was trainedusing FS-2 training data (about 20 hours of speech). The modelconsists of 3k HMM states and about 100k Gaussians. The 5kword lexicon was created using CMU dict [26], and the pro-nunciations for the missing words were generated with sequiturtoolkit [27]. Pronunciation and silence probabilities were alsoestimated from the training data alignments [28]. Row 1 of Ta-ble 3 summarizes the results achieved with this system. Later,this model was used for creating training targets for neural net-work acoustic models (AMs).Using the same dataset, a small TDNN-F model was trainedfollowing [18]. Here and in the following experiments, the fron-tend uses 40 MFCC features and 100-dimensional i-vector fea-tures that were shown to be efﬁcient for introducing speaker andchannel information for ASR [29, 9]. The i-vector extractor wastrained only on FS-2 training data. Using only MFCC frontendor augmenting i-vector training data with a large subset of A11corpus has not demonstrated good results in our preliminary ex-periments. For comparison, in all experiments, a ﬁxed context-dependent tree was used. Given the limited amount of data, arelatively small network was trained (15 1024-dimensional lay-ers factorized with 128-dimensional linear bottlenecks). Themodel was trained for 10 epochs with data augmented by speedand volume perturbation [30]. Row 2 of Table 3 shows the per-formance of the TDNN-F baseline.

We further explored the 19k hour A11 corpus for improvingASR performance. The A11 corpus also contains automaticallygenerated transcriptions and SAD segments. However, usingthese annotations for training the AMs (both GMM and TDNN-F) did not improve WER in our preliminary experiments. It islikely that the provided A11 transcriptions are quite inaccurate.However, we still found it useful to add these transcriptions forlanguage modeling (LM).The WER achieved by including A11 transcriptions avail-able in the dataset for LM is summarized in Row 3 of Table 3.For all LM experiments we used the pocolm toolkit that inter-polates data sources at the level of data-counts to minimize theperplexity of the development set. Using A11 provided tran-scriptions in LM results in a signiﬁcant WER improvement andan overall vocabulary expansion to 25k words. This 25k lexiconis further used in all remaining experiments. https://github.com/danpovey/pocolm Sp e a k e r s Ref speakersAHC speakersAHC+VB speakersAHC(underclust)+VB speakers

Figure 3:

The number of reference and estimated speakers by AHC, AHC+VB and under-clustered AHC+VB systems.

Table 3:

Word error rate on development and evaluation sets ofFS-2 ASR challenge track 2 (reference segmentation). Compar-ing models trained on the FS-2 data only, with Apollo 11 orig-inal noisy transcriptions (A11) and with semi-supervised train-ing (SST) of the acoustic model (AM) and the language model(LM), using n-gram and recurrent neural network (RNN) LMs. N o AM LM WER, %Model Data Model Data Dev Eval1 GMM FS-2 3-gram FS-2 53.8 55.62 TDNN FS-2 3-gram FS-2 28.6 31.43 TDNN FS-2 3-gram FS-2+A11 26.3 29.14 TDNN +SST1 3-gram FS-2+A11 23.7 26.05 TDNN +SST1 3-gram FS-2+A11+SST1 23.5 25.86 TDNN +SST1 4-gram FS-2+A11+SST1 23.0 25.67 TDNN +SST1 +RNN FS-2+A11+SST1 21.8 24.38 TDNN +SST2 +RNN FS-2+A11+SST2 22.2 24.6In order to leverage A11 data for AM training, a new seg-mentation was generated using a lightweight SAD (system 1 inTable 1). Only segments longer than two seconds and shorterthan 20 seconds were selected, resulting in 980 hours of untran-scribed speech segments. Next, word lattices were generatedwith our ASR system 3. Then, the lattices were converted toTDNN-F targets and combined with the original training dataas described in [15]. Supervised and transcribed samples wereequally weighted and used for training a larger TDNN-F fromscratch (1536-dimensional layers with 256-dimensional bottle-necks) for three epochs. The resulting system yields 23.7%WER on the development set (Row 4 of Table 3).1-best decoding hypotheses of A11 transcribed data werethen added for LM training. This resulted in a small improve-ment for 3-gram LM (system 5 in Table 3) with a signiﬁcantimprovement for 4-grams (system 6 in Table 3). We did notobserve any improvement with 4-grams when not using semi-supervised data. While we still found it useful to preserve theA11 original transcripts, performance was only slightly worseif discarded at this stage.With a sufﬁcient amount of LM data, we then trained a re-current neural network language model (RNNLM) with letter-based features and importance sampling [31]. See [32] for a de-tailed description of the architecture. An efﬁcient pruned latticere-scoring algorithm [33] results in a further 1.2% WER im-provement on the development set compared to 4-gram modeltrained on the same data (system 7 in Table 3).Finally, we experimented with a second pass SST (system 8in Table 3), which was similar to the system 7 training method-ology but used system 6 to transcribe A11 data again, this timealso using a larger unsupervised set (1M segments and 1.9khours) produced by SAD system 3. While this system did notimprove track 2 WER due to an increased number of insertionerrors, it did achieve the best result on track 1.

Similar to SD, ASR track 1 evaluated transcription without ref-erence segments or speaker information. Table 4 describesWERs on development and evaluation data of track 1, wherevarious segmentation algorithms were compared to our two bestASR systems (systems 7 and 8 from Table 3). The ﬁrst evalu-ation (system 1) was performed using only segments generatedby our best SAD (system 3 in Table 1). Similarly to our SD ex-periments, we found that the best SAD produced too many falsepositives. With SAD parameters minimizing

DCF

INV (Eq. 2),the ASR performance was signiﬁcantly improved (system 2 inTable 4). Two SD systems were also compared for ASR: thesimple i-vector based system (system 1 in Table 2) and our bestperforming x-vector+VB system (system 5 in Table 2). The cor-responding WERs are shown in Rows 3 and 4 of Table 4.Table 4:

Word error rate on development and evaluation sets ofASR track 1 with speech activity detector (SAD) system 3 (Ta-ble 1) optimized for

DCF (Eq. 1) and

DCF

INV (Eq. 2) met-rics with and without speaker diarization (SD). Two best per-forming ASR systems are reported (systems 7 and 8 in Table 3). N o SAD metric SD WER, sys. 7 WER, sys. 8Dev Eval Dev Eval1

DCF no 24.6 26.1 23.6 25.02

DCF

INV no 23.8 25.2 22.6 24.13

DCF

INV system 1 23.9 25.8 23.0 24.84

DCF

INV system 5 23.1 24.9 22.4 24.0The conclusions of these experiments are quite straight-forward. First, similar to SD performance, better WERs areachieved with SAD optimized for the

DCF

INV metric. Sec-ond, it appears consistently better to use no diarization thana poorly performing one (compare rows 2 and 3). Finally,a good diarization system signiﬁcantly improves ASR perfor-mance (compare row 2 and 4).

5. Conclusions

The FS-2 challenge clearly demonstrates that there is still muchroom for improvement of speech processing technology whenworking with realistic noisy recordings and a limited amount oflabeled data. Our experiments have shown that out-of-domainmodels can be effective (x-vector SD) and that SST is very efﬁ-cient in such conditions for both acoustic and language model-ing in ASR and SAD.

6. Acknowledgements

We would like to thank the organizers of the FS-2 challengefor coordinating this interesting and realistic task, and our col-leagues Alan de Zwaan and Alex Viall for valuable suggestions. . References [1] J. H. Hansen, A. Sangwan, A. Joglekar, A. E. Bulut, L. Kaushik,and C. Yu, “Fearless Steps: Apollo-11 Corpus Advancements forSpeech Technologies from Earth to the Moon,” in

Proceedings ofINTERSPEECH , 2018.[2] A. Sangwan, L. Kaushik, C. Yu, J. H. Hansen, and D. W. Oard,“‘Houston, We have a solution’: Using NASA Apollo Program toadvance Speech and Language Processing Technology,” in

Pro-ceedings of INTERSPEECH , 2013.[3] A. Joglekar, J. H. Hansen, M. C. Shekhar, and A. Sangwan,“FEARLESS STEPS Challenge (FS-2): Supervised Learningwith Massive Naturalistic Apollo Data,” in

Proceedings of IN-TERSPEECH , 2020.[4] J. H. Hansen, A. Joglekar, M. C. Shekhar, V. Kothapally, C. Yu,L. Kaushik, and A. Sangwan, “The 2019 Inaugural Fearless StepsChallenge: A Giant Leap for Naturalistic Audio,” in

Proceedingsof INTERSPEECH , 2019.[5] Z.-C. Fan, Z. Bai, X.-L. Zhang, S. Rahardja, and J. Chen, “AUCOptimization for Deep Learning Based Voice Activity Detection,”in

Proceedings of ICASSP , 2019.[6] S. Dwijayanti, K. Yamamori, and M. Miyoshi, “Enhancementof speech dynamics for voice activity detection using DNN,”

EURASIP Journal on Audio, Speech, and Music Processing , no. 1,p. 10, 2018.[7] N. Ryant, M. Liberman, and J. Yuan, “Speech activity detectionon youtube using deep neural networks,” in

Proceedings of IN-TERSPEECH , 2013.[8] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,” in

Proceedings of ICASSP , 2018.[9] M. J. Alam, V. Gupta, P. Kenny, and P. Dumouchel, “Use of mul-tiple front-ends and i-vector-based speaker adaptation for robustspeech recognition,” in

Proceedings of REVERB Challenge Work-shop , 2014.[10] F. Landini, S. Wang, M. Diez, L. Burget, P. Matˇejka,K. ˇZmol´ıkov´a, L. Moˇsner, A. Silnova, O. Plchot, O. Novotn`y et al. , “BUT System for the Second DIHARD Speech DiarizationChallenge,” in

Proceedings of ICASSP , 2020.[11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” in

Proceedings of ASRU ,2011.[12] L. Lamel, J.-L. Gauvain, and G. Adda, “Lightly supervised andunsupervised acoustic model training,”

Computer Speech & Lan-guage , vol. 16, no. 1, pp. 115–129, 2002.[13] M. Karaﬁ´at, M. K. Baskar, P. Matejka, K. Vesel`y, F. Grezl, L. Bur-get, and J. Cernock`y, “2016 BUT Babel System: MultilingualBLSTM Acoustic Model with i-vector Based Adaptation,” in

Pro-ceedings of INTERSPEECH , 2017.[14] V. Manohar, “Semi-supervised training for automatic speechrecognition,” Ph.D. dissertation, Johns Hopkins University, 2019.[15] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervised training of acoustic models using lattice-free MMI,”in

Proceedings of ICASSP , 2018.[16] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-ral networks for ASR based on lattice-free MMI,” in

Proceedingsof INTERSPEECH , 2016.[17] H. Hadian, D. Povey, H. Sameti, J. Trmal, and S. Khudanpur, “Im-proving LF-MMI using unconstrained supervisions for ASR,” in

Proceedings of SLT , 2018.[18] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi,and S. Khudanpur, “Semi-Orthogonal Low-Rank Matrix Factor-ization for Deep Neural Networks,” in

Proceedings of INTER-SPEECH , 2018. [19] A. Vafeiadis, E. Fanioudakis, I. Potamitis, K. Votis, D. Giakoumis,D. Tzovaras, L. Chen, and R. Hamzaoui, “Two-Dimensional Con-volutional Recurrent Neural Networks for Speech Activity Detec-tion,” in

Proceedings of INTERSPEECH , 2019.[20] T. Pham, C. Tang, and M. Stadtschnitzer, “Using Artiﬁcial Neu-ral Network for Robust Voice Activity Detection Under AdverseConditions,” in

Proceedings of IEEE-RIVF International Confer-ence on Computing and Communication Technologies , 2009.[21] G. Sell and D. Garcia-Romero, “Speaker diarization with PLDAi-vector scoring and unsupervised calibration,” in

Proceedings ofSLT , 2014.[22] S. Ioffe, “Probabilistic linear discriminant analysis,” in

Proceed-ings of European Conference on Computer Vision , 2006.[23] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba,M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watan-abe et al. , “Diarization is Hard: Some Experiences and LessonsLearned for the JHU Team in the Inaugural DIHARD Challenge,”in

Proceedings of INTERSPEECH , 2018.[24] M. Diez, L. Burget, S. Wang, J. Rohdin, and J. ˇCernock`y,“Bayesian HMM Based x-vector Clustering for Speaker Diariza-tion,” in

Proceedings of INTERSPEECH , 2019.[25] F. Landini, S. Wang, M. Diez, L. Burget, P. Matˇejka,K. ˇZmol´ıkov´a, L. Moˇsner, O. Plchot, O. Novotn`y, H. Zeinali et al. ,“BUT System Description for DIHARD Speech Diarization Chal-lenge 2019,” arXiv preprint arXiv:1910.08847 , 2019.[26] R. L. Weide, “The CMU pronouncing dictionary,” , 1998.[27] M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-phoneme conversion,”

Speech communication , vol. 50, no. 5, pp.434–451, 2008.[28] G. Chen, H. Xu, M. Wu, D. Povey, and S. Khudanpur, “Pronunci-ation and silence probability modeling for ASR,” in

Proceedingsof INTERSPEECH , 2015.[29] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adap-tation of neural network acoustic models using i-vectors,” in

Pro-ceedings of ASRU , 2013.[30] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen-tation for speech recognition,” in

Proceedings of INTERSPEECH ,2015.[31] H. Xu, K. Li, Y. Wang, J. Wang, S. Kang, X. Chen, D. Povey,and S. Khudanpur, “Neural network language modeling withletter-based features and importance sampling,” in

Proceedingsof ICASSP , 2018.[32] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, andY. Est`eve, “TED-LIUM 3: twice as much data and corpus repar-tition for experiments on speaker adaptation,” in

Proceedings ofSPECOM , 2018.[33] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel,D. Povey, and S. Khudanpur, “A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition,” in