An Investigation of End-to-End Models for Robust Speech Recognition
©© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all otheruses, in any current or future media, including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrightedcomponent of this work in other works. a r X i v : . [ ee ss . A S ] F e b N INVESTIGATION OF END-TO-END MODELS FOR ROBUST SPEECH RECOGNITION
Archiki Prasad † , Preethi Jyothi ‡ , and Rajbabu Velmurugan †† Department of Electrical Engineering, Indian Institute of Technology Bombay ‡ Department of Computer Science & Engineering, Indian Institute of Technology Bombay
ABSTRACT
End-to-end models for robust automatic speech recognition(ASR) have not been sufficiently well-explored in prior work.With end-to-end models, one could choose to preprocess theinput speech using speech enhancement techniques and trainthe model using enhanced speech. Another alternative is topass the noisy speech as input and modify the model archi-tecture to adapt to noisy speech. A systematic comparisonof these two approaches for end-to-end robust ASR has notbeen attempted before. We address this gap and present a de-tailed comparison of speech enhancement-based techniquesand three different model-based adaptation techniques cov-ering data augmentation, multi-task learning, and adversariallearning for robust ASR. While adversarial learning is thebest-performing technique on certain noise types, it comes atthe cost of degrading clean speech WER. On other relativelystationary noise types, a new speech enhancement techniqueoutperformed all the model-based adaptation techniques.This suggests that knowledge of the underlying noise typecan meaningfully inform the choice of adaptation technique.
Index Terms — Robust ASR, Speech Enhancement,Multi-task and Adversarial Learning, Data Augmentation
1. INTRODUCTION
End-to-end (E2E) models, that directly convert a spoken ut-terance into a sequence of characters, are becoming an in-creasingly popular choice for ASR systems. They have beenshown to outperform traditional cascaded ASR systems whenlarge amounts of labeled speech are available. E2E ASR sys-tems for low-resource scenarios have also emerged as an ac-tive new area of research. While E2E ASR systems for cleanspeech are growing rapidly in number, there have been rela-tively fewer investigations on the use of E2E models for noisyspeech recognition [1, 2]. To the best of our knowledge,we are the first to provide a detailed comparison of speechenhancement-based techniques with a number of E2E model-based adaptation techniques for noisy speech across a diverserange of noise types. This comparison highlights the strengthsand limitations of both types of approaches and offers pre-scriptions for which techniques are best suited for differentnoise types. Our code and datasets are publicly available. https://github.com/archiki/Robust-E2E-ASR Prior work on robust ASR has predominantly used atwo-pass approach to tackle the problem of robust ASR. Theinput speech is first passed through a speech enhancement(SE) module and the enhanced speech is subsequently passedthrough a standard speech recognition system. We adopt thisas one of our approaches as well and investigate the use ofthree different speech enhancement techniques in conjunc-tion with an E2E ASR system. As opposed to modifying theinput speech with front-end processing modules like speechenhancement, one could use the noisy speech as-is and adaptthe E2E model itself to handle the noisy input speech. Weexamine three model-based adaptation techniques:1. Data Augmentation-based Training (DAT): Speechsamples are augmented with varying noise types ofvarying signal to noise ratio (SNR) values and fedas input to an E2E model. Larger gradient updates aremade in the lower layers of the E2E model compared tothe higher layers. This technique first appeared in [2].2. Multi-task learning (MTL): The E2E model is jointlytrained with an auxiliary noise type classifier. MTLwith a noise type classifier has not been previously ex-plored for robust ASR and turns out to be quite effectiveas an adaptation technique.3. Adversarial Training (AvT): Unlike MTL which drivesthe learned representations to be more noise-aware, inAvT, we train an E2E model with a gradient reversallayer in conjunction with the noise classifier to learnmore noise-invariant representations.
2. RELATED WORK
We review some relevant supervised robust ASR techniquesunder noisy conditions and single-channel setting, and do notconsider reverberation or multi-channel approaches. A sum-mary of various deep learning techniques for robust ASR,datasets, and benchmarks are provided in [3]. In [4], a noiseaware training (NAT) technique that uses a mean noise esti-mate (assuming stationarity) in the input has been proposed togive good results. This has been further improved in [5], byjointly training a source separation model for noise estimationand acoustic modeling. The work in [6] uses deep convolu-tional neural network (CNN) to achieve the best ASR resultson Aurora-4 task [7].ne approach to improve the performance of E2E modelsis by using data augmentation along with fine tuning as in [8].Another approach is to formulate this as a domain adapta-tion problem, and then use variational auto encoders (VAE)[9]. Here, the source domain is clean speech and target do-main is noisy speech. Similarly, [1] uses penalty terms on anencoder-decoder type ASR model to achieve noise invariantrepresentations. The work in [10] uses a jointly adversarialtraining framework, with a mask-based SE front-end alongwith an attention-based E2E ASR model.Models that operate in time-domain, such as SE-GAN [11]and SE-VCAE [12] have been effective for SE as a front-endtechnique in a two-pass approach. However, such modelsrely on large datasets for training and enhancement, but neednot necessarily improve ASR. Deep Xi [13], a recent front-end SE has been shown to provide improved ASR whenused with DeepSpeech2 [14] (DS2) as the back-end ASR.A more recent approach that operates on raw waveform forreal-time speech enhancement is [15]. These two methodswill be further explained in Section 4.1, and will be used inour analysis.
3. DATASETS AND E2E ASR SYSTEM3.1. Implementation Details
In this work, we use DS2 as our main E2E ASR system.DS2 is trained using the Connectionist Temporal Classifica-tion (CTC) objective function and comprises two 2D convolu-tional layers, followed by five bidirectional LSTM layers witha final fully-connected (FC) softmax layer. The input featuresare derived from short-time Fourier transform (STFT) mag-nitude spectrograms and the baseline model is trained on 100hours of clean speech from the Librispeech Dataset [16]. Ad-ditional training-specific details can be found in [17]. We are specifically interested in applications that cater to cer-tain noise types for which we do not have a lot of data. Thereare not many existing datasets that are designed to be low-resource. Hence, we constructed our own custom dataset.Our custom dataset consists of the following noise types: ‘Babble’, ‘Airport/Station’, ‘Car’, ‘Metro’, ‘Cafe’, ‘Traffic’,‘AC/Vacuum’ . The noise samples are sampled at 16 kHz andwere collected from FreeSound [19]. For each noise type,we have 10 and 8 distinct samples in the train and test sets,respectively. The total duration of the noise train and test sets We use the DS2 implementation available at: https://github.com/SeanNaren/deepspeech.pytorch The DEMAND dataset [18] exists but has few samples per noise type. Airport/Station comprises background sounds containing announce-ments at both airports and stations, while
AC/Vacuum comprises room soundswith an air conditioner or vacuum cleaner in the background. Within thesenoise types, the noise sub-types are equally distributed in train and test sets.
Method Objective ScoresMOS-LQO PESQ STOI eSTOI
Baseline 1.29 1.80 80.68 60.66SE-VCAE 1.38 1.83 80.64 62.85Deep Xi 1.98 2.50 86.89 74.52D
EMUCS
Table 1 . Objective scores for various enhancement methods.Larger scores are better.is close to 2 hours. For clean speech, we use the Librispeechcorpus [16]. We train the DS2 models using the train-clean-100 set and add simulated noise using the training samples ofour noise dataset. Our development set is constructed usingthe dev-clean set of Librispeech (duration of 5.4 hours) withtraining noise samples from our noise dataset. During bothtraining and validation, a noise type was picked randomly andthe SNR of the additive noise was chosen randomly from { } dB. For testing, we randomly picked 120files from the test-clean set of Librispeech. To create noisyspeech, for each utterance and each noise type from our noisedatabase, a random section of the noise is added at SNR levelsranging from [0, 20] dB in increments of 5dB, resulting in atotal of 4200 noisy speech test utterances. (This process ofconstructing noisy test samples was outlined in [13].)
4. APPROACHES4.1. Front-End Speech Enhancement
We experiment with three state-of-the-art speech enhance-ment techniques detailed below.
SE-VCAE.
Speech Enhancement Variance Constrained Au-toencoder (SE-VCAE) [12] learns the distribution over la-tent features given noisy data and acts directly on the time-domain. It outperforms a popular generative modeling SEtechnique SE-GAN [11]. We finetune the pretrained SE-VCAE model on noisy speech samples from our dataset.
DeepXi.
DeepXi was specifically proposed as a front-end tobe used with DeepSpeech [13]. This enhancement techniqueacts on a noisy speech spectrogram and uses a priori SNR es-timation to design a mask that is used to produce an estimateof a clean speech spectrogram. DS2 is then fine-tuned for10 epochs on the enhanced examples from Librispeech whileensuring minimal loss in performance on clean speech data. D EMUCS . [15] proposes an alternate encoder-decoder archi-tecture for denoising speech samples. The model is trained ona low-resource noisy dataset along with reverberation, withdata augmentation techniques to compensate for the limiteddata. We use their pretrained model and fine-tune it on ourdataset for 20 epochs. Similar to DeepXi, DS2 is furtherfinetuned on the denoised samples. Table 1 shows objectivescores measuring the quality of enhanced speech using thethree SE techniques discussed here. D EMUCS clearly outper-forms the other two techniques on all four metrics. ig. 1 . Framework used in MTL (in black) and AvT (in red).2D-convolutions, BiLSTM and FC layers are shown using or-ange, blue and green circles, respectively.
For this technique, clean speech samples are augmented withnoise with a probability of 0.5 and subsequently used to trainDS2. The model was trained for 25 epochs with a batch sizeof 32 and a learning rate of 0.0001. In the end, the model thatperformed the best on the development set (with similar noiseaugmentation) was chosen. We will refer to this model as“Vanilla DAT”. To enable better transfer learning from cleanto noisy speech, we incorporate the soft-freezing scheme pro-posed in [2]. The learning rate of the FC layer along with thelast two LSTM layers is scaled down by factor of 0.5 (furtherdiscussed in Section 5). This training strategy has the effectof forcing the lower layers (that act as a feature extractor) tolearn noise-invariant characteristics in the noisy speech. Thismodel will henceforth be referred to as “Soft-Freeze DAT”.
Figure 1 describes our MTL setup. The auxiliary classifierpredicts noise type labels and uses representations from anintermediate LSTM layer as its input. This noise classifiercomprises one bidirectional LSTM layer followed by two lin-ear layers. The model is trained with a hybrid loss, L H = λL CTC + η (1 − λ ) L CE , where L CTC and L CE are the CTC lossfrom DS2 and the cross entropy loss on the noise labels, re-spectively. η and λ are scaling factors and η is annealed byfactor of 1.05 every epoch. In our experiments, we initializedthe model using Soft-Freeze DAT and set λ = 0 . , η = 10 . Contrary to MTL where the model jointly minimizes the CTCloss and the noise classification loss, adversarial training in-vokes the use of a gradient reversal layer (GRL) [20] before There are 8 noise labels: 7 for noise types + 1 for clean speech We observed that on starting with the baseline model, the initial 10-15epochs behaved similar to DAT. Hence, Soft-Freeze DAT served as a goodinitialization and led to faster convergence. the auxiliary classifier as shown in Fig 1. This forces the rep-resentations before the GRL to be noise-invariant, thus mak-ing it hard for the noise classifier to distinguish between noisetypes . AvT had to be carefully trained with setting differ-ent learning rates, λ f , λ r and λ n corresponding to the fea-ture extractor, recognition model and noise classifier, respec-tively. For our model, the base learning rate was set to . , λ f = 0 . , λ r = 0 . and λ n = 1 . Similar to MTL, we ini-tialized the model with Soft-Freeze DAT.
5. EXPERIMENTS AND RESULTS
Table 2 lists an exhaustive comparison of all previously men-tioned techniques on seven different noise types and five dif-ferent SNR values. We also report the WER on clean speechto observe degradation with each noise adaptation techniquein place. We make a number of observations. Among the SEtechniques, D
EMUCS which performs best on objective SEscores (as shown in Table 1), also performs best in the ASRexperiments. SE-VCAE suffers from distortion of content inthe speech signal, which also reflects in its high degradationof clean speech WER. We observe that DeepXi is outper-formed (in most noise conditions, and is otherwise matched)by techniques as simple as DAT. The performance mismatchbetween the reported numbers in DeepXi [13] and our num-bers could be attributed to the difference in sizes of our noisedatasets; their dataset was much larger in size compared toours. Interestingly, even with the relatively smaller amountsof noise samples in our dataset, the D
EMUCS technique is ableto generalize well. On the relatively stationary noise typesnamely, ‘Car’, ‘Metro’ and ‘Traffic’ ( Noise A ), D
EMUCS outperforms all techniques including all the model-adaptationtechniques. For the relatively non-stationary noise typesnamely, ‘Babble’, ‘Airport/Station’,‘Cafe’ and ‘AC/Vacuum’ ( Noise B ) D
EMUCS and MTL are statistically very close.AvT yields the largest reductions in WER on
Noise B samples (with few exceptions).Another important distinction to make between the SEtechniques and the ML-based techniques is that the SE tech-niques rely on high-quality pretrained models as a startingpoint. With our small noise dataset, training the SE modelfrom scratch would not be an option. In contrast, our ML-based techniques are expressive enough to be able to learnfrom our limited noise datasets without any prior pretraining.The overall takeaways from this investigation are the fol-lowing: 1) Among the SE techniques, D
EMUCS clearly out-performs the other two SE techniques by large margins. 2)Among the ML-based techniques, AvT is largely the best-performing technique across all noise types, with some ex-ceptions. While adversarial training drives the representa-tions to be noise-invariant and helps the noisy speech WERs(evident in low SNR conditions), it has an adverse effect on A setup similar to AvT was explored in [1] using the Musan corpus [21].Due to lack of noise labels, the classifier output 2 labels: clean and noisy. ethod WER under SNR (dB)Babble Airport/Station AC/Vacuum Cafe0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Baseline 104.2 98.3 91.3 79.7 65.0 91.9 84.1 73.7 60.6 50.0 93.0 83.1 71.5 59.5 45.8 83.8 72.7 59.5 44.3 33.4SE-VCAE 85.6 76.4 61.9 54.7 39.7 78.0 68.3 56.8 46.3 39.3 81.3 71.1 61.3 53.6 42.7 61.6 53.9 44.9 35.8 31.0Deep Xi 81.4 69.4 54.0 44.5 31.9 71.4 60.9 46.5 37.8 27.4 73.9 58.2 45.4 35.1 27.0 52.3 39.7 32.8 25.0 20.4D
EMUCS
Vanilla DAT 80.6 68.1 53.6 41.8 30.3 67.1 55.4 41.9 31.2 24.9 66.4 49.8 38.3 31.3 24.5 52.8 41.6 34.5 24.5 19.2Soft-Freeze DAT 77.4 65.5 52.2 38.5 28.3 64.2 52.9 39.0 29.2 23.7 63.1 46.8 37.1 30.2 23.9 49.1 40.1 33.1 24.4 19.0MTL 71.4 58.8 45.9 35.5 25.8 55.7 46.8 35.3 26.2
AvT
Method WER under SNR (dB) CleanWERTraffic Metro Car0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Baseline 72.4 62.5 50.2 41.0 33.6 68.4 54.4 46.4 34.9 27.6 35.0 28.1 24.3 21.7 16.7 10.3SE-VCAE 60.0 51.9 44.2 39.7 32.5 54.0 43.6 38.6 33.0 29.6 35.4 32.7 28.3 27.3 26.0 15.9Deep Xi 48.0 40.6 29.8 26.0 22.7 44.8 30.5 28.1 20.2 20.5 23.0 19.4 15.4 16.0 14.1 10.9D
EMUCS
Table 2 . Comparison of the performance (WER % after greedy decoding) of all techniques for various noise types and SNRsin the test set. The lowest SE and E2E WERs are shown in bold, and the lower WER among the two is highlighted in green.clean speech and high SNR conditions. AvT incurs the high-est WERs on clean speech WER (after SE-VCAE) and some-times comes second to MTL in high SNR conditions. 3)MTL and AvT are both significantly better than the DAT tech-niques. Summarily, either of D
EMUCS or AvT might be agood choice for noise adaptation but the underlying noise typeshould also factor into the choice. Table 3 provides an abla-tion analysis justifying our choices of LSTM and LSTM forSoft-Freeze DAT, LSTM for MTL and λ r = 0 . for AvT.The other hyperparameters were less influential and were se-lected based on performance on the development set.
6. CONCLUSIONS
In this work, we present a detailed comparison of three speechenhancement techniques and three model-based adaptationtechniques for robust E2E ASR across a set of diverse noisetypes. We observe different trends for different noise types;while adversarial learning yields the largest improvements inperformance on non-stationary noise types, a new SE tech-nique D
EMUCS gives the best results on relatively stationarynoise types. In future work, we aim to extend our analysis totransformer-based ASR systems and existing noisy datasets.
Method CleanWER WER under SNR (dB)Babble Airport/Station Cafe AC/Vacuum Car Metro Traffic0 15 0 15 0 15 0 15 0 15 0 15 0 15
Soft-Freeze DAT (LSTM ) 11.13 79.2 40.5 64.7 30.6 51.6 + LSTM ) Soft-Freeze DAT (LSTM + LSTM + LSTM ) 10.97 79.4 40.4 66.3 30.8 52.0 24.3 63.4 ) ) 11.05 MTL (noise classifier after LSTM ) 11.24 75.8 38.7 60.4 27.5 49.0 23.5 60.0 28.2 23.0 15.7 40.4 21.6 46.0 22.4
15 20 15 20 15 20 15 20 15 20 15 20 15 20
AvT (recognition scaling factor λ r = 0 . ) 13.79 31.9 26.2 27.8 22.1 21.9 18.9 29.1 23.5 17.1 16.2 20.7 19.4 21.9 19.2AvT (recognition scaling factor λ r = 0 . ) 13.37 λ r = 0 . ) Table 3 . WER % after greedy decoding under different settings for DAT, MTL and AvT. The best numbers are shown in bold. . REFERENCES [1] Davis Liang, Zhiheng Huang, and Zachary C Lip-ton, “Learning noise-invariant representations for robustspeech recognition,” in . IEEE, 2018, pp. 56–63.[2] Shucong Zhang, Cong-Thanh Do, Rama Doddipatla,and Steve Renals, “Learning noise invariant featuresthrough transfer learning for robust end-to-end speechrecognition,” in .IEEE, 2020, pp. 7024–7028.[3] Zixing Zhang, J¨urgen Geiger, Jouni Pohjalainen, AmrEl-Desoky Mousa, Wenyu Jin, and Bj¨orn Schuller,“Deep learning for environmentally robust speechrecognition: An overview of recent developments,”
ACM Transactions on Intelligent Systems and Technol-ogy (TIST) , vol. 9, no. 5, pp. 1–28, 2018.[4] Michael L Seltzer, Dong Yu, and Yongqiang Wang, “Aninvestigation of deep neural networks for noise robustspeech recognition,” in .IEEE, 2013, pp. 7398–7402.[5] Arun Narayanan and DeLiang Wang, “Joint noise adap-tive training for robust automatic speech recognition,”in . IEEE, 2014,pp. 2504–2508.[6] Yanmin Qian and Philip C Woodland, “Very deepconvolutional neural networks for robust speech recog-nition,” in . IEEE, 2016, pp. 481–488.[7] David Pearce and J Picone, “Aurora working group:DSR front end LVCSR evaluation AU/384/02,”
Inst. forSignal & Inform. Process., Mississippi State Univ., Tech.Rep , 2002.[8] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael LSeltzer, and Sanjeev Khudanpur, “A study on dataaugmentation of reverberant speech for robust speechrecognition,” in .IEEE, 2017, pp. 5220–5224.[9] Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsuper-vised domain adaptation for robust speech recognitionvia variational autoencoder-based data augmentation,”in . IEEE, 2017, pp. 16–23. [10] Bin Liu, Shuai Nie, Shan Liang, Wenju Liu, MengYu, Lianwu Chen, Shouye Peng, and Changliang Li,“Jointly adversarial enhancement training for robustend-to-end speech recognition,”
Proc. Interspeech , pp.491–495, 2019.[11] Santiago Pascual, Antonio Bonafonte, and Joan Serr`a,“Segan: Speech enhancement generative adversarialnetwork,”
Proc. Interspeech , pp. 3642–3646, 2017.[12] Daniel T Braithwaite and W Bastiaan Kleijn, “Speechenhancement with variance constrained autoencoders.,”in
Proc. Interspeech , 2019, pp. 1831–1835.[13] Aaron Nicolson and Kuldip K Paliwal, “Deep xi asa front-end for robust automatic speech recognition,” arXiv preprint arXiv:1906.07319 , 2019.[14] Dario Amodei, Sundaram Ananthanarayanan, RishitaAnubhai, Jingliang Bai, Eric Battenberg, Carl Case,Jared Casper, Bryan Catanzaro, Qiang Cheng, GuoliangChen, et al., “Deep speech 2: End-to-end speech recog-nition in english and mandarin,” in
International con-ference on machine learning , 2016, pp. 173–182.[15] Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi,“Real time speech enhancement in the waveform do-main,” in
Proc. Interspeech , 2020.[16] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in . IEEE, 2015, pp. 5206–5210.[17] Archiki Prasad and Preethi Jyothi, “How accents con-found: Probing for accent information in end-to-endspeech recognition systems,” in
Proceedings of the 58thAnnual Meeting of the Association for ComputationalLinguistics , 2020, pp. 3739–3753.[18] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-cent, “DEMAND: a collection of multi-channel record-ings of acoustic noise in diverse environments,” in
Proc.Meetings Acoust , 2013.[19] Frederic Font, Gerard Roma, and Xavier Serra,“Freesound technical demo,” in
Proceedings of the 21stACM international conference on Multimedia , 2013, pp.411–412.[20] Yaroslav Ganin and Victor Lempitsky, “Unsuperviseddomain adaptation by backpropagation,” in
Interna-tional conference on machine learning , 2015, pp. 1180–1189.[21] David Snyder, Guoguo Chen, and Daniel Povey, “Mu-san: A music, speech, and noise corpus,” arXiv preprintarXiv:1510.08484arXiv preprintarXiv:1510.08484