[PDF] MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition

Abstract

In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of 10.6\% on TIMIT dataset, and achieves a strong WER of 4.7\% on WSJ dataset.

Full PDF

MMIXSPEECH: DATA AUGMENTATION FOR LOW-RESOURCE AUTOMATIC SPEECHRECOGNITION

Linghui Meng , Jin Xu , Xu Tan , Jindong Wang , Tao Qin , Bo Xu Institute of Automation, Chinese Academy of Sciences, China School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences, China Institute for Interdisciplinary Information Sciences, Tsinghua University, China Microsoft Research Asia { menglinghui2019,xubo } @ia.ac.cn, [email protected], { xuta,jindong.wang,taoqin } @microsoft.com ABSTRACT

In this paper, we propose MixSpeech, a simple yet effectivedata augmentation method based on mixup for automaticspeech recognition (ASR). MixSpeech trains an ASR modelby taking a weighted combination of two different speechfeatures (e.g., mel-spectrograms or MFCC) as the input,and recognizing both text sequences, where the two recog-nition losses use the same combination weight. We applyMixSpeech on two popular end-to-end speech recognitionmodels including LAS (Listen, Attend and Spell) and Trans-former, and conduct experiments on several low-resourcedatasets including TIMIT, WSJ, and HKUST. Experimentalresults show that MixSpeech achieves better accuracy thanthe baseline models without data augmentation, and outper-forms a strong data augmentation method SpecAugment onthese recognition tasks. Speciﬁcally, MixSpeech outperformsSpecAugment with a relative PER improvement of 10.6 % onTIMIT dataset, and achieves a strong WER of 4.7 % on WSJdataset. Index Terms — Speech Recognition, Data Augmentation,Low-resource, Mixup

1. INTRODUCTION

Automatic speech recognition (ASR) has achieved rapidprogress with the development of deep learning. Advancedmodels such as DNN [1], CNN [2], RNN [3] and end-to-endmodels [4, 5] result in better recognition accuracy comparedwith conventional hybrid models [6]. However, as a sideeffect, deep learning-based models require a large amount oflabeled training data to combat overﬁtting and ensure highaccuracy, especially for speech recognition tasks with fewtraining data.

This work is supported by the National Key Research and DevelopmentProgram of China under No.2017YFB1002102

Therefore, a lot of data augmentation methods for ASR [7,8, 9, 10, 11, 12] were proposed, mainly on augmenting speechdata. For example, speed perturbation [7], pitch adjust [8],adding noise [9] and vocal tract length perturbation increasesthe quantity of speech data by adjusting the speed or pitchof the audio, or by adding noisy audio on the original cleanaudio, or by transforming spectrograms. Recently, SpecAug-ment [10] was proposed to mask the mel-spectrogram alongthe time and frequency axes, and achieve good improvementson recognition accuracy. Furthermore, [13] masks the speechsequence in the time domain according to the alignment withtext to explore the semantical relationship. As can be seen,most previous methods focus on augmenting the speech in-put while not changing the corresponding label (text), whichneeds careful tuning on the augmentation policy. For exam-ple, SpecAugment needs a lot of hyper-parameters (the timewarp parameter W , the time and frequency mask parameters T and F , a time mask upper bound p , and the number of timeand frequency mask m F and m T ) to determine how to per-form speech augmentation, where improper parameters maycause much information loss and thus cannot generate textcorrectly, or may have small changes on the speech and thushave no augmentation effect.Recently, mixup technique [14] is proposed to improvethe generalization limitation of empirical risk minimization(ERM) [15]. Vanilla mixup randomly chooses a weight froma distribution and combines two samples and correspondinglabels with the same weight. Recent works apply mixup ondifferent tasks, including image classiﬁcation [16], and se-quence classiﬁcation [17]. Different from classiﬁcation taskwith one-hot labels where mixup can be easily incorporated,conditional sequence generation tasks such as automaticspeech recognition, image captioning, and handwritten textrecognition cannot directly apply mixup on the target se-quence due to different lengths. [18] directly apply mixup totrain a neural acoustic model like LF-MMI [19], which sim-ply integrates inputs and targets frame-by-frame with shared a r X i v : . [ c s . C L ] F e b ixture weight because the input feature and label are alignedframe-wisely.In this paper, we propose MixSpeech, a simple yet effec-tive data augmentation method for automatic speech recogni-tion. MixSpeech trains an ASR model by taking a weightedcombination of two different speech sequences (e.g., mel-spectrograms or MFCC) as input, and recognizing both textsequences. Different from mixup [14] that uses the weightedcombination of two labels as the new label for the mixed in-put, MixSpeech uses each label to calculate the recognitionloss and combines the two losses using the same weight as inthe speech input. MixSpeech is much simple with only a sin-gle hyper-parameter (the combination weight λ ), unlike thecomplicated hyper-parameters used in SpecAugment. Mean-while, MixSpeech augments the speech input by introducinganother speech, which acts like a contrastive signal to forcethe ASR model to better recognize the correct text of the cor-responding speech instead of misled by the other speech sig-nal. Therefore, MixSpeech is more effective than previousaugmentation methods that only change the speech input withmasking, wrapping, pitch, and duration adjusting, without in-troducing contrastive signals.

2. METHOD

In this section, we ﬁrst brieﬂy recap the concept of mixup [14],which is an efﬁcient augmentation approach for the single-label task. In order to handle sequence generation tasks (e.g.,ASR), then we introduce MixSpeech. At the last, we describetwo popular models on which MixSpeech is implemented toverify the effectiveness of MixSpeech.

Mixup [14] is an effective data augmentation method for su-pervised learning tasks. It trains a model on a convex combi-nation of pairs of inputs and their targets to make the modelmore robust to adversarial samples. The mixup procedure canbe demonstrated as: X mix = λX i + (1 − λ ) X j ,Y mix = λY i + (1 − λ ) Y j , (1)where X i and Y i are the i -th input and target of the data sam-ple, X mix and Y mix represent the mixup data by combininga pair of data samples ( i and j ), and λ ∼ Beta ( α, α ) with α ∈ (0 , ∞ ) , is the combination weight. Mixup can be easily applied to classiﬁcation tasks where thetarget is a single label. However, it is difﬁcult to apply intosequence generation tasks such as ASR due to the followingreasons: 1) Two text sequences ( Y i , Y j ) may have differentlengths and thus cannot be directly mixed up as in Eq. 1; ASR Model λ Y i Output

Loss j X j X i Loss i Y j λ Loss mix X mix Fig. 1 . The pipeline of MixSpeech. X mix is the mixture oftwo mel-spectrograms, X i and X j . Loss i is calculated fromthe output of X mix and the corresponding label Y i . Loss mix is the weighted combination of two losses

Loss i and Loss j .2) Text sequences are discrete and cannot be directly added;3) If adding two text sequences in the embedding space, themodel will learn to predict a mixture embedding of two textsequences, which will confuse the ASR model and may hurtthe performance since its ﬁnal goal is to recognize a singletext from the speech input.To ensure effective mixup for the sequential data (speechand text) in ASR, we propose MixSpeech, which mixes twospeech sequences in the input and mixes two loss functionsregarding the text output, as shown in Figure 1. The formula-tion of MixSpeech is as follows: X mix = λX i + (1 − λ ) X j , L i = L ( X mix , Y i ) L j = L ( X mix , Y j ) , L mix = λ L i + (1 − λ ) L j , (2)where X i and Y i are the input speech sequence and targettext sequence of i -th sample, X mix is the mixup speech se-quence by adding the two speech sequences frame-wiselywith weight λ . L ( · , · ) calculates the ASR loss (which also in-cludes the recognition process), and L mix combines the twolosses with the same weight λ as in the speech input duringthe training phase. Following the original mixup, we choose λ ∼ Beta ( α, α ) where α ∈ (0 , ∞ ) . LAS [4] and Transformer [20] architectures have been widelyused and achieved great performance on ASR tasks. In thispaper, we implement MixSpeech on these two popular modelsto demonstrate its effectiveness. The two models both lever-age the joint CTC-attention [21] structure to train the model,hich consists of two loss functions: 1) CTC (Connection-ist Temporal Classiﬁcation) [22] loss on the encoder outputand 2) Cross-Entropy (CE) loss at the output of decoder ateach timestep. To train the CTC part, a sequence of labels,denoted as l , are used to compute loss with an effective algo-rithm such as the Baum-Welch algorithm [23]. The CTC losscan be written as: L CT C (ˆ y, l ) = − log  (cid:88) π ∈ β − ( l ) T (cid:89) t =1 y tπ t  , (3)where β is the function that removes repeated labels and blanklabels, π t represents intermediate label sequence containingblank label and l is the target sequence without blank. TheCross-Entropy loss can be written as: L CE = − (cid:88) u log( P ( y u | x, y u − )) , (4)where y u is the u -th target token. In LAS [4], the encoder anddecoder are the stacked BiLSTM and LSTM respectively. InTransformer based structure [20], they are stacked of multi-head attention and feed-forward network.During training, following [21], we leverage the multi-task learning by combining CTC and Cross-Entropy loss asfollows: L MT L = β L CT C +(1 − β ) L CE , (5)where β ∈ [0 , is a tunable hyper-parameter. By combiningwith MixSpeech in Eq. 2, the ﬁnal training objective can bewritten as: L mix = λ L MT L ( X mix , Y i )+(1 − λ ) L MT L ( X mix , Y j ) . (6)

3. EXPERIMENTAL RESULTS3.1. Datasets

We evaluate our proposed MixSpeech on several datasetsincluding TIMIT [24], WSJ [25] and the Mandarin ASR cor-pus HKUST [26]. For TIMIT dataset, we use the standardtrain/test split, where we randomly sample 10 % of the train-ing set for validation and regard the left as the training set.WSJ [25] is a corpus of read news articles with approximately81 hours of clean data. Following the standard process, weset the Dev93 for development and Eval92 for evaluation.For HKUST dataset, the development set contains 4,000 ut-terances (5 hours) extracted from the original training setwith 197,387 utterances (173 hours) and the rest are used fortraining, besides 5,413 utterances (5 hours) for evaluation.The speech sentences of these corpora with 16k samplingrate extract the fbank features as the model input. For TIMITdataset, fbank features with 23 dimensions are extracted ateach frame, while for the other two datasets 80 dimensions fbank features are extracted at each frame following the com-mon practice. We implement MixSpeech based on the ESPnet codebase .MixSpeech is designed for extracting and splitting targetsfrom mixed inputs to train a more generalizable model, whichis a more difﬁcult task than training without MixSpeech.Thus we give more training time to the model enhanced withMixSpeech . The baselines of LAS [4] and Transformer [20]are implemented as the following settings. For the LAS (Lis-ten Attend and Spell), the encoder has 4-layer Bi-LSTM withwidth 320 and the decoder has a single layer LSTM withwidth 300. For the Transformer based architecture, the en-coder has 12 layers and the decoder has 6 layers with width256, where each layer is composed of multi-head attentionand fully connected layer. CTC and decoder cross-entropyloss are used for jointly decoding using beam search withbeam size 20. The default α for Beta distribution is set to0.5 following [14]. The multi-task learning parameter β isset to 0.3. In the practice, we randomly select part of paireddata within one batch to train the model with MixSpeechand other data within one batch to train the model as usual.The proportion of one batch data to train using MixSpeech isdenoted as τ , which is set to 15% as default and achieve goodperformance according to our experiments in Table 4. We compare MixSpeech with two settings: 1) models trainedwithout MixSpeech (Baseline) and 2) models trained withSpecAugment [10] (SpecAugment), an effective data aug-mentation technique. Table 1 shows the performance ofMixSpeech on TIMIT dataset with the LAS model, compar-ing with Baseline and SpecAugment. Furthermore, Table2 shows the comparison of our method with Baseline andSpecAugment on three different low resource datasets by dif-ferent metrics with Transformer. We can ﬁnd that MixSpeechimproves the baseline model on TIMIT dataset relatively by10.6% on LAS and 6.9% on Transformer. As shown in theTable 2, MixSpeech achieves 4.7% WER on WSJ, outper-forming the baseline model alone and the baseline modelwith SpecAugment. MixSpeech also achieves better perfor-mance on HKUST [26], which is a large Mandarin corpus.The results demonstrate that MixSpeech can consistently im-prove the recognition accuracy across datasets with differentlanguages and data sizes. https://github.com/espnet/espnet More training time for the baseline model cannot get better results ac-cording to our preliminary experiments. able 1 . The phone error rate (PER) results of LAS (Listen,Attend and Spell), and LAS enhanced with SpecAugment orMixSpeech on TIMIT dataset.

Method PER

Baseline 21.8%+SpecAugment 20.5%+MixSpeech . The results of Transformer, and Transformer en-hanced with SpecAugment or MixSpeech on TIMIT, WSJ,and HKUST dataset. The metrics of the three datasets arephone error rate (PER) and word error rate (WER) respec-tively.

Method TIMIT

P ER

WSJ

W ER

HKUST

W ER

Baseline 23.1% 5.9% 23.9%+SpecAugment 22.5% 5.4% 23.5%+MixSpeech

Both LAS and Transformer based models leverage multi-tasklearning to boost the performance and β in Equation 5 is ahyper-parameter to adjust the weight of them. When β is setto 0 or 1, there is only Cross-Entropy loss or CTC loss. Wevary β and get the results of the baseline as shown in Table 3.The results show that the model cannot perform well withonly one target and requires alignment information guided byCTC loss to help the attention decoder. In Table 3, β = 0 . gets the lowest PER score, and thus is set as default in thebaselines and models enhanced with MixSpeech. Table 3 . The phone error rate (PER) results by varying themulti-task learning parameters β on TIMIT dataset. β PER τ )to train the model with MixSpeech and other data within onebatch to train the model as usual. We further study whetherthe proportion τ of a batch data has much inﬂuence on theresults and the results are shown in Table 4. we can see thatthe model enhanced with MixSpeech can consistently outper-form the baseline ( τ = 0 ) by varying τ , which shows thatMixSpeech is not sensitive to τ .Rather than mixup on two inputs, we can extend mixup Table 4 . The phone error rate (PER) results by varying pro-portion τ of a batch data to conduct MixSpeech on TIMITdataset. τ

0% 15% 20% 30%

PER λ to / . Similar toMixSpeech, the proportion for mixup τ is also set to 15%.Table 5 shows that the performance of Tri-mixup drops dra-matically, comparing with the baseline and MixSpeech. Tri-mixup may introduce too much information beyond the orig-inal ones, which hinders the model from learning useful pat-terns from the data and leads to the performance drop. Table 5 . The phone error rate (PER) results of different dataaugmentation methods including Tri-mixup, Noise Regular-ization and MixSpeech on TIMIT dataset. The Baseline refersto the Transformer model without data augmentation. Tri-mixup means that we select three inputs and conduct mixupwith weight λ = PER

Noise regularization [27, 28, 29], which incorporatesnoise on the inputs, is a common data augmentation methodto make the model more robust to noise and more general-izable to unseen data. To compare it with MixSpeech, weadd Gaussian noise with SNR 5dB on the training set andreport the results in Table 5. We can see that MixSpeechoutperforms the noise regularization.

4. CONCLUSION AND FUTURE WORK

In this paper, we proposed MixSpeech, a new data augmen-tation method to apply the mixup technique on ASR tasks forlow-resource scenarios. Experimental results show that ourmethod improves the recognition accuracy compared withthe baseline model and previous data augmentation methodSpecAgument, and achieves competitive WER performanceon WSJ dataset. In the future, we will consider another mixupstrategy by concatenating different segments of speech andalso their corresponding text sequences, which can createmore data samples with novel distributions. [14] points out that complex prior distribution like the Dirichlet distribu-tion to generate λ fails to provide further gain but increase the computationcost. Thus we choose λ = . REFERENCES [1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependentpre-trained deep neural networks for large-vocabulary speechrecognition,” IEEE Transactions on audio, speech, and lan-guage processing , vol. 20, no. 1, pp. 30–42, 2011.[2] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ram-abhadran, “Deep convolutional neural networks for lvcsr,” in . IEEE, 2013, pp. 8614–8618.[3] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recogni-tion with deep recurrent neural networks,” in . IEEE, 2013, pp. 6645–6649.[4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend andspell: A neural network for large vocabulary conversationalspeech recognition,” in . IEEE,2016, pp. 4960–4964.[5] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Trans-formers with convolutional context for asr,” arXiv preprintarXiv:1904.11660 , 2019.[6] E. Trentin and M. Gori, “A survey of hybrid ann/hmm modelsfor automatic speech recognition,”

Neurocomputing , vol. 37,no. 1-4, pp. 91–126, 2001.[7] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-mentation for speech recognition,” in

Sixteenth Annual Confer-ence of the International Speech Communication Association ,2015.[8] S. Shahnawazuddin, A. Dey, and R. Sinha, “Pitch-adaptivefront-end features for robust children’s asr.” in

INTER-SPEECH , 2016, pp. 3459–3463.[9] L. T´oth, G. Kov´acs, and D. Van Compernolle, “A perceptu-ally inspired data augmentation method for noise robust cnnacoustic models,” in

International Conference on Speech andComputer . Springer, 2018, pp. 697–706.[10] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmenta-tion method for automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[11] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Al-most unsupervised text to speech and automatic speech recog-nition,” in

International Conference on Machine Learning .PMLR, 2019, pp. 5410–5419.[12] J. Xu, X. Tan, Y. Ren, T. Qin, J. Li, S. Zhao, and T.-Y. Liu, “Lr-speech: Extremely low-resource speech synthesis and recogni-tion,” in

Proceedings of the 26th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining , 2020,pp. 2802–2812.[13] C. Wang, Y. Wu, Y. Du, J. Li, S. Liu, L. Lu, S. Ren,G. Ye, S. Zhao, and M. Zhou, “Semantic mask for trans-former based end-to-end speech recognition,” arXiv preprintarXiv:1912.03010 , 2019.[14] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz,“mixup: Beyond empirical risk minimization,” arXiv preprintarXiv:1710.09412 , 2017. [15] V. N. Vapnik, “An overview of statistical learning theory,”

IEEE transactions on neural networks , vol. 10, no. 5, pp. 988–999, 1999.[16] Z. Zhang, S. Xu, S. Cao, and S. Zhang, “Deep convolutionalneural network with mixup for environmental sound classiﬁca-tion,” in

Chinese Conference on Pattern Recognition and Com-puter Vision (PRCV) . Springer, 2018, pp. 356–367.[17] H. Guo, Y. Mao, and R. Zhang, “Augmenting data with mixupfor sentence classiﬁcation: An empirical study,” arXiv preprintarXiv:1905.08941 , 2019.[18] I. Medennikov, Y. Y. Khokhlov, A. Romanenko, D. Popov,N. A. Tomashenko, I. Sorokin, and A. Zatvornitskiy, “An in-vestigation of mixup training strategies for acoustic models inasr.” in

Interspeech , 2018, pp. 2903–2907.[19] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trainedneural networks for asr based on lattice-free mmi.” in

Inter-speech , 2016, pp. 2751–2755.[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is allyou need,” in

Advances in neural information processing sys-tems , 2017, pp. 5998–6008.[21] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention basedend-to-end speech recognition using multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[22] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: labelling unsegmented se-quence data with recurrent neural networks,” in

Proceedings ofthe 23rd international conference on Machine learning , 2006,pp. 369–376.[23] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximiza-tion technique occurring in the statistical analysis of probabilis-tic functions of markov chains,”

The annals of mathematicalstatistics , vol. 41, no. 1, pp. 164–171, 1970.[24] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, andD. S. Pallett, “Darpa timit acoustic-phonetic continous speechcorpus cd-rom. nist speech disc 1-1.1,”

NASA STI/Recon tech-nical report n , vol. 93, 1993.[25] D. B. Paul and J. Baker, “The design for the wall street journal-based csr corpus,” in

Speech and Natural Language: Proceed-ings of a Workshop Held at Harriman, New York, February23-26, 1992 , 1992.[26] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff,“Hkust/mts: A very large scale mandarin telephone speechcorpus,” in

International Symposium on Chinese Spoken Lan-guage Processing . Springer, 2006, pp. 724–735.[27] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deepneural networks for noise robust speech recognition,” in . IEEE, 2013, pp. 7398–7402.[28] D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero,“A minimum-mean-square-error noise reduction algorithm onmel-frequency cepstra for robust speech recognition,” in2008IEEE International Conference on Acoustics, Speech and Sig-nal Processing