[PDF] Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

Abstract

The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based inference, we propose an internal LM training (ILMT) method to minimize an additional internal LM loss by updating only the E2E model components that affect the internal LM estimation. ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy. After ILMT, the more modular E2E model with matched training and inference criteria enables a more thorough elimination of the source-domain internal LM, and therefore leads to a more effective integration of the target-domain external LM. Experimented with 30K-hour trained recurrent neural network transducer and attention-based encoder-decoder models, ILMT with ILME-based inference achieves up to 31.5% and 11.4% relative word error rate reductions from standard E2E training with Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.

Full PDF

aa r X i v : . [ ee ss . A S ] F e b INTERNAL LANGUAGE MODEL TRAINING FOR DOMAIN-ADAPTIVE END-TO-ENDSPEECH RECOGNITION

Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun,Liang Lu, Xie Chen, Jinyu Li, Yifan Gong

Microsoft Corporation, Redmond, WA, USA

ABSTRACT

The efﬁcacy of external language model (LM) integration with exist-ing end-to-end (E2E) automatic speech recognition (ASR) systemscan be improved signiﬁcantly using the internal language modelestimation (ILME) method [1]. In this method, the internal LMscore is subtracted from the score obtained by interpolating the E2Escore with the external LM score, during inference. To improve theILME-based inference, we propose an internal LM training (ILMT)method to minimize an additional internal LM loss by updating onlythe E2E model components that affect the internal LM estimation.ILMT encourages the E2E model to form a standalone LM inside itsexisting components, without sacriﬁcing ASR accuracy. After ILMT,the more modular E2E model with matched training and inferencecriteria enables a more thorough elimination of the source-domaininternal LM, and therefore leads to a more effective integration ofthe target-domain external LM. Experimented with 30K-hour trainedrecurrent neural network transducer and attention-based encoder-decoder models, ILMT with ILME-based inference achieves up to31.5% and 11.4% relative word error rate reductions from standardE2E training with Shallow Fusion on out-of-domain LibriSpeech andin-domain Microsoft production test sets, respectively.

Index Terms — Speech recognition, language model, recurrentneural network transducer, attention-based encoder-decoder

1. INTRODUCTION

End-to-end (E2E) automatic speech recognition (ASR) has achievedstate-of-the-art performance by directly mapping speech to word se-quences through a single neural network. The most popular E2E mod-els include connectionist temporal classiﬁcation [2, 3, 4, 5], recur-rent neural network transducer (RNN-T) [6, 7, 8, 9], and attention-based encoder-decoder (AED) models [10, 11, 12, 13, 14]. However,E2E models tend to overﬁt the audio-transcript pairs in source-domaintraining data, and suffer from performance degradation when evalu-ated in a mismatched target domain. Numerous ideas have been ex-plored to adapt ASR models, such as regularization method [15, 16,17, 18, 19], teacher-student learning [20, 21, 22, 23], transformationmethod [24, 25, 26], and adversarial learning [27, 28, 29, 30]. Nev-ertheless, all these methods require audio data for adaptation whenapplied to E2E models [31, 32, 33]. One promising solution withoutusing audio is to train a language model (LM) with a large amountof text readily available in the target domain and fuse it with the E2Emodel during inference. However, an E2E model does not have amodular LM component as in a traditional hybrid system [34], mak-ing external LM integration a challenging task.Among many approaches proposed for LM integration, ShallowFusion [3, 35, 36, 37] is a simple yet effective method where the logprobabilities of the E2E model and the LM are linearly interpolatedduring the inference. Towards a better integration, the Density Ratio method [38, 39] subtracts the source-domain LM score from the inter-polated score of Shallow Fusion, and shows improved performance.Further, as a new type of E2E model, the hybrid autoregressive trans-ducer (HAT) was proposed in [40] to preserve the modularity of atraditional hybrid system. HAT allows us to estimate the internal LMscores and subtract them from the Shallow Fusion scores for externalLM integration.More recently, we proposed an internal LM estimation (ILME)method in [1] to facilitate the integration of an external LM for anypre-existing E2E models, including RNN-T and AED models, with-out any additional training. With ILME-based inference, the internalLM score of an E2E model is estimated by eliminating the contribu-tion of the acoustic encoder, and is then subtracted from the log-linearinterpolation between E2E and external LM scores. However, ILME-based inference in [1] is performed with an ASR model trained tooptimize a standard E2E loss by updating all model parameters. Theaccuracy of the internal LM estimation is not guaranteed when theE2E model is not structured in a way that strictly satisﬁes the condi-tions of Proposition 1 in [40, Appendix A].To compensate for the mismatch between the E2E training and theILME-based inference, we propose an internal LM training (ILMT) ofthe E2E model to minimize an additional internal LM loss by updat-ing only the model components engaged in the prediction of inter-nal LM scores during inference. ILMT facilitates the E2E model toform a standalone LM inside its existing components while maintain-ing ASR accuracy. ILMT improves the effectiveness of the ILME-based LM integration with a more modular E2E model and a well-aligned training and inference criterion. Evaluated with 30 thousand(K)-hour trained RNN-T and AED models, ILMT with ILME-basedinference achieves up to 31.5% and 11.4% relative word error rate(WER) reductions from Shallow Fusion on cross- and intra-domainevaluations, respectively, far outperforming the reductions with stan-dard E2E training.

2. RELATED E2E METHODS

An E2E model predicts the conditional distribution P ( Y | X ; θ E2E ) oftoken sequences Y = { y , . . . , y U } given a speech-feature sequence X = { x , . . . , x T } as the input, where y u ∈ V and x t is a featurevector at time t . V is the set of all possible output tokens, i.e., wordpieces. We insert a start-of-sentence token y = at the begin-ning of Y . The RNN-T model [6] comprises an encoder, a prediction networkand a joint network. The encoder maps the input speech features X to a sequence of hidden states H enc = { h enc , . . . , h enc T } . The predic-tion network is an RNN that takes the embedding vector e u − of therevious non-blank token y u − and generates the hidden state h pred u ,i.e., h pred u = PredictionRNN ( h pred u − , e u − ) .The joint network is a feed-forward network that combines theoutputs of the encoder and prediction network to predict the condi-tional distribution over the next possible token ˜ y i ∈ V ∪ , i.e., z t i ,u i = W j φ ( W e h enc t i + W p h pred u i + b e + b p ) + b j , (1) [ P (˜ y i = v | X t i , Y u i − ; θ RNNT ) (cid:3) v ∈V∪ = softmax ( z t i ,u i ) , (2)where denotes a blank symbol, φ is a non-linear function, e.g.,tanh or ReLU. W j , W e , W p are weight matrices, and b e , b p , b j arebiases. z t i ,u i is a |V| + 1 dimensional logit vector. ˜ y i forms a blank-augmented token sequences ˜Y = { ˜ y , . . . , ˜ y T + U } aligned with thetoken and feature sequences Y and X as (˜ y i , y u i , x t i ) U + Ti =1 , i.e., theindex i in ˜Y is mapped to the index u i in Y , and the index t i in X .The RNN-T loss is computed by marginalizing over all possibleblank-augmented token sequences aligned with each reference Y , i.e., A ( X , Y ) , on the training corpus. L RNN-T ( X , Y ; θ RNN-T )= − X ( X , Y ) ∈D log X ˜Y ∈A ( X , Y ) T + U Y i =1 P (˜ y i | X t i , Y u i − ; θ RNN-T ) . (3) The AED model [10] consists of an encoder, a decoder and an atten-tion network. The encoder maps a sequence of input speech frames X into a sequence of hidden states H enc . The attention network gener-ates an attention weight for h enc t at each decoder step u , determiningwhich encoder states should be attended to predict the output label y u , i.e., a u = AttentionNet ( a u − , h enc t , h dec u ) , where a u is a vectorof attention weights of dimension T , and h dec u is the decoder hiddenstate. The context vector c u is computed as a linear combination of H enc weighted by the attention, i.e., c u = P Tt =1 a u,t h enc t .At each step u , the decoder RNN takes the sum of the previoustoken embedding e u − and the context vector c u − as the input topredict the conditional distribution over V ∪ , i.e., h dec u = DecoderRNN ( h dec u − , e u − + c u − ) , (4) z u = W d h dec u + b d , (5) [ P ( y u = v | X , Y u − ; θ AED )] v ∈V∪ = softmax ( z u ) , (6)where is the end-to-sentence token, W d and b d are weightmatrix and bias, respectively.The AED loss is obtained as a summation of token sequence pos-teriors over the training corpus D as follows L AED ( X , Y ; θ AED ) = − X ( X , Y ) ∈D U +1 X u =1 log P ( y u | X , Y u − ; θ AED ) . (7)

3. INTERNAL LM ESTIMATION (ILME)

From audio-transcript training pairs, an E2E model implicitly learnsan internal language model (LM) that characterizes the distribu-tion of source-domain training text. The exact computation ofinternal LM is intractable, but it can be approximated by Proposi-tion 1 in [40, Appendix A] which suggests that the E2E internalLM P ( y u | Y u − ; θ SE2E ) approximately equals to E2E model out-put softmax [ J ( g u )] after zeroing out the acoustic embedding f t if P ( y u | X , Y u − ; θ SE2E )) = softmax [ J ( f t + g u )] and J ( f t + g u ) ≈ J ( f t ) + J ( g u ) are satisﬁed, where g u is a language embedding. As shown in [1], the conditional probability of RNN-T internalLM, P ( y u | Y u − ; θ RNNT ) , is estimated as a softmax normalizationof the non-blank token logits when the hidden states of the encoderare eliminated from the input of the joint network. z ILM u = W j φ ( W p h pred u + b p ) + b j , (8) P ( y u | Y u − ; θ RNNT ) = softmax ( z ILM, NB u ) , (9)where z ILM u is a ( |V| + 1) -dimensional vector with a designated logitfor the blank token , z ILM, NB u is a logit vector of dimension |V| created by taking out the blank logit from z ILM u . Without the encoderinput, the RNN-T is completely driven by the prediction and jointnetworks with the token sequence Y as the only input.Similarly, [1] has also shown that the conditional probability ofthe AED internal LM is estimated by the decoder output after zeroingout the context vector, i.e, P ( y u | Y u − ; θ AED )= softmax h W d · DecoderRNN ( h dec u − , e u − ) + b d i . (10)Without the context vector, AED is entirely driven by the decoderwith the token sequence Y as the only input, acting exactly the sameas an RNN-LM.During ILME-based inference [1], we subtract the log of the inter-nal LM probability P ( Y ; θ E2E ) from the log-linear combination be-tween the conditional probability of E2E model and the external LMprobability P ( Y ; θ LM ) , and search for the optimal token sequence ˆ Y as follows via a left-to-right beam search. ˆ Y = arg max Y [log P ( Y | X ; θ E2E ) + λ E log P ( Y ; θ LM ) − λ I log P ( Y ; θ E2E )] , (11)where λ E and λ I are the weights for external and internal LMs, re-spectively.

4. INTERNAL LM TRAINING OF E2E MODELS

In standard E2E training, an internal LM is implicitly learned to mini-mize the E2E loss by updating all parameters of the E2E model. How-ever, during ILME-based inference, only a part of the E2E model con-tributes to the prediction of the internal LM scores. The estimation ofthe internal LM scores is not accurate when the conditions of Propo-sition 1 in [40, Appendix A] is not strictly satisﬁed by the E2E model.In this work, we propose an internal LM training of the E2Emodel to mitigate the mismatch between the E2E training and theILME-based inference. Through the standard E2E training, the de-coder of an AED or the prediction and joint networks of an RNN-Tacts as an acoustically-conditioned LM that takes both the token andacoustic embeddings as the input to predict the conditional probabil-ity of the next token. From Eqs. (9) and (10), the internal LM scoresare estimated entirely by the acoustically-conditioned LM of an E2Emodel during ILME-based inference. Therefore, the goal of ILMTis to encourage the acoustically-conditioned LM of an E2E model toalso behave like a standalone internal LM, without sacriﬁcing ASRaccuracy. To achieve that, we jointly minimize an internal LM losstogether with the standard E2E loss during ILMT.The internal LM loss of an RNN-T model is obtained by summingup the negative log probabilities of the internal LM over the trainingcorpus D as follows L ILM ( X , Y ; θ pred , θ joint )= − X Y ∈D U X u =1 log P ( y u | Y u − ; θ pred , θ joint ) . (12)ote that, from Eqs. (8) and (9), the RNN-T internal LM loss is con-ditioned only on the parameters of the prediction and joint networks, θ pred and θ joint . For RNN-T, the ILMT loss is constructed as a weightedsum of the RNN-T loss in Eq. (3) and the ILM loss below L ILMT ( X , Y ; θ RNN-T ) = L RNN-T ( X , Y ; θ RNN-T )+ α L ILM ( X , Y ; θ pred , θ joint ) , (13)where α is the weight of the internal LM loss. By minimizing theRNN-T ILMT loss, we maximize the internal LM probability of theE2E training transcripts by updating only the prediction and joint net-works while maximizing the conditional probability of the trainingtranscripts given input speech by updating the entire RNN-T.The internal LM loss of AED is formulated as a summation ofnegative log probabilities of the internal LM over training corpus DL ILM ( X , Y ; θ dec ) = − X Y ∈D U +1 X u =1 log P ( y u | Y u − ; θ dec ) . (14)Note that, from Eq. (10), the AED internal LM loss is conditionedonly on the parameters of the decoder θ dec . For AED, the ILMT lossis computed as a weighted sum of the AED loss and the ILM lossbelow L ILMT ( X , Y ; θ AED ) = L AED ( X , Y ; θ AED ) + α L ILM ( X , Y ; θ dec ) (15)By minimizing the AED ILMT loss, we maximize the internal LMprobability of the E2E training transcripts by updating only the AEDdecoder while maximizing the conditional probability of the trainingtranscripts given input speech by updating the entire AED model.The procedure of ILMT with the ILME-based inference for theLM integration with an E2E model is the following1. Train an E2E model with source-domain audio-transcript pairsto minimize the ILMT loss in Eq. (13) for RNN-T or in Eq.(15) for AED.2. Train an external LM with target-domain text-only data.3. Integrate the ILMT E2E model in Step 1 with the external LMin Step 2 by performing ILME-based inference in Section 3.With ILMT, a standalone internal LM with a signiﬁcantly lowerperplexity is learned only by the E2E components used to computethe internal LM scores during the ILME-based inference. With in-creased modularity, the E2E model is more adaptable to the target do-main with its increased ﬂexibility to eradicate the effect of the source-domain internal LM through the ILME-based inference.

5. EXPERIMENT

In this work, we perform ILMT of RNN-T and AED models and inte-grate them with external long short-term memory (LSTM) [41, 42,43] LMs using different methods. We conduct both cross-domainand intra-domain evaluations to investigate the effectiveness of ILMT.Same as [1], we perform beam search inference with a beam size of25 for all evaluations, and use the 3999 word-piece units generated bybyte-pair encoding [44] as V for both E2E models and LSTM-LMs. We perform ILMT of the E2E models with the same 30K hours ofanonymized and transcribed data as in [1] collected from Microsoftservices, including voice search, short message dictation, commandand control, and conversations recorded in various conditions. The RNN-T model is initialized with the parameters of theRNN-T in [1] which was well-trained until full convergence with the30K-hour data. The encoder and prediction networks are both uni-directional LSTMs with 6 and 2 hidden layers, respectively, and 1024hidden units in each layer. The joint network has 4000-dimensionaloutput units. The RNN-T has 76M parameters. During ILMT, theweight of the internal LM loss is set to 0.4. The internal LM perplex-ities of ILMT RNN-T and the standard RNN-T in [1] are 52.0 and99.4, respectively, on the validation set of 30K-hour data.The AED model [10, 45, 46] is randomly initialized and sharesthe same architecture as the one in [1]. The encoder is a bi-directionalLSTM with 6 hidden layers and 780 hidden units in each layer. Thedecoder is a uni-directional LSTM with 2 hidden layers, each with1280 hidden units. The decoder has 4000-dimensional output units.The AED model has 97M parameters. During ILMT, the weight of theinternal LM loss is set to 1.0. The internal LM perplexities of ILMTAED and the standard AED in [1] are 46.1 and 796.7, respectively, onthe validation set of the 30K-hour training data.

We evaluate a 30K-hour E2E model on the LibriSpeech clean test setby integrating an LSTM-LM trained with LibriSpeech text. Collectedfrom read English based on audio book, the LibriSpeech corpus [47]is outside the domains covered by the 30K-hour training speech. Thetest-clean and dev-clean sets consist of 2620 and 2703 utterances, re-spectively. We tune the LM weights on dev-clean. We use the sameexternal LSTM-LM as in [1] trained with the transcript of 960K-hourtraining speech and the additional 813M-word text from LibriSpeechcorpus. With 58M parameters, the LSTM-LM has 2 hidden layerswith 2048 hidden units for each layer. For Density Ratio, we use thesame source-domain LSTM-LM with 2 hidden layers, 2048 hiddenunits, and 57M parameters as in [1] trained using the transcript of30K-hour speech.We list the results of RNN-T in Table 1 with an excerpt ofstandard E2E training results from [1]. With ILMT, all three LMintegration methods show 27.9%-40.9% relative WER reductionsfrom the baseline with standard RNN-T training and inference, sig-niﬁcantly larger than the corresponding reductions without ILMT inthe range of 16.1%-29.1%. ILMT with ILME inference performs thebest achieving 29.6% and 16.6% relative WER reduction from thestandard RNN-T with Shallow Fusion and ILME inference, respec-tively. As in Table 2, the AED results are similar to RNN-T. ILMTwith ILME inference performs the best, achieving 57.6%, 31.5% and25.1% relative WER reductions from the standard AED training withAED inference, Shallow Fusion and ILME inference, respectively.

We evaluate a 30K-hour E2E model on a in-house dictation test set byintegrating a strong external LSTM-LM trained with a large amountof multi-domain text. As in [1], we use the same 2K in-house dictationutterances collected from the keyboard input as the test set, and thesame 442 email dictation utterances as the validation set. The testset has a similar style as the dictation data in 30K-hour corpus andis thus considered as in-domain evaluation. We use the same multi-domain LSTM-LM as in [1] trained with 2 billion (B) words of textcomprising short message dictation and conversational data such astalks, interviews, and meeting transcripts.Table 1 lists the RNN-T results with an excerpt of standard E2Etraining results from [1]. With ILMT, all three LM integration meth-ods show 6.9%-13.6% relative WER reductions from the baselinerainLoss EvaluationMethod ModelParams LibriSpeech In-House Dictation In-House ConversationDevWER TestWER TestWERR DevWER TestWER TestWERR DevWER TestWER TestWERR L RNN-T

No LM 76M 9.27 8.97 - 23.40 16.16 - 14.92 14.26 -Shallow Fusion 134M 7.44 7.53 16.1 22.19 15.77 2.4 14.88 14.08 1.3Density Ratio 191M 6.80 6.74 24.9 21.54 15.64 3.2 14.76 14.20 0.4ILME 134M 6.41 6.36 29.1 21.04 14.70 9.0 14.61 14.03 1.6 L ILMT

No LM 76M 8.58 8.37 6.7 22.61 15.73 2.7 14.00 13.59 4.7Shallow Fusion 134M 6.60 6.47 27.9 21.31 15.04 6.9 13.83 13.29 6.8Density Ratio 191M 5.86 5.61 37.5 20.61 14.76 8.7 13.71 13.29 6.8ILME 134M . WERs (%) of 30k-hour

RNN-T models trained with RNN-T or ILMT loss, and evaluated with different LM integration methods on out-of-domain

LibriSpeech, and in-domain dictation and conversation dev and test sets. WERR is relative WER reduction.TrainLoss EvaluationMethod ModelParams LibriSpeech In-House Dictation In-House ConversationDevWER TestWER TestWERR DevWER TestWER TestWERR DevWER TestWER TestWERR L AED

No LM 97M 8.56 8.61 - 20.17 14.08 - 14.05 13.43 -Shallow Fusion 155M 5.00 5.33 38.1 18.55 12.96 8.0 13.45 12.95 3.6Density Ratio 212M 4.74 5.09 40.9 18.76 12.89 8.5 13.55 12.95 3.6ILME 155M 4.42 4.87 43.4 18.26 12.36 12.2 13.33 12.67 5.7 L ILMT

No LM 97M 7.31 7.47 13.2 21.06 13.72 2.6 12.60 12.19 9.2Shallow Fusion 155M 6.54 6.61 23.2 19.09 12.32 12.5 12.42 11.90 11.4Density Ratio 212M 4.28 4.85 43.7 18.30 12.23 13.1 12.23 11.85 11.8ILME 155M . WERs (%) of 30k-hour

AED models trained with AED or ILMT loss, and evaluated with different LM integration methods on out-of-domain

LibriSpeech, and in-domain dictation and conversation dev and test sets. WERR is relative WER reduction.with standard RNN-T training and inference, signiﬁcantly larger thanthe corresponding reductions without ILMT in the range of 2.4%-9.0%. ILMT with ILME inference performs the best achieving 11.4%and 5.0% relative WER reduction from the standard RNN-T trainingwith Shallow Fusion and ILME inference, respectively. As in Table2, the AED results are similar to RNN-T. ILMT with ILME inferenceperforms the best, achieving 17.6%, 10.5% and 6.1% relative WERreductions from the standard AED training with AED inference, Shal-low Fusion and ILME inference, respectively.

We evaluate a 30K-hour E2E model on a in-house conversation testset by integrating a strong multi-domain external LSTM-LM. Fromthe Microsoft telecommunication applications, we collect 2560 in-house conversational utterances as the test set, and another 1280 con-versational utterances as the validation set. The test set has a similarstyle as the conversational data in 30K-hour corpus and is thus consid-ered as in-domain evaluation. For the external LM, we use the same2B-word LSTM-LM as in Section 5.3.1.As shown in Table 1, with ILMT RNN-T, all three LM integrationmethods show 6.8%-9.1% relative WER reductions from the baselinewith standard RNN-T training and inference, signiﬁcantly larger thanthe corresponding reductions without ILMT in the range of 0.4%-1.6%. ILMT with ILME inference performs the best achieving 8.0%and 7.6% relative WER reduction from the standard RNN-T trainingwith Shallow Fusion and ILME inference, respectively. As shown inTable 2, the AED results are similar to RNN-T. ILMT with ILME in-ference performs the best, achieving 13.8%, 10.6% and 8.6% relativeWER reductions from the standard AED training with AED inference,Shallow Fusion and ILME inference, respectively.

From the results, we have the following observations for both RNN-Tand AED models, and for both cross- and intra-domain evaluations.All LM integration methods consistently achieve remarkably lowerWERs with ILMT than with standard E2E training. Among all meth-ods, ILMT with ILME inference consistently performs the best, with29.6%-31.5% and 8.0%-11.4% relative WER reductions from stan-dard E2E training with Shallow Fusion for cross-domain and intra-domain evaluations, respectively. ILME inference consistently out-performs Density Ratio in terms of lower WER with ILMT or stan-dard E2E training despite having 26.8%-29.8% fewer model parame-ters. All of these manifest the advantage of ILMT over standard E2Etraining for ILME inference and other LM integration methods.Note that, with or without ILMT, ILME inference is effectiveeven for intra-domain evaluation because it replaces the weak E2Einternal LM with a powerful external LM trained with orders of mag-nitude more multi-domain text than the E2E training transcript. Allthree LM fusion methods perform better for AED than RNN-T due tolarger relative WER reductions from a stronger baseline. The internalLM perplexity of an E2E model is remarkably reduced by ILMT.

6. CONCLUSION

We propose an internal LM training of the E2E model which min-imizes an internal LM loss in addition to the standard E2E loss toimprove the effectiveness of ILME external LM integration. WithILMT, ILME inference achieves 29.6%-31.5% and 8.0%-11.4% rel-ative WER reductions from the standard E2E training with ShallowFusion for cross-domain and intra-domain evaluations, respectively.With ILME inference, ILMT outperforms the standard E2E trainingby 16.6%-25.1% and 5.0%-8.6% relatively in terms of lower WERfor cross-domain and intra-domain evaluations, respectively. . REFERENCES [1] Z. Meng, S. Parthasarathy, E. Sun, et al., “Internal languagemodel estimation for domain-adaptive end-to-end speech recog-nition,” in

Proc. SLT . IEEE, 2021.[2] A. Graves, S. Fern´andez, F. Gomez, et al., “Connectionist tem-poral classiﬁcation: labelling unsegmented sequence data withrecurrent neural networks,” in

ICML . ACM, 2006.[3] A. Hannun, C. Case, et al., “Deep speech: Scaling up end-to-endspeech recognition,” arXiv preprint arXiv:1412.5567 , 2014.[4] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer:Acoustic-to-word lstm model for large vocabulary speech recog-nition,”

Proc. Interspeech , 2016.[5] J. Li, G. Ye, A. Das, et al., “Advancing acoustic-to-word CTCmodel,” in

Proc. ICASSP , 2018.[6] A. Graves, “Sequence transduction with recurrent neural net-works,” arXiv preprint arXiv:1211.3711 , 2012.[7] M. Jain, K. Schubert, J. Mahadeokar, et al., “RNN-T for la-tency controlled asr with improved beam search,” arXiv preprintarXiv:1911.01629 , 2019.[8] T. Sainath, Y. He, B. Li, et al., “A streaming on-device end-to-end model surpassing server-side conventional model qualityand latency,” in

Proc. ICASSP , 2020, pp. 6059–6063.[9] J. Li, R. Zhao, Z. Meng, et al., “Developing RNN-T modelssurpassing high-performance hybrid models with customizationcapability,” in

Interspeech , 2020.[10] J. K Chorowski, D. Bahdanau, D. Serdyuk, et al., “Attention-based models for speech recognition,” in

NIPS , 2015.[11] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversationalspeech recognition,” in

Proc. ICASSP . IEEE, 2016.[12] C.-C. Chiu, T. N Sainath, Y. Wu, et al., “State-of-the-artspeech recognition with sequence-to-sequence models,” in

Proc. ICASSP . IEEE, 2018, pp. 4774–4778.[13] S. Karita, N. Chen, et al., “A comparative study on transformervs RNN in speech applications,” in

Proc. ASRU , 2019.[14] J. Li, Y. Wu, Y. Gaur, et al., “On the comparison of popularend-to-end models for large scale speech recognition,” in

Proc.Interspeech , 2020.[15] D. Yu, K. Yao, H. Su, et al., “Kl-divergence regularized deepneural network adaptation for improved large vocabulary speechrecognition,” in

Proc. ICASSP , May 2013.[16] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,”in

Proc. ICASSP , 2019.[17] H. Liao, “Speaker adaptation of context dependent deep neuralnetworks,” in

Proc. ICASSP , May 2013.[18] Z. Meng, H. Hu, J. Li, et al., “L-vector: Neural label embeddingfor domain adaptation,” in

Proc. ICASSP . IEEE, 2020.[19] P. Swietojanski and S. Renals, “Learning hidden unit contri-butions for unsupervised speaker adaptation of neural networkacoustic models,” in

Proc. SLT . IEEE, 2014, pp. 171–176.[20] J. Li, R. Zhao, et al., “Learning small-size DNN with output-distribution-based criteria.,” in

INTERSPEECH , 2014.[21] Z. Meng, J. Li, Y. Gong, et al., “Adversarial teacher-studentlearning for unsupervised domain adaptation,” in

Proc. ICASSP ,2018.[22] V. Manohar, P. Ghahremani, D. Povey, et al., “A teacher-student learning approach for unsupervised domain adaptationof sequence-trained asr models,” in

Proc. SLT . IEEE, 2018.[23] Z. Meng, J. Li, Y. Zhao, and Y. Gong, “Conditional teacher-student learning,” in

Proc. ICASSP , 2019. [24] R. Gemello, F. Mana, S. Scanzio, et al., “Linear hidden trans-formations for adaptation of hybrid ann/hmm models,”

SpeechCommunication , vol. 49, no. 10, pp. 827 – 835, 2007.[25] T. Tan, Y. Qian, M. Yin, et al., “Cluster adaptive training fordeep neural network,” in

Proc. ICASSP . IEEE, 2015.[26] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hy-brid nn/hmm model for speech recognition based on discrimi-native learning of speaker code,” in

Proc. ICASSP , May 2013.[27] Y. Shinohara, “Adversarial multi-task learning of deep neuralnetworks for robust speech recognition,” in

INTERSPEECH ,2016.[28] Z. Meng, J. Li, Z. Chen, et al., “Speaker-invariant training viaadversarial learning,” in

Proc. ICASSP , 2018.[29] D. Serdyuk, K. Audhkhasi, P. Brakel, et al., “Invariant represen-tations for noisy speech recognition,” in

NIPS Workshop , 2016.[30] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsu-pervised adaptation with domain separation networks for robustspeech recognition,” in

Proc. ASRU , 2017.[31] T. Ochiai, S. Watanabe, et al., “Speaker adaptation for multi-channel end-to-end speech recognition,” in

Proc. ICASSP , 2018.[32] Z. Meng, Y. Gaur, J. Li, et al., “Speaker adaptation for attention-based end-to-end speech recognition,”

Proc. Interspeech , 2019.[33] Z. Meng, J. Li, Y. Gaur, and Y. Gong, “Domain adaptation viateacher-student learning for end-to-end speech recognition,” in

Proc. ASRU . IEEE, 2019.[34] G. Hinton, L. Deng, D. Yu, et al., “Deep neural networks foracoustic modeling in speech recognition: The shared views offour research groups,”

IEEE Signal Processing Magazine , 2012.[35] C. Gulcehre, O. Firat, K. Xu, et al., “On using monolingualcorpora in neural machine translation,”

CoRR , 2015.[36] J. Chorowski and N. Jaitly, “Towards better decoding andlanguage model integration in sequence to sequence models,”

CoRR , vol. abs/1613.02695, 2016.[37] S. Kim, Y. Shangguan, J. Mahadeokar, et al., “Improved neurallanguage model fusion for streaming recurrent neural networktransducer,” arXiv preprint arXiv:2010.13878 , 2020.[38] E. McDermott, H. Sak, and E. Variani, “A density ratio approachto language model fusion in end-to-end automatic speech recog-nition,” in

Proc. ASRU . IEEE, 2019, pp. 434–441.[39] N. Kanda, X. Lu, and H. Kawai, “Maximum a posteriori baseddecoding for CTC acoustic models.,” in

Interspeech , 2016.[40] E. Variani, D. Rybach, C. Allauzen, et al., “Hybrid autoregres-sive transducer (hat),” 2020.[41] A. Senior H. Sak and F. Beaufays, “Long short-term memoryrecurrent neural network architectures for large scale acousticmodeling,” in

Interspeech , 2014.[42] Z. Meng, S. Watanabe, J. R. Hershey, et al., “Deep long short-term memory adaptive beamforming networks for multichannelrobust speech recognition,” in

ICASSP . IEEE, 2017.[43] H. Erdogan, T. Hayashi, J. R. Hershey, et al., “Multi-channelspeech recognition: LSTMs all the way through,” in

CHiME-4workshop , 2016, pp. 1–4.[44] R. Sennrich, B. Haddow, and A. Birch, “Neural machinetranslation of rare words with subword units,” arXiv preprintarXiv:1508.07909 , 2015.[45] Z. Meng, Y. Gaur, J. Li, et al., “Character-aware attention-basedend-to-end speech recognition,” in

Proc. ASRU , 2019.[46] Y. Gaur, J. Li, Z. Meng, et al., “Acoustic-to-phrase end-to-endspeech recognition,” in

INTERSPEECH , 2019.[47] V. Panayotov, G. Chen, et al., “Librispeech: an asr corpus basedon public domain audio books,” in

Related Researches

CDPAM: Contrastive learning for perceptual audio similarity

by Pranay Manocha

End-to-End Multi-Channel Transformer for Speech Recognition

by Feng-Ju Chang

Non-linear frequency warping using constant-Q transformation for speech emotion recognition

by Premjeet Singh

Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement

by Mostafa Sadeghi

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

by Jisi Zhang

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

by Yu-Wen Chen

The DKU-Duke-Lenovo System Description for the Third DIHARD Speech Diarization Challenge

by Weiqing Wang

Speaker attribution with voice profiles by graph-based semi-supervised learning

by Jixuan Wang

Sound Event Detection in Urban Audio With Single and Multi-Rate PCEN

by Christopher Ick

Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output

by Hangting Chen

Intermediate Loss Regularization for CTC-based Speech Recognition

by Jaesong Lee

Estimation of Microphone Clusters in Acoustic Sensor Networks using Unsupervised Federated Learning

by Alexandru Nelus

VSEGAN: Visual Speech Enhancement Generative Adversarial Network

by Xinmeng Xu

A Global-local Attention Framework for Weakly Labelled Audio Tagging

by Helin Wang

Underwater Acoustic Communication Receiver Using Deep Belief Network

by Abigail Lee-Leon

Integration of deep learning with expectation maximization for spatial cue based speech separation in reverberant conditions

by Sania Gul

Meta-Learning for improving rare word recognition in end-to-end ASR

by Florian Lux

Automatic Classification of OSA related Snoring Signals from Nocturnal Audio Recordings

by Arun Sebastian

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

by Ju Lin

SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

by Colin Lea

The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates

by Björn W. Schuller

Thoughts on the potential to compensate a hearing loss in noise

by Marc René Schädler

Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition

by Jian Luo

Handling Background Noise in Neural Speech Generation

by Tom Denton

Dual-Path Modeling for Long Recording Speech Separation in Meetings

by Chenda Li

«

1

2

3

4

»

Submitted on 2 Feb 2021 (v1), last revised 22 Apr 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar