[PDF] Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

Abstract

Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The high computation cost of Transformer decoding also limits its use in production streaming systems. To make Transformer suitable for streaming ASR, we explore Transducer framework as a streamable way to learn alignments. For audio encoding, we apply unidirectional Transformer with interleaved convolution layers. The interleaved convolution layers are used for modeling future context which is important to performance. To reduce computation cost, we gradually downsample acoustic input, also with the interleaved convolution layers. Moreover, we limit the length of history context in self-attention to maintain constant computation cost for each decoding step. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6\% WER on test-clean) without external language models. The performance is comparable to previously published streamable Transformer Transducer and strong hybrid streaming ASR systems, and is achieved with smaller look-ahead window (140~ms), fewer parameters and lower frame rate.

Full PDF

CConv-Transformer Transducer: Low Latency, Low Frame Rate, StreamableEnd-to-End Speech Recognition

Wenyong Huang, Wenchao Hu, Yu Ting Yeung, Xiao Chen

Huawei Noah’s Ark Lab { wenyong.huang, huwenchao, yeung.yu.ting, chen.xiao2 } @huawei.com Abstract

Transformer has achieved competitive performance againststate-of-the-art end-to-end models in automatic speech recog-nition (ASR), and requires signiﬁcantly less training time thanRNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for ofﬂine ASR. It relieson an attention mechanism to learn alignments, and encodes in-put audio bidirectionally. The high computation cost of Trans-former decoding also limits its use in production streaming sys-tems. To make Transformer suitable for streaming ASR, weexplore Transducer framework as a streamable way to learnalignments. For audio encoding, we apply unidirectional Trans-former with interleaved convolution layers. The interleavedconvolution layers are used for modeling future context whichis important to performance. To reduce computation cost,we gradually downsample acoustic input, also with the inter-leaved convolution layers. Moreover, we limit the length ofhistory context in self-attention to maintain constant compu-tation cost for each decoding step. We show that this archi-tecture, named Conv-Transformer Transducer, achieves com-petitive performance on LibriSpeech dataset (3.6% WER ontest-clean) without external language models. The performanceis comparable to previously published streamable TransformerTransducer and strong hybrid streaming ASR systems, and isachieved with smaller look-ahead window (140 ms), fewer pa-rameters and lower frame rate.

Index Terms : speech recognition, Transformer Transducer,end-to-end, RNN-T

1. Introduction

Transformer [1] models have achieved state-of-the-art results inmany natural language processing tasks [2, 3]. Recently, Trans-former is gaining popularity in speech recognition researchcommunity [4–7]. Transformer-based speech recognition mod-els have achieved comparable or better performance than state-of-the-art models, which are mostly based on recurrent neuralnetwork (RNN). Transformer is more parallelizable than RNN,thus is signiﬁcantly faster to train [1, 7].Despite its success, the original Transformer with encoder-decoder architecture is only suitable for ofﬂine speech recog-nition. There are a number of challenges to apply the origi-nal Transformer for streaming speech recognition. First, theoriginal Transformer relies on an attention mechanism over fullencoder output to learn alignments between input and outputsequences [8]. Second, the original Transformer encodes in-put audio in a bidirectional way, thus requires a full utteranceas input. Moreover, computation of Transformer increases inquadratic complexity with the length of input sequences. Naiveimplementation usually leads to much slower decoding thanRNN-based models [7]. For alignment learning, there exist many streamable waysfor speech recognition, e.g., Connectionist Temporal Classiﬁ-cation (CTC) [9], Transducer [10], Monotonic Chunkwise At-tention (MoChA) [11], Triggered Attention [12]. All of thesealignment learning methods can be combined with Transformer.In this work, we focus on Transducer, as previous works [13,14]suggest that Transducer models outperform traditional hybridmodels for streaming speech recognition in production set-tings. Several research groups have been working on combiningTransformer with Transducer for speech recognition, which isusually referred to as Transformer Transducer [15–17].To make Transformer encoder streamable, previous worksof Transformer Transducer [16, 17] tried to limit future context(right-context) in self-attention for input audio encoding. Al-though each layer only requires a little future context, the over-all future context aggregates over all Transformer layers. Theoverall look-ahead window is still large. In this work, we ex-plore using unidirectional Transformer with interleaved convo-lution layers for audio encoding. Unidirectional Transformerrequires no future context. Thus it is streamable and does notintroduce latency into the model. However, future context is im-portant to speech recognition performance. Therefore, we addinterleaved convolutions between Transformer layers to modelfuture context. We carefully control number of layers and ﬁltersize of convolution layers. In our model, the look-ahead win-dow introduced by convolution layers is 140 ms.We explore a number of options for practical streamingspeech recognition with Transformer. First, we apply inter-leaved convolutions to gradually downsample input audio se-quence. This enables our model to run in a low frame rate of80 ms, thus reduces computation cost signiﬁcantly [18]. Sec-ond, we limit the length of history context (left-context) of self-attention in Transformer layers to maintain constant computa-tion cost for each decoding step. Finally, we apply relative po-sition encoding [3], which enables hidden state reuse for Trans-former. Hidden state reuse improves decoding speed of Trans-former signiﬁcantly [3].We name our design as Conv-Transformer Transducer. Weshow that this architecture achieves competitive performance onLibriSpeech dataset [19], with word error rate (WER) of 3.5%in test-clean and 8.3% in test-other without external languagemodels. When we further limit history context of self-attention,our proposed model still achieves WER of 3.6% and 8.9% re-spectively. The performance is comparable to previously pub-lished streamable Transformer-Transducer [17], and strong hy-brid speech recognition systems, with smaller look-ahead win-dow, fewer parameters and lower frame rate.This paper is organized as follows. In the next Section, weintroduce the Transducer framework and detailed architectureof Conv-Transformer Transducer. In Section 3, we present ourexperimental results on LibriSpeech dataset. Finally, we con-clude our work in Section 4. a r X i v : . [ ee ss . A S ] A ug igure 1: Transducer model architecture.

2. Conv-Transformer Transducer

Transducer, proposed in [10], is a neural sequence transduc-tion framework which can be used for end-to-end speech recog-nition, i.e., transduction of an input acoustic sequence into anoutput transcription label sequence.We denote an input acoustic sequence with T frames as x = ( x , . . . , x T ) , and a transcription label sequence of length U as y = ( y , . . . , y U ) , where y u ∈ Z and Z is vocabularyof output labels. As depicted in Figure 1, a Transducer modelﬁrst encodes an input acoustic sequence with an audio encodernetwork, producing a sequence of encoder states denoted as h = ( h , . . . , h T ) . For each encoder state h t , the model ﬁrstpredicts either a label or a special blank symbol (cid:104) b (cid:105) with a jointnet. When the model predicts a label, the model continues topredict the next output. When the model predicts a blank sym-bol, which indicates that no more labels can be predicted, themodel proceeds to the next encoder state. This process is similarto CTC [9], but there is a difference. A Transducer model makesuse of previously predicted non-blank labels as input conditionto predict the next output. The previously predicted labels areencoded with another network referred to as prediction net.During the process described above, the Transducer modeldeﬁnes a conditional distribution, P (ˆ y | x ) = T + U (cid:89) i =1 P (ˆ y i | x , · · · , x t i , y , . . . , y u i − ) (1)where ˆ y = (ˆ y , . . . , ˆ y T + U ) ⊂ {Z ∪ (cid:104) b (cid:105)} T + U is an alignmentpath with T blank symbols and U labels such that removing allblank symbols in ˆ y yields y , and y represents start of sentence.To obtain probability of target sequence, Transducer computesa marginalized distribution, P ( y | x ) = (cid:88) ˆ y ∈A ALIGN ( x , y ) P (ˆ y | x ) (2)where A ALIGN ( x , y ) is the set containing all valid alignmentpaths such that removing the blank symbols in ˆ y yields y . Thesummation of probabilities of all alignment paths is computedefﬁciently with forward-backward algorithm [10].In most of previous works in Transducer framework, audioencoder and prediction net are composed of RNN [10]. Suchtype of Transducers are often referred to as Recurrent NeuralNetwork Transducer (RNN-T). Other types of networks are al-lowed in Transducer model. Figure 2: Audio encoder of Conv-Transformer Transducer:

ConvND-BN-RELU denotes N-Dimensional convolution lay-ers followed by batch normalization and Rectiﬁed Linear Unit(ReLU) activation. For convolution layers, F denotes numberof ﬁlters, S denotes the value of stride, C denotes number ofoutput channels. For 2D convolutions, the ﬁrst values of F and S correspond to time dimension, the second values correspondto frequency dimension. For Transformer, L denotes numberof layers, N denotes number of heads, DH denotes dimensionof each head, DM denotes input and output dimension of self-attention layers and feed-forward network (FFN) layers, DF denotes dimension of intermediate hidden layer in the FFN. We now describe the architecture of our Conv-TransformerTransducer. The audio encoder is illustrated in Figure 2,which is composed of three blocks. Each block is composedof three convolution layers followed by unidirectional Trans-former. Unidirectional Transformer is similar to encoder ofthe original Transformer, except that self-attention is masked,preventing the self-attention from attending to subsequent posi-tions beyond current position. Unidirectional Transformer doesnot require future context, and is widely used in language mod-eling [3, 20] and text generation [21].However, future acoustic context is helpful to improverecognition accuracy. Therefore, we apply convolution layersbefore Transformer in each block. In our model, all future con-text comes from convolution layers. The size of look-aheadwindow required by current input frame is illustrated in Fig-ure 3. In our 3-block setting, the size of look-ahead window is140 ms, which is acceptable in streaming speech recognition.We also downsample input representation with the convolutionlayers. As shown in Figure 2, at the second convolution layerof each block, the stride is 2 along time dimension. With aninput frame rate of 10 ms, the frame rate becomes 20 ms in theﬁrst block, 40 ms in the second block, and ﬁnally 80 ms in thethird block, as shown in Figure 3. We ﬁnd that this progressivedownsampling scheme causes no loss in accuracy, and reducesmemory requirement for training signiﬁcantly. We can use big-ger batch size for efﬁcient training. Moreover, downsamplinggreatly reduces computation cost of inference. To further re-duce computation complexity of audio encoder, we place mostof Transformer layers in the third block with 80 ms frame rate.The idea of using interleaved convolution for gradual downsam-pling and incorporating future context is inspired by [22].The prediction net is illustrated in Figure 4. An embeddinglayer converts previously predicted non-blank labels into vec- B l o c k B l o c k B l o c k frame-index 10ms20ms20ms20ms Figure 3:

Illustration of context window and frame rate changeof convolution layers in audio encoder of Conv-TransformerTransducer at current input frame (frame-index = 0). Inputand output of each block are represented by squares. Convo-lution layers are represented by circles. Only time dimension isshown. For each layer, activation with dashed outline is skippedin computation. tor representations. Then a linear layer projects the embeddingvectors to match input dimension of unidirectional Transformerlayers.Figure 4:

Prediction net of Conv-Transformer Transducer. Wedenote the output dimensions of embedding layer and linearlayer as D . The joint net is a fully-connected feed-forward neural net-work with single hidden layer. There are 512 units in the hiddenlayer, with Rectiﬁed Linear Unit (ReLU) as activation function.We concatenate outputs of audio encoder and prediction net asinput of joint net.

During decoding, self-attention of unidirectional Transformerattends to all history context. Computation cost continues togrow with the length of history context as decoding goes on,which is undesirable for a streaming system. We limit historycontext of self-attention with a ﬁxed-size window. Computa-tional cost of each step becomes constant. The idea of limitinghistory context of self-attention for streamable speech recogni-tion has been explored in [16, 17, 23].

We apply relative position encoding to model sequential orderin self-attention [3]. Relative position encoding is proposedin [24], and leads to signiﬁcant improvement over absolute po-sition encoding in the original Transformer. Moreover, relativeposition encoding is necessary for reusing hidden states whenapplying self-attention with limited attention context [3].

3. Experiments

We perform our experiments with LibriSpeech dataset [19],which is a publicly available English read speech corpus of au-diobooks from the LibriVox project. The corpus consists of960-hour training data, 10.7 hours of development data, and10.5 hours of test data. The audio is sampled at 16 kHz andquantized at 16 bit. The development and test data are furtherdivided into clean and other subsets according to the qual-ity, assessed with a speech recognition system. We train ourmodels with the entire training set. For clarity, we only reportthe results of test sets (test-clean and test-other), with the best-performed decoding conﬁguration in the development sets.

All the Conv-Transformer Transducer models are trained using8 GPUs with per-GPU batch bucketing conﬁguration as shownin Table 1. We use NovoGrad [25] as our optimizer. The learn-ing rate is warmed up from 0 to 0.01 in the ﬁrst 10k steps, then isdecayed to × − in the following 200k steps polynomially.We use 128-dimensional log-mel ﬁlterbank as acoustic fea-ture, calculated with 20 ms window and 10 ms stride. We use4k subwords as output target units, which are generated fromtraining transcripts of LibriSpeech using SentencePiece [26].We apply several techniques to avoid overﬁtting. We add ad-ditive Gaussian noise and apply speed perturbation [27] in timedomain. After feature extraction, we apply SpecAugment asdescribed in [28]. We do not apply time-warping for SpecAug-ment as we have already applied speed perturbation. Trans-former layers are regularized with 10% dropout. Due to limitedcomputation resources, we choose hyperparameters without in-tensive tuning.Table 1: Batch bucketing conﬁguration

SampleLength BatchSize ∼ s ∼ s ∼ s Our baseline system is based on Kaldi’s [29] chain-model recipe[30]. The TDNN-LSTM based acoustic model (AM) [22] istrained with LibriSpeech, plus additional CommonVoice (CV)data ( en 1488h 2019-12-10 , validated set excluding devand test sets) [31]. The AM is monophone-based, and contains33.5M parameters. The look-ahead window size is 170 ms,which is comparable to our Conv-Transformer Transducer. Weapply SpecAugment during AM training. We perform languagemodel (LM) training with normalized training text of the of-ﬁcial LibriSpeech LM [32], plus transcription of CV trainingdata. Vocabulary size of lexicon is about 200k. We deploya two-pass decoding strategy, with a smaller 4-gram LM with5.8M n-grams in ﬁrst pass. We utilize a bigger 4-gram LM with91M n-grams for second-pass rescoring. Note that we train thebaseline system with additional CommonVoice data.

We compare the results of our Conv-Transformer Transducer(ConvT-T) with the results of a previously published streamableTransformer Transducer model (T-T) [17], a hybrid model withTransformer AM [23] and our hybrid TDNN-LSTM AM base-line. We use beam search without an external language modelfor Conv-Transformer Transducer decoding. As shown in Table2, Conv-Transformer Transducer achieves the lowest WER onboth test-clean and test-other. Our model operates with fewerparameters, smaller look-ahead window, and lower frame rate.We do not list the number of parameters of the hybrid modelsas they both rely on large external n-gram language models.We also evaluate if low frame rate of 80 ms leads to per-formance degradation. We train a model with 40 ms frame rate(HFR ConvT-T), by setting the stride to 1 instead of 2 in the sec-ond convolution layer of the third block. As this model requiresmore memory, we train the 40-ms model with only half of theoriginal batch size, but with twice more steps to maintain thesame number of epochs. As shown in Table 2, the performancedifference between the 40-ms model and the 80-ms model isinsigniﬁcant.Table 2:

WER comparison with previously published stream-able Transformer Transducer model, hybrid streamable modelwith a Transformer-based AM, our conventional hybrid base-line model and our streamable Conv-Transformer Transducermodels on LibriSpeech test sets.

Model Params Look Frame WER (%)ahead rate clean otherHybrid TDNN-LSTM - 170ms 30ms 4.0 8.9Hybrid Transformer [23] - 2480ms 20ms 3.65 9.01T-T [17] 139M 1080ms 30ms 3.6 10.0ConvT-T (Ours) 67M 140ms

We then perform experiments to analyze effects of vary-ing the number of Transformer layers over different blocks ofvarious frame rates. The results are shown in Table 3. Recallthat in the original model, there are 2 layers of Transformer inthe ﬁrst (20 ms) block, 2 layers of Transformer in the second(40 ms) block and 8 layers of Transformer in the third (80 ms)block. Moving Transformer layers from the third to the secondblock does not affect recognition performance signiﬁcantly, butincreases computation cost as more Transformer layers operateat higher frame rate. However, when we move all the Trans-former layers in the second block to the third block as shown inRow 4 of Table 3, small performance degradation begins. Wefurther move all the Transformer layers in the ﬁrst two blocks tothe third block as shown in Row 5. This is equivalent to runninga convolutional neural network (CNN) followed by multi-layerTransformer. There is further performance degradation. Notethat there are fewer parameters for the Transformer layers inthe ﬁrst block. We reduce total number of Transformer layersfrom 12 to 11 to maintain similar number of parameters. Theexperimental results support our strategies of placing most ofTransformer layers in the third block with the lowest frame rate,and applying interleaved convolution in the model.We further evaluate the effects of limiting the amount ofhistory context (left-context) in self-attention of Transformerlayers. The results are shown in Table 4. The size of attention Table 3:

WER of different number of Transformer layers overthe three blocks in audio encoder. Layer distribution B1-B2-B3corresponds to number of Transformer layers in the ﬁrst, thesecond and the third block respectively

Layer WER (%)distribution clean other2-2-8 3.5

WER of different window sizes of history context (left-context) in self-attention of Transformer layers. Inf means thatall history context is used.

Left-context window size WER (%)audio encoder prediction net clean otherInf Inf 3.5

Inf 2 3.5 8.6Inf 8 3.6 8.5Inf 16

4. Conclusions

We propose Conv-Transformer Transducer model, which issuitable for streaming speech recognition. Our model is end-to-end trainable using Transducer framework. By applying uni-directional Transformer with interleaved convolution for audioencoding, our model requires only a 140 ms look-ahead win-dow. A smaller look-ahead window generally leads to lowerlatency. We gradually downsample audio input into frame rateof 80 ms to reduce computation cost. Recognition accuracy isstill maintained. Experimental results also support our strate-gies of applying interleaved convolution, and placing most ofTransformer layers in the block with the lowest frame rate. Wealso limit the size of history context window in self-attentionof Transformer layers. Although there is performance degrada-tion, the results are still comparable to published LibriSpeechresults. Our model achieves competitive results to previouslypublished streamable Transformer transducer model, and state-of-the-art hybrid models, with lower latency, lower frame rateand fewer parameters. . References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in Neural Information Processing Systems , 2017, pp.5998–6008.[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language under-standing,” in

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Long and ShortPapers) , Minneapolis, Minnesota, Jun. 2019, pp. 4171–4186.[3] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhut-dinov, “Transformer-XL: Attentive language models beyond aﬁxed-length context,” in

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics , Florence, Italy,Jul. 2019, pp. 2978–2988.[4] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,”in , 2018, pp. 5884–5888.[5] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa,and T. Nakatani, “Improving transformer-based end-to-endspeech recognition with connectionist temporal classiﬁcation andlanguage model integration,” in

Interspeech 2019 , 2019, pp.1408–1412.[6] A. Zeyer, P. Bahar, K. Irie, R. Schl¨uter, and H. Ney, “A com-parison of transformer and lstm encoder decoder models for asr,”in , 2019, pp. 8–15.[7] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watan-abe, T. Yoshimura, and W. Zhang, “A comparative study on trans-former vs RNN in speech applications,” in , 2019,pp. 449–456.[8] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translationby jointly learning to align and translate,” in

International Confer-ence on Learning Representations 2015 (ICLR 2015) , 2015, arXivpreprint arXiv:1409.0473.[9] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: Labelling unsegmented se-quence data with recurrent neural networks,” in

Proceedings ofthe 23rd International Conference on Machine learning (ICML) ,2006, pp. 369–376.[10] A. Graves, “Sequence transduction with recurrent neural net-works,” in

International Conference of Machine Learning 2012(ICML) Workshop on Representation Learning , 2012, arXivpreprint arXiv:1211.3711.[11] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in

In-ternational Conference on Learning Representations 2018 (ICLR2018 ) , 2018, arXiv preprint arXiv:1712.05382.[12] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in ,2019, pp. 5666–5670.[13] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al. , “Stream-ing end-to-end speech recognition for mobile devices,” in , 2019, pp. 6381–6385.[14] T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang, A. Bruguier,S. Chang, W. Li, R. Alvarez, Z. Chen, C. Chiu, D. Garcia, A. Gru-enstein, K. Hu, A. Kannan, Q. Liang, I. McGraw, C. Peyser,R. Prabhavalkar, G. Pundak, D. Rybach, Y. Shangguan, Y. Sheth,T. Strohman, M. Visontai, Y. Wu, Y. Zhang, and D. Zhao,“A streaming on-device end-to-end model surpassing server-sideconventional model quality and latency,” in , 2020, pp. 6059–6063. [15] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention trans-ducers for end-to-end speech recognition,” in

Interspeech 2019 ,2019, pp. 4395–4399.[16] C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le,M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-Transducer: End-to-end speech recognition with self-attention,” arXiv preprint arXiv:1910.12977 , 2019.[17] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo,and S. Kumar, “Transformer Transducer: A streamable speechrecognition model with transformer encoders and RNN-T loss,”in , 2020, pp. 7829–7833.[18] G. Pundak and T. N. Sainath, “Lower frame rate neural networkacoustic models,” in

Interspeech 2016 , 2016, pp. 22–26.[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[20] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones,“Character-level language modeling with deeper self-attention,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,vol. 33, 2019, pp. 3159–3166.[21] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser,and N. Shazeer, “Generating Wikipedia by summarizing long se-quences,” arXiv preprint arXiv:1801.10198 , 2018.[22] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latencyacoustic modeling using temporal convolution and LSTMs,”

IEEESignal Processing Letters , vol. 25, no. 3, pp. 373–377, 2017.[23] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,H. Huang, A. Tjandra, X. Zhang, F. Zhang, C. Fuegen, G. Zweig,and M. L. Seltzer, “Transformer-based acoustic modeling for hy-brid speech recognition,” in ,2020, pp. 6874–6878.[24] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rela-tive position representations,” in

Proceedings of the 2018 Confer-ence of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Volume 2(Short Papers) , New Orleans, Louisiana, Jun. 2018, pp. 464–468.[25] B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev,V. Lavrukhin, R. Leary, J. Li, H. Nguyen, and J. M. Cohen,“Stochastic gradient methods with layer-wise adaptive momentsfor training of deep networks,” arXiv preprint arXiv:1905.11286 ,2019.[26] T. Kudo and J. Richardson, “SentencePiece: A simple and lan-guage independent subword tokenizer and detokenizer for neuraltext processing,” in

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing: System Demon-strations , Brussels, Belgium, Nov. 2018, pp. 66–71.[27] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-mentation for speech recognition,” in

Interspeech 2015 , 2015, pp.3586–3589.[28] D. S. Park, Y. Zhang, C. Chiu, Y. Chen, B. Li, W. Chan, Q. V.Le, and Y. Wu, “SpecAugment on large scale datasets,” in , 2020, pp. 6879–6883.[29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” in

IEEE 2011 Workshopon Automatic Speech Recognition and Understanding (ASRU) ,2011.[30] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu-ral networks for ASR based on lattice-free MMI,” in