Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition
Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He
PParallel Rescoring with Transformer for Streaming On-Device SpeechRecognition
Wei Li ∗ , James Qin ∗ , Chung-Cheng Chiu, Ruoming Pang, Yanzhang He Google Inc., USA { mweili, jamesqin, chungchengc, rpang, yanzhanghe } @google.com Abstract
Recent advances of end-to-end models have outperformed con-ventional models through employing a two-pass model. Thetwo-pass model provides better speed-quality trade-offs for on-device speech recognition, where a st -pass model generateshypotheses in a streaming fashion, and a nd -pass model re-scores the hypotheses with full audio sequence context. The nd -pass model plays a key role in the quality improvement ofthe end-to-end model to surpass the conventional model. Onemain challenge of the two-pass model is the computation la-tency introduced by the nd -pass model. Specifically, the orig-inal design of the two-pass model uses LSTMs for the nd -passmodel, which are subject to long latency as they are constrainedby the recurrent nature and have to run inference sequentially.In this work we explore replacing the LSTM layers in the nd -pass rescorer with Transformer layers, which can process theentire hypothesis sequences in parallel and can therefore utilizethe on-device computation resources more efficiently. Com-pared with an LSTM-based baseline, our proposed Transformerrescorer achieves more than latency reduction with qualityimprovement. Index Terms : Streaming speech recognition, Transformer, La-tency, Rescoring
1. Introduction
There has been a growing interest in building on-device stream-ing speech recognition models, which provide recognition re-sults instantly as words are being spoken [1]. Such models makepredictions based on partial context under strict latency require-ments [2, 3, 4]. As a result the streaming models tend to be lessaccurate than non-streaming models, which have access to theentire utterance.Previous work have shown that this issue can be alleviatedby combining a second-pass rescoring model [5] with streamingmodels, where the rescoring model uses the Listen, Attend, andSpell (LAS) architecture [6]. LAS has access to the full con-text of the utterance and therefore provides better quality thanthe streaming models [7]. From user’s perspective, such a two-pass speech model exhibits the advantages of both streamingand non-streaming models—words are recognized as they arespoken and the final results have high accuracy.The canonical architecture of the LSTM-based LAS model,however, is designed for beam search and is not efficient asa nd -pass rescoring model. The LSTM [8] layers processhypothesis tokens sequentially , with temporal dependency be-tween timesteps. On the other hand, for the nd -pass rescoring,all hypothesis tokens are available. A more efficient design ofthe rescorer model will be to rescore all tokens in parallel. ∗ Equal contribution. go/rnntep_eng
Trading WER for Latency
RNN-T decoderRNN-T encoder
SOS c a t
RNN-T hypothesis acoustic framesx , …, x T Additional encoderTransformer rescorere , …, e T y , …, y s EOS Figure 1:
The architecture of two-pass model with Transformer.
In recent years there have been a growing success in ap-plying Transformer [9] for machine translation and languagemodeling [10], and speech recognition [11, 12, 13, 14]. Trans-former applies self-attention to capture the sequential relationamong input features, and therefore does not have the recurrentconstraint. This allows Transformer to compute self-attentionin parallel and significantly increase the computation efficiency.The Transformer architecture proposed in [9] consists of an en-coder and a decoder, where each decoder layer has an additionalcross-attention that summarizes the encoder output based on theself-attention output.In this work, we address the sequential dependency is-sue of the original LSTM-based rescoring model with Trans-former. Specifically, the paper proposes to use Transformer asthe second-pass rescorer for parallel rescoring of hypothesis to-kens. Unlike beam search, where the Transformer decoder stillhas to run autoregressively, the rescoring scenario allows paral-lel processing of the full hypothesis sequence. Such parallelismreduces the lengths of temporal dependency paths from O ( n ) to O (1) , where n corresponds to the hypothesis length. This al-lows the Transformer rescorer to utilize on-device computationcapacity much more efficiently. We further improve the infer-ence speed of the Transformer rescorer by reducing the numberof cross-attention in the decoder. The Transformer rescorer im-proves the Word Error Rate (WER) of Googles voice searchquery test set to . from . with LSTM rescoring. OnLibrispeech [15] the Transformer rescorer improves the WERto . on test clean and . on test other compared to . and . with LSTM rescoring. The th percentile second-pass latency, benchmarked on a Google Pixel4 phone on CPUs,is reduced to ms from previous ms with LSTM rescoring. a r X i v : . [ ee ss . A S ] S e p dditional encoderRNN-T encoder RNN-T hypothesis acoustic framesx , …, x T Transformer rescorer e , …, e T y , …, y s Decoder
Self-attentionCross-attentionFeed Forward
Self-Decoder
Self-attentionFeed Forward
DecoderSelf-Decoder
Figure 2:
Transformer rescorer. The Transformer rescorer com-bines conventional Transformer decoders (containing cross-attention) and Transformer self-decoders (without cross-attention) for more efficient inference. The figure omits the nor-malization and residual links to simplify the illustration.
2. Transformer Rescorer
A two-pass model consists of a st -pass model and a nd -pass model. Here we use RNN-T [16, 17] as the st -passmodel and Transformer for the nd -pass model. Specifically,our Transformer-based two-pass model, as demonstrated in Fig-ure 1, consists of four components: RNN-T encoder, RNN-T decoder, additional encoder, and Transformer decoder asthe rescorer. The input acoustic frames are denoted as x =( x , ..., x T ) , where x t ∈ R d are stacked log-mel filterbank en-ergies ( d = 512 ) and T is the number of frames in x . In the st -pass, each acoustic frame x t is passed through RNN-T en-coder, consisting of a multi-layer LSTM [8], to get encoder out-put. RNN-T decoder takes the acoustic features from RNN-Tencoder to generate the hypotheses in a streaming fashion, de-noted as y = ( y , ..., y s ) where s is the label sequence length.Here y is a sequence of word-piece tokens [18]. In the nd -pass, the full output of the RNN-T encoder is passed to a smalladditional encoder to generate e , ..., e T , which is then passedto Transformer decoder. The additional encoder is added as it isfound to be useful to adapt the encoder output to be more suit-able for the second-pass model [2]. The RNN-T model struc-ture and the additional encoders are exactly the same as [2].During training, the Transformer decoder computes output la-bel sequence according to the full audio sequence e , ..., e T .More details about the rescorer training is elucidated in Sec-tion 2.3. During decoding, the Transformer decoder rescoresmultiple top hypotheses from RNN-T, y , ..., y s . The architecture of our Transformer rescorer is based onthe conventional Transformer decoder [9] with some cross-attention layers being removed. The conventional Transformerdecoder layer contains both the self-attention and the cross-attention, where the query of the cross-attention originates fromthe output of the self-attention. In the Transformer rescorer, weimprove the rescorer efficiency by removing the cross-attention from some decoder layers and interleave those layers with theconventional decoder layers. The decoder layer without thecross-attention shares the same architecture as the conventionalTransformer encoder layer [9]. The architecture of the result-ing rescorer is illustrated in Figure 2, where layers withoutcross-attention are annotated as self-decoder . The Transformerrescorer takes the RNN-T’s hypothesis as input and feed the to-kens to the self-attention layer. And the cross-attention layersattend to the encoder output to summarize the acoustic signals.In our rescorer model, there are Transformer layers, each withthe attention model dimension d model = 640 and feed forwarddimension d ff = 2560 . Both cross-attention and self-attentionlayers use multi-headed attention with heads. The rescorermodel has . M parameters.Our design of keeping only two cross-attention layers inthe rescorer is based on observing the attention mechanismof the Transformer decoder. In the first Transformer decoderlayer, the self-attention conditions only on the hypothesis to-kens, therefore the resulting cross-attention generates its querysolely based on language modeling information. The missingof acoustic information on generating attention query inherentlylimit the effectiveness of the first cross-attention. After the firstcross-attention layer, the output of the first decoder layer con-tains acoustic information, and the following decoder layers cancondition on both the acoustic and language modeling infor-mation to generate effective cross-attention queries. Thus, it iscritical to have the second cross-attention layer in the decoder.On the other hand, the additional cross-attention layers beyondthe second one do not introduce additional modality and havediminishing returns in terms of the model quality. As a compar-ison, the cross-attention of the LAS model conditions on boththe previous attention context and the text tokens, and requiresonly one cross-attention in the decoder. We demonstrated theseproperty with an ablation study in Section 3. Same with the LAS rescoring training described in [5], Trans-former rescorer model is trained after the st -pass model train-ing. During nd -pass training, RNN-T encoder and RNN-T decoder are freezed. Additional encoder and Transformerrescorer are trained in two stages: cross entropy (CE) andminimum word error rate (MWER) training [19]. During CEtraining, frozen RNN-T encoder generates the acoustic featuresfor additional encoder, and Transformer rescorer is trained topredict groundtruth sequence with the full audio context fromadditional encoder and the prefix of the label sequence con-text: p ( y l | x, y ...y l − ) , where l is the label to predict. Dur-ing MWER training, the Transformer rescorer is trained to re-rank the hypotheses generated from RNN-T, which bridges thegap from CE training to inference [5]. More specifically, givenacoustic input x , groundtruth transcript y ∗ , the probability com-puted by rescorer model P ( y m | x ) for any given target sequence y m , and a set of hypotheses H m = h , ..., h b where b is thebeam-size, the MWER loss is defined as L MWER ( x, y ∗ ) = (cid:88) y m ∈ H m ( x ) P (cid:48) ( y m | x, H m ) (cid:104) W (cid:48) ( y m , y ∗ ) − (cid:99) W (cid:105) where P (cid:48) ( y m | x, H m ) = P ( y m | x ) (cid:80) yi ∈ Hm P ( y i | x ) represents the con-ditional probability the Transformer rescorer assigns to hypoth-esis y m among all hypotheses in H m , and W (cid:48) ( y ∗ , y m ) is thenumber of word errors of y m , and (cid:99) W is the average number ofword errors among H m . In our MWER training we use the N-able 1: Librispeech test sets word error rate
Model Test clean Test other
RNN-T only . . LSTM rescorer . . Transformer rescorer . . Best approximation approach for calculating the expected worderrors [19].
3. Quality Experiments
We conduct experiments on the Librispeech [15] dataset and alarge-scale internal dataset. We use SpecAugment [20] with thesame configuration as described in [21] during training. Similarto [2], we apply constant learning rate and maintain Exponentialmoving average (EMA) [22] of the weights during training, anduse the EMA weights for evaluation. Both LSTM and Trans-former rescorer are trained with CE and MWER. The N-Bestsize of MWER training is , which matches the rescoring be-havior during evaluation, where top hypotheses from RNN-Tare used for rescoring. The prediction targets are wordpieces [18] derived using a large corpus of text transcripts. TheLSTM-based rescorer has size M and the Transformer has . M parameters. All models are implemented in Tensorflow[23] using the Lingvo [24] toolkit and trained on × TensorProcessing Units (TPU) slices with a global batch size of . In this experiment, the models are trained on the Librispeech960h training set and evaluated on the clean and noisy testsets without an external language model. In order to maintainlow-latency streaming speech recognition, the st -pass RNN-Tmodels in all the compared systems use a uni-directional LSTMencoder with 0 right context frame. As is shown in Table 1, boththe LSTM rescorer and the Transformer rescorer significantlyimprove the WER of the clean and noisy test sets comparedto the RNN-T only model with - relative improvement,alleviating the limited context problem for the st -pass modelwhile still maintaining low-latency streaming recognition. TheTransformer rescorer further improves the WER slightly overthe LSTM rescorer, and also significantly reduce the nd -passlatency, which is studied in detail in Section 4. We perform a large scale experiment on an internal task, GoogleVoice Search, and show the proposed Transformer rescoreris also effective. In this experiment, the models are trainedon a multi-domain training set as described in [25]. Thesemulti-domain utterances span domains of search, farfield, tele-phony and YouTube. The test set includes ∼ K Voice-search utterances (VS) extracted from Google traffic. Alldatasets are anonymized and hand-transcribed. The transcrip-tion for YouTube utterances is done in a semi-supervised fash-ion [26, 27]. Following [28, 29, 2], we train the first-pass RNN-T to also emit the end-of-sentence decision to reduce the end-pointing latency, allowing 2nd-pass rescoring to execute early.As is shown in Table 2, the Transformer rescorer im-proves the WER from . to . on the VS test set comparedwith the LSTM rescorer, both of which are trained with CE Table 2: Voice Search test set word error rate
Model VS
RNN-T only . LSTM rescorer . Transformer rescorer CE . Transformer rescorer MWER . and MWER. Compared with st -pass model, the Transformerrescorer achieves relative WER improvement.
The additional capability that the Transformer rescorer canbring is to utilize the full hypothesis when rescoring every tar-get token. The original LSTM-based rescorer scores each tar-get token conditioned only on the tokens before it. Specif-ically, the LSTM rescorer learns a conditional probability p ( y t | x, y , ..., y t − ) for each prediction target y t where y de-notes hypothesis tokens from RNN-T and x denotes acousticfeatures. A conventional Transformer decoder uses causal self-attention and also learns p ( y t | x, y , ..., y t − ) . We explored ex-tending the self-attention to access also the future label con-text and as a result learns to score target tokens with p ( y t | x, y ) .During CE training, using groundtruth sequence as the full con-text makes the training target trivial. Thus we randomly swapdifferent proportions of the groundtruth tokens that fed to theself-attention layer with alternative tokens sampled within theword-piece vocabulary. Some sentinel tokens like SOS, EOS,UNKNOWN and RNN-T’s blank symbol are excluded to beused as random tokens. The prediction targets are the origi-nal groundtruth sequence. During MWER training, the RNN-Thypothesis is used as the decoder input to match the inferencescenario. With this experiment, random proportion worksout the best and achieves the same . WER on the voicesearch task. Thus, we report results with causal self-attentionfor the experiments throughout the paper.
4. Latency Optimizations
In this section, we measure the additional latency introduced bythe nd -pass rescorer on a Google Pixel4 phone on CPUs. Forefficient on-device execution, all models are converted to Ten-sorFlow Lite format with post-training dynamic range quantiza-tion using the TensorFlow Lite Converter [30]. Matrix multipli-cation is operated in 8-bits with little accuracy loss. The bench-mark suite consists of 89 utterances with voice action queries.The LSTM rescorer latency baseline is fully optimized and ismeasured with lattice rescoring with batching described in [2]. We investigate the impact of the number of cross-attention lay-ers on quality and latency. As shown in Table 3, we startwith cross-attention on the st decoder layer and graduallyadd more. We observe a noticeable quality improvement atfirst, which later quickly diminishes. Specifically, with cross-attentions the rescorer achieves a . WER improvement than cross-attention, but no further improvement is realized byadding more of it. In addition, when cross-attentions are used,we find that applying them on the st and rd layers improvesWER by . than on the st and nd layers. In the end, byselectively applying cross-attention, we achieved a ∼ ms la-igure 3: Parallel rescoring with Transformer. tency reduction (Table 4) and a . ( M ) parameter sizereduction without quality compromise.Table 3: Effect of cross-attention layers
Cross attention layers WER . . . All 4 layers . As is illustrated in Figure 3, with hypothesis labels ready fromthe st -pass decoder output, Transformer rescorer can finish thecomputation in a single batch step as opposed to a series of se-quential steps as in LSTM rescorer, which could better lever-age multi-threading during inference. The batch size for trans-former rescorer corresponds tonumber of hyps × hyp length × number of attention heads . Taking the utterance at the th percentile latency as an exam-ple, with the top hypotheses used, the batch size is × × . This large batch size provides better parallelism and as aresult benefits more from using threads which reduces ms latency (Table 4). The multi-threading benefit is not witnessedin the LSTM-based rescorer. Potentially it might be due to (1)limited parallelism in LSTM, where batching is done withineach inference step with a relatively smaller batch size be-ing number of hyps × number of gates and (2) utilizing multi-threading within each inference step could introduce extra over-head due to context switch across inference steps and layers. An overall breakdown for latency optimizations is shown in Ta-ble 4. The Transformer rescorer achieves a latency reduc-tion compared to the LSTM rescorer, measured on the utterancewith the th percentile latency with the LSTM rescorer, whichhas s audio and word-piece tokens in the transcript.The initial latency of the Transformer rescorer with cross-attention layer is ms , which then improves to ms bykeeping only cross-attentions. Compared to ms fromLSTM baseline, the latency improvement is from thereduced FLOPs. Transformer rescorer with and cross-attentions provide a ( M ) and ( M ) FLOPsreduction compared to LSTM ( M ). Figure 4: Latency comparison by percentile.
Using two threads reduces the latency by an additional ms for Transformer rescorer, while the LSTM rescorer doesnot benefit from multi-threading.Table 4: Computational latency for the Transformer rescorerwith various optimizations, benchmarked on Pixel4 CPUs.
Optimizations Latency(ms)
Initial latency ( cross attention) cross attention Parallelism in two threads LSTM baseline
We also compared the latency distribution over the fullbenchmark suite, demonstrated in Figure 4. The speech timeranges from . s to . s in the benchmark. The output labelsequence length varies from to . Transformer rescorer isconsistently ∼ faster than LSTM rescorer at almost everylatency percentile.
5. Conclusion
In this work we present a Transformer rescorer for a two-pass model. Our proposed Transformer rescorer reduces morethan of the on-device computation latency in second-passmodel by taking advantage of the parallelism in Transformerdecoder and reducing the number of cross attention layers. Ona Google Voice Search task the Transformer rescorer achieves . WER compared with . of an LSTM rescorer. OnLibrispeech the Transformer rescorer achieves . and . WER on test clean and test other, also lower than . and . of the LSTM rescorer, respectively.
6. Acknowledgements
We thank TF-Lite team for the help to get Transformer modelrunning on device, especially T.J. Alumbaugh, Jared Duke, JianLi, Feng Liu and Renjie Liu. We are also grateful for the in-sightful discussions with Shuo-yiin Chang, Ian McGraw, TaraSainath and Yonghui Wu. . References [1] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang,D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby,S. yiin Chang, K. Rao, and A. Gruenstein, “Streaming end-to-endspeech recognition for mobile devices,” in
ICASSP 2019 - 2019IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2019.[2] T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang, A. Bruguier,S.-y. Chang, W. Li, R. Alvarez, Z. Chen, and et al., “A streamingon-device end-to-end model surpassing server-side conventionalmodel quality and latency,”
ICASSP 2020 - 2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , May 2020.[3] S.-Y. Chang, B. Li, D. Rybach, Y. He, W. Li, T. Sainath, andT. Strohman, “Low Latency Speech Recognition using End-to-End Prefetching,” in
Proc. of Interspeech , 2020.[4] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum la-tency training strategies for streaming sequence-to-sequence asr,”in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.6064–6068.[5] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li,M. Visontai, Q. Liang, T. Strohman, Y. Wu, I. McGraw, and C.-C. Chiu, “Two-Pass End-to-End Speech Recognition,” in
Proc. ofInterspeech , 2019.[6] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend andspell,” 2015.[7] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly,B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speechrecognition with sequence-to-sequence models,” in
Proc. ofICASSP , 2018.[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Comput. , p. 17351780, Nov. 1997.[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all youneed,” in
Advances in Neural Information Processing Systems 30 ,2017, pp. 5998–6008.[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transferlearning with a unified text-to-text transformer,”
JMLR , 2019.[11] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al. , “Acomparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317 , 2019.[12] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, andS. Kumar, “Transformer transducer: A streamable speech recog-nition model with transformer encoders and rnn-t loss,”
ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , May 2020.[13] C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le,M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,”2019.[14] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,”in , 2018, pp. 5884–5888.[15] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An asr corpus based on public domain audio books,”in
Proc. of ICASSP , 2015, pp. 5206–5210.[16] A. Graves, “Sequence transduction with recurrent neural net-works,”
CoRR , vol. abs/1211.3711, 2012.[17] A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognitionwith deep recurrent neural networks,” in
Proc. of ICASSP , 2013. [18] M. Schuster and K. Nakajima, “Japanese and Korean VoiceSearch,” in
Proc. of ICASSP , 2012, pp. 5149–5152.[19] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training forattention-based sequence-to-sequence models,” , Apr 2018.[20] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “Specaugment: A simple data augmentation methodfor automatic speech recognition,” in
Interspeech , 2019.[21] D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V.Le, and Y. Wu, “Specaugment on large scale datasets,” in
ICASSP ,2020.[22] B. Polyak and A. Juditsky, “Acceleration of Stochastic Approxi-mation by Averaging,”
SIAM Journal on Control and Optimiza-tion , vol. 30, no. 4, 1992.[23] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “Tensorflow:A System for Large-scale Machine Learning,” pp. 265–283, 2016.[24] J. Shen, P. Nguyen, Y. Wu, Z. Chen, and et al., “Lingvo: a modu-lar and scalable framework for sequence-to-sequence modeling,”2019.[25] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach,T. Sainath, and T. Strohman, “Recognizing Long-Form SpeechUsing Streaming End-to-End Models,” in
Proc. ASRU , 2019.[26] H. Liao, E. McDermott, and A. Senior, “Large scale deep neuralnetwork acoustic modeling with semi-supervised training data foryoutube video transcription,” in , 2013.[27] H. Soltau, H. Liao, and H. Sak, “Neural Speech Recognizer:Acoustic-to-Word LSTM Model for Large Vocabulary SpeechRecognition,” in
Proc. of Interspeech , 2017, pp. 3707–3711.[28] B. Li, S.-Y. Chang, T. N. Sainath, R. Pang, Y. He, T. Strohman,and Y. Wu, “Towards fast and accurate streaming end-to-end asr,”in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.6069–6073.[29] S.-Y. Chang, R. Prabhavalkar, Y. He, T. N. Sainath, and G. Simko,“Joint endpointing and decoding with end-to-end models,” in