[PDF] Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

Abstract

Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model. One main challenge of the two-pass model is the computation latency introduced by the 2nd-pass model. Specifically, the original design of the two-pass model uses LSTMs for the 2nd-pass model, which are subject to long latency as they are constrained by the recurrent nature and have to run inference sequentially. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel and can therefore utilize the on-device computation resources more efficiently. Compared with an LSTM-based baseline, our proposed Transformer rescorer achieves more than 50% latency reduction with quality improvement.

Full PDF

PParallel Rescoring with Transformer for Streaming On-Device SpeechRecognition

Wei Li ∗ , James Qin ∗ , Chung-Cheng Chiu, Ruoming Pang, Yanzhang He Google Inc., USA { mweili, jamesqin, chungchengc, rpang, yanzhanghe } @google.com Abstract

Recent advances of end-to-end models have outperformed con-ventional models through employing a two-pass model. Thetwo-pass model provides better speed-quality trade-offs for on-device speech recognition, where a st -pass model generateshypotheses in a streaming fashion, and a nd -pass model re-scores the hypotheses with full audio sequence context. The nd -pass model plays a key role in the quality improvement ofthe end-to-end model to surpass the conventional model. Onemain challenge of the two-pass model is the computation la-tency introduced by the nd -pass model. Speciﬁcally, the orig-inal design of the two-pass model uses LSTMs for the nd -passmodel, which are subject to long latency as they are constrainedby the recurrent nature and have to run inference sequentially.In this work we explore replacing the LSTM layers in the nd -pass rescorer with Transformer layers, which can process theentire hypothesis sequences in parallel and can therefore utilizethe on-device computation resources more efﬁciently. Com-pared with an LSTM-based baseline, our proposed Transformerrescorer achieves more than latency reduction with qualityimprovement. Index Terms : Streaming speech recognition, Transformer, La-tency, Rescoring

1. Introduction

There has been a growing interest in building on-device stream-ing speech recognition models, which provide recognition re-sults instantly as words are being spoken [1]. Such models makepredictions based on partial context under strict latency require-ments [2, 3, 4]. As a result the streaming models tend to be lessaccurate than non-streaming models, which have access to theentire utterance.Previous work have shown that this issue can be alleviatedby combining a second-pass rescoring model [5] with streamingmodels, where the rescoring model uses the Listen, Attend, andSpell (LAS) architecture [6]. LAS has access to the full con-text of the utterance and therefore provides better quality thanthe streaming models [7]. From user’s perspective, such a two-pass speech model exhibits the advantages of both streamingand non-streaming models—words are recognized as they arespoken and the ﬁnal results have high accuracy.The canonical architecture of the LSTM-based LAS model,however, is designed for beam search and is not efﬁcient asa nd -pass rescoring model. The LSTM [8] layers processhypothesis tokens sequentially , with temporal dependency be-tween timesteps. On the other hand, for the nd -pass rescoring,all hypothesis tokens are available. A more efﬁcient design ofthe rescorer model will be to rescore all tokens in parallel. ∗ Equal contribution. go/rnntep_eng

Trading WER for Latency

RNN-T decoderRNN-T encoder

SOS c a t

RNN-T hypothesis acoustic framesx , …, x T Additional encoderTransformer rescorere , …, e T y , …, y s EOS Figure 1:

The architecture of two-pass model with Transformer.

In recent years there have been a growing success in ap-plying Transformer [9] for machine translation and languagemodeling [10], and speech recognition [11, 12, 13, 14]. Trans-former applies self-attention to capture the sequential relationamong input features, and therefore does not have the recurrentconstraint. This allows Transformer to compute self-attentionin parallel and signiﬁcantly increase the computation efﬁciency.The Transformer architecture proposed in [9] consists of an en-coder and a decoder, where each decoder layer has an additionalcross-attention that summarizes the encoder output based on theself-attention output.In this work, we address the sequential dependency is-sue of the original LSTM-based rescoring model with Trans-former. Speciﬁcally, the paper proposes to use Transformer asthe second-pass rescorer for parallel rescoring of hypothesis to-kens. Unlike beam search, where the Transformer decoder stillhas to run autoregressively, the rescoring scenario allows paral-lel processing of the full hypothesis sequence. Such parallelismreduces the lengths of temporal dependency paths from O ( n ) to O (1) , where n corresponds to the hypothesis length. This al-lows the Transformer rescorer to utilize on-device computationcapacity much more efﬁciently. We further improve the infer-ence speed of the Transformer rescorer by reducing the numberof cross-attention in the decoder. The Transformer rescorer im-proves the Word Error Rate (WER) of Googles voice searchquery test set to . from . with LSTM rescoring. OnLibrispeech [15] the Transformer rescorer improves the WERto . on test clean and . on test other compared to . and . with LSTM rescoring. The th percentile second-pass latency, benchmarked on a Google Pixel4 phone on CPUs,is reduced to ms from previous ms with LSTM rescoring. a r X i v : . [ ee ss . A S ] S e p dditional encoderRNN-T encoder RNN-T hypothesis acoustic framesx , …, x T Transformer rescorer e , …, e T y , …, y s Decoder

Self-attentionCross-attentionFeed Forward

Self-Decoder

Self-attentionFeed Forward

DecoderSelf-Decoder

Figure 2:

Transformer rescorer. The Transformer rescorer com-bines conventional Transformer decoders (containing cross-attention) and Transformer self-decoders (without cross-attention) for more efﬁcient inference. The ﬁgure omits the nor-malization and residual links to simplify the illustration.

2. Transformer Rescorer

A two-pass model consists of a st -pass model and a nd -pass model. Here we use RNN-T [16, 17] as the st -passmodel and Transformer for the nd -pass model. Speciﬁcally,our Transformer-based two-pass model, as demonstrated in Fig-ure 1, consists of four components: RNN-T encoder, RNN-T decoder, additional encoder, and Transformer decoder asthe rescorer. The input acoustic frames are denoted as x =( x , ..., x T ) , where x t ∈ R d are stacked log-mel ﬁlterbank en-ergies ( d = 512 ) and T is the number of frames in x . In the st -pass, each acoustic frame x t is passed through RNN-T en-coder, consisting of a multi-layer LSTM [8], to get encoder out-put. RNN-T decoder takes the acoustic features from RNN-Tencoder to generate the hypotheses in a streaming fashion, de-noted as y = ( y , ..., y s ) where s is the label sequence length.Here y is a sequence of word-piece tokens [18]. In the nd -pass, the full output of the RNN-T encoder is passed to a smalladditional encoder to generate e , ..., e T , which is then passedto Transformer decoder. The additional encoder is added as it isfound to be useful to adapt the encoder output to be more suit-able for the second-pass model [2]. The RNN-T model struc-ture and the additional encoders are exactly the same as [2].During training, the Transformer decoder computes output la-bel sequence according to the full audio sequence e , ..., e T .More details about the rescorer training is elucidated in Sec-tion 2.3. During decoding, the Transformer decoder rescoresmultiple top hypotheses from RNN-T, y , ..., y s . The architecture of our Transformer rescorer is based onthe conventional Transformer decoder [9] with some cross-attention layers being removed. The conventional Transformerdecoder layer contains both the self-attention and the cross-attention, where the query of the cross-attention originates fromthe output of the self-attention. In the Transformer rescorer, weimprove the rescorer efﬁciency by removing the cross-attention from some decoder layers and interleave those layers with theconventional decoder layers. The decoder layer without thecross-attention shares the same architecture as the conventionalTransformer encoder layer [9]. The architecture of the result-ing rescorer is illustrated in Figure 2, where layers withoutcross-attention are annotated as self-decoder . The Transformerrescorer takes the RNN-T’s hypothesis as input and feed the to-kens to the self-attention layer. And the cross-attention layersattend to the encoder output to summarize the acoustic signals.In our rescorer model, there are Transformer layers, each withthe attention model dimension d model = 640 and feed forwarddimension d ff = 2560 . Both cross-attention and self-attentionlayers use multi-headed attention with heads. The rescorermodel has . M parameters.Our design of keeping only two cross-attention layers inthe rescorer is based on observing the attention mechanismof the Transformer decoder. In the ﬁrst Transformer decoderlayer, the self-attention conditions only on the hypothesis to-kens, therefore the resulting cross-attention generates its querysolely based on language modeling information. The missingof acoustic information on generating attention query inherentlylimit the effectiveness of the ﬁrst cross-attention. After the ﬁrstcross-attention layer, the output of the ﬁrst decoder layer con-tains acoustic information, and the following decoder layers cancondition on both the acoustic and language modeling infor-mation to generate effective cross-attention queries. Thus, it iscritical to have the second cross-attention layer in the decoder.On the other hand, the additional cross-attention layers beyondthe second one do not introduce additional modality and havediminishing returns in terms of the model quality. As a compar-ison, the cross-attention of the LAS model conditions on boththe previous attention context and the text tokens, and requiresonly one cross-attention in the decoder. We demonstrated theseproperty with an ablation study in Section 3. Same with the LAS rescoring training described in [5], Trans-former rescorer model is trained after the st -pass model train-ing. During nd -pass training, RNN-T encoder and RNN-T decoder are freezed. Additional encoder and Transformerrescorer are trained in two stages: cross entropy (CE) andminimum word error rate (MWER) training [19]. During CEtraining, frozen RNN-T encoder generates the acoustic featuresfor additional encoder, and Transformer rescorer is trained topredict groundtruth sequence with the full audio context fromadditional encoder and the preﬁx of the label sequence con-text: p ( y l | x, y ...y l − ) , where l is the label to predict. Dur-ing MWER training, the Transformer rescorer is trained to re-rank the hypotheses generated from RNN-T, which bridges thegap from CE training to inference [5]. More speciﬁcally, givenacoustic input x , groundtruth transcript y ∗ , the probability com-puted by rescorer model P ( y m | x ) for any given target sequence y m , and a set of hypotheses H m = h , ..., h b where b is thebeam-size, the MWER loss is deﬁned as L MWER ( x, y ∗ ) = (cid:88) y m ∈ H m ( x ) P (cid:48) ( y m | x, H m ) (cid:104) W (cid:48) ( y m , y ∗ ) − (cid:99) W (cid:105) where P (cid:48) ( y m | x, H m ) = P ( y m | x ) (cid:80) yi ∈ Hm P ( y i | x ) represents the con-ditional probability the Transformer rescorer assigns to hypoth-esis y m among all hypotheses in H m , and W (cid:48) ( y ∗ , y m ) is thenumber of word errors of y m , and (cid:99) W is the average number ofword errors among H m . In our MWER training we use the N-able 1: Librispeech test sets word error rate

Model Test clean Test other

RNN-T only . . LSTM rescorer . . Transformer rescorer . . Best approximation approach for calculating the expected worderrors [19].

3. Quality Experiments

We conduct experiments on the Librispeech [15] dataset and alarge-scale internal dataset. We use SpecAugment [20] with thesame conﬁguration as described in [21] during training. Similarto [2], we apply constant learning rate and maintain Exponentialmoving average (EMA) [22] of the weights during training, anduse the EMA weights for evaluation. Both LSTM and Trans-former rescorer are trained with CE and MWER. The N-Bestsize of MWER training is , which matches the rescoring be-havior during evaluation, where top hypotheses from RNN-Tare used for rescoring. The prediction targets are wordpieces [18] derived using a large corpus of text transcripts. TheLSTM-based rescorer has size M and the Transformer has . M parameters. All models are implemented in Tensorﬂow[23] using the Lingvo [24] toolkit and trained on × TensorProcessing Units (TPU) slices with a global batch size of . In this experiment, the models are trained on the Librispeech960h training set and evaluated on the clean and noisy testsets without an external language model. In order to maintainlow-latency streaming speech recognition, the st -pass RNN-Tmodels in all the compared systems use a uni-directional LSTMencoder with 0 right context frame. As is shown in Table 1, boththe LSTM rescorer and the Transformer rescorer signiﬁcantlyimprove the WER of the clean and noisy test sets comparedto the RNN-T only model with - relative improvement,alleviating the limited context problem for the st -pass modelwhile still maintaining low-latency streaming recognition. TheTransformer rescorer further improves the WER slightly overthe LSTM rescorer, and also signiﬁcantly reduce the nd -passlatency, which is studied in detail in Section 4. We perform a large scale experiment on an internal task, GoogleVoice Search, and show the proposed Transformer rescoreris also effective. In this experiment, the models are trainedon a multi-domain training set as described in [25]. Thesemulti-domain utterances span domains of search, farﬁeld, tele-phony and YouTube. The test set includes ∼ K Voice-search utterances (VS) extracted from Google trafﬁc. Alldatasets are anonymized and hand-transcribed. The transcrip-tion for YouTube utterances is done in a semi-supervised fash-ion [26, 27]. Following [28, 29, 2], we train the ﬁrst-pass RNN-T to also emit the end-of-sentence decision to reduce the end-pointing latency, allowing 2nd-pass rescoring to execute early.As is shown in Table 2, the Transformer rescorer im-proves the WER from . to . on the VS test set comparedwith the LSTM rescorer, both of which are trained with CE Table 2: Voice Search test set word error rate

Model VS

RNN-T only . LSTM rescorer . Transformer rescorer CE . Transformer rescorer MWER . and MWER. Compared with st -pass model, the Transformerrescorer achieves relative WER improvement.

The additional capability that the Transformer rescorer canbring is to utilize the full hypothesis when rescoring every tar-get token. The original LSTM-based rescorer scores each tar-get token conditioned only on the tokens before it. Specif-ically, the LSTM rescorer learns a conditional probability p ( y t | x, y , ..., y t − ) for each prediction target y t where y de-notes hypothesis tokens from RNN-T and x denotes acousticfeatures. A conventional Transformer decoder uses causal self-attention and also learns p ( y t | x, y , ..., y t − ) . We explored ex-tending the self-attention to access also the future label con-text and as a result learns to score target tokens with p ( y t | x, y ) .During CE training, using groundtruth sequence as the full con-text makes the training target trivial. Thus we randomly swapdifferent proportions of the groundtruth tokens that fed to theself-attention layer with alternative tokens sampled within theword-piece vocabulary. Some sentinel tokens like SOS, EOS,UNKNOWN and RNN-T’s blank symbol are excluded to beused as random tokens. The prediction targets are the origi-nal groundtruth sequence. During MWER training, the RNN-Thypothesis is used as the decoder input to match the inferencescenario. With this experiment, random proportion worksout the best and achieves the same . WER on the voicesearch task. Thus, we report results with causal self-attentionfor the experiments throughout the paper.

4. Latency Optimizations

In this section, we measure the additional latency introduced bythe nd -pass rescorer on a Google Pixel4 phone on CPUs. Forefﬁcient on-device execution, all models are converted to Ten-sorFlow Lite format with post-training dynamic range quantiza-tion using the TensorFlow Lite Converter [30]. Matrix multipli-cation is operated in 8-bits with little accuracy loss. The bench-mark suite consists of 89 utterances with voice action queries.The LSTM rescorer latency baseline is fully optimized and ismeasured with lattice rescoring with batching described in [2]. We investigate the impact of the number of cross-attention lay-ers on quality and latency. As shown in Table 3, we startwith cross-attention on the st decoder layer and graduallyadd more. We observe a noticeable quality improvement atﬁrst, which later quickly diminishes. Speciﬁcally, with cross-attentions the rescorer achieves a . WER improvement than cross-attention, but no further improvement is realized byadding more of it. In addition, when cross-attentions are used,we ﬁnd that applying them on the st and rd layers improvesWER by . than on the st and nd layers. In the end, byselectively applying cross-attention, we achieved a ∼ ms la-igure 3: Parallel rescoring with Transformer. tency reduction (Table 4) and a . ( M ) parameter sizereduction without quality compromise.Table 3: Effect of cross-attention layers

Cross attention layers WER . . . All 4 layers . As is illustrated in Figure 3, with hypothesis labels ready fromthe st -pass decoder output, Transformer rescorer can ﬁnish thecomputation in a single batch step as opposed to a series of se-quential steps as in LSTM rescorer, which could better lever-age multi-threading during inference. The batch size for trans-former rescorer corresponds tonumber of hyps × hyp length × number of attention heads . Taking the utterance at the th percentile latency as an exam-ple, with the top hypotheses used, the batch size is × × . This large batch size provides better parallelism and as aresult beneﬁts more from using threads which reduces ms latency (Table 4). The multi-threading beneﬁt is not witnessedin the LSTM-based rescorer. Potentially it might be due to (1)limited parallelism in LSTM, where batching is done withineach inference step with a relatively smaller batch size be-ing number of hyps × number of gates and (2) utilizing multi-threading within each inference step could introduce extra over-head due to context switch across inference steps and layers. An overall breakdown for latency optimizations is shown in Ta-ble 4. The Transformer rescorer achieves a latency reduc-tion compared to the LSTM rescorer, measured on the utterancewith the th percentile latency with the LSTM rescorer, whichhas s audio and word-piece tokens in the transcript.The initial latency of the Transformer rescorer with cross-attention layer is ms , which then improves to ms bykeeping only cross-attentions. Compared to ms fromLSTM baseline, the latency improvement is from thereduced FLOPs. Transformer rescorer with and cross-attentions provide a ( M ) and ( M ) FLOPsreduction compared to LSTM ( M ). Figure 4: Latency comparison by percentile.

Using two threads reduces the latency by an additional ms for Transformer rescorer, while the LSTM rescorer doesnot beneﬁt from multi-threading.Table 4: Computational latency for the Transformer rescorerwith various optimizations, benchmarked on Pixel4 CPUs.

Optimizations Latency(ms)

Initial latency ( cross attention) cross attention Parallelism in two threads LSTM baseline

We also compared the latency distribution over the fullbenchmark suite, demonstrated in Figure 4. The speech timeranges from . s to . s in the benchmark. The output labelsequence length varies from to . Transformer rescorer isconsistently ∼ faster than LSTM rescorer at almost everylatency percentile.

5. Conclusion

In this work we present a Transformer rescorer for a two-pass model. Our proposed Transformer rescorer reduces morethan of the on-device computation latency in second-passmodel by taking advantage of the parallelism in Transformerdecoder and reducing the number of cross attention layers. Ona Google Voice Search task the Transformer rescorer achieves . WER compared with . of an LSTM rescorer. OnLibrispeech the Transformer rescorer achieves . and . WER on test clean and test other, also lower than . and . of the LSTM rescorer, respectively.

6. Acknowledgements

We thank TF-Lite team for the help to get Transformer modelrunning on device, especially T.J. Alumbaugh, Jared Duke, JianLi, Feng Liu and Renjie Liu. We are also grateful for the in-sightful discussions with Shuo-yiin Chang, Ian McGraw, TaraSainath and Yonghui Wu. . References [1] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang,D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby,S. yiin Chang, K. Rao, and A. Gruenstein, “Streaming end-to-endspeech recognition for mobile devices,” in

ICASSP 2019 - 2019IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2019.[2] T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang, A. Bruguier,S.-y. Chang, W. Li, R. Alvarez, Z. Chen, and et al., “A streamingon-device end-to-end model surpassing server-side conventionalmodel quality and latency,”

ICASSP 2020 - 2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , May 2020.[3] S.-Y. Chang, B. Li, D. Rybach, Y. He, W. Li, T. Sainath, andT. Strohman, “Low Latency Speech Recognition using End-to-End Prefetching,” in

Proc. of Interspeech , 2020.[4] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum la-tency training strategies for streaming sequence-to-sequence asr,”in

ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.6064–6068.[5] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li,M. Visontai, Q. Liang, T. Strohman, Y. Wu, I. McGraw, and C.-C. Chiu, “Two-Pass End-to-End Speech Recognition,” in

Proc. ofInterspeech , 2019.[6] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend andspell,” 2015.[7] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly,B. Li, J. Chorowski, and M. Bacchiani, “State-of-the-art speechrecognition with sequence-to-sequence models,” in

Proc. ofICASSP , 2018.[8] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Comput. , p. 17351780, Nov. 1997.[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all youneed,” in

Advances in Neural Information Processing Systems 30 ,2017, pp. 5998–6008.[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transferlearning with a uniﬁed text-to-text transformer,”

JMLR , 2019.[11] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al. , “Acomparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317 , 2019.[12] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, andS. Kumar, “Transformer transducer: A streamable speech recog-nition model with transformer encoders and rnn-t loss,”

ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , May 2020.[13] C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le,M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,”2019.[14] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,”in , 2018, pp. 5884–5888.[15] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An asr corpus based on public domain audio books,”in

Proc. of ICASSP , 2015, pp. 5206–5210.[16] A. Graves, “Sequence transduction with recurrent neural net-works,”

CoRR , vol. abs/1211.3711, 2012.[17] A. Graves, A. r. Mohamed, and G. Hinton, “Speech recognitionwith deep recurrent neural networks,” in

Proc. of ICASSP , 2013. [18] M. Schuster and K. Nakajima, “Japanese and Korean VoiceSearch,” in

Proc. of ICASSP , 2012, pp. 5149–5152.[19] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training forattention-based sequence-to-sequence models,” , Apr 2018.[20] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “Specaugment: A simple data augmentation methodfor automatic speech recognition,” in

Interspeech , 2019.[21] D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V.Le, and Y. Wu, “Specaugment on large scale datasets,” in

ICASSP ,2020.[22] B. Polyak and A. Juditsky, “Acceleration of Stochastic Approxi-mation by Averaging,”

SIAM Journal on Control and Optimiza-tion , vol. 30, no. 4, 1992.[23] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow:A System for Large-scale Machine Learning,” pp. 265–283, 2016.[24] J. Shen, P. Nguyen, Y. Wu, Z. Chen, and et al., “Lingvo: a modu-lar and scalable framework for sequence-to-sequence modeling,”2019.[25] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach,T. Sainath, and T. Strohman, “Recognizing Long-Form SpeechUsing Streaming End-to-End Models,” in

Proc. ASRU , 2019.[26] H. Liao, E. McDermott, and A. Senior, “Large scale deep neuralnetwork acoustic modeling with semi-supervised training data foryoutube video transcription,” in , 2013.[27] H. Soltau, H. Liao, and H. Sak, “Neural Speech Recognizer:Acoustic-to-Word LSTM Model for Large Vocabulary SpeechRecognition,” in

Proc. of Interspeech , 2017, pp. 3707–3711.[28] B. Li, S.-Y. Chang, T. N. Sainath, R. Pang, Y. He, T. Strohman,and Y. Wu, “Towards fast and accurate streaming end-to-end asr,”in

ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.6069–6073.[29] S.-Y. Chang, R. Prabhavalkar, Y. He, T. N. Sainath, and G. Simko,“Joint endpointing and decoding with end-to-end models,” in