[PDF] Online Automatic Speech Recognition with Listen, Attend and Spell Model

Abstract

The Listen, Attend and Spell (LAS) model and other attention-based automatic speech recognition (ASR) models have known limitations when operated in a fully online mode. In this paper, we analyze the online operation of LAS models to demonstrate that these limitations stem from the handling of silence regions and the reliability of online attention mechanism at the edge of input buffers. We propose a novel and simple technique that can achieve fully online recognition while meeting accuracy and latency targets. For the Mandarin dictation task, our proposed approach can achieve a character error rate in online operation that is within 4% relative to an offline LAS model. The proposed online LAS model operates at 12% lower latency relative to a conventional neural network hidden Markov model hybrid of comparable accuracy. We have validated the proposed method through a production scale deployment, which, to the best of our knowledge, is the first such deployment of a fully online LAS model.

Full PDF

11 Online Automatic Speech Recognition withListen, Attend and Spell Model

Roger Hsiao*, Dogan Can*, Tim Ng*, Ruchir Travadi and Arnab Ghoshal

Abstract —The Listen, Attend and Spell (LAS) model and otherattention-based automatic speech recognition (ASR) models haveknown limitations when operated in a fully online mode. Inthis paper, we analyze the online operation of LAS models todemonstrate that these limitations stem from the handling ofsilence regions and the reliability of online attention mechanismat the edge of input buffers. We propose a novel and simpletechnique that can achieve fully online recognition while meetingaccuracy and latency targets. For the Mandarin dictation task,our proposed approach can achieve a character error rate inonline operation that is within 4% relative to an ofﬂine LASmodel. The proposed online LAS model operates at 12% lowerlatency relative to a conventional neural network hidden Markovmodel hybrid of comparable accuracy. We have validated theproposed method through a production scale deployment, which,to the best of our knowledge, is the ﬁrst such deployment of afully online LAS model.

Index Terms —end-to-end ASR, online recognition

I. I

NTRODUCTION T HE Listen, Attend, and Spell (LAS) model is a widely-used architecture for the so-called end-to-end automaticspeech recognition (ASR) [1]. Compared to other end-to-endmethods proposed in literature, LAS tends to be easier totrain and can achieve higher accuracy. It has been shownthat LAS may outperform deep neural network based hiddenMarkov model (DNN-HMM) [2], [3] hybrids in large scaleASR tasks [4]. However, online recognition with such a modelremains a challenge due to the fact that the generation ofoutputs from the model is not synchronized with the con-sumption of inputs by the model (referred to as asynchronousdecoding subsequently). In recent years, research progresshas been made with alternatives like RNN transducer (RNN-T) [5], [6] that allows time synchronous decoding, multi-model techniques [7], [8], [9], [10], [11], [12] where a time-synchronous model like RNN-T or CTC provides timinginformation to the LAS decoder, or specially designed lossfunctions that aim to penalize latency [13], [12]. While theseapproaches are effective, they can involve more complicatedtraining algorithms [14], [13] or decoding procedures [10],[11], [12]. In this paper, we aim to understand and addressthe fundamental issues related to online recognition for LASmodels and create a stand-alone LAS model that is capable ofonline recognition.Online recognition LAS models requires three components:1) streamable encoder, 2) online attention mechanism and 3)

R. Hsiao, D. Can, T. Ng, R. Travadi and A. Ghoshal are with AppleInc. (e-mail: [email protected]; dogan [email protected]; tim [email protected];[email protected]; [email protected]).*: these authors contributed equally to this work. online asynchronous decoding. The ﬁrst two issues have beensufﬁciently addressed in the literature. It has been proposedto use a uni-directional encoder or a latency controlled bi-directional encoder [15]. For streamable attention mecha-nism, there are multiple proposals including, but not limitedto, monotonic attention [16], monotonic chunkwise attention(MoChA) [17] and local attention [18]. However, addressingthese two issues is insufﬁcient for practical use of LAS inonline recognition, without also addressing issues related toasynchronous decoding. This is a key difference compared totime synchronous decoding adopted by DNN-HMM hybrids,CTC, or RNN-T, where the time synchronous decoder createsa monotonic map between audio features and output tokens asit consumes audio features. We show in later sections that thisasynchronous nature is a major obstacle to LAS and similarattention-based models to perform online recognition. For suchmodels, the ability to do online decoding is contingent uponthe model learning when to terminate the decoding process.This paper has two main contributions. First, we perform athorough analysis of the main obstacles for online recognitionwith LAS models. Second, we propose a novel, yet simpleapproach that would allow LAS model to perform onlinerecognition. The underlying concept is based on silence model-ing and a buffering scheme that is applicable to any end-to-endmodel that uses asynchronous decoding, especially the LASmodel that uses an online attention mechanism, like MoChAand local attention. We think this work could contribute toadvancing the end-to-end technology, and could be useful toanyone who wants to deploy such technology.II. LAS

MODEL AND ONLINE ATTENTION

LAS computes the posterior probability of a sentence givena sequence of acoustic features by P ( Y | X ) = N (cid:89) i P ( y i | X, y ,...,i − ) (1)where X = ( x , x , . . . , x T ) is a sequence of T acousticfeatures; Y = ( y , y , . . . , y N ) is a sequence of N outputtokens. An output token could be a character or a subwordand it also includes beginning of sentence (BOS) and end ofsentence (EOS) symbols.The probability of each predicted token, P ( y i | X, y ,...,i − ) ,is computed by an encoder and a decoder. The encoder mapsthe acoustic features into a sequence of hidden features. Themapping can be achieved through a pyramidal RNN [1], h kt , r kt = RNN k ([ h k − t , h k − t +1 ] , r kt − ) if k > h t , r t = RNN ( x t , r t − ) if k = 1 (2) a r X i v : . [ ee ss . A S ] O c t where h kt is the hidden features of k -th layer of the encoder; r kt is the RNN internal state at time t . This RNN takes the outputsfrom the previous layer by concatenating the neighboringfeatures. As a result, the length of the output sequence of eachlayer is half of the corresponding input sequence. After K layers of encoder RNN, the decoder would apply an attentionmechanism to compute, c i = a ( s i , h T (cid:48) ) (3)where c i is the context vector for i -th decoding step computedby the attention mechanism a ; s i is the internal state of thedecoder; h T (cid:48) is the output hidden features from the pyramidalRNN where T (cid:48) = T / K . This context vector is then used bythe decoder RNN to compute, o i , s i = RNN ( s i − , y i − , c i − ) (4)where o i is the output of the decoder RNN at i -th decodingstep, and s i is the internal state of the decoder RNN. Finally,we compute the posterior probability based on o i and c i .The function f can be as simple as a softmax over an afﬁnetransform to project o i and/or c i , P ( y i | X, y ,...,i − ) = f ( o i , c i ) (5)The attention mechanism described by equation 3 is globalif a can access the entire utterance h T (cid:48) at any decoding step.This is not compatible with online recognition for an obviousreason. The monotonic attention of [16] limits the visibleregion of an utterance by allowing one and only one frameof encoder output to be accessible by the decoder at eachdecoding step, and the chosen encoder output at step i + 1 has to be either the same chosen frame at step i , or somefuture frame. Therefore, it ensures the attention mechanism tobe monotonic. MoChA attention [17] can be considered as anextension of monotonic attention. Instead of only allowing oneframe to be accessible, MoChA attention allows the decoderto access a chunk at a time, and monotonicity is guaranteedat the chunk level.III. S YNCHRONOUS AND A SYNCHRONOUS D ECODING

We call a decoding process time synchronous if as itconsumes audio features, it always generates a monotonicmany-to-one mapping between the features and the outputtargets. For example, the Viterbi decoder for DNN-HMMhybrid would produce a monotonic map between every frameof audio features and HMM states (usually tied context-dependent states or senones) that are the outputs of the DNN.The same is true for CTC or RNN-T except the output targetsfor such models may be letters or word pieces and a special blank symbol. In such cases, the decoding process stops onceit consumes all audio features.In contrast, the decoding process of LAS and otherattention-based ASR models is asynchronous and it does notguarantee that any output is generated as inputs are consumed.Although the model creates the encoded representation of theinputs and may even calculate the attention weights, the decod-ing process may require the consumption of an arbitrary lengthof input before generating any output. Similarly, after having

Fig. 1. A hypothetical example showing the differences between training andinference. In this utterance, the reference is “ < BOS > a b < EOS > ” and thereis a long silence region between a and b. The upper part and the lower partshow the training and inference process respectively. The black boxes showwhat the model is processing at each step (indexed by i ) and the orange dottedline is the peak of the attention at different steps. This example shows duringinference, the model could not jump to the next segment like training sincethe features have not arrived yet. consumed a sufﬁcient length of inputs, the decoding processmay generate an arbitrary length of outputs. In other words,given a segment of speech, an asynchronous decoder couldproduce none or an arbitrary number of output tokens. Sucha condition is not possible with time synchronous decoders.As a result, the stopping criterion of such a decoding processcannot be based on the exhaustion of input audio, but throughlearning when to self-terminate. For the LAS model, we havethe BOS and EOS symbols because of this reason.It is important to appreciate this key difference in orderto understand why online recognition is difﬁcult with asyn-chronous decoding. With LAS models, even though we mayuse monotonic or MoChA attention to guarantee monotonicity,without controlling the way the model learns to self-terminateit is not possible to guarantee complete online operation.Figure 1 illustrates the difﬁculty of online recognition bycomparing the training and inference processes. As shown inthe example, this utterance consists of two speech segments,and there is a long silence region in between. During training,the LAS model could jump from the ﬁrst segment to thesecond one as the entire utterance is visible at training time.This is true even for monotonic and MoChA attention sincethey compute expected attended regions for each output tokenduring training. As a result, the model may learn to skipattending to the silence segments and there is no constraintto prohibit this kind of behaviors.The situation, however, is different during online recogni-tion. As the acoustic features are being streamed, the asyn-chronous decoder can only look at a portion of the utterance.As long as the silence segment is long enough, the decoderwould only see a trailing silence after the ﬁrst speech segment(as shown in the gray area in ﬁgure 1). At this point, the LASmodel may terminate the decoding prematurely by emitting anEOS symbol, even though there are still incoming features. Inour experience, this leads to a large number of deletion errors,and it is the main obstacle for LAS model to perform onlinerecognition. Fig. 2. A hypothetical example with explicit silence modeling. Compared toﬁgure 1, we allow the model to output silence tokens and insert silence labelsto the reference. As a result, during online recognition, the model outputs asilence token instead of an EOS token when it sees a trailing silence segment.

IV. S

ILENCE M ODELING FOR

LASAppreciating the issues related to the asynchronous natureof attention-based decoders allows us to view the existingmethods in literature in the context of forcing a greatertime-synchronosity of attention-based decoding. This includesmethods like triggered attention [9], as well as those thatpenalize the generation latency [13], [12]. We propose that thesame effect can be achieved simply by modeling the silenceregions and having the LAS model generate silence tokens. Wenotice that silence modeling also improves the end of utterancelatency without explicitly controlling for it (cf. Section V).We insert silence tokens to the references so that the trainingprocess would teach the model when, and how often, it shouldoutput a silence token. This idea is similar to the explicitsilence models used in HMM-based ASR systems [19]. InHMM-based ASR, the silence model often has multiple statesand loops so it could generate multiple silence frames. Ouridea is similar in principle. By adding multiple silence tokens,the LAS model would learn to output silence tokens repeatedlydepending on the length of the silence segment. The onlydifference compared to DNN-HMM is that we help the modellearn this only through the data instead of having an explicitstructure in the model.Figure 2 is an example to explain the idea. In this example,we insert multiple silence tokens to the reference. Each silencetoken corresponds to a ﬁxed number of silence frames, andeach silence segment could have one or more silence tokensdepending on its length. This number of frames for eachsilence label is called duration of silence in this paper. Aftertraining, the model would learn to output silence tokens whenit sees a silence segment. Therefore, during online recognition,the decoder could choose to output a silence token instead ofan EOS symbol in ﬁgure 2. To insert silence segments to thereferences, we could use a hybrid DNN-HMM system to per-form forced alignment. Based on the alignment information,we could choose to insert a silence token for every N numberof silence frames. This preprocessing procedure would requirea hybrid DNN-HMM model. In practice, this is not a problemsince one could build a useful hybrid DNN-HMM system withsmall amount of data [20] and the pronunciation lexicon couldbe replaced by a graphemic dictionary [21], [22]. While adding this explicit silence model would help onlinerecognition, the model could still output an EOS symbol pre-maturely. In ﬁgure 2, at decoding step i = 2 , the model couldoutput EOS instead of silence. To help reducing this confusion,we propose a buffering scheme to help online recognition.In this buffering scheme, the audio data is streamed to thedecoder in batches of T ms long. For each batch, we designatea restricted buffer at the end of the batch. During decoding,if the peak of the attention distribution of any attention headfalls within this restricted buffer, this suggests the attentionmay beneﬁt from more data, and so the decoding for this batchis voided and backtracked. The decoder would then wait formore audio data, unless it is the end of a utterance, whichwould perform decoding until the model emits EOS.The purpose of this buffering scheme is to ensure there isenough audio data, so the decoder would not output a tokenwhen the corresponding audio is truncated. This idea can beextended so that the buffer size depends on the previous outputtoken. For example, if the decoder is processing a silencesegment, we might use a larger buffer so the model is lesslikely to emit EOS prematurely. Therefore, we could havedifferent buffer sizes for regular and silence tokens. When theencoder has processed all audio features, the buffering schemewould be disabled so the decoder can access all the encoderoutputs and ﬁnish the decoding process. Generally speaking,smaller buffer means smaller latency, however, the buffer sizedoes not imply a lower bound of the latency, since the lastbatch may be smaller.V. E XPERIMENTAL R ESULTS

We test our proposed approach on our proprietary Mandarinvoice assistant and dictation tasks. We use a research trainingset consisting of 5500 hours of transcribed audio, sampledfrom both use cases. We evaluate the models on two test sets— one for the voice assistant use case and another for thedictation use case — each with 10 hours of audio data.For the audio features, we extract 40 dimensional ﬁlter-bankcoefﬁcients with a standard 25ms window and 10ms frameshift. The LAS model has three layers of pyramidal LSTMwith a reduction factor of two for each layer. Each encoderLSTM has 1200 hidden units and 600 dimensional recurrentprojection. The decoder uses MoChA attention with a chunksize of three and the dimension of the context vector is 600.The decoder has three LSTM layers, each with 800 hiddenunits and 300 dimensional recurrent projection. The outputlayer is 7975 dimensional and each output corresponds to aChinese character, Latin character, digit, punctuation, a specialtoken like BOS and EOS, or a silence token. The entire modelhas around 64 million parameters.To train the model, we use block momentum algorithm [23]to optimize the model with cross entropy loss. The traininguses scheduled sampling [24] with a probability of 0.2, and italso uses spectral augmentation [25] and label smoothing [26]with a smoothing factor of 0.2. During decoding, we use adecoding beam of eight, unless otherwise speciﬁed. The audiodata is streamed to the decoder in 320ms batches. Each batchof audio data is processed by the model’s encoder and the

Fig. 3. An experiment on the dictation evaluation set to evaluate how durationof silence affects online recognition. The black dotted lines are the error ratesof restarting the decoder when the baseline model emits an end of sentencetoken prematurely. The beam is set to eight for this experiment. encoder output is then appended to the buffer. As mentionedin section IV, decoding can only happen when the buffer islarger than a predeﬁned minimum buffer size.Figure 3 studies how different durations of silence, whichis the length of the silence segment for each silence label,interacts with different minimum buffer sizes. The results giveus some interesting insights. First, without an explicit silencemodel (“no silence label”), although the LAS model couldachieve 10.1% character error rate (CER) on our dictation testset for ofﬂine decoding, the accuracy is greatly degraded foronline recognition regardless of the minimum buffer size (allover 20.0% CER). The reason for much higher CER is due tohigh deletion rate (12-15% deletion rate) and these deletionsare caused by early stopping as discussed in section III.For this baseline model, we also try to restart the decoderwhenever the model emits an EOS token when there is stillunconsumed audio features. While this technique alleviatespart of the deletion problem, it still has signiﬁcant degradationcompared to the ofﬂine accuracy. This shows that adoptingonline attention mechanisms and uni-directional encoder aloneare not enough for LAS model to perform online decoding.Second, we ﬁnd that with silence modeling, the accuracy gapbetween online and ofﬂine decoding is greatly reduced. For240ms duration of silence, the CER is 10.4%-11.3% dependingon the minimum buffer size, which is close to the ofﬂine resultof 10.1% CER. Third, while bigger buffer would generallyimprove accuracy, it might impact latency, so smaller durationof silence is favorable as we can use a smaller buffer.

TABLE ICER(%)

OF DIFFERENT SILENCE BUFFERING SIZES . T

HE BUFFER FOROTHER TOKENS IS

MS AND SILENCE DURATION IS MS .Test set Ofﬂine sil buf=480ms sil buf=640ms sil buf=800msAssistant 9.3 11.9 10.9 9.8Dictation 10.1 11.3 11.1 10.5 In table I, we investigate the effect of having differentbuffer sizes for speech and silence segments. As discussedin section IV, the purpose is to use a larger buffer for silencesegments to reduce the risk of generating an EOS symbol

Fig. 4. An experiment on the dictation evaluation set to compare consumerperceived latency (CPL) vs character error rate (CER). For each beam, weperform a grid search on the minimum buffer size and silence buffer size.This measurement is done on an Intel Xeon E5-2640 2.4GHz server. Theblack triangle is the chosen operating point and its CER is 10.5% and theaverage CPL is 320ms. prematurely. While the regular minimum buffer is set to480ms, we use a larger buffer when the last output is silence.This adjustment improves the CER from 11.3% to 10.5% fordictation and from 11.9% to 9.8% for assistant test sets.Figure 4 studies how latency is related to accuracy given ourﬁndings so far. In this experiment, we enumerate the decodersetting including beam size, minimum buffer size and silencebuffer size. Then for each setting, we measure the CER andlatency. In this work, latency is the time between the last wordspoken and the time when the decoder no longer changes thehypothesis presented to the user. This deﬁnition of latencyaims to capture perceived latency for the user and hence, wecall it consumer perceived latency (CPL). In this paper, wereport average CPL. The data point marked in a black triangleis the chosen operating point, where the beam is one, andboth minimum buffer and silence buffer are set to 960ms.The resulting CER is 10.5% and the average CPL is 320ms.Compared to the ofﬂine baseline, the difference in CER is only0.4% absolute or 4.0% relative.VI. C

ONCLUSIONS

In this paper, we explain the challenges of online recog-nition for LAS models. We show that online encoder andattention mechanism alone are not enough for online recogni-tion, and the model can suffer from early stopping. However,with our proposed silence modeling and buffering scheme, weshow that LAS model is capable for online recognition. Inour dictation evaluation set, the online LAS model has 10.5%CER and 320ms average CPL, which has only 0.4% absoluteor 4.0% relative difference compared to the ofﬂine baseline.VII. A

CKNOWLEDGMENTS

We would like to thank Ossama Abdelhamid, John Bridle,Pawel Swietojanski, Russ Webb and Manhung Siu for theirsupport and useful discussions. We also want to thank RonHuang and Xinwei Li for their initial experiments on MoChAand spectral augmentation respectively. R EFERENCES[1] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen,attend and spell: A neural network for large vocabulary conversationalspeech recognition,” in

Proceedings of the IEEE International Confer-ence on Acoustics, Speech, and Signal Processing , 2016, pp. 4960–4964.[2] Herv´e Bourlard and Nelson Morgan,

Connectionist Speech Recognition:A Hybrid Approach , Kluwer Academic Publishers.[3] G. Hinton, L. Deng, D. Yu, A. Mohamed, N. Jaitly, A. Senior, V. Van-houcke, P. Nguyen, T. Sainath, G. Dahl, and B. Kingsbury, “Deep NeuralNetworks for Acoustic Modeling in Speech Recognition,”

IEEE SignalProcessing Magazine , vol. 29, no. 6, pp. 82–97, 2012.[4] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar,Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, KanishkaRao, Ekaterina Gonina, et al., “State-of-the-art speech recognition withsequence-to-sequence models,” in

Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing , 2018, pp.4774–4778.[5] A. Graves, “Sequence transduction with recurrent neural networks,” in arXiv:1211.3711 , 2012.[6] Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, RazielAlvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu,Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li,Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo yiin Chang, KanishkaRao, and Alexander Gruenstein, “Streaming end-to-end speech recog-nition for mobile devices,” in

Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing , 2019, pp.6381–6385.[7] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attentionbased end-to-end speech recognition using multi-task learning,” in

Proceedings of the IEEE International Conference on Acoustics, Speech,and Signal Processing , 2017, pp. 4835–4839.[8] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, andTomoki Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,”

IEEE Journal of Selected Topics in SignalProcessing , vol. 11, no. 8, pp. 1240–1253, 2017.[9] Niko Moritz, Takaaki Hori, and Jonathan Le Roux, “Triggered attentionfor end-to-end speech recognition,” in

Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech, and Signal Processing , 2019,pp. 5666–5670.[10] Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang,Antoine Bruguier, Shuo yiin Chang, Wei Li, Raziel Alvarez, ZhifengChen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu,Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, RohitPrabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, YashSheth, Trevor Strohman, Mirk´o Visontai, Yonghui Wu, Yu Zhang, andDing Zhao, “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in

Proceedings of the IEEEInternational Conference on Acoustics, Speech, and Signal Processing ,2020.[11] Tara N. Sainath, Ruoming Pang, Ron J. Weiss, Yanzhang He, Chungcheng Chiu, and Trevor Strohman, “An attention-based joint acousticand text on-device end-to-end model,” in

Proceedings of the IEEEInternational Conference on Acoustics, Speech, and Signal Processing ,2020.[12] Bo Li, Shuo yiin Chang, Tara N. Sainath, Ruoming Pang, YanzhangHe, Trevor Strohman, and Yonghui Wu, “Towards fast and accuratestreaming end-to-end asr,” in

Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing , 2020.[13] Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li, and YifanGong, “Minimum latency training strategies for streaming sequence-to-sequence asr,” in

Proceedings of the IEEE International Conferenceon Acoustics, Speech, and Signal Processing , 2020.[14] Jinyu Li, Rui Zhao, Hu Hu, and Yifan Gong, “Improving RNNTransducer Modeling for End-to-End Speech Recognition,” in

Pro-ceedings of the IEEE Workshop on Automatic Speech Recognition andUnderstanding , 2019.[15] Ruchao Fan, Pan Zhou, Wei Chen, Jia Jia, and Gang Liu, “An onlineattention-based model for speech recognition,” in

Proceedings of theINTERSPEECH , 2019, pp. 4390––4394.[16] Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, , andDouglas Eck, “Online and linear-time attention by enforcing monotonicalignments,” in

Proceedings of the International Conference on MachineLearning , 2017.[17] Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,”in

Proceedings of the International Conference on Learning Represen-tations , 2018. [18] Andr´e Merboldt, Albert Zeyer, Ralf Schl¨uter, and Hermann Ney, “Ananalysis of local monotonic attention variants,” in

Proceedings of theINTERSPEECH , 2019.[19] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Luk´as Burget, O. Glem-bek, Nagendra Goel, Mirko Hannemann, Petr Motl´ıˇcek, Yanmin Qian,Petr Schwarz, Jan Silovsk´y, Georg Stemmer, and Karel Vesel´y, “TheKaldi Speech Recognition Toolkit,” in

Proceedings of the IEEEWorkshop on Automatic Speech Recognition and Understanding , 2011.[20] Jia Cui, Xiaodong Cui, Bhuvana Ramabhadran, Janice Kim, BrianKingsbury, Jonathan Mamou, Lidia Mangu1, Michael Picheny, Tara N.Sainath, and Abhinav Sethy, “Developing speech recognition systemsfor corpus indexing under the IARPA Babel program,” in

Proceedingsof the IEEE International Conference on Acoustics, Speech, and SignalProcessing , 2013.[21] S. Kanthak and H. Ney, “Context-dependent acoustic modeling usinggraphemes for large vocabulary speech recognition,” in

Proceedings ofthe IEEE International Conference on Acoustics, Speech, and SignalProcessing , 2002.[22] Mirjam Killer, Sebastian St¨uker, and Tanja Schultz, “Grapheme basedspeech recognition,” in

Proceedings of the European Conference onSpeech Communication and Technology , 2003, pp. 3141–3144.[23] Kai Chen and Qiang Huo, “Scalable training of deep learning machinesby incremental block training with intra-block parallel optimizationand blockwise model-update ﬁltering,” in

Proceedings of the IEEEInternational Conference on Acoustics, Speech, and Signal Processing ,2016.[24] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer,“Scheduled sampling for sequence prediction with recurrent neuralnetworks,” in

Proceedings of the NIPS , 2015.[25] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, BarretZoph, Ekin D. Cubuk, and Quoc V. Le, “Specaugment: A simple dataaugmentation method for automatic speech recognition,” in

Proceedingsof the INTERSPEECH , 2019.[26] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, andZbigniew Wojna, “Rethinking the inception architecture for computervision,” in