[PDF] Do End-to-End Speech Recognition Models Care About Context?

Abstract

The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.

Full PDF

DDo End-to-End Speech Recognition Models Care About Context?

Lasse Borgholt , , Jakob D. Havtorn , ˇZeljko Agi´c , Anders Søgaard , Lars Maaløe , Christian Igel Department of Computer Science, University of Copenhagen, Denmark Corti, Copenhagen, Denmark { borgholt, sogaard, igel } @di.ku.dk, { lb, jdh, za, lm } @corti.ai Abstract

The two most common paradigms for end-to-end speech recog-nition are connectionist temporal classiﬁcation (CTC) andattention-based encoder-decoder (AED) models. It has been ar-gued that the latter is better suited for learning an implicit lan-guage model. We test this hypothesis by measuring temporalcontext sensitivity and evaluate how the models perform whenwe constrain the amount of contextual information in the audioinput. We ﬁnd that the AED model is indeed more context sen-sitive, but that the gap can be closed by adding self-attention tothe CTC model. Furthermore, the two models perform similarlywhen contextual information is constrained. Finally, in contrastto previous research, our results show that the CTC model ishighly competitive on WSJ and LibriSpeech without the help ofan external language model.

Index Terms : automatic speech recognition, end-to-end speechrecognition, connectionist temporal classiﬁcation, attention-based encoder-decoder

1. Introduction

Connectionist temporal classiﬁcation (CTC) [1] and attention-based encoder-decoder (AED) models [2, 3] are arguably themost popular choices for end-to-end automatic speech recogni-tion (E2E ASR). However, it has been unclear if CTC and AEDmodels process speech in qualitatively different ways. The useof sentence-level context is important for human speech percep-tion [4], but has not been studied for ASR. Previous researchhas claimed that AED models learn a better implicit languagemodel given enough training data [5]. Furthermore, compar-isons of the two models have suggested that CTC models areinferior without the help of an external language model [5, 6],which leads to the hypothesis that CTC models are incapable ofexploiting long temporal dependencies. We study how the two E2E ASR models utilize tempo-ral context. For this purpose, we consider ﬁrst-order deriva-tives [7, 8] and the occlusion of input features [9]. While thesemethods have been frequently used to analyze natural languageprocessing models [10, 11, 12, 13] their application to speechrecognition has been limited [14, 15].We ﬁrst highlight three general architectural differences be-tween the two approaches and argue that these may enable AEDmodels to utilize more temporal context than CTC models (sec-tion 2). Further, we deﬁne an intrinsic measure of context sensi-tivity based on the partial derivative of individual character pre-dictions with respect to the input audio (section 3.1 and ﬁgure1). This allows us to analyze the sensitivity across the temporaldimension of the input for any E2E ASR model. Finally, wedevise an experiment to directly compare model performancewhen context is constrained (section 3.2). For this, we use hand-annotated word-alignments to accurately occlude temporal con-text. Our contributions are as follows:1. Through a derivative-based sensitivity analysis we show thatthe AED model is more context sensitive than the CTCmodel. Our ablation study attributes this difference to theattention-mechanism which closes the gap when applied tothe CTC model.2. Although the AED model is more context sensitive than theCTC model without an attention-mechanism, we ﬁnd thatthe two models perform similarly when contextual informa-tion is constrained by occluding surrounding words in theinput audio.3. In contrast to previous comparisons, we show that the CTCmodel is highly competitive with the AED model withoutthe help of an external language model. Using a deep anddensely connected architecture, both models reach a newE2E state-of-the-art on the WSJ task. t (= 0.01 seconds)01020304050 S e n s iti v it y s c o r e , r u , t

10% (0.03s)20% (0.26s)30% (0.27s)40% (0.3s)50% (0.31s)60% (0.71s)70% (1.16s)80% (1.29s)90% (1.41s)100% (1.59s) who is going to stop mehh uw ih z gcl g ow ih ng tcl t ix s tcl t aa pcl m iy

Figure 1:

Sensitivity scores for the character “p” in the correctly predicted sentence “who is going to stop me” by the CTC modeltrained on LibriSpeech. Hand-annotated word and phone alignments from the TIMIT dataset are shown in the bottom. The temporalspans corresponding to different levels of accumulated sensitivity are shown in the top. By averaging these across all non-blankcharacter predictions in a test set, we obtain a measurement of the model’s context sensitivity. a r X i v : . [ ee ss . A S ] F e b . End-to-End Speech recognition Given a sequence of real valued input vectors x = ( x , ..., x T ) ,CTC models compute an output sequence ˆy = ( ˆy , ..., ˆy U ) ,where each ˆy u is a categorical probability distribution over thetarget character set. Apart from the letters a-z, white-space andapostrophe (’), the character set also includes the special blanktoken (-). The input and output lengths, T and U , are related by U = (cid:100) TR (cid:101) where R is a constant reduction factor achieved bystriding or stacking adjacent temporal representations. In thisstudy, we never use an external language model. Instead, werely on a simple greedy decoder β ( · ) that collapses repeatedcharacters and removes blank tokens (e.g., -c-aatt- (cid:55)→ cat ).The β ( · ) function operates on the predicted alignment path ˆq = (ˆ q , ..., ˆ q U ) obtained by letting ˆ q u = arg max q ˆ y u,q .This decoding mechanism results from the CTC loss func-tion. The loss is computed by summing the probability of allalignment paths q = ( q , ..., q U ) that translate to the target se-quence y . The probability of a single path is given by:P ( q | x ) = U (cid:89) u =1 ˆ y u,q u (1)Given the set of paths { q | β ( q ) = y } = β − ( y ) that translateto a given target transcript, the total probability is:P ( y | x ) = (cid:88) q ∈ β − ( y ) P ( q | x ) (2)The loss is simply L ( y , ˆy ) = ln( P ( y | x )) which can be com-puted efﬁciently with dynamic programming [1]. AED models ﬁrst encode the input x to a sequence of vectors h = ( h , ..., h U ) = ENCODE ( x ) which is passed to an autore-gressive decoder function DECODE ( · ) . We reuse U to denotethe length of h to emphasize that, as with CTC models, it is de-ﬁned by a constant reduction factor R . However, AED modelsare typically robust to a high reduction factor ( R ≤ ) com-pared to CTC models ( R ≤ ) [5]. Operating at a lower tem-poral resolution should make it easier for recurrent encoder lay-ers (section 2.3) to pass information across longer time spans. BOTTLENECK- N Concatenate (and stack*)

Position-wise fully-connected ( N units) Clipped ReLU (clip value = ) -[ , ]-[ , ]2D-CONV-BLOCK- -[ , ]-[ , ]2D-CONV-BLOCK- -[ , ]-[ , ]DENSE-LSTM-BLOCK- - DENSE-LSTM-BLOCK- - BOTTLENECK-

DENSE-LSTM-BLOCK- N - L Bidirectional LSTMBidirectional LSTMBottleneck- N Bidirectional LSTM L l a y e r s w it h N un it s C -[ H , W ]-[ F , T ]2D-conv. ( C filters of H x W and stride F x T ) Clipped ReLU (clip value = ) Figure 2:

Default encoder architecture used for both CTC andAED models. * Only applied in the bottleneck layer of the ﬁrstdense LSTM block for the AED model to achieve R = 4 . Roughly speaking, we could write the decoder as ˆy k = DECODE ( h , ˆy k − , s k − , a k − ) where ˆy k is a probability dis-tribution over characters, s k is the decoder state and a k is the at-tention vector. Unlike CTC models, there are no repeated char-acters or blank tokens to interleave the ﬁnal predictions. Thus,given the same output sequence, we have K ≤ U . As withthe encoder, lower temporal resolution between decoder stepscould make it easier to pass information between predictions.Emphasizing more detail, we split the DECODE ( · ) functioninto the following sequence of computations: s k = RECURRENT ( s k − , [Φ( ˆy k − ); a k − ]) (3) a k = ATTEND ( s k , h ) (4) ˆy k = PREDICT ( a k ) (5)Here [ · ; · ] denotes the concatenation of two vectors and Φ( · ) is a non-differentiable embedding lookup. The

RECURRENT ( · ) function can take the form of any recurrent neural network ar-chitecture. We use a single LSTM [16] cell for all our exper-iments. The PREDICT ( · ) function is a single fully-connectedlayer followed by the softmax function. The following stepsdeﬁne the ATTEND ( · ) function: e k,u = v (cid:62) tanh( W s s k + W h h u ) (6) α k,u = exp( e k,u ) (cid:80) Uu (cid:48) =1 exp( e k,u (cid:48) ) (7) c k = U (cid:88) u =1 α k,u h u (8) a k = tanh( W a [ c k ; s k ]) (9)Where v , W s , W h and W a are trainable parameters. Thecomputation of the energy coefﬁcient e k,u is taken from [17].Note that each energy coefﬁcient, and thus each attentionweight α k,u , is computed identically for all encoder representa-tions h u . Unlike recurrent network connections, combining in-formation from time steps far apart does not require propagatingthe information through a number of computations proportionalto the distance between the time steps.Thus, we have highlighted three components that couldmake AED models more context sensitive: (I) Encoder resolu-tion, (II) decoder resolution and (III) the attention-mechanism. Whereas the main contribution of the CTC framework is the lossfunction, the AED model relies on a more complex architecturethat allows it to be trained with a simple cross-entropy loss. Tosee this, note that the CTC forward pass can be stated as a subsetof the functions introduced in section 2.2: h = ENCODE ( x ) (10) ˆy u = PREDICT ( h u ) (11)As in previous work, we use convolutions followed by a se-quence of bidirectional recurrent neural networks [18, 19, 20].Our ﬁnal encoder has 10 bidirectional LSTM layers with skip-connections inspired by [21]. The outputs of the forward andbackward cells are summed after each LSTM layer. Default is R = 2 for CTC and R = 4 for AED. See ﬁgure 2. The lookup is not captured by the gradient-based sensitivity analy-sis presented in 3.1. odel clean other type params

Li et al., 2019 [22] 3.86 11.95 CTC 333 MKim et al., 2019 [23] 3.66 12.39 AED ~

320 MPark et al., 2019 [18] 2.80 6.80 AED ~

280 MIrie et al., 2019 [24]

Small - Grapheme

Small - Word-piece

Medium - Grapheme

Medium - Word-piece

Our work:

Deep LSTM 5.13 16.03 CTC 17.7 MDeep LSTM 5.45 17.05 AED 19.8 MTable 1:

Word error rates on the clean and other test sets of Lib-riSpeech. None of the above use an external language model.

3. Method

We used two different approaches for analyzing temporal con-text utilization of the two E2E ASR models. The derivative-based sensitivity analysis (3.1) can be used to compare a set ofmodels on any dataset. However, as we will see, there is noguarantee that the differences found with this approach trans-late to better performance. The occlusion-based analysis (3.2)allows us to evaluate how the models respond when we removetemporal context. This measure is easy to interpret and canbe used to directly asses the importance of temporal context,but requires hand-annotated word-alignments which are rarelyavailable in publicly available datasets.

We deﬁne a sequence of sensitivity scores r k ( r u for CTC mod-els) across the temporal dimension of the input space for eachpredicted character. Let F be number of spectral input featuresand Q the size of the output character set: r k,t = Q (cid:88) q =1 F (cid:88) f =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ ˆ y k,q ∂x t,f (cid:12)(cid:12)(cid:12)(cid:12) (12)An example is shown in ﬁgure 1. Our goal is to measure thedispersion of these scores across the input time steps. We doso by summing the scores from largest to smallest and mea-sure the temporal span of the scores accumulated for a cer-tain percentage of the total sensitivity. For example, the set ofscores needed to account for of the total sensitivity maybe { r k, , r k, , r k, , r k, } . The temporal span would then be − time steps corresponding to . seconds. We takethe mean of this span for a ﬁxed percentage across all characterpredictions in a given data set to summarize the temporal con-text sensitivity of a model. This allows us to evaluate how thesensitivity disperses as we increase the accumulated percentage.A higher dispersion of sensitivity scores equals a higher contextsensitivity. The derivative-based measure considers a lineariza-tion of the models and, thus, does not capture non-linear effects. To directly test the dependence on contextual information, weuse hand-annotated word-alignments to systematically occludecontext. Given a word w t , we test how well a model recognizes Model eval92 type

Chorowski & Jaitly, 2016 [25] 10.60 AEDZhang et al., 2017 [20] 10.53 AEDChan et al., 2016 [26] 9.6 AEDSabour et al., 2018 [27] 9.3 AEDOur work:Deep LSTM 9.25 CTCDeep LSTM 9.25 AEDTable 2:

Word error rates on the eval92 test set of WSJ. None ofthe above use an external language model. the word given different levels of context. That is, we crop outthe audio segment corresponding to w t − C , ..., w t + C where C is the maximum number of context words visible on each side. If the target word w t is in the predicted sequence, we acceptthe hypothesis. To avoid ambiguous situations where the targetword is identical to one of the C context words, we only makeuse of sentences that consist of a sequence of unique words.

4. Experiments

We trained the models on the Wall Street Journal CSR corpus(WSJ) [28] and the LibriSpeech ASR corpus [29]. WSJ con-tains approximately 81 hours of read newspaper articles andLibriSpeech contains 960 hours of audio book samples. Weused 80-dimensional log-mel spectrograms as input. The mod-els were trained for 600 epochs on WSJ and 120 epochs onLibriSpeech. We used Adam [30] with a ﬁxed learning rateof · − for the ﬁrst 100 epochs on WSJ and 20 epochs onLibriSpeech, before annealing it to / of its original size. Weused dropout after each convolutional block [31] and each bidi-rectional LSTM layer [32]. The dropout rate was set to 0.10for models trained on LibriSpeech and 0.40 for WSJ. Similarto [33], we constructed batches of similar length samples, suchthat one batch consisted of up to 320 seconds of audio and con-tained a variable number of samples. For the AED model, weused teacher-forcing with a 10% sampling rate.For the occlusion-based analysis, we considered the hand-annotated word-alignments from the TIMIT dataset [34]. Weexcluded all sentences repeated by multiple speakers in order toavoid biasing the results towards certain sentence constructions(i.e., we only use the SI-ﬁles of the TIMIT dataset). We compare the default conﬁguration of our CTC and AEDmodels trained on WSJ and LibriSpeech to other notable E2EASR models in table 1 and 2. Both the CTC and AED modelcompare favorably to more sophisticated approaches on WSJ.On LibriSpeech, our models do not perform as well as largermodels, but are still on par with models of comparable size from[24] which is the same model as in [18] at smaller scale. Theslightly worse performance of the AED model on LibriSpeechcan be attributed to longer sentences which have a tendency todestabilize training. Similar issues have been reported in priorwork [2, 5]. We also add the silence from the start and end of the original sen-tence to the audio segment as it improves model performance. .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Accumulated sensitivity (%) M ea n t e m po r a l s p a n ( s ec ond s ) Figure 3.1: CTC vs. AED (test source: WSJ)

CTC, Train source: WSJAED, Train source: WSJ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Accumulated sensitivity (%) M ea n t e m po r a l s p a n ( s ec ond s ) Figure 3.2: CTC vs. AED (test source: LibriSpeech)

CTC, Train source: LibriSpeechAED, Train source: LibriSpeech 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Accumulated sensitivity (%) M ea n t e m po r a l s p a n ( s ec ond s ) Figure 3.3: CTC vs. AED (test source: TIMIT)

CTC, Train source: WSJCTC, Train source: LibriSpeechAED, Train source: WSJAED, Train source: LibriSpeech0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Accumulated sensitivity (%) M ea n t e m po r a l s p a n ( s ec ond s ) Figure 3.4: Temporal resolution (test source: WSJ)

AED, R =2 , K · , Train source: WSJAED, R =4 , K · , Train source: WSJAED, R =8 , K · , Train source: WSJAED, R =4 , K · , Train source: WSJAED, R =4 , K · , Train source: WSJ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accumulated sensitivity (%) M ea n t e m po r a l s p a n ( s ec ond s ) Figure 3.5: Attention-mechanism (test source: WSJ)

CTC, Train source: WSJCTC w/ self-attention, Train source: WSJAED, Train source: WSJ 0 1 2 3 ∞ Maximum number of context words T a r g e t - w o r d e rr o r r a t e ( % ) Figure 3.6: Occlusion performance (test source: TIMIT)

CTC, Train source: WSJCTC, Train source: LibriSpeechAED, Train source: WSJAED, Train source: LibriSpeech

Figure 3:

Sensitivity analysis (ﬁgures 3.1-3.5) and occlusion-based analysis (ﬁgure 3.6). See corresponding subsections.

As hypothesized, ﬁgure 3.1 and 3.2 reveal that our AED modelsutilized a larger temporal context than the CTC models based onthe sensitivity scores. The trend was consistent across all levelsof accumulated sensitivity scores. In ﬁgure 3.3, we see the samepattern when evaluated on the TIMIT dataset which will be usedfor the occlusion-based analysis.

We trained the AED model with three different temporal en-coder resolutions R = 2 , , on the WSJ dataset. R was con-ﬁgured by increasing stride in each of the three convolutionallayers. As seen in ﬁgure 3.4, encoder resolution had no impacton context sensitivity.To test decoder resolution, we interleaved the target tran-script with one or two redundant blank tokens to effectively in-crease the target length to K · or K · . Figure 3.4 shows thatdecoder resolution had no impact on context sensitivity. To test how the attention-mechanism affects context sensitivity,we incorporated the

ATTEND ( · ) function in the CTC architec-ture. Instead of passing h u directly to PREDICT ( · ) , we ﬁrstapplied self-attention: ˆy u = PREDICT ( ATTEND ( h u , h )) (13)We trained this model on the WSJ dataset and compared it tothe AED model and the CTC model without attention in ﬁgure3.5. The attention-mechanism closed the gap in context sen-sitivity between the two models. Thus, the difference found insections 4.3 is likely a result of this architectural component thatcan be easily incorporated in a CTC model. However, a large U results in high memory consumption. Therefore, we used asmaller model where the two dense LSTM blocks are replacedby three LSTM layers with 128 units for the experiments shownin ﬁgures 3.4 and 3.5. Figure 3.6 shows how model performance is affected under dif-ferent context constraints. We see that both the CTC and AEDmodel suffered severely when contextual information was com-pletely removed. The models came close to optimal perfor-mance when approximately three words were allowed on eachside of the target word. Thus, temporal context is an importantfactor for both models. This result aligns well with the commonn-gram size (3-4) when decoding with the help of a statisticallanguage model [25, 35, 36].Based on the results in section 4.3, we would expect thatthe AED models rely more on the temporal context than theCTC model. However, we do not see such a trend in ﬁgure 3.6.Indeed, there was no pronounced or consistent difference be-tween the two models regardless of training source. This resultimplies that the architectural differences between the AED andCTC models do not necessarily translate to a performance dif-ference. It may be that the AED model included more evidencefrom context than the CTC model, but the results in ﬁgure 3.6indicate that this did not add any additional value in terms oflowering word error rate.

5. Conclusions

We show that AED models are generally more context sensitivethan CTC models and that this difference is largely explainedby the attention-mechanism of AED models. Adding a self-attention layer to the CTC model bridges the gap between themodels. Analyzing performance by constraining temporal con-text, we also ﬁnd that the initial difference between the twomodels is not crucial in terms of word error rate performance,although both models rely heavily on context for optimal per-formance. Our experiments on WSJ and LibriSpeech show thatCTC models are capable of delivering state-of-the-art resultson par with AED models without an external language model.Because of its simplicity and more stable training, CTC is ourpreferred E2E ASR framework. . References [1] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: Labelling unsegmented se-quence data with recurrent neural networks,” in

InternationalConference on Machine learning (ICML) . ACM, 2006, pp. 369–376.[2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversationalspeech recognition,” in

International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2016, pp. 4960–4964.[3] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speech recog-nition,” in

International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) . IEEE, 2016, pp. 4945–4949.[4] K. M. Hutchinson, “Inﬂuence of sentence context on speechperception in young and older adults,”

Journal of Gerontology ,vol. 44, no. 2, pp. 36–44, 1989.[5] E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu,S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducersfor end-to-end speech recognition,” in

Automatic Speech Recog-nition and Understanding Workshop (ASRU) . IEEE, 2017, pp.206–213.[6] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, andN. Jaitly, “A comparison of sequence-to-sequence models forspeech recognition,” in

INTERSPEECH . ISCA, 2017, pp. 939–943.[7] L. Fu and T. Chen, “Sensitivity analysis for input vector in multi-layer feedforward neural networks,” in

International Conferenceon Neural Networks . IEEE, 1993, pp. 215–218.[8] Y. Dimopoulos, P. Bourret, and S. Lek, “Use of some sensitivitycriteria for choosing networks with good generalization ability,”

Neural Processing Letters , vol. 2, no. 6, pp. 1–4, 1995.[9] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-volutional networks,” in

European conference on computer vision .Springer, 2014, pp. 818–833.[10] L. Arras, G. Montavon, K.-R. M¨uller, and W. Samek, “Explain-ing recurrent neural network predictions in sentiment analysis,”

EMNLP Workshop: Workshop on Computational Approaches toSubjectivity, Sentiment and Social Media Analysis , 2017.[11] J. Li, X. Chen, E. Hovy, and D. Jurafsky, “Visualizing and un-derstanding neural models in NLP,”

North American Chapter ofthe Association for Computational Linguistics: Human LanguageTechnologies , 2016.[12] L. Arras, A. Osman, K.-R. M¨uller, and W. Samek, “Evaluatingrecurrent neural network explanations,”

ACL Workshop: Black-boxNLP: Analyzing and Interpreting Neural Networks for NLP ,2019.[13] J. Li, W. Monroe, and D. Jurafsky, “Understandingneural networks through representation erasure,” arXivpreprint:1612.08220 , 2016.[14] A. Krug and S. Stober, “Introspection for convolutional automaticspeech recognition,” in

EMNLP Workshop: BlackboxNLP: Ana-lyzing and Interpreting Neural Networks for NLP , 2018.[15] H. Bharadhwaj, “Layer-wise relevance propagation for explain-able deep learning based speech recognition,” in

InternationalSymposium on Signal Processing and Information Technology(ISSPIT) . IEEE, 2018, pp. 168–174.[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Computation , vol. 9, no. 8, pp. 1735–1780, 1997.[17] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-tion by jointly learning to align and translate,” in

InternationalConference on Learning Representations (ICLR) , 2015.[18] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “Specaugment: A simple data augmentation methodfor automatic speech recognition,” in

INTERSPEECH . ISCA,2019, pp. 2613–2617. [19] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Bat-tenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al. , “Deep speech 2: End-to-end speech recognition in Englishand Mandarin,” in

International Conference on Machine Learning(ICML) , 2016, pp. 173–182.[20] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional net-works for end-to-end speech recognition,” in

International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2017, pp. 4845–4849.[21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in

Conference onComputer Vision and Pattern Recognition (CVPR) . IEEE, 2017,pp. 4700–4708.[22] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M.Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end con-volutional neural acoustic model,” in

INTERSPEECH . ISCA,2019, pp. 71–75.[23] C. Kim, M. Shin, A. Garg, and D. Gowda, “Improved vocal tractlength perturbation for a state-of-the-art end-to-end speech recog-nition system,” in

INTERSPEECH . ISCA, 2019, pp. 739–743.[24] K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach,and P. Nguyen, “On the choice of modeling unit for sequence-to-sequence speech recognition,” in

INTERSPEECH . ISCA, 2019,pp. 3800–3804.[25] J. Chorowski and N. Jaitly, “Towards better decoding and lan-guage model integration in sequence to sequence models,” in

IN-TERSPEECH . ISCA, 2017, pp. 523–527.[26] W. Chan, Y. Zhang, Q. Le, and N. Jaitly, “Latent sequence decom-positions,” in

International Conference on Learning Representa-tions (ICLR) , 2017.[27] S. Sabour, W. Chan, and M. Norouzi, “Optimal completion dis-tillation for sequence learning,” in

International Conference onLearning Representations (ICLR) , 2019.[28] D. B. Paul and J. M. Baker, “The design for the Wall StreetJournal-based CSR corpus,” in

Proceedings of the Workshop onSpeech and Natural Language . Association for ComputationalLinguistics, 1992, p. 357–362.[29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an ASR corpus based on public domain audio books,”in

International Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2015, pp. 5206–5210.[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in

International Conference on Learning Representa-tions (ICLR) , 2015.[31] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler,“Efﬁcient object localization using convolutional networks,” in

Conference on Computer Vision and Pattern Recognition (CVPR) .IEEE, 2015, pp. 648–656.[32] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropoutand the local reparameterization trick,” in

Neural InformationProcessing Systems (NeurIPS) , 2015, pp. 2575–2583.[33] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” in

INTER-SPEECH . ISCA, 2019, pp. 3465–3469.[34] J. S. Garofolo, “Timit acoustic phonetic continuous speech cor-pus,”

Linguistic Data Consortium , 1993.[35] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve,and R. Collobert, “Fully convolutional speech recognition,” arXivpreprint:1812.06864 , 2018.[36] C. L¨uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer,R. Schl¨uter, and H. Ney, “RWTH ASR systems for LibriSpeech:Hybrid vs attention,” in