Intermediate Loss Regularization for CTC-based Speech Recognition
aa r X i v : . [ ee ss . A S ] F e b INTERMEDIATE LOSS REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
Jaesong Lee , Shinji Watanabe Naver Corporation Johns Hopkins University
ABSTRACT
We present a simple and efficient auxiliary loss function for auto-matic speech recognition (ASR) based on the connectionist tempo-ral classification (CTC) objective. The proposed objective, an inter-mediate CTC loss, is attached to an intermediate layer in the CTCencoder network. This intermediate CTC loss well regularizes CTCtraining and improves the performance requiring only small modi-fication of the code and small and no overhead during training andinference, respectively. In addition, we propose to combine this in-termediate CTC loss with stochastic depth training, and apply thiscombination to a recently proposed Conformer network. We eval-uate the proposed method on various corpora, reaching word errorrate (WER) 9.9% on the WSJ corpus and character error rate (CER)5.2% on the AISHELL-1 corpus respectively, based on CTC greedysearch without a language model. Especially, the AISHELL-1 taskis comparable to other state-of-the-art ASR systems based on auto-regressive decoder with beam search.
Index Terms — end-to-end speech recognition, connectionisttemporal classification, multitask learning, non-autoregressive
1. INTRODUCTION
End-to-end automatic speech recognition (ASR) has become apromising approach for the speech recognition community. It sim-plifies model design, training, and decoding procedure compared toconventional approaches like hybrid systems using hidden Markovmodel (HMM).However, the improvement comes with a computational cost:many state-of-the-art ASR architectures employ attention-baseddeep encoder-decoder architecture [1–5], which requires heavycomputational cost and large model size. Also, the decoder runsin an autoregressive fashion and requires sequential computation,i.e., the generation of an output token can be started only after thecompletion of the previous token.Compared to the encoder-decoder modeling, the ConnectionistTemporal Classification (CTC) [6] does not require a separate de-coder, thus allows designing more compact and fast models. Also,CTC provides a greedy decoding algorithm for generating sentencesin a fast and parallel way, especially compared to autoregressive de-coder of encoder-decoder models.Although recent advances on architectural design [7, 8] andpre-training method [9] have improved the performance with CTC,it is usually weaker than encoder-decoder models, often creditedto its strong conditional independence assumption, and overcomingthe performance often requires external language models (LMs) andbeam search algorithm [10, 11], which demand extra computationalcosts and effectively makes the model an autoregressive one. There-fore, it is important to improve CTC modeling to reduce overallcomputational overhead, ideally without the help of LM and beamsearch. There also has been a great interest on non-autoregressivespeech recognition toward reaching the performance of autoregres-sive models [12–16], inspired by the success of non-autoregressivemodels in neural machine translation [17–20]. Non-autoregressiveASR would allow faster token generation compared to autoregres-sive ASR, as the generation of a token does not directly depend onthe previous token. CTC itself can be viewed as an early instanceof non-autoregressive ASR, and recently proposed methods, MaskCTC [13] and Imputer [14], use CTC as a part of non-autoregressivemodeling: they first generate initial output from CTC, then refine itvia the other network. Therefore, improving CTC is also importantfor improving non-autoregressive methods in general.In this work, we show the performance of CTC can be improvedwith a proposed auxiliary task. The proposed task, named inter-mediate CTC loss, is constructed by first obtaining the intermediaterepresentation of the model then computing its corresponding CTCloss. The model is trained with the original CTC loss in conjunctionwith the proposed loss, with a very small computational overhead.During inference, the usual CTC decoding algorithm is used, thusthere is no overhead.We show the proposed method can improve Transformer [21]with various depths, and also Conformer [5], recently proposed ar-chitecture combining self-attention and convolution layers. Also,we show the method can be combined with the other regularizationmethod, stochastic depth [22, 23], for further enhancement.The contributions of this paper are as follows:• We present a simple yet efficient auxiliary loss, called inter-mediate CTC loss, for improving performance of CTC ASRnetwork.• We combine the intermediate CTC loss and stochastic depthregularization to achieve better performance than using onlyone of them.• We show application to the Conformer encoder, recently pro-posed architecture. We show the proposed method is alsoeffective for Conformer.• We achieve comparable to state-of-the-art results, specificallyword error rate (WER) 9.9% on Wall Street Journal (WSJ)and character error rate (CER) 5.2% on AISHELL-1, usingCTC modeling and greedy decoding only.
2. ARCHITECTURE
We consider a multi-layer architecture with the CTC loss function.For given input x ∈ R T × D of length T and dimension D , theencoder consists of L layers as follows: x l = EncoderLayer l ( x l − ) , (1)where EncoderLayer l is the l -th layer of the network explained atSection 2.2. .1. Connectionist Temporal Classification CTC [6] computes the likelihood of target sequence y by consideringall possible alignments for the label and the input length T . For theencoder output x L and target sequence y , the likelihood is definedas: P CTC ( y | x L ) := X a ∈ β − ( y ) P ( a | x L ) (2)where β − ( y ) is the set of alignment a of length T compatibleto y including the special blank token. The alignment probability P ( a | x L ) is factorized with the following conditional independenceassumption: P ( a | x L ) = Y t P ( a [ t ] | x L [ t ]) (3)where a [ t ] and x L [ t ] denote the t -th symbol of a and the t -th repre-sentation vector of x L , respectively.At training time, we minimize the negative log-likelihood in-duced by CTC by using P CTC ( y | x L ) in Eq. (2): L CTC := − log P CTC ( y | x L ) . (4)At test time, we use greedy search to find the most probable align-ment for fast inference. We use two encoder architectures: Transformer [21] and Con-former [5]. Transformer uses self-attention (SelfAttention ( · ) shownin Eq. (5)) for learning global representation, and layer normaliza-tion [24] and residual connection [25] for stabilizing learning.With Transformer, EncoderLayer ( · ) in Eq. (1) consists of: x MHA l = SelfAttention ( x l − ) + x l − , (5) x l = FFN ( x MHA l ) + x MHA l , (6)where FFN ( · ) denotes the feed forward layers.Conformer combines Transformer and convolution neural layersfor efficient learning of both global and local representations.With Conformer, EncoderLayer ( · ) in Eq. (1) consists of: x FFN l = 12 FFN ( x l − ) + x l − (7) x MHA l = SelfAttention ( x FFN l ) + x FFN l (8) x Conv l = Convolution ( x MHA l ) + x MHA l (9) x l = LayerNorm ( 12
FFN ( x Conv l ) + x Conv l ) . (10) Stochastic depth [22, 23] is a regularization technique for residualnetwork. It helps training of very deep networks by randomly skip-ping some layers. It can be viewed as training an ensemble of L sub-models, induced by removing some layers of the model.Consider EncoderLayer ( · ) in Eq. (1) with residual connection: x l = x l − + f l ( x l − ) (11)for some layer f l ( · ) .Let b l be a Bernoulli random variable which takes value 1 withprobability p l . During training, the layer is computed as: x l = ( x l − if b l = 0 ,x l − + p l · f l ( x l − ) otherwise. (12) Thus, with probability − p l , the layer skips the f l ( x l − ) part.The denominator p ensures the expectation matches the Eq. (1).During testing, we do not skip the layers and use Eq. (11).The per-layer survival probability is given as p l = 1 − lL (1 − p L ) with hyper-parameter p L . This assigns higher skipping probabilityto higher layers, as skipping lower layers may harm the overall per-formance [22]. We use p L = 0 . for all experiments.
3. INTERMEDIATE CTC LOSS
The stochastic depth aims to improve training of multi-layer networkusing a stochastic ensemble approach, but the experiments show theimprovement only comes with sufficiently deep networks, e.g. with24 or more layers [23].We hypothesize that while the stochastic depth is effective forregularizing higher layers, it is not effective for regularizing lowerlayers, due to the its ensemble strategy. As each layer has own ran-dom variable for skipping, the probability of skipping all high layersis very low. Therefore, for most cases, the lower layers may rely onthe remaining higher layers rather than learn regularized representa-tion by themselves.In this context, we propose to skip the higher layers as a whole.We choose a layer, called “intermediate layer”, and induce a sub-model by skipping all layers after the intermediate layer. The sub-model relies on the lower layers rather than higher layers, thus train-ing the sub-model would regularize the lower part of the full model.For the position of the intermediate layer, this paper mainly uses ⌊ L/ ⌋ , as it seems a safe choice between lower and higher layers.We later discuss other choices at Section 3.1.As the sub-model and the full model share lower structure, it ispossible to denote the output of the sub-model as x ⌊ L/ ⌋ , the inter-mediate representation of the full-model. Like the full-model, weuse a CTC loss for the sub-model: L InterCTC := − log P CTC ( y | x ⌊ L/ ⌋ ) . (13)Then we note that the sub-model representation x ⌊ L/ ⌋ is naturallyobtained when we compute the full model. Thus, after computingthe CTC loss of the full model, we can compute the CTC loss ofthe sub-model with a very small overhead. The proposed trainingobjective is the weighted sum of the two losses: L := (1 − w ) L CTC + w L InterCTC , (14)where we use w = 0 . for all experiments.During testing, we do not use the intermediate prediction andonly use the final representation x L for decoding.The intermediate loss can also be used jointly with stochasticdepth. We expect the intermediate loss regularizes the lower layers,and the stochastic depth regularizes the higher layers, thus combin-ing them further improves the whole model. We show the empiricalresult at Section 4. We also consider different sub-model configurations and investigatetheir effects. We consider the following variants:•
Lower than the middle . Depending on the number of lay-ers L , the optimal ratio of lower layers to higher layers maydiffer. To find the effect of position of the intermediate loss,we consider lower position than middle, e.g., ⌊ L/ ⌋ , for thesub-model. Multiple sub-models . We consider multiple sub-modelsrather than only one. For the number of sub-models K , wecompute the following loss: − K K X k =1 P CTC ( y | x ⌊ kLK +1 ⌋ ) . (15)For K = 1 , the loss corresponds to Eq. (13).• Random position . We also consider randomly choosing sub-model among multiple models. We introduce a uniform ran-dom variable u with range from ⌊ L/ ⌋ to L − , and choose u -th layer for the intermediate representation.We show the experimental results at Section 4.2. In Eq. (14), we compute the weighted sum of the two sub-models.Instead, we may compute the stochastic variant of the loss, likestochastic depth, as follows. Let b a Bernoulli random variable whichtakes value 1 with probability w . the stochastic intermediate CTCobjective is: L ′ := ( L CTC if b = 0 , L InterCTC otherwise . (16)The loss coincides with Eq. (14) in expectation.We argue the deterministic version is better than stochastic onefor gradient-based learning even if they have same expected value.For the stochastic variant, the loss and its gradient only have accessto L InterCTC if b = 1 , and the model may forget features useful for L CTC but not for L InterCTC . On the other hand, the deterministic vari-ant always computes two losses at the same time, therefore, the riskof forgetting features is low.At Section 4.2, we experimentally show while the stochasticvariant also improves the model, it is not so effective as the deter-ministic one.
Mask CTC [13] consists of an encoder, a CTC layer on top of theencoder, and a conditional masked language model (CMLM) [18].During decoding, the model first generates initial hypothesesfrom the CTC layer, and replaces any token of low probability (be-low a given threshold) with special token
4. EXPERIMENTS
We evaluate the performance of intermediate CTC loss on the threecorpora: Wall Street Journal (WSJ) [33] (English, 81 hours), TED-LIUM2 [34] (English, 207 hours), and AISHELL-1 [35] (Chinese,170 hours). We use ESPnet [29] for all experiments. We use 80-dimensional log-mel feature and 3-dimensional pitch feature for theinput, and apply SpecAugment [36] during training. For WSJ andAISHELL-1, we tokenize label sentences as characters. For TED-LIUM2, we tokenize label sentences as sub-words with sentence-piece [37].For WSJ, the model is trained for 100 epochs. For TED-LIUM2and AISHELL-1, the model is trained for 50 epochs. After training,the model parameter is obtained by averaging models from last 10epochs. Note that we do not use any external language models (LMs)or beam search, and only use greedy decoding for CTC. Thus, allexperiments are based on the non-autoregressive setup in order tokeep the benefit of fast and parallel inference of CTC.
We show the experimental results for Transformer and Conformerarchitectures. For each architecture, we compare four regularizationconfigurations:• Baseline (no regularization)• Intermediate CTC (“InterCTC”)• Stochastic depth (“StochDepth”)• Intermediate CTC + Stochastic depth (“both”)For Transformer, we use 12-layer, 24-layer and 48-layer mod-els. Table 1 shows the word error rates (WERs) for WSJ and TED-LIUM2, and character error rates (CERs) for AISHELL-1.For all of the experiment, intermediate CTC gives an improve-ment over the baseline model. Stochastic depth improves 24-layerand 48-layer models, but does not improve 12-layer models wellfor WSJ and AISHELL-1. Using both the intermediate loss andthe stochastic depth gives better result than using only one of them.Thus, we conclude the two methods have complimentary effects.Additionally, we apply intermediate CTC to 6-layer Transformerfor WSJ, and get WER improvement from 21.1% to 18.3%. Thissuggests the intermediate CTC is still beneficial for smaller net-works.For Conformer, we use 12-layer model. The results are atTable 2. Again, intermediate CTC gives consistent improvement able 1 . Word error rates (WERs) and character error rates (CERs)for Transformer. See section 4.1 for details.
WSJ TED-LIUM2 AISHELL-1(WER) (WER) (CER)dev93 eval92 dev test dev test
Table 2 . Word error rates (WERs) and character error rates (CERs)for Conformer. See section 4.1 for details.
WSJ TED-LIUM2 AISHELL-1(WER) (WER) (CER)dev93 eval92 dev test dev test over baseline. Stochastic depth gives improvement for WSJ andAISHELL-1, but does not give improvement for TED-LIUM2.The combination of intermediate loss and stochastic depthachieves WER for WSJ and CER for AISHELL-1. ForWSJ, it outperforms the previously published non-autoregressiveresults [13, 14, 38], and is close to the state-of-the-art autoregres-sive result (9.3%) [39]. Also, for AISHELL-1, it outperformsTransformer-based encoder-decoder models [40, 41], and is closeto the state-of-the-art autoregressive result (5.1%) [42]. Note thatthe referred state-of-the-art results use an autoregressive decoderand [42] also uses an external LM. On the other hand, our result issolely based on CTC with greedy decoding, without LM or beamsearch.
To compare the proposed intermediate loss to the position variants(Section 3.1) and the stochastic variant (Section 3.2), we conductadditional experiments for WSJ corpus. The result is at Table 3. Weconduct the following experiments:•
Lower position . We conduct this variant for the 24-layermodel, which is sufficiently deep to consider a lower posi-tion. We used 6th layer for the experiment. Despite the deepnetwork, the variant performs slightly worse than the default.•
Multiple positions . We conduct this variant for the 24-layerand the 48-layers, which are very deep and more sub-modelsmay help. We use K = 3 for 24-layer and K = 7 for 48-layer, to select all layer positions of power of 6. It gives asmall improvement for the 24-layer, but gives a mixed resultfor the 48-layer.• Random position . We conduct this variant for all models.The result is mixed: it gives no improvement for the 12-layerand the 24-layer, although a small improvement for 48-layer.
Table 3 . Word error rates (WERs) of the intermediate loss variantsfor WSJ. See Section 4.2 for details. dev93 eval92
Default 17.5 13.6Random 17.4 14.3Stochastic 19.0 15.0
Default 15.3 12.4Lower 15.8 12.9Multiple 15.1 12.0Random 15.4 12.4
Default 14.9 12.6Multiple 15.4 12.1Random 14.7 12.0
Table 4 . Word error rates (WERs) of Mask CTC-based non-autoregressive ASR for WSJ. See Section 4.3 for details. threshold dev93 eval92
Mask CTC [13] 0.999 15.4 12.1Align-Refine [38] - 13.7 11.4 • Stochastic variant . We conduct this variant for 12-layermodel. As discussed in Section 3.2, the stochastic variantis worse than the deterministic one, although it is still betterthan no regularization.From the experimental results, we conclude that the proposeddesign is a simple yet reasonable choice among the variants.
We present an experimental result of Mask CTC-based non autore-gressive ASR and intermediate loss, as described at Section 3.3. TheWSJ corpus is used for the experiment. We use w CTC = 0 . , andfor intermediate CTC variant, we also use w InterCTC = 0 . . We use aTransformer model with 12-layer encoder and 6-layer decoder. Themodel is trained for 500 epochs and parameters of last 60 epochs areaveraged.Table 4 shows the WERs for Mask CTC. The second columnindicates the threshold of probability for CTC prediction; Mask CTCuses 0.999 by default. If threshold is 0.0, the model does not use thedecoder and just treats the CTC result as the final prediction. We seethe intermediate CTC improves the performance of CTC prediction,from 13.5% to 11.6%. We also see the improvement of CTC leadsthe overall improvement of Mask CTC, as the WER reduced from12.9% to 11.3%. It is also lower than Align-Refine [38] (11.4%)which improves Mask CTC by modifying the role of CMLM. Thisshows the intermediate loss helps the training of Mask CTC.
5. CONCLUSION
We present intermediate CTC loss, an auxiliary task for improvingCTC-based speech recognition. The proposed loss is easy to imple-ment, has small overhead at training time and no overhead at testtime. We empirically show the intermediate CTC loss improvesTransformer and Conformer architectures, and combining the losswith stochastic depth further improves training, reaching word er-ror rate (WER) 9.9% on WSJ and character error rate (CER) 5.2%on AISHELL-1, without an autoregressive decoder or external lan-guage model. . REFERENCES [1] Jan K Chorowski et al., “Attention-based models for speechrecognition,” in
Proc. NeurIPS , 2015.[2] W. Chan et al., “Listen, attend and spell: A neural network forlarge vocabulary conversational speech recognition,” in
Proc.ICASSP , 2016.[3] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recogni-tion,” in
Proc. ICASSP , 2018.[4] Shigeki Karita et al., “Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Clas-sification and Language Model Integration,” in
Proc. Inter-speech , 2019.[5] Anmol Gulati et al., “Conformer: Convolution-augmentedtransformer for speech recognition,” in
Proc. Interspeech ,2020.[6] Alex Graves et al., “Connectionist temporal classification: la-belling unsegmented sequence data with recurrent neural net-works,” in
Proc. ICML , 2006.[7] Vineel Pratap et al., “Scaling up online speech recognitionusing convnets,” in
Proc. Interspeech , 2020.[8] S. Kriman et al., “Quartznet: Deep automatic speech recog-nition with 1d time-channel separable convolutions,” in
Proc.ICASSP , 2020.[9] Alexei Baevski et al., “wav2vec 2.0: A framework forself-supervisedlearning of speech representations,” in
Proc.NeurIPS , 2020.[10] Yajie Miao, Mohammad Gowayyed, and Florian Metze,“EESEN: End-to-end speech recognition using deep rnn mod-els and wfst-based decoding,” in
Proc. ASRU , 2015.[11] Sei Ueno et al., “Acoustic-to-word attention-based model com-plemented with character-level ctc-based model,” in
Proc.ICASSP , 2018.[12] Nanxin Chen et al., “Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition,” 2020.[13] Yosuke Higuchi et al., “Mask ctc: Non-autoregressive end-to-end asr with ctc and mask predict,” in
Proc. Interspeech , 2020.[14] William Chan et al., “Imputer: Sequence modelling via impu-tation and dynamic programming,” in
Proc. ICML , 2020.[15] Yuya Fujita et al., “Insertion-based modeling for end-to-endautomatic speech recognition,” in
Proc. Interspeech , 2020.[16] Zhengkun Tian et al., “Spike-triggered non-autoregressivetransformer for end-to-end speech recognition,” in
Proc. In-terspeech , 2020.[17] Jiatao Gu et al., “Non-autoregressive neural machine transla-tion,” in
Proc. ICLR , 2018.[18] Marjan Ghazvininejad et al., “Mask-predict: Parallel decodingof conditional masked language models,” in
Proc. EMNLP-IJCNLP , 2019.[19] Xuezhe Ma et al., “Flowseq: Non-autoregressive conditionalsequence generation with generative flow,” in
Proc. EMNLP-IJCNLP , 2019.[20] Raphael Shu et al., “Latent-variable non-autoregressive neuralmachine translation with deterministic inference using a deltaposterior,” in
Proc. AAAI , 2020. [21] Ashish Vaswani et al., “Attention is all you need,” in
Proc.NeurIPS , 2017.[22] Gao Huang et al., “Deep networks with stochastic depth,” in
Proc. ECCV , 2016.[23] Ngoc-Quan Pham et al., “Very Deep Self-Attention Networksfor End-to-End Speech Recognition,” in
Proc. Interspeech ,2019.[24] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton,“Layer normalization,” 2016.[25] K. He et al., “Deep residual learning for image recognition,”in
Proc. CVPR , 2016.[26] Santiago Fern´andez, Alex Graves, and J¨urgen Schmidhuber,“Sequence labelling in structured domains with hierarchical re-current neural networks,” in
Proc. IJCAI , 2007.[27] Kalpesh Krishna, Shubham Toshniwal, and Karen Livescu,“Hierarchical multitask learning for ctc-based speech recog-nition,” 2019.[28] Shubham Toshniwal et al., “Multitask learning with low-levelauxiliary tasks for encoder-decoder based speech recognition,”in
Proc. Interspeech , 2017.[29] Shinji Watanabe et al., “ESPnet: End-to-end speech processingtoolkit,” in
Proc. Interspeech , 2018.[30] A. Tjandra et al., “Deja-vu: Double feature presentation anditerated loss in deep transformer networks,” in
Proc. ICASSP ,2020.[31] Chunxi Liu et al., “Improving rnn transducer based asr withauxiliary tasks,” in
Proc. SLT , 2020.[32] Alex Graves, “Sequence transduction with recurrent neuralnetworks,” 2012.[33] Douglas B Paul and Janet M Baker, “The design for thewall street journal-based CSR corpus,” in
Proc. Workshop onSpeech and Natural Language , 1992.[34] Anthony Rousseau, Paul Del´eglise, and Yannick Est`eve, “En-hancing the TED-LIUM corpus with selected data for languagemodeling and more TED talks,” in
Proc. LREC , May 2014.[35] Hui Bu et al., “Aishell-1: An open-source mandarin speechcorpus and a speech recognition baseline,” in
Proc. O-COCOSDA , 2017.[36] Daniel S Park et al., “SpecAugment: A simple data augmenta-tion method for automatic speech recognition,” in
Proc. Inter-speech , 2019.[37] Taku Kudo and John Richardson, “SentencePiece: A simpleand language independent subword tokenizer and detokenizerfor neural text processing,” in
Proc. EMNLP: System Demon-strations , Nov. 2018.[38] Ethan A. Chi, Julian Salazar, and Katrin Kirchhoff, “Align-refine: Non-autoregressive speech recognition via iterative re-alignment,” 2020.[39] Sara Sabour, William Chan, and Mohammad Norouzi, “Op-timal completion distillation for sequence learning,” in
Proc.ICLR , 2019.[40] S. Karita et al., “A comparative study on transformer vs rnn inspeech applications,” in
Proc. ASRU , 2019.[41] Zhifu Gao et al., “San-m: Memory equipped self-attention forend-to-end speech recognition,” in
Proc. Interspeech , 2020.[42] Xinyuan Zhou et al., “Self-and-mixed attention decoder withdeep acoustic structure for transformer-based lvcsr,” in