[PDF] End-to-End Automatic Speech Recognition with Deep Mutual Learning

Abstract

This paper is the first study to apply deep mutual learning (DML) to end-to-end ASR models. In DML, multiple models are trained simultaneously and collaboratively by mimicking each other throughout the training process, which helps to attain the global optimum and prevent models from making over-confident predictions. While previous studies applied DML to simple multi-class classification problems, there are no studies that have used it on more complex sequence-to-sequence mapping problems. For this reason, this paper presents a method to apply DML to state-of-the-art Transformer-based end-to-end ASR models. In particular, we propose to combine DML with recent representative training techniques. i.e., label smoothing, scheduled sampling, and SpecAugment, each of which are essential for powerful end-to-end ASR models. We expect that these training techniques work well with DML because DML has complementary characteristics. We experimented with two setups for Japanese ASR tasks: large-scale modeling and compact modeling. We demonstrate that DML improves the ASR performance of both modeling setups compared with conventional learning methods including knowledge distillation. We also show that combining DML with the existing training techniques effectively improves ASR performance.

Full PDF

aa r X i v : . [ c s . C L ] F e b End-to-End Automatic Speech Recognition withDeep Mutual Learning

Ryo Masumura ∗ , Mana Ihori ∗ , Akihiko Takashima ∗ , Tomohiro Tanaka ∗ , Takanori Ashihara ∗∗ NTT Media Intelligence Laboratories, NTT Corporation, JapanE-mail: [email protected]

Abstract —This paper is the ﬁrst study to apply deep mutuallearning (DML) to end-to-end ASR models. In DML, multi-ple models are trained simultaneously and collaboratively bymimicking each other throughout the training process, whichhelps to attain the global optimum and prevent models frommaking over-conﬁdent predictions. While previous studies ap-plied DML to simple multi-class classiﬁcation problems, thereare no studies that have used it on more complex sequence-to-sequence mapping problems. For this reason, this paper presentsa method to apply DML to state-of-the-art Transformer-basedend-to-end ASR models. In particular, we propose to combineDML with recent representative training techniques. i.e., labelsmoothing, scheduled sampling, and SpecAugment, each of whichare essential for powerful end-to-end ASR models. We expect thatthese training techniques work well with DML because DML hascomplementary characteristics. We experimented with two setupsfor Japanese ASR tasks: large-scale modeling and compact mod-eling. We demonstrate that DML improves the ASR performanceof both modeling setups compared with conventional learningmethods including knowledge distillation. We also show thatcombining DML with the existing training techniques effectivelyimproves ASR performance.

Index Terms : end-to-end ASR, deep mutual learning, Trans-former, scheduled sampling, SpecAugmentI. I

NTRODUCTION

In the automatic speech recognition (ASR) ﬁeld, there hasbeen growing interest in developing end-to-end ASR systemsthat directly convert input speech into text. While traditionalASR systems have been built from noisy channel formulationsusing several component models (i.e., an acoustic model,language model, and pronunciation model), end-to-end ASRsystems can learn the overall conversion in one step withoutany intermediate processing.Modeling methods and training techniques help to achievepowerful end-to-end ASR models. Recent studies have devel-oped modeling methods that include connectionist temporalclassiﬁcation [1], [2], a recurrent neural aligner [3], a recurrentneural network (RNN) transducer [4], and an RNN encoder-decoder [5]–[9]. In particular, Transformer-based modelingmethods have shown the strongest performance in recentstudies [10]–[15]. In addition, a few effective training tech-niques are label smoothing [16], scheduled sampling [17],and SpecAugment [18], [19]. These techniques effectivelyprevent over-ﬁtting problems caused by maximum likelihoodestimation, and combining them can improve end-to-end ASRsystems [20]. Furthermore, recent studies have focused onbuilding compact models because computation complexity and memory efﬁciency must be considered in practice. The mostrepresentative technique is knowledge distillation [21] (i.e.,teacher-student learning) that trains compact student modelsto mimic a pre-trained large-scale teacher model. In fact,knowledge distillation is an effective compact end-to-end ASRmodeling technique [22]–[24].To achieve a more powerful and compact end-to-end ASRmodel, we focused on deep mutual learning (DML) [25], oneof the most successful learning strategies in recent machinelearning studies. In DML, multiple student models simulta-neously learn to solve a target task collaboratively withoutintroducing pre-trained teacher models. In fact, each studentmodel is constrained to mimic other student models, therebyhelping it to ﬁnd a global optimum and prevent it from mak-ing over-conﬁdent predictions. DML enables us to constructstronger models using a uniﬁed network structure rather thanindependent learning. In addition, DML can be used to obtaincompact models that perform better than those distilled froma strong but static teacher. In previous studies, DML was usedon simple multi-class classiﬁcation problems, such as imageclassiﬁcation [25]–[28]. However, no studies have tried DMLon more complex sequence-to-sequence mapping problems.This paper presents a method to incorporate DML in state-of-the-art Transformer-based end-to-end ASR models. In par-ticular, we propose to combine DML with the existing trainingtechniques for end-to-end ASR models. DML is closely relatedto label smoothing [16]; both aim to prevent models frommaking over-conﬁdent predictions. While label smoothing usesa uniform distribution to smooth the ground-truth distribution,DML leverages the distributions predicted by other studentmodels. Combining both kinds of smoothing should efﬁcientlyprevent over-conﬁdent predictions. In addition, DML is relatedto scheduled sampling [17] and SpecAugment [18], [19].While scheduled sampling and SpecAugment aim to maintainconsistency between similar conditioning contexts, DML aimsto maintain consistency between different student models. Weexpect that these consistency strategies complement each other.Our experiments using the Corpus of Spontaneous Japanese(CSJ) [29] examined two experimental setups: large-scalemodeling and compact modeling. We found that DML im-proves the ASR performance of both modeling setups com-pared with conventional learning methods, including knowl-edge distillation. We also found that combining DML withthe existing training techniques effectively improves ASRperformance.I. E ND - TO -E ND ASR

WITH T RANSFORMER

This section brieﬂy describes end-to-end ASR that usesTransformer-based encoder-decoder models based on auto-regressive generative modeling [10]–[14], [30]. The encoder-decoder models predict a generation probability of a text W = { w , · · · , w N } given speech X = { x , · · · , x M } ,where w n is the n -th token in the text and x m is the m -th acoustic feature in the speech. N is the number oftokens in the text and M is the number of acoustic featuresin the speech. In the auto-regressive generative models, thegeneration probability of W is deﬁned as P ( W | X ; Θ ) = N Y n =1 P ( w n | W n − , X ; Θ ) , (1)where Θ represents the trainable model parameter sets and W n − = { w , · · · , w n − } . In our Transformer-based end-to-end ASR models, P ( w n | W n − X ; Θ ) is computed usinga speech encoder and a text decoder, both of which arecomposed of a couple of Transformer blocks. A. Network structure

Speech encoder:

The speech encoder converts input acousticfeatures X into the hidden representations H ( I ) using I Transformer encoder blocks. The i -th Transformer encoderblock composes i -th hidden representations H ( i ) from thelower layer inputs H ( i − as H ( i ) = TransformerEncoderBlock ( H ( i − ; Θ ) , (2)where TransformerEncoderBlock () is a Transformer en-coder block that consists of a scaled dot product multi-headself-attention layer and a position-wise feed-forward network[10]. The hidden representations H (0) = { h (0)1 , · · · , h (0) M ′ } areproduced by h (0) m ′ = AddPostionalEncoding ( h m ′ ) , (3)where AddPositionalEncoding () is a function that adds acontinuous vector in which position information is embedded. H = { h , · · · , h M ′ } is produced by H = ConvolutionPooling ( x , · · · , x M ; Θ ) , (4)where ConvolutionPooling () is a function composed ofconvolution layers and pooling layers. M ′ is the subsampledsequence length depending on the function. Text decoder:

The text decoder computes the generativeprobability of a token from preceding tokens and the hiddenrepresentations of the speech. The predicted probabilities ofthe n -th token w n are calculated as P ( w n | W n − , X ; Θ ) = Softmax ( u ( J ) n − ; Θ ) , (5)where Softmax () is a softmax layer with a linear transfor-mation. The input hidden vector u ( J ) n − is computed from J Transformer decoder blocks. The j -th Transformer decoder block composes j -th hidden representation u ( j ) n − from thelower layer inputs U ( j − n − = { u ( j − , · · · , u ( j − n − } as u ( j ) n − = TransformerDecoderBlock ( U ( j − n − , H ( I ) ; Θ ) , (6)where TransformerDecoderBlock () is a Transformer de-coder block that consists of a scaled dot product multi-head masked self-attention layer, a scaled dot product multi-head source-target attention layer, and a position-wise feed-forward network [10]. The hidden representations U (0)1: n − = { u (0)1 , · · · , u (0) n − } are produced by u (0) n − = AddPositionalEncoding ( w n − ) , (7) w n − = Embedding ( w n − ; Θ ) , (8)where Embedding () is a linear layer that embeds input tokenin a continuous vector. B. Typical objective function

In end-to-end ASR, a model parameter set can be optimizedfrom the utterance-level training data set U = { ( X , W ) , · · · , ( X T , W T ) } , where T is the number of utterances in thetraining data set. An objective function based on the maximumlikelihood estimation is deﬁned as L mle ( Θ ) = − T X t =1 N t X n =1 X w tn ∈V ˆ P ( w tn | W t n − , X t )log P ( w tn | W t n − , X t ; Θ ) , (9)where w tn is the n -th token for the t -th utterance and W t n − = { w t , · · · , w tn − } . V represents the vocabulary sets,and N t is the number of tokens in the t -th utterance. Theground-truth probability ˆ P ( w tn | W t n − , X t ) is deﬁned as ˆ P ( w tn | W t n − , X t ) = ( w tn = ˆ w tn )0 ( w tn = ˆ w tn ) , (10)where ˆ w tn is the n -th reference token in the t -th utterance. C. Training techniques

There are several training techniques in the end-to-endASR modeling. This paper introduces the following threetechniques.

Label smoothing:

Label smoothing is a regularization tech-nique that can prevent the model from making over-conﬁdentpredictions [16]. This encourages the model to have higherentropy at its prediction. This paper introduces a uniformdistribution to all tokens in vocabulary that smooths theground-truth probabilities. Thus, an objective function thatuses the label smoothing is deﬁned as L ls ( Θ ) = − T X t =1 N k X n =1 X w tn ∈V ˜ P ( w tn | W t n − , X t )log P ( w tn | W t n − , X t ; Θ ) , (11) (cid:2)(cid:3)(cid:3)(cid:4)(cid:5)(cid:6)(cid:3)(cid:7)(cid:4)(cid:8)(cid:9)(cid:3)(cid:10) (cid:11)(cid:3)(cid:12)(cid:13)(cid:6)(cid:9)(cid:3)(cid:4)(cid:8)(cid:9)(cid:3)(cid:10) (cid:1)(cid:2)(cid:3)(cid:0)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8) (cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21) (cid:22)(cid:23)(cid:24) (cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30)(cid:31) !" (cid:1)(cid:2)(cid:3)(cid:3)(cid:4)(cid:5)(cid:6)(cid:3)(cid:7)(cid:4)(cid:8)(cid:9)(cid:3)(cid:10)(cid:11)(cid:3)(cid:12)(cid:13)(cid:6)(cid:9)(cid:3)(cid:4)(cid:8)(cid:9)(cid:3)(cid:10) … …… (cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148)(cid:149)(cid:150)(cid:151)(cid:152)(cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160)¡¢£⁄¥ƒ§¤'“« Fig. 1. Deep mutual learning using two student models. ˜ P ( w tn | w t , · · · , w tn − , X t ) =(1 − α ) ˆ P ( w tn | W t n − , X t ) + α |V| , (12)where α is a smoothing weight to adjust the smoothing term. Scheduled sampling:

Scheduled sampling is a technique thatrandomly uses predicted tokens as conditioning tokens in thetext decoder [17]. This technique helps reduce the gap betweenteacher forcing in a training phase and free running in a testingphase. An objective function that uses scheduled sampling isdeﬁned as L ss ( Θ ) = − T X t =1 N k X n =1 X w tn ∈V ˆ P ( w tn | W t n − , X t )log P ( w tn |S ( W t n − ) , X t ; Θ ) , (13)where S () is a scheduled sampling function with randombehavior for the conditioning tokens. SpecAugment:

SpecAugment is a technique that augmentsinput acoustic feature representations [18], [19]. This tech-nique consists of three kinds of deformations: time warping,time masking, and frequency masking. Time warping is adeformation of the acoustic features in the time direction. Timemasking and frequency masking mask a block of consecutivetime steps or frequency channels. An objective function thatuses SpecAugment is deﬁned as L sa ( Θ ) = − T X t =1 N k X n =1 X w tn ∈V ˆ P ( w tn | W t n − , X t )log P ( w tn | W t n − , G ( X t ); Θ ) , (14)where G () is the SpecAugment deformation function withrandom behavior for the input acoustic features.III. P ROPOSED M ETHOD

This section details deep mutual learning (DML) for end-to-end ASR. In addition, we present objective functions whencombining DML with several training techniques.

A. Deep mutual learning for end-to-end ASR

In DML, K different model parameters { Θ , · · · , Θ K } are simultaneously trained to mimic each other, while theconventional training method learns the model parameters topredict ground-truth probabilities for the training instances.Figure 1 represents DML using two student model parameters.A DML-based objective function for training the k -th modelparameter Θ k is deﬁned as L dml ( Θ k ) = (1 − λ ) L mle ( Θ k ) + λ K − K X i =1 ,i = k D ( Θ i || Θ k ) , (15)where D ( Θ i || Θ k ) is a mimicry loss to mimic the i -th model,and λ is an interpolation weight to adjust the inﬂuence of themimicry loss. The mimicry loss is computed from D ( Θ i || Θ k ) = − T X t =1 N k X n =1 X w tn ∈V P ( w tn | W t n − , X t ; Θ i )log P ( w tn | W t n − , X t ; Θ k ) . (16)In a mini-batch training, K model parameters are optimizedjointly and collaboratively. Thus, K models are learned withthe same mini-batches. In each mini-batch step, we computepredicted probability distributions using the K models andupdate each parameter according to the predicted probabilitydistributions of the others. These optimizations are conductediteratively until convergence. We ﬁnally pick up the singlemodel with the smallest validation loss or a pre-deﬁnedcompact model. B. Deep mutual learning with training techniques

DML can be combined with existing training techniquesfor end-to-end ASR. This paper proposes new objective func-tions speciﬁc to using DML with label smoothing, scheduledsampling, and SpecAugment. Note that all techniques can besimultaneously combined with DML. eep mutual learning with label smoothing:

Both labelsmoothing and DML avoid peaky predictions with very lowentropy. When combining label smoothing with DML, wedeﬁne an objective function that trains k -th model parameter Θ k as L dml + ls ( Θ k ) = (1 − λ ) L ls ( Θ k )+ λ K − K X i =1 ,i = k D ( Θ i || Θ k ) , (17)where L ls is the same as Eq. (11). Deep mutual learning with scheduled sampling:

Whencombining scheduled sampling with DML, we aim to makemodel more robust to various conditioning tokens by main-taining consistency between different models with differentconditioning contexts. Thus, an objective function for the k -thmodel parameter is deﬁned as L dml + ss ( Θ k ) = (1 − λ ) L ss ( Θ k )+ λ K − K X i =1 ,i = k D ss ( Θ i || Θ k ) , (18) D ss ( Θ i || Θ k ) = − T X t =1 N k X n =1 X w tn ∈V P ( w tn |S i ( W t n − ) , X t ; Θ i )log P ( w tn |S k ( W t n − ) , X t ; Θ k ) , (19)where L ss is the same as Eq. (13). S i () and S k () are thefunctions for the scheduled sampling with different randomseeds. Deep mutual learning with SpecAugment:

When combiningSpecAugment with DML, we aim to make model more robustto various acoustic feature examples by maintaining consis-tency between different models with different deformation. Anobjective function for the k -th model parameter is deﬁned as L dml + sa ( Θ k ) = (1 − λ ) L sa ( Θ k )+ λ K − K X i =1 ,i = k D sa ( Θ i || Θ k ) , (20) D sa ( Θ i || Θ k ) = − T X t =1 N k X n =1 X w tn ∈V P ( w tn | W t n − , G i ( X t ); Θ i )log P ( w tn | W t n − , G k ( X t ); Θ k ) , (21)where L sa is the same as Eq. (14). G i () and G k () are thefunctions for the SpecAugment with different random seeds.IV. E XPERIMENTS

We experimented using CSJ [29]. We divided the CSJ intoa training set (512.6 hours), a validation set (4.8 hours), andthree test sets (1.8 hours, 1.9 hours, and 1.3 hours). We usedthe validation set to choose several hyper parameters andto conduct early stopping. Each discourse-level speech wassegmented into utterances in accordance with our previouswork [31]. We used characters as the tokens.

A. Setups

We examined two types of experimental setups: large-scalemodeling and compact modeling. • Large-scale modeling : We set I = 8 for the encoderblocks and J = 6 for the decoder blocks. When in-troducing DML, we prepared 4 large-scale models andevaluated the single model with the least validation loss. • Compact modeling : We set I = 2 for the encoderblocks and J = 1 for the decoder blocks where otherparameters were the same as the large-scale modeling.When introducing the knowledge distillation [21] or thedeep mutual learning, we prepared 1 compact model and3 large-scale models and evaluated the compact model.In both setups, Transformer blocks were composed using thefollowing conditions: the dimensions of the output continuousrepresentations were set to 256, the dimensions of the inneroutputs in the position-wise feed forward networks were setto 2,048, and the number of heads in the multi-head attentionswas set to 4. For the speech encoder, we used 40 log mel-scaleﬁlterbank coefﬁcients appended with delta and accelerationcoefﬁcients as acoustic features. The frame shift was 10 ms.The acoustic features passed two convolution and max poolinglayers with a stride of 2, so we downsampled them to / along with the time-axis. In the text decoder, we used 256-dimensional word embeddings. We set the vocabulary size to3,262.For the optimization, we used the Adam optimizer with β = 0 . , β = 0 . , ǫ = 10 − and varied the learningrate based on the update rule presented in previous studies[10]. The training steps were stopped based on early stoppingusing the validation set. We set the mini-batch size to 32utterances and the dropout rate in the Transformer blocksto 0.1. When we introduced label smoothing, we set α as0.1. Our scheduled sampling-based optimization process usedthe teacher forcing at the beginning of the training steps,and we linearly ramped up the probability of sampling tothe speciﬁed probability at the speciﬁed epoch (20 epoch).Our SpecAugment only applied frequency masking and timemasking where the number of frequency masks and timestep masks were set to 2, the frequency masking width wasrandomly chosen from 0 to 20 frequency bins, and the timemasking width was randomly chosen from 0 to 100 frames. λ was set to 0.4 in DML. We used a beam search algorithm inwhich the beam size was set to 20. B. Results

We evaluated various setups using DML and the trainingtechniques in large-scale modeling and compact modeling se-tups. Table 1 shows experimental results in terms of charactererror rate.First, in the large-scale modeling setup, the results showthat each training tip improves Transformer-based end-to-end ASR performance, and combining the techniques effec-tively improved ASR performance. SpecAugment signiﬁcantlyimproved performance in particular. These results indicate

ABLE IE

XPERIMENTAL RESULTS IN TERMS OF CHARACTER ERROR RATE (%).Label Scheduled SpecAugment Knowledge Deep mutual learning Test 1 Test 2 Test 3smoothing sampling distillation (DML)Large-scale - - - - - 8.83 6.49 7.19modeling √ - - - - 8.41 6.23 6.74- √ - - - 8.59 6.31 6.30- - √ - - 7.48 5.59 5.86 √ √ √ - - 7.24 5.13 5.40- - - - √ √ - - - √ √ - - √ √ - √ √ √ √ - √ Compact - - - - - 12.80 9.43 10.01modeling √ √ √ - - 11.37 7.88 8.44- - - √ - 11.67 8.28 9.08 √ √ √ √ - 11.15 7.58 8.31- - - - √ √ √ √ - √ that training techniques are important for the Transformer-based end-to-end ASR models. In addition, we improved ASRperformance by introducing DML into the Transformer-basedend-to-end ASR models, both with and without training tech-niques. It is thought that DML could help discover the globaloptimum and prevent models from making over-conﬁdent pre-dictions. The highest results were attained by combining DMLand all the training techniques. This suggests that combiningDML with existing training techniques effectively improvesASR performance.Next, in the compact modeling setups, the results showthat DML improved performance even more than knowledgedistillation. This indicates that DML in which student modelsinteract with each other during all the training steps effectivelytransfers knowledge in large-scale end-to-end ASR models tothe compact end-to-end ASR models. These results conﬁrmthat DML is a good solution to build Transformer-based end-to-end ASR models. V. C ONCLUSIONS

We have presented a method to incorporate deep mutuallearning (DML) in Transformer-based end-to-end automaticspeech recognition models. The key advance of our methodis to introduce combined training strategies of DML withrepresentative training techniques (label smoothing, scheduledsampling, and SpecAugment) for end-to-end ASR models. Ourexperiments demonstrated that the DML improves ASR per-formance of both large-scale modeling and compact modelingsetups compared with conventional learning methods, includ-ing knowledge distillation. We also showed that combiningDML with existing training techniques effectively improvesASR performance. R

EFERENCES[1] G. Zweig, C. Yu, J. Droppo, and A. Stolcke, “Advances in all-neuralspeech recognition,”

In Proc. International Conference on Acoustics,Speech, and Signal Processing (ICASSP) , pp. 4805–4809, 2017. [2] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Nahamoo,“Direct acoustics-to-word models for English conversational speechrecognition,”

In Proc. Annual Conference of the International SpeechCommunication Association (INTERSPEECH) , pp. 959–963, 2017.[3] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner:An encoder-decoder neural network model for sequence to sequencemapping,”

In Proc. Annual Conference of the International SpeechCommunication Association (INTERSPEECH) , pp. 1298–1302, 2017.[4] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, dataand units for streaming end-to-end speech recognition with RNN-transducer,”

In Proc. Automatic Speech Recognition and UnderstandingWorkshop (ASRU) , pp. 193–199, 2017.[5] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,”

In Proc.International Conference on Acoustics, Speech, and Signal Processing(ICASSP) , pp. 4945–4949, 2015.[6] L. Lu, X. Zhang, K. Cho, and S. Renals, “A study of the recurrent neuralnetwork encoder-decoder for large vocabulary speech recognition,”

InProc. Annual Conference of the International Speech CommunicationAssociation (INTERSPEECH) , pp. 3249–3253, 2015.[7] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition,”

In Proc. International Conference on Acoustics, Speech, and SignalProcessing (ICASSP) , pp. 4960–4964, 2016.[8] L. Lu, X. Zhang, and S. Renals, “On training the recurrent neuralnetwork encoder-decoder for large vocabulary end-to-end speech recog-nition,”

In Proc. International Conference on Acoustics, Speech, andSignal Processing (ICASSP) , pp. 5060–5064, 2016.[9] R. Masumura, T. Tanaka, T. Moriya, Y. Shinohara, T. Oba, and Y. Aono,“Large context end-to-end automatic speech recognition via extension ofhierarchical recurrent encoder-decoder models,”

In Proc. InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP) , pp.5661–5665, 2019.[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,”

In Proc.Advances in Neural Information Processing Systems (NIPS) , pp. 5998–6008, 2017.[11] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A no-recurrencesequence-to-sequence model for speech recognition,”

In Proc. Inter-national Conference on Acoustics, Speech, and Signal Processing(ICASSP) , pp. 5884–5888, 2018.[12] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech recognition with the Transformer in mandarin chinese,”

In Proc. Annual Conference of the International Speech CommunicationAssociation (INTERSPEECH) , pp. 791–795, 2018.[13] Y. Zhao, J. Li, X. Wang, and Y. Li, “The SpeechTransformer forlarge-scale mandarin chinese speech recognition,”

In Proc. InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP) , pp.7095–7099, 2019.[14] S. Li, D. Raj, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “Improvingransformer-based speech recognition systems with compressed struc-ture and speech attribute augmentation,”

In Proc. Annual Conference ofthe International Speech Communication Association (INTERSPEECH) ,pp. 4400–4404, 2019.[15] R. Masumura, N. Makishima, M. Ihori, A. Takashima, T. Tanaka, andS. Orihashi, “Phoneme-to-grapheme conversion based large-scale pre-training for end-to-end automatic speech recognition,”

In Proc. AnnualConference of the International Speech Communication Association(INTERSPEECH) , pp. 2822–2826, 2020.[16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,”

In Proc. IEEE confer-ence on Computer Vision and Pattern Recognition (CVPR) , pp. 2818–2826, 2016.[17] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled samplingfor sequence prediction with recurrent neural networks,”

In Proc. Ad-vances in Neural Information Processing Systems (NIPS) , pp. 1171–1179, 2015.[18] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “SpecAugment: A simple data augmentation methodfor automatic speech recognition,”

In Proc. Annual Conference of theInternational Speech Communication Association (INTERSPEECH) , pp.2613–2617, 2019.[19] D. S. Park, Y. Zhang, C.-C. Chiu, Y. Chen, B. Li, W. Chan, Q. V. Le,and Y. Wu, “Spacaugment on large scale datasets,”

In Proc. InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP) , pp.6874–6878, 2020.[20] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li,J. Chorowski, and M. Bacchiani, “State-of-the-art speech recognitionwith sequence-to-sequence models,”

In Proc. International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP) , pp. 4774–4778,2018.[21] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,”

In Proc. NIPS Deep Learning and Representation LearningWorkshop , 2015.[22] M. Huang, Y. You, Z. Chen, Y. Qian, and K. Yu, “Knowledge distillationfor sequence model,”

In Proc. Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH) , pp. 3703–3707,2018.[23] H.-G. Kim, H. Na, H. Lee, J. Lee, T. G. Kang, M.-J. Lee, and Y. S. Choi,“Knowledge distillation using output errors for self-attention end-to-endmodels,”

In Proc. International Conference on Acoustics, Speech, andSignal Processing (ICASSP) , pp. 6181–6185, 2019.[24] R. Masumura, M. Ihori, A. Takashima, T. Moriya, A. Ando, and Y. Shi-nohara, “Sequence-level consistency training for semi-supervised end-to-end automatic speech recognition,”

In Proc. International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP) , pp. 7049–7053,2020.[25] Y. Zhang, T. Xiang, T. M. Hospedales, , and H. Lu, “Deep mutuallearning,”

In Proc. IEEE conference on Computer Vision and PatternRecognition (CVPR) , pp. 4320–4328, 2018.[26] H. Zhao, G. Yang, D. Wang, and H. Lu, “Lightweight deep neuralnetwork for real-time visual tracking with mutual learning,”

In Proc.IEEE International Conference on Image Processing (ICIP) , pp. 3063–3067, 2019.[27] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding, “A mutuallearning method for salient object detection with intertwined multi-supervision,”

In Proc. IEEE conference on Computer Vision and PatternRecognition (CVPR) , pp. 8150–8159, 2019.[28] F. M. Thoker and J. Gall, “Cross-modal knowledge distillation foraction recognition,”

IEEE International Conference on Image Processing(ICIP) , pp. 6–10, 2019.[29] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, “Spontaneous speechcorpus of Japanese,”

In Proc. International Conference on LanguageResources and Evaluation (LREC) , pp. 947–952, 2000.[30] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, andT. Nakatani, “Improving Transformer-based end-to-end speech recog-nition with connectionist temporal classiﬁcation and language modelintegration,”

In Proc. Annual Conference of the International SpeechCommunication Association (INTERSPEECH) , pp. 1408–1412, 2019.[31] R. Masumura, H. Sato, T. Tanaka, T. Moriya, Y. Ijima, and T. Oba, “End-to-end automatic speech recognition with a reconstruction criterion usingspeech-to-text and text-to-speech encoder-decoders,”