[PDF] On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Abstract

Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence useful for the upper self-attention encoder layers in Transformers? To investigate this, we train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard. Compared to baseline Transformers, no performance drop but minor gains are observed. We further developed a novel metric of the diagonality of attention matrices and found the learned diagonality indeed increases from the lower to upper encoder self-attention layers. We conclude the global view is unnecessary in training upper encoder layers.

Full PDF

OON THE USEFULNESS OF SELF-ATTENTION FOR AUTOMATIC SPEECH RECOGNITIONWITH TRANSFORMERS

Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK

ABSTRACT

Self-attention models such as Transformers, which can cap-ture temporal relationships without being limited by the dis-tance between events, have given competitive speech recog-nition results. However, we note the range of the learnedcontext increases from the lower to upper self-attention lay-ers, whilst acoustic events often happen within short timespans in a left-to-right order. This leads to a question: forspeech recognition, is a global view of the entire sequenceuseful for the upper self-attention encoder layers in Trans-formers? To investigate this, we train models with lower self-attention/upper feed-forward layers encoders on Wall StreetJournal and Switchboard. Compared to baseline Transform-ers, no performance drop but minor gains are observed. Wefurther developed a novel metric of the diagonality of atten-tion matrices and found the learned diagonality indeed in-creases from the lower to upper encoder self-attention layers.We conclude the global view is unnecessary in training upperencoder layers.

Index Terms — speech recognition, transformer, self-attention, end-to-end

1. INTRODUCTION

Self-attention networks (SANs) have recently become a pop-ular research topic in the speech recognition community [1–9] and they can yield superior results compared to recurrentneural networks (RNNs), which are conventionally used tomodel sequential data. However, due to gradient vanishing,it is difﬁcult for RNNs to model long-range dependencies[10], even with gated structures such as Long Short-TermMemory (LSTM) [11] and Gated Recurrent Unit (GRU) [12].In SANs, self-attention layers encode contextual informationthrough attention mechanisms [13,14]. With this mechanism,when learning the hidden representation for each time stepof a sequence, a self-attention layer has a global view of theentire sequence and thus can capture temporal relationshipswithout the limitation of range. This is believed to be a keyfactor for the success of SANs [14].Previous works on attention-based RNN end-to-end mod-els has shown that for speech recognition, since acousticevents usually happen in a left-to-right order within small time spans, restricting the attention to be monotonic alongthe time axis improves the model’s performance [15–17].This appears to be in contrast to the reason for the success ofSANs: if the global view provided by the attention module ofself-attention layers is beneﬁcial, why then does forcing theattention mechanism to focus on local information result inperformance gains for RNN end-to-end models?To investigate this, we study Transformers [14], which areend-to-end SAN-based models. We explore training Trans-formers with upper (further from the input) feed-forwardlayers and lower self-attention layers encoders. The feed-forward layers can be viewed as “monotonic left-to-rightdiagonal attention”. We performed extensive experiments onthe Wall Street Journal (WSJ) read speech corpus [18] andthe Switchboard (SWBD) conversational telephone speechcorpus [19], ﬁnding that the upper feed-forward layers do notlead to higher error rates – they even give improved accuracy.To further analyse each self-attention layer, we have de-veloped a novel metric for the diagonality of attention ma-trices. Based on this metric we found that the overall trendof the average diagonality of each layer increases from thelower layers to the upper layers. Thus, even given a globalview of inputs, the upper layers learned to only attend localinformation during training. The lower layers, on the otherhand, have learned to captured long range context through theself-attention mechanism.These observations resolves the seemingly contradictionbetween the previous studies on RNN-based end-to-end mod-els which restrict the attention to be diagonal and the reasonfor the success of SAN-based models. For attention-basedRNN models, the attention mechanism interacts with both thedecoder and the encoder. Since an output unit (e.g. a charac-ter) is often related to a short time span of acoustic features,the attention layer should attend to a small window of the en-coded input sequence in a left-to-right order. In this work westudy a self-attention encoder which learns the hidden repre-sentation for each time step of the input sequence. The globalview of the input sequences enables the lower layers to en-code context information well. When the lower layers capturesufﬁcient contextual information, is the self-attention mecha-nism not useful for the upper layers. Thus, we conclude theupper self-attention layers are not useful and they can be re-placed by feed-forward layers. a r X i v : . [ c s . C L ] N ov . RELATED WORK Self-attention and its multi-head attention module [14] whichuses multiple attention mechanisms to encode context are keycomponents of Transformers. Michel et al [20] remove a pro-portion of the heads in the multi-head attention for each self-attention layer in trained Transformers, ﬁnding it leads to mi-nor performance drops. This implies that not all the attentionheads are equally useful. In our work, instead of removingsome attention heads in trained models, we replace entire self-attention layers with a feed-forward layers and train modelswith feed-forward layers as the upper layers in the encoder .For a self-attention layer, a single-layer feed-forwardmodule is stacked on the multi-head attention module. Irieet al [21] extend the single-layer feed-forward module to amulti-layer module, arguing it can bring more representa-tion power, and show that a SAN with fewer modiﬁed self-attention layers (as well as fewer parameters) can have minorperformance drops compared to a SAN with a larger numberof the original self-attention layers. In this work we study theeffect of the stacked context among the self-attention layersof the encoder. We do not change the architecture of the self-attention layers and we replace the upper self-attention layersin the encoder of Transformers with feed-forward layers.Previous works have investigated restricting each self-attention layer to attend a small window of context and ob-served a decrease in accuracy [6,9]. In this work we observedthat the lower self-attention layers tend to learn a larger win-dow of context compared to the upper layers. Thus assigninga uniform window length to each layer may not be optimal.The upper feed-forward lower self-attention layers encoderscan be viewed as imposing a window of length one for theupper layers, without restricting the window length for thelower layers.When the upper self-attention layers are replaced withfeed-forward layers, the architecture of the encoder is simi-lar to the CLDNN (Convolutional, Long Short-Term MemoryDeep Neural Network) [22]. The CLDNN uses an LSTM tomodel the sequential information and a deep neural network(DNN) to learn further abstract representation for each timestep. Stacking a DNN on an LSTM results in a notable er-ror rate reduction compared to pure LSTM models. While wefound the upper self-attention layers of the encoder of Trans-formers can be replaced with feed-forward layers, stackingmore feed-forward layers does not result in further perfor-mance gains. The main goal of this work is to understandthe self-attention encoder.

3. MODEL ARCHITECTURE3.1. Multi-head Attention

Multi-head attention uses attention mechanisms to encode se-quences [14]. We ﬁrstly consider a single attention head. The (a) SA (b) FF

Fig. 1 : Architectures of a self-attention (SA) encoder layerwith multi-head attention (MHA) and a feed-forward (FF) en-coder layer. LN is layer normalization [23]. We omit LN anddropout [24] in the equations of encoder layers but they areapplied in the experiments.input sequences to the attention mechanism are mapped to aquery sequence Q , a key sequence K , a value sequence V where K and V have the same length. For the i -th element Q [ i ] of Q , an attention vector is generated by computing thesimilarity between Q [ i ] and each element of K . Using theattention vector as weights, the output is a weighted sum overthe value sequence V . Thus, an attention head A of the multi-head attention can be described as: A( X Q , X K , X V ) = softmax( QK T √ d K ) V (1) ( Q , K , V ) = ( X Q W Q , X K W K , X V W V ) , (2)where X Q ∈ R n × d M , X K , X V ∈ R m × d M are inputs and m, n denote the lengths of the input sequences; W Q , W K ∈ R d M × d K and W V ∈ R d M × d V are trainable matrices. Thethree input sequences ( X Q , X K , X V ) can be the same se-quence, e.g., the speech signal to be recognised. The multi-head attention MHA uses h attention heads ( A , A , · · · , A h ) and a trainable matrix U H ∈ R d H × d M , d H = h × d V to com-bined the outputs of each attention head: MHA( X Q , X K , X V ) = ( A , A , · · · , A h ) U H (3) The self-attention encoder in a Transformer is a stack of self-attention layers. The j -th layer reads the output sequence X j − from its lower layer and uses multi-head attention toprocess the input sequence. That is, ( X Q , X K , X V ) = X j − , X j − , X j − ) . The multi-head attention only con-tains linear operations. Thus, in a self-attention layer, anon-linear feed-forward layer is stacked on the multi-headattention module. A self-attention layer in the encoder of aTransformer can be described as: X (cid:48) j − = X j − + MHA( X j − , X j − , X j − ) (4) X j = X (cid:48) j − + ReLU( X (cid:48) j − S + b ) Z + r (5)where S ∈ R d M × d FF , Z ∈ R d FF × d M , b ∈ R d FF and r ∈ R d M are trainable matrices and vectors . In the encoder, since each self-attention layer learns contex-tual information from its lower layer, the span of the learnedcontext increases from the lower layers to the upper lay-ers. Since acoustic events often happen within small timespans in a left-to-right order, if the inputs to the upper layerhave encoded a sufﬁcient large span of context, then it isunnecessary for the upper layers to learn further temporalrelationships. Thus, the multi-head attention module whichextracts the contextual information could be redundant, andthe self-attention layer will not be essential. However, if theupper layers of the encoder are self-attention layers and thelower layers have already seen a sufﬁciently wide context,then the attention mechanism will focus on a narrow rangeof inputs, since no further contextual information is required.Assuming that acoustic events often happen left-to-right,the attention matrix will tend to be diagonal. Then, since

MHA( X j − , X j − , X j − ) ≈ X j − and self-attention isnot helpful, replacing self-attention layers with feed-forwardlayers will not lead to a drop in accuracy.The architecture of the feed-forward layers is: X j = X j − + ReLU( X j − S + b ) Z + r (6)Figure 1 demonstrates the architecture of a self-attention layerand a feed-forward layer. Furthermore, a feed-forward layercan be viewed as a self-attention layer with an identity matrixas its attention matrix.

4. EXPERIMENTS AND DISCUSSION4.1. Experimental Setup

We experiment on two datasets, Wall Street Journal (WSJ)which contains 81 hours of read speech training data andSwitchboard (SWBD), which contains 260 hours of conversa-tional telephone speech training data. We use WSJ dev93 andeval92 test sets and SWBD eval2000 and SWBD/callhometest sets. We use Kaldi [25] for data preparation and featureextraction – 83-dim log-mel ﬁlterbank frames with pitch [26].The output units for the WSJ experiments are 26 characters,and the apostrophe, period, dash, space, noise and sos / eos tokens. The output tokens for SWBD experiments are tok-enized using Byte Pair Encoding (BPE) [27].We compare Transformers with different types of en-coders. The baseline Transformer encoders comprise self-attention layers and are compared with Transformers whoseencoders have feed-forward layers following the self-attentionlayers. Each self-attention/feed-forward layer is counted as asingle layer, and encoders with the same number of layers arecompared. All the components of each model have the samearchitecture, except for the number of self-attention/feed-forward layers in the encoder,We employ 12-layer encoders, since a 12-layer architec-ture is consistent with previous works and has been widelyused for Transformer models [1,7–9,20]. We also test 6-layerencoders for the WSJ dataset. Other settings of the modelsfollow [7].In each model, below the Transformer’s encoder there aretwo convolutional neural network layers with 256 channels,with a stride of 2 and a kernel size of 3, which map the di-mension of the input sequence to d M . The multi-head atten-tion components of the self-attention layers have 4 attentionheads and d V = d K = 64 , d M = 256 . For the feed-forwardmodule of the self-attention layers, as well as for the proposedfeed-forward encoder layers, d FF = 2048 . Dropout rate . isused when dropout is applied. The Transformer decoder has6 layers. Input sequences to the encoder and the decoder areconcatenated with sinusoidal positional encoding [14]. Mod-els are implemented using ESPnet [28] and PyTorch [29].The training schedule (warm up steps/learning rate de-cay) follows [1]. Adam [30] is used as the optimizer. Thebatch size is 32. Label smoothing with smoothing weight 0.1is used. We train the model for 100 epochs and the aver-aged parameters of the last 10 epochs are used as the param-eters of the ﬁnal model [1]. Besides the loss from the Trans-former’s decoder L D , a connectionist temporal classiﬁcation(CTC) [31] loss L CTC is also applied to the Transformer en-coder [16]. Following the previous work [7], the ﬁnal loss L for the model is: L = (1 − λ ) L D + λL CTC (7)where λ = 0 . for WSJ and λ = 0 . for SWBD. For the experiments on WSJ, we ﬁrst train a baseline modelwith a 12-layer self-attention encoder. Then, we use thismodel to decode WSJ eval92 and compute the attention ma-trices of a randomly sampled utterance from eval92. Figure 2shows the plots of the attention matrices for each attentionhead of the lowest layer, a middle layer and the highest layer.The lowest layer attends to a wider range of context. Themiddle layers put more attention weight on the diagonaland the middle two heads of the topmost layer have closeto pure diagonal attention matrices which can be described a) Attention vectors of each attention head of encoder self-attention layer 12(b) Attention vectors of each attention head of encoder self-attention layer 5(c) Attention vectors of each attention head of encoder self-attention layer 1

Fig. 2 : A sample of attention vectors of encoder self-attentionlayers generated by the baseline Transformer with a 12-layerencoder. The sampled utterance is form WSJ eval92. Whilethe lowest layer (layer1, near input) attends a wide range ofcontext, the middle layer focus more on the local informationand the topmost layer assigns nearly all the attention weightto the diagonal.as

MHA( X j − , X j − , X j − ) ≈ X j − . This implies evengiven a global view of inputs during training, the topmostlayer learned to only focus on local information. Section 4.5discusses the statistics of the “diagonality” of attention matri-ces for each head of every layer.After training the baseline, we train models whose en-coders are built by different numbers of self-attention lay-ers and feed-forward layers. For the encoder of these mod-els, there are 12 layers in total and the lower layers are self-attention layers while the upper layers are feed-forward lay-ers. We start from an encoder with 6 self-attention layers and6 feed-forward layers. Then, we increase the number of self-attention layers and decrease the number of feed-forward lay-ers. Table 1 shows that as the number of self-attention layersincreases, the character error rate (CER) decreases, which im-plies learning further contextual information is beneﬁcial.However, when the number of self-attention layers in-creases to 10, with 2 upper feed forward layers, the en-coder gives almost identical results compared to the 12-layerself-attention baseline, although the 10-layer self-attentionencoder has notably higher CERs. Furthermore, althoughthe 11-layer self-attention encoder gives worse results com-pared to the 12-layer baseline, the encoder which has 11self-attention layers and one upper feed forward layers yields Table 1 : Character error rate (CER) on WSJ for the Trans-former models with different encoders. The evaluation setsare WSJ eval92 and dev93. SA denotes self-attention layerand FF denotes feed-forward layer.Number of Layers CER/%Total SA FF eval92 dev9312 12 0 3.5 4.612 11 1

12 10 2 3.6 4.612 9 3 3.8 4.812 8 4 3.9 4.912 7 5 4.0 5.112 6 6 4.2 5.311 11 0 3.6 4.710 10 0 4.0 5.213 12 1 3.6 4.713 11 2 3.6 4.614 11 3 3.7 4.66 6 0 4.2 5.46 5 1 4.2 th layer to encode temporal relationships.Upon the th layer the global view of the sequence is notuseful, indicating the contextual information is well capturedby the layers beneath.We further tested if stacking more feed-forward layers tomake deeper encoders is beneﬁcial. As shown in Table 1, thisdoes not give performance gains. We also investigated modiﬁ-cations to the architecture of the stacked feed-forward layers,such as removing residual connections or using an identitymapping [32]. These modiﬁcations did not result in a CER re-duction compared to the 11-layer self-attention 1-layer feed-forward encoder.We also tested the 6-layer encoder architecture and the re-sults are also shown in Table 1. The baseline model has 6self-attention layers as its encoder. Then we replace the topone, two and three layers with feed-forward layers respec-tively. We observe that replacing the topmost layer of the6-layer self-attention encoder does not lead to reductions inaccuracy but to minor improvements, which is consistent withthe experimental results for the 12-layer encoder. We further test replacing upper self-attention layers on thelarger and more challenging SWBD corpus. The results areshown in Table 2. The encoder with 10 self-attention layers is able 2 : Word error rate (WER) of the experiments on SWBDfor the Transformer models with different encoders. The eval-uation sets are eval 2000 SWBD/callhome. SA denotes self-attention layer and FF denotes feed forward layer.Number of Layers WER/%Total SA FF SWBD Callhome12 12 0 9.0 18.112 11 1 9.0 17.812 10 2

12 9 3 9.5 18.511 11 0 9.0 17.710 10 0 9.2 18.4Transformer [7] 9.0 18.1Transformer [3] 10.4 18.6Transformer [5] 10.6 22.3less accurate than the encoders with 11 and 12 self-attentionlayers. Also, the 12-layer self-attention encoder has higherword error rates (WERs) than the 11-layer encoder. However,the encoder with 10 self-attention layers and 2 feed-forwardlayers, which has 12 layers in total, gives the lowest WERs.The 9 self-attention layers + 3 feed-forward layers encoderyields higher WERs. Thus, the layers below the 10 th layeris crucial in learning contextual information. Upon the 10 th self-attention layer feed forward layers are sufﬁcient in learn-ing further abstract representations. To further analyse each attention layer and each attentionhead, we propose a novel metric for the diagonality of atten-tion matrices. The j th element in the i th row of the attentionmatrix is the attention weight between the i th element and the j th element of the input sequence of the self-attention layer.The attention weights sum to 1 in each row and the attentionvector can be viewed as a probability distribution over eachrow. In the i th row, if all the probability mass is allocatedto the i th element then it indicates, that for this row, all theattention weight is on the diagonal of the attention matrix.When the probability mass is assigned to be as far as possiblefrom the i th element in the i th row for all rows, then theattention matrix has the lowest diagonality. Based on this, weﬁrst deﬁne the centrality C i of row i : C i = 1 − (cid:80) nj =1 a ij | i − j | Max( | i − | , | i − | , · · · , | i − n | ) (8)where j denotes the index of each column, n denotes thelength of the input sequence, a ij denotes the attentionweight between the i th element and the j th element ofthe input sequence, and | i − j | is the distance between the i th element and the j th element of the input sequence. (a) 12-layer Encoder (b) 6-layer Encoder Fig. 3 : The heat map of the averaged diagonality of each at-tention head in each layer. The th column shows the averagediagonality over all heads of each layer. The red color denoteshigh diagonality and the blue color indicates low diagonality. Fig. 4 : The averaged diagonality with ± standard deviationof each self-attention layer of the 12-layer encoder baseline.Layer 12 is the topmost layer. The dash line is the trend line.Based on this deﬁnition, consider the ﬁrst row of a × attention matrix. For such a matrix, (1 , , , , willhave centrality 1, (0 , , , , will have centrality 0, and (0 . , . , . , . , . will have centrality 0.5. We deﬁne the diagonality D of an attention matrix as the average over thecentrality of all its rows: D = (cid:80) ni =1 C i n . (9) To further evaluate the usefulness of self-attention for eachlayer, we compute the average diagonality of each attentionhead for every layer of the baseline 12-layer encoder modelon WSJ eval92, and the average diagonality over all attentionheads of each layer. As shown in Figure 4, the overall trendof the average diagonality indeed increases from the lowerlayers to the upper layers. In the experiments on replacingself-attention layers, models with more than 2 feed-forwardlayers and fewer than 10 self-attention layers yield higher er-ror rates (Table 1). Figure 4 shows average diagonality fromthe th layer to the th layer is relatively low, compared to ig. 5 : The averaged diagonality of each layer of the en-coders. Layer 12 is the topmost layer. Feed-forward layershave diagonality 1.the topmost two layers. These consistent observations indi-cate contextual information is necessary for the th layer andthe th layer and thus the self-attention mechanism is essen-tial for these two layers. For the topmost two layers, evenwith the self-attention mechanism, the diagonality is close to1, which shows they focus on local information. This is alsoconsistent with the ﬁnding in Table 1 that replacing these self-attention layers with feed-forward layers leads to no increasein error rate.Another interesting observation is the average diagonalityof the the th and th layers is also high. Thus, it is possiblethat self-attention is also not useful for these two layers. Thereason for the high CERs of replacing the th to the th self-attention layers with feed-forward layers (Table 1) could bethe global view of the th and th layers. We propose thatlayers could be replaced not only based on their position butalso based on their diagonality, such as replacing the th , th , th and th layer with feed-forward layers and leaving the th and th layer with self-attention.The average diagonality of each attention head of everylayer is shown in Figure 3. From the th layer to the th layerthe diagonality of each head varies signiﬁcantly – two headshave diagonality close to 1 and two heads have relatively lowdiagonality. These heads with high diagonality are candi-dates for replacement with diagonal attention (feed-forwardnetworks). To investigate how the diagonality changes afterthe upper layers are replaced by feed forward layers, we com-pute the average diagonality of each layer of the models withone and two feed-forward layers in the encoder, where theperformance does not drop. Figure 5 shows the overall trendof the average diagonality still increases from the lower layersto the upper layers.We also computed the average diagonality of each layerof the baseline 6-layer Transformer encoder. Figure 6 showsthe trend of the average diagonality increases from the lowerlayers to the upper layers. After replacing the top one or twolayers of the self-attention layers with feed-forward layers,the trend of the average diagonality remains increasing (Fig-ure 7). For the baseline 6-layer encoder, the th layer achieves Fig. 6 : The averaged diagonality with ± standard deviationof each self-attention layer of the 6-layer encoder baseline.Layer 6 is the topmost layer. The dash line is the trend line. Fig. 7 : The averaged diagonality of each layer of the en-coders. Layer 6 is the topmost layer. Feed-forward layershave diagonality 1.the highest diagonality. Thus it is potential that only replac-ing the th layer will not harm the performance of the model.Also, as Figure 3 shows, the ﬁrst layer of the 6-layer encoderhas a head with a diagonality of . , which is clearly anoutlier among the heads in the ﬁrst layer, and a candidate forreplacement with a feed-forward network.

5. CONCLUSION

In this paper, based on the argument that acoustic events oftenhappen in short time spans with a left-to-right ordering, andthat the encoded context increases through the lowest self-attention layer to the highest self-attention layer through theTransformer encoder, we investigate the usefulness of self-attention for the upper layers in the encoder. Our experi-ments on WSJ and SWBD show that replacing the upper self-attention layers with feed-forward layers does not increase themodel’s error rate. We developed a novel metric for the diag-onality of the attention matrix, ﬁnding the overall diagonal-ity indeed increases from the lower layers to the upper lay-ers. These observations imply the self-attention is not usefulfor the upper layers of the encoder. Further work includesreplacing self-attention heads and self-attention layers basedon their diagonality and designing novel network architectureased on our ﬁndings.

6. REFERENCES [1] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequencemodel for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[2] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu,“Syllable-based sequence-to-sequence speech recogni-tion with the transformer in mandarin chinese,”

Proc.INTERSPEECH 2018 , pp. 791–795, 2018.[3] Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues,Markus M¨uller, and Alex Waibel, “Very deep self-attention networks for end-to-end speech recognition,”

Proc. INTERSPEECH 2019 , pp. 66–70, 2019.[4] Daniel Povey, Hossein Hadian, Pegah Ghahremani,Ke Li, and Sanjeev Khudanpur, “A time-restricted self-attention layer for asr,” in . IEEE, 2018, pp. 5874–5878.[5] Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schl¨uter,and Hermann Ney, “A comparison of Transformer andLSTM encoder decoder models for ASR,” in

IEEEASRU , 2019.[6] Yongqiang Wang, Abdelrahman Mohamed, Duc Le,Chunxi Liu, Alex Xiao, Jay Mahadeokar, HongzhaoHuang, Andros Tjandra, Xiaohui Zhang, Frank Zhang,et al., “Transformer-based acoustic modeling for hybridspeech recognition,” arXiv preprint arXiv:1910.09799 ,2019.[7] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, TakaakiHori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki,Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xi-aofei Wang, et al., “A comparative study on trans-former vs rnn in speech applications,” arXiv preprintarXiv:1909.06317 , 2019.[8] Tomohiro Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporalclassiﬁcation and language model integration,”

Proc.INTERSPEECH 2019 , 2019.[9] Liang Lu, Changliang Liu, Jinyu Li, and Yifan Gong,“Exploring transformers for large-scale speech recogni-tion,” arXiv preprint arXiv:2005.09684 , 2020.[10] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al.,“Learning long-term dependencies with gradient de-scent is difﬁcult,”

IEEE transactions on neural net-works , vol. 5, no. 2, pp. 157–166, 1994. [11] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp.1735–1780, 1997.[12] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio, “Empirical evaluation of gated re-current neural networks on sequence modeling,” arXivpreprint arXiv:1412.3555 , 2014.[13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio, “Neural machine translation by jointly learning toalign and translate,” in

ICLR 2015 : International Con-ference on Learning Representations 2015 , 2015.[14] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in

Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[15] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura,“Local monotonic attention mechanism for end-to-endspeech and language processing,” in

Proceedings of theEighth International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) , 2017, pp.431–440.[16] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Jointctc-attention based end-to-end speech recognition us-ing multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[17] Shucong Zhang, Erfan Loweimi, Peter Bell, and SteveRenals, “Windowed attention mechanisms for speechrecognition,” in .IEEE, 2019, pp. 7100—-7104.[18] Douglas B Paul and Janet M Baker, “The design for thewall street journal-based csr corpus,” in

Proceedings ofthe workshop on Speech and Natural Language . Associ-ation for Computational Linguistics, 1992, pp. 357–362.[19] J.J. Godfrey, E.C. Holliman, and J. McDaniel, “Switch-board: telephone speech corpus for research and de-velopment,” in ,1992, pp. 517–520.[20] Paul Michel, Omer Levy, and Graham Neubig, “Aresixteen heads really better than one,” in

NeurIPS 2019 :Thirty-third Conference on Neural Information Process-ing Systems , 2019, pp. 14014–14024.[21] Kazuki Irie, Alexander Gerstenberger, Ralf Schluter,and Hermann Ney, “How much self-attention dowe need? trading attention for feed-forward layers,”n , 2020.[22] Tara N. Sainath, Oriol Vinyals, Andrew Senior, andHasim Sak, “Convolutional, long short-term memory,fully connected deep neural networks,” in , 2015, pp. 4580–4584.[23] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey EHinton, “Layer normalization,” arXiv preprintarXiv:1607.06450 , 2016.[24] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout:a simple way to prevent neural networks from overﬁt-ting,”

Journal of Machine Learning Research , vol. 15,no. 1, pp. 1929–1958, 2014.[25] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,et al., “The Kaldi speech recognition toolkit,” in

IEEEASRU , 2011.[26] Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Ko-rbinian Riedhammer, Jan Trmal, and Sanjeev Khudan-pur, “A pitch extraction algorithm tuned for automaticspeech recognition,” in , 2014, pp. 2494–2498.[27] Rico Sennrich, Barry Haddow, and Alexandra Birch,“Neural machine translation of rare words with subwordunits,” in

ACL , 2016, pp. 1715–1725.[28] Shinji Watanabe, Takaaki Hori, Shigeki Karita, TomokiHayashi, Jiro Nishitoba, Yuya Unno, Nelson En-rique Yalta Soplin, Jahn Heymann, Matthew Wies-ner, Nanxin Chen, Adithya Renduchintala, and Tsub-asa Ochiai, “ESPnet: End-to-end speech processingtoolkit,”

Proc. INTERSPEECH 2018 , p. 2207–2211,2018.[29] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer, “Au-tomatic differentiation in pytorch,” in

NIPS 2017 Work-shop Autodiff , 2017.[30] Diederik P. Kingma and Jimmy Lei Ba, “Adam: Amethod for stochastic optimization,” in

ICLR 2015 :International Conference on Learning Representations2015 , 2015.[31] Alex Graves, Santiago Fern´andez, Faustino Gomez, andJ¨urgen Schmidhuber, “Connectionist temporal classi-ﬁcation: labelling unsegmented sequence data with re- current neural networks,” in

Proceedings of the 23rd in-ternational conference on Machine learning , 2006, pp.369–376.[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Identity mappings in deep residual networks,”in