[PDF] Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

Abstract

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.

Full PDF

GGAUSSIAN KERNELIZED SELF-ATTENTION FOR LONG SEQUENCE DATAAND ITS APPLICATION TO CTC-BASED SPEECH RECOGNITION

Yosuke Kashiwagi , Emiru Tsunoo , Shinji Watanabe Sony Corporation, Japan, Johns Hopkins University, USA

ABSTRACT

Self-attention (SA) based models have recently achieved signiﬁ-cant performance improvements in hybrid and end-to-end automaticspeech recognition (ASR) systems owing to their ﬂexible contextmodeling capability. However, it is also known that the accuracydegrades when applying SA to long sequence data. This is mainlydue to the length mismatch between the inference and training databecause the training data are usually divided into short segmentsfor efﬁcient training. To mitigate this mismatch, we propose a newarchitecture, which is a variant of the Gaussian kernel, which itselfis a shift-invariant kernel. First, we mathematically demonstratethat self-attention with shared weight parameters for queries andkeys is equivalent to a normalized kernel function. By replacingthis kernel function with the proposed Gaussian kernel, the archi-tecture becomes completely shift-invariant with the relative positioninformation embedded using a frame indexing technique. Theproposed Gaussian kernelized SA was applied to connectionist tem-poral classiﬁcation (CTC) based ASR. An experimental evaluationwith the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM3 benchmarks shows that the proposed SA achieves a signiﬁcantimprovement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ)in long sequence data without any windowing techniques.

Index Terms : speech recognition, end-to-end, self-attention, longsequence data

1. INTRODUCTION

In recent years, automatic speech recognition (ASR) using self-attention (SA) [1] has attracted considerable attention. Bothtransformer-based speech recognition [2–6] and hybrid [7, 8] andconnectionist temporal classiﬁcation (CTC) [9, 10] models haveshown a high recognition performance with SA. The SA networkhas a mathematically simple structure by fully using matrix-vectorbased operations designed for an efﬁcient parallel computation.Thus, the recurrent neural network (RNN) based architecture hasbeen replaced with the SA network because of the efﬁcient compu-tation property and its high performance.However, SA is unsuitable for decoding long sequence data be-cause it has a high computational complexity on the order of thesquare of the sequence length. In addition, the recognition accu-racy degrades in long utterances owing to its excessive ﬂexibility incontext modeling. In this paper, we focus on the problems of the ac-curacy degradation in long sequence data. In general, self-attentionrequires dividing original long recordings into short segments dur-ing training for efﬁcient GPU computing. This leads to a mismatchbetween the sequence lengths of the training and test data, resultingin a performance degradation.To solve this problem, several studies have been proposed.Masking [11] limits the range of self-attention by using a Gaussianwindow, whereas relative positional encoding [12, 13] uses relativeembedding in a self-attention architecture to eliminate the effect of the length mismatch. However, masking does not take into accountthe correlation between input features and relative distance. In addi-tion, the relative positional encoding does not limit the attention tothe neighborhood in a mathematical.Inspired by the mathematical expression based on the shared-QK attention used in Reformer [14], in this paper, yet anotherself-attention reformulation based on a Gaussian kernel is proposed.First, we mathematically demonstrate that the linear layers andsoftmax functions in the shared-QK attention can be represented asnormalized kernel functions, similarly to [15] and [16], interpretingbilinear pooling as a kernel function. These kernel functions are re-placed with a Gaussian kernel and thus we call our model

Gaussiankernelized self-attention . Gaussian kernel, known as a radial basisfunction kernel, has several useful features and has been widelyused with a support vector machine (SVM) [17–21]. The Gaussiankernelization applied in our new formulation also provides a shift-invariance into the self-attention architecture. This shift-invariantproperty is a highly desirable property for controlling the relativeposition. To take advantage of this property, we propose concatenat-ing the bare frame index to an input feature, which is called a frameindexing technique .To compare the differences in SA structure, this paper appliesthe proposed Gaussian kernelized SA to CTC-based ASR becausethe decoder network of a CTC is rather simple compared with otherend-to-end architectures, and we can purely evaluate the effective-ness between the proposed and conventional SA methods. An exper-imental evaluation shows that our proposed SA with frame indexingachieved a signiﬁcant improvement in the long sequence data.

2. SELF-ATTENTION FOR LONG SEQUENCE DATA2.1. Self-attention

Let X i and X j be D -dimensional input features of the self-attentionnetwork with time indexes i and j in a sequence, respectively. Thescaled dot product attention [1] calculates the attention weight asfollows: Attn ( i, j ) = softmax (cid:18) ( W ( Q ) X i ) (cid:62) ( W ( K ) X j ) √ d k (cid:19) (1)where W ( Q ) and W ( K ) represent d k × D trainable matrices in thelinear operation for X i and X j , respectively. Note that the bias termis included in each matrix. Multi head attention, which individuallycalculates the above attention in multiple heads, is effectively usedin every layer. For simplicity, we omit the head and layer indexes inour formulation. Self-attention itself attends to the target frames without any posi-tional limitations. This ﬂexibility is an advantage over conventionalneural networks. However, in typical speech recognition encoders, a r X i v : . [ ee ss . A S ] F e b ocal information is more important than global characteristicsfor representing phonetic features, particularly in long sequences.Therefore, several masking approaches are studied to control theattention and allow it to be more local.Sperber et al. used a self-attention architecture with a weightingtechnique applying a hard or soft mask in acoustic modeling [11].They limited the target frames to be calculated by adding a maskthat has values within a range of − inf to zero to the attention beforethe softmax function. In [11], the authors reported that the soft maskis more effective than the hard mask with proper initialization. Thesoft mask M soft is equal to the Gaussian window, which is deﬁnedas M soft i,j = − ( i − j ) σ , (2)where σ is a trainable parameter and a standard deviation of theGaussian, which controls the window size. Relative positional encoding [12, 13] is an extension of an absolutepositional encoding technique that allows self-attention to handlerelative positional information. The absolute positional encoding isdeﬁned as follows: U i,d =  sin (cid:18) i dD (cid:19) if d = 2 k cos (cid:18) i ( d − D (cid:19) if d = 2 k + 1 , (3)where i is the frame index, d is the index of the feature dimension,and k is an arbitrary integer. Typically, the positional encoding isadded to the input speech feature, i.e., X i → X i + U i . However,this coding depends on absolute positional information. When thetesting data are longer than training data, the indexes near the endof the utterance become unseen. This mismatch degrades the speechrecognition performance in long sequence data. Relative positionalencoding can remove the effects of this mismatch.Relative positional encoding modiﬁes the attention value beforethe softmax function as follows: A rel ij = X (cid:62) i W ( Q ) (cid:62) W ( K , X ) X j + X (cid:62) i W ( Q ) (cid:62) W ( K , R ) R i − j + u (cid:62) W ( K , X ) X j + v (cid:62) W ( K , R ) R i − j . (4)Here, R i − j is a sinusoid encoding matrix based on Eq. (3), and u and v are trainable parameters.

3. GAUSSIAN KERNELIZED SELF-ATTENTION3.1. Motivation

Although relative position encoding can reduce the mismatch, whichis a problem with absolute position encoding, the relative encodingitself does not limit the attention to the neighborhood of the frame.This still allows the self-attention to attend to the distant frames thatare not important in the encoder of the speech recognition model.By contrast, masking is a reasonable approach for the encoder of thespeech recognition models because it structurally limits the range ofattention. However, the effective window length σ in Eq. (2) is ﬁxedduring the testing time as a trained parameter. Thus, the windowlength is constant for any input features. To eliminate mismatchesand improve the recognition accuracy in long sequence data, our aimis to extend the masking technique to be more adaptive, allowing therange of attention to be trained depending on both the input featuresand their positions. Before describing our proposed Gaussian kernelized self-attention,we describe an important related approach, i.e., shared-QK atten-tion [14]. Shared-QK attention is one variant of self-attention andcomputes Q i and K j in Eq. (1) using the shared linear transforma-tion parameters, W ( Q ) = W ( K ) (cid:44) W ( S ) . The shared-QK attentionis then calculated from Eq. (1) as follows: Attn ( S ) ( i, j ) = softmax (cid:18) ( W ( S ) X i ) (cid:62) ( W ( S ) X j ) √ d k (cid:19) . (5)It has been reported to achieve a performance comparable to that ofnon-shared self-attention with a smaller parameter size [14]. The set of shared-QK attentions is a non-negative symmetric ma-trix, because Q and K are identical and exponential calculations existin the softmax function. We interpret the attention as a normalizedGram matrix. By introducing the normalization term Z ( S ) , we canrewrite the shared-QK attention in Eq. (5) as follows: Attn ( S ) ( i, j ) = 1 Z ( S ) exp (cid:16) X (cid:62) i Σ − X j (cid:17) , (6)where Σ and Z ( S ) are deﬁned in the following manner. Σ − (cid:44) ˆ W ( S ) (cid:62) ˆ W ( S ) , ˆ W ( S ) (cid:44) W ( S ) ( d k ) (7) Z ( S ) (cid:44) (cid:88) j exp (cid:16) X (cid:62) i ˆ W ( S ) (cid:62) ˆ W ( S ) X j (cid:17) (8)Here, Σ − in Eq. (7) is a positive semideﬁnite matrix, which can beregarded as the inverse of the full-covariance matrix. Attn ( S ) ( i, j ) is further rewritten by completing the square ofEq. (6) as follows. Attn ( S ) ( i, j ) = 1 Z ( S ) exp (cid:18) −

12 ( X i − X j ) (cid:62) Σ − ( X i − X j ) (cid:19) × exp (cid:18) X (cid:62) i Σ − X i (cid:19) exp (cid:18) X (cid:62) j Σ − X j (cid:19) , (9)where we have three matrix square forms based on X i − X j , X i , and X j . We call the matrix square forms of X i and X j energy terms. We propose replacing the kernel function of the self-attention inEq. (9) with a

Gaussian kernel having a full-covariance trainablematrix. Gaussian kernelized attention

Attn ( G ) is deﬁned as Attn ( G ) ( i, j ) = 1 Z ( G ) exp (cid:18) −

12 ( X i − X j ) (cid:62) Σ − ( X i − X j ) (cid:19) . (10)The normalization term Z ( G ) is deﬁned as follows. Z ( G ) (cid:44) (cid:88) j (cid:18) −

12 ( X i − X j ) (cid:62) Σ − ( X i − X j ) (cid:19) (11)Eq. (10) can be interpreted as the removal of the energy terms fromthe conventional shared-QK attention in Eq. (9). The Gaussian ker-nel depends only on the difference between the input features ( X i − X j ) , which is shift-invariant. In addition, because it is an exponen-tial function, the attention value approaches zero as the differenceincreases.The proposed self-attention architecture requires the query andkey to be of the same matrix. Therefore, this approach cannot beused in source-target attention and can only be used in self-attention. (cid:2)(cid:1)(cid:3)(cid:2)(cid:1)(cid:4)(cid:2)(cid:1)(cid:5)(cid:2)(cid:1)(cid:6)(cid:7)(cid:2)(cid:1)(cid:6)(cid:8)(cid:2)(cid:1) (cid:6)(cid:1) (cid:3)(cid:1) (cid:8)(cid:1) (cid:9)(cid:1) (cid:5)(cid:1) (cid:6)(cid:6)(cid:1) (cid:6)(cid:3)(cid:1) (cid:6)(cid:8)(cid:1) (cid:6)(cid:9)(cid:1) (cid:1) (cid:2) (cid:3) (cid:4) (cid:5) (cid:6) (cid:7) (cid:10)(cid:11)(cid:12)(cid:13)(cid:10)(cid:14)(cid:12)(cid:15)(cid:16)(cid:12)(cid:17)(cid:18)(cid:12)(cid:19)(cid:20)(cid:12)(cid:15)(cid:21)(cid:12)(cid:19)(cid:14)(cid:22)(cid:23)(cid:15)(cid:24)(cid:16)(cid:12)(cid:20)(cid:2)(cid:25)(cid:16)(cid:12)(cid:21)(cid:26)(cid:27)(cid:10)(cid:22)(cid:22)(cid:12)(cid:19)(cid:22)(cid:28)(cid:29)(cid:19)(cid:13)(cid:12)(cid:21)(cid:10)(cid:22)(cid:28)(cid:11)(cid:12)(cid:15)(cid:30)(cid:29)(cid:16)(cid:28)(cid:22)(cid:29)(cid:19)(cid:10)(cid:21)(cid:15)(cid:12)(cid:19)(cid:20)(cid:29)(cid:31)(cid:28)(cid:19)(cid:14) (cid:10)(cid:18)(cid:16)(cid:16)(cid:28)(cid:10)(cid:19)(cid:15)!(cid:12)(cid:13)(cid:19)(cid:12)(cid:21)(cid:15)"(cid:15)(cid:26)(cid:13)(cid:10) Fig. 1 : Comparison of self-attention, relative positional encoding,and Gaussian kernelized self-attention with frame indexing for eachsequence length (character error rate.)

The Gaussian kernel is a function that depends only on the X i − X j term as in Eq. (10). However, the Gaussian kernel itself does nothave the ability to obtain relative positional information. Therefore,we include the absolute positional information by simply appendingthe frame index i to X i . Owing to the shift-invariant nature of theGaussian kernelized self-attention, the X i − X j term is rewritten asfollows: ˆ X i − ˆ X j = [( X i − X j ) (cid:62) , ( i − j ) /α ] (cid:62) , (12)where α is a scaling factor used to control the scales of the relativeposition and the input features, and is normalized through a layernormalization function. We set α to in this paper.By assigning Eq. (12) to Eq. (10), the frame indexing elementbecomes similar to Eq. (2). However, because Σ − in Eq. (7) istrained by considering both X i and frame indexing, the standard de-viation of the Gaussian window, which is proportional to Σ , is sta-tistically adaptive to the input features. Thus, the proposed methodproperly embeds relative positional information to the model by con-catenating the frame indexing to the input features. Note that theoriginal self-attention has energy terms as in Eq. (9). When theframe indexes are concatenated to the input feature in the same wayas the Gaussian kernel, these energy terms become dependent on theabsolute indexes. Therefore, this indexing is ineffective unless theattention architecture is shift-invariant, as shown in the experimentsbelow.

4. EXPERIMENTAL EVALUATION4.1. Experimental setup

The Gaussian kernelized self-attention was evaluated using the CSJdataset [22] and TED-LIUM 3 dataset [23]. We compared the Gaus-sian kernelized self-attention with an RNN, self-attention (Sec. 2.1),masking (Sec. 2.2), relative encoding (Sec. 2.3) and shared-QK at-tention (Sec. 3.2). The CTC [24] model based on self-attention [9]was used as our baseline architecture to purely compare the differ-ence between the proposed and other self attention methods becauseCTC has a simple decoder architecture compared with other end-to-end models. The methods other than an RNN were implementedunder the same conditions except for the structure corresponding tothe self-attention. The baseline model consisted of convolutionallayers and a subsequent 12-layer self-attention blocks. In each self-attention block, the number of dimensions d k in Eq. (1) was 256 and (a) (b) (c)(d) (e) (f) Fig. 2 : Examples of the attention heat map of (a) self-attention inshort data, (b) Gaussian kernelized self-attention w/o frame indexingin short data, and (c) Gaussian kernelized self-attention w/ frameindexing in short data. (d), (e), and (f) describe the attentions in longdata corresponding to (a), (b), and (c), respectively. The vertical axisrepresents the source frame index and the horizontal axis representsthe target frame index.the number of heads was 4. A middle linear layer followed eachself-attention network and a position-wise feedforward network ex-panded the dimension of the middle layer to 2,048. The input fea-tures were 80-dimensional Mel ﬁlter banks and pitch features. TheSpecAugment [25] technique was applied to the data. In addition,the features were subsampled to reduce their number by a factorof 4 in the convolutional layers. A positional encoding was addedto the input feature just before the ﬁrst self-attention block. TheRNN based encoder consisted of 4 RNN layers with 1,204 units. Allmethods were evaluated using greedy decoding without any externallanguage model to purely evaluate the performance of the proposedself-attention network.For CSJ data, the training and development set consisted of413,408 and 4,000 utterances, respectively. The tokens consisted of3,262 Japanese characters, including a blank label. For the evalu-ation data, we prepared standard evaluation sets, eval1, eval2, andeval3, which were split into short segment units (short eval (1, 2, 3)).To investigate the recognition performance in long sequence data,we also used the original long data without splitting into segmentsas an additional evaluation set (long eval (1, 2, 3)). The averagesequence length was approximately 4.7 sec. for a short evaluation,and 772.6 sec. for long evaluation.For TED-LIUM 3 data, the training and development sets con-sisted of 268,262 and 507 utterances respectively. The tokens con-sisted of 654 English tokens which were encoded using the unigramlanguage model [26], including a blank label. For the evaluationdata, a longer dataset was prepared in addition to the standard setas in CSJ. The average sequence length was 8.2 sec. and 1,004.1sec., respectively. Note that the longest single talk (1,772 sec.) wasextremely long, and was evaluated by splitting it in half to avoid amemory shortage.

Table 1 shows the performance of different model architectures inthe CSJ data. In our experiments, the self-attention and shared-QK able 1 : Comparison of the recognition performance (character error rate) for short and long data in CSJ data. The relative encoding in longsequence data was skipped due to the huge memory requirements over 700GB (-). short data long dataeval1 eval2 eval3 avg. eval1 eval2 eval3 avg.average length (sec.) 5.2 5.4 3.4 4.7 829.5 871.7 616.6 772.6RNN 7.9 5.8 6.3 6.7 9.7 6.5 7.4 7.9self-attention 6.5 4.7

Table 2 : Comparison of the recognition performance (token errorrate) for short and long data in TED-LIUM 3 data.short data long datadev test dev testaverage length (sec.) 11.35 8.16 771.50 1004.14RNN 21.7 25.6 22.4 30.8self-attention 15.2 17.3 82.8 84.6+ soft mask attention achieved similarly lower error rates in the short dataset.However, in the long dataset, the accuracy of these methods de-creased. As reason for this, the structure of the self-attention it-self could not limit the attention to its neighborhood. By contrast,masking was effective and the performance difference between shortand long sequence data was small. However, the recognition per-formance for short utterances was worse than that of a simple self-attention because the ﬂexibility of the attention was suppressed bythe ﬁxed-length window. As with self-attention, Gaussian kernel-ization achieved low error rates in short sequence data, but its per-formance degraded signiﬁcantly in long sequence data. By usingthe frame indexing to take into account the relative positional in-formation along with the input features, the Gaussian kernelized at-tention signiﬁcantly improved the recognition performance in longsequence data. However, when the frame indexing was used for self-attention, the recognition accuracy signiﬁcantly degraded in bothshort and long utterances. This was because the energy term of theself-attention became dependent on the absolute positional informa-tion, which greatly reduced its generalization ability.Unfortunately, we could not decode the long data using the rela-tive positional encoding. Because the second term of Eq. (4) requiredthe same amount of memory as the self-attention, relative positionalencoding required more than twice as much memory, at over 700GB.Therefore, we further investigated the speech recognition per-formance per utterance length including the relative positional en-coding within a decodable range. We sampled 100 segments eachfrom the CSJ eval1 set to create subsets such that the average ut-terance length of each subset became 10 seconds, 20 seconds, andso on. Figure 1 shows the speech recognition performances of theself-attention, the relative positional encoding, and Gaussian kernel-ized attention with frame indexing for each sequence length of theevaluated data. The relative positional encoding was found to bemore robust to the length mismatches than absolute positional en-coding. Although self-attention achieved a better performance thanGaussian kernelized attention in short data, the performance of self-attention degraded as the sequence length increased. By contrast, the Gaussian kernel with frame indexing did not degrade much as the se-quence length increased. Therefore, we can conﬁrm that Gaussianwas more robust to a length mismatch than self-attention or that withrelative positional encoding.Figure 2 visualizes the attention weights obtained by a stan-dard self-attention network based on

Attn( i, j ) in Eq. (1) (Figure 2(a) and (d)), the Gaussian kernelized self-attention Attn ( G ) ( i, j ) inEq. (10) (Figure 2 (b) and (e)), and the Gaussian kernelized self-attention with the frame indexing in Eq. (12) (Figure 2 (c) and (f)).Self-attention was ﬂexible in short utterances as indicated in Fig-ure 2 (a). However, when there was a length mismatch between thetraining and testing data, attention was dispersed and the attentionweights became smaller, as shown in Figure 2 (d). In the case of theGaussian kernel, the diagonal components mathematically becamepeaky as in Figure 2 (b). However, the attention was dispersed inlong sequence data as in the case of self-attention shown in Figure 2(e). By contrast, using frame indexing, the components around thediagonal location were correctly attended even in the long speech, asindicated in Figure 2 (f).Table 2 shows the performance for the TED-LIUM 3 data. Inthis case, masking maintained the recognition performance even forshort utterances. The performance of the self-attention with frameindexing was signiﬁcantly worse than that of CSJ data. This maybe because the average length of the evaluation data was longer thanthat of CSJ. By contrast, the Gaussian kernelized self-attention withframe indexing can achieve a low token error rate similar to maskingfor both short and long data.

5. CONCLUSION

In this paper, we proposed a new SA architecture called Gaussiankernelized SA. This structure was a natural combination of conven-tional masking with the kernel structure of SA. With frame indexing,the attention can statistically adapt depending on both the input fea-tures and their relative positions. We applied this novel structure tothe encoder of the CTC-based ASR model to improve the recogni-tion performance in long sequence data, which showed length mis-matches between the training and testing data. In the experimentsusing CSJ and TED-LIUM 3 data, the Gaussian kernelized SA withframe indexing achieved a performance close to that of conventionalSA in short sequence data. In addition, our model achieved a sig-niﬁcant accuracy improvement (e.g., from 24.0% WER to 6.0% inthe Corpus of Spontaneous Japanese (CSJ) benchmark) in long se-quence data. In the future, we will attempt to apply the Gaussiankernelized self-attention to RNN-T. In addition, we will expand theGaussian kernel to include asymmetric attention and source-targetattention and use our architecture in a transformer-based ASR. . REFERENCES [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is allyou need,” in

Advances in neural information processing sys-tems , 2017, pp. 5998–6008.[2] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recogni-tion,” in . IEEE, 2018, pp.5884–5888.[3] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al. , “AComparative Study on Transformer vs RNN in Speech Appli-cations,” arXiv preprint arXiv:1909.06317 , 2019.[4] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Trans-formers with convolutional context for ASR,” arXiv preprintarXiv:1904.11660 , 2019.[5] A. Zeyer, P. Bahar, K. Irie, R. Schl¨uter, and H. Ney, “A com-parison of transformer and LSTM encoder decoder models forASR,” in , 2019, pp. 8–15.[6] X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watan-abe, “End-to-end multi-speaker speech recognition with trans-former,” in . IEEE, 2020, pp.6134–6138.[7] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur,“A time-restricted self-attention layer for asr,” in . IEEE, 2018, pp. 5874–5878.[8] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,H. Huang, A. Tjandra, X. Zhang, F. Zhang et al. , “Transformer-based acoustic modeling for hybrid speech recognition,” in

ICASSP 2020 - 2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp.6874–6878.[9] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. M¨uller, andA. Waibel, “Very deep self-attention networks for end-to-endspeech recognition,”

Proc. Interspeech 2019 , pp. 66–70, 2019.[10] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention net-works for connectionist temporal classiﬁcation in speechrecognition,” in

ICASSP 2019-2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 7115–7119.[11] M. Sperber, J. Niehues, G. Neubig, S. St¨uker, andA. Waibel, “Self-attentional acoustic models,” in

Proc.Interspeech 2018 , 2018, pp. 3723–3727. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2018-1910[12] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attentionwith relative position representations,” in

Proceedings of the2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, Volume 2 (Short Papers) arXivpreprint arXiv:2005.09940 , 2020. [14] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The Efﬁ-cient Transformer,” arXiv preprint arXiv:2001.04451 , 2020.[15] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilin-ear pooling,” in

Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2016, pp. 317–326.[16] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codesfrom shift-invariant kernels,” in

Advances in neural informa-tion processing systems , 2009, pp. 1509–1517.[17] B.-C. Kuo, H.-H. Ho, C.-H. Li, C.-C. Hung, and J.-S. Taur,“A kernel-based feature selection method for SVM with RBFkernel for hyperspectral image classiﬁcation,”

IEEE Journalof Selected Topics in Applied Earth Observations and RemoteSensing , vol. 7, no. 1, pp. 317–326, 2013.[18] P. P. Dahake, K. Shaw, and P. Malathi, “Speaker dependentspeech emotion recognition using MFCC and support vec-tor machine,” in .IEEE, 2016, pp. 1080–1084.[19] Y. Shao and C.-H. Chang, “Wavelet transform to hybrid sup-port vector machine and hidden Markov model for speechrecognition,” in . IEEE, 2005, pp. 3833–3836.[20] J. Stadermann and G. Rigoll, “A hybrid SVM/HMM acousticmodeling approach to automatic speech recognition,” in

Proc.Int. Conf. on Spoken Language Processing ICSLP , 2004.[21] A. Ganapathiraju, J. E. Hamaker, and J. Picone, “Applicationsof support vector machines to speech recognition,”

IEEE trans-actions on signal processing , vol. 52, no. 8, pp. 2348–2355,2004.[22] K. Maekawa, “Corpus of spontaneous japanese: Its design andevaluation,”

Proceedings of SSPR , 01 2003.[23] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, andY. Est`eve, “Ted-lium 3: twice as much data and corpus repar-tition for experiments on speaker adaptation,” in

InternationalConference on Speech and Computer . Springer, 2018, pp.198–208.[24] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: labelling unsegmented se-quence data with recurrent neural networks,” in

Proceedings ofthe 23rd international conference on Machine learning , 2006,pp. 369–376.[25] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple augmentationmethod for automatic speech recognition,” in

INTERSPEECH ,2019.[26] T. Kudo, “Subword regularization: Improving neural net-work translation models with multiple subword candidates,”in