[PDF] Thank you for Attention: A survey on Attention-based Artificial Neural Networks for Automatic Speech Recognition

Abstract

Attention is a very popular and effective mechanism in artificial neural network-based sequence-to-sequence models. In this survey paper, a comprehensive review of the different attention models used in developing automatic speech recognition systems is provided. The paper focuses on the development and evolution of attention models for offline and streaming speech recognition within recurrent neural network- and Transformer- based architectures.

Full PDF

11 Thank you for Attention: A survey onAttention-based Artiﬁcial Neural Networks forAutomatic Speech Recognition

Priyabrata Karmakar, Shyh Wei Teng, Guojun Lu

Abstract —Attention is a very popular and effective mechanismin artiﬁcial neural network-based sequence-to-sequence models.In this survey paper, a comprehensive review of the differentattention models used in developing automatic speech recognitionsystems is provided. The paper focuses on the development andevolution of attention models for ofﬂine and streaming speechrecognition within recurrent neural network- and Transformer-based architectures.

Index Terms —Automatic speech recognition (ASR), attentionmechanism, recurrent neural network (RNN), Transformer, of-ﬂine ASR, streaming ASR.

I. I

NTRODUCTION

Automatic speech recognition (ASR) is a type of sequence-to-sequence (seq2seq) task. The input speech sequence istranscribed into a sequence of symbols. The majority of the ex-isting state-of-the art ASR systems consisted of three modules:acoustic, pronunciation and language [1]. These three modulesare separately trained. The acoustic module predicts phonemesbased on the input speech feature like Mel Frequency CepstralCoefﬁcient (MFCC) [2]. The pronunciation module is a hiddenMarkov model [3] which maps the phonemes predicted atthe earlier module to word sequences. Finally, the languagemodule which is pre-trained on a large corpus, scores theword sequences. In other words, language model estimates theprobabilities of next word based on previously predicted wordsto establish a meaningful sentence. This traditional approachhas some limitations. First, the modules are trained separatelyfor different objective functions. Therefore, it may result in-compatibility between modules. Also separate training is timeexpensive. Second, the pronunciation model requires a dictio-nary for mapping between phonemes and word sequences. Thepronunciation dictionary is developed by linguistic experts andis prone to human errors [4], [5].From the last decade, deep learning has been appliedsigniﬁcantly in various domains, such as image and videoprocessing, machine translation and text processing. Speechrecognition is not an exception as well. Early deep learning-based ASR systems mostly consider a hybrid approach wherethe acoustic model is replaced by a deep neural network andthe rest of modules use the traditional approach [6], [7], [8].The recent trend of building ASR systems is to developan end-to-end deep neural network. The network can there-fore map the input speech sequence to a sequence of either

Priyabrata Karmakar, Shyh Wei Teng and Guojun Lu are with the Schoolof Engineering, IT and Physical Sciences, Federation University Australia.e-mail: { p.karmakar, shyh.wei.teng, guojun.lu } @federation.edu.au graphemes, characters or words. In end-to-end ASR systems,the acoustic, pronunciation and language modules are trainedjointly to optimize a common objective function and thenetwork overcomes the limitations of traditional ASR systems.In the literature, there are generally two major end-to-endASR architectures can be found. They are (a) Connectionisttemporal classiﬁcation (CTC)-based, and (b) Attention-based.CTC uses Markov assumptions to solve sequence-to-sequenceproblem with a forward-backward algorithm [9]. Attentionmechanism aligns the relevant speech frames for predictingsymbols at each output time step [10], [11].The end-to-end ASR models are mainly based on anencoder-decoder architecture. The encoder part converts thespeech frames and their temporal dependencies into a highlevel representation which will be used by the decoder foroutput predictions. The initial versions of the encoder-decoderarchitecture for ASR modelled with recurrent neural network(RNN) as the main component for sequence processing [12],[13]. RNN is a type of artiﬁcial neural network which istypically used for modelling sequential data. Apart from thevanilla RNN, some other variations like long short-term mem-ory (LSTM) [14], gated recurrent unit (GRU) [15] are alsopopular in modelling sequential data. RNNs can be used inunidirectional as well as bi-directional fashion [16], [17]. Con-volutional neural networks (CNN) coupled with RNNs [18] orstand-alone [19] have also been used to make effective ASRmodels. Processing data sequentially is an inefﬁcient processand may not capture temporal dependencies effectively. Toaddress the limitations of RNN, Transformer network [20] hasbeen recently proposed for sequence-to-sequence transduction.Transformer is a recurrence-free encoder-decoder architecturewhere sequence tokens are processed parallelly using self-attention mechanism.Automatic speech recognition operates in two differentmodes: ofﬂine (when recorded speech is available before tran-scription starts), and online or streaming (when transcriptionstarts simultaneously as the speaker(s) starts speaking). In thispaper, we have reviewed attention-based ASR literature forboth ofﬂine and streaming speech recognition. While review-ing, we have only considered the models built with eitherrecurrent neural network (RNN) or Transformer. Nowadays,ASR models are widely embedded in systems like smartdevices and chatbots. In addition, application of attentionmechanism is showing great potential in achieving highereffectiveness and efﬁciency for ASR. From the middle of lastdecade, a lot of progress has been made on attention-based a r X i v : . [ c s . S D ] F e b TABLE ID

IFFERENT TYPES OF ATTENTION MECHANISM FOR

ASRName Short descriptionGlobal/Soft [10] At each decoder time step, all encoderhidden states are attended.Local/Hard [23] At each decoder time step, a set of en-coder hidden states (within a window)are attended.Content-based [24] Attention calculated only using thecontent information of the encoderhidden states.Location-based [25] Attention calculation depends only onthe decoder states and not on theencoder hidden states.Hybrid [11] Attention calculated using both con-tent and location information.Self [20] Attention calculated over different po-sitions(or tokens) of a sequence itself.2D [26] Attention calculated over both time-and frequency-domains.Hard monotonic [27] At each decoder time step, only oneencoder hidden state is attended.Monotonic chunkwise[28] At each decoder time step, a chunk ofencoder states (prior to and includingthe hidden state identiﬁed by the hardmonotonic attention) are attended.Adaptive monotonicchunkwise [29] At each decoder time step, the chunkof encoder hidden states to be at-tended is computed adaptively. models. Recently, some survey papers [21], [22] have pre-sented the development of attention-based models on naturallanguage processing (NLP). These survey papers have docu-mented the advancement of a wide range of NLP applicationslike machine translation, text and document classiﬁcation, textsummarisation, question answering, sentiment analysis, andspeech processing. However, the existing literature still lacks asurvey speciﬁcally targeted on the evolution of attention-basedmodels for ASR. Therefore, we have been motivated to writethis paper.The rest of paper is organised as follows. Section II pro-vides a simple explanation of Attention mechanism. A briefintroduction to attention-based encoder-decoder architecture isdiscussed in Section III. Section IV discusses the evolutionof ofﬂine speech recognition followed by the evolution ofstreaming speech recognition in Section V. Finally SectionVI concludes the paper.II. A

TTENTION

Attention mechanism can be deﬁned as the method foraligning relevant frames of input sequence for predicting theoutput at a particular time step. In other words, attentionmechanism helps deciding which input frame(s) to be focusedat and how much for the output prediction at the correspondingtime step. With the help of a toy example, the attentionmechanism for sequence-to-sequence model is explained inthis section. Consider the input source sequence is X and theoutput target sequence is Y . For simplicity, we have consideredthe number of frames (or tokens) in both input and outputsequence is same. X = [ x , x , · · · , x n ]; Y = [ y , y , · · · , y n ] . TABLE IIL

IST OF LITERATURE

Attention Ofﬂine ASR Streaming ASRRNN-based [10], [11], [24],[30], [25], [23],[31], [32], [33],[34], [35], [36],[37] [38], [27], [39],[28], [40], [29],[41], [42], [43],[44], [45]Transformer-based [26], [46], [47],[48], [49], [50],[51], [52], [53],[53], [54], [55],[56], [57], [58],[59] [60], [61], [62],[63], [57], [64],[52], [65], [66],[67], [68], [69],[70]

An encoder processes X to a high level representation(hidden states) and passes it to the decoder where prediction ofY happens. In most cases, the information required to predicta particular frame y t is conﬁned within a small number ofinput frames. Therefore, for decoding y t , it is not requiredto look at each input frames. The Attention model aligns theinput frames with y t by assigning match scores to each pairof input frame and y t . The match scores convey how mucha particular input frame is relevant to y t and accordingly, thedecoder decides the degree of focus on each input frame forpredicting y t .Depending on how the alignments between output and inputframes are designed, different types of attention mechanismare presented in the literature. A list of existing attentionmodels along with short descriptions is provided in TableI. The detailed explanation of different attention models isdiscussed throughout the paper. In this survey, we have con-sidered the models which are built within RNN or Transformerarchitecture. Table II provides the list of literature which wehave reviewed in the later sections of this paper.III. A TTENTION - BASED E NCODER -D ECODER

For ASR, attention-based encoder-decoder architecture isbroadly classiﬁed into two categories: (a) RNN-based, and(b) Transformer-based. In this section, we have provided anoverview of both categories. In the following sections, adetailed survey has been provided.

A. RNN-based encoder-decoder architecture

Sequence-to-sequence RNN-based ASR models are basedon an encoder-decoder architecture. The encoder is an RNNwhich takes input sequence and converts it into hidden states.The decoder is also an RNN which takes the last encoderhidden state as input and process it to decoder hidden stateswhich in turn used for output predictions. This traditionalencoder-decoder structure has some limitations: • The encoder hidden state, h T (last one) which is fed tothe decoder has the entire input sequence informationcompressed into it. For longer input sequences, it maycause information loss as h T may not capture long-rangedependencies effectively. • There is no alignment between the input sequence framesand the output. For predicting each output symbol, instead

Fig. 1. RNN-based encoder-decoder architecture with attention of focusing on the relevant ones, the decoder considersall input frames with same importance.The above issues can be overcome by letting the decoder toaccess all the encoder hidden states (instead of the last one)and at each decoder time step, relevant input frames are givenhigher priorities than others. It is achieved by incorporatingattention mechanism to the encoder-decoder model. As a partof sequence-to-sequence modelling, attention mechanism wasintroduced in [71] for machine translation. Inspired by theeffectiveness in [71], the attention mechanism was introducedto ASR in [11]. An earlier version of this work has beenpresented in [10].The model in [11] is named as attention-based recurrentsequence generator (ASRG). The graphical representation ofthis model is shown in Figure 1. The encoder of ASRGprocesses the input audio frames to encoder hidden stateswhich are then used to predict output phonemes. By focusingon the relevant encoder hidden states, at i th decoder time step,prediction of phoneme y i is given by (1) y i = Spell ( s i − , c i ) , (1)where c i is the context given by (2) generated by attentionmechanism at the i th decoder time step. s i given by (3) isthe decoder hidden state at i th time step. It is the output ofa recurrent function like LSTM or GRU. Spell ( ., . ) is a feed-forward neural network with softmax output activation. c i = L (cid:88) j =1 α i,j h j , (2)where h j is the encoder hidden state at the j th encoder timestep. α i,j given by (4) is the attention probability belongingto the j th encoder hidden state for the output prediction at i th decoder time step. In other words, α i,j captures the importanceof the j th input speech frame (or encoder hidden state) fordecoding the i th output word (or phoneme or character). α i values are also considered as the alignment of encoder hiddenstates ( h j ∈ [1 , ··· ,L ] ) to predict an output at i th decoder time step.Therefore, c i is the sum of the products (SOP) of attentionprobabilities and the hidden states belonging to all encoder time steps at the i th decoder time step and it provides a contextto the decoder to decode (or predict) the corresponding output. s i = Recurrent ( s i − , c i , y i − ) . (3) α i,j = exp ( e i,j ) (cid:80) Lj =1 exp ( e i,j ) , (4)where e i,j is the matching score between the i th decoderhidden state and the j th encoder hidden state. It is computedusing a hybrid attention mechanism given by (5) in a generalform and by (6) in a parametric form. e ij = Attend ( s i − , α i − , h j ) . (5) e i,j = w T tanh ( W s i − + V h j + U f i,j + b ) , (6)where w and b are vectors and W , V and U are matrices.These are all trainable parameters. f i = F ∗ α i − is a set ofvectors which are extracted for every encoder state h j of theprevious alignment α i − which is convolved with a trainablematrix F . The tanh function produces a vector. However, e i,j is a single score. Therefore, a dot product of tanh outcomeand w is performed. The mechanism in (5) is referred to ashybrid attention as it considers both location ( α ) and content( h ) information. By dropping either α i − or h j , the Attend mechanism is called content-based or location-based attention.

B. Transformer-based encoder-decoder architeture

RNN-based encoder-decoder architecture is sequential innature. To capture the dependencies, hidden states are gener-ated sequentially and at each time step, the generated hiddenstate is the output of a function of previous hidden state.This sequential process is time consuming. Also, during thetraining, error back propagates through time and this processis again time consuming.To overcome the limitations of RNN, Transformer networkis proposed completely based on attention mechanism. InTransformer network, no recurrent connection is used. Instead,the input farmes are processed parallelly at the same time,and during training, no back propagation through time isapplicable.Transormer network was introduced in [20] for machinetranslation and later it is successfully applied to ASR tasks.In this section, the idea of Transformer is given as describedin [20]. The graphical representation of Transformer is shownin Figure 2.The Transformer network is composed of an encoder-decoder architecture but there is no recurrent or convolutionalneural network involved here. Instead, the authors have usedself-attention to incorporate the dependencies in the seq2seqframework. The encoder is composed of six identical layerswhere each layer is divided into two sub-layers. The ﬁrst sub-layer is a multi-head self-attention module and the second oneis a position-wise feed-forward neural network. The decoderis also composed of six identical layers but has an additionalsub-layer to perform multi-head self-attention over the encoder

Fig. 2. Transformer-based encoder-decoder architecture [20] output. Around each sub-layer, a residual connection [72] isemployed followed by a layer-normalisation [73]. In the de-coder section, out of two multi-head attention blocks, the ﬁrstone is masked to prevent positions from attending subsequentpositions.The attention function is considered here as to obtain anoutput which is the weighted sum of values based on matchinga query with keys from the corresponding key-value pairsusing scaled dot-product. The dimensionalities of query, keyand value vectors are d k , d k and d v , respectively. In practice,attention is computed on a set of query, key and value togetherby stacking these vectors in a matrix form. Mathematically, itis given by (7). Attention ( Q, K, V ) =

Sof tmax ( QK T (cid:112) ( d k ) ) V, (7)where Q , K , V are matrices which represent Query, Key andValue, respectively.Positional information is added to the input sequence togenerate the input embedding upon which the attention willbe performed. Instead of directly applying attention on inputembeddings, they are linearly projected to d k and d v dimen-sional vectors using learned projections given by (8) q = XW q ,k = XW k ,v = XW v , (8)where W q ∈ R d model × d k , W k ∈ R d model × d k and W v ∈R d model × d v are trainable parameters. d model is the dimensionof input embeddings. X is the input embedding for the encodersection and the output embedding for the masked multi-headblock for the decoder section. For the second multi-head blockof the decoder section, X is the encoder output for k and v projection. However, for q projection, X is the output fromthe masked multi-head section.In Transformer network [20], the attention mechanism havebeen used in three different ways. They are as follows.1) Encoder self-attention: In the encoder section, attentionmechanism is applied over the input sequences to ﬁndthe similarity of each token of a sequence with rest ofthe tokens.2) Decoder masked self-attention: Similar to the encoderself-attention, output (target) sequence tokens attendeach other in this stage. However, instead of accessingthe entire output sequence at a time, the decoder can onlyaccess the tokens preceding the token which decoderattempts to predict. This is done by masking current andall the future tokens of a particular decoder time step.This approach prevents the training phase to be biased.3) Encoder-decoder attention: This occurs at the decodersection after decoder masked self-attention stage. Withreference to (7), at this stage, Q is the linear projection ofthe vector coming from decoder’s masked self-attentionblock. Whereas, K and V are obtained by linearly pro-jecting the vector resulting from encoder self-attentionblock. This is the stage where the mapping betweeninput and output (target) sequences happens. The outputof this block is the attention vectors containing the rela-tionship between tokens of input and output sequences.At each sub-layer, the attention is performed h -times inparallel. Hence, the name “multi-head attention” is given. In[20], the value of h is 8. According to the authors, multi-head attention allows the model to jointly attend to informationfrom different representation subspaces at different positions.The outputs from each attention head are then concatenatedand projected using (9) to obtain the ﬁnal output of thecorresponding sub-layer. M ultiHead ( Q, K, V ) =

Concat ( head i , · · · , head h ) W o , (9)where head i ∈ [1 ,h ] is computed using (8) and W o ∈R hd v × d model is a trainable parameter.IV. O FFLINE S PEECH R ECOGNITION

In this section, the evolution of attention-based models willbe discussed for ofﬂine speech recognition. This section isdivided into four sub-sections to explore global and localattention with RNN-based models, joint attention-CTC withRNN-based models and RNN-free Transformer-based models.

A. Global Attention with RNN

Global attention is computed over the entire encoder hiddenstates at every decoder time step. The mechanism illustratedin Section III-A as per [11] is an example of global atten-tion. Since [11], a lot of progress has been made by manyresearchers.The authors of [24] presented a global attention mechanismin their

Listen, Attend and Spell (LAS) model. Here,

Spell function takes inputs as current decoder state s i and thecontext c i . y i = Spell ( s i , c i ) . s i is computed using a recurrentfunction which takes inputs as previous decoder state ( s i − ),previous output prediction ( y i − ) and previous context ( c i − ). s i = Recurrent ( s i − , y i − , c i − ) . The authors have usedthe content information only to calculate the matching scoresgiven by (10). Attention probabilities are then calculated by(4) using the matching scores. e i,j = w T tanh ( W s i − + V h j + b ) . (10)A similar content-based global attention have been proposedin [30] where a feedback factor is incorporated in addition tothe content information in calculating the matching scores forbetter numerical stability. In generalised form, it is given by(11) e i,j = w T tanh ( W [ s i , h j , β i,j ]) , (11)where β i,j is the attention weight feedback computed using thepreviously aligned attention vectors and it is given by (12). β i,j = σ ( w Tb h j ) · i − (cid:88) k =1 α k,j , (12)where w b is a trainable weight vector. Here, Spell function iscomputed over s i , y i − and c i , i.e. y i = Spell ( s i , y i − , c i ) A character-aware (CA) attention is proposed in [25] toincorporate morphological relations for predicting words andsub-word units (WSU). A separate RNN (named as CA-RNNby the author) which dynamically generates WSU represen-tations connected to the decoder in parallel with the encodernetwork. The decoder hidden state s t − is required to obtainthe attention weights at t time step. s t is computed usingthe recurrent function over s t − , w t − (WSU represenation)and c t − . The matching scores required to compute attentionvectors at decoder t time step is calculated using (6). Incontrast to [11], the authors have used RELU instead of tanh function and claimed it provides better ASR performance.

B. Local attention with RNN

In global attention model, each encoder hidden states areattended at each decoder time step. This results in a quadraticcomputation complexity. In addition, the prediction of a par-ticular decoder output mostly depends on a small number ofencoder hidden states. Therefore, it is not necessary to attendthe entire set of encoder hidden states at each decoder timestep. The application of local attention fulﬁls the requirementof reducing the computation complexity by focusing on rel-evant encoder hidden states. Local attention mechanism is mostly popular in streaming speech recognition but, it hasbeen applied to ofﬂine speech recognition as well. The coreidea of local attention is to attend a set of encoder hiddenstates within a window or range at each decoder time stepinstead of attending the entire set of encoder hidden states.Local attention was introduced in [74] for machine translationand thereafter, it has been applied to ASR as well.In [23], the window upon which the attention probabilitiesare computed is considered as [ m t − − w l , m t − + w r ] ,where m t − is the median of previous alignment α t − (i.e.the attention probabilities computed at the last decoder timestep). w l and w r are the user-deﬁned ﬁxed parameters whichdetermine the span of the window in left and right directions,respectively. A similar local attention was proposed in [31].To obtain the attention window, position difference (cid:52) p t is calculated for the prediction at the t decoder time stepin [32]. (cid:52) p t is the position difference between the centreof attention windows of previous and current decoder timesteps. Therefore, given p t − (the centre of previous attentionwindow) and (cid:52) p t , the centre of current attention windowcan be calculated. After that, the attention window at the t th decoder time step is set as [ p t − (cid:52) p t , p t + (cid:52) p t ] . Two methodswere proposed to estimate (cid:52) p t as given by (13) and (14). (cid:52) p t = C max ∗ sigmoid ( V TP tanh ( W p h dt )) , (13)where V p and W p are a trainable vector and matrix respec-tively. C max is a hyper parameter to maintain the condition: < (cid:52) p t < C max . (cid:52) p t = exp ( V TP tanh ( W p h dt )) , (14)Equations (13) and (14) are named as Constrained andUnconstrained position predictions respectively. C. Joint attention-CTC with RNN

Two main approaches for end-to-end encoder-decoder ASRare attention-based and CTC [75]-based. In attention-based ap-proach, the decoder network ﬁnds an alignment of the encoderhidden states during the prediction of each element of outputsequence. The task of speech recognition is mostly mono-tonic. Therefore, the possibility of right to left dependencyis signiﬁcantly lesser compared to left to right dependency inASR tasks. However, due to the ﬂexible nature of attentionmechanism, non-sequential alignments are also considered.Therefore, noise and irrelevant frames (encoder hidden states)may result in misalignment. This issue becomes worse forlonger sequences as the length of input and output sequencesvary due to factors, e.g. the rate of speech, accent, andpronunciation. Therefore, the risk of misalignment in longersequences is higher. In contrast, CTC allows strict monotonicalignment of speech frames using forward-backward algorithm[9], [76] but assumes targets are conditionally independenton each other. Therefore, temporal dependencies are notproperly utilised in CTC, unlike in attention mechanism. Foreffective ASR performance, many researchers have combinedthe advantages of both attention and CTC in a single model and therefore, the CTC probabilities replaces the incorrectpredictions by the attention mechanism.The discussion on CTC and its application on ASR isbeyond the scope of this paper. However, in this section a briefintroduction to CTC and how it is jointly used with attentionis provided [33], [34]. CTC monotonically maps an inputsequence to output sequence. Considering the model outputs L - length letter sequence Y { y l ∈ U | l = 1 , · · · , L } with aset of distinct characters U , given the input sequence is X .CTC introduces frame-wise letter sequence with an additional“blank” symbol Z = { z t ∈ U ∪ blank | t = 1 , · · · , T } .By using conditional independence assumptions, the posteriordistribution p ( Y | X ) is factorized as follows: p ( Y | X ) ≈ (cid:88) Z (cid:89) t p ( z t | z t − , Y ) p ( z t | X ) p ( Y ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) p ctc ( Y | X ) . (15)CTC has three distribution components by the Bayes the-orem similar to the traditional or hybrid ASR. They areframe-wise posterior distribution p ( z t | X ) - acoustic module,transition probability p ( z t | z t − , C ) - pronunciation module,and letter-based language module p ( Y ) .Compared with CTC approaches, the attention-based ap-proach does not make any conditional independence assump-tions, and directly estimates the posterior p ( Y | X ) based onthe chain rule: p ( Y | X ) = (cid:89) l p ( y l | y , · · · , y l − , X ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) p att ( Y | X ) . (16) p ctc ( Y | X ) and p att ( Y | X ) are the CTC-based and attention-based objective functions, respectively. Finally, the logarithmiclinear combination of CTC- and attention-based objectivefunctions given by (17) is maximised to leverage the CTCand attention mechanism together in a ASR model. L = λ log p ctc ( Y | X ) + (1 − λ ) log p att ( Y | X ) , (17) λ is a tunable parameter in the range [0 , .In [33], [34], the CTC objective function was incorporatedin the attention-based model during the training only. However,motivated by the effectiveness of this joint approach, in [35],[36], it is used for decoding or inferencing phase as well.A triggered attention mechanism is proposed in [37]. Ateach decoder time step, the encoder states which the attentionmodel looks upon are controlled by a trigger model. Theencoder states are shared with the trigger model which is aCTC-based network as well as with the attention model. Thetrigger sequence which is computed based on the CTC gen-erated sequence provides alignment information that controlsthe attention mechanism. Finally, the objective functions ofCTC and attention model are optimised jointly. D. RNN-free Transformer-based models

Self-attention is a mechanism to capture the dependencieswithin a sequence. It allows to compute the similarity betweendifferent frames in the same sequence. In other words, self-attention ﬁnds to what extent different positions of a sequencerelate to each other. Transformer network [20] is entirelybuilt using self-attention for seq2seq processing and has beensuccessfully used in ASR as well.Transformer was introduced to ASR domain in [26] byproposing Speech-transformer. Instead of capturing only tem-poral dependencies, the authors of [26] have also capturedspectral dependencies by computing attention along time andfrequency axis of input spectrogram features. Hence, thisattention mechanism is named as “2D attention”. The setof ( q, k, v ) for time-domain attention is computed using (8).Here, the input embedding ( X ) is the convolutional featuresof spectrogram. For frequency-domain attention, the set of ( q, k, v ) are the transpose of same parameters in the time-domain. At each block of multi-head attention, the time-domain and frequency-domain attentions are computed paral-lelly and after that they are concatenated using (9). In this caseattention heads belong to both time and frequency domains.Speech transformer was built to output word predictions andlater on it is explored for different modelling units likephonemes, syllables, characters in [46], [47] and for large-scale speech recognition in [48].A very deep Transformer model for ASR is proposedin [49]. The authors have claimed that depth is an impor-tant factor for obtaining effective ASR performance usingTransformer network. Therefore, instead of using the originalversion of six stacked layers for both encoder and decoder,more layers (deep conﬁguration) are used in the structure.Speciﬁcally, the authors have shown − layers for theencoder-decoder is the most effective conﬁguration. To facili-tate the training of this deep network, around each sub-layer,a stochastic residual connection is employed before the layer-normalisation. Another deep Transformer model is proposedin [50] where it has been shown that the ASR performance iscontinually increased with the increase of layers up to 42 andthe attention heads up to 16. The effect on performance beyond42 layers and 16 attention-heads is not provided, probably dueto the increased computation complexity. The authors havealso experimentally shown that sinusoidal positional encoding[20] is not required for deep Transformer model. To increasethe model capacity efﬁciently, the deep Transformer proposedin [51] replaced the single-layer feed-forward network in eachTransformer sub-layer by a deep neural network with residualconnections.Training deep Transformers can be difﬁcult as it oftengets caught in a bad local optimum. Therefore, to enabletraining deep Transformer, iterated loss [77] is used in [52].It allows output of some intermediate transformer layers tocalculate auxiliary cross entropy losses which are interpolatedto conﬁgure the ﬁnal loss function. Apart from that, “gelu”(Gaussian error linear units) [78] activation function is usedin the feed-forward network of each Transformer layer. Out ofthe different explored approaches, positional embedding with a convolutional block before each Transformer layer has shownthe best performance.A self-attention based ASR model has been proposed in[53] by replacing the pyramidal recurrent block of LASmodel at the encoder side with multi-head self-attention block.As self-attention computes similarity of each pair of inputframes, the memory grows quadratically with respect to thesequence length. To overcome this, authors have applied adownsampling to the sequence length before feeding it toevery self-attention block. This downsampling is done byreshaping the sequences and it is a trade-off between thesequence length and the dimension. If the sequence lengthis reduced by a factor a , then the dimension increased bythe same factor. Speciﬁcally, X ∈ R l × d → (cid:124)(cid:123)(cid:122)(cid:125) reshape ˆ X ∈ R la × ad .Therefore, memory consumption to compute the attentionmatrices is reduced by a . Unlike in [20] where positioninformation is added to input sequence before feeding to theself-attention block, in [53], authors have claimed that addingpositional information to the acoustic sequence makes themodel difﬁcult to read content. Therefore, position informationis concatenated to the acoustic sequence representation andthis concatenated sequence is passed to the self-attentionblocks. In addition, to enhance the context relevance whilecalculating the similarity between speech frames, a Gaussiandiagonal mask with learnable variance is added to the attentionheads. Speciﬁcally, an additional bias matrix is added toEquation (7) as given by (18). Attention ( Q, K, V ) =

Sof tmax ( QK T (cid:112) ( d k ) + M ) V, (18)where M is matrix whose values around the diagonal are setto a higher value to force the self-attention attending in a localrange around each speech frame. The elements of this matrixare calculated by a Gaussian function: M i,j = − ( j − k ) σ , σ isa learnable parameter.The quadratic computation complexity during the self-attention computation using (7) has been reduced down tolinear in [54] where the authors have proposed to use the dotproduct of kernel feature maps for the similarity calculationbetween the speech frames followed by the use of associativeproperty of matrix products.For better incorporating long-term dependency using Trans-formers, in [55] Transformer-XL was proposed for machine-translation. In Transformer-XL, a segment-level recurrencemechanism is introduced which enables the reuse of pastencoder states (output of the previous layers) at the trainingtime to maintain a longer history of contexts until they becomesufﬁciently old. Therefore, queries at current layer have accessto the key-value pairs of current layer as well as previous lay-ers. Based on this concept, Compressive Transformer [56] wasproposed and it was applied to ASR to effectively incorporatelong-term dependencies. In [56], instead of discarding olderencoder states, they were preserved in a compressed form.[51] also explored sharing previous encoder states but reusedonly key vectors from previous layers. Another Transformer-based ASR model is proposed in [57]as an adaptation of RNN-Transducer based model [79] whichuses two RNN-based encoders for audio and labels respec-tively to learn the alignment between them. In [57], audioand label encoders are designed with Transformer networks.Given the previous predicted label from the target label space,the two encoder outputs are combined by a joint network.Vanilla Transformer and the deep Transformer models havea number of layers stacked in both encoder and decoder sides.Each layers and their sub-layers have their own parameters andprocessing them is computationally expensive. In [58], a pa-rameter sharing approach has been proposed for Transformernetwork. The parameters are initialised at the ﬁrst encoderand decoder layers and thereafter, re-used in the other layers.If the number of encoder and decoder layers is N and thetotal number of parameters in each layer is M , then insteadof using N × M parameters in both encoder and decoder sides,in [58] only M parameters are used. There is a performancedegradation due to sharing the parameters. To overcome that,speech attributes such as, duration of the utterance, sex andage of the speaker are augmented with the ground truth labelsduring training.In self-attention based Transformer models, each speechframe attends all other speech frames of the entire sequenceor within a window. However, some of them like framesrepresenting silence are not crucial for modelling long-range dependencies and may present multiple times in theattended sequence. Therefore, these frames should be avoided.The attention weights (or probabilities) are obtained using sof tmax function which generates non-zero probabilitiesand therefore, insigniﬁcant frames are also assigned to someattention weights. To overcome this, in [59] weak-attentionsuppression (WAS) mechanism is proposed. WAS inducedsparsity over the attention probability distribution by settingattention probabilities to zero which are smaller than a dynam-ically determined threshold. More speciﬁcally, the threshold isdetermined by (19). After that, the rest non-zero probabilitiesare re-normalised by passing through a sof tmax function. θ i = m i − γ i σ i , (19)where θ i is the threshold, m i and σ i are the mean and standarddeviation of the attention probability for the i th frame in thequery sequence. γ is a scaling factor which ranges from to and experimentally, . provided the best result.V. S TREAMING S PEECH R ECOGNITION

For ofﬂine speech recognition, the entire speech framesare already available before the transcription starts. However,for streaming environment, it is not possible to pass theentire speech through the encoder before the prediction starts.Therefore, to transcribe streaming speech, attention mecha-nism mostly focuses on a range or a window of input speechframes. Speciﬁcally, streaming spech recognition relies onlocal attention. In this section, we will discuss the developmentof attention models for streaming speech recognition. Thissection is divided into two sub-sections to explore RNN- andTransformer-based literature.

A. RNN-based models

In this section, we will discuss the literature where attentionmechanism is applied for streaming speech recognition withRNN-based encoder decoder models. To work with streamingspeech, it is ﬁrst required to obtain the speech frame or the setof speech frames on which attention mechanism will work. AGaussian prediction-based attention mechanism is proposed in[38] for streaming speech recognition. Instead of looking at theentire encoder hidden states, at each decoder time step, only aset of encoder hidden states are attended based on a Gaussianwindow. The centre and the size of window at a particulardecoder time step, t are determined by its mean ( µ t ) andvariance ( σ t ) which are predicted given the previous decoderstate. Speciﬁcally, the current window centre is determined bya predicted moving forward increment ( (cid:52) µ t ) and last windowcentre. µ t = (cid:52) µ t + µ t − . A different approach compared to(5) has been considered to calculate the similarity between j th encoder state (within the current window) and i th encoderstate and it is given by (20): e i,j = exp ( − ( i − µ t ) σ t ) . (20)A hard monotonic attention mechanism is proposed in[27]. Only a single encoder hidden state h i ( i represents adecoder time step and h i represents the only encoder stateselected for output prediction at i th decoder time step) whichscores the highest similarity with the current decoder stateis selected by passing the concerned attention probabilitiesthrough a categorical function. A stochastic process is used toenable attending encoder hidden states only from left to rightdirection. At each decoder time step, the attention mechanismstarts processing from h i − to the proceeding states. h i − isthe encoder state which was attended at last decoder timestep. Each calculated similarity score ( e i,j ) is then sequentiallypassed through a logistic sigmoid function to produce selectionprobabilities ( p i,j ) followed by a Bernoulli distribution andonce it outputs , the attention process stops. The last attendedencoder hidden state, h i at the current decoder time step isthen set as the context for the current decoder time step, i.e. c i = h i . Although the encoder states within the window ofboundary [ h i − , h i ] are processed, only a single encoder stateis ﬁnally selected for the current prediction.[27] provides linear time complexity and online speechdecoding, it only attends a single encoer state for each outputprediction and it may cause degradation to the performance.Therefore, monotonic chunkwise attention (MoChA) is pro-posed in [28] where decoder attends small “chunks” of encoderstates within a window containing a ﬁxed number of encoderstates prior to and including h i . Due to its effectiveness,MoChA is also used to develop an on-device commercialisedASR system [40]. To increase the effectiveness of the matchingscores obtained to calculate the attention probabilities betweenthe decoder state and the chunk encoder states, multi-headmonotonic chunkwise attention (MTH-MoChA) is proposedin [39]. MTH-MoChA splits the encoder and decoder hiddenstates into K heads. K is experimentally set as . For eachhead, matching scores, attention probabilities and the context vectors are calculated to extract the dependencies between theencoder and decoder hidden states. Finally, the average contextvector over all the heads takes part in decoding.The pronunciation rate among different speakers may varyand therefore, the attention calculated over the ﬁxed chunksize may not be effective. To overcome this, in [29] an adap-tive monotonic chunkwise attention (AMoChA) was proposedwhere attention at current decoder time step is computedover a window whose boundary [ h i − , h i ] is computed as in[27]. Within the window, whichever encoder states results in p i,j > . or e i,j > are attended. Hence, the chunk size isadaptive instead of constant.The input sequence or the encoder states of length L isdivided equally into W in [41]. So, each block contains B = LW encoder states, while the last block may containfewer than B encoder states. In this model, each block isresponsible for a set of output predictions and attention iscomputed over only the concerned blocks and not the entireencoder states. Once the model has ﬁnished attending all theencoder states of a block and predicting the required outputs,it emits a special symbol called < epsilon > which marksthe end of the corresponding block processing and the modelproceeds to attend the next block. The effectiveness of thismodel has been enhanced in [42] by extending the attentionspan. Speciﬁcally, the attention mechanism looks at not onlythe current block but the k previous blocks. Experimentally, k is set as .The authors of [44] have identiﬁed the latency issue instreaming attention-based models. In most streaming models,the encoder states are attended based on a local window.Computing the precise boundaries of these local windows is acomputational expensive process which in turn causes a delayin the speech-to-text conversion. To overcome this issue, in[44] external hard alignments obtained from a hybrid ASRsystem is used for frame-wise supervision to force the MoChAmodel to learn accurate boundaries and alignments. In [80]performance latency is reduced by proposing a unidirectionalencoder with no future dependency. Since each position doesnot depend on future context, the decoder hidden states arenot required to be re-computed every time a new input chunkarrives and therefore, the overall delay is reduced.In [43], attention mechanism has been incorporated in RNN-Transducer (RNN-T) [12], [13] to make streaming speechrecognition more effective and efﬁcient. RNN-T consists ofthree sections: (i) a RNN encoder which processes an inputsequence to encoder hidden states, (ii) a RNN decoder whichis analogues to a language model takes the previous predictedsymbol as input and outputs decoder hidden states, and (iii) ajoint network that takes encoder and decoder hidden statesat the current time step to compute output logit which isresponsible to predict the output symbol when passed througha softmax layer. In [43], at the encoder side, to learn contextualdependency, a multi-head self-attention layer is added on thetop of RNN layers. In addition, the joint network attends achunk of encoder hidden states instead of attending only thecurrent hidden state at each time step.LAS model is primarily proposed for ofﬂine speech recogni-tion. However, it has been modiﬁed with silence modelling for working in the streaming environment in [45]. Given stream-able encoder and a suitable attention mechanism (hard mono-tonic, chunkwise or local window-based instead of global),the main limitation of LAS model to perform in streamingenvironment is a long enough silence between the utterancesto make decoder believe it is the end of speech. Therefore,the LAS decoder terminates the transcription process whilethe speaker is still active (i.e. early stopping). This limitationis addressed in [45] by incorporating reference silence tokensduring the training phase to supervise the model when tooutput a silence token instead of terminating the process duringthe inference phase. B. RNN-free Transformer-based models

In this section, we will discuss the literature where RNN-free self-attention models are used for streaming speech recog-nition. Self-attention aligner [60] which is designed based onthe Transformer model proposes a chunk hoping mechanismto provide support to online speech recognition. Transformer-based network requires the entire sequence to be obtainedbefore the prediction starts and hence, not suitable for onlinespeech recognition. In [60], the entire sequence is partitionedinto several overlapped chunks, each of which contains threeparts belonging to current, past and future. Speech frames orencoder states of the current part are attended to provide theoutput predictions belonging to the corresponding chunk. Thepast and future parts provide contexts to the identiﬁcation ofthe current part. After attending a chunk, the mechanism hopsto a new chunk to attend. The number of speech frames orencoder states hopped between two chunks is same as thecurrent part of each chunk. A similar method was proposedin augmented memory Transformer [61] where an augmentedmemory bank is included apart from partitioning the inputspeech sequence. The augmented memory bank is used forcarrying the information over the chunks, speciﬁcally byextracting key-value pairs from the projection of concatenatedaugmented memory bank and the relevant chunk (includingpast, current and future parts).Transformer transducer model [62] uses truncated self-attention to support streaming ASR. Instead of attending theentire speech sequence at each time step t , truncated self-attention mechanism allows attending speech frames withinthe window of [ t − L, t + R ] frames. L and R representthe frame limits to the left and right respectively. In [62],positional encoding in input embedding is done by causalconvolution [63] to support online ASR. In another variationof Transformer transducer [57], the model restricts attending tothe left side of the current frame only by masking the attentionscores to the right of the current frame. The attention span isfurther restricted by attending the frames within a ﬁxed-sizewindow at each time step.A chunk-ﬂow mechanism is proposed in [64] to supportstreaming speech recognition in self-attention based transducermodel. The chunk-ﬂow mechanism restricts the span of self-attention to a ﬁxed length chunk instead the whole inputsequence. The ﬁxed length chunk proceeds along time overthe input sequence. Not attending the entire input sequence may degrade the performance. However, it is still kept satis-factory by using multiple self-attention heads to model longerdependencies. The chunk-ﬂow mechanism at time t for theattention head h i is given by (21) h i,t = t + N r (cid:88) τ = t − N l α i,τ s τ , (21)where N l and N r represent the number of speech frames to theleft and right of the current time t . N l and N r determine thechunk span and experimentally they are chosen as 20 and 10respectively. s τ represents the τ th vector in the input sequenceand α i,τ = Attention ( s τ , K, V ) ; K = V = chunk τ A streaming friendly self-attention mechanism, named astime-restricted self-attention is proposed in [65]. It works byrestricting the speech frame at current time step to attend onlya ﬁxed number of frames to its left and right and thus itdoes not allow attending each speech frame to attend all otherspeech frames. Experimentally, these numbers are set to 15and 6 for left and right sides, respectively. Similarly, in [52],each Transformer layer is restricted to attend a ﬁxed limitedright context during inference. A special position embeddingapproach also has been proposed by adding a one-hot encodervector with the value vectors. The one-hot encoder vectorconsists of all zeros except a single one corresponding to theattending time step with respect to all the time steps in thecurrent attention span. This mechanism is also used in theencoder side of streaming transformer model [66].Synchronous Transformer [67] is proposed to supportstreamable speech recognition using self-attention mechanismto overcome the requirement of processing all speech framesbefore decoding starts. While calculating the self-attention,every speech frame is restricted to process only the frames leftto it and ignore the right side. Also, at the decoder time step,encoded speech frames are processed chunkwise. The encodedspeech frames are divided into overlapped chunks to maintainthe smooth transition of information between chunks. At eachdecoder time step, the decoder predicts an output based onthe last predicted output and the attention calculated overthe frames belonging to a chunk only and therefore, avoidsattending the entire speech sequence.To make Transformer streamable, chunk self-attention en-coder and monotonic truncated attention-based self-attentiondecoder is proposed in [68]. At the encoder side, the inputspeech is split into isolated chunks of ﬁxed length inspiredby MoChA. At the decoder side, encoder-decoder attentionmechanism [20] is replaced by truncated attention [69]. Theencoder embedding is truncated in a monotonic left to rightapproach and then attention applied over the trunacted outputs.After that, the model is optimised by online joint CTC-attention method [69].Monotonic multihead attention (MMA) is proposed in [81]to enable online decoding in Transformer network by replacingeach encoder-decoder attention head with a monotonic atten-tion (MA) head. Each MA head needs to be activated to predicta output symbol. If any MA head failed or delayed to learnalignments, it causes delay during inference. The authors of[70] have found that only few MA heads (dominant ones) learn alignments effectively and others do not. To prevent thisand to let each head learning alignments effectively, HeadDropregularisation is proposed. It entirely masks a part of the headsat random and forces the rest of non-masked heads to learnalignment effectively. In addition, the redundant MA headsare pruned in the lower layers to further improve the teamwork among the attention heads. Since MA is hard attention,chunkwise attention is applied on the top of each MA head toenhance the quality of context information.VI. C ONCLUSION

In this survey, how different types of attention modelshave been successfully applied to build automatic speechrecognition models is presented. We have discussed variousapproaches to deploy attention model into the RNN-basedencoder-decoder framework. We have also discussed howself-attention replaces the need of recurrence and can buildeffective and efﬁcient ASR models. Speech recognition canbe performed ofﬂine as well as online and in this paper, wehave discussed various aspects of the ofﬂine and online ASRdevelopment. R

EFERENCES[1] F. Jelinek, “Continuous speech recognition by statistical methods,”

Proceedings of the IEEE , vol. 64, no. 4, pp. 532–556, 1976.[2] L. Muda, B. KM, and I. Elamvazuthi, “Voice recognition algorithmsusing mel frequency cepstral coefﬁcient (mfcc) and dynamic timewarping (dtw) techniques,”

Journal of Computing , vol. 2, no. 3, pp.138–143, 2010.[3] M. Gales and S. Young,

The application of hidden Markov models inspeech recognition . Now Publishers Inc, 2008.[4] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying conditionalrandom ﬁelds to japanese morphological analysis,” in

Proceedingsof the 2004 Conference on Empirical Methods in Natural LanguageProcessing , 2004, pp. 230–237.[5] S. Bird, “Nltk: The natural language toolkit,” in

COLING• ACL 2006 .Citeseer, 2006, p. 69.[6] A.-r. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks forphone recognition,” in

NIPS Workshop on Deep Learning for SpeechRecognition and Related Applications , vol. 1, no. 9. Vancouver, Canada,2009, p. 39.[7] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”

IEEE Signal Processing Magazine , vol. 29,no. 6, pp. 82–97, 2012.[8] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognitionwith deep bidirectional lstm,” in . IEEE, 2013, pp. 273–278.[9] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connection-ist temporal classiﬁcation: labelling unsegmented sequence data withrecurrent neural networks,” in

Proceedings of the 23rd InternationalConference on Machine Learning , 2006, pp. 369–376.[10] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-endcontinuous speech recognition using attention-based recurrent nn: Firstresults,” in

NIPS 2014 Workshop on Deep Learning, December 2014 ,2014.[11] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in

Advances in NeuralInformation Processing Systems , 2015, pp. 577–585.[12] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711 , 2012.[13] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in . IEEE, 2013, pp. 6645–6649.[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

NeuralComputation , vol. 9, no. 8, pp. 1735–1780, 1997. [15] K. Cho, B. van Merrienboer, C¸ . G¨ulc¸ehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations using rnnencoder-decoder for statistical machine translation,” in

EMNLP , 2014.[16] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-works,”

IEEE Transactions on Signal Processing , vol. 45, no. 11, pp.2673–2681, 1997.[17] A. Graves, S. Fern´andez, and J. Schmidhuber, “Bidirectional lstmnetworks for improved phoneme classiﬁcation and recognition,” in

International Conference on Artiﬁcial Neural Networks . Springer, 2005,pp. 799–804.[18] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks forend-to-end speech recognition,” in , 2017, pp. 4845–4849.[19] Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent, Y. Bengio,and A. Courville, “Towards end-to-end speech recognition with deepconvolutional neural networks,”

Interspeech 2016 , pp. 410–414, 2016.[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin Neural Information Processing Systems , 2017, pp. 5998–6008.[21] S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal, “An attentivesurvey of attention models,” arXiv preprint arXiv:1904.02874 , 2019.[22] A. Galassi, M. Lippi, and P. Torroni, “Attention in natural languageprocessing,”

IEEE Transactions on Neural Networks and LearningSystems , 2020.[23] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,” in . IEEE, 2016, pp. 4945–4949.[24] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition,”in . IEEE, 2016, pp. 4960–4964.[25] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Character-aware attention-based end-to-end speech recognition,” in . IEEE, 2019, pp.949–955.[26] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrencesequence-to-sequence model for speech recognition,” in . IEEE, 2018, pp. 5884–5888.[27] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online andlinear-time attention by enforcing monotonic alignments,” in

Interna-tional Conference on Machine Learning , 2017, pp. 2837–2846.[28] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in

Interna-tional Conference on Learning Representations , 2018.[29] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, “An online attention-basedmodel for speech recognition,”

Proc. Interspeech 2019 , pp. 4390–4394,2019.[30] A. Zeyer, K. Irie, R. Schl¨uter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,”

Proc. Interspeech 2018 ,pp. 7–11, 2018.[31] W. Chan and I. Lane, “On online attention-based speech recognitionand joint mandarin character-pinyin training.” in

Interspeech , 2016, pp.3404–3408.[32] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic attentionmechanism for end-to-end speech and language processing,” in

Proceed-ings of the Eighth International Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers) , 2017, pp. 431–440.[33] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in . IEEE, 2017, pp. 4835–4839.[34] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy-brid ctc/attention architecture for end-to-end speech recognition,”

IEEEJournal of Selected Topics in Signal Processing , vol. 11, no. 8, pp.1240–1253, 2017.[35] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoderand rnn-lm,”

Proc. Interspeech 2017 , pp. 949–953, 2017.[36] S. Watanabe, T. Hori, and J. R. Hershey, “Language independent end-to-end architecture for joint language identiﬁcation and speech recogni-tion,” in . IEEE, 2017, pp. 265–271.[37] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in

ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 5666–5670. [38] J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attention foronline end-to-end speech recognition.” in Interspeech , 2017, pp. 3692–3696.[39] B. Liu, S. Cao, S. Sun, W. Zhang, and L. Ma, “Multi-head monotonicchunkwise attention for online speech recognition,” arXiv preprintarXiv:2005.00205 , 2020.[40] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo,D. Kim, S. Jung et al. , “Attention based on-device streaming speechrecognition with large speech corpus,” in . IEEE, 2019, pp.956–963.[41] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio,“An online sequence-to-sequence model using partial conditioning,” in

Advances in Neural Information Processing Systems , 2016, pp. 5067–5075.[42] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu,P. Nguyen, and Z. Chen, “Improving the performance of online neuraltransducer models,” in . IEEE, 2018, pp. 5864–5868.[43] B. Wang, Y. Yin, and H. Lin, “Attention-based transducer for onlinespeech recognition,” arXiv preprint arXiv:2005.08497 , 2020.[44] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latencytraining strategies for streaming sequence-to-sequence asr,” in

ICASSP2020-2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2020, pp. 6064–6068.[45] R. Hsiao, D. Can, T. Ng, R. Travadi, and A. Ghoshal, “Online automaticspeech recognition with listen, attend and spell model,”

IEEE SignalProcessing Letters , vol. 27, pp. 1889–1893, 2020.[46] S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese,”

Proc. Interspeech 2018 , pp. 791–795, 2018.[47] ——, “A comparison of modeling units in sequence-to-sequence speechrecognition with the transformer on mandarin chinese,” in

InternationalConference on Neural Information Processing . Springer, 2018, pp.210–220.[48] J. Li, X. Wang, Y. Li et al. , “The speechtransformer for large-scalemandarin chinese speech recognition,” in

ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 7095–7099.[49] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. M¨uller, S. St¨uker, andA. Waibel, “Very deep self-attention networks for end-to-end speechrecognition,”

Proc. Interspeech 2019 , pp. 66–70, 2019.[50] K. Irie, A. Zeyer, R. Schl¨uter, and H. Ney, “Language modeling withdeep transformers,”

Proc. Interspeech 2019 , pp. 3905–3909, 2019.[51] K. Irie, A. Gerstenberger, R. Schl¨uter, and H. Ney, “How much self-attention do we needƒ trading attention for feed-forward layers,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2020, pp. 6154–6158.[52] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,H. Huang, A. Tjandra, X. Zhang, F. Zhang et al. , “Transformer-basedacoustic modeling for hybrid speech recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 6874–6878.[53] M. Sperber, J. Niehues, G. Neubig, S. St¨uker, and A. Waibel, “Self-attentional acoustic models,”

Proc. Interspeech 2018 , pp. 3723–3727,2018.[54] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformersare rnns: Fast autoregressive transformers with linear attention,” arXivpreprint arXiv:2006.16236 , 2020.[55] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdi-nov, “Transformer-xl: Attentive language models beyond a ﬁxed-lengthcontext,” in

Proceedings of the 57th Annual Meeting of the Associationfor Computational Linguistics , 2019, pp. 2978–2988.[56] J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lill-icrap, “Compressive transformers for long-range sequence modelling,”in

International Conference on Learning Representations , 2019.[57] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, andS. Kumar, “Transformer transducer: A streamable speech recognitionmodel with transformer encoders and rnn-t loss,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 7829–7833.[58] S. Li, D. Raj, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “Improvingtransformer-based speech recognition systems with compressed structureand speech attributes augmentation.” in

Interspeech , 2019, pp. 4400–4404. [59] Y. Shi, Y. Wang, C. Wu, C. Fuegen, F. Zhang, D. Le, C.-F. Yeh, andM. L. Seltzer, “Weak-attention suppression for transformer based speechrecognition,” arXiv , pp. arXiv–2005, 2020.[60] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in

ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.5656–5660.[61] C. Wu, Y. Wang, Y. Shi, C.-F. Yeh, and F. Zhang, “Streamingtransformer-based acoustic models using self-attention with augmentedmemory,” arXiv , pp. arXiv–2005, 2020.[62] C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain,K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer:End-to-end speech recognition with self-attention,” arXiv , pp. arXiv–1910, 2019.[63] A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Transformers withconvolutional context for asr,” arXiv , pp. arXiv–1904, 2019.[64] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention transducers forend-to-end speech recognition,”

Proc. Interspeech 2019 , pp. 4395–4399,2019.[65] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “Atime-restricted self-attention layer for asr,” in .IEEE, 2018, pp. 5874–5878.[66] N. Moritz, T. Hori, and J. Le, “Streaming automatic speech recognitionwith the transformer model,” in

ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2020, pp. 6074–6078.[67] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronoustransformers for end-to-end speech recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 7884–7888.[68] H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-based online ctc/attention end-to-end speech recognition architecture,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2020, pp. 6084–6088.[69] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Online hybridctc/attention architecture for end-to-end speech recognition,” in

Inter-speech , 2019.[70] H. Inaguma, M. Mimura, and T. Kawahara, “Enhancing monotonicmultihead attention for streaming asr,”

Proc. Interspeech 2020 , pp.2137–2141, 2020.[71] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” in , 2015.[72] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 770–778.[73] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450 , 2016.[74] T. Luong, H. Pham, and C. D. Manning, “Effective approaches toattention-based neural machine translation,” in

EMNLP , 2015.[75] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al. , “Deepspeech: Scaling up end-to-end speech recognition,” arXiv preprintarXiv:1412.5567 , 2014.[76] A. Graves and N. Jaitly, “Towards end-to-end speech recognition withrecurrent neural networks,” in

International Conference on MachineLearning , 2014, pp. 1764–1772.[77] A. Tjandra, C. Liu, F. Zhang, X. Zhang, Y. Wang, G. Synnaeve,S. Nakamura, and G. Zweig, “Deja-vu: Double feature presentation indeep transformer networks,” arXiv preprint arXiv:1910.10324 , 2019.[78] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv , pp. arXiv–1606, 2016.[79] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data andunits for streaming end-to-end speech recognition with rnn-transducer,”in . IEEE, 2017, pp. 193–199.[80] D. Liu, G. Spanakis, and J. Niehues, “Low-latency sequence-to-sequencespeech recognition and translation by partial hypothesis selection,” arXivpreprint arXiv:2005.11185 , 2020.[81] X. Ma, J. M. Pino, J. Cross, L. Puzon, and J. Gu, “Monotonic multiheadattention,” in