[PDF] Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Abstract

Recently, attention-based encoder-decoder (AED) models have shown state-of-the-art performance in automatic speech recognition (ASR). As the original AED models with global attentions are not capable of online inference, various online attention schemes have been developed to reduce ASR latency for better user experience. However, a common limitation of the conventional softmax-based online attention approaches is that they introduce an additional hyperparameter related to the length of the attention window, requiring multiple trials of model training for tuning the hyperparameter. In order to deal with this problem, we propose a novel softmax-free attention method and its modified formulation for online attention, which does not need any additional hyperparameter at the training phase. Through a number of ASR experiments, we demonstrate the tradeoff between the latency and performance of the proposed online attention technique can be controlled by merely adjusting a threshold at the test phase. Furthermore, the proposed methods showed competitive performance to the conventional global and online attentions in terms of word-error-rates (WERs).

Full PDF

11 Gated Recurrent Context: Softmax-free Attentionfor Online Encoder-Decoder Speech Recognition

Hyeonseung Lee, Woo Hyun Kang, Sung Jun Cheon, Hyeongju Kim, and Nam Soo Kim

Recently, attention-based encoder-decoder (AED) models haveshown state-of-the-art performance in automatic speech recogni-tion (ASR). As the original AED models with global attentionsare not capable of online inference, various online attentionschemes have been developed to reduce ASR latency for betteruser experience. However, a common limitation of the conven-tional softmax-based online attention approaches is that theyintroduce an additional hyperparameter related to the lengthof the attention window, requiring multiple trials of modeltraining for tuning the hyperparameter. In order to deal withthis problem, we propose a novel softmax-free attention methodand its modiﬁed formulation for online attention, which doesnot need any additional hyperparameter at the training phase.Through a number of ASR experiments, we demonstrate thetradeoff between the latency and performance of the proposedonline attention technique can be controlled by merely adjustinga threshold at the test phase. Furthermore, the proposed methodsshowed competitive performance to the conventional global andonline attentions in terms of word-error-rates (WERs).

Index Terms —automatic speech recognition, online speechrecognition, attention-based encoder-decoder model

I. I

NTRODUCTION I N the last few years, the performance of deep learning-basedend-to-end automatic speech recognition (ASR) systems hasimproved signiﬁcantly through numerous studies mostly on thearchitecture designs and training schemes of neural networks(NNs). Among many end-to-end ASR systems, attention-basedencoder-decoder (AED) models [1], [2] have shown betterperformance than the others, such as the connectionist temporalclassiﬁcation (CTC) [3] and recurrent neural network transducer(RNN-T) [4], and even outperformed the conventional DNN-hidden Markov model (HMM) hybrid systems in case a largetraining set of transcribed speech is available [5]. Such success-ful results of AED models come from the tightly integratedlanguage modeling capability of the label-synchronous decoder,supported by the attention mechanism that provides properacoustic information at each step [6].A major drawback of the conventional AED models is thatthey cannot infer the ASR output in an online fashion, whichdegrades the user experience due to the large latency [7]. Thisproblem is mainly caused by the following aspects of the

The authors are with the Institute of New Media and Communications,Department of Electrical and Computer Engineering, Seoul National University,Seoul, Republic of Korea (e-mail: [email protected]; [email protected];[email protected]; [email protected]; [email protected]).This work has been submitted to the IEEE for possible publication. Copyrightmay be transferred without notice, after which this version may no longer beaccessible.

AED models. Firstly, the encoders of most high-performanceAED models make use of layers with global receptive ﬁelds,such as bidirectional long short-term memory (BiLSTM) orself-attention layer. More importantly, a conventional globalattention mechanism (e.g., Bahdanau attention) considers theentire utterance to obtain the attention context vector atevery step. The former issue can be solved by replacing theglobal-receptive encoder with an online encoder, where anencoded representation for a particular frame depends on onlya limited number of future frames. The online encoder canbe built straightforwardly by employing layers with ﬁnitefuture receptive ﬁeld such as latency-controlled BiLSTM(LC-BiLSTM) [8] and masked self-attention layers. However,reformulating the global attention methods for an onlinepurpose is still a challenging problem.Conventional techniques for online attentionare usuallytwo-step approaches where the window (i.e., chunk) for thecurrent attention is determined ﬁrst at each decoder step,then the attention weights are calculated using the softmaxfunction deﬁned over the window. As the softmax functionis a common solution for representing discrete probabilitydistributions (i.e., attention weights), existing online attentionsmainly differ in how they determine the window. Neuraltransducers [9], [10] divide an encoded sequence into multiplechunks with a ﬁxed length, and the attention-decoder producesan output sequence for each input chunk. In the windowedattention techniques [11]–[13], the position of each ﬁxed-sizewindow is decided by a position prediction algorithm. Thewindow position is monotonically increasing in time, and someapproaches employ a trainable position prediction model witha ﬁxed Gaussian window. In MoChA-based approaches [14]–[16], a ﬁxed-size chunk is obtained using a monotonic endpointprediction model, which is jointly trained considering allpossible chunk endpoints.A common limitation of the aforementioned approaches isthat the ﬁxed-length of the window needs to be tuned accordingto the training data. Merely choosing a large window of aconstant size causes a large latency while setting the windowsize too small results in degraded performance. Thereforemultiple trials of the model training are required to ﬁnd a propervalue of the window length, consuming excessive computationalresources. Moreover, the trained model does not guarantee toperform well on an unseen test set, since the window size isﬁxed for all datasets.Although a few variants of MoChA utilize an adaptivewindow length to remove the need for tuning the windowsize, such variants induce other problems. MAtChA [14]regards the previous endpoint as the beginning of the current a r X i v : . [ ee ss . A S ] J u l chunk. Occasionally, the window can be too short to containenough speech content when two consecutive endpoints aretoo close, which may degrade the performance. AMoChA [17]employs an auxiliary model that predicts the chunk size butalso introduces an additional loss term for the prediction model.As the coefﬁcient for the new loss needs tuning, AMoChAstill requires repeated training sessions. Besides, several recentapproaches [18], [19] utilize strictly monotonic windows. Butthese methods have a limitation in that the decoder state is notused for determining the window, which means such algorithmsmight not fully exploit the advantage of AED models, i.e., theinherent capability of autoregressive language modeling.The aforementioned inefﬁciency in training the conventionalonline attentions is essentially caused by the fact that thesoftmax function needs a predetermined attention window toobtain the attention weights, which results in repetitive tuningprocess of the window-related hyperparameter. To overcomethis drawback, we propose a novel softmax-free global attentionmethod called gated recurrent context (GRC), inspired by thegate-based update in gated recurrent unit (GRU) [20]. Whereasconventional attentions are based on a kernel smoother (e.g.softmax function) [21], [22], GRC obtains an attention contextvector by recursively aggregating the encoded vectors in atime-synchronous manner, using update gates. GRC can bereformulated for the purpose of online attention, which werefer to decreasing GRC (DecGRC), where the update gatesare constrained to be decreasing over time. DecGRC is capableof deciding the attention-endpoint by thresholding the updategate values at the inference phase. DecGRC as well as GRCintroduces no hyperparameter to be tuned at the training phase.The main contributions of this paper can be summarized asfollows: • We propose a novel softmax-free attention method called

Gated Recurrent Context (GRC) . To the best of ourknowledge, GRC is the ﬁrst attention method that repre-sents discrete probability distributions without a kernelsmoother. • We present a novel online attention method,

DecreasingGRC (DecGRC) , a constrained variant of GRC. DecGRCdoes not need any new hyperparameter to be tuned atthe training phase. At test time, the tradeoff betweenperformance and latency can be adjusted using a simplethresholding technique. • We experimentally show that GRC and DecGRC performcompetitive to the conventional global and online attentionmethods on the LibriSpeech test set.The remainder of this paper is organized as follows. InSection II, the general framework of attention-based encoder-decoder ASR is formally described, followed by conventionalonline attention methods and their common limitation. Sec-tion III proposes formulations of both GRC and DecGRCand the algorithm for online inference. The experimentalresults with various attention methods are given in Section IV.Conclusions are presented in Section V. II. B

ACKGROUNDS

A. Attention-based Encoder-Decoder for ASR

An attention-based encoder-decoder model consists of twosub-modules

Encoder( · ) and AttentionDecoder( · ) , and itpredicts the posterior probability of the output transcriptiongiven the input speech features as follows: h = Encoder( x ) ,P ( y | x ) = AttentionDecoder( h , y ) where x = [ x , x , ..., x T in ] and h = [ h , h , ..., h T ] aresequences of input speech features and encoded vectorsrespectively, and y = [ y , y , ..., y U ] is a sequence of outputtext units. Either the start or end of the text is considered asone of the text units.In general, Encoder( · ) reduces its output length T to besmaller than the input length T in , cutting down the memory andcomputational footprint. A global Encoder( · ) is implementedwith NN layers having powerful sequence modeling capacity,e.g., BiLSTM or self-attention layers with subsampling layers.On the other hand, an online Encoder( · ) must only consist oflayers with ﬁnite future receptive ﬁeld. AttentionDecoder( · ) operates at each output step recur-sively, emitting an estimated posterior probability over allpossible text units given the outputs produced at the previousstep. This procedure can be summarized as follows: s u = RecurrentState( s u − , y u − , c u − ) ,c u = AttentionContext( s u , h ) , (1) P ( y u | y

To achieve online attention, the context vector c u in Eq. (2)must have local dependency on the encoded vectors h . Win-dowed attention and MoChA are widely-used online attentionmethods that show high performance for which only the AttentionContext( · ) function in Eq. (1) is modiﬁed in thegeneral framework.

1) Windowed attention

Among various formulations of windowed attention, a simpleheuristic using argmax for window boundary prediction [13]has shown the best performance. This method can be describedas follows: p = 0 , p u = argmax t ( α u − , ≤ t ≤ T ) , (5) c u = p u + w − (cid:88) t = p u α u,t h t ,α u,t = exp( e u,t ) (cid:80) p u + w − j = p u exp( e u,j ) (6)where p u is the start point of the attention window at the u -thstep, and w is the window size. The windowed attention isonline, as the attention context c u derived through Eqs. (5)-(6)does not depend on the entire encoded vector sequence h .The tradeoff between performance and latency of windowedattention relies on the window length w .

2) MoChA

In MoChA [14], an attention window endpoint is ﬁrstdecided, followed by attention weights calculation within aﬁxed-size window as follows: c u = τ u (cid:88) t = τ u − w +1 β u,t h t , (7) β u,t = exp( e u,t ) (cid:80) τ u j = τ u − w +1 exp( e u,j ) , (8) τ u = MonotonicEndpoint(˜ e u, ≥ τ u − ) , ˜ e u,t = MonotonicScore( s u , h t , α

3) A limitation of the conventional methods

As mentioned in Sec. I, the softmax function in theconventional online attentions (e.g., Eqs. (6)-(8)) requires apredetermined attention window, which induces a limitation intraining efﬁciency since multiple trials of training are inevitablefor tuning either the window length or the coefﬁcient of anadditional loss term. To overcome this limitation, in the nextsection, we propose a novel softmax-free global attentionapproach and its online version which is free from the tuningof hyperparameters in training.III. P

ROPOSED METHODS

A. Gated Recurrent Context (GRC)

We propose a novel softmax-free global attention methodcalled GRC, which recursively aggregates the information ofthe encoded sequence into an attention context vector in atime-synchronous manner. Speciﬁcally, the following formulasare employed in place of the Eqs. (2)-(4): c u = d u,T , (11) d u,t = (1 − z u,t ) d u,t − + z u,t h t , (12) z u, = 1 , z u,t = σ ( e u,t ) = 11 + exp( e u,t ) , (13) e u,t = Score( s u , h t , α

1) Relation to GSA

The update gate z u,t of GRC in Eq. (13) and the attentionweight α u,t of GSA in Eq. (3) have one-to-one correspondenceaccording to the following theorem: Theorem 1 (GRC-GSA duality) . For arbitrary n ∈ N , let Z n = { x ∈ R n | x = 1 , ≤ x j ≤ f or j = 2 , , . . . , n } and A n = { x ∈ R n | (cid:80) Tj =1 x j = 1 , ≤ x j ≤ f or j =1 , , . . . , n } . There exists a bijective function ¯ α : Z T → A T s.t. for any h = [ h , h , . . . , h T ] and z u = [ z u, , z u, , . . . , z u,T ] ,the following holds: d u,T = T (cid:88) t =1 ¯ α ( z u ) t h t where ¯ α ( z u ) t denotes the t -th element of ¯ α ( z u ) , and d u,T isobtained from z u and h according to Eq. (12) . Proof . Using the recursive Eq. (12), d u,T =(1 − z u,T ) d u,T − + z u,T h T =(1 − z u,T )(1 − z u,T − ) d u,T − + (1 − z u,T ) z u,T − h T − + z u,T h T = . . . = T (cid:88) t =1 (cid:16) T (cid:89) j = t +1 (1 − z u,j ) (cid:17) z u,t h t . Therefore the function ¯ α that satisﬁes Eq. 1 is given by ¯ α ( z u ) t := z u,t T (cid:89) j = t +1 (1 − z u,j ) for t = 1 , , . . . , T . (15)Given that z u ∈ Z T , the output ¯ α ( z u ) is an element of A T because it is trivial to show that ≤ ¯ α ( z u ) t ≤ for i = 1 , , . . . , T , and also (cid:80) Tt =1 ¯ α ( z u ) t = 1 holds as follows: T (cid:88) t =1 ¯ α ( z u ) t = T (cid:88) t =1 z u,t T (cid:89) j = t +1 (1 − z u,j )= T (cid:88) t =2 z u,t T (cid:89) j = t +1 (1 − z u,j ) + T (cid:89) j =2 (1 − z u,j )= T (cid:88) t =3 z u,t T (cid:89) j = t +1 (1 − z u,j ) + T (cid:89) j =3 (1 − z u,j )= . . . = z u,T + (1 − z u,T ) = 1 . The ¯ α ( z u ) is a bijective function since the inverse mappingof ¯ α exists as follows: z u,T = ¯ α ( z u ) T ,z u,T − =  ¯ α ( z u ) T − (cid:0) − z u,T (cid:1) = ¯ α ( z u ) T − − ¯ α ( z u ) T if ¯ α ( z u ) T < ; otherwise, z u,T − =  ¯ α ( z u ) T − − ¯ α ( z u ) T − ¯ α ( z u ) T − if T (cid:88) j = T − ¯ α ( z u ) j ; otherwise,... ∴ z u,t =  ¯ α ( z u ) t − (cid:80) Tj = t +1 ¯ α ( z u ) j if T (cid:88) t = j +1 ¯ α ( z u ) T < ; otherwise,for t = 1 , , . . . , T . It is also trivial to show that z u, = 1 and ≤ z u,t ≤ for i = 2 , . . . , T , given that ¯ α ( z u ) ∈ A T .Therefore, z u ∈ Z T . Note that ¯ α ( z u ) t in Eq. (15) corresponds to the attentionweight α u,t in Eq. (3) of GSA. By Thm. 1, the attentioncontext vector c u of GRC is capable of expressing all possibleweighted averages of the encoded representations over time,as in the GSA. Thus the range of c u in GRC or GSA is thesame. Nonetheless, we empirically showed that GRC performscomparable to or even better than GSA, and the experimentalresults are given in Sec. IV.

2) Relation to sMoChA

The sMoChA [15] is a variant of MoChA where Eq. (10)is replaced by the following formula: α u,t = p u,t t − (cid:89) j =1 (1 − p u,j ) (16)which enables the optimization process to be more stabilized.Eq. (16) is almost similar to the function ¯ α in Eq. (15), andimplies evidence on the stability of GRC training. Despite thisfact, sMoChA is an algorithm independent of GRC, as Eq. (16)is merely used as the selection probability component in thewhole training formulas and not even used for inference. B. Decreasing GRC (DecGRC)

The intermediate context d u,t of GRC is a weighted averageof the encoded representations over time according to thefollowing corollary: Corollary 1.1 (Weighted average) . Let Z n and A n be the setsdeﬁned in Thm. 1. For any τ ∈ { , , . . . , T } , z u ∈ Z τ , and h = [ h , h , . . . , h T ] , there exists a function ¯ a : Z τ → A T that satisﬁes the following equation: d u,τ = T (cid:88) t =1 ¯ a ( z u ) t h t where ¯ a ( z u ) t denotes the t -th element of ¯ a ( z u ) , and d u,τ isobtained from z u and h according to Eq. (12) . Proof . By substituting every T in the proof of Thm. 1 with τ , there exists a bijective function ¯ α : Z τ → A τ given by ¯ α ( z u ) t := z u,t τ (cid:89) j = t +1 (1 − z u,j ) for t = 1 , , . . . , τ ,such that d u,τ = τ (cid:88) t =1 ¯ α ( z u ) t h t . It is trivial to show that the following function ¯ a : Z τ → A T satisﬁes the equation in Coroll. 1.1: ¯ a ( z u ) t = (cid:40) ¯ α ( z u ) t if t ≤ τ ; otherwise, for t = 1 , , . . . , T .Global attention methods including GSA and GRC cannotcompute the attention weights without the entire sequence ofthe encoded vectors h . However, considering that the attentiontechniques are methods for calculating the weighted averageof the encoded vectors, Coroll. 1.1 enables us to treat an Algorithm 1

Online inference using DecGRC.

Input: encoded vectors h of length T , threshold ν State: s = (cid:126) , u = 1 , y = StartOfSequence while y u − ! = EndOfSequence do d u = h for t = 2 to T do e u,t = Score( s u , h t , α u,t ) + bz u,t = 1 / (1 + (cid:80) tj =1 exp( e u,j )) d u = (1 − z u,t ) d u + z u,t h t if z u,t < ν thenbreakend ifend for s u = RecurrentState( s u − , y u − , d u )˜ P ( y u | y t end , the difference between d u,t end and d u,T is small, asthe numerical change | d u,t − d u,t − | for t > t end induced by therecursion rule in Eq. (12) is negligible if z u,t is small enough.Intuitively, intermediate context vectors roughly converge afterthe endpoint.DecGRC can operate as an online attention method if such anendpoint index t end exists at each decoder step and the indexcan be decided by the model. We experimentally observedthat DecGRC models adequately learn the alignment betweenencoded vectors and text output units, and the intermediatecontext nearly converges after the aligned time index ateach decoder step. Relevant experimental results are givenin Sec. IV-DAccordingly, with an online encoder, online inference can beimplemented via a well-trained DecGRC model. We describethe online inference technique in Alg. 1, where the endpointindex is decided simply by thresholding the update gate values. C. Computational efﬁciency of proposed methods

GRC or DecGRC increases negligible amount of memoryfootprint, since only one trainable parameter b in Eq. (14) isadded to the standard GSA-based AED model. Both proposed methods have computational complexity of O ( T U ) , same as the conventional global attentions. However,in practice, a speech sequence is linearly aligned with thetext sequence on average. As Alg. 1 only regards to encodedvectors before endpoint indices, the total number of stepsin the for loop is typically slightly larger than T U/ , if thethreshold ν is set to an appropriate value. Therefore, DecGRCis computationally more efﬁcient than the global attentionssuch as GRC and GSA.The recursive updating in Eq. (12) induces negligible amountof computation compared to the whole training or inferenceprocess. There still exists a room for faster computation byenabling parallel computation in time. The parallel computationcan be implemented by utilizing Eq. (2) where α u,t is replacedwith ¯ α ( z u ) in Eq. (15), instead of Eqs. (11)-(12).Most importantly, both proposed methods introduce nohyperparameter at the training phase. Thus the proposedmethods do not need to repeat training to ﬁnd a proper valueof such a hyperparameter. Though the DecGRC inference inAlg. 1 introduces a new hyperparameter (i.e., threshold ν ) attest phase, the threshold searching on development sets doesnot take a long time, because the size of the development setsare minor compared to the training set. Hence the total timespent to prepare an ASR system can be saved. Furthermore,the tradeoff between latency and performance can be adjustedby resetting the threshold value ν at inference phase, unlike theconventional online attention methods [9], [13], [14]. In theseexisting methods, the inference algorithms’ decision rules onthe attention endpoints are determined at the training phase,and remains unchanged at the test stage. The experimentson DecGRC with different thresholds are demonstrated inSec. IV-E. IV. E XPERIMENTS

A. Conﬁgurations

All experiments were conducted on LibriSpeech dataset ,which contains 16 kHz read English speech with transcription.The dataset consists of 960 hours of a training set from 2,338speakers, 10.8 hours of a dev set from 80 speakers, and 10.4hours of a test set from 66 speakers, with no overlappingspeakers between different sets. Both dev and test sets are splitin half into clean and other subsets, depending upon the ASRdifﬁculty of each speaker. We randomly chose 1,500 utterancesfrom dev set as a validation set.All experiments shared the same network architectureand training scheme of a recipe of RETURNN toolkit [24],[25], except the attention methods. Input features were 40-dimensional mel-frequency cepstral coefﬁcients (MFCCs) ex-tracted with Hanning window of 25 ms length and 10 mshop size, followed by global mean-variance normalization.Output text units were 10,025 byte-pair encoding (BPE) unitsextracted from transcription of LibriSpeech training set. The Encoder( · ) consisted of 6 BiLSTM layers of 1,024 units for The scripts for all experiments are available at https://github.com/GRC-anonymous/GRC.

TABLE IW

ORD ERROR RATES (WER S ) COMPARISON BETWEEN ATTENTION METHODS ON L IBRI S PEECH DATASET . E XP .ID A TTENTION METHOD P ARAM . INIT . FROM I SATTENTIONONLINE ? I

SENCODERONLINE ? C

ANINFERONLINE ? WER [%]

DEV TESTCLEAN OTHER CLEAN OTHER

E1 GSA (B

AHDANAU ) - N O N O (B I LSTM) N O E3 W

INDOWED ATT . ( W =11) E1 Y ES INDOWED ATT . ( W =20) E1 5.78 14.82 5.71 15.90E5 M O C H A ( W =2) - 6.49 17.11 6.17 18.18E6 M O C H A ( W =8) - EC GRC ( ν =0.01) - 4.91 14.85 5.10 15.85E8 E2 4.97 E9 GSA (B

AHDANAU ) - N O Y ES (LC-B I LSTM) 5.54 15.49 5.51 16.91E10 E1

E13 W

INDOWED ATT . ( W =11) E10 Y ES Y ES INDOWED ATT . ( W =20) E10 5.62 15.86 5.56 16.96E15 M O C H A ( W =2) E5 6.48 18.35 6.55 19.33E16 M O C H A ( W =8) E6 EC GRC ( ν =0.01) E8 5.77 16.24 5.87 17.04E18 D EC GRC ( ν =0.08) E12 5.79 15.67 6.04 each direction, and max-pooling layers of stride 3 and 2 wereapplied after the ﬁrst two BiLSTM layers respectively. For theonline Encoder( · ) , 6 LC-BiLSTM layers were employed inplace of the BiLSTM layers, where the future context sizes wereset to 36, 12, 6, 6, 6, and 6 for each layer from bottom to topand the chunk sizes were twice the future context sizes. Both Score( · ) and MonotonicScore( · ) functions were implementedusing Bahdanau score with fertility-based weight feedback and1,024-dimensional attention key, as in [26]. RecurrentState( · ) was implemented with an unidirectional LSTM layer with1,000 units. ReadOut( · ) consisted of a max-out layer with2 ×

500 units, followed by a softmax output layer with 10,025units.Weight parameters were initialized with Glorot uniformmethod [27], and biases were initially set to zero. Optimizationtechniques were utilized during the training: teacher forcing,Adam optimizer, learning rate scheduling, curriculum learning,and the layer-wise pre-training scheme. Brieﬂy, the modelswere trained for 13.5 epochs using a learning rate of 8 × − with a linear warm-up starting from 3 × − and the Newbobdecay rule [28]. Only the ﬁrst two layers of Encoder( · ) withhalf-width were used at the beginning of a training, then thewidth and the number of layers gradually increased to theoriginal size at 1.5 epoch. The CTC multi-task learning [29]with a lambda of 0.5 was employed to stabilize the learning,where CTC loss is measured with another 10,025-units softmaxlayer on the top of Encoder( · ) . For the models which beganthe learning from parameters of a pre-trained model, the layer-wise pre-training was skipped. Every model was regularized byapplying dropout rate 0.3 to Encoder( · ) layers and the softmaxlayer and employing label smoothing of 0.1. For each epochof the training, both cross-entropy (CE) losses and output errorrates were measured 20 times on the validation set with teacherforcing. During the inference phase, model with the lowest WER on the dev-other set among all checkpoints was selectedas the ﬁnal model, and performed beam search once on thedev and test sets with a beam size of 12.We trained MoChA models for 17.5 epochs with ﬁve timeslonger layer-wise pre-training to make them converge. A smalllearning rate of 1e-5 was used for training windowed attentionmodels as in [13]. Though the numbers of total epochs fordifferent experiments were not the same, each model wasoptimized to converge and showed negligible improvementsafter that. B. Performance comparison between attentions

All experimental results are summarized in Tbl. I. For eachexperiment, we performed two trials of training with the sameconﬁguration and chose a model with the lowest word-error-rate (WER), a word-level Levenshtein distance divided by thenumber of ground-truth words, on dev-other set.In E1 to E2 and E9 to E12, GRC showed better performancethan the other attention methods on test-other set, showing3.7% and 3.2% relative error-reduction rate (RERR) comparedto GSA when evaluated on BiLSTM and LC-BiLSTM encoder,respectively.In E3 to E6 and E13 to E16, performances of the conven-tional online attentions, i.e., windowed attention and MoChA,were shown to be highly dependent on a choice of windowsize hyperparameter w . On the other hand, DecGRC is trainedwithout any additional hyperparameter and only involves athreshold ν at the inference phase.In E3 to E8 and E13 to E18, DecGRC outperformed theconventional online attention techniques on BiLSTM encoder.With LC-BiLSTM encoder, the performance of DecGRC ontest-other set surpassed the conventional attentions includingGSA, while the scores on test-clean set were worse than thecompetitors. The overall performance of GRC and DecGRC Epoch C E l o ss GSA (train)GSA (dev)GRC (train)GRC (dev)MoChA (train)MoChA (dev)DecGRC (train)DecGRC (dev)

Fig. 1. Cross-entropy loss curves of various attention methods. All the modelswere trained from scratch (w/ BiLSTM encoder). is degraded on LC-BiLSTM compared to their preferableperformance on BiLSTM, which was conjectured to be causedby the following aspect of the proposed methods; ¯ α ( z u ) inEq. (15) has a dependency on update-gate values of the futuretime-steps. Therefore using a short future receptive ﬁeld ofLC-BiLSTM may affected the degradation. C. Optimization speed

The cross-entropy loss curves on training and dev set in E1,E2, E6, and E7 are depicted in Fig. 1. The model based on eachattention method was trained from scratch until convergence,with a few spikes in its training loss curve. A spike in theloss curve occurs when the layer-wise pre-training algorithminserts a new layer and units into the encoder, as the untrainednew components make the performance worse right after theinsertion.GRC converged later than GSA, and DecGRC convergedslightly later than GRC. MoChA showed the slowest opti-mization speed, which was partly due to the 5 times longerlayer-wise pre-training scheduling than the others. Such longpre-training was employed to stabilize the training of MoChA,whereas the both GRC and DecGRC successfully convergedwith the standard pre-training. Note that the longer pre-trainingof MoCha was adopted because it had failed to converge witha short pre-training in our initial experiments. The relativelystable learning of the proposed methods over MoChA can beexplained in relation to sMoChA, as described in Sec. III-A2;the sMoChA stabilized the training of MoChA by utilizing amodiﬁed selection probability formula, which is actually almostsimilar to the attention weight ¯ α ( z u ) of GRC in Eq. (15). D. Attention analysis

GRC and DecGRC accurately learned alignments betweenencoded representations and output text units, as illustrated

Encoder time index D ec G RC D ec G RC ( ν = . ) G RC Update gate 𝑧 𝑢,𝑡 D ec od e r s t e p i nd e x I npu t G S A “we went at a good swing gallo__ p and what about you” “we went at a good swinging gallo__ p and what about you” “we went at a good swinging gallo__ p and what about you”“we went at a good swinging gallo__ p and what about you”

40 50 60

Fig. 2. An input spectrogram, attention plots with the output BPE sequenceof GSA (E1), GRC (E2), and DecGRC (E8), and the update gates of theDecGRC, from top to bottom. All results were obtained with BiLSTM encoderon an utterance 8254-84205-0009 in dev-other set. The update gates wereobtained with teacher forcing, and the attention plots were results of the beamsearch w/ beam size 12. “ ” was inserted after a BPE unit end if it was nota word-end. in Fig. 2. An interesting characteristic of GRC was observedthat it tended to put much weight on the latter time indices ofattention weights, compared to GSA. This can be regarded asan innate behavior of GRC, as the attention weight ¯ α ( z u ) inThm. 1 is designed to weigh the latter indices when the updategates z u,t have similar value over several consecutive time-indices. The latter-time-weighing attribute could be especiallyeffective for a long text unit (e.g., a BPE unit “swinging” inFig. 2), as a long BPE unit often ends with a sufﬁx that mightbe crucial to distinguish words (e.g., “-ing”, “-n’t”, or “-est” inEnglish). A piece of statistical evidence is presented in Fig. 3;GRC outperformed GSA when the median length of BPE waslarger than or equal to 6, while it showed similar performancefor shorter median lengths.Attention weights of DecGRC tended to be much smoother(i.e., focused on longer time) than GRC and GSA. Suchsmoothness was hypothesized to be caused by the decreasing ≤ ≥ . . . Median length of BPE units in an utterance T o t a l W E R ( % ) GSA GRC

Fig. 3. WER for each utterance-wise median length of the BPE units (w/BiLSTM encoder). WERs for GSA (E1) and GRC (E2) were measured onthe test set (i.e., both test-clean and test-other). update gates, which made the model trained to be cautious fora sharp descent of update gate values, as it is irreversiblein DecGRC. In addition, DecGRC did not attend on theﬁrst time index, unlike GSA and GRC. It is an intrinsicproperty of DecGRC, as the earliest update gates have valuesclose to 1 and therefore difﬁcult to carry information to latertime. As the initial frames of an utterance usually containhelpful information such as background noise, this might causeDecGRC to be degraded compared to the global attentions.The last two plots in Fig. 2 show that the update gate valuesof DecGRC mostly changed near the attention region. Asthe update gates rapidly decreased after the attention region,tight attention endpoints could easily be found by setting thethreshold value approximately in a range of [0.001, 0.2]. Forinstance, with an inference threshold ν = 0 . in Fig. 2, thetotal number of steps in the for loop in Alg. 1 was 459, whichwas approximately 54% of T U = 13 ×

65 = 845 . It impliesthat insigniﬁcant time indices were properly ignored duringthe inference.

E. Ablation study on DecGRC inference threshold

We evaluated WERs of DecGRC models for differentthreshold values, and the results are summarized in Tbl. II.Setting the threshold to a value larger than 0.2 was found to bedetrimental to the performance on the BiLSTM encoder (E8),with larger thresholds giving higher WERs. It means that someencoded vectors in the correct attention region were ignoreddue to the high threshold, as shown in the last two plots ofFig. 2. Impressively, the best performance was obtained with ν between 0.001 and 0.1, not ν = 0 . This may be attributed to thefact that the thresholding not only reduced the latency, but alsoeliminated undesirable updates after the correct attention region.With thresholds higher than the best-performing threshold, thelatency could be further reduced by taking the performancepenalty, and vice versa.The best-performing threshold on the LC-BiLSTM encoder(E18) was ν = 0 . , which means that the optimal threshold forperformance depends on the model architecture, thus thresholdsearching is required before test phase. Nevertheless, as thebeam search inference on the dev set took less than 15 minutes TABLE IIA

BLATION STUDY ABOUT THE INFERENCE THRESHOLD OF D EC GRC ON L IBRI S PEECH DATASET . THRE - SHOLD ( ν ) I SENCODERONLINE ? WER [%]

DEV TESTCLEAN OTHER CLEAN OTHER O (B I -LSTM) 5.03 14.08 4.89 14.830.001 4.96 ES (LC-B I -LSTM) 6.07 15.80 6.30 16.510.001 6.08 15.77 6.32 16.500.01 6.01 15.75 6.22 16.450.05 5.83 using a single GPU, the time spent for the tuning process of thethreshold was no more than 2.5 hours, which is much shorterthan the model training time; a single epoch of training took 9hours on average, and the total time for training a model fromscratch was more than 5 days.V. C ONCLUSION

We proposed a novel softmax-free global attention methodcalled GRC, and its variant for online attention, namelyDecGRC. Unlike the conventional online attentions, DecGRCintroduces no additional hyperparameter to be tuned at thetraining phase. Thus DecGRC does not require multiple trialsof training, saving time for model preparation. Moreover atthe inference of DecGRC, the tradeoff between ASR latencyand performance can be controlled by adapting the scalarthreshold which is related to the attention endpoint decision,whereas the conventional online attentions are not capable ofchanging the endpoint decision rule at test phase. Both GRCand DecGRC showed comparable ASR performance to theconventional global attentions.For further research, the proposed attention methods willbe investigated in various applications which leverage AEDmodels. We are particularly interested in applying DecGRCto simultaneous machine translation [30] and real-time scenetext recognition [31], where the latency can be reduced byexploiting an online attention method.R

EFERENCES[1] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: Aneural network for large vocabulary conversational speech recognition,”in

Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2016, pp. 4960–4964.[2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,” in

Proceedings of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2016, pp. 4945–4949. [3] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Connectionisttemporal classiﬁcation: labelling unsegmented sequence data with recur-rent neural networks,” in

Proceedings of the International Conferenceon Machine learning (ICML) , 2006, pp. 369–376.[4] A. Graves, “Sequence transduction with recurrent neural networks,” in

Representation Learning Workshop in International Coneference onMachine Learning (ICML) , 2012.[5] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen,A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. , “State-of-the-art speechrecognition with sequence-to-sequence models,” in

Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2018, pp. 4774–4778.[6] A. Garg, D. Gowda, A. Kumar, K. Kim, M. Kumar, and C. Kim,“Improved multi-stage training of online attention-based encoder-decodermodels,” arXiv preprint arXiv:1912.12384 , 2019.[7] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li,M. Visontai, Q. Liang, T. Strohman, Y. Wu et al. , “Two-pass end-to-endspeech recognition,” in

Proceedings of Interspeech , 2019, pp. 2773–2778.[8] Y. Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass,“Highway long short-term memory rnns for distant speech recognition,”in

Proceedings of IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2016, pp. 5755–5759.[9] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio,“A neural transducer,” arXiv preprint arXiv:1511.04868 , 2015.[10] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen,and Z. Chen, “Improving the performance of online neural transducermodels,” in

Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 5864–5868.[11] J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attentionfor online end-to-end speech recognition.” in

Proceedings of Interspeech ,2017, pp. 3692–3696.[12] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic attention mech-anism for end-to-end speech and language processing,” in

Proceedingsof the International Joint Conference on Natural Language Processing(IJCNLP) , vol. 1, 2017, pp. 431–440.[13] A. Merboldt, A. Zeyer, R. Schl¨uter, and H. Ney, “An analysis of localmonotonic attention variants,” in

Proceedings of Interspeech , 2019, pp.1398–1402.[14] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in

Proceed-ings of International Conference on Learning Representations (ICLR) ,2018.[15] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Online hybridctc/attention architecture for end-to-end speech recognition,” in

Pro-ceedings of Interspeech 2019 , 2019, pp. 2623–2627.[16] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Towards onlineend-to-end transformer automatic speech recognition,” arXiv preprintarXiv:1910.11871 , 2019.[17] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, “An online attention-basedmodel for speech recognition,” in

Proceedings of Interspeech , 2019, pp.4390–4394.[18] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-endspeech recognition,” in

Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019,pp. 5666–5670.[19] L. Dong and B. Xu, “Cif: Continuous integrate-and-ﬁre for end-to-endspeech recognition,” arXiv preprint arXiv:1905.11235 , 2019.[20] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” in

Proceedingsof Conference on Empricial Methods in Natural Language Processing(EMNLP) , 2014, pp. 1724–1734.[21] L. Wasserman,

All of nonparametric statistics . Springer Science &Business Media, 2006.[22] Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov,“Transformer dissection: An uniﬁed understanding for transformer’sattention via the lens of kernel,” in

Proceedings of Conference onEmpirical Methods in Natural Language Processing and InternationalJoint Conference on Natural Language Processing (EMNLP-IJCNLP) ,2019, pp. 4343–4352.[23] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches toattention-based neural machine translation,” in

Proceedings of Conferenceon Empricial Methods in Natural Language Processing (EMNLP) , 2015,pp. 1412–1421.[24] A. Zeyer, T. Alkhouli, and H. Ney, “Returnn as a generic ﬂexible neuraltoolkit with application to translation and speech recognition,” in

AnnualMeeting of the Association for Computational Linguistics (ACL) , 2018. [25] A. Zeyer, A. Merboldt, R. Schl¨uter, and H. Ney, “A comprehensiveanalysis on attention models,” in

Interpretability and Robustness inAudio, Speech, and Language (IRASL) Workshop in Conference on NeuralInformation Processing Systems (NeurIPS), Montreal, Canada , 2018.[26] A. Zeyer, K. Irie, R. Schl¨uter, and H. Ney, “Improved training ofend-to-end attention models for speech recognition,” in

Proceedingsof Interspeech , 2018, pp. 7–11.[27] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deepfeedforward neural networks,” in

Proceedings of International Conferenceon Artiﬁcial Intelligence and Statistics (AISTATS) , 2010, pp. 249–256.[28] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schl¨uter, and H. Ney, “Acomprehensive study of deep bidirectional lstm rnns for acoustic modelingin speech recognition,” in

Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2017,pp. 2462–2466.[29] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-endspeech recognition using multi-task learning,” in

Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2017, pp. 4835–4839.[30] N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz, R. Pang,W. Li, and C. Raffel, “Monotonic inﬁnite lookback attention forsimultaneous machine translation,” in

Annual Meeting of the Associationfor Computational Linguistics (ACL) , 2019.[31] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu, “Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decodernetwork,” in