[PDF] Wake Word Detection with Streaming Transformers

Abstract

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

Full PDF

WWAKE WORD DETECTION WITH STREAMING TRANSFORMERS

Yiming Wang , Hang Lv , Daniel Povey , Lei Xie , Sanjeev Khudanpur , Center for Language and Speech Processing, Human Language Technology Center of Excellence,Johns Hopkins University, Baltimore, MD, USA ASLP@NPU, School of Computer Science, Northwestern Polytechnical University, Xi’an, China Xiaomi Corporation, Beijing, China {yiming.wang,khudanpur}@jhu.edu, {hanglv,lxie}@nwpu-aslp.org, [email protected]

ABSTRACT

Modern wake word detection systems usually rely on neural net-works for acoustic modeling. Transformers has recently shown su-perior performance over LSTM and convolutional networks in var-ious sequence modeling tasks with their better temporal modelingpower. However it is not clear whether this advantage still holdsfor short-range temporal modeling like wake word detection. Be-sides, the vanilla Transformer is not directly applicable to the taskdue to its non-streaming nature and the quadratic time and spacecomplexity. In this paper we explore the performance of severalvariants of chunk-wise streaming Transformers tailored for wakeword detection in a recently proposed LF-MMI system, includinglooking-ahead to the next chunk, gradient stopping, different posi-tional embedding methods and adding same-layer dependency be-tween chunks. Our experiments on the Mobvoi wake word datasetdemonstrate that our proposed Transformer model outperforms thebaseline convolution network by 25% on average in false rejectionrate at the same false alarm rate with a comparable model size, whilestill maintaining linear complexity w.r.t. the sequence length.

Index Terms — wake word detection, Transformer, streaming,LF-MMI

1. INTRODUCTION

Voice interactions between human and digital assistants integrated insmartphones and home-owned voice command devices are becom-ing ubiquitous in our daily lives. This necessitates a built-in wakeword detection system which constantly listens to its environment,expecting a predeﬁned word to be spotted before turning into a morepower consumptive state to understand users’ intention (e.g. [1]).Similar to automatic speech recognition (ASR), modern wakeword detection systems can be constructed with either HMM/DNNhybrid [2, 3, 4, 5] or pure neural networks [6, 7, 8, 9, 10]. In eitherof these two categories some types of neural networks are used foracoustic modeling to encode the input features of an audio recordinginto a high level representation for the decoder to determine whetherthe wake word has been detected within some range of frames.A wake word detection system usually runs on devices, and itneeds to be triggered as soon as the wake word actually appears in astream of audio. Hence the neural networks are limited to: 1) smallmemory footprint; 2) small computational cost; and 3) low latencyin terms of the number of future frames needed to compute the score

This work was partially supported by an unrestricted gift from Applica-tions Technology (AppTek). The authors thank Tongfei Chen and Hainan Xufor valuable comments. for the current frame. Under these criteria, the family of recurrentneural networks [11, 12] is not suitable because its sequential naturein computation prevents it from being parallelized across time in thechunk-streaming case even with GPUs. So most of the current sys-tems adopt convolutional networks. A convolution kernel spans overa small and ﬁxed range of frames, and is repeatedly applied by slid-ing across time or frequencies. Although each kernel only captures alocal pattern, the receptive ﬁeld can be enlarged by stacking severalconvolution layers together: higher layers can “see” longer range offrames than lower layers, capturing more global patterns.Recently the self-attention structure, as a building block of theTransformer networks [13], receives popularity in both NLP andspeech communities for its capability of modeling context depen-dency for sequence data without recurrent connections [13, 14].Self-attention replaces recurrence with direct interactions betweenall the pairs of frames in a layer, making each frame aware of itscontexts. The computations are more parallelized, in the sense thatthe processing of a frame does not depend on the completion ofprocessing other frames in the same layer. [15] also explored theself-attention in the keyword search (KWS) task. However, theoriginal self-attention requires the entire input sequence to be avail-able before any frames can be processed, and the computationalcomplexity and memory usage are both O ( T ) . Time-restrictedself-attention [16] allows the self-attention to be restricted withina small context window around each frame with attention masks.But it still does not have a mechanism of saving the current com-puted state for future computations, and thus is not applicable tostreaming data. Transformer-XL [17] performs chunk-wise trainingwhere the previous chunk is cached as hidden state for the currentchunk to attend to for long-range temporal modeling. So it can beused for streaming tasks. The time and space complexity are bothreduced to O ( T ) , and the within-chunk computation across timecan be parallelized with GPUs. While there has been recent work[18, 19, 20, 21, 22] with similar ideas showing that such stream-ing Transformers achieve competitive performance compared withlatency-controlled BiLSTMs [23] or non-streaming Transformersfor ASR, it remains unclear how the streaming transformers workfor shorter sequence modeling task like wake word detection.In this paper we investigate various aspects of the streamingTransformers with its application to wake word detection for the re-cently proposed alignment-free LF-MMI system [5]. This paper hasthe following contributions: 1) we explore how the gradient stoppingpoint during back-propagation affects the performance; 2) we showhow different positional embedding methods affect the performance;and 3) we compare the performance of either obtaining the hiddenstate coming from the current or previous layer. In addition, we pro-To appear in Proc. ICASSP2021, June 6-11, 2021, Toronto, Canada © IEEE 2021 a r X i v : . [ c s . C L ] F e b ose an efﬁcient way to compute the relative positional embeddingin streaming Transformers. To the best of our knowledge, this is theﬁrst time that streaming Transformers are applied to this task.

2. THE ALIGNMENT-FREE LF-MMI SYSTEM

We build our system on top of the state-of-the-art system describedin [5]. We brieﬂy explain that system below to provide some back-ground information. Interested readers can refer to [5] for details.This is a hybrid HMM/DNN system with alignment-free LF-MMI loss [24, 25], where the positive wake word (denoted as wakeword ) and the negative non-silence speech (denoted as freetext ) aremodeled with a single left-to-right 4-state HMM respectively, re-gardless of how many actual phonemes are there. In addition, a1-state HMM is dedicated to model optional silence [26] (denotedas

SIL ). The motivation behind this is that we believe the proposeddesign choice has sufﬁcient modeling power for this task.In LF-MMI loss, the numerator represents the likelihood of theinput feature given the correct output state sequence, while the de-nominator represents the likelihood given incorrect state sequences.So the model is trained to maximize the posterior of the correct se-quence among other competing sequences. “Alignment-free” hererefers to unfolded FSTs as the numerator graphs are directly de-rived from the truth labels (“positive” or “negative” in our task). Thedenominator graph is speciﬁed manually, containing one path cor-responding to the positive recordings and two paths correspondingto the negatives. Since the alignment-free LF-MMI system outper-forms the cross-entropy HMM-based and other pure neural systems[5], we base our work in this paper on this speciﬁc system.The work in [5] adopts dilated and strided 1D convolutionalnetworks (or “TDNN” [27, 28]) for acoustic modeling, which isstraightforward as the computation of convolution is both stream-able by its nature and highly parallelizable. In the next section, wewill detail our approach to streaming Transformers for modeling theacoustics in our task.

3. STREAMING TRANSFORMERS

We recapitulate the computation of a vanilla single-headed Trans-former here. Assume the input to a self-attention layer is X =[ x , . . . , x T ] ∈ R d x × T where x j ∈ R d x . The tensors of query Q ,key K , and value V are obtained via Q = W Q X , K = W K X , V = W V X ∈ R d h × T (1)where the weight matrices W Q , W K , W V ∈ R d h × d x . The outputat i -th time step is computed as h i = Va i ∈ R d h , a i = softmax (cid:18) [ Q (cid:62) K ] i √ d h (cid:19) ∈ R T (2)where [ · ] i means taking the i -th row of a matrix. All the operationsmentioned above are homogeneous across time, thus can be paral-lelized on GPUs. Note that here Q , K , V are computed from thesame input X , which represents the entire input sequence.Such dependency of each output frame on the entire input se-quence makes the model unable to process streaming data where ineach step only a limited number of input frames can be processed.Also, the self-attention is conducted between every pair of frameswithin the whole sequence, making the memory usage and com-putation cost are both O ( T ) . Transformer-XL-like architecturesaddress these concerns by performing a chunk-wise processing of A multi-headed extension is straightforward and irrelevant to our dis-cussion here. chunk c-1 chunk cll-1l+1 (a) Dependency on the previouslayer of the previous chunk. chunk c-1 chunk cll-1l+1 (b) Dependency on the samelayer of the previous chunk.

Fig. 1 : Two different type of nodes dependency when computingself-attention in streaming Transformers. The ﬁgures use 3-layernetworks with 2 chunks (delimited by the thick vertical line in eachsub-ﬁgure) of size 2 as examples. The grey arcs illustrate the nodesdependency within the current chunk, while the green arcs show thedependency from the current chunk to the previous one.the sequence. The whole input sequence is segmented into sev-eral equal-length chunks (except the last one whose length can beshorter). As shown in Fig. 1a, the hidden state from the previouschunk are cached to extend keys and values from the current chunk,providing extra history to be attended to. In this case, ˜ K (or ˜ V ) islonger in length than Q due to the prepended history. To alleviatethe gradient explosion/vanishing issue introduced in this kind of re-current structure, gradient is set to not go beyond the cached state,i.e., ˜ K c = [SG( K c − ); K c ] , ˜ V c = [SG( V c − ); V c ] (3)where c is the chunk index, [ · ; · ] represents concatenation along thetime dimension, and SG( · ) is the stop gradient operation. Thememory usage and computation cost are both reduced to O ( T ) giventhe chunk size is constant. Our preliminary experiments show that only having history to theleft is not sufﬁcient for a good performance in our task. So we alsoallow a chunk to “look-ahead” to the next chunk to get future contextwhen making predictions from the current chunk. For the right con-text, the gradient in back-propagation does not just stop at K c +1 and V c +1 , but rather go all the way down to the input within the chunk c + 1 . On the other hand we can optionally treat the left context (i.e.the history state) the same way as well. Intuitively, having weights tohave more information while being updated should always be ben-eﬁcial, as long as their gradient ﬂow is constrained within a smallrange of time steps. This can be achieved by splicing the left chunktogether with the current chunk and then only selecting the outputof the current chunk for loss evaluation, at the cost of having onemore forward computation for each chunk by not caching them. Wewill show a performance comparison between with and without suchstate-caching in the experiments. Note that when there are multiple stacked self-attention layers, theoutput of the c -th chunk of the l -th layer actually depends on theoutput of the ( c − -th chunk of the ( l − -th layer. So the re-ceptive ﬁeld of each chunk grows linearly with the number of theself-attention layers, and the current chunk does not have access toprevious chunks in the same layer (Fig. 1a). This may limit themodel’s temporal modeling capability as not all parts of the networkwithin the receptive ﬁeld are utilized. Hence, we take the output For example, this would be

Tensor.detach() in PyTorch. H = [ h , . . . , h T ] ∈ R d h × T where h i isdeﬁned in Eq. (2). Then Eq. (3) becomes: ˜ K lc = [SG( H lc − ); K lc ] , ˜ V lc = [SG( H lc − ); V lc ] (4)where we use superscript l to emphasize tensors from the same layer.Fig. 1b illustrates nodes dependency in such computation ﬂows. The self-attention in Transformers are invariant to the sequence re-ordering, i.e., any permutations of the input sequence will yield ex-actly the same output for each frame. So it is crucial to encodethe positions. The original Transformer [13] employs determinis-tic sinusoidal functions to encode absolute positions. In our chunk-streaming setting, we can also enable this by adding an offset to theframe positions within each chunk. However our goal for wake worddetection is just to spot the word rather than recognizing the wholeutterance, a relative positional encoding may be more appropriate.One type of relative positional embedding, as shown in [29], en-codes the relative positions from a query frame to key/value framesin the self-attention, and pairs of frames having the same positiondifference share the same trainable embedding vector. The embed-ding vectors E ∈ R d h × T are then added to the key (optionally to thevalue as well) of each self-attention layer. So Eq. (2) is modiﬁed as: h i = ( V + E ) a i ∈ R d h , a i = softmax (cid:18) [ Q (cid:62) ( K + E )] i √ d h (cid:19) ∈ R T (5)As suggested, the relative positional embedding is fed into everyself-attention layer and jointly trained with other model parameters.Different from the case in [29] where the query and key (orvalue) have the same sequence length, there is extra hidden stateprepended to the left of the key and the value in the current chunk,making the resulting key and value longer than the query. By lever-aging the special structure of the layout of relative positions betweenthe query and key, we design a series of simple but efﬁcient ten-sor operations to compute self-attentions with positional embedding.Next we show how the dot product between the query Q and the po-sitional embedding E for the key K can be obtained . The procedurewhen adding the embedding to the value V is similar.Let’s denote the length of the query and the extended key as l q and l k , respectively, where l q < l k . There are (2 l k − possiblerelative positions from the query to the key ranging in [ − l k + 1 , l k − . Given an embedding matrix E ∈ R d h × (2 l k − , we ﬁrst computeits dot product with the query Q , resulting in a matrix M = Q (cid:62) E ∈ R l q × (2 l k − . Then for the i -th row in M , we select l k consecutiveelements corresponding to l k different relative positions from the i -th frame in the query to each frame in the key, and rearrange theminto M (cid:48) ∈ R l q × l k . This process is illustrated in Fig. 2. In the -th row, we keep those corresponding to the relative positions in therange [ − l k + l q , l q − ; in the i -th row, the range is left shifted by1 from the one in the ( i − -th row; ﬁnally in the ( l q − -th row,the range would be [ − l k + 1 , . This process can be convenientlyimplemented by reusing most of the memory conﬁguration from M for M (cid:48) without copying the underlying storage of M , and then dothe following steps: 1) point M (cid:48) to the position of the ﬁrst yellowcell in M ; 2) modify the row stride of M (cid:48) from l k to ( l k − ; and3) modify the number of columns of M (cid:48) from (2 l k − to l k . We drop the batch and heads dimensions for clarity. So all tensors be-come 2D matrices in our description.

Fig. 2 : The process of selecting relevant cells from the matrix M ∈ R l q × (2 l k − (left) and rearranging them into M (cid:48) ∈ R l q × l k (right).The relevant cells are in yellow, and others are unselected. Note thatthe position of yellow block in one row of M is left shifted by 1 cellfrom the yellow block in the row above.

4. EXPERIMENTS4.1. The Dataset

We use the Mobvoi (SLR87) dataset [30] including two wakewords: “Hi Xiaowen” and “Nihao Wenwen”. It contains 144 hrstraining data with 43,625 out of 174,592 positive examples, and 74hrs test data with 21,282 out of 73,459 positive examples. We do notreport results on the other datasets mentioned in [5], because boththe numbers reported there and in our own experiments are too good(FRR < . %) to demonstrate any signiﬁcant difference. All the experiments in this paper are conducted in E

SPRESSO , aPyTorch-based end-to-end ASR toolkit [31], using P Y C HAIN , afully parallelized PyTorch implementation of LF-MMI [32].We follow exactly the same data preparation and preprocess-ing pipeline as those in [5], including HMM and decoding graphtopolopies, feature extraction, negative recording sub-segmentation,and data augmentations. During evaluation, when one of the twowake words is considered, the other one is treated as negative. Theoperation points are obtained by varying the positive path cost whileﬁxing the negative path cost at 0 in the decoding graph. It it worthmentioning that all the results reported here are from an ofﬂine de-coding procedure, as currently Kaldi [33] does not support onlinedecoding with PyTorch-trained neural acoustic models. However,we believe that the ofﬂine decoding results would not deviate signif-icantly from the online ones.The baseline system is a 5-layer dilated 1D convolution networkwith 48 ﬁlters and the kernel size of 5 for each layer, leading to 30frames for both left and right context (25% less than that in [5]) andonly 58k parameters (60% less than that in [5]). For the streamingTransformer models, the ﬁrst two layers are 1D convolution. Theyare then followed by 3 self-attention layers with the embedding di-mension 32 and the number of heads 4, resulting in 48k parame-ters without any relative embedding . To make sure that the outputscan “see” approximately the same amount of context as those in thebaseline, the chunk size is set to 27, so that in the no state-cachingsetting the right-most frame in a chunk depends on 27 input frames(still smaller than 30) as its right context ; in the state-caching case,the receptive ﬁeld covers one more chunk (or 27 more frames) on theleft, as it increases linearly when the self-attention layers increases.All the models are optimized using Adam with an initial learn-ing rate − , and then halved if the validation loss at the end of See Table 1 for model sizes with different relative embedding settings. Our experiments (not shown here) also suggest 27 is the optimal in thissetting: a smaller chunk hurts the performance, and a larger one does nothave signiﬁcantly improvement but incurs more latency. able 1 : Results of streaming Transformers with state-caching.

58k 0.8 0.8Transformer (w/o look-ahead) 48k 3.5 4.7+look-ahead to next chunk 48k 1.3 1.2+abs. emb. 48k 1.2 1.2+rel. emb. to key 52k 1.0 1.1+rel. emb. to value 57k an epoch does not improve over the previous epoch. The trainingprocess stops if the number of epochs exceeds 15, or the learningrate is less than − . We found that learning rate warm-up is notnecessary to train our Transformer-based systems, probably due tothe relatively simple supervisions in our task. We ﬁrst evaluate our streaming Transformer models with state-caching. The results are reported in Table 1, as false rejection rate(FRR) at 0.5 false alarms per hour (FAH). If we only rely on the cur-rent chunk and the cached state from the previous chunk but withouttaking any look-ahead to the future chunk, the detection results (seerow 2 in Table 1) are much worse than the baseline. It is actuallyexpected, as the symmetric property of convolution kernels allowsthe network to take future frames into consideration. This validatesthat look-head to the future frames is important in the chunk-wisetraining of Transformers. Then adding absolute positional embed-ding seems not improve the performance signiﬁcantly. One possibleexplanation could be: the goal of the wake word detection is nottrying to transcribe the whole recording, but just spot the word ofinterest, where the absolute encoding of positions do not have toomuch effective impact. On contrary, when we add relative posi-tional embedding to the key of self-attention layers instead, there isslightly improvement over adding the absolute embedding, whichsupports our previous hypothesis that the relative embedding makesmore sense in such task. When the embedding is also added to thevalue, FRR reaches 0.7% and 0.5% at FAH=0.5 for the two wakewords respectively (i.e., 25% relative improvement over the baselineon average), showing that the embedding is not only useful whencalculating the attention weights, but also beneﬁcial when encodingthe positions into the layer’s hidden values.

Next we would like to explore whether having gradient to been back-propagated into the history state would help train a better model. Aswe mentioned in Sec. 3.1, this can be done by concatenating thecurrent chunk with the previous chunk of input, instead of cachingthe internal state of the previous chunk. Table 2 shows several re-sults. By looking at Table 2 itself, we observe a similar trend as thatin the state-caching model from the previous section: relative posi-tional embedding is advantageous over the absolute sinusoidal posi-tional embedding, and adding the embedding to both key and value isagain the best. Furthermore, by comparing the rows in Table 2 withtheir corresponding entries in Table 1, we observe that, except thecase in the last row, regardless of the choice of positional embeddingand how it is applied, the models without state-caching outperformmodels with state-caching. It indicates the beneﬁt of updating themodel parameters with more gradient information back-propagatedfrom the current chunk into the previous chunk. However in the We do not compare with other systems, because to our best knowledgethis baseline system is the state-of-the-art reported on the same dataset at thetime of submission.

Table 2 : Results of streaming Transformers without state-caching. case where relative positional embedding is also added to the value,the gap seems diminished, suggesting that by utilizing the positionalembedding in a better way, there is no need to recompute the part ofthe cached state in order to reach the best performance.We provide DET curves of the baseline convolution network andthe two proposed streaming Transformers in Fig. 3, for a more com-prehensive demonstration of their performance difference. F a l s e R e j e c t i o n R a t e ( % )

1D Conv. 'Hi Xiaowen'1D Conv. 'Nihao Wenwen'states-caching Trans. 'Hi Xiaowen'states-caching Trans. 'Nihao Wenwen'non-states-caching Trans. 'Hi Xiaowen'non-states-caching Trans. 'Nihao Wenwen'

DET curve

Fig. 3 : DET curves for the baseline 1D convolution network and ourtwo proposed streaming Transformers.

We now explore the architectural variant introduced in Sec. 3.2.Note that if the relative positional embedding is added to the value V lc − as shown in Eq. (5), H lc − will no longer be in the samesemantic space as V lc . So it is problematic to concatenate H lc − and V lc together in Eq. (4). A similar issue arises if the parame-ter W K and W V from the same layer are not tied because H lc − is going to be concatenated to both K lc and V lc . Our solution is toonly add the positional embedding to K lc , and also tie K lc and V lc together. However, it only achieves FRR=1.3% at FAH=0.5. Whenabsolute embedding is used, FRR=1.1% at the same FAH. This con-tradicts the observations in [21, 22] where same-layer dependencywas found to be more advantageous for ASR and it was attributed tothe fact that the receptive ﬁeld is maximized at every layer . A betterway of incorporating relative positional information for this case isour future work.

5. CONCLUSIONS

We propose using streaming Transformers for wake word detec-tion with the latest alignment-free LF-MMI system. We explorehow look-ahead of the future chunk, and different gradient stopping,layer dependency, and positional embedding strategies could affectthe system performance. Along the way we also propose a seriesof simple tensor operations to efﬁciently compute the self-attentionin the streaming setting when relative positional embedding is in-volved. Experiments on Mobvoi (SLR87) show the advantage of theproposed streaming Transformers over the 1D convolution baseline. They did not mention the type of positional embedding being used. . REFERENCES [1] Yiming Wang, Xing Fan, I-Fan Chen, Yuzong Liu, TongfeiChen, and Bj¨orn Hoffmeister, “End-to-end anchored speechrecognition,” in Proc. ICASSP , 2019.[2] Sankaran Panchapagesan, Ming Sun, Aparna Khare, SpyrosMatsoukas, Arindam Mandal, Bj¨orn Hoffmeister, and Shiv Vi-taladevuni, “Multi-task learning and weighted cross-entropy fordnn-based keyword spotting,” in

Proc. Interspeech , 2016.[3] Ming Sun, David Snyder, Yixin Gao, Varun K. Nagaraja, MikeRodehorst, Sankaran Panchapagesan, Nikko Strom, Spyros Mat-soukas, and Shiv Vitaladevuni, “Compressed time delay neuralnetwork for small-footprint keyword spotting,” in

Proc. Inter-speech , 2017.[4] Minhua Wu, Sankaran Panchapagesan, Ming Sun, Jiacheng Gu,Ryan Thomas, Shiv Naga Prasad Vitaladevuni, Bj¨orn Hoffmeis-ter, and Arindam Mandal, “Monophone-based background mod-eling for two-stage on-device wake word detection,” in

Proc.ICASSP , 2018.[5] Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, and SanjeevKhudanpur, “Wake word detection with alignment-free lattice-free MMI,” in

Proc. Interspeech , 2020.[6] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-footprint keyword spotting using deep neural networks,” in

Proc.ICASSP , 2014.[7] Tara N. Sainath and Carolina Parada, “Convolutional neuralnetworks for small-footprint keyword spotting,” in

Proc. Inter-speech , 2015.[8] Yanzhang He, Rohit Prabhavalkar, Kanishka Rao, Wei Li, AntonBakhtin, and Ian McGraw, “Streaming small-footprint keywordspotting using sequence-to-sequence models,” in

Proc. ASRU ,2017.[9] Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie,“Attention-based end-to-end models for small-footprint keywordspotting,” in

Proc. Interspeech , 2018.[10] Jingyong Hou, Yangyang Shi, Mari Ostendorf, Mei-Yuh Hwang,and Lei Xie, “Mining effective negative training samples forkeyword spotting,” in

Proc. ICASSP , 2020.[11] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-termmemory,”

Neural Computation , vol. 9, no. 8, pp. 1735–1780,1997.[12] Kyunghyun Cho, Bart van Merrienboer, C¸ aglar G¨ulc¸ehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, andYoshua Bengio, “Learning phrase representations using RNNencoder-decoder for statistical machine translation,” in

Proc.EMNLP , 2014.[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polo-sukhin, “Attention is all you need,” in

NeurIPS , 2017.[14] Shigeki Karita, Xiaofei Wang, Shinji Watanabe, TakenoriYoshimura, Wangyou Zhang, Nanxin Chen, Tomoki Hayashi,Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki,Nelson Enrique Yalta Soplin, and Ryuichi Yamamoto, “A com-parative study on transformer vs RNN in speech applications,”in

Proc. ASRU , 2019.[15] Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Zhengkun Tian,Chenghao Zhao, and Cunhang Fan, “A Time Delay NeuralNetwork with Shared Weight Self-Attention for Small-FootprintKeyword Spotting,” in

Proc. Interspeech , 2019.[16] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, andSanjeev Khudanpur, “A time-restricted self-attention layer forASR,” in

Proc. ICASSP , 2018. [17] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell,Quoc Viet Le, and Ruslan Salakhutdinov, “Transformer-XL:Attentive language models beyond a ﬁxed-length context,” in

Proc. ACL , 2019.[18] Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, andShinji Watanabe, “Transformer ASR with contextual block pro-cessing,” in

Proc. ASRU , 2019.[19] Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang,and Zhengqi Wen, “Synchronous transformers for end-to-endspeech recognition,” in

Proc. ICASSP , 2020.[20] Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik Mc-Dermott, Stephen Koo, and Shankar Kumar, “Transformer trans-ducer: A streamable speech recognition model with transformerencoders and RNN-T loss,” in

Proc. ICASSP , 2020.[21] Liang Lu, Changliang Liu, Jinyu Li, and Yifan Gong, “Explor-ing transformers for large-scale speech recognition,” in

Proc.Interspeech , 2020.[22] Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-FengYeh, and Frank Zhang, “Streaming transformer-based acousticmodels using self-attention with augmented memory,” in

Proc.Interspeech , 2020.[23] Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yao, SanjeevKhudanpur, and James R. Glass, “Highway long short-termmemory RNNs for distant speech recognition,” in

Proc. ICASSP ,2016.[24] Hossein Hadian, Hossein Sameti, Daniel Povey, and SanjeevKhudanpur, “End-to-end speech recognition using lattice-freeMMI,” in

Proc. Interspeech , 2018.[25] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, PegahGhahremani, Vimal Manohar, Xingyu Na, Yiming Wang, andSanjeev Khudanpur, “Purely sequence-trained neural networksfor asr based on lattice-free MMI,” in

Proc. Interspeech , 2016.[26] Guoguo Chen, Hainan Xu, Minhua Wu, Daniel Povey, and San-jeev Khudanpur, “Pronunciation and silence probability model-ing for ASR,” in

Proc. Interspeech , 2015.[27] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Atime delay neural network architecture for efﬁcient modeling oflong temporal contexts,” in

Proc. Interspeech , 2015.[28] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, HainanXu, Mahsa Yarmohammadi, and Sanjeev Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural net-works,” in

Proc. Interspeech , 2018.[29] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, “Self-attention with relative position representations,” in

Proc.NAACL-HLT , 2018.[30] Jingyong Hou, Yangyang Shi, Mari Ostendorf, Mei-Yuh Hwang,and Lei Xie, “Region proposal network based small-footprintkeyword spotting,”

IEEE Signal Process. Lett. , vol. 26, no. 10,pp. 1471–1475, 2019.[31] Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, HangLv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, andSanjeev Khudanpur, “Espresso: A fast end-to-end neural speechrecognition toolkit,” in

Proc. ASRU , 2019.[32] Yiwen Shao, Yiming Wang, Daniel Povey, and Sanjeev Khu-danpur, “PyChain: A fully parallized PyTorch implementationof LF-MMI for end-to-end ASR,” in

Proc. Interspeech , 2020.[33] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Bur-get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, PetrMotlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speechrecognition toolkit,” in

Proc. ASRU , 2011., 2011.