[PDF] Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Abstract

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields 67.6\% relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.

Full PDF

MMAX-POOLING LOSS TRAINING OF LONG SHORT-TERM MEMORY NETWORKS FORSMALL-FOOTPRINT KEYWORD SPOTTING

Ming Sun , Anirudh Raju , George Tucker ∗ , Sankaran Panchapagesan , Gengshen Fu , Arindam Mandal , Spyros Matsoukas , Nikko Strom , Shiv Vitaladevuni Amazon.com, Cambridge, MA, USA Amazon.com, Sunnyvale, CA, USA Google Brain, Mountain View, CA, USA Amazon.com, Seattle, WA, USA { mingsun,ranirudh,panchi,gengshef,arindamm,matsouka,nikko,shivnaga } @amazon.com, [email protected] ABSTRACT

We propose a max-pooling based loss function for train-ing Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory,and latency requirements. The max-pooling loss training canbe further guided by initializing with a cross-entropy losstrained network. A posterior smoothing based evaluationapproach is employed to measure keyword spotting perfor-mance. Our experimental results show that LSTM modelstrained using cross-entropy loss or max-pooling loss out-perform a cross-entropy loss trained baseline feed-forwardDeep Neural Network (DNN). In addition, max-pooling losstrained LSTM with randomly initialized network performsbetter compared to cross-entropy loss trained LSTM. Finally,the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance,which yields . relative reduction compared to baselinefeed-forward DNN in Area Under the Curve (AUC) measure. Index Terms — LSTM, keyword spotting, max-poolingloss, small-footprint

1. INTRODUCTION

Keyword spotting has been an active research area fordecades. Different approaches have been proposed to detectthe words of interest in speech utterances. As one solution,a general large vocabulary continuous speech recognition(LVCSR) system is applied to decode the audio signal, andkeyword searching is conducted in the resulting lattices orconfusion networks [1, 2, 3, 4]. These methods require rela-tively high computational resources for the LVCSR decoding,and also introduce latency.Small-footprint keyword spotting systems have been in-creasingly attracting attention. Voice assistant systems suchas Alexa on Amazon Echo deploy a keyword spotting systemon device, and only stream audio to the cloud for LVCSR *Work conducted while the author was at Amazon.com when the keyword is detected on device. For such appli-cations, accurate on-device keyword spotting running withlow CPU and memory is critical [5]. It needs to run withhigh recall to make devices easy to use, while having lowfalse accepts to mitigate privacy concerns. Latency has to below as well. A traditional approach employs Hidden MarkovModel (HMM) to model both keyword and background [6, 7,8]. The background includes non-keyword speech, or non-speech noise etc. This background model is also named ﬁllermodel in some literatures. It could involve loops over simplespeech/non-speech phones, or for more complicated cases,normal phone set or confusing word set. Viterbi decoding isused to search the best path in the decoding graph. The key-word spotting decision can be made based on the likelihoodcomparison of keyword and background models. GaussianMixture Model (GMM) was commonly used in the past tomodel the observed acoustic features. With DNN becomingmainstream for acoustic modeling, this approach can be ex-tended to include discriminative information by incorporatinga hybrid DNN-HMM decoding framework [9].In recent years, there are keyword spotting systems builton DNN or Convolutional Neural Network (CNN) directly,with no HMM involved in the system [10, 11, 12, 13]. Duringdecoding time, framewise keyword posteriors are smoothed.The system is triggered when smoothed keyword posteriorsexceed a pre-deﬁned threshold. The trade off between balanc-ing false rejects and false accepts can be performed by tuningthe threshold. Context information is taken care of by stack-ing frames as input. Some keyword spotting systems are builton Recurrent Neural Network (RNN) directly. Particularly,bidirectional LSTM is used to search for keywords in audiostreams when latency is not a hard constraint [14, 15, 16, 17].We are interested in a small-footprint keyword spottingsystem that runs on low CPU and memory utilization, withlow latency. This low latency constraint makes bidirectionalLSTM not a proper ﬁt in principle. Instead, we focus on train-ing a unidirection LSTM model using two different loss func-tions: cross-entropy loss and max-pooling based loss [18]. a r X i v : . [ c s . C L ] M a y pplying the max-pooling loss function to LSTM training forkeyword spotting is the main contribution of this paper.During decoding time, the system is triggered when thekeyword posterior smoothed by averaging the output of a slid-ing window is above a threshold. Considering the practicaluse case, our keyword spotting system is designed to lock outfor some time after each detection, to avoid unnecessary falseaccepts and reduce decoding computational cost.The remaining part of this paper is organized as follows:Section 2 describes our LSTM based keyword spotting sys-tem, which includes the LSTM model, training loss functionsand performance evaluation details. Experimental setup andresults are included in Section 3. Section 4 is for conclusionand future work.

2. SYSTEM OVERVIEW

As shown in Figure 1, Log Mel Filter-Bank Energies (LFBEs)are used as input acoustic features for our keyword spottingsystem. We extract 20 dimensional LFBEs over 25ms frameswith a 10ms frame shift. The LSTM model is used to pro-cess input LFBEs. Our system has two targets in the outputlayer: non-keyword and keyword. The output of the keywordspotting system is passed to an evaluation module for decisionmaking.

Fig. 1 . Keyword spotting system

Different from feed-forward DNN networks, RNNs containcyclic connections which can be used to model sequential data. This makes RNNs a natural ﬁt to model temporal informa-tion within continuous speech frames. However, traditionalRNN structures suffer from the vanishing gradient problem,which prevents them from effectively modeling long contextin the data. To overcome this, LSTMs contain memory blocks[19, 20]. Each block contains one or more memory cells, aswell as input, output and forget gates. These three gates con-trol the information ﬂow within the associated memory block. Sometimes a projection layer is added on top of the LSTMoutput, to reduce model complexity [21]. A typical LSTMcomponent with projection layer is shown in Figure 2. Forthe sake of clarity, a single LSTM block is shown here.

Fig. 2 . Architecture of LSTM with projection layerGiven a sequence of T frames X = ( x , . . . , x T ) , let i , o , f , c denote the input, output, forget gates, and the mem-ory cell, and Y = ( y , . . . , y T ) be the output. The LSTMcomputes the gate activations and output at time t as follows: i t = σ ( W ix x t + W ir r t − + W ic c t − + b i ) f t = σ ( W fx x t + W fr r t − + W fc c t − + b f ) c t = f t (cid:12) c t − + i t (cid:12) g ( W cx x t + W cr r t − + b c ) o t = σ ( W ox x t + W or r t − + W oc c t + b o ) m t = o t (cid:12) h ( c t ) r t = W rm m t y t = φ ( W yr r t + b y ) Here W ∗ matrices label the connection weights. E.g., W ix , W ir and W ic represent the weight matrices from theinput x , recurrent feedback r and cell c respectively. Note thatthe peephole connections W ic , W fc and W oc are diagonalmatrices. The b ∗ terms represent the bias vectors for differentcomponents of the model. E.g., b i is the bias for input gateactivation.A projection layer is added to the LSTM output. That is, W rm linearly maps m t to a lower dimensional representation r t , which is the recurrent signal. The network output y t iscomputed based on the projection layer output r t as well.Regarding the activation functions, we use logistic sig-moid function as σ () for gate activations, tanh as g () and h () for cell input and output, and softmax as φ () for output layer. (cid:12) is the element-wise product of vectors.Finally, the complexity of the model described above canbe calculated as n = n c × n r × n i × n c × n r × n o + n c × n r + n c × (2)here n c is the number of memory cells (we only consider thecase of single memory cell per block, thus here n c is also thenumber of memory blocks), n r is the dimension of projectionlayer, n i and n o denote the dimension of input and outputrespectively. For our experiments, we consider two different types of lossfunctions: cross-entropy loss and max-pooling loss.

Cross-entropy (xent) has been widely applied as a lossfunction for DNN and RNN training [22]. Let K be thetotal number of classes. Given a sequence of T frames X = ( x , . . . , x T ) , where x t is the feature vector of the t thframe, let y t = ( y t , . . . , y Kt ) denote the K -dimensional out-put of the network for x t , and let z t = ( z t , . . . , z Kt ) denotethe corresponding target vector. The cross-entropy loss forthe t th frame is calculated as follows: L xentt = − K (cid:88) k =1 z kt ln y kt (3)The -of- K coding is usually used for target vector z t . Thatis, if the t th frame vector x t is aligned with class k , the K -dimensional vector z t has value for the k th element, withall other elements being . Let k † t denote the aligned class forthe t th frame. The cross-entropy loss for the t th frame can beformulated as L xentt = − ln y k † t t (4)Then the cross-entropy loss for the whole T frame se-quence is: L xentT = T (cid:88) t =1 L xentt = − T (cid:88) t =1 ln y k † t t (5) We propose to train the LSTM for keyword spotting using amax-pooling based loss function. Given that the LSTM hasthe ability to model long context information, we hypothe-size that there is no need to teach the LSTM to ﬁre everyframe within the keyword segment. Instead, we want to teachthe LSTM to ﬁre at its highest conﬁdence time. The LSTMshould ﬁre near the end of keyword segment in general, whereit has seen enough context to make a decision. A simple wayis to back-propagate loss only from the last frame or last sev-eral frames for updating the weights. But our initial experi-ments indicate that the LSTM does not learn much from thisscheme. Hence we employ a max-pooling based loss functionto let the LSTM pick the most informative keyword framesto teach itself. This also helps mitigate issues potentially caused by inaccurate frame alignment around keyword seg-ment boundaries. Max-pooling loss can be viewed as a tran-sition from frame-level loss to segment-level loss for keywordspotting model training.Alternative segment-level loss functions include differentstatistics of frame-level keyword posteriors within a keywordsegment, e.g., the geometric mean etc. There have been litera-tures on training LSTMs using Connectionist Temporal Clas-siﬁcation (CTC) [14, 15, 16, 23] for keyword spotting tasksas well. In addition, architectures that combine LSTMs andCNNs have been applied to different tasks [24, 25]. TypicallyLSTM is added on top of CNN layers, where CNN layerswith pooling are used to extracted features as LSTM input,and LSTM output is used for prediction.Let Q denote the cardinality of target keyword set. Whenwe consider word level labels, there are in total Q + 1 classes( K = Q + 1 ), with one additional class used to label thoseframes aligned with background. For the T input frames X = ( x , . . . , x T ) , if there are P keywords instances inside,we use l p to denote a continuous frame index range whoseframes are aligned with the p th keyword. As a result, the K -dimensional target vector z p is the same for all frameswithin l p .Let L = ( l , . . . , l P ) represent a collection of frame indexranges for all P keywords instances in X , and ˆ L be a collec-tion of all the indices for the remaining frames which are notaligned to any keyword (i.e. background frame indices). Weuse k † p to represent the target label for frames inside l p , and l † p to label the speciﬁc one frame within l p whose posterior for k † p is the maximum. The max-pooling loss proposed for theinput sequence X can be calculated as L maxpoolT = − (cid:88) t ∈ ˆ L ln y k † t t − P (cid:88) p =1 ln y k † p l † p (6)The ﬁrst item states that we calculate the cross-entropy lossfor input frames not aligned to any keyword. The second itemshows how we do max-pooling for keyword aligned frames.In more details, for the frames of the p th segment (index range l p ), they are aligned to keyword k † p . We only back propagatefor a single frame (index l † p ) whose posterior for target k † p isthe largest among all frames within current segment l p , anddiscard all other frames within current segment.The idea of max-pooling loss is shown in Figure 3, whereﬁlled frames are aligned with the keywords, and empty framesare for background. Given an input sequence of frames,within each keyword segment, only the frame which has themaximum posterior for corresponding keyword target is kept,while all other frames within the same keyword segment arediscarded. All background frames are kept.We consider two cases for max-pooling loss based LSTMtraining: one starts with a randomly initialized model, andthe other uses a cross-entropy loss pre-trained model. Witha randomly initialized model, max-pooling loss based LSTM ig. 3 . Idea of max-pooling losstraining may not learn well in the ﬁrst few epochs with ratherrandom keyword ﬁring. The idea is to take the advantagesof both cross-entropy and max-pooling loss training. Witha cross-entropy trained LSTM as the initial model to startmax-pooling training, it already learns some basic knowledgeabout target keywords. This could provide a better initializa-tion point, and faster convergence to a better local optimum. We consider a posteriors smoothing based evaluation scheme.To detect the keyword, given input audio, the system com-putes smoothed posteriors based on a sliding context windowcontaining N ctx frames. When the smoothed posterior for thekeyword exceeds a pre-deﬁned threshold, this is consideredas a ﬁring spike. The system is designed to shut down for thefollowing N lck frames. This lockout period of length N lck is for the purpose of reducing unnecessarily duplicated detec-tions during the same keyword segment, as well as reducingdecoding computational cost.For our use case, we allow a short latency period with N lat frames after each keyword segment. That is, if the sys-tem ﬁres within the N lat -frame window right after a keywordsegment, we still consider the ﬁring as being aligned with thecorresponding keyword. This latency window does not intro-duce signiﬁcant delay in perception, and it could mitigate thepossible issues of inaccurate keyword alignment boundariesin evaluation.Finally, the ﬁrst ﬁring spike within each keyword segmentplus latency window is considered as a valid detection. Anyother ﬁring spikes within the same keyword segment plus la-tency window, or outside any keyword segment plus latencywindow, are counted as false accepts. Two metrics are used tomeasure the system performance: miss rate, which is one mi- nus recall, and false accept rate, which is a normalized valueof false accepts.Figure 4 illustrates the idea of our evaluation approach.As examples, there are two input audio streams. The keywordsegment length varies depending on the way the keyword isspoken. Each keyword segment is followed by a ﬁxed lengthlatency window. The keyword segments are labeled by blockswith vertical line ﬁll, while the follow-on latency windows arelabeled by blocks with horizontal line ﬁll. There is a systemlock out period by design after each ﬁring spike. For the ﬁrstaudio, there are two false accepts (FAs) with system ﬁring inthe region outside any keyword segment plus latency window.The true accepts (TAs) happen as the ﬁrst detection in eachkeyword segment plus latency window. True accepts couldhappen either in the keyword segment, or in the followinglatency window. For the second audio, false accepts happenas additional ﬁring spikes within the same keyword segmentplus latency window which already has a true accept.For our system, we use frames ( N ctx = 30 ) for posteri-ors smoothing, frames ( N lck = 40 ) as the lockout period,and frames ( N lat = 20 ) as the allowed latency windowlength after each aligned wake word segment.

3. EXPERIMENTAL RESULTS

For our experiments, the word ’Alexa’ is chosen as the key-word. We use an in-house far-ﬁeld corpus which contains far-ﬁeld data collected under different conditions. This datasetcontains an order of magnitude more instances of keywordand background speech utterances than the largest previousstudies [10, 12] for both training and testing. Our data is col-lected in a far-ﬁeld environment, which is a more challengingtask by nature. Considering the large size of our corpus, thedevelopment set partition is sufﬁcient to tune parameters, and ig. 4 . Illustration of evaluation methodthe test set partition is large enough to show strong statisticaldifference.Since we only target for one keyword ’Alexa’, a binarytarget set is used for our experiments. Frames of backgroundhave target label , while frames aligned with keyword havetarget . We train a feed-forward DNN model as the base-line based on the model structure and training described in[10], with some adaptations to our experimental setup anduse case. We compare it with the LSTM models trained withcross-entropy loss and max-pooling loss. The GPU-based distributed trainer described in [26] is usedfor our experiments. A performance based learning rateschedule is used for our model training. To elaborate, foreach training epoch, if the loss on the dev set degrades com-pared to the previous epoch, the learning rate is halved, andcurrent epoch is repeated with reduced learning rate. Train-ing process terminates when either the minimal learning rate(for our case, a factor of . of initial learning rate), or themaximum number of epochs is reached (we limit our trainingto be epochs). The initial learning rate and batch size aretuned on the development set.The baseline feed-forward DNN has four hidden layers,with 128 nodes per hidden layer. Sigmoid function is used asactivation. A stack of 20 frames on the left and 10 frames onthe right are used to form an input feature vector. Note that theright context cannot be too large, since it introduces latency.There are in total ∼ K parameters with the DNN model.Layerwise pre-training is used for the DNN. Initial learningrate for DNN training is . , and batch size is .For LSTM training with different loss functions, we use asingle layer of unidirectional LSTM with 64 memory blocks and a projection layer of dimension 32. This serves the pur-pose of low CPU and memory, as well as low latency. Forinput context, we consider frames on the left and frameson the right. Note that we still use frames as left contextfor LSTM input, though the LSTM learns past frames’ in-formation by deﬁnition. By doing this our DNN and LSTMtraining setup are aligned better for comparison, and pastinformation is further imposed for LSTM training. OurLSTM has ∼ K parameters. For random initialization,the LSTM parameters are initialized with a uniform distri-bution U [ − . , . for weights, and constant . for bias.The initial learning rates are chosen to be . , . and . for the cases of cross-entropy loss, max-poolingloss with randomly initialized model, and max-pooling lossinitialized with a cross-entropy pre-trained model. We use the evaluation approach described in Section 2.3 onour test dataset. The performance of the DNN and LSTMmodels are shown in Figure 5.We plot detection (DET) curves in a low miss rate range,i.e., ≤ for this case. Here the false accept rate is com-puted by normalizing the false accept counts with the totalnumber of test data utterances. The x-axis labels false acceptrate, and the y-axis labels miss rate. Lower numbers indicatebetter performance. The blue solid curve represents the base-line feed-forward DNN trained using cross-entropy loss. TheLSTM models trained using cross-entropy loss, max-poolingloss with random initialization, and max-pooling loss withcross-entropy pre-training, are labeled by the green dashed,red dash-dot and cyan dotted curves respectively. Absolutenumbers of false accepts have been obscured in this paper dueto conﬁdentiality reasons. Instead, we plot false accept rates .0 x 2x 3x 4x False Accept Rate M i ss R a t e DNN xentLSTM xentLSTM maxpool (random init)LSTM maxpool (xent pretrain)

Fig. 5 . Performance of DNN and LSTM modelsup to a multiplicative constant. The false accept range con-sidered in our experiments is aligned with a low value rangewhich can be considered for production deployment purpose.In the selected low miss rate range, LSTM models out-perform the baseline feed-forward DNN. In terms of differentloss functions for LSTM training, max-pooling loss with ran-dom initialization is superior to cross-entropy loss. LSTMtrained using max-pooling loss with cross-entropy loss pre-training yields the best results. We compute the Area Underthe Curve (AUC) numbers for quantitative comparison of dif-ferent models. AUC is computed on DET curves and hencelower is better. The relative changes of AUC for LSTM mod-els compared to the baseline DNN are summarized in Table 1.Our experimental results indicate that in the ≤ low missrate range, compared to a cross-entropy loss trained baselineDNN, cross-entropy loss trained LSTM results in . rela-tive reduction in AUC. The LSTM model trained using max-pooling loss with random initialization further shows . relative reduction in AUC. The best performance comes fromthe LSTM trained using max-pooling loss with cross-entropypre-training, which yields . AUC reduction compared tothe baseline DNN.model DNN LSTMloss function xent xent maxpoolrandom init xent pretrainAUC change − . − . − . Table 1 . Relative change of AUC for LSTM models com-pared to the baseline feed-forward DNN. Lower AUC indi-cates better performance.

4. CONCLUSION AND FUTURE WORK

We present our work of training a small-footprint LSTM tospot the keyword ’Alexa’ in far-ﬁeld conditions. Two lossfunctions are employed for LSTM training: one is cross-entropy loss, and the other is max-pooling loss proposed inthis paper. A smoothed posterior thresholding approach isused for evaluation. Keyword spotting performance is mea-sured using miss rate and false accept rate. We show thatLSTM performs better than DNN in general. The best LSTMsystem, which is trained using max-pooling loss with cross-entropy loss pre-training, reduces the AUC number by . in the low miss rate range.For future work, we plan to add weighting to max-poolingloss based LSTM training, i.e., scale the back-propagatedloss for the selected keyword frames. It is of interest to seeif LSTM performance can be further improved by varyingmodel structures, e.g., adding additional feed-forward layerson top of the LSTM component. We also plan to benchmarkmax-pooling loss performance against other segmental levelloss functions, e.g., geometric mean of framewise keywordposteriors within each keyword segment, CTC etc, for ourkeyword spotting experiments.

5. REFERENCES [1] Miller, D.R., Kleber, M., Kao, C.L., Kimball, O.,Colthurst T., Lowe, S.A., Schwartz, R.M., and Gish,H., “Rapid and accurate spoken term detection”,in

Proceedings of Annual Conference of the Inter-national Speech Communication Association (Inter-speech) , 2007.[2] Parlak, S. and Saraclar, M., “Spoken term detection forTurkish broadcast news”, in

IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 5244-5247, 2008.[3] Chen, G., Yilmaz, O., Trmal, J., Povey, D. and Khu-danpur, S., “Using proxies for OOV keywords in thekeyword search task”, in

IEEE Workshop on AutomaticSpeech Recognition and Understanding (ASRU) , pp.416-421, 2013.[4] Tsakalidis, S., Hsiao, R., Karakos, D., Ng, T., Ranjan,S., Saikumar, G., Zhang, L., Nguyen, L., Schwartz, R.and Makhoul, J., “The 2013 BBN vietnamese telephonespeech keyword spotting system”, in

IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 7829-7833, 2014.[5] Sun, M., Nagaraja, V., Hoffmeister, B. and Vitaladevuni,S., “Model Shrinking for Embedded Keyword Spot-ting”, in

IEEE 14th International Conference on Ma-chine Learning and Applications (ICMLA) , 2015.6] Rose, R.C. and Paul, D.B., “A hidden Markov modelbased keyword recognition system”, in

IEEE Interna-tional Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP) , pp. 129-132, 1990.[7] Wilpon, J.G., Rabiner, L., Lee, C.H. and Goldman,E.R., “Automatic recognition of keywords in uncon-strained speech using hidden Markov models”,

IEEETransactions on Acoustics, Speech and Signal Process-ing , 38(11):1870-1878, 1990.[8] Wilpon, J.G., Miller, L.G. and Modi, P., “Improvementsand applications for key word recognition using hiddenMarkov modeling techniques”, in

IEEE InternationalConference on Acoustics, Speech, and Signal Process-ing (ICASSP) , pp. 309-312, 1991.[9] Panchapagesan, S., Sun, M., Khare, A., Matsoukas,S., Mandal, A., Hoffmeister, B., and Vitaladevuni, S.,“Multi-task learning and weighted cross-entropy fordnn-based keyword spotting”, in

Proceedings of AnnualConference of the International Speech CommunicationAssociation (Interspeech) , 2016.[10] Chen, G., Parada, C. and Heigold, G., “Small-footprintkeyword spotting using deep neural networks”, in

IEEEInternational Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , pp. 4087-4091, 2014.[11] Nakkiran, P., Alvarez, R., Prabhavalkar, R. and Parada,C., “Compressing deep neural networks using a rank-constrained topology”, in

Proceedings of Annual Con-ference of the International Speech Communication As-sociation (Interspeech) , 2015.[12] Sainath, T. and Parada, C., “Convolutional neu-ral networks for small-footprint keyword spotting”,in

Proceedings of Annual Conference of the Inter-national Speech Communication Association (Inter-speech) , 2015.[13] Tucker, G., Wu, M., Sun, M., Panchapagesan, S., Fu,G. and Vitaladevuni, S., “Model compression appliedto small-footprint keyword spotting”, in

Proceedings ofAnnual Conference of the International Speech Commu-nication Association (Interspeech) , 2016.[14] Fernndez, S., Graves, A. and Schmidhuber, J., “An ap-plication of recurrent neural networks to discrimina-tive keyword spotting”, in

Artiﬁcial Neural Networks-ICANN , pp. 220-229, 2007.[15] Wollmer, M., Schuller, B. and Rigoll, G., “Keywordspotting exploiting long short-term memory”, in

SpeechCommunication , 55(2), pp.252-265, 2013. [16] Baljekar, P., Lehman, J.F., and Singh, R., “Online word-spotting in continuous speech with recurrent neural net-works”, in

IEEE Spoken Language Technology Work-shop (SLT) , 2014.[17] Sundar, H., Lehman, J.F. and Singh, R., 2015. “Key-word spotting in multi-player voice driven games forchildren”,

Proceedings of Annual Conference of the In-ternational Speech Communication Association (Inter-speech) , 2015.[18] Scherer, D., Muller, A. and Behnke, S., “Evaluation ofpooling operations in convolutional architectures for ob-ject recognition”, in

Proceedings of International Con-ference on Artiﬁcial Neural Networks , pp. 92-101, 2010.[19] Hochreiter, S. and Schmidhuber, J., “Long Short-TermMemory”, in

Neural Computation , vol. 9, no. 8, pp.1735-1780, 1997.[20] Gers, F.A., Schraudolph, N.N. and Schmidhuber, J.,“Learning precise timing with LSTM recurrent net-works”, in

Journal of machine learning research , vol.3, pp. 115-143, 2002.[21] Sak, H., Senior, A.W. and Beaufays, F., “Long short-term memory recurrent neural network architectures forlarge scale acoustic modeling”, in

Proceedings of An-nual Conference of the International Speech Communi-cation Association (Interspeech) , 2014.[22] Bishop, C., “Pattern Recognition and Machine Learn-ing”,

Springer , 2006.[23] Graves, A., Fernandez, S., Gomez, F. and Schmidhuber,J., “Connectionist Temporal Classiﬁcation: LabellingUnsegmented Sequence Data with Recurrent NeuralNetworks”, in

Proceedings of the International Confer-ence on Machine Learning (ICML) , 2006.[24] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S.,Vinyals, O., Monga, R. and Toderici, G., “Beyond shortsnippets: Deep networks for video classiﬁcation”, in

Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) , 2015.[25] Xu, Z., Li, S. and Deng, W., “Learning temporal featuresusing LSTM-CNN architecture for face anti-spooﬁng”,in

Proceedings of IAPR Asian Conference on PatternRecognition (ACPR) , 2015,[26] Strom, N., “Scalable Distributed DNN Training UsingCommodity GPU Cloud Computing”, in