[PDF] End-to-End ASR-free Keyword Search from Speech

Abstract

End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-end system for text query-based keyword search (KWS) from speech trained with minimal supervision. Our E2E KWS system consists of three sub-systems. The first sub-system is a recurrent neural network (RNN)-based acoustic auto-encoder trained to reconstruct the audio through a finite-dimensional representation. The second sub-system is a character-level RNN language model using embeddings learned from a convolutional neural network. Since the acoustic and text query embeddings occupy different representation spaces, they are input to a third feed-forward neural network that predicts whether the query occurs in the acoustic utterance or not. This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.

Full PDF

CCopyright 2017 IEEE. Published in the IEEE 2017 International Conference on Acoustics, Speech, and Signal Processing(ICASSP 2017), scheduled for 5-9 March 2017 in New Orleans, Louisiana, USA. Personal use of this material is permitted.However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collectiveworks for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, mustbe obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. a r X i v : . [ c s . C L ] J a n ND-TO-END ASR-FREE KEYWORD SEARCH FROM SPEECH

Kartik Audhkhasi, Andrew Rosenberg, Abhinav Sethy, Bhuvana Ramabhadran, Brian Kingsbury

IBM Watson, IBM T. J. Watson Research Center, Yorktown Heights, New York

ABSTRACT

End-to-end (E2E) systems have achieved competitive results com-pared to conventional hybrid hidden Markov model (HMM)-deepneural network based automatic speech recognition (ASR) systems.Such E2E systems are attractive due to the lack of dependence onalignments between input acoustic and output grapheme or HMMstate sequence during training. This paper explores the design of anASR-free end-to-end system for text query-based keyword search(KWS) from speech trained with minimal supervision. Our E2EKWS system consists of three sub-systems. The ﬁrst sub-system is arecurrent neural network (RNN)-based acoustic auto-encoder trainedto reconstruct the audio through a ﬁnite-dimensional representation.The second sub-system is a character-level RNN language model us-ing embeddings learned from a convolutional neural network. Sincethe acoustic and text query embeddings occupy different represen-tation spaces, they are input to a third feed-forward neural networkthat predicts whether the query occurs in the acoustic utterance ornot. This E2E ASR-free KWS system performs respectably despitelacking a conventional ASR system and trains much faster.

Index Terms — End-to-end systems, neural networks, keywordsearch, automatic speech recognition

1. INTRODUCTION

Deep neural networks (DNNs) have pushed the state-of-the-art forautomatic speech recognition (ASR) systems [1]. This has led tosigniﬁcant performance improvements on several well-known ASRbenchmarks such as Switchboard [2, 3]. End-to-end (E2E) or fully-neural architectures have become an alternative to the hybrid hiddenMarkov model (HMM)-DNN architecture. These include the con-nectionist temporal classiﬁcation [4, 5] loss-based recurrent neuralnetwork (RNN) and attention-based RNNs [6, 7].Automatic speech recognition is often not the end goal of real-world speech information processing systems. Instead, an impor-tant end goal is information retrieval, in particular keyword search(KWS), that involves retrieving the speech utterances containing auser-speciﬁed text query from a large database. Conventional KWSfrom speech uses an ASR system as a front-end that converts thespeech database into a ﬁnite-state transducer (FST) index containingall possible hypotheses word sequences with their associated con-ﬁdence scores and time stamps [8]. The user-speciﬁc text query isthen composed with this FST index to ﬁnd putative keyword loca-tions and conﬁdence scores.

The authors thank Dogan Can and Shrikanth Narayanan of the SignalAnalysis and Interpretation Lab, University of Southern California for use-ful discussions. This paper uses the IARPA-Babel404b-v1.0a full languagepack. Supported by the Intelligence Advanced Research Projects Activ-ity (IARPA) via Department of Defense U.S. Army Research Laboratory(DoD/ARL) contract number W911NF-12-C-0012. The U.S. Governmentis authorized to reproduce and distribute reprints for Governmental purposesnotwithstanding any copyright annotation thereon.

Training a good ASR system is time-consuming and requiressubstantial amount of transcribed audio data. The main novelty ofthis paper is an end-to-end ASR-free KWS system motivated by therecent success of E2E systems for ASR. Our fully-neural E2E KWSsystem lacks both an ASR system and a FST-index, which makes itfast to train. We train our system in only 2 hours with minimal super-vision and do not need fully transcribed training audio. Our systemperforms signiﬁcantly better than chance and respectably comparedto a state-of-the-art hybrid HMM-DNN ASR-based system whichtakes over 72 hours to train.The next section gives an overview of related prior work. Sec-tion 2 introduces our E2E ASR-free system by discussing its threeconstituent sub-systems - an RNN-based acoustic encoder-decoder,a convolutional neural network (CNN)-RNN character languagemodel (LM), and a feed-forward KWS network. Section 3 dis-cusses the experimental setup, training of the E2E KWS system, andanalysis of the results. Section 4 gives directions for future work.

Most prior works relevant to this paper have focused on query-by-example (QbyE) retrieval of speech from a database. The user pro-vides a speech utterance of the query to be searched, in contrast toa text query used in the KWS setup of this paper. Dynamic timewarping (DTW) of acoustic features extracted from the speech queryand speech utterances from the database is a classic technique insuch QbyE systems [9, 10]. The cost of DTW alignment serves as amatching score for retrieval.Chen, Parada, and Sainath [11] present a system for QbyE wherethe audio database contains examples of certain key-phrases, such as “hello genie” . The last k state vectors from the ﬁnal hidden layer ofan RNN acoustic model give a ﬁxed-dimensional representation forboth the speech query and each utterance in the speech database. TheKWS detector then computes cosine similarity between the speechquery and utterance representations. Levin et. al [12] comparesseveral non-neural network-based techniques for computing ﬁxed-dimensional representations of speech segments for QbyE, includingprincipal component analysis and Laplacian eigenmaps. Chung et.al [13] also present a similar QbyE system using an acoustic RNNencoder-decoder network. Kamper, Wang, and Livescu [14] use aSiamese convolutional neural network (CNN) for obtaining acous-tic embeddings of the audio query and utterances from the database,and train this network by minimizing a triplet hinge loss.Our paper is closely related to the recent work of Palaz, Syn-naeve, and Collobert [15], where the authors propose a CNN-basedASR system trained on a bag of words in the speech utterance. TheCNN emits a time sequence of posterior distributions over the vo-cabulary, which is then aggregated over time to produce an estimateof the bag of words. However, we note that their system is trainedon a ﬁxed vocabulary of 1000 words compared with our proposedopen-vocabulary system that does not use word identity. In addition,the training examples used in their approach use a stronger supervi-ion of word identity compared with a much weaker supervision inour proposed model. The next section presents the architecture ofour E2E ASR-free KWS system.

2. END-TO-END ASR-FREE KWS ARCHITECTURE

Our E2E ASR-free KWS system is philosophically similar to a con-ventional hybrid ASR-based KWS system, with three sub-systemsthat model the acoustics, language, and keyword search. However,there are several differences in the structures of these sub-systemsand their training.

Motivated by prior work [11, 13] on computing ﬁxed-dimensionalrepresentations from variable length acoustic feature vector se-quence, we use an RNN-based auto-encoder as shown in Figure 1.The encoder processes T acoustic feature vectors ( x , . . . , x T ) bya uni-directional RNN with gated recurrent unit (GRU) [16] hid-den units unrolled over T time steps. A fully-connected layer withweight matrix W and D rectiﬁed linear units (ReLUs) denotedby g then processes the hidden state vector h eT from time step T .A decoder GRU-RNN takes the resulting D -dimensional acousticrepresentation g ( Wh eT ) as input at each time step t ∈ { , . . . , T } to reconstruct the original sequence of T acoustic feature vectors. Fig. 1 . This ﬁgure shows an RNN acoustic auto-encoder for a T -length input sequence of acoustic feature vectors through a D -dimensional encoded representation g ( Wh eT ) , where g denotes aReLU activation function.We use the D -dimensional output g ( Wh eT ) of the RNN encoderas the vector representation of the input acoustic feature vector se-quence. This “acoustic model” does not use any transcribed speechdata to train, in contrast with an acoustic model in a conventionalASR system. This is in line with our overall goal of making theentire KWS system ASR-free and train with less supervision. Thenext section describes the language model used to produce a ﬁxeddimensional representation of the text query. We use the CNN-based character RNN-LM architecture from Kimet. al [17] for deriving query embeddings. Figure 2 shows this LMfor the simple case of 2 convolutional masks. We map the input se-quence of N characters ( c , . . . , c N ) to a matrix of d -dimensionalcharacter embeddings via a look-up table. Next, M d × w convo-lutional masks operate on the resulting d × N embedding matrix toproduce a set of M N -dimensional vectors, one per mask. We then perform max-pooling over time on each of these vectors to obtain ascalar per mask and a M -dimensional embedding vector. This em-bedding vector then feeds into a GRU-RNN that predicts one out of K characters at each time step. Fig. 2 . This ﬁgure shows an character CNN-RNN LM for encodingtext queries. We show two convolutional masks for simplicity.One key difference between our LM and the one from Kim et.al [17] is that we train our LM to predict a sequence of charactersinstead of words. The next section presents our overall KWS systemthat uses learned acoustic and query embeddings to predict whetherthe query occurs in the utterance or not.

The ﬁnal block in the overall KWS system is a neural network thattakes the speech utterance and text query embeddings as input andpredicts whether the query occurs in the utterance or not. This incontrast to previous works on QbyE in speech where both the speechutterance and speech query lie in the same acoustic representationspace and cosine similarity is enough to match the two. Figure 3shows the overall E2E KWS system. We extract the encoders fromboth the acoustic RNN auto-encoder and the CNN-RNN characterLM, and feed them into a feed-forward neural network. In contrastto conventional ASR-based approaches to KWS from speech, theE2E system in Figure 3 is also jointly trainable after the utteranceand query encoders have been pre-trained. The next section presentsour data preparation, experiments, results and analysis.

3. EXPERIMENTS, RESULTS, AND ANALYSIS3.1. Data Description and Preparation

We used the Georgian full-language pack from Option Period 3 ofthe IARPA Babel program for keyword search from low resourcelanguages. The training data contains 40 hours of transcribed au-dio corresponding to 45k utterances and a graphemic pronunciationdictionary containing 35k words. We use the 15 hour developmentaudio, 2k in-vocabulary (IV) keywords and 400 out-of-vocabulary(OOV) keywords for testing. The keywords include both single-word and multi-word queries. Multilingual acoustic features havebeen successful in the Babel program compared to conventional fea-tures such as Mel frequency cepstral coefﬁcients. We use an 80-dimensional multilingual acoustic front-end [18] trained on all 24Babel languages from the Base Period to Option Period 3, excludingGeorgian.he training of the acoustic auto-encoder RNN and CNN-RNNcharacter LM does not requires special selection of training exam-ples. However, the ﬁnal KWS neural network requires a set of posi-tive and negative examples to train. We wanted to keep the trainingof this network independent of the list of test queries. Hence we con-structed positive examples by taking all words in the 35k vocabularyand ﬁnding a maximum of 100 utterances that contain each word.We then selected an equal number of negative examples by pickingutterances that do not contain the particular vocabulary word. Weensured good coverage of the acoustic training data by constrainingeach acoustic utterance to be picked in only a maximum of 5 positiveand negative examples. We also excluded acoustic utterances thatdid not contain speech. This resulted in 62k positive and 62k neg-ative examples each for training the KWS network. We processedthe development audio and test queries in similar fashion, and endedup with 7.5k positive and negative examples each for testing. Weimplemented the system in Keras [19] and Theano [20].

We used 300 hidden GRU neurons in the acoustic encoder and de-coder RNNs. We sorted the acoustic feature sequences in the train-ing data set in increasing order of length since it improved trainingconvergence. We unrolled both the RNNs for 15 seconds or 1500time steps which is the length of the maximum acoustic utterance inthe training data set. We padded all acoustic feature sequences tomake their length equal to 1500 time steps and excluded these extraframes from the loss function and gradient computation. We used alinear dense layer of size to compute the embedding from the ﬁ-nal hidden state vector of the encoder RNN. We trained the acousticauto-encoder by minimizing the mean-squared reconstruction errorof the input sequence of acoustic feature vectors using the Adam op-timization algorithm [21] with a mini-batch size of 40 utterances andlearning rate of × − . We used the “newbob” annealing sched-ule by reducing the learning rate by half whenever the validation setloss did not decrease sufﬁciently. The top plot of Figure 4 shows theprogress of the training set loss as training proceeds. We ﬁnd thatthe loss drops signiﬁcantly in the initial part of training and is nearlyconstant after 15k utterances. Further epochs through the trainingdata did not yield signiﬁcant improvements in loss. The next sectiondiscusses details about the training of the CNN-RNN character LM. Table 2 . This table compares the KWS accuracy of the E2W KWSand DNN-HMM hybrid ASR systems for IV and OOV queries.

Query Type → IV OOV

DNN-HMM (2gm word LM) 76.7 50.0 (chance)DNN-HMM (4gm grapheme LM) 70.7 55.5E2E ASR-free 55.6 57.7

We used all the acoustic transcripts and converted them to sentencesover 39 unique graphemes. We broke sentences longer than 50graphemes into smaller chunks for preparing mini-batches, sincethe maximum length of a query is 23 graphemes. We used 50-dimensional embeddings for each grapheme and 300 convolutionalmasks of size × , which resulted in a 300-dimensional embed-ding for each input sequence. The decoder RNN used 256 GRUsand a softmax layer of 39 neurons at each time step. We minimizedthe cross-entropy of the output grapheme sequence, and used Adamwith a mini-batch size of 256 sequences, learning rate of × − ,and the newbob annealing schedule. The bottom plot in Figure 4 Fig. 4 . This ﬁgure shows the training loss of the RNN acousticencoder-decoder (top plot) and the CNN-RNN character LM (bot-tom plot) as training proceeds.shows the progress of training cross-entropy.Unlike the acousticRNN auto-encoder, we trained this network for a few epochs.

After training the acoustic RNN auto-encoder and CNN-RNN char-acter LM, we removed the decoders from both models, concatenatedthe encoder outputs (resulting in a 600-dimensional vector), and fedthem into a fully-connected feed-forward neural network with oneReLU hidden layer of size 256. The output layer contained twosoftmax neurons that detect whether the input query occurred in theacoustic utterance or not. We applied 50% dropout to all layers ofthis network since it improved classiﬁcation performance. We usedAdam to train this network by minimizing the cross-entropy losswith a batch size of 128 examples, learning rate of × − , and thenewbob learning rate annealing schedule. We ﬁrst back-propagatedthe errors through this KWS network only for a few epochs, andthen through the entire network.The latter did not have a signiﬁcantimpact on KWS accuracy. Figure 5 shows the training loss and clas-siﬁcation accuracy of this KWS neural network over several epochs.We observe that the network gradually reaches an above-chance clas-siﬁcation accuracy of approximately 56%.We then tested the KWS network on the test set of 2k in-vocabulary (IV), 400 out-of-vocabulary (OOV) queries and 15 hoursof development audio. To get the topline performance, we alsotrained a hybrid DNN-HMM system for Georgian. This ASR sys-tem used 6000 context-dependent states in the HMM and a ﬁve-layerdeep neural network with 1024 neurons in each layer. We trainedthis network ﬁrst using the frame-wise cross-entropy criterion andthen using Hessian-free sequence minimum Bayes risk (sMBR)training [22, 23]. We used two LMs - a bigram word LM trainedon a vocabulary of 35k words and a 4-gram grapheme LM trainedover 39 graphemes. The word error rate (WER) of this hybridASR system is 41.9%. We then performed KWS over the 1-besttranscript obtained by Viterbi decoding of the development audio, ig. 5 . This ﬁgure shows the training loss and classiﬁcation accuracyof the KWS neural network as training proceeds.instead of the full-blown lattice-based KWS for simplicity and afair-comparison to the E2E KWS approach.Table 2 shows the classiﬁcation accuracies of the DNN-HMMASR system and the proposed E2E ASR-free KWS system. We ob-tain a classiﬁcation accuracy of 55.6% on IV and 57.7% on OOVqueries, which is signiﬁcantly above chance. As expected, the IVperformance is lower than that of the hybrid ASR system using 2-gm word LM. But it is interesting to note that the E2E ASR-freeand hybrid system using 4-gm grapheme LM have closer accuracies,especially for OOV queries, where the E2E KWS system performsbetter by 2.2% absolute. This result is encouraging, since the hy-brid system uses word-level transcriptions for training the acousticmodel and 36 times more training time than the E2E ASR-free KWSsystem. We performed further analysis of the dependence of KWSperformance on query length. Table 1 shows the classiﬁcation accu-racy as a function of number of graphemes in the query. We observethat both the ASR-based and E2E KWS systems have difﬁculty de-tecting short queries. In case of the E2E system, this is because itis difﬁcult to derive a reliable representation for short queries due tothe lack of context. A key advantage of the E2E KWS system is thatit takes 36 times less time to train than the DNN-HMM system.

4. CONCLUSIONS AND FUTURE WORK

This paper presented a novel end-to-end ASR-free approach to textquery-based KWS from speech. This is in contrast to ASR-basedapproaches and to previous works on query-by-example retrieval ofaudio. The proposed system trains with minimal supervision with-out any transcription of the acoustic data. The system uses an RNNacoustic auto-encoder, a CNN-RNN character LM, and a KWS neu-ral network that decides whether the input text query occurs in theacoustic utterance. We show that the system performs respectably ona Georgian keyword search task from the Babel program, and trains36 times faster than a conventional DNN-HMM hybrid ASR system.Future work should focus on closing the performance gap with thehybrid ASR system and estimating times of the detected keywords.

5. REFERENCES [1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, andB. Kingsbury, “Deep neural networks for acoustic modeling inspeech recognition: The shared views of four research groups,”

IEEE Signal Processing Magazine , vol. 29, no. 6, pp. 82–97,2012.[2] G. Saon, T. Sercu, S. Rennie, and H. J. Kuo, “The IBM2016 English Conversational Telephone Speech RecognitionSystem,” in

Proc. Interspeech , 2016.[3] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stol-cke, D. Yu, and G. Zweig, “The Microsoft 2016 ConversationalSpeech Recognition System,” in arXiv:1609.03528 , 2016.[4] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber,“Connectionist temporal classiﬁcation: labelling unsegmentedsequence data with recurrent neural networks,” in

Proc. ICML .ACM, 2006, pp. 369–376.[5] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-endspeech recognition using deep RNN models and WFST-baseddecoding,” in

Proc. ASRU . IEEE, 2015, pp. 167–174.[6] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-lation by jointly learning to align and translate,” arXiv preprintarXiv:1409.0473 , 2014.[7] D Bahdanau, J. Chorowski, D. Serdyuk, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in

Proc. ICASSP . IEEE, 2016, pp. 4945–4949.[8] D. Can and M. Saraclar, “Lattice indexing for spoken term de-tection,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 19, no. 8, pp. 2338–2347, 2011.[9] T. J. Hazen, W. Shen, and C. White, “Query-by-example spo-ken term detection using phonetic posteriorgram templates,” in

Proc. ASRU . IEEE, 2009, pp. 421–426.[10] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spot-ting via segmental dtw on gaussian posteriorgrams,” in

Proc.ASRU . IEEE, 2009, pp. 398–403.[11] G. Chen, C. Parada, and T. N. Sainath, “Query-by-examplekeyword spotting using long short-term memory networks,” in

Proc. ICASSP . IEEE, 2015, pp. 5236–5240.[12] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segmentsin low-resource settings,” in

Proc. ASRU . IEEE, 2013, pp. 410–415.[13] Y. Chung, C. Wu, C. Shen, H. Lee, and L. Lee, “AudioWord2Vec: Unsupervised Learning of Audio Segment Repre-sentations using Sequence-to-sequence Autoencoder,” in

Proc.Interspeech , 2016.[14] H. Kamper, W. Wang, and K. Livescu, “Deep convolutionalacoustic word embeddings using word-pair side information,”in

Proc. ICASSP . IEEE, 2016, pp. 4950–4954.[15] D. Palaz, G. Synnaeve, and R. Collobert, “Jointly learning tolocate and classify words using convolutional networks,” in

Proc. Interspeech , 2016.[16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiri-cal evaluation of gated recurrent neural networks on sequencemodeling,” arXiv preprint arXiv:1412.3555 , 2014.17] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush,“Character-aware neural language models,” arXiv preprintarXiv:1508.06615 , 2015.[18] J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Au-dhkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom,M. Picheny, et al., “Multilingual representations for low re-source speech recognition and keyword search,” in

Proc.ASRU . IEEE, 2015, pp. 259–266.[19] F. Chollet, “Keras,” https://github.com/fchollet/keras , 2015.[20] Theano Development Team, “Theano: A Python frameworkfor fast computation of mathematical expressions,” arXiv e-prints , vol. abs/1605.02688, May 2016.[21] D. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[22] J. Martens, “Deep learning via Hessian-free optimization,” in

Proc. ICML , 2010, pp. 735–742.[23] B. Kingsbury, T. N. Sainath, and H. Soltau, “Scalable mini-mum Bayes risk training of deep neural network acoustic mod-els using distributed Hessian-free optimization,” in

Proc. In-terspeech , 2012. ig. 3 . This ﬁgure shows the overall E2E KWS system. The ﬁnite-dimensional embeddings from the acoustic and query encoders feed intothe KWS neural network that predict if the query occurs in the utterance or not.

Table 1 . This table compares the KWS accuracy of the E2W KWS and DNN-HMM hybrid ASR systems for different IV query lengths.

Query Length → ≤ ≥≥