[PDF] Automatic Data Expansion for Customer-care Spoken Language Understanding

Abstract

Full PDF

AA UTOMATIC D ATA E XPANSION FOR C USTOMER - CARE S POKEN L ANGUAGE U NDERSTANDING

A P

REPRINT

Shahab Jalalvand

Senior Research ScientistInteractions Corp.Murray Hill, NJ, [email protected]

Andrej Ljolje

Principal Research ScientistInteractions Corp.Murray Hill, NJ, [email protected]

Srinivas Bangalore

Director AI ResearchInteractions Corp.Murray Hill, NJ, [email protected] 2, 2018 A BSTRACT

Spoken language understanding (SLU) systems are widely used in handling of customer-care calls.A traditional SLU system consists of an acoustic model (AM) and a language model (LM) that areused to decode the utterance and a natural language understanding (NLU) model that predicts theintent. While AM can be shared across different domains, LM and NLU models need to be trainedspeciﬁcally for every new task. However, preparing enough data to train these models is prohibitivelyexpensive. In this paper, we introduce an efﬁcient method to expand the limited in-domain data. Theprocess starts with training a preliminary NLU model based on logistic regression on the in-domaindata. Since the features are based on n = 1 , -grams, we can detect the most informative n-gramsfor each intent class. Using these n-grams, we ﬁnd the samples in the out-of-domain corpus that1) contain the desired n-gram and/or 2) have similar intent label. The ones which meet the ﬁrstconstraint are used to train a new LM model and the ones that meet both constraints are used to train anew NLU model. Our results on two divergent experimental setups show that the proposed approachreduces by 30% the absolute classiﬁcation error rate (CER) comparing to the preliminary modelsand it signiﬁcantly outperforms the traditional data expansion algorithms such as the ones based onsemi-supervised learning, TF-IDF and embedding vectors. The usage of spoken language understanding (SLU) in costumer-care applications is increasing everyday. TraditionalSLU system consists of a pipeline of automatic speech recognition (ASR) and natural language understanding (NLU)[8]. ASR has two main components: an acoustic model (AM) and a language model (LM) [14]. AM is trained on theacoustic information and LM is trained on the text corpora. While AM can be shared across different domains, LM andNLU models need to be trained speciﬁcally for every new task. Assuming that AM is generic enough to be used acrossdifferent tasks, in this paper, we focus on the LM and NLU components and we address two main issues:1.

LM is usually trained independently from NLU model.

Therefore it does not receive any feedback fromthe NLU model about the important words (or n-grams) in the task.2.

The availability of in-domain data is usually very limited.

Therefore training accurate LM and NLUmodels is not possible on day zero.Addressing the ﬁrst issue, [19] proposes an integrative and discriminative technique to update the parameters of the LMand NLU models. In this technique, the n-best hypotheses that are generated by the ASR decoder are rescored andreranked using the AM, LM and NLU scores. Then the best hypothesis (i.e. the one whose intent is truly predicted withthe highest score) and the most competitive hypothesis (i.e. the one whose intent is predicted wrongly with high score)are detected and used to update the n-gram probabilities in LM and NLU. a r X i v : . [ c s . C L ] S e p PREPRINT - O

CTOBER

2, 2018To solve the second issue, there are many data expansion techniques in the literature. These techniques can becategorized into two main categories: generative and selective techniques. In generative techniques a generative modellike recurrent neural network language model (RNNLM) [10] is trained on the in-domain data and then the modelis used to generate similar samples [1, 16]. The selective approaches are mostly based on searching through a bigout-of-domain (OOD) corpus and ﬁnding most similar samples according to an appropriate similarity measures. Someof these methods are based on cross entropy [12], term frequency inverse document frequency (TF-IDF) [15] and wordembeddings [11].This paper addresses the two mentioned issues and presents an automatic data expansion algorithm that: • Takes advantage of a preliminary NLU model to detect the most informative words (or n-grams) and • Uses those n-grams to select relevant data from an out-of-domain corpus.

The process starts with training a preliminary NLU model based on logistic regression on the in-domain data. Since thefeatures are based on n = 1 , -grams, we can detect the most informative n-grams for each intent class. Therefore,for each intent we prepare a list of representative n-grams by excavating the trained feature weight matrix in the NLUmodel. Table 1 shows some examples of in-domain sentences and the obtained list of n-grams. For every n-gram, weAudio Utterance Intent n-gram listi’m calling about charges BILLING (charges, paying)uh password help ACCOUNT (pass code, password, my account)making appointment APPOINTMENT (reservation, reservations)complain about delivery COMPLAINT (complain, bad experience)Table 1: in-domain data examples for task A .ﬁnd samples in the OOD corpus that 1) contain the desired n-gram and/or 2) have similar intent label. The OOD samplesthat meet the ﬁrst constraint are used to train the new LM model and the ones that meet both constraints are used to trainthe new NLU model. Our results on two divergent tasks show that the proposed approach reduces the classiﬁcation errorrate (CER) by 30% comparing to the preliminary models and it signiﬁcantly outperforms the traditional data expansionalgorithms such as the ones based on semi-supervised learning [2], TF-IDF [15] and embedding approaches [11].The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 brieﬂy describes thedifferent SLU components; Section 4 describes the proposed automatic data expansion approaches; Section 5 containsthe experimental setup and results and ﬁnally, Section 6 concludes this paper. Scientiﬁc literature related to the work described in this article spans over two main lines of investigation: a) integrativetraining of LM and NLU models and b) automatic data expansion.

Integrative training of LM and NLU. [19] proposes an integrative and discriminative technique to update theparameters of LM and NLU models. In this technique, the n-best list hypotheses that are generated by an ASR decoderare ﬁrst rescored and reranked using the AM, LM and NLU scores. Then the best hypothesis (i.e. the one whose intentis truly predicted with the highest score) and the most competitive hypothesis (i.e. the one whose intent is predictedwrongly with high score) are detected. Finally, these discriminant hypotheses are used to update the probability of then-grams in both the LM and NLU models. The performance of this approach is highly dependent on the performance ofthe preliminary LM which is used to generate the n-best list. Moreover, in order to optimize the weights of the AM,LM and NLU scores, a lot of in-domain data is required. Our proposed approach also exploits the NLU information toupdate the LM and NLU even though very limited in-domain data is available.Other researches try to remove the need of ASR in the SLU system and build an end-to-end deep neural networkthat directly predicts the intent from acoustic information [4]. Although they have obtained competitive results, thesemethods have some deﬁciencies that prevent them to be used in the industry. For example whenever a new intent isintroduced, the whole network need to be retrained.

Data expansion.

An intelligent method to expand the LM training data is proposed in [12]. In this method, twolanguage models are trained: one on in-domain and the other on out-of-domain data. Then all the out-of-domainsentences are measured by a criterion that is based on cross entropy difference between the in-domain and out-of-domainLMs. The larger the difference, the higher the chance to be selected. This approach however needs a reasonable size ofin-domain data to train the in-domain LM, while as mentioned before, our available data is as small as several hundredutterances. 2

PREPRINT - O

CTOBER

2, 2018To expand data for NLU models, one solution is through semi-supervised learning [13]. In this approach, a preliminaryNLU model is trained on the in-domain data and then it is used to label a large number of unlabeled data. The newNLU model is trained on the predicted labels. However, as we will see in our experiments, the preliminary model canbe such poor that it fails to assign proper intent labels to the unlabelled utterances.

Figure 1 shows a block diagram of a costumer-care SLU system, highlighting the modules that we focus in this paper 1.A transaction starts with a prompt like:

How may I help you ? .The human responds with usually short utterances: I want to make an appointment .ASR generates a hypothesis for this utterance and NLU predicts the intent label for this hypothesis as:

AP P OIN T M EN T .Based on the predicted intent label and its conﬁdence score,

Dialog Manager provides the proper response for thisrequest and the dialog continues.Figure 1: Basic SLU system architecture for costumer-care calls.

ASR aims to ﬁnd the most probable sequence of words, ˆ W , given a sequence of acoustic observations, X : ˆ W = argmax W { P /L [ X | W ] × P [ W ] } where X = x , x , ..., x T indicates the sequence of acoustic observations; P [ X | W ] computes the acoustic modellikelihood and P [ W ] computes the language model likelihood. Since the AM and LM scores have different dynamicrange, an LM scale factor L is used to balance the scores.Since in this paper, we focus on LM and NLU models, we assume that AM is unchanged from one experiment to theother.Our language model is a traditional back-off n-gram LM [3]. Thanks to its simple structure, fast training and easyimplementation, n-gram LM has been widely used in the industry. In a back-off n-gram LM, the conditional probability3 PREPRINT - O

CTOBER

2, 2018of a word w i given the n − previous words w i − i − n +1 is computed by: P bo ( w i | w i − i − n +1 ) =  d w ii − ( n − × C ( w ii − ( n − ) C ( w i − i − ( n − ) , if C ( w ii − ( n − ) > kα w i − i − ( n − × P bo ( w i | w i − i − ( n − ) , otherwise (1)In this formula, p bo ( w i | w i − i − n +1 ) is the back-off probability of observing w i ; C ( w ) is the frequency of w in training set; k is a threshold for the least acceptable number of appearances and d is the Good Turing discounting estimation. Otherextensions such as modiﬁed Kneser-Ney also called modiﬁed shift-beta smoothing have shown very good performance[3].Our proposed data expansion algorithm modiﬁes the probability of the speciﬁc n-grams that contribute the most inintent prediction. This modiﬁcation is done by adding more hypotheses that contain those speciﬁc n-grams. An NLU model classiﬁes the hypothesis ˆ W into one of L intent classes ˆ Y ∈ { Y ...Y L } . In SLU, ˆ W is generated by theASR decoder, Eq. 1. To achieve a more robust intent prediction, instead of a single hypothesis ˆ W , the n-best lists areused.We use binary n-grams with n = 1 , to form the feature vector for a recognized hypothesis. The dimension of thefeature vector is equal to the total number of uni- and bi-grams in the training data. k -th element of the vector is , if itscorresponding n-gram is seen in the hypothesis.For the classiﬁer, we use logistic regression with hinge loss, averaged stochastic gradient descent (SGD) and L featurenormalization. The objective function for training is: min M,b N N (cid:88) i =1 ξ i .Loss ( y i , M T x i + b ) + α (cid:107) M (cid:107) (2)where M is the estimated weight matrix; b is the bias; N is the total number of training samples; y i is the label for the i -th training sample and x i is the feature vector for the i -th training sample.Matrix M is [ I × L ] dimensional, in which I is the total number of uni- and bi-grams and L is the total number of intentclasses. Our proposed data expansion method makes use of this matrix as a source of information to identifythe most informative n-grams for each intent.

We are aware of other classiﬁers based on deep neural networks like the ones in [5, 20, 6], however we use simplelogistic regression because we are interested in evaluating different data expansion techniques. Analyzing the classiﬁerperformance is beyond the context of this paper.

In this section, we describe four techniques for data expansion.

In this approach, an NLU model is trained on the in-domain data and then it is used to assign an intent label to theutterances in the OOD corpus. The samples with high conﬁdence score are used to train the second NLU model. Thisprocess is done for several iterations until convergence to an optimal model [13].

Another strategy to select relevant data is using similarity measures. One of the most popular methods to computesentence similarity is through term frequency-inverse document frequency (TF-IDF). Similarity of the two sentences4

PREPRINT - O

CTOBER

2, 2018Figure 2: NLU-driven data expansion architecture.can be computed as cosine similarity between the TF-IDF vectors of the two sentences. cos _ sim ( s , s ) = (3) tf idf V ect ( s ) .tf idf V ect ( s ) tf idf V ect ( s ) =[ tf idf ( w , s ) , tf idf ( w , S ) , ... ] tf idf ( w, s ) = tf ( w, s ) × idf ( w ) idf ( w ) = 1 + log ( N |{ s : w ∈ s }| ) where, N is the total number of sentences and tf ( w, s ) is the frequency of w in sentence s .In the experiments, we collect all the in-domain and out-of-domain utterances into a large document and we compute aTF-IDF vector for each sentence. Then the cosine similarity is computed between every in-domain and all out-of-domainsamples. For every in-domain sample, the most similar OOD samples are selected to train the new LM and NLUmodels. Continuous space embeddings such as word2vec [11] project terms to a continuous and dense feature space. When thewords and sentences are represented by a continuous feature vectors, their similarity can be measured using simplemetrics such as euclidean distance.We apply embedding methods to expand our limited in-domain data. To do this, a word2vec model is trained on thewhole available sentences including in-domain and out-of-domain corpora. This model represents each word with acontinuous feature vector. In order to extend the word vectors to sentence vectors, we use average pooling. Althoughaverage pooling is not the most efﬁcient approach, it is a common way to convert word vectors into sentence vectors.For every in-domain sentence, we ﬁnd the most similar ones in the OOD corpus, using euclidean distance. Among theselected sentences, the ones whose intent label matches the desired in-domain sample will be selected for the NLUtraining and the rest for the LM training.

Figure 2 shows our proposed algorithm for data expansion using n-gram features. The available resources are:5

PREPRINT - O

CTOBER

2, 2018 • out-of-domain (OOD) corpus that is a large corpus containing 58 million utterances from 29 costumer careapplications. Each utterance comes with an audio, a recognized hypothesis and an intent label. • genericAM that is an acoustic model trained on hundreds of telephone calls from the 29 applications. • genericLM that is an interpolated language model between 29 LMs, each trained for one application. • in-domain data that contains several utterance examples per intent from the new application.We ﬁrst use genericAM and genericLM to decode the in-domain utterances. Then, NLU_0 is trained on the pairs using logistic regression with n = 1 , -grams as features. The trained weight matrix in NLU_0 is excavated to detect and create a list of the most informative n-grams for each intent (see Table 1 for examples).These n-grams are deﬁned as the ones with highest positive weights or the ones with lowest negative weights or acombination of both. For each n-gram, we ﬁnd utterances in

OOD corpus that meet two constraints: 1) contain thedesired n-gram and 2) has similar intent label. The ones that meet the ﬁrst constraint are used to train

LM_1 . The onesthat meet both constraints are used to train

NLU_1 . Note that the sentences in NLU training set are a subset of LMtraining.

We evaluate the performance of our data expansion algorithm with three alternative approaches: • semi-supervised approach; • TF-IDF based data selection; • embedding approach.We conduct the experiments in two divergent circumstances:1. Task_A: high intent coverage . When most of the intent labels in the new application task_A already exist inthe OOD corpus.2.

Task_B: low intent coverage . When most of the new intent labels do not exist in the OOD corpus.Since genericAM , genericLM and OOD corpus are shared across the two tasks, we describe them beforehand. genericAM . Our generic acoustic model is a hybrid DNN-HMM model [7] trained with the cross-entropy criterionfollowed by the state-level Minimum Bayes Risk (sMBR) objective. The training set consists of about 380 hours oftranscribed utterances. genericLM . The generic language model that we use in the ﬁrst pass of decoding is a 3-gram interpolated languagemodel between 29 individual LMs. Each individual LM corresponds to one application and is trained using 3-gramKatz’s back-off [9]. The largest and smallest LMs are trained on 588 million and 8 million words, respectively.

Out-of-domain (OOD) corpus . This corpus consists of the utterances from 29 existing applications (excludingapplication A and B ). Table 2 shows the statistics of this corpus.Description Statsnumber of applications 29number of utterances 58 Mnumber of words 202 Msize of vocabulary 23 Knumber of unique intents 3392Table 2: Statistics of the out-of-domain corpus. In this task, the new intent labels for the new application A are well covered in OOD corpus . About 93% of the newintents are already used in the previous applications.

In-domain data A.

Application A has 78 different intents and for each intent only 10 audio samples are provided.Therefore, in total there are 780 < audio, intent > samples. In Table 1, we showed some examples of these utterances.6 PREPRINT - O

CTOBER

2, 2018

Test set A.

The test set of this application consists of 4891 customer care calls that are transcribed and labeled byhuman transcriber. The whole set is 7 hours, including 28 K words and there are 78 unique intent labels. It’s worthremembering that 93% of these intent labels are covered in

OOD corups .LM ( A . Results A.

Table 3 shows the word error rate (WER[%]) and classiﬁcation error rate (CER[%]) results by different dataexpansion approaches.In the ﬁrst row, we use genericLM for recognition. For intent prediction, we use inNLU which is trained on thein-domain data only. The WER result is 17.1% and the CER is 61.43%. This is the setup that we consider as baseline.The next two rows report the performance of the semi-supervised approach in two iterations. In the ﬁrst iteration, inNLU is used to label all the utterances in

OOD corpus . The ones with high conﬁdence score (together with the in-domaindata) are used to train in+semi-supNLU1 . We tune the conﬁdence threshold, so that we can retrieve a reasonable numberof 60 K samples. The reason for selecting 60 K is simply for the sake of consistency in comparison between differentdata selection approaches. In the second iteration, we use in+semi-supNLU1 to label the OOD samples. We observethat semi-supervised approach improves CER to 57.31% and 54.63%, respectively, in the ﬁrst and second iterations.The third set of rows reports the performance of TF-IDF based approach. For each in-domain sample, we keep the 10K most similar samples in the OOD corpus according to Eq. 3. This number provides the desired 60 K data to train in+tﬁdfNLU . This model yields 52.52% CER. This approach can be used to collect more data (i.e. by removing theconstraint of intent label match) for training a new LM. This data along with the ones for NLU training are used tobuild tﬁdfLM . This new LM yields worse WER results 26.1%, though it improves the CER to 51.65%. Such a strangebehaviour, i.e. WER increase and CER decrease, may happen when LM is trained together with NLU [18, 19].The fourth set of results are obtained by embedding approaches. Using embedding vectors with some thresholds onthe euclidean distance between in-domain and out-of-domain sentence vectors, we collect 60 K utterances to train in+embedNLU model. This model yields 53.67% CER which is 1.05% worse than TF-IDF approach when genericLM is used to decode the utterances. Then, we train embedLM on the data selected by embedding approach. This LM resultsin a much better WER performance (19.9%) comparing to its TF-IDF counterpart (26.1%) and it also outperformsTF-IDF method in terms of CER.In the last set of rows, we evaluate our proposed data expansion algorithm by using n-gram features from inNLU . Howmany n-grams per intent and how many samples per n-gram? These are the parameters that are again tuned to retrieveabout 60 K instances to train ngselNLU1 . Using this model along with the genericLM , we obtain 49.98% CER. Thisresult shows an absolute improvement of 2.54% CER in comparison to the TF-IDF method (52.52%) and 3.69% incomparison to the embedding approach. Again, this method can be used to select data for training a customized LM ngselLM1 . This LM although increases the WER to 18.5%, it slightly reduces the CER to 49.35%. In the seconditeration, we use the same approach, this time by using n-gram features from in+ngselNLU1 . The selected data are usedto train ngselLM2 and ngselNLU2 yielding 48.43% CER.Figure 3 shows the Lift curves [17] obtained by different selection strategies. Lift curve shows the error rate withinaccepted samples as a function of rejection rate that is applied on the conﬁdence score of the classiﬁer. As we can see inthis ﬁgure, the proposed n-gram based data selection approach consistently outperforms the other approaches in allrejection rates.To investigate why n-gram based data expansion outperformed TF-IDF and embedding approaches, we look at thediversity of the data selected by each method. The vocabulary size of the data selected by TF-IDF, embedding andn-gram based data selection is 2169, 2572 and 3030 words, respectively.

That is, n-gram based data expansion is ableto ﬁnd more variety of data, whereas, TF-IDF and embedding, due to the way they work, try to ﬁnd the most PREPRINT - O

CTOBER

2, 2018Figure 3: Error rate within accepted samples as a function of rejection rate (Lift curve). similar sentences to the in-domain data, ignoring the fact that some words (or n-grams) are more important topredict the intent of a hypothesis.5.2 Task_B

In this task, the new application B has a lot of unseen intent labels. Only 25% of the intents in application B is previouslyseen in the OOD corpus. Our goal is to verify the performance of our data selection algorithm in a challenging scenario,where there are a lot of new intent labels.In-domain Examples Out-of-domain ExamplesUtterance Intent Utterance Intentcredit card bill BILLING account charges BILLINGi need my account unlocked ACCOUNT trouble logging in ACCOUNTproblem with the website ONLINE_WEB_HELP online support WEB_HELPi need update for version two UPDATE_SOFTWARE edit my name UPDATE_PROFILEremove a form FORMS – –Table 4: Left: in-domain data examples for task B . Right: the matched intent examples from out-of-domain corpus. In-domain data B.

The provided in-domain data for this task has 91 different intents. Again we consider having only10 audio samples per intent. Therefore, in total there are 910 < audio, intent > samples in the in-domain data. Table4 shows some examples.

Test set B.

The test set contains 3657 costumer-care utterances which is about 6.5 hours. For this set, there is no humantranscription available, so that we are not able to evaluate WER results. However, the true intent labels are available forall test utterances. There are 100 different labels and only 25% of the labels are seen in

OOD corpus . Results B.

Our ﬁrst observation about the unseen intent labels is that, although there are many unseen labels, for manyof them we can ﬁnd similar labels. Therefore, as a naive solution for this problem, we make use of string similarity toﬁnd intent labels for the unseen ones. To do this, we use python’s difﬂib library and its get_close_matches functionto ﬁnd the closest matches for the unseen labels. As we see in Table 4 sometimes we can ﬁnd exact matches in

OODcorpus , sometimes we ﬁnd close matches and sometimes there is no match.From Table 5, again we observe a huge improvement from 71.28% to 44.21% in CER. This result already veriﬁesthe effectiveness of data expansion using n-gram features. It worth analyzing the improvement gap as a function ofin-domain data size. 8

PREPRINT - O

CTOBER

2, 2018LM ( B . In the further analyses, we are interested to know, to what extent our data expansion approach improves the CER resultswhen there is more in-domain data. In the previous experiments, we assumed that only 10 samples per intent wereavailable. In the next experiments, we enlarge this number to 50, 100 and 150 and we conduct experiments on task B .Figure 4: Improvement gap as a function of in-domain data size (i.e. 10, 50, 100 and 150 samples per intent).Figure 4 shows the Lift curve of models trained only on in-domain data (solid lines) and the ones trained on selecteddata using n-gram based data expansion (dotted lines). As it can be seen in Figure 4, by enlarging the in-domain data to50 and 100 samples per intent, the baseline models get improved consistently. From 100 to 150 samples per intent,we see a slight increase in CER. Regardless to the size of in-domain data, n-gram based data expansion signiﬁcantlyimproves the CER results. We introduced a novel technique to expand the limited in-domain data for a new spoken language understanding (SLU)task. This technique takes advantage of the key information that a preliminary NLU model (trained only on in-domaindata) provides and it uses this information to ﬁnd more relevant data from an out-of-domain corpus. Using this methodwe were able to signiﬁcantly enlarge the training data for both language model and NLU model. The new modelsshowed signiﬁcantly better performance in terms of classiﬁcation error rate, in two divergent experimental setup.Future work is dedicated to a) designing a more efﬁcient algorithm to match the unseen intent labels and b) preparingthe out-of-domain corpus using clustering approaches.

References [1] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generatingsentences from a continuous space. arXiv preprint arXiv:1511.06349 , 2015.9

PREPRINT - O

CTOBER

2, 2018[2] Asli Celikyilmaz, Dilek Hakkani-Tur, Gokhan Tur, and Ruhi Sarikaya. Semi-supervised semantic tagging ofconversational understanding using markov topic regression. In

Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) , volume 1, pages 914–923, 2013.[3] Stanley F Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling.

Computer Speech & Language , 13(4):359–393, 1999.[4] Yuan-Ping Chen, Ryan Price, and Srinivas Bangalore. Spoken language understanding without speech recognition.In

Acoustics, Speech, and Signal Processing, IEEE International Conference on , pages 6189–6193. IEEE, 2018.[5] Ronan Collobert and Jason Weston. A uniﬁed architecture for natural language processing: Deep neural networkswith multitask learning. In

Proceedings of the 25th international conference on Machine learning , pages 160–167.ACM, 2008.[6] Yoav Goldberg. A primer on neural network models for natural language processing.

Journal of ArtiﬁcialIntelligence Research , 57:345–420, 2016.[7] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior,Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep Neural Networks for Acoustic Modeling inSpeech Recognition.

IEEE Signal Processing Magazine , 9(3):82–97, 2012.[8] Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Reddy.

Spoken language processing: A guide to theory,algorithm, and system development , volume 95. Prentice hall PTR Upper Saddle River, 2001.[9] Slava Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer.

IEEE transactions on acoustics, speech, and signal processing , 35(3):400–401, 1987.[10] Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. Rnnlm-recurrent neuralnetwork language modeling toolkit. In

Proc. of the 2011 ASRU Workshop , pages 196–201, 2011.[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of wordsand phrases and their compositionality. In

Advances in neural information processing systems , pages 3111–3119,2013.[12] Robert C Moore and William Lewis. Intelligent selection of language model training data. In

Proceedings of theACL 2010 conference short papers , pages 220–224. Association for Computational Linguistics, 2010.[13] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classiﬁcation from labeledand unlabeled documents using em.

Machine learning , 39(2-3):103–134, 2000.[14] Lawrence R. Rabiner and Biing H. Juang.

Fundamentals of Speech Recognition . Prentice-Hall Signal ProcessingSeries: Advanced monographs. PTR Prentice Hall, 1993.[15] Gerard Salton and Chris Buckley. Term weighting approaches in automatic text retrieval. Technical report, CornellUniversity, 1987.[16] Yik-Cheung Tam, Yangyang Shi, Hunk Chen, and Mei-Yuh Hwang. Rnn-based labeled data generation for spokenlanguage understanding. In

Sixteenth Annual Conference of the International Speech Communication Association ,2015.[17] Stéphane Tufféry.

Data mining and statistics for decision making , volume 2. Wiley Chichester, 2011.[18] Ye-Yi Wang, Alex Acero, and Ciprian Chelba. Is word error rate a good indicator for spoken language understand-ing accuracy. In

Automatic Speech Recognition and Understanding, 2003. ASRU’03. 2003 IEEE Workshop on ,pages 577–582. IEEE, 2003.[19] Sibel Yaman, Li Deng, Dong Yu, Ye-Yi Wang, and Alex Acero. An integrative and discriminative technique forspoken utterance classiﬁcation.

IEEE Transactions on Audio, Speech, and Language Processing , 16(6):1207–1214,2008.[20] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classiﬁcation. In