[PDF] Future Vector Enhanced LSTM Language Model for LVCSR

Abstract

Language models (LM) play an important role in large vocabulary continuous speech recognition (LVCSR). However, traditional language models only predict next single word with given history, while the consecutive predictions on a sequence of words are usually demanded and useful in LVCSR. The mismatch between the single word prediction modeling in trained and the long term sequence prediction in read demands may lead to the performance degradation. In this paper, a novel enhanced long short-term memory (LSTM) LM using the future vector is proposed. In addition to the given history, the rest of the sequence will be also embedded by future vectors. This future vector can be incorporated with the LSTM LM, so it has the ability to model much longer term sequence level information. Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction. For the speech recognition rescoring, although the proposed LSTM LM obtains very slight gains, the new model seems obtain the great complementary with the conventional LSTM LM. Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.

Full PDF

FFUTURE VECTOR ENHANCED LSTM LANGUAGE MODEL FOR LVCSR

Qi Liu, Yanmin Qian, Kai Yu

Key Lab. of Shanghai Education Commission for Intelligent Interaction and Cognitive EngineeringSpeechLab, Department of Computer Science and EngineeringBrain Science and Technology Research CenterShanghai Jiao Tong University, Shanghai, ChinaEmails: { liuq901, yanminqian, kai.yu } @sjtu.edu.cn ABSTRACT

Language models (LM) play an important role in large vocab-ulary continuous speech recognition (LVCSR). However, tra-ditional language models only predict next single word withgiven history, while the consecutive predictions on a sequenceof words are usually demanded and useful in LVCSR. Themismatch between the single word prediction modeling intrained and the long term sequence prediction in read de-mands may lead to the performance degradation. In this pa-per, a novel enhanced long short-term memory (LSTM) LMusing the future vector is proposed. In addition to the givenhistory, the rest of the sequence will be also embedded by fu-ture vectors. This future vector can be incorporated with theLSTM LM, so it has the ability to model much longer termsequence level information. Experiments show that, the pro-posed new LSTM LM gets a better result on BLEU scoresfor long term sequence prediction. For the speech recogni-tion rescoring, although the proposed LSTM LM obtains veryslight gains, the new model seems obtain the great comple-mentary with the conventional LSTM LM. Rescoring usingboth the new and conventional LSTM LMs can achieve a verylarge improvement on the word error rate.

Index Terms : speech recognition, language model, recurrentneural network, n-best rescoring

1. INTRODUCTION

Language model plays an important role in LVCSR. N-gram[1, 2] has been widely used in the LVCSR system for a longtime. However, n-gram only uses limited histories which ishard to deal with long context sequences. RNN and LSTMlanguage models [3, 4] which can store the whole history ofthe sequence have been proposed to deal with this problemand obtained great success in many ﬁelds [5, 6].However, many sequence level tasks including machinetranslation [7], speech recognition [8] and handwriting recog-nition [9] need long term sequence prediction, while the tradi-tional RNN language model only predicts single word one byone. According to [10], there is a gap between the common used word level metric perplexity (PPL) for language modelevaluation and the true sequence level metric such like BLEUscore in machine translation [11] and word error rate (WER)in speech recognition [12].Several researches have been done to deal with this prob-lem. [13, 14, 15] researched on training bidirectional LSTMlanguage model, which can retrieve the the information notonly from the past context but also the future context. [10]combined reinforcement learning and deep learning together,directly trained the neural network with the estimated BLEUscore. [16, 17] applied sequence to sequence training methodon language model.In this paper, an novel enhanced LSTM language modelhas been proposed. Enhanced LSTM language model predictsnot only a single word, but also the whole future of the inputsequence. It is believed that enhanced LSTM language modelcan perform well with more sequence level information.Enhanced LSTM language model trains a reversed LSTMlanguage model. And the activation values of the last hiddenlayer of this reversed LSTM are used as bottleneck features[18] which can embed the future of the sequence. These bot-tleneck features are called future vectors of the sequence.These future vectors which contain sequence level infor-mation will be used to train the enhanced LSTM languagemodel. The model will be trained by not only to predict thenext word but also the future vector. The predicted future vec-tor will also be the input feature to predict the next word.The experiments show that the enhanced LSTM languagemodel performs well on the sequence prediction task. It isalso observed that in n-best rescoring task, the WER can geta very large improvement by the combination on the normaland enhanced LSTM language model.The rest of the paper is organized as follows, section 2is the background. Section 3 indicates the methodology ofenhanced LSTM language model and section 4 shows the ex-perimental setup and results. Finally, conclusion will be givenin section 5 and discussion can be found in section 6. a r X i v : . [ ee ss . A S ] J u l ig. 1 . One LSTM memory cell [25]. There are three gates(input gate, output gate and forget gate) in each cell to controlthe data ﬂow. In practice, h t − will also be the input to thecell together with x t .

2. BACKGROUND2.1. Long Short-Term Memory

RNN [19] is the neural network with cycles in its structure,which is effective in dealing with sequential data. Supposethere is a sequence of data x , x , . . . , x T as the input and let h , h , . . . , h T be the output of one RNN, the most commonlyused RNN formula looks like h t = f ( W x x t + W h h t − + b ) . where W x and W h are weight matrix parameters, b is the biasand f is the activation.Due to gradient vanishing and explosion problems [20,21], LSTM [3], which is a unit structured RNN, has beenused to replace the traditional RNN. LSTM-RNN shows bet-ter performance [22, 23, 24], and the LSTM formula is shownbelow: i t = σ ( W xi x t + W hi h t − + W ci c t − + b i ) f t = σ ( W xf x t + W hf h t − + W cf c t − + b f ) m t = tanh( W xc x t + W hc h t − + b c ) c t = f t · c t − + i t · m t o t = σ ( W xo x t + W ho h t − + W co c t + b o ) h t = o t · tanh( c t ) . where W ∗∗ are the weight matrix parameters, b ∗ are the biasand σ is the sigmoid function. The detail of its structure canbe found in Figure 1. LSTM language model uses the current word as the input andthe next word as the output. In detail, suppose x , x , . . . , x T Fig. 2 . The structure of LSTM language model. Here x , x , . . . , x T is the input sequence.is the input sequence, x i is the i -th word, and the vocabularysize is n . The input layer of the LSTM is a word embeddinglayer with size n , and the output layer of the LSTM is a soft-max layer with size n . The detail formula is shown below: ¯ x i = f ( x i ) h i = LSTM ( ¯ x i , h i − ) p i = softmax ( W h i + b ) x i +1 = arg max p i , where f represents the word embedding and W, b are the net-work parameters. Figure 2 shows the structure of LSTM lan-guage model. At the i -th time step, x i is the input to theLSTM, and the output value p i = ( p (1) i , p (2) i , . . . , p ( n ) i ) isconsidered to be the probability of observe each word at timestep i + 1 , i.e. p ( x i +1 | x , x , . . . , x i ) = p ( x i +1 ) i . To train the LSTM language model, the cross entropy(CE) of output distribution p i and the ground truth distribu-tion g i = (0 , . . . , , , , . . . , | at position x i +1 ) will be used as the criterion to train the network, i.e. the lossfunction is L = CE ( g i , p i ) = − n (cid:88) j =1 g ( j ) i log p ( j ) i .

3. METHODOLOGY3.1. Future Vector Extraction

Traditional LSTM language models only predict a singleword for the given history, which may lose information aboutthe whole future. In contrast the rest of the sequence will be ig. 3 . The structure of future vector extractor. Here x , x , . . . , x T is the input sequence and z , z , . . . , z T arethe extracted future vectors.embedded into a sequence vector in the new proposed en-hanced LSTM language model. This sequence vector, whichis called future vector in this paper, contains the informationabout all the sequence future.There are several ways [26, 27, 28] to extract future vec-tors. What is needed here is that for a given input sequence,each sufﬁx needs be embedded and the relationship amongthem must be kept. Therefore the method similar to [29]has been chosen. A normal LSTM language model with re-versed input sequence order has been trained, which meansthis LSTM language model predicts the previous word withthe given future. The future vector is extracted from the acti-vation values of the last hidden layer in this reversed LSTMlanguage model. Figure 3 shows the detailed structure and theformula is shown below. ¯ x i = f ( x i ) z i = LSTM ( ¯ x i , z i +1 ) p i = softmax ( W z i + b ) x i − = arg max p i , where f is the word embedding and W, b are model parame-ters. z , z , . . . , z T are the extracted future vectors. Future vectors cannot be directly used to train a languagemodel. For a input sequence x , x , . . . , x T and its futurevectors z , z , . . . , z T , only history x , x , . . . , x i are knownwhile the language model is trying to predict word x i +1 .However, the future vector z i +1 is a function of unknown fu-ture x i +1 , x i +2 , . . . , x T which is impossible to be generated.One additional LSTM network has been trained to solvethis problem. This network is similar to normal LSTM lan-guage model but predicts the future vector rather than the next Fig. 4 . The structure of enhanced LSTM language model.Here x , x , . . . , x T is the input sequence and y , y , . . . , y T are the predicted future vectors. In practice, the two LSTMnetworks are trained separately.word. The detailed formula is ¯ x i = f ( x i ) h i = LSTM ( ¯ x i , h i − ) y i +1 = W h i + b where f is word embedding and W, b are network parameters.The criterion to train this network is the mean squared error(MSE) between the future vector prediction y i and the trulyextracted future vector z i described in section 3.1, i.e. theerror function is L = MSE ( y i , z i ) = 1 m m (cid:88) j =1 ( y ( j ) i − z ( j ) i ) , where m is the dimension of future vector. y i is a function of x , x , . . . , x i − which means it can bedirectly used to train a language model. In enhanced LSTMlanguage model, y i +1 will be combined together with x i asthe new input of the LSTM language model, i.e. ¯ x i = f ( x i ) h i = LSTM ( ¯ x i , y i +1 , h i − ) p i = softmax ( W h i + b ) x i +1 = arg max p i , where f indicates the word embedding and W, b are networkparameters. The criterion is CE which is the same as normalLSTM language model in section 2.2. The details structure isillustrated in ﬁgure 4. ig. 5 . The structure of multi-task enhanced LSTM lan-guage model. Here x , x , . . . , x T is the input sequence and y , y , . . . , y T are the predicted future vectors. y is a zerovector. In practice, the three LSTM networks are trained to-gether.Enhanced LSTM language model has more input, the fu-ture vector y i , to predict the next word compared with the nor-mal LSTM language model. This results an enhanced LSTMlanguage model which has the power ability to modeling fu-ture sequence level information. Enhanced LSTM language model has two networks, one isfuture vector prediction LSTM and the other one is languagemodel LSTM. It is observed that these two networks can betrained together. Multi-task training [30, 31, 32] is a suitablemethod for joint training.The prediction of next word and corresponding futurevector can be optimized at the same time in the multi-taskenhanced LSTM language model. The predicted future vec-tor will also be the input like the non multi-task version. Thedetailed formula is here, ¯ x i = f ( x i ) h i = LSTM ( ¯ x i , y i , h i − ) u i = LSTM ( h i , u i − ) y i +1 = W u u i + b u v i = LSTM ( h i , v i − ) p i = softmax ( W v v i + b v ) x i +1 = arg max p i , where f is the word embedding and W ∗ , b ∗ are network pa-rameters. The two criteria to train this multi-task network isMSE for future vector prediction and CE for word prediction which also have been used for non multi-task version in sec-tion 3.2, i.e. the loss function is L = CE ( g i , p i ) + λ MSE ( y i +1 , z i +1 ) ,λ = 1 . in this implementation. The structure is Figure 5.Multi-task enhanced LSTM language model can get notonly explicit sequence level information from the input butalso the implicit sequence level information from the futurevector prediction. Model Input Output

LSTM x i x i +1 FV x i y i +1 x i , y i +1 x i +1 MT-FV x i , y i x i +1 , y i +1 Table 1 . Brief comparison among three LSTM languagemodel structures. FV indicates the future vector enhancedLSTM, and MT-FV indicates the future vector enhancedLSTM with multi-task training. x ∗ indicates the original in-put sequence and y ∗ is the predicted future vector.In table 1, a brieﬂy comparison of structures among nor-mal LSTM, enhanced LSTM and multi-task enhanced LSTMlanguage model has been shown.

4. EXPERIMENTS4.1. Experimental Setup

The experiments are designed to evaluate the performance ofthe proposed enhanced LSTM language model. The exper-iments uses two corpora including PTB English corpus andshort messages Chinese corpus. PTB corpus contains 49199utterances and Chinese short messages corpus has 403218 ut-terances. The vocabulary size is 10000 and 40697 respec-tively. The experiments used almost the same structure in allthe systems. All the LSTM block in Figure 2, 3 and 4 is astacked three hidden layers LSTM. In Figure 5, the multi-tasknetwork has two hidden LSTM layers in shared part and onehidden LSTM layer in separate part. All the LSTM hiddenlayers contains 300 cells.Both sequence prediction and speech recognition n-bestrescoring will be evaluated, and the BLEU score and WERare used respectively.

The results of sequence prediction can be found in Table 2.For each test sequence, ﬁve different lengths (0, 1, 2, 3 and5) of history were used. The BLEU score which is calculatedbetween the ground truth and prediction is used as the evalu-ation metric. orpus Model Perplexity BLEU Score0 1 2 3 5

PTB LSTM 122 0.076 0.083 0.092 0.097 0.106FV-LSTM 120 0.081 0.094 0.099 0.104 0.112MT-FV-LSTM 120 0.076 0.084 0.091 0.098 0.105SMS LSTM 105 0.179 0.222 0.241 0.262 0.277FV-LSTM 102 0.212 0.243 0.261 0.273 0.285MT-FV-LSTM 104 0.187 0.225 0.243 0.265 0.284

Table 2 . PPL and BLEU comparison of sequence prediction task. FV-LSTM indicates the future vector enhanced LSTM, andMT-FV-LSTM indicates the future vector enhanced LSTM with multi-task training. The number below BLEU score is thelength of history.It can be observed that the PPL keeps almost the samein all the three systems. It is not surprising due to the en-hanced LSTM language model is focused on the improvementof sequence level performance but PPL is a word level metric.However, the enhanced LSTM language model performs con-sistent better on BLEU score with different history lengths.These demonstrate that the enhanced LSTM language modelcan retrieve more sequence level information and get betterresult on sequence level metric.To give a better understanding on the results comparison,an example has been given with the history ”Japan howeverhas”, and the results of three models (traditional LSTM, en-hanced LSTM, multi-task enhanced LSTM) are shown as be-low: • Japan however has a N of its million; • Japan however has been a major brand for the market; • Japan however has been a major part of the company.It can be observed that the enhanced LSTM language modelgives more natural results on sequence prediction.

The Chinese SMS corpus is used to do speech recognition n-best rescoring. In the speech decoding stage for each audio,the sequences with the 100 highest probability will be gen-erated. In the language model rescoring the language modelscore will be re-calculated by LSTM and enhanced LSTMlanguage models, and the best path is obtained by combiningboth the language model score and acoustic model score. TheWER comparison of n-best rescoring with different LSTMlanguage models is given in Table 3.It can be observed that all LSTM language models can geta large improvement over the 3-gram language model, andthe new proposed LSTM language model enhanced with fu-ture vector only get a slight gain compared to the traditionalLSTM language model in the single model rescoring. How-ever, when implementing the multiple LSTM language mod-els rescoring shown as the bottom part of Table 3, the new pro-posed future vector enhanced LSTM language models seem

Model WER

Table 3 . WER (%) comparison of speech recognition n-bestrescoring on Chinese SMS corpus. FV indicates the futurevector enhanced LSTM, and MT-FV indicates the future vec-tor enhanced LSTM with multi-task training. All the modelsuse equally interpolated weights.to own the huge complementary with the traditional LSTMlanguage model. Rescoring using both the new and conven-tional LSTM language model together can achieve anothersigniﬁcant improvement compared to the single LSTM lan-guage model rescoring.

5. CONCLUSION

Traditional LSTM language model only predicts a singleword with the given history. However, LVCSR need sequencelevel predictions. This mismatch may cause the degradationon the performance. In this paper, a novel enhanced LSTMlanguage model has been proposed. Enhanced LSTM lan-guage model retrieves sequence level information from futurevector which is a special kind of sequence vector. Thereforeenhanced LSTM language model is able to predict long termfuture rather than immediate word. The experiments demon-strated that the proposed enhanced LSTM language modelwith future vector performs well on n-best rescoring thanthe traditional LSTM language model, and there is a hugecomplementary within the new and normal LSTM languagemodels. The results of sequence prediction also indicate thatthe enhanced LSTM language model can be used on othersequence level tasks. . DISCUSSION

Enhanced LSTM language model is an enhanced version oftraditional LSTM language model, it is still a word level su-pervised neural network model. This is an advantage that inthe pipeline of other applications, traditional LSTM languagemodel can be straightforward replaced by enhanced LSTMlanguage model. However, this makes the performance ofenhanced LSTM language model relies on the informationcontains in the future vector and prediction accuracy of fu-ture vector prediction network. If the extracted future vectoror predicted future vector are not generated properly, the en-hanced LSTM language model system may give worse resultsthan normal LSTM language model. Thus, the future work islisted here,1. add gate to the network to control the scale of wordlevel and sequence level information;2. try other ways to extract future vector;3. implement different methods to predict future vector;4. use reinforcement learning to train the network directlywith the sequence level evaluation metric;5. use other sequence level tasks to test enhanced LSTMlanguage model.

7. ACKNOWLEDGEMENT

This work was supported by the Shanghai Sailing Pro-gram No. 16YF1405300, the China NSFC projects (No.61573241 and No. 61603252) and the Interdisciplinary Pro-gram (14JCZ03) of Shanghai Jiao Tong University in China.Experiments have been carried out on the PI supercomputerat Shanghai Jiao Tong University.

8. REFERENCES [1] Andrei Z. Broder, Steven C. Glassman, Mark S. Man-asse, and Geoffrey Zweig, “Syntactic clustering of theweb,”

Computer Networks , vol. 29, no. 8-13, pp. 1157–1166, 1997.[2] Ted Dunning,

Statistical identiﬁcation of language ,Computing Research Laboratory, New Mexico StateUniversity, 1994.[3] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”

Neural computation , vol. 9, no. 8, pp.1735–1780, 1997.[4] Tomas Mikolov, Martin Karaﬁ´at, Luk´as Burget, Jan Cer-nock´y, and Sanjeev Khudanpur, “Recurrent neural net-work based language model,” in

Proceedings of the 11thAnnual Conference of the International Speech Commu-nication Association, INTERSPEECH 2010 , 2010, pp.1045–1048. [5] X. Chen, T. Tan, Xunying Liu, Pierre Lanchantin,M. Wan, Mark J. F. Gales, and Philip C. Woodland,“Recurrent neural network language model adaptationfor multi-genre broadcast speech recognition,” in

Pro-ceedings of the 16th Annual Conference of the Inter-national Speech Communication Association, INTER-SPEECH 2015 , 2015, pp. 3511–3515.[6] Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao,“Minimum translation modeling with recurrent neuralnetworks,” in

Proceedings of the 14th Conference of theEuropean Chapter of the Association for ComputationalLinguistics, EACL 2014 , 2014, pp. 20–29.[7] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz JosefOch, and Jeffrey Dean, “Large language models inmachine translation,” in

Proceedings of the 2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Lan-guage Learning, EMNLP-CoNLL 2007 , 2007, pp. 858–867.[8] Xie Chen, Xunying Liu, Mark J. F. Gales, and Philip C.Woodland, “Recurrent neural network language modeltraining with noise contrastive estimation for speechrecognition,” in

Proceedings of the 2015 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing, ICASSP 2015 , 2015, pp. 5411–5415.[9] Qi Liu, Lijuan Wang, and Qiang Huo, “A study oneffects of implicit and explicit language model infor-mation for DBLSTM-CTC based handwriting recogni-tion,” in

Proceedings of the 13th International Con-ference on Document Analysis and Recognition, ICDAR2015 , 2015, pp. 461–465.[10] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba, “Sequence level training withrecurrent neural networks,” in

Proceedings of the 4thInternational Conference on Learning Representations,ICLR 2016 , 2016.[11] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “Bleu: a method for automatic evaluation ofmachine translation,” in

Proceedings of the 40th AnnualMeeting of the Association for Computational Linguis-tics, ACL 2002 , 2002, pp. 311–318.[12] Dietrich Klakow and Jochen Peters, “Testing the corre-lation of word error rate and perplexity,”

Speech Com-munication , vol. 38, no. 1-2, pp. 19–28, 2002.[13] Tianxing He, Yu Zhang, Jasha Droppo, and Kai Yu, “Ontraining bi-directional neural network language modelwith noise contrastive estimation,” in

Proceedings ofthe 10th International Symposium on Chinese SpokenLanguage Processing, ISCSLP 2016 , 2016.14] Yangyang Shi, Martha Larson, Pascal Wiggers, andCatholijn Jonker, “Exploiting the succeeding words inrecurrent neural network language models,” in

Pro-ceedings of the 14th Annual Conference of the Inter-national Speech Communication Association, INTER-SPEECH 2013 , 2013, pp. 632–636.[15] Ebru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran,and Stanley Chen, “Bidirectional recurrent neural net-work language models for automatic speech recogni-tion,” in

Proceedings of 2015 IEEE International Con-ference on Acoustics, Speech and Signal Processing,ICASSP 2015 , 2015, pp. 5421–5425.[16] Sam Wiseman and Alexander M. Rush, “Sequence-to-sequence learning as beam-search optimization,” in

Pro-ceedings of the 2016 Conference on Empirical Methodsin Natural Language Processing, EMNLP 2016 , 2016,pp. 1296–1306.[17] Karl Pichotta and Raymond J. Mooney, “Usingsentence-level LSTM language models for script infer-ence,” in

Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics, ACL 2016 ,2016, pp. 279–289.[18] Jonas Gehring, Yajie Miao, Florian Metze, and AlexWaibel, “Extracting deep bottleneck features usingstacked auto-encoders,” in

Proceedings of the 2013IEEE International Conference on Acoustics, Speechand Signal Processing, ICASSP 2013 , 2013, pp. 3377–3381.[19] Jeffrey L Elman, “Finding structure in time,”

Cognitivescience , vol. 14, no. 2, pp. 179–211, 1990.[20] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, andJ¨urgen Schmidhuber, “Gradient ﬂow in recurrentnets: the difﬁculty of learning long-term dependencies,”2001.[21] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio,“On the difﬁculty of training recurrent neural networks,”in

Proceedings of the 30th International Conference onMachine Learning, ICML 2013 , 2013, pp. 1310–1318.[22] Felix A. Gers, Nicol N. Schraudolph, and J¨urgenSchmidhuber, “Learning precise timing with LSTM re-current networks,”

Journal of Machine Learning Re-search , vol. 3, pp. 115–143, 2002.[23] Alex Graves, Santiago Fern´andez, Faustino J. Gomez,and J¨urgen Schmidhuber, “Connectionist temporal clas-siﬁcation: labelling unsegmented sequence data with re-current neural networks,” in

Proceedings of the 23rdInternational Conference on Machine Learning, ICML2006 , 2006, pp. 369–376. [24] William Chan, Navdeep Jaitly, Quoc V. Le, and OriolVinyals, “Listen, attend and spell: A neural network forlarge vocabulary conversational speech recognition,” in

Proceedings of the 2016 IEEE International Conferenceon Acoustics, Speech and Signal Processing, ICASSP2016 , 2016, pp. 4960–4964.[25] Alex Graves and Navdeep Jaitly, “Towards end-to-endspeech recognition with recurrent neural networks,” in

Proceedings of the 31th International Conference onMachine Learning, ICML 2014 , 2014, pp. 1764–1772.[26] Quoc V. Le and Tomas Mikolov, “Distributed represen-tations of sentences and documents,” in

Proceedings ofthe 31th International Conference on Machine Learn-ing, ICML 2014 , 2014, pp. 1188–1196.[27] Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daum´e III, “Deep unordered compo-sition rivals syntactic methods for text classiﬁcation,” in

Proceedings of the 53rd Annual Meeting of the Associa-tion for Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Process-ing of the Asian Federation of Natural Language Pro-cessing, ACL 2015 , 2015, pp. 1681–1691.[28] Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som, “A convolutional neural network for modellingsentences,” in

Proceedings of the 52nd Annual Meetingof the Association for Computational Linguistics, ACL2014 , 2014, pp. 655–665.[29] Hamid Palangi, Li Deng, Yelong Shen, JianfengGao, Xiaodong He, Jianshu Chen, Xinying Song, andRabab K. Ward, “Deep sentence embedding using longshort-term memory networks: Analysis and applica-tion to information retrieval,”

IEEE/ACM Trans. Audio,Speech & Language Processing , vol. 24, no. 4, pp. 694–707, 2016.[30] Maofan Yin, Sunil Sivadas, Kai Yu, and Bin Ma,“Discriminatively trained joint speaker and environmentrepresentations for adaptation of deep neural networkacoustic models,” in

Proceedings of 2016 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing, ICASSP 2016 , 2016, pp. 5065–5069.[31] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Kr-ishnapuram, “Multi-task learning for classiﬁcation withdirichlet process priors,”

Journal of Machine LearningResearch , vol. 8, pp. 35–63, 2007.[32] Roi Reichart, Katrin Tomanek, Udo Hahn, and Ari Rap-poport, “Multi-task active learning for linguistic annota-tions,” in