Learning Spoken Language Representations with Neural Lattice Language Modeling
LLearning Spoken Language Representations withNeural Lattice Language Modeling
Chao-Wei Huang Yun-Nung (Vivian) Chen
Department of Computer Science and Information EngineeringNational Taiwan University, Taipei, Taiwan [email protected] [email protected]
Abstract
Pre-trained language models have achievedhuge improvement on many NLP tasks. How-ever, these methods are usually designed forwritten text, so they do not consider the prop-erties of spoken language. Therefore, thispaper aims at generalizing the idea of lan-guage model pre-training to lattices generatedby recognition systems. We propose a frame-work that trains neural lattice language modelsto provide contextualized representations forspoken language understanding tasks. The pro-posed two-stage pre-training approach reducesthe demands of speech data and has better ef-ficiency. Experiments on intent detection anddialogue act recognition datasets demonstratethat our proposed method consistently outper-forms strong baselines when evaluated on spo-ken inputs. The task of spoken language understanding (SLU)aims at extracting useful information from spokenutterances. Typically, SLU can be decomposedwith a two-stage method: 1) an accurate automaticspeech recognition (ASR) system transcribes theinput speech into texts, and then 2) language under-standing techniques are applied to the transcribedtexts. These two modules can be developed sepa-rately, so most prior work developed the backendlanguage understanding systems based on manualtranscripts (Yao et al., 2014; Guo et al., 2014; Mes-nil et al., 2014; Goo et al., 2018).Despite the simplicity of the two-stage method,prior work showed that a tighter integration be-tween two components can lead to better perfor-mance. Researchers have extended the ASR 1-bestresults to n-best lists or word confusion networksin order to preserve the ambiguity of the transcripts. The scource code is available at: https://github.com/MiuLab/Lattice-ELMo . Airfare0.3 fair1.0to, 1.0 LA1.0
Figure 1: Illustration of a lattice. (Tur et al., 2002; Hakkani-T¨ur et al., 2006; Hender-son et al., 2012; T¨ur et al., 2013; Masumura et al.,2018). Another line of research focused on usinglattices produced by ASR systems. Lattices aredirected acyclic graphs (DAGs) that represent mul-tiple recognition hypotheses. An example of ASRlattice is shown in Figure 1. Ladhak et al. (2016) in-troduced LatticeRNN, a variant of recurrent neuralnetworks (RNNs) that generalize RNNs to lattice-structured inputs in order to improve SLU. Zhangand Yang (2018) proposed a similar idea for Chi-nese name entity recognition. Sperber et al. (2019);Xiao et al. (2019); Zhang et al. (2019) proposed ex-tensions to enable the transformer model (Vaswaniet al., 2017) to consume lattice inputs for machinetranslation. Huang and Chen (2019) proposed toadapt the transformer model originally pre-trainedon written texts to consume lattices in order toimprove SLU performance. Buckman and Neu-big (2018) also found that utilizing lattices thatrepresent multiple granularities of sentences canimprove language modeling.With recent introduction of large pre-trained lan-guage models (LMs) such as ELMo (Peters et al.,2018), GPT (Radford, 2018) and BERT (Devlinet al., 2019), we have observed huge improvementson natural language understanding tasks. Thesemodels are pre-trained on large amount of writtentexts so that they provide the downstream taskswith high-quality representations. However, ap-plying these models to the spoken scenarios poses a r X i v : . [ c s . C L ] J u l everal discrepancies between the pre-training taskand the target task, such as the domain mismatchbetween written texts and spoken utterances withASR errors. It has been shown that fine-tuningthe pre-trained language models on the data fromthe target tasks can mitigate the domain mismatchproblem (Howard and Ruder, 2018; Chronopoulouet al., 2019). Siddhant et al. (2018) focused onpre-training a language model specifically for spo-ken content with huge amount of automatic tran-scripts, which requires a large collection of in-domain speech.In this paper, we propose a novel spoken lan-guage representation learning framework, whichfocuses on learning contextualized representationsof lattices based on our proposed lattice languagemodeling objective. The proposed framework con-sists of two stages of LM pre-training to reducethe demands for lattice data. We conduct experi-ments on benchmark datasets for spoken languageunderstanding, including intent classification anddialogue act recognition. The proposed methodconsistently achieves superior performance, withrelative error reduction ranging from 3% to 42%compare to pre-trained sequential LM. The two-stage framework that learns contextual-ized representations for spoken language is pro-posed and detailed below.
In the SLU task, the model input is an utter-ance X containing a sequence of words X =[ x , x , · · · , x | X | ] , and the goal is to map X toits corresponding class y . The inputs can alsobe stored in a lattice form, where we use edge-labeled lattices in this work. A lattice L = { N, E } is defined by a set of | N | nodes N = { n , n , · · · , n | N | } and a set of | E | transitions E = { e , e , · · · , e | E | } . A weighted transitionis defined as e = { prev [ e ] , next [ e ] , w [ e ] , P ( e ) } ,where prev [ e ] and next [ e ] denote the previousnode and next node respectively, w [ e ] denotes theassociated word, and P ( e ) denotes the transitionprobability. We use in [ n ] and out [ n ] to denote thesets of incoming and outgoing transitions of a node n . L Language models usually estimate p ( X ) by factor-izing it into p ( X ) = | X | (cid:89) t =0 p ( x t | X What a day Linear a day LatticeLSTM Max pooling classification Training Target Task ClassifierStage 1: Pre-Training on Sequential Texts Stage 2: Pre-Training on Lattices LatticeLSTM Figure 2: Illustration of the proposed framework. The weights of the pre-trained LatticeLSTM LM are fixed whentraining the target task classifier (shown in white blocks), while the weights of the newly added LatticeLSTMclassifier are trained from scratch (shown in colored block). where θ denotes the parameters of the LatticeRNNand W denotes the trainable parameters of the de-coder. We train our lattice language model by mini-mizing the KL divergence between the ground truthdistribution p ( w | L After pre-training, our model is capable of provid-ing representations for lattices. Following (Peterset al., 2018), the pre-trained lattice LM is used toproduce contextualized node embeddings for down-stream classification tasks, as illustrated in the rightpart of Figure 2. We use the same strategy as Peterset al. (2018) to linearly combine the hidden statesfrom different layers into a representation for eachnode. The classifier is a newly added 2-layer Lat-ticeLSTM, which takes the node representationsas input, followed by max-pooling over nodes, alinear layer and finally a softmax layer. We usethe cross entropy loss to train the classifier on eachtarget classification tasks. Note that the parametersof the pre-trained lattice LM are fixed during thisstage. TIS SNIPS SWDA MRDA Manual (a) biLSTM - 97.00 71.19 79.99(b) (a) + ELMo - 96.80 72.18 81.48Lattice oracle (c) biLSTM 92.97 94.02 63.92 70.49(d) (c) + ELMo 96.21 95.14 65.14 73.34ASR 1-Best (e) biLSTM 91.60 91.89 60.54 67.35(f) (e) + ELMo 94.99 91.98 61.65 68.52(g) BERT-base (j) (i) w/o Stage 1 94.65 95.19 61.81 71.71(k) (i) w/o Stage 2 95.35 94.58 62.41 71.66(l) (i) evaluated on 1-best 95.05 92.40 61.12 68.04 Table 2: Results of our experiments in terms of accuracy (%). Some audio files in ATIS are missing, so the testingsets of manual transcripts and ASR transcripts are different. Hence, we do not report the results for ATIS usingmanual transcripts. The best results obtained by using ASR output for each dataset are marked in bold. ATIS SNIPS SWDA MRDATrain Valid 500 700 8,989 15,037 Test 869 700 15,927 14,800 22 7 43 5 WER(%) Oracle WER Table 1: Data statistics. In order to evaluate the quality of the pre-trainedlattice LM, we conduct the experiments for twocommon tasks in spoken language understanding. Intent detection and dialogue act recognition aretwo common tasks about spoken language under-standing. The benchmark datasets used for intentdetection are ATIS (Airline Travel Information Sys-tems) (Hemphill et al., 1990; Dahl et al., 1994; Turet al., 2010) and SNIPS (Coucke et al., 2018). Weuse the NXT-format of the Switchboard (Stolckeet al., 2000) Dialogue Act Corpus (SWDA) (Cal-houn et al., 2010) and the ICSI Meeting RecorderDialogue Act Corpus (MRDA) (Shriberg et al.,2004) for benchmarking dialogue act recognition.The SNIPS corpus only contains written text, sowe synthesize a spoken version of the dataset us-ing a commercial text-to-speech service. We usean ASR system trained on WSJ (Paul and Baker,1992) with Kaldi (Povey et al., 2011) to transcribeATIS, and an ASR system released by Kaldi totranscribe other datasets. The statistics of datasetsare summarized in Table 1. All tasks are evaluated with overall classification accuracy. In order to conduct fair comparison with ELMo (Pe-ters et al., 2018), we directly adopt their pre-trainedmodel as our pre-trained sequential LM. The hid-den size of the LatticeLSTM classifier is set to 300.We use adam as the optimizer with learning rate0.0001 for LM pre-training and 0.001 for trainingthe classifier. The checkpoint with the best valida-tion accuracy is used for evaluation. The results in terms of the classification accuracyare shown in Table 2. All reported numbers areaveraged over at least three training runs. Rows(a) and (b) can be considered as the performanceupperbound, where we use manual transcripts totrain and evaluate the models. We also use BERT-base (Devlin et al., 2019) as a strong baseline,which takes ASR 1-best as the input (row (g)).Compare with the results on manual transcripts, us-ing ASR results largely degrades the performancedue to recognition errors, as shown in rows (e)-(g).In addition, adding pre-trained ELMo embeddingsbrings consistent improvement over the biLSTMbaseline, except for SNIPS when using manualtranscripts (row (b)). The baseline models trainedon ASR 1-best are also evaluated on lattice oraclepaths. We report the results as the performanceupperbound for the baseline models (rows (c)-(d)).In the lattice setting, the baseline bidirectionalLatticeLSTM (Ladhak et al., 2016) (row (h)) con-istently outperforms the biLSTM with 1-best in-put (row (e)), demonstrating the importance of tak-ing lattices into account. Our proposed methodachieves the best results on all datasets except forATIS (row(i)), with relative error reduction rang-ing from 3.2% to 42% compare to biLSTM+ELMo(row(f)). The proposed method also achieves per-formance comparable to BERT-base on ATIS. Weperform ablation study for the proposed two-stagepre-training method and report the results in rows(j) and (k). It is clear that skipping either stagedegrades the performance on all datasets, demon-strating that both stages are crucial in the proposedframework. We also evaluate the proposed modelon 1-best results (row (l)). The results show thatit is still beneficial to use lattice as input after fine-tuning. In this paper, we propose a spoken language repre-sentation learning framework that learns contextu-alized representation of lattices. We introduce thelattice language modeling objective and a two-stagepre-training method that efficiently trains a neurallattice language model to provide the downstreamtasks with contextualized lattice representations.The experiments show that our proposed frame-work is capable of providing high-quality represen-tations of lattices, yielding consistent improvementon SLU tasks. Acknowledgement We thank reviewers for their insightful comments.This work was financially supported from theYoung Scholar Fellowship Program by Ministryof Science and Technology (MOST) in Taiwan,under Grant 109-2636-E-002-026. References Jacob Buckman and Graham Neubig. 2018. Neural lat-tice language models. Transactions of the Associa-tion for Computational Linguistics , 6:529–541.Sasha Calhoun, Jean Carletta, Jason M Brenier, NeilMayo, Dan Jurafsky, Mark Steedman, and DavidBeaver. 2010. The nxt-format switchboard corpus:a rich resource for investigating the syntax, seman-tics, pragmatics and prosody of dialogue. Languageresources and evaluation , 44(4):387–419.Alexandra Chronopoulou, Christos Baziotis, andAlexandros Potamianos. 2019. An embarrassingly simple approach for transfer learning from pre-trained language models. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 2089–2095. ACL.Alice Coucke, Alaa Saade, Adrien Ball, Th´eodoreBluche, Alexandre Caulier, David Leroy, Cl´ementDoumouro, Thibault Gisselbrecht, Francesco Calta-girone, Thibaut Lavril, et al. 2018. Snips voice plat-form: an embedded spoken language understandingsystem for private-by-design voice interfaces. arXivpreprint arXiv:1805.10190 .Deborah A Dahl, Madeleine Bates, Michael Brown,William Fisher, Kate Hunicke-Smith, David Pallett,Christine Pao, Alexander Rudnicky, and ElizabethShriberg. 1994. Expanding the scope of the atis task:The atis-3 corpus. In Proceedings of the workshopon Human Language Technology , pages 43–48.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186. ACL.Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-LiHuo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for jointslot filling and intent prediction. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 2 (Short Pa-pers) , pages 753–757. ACL.Daniel Guo, Gokhan Tur, Wen-tau Yih, and GeoffreyZweig. 2014. Joint semantic utterance classificationand slot filling with recursive neural networks. In ,pages 554–559.Dilek Hakkani-T¨ur, Fr´ed´eric B´echet, Giuseppe Ric-cardi, and Gokhan Tur. 2006. Beyond ASR 1-best: Using word confusion networks in spokenlanguage understanding. Computer Speech & Lan-guage , 20(4):495–514.Charles T. Hemphill, John J. Godfrey, and George R.Doddington. 1990. The ATIS spoken language sys-tems pilot corpus. In Speech and Natural Language:Proceedings of a Workshop Held at Hidden Valley,Pennsylvania, June 24-27,1990 .Matthew Henderson, Milica Gaˇsi´c, Blaise Thomson,Pirros Tsiakoulis, Kai Yu, and Steve Young. 2012.Discriminative spoken language understanding us-ing word confusion networks. In , pages 176–181.Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory. Neural computation .eremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339. ACL.Chao-Wei Huang and Yun-Nung Chen. 2019. Adapt-ing pretrained transformer to lattices for spoken lan-guage understanding. In Proceedings of 2019 IEEEAutomatic Speech Recognition and UnderstandingWorkshop , pages 845–852.Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Lam-bert Mathias, Ariya Rastrow, and Bj¨orn Hoffmeister.2016. LatticeRNN: Recurrent neural networks overlattices. In Proceedings of INTERSPEECH , pages695–699.Ryo Masumura, Yusuke Ijima, Taichi Asami, HirokazuMasataki, and Ryuichiro Higashinaka. 2018. Neuralconfnet classification: Fully neural network basedspoken utterance classification using word confu-sion networks. In Proceedings of 2018 IEEE Inter-national Conference on Acoustics, Speech and Sig-nal Processing , pages 6039–6043.Gr´egoire Mesnil, Yann Dauphin, Kaisheng Yao,Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi-aodong He, Larry Heck, Gokhan Tur, Dong Yu, andGeoffrey Zweig. 2014. Using recurrent neural net-works for slot filling in spoken language understand-ing. IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , 23(3):530–539.Douglas B. Paul and Janet M. Baker. 1992. The de-sign for the wall street journal-based csr corpus. In Proceedings of the Workshop on Speech and NaturalLanguage , HLT ’91.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers) , pages 2227–2237. ACL.Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, et al. 2011. The Kaldi speech recognitiontoolkit. Technical report.Alec Radford. 2018. Improving language understand-ing by generative pre-training.Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The ICSI meetingrecorder dialog act (MRDA) corpus. In Proceedingsof the 5th SIGdial Workshop on Discourse and Di-alogue at HLT-NAACL 2004 , pages 97–100, Cam-bridge, Massachusetts, USA. ACL. Aditya Siddhant, Anuj Goyal, and Angeliki Metallinou.2018. Unsupervised transfer learning for spokenlanguage understanding in intelligent agents. arXivpreprint arXiv:1811.05370 .Matthias Sperber, Graham Neubig, Ngoc-Quan Pham,and Alex Waibel. 2019. Self-attentional models forlattice inputs. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 1185–1197. ACL.Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech. Computational Linguistics , 26(3):339–374.G¨okhan T¨ur, Anoop Deoras, and Dilek Z. Hakkani-T¨ur.2013. Semantic parsing using word confusion net-works with conditional random fields. In Proceed-ings of INTERSPEECH .Gokhan Tur, Dilek Hakkani-T¨ur, and Larry Heck. 2010.What is left to be understood in ATIS? In Proceed-ings of 2010 IEEE Spoken Language TechnologyWorkshop (SLT) , pages 19–24.Gokhan Tur, Jerry Wright, Allen Gorin, Giuseppe Ric-cardi, and Dilek Hakkani-T¨ur. 2002. Improving spo-ken language understanding using word confusionnetworks. In Seventh International Conference onSpoken Language Processing .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 31st InternationalConference on Neural Information Processing Sys-tems , pages 6000–6010. Curran Associates Inc.Fengshun Xiao, Jiangtong Li, Hai Zhao, Rui Wang,and Kehai Chen. 2019. Lattice-based transformerencoder for neural machine translation. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 3090–3097. ACL.Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Ge-offrey Zweig, and Yangyang Shi. 2014. Spoken lan-guage understanding using long short-term memoryneural networks. In , pages 189–194.Pei Zhang, Niyu Ge, Boxing Chen, and Kai Fan. 2019.Lattice transformer for speech translation. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6475–6484. ACL.Yue Zhang and Jie Yang. 2018. Chinese NER usinglattice LSTM. In