[PDF] Do as I mean, not as I say: Sequence Loss Training for Spoken Language Understanding

Abstract

Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech, and are essential components of voice activated systems. SLU models, which either directly extract semantics from audio or are composed of pipelined automatic speech recognition (ASR) and natural language understanding (NLU) models, are typically trained via differentiable cross-entropy losses, even when the relevant performance metrics of interest are word or semantic error rates. In this work, we propose non-differentiable sequence losses based on SLU metrics as a proxy for semantic error and use the REINFORCE trick to train ASR and SLU models with this loss. We show that custom sequence loss training is the state-of-the-art on open SLU datasets and leads to 6% relative improvement in both ASR and NLU performance metrics on large proprietary datasets. We also demonstrate how the semantic sequence loss training paradigm can be used to update ASR and SLU models without transcripts, using semantic feedback alone.

Full PDF

DDO AS I MEAN, NOT AS I SAY:SEQUENCE LOSS TRAINING FOR SPOKEN LANGUAGE UNDERSTANDING

Milind Rao, Pranav Dheram, Gautam Tiwari, Anirudh Raju,Jasha Droppo, Ariya Rastrow, Andreas Stolcke

Amazon Alexa, USA, {milinrao,pddheram,tgautam,ranirudh,drojasha,arastrow,stolcke}@amazon.com

ABSTRACT

Spoken language understanding (SLU) systems extract transcriptions,as well as semantics of intent or named entities from speech, andare essential components of voice activated systems. SLU models,which either directly extract semantics from audio or are composed ofpipelined automatic speech recognition (ASR) and natural languageunderstanding (NLU) models, are typically trained via differentiablecross-entropy losses, even when the relevant performance metrics ofinterest are word or semantic error rates. In this work, we proposenon-differentiable sequence losses based on SLU metrics as a proxyfor semantic error and use the REINFORCE trick to train ASR andSLU models with this loss. We show that custom sequence losstraining is the state-of-the-art on open SLU datasets and leads to 6%relative improvement in both ASR and NLU performance metrics onlarge proprietary datasets. We also demonstrate how the semanticsequence loss training paradigm can be used to update ASR and SLUmodels without transcripts, using semantic feedback alone.

Index Terms — speech recognition, spoken language understand-ing, REINFORCE, multitask training, neural interfaces

1. INTRODUCTION

Spoken language understanding systems that aim to understand usercommands are an integral part of voice interfaces or spoken dialoguesystems. Our focus is on developing compact models that can bedeployed on edge devices allowing low-latency processing withouttransmitting audio and/or transcripts to cloud servers and enablingofﬂine use in remote, medical, vehicular, or emergency environments.Table 1 shows an example of the transcript and semantics of anutterance. A conventional deployment for SLU comprises two distinctpipelined stages: (1) ASR to transcribe utterances (2) an NLU systemthat consumes the transcription and produces utterance intent andnamed entities or slots.

A pipelined or compositional deployment would make use of end-to-end (E2E) ASR architectures such as RNN-T [1], CTC [2],Transformer-transducers [3], LAS [4], or conventional RNN-HMMhybrid ASR systems [5]. Extracting intent and slots from transcriptsis a long running problem in NLU [6, 7, 8] that uses LSTMs orTransformers [9, 10]. The interface between ASR and NLU sys-tems has traditionally been the single best hypothesis generated byASR, although richer interfaces such as lattices and word confusionnetworks have also been proposed [11, 12, 13, 14]. © 20XX IEEE. Personal use of this material is permitted. Permissionfrom IEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.

Table 1 : An example of intent, slots for an utterance.Transcript set an alarm for six a.mIntent SetNotiﬁcationIntentSlots NotiﬁcationType - alarm, Time - six a.m.With the compositional approach listed above, ASR errors cas-cade down to the NLU system, ASR is not trained aware of down-stream NLU use, and NLU is not trained to compensate for ASRambiguity or error. [15] ﬁrst introduced multi-stage, multi-task andjoint models for E2E SLU. Most prior work in this space [16, 17,18, 19, 20] directly computes a serialization of the semantics withoutintermediate text output. Another common approach uses transferlearning of pretrained ASR models to SLU tasks by replacing theﬁnal layer. In contrast, [21] used pretrained ASR models and NLUarchitectures and replaced the one-best ASR hypothesis interfacewith a neural network interface allowing joint training of ASR andNLU.ASR systems are typically ﬁrst trained with differentiable lossessuch as cross-entropy (CE), CTC or RNN-T. NLU systems are trainedusing CE losses for classiﬁcation problems like intent, domain, ornamed entity tags. E2E SLU systems make use of cross-entropy oneither transcripts, intents, slots, or some serialization of semantics.The CE metric is simply a proxy for and does not directly minimizeSLU metrics of interest. REINFORCE [22] can be used to trainwith arbitrary non-differentiable loss functions. This was extendedto mWER training for ASR [23], LAS [24], and RNN-T [25]. RE-INFORCE corresponds to the policy gradient approach among otherreinforcement learning methods for seq2seq networks [26].

We consider the class of SLU models composed of multistage ASRand NLU subsystems, connected via text, subword tokens, or neu-ral interfaces that can be jointly trained. In these systems, ASR istrained with backpropagation of semantic feedback from NLU, andNLU is trained to be aware of ASR ambiguity and errors preventinga downward cascade of ASR errors as seen in compositional sys-tems. We consider ASR systems based on LAS[4] and LSTM- orTransformer-encoder-based NLU systems.We ﬁrst develop custom sequence loss training approaches tomake use of non-differentiable arbitrary risk values or losses on theentire sequence of outputs. Similar to minimum-word-error-rate(mWER) training, we develop minimum-semantic-error-rate (mSe-mER) training that directly minimizes intent and slot errors. We intro-duce alternatives which additionally factor in interpretation (concept)and word errors. Using datasets with complete ASR transcriptionsand NLU annotations, we ﬁrst show signiﬁcant gains in both ASRand NLU metrics using sequence loss training across datasets rangingfrom 15 to 15,000 hours for limited to general use cases. We beat allknown external benchmarks in the open Fluent speech dataset [20]. a r X i v : . [ c s . C L ] F e b ig. 1 : E2E SLU architectures including ASR subsystem, neural NLU subsystem and 3 interfaces - token, text and neural Fig. 2 : Training SLU models with non-differentiable sequence losses. Dotted box encompasses ASR model including decoderAs a further application of sequence loss training, we show how adataset with audio and semantic annotations without human transcrip-tions can still be used to drive ASR and SLU model improvements.

2. TECHNICAL APPROACH2.1. ASR-Interface-NLU Models

We consider SLU models that comprise an ASR subsystem and anNLU subsystem connected by an interface that passes the 1-best orsampled ASR hypotheses, or via a neural network hidden layer.

The ASR subsystem is an attention-based Listen Attend and Spell(LAS) model as shown in the green box of Fig. 1. The LAS usedhere primarily comprises two components - a stacked RNN encoderthat encodes audio frames x to generate representations, and anauto-regressive RNN decoder that sequentially generates logits orsubword probability distribution p w,i ( w i ) = P ( w i |{ w } i − , x ) ateach decoding step by using multiple attention heads to attend to theaudio encoding.In this work, we focus on the LAS ASR subsystem, but the resultscan be extended to other architectures, such as streaming-compatibleRNN-T systems or Transformer-based ASR architectures. The Neural NLU subsystem as shown in the yellow box of Fig. 1accepts a sequence of embeddings or features of tokens decoded byASR and passes it through multiple BLSTM layers. The outputs ofthe ﬁnal layer are used to generate probability distribution p s,i ( s i ) = P ( s i | w , x ) of the slots s of each subword-token. The slot of a wordis the slot of the last token of the word. The outputs or features ofthe ﬁnal layer are also max-pooled and passed through feed-forwardnetworks to perform the sequence classiﬁcation task of obtaininglogits over the utterance intent p intent ( intent ) = P ( intent | w , x ) .We also present results using a Transformer-encoder NLU archi-tecture. The Transformer-encoder replaces the BLSTM and appliesmultiple layers of self-attention to embeddings/features of the tran-script tokens to produce latent representations that are used to obtainslot and intent logits. ASR-NLU Interfaces . We design the SLU system which comprisesmultistage ASR and NLU systems with a choice of interfaces: •

Text from the ASR hypothesis is the interface between ASR andNLU, and embeddings of these tokenized text are the inputs to theNLU system. This allows use of pre-trained ASR and NLU models,using transcribed audio datasets for the former and text-only NLUdatasets for the latter. We also term this the compositional baselinemodel that chains pretrained ASR and NLU.•

Subword tokens sampled from the posteriors produced by theASR decoder form the interface with NLU. ASR and NLU canbe jointly trained with NLU trained aware of ASR errors. TheGumbel-softmax sampling approach [27] allows backpropagationof semantic feedback to ASR through the categorical subwordtoken interface.•

Neural network interface computes the feature for the token atdecoder step i using the token embedding concatenated with thehidden output layer from the LAS decoder LSTM. This interfaceallows NLU to be trained aware of ASR error, local ASR decodingambiguity as well as the audio context and allows for ASR to betrained with semantic backpropagation. A pretrained ASR modelcan be used using transcribed audio. These models are also termed joint models in this work. Traditionally, differentiable cross-entropy loss functions are used inASR or SLU model training. The ASR system is teacher-forced with the ground truth transcript subword sequence w and the cross-entropy loss CE asr = − (cid:80) i log p w,i ( w i ) is calculated using the one-step ahead decoded subword probability sequence. NLU consumesthe features from ASR and is trained with an intent loss CE intent = − log p intent ( intent ) and a slot-loss CE slot = − (cid:80) i log p s,i ( s i ) usingground-truth intent and slot sequence s .During joint ASR-NLU model multi-task training, a linear com-bination of these loss functions is used: CE total = CE asr + CE intent + CE slot (1)While the versatile cross-entropy metric allows for end-to-endmodel training, it serves merely as a differentiable proxy and doesnot directly optimize for the ﬁnal SLU metrics of interest, such as: peech recognition : Word error rate (WER) computed as the ratio ofword edit distance (the length of the shortest sequence of insert, delete,and substitute operations over words to transform the hypothesis tothe reference) to sequence length. A slot-WER metric that upweightscritical words and not carrier phrases may also be used. Intent classiﬁcation : Intent classiﬁcation error rate (ICER) is theprimary metric for evaluating intent. This is a recall-based metric.

Slot ﬁlling : Semantic error rate (SemER) metric is used to evaluatejointly the intent and slot-ﬁlling performance or NLU performance.Comparing a reference of words and their accompanying tags, per-formance is classiﬁed as: (1) Correct slots - slot name and slot valuecorrectly identiﬁed, (2) Deletion errors - slot name present in refer-ence but not hypothesis, (3) Insertion errors - extraneous slot namesincluded by hypothesis, (4) Substitution errors - correct slot name inhypothesis but with incorrect slot value. Intent classiﬁcation errorsare substitution errors.SemER = + + + + (cid:124) (cid:123)(cid:122) (cid:125) (2)The interpretation error rate (IRER) metric, also known as concepterror rate, or simply SLU accuracy is related and is the fraction ofutterances for which a semantic error has been made.For internal datasets, we report % relative improvements in thesemetrics. For example, IRERR is IRER-relative reduction. We make use of the REINFORCE framework [22, 24] to directlyoptimize for a non-differentiable semantic metric M ( C ) of intereston random candidate C = { w , s , intent } ∈ C that the SLU modelwith weight θ produces with probability Pr( C = c | x ) = p ( c ; θ ) = p intent ( intent ) (cid:81) i p w,i ( w i ) p s,i ( s i ) . To train the SLU model, we mini-mize the expected value of metric M for each utterance coupled withthe cross-entropy loss CE weighted by parameter λ : θ ∗ = argmin θ E[ M ( C )] + λCE (3)Sub-gradient descent solvers require access to ∇ θ E[ M ( C )] . In the sampling approximation to the term, we use an empirical average ofan equivalent quantity, ∇ θ E[ M ( C )] = E[( M ( C ) − ¯ M ) ∇ θ log p ( C ; θ )] ≈ n (cid:88) c i iid ∼ p ( c ; θ ) ( M ( c i ) − ¯ M ) ∇ θ log p ( c i ; θ ) , (4)where constant ¯ M is used to reduce the variance of the estimate.In the n-best approximation , ∇ θ E[ M ( C )] ≈ (cid:88) c ∈ ¯ C M ( c ) ∇ θ ¯ p ( c ; θ ) , (5) ¯ p ( c ; θ ) = p ( c ; θ ) (cid:80) c ∈ ¯ C p ( c ; θ ) ∀ c ∈ ¯ C (6)where ¯ C is a subset of candidates, here the n-best candidates producedby performing beam-decoding on the ASR subsystem followed byapplying the NLU model to obtain intent, slots for each candidate,is used to obtain a ﬁnite-sample approximation of the expectation.Probabilities ¯ p ( c ; θ ) are obtained by zeroing out probabilities of can-didates not in ¯ C and normalizing.In either approximation, backpropagation using the non-differentiable metric M is enabled as solvers have access to ∇ θ p ( c ; θ ) , as p ( c ; θ ) is a differentiable function of weights θ .We make use of the n-best approximation in the results section.Thus we run both teacher-forcing to obtain CE total as well asbeam-decoding to obtain candidates ¯ C as demonstrated in Fig. 2. Asnoted in prior work [24], the cross-entropy lends stability to sequenceloss training. In Table 2, we describe the choice of semantic metrics,candidate probability calculations, and regularizing cross-entropy Table 2 : By varying metric M of interest, candidate probability p ( c ; θ ) , and regularizing CE , different sequence loss training meth-ods can be realized for SLU or ASR models. Training Metric M hyp-prob p ( c ; θ ) CE mWER WER ¯ p ( w ; θ ) (ASR) CE asr mSLU-ASR WER + SemERmSemER SemER ¯ p ( c ; θ ) (ASR,NLU) Eq. (6) CE total as Eq.(1)mNLU SemER + IRER + CE intent mSLU WER + SemER +IRER + CE intent Transcript-free SemER + IRER + CE intent ¯ p ( c ; θ ) , Eq. (6) ˜ CE total ,Eq. (7)functions for the custom sequence loss training (mSemER, mSLU,mNLU) we propose for joint ASR-NLU model training. We alsorecover standard mWER training with WER metric, ASR candidateprobability and cross-entropy. mSLU-ASR is an example of usingsemantic sequence losses from an external NLU model for ASRmodel training. For ASR model training, ground-truth transcripts are normally re-quired, primarily for the computation of CE asr . We now show howa dataset with audio and only semantic or NLU annotations (intents,slots) and no transcript can be used to update ASR models. This weaklabel learning problem is motivated by deployments where humantranscriptions are not available, but where an inferred semantic feed-back from downstream dialogue management systems, applicationsor user interactions can be used to drive ASR model improvements.We focus on the case where semantic labels are available. In theabsence of a reference transcript, the 1-best ASR hypothesis tokens,slots as well as the reference intent ˜ c = { ˜ w , ˜ s , intent } are treated asthe reference in order to prevent catastrophic forgetting of the ASRtask. The ASR subsystem is teacher-forced [28] with the sequence ˜ w , and NLU obtains the intent, slots for the resulting sequence. Thecross-entropy can be computed as ˜ CE total = CE intent − (cid:88) i log p w,i ( ˜ w i ) + log p s,i (˜ s i ) , (7)without requiring access to a reference transcript. The NLU metricsof ICER, SemER, IRER can be computed from the available labels.The sequence loss training procedure minimizes NLU errors that alsoresults in better ASR performance.Note that this is not the only approach to obtaining the cross-entropy regularizer. Teacher ASR or NLU labels or mixing withdataset with transcribed audio are some alternatives.

3. DATA AND EXPERIMENTAL SETUP

We use datasets that include parallel speech transcripts and NLUannotations of intent and slots:• Fluent speech dataset: Public dataset [20] of 23k utterances (15hours) that has been processed to ﬁt the intent, named-entity frame-work with 10 intents and 2 slots • 18 intent: Dataset of approximately 5.6M utterances (3.3k hours)with utterances from 18 intents in home automation, global, andnotiﬁcations and 40 slots• More Intent: 22M utterances (16k hours) spanning across 64 intentsaccounting for 90% of the data and 122 slots accounting for 99%of the slots in the data• ASR-only 23k-hour dataset for pretraining the ASR model Actions are treated as intents. In addition, (inc/dec)rease_(volume/heat)and (de)activate_music are added to form 10 intents and 2 slots of object andlocation. able 3 : Performance results on open and proprietary SLU datasets (a) ASR-interface-NLU or Joint modeling approach with mSLU se-quence loss training beats all baselines on the test and dev splits of theopen Fluent speech dataset on IRER or accuracy

Model

Test

IRER % Dev

IRER % Transformer audio-intent [29] 2.5 -Baseline [20] 1.2 -AT-AT (SOTA) [30] 0.5 -Oracle neural NLU 0.00 0.00Compositional ASR → NLU 0.42 2.15ASR-Gumbel-NLU 0.40 2.05Joint SLU - no seq training (b) Comparison of compositional models & joint models with varioussequence loss approaches on the 18-intent eval set of 700k utterances.Performance ﬁgures are relative % improvement from row 2 shown as 0%

Model WERR % SemERR % IRERR % ICERR % → NLU 0 0 0 02a Comp. mWERLAS → NLU 6.23 1.07 0.96 2.823a LAS-Gumbel-NLU 2.04 1.50 0.12 3.874a Joint mSemER 6.87 5.66 3.12 7.684b Joint mNLU 5.45

T1 Transformer-NLUJoint mSemER 7.46 (c) Comparison of compositional baselines and sequence loss ap-proaches on the MoreIntent eval set of 500k utterances. Performanceﬁgures are relative % improvement to row M1 shown as 0%

Model WERR % SemERR % ICERR % M1 Compositional 0 0 0M2 Comp mWER (d) Relative % improvement from a baseline Joint ASR-NLU model withtranscript-free training on the 18 intent and Moreintent datasets

Dataset WERR % SemERR % ICERR % Training details : The audio feature is composed of 3 stacked 25 ms LFBEframes with 10 ms shift. This LAS model has 77M parameters: 5x512BLSTM encoder, 2x1024 LSTM decoder with 4 attention heads of depth256, projection 728, 4500 subword vocabulary. The NLU model has 4 (textinterface)-11 (neural network interface) million parameters with a 2x512BLSTM encoder, a dense layer for slots, and 2x512 relu feed-forward layersfor intent. We also experimented with a 3M parameter Transformer-encoder(2 layers, 8 attention heads, 256 units) NLU model. The LAS model is ﬁrstpretrained on the 23k hour dataset and ﬁnetuned on the speciﬁc dataset. WithASR now frozen, NLU is ﬁrst trained in joint systems followed by jointASR-NLU ﬁne-tuning using sequence losses. In the 18 intent dataset, NLU istrained in the joint system for 6 epochs followed by sequence loss training for2, taking 1 day on 8 Nvidia Tesla V100 GPUs.

4. RESULTS AND DISCUSSIONSequence loss training beats baselines

On the open Fluent speech dataset in Table 3a, we see all ASR-interface-NLU models beat external baselines that directly extractsemantics from audio without intermediate transcript showing utilityof ASR pretraining. Both the neural and Gumbel-softmax interfacejoint models outperform compositional text baselines.The joint modelwith mSLU sequence loss training is the best-performing model asseen by results on both dev and test splits. This can be categorizedas a small dataset of lower semantic complexity as the oracle NLUmodel perfectly recovers semantics from ground truth transcripts.In the 18-intent dataset results of Table 3b, the NLU metricsdegrade substantially from row 1 (NLU consuming ground-truthtranscript) to row 2 (NLU consuming ASR hypothesis), showingimpact of ASR errors. In row 2a, the LAS model is further trainedwith mWER sequence loss, leading to gains in WER as well as NLUmetrics. The joint model with mSLU sequence loss training results inbest ASR and NLU metrics. From rows 2a and 4b, we see worse WERfor the joint model, but better NLU metrics showing that joint trainingimproves ASR performance relevant to downstream NLU. In rowT1, mSemER training was used with jointly trained LAS ASR andTransformer-encoder NLU system; this has 1M fewer parameters thanjoint models with LSTM-based NLU, but shows better performance.

Sequence loss training optimizes a metric of interest

Table 3b shows the impact of the non-differentiable metric M tooptimize on ASR and NLU performance. mWER training optimizesfor WER but this may not reﬂect its optimal NLU metrics (row 2avs 4). In rows 4a-c, we use metrics rooted in different deﬁnitions ofsemantic error. The mNLU metric optimizing SemER, IRER, ICER leads to better ICER and IRER than mSemER metric training thatoptimizes only mSemER. mSLU training (adds WER to mNLU)shows the best ASR performance, reﬂecting the importance of se-mantic feedback even for ASR training. Thus we can customize anysequence loss to optimize model performance metric(s). Results on a general dataset

The conclusions from 15 and 3k hours datasets carry over to thelarge 16k hour

Moreintent dataset seen in Table 3c. Row M2 primarilyshows improvements in WER from mWER training of ASR resultingin fewer SLU errors. M2a is an example of semantic sequence losstraining of ASR. However, the joint model of M3 trained to optimizeSLU metrics shows the best NLU performance. We thus have arecipe to improve ASR and NLU model performance: train an ASRmodel with mWER sequence loss. The ASR subsystem in the jointmodel is initialized with these weights and the entire system is trainedminimizing SLU sequence losses.

Both ASR and NLU improve with transcript-free training

In Table 3d, we update models from a common starting pointusing weak-feedback training with only NLU labels. A 5% relativeimprovement in SemER is seen for both the 18-intent and

Moreintent datasets, and modest ASR improvements suggesting that semanticfeedback alone can be used to improve both ASR and SLU.

5. CONCLUSION

Edge deployments of ASR and SLU systems for voice activated as-sistants require the development of low-footprint performant models.Prior approaches involving either pipelined ASR and NLU modelsor end-to-end SLU models use the differentiable cross-entropy lossto train, but these do not map to metrics of interest such as wordand semantic error rates. In this work, we propose non-differentiablesemantic sequence losses and use the REINFORCE framework totrain ASR and SLU models. Joint training with custom sequencelosses lets ASR be trained with semantic feedback from NLU, andNLU be trained aware of ASR errors. We show that both ASR andNLU performance metrics of SLU systems improve across a range ofopen and proprietary datasets and beat state-of-the-art models. Wealso improve and update ASR systems without access to transcriptsusing weak-feedback via NLU labels alone.

Acknowledgement:

We thank Bach, Ehry, Chul, Shehzad, and reviewers forhelpful technical comments. Abhinav Khattar assisted with Transformers,Jinxi Guo with mWER discussions, and Zhe Zhang with data preparation. . REFERENCES [1] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711 , 2012.[2] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmid-huber, “Connectionist temporal classiﬁcation: labelling unsegmentedsequence data with recurrent neural networks,” in

Proceedings of the23rd international conference on Machine learning , 2006, pp. 369–376.[3] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, HirofumiInaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin,Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study onTransformer vs RNN in speech applications,” in . IEEE, 2019,pp. 449–456.[4] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen,Attend and Spell: A neural network for large vocabulary conversationalspeech recognition,” in

ICASSP , 2016.[5] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar,Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, KanishkaRao, Ekaterina Gonina, et al., “State-of-the-art speech recognition withsequence-to-sequence models,” in . IEEE, 2018, pp.4774–4778.[6] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, KazuyaKawakami, and Chris Dyer, “Neural architectures for named entityrecognition,” arXiv preprint arXiv:1603.01360 , 2016.[7] Young-Bum Kim, Sungjin Lee, and Karl Stratos, “Onenet: Joint domain,intent, slot prediction for spoken language understanding,” in .IEEE, 2017, pp. 547–553.[8] Yonghui Wu, Min Jiang, Jun Xu, Degui Zhi, and Hua Xu, “Clinicalnamed entity recognition using deep learning models,” in

AMIA AnnualSymposium Proceedings . American Medical Informatics Association,2017, vol. 2017, p. 1812.[9] Qian Chen, Zhu Zhuo, and Wen Wang, “Bert for joint intent classiﬁca-tion and slot ﬁlling,” arXiv preprint arXiv:1902.10909 , 2019.[10] Tanja Bunk, Daksh Varshneya, Vladimir Vlasov, and Alan Nichol,“Diet: Lightweight language understanding for dialogue systems,” arXivpreprint arXiv:2004.09936 , 2020.[11] Dilek Hakkani-Tür, Frédéric Béchet, Giuseppe Riccardi, and GokhanTur, “Beyond ASR 1-best: Using word confusion networks in spokenlanguage understanding,”

Computer Speech & Language , vol. 20, no. 4,pp. 495–514, 2006.[12] Matthew Henderson, Milica Gaši´c, Blaise Thomson, Pirros Tsiakoulis,Kai Yu, and Steve Young, “Discriminative spoken language understand-ing using word confusion networks,” in . IEEE, 2012, pp. 176–181.[13] Gokhan Tur, Jerry Wright, Allen Gorin, Giuseppe Riccardi, and DilekHakkani-Tür, “Improving spoken language understanding using wordconfusion networks,” in

Seventh International Conference on SpokenLanguage Processing , 2002.[14] Chao-Wei Huang and Yun-Nung Chen, “Adapting pretrained transformerto lattices for spoken language understanding,” in . IEEE, 2019,pp. 845–852.[15] Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang,Neeraj Gaur, Pedro Moreno, Rohit Prabhavalkar, Zhongdi Qu, andAustin Waters, “From audio to semantics: Approaches to end-to-endspoken language understanding,” in . IEEE, 2018, pp. 720–726.[16] Sahar Ghannay, Antoine Caubrière, Yannick Estève, Nathalie Camelin,Edwin Simonnet, Antoine Laurent, and Emmanuel Morin, “End-to-endnamed entity and semantic concept extraction from speech,” in . IEEE, 2018, pp.692–699. [17] Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar,Baiyang Liu, and Yoshua Bengio, “Towards end-to-end spoken languageunderstanding,” in . IEEE, 2018, pp. 5754–5758.[18] Yao Qian, Rutuja Ubale, Vikram Ramanaryanan, Patrick Lange, DavidSuendermann-Oeft, Keelan Evanini, and Eugene Tsuprun, “ExploringASR-free end-to-end modeling to improve spoken language understand-ing in a cloud-based dialog system,” in . IEEE, 2017, pp.569–576.[19] Natalia Tomashenko, Christian Raymond, Antoine Caubrière, RenatoDe Mori, and Yannick Estève, “Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2020, pp. 8509–8513.[20] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar,and Yoshua Bengio, “Speech model pre-training for end-to-end spokenlanguage understanding,” arXiv preprint arXiv:1904.03670 , 2019.[21] Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, and Ariya Ras-trow, “Speech to Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces,” in

Proc. Interspeech , 2020, pp. 876–880.[22] Ronald J Williams, “Simple statistical gradient-following algorithmsfor connectionist reinforcement learning,”

Machine learning , vol. 8, no.3-4, pp. 229–256, 1992.[23] Biing-Hwang Juang, Wu Hou, and Chin-Hui Lee, “Minimum classiﬁca-tion error rate methods for speech recognition,”

IEEE Transactions onSpeech and Audio processing , vol. 5, no. 3, pp. 257–265, 1997.[24] Rohit Prabhavalkar, Tara N Sainath, Yonghui Wu, Patrick Nguyen,Zhifeng Chen, Chung-Cheng Chiu, and Anjuli Kannan, “Minimum worderror rate training for attention-based sequence-to-sequence models,” in . IEEE, 2018, pp. 4839–4843.[25] Jinxi Guo, Gautam Tiwari, Jasha Droppo, Maarten Van Segbroeck, Che-Wei Huang, Andreas Stolcke, and Roland Maas, “Efﬁcient MinimumWord Error Rate Training of RNN-Transducer for End-to-End SpeechRecognition,” in

Proc. Interspeech , 2020, pp. 2807–2811.[26] Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K Reddy,“Deep reinforcement learning for sequence-to-sequence models,”

IEEEtransactions on neural networks and learning systems , vol. 31, no. 7, pp.2469–2489, 2019.[27] Eric Jang, Shixiang Gu, and Ben Poole, “Categorical reparameterizationwith Gumbel-softmax,” arXiv preprint arXiv:1611.01144 , 2016.[28] Ronald J Williams and David Zipser, “A learning algorithm for continu-ally running fully recurrent neural networks,”

Neural computation , vol.1, no. 2, pp. 270–280, 1989.[29] Martin Radfar, Athanasios Mouchtaris, and Siegfried Kunzmann, “End-to-End Neural Transformer Based Spoken Language Understanding,” in

Proc. Interspeech , 2020, pp. 866–870.[30] Subendhu Rongali, Beiye Liu, Liwei Cai, Konstantine Arkoudas, Cheng-wei Su, and Wael Hamza, “Exploring transfer learning for end-to-endspoken language understanding,” arXiv preprint arXiv:2012.08549arXiv preprint arXiv:2012.08549