End2End Acoustic to Semantic Transduction
Valentin Pelloin, Nathalie Camelin, Antoine Laurent, Renato De Mori, Antoine Caubrière, Yannick Estève, Sylvain Meignier
EEND2END ACOUSTIC TO SEMANTIC TRANSDUCTION
Valentin Pelloin Nathalie Camelin Antoine Laurent Renato De Mori , Antoine Caubri`ere Yannick Est`eve Sylvain Meignier LIUM - Le Mans University - France LIA - Avignon University - France McGill University - Montr´eal, Canada
ABSTRACT
In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an at-tention mechanism. It reliably selects contextual acoustic fea-tures in order to hypothesize semantic contents. An initialarchitecture capable of extracting all pronounced words andconcepts from acoustic spans is designed and tested. Witha shallow fusion language model, this system reaches a 13.6concept error rate (CER) and an 18.5 concept value error rate(CVER) on the French MEDIA corpus, achieving an absolute2.8 points reduction compared to the state-of-the-art. Then,an original model is proposed for hypothesizing concepts andtheir values. This transduction reaches a 15.4 CER and a 21.6CVER without any new type of context.
Index Terms — spoken language understanding, neu-ral networks, attention mechanisms, sequence-to-sequence,transfer learning
1. INTRODUCTION
Spoken Language Understanding (SLU) systems extract se-mantic contents mentioned in spoken sentences for task-oriented dialogues, question answering, and other conversa-tional applications.Semantic contents are fragments of a domain applicationontology that can be defined with a frame language as de-scribed in [1]. Frame structures are rich representations ofknowledge structures with important properties, such as typemodels of slot fillers and inheritance of slot properties throughchains of frame-slot relations. With a frame language, a con-cept is represented as a relation between a frame name (e.g.ADDRESS), a frame property, often called slot , (e.g. city),and a frame value , often called slot filler (e.g. Paris). Prop-erties can be inherited through a chain of slots . Using thissemantic representation, speech acts such as request can bedefined as frames with slots filled by instances of frame struc-
This work has been funded by the AISSPER project supported by theFrench National Research Agency (ANR) under contract ANR-19-CE23-0004-01. ture fragments. For the sake of simplicity, in the following,slots and slot fillers are denoted as concept and value .Concepts can be annotated with word sequences called supports . Supports can be instantiated with hypotheses ob-tained by an Automatic Speech Recognition (ASR) system asdescribed in [2, 3, 4, 5] or by direct transduction that gener-ates instances of frame structures from acoustic features.Supports are word spans expressed by a span of acousticfeatures. Such a span may have variable duration and dependson the speaker and other factors. If a value is mentioned, thenit is often included in a support. Spans of acoustic featuresupports may not be sufficient for unambiguously mentioninga concept instance, even if they contain relevant semantic fea-tures. This equivocation can be reduced by selecting relevantspans of acoustic feature contexts using an attention mecha-nism.Human experience in understanding spoken sentences, es-pecially in foreign languages, provides evidence that conceptscan be hypothesized from spoken language tokens withoutthinking about, or even knowing, the exact spelling of somerelevant words. This motivates the use of a model based ona transduction of acoustic into semantic content supports. Inthis paper, an end-to-end (E2E) SLU architecture is proposedin order to hypothesize fragments of frame instances withacoustic features. It is applied on user turns of the French ME-DIA corpus, corresponding to a complex negotiation dialoguetask (see Section 4.1). This task is particularly difficult sincea dialogue turn may contain a large variety of concept-valueinstances, including repetitions and self-corrections with spe-cific semantic relations between these contents.A novelty of the proposed model is the introduction of anattention mechanism that selects contextual acoustic featuresin order to hypothesize directly concept symbols and their val-ues. An initial architecture is first designed and tested in orderto hypothesize words and concepts. An enriched architec-ture is then proposed in order to hypothesize concepts and de-limiting their supports of underlying acoustic representationswhile limiting the generation of character hypotheses onlyfor values. As a consequence, no human prior-knowledge isneeded to extract values from words of the concept support. a r X i v : . [ c s . C L ] F e b he paper is organized as follows. Section 2 presents re-lated works in the SLU domain. Section 3 presents our E2ESLU architecture, while corpora, experiments and results aredescribed in Section 4. Finally, conclusions and perspectivesare given in Section 5.
2. RELATED WORK
Early speech understanding systems based on natural lan-guage semantic parsers and artificial intelligence approachesto beam search and knowledge representation are reviewedin [6]. Further developments stressing comparisons of differ-ent methods using a common annotated corpus, new parsingtechniques and statistical models are reviewed in [7]. Later,advanced spoken language understanding systems with ad-vanced statistical models and paradigms in various types ofapplications are described in [1]. More recently, deep neu-ral networks (DNN) for generating semantic domain, intentand slot filler hypotheses were proposed using a pipelineof automatic speech recognition and SLU models. Thesearchitectures provided better or comparable results to thoseobtained with previous architectures. Examples can be foundin [2, 3, 4]. Result examples and comparison using SLUin negotiation dialogues are reported in [5]. Examples ofattempts to integrate ASR and SLU functions in end-to-end(E2E) compact trainable DNN architectures are proposed in[8, 9, 10, 11, 12, 13].In this paper, a novel E2E SLU system is introduced andtested on the French MEDIA corpus. It is motivated by theevidence that the perception of spoken semantic entities andrelations is associated with the perception of word spans ex-pressing semantic properties and their values. A basic archi-tecture is introduced with attention on suitable acoustic con-texts having relevance dependent on decoder semantic repre-sentations. Then, a novel architecture is proposed in orderto detect acoustic spans expressing lexical features for cluewords of semantic entities and their values. This introductionis motivated by explaining, with prior knowledge, frequenterrors observed in the annotated development set.Recently, methods have been proposed to learn high-levelrepresentations from surface acoustic features for speech-to-speech translation [14, 15, 16, 17].In this paper, we aim to use high-level representationsfrom surface acoustic features for speech-to-semantic trans-duction.In [18], relations in semantic knowledge graphs have beenused for inferring answers in question answering dialogues.In this paper, application domain model relations, explicitor inferred by inheritance, are considered for selecting use-ful context for the interpretation of acoustic spans of domainrelevant concepts.
3. E2E SLU ARCHITECTURE
In this work, we use an encoder-decoder neural network. Anattention mechanism as described by [19] was designed in or-der to align the input with the output. Hypothesizing spokenphonemes with an ASR system may be difficult if a trainingset of phonemes aligned with speech is not available. For thisreason, a first attempt using experience on character recogni-tion is used in this paper. Acoustic to concept hypothesizingthrough to phoneme is left for future work.The input of the network are 40 dimensional MelFBanksextracted with a Hamming window of 25 ms and 10 msstrides. The system output is a sequence of characters pre-dicting words, normalized values, or concept labels. Notethat each concept label is represented by one special charac-ter ( e.g. ´g for the concept hotel-services ).Our attention based encoder-decoder architecture, de-picted in Figure 1, is based on the Espresso recipe initiallydeveloped for the WSJ ASR task, as described in [20].Let X = ( x , ..., x T x ) be the vector of the input featuresand Y = ( y , ..., y T y ) be the outputs of the model. Our modelcomputes the outputs Y from an input sequence X . The en-coder first uses 4 2-dimensional convolutional blocks (eachconvolution layer is followed by a batch norm). Then, 4 biL-STM layers are used to obtain the encoder hidden states as:Encoder ( X ) = ( ←→ h , ..., ←→ h T x ) (1)The decoder uses 4 LSTM followed by 2 fully-connected lay-ers and a softmax. The decoder states are computed usingthe attention mechanism by aligning the encoder hidden statesand previous decoder hidden states, as: Y = Decoder ( H att ) (2)The attention mechanism is applied as described in [19]: H att = ( H att , ..., H att T y ) (3) H att i = T x (cid:88) j α ij ←→ h j with α ij = exp ( e ij ) (cid:80) T x k =1 exp ( e ik ) (4) e ij is the relevance of the acoustic representation ←→ h j andthe i th semantic output. Note that the encoder vectors ←→ h arecomputed using only the information from the top-encoderhidden states.In order to improve the encoder-decoder performance, weuse an additional LSTM-based Language Model (LM) thatuses look-ahead word probabilities to obtain both word andcharacter probabilities ([20, 21]). Shallow-fusion [22] score combination is performed in or-der to take into account the output of the encoder-decoderSLU system and the LSTM-based LM. During the decodingphase with the beam search, hypotheses are scored using theweighted sum of both the encoder-decoder model and the LMscores. latexit sha1_base64="saNq8XZDxW8EtM9AOqwOwgIE1JI=">AAAB6nicbVDLSgNBEOyNrxhfUY9ehgTBU9j1gR4DXjxGNA9IljA7mU2GzM4uM71iWPIJXjwo4tUv8ubfOEn2oNGChqKqm+6uIJHCoOt+OYWV1bX1jeJmaWt7Z3evvH/QMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mr2d++4FrI2J1j5OE+xEdKhEKRtFKd4/9s3656tbcOchf4uWkCjka/fJnbxCzNOIKmaTGdD03QT+jGgWTfFrqpYYnlI3pkHctVTTixs/mp07JsVUGJIy1LYVkrv6cyGhkzCQKbGdEcWSWvZn4n9dNMbzyM6GSFLlii0VhKgnGZPY3GQjNGcqJJZRpYW8lbEQ1ZWjTKdkQvOWX/5LWac27qLm359V6JY+jCEdQgRPw4BLqcAMNaAKDITzBC7w60nl23pz3RWvByWcO4Recj28Gao2G x
4x Conv2D block4x BiLSTMAttention mechanism4x LSTM2x Fully Connected
EncoderDecoder
4. EXPERIMENTS4.1. Datasets
Our architecture is pretrained for ASR and then adapted toSLU. Training is detailed in the next section while we brieflydescribe the different corpora in this section.For the transcription model, we use several French genericbroadcast news corpora described in [23], in addition to thetraining sets of in-domain SLU telephone conversation cor-pora: PORTMEDIA [24], MEDIA [25] and DECODA [26].As a result, we obtained a training corpus of 414 hours forASR. Such a model needs a lot of audio data along with theirmanual transcription even if a frame acoustic alignment withwords is not needed. The system will train itself to align eachoutput character with input acoustic features thanks to its at-tention mechanism [27, 28, 29].For the SLU models, only the 24 hours of MEDIA (M)and PORTMEDIA (PM) training sets are used during train-ing. These two corpora contain both telephone conversationsbetween two humans in the
Wizard of Oz mode: one plays acomputer, the other a user who wants to obtain informationor make a reservation. MEDIA contains requests and lodgingbooking in France and PORTMEDIA contains conversationsabout theater shows during the french Avignon Festival. Allconversations are manually transcribed and user turns are se-mantically annotated with concept labels, their supports andthe corresponding values. Considering the Table 1 example,the concept label link-coref is supported by the word that an-notated with the normalized value singular .Turns of these datasets contain word spans expressing se-mantic content mentions mixed with other spans that do notexpress application domain contents. For this task, user turnsmay express concepts such as logical AND OR connectors,references to previously mentioned concepts and mentions toentities relative to other entities. Furthermore, system turns tend to solicit the user to provide useful information for com-posing a state of the world to obtain a frame structure in anapplication domain database.In the experiments section, we will present SLU results onthe MEDIA corpus, split accordingly to the official training,development (
Dev ) and test sets, containing respectively 727,79 and 208 dialogues and 13k, 1.2k and 3.6k user utterances.User transcripts are annotated using 76 semantic concepts,along with their supports and the corresponding normalizedvalue.
Inspired by the curriculum-based transfer learning procedurefor SLU proposed in [23] , we first train a model for the ASRdownstream task. For that purpose, we use out-of-domain andin-domain ASR data (as described in the previous section),and then fine-tune using our in-domain corpora. This firstmodel is denoted as ASR model.Thereafter, the model is retrained for the SLU task us-ing all the in-domain SLU data (MEDIA and PORTMEDIAtrain sets, denoted M+PM), and then fine-tuned with the tar-get MEDIA data. The transition from ASR to SLU is made byenriching the output with concept symbols. In contrast with[23, 30], the last fully connected layer is just extended. Asa consequence, the model benefits entirely from its previoustraining. This second model is denoted as
AllWords-C model.[30] showed that hypothesizing only words belonging toconcept support enhanced the SLU performance evaluation.Thus, we introduce a third model that adapts the
AllWords-C model on the specific outputs: concepts and words from sup-ports. A new character is then added to the last layer in orderto output a global character * instead of all out-of-concept-support words. This third
SupWords-C model is obtained withtransfer learning on these outputs.Finally, a novel solution is introduced to perform a directtransduction from acoustic features to concept-value pairs.To do so, we re-adapt a last time the
AllWords-C model inorder to output directly the normalized values and the con-cept. In that case, characters no longer represent pronouncedwords. They represent either one concept or one character ofa normalized value. This model is denoted as
NormValues-C model. In that last architecture, frame structure fragmentsare obtained with a fully automatic E2E sequence 2 sequencearchitecture.All models and their output formats are summarized inTable 1. Note that, for the sake of clarity, concepts are writtenin full instead of their special character output.
The architecture shown in Figure 1 is used in order to gener-ate outputs with different components, all evaluated according Unlike [23], we did not choose to train our model on the Named EntityRecognition (NER) task. SR ASR (cid:1)
ASR M+PM is there a swimming-pool in that one
AllWords-C
ASR model (cid:1)
SLU M+PM (cid:1)
SLU M is there a < hotel-services > swimming-pool < /hotel-services > in < linkref-coref > that < /linkref-coref > < objectbd > one < /objectbd > SupWords-C
AllWords-C model (cid:1)
SLU M* * < hotel-services > swimming-pool < /hotel-services > * < linkref-coref > that < /linkref-coref > < objectbd > one < /objectbd > NormValues-C
AllWords-C model (cid:1)
SLU Norm M* * < hotel-services > swimming-pool < /hotel-services > * < linkref-coref > singular < /linkref-coref > < objectbd > hotel < /objectbd > Table 1 . Model (chain of training) and output for each pro-posed configuration, based on the user utterance ”Is there aswimming-pool in that one?”.to the concept error rate (CER) and concept-value error rate(CVER) metrics. Insertion, deletion, and substitution errorsare used to compute evaluation measures and to perform theerror analysis.Normalized values, i.e. the slot fillers, needed for the eval-uation in CVER, are obtained with a set of manually designedregular expressions in
AllWords-C and
SupWords-C configu-rations. Expressions are applied to the outputs of each con-cept support, as in [12, 23, 30, 5]. Using the
NormValues-C outputs, these handmade rules are not necessary anymore.
Table 2 summarizes the results obtained with the three con-sidered architecture solutions, without and with LM, on theMEDIA corpus on both the Dev and Test sets.Without LM, the best CER results are obtained with
NormValues-C model, both for Dev and Test. This supportsthe conjecture that concepts can be perceived without consid-ering orthographic word transcriptions. As we could expect,when CER are similar, best CVER are achieved using humanrules. Nevertheless, very encouraging results are observed,notably on Dev, when the CER is lower than others. For thesake of comparison, [23] state-of-the-art results related to theuse of an E2E approach without LM, are reported with 21.6%CER and 27.7% CVER, showing an absolute gain of 6 points.Unlike in [23] where best results are obtained with the
SupWords-C configuration, our architecture does not benefitfrom this kind of representation. We suppose that this is dueto the architectural difference: Attention Mechanism (AM) vs. biLSTM/CTC. In the latter, the star is used as a correspon-dence to the input acoustic frame when predicting outside-
Dev Test% CER CVER CER CVERWithout a Language Model
AllWords-C [23] – – 21.6 27.7
AllWords-C
SupWords-C
NormValues-C
With a Language Model
AllWords-C [23] – – 18.1 22.1
SupWords-C [23] – – 16.4 20.9
AllWords-C
SupWords-C
NormValues-C
Table 2 . Results obtained on the MEDIA Dev and Test Cor-pora by our models compared to the State of the art.of-concept characters. In our AM system, such a star is notnecessarily needed, as the output sequence is not constrainedto have the same length as the input. The AM is designed todirectly select relevant acoustic spans.Using word hypotheses and LM, the best 13.6% CER and18.5% CVER are observed for
AllWords-C , showing a gainof 2 points using the LM and 2.8 points CER considering thebest [23] state-of-the-art result even without significant con-tribution of the LM on
NormValues-C and
SupWords-C . Thiscan be explained by the LM weight, optimized on the Dev set,which is close to zero. This needs further investigation.A first analysis of detailed concept errors on Dev showsthat the most frequent ones are insertions and deletions of log-ical connectors and co-references.
5. CONCLUSIONS AND PERSPECTIVES
An E2E SLU architecture is introduced, based on an encoder-decoder model with attention mechanism focusing on acous-tic representations, useful for generating hypotheses of words,concepts or values. Several combinations have been investi-gated using or not LM reaching better results than state-of-the-art.The results indicate that word knowledge can be a rele-vant context for semantic interpretation, particularly if it is se-lected by an appropriate attention mechanism. The solutionsevaluated in this paper are based either on word or value rep-resentations using characters. In future work we also plan touse two decoders, one that outputs each character ( allWords-C ) and the other one which focuses on concept-value pairs.This could, at least, improve our CVER without degrading theCER using the
NormValues-C representation. Finally, newtypes of LM including semantic hypotheses will also be in-vestigated in future work. . REFERENCES [1] G. Tur and R. De Mori, “Chapter 1: Spoken language un-derstanding for human/machine interactions,” in
Spoken Lan-guage Understanding: Systems for Extracting Semantic Infor-mation from Speech . 2011.[2] D. Hakkani-T¨ur, G. T¨ur, A. Celikyilmaz, Y. N. Chen, J. Gao,L. Deng, and Y. Y. Wang, “Multi-domain joint semantic frameparsing using bi-directional rnn-lstm.,” in
INTERSPEECH ,2016, pp. 715–719.[3] X. Zhang and H. Wang, “A joint model of intent determinationand slot filling for spoken language understanding,” in
IJCAI ,2016, p. 2993–2999.[4] B. Liu and I. Lane, “Attention-based recurrent neural networkmodels for joint intent detection and slot filling,” in
INTER-SPEECH , 2016, pp. 685–689.[5] E. Simonnet, S. Ghannay, N. Camelin, Y. Est`eve, andR. De Mori, “ASR error management for improving spokenlanguage understanding,” in
INTERSPEECH , 2017.[6] D. H. Klatt, “Review of the ARPA speech understandingproject,”
The Journal of the Acoustical Society of America ,vol. 62, pp. 1345, 1977.[7] R. Kuhn and R. De Mori, “The application of semantic clas-sification trees to natural language understanding,”
TPAMI 17 ,pp. 449–460, 1995.[8] Y. Qian, R. Ubale, V. Ramanaryanan, P. Lange,D. Suendermann-Oeft, K. Evanini, and E. Tsuprun, “ExploringASR-free end-to-end modeling to improve spoken languageunderstanding in a cloud-based dialog system,”
ASRU , pp.569–576, 2017.[9] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, andY. Bengio, “Towards End-to-end Spoken Language Under-standing,”
ICASSP , pp. 5754–5758, 2018.[10] R. Price, M. Mehrabani, and S. Bangalore, “Improved End-To-End Spoken Utterance Classification with a Self-AttentionAcoustic Classifier,”
ICASSP , pp. 8504–8508, 2020.[11] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur,P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “FromAudio to Semantics: Approaches to End-to-End Spoken Lan-guage Understanding,”
SLT , pp. 720–726, 2018.[12] N. Tomashenko, C. Raymond, A. Caubriere, R. De Mori, andY. Est`eve, “Dialogue History Integration into End-to-EndSignal-to-Concept Spoken Language Understanding Systems,”
ICASSP , pp. 8509–8513, 2020.[13] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scaleunsupervised pre-training for end-to-end spoken language un-derstanding,”
ICASSP , pp. 7994–7998, 2020.[14] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in
ICLR , 2020.[15] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord,“Unsupervised speech representation learning using wavenetautoencoders,”
TASLP , pp. 2041–2053, 2019. [16] Y. A. Chung and J. Glass, “Improved speech representationswith multi-target autoregressive predictive coding,” in
ACL ,2020, pp. 2353–2358.[17] S. Khurana, A. Laurent, and J. Glass, “Cstnet: Contrastivespeech translation network for self-supervised speech repre-sentation learning,” 2020, arXiv:2006.02814.[18] Z. He, Y. He, Q. Wu, and J. Chen, “Fg2seq: Effectivelyencoding knowledge for end-to-end task-oriented dialog,” in
ICASSP , 2020, pp. 8029–8033.[19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-lation by jointly learning to align and translate,” in
ICLR , 2015.[20] Y. Wang, T. Chen, H. Xu, S. Ding, H. Lv, Y. Shao, N. Peng,L. Xie, S. Watanabe, and S. Khudanpur, “Espresso: A fastend-to-end neural speech recognition toolkit,” in
ASRU , 2019,pp. 136–143.[21] T. Hori, J. Cho, and S. Watanabe, “End-to-end Speech Recog-nition with Word-Based Rnn Language Models,” in
SLT , 2018,pp. 389–396.[22] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C.Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On us-ing monolingual corpora in neural machine translation,” 2015,arXiv:1503.03535.[23] A. Caubri`ere, N. Tomashenko, A. Laurent, E. Morin,N. Camelin, and Y. Est`eve, “Curriculum-based transfer learn-ing for an effective end-to-end spoken language understand-ing and domain portability,”
INTERSPEECH , pp. 1198–1202,2019.[24] F. Lef`evre, D. Mostefa, L. Besacier, Y. Est`eve, M. Quignard,N. Camelin, B. Favre, B. Jabaian, and Lina M. Rojas B.,“Leveraging study of robustness and portability of spoken lan-guage understanding systems across languages and domains:the PORTMEDIA corpora,” in
LREC , 2012.[25] L. Devillers, H. Maynard, S. Rosset, P. Paroubek, K. Mc-Tait, D. Mostefa, K. Choukri, L. Charnay, C. Bousquet,N. Vigouroux, F. B´echet, L. Romary, J. Y. Antoine, J. Vil-laneau, M. Vergnes, and J. Goulian, “The French ME-DIA/EVALDA project: the evaluation of the understanding ca-pability of spoken language dialogue systems,” in
LREC , 2004.[26] F. Bechet, B. Maza, N. Bigouroux, T. Bazillon, M. El-Beze,R. De Mori, and E. Arbillot, “Decoda: a call-centre human-human spoken conversation corpus,” in
LREC , 2012, pp. 1343–1347.[27] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based re-current nn: First results,” in
NIPS , 2014.[28] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversa-tional speech recognition,” in
ICASSP , 2016.[29] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speechrecognition,” in
ICASSP , 2016, pp. 4945–4949.[30] S. Ghannay, A. Caubri`ere, Y. Est`eve, N. Camelin, E. Simon-net, A. Laurent, and E. Morin, “End-to-end named entity andsemantic concept extraction from speech,” in