[PDF] Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification

Abstract

Intent classification is a task in spoken language understanding. An intent classification system is usually implemented as a pipeline process, with a speech recognition module followed by text processing that classifies the intents. There are also studies of end-to-end system that takes acoustic features as input and classifies the intents directly. Such systems don't take advantage of relevant linguistic information, and suffer from limited training data. In this work, we propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model. We use knowledge distillation technique to map the acoustic embeddings towards linguistic embeddings. We perform fusion of both acoustic and linguistic embeddings through cross-attention approach to classify intents. With the proposed method, we achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.

Full PDF

LLEVERAGING ACOUSTIC AND LINGUISTIC EMBEDDINGS FROM PRETRAINEDSPEECH AND LANGUAGE MODELS FOR INTENT CLASSIFICATION

Bidisha Sharma, Maulik Madhavi and Haizhou Li

Department of Electrical and Computer Engineering,National University of Singapore, Singapore

ABSTRACT

Intent classiﬁcation is a task in spoken language understand-ing. An intent classiﬁcation system is usually implementedas a pipeline process, with a speech recognition module fol-lowed by text processing that classiﬁes the intents. There arealso studies of end-to-end system that take acoustic featuresas input and classiﬁes the intents directly. Such systems don’ttake advantage of relevant linguistic information, and sufferfrom limited training data. In this work, we propose a novelintent classiﬁcation framework that employs acoustic featuresextracted from a pretrained speech recognition system andlinguistic features learned from a pretrained language model.We use knowledge distillation technique to map the acous-tic embeddings towards linguistic embeddings. We performfusion of both acoustic and linguistic embeddings throughcross-attention approach to classify intents. With the pro-posed method, we achieve 90.86% and 99.07% accuracy onATIS and Fluent speech corpus, respectively.

Index Terms — intent classiﬁcation, acoustic embed-dings, linguistic embeddings, transfer learning

1. INTRODUCTION

With the increasing adoption of voice-operated interfaces insmart devices, one of the research topics in spoken languageunderstanding (SLU) is to achieve more natural, intuitive,robust and effective interaction [1–4]. Intent classiﬁcationrefers to inferring the meaning or intention of a spoken utter-ance, which is a crucial component of SLU [2]. The voiceinterfaces decide how to respond according to the perceivedintent of a particular spoken utterance from the user.The typical SLU system architecture follows a pipelineapproach, that consists of two components, ﬁrst, an automaticspeech recognition (ASR) system decodes the input speechinto text transcription, that is followed by a natural languageunderstanding (NLU) module to classify the intent from ASRoutput text [1].The pipeline approach has a few limitations. First,pipeline components are optimized separately under dif- This research is supported by the Agency for Science, Technology andResearch (A*STAR) under its AME Programmatic Funding Scheme (ProjectNo. A18A2b0046). ferent criteria. The ASR module is optimized to minimize theword error rate (WER), while the NLU module is typicallytrained on the clean text (original transcription). The ASRperformance varies largely depending on the noisy environ-mental conditions resulting in erroneous transcription, whichsubsequently affects the performance of the NLU module.As a result, there is an obvious mismatch between trainingand testing conditions, which may limit the performance.Second, all words are not equally important for the intentclassiﬁcation task. Some words carry more weight towardsthe meaning of the utterance than others that is reﬂected inthe linguistic prosody. Unfortunately, the pipeline approachhas not taken such prosodic information into consideration.The end-to-end approaches represent one of the solutionsto the above issues arising from pipeline approaches [5, 6].They are widely adopted in ASR [7–9], speech synthesis [10],machine translation [11–13]. Inspired by their success,Serdyuk et al. [6] introduces an end-to-end modeling ap-proach for intent classiﬁcation, where the features extractedfrom the speech signal are directly mapped to intents or otherSLU task, without intermediate ASR and NLU components.Lugosch et al. [6] study an end-to-end encoder-decoderframework for speech-to-domain and intent classiﬁcation.Other studies reveal that along with the speech features, anintermediate text representation is also crucial for the qualityof the predicted semantics. Haghani et al. [14] propose a jointmodel that predicts both words and semantics and achieves agood performance using a large train database.To beneﬁt from the phonetic recognition of speech, an-other study [15] explores the use of an ASR model for intentclassiﬁcation. In particular, a pre-trained ASR model is usedto extract acoustic features from speech signals. Similarly,to address the issue of limited data, different neural networkarchitectures are introduced in [15–18]. Due to lack of largespeech databases for SLU tasks, authors of [19] propose touse synthesized speech for training end-to-end SLU systems.Transfer learning has been successfully applied in intentclassiﬁcation tasks to align the acoustic embeddings extractedfrom an ASR system to linguistic embeddings extracted froma language model, such as bidirectional encoder representa-tions from transformers (BERT) [20, 21]. These approachesuse linguistic representations derived from acoustic features a r X i v : . [ c s . C L ] F e b sing transfer learning to classify intents. We consider thatthe original acoustic features or embeddings also carry sig-niﬁcant information for SLU tasks.Human perception and interpretation of a spoken utter-ance relies on both prosodic and linguistic information intactin the speech signal. In this work, we aim to beneﬁt from bothacoustic and linguistic representations, and leverage informa-tion extracted from pre-trained acoustic and language models.In particular, the use of pre-trained models helps to overcomethe issue of unavailability of a large speech database for tar-geted SLU tasks. The transfer learning approach assists us toderive linguistic representation from the acoustic embeddingsof speech.Speciﬁcally, we derive the acoustic embeddings from apre-trained ASR model, and learn to derive linguistic embed-dings that are close to those from a pre-trained BERT modelby applying transfer learning. We then combine both acous-tic and linguistic embeddings into a uniﬁed representationthrough a cross-attention module. With cross attention, theattention mask for one modality is used to highlight the ex-tracted features in another modality. The strategy of com-bining both acoustic and linguistic embeddings makes thiswork signiﬁcantly different from the existing transfer learn-ing based intent classiﬁcation methods [20, 21]. We note thatthe proposed framework exploits information extracted fromonly speech signal and we do not use ASR output text.

2. ACOUSTIC-LINGUISTIC NETWORK WITHTRANSFER LEARNING

In Figure 1, we provide an overview of the proposed acoustic-linguistic network (ALN) with transfer learning. We lever-age on the information extracted from two pretrained mod-els, which are the ASR model and BERT model. Initially,we extract the acoustic embeddings from the pretrained ASRmodel as described in [15]. The latent space of these acous-tic embeddings is transformed to NLU output space for thedownstream task using the transfer learning layer. We refer tothe derived embeddings as

ALN linguistic embeddings . Theacoustic embeddings extracted from ASR and ALN linguisticembeddings are effectively fused together for intent classiﬁ-cation. The framework is optimized using two loss functions,which are transfer learning loss and intent loss. Each of thesecomponents are described in detail below.

We follow the same strategy to use a pre-trained ASR modelas described in [15] . This model is a deep neural networkconsisting of a stack of modules. The ﬁrst module takes asinput the audio signal and outputs a sequence of hidden rep-resentations to represent the phonetic content of input speech.The second module takes the hidden phonetic representationsas input and outputs 256-dimensional hidden word represen-tation, which are used as acoustic embeddings ( e acoust ) in https://github.com/lorenlugosch/end-to-end-SLU Pretrained ASRmodel

Intentlayer

Intentclassiﬁcation

TransferlearninglayerFinetuned

BERT model

Loss TL Loss

Intent

SpeechText

Cross-attentionlayer

Fig. 1 . Block diagram of proposed acoustic-linguistic network(ALN) with transfer learning for intent classiﬁcation, where the dot-ted block highlights our contribution. this work. Further details of the pre-trained ASR model canbe found in [15]. We freeze this pre-trained ASR model, anduse it as it is for downstream intent classiﬁcation task.

BERT achieves state-of-the-art performance in sentence clas-siﬁcation and other natural language processing tasks [22,23].In order to incorporate domain-speciﬁc linguistic informa-tion, there is need to ﬁnetune the original BERT model. Weﬁne-tune a pre-trained bert-base-uncased model [22]for intent classiﬁcation task. Inspired by [20], we performtwo-stage ﬁnetuning approach, ﬁrst we ﬁne-tune the BERTweights for masked language model (MLM) task with lan-guage model heads. We follow the masking procedure asrecommended in [22]. Next, using these weights, we furtherﬁne-tune BERT model with intent classiﬁcation label as tar-get. The experimental details are given in Section 3.2.We apply the ﬁne-tuned BERT model to capture linguis-tic information for intent classiﬁcation. From sequence outputof BERT ( e lang ), we use mean pooling and obtain the BERTembeddings E lang , where E lang = M eanP ool ( e lang ) . Wefurther use this ﬁnetuned model as the teacher model in trans-fer learning stage. Using the transfer learning layer, we aim to derive linguisticrepresentation from the acoustic embeddings of speech thatclosely resemble the BERT embeddings. This is a linear layerthat transforms the 256-dimensional acoustic embeddings to768-dimensional linguistic embeddings. We follow teacher-student learning method as described in [24, 25] to learn thetransfer learning layer acting as a student model. The BERTmodel as discussed in Section 2.2 is considered as teachermodel. We learn the transfer learning layer to produce em-beddings that are closer to the BERT embeddings. We ob-tain ALN linguistic embeddings ( ˆ e lang ) as the output of thetransfer learning layer, which is used in the next modules forintent classiﬁcation. We note that the ALN linguistic embed-ings are frame-aligned with the acoustic embeddings. Bothsequences have the same length.We use mean square error (MSE) loss for backpropaga-tion of this layer, which is deﬁned as, Loss

T L = MSE ( E lang , ˆ E lang ) , (1)where, ˆ E lang is the ALN linguistic embeddings after meanpooling over ˆ e lang , i.e. ˆ E lang = M eanP ool (ˆ e lang ) . We design the transfer learning layer to map the acoustic em-beddings to linguistic embeddings as it is optimized with re-spect to the BERT embeddings. We employ a cross-attentionmethod as in [26], [27] to generate the alignment between thetwo streams of embedding vectors, namely, acoustic embed-dings, and ALN linguistic embeddings. In traditional atten-tion network, a decoder learns to attend some parts of encoderoutput [12,28]. Here, we leverage the attention mechanism tolearn the alignment weights between the two embeddings.Before applying the cross-attention, we use a mappinglayer to match the dimension (256-dimensional) of both theembedding sequences. The output of the cross-attention layercan be expressed as z = softmax (cid:18) f q ( X q )( f k ( X k )) T √ d k (cid:19) f v ( X v ) , (2)where, f q , f k and f v represent the linear layers for query X q ,key X k and value X v components, respectively, d k is the em-bedding dimension (256), and z is the output from the crossattention layer, which we refer to as ALN embeddings . Weuse e acoust as query component, while ˆ e lang is considered askey and value components. The intent layer aims to classify intents from the ALN em-beddings. The ﬁrst part of this module is a recurrent neuralnetwork, which is gated recurrent unit (GRU), followed bymaxpooling and linear layer for intent classiﬁcation. The in-tent classiﬁcation loss is refered to as

Loss intent . Loss intent = CrossEntropy ( f ( z ) , y ) , (3)where, f ( z ) is the intent layer output and y represents intentlabels. We combine Loss intent with the transfer learning loss(

Loss

T L ) using the weight α to derive the total loss, which isused to backpropagate the ALN framework. Loss total = αLoss T L + (1 − α ) Loss intent (4)

3. EXPERIMENTS3.1. Database

We evaluate the systems on two databases. The ATIS corpusis one of the most commonly used long-standing datasets fortext-based SLU research [29]. To use the ATIS database foracoustic feature based SLU task, we have rearranged the trainand test sets so that corresponding audio ﬁles are available from the original ATIS corpus recordings. We have ﬁlteredthe ATIS database to make sure that all utterances have orig-inal speech recordings. The utterances in the Fluent speechcommands (FSC) dataset [15] serve as speech commands toa virtual assistant. Each command consists of three slot val-ues, namely, action, object and location. The combination ofthree slot values represents the intent. The details of both thedatabases is presented in Table 1.

Table 1 : Statistics of ATIS and Fluent speech commands (FSC)databases.

Speciﬁcation Database

ATIS FSC

Train Test Train Test

To show the comparative performance of the proposed ALNframework, we develop two baseline systems. First, the con-ventional pipeline approach (

Baseline-1 ), where we initiallypass the speech signal through an ASR system to derivethe transcription, which are further applied to the ﬁne-tunedBERT model for intent classiﬁcation. The ASR system em-ployed here is an in-house general purpose ASR trainedusing a combination of several speech databases. We haven’tadapted any component of the ASR system speciﬁc to thecurrent application. The second baseline [15] is deployedusing acoustic features extracted from the pretrained ASRmodel (

Baseline-2 ) in Section 2.1, followed by the intentlayer discussed in Section 2.5. We note that, in this case onlyintent loss is used as in Equation 3 to optimize the network.

In the proposed ALN framework, we use a pretrained ASRmodel and a ﬁnetuned BERT model to extract the acous-tic and linguistic embeddings, respectively. The pretrainedASR model used in this work is adopted from [15], whichis frozen and not modiﬁed for downstream task. For theﬁnetuned BERT model, we use train set of both databases asper Table 1. For the two stage-ﬁnetuning of BERT model asdescribed in Section 2.2, during MLM stage we use Adamoptimizer with learning rate 5e-5, β = 0 . , β =0.999 and 3epochs [30]. For intent classiﬁcation stage of ﬁnetuning, weuse Adam optimizer with learning rate of 2e-5 and 4 epochs.To demonstrate the effectiveness of the transfer learningstrategy and subsequent improvement with the proposed ALNframework, we develop two systems. In the ﬁrst system ( ALNlinguistic ), initially we pass the acoustic embeddings throughthe transfer learning layer to obtain the ALN linguistic em-beddings. Then we use only ALN linguistic embeddings in able 2 : Intent classiﬁcation accuracy of Baseline-1 (pipeline ap-proach) using ASR output and ground truth text, and Baseline-2(Pretrained ASR) frameworks.

Database Baseline-1 Baseline-2 [15]

Ground truth ASR output WER(%)ATIS 93.73 89.65 11.57 85.34FSC 100 95.33 20.43 98.80 the intent layer described in Section 2.5 for intent classiﬁ-cation. In the second system (

ALN ), we obtain the align-ment between the ALN linguistic embeddings and acousticembeddings using the cross attention layer. The output of thecross attention layer is passed through the intent layer to de-rive the intent classiﬁcation as shown in Figure 1. Duringtraining of the ALN, the total loss as per Equation 4 is back-propagated for transfer learning layer, cross-attention layerand intent layer simultaneously. The ALN linguistic, ALN,and Baseline-2 frameworks are implemented using using Py-Torch. We use the Adam optimizer with a learning rate of0.001, batch size 64 and 100 epochs.

Table 2 lists the intent classiﬁcation accuracy of the Baseline-1 and Baseline-2 for both the databases. We also showthe pipeline approach when groundtruth text is used in-stead of ASR output text. We note that WER for ATISand FSC databases are 11.57% and 20.43%, respectively.The Baseline-2 gives better performance (98.80%) when theWER is higher as in case of FSC dataset. Whereas, Baseline-1 gives better performance in case of the ATIS database(89.65%), where WER is lower. However, the performanceof the pipeline approach using groundtruth transcription isbetter than Baseline-2. This indicates the involvement oflinguistic information to improve intent classiﬁcation.To illustrate the effect of transfer learning in ALN, wevisualize t-Distributed Stochastic Neighbor Embedding (t-SNE) representation [31] of embeddings. Figure 2(a) andFigure 2(b) show the t-SNE plots across BERT embeddingsand ALN linguistic embeddings in an initial epoch and ﬁnalepoch, respectively, for the ATIS database. We observe thatthe ALN linguistic embeddings are closer to BERT embed-dings in Figure 2(b) than in Figure 2(a). We note similarembeddings visualization behavior for FSC database.In Table 3 we report performance of the ALN linguis-tic and ALN frameworks. We also demonstrate the effectof weight, α in Equation 4, with experiments for two val-ues of α (0.5 and 0.8) in total loss computation. In both theframeworks, we observe relatively better performance using α = 0 . . This indicates that the transfer learning loss con-tributes more in learning better ALN linguistic embeddings.The intent classiﬁcation accuracy of ALN linguistic (87.75%)is better than that of Baseline-2 (85.34%) for ATIS database.For FSC database the performance of Baseline-2 (98.80%)is slightly better that that of ALN linguistic (98.31%). Thisshows the efﬁcacy of the ALN linguistic embeddings derivedthrough the transfer learning layer. -50 0 50-40-20020406080 -50 0 50-40-20020406080 BERT embeddingsALN linguistic embeddings(a) (b)

Fig. 2 . t-SNE visualization plots to compare BERT embeddings withALN linguistic embedding in (a) initial epoch, (b) ﬁnal epoch for 6intent classes of the ATIS database .

It is evident from Table 3 that the proposed ALN frame-work outperforms both baseline systems presented in Table 2.For α = 0 . the accuracy of the proposed ALN framework is . for ATIS database, and . for FSC database.The same for Baseline-2 are . and . , respec-tively. We observe the performance is consistently higherfor both the values of α for ALN framework, compared toBaseline-1 and Baseline-2, for the two databases. For theATIS database the performance of the ALN framework with α = 0 . is lower than that of the Baseline-1. This may be be-cause of the lower WER of the ATIS database and insufﬁcientrepresentation of ALN linguistic features using α = 0 . . Table 3 : Intent classiﬁcation accuracy (%) using ALN linguistic andproposed ALN frameworks for different values of weight α . Database α value ALN linguistic ALN ATIS 0.5 85.00 86.370.8 87.75

FSC 0.5 97.20 99.020.8 98.31

This paper presents an end-to-end intent classiﬁcation frame-work using both acoustic and linguistic embeddings extractedfrom a speech signal, without using any intermediate text rep-resentation. We extract the acoustic features from a pretrainedASR model and we learn the linguistic features from a pre-trained BERT model. We use transfer learning technique toconvert acoustic features towards linguistic features. In par-ticular, we employ the teacher-student transfer learning ap-proach to leverage the linguistic information and incorporateinto the intent classiﬁcation network. Finally, we fuse boththe acoustic and linguistic embeddings effectively through across-attention module.The two modalities, speech and text corresponding to aspoken utterance carry contrasting and crucial behavior to in-terpret it’s meaning. Through the proposed method, we estab-lish the impact of using both acoustic and linguistic modali-ties for SLU task. Our experimental results indicate that thetransfer learning can learn the information from the linguis-tic embeddings and perform better than using only acousticinformation as well as conventional pipeline SLU approach. . REFERENCES [1] G¨okhan T¨ur and Renato De Mori,

Spoken language understanding:Systems for extracting semantic information from speech , John Wiley& Sons, 2011.[2] Aditya Bhargava, Asli Celikyilmaz, Dilek Hakkani-T¨ur, and RuhiSarikaya, “Easy contextual intent prediction and slot detection,” in

International conference on acoustics, speech and signal processing(ICASSP) , 2013, pp. 8337–8341.[3] Suman Ravuri and Andreas Stolcke, “Recurrent neural network andlstm models for lexical utterance classiﬁcation,” in

INTERSPEECH ,2015, pp. 135–139.[4] Ruhi Sarikaya, Geoffrey E Hinton, and Anoop Deoras, “Application ofdeep belief networks for natural language understanding,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 22, no.4, pp. 778–784, 2014.[5] Yao Qian, Rutuja Ubale, Vikram Ramanaryanan, Patrick Lange, DavidSuendermann-Oeft, Keelan Evanini, and Eugene Tsuprun, “Explor-ing ASR-free end-to-end modeling to improve spoken language under-standing in a cloud-based dialog system,” in

Automatic Speech Recog-nition and Understanding Workshop (ASRU) , 2017, pp. 569–576.[6] Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar,Baiyang Liu, and Yoshua Bengio, “Towards end-to-end spoken lan-guage understanding,” in

International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2018, pp. 5754–5758.[7] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai,Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catan-zaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: End-to-endspeech recognition in english and mandarin,” in

International confer-ence on machine learning , 2016, pp. 173–182.[8] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen,attend and spell: A neural network for large vocabulary conversationalspeech recognition,” in

International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2016, pp. 4960–4964.[9] Hagen Soltau, Hank Liao, and Hasim Sak, “Neural speech recognizer:Acoustic-to-word LSTM model for large vocabulary speech recogni-tion,” in

INTERSPEECH , 2017, pp. 3707–3711.[10] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, andKoray Kavukcuoglu, “Wavenet: A generative model for raw audio,” in

ISCA Speech Synthesis Workshop , 2016, p. 125.[11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequencelearning with neural networks,” in

Advances in Neural InformationProcessing Systems 27 , pp. 3104–3112. 2014.[12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural ma-chine translation by jointly learning to align and translate,” in

Interna-tional Conference on Learning Representations, ICLR , 2015.[13] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, andYann N Dauphin, “Convolutional sequence to sequence learning,”in

International Conference on Machine Learning, ICML , 2017, pp.1243–1252.[14] Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang,Neeraj Gaur, Pedro Moreno, Rohit Prabhavalkar, Zhongdi Qu, andAustin Waters, “From audio to semantics: Approaches to end-to-endspoken language understanding,” in

IEEE Spoken Language Technol-ogy Workshop (SLT) , 2018, pp. 720–726.[15] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar,and Yoshua Bengio, “Speech model pre-training for end-to-end spokenlanguage understanding,” in

INTERSPEECH , 2019, pp. 814–818.[16] Vincent Renkens and Hugo Van hamme, “Capsule networks for lowresource spoken language understanding,” in

INTERSPEECH , 2018,pp. 601–605. [17] Jakob Poncelet and Hugo Van hamme, “Multitask learning with cap-sule networks for speech-to-intent applications,” in

International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) , 2020,pp. 8494–8498.[18] Vikramjit Mitra, Sue Booker, Erik Marchi, David Scott Farrar,Ute Dorothea Peitz, Bridget Cheng, Ermine Teves, Anuj Mehta, andDevang Naik, “Leveraging acoustic cues and paralinguistic embed-dings to detect expression from voice,” in

INTERSPEECH , 2019, pp.1651–1655.[19] Loren Lugosch, Brett Meyer, Derek Nowrouzezahrai, and Mirco Ra-vanelli, “Using speech synthesis to train end-to-end spoken languageunderstanding models,” in

International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2020, pp. 8499–8503.[20] Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kar-tik Audhkhasi, Brian Kingsbury, Ron Hoory, and Michael Picheny,“Leveraging unpaired text data for training end-to-end speech-to-intentsystems,” in

International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2020, pp. 5754–5758.[21] Pavel Denisov and Ngoc Thang Vu, “Pretrained semantic speechembeddings for end-to-end spoken language understanding via cross-modal teacher-student learning,” arXiv preprint arXiv:2007.01836 ,2020.[22] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,“BERT: pre-training of deep bidirectional transformers for languageunderstanding,” in

North American Chapter of the Association forComputational Linguistics: Human Language Technologies, NAACL-HLT , p. 4171–4186.[23] Nils Reimers and Iryna Gurevych, “Sentence-BERT: Sentence embed-dings using Siamese BERT-networks,” in

Empirical Methods in Nat-ural Language Processing and International Joint Conference on Nat-ural Language Processing, (EMNLP-IJCNLP) . 2019, pp. 3982–3992,Association for Computational Linguistics.[24] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowl-edge in a neural network,” arXiv preprint arXiv:1503.02531 , 2015.[25] Jinyu Li, Michael L Seltzer, Xi Wang, Rui Zhao, and Yifan Gong,“Large-scale domain adaptation via teacher-student learning,” in

IN-TERSPEECH , 2017, pp. 2386–2390.[26] Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, and Xi-angang Li, “Learning alignment for multimodal emotion recognitionfrom speech,” in

INTERSPEECH , pp. 3569–3573.[27] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter,Louis-Philippe Morency, and Ruslan Salakhutdinov, “Multimodaltransformer for unaligned multimodal language sequences,” in

AnnualMeeting of the Association for Computational Linguistics (Volume 1:Long Papers) , 2019.[28] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyunghyunCho, and Yoshua Bengio, “Attention-based models for speech recogni-tion,” in

Advances in neural information processing systems , 2015, pp.577–585.[29] Patti Price, “Evaluation of spoken language systems: The ATIS do-main,” in

Speech and Natural Language: Proceedings of a WorkshopHeld at Hidden Valley, Pennsylvania, June 24-27, 1990 , 1990.[30] “huggingface/transformers,” https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py , [On-line; accessed 21-October-2020].[31] Laurens Maaten and Geoffrey Hinton, “Visualizing data using t-SNE,”