[PDF] Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Abstract

Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. These components are optimized independently to allow usage of available data, but the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings to process acoustic features. In particular, we extend it with an encoder of pretrained speech recognition systems in order to construct end-to-end spoken language understanding systems. Our proposed method is based on the teacher-student framework across speech and text modalities that aligns the acoustic and the semantic latent spaces. Experimental results in three benchmarks show that our system reaches the performance comparable to the pipeline architecture without using any training data and outperforms it after fine-tuning with ten examples per class on two out of three benchmarks.

Full PDF

aa r X i v : . [ ee ss . A S ] A ug Pretrained Semantic Speech Embeddings for End-to-End Spoken LanguageUnderstanding via Cross-Modal Teacher-Student Learning

Pavel Denisov, Ngoc Thang Vu

Institute for Natural Language Processing (IMS), University of Stuttgart, Germany { pavel.denisov, thang.vu } @ims.uni-stuttgart.de Abstract

Spoken language understanding is typically based on pipelinearchitectures including speech recognition and natural languageunderstanding steps. These components are optimized indepen-dently to allow usage of available data, but the overall systemsuffers from error propagation. In this paper, we propose anovel training method that enables pretrained contextual em-beddings to process acoustic features. In particular, we extendit with an encoder of pretrained speech recognition systems inorder to construct end-to-end spoken language understandingsystems. Our proposed method is based on the teacher-studentframework across speech and text modalities that aligns theacoustic and the semantic latent spaces. Experimental resultsin three benchmarks show that our system reaches the perfor-mance comparable to the pipeline architecture without usingany training data and outperforms it after ﬁne-tuning with tenexamples per class on two out of three benchmarks.

Index Terms : spoken language understanding, transfer learn-ing, teacher student learning

1. Introduction

Recent developments in the ﬁelds of electronics, computationsand data processing have led to an increased interest in smart as-sistants with speech interfaces. It is likely driven by the fact thatusually people can learn to use speech for interaction intuitivelywithout any special training [1] and make it a primary mediumof information exchange. However, speech poses a major chal-lenge to a machine when it comes to the task of extraction ofinformation intended to be transmitted by a human speaker, alsoknown as Spoken Language Understanding (SLU) [2]. The keydifﬁculty here is that speech is highly variable, e.g. dependingon room acoustic, and contains rich information about speak-ers [3]. Some of them are not useful for SLU. The informationextraction task is often performed on the text representation us-ing Natural Language Understanding (NLU) methods [4], whileAutomatic Speech Recognition (ASR) systems [5, 6] convertspeech to text. ASR step removes redundant information fromthe input and provides some kind of normalized form on the out-put. At the same time it causes loss of potentially useful infor-mation that can not be encoded in the text representation, suchas prosody, loudness and speech rate. The operation of ﬁndingthe most probable sequence of words for speech input is compu-tationally expensive. This is partly solved by various heuristicsavoiding exploration of less probable hypothesis [7, 8, 9], whatin turn introduces additional errors propagated to NLU compo-nent. Finally, the sequential design of pipeline approach leadsto unavoidable source of latency, because NLU component cannot start its work before ASR is ﬁnished, and it is not desirablein the interactive context of smart assistant. The problems ofpipeline approach described above can be solved by end-to-endSLU methods. Existing works on end-to-end SLU modeling either focuson supervised downstream tasks, for example dialog act classi-ﬁcation [10], intent detection [12], slot ﬁlling [14], independentintent detection and domain classiﬁcation [11] and joint intentdetection, domain classiﬁcation and slot ﬁlling [13], or target ageneric semantic embedding [15, 16, 17] usually inspired bysuch successful models as word embeddings Word2Vec [18]and contextual text embeddings BERT [19]. Highly variableand complex nature of speech leads to large amounts of bothdata and computational resources required for SLU trainingcompared to NLU training, especially for recently popular ap-proach based on contextual embeddings. While data require-ments could be satisﬁed for unsupervised approaches, com-putational resources are still a problem. Fortunately, mostof the modern language processing methods, including ASRand NLU, are based on neural networks and deep learning.Deep learning offers an easy way to transfer knowledge be-tween learned tasks. This technique is referred as transferlearning and it is successfully applied in both ASR [20, 21]and NLU [22, 19]. Therefore, transfer learning should be apromising direction to explore for SLU as well. Several re-ports [23, 12, 13, 24, 14, 17] indicate that transfer learningfrom audio modality through pretraining on ASR task or, alter-natively, speech autoencoding, is helpful for downstream SLUtasks. Transfer learning from text modality, however, has beenapplied only for Speech2Vec [16] and SpeechBERT [17] so far.We propose a novel method that combines parameterstransfer from well trained end-to-end ASR systems [25] suchas pretrained ESPnet [26] and end-to-end NLU models such aspretrained BERT [19] with Teacher-Student learning [27, 28]for ﬁnal alignment of SLU output space to NLU output spacein order to construct end-to-end SLU model allowing few-shottransfer of downstream tasks from text to speech. By doing so,we enable pretrained end-to-end contextual embeddings suchas BERT to process acoustic features. In particular, we aim togenerate ﬁxed length vectors with semantic representation fromspeech segments of variable length. Transfer learning from bothtext and audio modalities makes our approach mostly similar to[17] and [16]. In this work, we investigate utterance classiﬁca-tion task and focus on zero-shot and few-shot cases, but the de-scribed method could be adopted to many types of SLU tasks.Although previous works described a number of experimentsfor such utterance classiﬁcation tasks as dialog act classiﬁcation[29] and intent classiﬁcation [24], and we use the same datasetsfor the evaluation, we do not compare our results directly tothese works, as this is outside of the scope of our work.

2. Method

Figure 1 provides the overview of the proposed method. OurSLU model is a combination of two pretrained models. First,we use Encoder block of pretrained end-to-end ASR model

LU modelNLU modelSpeech ASR encoderlayers NLUlayers UtteranceembeddingDistanceTranscription Subwordembeddings NLUlayers SentenceembeddingForward pass Backward pass

Figure 1:

End-to-end SLU using cross-modal T-S learning. [25] in order to covert acoustic features of speech signal tohidden representation. Second, we feed the hidden represen-tation through a learnable linear mapping to pretrained maskedlanguage model [19], ﬁne-tuned to produce semantic sentenceembedding, which serves as NLU model. Finally, we utilizeteacher-student learning method in order to align output of ourSLU model to output of pretrained NLU model. Both ASR andNLU models are based on Transformer architecture [30] widelyused for sequence processing.

ASR model implements sequence-to-sequence approach andcontains two major blocks, encoder and decoder. Encoder takessequence X with acoustic features and outputs encoded se-quence X ǫ with hidden representation. Decoder takes the en-coded sequence X ǫ on input and outputs target sequence Y with text tokens representing transcription of the input utter-ance. ASR model is trained to minimize weighted sum of cross-entropy objective function calculated from decoder output ˆ Y and ground truth transcription Y with CTC objective functioncalculated from learnable linear mapping of encoder output X ǫ and ground truth transcription Y . NLU model is a neural network that takes sequence X with texttokens on input and produces encoded sequence X ǫ . Semanticsentence embedding vector y is obtained by applying poolingoperation to the encoded sequence X ǫ . The model’s parametersare initially pretrained with the tasks of masked token and nextsentence prediction from the encoded sequence X ǫ represent-ing contextual text token embeddings, as it is done with BERTmodel [19]. After that, the model is extended with pooling op-eration over the encoded sequence X ǫ , producing pooled output y , and is ﬁne-tuned on specialized datasets to encode more se-mantic information to the pooled output y . Teacher-Student learning minimizes distance-based objectivefunction between outputs of two models on same or equivalentinputs with the aim to update Student model’s parameters so thatits output becomes more similar to the output of Teacher model.The parameters of Teacher model are not updated during thisprocess. The ﬁnal stage of our method is the alignment of SLUoutput to NLU output with Teacher-Student learning method,where SLU model consumes speech recordings and plays theStudent role, while NLU model consumes ground truth tran-scriptions and plays the Teacher role.

3. Experimental setup

We adopt the latest LibriSpeech recipe [31] from ESPnettoolkit. Transformer network has attention dimension 512,feed-forward inner dimension 2048, 8 heads, 12 blocks in theencoder and 6 blocks in the decoder. Input features are 80-dimensional log Mel ﬁlterbank coefﬁcients with 3-dimensionalpitch value, frame size is 25 ms and shift is 10 ms. Output la-bels are 100 subword units, automatically learned with unigramlanguage model algorithm [32] from lowercased concatenationof LibriSpeech and TED-LIUM LM training data with tran-scriptions of the acoustic training data. Training data combinesLibriSpeech, Switchboard, TED-LIUM 3, AMI, WSJ, Com-mon Voice 3, SWC, VoxForge and M-AILABS datasets witha total amount of 3249 hours. Validation data combines vali-dation subsets of LibriSpeech, TED-LIUM 3 and AMI datasetswith a total amount of 38 hours. The training is performed on4 GPUs using Adam optimizer and square root learning ratescheduling [30] with 25,000 warmup steps and learning rate co-efﬁcient 10. SpecAugment data augmentation method [33] isapplied dynamically during each batch generation. The modelis trained for 24 epochs and evaluated on the validation data af-ter each epoch. The ﬁnal model is obtained by averaging theparameters of the seven best performing models.

We use pretrained bert-base-nli-stsb-mean-tokens

Sentence-BERT model [34]. The model itself is ﬁne-tuned fromthe well-known pretrained bert-base-uncased model[19]. Transformer network has attention dimension 768, feed-forward inner dimension 3072, 12 heads and 12 blocks. Inputtext is tokenized to 30,000 subword units. The model is pre-trained with masked LM and next sentence prediction tasks onBooksCorpus and English Wikipedia datasets. Pooling opera-tion

MEAN is added to obtain the sentence embedding y fromthe encoded sequence X ǫ . The sentence embedding is ﬁrstﬁne-tuned on SNLI and MultiNLI datasets for 3-way classiﬁca-tion between contradiction , entailment and neutral classes fora given pair of sentences using cross-entropy objective func-tion. After that, the sentence embedding is ﬁne-tuned on STSbdataset for prediction of cosine similarity for a given pair ofsentences using mean-squared-error objective function. SLU model is constructed by combining ASR model’s encoderwith self-attention blocks of NLU model, so that NLU modelreceives the hidden representation from ASR encoder insteadof the output of input embedding layer of NLU model. Linearlayer is added between ASR encoder and NLU blocks to mapthe dimension of hidden representation from 512 to 768. Fine-tuning is performed using Teacher-Student approach by mini-mizing the distance between output of SLU model for speechrecordings and output of NLU model for corresponding tran-scriptions. We conduct ﬁne-tuning experiments with cosine, L2and L1 distance based objective functions. SLU model acts as aStudent, and we select empirically, which parameters to updateduring the ﬁne-tuning. NLU model acts as a Teacher, and wefreeze its parameters. We employ smaller acoustic dataset con-sisting of LibriTTS, Common Voice 3, and M-AILABS corporawith a total amount of 1453 hours for the ﬁne-tuning. Our mo-tivation here is to utilize richer transcriptions with punctuationavailable in these datasets and to supply NLU model with extranformation for potentially semantically ﬁner sentence embed-dings. We use the transcriptions as is and do not apply any textpreprocessing that is usually done in ASR training, includingour end-to-end ASR model. Validation data is the validationsubset of LibriTTS corpus with a total duration of 15 hours. Wedo not apply SpecAugment during the ﬁne-tuning, because ityielded worse results in our early experiments.

SLU model is evaluated on two downstream tasks, dialog act(DA) classiﬁcation and intent classiﬁcation, both of which areutterance classiﬁcation tasks. DA classiﬁcation is evaluatedon two corpora: ICSI Meeting Recorder Dialog Act Corpus(MRDA) and NXT-format Switchboard Corpus (SwDA). Intentclassiﬁcation is evaluated on Fluent Speech Commands (FSC)corpus. Table 1 summarizes the datasets.Table 1:

SLU evaluation datasets

Dataset Number of Number of utterancesclasses Train Valid TestSwBD 42 97,756 8,591 2,507MRDA 6 77,596 15,721 15,398FSC 31 23,132 3,118 3,793

In order to perform utterance classiﬁcation, we ﬁrst train aone layer feed-forward classiﬁer on sentence embeddings, pro-duced by the NLU model from the ground truth transcriptionsof training subset, using cross-entropy objective function. Afterthat, we test the classiﬁer on semantic utterance embeddings,extracted from the recordings of testing subset using the SLUmodel. We report accuracy values as a percentage of correctlyclassiﬁed utterances from the total number of utterances.

Traditional approach to SLU tasks is a pipeline of ASR fol-lowed by NLU, and we adopt it as a baseline while employingthe same ASR and NLU models as in the rest of the experi-ments. Table 2 reports the results of NLU on ASR output aswell as on the ground truth transcriptions. The ground truth re-sults represent an upper bound of accuracy achievable on thesedatasets with NLU model we use in case of perfect transcrip-tions on ASR output. The effect of imperfect ASR output variesbetween datasets depending on the difﬁculty of recording condi-tions, the differences between formats of manual transcriptionsused to train the classiﬁers and the tolerance of the downstreamtasks to the type of noise that ASR introduces. Amount of errorsin ASR output is indicated by Word Error Rate (WER), which isalso reported in the table. We select the best performing hyper-parameters for the classiﬁer training, but do not ﬁne-tune NLUcomponent for the downstream tasks, because our main goal isSLU as generic speech equivalent for NLU rather then the bestpossible model for some particular downstream task.Table 2:

Accuracy of NLU on ASR output and on the groundtruth transcriptions and WER of ASR

Transcriptions Accuracy on Test, %SwBD MRDA FSCGround truth 71.72 77.72 100.0ASR output 57.23 64.06 94.57WER on Test, %ASR output 28.0 29.7 7.9

4. Results

Our ﬁrst set of experiments is designed to determine which lay-ers of SLU model should be ﬁne-tuned after the combination ofparameters transferred from ASR encoder and NLU. As men-tioned before, we insert a linear mapping layer between formerASR and NLU layers because of the difference in dimensional-ity. It is initialized randomly and its parameters are always up-dated during the ﬁne-tuning step. In addition to that, we try toﬁne-tune various amount of layers closest to the mapping layer,meaning top layers of former ASR encoder and bottom layersof former NLU. We do so, because for these layers the output(for ASR encoder) or the input (for NLU) is expected to changeafter the parameters transfer in contrast to bottom layers of for-mer ASR encoder and top layers of former NLU, where inputand output should not change.We run ﬁne-tuning for 10 epochs using square root learn-ing rate scheduling [30] with 300,000 warmup steps and learn-ing rate coefﬁcient 50, and use cosine distance based objectivefunction. The results are given in Table 3. While it is not com-pletely clear how many layers should be ﬁne-tuned, we can con-clusively tell that ﬁne-tuning of former ASR encoder layers ismore beneﬁcial than former NLU layers. We decide to ﬁne-tunethe two top former ASR encoder layers. The results also illus-trate that the optimization of SLU model for smaller distance ofits output from the output of NLU model is general enough andtranslates to accuracy improvements in the downstream tasks,although not in all cases.Table 3:

Effect of layers ﬁne-tuning

ASR NLU Accuracy on Test, % Validationlayers layers SwBD MRDA FSC loss0 0 43.76 56.08 68.07 0.260 1 37.61 56.47 85.53 0.191 0 52.37

After deciding which layers to ﬁne-tune, we run a series of ex-periments to determine the best learning rate schedule. Table 4presents the combinations of learning rate constant and numberof warmup steps explored by us. When we increase number ofwarmup steps, we notice positive effect from slower learningrate ramp up. However, as number number of warmup steps be-comes close to the total number of ﬁne-tuning steps, we haveto increase number of epochs from 10 to 20 in order to see thewhole ﬁne-tuning process.Table 4:

Effect of learning rate schedule

Warmup LR Epochs Accuracy on Test, % Validationsteps constant SwBD MRDA FSC loss300,000 50 10 52.93 59.42 85.76 0.15600,000 50 10 51.18 59.95 86.84 0.14600,000 50 20 54.00

WER, % A cc u r ac y , % (a) SwBD

WER, % A cc u r ac y , % (b) MRDA

WER, % A cc u r ac y , % Architecture

PipelineEnd-to-end (c)

FSC

Figure 2:

Accuracy comparison for the utterances grouped by ASR WER4.1.3. Objective function

Comparison of objective functions on downstream tasks, aswell as cross-comparison of how selected objective functioninﬂuences value of others on validation subset, is provided inTable 5. Overall, these results indicate that the evaluated objec-tive functions behave similarly in this task, however L1 distancebased objective function yields slightly better results.Table 5:

Effect of objective function and longer training

Objective Accuracy on Test, % Validation valuefunction SwBD MRDA FSC Cosine L2 L1Cosine 55.56 59.64 89.45 0.13 0.08 0.21L2 53.73 59.91 88.64 0.13 0.07 0.20L1 56.32

The resulting end-to-end utterance classiﬁcation model is fullydifferentiable and can be further optimized for a downstreamtask by applying standard supervised neural network trainingmethods with few labeled speech samples. This feature shouldbe helpful for the full exploitation of information that is relevantfor the task and is encoded in speech, what is less trivial to im-plement in the traditional ASR and NLU pipeline setup, whereintermediate representation has to be discrete and for exampleﬂatten the rich variety of prosodic events to few punctuationcharacters. We examine whether it is useful in practice by run-ning standard supervised classiﬁer training on few samples fromtraining subsets. Table 6 compares the results of ﬁne-tuning ofthe output layer alone and together with two former ASR en-coder layers. We conclude that end-to-end approach indeed canovercome the error propagation problem of pipeline SLU ap-proach by the automatic propagation of the error signal backto relevant parts of SLU system. However, additional trainingsamples may sometimes easily skew the small training datasetaway from the testing dataset and cause worse results, so moreattention should be payed to the selection of training samples.

5. Qualitative Analysis

We attempt to assess the differences between the pipeline andend-to-end SLU approaches in greater detail by looking at the Table 6:

Effect of supervised ﬁne-tuning on downstream tasks

Num. of Fine-tuned layers (accuracy on Test, %)samples Output layer Output and hidden layersper class SwBD MRDA FSC SwBD MRDA FSC0 58.60 60.18 91.12 58.60 60.18 91.121 58.60 60.59 93.62 58.60 60.41 94.152 58.60 60.22 93.44 58.60 60.40 95.043 58.83 60.22 93.33 58.83 60.16 94.834 58.55 60.35 accuracy values on groups of utterances split by the WER levelsof the baseline ASR system. Our hypothesis is that the pipelinesystem would make more mistakes on the more challengingrecordings characterized by higher WER values, because ASRsystems are optimized for phonetic or graphemic similarity tothe ground truth and are more likely to lose semantic informa-tion in case of errors. Figure 2 shows the accuracy values of thepipeline system and the best end-to-end SLU model (withoutﬁne-tuning). The utterances are grouped by WER of the base-line ASR in 10% ranges. The ranges with WER >

6. Conclusions

We proposed to combine parameters transfer from well trainedASR and NLU models with Teacher-Student learning for ﬁnalalignment of SLU output space to NLU output space in orderto construct end-to-end SLU model allowing few-shot transferof downstream tasks from text to speech. We outlined neces-sary steps and settings for the practical pretrained NLU modeladaptation in SLU via cross-modal transfer. Our system reachesaccuracy of 58.60%, 60.18% and 91.12% on SwBD, MRDAand FSC datasets without ﬁne-tuning and 60.22%, 61.32% and95.49% after ﬁne-tuning on ten labeled samples per class com-pared to 57.23%, 64.06% and 94.57% reached by the pipelinesystem. The results of this research support the idea that textpretrained contextual embeddings can be useful for tasks out-side of text modality. The present study also adds new tasks tothe growing body of research of language processing methodsusing Transformer neural networks. . References [1] S. Pinker and P. Bloom, “Natural language and natural selection,”

Behavioral and brain sciences , vol. 13, no. 4, pp. 707–727, 1990.[2] R. De Mori, F. Bechet, D. Hakkani-Tur, M. McTear, G. Riccardi,and G. Tur, “Spoken language understanding,”

IEEE Signal Pro-cessing Magazine , vol. 25, no. 3, pp. 50–58, 2008.[3] J. L. Kr¨oger, O. H.-M. Lutz, and P. Raschke, “Privacy implica-tions of voice and speech analysis–information disclosure by in-ference,” in

IFIP International Summer School on Privacy andIdentity Management . Springer, 2019, pp. 242–258.[4] C. D. Manning and H. Sch¨utze,

Foundations of statistical naturallanguage processing . MIT press, 1999.[5] J. Baker, “The dragon system–an overview,”

IEEE Transactionson Acoustics, speech, and signal Processing , vol. 23, no. 1, pp.24–29, 1975.[6] F. Jelinek,

Statistical methods for speech recognition . MIT press,1997.[7] A. Viterbi, “Error bounds for convolutional codes and an asymp-totically optimum decoding algorithm,”

IEEE transactions on In-formation Theory , vol. 13, no. 2, pp. 260–269, 1967.[8] J. Chorowski and N. Jaitly, “Towards better decoding and lan-guage model integration in sequence to sequence models,”

Proc.Interspeech 2017 , pp. 523–527, 2017.[9] T. Hori, J. Cho, and S. Watanabe, “End-to-end speech recogni-tion with word-based rnn language models,” in . IEEE, 2018, pp.389–396.[10] D. Ortega and N. T. Vu, “Lexico-acoustic neural-based models fordialog act classiﬁcation,” in . IEEE,2018, pp. 6194–6198.[11] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Ben-gio, “Towards end-to-end spoken language understanding,” in . IEEE, 2018, pp. 5754–5758.[12] Y.-P. Chen, R. Price, and S. Bangalore, “Spoken language un-derstanding without speech recognition,” in . IEEE, 2018, pp. 6189–6193.[13] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur,P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From Audioto Semantics: Approaches to end-to-end spoken language under-standing,” in . IEEE, 2018, pp. 720–726.[14] N. Tomashenko, A. Caubri`ere, and Y. Est`eve, “Investigating adap-tation and transfer learning for end-to-end spoken language un-derstanding from speech,”

Proc. Interspeech 2019 , pp. 824–828,2019.[15] Y.-A. Chung and J. Glass, “Speech2vec: A sequence-to-sequenceframework for learning word embeddings from speech,” arXivpreprint arXiv:1803.08976 , 2018.[16] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsupervisedcross-modal alignment of speech and text embedding spaces,” in

Advances in Neural Information Processing Systems , 2018, pp.7354–7364.[17] Y.-S. Chuang, C.-L. Liu, and H.-Y. Lee, “Speechbert: Cross-modal pre-trained language model for end-to-end spoken questionanswering,” arXiv preprint arXiv:1910.11559 , 2019.[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their com-positionality,” in

Advances in neural information processing sys-tems , 2013, pp. 3111–3119.[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018. [20] N. T. Vu and T. Schultz, “Multilingual multilayer perceptron forrapid language adaptation between and across language families.”in

Interspeech , 2013, pp. 515–519.[21] J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, andS. Stober, “Transfer learning for speech recognition on a budget,”in

Proceedings of the 2nd Workshop on Representation Learningfor NLP , 2017, pp. 168–177.[22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, andI. Sutskever, “Language models are unsupervised multitask learn-ers,” 2019.[23] Y. Qian, R. Ubale, V. Ramanaryanan, P. Lange, D. Suendermann-Oeft, K. Evanini, and E. Tsuprun, “Exploring asr-free end-to-endmodeling to improve spoken language understanding in a cloud-based dialog system,” in . IEEE, 2017, pp.569–576.[24] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio,“Speech Model Pre-training for End-to-End Spoken LanguageUnderstanding,” arXiv preprint arXiv:1904.03670 , 2019.[25] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention basedend-to-end speech recognition using multi-task learning,” in , 2017, pp. 4835–4839.[26] P. Denisov and N. T. Vu, “Ims-speech: A speech to text tool,”

Studientexte zur Sprachkommunikation: Elektronische Sprachsig-nalverarbeitung 2019 , pp. 170–177, 2019.[27] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge ina neural network,” arXiv preprint arXiv:1503.02531 , 2015.[28] J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, “Large-scale domain adaptation via teacher-student learning,”

Proc. In-terspeech 2017 , pp. 2386–2390, 2017.[29] D. Ortega, C.-Y. Li, G. Vallejo, P. Denisov, and N. T. Vu,“Context-aware neural-based dialog act classiﬁcation on automat-ically generated transcriptions,” in

ICASSP 2019-2019 IEEE In-ternational Conference on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, 2019, pp. 7265–7269.[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in neural information processing systems , 2017, pp.5998–6008.[31] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al. , “Acomparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317 , 2019.[32] T. Kudo, “Subword regularization: Improving neural networktranslation models with multiple subword candidates,” arXivpreprint arXiv:1804.10959 , 2018.[33] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmen-tation Method for Automatic Speech Recognition,”

Proc. Inter-speech 2019 , pp. 2613–2617, 2019.[34] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embed-dings using siamese bert-networks,” in