Large-scale Transfer Learning for Low-resource Spoken Language Understanding
Xueli Jia, Jianzong Wang, Zhiyong Zhang, Ning Cheng, Jing Xiao
LLarge-scale Transfer Learning for Low-resource Spoken LanguageUnderstanding
Xueli Jia, Jianzong Wang ∗ , Zhiyong Zhang, Ning Cheng, Jing Xiao Ping An Technology (Shenzhen) Co., Ltd.
Abstract
End-to-end Spoken Language Understanding (SLU) models aremade increasingly large and complex to achieve the state-of-the-art accuracy. However, the increased complexity of a modelcan also introduce high risk of over-fitting, which is a majorchallenge in SLU tasks due to the limitation of available data. Inthis paper, we propose an attention-based SLU model togetherwith three encoder enhancement strategies to overcome datasparsity challenge. The first strategy focuses on the transfer-learning approach to improve feature extraction capability ofthe encoder. It is implemented by pre-training the encodercomponent with a quantity of Automatic Speech Recognitionannotated data relying on the standard Transformer architec-ture and then fine-tuning the SLU model with a small amountof target labelled data. The second strategy adopts multi-task learning strategy, the SLU model integrates the speechrecognition model by sharing the same underlying encoder,such that improving robustness and generalization ability. Thethird strategy, learning from Component Fusion (CF) idea,involves a Bidirectional Encoder Representation from Trans-former (BERT) model and aims to boost the capability of thedecoder with an auxiliary network. It hence reduces the riskof over-fitting and augments the ability of the underlying en-coder, indirectly. Experiments on the FluentAI dataset showthat cross-language transfer learning and multi-task strategieshave been improved by up to . and . respectively,compared to the baseline.
1. Introduction
Conventional SLU pipeline mainly consists of two components[1]: an Automatic Speech Recognition module generates tran-scriptions or N-hypotheses, and a Natural Language Under-standing (NLU) module classifies transcriptions into intents, inwhich speech recognition error propagation will be amplifiedduring sub-sequence NLU process. Although with the rapiddevelopment of end-to-end speech recognition systems, the per-formance of SLU has been significant improved [2, 3, 4, 5, 6, 7],it still can not satisfy the application requirements, due to thecomplexity of scenarios.Usually not all errors from speech recognition harm theSLU module, and those errors have no impact on the even-tual performance [8, 9]. The SLU component only keeps itsattention on keywords while discarding most of the other ir-relevant words [10]. Thus the joint optimization approach canstrengthen the focus of the model on improving the transcrip-tion accuracy that relates to target events [11, 12]. Recently,many efforts have been dedicated on end-to-end SLU in whichthe domain and the intent are predicted directly from input au-dio [13, 14, 15, 16, 17, 18, 19]. Previous researches have shown *Corresponding author [email protected] that a large amount of data is the determining factor for the ex-cellent performance of a model [14]. However, due to the lackof audio and the ambiguity of intents, it is difficult to obtain suf-ficient in-domain labeled data. Transfer learning methodologyhas become a common strategy to address insufficient of dataproblem [20, 21, 22]. Different transfer learning strategies havebeen applied in SLU model and all of them result in competi-tive complementary results [23, 24]. In this paper, this strategyis also applied to amplify the feature extraction capability of theencoder component, it pre-train the encoder with a large amountof speech recognition labeled data, and then transfer the encoderto the SLU model.Recently, [13] proposed and compared various of encoder-decoder approaches to optimize each module of SLU in end-to-end manners and have proved that intermediate text repre-sentation is crucial for SLU and jointly training the full modelis advantageous. Attention-based models have been widelyused in speech recognition and provide impressive performance[5, 6, 7, 25, 26, 27]. Inspired by this, we propose a Trans-former based multi-task strategy to adopt textual informationin the SLU model. Since text information only acts on the de-coder component in speech recognition task, it can be treatedas an adaptive regularizer to adjust the encoder parameters suchthat contributing to improve intent prediction performance. Itshould be noticed that the lack of textual corpus is also a majorchallenge when training language models. To address this prob-lem, various of methods have been carried out to expand corpusin the past decade [28, 29, 30]. In addition, textual level transferlearning strategy by merging a pre-trained representation to thedecoder is also explored. The pre-trained representation is ob-tained with the BERT model, which is designed to pre-train thedeep bidirectional representations from unlabeled text by jointlyconditioning on both left and right context in all layers [31].Encoder and decoder are mutual independent but are con-nected by the attention block, through which can get a collab-orated optimization in training. To maximize the performance,both encoder and decoder are optimized with transfer leaningstrategies. In this paper, we first propose a self-attention basedend-to-end SLU structure, and applied cross-lingual transferlearning method to solve insufficient acoustic data problem.Then we propose a Transformer based multi-task strategy thatconducts intent classification and speech recognition in parallel.Finally, a textual-level transfer learning structure is designed toaggregate the pre-trained BERT model into the decoder compo-nent to improves the feature extraction capability of the decoder,indirectly.
2. Methodology
In this section, a self-attention based end-to-end SLU model isproposed, Self-attention layers have been proved to be supe-rior than recurrent layers. Next a Transformer based multi-taskstructure is designed to take immediate textual information into a r X i v : . [ ee ss . A S ] A ug ccount. Finally, the CF structure is implemented in the decoderas an enhancement of the auxiliary network for the multi-taskstructure. Self-attention layers have been proved to be superior than re-current layers in computational complexity when the sequencelength is smaller than the representation dimensionality, andit can also yield more interpretable models than convolutionallayers [32]. Inspired by these advantages, an attention-basedencoder-decoder structure is designed to solve SLU problems.The architecture consists of several stacks of layers. Each layerof the encoder and decoder consists of a multi-head attentionmodule and a position-wise fully connected feed-forward net-work. A max-pooling layer is involved to aggregate the outputof the encoder along time axis, as illustrated in Figure 1. Soft-max function is used to estimate the posterior probabilities ofintents.We denote the input acoustic frames as x = ( x , ..., x T ) ,where x t ∈ R d (1 (cid:54) t (cid:54) T ) indicates log-mel filter-bank(FBank) features in this work, d is the dimension of Fbank, T indicates the number of frames in x . Ground-truth posterior dis-tribution for utterance u is defined as q u = ( q u , ..., q uI ) , whichis represented as one-hot format. Cross-entropy criterion is usedto evaluate the model performance, then the cost function foreach utterance L uslu is defined as Equation 1. L uslu ( θ ) = − I (cid:88) i =1 q ui logp ( y ui | x ; θ ) (1)Where u is the index of speech utterance. θ indicates modelparameters. I represents intent size. y ui indicates the i th pre-dicted intent and p ( y ui | x ; θ ) demonstrates the posterior proba-bility of y ui given x and θ . Human languages share some commonality in both acoustic andphonetic aspects. Features extracted from some languages canbe shared with other languages at some levels of abstraction.[33] adapted English phones on Hungarian data yield substan-tial gains in performance over those trained only with Hungar-ian data. Inspired by that, we concentrate on the study of cross-lingual transfer learning over the attention-based SLU model. Itis achieved by pre-training the encoder with a language-specificspeech signal that different from the target language. Then theencoder-decoder model is fine-tuned with a small amount of tar-get annotation data.The key approach is first training a transformer-basedspeech recognition model with a quantity of rich resourcespeech and transcribed text corpus in word level, and then mi-grate the well-trained encoder component to the intent model.This can be achieved since the encoder in SLU maps the sourceacoustic feature to high-dimensional representation dependingon large amounts of data for better representation capability,which is the same as speech recognition applications. Acous-tic transfer learning make it possible to transfer representationcapability of an encoder trained with rich-resource data to anintent classification task with insufficient data. In this work, weadopt the encoder from speech recognition to intent recognitiondirectly and explore its effectiveness. Figure 1:
Structures for base model and augmentation strate-gies: (1) attention-based SLU model(left); (2) left encoder to-gether with the right decoder form the basic transformer struc-ture (3) the intent classification model together with the trans-former produce the multi-task structure.2.2.2. Multi-task Training
The multi-task structure consists of three components: an en-coder module for acoustic representation, a decoder for speechrecognition task, and another decoder for intent prediction task.The intent prediction decoder is designed to be placed after theacoustic encoder model, which is a compromised strategy com-pared with the conventional end-to-end SLU model, since theinaccurate prediction of text from speech recognition module.The multi-task structure is illustrated as Figure 1.In this work, intent prediction task aims at mapping theacoustic feature sequence into semantic space and treats it as se-mantic classification task. During this procedure, a latent opera-tion is translating sequence of acoustic features to text, just likethe task of speech recognition. So speech recognition and intentprediction have the same procedure in translating acoustic fea-ture to high level semantic representation. Thus the multi-taskarchitecture is designed to share the same acoustic represen-tation for speech recognition and SLU, then optimized jointly.Since our ultimate goal is to predict intents immediately frominput acoustic features. Therefore, speech recognition compo-nent can be thought as a regularizer for SLU task, and offersinductive bias to it.The same attention based model in Section 2.1 is usedto do intent prediction. In order to achieve intent predictionand speech recognition in parallel, an additional stacked self-attention layers and a linear layer followed by a softmax classi-fication layer are coupled with the encoder to output the poste-rior probability for speech recognition. As illustrated in Figure1, this model consists of two sub-models: an attention basedigure 2:
Structure of Speech Recognition Decoder in the aux-iliary network. It consists of two sections: (a) indicates thedecoder structure, it is used in both encoder pre-training andmulti-task strategies; (b) is used for BERT Fusion strategy,which is to boost the decoder linguistic extraction capacity. intent prediction sub-model with only acoustic feature as input,and an speech recognition model accepts both the acoustic pre-sentation from the encoder and text input from the decoder. Theencoder part in the bottom left area together with the decodercomponent, which is detailed in Figure 2(a), gives the typicaltransformer architecture. In training procedure, the loss func-tion for speech recognition is described with cross-entropy cri-teria, and one-hot format is used to represent the output labels.Then the loss function for each utterance in speech recognitiontask can be described in Equation 2. L uasr ( θ ) = T (cid:88) t =1 L tasr ( θ )) (2) L tasr ( θ ) = − V (cid:88) v =1 q tv logp ( y tv | x ; y 3. Experimental setups In the experiment, two datasets are used to train and test dif-ferent structures, FluentAI dataset described in [35] is used totrain and evaluate the baseline model and SLU model with dif-ferent strategies. As shown in Table 1, this dataset is sampledin kHz single-channel wav format. Each audio includes asingle command and is labeled with three slots: action, ob-ject, and location. There are different phrases with a to-tal of hours. The second dataset, shown in Table 2, is aopen source mandarin speech corpus AISHELL-ASR0009-OS1which is used to pre-train the encoder component. This datasetconsists of hours long speech and recorded by peoplefrom different accent areas in China.Table 1: FluentAI Speech Command dataset Split Speakers Utterances HoursTrain 77 23 , 132 14 . Valid 10 3 , 118 1 . Test 10 3 , 793 2 . Total 97 30 , 043 19 . All experiments are conducted using 80-dim FBank feature witha frame length of ms and a frame shift of ms . Mean andVariance normalization is applied in utterance level, then fea-tures are down-sampled by a factor of , and consecutive vec-tors stacked at the end.The SLU model described in Section 2.1 is treated as thebaseline. Results in Table 3 shows that the performance of SLUmodel have a strong correlation with the amount of parameters,this is attributing to the small amount of training dataset andable 2: AISHELL-ASR0009-OS1 dataset Split Hours Male FemaleTrain 150 161 179 Valid 10 12 28 Test Total 165 186 214 Table 3: Results of various configures of SLU model N enc/dec N head N k/v d m/i Acc2/0 2 32 256/512 88.673/0 8 64 256/512 86.683/0 8 64 512/1024 90.38 complexity of the model. The best performance . can beachieved when the encoder is set to layer and the decoder to layer. In subsequent experiments, these parameters are adoptedto compare different enhancement strategies.Cross-lingual transfer learning is implemented by training atransformer based speech recognition model with hours ofAISHELL data first, then transferring the well-trained encoderto the SLU model directly. Two experiments, fixing parametersand fine-tuning parameters of the encoder, have been conductedto check their performance. Results in Table 4 indicate that bothstrategies have abilities of improving the performance of SLUmodel. It means that when training the encoder with an irrel-evant language can be migrated to other language in acousticspace. Table 4 also shows that a better improvement . isobtained when the encoder parameters are fixed. This impliesthe simplicity of the FluentAI dataset, training tends to becomeover-fitting when more parameters are involved. If the encoderis trained with more data, it will have more robust generationcapabilities.The multi-task experiment is implemented with FluentAIdataset as well. Table 4 gives result with different speech recog-nition scales. It indicates that the best performance . isgiven when the scale is set . . It proves that the speech recog-nition model can bring benefits to SLU when giving a appro-priate scale. Actually, it is tough to balance the parameter λ ,the main point is that we want the auxiliary task to promote theshared part into two tasks in a data-driven manner, or to becomea regularizer for the SLU task. The scale . is applied in thefollowing experiments.BERT fusion strategy is conducted relying on the multi-taskstructure. The BERT model consists of layers where eachTable 4: Intent prediction accuracy for different strategies Methodologies Tune/Fix Scales AccuracyBaseline - - 91.91EP Fix - 94.86EP Fine-tune - 93.25MT - 0.1 92.41MT - 0.5 95.25MT - 1.0 95.28MT and BF Fix 1.0 95.49 EP and MT Fine-tune 1.0 96.07 EP, MT and BF Fine-tune, Fix 1.0 94.91 EP: Encoder Pre-training. FT: Fine-tune. MT: Multi-task. BF: BERT Fusion Figure 3: Validation Intent Losses for the Baseline, Multi-taskwith Encoder Pre-training, and Multi-task with Encoder Pre-training and BERT Fusion. layer consists of hidden units, 12-heads, and about M parameters. Parameters of BERT model are fixed in all thesubsequent experiments. Table 4 indicates this strategy gives . and . improvements comparing with the baselineand the multi-task method. This indicates that BERT model hascapability of improving the performance of SLU model.In addition, different combinations of these strategies areexplored. Table 4 demonstrates that the combination of cross-lingual pre-training and multi-task strategies obtains an accu-racy of . , and the combination of all these three strategiesgives . . Both methods produce better performance thanthe baseline. Figure 3 depicts the validation intent loss alongwith epoch, both compound strategies obtain lower losses andconverge faster. Theoretically, the combinations of three strate-gies should give the best performance. However, experimentsshow that the cross-lingual encoder pre-training with multi-taskstrategy gives the most positive promote on the accuracy. Thereason attributes to the data sparsity, models are difficult to bewell trained with limited data. And the sparsity of labeled datausually accompanies with over-tting problem, which aggravatesthe tuning and optimization during training procedure. 4. Conclusion In this paper, we propose an attention-based end-to-end SLUmodel and evaluated different augmentation strategies based onthis model. We show that cross-lingual encoder pre-training,multi-task strategy, and BERT-fusion have abilities of improv-ing the intent classification performance. These enhancementstrategies can also extend to other areas such that improve theirperformance. Due to the limitation of data, the model is proneto over-fitting and sensitive to model parameters. More investi-gation on how to efficiently solve data sparsity in model trainingwill be conducted in future. 5. Acknowledgement This paper is supported by National Key Research and Develop-ment Program of China under grant No. 2018YFB1003500, No.2018YFB0204400 and No. 2017YFB1401202. Correspondingauthor is Jianzong Wang from Ping An Technology (Shenzhen)Co., Ltd. . References [1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequencelearning with neural networks,” in Advances in neural informationprocessing systems , 2014, pp. 3104–3112.[2] H. Inaguma, J. Cho, M. K. Baskar, T. Kawahara, and S. Watan-abe, “Transfer learning of language-independent end-to-end asrwith language model fusion,” in ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6096–6100.[3] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 5666–5670.[4] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu,N. Ari, S. Laurenzo, and Y. Wu, “Leveraging weakly superviseddata to improve end-to-end speech-to-text translation,” in ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 7180–7184.[5] N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speechrecognition with the transformer model,” in ICASSP 2020-2020IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020.[6] H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-based online ctc/attention end-to-end speech recognition architec-ture,” in ICASSP 2020-2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020.[7] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum la-tency training strategies for streaming sequence-to-sequence asr,”in ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE, 2020.[8] S. Bhosale, I. Sheikh, S. H. Dumpala, and S. K. Kopparapu, “End-to-End Spoken Language Understanding: Bootstrapping in LowResource Scenarios,” in Proc. Interspeech 2019 , 2019, pp. 1188–1192.[9] R. Masumura, T. Tanaka, A. Ando, H. Kamiyama, T. Oba,S. Kobashikawa, and Y. Aono, “Improving Conversation-ContextLanguage Models with Multiple Spoken Language UnderstandingModels,” in Proc. Interspeech 2019 , 2019, pp. 834–838.[10] A. Ray, Y. Shen, and H. Jin, “Robust spoken language understand-ing via paraphrasing,” in Proc. Interspeech 2018 , 2018, pp. 3454–3458.[11] Y. Li, X. Zhao, W. Xu, and Y. Yan, “Cross-lingual multi-task neu-ral architecture for spoken language understanding,” in Proc. In-terspeech 2018 , 2018, pp. 566–570.[12] R. Gupta, A. Rastogi, and D. Hakkani-Tr, “An efficient approachto encoding context for spoken language understanding,” in Proc.Interspeech 2018 , 2018, pp. 3469–3473.[13] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur,P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From audioto semantics: Approaches to end-to-end spoken language under-standing,” in . IEEE, 2018, pp. 720–726.[14] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Ben-gio, “Towards end-to-end spoken language understanding,” in . IEEE, 2018, pp. 5754–5758.[15] Y.-P. Chen, R. Price, and S. Bangalore, “Spoken language un-derstanding without speech recognition,” in . IEEE, 2018, pp. 6189–6193.[16] V. Renkens et al. , “Capsule networks for low resource spoken lan-guage understanding,” arXiv preprint arXiv:1805.02922 , 2018.[17] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scale unsu-pervised pre-training for end-to-end spoken language understand-ing,” in ICASSP 2020-2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020,pp. 7999–8003. [18] Y. Huang, H. Kuo, S. Thomas, Z. Kons, K. Audhkhasi, B. Kings-bury, R. Hoory, and M. Picheny, “Leveraging unpaired textdata for training end-to-end speech-to-intent systems,” in ICASSP2020 - 2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2020, pp. 7984–7988.[19] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scale un-supervised pre-training for end-to-end spoken language under-standing,” in ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp.7999–8003.[20] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Trans-fer learning from deep features for remote sensing and povertymapping,” in Thirtieth AAAI Conference on Artificial Intelligence ,2016.[21] Z. Huang, Z. Pan, and B. Lei, “Transfer learning with deep con-volutional neural network for sar target classification with limitedlabeled data,” Remote Sensing , vol. 9, no. 9, p. 907, 2017.[22] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A sur-vey on deep transfer learning,” in International conference on ar-tificial neural networks . Springer, 2018, pp. 270–279.[23] N. Tomashenko, A. Caubri`ere, and Y. Est`eve, “Investigating adap-tation and transfer learning for end-to-end spoken language under-standing from speech,” 2019.[24] A. Caubri`ere, N. Tomashenko, A. Laurent, E. Morin, N. Camelin,and Y. Est`eve, “Curriculum-based transfer learning for an effec-tive end-to-end spoken language understanding and domain porta-bility,” in Proc. Interspeech 2019 , 2019, pp. 1198–1202.[25] P. Wang, J. Cui, C. Weng, and D. Yu, “Large Margin Trainingfor Attention Based End-to-End Speech Recognition,” in Proc.Interspeech 2019 , 2019, pp. 246–250.[26] J. Li, X. Wang, Y. Li et al. , “The speechtransformer for large-scale mandarin chinese speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) . IEEE, 2019, pp. 7095–7099.[27] O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of auto-matic speech recognition with transformer sequence-to-sequencemodel,” in ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp.7074–7078.[28] E. Dikici and M. Sarac¸lar, “Semi-supervised and unsuperviseddiscriminative language model training for automatic speechrecognition,” Speech Communication , vol. 83, pp. 54–63, 2016.[29] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text withrecurrent neural networks,” in Proceedings of the 28th interna-tional conference on machine learning (ICML-11) , 2011, pp.1017–1024.[30] A. Zgank, “Cross-lingual speech recognition between languagesfrom the same language family,” Proceedings of the RomanianAcademy Series A-mathematics Physics Technical Sciences Infor-mation Science , vol. 20, no. 2, pp. 184–191, 2019.[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” NAACL , 2018.[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in Advances in neural information processing systems , 2017, pp.5998–6008.[33] L. T´oth, J. Frankel, G. Gosztolya, and S. King, “Cross-lingualportability of mlp-based tandem features–a case study for englishand hungarian,” 2008.[34] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie,“Component fusion: Learning replaceable language model com-ponent for end-to-end speech recognition system,” in ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 5361–5635.[35] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio,“Speech Model Pre-training for End-to-End Spoken LanguageUnderstanding,” in