Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data
Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma
CConstrained Output Embeddings for End-to-End Code-Switching SpeechRecognition with Only Monolingual Data
Yerbolat Khassanov , Haihua Xu , Van Tung Pham , , Zhiping Zeng , Eng Siong Chng , ,Chongjia Ni and Bin Ma School of Computer Science and Engineering, Nanyang Technological University, Singapore Temasek Laboratories, Nanyang Technological University, Singapore Machine Intelligence Technology, Alibaba Group { yerbolat002,haihuaxu,vantung001,zengzp,aseschng } @ntu.edu.sg, { ni.chongjia,b.ma } @alibaba-inc.com Abstract
The lack of code-switch training data is one of the major con-cerns in the development of end-to-end code-switching auto-matic speech recognition (ASR) models. In this work, we pro-pose a method to train an improved end-to-end code-switchingASR using only monolingual data. Our method encourages thedistributions of output token embeddings of monolingual lan-guages to be similar, and hence, promotes the ASR model toeasily code-switch between languages. Specifically, we proposeto use Jensen-Shannon divergence and cosine distance basedconstraints. The former will enforce output embeddings ofmonolingual languages to possess similar distributions, whilethe later simply brings the centroids of two distributions to beclose to each other. Experimental results demonstrate high ef-fectiveness of the proposed method, yielding up to . abso-lute mixed error rate improvement on Mandarin-English code-switching ASR task. Index Terms : code-switching, embeddings, Jensen-Shannondivergence, cosine distance, speech recognition, end-to-end
1. Introduction
The code-switching (CS) is a practice of using more than onelanguage within a single discourse which poses a serious prob-lem to many speech and language processing applications. Re-cently, the end-to-end code-switching automatic speech recog-nition (E2E-CS-ASR) gained increasing interest where impres-sive improvements have been reported [1, 2, 3, 4]. The improve-ments are mainly achieved for CS languages where sufficientamount of transcribed CS data is available such as Mandarin-English [5]. Unfortunately, for the vast majority of other CSlanguages the CS data remains too small or even non-existent.Several attempts have been made to alleviate the CS datascarcity problem. Notably, [6, 7] used semi-supervised ap-proaches to utilize untranscribed CS speech data. On the otherhand, [2, 3, 4] employed transfer learning techniques where ad-ditional monolingual speech corpora are either used for pre-training or joint-training. On the account of increased train-ing data, these approaches achieved significant improvements.However, all these approaches rely on the cross-lingual signalimposed by some CS data or other linguistic resources such asa word-aligned parallel corpus.In this work, we aim to build an E2E-CS-ASR using onlymonolingual data without any form of cross-lingual resource.The only assumption we make is an availability of monolingualspeech corpus for each of the CS languages. This setup is im-portant and common to many low-resource CS languages, but has not received much research attention. Besides, it will serveas a strong baseline performance that any system trained on CSdata should reach.However, due to the absence of CS train data, the E2E-CS-ASR model will fail to learn cross-lingual relations betweenmonolingual languages. Consequently, the output token embed-dings of monolingual languages will diverge from each other,and hence, prevent the E2E-CS-ASR model from switching be-tween languages. Indeed, we examined the shared output tokenembedding space learned by E2E-CS-ASR and observed thatoutput token embeddings of two monolingual languages are dif-ferently distributed and located apart from each other (see Fig-ure 3a). We hypothesize that the difference between output to-ken embedding distributions restricts the E2E-CS-ASR modelfrom correctly recognizing CS utterances.To address this problem, we propose to impose additionalconstraints which will encourage output token embeddings ofmonolingual languages to be similar. Specifically, we proposeto use Jensen-Shannon divergence and cosine distance basedconstraints. The former will enforce output token embeddingsof monolingual languages to possess similar distributions, whilethe later simply brings the centroids of two distributions to beclose to each other. In addition, the imposed constraints willact as a regularization term to prevent overfitting. Our methodis inspired by [8, 9] where intermediate feature representationsof text and speech are forced to be close to each other. Weevaluated our method on Mandarin-English CS language pairfrom the SEAME [5] corpus where we removed all CS utter-ances from the training data. Experimental results demonstratehigh effectiveness of the proposed method, yielding up to . absolute mixed error rate improvement.The rest of the paper is organized as follows. In Section 2,we review related works addressing the CS data scarcity prob-lem. In Section 3, we briefly describe the baseline E2E-CS-ASR model. In Section 4, we present the constrained outputembeddings method. Section 5 describes the experiment setupand discusses the obtained results. Lastly, Section 6 concludesthe paper.
2. Related works
An early approach to build CS-ASR using only monolingualdata is so-called “multi-pass” system [10]. The multi-pass sys-tem is based on traditional ASR and consists of three main steps.First, the CS utterances are split into monolingual speech seg-ments using the language boundary detection system. Next, ob-tained segments are labeled into specific languages using the a r X i v : . [ c s . C L ] J u l anguage identification system. Lastly, labeled segments are de-coded using the corresponding monolingual ASR system. How-ever, this approach is prone to error-propagation between dif-ferent steps. Moreover, the language boundary detection andlanguage identification tasks are considered difficult.More recently, the semi-supervised approaches have beenexplored to circumvent the CS data scarcity problem. For in-stance, [6] used their best CS-ASR to transcribe a raw CSspeech, the transcribed speech is then used to re-train the CS-ASR. In a similar manner, [7] employed their best CS-ASRto re-transcribe the poorly transcribed portion of the trainingset and then use it to re-train the model. Although the semi-supervised approaches are promising, they still require CS dataas well as other systems such as language identification.In the context of end-to-end ASR models, the transfer learn-ing techniques are widely used to alleviate the CS data scarcityproblem. For example, [2, 3] used monolingual data to pre-train the model followed by the fine-tuning with CS data. Onthe other hand, [4] used both CS and monolingual data for pre-training followed by the standard fine-tuning with the CS onlydata. While being effective, the transfer learning based tech-niques highly rely on the CS data.Generating synthesized CS data using only monolingualdata has been also explored in [11, 12, 13, 14], however, theyonly address the textual data scarcity problem.
3. Baseline E2E-CS-ASR
Figure 1 illustrates the baseline E2E-CS-ASR model basedon hybrid CTC/Attention architecture [15] which incorpo-rates the advantages of both Connectionist Temporal Classifi-cation (CTC) model [16] and attention-based encoder-decodermodel [17]. Specifically, the CTC and attention-based de-coder modules share a common encoder network and are jointlytrained.
Encoder.
The shared encoder network takes a sequenceof T -length speech features x = ( x , . . . , x T ) and trans-forms them into L -length high level representations h =( h , . . . , h L ) where L < T . The encoder is modeled as adeep convolutional neural network (CNN) based on the VGGnetwork [18] followed by several bidirectional long short-termmemory (BLSTM) layers. h = BLSTM ( CNN ( x )) (1) CTC module.
The CTC sits on top of the encoder andcomputes the posterior distribution P CTC ( y | x ) of N -length out-put token sequence y = ( y , . . . , y N ) . The CTC loss is definedas a negative log-likelihood of the ground truth sequences y ∗ : L CTC = − log P CTC ( y ∗ | x ) (2) Attention-based decoder module.
The attention-baseddecoder computes the probability distribution P ATT ( y | x ) overthe output token sequence y given the previously emitted to-kens y Joint DecoderSharedEncoderOutput tokenembeddings Figure 1: Hybrid CTC/Attention end-to-end ASR architecturewith constrained output token embeddings. The output tokenembeddings are learned by the parametric matrix of linear out-put projection layer (OutputProj). where α n is an attention weight vector produced by Attention( · )module, c n is a context vector which encapsulates the informa-tion in the input speech features required to generate the next to-ken, s n is a hidden state produced by unidirectional long short-term memory (LSTM). InputProj( · ) and OutputProj( · ) are inputand output linear projection layers with learnable matrix param-eters, respectively. The input and output learnable matrices holdinput and output embedding representations of tokens, respec-tively. The loss function of attention-based decoder module iscomputed using Eq. (7) as: L ATT = − log P ATT ( y ∗ | x ) (8)Finally, the CTC and attention-based decoder modules arejointly trained within multi-task learning (MTL) framework asfollows: L MTL = λ L CTC + (1 − λ ) L ATT (9)where λ controls the contribution of the losses.Our proposed method will append additional constraint intothe MTL framework which will mainly impact the learnablematrix parameter of OutputProj( · ) layer in Eq. (6) as will beexplained in the following section. 4. Constrained output embeddings In this work, we aim to build E2E-CS-ASR using only mono-lingual data. This setup is essential for the vast majority ofCS languages for which CS data is non-existent. However,an E2E-CS-ASR model trained on monolingual data will failto learn language switch-points, and hence, will perform sub-optimally on input CS speech. We investigated the E2E-CS-ASR model and found that the output token representationsof monolingual languages, modeled by linear projection layerOutputProj( · ), to be different and apart from each other (seeFigure 3a). We hypothesize that the difference between outputtoken distributions of monolingual languages restricts the E2E-CS-ASR model from switching between languages.To reduce the discrepancy between these distributions, wepropose to constrain output token embeddings using Jensen-Shannon divergence (JSD) and cosine distance (CD). Theseconstraints will typically act as a cross-lingual signal sourcewhich will force output token embedding representations ofable 1: SEAME dataset statistics after removing the CS utter-ances from the train set. ‘Man’ and ‘Eng’ refer to Mandarinand English languages, respectively train test man test eng Man Eng ∼ ∼ ∼ ∼ Jensen-Shannon divergence. First, we assume thatlearned output token embeddings of monolingual language pair L and L follow a z -dimensional multivariate Gaussian distri-bution: L ∼ Normal ( µ , Σ ) (10) L ∼ Normal ( µ , Σ ) (11)The JSD between these distributions is then computed as: L JSD = tr (Σ − Σ + Σ Σ − )+ ( µ − µ ) T (Σ − + Σ − )( µ − µ ) − z (12)Lastly, we fuse the JSD constraint with the loss function ofE2E-CS-ASR using Eq. (9) as follows: L MTL = λ L CTC + (1 − λ )( α L ATT + (1 − α ) L JSD ) (13)where α ∈ [0 , controls the importance of the constraint. Cosine distance. We first compute the centroid vectors C and C obtained by taking the mean of all output token embed-dings of monolingual language pair L and L , respectively.The cosine distance between two centroids is then computed asfollows: L CD = 1 − C · C (cid:107) C (cid:107) (cid:107) C (cid:107) (14)The CD constraint is integrated into the loss function in a simi-lar way as Eq. (13). 5. Experiment We evaluate our method on Mandarin-English CS language pairfrom the SEAME [5] corpus (see Table 1). We used standarddata splitting on par with previous works [1, 7] which con-sists of sets: train, test man and test eng . To match the no CSdata scenario, where we assume that we only possess monolin-gual data, we removed all CS utterances from the train set. Thetest man and test eng sets were used for evaluation. Both evalu-ation sets are gender balanced and consist of speakers, butthe matrix language of speakers is different, i.e. Mandarin fortest man and English for test eng . https://github.com/zengzp0912/SEAME-dev-set The dominant language into which elements from the embeddedlanguage are inserted. We used ESPnet toolkit [19] to train our baseline E2E-CS-ASRmodel. The encoder module consists of VGG network fol-lowed by BLSTM layers each with units. The attention-based decoder module consists of a single LSTM layer with units and employs multi-headed hybrid attention mecha-nism [20] with heads. The CTC module consists of a singlelinear layer with units and its weight in Eq. (9) is set to . . The network was optimized using Adadelta with gradientclipping. During the decoding stage, the beam size was set to . The baseline model achieves . and . mixed errorrates (MER) on test man and test sge respectively, when trainedon entire SEAME train set including the CS utterances. The experiment results are shown in Table 2. We split thetest sets into monolingual and CS utterances to analyze the im-pact of the proposed method on each of them. We first reportthe MER performance of a conventional ASR model built us-ing Kaldi toolkit [21] (row 1), the model specifications can befound in [7]. The MER performance of the baseline E2E-CS-ASR model is shown in the second row. We followed the re-cent trends [1, 2, 4] to obtain a much stronger baseline model.Specifically, we applied speed perturbation (SP) based data aug-mentation technique [22] and used byte pair encoding (BPE)based subword units [23] to balance Mandarin and English to-kens (rows and ). We tried different vocabulary sizes forBPE and found k units to work best in our case.Table 2: The MER (%) performance of different ASR modelsbuilt using monolingual data. The test sets are further split intomonolingual (mono) and code-switching (CS) utterances No. Model test man test eng monoutts. CSutts. all monoutts. CSutts. all1 Kaldi - - 39.1 - - 45.22 Baseline 57.7 73.3 70.6 73.7 80.6 78.33 + SP 39.4 56.0 53.2 54.2 65.9 62.24 + BPE 38.1 51.8 49.5 52.9 61.4 58.95 + CD 34.4 49.0 46.3 The performance of models employing proposed CD andJSD constraints are shown in rows and , the interpolationweights for CD and JSD are set to . and . , respectively.Both constraints gain considerable MER improvements. No-tably, we found that CD constraint is more effective on monolin-gual utterances, whereas JSD constraint is more effective on CSutterances. To complement the advantages of both constraints,we combined them as follows: L MTL = λ L CTC + (1 − λ )( α L ATT +(1 − α )( β L JSD + (1 − β ) L CD )) (15)where α and β are set to . and . , respectively. The combi-nation of two constraints significantly improves the MER overthe strong baseline model by . and . on test man andtest eng , respectively (row ). These results suggest that the pro-posed method of constraining the output token embeddings iseffective. The term “mixed” refers to different token units used for English(words) and Mandarin (characters). .3.1. Changing the interpolation weight We repeat the experiment with different interpolation weightsfor CD and JSD constraints (hyperparameter α in Eq. (13)) toinvestigate its effect on MER performance. Figure 2 shows thatthe proposed method consistently improves the MER over thestrong baseline model with SP and BPE. The best results areachieved for interpolation weights in range . - . . Interpolation weight ( 𝛂 ) M E R ( % ) Figure 2: The impact of CD and JSD constraint interpolationweights on MER performance for test eng (red / top) and test man (blue / bottom) sets. EnglishMandarin (a) No constraint EnglishMandarin (b) CD constraint EnglishMandarin (c) JSD constraint EnglishMandarin (d) CD & JSD constraints Figure 3: PCA visualization of shared output token embeddingspace without (a) and with (b,c,d) proposed constraints.5.3.2. Visualization of shared output token embedding space To gain insights from the effects of the proposed method on theshared output embedding space, we visualize it using dimen-sionality reduction technique based on the principal componentanalysis (PCA). Figure 3 shows the shared output embeddingspace without (3a) and with (3b, 3c, 3d) proposed constraints.Note that the learned output token embeddings of monolinguallanguages strongly diverge from each other when proposed con-straints are not employed. Visualization of the shared output embedding space confirms that our method is effective at bind-ing the output token embeddings of monolingual languages. The state-of-the-art results in ASR are usually obtained by em-ploying a language model (LM). To examine whether proposedconstraints are complementary with LM, we employed LM dur-ing the decoding stage. In this experiment, we tried differentLM interpolation weights changed with a step size of . andreport the best results (see Table 3). The LM was trained onthe entire SEAME train set, including CS utterances, as a singlelayer LSTM with units and was integrated using shallowfusion technique [24]. Obtained MER improvements show thatproposed constraints and LM complement each other. More-over, the proposed method benefits from the LM more than thestrong baseline model.Table 3: The MER performance after applying the languagemodel during the decoding stages Model Decoder LM MER (%)test man test eng Baseline No 49.5 58.9Yes 49.0 58.6Baseline + CD & JSD No 45.6 54.4Yes 6. Conclusions In this work, we proposed a method to train improved E2E-CS-ASR model using only monolingual data. Specifically, ourmethod constrains the output token embeddings of monolin-gual languages to force them to be similar, and hence, en-able E2E-CS-ASR to easily switch between languages. We ex-amined Jensen-Shannon divergence and cosine distance basedconstraints which are incorporated into the objective functionof the E2E-CS-ASR. We evaluated the proposed method onMandarin-English CS language pair from the SEAME corpuswhere CS utterances were removed from the train set. Theproposed method outperforms the strong baseline model by alarge margin, i.e. absolute . and . MER improvementson test man and test eng , respectively. The visualization of theshared output embedding space confirms the effectiveness ofthe proposed method. In addition, our method is complemen-tary with the language model where further MER improvementis achieved. Importantly, all these improvements are achievedwithout using any additional linguistic resources such as word-aligned parallel corpus or language identification system. Webelieve that the proposed method can be easily adapted to otherscenarios and benefit other CS language pairs.For the future work, we plan to test the proposed method onscenarios with a larger amount of monolingual data and exam-ine its effectiveness on E2E-CS-ASR models trained using CSdata. We also plan to study the effects of the proposed methodin transfer learning approach where it will be used to pre-trainthe model with external monolingual data. 7. Acknowledgements This work is supported by the project of Alibaba-NTU Singa-pore Joint Research Institute. . References [1] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li,“On the end-to-end solution to mandarin-english code-switchingspeech recognition,” in ,2019, in-press.[2] N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towardsend-to-end code-switching speech recognition,” arXiv preprintarXiv:1810.13091 , 2018.[3] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong, “Towards code-switching asr for end-to-end ctc models,” in IEEE InternationalConference on Acoustics, Speech and Signal Processing, ICASSP ,2019, pp. 6076–6080.[4] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu,and L. Xie, “Investigating end-to-end speech recognition formandarin-english code-switching,” in IEEE International Confer-ence on Acoustics, Speech and Signal Processing, ICASSP , 2019,pp. 6056–6060.[5] D. Lyu, T. P. Tan, E. Chng, and H. Li, “SEAME: a mandarin-english code-switching speech corpus in south-east asia,” in , 2010, pp. 1986–1989.[6] E. Yilmaz, M. McLaren, H. van den Heuvel, and D. A. vanLeeuwen, “Semi-supervised acoustic model training for speechwith code-switching,” Speech Communication , vol. 105, pp. 12–22, 2018.[7] P. Guo, H. Xu, L. Xie, and E. S. Chng, “Study of semi-supervised approaches to improving english-mandarin code-switching speech recognition,” in , 2018, pp. 1928–1932.[8] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, and M. Delcroix,“Semi-supervised end-to-end speech recognition,” in , 2018, pp. 2–6.[9] J. Drexler and J. Glass, “Combining end-to-end and adversarialtraining for low-resource speech recognition,” in IEEE SpokenLanguage Technology Workshop, SLT , 2018, pp. 361–368.[10] D. Lyu, R. Lyu, Y. Chiang, and C. Hsu, “Speech recognitionon code-switching among the chinese dialects,” in IEEE Inter-national Conference on Acoustics Speech and Signal Processing,ICASSP , 2006, pp. 1105–1108.[11] A. Pratapa, G. Bhat, M. Choudhury, S. Sitaram, S. Dandapat, andK. Bali, “Language modeling for code-mixing: The role of lin-guistic theory based synthetic data,” in , 2018, pp. 1543–1553.[12] C. Chang, S. Chuang, and H. Lee, “Code-switching sentence gen-eration by generative adversarial networks and its application todata augmentation,” CoRR , vol. abs/1811.02356, 2018.[13] G. I. Winata, A. Madotto, C. Wu, and P. Fung, “Learn to code-switch: Data augmentation using copy mechanism on languagemodeling,” CoRR , vol. abs/1810.10254, 2018.[14] E. Yilmaz, H. van den Heuvel, and D. A. van Leeuwen, “Acous-tic and textual data augmentation for improved ASR of code-switching speech,” in , 2018, pp.1933–1937.[15] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances injoint CTC-Attention based end-to-end speech recognition with adeep CNN encoder and RNN-LM,” in , 2017, pp. 949–953.[16] A. Graves, S. Fern´andez, F. J. Gomez, and J. Schmidhuber,“Connectionist temporal classification: Labelling unsegmentedsequence data with recurrent neural networks,” in , 2006, pp. 369–376. [17] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speech recog-nition,” in IEEE International Conference on Acoustics, Speechand Signal Processing, ICASSP , 2016, pp. 4945–4949.[18] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” CoRR , vol.abs/1409.1556, 2014.[19] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Ren-duchintala, and T. Ochiai, “Espnet: End-to-end speech processingtoolkit,” in , 2018, pp. 2207–2211.[20] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,“Attention-based models for speech recognition,” in Advances inNeural Information Processing Systems, NIPS , 2015, pp. 577–585.[21] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” IEEE Signal ProcessingSociety, Tech. Rep., 2011.[22] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-mentation for speech recognition,” in , 2015, pp. 3586–3589.[23] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla-tion of rare words with subword units,” in , 2016.[24] S. Toshniwal, A. Kannan, C. Chiu, Y. Wu, T. N. Sainath, andK. Livescu, “A comparison of techniques for language model in-tegration in encoder-decoder speech recognition,” in