End-to-End Code-Switching ASR for Low-Resourced Language Pairs
Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng, Haizhou Li
aa r X i v : . [ c s . C L ] S e p END-TO-END CODE-SWITCHING ASR FOR LOW-RESOURCED LANGUAGE PAIRS
Xianghu Yue , Grandee Lee , Emre Yılmaz , Fang Deng , Haizhou Li Beijing Institute of Technology, Beijing, China National University of Singapore, Singapore { xianghu.yue, grandee.lee } @u.nus.edu, [email protected], { emre, haizhou.li } @nus.edu.sg ABSTRACT
Despite the significant progress in end-to-end (E2E) auto-matic speech recognition (ASR), E2E ASR for low resourcedcode-switching (CS) speech has not been well studied. Inthis work, we describe an E2E ASR pipeline for the recog-nition of CS speech in which a low-resourced language ismixed with a high resourced language. Low-resourcednessin acoustic data hinders the performance of E2E ASR sys-tems more severely than the conventional ASR systems. Tomitigate this problem in the transcription of archives withcode-switching Frisian-Dutch speech, we integrate a desig-nated decoding scheme and perform rescoring with neuralnetwork-based language models to enable better utilizationof the available textual resources. We first incorporate amulti-graph decoding approach which creates parallel searchspaces for each monolingual and mixed recognition tasks tomaximize the utilization of the textual resources from eachlanguage. Further, language model rescoring is performed us-ing a recurrent neural network pre-trained with cross-lingualembedding and further adapted with the limited amount ofin-domain CS text. The ASR experiments demonstrate theeffectiveness of the described techniques in improving therecognition performance of an E2E CS ASR system in alow-resourced scenario.
Index Terms — Code-switching, end-to-end ASR, lan-guage modeling, multi-graph, under-resourced languages
1. INTRODUCTION
As multilingualism is becoming more common in today’sglobalized world [1], there has been increasing interest incode-switching (CS) automatic speech recognition (ASR) [2].Code-switching refers to the phenomenon where two lan-guages are spoken in contact within one utterance [3].Code-switching, such as Mandarin-English [4], Spanish-English [5] and Hindi-English [6], is commonly practiced inmulti-lingual societies.Traditionally, an ASR system consists of several compo-nents including acoustic model, pronunciation and languagemodel that are separately trained and optimized with differ-ent objectives, thus building an ASR system needs special- ized expertise in the field. Various end-to-end (E2E) ASR ap-proaches are emerging quickly because of its simplicity com-pared to the traditional ASR architecture. An E2E systempredicts phones or characters directly from acoustic informa-tion without predefined alignment. Some notable architec-tures include connectionist temporal classification (CTC) [7],attention based encoder-decoder networks [8, 9], and recur-rent neural network (RNN) transducers [10]. More recently,hybrid E2E systems have been successfully implemented andapplied to common ASR benchmarks [11]. These E2E mod-els have been successfully used in monolingual and multilin-gual ASR systems by achieving promising results on variousbenchmarks [12–16].E2E ASR approaches enable lexicon-free recognitionwhich is a key advantage over traditional hybrid hiddenMarkov model/deep neural networks (HMM/DNN) approachesin low-resourced settings, since there are many low-resourcedlanguages without an available pronunciation lexicon. How-ever, there is very limited work done for recognizing CSspeech using E2E techniques, especially for low-resourcedlanguage pairs. This is mainly due to the fact that low-resourcedness in acoustic data hinders the performance ofE2E CS ASR more severely than the conventional ASR sys-tem. Hiroshi [17] built an encoder-decoder based E2E ASRsystem that can recognize the mixed-language speech. How-ever, the work relies on training data that is generated frommonolingual datasets, rather than natural code-switchingspeech. Kim [18] and Toshniwa [19] both used encoder-decoder model to build multilingual E2E ASR, but theirsystems cannot deal with CS scenario. Li [20] incorporate aframe-level language identification (LID) model to linearlyadjust the posteriors of an E2E CTC model for the high-resourced Mandarin-English language pair.In this paper, we integrate a designated decoding schemeand a code-switch language model (LM) rescoring scheme tomitigate this problem in our recognition scenario, namelytranscripts of archives with CS Frisian-Dutch speech inwhich Frisian is a low-resourced language and Dutch is ahigh-resourced language. The code-switch LM [21] is arecurrent neural network (RNN) that is trained with cross-lingual embedding and adapted to maximize the use of theavailable textual resources. The decoding scheme provides aew multi-graph back-end for E2E CS ASR in which paral-lel search spaces are employed for monolingual and mixedrecognition subtasks. The code-switch RNN LM can bothpreserve the cross-lingual correspondence derived from largermonolingual textual resources and leverage the low-resourcedlanguage on the high-resourced language at the same time.The rest of this paper is organized as follows. Section2 introduces the E2E CTC acoustic model. The incorporatedmulti-graph decoding strategy and CS RNN LM rescoring aredescribed in section 3 and 4 respectively. We describe theexperimental setup in section 5 and then present and discussthe results provided by the described E2E ASR pipeline insection 6.
2. END-TO-END CTC ACOUSTIC MODEL
Unlike in the traditional hybrid HMM-DNN system, an E2ECTC acoustic model is not trained using frame-level labelswith respect to the cross-entropy (CE) criterion. Instead, aCTC model learns the alignments automatically betweenspeech frames and their label sequences, i.e., phone se-quences, by adopting the CTC objective. It predicts theconditional probability of the label sequence by summingover the joint probabilities of the corresponding set of CTCsymbol sequences. The CTC framework has the output in-dependent assumption that CTC symbols are conditionallyindependent at each frame, which may be more desirablefor dealing with CS speech (though less accurate in general)as the current output does not explicitly depend on previousoutputs [20]. The conditional probability of the whole labelsequence is: P ( z | x ) = X π ∈B − ( z ) P ( π | x ) = X π : π ∈ Z ′ , B ( π T )= z T Y t =1 y tπ t (1)where z = ( z , . . . , z u , . . . , z U ) denotes a phone label se-quence containing U phones, z ∈ Z and Z is the phone set. x = ( x , . . . , x t , . . . , x T ) denotes a sequence of T speechframes, with t being the frame index. The length of z is con-strained to be no greater than the length of the utterance, i.e., U ≤ T . π T = ( π , . . . , π t , . . . , π T ) is an output symbolsequence at frame level, named CTC path. Each output sym-bol π ∈ Z ′ and Z ′ = Z ∪ blank . blank is a special labelin the CTC framework, which maps frames and labels to thesame length. B is a multiple-to-one mapping with first re-moving the repeated labels and then all blank symbols fromthe paths. y tπ t is the posterior probability of output symbol π t at time t . Equation (1) can be efficiently evaluated anddifferentiated using forward-backward algorithm [22]. Giventraining utterances, the acoustic model networks are trainedto minimize the CTC objective function: L = − Q X k =1 ln( P ( z k | x k )) (2) where k is the index of training utterances and Q is the totalnumber.
3. MULTI-GRAPH DECODING STRATEGY
In modern ASR architectures, weighted finite-state transduc-ers (WFST) are used to integrate different knowledge sourcesand perform search space optimization to achieve the bestsearch efficiency using highly-optimized FST libraries suchas OpenFST [23, 24]. In E2E CTC ASR framework [25],individual components, containing CTC labels, lexicons,and N-gram language models, are encoded into three in-dividual WFSTs and then composed into a comprehensivesearch graph that encodes the mapping from a CTC symbolsequence emitted from the speech frames to a sequence ofwords. The search space is represented as T ◦ L ◦ G in theEesen toolkit [25], where T is a token WFST that maps asequence of frame-level symbols to a single lexicon unit, L is a lexicon WFST that encodes the mapping from sequencesof lexicon units to words, and G is a grammar WFST thatencodes the word sequences information in N-gram languagemodel. Thus, using WFST-based decoding framework, wecan incorporate different word-level language model effi-ciently to make full use of the available textual resourcesand overcome the imbalance in acoustic data between low-resourced and high-resourced language in our CS scenario.In our previous work [26], Yilmaz et al. proposed a multi-graph decoding strategy which creates parallel search spacesfor each monolingual and bilingual recognition tasks for theconventional CS ASR system. This strategy can be easilyextended to E2E CTC ASR system to address the above-mentioned data imbalance problem. For the multi-graphdecoding strategy, we use the union operation to create alarger graph with parallel bilingual and monolingual (Frisianand Dutch) subgraphs. The parallel graphs used during de-coding are characterized by the incorporated language modelcomponent, as they share the same token ( T ) and lexicon ( L )components. This approach has been shown to outperformstandard LM interpolation [26], that makes effective use ofthe text resources of the high-resourced language by creatingthree different search spaces with an identical acoustic model(AM). Monolingual and code-mixed utterances are decodedusing best-matching subgraph, yielding improved monolin-gual recognition performance on the high-resourced languagewithout any accuracy loss on the code-mixed utterances.
4. CS RNN LANGUAGE MODELING
In language modeling, we face data sparsity both in terms ofavailability of CS corpus, and scarcity of CS occurrences inthe corpus. To address these problems, we propose a two-stepapproach to language modeling. Firstly, in terms of data aug-mentation, we boost the size of CS corpus by syntheticallyenerating CS text using a well-trained long short-term mem-ory (LSTM) language model. Similar techniques are also pro-posed in [27, 28]. However, in [27] a sentence level alignedparallel corpus is available, thus synthetic CS data can be gen-erated based on word or phrase alignment between the par-allel sentences and guided by linguistic rules. Unlike [27],we lack a parallel corpus, thus we cannot explicitly establishthe word-level cross-lingual correspondence between the twolanguages. This motivates the second step of our languagemodel, i.e., to find the cross-lingual mapping of the mono-lingual word embeddings using an unsupervised self-learningmethod proposed by [29]. The method finds the mappingfunctions W M , W N that maximize the cosine similarity be-tween the monolingual embeddings of source language M and target language N , based on an iteratively learned dic-tionary D : arg max W M ,W N X i,j ∈ D ( M i W M ) · ( N j W N ) (3) i, j are paired entries in the dictionary that represent a transla-tion pair and M i , N j are the respective monolingual embed-dings. Since the transformation matrices and embeddings arelength normalized, cosine similarity is optimized. Thus, themethod explicitly aligns the word based on the monolingualdistributional property and projects both monolingual embed-ding into the same embedding space. Resultant word embed-dings of the related words in both languages are grouped to-gether and at the same time, monolingual syntactic informa-tion is preserved [21]. y k = LST M ( w k ) (4) p k = e y k P Vj =1 e y j (5) Loss = − Q − Q − X k =1 Y k +1 ln( p k ) (6)This pre-trained cross-lingual embedding is used to initializeour neural language model and the embedding layer is fixedduring training. The output y k from LSTM with the currentword embedding w k is passed through a softmax function Eq.(5) to form a distribution p k over the total vocabulary V, whichrepresents the next word probability. The loss function is thecross-entropy between the true target Y k +1 and p k in Eq. (6),where Q is the number of words in the corpus. By freez-ing the embedding layer, we aim to preserve the cross-lingualcorrespondence derived from larger monolingual corpora andlet the low-resourced language leverage on the resource richlanguage.
5. EXPERIMENTAL SETUP5.1. Datasets
The experiments are conducted on the low-resourced Frisian-Dutch CS corpus from the FAME! project, this project aims to
Table 1 . Acoustic data composition used for CTC AM train-ing (in hours)Traning data Annot. Frisian Dutch Total(1) FAME Manual 8.5 3.0 11.5(2) Frisian Broad. Auto. 125.5 125.5(3) CGN-NL Manual - 442.5 442.5develop a spoken document retrieval system for the disclosureof the archives of Omrop Fryslˆan (Frisian Broadcast) coveringa large time span and a wide variety of topics which containmonolingual Dutch and Frisian speech as well as code-mixedFrisian-Dutch speech. Further details can be found in [30].It is worth mentioning that proposed approaches can also beapplied to other low-resourced language pairs and scenarioswith more than two languages as in [31].The training data used in the experiments are summa-rized in Table 1. Both monolingual and CS data is usedfor acoustic model training, since monolingual acoustic dataaugmentation has been shown to improve the CS ASR onboth monolingual and code-mixed test utterances [32]. Themanually annotated CS data is from the FAME corpus con-taining 8.5 hours and 3 hours of orthographically transcribedspeech from Frisian (fy) and Dutch (nl) speakers respec-tively. The Frisian Broadcast data containing 125.5 hoursof automatically transcribed speech data extracted from thetarget broadcast archive. Monolingual Dutch data comprises442.5 hours Dutch component of the Spoken Dutch Corpus(CGN) [33] that contains diverse speech materials includingconversations, interviews, lectures, debates, read speech andbroadcast news. The development and test sets consist of1 hour of speech from Frisian speakers and 20 minutes ofspeech from Dutch speakers each. The sampling frequencyof all speech data is 16 kHz.
Bilingual text corpus (107M words) consisting of generatedCS text (61M words), monolingual Frisian text (37M words)and monolingual Dutch text (9M words) are used for train-ing the baseline CS LM. The transcripts of the FAME train-ing data is the only source of CS text containing 140k wordsand textual data augmentation techniques described in [32]have been applied to increase the amount of CS text. TheFrisian text is extracted from monolingual resources such asFrisian novels, news and Wikipedia articles. The Dutch textis extracted from the transcripts of the CGN speech corpus.We use the larger monolingual subset (300M words) of theNLCOW text corpus together with Dutch text (9M words)which is used in baseline CS LM to train larger Dutch LMand create larger monolingual Dutch graph. http://corporafromtheweb.org .2. Implementation details All the recognition experiments are performed in the EesenE2E CTC ASR toolkit [25]. The 3-fold data augmenta-tion [34] is applied to the in-domain acoustic training data,i.e., (1) and (2) in Table 1. The acoustic model is a 6-layerbidirectional LSTM with 640 hidden units trained withoutpredefined alignment. The 40-dimensional filterbank featureswith their first and second-order derivatives are stacked using3 contiguous frames to form 360-dimensional spliced featuresas inputs. The features are normalized via mean subtractionand variance normalization on a per-speaker basis. The learn-ing rates starts at 0.00004 and remains unchanged until thedrop of label error rate on validation set between two consec-utive epochs falls below 0.5%. From then on, the learningrate is halved at the subsequent epochs. The conventionalASR system is trained using the Kaldi ASR toolkit [35]. Acontext-dependent Gaussian mixture model-hidden Markovmodel (GMM-HMM) system is firstly trained using MFCCincluding the deltas and deltas-deltas to obtain the alignments.Then these alignments are used for training a TDNN-LSTMacoustic model (1 standard, 6 time-delay and 3 LSTM layers)with LF-MMI [36] criterion using 40-dimensional MFCC asfeatures combined with i-vectors for speaker adaptation.The language models used in the first pass ASR decodingare standard bilingual 3-grams with interpolated Kneser-Neysmoothing. The baseline RNN LM with gated recurrent units(GRU) has 400 hidden units and is trained using noise con-trastive estimation for lattice rescoring. The CS RNN LMwith the same architecture is adapted to the CS transcripts toreduce the mismatch. The adaptation is performed at the last5 epochs while following the overall learning rate decay of0.8. In summary, we have 7 LMs: (1) baseline CS LM (cs)trained on the bilingual text (107M), (2) baseline monolingualFrisian LM (fy) trained on monolingual Frisian text (37M),(3) baseline monolingual Dutch LM (nl) trained on mono-lingual Dutch text (9M), (4) larger monolingual Dutch LM(nl++) trained on 309M words, (5) interpolated LM (interp-nl++) with the interpolation between cs LM and nl++ LM,whose interpolation weight yields the lowest perplexity on thedevelopment set, (6) baseline RNN LM trained on the corre-sponding bilingual text (107M) using 1 layer LSTM with 400hidden units, (7) CS RNN LM trained using the similar pa-rameters. The RNN LM weight for rescoring is 0.75. Thefirst five LMs are used in the conventional signal-graph E2EASR systems for comparison with the corresponding multi-graph decoding systems using the same amount monolingualand bilingual text. The perplexities of the baseline CS and theDutch LMs on the monolingual Dutch component of the de-velopment and test set are shown in Table 2, the perplexitiesof two RNN LMs on development and test set show that CSRNN LM has a lower perplexity than its baseline in Table 3. https://github.com/yandex/faster-rnnlm Table 2 . Perplexities obtained on the Dutch component of thedevelopment and test set using different LMsLM Total
Table 3 . Perplexities obtained on the different components ofdevelopment and test transcripts using different LMsDev. Testfy nl cs fy nl cs3-gram LM 158 191 272 138 189 227Base. RNN LM 205 187 330 177 177 283CS RNN LM 183 164 296 159 156 257
Table 4 . WER (%) obtained on the monolingual utterances inthe development and test set of the FAME CorpusDev. Testfy nl fy nl - Four sets of ASR experiments are conducted to evaluate theperformance of the proposed method. Firstly, the ASR per-formance of the baseline single-graph ASR systems using csand interp-nl++ LMs are presented. Secondly, the results pro-vided by the bi-graph systems using the cs graph together withone of the monolingual graphs, namely fy, nl and nl++, arepresented. Thirdly, tri-graph decoding systems with varyingmonolingual graphs are evaluated.After finalizing the multi-graph decoding experiments,we present the RNN LM rescoring experiment performedto evaluate the performance of CS RNN LM on CS speechcompared to a baseline RNN LM. For the rescoring of themulti-graph systems, graph identification tags are used toidentify the graph used for the hypothesized ASR output andthen the rescoring is performed with the corresponding RNNLM. The CS RNN LMs are trained on the same text data withthe N-gram used in decoding. The monolingual Frisian andDutch RNN LMs are trained on Frisian text corpora (fy, 37M)and the largest Dutch text corpora (nl++, 309M) respectivelyusing the same parameters as the baseline and CS RNN LMs.The recognition results are reported separately for Frisian able 5 . WER (%) obtained on the development and test set of the FAME CorpusDev. Test Totalfy nl fy-nl all fy nl fy-nl all
Single-graph systems
Base. E2E CS ASR cs No 32.9 33.7 42.6 34.9 30.6 29.0 42.4 31.8 33.4Base. E2E CS ASR cs Yes 31.6 32.8 42.1 33.9 29.6 interp-nl++ cs-nl++ No 32.6 32.3 42.3 34.3 30.7 28.7 42.6 31.8 33.1interp-nl++ cs-nl++ Yes 31.3 32.5 41.5 33.4 29.9 28.2 41.0 31.0 32.2
Multi-graph systems union-fy cs, fy No 32.7 33.0 42.3 34.5 30.7 28.6 42.7 31.9 33.2union-nl cs, nl No 32.7 32.5 42.8 34.4 30.6 28.0 42.6 31.6 33.0union-nl++ cs, nl++ No 32.8 30.1 42.5 33.8 30.6 26.7 42.5 31.4 32.6union-nl++ cs, nl++ Yes 31.9 union-fy-nl cs, fy, nl No 32.9 32.4 42.9 34.6 30.8 28.1 42.8 31.8 33.2union-fy-nl++ cs, fy, nl++ No 32.9 30.1 42.8 33.9 30.8 25.6 43.1 31.3 32.5union-fy-nl++ cs, fy, nl++ Yes 32.3 28.2 41.7 32.8 30.2 23.1 41.3 30.2 31.6union-fy-nl++ cs, fy, nl++ CS-RNN 32.3 only (fy), Dutch only (nl) and code-mixed (fy-nl) utterances.The overall performance is also reported to use as an overallperformance indicator. The recognition performance of theASR system is quantified using the word error rate (WER).
6. RESULTS AND DISCUSSION
The recognition results obtained by using only monolingualgraphs on the corresponding monolingual utterances are pre-sented in Table 4. The ASR system using only Frisian (fy)graph gives similar recognition performance to the baselineCS system on monolingual Frisian utterances, which indi-cates that the latter CS system has the ability to recognizemonolingual Frisian speech as well as a monolingual FrisianASR system. For monolingual Dutch utterances, the perfor-mance by using only Dutch (nl) graph is slightly better thanbaseline CS system on the test set with a WER of 27.9% com-pared to 29.0%. Using the largest monolingual Dutch graphsnl++ yields a WER of 25.9% on the Dutch utterances respec-tively, revealing that the performance of the baseline CS graphcan be improved by using larger monolingual Dutch graph ina multi-graph decoding framework.The ASR results obtained using multi-graph decodingstrategy and the CS RNN LM rescoring are presented inTable 5. The number of Frisian and Dutch words in eachcomponent of development and test sets are presented in theupper panel. Then two baseline results using single-graph systems (cs and interp-nl++) are shown in the middle panel.The results provided by an equivalent Kaldi [35] ASR systemwith conventional architecture is also given as a reference.Compared to the baseline E2E CS ASR system, using theinterpolated larger Dutch LM brings marginal improvementsfrom 33.7% (29.0%) to 32.3% (28.7%) on the development(test) set. This indicates that using interpolated larger LMin single graph is ineffective in improving the accuracy onmonolingual utterances.Finally, the ASR results provided by the multi-graph E2EASR systems are presented in the bottom panel. According tothese results, using an additional monolingual Frisian graphduring the multi-graph decoding (union-fy and union-fy-nl)does not improve the ASR performance on the fy utterances,which is consistent with the previous results reported in [26].Including the largest monolingual Dutch graph in the union-fy-nl++ system improves the ASR accuracy on nl utteranceswith a WER of 30.1% (25.6%), yielding a 10.7% (11.7%)relative WER reduction.For RNN LM rescoring, CS RNN LM provides absoluteoverall 0.7% WER reduction from 32.3% to 31.6% over thebaseline RNN LM in single-graph systems and 1.2% (0.8%)WER reduction on fy-nl utterances for the union-fy-nl++ sys-tem perhaps due to the fact that the CS RNN LM could pre-serve more cross-lingual information. The Dutch RNN LM(trained on 309M Dutch text corpora) provides the best WERof 28.2% (23.1%) on monolingual Dutch utterances, whilehe Frisian RNN LM (trained on 37M Frisian text) and thebaseline RNN LM (trained on 107M bilingual text) give lim-ited improvements on the corresponding subsets. Finally, theWER of E2E CTC ASR system is significantly reduced to31.4%.
7. CONCLUSION
In this paper, we propose an E2E CTC ASR pipeline fora CS scenario in which a low-resourced language is mixedwith a high-resourced language. We first incorporate a multi-graph decoding strategy by creating parallel search spaces formonolingual and code-switching recognition tasks. More-over, we perform language model rescoring using a recur-rent neural network pre-trained with cross-lingual embed-ding and then adapted with the limited amount of in-domaincode-switching text. For evaluating the effectiveness of theproposed pipeline, ASR experiments are conducted on theFrisian-Dutch CS speech, in which the target Frisian lan-guage is low-resourced with limited acoustic and textualresources while Dutch language is high-resourced. The ex-perimental results demonstrate that the multi-graph decodingapproach can improve monolingual Dutch recognition perfor-mance of an E2E CS ASR system without degradation in theCS performance. The adapted recurrent neural network lan-guage model further improves the performance on CS speech.Finally, the proposed pipeline gives 16.3% (20.3%) relativeWER reduction on monolingual Dutch speech and absolute2.1% (1.9%) WER reduction on code-switching speech.
8. ACKNOWLEDGEMENTS
This research is supported by the National Research Foun-dation Singapore under its AI Singapore Programme (AwardNumber: AISG-100E-2018-006). This research is also sup-ported by the Agency for Science, Technology and Research(A*STAR) under its AME Programmatic Funding Scheme(Project
9. REFERENCES [1] Colin Baker, “Foundations of bilingual education andbilingualism,”
Multilingual matters , vol. 79, 2011.[2] Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Kr-ishna Rallabandi, and Alan W Black, “A survey ofcode-switched speech and language processing,” arXivpreprint arXiv:1904.00784, 2019.[3] Peter Auer, “Code-switching in conversation: Lan-guage, interaction and identity,”
Journal of Linguistics ,vol. 37, pp. 627–649, 2001. [4] Dau-Cheng Lyu, Tien-Ping Tan, Eng-Siong Chng, andHaizhou Li, “Seame: a mandarin-english code-switching speech corpus in south-east asia,” in
EleventhAnnual Conference of the International Speech Com-munication Association (INTERSPEECH) . ISCA, 2010,pp. 1986–1989.[5] Alfredo Ardila, “Spanglish: an anglicized spanish di-alect,”
Hispanic Journal of Behavioral Sciences , vol.27, pp. 60–81, 2005.[6] Anik Dey and Pascale Fung, “A hindi-english code-switching corpus,” in
Proceedings of the Ninth Inter-national Conference on Language Resources and Eval-uation (LREC) . ELRA, 2014, pp. 2410–2413.[7] Alex Graves, “Sequence transduction with recurrentneural networks,” in arXiv preprint arXiv:1211.3711 ,2012.[8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,Kyunghyun Cho, and Yoshua Bengio, “Attention-basedmodels for speech recognition,” in
Proceedings of the28th International Conference on Neural InformationProcessing Systems (NIPS) . ACL, 2015, pp. 577–585.[9] Kyunghyun Cho, Bart van Merrinboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio, “Learning phrase representationsusing rnn encoder-decoder for statistical machine trans-lation,” in
Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP) . ACL, 2014, pp. 1724–1734.[10] Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,”in
Proceedings of the 31st International Conferenceon Machine Learning (ICML) . ACM, 2014, pp. 1764–1772.[11] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R.Hershey, and Tomoki Hayashi, “Hybrid CTC/attentionarchitecture for end-to-end speech recognition,”
IEEEJournal of Selected Topics in Signal Processing , vol. 11,no. 8, pp. 1240–1253, Dec 2017.[12] Eric Battenberg, Jitong Chen, Rewon Child, AdamCoates, Yashesh Gaur, Yi Ci Li, Hairong Liu, San-jeev Satheesh, David Seetapun, Anuroop Sriram, andZhenyao Zhu, “Exploring neural transducers for end-to-end speech recognition,” in
Automatic Speech Recogni-tion and Understanding (ASRU), 2017 IEEE Workshopon . IEEE, 2017, pp. 206–213.[13] William Chan, Navdeep Jaitly, Quoc V. Le, and OriolVinyals, “Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition,”n
Acoustics, Speech and Signal Processing (ICASSP),2016 IEEE International Conference on . IEEE, 2016,pp. 4960–4964.[14] William Chan, Navdeep Jaitly, Quoc V. Le, and OriolVinyals, “Neural speech recognizer: Acoustic-towordlstm model for large vocabulary speech recognition,”in
Eighteenth Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH) .ISCA, 2017, pp. 3707–3711.[15] Jinyu Li, Guoli Ye, Amiya Das, Rui Zhao, and Yi-fan Gong, “Advancing acoustic-to-word ctc model,”in
Acoustics, Speech and Signal Processing (ICASSP),2018 IEEE International Conference on . IEEE, 2018,pp. 5794–5798.[16] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su,Min Luo, Dong Yu, and Lei Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in
Acoustics, Speech and Signal Process-ing (ICASSP), 2019 IEEE International Conference on .IEEE, 2019, pp. 6056–6060.[17] Hiroshi Seki, Shinji Watanabe, Takaaki Hori,Jonathan Le Roux, and John R. Hershey, “Anend-to-end language-tracking speech recognizer formixed-language speech,” in
Acoustics, Speech andSignal Processing (ICASSP), 2018 IEEE InternationalConference on . IEEE, 2018, pp. 4919–4923.[18] Suyoun Kim and Michael L. Seltzer, “Towardslanguage-universal end-to end speech recognition,” in
Acoustics, Speech and Signal Processing (ICASSP),2018 IEEE International Conference on . IEEE, 2018,pp. 4914–4918.[19] Shubham Toshniwa, Tara N. Sainath, Ron J. Weiss,Bo Li, Pedro Moreno, Eugene Weinstein, and KanishkaRao, “Multilingual speech recognition with a singleend-to-end model,” in
Acoustics, Speech and SignalProcessing (ICASSP), 2018 IEEE International Confer-ence on . IEEE, 2018, pp. 4904–4908.[20] Ke Li, Jinyu Li, Guoli Ye, Rui Zhao, and YifanGong, “Towards code-switching ASR for end-to-endCTC models,” in
Acoustics, Speech and Signal Process-ing (ICASSP), 2019 IEEE International Conference on .IEEE, 2019, pp. 6076–6080.[21] Grandee Lee and Haizhou Li, “Word and class commonspace embedding for code-switch language modelling,”in
Acoustics, Speech and Signal Processing (ICASSP),2019 IEEE International Conference on . IEEE, 2019,pp. 6086–6090. [22] Alex Graves, Santiago Fernandez, Faustino Gomez, andJurgen Schmidhuber, “Connectionist temporal classifi-cation: Labelling unsegmented sequence data with re-current neural networks,” in
Proceedings of the 23rd In-ternational Conference on Machine Learning (ICML) ,2006, pp. 369–376.[23] Mehryar Mohri, Fernando C. N. Pereira, and MichaelRiley, “Speech recognition with weighted finite-statetransducers,”
Computer Speech & Language , vol. 16,pp. 69–88, 2002.[24] Cyril Allauzen, Michael Riley, Wojciech Skut Jo-han Schalkwyk, and Mehryar Mohri, “Openfst: Ageneral and efficient weighted finite-state transducer li-brary,” in
Implementation and Application of Automata .Springer, 2007, pp. 11–23.[25] Yajie Miao, Mohammad Gowayyed, and Florian Metze,“Eesen: End-to-end speech recognition using deep rnnmodels and wfst-based decoding,” in
Automatic SpeechRecognition and Understanding (ASRU), 2017 IEEEWorkshop on . IEEE, 2017, pp. 167–174.[26] Emre Yılmaz, Samuel Cohen, Xianghu Yue, David vanLeeuwen, and Haizhou Li, “Multi-graph decoding forcode-switching ASR,” in
Twentieth Annual Conferenceof the International Speech Communication Association(INTERSPEECH) . ISCA, 2019, pp. 3750–3754.[27] Grandee Lee, Xianghu Yue, and Haizhou Li, “Linguis-tically motivated parallel data augmentation for code-switch language modeling,” in
Twentieth Annual Con-ference of the International Speech Communication As-sociation (INTERSPEECH) . ISCA, 2019, pp. 3730–3734.[28] Emre Yılmaz, Henk van den Heuvel, and Davidvan Leeuwen, “Code-switching detection with data-augmented acoustic and language models,” in the SixthInternational Workshop on Spoken Language Technol-ogy for Under-resourced Languages (SLTU) . ProcediaComputer Science, 2018, pp. 127–131.[29] Artetxe Mikel, Labaka Gorka, and Agirre Eneko, “A ro-bust self-learning method for fully unsupervised cross-lingual mappings of word embeddings,” in
Proceedingsof the 56th Annual Meeting of the Association for Com-putational Linguistics . ACL, 2018, pp. 789–798.[30] Emre Yılmaz, Maaike Andringa, Sigrid Kingma, JelskeDijkstra, Frits Van der Kuip, Hans Van de Velde, Fred-erik Kampstra, Jouke Algra, Henk van den Heuvel,and David A. van Leeuwen, “A longitudinal bilin-gual frisian-dutch radio broadcast database designed forcode-switching research,” in
Proceedings of the TenthInternational Conference on Language Resources andEvaluation (LREC) . ELRA, 2016, pp. 4666–4669.31] Emre Yılmaz, Astik Biswas, Febe de Wet, Ewald Vander Westhuizen, and Thomas Niesler, “Building a uni-fied code-switching ASR system for South African lan-guages,” in
Nineteenth Annual Conference of the Inter-national Speech Communication Association (INTER-SPEECH) . ISCA, 2018, pp. 1923–1927.[32] Emre Yılmaz, Henk van den Heuvel, and David vanLeeuwen, “Acoustic and textual data augmentation forimproved ASR of code-switching speech,” in
Nine-teenth Annual Conference of the International SpeechCommunication Association (INTERSPEECH) . ISCA,2018, pp. 1933–1937.[33] Nelleke Oostdijk, “The spoken dutch corpus: Overviewand first evaluation,” in
Proceedings of the Second Inter-national Conference on Language Resources and Eval-uation (LREC) . ELRA, 2000, pp. 886–894.[34] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-jeev Khudanpur, “Audio augmentation for speech recog-nition,” in
Sixteenth Annual Conference of the Inter-national Speech Communication Association (INTER-SPEECH) . ISCA, 2015, pp. 3586–3589.[35] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Na-gendra Goel, Mirko Hannemann, Yanmin Qian, PetrSchwarz, and Georg Stemmer, “The Kaldi speechrecognition toolkit,” in
Automatic Speech Recognitionand Understanding (ASRU), 2011 IEEE Workshop on ,2011, pp. 1–4.[36] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-gah Ghahremani, Vimal Manohar, Xingyu Na, Yim-ing Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-freemmi,” in