Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation
CCode-switching Sentence Generation by Generative AdversarialNetworks and its Application to Data Augmentation
Ching-Ting Chang, Shun-Po Chuang, Hung-Yi Lee
Graduate Institute of Communication Engineering, National Taiwan University [email protected], [email protected], [email protected]
Abstract
Code-switching is about dealing with alternative languages inspeech or text. It is partially speaker-dependent and domain-related, so completely explaining the phenomenon by linguis-tic rules is challenging. Compared to most monolingual tasks,insufficient data is an issue for code-switching. To mitigatethe issue without expensive human annotation, we proposedan unsupervised method for code-switching data augmentation.By utilizing a generative adversarial network, we can gener-ate intra-sentential code-switching sentences from monolingualsentences. We applied the proposed method on two corpora, andthe result shows that the generated code-switching sentencesimprove the performance of code-switching language models.
Index Terms : code-switching, generative adversarial networks,data augmentation, language model
1. Introduction
Code-switching (CS) is the practice that two or more languagesare used within a document or a sentence. It is widely observedin multicultural areas, or countries where official language isdifferent from native language. For example, Taiwanese tendto mix English and Taiwanese Hokkien in their text and speechbesides their main language, Mandarin. Solving CS is crucialto building a general ASR system that can process both mono-lingual and CS speech [1, 2, 3]. In this paper, we focus onimproving the language models for ASR of intra-sentential CSspeech. Specifically, we only deal with words and phrases thatare code-switched within a sentence.Computational processing of CS is fundamentally challeng-ing due to lack of data. Applying linguistic knowledge is a solu-tion to this [4, 5]. Equivalence Constraint and Functional HeadConstraint are used to build a better CS language model [6, 7, 8],and CS models with syntactic and semantic features are built toexploit more information [9, 10]. Because of a large amount ofmonolingual data, monolingual language models for host andguest languages are learned separately, and then combined witha probabilistic model for switching between the two [11].Because CS is mostly used in spoken language, the mostpractical way of generating data is to label CS speech. How-ever, manual transcription requires plenty of skilled labor andhours of tedious work. An alternative way is to generate CSdata from existing monolingual text. Unfortunately, there areno flawless rules for predicting code-switching points within asentence, since each person tends to code-switch in a differ-ent manner. These years, people try to synthesize more code-switching text by the models learned from data [12, 13, 14].Generative models have been used to generate CS sen-tences [13], but previous work uses generative model to gen-erate the sentences from scratch. Here the generator learns tomodify monolingual sentences into CS sentences. In this way, the generator can leverage the information from monolingualsentences.We propose a novel CS text generation method, by usinggenerative adversarial networks (GAN) [15] with reinforcementlearning (RL) [16], to generate CS data from monolingual sen-tences automatically. With CS data augmented by our method,it is possible to solve the problem of sparse training data. Ourproposed method has the following benefits:• We don’t use any labeled data to train the generator.• The model learns CS rules for data generation implicitlywith the help of discriminator instead of defining hand-crafted rules.• We conduct the experiments on two Mandarin-Englishcode-switching corpora, LectureSS and SEAME, whichhave very different characteristics to show that the pro-posed approach generalizes well in different cases.The experimental results show that GAN can generatereasonable code-switching sentences, and the generated code-switching sentences can be used to improve language modeling.
2. Methodology
The main issue of training code-switching model is lackof adequate code-switching training sentences because code-switching mostly occurs in speech or personal messages insteadof in written resources. We think that generating code-switchingsentences from monolingual data may solve the above issuesince we can obtain monolingual text much easier than code-switching text. In the following examples and discussion, Man-darin is the host language, and English is the guest language.Actually, the proposed approach is language independent, so itis possible to apply on other host-guest language pairs. ~ ~ ~ Generator G 好 我 知道 Okay 我 知道
DiscriminatorD
Real or generated? ො𝑦 Real code-switching sentences 𝑥 𝑠 𝑠 𝑠 𝑦 (Okay) (I) (know) sample
0: remain, 1: translate Translate the first word 好 Figure 1:
Proposed framework. The generator learns to generatecode-switching sentences from monolingual sentences.
To generate intra-sentential code-switching sentences, wecan randomly select some of the words in the Mandarin sen-tence and translate them into English. However, this ap- a r X i v : . [ c s . C L ] J un able 1: Details for the corpora: LecureSS and SEAME.
LectureSS SEAMEtrain dev test
Table 2:
Comparison between LectureSS and SEAME.
LectureSS SEAME proach will generate many unreasonable code-switching sen-tences. For instance, given the monolingual sentence “ 我 要 介 紹 ... ” (I will introduce ...), replacing the word “ 我 ” with “I”does not generate a reasonable sentence because few speakerswould code-switch in this way. Nevertheless, there are no per-fect rules to predict which word or phrase in a sentence shouldbe code-switched or not. Inspired by GAN, we propose to learna conditional generator for code-switching sentences, so it cantransform a monolingual sentence into a code-switching sen-tence. Discriminator . The discriminator D takes a sentence asinput, and outputs a scalar between 0 and 1. The output scalarindicates that the input sentence is generated by the generator Gor from the given code-switching training sentence. For a well-trained perfect discriminator, the output is zero when the sen-tence is generated by generator G, and the output is one whenthe sentence is sampled from the training data set. Generator . The generator G takes a monolingual (Chi-nese) word sequence x = { x , x , ..., x N } as input, where N is the length of x . The output of G is a sequence of values s = { s , s , ..., s N } . s n is a scalar between 0 and 1 corre-sponding to input word x n . Each scalar s n represents whetherit is proper to replace x n with its translated counterpart. s n isconsidered as a probability, and a binary value is sampled fromit. If the sampled value is , the word in Mandarin will be trans-lated into English. By contrast, if is sampled, the input wordwill remain the same. A code-switching sentence ˆ y is thus gen-erated from G . The generator proposed here only learns whichword in Mandarin can be replaced with English. It is possible tohave a generator which directly generates code-switching wordsequence. However, learning which word can be replaced iseasier than learning to generate the words in another languagedirectly. Training of Discriminator . D is learned by minimizing L D below, L D = − ( E y ∼D cs [ logD ( y )]+ E x ∼D zh , ˆ y ∼ G ( x ) [ log (1 − D (ˆ y ))]) . (1)In the first term of (1), the code-switching sentence y is sampledfrom the training data D cs , and the discriminator D learns to as- As typical conditional GAN, the generator G also takes a noise z sampled from Gaussian as input. We ignore z in the following formu-lation for simplicity. sign a larger score D ( y ) to y . In the second term, a monolingualsentence x is sampled from a data set D zh , and G transforms x into a code-switching sentence ˆ y . D learns to assign a smallerscore D (ˆ y ) to ˆ y . Training of Generator . The parameters in G are learnedfrom the following loss function L G . L G = − E x ∼D zh , ˆ y ∼ G ( x ) [ logD (ˆ y )] . (2)With (2), G learns to generate ˆ y that can obtain large D (ˆ y ) . Dueto the output of discriminator is discrete, the model is updatedby the REINFORCE algorithm [17, 18, 19].The discriminator D and generator G are trained iterativelyas typical GAN.
3. Experimental setup
In this work, we utilized two data sets for the experiments: Lec-tureSS and SEAME corpus [20]. The detailed statistics of thesecorpora are listed in Table 1. Additionally, we draw a compar-ison between them in Table 2. CS-rate in Table 2 is defined asbelow, CS-rate = . (3)LectureSS is a lecture speech corpus recorded by one Tai-wanese instructor at National Taiwan University in 2006. Thecontent of the recording is “Signal and System” (SS) course.It is spontaneous speech with highly imbalanced Mandarin-English code-switching characteristics. Mandarin is the hostlanguage and English is the guest language. Most Englishwords in this corpus are domain-specific terminologies.South East Asia Mandarin-English (SEAME) corpus isa conversational speech corpus recorded by Singapore andMalaysia speakers with almost balanced gender in NanyangTechnological University and Universities Sains Malaysia.There are two speaking types in the speech: conversational andinterview conditions, and the content are related to daily life,school, and so on. It is also Mandarin-English code-switchingwhile the amount of Chinese (ZH) words and English (EN)words is about equal. Not only proper nouns but also conjunc-tions may be used in English in this corpus. Some sentences inSEAME are completely in English, while it does not happen inNTU lecture. Nevertheless, the cs-rate of LectureSS is close tothe cs-rate of SEAME.Before using these datasets, we cleaned them first. Lec-tureSS comprises of “Zhuyin fuhao”, mathematical symbolsand English alphabet which cannot be translated into Englishwords or Chinese words. In addition, SEAME contains non-speech labels, unknown words labels, incomplete words andforeign words. We removed these words directly if the seman-tics of the sentences would not be influenced too much; other-wise, we ignored the utterances in the experiments.able 3: Code-switching point (CSP) prediction on manually labeled sentences.
Precision Recall F-measure BLEU-1 WER(%) EN WER ZH WERLectureSSZH 0 0 0 0.76 20.56 100 0EN 0.21 1 0.35 0.20 102.1 0 128.5random 0.17 0.16 0.16 0.62 39.20 88.14 26.54noun proposed 0.52 0.42 0.46 0.78 22.82 54.24 14.69proposed+pos 0.52 proposed+pos 0.51
The inputs of both the discriminator and the generator are wordsequences. There are two ways to represent a word. In the firstapproach, each word is first represented by one-hot encoding,and transforms into an embedding by an embedding layer. Weset 8200 and 12000 vocabulary size for LectureSS and SEAMEindividually, and 150 as the dimension for word embedding.In the second approach, we also consider the part-of-speech(POS) tag for each word. We used Jieba , an Open Source Chi-nese segmentation application in Python language, as our POStagger. Only Chinese words are tagged and English words aretagged as “eng.” Each POS tag corresponds to a 64-dim one-hotencoding, and it is transformed into 20-dim by an embeddinglayer. The embedding of words and POS tags are concatenated.The embedding layer is jointly trained with the whole model,and Chinese and English word embedding are trained together.The generator G is made up of embedding layer, one bidi-rectional long short-term memory (BLSTM) [21] layer, onefully connected (FC) layer. It outputs one value with sigmoidfor each time step to determine whether this word will be trans-lated into English. Gaussian noise is 10-dim vector concate-nated with the output of BLSTM. The parameter of G is updatedby policy gradient with the output of D as reward. Translator ismerely a mapping table which contains a list of Chinese vocab-ulary with each comparing English word translated by Googletranslator.The discriminator D shares the same embedding layer andBLSTM with G . However, it is updated only when G is trainingand fixed when D is training. The output of BLSTM is passedinto a FC layer with dropout rate 0.3. It ends in a one-dimensionvector with sigmoid.The whole optimization process is based on Adam opti-mizer [22] and we train 100 epochs for all experiments. Theinput data of G is all Chinese training sentences, and D istrained by all code-switching sentences in the training set andfake code-switching sentences generated by G in respective cor-pora.
4. Results
We evaluate our proposed method in three aspects: code-switching point (CSP) prediction, quality of generated text, andperformance of language modeling with augmented text. Jieba toolkit from: https://github.com/fxsjy/jieba
We selected 50 code-switching sentences y in testing set asground truth, and manually translated them into fully Chinesesentences x . The generator then generates code-switching sen-tences ˆ y conditioned on x . We consider the positions of Englishwords in y as CSPs that we want to detect, and use precision,recall and F-measure to evaluate the accuracy of detected CSPsin ˆ y . Additionally, we also apply BLEU score [23] and worderror rate (WER) in (4) to evaluate ˆ y ,Word Error Rate = (cid:80) i Edit Distance ( y i , ˆ y i ) , (4)where i indicates the i th selected code-switching sentences.The proposed approach is compared with four baselines:(1) ZH : fully Chinese sentences, that is, ˆ y = x . (2) EN : fullyEnglish sentences. (3) random : words are randomly translatedinto English. The translation probability for each word is thesame as the cs-rate of the corpus considered. (4) noun : translateall words that are tagged as nouns (common nouns and propernouns) [24] by POS tagger into English.According to Table 3, we observe that ZH can have goodperformance in BLEU score and total WER. This is attributedto the fact that Chinese words occur more frequently than En-glish words in code-switching sentences in both corpora. Ran-dom gets poor performance because people don’t code-switcharbitrarily.
Noun has high precision owing to high exactness,but it does not predict CSPs other than noun. Our method ob-tains better recall, F-measure, BLEU-1 and English WER than noun on all the corpora because it detects not only nouns butother CSPs like conjunctions, discourse particles, filled pauses,and so on.
Next, we demonstrate the quality of our generated text by val-idating them with language model trained on training text. Wecalculate the perplexity (PPL) of our generated text.Two types of language models were used to evaluate ourresults: n-gram model [25] and neural language model [26].N-gram language model is a word-level tri-gram with Kneser-Ney (KN) smoothing [27] trained by SRILM [28]. Recurrentneural networks based language model (RNNLM) is a two-layer character-level LSTM [29] language model. Because thetwo corpora have different scales of training data, we used 32-dimensional LSTM for LectureSS and 64-dimensional LSTMfor SEAME. We used Adam optimizer with initial learning rateable 4:
Code-switching examples from different methods.
Ground Truth Causality 這 個 也 是 讀 過 的 就 是 指 我 output at-any-time 只 depend-on inputInput 因 果 性 這 個 也 是 讀 過 的 就 是 指 我 輸 出 在 任 意 時 間 只 取 決 於 輸 入 (Causality, this is also what you have read, that means what I output at any time only depends on input)Random 因 果 性 this 也 是 讀 過 的 就 是 指 我 output 在 任 意 時 間 只 取 決 於 輸 入 Noun Causality 這 個 也 是 讀 過 的 就 是 指 我 輸 出 在 任 意 時 間 只 取 決 於 inputProposed Causality this also 你 所 read 過 的 就 是 指 我 output 在 任 意 時 間 只 取 決 於 輸 入 Proposed+pos Causality 這 個 也 是 讀 過 的 就 是 指 我 output at-any-time 只 depend-on 輸 入 Table 5:
Quality of generated code-switching text from testing textevaluated by PPL on n-gram LM and neural-based LM (RNNLM). random noun proposed proposed+posLectureSSn-gram 1022.95 337.713
SEAMEn-gram 177.039 154.28 159.103
RNNLM 79.338 84.081 78.335 random , noun and the proposed approaches to gen-erate some sentences to evaluate text quality. These sentencesare generated from all the fully Chinese sentences in the test-ing set (810 sentences for LectureSS and 2211 sentences forSEAME as shown in Table 1). The results are shown in Table 5.In the result of both language models, we observe that the per-formance of our methods with POS tagging ( proposed+pos ) isfar better than random on both corpora. It shows that our modelhas the capability to transform a Chinese sentence into a code-switching sentence with the similar pattern as the training data.Table 6: PPL of neural based language model (RNNLM) trained oncode-switching training text and data augmented from Chinese train-ing text. The last +pos column indicates considering POS features inproposed method. train random noun proposed +posLectureSS dev 110.35 107.28 105.37 109.58 test 73.394 71.779 70.038 71.974
SEAME dev 75.295 75.307 75.307
To see whether the data augmentation methods help languagemodeling, we trained RNNLM which is introduced in Sec-tion 4.2 on training data, and evaluated them on the same set ofdevelopment data and testing data. We do not show the resultsof n-gram-based LM here because its performance is not com-parable with RNNLM as shown in Table 5. Lower perplexityrepresents better performance on language modeling. We formthe augmented training set by combining the generated code-switching sentences with the original training set. The gener-ated code-switching sentences are from the Chinese sentencesin original training set (4643 sentences in LectureSS and 20365sentences in SEAME as shown in Table 1).Table 6 shows our experimental results. The train columnin the table represents the perplexity of language model withoutthe augmented code-switching sentences, which is the baselineof the experiment. As shown in this table, random surpasses the baseline on both LectureSS and SEAME testing set. It showsthat augmented code-switching text helps language modelingeven if the CSPs are randomly selected.
Noun improves theresults only on LectureSS. This may be due to the fact that Lec-tureSS contains lots of CSPs on domain-specific noun, whileSEAME has more complicated CSPs.
Proposed+pos performs the best on LectureSS, while pro-posed performs the best on SEAME . It indicates that POS fea-tures help generator generate more useful code-switching sen-tences on LectureSS, but not SEAME. This is because in Lec-tureSS domain-specific terminologies which tend to be code-switched into English are nouns. Meanwhile, SEAME comesfrom daily life conversation where CSPs are not focused onnouns, resulting in better performance without POS informa-tion. Based on Table 6, we demonstrated that the augmenteddata automatically generated by our method helps languagemodeling on CS text, and by adding POS features to generatorinput, our generated data further improves RNNLM on some ofthe data domains. Some generated code-switching examples are demonstrated inTable 4. The first row is the original code-switching sen-tence ( ground truth ). We translated it into fully Chinese ( in-put ). Then, we compared the generated results to random and noun . The rule-based approach is accurate, but cannot findout all CSPs. The proposed method with POS tagging canfind out more CSPs. More examples are in the following link: http://goo.gl/KdBYSy . The examples show that the pro-posed approach usually generates reasonable code-switchingsentences. However, it also generates some terrible sentences.We found that most of them stem from bad translation fromChinese to English.
5. Conclusion
In this work, we try to generate code-switching sentences frommonolingual sentences by GAN. The generator can learn topredict CSPs to a great degree without any linguistic knowl-edge. Moreover, our generated code-switching sentences arebetter than random generation and rule-based generation. Lastbut not least, the augmented data by our methods improvesRNNLM. For the future work, there is still room for improve-ment in the translator in this work since wrong translation maylead to terrible generated code-switching sentences. We willfurther analyze the generator to learn more mechanism aboutcode-switching. We notice that the influence of POS tags in Table 6 and Table 5are different. However, we believe the results in Table 6 is more crucialbecause it is the goal of this task. In Table 5, even the model decidesnot to code-switch any word in the input sentences, it may still obtainsreasonable number, but this model would not be very helpful in Table 6. . References [1] ¨O. C¸ etino˘glu, S. Schulz, and N. T. Vu, “Challenges ofcomputational processing of code-switching,” arXiv preprintarXiv:1610.02213 , 2016.[2] Z. Zeng, H. Xu, T. Y. Chong, E.-S. Chng, and H. Li, “Improvingn-gram language modeling for code-switching speech recogni-tion,” in
Asia-Pacific Signal and Information Processing Associa-tion Annual Summit and Conference (APSIPA ASC), 2017 . IEEE,2017, pp. 1596–1601.[3] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel,D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescoringalgorithm for automatic speech recognition,” in . IEEE, 2018, pp. 5929–5933.[4] R. M. Bhatt, “Code-switching and the functional head constraint,”in
Janet Fuller et al. Proceedings of the Eleventh Eastern StatesConference on Linguistics. Ithaca, NY: Department of ModernLanguages and Linguistics , 1995, pp. 1–12.[5] C. W. Pfaff, “Constraints on language mixing: intrasententialcode-switching and borrowing in spanish/english,”
Language , pp.291–318, 1979.[6] Y. Li and P. Fung, “Code-switch language model with inversionconstraints for mixed language speech recognition,”
Proceedingsof COLING 2012 , pp. 1671–1680, 2012.[7] ——, “Improved mixed language speech recognition using asym-metric acoustic model and language model with code-switch in-version constraints,” in
Acoustics, Speech and Signal Processing(ICASSP), 2013 IEEE International Conference on . IEEE, 2013,pp. 7368–7372.[8] L. Ying and P. Fung, “Language modeling with functional headconstraint for code switching speech recognition,” in
Proceedingsof the 2014 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , 2014, pp. 907–916.[9] H. Adel, N. T. Vu, K. Kirchhoff, D. Telaar, and T. Schultz, “Syn-tactic and semantic features for code-switching factored languagemodels,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 23, no. 3, pp. 431–440, 2015.[10] C.-F. Yeh and L.-S. Lee, “An improved framework for recognizinghighly imbalanced bilingual code-switched lectures with cross-language acoustic modeling and frame-level language identifica-tion,”
IEEE Transactions on Audio, Speech, and Language Pro-cessing , vol. 23, no. 7, pp. 1144–1159, 2015.[11] S. Garg, T. Parekh, and P. Jyothi, “Dual language models for codemixed speech recognition,” arXiv preprint arXiv:1711.01048 ,2017.[12] E. Yilmaz, H. Heuvel, and D. van Leeuwen, “Acoustic and textualdata augmentation for improved asr of code-switching speech,” in
Proc. Interspeech . Hyderabad, India: ISCA, 2018, pp. 1933–1937.[13] S. Garg, T. Parekh, and P. Jyothi, “Code-switched language mod-els using dual rnns and same-source pretraining,” in
Proceedingsof the 2018 Conference on Empirical Methods in Natural Lan-guage Processing , 2018, pp. 3078–3083.[14] E. Yılmaz, A. Biswas, E. van der Westhuizen, F. de Wet, andT. Niesler, “Building a unified code-switching asr system for southafrican languages,” arXiv preprint arXiv:1807.10949 , 2018.[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in
Advances in neural information processing sys-tems , 2014, pp. 2672–2680.[16] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence gen-erative adversarial nets with policy gradient.” in
AAAI , 2017, pp.2852–2858.[17] R. Williams, “A class of gradient-estimation algorithms for re-inforcement learning in neural networks,” in
Proceedings of theInternational Conference on Neural Networks , 1987, pp. II–601. [18] R. J. Williams, “Simple statistical gradient-following algorithmsfor connectionist reinforcement learning,”
Machine learning ,vol. 8, no. 3-4, pp. 229–256, 1992.[19] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Pol-icy gradient methods for reinforcement learning with function ap-proximation,” in
Advances in neural information processing sys-tems , 2000, pp. 1057–1063.[20] D.-C. Lyu, T.-P. Tan, E. S. Chng, and H. Li, “Seame: a mandarin-english code-switching speech corpus in south-east asia,” in
Eleventh Annual Conference of the International Speech Commu-nication Association , 2010.[21] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-tion with bidirectional lstm and other neural network architec-tures,”
Neural Networks , vol. 18, no. 5-6, pp. 602–610, 2005.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[23] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a methodfor automatic evaluation of machine translation,” in
Proceedingsof the 40th annual meeting on association for computational lin-guistics . Association for Computational Linguistics, 2002, pp.311–318.[24] C. Wei-Yu Chen, “The mixing of english in magazine advertise-ments in taiwan,”
World Englishes , vol. 25, no. 3-4, pp. 467–478,2006.[25] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C.Lai, “Class-based n-gram models of natural language,”
Computa-tional linguistics , vol. 18, no. 4, pp. 467–479, 1992.[26] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neuralprobabilistic language model,”
Journal of machine learning re-search , vol. 3, no. Feb, pp. 1137–1155, 2003.[27] R. Kneser and H. Ney, “Improved backing-off for m-gram lan-guage modeling,” in
Acoustics, Speech, and Signal Processing,1995. ICASSP-95., 1995 International Conference on , vol. 1.IEEE, 1995, pp. 181–184.[28] A. Stolcke, “Srilm-an extensible language modeling toolkit,” in
Seventh international conference on spoken language processing ,2002.[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”