Code-Mixed to Monolingual Translation Framework
Sainik Kumar Mahata, Soumil Mandal, Dipankar Das, Sivaji Bandyopadhyay
CCode-Mixed to Monolingual Translation Framework
Sainik Kumar Mahata , Soumil Mandal , Dipankar Das , Sivaji Bandyopadhyay Jadavpur University, Kolkata, India SRM University, Chennai, [email protected], [email protected]@gmail.com, sivaji cse [email protected]
Abstract
The use of multi-lingualism in the new gener-ation is widespread in the form of code-mixeddata on social media, and therefore a robusttranslation system is required for catering tothe monolingual users, as well as for easiercomprehension by language processing mod-els. In this work, we present a translationframework that uses translation-transliterationstrategy for translating code-mixed data intotheir equivalent monolingual instances. Forconverting the output to a more fluent form,it is reordered using a target language model.The most important advantage of the pro-posed framework is that it does not requirea code-mixed to monolingual parallel corpusat any point. On testing the framework, itachieved BLEU and TER scores of 16.47 and55.45, respectively. Since the proposed frame-work deals with various sub-modules, we divedeeper into the importance of each of them,analyze the errors and finally, discuss someimprovement strategies.
India has a linguistically diverse diaspora due toits long history of foreign acquaintances. English,one of those borrowed languages, became an inte-gral part of the education system and hence gaverise to a population who are very comfortable us-ing bilingualism in communication. This kind oflanguage diversity and dialects initiates frequentcode-mixing. Further, due to the emergence ofsocial media, the practice has become even morewidespread. We found out that only 26% of theIndian population are bilingual. To cater to therest, who are comfortable using only one nativelanguage and to make them compatible in the ageof social media, translating code-mixed data into
1, 2 equal contribution https://en.wikipedia.org/wiki/Multilingualism_in_India its corresponding monolingual instance is an al-ternative. But, translating such data manually re-quires a lot of effort and hence availing machinesfor the same is more desirable.Machine translation it self is a challenging taskdue to out of vocabulary problems, context misun-derstanding, grammatical errors, bias, etc. Thus,it becomes more difficult when the input instanceis code-mixed, as many new challenges emergewith it. In this work, we present an architecturefor code-mixed translation which doesn’t requirea code-mixed to monolingual parallel corpus fortraining. This is highly beneficial as code-mixeddata is difficult to scrape and an enormous amountof data would be required for a model like SMT orNMT to learn the nuances of the language in orderto properly translate. We implemented our archi-tecture for Bengali-English (Bn-En) code-mixeddata in Roman script to Bengali in Devanagariscript. Our architecture is capable of translatingmonolingual sentences as well, for example in ourcase, if the input is in monolingual Bengali or En-glish in Roman script, it will still translate it tothe target language, which is Bengali in Devana-gari. Our contributions also include preparation ofa gold standard Bn-En code-mixed to Bn parallelcorpus which was used for testing purposes only.The shortcomings and errors have been analyzedin detail as well. Several research works has been done in the recentpast on code-mixed data, and especially involv-ing language tagging. Jhamtani et al. (2014) cre-ated an ensemble model by combining two classi-fiers to create a Hindi-English code-mixed LID.The first classifier used word frequency, modi-fied edit distance, and character n-grams as fea-tures. The second classifier used the output from a r X i v : . [ c s . C L ] N ov he former classifier for the current word, alongwith language and POS tag of neighbouring wordsto give the final tag. Rijhwani et al. (2017)proposed a generalized language tagger for ar-bitrary set of languages which is fully unsuper-vised. With respect to back-transliteration, Bilacand Tanaka (2004) proposed a hyrbid approachwhich combines phoneme, grapheme and segmen-tation based modules. Luo and Lepage (2015) pre-sented an architecture for back transliteration us-ing an SMT framework described in (Franz et al.,2003). Ravishankar (2017) describes a finite-statebased system for back-transliteration of transliter-ated Marathi words in Roman. The major advan-tage over statistical models is that its able to modelexceptions without being retrained. Sinha andThakur (2005) took the challenge of translationof Hindi-English code-mixed to English mono-lingual from a linguistics point of view by us-ing morphological analyzers though they did notperform any in depth analysis or evaluations. In(Dhar et al., 2018), the authors created a code-mixed (Hindi-English) to monolingual (English)parallel corpus consisting of 6096 instances. Theyalso developed an augmentation pipeline whichcan be utilized for augmenting existing MT sys-tems such that the translation of the systems can beimproved without training the MT system specif-ically for code-mixed text. On testing the mod-ule with Moses, Google NMTS and Bing transla-tor, the BLEU scores improved by 2%, 9.4% and6.1% respectively. To the best of our knowledge,ours is the first end-to-end code-mixed translationsystem. In order to build our test data, we randomly col-lected 1600 code-mixed instance from the En-Bndata prepared in (Patra et al., 2018). For cre-ating the parallel corpus, a group consisting ofthree annotators who were fluent in both Englishand Bengali were employed. One of the annota-tors was asked to translate all the instances, whilethe other two classified the translations into twoclasses, correct and incorrect. The agreement wasthen calculated using Fleiss’ Kappa (Fleiss andCohen, 1973), which was found to be ≈ Our proposed approach comprises of four mod-ules. The first module is the language identifica-tion system that helps us to segment (boundaries ofsub-sequences that are in same language) a codemixed sentence. The second module translatesthe English segments to Bengali using a characterbased neural machine translation system. Bengalisegments written in Roman are back-transliteratedto Devanagari form by the third module. Joiningthe translated and the back-transliterated segmentsinto a monolingual instance, we noticed that theoutput wasn’t always fluent, and had grammaticalerrors. To counter this, we developed the fourthmodule, that uses a language modelling to convertthe output to a more natural looking instance withbetter flow. The architecture is depicted in Figure1. All the models are described in detail below.
Figure 1: Architecture overview.
This module partitions the input into segmentswith respect to language. Bn tagged segments arepassed to the transliteration system while En seg-ments are passed to the translation system. In ourcase, segments are sub-sequences of the instance,written in the same language. Strings in bracketsdenote segments.
E.g 1. (Movie) En (ta bhalo chilo) Bn (but midpoint) En (e amar khub) Bn (boring) En (lagte shurukorlo) Bn . E.g 2. (I had to go) En (karon o khub) Bn (urgently) En (daklo amaye) Bn .In order to achieve this goal, a language taggerwas used. We used the character based LSTM ar-chitecture proposed by Mandal et al. (2018). Thiss a model having stacked LSTM of sizes 15-35-25-1, in order where 15 is the input dimensionwhile 1 is the output dimension. The data usedfor training and testing were gathered from thedata released in ICON 16 and Mandal and Das(2018). Training dataset contains 6,632 words ofBn and En type each while test dataset comprises700 words of Bn and En type each. With respectto our present experiment, the training data wasincreased by 1,400 sentences for both English andBengali. This was collected from the code-mixeddata released in Ghosh et al. (2017). Sources of allthese instances, as described in their papers, werefrom social media websites like Twitter, Facebookand WhatsApp. The Loss function used was bi-nary cross-entropy and we employ adam optimizerwith sigmoid activation function. Epochs was setto 500 and batch size at 256. The increase in sizeof training data resulted in improvement in accu-racy from 91.71%, as was shown Mandal et al.(2018), to 93.2%, when experimented on identicaltest data. To make an accurate back-transliteration system,we used two resources, namely BN TRANS andPL which is described in Mandal and Nanmaran(2018). BN TRANS is essentially a parallel lexi-con with two columns where col 1 has Bn wordsin native script while col 2 has the respectiveITRANS transliterations. PL is a parallel lexiconwhere col 1 has phonetically transliterated Bnwords in Roman, while col 2 has the respectiveITRANS form. BN TRANS has 21850 entries ineach column while PL has 6000 entries in eachcolumn.Our back transliteration system first performs lex-ical checking, i.e. it checks if the word is presentin PL col 1 . If yes, it takes the respective ITRANSform and queries BN TRANS, i.e. it checks if itis present in BN TRANS col 2 and returns the re-spective word in native script. As there are sev-eral possible cases where the words are absent inPL, i.e. out of vocabulary, we decided to make aback transliteration system using character basedseq2seq model (Ling et al., 2015) in order to re-solve this scenario. We simply used BN TRANSas a parallel lexicon, where column 2 entries are http://ltrc.iiit.ac.in/icon2016/ https://en.wikipedia.org/wiki/ITRANS source sequences while column 1 entries are tar-get sequences, i.e. the goal of our model is to es-sentially learn the mappings from Bn in Romanscript to Bn in native script. For training, the ac-tivation function used was softmax, optimizer wasrmsprop, and loss function was categorical cross-entropy. Size of latent dimensions was set at 128,batch size was kept at 64, and number of epochswas set to 100. The training accuracy at the endwas 48.2%. The architecture is shown in Fig 2 Figure 2: Back-transliteration algorithm.
For translating the English segments to its corre-sponding Bengali script, we decided to go for fullycharacter level neural machine translation basedon the architecture described in Lee et al. (2017)as it outperforms a statistical model (Mahataet al., 2018). It relies on the sequence-to-sequence(Sutskever et al., 2014) model and uses attentionmechanism (Vaswani et al., 2017) while decoding.We opted for this because of the benefits itprovides over word level which are very importantin our case. The benefits as stated in Chung et al.(2016) are (1) capability to model morphologicalvariants (2) overcomes out-of-vocabulary issue(3) do not require segmentation.The seq2seq model takes a sequence X = { x ,x , ..., x n } as input and tries to generate thetarget sequence Y = { y , y , ..., y m } as output,where x i and y i are the input and target symbolsrespectively. The architecture of seq2seq modelcomprises of two parts, the encoder and decoder.In order to build the encoder, we used LSTMcells. The input of the cell was one hot tensorof English sentences (embedding at characterlevel). From the encoder, the internal states ofeach cell were preserved and the outputs werediscarded. The purpose of this is to preserve thenformation at context level. These states werethen passed on to the decoder cell as initial states.For building the decoder, again an LSTM cell wasused with initial states as the hidden states fromencoder. It was designed to return both sequencesand states. The input to the decoder was one hottensor (embedding at character level) of Bengaliand Hindi sentences while the target data wasidentical, but with an offset of one time-stepahead. The information for generation is gatheredfrom the initial states passed on by the encoder.Thus, the decoder learns to generate target data[t+1,...] given targets [..., t] conditioned on theinput sequence. It essentially predicts the outputsequence, one character per time step.For training and testing, the En-Bn parallel fromTDIL and the corpus in Post et al. (2012) wasdivided into 180k and 20k instances respectively.For training the model, batch size was set to64, number of epochs was set to 100, activa-tion function was softmax, optimizer chosen wasrmsprop and loss function used was categoricalcross-entropy. Learning rate was set to 0.001. Posttraining, the BLEU score of the model was calcu-lated to be 5.06. In several cases, we noticed that the result postjoining the outputs from the translation and thetransliteration module had grammatical errors,mainly contributed by wrong word ordering otherthan errors in word forms. To fix the former prob-lem, we created a simple language model basedtoken reordering system. We used the Bengali cor-pus in TDIL with 50k sentences to create a tri-gram and bigram based language model with nor-malized scores in log space. The system first cal-culates the normalized log probability of the inputsentence. A confusion set, if applicable, is madefor each trigram in the sentence. A re-scoring isperformed on the sentence by substituting candi-dates in confusion set. The trigram substitutions(essentially reordering) which results in the bestnet score is kept. If no alterations are performedby the trigram model, a similar sequence of stepsis performed using the bigram model as our finalstep. This process is inspired from the work inBryant and Briscoe (2018). An example of bigramand trigram ordering is shown in Fig 3. http://tdil.meity.gov.in/ The scores achieved by our system and GoogleNMTS (in en-bn setup) is given in Table 1. Twovariants of our system was tested, one withouttoken reordering (CMT1) and one with (CMT2).Manual scoring (in the range 1-5, low to high qual-ity) of Adequacy and Fluency (Banchs et al., 2015)was done by a bi-lingual linguist, fluent in both Enand Bn, with Bn as mother tongue. We can clearlysee that our pipeline outperforms GNMT by a fairmargin (about 13.34 BLEU, 18.34 TER) and tokenreordering further improves our system, especiallyin the case of fluency. Also, for a deeper analysis,we performed two experiments using CMT2.
Model BLEU TER Adq. Flu.
GNMT 2.44 75.09 0.90 1.12CMT1 15.09 58.04 3.18 3.57CMT2 16.47 55.45 3.19 3.97
Table 1: Evaluation results.
Exp 1.
We randomly took 100 instances whereBLEU score achieved was less than 15. Then wefed this back to our pipeline and collected outputsfrom each of the modules. We manually associ-ated each of the errors with the respective modulecausing it, considering the input to it was correct.The results are shown below in Table 2. Languagetagger being the starting module in our pipeline re-quires the most improvement for better results fol-lowed by the machine translation system and theback-transliteration module. All of these are su-pervised models and can be improved with moretraining data.
Module Contribution
Language Tagger 36Back Transliteration 12Machine Translation 25
Table 2: Error contribution.
Exp 2.
A linguist proficient in both English andBengali manually divided our test data into twosets, one where the matrix language was Ben-gali (M Bn ) and the other where matrix languagewas English (M En ). The size of (M Bn ) was 1205and for (M En ) it was 395. When feeding the setsseparately to CMT2, the BLEU and TER scoreachieved on MBn was 16.98 and 55.02 while onMEn it was 9.3 and 65.11 respectively. This is igure 3: Translation with and without token reordering on short snippets. mainly due to the fact that in our pipeline, the Bnsegments are transliterated while En segments aretranslated and translation has a higher error poten-tial, as compared to transliteration (as shown in Ta-ble 2). This problem can be easily solved if matrixand embedded languages are identified first, andthen passed on to different systems accordingly,i.e. one for (M Bn ) type, and one for (M En ) type. In this article, we have presented results from ourongoing work on translating code-mixed to mono-lingual instance. Our system gets a BLEU score of16.47 on our testing data which is a good startingpoint. On error analysis, we found out that lan-guage identification and translation systems con-tribute in reduction of BLEU score the most. In thefuture we would like to add new modules into ourpipeline like a matrix-embedded language clas-sifier, an accurate normalization module and re-place token reordering with a grammar correctionmodule, something similar to (Yuan and Briscoe,2016). Our current goals will include improvingthe language tagger and incorporating context in-formation while translating rather than just seg-ments. Experimenting on chat data which hasmore noise potential will be interesting as well.
References
Rafael E Banchs, Luis F DHaro, and Haizhou Li.2015. Adequacy–fluency metrics: Evaluating mt inthe continuous space model framework.
IEEE/ACMTransactions on Audio, Speech, and Language Pro-cessing , 23(3):472–482.Slaven Bilac and Hozumi Tanaka. 2004. A hybridback-transliteration system for japanese. In
Pro-ceedings of the 20th international conference onComputational Linguistics , page 597. Associationfor Computational Linguistics.Christopher Bryant and Ted Briscoe. 2018. Languagemodel based grammatical error correction withoutannotated training data. In
Proceedings of the Thir- teenth Workshop on Innovative Use of NLP forBuilding Educational Applications , pages 247–253.Junyoung Chung, Kyunghyun Cho, and Yoshua Ben-gio. 2016. A character-level decoder without ex-plicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 .Mrinal Dhar, Vaibhav Kumar, and Manish Shrivas-tava. 2018. Enabling code-mixed translation: Par-allel corpus creation and mt augmentation approach.In
Proceedings of the First Workshop on LinguisticResources for Natural Language Processing , pages131–140.Joseph L Fleiss and Jacob Cohen. 1973. The equiv-alence of weighted kappa and the intraclass corre-lation coefficient as measures of reliability.
Educa-tional and psychological measurement , 33(3):613–619.Philipp Koehn Franz, Franz Josef Och, and DanielMarcu. 2003. Statistical phrase-based translation.pages 127–133.Souvick Ghosh, Satanu Ghosh, and Dipankar Das.2017. Sentiment identification in code-mixed socialmedia text. arXiv preprint arXiv:1707.01184 .Harsh Jhamtani, Suleep Kumar Bhogi, and Vaskar Ray-choudhury. 2014. Word-level language identifica-tion in bi-lingual code-switched texts. In
Proceed-ings of the 28th Pacific Asia Conference on Lan-guage, Information and Computing .Jason Lee, Kyunghyun Cho, and Thomas Hofmann.2017. Fully character-level neural machine trans-lation without explicit segmentation.
Transactionsof the Association for Computational Linguistics ,5:365–378.Wang Ling, Isabel Trancoso, Chris Dyer, and Alan WBlack. 2015. Character-based neural machine trans-lation. arXiv preprint arXiv:1511.04586 .Juan Luo and Yves Lepage. 2015. Handling of out-of-vocabulary words in japanese-english machinetranslation by exploiting parallel corpus.
Int. J. ofAsian Lang. Proc. , 23(1):1–20.Sainik Kumar Mahata, Soumil Mandal, Dipankar Das,and Sivaji Bandyopadhyay. 2018. Smt vs nmt: Acomparison over hindi & bengali simple sentences. arXiv preprint arXiv:1812.04898 .oumil Mandal and Dipankar Das. 2018. Ana-lyzing roles of classifiers and code-mixed fac-tors for sentiment identification. arXiv preprintarXiv:1801.02581 .Soumil Mandal, Sourya Dipta Das, and Dipankar Das.2018. Language identification of bengali-englishcode-mixed data using character & phonetic basedlstm models. arXiv preprint arXiv:1803.03859 .Soumil Mandal and Karthick Nanmaran. 2018. Nor-malization of transliterated words in code-mixeddata using seq2seq model & levenshtein distance. arXiv preprint arXiv:1805.08701 .Braja Gopal Patra, Dipankar Das, and Amitava Das.2018. Sentiment analysis of code-mixed indianlanguages: An overview of sail code-mixed sharedtask@ icon-2017. arXiv preprint arXiv:1803.06745 .Matt Post, Chris Callison-Burch, and Miles Osborne.2012. Constructing parallel corpora for six indianlanguages via crowdsourcing. In
Proceedings of theSeventh Workshop on Statistical Machine Transla-tion , pages 401–409. Association for ComputationalLinguistics.Vinit Ravishankar. 2017. Finite-state back-transliteration for marathi.
The Prague Bulletin ofMathematical Linguistics , 108(1):319–329.Shruti Rijhwani, Royal Sequiera, Monojit Choud-hury, Kalika Bali, and Chandra Shekhar Maddila.2017. Estimating code-switching on twitter witha novel generalized word-level language detectiontechnique. In
Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , volume 1, pages 1971–1982.R Mahesh K Sinha and Anil Thakur. 2005. Machinetranslation of bi-lingual hindi-english (hinglish) text. , pages 149–156.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
Advances in neural information process-ing systems , pages 3104–3112.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems , pages 6000–6010.Zheng Yuan and Ted Briscoe. 2016. Grammatical errorcorrection using neural machine translation. In