Task-Specific Pre-Training and Cross Lingual Transfer for Code-Switched Data
aa r X i v : . [ c s . C L ] F e b Task-Specific Pre-Training and Cross Lingual Transfer for Code-SwitchedData
Akshat Gupta, Sai Krishna Rallabandi, Alan Black
Carnegie Mellon University [email protected], { srallaba, awb } @cs.cmu.edu Abstract
Using task-specific pre-training and leveragingcross-lingual transfer are two of the most popu-lar ways to handle code-switched data. In thispaper, we aim to compare the effects of bothfor the task of sentiment analysis. We workwith two Dravidian Code-Switched languages- Tamil-Engish and Malayalam-English andfour different BERT based models. We com-pare the effects of task-specific pre-trainingand cross-lingual transfer and find that task-specific pre-training results in superior zero-shot and supervised performance when com-pared to performance achieved by leveragingcross-lingual transfer from multilingual BERTmodels.
Code-Switching is a common phenomenon whichoccurs in many bilingual and multilingual com-munities around the world. It is characterizedby the usage of more than one language in a sin-gle utterance (Sitaram et al., 2019). India is onesuch place with many communities using differentCode-Switched languages, with Hinglish (Code-Switched Hindi-English) being the most popu-lar one. Dravidian languages like Tamil, Kan-nada, Malayalam and Telugu are also usually code-mixed with English. These code-mixed languagesare commonly used to interact with social media,which is why it is essential to be build systems thatare able to handle Code-Switched data.Sentiment analysis poses the task of inferringopinion and emotions of a text query (usuallysocial media comments or Tweets) as a classi-fication problem, where each query is classifiedto have a positive, negative or neutral sentiment(Nanli et al., 2012). It has various applications likeunderstanding the sentiment of different tweets,facebook and youtube comments, understandingproduct reviews etc. In times of the pandemic, where the entire world is living online, it has be-come an even more important tool. Robust sys-tems for sentiment analysis already exist for highresource languages like English (Barbieri et al.,2020), yet progress needs to be made for lower re-source and code-switched languages.One of the major bottlenecks in dealingwith Code-Switched languages is the lackof availability of annotated datasets in theCode-Mixed languages. To alleviate thisproblem for Dravidian languages, datasetshave been released for Sentiment Analysisin Tamil-English (Chakravarthi et al., 2020b)and Malayalam-English (Chakravarthi et al.,2020a). Various shared tasks (Patwa et al., 2020)(Chakravarthi et al., 2020c) have also been accom-panied by the release of these datasets to advanceresearch in this domain.Models built on top of contextualized word em-beddings have recently received huge amount ofsuccess and are used in most state of the art mod-els. Systems built on top of BERT (Devlin et al.,2018) and its multilingual variants mBERT andXLM-RoBERTa (Conneau et al., 2019) have beenthe top ranking systems in both the above compe-titions.In this paper, we train four different BERTbased models for sentiment analysis for two differ-ent code-switched Dravidian languages. The maincontributions of this paper are:• Comparing the effects of Task-specific pre-training and cross-lingual transfer for code-switched sentiment analysis. In our exper-iments, we find that the performance withtask-specific pre-training on English BERTmodels is consistently superior when com-pared to multilingual BERT models.• We present baseline results for theMalayalam-English (Chakravarthi et al.,020a) and Tamil-English(Chakravarthi et al., 2020b) dataset for athree-class sentiment classification problem,classifying each sentence into positive,negative and neutral sentiments. Our resultscan be used as baselines for future work.Previous work (Chakravarthi et al., 2020c)on these datasets treated the problem as afive-class classification problem.
Various datasets for Code-Switched SentimentAnalysis have been released in the last few years,some of which have also been accompaniedby shared tasks (Patra et al., 2018) (Patwa et al.,2020) (Chakravarthi et al., 2020c) in the respec-tive languages. The shared task released with(Chakravarthi et al., 2020c) focused on Senti-ment Analysis for Tamil-English and Malayalam-English datasets. The best performing sys-tems for both these tasks were built on topof BERT variants. (Chakravarthi et al., 2020b)(Chakravarthi et al., 2020a) have provided base-line results for sentiment analysis with a datasetof Youtube comments in Tamil-Englsih andMalayalam-English datasets respectively, usingvarious classification algorithms including Sup-port Vector Machines, Decision Trees, K-NearestNeighbours, BERT based models etc. In our pa-per, we use BERT, mBERT (Devlin et al., 2018),XLM-RoBERTa (Conneau et al., 2019) and aRoBERTa based sentiment classification model(Barbieri et al., 2020) for sentiment classification.Sentiment classification is usually done by clas-sifying a query into one of three sentiments -positive, negative and neutral. In this work,we perform a three-class classification to beable to leverage the power of the TweetEvalsentiment classifier (Barbieri et al., 2020), whichwas trained on a dataset of English Tweets.The TweetEval model is a monolingual modeltrained on an out-of-domain dataset for ourtask ( the Tamil-English and Malayalam-Englishdatasets are made from scraping Youtube com-ments). We also use mBERT and XLM-RoBERTa(Conneau et al., 2019) models for classification,which are trained on more than 100 languagesand are thus able to leverage the power of cross-lingual transfer for sentiment classification. Pre-vious works (Jayanthi and Gupta, 2021) has alsoshown that mBERT and XLM-RoBERTa based
Language Positive Negative Neutral
Tam-Eng 10,559 2,037 850Mal-Eng 2,811 738 1,903Hinglish 6,616 5,892 7,492
Table 1: Dataset statistics for Tamil-English(Chakravarthi et al., 2020b), Malayalam-English(Chakravarthi et al., 2020a) and Hinglish (Sentimix)(Patwa et al., 2020) dataset. These numbers for thedepict the entire dataset which is then divided intotrain, development and test sets by respective authors. models achieve state of the art performance whendealing with code-switched Dravidian languages.
In this paper, we primarily test our modelson Tamil-English (Chakravarthi et al., 2020b) andMalayalam English (Chakravarthi et al., 2020a).The dataset was collected by scrapping Youtubecomments from Tamil and Malayalam movies.All the sentences in the dataset are in the latinscript. We also use the Sentimix Hinglish dataset(Patwa et al., 2020) to leverage cross-lingual trans-fer from Hinglish. The datasets statistics are sum-marized in Table 1. The numbers shown are for theentire dataset which was then split into train, devel-opment and test sets by the respective authors.
We train sentiment analysis models based on topof four BERT variants:•
BERT (Devlin et al., 2018) : The origi-nal BERT model was trained using MaskedLanguage Modelling (MLM) and Next Sen-tence Prediction (NSP) objectives on EnglishWikipedia and BookCorpus (Zhu et al., 2015)and has approximately 110M parameters. Weuse the uncased-base implementation fromthe Hugging Face library for our work.• mBERT (Devlin et al., 2018): This is a mul-tilingual BERT model trained on 104 lan-guages and has approximately 179M param-eters. We again use the uncased-base modelfor our work.•
XLM-RoBERTa (Conneau et al., 2019):This is a multilingual RoBERTa modeltrained on 100 languages and has approx-imately 270M parameters. The RoBERTaodels were an incremental improvementover BERT models with optimized hyper-parameters. The most significant changewas the removal of the NSP objective usedwhile training BERT. The XLM-RoBERTamodel is trained large multilingual corpusof 2.5TB of webcrawled data. We use theuncased-base XLM-RoBERTa model.•
TweetEval, a RoBERTa based sentimentclassifier : The paper by (Barbieri et al.,2020) is a benchmark of Tweet classificationmodels trained on top of RoBERTa. We useits sentiment classification model, which isreferred to as the TweetEval model in this pa-per. The sentiment classification model wastrained on a dataset of 60M English Tweets.The underlying RoBERTa model was trainedon English data and has 160M paramemters.We use the Hugging Face library implementa-tion of these models. We expect mBERT andXLM-RoBERTa based models to leverage cross-lingual transfer from a large pool of languages it istrained on. The TweetEval model was trained on adataset of English Tweets for the task of sentimentanalysis, but is still out-of domain for Youtubecomments datasets. It is important to note thatall our chosen models either have a different do-main, language or task on which they were ini-tially trained and hence are not directly suitablefor the task of code-switched sentiment analysisof Youtube comments.
We evaluate our results based on weighted averagescores of precision, recall and F1. When calcu-lating the weighted average, the precision, recalland F1 scores are calculated for each class and aweighted average is taken based on the number ofsamples in each class. This metric is apt as thedatasets used are unbalanced. We use the samesklearn implementation of the weighted aver-age metric as used in (Chakravarthi et al., 2020b)(Chakravarthi et al., 2020a). All the numbersshown in the paper are weighted average scores. In this paper, we aim to understand the effects oftask-specific pre-training and cross-lingual trans-fer in improving performance of BERT based mod- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html Model Precision Recall F1
Baseline 0.76
BERT 0.74 0.78 0.75mBERT 0.75 0.75 0.75XLM-RoBERTa 0.76 0.76 0.76TweetEval
Table 2: Monolingual Results for Malayalam-English.All scores are weighted average scores.
Model Precision Recall F1
Baseline 0.66 0.79 0.66BERT 0.74 0.74 0.74mBERT 0.75 0.77 0.76XLM-RoBERTa 0.75 0.78 0.76TweetEval
Table 3: Monolingual Results for Tamil-English. Allscores are weighted average scores. els on code-switched datasets.
Task-Specific Pre-training refers pre-training a model on the sametask for which a larger dataset is available andfine-tuning the so-obtained model on the targetdataset, which is usually much smaller. In thiscase, we use the TweetEval model trained for thetask of sentiment analysis on a large English cor-pus, which we fine-tune on the Dravidian code-mixed datasets.
Cross-Lingual Transfer is com-monly referred to the phenomenon of leveragingfeatures and contextual information learnt from adifferent set of languages for a task with a newtarget language. Leveraging cross-lingual transferis one of the main reasons behind training multi-lingual BERT models, where we expect multilin-gual BERT models to perform better on an un-known language when compared to an EnglishBERT model.
We first present Monolingual Sentiment Classifi-cation Results for Tamil-English and Malayalam-English datasets, as shown in Table 2 and Table 3respectively. The baseline model for Malayalam-Engish is based on mBERT while the baselinemodel for Tamil-English was a Random For-est classifier as presented in (Chakravarthi et al.,2020c). The Baseline results were trainedfor a five-way classification problem, hence theweighted average scores have been re-weighted sothat they correspond to a three-way classificationproblem. We also train an English BERT model torain Language Test Language: Tamil Test Language: Malayalam Test Language: HinglishmBERT xlm-r TE mBERT xlm-r TE mBERT xlm-r TE
English - - 0.197 - - 0.345 - - .506
Hinglish - - -
T amil - - - 0.389 0.523 0.427 0.379 0.321 0.432
M alayalam - - - 0.376 0.321
Table 4: Zero-shot prediction results for different models trained on the
Train Language . Here we only reportthe weighted average F1 scores. We use ’-’ to represent cells that are Not Applicable to cross-lingual transfer.TE stands for the model built on top of the TweetEval sentiment classification model. xlm-r refers to the XLM-RoBERTa model. act as a baseline for understanding the effects oftask-specific pre-training and cross lingual trans-fer.We can see that the TweetEval Model improveson the Baseline results. The improvement is quitesignificant for Tamil-English. The unusually highimprovement in the Tamil-English dataset can bedue to the fact that the two models are trained fortwo different problems. The Baseline results arefor a five-way classification problem where theF1 scores are re-weighted to only include threeclasses - positive, negative and neutral. Our mod-els are trained specifically for a three-class clas-sification problem. Our results can provide base-lines for future work on sentiment analysis for theDravidian datasets, which is usually studied as athree-class classification problem. The TweetEvalmodel also performs the best out of all the modelstested. We see that the multilingual BERT mod-els perform better than the English BERT mod-els, although the improvement is not very drastic.XLM-RoBERTa consistently performs better thanmBERT.The pre-trained mBERT and XLM-RoBERTamodels are trained on multiple languages and inmultiple scripts. We expect these models to lever-age cross-lingual transfer and perform better thanthe English BERT model in the code-switched do-main. Although we do see improvement in per-formance by multilingual BERT models over theEnglish BERT models, the improvements are notvery drastic. On the other hand, we see a consis-tently larger improvement due to task-specific pre-training when fine-tuning the TweetEval model onthe Dravidian language datasets. We hypothesizetwo possible reasons for cross-lingual transfer be-ing less effective for the Dravidian datasets. Thefirst is that even though the multilingual BERTmodels were trained on multiple languages, they were trained in languages in their original scripts.The datasets we are considering contain Malay-alam and Tamil in Romanized form. Due to this,the multilingual BERT models do not have repre-sentations and contexts of Malayalam and Tamiltokens in Romanized form. Thus the multilingualBERT model has to learn the representation ofthese new tokens just as an English BERT modelwould. The second reason is that that although themultilingual BERT models were trained on mul-tiple languages, they were not trained on a code-switched dataset.
In this section, we look at the zero-shot transfer be-tween the different sets of languages for the aboveused models for sentiment analysis. The resultsare shown in Table 4. The first column of Table4 refers to the language the given models weretrained on. For Hindi-English (Hinglish), we usedthe Hinglish Sentimix dataset (Patwa et al., 2020).We first look at the zero shot transfer of Englishlanguage for Tamil-English, Malayalam-Englishand Hinglish datasets (refering to the entire firstrow of Table 4). This is same as looking at thezero-shot performance of the TweetEval modelsince it was trained on an English corpus. Wesee that an English sentiment analysis model hasthe best zero-shot performance for the Hinglishdataset. We hypothesize that this is becausethe TweetEval model is an in-domain model forthe Hinglish dataset (Patwa et al., 2020) as theHinglish dataset is also a Twitter corpus. TheMalaylam and Tamil datasets are out of domainsince they are made from Youtube comments.We can also see that the best zero-shot trans-fer results are obtained when the TweetEval modelis fine-tuned on a linguistically closer languagethan English for each of the three datasets. Forexample, the best zero-shot results on the Tamil odel Precision Recall F1
XLM-RoBERTa 0.76 0.76 0.76TweetEval 0.77 0.77 0.77TweetEval +Hinglish
Table 5: Comparison between performance of XLM-RoBERTa, TweetEval and TweetEval model pretrainedon Hinglish data for the Malayalam-English dataset.
Model Precision Recall F1
XLM-RoBERTa 0.75 0.78 0.76TweetEval 0.76 0.79 0.76TweetEval +Hinglish
Table 6: Comparison between performance of XLM-RoBERTa, TweetEval and TweetEval model pretrainedon Hinglish data for the Tamil-English dataset. dataset is achieved when we fine-tune the Tweet-Eval model on Malayalam. Similarly, the bestzero-shot results for Tamil are achieved when wefine-tune the TweetEval model on Hinglish. Fi-nally, the best zero-shot results for the Hinglishdataset are achieved when we fine-tune the Tweet-Eval model on Malayalam. These results showthat task-specific pre-training is more effective forzero-shot performance and hint at the superior-ity of task-specific pre-training over cross-lingualtransfer.In the next experiment, we try to leveragecross-lingual transfer along with task-specific pre-training. To do this, we first fine-tune the Tweet-Eval model on the Hinglish dataset. We ex-pect this fine-tuned model to begin to learn code-switching and recognizing new tokens in Hinglishwhich are not a part of its vocabulary. To makesure that we don’t let the model overfit on theHinglish dataset, we only fine-tune the model onthe Hinglish dataset for 1 epoch. Then we fine-tune this model on the target datasets - Tamil-English and Malayalam-English. The results forthis experiment are shown in Table 5 and Table 6.Though the combined models show improvementsfor both datasets, the improvements are not statisti-cally significant. Rigorous experiments to explorethis idea will be part of our future work.
In this paper we present various experiments tocompare the effects of task-specific pre-training and cross-lingual transfer on performance of sen-timent classification models for code-switcheddata. To do so, we check the performance offour BERT models on two code-switched lan-guages - Malaylam-English and Tamil English.We find that task-specific pre-training is supe-rior to cross-lingual transfer for our chosen code-switched datasets.The results presented in this paper for four dif-ferent BERT models can be used as baselines forfuture work on sentiment analysis for the cho-sen datasets. We also present first results for theTweetEval (Barbieri et al., 2020) sentiment classi-fication model for code-switched data.
References
Francesco Barbieri, Jose Camacho-Collados, LeonardoNeves, and Luis Espinosa-Anke. 2020. Tweet-eval: Unified benchmark and comparative eval-uation for tweet classification. arXiv preprintarXiv:2010.12421 .Bharathi Raja Chakravarthi, Navya Jose, ShardulSuryawanshi, Elizabeth Sherly, and John P Mc-Crae. 2020a. A sentiment analysis dataset forcode-mixed malayalam-english. arXiv preprintarXiv:2006.00210 .Bharathi Raja Chakravarthi, Vigneshwaran Murali-daran, Ruba Priyadharshini, and John P McCrae.2020b. Corpus creation for sentiment analysisin code-mixed tamil-english text. arXiv preprintarXiv:2006.00206 .BR Chakravarthi, R Priyadharshini, V Muralidaran,S Suryawanshi, N Jose, E Sherly, and JP McCrae.2020c. Overview of the track on sentiment analysisfor dravidian languages in code-mixed text. In
Work-ing Notes of the Forum for Information RetrievalEvaluation (FIRE 2020). CEUR Workshop Proceed-ings. In: CEUR-WS. org, Hyderabad, India .Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Sai Muralidhar Jayanthi and Akshat Gupta. 2021.Sj aj@ dravidianlangtech-eacl2021: Task-adaptivepre-training of multilingual bert models for of-fensive language identification. arXiv preprintarXiv:2102.01051 .hu Nanli, Zou Ping, Li Weiguo, and Cheng Meng.2012. Sentiment analysis: A literature review. In , pages 572–576. IEEE.Braja Gopal Patra, Dipankar Das, and Amitava Das.2018. Sentiment analysis of code-mixed indianlanguages: An overview of sail code-mixed sharedtask@ icon-2017. arXiv preprint arXiv:1803.06745 .Parth Patwa, Gustavo Aguilar, Sudipta Kar, SurajPandey, Srinivas PYKL, Bj¨orn Gamb¨ack, TanmoyChakraborty, Thamar Solorio, and Amitava Das.2020. Semeval-2020 task 9: Overview of sentimentanalysis of code-mixed tweets. arXiv e-prints , pagesarXiv–2008.Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Kr-ishna Rallabandi, and Alan W Black. 2019. A sur-vey of code-switched speech and language process-ing. arXiv preprint arXiv:1904.00784 .Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In