[PDF] Task-Specific Pre-Training and Cross Lingual Transfer for Code-Switched Data

Abstract

Using task-specific pre-training and leveraging cross-lingual transfer are two of the most popular ways to handle code-switched data. In this paper, we aim to compare the effects of both for the task of sentiment analysis. We work with two Dravidian Code-Switched languages - Tamil-Engish and Malayalam-English and four different BERT based models. We compare the effects of task-specific pre-training and cross-lingual transfer and find that task-specific pre-training results in superior zero-shot and supervised performance when compared to performance achieved by leveraging cross-lingual transfer from multilingual BERT models.

Full PDF

aa r X i v : . [ c s . C L ] F e b Task-Speciﬁc Pre-Training and Cross Lingual Transfer for Code-SwitchedData

Akshat Gupta, Sai Krishna Rallabandi, Alan Black

Carnegie Mellon University [email protected], { srallaba, awb } @cs.cmu.edu Abstract

Using task-speciﬁc pre-training and leveragingcross-lingual transfer are two of the most popu-lar ways to handle code-switched data. In thispaper, we aim to compare the effects of bothfor the task of sentiment analysis. We workwith two Dravidian Code-Switched languages- Tamil-Engish and Malayalam-English andfour different BERT based models. We com-pare the effects of task-speciﬁc pre-trainingand cross-lingual transfer and ﬁnd that task-speciﬁc pre-training results in superior zero-shot and supervised performance when com-pared to performance achieved by leveragingcross-lingual transfer from multilingual BERTmodels.

Code-Switching is a common phenomenon whichoccurs in many bilingual and multilingual com-munities around the world. It is characterizedby the usage of more than one language in a sin-gle utterance (Sitaram et al., 2019). India is onesuch place with many communities using differentCode-Switched languages, with Hinglish (Code-Switched Hindi-English) being the most popu-lar one. Dravidian languages like Tamil, Kan-nada, Malayalam and Telugu are also usually code-mixed with English. These code-mixed languagesare commonly used to interact with social media,which is why it is essential to be build systems thatare able to handle Code-Switched data.Sentiment analysis poses the task of inferringopinion and emotions of a text query (usuallysocial media comments or Tweets) as a classi-ﬁcation problem, where each query is classiﬁedto have a positive, negative or neutral sentiment(Nanli et al., 2012). It has various applications likeunderstanding the sentiment of different tweets,facebook and youtube comments, understandingproduct reviews etc. In times of the pandemic, where the entire world is living online, it has be-come an even more important tool. Robust sys-tems for sentiment analysis already exist for highresource languages like English (Barbieri et al.,2020), yet progress needs to be made for lower re-source and code-switched languages.One of the major bottlenecks in dealingwith Code-Switched languages is the lackof availability of annotated datasets in theCode-Mixed languages. To alleviate thisproblem for Dravidian languages, datasetshave been released for Sentiment Analysisin Tamil-English (Chakravarthi et al., 2020b)and Malayalam-English (Chakravarthi et al.,2020a). Various shared tasks (Patwa et al., 2020)(Chakravarthi et al., 2020c) have also been accom-panied by the release of these datasets to advanceresearch in this domain.Models built on top of contextualized word em-beddings have recently received huge amount ofsuccess and are used in most state of the art mod-els. Systems built on top of BERT (Devlin et al.,2018) and its multilingual variants mBERT andXLM-RoBERTa (Conneau et al., 2019) have beenthe top ranking systems in both the above compe-titions.In this paper, we train four different BERTbased models for sentiment analysis for two differ-ent code-switched Dravidian languages. The maincontributions of this paper are:• Comparing the effects of Task-speciﬁc pre-training and cross-lingual transfer for code-switched sentiment analysis. In our exper-iments, we ﬁnd that the performance withtask-speciﬁc pre-training on English BERTmodels is consistently superior when com-pared to multilingual BERT models.• We present baseline results for theMalayalam-English (Chakravarthi et al.,020a) and Tamil-English(Chakravarthi et al., 2020b) dataset for athree-class sentiment classiﬁcation problem,classifying each sentence into positive,negative and neutral sentiments. Our resultscan be used as baselines for future work.Previous work (Chakravarthi et al., 2020c)on these datasets treated the problem as aﬁve-class classiﬁcation problem.

Various datasets for Code-Switched SentimentAnalysis have been released in the last few years,some of which have also been accompaniedby shared tasks (Patra et al., 2018) (Patwa et al.,2020) (Chakravarthi et al., 2020c) in the respec-tive languages. The shared task released with(Chakravarthi et al., 2020c) focused on Senti-ment Analysis for Tamil-English and Malayalam-English datasets. The best performing sys-tems for both these tasks were built on topof BERT variants. (Chakravarthi et al., 2020b)(Chakravarthi et al., 2020a) have provided base-line results for sentiment analysis with a datasetof Youtube comments in Tamil-Englsih andMalayalam-English datasets respectively, usingvarious classiﬁcation algorithms including Sup-port Vector Machines, Decision Trees, K-NearestNeighbours, BERT based models etc. In our pa-per, we use BERT, mBERT (Devlin et al., 2018),XLM-RoBERTa (Conneau et al., 2019) and aRoBERTa based sentiment classiﬁcation model(Barbieri et al., 2020) for sentiment classiﬁcation.Sentiment classiﬁcation is usually done by clas-sifying a query into one of three sentiments -positive, negative and neutral. In this work,we perform a three-class classiﬁcation to beable to leverage the power of the TweetEvalsentiment classiﬁer (Barbieri et al., 2020), whichwas trained on a dataset of English Tweets.The TweetEval model is a monolingual modeltrained on an out-of-domain dataset for ourtask ( the Tamil-English and Malayalam-Englishdatasets are made from scraping Youtube com-ments). We also use mBERT and XLM-RoBERTa(Conneau et al., 2019) models for classiﬁcation,which are trained on more than 100 languagesand are thus able to leverage the power of cross-lingual transfer for sentiment classiﬁcation. Pre-vious works (Jayanthi and Gupta, 2021) has alsoshown that mBERT and XLM-RoBERTa based

Language Positive Negative Neutral

Tam-Eng 10,559 2,037 850Mal-Eng 2,811 738 1,903Hinglish 6,616 5,892 7,492

Table 1: Dataset statistics for Tamil-English(Chakravarthi et al., 2020b), Malayalam-English(Chakravarthi et al., 2020a) and Hinglish (Sentimix)(Patwa et al., 2020) dataset. These numbers for thedepict the entire dataset which is then divided intotrain, development and test sets by respective authors. models achieve state of the art performance whendealing with code-switched Dravidian languages.

In this paper, we primarily test our modelson Tamil-English (Chakravarthi et al., 2020b) andMalayalam English (Chakravarthi et al., 2020a).The dataset was collected by scrapping Youtubecomments from Tamil and Malayalam movies.All the sentences in the dataset are in the latinscript. We also use the Sentimix Hinglish dataset(Patwa et al., 2020) to leverage cross-lingual trans-fer from Hinglish. The datasets statistics are sum-marized in Table 1. The numbers shown are for theentire dataset which was then split into train, devel-opment and test sets by the respective authors.

We train sentiment analysis models based on topof four BERT variants:•

BERT (Devlin et al., 2018) : The origi-nal BERT model was trained using MaskedLanguage Modelling (MLM) and Next Sen-tence Prediction (NSP) objectives on EnglishWikipedia and BookCorpus (Zhu et al., 2015)and has approximately 110M parameters. Weuse the uncased-base implementation fromthe Hugging Face library for our work.• mBERT (Devlin et al., 2018): This is a mul-tilingual BERT model trained on 104 lan-guages and has approximately 179M param-eters. We again use the uncased-base modelfor our work.•

XLM-RoBERTa (Conneau et al., 2019):This is a multilingual RoBERTa modeltrained on 100 languages and has approx-imately 270M parameters. The RoBERTaodels were an incremental improvementover BERT models with optimized hyper-parameters. The most signiﬁcant changewas the removal of the NSP objective usedwhile training BERT. The XLM-RoBERTamodel is trained large multilingual corpusof 2.5TB of webcrawled data. We use theuncased-base XLM-RoBERTa model.•

TweetEval, a RoBERTa based sentimentclassiﬁer : The paper by (Barbieri et al.,2020) is a benchmark of Tweet classiﬁcationmodels trained on top of RoBERTa. We useits sentiment classiﬁcation model, which isreferred to as the TweetEval model in this pa-per. The sentiment classiﬁcation model wastrained on a dataset of 60M English Tweets.The underlying RoBERTa model was trainedon English data and has 160M paramemters.We use the Hugging Face library implementa-tion of these models. We expect mBERT andXLM-RoBERTa based models to leverage cross-lingual transfer from a large pool of languages it istrained on. The TweetEval model was trained on adataset of English Tweets for the task of sentimentanalysis, but is still out-of domain for Youtubecomments datasets. It is important to note thatall our chosen models either have a different do-main, language or task on which they were ini-tially trained and hence are not directly suitablefor the task of code-switched sentiment analysisof Youtube comments.

We evaluate our results based on weighted averagescores of precision, recall and F1. When calcu-lating the weighted average, the precision, recalland F1 scores are calculated for each class and aweighted average is taken based on the number ofsamples in each class. This metric is apt as thedatasets used are unbalanced. We use the samesklearn implementation of the weighted aver-age metric as used in (Chakravarthi et al., 2020b)(Chakravarthi et al., 2020a). All the numbersshown in the paper are weighted average scores. In this paper, we aim to understand the effects oftask-speciﬁc pre-training and cross-lingual trans-fer in improving performance of BERT based mod- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html Model Precision Recall F1

Baseline 0.76

BERT 0.74 0.78 0.75mBERT 0.75 0.75 0.75XLM-RoBERTa 0.76 0.76 0.76TweetEval

Table 2: Monolingual Results for Malayalam-English.All scores are weighted average scores.

Model Precision Recall F1

Baseline 0.66 0.79 0.66BERT 0.74 0.74 0.74mBERT 0.75 0.77 0.76XLM-RoBERTa 0.75 0.78 0.76TweetEval

Table 3: Monolingual Results for Tamil-English. Allscores are weighted average scores. els on code-switched datasets.

Task-Speciﬁc Pre-training refers pre-training a model on the sametask for which a larger dataset is available andﬁne-tuning the so-obtained model on the targetdataset, which is usually much smaller. In thiscase, we use the TweetEval model trained for thetask of sentiment analysis on a large English cor-pus, which we ﬁne-tune on the Dravidian code-mixed datasets.

Cross-Lingual Transfer is com-monly referred to the phenomenon of leveragingfeatures and contextual information learnt from adifferent set of languages for a task with a newtarget language. Leveraging cross-lingual transferis one of the main reasons behind training multi-lingual BERT models, where we expect multilin-gual BERT models to perform better on an un-known language when compared to an EnglishBERT model.

We ﬁrst present Monolingual Sentiment Classiﬁ-cation Results for Tamil-English and Malayalam-English datasets, as shown in Table 2 and Table 3respectively. The baseline model for Malayalam-Engish is based on mBERT while the baselinemodel for Tamil-English was a Random For-est classiﬁer as presented in (Chakravarthi et al.,2020c). The Baseline results were trainedfor a ﬁve-way classiﬁcation problem, hence theweighted average scores have been re-weighted sothat they correspond to a three-way classiﬁcationproblem. We also train an English BERT model torain Language Test Language: Tamil Test Language: Malayalam Test Language: HinglishmBERT xlm-r TE mBERT xlm-r TE mBERT xlm-r TE

English - - 0.197 - - 0.345 - - .506

Hinglish - - -

T amil - - - 0.389 0.523 0.427 0.379 0.321 0.432

M alayalam - - - 0.376 0.321

Table 4: Zero-shot prediction results for different models trained on the

Train Language . Here we only reportthe weighted average F1 scores. We use ’-’ to represent cells that are Not Applicable to cross-lingual transfer.TE stands for the model built on top of the TweetEval sentiment classiﬁcation model. xlm-r refers to the XLM-RoBERTa model. act as a baseline for understanding the effects oftask-speciﬁc pre-training and cross lingual trans-fer.We can see that the TweetEval Model improveson the Baseline results. The improvement is quitesigniﬁcant for Tamil-English. The unusually highimprovement in the Tamil-English dataset can bedue to the fact that the two models are trained fortwo different problems. The Baseline results arefor a ﬁve-way classiﬁcation problem where theF1 scores are re-weighted to only include threeclasses - positive, negative and neutral. Our mod-els are trained speciﬁcally for a three-class clas-siﬁcation problem. Our results can provide base-lines for future work on sentiment analysis for theDravidian datasets, which is usually studied as athree-class classiﬁcation problem. The TweetEvalmodel also performs the best out of all the modelstested. We see that the multilingual BERT mod-els perform better than the English BERT mod-els, although the improvement is not very drastic.XLM-RoBERTa consistently performs better thanmBERT.The pre-trained mBERT and XLM-RoBERTamodels are trained on multiple languages and inmultiple scripts. We expect these models to lever-age cross-lingual transfer and perform better thanthe English BERT model in the code-switched do-main. Although we do see improvement in per-formance by multilingual BERT models over theEnglish BERT models, the improvements are notvery drastic. On the other hand, we see a consis-tently larger improvement due to task-speciﬁc pre-training when ﬁne-tuning the TweetEval model onthe Dravidian language datasets. We hypothesizetwo possible reasons for cross-lingual transfer be-ing less effective for the Dravidian datasets. Theﬁrst is that even though the multilingual BERTmodels were trained on multiple languages, they were trained in languages in their original scripts.The datasets we are considering contain Malay-alam and Tamil in Romanized form. Due to this,the multilingual BERT models do not have repre-sentations and contexts of Malayalam and Tamiltokens in Romanized form. Thus the multilingualBERT model has to learn the representation ofthese new tokens just as an English BERT modelwould. The second reason is that that although themultilingual BERT models were trained on mul-tiple languages, they were not trained on a code-switched dataset.

In this section, we look at the zero-shot transfer be-tween the different sets of languages for the aboveused models for sentiment analysis. The resultsare shown in Table 4. The ﬁrst column of Table4 refers to the language the given models weretrained on. For Hindi-English (Hinglish), we usedthe Hinglish Sentimix dataset (Patwa et al., 2020).We ﬁrst look at the zero shot transfer of Englishlanguage for Tamil-English, Malayalam-Englishand Hinglish datasets (refering to the entire ﬁrstrow of Table 4). This is same as looking at thezero-shot performance of the TweetEval modelsince it was trained on an English corpus. Wesee that an English sentiment analysis model hasthe best zero-shot performance for the Hinglishdataset. We hypothesize that this is becausethe TweetEval model is an in-domain model forthe Hinglish dataset (Patwa et al., 2020) as theHinglish dataset is also a Twitter corpus. TheMalaylam and Tamil datasets are out of domainsince they are made from Youtube comments.We can also see that the best zero-shot trans-fer results are obtained when the TweetEval modelis ﬁne-tuned on a linguistically closer languagethan English for each of the three datasets. Forexample, the best zero-shot results on the Tamil odel Precision Recall F1

XLM-RoBERTa 0.76 0.76 0.76TweetEval 0.77 0.77 0.77TweetEval +Hinglish

Table 5: Comparison between performance of XLM-RoBERTa, TweetEval and TweetEval model pretrainedon Hinglish data for the Malayalam-English dataset.

Model Precision Recall F1

XLM-RoBERTa 0.75 0.78 0.76TweetEval 0.76 0.79 0.76TweetEval +Hinglish

Table 6: Comparison between performance of XLM-RoBERTa, TweetEval and TweetEval model pretrainedon Hinglish data for the Tamil-English dataset. dataset is achieved when we ﬁne-tune the Tweet-Eval model on Malayalam. Similarly, the bestzero-shot results for Tamil are achieved when weﬁne-tune the TweetEval model on Hinglish. Fi-nally, the best zero-shot results for the Hinglishdataset are achieved when we ﬁne-tune the Tweet-Eval model on Malayalam. These results showthat task-speciﬁc pre-training is more effective forzero-shot performance and hint at the superior-ity of task-speciﬁc pre-training over cross-lingualtransfer.In the next experiment, we try to leveragecross-lingual transfer along with task-speciﬁc pre-training. To do this, we ﬁrst ﬁne-tune the Tweet-Eval model on the Hinglish dataset. We ex-pect this ﬁne-tuned model to begin to learn code-switching and recognizing new tokens in Hinglishwhich are not a part of its vocabulary. To makesure that we don’t let the model overﬁt on theHinglish dataset, we only ﬁne-tune the model onthe Hinglish dataset for 1 epoch. Then we ﬁne-tune this model on the target datasets - Tamil-English and Malayalam-English. The results forthis experiment are shown in Table 5 and Table 6.Though the combined models show improvementsfor both datasets, the improvements are not statisti-cally signiﬁcant. Rigorous experiments to explorethis idea will be part of our future work.

In this paper we present various experiments tocompare the effects of task-speciﬁc pre-training and cross-lingual transfer on performance of sen-timent classiﬁcation models for code-switcheddata. To do so, we check the performance offour BERT models on two code-switched lan-guages - Malaylam-English and Tamil English.We ﬁnd that task-speciﬁc pre-training is supe-rior to cross-lingual transfer for our chosen code-switched datasets.The results presented in this paper for four dif-ferent BERT models can be used as baselines forfuture work on sentiment analysis for the cho-sen datasets. We also present ﬁrst results for theTweetEval (Barbieri et al., 2020) sentiment classi-ﬁcation model for code-switched data.

References

Francesco Barbieri, Jose Camacho-Collados, LeonardoNeves, and Luis Espinosa-Anke. 2020. Tweet-eval: Uniﬁed benchmark and comparative eval-uation for tweet classiﬁcation. arXiv preprintarXiv:2010.12421 .Bharathi Raja Chakravarthi, Navya Jose, ShardulSuryawanshi, Elizabeth Sherly, and John P Mc-Crae. 2020a. A sentiment analysis dataset forcode-mixed malayalam-english. arXiv preprintarXiv:2006.00210 .Bharathi Raja Chakravarthi, Vigneshwaran Murali-daran, Ruba Priyadharshini, and John P McCrae.2020b. Corpus creation for sentiment analysisin code-mixed tamil-english text. arXiv preprintarXiv:2006.00206 .BR Chakravarthi, R Priyadharshini, V Muralidaran,S Suryawanshi, N Jose, E Sherly, and JP McCrae.2020c. Overview of the track on sentiment analysisfor dravidian languages in code-mixed text. In

Work-ing Notes of the Forum for Information RetrievalEvaluation (FIRE 2020). CEUR Workshop Proceed-ings. In: CEUR-WS. org, Hyderabad, India .Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Sai Muralidhar Jayanthi and Akshat Gupta. 2021.Sj aj@ dravidianlangtech-eacl2021: Task-adaptivepre-training of multilingual bert models for of-fensive language identiﬁcation. arXiv preprintarXiv:2102.01051 .hu Nanli, Zou Ping, Li Weiguo, and Cheng Meng.2012. Sentiment analysis: A literature review. In , pages 572–576. IEEE.Braja Gopal Patra, Dipankar Das, and Amitava Das.2018. Sentiment analysis of code-mixed indianlanguages: An overview of sail code-mixed sharedtask@ icon-2017. arXiv preprint arXiv:1803.06745 .Parth Patwa, Gustavo Aguilar, Sudipta Kar, SurajPandey, Srinivas PYKL, Bj¨orn Gamb¨ack, TanmoyChakraborty, Thamar Solorio, and Amitava Das.2020. Semeval-2020 task 9: Overview of sentimentanalysis of code-mixed tweets. arXiv e-prints , pagesarXiv–2008.Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Kr-ishna Rallabandi, and Alan W Black. 2019. A sur-vey of code-switched speech and language process-ing. arXiv preprint arXiv:1904.00784 .Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-dinov, Raquel Urtasun, Antonio Torralba, and SanjaFidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching moviesand reading books. In