Sentiment Analysis of Persian-English Code-mixed Texts
aa r X i v : . [ c s . C L ] F e b Sentiment Analysis of Persian-English Code-mixedTexts st Nazanin Sabri electrical and computer engineeringUniversity of Tehran
Tehran, [email protected] nd Ali Edalat electrical and computer engineeringUniversity of Tehran
Tehran, [email protected] rd Behnam Bahrak electrical and computer engineeringUniversity of Tehran
Tehran, [email protected]
Abstract —The rapid production of data on the internet andthe need to understand how users are feeling from a businessand research perspective has prompted the creation of numer-ous automatic monolingual sentiment detection systems. Morerecently however, due to the unstructured nature of data on socialmedia, we are observing more instances of multilingual and code-mixed texts. This development in content type has created a newdemand for code-mixed sentiment analysis systems. In this studywe collect, label and thus create a dataset of Persian-Englishcode-mixed tweets. We then proceed to introduce a model whichuses BERT pretrained embeddings as well as translation modelsto automatically learn the polarity scores of these Tweets. Ourmodel outperforms the baseline models that use Na¨ıve Bayes andRandom Forest methods.
Index Terms —code-mixed language, sentiment analysis,Persian-English text
I. I
NTRODUCTION
Online social networking platforms place very few con-straints on the type and structure of the textual content postedonline. The unstructured nature of these content results in thecreation of textual content which could be far from the originalgrammatical, syntactic, or semantic rules of the language theyare written in [1], [2]. One deviation that has been observedquite often is the use of words from more than one languagein the text [3]. Commonly known as “code-mixed” text. Thesetexts are written in language A but include one or morewords of language B (either written in the official alphabetsof language B or transliterated to language A).In this study, we investigate code-mixed Persian-English dataand perform sentiment analysis on these texts. The reason whysentiment analysis of these texts would differ from a text writ-ten purely in the Persian language is that the words containingthe emotional energies of the text could be written in Englishwhich would make the Persian-only sentiment analysis modelsunable to produce the correct outputs. Grammatical differencescould also render such monolingual models useless [4]. Thedifficulties of this task have been shown in other language pairs[5], [6]. Our reasoning behind choosing English as the secondlanguage is the prominence of the usage of this languageoverall but more importantly among Persian speakers.Since Persian is a low resource language, in order to performthe aforementioned task, we first needed to create a dataset oftexts and label those texts with the correct sentiment scores. Thus, we begin by using the Twitter API and searching for alist of “Finglish” words through the use of the API. After thedata has been collected, 2 annotators labeled the 3640 tweetscompletely. A third annotator was then added to the projectto label the tweets on which the previous two annotators didnot agree on.After the dataset collection and creation was completed, anensemble model was created and used to detect the sentimentscores of the texts.The rest of this paper is structured as follows: Section IIprovides a brief overview of related work. Next, we look atour dataset in detail in Section III. Our models, as well as ourtext cleaning and preparation steps are described in SectionIV. We then report our results in Section V and the study isconcluded in Section VI.II. R ELATED W ORK
The prevalence of code-mixed textual data, due to the un-structured and uncontrolled nature of the web as well as socialnetworks, have resulted in a focus on the topic throughoutrecent years, including multiple shared-tasks being defined onthe subject [7]–[10].One of the language pairs that has been the center of attentionin code-mixed text analysis is Hindi-English [11]–[16]. Withthe large population of multilingual individuals in India, suchforms of texts have become quite common. Other languagepairs (such as Bengali-English, Bambara-French, and Tamil-English), however, have been studied as well [5], [17]–[19].Some studies attempt to solve the issue by hand engineeringfeatures which help in the task. In [20] various features (e.g.the number of code switches in the text) were introducedand employed in a multi-layer perceptron model. Number ofsentiment and opinion lexicon, number of uppercase words,and POS tags are among some of the features which havebeen utilized.Other studies try to find methods with less need for featureselection. For instance, it has been attempted to use cross-lingual word-embeddings [21] or subword embeddings [13] toaccomplish the task. In the SemEval-2020 task on Spanglish English words that are written using the Persian/Farsi alphabet nd Hinglish [10] it is reported that BERT and ensemble meth-ods are the most common and successful approaches. In [22]an approach is introduced to help deal with different variationsof the same word by substituting words with consideration totheir context words. In [23] a benchmark for linguistic code-switching is presented, the aim of which is to enable evaluationof models.To the best of our knowledge, the dataset annotated andpresented as part of this study is the first dataset for the Code-mixed Persian-English sentiment analysis task. We also believethat there are no other studies on this specific subset of thetopic available. III. D
ATA
In this section, we describe the distributions and characteris-tics of our data. As described in Section I, our dataset consistsof 3,640 tweets labeled with polarity values. Our dataset fieldsinclude the terms that were searched via the API that resultedin the tweets’ retrieval, the text content of the tweets, the threelabels assigned to each tweet, and the final label, which iscalculated through majority voting.We selected a list of 44 unique Finglish (English wordstransliterated to Persian) terms in order to collect this data.Some of the words in the list include: (cid:16)Iº (cid:9)¯QK(cid:18) (perfect), úæ(cid:18)ë (happy),
Á(cid:9)JK(cid:10)PñK. (boring), and
ÉÆ(cid:9)J(cid:28)(cid:10)ƒ (single).In our dataset, 69.2% of the instances received unanimouslabels by the first two annotators. A third annotator was thenasked to label the rest of the data. In the resulting dataset,15.7% were labeled as positive and 59.7% as negative (and theremaining 21.5% were labeled as neutral). Table I displays twoexamples of annotated data in our dataset. The majority of thedata being negative can be explained by two facts: One is thatdue to the access restrictions of Twitter in Iran, only 9.24% ofIranians use Twitter [24] and as a result the subset of users onthe platform are mostly from a particular belief system [25]which could result in the observed negative opinions. Anotherreason is that the data collection process was conducted inthe last months of 2019 which included the beginning stagesof the spread of the Coronavirus in Iran. Even though ourkeywords did not relate to the Coronavirus in any way the shiftin spirit was observed in our data. We however believe thatthis issue reflects the current state of our society and thus thedataset can still be used for the task of code-mixed sentimentdetection as the tweets do include the characteristics of code-mixed language and there are enough examples in the datasetto allow the model to learn attributes of polarities other thannegative.To preserve the privacy of the users, all user mentions inthe texts have been replaced by @USERMENTION. However,since all tweets were public (at least at the time of collection)and were collected using the official Twitter API, Tweet IDshave been provided to allow use in future research should theneed arise. This dataset is publicly available on GitHub. https://github.com/nazaninsbr/Persian-English-Code-mixed-Sentiment-Analysis TABLE IE XAMPLE ANNOTATED TWEETS FROM OUR DATASET
Tweet Label @USERMENTION éJ(cid:10)(cid:9)JªÓ úG. ð Á(cid:9)JK(cid:10)PñK. , (cid:16)I‚(cid:28)(cid:10)(cid:9)K ÉÆ (cid:17)ƒñ (cid:9)k úÃY(cid:9)K (cid:9)P
Negative (cid:9)àAg. ñ(cid:9)KAK. A(cid:9)¯Qk (cid:9)áK(cid:10) (cid:9)P@ ð PA(cid:16)K@ð(cid:14)@ ñJ(cid:10)(cid:9)K úæ(cid:18)ë Ðñëð@
Positive
IV. M
ETHOD
In this section we will go over our text processing andfeature extraction, and data representation methods as well asour model.In the text processing step we aim to create a vectorizedrepresentation of the textual input in order to be able to fitthe data into our machine learning model. To do so we takethe following steps:1) Finding the non-Persian words in the sentence: By thedefinition of our task we know that the text we areprocessing includes non-Persian words, however, wedo not know which words in the sentence they are.The reason why this knowledge could be useful is thatknowing which words they are would allow us to usemethods such as translation to convert them to theiroriginal language. In order to find these words, we usea dataset of Persian words collected from Wikipedia .The huge collection of articles available on Wikipediaensures that the most frequent words of the languagewould be in the list. We then check the existence ofevery word in the list and if it is not in the list we addit to our non-Persian word candidates.2) Translation: Next we translate the non-Persian words.First, we use an automatic tool, Yandex . This tool,however, faces difficulties when asked to translate Twit-ter specific slangs or expressions. Thus, we use commonTwitter expression lists and create a dictionary of ourown.3) Embedding creation: To create an embedding for ourtextual data we used the pretrained multilingual BERT[26] model that Google has provided on their GitHubpage . Since the model is multilingual, it would allowfor creation of embeddings for words from more thanone language which fits nicely with our problem sincecode-mixed data includes instances of more than onelanguage. Further since the model uses the idea ofsubword tokenization and embedding, it could allow fora better understanding of slangs or other non-commonwords that could be made up of better known subwords.After our data passes these steps, it is fed into an ensem-ble model consisting of three Bidirectional Long Short-TermMemory (Bi-LSTM) networks. https://github.com/behnam/persian-words-frequency/blob/master/persian-wikipedia.txt https://translate.yandex.com bit.ly/3h2qyNm https://github.com/google-research/bertABLE II10- FOLD CROSS - VALIDATION MODEL PERFORMANCE RESULTS : EACHREPORTED METRIC IS CALCULATED BY AVERAGING THE VALUES ACROSSALL FOLDS
Model Name Accuracy Precision Recall F1
Naive Bayes 60.88 69.00 60.88 47.17Random Forest 62.12 60.37 62.12 57.67Our Model 66.17 64.16 65.99 63.66 • Our first network is a stacked Bi-LSTM network. Themodel is quite simple and aims to encode the generalinformation available in the sentences. • Our second model adds the attention mechanism to theBi-LSTM network which enables the model to pay moreattention to the most important words in the sentence.The attention mechanism also helps with the encoding oflong-distance dependencies and information. • Another method we use to make sure long-distancedependencies are accounted for and that information isnot lost in our model is the use of pooling layers in ourfinal model.The final model takes the outputs of all three models and uses aweighted average to produce the output. To find the best weightassigned to the output of each model, we use the optimizationalgorithms offered by SciPy [27].V. R
ESULTS
We used 10-fold cross-validation and averaged the metricsin order to present more reliable values. Additionally, to beable to compare our results, naive Bayes and random forestmodels were also used on the data. The results have beenpresented in Table II.We can see that our ensemble model outperforms thebaseline models with regards to all metrics. Through our ex-periments, we find that the attention and pooling mechanismsboth help with the performance of the models. We further findthat the sum of all three models, offers better performance thaneach individual model, as each model appears to make up forthe shortcomings of the other models in the ensemble.VI. C
ONCLUSION
In this study we presented a Persian-English code-mixeddataset. The dataset consisted of 3640 tweets collected throughthe use of the Twitter API. Each tweet was consequentlylabeled with its corresponding polarity score. We then usedneural classification models to learn the polarity scores ofthese data. Our models employed Yandex and dictionary-basedtranslation techniques to translate the code-mixed words inour texts. We further used pretrained BERT embeddings torepresent our data. Our models reached an accuracy of 66.17%and F1 of 63.66 on the data.Future work could focus on other methods of dealing with thecode-mixed words or ways in which we could find word-basedpolarity scores for our code-mixed words using sentence levelscores. R
EFERENCES[1] Monika Arora and Vineet Kansal. Character level embedding with deepconvolutional neural network for text normalization of unstructured datafor twitter sentiment analysis.
Social Network Analysis and Mining ,9(1):12, 2019.[2] Geetika Gautam and Divakar Yadav. Sentiment analysis of twitter datausing machine learning approaches and semantic analysis. In ,pages 437–442. IEEE, 2014.[3] Anab Maulana Barik, Rahmad Mahendra, and Mirna Adriani. Normal-ization of indonesian-english code-mixed twitter data. In
Proceedings ofthe 5th Workshop on Noisy User-generated Text (W-NUT 2019) , pages417–424, 2019.[4] Wei Xu, Bo Han, and Alan Ritter. Proceedings of the workshop onnoisy user-generated text. In
Proceedings of the Workshop on NoisyUser-generated Text , 2015.[5] Soumil Mandal, Sainik Kumar Mahata, and Dipankar Das. Preparingbengali-english code-mixed corpus for sentiment analysis of indianlanguages. arXiv preprint arXiv:1803.04000 , 2018.[6] Somnath Banerjee, Sahar Ghannay, Sophie Rosset, Anne Vilnat, andPaolo Rosso. Limsi upv at semeval-2020 task 9: Recurrent convolu-tional neural network for code-mixed sentiment analysis. arXiv preprintarXiv:2008.13173 , 2020.[7] Braja Gopal Patra, Dipankar Das, and Amitava Das. Sentiment analysisof code-mixed indian languages: An overview of sail code-mixed sharedtask@ icon-2017. arXiv preprint arXiv:1803.06745 , 2018.[8] Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven Bethard, MonaDiab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, JuliaHirschberg, Alison Chang, et al. Overview for the first shared task onlanguage identification in code-switched data. In
Proceedings of theFirst Workshop on Computational Approaches to Code Switching , pages62–72, 2014.[9] Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, SrinivasPYKL, Bj¨orn Gamb¨ack, Tanmoy Chakraborty, Thamar Solorio, andAmitava Das. Semeval-2020 task 9: Overview of sentiment analysis ofcode-mixed tweets. In
Proceedings of the 14th International Workshopon Semantic Evaluation (SemEval-2020) , Barcelona, Spain, December2020. Association for Computational Linguistics.[10] Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, SrinivasPYKL, Bj¨orn Gamb¨ack, Tanmoy Chakraborty, Thamar Solorio, andAmitava Das. Semeval-2020 task 9: Overview of sentiment analysisof code-mixed tweets. arXiv e-prints , pages arXiv–2008, 2020.[11] Vivek Srivastava and Mayank Singh. Iit gandhinagar at semeval-2020task 9: code-mixed sentiment classification using candidate sentencegeneration and selection. arXiv preprint arXiv:2006.14465 , 2020.[12] Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma.Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In
Proceedings of COLING 2016, the26th International Conference on Computational Linguistics: TechnicalPapers , pages 2482–2491, 2016.[13] Ameya Prabhu, Aditya Joshi, Manish Shrivastava, and Vasudeva Varma.Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. arXiv preprint arXiv:1611.00472 , 2016.[14] Dinkar Sitaram, Savitha Murthy, Debraj Ray, Devansh Sharma, andKashyap Dhar. Sentiment analysis of mixed language employing hindi-english code switching. In , volume 1, pages 271–276. IEEE,2015.[15] Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, MonojitChoudhury, and Niloy Ganguly. Understanding language preference forexpression of opinion and sentiment: What do hindi-english speakersdo on twitter? In
Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing , pages 1131–1141, 2016.[16] Madan Gopal Jhanwar and Arpita Das. An ensemble model forsentiment analysis of hindi-english code-mixed data. arXiv preprintarXiv:1806.04450 , 2018.[17] Arouna Konate and Ruiying Du. Sentiment analysis of code-mixedbambara-french social media text using deep learning techniques.
WuhanUniversity Journal of Natural Sciences , 23(3):237–243, 2018.[18] Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, ElizabethSherly, and John P McCrae. A sentiment analysis dataset for code-mixedmalayalam-english. arXiv preprint arXiv:2006.00210 , 2020.19] Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba Priyad-harshini, and John P McCrae. Corpus creation for sentiment analysis incode-mixed tamil-english text. arXiv preprint arXiv:2006.00206 , 2020.[20] Souvick Ghosh, Satanu Ghosh, and Dipankar Das. Sentiment identifica-tion in code-mixed social media text. arXiv preprint arXiv:1707.01184 ,2017.[21] Pranaydeep Singh and Els Lefever. Sentiment analysis for hinglishcode-mixed tweets by means of cross-lingual word embeddings. In
Proceedings of the The 4th Workshop on Computational Approaches toCode Switching , pages 45–51, 2020.[22] Rajat Singh, Nurendra Choudhary, and Manish Shrivastava. Automaticnormalization of word variations in code-mixed social media text. arXivpreprint arXiv:1804.00804 , 2018.[23] Gustavo Aguilar, Sudipta Kar, and Thamar Solorio. Lince: A centralizedbenchmark for linguistic code-switching evaluation. arXiv preprintarXiv:2005.04322 , 2020.[24]
Iran’s Social Media Statistics , (accessed October 25, 2020).https://gs.statcounter.com/social-media-stats/all/iran.[25] Emad Khazraee. Mapping the political landscape of persian twit-ter: The case of 2013 presidential election.
Big Data & Society ,6(1):2053951719835232, 2019.[26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language under-standing. arXiv preprint arXiv:1810.04805 , 2018.[27] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, TylerReddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, WarrenWeckesser, Jonathan Bright, St´efan J. van der Walt, Matthew Brett,Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J.Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, ˙Ilhan Polat,Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, JosefPerktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles RHarris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paulvan Mulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: FundamentalAlgorithms for Scientific Computing in Python.