Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT
DDialect Identification in Nuanced Arabic Tweets Using FarasaSegmentation and AraBERT
Anshul Wadhawan
Flipkart Private Limited [email protected]
Abstract
This paper presents our approach to addressthe EACL WANLP-2021 Shared Task 1: Nu-anced Arabic Dialect Identification (NADI).The task is aimed at developing a sys-tem that identifies the geographical loca-tion(country/province) from where an Arabictweet in the form of modern standard Arabic ordialect comes from. We solve the task in twoparts. The first part involves pre-processingthe provided dataset by cleaning, adding andsegmenting various parts of the text. This isfollowed by carrying out experiments with dif-ferent versions of two Transformer based mod-els, AraBERT and AraELECTRA. Our finalapproach achieved macro F1-scores of 0.216,0.235, 0.054, and 0.043 in the four subtasks,and we were ranked second in MSA identifica-tion subtasks and fourth in DA identificationsubtasks.
Spoken by about 500 million people around theworld, Arabic is the biggest part of the Semiticlanguage family. Being the official language ofalmost 22 countries belonging to the Middle-EastNorth Africa (MENA) region, it is not only an in-tegral member of the six official UN languages,but also fourth most used language on the Internet(Guellil et al., 2019). Middle East contributes to164 million internet users and North Africa con-tributes to 121 million internet users. Comparingwith other languages, Arabic language has receivedlittle attention in modern computational linguistics,despite its religious, political and cultural signifi-cance. However, with rapid development of toolsand techniques delivering state-of-the-art perfor-mance in many language processing tasks, this neg-ligence is being taken care of.The presence of various dialects and complexmorphology are some of the distinguishing fac-tors prominent in the Arabic language. Also, the informal nature of conversations on social mediaand the differences in Modern Standard Arabic(MSA) and Dialectical Arabic (DA), both signifi-cantly increase this complexity. While DA is usedfor informal daily communication, MSA is usedfor formal writing. Social media is the home forboth of these forms, with the former being the mostcommon form. Lack of data is the primary reasonwhy many of the Arabic dialects remain understud-ied. With the availability of diverse data belongingto 21 Arab countries, this bottleneck can be dimin-ished. The Nuanced Arabic Dialect Identification(NADI), with this goal, is the task of automaticdetection of the source variety of a given text orspeech segment.Previously, on the lines of Arabic dialect identi-fication, there have been approaches focusing oncoarse-grained regional varieties such as Levantineor Gulf (Elaraby and Abdul-Mageed, 2018; Zaidanand Callison-Burch, 2014; Elfardy and Diab, 2013)or country level varieties (Bouamor et al., 2019;Zhang and Abdul-Mageed, 2019). There have beentasks that involved city level classification on hu-man translated data (Salameh et al., 2018). Sometasks have focused on country and province levelclassification simultaneously (Abdul-Mageed et al.,2020).In this paper, we present our process to tacklethe WANLP-2021 Shared Task 1. The paper isorganised in the following way: Section 2 presentsthe problem statement and details of the provideddataset. Section 3 describes a modularised processthat we inculcate as part of methodology. Section4 describes the experiments that were conducted,with detailed statistics about the dataset, systemsettings and results of these experiments. A briefconclusion of the paper with the potential prospectsof our study are presented in Section 5. a r X i v : . [ c s . C L ] F e b Task Definition
The WANLP-2021 Shared Task 1 (Abdul-Mageedet al., 2021) is based on a multi-class classifica-tion problem where the aim is to recognize whichcountry or province an Arabic tweet in the formof modern standard Arabic or dialect belongs to.The task targets dialects at the province-level, andalso focuses on naturally-occurring fine-grained di-alects at the sub-country level. The NADI 2021task promotes efforts made towards distinguishingboth modern standard Arabic (MSA) and dialects(DA) according to their geographical origin, focus-ing on fine-grained dialects with new datasets. Theprovided data comes from the domain of Twitterand covers 100 provinces from 21 Arab countries.The task is divided into 4 subtasks as describedbelow:Subtask 1.1: Country-level MSA identificationSubtask 1.2: Country-level DA identificationSubtask 2.1: Province-level MSA identificationSubtask 2.2: Province-level DA identificationThe training dataset has a total of 21,000 tweet,validation and test datasets have 5,000 tweets each.Every example belongs to one of 100 provincesof 21 Arab countries. Additional 10M unlabeledtweets are provided that can be used in developingthe systems for either or both of the tasks. F-score,Accuracy, Precision and Recall are the evaluationmetrics. However, the official metric of evaluationis the Macro Averaged F-score.
We present our methodology in two parts. Thefirst part in the methodology is data preprocessing.This is followed by experimenting with differenttransformer based models for the task at hand. Boththese parts have been described in detail in thefollowing sub sections.
Transformer based models, that we plan to fine tuneon our dataset, are pre-trained on processed ratherthan raw data. Owing to the variations in expres-sion of opinions among users belonging to differentparts of the world, the tweets fetched from the web-site are a clear representation of these variations.We find these variations on randomly checking thegiven examples in different forms. It is common forusers to use slang words on the Twitter platform,and post non-ascii characters like emojis. Also, spelling errors, user mentions and URLs are promi-nent in tweets of most users. These parts withinthe tweets do not contribute to being informativetowards deciding the geographical location of thetweet as they correspond to noise. Thus, the givendataset is cleaned in the following ways, so that thedata used for fine tuning has a similar distributionto that used for the pre-training process:1. Perform Farasa segmentation (for select mod-els only) (Abdelali et al., 2016).2. Replace all URLs with [ (cid:161)(cid:29)(cid:46) (cid:64)(cid:80) ], emails with[ (cid:89)(cid:75)(cid:10)(cid:81)(cid:75)(cid:46) ], mentions with [ (cid:208)(cid:89) (cid:9)(cid:106)(cid:16)(cid:74)(cid:130)(cid:211) ].3. Remove HTML line breaks and markup, un-wanted characters like emoticons, repeatedcharacters ( >
2) and extra spaces.4. Insert whitespace before and after all non Ara-bic digits or English Digits and Alphabet andthe 2 brackets, and between words and num-bers or numbers and words.
The domains of speech recognition (Graves et al.,2013) and computer vision (Krizhevsky et al.,2012) have largely utilised different deep learningtechniques and produced significant improvementsover the traditional machine learning techniques.In the domain of natural language processing, mostdeep learning based techniques until now utilisedword vector representations (Bengio et al., 2003;Yih et al., 2011; Mikolov et al., 2013) for differentclassification tasks. Lately, transformer based ap-proaches have shown significant progress towardsmany NLP benchmarks (Vaswani et al., 2017), in-cluding text classification (Chang et al., 2020), ow-ing to their ability to build proficient language mod-els. As an output of the pre-training process, em-beddings are produced which are utilised for finertasks.
AraBERT is an Arabic pretrained languagemodel based on Google’s BERT architecture(Antoun et al.). There are six versions ofthe model: AraBERTv0.1-base, AraBERTv0.2-base, AraBERTv0.2-large, AraBERTv1-base,AraBERTv2-base and AraBERTv2-large. For thesevariations, the model parameters with respect to thepre-training process have been depicted in Table 1. odel Size Pre-Segmentation DatasetMB Params
AraBERTv0.2-base 543MB 136M No 200M 77GB 8.6BAraBERTv0.2-large 1.38G 371M No 200M 77GB 8.6BAraBERTv2-base 543MB 136M Yes 200M 77GB 8.6BAraBERTv2-large 1.38G 371M Yes 200M 77GB 8.6BAraBERTv0.1-base 543MB 136M No 77M 23GB 2.7BAraBERTv1-base 543MB 136M Yes 77M 23GB 2.7B
Table 1: Model Pre-training Parameters
Being a method for self-supervised language repre-sentation learning, ELECTRA has the ability ofmaking use of lesser computations for the taskof pre-training transformers (Antoun et al., 2020).Similar to the objective of discriminator of a Gener-ative Adversarial Network, ELECTRA models aretrained with the goal of distinguishing fake input to-kens from the real ones. On the Arabic QA dataset,AraELECTRA achieves state-of-the-art results.For all new AraBERT and AraELECTRA mod-els, the same pretraining data is used. The datasetthat is used for pre-training, before the appli-cation of Farasa Segmentation, has a total of82,232,988,358 characters or 8,655,948,860 wordsor 200,095,961 lines, and has a size of 77GB. Ini-tially, several websites like OSCAR unshuffled andfiltered, Assafir news articles, Arabic Wikipediadump from 2020/09/01, The OSIAN Corpus andThe 1.5B words Arabic Corpus, were crawled tocreate the pre-training dataset. Later, unshuffledOSCAR corpus, after thorough filtering, was addedto the previous dataset used in AraBERTv1 with-out including the data from the above mentionedcrawled websites to create the new dataset.
We experiment with eight transformer based mod-els using the given training and validation sets. Wecalculate the final test predictions by fine tuningthe most efficient model, which is decided by thescores produced above, with the concatenated la-beled training and validation splits. This is fol-lowed by evaluating the test set on this fine tunedmodel. This section presents the Country-leveldataset distribution, system settings, results of ourresearch followed by a descriptive analysis of oursystem.
Country DA MSATrain Dev Train Dev
Algeria 1809 430 1899 427Bahrain 215 52 211 51Djibouti 215 27 211 52Egypt 4283 1041 4220 1032Iraq 2729 664 2719 671Jordan 429 104 422 103Kuwait 429 105 422 103Lebanon 644 157 633 155Libya 1286 314 1266 310Mauritania 215 53 211 52Morocco 858 207 844 207Oman 1501 355 1477 341Palestine 428 104 422 102Qatar 215 52 211 52Saudi Arabia 2140 520 2110 510Somalia 172 49 346 63Sudan 215 53 211 48Syria 1287 278 1266 309Tunisia 859 173 844 170UAE 642 157 633 154Yemen 429 105 422 88
Table 2: Country Level Data Distribution
Parameter Value
Learning Rate 1e-5Epsilon (Adam optimizer) 1e-8Maximum Sequence Length 256Batch Size (for base models) 40Batch Size (for large models) 4
Table 3: Parameter Values odel Subtask 1.1 Subtask 1.2 Subtask 2.1 Subtask 2.2F1 A F1 A F1 A F1 A
AraBERTv0.1-base 0.283 0.324 0.338 0.390 0.024 0.028 0.025 0.037AraBERTv0.2-base 0.300 0.344 0.382 0.427 0.038 0.042 0.035 0.051AraBERTv0.2-large 0.304 0.343 0.362 0.413 0.022 0.030 0.029 0.041AraBERTv1-base 0.281 0.318 0.306 0.377 0.032 0.040 0.019 0.033AraBERTv2-base 0.309 0.347 0.389 0.432 0.029 0.038 0.034 0.048AraBERTv2-large 0.315 0.346 0.416 0.450 0.001 0.010 0.001 0.010AraELECTRA-base-generator 0.106 0.231 0.165 0.285 0.005 0.018 0.006 0.022AraELECTRA-base-discriminator 0.192 0.281 0.280 0.375 0.007 0.020 0.006 0.026
Table 4: Validation Set Results
M-F1 A P R
Subtask 1.1 0.216 0.317 0.321 0.189Subtask 1.2 0.235 0.433 0.280 0.233Subtask 2.1 0.054 0.060 0.061 0.060Subtask 2.2 0.043 0.053 0.044 0.051
Table 5: Test Set Results
The country-wise distribution of the provided train-ing and validation splits, for both the tasks of MSAand DA, are shown in Table 2.
We make use of pre-trained AraBERT and Ara-ELECTRA models, with the names of bert-base-arabert, bert-base-arabertv01, bert-large-arabertv2,bert-base-arabertv2, bert-large-arabertv02, bert-base-arabertv02, araelectra-base-generator andaraelectra-base-discriminator for fine-tuning thetransformer based models. We use hugging-face API to fetch the pre-trained transformer based mod-els, and then fine tuned the same on our dataset.The hyper parameters used for fine tuning thesemodels have been specified in Table 3.
For all subtasks, the performance results of pro-posed models on the provided validation setwith reference to accuracy(A) and weighted F1scores(F1) are shown in Table 4.From Table 4, we conclude that:1. For most of the subtasks, one of the base mod-els performs almost as good as the best per-forming large model. https://huggingface.co/transformers/
2. AraELECTRA models seem to perform worsethan all AraBERT models, possibly due totheir specialization in handling GAN relatedtasks, which are different from classificationbased tasks.3. AraBERTv2-large out performs all other mod-els for subtasks 1.1 and 1.2. For subtasks 2.1and 2.2, AraBERTv0.2-base produces the bestresults on the validation set.From the above results, we choose AraBERTv2-large for subtasks 1.1, 1.2 and AraBERTv0.2-basefor subtasks 2.1, 2.2 to be the primary models tofine tune on the concatenated training and valida-tion set as well as carry out inferences on the un-seen dataset. The final test set results in terms ofMacro F1 Score(M-F1), Recall(R), Accuracy(A)and Precision(P) are specified in Table 5.
In this paper, we present a comprehensive overviewof the approach that we employed to solve theEACL WANLP-2021 Shared Task 1. We tackle thegiven problem in two parts. The first part involvespre processing the given data by modifying variousparts of the text. The second part involves experi-menting with different versions of two Transformerbased networks, AraBERT and AraELECTRA, allpre-trained on Arabic text. Our final submissionsfor the four subtasks are based on the best perform-ing version of AraBERT model. With Macro Aver-aged F1-Score as the final evaluation criteria, ourapproach fetches a private leaderboard rank of 2 forMSA identification and 4 for DA identification. Inthe future, we aim to utilise other features relevantfor classification tasks like URLs, emoticons, andexperiment with ensembles of transformer basedand word vector based input representations. eferences
Ahmed Abdelali, Kareem Darwish, Nadir Durrani, andHamdy Mubarak. 2016. Farasa: A fast and furioussegmenter for Arabic. In
Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: Demon-strations , pages 11–16, San Diego, California. Asso-ciation for Computational Linguistics.Muhammad Abdul-Mageed, Chiyu Zhang, HoudaBouamor, and Nizar Habash. 2020. NADI 2020:The First Nuanced Arabic Dialect IdentificationShared Task. In
Proceedings of the Fifth ArabicNatural Language Processing Workshop (WANLP2020) , Barcelona, Spain.Muhammad Abdul-Mageed, Chiyu Zhang, Abdel-Rahim Elmadany, Houda Bouamor, and NizarHabash. 2021. NADI 2021: The Second NuancedArabic Dialect Identification Shared Task. In
Pro-ceedings of the Sixth Arabic Natural Language Pro-cessing Workshop (WANLP 2021) .Wissam Antoun, Fady Baly, and Hazem Hajj. Arabert:Transformer-based model for arabic language un-derstanding. In
LREC 2020 Workshop LanguageResources and Evaluation Conference 11–16 May2020 , page 9.Wissam Antoun, Fady Baly, and Hazem Hajj. 2020.Araelectra: Pre-training text discriminators for ara-bic language understanding.Yoshua Bengio, R´ejean Ducharme, Pascal Vincent,and Christian Janvin. 2003. A neural proba-bilistic language model.
J. Mach. Learn. Res. ,3(null):1137–1155.Houda Bouamor, Sabit Hassan, and Nizar Habash.2019. The madar shared task on arabic fine-grained dialect identification. In
Proceedings of theFourth Arabic Natural Language Processing Work-shop , pages 199–207.Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, YimingYang, and Inderjit Dhillon. 2020. Taming pretrainedtransformers for extreme multi-label text classifica-tion.Mohamed Elaraby and Muhammad Abdul-Mageed.2018. Deep models for Arabic dialect identificationon benchmarked data. In
Proceedings of the FifthWorkshop on NLP for Similar Languages, Varietiesand Dialects (VarDial 2018) , pages 263–274, SantaFe, New Mexico, USA. Association for Computa-tional Linguistics.Heba Elfardy and Mona Diab. 2013. Sentence level di-alect identification in Arabic. In
Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages456–461, Sofia, Bulgaria. Association for Computa-tional Linguistics. Alex Graves, Abdel rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks.Imane Guellil, Houda Saˆadane, Faical Azouaou, Bil-lel Gueni, and Damien Nouvel. 2019. Arabic natu-ral language processing: An overview.
Journal ofKing Saud University - Computer and InformationSciences .Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In
Proceedings of the25th International Conference on Neural Informa-tion Processing Systems - Volume 1 , NIPS’12, page1097–1105, Red Hook, NY, USA. Curran Asso-ciates Inc.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In
Proceedings of the 26th International Con-ference on Neural Information Processing Systems- Volume 2 , NIPS’13, page 3111–3119, Red Hook,NY, USA. Curran Associates Inc.Mohammad Salameh, Houda Bouamor, and NizarHabash. 2018. Fine-grained arabic dialect identifica-tion. In
Proceedings of the 27th International Con-ference on Computational Linguistics , pages 1332–1344.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.Wen-tau Yih, Kristina Toutanova, John C. Platt, andChristopher Meek. 2011. Learning discriminativeprojections for text similarity measures. In
Proceed-ings of the Fifteenth Conference on ComputationalNatural Language Learning , pages 247–256, Port-land, Oregon, USA. Association for ComputationalLinguistics.Omar F. Zaidan and Chris Callison-Burch. 2014. Ara-bic dialect identification.
Computational Linguis-tics , 40(1):171–202.Chiyu Zhang and Muhammad Abdul-Mageed. 2019.No army, no navy: BERT semi-supervised learningof Arabic dialects. In