[PDF] An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

Abstract

Natural Language Processing (NLP) is today a very active field of research and innovation. Many applications need however big sets of data for supervised learning, suitably labelled for the training purpose. This includes applications for the Arabic language and its national dialects. However, such open access labeled data sets in Arabic and its dialects are lacking in the Data Science ecosystem and this lack can be a burden to innovation and research in this field. In this work, we present an open data set of social data content in several Arabic dialects. This data was collected from the Twitter social network and consists on +50K twits in five (5) national dialects. Furthermore, this data was labeled for several applications, namely dialect detection, topic detection and sentiment analysis. We publish this data as an open access data to encourage innovation and encourage other works in the field of NLP for Arabic dialects and social media. A selection of models were built using this data set and are presented in this paper along with their performances.

Full PDF

AAn open access NLP dataset for Arabic dialects :data collection, labeling, and model construction

ElMehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun,Ikram Chairi, and Ismail Berrada

Mohamed VI Polytechnic University (UM6P), Lot 660, Hay Moulay Rachid, BenGuerir 43150, Morocco.

Abstract.

Natural Language Processing (NLP) is today a very activeﬁeld of research and innovation. Many applications need however big setsof data for supervised learning, suitably labelled for the training purpose.This includes applications for the Arabic language and its national di-alects. However, such open access labeled data sets in Arabic and itsdialects are lacking in the Data Science ecosystem and this lack can be aburden to innovation and research in this ﬁeld. In this work, we presentan open data set of social data content in several Arabic dialects. Thisdata was collected from the Twitter social network and consists on +50Ktwits in ﬁve (5) national dialects. Furthermore, this data was labeled forseveral applications, namely dialect detection, topic detection and senti-ment analysis. We publish this data as an open access data to encourageinnovation and encourage other works in the ﬁeld of NLP for Arabic di-alects and social media. A selection of models were built using this dataset and are presented in this paper along with their performances.

Keywords:

NLP · Open data · Supervised learning · Arab dialects.

In last decades, many eﬀorts have been made to enhance Modern Standard Ara-bic (MSA) Natural Language Processing (NLP). These eﬀorts have led to buildsystems that can serve more than 400 million people, across Africa and Asia, inmany tasks such as machine translation, sentiment classiﬁcation, diacritization,etc. However, in most cases, MSA is only used in formal settings, such as news-papers and professional or academic purposes, while Arabic dialects are used ineveryday communication.In term of resource availability, the majority of Dialectal Arabic (DA) vari-ants are considered as low-resource languages, and suﬀer from the scarcity oflabeled data to build NLP systems. Furthermore, previous works have mainlyfocused on MSA and some dialects (mainly Egyptian and middle east regiondialects) [9,18,14]. Relying on the fact that the MSA and Arabic dialect (DA)variants are etymologically close, the use of MSA NLP systems on DA data hasshown shallow performance compared to the performance on MSA data [16].Thus, in order to enhance the performance of MSA NLP systems and make a r X i v : . [ c s . C L ] F e b Boujou et al. them generalize better on DA input texts, models should be trained on datacontaining samples from DA. In other words, there is a real need to open datasets with good quality of labeling for DA.Social media (Twitter, Facebook,. . . ) might be the most convenient sourceto collect DA data, as they provide diﬀerent contents that reﬂect the feelings ofusers across several topics and are written in the user native and informal Arabicdialect. However, data collected on social media should not be used in its rawform as it suﬀers from several issues [6]. We can cite, for example, the problemof code-switching [11] where users tend to borrow words or phrases from otherlanguages (English or French), which will introduce noise into the collected dataespecially for the case of automatic annotation.The creation of open access social media data sets, such as the one presentedin our paper, aims to enhance innovation and practical applications of NLP forDA, such as social media content analysis for marketing studies, public opinionassessment or for social sciences.In this work, we present the ﬁrst multi-topic and multi-dialect data set, man-ually annotated on ﬁve (5) DA variants and with ﬁve (5) topics. The data setwas collected from Twitter and is publicly available and designed to serve sev-eral Arabic NLP tasks: sentiment classiﬁcation, topic classiﬁcation, and Arabicdialect identiﬁcation. In order to evaluate the usability of the collected data set,several studies have been performed on it, using diﬀerent machine learning algo-rithms such as SVM, Naive Bayes Classiﬁer, etc. The main contributions of thispaper are:• The introduction of an open source multi-topic multi-dialects corpus fordialectical Arabic.• The proof of the usability of our data set through performance evaluationof Arabic dialect identiﬁcation, Arabic sentiment classiﬁcation, and Ara-bic topic categorization systems under diﬀerent conﬁgurations with diﬀerentmachine learning models.As far as we know, the proposed cross-topic and dialect Arabic data set isthe ﬁrst one that covers three tasks at the same time. The rest of the paper isorganized as follows. Section 2 discusses some related works. Section 3 presentsthe collected data and the way it was gathered. The labeling of the data isdescribed on section 4 for the three applications. Finally, section 7 concludesthis paper and gives some outlooks to future work.

With the explosion in the number of social media users in the Arab world inrecent years, interest in sentiment analysis in Arabic has gained more attention.However, the publicly available data sets are still limited in terms of coverage, sizeand number of dialects. Moreover, most of the work on Arab sentiment analysisfocuses on Modern Standard Arabic, although some authors cover Egyptian andGulf dialects. On the other hand, performing an analysis of the sentiment of anArab user on social media relies on many factors (e.g. dialect, topic, . . . ). n open access NLP dataset for Arabic dialects 3

Dialectal Arabic sentiment classiﬁcation.

Recently, several eﬀorts have beenmade to cover more DA variants. The authors of [2] have published an open-access data set which serves to analyze sentiments for Arabic Algerian dialects.The data was mainly 10K comments collected from Facebook pages and manu-ally annotated. The authors applied diﬀerent experiments using machine learning(e.g. SVM, Naive Bayes) and deep learning methods. Similarly, the authors of[13] have published a publicly available data set of 17K comments collected fromFacebook that covers the sentiments and opinions expressed in the Tunisiandialect. Finally, the authors of [15] have proposed a sentiment classiﬁcation con-taining 10K comments collected from Twitter on users from multiple Arab coun-tries.

Dialectal Arabic topic classiﬁcation.

Another direction in research on the analy-sis of user behaviour is topic classiﬁcation. [17] have proposed a publicly availabledata set containing more than 50K articles from Arabic newspaper websites, dis-tributed on 8 categories (e.g. culture, sport, religion, . . . ). They also proposedsome data pre-processing pipelines and an evaluation over several machine learn-ing models (e.g. Decision Tree, Naive Bayes, . . . ). Following the same direction,the authors of [12] introduce SANAD, a freely available data set with more than190K articles that is used for MSA text categorization. Seven categories wherecollected from news websites. Despite this interest, no one, to the best of ourknowledge, has covered the classiﬁcation topic on dialectal Arabic.

Arabic dialect identiﬁcation.

Due to the variety of dialectal Arabic, studying andanalysing the behaviour of Arab users and building NLP systems rely heavily onthe dialect of the input text. As a result, considerable work has been conductedrecently on the identiﬁcation of Arabic dialects [7,3], and several approacheshave been proposed to perform country-level dialect identiﬁcation and province-level dialect identiﬁcation. Several models have been proposed in this context:some were based on machine learning models [4], while others on advanced deeplearning architectures such as BERT pre-trained models [10,19].

The data was gathered by randomly scrapping tweets, from active users locatedin a predeﬁned set of Arab countries, namely : Algeria, Egypt, Lebanon, Tunisiaand Morocco. No limits were set for the date of the tweets nor for the exactlocation in the country. We used Selenium Python library to automate the webnavigation and BeautifulSoup to scrap the tweets. The total number of tweetsin the data set is 49,306. The tweets distribution per country is given in table 1.

Boujou et al.

Table 1: Number of tweets per country

Arabic dialect Algerian Lebanon Morocco Tunisian EgyptianNumber of Tweets 13393 14482 9965 8044 7519

In this ﬁrst step we aimed to remove noise from our data and transform it intoa form that is predictable and analyzable for machine learning algorithms. Weused mostly regular expressions for the following tasks :• Remove user accounts (@users) to anonymize the data.• Remove Twitter keywords and symbols such as : hastags

Stemming is the process of reducing a word to its word stem that aﬃxes tosuﬃxes and preﬁxes. For example: changing, by removing suﬃx “ing” their stemwill be “chang” without “e”.Lemmatization consists of representing words in their root form (lemma).For a verb, it will be its inﬁnitive. For a name, its singular masculine form. Theidea is once again to keep only the meaning of the words used in the corpus. Forexample: “gone” and “went“ their root is “go”.We used stemming for words with unknown root in Arabic words (Like

AJr(cid:7) in Tunisian dialect).

Machine Learning algorithms are applied to numeric data. Hence, to transformtext data to numbers, we used TFIDF (Term Frequency–Inverse Document Fre-quency) and Bag-of-Words (BOW) techniques. This last is the simplest represen-tation that turns arbitrary text into ﬁxed-length vectors by counting how many n open access NLP dataset for Arabic dialects 5

Table 2: Example of TFIDF vectorisation for great greatest lasagna life love loved the thing timessentence1 0 0 1 0 1 1 0 1 1 0sentence2 0 2 0 0 0 1 1 0 0 0sentence3 0 0 1 0 0 1 0 1 1 0sentence4 1 0 0 1 0 1 0 0 0 1 times each word appears. The common BOW vectorizer used is Countvectorizerfrom Scikit-learn Python package.TFIDF converts a collection of raw documents to a matrix of TF-IDF fea-tures. It reﬂects how important a word is to a document in a collection, in ourexample a collection of Tweets. The scores are computed as follows : W x,y = tf x,y log (cid:18) N df x (cid:19) with tf x,y the frequency of word x in document y , df x the number of documentscontaining word x and N the total number of documents. In order to label the tweets, we ﬁrst used the MonkeyLearn tool to manually labelone thousand (1000) Tweets for each application (Dialect, Topic and Sentimentdetection). We then built multiple models based on these labeled tweets. In thenext step, we predicted the labels for 200 non-labeled tweets using the builtmodels, and then corrected manually the ones that were labeled incorrectly. Werebuilt the models by adding the new tweets (1200 tweets in the second time) andrepeated the process by increasing the number of non-labeled tweets to predictuntil we ﬁnish and check the labeling for all the data set.In the following subsections we present the main statistics for the data afterlabeling.

We choose to attach topic labels to tweets among one of the following topics :Politics, Health, Social, Sport, Economics. The tweets that were not relevant toone of these labels were labeled as ’Other’ as shown in table 3. Only of thetweets were labeled to one of the topic categories and were labeled suchas ’Other’ topic. However, we believe that keeping the ’Other’ label categoryis important for topic detection application, so we publish the labeled data setas such. Given the unbalanced character of the labeled data (see ﬁg 2), werecommend the use of under-sampling (i.e. using only a part of the data) for the’Other’ topic category when building machine leaning models.

Boujou et al.

Table 3: Number of tweets per topic

Topic Other Politics Health Social Sport EconomicsNumber of Tweets 37313 5355 4574 1564 94 406

The number of tweets per dialect is given in table 1. In ﬁgure 1 we present thesame data in a histogram. The number of tweet per dialect ranges between eightthousands (8000) and fourteen thousands (14000).Fig. 1: Distribution of Tweets for dialect Detection.

The total number of tweets that we labeled for sentiment analysis is 52,210tweets. The majority of tweets were labeled with a neutral sentiment.Table 4: Number of tweets per label for sentiment analysis

Sentiment Analysis Positive Negative NeutralNumber of Tweets 6792 15385 30033n open access NLP dataset for Arabic dialects 7

Fig. 2: Distribution of Tweets for Sentiment analysis (left) and Topic detection(right).

Using our labeled data, we evaluate the performance of Arabic dialect identiﬁ-cation, Arabic sentiment classiﬁcation, and Arabic topic categorization systemswith the following machine learning models : Logistic Regression, SGD Classi-ﬁer, Linear SVC and Naive bayes. We used grid search and pipelines to ﬁnd thebest hyper-parametres.

As can be seen in table 5, the Naive bayes algorithm performs better than theother models on our testing set.Table 5: Performance of the diﬀerent tested models for dialect detection models SGD Classiﬁer Logistic Regression Naive bayes Linear SVCf1-score(micro) 0.73 0.72 0.75 0.75precision 0.75 0.72 0.80 0.76recall 0.72 0.71 0.71 0.74accuracy 0.72 0.72 0.75 0.76balanced accuracy 0.71 0.71 0.79 0.75

As shown in table 6, Logistic Regression and SGD Classiﬁer both have a goodperformance, in our case (imbalanced Data) we chose Logistic Regression becauseof its high f1-score value. We recall that for topic detection we recommend to

Boujou et al. use under-sampling of the ’other’ category as it is over-represented in our dataset. Table 6: Performance of the diﬀerent tested models for topic detection models Logistic Regression SGD Classiﬁer Linear SVCf1-score(micro) 0.82 0.72 0.84precision 0.59 0.65 0.65recall 0.59 0.45 0.45accuracy 0.82 0.84 0.84balanced accuracy 0.70 0.78 0.78

According to the results in table 7, the four models built have similar results forsentiment analysis, except Naive bayes which is less eﬃcient. The SGD Classiﬁerout-performes slightly in term of f1-score and accuracy.Table 7: Performance of the diﬀerent tested models for Sentiment Analysis models SGD Classiﬁer Logistic Regression Naive bayes Linear SVCf1-score(micro) 0.74 0.73 0.67 0.73precision 0.75 0.75 0.75 0.75recall 0.76 0.74 0.64 0.75accuracy 0.77 0.76 0.73 0.76balanced accuracy 0.74 0.74 0.63 0.74

In this work we presented a Labeled data set of 50K tweets in ﬁve (5) Ara-bic dialects. We presented the process of labeling this data for dialect detection,topic detection and sentiment analysis. We put this labeled data openly availablefor the research, startup and industrial community to build models for applica-tions related to NLP for dialectal Arabic. We believe that initiatives such asours can catalyse the innovation and the technological development of AI so-lutions in Arab countries like Morocco by removing the burden linked to thenon availability of labeled data and to time-consuming tasks of collecting and n open access NLP dataset for Arabic dialects 9 manually labeling data. We also presented a set of Machine learning models thatcan be used as baseline models and to which future users of this data set cancompare and aim to outperform by innovating in term of computational meth-ods and algorithms. The labeled data set can be downloaded at [1], and all theimplemented algorithms are available at [8].

References

1. Msda open data sets. Mohammed VI Polytechnic University (UM6P), https://msda.um6p.ma/msda_datasets

2. Abdelli, A., Guerrouf, F., Tibermacine, O., Abdelli, B.: Sentiment analysis of arabicalgerian dialect using a supervised method. In: 2019 International Conference onIntelligent Systems and Advanced Computing Sciences (ISACS). pp. 1–6 (2019).https://doi.org/10.1109/ISACS48493.2019.90688973. Abdul-Mageed, M., Zhang, C., Bouamor, H., Habash, N.: Nadi 2020: The ﬁrstnuanced arabic dialect identiﬁcation shared task (2020)4. Abu Kwaik, K., Saad, M.: ArbDialectID at MADAR shared task 1: Lan-guage modelling and ensemble learning for ﬁne grained Arabic dialect identiﬁca-tion. In: Proceedings of the Fourth Arabic Natural Language Processing Work-shop. pp. 254–258. Association for Computational Linguistics, Florence, Italy(Aug 2019). https://doi.org/10.18653/v1/W19-4632,

5. Alrefaie, M.T.: Arabic stop words (May 2016), https://github.com/mohataher/arabic-stop-words/blob/master/list.txt

6. Baly, R., Badaro, G., El-Khoury, G., Moukalled, R., Aoun, R., Hajj, H.,El-Hajj, W., Habash, N., Shaban, K.: A characterization study of ArabicTwitter data with a benchmarking for state-of-the-art opinion mining mod-els. In: Proceedings of the Third Arabic Natural Language Processing Work-shop. pp. 110–118. Association for Computational Linguistics, Valencia, Spain(Apr 2017). https://doi.org/10.18653/v1/W17-1314,

7. Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabicﬁne-grained dialect identiﬁcation. In: Proceedings of the Fourth Arabic Natu-ral Language Processing Workshop. pp. 199–207. Association for ComputationalLinguistics, Florence, Italy (Aug 2019). https://doi.org/10.18653/v1/W19-4622,

8. Boujou, E., Chataoui, H.: Nlp python code for ara-bic dialects (Nov 2020), https://github.com/Elmehdidebug/NLP-for-dialectDetection-TopicDetection-SentimentAnalysis

9. Dahou, A., Xiong, S., Zhou, J., Haddoud, M.H., Duan, P.: Word embeddings andconvolutional neural network for Arabic sentiment classiﬁcation. In: Proceedingsof COLING 2016, the 26th International Conference on Computational Linguis-tics: Technical Papers. pp. 2418–2427. The COLING 2016 Organizing Committee,Osaka, Japan (Dec 2016),

10. El Mekki, A., Alami, A., Alami, H., Khoumsi, A., Berrada, I.: Weighted combina-tion of BERT and N-GRAM features for Nuanced Arabic Dialect Identiﬁcation. In:Proceedings of the Fifth Arabic Natural Language Processing Workshop (WANLP2020). Barcelona, Spain (2020)0 Boujou et al.11. Elfardy, H., Al-Badrashiny, M., Diab, M.: AIDA: Identifying code switching ininformal Arabic text. In: Proceedings of the First Workshop on ComputationalApproaches to Code Switching. pp. 94–101. Association for Computational Lin-guistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/W14-3911,

12. Elnagar, A., Einea, O., Al-Debsi, R.: Automatic text tagging of Arabic news articlesusing ensemble deep learning models. In: Proceedings of the 3rd InternationalConference on Natural Language and Speech Processing. pp. 59–66. Associationfor Computational Linguistics, Trento, Italy (Sep 2019),

13. Medhaﬀar, S., Bougares, F., Est‘eve, Y., Hadrich-Belguith, L.: Senti-ment analysis of Tunisian dialects: Linguistic ressources and experiments.In: Proceedings of the Third Arabic Natural Language Processing Work-shop. pp. 55–61. Association for Computational Linguistics, Valencia, Spain(Apr 2017). https://doi.org/10.18653/v1/W17-1307,

14. Mulki, H., Haddad, H., Gridach, M., Babaoglu, I.: Tw-StAR at SemEval-2017 task 4: Sentiment classiﬁcation of Arabic tweets. In: Proceedings ofthe 11th International Workshop on Semantic Evaluation (SemEval-2017).pp. 664–669. Association for Computational Linguistics, Vancouver, Canada(Aug 2017). https://doi.org/10.18653/v1/S17-2110,

15. Nabil, M., Aly, M., Atiya, A.: ASTD: Arabic sentiment tweets dataset. In: Pro-ceedings of the 2015 Conference on Empirical Methods in Natural Language Pro-cessing. pp. 2515–2519. Association for Computational Linguistics, Lisbon, Por-tugal (Sep 2015). https://doi.org/10.18653/v1/D15-1299,

16. Qwaider, C., Chatzikyriakidis, S., Dobnik, S.: Can Modern Standard Arabic ap-proaches be used for Arabic dialects? sentiment analysis as a case study. In:Proceedings of the 3rd Workshop on Arabic Corpus Linguistics. pp. 40–50. As-sociation for Computational Linguistics, Cardiﬀ, United Kingdom (Jul 2019),

17. Selab, E., Guessoum, A.: Building talaa, a free general and catego-rized arabic corpus. In: Proceedings of the International Conference onAgents and Artiﬁcial Intelligence - Volume 1. p. 284–291. ICAART 2015,SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT(2015). https://doi.org/10.5220/0005352102840291, https://doi.org/10.5220/0005352102840291https://doi.org/10.5220/0005352102840291