Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM
Mohamed Eldesouki, Younes Samih, Ahmed Abdelali, Mohammed Attia, Hamdy Mubarak, Kareem Darwish, Kallmeyer Laura
AArabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM
Mohamed Eldesouki , Younes Samih ,Ahmed Abdelali , Mohammed Attia , Hamdy Mubarak , Kareem Darwish , and Laura Kallmeyer Qatar Computing Research Institute, HBKU, Doha, Qatar Dept. of Computational Linguistics,University of D¨usseldorf, D¨usseldorf, Germany Google Inc., New York City, USA { mohamohamed,hmubarak,aabdelali,kdarwish } @hbku.edu.qa { samih,kallmeyer } @phil.hhu.de [email protected] Abstract
Arabic word segmentation is essential fora variety of NLP applications such as ma-chine translation and information retrieval.Segmentation entails breaking words intotheir constituent stems, affixes and cli-tics. In this paper, we compare two ap-proaches for segmenting four major Ara-bic dialects using only several thousandtraining examples for each dialect. Thetwo approaches involve posing the prob-lem as a ranking problem, where an SVMranker picks the best segmentation, andas a sequence labeling problem, where abi-LSTM RNN coupled with CRF deter-mines where best to segment words. Weare able to achieve solid segmentation re-sults for all dialects using rather limitedtraining data. We also show that employ-ing Modern Standard Arabic data for do-main adaptation and assuming context in-dependence improve overall results.
Arabic has both complex morphology and orthog-raphy, where stems are typically derived from aclosed set of roots to which affixes such as coor-dinating conjunctions, determiners, and pronounsare attached to form words. Segmenting Arabicwords into their constituent parts is important fora variety of natural language processing applica-tions. For example, segmentation has been shownto improve the effectiveness of information re-trieval (Darwish et al., 2014a) and machine trans- lation (Habash and Sadat, 2006). Most previouswork has mostly focused on segmenting ModernStandard Arabic (MSA) achieving segmentationaccuracies of nearly 99% (Abdelali et al., 2016;Pasha et al., 2014). MSA is the lingua franca ofthe Arab world, and it is typically used in writ-ten and formal communications. Dialectal Ara-bic (DA) segmentation on the other hand has re-ceived limited attention, with most of the workfocusing on the Egyptian dialect (Habash et al.,2013; Samih et al., 2017). Arabic dialects are typ-ically spoken and are used in informal communi-cations. The advent of the social media and theubiquity of smart phones has led to a greater needfor dialectal processing such as dialect identifica-tion (Eldesouki et al., 2016; Khurana et al., 2016),morphological analysis (Habash et al., 2013) andmachine translation (Sennrich et al., 2016; Sajjadet al., 2013). Yet, dialectal training corpora for avariety of NLP modules, including segmentation,continue to be limited and often nonexistent.In this work, we focus on the segmentation offour major Arabic dialects, namely Egyptian, Lev-antine, Gulf, and Maghrebi. We particularly focuson DA text from Twitter, a popular social mediaplatform, from which we can obtain large amountsof text in different dialects written by ordinary so-cial media users and exhibiting nonstandard or-thography. We employ two machine learning ap-proaches for building robust segmentation mod-ules using limited training data (350 tweets con-taining several thousand words per dialect). Inone approach, we pose the segmentation as a rank-ing problem where all possible segmentations ofa word are ranked using a Support Vector Ma- a r X i v : . [ c s . C L ] A ug hine (SVM) based ranker. In the second, weuse bidirectional Long Short Term Memory (bi-LSTM) Recurrent Neural Network (RNN) withConditional Random Fields (CRF) to perform se-quence labeling over the characters in words. Forboth, we adopt the simplifying assumption thatword segmentation can be reliably performed in-dependent of context. Though the assumption isnot always correct, it has been shown to be fairlyrobust for more than 99% of word occurrences inArabic text (Abdelali et al., 2016). Lastly, giventhe large overlap between MSA and DA, we em-ploy segmented MSA data to further improve di-alectal segmentation.The contribution of this paper are as follows:• We present robust DA segmenters for four majorArabic dialects. We plan to open-source all ofthem.• We provide an exposition of challenges associ-ated with performing in situ DA segmentationincluding segmentation guidelines and the effectof orthographic standardization.• We compare two machine learning approachesthat can generalize well even when limited train-ing data is available. Work on dialectal Arabic is fairly new comparedto MSA. A number of research projects were de-voted to dialect identification (Biadsy et al., 2009;Zbib et al., 2012; Zaidan and Callison-Burch,2014; Eldesouki et al., 2016). There are five majordialects including Egyptian, Gulf, Iraqi, Levantineand Maghribi. Few resources for these dialectsare available such as the CALLHOME EgyptianArabic Transcripts (LDC97T19), which was madeavailable for research as early as 1997. Newly de-veloped resources include the corpus developed byBouamor et al. (2014), which contains 2,000 par-allel sentences in multiple dialects and MSA aswell as English translation.For the segmentation, Mohamed et al. (2012) builta segmenter based on memory-based learning.The segmenter has been trained on a small cor-pus of Egyptian Arabic comprising 320 commentscontaining 20,022 words from that were segmented and annotated by twonative speakers. They reported a 91.90% accu-racy on the task of segmentation. MADA-ARZ(Habash et al., 2013) is an Egyptian Arabic ex-tension of the Morphological Analysis and Dis- ambiguation of Arabic (MADA). They trained andevaluated their system on both Penn Arabic Tree-bank (PATB) (parts 1-3) and the Egyptian ArabicTreebank (parts 1-5) (Maamouri et al., 2014) andthey achieved 97.5% accuracy. MADAMIRA (Pasha et al., 2014) is a new version of MADAand includes functionality for analyzing dialec-tal Egyptian. Monroe et al. (2014) used a singledialect-independent model for segmenting all Ara-bic dialects including MSA. They argue that theirsegmenter is better than other segmenters that usesophisticated linguistic analysis. They evaluatedtheir model on three corpora, namely parts 1-3 ofPenn Arabic Treebank (PATB), Broadcast NewsArabic Treebank (BN), and parts 1-8 of the BOLTPhase 1 Egyptian Arabic Treebank (ARZ) report-ing a 95.13% F1 score. DA shares many MSA challenges, such as hav-ing complex templatic derivational morphologyand concatenative orthography. Most nouns andverbs are typically derived from a closed set ofroots, which are fitted into templates to gener-ate stems. Templates may indicate morphologi-cal features such POS tag, gender, and number.Stems may accept prefixes, such as coordinatingconjunction and prepositions, or suffixes, such aspronouns, to form words. While dialects mostlycomply with the templatic nature of morphology ,they diverge from MSA in other aspects such as:• Lack of standard orthography, particularly forstrictly dialectal words such as (cid:9)(cid:224)(cid:65) (cid:17)(cid:130)(cid:171) “E$An”(because), which may also appear as (cid:9)(cid:224)(cid:65) (cid:17)(cid:130)(cid:202)(cid:171) “El$An” or (cid:9)(cid:224)(cid:65) (cid:17)(cid:130)(cid:211) “m$An” (Habash et al., 2012).• Word borrowing from other languages (Ibrahim,2006), such as (cid:189)(cid:16)(cid:74)(cid:147)(cid:67)(cid:75)(cid:46) “blAStk” (your place)in Maghrebi, or code switching with other lan-guages (Samih et al., 2016).• Fusing multiple words together by concatenat-ing tokens and dropping letters, such as the word (cid:189)(cid:203)(cid:241)(cid:16)(cid:174)(cid:75)(cid:10) “yqwlk” (he says to you), “yqwl lk” areconcatenated and one “l” is dropped. MADAMIRA release 20160516 2.1 Minor exception exist such the Egyptian template “At-fEl” that occasionally replaces the MSA template “AnfEl” asin (cid:81)(cid:229)(cid:132)(cid:186)(cid:16)(cid:75)(cid:64) “Atksr” (broke)
Additional affixes. Dialectal-specific affixesmay arise because of: the alteration of pro-nouns, such as the feminine second person pro-noun from (cid:188) “k” to (cid:250)(cid:10)(cid:187) “ky” or the plural pronoun (cid:213)(cid:231)(cid:16)(cid:39) “tm” to (cid:241)(cid:16)(cid:75) “tw”; the introduction of negationprefix-suffix combination (cid:65)(cid:211) - (cid:17)(cid:128) “mA-$”, whichbehaves like the French “ne-pas” negation con-struct; the placement of present tense markers,such as “b” in Egyptian and Levantine; the useof different future markers such as “H”, “h”, and“g” instead of “s” for MSA; and the shorteningof prepositions and fusing them with the wordsthey precede such as the transformation of (cid:250)(cid:206)(cid:171) “ElY” (on) to (cid:168) “E”.• Letter substitution, where some letters are com-monly substituted for others such as “v” whichis replaced with “t” in Egyptian (as in (cid:81)(cid:30)(cid:10)(cid:16)(cid:74)(cid:187) “ktyr”(much)) or “q” which is replaced with “j” inGulf (as in (cid:104)(cid:46) (cid:89)(cid:147) “Sdj” (really).• Syntactic differences, such as the use of mascu-line plural or singular noun forms instead dualand feminine plural, the dropping of some ar-ticles and preposition in some syntactic con-structs, and the abandonment of some suffixessuch as “wn” in favor of “wA” for verbs and“yn” for nouns.Using raw text from social media introduces ad-ditional phenomena such as word elongation, such (cid:64)(cid:80)(cid:81)(cid:30)(cid:10)(cid:74)(cid:10)(cid:28)(cid:10)(cid:74)(cid:10) (cid:9)(cid:107)(cid:13)(cid:64) “ > KyyyyrrA” (finally) instead of (cid:64)(cid:81)(cid:30)(cid:10) (cid:9)(cid:103)(cid:13)(cid:64) “ > KyrA”, and the use of non-Arabic characterssuch as Urdu characters (Darwish et al., 2012).
We constructed our dataset by obtaining 350tweets that were authored for each of the follow-ing four dialects: Egyptian, Levantine, Gulf, andMaghrebi. For dialectal Egyptian tweets, we ob-tained the dataset described in (Darwish et al.,2014b), and we used the same methodology toconstruct the dataset for the remaining dialects.Initially, we obtained 175 million Arabic tweets byquerying the Twitter API using the query “lang:ar”during March 2014. Then, we identified tweetswhose authors identified their location in countrieswhere the dialects of interest are spoken (ex. Mo-rocco, Algeria, and Tunisia for Maghrebi) usinga large location gazetteer (Mubarak and Darwish,2014). Then we filtered the tweets using a list con-taining 10 strong dialectal words per dialect, such as the Maghrebi word (cid:65)(cid:210)(cid:74)(cid:10)(cid:187) “kymA” (like/as in) andthe Leventine word (cid:189)(cid:74)(cid:10)(cid:235) “hyk” (like this). Giventhe filtered tweets, we randomly selected 2,000unique tweets for each dialect, and we asked a na-tive speaker of each dialect to manually select 350tweets that are heavily dialectal. Table 4 lists thenumber of tweets that we obtained for each dialectand the number of words they contain.
Field Annotation
Orig. word (cid:189)(cid:203)(cid:241)(cid:16)(cid:174)(cid:74)(cid:10)(cid:75)(cid:46) “byqwlk”Meaning he is saying to youIn situ Segm. (cid:188) + (cid:200)(cid:241)(cid:16)(cid:174)(cid:75)(cid:10) + (cid:72)(cid:46) “b+yqwl+k”CODA (cid:189)(cid:203) (cid:200)(cid:241)(cid:16)(cid:174)(cid:74)(cid:10)(cid:75)(cid:46) “byqwl lk”CODA Segm. (cid:188) + (cid:200) (cid:200)(cid:241)(cid:16)(cid:174)(cid:75)(cid:10) + (cid:72)(cid:46) “b+yqwl l+k”Table 1: Egyptian annotation exampleDialect No of Tweets No of TokensEgyptian 350 6,721Levantine 350 6,648Gulf 350 6,844Maghrebi 350 5,495Table 2: Dataset size for the different dialectsSegmentation of DA can be applied on the orig-inal raw text, or on the cleaned text after correct-ing spelling mistakes and applying conventionalorthography rules, such as CODA (Habash et al.,2012). In this work, we decided to segment theoriginal raw text. Though Egyptian CODA is areasonably stable standard, CODA for other di-alects are either immature or nonexistent. Also,CODA conversion tools are lacking for most di-alects . Building such tools requires the estab-lishment of clear guidelines, is laborious, and mayrequire large annotated corpora (Eskander et al.,2013), such as the LDC Egyptian Treebank.To prepare the ground truth data for a dialect,we enlisted an annotator who is either a nativespeaker for the dialect or well versed in it and hasbackground in natural language processing. Theauthors along with another native speaker of thedialect made multiple review rounds on the workof the annotator to ensure consistency and quality.The annotation guidelines were fairly straightfor-ward. Basically, we asked annotators to:• segment words in a way that would maintain the except for Egyptian CODA tool that is embedded inMADAMIRA orrect number of part of speech tags• favor stems when repeated letters are dropped asTable 1• segment multiple concatenated words withpluses as in the “merged words” example in Ta-ble 3.• attach injected long vowels that trail preposi-tions or pronouns to the preposition or pronounrespectively (ex. (cid:250)(cid:10)(cid:190)(cid:74)(cid:10)(cid:203) “lyky” (to you – feminine) → “ly+ky”)• treat dialectal words that originated as multiplefused words as single tokens (ex. (cid:17)(cid:128)(cid:67)(cid:171) “ElA$”(why) – originally (cid:90)(cid:250)(cid:10)(cid:230)(cid:17)(cid:133) (cid:248)(cid:10) (cid:13)(cid:64) (cid:250)(cid:206)(cid:171) “ElY > y $y'”)• do not segment name mentions and hashtagsIn what follows, we discuss the advantagesand disadvantages of segmenting raw text versusthe CODA’fied text with some statistics obtainedfor the Egyptian tweets for which we have aCODA’fied version as exemplified in Table 1. Themain advantage of segmenting raw text is that itdoesn’t need any preprocessing tool to generateCODA orthography, and the main advantage ofCODA is that is regularizes text making it moreuniform and easier to process. We manuallycompared the CODA version to the raw versionof 2,000 words in our Egyptian dataset. Wefound that in 75.4% of the words, segmentationof original raw words is exactly the same asthe their CODA’ifed equivalents (ex. (cid:9)(cid:225)(cid:211)(cid:43)(cid:240) “w+mn” (and from) and (cid:65)(cid:234)(cid:43)(cid:202)(cid:210)(cid:170)(cid:9)(cid:75) , “nEml+hA”(we do it)). Further, if we normalize some char-acters, namely (cid:232) ← (cid:16)(cid:232) (cid:44) (cid:248)(cid:10) ← (cid:248) (cid:44) (cid:64) ← (cid:14)(cid:64)(cid:44) (cid:64)(cid:13)(cid:44) (cid:13)(cid:64) ,and non-Arabic characters (cid:248)(cid:10) ← (cid:254)(cid:10) (cid:44) (cid:188) ← (cid:185)(cid:44)(cid:195) (cid:44) (cid:9)(cid:172) ← (cid:17)(cid:172) (cid:44) (cid:104)(cid:46) ← (cid:104)(cid:18) ,and remove diacritics, the percentage of matchingincreases to 90.3%. Table 3 showcases theremaining differences between raw and CODAsegmentations and how often they appear. Thedifferences are divided into two groups. In thefirst group (accounting for 6.8% of the cases),the number of word segments remains the sameand both the raw and CODA’fied segments wouldhave the same POS tags.In this group, the “variable spelling” class con-tains dialectal words that may have different com-mon spellings with one “standard” spelling inCODA. The “dropped letter” and “shortened parti-cles” classes typically involve the omission of let-ters such as the first person imperfect prefix (cid:13)(cid:64) “ > ” Diff. % ExamplesSame no. of segments and same POS tags variable 2.4% (cid:9)(cid:224)(cid:65) (cid:17)(cid:130)(cid:202)(cid:171) ⇔ (cid:9)(cid:224)(cid:65) (cid:17)(cid:130)(cid:171) spellings “E$An, El$An”dropped 2.3% (cid:208)(cid:81)(cid:16)(cid:30)(cid:103)(cid:13)(cid:65)(cid:43)(cid:75)(cid:46) ⇔ (cid:208)(cid:81)(cid:16)(cid:30)(cid:103) (cid:43) (cid:72)(cid:46) letters “b+Htrm, b+AHtrm”merged 1.4% (cid:209)(cid:171) (cid:65)(cid:75)(cid:10) ⇔ (cid:209)(cid:171)(cid:43) (cid:65)(cid:75)(cid:10) words “yA+Em, yA Em”Shortened 0.4% (cid:250)(cid:10) (cid:9)(cid:175) ⇔ (cid:9)(cid:172) (cid:44) (cid:250)(cid:206)(cid:171) ⇔ (cid:168) particles “E, ElA f, fy”elongations 0.3% (cid:233)(cid:74)(cid:10)(cid:203) ⇔ (cid:233)(cid:74)(cid:10)(cid:74)(cid:10)(cid:28)(cid:10)(cid:74)(cid:10)(cid:203) “lyyyyh,lyh” Different no. of segments or POS tags spelling 2.2% (cid:66)(cid:64)(cid:13)(cid:240) ⇔ (cid:66)(cid:240) (cid:9)(cid:224)(cid:64) ⇔ (cid:65)(cid:9)(cid:75)(cid:64) errors “An, Ana wlA, wAlA”fused 0.8% (cid:249)(cid:10) (cid:43)(cid:203) (cid:200)(cid:65)(cid:16)(cid:175) ⇔ (cid:249)(cid:10) (cid:43)(cid:203)(cid:65)(cid:16)(cid:175) letters “qAl+y, qAl l+y”Table 3: Original vs CODA Segmentationswhen preceded by the present tense marker (cid:72)(cid:46) “b”or the future tense marker (cid:5)(cid:235) “h”, and the “A” innegation particle (cid:65)(cid:211) “mA” which is often written as (cid:208) “m” and attached to the following word, and thetrailing letters in prepositions. “Merged words”and “word elongations” are common in social me-dia, where users try to keep within limit by drop-ping the spaces between letters that do not connector to stress words respectively.Though some processing such as splitting ofwords or removing elongations is required to over-come the phenomena in this group, in situ segmen-tation of raw words would yield identical segmentswith the same POS tags as their CODA counter-parts. Thus, the segmentation of raw words couldbe sufficient for 97% of words.In the second group (accounting for 3% ofthe cases), both may have a different numberof segments or POS tags, which would compli-cate downstream processing such as POS tagging.They involve spelling errors and the fusion of twoidentical consecutive letters (gemmination). Cor-recting such errors may require a spell checker.We opted to segment raw input without correctionin our reference, and we kept stem, such as verbsand nouns, complete at the expense of other seg-ments such as prepositions as in the example inTable 1. Segmentation Approaches
We present here two different systems for wordsegmentation. The first uses SVM-based rank-ing (SVM
Rank ) to rank different possible seg-mentations for a word using a variety of features.The second uses bi-LSTM-CRF, which performssequence-to-sequence mapping to guess word seg-mentation. Rank
Approach
This approach is inspired by the work done byAbdelali et al. (2016), in which they used SVMbased ranking to ascertain the best segmentationfor Modern Standard Arabic (MSA), which theyshow to be fast and accurate. The approach in-volves generating all possible segmentations of aword and then ranking them.In training, we generate all possible seg-mentations of a word based on a closed set ofprefixes and suffixes, and the correct segmen-tation is assigned rank 1 and all other incorrectsegmentations are assigned rank 2. Our validaffixes include MSA prefixes and suffixes thatwe extracted from Farasa (Abdelali et al., 2016)and additional dialectal prefixes and suffixesthat we observed during training. Since we arenot mapping words into a standard spelling,such as CODA, prefixes and suffixes may havemultiple different representations. For example,given the dialectal Egyptian word for “I do notplay”, it could be spelled as “m+b+lEb+$”,“mA+b+lEb+$”, “mA+b+AlEb+$”,“m+b+lEb+$y”, “mA+b+AlEb+$y”, etc. Inthis example, the first prefix could be “m” or“mA” and the suffix could be “$” or “$y”.Here are two example dialectal Egyptian wordsto demonstrate segmentation:• Given the input word: (cid:17)(cid:128)(cid:241)(cid:203)(cid:65)(cid:171) “EAlw$”(on the face), possible segmentations are: { E+Al+w$ } (correct segmentation), { E+Alw$ } , { E+Al+w+$ } , { E+Alw+$ } , { EAlw+$ } , and { EAlw$ } .• Given the input word: (cid:250)(cid:10)(cid:190)(cid:75)(cid:10)(cid:88)(cid:65)(cid:75)(cid:46) “bAdyky”(I give you (feminine)), possible segmenta-tions are: { b+Ady+ky } (correct segmentation), { b+Adyky } , { bAdy+ky } , and { bAdyky } .We use the following features in training theclassifier: • Conditional probability that a leading charactersequence is a prefix.• Conditional probability that a trailing charactersequence is a suffix.• probability of the prefix given the suffix.• probability of the suffix given the prefix.• unigram probability of the stem (more detailsabout calculating this is showing below).• unigram probability of the stem with first suffix.• whether a valid stem template can be obtainedfrom the stem.• whether the stem that has no trailing suffixes ap-pears in a gazetteer of person and location names(Abdelali et al., 2016).• whether the stem is a function word, such as (cid:250)(cid:206)(cid:171) “ElY” (on), (cid:9)(cid:225)(cid:211) “mn” (from), and (cid:17)(cid:129)(cid:211) “m$”(not).• whether the stem appears in the AraComLex Arabic lexicon (Attia et al., 2011) or in theBuckwalter lexicon (Buckwalter, 2002). This issensible considering the large overlap betweenMSA and DA.• length difference from the average stem length.The segmentations with their correspondingfeatures are then passed to the SVM ranker(Joachims, 2006) for training. Our SVM
Rank usesa linear kernel and a trade-off parameter betweentraining error and margin of 100.Before training the classifier, features needed tocalculated in advance. As training data, we usedthe aforementioned sets of 350 dialectal tweets foreach dialect containing typically several thousandwords each. We also use three parts of the PennArabic Treebank (ATB); part 1 (version 4.1), 2(version 3.1), and 3 (version 2), which have a com-bined size of 628,870 tokens, to lookup MSA seg-mentations. The intuition behind using such seg-mented MSA data for lookup is that both MSA anddialects share a fair amount of vocabulary. Thus,using the ATB corpus has the effect of increasingcoverage.We also adopted the simplifying assumptionthat any given word has only 1 possible correctsegmentation regardless of context. Though thisassumption is not always true, previous work onMSA has shown that it holds for 99% of the cases(Abdelali et al., 2016). Invoking this assumptionhas multiple positive implications, namely: we http://sourceforge.net/projects/aracomlex/ an use the segmentations that we observed duringtraining directly, which typically cover most com-mon function words, or segmentations that we ob-served in the ATB, which cover most MSA wordsthat may be prevalent in dialectal text; and we cancache word segmentations leading to significantspeedup. Thus, we experimented with three dif-ferent lookup schemes for every word, namely: 1)we output the rankers guess directly (None); 2)if exists, we use seen segmentations in dialectaltraining set, and the output of the ranker otherwise(DA); 3) if exists, we use seen segmentation in di-alectal training set, else we use segmentation thatwe observed in the ATB, and lastly the output ofthe ranker (DA+MSA). Recurrent Neural Network (RNN) belongs to afamily of neural networks suited for modeling se-quential data. Given an input sequence x =( x , ..., x n ) , an RNN computes the output vector y t of each word x t by iterating the following equa-tions from t = 1 to n : h t = f ( W xh x t + W hh h t − + b h ) y t = W hy h t + b y where h t is the hidden states vector, W denotesweight matrix, b denotes bias vector and f is theactivation function of the hidden layer. Theoreti-cally, RNNs can learn long distance dependencies,still in practice they fail due vanishing/explodinggradients (Bengio et al., 1994). To solve this prob-lem , Hochreiter and Schmidhuber (1997) intro-duced the LSTM RNN. The idea consists of aug-menting an RNN with memory cells to overcomedifficulties with training and efficiently cope withlong distance dependencies. The output of theLSTM hidden layer h t given input x t is com-puted via the following intermediate calculations:(Graves, 2013): i t = σ ( W xi x t + W hi h t − + W ci c t − + b i ) f t = σ ( W xf x t + W hf h t − + W cf c t − + b f ) c t = f t c t − + i t tanh( W xc x t + W hc h t − + b c ) o t = σ ( W xo x t + W ho h t − + W co c t + b o ) h t = o t tanh( c t ) where σ is the logistic sigmoid function, and i , f , o and c are respectively the input gate, forgetgate, output gate and cell activation vectors. More interpretation about this architecture can be foundin (Lipton et al., 2015). Bi-LSTM networks (Schuster and Paliwal, 1997)are extensions to single LSTM networks. They arecapable of learning long-term dependencies andmaintain contextual features from past and future.As shown in Figure 1, they comprise two separatehidden layers that feed forward to the same out-put layer. A bi-LSTM calculates the forward hid-den sequence −→ h , the backward hidden sequence ←− h and the output sequence y by iterating over thefollowing equations : −→ h t = σ ( W x −→ h x t + W −→ h −→ h −→ h t − + b −→ h ) ←− h t = σ ( W x ←− h x t + W ←− h ←− h ←− h t − + b ←− h ) y t = W −→ hy −→ h t + W ←− hy ←− h t + b y More interpretations about these formulas arefound in Graves et al. (2013a)
Over the past few years, bi-LSTMs have achievedmany ground-breaking results in many NLP tasksbecause of their ability to cope with long dis-tance dependencies and exploit contextual fea-tures from past and future states. Still, whenthey are used for some specific sequence classi-fication tasks, (such as segmentation and namedentity detection), where there is strict dependencebetween output labels, they fail to generalize per-fectly. During the training phase of the bi-LSTMnetworks, the resulting probability distributionsfor different time steps are independent from eachother. To overcome the independence assumptionsimposed by the bi-LSTM and to exploit these kindof labeling constraints in our Arabic segmentationsystem, we model label sequence logic jointly us-ing Conditional Random Fields (CRF) (Laffertyet al., 2001).
In this model we consider Arabic segmentation asa sequence labeling problem at the character level.Each character is labeled with one of five labels
B, M, E, S, W B that designate the segmentationdecision boundaries: Beginning, Middle, End ofa multi-character segment, Single character seg-ment, and Word Boundary respectively. Figure1 illustrates our segmentation model and how themodel takes the word (cid:233)(cid:74)(cid:46)(cid:202)(cid:16)(cid:175) “qlbh” (his heart) as itsigure 1: Architecture of our proposed neural net-work Arabic segmentation model applied to word. (cid:233)(cid:74)(cid:46)(cid:202)(cid:16)(cid:175) “qlbh” and output “qlb+h”current input and predicts its correct segmentation.The model is comprised of the following three lay-ers:• Input layer: containing character embeddings.• Hidden layer: bi-LSTM maps character repre-sentations to hidden sequences.• Output layer: CRF computes the probability dis-tribution over all labels.At the input layer, a look-up table is ini-tialized with randomly uniform sampled embed-dings mapping each character in the input to d-dimensional vector. At the hidden layer, the outputfrom the character embeddings is used as the inputto the bi-LSTM layer to obtain fixed-dimensionalrepresentations for each character. At the outputlayer, a CRF is applied over the hidden represen-tation of the bi-LSTM to obtain the probabilitydistribution over all the labels. Training is per-formed using stochastic gradient (SGD) descentwith momentum . and batch size , optimizingthe cross entropy objective function. Due to the relatively small size the training anddevelopment sets, overfitting poses a considerablechallenge for our Dialectal Arabic segmentationsystem. To make sure that our model learns sig-nificant representations, we resort to dropout (Hin-ton et al., 2012) to mitigate overfitting. The basic We did not use pre-trained character embeddings, be-cause we conducted side experiments with and without pre-trained embeddings and the results were mixed idea behind dropout involves randomly omitting acertain percentage of the neurons in each hiddenlayer for each presentation of the samples duringtraining. This encourages each neuron to dependless on the other neurons to learn the right seg-mentation decision boundaries. We apply dropoutmasks to the character embedding layer before in-putting to the bi-LSTM and to its output vector. Inour experiments, we find that dropout with a fixedrate of . decreases overfitting and improves theoverall performance of our system. We also em-ploy early stopping (Caruana et al., 2000; Graveset al., 2013b) to mitigate overfitting by monitoringthe model’s performance on the development set. As described earlier, we perform several exper-iments for each dialect. These involve trainingusing dialectal data while using different lookupschemes, namely: no lookup (None); lookupfrom dialectal training only (DA); and a cascadedlookup from dialectal training and then MSA(DA+MSA). For all our experiments, we use 5fold cross validation with 70/10/20 train/dev/testsplits. We use the Farasa MSA segmenter as abaseline. Table 4 reports on the results for bothsegmentation approaches and in combination ofusing different lookup schemes. As the resultsclearly shows, using an MSA segmenter yieldssuboptimal results for dialects. Also, when nolookup is used, the bi-LSTM-CRF sequence la-beler performs better than the SVM ranker for alldialects. However, using lookup leads to greaterimprovements for the SVM approach leading tothe best results for Levantine, Gulf, and Maghrebiand slightly lower results for Egyptian. Further,SVM
Rank seemed to have benefited more from theDA lookup, while bi-LSTM-CRF benefited morefrom the MSA lookup. As for Egyptian segmenta-tion, we suspected that it performed better for bothapproaches than the segmentation for the other di-alects, because the percentage of test words thatappear in the training set was greater for Egyptian.The percentages for all the dialects are:Egyptian Levantine Gulf Maghrebi64.7% 54.7% 56.7% 55.2%As for the lower results for Maghrebi, we noticedthat Maghrebi has many more affixes than MSAand other dialects. These affixes contribute to thedata sparsity and complexity of the segmentationtask. For example, we enumerated 24 prefixesraining Set Look-up Egyptian Levantine Gulf MaghrebiSVM
Rank
None 91.0 87.8 87.7 84.7DA 94.5 92.9 92.8 90.5DA+MSA 94.6 bi-LSTM-CRF None 93.8 91.0 89.4 87.1DA 94.2 91.8 90.8 88.5DA+MSA
Rank and bi-LSTM results w/ and w/o lookupfor Maghrebi compared to 8 for MSA, 17 forLevantine and Gulf, and 12 for Egyptian. Sim-ilarly, Maghrebi had more suffixes than MSAand other dialects. To ascertaining the effect ofCODA’fication, we ran an extra experiment werewe trained our best SVM
Rank system using theCODA’fied version of the Egyptian data, and thesegmentation accuracy increased from 94.6% to96.8%. Thus, having stable CODA standards andreliable conversion tools may positively impactdialectal processing. Next, we elaborate on typicalerrors of both approaches.
SVM
Rank
Errors:
We examined the errors thatthe SVM ranker produced for different dialectsand the most common types involved:• erroneous splitting of leading or trailing charac-ters when they were not prefixes or suffixes re-spectively or not splitting actual prefix and suf-fixes. For example, (cid:9)(cid:224)(cid:241)(cid:186)(cid:74)(cid:10)(cid:43)(cid:107) “H+ykwn” (willbe) was segmented as “Hyk+wn”.• the use of non-Arabic letters, wrong form of alef , or “h” instead of “p”. For example (cid:193)(cid:43)(cid:202)(cid:43)(cid:75)(cid:10)(cid:14)(cid:65)(cid:103)(cid:46) “jAy+l+k” (I am coming to you),where “A” and “k” were replace with “ | ” and aFarsi character respectively, was not segmented.• long words with multiple segments such as (cid:17)(cid:128)(cid:43) (cid:65)(cid:9)(cid:74)(cid:43)(cid:74)(cid:10)(cid:43) (cid:16)(cid:174)(cid:202)(cid:16)(cid:174)(cid:16)(cid:74)(cid:43)(cid:211) “m+tqlq+y+nA+$” (don’t makeus angry) where the ranker chose to segment itas “m+tqlq+yn+A$”. bi-LSTM-CRF Errors: The errors in this systemare broadly classified into three categories:• Ambiguous in token boundary because of char-acter sharing in case of gemmination/elision.For example the word (cid:65)(cid:9)(cid:74)(cid:171) “En A” (about us) isactually two tokens (cid:9)(cid:225)(cid:171)
En and (cid:65)(cid:9)(cid:75) nA. The two (cid:9)(cid:224) n letters are now merged into one. In the gold data, the disputed letter belongs to the first to-ken while in the system output, it belongs to thesecond.• Like the SVM, the system often fails due to un-conventional spelling. For example the word (cid:65)(cid:75)(cid:10)(cid:241) (cid:9)(cid:107)(cid:66) “lAxwyA” (to my brother) is a mis-spelling of (cid:65)(cid:75)(cid:10)(cid:241) (cid:9)(cid:107) (cid:13)(cid:66) .• The majority of the remaining errors are simplymis-tokenization due to the system’s inability todecide whether a substring (which out of con-text can be a valid token) is an independent to-ken or part of a word, e.g. (cid:189)(cid:43)(cid:202)(cid:74)(cid:46) (cid:16)(cid:174)(cid:16)(cid:74)(cid:130)(cid:211) “mstqbl+k”(your future), which is predicted by the systemas “m+staqbl+k”, where it correctly recognizesthe genitive pronoun in the end, but mistakenlytags the first radical as a separate segment.
In this paper we presented two approaches in-volving SVM-based ranking and bi-LSTM-CRFsequence labeling for segmenting Egyptian, Lev-antine, Gulf, and Maghrebi dialects. Both ap-proaches yield strong comparable results thatrange between 91% and 95% accuracy for dif-ferent dialects. To perform the work, we cre-ated training corpora containing naturally occur-ring text from social media for the aforementioneddialects. We plan to release the data and the result-ing segmenters to the research community. For fu-ture work, we want to perform domain adaptationusing large MSA data, such as ATB, to improvesegmentation results. Further, we plan to investi-gate building a joint model capable of segmentingall the dialects with minimal loss in accuracy.
References
Ahmed Abdelali, Kareem Darwish, Nadir Durrani, andHamdy Mubarak. 2016. Farasa: A fast and furioussegmenter for arabic. In
Proceedings of the 2016onference of the North American Chapter of theAssociation for Computational Linguistics: Demon-strations . Association for Computational Linguis-tics, San Diego, California, pages 11–16.Mohammed Attia, Pavel Pecina, Antonio Toral, LamiaTounsi, and Josef van Genabith. 2011. An open-source finite state morphological transducer formodern standard arabic. In
Proceedings of the9th International Workshop on Finite State Methodsand Natural Language Processing . Association forComputational Linguistics, pages 125–133.Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gradi-ent descent is difficult.
IEEE transactions on neuralnetworks
Proceedings of the EACL 2009Workshop on Computational Approaches to SemiticLanguages . Association for Computational Linguis-tics, Stroudsburg, PA, USA, Semitic ’09, pages 53–61.Houda Bouamor, Nizar Habash, and Kemal Oflazer.2014. A multidialectal parallel corpus of arabic.In Nicoletta Calzolari (Conference Chair), KhalidChoukri, Thierry Declerck, Hrafn Loftsson, BenteMaegaard, Joseph Mariani, Asuncion Moreno, JanOdijk, and Stelios Piperidis, editors,
Proceedings ofthe Ninth International Conference on Language Re-sources and Evaluation (LREC’14) . European Lan-guage Resources Association (ELRA), Reykjavik,Iceland.Tim Buckwalter. 2002. Buckwalter { Arabic } morpho-logical analyzer version 1.0 .Rich Caruana, Steve Lawrence, and Lee Giles. 2000.Overfitting in neural nets: Backpropagation, conju-gate gradient, and early stopping. In NIPS . pages402–408.Kareem Darwish, Walid Magdy, and Ahmed Mourad.2012. Language processing for arabic microblogretrieval. In
Proceedings of the 21st ACM inter-national conference on Information and knowledgemanagement . ACM, pages 2427–2430.Kareem Darwish, Walid Magdy, et al. 2014a. Arabicinformation retrieval.
Foundations and Trends® inInformation Retrieval
EMNLP . pages 1465–1468.Mohamed Eldesouki, Fahim Dalvi, Hassan Sajjad, andKareem Darwish. 2016. Qcri@ dsl 2016: Spokenarabic dialect identification using textual features.
VarDial 3 page 221.Ramy Eskander, Nizar Habash, Owen Rambow, andNadi Tomeh. 2013. Processing spontaneous orthog-raphy. Alex Graves. 2013. Generating sequences withrecurrent neural networks. arXiv preprintarXiv:1308.0850 .Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo-hamed. 2013a. Hybrid speech recognition with deepbidirectional lstm. In
Automatic Speech Recognitionand Understanding (ASRU), 2013 IEEE Workshopon . IEEE, pages 273–278.Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013b. Speech recognition with deep re-current neural networks. In
Acoustics, speech andsignal processing (icassp), 2013 ieee internationalconference on . IEEE, pages 6645–6649.Nizar Habash, Mona T Diab, and Owen Rambow.2012. Conventional orthography for dialectal ara-bic. In
LREC . pages 711–718.Nizar Habash, Ryan Roth, Owen Rambow, Ramy Es-kander, and Nadi Tomeh. 2013. Morphologicalanalysis and disambiguation for dialectal arabic. In
Hlt-Naacl . pages 426–432.Nizar Habash and Fatiha Sadat. 2006. Arabic pre-processing schemes for statistical machine transla-tion. In
Proceedings of the Human Language Tech-nology Conference of the NAACL, Companion Vol-ume: Short Papers . Association for ComputationalLinguistics, pages 49–52.Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprintarXiv:1207.0580 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation
Innovation and Continuity in Language andCommunication of Different Language Cultures 9.Edited by Rudolf Muhr pages 235–260.Thorsten Joachims. 2006. Training linear svms in lin-ear time. In
Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discoveryand data mining . ACM, pages 217–226.Sameer Khurana, Ahmed Ali, and Steve Renals.2016. Multi-view dimensionality reduction for di-alect identification of arabic broadcast speech. arXivpreprint arXiv:1609.05650 .John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In
Proc. ICML .Zachary C Lipton, David C Kale, Charles Elkan, andRandall Wetzell. 2015. A critical review of recur-rent neural networks for sequence learning.
CoRR abs/1506.00019.ohamed Maamouri, Ann Bies, Seth Kulick, MichaelCiul, Nizar Habash, and Ramy Eskander. 2014. De-veloping an egyptian arabic treebank: Impact of di-alectal morphology on annotation and tool develop-ment. In
LREC . pages 2348–2354.Emad Mohamed, Behrang Mohit, and Kemal Oflazer.2012. Annotating and learning morphological seg-mentation of egyptian colloquial arabic. In
LREC .pages 873–877.Will Monroe, Spence Green, and Christopher D Man-ning. 2014. Word segmentation of informal arabicwith domain adaptation. In
ACL (2) . pages 206–211.Hamdy Mubarak and Kareem Darwish. 2014. Usingtwitter to collect a multi-dialectal corpus of arabic.In
Proceedings of the EMNLP 2014 Workshop onArabic Natural Language Processing (ANLP) . pages1–7.Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab,Ahmed El Kholy, Ramy Eskander, Nizar Habash,Manoj Pooleery, Owen Rambow, and Ryan M Roth.2014. Madamira: A fast, comprehensive tool formorphological analysis and disambiguation of Ara-bic.
Proc. LREC .Hassan Sajjad, Kareem Darwish, and Yonatan Be-linkov. 2013. Translating dialectal Arabic to En-glish. In
Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers) . Sofia, Bulgaria, ACL ’13,pages 1–6.Younes Samih, Mohammed Attia, Mohamed Eldes-ouki, Hamdy Mubarak, Ahmed Abdelali, LauraKallmeyer, and Kareem Darwish. 2017. A neu-ral architecture for dialectal arabic segmentation.
WANLP 2017 (co-located with EACL 2017) page 46.Younes Samih, Suraj Maharjan, Mohammed Attia,Laura Kallmeyer, and Thamar Solorio. 2016.Multilingual code-switching identification via lstmrecurrent neural networks. In
Proceedings of theSecond Workshop on Computational Approachesto Code Switching,
IEEE Transactionson Signal Processing
Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers)
Computational Linguistics