[PDF] A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Abstract

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Full PDF

AA reproduction of Apple’s bi-directional LSTM modelsfor language identiﬁcation in short strings

Mads Toftrup ∗ e , Søren Asger Sørensen ∗ e , Manuel R. Ciosici $ , and Ira Assent e e Computer Science Department, Aarhus University $ Information Sciences Institute, [email protected]

Abstract

Language Identiﬁcation is the task of identi-fying a document’s language. For applica-tions like automatic spell checker selection,language identiﬁcation must use very shortstrings such as text message fragments. In thiswork, we reproduce a language identiﬁcationarchitecture that Apple brieﬂy sketched in ablog post. We conﬁrm the bi-LSTM model’sperformance and ﬁnd that it outperforms cur-rent open-source language identiﬁers. We fur-ther ﬁnd that its language identiﬁcation mis-takes are due to confusion between related lan-guages.

Automatic Language Identiﬁcation is the taskof identifying a document’s language, an es-sential task for document classiﬁcation and ma-chine translation (Ling et al., 2013). General-purpose, open-source Language Identiﬁcation toolslike langid.py (Lui and Baldwin, 2012) and Fast-Text (Grave, 2017) are the de facto standards forLanguage Identiﬁcation in large documents.During the last two decades, text messaging andsocial media have generated large amounts of shortplain-text documents. Language identiﬁcation onpartial and complete short texts presents uniquechallenges (Jauhiainen et al., 2019). SuccessfulLanguage Identiﬁcation can support marketing, po-litical, and socioeconomic analyses on large cor-pora of short texts such as tweets. Such analysescan, for example, study hate speech towards im-migrants and women (Basile et al., 2019) or seekto understand support groups for smoking cessa-tion (Prochaska et al., 2012).On a smartphone, Language Identiﬁcation onshort texts can support several features. Languageidentiﬁcation of incoming text messages can help ∗ Equal contribution virtual assistants read incoming text messages,which can be an essential tool for minorities suchas visually impaired multilingual speakers.Language identiﬁcation can also help when typ-ing short texts. Identifying language from the ﬁrstfew characters typed (a very short string) can al-low a smartphone to select the correct spellingand grammar checker automatically. Such fea-tures motivated a team at Apple to study character-level Language Identiﬁcation using bi-directionalLSTMs (Apple, 2019).This paper reproduces the architecture presentedin an industry blog post (Apple, 2019) on Lan-guage Identiﬁcation on extremely short strings (10characters or less). The blog post brieﬂy sketchesthe language identiﬁcation system used by Apple’ssmartphones and computers. However, due to theuse of internal, proprietary corpora, the architec-ture’s performance cannot be compared with thecurrent de facto standards for Language Identiﬁ-cation: the open-source tools langid.py (Lui andBaldwin, 2012) and

FastText (Joulin et al., 2017,2016; Grave, 2017).Our reproduction conﬁrms the performance de-scribed in the original blog post (Apple, 2019).We go beyond mere reproduction and (1) comparethe bi-LSTM model with the current de facto stan-dards for Language Identiﬁcation and (2) analyzeperformance on related languages. We ﬁnd thatthe bi-LSTM is more accurate than out-of-the-boxFastText and langid.py, even outperforming there-trained FastText. Our results suggest that thebi-LSTM architecture could be an alternative toFastText and langid.py for Language Identiﬁcationon short strings. Our source code and models are available at https://github.com/AU-DIS/LSTM_langid . End-users candownload our code as a library from the Python Pack-age Index (PyPI) via https://pypi.org/project/LanguageIdentifier/ . a r X i v : . [ c s . C L ] F e b Related work

The simplest Language Identiﬁcation methods dis-criminate using elementary distinguishing traitslike unique character combinations, frequent orunique words, diacritics, or common n-grams (Dun-ning, 1994; Souter et al., 1994; Truic˘a et al., 2015).Increasing model complexity, some Language Iden-tiﬁcation methods model sequences of words, char-acters, or bytes. Some methods focus on mod-eling the frequency of n-grams, e.g., frequencyof character n-grams (Ahmed et al., 2004; Souteret al., 1994). Such methods outperform techniquesbased on unique words. Markov model-based ap-proaches estimate the probability of a string basedon n-grams of characters or bytes (Dunning, 1994),as is the case of langid.py (Lui and Baldwin, 2012,2011). Due to its availability as an open-sourcelibrary, langid.py is one of the most popular lan-guage identiﬁers.Recent language identiﬁers increasingly useword representations. For example, in a blog post,Grave (2017) shows how to identify languagesusing FastText vectors (Bojanowski et al., 2016;Joulin et al., 2017, 2016), which model charactern-grams. Language identiﬁcation with FastTextvectors is as performant as langid.py (Grave, 2017).Similar to langid.py, FastText language identiﬁca-tion models are open-source and, therefore, popu-lar.LanideNN (Kocmi and Bojar, 2017) identiﬁeslanguages in multilingual documents using a recur-rent neural network with a single layer of gatedrecurrent units (GRU). Unlike Markov-based meth-ods, recurrent neural network architectures do notmodel character sequences with a ﬁxed windowof context. The language identiﬁer that Applebrieﬂy sketched in a blog post (Apple, 2019) usesa recurrent neural network with a two-layer bi-directional LSTM to model character sequences.Apple’s method differs from LanideNN in architec-ture complexity (two layers, LSTM cells instead ofthe simpler GRU cells) and in its focus. LanideNNworks with long multilingual documents, whereasApple classify extremely short monolingual strings.In a survey, Jauhiainen et al. (2019) present morethan the techniques above, discuss challenges, andidentify remaining research questions. Among theremaining research questions are very short texts(the problem motivating Apple) and discriminationof related languages. In this paper, we go beyondreproducing Apple’s work by analyzing the effect

Figure 1: The bi-LSTM architecture. Figure repro-duced from Apple (2019). of related languages.

Figure 1 gives an overview of the two-layer bi-directional LSTM architecture powering Apple’sproducts, as brieﬂy sketched in a blog post (Apple,2019).The model takes as input strings of characters.In the following, we describe the left-to-right direc-tion of the bi-directional LSTM. The right-to-leftdirection is identical but mirrored. In the ﬁrst step,vector embeddings replace all characters in the in-put string. The network uses a single embeddingfor all languages since the language is unknown atthis point. At each time step, the LSTM ingests acharacter’s embedding and the hidden layer repre-sentation from the previous step. The per-characteroutput from the left-to-right LSTM layer is con-catenated with that of the right-to-left layer. Theconcatenated vectors pass to a second LSTM layerthat is identical to the ﬁrst but does not share pa-rameters. After the second layer, the concatenatedvectors go through a single linear layer, producinga distribution over all supported languages. Thelinear layer provides character-level language iden-tiﬁcation. In other words, for each input character,the network generates a probability distributionover the possible languages.With the outputs from the linear layer, Apple(2019) state that

A max pooling style majority vot-ing decides the dominant language of the string .However, max pooling and majority voting are dif-erent techniques. A combination of the two isimpossible as one cannot perform majority vot-ing over outputs that have been max pooled, andvice versa. Instead, we sum over the linear layer’soutput values at each time step and softmax thesummed output to obtain a prediction. We expectthis approach to be what the original authors in-tended. The similarity between our reproduction’sperformance and what Apple report in the originalblog post conﬁrms our approach.

Apple (2019) only mention the kind of data usedin their experiments. Therefore, we use two largeand openly available data sets of the same kindas Apple: a subset of OpenSubtitles (Lison andTiedemann, 2016) to study performance on dialog;and Universal Dependencies (UD, Zeman et al.,2019) for prose. Following Apple, we trim stringsto 50 characters per sample, with all samples start-ing at the beginning of a word, and remove specialcharacters.Apple test on 20 languages that use the Latinalphabet, but only show results on 9 of the 20 anddo not specify the remaining 11 languages. Be-sides the 9 languages in the original blog post, weselect 11 languages, some of which are closely re-lated. Thus, our experimental setup is similar toApple’s. Including closely related languages in-creases our data sets’ difﬁculty but supports moreinteresting and more representative experiments.Speciﬁcally, it supports performance analysis onrelated languages, an open research question (Jauhi-ainen et al., 2019). We use ﬁve-fold cross-validation in all experiments.Following Apple (2019), we evaluate on stringsof 10 characters. We test all models on the samestrings.We use the AdamW optimizer with default pa-rameters in PyTorch; we set the character embed-ding dimension to 150 and the bi-LSTM’s hiddendimension to 150; we train for 25 epochs usingbatches of 64 examples and use weighted cross-entropy for the loss function. The languages we use are: Catalan (ca), Czech (cs),Danish (da), French (fr), German (de), English (en), Span-ish (es), Estonian (et), Finnish (ﬁ), Croatian (hr), Hungarian(hu), Italian (it), Lithuanian (lt), Dutch (nl), Norwegian (no),Portuguese (pt), Polish (pt), Romanian (ro), Swedish (sv), andTurkish (tr).

Figure 2: Apple (2019)’s original results.

Out-of-the-box, FastText and langid.py can iden-tify more than our set of 20 languages. For fair eval-uation, we limit the set of languages that the mod-els output. For langid.py, we use a built-in methodthat limits the number of languages under consid-eration. For FastText, we take the probability dis-tribution over all language predictions, extractingonly the relevant 20. We use the large pre-trainedFastText model . When re-training FastText, weuse 15 epochs, with a minimum n-gram length ofone character and a maximum of six characters.We leave all other parameters at their default. Figure 3 contains the results of our reproductionof the experiment in Figure (b) from Apple (2019),a confusion matrix of the bi-LSTM model trainedand evaluated on the UD data set. Since Appledo not include averaged results, we use the confu-sion matrices for comparison. Figure 2 includesa copy of Figure (b) from Apple (2019) for easiercomparison. We ﬁnd that performance per lan-guage is similar between the two implementations.While in one case, accuracy is almost identical(Turkish, tr), for most languages, our implementa-tion is either a few points of accuracy below (e.g.,French, fr, − . points, and Italian, it, − . ) orabove the original model (e.g., Dutch, nl, +1 . ).For some languages, our implementation consider-ably underperforms the original (e.g., English, en, − . points, and Spanish, es, − . ). Our imple-mentation considerably outperforms the originalon German (de +6 . ) and Swedish (sv +7 . ). Available at https://fasttext.cc/docs/en/language-identification.html igure 3: Confusion matrix for bi-LSTM on UD. Figure 4: Confusion matrix for re-trained FastText onUD.

We attribute the difference in performance to ran-domness during training and differences in trainingdata. The original blog post does not state the sizenor language composition of the data set.In Figure 3, we follow Apple and thresholdvalues in the confusion matrix at . . Thus, wecan effortlessly compare error patterns. Interest-ingly, the patterns are almost identical. Both matri-ces show issues distinguishing between Italian (it)and Portuguese (pt), German (de) and Dutch (nl),French (fr) and English (en), and Italian (it) orPortuguese (pt) vs. Spanish (es) or French (fr). Un-surprisingly, most confusions appear for languagesfrom the same families, Romance (es, fr, it, pt) andGermanic (de, nl). In Tables 1 and 2, we include the comparative anal-ysis results with the current de facto standards forLanguage Identiﬁcation: FastText and langid.py.We use two weighing strategies for F1 to pro-vide different insights. Macro-F1 averages the per-language results and considers languages equallyimportant. Weighted-F1 takes into account thepopularity of the different languages in the datasets. Weighted-F1 measures the performance onthe data set, while macro-F1 illustrates languagecoverage as it is not affected by label frequency.In multi-class classiﬁcation, micro-F1 equals accu-racy. We, therefore, include only accuracy, denoted acc@1 .On both data sets, the bi-LSTM exceeds theweighted- and macro-F1 of langid.py, pre-trainedFastText, and re-trained FastText. The performance

LSTM pFT rFT langid.py wF1

Table 1: Results on UD. pFT = pre-trained FastText;rFT = re-trained FastText

LSTM pFT rFT langid.py wF1

Table 2: Results on OpenSubtitles. pFT = pre-trainedFastText; rFT = re-trained FastText difference between the bi-LSTM and the next bestmodel (the re-trained FastText) also appears in theconfusion matrix. Figure 4 shows that even there-trained FastText exhibits confusion across allpairs. It also shows a strong bias towards some lan-guages like English (en), French (fr), or Dutch (nl)regardless of the input language. All columns inFigure 4 that correspond to these languages exhibitconfusion errors.The OpenSubtitles data is more challenging thanUD for out-of-the-box langid.py and FastText, buteasier for bi-LSTM and re-trained FastText. Also,there is a considerable improvement from the pre-trained FastText to the re-trained FastText on both D HV SW IU LW UR GD QR VY GH QO HQ FV SO KU OW HW IL KX WU 3UHGLFWHG F DH V S WI U L W U RGDQR VY GHQ O HQ FV S O K U O W H WI L KX W U $ F W XD O Figure 5: Confusion matrix for bi-LSTM on UD. data sets. These observations suggest that (1) do-main adaptation has a considerable impact on Fast-Text, and (2) that dialog is more difﬁcult for theout-of-the-box models. OpenSubtitles contains sub-titles of movies predominantly produced in English.Consequently, character names are also English-centered, e.g., Jane. Character names can appear indialog, which might confuse the pre-trained modelsto assign such dialog lines to English, despite theirtranslation.

Tables 1 and 2 show a jump from accu-racy at the top of the list of prioritized pre-dicted languages ( acc@1 ) to accuracy at the topthree ( acc@3 ). For most models, a smaller jumpfollows to accuracy at the top ﬁve ( acc@5 ). Thesizeable jump indicates that, even when the modelsare wrong, the correct answer is usually among thetop three. For example, from acc@1 to acc@3 , thebi-LSTM jumps . points on UD and . onOpenSubtitles, but only . and . from acc@3 to acc@5 . The gap from acc@1 to acc@3 is muchlarger for langid.py and FastText, illustrating ahigher confusion. Recent work in language identi-ﬁcation suggests that the accuracy gap might be asymptom of confusion of related languages (Haasand Derczynski, 2020).To understand the bi-LSTM’s jump in accuracy,we turn to the complete confusion matrix. In Fig- ure 5, we show the confusion matrix of the bi-LSTM on all 20 languages in our experiments.There is intense confusion between highly simi-lar languages. We observe three large clusters ofconfused languages: Romance (ca, es, fr, it, pt,ro), West Germanic (de, en, nl), and languages ofNorthern Europe (da, no, sv). More closely relatedlanguages are more confusing, for example, Cata-lan (ca) vs. Spanish (es) and Danish vs. Norwegian(no). The clusters of confusion between relatedlanguages indicate that, despite the bi-LSTM’s im-proved performance, highly similar languages stillpose a challenge. Apple (2019) also consider storage requirements.Our bi-LSTM uses MB of storage, conﬁrmingthe claims in the original blog post. The re-trainedFastText model requires . GB of storage, but thatcould reduce to approximately

MB, follow-ing Joulin et al. (2016). langid.py’s model is only . MB. Given its language identiﬁcation perfor-mance and model size, the bi-LSTM is a great valueproposition, especially on storage-constrained mo-bile devices, conﬁrming Apple’s use case scenario.

We have reproduced the bi-LSTM language iden-tiﬁcation architecture described in a blog post bypple (2019). Our reproduction experiments con-ﬁrm the performance claims in the original blogpost. We evaluated the bi-LSTM against the defacto open-source language identiﬁers in experi-ments on two openly available data sets. Our eval-uation considered dialog and prose, and targetedtwenty languages, including some highly similarlanguages such as Danish (da) and Norwegian (no)or Catalan (ca) and Spanish (es). Our experimentsillustrate the difﬁculty of identifying the languagein very short strings. The reproduced bi-LSTMoutperformed FastText and langid.py on all mea-sures, even when training FastText on the same data.However, we went beyond a straightforward re-production and considered related languages. Ouranalysis shows that the bi-LSTM can easily confuselanguages from the same family (e.g., Romance,West Germanic, or Scandinavian) and highly simi-lar languages such as Catalan (ca) and Spanish (es).We publish our implementation’s source code andmake a trained model available as a library. Inthe future, we would like to consider avenues forimproving the bi-LSTM architecture. For exam-ple, we would like to replace the majority votingmechanism in the bi-LSTM with a more robustalternative.

References

Bashir Elhaj Ahmed, Sung-Hyuk Cha, and Charles C.Tappert. 2004. Language identiﬁcation from text us-ing n-gram based cumulative frequency addition. In

Proceedings of Student/Faculty Research Day, CSIS,Pace University .Apple. 2019. Language identiﬁcation fromvery short strings. Online: https://machinelearning.apple.com/research/language-identification-from-very-short-strings . Accessed: 2021-02-10.Valerio Basile, Cristina Bosco, Elisabetta Fersini,Debora Nozza, Viviana Patti, Francisco ManuelRangel Pardo, Paolo Rosso, and Manuela San-guinetti. 2019. SemEval-2019 task 5: Multilin-gual detection of hate speech against immigrants andwomen in Twitter. In

Proceedings of the 13th Inter-national Workshop on Semantic Evaluation , pages54–63, Minneapolis, Minnesota, USA. Associationfor Computational Linguistics.Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching Word Vec-tors with Subword Information. arXiv preprintarXiv:1607.04606 .Ted Dunning. 1994.

Statistical identiﬁcation of lan- guage . Computing Research Laboratory, New Mex-ico State University Las Cruces, NM, USA.Edouard Grave. 2017. Language identiﬁcation. On-line: https://fasttext.cc/blog/2017/10/02/blog-post.html . Accessed: 2020-09-24.Ren´e Haas and Leon Derczynski. 2020. Discrimi-nating Between Similar Nordic Languages. arXivpreprint arXiv:2012.06431 .Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Tim-othy Baldwin, and Krister Lind´en. 2019. Automaticlanguage identiﬁcation in texts: A survey.

Journalof Artiﬁcial Intelligence Research , 65:675–782.Armand Joulin, Edouard Grave, Piotr Bojanowski,Matthijs Douze, H´erve J´egou, and Tomas Mikolov.2016. Fasttext.zip: Compressing text classiﬁcationmodels. arXiv preprint arXiv:1612.03651 .Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efﬁcienttext classiﬁcation. In

Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 2, Short Pa-pers , pages 427–431, Valencia, Spain. Associationfor Computational Linguistics.Tom Kocmi and Ondˇrej Bojar. 2017. LanideNN: Multi-lingual language identiﬁcation on character window.In

Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 1, Long Papers , pages 927–936,Valencia, Spain. Association for Computational Lin-guistics.Wang Ling, Guang Xiang, Chris Dyer, Alan Black, andIsabel Trancoso. 2013. Microblogs as parallel cor-pora. In

Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 176–186, Soﬁa, Bul-garia. Association for Computational Linguistics.Pierre Lison and J¨org Tiedemann. 2016. OpenSub-titles2016: Extracting large parallel corpora frommovie and TV subtitles. In

Proceedings of the TenthInternational Conference on Language Resourcesand Evaluation (LREC’16) , pages 923–929, Por-toroˇz, Slovenia. European Language Resources As-sociation (ELRA).Marco Lui and Timothy Baldwin. 2011. Cross-domainFeature Selection for Language Identiﬁcation. In

Proceedings of 5th International Joint Conferenceon Natural Language Processing , pages 553–561,Chiang Mai, Thailand. Asian Federation of NaturalLanguage Processing.Marco Lui and Timothy Baldwin. 2012. langid.py: Anoff-the-shelf language identiﬁcation tool. In

Pro-ceedings of the ACL 2012 System Demonstrations ,pages 25–30, Jeju Island, Korea. Association forComputational Linguistics.udith J Prochaska, Cornelia Pechmann, Romina Kim,and James M Leonhardt. 2012. Twitter=quitter? ananalysis of twitter quit smoking social networks.

To-bacco Control , 21(4):447–449.Clive Souter, Gavin Churcher, Judith Hayes, JohnHughes, and Stephen Johnson. 1994. Naturallanguage identiﬁcation using corpus-based models.

HERMES-Journal of Language and Communicationin Business , 13:183–203.Ciprian-Octavian Truic˘a, Julien Velcin, and AlexandruBoicea. 2015. Automatic Language Identiﬁcationfor Romance Languages using Stop Words and Di-acritics. In17th International Symposium on Sym-bolic and Numeric Algorithms for Scientiﬁc Comput-ing (SYNASC)