[PDF] One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages

Abstract

Unsupervised word representation learning from large corpora is badly needed for downstream tasks such as text classification, information retrieval, and machine translation. The representation precision of the fastText language models is mostly due to their use of subword information. In previous work, the optimization of fastText subword sizes has been largely neglected, and non-English fastText language models were trained using subword sizes optimized for English and German. In our work, we train English, German, Czech, and Italian fastText language models on Wikipedia, and we optimize the subword sizes on the English, German, Czech, and Italian word analogy tasks. We show that the optimization of subword sizes results in a 5% improvement on the Czech word analogy task. We also show that computationally expensive hyperparameter optimization can be replaced with cheap n-gram frequency analysis: subword sizes that are the closest to covering 3.76% of all unique subwords in a language are shown to be the optimal fastText hyperparameters on the English, German, Czech, and Italian word analogy tasks.

Full PDF

OOne Size Does Not Fit All:Finding the Optimal 𝑁 -gram Sizes for FastText Models across Languages ∗ V´ıt Novotn´y Eniafe Festus Ayetiran D´avid Lupt´ak Michal ˇStef´anik Petr Sojka

Faculty of Informatics, Masaryk UniversityBrno, Czechia { witiko,ayetiran,dluptak,stefanik.m,sojka } @mail.muni.cz Abstract

Unsupervised word representation learningfrom large corpora is badly needed for down-stream tasks such as text classiﬁcation, informa-tion retrieval, and machine translation. The rep-resentation precision of the fastText languagemodels is mostly due to their use of subwordinformation. In previous work, the optimiza-tion of fastText subword sizes has been largelyneglected, and non-English fastText languagemodels were trained using subword sizes opti-mized for English and German.In our work, we train English, German,Czech, and Italian fastText language modelson Wikipedia, and we optimize the subwordsizes on the English, German, Czech, and Ital-ian word analogy tasks. We show that the opti-mization of subword sizes results in a 5% im-provement on the Czech word analogy task. Wealso show that computationally expensive hy-perparameter optimization can be replaced withcheap 𝑛 -gram frequency analysis: subwordsizes that are the closest to covering 3.76% ofall unique subwords in a language are shownto be the optimal fastText hyperparameters onthe English, German, Czech, and Italian wordanalogy tasks. Bojanowski et al. (2017) have shown that takingword morphology into account is important foraccurate continuous word representations. How-ever, they only show the optimal 𝑛 -gram sizes forGerman and English (Bojanowski et al., 2017, Sec-tion 5.5). We further their experiment by ﬁndingthe optimal parameters for Czech and Italian us-ing the Czech (Svoboda and Brychc´ın, 2016) and ∗ First author’s work was graciously funded by the SouthMoravian Centre for International Mobility as a part of theBrno Ph.D. talent project. Computational resources weresupplied by the project “e-Infrastruktura CZ” (e-INFRALM2018140) provided within the program Projects of LargeResearch, Development and Innovations Infrastructures.

Italian (Berardi et al., 2015) word analogy tasks.We show that while the optimal subword sizes aresimilar for English and German, they vary wildlyfor Czech and Italian. We further investigate acheaper alternative optimization technique usingthe analysis of 𝑛 -gram frequencies.The rest of the paper is organized as follows:Section 2 discusses the related work. In sections 3and 4, we discuss our methods and results. Sec-tion 5 concludes the paper. Mikolov et al. (2013) described the

Word2vec lan-guage model , which uses a shallow log-linear neu-ral network to learn continuous representations ofwords (word embeddings). They also produced the

English word analogy task , which tests the abil-ity to ﬁnd pairs of words with analogical relations(man is to woman what a king is to queen), andthey reported state-of-the-art results with the wordembeddings produced by the Word2vec languagemodel on their task.Berardi et al. (2015), K¨oper et al. (2015), andSvoboda and Brychc´ın (2016) explored the be-haviour of word embeddings on Italian, German,and Czech, respectively, and produced the

Italian,German, and Czech word analogy tasks for eval-uating the performance of non-English word em-beddings. Their ﬁndings revealed that, despite themorphological complexity of the languages, theWord2vec language models were able to generatesemantically and syntactically meaningful wordembeddings.In order to take word morphology into account,which has been ignored by previous work Bo-janowski et al. (2017) developed the fastText lan-guage model based on the skipgram variant of theWord2vec language model. Their improvementsconsisted of representing each word as a sequence a r X i v : . [ c s . C L ] F e b = : H, e, l, o, w, r, d 𝑛 = : He, el, ll, lo, wo, or, rl, ld 𝑛 = : Hel, ell, llo, wor, orl, rld 𝑛 = : Hell, ello, worl, orld 𝑛 = : Hello, world(a) 27 unique character 𝑛 -grams in a cor-pus with two words: Hello and world 𝑛 = : / = . 𝑛 = : / = . 𝑛 = : / = . 𝑛 = : / = . 𝑛 = : / = . (b) Frequencies of unique char-acter 𝑛 -grams for one size 𝑛 𝑁 -gram coverages with 𝑛 in { 𝑖, . . . , 𝑗 } forall values of 𝑖 and 𝑗 , where ≤ 𝑖 ≤ 𝑗 ≤ Table 1: An example of computing character 𝑛 -gram coverages. We start in Subtable (a) by producing all uniquecharacter 𝑛 -grams from the corpus. In Subtable (b), we compute frequencies of 𝑛 -gram sizes. In Subtable (c), wecompute 𝑛 -gram coverages. of character 𝑛 -grams (i.e. subwords ). They trainedtheir models on the English, German, Czech, andItalian Wikipedia corpora, and they reported state-of-the-art performance on the English, German,Czech, and Italian word analogy task. However,they neglected to optimize the 𝑛 -gram size hyper-parameter for Czech and Italian fastText models,and used hand-picked hyperparameter values forEnglish and German in their experiment.Grave et al. (2018) trained fastText languagemodels for 157 languages, but, like Bojanowskiet al. (2017), they also neglected to optimize the 𝑛 -gram size, noting only that “using character 𝑛 -grams of size 5, instead of using the default rangeof 3–6, does not signiﬁcantly decrease the accuracy[on word analogy tasks] (except for Czech).” We use the skipgram variant of the fastText lan-guage model, reproducing the experimental setupof Bojanowski et al. (2017, Section 4): hash tablebucket size · , 300 vector dimensions, negativesampling loss with 5 negative samples, initial learn-ing rate 0.05 with a linear decay to zero, samplingthreshold − , and window size 5.We train the language models on the English,German, Czech, and Italian Wikipedia corporafor 5 epochs. We use character 𝑛 -grams with 𝑛 in { 𝑖, . . . , 𝑗 } and report performance on the En-glish (Mikolov et al., 2013), German (K¨oper et al.,2015), Czech (Svoboda and Brychc´ın, 2016), andItalian (Berardi et al., 2015) word analogy tasksfor all values of 𝑖 and 𝑗 , where ≤ 𝑖 ≤ 𝑗 ≤ .Like Bojanowski et al. (2017), we only use the · most frequent words when solving the wordanalogies.Additionally, we compute the ratio betweenthe frequencies of unique character 𝑛 -grams (i.e.unique subwords of size 𝑛 ) with 𝑛 in { 𝑖, . . . , 𝑗 } and the frequencies of all unique character 𝑛 -grams on the English, German, Czech, and ItalianWikipedia corpora for all values of 𝑖 and 𝑗 , where ≤ 𝑖 ≤ 𝑗 ≤ . In the following text, we call thisratio the 𝑛 -gram coverage . Table 1 shows how the 𝑛 -gram coverage is computed.We take the 𝑛 -gram coverages for the optimalvalues of 𝑖 a 𝑗 on the word analogy tasks, we as-sume they are observations of a normal distribu-tion, and we construct a Student’s 𝑡 -distribution95% conﬁdence interval and a point estimate forthe mean optimal 𝑛 -gram coverage. In subtables 2b–2d and 2f–2h, we show the syntac-tic, semantic, and total accuracies of the Englishand German fastText models on the English andGerman word analogy tasks. The optimal 𝑛 -gramsizes { , } for English and { } for German con-ﬁrm the experiment of Bojanowski et al. (2017,Section 5.5). While the optimal 𝑛 -gram sizes forEnglish are within 1% of the total accuracy for thedefault 𝑛 -gram sizes { , , , } proposed by Bo-janowski et al. (2017), the optimal 𝑛 -gram sizesfor German achieve a 3% improvement in the totalaccuracy compared to the default 𝑛 -gram sizes. Aswe discuss below, this is due to a higher proportionof long character 𝑛 -grams in German.In subtables 3b–3d, 3f–3h, we show the syntac-tic, semantic, and total accuracies of the Czech andItalian fastText models on Czech and Italian wordanalogy tasks. While the optimal 𝑛 -gram sizes { , , , } for Italian are within 1% of the totalaccuracy for default 𝑛 -gram sizes proposed by Bo-janowski et al. (2017), the optimal 𝑛 -gram sizes { , , , } for Czech achieve 5% improvementin the total accuracy compared to default 𝑛 -gramsizes. As we discuss below, this is due to higherproportion of short character 𝑛 -grams in Czech. 2 3 4 N -gram size n N - g r a m c o v e r ag e ( % ) Optimum (3.76 %) (a) E N 𝑛 -gram coverages (b) E N syntactic accuracies (c) E N semantic accuracies

74 75

765 76 766 75 (d) E N total accuracies N -gram size n N - g r a m c o v e r ag e ( % ) Optimum (4.19 %) (e) D E 𝑛 -gram coverages (f) D E syntactic accuracies (g) D E semantic accuracies

55 56 57 (h) D E total accuracies Table 2: The coverages of various character 𝑛 -gram sizes for English and German Wikipedia (subtables (a, e)), andthe effect of sizes of character 𝑛 -grams on the English and German analogy task accuracies (subtables (b–d, f–h)).We consider 𝑛 -grams with 𝑛 in { 𝑖, . . . , 𝑗 } and report coverages and accuracies for various values of 𝑖 and 𝑗 . Italicized are coverages and total accuracies for the default 𝑛 -gram sizes of fastText, bold underlined the optimal 𝑛 -gramsizes, and bold the 𝑛 -gram sizes within the 95% conﬁdence interval [ . , . ] of optimal 𝑛 -gram coverage. N -gram size n N - g r a m c o v e r ag e ( % ) Optimum (3.28 %) (a) C S 𝑛 -gram coverages (b) C S syntactic accuracies (c) C S semantic accuracies

44 58

60 58

41 57

58 57

53 56 54 (d) C S total accuracies N -gram size n N - g r a m c o v e r ag e ( % ) Optimum (3.81 %) (e) I T 𝑛 -gram coverages (f) I T syntactic accuracies (g) I T semantic accuracies

44 50 53

46 51 53

51 53

525 53 52 (h) I T total accuracies Table 3: The coverages of various character 𝑛 -gram sizes for Czech and Italian Wikipedia (subtables (a, e)), andthe effect of sizes of character 𝑛 -grams on the Czech and Italian analogy task accuracies (subtables (b–d, f–h)). Weconsider 𝑛 -grams with 𝑛 in { 𝑖, . . . , 𝑗 } and report coverages and accuracies for various values of 𝑖 and 𝑗 . Italicized are coverages and total accuracies for the default 𝑛 -gram sizes of fastText, bold underlined the optimal 𝑛 -gramsizes, and bold the 𝑛 -gram sizes within the 95% conﬁdence interval [ . , . ] of optimal 𝑛 -gram coverage. ikipedia Active Corpus Suggested Cover-languages users sizes 𝑁 -gram sizes agesEnglish (E N ) 142,765 22 GiB { , } R ) 21,661 7.4 GiB { } E ) 19,702 8.3 GiB { } S ) 18,420 5.2 GiB { } U ) 12,647 9.9 GiB { } T ) 9,907 4.2 GiB { , , , } R ) 6,722 2.4 GiB { } A ) 6,347 1.7 GiB { , , } T ) 6,110 2.4 GiB { , , , } L ) 4,897 2.6 GiB { , , , } L ) 4,365 2.4 GiB { , } R ) 3,610 724 MiB { , , , } K ) 3,567 4.0 GiB { } E ) 3,332 1.7 GiB { } D ) 2,818 837 MiB { , } O ) 2,706 1.2 GiB { } S ) 2,604 1.2 GiB { , , , } V ) 2,487 2.8 GiB { } U ) 2,103 1.5 GiB { } I ) 2,017 570 MiB { , , } I ) 1,821 999 MiB { , , , , } A ) 1,750 1.6 GiB { , , } N ) 1,379 583 MiB { , , } O ) 1,337 898 MiB { } R ) 1,249 1.8 GiB { } L ) 1,247 1018 MiB { , } { , } O ) 1,056 726 MiB { , , } A ) 909 492 MiB { } G ) 892 934 MiB { , } T ) 790 340 MiB { } Z ) 752 352 MiB { , } R ) 708 435 MiB { , , , } Y ) 706 996 MiB { , , } U ) 640 397 MiB { , , , } K ) 607 372 MiB { , , , } S ) 523 375 MiB { } Table 4: Suggested optimal character 𝑛 -gram sizes for Wikipedia languages with over 500 active users. Active usersare registered users who have made at least one edit in the last thirty days. We omit languages that use writingsystems without word boundaries. Suggestions are based on 𝑛 -gram coverages in Wikipedia corpora. ord analogy categories Example analogies Accuracies (%)E N D E C S I T Nouns, plural bird, birds (E N ) 86.79 68.89 68.81 50.95Nouns, singular, masculine–feminine conte, contessa (I T ) 92.21 65.79Nouns, plural, masculine–feminine conti, contesse (I T ) 30.13Verbs, singular, 3rd person play, plays (E N ) 79.77 76.85Verbs, plural, 1st person cerco, cerchiamo (I T ) 23.64Verbs, plural, 3rd person cerca, cercano (I T ) 95.45Verbs, present participle write, writing (E N ) 70.36 32.50 76.92Verbs, past tense writing, wrote (E N ) 64.23 29.20 80.20 50.00Verbs, remote past tense, 1st person corro, corsi (I T ) 2.78Adjectives, adverbs calm, calmly (E N ) 43.75 31.72Adjectives, antonyms aware, unaware (E N ) 53.57 39.13 34.63Adjectives, comparative fast, faster (E N ) 87.84 59.61 71.43 0.00Adjectives, superlative fast, fastest (E N ) 76.61 25.72 52.27Pronouns, plural m´eho, m´ych (C S ) 7.71Nations, nationalities Chile, Chilean (E N ) 90.93 73.81 33.03 89.93Syntactic 74.44 55.33 74.07 62.66Nouns, antonyms dobro, zlo (C S ) 16.03Nouns, singular, masculine–feminine king, queen (E N ) 81.62 55.14 31.96 73.39Verbs, antonyms seˇc´ıst, odeˇc´ıst (C S ) 4.78Adjectives, antonyms svˇetl´y, tmav´y (C S ) 21.03Capital cities, common countries Rome, Italy (E N ) 94.27 89.83 35.33 84.58Capital cities, all countries Apia, Samoa (I T ) 91.98 87.15 59.24Capital cities, regions Piemonte, Torino (I T ) 43.57Biggest cities, U.S. states Austin, Texas (E N ) 72.60 39.09 20.72Countries, currencies Armenia, dram (E N ) 12.13 10.22 2.85Semantic 78.77 66.02 21.48 45.44Total 76.40 60.82 60.46 53.98 Table 5: Detailed results of the English, German, Czech and Italian fastText models with the optimal 𝑛 -gram sizes onthe English, German, Czech, and Italian word analogy tasks (see subtables 2d, 2h, 3d, and 3h). Example analogiesfor each category are given in English, where available. Syntactic accuracies are a weighted average of the results onsyntactic word analogy categories (top), and semantic accuracies are a weighted average of the results on semanticword analogy categories (bottom). Total accuracies are a weighted average of syntactic and semantic accuracies. n subtables 2a, 2e, 3a, 3e on page 4, we showcoverages of character 𝑛 -grams on English, Ger-man, Czech, and Italian corpora. Notice howshort character 𝑛 -grams achieve higher coverage onCzech compared to English, Italian, and especiallyGerman, where short character 𝑛 -grams cover onlysmall portion of character 𝑛 -grams.The coverages for the optimal 𝑛 -gram sizeson the English, German, Czech, and Italian wordanalogy tasks are 3.76%, 4.19%, 3.28%, and3.81%. This gives us the 95% conﬁdence interval [ . , . ] and the point estimate 3.76% forthe optimal mean 𝑛 -gram coverage. The resultsshow that optimal 𝑛 -gram sizes are the closest tothe 3.76% point estimate for all four languages,and that 𝑛 -gram sizes that correspond to the 𝑛 -gramcoverages within the [ . , . ] conﬁdenceinterval are within 1% of the total accuracyfor the optimal 𝑛 -gram sizes. Therefore, wepropose that the 𝑛 -gram coverage should be usedto ﬁnd optimal 𝑛 -gram sizes for new languagesinstead of expensive language model training andhyperparameter optimization. For this purpose,we have developed an application that suggestsoptimal 𝑛 -gram sizes for any of the 309 Wikipedialanguages. Table 4 on page 5 shows the suggestedoptimal 𝑛 -gram sizes for Wikipedia languageswith over 500 active users. We omit Chinese,Japanese, and Thai, whose writing systems lackword boundaries, and Vietnamese, which usesspaces to separate syllables rather than words.Furthermore, it is noteworthy that while the fast-Text language models achieve semantic accuracythat is higher than syntactic accuracy on the Englishand German word analogy tasks, they achieve syn-tactic accuracy that is higher than the semantic ac-curacy on the Czech and Italian word analogy tasks(see Table 5 on the previous page). We theorizethat this is because of the richer morphology andthe smaller training corpora for Czech (1.2 GiB)and Italian (4.2 GiB) compared to English (22 GiB)and German (8.3 GiB), which makes it more difﬁ-cult for the fastText models to learn the semanticrelations between words. Sizes of character 𝑛 -grams have a large impact onthe accuracy of fastText language models and their https://github.com/MIR-MU/fastText-subword-size-optimizer word embeddings. However, they are computation-ally expensive to optimize on large corpora.In this work, we showed the optimal 𝑛 -gramsizes for Czech and Italian fastText language mod-els, and we conﬁrmed prior optimal 𝑛 -gram sizesreported for English and German. Furthermore,we discovered that the optimization of 𝑛 -gramsizes signiﬁcantly improves the performance ofthe Czech fastText model on the word analogy taskcompared to the default 𝑛 -gram sizes. Finally, weshowed that slow 𝑛 -gram size optimization can bereplaced with fast 𝑛 -gram frequency analysis.It remains to be conﬁrmed in future work andin following studies that modeling only 𝑛 -gramsthat align with the morphological units of a lan-guage leads to even faster training, smaller models,and stronger results. Also, an example of Czechvs. German performance of word analogy subtask“verbs, past tense” in Table 5 on the preceding pagesuggests, that static embeddings are not enough forword representation, and that we need more com-plex word representations in context (Peters et al.,2018; Devlin et al., 2019). References

Giacomo Berardi, Andrea Esuli, and Diego Marcheg-giani. 2015. Word Embeddings Go to Italy: A Com-parison of Models and Training Datasets. In

CEURWorkshop Proceedings of 6th Italian Information Re-trieval Workshop, IIR 2015 , volume 1404, page 8.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTom´aˇs Mikolov. 2017. Enriching word vectors withsubword information.

Transactions of the Associa-tion for Computational Linguistics , 5(0):135–146.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186, Minneapolis, Minnesota. Association forComputational Linguistics.Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-mand Joulin, and Tomas Mikolov. 2018. LearningWord Vectors for 157 Languages. In

Proceedings ofthe Eleventh International Conference on LanguageResources and Evaluation (LREC 2018) , Miyazaki,Japan. European Language Resources Association(ELRA).Maximilian K¨oper, Christian Scheible, and SabineSchulte im Walde. 2015. Multilingual reliability and“semantic” structure of continuous word spaces. In roceedings of the 11th international conference oncomputational semantics , pages 40–45.Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efﬁcient Estimation of Word Represen-tations in Vector Space. In .Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In

Proceedings of the 2018 Conferenceof the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers) , pages 2227–2237,New Orleans, Louisiana. Association for Computa-tional Linguistics.Luk´aˇs Svoboda and Tom´aˇs Brychc´ın. 2016. New WordAnalogy Corpus for Exploring Embeddings of CzechWords. In19th International Conference on Intel-ligent Text Processing and Computational Linguis-tics, CICLing, 2016, Konya, Turkey, April 3–9, 2016