Language Detection Engine for Multilingual Texting on Mobile Devices
Sourabh Vasant Gothe, Sourav Ghosh, Sharmila Mani, Guggilla Bhanodai, Ankur Agarwal, Chandramouli Sanchi
LLanguage Detection Engine for MultilingualTexting on Mobile Devices
Sourabh Vasant Gothe, Sourav Ghosh, Sharmila Mani, Guggilla Bhanodai, Ankur Agarwal, Chandramouli Sanchi
Samsung R&D Institute Bangalore, Karnataka, India 560037Email: { sourab.gothe, sourav.ghosh, sharmila.m, g.bhanodai, ankur.a, cm.sanchi } @samsung.com Abstract —More than 2 billion mobile users worldwide typein multiple languages in the soft keyboard. On a monolingualkeyboard, 38% of falsely auto-corrected words are valid inanother language. This can be easily avoided by detecting thelanguage of typed words and then validating it in its respectivelanguage. Language detection is a well-known problem in naturallanguage processing. In this paper, we present a fast, light-weightand accurate Language Detection Engine (LDE) for multilingualtyping that dynamically adapts to user intended language in real-time. We propose a novel approach where the fusion of character N -gram model [1] and logistic regression [2] based selectormodel is used to identify the language. Additionally, we presenta unique method of reducing the inference time significantlyby parameter reduction technique. We also discuss variousoptimizations fabricated across LDE to resolve ambiguity ininput text among the languages with the same character pattern.Our method demonstrates an average accuracy of 94.5% forIndian languages in Latin script and that of 98% for Europeanlanguages on the code-switched data. This model outperformsfastText [3] by 60.39% and ML-Kit by 23.67% in F1 score [4]for European languages. LDE is faster on mobile device with anaverage inference time of 25.91 µ seconds. Index Terms —Language detection, multilingual, character N -gram, logistic regression, parameter reduction, mobile device,Indian macaronic languages, European languages, soft-keyboard I. I NTRODUCTION
In the current era of social media, language detectionis a much required intelligence in mobile device for manyapplications viz. translation, transliteration, recommendations,etc. Language detection algorithms work almost accuratelywhen the language scripts are distinct using simple scriptdetection method. In India, there are 22 official languages, andalmost every language has its own script but in general userprefers to type in Latin script. As per our statistical analysis,39.78% of words typed in QWERTY layout are from Indianlanguages. Hindi is a popular Indian language, 22.8% of Hindilanguage users use QWERTY keyboard for typing, that impliesthe need of support for languages written in Latin script.Standard languages written in Latin script i.e typed inQWERTY keyboard are referred to as Macaronic languages.These languages can have the same character pattern withother languages, unlike standard ones. For example, whenHindi language is written in Latin script (Hinglish), the word“somwar” which means Monday, shares the same text pattern https://firebase.google.com/docs/ml-kit/android/identify-languages with English word “Ran somwar e”, in such cases, character-based probabilistic models alone fail to identify the exactlanguage as probability will be higher for multiple languages.Also, the user may type based on phonetic sound of the wordthat leads to variations like, “somwaar”, “somvar”, “somvaar”etc. which are completely user dependent.The soft keyboard provides next-word predictions, wordcompletions, auto-correction, etc. while typing. LanguageModels (LMs) responsible for those are built using LongShort-Term Memory Recurrent Neural Networks (LSTMRNN) [5] based Deep Neural networks (DNN) model [6] witha character-aware CNN embedding [7]. We use knowledgedistilling method proposed by Hinton et al. [8] to train the LM[9]. Along with LMs, adding another DNN based model fordetecting the language that executes on every character typed,will increase the inference time and memory and leads to lag inmobile device. Additionally, in soft-keyboards extensibility isa major concern. Adding one or more language in the keyboardbased on the locality or discontinuing the support of a languageshould be effortless.Considering the above constraints into account, we presentthe Language Detection Engine (LDE), an amalgamation ofcharacter N -gram models and a logistic regression basedselector model. The engine is fast in inferencing on mobiledevice, light-weight in model size and accurate for both code-switched (switching between the languages) and monolingualtext. This paper discusses various optimizations performed toincrease engine accuracy compared to DNN based solutionsin ambiguous cases of the code-switched input text.We also discuss how LDE performs on five Indian Mac-aronic languages Hinglish (Hindi in English), Marathinglish(Marathi in English), Tenglish (Telugu in English), Tanglish(Tamil in English), Benglish (Bengali in English) and fourEuropean languages- Spanish, French, Italian, and German.Since the typing layout is of English language (Latin script),we term English as a primary language and all other languagesas a secondary language.II. R ELATED W ORK
In this section, we discuss about work related to languagedetection, both N -gram based and deep learning based ap-proaches. a r X i v : . [ c s . C L ] J a n Corpus Size ( × lines) A cc u r ac y ( % ) HinglishTenglishSpanishGermanFig. 1: Char N -gram accuracy over corpus size A. N -gram based models Ahmed et al. [10] detail about language identification using N -gram based cumulative frequency addition to increase theclassification speed of a document. It is achieved by reducingcounting and sorting operations by cumulative frequency addi-tion method. In our problem, we detect the language, based onuser typed text rather than document with large information.Vatanen et al. [1] compared the naive bayes classifer basedcharater N -gram and ranking method for language detectiontask. Their paper focuses on detecting short segments of length5 to 21 characters, and all the language models are constructedindependently of each other without considering the final clas-sification in the process. We have adopted similar methodologyfor building char N -gram models in our approach.Erik Tromp et al. [11] discuss Graph-Based N -gram Lan-guage Identification (LIGA) for short and ill-written texts.LIGA outperforms other N -gram based models by capturingthe elements of language grammar as a graph. However LIGAdoes not handle the case of code-switched text.All the above referred models do not prioritize recentlytyped words. For seamless multilingual texting which involvescontinuous code-switching, more priority must be given to therecent words so that the suggestions from currently detectedlanguage model can be fetched from the soft keyboard andshown to the user. B. Deep learning based models
Lopez et al. propose a DNN based language detection forspoken utterance [12] motivated by the success in acousticmodeling using DNN. They train a fully connected feed-forward neural network along with the logisic regressioncalibration to identify the exact language. Javier Gonzalez-Dominguez et al. [13] further extend to utilize LSTM RNNsfor the same task.Zhang et al. [14] have recently presented CMX, a fast,compact model for detection of code-mixed data. They address same problem as ours but in a different environment. Theytrain basic feed-forward network that predicts the languagefor every token passed where multiple features are used toobtain the accurate results. However, such models require hugetraining data and are not feasible in terms of extensibility andmodel size for mobile devices.This paper presents a novel method to resolve the ambiguityin input text and detect the language accurately in multilingualsoft-keyboard for five Indian macaronic languages and fourEuropean languges.III. P ROPOSED M ETHOD
We propose the Language Detection Engine (LDE) thatenhances user experience in multilingual typing by accuratelydeducing the language of input text in real-time. LDE is aunion of (a) character N -gram model which gives emissionprobability of input text originating from a particular language,and (b) a selector model which uses the emission probabilitiesto identify the most probable language for a given text usinglogistic regression [15]. Unique architecture of independentcharacter N -gram models with selector model is able to detectthe code-mixed multilingual context accurately. A. Emission probability estimation using Character N -gram Character N -gram is a statistical model which estimatesprobability distribution over the character set of a particularlanguage given its corpus.
1) Train data:
Training corpus is generated by crawlingonline data from various websites. For Indian macaroniclanguages, we crawled the native script (Example: Devanagariscript for Hindi) data of the languages and reverse translit-erated to Latin script. This data is validated by the languageexperts for quality purpose.We experimented with various sizes of corpus to traincharacter N -gram model and found out that k sentencesshow the best accuracy from the model on sample test set, asdetailed in Fig. 1.
2) Model training:
For every supported language l i , wetrain Character N -gram model C l i independent of the otherlanguages as shown in Fig. 2. Probability of a sequence ofwords ( t ..n ) in language l i is given by, P l i ( t ..n ) = n (cid:89) k =1 P l i ( t k ) r n − k , where r ∈ (0 , (1)We prioritize the probability of most recent word over theprevious words using a variable r , value ranging between and , that effectively reduces the impact of the leadingwords probability by converging values closer to . To preventthe underflow of values we use logarithmic probabilities.Mathematically, log P l i ( t ..n ) = n (cid:88) k =1 r n − k · log P l i ( t k ) (2)The probability of sequence of characters in a word t ,represented as c ..m , where c is considered as space character, w ....w n Words list lis listt ... C L1 C L2 C Lm EP ( w ) L1, EP ( w ) L2, . . . , EP ( w ) Lm EP ( w ) L1, EP ( w ) L2, . . . , EP ( w ) Lm .. EP ( w n ) L1, EP ( w n ) L2, . . . , EP ( w n ) Lm Char N-gram
Models list
Emis lists listion Probabilities list of words words list
Logis listtic Regression Regres lists listion
Model yw b , w (w L b L ), w (w L b L ),... w (w Lm , b Lm ) ParameterReduc Regression tion T L T L . . T Lm Final Thres listhold values list m - Number of words language s listupportedL n - n th Language w b , -
Weight and bias list vec Regression tors listC Lm - Tr ained Char N-gram Model of words m th language EP ( w n ) Lm - Emis lists listion probability of word w of words word w n in language m [ ][ ] N-gram Trainer
Corpus list
Language Detec Regression tion Engine
Weight , bias list vec Regression tors list
Fig. 2: Model Trainingis given by P l i ( c ..m ) = P l i ( c | c ) · m (cid:89) k =2 P l i ( c k | c k − c k − ) (3)These trained models C l i are used to estimate the emissionprobability of character sequence during the inference forlanguage l i . We have chosen n to be 3 in N -gram model,i.e character tri-gram model is trained on the corpus. B. Selector Model
Here we briefly discuss the motivation behind an addi-tional selector model. Firstly, input text originating from onelanguage may also have significant emission probability inanother language that may belong to the same family. Thisis because of words sharing the similar roots and frequentusage of loan words.For example, the Spanish word “vocabulario” shares lin-guistic root with its English counterpart “vocabulary” . Again, “jungle” which is a frequently used word in English, isactually a loan word via Hindi from Sanskrit. Presence ofsuch words in the training corpus increases the preplexity ofthe model, i.e emission probabilities will be higher for multiplelanguages which makes it difficult to deduce the ultimatelanguage.Secondly, as character N -gram model gets trained basedon the frequency of characters in the corpus, this makesit dependent on the size of character set. So, the emissionprobability values become incomparable as languages withsmaller character set will statistically get higher values. Hence,these probabilities from character N -gram are not sufficient enough to determine the source language accurately. To thisend, we present a logistic regression based selector modelwhich addresses the illustrated problems.Selector model S comprises of weight and bias vectors w and b respectively of size m , where m is the numberof supported languages. This model transforms the emissionprobability provided by char N -gram such that the newprobability value of word t n , P (cid:48) l i ( t n ) , given by, log P (cid:48) l i ( t n ) = w l i · log P l i ( t n ) + b l i (4)where l i is deemed to be the origin language of word t n if, P (cid:48) l i ( t n ) ≥ . (5)
1) Train data:
Training data for the selector model isthe vector of emission probabilities of word t n for everylanguage l i . Batch of 200 k labeled words are used for trainingparameters of a particular language l i . These 200 k wordscomprise of 100 k vocabulary words belonging to language l i , and another 100 k words that are equally distributed amongother languages.Trained character N -gram models C l ..m provide the re-quired emission probabilities for every word from m differentlanguages.
2) Model Training:
Weight and bias vectors of the selectormodel are trained such that for every input word, probabilityfor labeled language is greater than . as given in equation(5). These trained weight and bias vectors are used to ob-tain new probability values as given in equation (4), whichnow become comparable among languages. Newly estimatedprobabilities resolve the ambiguity in the input text among theanguages which have same patterns, with clearly dominatedprobability for the final detected language.As shown in Fig. 2, selector model takes emission probabil-ities εp ( w n ) L m for every word w n from each language ( L m )as input from pre-trained character N -gram models ( C l m ) andyields weight and bias vectors w l i , b l i respectively.
3) Parameter Reduction:
LDE performs a set of computa-tions to detect the language on mobile device, which we termas on-device inference. For every character typed by the useron a soft-keyboard, on-device inferencing happens followedby the inference of DNN Language model to provide nextword predictions, word completions, and auto-correction, etc.based on the context.To optimize on-device inference time, we propose a novelmethod of parameter reduction which reduces multiple com-putations during inference to a single arithmetic operation.Equation (4) is simplified to combine the weight and biasparameters as a single threshold value τ l given by Equation (8)which effectively reduces the computation to constant time.From Equations (4) and (5), log P (cid:48) l i ( t n ) ≥ log 0 . w l i · log P l i ( t n ) + b l i ≥ log 0 . (6)This can be further reduced to log P l i ( t n ) − log 0 . − b l i w l i ≥ ⇒ log P l i ( t n ) − log 0 . − b l i w l i + log 0 . ≥ log 0 . ⇒ log P l i ( t n ) − ( w l i − · log 2 − b l i w l i ≥ log 0 . ∴ log P l i ( t n ) − τ l i ≥ log 0 . (7)where τ l i is a parameter given by τ l i = ( w l i − · log 2 − b l i w l i (8)
4) On-device inference:
From equations (6) and (7), it isevident that we can obtain the ultimate probability log P (cid:48) l i ( t n ) just by subtracting the threshold value τ l from the logarithmicemission probability log P l ( t n ) as given below, log P (cid:48) l i ( t n ) = log P l i ( t n ) − τ l i (9)where l i is the language and t n is the character sequence. Thismakes probabilities from different languages comparable.IV. E NGINE A RCHITECTURE
Language Detection Engine constitutes of multiple compo-nents in various phases like preprocessor, optimizer, Char N-gram and Selector. The input text is first pre-processed andpassed to the optimization phase where multiple heuristics areapplied to address the enigmatic cases and proceeds to char N -gram inference and finally language selector phase to obtainthe detected language. End-to-end architecture is representedin Figure 3. In this section, we explain each phase of theengine in detail. “Lingua deteccion” Pre-processorOptimizersChar N-gramSelector Model
Tokenizer CachingSpecial Symbol HandlerShort-text Handler Typo handler Pronoun ExclusionRecent word priorityProbability EstimationThreshold computing Inference“Spanish”Model loadingModel loading
Fig. 3: Engine Architecture A. Pre-processor
In this phase, the input text is preprocessed to obtain therequired information from large context.
1) Special Symbol Handler:
In soft keyboard, the input maynot be only text but can also include various ideograms likeemojis, stickers, etc. This handler trims the input and providesthe data that is necessary to detect the language.
2) Tokenizer:
Engine tokenizes the input context into to-kens with whitespace as a delimiter. The last two tokens areconcatenated and processed for language detection, which isobserved to be most efficient in terms of processing time andaccuracy, compared to considering more than two tokens. Forshort words with character length ≤ , tokenizing is left toshort-text handler.
3) Caching:
Based on the current detected language multi-ple algorithms like, auto-correction, auto-capitalization, touch-area correction [16] [17] etc. tune the word suggestionsaccordingly in real-time. This leads to multiple calls to LDEfor the same input text, hence LDE caches the language ofpreviously typed text, to avoid the redundant task of detectingthe language again. B. Optimizers
LDE addresses enigmatic cases in multilingual typing byapplying additional optimizations that are discussed below.
1) Short-text Handler:
The context is an entire input thatthe user has typed and engine uses the previous two tokens ofthe context to detect the language. In the cases of short-wordswith character length less than or equal to two, it becomesambiguous to detect the language. For example, “to me” is avalid context in English as well as in Hinglish, in such caseswords before this context helps to deduce the exact sourceanguage. So extending the context to prior words, whencontext word length is less than two resolves the ambiguityfor the engine to decide upon short words. We observed ∼ improvement in the accuracy of Indian macaronic languageswith this change.
2) Typo Handler:
When a user makes a typo, often itsharder to decide from which language the suggestions orcorrection should be provided. To address this issue, LDEobtains a correction candidate word from non-current languageLM [9] with an edit distance of one and effectively avoidsthe decrease in False Negatives due to wrong auto-correction.Below example illustrates the need for this heuristic,“ [ Hello ] E [ bhai suno ] H [ can we meet ] E [ ksl ] ∗ ”where subscript [] E indicates English text and [] H asHinglish and [] ∗ a typo. When this context is typed in Englishand Hinglish bilingual keyboard, the engine fetches one auto-correction candidate [ kal ] (meaning tomorrow) with an editdistance of one. Though the previous two words are fromEnglish, LDE manages to auto-correct the typo into valid wordfrom non-current language Hinglish.Typo handler automatically adopts to the user behaviorwhile typing and provides valid corrections from the LM.For Indian macaronic languages, ∼ improvement and forEuropean languages ∼ improvement observed in the F1score of auto-correction on a linguist written bi-lingual testset.In a closed beta trial with soft keyboard users over aperiod of two months, 38% of the falsely auto-corrected wordsare valid in another language. LDE is able to suppress the falseauto-corrections and improve the auto-correction performanceby 43.71% in mono-lingual keyboard.
3) Pronoun Exclusion:
Practically, there is no particularlanguage associated with proper nouns alone, but it followsthe language of the entire context. To address this, the enginestores a linguist validated pronoun’s list as a TRIE data-structure [18] for efficient look-up. If the typed word is foundin pronouns list, the cached language for the input excludingpronoun is considered as the detected language. C. Char N -gram
1) Model loading:
As explained in section 3. a tri-grammodel is used to obtain emission probabilities of the charactersequence. So there are n +1 P possible character sequencesin a language with the character set of length n and anadditional character, i.e whitespace ‘ ’. These probabilitiesare pre-computed for every language and stored separatelyin a binary data file which is further compressed using zlib[19] compression to reduce the ROM size on mobile device.Considering the whitespace [10] as an extra character fortraining the char N -gram model makes an impact when thecharacter pattern is same among multiple languages. Fig.4 depicts average gain of 10.13% achieved for Europeanlanguages when whitespace is considered.For loading model on device, data file is uncompressed andprobabilities are loaded to an array of every language. Due tothe modularity of model files, we can upgrade the model or Fig. 4: Pictorial representation of Table IIIadd a new language and remove existing one just by trainingthe required language’s model. Such provision addresses theextensibility issue for soft-keyboard effectively.
2) Recent word prioritization:
In multilingual typing code-switching happens continuously, input begins with one lan-guage and eventually switches to another. To detect the currentlanguage in real-time, priority should be given to the recentlytyped character sequence as mentioned before.In below example,“ [ Our company is ] E [ intentando ] ES ”first three words are of English and the next is of Spanish.Ideally, current detected language should be Spanish as thereis a code-switch. But if all characters are treated equally mostprobable language will be English. From Equation (1) it canbe observed that our char N -gram prioritizes trailing wordsthan the leading words resulting in detecting the languageaccurately. D. Selector Mode l1) Threshold computing:
As explained in section III. A,logistic regression model is trained using library providedby sklearn [15] in python. For every language, the model istrained to obtain the weight and bias, and further reduced todetermine threshold value as explained in Equation (8). Thecomplete processing is done on a 64-bit linux machine offlineand threshold values are loaded to the model.
2) Model loading:
After parameter reduction every individ-ual language has corresponding threshold value. The thresholdvalues are stored in respective languages char n -gram datafile itself and unload to an array of thresholds correspondingto supported languages on device. In this way, we curtailthe effort of re-training all the models for any modificationsand update only threshold values in respective data file. LDEdoes not require any large infrastructure to train and build theodel, all our experiments were conducted on a linux machineof 4GB RAM. V. E XPERIMENTAL R ESULTS
We compare the Language detection Engine performancewith various baseline solutions like fastText library [3],langID.py [20] and Equilid a DNN model [21] and also withGoogle’s ML-Kit . In this section, we briefly explain theexperimental set-up configured for all of above mentionedmodels and discuss about the test set that we preparedfor the evaluation. Performance of LDE is compared withmonolingual models like fastText and langId.py, ML-Kit andmultilingual model such as Equilid. A. fastText Joulin et al. [3] have distributed the model which canidentify 176 languages. We used this model to comparethe performance of European languages with LDE. HoweverfastText pre-trained model does not support Indian macaroniclanguages. Custom fastText model for Indian macaronic languages : We trained a custom fastText supervised model. We usedreverse transliterated corpus of all the Indian languages whichis validated by linguists. The same corpus is used to train LDEso that evaluation is comparable. 2.5GB of corpus was usedto train the fastText model for five Indian languages each ofsize 500MB. Custom trained model size after quantization is900KB. B. ML-Kit
ML Kit supports total of 103 languages including oneIndian macaronic language, Hinglish. ML-Kit doesn’t have aprovision to train custom models for other Indian macaroniclanguages. For the experiment purpose a sample androidapplication was developed that uses the API exposed by ML-Kit to identify the language and calculate the F1 score forgiven test set. Complete evaluation was performed on SamsungGalaxy A50 device. C. Langid.py
Langid.py is a standalone python tool by Lui and Bald-win [20] [22] that can identify 97 languages. Langid.py ismonolingual model, i.e it can not identify code-switched text.Therefore we compare only on inter-sentential sentences whereno code-switching is involved within a sentence. D. Equilid: Socially-Equitable Language Identification
Jurgens et al. propose a sequence-to-sequence DNN model[21] for detecting the language. Equilid identifies the code-switched multilingual text and tags every word with thedetected language. An experiment was conducted on a GPUto evaluate the metric by loading the pre-trained models .Pre-trained model is of size MB which can identify 70languages but none of the Indian macaronic languages aresupported by Equilid. https://developers.google.com/ml-kit https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz http://cs.stanford.edu/ jurgens/data/70lang.tar.gz E. Performance Evaluation
We evaluate the performance on two types of test setsbased on code-switching style (a) Intra-sentential and (b) Inter-sentential. These test sets are hand written by the languageexperts involving natural code-switching. We evaluate abovedescribed methodologies and compare with LDE.TABLE I: Description of the Intra-sentential test set
Language Words Characters Code-switch (%)French . Italian . German . Spanish . Hinglish . Benglish . Marathinglish . Tanglish . Tenglish . TABLE II: Comparison on Intra-sentential test set
Language F1 scorefastText LDE ML-Kit EquilidFrench . .
813 0 . Italian . . . German . . . Spanish . . . Hinglish . . − Benglish . − − Marathinglish . − − Tanglish . − − Tenglish . − − Intra-sentential test set : In this type, the code-switchingcan occur anywhere in the sentence, where there are againtwo possibilities, i) Test set 1: context written mainly inprimary language English and partly in secondary languages,for example, “Can you believe midterms comienza next week” where Spanish word is used while typing in English.and ii) Test set 2: context written mainly in secondarylanguage and partly written in primary language. For example, “Justo thinking en ti” where English word is used while typing in Spanish.A uniformly distributed test set of these two types weretaken by picking sentences from each one. Every word in atest sentence is manually tagged with the source language. Asthere will be multiple code-switching involved, context levellanguage detection is performed i.e based on previous twowords the current language is identified which is exactly theway as LDE identifies the language for soft keyboard.Statistics for these test sets like the percentage of code-switching involved, characters, words are shown in Table I.ig. 5: Pictorial representation of Table II
F1 score : Table II shows the comparison of F1 scorebetween fastText [3], ML-Kit, Equilid [21] with LDE forEuropean and Indian macaronic languages. For Europeanlanguages LDE outperforms fastText by 60.39% and exceedsGoogle’s ML-Kit by 23.67% also surpasses Equilid DNNModel by 1.55%. For Indian macaronic languages LDE is44.29% better than fastText and exceeds by 7.6% for Hinglishwith ML-Kit. It can be observed that LDE performs betterthan the DNN based models which are huge in model size.Fig. (5) represents the visualization of the performance ofvarious language detection models on intra-sentential test-set, where LDE is in par with the Equilid and significantlydominates fastText and ML-Kit.TABLE III: Comparison on Inter-sentential test set
Language F1 scorefastText LDE ML-Kit LangId.pyFrench . . . Italian . . . . German . . . Spanish . . . Hinglish . . − Benglish . − − Marathinglish . − − Tanglish . − − Tenglish . − − Inter-sentential test set : In this type of data the code-switching occurs only after a sentence in first language iscompletely typed. Total of test sentences from everylanguage combination are used to obtain the metric. Unlikein previous case, here we evaluate sentence level accuracyfor each model, as there is no code-switching involved withinthe sentence. Additionally, we evaluated the same test set onLangID.py [20] which is a popular off-the-shelf model for thistype of data. Fig. 6: Pictorial representation of Table III
F1 score : On inter-sentential test-set all the models performaccurately as there is a long context to identify. Table III showsthe F1 score for fastText, ML-Kit and LangId for Europeanand Indian languages. It is observed that LDE is in par withML-Kit and fastText and better than LangId.py for Europeanlanguages. However, for Indian languages LDE dominatesfastText by 10% and ML-Kit by 22.95% for Hinglish whichshows that LDE performs as good as the DNN models. Figure6. shows the comparison of various models performance oninter-sentential test set.
Inference time : Table IV shows the inference time andmodel size for all 10 supported languages on a uniformlydistributed intra-sentential and inter-sentential test set. Averageinference time is 25.91 µ seconds and the model size of LDEfor all languages combined is . KB.TABLE IV: Average Inference time and Model size
Language Inference Time ( µ s) Model size (KB)French .
30 22 . English .
64 20 . Italian .
04 21 . German .
08 18 . Spanish .
30 16 . Hinglish .
41 13 . Benglish .
34 13 . Marathinglish .
94 14 . Tanglish .
46 13 . Tenglish .
56 12 . VI. C
ONCLUSION
We have proposed LDE a fast, light-weight, accurate enginefor multilingual typing with a novel approach, that unites char N -gram and logistic regression model for improved accuracy.LDE model size is 5X smaller than that of fastText customtrained model and ∼ better in accuracy. LDE being ashallow learning model, either surpasses or in par with state-of-the-art DNN models in performance. Though char N -grams trained on monolingual data, LDE accurately detects code-switching in a multilingual text with the help of uniquelydesigned selector model. LDE also improved the performanceof auto-correction by 43.71% by suppressing correction ofvalid foreign words. Furthermore, LDE is suitably designedfor supporting extensibility of languages.R EFERENCES[1] T. Vatanen, J. J. V¨ayrynen, and S. Virpioja, “Language identification ofshort text segments with n-gram models.”[2] H.-F. Yu, F.-L. Huang, and C.-J. Lin, “Dual coordinate descent methodsfor logistic regression and maximum entropy models,”
Machine Learn-ing , vol. 85, no. 1-2, pp. 41–75, 2011.[3] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks forefficient text classification,” arXiv preprint arXiv:1607.01759 , 2016.[4] C. Goutte and E. Gaussier, “A probabilistic interpretation of precision,recall and f-score, with implication for evaluation,” in
European Con-ference on Information Retrieval . Springer, 2005, pp. 345–359.[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[6] T. Mikolov, S. Kombrink, L. Burget, J. ˇCernock`y, and S. Khudanpur,“Extensions of recurrent neural network language model,” in . IEEE, 2011, pp. 5528–5531.[7] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-awareneural language models,” in
Thirtieth AAAI Conference on ArtificialIntelligence , 2016.[8] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015.[9] W. Chen, D. Grangier, and M. Auli, “Strategies for training largevocabulary neural language models,” arXiv preprint arXiv:1512.04906 ,2015.[10] B. Ahmed, S.-H. Cha, and C. Tappert, “Language identification fromtext using n-gram based cumulative frequency addition.”[11] E. Tromp and M. Pechenizkiy, “Graph-based n-gram language identi-fication on short texts,” in
Proc. 20th Machine Learning conference ofBelgium and The Netherlands , 2011, pp. 27–34.[12] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez,J. Gonzalez-Rodriguez, and P. Moreno, “Automatic language identifica-tion using deep neural networks,” in . IEEE, 2014, pp.5337–5341.[13] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, and P. J. Moreno, “Automatic language identification usinglong short-term memory recurrent neural networks,” in
Fifteenth AnnualConference of the International Speech Communication Association ,2014.[14] Y. Zhang, J. Riesa, D. Gillick, A. Bakalov, J. Baldridge, and D. Weiss, “Afast, compact, accurate model for language identification of codemixedtext,” arXiv preprint arXiv:1810.04142 , 2018.[15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,“Liblinear: A library for large linear classification,”
Journal of machinelearning research , vol. 9, no. Aug, pp. 1871–1874, 2008.[16] S. Azenkot and S. Zhai, “Touch behavior with different postures onsoft smartphone keyboards,” in
Proceedings of the 14th internationalconference on Human-computer interaction with mobile devices andservices . ACM, 2012, pp. 251–260.[17] C. Thomas and B. Jennings, “Hand posture’s effect on touch screentext input behaviors: A touch area based study,” arXiv preprintarXiv:1504.02134 , 2015.[18] S. Mani, S. V. Gothe, S. Ghosh, A. K. Mishra, P. Kulshreshtha,M. Bhargavi, and M. Kumaran, “Real-time optimized n-gram for mobiledevices,” in . IEEE, 2019, pp. 87–92.[19] J.-l. Gailly and M. Adler, “Zlib home site,” 2008.[20] M. Lui and T. Baldwin, “Langid.py: An off-the-shelf languageidentification tool,” in
Proceedings of the ACL 2012 SystemDemonstrations , ser. ACL ’12. Stroudsburg, PA, USA: Associationfor Computational Linguistics, 2012, pp. 25–30. [Online]. Available:http://dl.acm.org/citation.cfm?id=2390470.2390475 [21] D. Jurgens, Y. Tsvetkov, and D. Jurafsky, “Incorporating dialectalvariability for socially equitable language identification,” in
Proceedingsof the 55th Annual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers) . Vancouver, Canada: Associationfor Computational Linguistics, Jul. 2017, pp. 51–57.[22] M. Lui and T. Baldwin, “Cross-domain feature selection for languageidentification,” in