Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu
Usama Khalid, Aizaz Hussain, Muhammad Umair Arshad, Waseem Shahzad, Mirza Omer Beg
CCo-occurrences using Fasttext embeddings for wordsimilarity tasks in Urdu
Usama Khalid
Department of Computer ScienceAIM Lab, NUCES (FAST)
Islamabad, [email protected]
Aizaz Hussain
Department of Computer ScienceAIM Lab, NUCES (FAST)
Islamabad, [email protected]
Muhammad Umair Arshad
Department of Computer ScienceAIM Lab, NUCES (FAST)
Islamabad, [email protected]
Waseem Shahzad
Department of Computer ScienceAIM Lab, NUCES (FAST)
Islamabad, [email protected]
Mirza Omer Beg
Department of Computer ScienceAIM Lab, NUCES (FAST)
Islamabad, [email protected]
Abstract —Urdu is a widely spoken language in South Asia.Though immoderate literature exists for the Urdu language stillthe data isn’t enough to naturally process the language by NLPtechniques. Very efficient language models exist for the Englishlanguage, a high resource language, but Urdu and other under-resourced languages have been neglected for a long time. Tocreate efficient language models for these languages we must havegood word embedding models. For Urdu, we can only find wordembeddings trained and developed using the skip-gram model.In this paper, we have built a corpus for Urdu by scraping andintegrating data from various sources and compiled a vocabularyfor Urdu language. We also modify fasttext embeddings and N-Grams models to enable training them on our built corpus. Wehave used these trained embeddings for a word similarity taskand compared the results with existing techniques. The datasetsand code is made freely available on GitHub. . Index Terms —Word Embeddings, Ngrams, Fasttext, Urdu,Word2Vec, Skip-Gram, Low Resource
I. I
NTRODUCTION
Urdu language originated back in 12 th with an Indo-Aryanvocabulary [1] base and is a mixture of Arabic and Persian.Urdu language is widely spoken and written in the SouthAsian region with more than 170 million speakers specificallyin Pakistan and India. Despite this Urdu [2] is considereda low resourced language because of insufficient data [3]as compared to English and other widely spoken languages.In recent times the paradigm has been shifted towards thedevelopment of efficient models for low resource languages[4]. Many deep learning and machine learning techniques areused to train language models for the derivation of semanticsfrom given textual data [5]. To derive meaningful informationfrom the text it is useful to find out the relation betweenwords. For example, as shown in Fig. 2, words are clusteredtogether based on their similarity. Language models store thisinformation which can then be used for many downstream https://github.com/usamakh20/wordEmbeddingsUrdu Fig. 1. N dimensional visualization of word vectors in 3D space. tasks.Machines cannot understand the language [6] in the waywe do so data cannot be used as it is passed into the network,instead, each word is converted into a N dimensional vector.These representations are known as word embeddings. Anexample representation of these embeddings are shown inFig. 1. The words are projected from an n dimension to 3D[7], [8]. The words with related meanings tend to appearclose together. The word embeddings are the baseline forany natural language processing task e.g., transliteration,natural language generation, understanding user inputs, etc a r X i v : . [ c s . C L ] F e b ig. 2. An overview of the Fasttext architecture. [9]. All these vectors together combined show how much aword is similar to others in a given vector space [10]. ForUrdu, a lot of work has been done on semantic analysisand sentence classification however there are no studies thatshow the performance analysis of word embeddings modelson the Urdu language [11]. Unlike the studies conducted forwidely spoken languages, in this paper, we use different wordembedding models to compute similarity scores for words inthe Urdu language.In this paper, we used a freely available Urdu news textcorpus COUNTER [12], [13], which contains data from1200 documents collected from different news agenciesof Pakistan. We have then modified existing Fasttext andn-grams approaches to be applied to Urdu data and wetrain and provide embeddings. Additionally we compareour trained model and embeddings to previously availableskip-gram [14], [15] technique on a word similarity task [15],[16].This paper is organized in multiple sections which are asfollows: In section 2, we will look into the related research.In section 3, we discuss methodologies and the experimentalhypothesis. In section 4, we will look at the experimentationresults. Finally, in section 5 we will summarize our work anddiscuss the possible contributions and future directions.II. L ITERATURE R EVIEW
A lot of work have been done in Urdu language in termsof POS tagging, Sentimental Analysis, NER, Stemmer,MT, Topic Modeling [1], [6], [11], [17]–[26] but not muchwork has been done in word embeddings for Urdu. Thesewords embeddings play a major role in natural languageunderstanding. Multiple language embedding trainingarchitectures have been introduced i.e. BERT [27], Word2Vecetc.There are many ways in which words vectors can berepresented among them one is one hot encoding vector repre-sentation [28]. In one hot encoding the vectors are representedas long binary vector representations of words [24]. To formulate one hot vectors for a corpus, they can beaggregated to form the BoW (Bag of Words) representation[29], [30]. The bag of words maintain a dictionary of allpossible words in the language and keep track of the frequencyof the word encountered in the particular corpus.The problem with BoW is that it doesn’t keep track ofwords similarity and contextual meaning. So to solve thewords similarity problem TF-IDF (Term Frequency - InverseDocument Frequency) [31], [32] has been introduced. Itassociates each word in a document with a number whichis a measure of how relevant that word is. Based on thissimilarity of words, one can compare the similarity ofmultiple documents together.Word2Vec is a fusion of two architectures i.e. CBOW andSkip-Gram [33]. These architectures are designed to be mirrorimages of one another [34]. The CBOW the model tries topredict the closest context to the input word while the Skip-Gram model tries to predict the closest words to the inputword.Word embeddings help us considerably improve NLPtechniques for low resourced languages. In context of Urdu,the only words embeddings present in literature are that ofSkip-Gram [14]. To create a large sample word embeddingsfor urdu, 140 million sentences in Urdu were used. To checkthe accuracy of learned embeddings, the closest neighbouredwords were analyzed w.r.t different words in the vector space[35], context window sizes and their performance on Urdutransliteration.The basic idea behind N-Gram language model is to assignprobability to each word in a given sequence of words [36],[37]. In word embeddings the words are dissected into multipleN number of chunks and then these chunks are assignedprobabilities. Using these probabilities the closest context of aword can be calculated in the vector space. By the calculationof probabilities, this model is very helpful in Natural LanguageGenerations, sentence completion [38], sentence correctionetc. The main issue of N-Gram model is that it is sensitiveto the training corpus. Many models have been introduced ig. 3. An overview of the N-Gram model used in this research. which combine N-Grams and neural networks to overcomethe problems of N-Gram and generate more accurate results.Fasttext is primarily an architecture developed by facebookfor text classification [39]. Fasttext works on the principal ofWord2Vec and n-grams technique. In word2vec the text is feedinto the Neural Network individually. However in Fasttext thewords are divided into several sub words and then feed intothe Neural Networks. Consider the word apple and we haveto dissect this word into tri-grams then the resultant outputwould be app, ppl, and ple [40]. The word vector for applewill be the sum of all these tri-grams. After training the NeuralNetwork on the training data, we get the word vector for eachn-grams and later these n-grams can be used to relate otherwords. For rare words can be mapped as there will be manyoverlapping n-grams which appeared in other words.III. M
ETHODOLOGY
We used two methods to train our models on word embed-dings, Fasttext and N-Grams [41]. Fig. 3 shows the workingof modified N-Gram model used in this research. The n-grammodel converts the document into tokens and stores thesetokens in a dictionary based on the co-occurrences of words.That is the number of times a token t i appears next to atoken t j is stored in a co-occurrence dictionary. Against eachkey there is a are multiple word vectors with the probabilityscore of its occurrence. In fasttext a document is tokenized andpassed through a network. The network learns weights whichcan be extracted as word embeddings. Fig. 2 shows how wordsare propagated through the network to extract embeddings forUrdu. In next sections we will discuss in detail about thedataset, experimentation and results. A. Corpus
We have used the Urdu Monolingual corpus [42] containing54 million sentences, 90 million tokens and 129K uniquetokens. In the preprocessing step we removed all specialcharacters such as brackets, single/double quotes and spaces[43]. All these special characters are replaced by spaces. As asecond step consecutive occurring spaces of two or more arematched and replaced by a single space character.
B. Techniques
We have used two techniques for t [44]raining, namelyngrams and Fasttext [45]. The ngram technique requires datato be separated sentence by sentence where each sentenceis broken down into a list of words [46]. After separatinginto list of words we remove common stop words. Similarprepossessing is applied for Fasttext, however the fasttextpython package has the tokenizer and stop word removal toolbuiltin. The complete architecture is given in figure 2 and 3.
C. Hyper Parameters
The Fasttext technique has four main hyper parameters [47]that we can tune. Vector dimension represents the length of thevector size to represent a single word. Larger vectors capturemore information [48] but are harder to train and cost moredata [49]. Epochs is the number of times the model trains ona batch of data. The larger the corpus the lesser number oftimes it may have to be iterated. Learning rate is a measure ofhow quickly the model should converge to a solution [50]. Subwords length specifies the length of substrings [51] to considerfor different processing tasks like resolving out of vocabularywords.For the current study we have used the default parametersfor Fasttext which are • Vector dimension : 100 • Epochs : 5 • Learning rate : 0.05 • Sub words length : min=3 & max=6The ngrams technique only has a single hyper-parameternamely the number of consecutive words or grams to train.IV. R
ESULTS AND D ISCUSSION
For the evaluating the similarity of learned word representa-tions We use Urdu translated version of corpora SimLex-999[52] and WordSim-353 [53]. SimLex-999 is a gold standarddataset for evaluating word embeddings. It contains 999 noun,adjective and verb triplets in a concrete [54] and abstract form.The dataset is designed to evaluate similarity of words ratherthan relatedness and contains similarity score for words. TheWordSim-353 dataset [55] contains relatedness scores for 353word pairs. ig. 4. In this figure, some word embeddings produced by fasttext are mappedto a 2D space which shows how words are related and how they appear inclose proximity to each other.
These datasets have been translated to urdu using ijunoon’stranslation service and made available. For calculating thesimilarity and relatedness of words we use the Spearman cor-relation coefficient [56]. The difference between the predictedscore and actual score is d and n is the number of examples. r s = 1 − ni d n ( n − (1)WordSim-353 SimLex-999Fasttext 0.462 bigrams 0.188 0.156skip-gram [14] 0.492 0.293Fasttext English [57] 0.84 0.76 Fig. 5. We compare our results with skip-gram [14] embeddings that wereevaluated on the same dataset. The Fasttext embeddings are trained for 100dimensional vectors so the results of skip-gram are also for 100 dimensionalvectors for a fair comparison.
The bigrams similarity measure as expected produces alow correlation score, this is also because correlation is onlycomputed for exact word matches from the corpus which arecomparatively very less as compared to Fasttext for WordSim-353 and SimLex-999. The Fasttext technique outperformsskip-gram based technique [14] for the SimLex-999 taskhowever slightly under-performs in WordSim-353.V. C
ONCLUSION
The advent of Word Embedding techniques [58] was noless than a revolution in the field of NLP. It enabled the repre- https://translate.ijunoon.com/ sentation of words in a digital form (vectors) that computerscan understand and perform mathematical calculations on, likethe famous example King - Man + Woman = Queen. It alsoestablished the ground work for modern Deep attention basedmodels and Transformers in the field of NLP.Urdu has for long remained an Under resourced languagewhich has caused many proposed state-of-the-art techniquesto under perform when being applied to Urdu corpora. It canalso be seen in Fig. 5 that performance of Fasttext on Urducorpora is nowhere near to that of English. In this research wehave proposed Word co-ocurrences using bigrams and Fasttextword embeddings trained using the COUNTER corpus andhave evaluated our approach on WordSim-353 and SimLex-999 similarity tasks and compared that to previously proposedskip-gram technique.In the future work can be done on training these techniqueson larger Urdu corpora and evaluate on various tasks like POSTagging, NER, Machine Translation, sentiment analysis anddependency parsing. In addition to this large corpora haveto be proposed for Urdu if we want to at least match theperformance of techniques that have been proposed for Highresource Languages such as English. We hope that this workwill help researchers to produce better techniques in the areaof Urdu NLP. R EFERENCES[1] Mirza Beg and Mike Dahlin. A memory accounting interface for thejava programming language.[2] Bilal Naeem, Aymen Khan, Mirza Omer Beg, and Hasan Mujtaba. Adeep learning framework for clickbait detection on social area networkusing natural language cues.
Journal of Computational Social Science ,pages 1–13, 2020.[3] Abdul Rehman Javed, Muhammad Usman Sarwar, Mirza Omer Beg,Muhammad Asim, Thar Baker, and Hissam Tawfik. A collaborativehealthcare framework for shared healthcare plan with ambient intelli-gence.
Human-centric Computing and Information Sciences , 10(1):1–21, 2020.[4] Mirza Beg and Peter Van Beek. A graph theoretic approach to cache-conscious placement of data for direct mapped caches. In
Proceedingsof the 2010 international symposium on Memory management , pages113–120, 2010.[5] Hafiz Tayyeb Javed, Mirza Omer Beg, Hasan Mujtaba, Hammad Majeed,and Muhammad Asim. Fairness in real-time energy pricing for smartgrid using unsupervised learning.
The Computer Journal , 62(3):414–429, 2019.[6] Rabail Zahid, Muhammad Owais Idrees, Hasan Mujtaba, andMirza Omer Beg. Roman urdu reviews dataset for aspect basedopinion mining. In , pages 138–143.IEEE, 2020.[7] Mirza Beg, Laurent Charlin, and Joel So. Maxsm: A multi-heuristicapproach to xml schema matching. 2006.[8] Aaditeshwar Seth and Mirza Beg. Achieving privacy and security inradio frequency identification. In
Proceedings of the 2006 InternationalConference on Privacy, Security and Trust: Bridge the Gap BetweenPST Technologies and Business Services , pages 1–1, 2006.[9] Abdul Ali Bangash, Hareem Sahar, and Mirza Omer Beg. A method-ology for relating software structure with energy consumption. In , pages 111–120. IEEE, 2017.[10] Mirza O Beg, Mubashar Nazar Awan, and Syed Shahzaib Ali. Algo-rithmic machine learning for prediction of stock prices. In
FinTech asa Disruptive Technology for Financial Institutions , pages 142–169. IGIGlobal, 2019.11] Hussain S Khawaja, Mirza O Beg, and Saira Qamar. Domain specificemotion lexicon expansion. In , pages 1–5. IEEE, 2018.[12] Muhammad Sharjeel, Rao Muhammad Adeel Nawab, and Paul Rayson.Counter: corpus of urdu news text reuse.
Language resources andevaluation , 51(3):777–803, 2017.[13] Adeel Zafar, Hasan Mujtaba, Sohrab Ashiq, and Mirza Omer Beg. Aconstructive approach for general video game level generation. In , pages 102–107. IEEE, 2019.[14] Samar Haider. Urdu word embeddings. In
Proceedings of the EleventhInternational Conference on Language Resources and Evaluation (LREC2018) , 2018.[15] Saira Qamar, Hasan Mujtaba, Hammad Majeed, and Mirza Omer Beg.Relationship identification between conversational agents using emotionanalysis.
Cognitive Computation , pages 1–15.[16] Talha Imtiaz Baig, Nazish Banaras, Ebad Banissi, Rafia Bashir,Mirza Omer Beg, Junaid Bilal, Ahmad Hassan Butt, Waseem Chishti,Christos Chrysoulas, Anum Dastgir, et al. Awan, shahid mahmood 245ayubi, salah-u-din 192.[17] Wahab Khan, Ali Daud, Khairullah Khan, Jamal Abdul Nasir, Mo-hammed Basheri, Naif Aljohani, and Fahd Saleh Alotaibi. Part of speechtagging in urdu: Comparison of machine and deep learning approaches.
IEEE Access , 7:38918–38936, 2019.[18] Neelam Mukhtar and Mohammad Abid Khan. Urdu sentiment analysisusing supervised machine learning approach.
International Journal ofPattern Recognition and Artificial Intelligence , 32(02):1851001, 2018.[19] Muhammad Kamran Malik. Urdu named entity recognition and clas-sification system using artificial neural network.
ACM Transactions onAsian and Low-Resource Language Information Processing (TALLIP) ,17(1):1–13, 2017.[20] Vaishali Gupta, Nisheeth Joshi, and Iti Mathur. Design & developmentof rule based inflectional and derivational urdu stemmer ‘usal’. In , pages 7–12. IEEE, 2015.[21] Khadija Shakeel, Ghulam Rasool Tahir, Irsha Tehseen, and MubashirAli. A framework of urdu topic modeling using latent dirichlet allocation(lda). In , pages 117–123. IEEE, 2018.[22] Muhammad Umair Arshad, Muhammad Farrukh Bashir, Adil Majeed,Waseem Shahzad, and Mirza Omer Beg. Corpus for emotion detectionon roman urdu. In , pages 1–6. IEEE, 2019.[23] Saad Nacem, Majid Iqbal, Muhammad Saqib, Muhammad Saad,Muhammad Soban Raza, Zaid Ali, Naveed Akhtar, Mirza Omer Beg,Waseem Shahzad, and Muhhamad Umair Arshad. Subspace gaussianmixture model for continuous urdu speech recognition using kaldi.In , pages 1–7. IEEE, 2020.[24] Adil Majeed, Hasan Mujtaba, and Mirza Omer Beg. Emotion detectionin roman urdu text using machine learning. In
Proceedings of the 35thIEEE/ACM International Conference on Automated Software Engineer-ing Workshops , pages 125–130, 2020.[25] Uzma Rani, Aamer Imdad, and Mirza Beg. Case 2: Recurrent anemiain a 10-year-old girl.
Pediatrics in review , 36(12):548–550, 2015.[26] Zubair Baig, Mirza Omer Beg, Baber Majid Bhatti, Farzana AhamedBhuiyan, Tegawend´e F Bissyand´e, Shizhan Chen, Mohan BaruwalChhetri, Marco Couto, Jo˜ao de Macedo, Randy de Vries, et al. Ahmed,sanam 124 aleti, aldeida 105 alo´ısio, jo˜ao 151 arachchilage, nalin asankagamagedara 7.[27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language un-derstanding. arXiv preprint arXiv:1810.04805 , 2018.[28] John T Hancock and Taghi M Khoshgoftaar. Survey on categorical datafor neural networks.
Journal of Big Data , 7:1–41, 2020.[29] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. Understanding bag-of-words model: a statistical framework.
International Journal of MachineLearning and Cybernetics , 1(1-4):43–52, 2010.[30] Mirza Omer Beg. Performance analysis of packet forwarding on ixp2400network processor. 2006.[31] Bijoyan Das and Sarit Chakraborty. An improved text sentimentclassification model using tf-idf and next word negation. arXiv preprintarXiv:1806.06407 , 2018. [32] Muhammad Umer Farooq, Mirza Omer Beg, et al. Bigdata analysis ofstack overflow for energy consumption of android framework. In , pages 1–9.IEEE, 2019.[33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of words and phrases and theircompositionality. In
Advances in neural information processing systems ,pages 3111–3119, 2013.[34] Adeel Zafar, Hasan Mujtaba, and Mirza Omer Beg. Search-basedprocedural content generation for gvg-lg.
Applied Soft Computing ,86:105909, 2020.[35] M Beg. Critical path heuristic for automatic parallelization. 2008.[36] Adam Pauls and Dan Klein. Faster and smaller n-gram languagemodels. In
Proceedings of the 49th annual meeting of the Associationfor Computational Linguistics: Human Language Technologies , pages258–267, 2011.[37] Mirza Omer Beg. Flecs: A data-driven framework for rapid protocolprototyping. Master’s thesis, University of Waterloo, 2007.[38] Adeel Zafar, Hasan Mujtaba, Mirza Tauseef Baig, and Mirza Omer Beg.Using patterns as objectives for general video game level generation.
ICGA Journal , 41(2):66–77, 2019.[39] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze,H´erve J´egou, and Tomas Mikolov. Fasttext. zip: Compressing textclassification models. arXiv preprint arXiv:1612.03651 , 2016.[40] Muhammad Umer Farooq, Saif Ur Rehman Khan, and Mirza Omer Beg.Melta: A method level energy estimation technique for android devel-opment. In , pages 1–10. IEEE, 2019.[41] Hamza M Alvi, Hareem Sahar, Abdul A Bangash, and Mirza O Beg.Ensights: A tool for energy aware software development. In , pages 1–6.IEEE, 2017.[42] Bushra Jawaid, Amir Kamran, and Ondrej Bojar. A tagged corpus anda tagger for urdu. In
LREC , pages 2938–2943, 2014.[43] Danyal Thaver and Mirza Beg. Pulmonary crohn’s disease in downsyndrome: A link or linkage problem.
Case reports in gastroenterology ,10(2):206–211, 2016.[44] Ahmed Uzair, Mirza O Beg, Hasan Mujtaba, and Hammad Majeed.Weec: Web energy efficient computing: A machine learning approach.
Sustainable Computing: Informatics and Systems , 22:230–243, 2019.[45] Mirza Beg and Peter van Beek. A constraint programming approach forintegrated spatial and temporal scheduling for clustered architectures.
ACM Transactions on Embedded Computing Systems (TECS) , 13(1):1–23, 2013.[46] Muhammad Tariq, Hammad Majeed, Mirza Omer Beg, Farrukh AslamKhan, and Abdelouahid Derhab. Accurate detection of sitting postureactivities in a secure iot based assisted living environment.
FutureGeneration Computer Systems , 92:745–757, 2019.[47] Adeel Zafar, Hasan Mujtaba, Mirza Omer Beg, and Sajid Ali. Deceptivelevel generator. 2018.[48] Hareem Sahar, Abdul A Bangash, and Mirza O Beg. Towards energyaware object-oriented development of android applications.
SustainableComputing: Informatics and Systems , 21:28–46, 2019.[49] Walid Koleilat, Joel So, and Mirza Beg. Watagent: A fresh look attac-scm agent design. 2006.[50] Mirza Beg. Flecs: A framework for rapidly implementing forwardingprotocols. In
International Conference on Complex Sciences , pages1761–1773. Springer, 2009.[51] Muhammad Asad, Muhammad Asim, Talha Javed, Mirza O Beg, HasanMujtaba, and Sohail Abbas. Deepdetect: detection of distributed de-nial of service attacks using deep learning.
The Computer Journal ,63(7):983–994, 2020.[52] Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluatingsemantic models with (genuine) similarity estimation.
ComputationalLinguistics , 41(4):665–695, 2015.[53] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, MariusPasca, and Aitor Soroa. A study on similarity and relatedness usingdistributional and wordnet-based approaches. 2009.[54] Noman Dilawar, Hammad Majeed, Mirza Omer Beg, Naveed Ejaz,Khan Muhammad, Irfan Mehmood, and Yunyoung Nam. Understandingcitizen issues through reviews: A step towards data informed planningin smart cities.
Applied Sciences , 8(9):1589, 2018.[55] Abdul Rehman Javed, Mirza Omer Beg, Muhammad Asim, Thar Baker,and Ali Hilal Al-Bayatti. Alphalogger: Detecting motion-based side-hannel attack using smartphone keystrokes.
Journal of AmbientIntelligence and Humanized Computing , pages 1–14, 2020.[56] Ch Spearman. The proof and measurement of association between twothings.
International journal of epidemiology , 39(5):1137–1150, 2010.[57] Vitalii Zhelezniak, Aleksandar Savkov, April Shen, and Nils Y Ham-merla. Correlation coefficients and semantic textual similarity. arXivpreprint arXiv:1905.07790 , 2019.[58] Martin Karsten, Srinivasan Keshav, Sanjiva Prasad, and Mirza Beg.An axiomatic basis for communication.