[PDF] BNLP: Natural language processing toolkit for Bengali language

Abstract

BNLP is an open source language processing toolkit for Bengali language consisting with tokenization, word embedding, POS tagging, NER tagging facilities. BNLP provides pre-trained model with high accuracy to do model based tokenization, embedding, POS tagging, NER tagging task for Bengali language. BNLP pre-trained model achieves significant results in Bengali text tokenization, word embedding, POS tagging and NER tagging task. BNLP is using widely in the Bengali research communities with 16K downloads, 119 stars and 31 forks. BNLP is available at this https URL

Full PDF

BBNLP: N

ATURAL LANGUAGE PROCESSING TOOLKIT FOR B ENGALI LANGUAGE

Sagor Sarker

Begum Rokeya UniversityRangpur, Bangladesh [email protected] A BSTRACT

BNLP is an open source language processing toolkit for Bengali language consisting with tokenization,word embedding, pos tagging, ner tagging facilities. BNLP provides pre-trained model with highaccuracy to do model based tokenization, embedding, pos tagging, ner tagging task for Bengalilanguage. BNLP pretrained model achieves signiﬁcant results in Bengali text tokenization, wordembeddding, POS tagging and NER tagging task. BNLP is using widely in the Bengali researchcommunities with 16K downloads, 119 stars and 31 forks. BNLP is available at https://github.com/sagorbrur/bnlp . Natural language processing is one of the most important ﬁeld in computation linguistics. Tokenization, embedding,pos tagging, ner tagging, text classiﬁcation, language modeling are some of the sub task of NLP. Any computationallinguistics researcher or developer need hands on tools to do these sub task efﬁciently. Due to the recent advancement ofNLP there are so many tools and method to do word tokenization, word embedding, pos tagging, ner tagging in Englishlanguage. NLTK[1], coreNLP[2], spacy[3], AllenNLP[4], Flair[5], stanza[6] are few of the tools. These tools provide avariety of method to do tokenization, embedding, pos tagging, ner tagging, language modeling for English language.Support for other low resource language like Bengali is limited or no support at all. Recent tool like iNLTK [7] is aninitial approach for different indic language. But as it groups with other indic language special monolingual support forBengali language is missing.BNLP is an open source language processing toolkit for Bengali language is build to address this problem and breaksthe barrier to do different Bengali NLP task by:• Providing different tokenization method to tokenize Bengali text efﬁciently• Providing different embedding method to embed Bengali word using pretrained model and also provides anoption to train an embedding model from scratch• Providing hands on start option for pos tagging or ner tagging of Bengali sentences and also provides an optionfor training CRF based pos tagger or ner tagger model from scratch.BNLP also provides some utility methods like to remove stopwords from Bengali text, to get Bengali letters list orpunctuation list. BNLP github repositories for source code of the package, pretrained model and documentation .BNLP libraries has a permissive MIT license. BNLP is easy to install via pip or by cloning repository, easy to pluginwith any python projects. https://github.com/goru001/inltk https://github.com/sagorbrur/bnlp https://bnlp.readthedocs.io/ a r X i v : . [ c s . C L ] J a n igure 1: Overview of BNLP’s pipeline. BNLP takes raw text as input, and produces trained model of sentencepiece,word2vec and fasttext. Using that trained model BNLP prediction API do different prediction task.Figure 2: BNLP Basic Tokenization API BNLP tool is too simple to use. Researcher or developer can integrate this tool with installing simple python package.In this section we are describing how to do different NLP task for Bengali text using BNLP toolkit.

BNLP provides three different tokenization option to tokenize Bengali text. Under rule based tokenizer BNLP provides

Basic Tokenizer a punctuation splitting tokenizer and

NLTK tokenizer. As NLTK tokenizer is for English language,we modiﬁed nltk tokenize output to use it for Bengali language keeping in mind the difference between punctuation ofEnglish and Bengali. Under model based tokenization BNLP provides sentencepice tokenizer for Bengali text calledBengali Sentencepiece. Bengali sentencepiece api provide two option, pretrained sentencepiece model and trainingsentencepiece model. Anyone can tokenize Bengali text using pretrained sentencepiece model or can train their ownBengali sentencepiece model by calling train api. BNLP provides two different embedding option to embed Bengali words, one is Bengali word2vec and Bengali fasttext.Both Bengali word2vec and fasttext has two option, one is embed Bengali word using pretrained model and another istrain Bengali word2vec/fasttext model from scratch. For both embedding model we used gensim embedding api andtrained with Bengali corpora. https://github.com/nltk/nltk https://github.com/google/sentencepiece https://github.com/RaRe-Technologies/gensim BNLP provides a hands on starting option for pos tagging to Bengali by giving a method to tag pos from given sentenceusing pretrained CRF model and also train a CRF model by giving custom data.

Similar to pos tagging BNLP provides a hands on starting option to NER tagging for Bengali sentences and alsoprovides an option to train a CRF based NER model using custom data.Apart from this BNLP provides some extra utilities methods like getting Bengali stopwords , letters , punctuation from Corpus class.

In this section we describe about different BNLP model training datasets, training procedure, evaluation procedures.

For training sentencepiece, word2vec, fasttext we used Bengali raw text data from two sources. One is wikipedia dump dataset and another is crawl news articles from different news portal sites. As shown in Table 1 our raw datacontains total of 99139 wikipedia Bengali articles and 127867 news articles. Wikipedia corpus contains total of 1818523sentenes with 32908419 tokens. News articles corpus contains total of 4017940 sentences with 60526710 tokens. https://dumps.wikimedia.org/bnwiki/latest/ Corpus Articles Sentences Tokens

Wikipedia 99139 1818523 32908419News Articles 127867 4017940 60526710Total 227006 5836463 93435129Table 1: Statistics of Datasets used for training sentencepiece, word2vec, fasttext Models3 entences Train Test

POS 2997 2247 750NER 67719 64155 3564Table 2: Statistics of POS and NER datasets

Precision Recall F1

POS 81.74 79.78 80.75NER 74.15 60.91 66.88Table 3: Evaluation results of POS and NER modelFor POS tagging we used nltr datasets which contains total of 2997 sentences. We split that datasets into 2247 trainand 750 test set and train our POS tagging model. For NER we used NER-Bangla-Datasets [8] which contains total of67719 data with 64155 train and 3564 test. Table 2 provides details statistics of POS and NER datasets. We train sentencepiece model with our raw text data with vocab size 50000. As sentencepiece provide us end-to-endsystem for training and tokenizing we did not do any preprocessing task in our raw text datasets.We train our word2vec model with embedding dimention 300, window size 5, minimum number of word occurrences 1,and total workers number 8. We train it total of 50000 iterations.For training fasttext we set embedding dimension 300, windows size 5, number of minimum word occurrences 1, modeltype skipgram, learning rate 0.05. We trained total of 50 epochs and our loss is 0.318668.Our CRF based POS tagging model and NER tagging model training approach is similar. We splited data into 75%train and 25% test. Our evaluation result for POS tagging model is 80.75 F1 score and NER model is 66.88 F1 score.Table 3 describe details about evaluation results.

There are signiﬁcant number of open-source NLP tools for English language. Tools like NLTK[1], coreNLP[2],spacy[3], AllenNLP[4], Flair[5], stanza[6] are few of them. These tools mostly build for English language and haslimited or no support for other languages. Specially a low resource language like Bengali, there is huge scarcity oftool to process it. iNLTK[7] is an initial approach to help process Bengali language with tokenization, language modelsupport. But as it’s group with different indic language, specail monolingual concern for Bengali language is missing.Keeping that concern in mind we build BNLP to support specially for Bengali language and provides tokenization,embedding, pos, ner supports.

BNLP language processing toolkit provides tokenization, embedding, pos tagging, ner tagging, language modelingfacilities for Bengali language. BNLP pertrained model achieves signiﬁcant results in Bengali text tokenizing, wordembeddding, POS tagging and NER tagging task. BNLP is using widely in Bengali language research communities andappreciated by the communities.We are working on extending the support tools like stemming, lemmatizing, corpus support for BNLP in future. Weare working on to add language model based support like BERT based LM in BNLP so that researcher can use it fordifferent downstream task efﬁciently. While these task under development, we are hopping that BNLP will accelerateBengali NLP research and development.

References [1] Edward Loper and Steven Bird. Nltk: the natural language toolkit.

CoRR , cs.CL/0205028, 07 2002. https://github.com/abhishekgupta92/bangla_pos_tagger

42] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. TheStanford CoreNLP natural language processing toolkit. In

Proceedings of 52nd Annual Meeting of the Associationfor Computational Linguistics: System Demonstrations , pages 55–60, Baltimore, Maryland, June 2014. Associationfor Computational Linguistics.[3] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolu-tional neural networks and incremental parsing. To appear, 2017.[4] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters,Michael Schmitz, and Luke Zettlemoyer. Allennlp: A deep semantic natural language processing platform.

CoRR ,abs/1803.07640, 2018.[5] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. FLAIR: Aneasy-to-use framework for state-of-the-art NLP. In

Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics (Demonstrations) , pages 54–59, Minneapolis, Minnesota,June 2019. Association for Computational Linguistics.[6] Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. Stanza: A python naturallanguage processing toolkit for many human languages. arXiv preprint arXiv:2003.07082 , 2020.[7] Gaurav Arora. iNLTK: Natural language toolkit for indic languages. In

Proceedings of Second Workshop forNLP Open Source Software (NLP-OSS) , pages 66–71, Online, November 2020. Association for ComputationalLinguistics.[8] Redwanul Karim, M. A. Islam, Sazid Simanto, Saif Chowdhury, Kalyan Roy, Adnan Neon, Md Hasan, AdnanFiroze, and Mohammad Rahman. A step towards information extraction: Named entity recognition in bangla usingdeep learning.