Bangla Text Classification using Transformers
BBangla Text Classification using Transformers
Tanvirul Alam
BJIT Limited
Dhaka, [email protected]
Akib Khan
BJIT Limited
Dhaka, [email protected]
Firoj Alam
Qatar Computing Research Institute, HBKU
Doha, Qatarfi[email protected]
Abstract —Text classification has been one of the earliestproblems in NLP. Over time the scope of application areashas broadened, and the difficulty of dealing with new areas(e.g., noisy social media content) has increased. The problem-solving strategy switched from classical machine learning to deeplearning algorithms. One of the recent deep neural networkarchitecture is the Transformer. Models designed with this typeof network and its variants recently showed their success inmany downstream natural language processing tasks, especiallyfor resource-rich languages, e.g., English. However, these modelshave not been explored fully for Bangla text classification tasks.In this work, we fine-tune multilingual transformer models forBangla text classification tasks in different domains, including sentiment analysis , emotion detection , news categorization , and authorship attribution . We obtain the state of the art results onsix benchmark datasets, improving upon the previous results by5-29% accuracy across different tasks. Index Terms —Text classification, Bangla language, Deep learn-ing, Transformers
I. I
NTRODUCTION
Text classification is a classic topic in natural languageprocessing (NLP) with many real-world applications. It refersto the task of classifying textual units such as sentences,queries, paragraphs, and documents into predefined labels ortags. Some applications of text classification include sentimentanalysis, news categorization, user intent classification, contentmoderation, etc.Some of the common sources of data for text classificationare web pages, social media, online news portal, emails, onlineshops, user reviews, and questions and answers from customerservices. Even though there is an abundance of textual data,preparing such data for the classification task is not onlychallenging but also time-consuming due to its unstructuredand noisy nature.Earlier works on text classification were based on classicalmachine learning algorithms. This required manually selectingfeatures like the bag of words or n-grams, which were thenused as inputs to classification algorithms such as NaiveBayes (NB), Support Vector Machine (SVM), Hidden MarkovModels (HMM), random forests [1]–[3], etc. However, as largescale datasets [4] have become available in recent years, thefocus has been shifted to use deep learning algorithms. Asdeep learning models can learn representation from the dataitself, it reduces the need for feature engineering and makesthe models transferable across different tasks.Distributed word representations learned using neural net-works have been widely used as they are capable of learning rich semantics using large unlabeled data [5], [6]. Theseword embeddings are then used to classify the texts usingdifferent neural networks like Multi-layer Perceptrons (MLP),Convolutional Neural Networks (CNN), and Recurrent NeuralNetworks (RNN) [7]–[10].More recently, transformers [11] based pre-trained languagemodels have been successfully used for learning languagerepresentations by utilizing a large amount of unlabeled data.Some of these models include OpenAI GPT [12], BERT [13],RoBERTa [14]. These models have proven to be immenselysuccessful when fine-tuned on different downstream tasks,including text classification [15], question answering [16],natural language inference [17], etc. However, these modelsare usually trained on large monolingual English corporaor on multilingual corpora that can include over a hundredlanguages.Recent work has shown that fine-tuning from multi-lingualmodels can achieve comparable performance to monolingualmodels [18] for low resource languages. Motivated by this,and the fact that, there has been no prior work on Bangla textclassification based on this approach, we explore the efficacyof different multilingual models for Bangla text classificationtasks. We experiment on six publicly available datasets withdiverse topics including sentiment analysis , emotion detection , authorship attribution and news categorization . We show theeffectiveness of the proposed approach by obtaining state ofthe art results on them.We organize the rest of the paper as follows. In SectionII, we discuss works related to text classification in Banglalanguage. We briefly discuss the datasets used in our studyin Section III. Our proposed approach and training details aredescribed in Section IV. We compare our results with previouswork and draw meaningful insight into our approach in SectionV. Finally, in Section VI, we conclude the paper.II. R ELATED W ORKS
Compared to English the work on Bangla text classificationis limited in spite of being one of the most spoken and cultur-ally rich languages with nearly 265 million native speakers.The main reason is the scarcity of labeled data for trainingthe machine learning models. In this section, we highlight theworks that are relevant to our work. a) Sentiment/Emotion Classification:
For sentiment andemotion classification the current state of the art for Banglancludes resource development and addressing the model de-velopment challenges. The earlier work includes rule-basedand classical machine learning algorithms. In [19], the authorspropose a computational technique of generating an equivalentSentiWordNet (Bangla) from publicly available English senti-ment lexicons and English-Bangla bilingual dictionary withvery few easily adaptable noise reduction techniques. Theclassical algorithms used in different studies include BernoulliNaive Bayes (BNB), Decision Tree, SVM, Maximum Entropy(ME), and Multinomial Naive Bayes (MNB) [20]–[22]. In[23], the authors developed a polarity detection system ontextual movie reviews in Bangla by using two popular machinelearning algorithms such as NB and SVM and providedcomparative results. In another study, authors used NB withrules for detecting sentiment in Bengali Facebook Status [23].In [24], the authors developed a dataset using semi-supervisedapproaches and designed models using SVM and MaximumEntropy [24].The work related to the use of deep learning algorithmsfor sentiment analysis include [25]–[29]. In [27], the authorsused Long Short Term Memory (LSTM) and CNN withembedding layer for both sentiment and emotion identificationfrom youtube comments. The study in [28], provides a compar-ative analysis using both classical – SVM, and deep learningalgorithms – LSTM and CNN, for sentiment classificationin Bangla news comments. The study in [29] integratedword embeddings into a Multichannel Convolutional-LSTM(MConv-LSTM) network for predicting different types of hatespeech, document classification, and sentiment analysis forBangla. Due to the availability of romanized Bangla texts insocial media, the studies in [25], [26] use LSTM to design andevaluate the model for sentiment analysis. In [30], authors usedCNN sentiment classification in Bangla comments. The studiesin [31] and [32] analyze user sentiment on Cricket commentsfrom online news forums. b) Authorship identification:
Authorship attribution isanother interesting research problem in which the task is toidentify original authors from the text. The research work inthis area is comparatively low. In [33], the authors developeda dataset and experiment with character level embedding forAuthorship Attribution. c) New Classification:
A large dataset of Bangla articlesfrom different news portals, which contains around 3,76,226articles were provided [34]. The study conducted experimentsusing Logistic Regression, Neural Network, NB, RandomForest, and Adaboost by utilizing textual features such asword2vec, TF-IDF(3000 word vector), and TF-IDF(300 wordvector). In [35], the authors extracted tf-idf features andperformed Bangla content classification using Random Forest,SVM with linear and radial basis kernel, K-Nearest Neighbor,Gaussian Naive Bayes, and Logistic Regression. They havecreated a large Bangla content dataset and made it publiclyavailable. TABLE I: Statistics of the datasets used in the experiments. C:number of classes, L: average text length (words), Train/De-v/Test: number of samples in the respective splits (if providedofficially)
Dataset C Train Dev Test L
YouTube Sentiment-3 [27] 3 8910 - - 11YouTube Sentiment-5 [27] 5 3886 - - 11YouTube Emotion [27] 5 2890 - - 10News Comment Sentiment [28] 5 13802 - - 20Authorship Attribution [33] 14 14047 - 3511 750News Classification [36] 6 11284 1411 1411 231
III. D
ATASETS
We experiment with multiple publicly available datasets.These datasets are collected from four sources and includediverse topics such as sentiment classification, emotion detec-tion, authorship attribution, news categorization. Statistics ofdifferent datasets used are shown in Table I.
A. YouTube comment datasets
Three datasets are collected from YouTube user commentsin [27]. Two of these are for sentiment analysis and one foremotion detection. One sentiment analysis dataset consists of5 classes: strongly positive, positive, neutral, negative, andstrongly negative and the other one has 3 classes: positive,neutral, and negative. The emotion detection dataset has 5types of emotion: anger/disgust, joy, sadness, fear/surprise,and none. One interesting aspect of these datasets is that theycontain texts in Bangla, English, and romanized Bangla.
B. News comment sentiment dataset
This sentiment analysis dataset was developed in [28] fromBangla news portals. It has five categories of sentiments:slightly positive, definitely positive, neutral, slightly negative,and definitely negative. For training, slightly positive anddefinitely positives comments were considered as positive,while slightly negative and definitely negative comments wereconsidered as negative, and neutral comments were droppedin the paper.
C. Authorship Attribution dataset
This dataset in [33] contains writings of 14 different authorsfrom online Bangla e-library (e.g., novels, story, series, etc.).Each document in the dataset has fixed length of 750 words.
D. News Classification dataset
This dataset was prepared for the news classification taskin [36]. It contains 6 different classes of interest. The authorsprovide train, validation, and test split for this dataset.IV. E
XPERIMENTS
A. Models and Architecture
We fine-tune multi-lingual transformer models that aretrained on large corpus. We have not used monolingual Banglamodel as there is no such model publicly available to theig. 1: Model Architecturebest of our knowledge. A general architecture of our model isshown in Figure 1.We use model specific tokenizers to split the input text intoa list of tokens. As these models use byte-pair encoding [37],single word may be split into multiple tokens. For example,here the input sentence সু(cid:360)র সবসময় আন(cid:360)ময় (Beauty is joyforever) is split into 6 tokens when using the XLM-RoBERTa-large tokenizer.
1) Pretrained Language Models:
We use transformer lan-guage models in our experiments, which are available publiclyin HuggingFace’s transformer library [38]. To fine-tune thetransformer model for the classification task, we insert aspecial start of sequence (SOS) token at the start and a special end of sequence (EOS) token at the end. As we use fixed-sizeinput to the models, we add padding at the end if necessaryor remove extra tokens from the end if the number of tokensexceed the fixed sequence length. Padding tokens are maskedout so they are not present during training. These tokensform the input for the transformer models where they arepassed through multiple self-attention layer, and form thefinal hidden embedding corresponding to each input token.For the classification task, we only consider the first tokencorresponding to the start of the sequence token. This is thenused to produce final output probability distribution over thenumber of categories.We experiment with three models from two model classes:multilingual BERT and XLM-RoBERTa. a) Multilingual BERT:
BERT [13] is trained to learndistributed representation from unlabeled texts by jointly con-ditioning on left and right contexts of a token. It uses theencoder part of the Transformer architecture introduced in[11]. Two objective functions are used during the languagemodel pretraining step. The first one is masked languagemodel (MLM) that randomly masks some fraction of theinput tokens and the objective is to predict the vocabulary For BERT the
SOS and
EOS tokens are respectively [CLS] and [SEP] . For XLM-RoBERTa these are and . A special token [MASK] is used for this
ID of the original token in that position. The bidirectionalnature ensures that the model can effectively make use ofboth past and future tokens for this. The second objectiveis the next sentence prediction (NSP) task. This is a binaryclassification task for predicting whether two sentences aresubsequent in the original text. Positive sentences are createdby taking consecutive sentences from the text and negativesentences are created by taking sentences from two differentdocuments.The multilingual variant of BERT (mBERT) is trained onmore than 100 languages with the largest Wikipedia corpus.Since different languages have a different amount of Wikipediaentries, data is sampled using an exponentially smoothedweighting (with a factor 0.7). This ensures that high resourcelanguages like English are under-sampled compared to lowresource languages. Word counts are also sampled in a similarmanner so that low resource languages have sufficient wordsin the vocabulary. b) XLM-RoBERTa:
RoBERTa [14] improves upon BERTby training on larger datasets, using larger vocabulary, andtraining on longer sequences with larger batches. NSP task isremoved and only MLM loss is used for pretraining.XLM-RoBERTa [18] is the multilingual variant ofRoBERTa trained with a multilingual MLM. It is trainedon one hundred languages, with more than two terabytes offiltered Common Crawl data. XLM-RoBERTa showed impres-sive performance in several multilingual NLP tasks and canperform comparably to monolingual language models.
2) Classification Head:
We add a task-specific classifica-tion head for fine-tuning the model for specific tasks. Thehidden representation obtained from the start of sequencetoken can be treated as the sentence embedding and usedfor classification [13], [14]. This gives us a H dimensionalembedding vector. For BERT, we add a linear layer withneurons equal to the number of classes for the task. ForRoBERTa, a hidden layer is used with H neurons with tanh non-linearity and is followed by the final classification layer. B. Training Procedure
We trained the models using cross entropy loss criterion andAdam optimization algorithm [39]. All models were trainedfor 10 epochs with learning rate 1e-5. We use 32 samples ineach mini batch, except when this does not fit in memory (e.g.,XLM-RoBERTa large model trained on authorship attributiondataset). In such cases we use maximum batch size that fitswithin GPU memory. We used fixed sequence length duringtraining and apply padding or truncate when necessary. Thesequence length consists of 30, 100, 300, and 300 tokensfor YouTube comments, News Comments, Authorship, Newsdatasets, respectively. All model parameters are fine-tunedduring training i.e., no layer is kept frozen. The model withthe best validation set performance is evaluated on the testdataset. The models were trained on a single 16 GB NVIDIA Tesla P100 GPU
ABLE II: Result on YouTube sentiment (3 class) dataset
Model Accuracy F1 Score
Others LSTM [27]
CNN [27] 60.9 60.5NB [27] 60.8 59.5SVM [27] 59.2 58.9Ours BERT-base 71.7 71.5XLM-RoBERTa-base 74.4 74.2XLM-RoBERTa-large
TABLE III: Result on YouTube sentiment (5 class) dataset
Model Accuracy F1 Score
Others LSTM [27]
CNN [27] 52.1 52.1NB [27] 46.9 48.0SVM [27] 44.9 46.5Ours BERT-base 53.5 52.8XLM-RoBERTa-base 57.4 56.7XLM-RoBERTa-large
TABLE IV: Result on YouTube emotion detection dataset
Model Accuracy F1 Score
Others LSTM [27]
CNN [27] 54.0 53.5NB [27] 52.5 52.5SVM [27] 49.3 49.8Ours BERT-base 60.4 59.1XLM-RoBERTa-base 69.8 66.6XLM-RoBERTa-large
V. R
ESULTS AND D ISCUSSIONS
A. Results on different Datasets
In this section, we describe the evaluation procedures usedfor specific dataset and compare results obtained using trans-former models to those previously reported. We evaluate themodels using accuracy and weighted F1 score.
1) YouTube comment datasets:
The results on YouTubesentiment (3 and 5 class) and emotion detection datasets arereported in Table II, III and IV. The baseline SVM and NaiveBayes (NB) models in [27] were trained using tf-idf featureswith n-gram tokens, while CNN and LSTM models weretrained using word embedding learned using word2vec [5]. Wefollow similar procedure for reporting result as in their work.Specifically, we set aside 10% of each dataset for testing. Therest was further divided into 80% for training and 20% forvalidation. We repeated the experiments 10 times and reportaverage result for each model.It is evident that transformer models consistently yield betterperformance on these datasets compared to the benchmarkmodels, only exception being BERT on the 5-class sentimentdataset where it performs slightly worse. Most significantimprovement is observed on the emotion detection datasetwhere XLM-RoBERTa large achieves 22.8% relative increasein accuracy compared to the LSTM model. TABLE V: Result on news comment sentiment dataset
Model Accuracy F1 Score
Others SVM [28] 61.34 63.97LSTM [28]
CNN [28] 60.49 66.24Ours BERT-base 74.68 72.32XLM-RoBERTa-base 76.61 74.69XLM-RoBERTa-large
TABLE VI: Result on authorship attribution dataset
Model Accuracy F1 Score
Others Char-CNN [33] 69.0 -W2V(CBOW) [33] 71.8 -fastText(CBOW) [33] 40.3 -W2V(Skip) [33] 78.6 -fastText(Skip) [33] -Ours BERT-base 82.6 82.8XLM-RoBERTa-base 87.2 87.2XLM-RoBERTa-large
TABLE VII: Result on news classification dataset
Model Accuracy
Others FT-W [36] 62.79 -FT-WC [36] 64.78 -INLP [36] -Ours BERT-base 91.28 91.28XLM-RoBERTa-base 92.70 92.84XLM-RoBERTa-large
2) News Comment Sentiment:
We omitted the neutral classand only consider binary sentiment classification for thisdataset as was done in [28]. We split the dataset into train,validation and test splits having 80%, 10% and 10% samplesrespectively. The accuracy and F1 score are reported in TableV. The authors in [28] reported results for SVM, CNN andLSTM models. Transformer models perform better than SVMand CNN models and comparable to the LSTM model in thisdataset.
3) Authorship Attribution:
We balanced the dataset prior totraining, taking minimum number of samples (469) per classsimilar to [33]. We used the train and test splits provided in[33] for the experiments and split 20% of the training datafurther into validation set. We limit sequence length to 300 onthis dataset even though each sample in the dataset has 750words. This was done to meet GPU memory constraint.The authors in [33] reported results for character andword level CNN models trained with fastText and word2vecembedding and best results were obtained with skip-gramvariants of fastText embedding. All three transformer modelsperform better than this model with XLM-RoBERTa-largemodel significantly outperforming with a 15.5% relative in-crease in accuracy as reported in Table VI.
4) News classification:
We used the train, validation andtest splits provided in [36] for news classification. Resultsobtained on this dataset is reported in Table VII. The authorsin [36] used fastText embedding trained on different corporaig. 2: Relative accuracy compared to the best publishedresults on different datasetsto train KNN classifier. This means further improvementcould be gained by training CNN or LSTM model from theembedding. Regardless, we notice significant improvementusing transformer models in this dataset. All three variantsof our models achieve greater than 25% relative increase inaccuracy compared to the best benchmark model [36].
B. Discussions
We can conclude from the results that transformer basedmodels are better suited for Bangla text classification taskscompared to classical machine learning based approachesthat use manually engineered features or CNN/LSTM modelstrained on distributed word representation. We represent therelative accuracy of the three models compared to the bestpublished prior results on each dataset in Figure 2. There isa clear trend across the datasets. XLM-RoBERTa-base per-forms better than BERT and XLM-RoBERTa-large performsbetter than XLM-RoBERTa-base model. Both XLM-RoBERTamodels improve upon previously reported results across alldatasets. BERT performs slightly worse than the previousbest result on YouTube sentiment-5 class and News sentimentdatasets.There are various factors contributing to the performancegain obtained by XLM-RoBERTa models. They are trained ona larger corpus compared to BERT and has larger vocabularysize. As a result fewer unknown tokens are introduced aftertokenization. For example, considering the example sentencefrom before, we obtain following tokens after using XLM-RoBERTa model tokenizer '_সু(cid:360)র', '_সব', 'স', 'ময়', '_আন(cid:360)', 'ময়'
However, using BERT tokenizer we get the following tokens 'স', ' '[UNK]' , '[UNK]' Last two words are replaced with the unknown tokens, eventhough they are not rare word in Bangla. The first word isalso split into several tokens. XLM-RoBERTa performs betterin this regard due to availability of larger vocabulary forBangla in the model. As XLM-RoBERTa-large is a modelwith trained with more layers and parameters, it also has bettergeneralization ability than XLM-RoBERTa-base model. Thisresults in better performance in the downstream tasks.One advantage of subword embedding is that it can performbetter in noisy user generated text like those found in online (a) Word importance for sentence '(cid:447)কন জািন না! গানটা এেতা ভােলালােগ। সবসময় (cid:453)নেত ভােলা লােগ।' (I don’t know why! I like thesong so much. Always good to hear.) Detected as positive with 0.9972probability.(b) Word importance for sentence '(cid:447)শেষর িসনটা টা খুব খারাপ(cid:447)লেগিছল (The last scene was very bad!)
Detected as negative with0.9627 probability.(c) Word importance for sentence 'লাইেফ এমন গান খুব কম(cid:453)েনিছ,হাজার বার (cid:453)েনও খারাপ লােগিন' (I have rarely heard suchsongs in my life, I have not felt bad even after listening to them athousand times. ) Detected as positive with 0.9807 probability.
Fig. 3: Example sentences highlighting the importance ofwords that network learns.media. As they tend to contain misspelling and shortenedwords, subword embedding (also character embedding) canstill capture some meaningful information from them. Peoplealso often comment in both Bangla and English in thoseplatforms, as is present in the YouTube datasets. Since weare using multilingual models for fine-tuning which are alsotrained on large amount of English data, the model is expectedto perform better in such cases. This can explain why we gotsignificantly more improvement from transformer models inthe YouTube datasets compared to the Bangla news sentimentdataset, which does not contain any English text.Using transformer models for fine-tuning also reduces theneed for feature engineering and data preprocessing. In ourexperiments, we did not use any separate text preprocessingsteps (e.g., stop words and punctuation removal)other thanusing the model specific tokenizers.In this experiment, we have explored multilingual languagemodels as they are more readily available, especially for alow resource language like Bangla. If we fine-tune from amonolingual Bangla transformer model pretrained on a largecorpus, we believe further improvements can be made onBangla only texts like in authorship attribution dataset.For a better interpretation of the model we explore andshow word importance with three example sentences. This isobtained for XLM-RoBERTa large model trained on YouTubesentiment-3 dataset. These representations are produced usingCaptum [40]. In Figure 3a, the model puts a high emphasis onthe positive word ভােলা (good). First sentence (cid:447)কন জািন না! (I don’t know why!) can be considered to be slightly negativeand is reflected in the word importance. In Figure 3b, thenegative word খারাপ (bad) is properly highlighted to arriveat the overall negative sentiment. Interestingly খুব (very) ishighlighted as opposite i.e., positive, meaning the model wasnot able to combine two words and emphasize the phrase খুবখারাপ (very bad) as negative. Finally, the sentence in Figure3c is more challenging. However the model is able to identifythe overall positive tone despite the presence of some negativeords like কম, খারাপ (rarely, bad).VI. C
ONCLUSIONS
We have explored different transformer models for a varietyof Bangla text classification tasks. Our work shows that fine-tuning from transformer models can yield better performancecompared to traditional methods that make use of hand-craftedfeatures, and also deep learning models like CNN and LSTMtrained on distributed word representation. We obtained a stateof the art result on six benchmark datasets from differentdomains. We hope this will encourage researchers to makeuse of these models for other tasks in the Bangla language.R
EFERENCES[1] L. M. Manevitz and M. Yousef, “One-class SVMs for document clas-sification,”
Journal of machine Learning research , vol. 2, no. Dec, pp.139–154, 2001.[2] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment classi-fication using machine learning techniques,” in
Proc. of EMNLP , 2002.[3] G. Forman, “An extensive empirical study of feature selection metricsfor text classification,”
Journal of machine learning research , vol. 3, no.Mar, pp. 1289–1305, 2003.[4] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutionalnetworks for text classification,” in
Advances in neural informationprocessing systems , 2015, pp. 649–657.[5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation ofword representations in vector space,” in , 2013.[6] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectorsfor word representation,” in
EMNLP , 2014.[7] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882 , 2014.[8] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic represen-tations from tree-structured long short-term memory networks,” arXivpreprint arXiv:1503.00075 , 2015.[9] S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, and X. Cheng, “A deeparchitecture for semantic matching with multiple positional sentencerepresentations,” arXiv preprint arXiv:1511.08277 , 2015.[10] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks forefficient text classification,” arXiv preprint arXiv:1607.01759 , 2016.[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all youneed,” in
Advances in Neural Information Processing Systems 30 .Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available:http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf[12] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improvinglanguage understanding by generative pre-training,”
URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper. pdf , 2018.[13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimizedBERT pretraining approach,”
CoRR , vol. abs/1907.11692, 2019.[15] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune BERTfor text classification?” in
China National Conference on ChineseComputational Linguistics . Springer, 2019, pp. 194–206.[16] S. Garg, T. Vu, and A. Moschitti, “TANDA: Transfer and adaptpre-trained transformer models for answer sentence selection,” arXivpreprint arXiv:1911.04118 , 2019.[17] Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou,“Semantics-aware BERT for language understanding,” arXiv preprintarXiv:1909.02209 , 2019.[18] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsu-pervised cross-lingual representation learning at scale,” arXiv preprintarXiv:1911.02116 , 2019. [19] A. Das and S. Bandyopadhyay, “SentiWordNet for Bangla,”
KnowledgeSharing Event-4: Task , vol. 2, pp. 1–8, 2010.[20] A. Rahman and M. S. Hossen, “Sentiment analysis on movie review datausing machine learning approach,” in , 2019, pp. 1–4.[21] N. Banik and M. H. H. Rahman, “Evaluation of naïve bayes and supportvector machines on Bangla textual movie reviews,” in
Proc. of ICBSLP .IEEE, 2018, pp. 1–6.[22] R. R. Chowdhury, M. S. Hossain, S. Hossain, and K. Andersson,“Analyzing sentiment of movie reviews in Bangla by applying machinelearning techniques,” in
Proc. of (ICBSLP) . IEEE, 2019, pp. 1–6.[23] M. S. Islam, M. A. Islam, M. A. Hossain, and J. J. Dey, “Supervisedapproach of sentimentality extraction from Bengali facebook status,” in
Proc. of ICCIT . IEEE, 2016, pp. 383–387.[24] S. Chowdhury and W. Chzowdhury, “Performing sentiment analysis inBangla microblog posts,” in , 2014, pp. 1–6.[25] A. Hassan, M. R. Amin, A. K. Al Azad, and N. Mohammed, “Sentimentanalysis on Bangla and romanized Bangla text using deep recurrentmodels,” in . IEEE, 2016, pp. 51–56.[26] A. A. Sharfuddin, M. N. Tihami, and M. S. Islam, “A deep recurrentneural network with BiLSTM model for sentiment classification,” in
Proc. of ICBSLP . IEEE, 2018, pp. 1–4.[27] N. I. Tripto and M. E. Ali, “Detecting multilabel sentiment and emotionsfrom Bangla youtube comments,” in . IEEE, 2018, pp.1–6.[28] M. A.-U.-Z. Ashik, S. Shovon, and S. Haque, “Data set for sentimentanalysis on bengali news comments and its baseline evaluation,” in
Proc.of ICBSLP . IEEE, 2019, pp. 1–5.[29] M. R. Karim, B. R. Chakravarthi, J. P. McCrae, and M. Cochez, “Clas-sification benchmarks for under-resourced Bengali language based onmultichannel convolutional-lstm network,”
CoRR , vol. abs / 2004.07807,2020.[30] M. H. Alam, M.-M. Rahoman, and M. A. K. Azad, “Sentiment analysisfor Bangla sentences using convolutional neural network,” in
Proc. ofICCIT . IEEE, 2017, pp. 1–6.[31] M. Rahman, E. Kumar Dey et al. , “Datasets for aspect-based sentimentanalysis in Bangla and its baseline evaluation,”
Data , vol. 3, no. 2, p. 15,2018.[32] S. A. Mahtab, N. Islam, and M. M. Rahaman, “Sentiment analysis onBangladesh cricket with support vector machine,” in .IEEE, 2018, pp. 1–4.[33] A. Khatun, A. Rahman, M. S. Islam et al. , “Authorship attribution inbangla literature using character-level CNN,” in
Proc. of ICCIT . IEEE,2019, pp. 1–5.[34] M. Tanvir Alam and M. Mofijul Islam, “BARD: Bangla article classifi-cation using a new comprehensive dataset,” in
Proc. of ICBSLP , 2018,pp. 1–5.[35] S. Al Mostakim, F. Ehsan, S. Mahdiea Hasan, S. Islam, and S. Shatabda,“Bangla content categorization using text based supervised learningmethods,” in , 2018, pp. 1–6.[36] A. Kunchukuttan, D. Kakwani, S. Golla, G. N.C., A. Bhattacharyya,M. M. Khapra, and P. Kumar, “Ai4bharat-indicnlp corpus: Monolingualcorpora and word embeddings for indic languages,” arXiv preprintarXiv:2005.00085 , 2020.[37] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation ofrare words with subword units,” arXiv preprint arXiv:1508.07909 , 2015.[38] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. vonPlaten, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame,Q. Lhoest, and A. M. Rush, “Huggingface’s transformers: State-of-the-art natural language processing,”
ArXiv , vol. abs/1910.03771, 2019.[39] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”in3rd International Conference on Learning Representations