An Attention Ensemble Approach for Efficient Text Classification of Indian Languages
AAn Attention Ensemble Approach for Efficient Text Classification ofIndian Languages
Atharva Kulkarni , Amey Hengle , and Rutuja Udyawar Department of Computer Engineering, PVG’s COET, Savitribai Phule Pune University, India. Optimum Data Analytics, India. { atharva.j.kulkarni1998, ameyhengle22 } @gmail.com [email protected] Abstract
The recent surge of complex attention-baseddeep learning architectures has led to extraor-dinary results in various downstream NLPtasks in the English language. However, suchresearch for resource-constrained and morpho-logically rich Indian vernacular languages hasbeen relatively limited. This paper proffersteam SPPU AKAH’s solution for the TechD-Ofication 2020 subtask-1f: which focuses onthe coarse-grained technical domain identifica-tion of short text documents in Marathi, a De-vanagari script-based Indian language. Avail-ing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is pro-posed that competently combines the inter-mediate sentence representations generated bythe convolutional neural network and the bidi-rectional long short-term memory, leading toefficient text classification. Experimental re-sults show that the proposed model outper-forms various baseline machine learning anddeep learning models in the given task, givingthe best validation accuracy of 89.57% and f1-score of 0.8875. Furthermore, the solution re-sulted in the best system submission for thissubtask, giving a test accuracy of 64.26% andf1-score of 0.6157, transcending the perfor-mances of other teams as well as the baselinesystem given by the organizers of the sharedtask.
The advent of attention-based neural networks andthe availability of large labelled datasets has re-sulted in great success and state-of-the-art perfor-mance for English text classification (Yang et al.,2016; Zhou et al., 2016; Wang et al., 2016; Gaoet al., 2018). Such results, however, for Indian lan-guage text classification tasks are far and few asmost of the research employ traditional machinelearning and deep learning models (Joshi et al., 2019; Tummalapalli et al., 2018; Bolaj and Gov-ilkar, 2016a,b; Dhar et al., 2018). Apart frombeing heavily consumed in the print format, thegrowth in the Indian languages internet user baseis monumental, scaling from 234 million in 2016to 536 million by 2021 . Even so, just like mostother Indian languages, the progress in NLP forMarathi has been relatively constrained, due to fac-tors such as the unavailability of large-scale train-ing resources, structural un-similarity with the En-glish language, and a profusion of morphologicalvariations, thus, making the generalization of deeplearning architectures to languages like Marathidifficult.This work posits a solution for the TechDOfica-tion 2020 subtask-1f: coarse-grained domain clas-sification for short Marathi texts. The task providesa large corpus of Marathi text documents spanningacross four domains: Biochemistry, Communica-tion Technology, Computer Science, and Physics.Efficient domain identification can potentially im-pact, and improve research in downstream NLPapplications such as question answering, transliter-ation, machine translation, and text summarization,to name a few. Inspired by the works of (Er et al.,2016; Guo et al., 2018; Zheng and Zheng, 2019),a hybrid CNN-BiLSTM attention ensemble modelis proposed in this work. In recent years, Convo-lutional Neural Networks (Kim, 2014; Conneauet al., 2016; Zhang et al., 2015; Liu et al., 2020;Le et al., 2017) and Recurrent Neural Networks(Liu et al., 2016; Sundermeyer et al., 2015) havebeen used quite frequently for text classificationtasks. Quite different from one another, the CNNsand the RNNs show different capabilities to gener-ate intermediate text representation. CNN modelsan input sentence by utilizing convolutional filtersto identify the most influential n-grams of differ- https://home.kpmg/in/en/home/insights/2017/04/indian-language-internet-users.html a r X i v : . [ c s . C L ] F e b nt semantic aspects (Conneau et al., 2016). RNNcan handle variable-length input sentences and isparticularly well suited for modeling sequentialdata, learning important temporal features and long-term dependencies for robust text representation(Hochreiter and Schmidhuber, 1997). However,whilst CNN can only capture local patterns andfails to incorporate the long-term dependencies andthe sequential features, RNN cannot distinguishbetween keywords that contribute more context tothe classification task from the normal stopwords.Thus, the proposed model hypothesizes a potentway to subsume the advantages of both the CNNand the RNN using the attention mechanism. Themodel employs a parallel structure where both theCNN and the BiLSTM model the input sentencesindependently. The intermediate representations,thus generated, are combined using the attentionmechanism. Therefore, the generated vector hasuseful temporal features from the sequences gener-ated by the RNN according to the context generatedby the CNN. Results attest that the proposed modeloutperforms various baseline machine learning anddeep learning models in the given task, giving thebest validation accuracy and f1-score. Since the past decade, the research in NLP hasshifted from a traditional statistical standpoint tocomplex neural network architectures. The CNNand RNN based architectures have emerged greatlysuccessful for the text classification task. YoonKim was the first one who applied a CNN modelfor English text classification. In this work, a seriesof experiments were conducted with single as wellas multi-channel convolutional neural networks,built on top of randomly generated, pretrained, andfine-tuned word vectors (Kim, 2014). This successof CNN for text classification led to the emergenceof more complex CNN models (Conneau et al.,2016) as well as CNN models with character levelinputs (Zhang et al., 2015). RNNs are capable ofgenerating effective text representation by learn-ing temporal features and long-term dependenciesbetween the words (Hochreiter and Schmidhuber,1997; Graves and Schmidhuber, 2005). However,these methods treat each word in the sentencesequally and thus cannot distinguish between thekeywords that contribute more to the classificationand the common words. Hybrid models proposedby (Xiao and Cho, 2016) and (Hassan and Mah- mood, 2017) succeed in exploiting the advantagesof both CNN and RNN, by using them in combina-tion for text classification.Since the introduction of the attention mecha-nism (Vaswani et al., 2017), it has become an effec-tive strategy for dynamically learning the contribu-tion of different features to specific tasks. Needlessto say, the attention mechanism has expeditiouslyfound its way into NLP literature, with many workseffectively leveraging it to improve the text classi-fication task. (Guo et al., 2018) proposed a CNN -RNN attention-based neural network (CRAN) fortext classification. This work illustrates the effec-tiveness of using the CNN layer as a context ofthe attention model. Results show that using thismechanism enables the proposed model to pick theimportant words from the sequences generated bythe RNN layer, thus helping it to outperform manybaselines as well as hybrid attention-based modelsin the text classification task. (Er et al., 2016) pro-posed an attention pooling strategy, which focuseson making a model learn better sentence represen-tations for improved text classification. Authorsuse the intermediate sentence representations pro-duced by a BiLSTM layer in reference with thelocal representations produced by a CNN layer toobtain the attention weights. Experimental resultsdemonstrate that the proposed model outperformsstate-of-the-art approaches on a number of bench-mark datasets for text classification. (Zheng andZheng, 2019) combine the BiLSTM and CNN withthe attention mechanism for fine-grained text clas-sification tasks. The authors employ a method inwhich intermediate sentence representations gener-ated by BiLSTM are passed to a CNN layer whichis then max pooled to capture the local features of asentence. The local feature representations are fur-ther combined by using an attention layer to calcu-late the attention weights. In this way, the attentionlayer can assign different weights to features ac-cording to their importance to the text classificationtask.The literature in NLP focusing on the resource-constrained Indian languages has been fairly re-strained. (Tummalapalli et al., 2018) evaluated theperformance of vanilla CNN, LSTM, and multi-Input CNN for the text-classification of Hindi andTelugu texts. The results indicate that CNN basedmodels perform surprisingly better as comparedto LSTM and SVM using n-gram features. (Joshiet al., 2019) have compared different deep learn- abel Training Data Validation Data bioche 5,002 420com tech 17,995 1,505cse 9,344 885phy 9,656 970Total 41,997 3,780
Table 1: Data distribution. ing approaches for Hindi sentence classification.The authors have evaluated the effect of using pre-trained fasttext Hindi embeddings on the sentenceclassification task. The finest classification per-formance is achieved by the Vanilla CNN modelwhen initialized with fasttext word embeddingsfine-tuned on the specific dataset.
The TechDOfication-2020 subtask-1f dataset con-sists of labelled Marathi text documents, each be-longing to one of the four classes, namely: Bio-chemistry (bioche), Communication Technology(com tech), Computer Science (cse), and Physics(phy). The training data has a mean length of 26.89words with a standard deviation of 25.12.Table 1 provides an overview of the distributionof the corpus across the four labels for training andvalidation data. From the table, it is evident thatthe dataset is imbalanced, with the class Commu-nication Technology and Biochemistry having themost and the least documents, respectively. It is,therefore, reasonable to postulate that this data im-balance may lead to the overfitting of the modelon some classes. This is further articulated in theResults section.
This section describes the proposed multi-inputattention-based parallel CNN-BiLSTM. Figure 1depicts the model architecture. Each component isexplained in detail as follows:
The proposed model uses fasttext word embeddingstrained on the unsupervised skip-gram model tomap the words from the corpus vocabulary to acorresponding dense vector. Fasttext embeddingsare preferred over the word2vec (Mikolov et al.,2013) or glove variants (Pennington et al., 2014),as fasttext represents each word as a sequence
Figure 1: Model Architecture. of character-n-grams, which in turn helps to cap-ture the morphological richness of languages likeMarathi. The embedding layer converts each word w i in the document T = { w , w , ..., w n } of n words, into a real-valued dense vector e i using thefollowing matrix-vector product: e i = W v i (1)where W ∈ R d ×| V | is the embedding matrix, | V | is a fixed-sized vocabulary of the corpus and d is the word embedding dimension. v i is the one-hot encoded vector with the element e i set to 1while the other elements set to 0. Thus, the doc-ument can be represented as real-valued vector e = { e , e , ..., e n } . The word embeddings generated by the embed-dings layer are fed to the BiLSTM unit step bystep. A Bidirectional Long-short term memory (Bi-LSTM) (Graves and Schmidhuber, 2005) layer isjust a combination of two LSTMs (Hochreiter andSchmidhuber, 1997) running in opposite directions.his allows the networks to have both forward andbackward information about the sequence at ev-ery time step, resulting in better understanding andpreservation of the context. It is also able to counterthe problem of vanishing gradients to a certain ex-tent by utilizing the input, the output, and the forgetgates. The intermediate sentence representationgenerated by Bi-LSTM is denoted as h . The discrete convolutions performed by the CNNlayer on the input word embeddings, help to extractthe most influential n-grams in the sentence. Threeparallel convolutional layers with three differentwindow sizes are used so that the model can learnmultiple types of embeddings of local regions, andcomplement one another. Finally, the sentence rep-resentations of all the different convolutions areconcatenated and max-pooled to get the most dom-inant features. The output is denoted as c . The linchpin of the model is the attention blockthat effectively combines the intermediate sentencefeature representation generated by BiLSTM withthe local feature representation generated by CNN.At each time step t , taking the output h t of theBiLSTM, and c t of the CNN, the attention weights α t are calculated as: u t = tanh ( W h t + W c t + b ) (2) α t = Sof tmax ( u t ) (3)Where W and W are the attention weights, and b is the attention bias learned via backpropagation.The final sentence representation s is calculated asthe weighted arithmetic mean based on the weights α = { α , α , ..., α n } , and output of the BiLSTM h = { h , h , ..., h n } . It is given as: s = n (cid:88) t =1 α t ∗ h t (4)Thus, the model is able to retain the merits of boththe BiLSTM and the CNN, leading to a more robustsentence representation. This representation is thenfed to a fully connected layer for dimensionalityreduction. The output of the fully connected attention layer ispassed to a dense layer with softmax activation topredict a discrete label out of the four labels in thegiven task.
Each text document is tokenized and padded to amaximum length of 100. Longer documents aretruncated. The work of (Kakwani et al., 2020) isreferred for selecting the optimal set of hyperpa-rameters for training the fasttext skip-gram model.The 300-dimensional fasttext word embeddings aretrained on the given corpus for 50 epochs, with aminimum token count of 1, and 10 negative exam-ples, sampled for each instance. The rest of thehyperparameter values were chosen as default (Bo-janowski et al., 2017). After training, an averageloss of 0.6338. was obtained over 50 epochs. Thevalidation dataset is used to tune the hyperparam-eters. The LSTM layer dimension is set to 128neurons with a dropout rate of 0.3. Thus, the BiL-STM gives an intermediate representation of 256dimensions. For the CNN block, we employ threeparallel convolutional layers with filter sizes 3, 4,and 5, each having 256 feature maps. A dropoutrate of 0.3 is applied to each layer. The local repre-sentations, thus, generated by the parallel CNNs arethen concatenated and max-pooled. All other pa-rameters in the model are initialized randomly. Themodel is trained end-to-end for 15 epochs, with theAdam optimizer (Kingma and Ba, 2014), sparsecategorical cross-entropy loss, a learning rate of0.001, and a minibatch size of 128. The best modelis stored and the learning rate is reduced by a factorof 0.1 if validation loss does not decline after twosuccessive epochs.
The performance of the proposed model is com-pared with a host of machine learning and deeplearning models and the results are reported in ta-ble 3. They are as follows:
Feature Based models:
Multinomial NaiveBayes with bag-of-words input (MNB + BoW),Multinomial Naive Bayes with tf-idf input (MNB+ TF-IDF), Linear SVC with bag-of-words input(LSVC + BoW), and Linear SVC with tf-idf input(LSVC + TF-IDF).
Basic Neural Networks:
Feed forward Neuralnetwork with max-pooling (FFNN), CNN withmax-pooling (CNN), and BiLSTM with maxpool-ing (BiLSTM)
Complex Neural Networks:
BiLSTM +atten-tion (Zhou et al., 2016) , serial BiLSTM-CNN etrics bioche com tech cse phy
Precision 0.9128 0.8831 0.9145 0.8931Recall 0.7976 0.9342 0.8949 0.8793F1-Score 0.8513 0.9079 0.9046 0.8862
Table 2: Detailed performance of the proposed modelon the validation data. (Chen et al., 2017), and serial BiLSTM-CNN +attention.
The performance of all the models is listed in Ta-ble 3. The proposed model outperforms all othermodels in validation accuracy and weighted f1-score. It achieves better results than standaloneCNN and BiLSTM, thus reasserting the impor-tance of combining both the structures. The BiL-STM with attention model is similar to the pro-posed model, but the context is ignored. As theproposed model outperforms the BiLSTM withattention model, it proves the effectiveness of theCNN layer for providing context. Stacking a convo-lutional layer over a BiLSTM unit results in lowerperformance than the standalone BiLSTM. It canbe thus inferred that combining CNN and BiLSTMin a parallel way is much more effective than just se-rially stacking. Thus, the attention mechanism pro-posed is able to successfully unify the CNN and theBiLSTM, providing meaningful context to the tem-poral representation generated by BiLSTM. Table 2reports the detailed performance of the proposedmodel for the validation data. The precision andrecall for communication technology (com tech),computer science (cse), and physics(phy) labels arequite consistent. Biochemistry (bioche) label suf-fers from a high difference in precision and recall.This can be traced back to the fact that less amountof training data is available for the label, leading tothe model overfitting on it.
While NLP research in English is achieving newheights, the progress in low resource languages isstill in its nascent stage. The TechDOfication taskpaves the way for research in this field throughthe task of technical domain identification for textsin Indian languages. This paper proposes a CNN-BiLSTM based attention ensemble model for thesubtask-1f of Marathi text classification. The par-allel CNN-BiLSTM attention-based model unifies
Label Validation ValidationAccuracy F1-Score
MNB + Bow 86.74 0.8352MNB + TF-IDF 77.16 0.8010Linear SVC + Bow 85.76 0.8432Linear SVC + TF-IDF 88.17 0.8681FFNN 76.11 0.7454CNN 86.67 0.8532BiLSTM 89.31 0.8842BiLSTM + Attention 88.14 0.8697Serial BiLSTM-CNN 88.99 0.8807Serial BiLSTM-CNN+ Attention 88.23 0.8727
Ensemble CNN-BiLSTM+ Attention 89.57 0.8875
Table 3: Performance comparison of different modelson the validation data. the intermediate representations generated by boththe models successfully using the attention mech-anism. It provides a way for further research inadapting attention-based models for low resourceand morphologically rich languages. The perfor-mance of the model can be enhanced by givingadditional inputs such as character n-grams anddocument-topic distribution. More efficient atten-tion mechanisms can be applied to further consoli-date the amalgamation of CNN and RNN.
References
Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.
Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Pooja Bolaj and Sharvari Govilkar. 2016a. A surveyon text categorization techniques for indian regionallanguages.
International Journal of computer sci-ence and Information Technologies , 7(2):480–483.Pooja Bolaj and Sharvari Govilkar. 2016b. Text clas-sification for marathi documents using supervisedlearning methods.
International Journal of Com-puter Applications , 155(8):6–10.Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang.2017. Improving sentiment analysis via sentencetype classification using bilstm-crf and cnn.
ExpertSystems with Applications , 72:221–230.Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault,and Yann Lecun. 2016. Very deep convolutionalnetworks for text classification. arXiv preprintarXiv:1606.01781 .nkita Dhar, Himadri Mukherjee, Niladri Sekhar Dash,and Kaushik Roy. 2018. Performance of classifiersin bangla text categorization. In , pages 168–173. IEEE.Meng Joo Er, Yong Zhang, Ning Wang, and Mahard-hika Pratama. 2016. Attention pooling-based convo-lutional neural network for sentence modelling.
In-formation Sciences , 373:388–403.Shang Gao, Arvind Ramanathan, and Georgia Tourassi.2018. Hierarchical convolutional attention net-works for text classification. In
Proceedings ofThe Third Workshop on Representation Learning forNLP , pages 11–23.Alex Graves and J¨urgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional lstmand other neural network architectures.
Neural net-works , 18(5-6):602–610.Long Guo, Dongxiang Zhang, Lei Wang, Han Wang,and Bin Cui. 2018. Cran: a hybrid cnn-rnn attention-based model for text classification. In
InternationalConference on Conceptual Modeling , pages 571–585. Springer.A. Hassan and A. Mahmood. 2017. Efficient deeplearning model for text classification based on recur-rent and convolutional layers. In , pages 1108–1113.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Ramchandra Joshi, Purvi Goel, and Raviraj Joshi. 2019.Deep learning for hindi text classification: A com-parison. In
International Conference on Intelli-gent Human Computer Interaction , pages 94–101.Springer.Divyanshu Kakwani, Anoop Kunchukuttan, SatishGolla, NC Gokul, Avik Bhattacharyya, Mitesh MKhapra, and Pratyush Kumar. 2020. inlpsuite:Monolingual corpora, evaluation benchmarks andpre-trained multilingual language models for indianlanguages. In
Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing: Findings , pages 4948–4961.Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882 .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Hoa T Le, Christophe Cerisara, and Alexandre De-nis. 2017. Do convolutional networks need tobe deep for text classification? arXiv preprintarXiv:1707.04108 . Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.2016. Recurrent neural network for text classi-fication with multi-task learning. arXiv preprintarXiv:1605.05101 .Zhenyu Liu, Haiwei Huang, Chaohong Lu, andShengfei Lyu. 2020. Multichannel cnn with at-tention for text classification. arXiv preprintarXiv:2006.16174 .Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In
Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Martin Sundermeyer, Hermann Ney, and Ralf Schl¨uter.2015. From feedforward to recurrent lstm neural net-works for language modeling.
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing ,23(3):517–529.Madhuri Tummalapalli, Manoj Chinnakotla, and Rad-hika Mamidi. 2018. Towards better sentence classi-fication for morphologically rich languages. In
Pro-ceedings of the International Conference on Compu-tational Linguistics and Intelligent Text Processing .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762 .Yequan Wang, Minlie Huang, Xiaoyan Zhu, andLi Zhao. 2016. Attention-based lstm for aspect-level sentiment classification. In
Proceedings of the2016 conference on empirical methods in naturallanguage processing , pages 606–615.Yijun Xiao and Kyunghyun Cho. 2016. Efficientcharacter-level document classification by combin-ing convolution and recurrent layers.Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In
Proceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: human language technologies ,pages 1480–1489.Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In
Advances in neural information pro-cessing systems , pages 649–657.Jin Zheng and Limin Zheng. 2019. A hybrid bidi-rectional recurrent convolutional neural networkattention-based model for text classification.
IEEEAccess , 7:106673–106685.eng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,Hongwei Hao, and Bo Xu. 2016. Attention-basedbidirectional long short-term memory networks forrelation classification. In