2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) | 2019

A Sentiment Classification in Bengali and Machine Translated English Corpus

 
 

Abstract


The resource constraints in many languages have made the multi-lingual sentiment analysis approach a viable alternative for sentiment classification. Although a good amount of research has been conducted using a multi-lingual approach in languages like Chinese, Italian, Romanian, etc. very limited research has been done in Bengali. This paper presents a bilingual approach to sentiment analysis by comparing machine translated Bengali corpus to its original form. We apply multiple machine learning algorithms: Logistic Regression (LR), Ridge Regression (RR), Support Vector Machine (SVM), Random Forest (RF), Extra Randomized Trees (ET) and Long Short-Term Memory (LSTM) to a collection of Bengali corpus and corresponding machine translated English version. The results suggest that using machine translation improves classifiers performance in both datasets. Moreover, the results show that the unigram model performs better than higher-order n-gram model in both datasets due to linguistic variations and presence of misspelled words results from complex typing system of Bengali language; sparseness and noise in the machine translated data, and because of small datasets.

Volume None
Pages 107-114
DOI 10.1109/IRI.2019.00029
Language English
Journal 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)

Full Text