Bangla Text Dataset and Exploratory Analysis for Online Harassment Detection
Md Faisal Ahmed, Zalish Mahmud, Zarin Tasnim Biash, Ahmed Ann Noor Ryen, Arman Hossain, Faisal Bin Ashraf
XXXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Bangla Text Dataset and Exploratory Analysis for Online Harassment Detection
Md Faisal Ahmed, Zalish Mahmud, Zarin Tasnim Biash, Ahmed Ann Noor Ryen, Arman Hossain, Faisal Bin Ashraf Department of Computer Science and Engineering, Brac University, Bangladesh
Abstract — Being the seventh most spoken language in the world, the use of the Bangla language online has increased in recent times. Hence, it has become very important to analyze Bangla text data to maintain a safe and harassment-free online place. The data that has been made accessible in this article has been gathered and marked from the comments of people in public posts by celebrities, government officials, athletes on Facebook. The total amount of collected comments is 44001. The dataset is compiled with the aim of developing the ability of machines to differentiate whether a comment is a bully expression or not with the help of Natural Language Processing and to what extent it is improper if it is an inappropriate comment. The comments are labeled with different categories of harassment. Exploratory analysis from different perspectives is also included in this paper to have a detailed overview. Due to the scarcity of data collection of categorized Bengali language comments, this dataset can have a significant role for research in detecting bully words, identifying inappropriate comments, detecting different categories of Bengali bullies, etc. The dataset is publicly available at https://data.mendeley.com/datasets/9xjx8twk8p.
Keywords— Bangla Text, Sentimental analysis, Natural language processing (NLP), Cyberbullying, Social Media Bullying, Online Harassment. I. I NTRODUCTION
Cyberbullying or online harassment is the use of electronic correspondence such as the online networking platform to menace an individual, ordinarily by sending messages of an intimidating or compromising nature. Due to the spread use of internet and social networking sites, cyberbullying has become a major concern. It has been a very challenging research domain to understand the huge amount of text data, i.e., posts, comments, messages in social sites, so that preventive measures can be taken beforehand. Natural Language Processing is a branch of artificial intelligence that deals with the interaction between computers and humans utilizing the natural language . A definitive goal of NLP is to read, decode, comprehend, and make sense of the human dialects in a way that is significant. Bangla is the 7 th most speaking language in the world. Huge amount of Bengali speakers use Bangla Language in online platform. So, it has been a dire need to analyze these Bangla texts from the social sites. Nevertheless, there is scarcity of Bangla text data with proper label. Therefore, in this work we have collected a vast amount of Bangla comments of different kinds of online harassment. This dataset provides social media harassment comments in Bengali language, which is the seventh most spoken language in the world with more than 265 million speakers [1]. Therefore, in terms of analyzing online harassment this dataset will be immensely helpful. This dataset contains various types of bullying including the victim ’ s gender and profession, which aids the researchers to conduct in depth analysis of relationship between online harassment and other metadata. These data also includes the number of likes or reacts to each of the online comment, which broadens the area to identify the trend that shows the social acceptance or normalization of appreciative as well as harassing comments. These data will be helpful to detect cyberbullying in Bengali language as there are about 30 million daily active Facebook users in Bangladesh and a huge proportion of this population often has to deal with online harassment and bullying [2]. These dataset can also be used for identifying bully words, detecting different categories of harassment, and several other works in this related field. The dataset can be used to train and evaluate computational models and techniques for automation to reduce cyber harassment. II. E XPERIMENTAL D ESIGN
The data was obtained from the comment sections of public posts of some popular Facebook pages and anonymized in compliance with the Facebook Platform Policy for Developers [3]. We scrapped the comments from specific posts. Then the comments in Bengali language were filtered out from other English or mixed comments. The comments were also checked for duplicates. After getting rid of the duplicates, the comments were labelled into five categories where the bully category has four subcategories. If the comment was found to be harassing, it was then checked in which subcategory it logically belongs. The subcategories for bullies are sexual, threat, troll, and religious. If the comments did not fall under any criteria of harassment, it was labelled as not-bully. All the members repeated the process multiple times and the consensus decision was taken to ensure that the labels were correct. III.
DATA
DESCRIPTION The dataset [4] attached with this article in an excel file, contains comments from the interaction section under public posts by celebrities, politicians, sportsmen on the Facebook platform. The total amount of comments collected is 44001. For the convenience of describing, Table 1 below refers to each column, which includes the variable name, variable type, description of the variable and the source or engineering behind it. Table 1 describes variables of the dataset that is available in the given link and the data type of each variable. Variable types in Table 1 are categorical, numeric and text-based data. Table 1: Variables of Dataset
Variable Type Description Comment
Text Collected comment of the users ategory
Categorical Occupation of the celebrities. For example, actor, singer, politician etc.
Gender
Categorical Gender of the celebrities
Number of Reacts
Integer Number of likes, reactions on that comment
Label
Categorical Type of bully / non-bully IV. E XPLORATORY A NALYSIS
Fig. 1 Male to Female percentage ratio of the total number of data collected Fig. 1 shows us the numbers of male and female victims based on the total number of data collected. According to our dataset, 31.9% i.e. 14,051 comments of the total comments, are targeted towards male victims and 68.1% i.e. 29,950 comments are aimed towards female victims (Table 2). Table 2: Number of Male and Female victim
Quantity Percentage Male
Female
Quantity Percentage Actor
Social
Politician
Sports
Singer
Quantity Percentage Sexual
Not bully
Troll
Religious
Threat C ONCLUSION
This dataset is built with proper care and the data are fully anonymized. These data can help the researchers largely to identify online harassment in Bangla Text. The distribution and number of data in each category provide a good source of learning a good machine learning model to detect different kind of online harassment in Bangla language. R
EFERENCES [1]
FE Online Report |. (n.d.). Social media users 30 million in Bangladesh: Report. Retrieved January 29, 2021, from https://thefinancialexpress.com.bd/sci-tech/social-media-users-30-million-in-bangladesh-report-1521797895 [3]
3. Facebook Platform Terms. (n.d.). Retrieved January 29, 2021, from https://developers.facebook.com/policy/ [4]
Ahmed, Md Faisal; Mahmud, Zalish; Biash, Zarin Tasnim ; Ryen, Ahmed Ann Noor ; Hossain, Arman ; Ashraf, Faisal Bin (2021), “Bangla Online Comments Dataset”, Mendeley Data, V1, doi: 10.17632/9xjx8twk8p.1
Male Female Sexual
Not bully
Troll
Religious
491 (3.49%) 7086 (23.66%)