[PDF] From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

Abstract

Natural language processing is a fast-growing field of artificial intelligence. Since the Transformer was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT on our dataset by re-training the model on the Masked Language Model task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.

Full PDF

FFrom Universal Language Model to DownstreamTask: Improving RoBERTa-Based Vietnamese HateSpeech Detection

Quang Huu Pham ∗ , Viet Anh Nguyen ∗ , Linh Bao Doan, Ngoc N. Tran R&D Lab, Sun Asterisk Inc{pham.huu.quang, nguyen.viet.anhd, doan.bao.linh, tran.ngo.quang.ngoc}@sun-asterisk.com

Ta Minh Thanh † Le Quy Don Technical University, 236 Hoang Quoc Viet, Bac Tu Liem, Ha [email protected] ∗ Equal contribution † Corresponding author

Tóm tắt nội dung —Natural language processing (NLP) is a fast-growing field of artificial intelligence. Since the Transformer [32]was introduced by Google in 2017, a large number of languagemodels such as BERT, GPT, and ELMo have been inspired bythis architecture. These models were trained on huge datasetsand achieved state-of-the-art results on natural language under-standing. However, fine-tuning a pre-trained language model onmuch smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lackof training data and imbalanced data. In this paper, we propose apipeline to adapt the general-purpose RoBERTa language modelto a specific text classification task: Vietnamese Hate SpeechDetection. We first tune the PhoBERT [9] on our dataset by re-training the model on the Masked Language Model (MLM) task;then, we employ its encoder for text classification. In order topreserve pre-trained weights while learning new feature repre-sentations, we further utilize different training techniques: layerfreezing, block-wise learning rate, and label smoothing. Our exper-iments proved that our proposed pipeline boosts the performancesignificantly, achieving a new state-of-the-art on Vietnamese HateSpeech Detection (HSD) campaign with 0.7221 F1 score. Index Terms —Hate Speech Detection (HSD), RoBERTa, TextClassification, Natural Language Processing, Text Mining.

I. I

NTRODUCTION

A. Overview

The rapid growth of the Internet, social media, and commu-nity forums have allowed people across the world to connectinstantaneously and has revolutionized communication as wellas content issues. However, the increase of hate speech onthese platforms has drawn significant expenditure from gov-ernments, organizations, companies, and researchers. The term“hate speech” can be understood as any kind of communicationthat uses pejorative or discriminatory language with referenceto a person or a group based on their religion, gender, ethnicity, PhoBERT is a pre-trained RoBERTa model which is known as a state-of-the-art language model for Vietnamese provided by VinAI research:https://github.com/VinAIResearch/PhoBERT https://vlsp.org.vn/vlsp2019/eval/hsd nationality, race, colour, descent, or other identity factors. Mul-tiple statistics reports [20] and books [28] also show that hatespeech and crime are highly correlated and on the rise together.Since the internet gives people some degree of anonymity, sometake this for granted and abuse it to harass others. Callingnames, making distasteful comments about one’s origin, orsimply shaming someone in anyway, are all everyday examplesof hate speech that anyone will encounter occasionally. Tocombat this, a vast number of methods have been studied anddeveloped for automated HSD. This aims to classify textualcontent into hate and non-hate speech.By the end of 2019, social network site users in Vietnamhave reached 48 million users. Still, there has been limitedavailable research about Vietnamese HSD; building appropriatecountermeasures for hate speech requires detecting and trac-ing through content. In the case of the Vietnamese language,this task becomes difficult due to the diverse vocabulary andcomplex grammar. For example, the same subword can havemultiple meanings, which is different from Latin word roots.Or, the fact that the same base (sub)word having multiplepossible intonations creates many hindrances in the path of lan-guage understanding: not only each subword has multiple vastlydifferent meanings, but also this causes an infinite number ofcombinatorial possibilities of either misspelling or intentionalshorthand expressions.The variety of semantics and grammar in the Vietnameselanguage has led to a big challenge for automatic hate speechdetection. Previous researches of text classification based ontraditional machine learning algorithms or training deep learn-ing from scratch is often inefficient. It also requires a lot of ef-fort for pre-processing and assimilating the semantic of words.In addition, recently proposed pre-trained language modelshave accomplished success in multiple tasks of natural languageprocessing through fine-tuning when integrated with the modelof downstream tasks. The emergence of pre-trained languagemodels has contributed to new ideas for solving significantproblems. Pre-trained language models are large scale neural a r X i v : . [ c s . C L ] F e b etwork models based on the deep Transformer structure. Theirinitial parameters are learned through immense self-supervisedtraining, then combined with multiple downstream models tofix special tasks by fine-tuning. Empirical experiment’s resultsshow that the downstream tasks’ performances of these kinds ofmodels are usually better than those of conventional models. B. Our contributions

In this paper, we investigate many experiments in fine-tuningthe pre-trained RoBERTa model for text classification tasks,specifically Vietnamese HSD. We propose a general pipelineand model architectures to adapt the universal language modelas RoBERTa for downstream tasks such as text classification.With our technique, we achieve new state-of-the-art results onthe Vietnamese Hate Speech campaign, organized by VLSP2019 .The main contributions of our paper are as follows: • We propose a general pipeline to adopt the universallanguage model as a pre-trained RoBERTa for the text clas-sification. It includes two steps: (1) Re-training maskedlanguage model task on training data of the classificationtask. (2) Fine-tuning model with a new classification headfor the target task. • We conduct multiple methods to design model architec-ture for text categorization task by using the pre-trainedRoBERTa model such as PhoBERT[9]. • A number of training techniques are suggested that canimprove the efficiency of the fine-tuning phase in solvingdata problems. These techniques are practical and can em-pirically help the model prevent overfitting phenomenonin the absence of training data or data imbalances. • From PhoBERT to Vietnamese HSD: We have achievedthe new state-of-the-art results on Vietnamese HSD taskby utilizing PhoBERT and our proposed method.

C. Roadmap

The rest of the paper is organized as follows. Section 2provides a brief survey of related work. Next, in Section 3, weintroduce our proposed method, the model architecture, and ourconducted fine-tuning strategies. Experiments are describedin Section 4, including dataset information, data processingmethod, and some other experimental settings. Section 5 showsour experimental results. Finally, Section 6 presents our conclu-sions. II. R

ELATED WORK

A. Language Models

Language modeling has been becoming an essential partof modern NLP field. It helps computers understand humanlanguage by digitalizing qualitative information. Early studieson word embeddings try to construct static representations forwords, that is, context-independent embeddings. Mikolov etal. [22] introduced novel architectures with an open-source The sixth international workshop on Vietnamese Language and SpeechProcessing. package called

Word2Vec . The architectures consist of twomodels: Continuous Bag of Words (CBOW) and Skip-grammodel were trained on a huge dataset of 1.6 billion words.GloVe [27] is an unsupervised learning algorithm for context-independent word embedding extraction. It first creates co-occurrence matrix of words and then factorizes it to extractdense word vectors. However, GloVe and Word2Vec are bothfail in representing rare or out-of-vocabulary words. To mitigatethis problem, fastText [23] decomposes each word as a sum ofcharacter n-grams. This handles unseen words very well be-cause their character n-grams still occur in other words. In con-trast to context-independent embeddings, contextualized wordembeddings aim to encode word semantics within contexts.Marking a new era of NLP, Bidirectional Encoder Represen-tations from Transformers [11] or BERT achieves new state-of-the-art performance on eleven NLP tasks, outperforms previousbest result (with GLUE score of 80.4%, 7.6% improvement)and human on SQuAD 1.1 (with 93.2% accuracy, 2% higher).The large model consists of 24 Transformer blocks for a total of340M parameters and was trained on 3.3 billion words corpus.GPT-2 [1] by OpenAI is even a larger model with 1.5 billionparameters and 48 layers. Being trained on a huge, diverse andwell-processed dataset, GPT-2 achieves state-of-the-art resultson 7 out of 8 datasets. Facebook research team propose animproved training procedure for BERT, called RoBERTa [17].The improvements include a ten-times larger dataset, longertraining, increased batch size, using byte-level encoding withbigger vocabulary, excluding next sentence predicting task anddynamic masking pattern changing. For Vietnamese languagemodeling, PhoBERT [9] is the current state-of-the-art. Themodel is based on RoBERTa with two versions “base” and“large”. Both versions outperform previous best on differentdownstream tasks.

B. Hate Speech Detection (HSD)

HSD can be understood as a type of text classification. Beforethe deep learning era, traditional methods had been widelystudied for this notorious problem. These approaches requiremanual feature engineering to encode text sequences in vectorform, which then be fed into classifiers such as Random Forests[35], Naive Bayes [5], [21], and Support Vector Machine [3],[21], [35].

Bag-of-words (BoW) is a way to represent documentsby modeling the occurrence of every word. This has beenreported to be a discriminative feature for HSD [3], [4], [5],[14], [30], [31], [34], [36]. However, modeling words suffersfrom the problem of typing error, which generates noisy wordsto the corpus.

Character n-gram divides a word into sub-wordsin order to suppress typing error parts of the word. In [21],the authors have systematically proved that character-level n-gram representation outperforms BoW in abusive languagedetection. Based on the idea that hate speech usually comesalong with negative sentiment, the methods in [16], [25], [30]utilize sentiment analysis as an evidence for HSD.

Linguisticfeatures inject additional useful knowledge to the raw text. Xu etal. [36] combine Stanford CoreNLP [19] POS information withn-gram features. However, the POS tokens did not improve theerformance of the classifiers. In contrast, [3], [4], [5], [25],[26] successfully employed dependency relationships in theirfeature set and report significant performance improvements.Because hate speech usually comes from social media, it isfeasible to extract meta-information such as how likely a userwill post hateful speech in the future [35], the number ofprofanity in a user’s activity history[7] or user gender [6], [34].Recent studies are paying great attention to deep learningapproaches. It not only boosts performance considerably, butalso requires less effort on feature engineering. The input ofa deep learning model can simply be an one-hot encoding oftext sequences [2], [13], [38], [10], then meaningful features arelearned by Convolutional Neural Networks (CNN) [13], LongShort-Term Memory (LSTM) [2], [10] or even a combination ofCNN and LSTM [38]. However, one-hot representation suffersfrom issues of high dimensionality as its length equals tothe vocabulary size. Therefore, a more convenient way is toembed the input into a low-dimensional space. This can becharacter embeddings [21], comment embeddings [12] or textembeddings from a two-phase deep learning model [37].HSD Shared Task in VLSP Campaign 2019 [33] is a chal-lenge for detecting Vietnamese hate speech on social networks.Our team, the winning team, using logistic regression withensemble learning n-gram, achieved F1 score of 0.67756 and0.61971 on public-test and private-test, respectively. The sec-ond best team also utilized logistic regression with ensem-ble learning, however the input includes more feature types:n − grams of words, part-of-speech tags, and numeric features.They scored 0.58883 on private-test, that is 3% lower than thewinner. The other teams employed deep learning architecturesfrom CNN, LSTM, RNN to Bi-LSTM, LSTMCNN and Bi-GRU-LSTM-CNN, achieving scores from 0.51705 to 0.58455on private-test. III. P ROPOSED METHOD

A. RoBERTa-based Network for HSD Task

Our architecture utilizes PhoBERT as a backbone networkwith some modifications. The output of each Transformer blockrepresents a different semantic level for the inputs. In ourexperiments, we use different combinations of outputs of thoseTransformers blocks. The general architecture model is shownin Figure 1.The features are combined across outputs of multiple trans-former blocks by concatenating or adding are fed to the classi-fication head. A simple classification head is a fully connectednetwork without hidden layers.

B. Proposed Fine-Tuning Strategies1) Fine-Tuning Masked Language Model:

The pre-trainedmodel was fit to a much larger dataset of a completely differentdomain. Therefore, while it might work very well in general, thevanilla pre-trained model will underperform at our one specifictask. This induces the need for us to tune the model to our needs.Therefore, with the existing weights of PhoBERT as a startingpoint, we train our model with our domain-specific training dataon Masked Language Modeling (MLM) task.

Hình 1. An illustration of our architecture. The input is a tokenized sentence,feeding parallelly to RoBERTa-base. Each of the twelve transformer blocks ofthe RoBERTa produces a 768-D vector. These vectors are then concatenated toform a long vector for the classification head. We investigate on the performanceof different combinations of these vectors.Hình 2. The proposed fine-tuning strategies for Hate Speech detection.

Moreover, for such a large model to be trained successfullywithout either forgetting its good initialization (by gradientdescending too far from it) or failing to converge (due to deepmodels being bad at propagating through further layers), we usea warm-up learning rate scheduling scheme. Originating frompaper [15] under the name of “slanted triangular learning rates",

Hình 3. Warm-up learning rate he main purpose of this method is to make the model convergemore quickly for a suitable region of the parameter space in thebeginning process of training.

2) Optimization Strategy: Freeze or not Freeze:

Each layerin RoBERTa network captures different levels of context.Specifically, lower layers produce global embeddings for wordswhile the embeddings from upper layers are more context-specific. We would like to preserve the global knowledge whileadjusting context representations for our classification model.For the first few epochs, we only keep the fully connectedlayers that are responsible for classifying text sequences, andthe RoBERTa part is frozen. This allows the model to learn adecent decision for the task. After these epochs, we unfreezeall layers and set different learning rates for different layers:the deeper the layer is, the more learning rate increased. Thisprevents the model from forgetting the useful global meaningof the words while forcing it to learn the context of the domain.

3) Label Smoothing:

For such a large model to be fit on arelatively small set of data, the model tends to become overcon-fident of its performance, going to the dark side of overfitting.To avoid this, we employ label smoothing, which softens ourone-hot ground truth labels. Specifically, for a model outputtingprobabilities y k of K classes, instead of labelling our groundtruth with one-hot encoding: y k = (cid:40) if k = j for some j, otherwise, (1)we slightly rebalance the target distribution so that it becomesless “binarized” by smoothing the probabilities with, y (cid:48) k = y k (1 − α ) + α/K (2)for some smoothing parameter α . As a result, this techniqueteaches the model to have some uncertainty in its guesses, andreduces the severity of overfitting. Moreover, since we are fine-tuning on a pretrained model, the original output probabilityvector of the model is close to an one-hot. This introducesnumerical instabilities if the new true label is also one-hot,since with cross-entropy as the loss function, the loss becomes −∞ . Thus, being used here, label smoothing actsas a small perturbation in our numerical method, making thetraining process more stable, helping our model converge better.IV. E XPERIMENTS

A. Dataset

To perform experiments on hate speech detection task, weuse the HSD dataset from the VLSP workshop campaign in2019 [29]. The dataset includes 25,431 samples; all werecrawled from posts and comments on Facebook, and annotatedto one of three classes: hate speech (HATE), offensive but nothate speech (OFFENSIVE), neither offensive nor hate speech(CLEAN) by many annotators. The training data consists of20,345 items with label distribution is showed in Table I.In general, this is an imbalanced dataset and contains a lotof noisy data. There is an extremely enormous gap betweenthe number of CLEAN data points and those from the other

Bảng IT

RAINING DATA DISTRIBUTION

HATE OFFENSIVE CLEAN

Number of sample 709 1,022 18,614 two categories, which lead to bad results for some methods,especially Deep Learning approaches.Because the dataset was crawled directly from users’ postsand comments on social networks, it has some notable prop-erties such as abbreviations, emoji, special characters, foreignlanguage, teen code, typing errors, etc. This has led to a sig-nificant challenge because PhoBERT or some other pre-trainedlanguage model are often trained on normal clean data such asWikipedia data or Vietnamese news.

B. Data preprocessing

We have built a basic preprocessing module to process theraw text before feeding it into the model. It includes Unicodenormalization, lowercase conversion, and the replacement of allemoji characters to the

EMOJI label. Private personal informa-tion also is masked for privacy purposes such as phone numbersand email addresses. Finally, raw text is segmented by a wordsegmenter. As PhoBERT employed the RDRSegmenter [8]from VnCoreNLP to pre-process training data of the languagemodel task, it is recommended to use the same word segmenterfor PhoBERT-based downstream applications. C. Experimental Settings1) Data augmentation:

To partially prevent data imbalanceand less data for two HATE and OFFENSIVE labels, we usedPhoBERT Masked Language Model as a method to augmentdata. From the original text, we randomly selected one token,replaced it with , and used the PhoBERT MLM to fillthe mask. Repeated 5 times, we got a brand new sample by fillup 5 tokens from the original text.

2) Model Training:

We fine-tuned PhoBERT with MaskedLanguage Model on HSD dataset with ratio of 15% maskedtokens. As in BERT default settings, batch size is fixed to32, with 10.000 steps. We chose Adam for the optimizationand trained the model on a single GPU with learning rate of × − . Figure 2 illustrates the overall pipeline. After tuningPhoBERT process completed, we retrain HSD model with oneor combination of multiple output features from numeroustransformer blocks.These features then would be incorporated into aClassification Head, which was designed straightforwardwith Fully Connected and Dropout layers. AdamW [18] wereselected to be the optimizer for BERT blocks and ClassificationHead with different learning rates. BERT learning rate was × − while classification head learning rate was largerat × − . Except bias and LayerNorm being excludedfrom weight decay, this ratio was chosen with value 0.01.Num_warmup_steps in experiments is equal to / of total https://github.com/vncorenlp/VnCoreNLP teps in 1 epoch.We utilized Stratified K-fold with k = 10 , the ratio ofsamples in train dataset and valid dataset were preserved asin the original dataset. Each and every fold was trained for 10epochs.

3) Loss function:

We use Label Smoothing loss which isthe combination of Cross-Entropy Loss and Label Smoothing.The smoothing rate is set to 0.2 as this shows prominentafter the first few epochs. We slightly lower loss target valueswhile increasing the target value 0. These so-called soft targetswill give lower loss when there is an incorrect prediction andconsequently, our model will penalize low entropy predictionsor “confidence penalty” as mention in [24] and the sectionabove.The cross entropy loss function is calculated as follows: L ( x (cid:48) , x ) = − (cid:88) i x (cid:48) i log x i , (3)where x (cid:48) is the true distribution of any particular data point(one-hot encoded, possibly with label smoothing applied), and x is the model’s predicted distribution. Specifically, for ourprediction task with a dictionary of c possible words to choosefrom, x i ≤ i

Our experiments are conducted on a computer with IntelCore i7 9700K Turbo 4.9GHz, 32GB of RAM, GPU GeForceGTX 2080Ti, and 1TB SSD hard disk.V. E

XPERIMENTAL R ESULTS

A. Evaluation metrics

Macro F1-score is a common evaluation metrics for classifi-cation tasks. F1-score is the harmonic mean of

P recision and

Recall . F1 score: performance measure for classification F = 2 Recall − + P recision − , (4)where P recision is the number of correct positive resultsdivided by the number of all positive results, and

Recall is thenumber of correct positive results divided by the number of allsamples that should have been identified as positive.

F1-macro or macro-averaged F1 score is computed as meanof F1 scores for each class

M acro F = F HATE + F OFFENSIVE + F CLEAN (5) Bảng IIM

EAN OF M ACRO F SCORE ON S TRATIFIED K- FOLD WITH K = 10

OFDIFFERENCE BLOCKS

Feature blocks Mean of F1 score

Layer 6 (only single block) 0.6854Layer 12 (only single block) 0.6978Layer 3-6 (4 middle blocks) 0.6855Layer 9-12 (4 last blocks)

Layer 1-6 (6 first blocks) 0.6905Layer 7-12 (6 last blocks)

Layer 1-12 (all blocks) 0.6979Bảng IIIS

OME SAMPLES THAT OUR SYSTEM GOOD AND FAILURE CLASSIFIED

Text Model prediction Truth label "học kỳ cuối như đồ thị hình sóng thần" CLEAN CLEAN"lan anh đm ám ảnh vl" OFFENSIVE OFFENSIVE"nguyễn hoàng bách dm a mồm lol à" HATE HATE"hay vat đúng lắm ạ" OFFENSIVE CLEAN"có deo đâu mà xem vl" CLEAN OFFENSIVE"hường lily mặt ngu vl" OFFENSIVE HATE

B. Our results

Our experiment results are shown in Table II and Table IV.Table II indicates the mean of Macro F score on using fea-ture from different BERT blocks. We take the test on multipleselected blocks including single block (layer 6, 12), middleblocks (layer 3-6), last blocks (layer 6-12, 7-12, 9-12) and allblocks. Our retrieved result with last blocks by far is the topscore. Last 4 blocks (layer 9-12) and last 6 blocks (layer 7-12)both achieve the highest F1 score at 0.6989.Table IV demonstrates effectiveness of each and every fine-tuning methods. Each individual technique boosts performanceof the model by 0.5-1.5% in term of F1 score while the com-bination of these methods significantly enhances the score upto 0.7211, outperforming the winner of this challenge (0.67756F1 score) and the current best result on the public leaderboard(0.71432 F1 score with a single model). Some samples that oursystem good and failure classified are shown in Table III. Itshows that our proposed method is efficient for task of HSD.VI. C ONCLUSION

In this paper, we have explored and proposed our pipeline tosolve the Vietnamese Hate Speech Detection task by using apre-trained universal language model. By intensively conduct-ing experiments using PhoBERT, we have demonstrated thatRoBERTa and our fine-tuning strategy is highly effective intext classification tasks. With our proposed methods, we haveachieved new state-of-the-art results on the Vietnamese HSDcampaign. For future work, we would like to explore differentclassification head architectures. Instead of only using the fullyconnected layer, we will experiment with other architectures forinstance LSTM, RNN, and CNN-LSTM on top.A

CKNOWLEDGMENT

This work is partially supported by

Sun-Asterisk Inc . Wewould like to thank our colleagues at

Sun-Asterisk Inc for ảng IVM

EAN OF M ACRO F SCORE ON S TRATIFIED K- FOLD WITH K = 10

WITHCONCATENATE OF LAYERS

AND OUR TRAINING APPROACH

Proposed training approach Mean of F1 score

Cross entropy loss 0.6922Label Smoothing loss 0.7005Non warm-up learning rate 0.6989Warm-up learning rate 0.7062Non Fine-tune MLM 0.6989Fine-tune MLM 0.7162Non Block wise learning rate 0.7051Block wise learning rate 0.7079Combine all the methods their advice and expertise. Without their support, this experi-ment would not have been accomplished. We also express ourgratitude to Tiep Vu, founder of AIVIVN for supporting us inevaluating results on aivivn.com.T ÀI LIỆU[1] R. Alec, J. Wu, C. Rewon, L. David, A. Dario, and S. Ilya. Languagemodels are unsupervised multitask learners. 2019.[2] Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma.Deep learning for hate speech detection in tweets.

CoRR , abs/1706.00188,2017.[3] Pete Burnap and Matthew Williams. Cyber hate speech on twitter: Anapplication of machine classification and statistical modeling for policyand decision making: Machine classification of cyber hate speech.

Policy& Internet , 7, 04 2015.[4] Pete Burnap and Matthew L. Williams. Us and them: identifying cyberhate on twitter across multiple protected characteristics.

Epj DataScience , 5, 2016.[5] Y. Chen, Y. Zhou, S. Zhu, and H. Xu. Detecting offensive language insocial media to protect adolescent online safety. In , pages 71–80, 2012.[6] M. Dadvar, F. Jong, R. Ordelman, and D. Trieschnigg. Improved cyber-bullying detection using gender information. 01 2012.[7] M. Dadvar, D. Trieschnigg, R. Ordelman, and F. Jong. Improvingcyberbullying detection with user context. pages pp 693–696, 03 2013.[8] N. Q. Dat, N. Q. Dai, V. Thanh, M. Dras, and M. Johnson. A Fastand Accurate Vietnamese Word Segmenter. In

Proceedings of the 11thInternational Conference on Language Resources and Evaluation (LREC2018) , pages 2582–2587, 2018.[9] N. Q. Dat and N. A. Tuan. Phobert: Pre-trained language models forvietnamese, 2020.[10] Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petroc-chi, and Maurizio Tesconi. Hate me, hate me not: Hate speech detectionon facebook. 01 2017.[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language under-standing, 2018.[12] Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, VladanRadosavljevic, and Narayan Bhamidipati. Hate speech detection withcomment embeddings. In

Proceedings of the 24th International Con-ference on World Wide Web , WWW ’15 Companion, page 29–30, NewYork, NY, USA, 2015. Association for Computing Machinery.[13] Bj¨orn Gamb¨ack and Utpal Kumar Sikdar. Using convolutional neuralnetworks to classify hate-speech. In

Proceedings of the First Workshop onAbusive Language Online , pages 85–90, Vancouver, BC, Canada, August2017. Association for Computational Linguistics.[14] Homa Hosseinmardi, Sabrina Arredondo Mattson, Rahat Ibn Rafiq,Richard Han, Qin Lv, and Shivakant Mishra. Detection of cyberbullyingincidents on the instagram social network, 2015.[15] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification, 2018. ACM Trans. Interact. Intell. Syst. , 2(3), September 2012.[17] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, DanqiChen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.Roberta: A robustly optimized bert pretraining approach, 2019.[18] I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2017.[19] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, StevenBethard, and David McClosky. The Stanford CoreNLP natural languageprocessing toolkit. In

Proc. of 52nd Annual Meeting of the Associationfor Computational Linguistics: System Demonstrations , pages 55–60.Association for Computational Linguistics, June 2014.[20] W. L. Matthew, B. Pete, J. Amir, L. Han, and O. Sefa. Corrigendum to:Hate in the Machine: Anti-Black and Anti-Muslim Social Media Posts asPredictors of Offline Racially and Religiously Aggravated Crime.

TheBritish Journal of Criminology , 60(1):242–242, 2019.[21] Yashar Mehdad and Joel Tetreault. Do characters abuse more than words?pages 299–303, 01 2016.[22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficientestimation of word representations in vector space, 2013.[23] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch,and Armand Joulin. Advances in pre-training distributed word repre-sentations. In

Proceedings of the International Conference on LanguageResources and Evaluation (LREC 2018) , 2018.[24] Rafael M¨uller, Simon Kornblith, and Geoffrey Hinton. When does labelsmoothing help?, 2019.[25] Dennis Njagi, Z. Zuping, Damien Hanyurwimfura, and Jun Long. Alexicon-based approach for hate speech detection.

International Journalof Multimedia and Ubiquitous Engineering , 10:215–230, 04 2015.[26] Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, andYi Chang. Abusive language detection in online user content. In

Proceedings of the 25th International Conference on World Wide Web ,WWW ’16, page 145–153, Republic and Canton of Geneva, CHE, 2016.International World Wide Web Conferences Steering Committee.[27] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe:Global vectors for word representation. In

Proc. of the 2014 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) , pages1532–1543, Doha, Qatar, October 2014. Association for ComputationalLinguistics.[28] Timothy C Shiell.

Campus hate speech on trial . Univ Pr of Kansas, 2009.[29] V. X. Son, V. Thanh, T. M. Vu, L. C. Thanh, and N. T. M. Huyen. Hsdshared task in vlsp campaign 2019: Hate speech detection for social good.

Proceedings of VLSP 2019 , 2019.[30] Sara Owsley Sood, Elizabeth F. Churchill, and Judd Antin. Automaticidentification of personal insults on social news sites.

J. Am. Soc. Inf. Sci.Technol. , 63(2):270–285, February 2012.[31] Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie Mennes, BartDesmet, Guy Pauw, Walter Daelemans, and Veronique Hoste. Detectionand fine-grained classification of cyberbullying events. 09 2015.[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin. Attention is all you need, 2017.[33] Xuan-Son Vu, Thanh Vu, Mai-Vu Tran, Thanh Le-Cong, and Huyen T M.Nguyen. Hsd shared task in vlsp campaign 2019:hate speech detection forsocial good, 2020.[34] Zeerak Waseem and Dirk Hovy. Hateful symbols or hateful people?predictive features for hate speech detection on twitter. In

Proceedingsof the NAACL Student Research Workshop , pages 88–93, San Diego,California, June 2016. Association for Computational Linguistics.[35] Guang Xiang, Bin Fan, Ling Wang, Jason Hong, and Carolyn Rose.Detecting offensive tweets via topical feature discovery over a large scaletwitter corpus. pages 1980–1984, 10 2012.[36] J. M. Xu, K. S. Jun, X. Zhu, and A. Bellmore. Learning from bullyingtraces in social media. In

Proceedings of the 2012 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies , NAACL HLT ’12, page 656–666, USA,2012. Association for Computational Linguistics.[37] Shuhan Yuan, Xintao Wu, and Yang Xiang. A two phase deep learningmodel for identifying discrimination from tweets. In