Exploring Text-transformers in AAAI 2021 Shared Task: COVID-19 Fake News Detection in English
EExploring Text-transformers in AAAI 2021 SharedTask: COVID-19 Fake News Detection in English
Xiangyang Li ∗ [0000 − − − , Yu Xia ∗ [0000 − − − , XiangLong − − − , Zheng Li − − − , and Sujian Li Key Laboratory of Computational 2 Linguistics(MOE), Department of Computer Science,Peking University, China { xiangyangli,yuxia, 1800017744, lisujian } @pku.edu.cn Beijing University of Posts and Telecommunications, China [email protected]
Abstract.
In this paper, we describe our system for the AAAI 2021 shared taskof COVID-19 Fake News Detection in English, where we achieved the 3rd po-sition with the weighted F score of 0.9859 on the test set. Specifically, weproposed an ensemble method of different pre-trained language models such asBERT, Roberta, Ernie, etc. with various training strategies including warm-up,learning rate schedule and k -fold cross-validation. We also conduct an extensiveanalysis of the samples that are not correctly classified. The code is available at:https://github.com/archersama/3rd-solution-COVID19-Fake-News-Detection-in-English. Keywords:
Natural language processing · Pre-trained language model · COVID-19 · Fake news detection · Bert.
Due to the COVID-19 pandemic, offline communication has become less and tens ofmillions of people have expressed their opinions and published some news on the In-ternet. However, some users might publish some unverified news. If these pieces ofnews are fake, they may lead to irreparable losses, such as ”drinking bleach to kill thenew crown virus”. Manual detection of these fake news is not feasible because of hugeonline communication traffic. In addition, individuals responsible for checking suchcontent may suffer from depression and burnout. For these reasons, it is desirable tobuild a system that can automatically detect online fake news about COVID-19.The Constraint@AAAI 2021 shared task of COVID-19 Fake News Detection inEnglish was organized by ’the First Workshop on Combating Online Hostile Posts inRegional Languages during Emergency Situation’. The data sources are various socialmedia platforms, such as Twitter, Facebook, Instagram, etc. When a piece of socialmedia news is given, the purpose of the shared task is to classify it as fake news or realnews.The rest of the paper is organized as follows: Section 2 introduces the dataset of thistask. Section 3 details the architecture of our system (features, models and ensembles). *Equal contribution. a r X i v : . [ c s . C L ] J a n X. Li et al.
Section 4 offers an analysis of the performance of our models. Section 5 describes therelated Work. Finally, Section 6 presents our conclusions for this task.
In this section, we first introduce which datasets we use, and perform some exploratoryanalyses on the dataset.
We use the officially provided dataset [9] and external dataset we collect from the In-ternet as our training data. The distribution of the data is shown in Table 1.
Table 1.
Statistics of Datasets.Dataset Train Val TestOfficial 6420 2140 2140External 699 233 233
Fig. 1.
The distribution of positive and negative samples in the training and validation set.(a) Data distribution in train (b) Data distribution in valid
In order to have a better understanding of the dataset, we first perform some ex-ploratory analyses on the dataset, which helps us see the hidden laws in the data at aglance and find a model most suitable for the data.
OVID-19 Fake News Detection in English using Text-transformers 3
We first explore the distribution of positive and negative samples in the trainingset and validation set, as shown in Fig. 1. From Fig. 1, we can see that in the trainingand validation sets, the number of real news exceeds the number of fake news, whichillustrates that our dataset is unbalanced, so we can consider a data balanced samplingmethod when preprocessing data.
Fig. 2.
The word cloud diagram of the training set and the validation set. We determine the sizeof the word in the word cloud according to the frequency of the word.(a) Train word cloud (b) Validation word cloud
In order to analyze the characteristics of the words in the sentence, we calculate theword frequencies of the training and validation set respectively, remove the stop words,and make the corresponding word cloud diagram as shown in Fig. 2.From the Fig. 2, we can see that ’COVID’, ’https’, and ’co’ are the words with thehighest frequency in the dataset. ’COVID’, and ’co’ appear more frequently than inother normal text, while the higher frequency of ’https’ is a strange phenomenon. Afterfurther observation, we found that they might be the URLs of the news in each pieceof data. Therefore, in the data preprocessing step, we can consider removing the URLsfrom the sentences.
We propose two fake news detection models: one is the Text-RNN model based onbidirectional LSTM, and the other is Text-Transformers based on transformers. Thedescription of the two models is as follows.
Although the LSTM-based deep neural network has proven its effectiveness, but onedisadvantage is that the LSTM is based on the previous text information. Therefore, ourfirst model uses a bidirectional LSTM to overcome this shortcoming. The architectureof the model is shown in the Fig. 3.In the TextRNN model, we use the GloVe [10] word vector as our embedding layerwith the dimension of 200. After the encoded word vector passes through the bidirec-tional LSTM, we take the hidden state of the last layer and get the final result throughthe fully connected layer.
X. Li et al.
Fig. 3.
Text-RNN model based on bidirectional LSTM.
Contextualized language models such as ELMo and Bert trained on large corpus havedemonstrated remarkable performance gains across many NLP tasks recently. In ourexperiments, we use various architectures of language models as the backbone of oursecond model.As shown in the Fig.4, for the architecture of the language model, we use five dif-ferent language models including Bert, Ernie, Roberta, XL-net, and Electra trained withthe five-fold cross-validation. We have designed three training methods for this modelarchitecture: – Five-fold Single-model Ensemble : For each fold of the five-fold cross-validationmethod, we use same models for fine-tuning. – Five-fold Five-model Ensemble : For each fold of the five-fold cross-validationmethod, we use different models for fine-tuning. – Pseudo Label Algorithm : Because the amount of data is too small, we propose apseudo-label algorithm to do data augmentation. If a test data is predicted with aprobability greater than 0.95, we think that the data is predicted correctly with arelatively high confidence and add it into the training set.
OVID-19 Fake News Detection in English using Text-transformers 5
Fig. 4.
Five-fold Five-model cross-validation framework based on pre-trained language models. – Weight Ensemble : We adopt soft voting as an integration strategy, which refersto taking the average of the probabilities of all the models predicted to a certainclass as the standard and the type of corresponding with the highest probability asthe final prediction result. In our method, we take the highest f1-score of each foldmodel on the validation set as the ensemble weight. : The epoch is set to 120, learning rate to 0.01, batch size to 128, textlength to 140, and drop out rate to 0.2. The learning rate is multiplied by the atten-uation coefficient 0.1 every 30 epochs. – Text-Transformers : The epoch of each fold is set to 12, the batch size is set to 256,the maximum length of the text is set to 140. For the Text-transformers model, dueto the complexity of transformer model, we adopt the training strategy as shown in4.2.
X. Li et al. : Label smoothing [17] is a regularization technique that intro-duces noise for the labels. Assuming for a small constant (cid:15) , the training set label y is correct with a probability or incorrect otherwise. Label Smoothing regularizes amodel based on a softmax with output values by replacing the hard 0 and 1 classi-fication targets with targets of (cid:15)k − and − (cid:15) respectively. In our strategy, we take (cid:15) equal to 0.01. – Learning Rate Warm Up : Using too large learning rate may result in numericalinstability especially at the very beginning of the training, where parameters arerandomly initialized. The warm up [4] strategy increases the learning rate from 0to the initial learning rate linearly during the initial N epochs or m batches. In ourstrategy, we set an initial learning rate of 1e-6, which increased gradually to 5e-5after 6 epochs. – Learning Rate Cosine Decay : After the learning rate warmup stage described ear-lier, we typically steadily decrease its value from the initial learning rate. Comparedto some widely used strategies including exponential decay and step decay, the co-sine decay [7] decreases the learning rate slowly at the beginning, and then becomesalmost linear decreasing in the middle, and slows down again at the end. It poten-tially improves the training progress. In our strategy, after reaching a maximumvalue of 5e-5, the learning rate decreases to 1e-6 after a cosine decay of 6 epochs – Domain Pretraining : Sun et. al. [14] demonstrated that pre-trained models such asBert, which do further domain pretraining on the dataset, can lead to performancegains. Therefore, we adopt Covid-Twitter-Bert which is pretrained on a large cor-pus of twitter messages on the topic of COVID-19.
In Table 2, we presented our results. We evaluated our models using the official compe-tition metric weighted F1-score which is F1-score averaged across the classes.
Table 2.
Results of different models.Method accuracy precision recall weighted F1-scoreTextRNN 0.924 0.935 0.924 0.926Text-Transformers+Five-fold single model cross-validation 0.976 0.974 0.974 0.976Text-Transformers+Five-fold five model cross-validation 0.980 0.982 0.980 0.981Text-Transformers+ Five-fold five model cross-validation+Pseudo Label Algorithm 0.985 0.986 0.985 0.985OVID-19 Fake News Detection in English using Text-transformers 7
Fig. 5.
Results of five-fold five-model ensemble. The blue and orange lines represent val F1 score,train F1 score, and the red and green lines represent val loss and train loss.(a) Fold 1 for Bert. (b) Fold 2 for Ernie (c) Fold 3 for Roberta(d) Fold 4 for Xl-net (e) Fold 5 for Electra
Fig. 6.
Confusion matrix of predicted result and true label
In order to make full use of the data, we merged the train set and the valid set.For TextRNN, we re-divided the merged data into the training set and the validationset at a ratio of 8:2, and performed single-fold cross-validation. The weighted f1-scoreis 0.926. For the Text-transformers model, we used five-fold cross-validation. Thenwe compared five-fold single-model cross-validation with five-fold five-model cross-validation. Finally, we achieved the weighted F1 scores of 0.975 and 0.981, respectively.After adding the pseudo-label, the weighted F1 score of 0.985 was obtained on the testset, achieving the third place in the competition which attracted 421 teams to participatein total. Fig.5 shows the performance of our model in each fold.
X. Li et al.
In order to further understand the results on the test set, we investigated the predictionsmade by our models by conducting simple visualizations of the confusion matrices ofpredictions acquired by our best models.From Fig. 6, we can see that our model has high precision, which is also obviousfrom Table 2 presented above. Fig. 6 also shows that our model has slightly higherfalse negatives compared to false positives. In other words, the chance of our modelmislabeling fake news as true news is slightly higher than predicting true news as fake.
Pre-training and then fine-tuning has become a new paradigm in natural language pro-cessing. Through self-supervised learning from a large corpus, the language model canlearn general knowledge, and then transfer it to downstream tasks by fine-tuning onspecific tasks.Elmo uses Bidirectional LSTM [5] to extract word vectors using context infor-mation [11]. GPT [12] enhances context-sensitive embedding by adjusting the trans-former [18]. The bidirectional language model BERT [2] applies cloze and next sen-tence prediction to self-supervised learning to strengthen word embeddings. Liu et.al. [6] removes the next sentence prediction from self-training, and performs more fullytraining, getting a better language model na Roberta. Sun et. al. [16] strengthened thepre-trained language model, completely masking the span in Ernie. Further, Sun et.al. [15] proposed continuous multi-task pre-training and several pre-training tasks inErnie 2.0.In our system, we fine-tuned the above models using the k-fold cross-validationmethod, which achieved excellent performance. K -fold cross-validation [8] means that the training set is divided into K sub samples,one single sub sample is reserved as the data for validation, and the other K -1 samplesare used for training. Cross-validation is repeated K times, and each sub sample isverified once. The average of the results or other combination methods are used toobtain a single estimation. The advantage of this method is that it can repeatedly usethe randomly generated sub samples for training and verification, and each time theresults are verified, the less biased results can be obtained.The traditional K -fold cross-validation uses the same model to train each fold andonly retains the best results. In our system, we use different models for each fold andkeep the models for each fold to fuse the results. Our experiments prove that this methodoutperforms the common K -fold cross-validation method. OVID-19 Fake News Detection in English using Text-transformers 9
In the past few years, there have been several studies of applying computational meth-ods to deal with fake news detection. Ceron et. al. [1] used topic models to distinguishfake news, and Hamid et. al. [3] proposed to use Bag of Words (BoW) and BERT em-bedding. Yuan et. al. [19] explicitly exploited the credibility of publishers and users forearly fake news detection.However, during the COVID-19 pandemic, it is necessary to establish a reliable au-tomated detection program for COVID-19, but the above-mentioned work rarely studiesfake news detection on how to detect COVID-19, and ignores the ensemble strategiesof pre-trained language models.
In this paper, we presented our approach on COVID-19 fake news detection in English.We have established two types of models based on bidirectional LSTM and transformer,and the transformer-based model achieved better results in this competition. We provedthat five-fold five-model cross-validation performs better than five-fold single-modelcross-validation, and pseudo label algorithm can effectively improve the performance.In the future, we plan to use generative models such as T5 [13] to generate labels di-rectly, further enhancing the predicted results.
Acknowledgements
This work was partially supported by National Key Research and Development Project(2019YFB1704002) and National Natural Science Foundation of China (61876009).
References
1. Ceron, W., de Lima-Santos, M.F., Quiles, M.G.: Fake news agenda in the era of covid-19: Identifying trends through fact-checking content. Online Social Networks and Mediap. 100116 (2020)2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)3. Hamid, A., Shiekh, N., Said, N., Ahmad, K., Gul, A., Hassan, L., Al-Fuqaha, A.: Fake newsdetection in social media using graph neural networks and nlp techniques: A covid-19 use-case (2020)4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778(2016)5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer,L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)0 X. Li et al.7. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprintarXiv:1608.03983 (2016)8. Mosteller, F., Tukey, J.W.: Data analysis, including statistics. Handbook of social psychology2