g2tmn at Constraint@AAAI2021: Exploiting CT-BERT and Ensembling Learning for COVID-19 Fake News Detection
aa r X i v : . [ c s . C L ] J a n g2tmn at Constraint@AAAI2021: ExploitingCT-BERT and Ensembling Learning forCOVID-19 Fake News Detection Anna Glazkova − − − , Maksim Glazkov − − − ,and Timofey Trifonov , − − − University of Tyumen, ul. Volodarskogo 6, 625003 Tyumen, Russia [email protected] , [email protected] ”Organization of cognitive associative systems” LLC, ul. Gertsena 64, 625000Tyumen, Russia [email protected] Abstract.
The COVID-19 pandemic has had a huge impact on variousareas of human life. Hence, the coronavirus pandemic and its conse-quences are being actively discussed on social media. However, not allsocial media posts are truthful. Many of them spread fake news thatcause panic among readers, misinform people and thus exacerbate theeffect of the pandemic. In this paper, we present our results at the Con-straint@AAAI2021 Shared Task: COVID-19 Fake News Detection in En-glish. In particular, we propose our approach using the transformer-basedensemble of COVID-Twitter-BERT (CT-BERT) models. We describe themodels used, the ways of text preprocessing and adding extra data. As aresult, our best model achieved the weighted F1-score of 98.69 on the testset (the first place in the leaderboard) of this shared task that attracted166 submitted teams in total.
Keywords:
COVID-Twitter-BERT, social media, fake news, ensemblinglearning, coronavirus, infodemic, text classification
Social media is a unique source of information. On the one hand, their low cost,easy access and distribution speed make it possible to quickly share the news.On the other hand, the quality and reliability of social media news is difficultto verify [38]. This is the source of a lot of false information that has a negativeimpact on society.Over the past year, the world has been watching the situation developingaround the novel coronavirus pandemic. The COVID-19 pandemic has becomea significant newsworthy event of 2020. Therefore, news related to COVID-19are actively discussed on social media and this topic generates a lot of misin-formation. Fake news related to the pandemic have large-scale negative socialconsequences, they provoke huge public rumor spreading and misunderstandingabout the COVID-19 and aggravate effects of the pandemic. Moreover, recentstudies [22] show an increase in symptoms such as anxiety and depression in
A. Glazkova et al. connection with the pandemic. This is closely related to the spread of misin-formation, because fake news can be more successful when the population isexperiencing a stressful psychological situation [25]. The popularity of fake newson social media can rapidly increase, because the rebuttal is always publishedtoo late. In this regard, there is evidence that the development of tools for au-tomatic COVID-19 fake news detection plays a crucial role in the regulation ofinformation flows.In this paper, we present our approach for the Constraint@AAAI2021 SharedTask: COVID-19 Fake News Detection in English [29] that attracted 433 par-ticipants on CodaLab. This approach achieved the weighted F1-score of 98.69(the first place in the leaderboard) on the test set among 166 submitted teamsin total.The rest of the paper is organized as follows. A brief review of related workis given in Section 2. The definition of the task has been summarized in Section3, followed by a brief description of the data used in Section 4. The proposedmethods and experimental settings have been elaborated in Section 5. Section 6contains the results and error analysis respectively. Section 7 is a conclusion.
In recent years, the task of detecting fake news and rumors is extremely rele-vant. False information spreading involves various research tasks, including: factchecking [4,40], topic credibility [15,41], fake news spreaders profiling [34], andmanipulation techniques detection [8]. Various technologies and approaches inthis field range from traditional machine learning methods [5,23,33], to state-of-the-art transformers [24,47].A overview of fake news detection approaches and challenges on social mediahas been discussed in [38,50]. Many scholars have proposed their solutions tothis problem in different subject areas (in particular, [6,35]). Up to now, a largenumber of studies in fake news detection have used supervised methods includingmodels based on transformers-based architecture [13,17,49].Some recent work have focused on detecting fake news about COVID-19.For example, the predictors of the sharing of false information about the pan-demic are discussed in [3]. In [44], a novel COVID-19 fact checking algorithmis proposed that retrieves the most relevant facts concerning user claims aboutparticular facts. A number of studies have begun to examine COVID-19 fakenews detection methods for non-English languages [10,14,48].In addition, several competitions have been announced over the past yearrelated to the analysis of posts about COVID-19 on social media [1,27,36].
The task focused on the detection of COVID-19-related fake news in English. Thesources of data was various social-media platforms such as Twitter, Facebook,Instagram, etc. Formally, the task is described as follows.
T-BERT and Ensembling Learning for COVID-19 Fake News Detection 3 – Input.
Given a social media post. – Output.
One of two different labels, such as ”fake” or ”real”.The official competition metric was F1-score averaged across the classes (theweighted F1-score). The participants were allowed five submissions per teamthroughout the test phase.
The dataset [28] provided to the participants of the shared task contains 10,700manually annotated social media posts divided into training (6420), validation(2140), and test (2140) sets. The vocabulary size (i.e., unique words) of thedataset is 37,505 with 5141 common words in both fake and real news. Thedataset contains the post ID, the post, and the corresponding label which is”fake” or ”real” (see Table 1).Table 1: Some examples of fake and real postsLabel Postreal The CDC currently reports 99031 deaths. In general the discrepan-cies in death counts between different sources are small and explica-ble. The death toll stands at roughly 100000 people todayreal
In this section, we describe the approaches that we evaluated on the validationdata during the validation phase. We used transformer-based models as theydemonstrate state-of-the-art results in most text classification tasks. We alsoevaluated the empirical effectiveness of a Linear Support Vector baseline (LinearSVC) and different text preprocessing techniques and adding extra data. Theresults are shown in the next section.
Our approaches to text preprocessing for transformer-based models are variouscombinations of the following steps, most of which have been inspired by [18,42]:
A. Glazkova et al. – removing or tokenizing hashtags, URLs, emoji, and mentions using a pre-processing library for tweet data written in Python [43]. Tokenization meansthe replacement of URLs, mentions, and emoji with special tokens, such as $ URL $ , $ MENTION $ , and $ HASHTAG $ respectively (for example ”HHSto distribute $ $
340 million already paidout. https://t.co/uAj29XA1Y5” (original) → ”HHS to distribute $ $ HASHTAG $ hot spots; $
340 million already paid out. $ URL $ ” (tokeniz-ing); ”HHS to distribute $ $
340 million already paidout.” (removing)); – using the Python emoji library to replace the emoji with short textual de-scription [11], for example :red heart:, :thumbs up:, etc.; – converting hashtags to words (” → ”COVID”); – translating in the lowercase.In the case of the baseline, we translated the text to the lowercase, removedpunctuation and special characters, and then lemmatized the words. Further, weconverted texts into the form of a token counts matrix (a bag of words model). We experimented with the following transformer-based models: – BERT [9]. BERT is a language representation model presented by Google,which stands for Bidirectional Encoder Representations from Transformers.BERT-based models show great results in many natural language processingtasks. In our work, we used BERT-base-uncased, which is pretrained on textsfrom Wikipedia. – RoBERTa [19]. RoBERTa is a robustly optimized BERT approach intro-duced at Facebook. Unlike BERT, RoBERTa removes the Next Sentence Pre-diction task from the pretraining process. RoBERTa also uses larger batchsizes and dynamic masking so that the masked token changes while traininginstead of the static masking pattern used in BERT. We experimented withRoBERTa-large. – COVID-Twitter-BERT [26]. CT-BERT is a transformer-based model,pretrained on a large corpus of Twitter messages on the topic of COVID-19collected during the period from January 12 to April 16, 2020. CT-BERT isoptimised to be used on COVID-19 content, in particular social media postsfrom Twitter. This model showed a 10–30% marginal improvement comparedto its base model, BERT-large, on five different specialised datasets. More-over, it was successfully used for a variety of natural language tasks, such asidentification of informative COVID-19 tweets [18], sentiment analysis [16],and claims verification [2,45].
To improve the quality of our approach, we made attempts to add extra datato the model. For this purpose we used two datasets related to the topic ofCOVID-19 fake news:
T-BERT and Ensembling Learning for COVID-19 Fake News Detection 5 – CoAID: COVID-19 Healthcare Misinformation Dataset [7]. The datasetincludes 4251 health-related fake news posted on websites and social plat-forms. – FakeCovid - A Multilingual Cross-domain Fact Check News Datasetfor COVID-19 [37]. The dataset contains 5182 fact-checked news articlesfor COVID-19 collected from January to May 2020.In our experiments, we added news headlines to the training set.
We conducted our experiments on Google Colab Pro (CPU: Intel(R) Xeon(R)CPU @ 2.20GHz; RAM: 25.51 GB; GPU: Tesla P100-PCIE-16GB with CUDA10.1). Each model was trained on the training set for 3 epochs and evaluated onthe validation set. As our resources are constrained, we used random seeds tofine-tune pre-trained language models and made attempts to combine them withother parameters. The models are optimised using AdamW [21] with a learningrate of 2e-5 and epsilon of 1e-8, max sequence length of 128 tokens, and a batchsize of 8. We implemented our models using Pytorch [30] and Huggingface’sTransformers [46] libraries.The Linear SVC was implemented with Scikit-learn [31]. For text preprocess-ing, we used NLTK [20] and Scikit-learn’s CountVectorizer with a built-in listof English stop-words and a maximum feature count of 10,000.
In Table 2, we present the results of our experiments in a step by step manner.We started with a Linear SVC baseline and then evaluated BERT-based modelsusing a variety of text preprocessing and adding extra data techniques. Notethat we evaluated our models using F1-score for the fake class while the officialcompetition metric was the weighted F1-score.Table 2: Evaluation resultsModel Data preprocessing Additionaldata F1-Score(%, for fakeclass)LinearSVC converting into a bag of words no 88.39BERT lowercase no 96.75RoBERTa lowercase no 97.62RoBERTa removing hashtags, URLs,emoji + lowercase no 95.79
A. Glazkova et al.
RoBERTa removing URLs and emoji +converting hashtags to words +lowercase no 95.68CT-BERT lowercase no 98.07CT-BERT tokenizing URLs and mentions+ converting emoji to words +lowercase no 97.87CT-BERT converting emoji to words +lowercase no 98.32CT-BERT tokenizing URLs + convertingemoji to words + lowercase no 98.42CT-BERT tokenizing URLs + convertingemoji to words + lowercase FakeCovid 98.23CT-BERT tokenizing URLs + convertingemoji to words + lowercase CoAID 98.37As can be seen from the table above, CT-BERT models showed absolutelybetter results compared to BERT- and RoBERTa-based classifiers. Our workdoesn’t contain a detailed comparative analysis of text preprocessing techniquesfor this task. Still we can see that text preprocessing can affect the quality offake news detection. For example, there was no evidence that removing emojiand mentions improve the model results. A clear benefit of converting hashtagsinto words could not be identified during this evaluation. Also, as a result of ourexperiments, we decided not to tokenize links to other user’s accounts (mentions).This can be seen in the case of mentions of major news channels like CNN thattend to indicate that the post is real. The next section of the model evaluationwas concerned with using additional datasets. We noticed that adding extra datadid not show any benefits in our experiments
As it was mentioned above, the participants of the shared task were allowedfive submissions per team throughout the test phase. Our best model based onexperimental results (subsection 5.1) included the following preprocessing steps:tokenizing URLs, converting emoji to words, and lowercase. As final submissions,we used the results of hard voting ensembles of three such models with randomseed values and with different data splitting into training and validation samples.The final architecture of our solution is shown in Figure 1.Four of our five submitted models were trained entirely on the dataset pro-vided by the competition organizers [28]. The last model was trained on thecompetition data and on additional datasets [7,37]. The training details and theresults of our final submissions are summarised in Table 3.
T-BERT and Ensembling Learning for COVID-19 Fake News Detection 7
Fig. 1.
Our approach to COVID-19 fake news detection.
Table 3: Final submissionsPlaceamong allsubmissions Submissionname WeightedF1-score(%) Training details1 g2tmn 2.csv 98.69 All models were trained both ontraining and validation sets with nohold out validation. We trained 3models and then used hard-voting toensemble their predictions together.6 g2tmn 4.csv 98.51 1000 random posts were used forhold-out validation. Models weretrained on all other data. We trained5 models with random seed valuesand choose 3 models with the bestF1-score performances to ensemblethem together.11 g2tmn 1.csv 98.37 Models were trained on the of-ficial training set. The validationset was used for hold-out valida-tion. We trained 5 models withrandom seed values and choose 3best-performance models to ensem-ble their predictions.15 g2tmn 3.csv 98.32 This submission was similar tog2tmn 1.csv but it had differentseed values.
A. Glazkova et al.
25 g2tmn 5.csv 98.18 1000 random posts were used forhold-out validation. Models weretrained on all other official data,CoAID and FakeCovid data. Wetrained 5 models with random seedvalues and used 3 best-performancemodels for ensembling learning.It can be seen from the data in Table 3 that, with the weighted F1-score, ourmodel performance is 98.69% (the random seeds are 23, 30, and 42), which wasranked the first place of the leaderboard of this task.
Error analysis allows us to further evaluate the quality of the machine learningmodel and conduct a quantitative analysis of errors. Figure 2 provides the confu-sion matrix for our best solution when detecting fake news about COVID-19 onthe test set. As can be seen from the figure, the precision of our system is slightlyhigher than its recall. In other words, the false positive value is greater that falsenegative. Table 4 shows the examples of false negative and false positive errors.We noticed that the type of error is frequently related to the topic of thepost. For example, the model often misclassifies false reports about the numberof people infected. At the same time, true posts related to the coronavirus vaccineor to political topics can be identified as false.
Fig. 2.
Confusion matrix of our best-performance ensemble for COVID-19 fake newsdetection (for the fake class).T-BERT and Ensembling Learning for COVID-19 Fake News Detection 9
Table 4: Some examples of misclassified postsTrue label Prediction Postreal fake Scientists ask: Without trial data how can we trustRussia’s
In this work, we propose a simple but effective approach to COVID-19 fakenews detection based on CT-BERT and ensembling learning. Our experimentsconfirmed that BERT-based models specialized in the subject area successfullycope with such tasks and perform high-quality binary classification.The experimental results showed that our solution achieved 98.69% of theweighted F1-score on test data and ranked in the first place in the Constraint@-AAAI2021 shared task. For future work, we can experiment with different train-ing and data augmentation techniques. We can also apply and evaluate hybridmodels combining BERT-based architectures with other methods of natural lan-guage processing [32,39].
References
1. Alam F., et al.: Fighting the COVID-19 Infodemic: Modeling the Perspective ofJournalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society.arXiv preprint arXiv:2005.00033 (2020)2. Alkhalifa R. et al.: QMUL-SDS at CheckThat! 2020: determining COVID-19 tweetcheck-worthiness using an enhanced CT-BERT with numeric expressions. arXivpreprint arXiv:2008.13160 (2020).0 A. Glazkova et al.3. Apuke O. D., Omar B.: Fake news and COVID-19: modelling the predictors offake news sharing among social media users. Telematics and Informatics , 101475(2020)4. Atanasova P. et al.: Overview of the CLEF-2019 CheckThat! Lab: Automatic Iden-tification and Verification of Claims. Task 1: Check-Worthiness. In: CLEF (WorkingNotes) (2019)5. Buda J., Bolonyai F.: An Ensemble Model Using N-grams and Statistical Featuresto Identify Fake News Spreaders on Twitter. In: CLEF (2020)6. Chernyaev A. et al.: A Rumor Detection in Russian Tweets. In: Speech and Com-puter, 22nd International Conference, SPECOM 2020, pp. 108-118. Springer, Cham(2020)7. Cui L., Lee D.: CoAID: COVID-19 Healthcare Misinformation Dataset. arXivpreprint arXiv:2006.00885 (2020)8. Da San Martino G. et al.: SemEval-2020 task 11: Detection of propaganda tech-niques in news articles. In: Proceedings of the Fourteenth Workshop on SemanticEvaluation, pp. 1377-1414 (2020)9. Devlin J. et al.: Bert: Pre-training of deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805 (2018)10. Elhadad M. K., Li K. F., Gebali F.: COVID-19-FAKES: A twitter (Arabic/English)dataset for detecting misleading information on COVID-19. In: International Con-ference on Intelligent Networking and Collaborative Systems, pp. 256-268 (2020)11. emoji 0.6.0, https://pypi.org/project/tweet-emoji/ . Last accessed 14 Dec202012. g2tmn at Constraint@AAAI2021 - COVID19 Fake News Detection in English, https://github.com/oldaandozerskaya/covid_news . Last accessed 14 Dec 202013. Jwa H. et al.: exBAKE: Automatic Fake News Detection Model Based on Bidirec-tional Encoder Representations from Transformers (BERT). Applied Sciences
2, 331-351 (2019)16. Kruspe A. et al.: Cross-language sentiment analysis of European Twitter messagesduring the COVID-19 pandemic. arXiv preprint arXiv:2008.12172 (2020)17. Kula S., Chora´s M., Kozik R.: Application of the BERT-Based Architecture in FakeNews Detection. In: Conference on Complex, Intelligent, and Software IntensiveSystems, pp. 239-249. Springer, Cham (2020)18. Kumar P., Singh A.: NutCracker at WNUT-2020 Task 2: Robustly Identifying In-formative COVID-19 Tweets using Ensembling and Adversarial Training. In: Pro-ceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp.404-408 (2020)19. Liu Y. et al.: Roberta: A robustly optimized bert pretraining approach. arXivpreprint arXiv:1907.11692 (2019)20. Loper E., Bird S.: NLTK: The Natural Language Toolkit. In: Proceedings of theACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Lan-guage Processing and Computational Linguistics, pp. 63-70 (2002)21. Loshchilov I., Hutter F.: Fixing weight decay regularization in adam. arXiv preprintarXiv:1711.05101 (2017)T-BERT and Ensembling Learning for COVID-19 Fake News Detection 1122. Mazza C. et al.: A nationwide survey of psychological distress among Italian peopleduring the COVID-19 pandemic: Immediate psychological responses and associatedfactors. International Journal of Environmental Research and Public Health
16, 5850 (2020)26. M¨uller M., Salath´e M., Kummervold P. E.: COVID-Twitter-BERT: A Natural Lan-guage Processing Model to Analyse COVID-19 Content on Twitter. arXiv preprintarXiv:2005.07503 (2020)27. Nguyen D. Q. et al.: WNUT-2020 Task 2: Identification of Informative COVID-19English Tweets. In: Proceedings of the Sixth Workshop on Noisy User-generatedText (W-NUT 2020), pp. 314-318 (2020)28. Patwa P. et al.: Fighting an Infodemic: COVID-19 Fake News Dataset. arXivpreprint arXiv:2011.03327 (2020)29. Patwa P. et al.: Overview of CONSTRAINT 2021 Shared Tasks: Detecting EnglishCOVID-19 Fake News and Hindi Hostile Posts. In: Proceedings of the First Work-shop on Combating Online Hostile Posts in Regional Languages during EmergencySituation (CONSTRAINT), Springer (2021)30. Paszke A. et al.: Pytorch: An imperative style, high-performance deep learninglibrary. Advances in neural information processing systems, 8026–8037 (2019)31. Pedregosa F. et al.: Scikit-learn: Machine learning in Python. The Journal of ma-chine Learning research , 2825-2830 (2011)32. Peinelt N., Nguyen D., Liakata M. tBERT: Topic models and BERT joining forcesfor semantic similarity detection. In: Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics, pp. 7047-7055 (2020)33. Pizarro J.: Using N-grams to detect Fake News Spreaders on Twitter. In: CLEF(2020)34. Rangel F. et al.: Overview of the 8th Author Profiling Task at PAN 2020: ProfilingFake News Spreaders on Twitter. In: CLEF (2020)35. Reis J. C. S. et al.: Supervised learning for fake news detection. IEEE IntelligentSystems
34, 76–81 (2019)36. Shaar S. et al.: Overview of CheckThat! 2020 English: Automatic identificationand verification of claims in social media. arXiv preprint arXiv:2007.07997 (2020)37. Shahi G. K., Nandini D.: FakeCovid–A Multilingual Cross-domain Fact CheckNews Dataset for COVID-19. arXiv preprint arXiv:2006.11343 (2020)38. Shu K. et al.: Fake news detection on social media: A data mining perspective.ACM SIGKDD explorations newsletter
19, 22–36 (2017)39. Tang L.: UZH at SemEval-2020 Task 3: Combining BERT with WordNet senseembeddings to predict graded word similarity changes. In: Proceedings of the Four-teenth Workshop on Semantic Evaluation, pp. 166-170 (2020)40. Thorne J. et al.: FEVER: a Large-scale Dataset for Fact Extraction and VERifica-tion. In: Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, Volume1 (Long Papers), pp. 809-819 (2018)2 A. Glazkova et al.41. Thorne J. et al.: The FEVER2. 0 shared task. In: Proceedings of the SecondWorkshop on Fact Extraction and VERification (FEVER), pp. 1-6 (2019)42. Tran K. V. et al.: UIT-HSE at WNUT-2020 Task 2: Exploiting CT-BERT forIdentifying COVID-19 Information on the Twitter Social Network. In: Proceedingsof the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 383-387(2020)43. tweet-preprocessor 0.6.0, https://pypi.org/project/tweet-preprocessor/https://pypi.org/project/tweet-preprocessor/