Mining Coronavirus (COVID-19) Posts in Social Media
aa r X i v : . [ c s . C L ] M a r Mining Coronavirus (COVID-19) Posts in Social Media
Negin Karisani
Purdue University [email protected]
Payam Karisani
Emory University [email protected]
Abstract
World Health Organization (WHO) characterized the novel coronavirus (COVID-19) as a globalpandemic on March 11th, 2020. Before this and in late January, more specifically on January27th, while the majority of the infection cases were still reported in China and a few cruiseships, we began crawling social media user postings using the Twitter search API. Our goalwas to leverage machine learning and linguistic tools to better understand the impact of theoutbreak in China. Unlike our initial expectation to monitor a local outbreak, COVID-19 rapidlyspread across the globe. In this short article we report the preliminary results of our study onautomatically detecting the positive reports of COVID-19 from social media user postings usingstate-of-the-art machine learning models. According to a tally by Johns Hopkins University 566,269 persons are tested positive and 25,423 personshave died around the globe as of today, March 27th. Approximately a third of the world’s population isimpacted by COVID-19. The United States became the epicenter of the virus pandemic on March 26th–yesterday–and New York City with 23,112 confirmed cases is the epicenter of the US outbreak. The USHouse passed a $2 trillion stimulus bill to combat the negative impact of COVID-19 on the country’seconomy. Despite the devastating global impact of COVID-19, WHO has announced that the currentpandemic would be the first pandemic in human history that could be controlled.The impact of COVID-19 on societies is unprecedented. Numerous countries in Asia and the EU,including Iran, Italy, and Spain are under a lock-down. In the US, states such as California and NewYork are experiencing the same situation. People are ordered to stay home, and are encouraged to practicesocial distancing. Psychologists advise the residents of the affected areas to practice certain routines tomaintain their mental well-being. With people staying at home more often, the role of the internet, as ameans of communication, has become even more critical. For instance, NextDoor, a hyperlocal socialnetwork, recently announced that the daily rate of its active users has increased by 80%.It has long been known that social networks are effective media for public health monitoring. De-spite the well-understood limitations and biases present in the conclusions drawn from social media data(Olteanu et al., 2018), they are proven to be invaluable resources (Paul and Dredze, 2017). In this arti-cle, we report the preliminary results of our study on automatically mining the user postings related toCOVID-19 on Twitter. Our goal is to find the extent in which machine learning models can distill the usergenerated data. As pointed out by previous studies (Karisani and Agichtein, 2018), this can facilitate therelated institutions’ responses to the outbreak. In the next section, we focus on automatically detectingthe positive reports of COVID-19 infections in the data that we have been collecting since January 27th,2020.
We started collecting Twitter data on January 27th, 2020. As of March 26th, we have collected 5,621,048tweets. We used the Twitter search API to crawl the data, and our search keywords were initially “coro- This paper is a short version of a longer study. raining Test
Count Negative Positive Count Negative Positive
We begin this section by describing the methods that we implemented, then we briefly discuss the trainingprocedure, and finally report the results.
We included seven methods in our experiments. One classic generative model (Naive Bayes), one clas-sic discriminative model (Logistic Regression), one widely used neural network model (fasttext), andfour models based on the state-of-the-art model Bidirectional Encoder Representations from Transform-ers (BERT). Below we briefly describe each one. NB . We included the Naive Bayes classifier. We incorporated the MALLET implementation(McCallum, 2002) of this classifier, and used the tweet unigrams and bigrams as features. LR . We included the Logistic Regression as the discriminative counterpart of Naive Bayes. We incorpo-rated the MALLET implementation, and again used the tweet unigrams and bigrams as features.
Fasttext . We included the neural model introduced in (Joulin et al., 2016). This model is a shallow widenetwork, capable of updating the input word embeddings during the training. We used the pretrainedword2vec vectors (Mikolov et al., 2013) as input features. The learning rate was empirically set to 0.5,and the window size was set to 2.
BERT-BASE . We included the state-of-the-art model introduced in (Devlin et al., 2019). This model isbased on a multi-layer transformer encoder (Vaswani et al., 2017). We used the pre-trained base variant,followed by one layer fully connected network. We applied the default model settings recommended in(Devlin et al., 2019). We used the pytorch implementation of BERT introduced in (Wolf et al., 2019).
BERT-Twitter . Since our classification problem is defined on social media posts, we can expect that amodel specifically exposed to the social media language model (through the Masked Language modeltask) would perform better than regularly pre-trained ones. Thus, we used a corpus of 35 million tweetscollected between 2018 and 2019 through the Twitter streaming API to further pre-train
BERT-BASE . Weset the maximum window size to 160 and batch-size to 32–the rest of the settings were set to the defaultvalues, as suggested in (Devlin et al., 2019)–and pre-trained this model for 4.5 million steps–about 5epochs.
BERT-Corona . We hypothesized that if a model is already familiar with the contexts in which the word odel
F1 Precision Recall NB LR Fasttext
BERT-BASE
BERT-Twitter
BERT-Corona
BERT-Corona-BiLSTM
BERT-BASE . We pre-trained thismodel for 400K steps–approximately 5 epochs. The pre-training settings were identical to
BERT-Twitter . BERT-Corona-BiLSTM . Even though BERT already utilizes a sophisticated attention mechanism, westill experimented with sequence encoding models. Thus, we used a Bidirectional Long Short TermMemory Network (Hochreiter and Schmidhuber, 1997) on top of
BERT-Corona , followed by one layerfully connected network. We empirically observed that if we set the size of the hidden dimensions ofthe BiLSTM to a half of the size of the hidden dimensions of BERT (i.e., 768) we would get the bestperformance.
We trained
Fasttext for 100 iterations. The models based on BERT, i.e.,
BERT-BASE , BERT-Twitter , and
BERT-Corona , were trained for 2 iterations–these models are already based on a pre-trained model. Wetrained
BERT-Corona-BiLSTM for 3 iterations, since it has more parameters than the other BERT basedbaselines. For training the models, we used the default model optimizers proposed by the references. Innone of the experiments we did any text pre-processing. Since there is a randomness in model initializa-tion and drop-out regularization, we carried out all of the neural network experiments for five times. Theresults reported in the next section are the average over these experiments.The task that we defined is a binary classification problem–detecting the positive reports of COVID-19 from Twitter data. Since the class distribution is highly skewed, following the previous studies(McCreadie et al., 2019), we report the F1, Precision, and Recall of the models in the positive class.
Table 2 summarizes the performance results. We see that the baselines based on the pre-trained modelsshow the best results. The experiments validate our hypothesis about the effectiveness of pre-trainingBERT on domain specific data. We see that
BERT-Corona has achieved the best F1 value. By comparing
BERT-Twitter and
BERT-Corona-BiLSTM we can also hypothesize that model initialization–through pre-training–can be potentially more effective than increasing model complexity. Even though validating thishypothesis require more comprehensive experiments.We believe a robust social media surveillance system can be immensely helpful. Although the resultsare encouraging, there are still a lot of challenges to be addressed to build such a system for COVID-19. Automatically detecting positive reports, or even following up on the mental well-being of patientsthrough social media posts can greatly enhance the concerned institutions’ endeavor to monitor the publichealth and respond in timely manner.
In this short article we reported the preliminary results of our study on the capability of machine learningmodels to distill social media posts related to COVID-19, namely we focused on automatically detectingthe positive reports of this illness. We constructed a manually annotated dataset, and showed that state-of-the-art classifiers have encouraging results. Our pre-trained model and unlabeled data can be accessedhrough our Github . We will also release our labeled data, along a more comprehensive analysis soon. References
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirec-tional transformers for language understanding. In
Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics , pages 4171–4186. Association for ComputationalLinguistics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neural computation , 9(8):1735–1780.Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient textclassification. arXiv preprint arXiv:1607.01759 .Payam Karisani and Eugene Agichtein. 2018. Did you really just have a heart attack? towards robust detection ofpersonal health mentions in social media. In
Proceedings of the 2018 World Wide Web Conference , WWW 18,page 137146.Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit (2002).Richard McCreadie, Cody Buntain, and Ian Soboroff. 2019. Trec incident streams: Finding actionable informationon social media.
Proceedings of the 16th International Conferenc e on Information Systems for Crisis Responseand Management, Valencia, Spain .Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations ofwords and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, andK. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 26 , pages 3111–3119. CurranAssociates, Inc.Alexandra Olteanu, Emre Kıcıman, and Carlos Castillo. 2018. A critical review of online social data: Biases,methodological pitfalls, and ethical boundaries. In
Proceedings of the Eleventh ACM International Conferenceon Web Search and Data Mining , WSDM 18, page 785786, New York, NY, USA. Association for ComputingMachinery.Michael J Paul and Mark Dredze. 2017. Social monitoring for public health.
Synthesis Lectures on InformationConcepts, Retrieval, and Services , 9(5):1–183.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, andIllia Polosukhin. 2017. Attention is all you need. In
Advances in Neural Information Processing Systems 30 ,pages 5998–6008. Curran Associates, Inc.A. J. Viera and J. M. Garrett. 2005. Understanding interobserver agreement: the kappa statistic.
Fam Med ,37(5):360–363, May.Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing.