[PDF] Pulse of the Pandemic: Iterative Topic Filtering for Clinical Information Extraction from Social Media

Abstract

The rapid evolution of the COVID-19 pandemic has underscored the need to quickly disseminate the latest clinical knowledge during a public-health emergency. One surprisingly effective platform for healthcare professionals (HCPs) to share knowledge and experiences from the front lines has been social media (for example, the "#medtwitter" community on Twitter). However, identifying clinically-relevant content in social media without manual labeling is a challenge because of the sheer volume of irrelevant data. We present an unsupervised, iterative approach to mine clinically relevant information from social media data, which begins by heuristically filtering for HCP-authored texts and incorporates topic modeling and concept extraction with MetaMap. This approach identifies granular topics and tweets with high clinical relevance from a set of about 52 million COVID-19-related tweets from January to mid-June 2020. We also show that because the technique does not require manual labeling, it can be used to identify emerging topics on a week-to-week basis. Our method can aid in future public-health emergencies by facilitating knowledge transfer among healthcare workers in a rapidly-changing information environment, and by providing an efficient and unsupervised way of highlighting potential areas for clinical research.

Full PDF

PPulse of the Pandemic: Iterative Topic Filtering for ClinicalInformation Extraction from Social Media

Julia Wu ∗ , Venkatesh Sivaraman , Dheekshita Kumar , Juan M. Banda and David Sontag Dept of Electrical Engineering and Computer Science, Massachusetts Institute of Technology Human-Computer Interaction Institute, Carnegie Mellon University Department of Computer Science, Georgia State University

Abstract

Keywords: data mining, social media, information retrieval, topic modeling, clinical concept extraction,public health surveillance

In May 2020, a retrospective study of over 3,000 patients in a major New York healthcare system foundthat in around 1% of cases, the disease caused by the novel coronavirus (COVID-19) was associated withischemic stroke [27]. The result, which was initially described in early April in China and Europe [16, 15]and corroborated by other New York studies [13, 21], quickly became part of a larger story on thromboticcomplications of COVID-19 [12]. In the weeks leading up to these articles, however, physicians were alreadydiscussing a possible association between cerebrovascular accidents and COVID-19 on Twitter, starting witha small number of users aﬃliated with Boston-area hospitals on March 17. If conversations like these couldbe surfaced to physicians around the world as they emerge, they could serve as a focal point for new evidence,suggest directions for clinical research, and accelerate progress toward understanding the disease. ∗ First three authors contributed equally to this research. a r X i v : . [ c s . S I] F e b ulse of the PandemicIn the face of a developing medical situation such as COVID-19, health-care professionals (HCPs) interactwith a range of knowledge sources to provide and share up-to-date, accurate clinical information. Guidancereleased by public health organizations such as the World Health Organization (WHO) and the US Centersfor Disease Control and Prevention (CDC) is considered the most reliable, but is relatively slow to changedue to its wide impact. These guidelines are backed by published research and case reports, which also taketime to be disseminated due to the need for formalization and peer review. To obtain more up-to-the-minuteinformation, HCPs share insights amongst each other through webinars, hospital-speciﬁc channels, and chatgroups. For example, many initial accounts of Chilblain-like skin lesions on the toes, a now-famous symptomof COVID-19, were circulated on a WhatsApp group from France [17].Social media platforms have emerged as important areas for sharing clinical information publicly. Inparticular, Twitter is a popular option because it is already home to a sizeable physician community [22, 19].HCPs can opt to discuss medical topics under hashtags such as “ ﬁnding clinical information on social media, however, can be challenging. For HCPs lookingfor advice or anecdotes, it may be diﬃcult to ﬁnd the most relevant authors unless they are already in the rightsocial circles. Furthermore, because of the high volume of non-expert conversation, the terms that one wouldexpect to ﬁnd in clinically-meaningful information can also be found in mundane and non-expert posts aswell as in myths and misinformation [24]. In the case of COVID-19, even within the posts that mention thepandemic, the global impact on everyday life has essentially put “popularity” at odds with medical usefulness. Framed as an information retrieval problem, our task is to extract and cluster clinically relevant socialmedia posts by reputable authors, who form a tiny minority of the general population. Our method istherefore designed to implicitly derive a set of distinguishing characteristics between relevant and irrelevanttext, building upon easily-available indicators such as medical vocabularies and user metadata. While weexclusively consider COVID-19 and Twitter data in this study, the problem formulation and approach can beapplied to other healthcare topics or social media platforms.For the purposes of this study, we deﬁne information as clinically relevant if it is technical in nature andintended to help characterize, prevent, diagnose, or treat the disease under consideration. While some pastapproaches designate relevance through manual labeling [14, 10], this is diﬃcult to scale due to the timesensitivity of a pandemic and the rapid shifts in conversation topics over time. Therefore, we turn to topicmodeling with Latent Dirichlet Allocation (LDA) [5], an unsupervised approach that has also been studiedextensively in the literature [11, 8, 29]. However, the level of granularity that has typically been achieved inthese approaches is limited by the user’s ability to scan through and interpret topics. Hierarchical topic modelsmay facilitate interpretation by connecting related topics [4, 25], but due to the sheer volume of irrelevant datain social media, we propose that ﬁltering out the undesirable items altogether may be a more robust strategy.We take an iterative approach to ﬁnding the most clinically relevant documents within a dataset. Documentsare automatically annotated for clinical concepts using MetaMap [1], which provides an initial approximation2ulse of the Pandemicof clinical relevance (though prone to false positives [9]). Our method uses topic modeling to associatedocuments with similar content without supervision, then scores topics based on the relevance of the clinicalconcepts they contain. The documents are then ﬁltered by their degree of association with the most relevanttopics, and the process is repeated. The concept relevance estimates are reﬁned in each iteration, therebyovercoming the noise in concept annotations as the ﬁltering quality improves.We demonstrate the utility of our method by retrospectively analyzing 1 million automatically extractedCOVID-related tweets by HCPs, resulting in a detailed picture of clinical discourse about the disease. Weexplore the behavior of the proposed pipeline by comparing it to a version that ablates the concept extractionstep, producing relevance estimates that highlight diﬀerent characteristics of the data. Finally, we simulate theuse of our technique during the early stages of a pandemic by analyzing time-limited subsets of the tweetdataset. Our results suggest this method’s potential to eﬃciently surface useful information to a clinicalaudience without signiﬁcant manual analysis, potentially before such information is announced in more formalchannels.

In order to surface clinically relevant information in a highly noisy corpus, we develop a method based ontwo fundamental subroutines: topic modeling using LDA, and relevance ﬁltering based on clinical concepts.Given an initial corpus of documents (tweets), denoted D ( ) , we ﬁrst apply an author-based heuristic (Section2.1) to obtain a dataset D ( ) that has a considerably higher prior likelihood for clinical relevance. Then, weuse D ( ) to produce a series of ﬁltered subsets D ( 𝑖 ) by ﬁnding topics, computing a relevance score for eachtopic, and removing documents containing clinically irrelevant topics. The result of this iterative process(Section 2.3) is a set of highly relevant tweets as well as an interpretable yet granular topic model. A ﬂowchartof this process is shown in Figure 1. To generate D ( ) , we opt to only consider documents by authors who self-identify as HCPs. Social medianorms suggest that it should be relatively straightforward to design a highly sensitive classiﬁer for HCPs:audiences typically rely on author information to determine credibility on Twitter [20], so users posting aboutmedical topics are incentivized to display their credentials. Thus, we ﬁlter D ( ) for users whose name, handle,or bio contains any of 27 medical titles, professions, or keywords (for example, “MD”, “Dr”, “epidemiolog*”,“public health”). Note that some authors use credentials falsely or in jest, and many credentialed authors postirrelevant content; these users’ documents intentionally pass the heuristic selection, and will be removed inthe subsequent relevance ﬁltering process. For each document in the HCP-authored set D ( ) , we preprocess the text using standard natural languageprocessing (NLP) routines such as removing contractions, punctuation, HTML tags, and emoji. We lemmatizeeach word using NLTK’s WordNet lemmatizer [3], and remove stopwords and any query terms that wereused to generate the original dataset (in our case study data [2], these were words explicitly referring to thecoronavirus). 3ulse of the PandemicFigure 1: Flow chart of the iterative relevance ﬁltering process.In addition, we extract clinical concepts from each document using MetaMap18, a tool that uses symbolicNLP to identify UMLS Metathesaurus medical concepts . For a given piece of clinical text, MetaMap outputsa series of “mentions,” or occurrences of a concept deﬁned by a unique identiﬁer (CUI), a preferred namefrom UMLS, one or more semantic types (e.g. disease/syndrome or clinical ﬁnding), and trigger words (a setof words that triggered the concept match). MetaMap also outputs a relevance score, but this was not used inour protocol because it correlated poorly with our desired criteria on Twitter data. We perform the following iterative procedure to produce corpus D ( 𝑖 + ) given D ( 𝑖 ) and D ( 𝑖 − ) (for 𝑖 ≥ ). In this study, we reduced computational overhead by only applying MetaMap to tweets that contain at least 4 words. Shortertweets often contained reactions and links, which are not useful for our text analysis purposes.

Using the MALLET implementation of LDA [18], we generate a topic model over the 𝑀 ( 𝑖 ) documents in D ( 𝑖 ) ,with 𝑘 topics (we found that 𝑘 = provides a good balance of detail and summarizability). This results in a 𝑘 × 𝑀 ( 𝑖 ) matrix 𝜃 ( 𝑖 ) that encodes topic probabilities: in particular, the value at 𝜃 ( 𝑖 ) 𝑡,𝑚 is the probability that aword in document 𝑚 was sampled from topic 𝑡 . As described above, each document is also annotated with acertain number of concepts, which we will denote 𝐶 ( 𝑚 ) . Concepts are scored by estimating the clinical relevance of each concept given its trigger word. More formally,we deﬁne relevance as a relationship between two corpora 𝐴 and 𝐵 where 𝐴 ∈ 𝐵 :Rel ( 𝑐 ; 𝐴, 𝐵 ) = 𝑓 𝐴 ( 𝑐 )/| 𝐴 | + 𝜖 ( 𝑓 𝐵 ( 𝑐 ) − 𝑓 𝐴 ( 𝑐 ))/(| 𝐵 | − | 𝐴 |) + 𝜖 (1)where 𝑓 𝐴 ( 𝑐 ) and 𝑓 𝐵 ( 𝑐 ) are the number of occurrences of trigger word 𝑐 in corpora 𝐴 and 𝐵 respectively, | 𝐴 | and | 𝐵 | represent the number of documents in each corpus, and 𝜖 is a small number to prevent division byzero.Intuitively, Eqn. 1 measures how frequently concepts appear in preserved documents ( 𝐴 ) as compared todiscarded documents ( 𝐵 \ 𝐴 ). This formulation follows naturally from the need for an appropriate referenceset, especially in the absence of labeled data. A simpler approach might be to rank concepts by inverseEnglish word frequency, but this tends to artiﬁcially elevate alphanumeric codes (often falsely annotatedby MetaMap as genes) and suppress everyday words with medical deﬁnitions (e.g. “mask,” “vent”). Theiterative nature of our protocol aﬀords us the most direct comparison possible, i.e. documents from the samedata source that are known to be irrelevant.The choice of reference set 𝐵 is ﬂexible: it could be held constant, or it could be set to the previously-generated corpus D ( 𝑖 − ) . Empirically, we found that a hybrid of these two approaches results in the mostmeaningful scores: Rel ( 𝑖 ) ( 𝑐 ) = (cid:40) Rel ( 𝑐 ; D ( ) , D ( ) ) if 𝑖 = Rel ( 𝑐 ; D ( 𝑖 ) , D ( ) ) if 𝑖 > (2)In other words, the relevance scores are initialized using the unﬁltered set D ( ) as reference, comparing HCPto non-HCP texts. In subsequent sections, D ( ) serves as a better baseline because of its drastically improvedsignal-to-noise ratio compared to D ( ) . We then hold D ( ) as a ﬁxed baseline, which helps to stabilize therelevance scores over multiple iterations, and avoids honing in on a particular topic at the expense of others.We also apply this relevance metric as a pre-ﬁlter on the MetaMap concepts: any concepts withRel ( 𝑐 ; D ( ) , D ( ) ) < (the concept occurred less frequently in doctor tweets than non-doctor tweets) areremoved. This helps mitigate the signal-to-noise ratio from the very beginning, which clariﬁes the results ofthe next step. 5ulse of the Pandemic Each topic is given a score based on the relevance of the concepts in tweets that are associated with it. Giventhe document-topic probability matrix 𝜃 ( 𝑖 ) , the score of topic 𝑡 isScore ( 𝑖 ) ( 𝑡 ) = (cid:205) 𝑀 ( 𝑖 ) 𝑚 = 𝜃 ( 𝑖 ) 𝑡,𝑚 𝑀 ( 𝑖 ) ∑︁ 𝑚 = 𝜃 ( 𝑖 ) 𝑡,𝑚 ∑︁ 𝑐 ∈ 𝐶 ( 𝑚 ) Rel ( 𝑖 ) ( 𝑐 ) (3)This favors topics for which the documents drawn most heavily from the topic also contain highly relevantconcepts. Note that we do not directly test for concept relevance among the topic words 𝛽 ( 𝑖 ) ; this allows forclinically-relevant words that are not annotated by MetaMap (e.g. too new to appear in UMLS) to weighheavily in topics without penalty.The topic scores often follow an elbow-shaped curve when plotted in sorted order (see Fig. 1). Wedesignate topics as “relevant” by choosing a threshold 𝜏 ∈ [ , ] and retaining topics that satisfyScore ( 𝑖 ) ( 𝑡 ) ≥ ( 𝑆 ( 𝑖 ) max − 𝑆 ( 𝑖 ) min ) · 𝜏 + 𝑆 ( 𝑖 ) min (4)where 𝑆 ( 𝑖 ) max and 𝑆 ( 𝑖 ) min are the maximal and minimal topic scores, respectively. We denote this set of relevanttopics 𝑅 ( 𝑖 ) , of size 𝑟 ( 𝑖 ) . We chose a threshold of 𝜏 = . throughout; for new datasets, 𝜏 could easily be setthrough an initial topic exploration and inspection of the resulting topics.Notably, this thresholding scheme allows for variation in how many topics are selected: as the algorithmprogresses, the number of topics retained increases with the prevalence of relevant content. The number oftopics preserved in each iteration also indicates when the ﬁltering is roughly complete. Finally, we select the documents that are highly associated with relevant topics. We generate the next corpus D ( 𝑖 + ) by simply choosing tweets in which the probability of sampling from a relevant topic is greater thana uniform probability over topics, i.e. if (cid:205) 𝑡 ∈ 𝑅 ( 𝑖 ) 𝜃 ( 𝑖 ) 𝑡,𝑚 ≥ 𝑟 ( 𝑖 ) / 𝑘 . At this point, the ﬁltering process can beterminated if 𝑟 ( 𝑖 ) is suﬃciently high, or the newly-ﬁltered corpus can be passed on to another iteration oftopic modeling and relevance ﬁltering. First, we describe our COVID-19 tweet dataset, which forms a case study for the use of our method. Then wepresent the results of the method on this data (Sec. 3.2), a validation of the use of concept extraction (Sec.3.3), and a proof-of-concept for analysis on time-limited datasets (Sec. 3.4).6ulse of the PandemicTable 1: Paraphrased example tweets from the HCP-authored subset of the dataset.(a) GI symptoms (anorexia, di-arrhea, vomiting, or ab-dominal pain) may be ear-lier signs of novel

We illustrate the use of our method on a publicly-available COVID-19 Twitter dataset [2], comprising over420 million tweets (as of June 21, 2020) that contain coronavirus-related keywords such as coronavirus , , and covid19 . Using a pre-ﬁltered set with non-English tweets and retweets removed, we retrieved52.9 million tweets posted between January 8 and June 21, 2020 using the twarc command line tool. Notably,many HCP-authored tweets were part of longer “threads,” or sequences of tweets by the same author, that werenot fully covered by D ( ) . We decided to expand the dataset in D ( ) by including the complete threads, becausethe missing tweets often appeared to contain useful clinical information despite not explicitly mentioningCOVID keywords. The ﬁnal HCP-authored tweet set contained 990,756 threads, with 1,078,830 total tweets.Even after this initial ﬁltering step, the tweets varied dramatically in relevance, as illustrated in Table 1.Some tweets introduce useful information, such as tweet (a) in the table. Many others contain no medicalinsights (tweet (b)). More subtly, however, tweet (c) contains clinical terms, but does not introduce novelinformation. Because of the preponderance of tweets like (b) and (c), when we attempted a large topic model( 𝑘 = ) without relevance ﬁltering, the clinically-relevant topics were vastly outnumbered by irrelevantones. This validated the need for ﬁltering out tweets with spurious relevance and preferentially surfacingtweets like (a). We performed three rounds of ﬁltering on the HCP-authored dataset, resulting in a dataset of 107,794 tweets.Roughly 38-40 topics were selected as relevant in each iteration, but 85 topics would have been selected toproceed to the fourth iteration, indicating that most of the irrelevant data had been ﬁltered out by this point.The topics extracted from the third level of ﬁltering ( D ( ) ), shown in Fig. 2, demonstrate clear clinicalrelevance. For example, several topics describe clinical presentations of COVID-19, ranging from the mostcommon symptoms (topic D ( ) ). The upper sectionshows the top 40 highest-scoring topics, while the lower section shows the 10 lowest-scoring topics. Bothsections are sorted vertically in order of the date of maximum intensity. The heat map colors indicate thepopularity of each topic per day, with yellow representing the peak of popularity for the topic.8ulse of the Pandemic D ( ) D ( ) with concepts D ( ) without concepts1 medtwitter ( ) cells ( ) cells ( )2 publichealth ( ) lung ( ) mortality ( )3 physicians ( ) hcq ( ) rate ( )4 patients with ( ) trial ( ) asymptomatic ( )5 pts ( ) patients with ( ) rate of ( )6 clinical ( ) severe ( ) fatality ( )7 physician ( ) respiratory ( ) mild ( )8 icu ( ) blood ( ) growth rate ( )9 surgery ( ) antibodies ( ) viral ( )10 md ( ) ards ( ) ace2 ( )1 f*** ( ) business ( ) trump ( )2 s*** ( ) leadership ( ) business ( )3 f***ing ( ) businesses ( ) bbc news ( )4 petition ( ) students ( ) bbc ( )5 the petition ( ) government ( ) president ( )6 rt ( ) trump ( ) news coronavirus ( )7 sign the petition ( ) crisis ( ) bbc news coronavirus ( )8 democrats ( ) county ( ) the latest ( )9 sign the ( ) pm ( ) businesses ( )10 the petition via ( ) economy ( ) amid ( )Table 2: Words and phrases that were designated most (upper half) and least (lower half) relevant accordingto Eqn. 2 (relevance values shown in parentheses). The ﬁrst column indicates the initial relevance estimatescomputed after heuristic author ﬁltering; the right two columns list the relevances after three rounds ofﬁltering with and without concept annotations.of clinical tweets, such as To validate our use of MetaMap concept annotations when calculating relevance, we ran a version of ouriterative ﬁltering routine that entirely omitted clinical concepts. In other words, the relevance of a topic(Eqn. 3) was computed not as the sum of relevances over concepts , but instead over all words in each tweet9ulse of the Pandemic(excluding stopwords and words explicitly mentioning the coronavirus). This comparison therefore reﬂectsthe marginal beneﬁt of directing the relevance ﬁltering toward words already known to be clinically relevant(subject to MetaMap error).First, we compared the words that were predicted to be most and least relevant by each method, shown inTable 2. Relevance is calculated using Eqn. 2 for all unigrams, bigrams, and trigrams; this therefore alsoserves as a measure of what phrases are most “enriched” in the relevant tweet set compared to the irrelevanttweet set. The ﬁrst column, which shows the enrichment of D ( ) relative to D ( ) , suggests that heuristic authorﬁltering alone establishes a fairly strong baseline, independent of the use of clinical concepts. Still, afterthree rounds of relevance ﬁltering, the top phrases in D ( ) show a marked improvement over D ( ) . Usingconcepts in the ﬁltering process led to an emphasis on clinical content (“lung,” “respiratory,” “ards”), whileusing all words resulted in a greater proportion of epidemiological content (“mortality,” “asymptomatic,”“growth rate”). This could be because of a natural bias in UMLS toward concepts used in a hospital setting, orbecause epidemiological topics received the most sustained attention while areas of medical interest shifted.Therefore, while using concepts better suited our objective of extracting clinically-relevant information in thiscase, omitting the MetaMap step in our pipeline could still result in useful ﬁltering for a diﬀerent application.Next, to ensure our method was accurately discriminating relevant and irrelevant tweets, we examined theeﬀect of ﬁltering on several topics known to be either relevant or irrelevant to the pandemic. With the guidanceof a clinician, 12 categories and associated keywords were chosen such that if any two of the keywords werepresent in a tweet, its relevance could be gauged with relative certainty. For example, any tweet containingboth “anosmia” and “dysgeusia” should most likely never be eliminated. Similarly, any tweet containing thenames of political leaders is most likely irrelevant. A total of 26,851 tweets were used for this analysis.Figure 3 shows the proportion of tweets in each category that were retained by the ﬁltering process in eachiteration, with the darkest bar representing the proportion kept after three rounds. Based on the aggregation ofcategories to the right of the ﬁgure, using concept extraction leads to an increase in relevant tweets preserved(72.5% with concepts, 55.8% without), and a decrease in irrelevant tweets (2.5% with concepts, 5.9% without).Concept ﬁltering performed especially well on the “Drugs” category, likely because the drug mentions wereconsistently annotated and scored well for relevance. On the other hand, concept-based ﬁltering performedpoorly by this test on “Chilblains” and “Anosmia,” both of which are referenced using common words (“toe,”“smell”) that were ranked lower by the concept relevance metric than other clinical terms. While the categoriesused here are imperfect representations of relevance and irrelevance, concept extraction does provide anadvantage in preserving the correct sets of tweets. If this method were applied in the early stages of a pandemic, it would be important to identify emerging topicswithin a matter of days. To see if our method could recover nascent topics in this time-limited manner, wesplit our data into 2-week subsets overlapped by 1 week, then applied two rounds of relevance ﬁltering to eachsubset. Figure 4 shows two case studies of concepts that surfaced in these time-limited topic models: anosmia(or lack of smell) and thrombosis (intravenous blood clots). For a comparison to the academic literature,the upper plot shows when these clinical concepts ﬁrst appeared in CORD-19, a dataset of COVID-relatedpublications [26].For topics related to anosmia, the week-to-week microtrends are clearly visible in the time-limited topicmodels. While anosmia is now known to be a symptom of COVID-19, in March 2020 the connection between10ulse of the PandemicFigure 3: Fraction of tweets preserved by two diﬀerent models (with and without using clinical concepts forrelevance ﬁltering) for each of 12 pre-deﬁned tweet categories, including clinically-relevant subjects (ﬁrstseven bars) and irrelevant ones (next ﬁve). The progressively darker bars represent successive stages ofﬁltering. The parenthesized number indicates the number of tweets that fell into each category. The totalfractions for the irrelevant and irrelevant categories are shown in the right panel.anosmia and COVID-19 was not yet well established. In topics that contain anosmia-related keywords fromthree consecutive time windows (Fig. 4, lower left), words describing loss of smell grow progressively moresigniﬁcant, which tracks with the increasing number of tweets about anosmia during this time period.Because some emerging conditions contain very low tweet counts before they enter the literature, theymay not always be isolated as individual topics. For example, words related to thrombosis (Fig. 4, lower right)never become a topic in their own right in the time-limited models, although they were present alongsiderelated conditions. This underscores the importance of looking at the tweets associated with topics of interest,which do contain early insights about thrombosis.We also found that preliminary clinical information was surfaced in our topic models before its mention inacademic publications. For example, the topic model from March 15-28 surfaced several tweets mentioninglinks between anosmia and COVID (e.g. “Too many reports of loss of smell, taste or both in people withcoronavirus for this to be a coincidence.”, March 23, 2020). Meanwhile, the ﬁrst academic article describingthis phenomenon that appears in CORD-19 was only published at the end of this interval (March 27). Asseen in Figure 4 (top), clinical conversations about anosmia and thrombosis were transpiring on Twitter atsuﬃcient levels to manifest in the ﬁltered topic models before academic publications appeared. This suggestsour method can be useful in rapidly-changing situations as it can surface clinically relevant information asthey are being discovered and discussed. 11ulse of the PandemicFigure 4: Analysis of two surfaced clinical concepts: “anosmia” and “thrombosis.” The histograms arelabeled with examples of mentions in tweets and publications. Topics containing the concept are shown asword clouds; the size of the word correlates to its weight in the topic. A tweet containing the topic is alsoshown.

This study provides a new look at social media for rapidly-evolving public health situations, and develops astrategy for extracting granular clinical topics without manual labeling. As with all social media applications,a key challenge in information extraction from Twitter is the signal-to-noise ratio: our ﬁltering process reducedthe initial COVID dataset to about 0.2% of its original size. Furthermore, because of the pandemic’s impacton everyday life, many tweets can contain medical terminology without necessarily imparting clinicallyrelevant information. Our proposed technique resolves these issues by applying topic modeling and conceptextraction in tandem, resulting in progressively better estimates of clinical relevance.Traditional topic modeling is a valuable ﬁrst step in understanding the contents of any textual dataset,and we observed that our topic models often naturally grouped together relevant and irrelevant information.However, the larger and more granular our models became, the harder it was to ﬁnd relevant topics withouta ranking strategy (which led to the development of our method). On the other hand, MetaMap conceptextraction can identify relevant phrases regardless of their frequency, a useful complement to statistical NLPmethods. Using these annotations to rank the topic models allows one to quickly discard irrelevant categories,improving interpretability considerably for larger models. Concept extraction therefore approximates aphysician’s prior belief of which topics should be considered relevant , while the topic-document mappingapproximates which documents should be considered related .An iterative approach is key to enabling topic modeling and concept extraction to reﬁne and validate eachother’s results. For example, in early iterations, we observed many tweets that were associated with relevant12ulse of the Pandemictopics because they contained clinical terminology, but that were not themselves written in a medical “frame.”As these tweets were carried forward into subsequent levels, however, topics became more relevant overall,and tweets that used clinical terminology in a non-medical context became sequestered into their own topics.Analyzing topic models from HCP-authored tweets opens opportunities for medical professionals to learnfrom each other and initiate new lines of research. When clinical indicators are highlighted by our method asthey emerge, as suggested by the two-week topic models, HCPs in regions with rising cases could browsetopics and tweets from previous epicenters to better understand what to expect. Clinicians who were not partof the original conversation but who have observed the emerging symptom could then contribute knowledge toongoing research, enabling wider information sharing and even expanding access to clinical trials. Althoughthe analysis relies to some degree on the topics already being relatively prevalent on social media, it can stillaccelerate awareness of these ﬁndings outside of the circles in which they are shared.As with any study involving Twitter data, our analysis is subject to the biases and caveats inherent in socialmedia. The dataset collected here only consisted of English tweets, but it could be more helpful to apply thetechnique across languages and capture a truly global conversation, perhaps by integrating a multilingualtopic model [6]. Even among English-speaking HCPs, this dataset is hardly a representative sample of allmedical opinion on the pandemic, since the demographics of physicians on Twitter are inevitably skewed bythe demographics of Twitter itself. Finally, the rapid oscillations of interest in the generated topics suggestthat HCPs, like other Twitter users, gravitate toward the most exciting stories at any given time. The degree ofintensity of a topic, therefore, will always be an imperfect proxy for true clinical importance.

We present a method that combines topic modeling and clinical concept extraction to surface clinically relevantinformation from a noisy text dataset. Both topic modeling and concept annotation have their limitations onthis type of data; nevertheless, we have shown that iterating between the two techniques can overcome theseweaknesses. This is especially important in emergent situations such as COVID-19, where unconventionaldata sources and unsupervised information extraction are often the only option for rapid analysis. Our pipelinecan be applied without modiﬁcation to social media data on other medical topics, and the iterative ﬁlteringtechnique could be a helpful augmentation to topic models used for preliminary explorations of text data.In seeking to extract clinical information from social media data, this study addresses a problem space thathas not yet been addressed to our knowledge. Indeed, in normal circumstances the rapid ﬂuctuations of socialmedia trends are antithetical to robust, reliable oﬃcial guidance. As demonstrated by our analysis, however,HCP-authored tweets about COVID-19 can present a surprisingly bountiful window into the epicenters of apandemic, and we hope that our method enables future research to gain deeper insights from the physicianpopulation on social media. By automatically bringing clinical tweets out of the limited audiences amongwhom they are shared, newly-emerging clinical observations can be disseminated quickly and clinicianseverywhere can contribute to shared knowledge. Just as Twitter data reﬂects the rapidly-changing world, thismethod could enable the ﬂexible, real-time analysis that a pandemic demands.13ulse of the Pandemic

Acknowledgements

The authors would like to thank Dr. Jonathan Bickel (Boston Children’s Hospital, Harvard Medical School)for providing invaluable support and guidance throughout the project. The authors would also like to thankMatthew McDermott for his mentorship and encouragement. This work received funding from a grant by theNational Institute of Aging (3P30AG059307-02S1).

References [1] A. R. Aronson and F.-M. Lang. An overview of metamap: historical perspective and recent advances.

Journal of the AmericanMedical Informatics Association , 17(3):229–236, 2010.[2] J. M. Banda, R. Tekumalla, G. Wang, J. Yu, T. Liu, Y. Ding, and G. Chowell. A Twitter Dataset of 100+ million tweets relatedto COVID-19, Mar. 2020. This dataset will be updated bi-weekly at least with additional tweets, look at the github repo forthese updates.[3] S. Bird, E. Loper, and E. Klein.

Natural Language Processing with Python . O’Reilly Media Inc.[4] D. M. Blei, M. I. Jordan, T. L. Griﬃths, and J. B. Tenenbaum. Hierarchical topic models and the nested chinese restaurantprocess. In

Proceedings of the 16th International Conference on Neural Information Processing Systems , NIPS’03, page 17–24,Cambridge, MA, USA, 2003. MIT Press.[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.

J. Mach. Learn. Res. , 3(null):993–1022, Mar. 2003.[6] J. Boyd-Graber and D. Blei. Multilingual topic models for unaligned text, 2012.[7] A. Brynolf, S. Johansson, E. Appelgren, N. Lynoe, and A.-K. Edstedt Bonamy. Virtual colleagues, virtually col-leagues—physicians’ use of twitter: a population-based observational study.

BMJ Open , 3(7), 2013.[8] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic intelligence for the crowd, by the crowd (fullversion), 2012.[9] D. A. Hanauer, M. Saeed, K. Zheng, Q. Mei, K. Shedden, A. R. Aronson, and N. Ramakrishnan. Applying metamap to medlinefor identifying novel associations in a large clinical dataset: a feasibility analysis.

Journal of the American Medical InformaticsAssociation : JAMIA , 21(5):925—937, 2014.[10] M. Imran, S. Elbassuoni, C. Castillo, F. Diaz, and P. Meier. Practical extraction of disaster-relevant information from socialmedia. In

Proceedings of the 22nd International Conference on World Wide Web , WWW ’13 Companion, page 1021–1024,New York, NY, USA, 2013. Association for Computing Machinery.[11] I. Kagashe, Z. Yan, and I. Suheryani. Enhancing seasonal inﬂuenza surveillance: Topic analysis of widely used medicinal drugsusing twitter data.

Journal of Medical Internet Research , 19(9):e315, Sept. 2017.[12] F. Klok, M. Kruip, N. van der Meer, M. Arbous, D. Gommers, K. Kant, F. Kaptein, J. van Paassen, M. Stals, and M. e. a.Huisman. Incidence of thrombotic complications in critically ill icu patients with covid-19.

Thrombosis Research , 191:145–147,2020.[13] Y. Li, M. Li, M. Wang, Y. Zhou, J. Chang, Y. Xian, D. Wang, L. Mao, H. Jin, and B. Hu. Acute cerebrovascular diseasefollowing covid-19: a single center, retrospective, observational study.

Stroke and Vascular Neurology , 2020.[14] X. Liu and H. Chen. Azdrugminer: An information extraction system for mining patient-reported adverse drug events in onlinepatient forums. In

Smart Health - International Conference, ICSH 2013, Proceedings , Lecture Notes in Computer Science(including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), pages 134–150, Aug. 2013.2013 International Conference for Smart Health, ICSH 2013 ; Conference date: 03-08-2013 Through 04-08-2013.[15] C. Lodigiani, G. Iapichino, L. Carenzo, M. Cecconi, P. Ferrazzi, T. Sebastian, N. Kucher, J.-D. Studt, C. Sacco, B. Alexia, andet al. Venous and arterial thromboembolic complications in covid-19 patients admitted to an academic hospital in milan, italy.

Thrombosis Research , 191:9–14, 2020. [16] L. Mao, H. Jin, M. Wang, Y. Hu, S. Chen, Q. He, J. Chang, C. Hong, Y. Zhou, D. Wang, X. Miao, Y. Li, and B. Hu. NeurologicManifestations of Hospitalized Patients With Coronavirus Disease 2019 in Wuhan, China.

JAMA Neurology , 77(6):683–690,06 2020.[17] P. R. Massey and K. M. Jones. Going viral: A brief history of chilblain-like skin lesions (“covid toes”) amidst the covid-19pandemic.

Seminars in Oncology , 2020.[18] A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.[19] R. Mishori, L. O. Singh, B. Levy, and C. Newport. Mapping physician twitter networks: Describing how they work as a ﬁrststep in understanding connectivity, information ﬂow, and message diﬀusion.

J Med Internet Res , 16(4):e107, Apr 2014.[20] M. R. Morris, S. Counts, A. Roseway, A. Hoﬀ, and J. Schwarz. Tweeting is believing? understanding microblog credibilityperceptions. In

Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work , CSCW ’12, page441–450, New York, NY, USA, 2012. Association for Computing Machinery.[21] T. J. Oxley, J. Mocco, S. Majidi, C. P. Kellner, H. Shoirah, I. P. Singh, R. A. De Leacy, T. Shigematsu, T. R. Ladner, K. A.Yaeger, M. Skliut, J. Weinberger, N. S. Dangayach, J. B. Bederson, S. Tuhrim, and J. T. Fiﬁ. Large-vessel stroke as a presentingfeature of covid-19 in the young.

New England Journal of Medicine , 382(20):e60, 2020. PMID: 32343504.[22] S. Panahi, J. Watson, and H. Partridge. Social media and physicians: Exploring the beneﬁts and challenges.

Health InformaticsJournal , 22(2):99–112, 2014.[23] H. W. Park, S. Park, and M. Chong. Conversations and medical news frames on twitter: Infodemiological study on covid-19 insouth korea.

J Med Internet Res , 22(5):e18897, May 2020.[24] L. Singh, S. Bansal, L. Bode, C. Budak, G. Chi, K. Kawintiranon, C. Padden, R. Vanarsdall, E. Vraga, and Y. Wang. A ﬁrstlook at covid-19 information and misinformation sharing on twitter, 2020.[25] A. Smith, T. Hawes, and M. Myers. Hiearchie: Visualization for hierarchical topic models. In

Proceedings of the Workshop onInteractive Language Learning, Visualization, and Interfaces , pages 71–78, 2014.[26] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill, P. Mooney,D. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. Wade, K. Wang, C. Wilhelm, B. Xie, D. Raymond, D. S. Weld,O. Etzioni, and S. Kohlmeier. Cord-19: The covid-19 open research dataset.

ArXiv , 2020.[27] S. Yaghi, K. Ishida, J. Torres, B. M. Grory, E. Raz, K. Humbert, N. Henninger, T. Trivedi, K. Lillemoe, S. Alam, M. Sanger,S. Kim, E. Scher, S. Dehkharghani, M. Wachs, O. Tanweer, F. Volpicelli, B. Bosworth, A. Lord, and J. Frontera. Sars-cov-2 andstroke in a new york healthcare system.

Stroke , 51(7):2002–2011, 2020.[28] Y. T. Yang, M. Horneﬀer, and N. DiLisio. Mining social media and web searches for disease detection.

Journal of PublicHealth Research , 2(1):e4, May 2013.[29] S. Youseﬁnaghani, R. Dara, Z. Poljak, T. M. Bernardo, and S. Sharif. The assessment of twitter’s potential for outbreakdetection: Avian inﬂuenza case study.

Scientiﬁc Reports , 9(1), 2019., 9(1), 2019.