Tracking Short-Term Temporal Linguistic Dynamics to Characterize Candidate Therapeutics for COVID-19 in the CORD-19 Corpus
TTracking Short-Term Temporal Linguistic Dynamics toCharacterize Candidate Therapeutics for COVID-19 in theCORD-19 Corpus
James Powell [email protected] Alamos National LaboratoryLos Alamos, New Mexico, USA
Kari Sentz [email protected] Alamos National LaboratoryLos Alamos, New Mexico, USA
ABSTRACT
Scientific literature tends to grow as a function of funding andinterest in a given field. Mining such literature can reveal trends thatmay not be immediately apparent. The CORD-19 corpus representsa growing corpus of scientific literature associated with COVID-19.We examined the intersection of a set of candidate therapeuticsidentified in a drug-repurposing study with temporal instances ofthe CORD-19 corpus to determine if it was possible to find andmeasure changes associated with them over time. We propose thatthe techniques we used could form the basis of a tool to pre-screennew candidate therapeutics early in the research process.
ACM Reference Format:
James Powell and Kari Sentz. 2021. Tracking Short-Term Temporal LinguisticDynamics to Characterize Candidate Therapeutics for COVID-19 in theCORD-19 Corpus. In
Proceedings of SenSys 2020: 18th ACM Conference onEmbedded Networked Sensor Systems (SenSys 2020).
ACM, New York, NY,USA, 2 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Diachronic word analysis is Natural Language Processing (NLP)technique for characterizing the evolution of words over time. Oftenused for historical linguistic studies, it can also be applied to scien-tific literature [Tshitoyan et al. 2019] and can reveal early evidenceof scientific discoveries before they become widely known.Drug-repurposing studies aim to identify existing drugs thatmight be useful in treating other diseases. The availability of largeamounts of data about drugs and infectious agents such as viruseshas enabled such studies to be performed in-silico. In early 2020, anumber of repurposing studies were undertaken to identify poten-tial treatments for COVID-19.The CORD-19 corpus [Wang et al. 2020] was established inMarch 2020 as a repository for research related to SARS-COV-2 andother coronaviruses. It aggregates content from PubMed, bioRxiv,medRxiv, and other sources, and it is updated with new publica-tions on a regular basis. Figure 1 illustrates the growth of CORD-19through mid-2020.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
SenSys 2020, November 16-19, 2020, Yokohama, Japan © 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
Figure 1: Weekly growth of the CORD-19 corpus
Using CORD-19, we conducted a diachronic survey of candi-date therapeutics identified in one of the more exhaustive drugre-purposing studies conducted to date for COVID-19, undertakenin February 2020 at Oak Ridge National Laboratory. This study,detailed in [Smith and Smith 2020], analyzed drugs in the SWEET-LEAD database for potential antiviral properties. The study pro-duced a dataset identifying over 9,000 existing approved drugs andsupplements as potential candidate therapeutics for COVID-19.Most importantly for our purposes, this dataset included commonlyused drug or supplement names for each candidate.Our survey considered the following questions: • How many candidate therapeutics appear in CORD-19? • Do references to the candidates change over time?
Using temporal snapshots of CORD-19 spanning March 13 to June30, we computed frequency and semantic representations for eachcandidate therapeutic found in the corpus. For each temporal in-stance, we computed TF/IDF (Term Frequency/Inverse DocumentFrequency) score for each candidate, a common metric to evaluatethe relative importance of terms in a corpus.
Figure 2: TF/IDF and cosine embedding distances a r X i v : . [ q - b i o . O T ] J a n enSys 2020, November 16-19, 2020, Yokohama, Japan James Powell and Kari Sentz To perform semantic analysis, we first computed diachronicword embeddings for each temporal instance of the corpus. Theseembeddings were aligned with one another to ensure that termsfrom each temporal instance were comparable. The technique weused is based on TWEC [3]. It uses a negative sampling optimiza-tion of softmax to maximize the probability that a set of wordssurrounding word 𝑤 𝑘 are representative of its context in time ( 𝐶 𝑡 ),when multiplied by the mean of atemporal word embedding vectorsfrom 𝑢 (the compass) for the same set of context words around 𝑤 𝑘 . max C 𝑡 log 𝑃 ( 𝑤 𝑘 | 𝛾 ( 𝑤 𝑘 )) = 𝜎 ( (cid:174) 𝑢 𝑘 · (cid:174) 𝑐 𝑡𝛾 ( 𝑤 𝑘 ) ) Since the TWEC embedding model did not account for phrases,we incoporated an additional step to indentify them. Phrases (in-cluding drug names) were then specially encoded to allow themto be treated like words. Figure 2 shows TF/IDF verses the meanembedding distance to the compass for candidate therapeutics.Because diachronic embedding instances were aligned with oneanother, we were able to isolate a given candidate and visualize itssemantic trajectory over time (Figure 3). As the trajectory is basedon nearest neighbors at a given time, subtle changes in semanticassociations become apparent [Stewart et al. 2017].
Figure 3: Semantic trajectory of ’acetazolamide’ (rank -2.4).Nodes along the path represent the candidate embeddingvector and its two closest terms at time 𝑡 We detected 14% (1267) of the candidate therapeutics in CORD-19at 3/13, increasing to 26% (2361) by 6/30. For candidates detected inmultiple adjacent temporal instances of the corpus, we were ablemeasure their changes over time. We found that many candidatesexhibited increases in frequency, and stable or strengthening se-mantic associations. However, given the nature of this corpus, wesuspected some would exhibit other kinds of change over time.We found that some candidate therapeutics exhibited differentpatterns of semantic associations. Using heatmap visualizations asdescribed in [Xu and Crestani 2017], we can illustrate two addi-tional recurring patterns of behavior. Some candidates exhibitedweakening semantic associations over time (Figure 4), while othersexhibited an abrupt persistent shift to a different pattern (Figure5). Additionally, we found that these changes were not stronglycorrelated with changes to a target’s frequency scores.
Figure 4: Example of weakening semantic associations forthe candidate therapeutic ivermectin
Figure 5: Example of disrupted semantic associations for thecandidate therapeutic acetazolamide
Our diachronic survey of candidate therapeutics for COVID-19in the CORD-19 corpus found that some exhibited weakening orabrupt changes of semantic associations. We speculate that thiscould be related to the publication of new research that positivelyor negatively affected consideration of a candidate therapeutic asa treatment for COVID-19. Future work will investigate how todetect and quantify these patterns, and to determine if there areany correlations between a target’s rank and magnitude of change.
REFERENCES
Micholas Smith and Jeremy C. Smith. 2020. Repurposing Therapeutics for Covid-19.https://doi.org/10.26434/chemrxiv.11871402.v3Ian Stewart, Dustin Arendt, Eric Bell, and Svitlana Volkova. 2017. Measuring, predictingand visualizing short-term change in word representation and usage in vkontaktesocial network. arXiv preprint arXiv:1703.07012 (2017).Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander Dunn, Ziqin Rong, OlgaKononova, Kristin A. Persson, Gerbrand Ceder, and Anubhav Jain. 2019. Unsuper-vised word embeddings capture latent knowledge from materials science literature.
Nature
571 (july 2019), 95–98. https://doi.org/10.1038/s41586-019-1335-8Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang,Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Merrill, et al. 2020.CORD-19: The Covid-19 Open Research Dataset. arXiv:2004.10706v2Zaikun Xu and Fabio Crestani. 2017. Temporal Semantic Analysis and Visualizationof Words. In