Accelerating COVID-19 research with graph mining and transformer-based learning
Ilya Tyagin, Ankit Kulshrestha, Justin Sybrandt, Krish Matta, Michael Shtutman, Ilya Safro
AAccelerating COVID-19 research with graph mining andtransformer-based learning
Ilya Tyagin
Center for Bioinformaticsand Computational BiologyUniversity of DelawareNewark, [email protected]
Ankit Kulshrestha
Computer and InformationSciencesUniversity of DelawareNewark, [email protected]
Justin Sybrandt ∗ School of ComputingClemson UniversityClemson, [email protected]
Krish Matta
Charter School ofWilmingtonWilmington, [email protected]
Michael Shtutman
Drug Discovery andBiomedical SciencesUniversity of S. CarolinaColumbia, [email protected]
Ilya Safro
Computer and InformationSciencesUniversity of DelawareNewark, [email protected]
ABSTRACT
In 2020, the White House released the, “Call to Action to theTech Community on New Machine Readable COVID-19 Dataset,”wherein artificial intelligence experts are asked to collect data anddevelop text mining techniques that can help the science commu-nity answer high-priority scientific questions related to COVID-19.The Allen Institute for AI and collaborators announced the availabil-ity of a rapidly growing open dataset of publications, the COVID-19Open Research Dataset (CORD-19). As the pace of research acceler-ates, biomedical scientists struggle to stay current. To expedite theirinvestigations, scientists leverage hypothesis generation systems,which can automatically inspect published papers to discover novelimplicit connections. We present an automated general purposehypothesis generation systems AGATHA-C and AGATHA-GP forCOVID-19 research. The systems are based on graph-mining andthe transformer model. The systems are massively validated usingretrospective information rediscovery and proactive analysis in-volving human-in-the-loop expert analysis. Both systems achievehigh-quality predictions across domains (in some domains up to0.97% ROC AUC) in fast computational time and are released tothe broad scientific community to accelerate biomedical research.In addition, by performing the domain expert curated study, weshow that the systems are able to discover on-going research find-ings such as the relationship between COVID-19 and oxytocinhormone.
Reproducibility:
All code, details, and pre-trained models areavailable at https://github.com/IlyaTyagin/AGATHA-C-GP
CCS CONCEPTS • Applied computing → Bioinformatics ; Document managementand text processing ; •
Computing methodologies → Learninglatent representations ; Neural networks ; Information extraction ; Semantic networks . ∗ Now with Google Brain. Contact: [email protected].
KEYWORDS
Hypothesis Generation, Literature-Based Discovery, TransformerModels, Semantic Networks, Biomedical Recommendation,
Development of vaccines for COVID-19 is a major triumph of mod-ern medicine and humankind’s ability to accelerate scientific re-search. While we are all hoping to see large-scale positive changesfrom fast mass adoption of the existing vaccines, there remainsignificant open research questions around COVID-19. The scien-tific community has a responsibility to do everything possible toblock the ongoing transmission of the dangerous virus and acceler-ate research to mitigate its consequences . We present the followingautomated knowledge discovery system in order to propose newtools that could compliment the existing arsenal of techniques toaccelerate biomedical and drug discovery research for events likeCOVID-19.The COVID-19 pandemic became one of the most importantevents in the information space since the end of 2019. The paceof published scientific information is unprecedented and spans allresolutions, from the news and pop-science articles to drug designat the molecular level. The pace of scientific research has alreadybeen a significant problem in science for years [29], and undercurrent circumstances this factor becomes even more pronounced.Several thousands papers are being added weekly to CORD-19 [39](the dataset of publications related to COVID-19) and even morein MEDLINE [1]. As a result, groups working on similar problemsmay not be immediately aware of the other’s findings, which canlead to inefficient investments and production delays.Under normal circumstances, the MEDLINE database of biomed-ical citations receives approximately 950,000 new papers per year.Currently this database indexes 31 million total citations. This pacechallenges traditional research methods, which often rely on humanintuition when searching for relevant information. As a result, thedemand for modern AI solutions to help with the automated anal-ysis of scientific information is incredibly high. For instance, thefield of drug discovery has explored a range of AI analytical tools a r X i v : . [ c s . I R ] F e b igure 1: Number of new citations per week in CORD-19dataset. to expedite new treatments [12]. Designing lab experiments andfinding candidate chemical compounds is a costly and long-lastingprocedure, often taking years. To accelerate scientific discovery,researchers came up with a family of strategies to utilize publicknowledge from databases like MEDLINE that are available throughthe National Institute of Health (NIH), which facilitate automatedhypothesis generation (HG) also known as literature-based discov-ery. Undiscovered public knowledge, information that is implicitly present within available literature, but is not yet explicitly knownby an individual who can act on that information, represents thetarget of our work.Although, there are quite a few automated HG systems [12] in-cluding those we have previously proposed [35, 37], none of themis currently customized and available in the open domain to mas-sively process COVID-19 related queries . In addition to the traditionalgeneral requirements for HG systems, such as high-quality resultsof hypotheses, interpretability and availability for broad scientificcommunity, a specific demand for COVID-19 data analysis requires:(1) customization of the vocabulary and other logical units suchas subject-verb-object predicates; (2) customization of the trainingdata that in the reality of urgent research contains a lot of controver-sial and incorrect information; (3) models for different informationresolutions; and (4) validation on the on-going domain-specificdiscovery. Our contribution:
In this work we bridge this gap by releasing,AGATHA-C and AGATHA-GP , reliable and easy to use HG sys-tems that demonstrate state-of-the art performance and validatetheir inference capabilities on both COVID-19 related and generalbiomedical data. To make them closely related to different goals ofCOVID-19 research, they correspond to micro- (AGATHA-C, forCOVID-19) and macroscopic (AGATHA-GP, for general purpose)scales of knowledge discovery. Both systems are able to process anyqueries to connect biomedical concepts but AGATHA-C exhibitsbetter results on the molecular scale queries, e.g., those that arerelevant to drug design, and AGATHA-GP works better for generalqueries, e.g., establishing connections between certain professionand COVID-19 transmission.Both systems are the next generation of the AGATHA knowl-edge network mining transformer model [37]. They substantiallyimprove the quality of the previous AGATHA by introducing newinformation layer into multi-layered semantic knowledge networkpipeline, and expanding new information retrieval techniques thatfacilitate inference. We deploy the deep learning transfer modeltrained with up-to date datasets and provide easy to use interfaceto broad scientific community to conduct COVID-19 research. We validate the system via candidate ranking [36, 37] using very recentscientific publications containing findings absent in the trainingset. While the original AGATHA has demonstrated state-of-the-art performance for the time of its release, AGATHA and othersystems were found to perform with notably lower quality on ex-tremely rapidly changing COVID-19 research. We demonstrate aremarkable improvement in the range of approximately 20-30%(in ROC-AUC) on the average on different types of queries withvery fast query process that allows massive validation. In addition,we demonstrate that the proposed system can identify recentlyuncovered gene (BST2) and hormone (oxytocin and melatonin) re-lationships to COVID-19, using only papers published before theseconnections were discovered.
Reproducibility:
All code, details, and pre-trained models areavailable at https://github.com/IlyaTyagin/AGATHA-C-GP
CORD-19 dataset [39] was released as a response to the world’sCOVID-19 pandemic to help data science experts and researchersto tackle the challenge of answering the high priority scientificquestions. It updates daily and was created by the Allen Institutefor AI in collaboration with Microsoft Research, NLM, IBM andother organizations. At the time of this publication it contains over400.000 scientific abstracts and over 150.000 full-text papers aboutcoronaviruses, primarily COVID-19.
MEDLINE is a database of NIH that includes almost 31 millioncitations (as of 2021) of scientific papers related to the biomedicaland related fields. Some of the citations are provided with MeSH(Medical Subject Headings) terms and other metadata. MEDLINEis one of the largest and well-known resources for biomedical textmining.
Hypothesis Generation Systems.
The HG field has been presentin information sciences for several decades. The first notable ap-proach was proposed by Swanson et al. in 1986 [33], which is calledthe A-B-C model. The concept of A-B-C model is to discover in-termediate (B) terms which occur in titles of publications for bothterms A (source) and C (target). In their experiments, Swanson et al.discovered an implicit connection between Raynauld’s syndrome(term A) and fish oil (term C) through blood viscosity (term B),which was mentioned in both sets. The hypothesis that fish oil canbe used for patients with Raynaud’s disease was experimentallyconfirmed several years later [10]. The key idea of the proposedmethod is that all fragmented bits of information are explicitlyknown, but their implicit relationships is what HG systems areaimed to uncover.We note the difference between HG and traditional informationretrieval. The information retrieval techniques which represent thevast majority of biomedical literature based discovery systems aretrained and (what is even more important) validated to retrieve existing information whereas the HG techniques predict undiscov-ered knowledge and thus must be massively validated on it. The HGvalidation requires training the system strictly on historical datarather than sampling it over the entire time.The advances in machine and deep learning transformed thealgorithmics of HG systems (see Sec. 9) that are now able to pro-cess much larger information volumes demonstrating much higheruality predictions. However, lack of broader applicability of HGsystems in the situation with COVID-19 pandemic demonstratesthat several major issues exist and require immediate attention:(1) Most of the existing HG systems are domain-specific (e.g., gene-disease interactions) that is usually expressed in limiting the pro-cessed information (e.g., significant filtering vocabulary and papersto a specific domain in probabilistic topic modeling [38]);(2) A proper validation of HG system remains a technical problembecause multiple large-scale models have to trained with all het-erogeneous data carefully eliminated several years back;(3) Moreover, a large number of HG systems are not massivelyvalidated at all except of very old findings rediscovery [28] ordemonstrating of just a few proactive examples in humanly cu-rated investigation; and(4) Interpretability and explainbability of generated hypothesesremains a major issue.
The UMLS Metathesaurus [7] is the NIH database containinginformation about millions of concepts (both medical and general)and their synonyms. Metathesaurus accumulates information aboutits entries from more than 200 different vocabularies allowing tomap and connect concepts from different terminologies. Metathe-saurus also keeps metadata about the concepts such as semantictypes and their hierarchy. The core unit of information in UMLS isthe concept unique identifier, or CUI. CUI is a codified representa-tion of a specific term, which includes its different atoms (spellingvariants or translations of the term on other languages), vocabularyentries, definitions and other metadata.
SemRep [4] is a software kit developed by NIH for extraction ofsemantic predicates (subject-verb-object triples) from the providedcorpus. It also allows to extract entities not involved in any semanticpredicate, if the corresponding option is selected. The official exam-ple of possible SemRep output is: INPUT = “We used hemofiltrationto treat a patient with digoxin overdose that was complicated byrefractory hyperkalemia.”, OUTPUT = “Hemofiltration-TREATS-Patients; Digoxin overdose-PROCESS_OF-Patients; hyperkalemia-COMPLICATES-Digoxin overdose; Hemofiltration-TREATS(INFER)-Digoxin overdose”. SemRep handles word sense disambiguation andperforms terms mapping to the corresponding CUIs from UMLSmetathesaurus.
ScispaCy [24] ScispaCy is a special version of spaCy maintainedby AllenAI, containing spaCy models for processing scientific andbio-related texts. ScispaCy models are trained on different sources,such as PMC-pretrained word2vec representations, MedMentionsEntity linking Dataset and so on. SciSpacy can handle various NLPtasks, such as NER, dependency parsing and POS-tagging, whereachieves state of the art performance.
SciBERT [6] is a BERT-like transformer pretrained language model,where full-text scientific papers were used as a training dataset.Embeddings are learned in a word-piece fashion, which makes themcapture the relationships between not only words in a sentence,but also between word parts in each word.
FAISS [15] is a library for fast approximate clustering and similaritysearch between dense vectors. It scales to the huge datasets that donot fit in RAM and can be used in a distributed fashion. FAISS is usedin our pipeline to perform 𝑘 -means clustering of PQ-quantizatedsentence vectors to generate 𝑘 -nearest neighbor edges for similarsentences (nodes) in knowledge network. Figure 2: AGATHA multi-layered graph schema.PTBG [21] (stands for PyTorch BigGraph) is a high-performancegraph embedding system allowing distributed training. It was de-signed to handle large heterogeneous networks containing hun-dreds of millions of nodes of different types and billions of typededges. Distributed training is achieved by computing embeddingson disjoint node sets.
AllenNLP Open Information Extraction . AllenNLP [11] is apowerful library developed by AllenAI that uses PyTorch backendto provide deep-learning models for various natural processingtasks. Specifically, AllenNLP Open Information Extraction providesa trained deep bi-LSTM model for extracting predicates from un-structured text. An API is provided for running inference in bothsingle sentence and batch modes.
We briefly summarize the AGATHA semantic graph constructionpipeline. It is described in greater detail in the original paper [37].
Text pre-processing . The input for our system is a corpora ofscientific citations from the MEDLINE and CORD-19 datasets. Thesefiles contain titles and abstracts for millions of biomedical papers.We filter non-English documents, using the FastText LangaugeIdentification model [16] if the language is not provided. After thatwe split all abstracts into sentences and process all sentences withScispaCy library. From each sentence we extract POS-annotatedlemmas, entities and perform 𝑛 -gram mining, where 𝑛 ∈ [ , , ] and 𝑛 -grams are composed of frequently co-occurring lemmas.Additionally, we associate all sentences with any relevant metadata,such as the MeSH/UMLS keywords provided along with the citation. Semantic Graph Construction . We construct a semantic graphcontaining different types of nodes, namely, sentences, entities,coded terms (from UMLS and MeSH), 𝑛 -grams, lemmas, and pred-icates following the schema depicted in Figure 2. Edges betweensentences are induced from the nearest-neighbors network of sen-tence embeddings. We also include an edge between two sentencesthat appear sequentially within the same abstract, counting thetitle as the first sentence. Other edges can be inferred directly fromthe recorded metadata. For instance, the node representing the en-tity “COVID-19” is connected to every sentence and predicate thatdiscuss COVID-19. NLM UMLS implementation . The prior AGATHA semantic net-work only includes UMLS terms that appear in SemMedDB predi-cates [18] which is a major limitation. In this work we enrich the“Coded Term” layer by introducing an additional preprocessingphase wherein we run the SemRep tool with full-fielded outputoption ourselves on the entire input corpora . This phase would benecessary as CORD-19 and most recent MEDLINE citations are notrepresented within slowly updated SemMedDB. However, we findthat we can substantially increase the quality of recovered termsby applying these tools ourselves.y doing that we not only enrich the "Coded Terms" semanticnetwork layer, but also introduce a significant number of uncoveredpreviously semantic predicates. It happens because SemMedDB isa cumulative database, having various citations in the databaseprocessed over many years with various versions of SemRep andvarious UMLS releases available at different time periods.To illustrate what was just said, let us consider the followingexample (PMID: 20109154): "The results showed that V. cholerae O395and also other related enteric pathogens have the essential CASScomponents (CRISPR and cas genes) to mediate a RNAi-like path-way."
The current SemRep version extracts the following predicate:
CRISPR-AFFECTS-RNAi , while SemMedDB does not contain anypredicates for this sentence. The year of publication of the corre-sponding paper is 2009, but CRISPR term (C3658200) did not existin the UMLS metathesaurus on or before 2012, that is why at thetime of adding this citation to SemmedDB CRISPR-involved relationcould not be identified.
Graph Embedding.
We embed our large semantic graph using aheterogeneous technique that captures node similarity through a biased transformed dot product . By explicitly including a bias termfor each node, we capture a concepts overall affinity within thenetwork that is critical for such general terms as “coronavirus.” Bylearning transformations between each pair of node types (e.g.,between sentences and lemmas), we enable each type to occupyembedding spaces with differing characteristics. Specifically, wefit an embedding model that optimizes the following similaritymeasure: S( 𝑢, 𝑣 ) = ^ 𝑢 + ^ 𝑣 + 𝑇 𝑢𝑣 + 𝑑 ∑︁ 𝑖 = ^ 𝑢 𝑖 ( ^ 𝑣 𝑖 𝑇 𝑢𝑣𝑖 ) , (1)where 𝑢, 𝑣 are nodes in the semantic graph with embeddings ^ 𝑢, ^ 𝑣 ,and 𝑇 𝑢𝑣 is the directional transformation vector between nodes of 𝑢 ’s type to nodes of 𝑣 ’s.We use the PTBG heterogeneous graph embedding library tolearn 𝑑 = dimensional embeddings for each node of our largesemantic graph. While fitting embeddings ( ^ 𝑢 ) and transformationvectors ( 𝑇 𝑢𝑣 ), we represent each edge of the semantic graph as twodirected edges. These learned values are optimized using softmaxloss, where the similarity for one edge is compared against thesimilarities of 100 negative samples. Ranking Semantic Predicates (Transformer model).
After weobtain embeddings per node in the semantic graph, we train AGA-THA system ranking model. This model is trained to rank publishedsubject-object pairs above randomly composed pairs of UMLS con-cepts (negative samples). Two coded terms, along with a fixed-sizerandom subsample of predicates containing each term are input tothis model. Graph embeddings for each term and predicate are fedinto stacked transformer encoder layers, which apply multi-headedself-attention across the embedding set. The last set of encodingsare averaged and the result is projected to the unit interval, forminga scalar prediction for the input’s “plausibility.”
Allennlp Predictor
CORD-19
ProcessAbstracts
UMLS ConceptTagging SemnetFilter Final Predicates
MEDLINE
Figure 3: Predicate Extraction pipeline with Deep Learningbased Open IE system.
Formally, the model to evaluate term pairs is defined as: 𝑓 ( 𝑥, 𝑦 ) = 𝑔 (cid:16)(cid:104) ^ 𝑥 ^ 𝑦 ^ 𝑥 ′ . . . ^ 𝑥 ′ 𝑘 ^ 𝑦 ′ . . . ^ 𝑦 ′ 𝑘 (cid:105)(cid:17) 𝑔 ( 𝑋 ) = sigmoid (M Θ )M = | 𝑋 | ColSum (E 𝑁 ( FeedForward ( 𝑋 )))E ( 𝑋 ) = 𝑋 E 𝑖 + ( 𝑋 ) = LayerNorm ( FeedForward (A( 𝑋 )) + A( 𝑋 ))A( 𝑋 ) = LayerNorm ( MultiHeadAttention ( 𝑋 ) + 𝑋 ) , (2)where each 𝑥 ′ and 𝑦 ′ are randomly sampled from the neighbor-hoods of 𝑥 and 𝑦 respectively, and each ^ · denotes the graph embed-ding of the given node. Furthermore, Θ represents a free parameter,which is fit along with parameters internal to each FeedForwardand MultiHeadAttention layer, following the standard conventionsfor each.The above model is fit using margin ranking loss, where pred-icates from the training set are compared against a large set ofnegative samples. Additional details pertaining to specific opti-mization choices surrounding this model are present in the workoriginally proposing this model [37]. We used
SemRep predicate extraction system in the first system,AGATHA-C , to extract predicates from the abstracts. However,
SemRep relies on expert coded rules and heuristics to extract biomed-ical relations leading to significantly fewer predicates for training.Thus, in order to augment the predicates (for the second system,AGATHA-GP ) we decided to use a deep learning based informa-tion extraction system by Stanvosky et al. [31]. Figure 3 shows ouroverall predicate extraction pipeline.
Abstract Pre-processing . The input for the proposed semanticpredicate extraction system is the output files generated by
SemRep tool with full-fielded output option enabled, obtained from the pre-processing stage described in Sec. 3. As it was mentioned previously,
SemRep system extracts not only semantic triples, but also mapsentities found in the input corpus to their corresponding UMLSconcept IDs, this is the data which is used for the following method.The initial set of records includes the sentence raw texts and ex-tracted from them UMLS terms and is augmented throughout thepipeline making it easier to extract final predicates for downstreamtraining.
Raw Predicate Extraction . We use a pre-trained instance of
RnnOIE [31] provided as an API by AllenNLP. The model was trained onhe
OIE2016 corpus. At a high level the model aims to learn a jointembedding of individual words and their corresponding Beginning-Input-Output (BIO) tags. The output of the model is a probabilitydistribution over the BIO tags. During inference the model selectsspecific phrases and groups them into
ARG0 , V , ARG1 tags. By con-vention, we treat
ARG0 as the subject and
ARG1 as the object in asubject-verb-object tuple. To speed up processing and scale it tothousands of abstracts, we leverage model-parallelism across differ-ent machines and run batch-mode inference on chunks of abstracts.Once the model predictions have been extracted we extract thephrases with relevant tags into raw predicates and augment themin the record. A subsequent filtering is performed by extracting theterms matching with previously detected UMLS concepts in thesentence.
Semnet Filtering
Using a general purpose
RnnOIE model has it’sown challenges. During processing we noted that a lot of rawpredicates were either too general or contained too little meaningto be useful for training a prediction model. To overcome thischallenge we designed a corrective filter to reduce noise and retainmost useful predicates. We call this filter the semnet filter .Each UMLS concept has an associated semantic type (e.g., COVID-19 has an associated semantic type of dsyn (disease)). This is usefulfor summarizing large set of diverse text concepts into smaller num-ber of categories. We used the metadata from semantic types toconstruct two networks - a semantic network and a hierarchicalnetwork. The semantic network consists of semantic types as nodesand the edges imply a corresponding direct relation between them.The hierarchical network is a network of a semantic type connectedto its more general semantic types. For example, a semantic type dsyn (disease) is more generally associated with a biof (biologicalfunction) or a pathf (pathological function). In order to filter apredicate, all edges emanating from the subject’s semantic typesare computed on a per-predicate basis. These edges also includeany specific-general concept relationships. If the object’s semantictype is found to be in the candidate edge set, then we deem thepredicate as valid. In our experiments, we found that this filteringmethod significantly eliminates predicates which do not directlypertain to the biomedical domain.
Processing Abstracts at Scale
Building a pipeline that scales tothousands of abstracts is not a trivial task. In order to extract predi-cates from RnnOIE model and extract quality terms of interest wenot only have to contend with the problem of running inference ona deep neural network but also the task of aligning the extractedterms with the entities recognized by
SemRep . Deployment details:
The RnnOIE model by Stanovsky et al. usesa deep Bi-LSTM [27] model to learn the joint word embeddingand predict the resulting semantic position tags. Since LSTMs areinherently sequential model, it means that the inference time persentence would be considerable. We first tried processing an entirecollection of abstracts at once on a cluster of 10 machines eachconsisting of 24 CPUs using the
Dask [26] library. The entire processtook more than 8 hours. Considering that we had about 100 suchcollections, this inference time was prohibitively high. In order tospeed up inference we read each collection once and distributedchunks of abstracts over the machines. This change helped us to cutdown the processing time from over a week to just over 4 days forthe MEDLINE corpus. For the CORD-19 corpus the processing time was even faster at 2 days. The next step was to align the extractedpredicates with the
SemRep recognized biomedical concepts. Weachieved this alignment by first building an index of files thatcontained a specific abstract ID and then processing the RnnOIEpredicates with the aforementioned index. We further optimizedthe indexing phase by updating the existing index each time weprocessed more than 𝜏 abstracts.The semnet filter does not introduce additional computationaloverhead and can process a thousand abstracts in under 1 second.Hence, to obtain the most relevant set of predicates we were ableto parallelize over “checkpoints" (each of which contained 30kabstracts) in an hour. A fair validation of HG systems is extremely challenging, as thesemodels are designed to predict novel connections that are unknownto even those who evaluate the system [34]. In addition, even ifvalidated by rediscovering findings using historical, the process iscomputationally expensive because of the need to train multiplemodels to understand how many months (or years) back, the HGsystem can predict the findings which requires careful filtering ofthe used papers, vocabulary and other types of data. To present ourresults in terms of its usefulness for urgent CORD-19-related HG,we use a historical benchmark, which is conceptually describedin [37]. This technique is fully automated and does not require anydomain experts intervention.
Positive samples collection.
We use
SemRep and proposed in Sec.4 approach to process the most recent CORD-19 citations, whichwere published after the specific cut date making sure that thecitations are not included in the training set. After that we extract allsubject-object pairs from the obtained results and explicitly checkthat none of these pairs are presented in the training set. Pairsmentioned in the CORD-19 less than twice are filtered out fromthe validation set. Almost all of them are either noisy or representinformation that already appears in other pairs (e.g., because of thedifference in grammar).We also use the strategy of subdomain recommendation . Thisstrategy works in the following way. For each UMLS term we collectits semantic type (which is a part of the metadata provided inUMLS metathesaurus) and group all extracted SemRep pairs bythe term-pair criteria (combination of subject and object types).Then we identify the top-20 most common term-pairs subdomainsand construct the validation set from pairs belonging to these 20subdomains.
Negative samples generation.
To generate negative samples perdomain, the random sampling is used, that is, for each positivesample we keep its subject and randomly sample the object belong-ing to the same semantic type as the object of the source pair. Wedo this 10 times, thus having 10 negative domain-specific samplesfor each positive sample. When the validation set is generated, weapply our ranking criteria to it, obtaining a numerical score value 𝑠 per each sample, where 𝑠 ∈ [ , ] . Evaluation metrics.
We propose our approach as a recommenda-tion system and to report our results we use a combination of thefollowing classification and recommendation metrics.
Classification metrics: (1) Area under the receiver-operating-characteristic curve (AUC ROC); (2) Area under the precision-recall curve (AUC PR). • Recommendation metrics: (1) Top-k precision (P.@k); (2)Average precision (AP.@k); and (3) Overall reciprocal rank(RR).We report these numbers in per subdomain manner to better un-derstand how the system performs with respect to specific task (e.g.drug repurposing).
To report results, we provide the performance measures for threeAGATHA models trained on the same input data (MEDLINE corpusand CORD-19 abstracts dataset):(1) AGATHA-O : Baseline AGATHA model [37];(2) AGATHA-C : AGATHA-O with new UMLS layer and
SemRep enrichment;(3) AGATHA-GP : AGATHA-C with additional deep learning-based extracted and further filtered predicates.It is done in this particular manner because the major role in learn-ing the proposed ranking criteria depends heavily on the qualityof extracted semantic predicates and their number, as they formthe training set for the AGATHA ranking module. At the momentof writing, no other general purpose and available for public useHG system compliant with the three validation criteria, namely, (a)ability to run thousands of queries in a reasonable time, (b) abilityto process COVID-19 related vocabulary, and (c) ability to operatein multiple domains was available for comparison.The performance of both AGATHA-C and AGATHA-GP allowsto run thousands of queries in a very short time (in the order ofminutes), making the validation on a large number of samples pos-sible. Unfortunately, given the current circumstances, large-scalevalidation for the specific scientific subdomain (COVID-19 relatedhypotheses) is hard to implement, because well-established andreliable factual base is being actively developed at the momentand big historic gap for the vocabulary simply does not exist (e.g.,the COVID-19 term is just approximately one year old). We, how-ever, provide the validation set including 2736 positive connectionsextracted from CORD-19 dataset citations added within the timeframe from October 28, 2020 to January 21, 2021, which numberedat 77 thousand abstracts.
Table 1: Graph metrics (M = millions, B = billions).
CountsNode Type AGATHA-O AGATHA-C AGATHA-GPSentence 190.6 M. 190.6 M. 190.6 M.Predicate 24.2 M. 36.3 M. 38.7 M.Lemma 16.8 M. 16.1 M. 16.1 M.Entity 41.7 M. 43.2 M. 43.2 M.Coded Term 538,588 855,351 855,351 𝑛 -Grams 212.922 326.864 333.575Total Nodes 274,1 M. 287.4 M. 289.8 M.Total Edges 13.52 B. 13.5 B. 13.53 B. In Table 1, we share some basic graph metrics for the modelsAGATHA-O , AGATHA-C and AGATHA-GP . The most signifi-cant change is observed in the number of semantic predicates andcoded terms, which clearly represents the purpose of introducingadditional preprocessing steps.In Table 2, we compare aforementioned models using the met-rics described in Sec. 5. We present predicate types with NLMsemantic type codes [23] due to space restrictions. Both AGATHA-C and AGATHA-GP models show significant gains when comparedto AGATHA-O baseline model . Benefits in the most problematicfor the baseline model areas (e.g., (Gene) → (Gene) denoted by (gngm,gngm) ) serve the best illustration for that, showing up toalmost 30 percent advantage in ROC AUC. Now all most popularbiomedical subdomains are covered by the proposed models andshow AUC ROC results at at least 0.87. Average ROC AUC value isincreased by 0.09.Our validation strategy involves a big number of many-to-manyqueries, making the area under precision-recall curve another veryillustrative metric. This is where the newly proposed models showeven more drastic improvements over the baseline AGATHA-O .For some subdomains, like (Gene or Genome) → (Gene or Genome)(gngm,gngm) or (Amino Acid, Peptide, or Protein) → (Gene or Genome)(aapp,gngm) , we observe that new models take the recommenda-tions performance to the new quality level. Average PR AUC valueis increased by 0.16.The approximate running time with corresponding types of usedhardware is presented in Table 3. Each row corresponds to thestage in the AGATHA-C /AGATHA-GP pipelines. The column “M”(machines) and CPU show the number of machines and requiredCPUs, respectively. In the column “GPU” we indicate if GPU wasrequired or optional. For AGATHA training we used two NVIDIAV100 per machine. The minimal requirements for RAM per machineare in column “RAM”. The running time of queries is negligible. The proactive discovery of ongoing research findings is an impor-tant component in the validation of hypothesis generation systems[36]. In particular, in the current uncertain situation when a lot ofunintentionally incorrect discoveries are published, the validationmust include human-in-the-loop part even in limited capacity suchas in [2, 30]. To demonstrate the predictive potential of AGATHA-Cwe perform a case study on three COVID-19-related novel connec-tions manually selected by the domain expert. These connectionswere published after the cut date before which any data used intraining was available to download at NIH.At a low level, all AGATHA models use entity subsampling tocalculate pairwise ranking criteria, which means that the absolutenumbers may fluctuate slightly. Thus, to present the numeric scores,each experiment was repeated 100 times to compute the averageand standard deviation that we present in Table 4.AGATHA-C was tested whether it will be able to predict com-pounds potentially applicable for the treatment of COVID-19 andthe genes involved in the SARS-CoV-2 pathogenesis. The data con-firming cardiovascular protective effects of hormone oxytocinewere published recently [9, 40]. The protective effect is linked to able 2: Classification and recommendation quality metrics across recently popular COVID-19-related biomedical subdomains.Labels O, C and GP stand for AGATHA-O , AGATHA-C and AGATHA-GP models, respectively.
ROC AUC PR AUC RR P.@10 P.@100 AP.@10 AP.@100O C GP O C GP O C GP O C GP O C GP O C GP O C GPorch:dsyn 0.91 0.93 0.92 0.47 0.57 0.55 1.00 1.00 0.50 0.60 0.90 0.70 0.48 0.59 0.61 0.79 0.88 0.64 0.64 0.73 0.71aapp:dsyn 0.90 0.95 0.95 0.45 0.58 0.63 1.00 0.50 1.00 0.60 0.70 0.90 0.52 0.56 0.65 0.79 0.73 0.98 0.57 0.66 0.74phsu:dsyn 0.89 0.93 0.94 0.40 0.48 0.57 0.50 0.12 1.00 0.40 0.20 0.80 0.50 0.56 0.69 0.56 0.17 0.98 0.43 0.49 0.76orch:orch 0.85 0.92 0.91 0.47 0.60 0.57 1.00 1.00 1.00 0.90 0.80 0.70 0.51 0.60 0.57 1.00 0.99 0.79 0.66 0.76 0.71phsu:phsu 0.85 0.90 0.91 0.35 0.41 0.47 0.33 0.20 1.00 0.30 0.50 0.50 0.39 0.42 0.47 0.40 0.38 0.78 0.44 0.49 0.56orch:phsu 0.87 0.93 0.93 0.51 0.60 0.57 1.00 1.00 1.00 0.90 0.80 0.80 0.49 0.56 0.52 0.91 0.91 0.86 0.68 0.72 0.67fndg:dsyn 0.89 0.95 0.94 0.46 0.60 0.60 1.00 1.00 1.00 0.60 0.80 0.80 0.56 0.69 0.69 0.88 0.80 0.75 0.65 0.68 0.72orch:aapp 0.87 0.93 0.93 0.57 0.66 0.73 1.00 1.00 1.00 0.90 0.90 0.90 0.48 0.55 0.60 0.88 0.98 1.00 0.77 0.79 0.84geoa:spco 0.79 0.77 0.93 0.32 0.23 0.52 1.00 0.50 1.00 0.60 0.30 0.60 0.39 0.26 0.56 0.91 0.51 0.84 0.54 0.35 0.64geoa:idcn 0.65 0.81 0.88 0.10 0.11 0.28 0.05 0.03 0.50 0.00 0.00 0.70 0.17 0.09 0.25 0.00 0.00 0.69 0.14 0.06 0.45topp:dsyn 0.90 0.95 0.95 0.53 0.66 0.66 1.00 1.00 1.00 0.90 0.90 0.90 0.60 0.77 0.72 0.96 0.88 0.95 0.72 0.82 0.86hlca:dsyn 0.89 0.96 0.96 0.58 0.72 0.72 1.00 1.00 1.00 0.90 1.00 0.80 0.46 0.54 0.56 0.88 1.00 0.79 0.75 0.79 0.78gngm:dsyn 0.93 0.97 0.96 0.47 0.72 0.74 0.50 1.00 1.00 0.60 0.80 0.90 0.48 0.65 0.66 0.62 0.82 1.00 0.50 0.79 0.82fndg:humn 0.83 0.92 0.91 0.38 0.53 0.54 1.00 0.50 0.50 0.60 0.70 0.80 0.45 0.64 0.63 0.65 0.69 0.73 0.62 0.69 0.77gngm:gngm 0.66 0.88 0.89 0.14 0.40 0.41 0.10 0.50 1.00 0.10 0.60 0.30 0.15 0.45 0.44 0.10 0.51 0.61 0.17 0.49 0.52dsyn:fndg 0.81 0.91 0.92 0.31 0.44 0.43 0.25 0.50 0.33 0.20 0.60 0.60 0.42 0.49 0.46 0.32 0.55 0.49 0.45 0.53 0.51phsu:fndg 0.78 0.91 0.90 0.28 0.51 0.47 0.50 1.00 1.00 0.50 0.50 0.50 0.30 0.49 0.46 0.54 0.76 0.68 0.41 0.62 0.58dsyn:humn 0.80 0.87 0.88 0.30 0.40 0.42 1.00 0.50 0.20 0.70 0.50 0.50 0.35 0.49 0.54 0.81 0.45 0.40 0.56 0.58 0.56dsyn:dsyn 0.86 0.92 0.92 0.40 0.50 0.53 0.50 1.00 1.00 0.60 0.80 0.70 0.55 0.65 0.66 0.54 1.00 0.86 0.55 0.67 0.73aapp:gngm 0.70 0.88 0.87 0.19 0.36 0.37 0.14 0.33 0.20 0.10 0.30 0.30 0.24 0.42 0.42 0.14 0.29 0.32 0.27 0.43 0.47Mean 0.83 0.91
Stage Time HardwareM CPU GPU RAMSemRep Processing 2 d 10-28 20+ Opt N/AAllenNLP Predicates 3 d 28-40 20+ Opt N/AGraph Construction 10 d 30+ 20+ Opt 120GB+Graph Conversion 7 h 1 40+ Opt 1TB+Graph Embedding 1 d 20 24+ Opt 120GB+AGATHA Training 22 h 5+ 2+ Yes 300GB+Network Adjacency 1 d 1 40+ Opt 1.5TB+
Table 4: Scores for valid recently published connections ob-tained by different AGATHA models. Reported average val-ues for 100 runs and standard deviation.
AGATHA-O AGATHA-C AGATHA-GPCOVID-19:Melatonin 0.63 ± 0.03 0.91 ± 0.03 0.78 ± 0.03COVID-19:Oxytocin 0.75 ± 0.03 0.98 ± 0.02 0.81 ± 0.02COVID-19:BST2 gene 0.41 ± 0.01 0.88 ± 0.03 0.74 ± 0.03anti inflammatory activity of the hormone. For this connectionAGATHA-C generated the score of 0.98.Similarly, we tested the prediction of the effects of the otherhormone, melatonin. Several publications, started from November2020 [3, 8, 13, 43] show the protective effects of melatonin, specifi-cally for COVID-19 neurological complications. The activity was linked to anti-oxidative effects of the melatonin. For this connectionAGATHA-C generated the score of 0.91.Our system accurately predicted with score of 0.88 the involve-ment of tetherin (BST2). The results published in 2021 [32] showthat tetherin restricts the secretion of SARS-CoV-2 viral particlesand is downregulated by SARS-CoV-2. Therefore, pharmacologicalactivation of tetherin expression, or inhibition of the degradationcould be a promising direction of the development of SARS-CoV-2treatment.
Quality of the information retrieval pipelines . Informationretrieval is an important part of any HG pipeline. In order to uncover implicit connections, the system should be able to capture existing explicit connections with as much quality as possible. Given thathuman knowledge is usually stored in a non-structured manner(e.g., scientific texts), the quality of systems that process raw textualdata, such as those that solve the named entity recognition, or wordsense disambiguation problems, is crucial.We observed that the SemRep system performs better conceptand relation recognition when full abstracts are used as input datainstead of single sentences. SemRep also allows to perform optionalsortal anaphora resolution to extract co-references to the entitiesfrom neighbouring sentences, which was shown to be useful in [17]and is used in this work. "Positive" research bias . The absence of published negative re-search results is a big problem for the HG field. With mostly posi-tive results available, often we have to generate negative examplesthrough some kind of random sampling. These negative sampleslikely do not adequately represent the real nature of negativelyconfirmed scientific findings. Likely, one of the most importantuture work directions in the area of HG is to accurately distinguishand leverage positive and negative proposed results.
Domain experts involvement . When any hypothesis generationsystem is built, one of the first questions a designer should addressis extent that domain experts are expected to participate in thepipeline. Modern decision-making systems allow a fully automateddiscovery process (like the AGATHA system), but this may not besufficient. A domain expert who interfaces with a HG system asa black box may not trust generated results or know how best tointerpret them. The challenge of interpretable hypothesis genera-tion remains a significant barrier to widespread adoption of thesekinds of research tools. For this we advocate using our “structural”learning HG system MOLIERE [35] in which with the topical mod-eling and network analytic measures we interpret and explain theresults.
The nature of input corpora . The question of what should beused as input to a topic-modeling based hypothesis generation sys-tem is raised in [34]. Using full-text papers shows an improvement,but the trade-off between run time and output quality was barelyjustifiable. However, deep learning models have a greater potentialfor extracting useful information from large input sources, and as itwas demonstrated in our previous work [37], show significant per-formance advancements. Thus the question of using full-text papersin deep learning-based hypothesis generation systems should beaddressed. Unfortunately, it is currently too computationally expen-sive our resources as the number of sentences and thus predicatesand edges will be significantly larger.
Knowledge resolution . Our newly proposed systems showed thatthe knowledge resolution plays a major role in subdomain recom-mendation. To increase the scope of model expertise (and the scopeof potential applications beyond the biomedical fields) we deliber-ately incorporate a general-purpose information retrieval systemRnnOIE into AGATHA-GP . This additional information results insignificant gains in broad subdomains like (Geographic Area) → (Idea or Concept) (geoa,idcn) . At the same time, we observe thatAGATHA-C performs better in “microscopic” biomedical areas, e.g. (Organic Chemical) → (Organic Chemical) (orch,orch) , which raisesthe question of choosing the appropriate model for every specificuse case. Although, both systems process all types of queries, thegeneral purpose predicates participated in training significantlyimprove “macroscopic” types of queries. A number of works have been proposed to organize the CORD-19literature into a structured knowledge graph for different purposes.For instance, Basu et al. [5] propose ERLKG - a knowledge graphbuilt on CORD-19 with entities corresponding to gene/chemical/dis-ease names and the edges forming relations between the concept.They use a fine tuned SciBERT model for both entity and relationextraction. The main purpose of the knowledge graph is to predicta link between a given chemical-disease and chemical-protein pairusing a trained GCN autoencoder [19] approach. In another similarwork, Oniani et al. [25] build a co-occurrence network on a subsetof CORD-19 with the edges corresponding to either gene-disease,gene-mutation or chemical-disease type. The network is then em-bedded into latent space using a node2vec walk. Link prediction is performed on the nodes by training different classical machinelearning algorithms. A major shortcoming of these approaches isthat they limit themselves to either specific kind of entities or re-lations or both and as a result not only the scope of possible newliterature is narrowed but a lot of additional useful knowledge isfiltered out of the system. In contrast, our system does not limititself to specific entity or relation type and is able to capture muchmore information from the same corpus.A major interest of constructing knowledge graphs is to al-low medical researchers to re-purpose existing drugs for treatingCOVID-19. Zhang et al. [42] develop a system that uses combinedsemantic predications from SemMedDB and CORD-19 (extractedusing SemRep) to recommend drugs for COVID-19 treatment. Toimprove the predications from CORD-19, the authors fine tunevarious transformer based models on a manually annotated inter-nal dataset. Their resulting knowledge graph consists of 131,555nodes and 2,558,935 edges. Our work on the other hand utilizessimilar technologies and produces a bigger graph with 287,356,836nodes and 13,500,291,256 edges. Moreover, we do not post-processextracted relations from SemRep and are still able to achieve ahigher RoC metric. Another system proposed by Martinc et al. [22]uses a fine-tuned SciBERT model to generate contextualized embed-dings of CORD-19 articles and using an initial seed set of targetsproposes possible therapy targets. However, this system is verydifferent from ours as it treats the entire article as a bag of wordsand directly trains a word embedding model on CORD-19. It wasearlier noted that KinderMiner [20] provides a web-based literaturediscovery tool and supports COVID-19 queries. The underlyingalgorithm is based on a simple keyword co-count between sourceand target words in a given corpus. While co-count is a fast andscalable approach, it suffers from a lack of “discrimination" i.e. twokeywords occurring together more frequently do not always implya high degree of correlation.The vastness of COVID-19 literature also spurned the need forhaving systems that could allow researchers and base users aliketo get their COVID-19 queries answered. Systems like CKG (Wise et al. ) [41] and SciSight (Hope et al. ) [14] currently provide thisfunctionality. While we do aim to provide an easy to use web-framework for medical researchers, the scope of the aforementionedsystems is beyond the scope of our work. Unfortunately, no existingsystem out of those that are trained to accept terms related toCOVID-19 or SARS-CoV-2 provided an open access for massivevalidation for a fair comparison with or was able to be tested inmultiple domains like AGATHA-C .
10 CONCLUSIONS
We present two graph mining transformer based models AGATHA-C and AGATHA-GP , for micro- and macroscopic scales of queriesrespectively, which are designed to help domain experts solve high-priority research problems and accelerate scientific discovery. Weperform per-subdomain validation of these new models on a rapidlychanging COVID-19 focused dataset, composed of recently pub-lished concept pairs and demonstrate that the proposed modelsachieve state-of-the-art prediction quality. Both models signifi-cantly outperform the existing baseline system AGATHA-O . Wedeploy the proposed models to the broad scientific community andelieve that our contribution can raise more interest in prospectivehypothesis generation applications.
REFERENCES
Journal of NeuroimmunePharmacology (2019), 1–15.[3] Lise Alschuler, Ann Marie Chiasson, Randy Horwitz, Esther Sternberg, RobertCrocker, Andrew Weil, and Victoria Maizes. 2020. Integrative medicine consid-erations for convalescence from mild-to-moderate COVID-19 disease.
Explore (2020).[4] Patrick Arnold and Erhard Rahm. 2015. SemRep: A repository for semanticmapping.
Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).[5] Sayantan Basu, Sinchani Chakraborty, Atif Hassan, Sana Siddique, and AshishAnand. 2020. ERLKG: Entity Representation Learning and Knowledge Graphbased association analysis of COVID-19 through mining of unstructured biomed-ical corpora. In
Proceedings of the First Workshop on Scholarly Document Pro-cessing . Association for Computational Linguistics, Online, 127–137. https://doi.org/10.18653/v1/2020.sdp-1.15[6] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. Scibert: Pretrained contextualizedembeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019).[7] Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): Inte-grating Biomedical Terminology.[8] Daniel P Cardinali, Gregory M Brown, and Seithikurippu R Pandi-Perumal. 2020.Can Melatonin Be a Potential “Silver Bullet” in Treating COVID-19 Patients?
Diseases
8, 4 (2020), 44.[9] Phuoc-Tan Diep. 2021. Is there an underlying link between COVID-19, ACE2,oxytocin and vitamin D?
Medical Hypotheses
146 (2021), 110360.[10] R. A. DiGiacomo, J. M. Kremer, and D. M. Shah. 1989. Fish-oil dietary supple-mentation in patients with Raynaud’s phenomenon: a double-blind, controlled,prospective study.
Am J Med
86, 2 (Feb 1989), 158–164.[11] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi,Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer.2017. AllenNLP: A Deep Semantic Natural Language Processing Platform.arXiv:arXiv:1803.07640[12] Vishrawas Gopalakrishnan, Kishlay Jha, Wei Jin, and Aidong Zhang. 2019. Asurvey on literature based discovery approaches in biomedical domain.
Journalof biomedical informatics
93 (2019), 103141.[13] Ping Ho, Jing-Quan Zheng, Chia-Chao Wu, Yi-Chou Hou, Wen-Chih Liu, Chien-Lin Lu, Cai-Mei Zheng, Kuo-Cheng Lu, and You-Chen Chao. 2021. PerspectiveAdjunctive Therapies for COVID-19: Beyond Antiviral Therapy.
InternationalJournal of Medical Sciences
18, 2 (2021), 314.[14] Tom Hope, Jason Portenoy, Kishore Vasan, Jonathan Borchardt, Eric Horvitz,Daniel S. Weld, Marti A. Hearst, and Jevin West. 2020. SciSight: Combining facetednavigation and research group detection for COVID-19 exploratory scientificsearch. arXiv:2005.12668 [cs.IR][15] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similaritysearch with GPUs. arXiv preprint arXiv:1702.08734 (2017).[16] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou,and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).[17] H. Kilicoglu, G. Rosemblat, M. Fiszman, and T. C. Rindflesch. 2016. Sortal anaphoraresolution to enhance relation extraction from biomedical literature.
BMC Bioin-formatics
17 (Apr 2016), 163.[18] Halil Kilicoglu, Dongwook Shin, Marcelo Fiszman, Graciela Rosemblat, andThomas C. Rindflesch. 2012. SemMedDB: a PubMed-scale repository of biomedi-cal semantic predications.
Bioinform.
28, 23 (2012), 3158–3160. http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics28.html
International Conference on Learning Repre-sentations (ICLR) .[20] F. Kuusisto, J. Steill, Z. Kuang, J. Thomson, D. Page, and R. Stewart. 2017. ASimple Text Mining Approach for Ranking Pairwise Associations in BiomedicalApplications.
AMIA Jt Summits Transl Sci Proc
Proceedings of the 2nd SysML Conference . Palo Alto, CA,USA.[22] Matej Martinc, Blaž Škrlj, Sergej Pirkmajer, Nada Lavrač, Bojan Cestnik, MartinMarzidovšek, and Senja Pollak. 2020. COVID-19 Therapy Target Discoverywith Context-Aware Literature Mining. In
Discovery Science , Annalisa Appice, Grigorios Tsoumakas, Yannis Manolopoulos, and Stan Matwin (Eds.). SpringerInternational Publishing, Cham, 109–123.[23] A. T. McCray, A. Burgun, and O. Bodenreider. 2001. Aggregating UMLS semantictypes for reducing conceptual complexity.
Stud Health Technol Inform
84, Pt 1(2001), 216–220.[24] Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. Scispacy: Fastand robust models for biomedical natural language processing. arXiv preprintarXiv:1902.07669 (2019).[25] David Oniani, Guoqian Jiang, Hongfang Liu, and Feichen Shen. 2020. Con-structing co-occurrence network embeddings to assist association extraction forCOVID-19 and other coronavirus infectious diseases.
Journal of the AmericanMedical Informatics Association
27, 8 (05 2020), 1259–1267.[26] Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms andTask Scheduling. In
Proceedings of the 14th Python in Science Conference , KathrynHuff and James Bergstra (Eds.). 130 – 136.[27] M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks.
IEEE Transactions on Signal Processing
45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093[28] Neil R Smalheiser. 2017. Rediscovering Don Swanson: The past, present andfuture of literature-based discovery.
Journal of Data and Information Science
2, 4(2017), 43–64.[29] Scott Spangler. 2015.
Accelerating Discovery: Mining Unstructured Information forHypothesis Generation . Chapman and Hall/CRC.[30] Scott Spangler, Angela D Wilkins, Benjamin J Bachman, Meena Nagarajan, TajhalDayaram, Peter Haas, Sam Regenbogen, Curtis R Pickering, Austin Comer, Jef-frey N Myers, et al. 2014. Automated hypothesis generation based on miningscientific literature. In
Proceedings of the 20th ACM SIGKDD international confer-ence on Knowledge discovery and data mining . 1877–1886.[31] Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Super-vised Open Information Extraction. In
Proceedings of The 16th Annual Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies (NAACL HLT) . Association for ComputationalLinguistics, New Orleans, Louisiana, (to appear).[32] Hazel Stewart, Kristoffer H Johansen, Naomi McGovern, Roberta Palmulli,George W Carnell, Jonathan Luke Heeney, Klaus Okkenhaug, Andrew Firth,Andrew A Peden, and James R Edgar. 2021. SARS-CoV-2 spike downregulatestetherin to enhance viral spread. bioRxiv (2021), 2021–01.[33] Don R Swanson. 1986. Fish oil, Raynaud’s syndrome, and undiscovered publicknowledge.
Perspectives in biology and medicine
30, 1 (1986), 7–18.[34] Justin Sybrandt, Angelo Carrabba, Alexander Herzog, and Ilya Safro. 2018. Are Ab-stracts Enough for Hypothesis Generation?. In . 1504–1513. https://doi.org/10.1109/bigdata.2018.8621974[35] Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. MOLIERE: Auto-matic Biomedical Hypothesis Generation System. In
Proceedings of the 23rdACM SIGKDD International Conference on Knowledge Discovery and Data Min-ing (Halifax, NS, Canada) (KDD ’17) . ACM, New York, NY, USA, 1633–1642.https://doi.org/10.1145/3097983.3098057[36] Justin Sybrandt, Micheal Shtutman, and Ilya Safro. 2018. Large-Scale Validation ofHypothesis Generation Systems via Candidate Ranking. In . 1494–1503. https://doi.org/10.1109/bigdata.2018.8622637[37] Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. 2020.
AGATHA:Automatic Graph Mining And Transformer Based Hypothesis Generation Approach .Association for Computing Machinery, New York, NY, USA, 2757–2764. https://doi.org/10.1145/3340531.3412684[38] Huijun Wang, Ying Ding, Jie Tang, Xiao Dong, Bing He, Judy Qiu, and David JWild. 2011. Finding complex biological relationships in recent PubMed articlesusing Bio-LDA.
PloS one
6, 3 (2011), e17243.[39] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang,Darrin Eide, K. Funk, Rodney Michael Kinney, Ziyang Liu, W. Merrill, P. Mooney,D. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Brandon Stil-son Stilson, Alex D Wade, Kuansan Wang, Christopher Wilhelm, Boya Xie, Dou-glas M. Raymond, Daniel S. Weld, Oren Etzioni, and Sebastian Kohlmeier. 2020.CORD-19: The Covid-19 Open Research Dataset.
ArXiv (2020).[40] Stephani C Wang and Yu-Feng Wang. 2021. Cardiovascular protective propertiesof oxytocin against COVID-19.
Life Sciences (2021), 119130.[41] Colby Wise, Vassilis N. Ioannidis, Miguel Romero Calvo, Xiang Song, GeorgePrice, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, and George Karypis. 2020.COVID-19 Knowledge Graph: Accelerating Information Retrieval and Discoveryfor Scientific Literature. arXiv:2007.12731 [cs.IR][42] Rui Zhang, Dimitar Hristovski, Dalton Schutte, Andrej Kastrin, Marcelo Fiszman,and Halil Kilicoglu. 2020. Drug Repurposing for COVID-19 via Knowledge GraphCompletion. arXiv:2010.09600 [cs.CL][43] Petra Zimmermann and Nigel Curtis. 2020. Why is COVID-19 less severe inchildren? A review of the proposed mechanisms underlying the age-relateddifference in severity of SARS-CoV-2 infections.