[PDF] A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective

Abstract

Drug discovery and development is an extremely complex process, with high attrition contributing to the costs of delivering new medicines to patients. Recently, various machine learning approaches have been proposed and investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Among these techniques, it is especially those using Knowledge Graphs that are proving to have considerable promise across a range of tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritisation. In such a knowledge graph-based representation of drug discovery domains, crucial elements including genes, diseases and drugs are represented as entities or vertices, whilst relationships or edges between them indicate some level of interaction. For example, an edge between a disease and drug entity might represent a successful clinical trial, or an edge between two drug entities could indicate a potentially harmful interaction. In order to construct high-quality and ultimately informative knowledge graphs however, suitable data and information is of course required. In this review, we detail publicly available primary data sources containing information suitable for use in constructing various drug discovery focused knowledge graphs. We aim to help guide machine learning and knowledge graph practitioners who are interested in applying new techniques to the drug discovery field, but who may be unfamiliar with the relevant data sources. Overall we hope this review will help motivate more machine learning researchers to explore combining knowledge graphs and machine learning to help solve key and emerging questions in the drug discovery domain.

Full PDF

AA Review of Biomedical Datasets Relating to DrugDiscovery: A Knowledge Graph Perspective

Stephen Bonner , Ian P Barrett , Cheng Ye , Rowan Swiers Ola Engkvist , Andreas Bender , William Hamilton Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK School of Computer Science, McGill University, Montreal, Canada Mila - Quebec AI Institute, Montreal, Canada

Abstract

Drug discovery and development is an extremely complex process, with high at-trition contributing to the costs of delivering new medicines to patients. Recently,various machine learning approaches have been proposed and investigated to helpimprove the effectiveness and speed of multiple stages of the drug discoverypipeline. Among these techniques, it is especially those using Knowledge Graphsthat are proving to have considerable promise across a range of tasks, includingdrug repurposing, drug toxicity prediction and target gene-disease prioritisation.In such a knowledge graph-based representation of drug discovery domains, cru-cial elements including genes, diseases and drugs are represented as entities orvertices, whilst relationships or edges between them indicate some level of inter-action. For example, an edge between a disease and drug entity might representa successful clinical trial, or an edge between two drug entities could indicate apotentially harmful interaction.In order to construct high-quality and ultimately informative knowledge graphshowever, suitable data and information is of course required. In this review, wedetail publicly available primary data sources containing information suitable foruse in constructing various drug discovery focused knowledge graphs. We aimto help guide machine learning and knowledge graph practitioners who are in-terested in applying new techniques to the drug discovery ﬁeld, but who may beunfamiliar with the relevant data sources. The chosen datasets are selected viastrict criteria, categorised according to the primary area of biological informationcontained within and are considered based upon what type of information couldbe extracted from them in order to help build a knowledge graph. To help motivatethe study, a series of case studies of successful applications of knowledge graphsin drug discovery is presented. We also detail the existing pre-constructed knowl-edge graphs that have been made available for public access which could serve aspotential machine learning benchmarks, as well as starting points for further task-speciﬁc graph composition enrichments. Additionally, throughout the review, weraise the numerous and unique challenges and issues associated with the domainand its datasets – for example, the inherent uncertainty within the data, its con-stantly evolving nature and the various forms of bias in many sources. Overallwe hope this review will help motivate more machine learning researchers to ex-plore combining knowledge graphs and machine learning to help solve key andemerging questions in the drug discovery domain.

Preprint. Under review. a r X i v : . [ c s . A I] F e b Introduction

The process of discovering new drugs is an exceptionally complex one, requiring knowledge fromnumerous biological and chemical domains, however it is a vital task in saving, extending or im-proving the quality of human life. The overall drug discovery pipeline requires key understanding invarious subtasks. For example, drugs are primarily developed in response to some disease or condi-tion negatively affecting patients. This implicitly requires that the mechanisms in the body by whichthe disease is caused are well understood so that a drug can be used to treat it – a process known astarget discovery [78]. However, due to the complexities involved, the process of developing a newdrug and bringing it to market is expensive and has a high chance of failure [96].Hence, increasingly researchers are looking for new ways via which the drug discovery processcan be undertaken, both in a more cost effective manner and with a higher probability of success.Graphs have long been used in the life sciences as they are well suited to the complex intercon-nected systems often studied in the domain [6]. Homogeneous graphs have, for example, beenused extensively to study protein-protein interaction networks [100], where each vertex in the graphrepresents a protein, and edges capture interactions between them.However, recently Knowledge Graphs (KGs) have begun to be utilised to model various aspects ofthe drug discovery domain. Knowledge graphs are heterogeneous data representations (discussedin more detail in Section 2.2), where both the vertices and edges can be of multiple different types,allowing for more complex and nuanced relationships to be captured [54]. In the context of drugdiscovery, the vertices (commonly known as entities) in a knowledge graph could represent keyelements such as genes, disease or drugs – with the edge types capturing different categories ofinteraction between them. As an example of where having distinct edge types could be crucial,an edge between a drug and disease entity could indicate that the drug has been proven clinicallysuccessful in treating the disease. Conversely, an edge between the same two entities could meanthe drug was assessed but ultimately proved unsuccessful in alleviating the disease, thus failingthe clinical trail. This crucial distinction in the precise meaning of the relationship between thetwo entities would not truly be captured in the simple binary option offered by homogenous graphs.Whereas, a knowledge graph representation would preserve this important difference and enable thatknowledge to be used to inform better predictions. As a topical concrete application, knowledgegraphs have been utilised to address various tasks in helping to combat the COVID-19 pandemic[35, 60, 145, 112, 142, 32, 55, 21, 9]. Additionally, considering the domain as a knowledge graphhas the potential to enable recent advances in graph-speciﬁc machine learning models to be used toaddress some key tasks [41].However, constructing a suitable and informative knowledge graph requires that the correct primarydata is captured in the process. An interesting aspect of the drug discovery domain, and perhapsin contrast to others, is that there is a wealth of well curated, publicly available data sources, manyof which can be represented as, or used to construct elements of, knowledge graphs [113]. Manyof these are maintained by government and international level agencies and are regularly updatedwith new results [113]. Indeed, one could argue that there is sometimes too much data available,rather than too little, and researchers working in drug discovery must instead consider other issueswhen looking to use these data resources, particularly for graph-based machine learning tasks. Suchissues include assessing how reliable the underlying information is, how best to integrate disparateand heterogeneous resources, how to deal with the uncertainty inherent in the domain, how best totranslate key drug discovery objectives into machine learning training objectives, and how to modeland express data that is often quantitative and contextual in nature. Despite these complications, anincreasing level of interest in the area suggests that knowledge graphs could play a crucial role inenabling machine learning based approaches for drug discovery [41, 53, 156]. In this work, we present a review of the publicly available data sources which contain informationpertinent to key areas within the drug discovery domain. The primary aim of the review is to serveas a guide for knowledge graph and machine learning practitioners who are new to the ﬁeld ofdrug discovery and who therefore may be unfamiliar with major relevant data resources suitable Also commonly known as networks within the biological domain. In this review we use the term graphinterchangeably with network and without loss of generality.

For the purpose of this review, we use the following criteria when choosing datasets for inclusion:•

Publicly Accessible -

The dataset should be available for use within the public domain.Whilst many high quality commercial datasets exist, we choose to focus on only thosedatasets which are publicly accessible to some degree.•

High Quality -

The dataset should contain information of the highest biological quality.This will primarily be assessed through its popularity within the drug discovery literature.•

Actively Maintained -

The dataset, and the means to access it, should still be activelymaintained.•

Non-Replicated Data -

It is common for databases in the drug discovery ﬁeld to containpartial, or even complete, copies of other datasets. We aim to detail mostly primary datasetsources.The remainder of the paper is structured as follows: in Section 2 we introduce the required back-ground knowledge, Section 3 details competing reviews, Section 4 introduces the relevant ontolo-gies, in Section 5 the primary datasets are reviewed, in Section 6 existing drug discovery knowledgegraphs are detailed, Section 7 presents some application case studies and the ﬁnal conclusions arepresented in Section 8.

In this section, we introduce some key background concepts including knowledge graphs, graph-based machine learning and the ﬁeld of drug discovery.

Drug discovery and development is a complex and highly multi-disciplinary process [129] and isdriven by the need to address a disease or other medical condition affecting patients for whichno suitable treatment currently is in production, or where current treatments are insufﬁcient [58].Whilst a full review of the area is beyond the scope of this work, interested readers are referredto relevant reviews [139, 29, 96] and here we instead give a high-level overview of key conceptsfor machine learning practitioners, giving context for subsequent discussion of relevant knowledgegraph resources. This section will make use of many of the biological terms and concepts deﬁned inTable 1. 3rug discovery involves searching for causally implicated molecular functions, biological and phys-iological processes underlying disease, and designing drugs that can modify, halt or revert them.There are currently three main routes to drug discovery – selecting a molecular target(s) to design adrug against (targeted drug discovery), designing a high-throughput experiment to act as a surrogatefor a disease process and then screening molecules to ﬁnd ones that affect the outcome (phenotypicdrug discovery), or using an existing drug developed for another disease (drug re-positioning). Intargeted drug discovery, once a drug target has been identiﬁed, the process of ﬁnding suitable drugcompounds can begin via an iterative drug screening process. Selected possible candidate drugs arethen tested through a series of experiments in preclinical models (both in-vitro – study in cells orartiﬁcial systems outside the body, and in-vivo – study in a whole organism), and then clinical tri-als (drug development) to measure efﬁcacy (beneﬁcial modiﬁcation of disease process) and toxicity(undesirable biological effects).

Term Deﬁnition

Cells

Important for preclinical research and data generation, e.g. studyingdrug responses, gene and protein expression, morphological responsesvia imaging.

Genes

Functional units of DNA, encoding RNA and ultimately proteins. Vari-ants of a gene’s DNA sequence may be associated with disease(s).

Gene Sequence

Gene sequences are transcribed into an intermediate molecule calledRNA, which is in turn “read” to produce protein molecules.

Proteins

Key functional units of a cell and that can play structural or signallingroles, or catalyse reactions, and interact with other proteins.

BiologicalProcesses &Pathways

Molecules and cells function together to perform biological processeswhich can be conceptualised at different scales, from intracellular (e.g.signalling transmission via “pathways” between molecules) to intercel-lular processes, and ultimately physiological processes at the tissue andbody scale.

Diseases

A condition resulting from aberrant biological/physiological processes.Different diseases may share symptoms and underlying aberrant pro-cesses, and can often be categorised into subtypes based on clinicaland/or molecular features.

Targets

A drug target is a molecule(s) whose modulation (by a drug) we hy-pothesise will alter the course of the disease.

Compounds

Small molecules generated and studied as part of drug discoveryare sometimes termed “compounds”, with an accompanying chemicalstructure representation.

In-Vitro

Studies and experiments taking place outside the body, either in cells orin cell-free, highly deﬁned systems.

In-Vivo

Studies and experiments taking place in a physiological context (e.g.animal or human study).Table 1: Deﬁnitions of key terms used within the scope of drug discovery.Drugs can be different types of molecules, some of which are more established than others. Manyare small chemicals (sometimes termed compounds) or antibodies (a type of protein) [10]. Vari-ous newer types of drugs, often collectively termed “drug modalities” are also being explored [10].4hese different types of drugs have particular advantages and disadvantages, but together open upa wider set of potential drug targets compared to historical approaches. Biomedical science hasresearched such processes at different scales, using technologies that probe the abundance and se-quence variation in DNA, RNA and proteins, and for studying speciﬁc biological functions via ex-perimentation. For example, studies in genetic variation associated with disease are used to providesupport to hypotheses for new drug targets [99, 68]. Databases have been constructed to collate anddisseminate such data and information [113]. Ontologies and dictionaries have also been developedto model relevant concepts, such as disease and biological process descriptions which are discussedmore in Section 4.

The ﬁeld is increasingly looking towards computational [129] and machine learning approachesto help in various tasks within the drug discovery process [136]. It can be helpful to considerpartitioning the drug discovery process up into smaller subtasks which can be modelled through theuse of machine learning. Some of the most common subtasks are:•

Disease Target Identiﬁcation - What molecular entities (genes and proteins) are implicatedin causing or maintaining disease, could we develop new drugs to target? Also known asTarget Identiﬁcation and Gene-Disease Prioritisation.•

Drug Target Interaction - Given a drug with unknown interactions, what proteins may itinteract with in a cell? Also known as Target Binding and Target Activity.•

Drug Combinations - What are the beneﬁcial, or toxicity consequences of more than onedrug being present and interacting with the biological system?•

Drug Toxicity Predictions – What toxicities may be produced by a drug, and in turn whichof those are elicited by modulating the intended target of the drug, and which are fromother properties of the drug? Also known as Toxicity Prediction.

There is currently not a strict and commonly agreed upon deﬁnition of a knowledge graph in theliterature [54]. Whilst we do not aim to give a deﬁnitive deﬁnition here, we instead deﬁne knowl-edge graphs as they will be used through the remainder of this work. We ﬁrst start by deﬁninghomogeneous graphs, before expanding the deﬁnition for heterogeneous graphs.A homogeneous graph can be deﬁned as G = ( V, E ) where V is a set of vertices and E is a setof edges. The elements in E are pairs ( u, v ) of unique vertices u, v ∈ V . An example graph isillustrated in Figure 1a which demonstrates that homogeneous graphs can contain a mix of directedand undirected edges. It is common for these graphs to have a set of features associated with thevertices, typically represented as a matrix X ∈ R | V |× f , where f is the number of features fora certain vertex. The graphs frequently used as benchmarks in the graph-based machine learningﬁeld, Cora, Citeseer and PubMed, fall into this deﬁnition of homogeneous graph [56].Heterogeneous graphs, or knowledge graphs as they are often referred to, are graphs which containdistinct different types of both vertices and edges, which can be deﬁned as G = ( V, E, R, Ψ) [151].Such graphs now have a set of relations R , and each edge is now deﬁned by its relation type r ∈ R – meaning that edges are now represented as triplet values ( u, r, v ) ∈ E [73]. The vertices inknowledge graphs are often known as entities, with the ﬁrst entity in the triple called the head entity,connected via a relation to the tail entity. In a drug discovery context, multiple relations are crucialas an edge could indicate whether a drug up or down regulates a certain gene for example. Twovertices can also now be linked by more than one edge type, or even multiples of the same type.Again this is important in the drug discovery domain, as multiple edges of the same relation canindicate evidence from multiple sources. Additionally, each vertex in a heterogeneous graph alsobelongs to a certain type from the set Ψ , meaning that our original set of vertices can be divided intosubsets V i ⊂ V , where i ∈ Ψ and V i ∩ V j = ∅ , ∀ i ∈ Ψ (cid:54) = j ∈ Ψ [46]. Given the drug discoveryfocus, these types could indicate if a vertex represents a gene, protein or drug. Further, these vertextypes can limit the type of relations placed between them, ( u, r , v ) ∈ E → u ∈ V i , v ∈ V j where i, j ∈ Ψ [46]. An edge relation type of ‘expressed-as’ makes sense between genes and proteins butnot genes and drugs for example. Finally, heterogeneous graphs also commonly contain vertex-levelfeatures, with each type often having its own set of features.5 v v v v v v (a) A Homogeneous Graph. e1e2e2e1e1 e3e3e1e2e2 v v v v v v v (b) A Heterogeneous Graph. Figure 1: A Homogeneous and Heterogeneous Graph.A heterogeneous graph is presented in Figure 1b and contains some key differences with its homo-geneous counterpart: there are three types of vertex (v1, v2 and v3) and these are linked through amix of directed and undirected edges of three relation types (e1, e2 and e3).

A growing number of methods have been presented in the literature combining graphs and machinelearning [148, 47]. These range from approaches which attempt to learn low-dimensional represen-tations of vertices within the graph for use with various down-stream predictive tasks (many of whichcombine random walks on graphs with models from Natural Language Processing (NLP) [107, 44]),to custom graph-speciﬁc neural-based models for end-to-end learning using the raw graph as input[69, 48, 137]. Until recently, the majority of these graph-speciﬁc neural models were focused uponhomogeneous graphs, however some methods have been created to process graphs with multipleedge types [116] and even fully heterogeneous graphs [57].A somewhat parallel stream of work has focused on learning embeddings on knowledge graphsspeciﬁcally [143]. Often these approaches are not graph structure speciﬁc, instead learning embed-dings by optimising the distance between entities after translation via relations [11] or by exploitingvarious notions of similarity [102]. Primarily these approaches are trained to perform knowledgegraph completion – the task of ranking true triples within the graph above negative ones [120] . Theresulting models can then be used to propose likely tail entities given a certain head and relationcombination. For example, in the context of drug discovery, given a certain drug entity and therelation type of down-regulates , a model is trained to propose the most likely gene entity – thuscompleting the triple. The study of knowledge graphs in the biomedical sciences and particularly drug discovery bringschallenges and opportunities. Opportunities because biomedical information inherently containsmany relationships which can be exploited for new knowledge. Unfortunately there are many chal-lenges that arise when constructing a knowledge graph suitable for use in drug discovery tasks.Some of the most interesting challenges are detailed below:•

Graph Composition - Strategies are needed to deﬁne how to convert data into informationfor modelling in a graph (e.g., instantiating a node or edge versus a feature on those en- Conceptually this is very similar to link-prediction in homogeneous graphs [86].

Heterogeneous & Uncertain - In biomedical graphs the data types are heterogeneous andhave differing levels of conﬁdence (e.g. well characterised and curated ﬁndings versusNLP-derived assertions), and much of the data will be dependent on both time and the doseof drug used as well as the genetic background in the study. Overall, this means edges aremuch less certain, and thus less trustworthy, than in other domains.•

Evolving Data - The underlying data sources integrated and used in suitable knowledgegraphs are also often changing over time as the ﬁeld develops, requiring attention to ver-sioning and other reproducible research practices. As an example of this, the evolution ofthe frequently used STRING dataset is demonstrated in Figure 2. • Bias - There are various biases evident in different data sources, for example negativedata remain under-represented in some sources, including the primary scientiﬁc literature,and some areas have been studied more than others, introducing ascertainment bias in thegraphs [103].•

Fair Evaluation - Several works have shown promise in applying machine learning tech-niques on a knowledge graph of drug discovery data. However, ensuring a fair data splitis used for evaluation is perhaps more complicated than other domains, as it is easy forbiologically meaningful data to leak across splits. Thus, care should be taken to constructmore biologically meaningful data splits, as well as considering if replicated knowledgehas been incorporated in the graph.Ultimately though we feel there is now an interesting opportunity to experiment at the intersection ofvarious research ﬁelds spanning graph theoretic and other network analysis approaches for moleculenetworks [5, 26], machine learning approaches [156], and quantitative systems pharmacology [123].

STRING Database Version C oun t Content

ProteinsInteractionsOrganisms

Figure 2: The evolution of the STRING database over major versions showing the increase in Or-ganisms, Interactions and Proteins. This data has been collected from https://string-db.org/cgi/access.pl?sessionId=dbw44gRWU7Xo&footer_active_subpage=archive i s ea s e s D i s ea s e s S y m p t o m s A n a t o m y P r o ce ss e s P a t h w a y s C e ll u l a r B i o l og i ca l P r o ce ss e s P r o t e i n s G e n e s M o l F un c G e n e s / P r o t e i n s D r ug s S i d e E ff ec t P h a r m C P h a r m ac o l og i ca l A g e n t s Figure 3: A simpliﬁed hierarchical view of a drug discovery knowledge graph schema.

There have thus far been several other studies undertaken in the literature that address some of thesame topics as this present work. It is interesting to note that many of these studies were publishedwithin the last few years, perhaps highlighting the growing interest in the ﬁeld. This section willgive an overview of these related studies and highlight how our own work both complements anddiffers from these.As a topic of great recent interest within the community, the area of drug repurposing has beendirectly addressed in several reviews – some of which detail suitable available datasets [127, 83,154, 87]. Recent research has detailed over 100 relevant drug repurposing databases, as well asappropriate methods [127]. The authors group the datasets by the primary domain that they detail:Chemical, Biomolecular, Drug–target interactions and Disease – with many of these categories beingfurther subdivided into more speciﬁc areas. One interesting aspect of this study is that a set ofrecommended datasets is provided covering each domain. In [83], a review of drug repurposingfrom the view of machine learning has been presented, covering many of the available methodsand a more focused selection of over 20 datasets. The datasets covered in this review must be in thepublic domain and are split into two primary categories: drug-centric and disease-centric. Relevantlyfor this review, in [154], knowledge graph speciﬁc approaches for drug repurposing are explored.A brief overview of the available datasets is presented, with the authors then choosing 6 to formthe basis of the knowledge graph used in their experimental evaluation. In [87], the authors reviewand then partition the available drug database resources into four major categories based upon thetype of information contained within: raw data, target-based, area speciﬁc and drug design. Thevarious datasets are then further classiﬁed based on whether the information is curated or whetherthe dataset is an integration of existing resources. Compared to our own work, many of these reviews8re conﬁned to considering just a limited area of the domain and all but one do not consider how thedatasets could be integrated into a knowledge graph.With a broader focus than the drug repurposing previously covered, the area of drug-target interac-tions has been detailed [4]. The review primarily focuses upon the various machine learning methodsfor predicting drug-target interactions, however over 20 potential data sources are also presented.The datasets are classiﬁed as containing information regarding drug-target interactions, drug or tar-gets alone and binding afﬁnity. A prior review conducted with a similar focus also covered many ofthe same methods and databases [23]. Conversely, machine learning based methodologies for pre-dicting drug-drug interaction have been detailed, as well as a comparative experimental evaluationconducted [19]. As part of this process, the authors construct a drug-drug interaction knowledgegraph from a subset of the Bio2RDF [7] resource, consisting of three of the constituent datasets. Amore general review of 13 drug related databases has also been presented [155], covering a broadrange of databases detailing drugs, drug-target interactions and other drug related information. Oneinteresting aspect to this survey is that they categorise the studied datasets based upon the tasks inthe literature they have been used for, as well as detailing any studies which made use of them.Perhaps one of the most closely aligned studies to our own work is presented here [94], where bothdatasets and approaches for biological knowledge graph embeddings are reviewed. Although thereview focuses primarily upon the evaluation of different methodologies, 16 relevant databases arealso discussed, identifying which topic they primarily contain information on. However, as the workis experimentally driven, only a limited dataset discussion is undertaken and the work is not gearedtowards graph-speciﬁc neural models. A comprehensive survey of the wider biomedical area andknowledge graph-based applications within it has been presented here [17]. Within the study, 13datasets which meet the author’s criteria to be deﬁned as knowledge graphs, are identiﬁed and de-tailed – although due to the wider scope of the study, not all are directly related to the target discoverydomain. Finally, a recent study presents a detailed overview of the application of graph-based ma-chine learning in the drug discovery domain, focusing upon relevant methods and approaches [41].The review is wide ranging, covering more than just knowledge graph based applications and un-like our work, makes no mention of suitable public datasets. We do however feel that it stronglycomplements our own review and serves as a method-focused counterpart to our dataset overview.Within the available reviews in the area, although much high-quality work has been performed, thereare some deﬁnite gaps which our own review will aim to address. One clear issue is that many of thereviews are focused upon a speciﬁc area, with drug repurposing being well represented, thus theyare not giving a clear view of the target discovery landscape. Another trend in the current reviewsis for their primary focus to be upon the experimental evaluation of methodologies, with datasetsgiven comparatively less attention. Additionally, most reviews are considering the resources solelyas databases, rather than focusing upon how they could relate to, or indeed form part of, a knowledgegraph. This lack of knowledge graph focus also means that many studies do not consider what typeof information the databases could be used for, such as structural relations or entity features. Finally,many of the reviews have been written from a biological point of view, which may perhaps makethem less accessible for machine learning practitioners who may be new to the domain, but who areinterested in experimenting with relevant datasets.

An ontology is a set of controlled terms that deﬁnes and categorises objects in a speciﬁc subjectarea, and also the properties and relationships between the ontological terms. Modern biomedicalontologies are usually human constructed representations of a domain, capturing the key entitiesand relationships and distilling the knowledge into a concise machine readable format [38]. Thereis a need for consistency when discussing concepts like diseases and protein functions which canbe interpreted in multiple ways. Therefore many biomedical ontologies have been created to cate-gorise and classify biomedical concepts such as genes, proteins, biological processes and diseases.Most ontologies have a Directed Acyclic Graph (DAG) structure with the nodes representing theontological terms and the edges representing the relations between them. This induces a hierarchi-cal structure on the terms. The terms may also have properties that provide descriptions or crossreferences to terms in other ontologies. 9 .2 Ontology Representations

Most biomedical ontologies are expressed in a knowledge representation language such as the OBOlanguage created by the Open Biological and Biomedical Ontologies Foundry (OBO), ResourceDescription Framework Schema (RDFS) or the Web Ontology Language (OWL) [2]. OBO is abiologically oriented ontology and is expressive enough to deﬁne the required terms, relationshipsand properties of an ontology. There exist free browser based tools to create, view and manipulateontologies deﬁned in OBO. There are also tools to check their completeness and logical consistency.It is possible to have a lossless transition from the OBO language to OWL which is a family ofknowledge representation languages for creating ontologies. OWL was designed for the web but isalso used for creating biomedical ontologies. RDFS is another ontology language designed for theweb and used for biomedical ontologies.

Ontologies provide such value in providing interoperability for biological data that there has been aproliferation of ontologies. However this in itself causes an issue, especially if a different ontology isused for the same biomedical entity. If database A labels diseases using ontology X and database Blabels diseases using ontology Y it can be hard to know the relation of two different disease entitiesin the database. This type of scenario often occurs during the creation of biomedical knowledgegraphs which are databases of relationships between different biological entities and often use datafrom multiple sources. Some resources exist to match together ontological terms; e.g. OXO [63]DODO [40]. However the mappings provided by these resources are different and between any twodistinct ontologies, there is no guarantee of a direct map or any map at all between their ontologicalterms even if the subject matter is the same.Merging two different ontologies to become one ontology is an active area of research. There isdemand for ontology merging as more and more databases are integrated together. However justmapping ontological terms to other ontological terms can create logical inconsistencies in the newlycreated ontology violating DAG structures. Therefore merging ontologies often involves lots ofmanual intervention and is a time consuming and error prone process. The Open Biological andBiomedical Ontologies Foundry (OBO) [130] was set up to provide rules and advice for ontologiesto make them easier to merge and match. Their recommendations include a standard set of relationsbetween ontological terms.

In the remainder of this section we will detail the major ontologies which are relevant for use in drugdiscovery tasks. These ontologies are detailed in Table 2.

Ontology Name Entities Covered Classes Average

Monarch Disease Ontology(MonDO) Diseases 24K 5 8K 25 16 CreativeCommonsExperimental Factor Ontol-ogy (EFO) Diseases 28K 6 7K 66 20 Apache 2.0Orphanet Rare Disease Ontol-ogy (ORDO) Rare Diseases 15K 17 8.5K 24 11 CreativeCommonsMedical Subject Headings(MeSH) Medical Terms 300K 4 270K 38 15 UMLSLicenseHuman Phentoype Ontology(HPO) DiseasePhenotype 19K 3 6.5K 0 16 HPO LicenseDisease Ontology (DO) Diseases 19K 4 8K 89 33 CreativeCommonsDrug Target Ontology (DTO) Drug Targets 10K 4 3K 43 11 CreativeCommonsGene Ontology (GO) Genes 44K - - 11 - CreativeCommons

Table 2: An overview of Ontologies suitable for use in drug discovery.

EFO

MeSH

One of the most commonly used and largest biomedical ontologies is the Medical Subject Head-ings Thesaurus or MeSH [79]. It was designed for indexing articles in the MEDLINE/PubMEDdatabase. Each article in PubMED has MeSH terms attached that specify which biological entitiesthe article is describing. The ontology has been translated into many different languages and thereare useful tools provided by MeSH. It is possible to generate MeSH terms from text automaticallyusing APIs although these are not guaranteed to be correct. MeSH has around 300,000 differentclasses although these are not all disease speciﬁc. The relationships between the ontological termsis not very complex and there is not very good cross-referencing or interlinking of terms. This cancause difﬁculties when working with other ontologies.

Disease Ontology

The Human Disease Ontology (DO) is an ontology designed to link different datasets through dis-ease concepts [118]. The resource is community driven and aims to have a rich hierarchy allowingthe study of different diseases from the ontological connections. DO has terms linked to well estab-lished ontologies such as MeSH, SNOMED and UMLS. It is a member of the OBO community ofinteroperable ontologies.

Mondo Disease Ontology

The Mondo Disease Ontology was semi-automatically created [97]. It aims to harmonise diseasedeﬁnitions between generalised ontologies such as MeSH and speciﬁc deep ontologies such as Or-phanet (an ontology focussing on rare diseases). Mondo merges these ontologies using algorithmssuch as Bayesian OWL ontology merging together with curated equivalence relations and user feed-back. One of the advantages is that the axioms linking to other resources are checked algorith-mically. This prevents inconsistencies and logical loops created from linking terms to differentresources which can otherwise easily occur. It has around 20K different disease classes.

HPO

The Human Phenotype Ontology (HPO) provides an ontology to describe the phenotypes (the ob-servable traits) of disease [114]. The terms do not contain the actual disease entities such as “Flu”,instead there are symptoms such as a “runny nose” and “sore throat”. The ontology was originallydeveloped for rare and typically genetic diseases but is now also used for common diseases. HPOincludes terms to describe the speed of onset of the disease, how often symptoms occur as well asinheritance characteristics. HPO can be used to diagnose diseases and are used by clinicians, as wellas researchers. HPO is developed by Monarch who also develop Mondo so these two ontologieswork well together. GO DTO

As we have previously mentioned, a drug target is a protein or molecule within the body that isassociated with disease and would be suitable to develop a drug against. The Drug Target Ontology(DTO) has terms to describe information about drug targets [77]. The ontology has terms to describethe type of protein, how well studied the protein is and what types of drugs may be used againstthe target. It is also used to describe the level of development of drugs for a protein. It uses theHuman Disease Ontology to link diseases to the proteins and is designed to be modular and easilyextensible.

As has been highlighted throughout this review, unlike some other domains, the drug-discoveryarea actually has a wealth of publicly available information, much of which has dedicated teamstasked with maintaining and updating the resources. Many of these are national or internationallevel bodies, for example the US based National Center for Biotechnology (NCBI) or the Euro-pean Bioinformatics Institute (EBI). Additionally the pan-European ELIXIR body, an organisationdedicated to detailing best practices for life science datasets and enabling stable funding for them,maintains a list of core data resources, which includes many of the resources covered in this review[37].In this section we introduce some of the key, primary resources covering the crucial entities ofinterest in drug discovery: genes, disease and drugs, as well as sources capturing the relationshipsbetween them via interactions, pathways and processes. The list of resources covered here is notdesigned to be exhaustive, instead here we signpost some of the most popular and trusted ones,suggest how they could be integrated into a knowledge graph and discuss the origins of the data.

This section outlines data sources which are tailored speciﬁcally for the drug discovery ﬁeld. Typi-cally these resources combine two or more entity speciﬁc data sources and add additional knowledgeuseful for the domain. These resources can also be useful as a reference point for some best practiceswith regards to data handling and integration for the ﬁeld.

Open Targets:

Open Targets is a resource which collects various disparate data sources together,covering the key entities for target discovery including genes, drugs and diseases [71, 18]. OpenTargets was established in 2016 as a collaboration between academia and industry with the goalof enabling better science through the integration of public resources, primarily relating to genesand diseases, for the task of target prediction [71]. Data for Open Targets has been taken from20 resources including Uniprot [3], Reactome [62] and ChEMBL [89]. As of January 2021, OpenTargets contains data on 14K diseases and 27K targets. The resource is updated multiple times ayear, with ﬁve main releases in both 2020 and 2019. Each release also contains detailed versioninformation for the constituent datasets. Access to the data is enabled via a web-based REST API,a Python client, as well directly for download via JSON and CSV ﬁles. Recently Open Targets hasbeen expanded with the addition of a Genetics portal [42], for studying genetic variants and theirrelation to disease. 12s Open Targets is speciﬁcally designed to integrate data around linking potential targetgenes/proteins to diseases, each potential link is provided with annotated associative evidence scoresfor a variety of evidence classes including genetic, drug and text mining. This information is aggre-gated into a ﬁnal association score, indicating how associated a certain target-disease pair is overall.Thus far, Open Targets does not provide any of its information in a format amenable for use inknowledge graphs, nor has its information been integrated into any of the existing drug discoverysuitable graph resources. However it is a prime resource, with clear scope to enrich a knowledgegraph with pertinent target discovery information. For example, it could be utilised to provide linksbetween target gene entities in a knowledge graph and the relevant disease entities. Further, thevarious associative scores contained within Open Targets could be used to weight these edges andprovide some notion of trust based on the type of association.

Pharos:

In a similar vein to Open Targets, the Pharos resource provides data integrations around thedrug discovery domain, with a particular focus on the druggable genome [101]. Pharos is actuallythe front-end access point, with the underlying data resource being the Target Central ResourceDatabase (TCRD), which was launched in 2014 as part of a National Institutes of Health (NIH)program. TCRD contains data on the key entities in the drug discovery domain, including genesand diseases, with the data being integrated from a large number of other public data sources suchas ChEMBL [89], STRING [125], DisGeNET [109] and Uniprot [3]. TCRD also implements anumber of ontologies including the Disease Ontology [119] and the Gene Ontology [28]. Access tothe data through Pharos is provided programmatically through the use of a GraphQL API endpoint,a graph-like query language for data retrieval [50]. Both Pharos and TCRD are open-sourced and areupdated regularly, with multiple yearly releases keeping up to date with changes in the constituentdataset sources, whilst also keeping version information.As with Open Targets, the information contained within Pharos could be used to provide links be-tween proteins and diseases. However, it also contains detailed protein-protein interaction whichcould also be added as relationships between protein entities, these could be weighted by the variousconﬁdence metrics with which the relationships are annotated. Additionally, Pharos contains vari-ous information types which could be used to add features onto entities. For example, for a givenprotein entity, Pharos contains structural and expression information which could be transformedinto a generic and task agnostic set of features.

Genes and Proteins are the key entities related to target discovery and as such there are numerousrich public resources related to them. We brieﬂy review the most pertinent of these here, howeverinterested readers are encouraged to refer to more detailed protein-speciﬁc dataset reviews [22]. Thedatasets are summarised in Table 3.

Dataset FirstReleased UpdateFrequency Updated < UniprotKB 2003 8 Weeks (cid:51)

Expert &Automated Proteins Primary protein resource used in the domain. Can bemined for protein-protein interactions and potentiallyprotein features.Ensembl 1999 3 Months (cid:51)

Automated Genes One of the primary sources for gene data. Gene-geneand gene-disease relationships can be extracted, aswell as many gene-based features.EntrezGene 2003 Daily (cid:51)

Expert &Automated Genes Another primary gene data resource. Used in existingKGs for gene entity annotations.

Table 3: Primary data sources relating to Genes and Proteins.

Uniprot

UniProt is a collection of protein sequence and functional information started in its current form in2003 and provides three core databases: UniProtKB, UniParc, UniRef [3]. UniProtKB is the pri-mary protein resource and thus will be focused on here. Overall, UniProt is classiﬁed as an ELIXIRcore resource and is frequently integrated into other datasets. The whole of UniProt is updated on13n eight week cycle and access is provided via a REST, Python, Java and SPARQL API endpoints.UniProt cross-references with many resources in the domain including the Gene Ontology [28], In-tAct [52] and STRING [125]. The UniProtKB database itself comprises two different resources:Swiss-Prot and TrEMBL [3]. Swiss-Prot contains the expert annotated and curated protein infor-mation, whilst TrEMBL stores the automatically extracted information. Thus, as of UniProt version2020 6, TrEMBL contains a greater volume of entities at 195M versus the 563K entities in Swiss-Prot. The information contained within UniProtKB could be used to add protein-protein interactionrelations into a knowledge graph. Additionally, numerous protein structural and sequence basedfeatures could be extracted to enrich the relevant entities.

Ensembl

Ensembl is primarily a data resource for genetics from the EBI, covering many different specieswhich was founded in 1999 and considered an ELIXIR Core resource [150]. It provides detailedinformation on gene variants, transcripts and position in the overall genome. This data is extractedvia an automated annotation process which considers only experimental evidence. The informationin Ensembl is updated on approximately a three monthly schedule, with over 100 versions havingbeen released to date. Access is provided via a REST endpoint as well as MySQL data dumps. Thedata in Ensembl could be used to provide gene-protein links, as well as gene-disease links throughthe integration with HPO.

Entrez Gene

Entrez Gene is the database of the NCBI which provides gene-speciﬁc information, which wasinitially launched in 2003 [84]. Gene can be viewed as an integrated resource of gene information,incorporating information from numerous relevant resources. As such, it contains a mix of curatedand automatically extracted information, which is updated as frequently as daily and made availablefor direct download [14]. Owing to its status as an integrator of relevant resources, Gene providesthe GeneID system, a unique integer associated to each catalogued gene. The GeneID can be usefulas a translation service between other resources and is used by the Hetionet knowledge graph [53]as the primary ID for its gene entities. Entrez Gene could potentially be used to enrich a knowledgegraph with gene level features, for example it contains detailed tissue expression data.

In this section, we detail the resources specialising in the linking of the entities discussed thus farthrough interaction , processes and pathways. The interaction resources are presented in Table 4,whilst the processes and pathways resources are detailed in Table 5 Dataset FirstReleased UpdateFrequency Updated < STRING 2003 Monthly (cid:51)

Expert &Automated Protein/GeneInteractions One of the most commonly used sources for physicaland functional protein-protein interactions in existingKGs.BioGRID 2003 Monthly (cid:51)

Expert BiologicalInteractions Contains interactions between gene, protein andchemical entities with could be included directly in aKG.IntAct 2003 Monthly (cid:51)

Expert MolecularInteractions Contains molecular reactions between gene, proteinand chemical entities. Uses UniProt for identiﬁers.

Table 4: Primary data sources relating to interactions. A more focused review speciﬁcally detailing protein-protein interactions can be found here [88]. .3.1 Interaction ResourcesSTRING The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) is another ELIXIR coreresource containing detailed protein information, with a particular focus on capturing protein-proteininteractions networks [125]. The resource was started prior to 2003 and has been updated typicallybi-annually, with STRING version 11.0b containing over 24M proteins from over 5K different or-ganisms. STRING integrates with other resources such as UniProt [3], Ensembl [150] and the GeneOntology [28] to allow for easier joining of entities. Access is provided via a REST API as well as aseries of ﬂat ﬁles, with the primary protein-protein interaction networks being provided as an edge-list – making the data extremely amenable for inclusion in a knowledge graph. The interactions inSTRING are taken from a range of sources, including curated ones taken directly from experimentaldata and those which are mined from the literature using NLP techniques.

BioGRID

The Biological General Repository for Interaction Datasets (BioGRID) is a resource maintainedby a range of international universities which specialises in collecting information regarding theinteractions between biological entities including proteins, genes and chemicals [124]. The resourcewas started in 2003 and is updated monthly with curated information from the literature, with version4.2 containing over 1.9M interactions. The data is provided for download directly as CSV ﬁles, asREST API and as a Cytoscape plugin. The data available via BioGRID could be directly used ina graph as edges between entities, with protein-protein and gene-protein interactions being clearcandidates. This process is simpliﬁed as BioGRID uses Entrez Gene IDs to represent the entities.

IntAct & MINT

IntAct is a database of molecular interactions maintained by the EBI and considered a core resourceby ELIXIR [52]. IntAct was ﬁrst launched in 2003 and has been updated on a monthly cycle, with theversion released in January 2021 containing over 1.1M interactions. The data is made available fordownload via ﬂat ﬁle formats from the IntAct website. Like the other interaction datasets highlightedin this section, IntAct data can be interpreted as edges between entities describing protein-proteininteractions, where the Protein entities are represented using UniProt IDs allowing for cross-resourcelinking. IntAct is closely linked with Molecular Interaction database (MINT), another core resourceproviding various interaction types between proteins [76].

Dataset FirstReleased UpdateFrequency Updated < Reactome 2003 > Annually (cid:51)

Expert Pathways A core resource for pathways and reactions. Amiablefor graph representation and already included inseveral KGs.Omnipath 2016 > Annually (cid:51)

Expert Pathways An integrator of pathway resources that could beincluded in a KG via its RDF version.Wikipathways 2008 Monthly (cid:51)

Expert Pathways A crowdsourced collection of pathway resources.Also provided in graph amiable formats.PathwaysCommons 2010 Biannually (cid:51)

Expert &Automated Pathways A collection of many resources, including the othersdiscussed in this table.

Table 5: Primary data sources relating to pathways and processes.15 eactome

Reactome is a large and detailed resource comprising biological reactions and pathways collectedacross multiple species including those from several model organisms and humans [62]. The projectwas started in 2003 and has been updated multiple times a year since then, with release version 75containing over 2.4K human pathways, 13K reactions and 10K proteins. Reactome has been grantedELIXIR core resource status and integrates closely with other databases like IntAct, UniProt andChEMBL. The data contained within Reactome is curated and peer reviewed by domain experts,with links provided to the originating literature. The resource is very amenable to graph basedrepresentation and is provided for download in a variety of formats, including a Neo4J databasedump. Resources from Reactome have also already been included in existing knowledge graphresources like Hetionet.

Omnipath

Omnipath is a comparatively new resource for biological signalling pathway information with a fo-cus on humans and rodents [133]. The Omnipath web-resource was ﬁrst made publicly availablein 2016 and has been updated regularly since, although with an ad hoc pattern. Omnipath inte-grates over 100 literature curated data resources containing information on signalling pathways, thisincluding many datasets covered in this present review such as IntAct, Reactome and ChEMBL.Access to Omipath is provided via a web-based REST API, a Cytoscape plugin, as ﬂat ﬁles, as wellas through libraries for the R and Python programming languages. Owing to Omnipath’s status asan integrator of other resources, it provides a wealth of different interaction types which could beutilised as edges in a knowledge graph.

Wikipathways

The Wikipathways project explores the use of crowdsourcing for community curation of pathwayand interaction resources [121]. The project began in 2008 and has been updated on a monthlyschedule since then. Owing to its crowdsourced nature, domain scientists can add new and editexisting information to ensure better overall quality and it has been designed to complement existingresources such as Reactome. Access is provided via a REST endpoint, as well as clients in variousprogramming languages such as R, Python and Java. Additionally, Wikipathways has a semanticweb portal, enabling access through a SPARQL endpoint to an RDF version of the resource forknowledge graph compatibility [138].

Pathways Commons

The Pathways Commons project was started in 2010 with the aim of collecting and allowing easyaccess to biological pathway and interaction databases [20]. Pathways Commons aims to comple-ment, rather than compete with, other pathway resources and offers no additional curation on top ofthe constituent resources. As of version 12, 20 different resources have been included in PathwaysCommons, which include Wikipathways, BioGRID and Reactome. The main resource is updatedwith data from the source datasets on an approximately biannual schedule [115]. Access is providedvia a REST API, as well as a SPARQL endpoint which, through the returned RDF triples, meansthat the data could easily be incorporated into a knowledge graph.

We now detail the resources whose primary focus is providing information about diseases. Theseresources are detailed in Table 6.

KEGG DISEASE ataset FirstReleased UpdateFrequency Updated < KEGGDISEASE 2008 Monthly (cid:51)

Expert Disease A comprehensive disease resource for viewing diseaseas part of a biological system. Access is restricted forindustrial use.DISEASES 2015 Daily (cid:51)

Expert &Automated Disease A resource detailing links between genes and diseases.Already commonly used in drug discovery KGs.DisGeNET 2010 Annually (cid:51)

Expert &Automated Disease One of the most frequently used disease sources inexisting KGs. Contains a mix of resources includingexperimental and text-mined data.OMIM 1987 Daily (cid:51)

Expert &Automated Disease One of the oldest disease databases, focusing uponmendelian disorders. Can provide gene-diseaserelationships.GWASCatalog 2008 Biweekly (cid:51)

Expert Disease Contains the results from GWAS studies which couldbe used to provide less studied links between genesand diseases into a KG.

Table 6: Primary data sources relating to disease.The KEGG DISEASE database is part of the larger KEGG resource, in which diseases are modelledas perturbed states of the molecular network system [64]. Each disease entry contains informationof the perturbants to this system including genetic and environmental factors of diseases, as well asdrugs. Different types of diseases, including single-gene (monogenic) diseases, multifactorial dis-eases, and infectious diseases, are all treated in a uniﬁed manner by accumulating such perturbantsand their interactions. The KEGG DISEASE resource was originally launched in 2008 and is up-dated with the rest of KEGG on a monthly schedule. Access to the data is provided via an FTP site,however users must register to gain access and industry users are charged for use.

DISEASES

DISEASES is a dataset designed to integrate evidence on disease-gene associations from automatictext mining, manually curated literature, cancer mutation data, and genome-wide association studies[110]. The dataset was originally launched in 2015 and has been updated daily as new relations areextracted. The data is available for download directly as ﬂat ﬁles, with clear indication given as towhether the data is curated or exacted from text. As of January 2021, there are in total over 17Kgenes, 4K diseases and 543K associations available in DISEASES. Owing to its relational nature,the data in DISEASES can be represented as edges in a knowledge graph, and it has already beenincorporated in Hetionet [53].

DisGeNET

The DisGeNET resource integrates a variety of data sources from expert curated repositories, GWAScatalogues, animal models and the scientiﬁc literature [108]. DisGeNET was started in 2010 and theunderlying database was been updated on an annual basis recently. The latest version of DisGeNETv7.0 released in 2020 contains 1.1M gene-disease associations between 21K genes and 30K diseasesas well as many other variant-disease relationships. The data is integrated from four primary sourcetypes including expert curated databases and inferred associations. Access is provided via a RESTAPI as well as being made available as RDF triples, providing disease-gene, disease-disease anddisease-variant relations for use in a knowledge graph. DisGeNet has been included in existing KGresources like Hetionet [53].

OMIM

Online Mendelian Inheritance in Man (OMIM) is a comprehensive, authoritative compendium ofhuman genes and genetic phenotypes, with particular focus on the molecular relationship between17enetic variation and phenotypic expression [49]. The OMIM resource has its origins in a resourcecollected in the 1950s and has been made available digitally in some form since 1987, with updatesbeing provided on a daily basis. As of January 2021, OMIM Morbid Map Scorecard containsover 6K phenotypes for which the molecular basis is known and 4K genes with phenotype-causingmutation. The data is provided via a REST API, however access is controlled and users must registerwith OMIM in order to be able to download it.

GWAS Catalog

The Genome-Wide Association Studies (GWAS) performed in the literature provide an unprece-dented opportunity to investigate the impact of common variants on complex diseases. The GWASCatalog provides a consistent, searchable and freely available database of Single Nucleotide Poly-morphism (SNP) to trait associations which are extracted from both published and unpublishedGWA studies [15]. The data inside the GWAS Catalog is taken from studies found in the literaturewhich are curated by experts before being added. As of January 2021, the GWAS Catalog containsdata from over 4K publications and 220K SNP-trait associations. The resource started in 2008 and iscurrently updated on a biweekly basis. Access is provided via a REST API, as well as ﬂat ﬁles, withlinks being provided to Ensembl and EFO for integration purposes. The GWAS Catalog could beused to provide relations in a knowledge graph between genes and diseases which may not alreadybe captured in resources like OMIM.

We will now detail datasets containing information relating to drugs and compounds. This includesinformation on the relationships between the drugs and the targets or diseases as well other infor-mation such as potential adverse side effects or drug-drug interactions. These resources are detailedin Table 7.

Dataset FirstReleased UpdateFrequency Updated < ChEMBL 2009 > Annually (cid:51)

Expert Drugs One of the primary resources for drug-like molecules.Could provide relational information between geneand drugs.PubChem 2004 As SourcesAre (cid:51)

Expert &Automated Chemical A comprehensive integrator of other chemicalresources provided in RDF format, enabling easyincorporation into a KG.DrugBank 2006 > Annually (cid:51)

Expert Drugs A rich source of drug, disease and gene information.Free use is limited to academic work only.Drug Cen-tral 2016 Annually (cid:51)

Expert &Automated Drugs A collection of drug information extracted fromliterature and other sources. A potential source of drugfeatures.BindingDB 1995 Weekly (cid:51)

Expert &Automated Drugs A data resource of target protein and compoundinformation. Already incorporated in existing KGs.RepoDB 2017 No SetSchedule (cid:55)

Expert Drugs A resources of drug to disease links containing bothsuccessful and failed examples. A rare source ofnegative information.

Table 7: Primary data sources relating to drugs.

ChEMBL

The ChEMBL dataset is one of the primary resources containing information on drug-like moleculesand compounds [89]. The database is hosted by the EBI and it is recognised as an ELIXIR coreresource. ChEMBL was initially released in 2009 and has been updated at least annually since then.The records captured in the database are taken from the literature and curated before being added.As of version 27, ChEMBL contains information on over 1.9M compounds and 13.3K targets, takenfrom over 76K publications and 65 deposited datasets. The data is provided via an SQL dump, as18ell as via RDF through a SPARQL endpoint. ChEMBL has been included in many other integratedresources such as OpenTargets and Pharos. The data could also be included in a knowledge graph,giving relationships between drug and gene entities.

PubChem

The PubChem is a resource collecting information on chemical molecules maintained by the NCBI[67]. PubChem can be considered an integrator resource, aggregating over 700 disparate resourcesincluding UniProt, ChEMBL and Reactome. As such, as of January 2021, PubChem contains in-formation on over 11M Compounds, 287M Substances and 273M Bioactivities. The resource wasinitially launched in 2004 and is updated regularly as the underlying sources are. Access to the datais provided via an FTP site and a REST API, as well as via an RDF endpoint.

DrugBank

The DrugBank database can be considered both a bioinformatics and a cheminformatics resource,thus containing information on drugs and potential targets [147]. DrugBank was initially launchedin 2006 and has been updated multiple times per year since then. As of DrugBank version 5.1.7,over 11K drug-like small molecules, 27K targets and 48K biological pathways (including disease)are detailed. Users must register before being allowed access to the data and use is only allowed forfree in an academic setting. DrugBank has been added to existing knowledge graph resources likeHetionet to provide links between drugs and target entities.

Drug Central

Drug Central is a resource containing information on drugs and other pharmaceuticals which wasinitially made publicly available in 2016 and has been updated annually since [135]. The 2020release of Drug Central contains information on over 4K drugs and 110K pharmaceuticals. Thedatabase focuses upon collecting information on FDA and EMA approved drugs, with the informa-tion being collected and curated from the literature and drug-labels, as well as from external sources.Drug Central is available for download in both the form of an SQL dump and as TSV ﬁles.

BindingDB

BindingDB is a resource containing information primarily on the interactions between potentialdrug-target proteins and drug-like molecules [24]. The database has its origins in a project from1995 and has been updated approximately weekly since then. As of January 2021, BindingDBcontains information on over 5.5K proteins and 520K drug-like molecules, with this data being takenfrom a range of sources including information curated from the literature, as well as from ChEMBL.Access to the data is provided via a range of web-based REST APIs, additionally the completedatabase is provided for download in TSV ﬁle format. BindingDB contains information which couldbe used to add relations between compound, protein and pathway entities, as well as potentiallysome compound-based feature information. It has also already been included in knowledge graphresources such as Hetionet.

RepoDB

RepoDB is a smaller and more focused resource containing information more suitable for drugrepositioning than many of the ones highlighted thus far. It focuses upon providing drug to diseaselinks, however it not only provides information about approved drugs, but also on drugs whichfailed at various stages of clinical trails or that have been withdrawn [13]. This is interesting when19onsidering knowledge graphs, as the data could be used to provide negative edges between drugsor diseases – perhaps creating a richer overall resource. However, since its introduction in 2017,RepoDB has seen only two updates and none within the last few years.

This section has introduced some of the primary sources for information about the key entities ina drug discovery knowledge graph, namely proteins & genes (Section 5.2), drugs (Section 5.5) anddiseases (Section 5.4). Additionally, the resources governing relationships and interactions betweenthese entities have also been covered. We have also highlighted how many of these resources holdgreat potential for inclusion in potential drug discovery knowledge graphs.There are many primary data sources which capture more information about key entities within drugdiscovery than just relational interactions. UniprotKB, for example, details numerous sequence andfunctional properties of proteins which may not be captured by relations alone. However, thusfar, this wealth of information is relatively untapped and could be used to greatly enrich a graphwith more domain knowledge. Of course this would come at the potential cost of some level ofmanual feature engineering being required – an often complicated, domain speciﬁc and iterativeprocess by itself, and one that much of the research into representation learning is attempting toavoid [8, 91, 43]. It also is interesting to note that many of the resources detailed here are alreadyprovided in some form that is amenable for ingestion into a knowledge graph – either as edgelistsor by providing RDF versions. This reduces the complexity of incorporating the resources as anyissues arising from parsing and formatting process are avoided. There are also various integratorresources available, like Reactome and PubChem, which aggregate other primary datasets into onelocation. Whilst some caution is perhaps needed around the potential for replicated knowledge, theyoffer a way for machine learning researchers to incorporate a diverse set of resources. Finally thereare resources speciﬁc to drug discovery, such as OpenTargets and Pharos, which have thus far notbeen incorporated into any public knowledge graph. However, they are not currently provided in aformat enabling easy incorporation into a knowledge graph, meaning that some manual conversionprocess is required.

This section highlights the few existing knowledge graph datasets covering various aspects of thedrug discovery process. These datasets often comprise graphs extracted from resources coveringmore primary information on the various relevant entities and relations. These datasets are interest-ing as they could form a good initial starting point for ML practitioners looking to test algorithmson suitable knowledge graphs. A selection of some of the most relevant resources are summarisedin Table 8.

KG Dataset Link Entites Relations EntityTypes RelationTypes ContainsFeatures ConstituentDatasets VersionInfo LastUpdate

Hetionet [53] https://het.io/

47K 2.2M 11 24 (cid:55) (cid:55) https://github.com/gnn4dr/DRKG

97K 5.7M 13 107 (cid:55) (cid:55) https://github.com/dsi-bdi/biokg (cid:55) https://github.com/MindRank-Biotech/PharmKG (cid:55) https://zenodo.org/record/3834052 (cid:55) (cid:55) Table 8: Pre-existing knowledge graphs suitable for use in various drug discovery applications.

This section details graphs which we feel meet the criteria to be considered full knowledge graphs.20 etionet V1.0

One of the ﬁrst attempts to create a holistic knowledge graph suitable for various tasks within drugdiscovery was Hetionet [53]. Hetionet was developed as part of project Rephetio, a study lookingat drug purposing through the use of knowledge graph-based approaches. Hetionet integrated datafrom 29 public databases and contains over 47K vertices, representing 11 entity types includinggenes, drugs and diseases. These are interlinked with over 2.2M edges of 24 different relation types.The graph is publicly available ( https://het.io/ ) and is provided as a Neo4j [51] dump, as wellas other formats including JSON and edge list. Additionally, Hetionet has started to be used as abenchmark dataset within the machine learning community [81, 12].Hetionet contains over 18K gene, 1.5K drug and 120 disease entities, with other entities includingdrug side effects, biological processes and pathways. These are taken from popular data sourcesincluding Entrez Gene [84], DrugBank [146], DisGeNET [109], Reactome [62] and Gene Ontology[28], among others. The thresholds or quality scores for the various edges are not included in thegraph, instead the preselected values are detailed in the accompanying paper [53].From the time of writing, Hetionet has not been updated since 2017, although many of its con-stituent datasets have continued to evolve, although a project called the Scalable Precision MedicineOriented Knowledge Engine (SPOKE) [98] looks to update Hetionet with extra data sources heldby the University of California, San Francisco (UCSF). However, to date, this updated resource hasnot been made publicly available, thus it has been excluded from our review. Additionally, Hetionetdoes not contain vertex or edge level features within the graph, perhaps limiting its suitability forinput into certain model types without a further enrichment process.

Drug Repurposing Knowledge Graph (DRKG)

The Drug Repurposing Knowledge Graph (DRKG) [60] is a resource which builds upon Hetionetby integrating several additional data resources and was originally developed as part of a project fordrug repurposing to target suitable treatments for COVID-19 [61]. The dataset is closely alignedwith the Deep Graph Library (DGL) package for graph-based machine learning [141], with pre-trained embeddings being provided from the package with the dataset. The data is publicly available,provided in edgelist format: https://github.com/gnn4dr/DRKG .Being based upon Hetionet, all of its data sources are also included within DRKG, with the graphbeing enriched primarily with recent COVID-19 related data from STRING [125], DrugBank [146]and GNBR [106], among others. These extra datasets result in DRKG having more than 97K verticesof 13 entity types, connected via 5.7M edges of 107 relation types. This covers over 39K uniquegene, 24K drug and 5K disease entities, as well as others. However, DRKG shares the lack of entityor relation level features from Hetionet.

BioKG

BioKG is a project for integrating various biomedical resources and creating a knowledge graphfrom them [140]. As part of the project, various tools are provided to enable a simpliﬁed knowledgegraph construction process. The BioKG contains over 105K entities of 10 different types in theprimary graph, this includes over 13K drug, 122K proteins (many of which are isolated vertices)and 4K disease entities. These are connected via over 2M relations of 17 unique types. A publicpre-made version of the graph, as well as the code for building it, can be found here - https://github.com/dsi-bdi/biokg .The data which makes up BioKG is taken from 13 different data sources, including UniProt [3],Reactome [62], OMIM [49] and Gene Ontology [28]. One interesting aspect of BioKG is that asmall number of categorical features are provided with some of the entities. For example, drugentities are enriched with information pertaining to any associated negative side effects.21 harmKG

The PharmKG project had the goal of designing a high quality general purpose knowledge graphand associated graph neural network based model for use within the drug discovery domain [153].Compared to others highlighted in this section, the PharmKG graph is fairly compact, containing7.6K entities of 3 types: chemical, gene and disease. These are connected via 500K relations of29 different types. The constructed graph is compared against a ﬁltered version of Hetionet and isshown to produce better predictions for both drug repositioning and gene prioritisation tasks [153].The data contained within PharmKG is initially integrated from 7 sources including OMIM [49],DrugBank [147], PharmGKB [144], Therapeutic Target Database (TTD) [25], SIDER [72], Hu-manNet [59] and GNBR [106]. A ﬁltering process is then applied to ensure that only high qualityknowledge is kept, for example by only including well studied genes. One of the most uniqueaspects of the PharmKG graph is that numerical features are provided with all the entities, thusenabling the use of more complex graph-speciﬁc models. Such features include chemical connec-tivity and other physiochemical features for the chemical entities, the use of BioBERT [74] to createfeatures for the disease entities and a reduced expression matrix to create a feature vector for geneentities. The unﬁltered PharmKG graph, as well as model code, is available to download here - https://github.com/MindRank-Biotech/PharmKG , however at the time of writing, neither theﬁltered graph or the entity features vectors have been released.

OpenBioLink

OpenBioLink is a project to allow for easier and fairer comparison of knowledge graph completionapproaches for the biomedical domain [12]. As part of the project, a benchmark knowledge graphhas been created covering aspects of the drug discovery landscape. The OpenBioLink graph containsmore than 184K vertices of 7 different entity types, including 19K gene, 77K drug, 9K disease and2K pathway entities. Over 4.7M edges of 30 different relation types connect these entities. Thedataset is publicly available and is provided in edgelist and RDF formats - https://zenodo.org/record/3834052 .The data is taken from 17 datasets including STRING [125], DisGeNET [109], Gene Ontology [28],CTD [33], Human Phenotype Ontology [70], SIDER [72] and KEGG [65], among other resources.Of interest is that OpenBioLink contains additional true negatives for a selection of relation types,meaning that this relation was explicitly detailed not to exist. This can be used to avoid the issuesinherent with the choice of negative sampling strategy when training KG embedding models [152].

This section details graphs which do not meet the full criteria for being classiﬁed as knowledgegraphs.

Stanford Biomedical Network Dataset Collection (BioSNAP)

The Stanford Biomedical Network Dataset Collection (BioSNAP), unlike the resources discussedthus far, is not a single graph, instead, it’s a collection of graphs, similar to the more general StanfordNetwork Analysis Project (SNAP) dataset repository [75]. Compared to the resources discussedthus far, many of the graphs in BioSNAP are large, with the interspecies protein-protein interactionnetwork alone containing over 1.8B edges. Each individual graph in BioSNAP only contains edgesbetween two entity types, for example gene-gene or gene-disease, however these graphs could bejoined to create a more complete resource. Overall, 10 unique entities are represented includinggenes, proteins, diseases and drugs. These entities and corresponding relations are extracted frombiological data sources including Disease Ontology [119], CTD [33], SIDER [72], STRING [125]and Ensembl [30] among others. All of the BioSNAP graphs are provided here - https://snap.stanford.edu/biodata/ . 22ne interesting aspect is the inclusion of a small number of features for a selection of disease, side-effect and gene entities. This includes disease class information, pre-extracted network motifs [92],text-based synopses and structural pathway measures. Despite this appealing aspect, it is arguablewhether BioSNAP should be treated as a true existing knowledge graph, primarily because it lacksthe rich resource of multiple relation types.

BioGrakn

BioGrakn is a biomedical knowledge project released as part of the GRAKN.AI graph analyticsplatform [90]. BioGrakn is a collection of different graphs, with the two most relevant being the pre-cision medicine and disease focused graphs. BioGrakn is available for download as a pre-populatedGrakn instance here - https://github.com/graknlabs/biograkn .For example, the current precision medicine knowledge graph contains 845K vertices of six entitytypes including gene, drug and disease, which are connected via 14 different relation types. Dueto the way relations are modelled in Grakn, an exact edge count is not directly comparable withother approaches. The data for the various graphs is collected from 17 different resources includingUniProt [3], DisGeNET [109], Reactome [62] and the Human Protein Atlas [134] among others.One thing to note is that a user must use the Grakn package to interact the graph as no other ﬁleformats are provided.

Bio2RDF

Bio2RDF was an earlier project based on integrating disparate biological datastores [7], using tech-nologies from the Semantic Web stack such as the Resource Description Framework (RDF) andthe Web Ontology Language (OWL) [2]. Bio2RDF is open-source and available for download here- https://bio2rdf.org/ . It incorporates data from 35 separate sources including KEGG [65],Reactome [62] and CTD [33], among others. Due to its disparate nature, obtaining a completebreakdown of various entity and relation types is challenging, however the total Bio2RDF graph isover 10B triples. Although, it should also be noted that, as of the time of writing, the latest release ofBio2RDF was version 3 which dates from 2014 [36] – with many of the constituent datasets havingbeen updated since. It is also not explicitly focused towards the drug discovery domain and does notcurate the resources in any way.

NDEx

In a similar vein to BioSNAP, the Network Data Exchange (NDEx) is a collection of different bi-ological graphs, many of which are pertinent for drug discovery, hosted in a single location [111].Users are able to submit their graphs to be made available for the community through a commonframework and access protocols – with a REST endpoint provided, as well as a Cytoscape plu-gin [122]. Example datasets already made available through NDEx include STRING [125] andDisGeNET [109]. NDEx could be a way to help improve the reproducibility of ﬁeld by allowingauthors to share their graphs used in publications in an accessible manner.

Knowledge Base Of Biomedicine (KaBOB)

The Knowledge Base Of Biomedicine (KaBOB) was a project to develop a common ontology,using similar Semantic Web technologies to Bio2RDF, with a goal of closer integration to allowfor easier querying across various biomedical datasets [82]. KaBOB is an open source toolkitfor converting biomedical data into RDF format using a common ontology and is available here: https://github.com/drlivingston/kabob . As such, KaBOB does not provide the dataset asa whole, instead a user must build it using the appropriate tool, which provides parses for resources23uch as UniProt [3], DrugBank [146] and Reactome [62]. One advantage of this approach is that itwould allow the latest data from the various resources to be included with ease.

Phenotype Knowledge Translator (PheKnowLator)

Like KaBOB, the Phenotype Knowledge Translator (PheKnowLator) aims to build a framework foreasier biomedical knowledge graph construction [16]. The project is available for download here - https://github.com/callahantiff/PheKnowLator . At the time of writing, the project is stillunder active development and awaiting the ﬁrst true public release.

When looking at these existing Knowledge graph resources datasets as a whole, we can identify thefollowing potential shortcomings:•

Repeated Dataset Use -

Many of the existing knowledge graphs reuse the same underlyingdatasets - perhaps highlighting these as crucial and trusted resources for that particularentity type.•

Lack of Features -

Almost all of these graphs (with the exception of the limited selectionavailable in BioSNAP) do not provide any feature information on the entities or relations.•

No Dataset Version Information -

Many of the resources do not detail from which versionor year of a certain dataset the information has been collected. Additionally, resources arenot kept up to date as the primary data continues to evolve.•

Lack of Detailed Creation Process -

Many of the resources could provide more detailson the graph creation process. For example, DRKG has included additional DrugBankinformation on top of what was present in Hetionet, however it is not clear if this was toupdate the information or add additional drugs.•

Threshold Information -

Much of the underlying information in the primary resources isuncertain and contains various metrics upon which the presence of an edge between twoentities could be decided or not. Many of the graphs do not include this information or havealready been thresholded.

Thus far, we have introduced the concept of a biomedical knowledge graph, highlighted how such aresource could be utilised to improve various key tasks within the drug discovery ﬁeld and detailedsome of the primary suitable data sources. In this section we highlight three exemplar case studiesfrom the literature, where relational graph data, combined with graph speciﬁc neural models, havealready been exploited in the drug discovery domain. These approaches are detailed in Table 9. Thisapplication focus is in contrast to work like Hetionet [53], where data integration and knowledgegraph construction is the primary goal.It should be noted that the number of published works which meet this criteria is currently limited.Indeed, one of the primary motivations of this work is to help aid practitioners become more familiarwith the domain and increase interest from the community.

Approach Domain Model PredictionTask Entites Relations EntityTypes RelationTypes Num Datasets inGraph

Decagon [156] Drug-DrugInteractions Relational GCN with tensorfactorisation decoder LinkPrediction 19.6K 5.3M 2 964 ≈ ≈ Table 9: An overview of drug discovery related approaches in the literature employing the use ofknowledge graphs. 24 .1 Polypharmacy Prediction

The problem of adverse side effects that arises through the use of Polypharmacy (the use of morethan one drug simultaneously to treat one or multiple conditions) has been modelled through the useof a knowledge graph and a novel GNN-based model [156]. The model, entitled Decagon, encodesa heterogeneous graph comprising drug and protein vertices linked through edges detailing adverseeffects between drug pairs and drug-protein interactions. The model encoder, similar to a RelationalGraph Convolutional Network (RGCN) [116], uses a separate parameter matrix for each edge typein the graph to learn relational aware vertex level embeddings. These embeddings are then input intoa tensor factorisation-based decoder to directly predict potential negative drug-drug interactions viathe task of link prediction. The presented results show that compared with non-graph speciﬁc andhomogeneous models, Decagon is better able to predict existing, and even propose novel, drug-druginteractions.The knowledge graph constructed for the research was actually bipartite, containing drug (over 900unique entities) and protein (over 19K unique entities) vertices [156]. These are linked throughvarious edge types, with 964 unique edge types between drug-drug vertex pairs used to representthe various types of adverse side effects and a single edge type used to represent drug-protein andprotein-protein interactions. The data for the graph was taken from protein centric databases likeBioGRID [104], STRING [125], STITCH [126], as well as drug centric resources like SIDER [72],OFFSIDES and TWOSIDES [128]– with much of the processed data used in this work being avail-able as part of the BioSNAP project. Additionally, the authors enrich the graph with features ononly the drug vertices containing descriptive single drug side-effect information – which can beincorporated through the encoder model. It is interesting to note that the authors chose to use arelatively focused graph, containing just two vertex types, but were still able to exploit the multitudeof additional edge types to increase performance. Additionally, a new method was developed, de-signed speciﬁcally around the structure of the data, highlighting how tightly integrated graphs andthe models that operate on them are.

Recent work has explored the task of Drug-Target Interaction prediction using data represented asa knowledge graph, containing existing protein and drug compound interactions [95]. The authorspropose that DTI can be formulated as a link-prediction task on this graph and introduce a modelentitled TriModel in order to accomplish this. Similar to other Knowledge Graph Embedding (KGE)approaches [143], TriModel learns an embedding for all entities and relations in the graph by opti-mising the parameters such that true triplets are more accurately predicted over randomly samplednegatives. Over various traditional and non-relational graph methods, TriModel demonstrates su-perior performance, perhaps highlighting the importance of complex multi-relational information ingenerating accurate predictions.A variety of different data sources were included in the knowledge graphs which have been madepublicly available. This includes two existing benchmark graphs: Yamanishi08 [149] (containingknown DTIs from KEGG BRITE [66], BRENDA [117], SuperTarget [45] and DrugBank [147]) andDrugBankFDA [147]. The authors further enrich these graphs with additional relations from otherdatabases including KEGG MEDICUS [65], InterPro [93] and UniProt [3]. This enrichment processhighlights that knowledge graphs can be easily expanded with additional information, but also thatconversely, researchers are still required to do this process manually as no universal resource exists.

The crucial task of gene prioritisation, otherwise known as disease target identiﬁcation (detailedin Section 2.1), has been addressed via the use of a knowledge graph [105]. The overall approach,entitled Rosalind, details the construction of a knowledge graph and the choosing of a suitable modelwith which to make predictions. The work proposes that the disease target identiﬁcation problemcan be modelled as a link prediction task where the prediction of an edge between a disease and agene entity would indicate possible association between the two. The model chosen for the work isthe ComplEx tensor factorisation approach [132]. The evaluation of the approach demonstrates thatit outperforms competing methods, including OpenTarget [18], by as much as an extra 20% of recallwhen predicting potential gene-disease relationships over 198 diseases.25he Rosalind knowledge graph is constructed from many of the datasets detailed in this review. Forexample, the graph incorporates disease information (explored in Section 5.4) from resources likeDisGeNET, OMIM and the GWAS Catalog, interaction information (detailed in Section 5.3.1) fromBioGrid, pathway information (explored in Section 5.3.2) from Reactome and compound informa-tion (Section 5.5) from ChEMBL. Of note in this work is that it captures some of the subtlety arounddisease-gene prioritisation, as ideally the model would predict which genes have some causal effecton the disease, not just an association. In Rosalind, they use two different types of edge between dis-ease and gene entities – one indicating biological association and the other therapeutic association(meaning that a drug exists targetting the gene to help alleviate the disease). Model performance isevaluated only on this therapeutic edge type – ensuring the model is making useful predictions. Ad-ditionally, results are presented on a time-slices graph, where the model is trained on historical dataand predictions are made on future edges. This is attempting to replicate the task we would ideallywant performed - using the currently available knowledge to predict currently unknown information,in this case, unknown relationships between genes and diseases. However, to date, the authors havenot realised the graph used in their work, making reproducibility challenging.

The use of complex knowledge graphs, combined with machine learning techniques, has the poten-tial to help solve key challenges in the ﬁeld of drug discovery, with promising early applicationsalready being demonstrated in the tasks of drug repositioning, drug-drug interactions and gene pri-oritisation. In this review we have presented an overview of the various key related datasets whichcould provide some of the fundamental building blocks for a hypothetical drug discovery knowledgegraph. The review has also detailed and evaluated the range of pre-existing knowledge graphs in thedomain, which could already be used directly as input into graph-based models to begin to solve drugdiscovery problems. Additionally, we have highlighted the many pitfalls and challenges of work-ing with drug discovery-based data and signposted key issues which machine learning practitionersshould consider when choosing suitable sources.Our hope is that this review of suitable data sources, combined with recent works evaluating graph-speciﬁc machine learning models in the context of drug discovery [41], can help guide researchersfrom across the knowledge graph mining and machine learning ﬁelds in applying state-of-the-arttechniques in the ﬁeld. Overall, we hope this review can serve as a catalyst in making the drug dis-covery domain more accessible, sparking new thought and innovation, whilst allowing researchersto more easily address key tasks within the domain, ultimately helping to improve and extend humanlife through new medicines.

Whilst there has been signiﬁant progress made in the ﬁeld, there are still numerous open challengesand issues that could be addressed. In this section, we detail major areas still needed for improve-ment, which could help produce better drug discovery knowledge graphs. We build upon many ofthe challenges of working with drug discovery data we established in Section 2.4 and the issues withpre-existing knowledge graphs in Section 6.3.

Graph Composition.

Constructing a useful knowledge graph for use in the drug discovery domainis still a challenging problem, especially when being performed by non-domain experts. Manychoices much be made when transforming a data source into a graph, especially if it is not relationalby nature. Here, there is however great scope for interdisciplinary collaborations between domainscientists and knowledge graph and machine learning researchers. Additionally, we would like to seemore high quality pre-constructed knowledge graphs, designed and validated by domain scientists,be made available for use by researchers.

Data Value.

The availability of massive datasets has been partially credited with enabling the suc-cess of recent neural network models in areas such as computer vision [34]. It might be temptingthen to incorporate as much knowledge and data as possible into drug discovery knowledge graph.However, much work still needs to be done in assessing the beneﬁt of incorporating different datamodalities. The consideration of value can also be extended to a ﬁnancial view point: data collection,storage and processing can be expensive, especially if larger datasets do not improve performancein the task of interest. Another questions is whether a single super graph should be created, which26ttempts to capture all knowledge around drug discovery, or whether smaller, more task speciﬁc,representations enable better predictions overall.

Better Metadata.

As highlighted throughout the review, many of the core data resources are typi-cally updated and reﬁned at frequent intervals. However, many of the pre-existing knowledge graphsdo not capture exactly which version of a certain resources was used during its construction. Storingthis information might allow for better reproducibility, as well as measuring any change in predic-tive performance as the underlying knowledge is updated over time. Improved metadata could alsocapture if the relationship was taken from an expert curated, or automated pipeline. Additionally,graphs could provide common alternative identiﬁers (for example including both Entrez GeneID andEnsembl IDs for gene entities) as properties to enable easier incorporation of additional resourcesinto the graph.

Incorporation of Features.

Typically many existing knowledge graphs are provided as little morethan edge lists, with models trying to make predictions using this relational information alone.Throughout this review, we have attempted to highlight where data resources may be used to addadditional features for entities and relations. However, it is easier to imagine suitable features forcertain entities (proteins and chemicals for example, where structural information could be incorpo-rated) than others. Additionally, any potential beneﬁts of incorporating these extra features wouldneed to be assessed fairly. Nevertheless, we feel that there is scope for the incorporation of fea-tures to enable graph-speciﬁc neural models to be better exploited in the domain, with some recentpromising work being demonstrated in the literature [153].

Addressing Bias.

Many biases will be present in a drug discovery knowledge graph and any modelbeing trained upon it, may have its predictive performance skewed away from under-represented,but potentially crucial relationships. Even manually curated resources may suffer incur bias fromthe person performing the curation. Practitioners should be aware of these issues and steps could betaken to mitigate them by, for example, reweighing the model training process. Additionally, userscould consider removing over represented entities if they are conﬁdent that they are not required inthe area of study. The lack of true negative samples in many graphs also means that the negativesampling strategy employed can bias the results. Recent inclusion of true negative samples in abenchmark graph [12] is encouraging, however where they are not possible to collect, more domainaware sampling strategies should be investigated.

Fair Evaluation.

Due to the combinatorial ingestion process used to construct knowledge graphs,it is common for edges to be duplicated if the relationship is captured in more than one underlingsource. This can cause obvious issues when it comes to creating train/test splits for evaluation ifthe issues are not considered. Further, the presence of trivial inverse relationships, many of whichmay be present, can also skew performance metrics [131]. It may also be more useful to assessmodel performance on more biologically meaningful data splits, for example by splitting on diseaseor protein family. It could help move the ﬁeld forward if meaningful splits for key tasks within drugdiscovery could be created by experts and made available for public use.

Uncertainty.

So much of the data represented in a biological knowledge graph is uncertain, eitherdue to the nature of the experiment that generated it, or because it has been automatically minedfrom the literature. Yet this uncertainty is rarely represented in the graph itself, perhaps leadingto a false sense of trust being created by the presence of certain relationships. We feel that moreshould be done to incorporate any uncertainty directly inside the knowledge graph. This could allowmethods to directly learn from this information, thus creating better and more robust predictions.

Reproducibility.

As in many areas of machine learning [80, 31, 39], reproducibility of resultsis still a major issue in the knowledge graph ﬁeld [1]. It is common for many papers to publishresults without also providing the exact graph constructed to generate them. We believe furtherimprovements in this area are essential for continued development in the ﬁeld.

Acknowledgement

The authors would like to thank Manasa Ramakrishna, Ufuk Kirik, Natalie Kurbatova, ElizavetaSemenova and Claus Bendtsen for help and feedback throughout the preparation of this manuscript.Stephen Bonner is a fellow of the AstraZeneca postdoctoral program.27 eferences [1] Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Mikhail Galkin, SahandSharifzadeh, Asja Fischer, Volker Tresp, and Jens Lehmann. Bringing light into the dark:A large-scale evaluation of knowledge graph embedding models under a uniﬁed framework. arXiv preprint arXiv:2006.13365 , 2020.[2] Grigoris Antoniou and Frank Van Harmelen.

A semantic web primer . MIT press, 2004.[3] Rolf Apweiler, Amos Bairoch, Cathy H Wu, Winona C Barker, Brigitte Boeckmann,Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang, Rodrigo Lopez, Michele Ma-grane, et al. Uniprot: the universal protein knowledgebase.

Nucleic acids research ,32(suppl 1):D115–D119, 2004.[4] Maryam Bagherian, Elyas Sabeti, Kai Wang, Maureen A Sartor, Zaneta Nikolovska-Coleska,and Kayvan Najarian. Machine learning approaches and databases for prediction of drug–target interaction: a survey paper.

Brieﬁngs in bioinformatics , 2020.[5] Albert-L´aszl´o Barab´asi, Natali Gulbahce, and Joseph Loscalzo. Network medicine: anetwork-based approach to human disease.

Nature reviews genetics , 12(1):56–68, 2011.[6] Albert-Laszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell’s func-tional organization.

Nature reviews genetics , 5(2):101–113, 2004.[7] Franc¸ois Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Moris-sette. Bio2rdf: towards a mashup to build bioinformatics knowledge systems.

Journal ofbiomedical informatics , 41(5):706–716, 2008.[8] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A reviewand new perspectives.

IEEE transactions on pattern analysis and machine intelligence ,35(8):1798–1828, 2013.[9] Joao H Bettencourt-Silva, Natasha Mulligan, Charles Jochim, Nagesh Yadav, Walter Sed-lazek, Vanessa Lopez, and Martin Gleize. Exploring the social drivers of health during a pan-demic: Leveraging knowledge graphs and population trends in covid-19.

Studies in HealthTechnology and Informatics , 275:6–11, 2020.[10] Maria-Jesus Blanco and Kevin M Gardinier. New chemical modalities and strategic thinkingin early drug discovery, 2020.[11] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and OksanaYakhnenko. Translating embeddings for modeling multi-relational data. In

Advances inneural information processing systems , pages 2787–2795, 2013.[12] Anna Breit, Simon Ott, Asan Agibetov, and Matthias Samwald. Openbiolink: A benchmark-ing framework for large-scale biomedical link prediction.

Bioinformatics , 2020.[13] Adam S Brown and Chirag J Patel. A standard database for drug repositioning.

Scientiﬁcdata , 4(1):1–7, 2017.[14] Garth R Brown, Vichet Hem, Kenneth S Katz, Michael Ovetsky, Craig Wallin, Olga Er-molaeva, Igor Tolstoy, Tatiana Tatusova, Kim D Pruitt, Donna R Maglott, et al. Gene: agene-centered information resource at ncbi.

Nucleic acids research , 43(D1):D36–D42, 2015.[15] Annalisa Buniello, Jacqueline A L MacArthur, Maria Cerezo, Laura W Harris, James Hay-hurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot Sol-lis, et al. The nhgri-ebi gwas catalog of published genome-wide association studies, targetedarrays and summary statistics 2019.

Nucleic acids research , 47(D1):D1005–D1012, 2019.[16] Tiffany J Callahan, Ignacio J Tripodi, Lawrence E Hunter, and William A Baumgartner. Aframework for automated construction of heterogeneous large-scale biomedical knowledgegraphs. bioRxiv , 2020.[17] Tiffany J Callahan, Ignacio J Tripodi, Harrison Pielke-Lombardo, and Lawrence E Hunter.Knowledge-based biomedical data science.

Annual Review of Biomedical Data Science , 3,2020.[18] Denise Carvalho-Silva, Andrea Pierleoni, Miguel Pignatelli, ChuangKee Ong, Luca Fu-mis, Nikiforos Karamanis, Miguel Carmona, Adam Faulconbridge, Andrew Hercules, ElaineMcAuley, et al. Open targets platform: new developments and updates two years on.

Nucleicacids research , 47(D1):D1056–D1065, 2019.2819] Remzi Celebi, Huseyin Uyar, Erkan Yasar, Ozgur Gumus, Oguz Dikenelli, and Michel Du-montier. Evaluation of knowledge graph embedding approaches for drug-drug interactionprediction in realistic settings.

BMC bioinformatics , 20(1):1–14, 2019.[20] Ethan G Cerami, Benjamin E Gross, Emek Demir, Igor Rodchenkov, ¨Ozg¨un Babur, NadiaAnwar, Nikolaus Schultz, Gary D Bader, and Chris Sander. Pathway commons, a web re-source for biological pathway data.

Nucleic acids research , 39(suppl 1):D685–D690, 2010.[21] George Cernile, Trevor Heritage, Neil J Sebire, Ben Gordon, Taralyn Schwering, ShanaKazemlou, and Yulia Borecki. Network graph representation of covid-19 scientiﬁc publi-cations to aid knowledge discovery.

BMJ Health & Care Informatics , 28(1), 2020.[22] Chuming Chen, Hongzhan Huang, and Cathy H Wu. Protein bioinformatics databases andresources. In

Protein Bioinformatics , pages 3–39. Springer, 2017.[23] Ruolan Chen, Xiangrong Liu, Shuting Jin, Jiawei Lin, and Juan Liu. Machine learning fordrug-target interaction prediction.

Molecules , 23(9):2208, 2018.[24] Xi Chen, Ming Liu, and Michael K Gilson. Bindingdb: a web-accessible molecular recogni-tion database.

Combinatorial chemistry & high throughput screening , 4(8):719–725, 2001.[25] Xin Chen, Zhi Liang Ji, and Yu Zong Chen. Ttd: therapeutic target database.

Nucleic acidsresearch , 30(1):412–415, 2002.[26] Sarvenaz Choobdar, Mehmet E Ahsen, Jake Crawford, Mattia Tomasoni, Tao Fang, DavidLamparter, Junyuan Lin, Benjamin Hescott, Xiaozhe Hu, Johnathan Mercer, et al. Assess-ment of network module identiﬁcation across complex diseases.

Nature methods , 16(9):843–852, 2019.[27] Gene Ontology Consortium. The gene ontology (go) database and informatics resource.

Nu-cleic acids research , 32(suppl 1):D258–D261, 2004.[28] Gene Ontology Consortium. The gene ontology project in 2008.

Nucleic acids research ,36(suppl 1):D440–D444, 2008.[29] David Cook, Dearg Brown, Robert Alexander, Ruth March, Paul Morgan, Gemma Satterth-waite, and Menelas N Pangalos. Lessons learned from the fate of astrazeneca’s drug pipeline:a ﬁve-dimensional framework.

Nature reviews Drug discovery , 13(6):419–431, 2014.[30] Fiona Cunningham, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis,Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Stephen Fitzgerald, et al.Ensembl 2015.

Nucleic acids research , 43(D1):D662–D669, 2015.[31] Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cremonesi, and Dietmar Jannach. A trou-bling analysis of reproducibility and progress in recommender systems research.

ACM Trans-actions on Information Systems (TOIS) , 39(2):1–49, 2021.[32] Debasmita Das, Yatin Katyal, Janu Verma, Shashank Dubey, AakashDeep Singh, KushagraAgarwal, Sourojit Bhaduri, and RajeshKumar Ranjan. Information retrieval and extractionon covid-19 clinical articles using graph community detection and bio-bert embeddings. In

Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 , 2020.[33] Allan Peter Davis, Cynthia J Grondin, Robin J Johnson, Daniela Sciaky, Roy McMorran, Jo-lene Wiegers, Thomas C Wiegers, and Carolyn J Mattingly. The comparative toxicogenomicsdatabase: update 2019.

Nucleic acids research , 47(D1):D948–D954, 2019.[34] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[35] Daniel Domingo-Fernandez, Shounak Baksi, Bruce Schultz, Yojana Gadiya, Reagon Karki,Tamara Raschka, Christian Ebeling, Martin Hofmann-Apitius, et al. Covid-19 knowledgegraph: a computable, multi-modal, cause-and-effect knowledge model of covid-19 patho-physiology.

Bioinformatics , 09 2020.[36] Michel Dumontier, Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Vincent Emonet,Franc¸ois Belleau, and Arnaud Droit. Bio2rdf release 3: a larger connected network of linkeddata for the life sciences. In

Proceedings of the 2014 International Conference on Posters &Demonstrations Track , volume 1272, pages 401–404. Citeseer, 2014.2937] Christine Durinx, Jo McEntyre, Ron Appel, Rolf Apweiler, Mary Barlow, Niklas Blomberg,Chuck Cook, Elisabeth Gasteiger, Jee-Hyub Kim, Rodrigo Lopez, et al. Identifying elixircore data resources.

F1000Research , 5, 2016.[38] Ste en Schulze-Kremer. Ontologies for molecular biology.

Computer and Information Sci-ence , 6(21), 2001.[39] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison ofgraph neural networks for graph classiﬁcation. arXiv preprint arXiv:1912.09893 , 2019.[40] Liesbeth Franc¸ois, Jonathan van Eyll, and Patrice Godard. Dictionary of disease ontologies(dodo): a graph database to facilitate access and interaction with disease and phenotype on-tologies.

F1000Research , 9(942):942, 2020.[41] Thomas Gaudelet, Ben Day, Arian R Jamasb, Jyothish Soman, Cristian Regep, GertrudeLiu, Jeremy BR Hayter, Richard Vickers, Charles Roberts, Jian Tang, et al. Utilising graphmachine learning within drug discovery and development. arXiv preprint arXiv:2012.05716 ,2020.[42] Maya Ghoussaini, Edward Mountjoy, Miguel Carmona, Gareth Peat, Ellen M Schmidt, An-drew Hercules, Luca Fumis, Alfredo Miranda, Denise Carvalho-Silva, Annalisa Buniello,et al. Open targets genetics: systematic identiﬁcation of trait-associated genes using large-scale genetics and functional genomics.

Nucleic Acids Research , 2020.[43] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press, 2016. .[44] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discoveryand data mining , pages 855–864, 2016.[45] Stefan G¨unther, Michael Kuhn, Mathias Dunkel, Monica Campillos, Christian Senger, Evan-gelia Petsalaki, Jessica Ahmed, Eduardo Garcia Urdiales, Andreas Gewiess, Lars Juhl Jensen,et al. Supertarget and matador: resources for exploring drug-target relationships.

Nucleicacids research , 36(suppl 1):D919–D922, 2007.[46] William L Hamilton. Graph representation learning.

Synthesis Lectures on Artiﬁcial Intelli-gence and Machine Learning , 14(3):1–159.[47] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Meth-ods and applications.

IEEE Data Engineering Bulletin , 2017.[48] William L Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning onlarge graphs. In

Advances in neural information processing systems , pages 1024–1034, 2017.[49] Ada Hamosh, Alan F Scott, Joanna Amberger, David Valle, and Victor A McKusick. Onlinemendelian inheritance in man (omim).

Human mutation , 15(1):57–61, 2000.[50] Olaf Hartig and Jorge P´erez. Semantics and complexity of graphql. In

Proceedings of the2018 World Wide Web Conference , pages 1155–1164, 2018.[51] Christian Theil Have and Lars Juhl Jensen. Are graph databases ready for bioinformatics?

Bioinformatics , 29(24):3107, 2013.[52] Henning Hermjakob, Luisa Montecchi-Palazzi, Chris Lewington, Sugath Mudali, SamuelKerrien, Sandra Orchard, Martin Vingron, Bernd Roechert, Peter Roepstorff, Alfonso Valen-cia, et al. Intact: an open source molecular interaction database.

Nucleic acids research ,32(suppl 1):D452–D455, 2004.[53] Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina LChen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini. Systematicintegration of biomedical knowledge prioritizes drugs for repurposing.

Elife , 6:e26726, 2017.[54] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, ClaudioGutierrez, Jos´e Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, et al.Knowledge graphs. arXiv preprint arXiv:2003.02320 , 2020.[55] Kanglin Hsieh, Yinyin Wang, Luyao Chen, Zhongming Zhao, Sean Savitz, Xiaoqian Jiang,Jing Tang, and Yejin Kim. Drug repurposing for covid-19 using graph neural network with ge-netic, mechanistic, and epidemiological validation. arXiv preprint arXiv:2009.10931 , 2020.3056] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, MicheleCatasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687 , 2020.[57] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer.In

Proceedings of The Web Conference 2020 , pages 2704–2710, 2020.[58] James P Hughes, Stephen Rees, S Barrett Kalindjian, and Karen L Philpott. Principles ofearly drug discovery.

British journal of pharmacology , 162(6):1239–1249, 2011.[59] Sohyun Hwang, Chan Yeong Kim, Sunmo Yang, Eiru Kim, Traver Hart, Edward M Marcotte,and Insuk Lee. Humannet v2: human gene networks for disease research.

Nucleic acidsresearch , 47(D1):D573–D580, 2019.[60] Vassilis N. Ioannidis, Xiang Song, Saurav Manchanda, Mufei Li, Xiaoqin Pan, Da Zheng,Xia Ning, Xiangxiang Zeng, and George Karypis. Drkg - drug repurposing knowledge graphfor covid-19. https://github.com/gnn4dr/DRKG/ , 2020.[61] Vassilis N Ioannidis, Da Zheng, and George Karypis. Few-shot link prediction via graphneural networks for covid-19 drug-repurposing. arXiv preprint arXiv:2007.10261 , 2020.[62] Bijay Jassal, Lisa Matthews, Guilherme Viteri, Chuqiao Gong, Pascual Lorente, AntonioFabregat, Konstantinos Sidiropoulos, Justin Cook, Marc Gillespie, Robin Haw, et al. Thereactome pathway knowledgebase.

Nucleic acids research , 48(D1):D498–D503, 2020.[63] Simon Jupp, Thomas Liener, Sirarat Sarntivijai, Olga Vrousgou, Tony Burdett, and Helen EParkinson. Oxo-a gravy of ontology mapping extracts. In

ICBO , 2017.[64] Minoru Kanehisa, Michihiro Araki, Susumu Goto, Masahiro Hattori, Mika Hirakawa, Ma-sumi Itoh, Toshiaki Katayama, Shuichi Kawashima, Shujiro Okuda, Toshiaki Tokimatsu,et al. Kegg for linking genomes to life and the environment.

Nucleic acids research ,36(suppl 1):D480–D484, 2007.[65] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hirakawa. Keggfor representation and analysis of molecular networks involving diseases and drugs.

Nucleicacids research , 38(suppl 1):D355–D360, 2010.[66] Minoru Kanehisa, Susumu Goto, Masahiro Hattori, Kiyoko F Aoki-Kinoshita, Masumi Itoh,Shuichi Kawashima, Toshiaki Katayama, Michihiro Araki, and Mika Hirakawa. Fromgenomics to chemical genomics: new developments in kegg.

Nucleic acids research ,34(suppl 1):D354–D357, 2006.[67] Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, LianyiHan, Jane He, Siqian He, Benjamin A Shoemaker, et al. Pubchem substance and compounddatabases.

Nucleic acids research , 44(D1):D1202–D1213, 2016.[68] Emily A King, J Wade Davis, and Jacob F Degner. Are drug targets with genetic supporttwice as likely to be approved? revised estimates of the impact of genetic support for drugmechanisms on the probability of drug approval.

PLoS genetics , 15(12):e1008489, 2019.[69] Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutionalnetworks. In

International Conference on Learning Representations , 2017.[70] Sebastian K¨ohler, Leigh Carmody, Nicole Vasilevsky, Julius O B Jacobsen, Daniel Danis,Jean-Philippe Gourdine, Michael Gargano, Nomi L Harris, Nicolas Matentzoglu, Julie AMcMurry, et al. Expansion of the human phenotype ontology (hpo) knowledge base andresources.

Nucleic acids research , 47(D1):D1018–D1027, 2019.[71] Gautier Koscielny, Peter An, Denise Carvalho-Silva, Jennifer A Cham, Luca Fumis, RippaGasparyan, Samiul Hasan, Nikiforos Karamanis, Michael Maguire, Eliseo Papa, et al. Opentargets: a platform for therapeutic target identiﬁcation and validation.

Nucleic acids research ,45(D1):D985–D994, 2017.[72] Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The sider database of drugsand side effects.

Nucleic acids research , 44(D1):D1075–D1079, 2016.[73] Bohyun Lee, Shuo Zhang, Aleksandar Poleksic, and Lei Xie. Heterogeneous multi-layerednetwork model for omics data integration and analysis.

Frontiers in Genetics , 10:1381, 2020.[74] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedi-cal text mining.

Bioinformatics , 36(4):1234–1240, 2020.3175] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data , June 2014.[76] Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, EugeniaGaleota, Francesca Sacco, Anita Palma, Aurelio Pio Nardozza, Elena Santonico, et al. Mint,the molecular interaction database: 2012 update.

Nucleic acids research , 40(D1):D857–D861, 2012.[77] Yu Lin, Saurabh Mehta, Hande K¨uc¸ ¨uk-McGinty, John Paul Turner, Dusica Vidovic, MicheleForlin, Amar Koleti, Dac-Trung Nguyen, Lars Juhl Jensen, Rajarshi Guha, et al. Drug targetontology to classify and integrate drug discovery data.

Journal of biomedical semantics ,8(1):50, 2017.[78] Mark A Lindsay. Target discovery.

Nature Reviews Drug Discovery , 2(10):831–838, 2003.[79] Carolyn E Lipscomb. Medical subject headings (mesh).

Bulletin of the Medical LibraryAssociation , 88(3):265, 2000.[80] Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. arXiv preprint arXiv:1807.03341 , 2018.[81] Yushan Liu, Marcel Hildebrandt, Mitchell Joblin, Martin Ringsquandl, and Volker Tresp.Integrating logical rules into neural multi-hop reasoning for drug repurposing. arXiv preprintarXiv:2007.05292 , 2020.[82] Kevin M Livingston, Michael Bada, William A Baumgartner, and Lawrence E Hunter. Kabob:ontology-based semantic integration of biomedical databases.

BMC bioinformatics , 16(1):1–21, 2015.[83] H Luo, M Li, M Yang, FX Wu, Y Li, and J Wang. Biomedical data and computational modelsfor drug repositioning: a comprehensive review.

Brieﬁngs in Bioinformatics , 2020.[84] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez gene: gene-centeredinformation at ncbi.

Nucleic acids research , 33(suppl 1):D54–D58, 2005.[85] James Malone, Ele Holloway, Tomasz Adamusiak, Misha Kapushesky, Jie Zheng, NikolayKolesnikov, Anna Zhukova, Alvis Brazma, and Helen Parkinson. Modeling sample variableswith an experimental factor ontology.

Bioinformatics , 26(8):1112–1118, 2010.[86] V´ıctor Mart´ınez, Fernando Berzal, and Juan-Carlos Cubero. A survey of link prediction incomplex networks.

ACM computing surveys (CSUR) , 49(4):1–33, 2016.[87] Yosef Masoudi-Sobhanzadeh, Yadollah Omidi, Massoud Amanlou, and Ali Masoudi-Nejad.Drug databases and their contributions to drug repurposing.

Genomics , 112(2):1087–1095,2020.[88] Suresh Mathivanan, Balamurugan Periaswamy, TKB Gandhi, Kumaran Kandasamy, ShubhaSuresh, Riaz Mohmood, YL Ramachandra, and Akhilesh Pandey. An evaluation of humanprotein-protein interaction data in the public domain. In

BMC bioinformatics , volume 7, pageS19. Springer, 2006.[89] David Mendez, Anna Gaulton, A Patr´ıcia Bento, Jon Chambers, Marleen De Veij, EloyF´elix, Mar´ıa Paula Magari˜nos, Juan F Mosquera, Prudence Mutowo, Michał Nowotka, et al.Chembl: towards direct deposition of bioassay data.

Nucleic acids research , 47(D1):D930–D940, 2019.[90] Antonio Messina, Haikal Pribadi, Jo Stichbury, Michelangelo Bucci, Szymon Klarman, andAlfonso Urso. Biograkn: A knowledge graph-based semantic database for biomedical sci-ences. In

Conference on Complex, Intelligent, and Software Intensive Systems , pages 299–309. Springer, 2017.[91] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-sentations of words and phrases and their compositionality.

Advances in Neural InformationProcessing Systems , 26:3111–3119, 2013.[92] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon.Network motifs: simple building blocks of complex networks.

Science , 298(5594):824–827,2002. 3293] Alex L Mitchell, Teresa K Attwood, Patricia C Babbitt, Matthias Blum, Peer Bork, AlanBridge, Shoshana D Brown, Hsin-Yu Chang, Sara El-Gebali, Matthew I Fraser, et al. Inter-pro in 2019: improving coverage, classiﬁcation and access to protein sequence annotations.

Nucleic acids research , 47(D1):D351–D360, 2019.[94] Sameh K Mohamed, Aayah Nounu, and V´ıt Nov´aˇcek. Biological applications of knowledgegraph embedding models.

Brieﬁngs in Bioinformatics , 2020.[95] Sameh K Mohamed, V´ıt Nov´aˇcek, and Aayah Nounu. Discovering protein drug targets usingknowledge graph embeddings.

Bioinformatics , 36(2):603–610, 2020.[96] Paul Morgan, Dean G Brown, Simon Lennard, Mark J Anderton, J Carl Barrett, Ulf Eriksson,Mark Fidock, Bengt Hamren, Anthony Johnson, Ruth E March, et al. Impact of a ﬁve-dimensional framework on r&d productivity at astrazeneca.

Nature reviews Drug discovery ,17(3):167, 2018.[97] Christopher J Mungall, Julie A McMurry, Sebastian K¨ohler, James P Balhoff, Charles Bor-romeo, Matthew Brush, Seth Carbon, Tom Conlin, Nathan Dunn, Mark Engelstad, et al. Themonarch initiative: an integrative data and analytic platform connecting phenotypes to geno-types across species.

Nucleic acids research , 45(D1):D712–D722, 2017.[98] Charlotte A Nelson, Atul J Butte, and Sergio E Baranzini. Integrating biomedical researchand electronic health records to create knowledge-based biologically meaningful machine-readable embeddings.

Nature communications , 10(1):1–10, 2019.[99] Matthew R Nelson, Hannah Tipney, Jeffery L Painter, Judong Shen, Paola Nicoletti, YufengShen, Aris Floratos, Pak Chung Sham, Mulin Jun Li, Junwen Wang, et al. The supportof human genetic evidence for approved drug indications.

Nature genetics , 47(8):856–860,2015.[100] Behnam Neyshabur, Ahmadreza Khadem, Somaye Hashemifar, and Seyed Shahriar Arab.Netal: a new graph-based method for global alignment of protein–protein interaction net-works.

Bioinformatics , 29(13):1654–1662, 2013.[101] Dac-Trung Nguyen, Stephen Mathias, Cristian Bologa, Soren Brunak, Nicolas Fernandez,Anna Gaulton, Anne Hersey, Jayme Holmes, Lars Juhl Jensen, Anneli Karlsson, et al. Pharos:Collating protein information to shed light on the druggable genome.

Nucleic acids research ,45(D1):D995–D1002, 2017.[102] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collectivelearning on multi-relational data. In

Icml , volume 11, pages 809–816, 2011.[103] Tudor I Oprea, Cristian G Bologa, Søren Brunak, Allen Campbell, Gregory N Gan, AnnaGaulton, Shawn M Gomez, Rajarshi Guha, Anne Hersey, Jayme Holmes, et al. Unexploredtherapeutic opportunities in the human genome.

Nature reviews Drug discovery , 17(5):317,2018.[104] Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, Lorrie Boucher, ChristieChang, Nadine Kolas, Lara O’Donnell, Genie Leung, Rochelle McAdam, et al. The biogridinteraction database: 2019 update.

Nucleic acids research , 47(D1):D529–D541, 2019.[105] Saee Paliwal, Alex de Giorgio, Daniel Neil, Jean-Baptiste Michel, and Alix MB Lacoste.Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneousgraphs.

Scientiﬁc reports , 10(1):1–19, 2020.[106] Bethany Percha and Russ B Altman. A global network of biomedical relationships derivedfrom text.

Bioinformatics , 34(15):2614–2624, 2018.[107] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In

Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 701–710, 2014.[108] Janet Pi˜nero, N´uria Queralt-Rosinach, Alex Bravo, Jordi Deu-Pons, Anna Bauer-Mehren,Martin Baron, Ferran Sanz, and Laura I Furlong. Disgenet: a discovery platform for thedynamical exploration of human diseases and their genes.

Database , 2015, 2015.[109] Janet Pi˜nero, Juan Manuel Ram´ırez-Anguita, Josep Sa¨uch-Pitarch, Francesco Ronzano,Emilio Centeno, Ferran Sanz, and Laura I Furlong. The disgenet knowledge platform fordisease genomics: 2019 update.

Nucleic acids research , 48(D1):D845–D855, 2020.33110] Sune Pletscher-Frankild, Albert Pallej`a, Kalliopi Tsafou, Janos X Binder, and Lars JuhlJensen. Diseases: Text mining and data integration of disease–gene associations.

Methods ,74:83–89, 2015.[111] Dexter Pratt, Jing Chen, David Welker, Ricardo Rivas, Rudolf Pillich, Vladimir Rynkov,Keiichiro Ono, Carol Miello, Lyndon Hicks, Sandor Szalma, et al. Ndex, the network dataexchange.

Cell systems , 1(4):302–305, 2015.[112] Justin T Reese, Deepak Unni, Tiffany J Callahan, Luca Cappelletti, Vida Ravanmehr, SethCarbon, Kent A Shefchek, Benjamin M Good, James P Balhoff, Tommaso Fontana, et al.Kg-covid-19: a framework to produce customized knowledge graphs for covid-19 response.

Patterns , page 100155, 2020.[113] Daniel J Rigden and Xos´e M Fern´andez. The 27th annual nucleic acids research databaseissue and molecular biology database collection.

Nucleic Acids Research , 48(D1):D1–D8,2020.[114] Peter N Robinson, Sebastian K¨ohler, Sebastian Bauer, Dominik Seelow, Denise Horn, andStefan Mundlos. The human phenotype ontology: a tool for annotating and analyzing humanhereditary disease.

The American Journal of Human Genetics , 83(5):610–615, 2008.[115] Igor Rodchenkov, Ozgun Babur, Augustin Luna, Bulent Arman Aksoy, Jeffrey V Wong, Dy-lan Fong, Max Franz, Metin Can Siper, Manfred Cheung, Michael Wrana, et al. Pathwaycommons 2019 update: integration, analysis and exploration of pathway data.

Nucleic acidsresearch , 48(D1):D489–D497, 2020.[116] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, andMax Welling. Modeling relational data with graph convolutional networks. In

EuropeanSemantic Web Conference , pages 593–607. Springer, 2018.[117] Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, GregorHuhn, and Dietmar Schomburg. Brenda, the enzyme database: updates and major new devel-opments.

Nucleic acids research , 32(suppl 1):D431–D433, 2004.[118] Lynn M Schriml, Elvira Mitraka, James Munro, Becky Tauber, Mike Schor, Lance Nickle,Victor Felix, Linda Jeng, Cynthia Bearer, Richard Lichenstein, et al. Human disease ontol-ogy 2018 update: classiﬁcation, content and workﬂow expansion.

Nucleic acids research ,47(D1):D955–D962, 2019.[119] Lynn Marie Schriml, Cesar Arze, Suvarna Nadendla, Yu-Wei Wayne Chang, Mark Mazaitis,Victor Felix, Gang Feng, and Warren Alden Kibbe. Disease ontology: a backbone for diseasesemantic integration.

Nucleic acids research , 40(D1):D940–D946, 2012.[120] Baoxu Shi and Tim Weninger. Open-world knowledge graph completion.

Association for theAdvancement of Artiﬁcial Intelligence , 2018.[121] Denise N Slenter, Martina Kutmon, Kristina Hanspers, Anders Riutta, Jacob Windsor, NunoNunes, Jonathan M´elius, Elisa Cirillo, Susan L Coort, Daniela Digles, et al. Wikipathways: amultifaceted pathway database bridging metabolomics to other omics research.

Nucleic acidsresearch , 46(D1):D661–D667, 2018.[122] Michael E Smoot, Keiichiro Ono, Johannes Ruscheinski, Peng-Liang Wang, and Trey Ideker.Cytoscape 2.8: new features for data integration and network visualization.

Bioinformatics ,27(3):431–432, 2011.[123] Peter K Sorger, Sandra RB Allerheiligen, Darrell R Abernethy, Russ B Altman, Kim LRBrouwer, Andrea Califano, David Z D’Argenio, Ravi Iyengar, William J Jusko, RichardLalonde, et al. Quantitative and systems pharmacology in the post-genomic era: new ap-proaches to discovering drugs and understanding therapeutic mechanisms. In

An NIH whitepaper by the QSP workshop group , volume 48. NIH Bethesda Bethesda, MD, 2011.[124] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, andMike Tyers. Biogrid: a general repository for interaction datasets.

Nucleic acids research ,34(suppl 1):D535–D539, 2006.[125] Damian Szklarczyk, Annika L Gable, David Lyon, Alexander Junge, Stefan Wyder, JaimeHuerta-Cepas, Milan Simonovic, Nadezhda T Doncheva, John H Morris, Peer Bork,et al. String v11: protein–protein association networks with increased coverage, support-ing functional discovery in genome-wide experimental datasets.

Nucleic acids research ,47(D1):D607–D613, 2019. 34126] Damian Szklarczyk, Alberto Santos, Christian von Mering, Lars Juhl Jensen, Peer Bork, andMichael Kuhn. Stitch 5: augmenting protein–chemical interaction networks with tissue andafﬁnity data.

Nucleic acids research , 44(D1):D380–D384, 2016.[127] Ziaurrehman Tanoli, Umair Seemab, Andreas Scherer, Krister Wennerberg, Jing Tang, andMarkus V¨ah¨a-Koskela. Exploration of databases and methods supporting drug repurposing:a comprehensive survey.

Brieﬁngs in Bioinformatics , 2020.[128] Nicholas P Tatonetti, P Ye Patrick, Roxana Daneshjou, and Russ B Altman. Data-drivenprediction of drug effects and interactions.

Science translational medicine , 4(125):125ra31–125ra31, 2012.[129] Georg C Terstappen and Angelo Reggiani. In silico research in drug discovery.

Trends inpharmacological sciences , 22(1):23–26, 2001.[130] Syed Hamid Tirmizi, Stuart Aitken, Dilvan A Moreira, Chris Mungall, Juan Sequeda,Nigam H Shah, and Daniel P Miranker. Mapping between the obo and owl ontology lan-guages.

Journal of biomedical semantics , 2(S1):S3, 2011.[131] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base andtext inference. In

Proceedings of the 3rd workshop on continuous vector space models andtheir compositionality , pages 57–66, 2015.[132] Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´Eric Gaussier, and Guillaume Bouchard.Complex embeddings for simple link prediction. International Conference on Machine Learn-ing (ICML), 2016.[133] D´enes T¨urei, Tam´as Korcsm´aros, and Julio Saez-Rodriguez. Omnipath: guidelines and gate-way for literature-curated signaling pathway resources.

Nature methods , 13(12):966–967,2016.[134] Mathias Uhl´en, Linn Fagerberg, Bj¨orn M Hallstr¨om, Cecilia Lindskog, Per Oksvold, AdilMardinoglu, ˚Asa Sivertsson, Caroline Kampf, Evelina Sj¨ostedt, Anna Asplund, et al. Tissue-based map of the human proteome.

Science , 347(6220), 2015.[135] Oleg Ursu, Jayme Holmes, Jeffrey Knockel, Cristian G Bologa, Jeremy J Yang, Stephen LMathias, Stuart J Nelson, and Tudor I Oprea. Drugcentral: online drug compendium.

Nucleicacids research , page gkw993, 2016.[136] Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, GeorgeLee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications ofmachine learning in drug discovery and development.

Nature Reviews Drug Discovery ,18(6):463–477, 2019.[137] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li`o, andYoshua Bengio. Graph attention networks. In

International Conference on Learning Repre-sentations , 2018.[138] Andra Waagmeester, Martina Kutmon, Anders Riutta, Ryan Miller, Egon L Willighagen,Chris T Evelo, and Alexander R Pico. Using the semantic web for rapid integration ofwikipathways with other biological online data resources.

PLoS computational biology ,12(6):e1004989, 2016.[139] John Wagner, Andrew M Dahlem, Lynn D Hudson, Sharon F Terry, Russ B Altman, C Tay-lor Gilliland, Christopher DeFeo, and Christopher P Austin. A dynamic map for learning,communicating, navigating and improving therapeutic development.

Nature Reviews DrugDiscovery , 17(2):150–150, 2018.[140] Brian Walsh, Sameh K Mohamed, and V´ıt Nov´aˇcek. Biokg: A knowledge graph for relationallearning on biological data. In

Proceedings of the 29th ACM International Conference onInformation & Knowledge Management , pages 3173–3180, 2020.[141] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, ChaoMa, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and ZhengZhang. Deep graph library: A graph-centric, highly-performant package for graph neuralnetworks. arXiv preprint arXiv:1909.01315 , 2019.[142] Qingyun Wang, Manling Li, Xuan Wang, Nikolaus Parulian, Guangxing Han, Jiawei Ma,Jingxuan Tu, Ying Lin, Haoran Zhang, Weili Liu, et al. Covid-19 literature knowledge graphconstruction and drug repurposing report generation. arXiv preprint arXiv:2007.00576 , 2020.35143] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A surveyof approaches and applications.

IEEE Transactions on Knowledge and Data Engineering ,29(12):2724–2743, 2017.[144] Michelle Whirl-Carrillo, Ellen M McDonagh, JM Hebert, Li Gong, K Sangkuhl, CF Thorn,Russ B Altman, and Teri E Klein. Pharmacogenomics knowledge for personalized medicine.

Clinical Pharmacology & Therapeutics , 92(4):414–417, 2012.[145] Colby Wise, Miguel Romero Calvo, Pariminder Bhatia, Vassilis Ioannidis, George Karypus,George Price, Xiang Song, Ryan Brand, and Ninad Kulkani. Covid-19 knowledge graph:Accelerating information retrieval and discovery for scientiﬁc literature. In

Proceedings ofKnowledgeable NLP: the First Workshop on Integrating Structured Knowledge and NeuralNetworks for NLP , pages 1–10, 2020.[146] David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant,Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. Drugbank 5.0: a major updateto the drugbank database for 2018.

Nucleic acids research , 46(D1):D1074–D1082, 2018.[147] David S Wishart, Craig Knox, An Chi Guo, Dean Cheng, Savita Shrivastava, Dan Tzur,Bijaya Gautam, and Murtaza Hassanali. Drugbank: a knowledgebase for drugs, drug actionsand drug targets.

Nucleic acids research , 36(suppl 1):D901–D906, 2008.[148] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip.A comprehensive survey on graph neural networks.

IEEE Transactions on Neural Networksand Learning Systems , 2020.[149] Yoshihiro Yamanishi, Michihiro Araki, Alex Gutteridge, Wataru Honda, and Minoru Kane-hisa. Prediction of drug–target interaction networks from the integration of chemical andgenomic spaces.

Bioinformatics , 24(13):i232–i240, 2008.[150] Andrew D Yates, Premanand Achuthan, Wasiu Akanni, James Allen, Jamie Allen, JorgeAlvarez-Jarreta, M Ridwan Amode, Irina M Armean, Andrey G Azov, Ruth Bennett, et al.Ensembl 2020.

Nucleic acids research , 48(D1):D682–D688, 2020.[151] Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. Het-erogeneous graph neural network. In

Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining , pages 793–803, 2019.[152] Yongqi Zhang, Quanming Yao, Yingxia Shao, and Lei Chen. Nscaching: simple and efﬁ-cient negative sampling for knowledge graph embedding. In , pages 614–625. IEEE, 2019.[153] Shuangjia Zheng, Jiahua Rao, Ying Song, Jixian Zhang, Xianglu Xiao, Evandro Fei Fang,Yuedong Yang, and Zhangming Niu. Pharmkg: a dedicated knowledge graph benchmark forbomedical data mining.

Brieﬁngs in Bioinformatics , 2020.[154] Yongjun Zhu, Chao Che, Bo Jin, Ningrui Zhang, Chang Su, and Fei Wang. Knowledge-drivendrug repurposing using a comprehensive drug knowledge graph.

Health Informatics Journal ,page 1460458220937101, 2020.[155] Yongjun Zhu, Olivier Elemento, Jyotishman Pathak, and Fei Wang. Drug knowledgebases and their applications in biomedical informatics research.

Brieﬁngs in bioinformatics ,20(4):1308–1321, 2019.[156] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effectswith graph convolutional networks.