# HINT: Hierarchical Interaction Network for Trial Outcome Prediction Leveraging Web Data

Tianfan Fu, Kexin Huang, Cao Xiao, Lucas M. Glass, Jimeng Sun

HHINT : Hierarchical Interaction Network for Trial Outcome PredictionLeveraging Web Data

Tianfan Fu ∗ Kexin Huang ∗ Cao Xiao Lucas M. Glass , and Jimeng Sun Georgia Institute of Technology Harvard University Analytics Center of Excellence, IQVIA Temple University University of Illinois at Urbana-ChampaignFeb 02, 2021

Abstract

Clinical trials are crucial for drug development but are time consuming, expensive, and often burdensome onpatients. More importantly, clinical trials face uncertain outcomes due to issues with eﬃcacy, safety, or problemswith patient recruitment. If we were better at predicting the results of clinical trials, we could avoid having torun trials that will inevitably fail — more resources could be devoted to trials that are likely to succeed. In thispaper, we propose Hierarchical INteraction Network (

HINT ) for more general, clinical trial outcome predictions forall diseases based on a comprehensive and diverse set of web data including molecule information of the drugs,target disease information, trial protocol and biomedical knowledge.

HINT ﬁrst encode these multi-modal data intolatent embeddings, where an imputation module is designed to handle missing data. Next, these embeddings willbe fed into the knowledge embedding module to generate knowledge embeddings that are pretrained using externalknowledge on pharmaco-kinetic properties and trial risk from the web. Then the interaction graph module willconnect all the embeddings via domain knowledge to fully capture various trial components and their complexrelations as well as their inﬂuences on trial outcomes. Finally,

HINT learns a dynamic attentive graph neuralnetwork to predict trial outcome. Comprehensive experimental results show that

HINT achieves strong predictiveperformance, obtaining 0.772, 0.607, 0.623, 0.703 on PR-AUC for Phase I, II, III, and indication outcome prediction,respectively. It also consistently outperforms the best baseline method by up to 12.4% on PR-AUC.

Clinical trial is an indispensable step towards developing a new drug, where human participants are tested in respondingto a treatment (e.g., a drug or drug combinations) for treating target diseases. The costs of conducting clinical trialsare extremely expensive (up to hundreds of millions of dollars [30]) and the time to run a trial is very long with lowsuccess probability [31, 27]. However, many factors such as the ineﬃcacy of the drug, drug safety issues, and poor trialprotocol design can cause the failure of a clinical trial [14]. If we were better at predicting the results of clinical trials,we can avoid running trials that will inevitably fail — more resources could be devoted to trials that are more likelyto succeed. Fortunately, the online availability of historical clinical trial data, and massive knowledge bases aboutapproved and failed drugs bring a new opportunity for using machine learning models to tackle the key question:

Canone predict the success probability of a trial before the trial started?

Various data sources on the web can provide important knowledge for predicting the trial outcome. For example,clinicaltrials.gov database has 279K clinical trial records, which describe many important information about clinicaltrials. In addition, we are able to ﬁnd the standard medical codes for most of the diseases described in natural languagethrough National Institutes of Health website . DrugBank Database [41] contains the biochemical description ofmany drugs, which allows the computational modeling of drug molecules.Over the years, there have been early attempts in predicting individual components in clinical trials to improvethe trial results, including using electroencephalographic (EEG) measures to predict the eﬀect of antidepressanttreatment in improving depressive symptoms [34], optimizing drug toxicity based on drug-property and target-propertyfeatures [20], leveraging phase 2 results to predict the phase 3 trial results [33]. Recently, there has been interest indeveloping a general method for trial outcome prediction. As an initial attempt, [28] expanded beyond optimizing Publicly available at https://clinicaltrials.gov/ . Publicly available at https://clinicaltables.nlm.nih.gov/ . Publicly available at a r X i v : . [ c s . C Y ] F e b ndividual component to predict trial outcomes for 15 disease groups based on disease-only features using statisticalmodeling. Despite these initial eﬀorts, there are several limitations that impede the utility of existing trial outcomeprediction models.• Limited task deﬁnition and study scope.

Existing works either focus on predicting individual component oftrials [20, 17, 4, 12] such as patient-trial matching or only covering disease groups of which the disease-speciﬁcfeatures are available [28]. Although these works are potentially useful for a limited part of the trial design, theydo not answer the fundamental problem: will this trial succeed? To the best of our knowledge, there is no workthat attempts to solve the general trial outcome prediction problem across diﬀerent trial phases for many diﬀerentdiseases.•

Limited features used for prediction.

Due to their limited task deﬁnition and study scope, existing works oftenonly leverage restricted disease-speciﬁc features, which cannot be generalized for other diseases. These works alsoignore the facts that trial outcomes are determined by various trial risks including drug safety, treatment eﬃciencyand trial recruitment, where abundant information exist on the web. For example, the biomedical knowledge thatprovide explicit biochemical structures among drug molecules and previous trials history for certain disease areboth ignored by existing studies but deem very useful for trial outcome prediction.•

Failed to explicitly capture the complex relations among trial components and trial outcomes.

Dueto the limited data and task scope, existing methods often simplify their predictions by limited input features anduse simple computational method that is not explicitly designed for trial outcome prediction [20, 17, 4, 33, 38]. Thissigniﬁcantly impedes them to model the complicated relations among various trial components.

Our Approach . To provide accurate trial outcome prediction for all trials, we propose a Hierarchical InteractionNetwork (

HINT ). HINT has an input embedding module to encode data from various web sources including drugmolecules, disease information and trial protocols, where an imputation module is designed to handle missing data.Next, these embeddings are fed into the knowledge embedding module to generate knowledge embeddings that arepretrained using external knowledge on pharmaco-kinetic properties and trial risk from the web. Then the interactiongraph module will connect all the embeddings via domain knowledge to fully capture various trial components andtheir complex relations as well as their inﬂuences on trial outcomes. Based on that,

HINT learn a dynamic attentivegraph neural network to predict trial outcome.

Contribution . Our main contributions are listed as follows:•

Problem

We formally deﬁne a model framework for a general clinical trial outcome prediction task, which not onlymodels various trial risks including drug safety, treatment eﬃciency and trial recruitment, but also models a widerange of drugs and indications (e.g., diseases). Our model framework can generalize over new trials given the drug,disease and protocol information (Section 3.1).•

Benchmark

To enable general clinical trial outcome predictions, we leverage a comprehensive set of datasetsfrom various public web sources, including drug bank, standard disease code and clinical trial records and curatea public-available clinical trial outcome prediction dataset

TOP . This benchmark dataset will be released andopen-sourced to the community to advance machine learning-aided clinical trial design soon after the review process(Section 4).•

Method

We design a machine learning method that explicitly simulates each clinical trial component and thecomplicated relations among them (Section 3.2-3.6).We evaluated

HINT against state-of-the-art baselines using real world data.

HINT achieved 0.772, 0.607, 0.623with PR-AUC on Phase I, II, III level prediction respectively and 0.703 on indication-level prediction. These highabsolute scores suggest the practical usage of

HINT in predicting clinical trial outcome in various stages of the trials.In addition,

HINT has up to 12.4% relatively improvement in terms of PR-AUC compared with best baseline method(COMPOSE) [16]. We also conduct ablation study to evaluate the importance of key components of clinical trials tothe prediction power and the eﬀectiveness of the hierarchical formulation of a trial interaction graph. At last, weconduct a case study to show the potential real world impact of

HINT by successfully predicting the failure of somerecent huge trial eﬀorts.

Existing works often focus on predicting individual patient’s outcome in a trial instead of a general prediction aboutthe overall trial success. They usually leverage expert crafted features. For example, [42] leveraged SVM to predict the2tatus of genetic lesions based on cancer clinical trial documents. [34] use Gradient-Boosted Decision Trees (GBDT [43])to predict the improvement in symptom scores based on treatment symptom score and electroencephalographic (EEG)measures for depressive symptoms with antidepressant treatment. [20] focus on predicting clinical drug toxicityaccording to drug-property and target-property features and use an ensemble classiﬁer of weighted least squaressupport vector regression. Note that these models are not tackling the same task as us. They are predicting in apatient-level whereas

HINT focuses on trial-level. More relevant to us, [33] designs Residual Semi-Recurrent NeuralNetwork (RS-RNN) to predict the phase 3 trial results based on phase 2 results. In contrast, the task of

HINT is topredict for all clinical trial phases. [28] explored various imputation techniques and a series of conventional machinelearning models (including logistic regression, random forest, SVM) to predict the outcome of clinical trial within 15disease groups. However, they do not consider drug molecule features and trial protocol information and thus couldnot diﬀerentiate the outcome for diﬀerent drugs focusing on a disease, and cannot capture trial failure due to poorenrollment whereas

HINT takes account into all of these information.

Recently, deep learning has been leveraged to learn representation from clinical trial data to support downstreamtasks such as patient retrieval [45, 16] and enrollment [5]. For example, Doctor2Vec [5] learn hierarchical clinicaltrial embedding where the unstructured trial descriptions were embedded using BERT [11]. DeepEnroll [45] alsoleveraged a Bidirectional Encoder Representations from Transformers (BERT [11]) model to encode clinical trialinformation. More recently, COMPOSE [16] used pretrained BERT to generate contextualized word embedding foreach word of trial protocol, and then applied multiple one-dimensional convolutional layers with diﬀerent kernel sizesto generate trial embedding in order to capture semantics at diﬀerent granularity level. While these works optimize therepresentation learning for a single component in a trial,

HINT is the ﬁrst to model a diverse set of trial componentssuch as molecule, disease, protocols, pharmaco-kinetics, and disease risk information and fuse all of them through aunique hierarchical interaction graph. A clinical trial is designed to validate the safety and eﬃcacy of a treatment set towards a target disease set on apatient group deﬁned by the trial protocol . Deﬁnition 1 ( Treatment Set ) . Treatment set includes one or multiple drug candidates, denoted by M = { m , · · · , m N m } , (1)where m , · · · , m N m are N m drug molecules involved in this trial.Note that we focus on clinical trials that aim at discovering new indications of drug candidates. Other trials thatdo not involve molecules such as surgeries and devices are out of scope and can be considered as future work. Deﬁnition 2 ( Target Disease Set ) . Each trial targets at one or more diseases. Suppose there are N d ≥ diseasesin a trial, we represent the target disease set as D = { d , · · · , d N d } , (2)where d , · · · , d N d are N d diseases. We use d i to represent the raw information associated with the disease includingthe disease name, description (text data) and its corresponding diagnosis code (e.g., International Classiﬁcation ofDiseases - ICD codes [3]).Each trial has a trial protocol (in unstructured natural language) that describes eligibility criteria for enrollingpatients including participants characteristics such as age, gender, medical history, target disease conditions, andcurrent health status. Deﬁnition 3 ( Trial Protocol ) . The patient group is speciﬁed by the trial protocol. Formally, each protocol consistsof a set of inclusion and exclusion criteria for recruiting patients, which describe what are desired and unwanted fromthe targeted patients, respectively, C = [ c I , ..., c IM , c E , ..., c EN ] , c I/Ei is a sentence . (3) M ( N ) is the number of inclusion (exclusion) criteria in the trial, c Ii ( c Ei ) denotes the i -th inclusion (exclusion)criterion. Each criterion c is a sentence in unstructured natural language.3 able 1: Mathematical notations.

Categories Notations ExplanationsWeb Data& Output(Section 3.1) M = { m , · · · , m N m } N m Drug molecules, Eq. (1). D = { d , · · · , d N d } N d Diseases, Eq. (2). C = [ c I , · · · , c IM , c E , · · · , c EN ] Protocol criteria, Eq. (3). ˆ y ∈ [0 , Predicted trial success probabilityInputEmbedding(Section 3.3) h m ∈ R d , f m () drug embedding, drug encoder h d ∈ R d , GRAM () disease embedding, disease encoder α ji , g ( · ) , e i/j/k GRAM parameters [8], Eq. (7), (8) h p ∈ R d , f p () trial embedding, trial encoderEmbedding andits function(Section 3.4,3.5) h A , h D , h M , h E , h T ∈ R d embeddings for A,D,M,E,T, Eq. (13) .and X A , X D , X M , X E , X T and corresponding embedding functions h R ∈ R d and R () Disease risk,Eq. (17). h P K ∈ R d and K () Pharmaco-Kinetics, Eq. (18). h I ∈ R d and I () Interaction, Eq. (19). h V ∈ R d and V () Augmented interaction, Eq. (20) h Pred ∈ R d and P () Prediction, Eq. (21).HierarchicalInteractionGraph(Section 3.5) G , |G| = K Interaction graph with K nodes. A ∈ { , } K × K Adjacency matrix of G , Eq. (23) V ∈ R K × K + , g ( · ) Attentive matrix/function, Eq. (23), (24). B ∈ R K × d Bias parameter of GNN. W ( l ) ∈ R d × d , l = 0 , · · · , L Weights in the l -th layer H ( l ) ∈ R K × d , l = 0 , · · · , L the l -th layer node embeddingImputation(Section 3.6) (cid:99) h m ∈ R d Recovered drug embedding L Recovery

Recovery loss, Eq. (28).

Problem 1 ( Trial Outcome Prediction ) . The trial outcome is a binary label y ∈ { , } , where y = 1 indicatestrial success and 0 indicates trial failure. The predicted success probability is ˆ y ∈ [0 , . The goal of HINT is to learn adeep neural network model f θ for predicting the trial outcome y : y = f θ ( M , D , C ) , (4)where M , D , C are the treatment set, target disease set and trial protocol, respectively. We predict trial outcomesunder 2 settings:• Phase Level Prediction focuses on predicting the outcome for a particular phase of the trial. In general, thereare three trial phases: phase I tests the toxicity and side eﬀects of the drug; phase II determines the eﬃcacy of thedrug (i.e., if the drug works); phase III focus on the eﬀectiveness of the drug (i.e., whether the drug is better thanthe current standard practice). The phase level prediction determines whether a speciﬁc clinical trial study willsuccessfully complete at the phase.•

Indication Level Prediction aims at predicting whether the indication of a drug will be approved (i.e., thetreatment for a disease will pass all 3 phases).For ease of exposition, we list mathematical notations in Table 1.

HINT

As illustrated in Figure 1,

HINT is an end-to-end framework for predicting the success probability of a trial before thetrial starts. First,

HINT has an input embedding module to encode multi-modal data from various sources includingdrug molecules, disease information and trial protocols to input embeddings (Section 3.3). Next, these embeddings4ill be fed into the knowledge embedding module to generate knowledge embeddings that are pretrained usingexternal knowledge on pharmaco-kinetic properties and trial risk from Web (Section 3.4). Then the interactiongraph module will connect all the embeddings via domain knowledge to fully capture various trial components andtheir complex relations as well as their inﬂuences on trial outcomes. Based on that,

HINT learn a dynamic attentivegraph neural network to predict trial outcome (Section 3.5). An imputation module is designed to handle missingdata (Section 3.6).

Interaction Graph for Trial Components

Drug MoleculeTrial Protocol

On the Web

Disease h A

Web KnowledgePre-training

CC1=C(SC=[N+]1CC2=CN=C(N=C2N)C)CCO

GRAMEmbedding

Eligibility CriteriaNon-small celllung cancer

BioBERTEmbedding

Multi-Model Input Feature &Missing Data Recovery

HistoricalDiseaseTrial Data

On the Web

PKTrial W DF

Wet Lab Data

On the Web

Disease Risk EncoderADMETEncoder

Pre-trainPre-train A

MissingMolecule? … h R

Disease Protocol

DiseaseRisk

Interaction PK AugmentedInteraction

TrialPrediction

A DM ET h A

Figure 1:

HINT

Framework.

HINT takes features that describe the following trial components: drug moleculeembeddings h m , disease embedding h d , and trial protocol embedding h p (Section 3.3). Before constructing aninteraction graph using these components, HINT pretrain some embeddings (blue nodes) using web based knowledgeabout drug properties and disease risks (Section 3.4). Next, in Section 3.5, we construct an interaction graph tocharacterize interactions between various trial components. Trial embeddings are learned based on the interactiongraph to capture both trial components and their interactions. Based on the learned representation and

DynamicAttentive Graph Neural Network (Eq. 23), we make trial outcome prediction.

In this section, we describe the raw data and input representation learning. Raw data used for trial outcome predictioncome from three diﬀerent sources, namely (1) drug molecules, (2) disease information and (3) trial protocols.

1. Drug molecules are important for predicting trial outcomes. The drug molecules are represented by SMILESstrings or molecule graphs. There are many existing works in embedding drug molecules into latent vectors includingknowledge base approach such as Morgan ﬁngerprint and its variants [6], or representation learning methods forSMILES strings [40] and molecule graphs [10, 9, 21].

HINT supports all three types of molecule embeddings. Formally,we represent all molecules M = { m , · · · , m N m } as drug (molecule) embeddingDrug (Molecule) Embedding h m = 1 N m N m (cid:88) j =1 f m ( m j ) , h m ∈ R d . (5)where f m ( · ) is the molecule embedding function and we average all molecule embeddings from a trial to get the drugembedding vector . Our experiments show that average is a better aggregation function than summation for drugembedding. The molecule embedding function f m can be any of the above drug embedding methods. HINT supportsMorgan ﬁngerprint [6], SMILES encoder [40], graph message passing neural network (MPNN) [10, 23] for creating thedrug embedding.

2. Disease information can also aﬀect the trial outcome. For example, drugs in oncology have much lower approvalrate than ones in infectious diseases [19]. The disease information comes from its description and its correspondingontology such as disease hierarchies like International Classiﬁcation of Diseases (ICD) [3].Formally, we represent all diseases D = { d , · · · , d N d } (Eq. 2) as disease embedding,Disease Embedding h d = 1 N d N d (cid:88) i =1 GRAM ( d i ) , h d ∈ R d . (6)5here GRAM ( d i ) represent an embedding of disease d i using GRAM (graph-based attention model) [8]. GRAM isdesigned to leverage the hierarchical information inherent to medical ontologies. Speciﬁcally, the representation ofcurrent disease d i is a convex combination of the basic embeddings ( e ∈ R d ) of itself and its ancestors, i.e,GRAM ( d i ) = (cid:88) j ∈ Ancestors ( i ) ∪{ i } α ji e j , (7)where α ji ∈ R + represents the attention weight and we have α ji = exp (cid:0) g ([ e (cid:62) j , e (cid:62) i ] (cid:62) ) (cid:1)(cid:80) k ∈ Ancestors ( i ) ∪{ i } exp (cid:0) g ([ e (cid:62) k , e (cid:62) i ] (cid:62) ) (cid:1) , (cid:88) j ∈ Ancestors ( i ) ∪{ i } α ji = 1 , (8)where g ( · ) is a feed-forward network with a single hidden layer, following [8]. Ancestors ( i ) represents the set of allthe ancestors of i . An ancestor of code represent a higher-level category of the current code. For example, in ICD 10code [3], “D41” (urinary organs neoplasm) and “D41.2” (ureter neoplasm) are the ancestors of “D41.20” (right ureterneoplasm). “C34” (malignant neoplasm of bronchus and lung) and “C34.9” (malignant neoplasm of unspeciﬁed part ofbronchus or lung) are the ancestors of “C34.91” (malignant neoplasm of unspeciﬁed part of right bronchus or lung).

3. Trial protocol is a document that describes how a clinical trial will be conducted, includes eligibility criteria,which describes the patient recruitment requirements. Eligibility criteria contain both inclusion and exclusion criteriadenoted by Eq. (3), C = [ c I , ..., c IM , c E , ..., c EN ] , c I/Ei is a sentencewhere c Ii and c Ej denote the i -th and j -th sentence in inclusion and exclusion criteria, respectively. To convertthe criteria sentences into embedding vectors, we apply Clinical-BERT [2, 22] which is a domain-speciﬁc version ofBERT [11]. s Ii = Clinical-BERT ( c Ii ) , S I = [ s I , · · · , s IM ] , s Ej = Clinical-BERT ( c Ei ) , S E = [ s E , · · · , s EN ] . (9)Then we use 4 one-dimensional convolutional layers [44, 16] (denoted Conv1d) with diﬀerent kernel sizes ( k , · · · , k )to capture semantics at 4 diﬀerent granularity level. Speciﬁcally, S I and S E are fed into four one-dimensionalconvolutional layers and the output are concatenated as p I = CONCAT (cid:16)

Conv1d ( S I , k ) , · · · , Conv1d ( S I , k ) (cid:17) p E = CONCAT (cid:16)

Conv1d ( S E , k ) , · · · , Conv1d ( S E , k ) (cid:17) . (10)Then a one-layer fully connected neural network (denoted by FC ( · ) ) is used to build the protocol embedding h p basedon the concatenation of p I and p E , h p = FC ( CONCAT ( p I , p E )) . (11)Thus, the protocol embedding is written asProtocol Embedding h p = f p ( C ) , h p ∈ R d . (12) HINT utilizes vast amount of web data to further enhance those input embeddings. In particular,

HINT leverages drugbank data [41] and trial data from clinicaltrials.gov . For disease risk, we ﬁnd historical trial success statistics fordiﬀerent diseases stored in clinicaltrials.gov . Pharmaco-kinetics Knowledge:

We pretrain embeddings using the pharmaco-kinetics (PK) knowledge which isabout how body reacts to the intaken drug. Because trial success highly depends on factors such as pharmaco-kineticsproperties of a drug and disease risk. Speciﬁcally, we leverage various PK experimental scores generated from thewet labs and stored in various sources on the web. Utilizing those information, we pretrain on prediction models forAbsorption, Distribution, Metabolism, Excretion, Toxicity (ADMET) properties, which are used together in drugdiscovery to provide insight into how a drug interacts with the body as a whole [18]:•

Absorption:

The Absorption model describes how drugs can be absorbed into the human body to reach the siteof action. A poor absorption drug is usually less desirable.6

Distribution:

The drug Distribution model measures the ability of the molecule to move through bloodstream tovarious parts of the body. The stronger distribution movement is advisable.•

Metabolism:

The drug metabolism rate determines the duration of a drug’s eﬃcacy.•

Excretion:

The drug excretion rate measures how much a drug’s toxic components can be removed from the body.•

Toxicity:

The drug toxicity measures damage a drug can cause to the human body.We build predictive models for each of these property scores and its latent embeddings: absorption score y A andembedding h A , distribution score y D and embedding h D , metabolism score y M and embedding h M , excretion score y E and embedding h E , and toxicity score y T and embedding h T . The input is a molecule while the label is binary,indicating whether the molecule has the desired property.ADMET h ∗ = X ∗ ( h m ) , h ∗ ∈ R d , ˆ y ∗ = Sigmoid (FC( h ∗ )) , ˆ y ∗ ∈ [0 , . min − y ∗ log ˆ y ∗ − (1 − y ∗ ) log(1 − ˆ y ∗ ) y ∗ ∈ { , } (13)where h m ∈ R d is the input drug embedding deﬁned in Eq. (5), ∗ can be A, D, M, E and T. FC is a one-layer fullyconnected neural network. X ∗ can be any neural network, we use a multi-layer highway neural network [36]. Wechoose highway network because it is able to alleviate the vanishing gradient issue in network training. A single layerof highway network is: z = T ( u , W T ) (cid:12) T ( u , W T ) + u (cid:12) (1 − T ( u , W T )) , (14)where the dimension of u , z , T ( u , W T ) , and T ( u , W T ) are the same. u and z are input and output for a single layer,respectively. Here (cid:12) is element-wise multiplication, T is the aﬃne transform with RELU and T is the transformgate with sigmoid. T and T are parametrized by W T and W T , respectively. We have z = (cid:40) u , if T ( u , W T ) = 0 ,T ( u , W T ) , if T ( u , W T ) = 1 . (15)Multiple layers highway network are concatenated. For simplicity, in this paper it is denoted z = Highway ( u ) . (16)Note that the input and output dimensions of highway network are same. If the input embedding is larger thanthe desired dimension (e.g., concatenation of multiple embedding as input), we ﬁrst apply a FC layer to reduce theinput dimension then apply the highway network in order to maintain the desired dimension. Disease risk embedding and trial risk prediction:

In addition to drug properties, we also consider the knowledgedistilled from historical trials of the target diseases. We consider multiple sources of information about diseases:1) The disease description and disease ontology, and 2) the historical trial success rate for the disease. As detailedstatistics for trial success rate of each disease at diﬀerent trial phases are widely available [19], we will consider that asthe supervision signal to train the trial risk prediction model . More speciﬁcally, given the diseases in the trial, weleverage the historical trial data to predict their success rate, the data is also available at

ClinicalTrials.gov . Thepredicted trial risk ˆ y R and embedding h R ∈ R d are generated via a two-layer highway neural network (Eq. 16) R ( · ) :Disease Risk h R = R ( h d ) , h R ∈ R d , ˆ y R = Sigmoid (FC( h R )) , ˆ y R ∈ [0 , . min − y R log ˆ y R − (1 − y R ) log(1 − ˆ y R ) y R ∈ { , } (17)where h d ∈ R d is the input disease embedding in Eq. (6), ˆ y R ∈ [0 , is the predicted trial risk between 0 and 1 (and 0being the most likely to fail and 1 the most likely to succeed), y R ∈ { , } is the binary label indicating about thesuccess or failure of the trial as a function of disease only. and FC represents the one-layer fully connected layer.Binary cross entropy loss between y R and ˆ y R is used to guide the training. In this section, we mainly describe (1) the construction of trial interaction graph and (2) how to predict trial outcomeusing dynamic attentive graph neural network on this interaction graph. (1) Trial Interaction Graph Construction

We construct a hierarchical interaction graph G to connect all inputdata sources and important factors aﬀecting clinical trial outcomes. Next we describe the interaction graph and itsinitialization process.The interaction graph G is constructed in a way to reﬂect the real-world trial development process and it consistsof four tiers of nodes that are connected between tiers: 7 Input nodes include drugs, diseases and protocols with node features of input embedding h m , h d , h p ∈ R d , coloredin green in Figure 1 (Section 3.3).• External knowledge nodes include ADMET embeddings h A , h D , h M , h E , h T ∈ R d and disease risk embedding h R . These representation are initialized using pretraining on external knowledge, which are colored blue in Figure 1(Section 3.4).• Aggregation nodes include (1) Interaction node h I connecting disease h d , drug molecules h m and protocols h p ;(2) pharmaco-kinetics node h P K connecting ADMET embeddings h A , h D , h M , h E , h T ∈ R d and (3) augmentedinteraction node h V that augment interaction node h I using disease risk node h R . Aggregation nodes are coloredyellow in Figure 1.• Prediction node: h pred connect pharmaco-kinetics node h P K and augmented interaction node h V to make theprediction, which is colored in grey in Figure 1.The connection of diﬀerent nodes are showed in Figure 1. Input nodes and external knowledge nodes have beendescribed above and the obtained representation are used as the node embeddings for the interaction graph. Next, wedescribe aggregation nodes and prediction node. Aggregation nodes:

The pharmaco-kinetics node is to gather all information of the ﬁve ADMET properties (Eq. 13).We obtain PK embedding by:Pharmaco-Kinetics h P K = K ( h A , h D , h M , h E , h T ) , h P K ∈ R d , (18)where K ( · ) is a one-layer fully-connected layer (input feature is concatenate of h A , h D , h M , h E , h T , input dimension is ∗ d , output dimension is d ) followed by d -dimensional two-layer highway neural network (Eq. 16) [36].Then, we model the interaction among the input drug molecule, disease and protocol by an interaction node, wherethe embedding is obtained by: Interaction h I = I ( h m , h d , h p ) , h I ∈ R d , (19)where h m , h d , h p are input embeddings deﬁned in Eq. (5), Eq. (6) and Eq. (12), respectively. The neural architectureof I () is a one-layer fully connected network (input dimension is ∗ d , output dimension is d ) followed by d -dimensionaltwo-layer highway network (Eq. 16) [36].Second, we have augmented interaction model to combine (i) trial risk of the target disease h R (Eq. 17) and (ii)the interaction among disease, molecule and protocol h I (Eq. 19).Augmented Interaction h V = V ( h R , h I ) , h V ∈ R d . (20)The neural architecture of V () is a one-layer fully connected network (input dimension is ∗ d , output dimension is d )followed by d -dimensional two-layer highway network (Eq. 16) [36]. Prediction node summarize the pharmaco-kinetics and augmented interaction to obtain the ﬁnal prediction:Trial Prediction h pred = P ( h P K , h V ) , h pred ∈ R d . (21)Like I () and V () , the architecture of P is a one-layer fully connected network (input dimension is ∗ d , outputdimension is d ) followed by d -dimensional two-layer highway network (Eq. 16) [36]. (2) Dynamic Attentive Graph Neural Network The trial embeddings provide initial representations of diﬀerenttrial components and their interactions via a graph. In order to further enhance predictions, we design a dynamicattentive graph neural network to leverage this interaction graph to model the inﬂuential trial components and helpimprove predictions.Mathematically, given the interaction graph G as input graph where nodes are trial components and edges arethe relations among these trial components. We denote A ∈ { , } K × K as the adjacency matrix of G . The nodeembeddings H (0) ∈ R K × d are initialized to H (0) = [ h d , h m , h p , h A , h D , h M , h E , h T , h R , h P K , h I , h V , h pred ] (cid:62) ∈ R K × d , (22) K = |G| is number of nodes in graph G . K = 13 in this paper. We further enhance the node embeddings using graphconvolutional network (GCN) [26]. Eq. (23) is the updating rule of GCN for the l -th layer, H ( l ) = RELU (cid:18) B ( l ) + ( V (cid:12) A )( H ( l − W ( l ) ) (cid:19) ,l = 1 , · · · , L, H ( l ) ∈ R K × d , (23)8here B ∈ R K × d is a bias parameter, W ( l ) ∈ R d × d is the weight matrix in l -th layer to transform the embedding, L is depth of GCN and (cid:12) is the element-wise multiplication.Diﬀerent from conventional GCN [26], we introduce a learnable layer-independent attentive matrix V ∈ R K × K + . V i,j , the ( i, j )-th entry of V , measures the importance of the edge that connects the i -th and j -th node in G . Weevaluate V i,j based on the i -th and j -th nodes’ embedding in H (0) , which are denoted h i , h j ∈ R d , ( h i ∈ R d istranspose of the i -th row of H (0) ∈ R K × d in Eq. 22) V i,j = g (cid:0) CONCAT ( h i , h j ) (cid:1) ,i, j ∈ { , · · · , K } , V i,j ∈ R + , (24)where g ( · ) is a two-layer fully connected neural network with ReLU and Sigmoid activation function in the hiddenand output layer respectively. Note that the attentive matrix V is element-wisely multiplied to the adjacency matrix A (Eq. 23) so that message of edge with higher prediction scores would give a higher weight to propagate. Training

The target is binary label y ∈ { , } , y = 1 indicates the trial succeed while 0 means it fails.After GNN message-passing, we obtain an updated representation for each trial components. We then use thelast-layer ( L -th layer) representation on the trial prediction node to generate the trial success prediction as shown inEq. (25), ˆ y = Sigmoid ( FC ( h L pred )) , ˆ y ∈ [0 , , (25)where L is the depth of GCN. we use one-layer fully-connected network with Sigmoid activation function. Then binarycross entropy loss (in Eq. (26)) is used to guide the model training, L classify = − y log ˆ y − (1 − y ) log(1 − ˆ y ) . (26)Note that HINT is trained in an end-to-end manner, that is the loss back-propagates to GNN and all the neuralnetworks in Section 3.4 and Section 3.3 to generate the input representation of node embeddings.

Algorithm 1

HINT

Framework (with Missing Data Imputation) Pretrain basic modules: (i) ADMET models ( A , D , M , E , T ); (ii) disease risk (DR) model. Construct Interaction Graph G . HINT if complete data ( M , D , C , y ) then Fix IMP ( · ) (Eq. 27), minimize L classify (Eq. 26), update remaining part of model. Minimize L recover (Eq. 28), and only update IMP ( · ) . else ( D , C , y ) . Fix IMP ( · ) , minimize L classify (Eq. 26), update remaining part of model. end if Given new data ( M , D , C ) , predict success probability ˆ y . One special challenge associated with trial data is that there can be missing data on molecular information M due toproprietary information. For complete data, we have ( M , D , P , y ) , whereas for missing data on molecules, we onlyhave ( D , P , y ) . This poses a problem since many nodes representation depend on the molecular information. Weobserve that there exists high correlation between the drug molecule, disease and protocol features. Thus, we design amissing data imputation module based on learning embeddings that capture inter-modal correlations and intra-modaldistribution. In particular, in this study the imputation module IMP ( · ) uses disease and protocol embedding ( h d , h p )to recover molecular embedding h m as given by Eq. (27). (cid:99) h m = IMP ( h d , h p ) . (27)Here, we adopt MSE (Mean Square Error) loss as the learning objective to minimize the distance between the groundtruth molecule embedding h m and predicted one (cid:99) h m . L recovery = (cid:107) (cid:99) h m − h m (cid:107) , (28)where || · || is l -norm of a vector. When learning complete data, we update IMP ( · ) via minimizing L recovery . Whenlearning missing data, we ﬁx IMP ( · ) , and use (cid:99) h m to replace h m and update the remaining part of model.For an illustration of the entire framework, we summarize it in Algorithm 1.9 rotocol - Eligibility criteriaDisease- ICD-10 - DescriptionDrug Molecule- SMILES string- Molecule graph Disease risk - Disease past trial history - Disease ontologyPharmaco-kinetics Wet-lab Experiment Results -- Absorption – Distribution – Metabolism – Excretion – Clinical Toxicity

Data Linking Auxiliary DataWeb Resources moleculenet.ai drugbank.ca

Trial Outcome Data clinicaltables.nlm.nih.govclinicaltrials.gov chemoinfo.ipmc.cnrs.fr/MOLDB Web Article Supplementary Wet-Lab Data

XML ParsingNCT ID, Title, Phase,Diseases, Drugs, Protocol Summary, Primary Outcome Phase I, II, III DataIndication-level DataDisease Group DataFiltering

Figure 2:

Clinical trial outcome benchmark dataset

TOP . As there is no public trial outcome prediction dataset available, we create a benchmark dataset for Trial OutcomePrediction named

TOP , which is ready to be released after the double blind review. We ﬁrst describe the datacomponents and then reports the processing steps to construct this benchmark dataset (Figure 2).

For each clinical trial, we produce 1) drug molecule information including SMILES strings and molecular graphsfor the drug candidates used in the trials; 2) disease information including ICD-10 codes (disease code), diseasedescription and disease hierarchy in terms of CCS codes trial protocol information including eligibility criteria ofthe trial, study description, outcome measures; and 4) trial outcome information which has a binary indicator of trialsuccess (1) or failure (0), trial phase, start and end date, sponsor, trial size (i.e., number of participants). In additionto the main clinical trial outcome data, we also provide two auxiliary data. One is the pharmaco-kinetics data, whichconsists of wet lab experiment results for ﬁve important PK tasks, along with the drug SMILES strings. Another isdisease risk data, which is the past disease trial history success rate and the disease descriptions. Task.

There are many tasks that can be studied in term of prediction using

TOP . In this paper, we focus on trialprimary outcome success prediction as a binary classiﬁcation. Future works can be done for more granular predictionson diﬀerent types of outcomes such as patient enrollment, expected time and etc.

TOP statistics.

TOP consists of 7,062 clinical trials, with 6,483 drugs, 3,820 diseases, 5,582 ICD-10 codes. Out of thesetrials, 3,448 (48.8%) succeeded and 3,714 (51.2%) failed. For pharmaco-kinetics auxiliary dataset, we have 640 drugs hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp We create the

TOP benchmark for trial outcome prediction from multiple web sources including drug knowledge base,disease code (ICD-10 code) and historical clinical trials [3].

Trial selection.

We apply a series of selection ﬁlters to ensure the selected trials have high quality outcomelabels. First, we select trials that have associated publications (results or background information) in medicaljournals as these trials have credible sources. Second, we focus on small molecule drug trials, thus we remove trialswith other treatment types such as biologics and behaviorals. Third, we remove observational trials and focus oninterventional trials . Fourth, we select trials that have statistical analysis results for the primary outcome. Eachtrial in ClinicalTrials.gov is an XML ﬁle and we parse them to obtain the variables. For each trial, we obtain theNCT ID (i.e., identiﬁers to each clinical study), disease names, drugs, brief title and summary, phase, criteria, andstatistical analysis results.

Data Processing and Linking.

Next, we describe how do we process and link the parsed trial data to machinelearning-ready input and output format: • Drug molecule data are extracted from

ClinicalTrials.gov and linked to its molecule structure (SMILES stringsand the molecular graph structures) using DrugBank Database [41] ( drugbank.com ). • Disease data are extracted from

ClinicalTrials.gov and linked to ICD-10 codes and disease description using clinicaltables.nlm.nih.gov and then to CCS codes via hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp . • Trial protocol data are extracted from

ClinicalTrials.gov , in particular, the study description section, outcomesection and eligibility criteria section. • Trial outcome data are determined by parsing the study result section on

ClinicalTrials.gov . We use the p-valuein the statistical analysis results to induce the primary outcome. The trial is labelled as success if p-value is less than0.05 and negative if p-value is higher than 0.05. • Auxiliary drug pharmaco-kinetics data include ﬁve datasets across the main categories of PK. For absorption,we use the bioavailability dataset provided in Ma et al. paper supplementary [29]. For distribution, we use theblood-brain-barrier experimental results provided in Adenot et al. study [1]. For metabolism, we use the CYP2C19experiment from Veith et al. [39] paper, which is hosted in the PubChem biassay portal under AID 1851. For excretion,we use the clearance dataset from the eDrug3D database [32]. For toxicity, we use the ToxCast dataset [35], providedby MoleculeNet ( http://moleculenet.ai ). We consider drugs that are not toxic across all toxicology assays as nottoxic and otherwise toxic.

Implementation Details

We use our

TOP benchmark for model evaluation. The implementation details are describedbelow. • Molecule Embedding (Section 3.3). Regarding the molecule embedding function f m ( · ) in Eq. (5), HINT supportsMorgan ﬁngerprint [6], SMILES encoder [40], graph message passing neural network (MPNN) [10, 24, 15], we chooseMPNN because it usually works better in our experiments. The depth of MPNN is 3, with hidden layer dimension100. The input feature of MPNN is molecular graph with atom and bond features. Following [24, 15], atom features isa 38 dimensional vector, including its atom type (23 dim, 22 frequent atoms and 1 unknown indicator), degree (6-dimone-hot vector, { , , , , , } ), its formal charge (5-dim one-hot vector, {− , − , , , } ) and its chiral conﬁguration(4-dim one-hot vector, { , , , } ). Bond feature is a 11 dim vector, which is concatenation of its bond type (4-dimone-hot vector, { single, double, triple, aromatic } ), whether the bond is in a ring (1-dim), and its cis-trans conﬁguration(6-dim one-hot vector, { , , , , , } ). • Disease Embedding (Section 3.3). Disease embedding is obtained by GRAM [8], as deﬁned in Eq. (6). Theembedding size of both h d (Eq. 6) and e (Eq. 7) are both 100. Following [8], the attention model deﬁned in Eq. (8) isfeed-forward network with a single hidden layer (with dimension 100). • Protocol Embedding (Section 3.3). s Ii and s Ei in Eq. (9) are both 768 dimensional vectors. Following [16], thekernel size k , k , k , k in Eq. (10) is set to 1,3,5,7, respectively. • Interaction Graph (Section 3.5). All the nodeembedding in Interaction Graph have same dimension to support graph neural network reasoning. The dimension is setto 100. The neural architecture of the connections between diﬀerent nodes are one-layer fully-connected neural networkfollowed by a two-layer highway network. The input dimension of fully-connected neural network is determined by thenumber of input node number while the output dimension is 100, equal to the hidden dimension of highway network The study type for each record is labelled in

ClinicalTrials.gov

500 = 5 ∗ ( h A , h D , h M , h E , h T ), the outputdimension is 100. To obtain Interaction node’s embedding in Eq. (19), the input dimension of one-layer fully-connectedNN is

300 = 3 ∗ ( h d , h m , h p ). • Graph Neural Architecture (Section 3.5). The hidden size of GCN in Eq. (23) is set to 200. The dropout rate isset to 0.6. The feature dimension of GCN is 100, equal to node’s embedding size. The graph attention model deﬁnedin Eq. (24) is a two-layer fully-connected neural network with output dimension 1 with sigmoid activation. The inputsize is

200 = 2 ∗ (concatenation of two nodes’ embedding), the hidden size is 50, the output dimension is 1. Asshown in Eq. (24), the output is a scalar, V i,j ∈ R + . The depth of GNN L is set to 3. • Learning . During both pre-training and training procedure, we use the Adam as optimizer [25]. The learning rateis selected from { e − , e − , e − } and tuned on validation set. When pretraining ADMET models, the learning rateis set to e − . When pretraining disease risk model, learning rate is set to e − . When training HINT , the learningrate is set to e − . The maximal epochs are set to 10. We observe all the models converged within maximal epochs.We save the model every epoch and choose the model that performs best on validation set. All hyperparameters aretuned on validation set. Evaluation Settings

We consider two realistic evaluation setups. The ﬁrst is phase-level evaluation where wepredict the outcome of a single phase study. Since each phase has diﬀerent goals (e.g. Phase I is for safety whereasPhase II and III is for eﬃcacy), we conduct evaluation for Phase I, II and III individually. We create the test datasetsusing the FDA guideline on the success-failure ratio for each phase, speciﬁcally 70% success rate for Phase I, 33% forPhase II and 30% for Phase III. Second, we also consider indication-level evaluation where we test if the drug canpass all three phases for the ﬁnal market approval. To imitate it, we assemble all phase studies related to the drugand disease of the study and then use the latest phase protocol as the input to our model. Drugs that have Phase IIIsuccess are labelled positive and other drugs that fail in any of the three phases are labelled negative. Data statisticsare shown in Table 2. Table 2:

Data statistics. During training, we randomly select 15% training samples for model validation. The earliertrials are used for learning while the later trials are used for inference.

Settings Train Test Split DateSuccess Failure Success FailurePhase I 702 386 199 113 Aug 13, 2014Phase II 956 1655 302 487 March 20, 2014Phase III 1,820 2,493 457 684 April 7, 2014Indication 1,864 2,922 473 674 May 21, 2014

Evaluation Metrics

We use the following metrics to measure the performance of all methods. • PR-AUC (Precision-Recall Area Under Curve). • F1 . The F1 score is the harmonic mean of the precision and recall. • ROC-AUC (Area Under the Receiver Operating Characteristic Curve). • p-value . We report the results of hypothesis testing in terms of p-value to showcase the statistical signiﬁcance of ourmethod over the best baseline results. If p-value is smaller than 0.05, we claim our method signiﬁcantly outperformsthe best baseline method.For PR-AUC, F1, and ROC-AUC, higher value represent better performance. We split the dataset based onregistration date. The earlier trials are used for learning while the later trials are used for inference. For example, forPhase I dataset, we learn the model using the trials before Aug 13th, 2014 and make inference using the trials afterthat date, as shown in Table 2. We use Bootstrap [7] to estimate the mean and standard deviation of accuracy on testset. Baselines . We compare

HINT with several baselines, including both conventional machine learning models and deeplearning methods.• LR (Logistic Regression). It was used in [28] on trial prediction with disease features only. For fair comparison,we adapt it such that the input features include (i) 1024 dimensional Morgan ﬁngerprint feature [6], (ii) GRAMembedding (Eq. 7, GRAM is pretrained using disease risk module Eq. 17) and (iii) BERT embedding of eligibilitycriteria for protocol. Then these three features are concatenated as the input of LR model.• RF (Random Forest). Similarly to LR, it was used in [28] on trial prediction and we adapt it to use the samefeature set. XGBoost . An implementation of gradient boosted decision trees designed for speed and performance. It was usedin context of individual patient trial outcome prediction in [34]. We adapt it to use the same feature set for generaltrial outcome prediction.•

AdaBoost (Adaptive Boosting). It was used in [13] for individual Alzheimer’s patient’s trial result prediction. Weadapt it to use the same feature set.• kNN+RF . [28] leverages statistical imputation techniques to handle missing data, and ﬁnds using (1) kNN(k-Nearest Neighbor) as imputation technique and (2) Random Forest as classiﬁer would achieve best performance.We adapt this method to use the same feature set.•

FFNN (Feed-Forward Neural Network) [38]. It uses the same feature with LR. The feature vectors are fed into athree-layer feedforward neural network.•

DeepEnroll [45]. DeepEnroll was originally designed for patient trial matching, it uses (1) pre-trained BERTmodel [11] to encode eligibility criteria into sentence embedding; (2) a hierarchical embedding model to diseaseinformation and (3) alignment model to capture the protocol-disease interaction information. To adapt it to ourscenario, molecule embedding ( h m ) is concatenated to the output of alignment model to make prediction.• COMPOSE (cross-modal pseudo-siamese network) [16]. COMPOSE was also originally designed for patient trialmatching, it uses convolutional highway network and memory network to encode eligibility criteria and diseasesrespectively and an alignment model to model the interaction. COMPOSE incorporate the molecule information inthe same way with DeepEnroll, as mentioned above.

Exp 1. Phase Level Trial Outcome Prediction

First, we conduct phase level outcome prediction. For each phase, we train a separate model to make the prediction.We compare

HINT with several baseline approaches, which cover conventional machine learning models and deeplearning based models. We control the ratio of learning/inference data number to about 4:1, during learning, we leave15% training data as validation set. The means and standard deviations of the results are reported. We present theprediction performance in Table 3. We have the following observations:(1) Deep learning based approaches including FFNN, DeepEnroll, COMPOSE and

HINT outperforms conventionalmachine learning approaches (LR, RF, XGBoost, AdaBoost, kNN+RF) signiﬁcantly in outcome prediction for all thethree phases, thus validating the beneﬁt of deep learning methods for clinical trial outcome prediction.(2) Among all the deep learning methods,

HINT performs best with 0.772 PR-AUC for phase I, 0.607 for phase IIand 0.623 for phase III. Compared with the strongest baseline (COMPOSE), which is also deep learning approachthat uses all the features,

HINT achieved 12.4%, 3.5%, 3.0% relatively improvement in terms of PR-AUC and 8.2%,2.7%, 6.5% relative improvement in terms of F1 score. The reason is that

HINT incorporates insightful multimodaldata embedding and ﬁner-grained interaction between multimodal data and trial components (i.e., node in interactiongraph).(3) The full

HINT performs better than the variant without pre-trained model (

HINT - Pretrain) and the one withoutusing GNN model (

HINT - GNN) in both phase I and II scenario. This observation conﬁrmed the importance of allmodules in

HINT .(4) When comparing the prediction performance across phase I, II and III, we ﬁnd that phase I achieves highestaccuracy for almost all the methods while phase II is most challenging with lowest accuracy. This result is consistentwith historical trials statistics [37] and reported accuracy of machine learning models on these tasks [28].

Exp 2. Indication Level Outcome Prediction

The indication level outcome prediction focuses on predicting whether a trial will pass all three phases. We builda separate model using a combined dataset where the successful trials are the ones passed phase III (plus the onesreached phase IV) while the failed trials are the ones that failed in any phase from I to III. The results are presentedin Table 4. Similar trends are observed. In particular,

HINT performs the best with 0.703 PR-AUC, 0.765 F1 and0.793 ROC-AUC which achieves 8.2%, 5.7%, 1.2% relative improvements on PR-AUC, F1 and ROC-AUC over thestrongest baseline (COMPOSE).

In this paper, we propose the a general clinical trial outcome prediction task. We design a Hierarchical InteractionNetwork (

HINT ) to leverage multi-sourced data and incorporate multiple factors in a hierarchical interaction graph.Also, it can handle missing data via imputation module. Empirical studies indicate that

HINT outperforms baseline13 able 3:

Empirical results of various approaches for

Phase Level outcome prediction. The mean and standarddeviation are reported.

Phase I Trials

Method PR-AUC F1 ROC-AUCLR 0.575 ± ± ± RF 0.640 ± ± ± XGBoost 0.653 ± ± ± AdaBoost 0.589 ± ± ± kNN+RF [28] 0.616 ± ± ± FFNN [38] 0.643 ± ± ± DeepEnroll [45] 0.654 ± ± ± COMPOSE [16] 0.681 ± ± ± HINT - Pretrain 0.701 ± ± ± HINT - GNN 0.753 ± ± ± HINT ± ± ± p-value 0.00003 0.0001 0.0004 Phase II Trials

Method PR-AUC F1 ROC-AUCLR 0.489 ± ± ± RF 0.578 ± ± ± XGBoost 0.571 ± ± ± AdaBoost 0.457 ± ± ± kNN+RF [28] 0.544 ± ± ± FFNN [38] 0.555 ± ± ± DeepEnroll [45] 0.560 ± ± ± COMPOSE [16] 0.570 ± ± ± HINT - Pretrain 0.583 ± ± ± HINT - GNN 0.595 ± ± ± HINT ± ± ± p-value 0.008 0.012 0.048 Phase III Trials

Method PR-AUC F1 ROC-AUCLR 0.533 ± ± ± RF 0.554 ± ± ± XGBoost 0.588 ± ± ± AdaBoost 0.560 ± ± ± kNN+RF [28] 0.560 ± ± ± FFNN [38] 0.576 ± ± ± DeepEnroll [45] 0.581 ± ± ± COMPOSE [16] 0.589 ± ± ± HINT - Pretrain 0.599 ± ± ± HINT - GNN ± ± ± HINT ± ± ± p-value 0.002 0.009 0.14 methods in several metrics, obtaining state-of-the-art predictive measures on indication-level and phase level outcomeprediction. References [1] Marc Adenot and Roger Lahana. Blood-brain barrier permeation models: discriminating between potential cnsand non-cns drugs including p-glycoprotein substrates.

Journal of chemical information and computer sciences ,44(1):239–248, 2004.[2] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and MatthewMcDermott. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 , 2019.14 able 4:

Performance on indication level prediction. The mean and standard deviation are reported.

Method PR-AUC F1 ROC-AUCLR 0.579 ± ± ± RF 0.594 ± ± ± XGBoost 0.603 ± ± ± AdaBoost 0.565 ± ± ± kNN+RF [28] 0.594 ± ± ± FFNN [38] 0.602 ± ± ± DeepEnroll [45] 0.616 ± ± ± COMPOSE [16] 0.629 ± ± ± HINT - Pretrain 0.642 ± ± ± HINT - GNN 0.653 ± ± ± HINT ± ± ± p-value 0.002 0.001 0.0004 [3] Stefan D Anker et al. Welcome to the icd-10 code for sarcopenia. Journal of cachexia, sarcopenia and muscle ,2016.[4] Artem V Artemov et al. Integrated deep learned transcriptomic and structure-based predictor of clinical trialsoutcomes.

BioRxiv , 2016.[5] S. Biswal et al. Doctor2vec: Dynamic doctor representation learning for clinical trial recruitment. In

AAAI , 2020.[6] Adrià Cereto-Massagué et al. Molecular ﬁngerprint similarity search in virtual screening.

Methods , 2015.[7] Michael R Chernick et al. Bootstrap methods, 2011.[8] Edward Choi et al. Gram: graph-based attention model for healthcare representation learning. In

KDD , 2017.[9] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen. Convolutionalembedding of attributed molecular graphs for physical property prediction.

Journal of chemical information andmodeling , 57(8):1757–1772, 2017.[10] Hanjun Dai et al. Discriminative embeddings of latent variable models for structured data. In

ICML , 2016.[11] Jacob Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In

NAACL-HLT , 2019.[12] Jie Dong et al. Admetlab: a platform for systematic admet evaluation based on a comprehensively collectedadmet database.

Journal of cheminformatics , 2018.[13] Zhao Fan, Fanyu Xu, Cai Li, and Lili Yao. Application of kpca and adaboost algorithm in classiﬁcation offunctional magnetic resonance imaging of alzheimer’s disease.

Neural Computing and Applications , pages 1–10,2020.[14] Lawrence M Friedman, Curt D Furberg, David L DeMets, David M Reboussin, and Christopher B Granger.

Fundamentals of clinical trials . Springer, 2015.[15] Tianfan Fu, Cao Xiao, and Jimeng Sun. Core: Automatic molecule optimization using copy & reﬁne strategy. In

AAAI , 2020.[16] Junyi Gao, Cao Xiao, Lucas M Glass, and Jimeng Sun. Compose: Cross-modal pseudo-siamese network forpatient trial matching. In

KDD , 2020.[17] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. A data-driven approach to predicting successes andfailures of clinical trials.

Cell chemical biology , 23(10):1294–1301, 2016.[18] Jayeeta Ghosh et al. Modeling admet. In

In Silico Methods for Predicting Drug Toxicity , pages 63–83. 2016.[19] Michael Hay et al. Clinical development success rates for investigational drugs.

Nat. Biotechnol. , 2014.[20] Zhen Yu Hong, Jooyong Shim, Woo Chan Son, and Changha Hwang. Predicting successes and failures of clinicaltrials with an ensemble ls-svr. medRxiv , 2020. 1521] Weihua Hu et al. Strategies for pre-training graph neural networks. In

ICLR , 2019.[22] Kexin Huang et al. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv:1904.05342 ,2019.[23] Kexin Huang et al. Deeppurpose: a deep learning library for drug–target interaction prediction.

Bioinformatics ,2020.[24] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graphgeneration.

ICML , 2018.[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[26] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks.

ICLR , 2017.[27] Heidi Ledford. 4 ways to ﬁx the clinical trial: clinical trials are crumbling under modern economic and scientiﬁcpressures. nature looks at ways they might be saved.

Nature , 477(7366):526–529, 2011.[28] Andrew W. Lo, Kien Wei Siah, and Chi Heem Wong. Machine learning with statistical imputation for predictingdrug approvals.

Harvard Data Science Review , 1(1), 7 2019.[29] Chang-Ying Ma et al. Prediction models of human plasma protein binding rate and oral bioavailability derivedby using ga–cg–svm method.

Journal of pharmaceutical and biomedical analysis , 2008.[30] Linda Martin et al. How much do clinical trials cost?

Nat. Rev. Drug Discov. , 16(6):381–382, June 2017.[31] Richard Peto. Clinical trial methodology.

Nature , 272(5648):15–16, 1978.[32] Emilie Pihan et al. e-drug3d: 3d structure collections dedicated to drug repurposing and fragment-based drugdesign.

Bioinformatics , 2012.[33] Youran Qi and Qi Tang. Predicting phase 3 clinical trial results by modeling phase 2 clinical trial subject leveldata using deep learning. volume 106 of

Proceedings of Machine Learning Research , pages 288–303, 2019.[34] Pranav Rajpurkar et al. Evaluation of a Machine Learning Model Based on Pretreatment Symptoms andElectroencephalographic Features to Predict Outcomes of Antidepressant Treatment in Adults With Depression:A Prespeciﬁed Secondary Analysis of a Randomized Clinical Trial.

JAMA Network Open , 2020.[35] Ann M Richard et al. Toxcast chemical landscape: paving the road to 21st century toxicology.

Chemical researchin toxicology , 2016.[36] Rupesh Kumar Srivastava et al. Training very deep networks. In

NIPS , 2015.[37] David W Thomas, Justin Burns, John Audette, Adam Carroll, Corey Dow-Hygelund, and Michael Hay. Clinicaldevelopment success rates 2006–2015.

BIO Industry Analysis , 1:16, 2016.[38] Léon-Charles Tranchevent, Francisco Azuaje, and Jagath C. Rajapakse. A deep neural network approach topredicting clinical outcomes of neuroblastoma patients. bioRxiv , 2019.[39] Henrike Veith, Noel Southall, Ruili Huang, Tim James, Darren Fayne, Natalia Artemenko, Min Shen, JamesInglese, Christopher P Austin, David G Lloyd, et al. Comprehensive characterization of cytochrome p450 isozymeselectivity across chemical libraries.

Nature biotechnology , 27(11):1050–1055, 2009.[40] Sheng Wang et al. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. In

ACM BCB , 2019.[41] David S Wishart et al. Drugbank 5.0: a major update to the drugbank database for 2018.

Nucleic acids research ,2018.[42] Yonghui Wu et al. Identifying the status of genetic lesions in cancer clinical trial documents using machinelearning.

BMC genomics , 2012.[43] Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. Stochastic gradient boosted distributed decisiontrees. In

CIKM , 2009.[44] Quanzeng You, Zhengyou Zhang, and Jiebo Luo. End-to-end convolutional semantic embeddings. In

CVPR ,2018.[45] Xingyao Zhang et al. Patient-trial matching with deep embedding and entailment prediction. In