[PDF] Network-principled deep generative models for designing drug combinations as graph sets

Abstract

Combination therapy has shown to improve therapeutic efficacy while reducing side effects. Importantly, it has become an indispensable strategy to overcome resistance in antibiotics, anti-microbials, and anti-cancer drugs. Facing enormous chemical space and unclear design principles for small-molecule combinations, the computational drug-combination design has not seen generative models to meet its potential to accelerate resistance-overcoming drug combination discovery. We have developed the first deep generative model for drug combination design, by jointly embedding graph-structured domain knowledge and iteratively training a reinforcement learning-based chemical graph-set designer. First, we have developed Hierarchical Variational Graph Auto-Encoders (HVGAE) trained end-to-end to jointly embed gene-gene, gene-disease, and disease-disease networks. Novel attentional pooling is introduced here for learning disease-representations from associated genes' representations. Second, targeting diseases in learned representations, we have recast the drug-combination design problem as graph-set generation and developed a deep learning-based model with novel rewards. Specifically, besides chemical validity rewards, we have introduced a novel generative adversarial award, being generalized sliced Wasserstein, for chemically diverse molecules with distributions similar to known drugs. We have also designed a network principle-based reward for drug combinations. Numerical results indicate that, compared to graph embedding methods, HVGAE learns more informative and generalizable disease representations. Case studies on four diseases show that network-principled drug combinations tend to have low toxicity. The generated drug combinations collectively cover the disease module similar to FDA-approved drug combinations and could potentially suggest novel systems-pharmacology strategies.

Full PDF

aa r X i v : . [ q - b i o . M N ] A p r Bioinformatics doi.10.1093/bioinformatics/xxxxxx

Network-principled deep generative models fordesigning drug combinations as graph sets

Mostafa Karimi , Arman Hasanzadeh and Yang shen Department of Electrical and Computer Engineering and TEES–AgriLife Center for Bioinformatics and Genomic SystemsEngineering, Texas A&M University, College Station, 77843, USA. = Co-ﬁrst authors. ∗ To whom correspondence should be addressed.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation:

Combination therapy has shown to improve therapeutic efﬁcacy while reducing side effects.Importantly, it has become an indispensable strategy to overcome resistance in antibiotics, anti-microbials,and anti-cancer drugs. Facing enormous chemical space and unclear design principles for small-moleculecombinations, computational drug-combination design has not seen generative models to meet itspotential to accelerate resistance-overcoming drug combination discovery.

Results:

We have developed the ﬁrst deep generative model for drug combination design, by jointlyembedding graph-structured domain knowledge and iteratively training a reinforcement learning-basedchemical graph-set designer. First, we have developed Hierarchical Variational Graph Auto-Encoders(HVGAE) trained end-to-end to jointly embed gene-gene, gene-disease, and disease-disease networks.Novel attentional pooling is introduced here for learning disease-representations from associatedgenes’ representations. Second, targeting diseases in learned representations, we have recast thedrug-combination design problem as graph-set generation and developed a deep learning-basedmodel with novel rewards. Speciﬁcally, besides chemical validity rewards, we have introduced novelgenerative adversarial award, being generalized sliced Wasserstein, for chemically diverse moleculeswith distributions similar to known drugs. We have also designed a network principle-based rewardfor drug combinations. Numerical results indicate that, compared to state-of-the-art graph embeddingmethods, HVGAE learns more informative and generalizable disease representations. Results also showthat the deep generative models generate drug combinations following the principle across diseases.Case studies on four diseases show that network-principled drug combinations tend to have low toxicity.The generated drug combinations collectively cover the disease module similar to FDA-approved drugcombinations and could potentially suggest novel systems-pharmacology strategies. Our method allowsfor examining and following network-based principle or hypothesis to efﬁciently generate disease-speciﬁcdrug combinations in a vast chemical combinatorial space.

Availability: https://github.com/Shen-Lab/Drug-Combo-Generator

Contact: [email protected]

Supplementary information : Supplementary data are available at https://github.com/Shen-Lab/Drug-Combo-Generator/blob/master/SI_drugcomb_RL.pdf

Drug resistance is a fundamental barrier to developing robust antimicrobialand anticancer therapies (Taubes, 2008; Housman et al. , 2014). Itsﬁrst sign was observed in 1940s soon after the discovery ofpenicillin (Abraham and Chain, 1940), the ﬁrst modern antibiotic.Since then, drug resistance has surfaced and progressed in infectiousdiseases such as HIV (Clavel and Hance, 2004), tuberculosis(TB) (Dooley et al. , 1992) and hepatitis (Ghany and Liang, 2007) as well as cancers (Holohan et al. , 2013). Mechanistically, it can emergethrough drug efﬂux (Chang and Roth, 2001), activation of alternativepathways (Lovly and Shaw, 2014) and protein mutations (Toy et al. , 2013;Balbas et al. , 2013) while decreasing the efﬁcacy of drugs.Combination therapy is a resistance-overcoming strategy thathas found success in combating HIV (Shafer and Vuitton, 1999),TB (Ramón-García et al. , 2011), cancers (Sharma and Allison, 2015;Bozic et al. , 2013) and so on. Considering that most diseases andtheir resistances are multifactorial (Kaplan and Junien, 2000; Keith et al. ,2005), multiple drugs targeting multiple components simultaneouslycould confer less resistance than individual drugs targeting components

Karimi, Hasanzadeh and Shen separately. Examples include targeting both MEK and BRAF in patientswith BRAF V600-mutant melanoma rather than targeting MEK orBRAF alone (Madani Tonekaboni et al. , 2018; Flaherty et al. , 2012).The effect of drug combination is usually categorized as synergistic,additive, or antagonistic depending on whether it is greater than,equal to or less than the sum of individual drug effects (Chou,2006). Synergistic combinations are effective at delaying the beginningof the resistance, however antagonistic combinations are effective atsuppressing expansion of resistance (Saputra et al. , 2018; Singh and Yeh,2017), representing offensive and defensive strategies to overcome drugresistance. In particular, offensive strategies cause huge early causalitiesbut defensive ones anticipate and develop protection against future threats.(Saputra et al. , 2018).Discovering a drug combination to overcome resistance is howeverextremely challenging, even more so than discovering a drug whichis already a costly ( ∼ billions of USD) (DiMasi et al. , 2016) andlengthy ( ∼

12 years) (Van Norman, 2016) process with low success rates(3.4% phase-1 oncology compounds make it to approval and market)(Wong et al. , 2019). An apparent challenge, a combinatorial one, is inthe scale of chemical space, which is estimated to be for singlecompounds (Bohacek et al. , 1996) and can “explode” to K for K -compound combinations. Even if the space is restricted to around FDA-approved human drugs, there are – pairwise combinations.Another challenge, a conceptual one, is in the complexity of systemsbiology. On top of on-target efﬁcacy and off-target side effects or eventoxicity that need to be considered for individual drugs, network-baseddesign principles are much needed for drug combinations that effectivelytarget multiple proteins in a disease module and have low toxicityor even resistance proﬁles (Martínez-Jiménez and Marti-Renom, 2016;Billur Engin et al. , 2014).Current computational models in drug discovery, especially thosefor predicting pharmacokinetic and pharmacodynamic properties ofindividual drugs/compounds, can be categorized into discriminativeand generative models. Discriminative models predict the distributionof a property for a given molecule whereas generative models wouldlearn the joint distribution on the property and molecules. Forinstance, discriminative models have been developed for predicting singlecompounds’ toxicities, based on support vector machines (Darnag et al. ,2010), random forest (Svetnik et al. , 2003) and deep learning (Mayr et al. ,2016). Whereas discriminative models are useful for evaluating givencompounds or even searching compound libraries, generative modelscan effectively design compounds of desired properties in chemicalspace. Recent advance in inverse molecular design has seen deepgenerative models such as SMILES representation-based reinforcementlearning (Popova et al. , 2018) or recurrent neural networks (RNNs)as well as graph representation-based generative adversarial network(GANs), reinforcement learning (You et al. , 2018), and generativetensorial reinforcement learning (GENTRL) (Zhavoronkov et al. , 2019).Unlike single drug design, current computational efforts for drugcombinations are exclusively focused on discriminative models and lackgenerative models. The main focus for drug combination is to usediscriminate models to identify synergistic or antagonistic drugs for agiven speciﬁc disease. Examples include the Chou-Talalay method (Chou,2010), integer linear programming (Pang et al. , 2014), and deep learning(Preuer et al. , 2017) and . However, it is daunting if not infeasible toenumerate all cases in the enormous chemical combinatorial space andevaluate their combination effects using a discriminative model. Not tomention that such methods often lack explainability.Directly addressing aforementioned combinatorial and conceptualchallenges and ﬁlling the void of generative models for drug combinations,in this study, we develop network-based representation learning fordiseases and deep generative models for accelerated and principled drug combination design (the general case of K drugs). Recently, by analyzingthe network-based relationships between disease proteins and drug targetsin the human protein–protein interactome, Cheng et al. proposed anelegant principle for FDA-approved drug combinations that targets oftwo drugs both hit the disease module but cover different neighborhoods.Our methods allow for examining and following the proposed network-based principle (Cheng et al. , 2019) to efﬁciently generate disease-speciﬁcdrug combinations in a vast chemical combinatorial space. They willalso help meet a critical need of computational tools in a battle againstquickly evolving bacterial, viral and tumor populations with accumulatingresistance.To tackle the problem, we have developed a network principle-baseddeep generative model for faster, broader and deeper exploration of drugcombination space by following the principle underling FDA approveddrug combinations. First, we have developed Hierarchical VariationalGraph Auto-Encoders (HVGAE) for jointly embedding disease-diseasenetwork and gene-gene network. Through end-to-end training, we embedgenes in a way that they can represent the human interactome. Then,we utilize their embeddings with novel attentional pooling to createfeatures for each disease so that we can embed diseases more accurately.Second, we have also developed a reinforcement-learning based graph-set generator for drug combination design by utilizing both gene/diseaseembedding and network principles. Besides those for chemical validityand properties, our rewards also include 1) a novel adversarial reward,generalized sliced Wasserstein distance, that fosters generated moleculesto be diverse yet similar in distribution to known compounds (ZINCdatabase and FDA-approved drugs) and 2) a network principle-basedreward for drug combinations that are feasible for online calculations.The overall schematics are shown in Fig. 1 and details in Sec. 3. We used the human interactome data (a gene-gene network) from(Menche et al. , 2015) that feature 13,460 proteins interconnected by141,296 interactions.We introduced edge features for the human interactome based on thebiological nature of edges (interactions). The interactome was compiled bycombining experimental support from various sources/databases including1) regulatory interactions from TRANSFAC (Matys et al. , 2003); 2) binaryinteractions from high-throughput (including (Rolland et al. , 2014)) andliterature-curated datasets (including IntAct (Aranda et al. , 2010) andMINT (Ceol et al. , 2010)) as well as literature-curated interactions fromlow-throughput experiments (IntAct, MINT, BioGRID (Stark et al. , 2010),and HPRD (Keshava Prasad et al. , 2009)); 3) metabolic enzyme-coupledinteractions from (Lee et al. , 2008); 4) protein complexes from CORUM(Ruepp et al. , 2010); 5) kinase-substrate pairs from PhosphositePlus(Hornbeck et al. , 2012); and 6) signaling interactions. In summary, an edgecould correspond to one or multiple physical interaction types. So we used a6-hot encoding for edge features, based on whether an edge corresponds toregulatory, binary, metabolic, complex, kinase and signaling interactions.We also introduced features for nodes (genes) in the human interactomebased on 1) KEGG pathways (Kanehisa et al. , 2002) (336 features) queriedthrough Biopython (Cock et al. , 2009); 2) Gene Ontology (GO) terms(Ashburner et al. , 2000) including biological process (30,769 features),molecular function (12,183 features), and cellular component (4,451features), mapped using the NCBI Gene2Go dataset; 3) disease-geneassociations from the database OMIM (Mendelian Inheritance in Man)(Hamosh et al. , 2005) and the results from Genome-Wide AssociationStudies (GWAS) (Mottaz et al. , 2008; Ramos et al. , 2014) (299 features). rug Combo Generator !" )*+% <-9&:+9& ?@A, FG3$= ! " " $ "%& H IJ KL MN

OOOOOOO

OOO ))BG3$=33 )&(109&P ) H ) K ) I ’ ( ) * +, * ’ ( ) - +, - ,5&.0:19 ,5&.0:19 ,5&.0:19 H ) K ) I HKILJNM )3 )0E&1E&B3&’&

Fig. 1: Overall schematics of the proposed approach for generating disease-speciﬁc drug combinations.The last 299 features correspond to 299 diseases represented by theMedical Subject Headings (MeSH) vocabulary (Rogers, 1963).After removing those genes without KEGG pathway information, thehuman interactome used in this study has 13,119 genes and 352,464physical interactions.

We used a disease-disease network from (Menche et al. , 2015) with 299nodes (diseases), created based on human interactome data (as detailedearlier), gene expression data (Su et al. , 2004), disease-gene associations(Mottaz et al. , 2008; Ramos et al. , 2014; Hamosh et al. , 2005), GeneOntology (Ashburner et al. , 2000), symptom similarity (Zhou et al. , 2014)and comorbidity (Hidalgo et al. , 2009). The original disease-diseasenetwork is a complete graph with real-valued edges. The edge valuebetween two diseases shows how much they are topologically separatedfrom each other. A positive/negative edge weight indicates that that twodisease modules are topologically separated/overlapped. Therefore, weused zero-weight as the threshold and pruned positive-valued edges, whichresults in a disease-disease network of 299 nodes and 5,986 edges (withoutweights).

We used disease-gene associations from the database OMIM (Hamosh et al. ,2005). These associations bridge aforementioned gene-gene and disease-disease networks into a hierarchical graph of genes and diseases, based onwhich gene and disease representations will be learned.

For the purpose of assessment, we used the Comparative ToxicogenomicsDatabase (CTD) (Davis et al. , 2019) to classify diseases into 8 classesbased on their Disease Ontology (DO) terms (Schriml et al. , 2012) wherediseases are represented in the MeSH vocabulary (Rogers, 1963). In the CTD database only 201 of the 299 diseases have a corresponding DO term.Therefore, for the 98 diseases with missing DO terms we considered themajority of their parents’ DO terms, if applicable, as their DO terms. Withthis approach, we assigned DO terms to 66 such diseases and classiﬁed267 of the 299 diseases. The 32 disease with DO terms still missing areusually at the top layers of the MeSH tree.

To assess our deep generative model for drug combination design (tobe detailed in Sec. 3.2), we consider a comprehensive list of US FDA-approved combination drugs (1940–2018.9) (Das et al. , 2018). The datasetcontains 419 drug combinations consisting of 328 unique drugs, including341 (81%), 67 (16%) and 11 (3%) of double, triple and quadruple drugcombinations.We also utilized the curated drug-disease association from CTDdatabase (Davis et al. , 2019).

We have developed a network-based drug combination generator whichcan be utilized in overcoming drug resistance. Representing drugs throughtheir molecular graphs, we recast the problem of drug combinationgeneration into network-principled, graph-set generation by incorporatingprior knowledge such as human interactome (gene-gene), disease-gene, disease-disease, gene pathway, and gene-GO relationships.Furthermore, we formulate the graph-set generation problem as learninga Reinforcement Learning (RL) agent that iteratively adds substructuresand edges to each molecular graph in a chemistry- and system-awareenvironment. To that end, the RL model is trained to maximize a desiredproperty Q (for example, therapeutic efﬁcacy for drug combinations)while following the valency (chemical validity) rules and being similarin distribution to the prior set of graphs. Karimi, Hasanzadeh and Shen

As shown in Fig. 1, the proposed approach consists of: 1) embeddingprior knowledge (different network relationships) through HierarchicalVariational Graph Auto-Encoders (HVGAE); and 2) generating drugcombinations as graph sets through a reinforcement learning algorithm,which will be detailed next.

Notations:

As both gene-gene and disease-disease networks can berepresented as graphs, notations are differentiated by superscripts ‘g’and ‘d’ to indicate gene-gene and disease-disease networks, respectively.Drugs (compounds) are also represented as graphs and notations with ‘ k ’in the superscript indicates the k -th drug (graph) in the drug combination(graph set). Suppose that a gene-gene network is represented as a graph G (g) =( A (g) , { F (g ,m ) } Mm =1 ) , where A (g) = [ A (g , , · · · , A (g ,n e ) ] ∈{ , } n g × n g × n e is the adjacency tensor of the gene-gene network with n g nodes and n e edge types ( k -hot encoding of 6 types of aforementionedphysical interactions such as regulatory, binary, metabolic, complex,kinase and signaling interactions). We also deﬁne ˜ A (g) ∈ { , } n g × n g to be elemenwise OR of { A (g , , · · · , A (g ,n e ) } . Furthermore, F (g ,m ) denotes the m th set of node features for gene-gene network where M (5in the study) represents different types of node features such as pathways,3 GO terms and gene-disease relationship. We also suppose the disease-disease network is represented as graph G (d) = ( A (d) , F (d) ) , where A (d) ∈ { , } n d × n d is the adjacency matrix of the disease-diseasenetwork with n d nodes; and F (d) represents the set of node featuresfor the disease-disease network.We have developed a hierarchical embedding with 2 levels. In the ﬁrstlevel, we embed the gene-gene network to get the features related to eachdisease and then we incorporate the disease features within the disease-disease network to embed their relationship. We infer the embedding foreach gene and disease jointly through end-to-end training. The proposedHVGAE perform probabilistic auto-encoding to capture uncertainty ofrepresentations which is in the same spirit as the variational graph auto-encoder models introduced in (Kipf and Welling, 2016; Hasanzadeh et al. ,2019; Hajiramezanali et al. , 2019) . The inference model for variational embedding of the gene-gene networkis formulated as follows. We ﬁrst use M graph neural networks (GNNs)to transform individual nodes’ features in M types and then concatenatethe M sets of results ˆ F (g ,m ) ( m = 1 , . . . , M ) into ˆ F (g) : ˆ F (g ,m ) = AGG (cid:16) { GNN j ( A (g ,j ) , F (g ,m ) ) } , j = 1 , · · · , n e (cid:17) ˆ F (g ,m ) ∈ R n g × L g , m = 1 , · · · , M ˆ F (g) = CONCAT ( { ˆ F (g ,m ) } Mm =1 ) ∈ R n g × ML g , (1)where AGG is an aggregation function combining output features of

GNN j ’s for each node. We used a two layer fully connected neuralnetwork with ReLU activation functions followed by a single linear layerin our implementation. We then approximate the posterior distributionof stochastic latent variables Z (g) (containing z (g) i ∈ R L g for i =1 , · · · , n g where L g (32 in this study) is the latent space dimensionalityfor the i th gene), with a multivariate Gaussian distribution q ( · ) given thegene-gene network’s aggregated node features ˆ F (g) and adjacency tensor A (g) : q ( Z (g) | ˆ F (g) , A (g) ) = n g Y i =1 q ( z (g) i | ˆ F (g) , A (g) ) , where q ( z (g) i | ˆ F (g) , A (g) ) = N ( µ (g) i , diag ( σ , ( g ) i )) , µ (g) = AGG (cid:16) { GNN µ ,g,j ( A (g ,j ) , ˆ F (g) ) } , j = 1 , · · · , n e (cid:17) , log( σ (g) ) = AGG (cid:16) { GNN σ ,g,j ( A (g ,j ) , ˆ F (g) ) } , j = 1 , · · · , n e (cid:17) , µ (g) ∈ R n g × L g , log( σ (g) ) ∈ R n g × L g . (2)where Z (g) ∈ R n g × L g ; µ (g) is the matrix of mean vectors µ (g) i ;and σ (g) the matrix of standard deviation vectors σ (g) i ( i = 1 , . . . , n g ).The generative model for the gene-gene network is formulated as: p ( ˜ A (g) | Z (g) ) = n Y i =1 n Y j =1 p ( ˜ A (g) ij | z (g) i , z (g) j ) , where p ( ˜ A (g) ij | z (g) i , z (g) j ) = σ ( z (g) i z (g) Tj ) , (3)and σ ( · ) is the logistic sigmoid function. The loss for gene-gene variationalembedding is represented as a variational lower bound (ELBO): L (g) = E q ( Z (g) | ˆ F (g) ,A (g) ) [log p ( ˜ A (g) | Z (g) )] − KL (cid:0) q ( Z (g) | ˆ F (g) , A (g) ) || p ( Z (g) ) (cid:1) , (4)where KL (cid:0) q ( · ) || p ( · ) (cid:1) is the Kullback-Leibler divergence between q ( · ) and p ( · ) . We take the Gaussian prior for p ( Z (g) ) and make use of thereparameterization trick (Kipf and Welling, 2016) for training. The inference model for variational embedding of the disease-diseasenetwork is similar to that of the gene-gene network except that thedisease-disease network’s aggregated node features, ˆ F (d) , are derivedthrough parameterized attentional pooling of ˆ Z (g) r , latent variables ofgenes associated with the r th disease (a subset of Z (g) ): e r = v tanh( ˆ Z (g) r W + b ) , r = 1 , · · · , n d α r = softmax ( e r ) , r = 1 , · · · , n d ˆ F (d) r = X i α r,i ˆ Z (g) r,i , r = 1 , · · · , n d ˆ F (d) = CONCAT ( { ˆ F (d) r } n d r =1 ) ∈ R n d × L d , (5)where α m capture the importance of genes related to the r th diseasefor calculating its latent representations and L d is the latent spacedimensionality of a disease.Once ˆ F (d) , the disease-disease network’s aggregated node featuresfor all diseases, are derived; we again deﬁne q ( Z (d) | ˆ F (d) , A (d) ) forthe posterior distribution of stochastic latent variables Z (d) similarlyto what we did in Eq. (2) except that AGG functions are removedsince disease-disease network has one binary adjacency matrix; givethe generative decoder p ( A (d) | Z (d) ) for embedding the disease-diseasenetwork similarly to what we did in Eq. (3); and calculate the variationallowerbound (ELBO) loss L (d) for the disease-disease network similarlyto what we did in Eq. (4). Details can be found in Supplemental Sec. 1.1.Both levels of our proposed HVGAE, i.e. gene-gene and disease-disease variational graph representation learning, are jointly trained inan end-to-end fashion using the following overall loss: L HVGAE = L (d) + L (g) . (6) rug Combo Generator In this section, we introduce the reinforcement learning-based drugcombination generator. We will detail 1) the state space of graph sets( K compounds) and the action space of graph-set growth; 2) multi-objective rewards including chemical validity and our generalized slicedWasserstein reward for individual drugs as well as our newly designednetwork principle-based reward for drug combinations; and 3) policynetwork that learns to take actions in the rewarding environment. We represent a graph set (drug combination) with K graphs as G = { G ( k ) } Kk =1 . Each graph G ( k ) = ( A ( k ) , E ( k ) , F ( k ) ) where A ( k ) ∈{ , } n k × n k is the adjacency matrix, F ( k ) ∈ R n k × φ the node featurematrix, E ( k ) ∈ { , } ǫ × n k × n k the edge-conditioned adjacency tensor,and n k the number of vertices for the k th graph, respectively; and φ is thenumber of features per nodes and ǫ the number of edge types.The state space G is the set of all K graphs with different numbersand types of nodes or edges. Speciﬁcally, the state of the environment s t at iteration t is deﬁned as the intermediate graph set G t = { G ( k ) t } Kk =1 generated so far which is fully observable by the RL agent.The action space is the set of edges that can be added to the graph set.An action a t at iteration t is analogous to link prediction in each graphin the set. More speciﬁcally, a link can either connect a new subgraph (asingle node/atom or a subgraph/drug-substructure) to a node in G ( k ) t orconnect existing nodes within graph G ( k ) t . The actions can be interpretedas connecting the current graph with a member of scaffold subgraphs set C . Mathematically, for G ( k ) t , graph k at step t , the action a ( k ) t is thequadruple of a ( k ) t = concat ( a ( k )ﬁrst , t , a ( k )second , t , a ( k )edge , t , a ( k )stop , t ) . We have deﬁned a multi-objective reward R t to satisfy certainrequirements in drug combination therapy. First, a chemical validityreward maintains that individual compounds are chemically valid. Second,a novel adversarial reward, generalized sliced Wasserstein GAN (GS-WGAN), enforces generated compounds are synthesizable and “drug-like"by following the distribution of synthesizable compounds in the ZINCdatabase (Irwin and Shoichet, 2005) or FDA-approved drugs. Third, anetwork principle-based award would encourage individual drugs to targetthe desired disease module but not to overlap in their target sets. Toxicitydue to drug-drug interactions can also be included as a reward. It isintentionally left out in this study so that toxicity can be evaluated fordrug combinations designed to follow the network principle.When training the RL agent, we use different reward combinations indifferent stages. We ﬁrst only use the weighted combination of chemicalvalidity and GS-WGAN awards learning over drug combinations forall diseases; then we remove the penalized logP (Pen-logP) portion ofchemical validity and add adversarial loss again while learning over drugcombinations for all diseases; and ﬁnally use the combination of the threerewards as in the second stage but focusing on a target disease and possiblyon restricted actions/scaffolds (in a spirit similar to transfer learning). Thethree types of rewards are detailed as follows. Chemical validity reward for individual drugs.

A small positivereward is assigned if the action does not violate valency rules. Otherwisea small negative reward is assigned. This is an intermediate reward addedat each step. Another reward is on penalized logP (lipophilicity where P isthe octanol-water partition coefﬁcient) or Pen-logP values. The design andthe parameters of this reward is adopted from (You et al. , 2018) withoutoptimization.

Adversarial reward using generalized sliced Wasserstein distance(GSWD).

To ensure that the generated molecules resemble a given set of molecules (such as those in ZINC or FDA-approved), we deployGenerative Adversarial Networks (GAN). GANs are very successful atmodeling high-dimensional distributions from given samples. Howeverthey are known to suffer from training unsuitability and cannot generatediverse samples (a phenomenon known as mode collapse ).Wasserstein GANs (WGAN) have shown to improve stability andmode collapse by replacing the Jenson-Shannon divergence in originalGAN formulation with the Wasserstein Distance (WD) (Arjovsky et al. ,2017). More speciﬁcally, the objective function in WGAN with gradientpenalty (Gulrajani et al. , 2017) is deﬁned as follows: min θ max φ V W ( π θ , D φ ) + λR ( D φ ) , (7)with V W ( π θ , D φ ) = E x ∼ p r [log D φ ( x )] − E y ∼ π θ [log D φ ( y )] , where p r is the data distribution, λ is a hyper-parameter, R is the Lipschitzcontinuity regularization term, D φ is the critic with parameters φ , and π θ is the policy (generator) with parameters θ .Despite theoretical advantages of WGANs, solving equation (7)is computationally expensive and intractable for high dimensionaldata. To overcome this problem, we propose and formulate a novelGeneralized Sliced WGAN (GS-WGAN) which deploys GeneralizedSliced Wasserstein Distance (GSWD) (Kolouri et al. , 2019). GSWD,ﬁrst, factorizes high-dimensional probabilities into multiple marginal 1Ddistributions with generalized Radon transform. Then, by taking advantageof closed form solution of Wasserstein distance in 1D, the distance betweentwo distributions is approximated by the sum of Wasserstein distances ofmarginal 1D distributions. More speciﬁcally, let R represent generalizedRadon transform operator. The generalized Radon transform (GRT) of aprobability distribution P ( · ) which is deﬁned as follows: R P ( t, ψ ) = Z R d P ( x ) δ ( t − f ( x , ψ )) d x , (8)where δ ( · ) is the one-dimensional Dirac delta function, t ∈ R is a scalar, ψ is a unit vector in the unit hyper-sphere in a d -dimensional space ( S d − ),and f is a projection function whose parameters will be learned in training.Injectivity of the GRT (Beylkin, 1984) is the requirement for the GSWDto be a valid distance. We use linear project f ( x, ψ ) here and can easilyextend to two nonlinear cases that maintains the GRT-injectivity (circularnonlinear projections or homogeneous polynomials with an odd degree).GSWD between two d-dimensional distributions P X and P Y istherefore deﬁned as:GSWD ( P X , P Y ) = Z S d − WD ( R P X ( · , ψ ) , R P Y ( · , ψ )) dψ . (9)The integral in the above equation can be approximated with a Riemannsum. Knowing the deﬁnition of GSWD, we deﬁne the objective functionof GS-WGAN as follows: min θ max φ V GSW ( π θ , D φ ) + λR ( D φ ) , (10)s.t. V GSW ( π θ , D φ ) = Z ψ ∈ S d − E x ∼ p r [log D φ ( x )] − (11) E y ∼ π θ [log D φ ( y )] dψ , where the parameters and notations are the same as deﬁned in Eq. (7).We note that x and y in Eq. (10) are random variables in R d ,which is not a reasonable assumption for graphs. To that end, we use anembedding function g that maps each graph to a vector in R d . We use graphconvolutional layers followed by fully connected layers to implement g .We deploy the same type of neural network architecture for D φ . We use Karimi, Hasanzadeh and Shen R advers = − V GSW ( π θ , D φ ) as the adversarial reward used togetherwith other rewards, and optimize the total rewards with a policy gradientmethod (Sec. 3.2.3). Network principle-based reward for drug combinations.

Proteins orgenes associated with a disease tend to form a localized neighborhooddisease module rather than scattering randomly in the interactome(Cheng et al. , 2019). A network-based score has been introduced(Menche et al. , 2015), to efﬁciently capture the network proximity ofa drug ( X ) and disease ( Y ) based on the shortest-path length d ( x, y ) between a drug target ( x ) and a disease protein ( y ): Z = d ( X, Y ) − ¯ dσ d d ( X, Y ) = 1 || Y || X y ∈ Y min x ∈ X d ( x, y ) , (12)where d ( · , · ) is the shortest path distance; ¯ d and σ d are the mean andstandard deviation of the reference distribution which is corresponding tothe expected network topological distance between two randomly selectedgroups of proteins matched to size and degree (connectivity) distributionas the original disease proteins and drug targets in the human interactome.Z-score being negative ( Z < ) implies network proximity of diseasemodule and drug targets which is desirable. From the drug combinationperspective, it has been shown that the complementary exposed drug-drug relationship has the least side drug side affect and the most drugcombination efﬁcacy (Cheng et al. , 2019). Complementary exposed drug-drug ( X and X ) relationship means that the drug targets ( x ) and drugtargets ( x ) are not in the same neighborhood and has the least overlapping.Therefore, Cheng et al. have proposed a network-separation score whichis formulated as follow: s X ,X = d ( X , X ) − d ( X , X ) + d ( X , X )2 , (13)where d ( X , X ) is the mean shortest path distance between drugs X and X ; d ( X , X ) and d ( X , X ) are the mean shortest path distancewithin drug targets X and X respectively (Cheng et al. , 2019). Theseparation score being positive ( s > ) implies to network are separatedfrom each other which is desirable. We have extended and combined thesescores for general drug combination therapy where we have a set of k drugs { X , · · · , X k } and disease Y : R network = λ k X i =1 X j>i s ( X i , X j ) − λ k X i =1 Z ( X i , Y ) (14)However, the exact online calculation of the reward R network isinfeasible while training across all the diseases and the whole humaninteractome with more than 13K nodes and 352K edges. Therefore, wehave developed a relaxed version of the reward which is feasible for onlinecalculation and correlates with the actual reward. Speciﬁcally, we considerthe normalized exclusive or (XOR) of intersections of disease modules withdrug targets: ˆR network = Y ∩ ( X ⊕ · · · ⊕ X k ) | Y | = ( X ∩ Y ) ⊕ · · · ⊕ ( X k ∩ Y ) | Y | . (15)The relaxed network principle-based reward is penalizing a drugcombination if the overlap between drug targets in the disease moduleis high, therefore it will prevent the adverse drug-drug interactions. Wescaled the network score by a constant (equals 10) such that the scorewould be in the same range as Pen-logP and can use the same weight inthe total reward as Pen-logP did in (You et al. , 2018).For a generated compound, we predict its protein targets byDeepAfﬁnity (Karimi et al. , 2019), judging by whether the predicted IC is below 1 µ M. Having explained the graph generation environment (various rewards), weoutline the architecture of our proposed policy network. Our method takesthe intermediate graph set G t and the collection of scaffold subgraphs C as inputs, and outputs the action a t , which predicts a new link for each ofthe graphs in G t (You et al. , 2018).Since the input to our policy network is a set of K compounds or graphs { G ( k ) t ∪ C } Kk =1 , we ﬁrst deploy some layers of graph neural network toprocess each of the graphs. More speciﬁcally, X ( k ) = GNN ( k ) ( G ( k ) t ∪ C ) , for k = 1 , . . . , K, (16)where GNN ( k ) is a multilayer graph neural network. The link predictionbased action at iteration t is a concatenation of four components foreach of the K graphs: selection of two nodes, prediction of edge type,and prediction of termination. Each component is sampled according toa predicted distribution (You et al. , 2018). Details are included in theSupplemental Sec. 1.2. We note that the ﬁrst node is always chosen from G t while the next node is chosen from { G ( k ) t ∪ C } Kk =1 . We also note thatinfeasible actions (i.e. actions that do not pass valency check) proposedby the policy network are rejected and the state remains unchanged. Weadopt Proximal Policy Optimization (PPO) (Schulman et al. , 2017), oneof the state-of-the-art policy gradient methods, to train the model. To assess the performance of our proposed model, we have designed aseries of experiments. In section 4.1, we ﬁrst compare HVGAE to state-of-art graph embedding methods in disease-disease network representationlearning and further include several variants of HVGAE for ablationstudies. We then assess the performance of the proposed reinforcementlearning method in two aspects. In a landscape assessment in Section4.2, we examine designed pairwise compound-combinations for 299diseases in quantitative scores of following a network-based principle(Cheng et al. , 2019). In Section 4.3, we focus on four case studies involvingmultiple diseases of various systems-pharmacology strategies. Our methodis capable of generating higher-order combinations of K drugs. As FDA-approved drug combinations are often pairs, here we design compoundpairs from the scaffolds of FDA-approved drug pairs. We further delve intodesigned compound pairs to understand the beneﬁt of following networkprinciples in lowering toxicity from drug-drug interactions. We also do soto understand their systems pharmacology strategies in comparison to theFDA-approved drug combinations. To assess the performance of our proposed embedding method HVGAE,we compare its performance in (disease-disease) network reconstructionwith Node2Vec (Grover and Leskovec, 2016), DeepWalk (Perozzi et al. ,2014), and VGAE (Kipf and Welling, 2016), as well as some variants ofour own model for ablation study. Node2Vec and DeepWalk are randomwalk based models that do not capture node attributes, hence we only usedthe disease-disease graph structure. For VGAE, we used identity matrixas node attributes as suggested by the authors.For our HVGAE described in Sec. 3.1, we also considered two variantsfor ablation study: HVGAE-disjoint does not jointly embed gene-gene anddisease-disease networks and does not use attentional pooling for diseaseembedding; whereas HVGAE-noAtt just does not use attentional pooling.Speciﬁcally, in HVGAE-disjoint, we, ﬁrst, learned an embedding for gene-gene network, then used the sum of mean of the node representations of rug Combo Generator genes affected by a disease as its node attributes. In HVGAE-noAtt, wejointly learned the representations while using sum of mean of the noderepresentations of genes as node attributes for disease-disease network.In node2vec and DeepWalk, the walk length was set to 80, the numberof walks starting at each node was set to 10, and the nodes were embeddedto a 16-dimensional space. The window size was 10 for node2vec while itis set to 10 in DeepWalk. All models were trained using Adam optimizer.In VGAE, a 32-dimensional graph convolutional (GC) layer followed bytwo 16-dimensional layers was used for mean and variance inference. Thelearning rate was set to 0.01.For HVGAE and its variants (for ablation study), we embed genenetworks in 32 dimensional space using a single GC layer with 32 ﬁltersfor each of the 5 types of input followed by a 64-dimensional GC layerand two 32-dimensional GC layer to infer mean and variance of therepresentation. We used a single 32-diensional fully connected (FC) layerfor attention layer. For disease-disease network embedding, we deployeda single 32-dimensional GC layer followed by two 16-dimensional layerfor mean and variance inference resulting in 16-dimensional embeddingfor disease-disease network. Learning rates were set to 0.001. The modelswere trained for 1,000 epochs choosing the best representation based ontheir the reconstruction performance at each epoch. Table 1 summarizes the reconstruction performance of the aforementionedmethods. Compared to all baselines, our HVGAE showed the bestperformance in all metrics considered. Node2Vec and DeepWalk showedthe worst performance as they only use the graph structure. Theperformance of VGAE was very close to DeepWalk. This is due to thefact that no attributes have been provided to VGAE despite having thecapability of capturing attributes.

Table 1. Graph reconstruction performances (unit: %) in the disease-diseasenetwork using our proposed HVGAE and baselines. F-1 scores are based on50% threshold.

Method

AUC-ROC AP F1-Macro F1-MicroNode2Vec 79.01 72.82 35.73 51.10DeepWalk 79.32 73.77 40.28 53.30VGAE 88.12 85.71 60.19 64.98

HVGAE-disjoint

HVGAE-noAtt

HVGAE .

11 95 .

89 79 .

77 80 . Compared to VGAE, HVGAE-disjoint without joint embedding orattentional pooling still saw better performance, which suggests thatthe attributes generated by the gene-gene network contains meaningfulfeatures about the disease-disease network. The slight performance gainfrom HVGAE-disjoint to HVGAE-noAtt shows that joint learning of bothnetworks hierarchically helps to render more informative features for thedisease-disease network. Finally, HVGAE had another performance boostcompared to HVGAE-noAtt and outperformed all competing methods,which shows the beneﬁt of attentional pooling. Speciﬁcally, the attentionlayer of HVGAE allows the model to produce features that are speciﬁcallyinformative for the disease-disease network representation learning.

We have trained the proposed reinforcement model in 3 stages usingdifferent rewards, disease sets, and action spaces to increasingly focus on atarget disease while exploiting all diseases whose representations alreadyjointly embed gene-gene, disease-disease, and gene-disease networks. In the ﬁrst stage, we train the model to only generate drug-like small-molecules which follow the chemistry valency reward, lipophilicity reward(logP where P is the octanol-water partition coefﬁcient) (You et al. , 2018),and our novel adversarial reward for individual compounds. In this study,we trained the model for 3 days (4,800 iterations) to learn to follow thevalency conditions and promote high logP for generated compounds.In the second stage, we start from the trained model at the end of theﬁrst stage (“warm-start” or “pre-training”). And we continue to train themodel to generate good drug combinations across all diseases. We do so byadding the network principle-based reward for compound combinationsand sequentially generating drug combinations for each disease one byone. Then, we calculate the network-based score for the generated drugcombinations at the last epoch across disease ontologies and compare themwith the FDA-approved melanoma drug combinations’ network-basedscore. In this study, we trained the model for 1,500 iterations to generatedrug combinations across all 299 diseases. In each iteration, we generated 8drug combinations for a given disease. We adopted PPO (Schulman et al. ,2017) with a learning rate of 0.001 to train the proposed RL for both stages.The last stage is disease-speciﬁc and will be detailed in Sec. 4.3.

Across disease ontologies we quantify the performance of the proposed RL(stage 2 model ﬁrst) using quantitative scores of compound-combinationsfollowing a network-based principle (Cheng et al. , 2019). We considerthe generated combinations in the last epoch (the last 299 iterations)and calculate the network score ˆR network based on disease ontologies.We asses our model based on two versions of disease classiﬁcation,original disease ontology and its extension, explained in Sec. 2.4. Table 2summarizes the network-based scores for our model. Speciﬁcally, supposethat the set of targets for drug 1 and 2 are represented by A and B whereas the disease module is the universal set Ω , we report the portionexclusively covered by drug 1 ( η A − B ), exclusively covered by drug 2( η B − A ), overlapped by both ( η A ∩ B ), and collectively by both ( η A ∪ B ).As a reference, we calculated the corresponding network scores for 3FDA-approved drug combinations for melanoma.Based on the results shown in Table 2, we note that across all diseaseclasses, the designed compound combinations learned in an environment,where the network principle(Cheng et al. , 2019) was rewarded, did achievethe desired performances. Speciﬁcally, their overlaps in disease moduleswere low as η A ∩ B fractions are around 0.1; whereas their joint coverage indisease modules was high as η A ∪ B fractions were in the range of 0.4–0.5for all diseases. Table 2. Network-based score for the generated drug combinations based ondisease ontology classiﬁcations.

Disease Ontology Disease Ontology extended η A − B η B − A η A ∩ B η A ∪ B η A − B η B − A η A ∩ B η A ∪ B infectious disease 0.25 0.10 0.06 0.41 0.20 0.07 0.05 0.33disease of anatomical entity 0.27 0.12 0.10 0.49 0.26 0.11 0.09 0.48disease of cellular proliferation 0.25 0.09 0.07 0.42 0.25 0.10 0.08 0.44disease of mental health 0.22 0.11 0.10 0.43 0.22 0.11 0.10 0.43disease of metabolism 0.22 0.13 0.10 0.46 0.23 0.14 0.11 0.48genetic disease 0.23 0.15 0.11 0.4 0.23 0.15 0.11 0.49syndrome 0.22 0.11 0.11 0.44 0.22 0.11 0.11 0.44 Compared to a few FDA-approved drugs for melanoma in Table 3, wenotice that the designed compound combinations had similar exclusivecoverage ( η A − B and η B − A ) as the drug combinations. However,the overlapping and overall coverage ( η A ∩ B and η A ∪ B ) were bothmuch higher in FDA-approved drug combinations than the designed.Improvements could be made by training the RL agent longer, as thesescores had already been improving during the limited training process Karimi, Hasanzadeh and Shen under computational restrictions. More improvement can be made byadjusting the network-based reward as well.

Table 3. Network-based scores for FDA-approved melanoma drug-combinations. η A − B η B − A η A ∩ B η A ∪ B Dabrafenib + Trametinib 0.05 0.21 0.55 0.81Encorafenib + Binimetinib 0.21 0.05 0.53 0.86Vemurafenib + Cobimetinib 0.05 0.27 0.36 0.68

In the third and last stage of RL model training, we start from the stage 2model and generate drug combinations for a ﬁxed target disease and canchoose scaffold libraries speciﬁc to the disease. In parallel, we trainedthe model for 500 iterations (roughly 1 day) to generate 4,000 drugcombinations speciﬁcally for each of 4 diseases featuring various drug-combination strategies: melanoma, lung cancer, ovarian cancer, and breastcancer. In all cases, we started with the Murcko scaffolds of speciﬁcFDA-approved drug combinations to be detailed next.

Melanoma: Different targets in the same pathway.

Resistance to BRAFkinase inhibitors is associated with reactivation of the mitogen-activatedprotein kinase (MAPK) pathway. There is, thus, a phase 1 and 2 trialof combined treatment with Dabrafenib, a selective BRAF inhibitor, andTrametinib, a selective MAPK kinase (MEK) inhibitor. As melanoma isnot one of the 299 diseases, we chose broader neoplasm as an alternative.To compensate the loss of focus on target disease, we design compoundpairs from Murcko scaffolds of Dabrafenib + Trametinib.

Lung and ovarian cancers: Targeting parallel pathways.

MAPK and PI3Ksignaling pathways are parallels important for treating many cancersincluding lung and ovarian cancers (Day and Siu, 2016; Bedard et al. ,2015). Clinical data suggest that dual blockade of these parallelpathways has synergistic effects. Buparlisib (BKM120) and Trametinib(GSK1120212; Mekinist) are as a drug combination therapy are used forthe purpose. Speciﬁcally, Buparlisib is a potent and highly speciﬁc PI3Kinhibitor, whereas Trametinib is a highly selective, allosteric inhibitor ofMEK1/MEK2 activation and kinase activity (Bedard et al. , 2015).

Breast cancer: Reverse resistance.

Endocrine therapies, includingFulvestrant, are the main treatment for hormone receptor–positive breastcancers (80% of breast cancers) (Turner et al. , 2015). However, they couldconfer resistance to patients during or after the treatment. A phase 3 studyis using Fulvestrant and Palbociclib as a combination therapy to reversethe resistance. Fulvestrant and Palbociclib are targeting different genes indifferent pathways. Speciﬁcally, Fulvestrant targets estrogen receptor (ER) α in estrogen signaling pathway and Palbociclib targets cyclin-dependentkinases 4 and 6 (CDK4 and CDK6) in cell cycle pathway (Turner et al. ,2015). Since our proposed method is the ﬁrst to generate drug combinations forspeciﬁc diseases, we consider the following baseline methods to comparewith: 1) random selection of 1,000 pairs from 8,724 small-moleculedrugs in Drugbank (Wishart et al. , 2018); 2) 628 FDA-approved drugcombinations curated by (Cheng et al. , 2019) for hypertension and cancers(our case studies are on 4 types of cancers); 3) random selection of 1,000 pairs of FDA-approved drugs for the given disease, based on drug-diseasedataset "SCMFDD-L" (Zhang et al. , 2018).

We ﬁrst compare the compound combinations designed by our model andthose from the baselines using the network score that reﬂects the network-based principle. Fig. 2(a)–(d) shows that our designed combinationsin all 4 cases, with higher network scores in distribution, respectedthe network principle more than the baselines (including the FDA-approved pairs not necessarily speciﬁc for the target disease). Theobservation is statistically signiﬁcant with P-values ranging from 6E-74 to 7E-7 (one-sided Kolmogorov-Smirnov [KS] test; see more detailsin the Supplemental Tables S2 and S3). Such a result is thanks to thenetwork-principled reward we introduced.We also examine whether drug combinations designed to followthe network principle could reduce toxicity from drug-drug interactions(DDIs). DDIs are crucial when using drug combinations since they maytrigger unexpected pharmacological effects, including adverse drug events(ADEs). We used a deep-learning model DeepDDI (Ryu et al. , 2018) witha mean accuracy of 92.4% to predict for each combination the probabilitiesof 86 types of DDIs (we manually split them into 16 positive and 70negatives; see details in the Supplemental Sec. 1.3). To summarize over theDDIs, we considered both maximum and mean probabilities of positive ornegative ones. And we compared those distributions between our designedpairs and baselines in each disease.Fig. 2(e)–(h), using the mean probability among negative DDIs, showsthat our compound pairs designed for all 4 diseases were predicted tohave less chances of toxicity compared to the baselines. One-sided KStests attested to the statistical signiﬁcance of the observation as P-valuesranged between 2E-166 and 2E-53. More analyses can be found in theSupplemental Sec. 3.Taken together, Fig. 2 suggested that following the network principlein designing drug combinations would help reduce toxicity due to DDIs.

We next examine the DeepAfﬁnity-predicted target genes of our designedpairs and compare them to the polyphamacology strategies outlined inSec. 4.3.1 for each disease. Since improved network scores have beenshown to correlate with lower toxicity, we used the scores to ﬁlter the4,000 combinations designed for each disease. speciﬁcally, we retainedcombinations with network scores above 0.5 and η A ∩ B below 0.1. Thesedesigns are shared along with the codes.For melanoma, out of 69 combination designs retained, 26% werepredicted to jointly cover BRAF and MEK genes in a complementary way.In other words, one molecule only targets BRAF and the other only targetsMEK, according to our DeepAfﬁnity(Karimi et al. , 2019)-predicted IC ,echoing the systems pharmacology strategy of the drug combination ofDabrafenib and Trametinib. There were also other designs which demandfurther examination and potentially contain novel strategies. All retaineddesigns were predicted to target the MAPK pathway to which BRAF andMEK belong.For lung and ovarian cancers, the same ﬁltering criteria retained 204(896) compound combinations designed for lung (ovarian) cancer. Asdisease modules can be limited, MEK1/2 does not exist in the used modulesfor lung (ovarian cancer) and a gene-level analysis cannot be performedas the melanoma case. Instead, we performed the pathway-level analysisand found that 50.9% (45.2%) of combination designs for ovarian (lung)cancer were predicted to jointly and complementarily cover the MAPKand PI3K signaling pathways, which echoes the combination of Buparlisiband Trametinib. Moreover, 99.5% (100%) of these retained designs werepredicted to jointly target both pathways for ovarian (lung) cancer. rug Combo Generator (cid:0)✁✂✄☎✆✝✞✟☎✆✁✠☎✡☛✟☛✂☞ ✌✍✎ ✌✏✎ ✌✑✎✌✒✎✌✓✎ ✌✔✎✌✕✎ ✌✖✎✗✁✘✙✚☎✛✙ ✜✢✚✣ ✟✙✚✟✁✆ ✤✥✙✆☛✙✚ ✟✙✚✟✁✆ ✦✆✁✙✞✂ ✟✙✚✟✁✆ Fig. 2: Comparison of network score and toxicity of RL-generated pairs of compounds (our proposed method) with three baselines, i.e. random pairs ofDrugBank compounds, FDA-approved drug pairs, and random pairs of FDA-approved drugs for four case-study diseases.For breast cancer, 77 designed compound-combinations passed theﬁlters. As CDK4/6 does not belong to the breast-cancer module dueto the limitation of disease modules used, we again only performeda pathway-level analysis. 9% of the combinations were predicted tojointly and complementarily cover ER-signaling and cell-cycle pathwaysas Fulvestrant and Palbociclib do. Also, 74% of the retained combinationsjointly cover these pathways. These two portions suggest that manydesigned combinations were predicted to simultaneously target bothpathways (with possible overlapping genes). If we consider PI3Ksignaling rather than cell cycle pathway for CDK4/6, 15.5% of retaineddrug combinations were predicted to jointly and complementarily coverestrogen and PI3K signaling pathways and all of them did jointly.

Besides HVGAE for network and disease embedding, two of our novelcontributions in RL-based drug set generations were network-principledreward and adversarial reward through GS-WGAN. To assess the effectsof these contributions to our model, we performed ablation study for stage3 using the case of melanoma. We ablated the originally proposed modelin two ways: removing the network-principled reward or replacing the GS-WGAN adversarial reward with the previously-used GAN reward basedon Jenson-Shannon (JS) divergence. Results in Fig. 3 suggested that bothrewards led to faster initial growth and higher saturation values in network-based scores.

In response to the need of accelerated and principled drug-combinationdesign, we have recast the problem as graph set generation in a chemicallyand net-biologically valid environment and developed the ﬁrst deepgenerative model with novel adversarial award and drug-combinationaward in reinforcement learning for the purpose. We have also designedhierarchical variation graph auto-encoders (HGVAE) to jointly embeddomain knowledge such as gene-gene, disease-disease, gene-diseasenetworks and learn disease representations to be conditioned on in thegenerative model for disease-speciﬁc drug combination. Our resultsindicate that HGVAE learns integrative gene and disease representations B e s t n e t w o r k s c o r e GAN w/ network rewardGSWGAN w/o network rewardGSWGAN w/ network reward (ours)

Fig. 3: Ablation study for RL: Best network scores achieved by threevariants of the proposed method over training iterations.that are much more generalizable and informative than state-of-the-art graph unsupervised-learning methods. The results also indicate thatthe reinforcement learning model learns to generate drug combinationsfollowing a network-based principle thanks to our adversarial and drug-combination rewards. Case studies involving four diseases indicate thatdrug combinations designed to follow network principles tend to have lowtoxicity from drug-drug interactions. These designs also encode systemspharmacology strategies echoing FDA-approved drug combinations aswell as other potentially promising strategies. As the ﬁrst generativemodel for disease-speciﬁc drug combination design, our study allowsfor assessing and following network-based mechanistic hypotheses inefﬁciently searching the chemical combinatorial space and effectivelydesigning drug combinations.

Acknowledgements

Part of the computing time is provided by the Texas A&M HighPerformance Research Computing. Karimi, Hasanzadeh and Shen

Funding

This project is in part supported by the National Institute of GeneralMedical Sciences of the National Institutes of Health (R35GM124952to YS).

References

Abraham, E. P. and Chain, E. (1940). An enzyme from bacteria able todestroy penicillin.

Nature , (3713), 837–837.Aranda, B. et al. (2010). The intact molecular interaction database in 2010. Nucleic acids research , (suppl_1), D525–D531.Arjovsky, M. et al. (2017). Wasserstein gan. arXiv preprintarXiv:1701.07875 .Ashburner, M. et al. (2000). Gene ontology: tool for the uniﬁcation ofbiology. Nature genetics , (1), 25–29.Balbas, M. D. et al. (2013). Overcoming mutation-based resistance toantiandrogens with rational drug design. Elife , , e00499.Bedard, P. L. et al. (2015). A phase ib dose-escalation study of the oral pan-pi3k inhibitor buparlisib (bkm120) in combination with the oral mek1/2inhibitor trametinib (gsk1120212) in patients with selected advancedsolid tumors. Clinical Cancer Research , (4), 730–738.Beylkin, G. (1984). The inversion problem and applications of thegeneralized radon transform. Communications on pure and appliedmathematics , (5), 579–599.Billur Engin, H. et al. (2014). Network-based strategies can help mono-and poly-pharmacology drug discovery: a systems biology view. Currentpharmaceutical design , (8), 1201–1207.Bohacek, R. S. et al. (1996). The art and practice of structure-baseddrug design: A molecular modeling perspective. Medicinal ResearchReviews , (1), 3–50.Bozic, I. et al. (2013). Evolutionary dynamics of cancer in response totargeted combination therapy. elife , , e00747.Ceol, A. et al. (2010). Mint, the molecular interaction database: 2009update. Nucleic acids research , (suppl_1), D532–D539.Chang, G. and Roth, C. B. (2001). Structure of msba from e. coli:a homolog of the multidrug resistance atp binding cassette (abc)transporters. Science , (5536), 1793–1800.Cheng, F. et al. (2019). Network-based prediction of drug combinations. Nature communications , (1), 1197.Chou, T.-C. (2006). Theoretical basis, experimental design, andcomputerized simulation of synergism and antagonism in drugcombination studies. Pharmacological reviews , (3), 621–681.Chou, T.-C. (2010). Drug combination studies and their synergyquantiﬁcation using the chou-talalay method. Cancer research , (2),440–446.Clavel, F. and Hance, A. J. (2004). Hiv drug resistance. New EnglandJournal of Medicine , (10), 1023–1035.Cock, P. J. et al. (2009). Biopython: freely available python tools forcomputational molecular biology and bioinformatics. Bioinformatics , (11), 1422–1423.Darnag, R. et al. (2010). Support vector machines: development of qsarmodels for predicting anti-hiv-1 activity of tibo derivatives. Europeanjournal of medicinal chemistry , (4), 1590–1597.Das, P. et al. (2018). A survey of the structures of us fda approvedcombination drugs. Journal of medicinal chemistry , (9), 4265–4311.Davis, A. P. et al. (2019). The comparative toxicogenomics database:update 2019. Nucleic acids research , (D1), D948–D954.Day, D. and Siu, L. L. (2016). Approaches to modernize the combinationdrug development paradigm. Genome medicine , (1), 115.DiMasi, J. A. et al. (2016). Innovation in the pharmaceutical industry:new estimates of r&d costs. Journal of health economics , , 20–33. Dooley, S. W. et al. (1992). Multidrug-resistant tuberculosis. Annals ofinternal medicine , (3), 257–259.Flaherty, K. T. et al. (2012). Combined braf and mek inhibition inmelanoma with braf v600 mutations. New England Journal of Medicine , (18), 1694–1703.Ghany, M. and Liang, T. J. (2007). Drug targets and molecular mechanismsof drug resistance in chronic hepatitis b. Gastroenterology , (4),1574–1585.Grover, A. and Leskovec, J. (2016). node2vec: Scalable feature learningfor networks. In Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 855–864.Gulrajani, I. et al. (2017). Improved training of wasserstein gans. In

Advances in neural information processing systems , pages 5767–5777.Hajiramezanali, E. et al. (2019). Variational graph recurrent neuralnetworks. In

Advances in Neural Information Processing Systems , pages10700–10710.Hamosh, A. et al. (2005). Online mendelian inheritance in man (omim),a knowledgebase of human genes and genetic disorders.

Nucleic acidsresearch , (suppl_1), D514–D517.Hasanzadeh, A. et al. (2019). Semi-implicit graph variational auto-encoders. In Advances in Neural Information Processing Systems , pages10711–10722.Hidalgo, C. A. et al. (2009). A dynamic network approach for the studyof human phenotypes.

PLoS computational biology , (4).Holohan, C. et al. (2013). Cancer drug resistance: an evolving paradigm. Nature Reviews Cancer , (10), 714–726.Hornbeck, P. V. et al. (2012). Phosphositeplus: a comprehensive resourcefor investigating the structure and function of experimentally determinedpost-translational modiﬁcations in man and mouse. Nucleic acidsresearch , (D1), D261–D270.Housman, G. et al. (2014). Drug resistance in cancer: an overview. Cancers , (3), 1769–1792.Irwin, J. J. and Shoichet, B. K. (2005). Zinc- a free database ofcommercially available compounds for virtual screening. Journal ofchemical information and modeling , (1), 177–182.Kanehisa, M. et al. (2002). The kegg database. In Novartis FoundationSymposium , pages 91–100. Wiley Online Library.Kaplan, J.-C. and Junien, C. (2000). Genomics and medicine: ananticipation.

Comptes Rendus de l’Académie des Sciences-SeriesIII-Sciences de la Vie , (12), 1167–1174.Karimi, M. et al. (2019). Deepafﬁnity: interpretable deep learning ofcompound–protein afﬁnity through uniﬁed recurrent and convolutionalneural networks. Bioinformatics , (18), 3329–3338.Keith, C. T. et al. (2005). Multicomponent therapeutics for networkedsystems. Nature reviews Drug discovery , (1), 71–78.Keshava Prasad, T. et al. (2009). Human protein reference database—2009update. Nucleic acids research , (suppl_1), D767–D772.Kipf, T. N. and Welling, M. (2016). Variational graph auto-encoders. arXivpreprint arXiv:1611.07308 .Kolouri, S. et al. (2019). Generalized sliced wasserstein distances. arXivpreprint arXiv:1902.00434 .Lee, D.-S. et al. (2008). The implications of human metabolic networktopology for disease comorbidity. Proceedings of the National Academyof Sciences , (29), 9880–9885.Lovly, C. M. and Shaw, A. T. (2014). Molecular pathways: resistanceto kinase inhibitors and implications for therapeutic strategies. ClinicalCancer Research , (9), 2249–2256.Madani Tonekaboni, S. A. et al. (2018). Predictive approaches for drugcombination discovery in cancer. Brieﬁngs in bioinformatics , (2),263–276.Martínez-Jiménez, F. and Marti-Renom, M. A. (2016). Should networkbiology be used for drug discovery? rug Combo Generator Matys, V. et al. (2003). Transfac®: transcriptional regulation, frompatterns to proﬁles.

Nucleic acids research , (1), 374–378.Mayr, A. et al. (2016). Deeptox: toxicity prediction using deep learning. Frontiers in Environmental Science , , 80.Menche, J. et al. (2015). Uncovering disease-disease relationships throughthe incomplete interactome. Science , (6224), 1257601.Mottaz, A. et al. (2008). Mapping proteins to disease terminologies: fromuniprot to mesh. In BMC bioinformatics , volume 9, page S3. BioMedCentral.Pang, K. et al. (2014). Combinatorial therapy discovery using mixedinteger linear programming.

Bioinformatics , (10), 1456–1463.Perozzi, B. et al. (2014). Deepwalk: Online learning of socialrepresentations. In Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 701–710.Popova, M. et al. (2018). Deep reinforcement learning for de novo drugdesign.

Science advances , (7), eaap7885.Preuer, K. et al. (2017). Deepsynergy: predicting anti-cancer drug synergywith deep learning. Bioinformatics , (9), 1538–1546.Ramón-García, S. et al. (2011). Synergistic drug combinations fortuberculosis therapy identiﬁed by a novel high-throughput screen. Antimicrobial agents and chemotherapy , (8), 3861–3869.Ramos, E. M. et al. (2014). Phenotype–genotype integrator (phegeni):synthesizing genome-wide association study (gwas) data with existinggenomic resources. European Journal of Human Genetics , (1), 144–147.Rogers, F. B. (1963). Medical subject headings. Bulletin of the MedicalLibrary Association , (1), 114–116.Rolland, T. et al. (2014). A proteome-scale map of the human interactomenetwork. Cell , (5), 1212–1226.Ruepp, A. et al. (2010). Corum: the comprehensive resourceof mammalian protein complexes—2009. Nucleic acids research , (suppl_1), D497–D501.Ryu, J. Y. et al. (2018). Deep learning improves prediction of drug–drugand drug–food interactions. Proceedings of the National Academy ofSciences , (18), E4304–E4311.Saputra, E. C. et al. (2018). Combination therapy and the evolution ofresistance: the theoretical merits of synergism and antagonism in cancer. Cancer research , (9), 2419–2431.Schriml, L. M. et al. (2012). Disease ontology: a backbone for diseasesemantic integration. Nucleic acids research , (D1), D940–D946.Schulman, J. et al. (2017). Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347 . Shafer, R. and Vuitton, D. (1999). Highly active antiretroviral therapy(haart) for the treatment of infection with human immunodeﬁciency virustype 1. Biomedicine & pharmacotherapy , (2), 73–86.Sharma, P. and Allison, J. P. (2015). Immune checkpoint targeting incancer therapy: toward combination strategies with curative potential. Cell , (2), 205–214.Singh, N. and Yeh, P. J. (2017). Suppressive drug combinations and theirpotential to combat antibiotic resistance. The Journal of antibiotics , (11), 1033.Stark, C. et al. (2010). The biogrid interaction database: 2011 update. Nucleic acids research , (suppl_1), D698–D704.Su, A. I. et al. (2004). A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy ofSciences , (16), 6062–6067.Svetnik, V. et al. (2003). Random forest: a classiﬁcation and regressiontool for compound classiﬁcation and qsar modeling. Journal of chemicalinformation and computer sciences , (6), 1947–1958.Taubes, G. (2008). The bacteria ﬁght back.Toy, W. et al. (2013). Esr1 ligand-binding domain mutations in hormone-resistant breast cancer. Nature genetics , (12), 1439.Turner, N. C. et al. (2015). Palbociclib in hormone-receptor–positiveadvanced breast cancer. New England Journal of Medicine , (3),209–219.Van Norman, G. A. (2016). Drugs, devices, and the fda: Part 1: an overviewof approval processes for drugs. JACC: Basic to Translational Science , (3), 170–179.Wishart, D. S. et al. (2018). Drugbank 5.0: a major update to the drugbankdatabase for 2018. Nucleic acids research , (D1), D1074–D1082.Wong, C. H. et al. (2019). Estimation of clinical trial success rates andrelated parameters. Biostatistics , (2), 273–286.You, J. et al. (2018). Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural InformationProcessing Systems , pages 6410–6421.Zhang, W. et al. (2018). Predicting drug-disease associations by usingsimilarity constrained matrix factorization.

BMC bioinformatics , (1),1–12.Zhavoronkov, A. et al. (2019). Deep learning enables rapid identiﬁcation ofpotent ddr1 kinase inhibitors. Nature biotechnology , (9), 1038–1040.Zhou, X. et al. (2014). Human symptoms–disease network. Naturecommunications ,5