Causal Network Models of SARS-CoV-2 Expression and Aging to Identify Candidates for Drug Repurposing
Anastasiya Belyaeva, Louis Cammarata, Adityanarayanan Radhakrishnan, Chandler Squires, Karren Dai Yang, G.V. Shivashankar, Caroline Uhler
CCausal Network Models of SARS-CoV-2 Expression and Aging toIdentify Candidates for Drug Repurposing
Anastasiya Belyaeva , Louis Cammarata , Adityanarayanan Radhakrishnan ,Chandler Squires , Karren Dai Yang , G.V. Shivashankar , , Caroline Uhler , ∗ Massachusetts Institute of Technology, U.S.A. Harvard University, U.S.A. ETH Zurich, Switzerland Paul Scherrer Institute, Switzerland
Equal contribution. ∗ To whom correspondence should be addressed; E-mail: [email protected].
Abstract
Given the severity of the SARS-CoV-2 pandemic, a major challenge is to rapidly repurposeexisting approved drugs for clinical interventions. While a number of data-driven and experimentalapproaches have been suggested in the context of drug repurposing, a platform that systematicallyintegrates available transcriptomic, proteomic and structural data is missing. More importantly, giventhat SARS-CoV-2 pathogenicity is highly age-dependent, it is critical to integrate aging signatures intodrug discovery platforms. We here take advantage of large-scale transcriptional drug screens combinedwith RNA-seq data of the lung epithelium with SARS-CoV-2 infection as well as the aging lung. Toidentify robust druggable protein targets, we propose a principled causal framework that makes use ofmultiple data modalities. Our analysis highlights the importance of serine/threonine and tyrosinekinases as potential targets that intersect the SARS-CoV-2 and aging pathways. By integratingtranscriptomic, proteomic and structural data that is available for many diseases, our drug discoveryplatform is broadly applicable. Rigorous in vitro experiments as well as clinical trials are needed tovalidate the identified candidate drugs.Candidates for drug repurposing have mainly been identified based on an understanding of theirpharmacology or based on retrospective analyses of their clinical effects. Recently, also more systematiccomputational methods combined with large-scale experimental screens have been employed [1]. TheConnectivity Map (CMap) containing gene expression profiles generated by dosing thousands of smallmolecules, including many FDA approved compounds, in a number of human cell lines has been particu-larly valuable in this regard [2]. Common computational approaches include signature matching, wherethe signature of a drug is determined for example using CMap and compared to the reverse signature ofa disease to identify drugs with high correlation [3]. In addition, approaches to identify drug or diseasenetworks based on known pathways, protein-protein interactions, gene expression or genome-wide asso-ciation studies have also been employed [4, 5, 6]. To capitalize on the abundance of data, it is critical todevelop computational platforms that can integrate different data modalities including gene expression,drug targets and signatures, as well as protein-protein interactions. In addition, a drug represents anintervention in the system and only a causal framework allows predicting the effect of an intervention.It is therefore critical to capitalize on recent advances in causal inference [7, 8] in particular with respectto the use of interventional data [9, 10, 11, 12].Given the current coronavirus disease 2019 (COVID-19) crisis, there is an urgent need for the de-velopment of robust drug repurposing methods. Coronaviruses belong to the family of positive-strandRNA-viruses. While most coronaviruses infect the upper respiratory tract and cause mild illness, theycan have serious effects as exemplified by the severe acute respiratory syndrome coronavirus (SARS-CoV)epidemic and now the SARS-CoV-2 pandemic [13]. Recent studies have shown that coronaviruses usecanonical inflammatory pathways (e.g. NF- κ B) of the host cell for their replication, while simultaneouslydampening their outward inflammatory signaling [14, 15]. This delicate partial up and down-regulationof inflammatory pathways by coronaviruses has represented major challenges for therapeutic interven-1 a r X i v : . [ q - b i o . M N ] J un ions [16]. While the infection rates for these viruses are similar among different age groups, the morbidityand fatality rates are significantly higher in the aging population [17, 18]. The respiratory system ofaging individuals is characterized by alterations of tissue stiffness [19]. Notably, recent micropatterningexperiments have shown that cells subjected to substrates of different stiffness stimulated with the samecytokine (TNF- α ) exhibit different downstream NF- κ B signaling [20]. In a recent commentary, we out-lined that the cross-talk between coronavirus infection and cellular aging could play a critical role in thereplication of the virus in host cells by differentially intersecting with NF- κ B signaling [21]. This suggeststhat efforts for drug repurposing should analyze SARS-CoV-2 infected host cell expression programs inconjunction with aging-dependent programs. While a number of studies are underway that investigateviral integration/replication and interactions with the host cell [6, 22], to our knowledge the interplayof SARS-CoV-2 host response and aging has not been explored in the context of drug development andrepurposing.In this paper, we propose a novel computational platform for drug repurposing, which integratestranscriptomic, proteomic and structural data with a principled causal framework, and we apply it inthe context of SARS-CoV-2 (Fig. 1). Given the age-dependent pathogenicity of SARS-CoV-2, we firstidentify genes that are differentially regulated by SARS-CoV-2 infection and aging based on bulk RNA-seq data from [23, 24]. We then use an autoencoder, a type of artificial neural network used to learndata representations in an unsupervised manner [25, 26], to embed the CMap data together with theSARS-CoV-2 expression data for signature matching to obtain an ordered list of FDA approved drugs. Inparticular, we show that overparameterized autoencoders align drug signatures from different cell typesand thus allow constructing synthetic interventions [27, 28] by translating the effect of a drug from one celltype to another. We then construct a combined SARS-CoV-2 and aging interactome using a Steiner treeanalysis to connect the differentially expressed genes within a protein-protein interaction network [29, 30].By intersecting the resulting combined SARS-CoV-2 and aging interactome with the targets of the 300top ranked FDA approved drugs from the previous analysis, we identify serine/threonine and tyrosinekinases as potential drug targets for therapeutic interventions. Causal structure discovery methodsapplied to the combined SARS-CoV-2 and aging interactome show that the identified protein kinaseinhibitors such as axitinib, dasatinib, pazopanib and sunitinib target proteins that are upstream fromgenes that are differentially expressed in SARS-CoV-2 infection and aging, thereby validating these drugsas being of particular interest for the repurposing against COVID-19. While we apply our computationalplatform in the context of SARS-CoV-2, our algorithms integrate data modalities that are available formany diseases, thereby making them broadly applicable.
Results
Differential expression analysis identifies genes that intersect the SARS-CoV-2host response and aging pathways
Since age is strongly associated with severe outcomes in patients with COVID-19, we sought to analyzegenes differentially expressed in normal versus SARS-CoV-2 infected cells as well as genes differentiallyexpressed in young versus old individuals. Used as model system for lung epithelial cells and the effect ofSARS-CoV-2 infection, we obtained from [23] RNA-seq samples from normal and SARS-CoV-2 infectedA549 lung alveolar cells as well as A549 cells supplemented with ACE2 (A549-ACE2), a receptor that hasbeen shown to be critical for SARS-CoV-2 cell entry [31]. Fig. 2a shows the expression of A549-ACE2cells infected with SARS-CoV-2 in comparison to normal A549-ACE2 cells, with many genes upregulatedas a result of the infection, as expected. Given the availability of A549 data with/without ACE2 andwith/without SARS-CoV-2 infection, we removed genes from this initial list of differentially expressedgenes that were just ACE2-specific or just SARS-CoV-2 infection-specific to extract a more refinedexpression pattern of ACE2-mediated SARS-CoV-2 infection (Methods, Fig. 2b). The rationale was toremove genes linked to the response of the ACE2 receptor to signals other than SARS-CoV-2 infection orgenes involved in the entry of SARS-CoV-2 into the cell through means other than the ACE2 receptor,which has been shown to be the critical mode of entry in humans [31]. Gene ontology (GO) enrichmentanalysis revealed enrichment in mitotic cell cycle as the top term, further supporting removal of thesegenes (Supplementary Fig. S1). The remaining 1926 genes are denoted in red in Fig. 2a,b and usedfor the subsequent analysis. GO enrichment analysis of these genes revealed that they are significantlyenriched in the type I interferon signaling pathway and defense response to virus in addition to otherGO terms (Fig. 2c). Next, in order to analyze the link between SARS-CoV-2 infection and aging, weanalyzed RNA-seq samples from the lung of different aged individuals collected as part of the Genome2issue Expression (GTEx) study [24]. Given the stark increase in case fatality rates of COVID-19 afterage 70 [17, 18], we performed a differential expression analysis comparing the youngest group (20-29years old) and oldest group (70-79 years old), thereby identifying 1923 genes differentially regulated inaging (Fig. 2d, Supplementary Fig. S2). As shown in Fig. 2e, these genes show a significant overlapwith the 1926 genes found to be differentially regulated by SARS-CoV-2 ( p -value= 0 . -fold changes in Fig. 2f and SupplementaryFig. S3a. The association in the directionality of regulation between SARS-CoV-2 infection and aging isstatistically significant ( p -value < . × − , Fisher’s exact test), thereby providing further evidencefor the interplay of SARS-CoV-2 host response and aging as hypothesized in [21]. Fig. 2g shows thelog -fold changes of the 10 most differentially expressed genes across aging and SARS-CoV-2 infection(based on the sum of their ranks with Supplementary Fig. S3b showing the distribution of the ranks). Identification of SARS-CoV-2 infection signature in reduced L1000 gene ex-pression space
Next, we focused our analysis on identifying the SARS-CoV-2 transcriptional signature, which we thencorrelated with the transcriptional signatures of FDA approved drugs in CMap to identify drugs thatcould revert the effect of SARS-CoV-2 infection. While this analysis resulting in an initial ranking ofFDA approved drugs did not take the transcriptional signature of aging into account, aging was a criticalcomponent in the final selection of FDA approved drugs described below.Since gene expression in CMap was quantified using L1000 reduced representation expression profil-ing [2], which measures gene expression of 1000 representative genes, we first sought to analyze whetherthese genes sufficiently capture the transcriptional signature of SARS-CoV-2 infection. For this, we in-tersected the genes measured both by Blanco et al. [23] and CMap [2], resulting in 911 genes. We founda statistically significant overlap between the genes identified as differentially expressed by SARS-CoV-2infection in Fig. 2 and the L1000 genes ( p -value=7 . × − , Fisher’s exact test), thereby providing arational for using the CMap database for drug identification in this disease context (Fig. 3a). We thusproceeded to obtain the signature of SARS-CoV-2 infection in the reduced L1000 gene expression spaceby projecting the RNA-seq data of A549 cells with and without ACE2-receptor and SARS-CoV-2 infec-tion onto the shared 911 genes. The resulting signatures of SARS-CoV-2 infection and ACE2-receptorare visualized using the first two principal components in Fig. 3b. Interestingly, the signature of SARS-CoV-2 infection (indicated by arrows) was aligned across both A549 and A549-ACE2 cells as well asacross different levels of infection (MOI of 0.2 and 2), suggesting that the SARS-CoV-2 transcriptionalsignature was captured robustly by the L1000 genes, thus providing further rational for using CMap toidentify drugs that could reverse the effect of SARS-CoV-2 infection. Combined autoencoder and synthetic interventions framework to identifydrug signatures and rank FDA approved drugs for SARS-CoV-2
Next, we sought to determine transcriptional drug signatures using the CMap database, which includesamong other cell lines A549. The data was visualized using Uniform Manifold Approximation andProjection (UMAP) [33] in Supplementary Fig. S4a, showing that the perturbations clustered by cell typeand hence the drug signatures were small relative to the differences between cell types. We intersectedthe perturbations from CMap with a list of FDA approved drugs using Slinky [34], resulting in 759drugs of which 605 were available for A549. After removing batch effects using k-means clustering (seeMethods and Supplementary Fig. S4b), we computed initial signatures of these drugs based on the meanbefore and after drug perturbation in A549 cells. Fig. 3c shows a selection of drug signatures in relationto the signature of SARS-CoV-2 infection visualized using the top two principal components.Since the effect of a drug can be cell-type specific [35], this standard approach to computing drugsignatures may not allow extrapolating the obtained signatures beyond A549 cells. In order to determinerobust drug signatures and consider also FDA approved drugs that have been dosed on cell lines otherthan A549 in CMap, we employed an autoencoder framework. Autoencoders, a particular class ofneural networks where an input is mapped through a latent space to itself, have been widely used forrepresentation learning [25, 26, 36] and more recently also in genomics and single-cell biology [37, 38, 39].We trained an autoencoder (architecture described in Supplementary Fig. S5) to minimize reconstruction3rror on CMap data and data from Blanco et al. [23] in the L1000 gene expression space. We thencomputed the disease and drug signatures based on the embedding of the data in the latent space.Interestingly, by comparing the correlations between drug signatures obtained from A549 cells and MCF7cells (Fig. 3d) as well as HCC515 cells (Supplementary Fig. S7), cell lines with many perturbations inCMap, it is apparent that the autoencoder aligned the drug signatures across different cell types. Whileautoencoders and other generative models have been used for computing signatures of perturbationsalso in other works [39, 40], these works have used autoencoders in the standard way to obtain a lower -dimensional embedding of the data. Motivated by our recent work which, quite counter-intuitively,described various benefits of using autoencoders to learn a latent representation of the data that is higher -dimensional than the original space [41], we found that overparameterized autoencoders not onlyled to better reconstruction of the data than standardly used autoencoders (Supplementary Fig. S6 andarchitectures described in Supplementary Fig. S5), but also to a better alignment of drug signaturesbetween different cell types (Supplementary Fig. S7). Interestingly, overparameterized autoencodersprovided about the same alignment of drug signatures as using the top three principal components (Fig.3e), while at the same time allowing a near perfect reconstruction of the original gene expression vectorsfrom the embedding. Providing additional validation of the latent space embedding obtained by theoverparameterized autoencoder, we found that known ACE2 inhibitors were generally aligned with thereverse signature of ACE2 in A549 cells (e.g. correlations for moexipril, quinapril and perindopril are all0.88; see Supplementary Dataset 1). We thus used this latent space embedding to rank the drugs basedon their correlation with the reverse disease signature in A549 cells (Supplementary Dataset 1). Sinceoverparameterized autoencoders aligned drug signatures across cell types, this embedding also allowedconstructing synthetic interventions [27, 28], i.e., to predict the effect of a drug on A549 cells withoutmeasuring it, by linearly transferring the corresponding drug signature in the latent space from a celltype where it has been measured. In this way, we obtained an enlarged list of drug signatures, whichwe correlated in the latent space with the reverse disease signature to obtain further candidates of FDAapproved drugs for SARS-CoV-2 (Supplementary Dataset 1). To compare the correlations obtained withthe different embeddings, a list of the top ranked drugs is shown in Fig. 3f. Interestingly, it containsvarious drugs that were identified also in [6] using a different analysis (clemastine, haloperidol, ribavirin)or are currently in clinical trials (ribavirin, quinapril).
Steiner tree analysis identifies candidate drug targets by constructing com-bined SARS-CoV-2 and aging interactome
Our differential expression analysis revealed relevant genes to investigate in the context of SARS-Cov-2infection and aging, while the combined autoencoder and synthetic interventions analysis provided can-didate FDA approved drugs for reverting the effect of SARS-CoV-2 infection. Next, we integratedthese two separate analyses to obtain a final list of FDA approved drugs by constructing a combinedSARS-CoV-2 infection and aging protein-protein interactome and intersecting it with the targets of thecandidate drugs (Fig. 4a). For this, we selected the differentially expressed genes identified in Fig. 2fthat showed concordant regulation between aging and SARS-CoV-2 infection and intersected them withthe nodes of the human protein-protein interaction (PPI) network (IRefIndex Version 14 [42]), whichcontains 182,002 interactions between 15,759 human proteins along with a confidence measure for eachinteraction. This resulted in 162 protein-coding genes, which we call terminals (Supplementary Fig. S8and Methods). To gain a better understanding of the molecular pathways connecting these terminalgenes, we used a Steiner tree algorithm [30, 43] to determine a “minimal” subnetwork or interactome within the PPI network that connects these genes (see Methods). A Steiner tree is minimal in thatit is a minimum weight subnetwork that connects the terminals. As edge weights in the PPI networkwe used 1 minus the confidence in the corresponding interactions so as to favor high-confidence edges.After a careful sensitivity analysis to select the various tuning parameters (Methods and SupplementaryFig. S9), this resulted in an interactome containing 252 nodes and 1,003 edges (Fig. 4b and Supplemen-tary Fig. S10). Interestingly, the interactome contained five genes whose corresponding proteins havebeen found in [6] to interact with SARS-Cov-2 proteins (EXOSC5, FOXRED2, LOX, RBX1, RIPK1).The 2-nearest-neighborhoods of these proteins are shown in Fig. 4c. Another Steiner tree analysis re-vealed that two additional SARS-Cov-2 interaction partners (CUL2 and HDAC2) were connected to theidentified interactome via few high-confidence edges (Supplementary Fig. S11 - S13).Next, we intersected the interactome with the targets of the candidate drugs identified in the previousanalysis. A compound was considered if its signature matched the reverse SARS-CoV-2 signature withat least a correlation of 0 .
86, resulting in about 300 FDA approved drugs (see Methods). The targets of4hese drugs were determined using DrugCentral [44, 45] and filtered for high affinity (activity constantslower than 10 µM , a common threshold used in the field for K i , K d , IC
50 or EC Causal structure discovery methods validate serine/threonine and tyrosinekinases as critical targets in SARS-CoV-2 infection in the elderly.
Finally, in order to suggest putative causal drug mechanisms and validate the predicted drugs for COVID-19, we supplemented the PPI analysis with causal structure discovery. Since the edges in the PPI networkand hence in the SARS-CoV-2 and aging interactome are undirected, it is a-priori not clear whether a drugthat targets a node in the interactome has any effect on the differentially expressed terminal nodes, sincethe target may be downstream of these nodes (Fig. 5a). To understand which genes can be modulatedby a drug, it is therefore critical to obtain a causal (directed) network. We obtained single-cell RNA-seqdata for A549 cells from [46] and intersected it with the genes present in the combined SARS-CoV-2 andaging interactome. To learn the (causal) regulatory network among these genes, we took advantage ofrecently developed causal structure discovery algorithms, in particular the greedy sparsest permutation(GSP) algorithm: it performs a greedy search over orderings of the genes to find the sparsest causalnetwork that best fits the data ,and it has been successfully applied to single-cell gene expression databefore [11, 12, 47]. To validate the obtained causal model and benchmark the performance of GSP toother prominent causal structure discovery algorithms including PC and GES [48], we took advantage ofthe gene knockout and overexpression data available from CMap. A causal model should allow predictingthe effect of such interventions. Thus, for each such gene knockout and overexpression experiment inCMap that targeted a gene in the interactome, we inferred the genes whose expression changed as aresult of the intervention, when compared to control samples (Methods and Supplementary Fig. S15a).We then constructed receiver operating characteristic (ROC) curves to evaluate GSP, PC and GES byvarying their tuning parameters and counting an edge i → j as a true positive if intervening on gene i resulted in a change in the expression of gene j and a false positive otherwise, thereby showing that GSPexceeded random guessing based on the PPI network ( p -value=0.0177, see Methods) and outperformedthe other methods (Supplementary Fig. S15b).Having established that the causal network obtained by GSP can be used to predict the effect of anintervention, we turned to analyzing the regulatory effects of the identified candidate drugs on the SARS-CoV-2 and aging interactome in A549 cells. The main connected component of the corresponding causalgraph is shown in Fig. 5a (see also Supplementary Fig. S16a) highlighting the drug targets and the genesthat were found to be differentially expressed by SARS-CoV-2 infection and aging. We then traced thepossible downstream effects for each identified drug, thereby finding that the protein kinase inhibitors andHDAC inhibitors could target the majority of differentially expressed genes in this connected component(Table S1). Similarly, we traced the downstream effects for each gene in the interactome that canbe targeted by one of the identified drugs, thereby finding that EGFR, FGFR3, HDAC1, HSP90AA1,IRAK1, PAK1, RIPK1, RIPK2, STK3 all have downstream nodes in the interactome with RIPK1 havingthe largest number of them (127). To validate these results in a broader context, we obtained single-cellRNA-seq data from [49] and repeated the analysis in AT2 cells, which have been shown to be criticallyaffected by SARS-CoV-2 in humans [31]. The resulting causal network for AT2 cells (SupplementaryFig. S16b) is similar to the one for A549 cells, intersecting it in 55 .
3% of the edges, with EGFR, HDAC1,HSP90AA1, IRAK1, RIPK1 and RIPK2 all having descendants in the interactome, and targets of proteinkinase inhibitors and HDAC inhibitors being particularly central (Table S1). To analyze the most criticaltargets for the crosstalk between SARS-CoV-2 and aging, we repeated the analysis in the interactomeobtained without taking aging into account (Supplementary Fig. S16c). Interestingly, while HDAC1 andHSP90AA1 continued to have widespread effect, the number of genes downstream of RIPK1 changed5rastically to just 1, suggesting that RIPK1 plays a critical role in the SARS-CoV-2 and aging cross-talk.In line with this, while the effect of HDAC inhibitors remained similar in the analysis without ageing,the effect of protein kinase inhibitors changed drastically (Table S1). Collectively, our combined analysispoints to protein kinase inhibitors, and it in particular highlights RIPK1, a serine/threonine-proteinkinase, as one of the main targets against SARS-CoV-2 infections with a highly age-dependent roleand the largest number of downstream differentially expressed genes in the combined SARS-CoV-2 andageing interactome.
Discussion
The repurposing of drugs for SARS-CoV-2 has been a major challenge given the many pathways involvedin host-pathogen interactions and the intricate interplay of SARS-CoV-2 with inflammatory pathways [14,13, 15, 16]. Interestingly, while both young and old individuals are susceptible to SARS-CoV-2 infection,the virus’ pathogenicity is significantly more pronounced in the elderly [17, 18]. Since the mechanicalproperties of the lung tissue change with aging [19], this led us to hypothesize an interplay between viralinfection/replication and tissue aging [21], suggesting that this could play an important role in drugdiscovery programs. While ongoing drug repurposing efforts have analyzed host-pathogen interactionsand the associated gene expression programs [6, 23], they have lacked an integration with aging. Moregenerally, while a number of data-driven and experimental approaches have been proposed for drugidentification and repurposing [1], a platform that systematically integrates different data modalitiesincluding transcriptomic, proteomic and structural data into a principled causal framework to predictthe effect of different drugs has been missing.By combining bulk RNA-seq data from GTEx [24] and Blanco et al. [23], we identified a critical groupof genes that were differentially expressed by aging and by SARS-CoV-2 infection. While previous anal-ysis relied primarily on contrasting the expression in cells with and without SARS-CoV-2 infection [32],we made an attempt to separate the effect of the ACE2 receptor alone and the effect of SARS-CoV-2 incells without ACE2 receptor to extract a more refined differential expression pattern of ACE2-mediatedSARS-CoV-2 infection. While previous computational efforts to repurpose drugs have mainly consideredtwo approaches: (1) identifying drug targets by analyzing disease networks based for example on PPI ortranscriptomic data [5, 4, 6], and (2) identifying drugs by matching their signature (for example obtainedfrom the CMap project [2]) to the reverse disease signature [3], we developed a principled causal frame-work that encompasses these two approaches. First, in order to ensure that the CMap database, whichmeasures expression using 1000 representative genes, would be useful in the context of SARS-CoV-2, wevalidated that the intersection of these genes with the SARS-CoV-2 differentially expressed genes wassignificant. Second, to establish drug signatures based on the CMap database, we employed a particularautoencoder framework [41]. Rather unintuitively, we showed that using an overparameterized autoen-coder, i.e. by using an autoencoder not to perform dimension reduction as usual but to instead embedthe data into a higher-dimensional space, aligned the drug signatures across different cell types. Thisallowed constructing synthetic interventions, i.e., to predict the effect of a drug on a cell type withoutmeasuring it by using other cell types to infer it. Third, to identify drug targets in the pathways in-tersecting SARS-CoV-2 and aging, we connected the differentially expressed genes in the PPI networkusing a Steiner tree analysis [30] and intersected the resulting interactome with high-affinity targets ofthe drugs obtained using the overparameterized autoencoder framework. Finally, while computationaldrug discovery programs have been largely correlative [1], we made use of recent causal structure discov-ery algorithms [11, 47, 48] to validate the identified drug targets and their downstream effects, therebyidentifying protein kinase inhibitors such as axitinib, dasatinib, pazopanib, and sunitinib as drugs ofparticular interest for the repurposing against COVID-19.Among the various protein kinases, in particular from the family of serine/threonine-protein kinases,identified by our drug repurposing pipeline, RIPK1 was singled out by our causal analysis as beingupstream of the largest number of genes that were differentially expressed by SARS-CoV-2 infectionand aging, while losing its central role in the corresponding gene regulatory network without takingaging into account. Notably, RIPK1 has been shown to bind to SARS-CoV-2 proteins [6] and has alsobeen found to be in an age-dependent module [32]. RIPK1 belongs to an interesting family of proteinscomprising of a kinase domain on the N terminus and a death domain on the C terminus; activation of thekinase domain has been associated with epithelial cell homeostasis, while activation of the death domainleads to triggering necroptotic or apoptotic pathways [50, 51], the death pathways potentially triggeringtissue fibrosis [52]. Interestingly, our differential expression analysis found RIPK1 to be upregulated withSARS-COV-2 infection. We hypothesize that upon SARS-CoV-2 infection in older individuals the death6athways may be favored, thereby leading to fibrosis and increased blood clotting. Consistent with this,recent post-mortem lung tissue biopsies of SARS-CoV-2 human patients revealed a fibrotic epitheliumand increased blood clotting [53, 54].Collectively, these results highlight the importance of RIPK1 in the interplay between SARS-CoV-2infection and aging as a potential target for drug repurposing programs. There are various drugs cur-rently approved that non-specifically target RIPK1 (such as pazopanib and sunitinib) as well as underinvestigation that are highly specific to RIPK1 [55, 56]. Given the distinct pathways elicited by RIPK1,there is a need to develop appropriate cell culture models that can differentiate between young and agingtissues to validate our findings experimentally and allow for highly specific and targeted drug discoveryprograms. While our work identified particular drugs and drug targets in the context of COVID-19,our computational platform is applicable well beyond SARS-CoV-2, and we believe that the integrationof transcriptional, proteomic and structural data with network models into a causal framework is animportant addition to current drug discovery pipelines.
Methods
Bulk gene expression data
The RNA-seq gene expression data related to SARS-CoV-2 infection in A549 and A549-ACE2 cells wasobtained from [23] under accession code GSE147507. The RNA-seq data of lung tissues for the aginganalysis was downloaded from the GTEx Portal (https://gtexportal.org/home/index.html) along withmetadata containing the age of the individual from whom the RNA-seq sample was obtained. TheRNA-seq raw read counts were transformed into quantile normalized, log ( x + 1) scaled RPKM values,following the normalization performed in [2]. Differential expression analysis
For differential expression analysis, we focused on genes that were highly expressed, filtering out anygenes with log (RPKM +1) < -fold changes based on the data from [23]. Namely,we defined as ACE2-mediated SARS-CoV-2 genes all genes that had an absolute log -fold change betweenA549-ACE2 cells infected with SARS-CoV-2 and A549-ACE2 cells above threshold, excluding genes thathad an absolute log -fold change above the same threshold in A549-ACE2 cells versus A549 cells and alsoexcluding genes that had an absolute log -fold change above the same threshold in A549 cells infectedwith SARS-CoV-2 versus normal A549 cells. In other words, the ACE2-mediated SARS-CoV-2 geneswere defined as the genes denoted in red in the Venn diagram in Fig. 2b (with pink, brown and yellowsubsets removed). The absolute log -fold change threshold was determined such that the number ofACE2-mediated SARS-CoV-2 genes was 10% of the protein coding genes.In order to determine the age associated genes, we analyzed lung tissue samples obtained from theGTEx portal (https://gtexportal.org/home/index.html) from individuals of varying ages. We computedthe absolute log -fold change between samples of the lung tissue from older (70-79 years old) and younger(20-29 years old) individuals, defining the age associated genes as the top 10% of protein coding geneswith highest absolute log -fold change. We also considered defining age-associated genes based on theabsolute log -fold change comparing individuals who are 20-29 years old versus 60-79 years old, whichyielded similar age-associated genes, with 1339 out of the 1923 genes in common between the two setsas shown in Supplementary Fig. S2b. Gene ontology enrichment analysis
Gene ontology analysis was performed on a given gene set using GSEApy, keeping the top 10 geneontology biological process terms with lowest p -values. All reported terms had p -values ≤ .
05, afteradjusting for multiple hypothesis testing using the Benjamini–Hochberg procedure.
L1000 gene expression data from CMap
The CMap data measured via L1000 high-throughput reduced representation expression profiling, whichquantifies the expression of 1000 landmark genes, was obtained from [2] under accession code GSE92742.7e chose level 2 data, truncated to only the genes that were also measured by [23], and then performedlog ( x + 1) scaling and min-max scaling on each of the resulting 911-dimensional expression vectors. Combined autoencoder and synthetic interventions framework
We first describe our training procedures for the autoencoder framework. CMap contains a total of1,269,922 gene expression vectors and we performed a 90-10 training-test split resulting in 1,142,929training examples and 126,993 test examples. We selected the best model by applying early stoppingwith an upper bound on the number of total epochs being 150. Note that this is well past the usualearly stopping method of applying a patience strategy with a patience of at most 10 epochs [57]. Allhyperparameter settings, optimizer details, and architecture details are presented in SupplementaryFig. S5c. To summarize, we considered a range of fully connected autoencoders with varying depth,width, and nonlinearity, and we used Adam with a learning rate of 10 − for optimization. To computethe drug signatures via the trained autoencoder, we used as embeddings the output of the first hiddenlayer prior to application of the activation function.Drug signatures for the A549 cells (and similarly for the MCF7 and HCC515 cells) in CMap werecomputed by taking the difference between the mean embedding for the A549 samples with drug and themean embedding for the A549 control (DMSO) samples. To remove batch effects, we performed k -meansclustering of the control samples in the embedding space and removed all points falling in the smaller ofthe two clusters (see Supplementary Fig. S4b).Next, we briefly describe the synthetic interventions framework and how the embedding from ourtrained overparameterized autoencoder is used for this. The traditional application of synthetic inter-ventions [27, 28] in the context of drug repurposing would proceed as follows: when a drug signature isunavailable on a given cell type but is available on other cell types, we would express the cell type as alinear combination of the other cell types and use this linear combination to predict the signature on thecell type for which data is unavailable. Since we demonstrated that over-parameterized autoencodersalign drug signatures between different cell types (Supplementary Fig. S7), instead of using a linear com-bination of drug signatures across cell types, we can simply use one of the available drug signatures as thesynthetic intervention. In particular, in this work, we used drug signatures on MCF7 cells to constructsynthetic interventions for A549 cells. We also considered drug signatures on HCC515 cells; however,there was only one FDA approved drug that was applied to HCC515 cells which was not also appliedto A549 cells in CMap. While this analysis did not help to increase the number of considered drugs, weused the data on HCC515 cells in conjunction with the data on A549 and MCF7 cells to validate thatthe overparameterized autoencoder aligns the signatures of drugs between different cell types (Fig. 3dand Supplementary Fig. S7). Cosine similarity between perturbations
For each cell type and perturbation, we computed a cell-type specific “perturbation signature”, which isdefined as the difference between the average gene expression of a cell type under that perturbation andunder the control perturbation, DMSO. Then, for each perturbation, we computed the cosine similarity a · b | a || b | between the perturbation vectors for all pairs of cell types which received that perturbation inCMap. For example, daunorubicin was applied to 14 cell types in cMap, resulting in (cid:0) (cid:1) = 91 cosinesimilarities associated with daunorubicin. All cosine similarities were plotted (Fig. 3e). Steiner tree analysis
Human protein-protein interaction (PPI) network
A weighted version of the publicly available IRefIndex v14 human PPI network [42] was retrieved fromthe OmicsIntegrator2 GitHub repository (http://github.com/fraenkel-lab/OmicsIntegrator2). The in-teractome contains 182,002 interactions between 15,759 proteins. Each interaction e has an associatedcost c ( e ) = 1 − m ( e ) where the score m ( e ) is obtained using the MIScore algorithm [58], which quantifiesconfidence in the interaction e based on several evidence criteria (e.g. number of publications reportingthe interaction and corresponding detection methods).8 uman-SARS-Cov-2 PPI network Drug-target interaction data
Data on the targets of drugs was obtained from DrugCentral (http://drugcentral.org/download), an on-line drug information resource, which includes drug-target interaction data extracted from the literaturealong with metrics (such as inhibition constant K i , dissociation constant K d , effective concentration EC
50, and inhibitory concentration IC
50) measuring the affinity of the drug for its target [44, 45].Drugs in the database are approved by the FDA and may also be approved by other regulatory agencies(such as the EMA). From this database, we filtered out compounds targeting non-human proteins. Wealso discarded drug-target pairs with affinity metrics ( K i , K d , EC
50 or IC
50) higher than 10 µM , a com-monly used threshold in the field. Based on this filtering we obtained a data set containing 12,949 highaffinity drug-target pairs involving 1,457 unique human protein targets and 2,095 unique compounds.This dataset was further restricted to drugs predicted to reverse the SARS-Cov-2 signature (correlationgreater than 0 .
86 in the overparameterized autoencoder embedding). As a result, the final drug-targetdata set included information on 2,296 drug-target pairs involving 652 unique human gene targets and117 unique FDA approved drugs.
Prize-collecting Steiner forest algorithm
The Prize-Collecting Steiner Forest (PCSF) problem is an extension of the classical Steiner tree problem:Given a connected undirected network with non-negative edge weights (costs) and a subset of nodes, the terminals , find a subnetwork of minimum weight that contains all terminals. The resulting subnetworkis always a tree, which in general contains more nodes than the terminals; these are known as
Steinernodes . In the special case when there are only 2 terminals, this boils down to finding the shortest pathbetween these nodes. The Steiner tree problem in general is known to be NP-complete, but variousapproximations are available. The PCSF problem generalizes this problem by introducing prices forthe terminals (in addition to the edge costs already present in the Steiner tree problem) and a dummynode connected to all terminals. The problem is then to find a connected subnetwork that minimizes anobjective function involving the cost of selected edges and the prizes of terminals that are missing fromthe subnetwork as detailed below; we used OmicsIntegrator2 to solve this optimization problem [30].To formally introduce the objective function, let G = ( V, E, c ( · ) , p ( · )) denote the undirected PPInetwork with protein set V (containing N proteins), interaction set E , edge cost function c ( · ), set ofterminals S ⊂ V (containing N proteins) and attributed prizes p ( · ). The version of the PCSF problemsolved by OmicsIntegrator2 [30] and used in this article consists of finding a connected subnetwork T = ( V T , E T ) of the modified graph G ∗ = ( V ∪ { r } , E ∪ {{ r, s } : s ∈ S } ) that minimizes the objectivefunction ψ ( T ) = b (cid:88) v / ∈ V T p ( v ) + (cid:88) e ∈ E T c ∗ ( e )The node r is a dummy root node connecting all terminals in the network. The parameter b ∈ R + linearly scales the node prizes (which are non-zero for terminal nodes exclusively), and the modified edgecost function c ∗ ( · ) can be expressed as follows. For any edge e = { x, y } c ∗ ( e ) = (cid:40) c ( e ) + d x d y d x d y +( N − d x − N − d y − g if e ∈ Ew if e ∈ {{ r, s } : s ∈ S } (1)where d x denotes the degree of node x in G and g, w ∈ R + are tuning parameters. If the resulting treecontains the root node r , r is removed from the tree, and the output is an ensemble of trees, a forest .The final output, the interactome , is the subnetwork in the PPI network induced by the nodes of thisforest. Selection of terminal nodes
Results from the differential expression analysis yielded 219 protein-coding genes that were associatedwith both aging and SARS-Cov-2 infection. Of particular interest among these genes were 181 genes9hat showed concordant regulation, i.e. they were either upregulated in both SARS-Cov-2 infection andaging or downregulated in both SARS-Cov-2 infection and aging. Intersecting the proteins correspondingto these 181 genes with proteins in the IREF interactome resulted in 162 proteins. These 162 proteinswere selected as terminal nodes for the PCSF algorithm and prized according to their absolute log -foldchange between SARS-Cov-2-infected A549-ACE2 cells and normal A549-ACE2 cells (SupplementaryFig. S8) Parameter sensitivity analysis
Running the PCSF algorithm in the OmicsIntegrator2 required specifying three tuning parameters: g , w and b . In order to guarantee the robustness of the resulting network with respect to moderate changesin these parameters, we selected the parameters based on a sensitivity analysis.The parameter g modifies the background PPI network by imposing an additive penalty on eachedge based on the degrees of the corresponding vertices. It reduces the propensity of the algorithm toselect hub nodes connecting many proteins in the interactome. While this feature may be relevant incertain biological applications, it was not necessarily the case in our work since high degree nodes maybe of interest for the purpose of drug target identification. In the cost function in Equation (1), theabsence of penalty corresponds to g = −∞ . However the OmicsIntegrator2 implementation only allowsfor g ∈ R + . In Supplementary Fig. S9a1, we reported boxplots of penalized edge costs in the IREFinteractome for different values of g . These boxplots suggest that the hub penalty parameter g = 0yields similar edge costs to the desired setting where g = −∞ . For this reason we chose the value g = 0in all OmicsIntegrator2 runs in this work.The parameter w corresponds to the cost of edges connecting terminal nodes to the dummy root r .This parameter influences the number of trees in the Steiner forest. If w is chosen too low compared tothe typical shortest path cost between two terminals, a trivial solution will connect all terminal nodesvia r , leading to fully isolated terminals in the final forest. For high values of w the PCSF algorithmwill not include the root r and output a connected network. Based on the histogram of the cost of theshortest path between any two terminals in the IREF interactome reported in Supplementary Fig. S9a2,we ran a sensitivity analysis for w in the range [0 . , b linearly inflates the prizes of terminal nodes in the objective function. Higher valuesof b result in more terminal nodes in the final PCSF. We analyzed edge costs in the network to determinea suitable range for b so as to include many terminal nodes in the resulting interactome. SupplementaryFig. S9a1 shows that the maximum edge cost in the network for g = 0 was lower than 1, which meantthat making b of order greater than 1 was necessary to ensure that trading off cost of edges added andprizes collected in the solution would rarely require discarding a terminal node. For this reason we rana sensitivity analysis for b in the range [5 , g = 0 and ran a sensitivity analysis as described inSupplementary Fig. S9b with w ∈ { . , . , . , . , , . , . , . , . , } and b ∈ { , , , , , , , , , } . We obtained 100 PCSFs, each corresponding to a particular choice of ( w, b ). All of themincluded the entire terminal set S , a desired property resulting from the chosen range of the values of b . To analyze the robustness of the resulting networks to changes in the parameters, we analyzed thematrix M ∈ [0 , × defined by M ij = (cid:12)(cid:12) { nodes innetwork i } ∩ (cid:8) nodes innetwork j (cid:9) ∩ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:0) { nodes innetwork i } ∪ (cid:8) nodes innetwork j (cid:9)(cid:1) ∩ C (cid:12)(cid:12) for every pair of PCSFs i and j corresponding to parameters ( w i , b i ) and ( w j , b j ), respectively. Supple-mentary Fig. S9c displays heatmaps of this matrix. We considered three different node sets C , namelythe set of all nodes in the input PPI network (Supplementary Fig. S9c1), the subset of terminal nodes( C = S , Supplementary Fig. S9c2) and the subset of SARS-Cov-2 interaction partners (SupplementaryFig. S9c3). Supplementary Fig. S9c1, S9c2, S9c3 illustrate that choosing any ( w, b ) ∈ [1 . , × [5 , w and b . Collectively, this sensitivity analysis motivated the choice of g = 0, w = 1 . b = 40 used to obtain the interactome in Fig. 4b, where nodes are grouped by generalfunction. The same interactome is presented in Supplementary Fig. S10 with nodes grouped by generalprocess. Note that since this interactome included all terminals and did not include the root node, it isequivalent to the solution of the classical Steiner tree problem.10 eighborhood analysis For the interactomes obtained in this work, we reported 2-nearest-neighborhoods of genes of interest inFig. 4c for the interactome of Fig. 4b, in Supplementary Fig. S13 for the interactome of SupplementaryFig. S12, and in Supplementary Fig. S14d for the interactome in Supplementary Fig. S14c. Depending onthe interactome, genes of interest include SARS-Cov-2 interaction partners (e.g. EXOSC5, FOXRED2,LOX, RBX1, RIPK1) as well as genes of potential therapeutic interest (e.g. HDAC1, EGFR). Neighbor-hood plots were enriched with information such as SARS-Cov-2 interaction partners and FDA approved,high affinity (based on data from DrugCentral) drugs with high correlation to the reverse SARS-Cov-2 in-fection signature. To improve legibility of the neighborhood networks, we discarded the highly connectedhub node UBC (connected to 62% of proteins in the IREF network). To further improve legibility, weapplied an upper threshold on edge cost (i.e., only visualizing high confidence edges) when the neighbor-hood networks were too densely connected. We generally chose this threshold at 0.53, with the exceptionof the LOX neighborhood (0.58) and the FOXRED2, ETFA and GNB1 neighborhoods (no thresholding).For each edge e in a given neighborhood, we defined the min-max scaled edge confidence C ( e ) as C ( e ) = max e (cid:48) ∈E c ( e (cid:48) ) − c ( e )max e (cid:48) ∈E c ( e (cid:48) ) − min e (cid:48) ∈E c ( e (cid:48) ) ∈ [0 , E denotes the edge set of the corresponding interactome and c ( e ) denotes the cost of edge e in thePPI network. This confidence metric was used to color edges in the neighborhood plots. Addition of SARS-Cov-2 interaction partners to the terminal node list
In order to understand which other SARS-CoV-2 protein interaction partners were in the neighborhoodof the identified interactome, we also ran the PCSF algorithm on the IREF PPI network using theSARS-Cov-2 and aging terminal list augmented with all known SARS-Cov-2 interaction partners. AllSARS-Cov-2 interaction partners (with the exception of EXOSC5, FOXRED2 and LOX which werealready present in the original terminal gene list) were given a small prize p . This prize was chosenby sensitivity analysis over a range of possible values from p = 0 (5 SARS-Cov-2 interaction partnersinitially selected by the method: EXOSC5, FOXRED2, LOX, RBXL1, RIPK1) to p = 0 .
02, beyondwhich all 332 known SARS-Cov-2 interaction partners belonged to the computed interactome. Fine-grained analysis revealed that choosing p ∈ [4 × − , − ] leads to interactomes which include a stableset of 7 SARS-Cov-2 interaction partners, the 5 present initially plus CUL2 and HDAC2 (SupplementaryFig. S11a). Supplementary Fig. S11b-S11c display heatmaps of the matrix M ∈ [0 , × defined as M ij = (cid:12)(cid:12)(cid:0) { nodes innetwork i } \ (cid:8) nodes innetwork j (cid:9)(cid:1) ∩ C (cid:12)(cid:12) |{ nodes innetwork i } ∩ C| for every pair of PCSFs i and j corresponding to parameters p i and p j , respectively. For the sensitivityanalysis, we considered two different node sets C , namely the set of all nodes in the input PPI network(Supplementary Fig. S11b) as well as the subset of SARS-Cov-2 interaction partners (SupplementaryFig. S11c). Supplementary Fig. S11b shows that the obtained interactome was stable over the range p ∈ [7 × − , − ]. Supplementary Fig. S11c shows that all SARS-Cov-2 interaction partners collectedin the interactome when p ∈ [7 × − , − ] were also collected for higher values of p , which is aconsequence of the observation from Supplementary Fig. S11b. We used the value p = 8 × − for allsubsequent analyses and figures, including Supplementary Fig. S12 and Supplementary Fig. S13. Single-cell RNA-seq analysis
Single-cell RNA-seq for A549 cells was obtained from GSE81861 [46], where each entry in the ma-trix represents the gene expression (FPKM) of gene i in cell j . We preprocessed the data, keep-ing only genes that had a nonzero gene expression value in more than 10% of the cells, followed bylog ( x ( x + 1)transformation of the data as for A549 cells. 11 valuation of causal structure discovery algorithms Prior to reporting the results of learning gene regulatory networks on A549 and AT2 cells, we bench-marked several causal structure discovery methods on the task of predicting the effects of interventionsusing gene knockout and overexpression data collected on A549 cells as part of the CMap project [2],similar to prior evaluations of causal methods [11, 12]. We estimated the gene regulatory network un-derlying the identified interactome in A549 cells using the prominent causal structure discovery methodsPC, GES and GSP [8, 48, 47]. Since not all edge directions are identifiable from purely observationaldata, these methods output a causal graph containing both directed and undirected edges. Since theadvantage of causal networks is their ability to predict the effects of interventions on downstream genes,we evaluated these methods using interventions collected in CMap. In the following, we first describehow we estimated the effects of interventions based on the CMap data to use as ground truth for eval-uating causal structure discovery methods. We focused our evaluation on genes and interventions thatare shared between the combined SARS-CoV-2 and aging interactome and CMap knockout and overex-pression experiments, resulting in 32 genes and 41 interventions (note that the number of interventionsis larger than the number of genes, since in CMap interventions have been performed on genes thatare not part of the L1000 landmark genes, but are contained in the interactome). We formed a ma-trix of genes by interventions, where each ( i, j )-entry in the matrix represents the log -fold change inexpression of gene i when gene j was intervened on in comparison to the expression of gene i withoutintervention. We denoted by Q the binary matrix of intervention effects with Q ij = 1 if the sign of thelog -fold change for the ( i, j ) entry was opposite for knockout and overexpression interventions to filterout unsuccessful interventions, the rational being that knockout and overexpression should have oppositedownstream effects. Thus Q ij = 1 denotes that perturbing gene j effects gene i and hence that gene i isdownstream of gene j (Supplementary Fig. S15a). Taking this matrix of interventional effects, Q , as theground truth, we estimated the causal graph using the PC, GES and GSP algorithms and determinedthe corresponding ROC curve, counting and edge from j → i as a true positive if Q ij = 1 and a falsepositive otherwise (Supplementary Fig. S15b). In order to statistically evaluate whether the differentalgorithms performed better than random guessing, we sampled causal graphs (from an Erd¨os-Renyimodel, where the edges were directed based on a uniformly sampled permutation) with different edgeprobabilities from the PPI network and calculated the corresponding number of true and false positives.For each false positive level, we created a distribution over true positives based on the sampled randomcausal graphs and calculated the p -value for the number of true positives obtained from the PC, GESand GSP algorithms. We combined the p -values across different numbers of false positives using Fisher’smethod and used this combined p -value for evaluating whether the PC, GES and GSP algorithms weresignificantly different from random guessing. Causal structure discovery for learning gene regulatory networks
In order to learn the gene regulatory networks governing A549 and AT2 cells, we used the recent structurediscovery method GSP [47, 11, 12] on single-cell RNA-seq data from A549 cells as well as AT2 cellswith the PPI network on 252 nodes as a prior. We used GSP since based on the previous analysis itoutperformed the PC and GES algorithms in terms of ROC analysis on predicting the effect of geneknockout and overexpression experiments in A549 cells ( p -value = 0 . p -value = 0 . p -value = 0 . α -level for conditional independence testing), edges with high selectionprobability (0.3 for A549 cells and 0.4 for AT2 cells) were retained. The threshold for AT2 cells waschosen so as to approximately match the number of edges in the A549 network. Data and code availability
All data used in this work is publicly available from the sources mentioned. We relied on open sourcecode for the analysis (including OmicsIntegrator2, the R package pcalg, as well as the python packagescausaldag, GSEApy, networkx, numpy, pandas, PyTorch, scikit-learn, scipy). A comprehensive GitHubrepository containing all data and code will be made available upon publication of the manuscript.12 eferences [1] Pushpakom, S. et al.
Drug repurposing: progress, challenges and recommendations.
Nature ReviewsDrug Discovery , 41–58 (2019).[2] Subramanian, A., Narayan, R., Corsello, S. M. et al. A next generation connectivity map: L1000platform and the first 1,000,000 profiles.
Cell , 1437—1452.e1 (2017).[3] Dudley, J. T., Deshpande, T. & Butte, A. T. Exploiting drug–disease relationships for computationaldrug repositioning.
Briefings in Bioinformatics , 303—311 (2011).[4] Greene, C. S. & Voight, B. F. Pathway and network-based strategies to translate genetic discoveriesinto effective therapies. Human Molecular Genetics , R94–R98 (2016).[5] Smith, S. B., Dampier, W., Tozeren, A., Brown, J. R. & Magid-Slav, M. Identification of commonbiological pathways and drug targets across multiple respiratory viruses based on human host geneexpression analysis. PLoS One , e331741 (2012).[6] Gordon, D. E. et al. A sars-cov-2 protein interaction map reveals targets for drug repurposing.
Nature (2020).[7] Pearl, J.
Causality (Cambridge University Press, Cambridge, UK, 2009), 2 edn.[8] Spirtes, P., Glymour, C. & Scheines, R.
Causation, Prediction, and Search (MIT press, 2000).[9] Eberhardt, F.
Causation and Intervention (PhD thesis, Department of Philosophy, Carnegie MellonUniversity, 2007).[10] Meinshausen, N. et al.
Methods for causal inference from gene perturbation experiments and vali-dation.
Proceedings of the National Academy of Sciences, U.S.A. , 7361–7368 (2016).[11] Wang, Y., Solus, L., Yang, K. D. & Uhler, C. Permutation-based causal inference algorithms withinterventions.
Advances in Neural Information Processing Systems (2017).[12] Yang, K. D., Katcoff, A. & Uhler, C. Characterizing and learning equivalence classes of causal dagsunder interventions. Proceedings of Machine Learning Research , 5537–5546 (2017).[13] de Wit, E., van Doremalen, N., Falzarano, D. & Munster, V. J. Sars and mers: recent insights intoemerging coronaviruses. Nature Reviews Microbiology , 523 (2016).[14] Fung, T. S. & Liu, D. X. Human coronavirus: Host-pathogen interaction. Annual Review ofMicrobiology , 529–557 (2019).[15] Poppe, M. et al. The nf- κ b-dependent and-independent transcriptome and chromatin landscapes ofhuman coronavirus 229e-infected cells. PLoS Pathogens , e1006286 (2017).[16] Yang, C. W. et al. Targeting coronaviral replication and cellular jak2 mediated dominant nf- κ bactivation for comprehensive and ultimate inhibition of coronaviral activity. Scientific Reports ,4105 (2017).[17] Wu, J. T. et al. Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan,China.
Nature Medicine , 506–510 (2020).[18] Onder, G., Rezza, G. & Brusaferro, S. Case-fatality rate and characteristics of patients dying inrelation to COVID-19 in Italy. JAMA , 1775–1776 (2020).[19] Sicard, D. et al.
Aging and anatomical variations in lung tissue stiffness.
American Journal ofPhysiology-Lung Cellular and Molecular Physiology , L946–L955 (2018).[20] Mitra, A. et al.
Cell geometry dictates TNF α -induced genome response. Proceedings of the NationalAcademy of Sciences, U.S.A. , E3882–E3891 (2017).[21] Uhler, C. & Shivashankar, G. V. Mechano-genomic regulation of coronaviruses and its interplaywith ageing.
Nature Reviews Molecular Cell Biology , 247–248 (2020).1322] Zhou, Y. et al. Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2.
CellDiscovery , 14 (2020).[23] Blanco-Melo, D. et al. Imbalanced host response to sars-cov-2 drives development of covid-19.
Cell (2020).[24] Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A. et al.
A novel approach to high-qualitypostmortem tissue procurement: the gtex project.
Biopreservation and Biobanking , 311–319(2015).[25] Baldi, P. Autoencoders, unsupervised learning, and deep architectures. In Guyon, I., Dror, G.,Lemaire, V., Taylor, G. & Silver, D. (eds.) Proceedings of ICML Workshop on Unsupervised andTransfer Learning , vol. 27 of
Proceedings of Machine Learning Research , 37–49 (PMLR, Bellevue,Washington, USA, 2012).[26] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.
Nature , 436–444 (2015).[27] Agarwal, A., Cosson, R., Shah, D. & Shen, D. Synthetic interventions. In
Proceedings of CausalMLNeurIPS Workshop (2019).[28] Abadie, A., Diamond, A. & Hainmueller, J. Synthetic control methods for comparative case studies:Estimating the effect of california’s tobacco control program.
Journal of the American StatisticalAssociation , 493–505 (2010).[29] De Las Rivas, J. & Fontanillo, C. Protein–protein interactions essentials: Key concepts to buildingand analyzing interactome networks.
PLoS Computational Biology , e1000807 (2010).[30] Huang, S. S. & Fraenkel, E. Integrating proteomic, transcriptional, and interactome data revealshidden components of signaling and regulatory networks. Science Signaling , ra40 (2009).[31] Hoffmann, M. et al. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by aclinically proven protease inhibitor.
Cell , 271–280.e8 (2020).[32] Chow, R. D. & Chen, S. The aging transcriptome and cellular landscape of the human lung inrelation to SARS-CoV-2. https://doi.org/10.1101/2020.04.07.030684 (2020).[33] McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection fordimension reduction. arXiv preprint arXiv:1802.03426 (2018).[34] Kort, E. J. & Jovinge, S. Streamlined analysis of lincs l1000 data with the slinky package for r.
Bioinformatics , 3176–3177 (2019).[35] Niepel, M. et al. Common and cell-type specific responses to anti-cancer drugs revealed by highthroughput transcript profiling.
Nature Communications , 1186 (2017).[36] Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science , 504–507 (2006).[37] Yang, K. D. et al.
Autoencoder and optimal transport to infer single-cell trajectories of biologicalprocesses.
PLoS Computational Biology , e1007828 (2020).[38] Yang, K. D. et al. Multi-domain translation between single-cell imaging and sequencing data usingautoencoders. bioRxiv, https://doi.org/10.1101/2019.12.13.875922 (2019).[39] Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses.
NatureMethods , 715–721 (2019).[40] Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate geneexpression and predict perturbations in single cells. bioRxiv, https://doi.org/10.1101/262501 (2018).[41] Radhakrishnan, A., Belkin, M. & Uhler, C. Overparameterized neural networks can implementassociative memory. arXiv:1909.12362 (2019).[42] Razick, S., Magklaras, G. & Donaldson, I. M. irefindex: a consolidated protein interaction databasewith provenance. BMC Bioinformatics , 405 (2008).1443] Tuncbag, N., McCallum, S., Huang, S. & Fraenkel, E. Steinernet: A web server for integrating‘omic’ data to discover hidden components of response pathways. Nucleic Acids Research (2012).[44] Ursu, O. et al.
Drugcentral: online drug compendium.
Nucleic Acids Research gkw993 (2016).[45] Ursu, O. et al.
Drugcentral 2018: an update.
Nucleic acids research , D963–D970 (2019).[46] Li, H. et al. Reference component analysis of single-cell transcriptomes elucidates cellular hetero-geneity in human colorectal tumors.
Nature Genetics , 708 (2017).[47] Solus, L., Wang, Y. & Uhler, C. Consistency guarantees for greedy permutation-based causalinference algorithms. arXiv:1702.03530 (2017).[48] Glymour, C., Zhang, K. & Spirtes, P. Review of causal discovery methods based on graphicalmodels. Frontiers in Genetics , 524 (2019).[49] Reyfman, P. A. et al. Single-cell transcriptomic analysis of human lung provides insights into thepathobiology of pulmonary fibrosis.
American Journal of Respiratory and Critical Care Medicine , 1517–1536 (2019).[50] Festjens, N., Berghe, T. V., Cornelis, S. & Vandenabeele, P. RIP1, a kinase on the crossroads of acell’s decision to live or die.
Cell Death & Differentiation , 400–410 (2007).[51] Dannappel, M. et al. RIPK1 maintains epithelial homeostasis by inhibiting apoptosis and necrop-tosis.
Nature , 90–94 (2014).[52] Sauler, M., Bazan, I. S. & Lee, P. J. Cell death in the lung: the apoptosis–necroptosis axis.
AnnualReview of Physiology , 375–402 (2019).[53] Jose, R. J. & Manuel, A. COVID-19 cytokine storm: the interplay between inflammation andcoagulation. The Lancet Respiratory Medicine (2020).[54] Spagnolo, P. et al.
Pulmonary fibrosis secondary to COVID-19: a call to arms?
The LancetRespiratory Medicine (2020).[55] Martens, S., Hofmans, S., Declercq, W., Augustyns, K. & Vandenabeele, P. Inhibitors targetingRIPK1/RIPK3: Old and new drugs.
Trends in Pharmacological Sciences , 209–224 (2020).[56] Degterev, A., Ofengeim, D. & Yuan, J. Targeting RIPK1 for the treatment of human diseases. Proceedings of the National Academy of Sciences, U.S.A , 9714–9722 (2019).[57] Goodfellow, I., Bengio, Y. & Courville, A.
Deep Learning , vol. 1 (MIT Press, 2016).[58] Kedaigle, A. J.
Integrating Omics data: a new software tool and its use in implicating therapeutictargets in Huntington’s disease . Ph.D. thesis, Massachusetts Institute of Technology (2018).[59] Meinshausen, N. & B¨uhlmann, P. Stability selection.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 417–473 (2010). Acknowledgments
A.B. was supported by J-WAFS and J-Clinic for Machine Learning and Health at MIT. A.R. wassupported by the National Science Foundation (DMS-1651995) and IBM. C.S. and K.D.Y. were supportedby the National Science Foundation (NSF) Graduate Research Fellowships and ONR (N00014-18-1-2765and N00014-18-1-2765). G.V.S. was supported by ETH funding. C.U. was partially supported byNSF (DMS-1651995), ONR (N00014-17-1-2147 and N00014-18-1-2765), IBM, and a Simons InvestigatorAward. The Titan Xp used for this research was donated by the NVIDIA Corporation.
Author Contributions
All authors designed the research. A.B., L.C., A.R., C.S. and K.D.Y. developed and implemented thealgorithms and performed model and data analysis. A.B., L.C., A.R., G.V.S. and C.U. wrote the paper.15 ompeting Interests
The authors declare no competing interests. 16igure 1: Overview of computational drug repurposing platform for COVID-19. (a) COVID-19 is associ-ated with more severe outcomes in older individuals, suggesting that gene expression programs associatedwith SARS-CoV-2 and aging must be analyzed in tandem. A potential hypothesis regarding the cross-talk between SARS-CoV-2 and aging relies on changes in tissue stiffness in older individuals, outlinedin [21]. (b) In order to identify potential drug candidates for COVID-19, we integrated RNA-seq datafrom SARS-CoV-2 infected cells (obtained from [23]) and RNA-seq data from the lung tissue of youngand old individuals (collected as part of the GTEx project [24]) with protein-protein interaction data(from [42]), drug-target data (from DrugCentral [45]) and the large-scale transcriptional drug screenCMap [2]. (c) Based on this data, we develop a novel drug repurposing pipeline, which consists of first,mining relevant drugs by matching their signatures with the disease signature in the latent embeddingobtained by an overparameterized autoencoder and sharing data across cell types to obtain missing drugsignatures via synthetic interventions. Second, we identify a disease interactome within the protein-protein interaction network by identifying a minimal subnetwork that connects the genes differentiallyexpressed by SARS-CoV-2 infection and aging using a Steiner tree analysis. Third, we validate thedrugs identified in the first step that have targets in the interactome by identifying the potential drugmechanism using causal structure discovery. 17igure 2: Identification of differentially regulated genes in SARS-CoV-2 infection and aging. (a) Geneexpression (log RPKM + 1) of A549-ACE2 cells infected with SARS-CoV-2 versus normal A549-ACE2cells. Genes associated with ACE2-mediated SARS-CoV-2 infection after removing just ACE2-specificor just SARS-CoV-2 infection-specific genes are shown in red. (b) Venn diagram, showing the number ofgenes in sets considered for obtaining the 1926 genes in the red subset and shown in red in (a) associatedwith ACE-2 mediated SARS-Cov-2 infection. (c) Top 10 gene ontology terms associated with SARS-CoV-2 infection (adjusted p -value < . RPKM + 1) of cells collectedfrom lung tissue of older (70-79 years old) versus younger (20-29 years old) individuals. Differentiallyexpressed genes associated with aging are shown in blue and genes that are associated with both aging andSARS-CoV-2 are shown in orange. (e) Venn diagram of genes associated with SARS-CoV-2 and aging;intersection is significant ( p -value = 0 . -fold changes ofdifferentially expressed genes shared by SARS-CoV-2 and aging; most genes show concordant expression,i.e., they are both upregulated or both downregulated with SARS-CoV-2 infection and ageing. (g) Tableof the top 10 most differentially expressed genes across aging and SARS-CoV-2, based on the sum oftheir ranks with log -fold changes for each gene. 18igure 3: Mining FDA approved drugs by correlating disease and drug signatures using an overparam-eterized autoencoder embedding. (a) Gene expression (log RPKM +1) of A549-ACE2 cells infectedwith SARS-CoV-2 versus normal A549-ACE2 cells with genes collected as part of the CMap study usingL1000 reduced representation expression profiling method highlighted as stars, showing that L1000 genessignificantly overlap with SARS-CoV-2 associated genes ( p -value=7 . × − , Fisher’s exact test). (b)Signature of SARS-CoV-2 infection on A549 and A549-ACE2 cells visualized using the first two princi-pal components based on RNA-seq data from [23]. Signature of SARS-CoV-2 infection is aligned acrossnormal A549 and A549-ACE2 cells as well as across different levels of infection. (c) Comparison of thesignatures of a selection of 13 representative FDA approved drugs as compared to the signature of SARS-CoV-2 infection based on A549-ACE2 cells visualized using the first two principal components. Drugswhose signatures maximally align with the direction from SARS-CoV-2 infection to normal are consid-ered candidates for treatment. As expected, drugs have varying signatures of varying magnitudes. (d)Correlation between drug signatures in A549 and MCF7 cells when using the original L1000 expressionspace versus the embedding obtained from an overparameterized autoencoder. The overparameterizedautoencoder aligns the drug signatures in A549 and MCF7 cells by shifting the correlations towards -1 or1 while maintaining the sign of the correlation in the original space. (e) Histogram of correlations betweencell types for a given drug using original L1000 gene expression vectors, overparameterized autoencoderembedding, top 100 principal components, and top 3 principal components. The overparameterized au-toencoder achieves about the same alignment of drug signatures as using the top 3 principal components,while at the same time faithfully reconstructing the data (10 − training error). (f) A list of drugs whosesignatures maximally align with the direction from SARS-CoV-2 infection to normal in A549-ACE2 cells(MOI 2) with respect to correlations using the overparameterized autoencoder embedding, the originalL1000 gene expression space, and the top 100 principal components.19igure 4: Drug target discovery via Steiner tree analysis to identify putative molecular pathways linkingdifferentially expressed genes in SARS-Cov-2 infection and aging. (a) The general procedure takes asinput a list of genes of interest (terminal nodes) with prizes indicating their respective importance, aprotein-protein interaction (PPI) network with edge cost/confidence information (e.g. from IRefIndexv14 [42]), and a list of drugs of interest along with their protein targets and available affinity constants(e.g. from DrugCentral [44, 45]). In this study we consider 181 terminal nodes (of which 162 are presentin the IREF network) corresponding to genes differentially expressed in SARS-CoV-2 infection and agingfrom Figure 2 that are either up-regulated in both SARS-Cov-2 infection and aging or down-regulatedin both SARS-Cov-2 infection and aging. The prize of a terminal node equals the absolute value ofits log -fold change in SARS-Cov-2-infected A549-ACE2 cells versus normal A549-ACE2 cells based onthe data from [23]. Terminals and PPI data are processed using OmicsIntegrator2 [30] to output thedisease interactome , i.e., the subnetwork induced by a Steiner tree, with drug targets indicated by greendiamonds and terminal nodes colored according to their prizes. (b) Interactome obtained using thisprocedure. Genes are grouped by general function and marked with a cross if known to interact withSARS-Cov-2 proteins based on data from [6]. (c) 2-Nearest-Neighborhoods of nodes of interest (denotedby a red hexagon) in the interactome. A threshold was applied on the edge confidence to improvereadability. Proteins known to interact with SARS-Cov-2 are denoted by blue squares, drug targets aredenoted as green diamonds, terminal nodes are colored according to their log -fold change in SARS-Cov-2-infected A549-ACE2 cells versus normal A549-ACE2 cells, Steiner nodes appear in grey. (d) Tableof drug targets and corresponding drugs in the interactome. Selected drugs are FDA approved, highaffinity (at least one of the activity constants K i , K d , IC
50 or EC
50 is below 10 µM ), and match theSARS-Cov-2 signature well (correlation > . − log (activity). Proteinname corresponding to each gene is included. 20 b � Drug targetTerminal nodeSteiner nodeDrug targetTerminal nodeSteiner node � c A549-ACE2 with SARS-CoV-2vs. A549-ACE2Older vs. younger l og2 f o l d c hange d RBM33RIPK1RIPK2TRIM38CD40 CBLUBE2D2BTRC CUL1 UBCHIVEP2IRAK1CRY1 RPL23ACAT1MYH9EBNA1BP2 PSMA3QDPR FARSA PSMD3 PARP1HIF1AUBE2N SMAD9SHFM1SUMO1 ATREGFR ATIC MLH1IRF2 RAD21 MYCHECW2HSP90AA1 PLSCR1MSH2UIMC1SUMO2 METTL13TP53 RAC1BTN2A2 TERF2IPVHLPCCB DAXXRANBP9 MRPL27 CTNNB1FAHSLC3A2 XRN1MIPEPDHX15 SMAD3ALDH1B1SEC31BUBE2D3 ZSWIM6 PRMT5 FGFR1TERF2 RNF2 MSH6PHC3 DPP3 PPA2HIST1H1EPYCR1 SP100SRPRBMRPS16YY1AP1 ACVR2AHMBS SLC7A7 ZNF8 EP300 AIFM1CLN6 KDM1AFBXO6AGA HDAC1 MBD3GATAD2B SMAD4 CREBBPREL ZNF217 PHF21AMDM2TCTA HOXB3CDH1 EXOSC5NUDT14STX1AFH VAMP2SNF8RELAKAT6A SAV1XIAPXAF1 CCSWDR48USP12 PEX19TRAF6 APPNLN LARP6 SNAP25GTPBP1PAK1 IPO4 OCIAD2BRSK1 COMMD3-BMI1SLC25A11FN1 PAICSPOP1 VKORC1RBX1 PMLTCF4 NIPSNAP1NFKBIANCK2 STK3 l og2 f o l d c hange l og2 f o l d c hange SLC3A2PEX19EGFRATRCTNNB1 CBLRPL23 ATIC FAHFARSAUBE2D2 TP53MYH9 SUMO1RIPK2 RBM33 RAD21ACAT1CD40RANBP9IRAK1 HIVEP2MRPL27 DAXXSMAD3PSMA3UBE2D3TRIM38 PAK1 POP1IRF2RNF2 CUL1 EBNA1BP2 QDPRGTPBP1 IPO4 OCIAD2 LARP6 APPXAF1 NLN SNAP25TRAF6 UBCRIPK1
Figure 5: Causal mechanism discovery of potential drug targets. (a) In an undirected PPI network(left), edge directions for a particular drug target (green diamond) are unknown. Establishing causaldirections is important since it is of interest to avoid drug targets that do not have many downstreamnodes in the disease interactome (middle) and instead choose drug targets that have a causal effect onmany downstream nodes in the disease interactome (right). (b) Causal network underlying the combinedSARS-CoV-2 and aging interactome in A549 cells with gene targets of selected drugs in boxes (largestconnected component shown). (c) Causal subnetwork of A549 cells corresponding to nodes within 5nearest neighbors of RIPK1. The node color corresponds to the log -fold change of A549-ACE2 withversus without SARS-CoV-2. (d) Heatmap of log -fold change of genes that are downstream of RIPK1.21 upplementary Materials:Causal Network Models of SARS-CoV-2 Expression and AgingIdentify Drugs for Repurposing Anastasiya Belyaeva , Louis Cammarata , Adityanarayanan Radhakrishnan ,Chandler Squires , Karren Dai Yang , G.V. Shivashankar , , Caroline Uhler , ∗ Massachusetts Institute of Technology, U.S.A. Harvard University, U.S.A. ETH Zurich, Switzerland Paul Scherrer Institute, Switzerland
Equal contribution. ∗ To whom correspondence should be addressed; E-mail: [email protected].
This PDF file includes:
Supplementary Figures S1 to S16Supplementary Table S1Supplementary Dataset 1 captionReferences 22 upplementary Figures
Supplementary Fig. S1: (a) Gene expression of A549-ACE2 cells with and without SARS-CoV-2 in-fection, with differentially expressed genes in red. (b) Gene expression of A549 cells with and withoutSARS-CoV-2 infection, with differentially expressed genes in purple. (c) Gene expression of A549 cellswith and without ACE2 receptor, with differentially expressed genes in green. (d) Top 10 gene ontologyterms associated with differentially expressed genes between A549-ACE2 cells with and without SARS-CoV-2 infection . (e) Top 10 gene ontology terms associated with differentially expressed genes betweenA549 cells with and without SARS-CoV-2 infection. (f) Top 10 gene ontology terms associated withdifferentially expressed genes between A549 cells with and without ACE2 receptor. All gene ontologyterms have adjusted p -value < .
05. 23 sfd
584 5841339
Aging 20-29 vs. 70-79 Aging 20-29 vs. 60-79 a b carboxylic acid catabolic process (GO:0046395)organic acid catabolic process (GO:0016054)mesonephric epithelium development (GO:0072163)positive regulation of muscle organ development (GO:0048636)regulation of organ morphogenesis (GO:2000027)regionalization (GO:0003002)cellular amino acid metabolic process (GO:0006520)positive regulation of striated muscle tissue development (GO:0045844)positive regulation of muscle tissue development (GO:1901863)kidney epithelium development (GO:0072073) Aging associated genes
Supplementary Fig. S2: (a) Top 10 gene ontology terms associated with aging. (b) Venn diagramshowing significant overlap between aging associated genes considering different definitions of older,specifically just individuals in the oldest category (70-79) or individuals that are 60-79.24 ab F O X C G NRH S N AP N BP H I VEP T C T E R AP G E F I L20 R B CR EB PA R G RH L1 O AS L H I S T H D O V G P H I S T H ES U M O C O L E C H AS CH I C ZF C H S M A D Z N F E LL H I S T H EA R L4 D L O X I F I C r f C r f Z S W I M P M L CR Y YY AP I NH BASP E T V SP R E D I R F Z S W I M H I S T H BBSE T D XA F D LL1 M T M R K R T AP - A R P KA T AP H F AP L S CR R I PK T R I M W H A MM G A T A D BP D E BX RN I S G W DR C EP Z N F M C T P G T PBP CD K T SPY L4 S C G R B M U SP T H AP H O XB P R O X A UR K CCCDC B T N A B R SK M K L N T S C D CHN R ASA T E R F I P C TT N BP N L T R A F P HC R B M SAV S F M B T RN F SE C B H E C AS M P D L3 BPY CR GO T T L CD T M S F N S DH L H A DH T RU B I P O P L S D PP S L C A G P D F O X R E D EX O S C MM P M P DU Q D P R T M E M D O L K F A H P CC B DCDC F A R SAAPE H S R P R B G L B G BAA R S D A D K C YB D M E TT L13 T M E M NUD T CDH A I F M M E G F A L DH B PX M P A G AA C A T MM P C r f M S H N L R P M I PEP M S R B F A D S P T P N G S RC L N T N F S F GG T N I PS N AP P O P F E CH T M E M PA R P A T I C M R P L37 S L C A M Y O CD T D SA C D CCDC F H S D S L F G F BP F G F R S M I M PA I C S T I MM EB N A BP M R PS A L G S L C A C L N M A N S C N L N M R P L27 H M BS O S G I N O R M D L2 S L C A G S TZ MM AB M R PS S L C A EPS O C I A D PPA H S D B T HN S L1 T C T AP D G F D T M S F HR AS L S H B Q K L F T B F O XA Z N F N E DD U T P P I G AE G R F A HDN M R BP F T RN P T U BB B G N T U AP S NC G M A D BPP R SS A T P B S L C A CN T N AP P I T P N M T V P I G W EX O C G A L S L M E D A DC Y C O P Z P D E DC S TF C A CN A D F A M B A549-ACE2 with SARS-CoV-2 vs. A549-ACE2Older vs. younger l og2 f o l d c hange Supplementary Fig. S3: (a) Heatmap of log -fold changes of differentially expressed genes shared bySARS-CoV-2 and aging with gene names. (b) 2D histogram of the number of genes having a certainrank in aging and SARS-CoV-2 datasets. 25
549 Control Batch 1A549 Control Batch 2 a b
UMAP 1 U M AP PC 1 P C Supplementary Fig. S4: (a) UMAP of control and perturbations across all cell types in CMap. Theeffect of a perturbation on a given cell type is small relative to the differences between cell types. (b)Principal component analysis highlighting batch effects for the control samples of the A549 cell linefrom CMap. K-means clustering by gene expression vector is used to identify and remove batch effects(represented as red and blue clusters). 26 c Num.
Hidden
Units Num.
Hidden
Layers Nonlinearity Optimizer, LR Initialization Seed Used Training Loss Test Loss1024 1 Leaky ReLU Adam, 1e-4 PyTorchDefault 17 7.3 x 10^-7 1.1 x 10^-6100 1 Leaky ReLU Adam, 1e-4 PyTorchDefault 17 2.8 x 10^-3 2.8 x 10^-3
Default
17 6.4 x 10^-6 6.5 x 10^-6Over-parameterized Autoencoder Under-parameterized Autoencoder b Supplementary Fig. S5: Overview of autoencoder architectures, optimization methods and hyperpa-rameter settings considered. (a) Diagram representing an overparameterized autoencoder. While thisautoencoder is capable of learning the identity function, training leads to a solution that better alignsdrug signatures across cell types in the latent space. (b) Diagram representing an underparameterizedautoencoder. While this architecture is most commonly used in practice, it does not align drug signaturesas well in the latent space as its overparameterized counterpart; see Supplementary Fig. S7. (c) Detailson the width, depth, nonlinearity, optimization method, learning rate, random seed, training loss andtest loss for all architectures considered in this work.27 b Under-parameterized Autoencoder Over-parameterized AutoencoderPCA (100 PCs)PCA (2 PCs) c d
Supplementary Fig. S6: Receiver operating characteristic (ROC) curves for the agreement in classi-fication between gene expression vectors and reconstructed gene expression vectors obtained using anembedding given by the first 2 principle components in (a), the first 100 principle components in (b), anunderparameterized autoencoder in (c), and an overparameterized autoencoder in (d). While a logisticregression model trained to classify between 831 A549 control samples and 32893 A549 perturbationsamples shows differences in predictions on original gene expression vectors versus underparameterizedautoencoder reconstructions and reconstructions from the top 2 or 100 principal component, the overpa-rameterized embedding allows near perfect reconstruction of the original gene expression vectors with nodifference in predictions between using overparameterized embeddings for gene expression vectors andoriginal gene expression vectors. 28 riginal Space P C E m bedd i ng 100 P C E m bedd i ng 100 D i m Lea ky R eL U A u t oen c ode r E m bedd i ng 1024 D i m Lea ky R eL U A u t oen c ode r E m bedd i ng MCF7, A549 Drug Signature Correlation D i m C o s I D A u t oen c ode r E m bedd i ng HCC515, A549 Drug Signature Correlation P C E m bedd i ng a b Original Space
Original Space Original Space Original Space P C E m bedd i ng Original Space D i m Lea ky R eL U A u t oen c ode r E m bedd i ng 1024 D i m C o s I D A u t oen c ode r E m bedd i ng 1024 D i m Lea ky R eL U A u t oen c ode r E m bedd i ng Original Space Original Space Original Space Original Space
Supplementary Fig. S7: Comparison of drug signature alignment between A549 and MCF7 (top) andA549 and HCC515 (bottom) cell types upon using an embedding verus the original space. Embeddingsprovided include (from left to right) top 2 PCs, top 100 PCs, underparameterized leaky ReLU autoen-coder, overparameterized cosid autoencoder, overparameterized leaky ReLU autoencoder. Embeddingsfrom the overparameterized autoencoder with leaky ReLU activation better align drug signatures be-tween these two pairs of cell types than any other embedding considered while still providing near perfectreconstruction of the original data. 29 in = 0.98max = 6.22mean = 1.46sd = 0.56
Supp_figure_4_1 ab gene prize log2FC virus log2FC age OASL 6.22 6.22 0.50SUMO4 3.64 3.64 0.48GRHL1 3.01 3.01 0.51FOXC2 3.00 3.00 0.81XAF1 2.84 2.84 0.39IL20RB 2.62 2.62 0.53CREB5 2.59 2.59 0.53PLSCR1 2.36 2.36 0.36CHIC2 2.22 2.22 0.49HIVEP2 2.21 2.21 0.69SNAP25 2.18 2.18 0.82C19ORF66 2.01 2.01 0.46GNRH1 1.99 1.99 0.99TRIM38 1.96 1.96 0.37HIST3H2BB 1.91 1.91 0.42PDE4B 1.90 1.90 0.37CRY1 1.87 1.87 0.46INHBA 1.85 1.85 0.45ZNF8 1.80 1.80 0.36SP100 1.79 1.79 0.44N4BP3 1.77 1.77 1.04ELL 1.74 1.74 0.51RAPGEF4 1.73 1.73 0.67XRN1 1.72 1.72 0.38HIST1H1E 1.71 1.71 0.56WDR26 1.66 1.66 0.38ZFC3H1 1.66 1.66 0.55KAT6A 1.60 1.60 0.41CEP85L 1.57 1.57 0.39YY1AP1 1.52 1.52 0.48ZNF217 1.51 1.51 0.59OVGP1 1.48 1.48 0.77SETD5 1.48 1.48 0.46ARL4D 1.47 1.47 0.57IRF2 1.45 1.45 0.47SMAD3 1.45 1.45 0.67HIST1H1D 1.44 1.44 0.84TSPYL4 1.44 1.44 0.40MKLN1 1.43 1.43 0.35HAS2 1.40 1.40 0.82RBM33 1.40 1.40 0.41ISG20 1.36 1.36 0.45USP12 1.35 1.35 0.42LOX 1.34 1.34 0.78PHF21A 1.33 1.33 0.48RIPK2 1.33 1.33 0.47GTPBP1 1.30 1.30 0.46PML 1.30 1.30 0.67RASA2 1.30 1.30 0.37THAP2 1.28 1.28 0.44PHC3 1.26 1.26 0.36BRSK1 1.25 1.25 0.42DLL1 1.23 1.23 0.57ETV6 1.23 1.23 0.62TERF2IP 1.22 1.22 0.38AURKC 1.21 1.21 0.45SPRED3 1.21 1.21 0.66GATAD2B 1.18 1.18 0.52MTMR11 1.18 1.18 0.60ZSWIM6 1.18 1.18 0.66PROX1 1.16 1.16 0.47CTTNBP2NL 1.15 1.15 0.41CCDC71L 1.14 1.14 0.46SAV1 1.13 1.13 0.36CDK17 1.10 1.10 0.53RBM5 1.09 1.09 0.38BTN2A2 1.08 1.08 0.47TSC22D2 1.07 1.07 0.46LARP6 1.06 1.06 0.79RNF24 1.06 1.06 0.37SEC31B 1.05 1.05 0.36CHN1 1.04 1.04 0.46HOXB3 1.01 1.01 0.55SCG2 1.01 1.01 0.58TRAF6 1.00 1.00 0.46 gene prize log2FC virus log2FC age
SLC25A11 2.00 -2.00 -0.46HADH 1.77 -1.77 -0.52PYCR1 1.74 -1.74 -0.63TM7SF2 1.74 -1.74 -0.53EXOSC5 1.71 -1.71 -0.47MSRB1 1.70 -1.70 -0.37MSH2 1.65 -1.65 -0.39NSDHL 1.63 -1.63 -0.56SMPDL3B 1.60 -1.60 -0.77AGA 1.59 -1.59 -0.40FECH 1.59 -1.59 -0.36GOT1 1.58 -1.58 -0.71GBA 1.57 -1.57 -0.44GSR 1.57 -1.57 -0.38FOXRED2 1.56 -1.56 -0.49GLB1 1.54 -1.54 -0.44TRUB2 1.54 -1.54 -0.54DCDC2 1.53 -1.53 -0.46FADS1 1.53 -1.53 -0.40IPO4 1.53 -1.53 -0.54DPP3 1.52 -1.52 -0.53MPDU1 1.51 -1.51 -0.49MIPEP 1.49 -1.49 -0.41ALDH1B1 1.48 -1.48 -0.44QDPR 1.48 -1.48 -0.49NUDT14 1.46 -1.46 -0.45GPD1L 1.44 -1.44 -0.55SRPRB 1.44 -1.44 -0.47AIFM1 1.42 -1.42 -0.45TLCD1 1.41 -1.41 -0.83APEH 1.40 -1.40 -0.48PCCB 1.39 -1.39 -0.51SLC39A11 1.39 -1.39 -0.38DOLK 1.37 -1.37 -0.53FGFBP1 1.37 -1.37 -0.37PLS1 1.37 -1.37 -0.68EBNA1BP2 1.34 -1.34 -0.36TNFSF15 1.34 -1.34 -0.44POP1 1.33 -1.33 -0.43FARSA 1.32 -1.32 -0.52ALG1 1.30 -1.30 -0.37ATIC 1.30 -1.30 -0.42MRPL27 1.29 -1.29 -0.36SAC3D1 1.28 -1.28 -0.41METTL13 1.27 -1.27 -0.51PARP1 1.27 -1.27 -0.44SLC7A7 1.26 -1.26 -0.36MMP15 1.25 -1.25 -0.73TIMM13 1.25 -1.25 -0.40CCDC71 1.24 -1.24 -0.42FAH 1.23 -1.23 -0.65MRPL37 1.23 -1.23 -0.44MMP24 1.21 -1.21 -0.50NLN 1.21 -1.21 -0.37ACAT1 1.20 -1.20 -0.51ADK 1.19 -1.19 -0.57TMEM164 1.18 -1.18 -0.81CYB561D2 1.17 -1.17 -0.58DTD2 1.17 -1.17 -0.45FH 1.17 -1.17 -0.45GSTZ1 1.14 -1.14 -0.39NIPSNAP1 1.14 -1.14 -0.50ORMDL2 1.14 -1.14 -0.40OCIAD2 1.13 -1.13 -0.37CLN3 1.12 -1.12 -0.43MRPS16 1.12 -1.12 -0.43GGT1 1.11 -1.11 -0.51PAICS 1.11 -1.11 -0.45SLC27A4 1.10 -1.10 -0.44OSGIN1 1.09 -1.09 -0.42SDSL 1.09 -1.09 -0.47MYO5C 1.08 -1.08 -0.48MMAB 1.07 -1.07 -0.41NLRP2 1.05 -1.05 -0.63PTPN13 1.05 -1.05 -0.58CDH1 1.02 -1.02 -1.01CLN6 1.02 -1.02 -0.62HMBS 1.02 -1.02 -0.45SLC31A1 1.02 -1.02 -0.43PPA2 1.01 -1.01 -0.40THNSL1 1.01 -1.01 -0.37FGFR3 1.00 -1.00 -0.50HSD17B8 0.99 -0.99 -0.40PXMP4 0.99 -0.99 -1.08TCTA 0.99 -0.99 -0.36EPS8L2 0.98 -0.98 -0.44MRPS33 0.98 -0.98 -0.45
75 upregulated terminals 87 downregulated terminals
Supplementary Fig. S8: Terminal node selection for prize-collecting Steiner forest analysis. Terminalgenes include 162 genes present in the IREF interactome that are either upregulated in both SARS-Cov-2infection and aging or downregulated in both SARS-Cov-2 infection and aging. Each terminal gene isprized with its absolute log -fold change between SARS-Cov-2 infected A549-ACE2 cells and normalA549-ACE2 cells. (a) Histogram of prizes for terminal genes along with descriptive statistics. (b) Tableof 75 terminal genes upregulated in both SARS-Cov-2 infection and aging (left) and table of 87 terminalgenes downregulated in both SARS-Cov-2 infection and aging, along with prize and log fold changeinformation. 30upplementary Fig. S9: Parameter selection via sensitivity analysis for prize-collecting Steiner forestanalysis. (a1) Boxplot of penalized edge costs in the IREF interactome for different values of g . Thedistribution of penalized edge costs are very similar for g = −∞ and g = 0. For these values of g , themaximum penalized edge cost is upper bounded by 1. (a2) Histogram of shortest path cost betweenany two terminals in the IREF interactome for g = 0, along with descriptive statistics. (b) Range ofparameters g , w and b used in sensitivity analysis. Red values indicate a stable range for the interactomeobtained with the prize-collecting Steiner forest algorithm. We retain g = 0, w = 1 . b = 40for our subsequent analysis. (c1-3) Heatmaps of the matrix M indexed for different types of selectednodes: all nodes (c1), terminal nodes (c2) and SARS-Cov-2 interaction partners (c3). Each row/columncorresponds to a prize-collecting Steiner forest obtained from a given set of parameters ( g = 0 , w, b ). Astability region for the prize-collection Steiner forest solution appears for g = 0, w ≥ . b ∈ [5 , g = 0, w = 1 . b = 40) using the terminal gene list from Supplementary Fig. S8. Theinteractome contains 1,003 edges between 252 genes, five of which are known SARS-Cov-2 interactionpartners (EXOSC5, FOXRED2, LOX, RBX1, RIPK1). Genes in the interactome are grouped by generalprocess. 32 upp_figure_4_4 ab c M ij = |{ all nodes in network i } \ { all nodes in network j }||{ all nodes in network i }| M ij = SARS-Cov-2 partnersin network i \ n SARS-Cov-2 partnersin network j o SARS-Cov-2 partnersin network i Supplementary Fig. S11: Selection of the prize p for non-terminal SARS-Cov-2 interaction partners (allbut EXOSC5, FOXRED2 and LOX) via sensitivity analysis. (a) Number of SARS-Cov-2 interactionpartners collected in the interactome obtained from the prize-collecting Steiner forest algorithm fordifferent values of p ranging from 0 to 0 .
02. For p > .
02, all known SARS-Cov-2 interaction partnerspresent in the IREF network are collected in the final interactome. A stability region appears for p ∈ [4 · − , − ] with 7 SARS-Cov-2 interaction partners collected. (b-c) Heatmaps of the matrix M indexed for different types of selected nodes: all nodes (b), and SARS-Cov-2 interaction partners (c).Each row/column corresponds to a prize-collecting Steiner forest obtained from a given set of parameters( g = 0 , w = 1 . , b = 40 , p ). A stability region for the prize-collection Steiner forest solution appears for g = 0, w = 1 . b = 40 and p ∈ [7 · − , − ]. We retain g = 0, w = 1 . b = 40 and p = 8 · − for our subsequent analysis. 33upplementary Fig. S12: Interactome obtained from the prize-collecting Steiner forest algorithm (withparameters g = 0, w = 1 . b = 40) using the terminal gene list from Supplementary Fig. S8 augmentedwith all other SARS-Cov-2 interaction partners prized with p = 8 · − . The interactome contains1,090 edges between 254 genes, seven of which being known SARS-Cov-2 interaction partners (EXOSC5,FOXRED2, LOX, RBX1, RIPK1, CUL2, HDAC2). Genes in the interactome are grouped by generalfunction. 34upplementary Fig. S13: 2-Nearest-Neighborhoods of nodes of interest (denoted by a red hexagon) inthe interactome of Supplementary Fig. S12 (parameters g = 0, w = 1 . b = 40, p = 8 · − ). A thresholdwas applied on the edge confidence to improve legibility. Proteins known to interact with SARS-Cov-2are denoted as blue squares, drug targets are denoted as green diamonds, terminal nodes are coloredaccording to log -fold change in SARS-Cov-2-infected A549-ACE2 cells versus normal A549-ACE2 cells,Steiner nodes appear in grey. 35upplementary Fig. S14: Drug target discovery via prize-collecting Steiner forest analysis to identifyputative molecular pathways linking differentially expressed genes in SARS-Cov-2 infection without tak-ing into account age-related differential expression. (a) The general procedure to obtain the interactomeis identical to the one described in Fig. 4a, with a different terminal gene list. (a) Terminal nodes andhistogram of prize distribution. We consider 169 terminal nodes corresponding to genes differentiallyexpressed in SARS-CoV-2 infection after removing the effect of the ACE2 receptor. Only 11 of these 169genes belong to the terminal list used in Fig. 4. The prize of a terminal node equals the absolute value ofits log -fold change in SARS-Cov-2-infected A549-ACE2 cells versus normal A549-ACE2 cells based ondata from [23]. (b) Sensitivity analysis to choose the parameters w and b for the prize-collecting Steinerforest algorithm. We select g = 0, w = 1 . b = 40 corresponding to a robust solution for moderatechanges in the parameters. (c) Interactome obtained using the prize-collecting Steiner forest algorithm.Genes are grouped by general function and marked with a cross if known to interact with SARS-Cov-2proteins based on data from [6]. (d) 2-Nearest-Neighborhoods of nodes of interest (denoted by a redhexagon) in the interactome. A threshold was applied on the edge confidence to improve legibility. Pro-teins known to interact with SARS-Cov-2 are denoted as blue squares, drug targets are denoted as greendiamonds, terminal nodes are colored according to log -fold change in SARS-Cov-2-infected A549-ACE2cells versus normal A549-ACE2 cells, Steiner nodes appear in grey. (e) Table of drug targets in theinteractome with the corresponding drugs. Selected drugs are FDA approved, high affinity (at least oneof the activity constants K i , K d , IC
50 or EC
50 is below 10 µM ), and match the SARS-Cov-2 signaturewell (correlation > . − log (activity). Protein name corresponding toeach gene is included. 36 a b Supplementary Fig. S15: (a) Matrix Q of estimated effects of interventions (columns) on measured genes(rows) in A549 cells from CMap gene knockout and overexpression data with Q ij = 1 representing thatperturbing gene j effects gene i and hence that gene i is downstream of gene j . (b) ROC curve evaluatingcausal structure discovery methods GSP, PC and GES for predicting the effects of interventions in A549cells. The performance of each algorithm is measured by sampling random causal graphs and measuringnumber of true positives and false positives. GSP performs significantly above random guessing with p -value of 0.0177, while PC achieves p -value of 0.0694 and GES a p -value of 0.5867. The grey linerepresents a random guessing baseline (not used for computation of p -value) based on the number ofground truth positives and negatives, calculated from Q and scaled to extend from (0 ,
0) to span theentirety of the plot. 37
RBM33RIPK1RIPK2TRIM38CD40 CBLUBE2D2BTRC CUL1 UBCHIVEP2IRAK1CRY1 RPL23ACAT1MYH9EBNA1BP2 PSMA3QDPR FARSA PSMD3 PARP1HIF1AUBE2N SMAD9SHFM1SUMO1 ATREGFR ATIC MLH1IRF2 RAD21 MYCHECW2HSP90AA1 PLSCR1MSH2UIMC1SUMO2 METTL13TP53 RAC1BTN2A2 TERF2IPVHLPCCB DAXXRANBP9 MRPL27 CTNNB1FAHSLC3A2 XRN1MIPEPDHX15 SMAD3ALDH1B1SEC31BUBE2D3 ZSWIM6 PRMT5 FGFR1TERF2 RNF2 MSH6PHC3 DPP3 PPA2HIST1H1EPYCR1 SP100SRPRBMRPS16YY1AP1 ACVR2AHMBS SLC7A7 ZNF8 EP300 AIFM1CLN6 KDM1A FGF2FGFR3FBXO6AGA HDAC1 MBD3GATAD2B SMAD4 CREBBPREL ZNF217 PHF21AMDM2TCTA HOXB3CDH1 EXOSC5NUDT14STX1AFH VAMP2SNF8RELAKAT6A SAV1XIAPXAF1 CCSWDR48USP12 DCDC2MRFAP1PEX19 SUGT1NLRP2TRAF6 APPNLN LARP6 SNAP25GTPBP1PAK1 IPO4 OCIAD2BRSK1 COMMD3-BMI1SLC25A11FN1 PAICSPOP1 VKORC1RBX1 PMLTCF4 NIPSNAP1NFKBIA CHIC2NCK2 ITSN2STK3 b RBM33SHFM1 MMABRIPK1UBE2N RPL23HSP90AA1HIF1A CDH1SUGT1PRMT5 ATRNLRP2PSMA3 FGFR1MYH9 KAT5 RIPK2IRF2 NIPSNAP1PSMD3RAC1 CD40BTRCUBE2D2UBE2D3 GBAMIPEP FBXO6ACAT1ALG1ALDH1B1 CUL1UBC FECHGLB1RAD21CRY1QDPR FARSANAMPT FAHXAF1 TSC22D2 OASLRBX1 VKORC1SRPRB MRPS16 PXMP4OSGIN1YY1AP1 GSTZ1 RASSF1DHX15 NLNIPO4 POP1 ATICSLC3A2 CDK17 PDE4B ZFC3H1WDR26 HIVEP2 MRPL37 ADKMYC MYO5C GOT1 PAICSHMBS TM7SF2 ISG20PPA2 TLCD1 SLC25A11TIMM13 ACVR2A GTPBP1 HADHELL ETV6 RNF24MRFAP1PAK1 MRPS33CHIC2 TRUB2 RELA PYCR1 VAMP2SP100 STK3FADS1 MKLN1 BTN2A2 FHGGT1 SAV1RBM5 SLC7A7RANBP9 KDM1AHDAC1VHL MBD3 IRAK1 CLN3RELCTTNBP2NLZNF217SMAD9 GATAD2BPHF21A NCK2TRAF6 TRAF2 FADDEP300TERF2TERF2IPMTMR11 DPP3CTNNB1CBLSMAD3SOD1 C19ORF66TCF4CREB5 CYB561D2XIAP CCS PHC3RNF2HECW2WDR48UIMC1 APPTRIM38PLSCR1 MDM2OCIAD2TP53MSH6 SUMO1 PARP1 SUMO4SMAD4 GSR SUMO2CREBBP METTL13KAT6A MSRB1SMPDL3BCCDC8AIFM1 PEX19FN1 EGFRITSN2 DAXXPML MLH1STX1ANFKBIA PTPN13 MSH2 c NFKB1RIPK1 PPP4CNCOR1NFKBIZ TNFRSF1ATP53 DHX30 PMLPER2RBM39 NFKBIAARTNK2 HDAC1 RBX1 PLSCR1ATR RELARELHSH2D JUNABL2CRKNOTCH2NL WDR5HES1 EP300HUS1 BRCA1 SNIP1 PPARGBTRC OGTCUL1 UBCHIVEP2SKP2 THAP1RNMT PRMT5 CBX5MORF4L1AHCY XIAP CCNT1SUMO2 PSEN1 PRPF8HCFC1 CREB3CCNL1HSP90AA1 EEF1G IFNAR1CDK11APAFAH1B1 CBL PTPN11CDK2CDKN1B NUP62 CHD2 XAF1UBE2D3ETFA PCF11 TANKMBIPCOPG1 GNB1 IRF1PSMB5 STAT2NDEL1 LIFRDDX58EIF1B RB1NFE2L2CD200 RSRC2JAK2MED1 PMS2CRTC2CDK9 TRIM28UBQLN1 HBP1GPBP1PNMA1EPC1RPA1RPA2 PARP9 APP MARCKSPPP1R10PGRMC1 SERTAD3ETFBRBPJ SNAP25 NEDD8YWHAH CYLD CHIC2 CDKN1APSMD3ZBTB43 SUMO3 ZNF12 USPL1PTPN6SGTBLAMP3 STX1AVAMP2CASP8HIF1A GRB2
Supplementary Fig. S16: (a) Causal network corresponding to A549 cells. (b) Causal network corre-sponding to AT2 cells. (c) Causal network corresponding to A549 cells learned using PPI interactomeobtained without considering age-associated genes as a prior. All non-singleton nodes are shown, genetargets of drugs selected via our computational drug repurposing pipeline are in boxes and the nodecolor corresponds to the log -fold change of A549-ACE2 with versus without SARS-CoV-2.38 upplementary Tables Drug name % differentially expressednodes downstream (A549) % nodes downstream(A549 no age) % nodes downstream(AT2)afatinib 98.51 0.00 83.93axitinib 98.51 0.85 83.93bosutinib 98.51 0.00 83.93dasatinib 98.51 0.00 83.33erlotinib 98.51 0.00 83.33imatinib 98.51 0.00 83.93pazopanib 98.51 0.85 83.93ruxolitinib 98.51 0.00 83.33sorafenib 97.01 0.00 0.60sunitinib 98.51 0.85 83.93tofacitinib 1.49 0.00 0.00belinostat 98.51 94.92 83.33vorinostat 98.51 94.92 83.33formoterol 98.51 94.92 83.33primaquine 98.51 94.92 83.33vardenafil 0.00 0.00 0.00milrinone 0.00 0.00 0.00docetaxel 98.51 0.00 83.33Table S1: Percentage of nodes in the largest connected component of the corresponding causal graphthat are targeted by each drug. For A549 cells, only genes that are associated with SARS-CoV-2 andaging are considered. 39 upplementary Datasetsupplementary Datasets