[PDF] Identifying Human Interactors of SARS-CoV-2 Proteins and Drug Targets for COVID-19 using Network-Based Label Propagation

Abstract

Motivated by the critical need to identify new treatments for COVID-19, we present a genome-scale, systems-level computational approach to prioritize drug targets based on their potential to regulate host-virus interactions or their downstream signaling targets. We adapt and specialize network label propagation methods to this end. We demonstrate that these techniques can predict human-SARS-CoV-2 protein interactors with high accuracy. The top-ranked proteins that we identify are enriched in host biological processes that are potentially coopted by the virus. We present cases where our methodology generates promising insights such as the potential role of HSPA5 in viral entry. We highlight the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. We identify tubulin proteins involved in ciliary assembly that are targeted by anti-mitotic drugs. Drugs that we discuss are already undergoing clinical trials to test their efficacy against COVID-19. Our prioritized list of human proteins and drug targets is available as a general resource for biological and clinical researchers who are repositioning existing and approved drugs or developing novel therapeutics as anti-COVID-19 agents.

Full PDF

IIdentifying Human Interactors of SARS-CoV-2Proteins and Drug Targets for COVID-19 usingNetwork-Based Label Propagation

Jeﬀrey N. Law , Kyle Akers , Nure Tasnina , Catherine M. DellaSantina , Meghana Kshirsagar , Judith Klein-Seetharaman , MarkCrovella , Padmavathy Rajagopalan , Simon Kasif , and T. M. Murali Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, andComputational Biology, Blacksburg, VA, USA Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Department of Biomedical Engineering, Boston University, Boston, MA,USA AI for Good Lab, Microsoft, Redmond, WA, USA Department of Chemistry, Colorado School of Mines, Golden, CO USA Department of Computer Science, Boston University, Boston, MA, USA Department of Chemical Engineering, Virginia Tech, Blacksburg, VA,USA * Corresponding author. Email: [email protected]

Motivated by the critical need to identify new treatments for COVID-19, we present a genome-scale, systems-level computational approachto prioritize drug targets based on their potential to regulate host-virus interactions or their downstream signaling targets. We adaptand specialize network label propagation methods to this end. Wedemonstrate that these techniques can predict human-SARS-CoV-2 protein interactors with high accuracy. The top-ranked proteins a r X i v : . [ q - b i o . M N ] J un hat we identify are enriched in host biological processes that are po-tentially coopted by the virus. We present cases where our method-ology generates promising insights such as the potential role ofHSPA5 in viral entry. We highlight the connection between en-doplasmic reticulum stress, HSPA5, and anti-clotting agents. Weidentify tubulin proteins involved in ciliary assembly that are tar-geted by anti-mitotic drugs. Drugs that we discuss are already un-dergoing clinical trials to test their eﬃcacy against COVID-19. Ourprioritized list of human proteins and drug targets is available asa general resource for biological and clinical researchers who arerepositioning existing and approved drugs or developing novel ther-apeutics as anti-COVID-19 agents. Introduction

The COVID-19 pandemic has created many clinical, economic, and societal challengesworld-wide. It has galvanized scientists to develop vaccines and drugs for the disease ( ).Existing antiviral agents such as remdesivir to treat COVID-19 are already in clinicaltrials ( ). These drugs interfere with diﬀerent aspects of the viral life cycle, includingfusion with the cell membrane, proteolysis, translation, and RNA replication ( ). Since avirus must necessarily co-opt host cellular processes in order to replicate, an alternativeattractive approach is to develop or repurpose drugs that target human proteins that thevirus requires. To this end, a global, whole-genome view of host-pathogen interactionsis likely to be valuable ( ), especially in the case of SARS-CoV-2, given the size of itsgenome and the complexity of the observed clinical and epidemiological manifestations ofCOVID-19 ( ). 2 TRING Network26 SARS-CoV-2 proteins333 Human interactors Supervised Learning N e t w o r k p r opaga t i on sc o r e H i ghLo w Network Propagation Functional Enrichment,Drug TargetsDi ﬀ usion AnalysisValidationDrug-Target Network(a) (b) (c) Figure 1: Our analysis framework comprising of (a) data integration: experimentallydetermined host-pathogen network, genome-scale functional linkage network, functionontologies, and drug-target databases; (b) computational analysis: network propagation,validation and statistical analysis, and network diﬀusion tracking; and (c) functionalenrichment and literature-based examination of promising protein targets and drugs.3n this work, we present a strategy based on network-based functional label predictionto prioritize human proteins as drug targets and to highlight existing, approved drugs asanti-COVID-19 agents (Figure 1). We base our computational approach on the hypothe-sis that human proteins belonging to complexes or signaling pathways that are proximalto human interactors of viral proteins are potentially good targets for inhibition. Accord-ingly, we take advantage of a recently published dataset of human proteins that physicallyinteract with SARS-CoV-2 ( ). These SARS-CoV-2 interactors are entry points to hostcellular processes that may be hijacked by viral infection.Although these viral interactors are crucial starting points, we note that the proteomicspipeline used to discover them ( ) may not capture in vivo conditions and tissue-speciﬁcinteractions, leading to false negatives. Therefore, to prioritize additional human pro-teins, we formulate identiﬁcation of downstream interactors of SARS-CoV-2 as a networklabeling problem capable of tolerating uncertainty in its inputs. Speciﬁcally, given theknown human protein interactors of SARS-CoV-2 proteins and a whole-genome proteininteraction network, we use network propagation algorithms to predict other potentialSARS-CoV-2 interactors. We ﬁrst demonstrate with that this approach predicts knownSARS-CoV-2 interactors with high accuracy. We further analyze highly-ranking proteinscomputed by these methods to identify statistically-enriched biological cellular processesand pathways that may be impacted by SARS-CoV-2. Additionally, we integrate drug-protein interactions into this framework to propose drugs that may be repositioned totreat COVID-19. We present several case studies that illustrate how drugs that targethighly-ranked proteins may inhibit the virus in diﬀerent stages of its life cycle. Some ofthese drugs are already in clinical trials for COVID-19. We emphasize that the drugs dis-cussed in this work are in silico predictions that require further experimental and clinicalvalidation before they can be used as treatments for COVID-19.4 esults We took inspiration from the success of network propagation in diverse applications insystems biology ( ). Our underlying hypothesis was that network propagation was a rea-sonable mechanism for predicting SARS-CoV-2 interactors. To test this hypothesis, weconducted the following validation. For each interactor of a SARS-CoV-2 protein ( ), wecomputed the propagation score (also called “diﬀusion score” or “heat” in the literature)treating all the other interactors of SARS-CoV-2 as sources. To this end, we appliedthe Regularized Laplacian (RL) ( ), a widely-used technique for semi-supervised learning(“Methods”). These scores were signiﬁcantly larger than the scores of randomly-selectednodes (Figure 2(a); p -value of 2 . × − for the Kolmogorov-Smirnov test, Table S1).This trend held true for other viruses as well, e.g., SARS-CoV, HIV-1, and HSV-1 (Fig-ure 2(a); p -values of 3 . × − , 2 . × − , and 1 . × − , respectively), indicatingthat network proximity may be a general property of human proteins that bind to a virus. Prioritization of Potential SARS-CoV-2 Interactors

These results encouraged us to prioritize putative human protein interactors of SARS-CoV-2. In addition to RL, we used GeneMania (GM) ( ), a variant of RL that has beenused for ﬁnding associations between GO terms and proteins, and SinkSource (SS) ( ), arelated approach previously used to prioritize human proteins that are dependency factorsfor HIV. We also used two oﬀ-the-shelf classiﬁers: a Support Vector Machine with a linearkernel (SVM), and Logistic Regression (LR). For these classiﬁers we used the adjacencyvector of each protein in the human interaction network as its feature vector. Finally, wetested Local, a method which sets each node’s score to be the weighted average of thescores of its neighbors.We evaluated the performance of these algorithms using 5-fold cross validation of the5 b)(c) RL GM SS SVM LogReg Local0.50.60.70.80.91.0 A U P R C RL GM SS SVM LogReg Local0.50.60.70.80.91.0 E a r l y P r e c i s i o n @ . R e c a ll RL GM SS SVM LogReg Local0.50.60.70.80.91.0 A U R O C RL Propagation Scores P r o b a b ili t y D e n s i t y SARS-CoV-2 InteractorsRandom Proteins 0.004 0.008 0.012

RL Propagation Scores P r o b a b ili t y D e n s i t y SARS-CoV InteractorsRandom Proteins 0.015 0.030 0.045 0.060

RL Propagation Scores P r o b a b ili t y D e n s i t y HIV-1 InteractorsRandom Proteins 0.005 0.010 0.015 0.020

RL Propagation Scores P r o b a b ili t y D e n s i t y HSV-1 InteractorsRandom Proteins (a) cilium assemblymicrotubule nucleationGolgi organizationrRNA processingtRNA export from nucleusviral transcriptionribosomal large subunit biogenesisNADH dehydrogenase complex assembly ribonucleoprotein complex subunit organization - l o g ( p v a l ) Transcription/TranslationOrganelleOrganization cotranslational protein targeting to membrane protein import into mitochondrial matrixubiquitin-dependent ERAD pathwayprotein folding in ERprotein foldingGPI anchor biosynthetic processRab protein signal transductionzinc ion transport mitochondrial ATP synthesis coupled proton transport R L S V M K r b i c _ b i c _ b i c _ b i c _ b i c _ b i c _ b i c _ b i c _ protein N-linked glycosylation via asparagine protein demannosylation RespirationER StressPTMsOther oligosaccharide-lipid intermediate biosynthetic process (d)

Figure 2: Network propagation results. (a) Distribution of network propagation scoresamong human interactors of a virus (solid curves) compared to distribution of scoresof random proteins (dotted curves). The four plots are for SARS-CoV-2, SARS-CoV,HIV-1, and HSV-1, respectively. Each curve is a Gaussian kernel density estimate of thedistribution. (b) Comparison of AUPRC, AUROC, and precision at 0.1 recall across sixalgorithms. The positive:negative ratio is one. The dashed line indicates the score fora random predictor. ‘LogReg’ is an abbreviation for logistic regression. (c) Heat mapdisplaying contribution of each SARS-CoV-2 interactor to top-ranking proteins from RL.Ellipses highlight manually selected biclusters. Each bicluster includes a set of SARS-CoV-2 interactors that contribute high scores to a set of top-ranking proteins. We usedMorpheus https://software.broadinstitute.org/morpheus, to generate this ﬁgure. (d) Heatmap summarizing GO biological process terms enriched in top ranking proteins fromRL and SVM, the biclusters highlighted in (c), and human interactors of SARS-CoV-2proteins (indicated as ‘Kr’). We manually grouped the terms into broader categoriesshown in bold text. The color of a group matches the colors in Figure 3. A gray cellindicates a p -value larger than 0.01. 6ositive examples (human interactors of SARS-CoV-2 proteins). We used three measuresof performance: the area under the receiver-operator characteristic curve (AUROC), thearea under the precision-recall curve (AUPRC), and the precision at a recall of 0.1 (earlyprecision). The third measure permitted us to estimate the accuracy of the methods forhigh-conﬁdence proteins, which are likely to be the basis for experimental validation.RL achieved a median AUROC of 0.76, a median AUPRC of 0.75, and an earlyprecision of 0.89, values which were approximately 1.5–1.8 times superior to those of arandom predictor. SVM and logistic regression achieved somewhat lower scores than RL.We obtained these results when we sampled as many negative examples as positive (see“Methods”). When we increased the sample size to ﬁve times or ten times the numberof positives, RL continued to achieve the highest measures of performance (Figure S1).Local was consistently the worst-performing method, conﬁrming the relative superiorityof network propagation for this problem.The strength of this cross-validation performance encouraged us to apply these meth-ods to the full set of positive examples to rank the remaining proteins in the STRINGnetwork. We then used a stratiﬁed sampling approach to estimate the statistical signiﬁ-cance of the resulting node scores (see “Methods”). For subsequent analyses, we selectedone network propagation method (RL) and one supervised classiﬁer (SVM). We choseRL because it performed the best in our evaluations, particularly with respect to earlyprecision (Figure 2(b)). We chose SVM since it also had very good performance in cross-validation. Moreover, the protein rankings computed by the two methods were fairlydissimilar (Spearman’s correlation of 0.4 and Jaccard index of 0.3 for the top-300 ranks;Section S2.1). We considered the top 332 predictions of RL and SVM that were statisti-cally signiﬁcant at p ≤ .

05 (Table S2), i.e., we selected as many top-ranking proteins asthe number of experimentally-determined human proteins that interact with SARS-CoV-27roteins ( ). We selected this many proteins reasoning that that each of the SARS-CoV-2interactors would themselves interact with or otherwise aﬀect at least one other humanprotein.We refer to these as “top-ranking proteins” below. We reasoned that due to theirlow overlap, using the two sets of proteins identiﬁed by RL and by SVM would resultin broader coverage of human cellular processes and provide additional insights into thepotential eﬀects of SARS-CoV-2 as compared to using the results of just one method. Enriched Biological Processes

We tested for enrichment of Gene Ontology (GO) biological processes (Benjamini-Hochbergcorrected p -value ≤ .

01) among the top-ranking proteins from RL and from SVM, aswell as in the interactors of SARS-CoV-2 (“Methods”). Since parent-child relationshipsin the GO cause many closely related terms to be enriched, we used a heuristic to selecta non-redundant set of terms that were enriched in at least one of these sets of proteins(“Methods” and Section S2.4). While some terms were common to all three sets of pro-teins, there were many that were enriched only in our predictions (Figure 2(d), Table S3),indicating that network propagation successfully identiﬁed speciﬁc cellular processes in-volving proteins proximal to, but not directly interacting with, viral proteins. We observedthat GO terms related to transcription and translation (second group in Figure 2(d)) wereenriched in the SVM’s top-ranking proteins but not in RL’s list.We sought to gain a deeper understanding of the trace of evidence linking a top-rankedprotein to an experimentally-determined SARS-CoV-2 interactor. To this end, we com-puted the contribution of each SARS-CoV-2 interactor to every top-ranking predictioncomputed by RL (“Methods”). Visual examination of this matrix (Figure 2(c), Table S4)revealed the presence of numerous biclusters (numbered ellipses in Figure 2(c)). Each8icluster contained a subset of SARS-CoV-2 interactors that made high contributions tothe scores of a subset of top-ranking proteins. The GO biological processes enriched inthe top-ranking proteins in each bicluster had a high correspondence with the terms en-riched in the full set of top-ranking proteins (Figure 2(d)). Thus, these biclusters provideuseful information on which human proteins the virus may use as “gateways” to targethost cellular processes. We further study these biclusters and the largest contributors toindividual top-ranking proteins in Section S2.3.Our top-ranking proteins were enriched in ﬁve broad categories of GO biological pro-cesses: organelle organization, transcription and translation, respiration, ER stress, andpost-translational modiﬁcations (Figure 2(d)). In addition, two processes did not belongto any of these categories. Figure 3 provides a high-level view of these GO terms andtheir connections in the STRING network to SARS-CoV-2 interactors and SARS-CoV-2proteins. In the rest of this section, we examine the relevance of some of these processesto the viral cell cycle. In subsequent sections, we consider other enriched processes fromthe viewpoint of identifying promising drug targets.One group of enriched terms are related to protein translation (green nodes in Figure 3)including “rRNA processing” ( p -value 3 . × − for SVM and 0.05 for SARS-CoV-2interactors), “ribonucleoprotein complex subunit organization” ( p -value 1 . × − forSVM and 0.48 for SARS-CoV-2 interactors), and “ribosomal large subunit biogenesis” ( p -value 1 . × − for SVM and 0.69 for SARS-CoV-2 interactors). RNA viruses appearto reduce protein synthesis in host cells, including in the case of SARS-CoV ( ). Arecent proteomics study that infected Caco-2 cells with SARS-CoV-2 ( ) revealed thatwhile global translation rates in host cells exhibited only minor changes, the levels ofa signiﬁcant number of human proteins involved in translation were highly positivelycorrelated with levels of key viral proteins. This study further showed that translation9 sp14nsp12 nsp4 nsp10 nsp6orf10nsp8 Figure 3: Overview of connections between SARS-CoV-2 interactors and the GO biologicalprocess terms enriched in the top-ranking proteins. Each circle is a GO term shown inFigure 2(d); the color of the GO term corresponds to its group in Figure 2(d) and itssize corresponds to the number of annotated top-ranking proteins. Each box with a redborder is a SARS-CoV-2 protein (the box labels). The box contains one or more humanproteins that interact with that viral protein. A viral protein may label more than onebox. A gray edge connects a red box to a GO term. The width of the edge is proportionalto the fraction of the human interactors of the viral protein that are connected in theSTRING network to the top-ranking proteins annotated to the term (considering onlySTRING edge weights ≥ < p -value3 . × − ) but not in SARS-CoV-2 interactors ( p -value 0 . ). It involves the attachment of an oligosaccharide toan asparagine residue of a protein ( ). The process takes place in two major stages.The ﬁrst step involves the biosynthesis of lipid-linked oligosaccharides ( ) correspondingto the GO term “oligosaccharide-lipid intermediate biosynthetic process” (Figure 4(a)),which is enriched in top-ranking RL proteins ( p -value 1 . × − ) but not in SARS-CoV-2 interactors ( p -value 0 . p -value 2 . × − ) but not in SARS-CoV-2 interactors ( p -value0.2).Several viruses can exploit the host glycosylation pathway to modify viral proteins inorder to enter host cells and evade the immune response ( ). An extensive glycan shieldthat covers the S protein in human coronavirus NL63 (HCoV-NL63) masks the protein11 g )(e) (f)(a) Top-ranking human proteinDrug Human interactor of SARS-CoV-2 SARS-CoV-2 proteinDrug under COVID-19 trialHuman PPI Human PPI withinteractor of SARS-CoV-2SARS-CoV-2 -Human PPIDrug target (b)(d)(c)

Figure 4: Networks of the top-ranking proteins for RL (green nodes) that are annotatedto the enriched terms (a) “oligosaccharide-lipid intermediate biological process” and “pro-tein N-linked glycosylation via asparagine,” (b) “GPI anchor biosynthetic process,” (c)“protein import into mitochondrial matrix,” (d) “protein demannosylation,” (e) “proteinfolding in ER,” (f) “cilium assembly,” or (g) “NADH dehydrogenase complex assembly.”The number below the name of a green protein is its rank in RL. A rectangle encompassesproteins (top-ranking or SARS-CoV-2 interactors) annotated to the respective term. Pro-teins discussed in the text are highlighted with a red border. For each top-ranking protein,we also display its connections with neighboring SARS-CoV-2 interactors. We removedSTRING edges with weight <

700 to simplify the visualization. In (d), we limited proteinsto those that are the target of an approved or investigational drug. In (f), we removeddrugs that promote clotting. 12urface, thus limiting access to neutralizing antibodies ( ). The receptor-binding andfusion-peptide domains of the S protein (fusion protein) in SARS-CoV-2, SARS-CoV, andMERS-CoV contain N-glycans ( ). It appears that the N-linked glycan modiﬁcationsof SARS and MERS CoV S proteins are extensive but do not form as eﬀective a shieldas in viruses such as HIV-1 ( ). Drugs that target host N-linked glycosylation pathwayshave been proposed as treatments for COVID-19 ( ).The term “(GPI)-anchor biosynthetic process” is signiﬁcantly enriched in the top-ranking proteins from RL ( p -value 4 × − ) but not in the human interactors of SARS-CoV-2 ( p -value 0.21). The proteins prioritized by our methods (Figure 4(b)) are eithercomponents of the Gycosylphosphatidylinositol (GPI) transamidase complex or transferGPI to proteins during the synthesis of GPI anchors; these anchors tether proteins tolipid bilayers ( ). GPI-anchored proteins are often associated with lipid rafts, whichare microdomains in plasma membranes that are enriched with cholesterol and sphin-golipids ( ). Lipid rafts play a major role in viral entry, assembly, replication, andbudding. They are known to be involved in the entry of SARS-CoV into host cells ( ).Thus, the proteins involved in (GPI)-anchor biosynthesis that we prioritize may providea deeper understanding on SARS-CoV-2 entry into host cells.Several GO terms related to cellular respiration (blue nodes in Figure 3) were enrichedin the top-ranking proteins, e.g., “protein import into mitochondrial matrix” (Figure 4(c); p -value of 1 . × − for RL and 0 . p -value of 2 . × − for RL in contrast toa p -value of 0.035 in SARS-CoV-2 interactors. NADH dehydrogenase is the ﬁrst complex13n the mitochondrial respiratory chain. Viral infections have been proposed to cause theWarburg eﬀect that is frequently associated with cancer ( ). Due to oxygen deprivation,cells switch from respiration to glycolysis. We note that a recent genome-wide CRISPRscreen for human genes that regulate SARS-CoV-2 infection ( ) identiﬁed several mem-bers of the respiratory chain as “anti-viral”, i.e., the knockout of the gene sensitizes thehost cell to virus-induced cell death. Thus, the GO terms enriched in our results maysuggest mechanisms by which SARS-CoV-2 modulates cellular respiration in the host topromote infection. The Role of Endoplasmic Reticulum Stress, HSPA5, and Anti-Clotting Drugs

We now describe the connection unearthed by our analysis between interactors of SARS-CoV-2, proteins involved in endoplasmic reticulum (ER) stress (orange nodes in Figure 3,and anti-clotting drugs . The GO biological process “protein demannosylation”, theremoval of a mannose group from a protein, is enriched in top-ranking RL proteins ( p -value 4 . × − ) but not in SARS-CoV-2 interactors ( p -value 0.08). Key top-rankingproteins in this network (Figure 4(e)) include MAN1B1 and UGGT1, which are localizedto the ER. Enveloped viruses have been shown to hijack the host cell’s ER for survivaland replication, thereby causing ER-associated degradation (ERAD) ( ). MAN1B1 playsa role in protein quality control as well as in their degradation ( ). The utilization ofUGGT1 by enteroviruses leads to increased viral replication and higher pathogenicity ( ).Viral protein Orf8 interacts with human OS9 (an ER degradation protein), whichin turn interacts with top-ranking protein EDEM1 (ER degradation-enhancing alpha-mannosidase-like protein 1). EDEM1 controls the degradation of mis-folded glycoproteinsand is a key player in ERAD ( ) When EDEM1 is cleared from the ER, it is transported14ut of the organelle in nanoscale vesicles that are called EDEMosmomes. Coronavirusesssuch as the mouse hepatitis virus subsequently hijack these organelles and remodel themto form double membrane vesicles (DMVs) ( ). DMVs are widely used by viruses,including coronaviruses, to promote viral replication ( ).The GO biological process “protein folding in endoplasmic reticulum” was also en-riched in the top-ranking proteins ( p -value 2 . × − for RL and 0.28 for interactorsof SARS-CoV-2). Viral protein nsp9 interacts with several human nucleoporins (NUP54,NUP62, NUP88 and NUP214) ( ), which in turn interact with heat shock protein HSPA5in the STRING network (Figure 4(f)). HSPA5, also referred to as glucose regulated pro-tein (GRP78) or immunoglobulin binding protein (BiP) in the literature, is evolutionarilyconserved from prokaryotes to humans ( ). It has a repertoire of functions associatedwith ER stress response. HSPA5 is usually localized in the ER. When the ER is stressed,HSPA5 can translocate to the cell surface, the nucleus and mitochondria (

31, 32 ). Onthe cell surface, HSPA5 plays a multi-functional role in cell proliferation, cell viability,apoptosis, and regulation of innate and adaptive immunity (

32, 33 ).HSPA5 has been proposed as a universal target for human diseases ( ). It has in-creasingly well-documented essential interactions and activities during viral infections. Inparticular, the role of HSPA5 in viral entry and pathogenesis has been widely investi-gated. As a cell surface protein, HSPA5 has been reported to play an important rolein viral entry (

35, 36 ). SARS-CoV-2 infection has been shown to lead to ER stress andthe up-regulation of HSPA5 (

35, 36 ). The S protein of SARS-CoV-2 can induce tran-scriptional activation of HSPA5 ( ). This protein can serve as a point of attachmentfor both MERS-COV and bat coronavirus (bCOV HKU9) ( ). Both Zika virus andJapanese encephalitis virus use HSPA5 to prevent apoptosis and to help in viral replica-tion ( ). A recent molecular docking study has predicted HSPA5 as a potential receptor15or the SARS-CoV S protein ( ). Based on its observed expression in vitro in airwayepithelial cells, it has been suggested that HSPA5 may serve as an additional receptorfor SARS-CoV-2 in these cells ( ). Based on our network-based analysis and supportin the literature, we hypothesize that HSPA5 may serve as a co-receptor, a point of viralattachment, or aid in viral entry of SARS-CoV-2.Blood hypercoagulability is reported to be common among COVID-19 patients ( ).We now present the linkage between the “protein folding in ER” network (Figure 4(f)) andcoagulation. We also suggest how anti-coagulant drugs may function within the context ofCOVID-19. SARS-CoV-2 protein Orf8 interacts with human hypoxia up-regulated protein1 (HYOU1). HYOU1 (also known as GRP170) plays a cytoprotective role in response tooxygen deprivation ( ). In addition to HSPA5, HYOU1 interacts with calnexin (CANX)and calreticulin (CALR), both of which are chaperone proteins found in the ER ( ).HSPA5, CANX and CALR act as protein chaperones for pro-coagulant proteins such asFactor V and Factor VIII. Once Factor VIII is secreted, it binds to another pro-coagulantprotein von Willebrand factor (vWF) to prevent degradation of clots ( ). AlthoughFactor V, Factor VIII, and vWF are not among the top-ranking proteins and thus do notappear in Figure 4(f), this network is suggestive of mechanisms that SARS-CoV-2 mayuse to cause abnormal blood coagulation.Anti-coagulant drugs that interact with HSPA5, CANX or CALR include Tenecteplase,a third generation plasminogen activating enzyme and the investigational drug Lan-oteplase, which is a serine protease that binds to ﬁbrin leading to the formation of plas-min ( ), an enzyme that breaks clots. Lanoteplase is a second-generation derivative ofalteplase, and a third generation derivative of recombinant plasminogen. It is notable thatthere is a clinical trial for Alteplase (ClinicalTrials.gov, NCT04357730) to test its eﬀective-ness in treating COVID-19. Calcium citrate, a known anti-coagulant agent also interacts16ith CANX and CALR ( ). Aspirin, also present in (Figure 4(f)), binds to and inhibitsthe ATPase activity of HSPA5 ( ). A clinical trial (ClinicalTrials.gov, NCT04363840)is testing whether early treatment of COVID-19 patients with aspirin and vitamin D caninhibit the production of blood clots and decrease rates of hospitalization. Cilium Assembly and Tubulin-Modulating Drugs

GO biological processes related to cilia were signiﬁcantly enriched in the top-rankingpredictions RL and SVM predictions (yellow nodes in Figure 4(g), e.g., “cilium assem-bly,” p -value 9 . × − for RL) but not in the human interactors of SARS-CoV-2( p -value 0 . γ -tubulins(TUBGCP2 and TUBGCP3), which interact with several α - and β -tubulins among thetop 332 predictions (Figure 4(g)). Microtubules are polymers that provide shape andstructure to eukaryotic cells and are necessary in cell transport and cell division, amongother functions ( ). α - and β -tubulins compose microtubule ﬁlaments, while γ -tubulinsconnect them to the microtubule organizing center.Viruses commonly utilize microtubules for cellular entry, intra-cellular traﬃcking, andexit from cells ( ). For instance, the S protein of human α -coronavirus interacts withtubulin α and β chains ( ), suggesting that tubulin may be involved in the transport andlocalization of the S protein and its assembly into virions ( ). Relevant to SARS-CoV-2,microtubules are the primary structural component of cilia, which line epithelial cells inthe respiratory tract and are responsible for the transport of mucus out of cells ( ). TheACE2 receptor that SARS-CoV-2 uses to enter cells appears to be expressed primarilyon the cilia of respiratory tract epithelial cells (

51, 52 ), further implicating microtubulesin viral infection. The combination of high expression levels of ACE2 and the presence of17ilia may also explain the detection of the virus in multiple organs ( ) and the deleteriouseﬀect of COVID-19 on the renal, gastroinstestinal, and olfactory systems ( ).Next, we turned our attention to drugs that targeted Tubulin proteins (Figure 4(g)).Our methods highlighted the drug Colchicine since it had the smallest p -value (0.001)among the approved or investigational drugs that target proteins annotated to “ciliaryassembly.” Colchicine is FDA-approved for treating gout. Its eﬀectiveness againstCOVID-19 is being tested in thirteen ongoing clinical trials (clinicaltrials.gov).The drugs that target proteins involved in ciliary assembly (Figure 4(g)) are mostlyanti-mitotic agents, which are also being investigated as anti-cancer therapeutics. Thesedrugs fall into three broad classes depending on how they aﬀect the dynamic equilibriumbetween unpolymerized and polymerized tubulin ( ). Colchicine, Podoﬁlox, Vinchris-tine, Vinﬂunine, and Vinblastine destabilize microtubules by binding to speciﬁc sitesof α - and β -tubulins, thereby preventing their assembly. Albendazole, Mebendazole, andOxibendazole, Cabazitaxel, Milataxel, and CYT99 (a synthetic drug) inhibit tubulin poly-merization thereby aﬀecting the assembly and dynamics of microtubules and slowing cellgrowth. Two other drugs in this network, Patupilone and Epithilone-D, also cause cell-cycle arrest but by a diﬀerent mechanism: they stabilize the tubulin network. In the caseof COVID-19, we hypothesize that drugs that destabilize the microtubule network mightprove to be more eﬃcacious since they may prevent viral proteins from using microtubulesfor intra-cellular traﬃcking. In addition, as in the case of Colchicine, they may act asanti-inﬂammatory agents and provide additional beneﬁts to COVID-19 patients. Discussion

Drug development is widely acknowledged to be one of the most challenging industrialprocesses requiring billions of dollars and high levels of diverse expertise to overcome pro-18ound scientiﬁc and logistic barriers. There remains a high failure rate in the steps that liebetween discovering a drug target and manufacturing a drug for market ( ). The COVID-19 pandemic and its medical and economic impact created a new and urgent challenge forresearch and pharmaceutical communities to develop a therapeutic response ( ). As onemanifestation of this community response, a recent proteomics eﬀort by a consortium ofscientists generated a ﬁrst-of-a-kind interactome associated with the SARS-CoV-2-humaninterface ( ). This interactome inspired a large-scale drug repositioning eﬀort targetingthe human proteins that directly interact with the virus.This important and timely advance signals a very promising direction in drug devel-opment for COVID-19. The set of human proteins reported to interact with SARS-CoV-2is likely to have both false positives and false negatives due to the properties of the pro-teomic screening pipeline used. Thus, we sought to further extend the results of thisstudy and signiﬁcantly expand the resources available to the drug development and repo-sitioning community through the use of state-of-the-art network prediction algorithms.We also aimed to demonstrate that the area of function prediction using network propa-gation has achieved signiﬁcant technological maturity since its introduction over 15 yearsago (

57, 58, 59, 60 ). In particular, network propagation-based approaches to implementlabel propagation have evolved to be relatively easy to implement and validate statisti-cally ( ), thereby making them amenable for translational research.In this work, we take full advantage of these mature predictive platforms to signiﬁ-cantly expand the resources available to the COVID-19 community by producing a largerset of putative SARS-CoV-2 interactors. Many of these proteins could serve as drug tar-gets and several are already under development or in clinical trials. Our methods predictinteractors with high accuracy. We expect this performance to be further improved bythe computational biology community building on our and other resources.19ur initial ﬁndings suggest existing as well as new drug targets. Several targets areassociated with ER stress, a well-documented initial response to viral infections withlethal or severe clinical outcomes. These ER stress response proteins include a number ofwell-studied heat shock proteins (e.g., HSPA9, rank 9, p -value 0.031 and HSPA5, rank 18, p -value 0.035 for RL, Table S1). We have discussed one of them, HSPA5, as a promisingdrug target. Other authors have also noted the potential for HSPA5 in the context ofCOVID-19 ( ). Other HSPs in our prediction list are also natural targets for drugrepositioning. More broadly, targeting ER stress is a promising direction to reduce celldeath ( ).In addition to the primary drug targets we have identiﬁed in “Results”, we pointout VDAC proteins, which are voltage-dependent anion channels (e.g., VDAC2, rank 99, p -value 0.035 and VDAC1, rank 182, p -value 0.037 for RL) that have been associatedwith mitochondria-triggered pro-apoptotic processes. We speculate that several recentlyidentiﬁed inhibitors of VDAC1 may be considered as therapeutic interventions to inhibitanion transport channels. An example is DIDS, which leads to a potential decrease inROS-associated cell death ( ).We have noted that anumber of dehydrogenases appear among our top ranking pro-teins. This ﬁnding naturally promotes NADH as a promising drug target to treat SARS-CoV-2 infection. NADH is also associated with metabolic health and has been studiedas an energy currency, suggesting a number of promising biological follow-ups. The War-burg eﬀect mentioned earlier is targeted by a number of metabolic drugs, which oﬀer newdirections for viral treatments. We also note the classical connection between elevated in-ﬂammation and metabolic slowdown associated with both aging and viral infections ( ).Thus, a combined immuno-metabolic combinatorial drug regimen might be worth consid-ering for COVID-19 patients to prevent severe outcomes.20e have already begun to integrate our current techniques both with other omicsdata and with orthogonal methods to predict more biologically meaningful networks andprocesses impacted by the virus. In particular, single-cell RNAseq data oﬀer many op-portunities to examine cellular heterogeneity and context-speciﬁc interactions. In thiscontext, we note complementary eﬀorts to repurpose drugs for SARS-CoV-2 that arebased on protein structures ( ), observational studies of treatments being administeredto patients ( ), and shortest paths in protein networks ( ).In summary, this paper provides a signiﬁcant new resource for drug repurposing anddevelopment to the COVID-19 community. The relatively quick turn-around of thisproject and the host-virus protein interaction resource on which we have built ( ), demon-strates that biological network science and network propagation ( ) have achieved signiﬁ-cant maturity, as further evidenced by their strong presence in cancer and chronic diseases.These eﬃcient computational methods and rapidly generated data have allowed us to de-velop high accuracy predictions of proteins involved in speciﬁc functional activities as anexpeditious response to this global pandemic. References

1. J. M. Sanders, M. L. Monogue, T. Z. Jodlowski, J. B. Cutrell,

JAMA - Journal ofthe American Medical Association (2020).2. M. P. Lythgoe, P. Middleton,

Trends in Pharmacological Sciences (2020).3. G. Li, E. De Clercq,

Nat Rev Drug Discov , 149 (2020).4. R. K. Guy, R. S. DiPaola, F. Romanelli, R. E. Dutch, Science , 829 (2020).5. C. V. Forst,

Infectious Disease Informatics (Springer, 2010), pp. 123–147.21. M. Z. Tay, C. M. Poh, L. R´enia, P. A. MacAry, L. F. Ng,

Nature Reviews Immunology pp. 1–12 (2020).7. D. E. Gordon, et al. , Nature (2020).8. L. Cowen, T. Ideker, B. J. Raphael, R. Sharan,

Nature Reviews Genetics , 551(2017).9. D. Zhou, B. Sch¨olkopf, ICML Workshop on Statistical Relational Learning and ItsConnections to Other Fields (2004), pp. 132–137.10. S. Mostafavi, D. Ray, D. Warde-Farley, C. Grouios, Q. Morris,

Genome Biology , S4(2008).11. T. M. Murali, M. D. Dyer, D. Badger, B. M. Tyler, M. G. Katze, PLoS computationalbiology , e1002164+ (2011).12. W. Kamitani, C. Huang, K. Narayanan, K. G. Lokugamage, S. Makino, Nat. Struct.Mol. Biol. , 1134 (2009).13. D. Bojkova, et al. , Nature (2020).14. J. Breitling, M. Aebi,

Cold Spring Harb Perspect Biol , a013359 (2013).15. D. J. Vigerust, V. L. Shepherd, Trends Microbiol. , 211 (2007).16. A. C. Walls, et al. , Nat. Struct. Mol. Biol. , 899 (2016).17. Y. Watanabe, et al. , Nat Commun , 2688 (2020).18. Y. Watanabe, J. D. Allen, D. Wrapp, J. S. McLellan, M. Crispin, bioRxiv (2020).19. S. J. Williams, E. D. Goddard-Borger, Biochem. Soc. Trans. (2020).220. C. Metzner, B. Salmons, W. H. G¨unzburg, J. A. Dangerﬁeld,

Virology , 125(2008).21. Y. Lu, D. X. Liu, J. P. Tam,

Biochem. Biophys. Res. Commun. , 344 (2008).22. E. L. Sanchez, M. Lagunoﬀ,

Virology , 609 (2015).23. J. Wei, et al. , bioRxiv (2020).24. S. E. Bettigole, L. H. Glimcher, Annu. Rev. Immunol. , 107 (2015).25. Y. Wu, M. T. Swulius, K. W. Moremen, R. N. Sifers, Proc. Natl. Acad. Sci. U.S.A. , 8229 (2003).26. P. N. Huang, et al. , PLoS Pathog. , e1006375 (2017).27. Y. Yoshida, K. Tanaka, Biochim. Biophys. Acta , 172 (2010).28. F. Reggiori, et al. , Cell Host Microbe , 500 (2010).29. G. Wolﬀ, C. E. Melia, E. J. Snijder, M. B´arcena, Trends Microbiol. (2020).30. A. S. Lee,

Nat. Rev. Cancer , 263 (2014).31. Y. Zhang, R. Liu, M. Ni, P. Gill, A. S. Lee, J. Biol. Chem. , 15065 (2010).32. Y. L. Tsai, et al. , Proc. Natl. Acad. Sci. U.S.A. , E4245 (2018).33. M. Ni, Y. Zhang, A. S. Lee,

Biochem. J. , 181 (2011).34. L. Booth, et al. , J. Cell. Physiol. , 1661 (2015).35. M. L. DeDiego, et al. , PLoS Pathog. , e1002315 (2011).36. C. P. Chan, et al. , J. Virol. , 9279 (2006).237. H. Chu, et al. , J. Biol. Chem. , 11709 (2018).38. H. R. Lyoo, S. Y. Park, J. Y. Kim, Y. S. Jeong,

Virol. J. , 32 (2015).39. I. M. Ibrahim, D. H. Abdelmalek, M. E. Elshahat, A. A. Elﬁky, J. Infect. , 554(2020).40. J. A. Aguiar, et al. , bioRxiv (2020).41. E. Terpos, et al. , Am. J. Hematol. (2020).42. D. B. Williams,

J. Cell. Sci. , 615 (2006).43. R. J. Kaufman, S. W. Pipe, L. Tagliavacca, M. Swaroop, M. Moussalli,

Blood Coagul.Fibrinolysis , 3 (1997).44. M. Flemmig, M. F. Melzig,

J. Pharm. Pharmacol. , 1025 (2012).45. K. G. Mann, M. F. Whelihan, S. Butenas, T. Orfeo, J. Thromb. Haemost. , 2055(2007).46. W. G. Deng, K. H. Ruan, M. Du, M. A. Saunders, K. K. Wu, FASEB J. , 2463(2001).47. E. Nogales, Annu. Rev. Biochem. , 277 (2000).48. U. F. Greber, M. Way, Cell , 741 (2006).49. A. T. R¨udiger, et al. , Virology , 185 (2016).50. P. Satir, S. T. Christensen,

Annu. Rev. Physiol. , 377 (2007).51. I. T. Lee, et al. , medRxiv (2020). 242. W. Sungnak, et al. , Nat. Med. , 681 (2020).53. V. G. Puelles, et al. , N. Engl. J. Med. (2020).54. C. Huang, et al. , Lancet , 497 (2020).55. Y. Lu, J. Chen, M. Xiao, W. Li, D. D. Miller,

Pharmaceutical Research (2012).56. S. Pushpakom, et al. , Nat Rev Drug Discov , 41 (2019).57. M. Deng, K. Zhang, S. Mehta, T. Chen, F. Sun, J Comput Biol , 947 (2003).58. A. Vazquez, A. Flammini, A. Maritan, A. Vespignani, Nat Biotechnol , 697 (2003).59. S. Letovsky, S. Kasif, Bioinformatics

19 Suppl 1 , i197 (2003).60. U. Karaoz, et al. , Proceedings of the National Academy of Sciences of the UnitedStates of America , 2888 (2004).61. D. P. Ha, R. Van Krieken, A. Carlos, A. S. Lee,

J. Infect. (2020).62. J. D. Malhotra, et al. , Proceedings of the National Academy of Sciences , 18525(2008).63. D. Ben-Hail, V. Shoshan-Barmatz,

Biochim. Biophys. Acta , 1612 (2016).64. G. Pawelec, D. Goldeck, E. Derhovanessian,

Curr. Opin. Immunol. , 23 (2014).65. C. Wu, et al. , Acta Pharmaceutica Sinica B (2020).66. M. Vaduganathan, et al. , New England Journal of Medicine , 1653 (2020).67. Y. Zhou, et al. , Cell discovery , 1 (2020).25 ata and Software Availability Our software is available under the GNU Public License version 3 at https://github.com/Murali-group/SARS-CoV-2-network-analysis . The networks in Figures 3 and 4are available for visualization and download on GraphSpace at http://graphspace.org/graphs/?query=tags:2020-sarscov2-network-analysis .We used publicly available datasets for our analysis.

Acknowledgments

TMM acknowledges support from NSF grants DBI-1759858 and MCB-1817736. KA ac-knowledges support from the Genetics, Bioinformatics, and Computational Biology pro-gram at Virginia Tech. JK acknowledges support from NSF grant CCF-2029543. MCacknowledges support from NSF grant CNS-1618207. CMDS acknowledges support fromthe Hariri Institute and the Department of Biomedical Engineering at Boston University.PR acknowledges support from NSF grant CBET-1510920 and USDA-NIFA grant 2018-07578. PR and TMM acknowledge support from the Computational Tissue EngineeringGraduate Education Program at Virginia Tech.The authors wish to thank S. Alabdullatif, S. Alshuaib, M. Iennaco, M. Kouzminov, S.Murthy, S.Makwana, N. Naguib, C. Tagliettii, and M. Zanna for exploratory research onthis data and insightful and thought-provoking analysis. We also thank Roded Sharan,Noga Alon, Dan Lancour and Rich Roberts for discussions that helped formulate thetechniques and ideas we used in this paper.26 upplementary materialsMethods

Algorithms

We describe each of the algorithms we use for label propagation and prediction. We aregiven a weighted, undirected network G = ( V, E, w ) and a set P of positive examples.Each node in G is a human protein and each edge represents a physical or functionalinteraction between two proteins. We describe the network we use in more detail in“Datasets”. The set P comprises of human proteins that interact with SARS-CoV-2proteins ( ). We seek to compute a score s ( v ) ≥ G that indicatesour conﬁdence that v either physically interacts with or is functionally linked to a SARS-CoV-2 protein. Regularized Laplacian (RL) ( ). We ﬁrst provide some intuition behind this method.We seek to compute a diﬀusion score that represents the probability that a given nodein the network will be reached by a random walk of a given length from a set of seednodes (the human interactors of SARS-CoV-2); the parameter α introduced below governsthe length of the random walk. Biologically, this model captures the stochasticity inprotein interactions in signaling networks or complexes that depends on many factorssuch as concentrations, conditions, co-factors, chaperones, ATP levels, etc. Thus, anygiven signaling cascade may be modelled as a random walk that will reach its target withsome probability. In our context, the higher the diﬀusion score our algorithms computefor a node, the more likely it is that the node may be involved in the host response to thevirus or be used by the virus in its life cycle.Formally, this method deﬁnes a label vector (cid:126)y over the nodes in G where y ( u ) = 1 if27ode u is a SARS-CoV-2 interactor and y ( u ) = 0, otherwise. It computes a score s ( u )between 0 and 1 for each protein u in G . Let W ∈ R n × n denote the adjacency matrix of G and ˜ W = D − / W D − / denote the normalized network, where D is a diagonal matrixwith D uu = (cid:80) v a uv , for every node u in G . Then, to compute the scores for each node,we minimize the following sum, where α > (cid:88) u ∈ V (cid:0) s ( u ) − y ( u ) (cid:1) + α (cid:88) ( u,v ) ∈ E ˜ w uv (cid:0) s ( u ) − s ( v ) (cid:1) , where n is the number of nodes in the graph and the minimization ranges over all vectors s ∈ R n . We note that ˜ L = I − ˜ W is the normalized Laplacian of G . To minimize the aboveexpression, we solve the system of linear equations ( I + α ˜ L ) (cid:126)s = (cid:126)y . It can be shown thatfor any connected graph G this system always has a unique solution, and that ( I + α ˜ L )is always invertible. The matrix ( I + α ˜ L ) − can be interpreted as the amount of diﬀusionthat ﬂows in the network between any two node pairs, and is termed the RegularizedLaplacian ( ). GeneMANIA ( ). This method is a variation of RL that can take negative examplesinto consideration. Starting with a label vector (cid:126)y where y ( u ) represents the prior evidencefor protein u being a SARS-CoV-2 interactor, this algorithm computes a score s ( u ) be-tween − u in G . The value of y ( u ) is 1 or -1 if u is a positiveor negative example, respectively. Let the number of positive examples be n + and thenumber of negative example be n − . If u is an unknown example, then y ( u ) = n + − n − n + + n − , themean of the labels of the labeled nodes. Apart from this deﬁnition of (cid:126)y , this method isidentical to RL. Note that the original version of GeneMANIA implicitly chose α = 1.We introduce the parameter α to allow a tradeoﬀ in the importance given to the inputnode labels (cid:126)y vis-a-vis the similarity of adjacent labels in the output (cid:126)s .28 inkSource ( ). This method ﬁxes the score s ( u ) = 1 for every positive example and s ( u ) = − G byminimizing the following function: (cid:88) ( u,v ) ∈ E w uv (cid:0) s ( u ) − s ( v ) (cid:1) . Let U denote the set of nodes in G that are unlabeled, i.e., are not positive or negativeexamples. For every node u ∈ U , we deﬁne the set of its neighbours as N ( u ) and use f ( u ) = (cid:88) v ∈ N ( u ) v is positive w uv − (cid:88) v ∈ N ( u ) v is negative w uv to denote the sum of the weighted scores of its neighbors that are positive or negativeexamples. Let (cid:126)f denote the vector containing these values and let W denote the adjacencymatrix of the subgraph of G induced by U . Deﬁning D as a diagonal matrix of W as inthe case of GeneMANIA but only for the nodes in U , we compute (cid:126)s as the solution of thelinear system of equations ( I + W D − ) s = f .When negative examples are not available, as is the case here, SinkSource adds anartiﬁcial negative example to the network, and connects each node to the artiﬁcial negativewith an edge of weight λ , where λ > Local.

We set s ( v ) = 1 for every node in P . For every other node u , we initialise s ( u ) = 0 and then compute s ( u ) as the weighted average of the scores of its neighbors. Support Vector Machine (SVM).

We set each node’s feature vector to be its adja-cency vector in the normalized network ˜ W . We trained a linear kernel using the LinearSVC function in the scikit-learn

Python package with default parameters.29 ogistic Regression.

We set each node’s feature vector as in the case of SVM. We usedthe

LogisticRegression function in the scikit-learn

Python package with defaultparameters.

Network Proximity Among Interactors of SARS-CoV-2

We sought to develop a statistical justiﬁcation for the RL method that we use to prioritizeadditional viral interactors using their network propagation distance to the set P of humanproteins that interact with SARS-CoV-2 proteins. we used the following procedure to thisend. We applied the Kolmogorov-Smirnov test to evaluate whether the distribution ofscores of any given protein in P from the other proteins in P is the same as the nulldistribution of scores of randomly selected proteins in the network from P .In more detail, for every protein p in P , we computed the contribution of the otherproteins in P to the score of p after running RL, akin to leave-one-out cross validation.Speciﬁcally, we set y ( p ) = 0 and executed RL to compute the score s ( p ) of p . We alsocomputed the scores in this manner for 1,000 proteins selected uniformly at random fromthe other nodes in G . These distributions appear in Figure 2(a). Statistical Signiﬁcance

To estimate the statistical signiﬁcance of each node’s scores, we adopted a null hypothesiscorresponding to the distribution of scores obtained from a randomly chosen positive set P (cid:48) where | P (cid:48) | = | P | . (As an alternative to randomizing the positive set, randomizing thenetwork (e.g., via a degree-preserving edge swap process) would destroy the correlationsbetween adjacent nodes (homophily) that are important contributions to pathway andneighborhood structure in the network.) We note that the degrees of the nodes in P mayhave a strong eﬀect on the resulting distribution of scores. For example, if many nodes30n P have high degree, then scores may tend to be larger overall than if there are fewnodes in P with high degree. Thus if the degree distribution of P does not approximatelymatch that of P (cid:48) , the resulting p -values will be biased.Had we selected each random sample uniformly at random from all nodes in G , thenthe degree distribution of the chosen nodes would not be ensured to match that of thenodes in P . Therefore, we implemented a stratiﬁed sampling approach, as follows: Givena number of bins b , we partitioned the nodes in G into b sets as follows:1. We sorted the nodes by weighted degree, i.e., we computed the degree sequence of G .2. We executed k -means clustering on the degree sequence of G to compute b clusters(i.e., we set k = b in the k -means algorithm).This approach emphasizes nearly-equal-degree groups. Then, to generate a random sam-ple P (cid:48) having | P | nodes, for every positive example v in P , we determined the subsetwhose range endpoints contained v and sampled a node from that subset uniformly atrandom. After evaluating various values of b , we selected the second approach and b = 10for use in our results (see Section S2.2).For each P (cid:48) we designated these nodes to be the set of positive examples and executedeach of the prediction algorithms, ensuring that the negative samples we selected for each P (cid:48) did not intersect with the original set of positive examples P . Repeating this procedure1,000 times, we constructed a distribution of scores for each node in P . We then estimatedthe p -value of a node’s score as the fraction of values in this distribution that were at leastas large as the score. We did not correct these scores for multiple hypothesis testing.31 dentiﬁcation of Biclusters with High Diﬀusion from SARS-CoV-2 Interactors to Top-Ranking Proteins Let K denote the regularized Laplacian matrix ( I + α ˜ L ) − . We remind the reader thatthe RL algorithm ranks proteins based on diﬀusion scores that associate a node u in thenetwork with a diﬀusion score s ( u ), where s ( u ) = (cid:80) v ∈ P K uv , where v ranges over the set P of all SARS-CoV-2 interactors. We sought to identify the experimentally determinedinteractors that contributed the most to the predicted diﬀusion scores. This analysis isimportant for tracing the provenance of computational predictions to their experimentalsources (

68, 69 ).We considered the sub-matrix of K where every column represented one of the 332SARS-CoV-2 interactors and every row corresponded to one of the top 332 proteins iden-tiﬁed by the RL algorithm. In this submatrix, each value K uv quantiﬁes the diﬀusionbetween a given SARS-CoV-2 interactor in column v and a top-ranking protein in row u . We replaced each element in row u and column v by the value K uv /s ( u ), i.e., thefractional contribution of a SARS-CoV-2 interactor to the overall score for top-rankingprotein u . We performed hierarchical clustering on both the rows and columns of thismatrix using Pearson’s correlation as the similarity measure and average linkage. Wemanually identiﬁed eight biclusters and validated this selection with tests for enrichmentof GO biological processes. Datasets

SARS-CoV-2–Human PPIs.

We obtained 332 human proteins that interact withSARS-CoV-2 ( ) and treated them as positive examples for our analysis. We added theACE2 receptor to this set. 32 irus–Human PPIs for network proximity computations. We obtained 104, 283,and 296 human proteins that interact with SARS-CoV, HIV-1, and HSV-1, respectively,from the VirHostNet database ( ). We used these sets of proteins for the analysis inFigure 2(a). Functional and protein interaction networks.

We started with the human func-tional interaction network in the STRING database (version 11) ( ), comprising of 18,886nodes and 977,789 edges after applying a “medium” score cutoﬀ of 400. We used the in-teraction reliabilities provided by STRING as edge weights; we divided each value inSTRING by 1 ,

000 to scale them between 0 and 1. An edge in this network may bederived from experimental data or computational analysis. Thus, an edge may representeither direct physical binding or indirect functional interaction. Of the 332 viral inter-actors, 328 were present in this network; REEP6 (Q96HR9), PPIL3 (Q9H2H8), RAB18(Q9NP72), and FKBP7 (Q9Y680) were missing.

Negative Examples.

To evaluate the precision of our predictions, we needed negativeexamples. Since datasets of human proteins that are certain not to interact with SARS-CoV-2 proteins are not available, we took the simple expedient of sampling them uniformlyat random from the STRING network. We considered three diﬀerent numbers of negativeexamples: as many as, ﬁve times, and ten times the number of positive examples. Whenwe applied the methods to the full set of positive examples, we averaged the results over100 random samples, with a positive:negative ratio of 1:5.

Drug-protein interactions.

We downloaded interactions among drugs and proteinsfrom the DrugBank database (version 5.1.6) ( ). This dataset contained 16,503 drug-protein target pairs among 5,665 drugs and 2,891 target proteins. Limiting the targets33o those in the STRING network reduced the number of drugs and targets to 5,589 and2,769, respectively. Evaluation

We evaluated the prediction algorithms using 100 runs of ﬁve-fold cross validation. Ineach run we ensured that all algorithms saw identical partitions of the examples into folds.We computed three measures of performance: (a) the area under the receiver operatorcharacteristic (ROC) curve. (b) the area under the precision-recall curve, and (c) precisionat a recall of 0.1 ( early precision ). Functional Enrichment

We used the clusterProfiler package in R ( ) to compute Gene Ontology terms,KEGG pathways, and Reactome pathways enriched in our predictions or in the humaninteractors of SARS-CoV-2 proteins. This package uses Fisher’s exact test to estimate theenrichment of an individual term or pathway and the method of Benjamini and Hochbergto correct for testing multiple hypotheses. We applied this correction for each database(GO, KEGG, Reactome) separately. We used a threshold of 0 .

01 to decide if a GO termor KEGG/Reactome pathway was signiﬁcantly enriched.The enrichment analysis yields many highly similar statistically signiﬁcant GO termsand KEGG and Reactome pathways; by “similar”, we mean that two diﬀerent termsor pathways may annotate many proteins in common. This problem is well-known withseveral approaches that have been proposed to mitigate it either by grouping similar termsand selecting a small subset of dissimilar terms (

74, 75 ) or by directly computing a setof non-redundant GO terms (

76, 77 ). As far as we can tell, these methods have beendeveloped to consider the enriched GO terms for one set of terms. When we apply them34ndependently to diﬀerent sets of proteins (e.g., predictions from RL and predictions fromSVM), they may select one term for one set of proteins but a similar but not identicalterm for another set, making the distinctions in enrichment hard to discern.Therefore, taking inspiration from previously developed methods (cited above), wedeveloped a simple heuristic based on the weighted set cover algorithm that simultaneouslysimpliﬁes multiple sets of enriched terms or pathways. For every term that is enrichedin at least one protein set, we deﬁned its composite odds ratio to be the product ofthe odds ratios for that term across the protein sets. We iteratively selected the termwith the largest composite odds ratio, deleted the proteins annotated to this term fromthe annotations of all other enriched term, adjusted the odds ratio for every term, andrecomputed the composite odds ratios. We stopped when the maximum composite oddsratio of the remaining terms became less than one. We demonstrate how this algorithmwas successful in reducing the redundancy among enriched GO terms in Section S2.4.In the case of two terms automatically selected by this algorithm, there was a dif-ferent, highly overlapping term that we felt would be more interpretable in the contextof SARS-CoV-2 and COVID-19. Therefore, we manually replaced the selected terms bytheir alternative choices. Speciﬁcally, we replaced “ciliary basal body-plasma membranedocking” and “mRNA transport” with “cilium assembly” and “viral transcription”, re-spectively. To substantiate our choices, we computed the Jaccard index between theoriginal and replaced term. Here we considered the proteins annotated by a term acrossall the protein sets (i.e., top-ranking proteins from RL and SVM and human interactorsof SARS-CoV-2 proteins). The Jaccard index value was 0.95 between “ciliary basal body-plasma membrane docking” and “cilium assembly” with all 88 proteins annotated by theﬁrst term being annotated by the second, which had 93 annotations. The Jaccard indexbetween “mRNA transport” and “viral transcription” was 0.46. The terms annotated 4035nd 43 proteins, respectively, with 26 proteins in common. In addition, we removed manyterms with a small number of annotations ( ≤

7) that were fairly similar to other termsin the list. We used the ﬁnal list of terms for further analysis.

Supplementary Files

These two supplementary ﬁles are available at the GitHub site mentioned above.

Table S1:

Network propagation score of each SARS-CoV-2 interactor using the remain-ing SARS-CoV-2 interactors as seeds. Each row contains a SARS-CoV-2 interactor,its score, the average score of 1,000 randomly-selected proteins, and a p -value. This p -value is the fraction of randomly-selected proteins whose score was as least aslarge as that of the protein in that row. Table S2:

The prediction rank and p -value computed by RL and SVM for each humanprotein on the STRING network, the list of drugs that target the protein (when thisinformation is available in DrugBank), and the closest SARS-CoV-2 interactor andSARS-CoV-2 protein. For the last piece of information, we computed the shortestweighted path, where we deﬁned the weight of a path to be the sum of the absolutevalue of the base-10 logarithm of the weights of the edges in the path. Table S3:

Enrichment results for RL, SVM and the viral interactors on GO biologicalprocesses.

Table S4:

Matrix of contributions to the network propagation score from each SARS-CoV-2 interactor to every top-ranking protein.

Table S5:

Sorted contributions from SARS-CoV-2 interactors to speciﬁc proteins dis-cussed in the paper. The table is divided into pairs of columns: the ﬁrst column36peciﬁes the rank and the name of a top-ranked protein followed by the SARS-CoV-2 interactors. The second column shows the fractional contribution from eachinteractor to the top-ranked protein. 37 upplementary TextS1 Supplementary Figures

RL GM SS SVM LogReg Local0.10.20.30.40.50.60.70.8 E a r l y P r e c i s i o n @ . R e c a ll A U R O C A U P R C RL GM SS SVM LogReg Local

P:N 1:5 P:N 1:10

Figure S1: Cross validation results for positive:negative ratios of 1:5 and 1:10.38 L R L S V M K r - l o g ( p v a l ) Transcription/TranslationRespirationER StressPTMsOrganelleOrganizationOther cilium assemblymicrotubule nucleationGolgi organizationrRNA processingtRNA export from nucleusviral transcriptionribosomal large subunit biogenesisNADH dehydrogenase complex assemblycellular respirationprotein import into mitochondrial matrixubiquitin-dependent ERAD pathwayprotein folding in ERprotein foldingprotein alpha-1,2-demannosylationGPI anchor biosynthetic processRab protein signal transductionzinc ion transport mitochondrial ATP synthesis coupled proton transportprotein N-linked glycosylation via asparagineribonucleoprotein complex subunit organizationcotranslational protein targeting to membraneoligosaccharide-lipid intermediate biosynthetic process

Figure S2: Comparison of GO terms enriched in two sets of top-ranking proteins inRL (top-332 and top-600), top-332 proteins from SVM, and SARS-CoV-2 interactors(abbreviated as ‘Kr’). Terms related to transcription and translation are enriched in thetop-600 RL proteins and the top-332 proteins in the SVM set.39

S2.1 Overlap Among Algorithms

Local RL SS SVM LogReg GM

Heatmap of Spearman Correlation Heatmap of Jaccard Index

LocalRLSSSVMLogRegGM a b

LocalRLSSSVMLogRegGM

75 21 31 7 775 84 45 11 921 84 40 10 731 45 40 13 47 11 10 13 27 9 7 4 2 L o c a l R L SS S V M L o g R e g G M

12 0 0 0 212 13 18 4 00 13 32 4 00 18 32 5 00 4 4 5 02 0 0 0 0 k = 100 L o c a l R L SS S V M L o g R e g G M

11 1 1 1 211 19 30 4 01 19 29 6 01 30 29 7 01 4 6 7 02 0 0 0 0 k = 200 L o c a l R L SS S V M L o g R e g G M

11 1 2 1 111 2329 5 11 23 34 4 02 2934 6 01 5 4 6 01 1 0 0 0 k = 300

Figure S3: Similarity of predictions between every pair of methods. ( a ) Spearman corre-lations of node prediction scores. ( b ) Overlap of the top k predictions of each method,measured using the Jaccard index. The number in each cell is the value of the corre-sponding correlation or Jaccard index multiplied by 100.To understand the diversity of the predictions across the prediction methods, betweeneach pair of algorithms, we computed the Spearman’s correlation of all scores. We alsocompared the top-ranking predictions between algorithms using the Jaccard index. RLhad a high correlation with Local (0.75), which was surprising given the poor performanceof Local. However, the Jaccard index for the two methods was around 0.1 (Figure S3(b)),suggesting that they shared very few top-ranking predictions and that the high correlationmay be caused by lower-scoring proteins. SVM had moderate values of correlation andJaccard’s index with RL and SS (around 0.4 and 0.3, respectively). Logistic regressionand GM both had low correlations and Jaccard’s indices with all other methods ( < . Algorithms.

To tune the methods, we varied the parameter α for RL, the weight ofthe edges λ connecting each node to the artiﬁcial sink for SS, and the parameter C controlling the inverse of the regularization strength for SVM and LR. For each settingof these parameters, we repeated 5-fold cross-validation with all three positive:negativeratios. We show the results for the ratios 1:1, 1:5, and 1:10 in Figure S4 and observedthe results were fairly consistent across ratios. We focused on optimizing early precisionvalues since we were interested in the analysis of top-ranking predictions.For RL, GM, and SS, we found that in general, constraining the propagation locallyaround positive examples (i.e., small values of α , large λ ) achieved higher early precisionthan more global propagation (i.e., large α , small λ ). We chose the parameter values α = 0 . , α = 0 .

1, and lambda = 100 for RL, GM, and SS, respectively.For the supervised classiﬁers SVM and LR, we found that decreasing the regularizationparameter (i.e., trying large values of C ) resulted in a slight increase in median earlyprecision (about 0.05 for SVM, and 0.07 for LR) over the default C = 1. However, toavoid overﬁtting, we chose to use C = 1 for both methods. Stratiﬁed sampling.

The number b of bins is a parameter. We tested b = 10 , , p -value < .

05 withthe exception of the top 15 predictions for SVM, many of which had p -values slightlyhigher than the cutoﬀ (Figure S5). In general, the p -values for SVM were slightly higherthan those for RL. Since we did not observe much diﬀerence when varying b , we selected b = 10. 41 .20.30.40.50.60.70.80.9 A U R O C / A U P R C / E P @ . .

01 0 . . . . . . α A U R O C / A U P R C / E P @ . .

01 0 . . . . . . α .

01 0 . . . . . . λ .

001 0 .

01 0 . . . . . C .

001 0 .

01 0 . . . . . C A U R O C / A U P R C / E P @ . RL GM SS SVM LogReg

Early Precision @0.1 RecallAUPRCAUROC P : N : P : N : P : N : Figure S4: Parameter search results for each method, evaluated using AUROC, AUPRC,and early precision (at recall equal to 0.1) of 5-fold CV with a positive:negative ratio of1:1, 1:5, and 1:10 on the STRING network. Each point shows the median value of 100repetitions 42 − − − − − b = 10 − − − − − − b = − − − − − − b = 10 − − − − − − b = RL SVM

Figure S5: Base-10 logarithms of the p -values of node scores of the top 500 ranked proteinsfor RL and SVM for three values of b . The red dashed lines show the signiﬁcance cutoﬀof 0 .

05, while the diagonal line shows x = y .43 A T P M G A C A D E C S I T NDU F A F NDU F A F NDU F B TIMMDC1NDUFV1NDUFS1NDUFA11NDUFS2NDUFB5NDUFS5NDUFA9NDUFB8NDUFS3NDUFV2NDUFAF5NDUFAF3 G F E R B C S DN A J C G R PE L1 P M P C AP M P C B T I MM B T I MM T I MM T I MM B T O MM TIMM21TIMM44TIMM50TIMM17BTIMM17APAM16TIMM23SLC25A4TIMM22TOMM40CHCHD4TIMM13TOMM22TIMM8ATAZATP5F1AATP5MC1ATP5F1BHSPD1CYC1IDH3GYME1L1HSPA9MTX2COQ2FXNGRPEL2HSCBLDHDPHB2UQCRC1 E RC G CC G CC GO L G A GO L G A GO R ASP GO L G B R AB A R AB A TRIP11GOLGA5GORASP2GOLGA4USO1GOLPH3BLZF1RAB11ARAB6ARAB6BRAB30RAB36RAB42EEA1SEC24ASEC24B A R F R AB R AB R A L A R AB CR AB R AE R AB A RAB11ARAB6ARAB6BRAB30RAB36RAB42EEA1 A L G A L G A L G M OG S U GG T ALG3ALG12ALG9ALG10ALG10BALG6RFT1ALG2STT3ADPAGT1GANABCANXUGGT1STT3BRPN1DDOSTMAN1A1CALRMAN1B1ALG9-210 F O X R E D E R O B H Y O U E D E M E R L E C O S E M C F A M A N G L Y SE L E N O S U BX N LMAN1ATL3DELE1DNAJC10EDEM1TXNDC16FAF2VCPSEL1LSYVN1AMFRDERL2DERL1DERL3ERLIN2ERLIN1EMC3EMC7EMC8UBAC2DNAJC3SEC63DNAJB11ALG3ALG12ALG9ALG10ALG10BALG6RFT1ALG2STT3ADPAGT1GANABCANXUGGT1STT3BRPN1DDOSTMAN1A1CALRMAN1B1ALG9-210 NU P NU P NU P NU P NU P SEC13NCOA5NXF1NUP62CLXPOTRGPD2NUP93NUP155NUP205NUP50NEK7SNUPNGEMIN5SNRPFSNRPGHNRNPCNCBP2EIF4EHSPA5HSPA1LHSPA1AHSPA6HSPA12AHSPA13HSPA4MCCC1 row min row max C E N P F C EP N I N AKAP CN T R L C EP CD K R AP C EP P CN T C EP P R KA R B N I N L R AB A T U B G C P T U B G C P CEP170CNTLNSASS6OPTNYWHAGYWHAENME7TUBB4BTUBBTUBB4ATUBA4ATUBA1ANUMA1HAUS6CETN2TUBGCP6NEDD1CEP290MZT1PCM1CLASP1MZT2BTUBG1CEP192CEP164CCP110CEP152CEP63CENPJODF2CEP76PLK4SDCCAG8OFD1SSNA1CEP131CEP57TUBG2HAUS2TUBGCP4MZT2ATUBGCP5HAUS7CEP72HAUS1HAUS5SFI1CEP78HAUS8CEP70HAUS4HAUS3CEP41ALMS1MARK4NDE1CDK11ACDK11BCEP89SCLT1FBF1CEP83CEP97TTBK2KIF24NPHP4CC2D2AIQCB1TCTN1TCTN2TMEM216TCTN3MAPRE1CKAP5PAFAH1B1CSNK1DDYNC1H1DYNC1I2DCTN2DCTN3ACTR1ADYNLL1SEPTIN2PLK1

HSPA5

Bicluster Enriched terms p - value Bicluster Enriched terms p - value T op -r an k i ng p r o t e i n s E -1024.66 E -183.87 E -315.60 E -175.52 E -173.50 E -09 4.02 E -091.12 E -155.20 E -091.04 E -166.05 E -113.61 E -10 Figure S6: Figure 2(c) reproduced with individual biclusters visualized and GOterms denoted in additional tables. We generated this ﬁgure using MORPHEUShttps://software.broadinstitute.org/morpheus.Figure S6 shows a number of biclusters that represent groups of experimentally iden-44iﬁed viral interactors (columns) that contribute substantially to diﬀusion scores of pre-dicted high-ranking proteins in the network as described in “Methods”. High diﬀusionbiclusters can naturally emerge either in long signaling chains or in dense subnetworkssuch as complexes. We hypothesize that these biclusters may arise from densely-connectedsubnetworks that are exploited by the virus to perturb host processes to promote viralinfection.One limitation of this approach arises when a protein matches well with multiplebiclusters, in which case proteins with closely related functions may not appear in closeproximity after we perform hierarchical clustering of the matrix of diﬀusion scores. Inthe case of HSPA5, we ﬁnd that despite it having well-documented connections to CANXand CALR, it does not appear alongside them in bicluster 7 and instead is included inbicluster 8. This behavior could arise from the many additional functions HSPA5 hason the cell surface in addition to its role in blood coagulation along with CANX andCALR as discussed in “Results”. Despite not being included in bicluster 7, the HSPA5score contributions that are attributed to the viral interactors of bicluster 7 are indicativeof relatively high diﬀusion similar to that of the other rows in bicluster 7. AlthoughHSPA5, CANX, and CALR are not directly adjacent in Figure S6, further observationreveals that there is notable overlap in the viral interactors that signiﬁcantly contributeto their respective diﬀusion scores. We illustrate this point in Figure S6 in the form ofthe additional row shown in the breakout for bicluster 7.To verify that these biclusters are associated with functionally coherent subnetworksimplicated in viral pathogenesis, we performed gene set enrichment on the genes in eachidentiﬁed bicluster (“Methods”). The results indicated that each bicluster can be eﬀec-tively mapped to one or more enriched GO terms (Figure S6). This type of biclusteringanalysis also serves as a nuanced visual representation of how the RL diﬀusion algorithm45orks and its contrast to other network properties such as shortest path lengths or cen-trality. The biclustering analysis reveals that in several cases the majority of diﬀusionfor a given protein can be attributed to a single or a small group of seed nodes. Forexample, 100% of the diﬀusion score for RTF1 originates from four viral interactors: (1)ALG11: 37%; (2) ALG5: 26%; (3) ALG8: 21%; and (4) MOGS: 16%. This analysisalso reveals that in several cases the highest contributor to a given protein’s score maynot be the nearest seed node by path length. For instance, 8.91% of the score for proteinTUBB4B can be attributed to TBCA despite UPF1 being the nearest viral interactor andonly contributing 7 . × − %. To further explore the trends among top contributors, wesorted the components contributing to each protein score in descending order (Table S5)and plotted the results. Figure S7 shows this analysis for a group of four tubulin proteinsthat all appear in bicluster 1 and in the network related to cilium assembly (Figure 4(g)). T B C AAKAP C YB R P T G ES N P C T U B G C P N I N L C EP T U B G C P CN T R L G L A CD K R AP C EP C EP C L I P P CN T GG H P R KA R BAKAP R AB A R AB A N I NC EP N EK C E N P F P D E D I PP R KA R AE R P GO L G A T I MM R AB APS M D R AB M A R K GO R ASP TUBB4B T B C A C YB R P T G ES N P C T U B G C P N I N L C EP T U B G C P CN T R L G L A CD K R AP C EP C EP C L I P P CN T GG H P R KA R BAKAP R AB A R AB AAP M N I NC EP N EK C E N P F P D E D I P R AB M A R K GO L G A R AB A R AB AE R P M A R K R AB P R KA R A TUBB T B C A T U B G C P N I N L T U B G C P C EP CN T R L CD K R AP C EP C EP C L I P P CN T P R KA R BAKAP R AB A C EP R T N R AB A N I N M AP D N EK P D E D I P C E N P FF Y C O R AB A M I B PS M D P O L A CH M P AP R KA R A R AB P R I M P O L A GO L G A P D Z D VPS TUBA4A T B C A T U B G C P N I N L T U B G C P C EP CN T R L CD K R AP C EP C EP C L I P P CN T P R KA R BAKAP R AB A C EP R A L A R AB A N I NN EK C E N P F P D E D I P M AP D R AB AP R KA R AE M C P O L A R AB PS M D M A R K GO L G A R AB P R I M RH O A DC A F R AE TUBA1A F r ac t i o n a l C o n t r i b u t i o n t o d i ff u s i o n s c o r e ( o u t o f ) F r ac t i o n a l C o n t r i b u t i o n t o d i ff u s i o n s c o r e ( o u t o f ) RL Rank: 81RL Rank: 47 RL Rank: 158RL Rank: 148

SARS-CoV-2 interactors SARS-CoV-2 interactors

Figure S7: Top 35 SARS-CoV-2 interactors contributing to the diﬀusion score of 4 notableproteins: TUBB4B, TUBB, TUBA4A, and TUBA1A with RL ranks of 47, 81, 148, and158 respectively. Each observation represents the fractional contribution by the viralinteractor on the x-axis to the total predicted score of the protein in each title.The four proteins depicted share the same highest contributor, TBCA, and all demon-46trate a sharp decrease from the ﬁrst to second contributing node. TBCA is the genethat codes for the tubulin-speciﬁc chaperone A protein which is involved in beta-tubulinbinding to facilitate tubulin complex assembly ( ). Within the ﬁrst 25 highest rankedcontributors in each plot, we observe another sharp decrease to near zero relative contri-bution from the remaining seed nodes. We view this steep drop-oﬀ as an indication thatseed nodes could be clustered such that diﬀusion from one group of seed nodes to a givenprotein far outweighs that of another set of seed nodes. These lower contribution seednodes could be conﬁgured in either a distant cluster or such that many individual nodesare distributed widely. Further analysis using diﬀusion mapping techniques could assistin distinguishing between these two proposed conﬁgurations.

S2.4 Evaluation of the Method for Simplifying Functional En-richment Results

The main goal of simplifying the result from enrichment analysis was to ﬁlter out theredundant terms, i.e., terms that have highly overlapping annotated proteins. To evaluatehow our simpliﬁcation method (“Methods”) performs in terms of reducing overlap , weanalyzed the enrichment results from each protein set (i.e., top-ranking proteins from RL,top-ranking proteins from SVM, and human interactors of SARS-CoV-2 proteins) in twoways. We performed each analysis for the set of terms before simpliﬁcation and for theset of terms after simpliﬁcation. We show the results RL as an exemplar.First, for each enriched term t , we computed the largest Jaccard index between t andevery other enriched terms. We plotted the distributions of these quantities. We expectedthe Jaccard indices to be large in the ’before’ set and small in the ’after’ set of terms. In-deed, we observed that our results exhibited these trends for each protein set. We presentthe result from top-ranking RL proteins in Figure S8(a). For the top-ranking RL proteins,47here were 181 and 23 enriched terms before and after simpliﬁcation respectively. Themaximum Jaccard index was ≥ . < .

3, which indicates a consider-able improvement in overlap. An exception was the pair of terms “NADH dehydrogenasecomplex assembly” and “cellular respiration”, which were both present after simpliﬁcationyet had a Jaccard index of 0 .

51 and an overlap of 29 annotated proteins. Of these twoterms, our method selected “NADH dehydrogenase complex assembly” ﬁrst since it hadthe higher composite odds ratio. The second term still had 14 annotated proteins thatwere not covered by any other previously selected term. Its resulting composite odds ratioof 1 . > , , , , and 4% of proteins were covered by1 , , , , and 5 terms, respectively. In contrast, these values were 54% , , , , and3% after simpliﬁcation. In addition, 18% of proteins were not covered by any terms after48 .0 0.2 0.4 0.6 0.8 1.0 Maximum Jaccard index for a term F r a c t i o n o f t e r m s BeforeAfter (a)

Number of terms covering a protein F r a c t i o n o f p r o t e i n s (b) Figure S8: Results of simpliﬁcation of enriched terms. (a) Distribution of the maximumJaccard index values for the enriched GO biological processes in the top-ranking RLproteins, both before and after simpliﬁcation. The x -axis corresponds to the maximumJaccard index and the y -axis to the fraction of terms that have a maximum Jaccard indexin a particular range. (b) Distribution of the number of enriched enriched GO biologicalprocesses containing each protein, considering the top-ranking RL proteins, both beforeand after simpliﬁcation. The thin bar at 0 indicates the fraction of proteins in the proteinuniverse left uncovered after simpliﬁcation.49impliﬁcation. However, no GO terms were enriched in this set of proteins at the 0.01 level.Overall, these results conﬁrmed the eﬃcacy of our approach in computing non-redundantenriched terms. References

68. B. P. Anton, et al. , PLoS Biol. , e1001638 (2013).69. S. Kasif, R. J. Roberts, To appear (2020).70. V. Navratil, et al. , Nucleic Acids Research , D661 (2009).71. D. Szklarczyk, et al. , Nucleic Acids Research , D362 (2016).72. D. S. Wishart, et al. , Nucleic Acids Research (2018).73. G. Yu, L. G. Wang, Y. Han, Q. Y. He,

OMICS A Journal of Integrative Biology ,284 (2012).74. Supek, F. and Bo˘snjak, M. and ˘Skunca, N. and Tomislav, ˘S. , PLoS One , e21800(2011).75. D. Merico, R. Isserlin, O. Stueker, A. Emili, G. D. Bader, PLoS ONE , e13984(2010).76. Y. Lu, R. Rosenfeld, I. Simon, G. J. Nau, Z. Bar-Joseph, Nucl. Acids Res. , e109+(2008).77. S. Bauer, J. Gagneur, P. N. Robinson, Nucleic Acids Research , 3523 (2010).78. The UniProt Consortium, Nucleic Acids Research , D506 (2018).509. P. Gaudet, M. S. Livstone, S. E. Lewis, P. D. Thomas, Brieﬁngs in bioinformatics12