[PDF] A guided network propagation approach to identify disease genes that combines prior and new information

Abstract

A major challenge in biomedical data science is to identify the causal genes underlying complex genetic diseases. Despite the massive influx of genome sequencing data, identifying disease-relevant genes remains difficult as individuals with the same disease may share very few, if any, genetic variants. Protein-protein interaction networks provide a means to tackle this heterogeneity, as genes causing the same disease tend to be proximal within networks. Previously, network propagation approaches have spread signal across the network from either known disease genes or genes that are newly putatively implicated in the disease (e.g., found to be mutated in exome studies or linked via genome-wide association studies). Here we introduce a general framework that considers both sources of data within a network context. Specifically, we use prior knowledge of disease-associated genes to guide random walks initiated from genes that are newly identified as perhaps disease-relevant. In large-scale testing across 24 cancer types, we demonstrate that our approach for integrating both prior and new information not only better identifies cancer driver genes than using either source of information alone but also readily outperforms other state-of-the-art network-based approaches. To demonstrate the versatility of our approach, we also apply it to genome-wide association data to identify genes functionally relevant for several complex diseases. Overall, our work suggests that guided network propagation approaches that utilize both prior and new data are a powerful means to identify disease genes.

Full PDF

AA guided network propagation approach to identify disease genesthat combines prior and new information

Borislav H. Hristov, Bernard Chazelle and Mona Singh ∗† Abstract

A major challenge in biomedical data science is to identify the causal genes underlying complexgenetic diseases. Despite the massive inﬂux of genome sequencing data, identifying disease-relevantgenes remains difﬁcult as individuals with the same disease may share very few, if any, genetic variants.Protein-protein interaction networks provide a means to tackle this heterogeneity, as genes causing thesame disease tend to be proximal within networks. Previously, network propagation approaches havespread “signal” across the network from either known disease genes or genes that are newly putativelyimplicated in the disease (e.g., found to be mutated in exome studies or linked via genome-wide as-sociation studies). Here we introduce a general framework that considers both sources of data withina network context. Speciﬁcally, we use prior knowledge of disease-associated genes to guide randomwalks initiated from genes that are newly identiﬁed as perhaps disease-relevant. In large-scale testingacross 24 cancer types, we demonstrate that our approach for integrating both prior and new informationnot only better identiﬁes cancer driver genes than using either source of information alone but also read-ily outperforms other state-of-the-art network-based approaches. To demonstrate the versatility of ourapproach, we also apply it to genome-wide association data to identify genes functionally relevant forseveral complex diseases. Overall, our work suggests that guided network propagation approaches thatutilize both prior and new data are a powerful means to identify disease genes. ∗ Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University † Email [email protected] a r X i v : . [ q - b i o . GN ] J a n ntroduction Large-scale efforts such as the 1000 Genomes Project [1], The Cancer Genome Atlas (TCGA) [2], andthe Genome Aggregation Database [3], among others, have catalogued millions of variants occurring intens of thousands of healthy and disease genomes. Despite this abundance of genomic data, however, un-derstanding the genetic basis underlying complex human diseases remains challenging [4]. In contrast tosimple Mendelian diseases, for which a small set of commonly shared genetic variants are responsible fordisease phenotypes, complex heterogeneous diseases are driven by a myriad of combinations of differentalterations. Individuals exhibiting the same phenotypic outcome—a particular disease—may share very few,if any, genetic variants, thereby making it difﬁcult to discover which of numerous variants are associatedwith heterogeneous diseases, even when focusing just on changes that occur within genes.Biological networks provide a powerful, unifying framework for identifying disease genes [5–8]. Genesrelevant for a given disease typically target a relatively small number of biological pathways, and sincegenes that take part in the same pathway or process tend to be close to each other in networks [9, 10],disease genes cluster within networks [11, 12]. Consequently, if genes known to be causal for a particulardisease are mapped onto a network, other disease-relevant genes are likely to be found in their vicinity [13].Thus, the signal from known disease genes can be “propagated” across a network to prioritize either all geneswithin the network or just candidate genes within a genomic locus where single nucleotide polymorphismshave been correlated with an increased susceptibility to disease [14–19].While initial network approaches to identify disease genes focused on propagating knowledge from aset of known “gold standard” disease genes, with the widespread availability of cancer sequencing data andgenome-wide association studies (GWAS), the source of where information is propagated from has shiftedto genes that are newly identiﬁed as perhaps playing a role in disease [20–26]. For example, in the cancercontext, diffusing a signal from genes that are somatically mutated across tumors is highly effective foridentifying cancer-relevant genes and pathways [21, 25]; notably, while frequency-based approaches identifygenes that “drive” cancer by searching for those that are recurrently mutated across tumor samples beyondsome background rate [27], such a network propagation approach can even pinpoint rarely mutated drivergenes if they are within subnetworks whose component genes, when considered together, are frequentlymutated.Thus there are two dominant network propagation paradigms for uncovering disease genes: spreadingsignal either from well-established, annotated disease genes or from genes that have some new evidenceof being disease-relevant. While both have been successful independently, we argue that both sources ofinformation should be utilized together, and that existing knowledge of disease genes should inform theway new data is examined within networks. That is, while our prior knowledge of causal genes for agiven disease may be incomplete, it nevertheless is a valuable source of information about the biologicalprocesses underlying the disease; furthermore, in many cases, there is substantial prior knowledge and thereis no reason disease gene discovery should proceed de novo from newly observed alterations.In this paper, we introduce a guided network propagation framework to uncover disease genes, wheresignal is propagated from new data so as to tend to move towards genes that are closer to known diseasegenes. Our core method of propagating information within a network is via either diffusion [28] or randomwalks with restarts (RWRs) [14], as these are mathematically sound, well-established approaches, where nu-merical solutions are easily obtained. In particular, our approach ﬁrst diffuses a signal from known diseasegenes, and then performs either guided random walks or guided diffusion from the new data so as to prefer-entially move towards genes that have received higher amounts of signal from the initial set of known diseasegenes. In contrast, previous network propagation methods for disease gene discovery have performed dif-fusion or random walks uniformly from each node (i.e., in an “unguided” manner, as in e.g., [21, 24]), orwhere the diffusion is scaled by weights on network edges that reﬂect their estimated reliabilities (e.g., [23]).Alternatively, several approaches have attempted to uncover disease genes by explicitly connecting in the1etwork genes that have genetic alterations with genes that have expression changes [29–34]; while well-suited for ﬁnding genes causal for observed expression changes, such approaches are less appropriate as ameans to link prior and new information, and our approach instead uses prior knowledge to simply inﬂuenceinformation propagation within the network.We demonstrate the efﬁcacy of our method uKIN — u sing K nowledge I n N etworks—by ﬁrst applying itto discover genes causal for cancer. Here, new information consists of genes that are found to be somaticallymutated in tumors—only a small number of which are thought to play a functional role in cancer—andprior information is comprised of subsets of “driver” genes known to be cancer-relevant [35]. In rigorouslarge-scale, cross-validation style testing across 24 cancer types, we demonstrate that propagating signalby integrating both these sources of information performs substantially better in uncovering known cancergenes than propagating signal from either source alone. Notably, even using just a small number of knowncancer genes (5–20) to guide the network propagation from the set of mutated genes results in substantialimprovements over the unguided approach. Next, we compare uKIN to four state-of-the-art network-basedmethods that use somatic mutation data for cancer gene discovery and ﬁnd that uKIN readily outperformsthem, thereby demonstrating the advantage of additionally incorporating prior knowledge. We also showthat by using cancer-type speciﬁc prior knowledge, uKIN can better uncover causal genes for speciﬁc cancertypes. Finally, to showcase uKIN ’s versatility, we show its effectiveness in identifying causal genes forthree other complex diseases, where the genes known to be associated with the disease come from theOnline Mendelian Inheritance in Man (OMIM) [36] and genes comprising the new information arise fromgenome-wide association studies (GWAS). Methods

Overview.

At a high level, our approach uKIN propagates new information across a network, while usingprior information to guide this propagation (Figure 1). While our approach is generally applicable, herewe focus on the case of propagating information across biological networks in order to ﬁnd disease genes.We assume that prior knowledge about a disease consists of a set of genes already implicated as causal forthat disease, and new information consists of genes that are potentially disease-relevant. In the scenario ofuncovering cancer genes, prior information comes from the set of known cancer genes, and new informationcorresponds to those genes that are found to be somatically mutated across patient tumors. For other complexdiseases, new information may arise from (say) genes weakly associated with a disease via GWAS studiesor found to have de novo or rare mutations in a patient population of interest.The ﬁrst step of our approach is to compute for each gene a measure that captures how close it is inthe network to the prior knowledge set of genes K (Figure 1a). To accomplish this, we spread the signalfrom the genes in K using a diffusion kernel [28]. Next, we consider new information consisting of genes M that have been identiﬁed as potentially being associated with the disease. As we expect those that areactually disease-relevant to be proximal to each other and to the previously known set of disease genes, wespread the signal from these newly implicated genes M , biasing the signal to move towards genes that arecloser to the known disease genes K (Figure 1b). We accomplish this by performing RWRs, where withprobability α , the walk jumps back to one of the genes in M . That is, α controls the extent to which weuse new versus prior information, where higher values of α weigh the new information more heavily. Withprobability − α , the walk moves to a neighboring node, but instead of moving from one gene to one of itsneighbors uniformly at random as is typically done, the probability instead is higher for neighbors that arecloser to the prior knowledge set of genes K . Genes that are visited more frequently in these random walksare more likely to be relevant for the disease because they are more likely to be part of important pathwaysaround K that are also close to M . We thus numerically compute the probability with which each geneis visited in these random walks, and then use these probabilities to rank the genes. As an alternative to a2WR, we also experiment with implementing the guided propagation via a diffusion kernel [28]. Each stepof our procedure is described in more detail below. Notation.

The biologi- Figure 1:

Overview. (a)

Known disease-relevant genes (prior knowledge) aremapped onto an interaction network (shown in red, top). Signal from this priorknowledge is propagated through the network via a diffusion approach [28], re-sulting in each gene in the network being associated with a score such that higherscores (visualized in darker shades of red, bottom) correspond to genes closer tothe set of known disease genes. These scores are used to set transition probabil-ities between genes such that a neighboring gene that is closer to the set of priorknowledge genes is more likely to be chosen. (b)

Genes putatively associatedwith the disease—corresponding to the new information—are mapped onto thenetwork (shown in green, top). To integrate both sources of information, RWRsare initiated from the set of putatively associated genes, and at each step, the walkeither restarts or moves to a neighboring gene according to the transition proba-bilities (i.e., walks tend to move towards genes outlined in darker shades of red).These prior-knowledge “guided” RWRs have a stationary distribution correspond-ing to how frequently each gene is visited, and this distribution is used to orderthe genes. Higher scores correspond to more frequently visited genes (depicted indarker greens, bottom). cal network is modeledas an undirected graph G = ( V, E ) where eachvertex represents a gene,and there is an edge be-tween two vertices if aninteraction has been foundbetween the correspondingprotein products. We re-quire G to be connected,restricting ourselves to thelargest connected com-ponent if necessary. Weexplain our formulationwith respect to cancer,but note that it is appli-cable in other settings(both disease and other-wise). The set of genesalready known to be can-cer associated is denotedby K = { k , k , ..., k l } .The set of genes thathave been found to besomatically mutated ina cohort of individualswith cancer is denoted by M = { m , m , ..., m p } ,with F = { f m , f m , ..., f m p } corresponding to the rate with which each of these genes is mutated. Werefer to K as the prior knowledge and M as the new information. We assume that K ⊂ V and M ⊂ V ; inpractice, we remove genes not present in the network. The genes within K and M may overlap (i.e., it isnot required that K ∩ M = ∅ ). Guided RWR Algorithm.

For each gene i ∈ V , assume that we have a measure q i that represents how close i is to the set of genes K . We will use the nonnegative vector q , which we describe in the next section, toguide a random walk starting at the nodes in M and walking towards the nodes in K . Each walk starts froma gene i in M , chosen with probability proportional to its mutational rate f i . At each step, with probability α the walk can restart from a gene in M , and with probability − α the walk moves to a neighboring genepicked probabilistically based upon q . Speciﬁcally, if N ( i ) are the neighbors of node i , the walk goes fromnode i to node j ∈ N ( i ) with probability proportional to q j / (cid:80) k ∈N ( i ) q k . That is, if at time t the walk is atnode i , the probability that it transitions to node j at time t + 1 is p ij = (1 − α ) δ ij · q j (cid:80) k ∈ N ( i ) q k + α · f j (cid:80) k ∈M f k where δ ij = 1 if j ∈ N ( i ) and otherwise. Hence, the guided random walk is fully described by a3tochastic transition matrix P with entries p ij . By the Perron-Frobenius theorem, the corresponding randomwalk has a stationary distribution π (a left eigenvector of P associated with the eigenvalue ). If the graph G is connected, then the back edges to M easily ensure that π is unique and can be approximated by a longenough random walk. For each gene i , its score is given by the i th element of π . The genes whose nodeshave high scores are most frequently visited and, therefore, are more likely relevant to cancer as they areclose to both the mutated starting nodes as well as to known cancer genes. Incorporating prior knowledge.

For each gene in the network, we wish to compute how close it is tothe set of cancer-associated genes K . While many approaches have been proposed to compute “distances”in networks, we use a network ﬂow/diffusion technique where each node k ∈ K introduces a continuousunitary ﬂow which diffuses uniformly across the edges of the graph and is lost from each node v ∈ V inthe graph at a constant ﬁrst-order rate λ [28]. Brieﬂy, let A = ( a ij ) denote the adjacency matrix of G (i.e., a ij = 1 if ( i, j ) ∈ E and otherwise) and let S be the diagonal matrix where s ii is the degree of node i ∈ V . Then, the Laplacian of the graph G shifted by λ is deﬁned as L = S + λI − A . The equilibriumdistribution of ﬂuid density on the graph is computed as q = L − b [28], where b is the vector with 1 for thenodes introducing the ﬂow and 0 for the rest (i.e., b i = 1 if v i ∈ K and b i = 0 if v i / ∈ K for ∀ v i ∈ V ).Note that L is diagonally dominant, hence nonsingular, for any λ ≥ . We set λ = 1 in our applications.The vector q can be efﬁciently computed numerically. Thus, at equilibrium, each node i in the graph isassociated with a score q i which reﬂects how close it is to the nodes already marked as causal for cancer. Guided diffusion.

Instead of performing RWRs to propagate knowledge in a guided manner, it is alsopossible to adapt the diffusion approach just outlined by letting A = ( a ij ) be deﬁned such that a ij = q j / (cid:80) k ∈N ( i ) q k , and using A to compute L and the equilibrium density as above. Data sources and pre-processing.

We test uKIN on two protein-protein interaction networks:

HPRD (Release 9 041310) [37] and

BioGrid (Release 3.2.99, physical interactions only) [38]. We pre-processthe networks as in [39]. Brieﬂy, we remove all proteins with an unusually high number of interactions( >

900 interactions, >

10 standard deviations away from the mean number of interactions). Additionally,to remove spurious interactions, we remove those that have a Z -score normalized diffusion state distance > . [40]. This leaves HPRD with 9,379 proteins and 36,638 interactions and

BioGrid with 14,326 proteinsand 102,552 interactions.We use level 3 cancer somatic mutation data from TCGA [2] for 24 cancer types (Supplemental Table 1).For each cancer type, we process the data as previously described and exclude samples that are obviousoutliers with respect to their total number of mutated genes [39]. Our set of prior knowledge is constructedfrom the 719 CGC genes that are labeled by COSMIC (version August 2018) as being causally implicatedin cancer [35]. For each cancer type, our new information consists of genes that have somatic missense ornonsense mutations, and we compute the mutational frequency of a gene as the number of observed somaticmissense and nonsense mutations across tumors, divided by the number of amino acids in the encodedprotein.We obtain 24, 28, and 63 genes associated with three complex diseases, age-related macular degenera-tion (AMD) , Amyotrophic lateral sclerosis (ALS) and epilepsy , respectively, from OMIM [36]. These genesare used to construct the set of prior knowledge. For each disease, we form the set M by querying from theGWAS database [41] the genes implicated for the disease and using the corresponding p -values to computethe starting frequencies f . Speciﬁcally, for each disease, for each GWAS study i , if a gene j ’s p -value is p i,j , we set its frequency to log( p i,j ) / (cid:80) k log( p i,k ) and then for each gene average these frequencies overthe studies. Performance evaluation.

To evaluate our method in the context of cancer, we subdivide the CGC genesthat appear in our network into two subsets. We randomly draw from the CGCs 400 genes to form a set H ofpositives that we aim to uncover. From the remaining 199 CGCs present in the network, we randomly drawa ﬁxed number l to represent the prior knowledge K and run our framework. As we consider an increasing4umber of most highly ranked genes, we compute the fraction that are in the set H of positives. All CGCgenes not in H are ignored in these calculations. Importantly, the genes in K which are used to guide thenetwork propagation are never used to evaluate the performance of uKIN . Note that this testing set up, whichmeasures performance on H , allows us to compare performance of uKIN when choosing prior knowledgesets of different size l from the CGC genes not in H .We also compute area under the precision-recall curves (AUPRCs). In this case, all CGC genes in H are considered positives, all CGC genes not in H are neutral (ignored), and all other genes are negatives.Though we expect that there are genes other than those already in the CGC that play a role in cancer, thisis a standard approach to judge performance (e.g., see [24]) as cancer genes should be highly ranked. Tofocus on performance with respect to the top predictions, we compute AUPRCs using the top 100 predictedgenes. To better estimate AUPRCs and account for the randomness in sampling, we repeatedly draw (10times) the set H and for each draw we sample the genes comprising the prior knowledge K

10 times. Theﬁnal AUPRC results from averaging the AUPRCs across all 100 runs.We compare uKIN on the cancer datasets to the frequency-based method

MutSigCV 2.0 [42] and fournetwork-based methods,

DriverNet [30],

Muffinn [43], nCOP [39] and

HotNet2 [25]. All methodsare run on each of the 24 cancer types with their default parameters.

Muffinn , nCOP and HotNet2 are run on the same network as uKIN , whereas MutSigCV does not use a network and DriverNet insteaduses an inﬂuence (i.e., functional interaction) graph and transcriptomic data (we use their default inﬂuencegraph and provide as input TCGA normalized expression data). Since uKIN uses a subset of CGCs as priorknowledge, we ensure that all methods are evaluated with respect to the hidden sets H (i.e., of CGCs notused by uKIN ). Though we could just consider performance with respect to one hidden set, consideringmultiple sets enables a better estimate of overall performance. For these comparisons, uKIN with α = 0 . is run 100 times, as described above, with 20 randomly sampled genes comprising the prior knowledge, andevaluation is performed with respect to the genes in the hidden sets. All methods’ AUPRCs are computedusing the same randomly sampled test sets H and averaged at the end. Since HotNet2 outputs a set ofpredicted cancer-relevant genes and does not rank them, we cannot compute AUPRCs for it; instead wecompute precision and recall for its output with respect to the test sets H and compare to uKIN ’s whenconsidering the same number of top scoring genes.To evaluate our method in the context of the three complex diseases, we subdivide evenly the set ofOMIM genes associated with each disease into the prior knowledge set K and the set of positives H . Aswith the cancer data, we do this repeatedly (100 times) and average AUPRCs at the end. Results

We ﬁrst apply our method uKIN to uncover cancer genes. Genes that have missense and nonsense somaticmutations comprise the new information, and random walks start from these genes with probability propor-tional to their mutation rates. We apply our approach to data from 24 cancer types, but showcase resultsfor glioblastoma multiforme (GBM). All results in the main paper use the

HPRD protein-protein interactionnetwork [37], with results shown for

BioGrid [38] in the Supplement. uKIN successfully integrates prior knowledge and new information.

We compare uKIN ’s performancewhen using both prior and new knowledge (RWRs with α = 0 . ), to versions of uKIN using either onlynew information ( α = 1 ) or only prior information ( α = 0 ). Brieﬂy, we use randomly drawn CGCs torepresent the prior knowledge K and another randomly drawn CGCs to be the hidden set H of unknowncancer-relevant genes that we aim to uncover (see Performance evaluation for details). We repeat thisprocess 100 times, each time spreading signal using the diffusion approach [28] before performing RWRsfrom the genes observed to be somatically mutated. For each run, we analyze the ranked list of genes outputby uKIN as we consider an increasing number of output genes, and average across runs the fraction that aremembers of the hidden set H consisting of cancer driver genes.5 = 0 (prior) α = 0.5 (unguided) α = 1 (new) Log Fold Change in AUPRC b) α = 0.5 (unguided) ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Number of genes considered F r a c t i on o f gene s t ha t a r e C G C s alpha ●

0 (prior)0.5 (uKIN)0.5 (unguided)1 (new) a) Figure 2: uKIN successfully integrates new information and prior knowledge. (a)

We illustrate the effectivenessof our approach uKIN on the GBM data set and the HPRD protein-protein interaction network using randomlydrawn CGCs to represent the prior knowledge. We combine prior and new knowledge using a restart probability of α = 0 . (blue line). As we consider an increasing number of high scoring genes, we plot the fraction of these thatare part of the hidden set of CGCs. As baseline comparisons, we also consider versions of our approach where weuse only the new information ( α = 1 ) and order genes by their mutational frequency (green line); where we use newinformation to perform unguided random walks with α = 0 . and order genes by their probabilities in the stationarydistribution of the walk (which uses new information but not prior information, purple line); and where we use onlyprior information ( α = 0 ) and order genes based on information propagated from the set of genes comprising our priorknowledge (orange line). Integrating both prior and new sources of information results in better performance. (b) Theperformance of uKIN when integrating information at α = 0 . is compared to the three baseline cases where eitheronly prior information is used ( α = 0 , left) or when only new information is used ( α = 1 , right and unguided RWRswith α = 0 . , middle). In all three panels, for each cancer type, we plot the log ratio of the AUPRC of uKIN withguided RWRs with α = 0 . to the AUPRC of the other approach. Across all 24 cancer types, using both sources ofinformation outperforms using just one source of information. For α = 0 . , we observe that a large fraction of the top predicted genes using the GBM dataset are partof the hidden set of known cancer genes (Figure 2a). At α = 1 , our method completely ignores both thenetwork and the prior information K and is equivalent to ordering the genes by their mutational frequencies.The very top of the list output by uKIN when α = 1 consists of the most frequently mutated genes (in thecase of GBM, this includes TP53 and

PTEN ). As we consider an increasing number of genes, ordering themby mutational frequency is clearly outperformed by uKIN with α = 0 . . At the other extreme with α = 0 ,the starting locations and their mutational frequencies are ignored as the random walk is memoryless and thestationary distribution depends only upon the propagated prior information q . As expected, performance isconsiderably worse than when running uKIN with α = 0 . . Nevertheless, we observe that several CCGs arefound for α = 0 ; this is due to the fact that known cancer genes tend to cluster together in the network [20]and our propagation technique ranks highly the genes close to the genes in K .We also consider uKIN ’s performance as compared to an unguided walk with the same restart prob-ability α = 0 . . In this case, the walk selects a neighboring node to move to uniformly at random. Thestationary distribution that the walk converges to depends upon the starting locations and the network topol-ogy but is independent of the prior information. Such a walk provides a good baseline to judge the impact6he propagated prior information q has on the performance of our algorithm, and is an approach that hasbeen widely applied [14]. As evident in Figure 2a, an unguided walk (purple line) performs considerablyworse than uKIN with α = 0 . , highlighting the importance of q in guiding the walk.Notably, the trends we observe on GBM hold across all 24 cancers (Figure 2b). For each cancer type, weconsider the log ratio of the AUPRC of the version of uKIN that uses both prior and new information with α = 0 . to the AUPRC for each of the other variants. For all cancer 24 cancers, when uKIN uses both priorand new information with α = 0 . , it outperforms the cases when using only prior information (Figure 2b,left) or using only new information (Figure 2b, middle and right). Further, we observe this improvementwhen using both prior and new information across all cancers for a wide range of α ( . < α < . , datanot shown), clearly demonstrating that using both sources of information is beneﬁcial. uKIN is effective in uncovering cancer-relevant genes. We next evaluate uKIN ’s performance in uncov-ering cancer-relevant genes as compared to several previous methods. These methods do not use any priorknowledge of cancer genes, and any performance differences between uKIN and them may be due eitherto the use of this important additional source of information or to speciﬁc algorithmic differences betweenthe methods. Nevertheless, such comparisons are necessary to get an idea of how well uKIN performsas compared to the current state-of-the-art. All methods are run and AUPRCs computed as described in

Methods . First, we compare uKIN with α = 0 . to MutSigCV 2.0 [42], perhaps the most widely usedfrequency-based approach to identify cancer driver genes. We ﬁnd that uKIN outperforms

MutSigCV 2.0 on 22 of 24 cancer types (Figure 3a). Next, we compare uKIN to three network-based approaches (Fig-ure 3b):

Muffinn [43], which considers mutations found in interacting genes;

DriverNet [30], whichﬁnds driver genes by uncovering sets of somatically mutated genes that are linked to dysregulated genes;and nCOP [39], which examines the per-individual mutational proﬁles of cancer patients in a biologicalnetwork. uKIN exhibits superior performance across all cancer types when compared to

DriverNet ,and outperforms

Muffinn in 23 out of 24 cancer types and nCOP in 17 of the 24 cancer types. In manycases, the performance improvements of uKIN are substantial (e.g., more than a 2-fold improvement for12, 10, 3 and 4 cancer types for

MutSigCV , DriverNet , Muffin and nCOP , respectively). We alsocompare to

Hotnet2 [25], whose core algorithmic component is diffusion [28], and as such uKIN is moresimilar to it than other methods.

Hotnet2 does not output a ranked list of genes, so we instead examinethe list of genes highlighted by both methods. We ﬁnd that uKIN exhibits higher precision and recall than

Hotnet2 for all cancer types (Suppl. Figure S1); since both uKIN and

Hotnet2 are network propagationapproaches, these performance improvements illustrate the beneﬁt of using prior information in identifyingcancer-relevant genes.

Robustness tests.

The overall results shown hold when we use different lists of known cancer genes as agold standard (Suppl. Figure S2a), different numbers of predictions considered when computing AUPRCs(Suppl. Figure S2b), and different networks (Suppl. Figure S2c). Further, we conﬁrm the importance ofnetwork structure to uKIN , by running uKIN on two types of randomized networks, degree-preserving andlabel shufﬂing, and show that, as expected, overall performance deteriorates across the cancer types (Suppl.Figure S2d); we note that while network structure is destroyed by these randomizations, per-gene mutationalinformation is preserved, and thus highly mutated genes are still output.We also determine the effect of the amount of prior knowledge for uKIN , and ﬁnd that while perfor-mance increases with larger numbers of genes comprising our prior knowledge, even as few as ﬁve priorknowledge genes leads to a ∼ -fold improvement over ranking genes by mutational frequency (Suppl. Fig-ure S3a). Finally, we investigate the effect of some incorrect prior knowledge, and ﬁnd that while uKIN ’sperformance decreases with more incorrect knowledge, uKIN with α = 0 . performs reasonably with < incorrect annotations (Suppl. Figure S3b). Alternate formulations.

We also tested guided diffusion from the somatically mutated genes instead ofRWRs (see

Methods ). We empirically ﬁnd that, for α = 0 . , diffusion with λ = 1 yields nearly identical7er-gene scores on the cancer datasets we tested (GBM and kidney renal cell carcinoma). Similarly, forother α , we were able to ﬁnd values of λ such that the RWRs and diffusion have highly similar results. Onthe other hand, replacing the initial diffusion from the prior knowledge with a RWR (with α = 0.5) resultsin somewhat worse performance (e.g., ∼ drop in AUPRC for GBM). uKIN highlights infre- MutSigCV zirrelv zirrelvb

Log Fold Change in AUPRC

DriverNet Muffinn nCOP

Log Fold Change in AUPRC a) b)

Figure 3: uKIN is more effective than other methods in identifying knowncancer genes.

For each method, for each cancer type, we plot the log ratio of uKIN ’s AUPRC to its AUPRC. (a) Comparison of uKIN to MutSigCV 2.0 , astate-of-the-art frequency-based approach. uKIN outperforms

MutSigCV 2.0 on 22 of the 24 cancer types. (b)

Comparison of uKIN to DriverNet (left),

Muffinn (middle), and nCOP (right). Our approach uKIN outperforms

DriverNet on all cancer types,

Muffinn on all but one cancer type and nCOP on 17 out of 24 cancer types. quently mutated cancer-relevant genes.

A majoradvantage of network-basedmethods is that they are ableto identify cancer-relevantgenes that are not necessarilymutated in large numbers ofpatients [25]. We next ana-lyze the mutation frequencyof genes output by uKIN with α = 0 . . In particular,for each cancer type, for eachgene, we obtain a ﬁnal scoreby averaging scores acrossthe 100 runs of uKIN ; to pre-vent “leakage” from the priorknowledge set, if a gene isin the set of prior knowledgegenes K for a run, this run isnot used when determiningits ﬁnal score. We conﬁrmthat, for all cancer types, thetop scoring genes exhibitdiverse mutational rates, andinclude both frequently and infrequently mutated genes (Suppl. Figure S4).We next highlight some infrequently mutated genes in GBM that are given high ﬁnal scores by uKIN (i.e., are predicted as cancer-relevant). For example, LAND1A and

SMAD4 are two well known cancerplayers that are highly ranked by uKIN , and that have mutational rates in GBM that are in the bottom 70%of all genes and are therefore hard to detect with frequency-based approaches. Of uKIN ’s top 100 scoringgenes, 23 are are in the bottom half with respect to mutational rates, and 5 of these are CGCs ( p < − ,hypergeometric test). When considering the top scoring 100 genes by uKIN for each cancer type, thosethat have mutational ranks in the bottom half of all genes are each found to have a statistically signiﬁcantenrichments of CGC genes. Thus, uKIN provides a means for pulling out cancer genes from the “longtail” [44] of infrequently mutated genes.In addition to highlighting known cancer genes, uKIN also ranks highly several non-CGC genes thatmay or may not play a functional role in cancer, as our knowledge of cancer-related genes is incomplete.Among these novel predictions for GBM are ATXN1 , SMURF1 , and

CCR3 , all of which have been recentlysuggested to play a role in cancers [45–47] and are each mutated in less than 5% of the samples.

ATXN1 isa chromatin-binding factor that plays a critical role in the development of spinocerebellar ataxia, a neurode-generative disorder [48], and mutants of

ATXN1 have been found to stimulate the proliferation of cerebellarstem cells in mice [49]. This is a promising gene for further investigation because glioblastoma is a cancerthat usually starts in the cerebrum and the potential role of

ATXN1 in tumorigenesis has only recently been8uggested [45].

SMURF1 and its highly ranked by uKIN network-interactor

SMAD1 have already beenimplicated in the development of several cancers [50].

SMURF1 also interacts with the nuclear receptor

TLX whose inhibitory role in glioblastoma has been revealed [51]. Overall, we also ﬁnd that the top scoringgenes by uKIN for GBM are enriched in many KEGG pathways and GO terms relevant for cancer, including microRNAs in cancer , cell proliferation , choline metabolism in cancer and apoptosis (Bonferroni-correctedp < Cancer-type speciﬁc prior

BRCAGBMSKCMTHCA B R C A G B M SK C M T H C A prior knowledge p r e d i c t − − B R C A G B M SK C M T H C A prior knowledge p r e d i c t a) Spread OMIM Spread GWAS

ALS AMD epilepsy ALS AMD epilepsy

Log F o l d c hange i n A U P RC b) Figure 4: (a) Use of cancer-type speciﬁc knowledge improves performance.

For four cancer types, BRCA, GBM, SKCM, and THCA, we consider the perfor-mance of uKIN with α = 0 . when using TCGA mutational data for that cancertype with prior knowledge consisting of genes known to be driver in that cancertype, as compared to performance when the prior knowledge set consists of genesthat are annotated as driver only for one of the other three cancer types. For eachcancer, performance is measured by the average ranking by uKIN of genes knownto be driver for that cancer. For all combinations of possible prior knowledge sets( x -axis) and speciﬁc cancer gene sets that we wish to recover ( y -axis), using priorknowledge from another cancer (off diagonal entries) leads to a decrease in perfor-mance as compared to the corresponding pairs (diagonal entries), as measured bythe increase in uKIN ’s average ranking of genes we aimed to uncover. (b) uKIN iseffective in identifying complex disease genes. We demonstrate the versatility ofthe uKIN framework by integrating OMIM and GWAS data for three complex dis-eases,

ALS , epilepsy and AMD . For each disease, we compare uKIN ’s performancewhen using OMIM annotated genes as prior information and GWAS hits as newinformation with α = 0 . , to baseline versions that propagate only information viadiffusion from OMIM (left) or GWAS studies (right). In all cases, we plot the log ratio of the AUPRC obtained by uKIN using both prior and new information to thebaseline methods. knowledge yields betterperformance. In severalcases, CGC genes are anno-tated with the speciﬁc cancersthey play driver roles in. Wenext test how uKIN ’s perfor-mance changes when usingsuch highly speciﬁc priorknowledge. We considerfour cancer types, GBM,breast invasive carcinoma(BRCA), skin cutaneouscarcinoma (SKCM), andthyroid carcinoma (THCA),with 33, 32, 42 and 29 CGCgenes annotated to them,respectively. We repeatedlysplit each of these sets ofgenes in half, and use half asthe set K of prior knowledge,and the other half as the set H to test performance.We ﬁrst use knowledgeconsisting of genes speciﬁc toa cancer type of interest to-gether with the TCGA datafor that cancer to uncoverthat cancer’s speciﬁc drivers.Given the small number ofgenes annotated to each cancer, we assess performance by, for each of these genes, computing the rankof its score by uKIN over the splits where these genes are in H . Next, for the same cancer type, we use aset K corresponding to a different cancer type as prior knowledge (excluding any genes that are annotatedto the original cancer type) while still trying to uncover the genes in the original cancer of interest (i.e.,using TCGA mutational data and H belonging to the original cancer type). That is, we are testing theperformance of uKIN when using knowledge corresponding to a different cancer type. For all four cancertypes, we ﬁnd that performance is best when uKIN uses prior knowledge for the same cancer cancer type(Figure 4a), as genes in H appear higher in the list of genes output by uKIN . This suggests that uKIN canutilize cancer-type speciﬁc knowledge and highlights the beneﬁts of having accurate prior information. Application to identify disease genes for complex inherited disorders.

A major advantage of our methodis that it can be easily applied in diverse settings. As proof of concept, we apply uKIN to detect disease genes9or three complex diseases:

AMD , ALS and epilepsy . For each disease, we randomly split in half the OMIMdatabase’s [36] list of genes associated with the disease 100 times to form the set of prior knowledge K andthe hidden set H . We use the GWAS catalogue list of genes with their corresponding p -values to form theset M . For all three diseases, uKIN combining both GWAS and OMIM sources of information ( α = 0 . )performs better than diffusing the signal with λ = 1 using only knowledge from OMIM (Figure 4b, leftpanel). For each of these diseases, there is virtually no overlap between the GWAS hits M and a set ofOMIM genes H ; simply sorting genes by their signiﬁcance in GWAS studies (i.e., uKIN with α = 1 )results in AUPRC of 0. Instead, we spread information from the set of GWAS genes M in the same fashionas from OMIM and observe again that using this single source of information alone does not work as wellas uKIN ’s using both GWAS and OMIM information together (Figure 4b, right panel). Discussion

In this paper, we have shown that uKIN , a network propagation method that incorporates both existingknowledge as well as new information, is a highly effective and versatile approach for uncovering diseasegenes. Our method is based upon the intuition that prior knowledge of disease-relevant genes can be usedto guide the way information from new data is spread and interpreted in the context of biological networks.Because uKIN uses prior knowledge, it has higher precision than other state-of-the-art methods in detectingknown cancer genes. Further, it excels at highlighting infrequently mutated genes that are neverthelessrelevant for cancer. Additionally, we have shown that uKIN can be applied to discover genes relevant forother complex diseases as well.The framework presented here can be extended in a number of natural ways. First, in addition to positiveknowledge of known disease genes, we may also have “negative” knowledge of genes that are not involvedin the development of a given disease. These genes can propagate their “negative” information, therebybiasing the random walk to move away from their respective modules and perhaps further enhancing theperformance of our method. Second, uKIN is likely to beneﬁt from incorporating edge weights that reﬂectthe reliability of interactions between proteins; these weights will have an impact on both the propagationof prior knowledge as well as the guided random walks. Third, since a recent study [52] has shown thatcontrasting cancer mutation data with natural germline variation data helps boost the true disease signal bydowngrading genes that vary frequently in nature, uKIN ’s performance may beneﬁt from scaling the start-ing probabilities of the new putatively implicated genes to account for their variation in healthy populations.Fourth, while here we have demonstrated how uKIN can use cancer-type speciﬁc knowledge, cancers of thesame type can often be grouped into distinct subtypes, and such highly-detailed knowledge may improve uKIN ’s performance even further. Finally, we note that network propagation approaches have been appliedto non-disease settings as well, including biological process prediction [53, 54]. We conjecture that ourguided network propagation approach will additionally be useful in other scenarios in computational biol-ogy, including where new data (e.g., arising from functional genomics screens) need to be interpreted in thecontext of what is already known about a biological process of interest.In conclusion, uKIN is a ﬂexible and effective method that handles diverse types of new information.As our knowledge of disease-associated genes continues to grow and be reﬁned, and as new experimentaldata becomes more abundant, we expect that the power of uKIN for accurately prioritizing disease geneswill continue to increase. 10 eferences [1] 1000 Genomes Project Consortium and others. A global reference for human genetic variation.

Nature ,526(7571):68, 2015.[2] TCGA Research Network. http://cancergenome.nih.gov/.[3] Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alf¨oldi J, Wang Q, et al.

Variation across 141,456 humanexomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv , 2019.[4] Kim YA, Przytycka TM. Bridging the gap between genotype and phenotype via network approaches.

Frontiersin genetics , 3:227, 2013.[5] Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barab´asi AL. The human disease network.

Proceedings of theNational Academy of Sciences , 104(21):8685–8690, 2007.[6] Barab´asi AL, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease.

Naturereviews genetics , 12(1):56, 2011.[7] Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal ampliﬁer of genetic associations.

Nature Reviews Genetics , 18(9):551, 2017.[8] Ozturk K, Dow M, Carlin DE, Bejar R, Carter H. The emerging potential for network analysis to inform precisioncancer medicine.

Journal of molecular biology , 430(18):2875–2899, 2018.[9] Hartwell L, Hopﬁeld J, Leibler S, Murray A. From molecular to modular cell biology.

Nature , 402:C47–52,1999.[10] Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks.

Proc Natl Acad Sci USA ,100:12123–12128, 2003.[11] Oti M, Brunner HG. The modular nature of genetic diseases.

Clinical genetics , 71(1):1–11, 2007.[12] Gandhi T, Zhong J, Mathivanan S, Karthick L, Chandrika K, Mohan SS, Sharma S, Pinkert S, Nagaraju S,Periaswamy B, et al.

Analysis of the human protein interactome and comparison with yeast, worm and ﬂyinteraction datasets.

Nature genetics , 38(3):285, 2006.[13] Krauthammer M, Kaufmann CA, Gilliam TC, Rzhetsky A. Molecular triangulation: bridging linkage andmolecular-network information for identifying candidate genes in Alzheimer’s disease.

Proceedings of the Na-tional Academy of Sciences , 101(42):15148–15153, 2004.[14] K¨ohler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes.

The American Journal of Human Genetics , 82(4):949–958, 2008.[15] Chen J, Aronow B, Jegga A. Disease candidate gene identiﬁcation and prioritization using protein interactionnetworks.

BMC Bioinformatics , 10, 2009.[16] Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease vianetwork propagation.

PLoS computational biology , 6(1):e1000641, 2010.[17] Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases.

Bioin-formatics , 26(8):1057–1063, 2010.[18] Erten S, Bebek G, Ewing RM, Koyuturk M. DADA: Degree-aware algorithms for network-based disease geneprioritization.

BioData Min , 4:19, 2011.[19] Smedley D, Khler S, Czeschik JC, Amberger J, Bocchini C, Hamosh A, Veldboer J, Zemojtel T, RobinsonP. Walking the interactome for candidate prioritization in exome sequencing studies of Mendelian diseases.

Bioinformatics , 30:3215–3222, 2014.

20] Cerami E, Demir E, Schultz N, Taylor BS, Sander C. Automated network analysis identiﬁes core pathways inglioblastoma.

PLoS ONE , 5(2):e8918, 2010.[21] Vandin F, Upfal E, Raphael BJ. Algorithms for detecting signiﬁcantly mutated pathways in cancer.

Journal ofComputational Biology , 18(3):507–522, 2011.[22] Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM. Prioritizing candidate disease genes by network-basedboosting of genome-wide association data.

Genome research , 21(7):1109–1121, 2011.[23] Babaei S, Hulsman M, Reinders M, de Ridder J. Detecting recurrent gene mutation in interaction network contextusing multi-scale graph diffusion.

BMC Bioinformatics , 14:29, 2013.[24] Jia P, Zhao Z. Varwalker: personalized mutation network analysis of putative cancer genes from next-generationsequencing data.

PLoS Comput Biol , 10(2):e1003460, 2014.[25] Leiserson MDM, Vandin F, Wu HT, Dobson JR, Eldridge JV, Thomas JL, Papoutsaki A, Kim Y, Niu B, McLellanM, Lawrence MS, Gonzalez-Perez A, Tamborero D, Cheng Y, Ryslik GA, Lopez-Bigas N, Getz G, Ding L,Raphael BJ. Pan-cancer network analysis identiﬁes combinations of rare somatic mutations across pathways andprotein complexes.

Nature Genetics , 47:106–114, 2015.[26] Carlin D, Fong S, Qin Y, Jia T, Huang J, Bao B, Zhang C, Ideker T. A fast and ﬂexible framework for network-assisted genomic association. iScience , 16:155–161, 2019.[27] Lawrence M, Stojanov P, Polak P, Kryukov G, Cibulskis K, Sivachenko A, et al.

Mutational heterogeneity incancer and the search for new cancer-associated genes.

Nature , 499:214–218, 2013.[28] Qi Y, Suhail Y, Lin Yy, Boeke JD, Bader JS. Finding friends and enemies in an enemies-only network: agraph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast geneticinteractions.

Genome research , 18:1991–2004, 2008.[29] Kim Y, Wuchty S, Przytycka T. Identifying causal genes and dysregulated pathways in complex diseases.

PLoSComput Biol , 7:e1001095, 2011.[30] Bashashati A, Haffari G, Ding J, Ha G, Lui K, Rosner J, Huntsman DG, Caldas C, Aparicio SA, Shah SP.DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer.

Genomebiology , 13(12):R124, 2012.[31] Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. Discovering causal pathways linking ge-nomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE).

Bioinformatics ,29(21):2757–2764, 2013.[32] Shrestha R, Hodzic E, Yeung J, Wang K, Sauerwald T, Dao P, Anderson S, Beltran H, Rubin MA, Collins CC, et al.

Hitndrive: multi-driver gene prioritization based on hitting time. In

International Conference on Researchin Computational Molecular Biology , pages 293–306. Springer, 2014.[33] Ruffalo M, Koyutrk M, Sharan R. Network-based integration of disparate omic data to identify ”silent players”in cancer.

PLOS Computational Biology , 11:e1004595, 2015.[34] Shi K, Gao L, Wang B. Discovering potential cancer driver genes by an integrated network-based approach.

Molecular Biosystems , 12:2921–2931, 2016.[35] Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR. A census of humancancer genes.

Nat Rev Cancer , 4(3):177–83, 2004.[36] Online Mendelian Inheritance in Man, OMIM R (cid:13) , 2000.[37] Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B,Venugopal A, et al. Human protein reference database–2009 update.

Nucleic acids research , 37(suppl 1):D767–D772, 2009.

38] Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository forinteraction datasets.

Nucleic acids research , 34(suppl 1):D535–D539, 2006.[39] Hristov BH, Singh M. Network-based coverage of mutational proﬁles reveals cancer genes.

Cell systems ,5(3):221–229, 2017.[40] Cao M, Zhang H, Park J, Daniels NM, Crovella ME, Cowen LJ, Hescott B. Going the distance for proteinfunction prediction: a new distance metric for protein interaction networks.

PloS one , 8(10):e76339, 2013.[41] Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, McMahon A, Morales J, MountjoyE, Sollis E, et al.

The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arraysand summary statistics 2019.

Nucleic acids research , 47(D1):D1005–D1012, 2018.[42] Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH,Roberts SA, et al.

Mutational heterogeneity in cancer and the search for new cancer-associated genes.

Nature ,499(7457):214–218, 2013.[43] Cho A, Shim JE, Kim E, Supek F, Lehner B, Lee I. Mufﬁnn: cancer gene discovery via network analysis ofsomatic mutation data.

Genome Biology , 17(1):129, 2016.[44] Garraway LA, Lander ES. Lessons from the cancer genome.

Cell , 153(1):17–37, 2013.[45] Kang AR, An HT, Ko J, Choi EJ, Kang S. Ataxin-1 is involved in tumorigenesis of cervical cancer cells via theEGFR–RAS–MAPK signaling pathway.

Oncotarget , 8(55):94606, 2017.[46] Li H, Xiao N, Wang Y, Wang R, Chen Y, Pan W, Liu D, Li S, Sun J, Zhang K, et al.

Smurf1 regulates lung cancercell growth and migration through interaction with and ubiquitination of PIPKI γ . Oncogene , 36(41):5668, 2017.[47] Lee YS, Kim SY, Song SJ, Hong HK, Lee Y, Oh BY, Lee WY, Cho YB. Crosstalk between CCL7 and CCR3promotes metastasis of colon cancer cells via erk-jnk signaling pathways.

Oncotarget , 7(24):36842, 2016.[48] Rousseaux MW, Tschumperlin T, Lu HC, Lackey EP, Bondar VV, Wan YW, Tan Q, Adamski CJ, Friedrich J,Twaroski K, et al.

ATXN1-CIC complex is the primary driver of cerebellar pathology in spinocerebellar ataxiatype 1 through a gain-of-function mechanism.

Neuron , 97(6):1235–1243, 2018.[49] Edamakanti CR, Do J, Didonna A, Martina M, Opal P. Mutant ataxin1 disrupts cerebellar development inspinocerebellar ataxia type 1.

The Journal of clinical investigation , 128(6):2252–2265, 2018.[50] Yang D, Hou T, Li L, Chu Y, Zhou F, Xu Y, Hou X, Song H, Zhu K, Hou Z, et al.

Smad1 promotes colorectalcancer cell migration through Ajuba transactivation.

Oncotarget , 8(66):110415, 2017.[51] Johansson E, Zhai Q, Zeng Zj, Yoshida T, Funa K. Nuclear receptor TLX inhibits TGF- β signaling in glioblas-toma. Experimental cell research , 343(2):118–125, 2016.[52] Przytycki PF, Singh M. Differential analysis between somatic mutation and germline variation proﬁles revealscancer-related genes.

Genome medicine , 9(1):79, 2017.[53] Wang P, Marcotte E. It’s the machine that matters: Predicting gene function and phenotype from protein net-works.

J Proteomics , 73:2277–2289, 2011.[54] Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps.

Bioinformatics , 21 Suppl. 1:i302–i310, 2005.[55] Hofree M, Carter H, Kreisberg JF, Bandyopadhyay S, Mischel PS, Friend S, Ideker T. Challenges in identifyingcancer genes by analysis of exome sequencing data.

Nature Communications , 7, 2016.[56] Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz J L A, Kinzler KW. Cancer genome landscapes.

Science , 339(6127):1546–58, 2013. upplementary Figures and Tables The following pages contain 1 table and 4 supplementary ﬁgures that support the ﬁndings of the main paper. ancer Number of Number of Mutated GenesSymbol Cancer Type Patients Total Average Cut off ACC Adrenocortical carcinoma 76 2068 32.1 80BLCA Bladder Urothelial Carcinoma 196 11407 135.7 300BRCA Breast invasive carcinoma 882 10813 27 80CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma 173 6907 63 200COAD Colon adenocarcinoma 153 6521 74.4 150GBM Glioblastoma multiforme 278 7250 46.8 80HNSC Head and Neck squamous cell carcinoma 435 13048 87.9 200KICH Kidney Chromophobe 64 661 11 50KIRC Kidney renal clear cell carcinoma 416 9212 40.9 100KIRP Kidney renal papillary cell carcinoma 166 5687 47.7 100LGG Brain Lower Grade Glioma 451 7130 28.8 60LIHC Liver hepatocellular carcinoma 196 7705 67.3 200LUAD Lung adenocarcinoma 487 15481 172.8 500LUSC Lung squamous cell carcinoma 167 12264 212 500OV Ovarian serous cystadenocarcinoma 138 3390 30.7 80PAAD Pancreatic adenocarcinoma 124 3228 36.8 100PCPG Pheochromocytoma and Paraganglioma 183 1819 11.7 30PRAD Prostate adenocarcinoma 238 4792 28.1 50READ Rectum adenocarcinoma 34 1214 40.7 150SKCM Skin Cutaneous Melanoma 329 14748 240.1 1000STAD Stomach adenocarcinoma 242 10595 103.5 500THCA Thyroid carcinoma 401 2268 7.4 30UCEC Uterine Corpus Endometrial Carcinoma 155 4282 38.8 100UCS Uterine Carcinosarcoma 54 1787 38.9 80 Table S1:

TCGA dataset and statistics.

We list the 24 cancer types studied along with their abbreviations.For each cancer type, we give the total number of patient samples considered after highly mutated samplesare ﬁltered out, the total number of mutated genes across these samples, the average number of mutatedgenes across all samples, and the cutoff on the number of mutated genes within a sample that was used toﬁlter samples. 15

CC BLCABRCA CESCCOAD GBMKICH KIRCKIRPLGG LIHCOVPAADPCPG PRADREAD THCAUCECUCSACC BLCABRCA CESCCOADGBMKICH KIRCKIRPLGG LIHCOVPAAD PCPG PRADREAD THCAUCECUCS HNSC LUADLUSC SKCMSTADHNSC LUADLUSCSKCM STAD l ll ll ll lll ll ll ll llll ll ll ll l ll ll l l ll lll l ll lll lll l recall p r ec i s i on a l a l Hotnet2uKIN

Figure S1:

Comparison between uKIN and

Hotnet2 . For each cancer type, we compute the precisionand recall of the genes returned by uKIN with α =0.5 and Hotnet2 . Hotnet2 is run with default parame-ters (100 permuted networks, and β = 0.2 for the restart probability for the insulated heat diffusion process). Hotnet2 outputs a set of genes predicted to be cancer-relevant, and these genes are not ranked. Thus,for uKIN , we consider the same number of top scoring genes as output by

Hotnet2 . uKIN exhibits bothhigher precision and higher recall than Hotnet2 across all 24 cancer types.16 lpha = 0 alpha = 1 Muffinn nCOP0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4UCSUCECTHCASTADSKCMREADPRADPCPGPAADOVLUSCLUADLIHCLGGKIRPKIRCKICHHNSCGBMCOADCESCBRCABLCAACC

Log Fold Change in AUPRC a) alpha = 0 alpha = 1 Muffinn nCOP − − − − Log Fold Change in AUPRC b) alpha = 0 alpha = 1 Muffinn nCOP − − − − Log Fold Change in AUPRC c) Randomized network (label swap)Randomized network (node swap)0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4UCSUCECTHCASTADSKCMREADPRADPCPGPAADOVLUSCLUADLIHCLGGKIRPKIRCKICHHNSCGBMCOADCESCBRCABLCAACC

Log Fold Change in AUPRC label swap edge swap d) Robustness of uKIN . (a)

To make sure that the results reported for uKIN in Figures 2 and 3 arerobust with respect to the set of labelled cancer genes H , instead of randomly sampling 400 genes from theCancer Gene Census (CGC) list, we form H using genes from other sources. Speciﬁcally, we aggregate thecancer genes provided by Hofree et al. in [55] (which they obtained by querying the UniprotKB database forthe keyword-terms ‘proto-oncogene,’ ‘oncogene’ and ‘tumoursuppressor’ gene) and Vogelstein et al. [56],excluding any genes present in the set of prior knowledge K . Log-fold AUPRCs are computed as describedin the main text. The results are consistent with those shown in Figures 2 and 3 based on the CGC list andshow the superior performance of uKIN as compared to the other methods in recapitulating known cancergenes. (b) To make sure that the results reported for uKIN in Figures 2 and 3 are robust with respect tonumber of genes used in evaluation, we compute AUPRCs using the top 50 predicted genes. The resultsare consistent with those shown in Figures 2 and 3 which use the top 100 predicted genes, and show thesuperior performance of uKIN as compared to the baselines and other methods in recapitulating knowncancer genes. The results are also consistent when computing AUPRC’s using 150 genes (data not shown). (c)

To make sure that our method is robust with respect to the speciﬁc network utilized, we repeat our entireanalysis procedure for uKIN with α = 0 . using the Biogrid network. The results are consistent with thoseshown in Figures 2 and 3, based on the HPRD network. (d) To make sure our method utilizes networkstructure appropriately, we also consider performance of uKIN on the real HPRD network as comparedto randomized HPRD networks. In the left panel, we use a node label shufﬂing randomization where thenetwork structure is maintained but gene names are swapped (thereby genes can have very different numbersof interactions in the randomizations). In the right panel, we use a classic degree-preserving randomization(edge swapping). For each of the 24 cancers, we compute the log ratio of the area under the precision recallcurve using uKIN with α = 0 . on the real network and on the randomized network and show the averageover 10 different randomizations. Performance, as expected, is worse for both randomizations across allcancers. We note that signiﬁcant cancer-relevant information is retained in these randomized networks. Inparticular, in both types of network randomizations, we maintain the relationships between genes and thesamples that they are found to be somatically mutated in. Thus, some highly mutated CGC genes may stillbe output by uKIN when running on randomized networks.18 .00.51.01.52.0 25 50 75 100 Number of genes used as prior knowledge

Log F o l d c hange i n A U P RC c o m pa r ed t o α = alpha -4-202

0% 10% 20% 40%

Fraction of incorrect prior knowledge

Log F o l d c hange i n A U P RC c o m pa r ed t o α = alpha a) b) Figure S3: (a) uKIN beneﬁts from more knowledge.

As we consider larger numbers of genes comprisingthe set of prior knowledge ( |K| = 5 , , , , . . . , ), we examine the ability of uKIN to uncover CGCgenes in the same ﬁxed set H when using α = 0 . (blue triangles), α = 0 . (pink circles) or α = 0 (orangesquares). uKIN is run on the HPRD network with the kidney renal clear cell carcinoma (KIRC) dataset.We show the log ratio, averaged over 100 runs, of the AUPRC of each version of uKIN to the AUPRCfor α = 1 which is constant across all possible K (and corresponds to the case where genes are ranked bymutational frequency). For small K , α = 0 performs poorly as is expected; as the prior knowledge availableincreases so does the performance. For both α = 0 . and α = 0 . , an increase in the size of K leads toan initial increase in the performance but eventually performance plateaus. When limited prior knowledgeis available ( |K| < ), α = 0 . , which uses more of the new information, does better then α = 0 . ,which relies more on using prior knowledge. When prior knowledge is abundant ( |K| > ), uKIN with α = 0 . outperforms α = 0 . . As the number of genes comprising the set of prior knowledge increases,spreading information just from those genes ( α = 0 ) improves in performance. This is consistent withthe observed clustering of CGC genes within biological networks [20]. However, even when propagatinginformation from 100 known cancer genes, the performance is worse than that when integrating it withnew information (with either α = 0 . or α = 0 . , Figure 3a). (b) uKIN is robust to small amounts oferroneous knowledge. We replace a fraction of the CGCs in the set of prior knowledge genes K with non-cancerous genes chosen uniformly at random from the set of non-CGC genes in the network. We considerthe performance for uKIN with α = 0 and α = 0 . when 0%, 10%, 20% and 30% of the prior knowledgegenes are replaced with non-cancer genes. 100 randomizations are performed at each level of incorrectknowledge. For each run, performance is measured as the log ratio of the AUPRC of uKIN (with either α = 0 or α = 0 . ) to the AUPRC for the case where uKIN is run with α = 1 (which is constant). uKIN isrun on the HPRD network with KIRC dataset with 20 CGC genes comprising the prior knowledge. Violinplots of this measure are shown are shown for α = 0 (orange) and α = 0 . (blue), jittered around the 0%,10%, 20% and 30% tick marks. At α = 0 . , while performance steadily decreases, uKIN remains robust tosome incorrect knowledge ( ≤ ). As expected, for α = 0 , the decrease is more notable even when 10%of the prior knowledge is incorrect because in that case uKIN uses only prior knowledge.19 ●●● ●●●●● ●●● ●● ● ●●●●●● ●●● ●●●● ●●●●● ●● ●●●● ●● ●●●●● ●●● ● ●● ●● ●●● ●●●● ● ●●● ●●● ●●●●●● ●●●● ●●●●● ●●●●●●●● ●●●● ●●●● ● ● ● ●● ●● ●●● ●●●●●●●● ●●●●●●● ●●● ●●● ●●●●●●● ●●●●● ●● ●●●●●● ●● ●●● ●●●●● ● ●●● ●●● ●● ●●●● ●● ●● ●● ●● ●● ●● ●●●●● ●●●● ●●●● ●●● ● ● ●●● ●●●●●●●●●●●● ●●● ●●●● ● ●●●●●● ●●● ●● ●●● ●●●●●● ●● ● ●●●● ●●●● ● ●●●●●● ● ●●●● ●● ●● ●●●● ● ●● ●●● ● ●●●●●●●●● ●●● ●● ●●●● ● ●● ●●●●●● ●●●● ●●●●●●●● ●●●●●● ●● ●● ●● ●●●●● ●●●●●●●●● ●●● ●●● ●●●● ●●●●● ●● ●● ●●●● ● ●●●●●●●● ●● ●●●● ●●●●●● ●●● ●● ●●● ●● ● ●● ●● ●● ●●● ●●●● ●●●●●●●●●● ● ●●● ●●●●●●● ●●● ●● ●●● ●●● ●●●● ●●●● ●● ●● ●● ●● ●● ●●● ●●●● ●●● ● ●●●●● ●●●● ●● ●●● ● ●●●●● ●●●●●● ● ●●● ●●●●●●●● ●●●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●● ●●● ●●●● ●●●● ●●●●●● ●●● ●●●●● ●●●● ●●●●● ●●●● ●●● ●●● ● ●●●● ●●●● ● ●● ●●●●●● ● ●●● ●●● ●●● ●● ●●●● ●●●● ●● ●●●●● ● ●● ●●●● ●●●● ●●●●●●● ● ●●● ●●●● ●●●●● ●●●●●●●● ●●●● ●●●●● ●●●●● ●●●●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ● ●●●● ●●●●●●● ●●● ●● ●●●●●● ●●●●●● ●● ●● ●●●●●●●● ●● ●●●●●●●● ●●● ●●● ●● ●●●●●●●●●● ●●●●●●●● ● ●●●● ●● ● ●●● ●●●●●●●● ● ●●● ●●● ●●● ●●●●●● ● ●●● ●● ●●●● ●● ●● ●● ●●● ●●●●●●●● ●●● ●●●● ●●● ●●●●●● ●●●● ●● ●● ●●●●●●● ●● ●●●●●● ●●●●●● ●●●● ●●● ●●● ●● ● ● ●●●●● ●● ●●●● ●●●●●●●●●●●● ●●● ●●● ●●●● ●●●● ●● ●●● ●● ●●●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●● ●●● ●●●●● ●●● ●●● ●●●●●●● ● ●●● ●● ●●● ● ●● ●●● ●●●●● ●● ●●●●● ●●● ●● ●●● ●●● ●●●● ●● ●●●●● ●●●●● ●●● ●● ●●●●● ●● ●●● ●●● ●●●●●●●●● ●●●●●● ●● ●●● ●●● ●●●● ● ●●●●●●● ●●● ● ● ●●●● ●●●● ●●●●●●●●●●●●● ●● ●●●●●●●●●● ●● ●●● ●●● ●● ●●●● ●●● ● ●●●● ●●● ●●● ●● ●●●●●● ●● ●●●●●●●●●● ●●● ●●●●● ●●●● ●●●●●● A C C B L C A B

R C A C E S

C C O A D G B M H N S C K I C H K I R C K I R P L G G L I H C m u t a t i on a l r a n k CGC novel ●●●●● ●●●●●●● ●● ●●●●●● ●●●● ●● ●●●●●●● ●●●●● ●●● ●●● ●●●●●● ●●● ●● ●●● ●●●●●●●● ●● ●●●● ●●●● ●●●●● ●●●● ●●●●● ●●●●● ●●● ●● ● ●● ●●●● ●●●●●● ●● ●● ●● ●●● ●●● ●●● ●●● ●●● ●●●●●●●● ●● ●●● ●●●●●●●●●● ●●●●●●●● ●●●●●● ●● ●● ●●●● ●● ●●● ●● ●●●● ●●●● ●●● ●●●● ● ●●● ●●●● ●●●● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●●●●●●●● ●●●● ●● ●●●●●● ●●●●●●●● ●●●●●● ●●●● ●● ●● ●●● ●● ●● ●●●●● ●● ●● ●● ●● ●● ●●●● ● ●●●● ●● ●●● ●● ●●● ●●●● ●● ●●●●●● ●● ●● ●●●●●●● ●● ●● ●●●●● ●●●●● ●●● ●●●●● ●● ● ●●● ●● ●● ● ●●●●●● ●●●● ●● ●●●● ●●●● ●●●● ●●● ●●● ● ●●●●● ●● ●●●● ●●●●●● ●● ● ●●●●●●●● ●●●●●●● ●●●● ●● ●● ●● ●● ●●●● ●● ●● ●● ●●●●●● ●● ●●● ●●●● ●● ●● ●●● ●● ●●● ●● ●●●●● ●●●● ●● ●● ● ● ●●● ●● ●●●●● ●●●●●●● ●● ●●● ●●●●● ●● ● ●● ●●● ●● ●●● ●● ●● ●● ●●● ●●●●● ●●●● ●●● ●●●● ●●● ●● ●●●●● ●● ● ●●● ●●● ●● ●●●● ●●●●●●●●● ● ● ●●●●● ●● ●●●●● ●●●●●●●●●● ●●● ●●●●●● ●●● ●●●●●●●●● ●● ●●●● ●● ●●● ●●● ●●● ●●●●●●● ●●●●●● ●●● ●●●●●●● ●● ●● ● ●● ●● ●● ●●●●● ● ● ●●● ●●●●●● ●●● ●●●● ●● ●●●● ●● ●●●●●●●● ●●● ● ●●●●●●●●●●● ●● ●●●● ●● ●●●●●●● ●●●●●●●●● ●●● ●●●●●●● ●●● ● ●● ●●● ●● ●● ●● ●●●● ● ●●●●●● ●●●●●●●●●●● ●●●● ●● ●●●●●●●●● ● ●●●●●●●●● ●●●● ●● ● ●●● ●● ● ●● ●●●●●●● ●●●●●●●●● ●●●● ●●● ● ●●● ●●●●● ●●●●● ●●●●●● ● ●●● ●●● ●●● ●●●● ●●● ●●●●●●●● ●●●● ●●●●● ●●● ●●●●●●● ●●●● ●● ●● ● ●●● ●● ●●● ●● ●●● ●●●●● ●●● ●●●● ●● ●● ●● ●● ●●●●● ●●● ●● ●● ●● ● ● ●●●● ●● ●●●●●●● ●●●●● ●●●●●● ●●● ●●●●● ●●●●●●● ●●●● ●●●● ●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●●● ●● ●●● ●●●● ●●● ● ●●●● ●●●●● ●●●● ●●●● ●●●● ●●●●●●● ●●●●● ●●● ●●●● ●●●● ●● ● ●●●●●●● ●●●● ●●● ●●●●● ●●● ●●●● ●● ●● ●●●●●●●● ●●●●● ● ●● ●●● ●●●● L U A D L U S C O V PA A D P C P G P R A D R

E A D S K C M S T A D T H C A U C E C U C S m u t a t i on a l r a n k Figure S4: uKIN identiﬁes rarely mutated genes.

To illustrate uKIN ’s ability to predict genes as cancer-relevant cancer even if they are mutated across fewer numbers of individuals, we consider mutation rates of uKIN ’s top scoring genes. For each cancer type, we run uKIN

100 times with α = 0 . and 20 genes as priorknowledge (see Methods ). For each gene, its ﬁnal score is obtained by averaging its scores (arising from thestationary distributions) across the runs; if a gene is in the set of prior knowledge genes K for a run, this runis not considered for its ﬁnal score. For each of the genes with highest ﬁnal scores, we consider the rankof its mutation rate ( y -axis). The mutation rate of a gene is computed as the number of observed somaticmissense and nonsense mutations across tumors of that cancer type, divided by the number of amino acidsin the encoded protein. Then, for each cancer type, genes are ranked by mutation rate where the gene withthe highest mutation rate is given the lowest rank. Known CGC genes are in red and novel predictions inblue. The top predictions consist of many heavily mutated genes (i.e., those with low ranks), but uKINuKIN