Biological Random Walks: integrating heterogeneous data in disease gene prioritization
Michele Gentili, Leonardo Martini, Manuela Petti, Lorenzo Farina, Luca Becchetti
BBiological Random Walks: integratingheterogeneous data in disease gene prioritization
Michele Gentili, Leonardo Martini, Manuela Petti, Lorenzo Farina and Luca Becchetti
Department of Computer, Control, and Management Engineering ”A. Ruberti”Sapienza University of Rome , Italy { surname } @diag.uniroma1.it Abstract —This work proposes a unified framework to leveragebiological information in network propagation-based gene prior-itization algorithms. Preliminary results on breast cancer datashow significant improvements over state-of-the-art baselines,such as the prioritization of genes that are not identified aspotential candidates by interactome-based algorithms, but thatappear to be involved in/or potentially related to breast cancer,according to a functional analysis based on recent literature.
Index Terms —Gene prioritization, interactome, PPI network,flow propagation algorithms, Gene Ontology.
I. I
NTRODUCTION AND RELATED WORK
Today, big data, genomics, and quantitative in silico method-ologies integration, have the potential to push forward thefrontiers of medicine in an unprecedented way [8], [17].Clinicians, diagnosticians and therapists have long striven todetermine single molecular traits that lead to diseases. Whatthey had in mind was the idea that a single golden bulletdrug might provide a cure. Unfortunately, individual diseasesrarely share the same mutations. This reductionist approachlargely ignores the essential complexity of human biology.Indeed, a large body of evidence that is now emerging fromnew genomic technologies, points out directly to the causeof disease as perturbations within the interactome, i.e. thecomprehensive network map of molecular components andtheir interactions [8].As a matter of fact, a growing body of knowledge revealsthe association between groups of interacting proteins and dis-ease within the so-called human interactome, representing thecellular network of all physical molecular interactions [6]. Pre-cisely, the human interactome is composed of direct physical,regulatory (transcription factors binding), binary, metabolicenzyme-coupled, protein complexes and kinase/substrate in-teractions. Such network is largely incomplete as well as theconnections between genes and disease. Currently, more than140,000 interactions between more than 13,000 proteins areknown (see e.g. [17], [21]. The interactome-based networkmedicine approach [6] has proved to be very effective in thestudy of many diseases, e.g. by identifying putative biomarkersand subtypes to provide a rational approach to drug targeting[6], [26].“Disease proteins” are the product of genes whose mutationshave a causal effect on the respective phenotype. In otherwords, such proteins work together in a network that gives rise to a cellular function and its disruption ends up in a specificdisease phenotype. Disease proteins may provide targets forcancer therapy such as, for example, imatinib which targetsthe BCR-ABL fusion or gefitinib which binds and inhibitsEGFR [26]. However, the big picture is far more complicated,since a large variety of factors affect the effectiveness ofa given drug for a specific patient. For example, targetedinhibition of BRAF V600E in patients harboring this mutation,is very effective in melanoma, but not in colorectal cancer[26]. Improvement of precision therapy needs new approachesable to capture information about molecular mechanisms bycharacterizing disease proteins causing the disruption of tumordriving pathways.A key property of the underlying molecular network ofinteractions is that disease proteins are not found to beuniformly scattered across the interactome, but they tend tointeract with one another confined in one or several subgraphscalled disease modules [23]. In fact, disease proteins are proneto participate in common biological activities such as, forexample, genome maintenance, cell differentiation or growthsignaling, which are the most relevant pathways in carcino-genesis [26]. Consequently, the module property also reflectsthe biological feature that disease proteins are often localizedon specific biological compartments (pathway, cellular space,or tissue).These considerations directly point towards the possibilitythat, whenever a disease module sub-network is found, otherdisease-related parts are likely to be identified in their topo-logical neighborhood [6]. However, notwithstanding a strongcommunity commitment to find new protein interactions andrelevant mutations for disease characterization, the list islargely incomplete. Moreover, identification of specific diseasegenes is often impaired by gene pleiotropy, by the multi-genicfeature of many diseases, by the influence of a plethora ofenvironmental agents, and by genome variability [7].The need for new disease genes (or disease proteins) asputative candidates for diagnosis, treatment or drug targeting,motivated the development of a number of algorithms forpredicting disease genes and modules [16]. The key questionis whether it is possible to find a way to fully characterizesuch genes (with respect to non-disease genes) and find analgorithm able to capture such features. From a networkperspective, the goal is to find correlations between disease a r X i v : . [ q - b i o . M N ] F e b ene location on the interactome and the network topology. Inother words, one hypothesizes that disease genes are embeddedwithin modules in ways that are amenable to some topologicalfeature descriptor. The recent [23] evidence-based biologicalobservation that disease genes are not randomly positioned inthe interactome has opened new possibilities for developingalgorithms for disease gene predictions.Two groups of methodologies have emerged in the lastdecade as the most promising ones: network propagation [11]and modules-based [6], [16] algorithms. Network propagation(or diffusion-based) algorithms rely on the assumption thatthe information contained in the initial (known) set of diseasegenes, flows through the network through nearby proteins. Bycontrast, module-based algorithms rely on the hypothesis thatall cellular components that belong to the same topological,functional or disease module have a high likelihood of beinginvolved in the same disease.From the above discussion, it is clear that prioritizingcandidate disease genes using the interactome, i.e. the networkof physical protein interactions, and mutational data (knowndisease gene or seeds), is still a largely open problem. Areliable prioritization (or ranking) of new predicted diseasegenes is very important from a biological viewpoint, since itprovides valuable information of a putative specific activity ofa gene in the development of a disease. Simply put, the smallerits rank position, the more likely a gene is to be a ”true”disease one. This allows providing experimenters/clinicianswith an ordered list of potentially interesting genes for furtherscrutiny, possibly speeding the complex and costly task ofidentifying the most promising candidates. Our contribution.
In this work, we provide a unifiedframework to leverage biological information in networkpropagation-based gene prioritization algorithms. This bringsto significant improvements over state-of-the-art algorithms.In more detail, we modify a well-known random walk-based,flow propagation algorithm [20], modifying the dynamics offlow propagation according to the functional relevance ofnodes for the disease under consideration. We considered thesame diseases as [16] and multiple biological data sources. Inthe remainder however, for the sake of space and for clarityof exposition, we focus on breast cancer as a use case andwe adopt
Gene Ontology Annotations - biological process [10](GO in the remainder) as added biological information.In the output ranking of the algorithm we almost double thenumber of known disease genes in the first 50 positions withrespect to state of the art baselines, in particular DIAMOnD[16] and Random Walk with Restart [20]. Moreover, some verypromising candidates are prioritized by our algorithm but notby baselines. Of these, some were only recently associated tobreast cancer, while others appear to be potentially related tothe disease according to a functional analysis based on recentliterature, as discussed in Section II.
Roadmap.
The rest of this paper is organized as follows.Section II gives an overview of our findings and their potentialbiological relevance. Section III provides background aboutthe baselines we considered, a more detailed account of our approach and of the experimental setting. Due to spacelimitations, it was only possible to include the most signif-icant results and a concise report of experimental evidencesupporting the design choices we made.II. R
ESULTS AND DISCUSSION
In this section, we present the main findings of our work.In particular, we discuss the benefits of leveraging both bio-logical and interactome information within gene prioritizationalgorithms. To this purpose, we compared state of the artprioritization algorithms that only rely on analysis of theinteractome, namely DIAMOnD [16] and Random Walk withRestart [20], with two heuristics we propose: i) BiologicalNode Relevance (BNR) only leverages biological information(e.g., annotations) to prioritize genes; ii) Biological RandomWalk (BRW) is a random walk-based heuristic that, differentlyfrom [20], also leverages biological information to bias therandom walk toward genes that are functionally closer toknown disease genes according to current literature.BRW builds on the hypothesis that integrating different bi-ological information sources may better reflect the complexityof protein interactions in a cell’s process. In light of this in-sight, our algorithm integrates information on pairwise proteininteraction reflected in the Protein-Protein Interaction network(PPI) [6] with other biological data in a unified framework.Our approach is to some extent agnostic to the particularbiological data source, as long as it affords a principled notionof similarity between proteins. So for example, while we focuson gene annotation data in this paper, the same approachcan be adopted to integrate different sources of biologicalinformation, e.g. miRNA targets or pathway annotation data.
A. The role of biological information
We began by investigating the potential role of high-quality,biological information (gene annotations in this case) in pri-oritizing new candidates genes. To this purpose, we designeda very simple heuristic that ranks genes of the PPI networkaccording to the degree of their co-occurrence in biologicalprocesses, completely disregarding mutual interaction proper-ties encoded by the PPI itself. Our
Biological Node Relevance heuristic (BNR in the remainder) prioritizes genes only onthe basis of their functional similarity with a seed set ofknown disease genes, with similarity measured on the basisof annotation data from the Gene Ontology database [10]according to well-established similarity indices adopted inData Science. While details are provided in Section III, a high-level description of the BNR is given in the paragraphs thatfollow.Given a seed set S of known disease genes, BNR ranks newcandidate genes according to the following steps:1) We first compute the set of statistically significant anno-tations for genes belonging to the seed set S . We callthis the enriched set of annotations for the disease.2) BNR then ranks each node i according to itsbiological relevance BNR(i), namely, the extent of the ig. 1. PPI single experiment result.
The Figure shows: the splitting of the known disease genes, i.e. train (seed) and test (validation) nodes, the retrievednodes by the different algorithms and the nodes’ names that have been highly ranked by the BRW. The size of train and test nodes is proportional to theBNR, indeed node
FGF10 has a very high score, inducing a high ranking despite has only one connection. Interestingly, BRW finds nodes retrieved by RWRand DIAMOnD algorithms. Furthermore, the names of other highly ranked nodes by the BRW that aren’t present in the test set are shown, and they seem tobe promising candidate as breast cancer related genes, such as
RAD50 [18],
XRCC2 [27]. overlap between the enriched set and the set of gene’sannotations.Whilst it is reasonable to expect that curated, high-qualityannotations are likely to contain information that can beleveraged to the purpose of gene prioritization, we observedthat BNR outperforms state of the art topology and flow-based prioritization heuristics, in particular DIAMOnD [16]and the well-established diffusion method based on randomwalks with restart [20]. In particular, as shown in Figure 5,BNR consistently recovers a larger fraction of known diseasegenes among its top- k ranking candidates. This result suggeststhat biological annotations (and, hopefully, other curated data)contain rich information, which is not implicit in PPI networksand thus cannot be leveraged by standard topology or flow-based methods. B. A unified framework
The
Biological Random Walk (BRW in the remainder)heuristic provides a framework to integrate heterogeneousbiological data sources within diffusion-based prioritizationmethods that are based on the well known Random Walk withrestart algorithm (RWR). For the sake of exposition, in theremainder we refer to the biological information associatedto a gene i (e.g., the set of its annotations) as the set ofits labels , denoted by labels(i) . In this study, we only usedannotations from the Gene Ontology (GO in the remainder)database to define labels, since at the moment it is one of themost complete and best curated available datasets. We remarkhowever, that in principle any reliable information source on gene biology can be integrated. BRW ranks genes accordingto the following steps:1) We compute the set of statistically significant annota-tions of known disease genes, as for the BNR heuristic, i.e. , the enriched set
2) Rather than using the standard method , we computeindividual teleporting probabilities for all nodes of thePPI. In particular, the Biological Teleporting Probability(BTP) of a node increases with the similarity betweenits labels and the enriched set (details in Section III),3) In a similar fashion, we weigh PPI network interactionsusing node annotations and the enriched set . This resultsin a modified random walk, namely the
BiologicalRandom Walk (BRW) , in which flow propagation isbiased toward genes that are functionally closer to thoseforming the seed set.4) Finally, we rank genes according to their
BiologicalRandom Walk (BRW) score.As the example in Figure 5 highlights, BRW not onlypropagates flow to and from known disease genes, but alsoinvolves a broader set of genes that are functionally relatedto disease ones, though themselves not directly related to thedisease, at least to the best of our knowledge.The results of Figure 2 suggest that BRW seems to leverageboth heterogeneous sources of biological information. In par- Whilst details are given in Section III, here we remind that in the standardRWR approach [20], the probability of restarting the random walk from agiven seed node (disease gene) is the same for all seeds nodes, while it is for other nodes of the PPI.ig. 2. Recall @k scores show how BRW performs better than other state-of-the-art techniques. The recall is computed splitting the known disease genesinto two groups the seed nodes and the test nodes. The former are usedto run the algorithm, the latter are used has validation for the output of thealgorithms. The recall@k are the percentage on nodes of the test set discoveredin the first k position of the ranked lists, that are the output of the algorithm.The splitting of the genes has been repeated 100 times, and the score is theaverage of the experiment results. The BRW on average is able to find morethan of the test nodes in the first 200 ranked genes. ticular, it significantly outperforms RWR and DIAMOnD, butit also achieves better recall than the BNR baseline across theentire spectrum of the values k that we considered. We alsonote that best results are achieved using a value . for therestart probability. This intuitively means that best candidatesare mostly found in the vicinity of disease genes or genes thatare functionally related to them.Beyond this internal validation of a more quantitative na-ture, the paragraphs that follow report and discuss anecdotalevidence, as to the potential biological interest of candidategenes that are prioritized by our algorithm, but are not part ofthe pool of known disease ones. C. Functional analysis
Table I reports genes prioritized by our BRW algorithmonly. Therefore, we briefly discuss the relevance of some ofthem to breast cancer, which is the most common malignancyin women [5] and has the second highest incidence amongall types of cancer worldwide. Notably, the list in table Xcontains two members of the erbB family which is composedof closely related genes: erbB (her), erbB-2 (her-2, neu),erbB-3 (her-3), and erbB-4 (her-4). This genes also encodemembers of the epidermal growth factor (EGF) receptor familyof receptor tyrosine kinases. In particular, erbB-2 gene is aproto-oncogene. In fact, overexpression of ErbB-2 leads totransformation, tumorigenicity, and metastasis. These findingssupport the implications of ErbB-2 as a major player in breastcancer initiation and/or progression. Moreover, targeting ofErbB-2 has proved to be effective for drug development [31].Over expression of human epidermal growth factor receptor-2(ErbB-2) has been found in 20-30% of breast cancer patientsand widely recognized as a reliable marker for metastatisformation, drug resistance and high aggressiveness. Amongall of the drugs that target HerbB-2, trastuzumab, pertuzumab,trastuzumab emtansine and lapatinib have been proven to beeffective in several clinical trials [32]. Another important genein our list is vegfa, a member of the Vascular Endothelial Growth Factor (VEGF) family which plays an important rolein multiple physiologic and pathologic processes involvingendothelial cells. Several preclinical and clinical evidencesupports its relevance in breast cancer and, consequently,numerous anti-VEGF drugs are now being under clinicalevaluation [30]. Interestingly, gene fgf9 of our list, plays arole in many tumours, like breast cancer, that contain differentpopulations of cells which may show increased resistance toanticancer drugs. There are now evidences of ”cancer stem-likecells, which are important for survival and expansion of normalstem cells. It has been reported that, in analogy to embryonicmammary epithelial biology, estrogen signaling expands thepool of functional breast cancer stem-like cells through aparacrine Fgf/Fgfr/Tbx3 signaling pathway [15]. Moreover,bmp4 gene in our list, encode the bone morphogenetic protein4, which is a key regulator of cell proliferation and differentia-tion. In breast cancer cells, bmp4 is able to reduce proliferationand induce migration, invasion and metastatis formation invitro [3]. Last (but not least) we found gene p63 in our listwhich is a transcription factor of the p53 gene family, widelyknown to play a fundamental role in the development of allthe stratified squamous epithelia, including breast [12].
D. Robustness
We briefly mention here the robustness of our results tothe presence of possible noise in both interactome and anno-tation data, finding that our framework is resilient to degreepreserving random shuffling on the graph [24] and it partiallydecreases its performances when noising the annotation.
Fig. 3.
PPI robustness analysis:
Recall@k is measured after shuffling thePPI interactions keeping node’s degree ( continuous lines ), and after shufflingnode’s annotations ( dashed lines ). As we can see all algorithms that use onlytopological information fail in the case their input is noisy. BRW decreaseby 2/3 when shuffling the annotations: relying on two different sources, theBRW is more resilient to noise.
III. M
ATERIALS AND METHODS
A. Datasets1) PPI Network and Gene-Disease Associations:
The ex-periments discussed in Section II were conducted on the samePPI network as [16] for the sake of comparison. In [16], theauthors only considered direct physical protein interactionswith reported experimental evidence. Several data sourceswere used to derive this PPI network:
ABLE IP
RIORITIZED GENES FOUND ONLY BY
BRW
ALGORITHM AND NOT BY OTHER , USING AS INITIAL DISEASE GENES ( seed nodes ) ALL THE KNOWN GENES .D ETAILS ON
PPI
AND ALGORITHMS IN S ECTION
III.T
HE LEFT COLUMNS INDICATES THE PARAMETERS USED TO DEFINE THE ENRICHED SET . Benjamini-Hochberg Correction P-value Annotations in Enriched set BRW prioritized genes
False 0.01 213 ERBB4, ERBB2, CDC42, TGFB1, FGF9, SIRT1False 0.05 485 ERBB4, VEGFA, BMP4, TGFB1, BMP2, FGF9True 0.01 80 ERBB4, TGFB1, BMP4, CDC42True 0.05 318 ERBB2, PAK1, ERBB4, RAD50, XRCC2Fig. 4.
Recall @k scores on HIPPIE network [1]. • TRANSFAC [19]: this database lists regulatory interac-tions derived from the presence of a transcription factorbinding site in the promoter region of a certain gene; • IntAct [4], MINT [9], BioGRID [14] and HPRD [13]:these databases list physical PPI interactions, typicallyidentified by low throughput experiments and manuallycurated by experts; • KEGG and BIGG [29]: sources used to find metabolicenzyme-coupled interactions; • CORUM [28]: this database lists mammalian proteincomplexes as single molecular units that integrate multi-ple gene products.In addition, we considered the main connected component ofthe network and we removed self-loops (i.e., edges describingproteins’ self-interactions). The resulting graph consists of13396 nodes and 138405 edges.Disease genes association are the same as in [16]. Out ofa corpus of 70 diseases in which gene-disease associationswere retrieved from OMIM (Online Mendelian Inheritancein Man [2]), in this work we focus on the
Breast Cancer phenotype, which involves 40 genes, refer to [16] for thecomplete list. Experiments concerning other diseases will bedescribed and discussed in the journal version of the paper.Though, repeating the same experiment on a different PPI,
HIPPIE [1], we obtain coherent results, see Figure 4.
2) Gene annotations:
We retrieved gene biological infor-mation from
Gene Ontology Consortium: in this case, weextracted annotations describing genes’ biological pro-cesses. We downloaded the database in November 2018.
B. Algorithms
In the remainder, we use bold lowercase to denote vectorsand capital, non-bold letters to denote matrices. Given a vector x , x i denotes its i -th entry. We use S to denote the subset of PPI’s nodes associated to known disease genes, i.e., what wecall the seed set .
1) Random Walk with Restart:
Random Walk with Restart(RWR) [20] is a diffusion-based method, whose purpose isidentifying pathways that are topologically “close” to knowndisease genes in the interactome. It was shown to outperformother prioritization algorithms in many cases [25].In a nutshell, this algorithm can be seen as performing mul-tiple random walks over the PPI network, each starting froma seed node associated to a known disease gene, iterativelymoving from one node to a random neighbour, thus simulatingthe diffusion of the disease phenotype across the interactome.More formally, the random walk with restart is defined as: p ( t +1) = (1 − r ) W p ( t ) + r q . (1)Here, W is the column-normalized adjacency matrix of thegraph and p ( t ) is a vector, whose i -th entry p ( t ) i is theprobability of the random walk being at node i at the endof the t -th step. r ∈ (0 , is the restart probability. It isthe probability that the random walk is restarted from oneof the (disease-associated) seed nodes in the next step. Upona restart, the probability of restarting the random walk fromsome seed node j is q j . This random walk corresponds to anergodic Markov chain [22] that admits a stationary distribution(i.e., a fixed point) p . Nodes of the PPI are simply ranked byconsidering the corresponding entries of p in descending orderof magnitude.Following [20], in our implementation, the initial probabilityvector q was uniform over the subset of seed nodes, i.e., q j =1 / | S | if j ∈ S , q j = 0 otherwise. We considered the followingvalues for the restart probability: r ∈ { . , . , . } .
2) DIAMOnD Algorithm:
The DIAMOnD (Disease Mod-ule Detection) algorithm [16] relies on the hypothesis thatdisease associated proteins do not necessarily reside withinlocally dense communities. Instead, this algorithm identifiesconnectivity significance (see paragraphs that follow for def-inition) as the most predictive quantity. DIAMOnD exploitsthis quantity to identify the full disease module starting froma seed set of known disease proteins.Consider a PPI network of N nodes, out of which a subset S of seed proteins are associated with a particular disease.Now, consider a protein with k links in the PPI network, outof which k s to seed nodes. If seed proteins were distributeduniformly at random in the network (null hypothesis), theprobability p ( k, k s ) that a protein with a total of k links has Note that [20] only considered the value . . xactly k s links to seed proteins (connectivity significance)would be given by the hypergeometric distribution: p ( k, k s ) = (cid:0) | S | k s (cid:1) · (cid:0) N −| S | k − k s (cid:1)(cid:0) Nk (cid:1) To evaluate whether a certain protein has more connectionsto seed proteins than expected under this null hypothesis, theDIAMOnD algorithm computes its connectivity p-value.We followed the implementation of DIAMOnD. Note thatthe set of p-values has to be recomputed in each iteration,which makes the algorithm computationally demanding formoderately large values of k . In our experiments, we consid-ered values of k up to .
3) Biological Node Relevance:
Building on the hypothesisthat genes involved in the same disease tend to be functionallyrelated and thus share similar biological information, we cameup with a very simple heuristic that ranks genes in the PPInetwork according to the extent of their co-occurrence withinbiological processes, what we call Biological Node Relevance(BNR) henceforth. BNR is a simple (yet effective as weshall see) baseline, which completely disregards informationimplicit in the link structure of the PPI network.Given a set S of seed nodes (known disease genes), thealgorithm first computes the set of annotations (see SectionIII-A2) that are statistically significant for seed proteins, i.e. the enriched set , , by using Fishers exact test to this purpose,with the P-value equal to 0.05 and the
Benjamini-Hochberg correction.Next, given a list of N proteins (nodes of the PPI network),BNR computes the score of each protein i , defined as theintersection between the set of annotations in the enriched setand the biological information of i , i.e.: score ( i ) = | enriched set ∩ label ( i ) | , where label ( i ) is the set of annotations of protein i.Finally, BNR sorts proteins in descending order with respectto their scores.
4) Biological Random Walk:
The Biological Random Walk(BRW) is a framework that exploits both biological (GOannotations, KEGG pathways and miRNA) and topologicalinformation (PPI network) to uncover potentially new diseasegenes.BRW is essentially a random walk with restart algorithm.While the form of the governing equation is still (1), thekey differences are that both the transition matrix W and therestart vector q now depend on available genes’ biologicalinformation. For this reason, we call W and q respectively Biological Transition matrix and
Biological TeleportingProbability vector in the remainder of this section.Note that, differently from RWR, q and W also dependon available biological information, so that the stationary We again remind that, while we consider gene annotations here, the sameapproach can be adapted to different biological data. distributions (and thus the rankings) produced
RW R and
BRW generally differ.Since the biological relevance of a node can’t be usedstraight forward as a probability, we next describe how wegenerate q and W , we explored several possibilities forintegrating and factoring in available biological information.This implies the setting of several parameters. We used a gridsearch to select the parameter configuration used in the finalround of experiments. For the sake of space, we refer to thecase of breast cancer as a disease and GO annotations as com-plementary (with respect to the PPI) biological information.The approach applies seamlessly to other data sources, suchas miRNA or KEGG (results will appear in the journal versionof the paper). In the remainder, labels ( i ) denotes the set ofannotations associated to a node i of the PPI network. As usual S denotes the seed set of known disease nodes. a) Biological Teleporting Probability (BTP) vector: The i -th entry q of the BTP vector is defined as follows:1) A measure of the overlap between labels ( i ) and enriched set is computed. We call this the Node Rele-vance
N R ( i ) . In this work we considered the followingdefinitions: • N R ( i ) = | enriched set ∩ label ( i ) || enriched set | , • N R ( i ) = 1 , whenever i ∈ S .2) We let w i = min { t, f ( N R ( i )) } , with f : R → [0 , asuitable monotonically increasing function ( Node Rel-evance Function ), and t ∈ [0 , used as a parameterto weight the importance of the NR score overall. Anumber of possible choices for f are presented in theparagraphs that follow.3) q i = w i / ( (cid:80) i w i ) , for every node i of the PPI (normal-ization).In the experiments, we tested different choices for the NodeScoring Function f . The first set consists of functions thatdirectly depend on N R ( i ) : • The power scoring function : f ( N R ( i ) , α ) = N R ( i ) α ,with α > . When α = 1 we are directly using N R ( i ) (default scoring function), • The sigmoid scoring function outputs a value thatis smooth and bounded based on two parameters: thesteepness and translation parameters: f ( N R ( i ) , s, θ ) = e − (( NR ( i ) − θ ) · s ) We further considered node scoring functions that dependon the rank of PPI nodes in descending order of their values of
N R ( i ) (i.e., higher N R ( i ) , the lower the corresponding rank).In more detail, let r i denote the rank of node i . We consideredthe following, rank-dependent definitions for f : • Linear : Given the rank r i of node i and the totalnumber N of nodes/proteins in the PPI, the linear rankingfunction is defined as f ( r i ) = N − r i +1 N . • Inverse Sigmoid:
In this case, for protein i we have: f ( r i ) = 1 − e ( − s · ( ri − t )) . We considered α ∈ { . , , . , } ig. 5. Biological Random Walk flow propagation: given the seed nodes ( star nodes ), the flow propagates to his neighbors. The BRW not only propagatethe flow around them but also teleports the flow to the target of the BTP nodes ( blue arrows ). So it discovers nodes that are biologically correlated to theseed nodes (just through the BTP, left-lower test node) and those nodes that aren’t reached directly to the BTP but are close to many related nodes, the BRWnode ( left-upper test node ) b) Biological Transition Matrix (BTM): Though otherchoices are possible, for breast cancer, entry W ij of therandom walk’s transition matrix depends on the extent towhich nodes i and j of the PPI share common annotations (i.e.,they are involved in common biological processes) that are alsosignificant for the disease. For breast cancer, we considered thefollowing Disease Specific Interaction function:
DSI ( i, j ) = | enriched set ∩ label ( i ) ∩ label ( j ) || enriched set | . Intuitively,
DSI ( i, j ) will be higher, the more i and j shareannotations that are also statistically significant for the diseaseunder consideration. W ij then depends on DSI ( i, j ) according to a scoringfunction as follows: W ij = (cid:40) f ( DSI ( i, j )) if edge (i,j) belongs to PPI otherwiseIn the experiments tested different choices for the scoringfunction f : • Power scoring function : for each edge ( i, j ) we consider f ( DSI ( i, j ) , α ) = DSI ( i, j ) α , with α > . • Summation scoring function : for each edge ( i, j ) , welet f ( DSI ( i, j ) , c ) = DSI ( i, j ) + c . • Sigmoid scoring function : for each edge ( i, j ) , we have f ( DSI ( i, j ) , s, t ) = e − (( DSI ( i,j ) − θ ) · s ) , with θ and s respectively the translation and steepness parameters. C. Internal validationa) Experimental setup:
For each algorithm and for eachset of parameter values we considered, we considered the aver-age value of
Recall @ k (defined below) over independentruns. In each run, the seed set of known disease genes wasrandomly split into a training set , accounting for of theoriginal seed set, and a test set , including the remaining of the genes. b) Performance measure: Intuitively, we are interestedin algorithms that identify new candidate genes that aremore likely to be of interest for further biological scrutiny.Consistently, we measured performance using
Recall . This isthe fraction of relevant items (in our case, known disease genesin the test set) that are successfully retrieved by the algorithm.Formally, in our scenario recall is defined as:
Recall = | disease genes ∩ retrieved genes || disease genes | , where disease genes are known genes involved in thephenotype and retrieved genes are genes prioritized bythe algorithm under consideration. Moreover, in order tocompare our approach with other baselines, we considered Recall @ k . In our framework, this is the value of recall when retrieved genes the set of Top-K genes in the algorithm’sranking. We considered several values for k , namely, k ∈{ , , , , , , , , , , } . For breast cancer, this amounts to and genes respectively. ) Parameter setting: As for DIAMOnD, this is aparameter-free algorithm. For RWR, we adopted the parametersetting suggested in [20].For BRW, as the previous paragraphs highlight, we useda grid search to select the parameter configuration used inthe final round of experiments. In more detail, for eachconsidered parameter configuration, we took the average valueof
Recall @ k over 1000 independent runs of BRW. Thefinal configuration was the one achieving the best (average) Recall @200 score, e.g. the ranking-inverse Sigmoid withsteepness 0.01 and translation 250 for the BTP constructionand the summation function for the DSI, with c = 1 .A CKNOWLEDGMENT
This work was partially supported by ”
Progetti di RicercaMedi 2018: Network medicine based machine learningand graph theory algorithms for precision oncology , id n.RM1181642AFA34C2”, and by ERC Advanced Grant 788893AMDROMA ”Algorithmic and Mechanism Design Researchin Online Markets” and MIUR PRIN project ALGADIMAR”Algorithms, Games, and Digital Markets”R
EFERENCES[1] Gregorio Alanis-Lobato, Miguel A Andrade-Navarro, and Martin HSchaefer. Hippie v2. 0: enhancing meaningfulness and reliability ofprotein–protein interaction networks.
Nucleic acids research , pagegkw985, 2016.[2] Joanna Amberger, Carol A Bocchini, Alan F Scott, and Ada Hamosh.Mckusick’s online mendelian inheritance in man (omim®).
Nucleicacids research , 37(1):D793–D796, 2008.[3] M Ampuja, EL Alarmo, P Owens, R Havunen, AE Gorska, HL Moses,and A Kallioniemi. The impact of bone morphogenetic protein 4 (bmp4)on breast cancer metastasis in a mouse xenograft model.
Cancer letters ,375(2):238–244, 2016.[4] I. Armean, A. Bridge, A. T. Ghanbarian, B. Aranda, B. Roechert,C. Derow, C. Leroy, H. Hermjakob, J. Kerssemakers, J. Khadake, K. vanEijk, L. Montecchi-Palazzi, M. Feuermann, M. Menden, M. Michaut,P. Achuthan, S. Kerrien, S. Orchard, S. N. Neuhauser, V. Perreau, andY. Alam-Faruque. The IntAct molecular interaction database in 2010.
Nucleic Acids Research , 38(1):D525–D531, 2009.[5] Hussein A Assi, Katia E Khoury, Haifa Dbouk, Lana E Khalil, Tarek HMouhieddine, and Nagi S El Saghir. Epidemiology and prognosis ofbreast cancer in young women.
Journal of thoracic disease , 5(Suppl1):S2, 2013.[6] Albert-L´aszl´o Barab´asi, Natali Gulbahce, and Joseph Loscalzo. Networkmedicine: a network-based approach to human disease.
Nature reviewsgenetics , 12(1):56, 2011.[7] Yana Bromberg. Disease gene prioritization.
PLoS computationalbiology , 9(4):e1002902, 2013.[8] Stephen Y Chan and Joseph Loscalzo. The emerging paradigm ofnetwork medicine in the study of human disease.
Circulation research ,111(3):359–374, 2012.[9] Andrew Chatr Aryamontri, Arnaud Ceol, Daniele Peluso, Gianni Ce-sareni, Leonardo Briganti, Livia Perfetto, Luana Licata, and LuisaCastagnoli. Mint, the molecular interaction database: 2009 update.
Nucleic Acids Research , 38(1):D532–D539, 2009.[10] The Gene Ontology Consortium. The gene ontology resource: 20years and still going strong.
Nucleic Acids Research , 47(Database-Issue):D330–D338, 2019.[11] Lenore Cowen, Trey Ideker, Benjamin J Raphael, and Roded Sharan.Network propagation: a universal amplifier of genetic associations.
Nature Reviews Genetics , 18(9):551, 2017.[12] Simone Di Franco, Gianluca Sala, and Matilde Todaro. p63 role inbreast cancer.
Aging (Albany NY) , 8(10):2256, 2016.[13] Abhilash et al. Human protein reference database2009 update.
NucleicAcids Research , 37(1):D767–D772, 2008. [14] Chatraryamontri et al. The biogrid interaction database: 2011 update.
Nucleic Acids Research , 39(1):D698–D704, 2010.[15] Christine M Fillmore, Piyush B Gupta, Jenny A Rudnick, SilviaCaballero, Patricia J Keller, Eric S Lander, and Charlotte Kuper-wasser. Estrogen expands breast cancer stem-like cells through paracrinefgf/tbx3 signaling.
Proceedings of the National Academy of Sciences ,107(50):21737–21742, 2010.[16] Susan Dina Ghiassian, J¨org Menche, and Albert-L´aszl´o Barab´asi. Adisease module detection (diamond) algorithm derived from a systematicanalysis of connectivity patterns of disease proteins in the humaninteractome.
PLoS computational biology , 11(4):e1004120, 2015.[17] Mika Gustafsson, Colm E Nestor, Huan Zhang, Albert-L´aszl´o Barab´asi,Sergio Baranzini, S¨oren Brunak, Kian Fan Chung, Howard J Federoff,Anne-Claude Gavin, Richard R Meehan, et al. Modules, networksand systems medicine for understanding disease and aiding diagnosis.
Genome medicine , 6(10):82, 2014.[18] Katri Heikkinen, Katrin Rapakko, Sanna-Maria Karppinen, HanneleErkko, Sakari Knuutila, Tuija Lund´an, Arto Mannermaa, Anne-LiseBørresen-Dale, ˚Ake Borg, Rosa B Barkardottir, et al. Rad50 and nbs1are breast cancer susceptibility genes associated with genomic instability.
Carcinogenesis , 27(8):1593–1599, 2006.[19] A. E. Kel, B. Lewicki-Potapov, D. Karas, D.-U. Kloos, E. Fricke,E. Gling, E. Wingender, H. Michael, H. Saxel, I. Reuter, K. Hornischer,M. Haubrock, M. Scheer, O. V. Kel-Margoulis, R. Geffers, R. Hehl,R. Mnch, S. Land, S. Rotert, S. Thiele, and V. Matys. Transfac: transcriptional regulation, from patterns to profiles.
Nucleic AcidsResearch , 31(1):374–378, 2003.[20] Sebastian K¨ohler, Sebastian Bauer, Denise Horn, and Peter N Robinson.Walking the interactome for prioritization of candidate disease genes.
The American Journal of Human Genetics , 82(4):949–958, 2008.[21] Tamas Korcsmaros, Maria Victoria Schneider, and Giulio Superti-Furga.Next generation of network medicine: interdisciplinary signaling ap-proaches.
Integrative Biology , 9(2):97–108, 2017.[22] David A Levin and Yuval Peres.
Markov chains and mixing times ,volume 107. American Mathematical Soc., 2017.[23] J¨org Menche, Amitabh Sharma, Maksim Kitsak, Susan Dina Ghiassian,Marc Vidal, Joseph Loscalzo, and Albert-L´aszl´o Barab´asi. Uncoveringdisease-disease relationships through the incomplete interactome.
Sci-ence , 347(6224):1257601, 2015.[24] Ron Milo, Nadav Kashtan, Shalev Itzkovitz, Mark EJ Newman, and UriAlon. On the uniform generation of random graphs with prescribeddegree sequences. arXiv preprint cond-mat/0312028 , 2003.[25] Saket Navlakha and Carl Kingsford. The power of protein interac-tion networks for associating genes with diseases.
Bioinformatics ,26(8):1057–1063, 2010.[26] Kivilcim Ozturk, Michelle Dow, Daniel E Carlin, Rafael Bejar, andHannah Carter. The emerging potential for network analysis to informprecision cancer medicine.
Journal of molecular biology , 2018.[27] DJ Park, F Lesueur, T Nguyen-Dumont, M Pertesi, F Odefrey, F Ham-met, SL Neuhausen, EM John, IL Andrulis, MB Terry, et al. Raremutations in xrcc2 increase the risk of breast cancer.
The AmericanJournal of Human Genetics , 90(4):734–739, 2012.[28] Andreas Ruepp, Barbara Brauner, Brigitte Waegele, Corinna Montrone,Gisela Fobo, Goar Frishman, H.-Werner Mewes, Irmtraud Dunger-Kaltenbach, and Martin Lechner. CORUM: the comprehensive re-source of mammalian protein complexes2009.
Nucleic Acids Research ,38(1):D497–D501, 2009.[29] Jan Schellenberger, Junyoung O. Park, Tom M. Conrad, and BernhardPalsson. Bigg: a biochemical genetic and genomic knowledgebase oflarge scale metabolic reconstructions.
BMC Bioinformatics , 11(1):213,2010.[30] George W Sledge. Vegf-targeting therapy for breast cancer.
Journal ofmammary gland biology and neoplasia , 10(4):319–323, 2005.[31] David F Stern. Tyrosine kinase signalling in breast cancer: Erbb familyreceptor tyrosine kinases.
Breast Cancer Research , 2(3):176, 2000.[32] Sunil Verma, David Miles, Luca Gianni, Ian E Krop, Manfred Welslau,Jos´e Baselga, Mark Pegram, Do-Youn Oh, V´eronique Di´eras, EllieGuardino, et al. Trastuzumab emtansine for her2-positive advancedbreast cancer.