Cell cycle and protein complex dynamics in discovering signaling pathways
Daniel Inostroza, Cecilia Hernández, Diego Seco, Gonzalo Navarro, Alvaro Olivera-Nappa
AApril 7, 2020 1:9 WSPC/INSTRUCTION FILE main
Cell Cycle and Protein Complex Dynamics in Discovering SignalingPathways
Daniel Inostroza
Computer Science Department, University of Concepci´on,Edmundo Larenas, Concepci´on, 4030000, [email protected]
Cecilia Hern´andez ∗ Computer Science Department, University of Concepci´on,Edmundo Larenas, Concepci´on, 4030000, ChileCenter for Biotechmology and Bioengineering (CeBiB), Santiago, [email protected]
Diego Seco
Computer Science Department, University of Concepci´on,Edmundo Larenas, Concepci´on, 4030000, [email protected]
Gonzalo Navarro
Center for Biotechnology and Bioengineering(CeBiB),Santiago, [email protected]
Alvaro Olivera-Nappa
Center for Biotechnology and Bioengineering(CeBiB),Department of Chemical Engineering and BiotechnologyUniversity of Chile, Santiago, [email protected]
Received (Day Month Year)Revised (Day Month Year)Accepted (Day Month Year)Signaling pathways are responsible for the regulation of cell processes, such as monitor-ing the external environment, transmitting information across membranes, and makingcell fate decisions. Given the increasing amount of biological data available and therecent discoveries showing that many diseases are related to the disruption of cellularsignal transduction cascades, in silico discovery of signaling pathways in cell biologyhas become an active research topic in past years. However, reconstruction of signaling ∗ corresponding author 1 a r X i v : . [ q - b i o . M N ] A p r pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names pathways remains a challenge mainly because of the need for systematic approaches forpredicting causal relationships, like edge direction and activation/inhibition among in-teracting proteins in the signal flow. We propose an approach for predicting signalingpathways that integrates protein interactions, gene expression, phenotypes, and proteincomplex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationshipsamong proteins, in candidate pathways using cell cycle gene expression and phenotypesto infer consistent pathways in yeast. Then, we incorporate protein complex coverageinformation for deciding on the final predicted signaling pathways. We show that ourapproach improves the predictive results of the state of the art using different rankingmetrics.
Keywords : Signaling pathways; cell cycle; protein complexes
1. Introduction
Proteins are molecules formed by sequences of amino acids. They usually interactwith each other to perform specific functions in organisms. Discovering the pro-tein roles in different functions is an important research area in the biological andbiochemical fields, because of the impact that such information may have in thecreation of new treatments of several diseases and in the comprehension of func-tions in living systems . A kind of cell activity where several proteins work togetherin a temporal and spatial sequence is called “signaling pathway”. A signaling path-way can be seen as a linear path in cascade, where multiple proteins associate ormodify each other to perform a specific function. In general, a signaling pathwayhas a set of proteins whose sequence interaction from a source to a target producesthe activation of transcription factors, which regulate gene expression or inhibition .Another kind of biological function, where multiple proteins work together, is calleda “protein complex”, where there is a high level of interaction among the involvedproteins, but there is not a temporal or sequential dependency of their interactions .Diverse technologies have made possible the compilation of Protein-Protein In-teraction (PPI) networks, which contain pairs of proteins that interact in a deter-mined experimental context. Several methods aim to discover interactions betweenproteins, such as yeast-two-hybrid (Y2H), affinity purification-mass spectrometry(AP-MS) approaches, or interaction reports inferred from mining information inscientific publications .Despite the progress made to date, many proteins have different and multiplefunctions in every organism and, since biological systems are complex, there arestill many knowledge gaps concerning their interactions, behavior, and functions.For instance, the identification of signaling pathways is a critical point to understandbiological processes, as well as pathological alterations of these functions that maytrigger diseases , , , , . However, prediction of biological signaling pathways fromPPI networks is a complex task, mainly because PPI networks are modeled asundirected graphs and signaling pathways are intrinsically directed. Thus, there isa high number of possible signaling pathways to consider from a PPI network, whichcan produce high rates of false positives in the results.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways To date, this problem has been approached from many points of view. Gitter etal. find pathways using a weighted PPI (to represent the confidence of the inter-action) and predict alternative pathways combining random orientation and localsearch algorithms. Their approach first construct a weighted PPI, where weightcomputation is based on both the confidence in the experimental system used todetect the interaction and the number of separate research articles that report theinteraction. Their proposal aims to maximize all interaction weights in every path-way, since a pathway is more reliable if the multiplication of its weights is larger.Cao et al. propose a pathway prediction tool that uses a distance-based metric(DSD: Diffusion State Distance), measuring the topological similarity of proteinsin a network, adding information from databases to make it more specific (numberof researches that prove the interactions and reference pathways). Shin et al. usea shortest path algorithm based on Dijkstra , choosing the best pathways as theones that minimize the pathway length. Jaromerska et al. discover signaling path-ways by defining a distance measure integrating PPI network topology with GeneOntology data and semantic similarity. However, this method improves Gitter’s results by a small percentage mainly in the first top-20 pathways. Vinayagamet al. proposed a computational model that predicts activation/inhibition per-forming phenotype correlation among proteins and build a signed PPI network for Dropsophila melanogaster , where the sign is positive for activation and negative forinhibition relationships. However, this approach does not predict signaling path-ways. P-Finder reconstructs signaling pathways from PPI networks using geneontology (GO) annotations and a semantic similarity metric. This method uses dif-ferent algorithms based on path frequencies, network motifs, and information prop-agation. PathLinker reconstructs human signaling pathways from PPI networksfrom receptors to transcription regulators by computing the shortest paths using afast algorithm based on the A* heuristic. Their experimental evaluation shows thatthe performance of PathLinker is similar to an approach based on random walkswith restarts proposed in RWR . Even though some approaches integrate some bi-ological knowledge for signaling pathway predictions, genome-scale reconstructionof signaling pathways is still challenging, mainly because causal relationships aredifficult to infer .In this work, we propose a ranking algorithm that combines PPI network topol-ogy with the biological knowledge available in public databases to provide a biolog-ical context for every pathway and its interactions. Our approach is based on twosteps. The first step applies an edge orientation and local search algorithm in theinput PPI network to find candidate signaling pathways. The second step defines agraph model and decision algorithms that include temporal biological data, basedon cell cycle dynamics, to determine which candidate pathways are biologically con-sistent, and then applies protein complex coverage to obtain predicted pathways.We evaluated our method using well-known ranking metrics and found that relatingbiological information with PPI networks significantly improves the prediction ofsignaling pathways.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names
2. Biological knowledge
There exists a wide range of biological knowledge describing biological processes,components, or structures in which individual genes and proteins are known to beinvolved, such as protein complexes and signaling pathways. In this section, wedescribe what it is relevant for this work.
Cell cycle information
The cell cycle is a set of events where the cell grows and develops processes thatlead to the duplication of its DNA and, subsequently, cell division. The cell cycleconsists of a sequence of four phases: G1, S, G2, and M, where the cell componentsfulfill specific functions. The G1 and G2 phases are gaps, the S phase representsthe synthesis (replication of its DNA), and the M phase represents the mitosis andcytokinesis. Also, there are two checkpoints, where the cell verifies if it is ready tocontinue with the next phase. These checkpoints are G1/S (at the end of G1) andG2/M (at the end of G2) .There are many available datasets related to the mitotic cell cycle. Thesedatasets include microarray-based time courses of mRNA expression, mass-spectrometry-based proteomics on protein expression during the cell cycle, sys-tematic screens for cyclin-dependent kinase (CDK) substrates, and high-contentscreening for knockdown phenotypes. All these datasets provide detailed informa-tion about the mitotic cell cycle and its many regulatory layers. As combining andanalyzing all this information is a complex task, Cyclebase aims to address thisproblem by processing different experimental data, mapping common gene identi-fiers and normalizing experiments onto a common timescale, facilitating direct com-parison of expression profiles between different experiments related to an organism.Version 3.0 of Cyclebase includes new mRNA and protein expression data, andintegrates cell cycle phenotype information from high-content screens and model-organism databases. Cyclebase also provides an easy way of obtaining informationabout cell cycle peak gene expression and phenotypes of individual genes. Figure1 shows the information available for a gene (YBR088C) in yeast ( Saccharomycescerevisiae ), where the gene expression in different phases of the cell cycle and thetime course experiments are displayed, and the periodic behavior of gene expressioncan be observed ( in red arrow in Figure 1-left and red dot in Figure 1-right). Inaddition we show a sample of the information displayed for several genes related topeak expression and phenotype for a set of yeast genes, including YBR088C, firstrow in Table 1.
Protein complex information
Proteins are known to participate in several biological processes such as transport,signaling, and metabolic and enzymatic catalysis. Most proteins interact with othersforming functional units, called protein complexes, which allows them to performpril 7, 2020 1:9 WSPC/INSTRUCTION FILE main
Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways Rank: 21Expression G1SG2M
Cyclebase 3.0 - POL30 https://cyclebase.org/CyclebasePage?type=4932...2 of 4 12/26/2017 03:54 PM
PTMsPhenotypesComplexesOrthologs and Paralogs
No evidence of this type.
Name Source Evidence
Meiotic recombination: abnormal SGD Classical genetics, reduction of functionVegetative growth: decreased rate SGD Large-scale survey, overexpressionInviable SGD Systematic mutation setChromosome/plasmid maintenance:abnormal SGD Large-scale survey, conditionalChromosome/plasmid maintenance:decreased SGD Large-scale survey,classical genetics, repressibleMitotic recombination: increased SGD Classical genetics, reduction of functionCell cycle progression in S phase:abnormal SGD Large-scale survey, repressible
No evidences in this channel
Time course experiments -6-5-4-3-2-1012345 E x p r e ss i on Cho-cdc28Granovskaia-alphaGranovskaia-cdc28Pramilla-alpha30Pramilla-alpha38Spellman-alphaSpellman-cdc15de Lichtenberg-cdc15G1S G2 M G1 S G2 M G1 S G2 M G1 S G2M
Cyclebase 3.0 - POL30 https://cyclebase.org/CyclebasePage?type=4932...3 of 4 12/26/2017 03:55 PM
Fig. 1. Capture from Cyclebase 3.0. Information available for protein YBR088C with includingtemporal expression data from eight experiments.
Primary name Type Identifier Peaktime PhenotypesPOL30 Saccharomyces cerevisiae YBR088C G1CLN2 Saccharomyces cerevisiae YPL256C G1 G1MRC1 Saccharomyces cerevisiae YCL061C G1 G1/S,S
Table 1. Set of genes with corresponding peak expression phase and phenotype biological functions collaboratively. Many proteins participate in different proteincomplexes according to the functional needs of the organism. Understanding thefunctions of proteins is important for many diseases since some research studieshave shown that the deletion of some proteins in a network may have lethal ef-fects on organisms . This has been an important motivation for the research com-munity to propose different prediction methods for protein functions and proteincomplexes , , . Moreover, there are gold standards with curated protein complexesfor different organisms. In the case of yeast, one of the most used gold standards isCYC2008 , which contains 408 manually curated heteromeric protein complexesand has been used as a reference by many protein complex prediction tools .
3. Method
In this section, we formally define the signaling pathway discovery problem andpropose a method that incorporates biological data to infer causal or temporalrelationship in the signal flow.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names
Problem formulation
Let G ( V, E, w ) be a weighted undirected graph, which models a PPI network withproteins as vertices in V , protein interactions as edges in E , and weights, w , as theconfidence of a real protein interaction. We denote w ( e ) as the weight (confidence)of an edge e in E . Let also assume a set of pairs ( s i , t i ) of source , target proteinsin V and a maximum path length h in G for all pairs ( s i , t i ).We formulate the problem of discovering signaling pathways in two steps. Thefirst step has the goal of defining an edge orientation. For the first step, we use Gitteret al. to find candidate pathways . Gitter et al. formulate the problem for length-bounded edge orientation pathways in weighted interaction networks and show thatit is NP-hard. The authors consider this problem as Maximun Edge Orientation(MEO) and propose an heuristic based on random edge orientation with a localsearch algorithm, which objective function is to maximize all the weights in thepathway, where the weights represent the protein interaction confidences.The second problem is defined as a graph model and decision rules, where eachcandidate pathway is evaluated using biological data that include temporal dynam-ics using cell cycle and protein complex information. As a result, we obtain a set ofsignaling pathways, which we call consistent pathways . Afterwards, we verify eachconsistent pathway for protein complex coverage to obtain predicted pathways . Definition 1.
Candidate pathway.A candidate pathway is a path P = < v , v , . . . , v m > , where each pair of consecu-tive vertices in a path forms an edge ( v j , v j +1 ) ∈ E , in which the first vertex is asource, s i , and the last one is its corresponding target, t i . Definition 2.
Consistent pathway.A consistent pathway is a candidate pathway where all of its edges, when interpretedas interactions between proteins, satisfy temporal cell cycle dynamics or proteincomplex rules (Algorithm 1).
Definition 3.
Protein Complex.A protein complex is a set pc = { p , p , . . . , p n } , where p i is a protein. Definition 4.
Predicted pathway.A predicted pathway is a consistent pathway where all of its nodes, when interpretedas proteins, satisfy the given ratio of protein complex coverage. Protein complexcoverage is defined as the ratio of the maximum number of proteins in a pathwaythat belong to a single protein complex with respect to the total number of proteinsin a pathway.
Approach
We define a cell cycle graph CG = ( A, B ), where A represents the cell cycle phasesand B transitions between them. Further, N ( x ) returns the transitions from onepril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways phase to others in CG . A = { G , G /S, S, G , G /M, M } (1) B = { ( x, y ) ∈ A × A } (2) N ( x ) = { y ∈ A/ ( x, y ) ∈ B } (3)The idea is based on the fact that, if two consecutive proteins in a candidatepathway are expressed in the same cell cycle phase, which is a self-loop edge in thegraph, or if the second protein is expressed in a phase that follows the one of the firstprotein, then the interaction is more likely to be real. As the information extractedfrom Cyclebase includes two checkpoints, G1/S and G2/M, we use them as a partof the cycle, so that the sequence remains in order. We based our method on theinformation reported by Cyclebase, which already estimates peak gene expressionby a probabilistic approach using several gene expression experiments. SG M G
Fig. 2. Graph representation of a cell cycle CG ( A, B ). We incorporate temporality in the form of cell cycle phases to check if twoconsecutive proteins in a candidate pathway are likely to participate in a pathway.If all proteins forming edges in a candidate pathway satisfy the expected transitionsin cell cycle dynamics or protein complex involvement, the candidate pathway is a consistent pathway .We model the cell cycle transition as the function T that maps each protein(seen as a node in V ) to the set of cell cycle phases (seen as nodes in A ) where itis expressed. To define T , we use information from Cyclebase 3.0 a , from where wemay use the information on cell cycle peak expression (in which case the resultingfunction is called T pk ), or the information on cell cycle phenotype (in which case a pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names we call T ph the resulting function). In addition, we consider the protein complexinvolvement function L to map proteins to protein complexes in which they par-ticipate, where P C are predicted or gold standard complexes. The procedure thatinvolves functions T and L can be used in any order as they are independent. T : V (cid:55)→ C ⊆ A L : V (cid:55)→ R ⊆ P C
Then, given CG ( A, B ) and P = { P , P , . . . , P n } , we decide whether a candidatepathway P i is TRUE or FALSE as shown in Algorithm 1. Consistent pathways arecandidate pathways that are TRUE after applying Algorithm 1. Algorithm 1
Signaling consistent pathway decision using cell-cycle and proteincomplex rules.
Require:
Candidate pathway P ∈ P , functions T pk , T ph , L . Ensure:
Returns TRUE (consistent pathway) if a candidate pathway satisfies thecell-cycle dynamics or protein complex rules. for v i , v i +1 ∈ P do if ( | T pk ( v i +1 ) ∩ N ( T pk ( v i )) | > ∨ ( | T ph ( v i +1 ) ∩ N ( T pk ( v i )) | > ∨ ( | T pk ( v i +1 ) ∩ N ( T ph ( v i )) | > ∨ ( | T ph ( v i +1 ) ∩ N ( T ph ( v i )) | > then continue else if ( | L ( v i ) ∩ L ( v i +1 ) | > then continue else return FALSE end if end for return TRUEThe first part of the Algorithm 1 (Cell Cycle rule), lines 2-3, evaluates whethera protein-protein interaction, ( v i , v i +1 ) is supported by:(1) Peaks of expression: where v i +1 peak must be in the next phase of v i peak.(2) Peaks and phenotypes: where v i +1 phenotype must be in the next phase of v i .(3) Phenotypes and peaks: where v i +1 peak must be in the next phase of v i phe-notype.(4) Phenotypes: where v i +1 phenotype must be in the next phase of v i phenotype.The next phase in the graph of a protein can be only as indicated in Figure2, that is, staying in the same one or going to other phases in the Cyclebase cellcycle. In case the cell cycle is not satisfied we verify the complex rule, line 4, thatis, our approach verifies if either the cell cycle rule or the complex rule is satisfied.If none of these two rules is satisfied, we discard such interaction (lines 6-7) and,consequently, the candidate pathway.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways The pathways that pass the filter of Algorithm 1 are furthered filtered as fol-lows. Taking advantage of the high relationship between the proteins when theycollaborate in a complex, we evaluate all the proteins of each consistent pathway tocalculate the highest coverage that the complexes have on the pathways. This canbe seen in Algorithm 2, where we compute the number of common protein com-plexes where proteins in a consistent pathway participate in (lines 1-3 in Algorithm2). Then, we compute the protein complex coverage r (line 4 in Algorithm 2) andchoose the pathways that are over the threshold given by the coverage ratio cv .This filter is related with the level of cohesion within the pathway, appreciatingthat certain proteins in it carry out functions together, beyond temporality. Algorithm 2
Signaling consistent pathway with protein complex coverage.
Require:
Consistent pathway P c and coverage ratio cv . Ensure:
Returns TRUE (Predicted pathway) if a consistent pathway satisfies the cv . for v i ∈ P c do F req [ L ( v i )] + + end for r = max ( F req ) / | P c | if r > = cv then return TRUE else return FALSE end if3.3. Ranking algorithms
A post-processing step of our method ranks consistent pathways according to dif-ferent criteria. In this section, we describe the main metrics used in such process,which are the edge and path metrics shown in Equation 4. These metrics compute avalue for each pathway P p , which is used to rank the pathways in a priority queue. Edge W eight = min / max /Avg ∀ e ∈ P p w ( e ) (4) Edge U se = min / max /Avg ∀ e ∈ P p F req ( e ) (5) P ath W eigh = (cid:89) e ∈ P p w ( e ) (6)The Edge Weight returns the minimum, average or maximum weight of aninteraction in a pathway. The
Edge Use is the number of times a certain edge ispresent in all predicted pathways. It is also considered as in its minimum, maximum,and average values. The
Path Weight is the multiplication of all the edge weightpril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names values, w ( e ), from a pathway. Path weight is used to break ties when ranking byother metrics.In preliminary experiments, we also considered vertex centrality metrics suchas degree , betweenness and closeness . However, such results are not reported heresince edge and path metrics always outperformed vertex metrics in our evaluation.
4. Evaluation
This section describes the experimental evaluation performed to compare the pro-posed method with state of the art methods for discovering signaling pathways. Wefirst describe the experimental setup, which consists of input datasets, gold stan-dards, evaluation metrics and alternative methods used for comparison, and thenthe results and a discussion about them.
Experimental setup
We replicate the experimental setup used by Gitter et al. . Hence, we use theirPPI network, Gold Standard and reference pathways. Next, we briefly explain eachof them. The PPI network consists of highly plausible interactions driven by theunion and analysis of different PPI databases such as MINT, BioGrid, and IntAct.Gitter et al. builds this PPI using both confidence of experimental system usedto detect the interactions and research articles supporting the interactions betweenproteins. This PPI contains 3,446 proteins and 10,944 interactions.The set of reference pathways consists of real pathways extracted from KEGGand Science Signaling Database of Cell Signaling. As they stated, signaling pathwaysfrom KEGG (MAPK signaling pathway) and the Science Signaling Database ofCell Signaling (Pheromone pathway and High Osmolarity Glycerol, HOG pathway)contain an average of 5 edges between a source and its closest target. We haveincluded the KEGG Cell wall stress pathway. Gitter et al. made this observationto define h = 5 as the length of the pathway in their experiments, and then we usethe same value for h . This value was calculated by getting the shortest path fromany source to each target, with a PPI network only with interactions from each ofthe four pathways in evaluation (MAPK, Pheromone, Cell wall stress, and HOG).This study is useful to bound the length of the reference pathways and the predictedpathways, since the longer the pathway, the more computational resources (memoryand time) are needed.In all gold standard pathways, we discarded inhibition interactions and onlyconsidered activation interactions, because inhibition interactions lead to stoppingin the cascade of interactions. In the reference PPI from KEGG and Science Signal-ing Database of Cell Signaling (where the gold standard is made from), there wereonly four inhibition interactions.Reference pathways were generated from the list of 16 sources and 16 targetschosen by Gitter et al. , as a list of vertices without a parent vertex (in the case ofsources) and a list of vertices without children (in the case of targets). In addition,pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways
Source TargetStandard name Systematic name Standard name Systematic nameSLN1 YIL147C CDC42 YLR229CYCK1 YHR135C HOG1 YLR113WYCK2 YNL154C STE7 YDL159WSHO1 YER118C STE20 YHL007CMF(ALPHA)2 YGL089C DIG2 YDR480WMID2 YLR332W DIG1 YPL049CRAS2 YNL098C PBS2 YJL128CGPR1 YDL035C FUS3 YBL016WBCY1 YIL033C STE5 YDR103WSTE50 YCL032W GPA1 YHR005CMSB2 YGR014W MSN1 YOL116WSIN3 YOL004W FKS2 YGR032WRGA1 YOR127W FUS1 YCL027WRGA2 YDR379W STE12 YHR084WARR4 YDL100C SWI4 YER111CMF(ALPHA)1 YPL187W FLO11 YIR019CPKH1 YDR490C RLM1 YPL089CPKH2 YOL100W SWI4 YER111CPKH3 YDR466W SWI6 YLR182Wwe added the Cell wall stress pathway adding three sources and three targets, havinga total of 19 sources and 19 targets. All sources and targets are displayed in Table2. These sources and targets were used to generate the candidate pathways in theinitial step.Also, we included biological datasets that register the peak expression dataand phenotypes available in Cyclebase . We consider predicted yeast protein com-plexes obtained by Hernandez et al. and yeast protein complex gold standardCYC2008 . We used a ranking scheme choosing the top- k predicted pathways thatconsiders the different pathway metrics described in Section 3.3.We used as parameters: h , the length of the pathways to consider; cv , in-tended to consider different ratios of protein complex coverage; and k , thesize of the top- k rankings. We compared our method using different configu-rations, as shown in Table 3, which includes Gitter’s method that is used asa baseline. We also compare our results with PathLinker and RWR . Bothimplementations are available at http://bioinformatics.cs.vt.edu/~murali/supplements/2016-sys-bio-applications-pathlinker/ .pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names
Table 3. Method configurations.
Config DescriptionG Gitter’s method, using the best values from 10 tests (as it is arandom model, results may vary).G-cv Gitter’s method plus protein complex coverage (i.e. using can-didate pathways as input and protein complex coverage in Al-gorithm 2). We use protein complex coverage ratio cv in per-centage.GC Gitter for candidate pathways plus applying Cell cycle rule(defined in Algorithm 1).GCC Gitter for candidate pathways, applying Cell cycle rule andprotein complex rule (defined in Algorithm 1).GCC-cv Gitter for candidate pathways, applying Cell cycle rule, proteincomplex rule (defined in Algorithm 1), and protein complexcoverage (defined in Algorithm 2).PathLinker Pathways predicted by PathLinker.PathLinker-RWR Pathways predicted by PathLinker RWR implementation. Evaluation metrics
We follow the definitions of complete and partial match as given in Gitter et al. todefine true positives. Hence, a predicted pathway is considered as a true positive ifthe pathway has at least three of the five interactions consecutively and it is foundin at least one reference pathway.Taking into account this definition, we considered the typical measures usedto evaluate effectiveness of ranked retrieval , such as Precision@k , Recall@k , F Score @ k , and MAP. Precision@k evaluates the effectiveness of a top-k predic-tion list as the ratio of the number of true positives the list contains and the listsize. In other words, it gives an idea about the portion of positive results in theretrieved list. An important property of this measure is that it tends to decreaseas k increases. On the other hand, Recall@k is computed as the ratio of the num-ber of true positives the list contains and the number of positive samples in theground truth. It gives an idea about the coverage of truth samples obtained fromthe predicted ranking. By definition, it monotonically increases with k . Values arecomputed as described in Equations 7. In these equations, I k ( u ) represents the listof pathways returned by a method and I + u represents positive samples in the groundtruth. Hence, I + u ∩ I k ( u ) are the true positives. F Score @ k is the harmonic aver-age of Precision@k and
Recall@k . In the next section, we will show recall-precisioncurves, which provide a graphical way of examining the trade-off between these twopril 7, 2020 1:9 WSPC/INSTRUCTION FILE main
Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways measures. P recision @ k = | I + u ∩ I k ( u ) || I k ( u ) | (7) Recall @ k = | I + u ∩ I k ( u ) || I + u | (8) F Score @ k = 2 × P recision @ k × Recall @ kP recision @ k + Recall @ k (9)The last measure we consider, MAP, is based on the average precision , whichmay be interpreted as the area under the recall-precision curves. Hence, it aggregatesall the recall values in a single measure, which may simplify the comparison betweendifferent methods. The Mean Average Precision or MAP is just the arithmetic meanover a set of predictions of the average precision values.In order to correctly interpret the results, it is important to notice that, unlikeother domains in which these measures have been used, in this domain the groundtruth is incomplete. In plain words, true positives represent pathways that havebeen shown to exist by biological experiments, but false positives do not necessarilyrepresent pathways that do not actually exist. Hence, low precision values are notas important as in other domains where the ground truth is complete. Results
As mentioned in the previous section, we used a high confidence PPI network,signaling pathway gold standards, path ranking metrics, and path length of 5 inter-actions as defined in Gitter approach . We considered all configurations describedin Table 3, using cv = 10 , , , ,
50% for all path ranking metrics defined inEquations 4.We first evaluate different configurations of our method and compare them withthe baseline . To do that we measure the number of true positives in the top-100results obtained by sorting the pathway predictions using the defined path rankingmetrics. For protein complex coverage we use predicted protein complexes fromthe DAPG method and from the gold standard CYC2008 . Table 4 shows thenumber of true positives out of the top-100 results obtained using predicted proteincomplexes and Table 5 using gold standard protein complexes. As observed, bothtables show similar results. Furthermore, we observe that using cell cycle and proteinrules, as well as protein complex coverage, are better alternatives than using onlythe random edge orientation with local search proposed by Gitter et al. , since thenumber of true positives increases about 27% (ratio between our method GCC-30%and G-30%).We promote our best configurations and compare them with state-of-the-artmethods using standard measures: precision, recall, F Score @ k and MAP (MeanAverage Precision). We summarize the results in Figure 3. The top left graphshows the best results including the original results of G, GCC, PathLinker andpril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names
Table 4. Ranking results (path and edge measures) from the top-100 consistent signaling pathwaysusing predicted protein complexes. Values in bold show best results.
Method PathWeight MaxEdgeWeight AvgEdgeWeight MinEdgeWeight MaxEdgeUse AvgEdgeUse MinEdgeUseG 31 7 31 34 0 0 0G-10 31 7 31 34 0 0 0G-20 31 7 31 32 0 0 0G-30 31 7 31 31 0 0 8G-40 26 3 26 25 0 0 0G-50 26 3 26 25 0 0 0GC 41 12 43 36 7 4 3GCC
17 44 42 6 2 0GCC-10
17 44 42 6 2 0GCC-20
44 6 2 0GCC-30
44 6 2 0GCC-40 40 10 43 41 0 1 0GCC-50 42 10 43 41 0 1 0
Table 5. Ranking results (path and edge measures) from the top-100 consistent signaling pathwaysusing protein complexes in gold standard CYC2008. Values in bold show best results.
Method PathWeight MaxEdgeWeight AvgEdgeWeight MinEdgeWeight MaxEdgeUse AvgEdgeUse MinEdgeUseG 31 7 31 34 0 0 0G-10 31 7 31 34 0 0 0G-20 32 7 32 37 0 1 8G-30 32 7 32 37 0 0 8G-40 28 3 28 26 0 0 0G-50 28 3 28 26 0 0 0GC 41 12 43 36 7 4 3GCC
17 44 42 6 2 0GCC-10
17 44 42 6 2 0GCC-20
45 46
45 46
PathLinker-RWR. Figure 3-left shows that our approach provides the best perfor-mance regarding precision and recall @ k , achieving better precision for the wholerecall spectrum. The other graphs in Figure 3 show the precision @ k , recall @ k and F Score @ k . We observed that our method provides the best results for all met-rics. We also observed that the second best method is PathLinker-RWR, whereasPathLinker does not behave well. We tested PathLinker increasing k , but results dopril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways not improve. Also, we compare the overall performance using MAP with the bestconfigurations for G, PathLinker-RWR and GCC. We find that, with GCC, MAPis 0.66 whereas with G it is 0.45 and with PathLinker-RWR it is 0.47, which givesus an improvement of 46% in both cases. P r e c i s i on RecallGCC-10-PathWeightGPathLinkerPathLinker-RWR 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 P r e c i s i on K GCC-10-PathWeightGPathLinkerPathLinker-RWR 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 R e c a ll KGCC-10-PathWeightGPathLinkerPathLinker-RWR 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 500 1000 1500 2000 2500 3000 F - sc o r e K GCC-10-PathWeightGPathLinkerPathLinker-RWR
Fig. 3. Precision/Recall @ k , Precision, Recall and F Score for different k values. Comparisonamong Gitter method (G) using best path metrics and G with best protein complex coverage,our method using best cell cycle and protein complex coverage (GCC) settings, PathLinker andPathLinker-RWR best results. Best configuration obtained with metric PathWeight using proteincoverage of 10% (GCC-10). In addition, we include a biological metric that measures the biological signifi-cance of predicted signaling pathways using enrichment analysis. We defined biolog-ical significance as the ratio of significant pathways with respect the top-k predictedpathways. This computation is similar to the one described for protein complexes . We use gprofiler , a recent tool that automatically considers the latest geneontology and annotations. We considered its python client application using thecomplete predicted pathway protein gene list, excluding electronic GO annotationspril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names
Table 6. Biological significance for predicted pathways with the best methods, GCC-10 andPathLinker-RWR. kMethod 10 20 30 40 50 60 70 80 90 100GCC-10 1.00 1.00 0.94 0.83 0.84 0.85 0.88 0.88 0.84 0.81PathLinker-RWR 0.90 0.88 0.81 0.78 0.82 0.73 0.74 0.76 0.77 0.71and using p-value smaller than 0.01 with minimum number of genes as the gene listsize. This tool is publicly available at http://biit.cs.ut.ee/gprofiler/ . Table6 shows these results for the top-100 ranked pathways for the best methods.We also analyze the effect of the path ranking metrics as well as the protein com-plex coverage ratio ( cv ) using precision and recall at k for both methods. Figure 4shows how precision and recall at k varies when increasing k using Gitter (G-cv%)with protein complex coverage and different path metrics and using our approach,Gitter with cell cycle and protein complex coverage (GCC-cv%). Best path rankingmetrics are Path Weight and Min Edge Weight for both methods. P r e c i s i on RecallG-0-MinEdgeweightG-10-MinEdgeweightG-30-MinEdgeweightG-50-MinEdgeweight 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 P r e c i s i on RecallGCC-0-PathWeightGCC-10-PathWeightGCC-30-PathWeightGCC-50-PathWeight
Fig. 4. Precision/Recall at k using different values for protein complex coverage using best pathmetrics with Gitter method (G) and our method with cell cycle and protein complex coverage(GCC).
5. Visualization
In order to help the analysis of predicted pathways, we developed a visualizationtool that processes the predicted pathways and displays them as a directed graph.This tool was developed using visNetwork package in R, and displays the pathwaysin HTML format so that they can be seen through any browser.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main
Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways
For example, Figure 5 displays the top-20 predicted pathways. There are 4 com-pletely matched predicted pathways in the top-20 (Figure 5-left), and there are13 partially matched pathways in the top-20 (Figure 5-middle). All of them sharethe same source protein (YCL032W) but they have different targets (YLR362W,YDR103W). We observe that the interactions from YLR362W to YDR103W andfrom YDR103W to YDL159W, are the most repeated in the 13 predicted pathwaysfrom Figure 5-middle, which can be an indicator of a high level of certainty.As it can be seen in Figure 5-right, it can distinguish the reference from thepredicted pathways, displaying true interactions in continue lines, and interactionsthat are not in the gold standard in dashed lines. Also, users can select for high-lighting individual proteins, or a group of proteins having a common attribute.Other allowed actions are zooming in and out, moving around proteins (vertices)and showing information about each protein.
6. Discussion
Analyzing Figure 5, Ste50, the protein product of YCL032W, is an adaptorprotein that mediates cell signaling between G protein-associated kinases andmitogen-activated protein kinases (MAPK) . In particular, it has been docu-mented to participate in at least three different pathways controlling pheromoneresponse/mating , , high osmolarity response (HOG pathway) , and starva-tion/filamentation response . The pheromone response pathway induces matingbehavior, the cell wall stress pathway produces cell wall remodeling and the starva-tion pathway causes filamentation. Additionally, the pheromone response pathwayis indirectly related to the cell wall stress signaling pathway through Cdc42. Boththese pathways and the HOG pathway can induce cell cycle arrest. Ste50 operatesas an adaptor protein that binds to Cdc42, active phosphorylated Ste20 and Ste11in order to bring them together on the cell membrane and cause the phosphoryla-tion of Ste11 by Ste20 . In this function, Ste50 also binds to the Opy2 membranepril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names anchor, depending on Opy2 and Ste50 phosphorylation states, which modulates thetransmission of signals to the HOG or the pheromone/mating pathway .All four MAPK pathways in yeast are interconnected in a complex way andshare common elements, including Ste50 , . Pathways are also highly intercon-nected with shared elements, and heavy crosstalk is established between them.The final response triggered by the common Cdc42 (YLR229C), Ste50 (YCL032W)and Ste11 (YLR362W) proteins is determined by the exact phosphorylated sitesand proteins in the cascades , and depends on the interaction of many differ-ent proteins and adaptors that form membrane-bound and cytoplasmic complexes.This complexity can be seen in the KEGG map of the yeast MAPK pathways( ).All together, these signaling events include many different protein-protein in-teractions that transmit signals with different intensities to pathway mediatorssuch as Cdc42 (YLR229C), Ste11 (YLR362W), Ste7 (YDL159W), and Ste 5(YDR103W), and pathway effectors such as Dig1 (YPL049C), Dig2 (YDR480W),Fus3 (YBL016W) and Ste12 (YHR084W) in the pheromone/mating pathway, andKss1 (YGR040W) in the starvation/filamentation pathway.In reference to Figure 5 as an example, our approach to discover signaling path-ways completely predicted the pathways that originate in Ste50 (YCL032W) andend in one of the effectors for the pheromone/mating and starvation/filamentationpathways that were present in the reference (Figure 5-left). This approach also al-lowed to discover pathways of signaling that include the Ste5 protein (YDR103W),which are not included in the reference and were therefore partially matched (Fig-ure 5-middle). These predicted pathways have not been reported before as signalingpathway by themselves. However, this is not a spurious artifact, since Ste5 has beenreported to simultaneously interact as a scaffolding protein with Ste11, Ste7 andFus3, tethering them to the membrane and forcing their correct interaction, thusincreasing the signaling through the pheromone/mating pathway and preventingerroneous signaling of the upstream kinases . Also, Ste5 is necessary for Ste7phosphorylation of Fus3 but not for phosphorylation of Kss1, which allows up-stream regulation of the pheromone / mating pathway and directs the signalingcascade to mating rather than to filamentation , , . Therefore, the “new” path-ways discovered by our approach in this case, although not in the reference, canbe seen as alternative signaling pathways through which crosstalk between differentpathways occurs and the final signaling response is determined. Hence, includingthese automatically discovered pathways in the analysis of MAPK signaling in yeastgives more detail about the information flow through pathways and are consistentwith experimental results , , , , which further demonstrates the utility of ourapproach.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways
7. Conclusions and future work
We have described a method that uses biological knowledge based on cell cyclepeak expression, phenotypes and protein complexes to improve prediction of sig-naling pathways. This approach first estimates candidate pathways using Gitter’sprediction method . Then, it incorporates cell cycle rules by defining a graph forrelations between the gene cycle stages peak expressions and phenotypes to obtainconsistent pathways. Finally, it adds protein complex coverage to obtain predictedpathways.We evaluate our approach using path metrics to define top- k rankings usingdifferent approaches. We show that our method improves Gitter’s performance re-garding the number of true positives in the top-100 results obtaining an improve-ment of 27%. In addition, our method provides better precision @ k , recall @ k and F Score @ k compared with Gitter’s, PathLinker and PathLinker-RWR methods.The performance gain can be observed with the overall MAP achieved by ourmethod, which is 0.66, compared to 0.45 obtained by Gitter’s and 0.47 obtainedby PathLinker-RWR.As a future work, we plan to use gene expression clustering algorithms to in-corporate a more detailed gene interaction for estimating the predicted consistentpathways as well as studying other organisms. An idea to explore include the use ofthe probabilistic markov chain model in combination with gene expression clusteringalgorithms in order to infer protein interactions for predicting signaling pathways.Combining markov models with gene clustering have been used to infer pairwisefeatures . Acknowledgement
This research has received funding from the European Union’s Horizon 2020 researchand innovation programme under the Marie Sk(cid:32)lodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941
References
1. Jimenez-Sanchez G., Childs B., Valle D. (2001). Human disease genes. Nature,409(6822).2. Cho Y., Xin Y., Speegle, G. (2015). P-Finder: Reconstruction of signaling networksfrom protein-protein interactions and GO annotations. IEEE/ACM Transactions onComputational Biology and Bioinformatics, 12(2), 309-321.3. Ritz A., Poirel C., Tegge A., Sharp N., Simmons K., Powell A., Kale S., Murali TM.(2016). Pathways on demand: automated reconstruction of human signaling networks.NPJ Systems Biology and Applications. Nature Publishing Group.4. Haveliwala, Taher H. (2003). Topic-sensitive pagerank: A context-sensitive ranking al-gorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 4(15).5. Bading, H., Ginty, D. D., Greenberg, M. E. (1993). Regulation of gene expression inhippocampal neurons by distinct calcium signaling pathways. Science, 260(5105), 181-186. pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names
6. Srihari S., Leong H. W. (2012). Temporal dynamics of protein complexes in PPI net-works: a case study using yeast cell cycle dynamics. BMC bioinformatics, 13(17), S16.7. Zhang F., Wu M-, Li X., Li X.L., Kwoh C., Chee K., Zheng J. Predicting essentialgenes and synthetic lethality via influence propagation in signaling pathways of cancercell fates. Journal of Bioinformatics and Computational Biology, 13(3), 2015.8. Junzhong Ji., et al. (2014). Survey: Functional module detection from protein-proteininteraction networks. IEEE Transactions on Knowledge and Data Engineering 26(2),261-277.9. Cao M., et al. (2014). New directions for diffusion-based network prediction of proteinfunction: incorporating pathways with confidence. Bioinformatics 30(12), i219-i227.10. Gitter A., et al. (2011). Discovering pathways by orienting edges in protein interactionnetworks. Nucleic Acids Research 39(4).11. Barton, M., et al. (2015). Sensitivity of PPI analysis to differences in noise reductionstrategies. Journal of Neuroscience Methods 253. 218-232.12. Kirouac D., et al. 2012. Creating and analyzing pathway and protein interaction com-pendia for modelling signal transduction networks. BMC Systems Biology.6(1).13. Shih Y., Srinivasan P. 2012. A single source k-shortest paths algorithm to infer regu-latory pathways in a gene network. Bioinformatics 28(12).14. Jaromerska S., Praus P., Cho Y. Distance-wise pathway discovery from protein-proteininteraction networks weighted by semantic similarity. Journal of Bioinformatics andComputational Biology, 12(1), 2014.15. Dijkstra E. 1959, A note on Two Problems in Connetion with Graphs. NumerischeMathematik, 1, 269-271.16. Vinayagam A., et al. 2014. Integrating protein-protein interaction networks with phe-notypes reveals signs of interactions. Nature Methods. 11(1) p. 94–99.17. Harper J. V. (2005). Synchronization of Cell Populations in G1/S and G 2/M Phasesof the Cell Cycle. Cell Cycle Control: Mechanisms and Protocols, 157-166.18. Santos A., Wernersson R., Jensen L. J. (2014). Cyclebase 3.0: a multi-organismdatabase on cell-cycle regulation and phenotypes. Nucleic Acids Research, 43(D1),D1140-D1144.19. Jeong H., Mason S., Barab´asi A., Oltvai Z. (2001). Lethality and centrality in proteinnetworks. Nature. 411(6833), 41-42.20. Hernandez C., Mella C., Navarro G., Olivera-Nappa A., Araya J. (2017) Protein com-plex prediction via dense subgraphs and false positive analysis. PLOS ONE 12(9).21. Pellegrini M., Baglioni M., Geraci F. (2016) Protein complex prediction for large pro-tein protein interaction networks with the Core and Peel method. BMC Bioinformatics.17(12).22. Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S. J. (2009). Up-to-date catalogues ofyeast protein complexes. Nucleic Acids Research, 37(3).23. Gauthier N. P., Larsen M. E., Wernersson R., De Lichtenberg U., Jensen L. J.,Brunak S., Jensen, T. S. (2007). Cyclebase. org—a comprehensive multi-organism on-line database of cell-cycle experiments. Nucleic Acids Research, 36(suppl), D854-D859.24. Pi˜nero J. et al. (2017). DisGeNET: a comprehensive platform integrating informationon human disease-associated genes and variants. Nucleic Acids Research, 45(D1), D833-D839.25. B¨uttcher S. et al. (2010). Information Retrieval: Implementing and Evaluating SearchEngines. The MIT Press.26. Srihari S., et al. (2015). Methods for protein complex prediction and their contribu-tions towards understanding the organisation, function and dynamics of complexes.FEBS letters 589.19. pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main
Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways21