[PDF] Cell cycle and protein complex dynamics in discovering signaling pathways

Abstract

Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow. We propose an approach for predicting signaling pathways that integrates protein interactions, gene expression, phenotypes, and protein complex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationships among proteins, in candidate pathways using cell cycle gene expression and phenotypes to infer consistent pathways in yeast. Then, we incorporate protein complex coverage information for deciding on the final predicted signaling pathways. We show that our approach improves the predictive results of the state of the art using different ranking metrics.

Full PDF

AApril 7, 2020 1:9 WSPC/INSTRUCTION FILE main

Cell Cycle and Protein Complex Dynamics in Discovering SignalingPathways

Daniel Inostroza

Computer Science Department, University of Concepci´on,Edmundo Larenas, Concepci´on, 4030000, [email protected]

Cecilia Hern´andez ∗ Computer Science Department, University of Concepci´on,Edmundo Larenas, Concepci´on, 4030000, ChileCenter for Biotechmology and Bioengineering (CeBiB), Santiago, [email protected]

Diego Seco

Computer Science Department, University of Concepci´on,Edmundo Larenas, Concepci´on, 4030000, [email protected]

Gonzalo Navarro

Center for Biotechnology and Bioengineering(CeBiB),Santiago, [email protected]

Alvaro Olivera-Nappa

Center for Biotechnology and Bioengineering(CeBiB),Department of Chemical Engineering and BiotechnologyUniversity of Chile, Santiago, [email protected]

Received (Day Month Year)Revised (Day Month Year)Accepted (Day Month Year)Signaling pathways are responsible for the regulation of cell processes, such as monitor-ing the external environment, transmitting information across membranes, and makingcell fate decisions. Given the increasing amount of biological data available and therecent discoveries showing that many diseases are related to the disruption of cellularsignal transduction cascades, in silico discovery of signaling pathways in cell biologyhas become an active research topic in past years. However, reconstruction of signaling ∗ corresponding author 1 a r X i v : . [ q - b i o . M N ] A p r pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names pathways remains a challenge mainly because of the need for systematic approaches forpredicting causal relationships, like edge direction and activation/inhibition among in-teracting proteins in the signal ﬂow. We propose an approach for predicting signalingpathways that integrates protein interactions, gene expression, phenotypes, and proteincomplex information. Our method ﬁrst ﬁnds candidate pathways using a directed-edge-based algorithm and then deﬁnes a graph model to include causal activation relationshipsamong proteins, in candidate pathways using cell cycle gene expression and phenotypesto infer consistent pathways in yeast. Then, we incorporate protein complex coverageinformation for deciding on the ﬁnal predicted signaling pathways. We show that ourapproach improves the predictive results of the state of the art using diﬀerent rankingmetrics.

Keywords : Signaling pathways; cell cycle; protein complexes

1. Introduction

Proteins are molecules formed by sequences of amino acids. They usually interactwith each other to perform speciﬁc functions in organisms. Discovering the pro-tein roles in diﬀerent functions is an important research area in the biological andbiochemical ﬁelds, because of the impact that such information may have in thecreation of new treatments of several diseases and in the comprehension of func-tions in living systems . A kind of cell activity where several proteins work togetherin a temporal and spatial sequence is called “signaling pathway”. A signaling path-way can be seen as a linear path in cascade, where multiple proteins associate ormodify each other to perform a speciﬁc function. In general, a signaling pathwayhas a set of proteins whose sequence interaction from a source to a target producesthe activation of transcription factors, which regulate gene expression or inhibition .Another kind of biological function, where multiple proteins work together, is calleda “protein complex”, where there is a high level of interaction among the involvedproteins, but there is not a temporal or sequential dependency of their interactions .Diverse technologies have made possible the compilation of Protein-Protein In-teraction (PPI) networks, which contain pairs of proteins that interact in a deter-mined experimental context. Several methods aim to discover interactions betweenproteins, such as yeast-two-hybrid (Y2H), aﬃnity puriﬁcation-mass spectrometry(AP-MS) approaches, or interaction reports inferred from mining information inscientiﬁc publications .Despite the progress made to date, many proteins have diﬀerent and multiplefunctions in every organism and, since biological systems are complex, there arestill many knowledge gaps concerning their interactions, behavior, and functions.For instance, the identiﬁcation of signaling pathways is a critical point to understandbiological processes, as well as pathological alterations of these functions that maytrigger diseases , , , , . However, prediction of biological signaling pathways fromPPI networks is a complex task, mainly because PPI networks are modeled asundirected graphs and signaling pathways are intrinsically directed. Thus, there isa high number of possible signaling pathways to consider from a PPI network, whichcan produce high rates of false positives in the results.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways To date, this problem has been approached from many points of view. Gitter etal. ﬁnd pathways using a weighted PPI (to represent the conﬁdence of the inter-action) and predict alternative pathways combining random orientation and localsearch algorithms. Their approach ﬁrst construct a weighted PPI, where weightcomputation is based on both the conﬁdence in the experimental system used todetect the interaction and the number of separate research articles that report theinteraction. Their proposal aims to maximize all interaction weights in every path-way, since a pathway is more reliable if the multiplication of its weights is larger.Cao et al. propose a pathway prediction tool that uses a distance-based metric(DSD: Diﬀusion State Distance), measuring the topological similarity of proteinsin a network, adding information from databases to make it more speciﬁc (numberof researches that prove the interactions and reference pathways). Shin et al. usea shortest path algorithm based on Dijkstra , choosing the best pathways as theones that minimize the pathway length. Jaromerska et al. discover signaling path-ways by deﬁning a distance measure integrating PPI network topology with GeneOntology data and semantic similarity. However, this method improves Gitter’s results by a small percentage mainly in the ﬁrst top-20 pathways. Vinayagamet al. proposed a computational model that predicts activation/inhibition per-forming phenotype correlation among proteins and build a signed PPI network for Dropsophila melanogaster , where the sign is positive for activation and negative forinhibition relationships. However, this approach does not predict signaling path-ways. P-Finder reconstructs signaling pathways from PPI networks using geneontology (GO) annotations and a semantic similarity metric. This method uses dif-ferent algorithms based on path frequencies, network motifs, and information prop-agation. PathLinker reconstructs human signaling pathways from PPI networksfrom receptors to transcription regulators by computing the shortest paths using afast algorithm based on the A* heuristic. Their experimental evaluation shows thatthe performance of PathLinker is similar to an approach based on random walkswith restarts proposed in RWR . Even though some approaches integrate some bi-ological knowledge for signaling pathway predictions, genome-scale reconstructionof signaling pathways is still challenging, mainly because causal relationships arediﬃcult to infer .In this work, we propose a ranking algorithm that combines PPI network topol-ogy with the biological knowledge available in public databases to provide a biolog-ical context for every pathway and its interactions. Our approach is based on twosteps. The ﬁrst step applies an edge orientation and local search algorithm in theinput PPI network to ﬁnd candidate signaling pathways. The second step deﬁnes agraph model and decision algorithms that include temporal biological data, basedon cell cycle dynamics, to determine which candidate pathways are biologically con-sistent, and then applies protein complex coverage to obtain predicted pathways.We evaluated our method using well-known ranking metrics and found that relatingbiological information with PPI networks signiﬁcantly improves the prediction ofsignaling pathways.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names

2. Biological knowledge

There exists a wide range of biological knowledge describing biological processes,components, or structures in which individual genes and proteins are known to beinvolved, such as protein complexes and signaling pathways. In this section, wedescribe what it is relevant for this work.

Cell cycle information

The cell cycle is a set of events where the cell grows and develops processes thatlead to the duplication of its DNA and, subsequently, cell division. The cell cycleconsists of a sequence of four phases: G1, S, G2, and M, where the cell componentsfulﬁll speciﬁc functions. The G1 and G2 phases are gaps, the S phase representsthe synthesis (replication of its DNA), and the M phase represents the mitosis andcytokinesis. Also, there are two checkpoints, where the cell veriﬁes if it is ready tocontinue with the next phase. These checkpoints are G1/S (at the end of G1) andG2/M (at the end of G2) .There are many available datasets related to the mitotic cell cycle. Thesedatasets include microarray-based time courses of mRNA expression, mass-spectrometry-based proteomics on protein expression during the cell cycle, sys-tematic screens for cyclin-dependent kinase (CDK) substrates, and high-contentscreening for knockdown phenotypes. All these datasets provide detailed informa-tion about the mitotic cell cycle and its many regulatory layers. As combining andanalyzing all this information is a complex task, Cyclebase aims to address thisproblem by processing diﬀerent experimental data, mapping common gene identi-ﬁers and normalizing experiments onto a common timescale, facilitating direct com-parison of expression proﬁles between diﬀerent experiments related to an organism.Version 3.0 of Cyclebase includes new mRNA and protein expression data, andintegrates cell cycle phenotype information from high-content screens and model-organism databases. Cyclebase also provides an easy way of obtaining informationabout cell cycle peak gene expression and phenotypes of individual genes. Figure1 shows the information available for a gene (YBR088C) in yeast ( Saccharomycescerevisiae ), where the gene expression in diﬀerent phases of the cell cycle and thetime course experiments are displayed, and the periodic behavior of gene expressioncan be observed ( in red arrow in Figure 1-left and red dot in Figure 1-right). Inaddition we show a sample of the information displayed for several genes related topeak expression and phenotype for a set of yeast genes, including YBR088C, ﬁrstrow in Table 1.

Protein complex information

Proteins are known to participate in several biological processes such as transport,signaling, and metabolic and enzymatic catalysis. Most proteins interact with othersforming functional units, called protein complexes, which allows them to performpril 7, 2020 1:9 WSPC/INSTRUCTION FILE main

Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways Rank: 21Expression G1SG2M

Cyclebase 3.0 - POL30 https://cyclebase.org/CyclebasePage?type=4932...2 of 4 12/26/2017 03:54 PM

PTMsPhenotypesComplexesOrthologs and Paralogs

No evidence of this type.

Name Source Evidence

Meiotic recombination: abnormal SGD Classical genetics, reduction of functionVegetative growth: decreased rate SGD Large-scale survey, overexpressionInviable SGD Systematic mutation setChromosome/plasmid maintenance:abnormal SGD Large-scale survey, conditionalChromosome/plasmid maintenance:decreased SGD Large-scale survey,classical genetics, repressibleMitotic recombination: increased SGD Classical genetics, reduction of functionCell cycle progression in S phase:abnormal SGD Large-scale survey, repressible

No evidences in this channel

Time course experiments -6-5-4-3-2-1012345 E x p r e ss i on Cho-cdc28Granovskaia-alphaGranovskaia-cdc28Pramilla-alpha30Pramilla-alpha38Spellman-alphaSpellman-cdc15de Lichtenberg-cdc15G1S G2 M G1 S G2 M G1 S G2 M G1 S G2M

Cyclebase 3.0 - POL30 https://cyclebase.org/CyclebasePage?type=4932...3 of 4 12/26/2017 03:55 PM

Fig. 1. Capture from Cyclebase 3.0. Information available for protein YBR088C with includingtemporal expression data from eight experiments.

Primary name Type Identiﬁer Peaktime PhenotypesPOL30 Saccharomyces cerevisiae YBR088C G1CLN2 Saccharomyces cerevisiae YPL256C G1 G1MRC1 Saccharomyces cerevisiae YCL061C G1 G1/S,S

Table 1. Set of genes with corresponding peak expression phase and phenotype biological functions collaboratively. Many proteins participate in diﬀerent proteincomplexes according to the functional needs of the organism. Understanding thefunctions of proteins is important for many diseases since some research studieshave shown that the deletion of some proteins in a network may have lethal ef-fects on organisms . This has been an important motivation for the research com-munity to propose diﬀerent prediction methods for protein functions and proteincomplexes , , . Moreover, there are gold standards with curated protein complexesfor diﬀerent organisms. In the case of yeast, one of the most used gold standards isCYC2008 , which contains 408 manually curated heteromeric protein complexesand has been used as a reference by many protein complex prediction tools .

3. Method

In this section, we formally deﬁne the signaling pathway discovery problem andpropose a method that incorporates biological data to infer causal or temporalrelationship in the signal ﬂow.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names

Problem formulation

Let G ( V, E, w ) be a weighted undirected graph, which models a PPI network withproteins as vertices in V , protein interactions as edges in E , and weights, w , as theconﬁdence of a real protein interaction. We denote w ( e ) as the weight (conﬁdence)of an edge e in E . Let also assume a set of pairs ( s i , t i ) of source , target proteinsin V and a maximum path length h in G for all pairs ( s i , t i ).We formulate the problem of discovering signaling pathways in two steps. Theﬁrst step has the goal of deﬁning an edge orientation. For the ﬁrst step, we use Gitteret al. to ﬁnd candidate pathways . Gitter et al. formulate the problem for length-bounded edge orientation pathways in weighted interaction networks and show thatit is NP-hard. The authors consider this problem as Maximun Edge Orientation(MEO) and propose an heuristic based on random edge orientation with a localsearch algorithm, which objective function is to maximize all the weights in thepathway, where the weights represent the protein interaction conﬁdences.The second problem is deﬁned as a graph model and decision rules, where eachcandidate pathway is evaluated using biological data that include temporal dynam-ics using cell cycle and protein complex information. As a result, we obtain a set ofsignaling pathways, which we call consistent pathways . Afterwards, we verify eachconsistent pathway for protein complex coverage to obtain predicted pathways . Deﬁnition 1.

Candidate pathway.A candidate pathway is a path P = < v , v , . . . , v m > , where each pair of consecu-tive vertices in a path forms an edge ( v j , v j +1 ) ∈ E , in which the ﬁrst vertex is asource, s i , and the last one is its corresponding target, t i . Deﬁnition 2.

Consistent pathway.A consistent pathway is a candidate pathway where all of its edges, when interpretedas interactions between proteins, satisfy temporal cell cycle dynamics or proteincomplex rules (Algorithm 1).

Deﬁnition 3.

Protein Complex.A protein complex is a set pc = { p , p , . . . , p n } , where p i is a protein. Deﬁnition 4.

Predicted pathway.A predicted pathway is a consistent pathway where all of its nodes, when interpretedas proteins, satisfy the given ratio of protein complex coverage. Protein complexcoverage is deﬁned as the ratio of the maximum number of proteins in a pathwaythat belong to a single protein complex with respect to the total number of proteinsin a pathway.

Approach

We deﬁne a cell cycle graph CG = ( A, B ), where A represents the cell cycle phasesand B transitions between them. Further, N ( x ) returns the transitions from onepril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways phase to others in CG . A = { G , G /S, S, G , G /M, M } (1) B = { ( x, y ) ∈ A × A } (2) N ( x ) = { y ∈ A/ ( x, y ) ∈ B } (3)The idea is based on the fact that, if two consecutive proteins in a candidatepathway are expressed in the same cell cycle phase, which is a self-loop edge in thegraph, or if the second protein is expressed in a phase that follows the one of the ﬁrstprotein, then the interaction is more likely to be real. As the information extractedfrom Cyclebase includes two checkpoints, G1/S and G2/M, we use them as a partof the cycle, so that the sequence remains in order. We based our method on theinformation reported by Cyclebase, which already estimates peak gene expressionby a probabilistic approach using several gene expression experiments. SG M G

Fig. 2. Graph representation of a cell cycle CG ( A, B ). We incorporate temporality in the form of cell cycle phases to check if twoconsecutive proteins in a candidate pathway are likely to participate in a pathway.If all proteins forming edges in a candidate pathway satisfy the expected transitionsin cell cycle dynamics or protein complex involvement, the candidate pathway is a consistent pathway .We model the cell cycle transition as the function T that maps each protein(seen as a node in V ) to the set of cell cycle phases (seen as nodes in A ) where itis expressed. To deﬁne T , we use information from Cyclebase 3.0 a , from where wemay use the information on cell cycle peak expression (in which case the resultingfunction is called T pk ), or the information on cell cycle phenotype (in which case a pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names we call T ph the resulting function). In addition, we consider the protein complexinvolvement function L to map proteins to protein complexes in which they par-ticipate, where P C are predicted or gold standard complexes. The procedure thatinvolves functions T and L can be used in any order as they are independent. T : V (cid:55)→ C ⊆ A L : V (cid:55)→ R ⊆ P C

Then, given CG ( A, B ) and P = { P , P , . . . , P n } , we decide whether a candidatepathway P i is TRUE or FALSE as shown in Algorithm 1. Consistent pathways arecandidate pathways that are TRUE after applying Algorithm 1. Algorithm 1

Signaling consistent pathway decision using cell-cycle and proteincomplex rules.

Require:

Candidate pathway P ∈ P , functions T pk , T ph , L . Ensure:

Returns TRUE (consistent pathway) if a candidate pathway satisﬁes thecell-cycle dynamics or protein complex rules. for v i , v i +1 ∈ P do if ( | T pk ( v i +1 ) ∩ N ( T pk ( v i )) | > ∨ ( | T ph ( v i +1 ) ∩ N ( T pk ( v i )) | > ∨ ( | T pk ( v i +1 ) ∩ N ( T ph ( v i )) | > ∨ ( | T ph ( v i +1 ) ∩ N ( T ph ( v i )) | > then continue else if ( | L ( v i ) ∩ L ( v i +1 ) | > then continue else return FALSE end if end for return TRUEThe ﬁrst part of the Algorithm 1 (Cell Cycle rule), lines 2-3, evaluates whethera protein-protein interaction, ( v i , v i +1 ) is supported by:(1) Peaks of expression: where v i +1 peak must be in the next phase of v i peak.(2) Peaks and phenotypes: where v i +1 phenotype must be in the next phase of v i .(3) Phenotypes and peaks: where v i +1 peak must be in the next phase of v i phe-notype.(4) Phenotypes: where v i +1 phenotype must be in the next phase of v i phenotype.The next phase in the graph of a protein can be only as indicated in Figure2, that is, staying in the same one or going to other phases in the Cyclebase cellcycle. In case the cell cycle is not satisﬁed we verify the complex rule, line 4, thatis, our approach veriﬁes if either the cell cycle rule or the complex rule is satisﬁed.If none of these two rules is satisﬁed, we discard such interaction (lines 6-7) and,consequently, the candidate pathway.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways The pathways that pass the ﬁlter of Algorithm 1 are furthered ﬁltered as fol-lows. Taking advantage of the high relationship between the proteins when theycollaborate in a complex, we evaluate all the proteins of each consistent pathway tocalculate the highest coverage that the complexes have on the pathways. This canbe seen in Algorithm 2, where we compute the number of common protein com-plexes where proteins in a consistent pathway participate in (lines 1-3 in Algorithm2). Then, we compute the protein complex coverage r (line 4 in Algorithm 2) andchoose the pathways that are over the threshold given by the coverage ratio cv .This ﬁlter is related with the level of cohesion within the pathway, appreciatingthat certain proteins in it carry out functions together, beyond temporality. Algorithm 2

Signaling consistent pathway with protein complex coverage.

Require:

Consistent pathway P c and coverage ratio cv . Ensure:

Returns TRUE (Predicted pathway) if a consistent pathway satisﬁes the cv . for v i ∈ P c do F req [ L ( v i )] + + end for r = max ( F req ) / | P c | if r > = cv then return TRUE else return FALSE end if3.3. Ranking algorithms

A post-processing step of our method ranks consistent pathways according to dif-ferent criteria. In this section, we describe the main metrics used in such process,which are the edge and path metrics shown in Equation 4. These metrics compute avalue for each pathway P p , which is used to rank the pathways in a priority queue. Edge W eight = min / max /Avg ∀ e ∈ P p w ( e ) (4) Edge U se = min / max /Avg ∀ e ∈ P p F req ( e ) (5) P ath W eigh = (cid:89) e ∈ P p w ( e ) (6)The Edge Weight returns the minimum, average or maximum weight of aninteraction in a pathway. The

Edge Use is the number of times a certain edge ispresent in all predicted pathways. It is also considered as in its minimum, maximum,and average values. The

Path Weight is the multiplication of all the edge weightpril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names values, w ( e ), from a pathway. Path weight is used to break ties when ranking byother metrics.In preliminary experiments, we also considered vertex centrality metrics suchas degree , betweenness and closeness . However, such results are not reported heresince edge and path metrics always outperformed vertex metrics in our evaluation.

4. Evaluation

This section describes the experimental evaluation performed to compare the pro-posed method with state of the art methods for discovering signaling pathways. Weﬁrst describe the experimental setup, which consists of input datasets, gold stan-dards, evaluation metrics and alternative methods used for comparison, and thenthe results and a discussion about them.

Experimental setup

We replicate the experimental setup used by Gitter et al. . Hence, we use theirPPI network, Gold Standard and reference pathways. Next, we brieﬂy explain eachof them. The PPI network consists of highly plausible interactions driven by theunion and analysis of diﬀerent PPI databases such as MINT, BioGrid, and IntAct.Gitter et al. builds this PPI using both conﬁdence of experimental system usedto detect the interactions and research articles supporting the interactions betweenproteins. This PPI contains 3,446 proteins and 10,944 interactions.The set of reference pathways consists of real pathways extracted from KEGGand Science Signaling Database of Cell Signaling. As they stated, signaling pathwaysfrom KEGG (MAPK signaling pathway) and the Science Signaling Database ofCell Signaling (Pheromone pathway and High Osmolarity Glycerol, HOG pathway)contain an average of 5 edges between a source and its closest target. We haveincluded the KEGG Cell wall stress pathway. Gitter et al. made this observationto deﬁne h = 5 as the length of the pathway in their experiments, and then we usethe same value for h . This value was calculated by getting the shortest path fromany source to each target, with a PPI network only with interactions from each ofthe four pathways in evaluation (MAPK, Pheromone, Cell wall stress, and HOG).This study is useful to bound the length of the reference pathways and the predictedpathways, since the longer the pathway, the more computational resources (memoryand time) are needed.In all gold standard pathways, we discarded inhibition interactions and onlyconsidered activation interactions, because inhibition interactions lead to stoppingin the cascade of interactions. In the reference PPI from KEGG and Science Signal-ing Database of Cell Signaling (where the gold standard is made from), there wereonly four inhibition interactions.Reference pathways were generated from the list of 16 sources and 16 targetschosen by Gitter et al. , as a list of vertices without a parent vertex (in the case ofsources) and a list of vertices without children (in the case of targets). In addition,pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways

Source TargetStandard name Systematic name Standard name Systematic nameSLN1 YIL147C CDC42 YLR229CYCK1 YHR135C HOG1 YLR113WYCK2 YNL154C STE7 YDL159WSHO1 YER118C STE20 YHL007CMF(ALPHA)2 YGL089C DIG2 YDR480WMID2 YLR332W DIG1 YPL049CRAS2 YNL098C PBS2 YJL128CGPR1 YDL035C FUS3 YBL016WBCY1 YIL033C STE5 YDR103WSTE50 YCL032W GPA1 YHR005CMSB2 YGR014W MSN1 YOL116WSIN3 YOL004W FKS2 YGR032WRGA1 YOR127W FUS1 YCL027WRGA2 YDR379W STE12 YHR084WARR4 YDL100C SWI4 YER111CMF(ALPHA)1 YPL187W FLO11 YIR019CPKH1 YDR490C RLM1 YPL089CPKH2 YOL100W SWI4 YER111CPKH3 YDR466W SWI6 YLR182Wwe added the Cell wall stress pathway adding three sources and three targets, havinga total of 19 sources and 19 targets. All sources and targets are displayed in Table2. These sources and targets were used to generate the candidate pathways in theinitial step.Also, we included biological datasets that register the peak expression dataand phenotypes available in Cyclebase . We consider predicted yeast protein com-plexes obtained by Hernandez et al. and yeast protein complex gold standardCYC2008 . We used a ranking scheme choosing the top- k predicted pathways thatconsiders the diﬀerent pathway metrics described in Section 3.3.We used as parameters: h , the length of the pathways to consider; cv , in-tended to consider diﬀerent ratios of protein complex coverage; and k , thesize of the top- k rankings. We compared our method using diﬀerent conﬁgu-rations, as shown in Table 3, which includes Gitter’s method that is used asa baseline. We also compare our results with PathLinker and RWR . Bothimplementations are available at http://bioinformatics.cs.vt.edu/~murali/supplements/2016-sys-bio-applications-pathlinker/ .pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names

Table 3. Method conﬁgurations.

Conﬁg DescriptionG Gitter’s method, using the best values from 10 tests (as it is arandom model, results may vary).G-cv Gitter’s method plus protein complex coverage (i.e. using can-didate pathways as input and protein complex coverage in Al-gorithm 2). We use protein complex coverage ratio cv in per-centage.GC Gitter for candidate pathways plus applying Cell cycle rule(deﬁned in Algorithm 1).GCC Gitter for candidate pathways, applying Cell cycle rule andprotein complex rule (deﬁned in Algorithm 1).GCC-cv Gitter for candidate pathways, applying Cell cycle rule, proteincomplex rule (deﬁned in Algorithm 1), and protein complexcoverage (deﬁned in Algorithm 2).PathLinker Pathways predicted by PathLinker.PathLinker-RWR Pathways predicted by PathLinker RWR implementation. Evaluation metrics

We follow the deﬁnitions of complete and partial match as given in Gitter et al. todeﬁne true positives. Hence, a predicted pathway is considered as a true positive ifthe pathway has at least three of the ﬁve interactions consecutively and it is foundin at least one reference pathway.Taking into account this deﬁnition, we considered the typical measures usedto evaluate eﬀectiveness of ranked retrieval , such as Precision@k , Recall@k , F Score @ k , and MAP. Precision@k evaluates the eﬀectiveness of a top-k predic-tion list as the ratio of the number of true positives the list contains and the listsize. In other words, it gives an idea about the portion of positive results in theretrieved list. An important property of this measure is that it tends to decreaseas k increases. On the other hand, Recall@k is computed as the ratio of the num-ber of true positives the list contains and the number of positive samples in theground truth. It gives an idea about the coverage of truth samples obtained fromthe predicted ranking. By deﬁnition, it monotonically increases with k . Values arecomputed as described in Equations 7. In these equations, I k ( u ) represents the listof pathways returned by a method and I + u represents positive samples in the groundtruth. Hence, I + u ∩ I k ( u ) are the true positives. F Score @ k is the harmonic aver-age of Precision@k and

Recall@k . In the next section, we will show recall-precisioncurves, which provide a graphical way of examining the trade-oﬀ between these twopril 7, 2020 1:9 WSPC/INSTRUCTION FILE main

Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways measures. P recision @ k = | I + u ∩ I k ( u ) || I k ( u ) | (7) Recall @ k = | I + u ∩ I k ( u ) || I + u | (8) F Score @ k = 2 × P recision @ k × Recall @ kP recision @ k + Recall @ k (9)The last measure we consider, MAP, is based on the average precision , whichmay be interpreted as the area under the recall-precision curves. Hence, it aggregatesall the recall values in a single measure, which may simplify the comparison betweendiﬀerent methods. The Mean Average Precision or MAP is just the arithmetic meanover a set of predictions of the average precision values.In order to correctly interpret the results, it is important to notice that, unlikeother domains in which these measures have been used, in this domain the groundtruth is incomplete. In plain words, true positives represent pathways that havebeen shown to exist by biological experiments, but false positives do not necessarilyrepresent pathways that do not actually exist. Hence, low precision values are notas important as in other domains where the ground truth is complete. Results

As mentioned in the previous section, we used a high conﬁdence PPI network,signaling pathway gold standards, path ranking metrics, and path length of 5 inter-actions as deﬁned in Gitter approach . We considered all conﬁgurations describedin Table 3, using cv = 10 , , , ,

50% for all path ranking metrics deﬁned inEquations 4.We ﬁrst evaluate diﬀerent conﬁgurations of our method and compare them withthe baseline . To do that we measure the number of true positives in the top-100results obtained by sorting the pathway predictions using the deﬁned path rankingmetrics. For protein complex coverage we use predicted protein complexes fromthe DAPG method and from the gold standard CYC2008 . Table 4 shows thenumber of true positives out of the top-100 results obtained using predicted proteincomplexes and Table 5 using gold standard protein complexes. As observed, bothtables show similar results. Furthermore, we observe that using cell cycle and proteinrules, as well as protein complex coverage, are better alternatives than using onlythe random edge orientation with local search proposed by Gitter et al. , since thenumber of true positives increases about 27% (ratio between our method GCC-30%and G-30%).We promote our best conﬁgurations and compare them with state-of-the-artmethods using standard measures: precision, recall, F Score @ k and MAP (MeanAverage Precision). We summarize the results in Figure 3. The top left graphshows the best results including the original results of G, GCC, PathLinker andpril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names

Table 4. Ranking results (path and edge measures) from the top-100 consistent signaling pathwaysusing predicted protein complexes. Values in bold show best results.

Method PathWeight MaxEdgeWeight AvgEdgeWeight MinEdgeWeight MaxEdgeUse AvgEdgeUse MinEdgeUseG 31 7 31 34 0 0 0G-10 31 7 31 34 0 0 0G-20 31 7 31 32 0 0 0G-30 31 7 31 31 0 0 8G-40 26 3 26 25 0 0 0G-50 26 3 26 25 0 0 0GC 41 12 43 36 7 4 3GCC

17 44 42 6 2 0GCC-10

17 44 42 6 2 0GCC-20

44 6 2 0GCC-30

44 6 2 0GCC-40 40 10 43 41 0 1 0GCC-50 42 10 43 41 0 1 0

Table 5. Ranking results (path and edge measures) from the top-100 consistent signaling pathwaysusing protein complexes in gold standard CYC2008. Values in bold show best results.

Method PathWeight MaxEdgeWeight AvgEdgeWeight MinEdgeWeight MaxEdgeUse AvgEdgeUse MinEdgeUseG 31 7 31 34 0 0 0G-10 31 7 31 34 0 0 0G-20 32 7 32 37 0 1 8G-30 32 7 32 37 0 0 8G-40 28 3 28 26 0 0 0G-50 28 3 28 26 0 0 0GC 41 12 43 36 7 4 3GCC

17 44 42 6 2 0GCC-10

17 44 42 6 2 0GCC-20

45 46

PathLinker-RWR. Figure 3-left shows that our approach provides the best perfor-mance regarding precision and recall @ k , achieving better precision for the wholerecall spectrum. The other graphs in Figure 3 show the precision @ k , recall @ k and F Score @ k . We observed that our method provides the best results for all met-rics. We also observed that the second best method is PathLinker-RWR, whereasPathLinker does not behave well. We tested PathLinker increasing k , but results dopril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways not improve. Also, we compare the overall performance using MAP with the bestconﬁgurations for G, PathLinker-RWR and GCC. We ﬁnd that, with GCC, MAPis 0.66 whereas with G it is 0.45 and with PathLinker-RWR it is 0.47, which givesus an improvement of 46% in both cases. P r e c i s i on RecallGCC-10-PathWeightGPathLinkerPathLinker-RWR 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 P r e c i s i on K GCC-10-PathWeightGPathLinkerPathLinker-RWR 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 R e c a ll KGCC-10-PathWeightGPathLinkerPathLinker-RWR 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 500 1000 1500 2000 2500 3000 F - sc o r e K GCC-10-PathWeightGPathLinkerPathLinker-RWR

Fig. 3. Precision/Recall @ k , Precision, Recall and F Score for diﬀerent k values. Comparisonamong Gitter method (G) using best path metrics and G with best protein complex coverage,our method using best cell cycle and protein complex coverage (GCC) settings, PathLinker andPathLinker-RWR best results. Best conﬁguration obtained with metric PathWeight using proteincoverage of 10% (GCC-10). In addition, we include a biological metric that measures the biological signiﬁ-cance of predicted signaling pathways using enrichment analysis. We deﬁned biolog-ical signiﬁcance as the ratio of signiﬁcant pathways with respect the top-k predictedpathways. This computation is similar to the one described for protein complexes . We use gproﬁler , a recent tool that automatically considers the latest geneontology and annotations. We considered its python client application using thecomplete predicted pathway protein gene list, excluding electronic GO annotationspril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names

Table 6. Biological signiﬁcance for predicted pathways with the best methods, GCC-10 andPathLinker-RWR. kMethod 10 20 30 40 50 60 70 80 90 100GCC-10 1.00 1.00 0.94 0.83 0.84 0.85 0.88 0.88 0.84 0.81PathLinker-RWR 0.90 0.88 0.81 0.78 0.82 0.73 0.74 0.76 0.77 0.71and using p-value smaller than 0.01 with minimum number of genes as the gene listsize. This tool is publicly available at http://biit.cs.ut.ee/gprofiler/ . Table6 shows these results for the top-100 ranked pathways for the best methods.We also analyze the eﬀect of the path ranking metrics as well as the protein com-plex coverage ratio ( cv ) using precision and recall at k for both methods. Figure 4shows how precision and recall at k varies when increasing k using Gitter (G-cv%)with protein complex coverage and diﬀerent path metrics and using our approach,Gitter with cell cycle and protein complex coverage (GCC-cv%). Best path rankingmetrics are Path Weight and Min Edge Weight for both methods. P r e c i s i on RecallG-0-MinEdgeweightG-10-MinEdgeweightG-30-MinEdgeweightG-50-MinEdgeweight 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 P r e c i s i on RecallGCC-0-PathWeightGCC-10-PathWeightGCC-30-PathWeightGCC-50-PathWeight

Fig. 4. Precision/Recall at k using diﬀerent values for protein complex coverage using best pathmetrics with Gitter method (G) and our method with cell cycle and protein complex coverage(GCC).

5. Visualization

In order to help the analysis of predicted pathways, we developed a visualizationtool that processes the predicted pathways and displays them as a directed graph.This tool was developed using visNetwork package in R, and displays the pathwaysin HTML format so that they can be seen through any browser.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main

Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways

For example, Figure 5 displays the top-20 predicted pathways. There are 4 com-pletely matched predicted pathways in the top-20 (Figure 5-left), and there are13 partially matched pathways in the top-20 (Figure 5-middle). All of them sharethe same source protein (YCL032W) but they have diﬀerent targets (YLR362W,YDR103W). We observe that the interactions from YLR362W to YDR103W andfrom YDR103W to YDL159W, are the most repeated in the 13 predicted pathwaysfrom Figure 5-middle, which can be an indicator of a high level of certainty.As it can be seen in Figure 5-right, it can distinguish the reference from thepredicted pathways, displaying true interactions in continue lines, and interactionsthat are not in the gold standard in dashed lines. Also, users can select for high-lighting individual proteins, or a group of proteins having a common attribute.Other allowed actions are zooming in and out, moving around proteins (vertices)and showing information about each protein.

6. Discussion

Analyzing Figure 5, Ste50, the protein product of YCL032W, is an adaptorprotein that mediates cell signaling between G protein-associated kinases andmitogen-activated protein kinases (MAPK) . In particular, it has been docu-mented to participate in at least three diﬀerent pathways controlling pheromoneresponse/mating , , high osmolarity response (HOG pathway) , and starva-tion/ﬁlamentation response . The pheromone response pathway induces matingbehavior, the cell wall stress pathway produces cell wall remodeling and the starva-tion pathway causes ﬁlamentation. Additionally, the pheromone response pathwayis indirectly related to the cell wall stress signaling pathway through Cdc42. Boththese pathways and the HOG pathway can induce cell cycle arrest. Ste50 operatesas an adaptor protein that binds to Cdc42, active phosphorylated Ste20 and Ste11in order to bring them together on the cell membrane and cause the phosphoryla-tion of Ste11 by Ste20 . In this function, Ste50 also binds to the Opy2 membranepril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names anchor, depending on Opy2 and Ste50 phosphorylation states, which modulates thetransmission of signals to the HOG or the pheromone/mating pathway .All four MAPK pathways in yeast are interconnected in a complex way andshare common elements, including Ste50 , . Pathways are also highly intercon-nected with shared elements, and heavy crosstalk is established between them.The ﬁnal response triggered by the common Cdc42 (YLR229C), Ste50 (YCL032W)and Ste11 (YLR362W) proteins is determined by the exact phosphorylated sitesand proteins in the cascades , and depends on the interaction of many diﬀer-ent proteins and adaptors that form membrane-bound and cytoplasmic complexes.This complexity can be seen in the KEGG map of the yeast MAPK pathways( ).All together, these signaling events include many diﬀerent protein-protein in-teractions that transmit signals with diﬀerent intensities to pathway mediatorssuch as Cdc42 (YLR229C), Ste11 (YLR362W), Ste7 (YDL159W), and Ste 5(YDR103W), and pathway eﬀectors such as Dig1 (YPL049C), Dig2 (YDR480W),Fus3 (YBL016W) and Ste12 (YHR084W) in the pheromone/mating pathway, andKss1 (YGR040W) in the starvation/ﬁlamentation pathway.In reference to Figure 5 as an example, our approach to discover signaling path-ways completely predicted the pathways that originate in Ste50 (YCL032W) andend in one of the eﬀectors for the pheromone/mating and starvation/ﬁlamentationpathways that were present in the reference (Figure 5-left). This approach also al-lowed to discover pathways of signaling that include the Ste5 protein (YDR103W),which are not included in the reference and were therefore partially matched (Fig-ure 5-middle). These predicted pathways have not been reported before as signalingpathway by themselves. However, this is not a spurious artifact, since Ste5 has beenreported to simultaneously interact as a scaﬀolding protein with Ste11, Ste7 andFus3, tethering them to the membrane and forcing their correct interaction, thusincreasing the signaling through the pheromone/mating pathway and preventingerroneous signaling of the upstream kinases . Also, Ste5 is necessary for Ste7phosphorylation of Fus3 but not for phosphorylation of Kss1, which allows up-stream regulation of the pheromone / mating pathway and directs the signalingcascade to mating rather than to ﬁlamentation , , . Therefore, the “new” path-ways discovered by our approach in this case, although not in the reference, canbe seen as alternative signaling pathways through which crosstalk between diﬀerentpathways occurs and the ﬁnal signaling response is determined. Hence, includingthese automatically discovered pathways in the analysis of MAPK signaling in yeastgives more detail about the information ﬂow through pathways and are consistentwith experimental results , , , , which further demonstrates the utility of ourapproach.pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways

7. Conclusions and future work

We have described a method that uses biological knowledge based on cell cyclepeak expression, phenotypes and protein complexes to improve prediction of sig-naling pathways. This approach ﬁrst estimates candidate pathways using Gitter’sprediction method . Then, it incorporates cell cycle rules by deﬁning a graph forrelations between the gene cycle stages peak expressions and phenotypes to obtainconsistent pathways. Finally, it adds protein complex coverage to obtain predictedpathways.We evaluate our approach using path metrics to deﬁne top- k rankings usingdiﬀerent approaches. We show that our method improves Gitter’s performance re-garding the number of true positives in the top-100 results obtaining an improve-ment of 27%. In addition, our method provides better precision @ k , recall @ k and F Score @ k compared with Gitter’s, PathLinker and PathLinker-RWR methods.The performance gain can be observed with the overall MAP achieved by ourmethod, which is 0.66, compared to 0.45 obtained by Gitter’s and 0.47 obtainedby PathLinker-RWR.As a future work, we plan to use gene expression clustering algorithms to in-corporate a more detailed gene interaction for estimating the predicted consistentpathways as well as studying other organisms. An idea to explore include the use ofthe probabilistic markov chain model in combination with gene expression clusteringalgorithms in order to infer protein interactions for predicting signaling pathways.Combining markov models with gene clustering have been used to infer pairwisefeatures . Acknowledgement

This research has received funding from the European Union’s Horizon 2020 researchand innovation programme under the Marie Sk(cid:32)lodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941

References

1. Jimenez-Sanchez G., Childs B., Valle D. (2001). Human disease genes. Nature,409(6822).2. Cho Y., Xin Y., Speegle, G. (2015). P-Finder: Reconstruction of signaling networksfrom protein-protein interactions and GO annotations. IEEE/ACM Transactions onComputational Biology and Bioinformatics, 12(2), 309-321.3. Ritz A., Poirel C., Tegge A., Sharp N., Simmons K., Powell A., Kale S., Murali TM.(2016). Pathways on demand: automated reconstruction of human signaling networks.NPJ Systems Biology and Applications. Nature Publishing Group.4. Haveliwala, Taher H. (2003). Topic-sensitive pagerank: A context-sensitive ranking al-gorithm for web search. IEEE Transactions on Knowledge and Data Engineering, 4(15).5. Bading, H., Ginty, D. D., Greenberg, M. E. (1993). Regulation of gene expression inhippocampal neurons by distinct calcium signaling pathways. Science, 260(5105), 181-186. pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main Authors’ Names

6. Srihari S., Leong H. W. (2012). Temporal dynamics of protein complexes in PPI net-works: a case study using yeast cell cycle dynamics. BMC bioinformatics, 13(17), S16.7. Zhang F., Wu M-, Li X., Li X.L., Kwoh C., Chee K., Zheng J. Predicting essentialgenes and synthetic lethality via inﬂuence propagation in signaling pathways of cancercell fates. Journal of Bioinformatics and Computational Biology, 13(3), 2015.8. Junzhong Ji., et al. (2014). Survey: Functional module detection from protein-proteininteraction networks. IEEE Transactions on Knowledge and Data Engineering 26(2),261-277.9. Cao M., et al. (2014). New directions for diﬀusion-based network prediction of proteinfunction: incorporating pathways with conﬁdence. Bioinformatics 30(12), i219-i227.10. Gitter A., et al. (2011). Discovering pathways by orienting edges in protein interactionnetworks. Nucleic Acids Research 39(4).11. Barton, M., et al. (2015). Sensitivity of PPI analysis to diﬀerences in noise reductionstrategies. Journal of Neuroscience Methods 253. 218-232.12. Kirouac D., et al. 2012. Creating and analyzing pathway and protein interaction com-pendia for modelling signal transduction networks. BMC Systems Biology.6(1).13. Shih Y., Srinivasan P. 2012. A single source k-shortest paths algorithm to infer regu-latory pathways in a gene network. Bioinformatics 28(12).14. Jaromerska S., Praus P., Cho Y. Distance-wise pathway discovery from protein-proteininteraction networks weighted by semantic similarity. Journal of Bioinformatics andComputational Biology, 12(1), 2014.15. Dijkstra E. 1959, A note on Two Problems in Connetion with Graphs. NumerischeMathematik, 1, 269-271.16. Vinayagam A., et al. 2014. Integrating protein-protein interaction networks with phe-notypes reveals signs of interactions. Nature Methods. 11(1) p. 94–99.17. Harper J. V. (2005). Synchronization of Cell Populations in G1/S and G 2/M Phasesof the Cell Cycle. Cell Cycle Control: Mechanisms and Protocols, 157-166.18. Santos A., Wernersson R., Jensen L. J. (2014). Cyclebase 3.0: a multi-organismdatabase on cell-cycle regulation and phenotypes. Nucleic Acids Research, 43(D1),D1140-D1144.19. Jeong H., Mason S., Barab´asi A., Oltvai Z. (2001). Lethality and centrality in proteinnetworks. Nature. 411(6833), 41-42.20. Hernandez C., Mella C., Navarro G., Olivera-Nappa A., Araya J. (2017) Protein com-plex prediction via dense subgraphs and false positive analysis. PLOS ONE 12(9).21. Pellegrini M., Baglioni M., Geraci F. (2016) Protein complex prediction for large pro-tein protein interaction networks with the Core and Peel method. BMC Bioinformatics.17(12).22. Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S. J. (2009). Up-to-date catalogues ofyeast protein complexes. Nucleic Acids Research, 37(3).23. Gauthier N. P., Larsen M. E., Wernersson R., De Lichtenberg U., Jensen L. J.,Brunak S., Jensen, T. S. (2007). Cyclebase. org—a comprehensive multi-organism on-line database of cell-cycle experiments. Nucleic Acids Research, 36(suppl), D854-D859.24. Pi˜nero J. et al. (2017). DisGeNET: a comprehensive platform integrating informationon human disease-associated genes and variants. Nucleic Acids Research, 45(D1), D833-D839.25. B¨uttcher S. et al. (2010). Information Retrieval: Implementing and Evaluating SearchEngines. The MIT Press.26. Srihari S., et al. (2015). Methods for protein complex prediction and their contribu-tions towards understanding the organisation, function and dynamics of complexes.FEBS letters 589.19. pril 7, 2020 1:9 WSPC/INSTRUCTION FILE main

Cell Cycle and Protein Complex Dynamics in Discovering Signaling Pathways21