Building Models for Biopathway Dynamics Using Intrinsic Dimensionality Analysis
Emilia M. Wysocka, Valery Dzutsati, Tirthankar Bandyopadhyay, Laura Condon, Sahil Garg
BBuilding models for biopathwaydynamics using intrinsicdimensionality analysis
Emilia M. Wysocka , Valery Dzutsati , Tirthankar Bandyopadhyay , LauraCondon , and Sahil Garg University of Edinburgh, UK Arizona State University, USA CSIRO, Australia Colorado School of Mines, USA University of Southern California, USASeptember 25, 2015 C OMPLEX S YSTEMS S UMMER S CHOOL
ANTA F E I NSTITUTE
NM USA a r X i v : . [ q - b i o . M N ] N ov ontents Introduction and project motivations
Extensive development of technologies and methods related to data acquisition,sharing and storage have made analysis and knowledge discovery unprecedentedlychallenging. For instance, big data has become increasingly common in socialsciences and requires new techniques of analysis, including non linear time seriesapproaches. One of such recent examples of a challenging dataset is the data onrebel violence in the volatile Russian North Caucasus region [26]. The dataset hasrecordings of incidents of rebel violence on weekly basis for every town and villageof the region. Overall, this resulted in over 1 million observations with nearly 200variables, approximately 200 million data points.An important task for many if not all the scientific domains is efficientknowledge integration, testing and codification. It is often solved with modelconstruction in a controllable computational environment. In spite of that, thethroughput of in-silico simulation-based observations become similarly intractablefor thorough analysis. This is especially the case in molecular biology, whichserved as a subject for this study.In this project, we aimed to test some approaches developed to deal with thecurse of dimensionality. Among these we found dimension reduction techniquesespecially appealing. They can be used to identify irrelevant variability and help tounderstand critical processes underlying high-dimensional datasets. Additionally,we subjected our data-sets to nonlinear time series analysis, as those are wellestablished methods for results comparison.To investigate the usefulness of dimension reduction methods, we decidedto base our study on a concrete sample set. The example was taken from thedomain of systems biology concerning dynamic evolution of subcellular signalling.Particularly, the dataset relates to the yeast pheromone pathway and is studied in-silico with a stochastic model. The model reconstructs signal propagationstimulated by a mating pheromone.In the following sections we will elaborate on the reason of multidimensionalanalysis problem in the context of molecular signalling. Next, we will introducethe model of choice, simulation details and obtained time series dynamics. Adescription of used methods followed by a discussion of results and their biologicalinterpretation will finalise this report. This study is a preliminary analysis of thedataset, future work will expand on the results presented here.
As with all signal processing systems, cell signalling is characterised bysignal related functionalities, such as input fidelity, output specificity, signalamplification, the sensitivity and diversity of response or the flexibility of3eaction [1]. These highly sophisticated functions produce complex systemsembodied by the combinatorial explosion of molecular interactions and states [12,2]. On the lower level, cell signalling depends on formation and interactionsof multi-subunit complexes, mainly formed by interacting proteins. They arecomposed from often numerous and autonomously folding blocks called domains,acting as protein functional interfaces. Importantly, protein activity is determinedby multiple post-translational modification sites (phosphorylation, acetylation,ubiquitination), transitionally changing their states. For example, lets consideran ubiquitously present Epidermal Growth Factor Receptor (EGFR), which has9 sites resulting in 512 possible states ( = 512 , on- and off-state). Furthermore,each site has at least one binding partner rising the value of single receptor proteinstates to 19,683 possibilities ( = 19 , ). The large number of possiblestates, even within this relatively simple system is one of the key challenges formechanistic modelling of signalling networks. Traditional equation-based modelsare capable of representing only extensively studied and limited size signallingcircuits. Any larger integrative models become intractable, impossible to reuse oreven proofread [19]. These problems have been addressed by rule-based modellingmethods embodied by flexible languages such as Kappa [3] and BioNetGen [6],facilitating the creation of large and complex dynamical models. In contrast to theother modelling techniques, in rule-based models the system emerges with time,often showing unpredictable behaviour arising from elementary reaction rules.However, their construction and analysis often limit their potential application.For instance, even though provided with visualization tools for static and causalanalysis, a modeller has to resort to a self-assembled battery of tests trying to unfoldthe complexity of results [24]. In the domain of molecular biology the yeast pheromone cell cycle is anextensively studied example, both in-vivo and as a computational model. It’s oftenused to test hypotheses and investigate details related to mechanisms of signallingprocesses, such as dynamical pathway adaptation to demanding environmentalconditions [20], evolutionary preserved functional units (G-protein coupledreceptor signalling [4], mitogene-activated protein kinase [17]), signal-noisedecoupling [4] and information transmission [29].
Saccharomyces cerevisiae yeast, is a model species, capable of sexuallyreproducing in pairs of opposite sexes (type a and α ). The mating signal iscommunicated by either of the cell type through pheromone release ( a-factor ) [20].The model used in this study relates to a subcellular signalling activated in the othercell in response to the stimulus [24].The pathway represents canonical mechanisms of the subcelluar signalpropagation, such as G-protein activation via a GPCR, which is stimulated bypheromone ligands. The scaffold protein (Ste5) is recruited to the cell surface.4igure 1: A: Usual scheme of hierarchically structured molecular machines B:Potential combinations of complexes appearing over the simulation. The red-arrowpath represents the possible way of construction of decamer complex. Source: [24]Its major role is to insulate the kinase phosphorylation cascade from activatingother related pathways. Ste5 dimerizes and aggregates five more proteins thatphosphorylates each other forming an activation cascade. The last one is doublyactivated mitogene-activated protein kinase (MAPK, Fus3) that travels to thenucleus and releases the transcription factor (TF) from its inhibitors. In this wayTF transcribes genes regulating yeast mating behaviour.The study is examining the established hypothesis that signals in cells arepropagated via well defined complexes of molecular machines rather than looselyassembled and polymorphic ensembles.As it was shown, even though a conserved structure of decameric complexwas hardly present in the ensemble model over repeated simulations, the signalwas uninterrupted leading to St4 synthesis. Furthermore, contrary to the machinemodel, the ensemble model was able to replicate the experimental observation ofcombinatorial inhibition of phosphorylated Fuss3 (Fus3pp), when a copy-numberof St5 was increased 60 fold. Models were built with the rule-based formalismthat allows us to sample the sets of possible protein complexes the model canproduce, without explicitly imposing the set of species that are formed [24]. Moredetails about the formalism are in the next section. The code with the models’implementation is in the public domain and can be found as one of the attachedfiles to this paper. The subject of signalling pathways and networks has already been addressedby many modelling formalisms. However, one significant advantage of the5ule-based (RB) modelling is that it is able to express an infinite number ofreactions with a small and finite number of rules, i.e. a single reaction rule and itsparameters generalize a class of multiple interactions. In all of the other modellingmethods every chemical species has to be specified in advance which is highlyproblematic for species with dozens of phosphorylation sites and many possiblestates. This limiting factor makes these methods inappropriate for modellinglarge-scale complex dynamical systems.RB modelling is a method for the formal representation of combinatoriallycomplex signalling systems in both a qualitative and quantitative way. The majoridea is to replace equations with interaction rules. A rule representation is agraph-rewriting, where a graph specified on the left-hand-side is a pattern to bematched to instances in the current “mixture” of graphs and transformed intographs specified on the right-hand-side. Matching should satisfy embedding ,i.e. injections on agents (graphs) with the preservation of names, sites, internalstates and edges [3]. In the rule-based language nomenclature, “agents” are mostelementary molecules and “species” are agent complexes having particular states.A model can be translated into a system of ODE equations or simulated with astochastic algorithm. In the latter case, system trajectories are created by ruleselection at each time step, which is applied probabilistically, based on reactionrates and the initial/current copy number of agents [12]. Immediate consequencesof this formulation are different levels of rule contextualisation (“don’t care don’twrite”), without obligation of ad hoc assumptions about the system, modularity,reusability and extensibility of the modelling process [19]. Furthermore, the abilityto capture a protein as a graph with (binding) sites (e.g. domains) that have internalstate(s) (e.g. phosphorylated) gives a sufficiently expressive system to capture allof the principal mechanisms of signalling processes (e.g. dissociation, synthesis,degradation, binding, complex formation [18]) as well as insight into site-specificdetails of molecular interactions such as affinities, dynamics of post-translationalmodifications, domain availability, competitive binding, causality and the intrinsicstructure of interactions.RB modelling originates from concurrency system representations and assuch has the ability to capture dependencies, causality and conflicts in biologicalinteractions (overlooked by concentration-based ODEs). In other words,precedences occurring along trajectories (stories) reveal competing events leadingto a final state [3]. In the Kappa language, this feature is supported by thesyntax for graphical analysis provided in the simulation tool. Among these arediagrams with causal flows, flux and influence maps as well as contact maps[Figure 2] that facilitate the process of modelling. The causal flow diagramshows dependencies and conflicts in tracking indicated species and the flux maps[Figure 13], negative/positive activity transfers between rules with the quantitativecontributions on edge weights, both generated on the fly during a simulation [7].At any time of a simulation, a snapshot can be taken to record the collection ofspecies existing at that time. 6igure 2: Contact map defined without running the simulation with KaSa softwareaccompanying KaSim4.0 simulation tool. Yellow circles denote agents sites, greencircles agent states, and edges all potential connections between species.Figure 3: Flux map for pheromone pathway model in steady state simulated withKaSim4.0 7 .4 Datasets and simulations
Our time series datasets report changes of indicated molecular species’copy-number over 13,800 time points. Stochastic simulations were run for 4,600sec with 3 time points recorded per second. The system was first simulated over1,000 sec to reach a steady state and the initial mixture of protein complexes.Afterwords, a pheromone stimulus was introduced and the system was simulatedfor another 3,600 sec.Variables in the rule-based syntax are called “observables” ( %obs: ) and arespecified in a separate code block. A single observable can be mapped to oneor more rules conditioned on the level of its particularity. Hence, all the typesof created species not indicated as observables, become intractable. For instance,the observable %obs: Fus3PPFus3(T180~p,Y182~p) , which is a doublephosphorylated MAPK kinase Fus3, is associated with 14 rules of the followingform: • Fus3(dock!1,T180~p,Y182~p),Sst2(S539,mapk!1)-> Fus3(dock,T180~p,Y182~p),Sst2(S539,mapk) • Ste7(ste5!2,S359_T363~pp,mapk!1),Ste5(ste7!2),Fus3(dock!1,T180~p,Y182~p)-> Ste7(ste5,S359_T363~pp,mapk),Ste5(ste7),Fus3(dock,T180~p,Y182~p) • Fus3(dock!1,T180~p,Y182~p),Ste11(mapk!1)-> Fus3(dock,T180~p,Y182~p),Ste11(mapk) • Ste5(ste7!1),Ste7(mapk!2,ste5!1,S359_T363~pp),Fus3(dock!2,T180~p,Y182~u)-> Ste5(ste7!1),Ste7(mapk!2,ste5!1,S359_T363~pp),Fus3(dock!2,T180~p,Y182~p) • ...etc. However, as it is with the model specification, as it is infeasible to observe allpotentially important variations of species, we have to resort to what we know wewant to observe.Therefore, the considered dataset consists of standard 31 variables, patternedafter the original paper. There is also an extended 977 variable set but it has yet tobe explored with parallel computations. This number was dictated by the snapshotof all existing species at the pick of the simulation (~1,000 sec after the stimulusappearance) used then as a list of observables in the simulation.
To compare the outcome of applied methods, the model was simulated in twostates, which are called here “perturbed” and “unperturbed”. By the unperturbed8odel we call the standard “wild type” pathway dynamics. The perturbed onerelates to an experimentally observed phenomenon of combinatorial inhibition.It occurs when the copy-number of protein scaffold is largely increased andimpossible to fully assemble the complex that doubly activates Fus3 because allavailable members of the complex are used up on too many scaffold proteins.
Models written in Kappa language are supported by KaSim simulator. Bydefault, reaction rates are computed applying the law of mass action [2] but caneasily be adjusted to follow any kinetic law (e.g. Michaelis-Menten, Hill’s Law).What can be found under the hood is a direct particle-based variant of Gillespie’smethod. A general version of Gillespie’s method, also called exact stochasticsimulation algorithm (SSA) or kinetic Monte Carlo is a common simulationmethod for modelling time-evolution of stochastic chemical reaction systems.Numerical stochastic simulations are known to be computationally intensive anda lot of efforts have been made to improve their efficiency [10]. The most popularand effective solution, implemented in KaSim, is called “network-free” becauserules transforming reactant into products are applied directly at runtime to advancethe state of a system. As a result, it does not have to generate the full reactionnetwork beforehand and is therefore independent of its size [13].
Since a rule representation may vary in generalization, it can be applied tomore then one reaction that satisfies it. In other words rules serve as the reactionand species generator. It results in the unpredictability of species types and theirimportance emerging over time. Especially, if the model is of a large magnitude.On the other hand, the intrinsic modularity of Kappa syntax opens the pathto large integrative models, gradually assembled from the collections of reusablerule-based syntactic modules.However, models are currently built in a fully controllable and stringentfashion. It leaves the notion of modularity and its experimental aspect risky andunexplored. Thinking ahead, the rule notation can be understood as an updatable,machine readable and executable knowledge representation and storage, replacingthe usually manual revision of papers required in the model construction [15]. Wecould allow for uncontrollable, collective model growth in a form of rule stacksand then verify inner links and hierarchies in the system. That could guide anautomatic trimming of the model size. Hence, the question is whether and how wecould restrict a model to only these rules which are most informative.9igure 4: Model dynamics in unperturbed and perturbed states for characteristicprotein species. The perturbed ensemble model showed a decrease in Fus3activation (Fus3PP) being a key observation of the combinatorial inhibition. As wecan see, the synthesis of St4, which happens in the nucleus, was inhibited underperturbed state (plot with a flat line). 10ikewise the question lies what exactly does it mean to say a species is“important” or “informative” and what do groups of biologically important speciesshare with each other? Are they strongly interlinked modules of the system?Could they guide the rule-based modularity idea? The last question is especiallyimportant, since the scope of elementary parts of RB model is not yet clear. Arethese three, four, five reaction rules? Is there any other measure of mechanismsgranularity?Facing these kind of questions, we opted to test one of recently realisedmethods Correlation Explanations (CorEx) that applies information theoreticobjective to learn a hierarchy of more abstract representations of the data.Having the hierarchy of latent variables, we can pose more precise questions,such as: • What subset of species has common underlying dynamics? • How strong is the correlation between species grouped under the same latentvariable? • How many underlying hidden causes can be identified out of the observedhigh-dimensional species dynamics?and finally: • What could be the meaning of these “hidden causes” for molecularsignalling?The algorithm was previously applied in a biological context for identificationof targets for a cancer therapy [23]. Furthermore, a similar method mentionedin CorEx paper, called the information bottleneck, was previously applied fortrimming of gene ontology (GO) [14] to compress the data into a smallerrepresentation. In contrast, in CorEx the redundant information is preservedignoring uncorrelated random variables [27].
In this section, we discuss an information theoretic approach for building amodel on dynamics of the species concentrations. This method, proposed recentlyfor a general domain [27, 28], learns a hierarchy of latent variables that maximallyinform correlation between the observed species dynamics. Herein, correlationrefers to mutual information between a set of variables, and not just a linearcorrelation.Before we delve into the details of the method for our specific settings, it shouldbe noted that we disregard the time series nature of species copy-number dynamicsin this method application.Let G be a set of random variables representing copy-numbers for all thespecies. Then, X G is a joint random variable on G . For a species i , all the11opy-number values of the time series are assumed to be independent samples ofa random variable X i . As such, we can see that there is a contradiction sinceconsecutive samples in the time series would have a correlation (not i.i.d.). Forobtaining uncorrelated samples, one can take sub-samples of the time series, eitherat uniform interval or using any other relevant technique.Following the notations in [27], total correlation T C ( . ) between a set ofvariables X G is defined as below. T C ( X G ) = (cid:88) i ∈ G H ( X i ) − H ( X G ) (1) T C ( X G ) = I ( X ; · · · ; X g ) (2)Here H ( X i ) is entropy on a random variable X i ; and H ( X G ) is a joint entropyon X G . Another interpretation of T C ( X G ) is that it is mutual information, I ( . ) ,between all the variables in the set G . Typically mutual information is expressedbetween a pair where each element of the pair can be a set of random variables.Here, we are instead expressing mutual information between a g dimensional tripletof random variables, where g is a number of random variables in the set G .In our problem of learning a model of species dynamics, evaluating mutualinformation (or total correlation) between all random variables would not be ofmuch value. We are instead interested in evaluating mutual information betweensome subsets of species. However there are two problems along these lines:i) we do not know for which subsets of species we should evaluate mutualinformation and there can be a large number of permutations to explore (dependingon the size of a subset and the G set); ii) non-parametric estimation of mutualinformation between random variables is a hard problem [16, 25, 22, 9]. To tacklethese, we formulate our algorithm such that; i) we assume the individual speciescopy-number variables X i to be Gaussian; however, we do not assume that theset of variables has to be Gaussian (the later is a stronger assumption); ii) weare interested in only those subsets where variables have low mutual informationconditioning on a latent variable (or high mutual information between variablesexplained by a latent variable).Along these lines, let us define a new information theoretic quantity T C ( X G ; Y F ) . T C ( X G ; Y F ) = T C ( X G ) − T C ( X G | Y F ) (3) T C ( X G ; Y F ) is a total correlation (or mutual information) in the set of randomvariables X G explained by a set of latent variables Y F . T C ( X G | Y F ) is a totalcorrelation between the random variables X G that can not be explained by Y F ,i.e. conditional total correlation (conditional mutual information). If the latentvariables Y F can explain the total correlation in X G perfectly, then T C ( X G | Y F ) =0 . Ideally, we would like to learn Y F if exists. Thus intuitively, optimal Y would12orrespond to minimum of T C ( X G | Y F ) . In our formulation, we can expressoptimization of Y F as optimizing conditional distributions P Y | X .Let us first consider optimization of a single latent variable Y , and thengeneralize it later. arg max Y : p ( y | x G ) T C ( X G ; Y ) s.t. | Y | = k (4)Here Y is a discrete random variable; x G is a sample of the random variable X G and y is sample of Y . We optimize Y by learning the conditional distribution p ( y | x G ) . Now, we further extend it for multiple latent variables, where each latentvariable explains total correlation in a subset of the species concentration variables. arg max G j ,p ( y j | x Gj ) m (cid:88) j =1 T C ( X G j ; Y j ) (5)We have introduced m latent variables here with Y j explaining correlationbetween random variables in the corresponding subset G j ⊂ G . Here these subsetscan have an overlap.Optimizing the above objective function seems hard. However, as explainedin detail in [27, 28], it can be solved very efficiently for practical purposes. Weomit these optimization details and refer readers to the original papers introducingthis algorithm for the first time [27, 28]. Computational complexity of the methodis linear with respect to the number of samples and number of variables in the set G . Furthermore, as an unsupervised method, it requires no assumption about thelearned model. The code implementation for this algorithm is publicly availablefrom the original authors . The CorEx algorithm was applied both to perturbed and unperturbed datasetsand yielded two results with a single layer of hidden variables. In both cases,presented results were the maximal values the data sets could be divided to. Furtherincrease of the number of hidden units did not change their values.For both sets [Figures 6 & 5] CorEx found 6 latent variables, where 8-9 out of31 biologically plausible variables were expected.Biologically plausible variableswere thought to be all these observables that contained Ste5 scaffold protein,known to be a nucleation point of the system [24]. However, the preliminaryintuitions did not align perfectly with the algorithm results.In the unperturbed data tree we can distinguish two important groups, ‘0’ and‘1’. They are recognisable in the perturbed data tree as they preserved half of theirmembers from the former set. However, contrary to the unperturbed data tree, https://github.com/gregversteeg/CorEx The interpretation of the results was conducted on two levels. The first one isbased on the biological knowledge about the process. The second one is supportedby the dynamic analytical tools provided by the KaSim simulator.Generally speaking, the CorEx algorithm successively subsets data into adefined number of latent variables guided by species dynamics. Results appearsto be consistent with the differences between perturbed [Figures 9 & 10] andunperturbed models [Figures 7 & 8]. The group ordering, referring to the strengthof inter-correlations, shows which event takes the lead in two cases. Earliest eventsupstream to the formation of signalling cascade appeared to be the leading onesin the perturbed simulation. This is consistent with the fact that phosphorylationof Fus3 kinase distinctively drops when the amount of Ste5 protein scaffoldscompeting for binding kinases increases [Figure 4 in Section 2.4.2]. As the Fus3phosphorylation was not entirely blocked, the second latent variable relates toevents leading to Fus3 phosphorylation. Thus it is more consistent with the group‘0’ apart from transcription in the nucleus, which was inhibited in the perturbeddata.Owing to the static causal analysis provided by the simulation software,we can ask whether important observables relate to frequently executed rules.The most powerful visualization output is a flux map, which tracks the overallinfluence of rule applications on each other [7]. It is a directed and weightedgraph with rules as vertices and edges annotated with positive or negative weights[Figure 13]. Dependent on simulation parameters (selected time or number ofevents), a flux map might vary in structure (for details of our simulation parameterssee section 2.4).Both untrimmed graphs for unperturbed and perturbed models had 233 verticesbut they differ from each other in the edge number (unperturbed- 2,753, perturbed-2,422). Weights range from 0 to 407,172,203. An important note is that verticesare rules. Hence, to compare them with the output of CorEx, thus subsets ofobservables, first observables had to be mapped to rules they referred to [Figure 11& 12]. The weight cutoff varies with inverse proportion to the number ofobservables in flux map subgraphs. Therefore, we compared different subgraphsby gradually removing less and less vertices given a set of thresholds for weightvalues. The aggregated results are presented in the Figure 13. As we canobserve, the frequency of rule application relates to subsets obtained with theCorEx algorithm but cannot explain them fully.We stated some questions in the Section 3.1.1, which we would like tocomment on or even answer to in the following part. We have learned that thealgorithm used on time series datasets divided the species into most important ones14 i gu r e : R e s u lt o f - v a r i a b l e d a t a s e t w i t h o u t p e r t u r b a ti on . I n t r i n s i c d i m e n s i on a lit y w a s f ound t ob e . V a r i a b l e nu m b e r s a r e s ho w n i n t h e m i dd l e nod e o f eac hg r oup . E dg e w e i gh t s l ea d i ng fr o m a g r oup ce n t r e t o it s m e m b e r a r e d i c t a t e dby it s e xp l a n a t o r y c on t r i bu ti on t o r e m a i n i ngg r oup m e m b e r s i gu r e : R e s u lt o f - v a r i a b l e d a t a s e t w i t h p e r t u r b a ti on . I n t r i n s i c d i m e n s i on a lit y w a s f ound t ob e . V a r i a b l e nu m b e r s a r e s ho w n i n t h e m i dd l e nod e o f eac hg r oup . E dg e w e i gh t s l ea d i ng fr o m a g r oup ce n t r e t o it s m e m b e r a r e d i c t a t e dby it s e xp l a n a t o r y c on t r i bu ti on t o r e m a i n i ngg r oup m e m b e r s To compare results with the outcome of CorEx algorithm and discover otheraspects hidden in our data, we applied methods of nonlinear time series analysis[11]. Similarly, we used both the perturbed and unperturbed datasets (for moredetails about used datasets see Section 2.4). To bypass an obvious division intoa pre- and post-pheromonal stimulation, we cut the beginning 1,000 sec and used17igure 8: On the left, a fragment of unperturbed data-tree with Group ‘1’. Onthe right, the process diagram for a comparison. The second highly scored groupindicates less vital events, related to dimerization, and the impact of Kss1 kinaseon the Ste4 activation. 18igure 9: On the left, a fragment of perturbed data-tree with Group ‘0’. Onthe right, the process diagram for a comparison. The strongest group indicatesthe earliest events located upstream to the Fus3 poshorylation (observable calledFus3PP), preceding the complexation step19igure 10: On the left, a fragment of perturbed data-tree with Group ‘1’.On the right, the process diagram for a comparison. The second strongestgroup of perturbed data tree reflects weak correlation between members and theunsuccessful activation of transcription factor St4.20 i gu r e : A n e x a m p l e o f t w o fl ux m a p s ubg r a ph s m a pp i ngob s e r v a b l e s t o r e l a t e d r u l e s . T h e p e r t u r b e dd a t a s e t w it h w e i gh t s > , , , b l u e nod e s d e no t e ob s e r v a b l e s , r e dnod e s r u l e n a m e s . i gu r e : A n e x a m p l e o f t w o fl ux m a p s ubg r a ph s m a pp i ngob s e r v a b l e s t o r e l a t e d r u l e s . T h e p e r t u r b e dd a t a s e t w it h w e i gh t s > , , , b l u e nod e s d e no t e ob s e r v a b l e s , r e dnod e s r u l e n a m e s . N xN square, where a dot isplaced at ( i, j ) whenever x ( j ) is sufficiently close to x ( i ) . For the purposes ofthis study we selected an embedding dimension of 10 and time delay 5 to keep thecomputational time within reasonable limits.In general, the recurrent plot shows the times at which a phase space trajectoryvisits roughly the same area in the phase space [21]. The authors [5] definedsmall and large scale patters, textures and topologies respectively, to ease theirinterpretation.The resulting figures [16 & 15] are densely grey without distinctive texturesor patterns. However, the unperturbed set is seemingly brighter away from thediagonal and distinctively darker along it. This gradient is interpreted as theoccurrence of a progressive decorrelation at large time intervals involving a lineartrend or drift. The perturbed model presents dynamics pushed a bit more towardsrandomness.Next, we created plots, showing the average mutual information index (AMI)of a given time series for a specified number of lags [8].23 i gu r e : T h e l a r g e s ti n t e r s ec ti onb e t w ee n t h e n e t w o r ko f m o s t fr e qu e n tl y e x ec u t e d r u l e s a nd t h e l a t e n t g r oup s i s a pp a r e n tl y m o r e v i s i b l e i n t h e p e r t u r b e d m od e l , w h i c h c on fi r m s a l ac ko f c oh e r e n ce i n s p ec i e s b e h a v i ou r . S = − (cid:88) ij p ij ( τ ) ln p ij ( τ ) p ij (6)Next, we created a sample correlation integral plot [11]. The correlationintegral can be approximated by the correlation sum. The correlation sum countsthe number of pairs ( −→ x ( i ) , −→ x ( j )) in a given set of vectors that are at most (cid:15) apart. C ( (cid:15) ) = 1 N ( N − N (cid:88) i =1 N (cid:88) j = i +1 Θ( (cid:15) − (cid:107)−→ x ( i ) − −→ x ( j ) (cid:107) ) , −→ x ( i ) ∈ R m (7)25igure 19: Unperturbed data Figure 20: Perturbed dataAs the perturbation involved a single parameter and demonstrated naturallyoccurring phenomenon (not randomised), differences between these two plots werenot expected to be extreme. Nonetheless, the results are coherent both with theunderstanding of process and the CorEx algorithm. However, for our purposes,these methods present a more distanced view on the system dynamics, missing adecoupling problem of individual species relations. Overall, this project offered a fruitful chance for an exploration of multivariatetime series analysis. We have learned that the approach offered by the CorExmethod might be very promising in analysis of rule-based models. However, itrequires further testing with models that incorporate multiple randomly modifiedparameters and represent larger advanced processes. Further, we applied somenonlinear time series methods to our dataset. Though powerful, they offered abird’s-eye view understanding of system dynamics missing species-related details.However, both methods correctly interpreted the process offering a useful insightotherwise inaccessible. 26 eferences [1] Lee Bardwell, Xiufen Zou, Qing Nie, and Natalia L Komarova. Mathematicalmodels of specificity in cell signaling.
Biophysical journal , 92(10):3425–41,May 2007.[2] William S. Chylek, Lily A. and Harris, Leonard A. and Tung, Chang-Shungand Faeder, James R. and Lopez, Carlos F. and Hlavacek. Rule-basedmodeling: a computational approach for studying biomolecular site dynamicsin cell signaling systems.
Wiley interdisciplinary reviews. Systems biologyand medicine , 6(1):13–36, September 2014.[3] Vincent Danos.
Rule-Based Modelling of Cellular Signalling , volume 4703.Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.[4] Gauri Dixit, Joshua B Kelley, John R Houser, Timothy C Elston, andHenrik G Dohlman. Cellular noise suppression by the regulator of G proteinsignaling Sst2.
Molecular cell , 55(1):85–96, 2014.[5] Jean-Pierre Eckmann, S Oliffson Kamphorst, and David Ruelle. Recurrenceplots of dynamical systems.
Europhys. Lett , 4(9):973–977, 1987.[6] James R Faeder, Michael L Blinov, and William S Hlavacek. Rule-basedmodeling of biochemical systems with BioNetGen.
Methods in molecularbiology (Clifton, N.J.) , 500:113–67, January 2009.[7] Jérôme Feret and Jean Krivine.
KaSim3 reference manual , 2012.[8] Andrew M Fraser and Harry L Swinney. Independent coordinates for strangeattractors from mutual information.
Physical review A , 33(2):1134, 1986.[9] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation ofmutual information for strongly dependent variables. In
AISTATS’15 , 2015.[10] Daniel T Gillespie. Stochastic simulation of chemical kinetics.
Annual reviewof physical chemistry , 58:35–55, January 2007.[11] Rainer Hegger, Holger Kantz, and Thomas Schreiber. Practicalimplementation of nonlinear time series methods: The tisean package.
Chaos: An Interdisciplinary Journal of Nonlinear Science , 9(2):413–435,1999.[12] William S Hlavacek, James R Faeder, Michael L Blinov, Richard G Posner,Michael Hucka, and Walter Fontana. Rules for modeling signal-transductionsystems.
Science’s STKE : signal transduction knowledge environment ,2006(344):re6, July 2006. 2713] Justin S. Hogg, Leonard A. Harris, Lori J. Stover, Niketh S. Nair,and James R. Faeder. Exact Hybrid Particle/Population Simulation ofRule-Based Models of Biochemical Systems.
PLoS Computational Biology ,10(4):e1003544, April 2014.[14] Bo Jin and Xinghua Lu. Identifying informative subsets of the Gene Ontologywith information bottleneck methods.
Bioinformatics (Oxford, England) ,26(19):2445–51, October 2010.[15] Agnes Kohler, Jean Krivine, and Jakob Vidmar. A Rule-Based Model of BaseExcision Repair. 2014.[16] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimatingmutual information.
Physical review E , 69(6):066138, 2004.[17] A Levchenko, J Bruck, and P W Sternberg. Scaffold proteins maybiphasically affect the levels of mitogen-activated protein kinase signalingand reduce its threshold properties.
Proceedings of the National Academy ofSciences of the United States of America , 97(11):5818–5823, 2000.[18] Bing Liu and P S Thiagarajan. Modeling and analysis of biopathwaysdynamics.
Journal of bioinformatics and computational biology , 10(4), 2012.[19] Carlos F Lopez, Jeremy L Muhlich, John A Bachman, and Peter K Sorger.Programming biological models in Python using PySB.
Molecular systemsbiology , 9:646, January 2013.[20] Abhishek Majumdar, Stephen D Scott, Jitender S Deogun, and Steven Harris.Yeast pheromone pathway modeling using Petri nets.
BMC bioinformatics ,15 Suppl 7(Suppl 7):S13, January 2014.[21] N. Marwan. A historical review of recurrence plots.
The European PhysicalJournal Special Topics , 164(1):3–12, 2008.[22] Dávid Pál, Barnabás Póczos, and Csaba Szepesvári. Estimation of rényientropy and mutual information based on generalized nearest-neighborgraphs. In
Advances in Neural Information Processing Systems , pages1849–1857, 2010.[23] Shirley Pepke, Greg Ver Steeg, and Aram Galstyan. Using Total CorrelationExplanation to Identify Cancer Therapeutic Targets. 2015.[24] Ryan Suderman and Eric J. Deeds. Machines vs. Ensembles: EffectiveMAPK Signaling through Heterogeneous Sets of Protein Complexes.
PLoSComputational Biology , 9(10):1–35, 2013.[25] Taiji Suzuki, Masashi Sugiyama, Jun Sese, and Takafumi Kanamori.Approximating mutual information by maximum likelihood density ratioestimation.
FSDM , 4:5–20, 2008. 2826] Monica Duffy Toft and Yuri M Zhukov. Islamists and nationalists: Rebelmotivation and counterinsurgency in russia’s north caucasus.
AmericanPolitical Science Review , 109(02):222–238, 2015.[27] Greg Ver Steeg and Aram Galstyan. Discovering structure inhigh-dimensional data through correlation explanation. In Z. Ghahramani,M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 27 , pages 577–585.Curran Associates, Inc., 2014.[28] Greg Ver Steeg and Aram Galstyan. Maximally informative hierarchicalrepresentations of high-dimensional data. In
AISTATS’15 , 2015.[29] Richard C Yu, C Gustavo Pesce, Alejandro Colman-Lerner, Larry Lok, DavidPincus, Eduard Serra, Mark Holl, Kirsten Benjamin, Andrew Gordon, andRoger Brent. Negative feedback that improves information transmission inyeast signalling.