[PDF] Percolation on the gene regulatory network

Abstract

We consider a simplified model for gene regulation, where gene expression is regulated by transcription factors (TFs), which are single proteins or protein complexes. Proteins are in turn synthesised from expressed genes, creating a feedback loop of regulation. This leads to a directed bipartite network in which a link from a gene to a TF exists if the gene codes for a protein contributing to the TF, and a link from a TF to a gene exists if the TF regulates the expression of the gene. Both genes and TFs are modelled as binary variables, which indicate, respectively, whether a gene is expressed or not, and a TF is synthesised or not. We consider the scenario where for a TF to be synthesised, all of its contributing genes must be expressed. This results in an ``AND'' gate logic for the dynamics of TFs. By adapting percolation theory to directed bipartite graphs, evolving according to the AND logic dynamics, we are able to determine the necessary conditions, in the network parameter space, under which bipartite networks can support a multiplicity of stable gene expression patterns, under noisy conditions, as required in stable cell types. In particular, the analysis reveals the possibility of a bi-stability region, where the extensive percolating cluster is or is not resilient to perturbations. This is remarkably different from the transition observed in standard percolation theory. Finally, we consider perturbations involving single node removal that mimic gene knockout experiments. Results reveal the strong dependence of the gene knockout cascade on the logic implemented in the underlying network dynamics, highlighting in particular that avalanche sizes cannot be easily related to gene-gene interaction networks.

Full PDF

PPercolation on the gene regulatory network

Giuseppe Torrisi, Reimer Kühn, and Alessia Annibale

King’s College London, UKE-mail: [email protected]

15 July 2020

Abstract.

We consider a simpliﬁed model for gene regulation, where gene expressionis regulated by transcription factors (TFs), which are single proteins or proteincomplexes. Proteins are in turn synthesised from expressed genes, creating a feedbackloop of regulation. This leads to a directed bipartite network in which a link from agene to a TF exists if the gene codes for a protein contributing to the TF, and a linkfrom a TF to a gene exists if the TF regulates the expression of the gene. Both genesand TFs are modelled as binary variables, which indicate, respectively, whether a geneis expressed or not, and a TF is synthesised or not. We consider the scenario where fora TF to be synthesised, all of its contributing genes must be expressed. This resultsin an “AND” gate logic for the dynamics of TFs. By adapting percolation theory todirected bipartite graphs, evolving according to the AND logic dynamics, we are ableto determine the necessary conditions, in the network parameter space, under whichbipartite networks can support a multiplicity of stable gene expression patterns, undernoisy conditions, as required in stable cell types.In particular, the analysis reveals the possibility of a bi-stability region, wherethe extensive percolating cluster is or is not resilient to perturbations. This isremarkably diﬀerent from the transition observed in standard percolation theory.Finally, we consider perturbations involving single node removal that mimic geneknockout experiments. Results reveal the strong dependence of the gene knockoutcascade on the logic implemented in the underlying network dynamics, highlightingin particular that avalanche sizes cannot be easily related to gene-gene interactionnetworks.

1. Introduction

The attempt to understand the complexity of life in terms of ﬁrst principles represents afascinating challenge of our times. The biochemical basis of life relies on how informationis encoded in the genome and how it is expressed in living cells.Gene regulatory networks (GRNs) are a popular way to conceptualize the basicmechanisms involved in encoding and expression of genes at a ‘mesoscopic’ level, withouthaving to consider the full underlying microscopic biochemical detail. In this framework,genes are represented as the nodes of a network, where edges aim to incorporatediﬀerent biochemical processes linked to regulation. A state variable is associated witheach node of the network, which describes the expression level of the gene. GRNs a r X i v : . [ q - b i o . M N ] J u l ercolation on the gene regulatory network i.e. from geneexpression [5]. A popular way to study regulatory processes is to perform gene knockoutexperiments, which consist of the removal of nodes from the GRN [6]. Gene expressionpatterns of the perturbed and unperturbed networks are compared and used to informon the regulatory processes and derive eﬀective gene perturbation networks [7]. Inorder to fully capture all collective eﬀects in gene regulation, models used to interpretthe outcome of these experimental protocols would have to take many-body interactionsinto account.Kauﬀman’s pioneering model of random Boolean networks (RBNs) accounts forcooperative regulation in GRNs [8]. In RBNs, gene states are modelled by Booleanvariables, which are updated at regular time intervals according to a random Booleanfunction of the states of their neighbouring genes. It is one of the characteristic featuresof RBNs to exhibit attractors in the gene expression dynamics, which are interpretedas stable cell types. RBNs aim to capture universal, i.e. statistical, properties ofthe dynamics of synthetic random networks that are compatible with those observedin biological systems and have been shown to reproduce statistical features of geneknockout experiments [9]. While models of small Boolean networks constructed throughknown regulatory circuits have provided several biological insights [10], large scale RBNlack biological interpretation and are hard to calibrate. Having been formulated longbefore the advent of the genomics era, they dot not incorporate much of the biologicalinformation which has become available in recent years.Nowadays, we can leverage on our improved understanding of the biologicalprocesses underpinning gene regulation, to make a more informed choice of themathematical model. Gene regulation consists of several distinct biological processesthat regulate gene expression. Gene expression starts with the transcription of a geneDNA sequence into RNA. This is generally followed by translation, i.e. the synthesis of aspeciﬁc protein from the associated RNA sequence, although other regulatory processesare involved [11, 12]. The transcription of genetic information from DNA to RNAis controlled by transcription factors (TFs), proteins that selectively bind to speciﬁclocations of the DNA sequence, called promoter regions, and modulate the expression ofspeciﬁc genes. TFs are themselves synthesised from expressed genes, via transcriptionand translation. The regulatory role of TFs in gene expression has been little investigatedin the context of GRNs and RBN models [13]. The regulatory eﬀect of an individual TFbinding to a given site of the promoter region of a gene is not fully known, let alone thecooperative eﬀects of several TFs binding to several sites of the promoter region [14]. Inaddition, cooperation of TFs has been observed to play an important role to describetheir binding aﬃnity to the genome [15]. ercolation on the gene regulatory network Left: Diagram of the regulatory processes incorporated into the model. In thisexample, µ and ρ are two single-protein TFs synthesised from gene j and gene k , respectively.The TF ν is a TF complex, whose synthesis requires the simultaneous expression of j and k .Each of the TFs µ , ρ , and ν regulate the expression of gene i . Right: the same process iswritten in terms of a bipartite directed graph, with two sets of nodes, genes and TFs, whichcan be either single proteins or protein complexes. In this work, we consider a random Boolean bipartite network model which aimsto incorporate some key features of the biological processes summarised above, inparticular, the role of TFs in gene expression and cooperativity eﬀects. We assumethat diﬀerent (single-protein) TFs bind to diﬀerent sites of the promoter region of agene, see Fig. 1 for an illustration. Moreover, (single-protein) TFs can cooperativelybind to the promoter region with a potentially new regulatory power. Thus, thesecombinations of single-protein TFs can be regarded as novel TFs, which are ‘complexes’of single-protein TFs ; (see left panel of Fig. 1). This model can be translated into abipartite network, with two sets of nodes, i.e. genes and TFs, where TFs can be eithersingle proteins or protein complexes. There is a directed edge from a TF to a gene ifthe former regulates the expression of the latter. Conversely, there is a directed edgefrom a gene to a TF if the former codes for a protein that is a constituent of the latter(see the right panel in Fig. 1). The number of diﬀerent TFs that can regulate a gene ischaracterised probabilistically so that the number of promoter sites associated with eachgene and other microscopic details can be left open. Both genes and TFs are modelledas binary variables, which indicate, respectively, whether a gene is expressed or not, andwhether a TF is synthesised or not.A key feature of the model is that a TF complex can only exist if all of the genescoding for proteins that constitute it are expressed at the same time . This fundamentallevel of cooperativity is exempliﬁed by the choice of a speciﬁc Boolean function, to modelthe dynamics of TFs synthesis, namely an AND logic gate of the input gene expressionlevels. Following Kauﬀman’s reasoning, we assume that in a given organism, all the cellshave the same bipartite GRN. Hence, diﬀerent cells are interpreted as distinct stableattractors of the gene expression dynamics running on the bipartite GRN, which is,in our model, coupled to the dynamics of TFs synthesis. Given that all ﬁnite systems ; The term ‘complex’ is used here and elsewhere in a loose way, to mean a combination of single-proteinTFs that act together: the model is agnostic of whether they form a physical complex or not. ercolation on the gene regulatory network § § https://zenodo.org/record/3798332 ercolation on the gene regulatory network Genes i j ... m μ i r iv νμ c λ out d k out k λ TFs

Figure 2: Schematic representation of the bipartite graph. The link m µi describes themembership of gene i to the production of complex µ . The link r iν indicates that TF ν regulates gene i .

2. Model and deﬁnitions

We construct a directed bipartite network where genes and TFs are the two sets of nodes.A link from a gene to a TF exists if the protein coded by the gene contributes to (i.e.is a member of) the TF. The link from a TF to a gene represents the regulation of thegene expression via promotion or inhibition of the gene expression. From a biologicalperspective, the latter connections aim to describe the regulatory processes related toTFs binding to the promoter region of a gene, see Fig. 1.Let N be the number of genes and αN the number of TFs. The bipartite graphis uniquely determined by the two bi-adjacency matrices, which are the membershipmatrix m “ p m µi q , m P R αN,N and the regulatory matrix r “ p r iµ q , r P R N,αN , seeFig. 2. Their entries, regarded as random variables, have the following interpretation: m µi “ , if gene i contributes to TF µ , else (1) r iµ $’’&’’% ą , if TF µ promotes transcription of gene i ă , if TF µ inhibits transcription of gene i “ , else. (2)For any realization of the bi-adjacency matrices r and m , we can deﬁne d in i p r q as thenumber of TFs that regulate gene i and d out i p m q as the number of TFs to which gene i contributes: d in i p r q “ αN ÿ µ “ Θ p| r iµ |q d out i p m q “ αN ÿ µ “ m µi (3) ercolation on the gene regulatory network c in µ p m q as the number of genes that contribute to TF µ (or sizeof the TF complex µ ) and c out µ p r q as the number of genes that are regulated by TF µ : c in µ p m q “ N ÿ i “ m µi c out µ p r q “ N ÿ i “ Θ p| r iµ |q . (4)Since promoter regions have ﬁnite length, it is plausible to assume that the sizes of TF“complexes” that can contribute to regulating a gene are ﬁnite, i.e. they are composedof a ﬁnite number of proteins. We also assume that TFs can only bind to a ﬁnite numberof diﬀerent promoter sites in the genome. } For these reasons, we are led to consider thecase where the bi-adjacency matrices m “ p m µi q and r “ p r iµ q are sparse.In the present paper, we will consider synthetic GRNs in the conﬁguration modelclass, i.e. we will study ensembles of maximally random directed bipartite graphs r , m with constrained in-degree and out-degree sequences of genes d in i , d out i and TFs c in µ , c out µ , drawn from the distributions P in , out D p d q “ N ´ ř i δ d,d in , out i and P in , out C p c q “p αN q ´ ř µ δ c,c in , out µ respectively. We use x c in y , x d in y to denote the mean in-degree of TFsand genes respectively, and x c out y , x d out y to denote the mean out-degrees. Conservationof links implies x d out y “ α x c in y and x d in y “ α x c out y . Without loss of generality, we willmostly set α “ , corresponding to having the same number of genes and TFs.It is important to emphasize that any bona ﬁde model of a GRN must respect thefollowing fundamental facts about gene regulation: the regulatory part of the genomeis deﬁned by the property that genes code for at least one TF, or a component of it.Therefore, the out-degree of genes in a GRN must be larger than or equal to . Thesame is true for the in-degree of TFs, as each TF is comprised of at least one protein.Besides, as all genes are regulated, they must, in a GRN, have an in-degree larger thanor equal to . However, TFs can have out-degree zero in the regulatory sector of thegenome, as they may regulate other genes that are not themselves regulatory. Thisleads to the following biological constraints on the degree distributions: P in D p q “ , P in C p q “ , and P out D p q “ .In the following, we will derive a general theory for arbitrary degree distributions,satisfying these constraints. Our analysis will show that diﬀerent degree distributionscan lead to diﬀerent behaviours. To exemplify results, we will consider two families ofdegree distributions which display diﬀerent behaviour. In the ﬁrst family, the in-degreesof genes and TFs are random variables obtained as d in i “ ` ˜ d in i and c in µ “ ` ˜ c in µ respectively, where ˜ d in i is a random variable drawn from a Poisson distribution withaverage x d in y ´ and ˜ c in µ is a Poisson random variable with average x c in y ´ [21]. Theout-degrees of genes and TFs are given by d out i “ ` ˜ d out i and c out µ respectively, where ˜ d out i , c out µ are also drawn from Poisson distributions, with average x d out y ´ and x c out y , } We are aware that this is still debated in the community, see [20] for the TF speciﬁcity paradox. ercolation on the gene regulatory network P γ p k q with a power-law tail P γ p k q „ γk ´ γ ´ and γ ą .Speciﬁcally, we will consider the discrete fat-tailed distribution P γ p k q “ k ´ γ ´ p k ` q ´ γ ,which is deﬁned for positive integers ď k P N and has the desired power-law behaviourfor large k , but other choices could be made. We characterise the dynamic state of a cell in terms of time-dependent gene expressionlevels. We denote by n i p t q P t , u for i “ , . . . N the expression level of gene i and by σ µ p t q P t , u for µ P , . . . αN the expression level of TF µ , both at time t . TFs aresingle proteins or protein complexes. A TF exists only if all the genes producing thecomponents of the complex are expressed at the same time. TFs can act as enhancersor inhibitors of gene activation. A gene i is expressed if the overall regulatory eﬀectof TFs that are enhancers exceeds the total regulatory eﬀect of inhibitors by a suitablemargin. Assuming a linear model for the combination of regulatory eﬀects of the TFscontributing to the regulation of a gene, we thus have a dynamics of the form n i p τ ` q “ Θ « αN ÿ µ r iµ σ µ p τ q ´ θ i ´ T z i p τ q ﬀ , (5) σ µ p τ q “ ź i PB in µ n i p τ q . (6)Here Θ is the Heaviside theta function, for which Θ p x q “ , if x ą , and Θ p x q “ ,if x ď , and θ i is the activation threshold. The set of predecessors of TF µ , i.e. , theset of proteins contributing to TF µ , is denoted by B in µ , and for the present purposes wecan identify proteins with the genes coding for them. Gene expression is a noisy processthat we model by introducing random variables z i p t q , which we take to be i.i.d. zero-mean, unit-variance random variables; the parameter T is introduced to parametrise thenoise level in the dynamics. Depending on the choice of the regulatory and membershipmatrices r and m and of the threshold θ , the dynamics described by Eqs. (5), (6) is ableto capture a rich spectrum of Boolean functions in regulation [22]. The gene expression dynamics, as formulated above, must be compatible with theexistence of multi-cellular organisms. Associating cell types with stable attractors ofthe dynamics in the noiseless limit, and assuming that diﬀerent cell types share thesame GRN, we thus have to require that the dynamical system allows for the existenceof a multiplicity of stable attractors in the noiseless limit, which need to survive as stablephases, under (realistic) noisy conditions. ercolation on the gene regulatory network i are statistically independent of one another, once the state of the node i is known. Thisapproximation holds if the graph is at least locally tree-like [27], and become exact inthe large N limit.The existence of a giant component under the logical rules imposed in the presentmodel can be interpreted as a heterogeneous k-core percolation problem in directednetworks. Homogeneous directed k-core percolation has a game theory interpretation[28]: a node requires a cost k to remain engaged, and it receives proﬁt from in-comingengaged neighbours. Nodes remain engaged if the payoﬀ is non-negative. In the presentcase, the situation is more involved, as there are two sets of nodes: genes which have cost , and TFs which have a cost equal to their in-degree. In comparison to their in-degree,these two sets of nodes thus have the highest and lowest possible cost; it is, therefore,non-trivial to predict the conﬁguration that will be stable given the above rules.For (conventional) directed graphs there are several diﬀerent types of giantcomponent one can consider. They are the strongly connected component (SCC), thein-component, and the out-component, which are formally deﬁned as follows, see Fig. 3. Deﬁnition 1 (Strongly connected component (SCC))

The SCC is the set ofnodes that are mutually connected by a directed path (which is itself contained in theSCC).ercolation on the gene regulatory network (a) TFs i j ... νμ k λ aSCCin Genes

SCC (b)

Figure 3: ( a ) Representation of in-component (green dotted line), “AND” stronglyconnected component (red solid line), and “AND” out-component (blue dashed line).( b ) Diﬀerence between SCC and aSCC. The TF µ belongs to SCC but does not belongto aSCC. Deﬁnition 2 (In-component)

The in-component is the set of nodes of the graph fromwhich the SCC is reachable via a directed path, i.e. the set of ancestors of the SCC.

Deﬁnition 3 (Out-component (OC))

The out-component is the set of nodes of thegraph that can be reached via a directed path when starting the path in the SCC, i.e. theset of descendants of a node in the SCC.

These deﬁnitions need to be adapted for the case discussed in the present paper, inwhich the “AND” logic must be imposed to ensure the existence of TFs, which in turnaﬀects the deﬁnition of the giant components. The revised deﬁnitions of the variousgiant components all rest on the deﬁnition of a a valid path through a set of nodes : avalid path through a set of nodes requires that for each TF µ on the path, all of µ ’spredecessors must also belong to that set. This means that for a TF to belong to agiant component, all of the genes contributing constituent proteins of the TF also needto belong to the giant component (see Fig. 3) for an illustration). We thus have: Deﬁnition 4 (“AND” SCC (aSCC) )

The aSCC is a subset of a SCC formed ofnodes that are mutually connected through valid paths within the aSCC, i.e. a TF µ belongs to the aSCC only if all of µ ’s predecessors belong to the aSCC as well. Deﬁnition 5 (“AND” out-component (aOC) )

The aOC is a subset of an out-component formed by nodes that can be reached through valid paths originating in theaSCC.

Note that the “AND” condition is trivially satisﬁed for the in-component, as foreach node in the in-component, its predecessors are by deﬁnition also part of the in-component, as shown in Fig. 3. In what follows, we study the existence of a giant aOC,which was previously investigated in Ref. [19]. However, in [19] a family of bipartite ercolation on the gene regulatory network

3. Existence of the giant out-component

We follow [19] to analyse conditions for the existence of a giant aOC, and determine itssize in terms of both the fraction of genes and the fraction of TFs in the giant aOC. Tothis end, we consider a bipartite graph of genes and TFs as described above. Indicatorvariables for genes and TFs are introduced, which signify whether or not they belongto the giant aOC. Thus, let n i “ t , u be the indicator variable for the gene i , whichis 1 if i belongs to the giant aOC, and 0 otherwise. Analogously, let σ µ “ t , u be theindicator variable for TF µ . Gene i is in the giant aOC, if at least one of its predecessors is also in the giant aOC (this is equivalent to an “OR” logic gate). Conversely, TF µ isin the giant aOC, if all of its predecessors are also in the giant aOC (this is equivalentto an “AND” logic gate). Formally, n i “ ´ ź µ PB in i p ´ σ p i q µ q ,σ µ “ ź i PB in µ n p µ q i ; (7)here B in i and B in µ denote the sets of TFs which are predecessors of gene i and the sets ofgenes that are predecessors of TF µ , respectively. The variable σ p i q µ indicates whetherthe TF µ does p σ p i q µ “ q or does not p σ p i q µ “ q belong to the giant aOC in the cavitygraph from which the gene i is removed; in a similar fashion, the variable n p µ q i indicateswhether gene i does or does not belong to the giant aOC in the cavity graph from whichthe TF µ is removed. By analogous reasoning, the indicator variables for the cavity ercolation on the gene regulatory network n p µ q i “ ´ ź ν PB in i z µ p ´ σ p i q ν q ,σ p i q µ “ ź j PB in µ z i n p µ q j . (8)Eqs. (8) constitute a set of self-consistency equations for the sets t n p µ q i u and t σ p i q µ u ofcavity indicator variables, which can be solved iteratively for large single instances ofGRNs. We are, however, mainly interested in the average fraction g of genes and thefraction t of TFs in the giant aOC, deﬁned as g “ N ÿ i n i , t “ αN ÿ µ σ µ . (9)These can be obtained by inserting the deﬁnitions Eq. (7) into the sums in Eq. (9),and in the large system limit separately evaluating contributions coming from diﬀerentin-degrees of genes and TFs respectively, giving g “ ÿ d “ P in D p d q “ ´ p ´ ˜ t q d ‰ ,t “ ÿ c “ P in C p c q ˜ g c . (10)Here ˜ t “ x σ p i q µ y is the probability that a TF regulating a gene belongs to the giantaOC, while ˜ g “ x n p µ q i y is the probability that a constituent gene of a TF belongs tothe giant aOC. In deriving Eqs. (10), the usual approximation has been made that inlarge sparse, thus locally tree-like systems, averages over products of random variablesinvolving diﬀerent neighbours of a node factor. Also, it is assumed that the probabilityof a TF regulating a gene belonging to the giant aOC is independent of the in-degree ofthe gene, and in an analogous fashion, that the probability that a constituent gene of aTF is on the giant aOC is independent of the in-degree of the TF.In a similar fashion Eqs. (8) for the cavity indicator variables can be translatedinto a pair of coupled self-consistency equations for the cavity probabilities ˜ g and ˜ t . Itwill turn out, however, that, for the family of networks we are considering, with sparseand directed interactions and uncorrelated in- and out-degrees, the equations for thecavity probabilities will be identical to the equations for the probabilities themselves,Eqs. (10). Due to the sparsity and directedness of the links, the probability that a geneis both a predecessor and a successor of a given TF (and similarly the probability thata TF is both a predecessor and a successor of a gene) will be inversely proportional tothe system size and thus negligible in the large N limit. As a consequence, the ‘cavity-exclusions’ embodied in the product terms of Eqs. (8) will almost surely be inactive .With these remarks in mind, we obtain ˜ g “ ÿ d d x d out y P out D p d q ÿ d “ P in | out D p d | d q ” ´ ` ´ ˜ t ˘ d ı , ˜ t “ ÿ c c x c out y P out C p c q ÿ c “ P in | out C p c | c q ˜ g c , (11) ercolation on the gene regulatory network P in | out D p d | d q is the conditional probability that a gene has in-degree d given thatits out-degree is d , similarly P in | out C is the conditional probability of the in-degree of a TFgiven its out-degree. Here, the distributions of out-degrees of neighbours of genes andTFs, respectively, appear due to the fact that the probability that a gene is a predecessorof a TF is actually proportional to the out-degree of the gene (and similarly for the TFs).Assuming absence of correlations between in- and out-degrees, i.e. P in | out D p d | d q “ P in D p d q and P in | out C p c | c q “ P in C p c q , one ﬁnally has ˜ g “ ÿ d “ P in D p d q “ ´ p ´ ˜ t q d ‰ ˜ t “ ÿ c “ P in C p c q ˜ g c (12)i.e. — in view of Eqs. (10) — that g ” ˜ g and t ” ˜ t . We thus ﬁnally have g “ ÿ d “ P in D p d q “ ´ p ´ t q d ‰ “ ´ G in D p ´ t q ” f p t q ,t “ ÿ c “ P in C p c q g c “ G in C p g q ” f p g q , (13)where we have also shown expressions of right hand sides in terms of the generatingfunctions G in D and G in C of the in-degree distributions of genes and TFs, respectively.Note that the system of equations (13) always has the trivial solution p ¯ g, ¯ t q “ p , q .However, due to the biological constraints P in C p c “ q “ P in D p d “ q “ on thedegree distributions, these equations also always have the solution p ¯ g, ¯ t q “ p , q . Thesesolutions correspond to the extreme cases where the system has no giant aOC, or thegiant aOC consists of the entire graph, respectively. A common strategy to ﬁnd solutionsof Eq. (13) is to obtain them as ﬁxed points under forward iteration using the map g p τ ` q “ ÿ d “ P in D p d q “ ´ p ´ t p τ qq d ‰ t p τ q “ ÿ c “ P in C p c q g c p τ q . (14)Solutions of Eqs. (13) can alternatively be obtained through a graphical method. InEqs. (13) one can substitute one relation into the other, to obtain the two independentequations g “ f p f p g qq ” F p g q and t “ f p f p t qq ” F p t q . In Fig. 4 we show the graphsof F p g q and F p t q for three diﬀerent choices of the connectivity in Type I networks,corresponding to (i) a situation where the giant aOC solution is unstable (panel a),(ii) coexistence of a solution representing a stable giant aOC and a stable solutionrepresenting the absence of a giant aOC (panel b), and (iii) existence of a stable giantaOC (panel c).To assess the stability of the solutions p g, t q “ p , q and p g, t q “ p , q under forwarditeration, we consider the Jacobian for the functions f p t q and f p g q . The condition forstability under forward iteration is that the maximum eigenvalue of the Jacobian at aﬁxed point is smaller than 1, which becomes x d in y P in C p c “ q ă (15) ercolation on the gene regulatory network . . . . . . x . . . . . . F , ( x ) genesTFs (a) . . . . . . x . . . . . . F , ( x ) genesTFs (b) . . . . . . x . . . . . . F , ( x ) genesTFs (c) Figure 4:

Solution of Eq. (13), obtained by intersecting the curves F p x q (dotted line) and F p x q (solid line), respectively, with the x axis (dashed line). Results shown for Type I networks.The intersections give the fraction of genes and TFs belonging to aOC, respectively. Panelsshows three diﬀerent regimes (from left to right): stable solution only in p , q , bi-stability in p , q and p , q , and stable solution only in p , q , as resulting from the stability criteria inEqs. (15), (16). Results are shown for Type I networks with average connectivities x c in y “ , x d in y “ (a), x c in y “ , x d in y “ (b), x c in y “ , x d in y “ (c). for p g, t q “ p , q , and x c in y P in D p d “ q ă (16)for p g, t q “ p , q . For some network distributions, the stability conditions of Eqs. (15)and (16) may be both satisﬁed at the same time. In that case, both p , q and p , q arestable ﬁxed points, and the dynamical system described by the map Eq. (14) exhibitsbi-stability.Type I and type II networks as speciﬁed in 2.1 are examples of networksthat can and cannot realise bi-stability, respectively. Indeed, for networks of typeI with connectivities deﬁned in terms of shifted Poisson degree distributions, thecondition Eq. (16) for the stability of the p g, t q “ p , q solution can be rewritten as x c in y ă exp “ x d in y ´ ‰ , and the complementary condition Eq. (15) for the stabiltiy ofthe non-percolating p g, t q “ p , q solution as x c in y ą ` log “ x d in y ‰ . It is relativelystraightforward to demonstrate that there is for any choice of x d in y a range values of x c in y that satisfy both conditions. For type II networks on the other hand, Eq. (16) ercolation on the gene regulatory network (a) . . . . . . x . . . . . . F , ( x ) geneT F . . . . . . x − . − . . . F , ( x ) − x (b) . . . . . . x . . . . . . F , ( x ) geneTF (c) Figure 5: Solution of Eq. (13), obtained by intersecting the curves F p x q (dotted line)and F p x q (solid line), respectively, with the x axis (dashed line). Same quantity as inFig. 4 for Type II networks. Panels shows three diﬀerent regimes (from left to right):stable solution only in p , q , un-stable solution in p , q and p , q , and stable solutionin p , q , as resulting from the stability criteria in Eqs. (15), (16). Inset in (a) and (b)shows the plot of F , p x q ´ x , to magnify the nature of transition. Parameters of thenetworks: x c in y “ . , γ “ . (a), x c in y “ . , γ “ . (b), x c in y “ . , γ “ . (c).becomes ζ p γ q “ x c in y ă exp “ x d in y ´ ‰ , where ζ p γ q is the Riemann Zeta-function, andEq. (15) gives γ ă log ` x d in y{ ` x d in y ´ ˘˘ . Evaluating these conditions, one ﬁnds thatthey are never satisﬁed at the same time for any γ ą .For networks that allow a range of connectivities where there is coexistence of twostable solutions, g “ and g “ , a discontinuous transition from a percolating to a non-percolating regime is expected, when connectivity parameters are varied. Conversely, fornetworks where there is always only a single stable solution, as connectivity is varied, thetransition from a percolating to a non-percolating situation is expected to be continuous.

4. Stability of the giant aOC and the non-percolating solution

Equations (13) describe the fractions of genes and TFs that belong to the giant aOC. Wehave identiﬁed two solutions which postulate extreme situations, where either the entire ercolation on the gene regulatory network

We investigate the stability of the solution p g, t q “ p , q against a perturbation thatdeletes a certain fraction of genes from the network, or more precisely declares them asnot being part of the original giant aOC. Similar perturbation protocols have been usedrecently in the context of satisﬁability problems [29][30]. Let us consider a setting wherea fraction ´ p of genes are thus removed from the aOC. Introducing a binary random ercolation on the gene regulatory network χ i i.i.d., taking the value χ i “ if gene i is kept in the aOC, and χ i “ if it isremoved, we can analyse the existence of the giant aOC as a site percolation problemin terms of a set of indicator variables n i for genes and σ µ for TFs, as in Sect. 3 above.Investing the almost sure equality of cavity and non-cavity indicator variables that wehave argued for there, the modiﬁcation of Eqs. (7) needed to capture the removal of apre-deﬁned set of genes is n i “ χ i ¨˝ ´ ź µ PB in i p ´ σ µ q ˛‚ σ µ “ ź i PB in µ n i . (17)Averaging these over the realizations t χ i u of the percolation experiment, and using thenotation g i “ x n i y χ and t µ “ x σ µ y χ to denote averages of local indicator variables overrealizations, we get g i “ p ¨˝ ´ ź µ PB in i p ´ t µ q ˛‚ ,t µ “ ź i PB in µ g i , (18)where we have set p “ x χ i y χ . Taking the average over the sites of a graph, deﬁning g “ N ´ ř i g i and t “ p αN q ´ ř µ t µ as in Sect. 3, one obtains g “ p ÿ d “ P in D p d q “ ´ p ´ t q d ‰ ” h p t q t “ ÿ c “ P in C p c q g c ” h p g q (19)The point p g, t q “ p , q is now no longer a solution. However, for certain degreedistributions, there are still two stable solutions, namely p g, t q “ p , q and p g, t q “p g ‹ , t ‹ q . Let us denote the function composition H p g q “ h p h p g qq , such that the fractionof genes g ‹ that is a solution of (19) is given by g ‹ “ H p g ‹ q . The composition of functionsfor the perturbed setting can be written in terms of the unperturbed composition as H p g q “ p F p g q , with p P r , s . Hence, the curve H p g q in the perturbed case, is alwaysupper bounded by that of the unperturbed case, F p g q , see Fig. 6a. This implies that anetwork where the solution p g, t q “ p , q is unstable in the unperturbed case (as in Fig.4a), will no longer exhibit a solution p g, t q “ p g ‹ , t ‹ q in the vicinity of p , q , when asmall perturbation is applied i.e. Eq. (16) gives the necessary condition for the stabilityof the giant aOC which encompasses the entire system. Conversely, in networks wherethe solution p g, t q “ p , q is the only stable solution at p “ (as in Fig. 4c), this isexpected to be continuously deformed when p is decreased to values below p “ . Thenature (continuous or discontinuous) of the phase transition that we have discussedfor the unperturbed system in Sect. 3 is expected to be indicative of the behaviour at ercolation on the gene regulatory network . . . . . . g . . . . . . p F ( g ) p =1 p =0.9 p = p ∗ (a) . . . . . . . h d in i . . . . . . . h c i n i . . g ∗ (b) Figure 6: (a)

Graphical solution of Eq. (19), obtained as intersection of the curve p F p g q with g . The solid lines show the perturbed cases ( p “ . , p “ p ˚ ) and the dotted line shows theunperturbed curve ( p “ ). Results are shown for type I networks with connectivity x c in y “ , x d in y “ (b) Heatmap showing the value g ‹ of the largest solution of Eq. (19), for diﬀerentvalues of the mean connectivities in type I networks. The dilution parameter is p =0.95. Thedashed line indicates the transition curve at p “ . . . . . . h c in i . . . . . P e r c o l a t i n g f r a c t i o n geneTFthr. genethr. TF (a) . . . . . h c in i . . . . . P e r c o l a t i n g f r a c t i o n geneTFthr. genethr. TF (b) Figure 7:

Fraction of genes and TFs (squares and circle respectively) in aOC, consequentto removal of 5% of genes, versus the average in-degree x c in y of TFs. Results are shown fornetworks of (a) type I with x d in y “ and N “ and (b) type II with x d in y “ . and N “ . Lines describe the theoretical prediction obtained from the largest solution ofEq. (19). For the continuous transition shown in panel (b) the approach to the critical value x c in y c is linear g „ px c in y c ´ x c in yq . p below but close to . This behaviour remains qualitatively similar as p is loweredfurther, away from , as long as p ą p ˚ , with p ˚ “ { F p q “ ` x d in y P in C p c “ q ˘ ´ .In Fig. 6b, we plot the largest solution g ‹ of Eq. (19), as a function of the meanconnectivities, for bipartite graphs of type I. As expected, a discontinuous transition of g ‹ is exhibited as connectivities are changed. Results are shown for p “ . , and thelocation of the transition curve at p “ (predicted by Eq. (16)) is shown as a dashedline. This indicates that the condition for the emergence of a non-zero ﬁxed point inthe perturbed case is more stringent than for the unperturbed case. ercolation on the gene regulatory network after a random fraction p of genes has been removed. We removea TF if any of its predecessors is missing. We then re-evaluate the out-component afterremoval of any TFs and iterate this procedure until convergence. This pruning protocolultimately gives the set of nodes in the giant aOC. We show results of this procedure,for the two families of networks of type I and type II, in Fig. 7. We ﬁx the average valueof the in-degree of genes x d in y and we tune the parameter x c in y in type I networks, andthe exponent γ of the fat tailed distribution in type II networks, for which x c in y “ ζ p γ q .Theoretical curves are given by the solution of Eq. (19) with the largest values of g ‹ and t ‹ . The fate of the giant aOC (if it exists) against deletion of genes is of interest for theinterpretation of gene knock-out experiments in the limit where the number of genesdeleted is actually ﬁnite. We will deal with this limit later on in Sect. 4.4 below. We now investigate the stability of the solution p g, t q “ p , q using a perturbation ofthis solution which consists of ﬁnding a giant aOC which contains predeﬁned randomlychosen set of genes that by deﬁnition forms its seed (or nucleus). We consider a settingwhere the fraction p of genes is thus deﬁned to belong to the aOC, and ask under whichconditions a giant aOC exists that contains this very set of genes. We assign a binaryrandom variable χ i taking the value χ i “ if a gene is deﬁned to belong to the seed ofthe aOC, and χ i “ if it does not belong to that seed. We can analyse the existence ofthe giant aOC in terms of a set of indicator variables n i for genes and σ µ for TFs as inSect. 3 above. Following a line of reasoning as in the previous section, we obtain themodiﬁcation of Eqs. (7) needed to capture the existence of a set of genes pre-deﬁned tobe the seed (or nucleus) of a giant aOC as n i “ p ´ χ i q ¨˝ ´ ź µ PB in i p ´ σ µ q ˛‚ ` χ i ,σ µ “ ź i PB ηµ n i , (20)Averaging these over the realisations t χ i u of the percolation (or seeding) experiment, weget g i “ p ´ p q ¨˝ ´ ź µ PB ξi p ´ t µ q ˛‚ ` p ,t µ “ ź i PB ηµ g i , (21) ercolation on the gene regulatory network . . . . . . g . . . . . . L ( g ) p =0 p =0.1 (a) . . . . . . . h d in i . . . . . . . h c i n i . . g ∗ (b) Figure 8: (a)

Graphical solution of Eq. (22). Comparison between the functional composition L p g q describing the perturbed scenario at p “ . (solid line) and F p g q describing theunperturbed scenario (dotted line). Results are shown for type I networks with connectivity x c in y “ x d in y “ (b) Heatmap of the largest solution g ‹ of Eq. (22) for a range of meanconnectivities in type I networks. The fraction of genes in the giant aOC seed is p “ . . Thedashed line indicates the transition curve for p “ . for the probability of gene i and TF µ to belong to a giant aOC thus seeded. Takingthe average over the sites of a graph, as in Sect. 3, one obtains g “p ´ p q ÿ d “ P in D p d q “ ´ p ´ t q d ‰ ` p “ l p g q ,t “ ÿ c “ P in C p c q g c “ l p t q . (22)In the present case, p g, t q “ p , q is still always a solution, but p g, t q “ p , q is notfor any p ‰ . Depending on the degree distributions, there may now for p ‰ be two stable solutions: namely p g, t q “ p g ‹ , t ‹ q and p g, t q “ p , q . The functionalcomposition L p g q “ l p l p g qq , that gives the fraction of genes g ‹ solving (22) as g ‹ “ L p g ‹ q , can now be written in terms of the unperturbed functional composition F p g q , as L p g q “ F p g q ` p ´ F p g qq p . One has L p g q ě F p g q for p P p , q . It isimplied that when p g, t q “ p , q is the only stable solution in the unperturbed case,this will be continuously deformed when a perturbation is introduced, see Figure 8a.Conversely, when the solution p g, t q “ p , q is unstable in the unperturbed case, asolution p g, t q “ p g ‹ , t ‹ q in the vicinity of p , q will no longer exist in the presence ofperturbation. Hence, Eq. (15) does indeed give the necessary condition for the stabilityof the non-percolating solution. In Fig. 8b, the smallest solution g ‹ of Eq. (19)) isplotted as a function of the mean connectivities, for bipartite graphs of type I. Again,a discontinuous transition of g ‹ , from a percolating to a non-percolating regime, isexhibited when connectivities are varied, as in Fig. 4. ercolation on the gene regulatory network h d in i h c i n i cavity 0 . . g (a) h d in i h c i n i cavity 0 . . g (b) h d in i h c i n i cavity 01 (c) h c in i . . . . . . g (a)(b) (d) Figure 9:

Comparison between cavity predictions (Eqs. (19), (22)) and the stationary stateof the corresponding microscopic dynamics (Eqs. (23), (24) respectively), averaged overthe network sites. (a)

Heat-map of the average gene expression level g “ N ´ ř i n i for aconﬁguration where a fraction ´ p of genes is clamped to be in inactive state, versus themean connectivities of type I networks. (b) Heat-map of the average gene expression level fora conﬁguration where a fraction p of genes is clamped to be in active state. (c) Diﬀerencebetween the two protocols. (d)

Plot of the average gene activation for a type I network withconnectivity x d in y “ , as indicated by the dotted lines in ﬁgures (a) and (b). The results of theprotocols used in panels (a) and (b) are represented in terms of circles and stars, respectively.The dashed lines represent the cavity predictions for the instabilities. Simulation parametersare N “ , p “ . . In order to study the stability of the percolating phase for a single instance of a GRN, weinvestigate the stability of the solutions of the microscopic Eqs. (17). To this purpose,we consider the dynamical processes n i p τ ` q “ χ i ¨˝ ´ ź µ PB in i p ´ σ µ p τ qq ˛‚ σ µ p τ q “ ź i PB in µ n i p τ q . (23) ercolation on the gene regulatory network θ i “ ´ χ i and positive regulatory couplings. This version of the dynamicsdescribes a setup in which a certain set of genes is forced to remain silent, e.g. as aresult of a gene knockout experiment.We now perform an analogous single instance analysis to the stability of the non-percolating phase by resorting to a solution of Eqs. (20) through forward iteration: n i p τ ` q “ p ´ χ i q ¨˝ ´ ź µ PB in i p ´ σ µ p τ q ˛‚ ` χ i ,σ µ p τ q “ ź i PB in µ n i p τ q . (24)Figure 9b shows the site average of the stationary solution of the microscopic dynamics(24). Just as for the percolating phase, the location of the transition in the stationarystate of the microscopic dynamics is correctly predicted by the cavity analysis (22)carried out at the macroscopic level. As expected from a discontinuous transition, thereis a region with coexisting solutions (Fig. 9c) which manifests itself as hysteresis in thedynamics: knocking out a small fraction of genes in a system where genes are initiallyon, will switch oﬀ the entire network, at a value of connectivity which is larger than theone required for a small fraction of initially active genes (in a system of inactive genes)to trigger an activation cascade (Fig. 9d).The iterative dynamics in Eqs. (24) is a special case of the gene regulatory dynamicsin Eq. (6) with θ i “ ´ χ i and positive regulatory couplings. This dynamics describes asetup where a certain set of genes is clamped to be active, e.g. as a result of an abundanceof TFs passed from a mother cell to a daughter cell in the ﬁrst moments after division.It is worth noting that there is a second level of single instance cavity dynamicsthat arises from solving Eqs. (18) and (21) through forward iteration. Solutions agreewith those of microscopic and macroscopic dynamics, unless the number of perturbedgenes is O p q , creating ﬁnite sample eﬀects which are particularly pronounced in theproximity of any phase transition. We discuss this behaviour in the next section. In section 4.1 above we studied the stability of the giant aOC by removing a fraction ´ p of genes and computing the fraction of nodes that do not belong to giant aOCas a result of that removal. We refer to this quantity as the avalanche size. Here weinvestigate the avalanche size in the limit p Ñ , in particular with scaling p “ ´ ρ { N .In this situation, a ﬁnite number of genes are initially removed from the network, as ercolation on the gene regulatory network N .This regime is relevant to the analysis of gene knockout experiments. These areperformed in vitro as a tool to investigate the underlying (unknown) network. In oursimulation, a synthetic network is generated and a single gene i is removed. Thedynamics of single instance cavity (23), is then run with χ j “ ´ δ ij and initialconditions n i p q “ @ i P t , . . . N u , until stationarity. In a realistic setting, we expectthe GRN to consist of a robust giant aOC. We therefore consider networks that satisfythe stability condition of Eq. (16). For each gene in the original giant aOC, we computethe avalanche size associated with the removal of that gene. A histogram of avalanchesizes for networks of type I is shown in Fig. 10.Here, we compare results for dynamics with diﬀerent logic gates, the AND logicdeﬁned in Eqs. (5), (6) and an OR logic dynamics, obtained by replacing Eq. (6)) with σ µ p τ q “ c in µ ÿ i PB in µ n i p τ q (25)The distribution of avalanche sizes is very diﬀerent for the two types of dynamics,emphasizing the crucial role of the underlying dynamics in interpreting data from geneknock-out experiments. We also provide comparisons of these quantities with the out-degree distribution of the so-called “projected graph" of eﬀective gene-gene interactions,deﬁned by the adjacency matrix A ij “ Θ ` ř αNµ | r iµ | m µj ˘ [19], where TFs have beenintegrated out (i.e. summed over). The out-degree distribution for this graph is deﬁnedby P out D,P p k q “ N ÿ j @ δ ř i A ij ,k D , (26)in which the angled bracket denotes an average over the realisations of the membershipand regulatory matrices m and r , respectively. This fully characterises directinteractions in a network with the dynamics governed by the OR logic Eq. (25) [19].However, the projected graph is not suﬃcient to fully specify the interactions in the casewith the dynamics governed by the AND logic Eq. (6), where it can only give interactionpathways.Unsurprisingly, the avalanche size distribution and the out-degree of the projectedgraph have diﬀerent behaviour, even when the dynamics is governed by the OR logic.This is due to the fact that the avalanche process depends on network structure beyondthe degree distribution. Hence, using the distribution of avalanche sizes in gene knock-out experiments, as proxies for the degree distribution of the gene-regulatory network,may systematically and signiﬁcantly bias the tail of the distribution, in a way thatdepends on the underlying logic.Let us now brieﬂy also consider the small perturbation limit in a network for whichthe giant aOC is not robust, i.e. for which the stability condition of Eq. (16) is violated.In Sect. 4.1 we found that Eq. (16) gives the right stability criterion for the survivalof the giant aOC, against removal of a ﬁnite fraction ´ p of nodes, for any p , and ercolation on the gene regulatory network Avalanche size − − − − F r e q u e n c y AND av.OR av. degree

Figure 10: Histogram of the avalanche size (only genes) i.e. the number of genes thatare removed from the network as a result of a single-gene knockout experiment. Sumof the heights is normalised to 1. The histogram for a single graph is computed. Foreach bin, the height is averaged over 5000 network realisations, and the average is shown(squares). Error bars, computed as the standard deviation of the mean, are smaller thanpoint markers. Squares show results for the AND logic dynamics, while stars indicateresults when OR gate logic is used in the same network. Circles indicate the out-degreedistribution of the projected gene-to-gene graph P out D,P . Parameters: x d in y “ x c in y “ N “ p Ñ , to the unperturbedEqs. (13), however, here, we are interested in the question of whether a single noderemoval from a network will cause the fragmentation of the giant component, in theregime where the latter is unstable. The outcome of such an experiment, will dependon the local topological properties of the node removed (see Ref. [25]), and cannot bedirectly extrapolated from the p Ñ limit of the macroscopic equations derived in Sect.4.1. If the number of nodes removed is small and does not grow with the size of thesystem, then the ﬁnal state of the microscopic Eq. (23) will depend on the random t χ i u realisation. The eﬀects of these ﬂuctuations are particularly pronounced in the vicinityof the instability line, and will lead to a rounding of the phase transition over and abovethe one illustrated in Fig. 9a, where the number of removed nodes grew linearly withthe size of the system. Results demonstrating this rounding from ﬁnite-size eﬀects arepresented and discussed in Appendix A.

5. Existence of a giant “AND” SCC

In this section we investigate the existence and stability of the “AND” strongly connectedcomponent (aSCC) using the cavity method, in a similar fashion as done above. Wealso present an algorithm that computes the giant aSCC for a given graph realisation,through a pruning procedure. We then verify the agreement between theory and ercolation on the gene regulatory network

To compute the fraction of nodes in the in-component, we introduce an indicator variable n i P t , u for each gene i , which is 1 if i belongs to the giant in-component , and otherwise. Analogously, let σ µ P t , u denote the variable indicating whether or notthe TF µ belongs to the giant in-component. A node (either gene or TF) is in the giant in-component if at least one of its successors is in the giant in-component , too; theseconditions are captured by n i “ ´ ź ν PB out i p ´ σ p i q ν q ,σ µ “ ´ ź j PB out µ p ´ n p µ q j q ,n p µ q i “ ´ ź ν PB out i z µ p ´ σ p i q ν q ,σ p i q µ “ ´ ź j PB out µ z i p ´ n p µ q j q (27)where B out i denotes the set of TFs ν which are successors of gene i . Thanks to the almostsure equality of cavity and non-cavity indicator variables, as previously discussed in Sect.4, Eqs. (27) will produce a pair of self-consistency equations for the average fractionsof nodes in the giant in-component which do not involve cavity variables, g “ ÿ d “ P out D p d q ” ´ p ´ t q d ı ,t “ ÿ c “ P out C p c q r ´ p ´ g q c s . (28) As explained above, for a node to belong to the strongly connected component (SCC),it must belong to the intersection of the in- and out- components, i.e. gene i is in theSCC only if i ’s predecessors belong to the out-component and its successors belong tothe in-component. To compute the fraction of nodes in the giant SCC, we introduce p n i , q n i , ¯ n i as the indicator variables for gene i to belong to the giant in-, out-, and strongly ercolation on the gene regulatory network ¯ n i “ ˆ n i , q n i , and the probability that a nodebelongs to the giant SCC is ¯ g “ N ´ ř i q n i p n i . If distributions of in- and out- degreee ofnodes are mutually independent, we simply get ¯ g “ p g ¨ q g , ¯ t “ p t ¨ q t . where p g and q g are the probabilities of genes to be in the giant in- and out- componentsrespectively, and p t and q t denote the same quantities for TFs.Here, we are interested in the existence of a giant strongly connected component,in the presence of the “AND" logic, i.e. in the giant aSCC, rather than SCC. Since thein-component and the “AND" in-component coincide, the diﬀerence between the giantSCC and aSCC resides in the diﬀerence between the giant OC and aOC. We recall thatthe solution of the macroscopic equations for the aOC describe situations where eitherthe giant aOC is the entire network, or there is no giant aOC at all. In the former case,the aOC must also coincide with the OC, as the set of nodes in the OC must includethose in the aOC. This means that the fraction of nodes belonging to the giant aSCCis either the same as that for the giant SCC, or zero (see Fig. 11). In the next section,we will have a closer look at the topological properties of the aSCC. The giant SCC of a network can be computed by identifying all the sites that can bereached from individual nodes, using e.g.

Tarjan’s algorithm [31]. However, in thepresence of the “AND" logic, the collection of paths emanating from individual nodesis no longer suﬃcient to identify the SCC. To circumvent this diﬃculty, we identify theaSCC through the following pruning procedure. We ﬁrst identify the giant SCC usingTarajan’s algorithm and denote with A the set of nodes belonging to the original graphbut not to the SCC subgraph. We then prune all the nodes in A from the originalgraph. This results in some TFs no longer satisfying the “AND” constraint. These arethen also removed and these steps are repeated until no new nodes are removed.We have shown in Fig. 11 that the fraction of nodes belonging to the giant aSCC iseither the same as the fraction belonging to the giant SCC, or it is zero. We now discusswhy this is the case from an algorithmic prospective. The (giant) SCC diﬀers from the(giant) aSCC if and only if there exists a gene which is not itself in the (giant) SCCand is a predecessor of at least one TF in the (giant) SCC; see Fig. 3b for an example.This would imply that none of the predecessors of that gene (called k in Fig. 3b) belongsto the giant SCC either. Since regulatory constraints on the network imply that eachgenes has always at least one predecessor and at least one successor, by iterating thisargument, one would exhaust the entire network, contradicting the initial hypothesis ofthe existence of a giant SCC, unless the set of predecessors form a loop. Therefore, a TF µ belonging to the (giant) SCC, will not satisfy the “AND” constraint if, and only if, itspredecessors form a loop. In this case, TF µ does not belong to aOC. It is known thatthe probability of having a ﬁnite cluster scales as N ´ with system size [16]. Thus, the ercolation on the gene regulatory network . . . . . . . h c in i . . . . . . F r a c t i o nn o d e s i n a S CC geneTF thr. genethr. TF Figure 11:

Fraction of genes and TFs (squares and circle respectively) in aSCC versus themean in-degree of TFs x c in y . The average in-degree of genes is held constant x d in y “ . .Points are obtained from simulations for a single network with size N “ . Continuouslines represent the theoretical prediction for the fraction of nodes in SCC. eﬀect of removal of a ﬁnite number of nodes from the aOC is observed to lead to eitherthe collapse of the giant aOC, or to the removal of a ﬁnite number of nodes, as shownin Fig. A1. ¶ Therefore the pruning procedure starting from the SCC can produce theelimination of (statistically) all nodes in the network, or none.

6. Summary and discussion

We re-investigate the problem of percolation in a bipartite directed network model ofgene regulation, that is based on the coupled dynamics of genes and transcription factors.The percolation problem plays a central role in assessing the viability of a GRN to sustainmulti-cellular life under the realistic assumption that a GRN’s dynamics is governed bynoise. As shown in [19] the relevant percolation problem is that of heterogeneous k -corepercolation on a directed bipartite graph. A re-investigation of the problem posed in [19]is needed, as fundamental constraints required for the degree distribution of GRNs werenot properly taken into account in that investigation. Interestingly, we found that takingthese constraints into account in the analysis leads to the most favourable situationregarding the size of the giant or percolating component, i.e. , there is always a solutionfor which all nodes of the network belong to the giant out component. The implicationof this observation, however, depends very much on the stability of the various solutionsfound in the problem: investigating the size of the giant “AND” out-component when ¶ The probability that a node elimination triggers an avalanche is diﬀerent from that computed inFig. A1. In ﬁgure A1a nodes are sampled uniformly from the network. ercolation on the gene regulatory network avalanche of node removals (corresponding to gene-silencings) witha size distribution shown in Fig. 10d. That size distribution does clearly not coincidewith the degree distribution of the underlying network. It is worth adding that thedynamics adopted in the present system involves only up -regulating interactions, andthat the statistics of avalanche sizes in systems with a mixture of up-regulating anddown-regulating TFs will be diﬀerent. Thus our results at this point merely serve as awarning that any inference of GRNs on the basis of results of gene knock-out experimentsshould be treated with suﬃcient levels of caution.One of the results of our stability analyses is a minimal condition for the full networkto form a stable giant out component. The condition formulated in Eq. (16) requiresthat x c in y P in D p d “ q ă . This condition suggests that GRNs are characterised by aredundancy of TF binding aﬃnities, i.e. the fraction of genes that are regulated byonly a single TF must be suﬃciently small. Note, however, that the condition Eq. (16) ercolation on the gene regulatory network small on average . Experimental evidences on knownregulating complexes support the idea that the number of proteins constituting a TF isindeed generally small [20]. We expect the condition of Eq. (16) to be satisﬁed for any gene regulatory network.The condition for the existence of a stable giant component represents a minimumrequirement for the existence of stable attractors of the dynamics of a GRN. In thepresence of frustration (created through a combination of up-regulating and down-regulating couplings), the condition for the existence of a multiplicity of stable phasesunder noisy conditions is expected to be more stringent. To address this problem, weplan to study the (stochastic) dynamics in systems which include the eﬀect of negativecouplings.An open challenge of our model is its calibration with available data. Severaldatabases, such as JASPAR [32] or TRANSFAC [33], provide information aboutpotential binding sites of TFs on the DNA that are based on motif-matching of bindingdomains on TFs and sites on the genome. However, it is observed that motif-matchinghas a fairly low speciﬁcity for predicting TF binding sites [34]. To compensate forthis, observed gene activity is often taken into account to increase the speciﬁcity of linkprediction. Our model is based on a clear distinction between gene activation states andthe topology of the underlying network. Hence, to specify the gene expression dynamics,it is imperative to combine information related to gene dynamics with information aboutpurely topological properties of the GRN. More speciﬁcally in this context, it should benoted that TF motif-matching does by itself not inform whether a TF is enhancing orinhibiting the expression of a given gene, let alone giving information about the strengthof the regulatory eﬀect. Information about the composition of protein complexes can beextracted from databases such as CORUM [35] or the review [14]. However, the bindingpreferences of TFs that are protein complexes are not well known, which representsanother challenge for the calibration of the model, and it is currently unclear how todesign an experimental protocol that would allow providing the necessary information.Elucidating the mechanisms of TF cooperation in gene regulation is an active ﬁeld ofresearch; we envisage that progress in the ﬁeld can potentially give more comprehensiveinformation to facilitate sensible network calibration in the near future.Finally, while we have explored gene regulation using the framework of autonomousdynamical systems, future directions may include the role of external signalling in generegulation. Moreover, it is worth emphasising that in the present paper we consider amodel of gene-regulation entirely based on protein-coding genes. Yet, it is estimated thatprotein-coding genes only make up about 2% of the entire genome [36]. It is believedthat the non-coding genome, too, is involved in regulatory mechanisms at diﬀerent stagesof protein synthesis [37]. An investigation of their role will constitute an interestingpathway to future work. We are conﬁdent that the formulation of our model is ﬂexible ercolation on the gene regulatory network Acknowledgments

GT supported by the EPSRC Centre for Doctoral Training in Cross-DisciplinaryApproaches to Non-Equilibrium Systems (CANES EP/L015854/1). All authors thankFranca Fraternali and Joseph CF Ng for illuminating discussions.

References [1] Takahashi K and Yamanaka S 2006

Cell

Nucleic Acids Res. e34–e34[3] Grada A and Weinbrecht K 2013 J. Invest. Dermatol. e11[4] Zhang Y, Liu T, Meyer C A et al.

Genome Biol. R137[5] Sanguinetti G et al.

GeneRegulatory Networks (Springer) pp 1–23[6] Zhang J H, Pandey M, Kahler J F et al.

J. Biotechnol. et al.

Cell

J. Theor. Biol.

437 – 467[9] Serra R, Villani M and Semeria A 2004

J. Theor. Biol

149 – 157 ISSN 0022-5193[10] Davidich M I and Bornholdt S 2008

PloS one [11] Schwanhäusser B, Busse D, Li N et al. Nature

FEBS Lett.

WIVACE (Springer) pp 142–152[14] Morgunova E and Taipale J 2017

Curr. Opin. Struct. Biol. Cell

Phys. Rev. E et al. Phys. Rev. Lett. Phys. Rev. E J. Phys. A Math. Theor. Annu. Rev. Cell Dev. Biol. BI Phys. Rev. E Proc. Natl. Acad. Sci. U.S.A.

Networks (Oxford university press)[24] Annibale A, Coolen A and Bianconi G 2010

J. Phys. A Math. Theor. EPL

Phys. Rev. E et al. Ann. Appl. Probab. et al. SIAM J DISCRETE MATH Phys. Rev. Lett. Phys. Rev. Lett. SIAM J COMPUT et al. Nucleic Acids Res. D91–D94[33] Wingender E 2008

Brief. Bioinform. ercolation on the gene regulatory network [34] Marbach D, Lamparter D, Quon G et al. Nat. Methods et al. Nucleic Acids Res. D497–D501[36] Di Iulio J, Bartha I, Wong E H et al.

Nat. Genet. Annu. Rev, Bioc. Phase transitions and critical phenomena vol 8 (Academicpress)[39] Stiefvater T, Müller K R and Kühn R 1996

Physica A ercolation on the gene regulatory network Appendix A. Single gene removal

In this appendix we investigate the avalanche size resulting from removing a singlegene in a type I network that does not have a robust giant aOC. Our analysis shows . . . . . . ¯ g . . . . . F r e q u e n c y . . . . . . . ¯ g . . . . . F r e q u e n c y (a) .

70 0 .

75 0 .

80 0 .

85 0 .

90 0 . λ F r e q u e n c y (b) . . . . h d in i h c i n i theory 1 λ (c) Figure A1:

Eﬀects of single-gene knock-out experiments on the aOC. (a)

Histogram of thefraction ¯ g of genes remaining in the aOC as a result of a single-gene knock-out experimentin a type I network with x d in y “ and x c in y “

3. The values on the y-axis are re-scaled bythe total number of knockout experiments, that is N . The inset magniﬁes the region ¯ g » . (b) Histogram of the fraction λ (see deﬁnition in Eq. (A.1)) of knockout experiments that donot trigger an extensive avalanche. The histogram is computed over 1000 random networkrealisations with the same connectivity as in (a). For each graph one value of λ is evaluated. (c) Heat-map of λ versus the mean degree of the two network layers. Each pixel represents agraph. The dashed line represents the boundary of the inequality Eq. (16). that, even in the range of connectivities where the macroscopic cavity analysis (16)predicts the aOC to be unstable, the microscopic single instance dynamics (23) showsthat a giant aOC may still be resilient to random removal of a ﬁnite number of genes,at least for connectivities not too far away from the instability line. Here we look atthe limiting case, where only a single gene is removed from the network. For a graphwith N genes, we performed N elimination experiments. Each experiment consists inremoving one gene, say gene j , and running the dynamics (23) until stationarity, for χ i “ ´ δ ij , and initial conditions n i p q “ @ i P t , . . . N u . For each experiment we ercolation on the gene regulatory network ¯ g “ N ´ ř Ni n i p8q , where n i p8q is the stationary state of the microscopicdynamics (23) for node i . The results of such experiments are shown in Fig. A1. Evenin the region where the macroscopic cavity analysis predicts only the solution ¯ g “ ,simulations occasionally exhibit a solution at ¯ g » . The histogram of ¯ g , resultingfrom the N single node elimination experiments, is shown in Fig. A1a, and presents twopeaks, around and . Magnifying the histogram in the region ¯ g » (see inset), asubstructure is observed which indicates that some percolation experiments only lead toa node removal cascade of ﬁnite size. We count the number of outcomes correspondingto the two peaks, and compute the fraction of genes whose removal does not lead to anextensive elimination avalanche λ “ count r ¯ g » s count r ¯ g » s ` count r ¯ g » s . (A.1)Fluctuations in the value of λ for diﬀerent network instances are displayed in Fig. A1b,which shows the distribution of λ over diﬀerent network realisations of a family ofrandom graphs with the same degree distribution. The connectivity parameters arechosen here such that the giant aOC is unstable but connectivities are still close to theinstability line. The dependence of λ on the network connectivities is shown in Fig. A1cas a heat-map, where each pixel corresponds to a network with a given combination ofdegrees. Results can be rationalised by noting that the parameter λ can be interpretedas a stability measure, given that λ “ characterises the condition of an unstable giantaOC, and λ “ is indicative of a stable giant aOC. Fig. A1c shows that, in contrastto the results predicted from Eq.(19) and presented in Fig. 9a, valid when a small but extensive perturbation of the giant aOC is applied, the parameter λ does not exhibit adiscontinuous transition between the regimes where the non-percolating solution and thepercolating solution encompassing (almost) the entire network are stable, respectively.As anticipated in Sect. 4.4, this is not unexpected: it is well known that in ﬁnitesystems there will always be some ﬁnite-size rounding of phase transitions [38], withadditional subtleties expected for discontinuous phase transitions in disordered systems[39]. The analysis presented here demonstrates that ﬂuctuations and rounding eﬀects aremagniﬁed in the case of non-extensive perturbations, where self-averaging mechanismsare not expected to occur in individual perturbation experiments. Results also showthat the ﬁnite-size rounding of the discontinuous transition is not symmetric relativeto the transition line. Nonetheless, the instability line predicted from the macroscopicanalysis does stillstill