[PDF] Cost and Capacity of Signaling in the Escherichia coli Protein Reaction Network

Abstract

In systems biology new ways are required to analyze the large amount of existing data on regulation of cellular processes. Recent work can be roughly classified into either dynamical models of well-described subsystems, or coarse-grained descriptions of the topology of the molecular networks at the scale of the whole organism. In order to bridge these two disparate approaches one needs to develop simplified descriptions of dynamics and topological measures which address the propagation of signals in molecular networks. Here, we consider the directed network of protein regulation in E. coli, characterizing its modularity in terms of its potential to transmit signals. We demonstrate that the simplest measure based on identifying sub-networks of strong components, within which each node could send a signal to every other node, indeed partitions the network into functional modules. We then suggest measures to quantify the cost and spread associated with sending a signal between any particular pair of proteins. Thereby, we address the signalling specificity within and between modules, and show that in the regulation of E.coli there is a systematic reduction of the cost and spread for signals traveling over more than two intermediate reactions.

Full PDF

aa r X i v : . [ q - b i o . M N ] N ov Cost and Capacity of Signaling in the

Escherichia coli

ProteinReaction Network

Jacob Bock Axelsen † , ∗ , Sandeep Krishna ∗ and Kim Sneppen ∗ ‡ † Centro de Astrobiolog´ıaInstituto Nacional de T´ecnica AeroespacialCtra de Ajalvir km 4, 28850 Torrej´on de ArdozMadrid, Spain ∗ Center for Models of LifeNiels Bohr InstituteBlegdamsvej 17, 2100 ØCopenhagen, Denmark

Abstract.

In systems biology new ways are required to analyze the large amount of existingdata on regulation of cellular processes. Recent work can be roughly classiﬁed intoeither dynamical models of well-described subsystems, or coarse-grained descriptionsof the topology of the molecular networks at the scale of the whole organism. In orderto bridge these two disparate approaches one needs to develop simpliﬁed descriptionsof dynamics and topological measures which address the propagation of signals inmolecular networks. Transmission of a signal across a reaction node depends on thepresence of other reactants. It will typically be more demanding to transmit a signalacross a reaction node with more input links. Sending signals along a path withseveral subsequent reaction nodes also increases the constraints on the presence of otherproteins in the overall network. Therefore counting in and out links along reactionsof a potential pathway can give insight into the signaling properties of a particularmolecular network.Here, we consider the directed network of protein regulation in

E. coli ,characterizing its modularity in terms of its potential to transmit signals. Wedemonstrate that the simplest measure based on identifying sub-networks of strongcomponents, within which each node could send a signal to every other node, indeedpartitions the network into functional modules. We suggest that the total number ofreactants needed to send a signal between two nodes in the network can be consideredas the cost associated to transmitting this signal. Similarly we deﬁne spread as thenumber of reaction products that could be inﬂuenced by transmission of a successfulsignal. Our considerations open for a new class of network measures that implicitlyutilize the constrained repertoire of chemical modiﬁcations of any biological molecule.The counting of cost and spread connects the topology of networks to the speciﬁcityof signaling across the network. Thereby, we address the signalling speciﬁcity withinand between modules, and show that in the regulation of

E.coli there is a systematicreduction of the cost and spread for signals traveling over more than two intermediatereactions. ‡ Corresponding author: [email protected]

Background

Many functions of a living cell involve sending signals from one protein to another.Signals need to be sent in response to environmental conditions in order to trigger theappropriate functional proteins needed at that time. For example, the presence of foodmetabolites in the surroundings triggers signals from membrane receptors to proteinsinvolved in chemotaxis and metabolism required to make the cell move toward and utilizethe food; or a sudden change in the temperature triggers signals to proteins which buﬀerthe cell against the shock. Many signalling pathways found in living cells have beenstudied and modeled in great detail: the PTS sugar uptake [24], chemotaxis [7, 4], heatshock [5], unfolded protein response [6], the p53 network [25], NF- κ B signalling [2, 20]and the SOS response to DNA damage [3, 1], just to name a few. All the computationsdone by the regulatory system of a cell are used to make sure the right signals get sentat the right times to the right places.Not much is known about the large-scale organization of protein networks in thecell and the connection between their architectural principles and the propagation ofsignals within them. This is the subject of investigation in this paper.The diﬀerent overall types of reactions we have in the network are: • transcription, where activated/inhibited polymerase complexes interacts with apromoter and regulates the transcription of downstream open reading frames. • complex-formation, where a complex is created from either monomers or othercomplexes (RNA-polymerases and ﬁlaments). • activation/inhibition, where a protein (e.g. enzyme) is modiﬁed by another enzymeby the addition of an organic compound (e.g. phosphate and methyl). • metabolic/enzymatic, where a protein reacts with one or more small molecule(s)(e.g. transport and cleavage).The EcoCyc database contains all this information to the level of water, ions,sugars, fatty acids, phosphate groups etc. Whereas we include enzymatic reactions withmetabolic output, we prune the network by removing all metabolic nodes.Our approach is to study a simpliﬁed dynamics of signal propagation on anorganism-wide network of proteins and reactions. By comparing with appropriaterandomized versions of the network we pinpoint features of the design of the real networkthat inﬂuence signal propagation.We chose to study Escherichia coli because it is the most studied prokaryote and,hence, its network of interactions and reactions is most complete; several databases existfor the regulatory and metabolic interactions in

E. coli [18, 17, 23, 11]. There are manyways to represent the full known molecular network of

E. coli . The standard method,used in a number of studies of biological and social networks [12, 14], has been to usean undirected graph. Although easily tractable, such a representation does lose a greatdeal of information about the interactions.A graph representation which, for the regulatory network of a living organism, addsmost of this missing information is one where the network is described by a directed,bipartite graph. Such a graph has two types of nodes: protein nodes and reaction nodes(including reversible and irreversible metabolic and complex-formation reactions, aswell as transcription reactions). In our representation a modiﬁed, e.g. phosphorylated,protein is assigned a diﬀerent node from the original protein. In addition, complexes ofproteins are also assigned their own nodes. Further, the links have direction. Fig. 1Ashows such a representation of the protein network of

E. coli .Even more information is contained in a representation of the network as a listof reactions. The list adds to the bipartite graph information about which neighboursof a reactant node are reactants and which are products. This reaction list and thedirected bipartite graph are the representations we focus on in this paper. To study thesignalling in these networks we introduce two quantities which measure diﬀerent aspectsof signal propagation. These measures are built on the fact that transmission of a signalacross a reaction node depends on the presence of other reactants. In particular we willassume that transmission of a signal across reaction nodes with more input links putsmore constraints on the status of other molecules in the network. A simple measure forthe complications associated with sending a signal along a given pathway is to countthe total number of in links or the total number of out links of reaction nodes along thepathway.Given a signal pathway from protein A to protein B, we can ask how many othertypes of proteins are required to be present to allow the signal to propagate all theway. This we call the ”cost” of the path. Another quantity is the number of alternatebranches, along the path from A to B, that the signal could be broadcast on. This wecall the ”spread” of that path. Quantifying such measures is useful only if there is anappropriate null-model to compare with the real

E. coli network. For this null-model wechoose a randomized version of the real network which has the same number of nodesand links, which preserves bipartiteness as well as all local point properties by keepingthe in and out degree of each node ﬁxed.

Results

Modular Design of the

E. coli

Network

The directed, bipartite graph representation of

E. coli consists of 2846 protein nodes and2774 reactions. The types of reactions are transcription reactions, complex formations,protein modiﬁcations and metabolic reactions. The dataset counts 848 transcriptionreactions out of the 980 irreversible reactions, with the remaining 1794 reactions beingreversible. In Fig.1A we show the giant weak component consisting of 1938 reactions(of which 812 are transcription reactions, (cyan squares)) and 1897 proteins (orangecircles). With such a network representation, one can identify four diﬀerent types ofdegree distributions: the in- and out-degree distributions for protein and reaction nodes,shown in Fig 1C,D.For the four diﬀerent degree distributions only the out-degree distribution of proteinnodes is suﬃciently broad to be ﬁtted to a power law with exponent of γ = 2 . ≈

4. In contrast, the average length of paths starting from arbitraryreaction nodes is ≈

7. This observation is a rough approximation to what is capturedby the betweenness centrality measure[13].The alternating reaction and protein nodes as one moves away from the core of thenetwork in Fig. 1A is in part due to the bipartiteness and in part due to the higherinterconnectedness of the core of the network, consisting mostly of transcription factors.The average degree of transcription factors is ≈

11, while it is ≈ E. coli graph is composed of a large number of relativelysmall strong components (a strong component is a subgraph where there is a pathbetween every pair of nodes, see Methods section). The largest of these contains 150nodes. We will here refer to a graph where every node has access to every other nodethrough a path in the network as being above percolation threshold or super-critical.Then, although the full network shown in Fig. 1A looks supercritical, the representationin terms of strong components shows that it is substantially below the percolationthreshold (as conﬁrmed by the exponential size distribution of strong components, notshown). Fig. 1B shows a corresponding condensed graph of the randomized network, inwhich the degree of each node is conserved. The existence of a giant strong componentwith ≈ E. coli reaction network indeed showsa highly modular design, even when compared to a random bipartite network that hasexactly the same number of nodes, each with the same in- and out-degree.

Downstream Targets and Restrictions on Allowed Paths

The simplest aspect of the structure of the network that inﬂuences signalling is thenumber of nodes that are downstream of any given starting node. Note that this is aquantity that can be sensibly studied only with a directed graph representation of thenetwork; in any connected undirected graph all nodes are downstream of each other. Thepossible signals emanating from the starting node are obviously limited to reach onlythese nodes. The strong component structures in Fig. 1A,B already indicate that thereal

E. coli network diﬀers substantially from its randomized counterpart. In the randomnetwork most nodes can reach almost all other nodes, whereas each protein in the realnetwork has a much smaller number of downstream targets. Thus, the real network isrelatively optimized for speciﬁc signalling; a percolating structure is not conducive tospeciﬁc signalling because every node has almost the entire network downstream of it.This expectation is conﬁrmed in Fig. 2A which shows the distribution of the number ofdownstream targets for the real and randomized

E. coli networks.The fact that the

E.coli network has a few nodes with a downstream sphere ofinﬂuence of over 1000 indicates a topology governed partly by a hierarchical subnetworkconsisting of about 1/4 of the original network, as also noted by ref. [21]. In contrast, therandomized network examined in Fig. 2A lacks such a hierarchical organization, ratherplacing ≈ E.coli network.Fig. 2B illustrates the kind of restrictions placed on allowable signalling pathsin a reversible reaction A + B ↔ C . The graph representation does not haveinformation about these restrictions because all neighbors of a reaction node areequivalent. Including this restriction limits the downstream targets from any node ascompared to the simpler graph representation. This is illustrated in Fig. 2C which showsthe distribution of the number of downstream nodes reachable from every node of thenetwork in Fig. 1 with the restrictions, as compared to Fig. 2A where the restrictionsare not applied. Intriguingly, the distribution with the signalling restrictions resemblesa scale free distribution, 1 /n . , with a substantially better scaling than the unrestrictedsignalling. Irrespective of restrictions the real E. coli network has much less downstreamtargets than its randomized version, a fact that is important for speciﬁc signalling.

Cost and Spread of a Path

Signalling is not just about reaching a downstream target. As a signal propagates itneeds other molecules to help it pass the message across consecutive reactions. Considerfor example a signal initiated by an increase in the concentration of a given transcriptionfactor. The promoter it inﬂuences may depend on other transcription factors, forexample in an or-gate construction. If that is the case, and the other transcriptionfactor is already abundant, the promoter activity will not be inﬂuenced and thus thesignal will not be transmitted. More generally, for each additional reactant along areaction pathway, signal propagation gets increasingly coupled to the overall state ofthe molecules in the cell. The more reactions in the path, and the more reactants ineach reaction, the more the conditions that need to be met for propagation of the signal.A concrete example of a signalling pathway is the Arc two component regulatorysystem illustrated in Fig. 3A. A receptor protein (ArcB) receives an external stimulus(here, lack of oxygen), gets phosphorylated, and then undergoes a series of two reactionswhere the phosphate group is shifted between residues in ArcB, such that ﬁnallyArcBp can transfer the phosphate group to ArcA. Subsequently, phosphorylated ArcAacts as a transcription factor for a large number of genes including the sucA geneemphasized in the ﬁgure. In terms of signal propagation, we follow the signal froma phosphorylation reaction: signal + AT P + ArcB ↔ ArcBp , through the reaction

ArcBp + ArcA ↔ ArcAp + ArcB , ending in the reaction

ArcAp + IHF + F nr + RN AP σ → SucABCD + .. .The external signal propagates under the condition that all reactions can takeplace. This means that (1) ArcB is present, (2) ArcA is present, and that (3) thethree additional transcription factors (IHF, Fnr, and RNAP- σ ) are present/absentin a combination that allows a change in the concentration of ArcAp to inﬂuence theactivity of the sucABCD operon. Thus, the propagation of the input stimulus to SucAputs constraints on the concentration levels of ArcA, ArcB, IHF, Fnr and the RNAP σ complex, and can be assigned a cost C = 5 which counts the number of proteins orprotein complexes involved in propagating the signal. In addition there could be somecost associated to the absence/presence of small molecules or metabolites, for exampleATP in the ﬁrst reaction of Fig. 3A. We disregard this metabolic part of signalling inthe present paper.We quantify this cost C = C (path) for an arbitrary path from a starting protein toa target protein by simply counting the number of reactants along the entire path (notcounting the protein nodes which are part of the path), as described schematically in Fig.3B. If the same reactant is used several times, it is only counted once, as illustrated inFig. 3C. Notice that the propagation of a signal does not necessarily mean an increasedlevel of the proteins involved. The key point is that a change in input state shouldbe transmitted to a changed output state of the end product. Our cost function is asimple measure of the complexity of handling such a signal and it could, in principle,be calculated between any pair of proteins where a path exists in the directed network.Another issue which is important for speciﬁc signalling is the possibility of signalsbranching, or spreading into the network. Thus, a signal propagating from a startingprotein to a target protein would pass by some reactions where it could branch outinto alternate paths to diﬀerent targets. Similar to the cost, we quantify this spread S = S (path) for a given path from start to target by counting the number of by-productsalong the entire path (Fig. 3B). S does not count the sequence of products needed togenerate our ﬁnal target, but only counts side-branches along the path.We stress that we here limit our spread counting to reaction products (proteins)along the path, whereas we disregard out links from proteins on the path that feed intoreactions. In principle these neighbor reactions to the path in turn feed into changesof other proteins. Our minimal spread for example disregard out degrees of highlyconnected transcription factors along the path. This may sometimes be to restrictive,but reﬂect the conjecture that speciﬁc disturbances typically diminishes across a reactionnode. To be more speciﬁc on this last point, consider the case of a transcription reactionwhere the product p = 1 / (1 + r ) as function of reactant r . Here p is only sensitive to r when this is close to the characteristic binding (here set to 1). Thus for most values of r the output response δp will be smaller than input changes ∆ r across a reaction node.For a related discussion on propagation of disturbances in chemical reactions, see [10].Fig. 4B shows the average cost of signals propagating from one protein to anotheralong the shortest path connecting them, as a function of the length l of that path.Each data point is the average over all pairs which are at the given distance. Except forpaths of length two, the average cost for signals is signiﬁcantly smaller for the real E.coli network than for a randomized version which preserves degrees. Fig. 4C shows theaverage spread of signals propagating from one protein to another along the shortestpath connecting them, as a function of the length of that path. Each data point is theaverage over all pairs which are at the given distance. As shown in Fig. 4A the numberof pairs at a given distance is quite high ( ∼ ) for the real network and much higherfor the random. The standard error is therefore negligible and not shown in Fig.4B,C.Just as with the cost, except for paths of length two, the average spread for signals isalways signiﬁcantly smaller for the real E. coli network than for a randomized version.Notice that in the spread S vs. distance plot the slope, for the random network,is ∆ S / ∆ l > S / ∆ l < E.coli network. In this connectionkeep in mind that a random directed network is critical when the average out degree h k out i = 2. Considering a random path, a node on this path should then on averagehave one more output than the one along the path, corresponding to S = 1. The valuesof ∆ S / ∆ then indicates that the geometry of the random network is super-critical, withan initial signal on average being ampliﬁed for each step along the path. In contrastthe real network is sub-critical with signals that tend to disappear with distance evenunder optimal conditions. Therefore, Fig. 1A,B can be regarded as a visual illustrationof the sub-criticality of the real network versus the super-criticality of the randomizednetwork.In sum, the real E. coli network reduces both the cost and spread of signals alongall shortest paths connecting pairs of proteins. Fig. 5 adds even more evidence to thisconclusion by showing that a scatter plot of spread vs. cost for all pairs of nodes in thereal

E. coli network covers a smaller area than a corresponding plot for a randomizednetwork. Note that this plot contains the full distribution from whence the distancedependent averages in Fig.4 were calculated.Fig. 6 repeats this analysis for each of the six largest strong components in thenetwork. These strong components capture distinct functional units being associated,respectively, to (a) predominantly fatty acid metabolism, (b) the transcription networkaround σ factors, (c) PTS-sugar transport, (d) ABC transporters, (e) the FeII and FeIIItransport system and ﬁnally, (f) the chemotaxis module. Fig. 6 also shows the cost andspread for the constrained reaction paths within each of these subgraphs compared tothe expected cost and spread for randomized versions of the subgraph. Overall, we seethat cost and spread within each module is fairly similar to the random expectation.The only network which has a substantially lower cost and spread is that of the ABCtransporters, the network where signalling is most seriously limited by the constraints. Discussion and Conclusion

We have shown that the molecular network of

E. coli is designed in a way which optimizessignalling by minimizing its requirements on the presence of other molecules, as well asfocusing signalling on a limited set of distant proteins with relatively small spreading ofsignals to other proteins along the paths. This overall design feature is in accordancewith the general belief that molecular networks are somewhat modular [16]. Also thisdesign of the network consisting of relatively separated domains provides much feweralternate paths when compared to the random expectation. Thus, the network isdesigned to favor speciﬁcity of signalling, rather than provide robustness to deletionin the form of multiple paths. We take this as a hint that robustness is, presumably,a design feature of the local dynamics in the network. For example, the well knownrobustness of chemotactic behavior is associated with changes of reaction rates andprotein concentrations [4], but not actual deletion of proteins.We stress that our available network is based on literature study, and therefore isvulnerable to systematic errors in collecting data. In particular, the overall data setprobably covers only a fraction of the real interactions in

E.coli . Further, certain typesof interactions are not available including, in particular, degradation by proteases, RNAregulation and small molecule interactions. Thus, the observed sub-critical breakup ofthe network into separated strong components in Fig. 1A may partly be due to limiteddata sampling. The complete network of all interactions actually taking place in

E.coli might well be above percolation. This is especially likely to be true if we also integratethe metabolism with the regulatory network because much of the feedback in regulationgoes through small molecules involved in metabolic processes [19].In regard to limitations of our approach to the incomplete

E.coli network, it isimportant to emphasize that our measures of cost and spread along a given path willbe robust to improvement of the

E. coli network. The reason for this is that anyreaction present in the current network is well characterized, i.e., its set of reactants andproducts is likely to be complete, and therefore its activity should be fairly independentof presently unknown proteins. Thus, improvement of the

E. coli network will likelyinvolve addition of new reaction pathways and will not, to a ﬁrst approximation, changethe connections of the existing reaction nodes. Therefore, for any existing path in thecurrent network the cost and spread will remain unaﬀected. Adding further links to thenetwork will increase cost and spread for the random network, and thus tend to increasethe observed diﬀerence between signalling in the real and the randomized network.Looking at cost and spread within the strong components we found that signallingwithin these modules was approximately as in their randomized counterparts. Thus, thecost and spread measure indeed indicate a fair degree of robustness within a module,while still showing a systematic absence of alternate path options on large scales.However, examining these modules against deletion of individual nodes we found that,for all the six largest strong components, the robustness of the size of the module was lessthan for a comparable module with randomized structure. Thus, even within modules,percolation robustness of signals is not a strong trait.It is clear that our deﬁnition of cost in terms of simply counting independent inputsis a simpliﬁed approach. Thus, one could easily imagine constructing more complicatedcost functions, taking into account, in particular, the logic of transcription regulation[15, 8] and epigenetic switches[9]. Also the cost may be modiﬁed according to universallyabundant proteins (housekeeping genes), for example by not counting input from allessential genes. To some extent our counting already excludes core enzymes such asribosomes and tRNAs but, obviously, this list of essential ingredients of cell functionalitymay be extended. Finally, the real usage of a given pathway may be restricted by thetime to process the signal along the path, wherein particular protein production eventstake a sizable time compared to a cell generation.A ﬁnal intriguing point is that the large modules have such widely diﬀerent designfeatures, as seen from Fig. 6. Indeed, some modules C,F are dominated by complexformation reactions, D,E by linear pathways, while A,B are densely interconnected.Thus, whereas signalling within each of the sub-networks is similar to random, interms of cost and spread, the way these networks deal with the signalling is stillwidely diﬀerent. We could not detect motifs common to all of these macromolecularnetworks [26].As an overall summary, our geometrical considerations capture a modularity of the

E.coli protein networks which favors signaling on fairly short distances: A topologywhich speaks to fruitful modular approaches to systems biology on the whole-cellscale, as propagation of signals through many intermediate reactions seems to benearly impossible. In addition, one expects limitations in signal propagation fromsimple mass-action kinetics, as shown by [cite Sergei Maslov, Kim Sneppen, IaroslavIspolatov “Propagation of ﬂuctuations in interaction networks governed by the law ofmass action” q-bio.MN/0611026]. As the macromolecular network in

E.coli indeed hasmodular features, and signals are diﬃcult to transmit, substantial parts of

E.coli may beconsistently understood by summing up separate studies of nearly independent modules.

Methods

Network construction

The basic ﬂat ﬁles of the EcoCyc database [18] were downloaded from

Ecocyc.org .EcoCyc is a scientiﬁc database for the bacterium

Escherichia coli

K-12 MG1655.The EcoCyc project performs literature-based curation of the entire genome, and oftranscriptional regulation, transport and metabolic pathways.0Despite being incomplete in places, when compared to more specialized databases,EcoCyc is still the most comprehensive database of reactions in

Escherichia coli .The ﬁles proteins.dat and genes.dat contains the list of all proteins and genenames in the EcoCyc. From the ﬁles bindrxns.dat and promoters.dat all protein-promoter interactions where extracted. The ﬁle transunits.dat contains a list ofspeciﬁc transcriptional units which was used to link proteins to their downstream geneproducts. These reactions where labelled according to the name of the actual promoterinvolved in the process. There is at least one promoter for each transcription reactionin the database.The ﬁles reactions.dat contains a general list of all biochemical reactions in theEcoCyc, and the ﬁle enzrxns.dat speciﬁes which of these are enzymatic reactions andwhich enzyme is involved. From these ﬁles all other reactions where extracted where atleast one protein is at least a reactant or product.From the total set of irreversible reactions (including all transcription reactions)we removed proteins from the product side which also occur as a reactant in the samereaction. The reason is that information is not transmitted from reactants to catalysts,therefore we do not want such links in our ﬁnal network.The resulting reaction list is represented as two stoichiometric lists (matrices), onefor reactants and one for products (proteins involved in reversible reactions are alsopartitioned into two sets with one being arbitrarily picked for the ”reactant” matrix) of2774 reactions and 2846 proteins.

Randomization

We constructed randomized versions of the

E. coli network by repeatedly swapping thetargets of randomly selected pairs of links [22]. This automatically preserves the in-and out-degree of each node. Further, by restricting the set of pairs of links for whichswapping was allowed we could preserve both the bipartiteness and the character of thelinks. For instance, links to irreversible reactions were only swapped with links to otherirreversible reactions, etc. In this way each (ir)reversible reaction remains (ir)reversiblein the randomized version.

Strong components

It is possible to uniquely partition the nodes of any directed graph into a set of strongcomponents, see Fig.1A, bottom left. Within each component, there is a path fromevery node of that component to every other node in the component. We generate thestrong components by selecting an arbitrary node and ﬁnding the intersection betweenthe set of nodes lying upstream and downstream to the selected node. This intersectionplus the selected node forms one strong component. This process is repeated until allnodes are placed in a strong component. If there is no overlap between downstreamand upstream sets for a given node, then, by deﬁnition, that node is the sole member of1its strong component. The partitioning produced by this method is, for a given graph,unique and independent of the order in which the nodes are chosen.The condensed graph corresponding to a given directed graph is one where eachnode represents one strong component of the original graph. There is a directed linkfrom one node to another if, in the original graph, there is a link from any node of theﬁrst strong component to any node of the second. The condensed graph, by deﬁnition,cannot have any loops.Notice that this partitioning into strong components is only possible if there istransitivity of paths, i.e., if there exists a path from node A to B, and from node B toC, then this implies there is a path from A to C. Transitivity is essential to constructnon-overlapping strong components. If we restrict the allowed paths as described inFig. 2 then this is no longer true and therefore non-overlapping strong components, asdeﬁned, cannot be constructed.

Cost and spread

When calculating the downstream distribution in Fig. 2(A) & (C) we use a standarddepth-ﬁrst-search: we keep track of visited nodes so that if we reach a node again by alonger path then it need not be searched for by alternative paths further downstream.This method does not take into account the bipartiteness of the graph.We calculated cost and spread using a modiﬁed depth-ﬁrst search of paths in thegraph. When restrictions of the type discussed in Fig. 2 are added the standard methodis no longer suﬃcient (because of the graph-theoretical non-transitivity of paths inbipartite graphs) and the only way to enumerate all the shortest distance paths is toactually go over all paths, of all lengths. In general, this is too computationally expensiveand therefore we put an arbitrary upper cutoﬀ on the length of allowed paths. Thisrestricts us to looking at only those pairs which are within this cutoﬀ distance. However,in practice, we are able to use a large cutoﬀ of 14 (which covers over 90% of the pairsin the real network, see Fig. 4A) therefore this does not aﬀect our conclusions.

Authors contributions

All authors contributed equivally to the work reported in this paper.

Acknowledgements

The authors wish to thank the Danish National Research Foundation for fundingthrough the Center for Models of Life at the NBI. KS and JBA wish to thank TheLundbeck Foundation. JBA wishes to thank The Fraenkel Foundation.2

References [1] S Krishna and S Maslov and K Sneppen (2007). UV-induced mutagenesis in the Escherichia coliSOS response: A quantitative model

PLoS Comput. Biol. 3 , e41[2] A Hoﬀmann and A Levchenko and M L Scott and D Baltimore (2002). The IB-NF-B SignalingModule: Temporal Control and Selective Gene Activation .

Science 298 , 1241–1245.[3] Aksenov, S. V. (1999). Dynamics of the inducing signal for the SOS regulatory system in Escherichiacoli after ultraviolet irradiation.

Math. Biosci. 157 (1-2), 269–86.[4] Alon, U., M. G. Surette, N. Barkai, and S. Leibler (1999). Robustness in bacterial chemotaxis.

Nature. 397 (6715), 168–71.[5] Arnvig, K. B., S. Pedersen, and K. Sneppen (2000). Thermodynamics of heat-shock response.

Phys. Rev. Lett. 84 (13), 3005–8.[6] Axelsen, J. B. and K. Sneppen (2004). Quantifying the beneﬁts of translation regulation in theunfolded protein response.

Phys. Biol. 1 , 159–65.[7] Bray, D., R. B. Bourret, and M. I. Simon (1993). Computer simulation of the phosphorylationcascade controlling bacterial chemotaxis.

Mol. Biol. Cell. 4 (5), 469–82.[8] Covert, M. W., C. H. Schilling, and B. Palsson (2001). Regulation of gene expression in ﬂux balancemodels of metabolism.

J. Theor. Biol. 213 (1), 73–88.[9] Dodd, I. B., M. A. Micheelsen, K. Sneppen, and G. Thon (2007). Theoretical Analysis of EpigeneticCell Memory by Nucleosome Modiﬁcation.

Cell 129 , 813–822.[10] Maslov, S, Sneppen, K, and I. Ispolatov (2007). Spreading out of perturbations in reversiblereaction networks.

New J. Phys. 9

Proc. Natl. Acad. Sci. U. S. A. 97 (10), 5528–33.[12] Farkas, I., H. Jeong, T. Vicsek, A.-L. Barabasi, and Z. N. Oltvai (2003). The topology of thetranscription regulatory network in the yeast S. cerevisiae.

Physica A 318 , 601–612.[13] Freeman, L. (1977). Set of measures of centrality based on betweenness.

Sociometry 40 , 35–41.[14] Girvan, M. and M. E. Newman (2002). Community structure in social and biological networks.

Proc. Natl. Acad. Sci. U. S. A. 99 (12), 7821–6.[15] Harris, S. E., B. K. Sawhill, A. Wuensche, and S. Kauﬀman (2002). A model of transcriptionalregulatory networks based on biases in the observed regulation rules.

Complexity 7 , 23–40.[16] Hartwell, L. H., J. J. Hopﬁeld, S. Leibler, and A. W. Murray (1999). From molecular to modularcell biology.

Nature. 402 (6761), C47–52.[17] Kanehisa, M., S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama,M. Araki, and M. Hirakawa (2006). From genomics to chemical genomics: new developments inKEGG.

Nucleic. Acids. Res. 34 , D354–7.[18] Karp, P. D., M. Riley, M. Saier, I. T. Paulsen, J. Collado-Vides, S. M. Paley, A. Pellegrini-Toole,C. Bonavides, and S. Gama-Castro (2002). The EcoCyc Database.

Nucleic. Acids. Res. 30 (1),56–8.[19] Krishna, S., A. M. Andersson, S. Semsey, and K. Sneppen (2006). Structure and function ofnegative feedback loops at the interface of genetic and metabolic networks.

Nucleic. Acids.Res. 34 (8), 2455–62.[20] Krishna, S., M. H. Jensen, and K. Sneppen (2006). Minimal model of spiky oscillations in NF-kappaB signaling.

Proc. Natl. Acad. Sci. U. S. A. 103 (29), 10840–5.[21] Ma, H. W., J. Buer, and A. P. Zeng (2004). Hierarchical structure and modules in theEscherichia coli transcriptional regulatory network revealed by a new top-down approach.

BMC.Bioinformatics 5 , 199.[22] Maslov, S. and K. Sneppen (2002). Speciﬁcity and stability in topology of protein networks.

Science. 296 (5569), 910–3.[23] Salgado, H., A. Santos-Zavaleta, S. Gama-Castro, M. Peralta-Gil, M. I. Penaloza-Spinola,A. Martinez-Antonio, P. D. Karp, and J. Collado-Vides (2006). The comprehensive updated regulatory network of Escherichia coli K-12. BMC. Bioinformatics. 7 (1), 5.[24] Thattai, M. and B. I. Shraiman (2003). Metabolic switching in the sugar phosphotransferasesystem of Escherichia coli.

Biophys. J. 85 (2), 744–54.[25] Tiana, G., K. Sneppen, and M. H. Jensen (2002). Time delay as a key to apoptosis induction inthe p53 network.

Eur. J. Phys. B 29 , 135–140.[26] Shen-Orr, S.S., Milo, R., Mangan, S. and Alon, U (2002). Network motifs in the transcriptionalregulation network of Escherichia coli.

Nature Genetics 31 , 64 – 68 Figures

Figure 1 -

E. coli protein reaction network. (A, Left) The graph is the largest weak component of a bipartite network, consistingof proteins (orange circles) and reaction nodes (promoters (cyan squares), complexformations & modiﬁcations (black squares)). The two largest hubs, σ and CRP , andtheir links, have been removed for ease of visualisation. (A, bottom left) Illustration ofthe procedure of condensing a directed graph (see Methods). An arrow indicates thatthere is a path connecting the two strong components in the original graph; nodescorrespond to strong components of minimum size two. (A, Right) The resultingcondensed graph of the

E. coli network. (B) The similarly condensed graph for arandomized version of the

E. coli network. (C) The cumulative degree distributionof reaction nodes for the full graph in (A). (D) The cumulative degree distribution ofprotein nodes.

Figure 2 - Domains of inﬂuence (A) The cumulative distribution of number of downstream targets s without restrictionson allowed paths. Green is the randomized network (null hypothesis) and blue is thereal network, the latter yielding a powerlaw distribution. (B) Schematic showing therestrictions on allowed paths for graphs constructed from a reaction list. The graphshown corresponds to a single reversible reaction: A + B ↔ C . In the graph there is apath from e.g. B to A , but in the real biochemical reaction this path does not exist. Incontrast, paths from A to C , and B to C , are allowed. (C) Distribution of downstreamtargets with restrictions on the allowed paths. Notice how the distribution is now betterresolved on nodes with high inﬂuence i.e. high s . Figure 3 - Cost and spread of a path. (A) The Arc two-component regulatory pathway. (B) Schematic showing how the ”cost”and ”spread” of a signalling path, A ↔ F , is measured. In this case protein B and D arenecessary, giving a cost C = 2. The proteins E, G and H are produced as a side eﬀect,hence the spread is S = 3. (C) Schematic illustrating the concept that if a protein isnecessary for more than one reaction along the path, we count it only once. Thus, thecost is reduced to C = 1, as compared to (B). Figure 4 - Measurements of cost and spread (A) Number of pairs at a given (shortest) distance for the

E. coli network (solid line)and its randomized version (dashed line). (B) Cost of a signalling path as a function ofits length for the real (solid) and randomized (dashed)

E. coli networks. (C) Spread of asignalling path as a function of its length for the real (solid) and randomized (dashed) E. coli networks. The shaded region illustrates which values lead to the strong componentsbreaking up (if the network was inﬁnitely large). Figure 5 - Scatter of cost vs. spread

Scatter plot of spread vs. cost for each pair of nodes lying within a distance of 14 toeach other for the real (solid circles) and randomized (open circles)

E. coli networks.

Figure 6 - The largest strong components

The six largest strong components of the

E. coli network, along with plots of the averagecost, C ( l ), and average spread, S ( l ), as functions of signalling distance. The yellow areasshow the range spanned by C ( l ) and S ( l ) for 100 randomized versions of the subgraphs.6 Figure 1 k n ( k ´ > k ) OutIn σ IHF σ σ σ σ CRPFisFNRArcA P B k n ( k ´ > k ) OutIn strong component condensation

C DA Figure 2 s n ( s ’ > s ) randomreal s n ( s ’ >= s ) realrandom B CA

ABC Figure 3

N = k −1 = 1 in(1) in

CA FG

N = 1 inout

N = k − 1 = 0 (1) (2) E (2) N = 3 out HD out B CA FG1 2 EHB CB A Figure 4 distance s p r ead c o s t nu m be r o f pa i r s randomreal ABC Figure 5 s p r ead cost Figure 6

Pajek