[PDF] A thermodynamic model for agglomeration of DNA-looping proteins

Abstract

In this paper, we propose a thermodynamic mechanism for the formation of transcriptional foci via the joint agglomeration of DNA-looping proteins and protein-binding domains on DNA: The competition between the gain in protein-DNA binding free energy and the entropy loss due to DNA looping is argued to result in an effective attraction between loops. A mean-field approximation can be described analytically via a mapping to a restricted random-graph ensemble having local degree constraints and global constraints on the number of connected components. It shows the emergence of protein clusters containing a finite fraction of all looping proteins. If the entropy loss due to a single DNA loop is high enough, this transition is found to be of first order.

Full PDF

aa r X i v : . [ q - b i o . S C ] O c t A thermodynamic model for agglomeration of DNA-looping proteins

Sumedha and Martin Weigt

Institute for Scientiﬁc Interchange, Viale Settimio Severo 65, Villa Gualino, I-10133 Torino, Italy

In this paper, we propose a thermodynamic mechanism for the formation of transcriptional focivia the joint agglomeration of DNA-looping proteins and protein-binding domains on DNA: Thecompetition between the gain in protein-DNA binding free energy and the entropy loss due to DNAlooping is argued to result in an eﬀective attraction between loops. A mean-ﬁeld approximation canbe described analytically via a mapping to a restricted random-graph ensemble having local degreeconstraints and global constraints on the number of connected components. It shows the emergenceof protein clusters containing a ﬁnite fraction of all looping proteins. If the entropy loss due to asingle DNA loop is high enough, this transition is found to be of ﬁrst order.

I. INTRODUCTION

Understanding the spatial organization of DNA in the cell / the cellular nucleus and its relation to transcription isone of the big challenges in cell biology [1, 2, 3, 4, 5]. In this context, the experimental observation of transcription fociis of great interest: The transcriptional activity is not evenly distributed inside the cell, but it is concentrated in focalpoints around so-called transcription factories [3]. These factories contain multiple copies of RNA polymerasis, tran-scription factors and parts of the machinery for post-transcriptional RNA modiﬁcations. In order to be transcribed,DNA has to loop back to these transcription factories, it is expected that one factory is surrounded by about 10-20DNA loops. In this and related phenomenological pictures [4, 5] the formation of transcription factories and DNAlooping are considered to be of fundamental importance for the large-scale spatial organization of the transcriptionalactivity. A sound theoretical understanding grounded on simple physical mechanisms is, however, missing.The major reason for the increased eﬃciency of transcription by transcription factories is the following: A locallyincreased concentration of transcription factors close to target genes enhances recognition of transcription factorbinding sites, the volume a transcription factor has to search before ﬁnding a target gene is substantially decreased.In bacteria, local concentration eﬀects are partially achieved by the co-localization of genes coding for a transcriptionfactor and their target genes along the one-dimensional chromosome itself [6]. By looping, target genes which are faralong the genome can be brought together in cellular / nuclear space. Very recently, it was shown experimantally[7] that compact DNA conformations actually enhance target localization compared to stretched conformations.This observation supplies strong support for the importance of coupling transcriptional activity to the spatial DNAorganization.The formation of single DNA loops and its consequences for gene regulation have recently been in the centerof interest of many bio-physical research works. These range from precise numerical descriptions of the loopingproperties of DNA resp. chromatin ﬁbers [8, 9] up to the thermodynamic modeling of mechanisms for transcriptionalgene-regulation. Both direct looping by bivalent transcription factors (as e.g. the lac repressor) [10, 11] and loopingvia attractive protein-protein interactions between DNA-bound proteins have been studied [12, 13]. The latter processis important in particular in distal gene regulation in eukaryotic cells [1].In this paper, we assume a more global point of view: May DNA loops and looping proteins agglomerate collectivelyto give rise to transcriptional foci? What are the thermodynamic ingredients leading to such an agglomeration? Inthis context, we model the DNA as a string containing many protein binding domains (BD), each one composed of K binding sites (BS). In this work we consider only bivalent DNA-binding proteins which are able to bind simultaneouslyto two diﬀerent BDs, introducing thus a DNA loop [24]. Fig. 1 resumes the basic model ingredients. We ﬁnd that thissimple model leads to an eﬀective attraction between DNA loops and thus to the formation of protein agglomerates.The role of multiple binding sites for a single loop has already been studied by Vilar et al. [10, 13], whereas multipleloops have been considered by Hanke and Metzler [14] but for BDs with only K = 1 BS. We will show that only thecombination of both is able to introduce the desired emergence of protein agglomerates. This raises an interestingquestion: given a concentration of looping proteins and an entropy cost of bringing two binding domains in closevicinity, is it possible to get an agglomerate of binding domains? Besides being of interest to transcription factoriesthe question is actually very general and interesting by itself when reformulated diﬀerently: If we consider each BDto be a monomer, then the problem is equivalent to understanding the eﬀect of introducing L non nearest neighbourlinks between N monomers, with global constraints resulting from the entropy loss due to DNA looping, and localconstraints due to the structure assumed for the BD.In the following Sec. II, we ﬁrst discuss the basic mechanism for protein agglomeration resulting from the combi-nation of these ingredients. Later in this paper, in Sec. III, we introduce a mean-ﬁeld model which can be mappedto a restricted random-graph ensemble. In Sec. IV, we solve it approximately generalizing a microscopic mean-ﬁeld sf b FIG. 1: Schematic representation of a single DNA loop with one looping protein: The looping protein binds to single bindingsites in two binding domains (each binding domain has K binding sites), leading to a binding free-energy gain of f b . A DNAloop leads to an entropy loss s . approach developed by Engel et al. [15]. In Sec. V, we discuss the results of our mean-ﬁeld model in the context offactories and compare them with the known results for collapse of randomly linked polymers. II. THE BASIC MECHANISM

As shown in Fig. 1, there are two competing eﬀects related to DNA looping: First, the binding of a linking proteinintroduces some free-energy diﬀerence − f b (for example in case of lac operon f b is of order 10-15 kcal/mol [16]). Thesecond contribution comes from the fact that each loop reduces the conformational entropy of the DNA, thus a linkleads to a total free-energy diﬀerence of ∆ F = − f b + T s , with T being the temperature and s being the entropy loss.In principle s depends on the length of the loop and on the DNA stiﬀness, cf. [14]. For this qualitative argument wedo not take care of this dependence and use the entropy loss of a typical-length loop.Now, as shown in Fig. 2, we introduce a second loop, and the total free-energy diﬀerence to the unlooped conﬁg-uration becomes ∆ F = − f b + 2 T s . There are two possible cases for the relative positions of the two loops: First,the loops are distant, and the binding of another linker protein has to introduce a new loop. Second, loops shareone BD. Then also the unconnected BDs of the two loops may be linked, cf. cases (c) and (d) in the ﬁgure. In thiscase, binding free energy is gained, but no new loop is introduced, i.e., no further entropy is lost. We thus have a freeenergy ∆ F = − f b + 2 T s which is lower than the one achievable by distant loops. This mechanism introduces an eﬀective attraction between binding domains of loops : A cluster of n loops might be connected by n ( n + 1) / III. A MEAN-FIELD DESCRIPTION VIA RANDOM PROTEIN-CONNECTION GRAPHS

To gain a ﬁrst understanding of the action of this eﬀective attraction, we set up a mean-ﬁeld model. The entropyloss s due to the introduction of a loop is assumed to be independent of the one-dimensional distance between twoBDs measured along the DNA chain. Note that this approximation would be exact for monomers in a box.On this level, BDs can be seen as vertices of a protein-connection graph , and each bound protein between two suchvertices forms an edge . We assume L proteins to be bound. The entropy loss due to this linking depends on thecomponent structure of the graph: A connected component (CC) of n vertices contains n − n vertices by C ( n ), and the total vertex number by N , we ﬁnd that the free-energy diﬀerence withrespect to the loop-free system is ∆ F = − Lf b + T s N X n =1 ( n − C ( n ) FIG. 2: Basic agglomeration mechanism: (a) DNA is represented by a string, binding proteins by linkers. (b) Binding a proteinto two BDs leads to a gain of binding free energy, but causes a loss in DNA conformational entropy. (c) The same happens, ifa second loop is introduced. (d) Now, if one BD is common to the loops, a next protein may bind to the still unlinked BDswithout major entropy losses. = − Lf b − T Cs + const. (1)with C = P n C ( n ) being the total number of CCs. This free energy has two competing negative contributions. Theﬁrst term favors large L by binding more proteins, and its ground state would be the fully connected graph whichhas only one CC. The second contribution in (1) favors many components for positive s . Its ground state is thus theempty graph with each of the N isolated vertices as a CC. The global behavior of the model is given by the balanceof these two terms, and can be characterized by the partition sum running over all graphs, Z = X graphs exp { Lf b /T + Cs } . (2)We note that this partition function describes a modiﬁed random-graph ensemble which depends only on the numberof links and the number of CCs. In fact, in usual diluted random graphs [17] each pair of vertices is connected withsome probability 0 < p <

1, and left unconnected with 1 − p . The probability of a speciﬁc graph with L edges is thenproportional to [ p/ (1 − p )] L , so it is exponential in the number of edges. If, further on, we reweight all graphs by somefactor q C , we ﬁnd that the graphs have a probability corresponding to Eq. (2) by identifying p/ (1 − p ) := e f b /T and q := e s . Further more, the sum over all graphs is restricted by the connectivity constraint: At most K proteins can bebound to one BD, for d bound proteins the distinguishable nature of the BS inside the BD results in a combinatorialfactor K ! / ( K − d )!. In the next section we give an analytical description of this problem for any K .The main question of this work, i.e. the question if an agglomerate exists or not, translates to the problem of graphpercolation: Agglomeration is equivalent to the existence of an extensively large connected component in the graph,i.e. to the existence of a giant component . IV. ANALYTICAL DESCRIPTION OF THE GRAPH ENSEMBLE

Before coming to the full description of the problem, i.e. to random graphs which have restricted degrees and arereweighted according to their number of CCs, we concentrate a moment on the case q = 1, i.e. without consideringthe number of CCs. The basic idea is that the number of CCs can be introduced in a later moment considering largedeviations from typical q = 1-graphs. The approach generalizes the cavity-type calculation of [15], which is the specialcase K = ∞ . We will resort to this limit in order to check correctness of our results. A. Graphs at q = 1 First, we describe the graphs without any constraint coming from the number of CCs, i.e. without entropy lossesin the DNA-looping model. In this case we have s = 0, i.e. q = 1. The graphs have N vertices (or BDs), each ofthem containing K distinguishable BS (called stubs in the following) which allow vertices to have up to degree K . Wedescribe graphs at the level of vertices by their symmetric adjacency matrix { J ij } with entries 1 whenever to verticesare connected via any two of their BS, and 0 else. The distinguishable nature of the binding sites is taken into accountby a c ombinatorial factor K ! / ( K − d )! for any vertex of degree d . This factor counts the number of non-equivalentway the d edges can be attached to the K BS.The statistical properties of a graph G with adjacency matrix { J ij } can be characterized by its number of links L ( G ) = X i

0) + γK N δ ( J ij , i N Y i =1  K !( K − P j J ij )! Θ( K − X j J ij )  = 1 Z ( γ, K, N ) (cid:16) γK N (cid:17) L ( G ) (cid:16) − γK N (cid:17) ( N ) − L ( G ) K Y d =0 (cid:20) K !( K − d )! (cid:21) N d ( G ) , (5)where the last line already takes into account that degrees beyond K are forbidden. In this notation, γ acts as achemical potential for links, and corresponds to the binding free energy in the protein case. Note that the combination γ/ ( K N ) is chosen such that, for K → ∞ , we recover normal Erd¨os-Renyi random graphs with average degree γ . Inthis limit, all our results here have to coincide with the ones of [15].From this microscopic description by the adjacency matrix of a graph, we can go directly to a coarser descriptiongiving the probability of a graph to have some degree distribution N d = p d N . We ﬁnd P ( N , ..., N K | γ, K, N ) = 1 Z ( γ, K, N ) (cid:16) γK N (cid:17) L (cid:16) − γK N (cid:17) ( N ) − L K Y d =0 (cid:20) K !( K − d )! (cid:21) N d × N ! Q d N d ! (2 L − K Y d =0 (cid:20) d ! (cid:21) N d . (6)This equation is obtained by multiplying the probability of a single of these graphs by the number of possiblerealizations of the degree sequence. The factors are, in the order of appearance, the number of ways to assigndegrees to the N vertices, the number of possibilities to wire a set of vertices with given degrees, and a correction ofovercounting due to the (at this point) indistinguishable occupied BSs. Note that the factor (2 L − L )! / L /L !can be easily understood in terms of generating the graph: Once degrees are assigned, each vertex of degree d isassigned d stubs (or half-edges). These are brought into a random permutation, and the ﬁrst stub becomes connectedto the second one, the third to the fourth etc. This procedure overcounts graphs: A factor 2 − L accounts for possiblepermutation of the two stubs inside each of the L links, the factor 1 /L ! for permutations of entire links. This proceduredoes not forbid double links or self-loops, but these are rare and therefore do not inﬂuence the global statistical featuresof the graph ensemble. Using Stirlings formula ln N ! = N ln N − N + o ( N ) we can rewrite this as P ( N , ..., N K | γ, K, N ) = 1 Z ( γ, K, N ) × exp ( N " ℓ ln (cid:16) γK N (cid:17) − γ K + X d p d ln (cid:18) Kd (cid:19) − X d p d ln p d + ℓ (ln[2 ℓN ] − + o ( N ) ) = 1 Z ( γ, K, N ) exp ( N " − X d p d ln (cid:18) Kd (cid:19) − p d ! + ℓ ln (cid:18) γℓK (cid:19) − ℓ + o ( N ) ) , (7)where we have used ℓ = L/N = P d dp d . The typical degree distribution, which is realized in this ensemble withprobability tending to one in the thermodynamic limit, can be evaluated by the maximum of this expression. Derivingthe exponent by p d , and using a Lagrange multiplier to ensure normalization of the degree dsitribution, we arrivedirectly at (overbars always denote typical values) p d = (cid:18) Kd (cid:19) z d (1 + z ) K , z = s γℓK . (8)With the average degree determined from this distribution, d = 2 ℓ = K z z , (9)we arrive at two self-consistent equations for z and d which are solved by (a second non-physical solution is not shown) z = 12 (cid:18)r γK − (cid:19) dK = p γK − p γK + 1 . (10)Note that, for K → ∞ , the degree distribution (8) tends as expected to the Poissonian law e − γ γ d /d !. These resultsallow us to express also the dominant contribution to the partition function Z ( γ, K, N ) by the exponent in Eq. (7)evaluated at the saddle point:1 KN ln Z ( γ, K, N ) = − dK ln dK − (cid:18) − dK (cid:19) ln (cid:18) − dK (cid:19) + d K ln (cid:18) γK dK (cid:19) − d K + o (1) . (11) B. The case q = 1 Now we have all results being important for analyzing the full model including entropy losses. In diﬀerence to thecase discussed so far, the full graph ensemble includes a weight q C ( G ) depending on the number C ( G ) of connectedcomponents of graph G . We therefore consider the modiﬁed ensemble P ( G | γ, q, K, N ) = 1 Z ( γ, q, K, N ) P ( G | γ, q, N ) q C ( G ) , (12)where the normalizing partition function is given as Z ( γ, q, K, N ) = X G P ( G | γ, q, N ) q C ( G ) . (13)Note that this partition function fulﬁlls Z ( γ, q = 1 , K, N ) = 1 due to the use of the normalized distribution P ( G | γ, q, N )in Eq. (12).Compared to the previous section, this graph ensemble is hard to handle: The distribution depends on the globalquantity C ( G ) which cannot calculated as easily as the degree distribution, which was suﬃcient to characterize thegraph ensemble at q = 1. To get information, we modify the cavity-type approach of [15]. There, information onthe graph ensemble with C -dependent weight (but without degree constraint, i.e. the case q = 1 and K → ∞ ) wasobtained via a simple and intuitive idea: If we add a new vertex to a typical graph of size N , and the degree of thisvertex is randomly selected according to the typical degree distribution of size- N graphs, the new graph is basicallyequivalent to a typical graph of size N + 1. Unfortunately, this argument holds only approximately in our case. Theadded vertex does not become a typical one of the enlarged graph. However, since it holds for K → ∞ and for K = 1,so we expect it to be rather precise also for intermediate-large K .Assume therefore that we add a vertex to a graph G of N vertices. The degree d of this vertex is drawn randomlyfrom p d = (cid:0) Kd (cid:1) z d / (1 + z ) K with z given by Eq. (10) (graph ensemble at q = 1). Now, the number of componentschanges according to P ( C | γ, K, N + 1) = X ∆ C ˜ D (∆ C ) P ( C + ∆ C | γ, K, N ) . (14)The kernel ˜ D (∆ C ) can be decomposed into various contributions according to the degree d of the new vertex, andthe number d of links this new vertex makes with the giant component of G . The change ∆ C of the number of CCsalso depends on these two numbers. For positive d >

0, we add d − d small components to the giant one, i.e. wehave ∆ C = d − d . For d = 0, we unify d small components to a single one, and we have ∆ C = d −

1. The kerneltherefore results in ˜ D (∆ C ) = X K ≤ d ≤ d ≤ ˜ D (∆ C, d, d ) , ˜ D (∆ C, d, d ) = αp d (cid:18) dd (cid:19) π d (1 − π ) d − d δ (∆ C, d − d − δ ( d , , (15)with π denoting the probability of selecting an end-vertex inside the giant component. Due to the special deﬁnitionof our ensemble, where we do not select directly vertices but free BS associated to a vertex, the number π equalstherefore the fraction of all free BS being inside the giant component. It has a simple relation to the fraction ν of vertices belonging to the giant component: π = K − d in K − d ν . (16)Here d in denotes the average degree inside the giant component, d the one of the full graph.The reweighting factor α can be calculated exactly in the case K → ∞ , cf. [15]. Its precise value is not of interestin our discussion. If we multiply Eq. (14) by q C and sum over C , we obtain Z ( γ, q, K, N + 1) = X δC ˜ D (∆ C ) q − δC Z ( γ, q, K, N )= ζ ( γ, q, K ) Z ( γ, q, K, N ) . (17)The logarithm of ζ ( γ, q, K ) can be interpreted as a free-energy shift in the graph ensemble due to adding a new vertex.Using Eq. (15) it can be calculated right away, resulting in ζ ( γ, q, K ) = α (cid:20) q + z (1 − π ) q (1 + z ) (cid:21) K " q − (cid:26) qzπq + z (1 − π ) (cid:27) K . (18)Note that due to the concentration of intensive quantities to their typical values in the summation over all graphs,we have replaced π by its saddle-point value π .The decomposition of ˜ D (∆ C ) helps to get more detailled insight into the graph structure. In fact, the quantity ζ ( d, d | γ, q, K ) = X ∆ C ˜ D (∆ C, d, d ) q − δC = αp d (cid:18) dd (cid:19) π d (1 − π ) d − d q − d + d + δ ( d , (19)describes an eﬀective single-vertex Boltzmann factor, and single-vertex quantities as the probability of belonging tothe giant component or the degree distribution can be derived from it.To start with the giant component size, we remind that for all d > N + 1)-vertex graph. Therefore the fraction of vertices not belonging to the giant component can be written as1 − ν = 1 ζ ( γ, q, K ) K X d =0 ζ ( d, d = 0 | γ, q, K )= qq − n qzπq + z (1 − π ) o K . (20)The degree distribution of the graph results in P ( d | γ, q, K ) = 1 ζ ( γ, q, K ) d X d =0 ζ ( d, d | γ, q, K )= (cid:18) Kd (cid:19) h z (1 − π ) q i d (cid:20) q − n q π − π o d (cid:21)h q + z (1 − π ) q i K (cid:20) q − n qzπq + z (1 − π ) o K (cid:21) . (21)For q = 1, this distribution deviates from a simple binomial distribution. The average vertex degree follows immedi-ately, dK = z (1 − π )( q −

1) + [1 + π ( q − n qzπq + z (1 − π ) o K − [ q + z (1 − π )] (cid:20) q − n qzπq + z (1 − π ) o K (cid:21) = z (1 − ν )(1 − π ) q [ q + z (1 − π )] " q − (cid:26) qπ − π (cid:27) (cid:26) qν − ν (cid:27) K − K , (22)where we have used Eq. (20) to simplify the expression in the second line. The degree distribution inside (resp.outside) the giant component can be obtained by restricting sums to d > d = 0), P in ( d | γ, q, K ) = P dd =1 ζ ( d, d | γ, q, K ) P ∞ d =1 P dd =1 ζ ( d, d | γ, q, K ) ,P out ( d | γ, q, K ) = ζ ( d, d = 0 | γ, q, K ) P ∞ d =0 ζ ( d, d = 0 | γ, q, K ) . (23)For the outside average degree we obtain the particularily simple result d out = K z (1 − π ) q + z (1 − π ) . (24)For the inside average degree, we use the fact that the total average degree is the weighted average of the inside andoutside degrees, and we ﬁnd d in = d − d out (1 − ν ) ν . (25) C. The phase diagram

We have now enough equations to determine self-consistently the phase diagram. Putting together Eqs. (16,20,22,24)and (25), and eliminating directly the expression for d in , we ﬁnd a closed set of three equations π = 1 K − ℓ (cid:20) Kν − ℓ + K z (1 − π )(1 − ν ) q + z (1 − π ) (cid:21) ℓ = K z (1 − ν )(1 − π ) q [ q + z (1 − π )] " q − (cid:26) qπ − π (cid:27) (cid:26) qν − ν (cid:27) K − K (26) ν = 1 − qq − n qzπq + z (1 − π ) o K . We had introduced the model in function of the parameter γ which can be understood as a chemical potential coupledto the number of edges in the graph. Since this parameter has no very obvious interpretation due to its interactionwith the degree-constraints, we prefer to use its conjugate quantity, the link density ℓ as a control parameter. In thissense, for given K , the phase diagram is spanned by q and ℓ , and Eqs. (26) allow to determine the unknown quantities ν, π and γ .Eqs. (26) always have the solution { ν, π, γ } = { , , ℓqK/ ( K − ℓ ) } . It corresponds to a phase without any extensiveCC, i.e. to a non-agglomerated phase. For large enough ℓ and q = e s , also other solutions exist. To see this, weexpand Eqs. (26) up to second order in ν and π , and ﬁnd in particular π = − K [2 ℓ ( K − − K ]2 ℓ ( K − K ( q − ℓ c = K K − , ∀ q < q c = 2 − K (28)Note that, for K → ∞ and q = 1, this result reproduces the known percolation result in Erd¨os-R´enyi random graphs.For q < q c , for all K , Eq. (27) implies that π ∼ ( ℓ − ℓ c ) near the transition point. At q = q c , we can expand Eqs. (26)up to third order in ν and π . We ﬁnd that there is a percolating point at same value of ℓ c as for q < q c , but with π ∼ ( ℓ − ℓ c ) / . Note that this transition exists for all K >

2, at K = 2 itself the transition point would be ℓ c = 1which equals the highest possible degree in this graph (due to 2 ℓ ≤ K ).For q > q c , Eq. (27) does not make sense. We ﬁnd a discontinuous transition at smaller ℓ c ( q, K ) which has tobe determined from Eqs. (26) via the spinodal point; the transition point can be obtained with good precision usingsymbolic manipulation software like Mathematica [18]. In this case, the largest component jumps from a non-extensivesize to a ﬁnite fraction of the full system. In Fig. 3, the phase diagram for various values of K is given. It is found tobe qualitatively similar for all K ≥

3, but agglomeration is favored for higher-order BDs at same number of links. Inthe inset of the ﬁgure we show also the discontinuous nature of the transition: At given number of links the parameter q is changed, and the size of the largest CC is recorded. We ﬁnd an excellent agreement between MC simulationsof random graphs using Metropolis type rewiring steps, and the analytical results obtained from Eqs. (26). Thisillustrates the quality of approximation done in the analytical approach under vertex addition.It is very interesting that the phase transition appears at smaller ℓ for higher entropy losses s = ln q . The reason isthat an increased s leads to a compaction of the giant component, where links can be added without loosing furtherCCs. Therefore, even if the transition appears at lower global average degree, the average degree inside the largestCC always exceeds two. Again, this fact illustrates why K > q = 10, andfor diﬀerent values of K . For all K ≥

3, both d in and d out show clearly a discontinous jump at a critical K -dependentvalue of ℓ . On one hand, d in always jumps to a value slightly above two being necessary for the largest component tobe connected. The degree outsite the giant component, on the other hand, jumps to values which become smaller andsmaller with increasing K . In the right panel of Fig. 4, we also show the fraction of vertices resp. edges being in thegiant component. For higher K values, the fraction of vertices becomes smaller at the transition point, whereas thefraction of links becomes larger. This illustrates that, for larger K , the agglomerate at the threshold becomes smallerbut more dense.Similarly, it is illustrative to look at the properties of the giant component for various values of of q . Fig. 5 showsthe plots for K = 5 at q = 5 ,

10 and 20. We see that, as q increases, the fraction of vertices inside the giant componentgoes down (right panel), but the fraction of edges goes up, implying that the agglomerate becomes more and morecompact with increasing q . Also the diﬀerence between the indegree and outdegree goes up as q increases (left panel). V. DISCUSSION

The aim of the present paper is to present a minimal model which, on the basis of a thermodynamic approach toDNA-protein interactions, is able to show protein agglomeration. In this sense, it can serve as a minimal model for q d = l K = 3K = 5K = 10K = 100

10 20 30 q ν l l l FIG. 3: Phase diagram for protein agglomeration in mean-ﬁeld description for K = 3 , , , q = e s , for ℓ = 0 . K = 20. The full line is the analytical result of Eqs. (26), the symbols show results of MC simulationsfor N = 5000, each symbol is an average over 900 independent equilibrium conﬁgurations. the mechanism behind the formation of transcription factories, which are observed in transcriptionally active cells. Inour paper we show that two ingredients are suﬃcient: DNA-looping proteins which are able to bind simultaneously totwo – also distant – protein binding sites on the DNA, and binding domains on the DNA which contain, on average,more than two binding sites each. In this case, the competition between free-energy gain by protein binding andentropy loss by DNA looping is found to lead to an eﬀective attraction between DNA loops. As a consequence,binding domains and proteins agglomerate collectively.In its minimal character, the model might miss some important properties of the biological system. As an example,we consider the number of doubly-bound proteins as one important control parameter, whereas the relevant parametershould be the total number of proteins - which would also include free and singly bound molecules. Using biologicallyreasonable parameters for the binding aﬃnities (ca. 5-15 kcal/mol), we ﬁnd in simulations that basically none of theproteins stay free and a large majority is doubly bound in the phase-transition region. It would be interesting if wecould extend our analytical model accordingly.Further more, our model did not consider the speciﬁcity of interaction between DNA-binding proteins and theirbinding sites. This speciﬁcty may result in the simultaneous agglomeration of various speciﬁc transcription factories,which actually is an important ingredient to [3, 4]. Including this possibility into our model would be anotherinteresting generalization, but it would not eﬀect the very basic agglomeration mechanism.Our model of agglomeration is similar to models of randomly linked polymers studied in the literature. Previousstudies mostly considered a Gaussian chain with randomly placed links using variational [19] and numerical methods[20]. Bryngelson et al [19] in their study based on variational approach and scaling arguments tried to argue thatfor a Gaussian chain when the links are soft, there is always a transition. They also argued that it is a continuoustransition that occurs above a threshold which is a product of the density of links and logarithm of average lengthof the loop. This result implied that for some arbitrary polymer, it is possible for transition to occur at vanishingdensity of links ( ∼ / ln N ). This result was countered by Kantor et al [20]. Based on scaling arguments they arguedthat number of links M necessary for a percolating collapsed phase to exist scale as N φ , with φ = 1 − /dν , where ν is the exponent which describes the shape of the polymer[25] (radius of gyration R g ∼ N ν ). Also, for the probabilityof looping p ( l ) ∼ A/l α , with 1 < α <

2, it was shown by Schulman and Newman [21] that for

M < N/ M ≥ N/

2, percolation may or may not occur depending on the value of A and α .Our solution correspond to the case where we have ignored length dependence of the entropy cost of the loop ( α = 0)and we ﬁnd that there would always be a transition from extended phase to collapsed phase, though the nature oftransition depends on the entropy cost ( s = − ln A ) of the loop. Most surprisingly, larger the cost of looping, smalleris the concentration at which the transition happens. Also it is no longer a second order percolation like transition,0 d-d =2(l-l ) d / d d-d =2(l-l ) ou t c c cc i n ν , π K=3K=5K=10 K=3K=5K=10

FIG. 4: Left panel: Average degrees d in inside (upper curves) and d out outside the giant component (lower curves), as a functionof the total link density ℓ , for values K = 3 , ,

10 and ﬁxed q = 10. Right panel: Fraction ν of vertices (lower curves) andfraction of links (upper curve) for the same parameters as in the left panel. but a ﬁrst order transition for low concentration of links.We did simulations for a Gaussian chain in three dimensions: In our mean-ﬁeld model we considered entropy lossesto be independent of distance between the BDs measured along the DNA. The distance dependence of entropy in vivois complicated. If we assume that the unlinked DNA behaves on long scales like a Gaussian chain, the entropy loss ismonotonously increasing in the the loop length and scales as q ( l ) = e s ( l ) ∼ ( l/l ) / , where l is the minimum distancebetween two ends of a loop. If we now look to a connected component of the n vertices { i , ...i n } with i m < i m +1 forall 1 ≤ m < n , the entropy loss is given by s ( i , ...i n ) = P n − m =1 s ( i m +1 − i m ) . In our simulations, we ﬁnd that thereis still a discontinuous transition which depends on the choice of l (see Fig. 6). Since longer loops are suppressedcompared to shorter ones, one could expect CC to be more localized in one-dimensional distance along the DNA. Thiswould correspond to Cook’s picture where DNA loops around a factory form a kind of rosetta, before DNA goes tothe next factory. The logarithmic entropy dependence taken into account in our simulations is not suﬃcient for sucha localization.Based on these simulations, we suggest that even when we take length dependence into account, there is a possibilityof ﬁrst order transition to the agglomerate, at small density of links for high entropy cost. The reason this couldbe possible would be because of the larger (exponential) contribution of the distribution of links to the entropy incomparision to the lograthimic dependence of entropy on length.We have ignored the interaction between loops. It is not clear how important that could be. For example, in thecase of DNA denaturation [22], exact results which ignore interaction between loops predict a continuous transition inall dimensions less than four. Whereas using scaling arguments and taking interaction of loop with rest of the chaininto account, Kafri et al [23] showed that the transition becomes ﬁrst order in d = 2.The present work can be extended into various directions. First, from biological point of view it would be interestingto go to more realistic modeling schemes (like worm-like chains for the DNA molecule) and to check the proposedpicture. Such a simulation would also allow to introduce biologically realistic parameters for protein binding aﬃnitiesand entropy losses, and to locate such a realistic setting in the simpliﬁed mean-ﬁeld phase diagram. However, currentsimulations are concentrated to a single loop [8, 9], so this task seems to pose a considerable numerical challenge. Itwould be interesting to see if self exclusion, depletion eﬀects due to macromolecular crowding or restricted volumewould lead to a spatial localization of the agglomerate. A second direction could be the inclusion of diverse loopingproteins with speciﬁc binding sites on the DNA to see whether equilibrium thermodynamics can drive the creation oftranscription factor speciﬁc spatial foci. Further, it would be interesting to see perhaps with the help of simulationsfor Gaussian and self-avoiding chains if one can see a regime of discontinuous transition from extended to collapsed1 -0.05 0 0.05 0.1 d-d =2(l-l ) d / d d-d = 2(l-l ) q=10q=5q=20 q=5q=20q=10 π , ν c c cc ou ti n FIG. 5: Left panel: Average degrees d in inside (upper curves) and d out outside the giant component (lower curves), as a functionof the shifted link density ℓ − ℓ c , for ﬁxed K = 5 and diverse values of q = 5 , ,

20. Right panel: Fraction ν of vertices (lowercurves) and fraction of links (upper curve) for the same parameters as in the left panel. phase. Acknowledgment — We warmly thank F. K´ep`es and O. Martin for many helpful discussions. We are also thankfulto H. Orland for pointing out the reference [19] to us. This work was support by the EC via the STREP GENNETEC(“Genetic networks: emergence and complexity”). [1] B. Alberts, A. Johnson, J. Lewis, M. Raﬀ, K. Roberts, and P. Walter,

Molecular Biology of the Cell (Garland, New York2002).[2] T. Cremer, C. Cremer, Nat. Rev. Gen. , 292 (2001); C. Lanctot, T. Cheutin, M. Cremer, G. Cavalli, T. Cremer, Nat.Rev. Gen. , 104 (2007).[3] P.R. Cook, Science , 1790 (1999); P.R. Cook, Nat. Gen. , 347 (2002).[4] F. K´ep`es, J. Mol. Biol. , 859 (2003); F. K´ep`es, C. Vaillant, Complexus ,171 (2003).[5] M.A. Wright, P. Kharchenko, G.M. Church, and D. Segre, Proc. Nat. Acad. Sci. , 10559 (2007).[6] G. Kolesov, Z. Wunderlich, O.N. Laikova, M.S. Gelfand, and L.A. Mirny, Proc. Nat. Acad. Sci. , 13948 (2007).[7] B. van den Broek, M.A. Lomholta, S.M.J. Kalisch, R. Metzler, and G.J.L. Wuite, Proc. Nat. Acad. Sci. , 15738 (2008).[8] M. Bon, D. Marenduzzo, P.R. Cook, Structure , 197 (2006).[9] N.M. Toan, D. Marenduzzo, P.R. Cook, C. Micheletti, Phys. Rev. Lett. , 178302 (2006).[10] J.M.G. Vilar and S. Leibler, J. Mol. Biol. , 981 (2003).[11] L. Bintu, N.E. Buchler, H.G. Garcia, U. Gerland, T. Hwa, J. Kondev, T. Kuhlman, and R. Phillips, Current Opinion inGenetics & Development , 125 (2005).[12] N.E. Buchler, U. Gerland, and T. Hwa, Proc. Nat. Acad. Sci. , 5136 (2003).[13] J.M.G Vilar and L. Saiz, Phys. Rev. Lett. , 167 (2003).[15] A. Engel, R. Monasson, A.K. Hartmann, J. Stat. Phys. , 387 (2004).[16] J.M.G Vilar and L. Saiz, Current Op. in Gen. and Dev., , 136(2005).[17] P. Erd¨os, A. R´enyi, Publicationes Mathematicae Internal Constraints Induce Localization in an Isolated Polymer Molecule , 1996, PRL Conformations of randomly linked polymers , 1996, PRE Long range percolation in one dimension , 1983, J. Phys. A , L 639; C.M.Newman and Schulman L.S. .... s ν FIG. 6: We consider the contribution of entropy loss to the change in free energy for a loop between two BDs to be equal to s ( − ln l + ln l ). The ﬁgure plots the fraction of sites in the giant component for K → ∞ as a function of s for a chain with N = 5000 binding domains and 1000 transcription factors.[23] Kafri Y, Mukamel D and Peliti L, Why is the DNA Denaturation Transition First Order? ,[24] Note that real DNA-binding proteins recognize speciﬁc binding sites along the DNA. This speciﬁcity allows for the simul-taneous agglomeration of transcription factories containing speciﬁc transcription factors, and thus the attraction of speciﬁcbinding sites. The basic mechanism described here is, however, not aﬀected.[25] this is diﬀerent from the deﬁnition of νν