[PDF] Modelling Protein Target-Search in Human Chromosomes

Abstract

Several processes in the cell, such as gene regulation, start when key proteins recognise and bind to short DNA sequences. However, as these sequences can be hundreds of million times shorter than the genome, they are hard to find by simple diffusion: diffusion-limited association rates may underestimate in vitro measurements up to several orders of magnitude. Moreover, the rates increase if the DNA is coiled rather than straight. Here we model how this works in vivo in mammalian cells. We use chromatin-chromatin contact data from state-of-the-art Hi-C experiments to map the protein target-search onto a network problem. The nodes represent a DNA segment and the weight of the links is proportional to measured contact probabilities. We then put forward a master equation for the density of searching protein that allows us to calculate the association rates across the genome analytically. For segments where the rates are high, we find that they are enriched with active genes and have high RNA expression levels. This paper suggests that the DNA's 3D conformation is important for protein search times in vivo and offers a method to interpret protein-binding profiles in eukaryotes that cannot be explained by the DNA sequence itself.

Full PDF

MModelling Protein Target-Search in Human Chromosomes

Markus Nyberg, Tobias Ambj¨ornsson, Per Stenberg, and Ludvig Lizana ∗ Integrated Science Lab, Department of Physics, Ume˚a University, SE-901 87 Ume˚a, Sweden Department of Astronomy and Theoretical Physics, Lund University, SE-223 62 Lund, Sweden Department of Ecology and Environmental Science, Ume˚a University, SE-901 87 Ume˚a, Sweden (Dated: September 23, 2019)Several processes in the cell, such as gene regulation, start when key proteins recognise and bindto short DNA sequences. However, as these sequences can be hundreds of million times shorterthan the genome, they are hard to ﬁnd by simple diﬀusion: diﬀusion-limited association rates mayunderestimate in vitro measurements up to several orders of magnitude. Moreover, the rates increaseif the DNA is coiled rather than straight. Here we model how this works in vivo in mammalian cells.We use chromatin-chromatin contact data from state-of-the-art Hi-C experiments to map the proteintarget-search onto a network problem. The nodes represent a DNA segment and the weight of thelinks is proportional to measured contact probabilities. We then put forward a master equation forthe density of searching protein that allows us to calculate the association rates across the genomeanalytically. For segments where the rates are high, we ﬁnd that they are enriched with active genesand have high RNA expression levels. This paper suggests that the DNA’s 3D conformation isimportant for protein search times in vivo and oﬀers a method to interpret protein-binding proﬁlesin eukaryotes that cannot be explained by the DNA sequence itself.

Several processes in the cell nucleus start when pro-teins bind to speciﬁc DNA sequences. For exam-ple, transcription factors that regulate genes and theCRISPR/CAS9 complex that edits DNA [1, 2]. Becausetarget sequences are much shorter than the genome – afew base pairs compared to billions in humans – theseproteins face a needle-in-a-haystack problem.Despite the large number of potential targets, mea-sured search times are shorter than theoretical estimates.The Lac repressor in

E. coli needs 1-5 min to ﬁnd itsdesignated site [3] which is twice as fast as a three-dimensional (3D) search by diﬀusion inside the bac-terium’s volume ( ≈ in vitro measurements by one to two orders ofmagnitude [4]. These examples suggest that some pro-teins search by other mechanisms than simple diﬀusion.One mechanism that speeds up diﬀusive search is of-fered by the Facilitated-diﬀusion model [5]. In thismodel, the proteins alternate between 3D diﬀusion and1D diﬀusion along the DNA. This decreases the searchtime because the proteins may take shortcuts to a linearlydistant DNA segment through the surrounding bulk. Al-though criticized [6, 7], the model is widely accepted afterexperiments in bacteria [3, 8] and in vitro [9].Another important aspect in target ﬁnding is rebind-ing. This is because proteins likely bind to a DNA seg-ment that is close by in 3D rather than far away. Sev-eral modelling studies examined this aspect and foundthat search times change with DNA conformation [7, 9–14]. However, because these studies treat the DNA as asimple polymer the results cannot be generalized beyondbacteria to eukaryotes that have longer DNA and morecomplex 3D organization.The most widely used experimental method to study 3D genome organization is Hi-C [15, 16]. The Hi-Cmethod cross-links close by DNA fragments inside the nu-cleus and gives a genome-wide map of the number of con-tacts between fragments pairs (Fig. 1a) [17]. MamalianHi-C maps have several interesting features – some thatare evolutionary conserved [18]. For example, the block–like structure along the diagonal represents densely con-nected 3D domains. The locations of these domains cor-relate with protein binding sites, active genes, and chro-matin states [19–21]. And, beyond the domains, the av-erage contact probability decays as a power-law [34].Hi-C is the state-of-the-art Chromosome ConformationCapture method that estimates the chromatin contactprobabilities across the genome. But it does not providechromatin’s 3D structure. Going from the contact mapto a computer-generated 3D structure is diﬃcult [22, 23].Because the chromatin’s spatial organization is so com-plex, there are few attempts to model protein search ineukaryotes. One exception [24], represents chromatin asa crumpled polymer globule that reproduces the averagelooping probabilities in the human genome. However, thecrumpled globule lacks the Hi-C maps’ domain structure.Here we model target search in eukaryotes without re-lying on chromatin’s explicit 3D structure. Instead, werepresent the DNA as a network in which the nodes areDNA segments and the link weights are proportional tothe contact probabilities measured in Hi-C. Then we putforward a master equation for the density of proteins onthe network over time that allows us to calculate the as-sociation rate – the inverse mean-ﬁrst passage time – toall nodes analytically. Correlating these rates with RNAexpression data in humans, we ﬁnd that easy-to-ﬁnd ge-nomic regions are enriched with active genes and havehigh RNA expression. a r X i v : . [ q - b i o . S C ] S e p k on k o↵ i j ! ij ab c FIG. 1: Modelling protein search on DNA as search on a weighted network. (a) Hi-C map: the number of pairwise physicalcontacts between 10 kilo-basepair (kb) DNA fragments in a part of human chromosome 21. Dark pixels indicate many contacts.(b) Schematic representation of the model. The key parameters are: jump rates between nodes i and j ( ω ij ), unbinding rateto the bulk ( k oﬀ ), and rebinding rate ( k on ). Red circles represent searching proteins. (c) Coarsed network representation of theHi-C map in (a) where each node represents a 160 kb fragment. The link weights v ij are proportional to the number of Hi-Ccontacts. We assume that ω ij ∝ v ij . Node numbering refers to positions along the DNA. THE MODEL

We model the proteins’ search on chromatin as non-interacting particles that move between nodes in aweighted network that represent physically connectedchromatin segments (Fig. 1).The model has three important parameters. First, thejump rate ω ij between nodes i and j ( i, j = 1 , . . . , N ).We assume that ω ij is proportional to the number ofcontacts between segments i and j , v ij , measured in Hi-C: ω ij = f coll v ij ; f coll – a free parameter in the model –is the collision frequency that leads to a successful jump.Second, the unbinding rate k oﬀ to the surrounding bulk.And third, the rebinding rate ¯ k on to a randomly chosennode. We assume that the protein bulk concentration n bulk is constant and therefore use k on = ¯ k on n bulk ; k on and k oﬀ therefore have the same unit: time − .In terms of these parameters, we formulate a masterequation for the protein number in node i at time t , n i ( t ): dn i ( t ) dt = N (cid:88) j =1 ω ij n j ( t ) − k oﬀ n i ( t ) + k on (1)The ﬁrst term represents diﬀusion on the network – weput ω jj = − (cid:80) i (cid:54) = j ω ij – and the two remaining termsdescribe the exchange with the surrounding bulk.As in [14], we assume that the protein density is ini-tially uniform ρ = k on /k oﬀ except for the target i = a which always is zero n a ( t ) = 0, that is n i (0) = ρ (1 − δ ia ). In terms of the eigenvalues λ j and eigenvectors V ij of ω ij , the solution to Eq. (1) is n i ( t ) = (cid:80) Nj =1 V ij (cid:26) k j on k off − λ j (cid:2) − e − ( k off − λ j ) t (cid:3) + e − ( k off − λ j ) t ρ (cid:80) l (cid:54) = a V − jl (cid:27) . (2)where k i on = k on (cid:80) Nj =1 V − ij . This equation is key to cal-culate the association rate to the target node i = a . PROTEIN ASSOCIATION RATES

To calculate the association rate K a , we use that itis one over the mean ﬁrst arrival time: K a = τ − a . Toobtain τ a , we proceed as in [14]. First we calculate thetotal number of particles that arrived to the target up totime t : J a ( t ) = (cid:82) t j a ( t (cid:48) ) dt (cid:48) where j a ( t ) = (cid:88) i ω ai n i ( t ) + k on (3)is the particle ﬂux. Second, if N p is the initial proteinnumber on the network, and if k on = k oﬀ = 0, then J a ( t ) /N p is the probability that a single protein havereached the target up to time t . The probability thatthe target has not been reached – the target’s survivalprobability – is therefore S a ( t ) = (1 − J a ( t ) /N p ) N p and τ a = (cid:82) ∞ S a ( t ) dt . Generalizing this argument for k on > FIG. 2: Predicted genome-wide association rates K a for chromosomes 1-21. K a vary by several orders of magnitude along thechromosomes but the 95% conﬁdence interval is a few percent of the mean (cid:104) K a (cid:105) = 1 ( K a = 1 . ± . K a from Eq. (6) ( k oﬀ (cid:29)

1) using the parameters k oﬀ = 0 . k on = 0 .

001 and ρ = 0 . the number of proteins N p → ∞ and therefore S a ( t ) (cid:39) exp ( − J a ( t )) [25]. Note that if k on = 0 and k oﬀ > τ a to diverge.Finally, using n i ( t ) from Eq. (2), we may calculate K a exactly as: K a = 1 (cid:82) ∞ exp ( − J a ( t )) dt , (4) J a ( t ) = k on t + (cid:88) i (cid:54) = a ω ai (cid:90) t n i ( t (cid:48) ) dt (cid:48) , (5) Limiting cases for K a Depending on the unbinding rate k oﬀ , K a has threeregimes. (i) Small k oﬀ . In this regime most particles ﬁnd thetarget before they unbind. This leaves the initial den-sity ρ approximately unchanged, n i ( t ) (cid:39) ρ . Using thisapproximation in Eq. (5) leads to J a ( t ) (cid:39) ¯ J a t , where¯ J a = k on + ρ (cid:80) Ni (cid:54) = a ω ai ≡ k on + ρ W a , and thus K a (cid:39) ¯ J a . (ii) Large k oﬀ . Here, the particles unbind and re-bind many times before ﬁnding the target. The pro-tein density is therefore approximately in steady-state¯ ρ a = k on × [ k oﬀ + W a / ( N − − [26]. Using n i ( t ) (cid:39) ¯ ρ a and proceeding as in (i) leads to K a (cid:39) k on + ¯ ρ a W a .These regimes simplify to k oﬀ (cid:29) k oﬀ (cid:28) f coll so that the genome-wide averaged K a , (cid:104) K a (cid:105) , is unity [26]; We deﬁne the average as (cid:104) X (cid:105) =(1 /N G ) (cid:80) N G i =1 X i , where N G (cid:29) N is the number of nodesfor all chromosomesAfter rescaling, we have K a (cid:39) k on + γ a V a , (6)in which V a = (cid:80) Ni (cid:54) = a v ai is the node strength and γ a = 1 − k on (cid:104) V a (cid:105) = γ , k oﬀ (cid:28) γ a = k on γ k on + V a γ / ( N − , k oﬀ (cid:29) k oﬀ [26], it is easy toimplement, and computationally cheaper than Eqs. (4)and (5). (iii) Intermediate k oﬀ . When k oﬀ ∼

1, we cannot useEq. (6). Instead we must use the exact expressions (4)and (5). In [26] we also treat the case k on = k oﬀ = 0. Protein association rates depend on chromatin’s 3Dorganization

Equation (6) suggests that the association rates changewith chromatin’s 3D structure because K a depends onthe node strength V a . To quantify by how much, weused Hi-C data (40 kb resolution) and calculated K a forchromosomes 1-21 (Fig. 2). We found that K a varies byseveral orders of magnitude relative to the genome-wideaverage (cid:104) K a (cid:105) = 1. Most K a values, however, only deviateby a few percent from the mean: K a = 1 ± . k oﬀ (cid:28) k oﬀ (cid:29) k on grows – for example by increasing the bulk concentra-tion of particles as k on ∝ n bulk . We see this in the datafor small V a where K a (cid:39) k on [26]. We interpret this asif the particles reach the target mostly from the bulk. Inthe other limit, where k on is small, we see that K a ∝ V a .This means that most particles ﬁnd the target via jumpson the network and that the 3D structure is important. Chromatin regions with high association rates areenriched with active genes

Figure 2 shows that the association rate varies acrossthe genome. This is important for regulatory proteins,such as transcription factors, that look for promoters tocontrol transcription. We therefore ask: are promoterregions easier to ﬁnd than non-promoter regions?

FIG. 3: Nodes with many gene starts have higher predictedassociation rates – and thus easier to ﬁnd – than nodes withfew gene starts (in humans). We deﬁne the gene starts asthe Transcription Start Sites (TSSs). The curves representpredicted association rates to nodes with active TSSs (gray),inactive TSSs (green), and any TSS type (pink). The activeTSSs have higher association rates than the genome wide av-erage ( (cid:104) K a (cid:105) = 1, dashed), whereas nodes with inactive TSSs(green) are below (apart from one data point). The symbolsrepresent the average association rate [Eq. (6)] and the col-ored areas show the 95% conﬁdence interval. Parameters (di-mensionless, see [26]): k oﬀ = 0 . k on = 0 .

001 and ρ = 0 . ≤

50 nodes).

To answer this, we downloaded gene annotation datafor human cells [27] to extract the gene starts – deﬁnedas the Transcription Start Sites (TSSs) – and correlatedthem with the predicted association rates from Fig. 2.We found that the rates grow with the number of genestarts per node (Fig. 3, pink). The data points representthe average association rate to all nodes with the samenumber of gene starts and the shaded area shows the 95%conﬁdence interval. In other words: gene-dense regionsare easy to ﬁnd.Then we asked: because these regions harbour bothactive and inactive genes, are active gene-dense regionseasier to ﬁnd than inactive ones? To see this, we groupedthe gene starts into ’active’ and ’inactive’ based on theRNA expression level surrounding each TSS – 1kb up-stream and downstream – and calculated the associationrates to the nodes with these TSSs. We found that nodeswith many active TSSs have even higher association ratesthan if we do not separate active and inactive TSSs: grayis above pink in Fig. 3.For nodes with inactive gene starts we ﬁnd the inverserelationship: green is below pink in Fig. 3. This is un-derscored when comparing the green area to the genome-wide average (cid:104) K a (cid:105) = 1 represented by dashed line: mostinactive TSSs are in regions with association rates that are smaller than unity. This suggests that inactive genestarts are hard to ﬁnd.Figure 3 also shows that the association rate growsslowly beyond one or two TSSs per node: adding a fewextra TSSs does not make the node easier to ﬁnd. How-ever, there is still a positive correlation between the num-ber of gene starts per node and high association rates. Chromatin regions with high RNA expression levelshave high association rates

Figure 3 suggests that transcription factors quicklyﬁnd highly transcribed genes because the association rateis larger for active than inactive TSSs. But what aboutany transcribed region? Are the association rates highalso for them?To study this, we summed the RNA expression in allnodes across the genome and ranked them based on theirRNA expression level. Then we partitioned the nodesinto 20 equally-sized groups and calculated the associa-tion rate in each group. Shown as a violin plot (Fig. 4a),we ﬁnd that our predicted rates vary widely but that themedian (white circles) increases with high RNA expres-sion levels (Spearman’s correlation coeﬃcient = 0.5449[28]). This suggests that nodes with high RNA expres-sion levels – with or without active gene starts – are rel-atively easy to ﬁnd.To see by how much this correlation is caused by ac-tive gene starts, we made two new groups: nodes with atleast one active TSS and the rest – nodes with inactiveor no TSSs. Then, as before, we ranked the nodes inthese large groups based on the RNA expression levels,divided them into 20 equally-sized subgroups, and cal-culated the average association rate for each subgroup.Plotting the predicted average association rate for thetwo large groups versus the average RNA expression levelas well as the average for all nodes (Fig. 4a), we cannotsee any signiﬁcant diﬀerence: all curves nearly lie on topof each other (Fig. 4b). This is a more general resultthan before. It is not only the highly transcribed genestarts that are relatively easy to ﬁnd, it is any region withhigh RNA expression.In addition, as a simple measure of DNA accessibil-ity, we checked how the association rate change with thefraction of base pairs that are transcribed per node [26].Just like for the RNA expression level, we ﬁnd a posi-tive correlation with high association rates (Spearman’scorrelation coeﬃcient = 0.5632).

DISCUSSION AND SUMMARY

Protein-binding experiments show that associationrates change if the binding sites are embedded in a shortor a long piece of DNA [4]. This is partly explained by ab FIG. 4: Nodes that are highly transcribed are easy to ﬁnd.(a) The distribution of K a values for all nodes across thegnome. The nodes are ranked by their RNA expression leveland divided into 20 groups. White circles represent the me-dian and horizontal bars represents the mean. 10 groups lieabove the genome wide average ( (cid:104) K a (cid:105) = 1, dashed line). (b)Predicted association rate as function of RNA expression levelfor nodes with at least one active TSS (grey) and no activeTSSs (purple). The RNA expression is divided into 20 binswith equally many points in each bin (same as in (a)). Thenodes with active TSSs tend to be above the genome wideaverage (18 points is above (cid:104) K a (cid:105) = 1), while most nodes withno active TSSs are below (6 points above (cid:104) K a (cid:105) = 1). Theshaded areas show the 95% conﬁdence interval. Parametersused in both plots: k oﬀ = 0 . k on = 0 .

001 and ρ = 0 . the facilitated-diﬀusion model in which proteins switchbetween 1D search along the DNA, and 3D search – bydiﬀusion – in the surrounding volume. The associationrates also change if the DNA is straight or coiled [9]. Thiscan be understood if the facilitated diﬀusion model in-cludes inter-segmental transfer between loop anchors [13].However, current studies use standard polymer modelsthat do not capture the chromatin’s complex 3D orga- nization in eukaryotes. To remedy this, we used Hi-Cdata as proxy for the 3D proximity between chromatinsegments in vivo . This allowed us to to map the pro-tein search problem onto a network problem with nodesand links representing chromatin segments and how theyare physically connected to each other. Then we for-mulated a master equation for the number of searchingproteins per node, from which we calculated analyticallythe genome-wide association rates in terms of the eigen-values and eigenvectors of the Hi-C matrix. Using humanHi-C data, we compared the predicted association rateswith RNA expression data and positions of gene starts.We found that regions which are easy to ﬁnd – measuredby high association rates – are enriched with active genesand have a generally high level of RNA expression.We assume that the protein ﬁnds the target, for exam-ple a promoter site, as soon as it arrives at the target node– here, 40 kb. This means that we model protein bindingas diﬀusion-limited. However, some transcription fac-tors, such as TetR in mammals [29], are suggested tobe reaction-limited. To include imperfect protein-DNAbinding our model we may follow [30]. Denoting theprotein-DNA binding rate as k DNA , and reinterpretingthe on rate k on as an eﬀective on rate k eﬀ . on , we may write1 /k eﬀ . on = 1 /k on + 1 /k DNA where K a = k eﬀ . on + γ a V a .We calculated the association rates in each chromo-some without considering inter-chromosome connections.This is an assumption as chromosomes do come in physi-cal contact. From Hi-C data, however, it seems like thesecontacts are less frequent than within chromosomes. Thisis a limitation of the data rather than our model that canhandle any genome-wide Hi-C map.Overall, this study provides a framework to predictprotein-binding positions dictated by chromatin contactmaps in the cell nucleus. As such, it opens new waysto interpret binding proﬁles of transcription factors thatcannot be explained by the DNA sequence [1, 31]. Mech-anistic understanding of these proﬁles is important toreach a molecular understanding of gene regulation. ACKNOWLEDGEMENTS

TA and LL acknowledges support from Swedish Re-search Council (grant no: 2014-4305 and 2017-03848).PS acknowledges support from the Knut and Alice Wal-lenberg foundation (grant no. 20140018, to EpiCoN, co-PI: PS). We thank Dr. Rajendra Kumar for his help withgcMapExplorer [32] to analyze the Hi-C data. ∗ Electronic address: [email protected][1] Peggy J Farnham. Insights from genomic proﬁlingof transcription factors.

Nature Reviews Genetics ,10(9):605, 2009. [2] Viktorija Globyte, Seung Hwan Lee, Taegeun Bae, Jin-Soo Kim, and Chirlmin Joo. Crispr/cas9 searches fora protospacer adjacent motif by lateral diﬀusion.

TheEMBO journal , 38(4):e99466, 2019.[3] Petter Hammar, Prune Leroy, Anel Mahmutovic, Erik GMarklund, Otto G Berg, and Johan Elf. The lac repres-sor displays facilitated diﬀusion in living cells.

Science ,336(6088):1595–1598, 2012.[4] Arthur D Riggs, Suzanne Bourgeois, and Melvin Cohn.The lac represser-operator interaction: Iii. kinetic stud-ies.

Journal of molecular biology , 53(3):401–417, 1970.[5] Peter H von Hippel and OG Berg. Facilitated target lo-cation in biological systems.

Journal of Biological Chem-istry , 264(2):675–678, 1989.[6] A B Kolomeisky. Physics of protein–dna interactions:mechanisms of facilitated target search.

Physical Chem-istry Chemical Physics , 13(6):2088–2095, 2011.[7] Leonid Mirny, Michael Slutsky, Zeba Wunderlich,Anahita Tafvizi, Jason Leith, and Andrej Kosmrlj. Howa protein searches for its site on dna: the mechanism offacilitated diﬀusion.

Journal of Physics A: Mathematicaland Theoretical , 42(43):434013, 2009.[8] Johan Elf, Gene-Wei Li, and X Sunney Xie. Probingtranscription factor dynamics at the single-molecule levelin a living cell.

Science , 316(5828):1191–1194, 2007.[9] Bram van den Broek, Michael Andersen Lomholt, S-MJKalisch, Ralf Metzler, and Gijs JL Wuite. How dna coil-ing enhances target localization by proteins.

Proceed-ings of the National Academy of Sciences , 105(41):15738–15742, 2008.[10] Assaf Amitai. Chromatin conﬁguration aﬀects the dy-namics and distribution of a transiently interacting pro-tein.

Biophysical journal , 114(4):766–771, 2018.[11] Maximilian Bauer and Ralf Metzler. In vivo facilitateddiﬀusion model.

PloS one , 8(1):e53956, 2013.[12] Tao Hu, A Yu Grosberg, and BI Shklovskii. How pro-teins search for their speciﬁc sites on dna: the role ofdna conformation.

Biophysical journal , 90(8):2731–2744,2006.[13] Michael A Lomholt, Bram van den Broek, Svenja-Marei JKalisch, Gijs JL Wuite, and Ralf Metzler. Facilitateddiﬀusion with dna coiling.

Proceedings of the NationalAcademy of Sciences , 106(20):8204–8208, 2009.[14] Michael A Lomholt, Tobias Ambj¨ornsson, and Ralf Met-zler. Optimal target search on a fast-folding polymerchain with volume exchange.

Physical review letters ,95(26):260603, 2005.[15] Erez Lieberman-Aiden, Nynke L van Berkum, LouiseWilliams, Maxim Imakaev, Tobias Ragoczy, AgnesTelling, Ido Amit, Bryan R Lajoie, Peter J Sabo,Michael O Dorschner, et al. Comprehensive mapping oflong-range interactions reveals folding principles of thehuman genome. science , 326(5950):289–293, 2009.[16] Siyuan Kong and Yubo Zhang. Deciphering hi-c: from 3dgenome to function.

Cell biology and toxicology , 35(1):15–32, 2019.[17] Suhas SP Rao, Miriam H Huntley, Neva C Durand,Elena K Stamenova, Ivan D Bochkov, James T Robinson,Adrian L Sanborn, Ido Machol, Arina D Omer, Eric SLander, et al. A 3d map of the human genome at kilo-base resolution reveals principles of chromatin looping.

Cell , 159(7):1665–1680, 2014.[18] Jan Krefting, Miguel A Andrade-Navarro, and Jonas Ibn-Salem. Evolutionary stability of topologically associat- ing domains is associated with conserved gene regulation.

BMC biology , 16(1):87, 2018.[19] Jesse R Dixon, Siddarth Selvaraj, Feng Yue, AudreyKim, Yan Li, Yin Shen, Ming Hu, Jun S Liu, and BingRen. Topological domains in mammalian genomes iden-tiﬁed by analysis of chromatin interactions.

Nature ,485(7398):376, 2012.[20] Daniel Jost, Pascal Carrivain, Giacomo Cavalli, andC´edric Vaillant. Modeling epigenome folding: formationand dynamics of topologically associated chromatin do-mains.

Nucleic acids research , 42(15):9553–9561, 2014.[21] Sang Hoon Lee, Yeonghoon Kim, Sungmin Lee, XavierDurang, Per Stenberg, Jae-Hyung Jeon, and LudvigLizana. Mapping the spectrum of 3d communities in hu-man chromosome conformation capture data.

Scientiﬁcreports , 9(1):6859, 2019.[22] Michele Di Pierro, Ryan R Cheng, Erez LiebermanAiden, Peter G Wolynes, and Jos´e N Onuchic. De novoprediction of human chromosome structures: Epigeneticmarking patterns encode genome architecture.

Proceed-ings of the National Academy of Sciences , 114(46):12126–12131, 2017.[23] Fran¸cois Serra, Marco Di Stefano, Yannick G Spill,Yasmina Cuartero, Michael Goodstadt, Davide Ba`u,and Marc A Marti-Renom. Restraint-based three-dimensional modeling of genomes and genomic domains.

FEBS letters , 589(20PartA):2987–2995, 2015.[24] Jan Smrek and Alexander Y Grosberg. Facilitated diﬀu-sion of proteins through crumpled fractal dna globules.

Physical Review E , 92(1):012702, 2015.[25] Igor M Sokolov, Ralf Metzler, Kiran Pant, and Mark CWilliams. First passage time of n excluded-volume par-ticles on a line.

Physical Review E , page 041102, 2005.[26] Seee supplementary material

American journal of Psy-chology , 15(1):72–101, 1904.[29] Davide Normanno, Lydia Boudarene, Claire Dugast-Darzacq, Jiji Chen, Christian Richter, Florence Proux,Olivier B´enichou, Rapha¨el Voituriez, Xavier Darzacq,and Maxime Dahan. Probing the target search of dna-binding proteins in mammalian cells using tetr as modelsearcher.

Nature communications , 6:7357, 2015.[30] Otto G Berg, Anel Mahmutovic, Emil Marklund, andJohan Elf. The helical structure of dna facilitates bind-ing.

Journal of Physics A: Mathematical and Theoretical ,49(36):364002, 2016.[31] Dominic Schmidt, Michael D Wilson, Benoit Ballester,Petra C Schwalie, Gordon D Brown, Aileen Mar-shall, Claudia Kutter, Stephen Watt, Celia P Martinez-Jimenez, Sarah Mackay, et al. Five-vertebrate chip-seqreveals the evolutionary dynamics of transcription factorbinding.

Science , 328(5981):1036–1040, 2010.[32] Rajendra Kumar, Haitham Sobhy, Per Stenberg, andLudvig Lizana. Genome contact map explorer: a plat-form for the comparison, interactive visualization andanalysis of genome contact maps.

Nucleic acids research ,45(17):e152–e152, 2017.[33] Diﬀusion-limited search time, τ = 4 πaD/V where D =0 . − . µ m /s, a = 5 bp (=1.3 nm), and V = 1 µ m . These values give τ = 2 . − . > × base pairs [15] SUPPLEMENTARY MATERIALParticle ﬂux through the target

The number of proteins that reached the target up totime t is J a ( t ). For non-zero k on and k oﬀ it reads J a ( t ) = (cid:88) j ω aj (cid:88) i V ji (cid:26) k i on k oﬀ − λ i (cid:20) t − − e − t ( k off − λ i ) k oﬀ − λ i (cid:21) + 1 − e − t ( k off − λ i ) k oﬀ − λ i ρ (cid:88) l (cid:54) = a V − il  + k on t. (9)where λ j and V ij are the eigenvalues and of eigenvec-tors ω ij . Because λ = 0 is the largest eigenvalue, wecan approximate Eq. (9) at times t (cid:29) k − with termsproportional to tJ a ( t ) (cid:39) (cid:0) k on + T − a (cid:1) t, T − a = (cid:88) j ω aj ¯ n j , (10)where the steady-state distribution is¯ n j = (cid:88) i V ji k i on k oﬀ − λ i , (11)The relation J a ( t ) (cid:39) ( k on + T − a ) t coincides withthe continuum approach in [14] for proteins that com-bines bulk excursions with 1D sliding (jumping to near-est neighbours in our model) and L´evy relocation’s withjump lengths x distributed like (cid:39) | x | − − α (0 < α < ω ij (cid:39) | i − j | − − α with 0 < α < Particle ﬂux through the target without bulkexchange

Here we investigate the case when proteins do not un-bind from the DNA. As k on , k oﬀ →

0, Eq. (9) becomes J a ( t ) = N (cid:88) k =1 ω ak N (cid:88) i =2 V ki | λ i | (1 − e − t | λ i | ) (cid:88) j (cid:54) = a ρ V − ij = N p − N (cid:88) k =1 ω ak N (cid:88) i =2 V ki | λ i | e − t | λ i | (cid:88) j (cid:54) = a ρ V − ij , (12)with N p = ρ ( N − J a ( t → ∞ ) = N p since by then all proteins have arrivedto the target. This leads to the simpliﬁcation in the 2ndrow. For large times t (cid:28) | λ N | − – λ N is the largesteigenvalue (in magnitude) – where J a ( t ) (cid:28) N p , we ﬁndthe same behaviour as before, J a ( t ) ∝ t . This is seen byexpanding Eq. (12) around t = 0. Derivation of Eq. (6) in the manuscriptFast target ﬁnding ( k oﬀ (cid:28) ) When the unbinding rate k oﬀ is small compared to theassociation rate K a , the number of proteins per node isclose to its initial value ρ by the time of the ﬁrst arrivalto the target, and we have the approximation K a = k on + ρ W a , (13)where W a = (cid:80) i (cid:54) = a ω ai . We may ﬁnd this approximationby expanding Eq. (9) around t = 0 and using the inversetransformation (cid:80) j V ij q j (0) = n i (0) = ρ (1 − δ ia ).Furthermore, by demanding that the genome wide av-erage of K a in Eq. (13) is unity, (cid:104) K a (cid:105) = 1, and usingthat ω ai = f coll v ai we ﬁnd f coll to be f coll = 1 − k on ρ (cid:104) V a (cid:105) . (14)Using this in Eq. (13) leads to K a = k on + (1 − k on ) V a (cid:104) V a (cid:105) , k oﬀ (cid:28) . (15)With this deﬁnition of f coll , the binding rate k on is boundby [0 , k on = 1 corresponds to target-ﬁndingdirectly from the bulk. Target ﬁnding in steady-state ( k oﬀ (cid:29) ) When the unbinding rate k oﬀ is large compared to theassociation rate K a , few proteins will ﬁnd the target be-fore leaving on a bulk excursion. In this limit, the systemreaches its steady-state – with ¯ ρ a the number of proteinsper node – before the ﬁrst arrival to the target. Thisleads to the approximation K a = k on + ¯ ρ a W a . (16)To arrive at this equation we identify in Eq. (10) that J ( t ) (cid:39) K a t . Then we replace the ¯ n j by the approximatedensity ¯ ρ a that we ﬁnd ¯ ρ a by the following argument. Insteady state, proteins bind to the DNA with rate k on .Except for the absorbing target, there are N − ρ a ( N −

1) numberof proteins that unbind from the DNA with rate k oﬀ .Last, proteins are absorbed at the target with rate T − a =¯ ρ a W a . These three terms sum to zero, and therefore¯ ρ a = k on k oﬀ + W a / ( N − . (17)Using that W a = f coll V a with f coll from Eq. (14) gives K a = k on + k on (1 − k on ) k on + − k on N − V a (cid:104) V a (cid:105) V a (cid:104) V a (cid:105) , k oﬀ (cid:29) . (18) Validation of approximations

To better understand the validity of Eqs. (15) and(18), we compare them to the exact association rate K exact a = (cid:18)(cid:90) ∞ exp( − J a ( t )) dt (cid:19) − . (19)Figure 5 a shows how the association rate changes fora speciﬁc target node – we choose a = N/ k oﬀ while keeping theon-rate ﬁxed, k on = 0 . ρ = k on /k oﬀ . The solid grey line shows K exact a and thehorizontal lines represent the approximations for smalland large k oﬀ – Eqs. (15) and (18).The blue area in Fig. 5 a shows the large- K a regime( K a > k oﬀ ). Here, Eq. (15) deviates only a few per-cent from K exact a : the deviation is 2 . ≈ − K a /K exact a )at the encircled green dot. To get this number, we used k oﬀ = 0 . k on = 0 .

001 and ρ = 0 . K a ( K a < k oﬀ / K exact a . At the red dot ( k oﬀ = 2)the relative error is 5.7%.In the intermediate region (white area), we cannot usethe simple expressions because the ﬂux J ( t ) has a com-plicated time-dependence. To get the association rate inthis regime, we have to evaluate Eq. (19) directly.In Figs. 5 b and 5 c , we calculate the association ratefor all nodes in chromosome 21 using Eqs. (19), (15) and(18) with ﬁxed parameters (shown in the ﬁgures). Theﬁgure shows the limiting k oﬀ cases. In 5 b the unbindingrate is small ( k oﬀ = 0 . K exact a whereas Eq. (18)does not. Equation (18) that matches better in 5 c wherethe unbinding rate is large ( k oﬀ = 2). Genome-wide association rates when k oﬀ (cid:29) Similar to Fig. 2 in the main text, we show the asso-ciation rate for all targets in every chromosome at large k oﬀ . We calculate K a from Eq. (18), see Fig. 6. Thesecurves for k oﬀ (cid:29) k oﬀ (cid:28) y -axis. Genome-wide association rate as a function of nodestrength

In Fig. 7 a we show K a – calculated from Eq. (15)( k oﬀ (cid:28)

1) – varies with node strength V a for our dif-ferent values of k on with ﬁxed ρ = 0 . abc FIG. 5: Comparison of the approximations to the exact, time-integrated association rate, using chromosome 21 at the res-olution 160 kb. a In the blue area where k oﬀ (cid:28) k oﬀ (cid:29)

1, the sys-tem is close to its steady state and target ﬁnding is slower.Target at N/

2, in the middle of the system. b Associationrates to all targets when k oﬀ (cid:28) c Association rates to alltargets when k oﬀ (cid:29)

1. Note the green and red circled dots in b and c , they correspond to the same parameter values as in a , respectively. FIG. 6: Genome-wide association rates K a at 40 kb resolution evaluated using Equation (18). The association rates varyby several orders of magnitude along the chromosomes but the 95% conﬁdence interval of K a varies only by a few percent(0 . ± . k oﬀ = 2, k on = 0 .

001 and ρ = 0 . plot the analytical prediction Eq. (15). We ﬁnd thatthe search times are dominated by k on for weakly con-nected nodes. For strongly connected ndoes, we ﬁnd theuniversal behavior K a ∝ V a .In Fig. 7 b we show K a for all nodes in the other limit k oﬀ (cid:29)

1. Here K a depends on the number of nodes N – via ρ a in Eq. (18) – and therefore we do not expect auniversal large- V a behaviour Fraction of transcribed DNA

We use RNA expression data (downloaded from EN-CODE) to calculate the fraction of transcribed DNA.This is calculated in the following way. For every 40kb region i across the genome, we count the number ofbase pairs n i that has an RNA expression level abovezero. The fraction of transcribed DNA in region i is thus n i / K a and fraction of transcribed DNAis slightly stronger than for RNA expression level (Spear-man correlation coeﬃcient = 0.5632). We point out thatthe fraction of transcribed DNA and the RNA expressioncorrelate strongly: Spearman’s correlation coeﬃcient is0.9693. Contact probability

The average contact probability decays with distancefrom the diagonal. To see this, we calculated p ij = (cid:104) (cid:80) N − j =1 ω j,j +1 / ( N − j ) (cid:105) , where (cid:104) ... (cid:105) denotes genome-wideaverage. We ﬁnd that p ij ∼ | i − j | − α where there arethree regimes with diﬀerent α (Fig. 9). Since the chro-mosomes has diﬀerent sizes, the regimes appear at diﬀer-ent length scales, but the cross-overs are roughly at 500kb and 5,000 kb. We used Hi-C matrices ω ij = f coll V ai ,where f coll is given in equation (14) with k on = 0 .

001 and ρ = 0 . ab FIG. 7: The association rate vs the nodes’ strength V a (sumof all link weights). a All data points follow the universallaw K a = k on + (1 − k on ) V a / (cid:104) V a (cid:105) , see Eq. (15). The density ρ = 0 . k on . b Association rate in during steady state ( k oﬀ (cid:29)

1) calculatedfrom Eq.(18). The behavior is not universal as it dependson the size N of each chromosome (number of nodes). Thedotted line evaluated analytically – as N we picked the meanchromosome size. FIG. 8: Association rate increases with the fraction of tran-scribed DNA. This is similar to the RNA expression levelwhich also increases with the association rate.FIG. 9: The average contact probability (genome wide) de-cays as d − α with distance d from the diagonal. At d = 500kb and d = 5000 kb, α changes value. The black line is is theaveraged data from ω ij , and green, pink and purple are ﬁts. Transcription start sites (TSSs)

To distinguish between active and inactive TSSs weuse RNA expression data (downloaded from ENCODE)and calculated the average number of RNA reads perbase pair, ¯ n RNA , 1kb upstream and 1kb downstream ofeach TSS. We deﬁned an active TSS as when ¯ n RNA ≥ K a changes with the number of TSSs per node (Fig.10). The plots show three TSS groups: active, inactive, and all. The genome-wide average of all these plots isFig. 3 in the main text. ∗ Electronic address: [email protected][1] Peggy J Farnham. Insights from genomic proﬁlingof transcription factors.

Nature Reviews Genetics ,10(9):605, 2009.[2] Viktorija Globyte, Seung Hwan Lee, Taegeun Bae, Jin-Soo Kim, and Chirlmin Joo. Crispr/cas9 searches fora protospacer adjacent motif by lateral diﬀusion.