Modelling Protein Target-Search in Human Chromosomes
Markus Nyberg, Tobias Ambjörnsson, Per Stenberg, and Ludvig Lizana
MModelling Protein Target-Search in Human Chromosomes
Markus Nyberg, Tobias Ambj¨ornsson, Per Stenberg, and Ludvig Lizana ∗ Integrated Science Lab, Department of Physics, Ume˚a University, SE-901 87 Ume˚a, Sweden Department of Astronomy and Theoretical Physics, Lund University, SE-223 62 Lund, Sweden Department of Ecology and Environmental Science, Ume˚a University, SE-901 87 Ume˚a, Sweden (Dated: September 23, 2019)Several processes in the cell, such as gene regulation, start when key proteins recognise and bindto short DNA sequences. However, as these sequences can be hundreds of million times shorterthan the genome, they are hard to find by simple diffusion: diffusion-limited association rates mayunderestimate in vitro measurements up to several orders of magnitude. Moreover, the rates increaseif the DNA is coiled rather than straight. Here we model how this works in vivo in mammalian cells.We use chromatin-chromatin contact data from state-of-the-art Hi-C experiments to map the proteintarget-search onto a network problem. The nodes represent a DNA segment and the weight of thelinks is proportional to measured contact probabilities. We then put forward a master equation forthe density of searching protein that allows us to calculate the association rates across the genomeanalytically. For segments where the rates are high, we find that they are enriched with active genesand have high RNA expression levels. This paper suggests that the DNA’s 3D conformation isimportant for protein search times in vivo and offers a method to interpret protein-binding profilesin eukaryotes that cannot be explained by the DNA sequence itself.
Several processes in the cell nucleus start when pro-teins bind to specific DNA sequences. For exam-ple, transcription factors that regulate genes and theCRISPR/CAS9 complex that edits DNA [1, 2]. Becausetarget sequences are much shorter than the genome – afew base pairs compared to billions in humans – theseproteins face a needle-in-a-haystack problem.Despite the large number of potential targets, mea-sured search times are shorter than theoretical estimates.The Lac repressor in
E. coli needs 1-5 min to find itsdesignated site [3] which is twice as fast as a three-dimensional (3D) search by diffusion inside the bac-terium’s volume ( ≈ in vitro measurements by one to two orders ofmagnitude [4]. These examples suggest that some pro-teins search by other mechanisms than simple diffusion.One mechanism that speeds up diffusive search is of-fered by the Facilitated-diffusion model [5]. In thismodel, the proteins alternate between 3D diffusion and1D diffusion along the DNA. This decreases the searchtime because the proteins may take shortcuts to a linearlydistant DNA segment through the surrounding bulk. Al-though criticized [6, 7], the model is widely accepted afterexperiments in bacteria [3, 8] and in vitro [9].Another important aspect in target finding is rebind-ing. This is because proteins likely bind to a DNA seg-ment that is close by in 3D rather than far away. Sev-eral modelling studies examined this aspect and foundthat search times change with DNA conformation [7, 9–14]. However, because these studies treat the DNA as asimple polymer the results cannot be generalized beyondbacteria to eukaryotes that have longer DNA and morecomplex 3D organization.The most widely used experimental method to study 3D genome organization is Hi-C [15, 16]. The Hi-Cmethod cross-links close by DNA fragments inside the nu-cleus and gives a genome-wide map of the number of con-tacts between fragments pairs (Fig. 1a) [17]. MamalianHi-C maps have several interesting features – some thatare evolutionary conserved [18]. For example, the block–like structure along the diagonal represents densely con-nected 3D domains. The locations of these domains cor-relate with protein binding sites, active genes, and chro-matin states [19–21]. And, beyond the domains, the av-erage contact probability decays as a power-law [34].Hi-C is the state-of-the-art Chromosome ConformationCapture method that estimates the chromatin contactprobabilities across the genome. But it does not providechromatin’s 3D structure. Going from the contact mapto a computer-generated 3D structure is difficult [22, 23].Because the chromatin’s spatial organization is so com-plex, there are few attempts to model protein search ineukaryotes. One exception [24], represents chromatin asa crumpled polymer globule that reproduces the averagelooping probabilities in the human genome. However, thecrumpled globule lacks the Hi-C maps’ domain structure.Here we model target search in eukaryotes without re-lying on chromatin’s explicit 3D structure. Instead, werepresent the DNA as a network in which the nodes areDNA segments and the link weights are proportional tothe contact probabilities measured in Hi-C. Then we putforward a master equation for the density of proteins onthe network over time that allows us to calculate the as-sociation rate – the inverse mean-first passage time – toall nodes analytically. Correlating these rates with RNAexpression data in humans, we find that easy-to-find ge-nomic regions are enriched with active genes and havehigh RNA expression. a r X i v : . [ q - b i o . S C ] S e p k on k o↵ i j ! ij ab c FIG. 1: Modelling protein search on DNA as search on a weighted network. (a) Hi-C map: the number of pairwise physicalcontacts between 10 kilo-basepair (kb) DNA fragments in a part of human chromosome 21. Dark pixels indicate many contacts.(b) Schematic representation of the model. The key parameters are: jump rates between nodes i and j ( ω ij ), unbinding rateto the bulk ( k off ), and rebinding rate ( k on ). Red circles represent searching proteins. (c) Coarsed network representation of theHi-C map in (a) where each node represents a 160 kb fragment. The link weights v ij are proportional to the number of Hi-Ccontacts. We assume that ω ij ∝ v ij . Node numbering refers to positions along the DNA. THE MODEL
We model the proteins’ search on chromatin as non-interacting particles that move between nodes in aweighted network that represent physically connectedchromatin segments (Fig. 1).The model has three important parameters. First, thejump rate ω ij between nodes i and j ( i, j = 1 , . . . , N ).We assume that ω ij is proportional to the number ofcontacts between segments i and j , v ij , measured in Hi-C: ω ij = f coll v ij ; f coll – a free parameter in the model –is the collision frequency that leads to a successful jump.Second, the unbinding rate k off to the surrounding bulk.And third, the rebinding rate ¯ k on to a randomly chosennode. We assume that the protein bulk concentration n bulk is constant and therefore use k on = ¯ k on n bulk ; k on and k off therefore have the same unit: time − .In terms of these parameters, we formulate a masterequation for the protein number in node i at time t , n i ( t ): dn i ( t ) dt = N (cid:88) j =1 ω ij n j ( t ) − k off n i ( t ) + k on (1)The first term represents diffusion on the network – weput ω jj = − (cid:80) i (cid:54) = j ω ij – and the two remaining termsdescribe the exchange with the surrounding bulk.As in [14], we assume that the protein density is ini-tially uniform ρ = k on /k off except for the target i = a which always is zero n a ( t ) = 0, that is n i (0) = ρ (1 − δ ia ). In terms of the eigenvalues λ j and eigenvectors V ij of ω ij , the solution to Eq. (1) is n i ( t ) = (cid:80) Nj =1 V ij (cid:26) k j on k off − λ j (cid:2) − e − ( k off − λ j ) t (cid:3) + e − ( k off − λ j ) t ρ (cid:80) l (cid:54) = a V − jl (cid:27) . (2)where k i on = k on (cid:80) Nj =1 V − ij . This equation is key to cal-culate the association rate to the target node i = a . PROTEIN ASSOCIATION RATES
To calculate the association rate K a , we use that itis one over the mean first arrival time: K a = τ − a . Toobtain τ a , we proceed as in [14]. First we calculate thetotal number of particles that arrived to the target up totime t : J a ( t ) = (cid:82) t j a ( t (cid:48) ) dt (cid:48) where j a ( t ) = (cid:88) i ω ai n i ( t ) + k on (3)is the particle flux. Second, if N p is the initial proteinnumber on the network, and if k on = k off = 0, then J a ( t ) /N p is the probability that a single protein havereached the target up to time t . The probability thatthe target has not been reached – the target’s survivalprobability – is therefore S a ( t ) = (1 − J a ( t ) /N p ) N p and τ a = (cid:82) ∞ S a ( t ) dt . Generalizing this argument for k on > FIG. 2: Predicted genome-wide association rates K a for chromosomes 1-21. K a vary by several orders of magnitude along thechromosomes but the 95% confidence interval is a few percent of the mean (cid:104) K a (cid:105) = 1 ( K a = 1 . ± . K a from Eq. (6) ( k off (cid:29)
1) using the parameters k off = 0 . k on = 0 .
001 and ρ = 0 . the number of proteins N p → ∞ and therefore S a ( t ) (cid:39) exp ( − J a ( t )) [25]. Note that if k on = 0 and k off > τ a to diverge.Finally, using n i ( t ) from Eq. (2), we may calculate K a exactly as: K a = 1 (cid:82) ∞ exp ( − J a ( t )) dt , (4) J a ( t ) = k on t + (cid:88) i (cid:54) = a ω ai (cid:90) t n i ( t (cid:48) ) dt (cid:48) , (5) Limiting cases for K a Depending on the unbinding rate k off , K a has threeregimes. (i) Small k off . In this regime most particles find thetarget before they unbind. This leaves the initial den-sity ρ approximately unchanged, n i ( t ) (cid:39) ρ . Using thisapproximation in Eq. (5) leads to J a ( t ) (cid:39) ¯ J a t , where¯ J a = k on + ρ (cid:80) Ni (cid:54) = a ω ai ≡ k on + ρ W a , and thus K a (cid:39) ¯ J a . (ii) Large k off . Here, the particles unbind and re-bind many times before finding the target. The pro-tein density is therefore approximately in steady-state¯ ρ a = k on × [ k off + W a / ( N − − [26]. Using n i ( t ) (cid:39) ¯ ρ a and proceeding as in (i) leads to K a (cid:39) k on + ¯ ρ a W a .These regimes simplify to k off (cid:29) k off (cid:28) f coll so that the genome-wide averaged K a , (cid:104) K a (cid:105) , is unity [26]; We define the average as (cid:104) X (cid:105) =(1 /N G ) (cid:80) N G i =1 X i , where N G (cid:29) N is the number of nodesfor all chromosomesAfter rescaling, we have K a (cid:39) k on + γ a V a , (6)in which V a = (cid:80) Ni (cid:54) = a v ai is the node strength and γ a = 1 − k on (cid:104) V a (cid:105) = γ , k off (cid:28) γ a = k on γ k on + V a γ / ( N − , k off (cid:29) k off [26], it is easy toimplement, and computationally cheaper than Eqs. (4)and (5). (iii) Intermediate k off . When k off ∼
1, we cannot useEq. (6). Instead we must use the exact expressions (4)and (5). In [26] we also treat the case k on = k off = 0. Protein association rates depend on chromatin’s 3Dorganization
Equation (6) suggests that the association rates changewith chromatin’s 3D structure because K a depends onthe node strength V a . To quantify by how much, weused Hi-C data (40 kb resolution) and calculated K a forchromosomes 1-21 (Fig. 2). We found that K a varies byseveral orders of magnitude relative to the genome-wideaverage (cid:104) K a (cid:105) = 1. Most K a values, however, only deviateby a few percent from the mean: K a = 1 ± . k off (cid:28) k off (cid:29) k on grows – for example by increasing the bulk concentra-tion of particles as k on ∝ n bulk . We see this in the datafor small V a where K a (cid:39) k on [26]. We interpret this asif the particles reach the target mostly from the bulk. Inthe other limit, where k on is small, we see that K a ∝ V a .This means that most particles find the target via jumpson the network and that the 3D structure is important. Chromatin regions with high association rates areenriched with active genes
Figure 2 shows that the association rate varies acrossthe genome. This is important for regulatory proteins,such as transcription factors, that look for promoters tocontrol transcription. We therefore ask: are promoterregions easier to find than non-promoter regions?
FIG. 3: Nodes with many gene starts have higher predictedassociation rates – and thus easier to find – than nodes withfew gene starts (in humans). We define the gene starts asthe Transcription Start Sites (TSSs). The curves representpredicted association rates to nodes with active TSSs (gray),inactive TSSs (green), and any TSS type (pink). The activeTSSs have higher association rates than the genome wide av-erage ( (cid:104) K a (cid:105) = 1, dashed), whereas nodes with inactive TSSs(green) are below (apart from one data point). The symbolsrepresent the average association rate [Eq. (6)] and the col-ored areas show the 95% confidence interval. Parameters (di-mensionless, see [26]): k off = 0 . k on = 0 .
001 and ρ = 0 . ≤
50 nodes).
To answer this, we downloaded gene annotation datafor human cells [27] to extract the gene starts – definedas the Transcription Start Sites (TSSs) – and correlatedthem with the predicted association rates from Fig. 2.We found that the rates grow with the number of genestarts per node (Fig. 3, pink). The data points representthe average association rate to all nodes with the samenumber of gene starts and the shaded area shows the 95%confidence interval. In other words: gene-dense regionsare easy to find.Then we asked: because these regions harbour bothactive and inactive genes, are active gene-dense regionseasier to find than inactive ones? To see this, we groupedthe gene starts into ’active’ and ’inactive’ based on theRNA expression level surrounding each TSS – 1kb up-stream and downstream – and calculated the associationrates to the nodes with these TSSs. We found that nodeswith many active TSSs have even higher association ratesthan if we do not separate active and inactive TSSs: grayis above pink in Fig. 3.For nodes with inactive gene starts we find the inverserelationship: green is below pink in Fig. 3. This is un-derscored when comparing the green area to the genome-wide average (cid:104) K a (cid:105) = 1 represented by dashed line: mostinactive TSSs are in regions with association rates that are smaller than unity. This suggests that inactive genestarts are hard to find.Figure 3 also shows that the association rate growsslowly beyond one or two TSSs per node: adding a fewextra TSSs does not make the node easier to find. How-ever, there is still a positive correlation between the num-ber of gene starts per node and high association rates. Chromatin regions with high RNA expression levelshave high association rates
Figure 3 suggests that transcription factors quicklyfind highly transcribed genes because the association rateis larger for active than inactive TSSs. But what aboutany transcribed region? Are the association rates highalso for them?To study this, we summed the RNA expression in allnodes across the genome and ranked them based on theirRNA expression level. Then we partitioned the nodesinto 20 equally-sized groups and calculated the associa-tion rate in each group. Shown as a violin plot (Fig. 4a),we find that our predicted rates vary widely but that themedian (white circles) increases with high RNA expres-sion levels (Spearman’s correlation coefficient = 0.5449[28]). This suggests that nodes with high RNA expres-sion levels – with or without active gene starts – are rel-atively easy to find.To see by how much this correlation is caused by ac-tive gene starts, we made two new groups: nodes with atleast one active TSS and the rest – nodes with inactiveor no TSSs. Then, as before, we ranked the nodes inthese large groups based on the RNA expression levels,divided them into 20 equally-sized subgroups, and cal-culated the average association rate for each subgroup.Plotting the predicted average association rate for thetwo large groups versus the average RNA expression levelas well as the average for all nodes (Fig. 4a), we cannotsee any significant difference: all curves nearly lie on topof each other (Fig. 4b). This is a more general resultthan before. It is not only the highly transcribed genestarts that are relatively easy to find, it is any region withhigh RNA expression.In addition, as a simple measure of DNA accessibil-ity, we checked how the association rate change with thefraction of base pairs that are transcribed per node [26].Just like for the RNA expression level, we find a posi-tive correlation with high association rates (Spearman’scorrelation coefficient = 0.5632).
DISCUSSION AND SUMMARY
Protein-binding experiments show that associationrates change if the binding sites are embedded in a shortor a long piece of DNA [4]. This is partly explained by ab FIG. 4: Nodes that are highly transcribed are easy to find.(a) The distribution of K a values for all nodes across thegnome. The nodes are ranked by their RNA expression leveland divided into 20 groups. White circles represent the me-dian and horizontal bars represents the mean. 10 groups lieabove the genome wide average ( (cid:104) K a (cid:105) = 1, dashed line). (b)Predicted association rate as function of RNA expression levelfor nodes with at least one active TSS (grey) and no activeTSSs (purple). The RNA expression is divided into 20 binswith equally many points in each bin (same as in (a)). Thenodes with active TSSs tend to be above the genome wideaverage (18 points is above (cid:104) K a (cid:105) = 1), while most nodes withno active TSSs are below (6 points above (cid:104) K a (cid:105) = 1). Theshaded areas show the 95% confidence interval. Parametersused in both plots: k off = 0 . k on = 0 .
001 and ρ = 0 . the facilitated-diffusion model in which proteins switchbetween 1D search along the DNA, and 3D search – bydiffusion – in the surrounding volume. The associationrates also change if the DNA is straight or coiled [9]. Thiscan be understood if the facilitated diffusion model in-cludes inter-segmental transfer between loop anchors [13].However, current studies use standard polymer modelsthat do not capture the chromatin’s complex 3D orga- nization in eukaryotes. To remedy this, we used Hi-Cdata as proxy for the 3D proximity between chromatinsegments in vivo . This allowed us to to map the pro-tein search problem onto a network problem with nodesand links representing chromatin segments and how theyare physically connected to each other. Then we for-mulated a master equation for the number of searchingproteins per node, from which we calculated analyticallythe genome-wide association rates in terms of the eigen-values and eigenvectors of the Hi-C matrix. Using humanHi-C data, we compared the predicted association rateswith RNA expression data and positions of gene starts.We found that regions which are easy to find – measuredby high association rates – are enriched with active genesand have a generally high level of RNA expression.We assume that the protein finds the target, for exam-ple a promoter site, as soon as it arrives at the target node– here, 40 kb. This means that we model protein bindingas diffusion-limited. However, some transcription fac-tors, such as TetR in mammals [29], are suggested tobe reaction-limited. To include imperfect protein-DNAbinding our model we may follow [30]. Denoting theprotein-DNA binding rate as k DNA , and reinterpretingthe on rate k on as an effective on rate k eff . on , we may write1 /k eff . on = 1 /k on + 1 /k DNA where K a = k eff . on + γ a V a .We calculated the association rates in each chromo-some without considering inter-chromosome connections.This is an assumption as chromosomes do come in physi-cal contact. From Hi-C data, however, it seems like thesecontacts are less frequent than within chromosomes. Thisis a limitation of the data rather than our model that canhandle any genome-wide Hi-C map.Overall, this study provides a framework to predictprotein-binding positions dictated by chromatin contactmaps in the cell nucleus. As such, it opens new waysto interpret binding profiles of transcription factors thatcannot be explained by the DNA sequence [1, 31]. Mech-anistic understanding of these profiles is important toreach a molecular understanding of gene regulation. ACKNOWLEDGEMENTS
TA and LL acknowledges support from Swedish Re-search Council (grant no: 2014-4305 and 2017-03848).PS acknowledges support from the Knut and Alice Wal-lenberg foundation (grant no. 20140018, to EpiCoN, co-PI: PS). We thank Dr. Rajendra Kumar for his help withgcMapExplorer [32] to analyze the Hi-C data. ∗ Electronic address: [email protected][1] Peggy J Farnham. Insights from genomic profilingof transcription factors.
Nature Reviews Genetics ,10(9):605, 2009. [2] Viktorija Globyte, Seung Hwan Lee, Taegeun Bae, Jin-Soo Kim, and Chirlmin Joo. Crispr/cas9 searches fora protospacer adjacent motif by lateral diffusion.
TheEMBO journal , 38(4):e99466, 2019.[3] Petter Hammar, Prune Leroy, Anel Mahmutovic, Erik GMarklund, Otto G Berg, and Johan Elf. The lac repres-sor displays facilitated diffusion in living cells.
Science ,336(6088):1595–1598, 2012.[4] Arthur D Riggs, Suzanne Bourgeois, and Melvin Cohn.The lac represser-operator interaction: Iii. kinetic stud-ies.
Journal of molecular biology , 53(3):401–417, 1970.[5] Peter H von Hippel and OG Berg. Facilitated target lo-cation in biological systems.
Journal of Biological Chem-istry , 264(2):675–678, 1989.[6] A B Kolomeisky. Physics of protein–dna interactions:mechanisms of facilitated target search.
Physical Chem-istry Chemical Physics , 13(6):2088–2095, 2011.[7] Leonid Mirny, Michael Slutsky, Zeba Wunderlich,Anahita Tafvizi, Jason Leith, and Andrej Kosmrlj. Howa protein searches for its site on dna: the mechanism offacilitated diffusion.
Journal of Physics A: Mathematicaland Theoretical , 42(43):434013, 2009.[8] Johan Elf, Gene-Wei Li, and X Sunney Xie. Probingtranscription factor dynamics at the single-molecule levelin a living cell.
Science , 316(5828):1191–1194, 2007.[9] Bram van den Broek, Michael Andersen Lomholt, S-MJKalisch, Ralf Metzler, and Gijs JL Wuite. How dna coil-ing enhances target localization by proteins.
Proceed-ings of the National Academy of Sciences , 105(41):15738–15742, 2008.[10] Assaf Amitai. Chromatin configuration affects the dy-namics and distribution of a transiently interacting pro-tein.
Biophysical journal , 114(4):766–771, 2018.[11] Maximilian Bauer and Ralf Metzler. In vivo facilitateddiffusion model.
PloS one , 8(1):e53956, 2013.[12] Tao Hu, A Yu Grosberg, and BI Shklovskii. How pro-teins search for their specific sites on dna: the role ofdna conformation.
Biophysical journal , 90(8):2731–2744,2006.[13] Michael A Lomholt, Bram van den Broek, Svenja-Marei JKalisch, Gijs JL Wuite, and Ralf Metzler. Facilitateddiffusion with dna coiling.
Proceedings of the NationalAcademy of Sciences , 106(20):8204–8208, 2009.[14] Michael A Lomholt, Tobias Ambj¨ornsson, and Ralf Met-zler. Optimal target search on a fast-folding polymerchain with volume exchange.
Physical review letters ,95(26):260603, 2005.[15] Erez Lieberman-Aiden, Nynke L van Berkum, LouiseWilliams, Maxim Imakaev, Tobias Ragoczy, AgnesTelling, Ido Amit, Bryan R Lajoie, Peter J Sabo,Michael O Dorschner, et al. Comprehensive mapping oflong-range interactions reveals folding principles of thehuman genome. science , 326(5950):289–293, 2009.[16] Siyuan Kong and Yubo Zhang. Deciphering hi-c: from 3dgenome to function.
Cell biology and toxicology , 35(1):15–32, 2019.[17] Suhas SP Rao, Miriam H Huntley, Neva C Durand,Elena K Stamenova, Ivan D Bochkov, James T Robinson,Adrian L Sanborn, Ido Machol, Arina D Omer, Eric SLander, et al. A 3d map of the human genome at kilo-base resolution reveals principles of chromatin looping.
Cell , 159(7):1665–1680, 2014.[18] Jan Krefting, Miguel A Andrade-Navarro, and Jonas Ibn-Salem. Evolutionary stability of topologically associat- ing domains is associated with conserved gene regulation.
BMC biology , 16(1):87, 2018.[19] Jesse R Dixon, Siddarth Selvaraj, Feng Yue, AudreyKim, Yan Li, Yin Shen, Ming Hu, Jun S Liu, and BingRen. Topological domains in mammalian genomes iden-tified by analysis of chromatin interactions.
Nature ,485(7398):376, 2012.[20] Daniel Jost, Pascal Carrivain, Giacomo Cavalli, andC´edric Vaillant. Modeling epigenome folding: formationand dynamics of topologically associated chromatin do-mains.
Nucleic acids research , 42(15):9553–9561, 2014.[21] Sang Hoon Lee, Yeonghoon Kim, Sungmin Lee, XavierDurang, Per Stenberg, Jae-Hyung Jeon, and LudvigLizana. Mapping the spectrum of 3d communities in hu-man chromosome conformation capture data.
Scientificreports , 9(1):6859, 2019.[22] Michele Di Pierro, Ryan R Cheng, Erez LiebermanAiden, Peter G Wolynes, and Jos´e N Onuchic. De novoprediction of human chromosome structures: Epigeneticmarking patterns encode genome architecture.
Proceed-ings of the National Academy of Sciences , 114(46):12126–12131, 2017.[23] Fran¸cois Serra, Marco Di Stefano, Yannick G Spill,Yasmina Cuartero, Michael Goodstadt, Davide Ba`u,and Marc A Marti-Renom. Restraint-based three-dimensional modeling of genomes and genomic domains.
FEBS letters , 589(20PartA):2987–2995, 2015.[24] Jan Smrek and Alexander Y Grosberg. Facilitated diffu-sion of proteins through crumpled fractal dna globules.
Physical Review E , 92(1):012702, 2015.[25] Igor M Sokolov, Ralf Metzler, Kiran Pant, and Mark CWilliams. First passage time of n excluded-volume par-ticles on a line.
Physical Review E , page 041102, 2005.[26] Seee supplementary material
American journal of Psy-chology , 15(1):72–101, 1904.[29] Davide Normanno, Lydia Boudarene, Claire Dugast-Darzacq, Jiji Chen, Christian Richter, Florence Proux,Olivier B´enichou, Rapha¨el Voituriez, Xavier Darzacq,and Maxime Dahan. Probing the target search of dna-binding proteins in mammalian cells using tetr as modelsearcher.
Nature communications , 6:7357, 2015.[30] Otto G Berg, Anel Mahmutovic, Emil Marklund, andJohan Elf. The helical structure of dna facilitates bind-ing.
Journal of Physics A: Mathematical and Theoretical ,49(36):364002, 2016.[31] Dominic Schmidt, Michael D Wilson, Benoit Ballester,Petra C Schwalie, Gordon D Brown, Aileen Mar-shall, Claudia Kutter, Stephen Watt, Celia P Martinez-Jimenez, Sarah Mackay, et al. Five-vertebrate chip-seqreveals the evolutionary dynamics of transcription factorbinding.
Science , 328(5981):1036–1040, 2010.[32] Rajendra Kumar, Haitham Sobhy, Per Stenberg, andLudvig Lizana. Genome contact map explorer: a plat-form for the comparison, interactive visualization andanalysis of genome contact maps.
Nucleic acids research ,45(17):e152–e152, 2017.[33] Diffusion-limited search time, τ = 4 πaD/V where D =0 . − . µ m /s, a = 5 bp (=1.3 nm), and V = 1 µ m . These values give τ = 2 . − . > × base pairs [15] SUPPLEMENTARY MATERIALParticle flux through the target
The number of proteins that reached the target up totime t is J a ( t ). For non-zero k on and k off it reads J a ( t ) = (cid:88) j ω aj (cid:88) i V ji (cid:26) k i on k off − λ i (cid:20) t − − e − t ( k off − λ i ) k off − λ i (cid:21) + 1 − e − t ( k off − λ i ) k off − λ i ρ (cid:88) l (cid:54) = a V − il + k on t. (9)where λ j and V ij are the eigenvalues and of eigenvec-tors ω ij . Because λ = 0 is the largest eigenvalue, wecan approximate Eq. (9) at times t (cid:29) k − with termsproportional to tJ a ( t ) (cid:39) (cid:0) k on + T − a (cid:1) t, T − a = (cid:88) j ω aj ¯ n j , (10)where the steady-state distribution is¯ n j = (cid:88) i V ji k i on k off − λ i , (11)The relation J a ( t ) (cid:39) ( k on + T − a ) t coincides withthe continuum approach in [14] for proteins that com-bines bulk excursions with 1D sliding (jumping to near-est neighbours in our model) and L´evy relocation’s withjump lengths x distributed like (cid:39) | x | − − α (0 < α < ω ij (cid:39) | i − j | − − α with 0 < α < Particle flux through the target without bulkexchange
Here we investigate the case when proteins do not un-bind from the DNA. As k on , k off →
0, Eq. (9) becomes J a ( t ) = N (cid:88) k =1 ω ak N (cid:88) i =2 V ki | λ i | (1 − e − t | λ i | ) (cid:88) j (cid:54) = a ρ V − ij = N p − N (cid:88) k =1 ω ak N (cid:88) i =2 V ki | λ i | e − t | λ i | (cid:88) j (cid:54) = a ρ V − ij , (12)with N p = ρ ( N − J a ( t → ∞ ) = N p since by then all proteins have arrivedto the target. This leads to the simplification in the 2ndrow. For large times t (cid:28) | λ N | − – λ N is the largesteigenvalue (in magnitude) – where J a ( t ) (cid:28) N p , we findthe same behaviour as before, J a ( t ) ∝ t . This is seen byexpanding Eq. (12) around t = 0. Derivation of Eq. (6) in the manuscriptFast target finding ( k off (cid:28) ) When the unbinding rate k off is small compared to theassociation rate K a , the number of proteins per node isclose to its initial value ρ by the time of the first arrivalto the target, and we have the approximation K a = k on + ρ W a , (13)where W a = (cid:80) i (cid:54) = a ω ai . We may find this approximationby expanding Eq. (9) around t = 0 and using the inversetransformation (cid:80) j V ij q j (0) = n i (0) = ρ (1 − δ ia ).Furthermore, by demanding that the genome wide av-erage of K a in Eq. (13) is unity, (cid:104) K a (cid:105) = 1, and usingthat ω ai = f coll v ai we find f coll to be f coll = 1 − k on ρ (cid:104) V a (cid:105) . (14)Using this in Eq. (13) leads to K a = k on + (1 − k on ) V a (cid:104) V a (cid:105) , k off (cid:28) . (15)With this definition of f coll , the binding rate k on is boundby [0 , k on = 1 corresponds to target-findingdirectly from the bulk. Target finding in steady-state ( k off (cid:29) ) When the unbinding rate k off is large compared to theassociation rate K a , few proteins will find the target be-fore leaving on a bulk excursion. In this limit, the systemreaches its steady-state – with ¯ ρ a the number of proteinsper node – before the first arrival to the target. Thisleads to the approximation K a = k on + ¯ ρ a W a . (16)To arrive at this equation we identify in Eq. (10) that J ( t ) (cid:39) K a t . Then we replace the ¯ n j by the approximatedensity ¯ ρ a that we find ¯ ρ a by the following argument. Insteady state, proteins bind to the DNA with rate k on .Except for the absorbing target, there are N − ρ a ( N −
1) numberof proteins that unbind from the DNA with rate k off .Last, proteins are absorbed at the target with rate T − a =¯ ρ a W a . These three terms sum to zero, and therefore¯ ρ a = k on k off + W a / ( N − . (17)Using that W a = f coll V a with f coll from Eq. (14) gives K a = k on + k on (1 − k on ) k on + − k on N − V a (cid:104) V a (cid:105) V a (cid:104) V a (cid:105) , k off (cid:29) . (18) Validation of approximations
To better understand the validity of Eqs. (15) and(18), we compare them to the exact association rate K exact a = (cid:18)(cid:90) ∞ exp( − J a ( t )) dt (cid:19) − . (19)Figure 5 a shows how the association rate changes fora specific target node – we choose a = N/ k off while keeping theon-rate fixed, k on = 0 . ρ = k on /k off . The solid grey line shows K exact a and thehorizontal lines represent the approximations for smalland large k off – Eqs. (15) and (18).The blue area in Fig. 5 a shows the large- K a regime( K a > k off ). Here, Eq. (15) deviates only a few per-cent from K exact a : the deviation is 2 . ≈ − K a /K exact a )at the encircled green dot. To get this number, we used k off = 0 . k on = 0 .
001 and ρ = 0 . K a ( K a < k off / K exact a . At the red dot ( k off = 2)the relative error is 5.7%.In the intermediate region (white area), we cannot usethe simple expressions because the flux J ( t ) has a com-plicated time-dependence. To get the association rate inthis regime, we have to evaluate Eq. (19) directly.In Figs. 5 b and 5 c , we calculate the association ratefor all nodes in chromosome 21 using Eqs. (19), (15) and(18) with fixed parameters (shown in the figures). Thefigure shows the limiting k off cases. In 5 b the unbindingrate is small ( k off = 0 . K exact a whereas Eq. (18)does not. Equation (18) that matches better in 5 c wherethe unbinding rate is large ( k off = 2). Genome-wide association rates when k off (cid:29) Similar to Fig. 2 in the main text, we show the asso-ciation rate for all targets in every chromosome at large k off . We calculate K a from Eq. (18), see Fig. 6. Thesecurves for k off (cid:29) k off (cid:28) y -axis. Genome-wide association rate as a function of nodestrength
In Fig. 7 a we show K a – calculated from Eq. (15)( k off (cid:28)
1) – varies with node strength V a for our dif-ferent values of k on with fixed ρ = 0 . abc FIG. 5: Comparison of the approximations to the exact, time-integrated association rate, using chromosome 21 at the res-olution 160 kb. a In the blue area where k off (cid:28) k off (cid:29)
1, the sys-tem is close to its steady state and target finding is slower.Target at N/
2, in the middle of the system. b Associationrates to all targets when k off (cid:28) c Association rates to alltargets when k off (cid:29)
1. Note the green and red circled dots in b and c , they correspond to the same parameter values as in a , respectively. FIG. 6: Genome-wide association rates K a at 40 kb resolution evaluated using Equation (18). The association rates varyby several orders of magnitude along the chromosomes but the 95% confidence interval of K a varies only by a few percent(0 . ± . k off = 2, k on = 0 .
001 and ρ = 0 . plot the analytical prediction Eq. (15). We find thatthe search times are dominated by k on for weakly con-nected nodes. For strongly connected ndoes, we find theuniversal behavior K a ∝ V a .In Fig. 7 b we show K a for all nodes in the other limit k off (cid:29)
1. Here K a depends on the number of nodes N – via ρ a in Eq. (18) – and therefore we do not expect auniversal large- V a behaviour Fraction of transcribed DNA
We use RNA expression data (downloaded from EN-CODE) to calculate the fraction of transcribed DNA.This is calculated in the following way. For every 40kb region i across the genome, we count the number ofbase pairs n i that has an RNA expression level abovezero. The fraction of transcribed DNA in region i is thus n i / K a and fraction of transcribed DNAis slightly stronger than for RNA expression level (Spear-man correlation coefficient = 0.5632). We point out thatthe fraction of transcribed DNA and the RNA expressioncorrelate strongly: Spearman’s correlation coefficient is0.9693. Contact probability
The average contact probability decays with distancefrom the diagonal. To see this, we calculated p ij = (cid:104) (cid:80) N − j =1 ω j,j +1 / ( N − j ) (cid:105) , where (cid:104) ... (cid:105) denotes genome-wideaverage. We find that p ij ∼ | i − j | − α where there arethree regimes with different α (Fig. 9). Since the chro-mosomes has different sizes, the regimes appear at differ-ent length scales, but the cross-overs are roughly at 500kb and 5,000 kb. We used Hi-C matrices ω ij = f coll V ai ,where f coll is given in equation (14) with k on = 0 .
001 and ρ = 0 . ab FIG. 7: The association rate vs the nodes’ strength V a (sumof all link weights). a All data points follow the universallaw K a = k on + (1 − k on ) V a / (cid:104) V a (cid:105) , see Eq. (15). The density ρ = 0 . k on . b Association rate in during steady state ( k off (cid:29)
1) calculatedfrom Eq.(18). The behavior is not universal as it dependson the size N of each chromosome (number of nodes). Thedotted line evaluated analytically – as N we picked the meanchromosome size. FIG. 8: Association rate increases with the fraction of tran-scribed DNA. This is similar to the RNA expression levelwhich also increases with the association rate.FIG. 9: The average contact probability (genome wide) de-cays as d − α with distance d from the diagonal. At d = 500kb and d = 5000 kb, α changes value. The black line is is theaveraged data from ω ij , and green, pink and purple are fits. Transcription start sites (TSSs)
To distinguish between active and inactive TSSs weuse RNA expression data (downloaded from ENCODE)and calculated the average number of RNA reads perbase pair, ¯ n RNA , 1kb upstream and 1kb downstream ofeach TSS. We defined an active TSS as when ¯ n RNA ≥ K a changes with the number of TSSs per node (Fig.10). The plots show three TSS groups: active, inactive, and all. The genome-wide average of all these plots isFig. 3 in the main text. ∗ Electronic address: [email protected][1] Peggy J Farnham. Insights from genomic profilingof transcription factors.
Nature Reviews Genetics ,10(9):605, 2009.[2] Viktorija Globyte, Seung Hwan Lee, Taegeun Bae, Jin-Soo Kim, and Chirlmin Joo. Crispr/cas9 searches fora protospacer adjacent motif by lateral diffusion.
TheEMBO journal , 38(4):e99466, 2019.[3] Petter Hammar, Prune Leroy, Anel Mahmutovic, Erik GMarklund, Otto G Berg, and Johan Elf. The lac repres-sor displays facilitated diffusion in living cells.
Science ,336(6088):1595–1598, 2012.[4] Arthur D Riggs, Suzanne Bourgeois, and Melvin Cohn.The lac represser-operator interaction: Iii. kinetic stud-ies.
Journal of molecular biology , 53(3):401–417, 1970.[5] Peter H von Hippel and OG Berg. Facilitated target lo-cation in biological systems.
Journal of Biological Chem-istry , 264(2):675–678, 1989.[6] A B Kolomeisky. Physics of protein–dna interactions:mechanisms of facilitated target search.
Physical Chem-istry Chemical Physics , 13(6):2088–2095, 2011.[7] Leonid Mirny, Michael Slutsky, Zeba Wunderlich,Anahita Tafvizi, Jason Leith, and Andrej Kosmrlj. Howa protein searches for its site on dna: the mechanism offacilitated diffusion.
Journal of Physics A: Mathematicaland Theoretical , 42(43):434013, 2009.[8] Johan Elf, Gene-Wei Li, and X Sunney Xie. Probingtranscription factor dynamics at the single-molecule levelin a living cell.
Science , 316(5828):1191–1194, 2007.[9] Bram van den Broek, Michael Andersen Lomholt, S-MJKalisch, Ralf Metzler, and Gijs JL Wuite. How dna coil-ing enhances target localization by proteins.
Proceed-ings of the National Academy of Sciences , 105(41):15738–15742, 2008.[10] Assaf Amitai. Chromatin configuration affects the dy-namics and distribution of a transiently interacting pro-tein.
Biophysical journal , 114(4):766–771, 2018.[11] Maximilian Bauer and Ralf Metzler. In vivo facilitateddiffusion model.
PloS one , 8(1):e53956, 2013.[12] Tao Hu, A Yu Grosberg, and BI Shklovskii. How pro-teins search for their specific sites on dna: the role ofdna conformation.
Biophysical journal , 90(8):2731–2744,2006.[13] Michael A Lomholt, Bram van den Broek, Svenja-Marei JKalisch, Gijs JL Wuite, and Ralf Metzler. Facilitateddiffusion with dna coiling.
Proceedings of the NationalAcademy of Sciences , 106(20):8204–8208, 2009.[14] Michael A Lomholt, Tobias Ambj¨ornsson, and Ralf Met-zler. Optimal target search on a fast-folding polymerchain with volume exchange.
Physical review letters ,95(26):260603, 2005.[15] Erez Lieberman-Aiden, Nynke L van Berkum, LouiseWilliams, Maxim Imakaev, Tobias Ragoczy, AgnesTelling, Ido Amit, Bryan R Lajoie, Peter J Sabo,Michael O Dorschner, et al. Comprehensive mapping oflong-range interactions reveals folding principles of thehuman genome. science , 326(5950):289–293, 2009.[16] Siyuan Kong and Yubo Zhang. Deciphering hi-c: from 3dgenome to function.
Cell biology and toxicology , 35(1):15–32, 2019.[17] Suhas SP Rao, Miriam H Huntley, Neva C Durand, FIG. 10: The association rate K a changes with locations of gene starts and RNA transcription levels for each chromosome. Wetruncated the plots where it was less than 7 K a values to calculate the average association rate. Parameter values: k off = 0 . k on = 0 .
001 and ρ = 0 . K a evaluated using Eq. (15) . Elena K Stamenova, Ivan D Bochkov, James T Robinson,Adrian L Sanborn, Ido Machol, Arina D Omer, Eric SLander, et al. A 3d map of the human genome at kilo-base resolution reveals principles of chromatin looping.
Cell , 159(7):1665–1680, 2014.[18] Jan Krefting, Miguel A Andrade-Navarro, and Jonas Ibn-Salem. Evolutionary stability of topologically associat-ing domains is associated with conserved gene regulation.
BMC biology , 16(1):87, 2018.[19] Jesse R Dixon, Siddarth Selvaraj, Feng Yue, AudreyKim, Yan Li, Yin Shen, Ming Hu, Jun S Liu, and BingRen. Topological domains in mammalian genomes iden-tified by analysis of chromatin interactions.
Nature ,485(7398):376, 2012.[20] Daniel Jost, Pascal Carrivain, Giacomo Cavalli, andC´edric Vaillant. Modeling epigenome folding: formationand dynamics of topologically associated chromatin do-mains.
Nucleic acids research , 42(15):9553–9561, 2014.[21] Sang Hoon Lee, Yeonghoon Kim, Sungmin Lee, XavierDurang, Per Stenberg, Jae-Hyung Jeon, and LudvigLizana. Mapping the spectrum of 3d communities in hu-man chromosome conformation capture data.
Scientificreports , 9(1):6859, 2019.[22] Michele Di Pierro, Ryan R Cheng, Erez LiebermanAiden, Peter G Wolynes, and Jos´e N Onuchic. De novoprediction of human chromosome structures: Epigeneticmarking patterns encode genome architecture.
Proceed-ings of the National Academy of Sciences , 114(46):12126–12131, 2017.[23] Fran¸cois Serra, Marco Di Stefano, Yannick G Spill,Yasmina Cuartero, Michael Goodstadt, Davide Ba`u,and Marc A Marti-Renom. Restraint-based three-dimensional modeling of genomes and genomic domains.
FEBS letters , 589(20PartA):2987–2995, 2015.[24] Jan Smrek and Alexander Y Grosberg. Facilitated diffu- sion of proteins through crumpled fractal dna globules.
Physical Review E , 92(1):012702, 2015.[25] Igor M Sokolov, Ralf Metzler, Kiran Pant, and Mark CWilliams. First passage time of n excluded-volume par-ticles on a line.
Physical Review E , page 041102, 2005.[26] Seee supplementary material
American journal of Psy-chology , 15(1):72–101, 1904.[29] Davide Normanno, Lydia Boudarene, Claire Dugast-Darzacq, Jiji Chen, Christian Richter, Florence Proux,Olivier B´enichou, Rapha¨el Voituriez, Xavier Darzacq,and Maxime Dahan. Probing the target search of dna-binding proteins in mammalian cells using tetr as modelsearcher.
Nature communications , 6:7357, 2015.[30] Otto G Berg, Anel Mahmutovic, Emil Marklund, andJohan Elf. The helical structure of dna facilitates bind-ing.
Journal of Physics A: Mathematical and Theoretical ,49(36):364002, 2016.[31] Dominic Schmidt, Michael D Wilson, Benoit Ballester,Petra C Schwalie, Gordon D Brown, Aileen Mar-shall, Claudia Kutter, Stephen Watt, Celia P Martinez-Jimenez, Sarah Mackay, et al. Five-vertebrate chip-seqreveals the evolutionary dynamics of transcription factorbinding.
Science , 328(5981):1036–1040, 2010.[32] Rajendra Kumar, Haitham Sobhy, Per Stenberg, andLudvig Lizana. Genome contact map explorer: a plat-form for the comparison, interactive visualization andanalysis of genome contact maps.