[PDF] Mechanisms of Protein Search for Targets on DNA: Theoretical Insights

Abstract

Protein-DNA interactions are critical for the successful functioning of all natural systems. The key role in these interactions is played by processes of protein search for specific sites on DNA. Although it has been studied for many years, only recently microscopic aspects of these processes became more clear. In this work, we present a review on current theoretical understanding of the molecular mechanisms of the protein target search. A comprehensive discrete-state stochastic method to explain the dynamics of the protein search phenomena is introduced and explained. Our theoretical approach utilizes a first-passage analysis and it takes into account the most relevant physical-chemical processes. It is able to describe many fascinating features of the protein search, including unusually high effective association rates, high selectivity and specificity, and the robustness in the presence of crowders and sequence heterogeneity.

Full PDF

aa r X i v : . [ q - b i o . S C ] A ug Mechanisms of Protein Search for Targets onDNA: Theoretical Insights

Alexey A. Shvets, † Maria P. Kochugaeva, ‡ and Anatoly B. Kolomeisky ∗ , ¶ † Institute for Medical Engineering and Science, Massachusetts Institute of Technology;Cambridge, MA 02142, USA ‡ Department of Biomedical Engineering and System Biology Institute Yale University WestHaven, CT, 06516, USA ¶ Department of Chemistry, Department of Chemical and Biomolecular Engineering, andCenter for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA

E-mail: [email protected]

Abstract

Protein-DNA interactions are critical for the successful functioning of all naturalsystems. The key role in these interactions is played by processes of protein searchfor speciﬁc sites on DNA. Although it has been studied for many years, only recentlymicroscopic aspects of these processes became more clear. In this work, we presenta review on current theoretical understanding of the molecular mechanisms of theprotein target search. A comprehensive discrete-state stochastic method to explain thedynamics of the protein search phenomena is introduced and explained. Our theoreticalapproach utilizes a ﬁrst-passage analysis and it takes into account the most relevantphysical-chemical processes. It is able to describe many fascinating features of theprotein search, including unusually high eﬀective association rates, high selectivity andspeciﬁcity, and the robustness in the presence of crowders and sequence heterogeneity. ntroduction Dynamical nature of underlying processes is what distinguishes the living systems from otherprocesses.

Biological processes constantly involve time-dependent ﬂuxes of energy and ma-terials, which makes them strongly deviating from equilibrium as long as organisms are alive.This implies that the concepts of equilibrium thermodynamics have limited applications forbiological systems, while the role of methods that study the dynamical transformations ismuch more important. In this review, we present our theoretical views on dynamic aspects ofthe protein-DNA interactions, which dominate in biological systems. Our approach is basedon explicit calculations of dynamic properties via a ﬁrst-passage probabilities analysis. Theﬁrst-passage ideas have been already widely utilized in studies of various complex processesin Chemistry, Physics and Biology.

We employ these ideas in developing a discrete-statestochastic framework for analyzing the dynamics of protein search for speciﬁc targets onDNA.It is known that the beginning of most biological processes is associated with speciﬁcprotein molecules binding to speciﬁc target sequences on DNA because these events initiatethe cascades of corresponding biochemical and biophysical processes.

For example, toactivate or to repress a gene the corresponding transcription factor proteins must bind ﬁrst tothe gene promoter’s region.

This fundamental aspect of protein-DNA interactions has beenstudied extensively by various experimental and theoretical methods.

A special attentionwas devoted to understanding the dynamics of the protein search for speciﬁc targets on DNA.Many ideas have been proposed and critically discussed, but only recently a clear molecularpicture of the underlying processes started to emerge.

Large amount of experimental observations on protein search phenomena, which mostlycome from the single-molecule measurements, suggests that it is a complex dynamic phe-nomenon which combines three-dimensional (in the bulk solution) and one-dimensional (onthe DNA chain) motions.

But the most paradoxical observation is that, although theprotein molecules spend most of the search time ( ≥ For example, the measured association rate for lac -repressorwas ∼ M − s − (two orders of magnitude faster than the diﬀusion limit!), and manyother experimentally determined protein-DNA association rates were also astonishingly highin comparison to typical biological binding rates. This is known as a facilitated diﬀusion .Several theoretical ideas on the origin of the facilitated diﬀusion, including lowering of di-mensionality, electrostatic eﬀects, correlations between 3D and 1D motions, conformationaltransitions, bending ﬂuctuations, and hydrodynamics eﬀects have been explored and dis-cussed. However, theoretical analysis shows that none of these mechanisms can fullyexplain the facilitated diﬀusion in the protein search. To understand the dynamic aspectsof protein-DNA interactions, we developed a discrete-state stochastic framework to take intoaccount the most relevant physical-chemical processes in the system. The application of theﬁrst-passage probabilities method allows us also to explicitly evaluate the dynamic propertiesand to clarify dynamic aspects of the protein-DNA interactions.It is important to note that although there are still diﬀerent opinions on the theoreticalfoundations of the protein search phenomena, in this work we mostly present our views onthese problems, which, of course, are subjective. In addition, there are many theoreticaladvances in our understanding of the protein search dynamics, but we will concentrate onlyon few of them in order to explain better the underlying molecular processes. Furthermore,there is a huge number of investigations on the protein target search phenomena. Our goalis not to cover all studies and all existing views but to present a clear theoretical picture ofthese processes as we understand it now. 3 implest Discrete-State Stochastic Model of the ProteinTarget Search

Experiments clearly indicate that during the search the protein molecule is alternating be-tween freely diﬀusing behavior in the solution around the DNA chain and non-speciﬁc asso-ciations to DNA, which also include scanning the DNA chain.

The process is completedwhen the protein molecule reaches the speciﬁc target sequence on DNA for the ﬁrst time.Stimulated by this observations, we start with a simplest minimal model of the proteinsearch as presented in Figure 1. It is important to note that, in contrast to other theoreticalapproaches, this method is based on a discrete-state stochastic description of thesystem. This is a more realistic view of early stages of protein-DNA interactions because ofintrinsically discrete nature of molecular interactions in these systems. protein t a r g e t proteinDNAuu k on k off u u Lm12 state 0 state i

Figure 1: A schematic view of a minimal discrete-state stochastic model of the protein searchfor targets om DNA. The DNA chain has L − u in bothdirections. It can also associate to DNA from the bulk solution (labeled as state 0) witha rate k on or it can dissociate back to the solution with a rate k off . The search is ﬁnishedwhen the protein binds to the target site at the position m for the ﬁrst time.In this simple model, we consider a single protein molecule and a single DNA moleculewith a single target site: see Figure 1. The DNA chain is viewed as having L discrete bindingsites, and one of them at the position m is considered to be the target for the protein molecule.Because the diﬀusion of the proteins in the bulk is usually fast, all solutions states for theprotein are combined into one state that we label as a state 0 (Figure 1). It is assumed4hat from the bulk solution the protein molecule can bind with equal probability to any siteon DNA, and the total association rate to DNA is equal to k on , while the dissociation ratefrom DNA is k off . The non-speciﬁcally bound proteins can diﬀuse without bias along theDNA contour in any direction with a rate u (see Figure 1). Since the search process endsas soon as the protein molecule arrives to the speciﬁc site for the ﬁrst time, we introduce afunction F n ( t ), which is deﬁned as a probability density function of reaching the site m (thetarget site) for the ﬁrst time at time t if at t = 0 the protein started in the state n ( n = 0is the bulk solution, and n = 1 , ..., L are the protein-DNA bound states). This function isalso known as a ﬁrst-passage probability density function. To compute these ﬁrst-passageprobabilities, we utilize backward master equations that describe the temporal evolution ofthese quantities, dF n ( t ) dt = u [ F n +1 ( t ) + F n − ( t )] + k off F ( t ) − (2 u + k off ) F n ( t ) , (1)for 2 ≤ n ≤ L −

1, while at the boundaries ( n = 1 or n = L ) we have dF ( t ) dt = uF ( t ) + k off F ( t ) − ( u + k off ) F ( t ) , (2)and dF L ( t ) dt = uF L − ( t ) + k off F ( t ) − ( u + k off ) F L ( t ) . (3)For the state n = 0, the backward master equation is diﬀerent, dF ( t ) dt = k on L L X n =1 F n ( t ) − k on F n ( t ) . (4)Here we used the fact that the rate to bind to any site on DNA is k on /L , so that the totalassociation rate is equal to k on . In addition, the initial conditions require that F m ( t ) = δ ( t )and F n = m ( t = 0) = 0. This means that if the protein molecule starts at the target site m the search is immediately accomplished. 5t is important to explain the physical meaning of the backward master equations becausethey diﬀer from classical forward master equations widely employed in Chemical Kinetics. Itcan be easily seen that all trajectories that start at the state n and ﬁnish at the target site m can be divided into several groups. For example, for 2 ≤ n ≤ L − n can be divided into three groups: 1) passing via the state n −

1, 2) passing via the state n + 1 or 3) passing via the state 0 in the next time step. The fractions of those trajectoriesare given by u/ (2 u + k off ), u/ (2 u + k off ) and k off / (2 u + k off ), respectively. Equation (1)describes this partition of the trajectories in the time-dependent manner because the ﬁrst-passage probability ﬂux to the target is determined by these trajectories. Thus, the backwardmaster equations reﬂect the temporal evolution of the ﬁrst-passage probabilities.The most convenient way to analyze the dynamics in the system is to use Laplace rep-resentations of the ﬁrst-passage probability functions, g F n ( s ) ≡ R ∞ e − st F n ( t ) dt . Then Equa-tions (1),(2), (3) and (4) can be written as simpler algebraic expressions:( s + 2 u + k off ) g F n ( s ) = u h g F n +1 ( s ) + g F n − ( s ) i + k off g F ( s ); (5)( s + u + k off ) g F ( s ) = u g F ( s ) + k off g F ( s ); (6)( s + u + k off ) g F L ( s ) = u g F L − ( s ) + k off g F ( s ); (7)( s + k on ) g F ( s ) = k on L L X n =1 g F n ( s ) . (8)In addition, from the initial conditions we have g F m ( s ) = 1. These equations are solvedassuming that the general form of the solution is g F n ( s ) = Ay n + B , where the unknowncoeﬃcients A , y and B are determined from the initial and boundary conditions. Onecould argue that the target site m divides the DNA molecule into two homogeneous segments(1 ≤ n ≤ m and m ≤ n ≤ L ), which can be considered separately. It was shown that thisapproach leads to explicit expressions for the ﬁrst-passage probability functions. Speciﬁcally,6ne obtains g F ( s ) = k on ( k off + s ) S ( s ) Ls ( k off + k on + s ) + k off k on S ( s ) , (9)with an auxiliary function S ( s ) deﬁned as S ( s ) = y (1 + y )( y − L − y L )(1 − y )( y − m + y m )( y m − L + y L − m ) ; (10)and with the parameters y and B given by y = s + 2 u + k off − p ( s + 2 u + k off ) − u u ; (11) B = k off g F ( s )( k off + s ) . (12)Explicit expressions for the ﬁrst-passage probabilities provide a full dynamic descriptionof the protein search processes and any relevant quantities can be easily computed. Forexample, the mean search time from the bulk solution, which is inversely proportional to thechemical association rate for the speciﬁc target site, can be found from, T ≡ − ∂ g F ( s ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) s =0 = 1 k on LS (0) + 1 k off L − S (0) S (0) . (13)This result has a very clear physical meaning. Here the parameter S (0) describes theaverage number of distinct sites that the protein molecule scans during each visit to DNAwhile searching for the single speciﬁc site. Then, on average, to ﬁnd the target the proteinmust make L/S (0) visits to DNA because during every association S (0) DNA sites arechecked. Each visit, on average, lasts 1 /k on while the protein scans for the target diﬀusingalong the DNA chain. The protein also makes L/S (0) − λ = p u/k off , whichgives the average distance that the protein molecule travels on DNA during each searchcycle. This quantity is related to the parameter S (0), but it is not the same because theprotein might visit the same sites several times. If the protein molecule has a strong aﬃnityto bind non-speciﬁcally to the DNA molecule (small k off , λ > L ), then there will be onlyone searching cycle. After binding to DNA the protein will not dissociate until it ﬁnds thetarget. In this case, the mean search time scales as ∼ L because the DNA-bound proteindoes a simple unbiased random walk. We call this dynamic phase a random-walk regime.Because of the redundancy of the random walk the search in this regime should be generallyslow: many sites are repeatedly visited. In the opposite limit of weak attractions betweenDNA and protein molecules (large k off , λ < L searching cycles ( T ∼ L ). This dynamic regime is called a jumping regime. The search inthis regime is generally fast as long as the associations are also fast. The most interestingbehavior is observed for the intermediate interactions, which we label as a sliding regime.Here the scanning length λ is larger than one but smaller than the length of DNA L , and thenumber of searching cycles is also proportional to L . But in this regime the system can reachthe most optimal dynamic behavior with the smallest search times. This search facilitationis achieved due to the fact that the ﬂuxes to the target are coming now from both the bulksolution and from the DNA chain. This is one of the main mechanisms of the facilitateddiﬀusion of proteins during the target search, but other processes like inter-segment transfermight also contribute signiﬁcantly in the facilitated diﬀusion. e a r c h t i m e , s e c Jumping regime Sliding regime Random walk regime

Figure 2: Mean search times as a function of the scanning length parameter λ = p u/k off .The parameters utilized in calculations are: L = 10 bp, u = k on = 10 s − , and m = L/ k off is varied to change λ . The Eﬀect of Multiple Targets and Traps

The advantage of the discrete-state stochastic framework with the ﬁrst-passage analysis pre-sented above is that it can be extended and generalized to more realistic biological situations.This allows us to investigate important questions related to the mechanisms of the proteintarget search on DNA. Let us present several speciﬁc examples, although many more resultshave been obtained.

We start with the problem of how the presence of multiple targetsites or multiple semi-speciﬁc trap sites aﬀect the dynamics of the protein search.It is known that in eukaryotic cells multiple target sites are available on the accessibleDNA fragments.

The protein search is accomplished in these systems when the proteinmolecules ﬁnds for the ﬁrst time any of the target sites. It has been argued that the meansearch time in this system might not decrease proportionally to the number of targets as onewould naively expect from simple-minded applications of chemical kinetics. This is dueto the complex mechanism of the protein search that involves both 3D and 1D motions. Applying our discrete-state stochastic framework to this problem, we consider a model withmultiple targets at arbitrary locations as presented in Figure 3. To describe the searchdynamics in this system, we again introduce the ﬁrst-passage probability function F n ( t ) of9nding any of the targets at time t if the process started at t = 0 at the site n . Targetsare dividing the DNA chain into several homogeneous segments, and this allows us to solvethe corresponding backward master equations as explained in Section 2. This leads to thefollowing explicit expression for the mean search time for any number of targets, T = 1 k on LS i (0) + 1 k off L − S i (0) Si (0) , (14)with a function S i (0) describing the average number of distinct sites scanned by the proteinon DNA with i targets. This formula is a generalization of Equation (13) when there is onlyone target ( i = 1). Speciﬁc expressions for S i (0) for various numbers of randomly distributedtargets have been obtained. For example, for i = 2 it was shown that S ( s ) = (1 + y ) (cid:2) − y L + m − m ) + (1 − y m − m )( y m − + y L − m ) ) (cid:3) (1 − y )(1 + y m − )(1 + y L − m ) )(1 + y m − m ) , (15)where the parameter y is given in Equation 11. protein t a r g e t protein t a r g e t DNAuu k on k off Lm12 state 0 state i m Figure 3: A schematic view of the discrete-state stochastic model of the protein search withmultiple speciﬁc sites. Targets are located at the sites m and m .To understand the eﬀect of multiple targets on the protein search dynamics, we analyzethe results of explicit calculations for mean search times as presented in Figure 4. It is foundthat the presence of multiple targets does not aﬀect the overall dynamic phase diagram ascompared with the single-target case: three search regimes are again observed depending onthe size of the scanning length, the target size and the size of the DNA segment. Generally,10he search is faster in the multiple-target systems. However, surprisingly, increasing thenumber of speciﬁc sites might not always accelerate the search. To quantify this eﬀect, weintroduced an acceleration parameter, a n = T (1) /T ( n ), where T ( n ) is the mean search forthe system with n targets. This ratio gives a numerical value of how faster the search is inthe presence of n targets in comparison with the single-target system. It is illustrated inFigure 5. One can see that there is a range of parameters when the search dynamics in thesystem with two targets can be slower than the dynamics in the system with one target. Thishappens in the eﬀectively 1D search regime (random-walk dynamic phase) when the singletarget is located in the middle of the DNA chain, while two targets are close to each otherand located near one of the ends of the DNA segment. In this case, for the protein moleculethe two targets are viewed as eﬀectively a single target site (with the size equal to two targetsites) because they are so close to each other. But it is faster to ﬁnd the target located inthe middle of the chain than the target positioned near the ends. This is the main reasonwhy having multiple targets does not always lead to decrease in the search times. Thus, ourtheoretical analysis predicts that the degree of acceleration due to the presence of multipletargets depends on the nature of the dynamic search phase and on the location of the speciﬁcsites with respect to each other and with respect to the middle point of DNA. Another important factor that might aﬀect the protein search dynamics is the existenceof so-called semi-speciﬁc sites, or decoys, on DNA. These sites have a chemical compositionvery similar to the speciﬁc targets with diﬀerences in only one or few nucleotides. Theprotein molecule can be trapped in these sites, and this should inﬂuence the search for realtargets. To analyze this eﬀect, we can extend the simplest model to include the possibilityof traps, assuming that associations to these semi-speciﬁc sites are eﬀectively irreversible. This assumption is reasonable because the search times in many systems are relatively shortand the experimental observations also limited in time. Thus the bindings to decoys canbe viewed as eﬀectively irreversible. The ﬁrst-passage analysis can be applied here, but wehave to notice that only a fraction of trajectories will reach the correct target site. Then11a] λ , bp T , s [b] λ , bp T , s [c] λ , bp , s Figure 4: Dynamic phase diagrams for the protein search on DNA with one target at theposition m , with two targets at the positions m and m and with the target and the trap atthe positions m and m , respectively. Parameters used for calculations are: k on = u = 10 s − and L = 10000. a) m = L/ m = L/ m = 3 L/

4; b) m = L/ m = L/ m = L/

2; and c) m = L/ m = L/ m = L . Adapted with permission from Ref. l/L a Figure 5: Ratio of the mean search times as a function of the normalized distance between thetargets for single-target and two-target systems ( l is the distance between between targets, L is the DNA length). The single target is in the middle of the chain. In the two-targetsystem, one of the speciﬁc sites is ﬁxed at the end and the position of the second one isvaried. The parameters used in calculations are: u = k on = 10 s − ; k off = 10 − s − ; and L = 10000. Adapted with permission from Ref. the main quantity of our calculations, the ﬁrst-passage probability function F n ( t ), is now a conditional probability for the protein molecules not captured by the trap to ﬁnd the targetsite.Let us consider a system consisting of a single target at the site m and a single trap atthe site m on the DNA molecule with L sites. The scheme presented in Figure 3 is alsoa correct representation of this system with the correction that instead of the second targetthere is a trap in the site m , and the successful search corresponds to the protein moleculeﬁnding the speciﬁc site m . Following our theoretical method, the corresponding backwardmaster equations can be solved and they yield the Laplace transform of the ﬁrst-passageprobability function to ﬁnd the target if the protein starts from the bulk solution, g F ( s ) = k on ( k off + s ) S ( s ) Ls ( k off + k on + s ) + k off k on S ( s ) , (16)with S ( s ) = (1 + y )(1 − y m + m − )(1 − y )(1 + y m − )(1 + y m − m ) , (17)13nd the parameters y and S given in Equations (11) and (15), respectively. This allows usto evaluate all dynamic properties in the system and to test the eﬀect of traps.The probability to reach the target (i.e., the fraction of the successful trajectories) is nowgiven by a so-called splitting probability function, Π ≡ g F ( s = 0) = S (0) S (0) . (18)The mean search time, which is the conditional mean ﬁrst-passage time to reach the target,can be estimated by averaging over the successful trajectories, producing T ≡ − ∂ g F ( s ) ∂s (cid:12)(cid:12)(cid:12)(cid:12) s =0 Π = 1 k on LS (0) + 1 k off L − S (0) S (0) + Π ∂∂s (cid:20) S ( s ) S ( s ) (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) s =0 . (19)Let us analyze this expression. On the left side, the division by the splitting probabilityemphasizes the fact that this is the conditional mean search time. It is also interesting tonote that the ﬁrst two terms on the right side of the equation is exactly the mean searchtime for the system with two targets and no traps (at the sites m and m ) as we discussedabove, while the third term is a correction which accounts for the fact that the site at m is actually the trap. The main reason for this is the observation that the sites m and m are special locations where all trajectories are end up in both systems, with two targets andwith the target and the trap. For the two-target case the mean search times are averagedover all trajectories to both sites, while for the target and the trap system the mean searchtimes are obtained only by considering the trajectories ﬁnishing at the target. The results of calculations for the dynamic properties of the protein search in the presenceof traps are presented in Figures 4 and 6. Again, three dynamic search phases are observed,but adding the trap generally facilitates the search dynamics, which is a counter-intuitiveresult: see Figure 4. However, this acceleration (in comparison with the single-target system)is always associated with lowering of the probability of reaching the speciﬁc target, as shown14n Figure 6. This means that the protein molecules might reach the target faster in thepresence of the traps, but the fraction of such events is decreasing. In addition, the searchdynamics is sensitive to the nature of the dynamic phase. The strongest eﬀect due to thepresence of the trap is observed in the eﬀective 1D random-walk regime (because it has onlyone searching cycle) where the locations of the target and the trap strongly inﬂuence thesearch. In other dynamic regimes, the eﬀect is smaller. λ , bp Π m =L/4, m =3L/4m =L/4, m =L/2m =L/2, m =L Figure 6: Probability to reach the target as a function of the scanning length for diﬀerentdistributions of the target and trap sites. Parameters used for calculations are: k on = u = 10 s − , L = 10000 and k off is changing. Symbols are from Monte Carlo computer simulations.Adapted with permission from Ref. Sequence heterogeneity

Real DNA molecules are heterogeneous polymers consisting of several types of subunits.This means that the interactions between protein and DNA molecules depend on the DNAsequence at the location where they meet. It is reasonable to expect that this sequencedependence in the interaction strength should aﬀect the protein search dynamics becausethe diﬀusion rate for the non-speciﬁcally bound proteins will be position-dependent.

Similarly, association and dissociation rates should also depend on the location of the proteinmolecule on DNA. In addition, recent theoretical investigations suggested that diﬀerent DNA15equence symmetries might lead to additional eﬀective interactions between protein and DNAmolecules.

The discrete-state stochastic framework with the ﬁrst-passage analysis is aconvenient tool to investigate the eﬀect of DNA sequence heterogeneity and symmetry onthe protein search dynamics. Our goal here is clarify the molecular origin of how the sequence heterogeneity inﬂuencesthe protein target search. We assume here a simpliﬁed picture of DNA, in which eachmonomer can be one of two chemical species, A or B , as presented in Figure 7. When theprotein is bound to the subunit A ( B ), it interacts with energy ε A ( ε B ), and the diﬀerencebetween interaction energies is given by a parameter ε = ε A − ε B ≥

0. This means that theprotein attracts stronger to the B sites than to the A sites. The protein molecule can diﬀusealong DNA with a rate u A ≡ u or u B = ue − ε , where ε is measured in k B T units. This reﬂectsthe assumption that if the protein interacts stronger with the DNA at given location then itwill move out of this site slower. In addition, we assume that, independently of the chemicalnature of the neighboring sites, sliding out of the sites A is characterized by the rate u A ,while the diﬀusion out of the sites B is given by u B . From the bulk solution the protein mightassociate to any site A or B on DNA with the corresponding rates k Aon = k on or k Bon = k on e − θε .Note that for convenience the on-rates deﬁned here as the rates per unit site, in contrast toour deﬁnitions in the previous sections. Similarly, the dissociations from the DNA chain aredescribed by the rates k Aoff = k off and k Boff = k off e ( θ − ε . Here, the parameter 0 ≤ θ ≤ The physical meaning of this parameter is that the proteinmolecule tends to bind faster and to dissociate slower from the stronger attracting sites B ,as compared with the weaker attracting A sites. The parameter θ accounts for these eﬀects.To quantify the role of sequence heterogeneity, we consider the DNA molecule with a ﬁxedchemical composition (the fractions of A and B monomers are the same), but with diﬀerentarrangements of subunits. Two limiting cases are speciﬁcally analyzed. One of them viewsthe DNA molecule as two homogeneous segments of only A and only B subunits separated16 A AA A AA AA

T B BB BB B B B B BA A AA A A AA AT B BB BB B B BB BA ABB A A AA A A AA A B BB BB B B BB B AB A A AA A A AA AT B BB BB B B BB B AB L/2 L/2 a b Figure 7: A simpliﬁed view on the protein search on DNA with two diﬀerent types of subunits, A and B . a) A general scheme; b) DNA is viewed as a symmetric block copolymer with thetarget in the middle of the chain; c) DNA is viewed as alternating copolymer with diﬀerentcompositions of the subunits ﬂanking the target in the middle of the chain. Adapted withpermission from Ref.

17y the target in the middle of the chain (Figure 7). Another one is the DNA chain with thealternating A and B sites. The block copolymer has two homogeneous sequence segments,while the alternating polymers are more heterogeneous. It is important to note that in bothcases, the overall interaction between the protein and DNA is the same (because the overallchemical composition in both cases is identical), and thus our analysis probes only the eﬀectof the heterogeneity and symmetry in the subunit positions, in contrast to other theoreticaltreatments. λ , bp0.40.50.60.70.80.911.1 T , a l t / T , b l o c k ATA sequenceATB sequenceBTB sequence

Figure 8: The ratio of the mean search times for the alternating DNA sequences and for theblock copolymer DNA sequences as a function of the scanning length λ = p u/k off . Threediﬀerent chemical compositions near the target ( T ) are distinguished, namely, AT A , AT B , BT B . The transition rates are u = 10 s − and k on = 0 . s − . The DNA length is L = 1000,the loading parameter is θ = 0 .

5, and the energy diﬀerence of interactions for the proteinwith A and B sites is ε = 5 k B T . Adapted with permission from Ref. Applying again the ﬁrst-passage approach and solving the corresponding equations leadsto the explicit expressions for mean search times for all situations shown in Figure 7. Forexample, for the block copolymer DNA sequences, we obtain T = k off + k on [( L/ − P A ) + e ε ( L/ − P B )] k on k off (1 + P A + e θε P B ) , (20)where P i = x − L/ i − x L/ i (1 − x i )( x L/ i + x L / i ) , (21)18 i = 2 u i + k ( i ) off − q (2 u i + k ( i ) off ) − u i u i , (22)for i = A or B . The expressions for the mean search time for alternating sequences are quitebulky and can be found in Ref. The results of our calculations are presented in Figure 8, where the ratio of the meansearch times for the block copolymer and alternating sequences are plotted. The analysis ofthis ﬁgure produces several interesting observations. First, we see that three dynamic searchregimes are also found in this system and the eﬀect of sequence heterogeneity on proteinsearch dynamics depends on the nature of the dynamic phase. In the jumping regime whenthe protein does not slide along the DNA contour ( λ < < λ < L ) where inmost cases, the search on alternating sequences is faster. This can be explained by noticingthat the search time in this dynamic phase is proportional to

L/λ , which gives the averagenumber of cycles before the protein can ﬁnd the target. In the block copolymer sequence, theprotein mostly comes to the target from the B segment because of stronger interactions withthese sites, i.e, it comes from one side of the DNA molecule. In the alternating sequences,the protein can reach the target from both sides of DNA, and this lowers the overall searchtime. It can be shown analytically that the scanning length on the alternating segment islarger than the scanning length for the B segment, i.e., λ AB > λ B . Then the search isfaster for the alternating sequences because

L/λ AB < L/λ B , i.e., the number of searchingcycles is lower for the alternating sequences, which helps to ﬁnd the target faster. The onlydeviation from this picture is found for AT A sequences, which corresponds to having two A sites around the target site, where for the small range of parameters the search is slower thanin the block copolymer sequence. This eﬀect can be explained by the fact that the proteindoes not sit at A sites for the long time and it moves quickly away, eﬀectively increasing the19arrier to enter the target via DNA. Thus, our theory predicts that the composition ofthe DNA ﬂanking sites around the target sequences might aﬀect the dynamics of reachingthem. It is interesting to note that recent experiments are consistent with our theoreticalpredictions. In the random-walk regime (1D search, λ > L ), the eﬀect of the sequence heterogeneityis even stronger. The protein molecule ﬁnds the speciﬁc binding site up to 2 times fasterfor more heterogeneous alternating DNA sequences. To understand this behavior, we notethat in this case the mean ﬁrst-passage time to reach the target is a sum of residence timeson the DNA sites since the protein will not dissociate until the target is located so thatall trajectories to the target are one-dimensional. Because the target is in the middle ofthe chain, the mean time to reach the target from the block copolymer sequence can beapproximated as T ≃ ( L/ τ B , where τ B is the average residence time on any site B . Theprotein prefers to start the search at any position on the B segment with equal probability,i.e., the distance to the target varies from 0 to L/

2. Then, the average starting position ofthe protein is L/ L/ T ≃ ( L/ τ A + ( L/ τ B ( τ A is the residence time on A sites). The protein spends much less time on A subunits,and this leads to faster search for the alternating DNA sequences. For τ A ≪ τ B , this alsoexplains the factor of 2 in the search speed. In this case, the B subunits can be viewed aseﬀective traps that slow down the search dynamics. Thus, our theoretical calculations makesurprising predictions that the sequence heterogeneity almost always lead to faster proteinsearch for targets on DNA despite the fact that it lowers the eﬀective protein-DNA bindingaﬃnity. And the stronger the contribution of the 1D search modes, the more relevantwill be the eﬀect of sequence heterogeneity. 20 he Eﬀect of Crowding on DNA in the Protein TargetSearch

Living cells are typically crowded with a large number of molecules, and many of themare attached to the DNA chains.

This should prevent the fast protein search for targetson DNA, and earlier theoretical studies supported this prediction. However, surprisingly,experiments show that crowding on DNA does not aﬀect much the eﬀectiveness of the proteintarget search, and this was also found in MD simulations. By applying the discrete-state stochastic approach, we were able to clarify the role of the crowding on DNA in theprotein target search.To analyze this problem, the model illustrated in Figure 9 is considered. There is a singleDNA molecule with L + 1 binding sites, and one of them is the target (at the site m ). On theDNA chain there is also a crowding particle that can diﬀuse with a rate u ob , but it cannotleave DNA. A single protein molecule starts from the solution (state 0) and it can bind to anysite on DNA that is not occupied by the crowder with a rate k on (rate per site). The boundprotein molecule can diﬀuse with a rate u , and there is an exclusion interaction between theprotein and the crowder. Finally, the protein molecule can dissociate from DNA to the bulksolution with a rate k off : see Figure 9. protein t a r g e t protein o b s t a c l e DNAuu k on k off u u Lm12 state 0 state i ob ob

Figure 9: A schematic view of the protein target search in the presence of a moving obstacleon DNA. The crowding particle cannot dissociate from DNA, while the protein molecule candissociate into the solution, labeled as state 0, and return back to the DNA chain.Investigating the model with the mobile crowding particle on DNA ﬁrst using Monte21arlo computer simulations, it is found that there are three search regimes depending onthe main length scales in the system. This is shown in Figure 10 for the mean search timesto ﬁnd the target as a function the scanning length λ . We can understand the complexdynamics in this system using the following arguments. If the diﬀusion rate of the crowderis much smaller than other rates ( u ob ≪ u , k on and k off ), then the protein molecule will ﬁndthe target before the crowding particle can move away from its original location. But wealready explicitly solved the problem of the protein target search with static obstacles usingthe same discrete-state stochastic approach with the ﬁrst-passage analysis. Then the meansearch time in the system with movable crowder can be approximated as the average overall possible static locations of the crowding particle, yielding h T i ≃ L m − X l ob =1 T ob ( l ob ) + L − m X l ob =1 T ob ( l ob ) ! , (23)where T ob = k off + k on ( L − S ob (0)) k on k off S ob (0) , (24)is the mean search time with the static obstacle located at a distance l ob from the target.An auxiliary function S ob is given by S ob ( s ) = y ( y − m − y m )(1 − y )( y m + y m ) + y (1 − y l ob − )(1 − y )(1 + y l ob − ) (25)with the parameter y speciﬁed in Equation 11.This simple approximate theory works quite well in the dynamic regimes where 3D path-ways are important for the search ( λ < L ). However, theoretical arguments fail in therandom-walk regime where 1D dynamics dominate the search. These results are expected.The protein molecule that collides with the crowding particle on DNA in dynamic regimeswith 3D pathways will have the opportunity to dissociate into the bulk solution and to avoidthe blocking eﬀect. But in the random-walk regime (1D search) there is no such opportunity,22nd the search times will deﬁnitely increase. Computer simulations also indicate that thesearch times in this regime depend on the diﬀusivity of the crowding particle. The search isfaster for more mobile crowders: see Figure 10. scanning length λ , bp T , s static obstacleno obstacles u ob = 10 u ob =10 u ob =10 u ob =10 u ob =10 u ob =10 Figure 10: Mean search times to ﬁnd the target in the system with a mobile crowder onDNA. The DNA chain has L = 1000 sites, and the target is in the middle of the chain, m = L/

2. Parameters used for calculations are k on = 0 . − , u = 10 s − and variable u ob . Solid curves correspond to analytical results for DNA without obstacles and for DNAwith a static obstacle, which are averaged over all initial positions of the crowder. Symbolscorrespond to Monte Carlo computer simulations. Dashed lines describe the approximatetheory, as explained in the text. Adapted with permission from Ref. The dynamics in the random-walk regime can be explained using the following arguments.The overall search can be viewed as consisting of two terms, h T ob i ≃ T + h T bl i , (26)where T is the search in the random-walk regime without any crowders, and it is given inEquation 13. The second term is the average time it takes for the crowder to diﬀuse awayand clear the path for the protein to reach the target without interference. It was shownthat this blocking time T bl depends on the location of the target and the diﬀusion rate ofthe crowding particle u ob , h T bl i = m + ( L − m ) u ob ( L + m − mL ) . (27)23his simple theoretical arguments show excellent agreement with Monte Carlo computersimulations: see dashed lines in Figure 10. But more importantly, they provide a clearmolecular picture on the role of the crowding on DNA in the protein target search. If theprotein search is dominated by 1D pathways and the mobility of the crowder is low the searchdynamics will be signiﬁcantly slowed down. But if the search involves mostly 3D pathwaysand the crowder is mobile the mean search times will not be aﬀected much. It seems that realbiological systems operate in 3D+1D regime, and crowding particles diﬀuse with the ratescomparable to the searching proteins ( u ∼ u ob ). Then one might conclude that the eﬀect ofthe crowders on DNA should be minimal. This fully agrees with experimental observationsand with results from MD simulations.

Conclusions and Future Directions

Although protein search for targets on DNA is a very complex phenomenon that involvesmultiple biochemical and biophysical processes, signiﬁcant advances in our understanding ofthe underlying molecular mechanisms have been achieved in recent years. A major role inthis success is due to analysis of the systems using the discrete-state stochastic frameworksupplemented by explicit calculations via the ﬁrst-passage probabilities method. In thisreview, we presented and explained this theoretical approach by considering the proteintarget search in various systems. It is important to emphasize that the main advantage ofour theoretical approach is the ability to obtain analytical results that clarify the physics ofthe underlying processes. In addition, the method can be easily extended in many directions,as shown in this work, as well as in other cases which we did not discuss in this work, such asthe role of conformational transitions and the eﬀect of DNA loop formation in the proteintarget search. Furthermore, our theoretical calculations using this theoretical frameworkwere successful in explaining the experimental observations on homology search by RecAprotein ﬁlaments and the dynamics of CRISPR genome interrogation. Acknowledgement

A.B.K. acknowledges the support from the Welch Foundation (C-1559), from the NSF (CHE-1664218) and from the Center for Theoretical Biological Physics sponsored by the NSF(PHY-1427654).

References (1) Alberts, B. et al.,

Molecular Biology of Cell , 6th ed., Garland Science, New York,2014.(2) Lodish, H. et al.,

Molecular Cell Biology , 6th ed., W. H. Freeman, New York, 2007.(3) Phillips, R.; Kondev, J.; Theriot, J.

Physical Biology of the Cell , 2nd ed., GarlandScience, New York, 2012.(4) Van Kampen, N.G.

Stochastic Processes in Physics and Chemistry , 3rd ed., NorthHolland, Amsterdam, 2007. 265) Redner, S.

A Guide to First-Passage Processes , Cambridge University Press, Cam-bridge, 2001.(6) Riggs, A.D.; Bourgeois, S.; Cohn, M. The lac-represser-operator interaction: IIIKi-netic studies.

J.Mol. Biol. , , 401-417.(7) Berg, O.G.; Winter, R.B.; von Hippel, P.H. Diﬀusion-driven mechanisms of proteintranslocation on nucleic acids: I. Models and theory. Biochemistry , , 6929-6948.(8) Berg, O.G.; von Hippel, P.H. Diﬀusion-controlled macromolecular interactions. Annu.Rev. Biophys. Biophys. Chem. , , 131-160.(9) Gowers, D.M.; Wilson, G.G.; Halford, S.E. Measurements of the contributions of 1Dand 3D pathways to the translocation of protein along DNA. Proc. Natl. Acad. Sci.USA , , 15883-15888.(10) Halford, S.E.; Marko, J.F. How do site-speciﬁc DNA-binding proteins ﬁnd their tar-gets? Nucl. Acids Res. , , 3040-3052.(11) Mirny, L.; Slutsky, M.; Wunderlich, Z.; Tafvizi, A.; Leith, J.; Kosmrlj, A. How aprotein searches for its site on DNA: the mechanism of facilitated diﬀusion. J. Phys.A: Math. Theor. , , 434019.(12) Kolomeisky, A.B. Physics of protein-DNA interactions: Mechanisms of facilitated tar-get search. Phys. Chem. Chem. Phys. , , 2088-2095.(13) Hu, T.; Grosberg, A.Y.; Shklovskii, B.I. How proteins search for their speciﬁc sites onDNA: the Role of DNA conformations. Biophys. J. , , 2731-2744.(14) Hu, L.; Grosberg, A.Y.; Bruinsma, R. Are DNA transcription factor proteinsmaxwellian demons? Biophys. J. , , 1151-1156.2715) Bauer, M.; Metzler, R. Generalized facilitated diﬀusion model for DNA-binding pro-teins with search and recognition states. Biophys. J. , , 2321-2330.(16) Sheinman, M.; Benichou, O.; Kafri, Y.; Voituriez, R. Classes of fast and speciﬁc searchmechanisms for proteins on DNA. Rep. Progr. Phys. , , 026601.(17) Veksler, A.; Kolomeisky, A.B. Speed-selectivity paradox in the protein search for tar-gets on DNA: Is it real or not? J. Phys. Chem. B , , 12695-12701.(18) Lange, M.; Kochugaeva, M.; Kolomeisky, A.B. Protein search for multiple targets onDNA. J. Chem. Phys. , , 105102.(19) Lange, M.; Kochugaeva, M.; Kolomeisky, A.B. Dynamics of the protein search fortargets on DNA in the presence of traps. J. Phys. Chem. B , , 12410-12416.(20) Shvets, A.A.; Kolomeisky, A.B. Sequence heterogeneity accelerates protein search fortargets on DNA. J. Chem. Phys. , 245101.(21) Shvets, A.A.; Kolomeisky, A.B. Crowding on DNA in protein search for targets.

J.Phys. Chem. Lett. , , 2502-2506.(22) Shvets, A.A.; Kochugaeva, M.; Kolomeisky, A.B. The role of static and dynamicobstacles in the protein search for targets on DNA. J. Phys. Chem. B , ,5802-5809.(23) Shvets, A.A.; Kolomeisky, A.B. The role of DNA looping in the search for speciﬁctargtes on DNA by multisite proteins. J. Phys. Chem. Lett. , , 5022-5027.(24) Kochugaeva, M.P.; Shvets, A.A.; Kolomeisky, A.B. How conformational dynamicsinﬂuences the protein search for targets on DNA. J. Phys. A: Math. Theor. , ,444004. 2825) Kochugaeva, M.P.; Berezhkovskii, A.A.; Kolomeisky, A.B. Optimal length of confor-mational transitions region in the protein search for targets on DNA. J. Phys. Chem.Lett. , , 4049-4054.(26) Shin, J.; Kolomeisky, A.B. Surface-assisted dynamic search processes. J. Phys. Chem.B , , 2243-2250.(27) Esadze, A.; Kemme, C.A.; Kolomeisky, A.B.; Iwahara, J. Positive and negative im-pacts of nonspeciﬁc sites during targte location by a sequence-speciﬁc DNA-bindingprotein: Origin of the optimal search at physiological ionic strength. Nucl. Acids Res. , , 7039-7046.(28) Kochugaeva, M.P.; Shvets, A.A.; Kolomeisky, A.B. On the mechanism of homologysearch by ReacA protein ﬁlaments. Biophys. J. , 859-867.(29) Shvets, A.A.; Kolomeisky, A.B. Mechanism of genome interrogation: How CRISPRRNA-guided Cas9 proteins locate speciﬁc targets on DNA.

Biophys. J. , ,1416-1424.(30) Tafvizi, A.; Huang, F.; Fersht, A.R.; Mirny, L.A.; van Oijen, A.M. A single-moleculecharacterization of p53 search on DNA. Proc. Natl. Acad. Sci. USA , , 563-568.(31) Slutsky, M.; Mirny, L.A. Kinetics of protein-DNA interaction: facilitated target loca-tion in sequence-dependent potential. Biophys. J. , , 4021-4035.(32) Benichou, O.; Kafri, Y., Sheinman, M.; Voituriez, R. Searching fast for a target onDNA without falling to traps. Phys. Rev. Lett. , , 138102-138104.(33) Hammar, P.; Leroy, P.; Mahmutovic, A.; Marklund, E.G.; Berg, O.G.; Elf, J. The lac Repressor displays facilitated Diﬀusion in Living Cells.

Science , , 1595-1598.2934) Mahmutovic, A;, Berg, O.G.; Elf, J. What matters for Lac repressor search in vivo -sliding, hopping, intersegment transfer, crowding on DNA or recognition? Nucl. AcidsRes. , , 3454-3464.(35) Cuculis, L.; Abil, Z.; Zhao, H.; Schroeder, C.M. Direct observation of TALE proteindynamics reveals a two-state search mechanism. Nat. Commun. , .(36) Zandarashvili, L.; Esadze, A.; Vuzman, D.; Kemme, C.A.; Levy, Y.; Iwahara, J.Balancing between aﬃnity and speed in target DNA search by zinc-ﬁnger proteins viamodulation of dynamic conformational ensemble. Nucl. Acid Res. , , E5142-E5149.(37) Reingruber, J.; Holcman, D. Transcription factor search for a DNA promoter in athree-state model. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. , , 020901-020904.(38) Koslover, E. F.; de la Rosa, M. A. D.; Spakowitz, A. J. Theoretical and computationalmodeling of target-site search kinetics in vitro and in vivo. Biophys. J. , , 856-865.(39) Chu, X.; Liu, F.; and Maxwell, B.A.; Wang, Y.; Suo, Z.; Wang, H.; Han, W.; Wang,J. Dynamic conformational change regulates the protein-DNA recognition: an inves-tigation on binding of a Y-family polymerase to its target DNA. PLoS Comput. Biol. , , e1003804.(40) Townson, S.A.; Samuelson, J.C.; Bao, Y.; Xu, S.-Y.; Aggarwal, A.K. BstYI Bound toNoncognate DNA Reveals a ”Hemispeciﬁc” Complex: Implications for DNA Scanning. Structure , , 449-459.(41) Bauer, M.; Rasmussen, E.S.; Lomholt, M.A.; Metzler, R. Real sequence eﬀects on thesearch dynamics of transcription factors on DNA. Sci. Rep. , , 10072.3042) Brackley, A.A.; Cates, M.A.; Marenduzzo, D. Facilitated diﬀusion on mobile DNA:Conﬁgurational traps and sequence heterogeneity. Phys. Rev. Lett. , , 168103.(43) Afek, A.; Sela, I., Musa-Lempel, N.; Lukatsky, D.B. Nonspeciﬁc transcription-factor-DNA binding inﬂuences nucleosome occupancy in yeast. Biophys. J. , , 2465-2475.(44) Afek, A.; Lukatsky, D.B. Nonspeciﬁc protein-DNA binding is widespread in the yeastgenome. Biophys. J. , , 1881-1888.(45) Afek, A.; Lukatsky, D.B. Positive and negative design for nonconsensus protein-DNAbinding aﬃnity in the vicinity of functional binding sites. Biophys. J. , ,1653-1660.(46) Afek, A.; Schipper, J.L.; Horton, J.; Gordan, R.; Lukatsky, D.B. Protein-DNA bindingin the absence of speciﬁc base-pair recognition. Proc. Natl. Acad. Sci. USA , ,17140-17145.(47) Le, D.D.; Shimko, T.C.; Aditham, A.K.; Keys, A.M.; Longwell, S.A.; Orenstein,Y.; Fordyce, P.M. Comprehensive, high-resolution binding energy landscapes revealcontext dependencies of transcription factor binding. Proc. Natl. Acad. Sci. USA , , E3702-E3711.(48) Marcovitz, A.; Levy, Y. Obstacles may facilitate and direct DNA search by proteins. Biophys. J. , , 2042-20152.(49) Gomez, D.; Klumpp, S. Facilitated diﬀusion in the presence of obstacles on the DNA. Phys. Chem. Chem. Phys. ,18