[PDF] Dynamics of transcription factor binding site evolution

Abstract

Evolution of gene regulation is crucial for our understanding of the phenotypic differences between species, populations and individuals. Sequence-specific binding of transcription factors to the regulatory regions on the DNA is a key regulatory mechanism that determines gene expression and hence heritable phenotypic variation. We use a biophysical model for directional selection on gene expression to estimate the rates of gain and loss of transcription factor binding sites (TFBS) in finite populations under both point and insertion/deletion mutations. Our results show that these rates are typically slow for a single TFBS in an isolated DNA region, unless the selection is extremely strong. These rates decrease drastically with increasing TFBS length or increasingly specific protein-DNA interactions, making the evolution of sites longer than ~10 bp unlikely on typical eukaryotic speciation timescales. Similarly, evolution converges to the stationary distribution of binding sequences very slowly, making the equilibrium assumption questionable. The availability of longer regulatory sequences in which multiple binding sites can evolve simultaneously, the presence of "pre-sites" or partially decayed old sites in the initial sequence, and biophysical cooperativity between transcription factors, can all facilitate gain of TFBS and reconcile theoretical calculations with timescales inferred from comparative genetics.

Full PDF

DDynamics of transcription factor binding site evolution

Murat Tu˘grul ∗ , Tiago Paix˜ao, Nicholas H. Barton, Gaˇsper Tkaˇcik Institute of Science and Technology Austria, Am Campus 1, A-3400 Klosterneuburg, Austria (Dated: October 18, 2018)Evolution of gene regulation is crucial for our understanding of the phenotypic diﬀerences be-tween species, populations and individuals. Sequence-speciﬁc binding of transcription factors tothe regulatory regions on the DNA is a key regulatory mechanism that determines gene expressionand hence heritable phenotypic variation. We use a biophysical model for directional selection ongene expression to estimate the rates of gain and loss of transcription factor binding sites (TFBS)in ﬁnite populations under both point and insertion/deletion mutations. Our results show thatthese rates are typically slow for a single TFBS in an isolated DNA region, unless the selection isextremely strong. These rates decrease drastically with increasing TFBS length or increasingly spe-ciﬁc protein-DNA interactions, making the evolution of sites longer than ∼

10 bp unlikely on typicaleukaryotic speciation timescales. Similarly, evolution converges to the stationary distribution ofbinding sequences very slowly, making the equilibrium assumption questionable. The availability oflonger regulatory sequences in which multiple binding sites can evolve simultaneously, the presenceof “pre-sites” or partially decayed old sites in the initial sequence, and biophysical cooperativitybetween transcription factors, can all facilitate gain of TFBS and reconcile theoretical calculationswith timescales inferred from comparative genomics.

Author Summary

Evolution has produced a remarkable diversity of living forms that manifests in qualitative diﬀerences as wellas quantitative traits. An essential factor that underlies this variability is transcription factor binding sites, shortpieces of DNA that control gene expression levels. Nevertheless, we lack a thorough theoretical understanding ofthe evolutionary times required for the appearance and disappearance of these sites. By combining a biophysicallyrealistic model for how cells read out information in transcription factor binding sites with model for DNA sequenceevolution, we explore these timescales and ask what factors crucially aﬀect them. We ﬁnd that the emergence ofbinding sites from a random sequence is generically slow under point and insertion/deletion mutational mechanisms.Strong selection, suﬃcient genomic sequence in which the sites can evolve, the existence of partially decayed oldbinding sites in the sequence, as well as certain biophysical mechanisms such as cooperativity, can accelerate thebinding site gain times and make them consistent with the timescales suggested by comparative analyses of genomicdata.

INTRODUCTION

Evolution produces heritable phenotypic variation within and between populations and species on relatively shorttimescales. Part of this variation is due to diﬀerences in gene regulation, which determines how much of eachgene product exists in every cell. These gene expression levels are heritable quantitative traits subject to naturalselection [1–3]. While the importance of their variability for the observed phenotypic variation is still debated [4], it isbelieved to be crucial within closely related species or in populations whose proteins are functionally or structurallysimilar [5]. The genetic basis for gene expression diﬀerences is thought to be non-coding regulatory DNA, but ourunderstanding of its evolution is still immature; this is due, in part, to the lack of precise knowledge about the mappingbetween the regulatory sequence and the resulting expression levels.Transcriptional regulation is the most extensively studied mechanism of gene regulation. Transcription factorproteins (TFs) recognize and bind speciﬁc DNA sequences called binding sites, thereby aﬀecting the expression oftarget genes. Eukaryotic regulatory sequences, i.e., enhancers and promoters, are typically between a hundred andseveral thousand base pairs (bp) in length [6], and can harbor many transcription factor binding sites (TFBSs), eachtypically consisting of 6 −

12 bp. The situation is diﬀerent in prokaryotes: they lack enhancer regions and have one ora few TFBSs which are typically longer, between 10 to 20 bp in length [7, 8]. Diﬀerences in TF binding are thought toarise primarily due to changes in the regulatory sequence at the TF binding sites rather than changes in the cellular ∗ [email protected] a r X i v : . [ q - b i o . P E ] N ov environment or the TF proteins themselves [10]. Nevertheless, a theoretical understanding of the relationship betweenthe evolution of the regulatory sequence and the evolution of gene expression levels remains elusive, mostly becauseof the complex interaction of evolutionary forces and biophysical processes [11].From the evolutionary perspective, the crucial question is whether and when these regulatory sequences can evolverapidly enough so that new phenotypic variants can arise and ﬁx in the population over typical speciation timescales.Comparative genomic studies in eukaryotes provide evidence for the evolutionary dynamics of TF binding, highlightingthe possibility for rapid and ﬂexible TFBS gain and loss between closely related species on timescales of as little as afew million years [12, 13]. Examples include quick gain and loss events that cause divergent gene expression [14], or thecompensation of such events by turn-over at other genome locations [15]; gain and loss events sometimes occur evenin the presence of strong constraints on expression levels [16, 17]. Furthermore, such events enabled new binding siteson sex chromosomes that arose as recently as 1 − . − .

000 years [2, 20–22]. Onthe other hand, strict conservation has also been observed at orthologous regulatory locations even in distant species(e.g., [23]). Taken together, these facts suggest that the rates of TFBS evolution can extend over many orders ofmagnitude and diﬀer greatly from the point mutation rate at a neutral site. To study the evolutionary dynamics ofregulatory sequences and understand the relevant timescales, we set up a theoretical framework with a special focuson the interplay of both population genetic and biophysical factors, brieﬂy outlined below.Sequence innovations originate from diverse mutational mechanisms in the genome. While tandem repeats [24] ortransposable elements [25] may be important in evolution, the better studied and more widespread mutation types stillneed to be better understood in the context of TFBS evolution. Speciﬁcally, we ask how the evolutionary dynamicsare aﬀected by single nucleotide (point) mutations, as well as by insertions and deletions (indels). New mutations inthe population are selected or eliminated by the combined eﬀects of selection and random genetic drift. Although theimportance of selection [26–28] and mutational closeness of the initial sequences [29, 30] for TF binding site evolutionhas already been reported, the belief in fast evolution via point mutations without selection (i.e., neutral evolution)persists in the literature (e.g.,[5, 13]), mainly due to Stone & Wray’s (2001) misinterpretation of their own simulationresults [31] (see Macarthur & Brookﬁeld (2004) [29]). This likely reﬂects the current lack of theoretical understandingof TFBS evolution in the literature, even under the simplest case of directional selection. Basic population geneticsshows that directional selection is expected to cause a change, e.g., yield a functional binding site, over times onthe order of 1 / ( N sU b ), where N is the population size, s is the selection advantage of a binding site, and U b is thebeneﬁcial mutation rate [32]. This process can be extremely slow, especially under neutrality, if several mutationalsteps are needed to reach a sequence with suﬃcient binding energy to confer a selective advantage. As already pointedout by Berg et al. (2004) [32], this places strong constraints on the length of the binding sites, if they were to evolvefrom random sequences.Several biophysical factors, such as TF concentration and the energetics of TF-DNA and TF-TF interactions, mightplay an important role in TFBS evolution. Quantitative models for TF sequence speciﬁcity [33–38] and for thermo-dynamic (TD) equilibrium of TF occupancy on DNA [34, 39–43] were developed in recent decades and, in parallelwith developments in sequencing, have contributed to our understanding of TF-DNA interaction biophysics. Thesebiophysical factors can shape the characteristics of the TFBS ﬁtness landscape over genotype space in evolution-ary models [8, 29, 32, 44–47]. There are also intensive eﬀorts to understand the mapping from promoter/enhancersequences to gene expression [42, 48–50]. Despite this recent attention, there have been relatively few attempts tounderstand the evolutionary dynamics of TFBS in full promoter/enhancer regions [29, 43, 51–53], especially using bio-physically realistic but still mathematically tractable models. Such models are necessary to gain a thorough theoreticalunderstanding of binding site evolution.Our aim in this study is to investigate the dynamics of TFBS evolution by focusing on the typical evolutionaryrates for individual TFBS gain and loss events. We consider both a single binding site at an isolated DNA regionand a full enhancer/promoter region, able to harbor multiple binding sites. In the following section, we lay out ourmodeling framework, which covers both population genetic and biophysical considerations, as outlined above. Usingthis framework, we try to understand i) what typical gain and loss rates are for a single TFBS site; ii) how quicklypopulations converge to a stationary distribution for a single TFBS; iii) how multiple TFBS evolve in enhancersand promoters; iv) how early history of the evolving sequences can change the evolutionary rates of TFBS; and v) how cooperativity between TFs aﬀects the evolution of gene expression. We ﬁnd that, under realistic parameterranges, both gain and loss of a single binding site is slow, slower than the typical divergence time between species.Importantly, fast emergence of an isolated TFBS requires strong selection and favorable initial sequences in themutational neighborhood of a strong TFBS. The evolutionary process approaches the equilibrium distribution veryslowly, raising concerns about the use of equilibrium assumptions in theoretical work. We proceed to show that thedynamics of TFBS evolution in larger sequences can be understood approximately from the dynamics of single bindingsites; the TFBS gain times are again slow if evolution starts from random sequence in the absence of strong selectionor large regulatory sequence “real estate.” Finally, we identify two factors that can speed up the emergence of TFBS:the existence of an initial sequence distribution biased towards the mutational neighborhood of strongly bindingsequences, which suggests that ancient evolutionary history can play a major role in the emergence of “novelties” [54];and the biophysical cooperativity between transcription factors, which can partially account for the lack of observedcorrelation between identiﬁable binding sequences and transcriptional activity [11]. MODELS & METHODSPopulation genetics

We consider a ﬁnite population of N diploid individuals whose genetic content consists of an evolvable L base pair(bp) contiguous regulatory sequence σ to which TFs can bind. Given that σ i ∈ { A, C, G, T } where i = 1 , , ..., L indexes the position in regulatory sequence, there are 4 L diﬀerent regulatory sequences in the genotype space. EachTF is assumed to bind to a contiguous sequence of n bp within our focal region of L bp (Fig. 1A,B). Regulatorysequences evolve under mutation, selection, and sampling drift. The rest of the genome is assumed to be identicalfor all individuals and is kept constant. In the ﬁrst part of our study we consider the regulatory sequence comprisedof a single TFBS (i.e. L = n ). Later, we consider the evolution of a longer sequence (i.e. L (cid:29) n ) in which morethan one TFBS can evolve. For simulations, we use a Wright-Fisher model where N diploid individuals are sampledfrom the previous generation after mutation and selection. Our analytical treatment is general and corresponds tosetups where a diﬀusion approximation to allele frequency evolution is valid. We neglect recombination since typicalregulatory sequences are short, L ≤ N individuals.Evolutionary dynamics simplify in the low mutation limit where the population consists of a single genotype duringmost of its evolutionary history (the ﬁxed state population model). Desai & Fisher [55] have shown that the condition log 4 N ∆ f ∆ f (cid:28) NU b ∆ f needs to hold for a ﬁxed state population assumption to be accurate. The term on the left isthe establishment time of a mutant allele with a selective advantage ∆ f relative to the wild type; the term on theright-hand side is the waiting time for such an allele to appear, where U b is the beneﬁcial mutation rate per individualper generation. Note that, in binding site context, U b refers to the rate of mutations which increase the ﬁtness, forinstance, by increasing binding strength. Its exact value depends on the current state of the genotype; nevertheless,typical value estimates help model the evolutionary dynamics. In multicellular eukaryotes, where most evidence forthe evolution of TFBSs has been collected and which provide the motivation for this manuscript, the number ofmutations per nucleotide site is typically low, e.g. 4 N u ∼ .

01 in

Drosophila and 4

N u ∼ .

001 in humans [56], where u is the point mutation rate per generation per base pair. For a single binding site of typical length n ∼ −

15, onetherefore expects the ﬁxed state population model to be accurate. For longer regulatory sequences, one expects thatbeneﬁcial mutations are rare among all possible mutations, so that the ﬁxed state population model can be assumedto hold as well.Evolution under the ﬁxed state assumption can be treated as a simple Markovian jump process. The transitionrate from a regulatory sequence σ to another regulatory sequence σ (cid:48) in a diploid population is R σ (cid:48) ,σ = 2 N U σ (cid:48) ,σ P ﬁx ( N, ∆ f σ (cid:48) ,σ ) (1)where ∆ f σ (cid:48) ,σ = f ( σ (cid:48) ) − f ( σ ) is the ﬁtness diﬀerence and U σ (cid:48) ,σ is the mutation rate from σ to σ (cid:48) . The ﬁxationprobability P ﬁx of a mutation with ﬁtness diﬀerence ∆ f in a diploid population of N individuals is P ﬁx ( N, ∆ f ) = 1 − e − f − e − N ∆ f ≈ f − e − N ∆ f , (2)which is based on the diﬀusion approximation [57]. Note that the ﬁxation probability scaled with 1 /N approximatesto 2 N ∆ f when N ∆ f (cid:29)

1. Evolutionary dynamics therefore depend essentially on how regulatory sequences aremutationally connected in genotype space, and how ﬁtnesses diﬀer between neighboring genotypes, i.e., on the ﬁtnesslandscape. genepromoter mRNARNA-polTF .... enhancer

ABC

L bp ... C G T A T C ... G C G T

G A A G T C Cn bppoint mutation, e.g. G C ... C G T A T C ... G C G T

G A A C T C C ... C G T A T C ... G C G T

G A A G T C C ... C G T A T C ... G C G T

G A A G T C T C Cindel mutation, e.g. TC insertion D x x | || | || |||| || | | | ||||||| || ||||||| ||||| | ||| | | ||| | FIG. 1:

Biophysics of transcription regulation. A)

TFs bind to regulatory DNA regions (promoters and enhancers)in a sequence-speciﬁc manner to regulate transcriptional gene expression (mRNA production) level via diﬀerent mechanisms,such as recruiting RNA polymerase (RNA-pol). B) A schematic of two types of mutational processes that we model: pointmutations (left) and indel mutations (right). C) The mismatch binding model results in redundancy of genotype classes, witha binomial distribution (red) of genotypes in each mismatch class (some examples of degenerate sequences shown) D) Themapping from the TFBS regulatory sequence to gene expression level is determined by the thermodynamic occupancy (bindingprobability) of the binding site. If each of the k mismatches from the consensus sequence decreases the binding energy by (cid:15) , theoccupancy of the binding site is π TD ( k ) = (1+ e β ( (cid:15)k − µ ) ) − , where µ is the chemical potential (related to free TF concentration).A typical occupancy curve is shown in black ( (cid:15) = 2 k B T and µ = 4 k B T ); the gray curves show the eﬀect of perturbation tothese parameters ( (cid:15) = 1 k B T , (cid:15) = 3 k B T and µ = 6 k B T ); the orange curve illustrates the case of two cooperatively bindingTFs ( k c = 0 and E c = − k B T , see text for details). We pick two thresholds, shown in dashed lines, to deﬁne discrete bindingclasses: strong S ( π TD > /

3) and weak W ( π TD < / Directional selection on biophysically motivated ﬁtness landscapes

In this study, we focus on directional selection by assuming that ﬁtness f is proportional to gene expression level g which depends on regulatory sequence, i.e. f ( σ ) = s g ( σ ) (3)where s is the selection strength. It is important to note that this choice does not imply that directional selection isthe only natural selection mechanism. It simply aims at obtaining the theoretical upper limits for the rates of gainingand losing binding sites.To analyze a realistic but tractable mapping from the regulatory sequence to ﬁtness, we primarily assume that theproxy for gene expression is the binding occupancy (binding probability) π at a single TF binding site, or the sumof the binding occupancies within an enhancer/promoter region (based on limited experimental support [84]). Thiscorresponds to f ( σ ) = s (cid:88) i π ( i ) ( σ ) (4)where π ( i ) is the binding occupancy of a site starting at the nucleotide i in sequence σ , and s can be interpretedas the selective advantage of a strongest binding to a weakest binding at a site. We assume all binding sites haveequal strength and direction in their contribution towards total gene activation. Sites acting as repressors in oursimple model would enter into Eq (4) with a negative selection strength, s . Future studies developing mathematicallytractable models should consider more realistic case of unequal contribution with combined activator and repressorsites responding diﬀerentially to various regulatory inputs [53]. Although one can postulate diﬀerent scenarios thatmap TF occupancies in a long ( L (cid:29) n ) promoter to gene expression, we chose the simplest case which allows us tomake analytical calculations. Later we relax our assumption on noninteracting binding sites and consider the eﬀectsof several kinds of interactions on gene expression and thus on evolutionary dynamics.The occupancy of the TF on its binding site is assumed to be in thermodynamic (TD) equilibrium [34, 39–43]. Whilethis might not always be realistic [58, 59], there is empirical support for this assumption (particularly in prokaryotes)[48, 60, 61], and more importantly, it is suﬃcient to capture the essential nonlinearity in this genotype-phenotype-ﬁtness mapping [62]. In thermodynamic equilibrium, the binding occupancy at the site starting with the i -th positionin regulatory sequence is given by π ( i )TD ( E i ) = (cid:16) e β ( E i − µ ) (cid:17) − . (5)Here, µ is the chemical potential of the TF (related to its free concentration) [44, 64]; E i is the sequence speciﬁcbinding energy, where lower energy corresponds to tighter binding, and β = ( k B T ) − . We compute the binding energy E i by adopting an additive energy model which is considered to be valid at least up to a few mismatches from theconsensus sequence [37, 38, 65, 66], i.e. E i ( σ ) = i + n − (cid:88) j = i ξ σ j ,j (6)where ξ stands for the energy matrix whose ξ σ j ,j element gives the energetic contribution of the nucleotide σ j appearingat the j -th position within TFBS. With this, Eq (4) can be rewritten more formally as f ( σ ) = s (cid:88) i π ( i )TD ( E i ( σ )) (7)To allow analytical progress, we make the “mismatch assumption,” i.e., the energy matrices contain identical (cid:15) > k mismatches therefore has the binding energy E = k(cid:15) . We will refer to (cid:15) as “speciﬁcity.”Speciﬁcity is provided by diverse interactions between DNA and TF, including speciﬁc hydrogen bonds, van der Waalsforces, steric exclusions, unpaired polar atoms, etc. [63]. (cid:15) is expected to be in the range 1 − k B T , which is consistentwith theoretical arguments [44] as well as direct measurements [65–67]. Note that we explicitly check the validityof the analytical results based on the mismatch assumption by comparing them against simulations using realisticenergy matrices. The redundancy (i.e., normalized number of distinct sequences) of a mismatch class k at a single sitein a random genome can be described by a binomial distribution φ (Fig. 1C) where the probability of encounteringa mismatch class k is φ k ( n, α ) = (cid:18) nk (cid:19) α k (cid:0) − α (cid:1) n − k (8)where α = 3 / µ = 4 k B T , which maps changes in thesequence (mismatch class k ) to a full range of gene expression levels, as shown in Fig. 1D. We subsequently vary µ systematically and report how its value aﬀects the results.After these preliminaries, the equilibrium binding probability of Eq (5) reduces to π TD ( k ) = (cid:16) e β ( (cid:15) k − µ ) (cid:17) − . (9)This function has a sigmoid shape whose steepness depends on speciﬁcity (cid:15) and whose midpoint depends on theratio of chemical potential to speciﬁcity, µ/(cid:15) (Fig. 1D). To simplify discussion, we introduce two classes of sequences:genotypes are associated with “strong binding” S and “weak binding” W if π TD > / π TD < /

3, respectively.The thresholds that we pick are arbitrary, while still placing the midpoint of the sigmoid between the two classes;our results do not change qualitatively for other choices of thresholds. In the mismatch approximation, the genotypeclasses k = { , , ..., k S } ∈ S and k = { k W , k W + 1 , ..., n } ∈ W correspond to strong and weak binding, respectively. k S and k W are deﬁned as the closest integers to the thresholds deﬁned above; these values depend on (cid:15) and µ . Wealso deﬁne a “presite” as the mismatch class that is 1 bp away from the threshold for strong binding, i.e., a class with k S + 1 mismatches. Note that binding length n extends the tail of the ﬁtness landscape for a single site and shifts thecenter of redundancy rich mismatch classes (Fig. 1C).The formulation in Eq (7) reduces to f ( k ) = s π TD ( k ) (10)in a mismatch approximation at a single site, which we will investigate extensively for N s scaling of TFBS gain andloss rates. We consider a wide range of

N s values:

N s <

N s = 0 for neutral evolution,

N s ∼ N s (cid:29) n log(2) / π coop ( k, k c ) = e − β ( (cid:15)k − µ ) + e − β ( (cid:15) ( k + k c ) − µ − E c ) e − β ( (cid:15)k − µ ) + e − β ( (cid:15)k c − µ ) + e − β ( (cid:15) ( k + k c ) − µ − E c ) , (11)where k c stands for the mismatch class at the co-binding site and E c for cooperativity. In this study we considerthat cooperative energy ranges from an intermediate strength ( E c = − k B T ) to a high strength ( E c = − k B T ) [42].Fig 1D shows an example of the binding probability when a strong co-binding site exists. As a function of k alone, atﬁxed k c , this formulation of cooperativity is consistent with the zero-cooperativity ( E c = 0) case but with a changedeﬀective chemical potential. We take cooperative interactions into account if the two TFs are binding within 3 bp ofeach other, and we only consider the strongest binding of the cooperative partner (i.e., the proximal location with thelowest k c ).In Supporting Information, we discuss the other two models of interacting TFBS. In one model, gene expressionis determined only by the binding probability of the strongest site in the regulatory sequence. In the other model,gene expression is determined by the probability of the joint occupancy of 2 strongest binding sites, anywhere inthe regulatory sequence; this model is a toy version of synergistic “non-physical” interaction of TFs which competewith nucleosomal binding for the occupancy of regulatory regions in eukaryotes (see Mirny (2010) [68] for a detailedmodel). Point and indel mutations

Point mutations and indels are the only mutational processes in our framework. Point mutations with a rate u convert the nucleotide at one position into one of the 3 other nucleotide types. For a single binding site, the probabilitythat a point mutation changes the mismatch class from k to k (cid:48) is P (point) k (cid:48) ,k = (cid:0) − k/n (cid:1) δ k (cid:48) ,k +1 + (cid:0) k/ n (cid:1) δ k (cid:48) ,k − + (cid:0) k/ n (cid:1) δ k (cid:48) ,k (12)where δ a,b = 1 if a = b and 0 otherwise.We deﬁne the indel mutation rate per base pair such that it occurs with rate θ u at a position where a randomnucleotide sequence is either inserted, or an existing nucleotide sequence is deleted. For mathematical simplicity,we assume that insertions and deletions are equally likely; in fact, a slight bias towards deletions is reported in theliterature with a ratio of deletion to insertion ∼ . − . θ is the ratio of indel mutation rateto point mutation rate, and is reported to be in the range 0 . − . θ = 0 for no indel mutations, and θ = 0 .

15 for the combined eﬀect of indel and point mutations. Since we ﬁxthe length of the regulatory sequence, indels shift existing positions away from or inwards to some reference position(e.g., transcription start site). For consistency, we ﬁx the regulatory sequence at its ﬁnal position and assume thatsequences before the initial position are random. Indel lengths vary, with reports suggesting a sharply decreasingbut fat-tail frequency distribution [75]. For simulations we consider only very short indels of size 1 − .

45 and 0 .

18, respectively. We do not need to assume any particularindel length for analytical calculations (below). While suﬃcient for our purposes, this setup would need to be modiﬁedwhen working with real sequence alignments of orthologous regions.For a single binding site (i.e. L = n ) one can exactly calculate the probability of an indel mutation changing themismatch class from k to k (cid:48) as P (indel) k (cid:48) ,k = n (cid:88) i =1 (1 /n ) k (cid:48) (cid:88) x =0 p ( X i = x | k ) p ( Y i = k (cid:48) − x ) . (13)Here, i is the index for the position of an indel mutation within the binding site. The distribution over possiblepositions is uniform (hence 1 /n ). The indel mutation deﬁnes two distinct parts in the binding site in terms ofmismatches: nucleotides behind the indel mutation preserve their mismatch information, yet the nucleotides withinand after indel mutation completely lose it. The new mismatches at these distinct parts X i and Y i are binomialrandom variables, p ( X i = x | k ) = φ x ( i − , α = k/n ) p ( Y i = y ) = φ y ( n − i + 1 , α = 3 /

4) (14)where φ k ( n, α ) is deﬁned in Eq (8). Fig 6 shows that Monte Carlo sampling of indel mutations at a single bindingsite matches the analytical expression in Eq (13).The two types of mutations can be combined into the mutation rate matrix as follows: U k (cid:48) ,k = (cid:40) n u (cid:16) P (point) k (cid:48) ,k + θ P (indel) k (cid:48) ,k (cid:17) k (cid:48) (cid:54) = k − (cid:80) k (cid:48) (cid:54) = k U k (cid:48) ,k k (cid:48) = k . (15) Evolutionary dynamics of single TF binding sites

For a sequence that consists of an isolated TFBS (i.e., L = n ), analytical treatment is possible under the ﬁxed stateassumption. Let ψ ( t ) be a distribution over an ensemble of populations, whose k -th component, ψ k ( t ), denotes theprobability of detecting a genotype with k mismatches at time t . In the continuous time limit, the evolution of ψ ( t )is described by ddt ψ ( t ) = R · ψ (16)which accepts the following solution: ψ ( t ) = e R t · ψ (0) . (17)Here, R is the transition rate matrix deﬁned as R k (cid:48) ,k = (cid:40) N U k (cid:48) ,k P ﬁx ( N, ∆ f k (cid:48) ,k ) k (cid:48) (cid:54) = k − (cid:80) k (cid:48) (cid:54) = k R k (cid:48) ,k k (cid:48) = k . (18)This dynamical system is a continous-time Markov chain and there exists a unique stationary distribution ˆ ψ corresponding the genotype distribution over an ensemble of populations at large time points. It can be calculatedby decomposing the transition rate matrix R into its eigenvalues and eigenvectors. The normalised left eigenvectorwith zero eigenvalue corresponds to the stationary distribution. This can also be expressed analytically asˆ ψ k ∝ e F ( k,N )+ H ( k | n,α ) , (19)where F ( k, N ) = 4 N f ( k ) captures the relative importance of selection to genetic drift, and H ( k | n, α ) is themutational entropy, describing how a particular mismatch class k is favored due to redundancy and connectivity ofthe genotype space. For point mutations alone ( θ = 0), H = log φ k ( n, α ), with the binomial distribution φ k ( n, α )as deﬁned in Eq (8). Obtaining a closed form expression for H is diﬃcult when considering indel mutations ( θ > θ in the range of interest. The form of the stationarydistribution was known for a long time in population genetics literature for a single locus or many loci with linkageequilibrium [76]. It has recently been generalised to arbitrary sequence space under the ﬁxed state assumption [32, 77],resulting in the form of Eq (19) with a close analogy in the energy-entropy balance of statistical physics [80], andbecome a subject of theoretical interest [62, 78, 79, 81].Under weak directional selection for high expression (and thus high binding site occupancy), the stationary dis-tribution shows a bimodal shape, with one peak located around the ﬁttest class, k ∼

0, and another at the core ofmutational entropy, k ∼ α n (recall that α = 3 / N s distinguishing strongand weak selection regimes primarily depends on the TFBS binding length, n . In a sigmoidal ﬁtness landscape andapproximating the binomial distribution by a normal distribution as appropriate, the sizes of these two peaks areroughly proportional to exp (4 N s − n log 4) and (cid:112) πα (1 − α ) n , respectively. Therefore, we expect the threshold N s to scale as (cid:16) n log 4 − log 2 πα (1 − α ) n (cid:17) . For typical n , the linear term is dominant, suggesting that N s ∼ n log(2) / n scaling diﬀers fromthe log( n ) scaling which is expected in simple ﬁtness landscapes [82]. Our argument assumes that the system is atevolutionary equilibrium, which, as we will see, is not necessarily the case even under strong selection, providingfurther motivation for focusing on dynamical aspects of evolution.We deﬁne the time needed to gain (or lose) a TFBS as the time it takes for a strong binding site to emerge from aweak one (and vice versa), as schematized in Fig. 1D. For an isolated TFBS, these times can be computed from theMarkovian properties of the evolutionary dynamics, by calculating the average ﬁrst hitting times [83]. We will usethe notations (cid:104) t (cid:105) S← k and (cid:104) t (cid:105) W← k , respectively, for average gain and loss times when evolution starts from mismatchclass k . Obviously, (cid:104) t (cid:105) S← k = 0 if k is among the strong binding classes ( k ∈ S ) and (cid:104) t (cid:105) W← k = 0 if k is among theweak binding classes ( k ∈ W ). The average gain times from other mismatch classes can be found by considering therelation (cid:104) t (cid:105) S← k = 1 + (cid:80) k (cid:48) / ∈S P k,k (cid:48) (cid:104) t (cid:105) S← k (cid:48) , where P k,k (cid:48) is the probability of transition from k (cid:48) to k in one generation.One can compute the average gain times by writing it in terms of linear algebraic equation: T S← = ( R / ∈S ) − T · ( − ) (21)where T S← is a column vector listing non-trivial gain times, i.e. {(cid:104) t (cid:105) S ← k } for k = k S + 1 , ..., n . R / ∈S is the R matrixwith all rows and columns corresponding to k ∈ S deleted and − T is the matrix operator for the transpose after aninverse operation. is a vector of ones. Similarly one can ﬁnd the loss times, T W← = ( R / ∈W ) − T · ( − ) (22)where T W← is a column vector listing non-trivial loss times, i.e. {(cid:104) t (cid:105) W← k } for k = 1 , , ... k W − R / ∈W is the R matrix with all rows and columns corresponding to k ∈ W deleted.In the case of point mutations alone ( θ = 0), the R matrix is tri-diagonal and one can deduce simpler formulae forgain and loss times: (cid:104) t (cid:105) (point) S← k = (cid:80) ki = k S +1 1 R i − , i − ˆ Ψ i − ˆ ψ i (cid:104) t (cid:105) (point) W← k = (cid:80) k W i = k +1 1 R i − , i ˆ Ψ i − ˆ ψ i (23)where we use ˆ Ψ i = (cid:80) ij =0 ˆ ψ j to denote the cumulative stationary distribution. For very strong selection, the secondterm in the sums approaches unity, resulting in even simpler formulae [32], called the “shortest path” (sp) solution: (cid:104) t (cid:105) (sp) S← k = (cid:80) ki = k S +1 1 R i − , i (cid:104) t (cid:105) (sp) W← k = (cid:80) k W i = k +1 1 R i − , i . (24)These equations can be used to quickly estimate gain and loss rates of interest. For example, the gain rate frompresites under strong selection is approximately 2 N s u k S +13 ( f ( k S ) − f ( k S + 1)). Although the exact value depends onthe binding speciﬁcity and chemical potential, one can see that it is about N s u for the parameter range of interest.Similarly, one can see that the rate of loss from strong sites is about 2 n | N s | u when there is strong negative selection. RESULTSSingle TF binding site gain and loss rates under mutation-selection-drift are typically slow

We ﬁrst studied the evolutionary rates for a single TF binding site at an isolated DNA sequence of the same lengthunder mutation, genetic drift, and directional selection for high gene expression level (i.e., tighter binding). As detailedin the Models & Methods section, we combined a thermodynamically motivated ﬁtness landscape with the mismatchapproximation, and assumed that the mutation rate is low enough for the ﬁxed state population approximation to bevalid. Under these assumptions, we could calculate the inverse of the average TFBS gain and loss times as a functionof the starting genotype, using either an exact method or Wright-Fisher simulations. We considered point mutationsalone, or point mutations combined with short indel mutations, in order to understand under which conditions therates of gaining and losing binding sites can reach or exceed the rates 2 − N s .For parameters typical of eukaryotic binding sites (length n = 7 bp, speciﬁcity (cid:15) = 2 k B T ), the TFBS gain ratesare extremely slow (practically no evolution) when there is negligible selection pressure ( N s ∼ N s >

N s u for the strong

N s regime (as estimated by Berg et al. [32]), they decrease dramatically if more mutational steps are needed to evolvea functionally strong binding site. This is illustrated in the inset to Fig 2A, showing an exponential-like decay inthe gain rates as a function of the number of mismatches, even for a TFBS of a modest length of 7 bp. As arguedin the Models & Methods section (see Eq (20)), we conﬁrmed that the threshold for the strong

N s regime scales as n log(2) / n ) which is the case for simple ﬁtness landscapes [82].The availability of a realistic fraction of indel mutations (here, θ = 0 .

15) can speed up evolution when starting fromdistant genotypes (cf. solid and dashed red line in Fig 2A). This is because indels connect the genotype space suchthat paths from many to few mismatches are possible within a single mutational step. Nevertheless, the improvementdue to indel mutations does not alleviate the need for very strong selective pressure and the proximity of the initialto strongly-binding sequence, in order to evolve a functional site.Biophysical parameters—the binding site length n , the chemical potential µ , and the speciﬁcity (cid:15) —inﬂuence theshape of the ﬁtness landscape and thus the TFBS gain rates. This is especially evident when we consider de novo evolution starting from random sequence. As shown in Figs 2B, C, increases in speciﬁcity or length cause a sharpdrop in the gain rates from initial sequences in the most redundancy rich class, which can be only partially mitigatedby the availability of indel mutations. This especially suggests that adaptation of TFBS from random sequences forTF with very large binding lengths and very strong speciﬁcities is unlikely with point and indel mutations which canconstrain the evolution of TF lengths and TF speciﬁcity, which is consistent with Berg et al. (2004) [32]’s earliernumeric observation. Importantly, the binding speciﬁcity and length show an inverse relation with the logarithm ofthe gain rates. This is due to the fact that a decrease in speciﬁcity allows more genotypes to generate appreciablebinding and therefore ﬁtness (see Fig. 1D), which partially compensates the increase in mutational entropy at largerbinding site lengths. Variation of the chemical potential µ corresponding to an order-of-magnitude change in the freeTF concentration does not qualitatively aﬀect the results.0 ●●● ● ● ●        ◆◆◆ ◆ ◆ ◆        ● ● ● ● ● ●            ◆ ◆ ◆ ◆ ◆ ◆            ● k = θ = ( point ) ● k = θ = ( point ) ◆ k = θ = ( point & indel ) ◆ k = θ = ( point & indel ) r a t eo f ga i n i ng TF BS / 〈 t 〉  ← k [ ⨯ u ] --- approx., θ = ( point ) Ns = = - Nsu10 - Nsu10 - Nsu10 - Nsu10 - Nsu10 Nsu binding length - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k ∼ n / , θ = ⊕ - Nsu10 - Nsu10 - Nsu10 - Nsu10 Nsu binding length - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k ∼ n / , θ = ⊕ A BC ⟵ point mutation rate FIG. 2:

Single TF binding site gain rates at an isolated DNA region. A)

The dependence of the gain rate, 1 / (cid:104) t (cid:105) S← k shown in units of point mutation rate, from sequences in diﬀerent initial mismatch classes k (blue: k = 2, red: k = 5), as afunction of selection strength. Results with point mutations only ( θ = 0) are shown by dashed line; with admixture of indelmutations ( θ = 0 .

15) by a solid line. For strong selection, Ns (cid:29) n log(2) /

2, the rates scale with Ns , which is captured well bythe “shortest path” approximation (black dashed lines in the main ﬁgure) of Eq (24). The biophysical parameters are: site length n = 7 bp; binding speciﬁcity (cid:15) = 2 k B T ; chemical potential µ = 4 k B T . Points correspond to Wright-Fisher simulations with Nu = 0 .

01 where error bars cover ± k for Ns = 0 and Ns = 100. B, C)

Gain rates from redundancy rich classes ( k ∼ n/

4, typical ofevolution from random “virgin” sequence) under strong selection, without (B) and with (C) indel mutations supplementing thepoint mutations. Red crosshairs denote the cases depicted in panel A. Contour lines show constant gain rates in units of

Ns u as a function of biophysical parameters n and (cid:15) . Wiggles in the contour lines are not a numerical artefact but a consequenceof discrete mismatch classes. Typically slow TFBS evolution is a consequence of the sigmoidal shape of the thermodynamically motivated ﬁtnesslandscape, where adaptive evolution in the redundant but weakly binding classes W must proceed very slowly dueto the absence of a selection gradient. To illustrate this point, we generated alternative ﬁtness landscapes that agreeexactly with the thermodynamically motivated one from the ﬁttest class to the threshold class for strong binding, k S , but after that decay as power laws, π pl , with a tunable exponent (see SI text). As seen in Fig 8, this exponentis a major determinant of the gain rates, suggesting that a biophysically realistic ﬁtness landscape is crucial for thequantitative understanding of TFBS evolution.To check that the assumption of the ﬁxed state population is valid at N u = 0 .

01, the value used here that isalso relevant for multicellular eukaryotes [56], we performed Wright-Fisher simulations as described in the Models &Methods section. Fig 2A shows excellent agreement between the analytical results and the simulation. We furtherincreased the mutation rate to

N u = 0 .

1, a regime more relevant for prokaryotes where polymorphisms in thepopulation are no longer negligible, to ﬁnd that the analytical ﬁxed state assumption systematically overestimatesthe gain rates, as shown in Fig 9. In the presence of polymorphism, therefore, evolution at best proceeds as quicklyas in monomorphic populations, and generally proceeds slower, so that our results provide a theoretical bound onthe speed of adaptive evolution under directional selection. This is expected since the eﬀects of clonal interference1kick in after a certain

N u , where two diﬀerent beneﬁcial mutants start competing with each other, and eventuallydecrease the ﬁxation probability in comparison to one beneﬁcial mutant sweeping to ﬁxation as in the monomorphicpopulation case.To check that the mismatch assumption does not strongly aﬀect the reported results, we analyzed evolutionarydynamics with more realistic models of TF-DNA interaction. Diﬀerent positions within the binding site can havediﬀerent speciﬁcities, and one could suspect that this can signiﬁcantly lower the evolutionary times. First, somepositions within the TFBS may show almost no speciﬁcity for any nucleotide, most likely due to the geometry ofTF-DNA interactions (e.g, when the TF can contact the nucleic acid residues only in the major groove); we havenot simulated such cases explicitly, but simply take the binding site length n to be the eﬀective sequence lengthwhere TF does make speciﬁc contacts with the DNA. Second, the positions that do exhibit speciﬁcity might do soin a manner that is more inhomogeneous than our mismatch assumption, which assigns zero energy to the consensusand a constant (cid:15) to any possible mismatch. We thus generated energy matrices where (cid:15) was drawn from a Gaussiandistribution with the same mean (cid:104) (cid:15) (cid:105) = 2 k B T as in our baseline case of Fig 2A, but with a standard deviation 0 . k B T .Fig 10 shows that both equal and unequal energy contributions produce statistically similar behaviors, indicating thatinhomogeneous binding interactions cannot substantially enhance the evolutionary rates.We further investigated the rate of TFBS loss (Fig 11). Here too strong (negative) selection is needed to lose asite on reasonable timescales, and it is highly unlikely that a site would be lost in the presence of positive selection.In contrast to the TFBS gain case, however, negative selection and mutational entropy act in the same direction forTFBS loss, reducing the importance of the initial genotype and making selection more eﬀective at larger n and (cid:15) .Taken together, these results suggest that the emergence of an isolated TFBS under weak or no selection is typicallyslow relative to the species’ divergence times, and gets rapidly slower for sites that are either longer or whose TFsare more speciﬁc than the baseline case considered here. This suggests that biophysical parameters themselves maybe under evolutionary constraints; in particular, if point mutations and indels were the only mutational mechanisms,the evolution of long sites, e.g. n (cid:29) −

12, would seem extremely unlikely, as has been pointed out previously [32].Absent any mechanisms that could lead to faster evolution and which we consider below, isolated TFBS are generallyonly likely to emerge in the presence of strong directional selection and a favorable distribution of initial sequencesthat is enriched in presites.

Convergence to the stationary distribution is slow and depends strongly on initial conditions

A number of previous studies (e.g., [62, 78, 79]) assumed that a stationary distribution of mismatch classes isreached in the evolution of isolated TFBS and thus an equilibrium solution, Eq (19), is informative for bindingsequence distributions. In contrast, our results for average gain and loss times suggest that the evolution of anisolated TFBS is typically slow. To analyze this problem in a way that does not depend on arbitrary thresholdsdeﬁning “strong” and “weak” binding classes S and W , we ﬁrst examined the evolution of the distribution ψ ( k ) overthe mismatch classes as a function of time in Fig 3A. For typical parameter values it takes on the order of the inversepoint mutation rate to reach the stationary distribution for populations that start oﬀ far away from it, even withstrong selection.A systematic study of the convergence rates can be performed by computing the (absolute value of the) secondeigenvalue, | λ | , of the transition rate matrix R from Eq (18), and exploring how this depends on the biophysicalparameters n and (cid:15) . Consistent with previous results, we observe large increases in convergence times as n and (cid:15) increase. For example, an increase in the binding site length from n = 7 to n = 11 at baseline speciﬁcity of (cid:15) = 2 k B T would result in a ten-fold increase in the convergence time.The intuitive reason behind the slow convergence rates is in the bimodal nature of the distribution ψ ( k ) on thethermodynamically motivated ﬁtness landscape, similar to that reported by Lynch & Hagner [9]. One “attractor” islocated around the ﬁttest class ( k ∼

0, due to directional selection), while the other is located around the redundancy-rich mismatch classes ( k ∼ / n ). These two attractors are separated by a typically sharp ﬁtness landscape, and theredundancy-rich attractor lacks selection gradients needed to support fast adaptation. The temporal evolution of thedistribution ψ ( k ) from, e.g., a maladapted state, can thus be best understood as the probability weight “switching”from resting approximately within one attractor to the other one, while maintaining the bimodal shape throughout,rather than a gradual shift of a unimodal distribution from a maladapted initial value of k to the value favored byselection. This is especially true when n gets larger: although adaptation within the functional sites can still happen,adaptation from the most random mismatch classes becomes extremely slow, even under strong selection (see Fig 15).These results suggest that stationary distributions of isolated TFBS sequences may not be realizable on thetimescales of speciation, which should be a cause of concern when stationarity is assumed without prior critical2 f r equen cy ψ k Ns = [ ⨯ u - ] D K L f r equen cy ψ k Ns = [ ⨯ u - ] D K L - Nsu10 - Nsu10 - Nsu10 - Nsu s pe c i f i c i t y ϵ convergence rate | λ | ⊕ A B initiallywell adapted ( k = ) — initiallymaladapted ( k = ) — stationarydistribution — FIG. 3:

Convergence to the stationary distribution of TFBS sequences. A)

Evolutionary dynamics of the mismatchclasses distribution ψ ( k ) for an isolated TFBS under point and indel mutations ( θ = 0 . k = 0, blue) and badly ( k = 5, red) adapted populations. At left, noselection ( Ns = 0); at right, strong selection ( Ns = 100). Diﬀerent curves show the distribution of genotype classes at diﬀerenttime points ( t = 0 u − , . u − , . u − as decreasing opacity); stationary distribution is shown in green. Insets show thetime evolution to convergence for initially well ( k = 0, blue) and badly ( k = 5, red) adapted populations, measured by theKullback-Leibler divergence D KL [ ψ ( t ) || ψ ( t = ∞ )]. The biophysical parameters are: n = 7 bp, (cid:15) = 2 k B T , µ = 4 k B T . B) Rate of convergence to the stationary distribution for diﬀerent (cid:15) and n values under strong selection ( Ns (cid:29) n log(2) /

2; herespeciﬁcally Ns = 100) and for θ = 0 .

15. Crosshairs represent the parameters used in a). assessment. For example, applications assuming the stationary distribution might wrongly infer selection on regula-tory DNA.

Evolution of TF binding sites in longer sequences

So far we have shown that the evolution of isolated TFBS is typically slow. How do the results change if we considerTFBS evolution in a stretch of sequence L bp in length, where L (cid:29) n , e.g., within a promoter or enhancer? Here wefocus on de novo evolution under strong directional selection for high gene expression, by simulating the process inthe ﬁxed state population framework. Compared to the isolated TFBS case, we need to make one further assumption:that the expression level of the selected gene is proportional to the summed TF occupancy on all sites within theregulatory region of length L (see Models & Methods for details). While this is the simplest choice, it is neither uniquenor perhaps the most biologically plausible one, although limited experimental support exists for such additivity [84];it does, however, represent a tractable starting point when the interactions between individual TF binding sites arenot strong and the contribution of each site is equal and of the same sign. To address the interactions, we look atthe cooperative binding case in the following section. In Supporting Information, we also discuss the competition ofTFBSs for the strongest binding, and the “nonphysical” synergetic interaction by two strongest TFBSs.We propose a simple analytical model for the time evolution of the number of strongly binding sites, z ( t ), in thepromoter, derived from isolated TFBS gain and loss rates, λ gain and λ loss . Assuming constant rates, one can write ddt z ( t ) = λ gain (cid:16) z max − z ( t ) (cid:17) − λ loss z ( t ) (25)where z max is the maximum number of TFBS that can ﬁt into the regulatory sequence of length L bp. If the sitescan overlap, z max = L − n + 1, otherwise z max ≈ L/n . The solution for Eq (25) is z ( t ) = (cid:16) z o − BA (cid:17) e − At + BA (26)where A = (cid:0) λ gain + λ loss (cid:1) , B = z max λ gain and z o = z ( t = 0). Under strong positive selection, i.e. N s (cid:29) n log(2) / λ loss can be ignored. If the distribution of the initial mismatch classes in the promoter is ψ k , one canapproximate z max − z o = z max (cid:80) nk = k S +1 ψ k to obtain: z ( t ) − z o = (cid:0) − e − λ gain t (cid:1) z max n (cid:88) k = k S +1 ψ k . (27)3There are two limiting regimes in which we can examine the behavior of Eq (27). Over a short timescale, evolutionarydynamics will search over all possible positions, z max = L − n + 1, to pull out the presites, since they are fastest toevolve into the strong binding class S , i.e.: λ gain ≈ λ presitegain = (cid:16) (cid:88) k / ∈S ψ k (cid:17) − ψ k S +1 / (cid:104) t (cid:105) S← k S +1 (28)As the process unfolds and new sites are established, new TFBS will only be able to emerge at a smaller set ofpositions due to possible overlaps, so that z max ≈ L/n . On the other hand, evolution from higher mismatch classeswill also start to contribute towards new sites: λ gain ≈ λ allgain = (cid:16) (cid:88) k / ∈S ψ k (cid:17) − (cid:88) k / ∈S ψ k / (cid:104) t (cid:105) S← k (29)Fig 4 shows how new TFBSs with length n = 7 bp emerge over time in a promoter of L = 30 bp in length.Consistent with the predictions of our simpliﬁed model, we can distinguish the early, intermediate, and late epochs.In the early epoch, t < . u − , presites are localized among all possible locations and are established as bindingsites. During this period, the growth in the expected number of new TFBSs is linear with time. The importance andpredictive power of presites at early epoch remain even under diﬀerent models of gene expression, including interactionbetween TFBSs (see Fig 14). In the intermediate epoch, new binding sites accumulate at the rate that is slightlyabove that expected by establishment from presites alone, as the mutational neighborhood is explored further. In thelate epoch, t > . u − , initial sites in the immediate mutational vicinity have been exhausted, and established siteshave constrained the number of positions where new sites can evolve from more distant initial sequences, leading tothe saturation in the number of evolved TFBS.Using the simple analytical model, we explored in Fig 4B,C how the binding length n and speciﬁcity (cid:15) aﬀect thenumber of newly evolved TFBS. Increasing n leads to a steep decrease in the number of expected sites, with a somewhatweaker dependence on (cid:15) , especially at early times. Simulations at other values of biophysical and evolutionaryparameters conﬁrm the qualitative agreement between the analytical model and the simulation (Fig 12); given thatthe model is a simple heuristic, it cannot be expected to match the simulations in detail, yet it nevertheless seems tocapture the gross features of evolutionary dynamics. Together, these results show that at early times under strongselection, the number of newly evolved sites will grow linearly with time and proportional to L , before evolution fromhigher mismatch classes can contribute and ultimately before the sites start interacting, with a consequent slowdownin their evolution. Thus, evolution in longer regulatory regions ( L = 10 − bp) could feasibly give rise to tens ofbinding sites at N s = 10 − within a realistic time frame t ∼ . u − , if the sites are suﬃciently short ( n ∼ n > −

12 bp, especially within short promoters found in prokaryotes,would likely necessitate invoking new mechanisms.

Ancient sites and cooperativity between TFs can accelerate binding site emergence

Finally, we brieﬂy examine two mechanisms that can further speed up the evolution of TF binding sites in longersequences.The ﬁrst possibility is that the sequence from which new TFBS evolve is not truly random; as discussed previously,presites have a strong inﬂuence on the early accumulation of new binding sites. There are a number of mechanisms thatcould bias the initial sequence distribution towards presites: examples include transposable elements, DNA repeats,or CG content bias. Here we consider an alternative mechanism that we refer to as the “ancient TFBS scenario,” inwhich a strong TFBS existed in the sequence in the ancient past, after which it decayed into a weak binding site,possibly due to the relaxation of selection (i.e.,

N s ∼ Ψ in a sequence of length L with a single ancient site can be captured bywriting: Ψ = 1 L − n + 1 ψ ( t (cid:48) ) + L − nL − n + 1 φ (30)where φ is the binomial distribution, Eq (8), characteristic of the random background, and ψ ( t (cid:48) ) is the distribution ofmismatches due to the presence of the ancient site. Time t (cid:48) refers to the interval in which the isolated ancient TFBShas been decaying under relaxed selection, and the corresponding ψ ( t (cid:48) ) can be solved for using Eq (17).4 - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] - NsL10 - NsL10 - NsL10 - NsL s pe c i f i c i t y ϵ t = / u ⊕ - NsL10 - NsL10 - NsL10 - NsL s pe c i f i c i t y ϵ t = / u ⊕ — simulations ( mean ± ⨯ SEM ) --- prediction ( single TFBS ) FIG. 4:

TF binding site evolution in a longer sequence of L = 30 base pairs. The expected number of newly evolvedTF binding sites with length n = 7 bp, under strong directional selection ( Ns = 100) and both point and indel mutations( θ = 0 . ± (cid:15) = 2 k B T , µ = 4 k B T . Insets:

Expected number of newly evolved sites from a random sequence of length L at t = 0 . u − (left) and t = 0 . u − (right) for diﬀerent binding length and speciﬁcity values, computed using the analytical predictions.Crosshairs denote the values used in the main panel. Fig 5A shows that the ancient site scenario can enhance the number of newly evolved sites by resurrecting theancient site, even after it has decayed for t (cid:48) = 0 . u − . Simulation results agree well with the simple analytical modelusing the biased initial sequence distribution of Eq (30). Importantly, such a mechanism is particularly eﬀective forlonger binding sites of high speciﬁcity, indicating that regulatory sequence reuse could be evolutionarily beneﬁcial inthis biophysical regime (see Fig 13).Fig 5A and Fig 13 also show the emergence of new sites when the ancient site was not a full consensus (preferred)sequence but diﬀered from it by a certain number of mismatches. The results qualitatively agree with the case ofperfect consensus. Importantly, this shows that the applicability of the ancient site scenario extends to cases wherethe ancient site belonged to a diﬀerent TF (albeit with a preferred sequence similar to the studied TF), which hasrecently been reported to be a frequent phenomenon by Payne & Wagner (2014) [47], possibly due to evolution ofTFs by duplication and divergence [85].The second mechanism that we consider is the physical cooperativity between TFs: when one site is occupied,it is favorable for the nearby site to be occupied as well. We extended the thermodynamic model to incorporate5 - - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] ● ● ● ● ● - - - - - Ancient site kt = / u n =

10 bp ϵ = k B T - - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T — without cooperativity — with cooperativity (- k B T ) — with cooperativity (- k B T ) — with cooperativity (- k B T ) --- prediction ( single TFBS ) — without ancient site — with ancient site ( k = ) — with ancient site ( k = ) --- predictions ( single TFBS ) A B

FIG. 5:

Ancient sites and cooperativity can accelerate the emergence of TF binding sites in longer regulatorysequences. A)

The expected number of newly evolved TFBS in the presence (red and brown) or absence (black) of an ancientsite, for binding site length n = 10 bp, and speciﬁcity, (cid:15) = 3 k B T . In this example, the ancient site was a consensus site ( k = 0)or two mismatches away from it ( k = 2) that evolved under neutrality for t (cid:48) = 0 . /u prior to starting this simulation. Dashedlines show the predictions of a simple analytical model, Eq (30). The inset shows how the number of newly evolved TFBS at t = 0 . /u scales with the mismatch of the ancient site k (plot markers: simulation means; error bars: two standard errors ofthe mean; dashed curve: prediction). B) The expected number of newly evolved TFBS without (black) and with cooperativeinteractions (for diﬀerent cooperativity strengths, magenta: E c = − k B T , yellow: E c = − k B T , cyan: E c = − k B T , seeEq (11) in the Models & Methods and text) for binding site length n = 7 bp, and speciﬁcity, (cid:15) = 2 k B T . Both panels use µ = 4 k B T , strong selection ( Ns = 100) and a combination of point and indel mutations ( θ = 0 . L = 30 bp. Thick solid lines show an average over 1000 simulation replicates, shading denotes ± cooperativity (see the Models & Methods, Eq (11) and Fig 1D). The genotype of a nearby site will then inﬂuencewhether a given site acts as a strongly or weakly binding site. The presence of a cooperative site acts as a local shiftin the chemical potential, which changes the weak/strong threshold, so that an individually weak site can becomea strongly binding site. Simulations using cooperative binding presented in Fig 5B illustrate how cooperativity canincrease the speed of evolution. This is speciﬁcally eﬀective for short binding sites of intermediate or low speciﬁcity,where a cooperative energy contribution can strongly inﬂuence the number of sites in the strong binding class (seeFig 13). DISCUSSION

In this study, we aimed at a better theoretical understanding of which biophysical and population genetic factorsinﬂuence the fast evolution of TFBSs in gene regulatory DNA, making sequence speciﬁc TF binding a plausible mech-anism for the evolution of gene regulation and for generating phenotypic diversity. Following Berg et al. (2004) [32],we combined a biophysical model for TF binding with a simple population genetic model for the rate of sequenceevolution. The key assumptions are that binding probability is determined by a thermodynamic equilibrium; thatﬁtness depends linearly on binding probability; and that populations are typically homogeneous in genotype, and soevolve by substitution of single point and short insertion/deletion (indel) mutations. Remarkably, the biophysical andthe evolutionary models take the same mathematical form: in the biophysical model, binding probability depends onthe binding energy, relative to thermal ﬂuctuations, βE , whilst in the evolutionary model, the chance that a mutation6ﬁxes depends on its selective advantage, relative to random sampling drift, N s .For single TFBS evolution, we calculated the average transition time between genotypes, the inverse being ameasure for the speed of the evolution. Our results indicate that TFBS evolution is typically slow unless selection isvery strong. It is important to emphasize that gaining a TFBS by point mutations under neutral evolution is veryunlikely, contrasting with the belief in the current literature (e.g., [5, 13]). This is mainly due to Stone & Wray’sargument that functional sites could readily be found by a random walk [31]; however, their argument assumed thatindividuals follow independent random walks, which grossly overestimates the rate of evolution (see MacArthur &Brookﬁeld [29]). Indeed, fast rates of gaining a single TFBS require not only strong selection but also initial sequencesin the mutational neighborhood of the functional sites. Especially, “presites,” i.e. sequences 1 bp away from thresholdsequences, can be crucial since they can evolve to functional sites by single mutations. Indel mutations can increasethe rate of gaining a single TFBS from distant sequences, since they connect the genotype space extensively, but theireﬀect is limited under realistic indel mutation rates [72, 73]. Future studies should consider the updates in estimatesof indel mutation rates, since they are currently not as precise as point mutation rates, although we do not expectbig qualitative departures from our results.Considering the evolution of a single TFBS from random sequence, we showed that biophysical parameters, bindinglength and speciﬁcity, are constrained for realistic evolutionary gain rates from the most redundant mismatch classes.The rates drop exponentially with binding length, making TF whose binding length exceeds 10 −

12 bp diﬃcult toevolve from random sites, at least under the point and indel mutation mechanisms considered here. As a consequenceof the biophysical ﬁtness landscape, binding speciﬁcity and length show an inverse relation for the same magnitudeof the gain rate from the most redundant mismatch class. Such an inverse relation is observed in position weightmatrices of TFs collected from diﬀerent databases for both eukaryotic and prokaryotic organisms, by Stewart &Plotkin (2012) [8]. In the same study, they reproduce this observation using a simple model which assumes thata trade-oﬀ between the selective advantage of binding to target sites, versus the selective disadvantage of bindingto non-target sequence. Their model assumes a stationary distribution, and that sites are functional if they aremismatched at no more than one base. It would be interesting to explore a broader range of models that accountfor the dynamical coevolution between transcription factor binding speciﬁcity, its length, and its binding sites [9].One idea can be to combine the evolutionary dynamical constraints (against large binding length and high speciﬁcity,which we show here) with simple physical constraints of TF dilution in non-target DNA (against short binding lengthand low speciﬁcity, again in an inverse relation [44]).For a single TF binding site, the stationary distribution for the mismatch with the consensus binding sequencedepends on the binding energy, but also on the sequence entropy – that is, the number of sequences at diﬀerentdistances from the consensus. Typically, the distribution is bimodal: either the site is functional, and is maintainedby selection, or it is non-functional, and evolves almost neutrally. We show that it may take an extremely longtime for the stationary distribution to be reached. Functional sites are unlikely to be lost if selection is strong (i.e.,

N s (cid:29) etal. (2015) [54] provide evidence that enhancer DNA sequence structure is older than other DNA portions, suggesting7the reuse of such regions in evolution, plausibly by gaining and losing TFBSs in repetitive manner. Nourmohammad& Lassig (2011) [30] showed evidence suggesting that local duplication of sequences followed by point mutationsplayed important role in binding site evolution in Drosophila species (but surprisingly, not in yeast species). Anotherinteresting option would be the existence of “mobile” presites or their fragments, e.g., as sequences embedded intotransposable elements that could be inserted before the gene under selection for high expression [25]. Presites canbe considered as concrete examples of cryptic sequences [86], potential source of future diversity and evolvability.We believe that understanding the eﬀects of presites would contribute to the predictability of genetic adaptationsregarding gene regulation, especially in important medical applications such as antibiotic resistance or virus evolution.We also showed that the evolution of a functional binding site in longer DNA can be accelerated by cooperativitybetween adjacent transcription factors. When a TF occupies a co-binding site, suﬃcient transcriptional activity canbe achieved from sequences of larger mismatch classes, an eﬀect similar to a local increase in TF concentration.This mechanism permits faster evolution towards strongly binding sequences, and seems most eﬀective for shortTFBS where it creates a selection gradient already in the redundancy rich mismatch classes. Cooperative physicalinteractions might allow the evolution of binding occupancy and thus expression without large underlying sequencechanges, which might be a reason for the observed weak correlation between sequence and binding evolution at certainregulatory regions. Importantly, TFBS clustering in eukaryotic enhancers can be a consequence of the fast evolutionwith cooperativity, as also supported by a recent empirical study [11].Our theoretical framework is relevant more broadly for understanding the evolution of gene regulatory architecture.Since the speed of TFBS evolution from random sequences is proportional to

N sL , our results suggest that populationsize N and the length of regulatory sequences L can compensate for each other in terms of the rate of adaptation. Thisis exactly what is observed: eukaryotes typically have longer regulatory DNA regions but small population sizes, whileprokaryotes evolve TFBS within shorter regulatory sequence fragments but have large population sizes. Similarly,prokaryotes might have achieved longer TF binding lengths n , as large population size allowed them to overcomethe exponential decrease in the gain rates with increasing n . If relevant, these observations would suggest that animportant innovation in eukaryotic gene regulation must have been the ability of the transcriptional machinery tointegrate the simultaneous occupancy of many low-speciﬁcity transcription factors bound over hundreds of basepairsof regulatory sequence, a process for which we currently have no good biophysical model. ACKNOWLEDGMENTS

We thank Magdalena Steinr¨uck, Georg Rieckh, Ferran Palero and Ziya Kalay for helpful comments. NB acknowl-edges support from ERC Advanced Grant 250152 “Selection and Information.”

REFERENCES [1] Fay JC, Wittkopp PJ. Evaluating the role of natural selection in the evolution of gene regulation. Heredity. 2007;100:191–199.[2] Zheng W, Gianoulis TA, Karczewski KJ, Zhao H, Snyder M. Regulatory Variation Within and Between Species. AnnualReview of Genomics and Human Genetics. 2011;12(1):327–346.[3] Romero IG, Ruvinsky I, Gilad Y. Comparative studies of gene expression and the evolution of gene regulation. NatureReviews Genetics. 2012 Jul;13(7):505–516.[4] Hoekstra HE, Coyne JA. The locus of evolution: evo devo and the genetics of adaptation. Evolution; International Journalof Organic Evolution. 2007 May;61(5):995–1016.[5] Wittkopp PJ. Evolution of Gene Expression. In: The Princeton Guide to Evolution. Princeton University Press; 2013. p.413–419.[6] Yao P, Lin P, Gokoolparsadh A, Assareh A, Thang MWC, Voineagu I. Coexpression networks identify brain region-speciﬁcenhancer RNAs in the human brain. Nature Neuroscience. 2015 Aug;18(8):1168–1174.[7] Wunderlich Z, Mirny LA. Diﬀerent gene regulation strategies revealed by analysis of binding motifs. Trends in genetics.2009 Oct;25(10):434–440.[8] Stewart AJ, Plotkin JB. Why transcription factor binding sites are ten nucleotides long. Genetics. 2012 Nov;192(3):973–985.[9] Lynch M, Hagner K. Evolutionary meandering of intermolecular interactions along the drift barrier. Proceedings of theNational Academy of Sciences of the United States of America. 2015. 112:E30-E38. [10] Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, et al. Five-Vertebrate ChIP-seq Reveals theEvolutionary Dynamics of Transcription Factor Binding. Science. 2010 May;328(5981):1036–1040.[11] Steﬄova K, Thybert D, Wilson M, Streeter I, Aleksic J, Karagianni P, et al. Cooperativity and Rapid Evolution ofCobound Transcription Factors in Closely Related Mammals. Cell. 2013 Aug;154(3):530–540.[12] Dowell RD. Transcription factor binding variation in the evolution of gene regulation. Trends in Genetics. 2010Nov;26(11):468–475.[13] Villar D, Flicek P, Odom DT. Evolution of transcription factor binding in metazoans - mechanisms and functionalimplications. Nature Reviews Genetics. 2014 Apr;15(4):221–233.[14] Doniger SW, Fay JC. Frequent Gain and Loss of Functional Transcription Factor Binding Sites. PLoS Comput Biol. 2007May;3(5):e99.[15] Moses AM, Pollard DA, Nix DA, Iyer VN, Li XY, Biggin MD, et al. Large-Scale Turnover of Functional TranscriptionFactor Binding Sites in Drosophila. PLoS Comput Biol. 2006 Oct;2(10):e130.[16] Ludwig MZ, Patel NH, Kreitman M. Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governingconservation and change. Development. 1998;p. 949–958.[17] Paris M, Kaplan T, Li XY, Villalta JE, Lott SE, Eisen MB. Extensive Divergence of Transcription Factor Binding inDrosophila Embryos with Highly Conserved Gene Expression. PLoS Genet. 2013 Sep;9(9):e1003748.[18] Ellison CE, Bachtrog D. Dosage Compensation via Transposable Element Mediated Rewiring of a Regulatory Network.Science. 2013 Nov;342(6160):846–850.[19] Alekseyenko AA, Ellison CE, Gorchakov AA, Zhou Q, Kaiser VB, Toda N, et al. Conservation and de novo acquisition ofdosage compensation on newly evolved sex chromosomes in Drosophila. Genes & Development. 2013 Apr;27(8):853–858.[20] Contente A, Dittmer A, Koch MC, Roth J, Dobbelstein M. A polymorphic microsatellite that mediates induction of PIG3by p53. Nature Genetics. 2002 Mar;30(3):315–320.[21] Kasowski M, Grubert F, Heﬀelﬁnger C, Hariharan M, Asabere A, Waszak SM, et al. Variation in Transcription FactorBinding Among Humans. Science. 2010 Apr;328(5975):232–235.[22] Chan YF, Marks ME, Jones FC, Villarreal G, Shapiro MD, Brady SD, et al. Adaptive Evolution of Pelvic Reduction inSticklebacks by Recurrent Deletion of a Pitx1 Enhancer. Science. 2010 Jan;327(5963):302–305.[23] Vierstra J, Rynes E, Sandstrom R, Zhang M, Canﬁeld T, Hansen RS, et al. Mouse regulatory DNA landscapes revealglobal principles of cis-regulatory evolution. Science. 2014 Nov;346(6212):1007–1012.[24] Gemayel R, Vinces MD, Legendre M, Verstrepen KJ. Variable Tandem Repeats Accelerate Evolution of Coding andRegulatory Sequences. Annual Review of Genetics. 2010;44(1):445–477.[25] Feschotte C. Transposable elements and the evolution of regulatory networks. Nature Reviews Genetics. 2008 May;9(5):397–405.[26] Hahn MW, Stajich JE, Wray GA. The Eﬀects of Selection Against Spurious Transcription Factor Binding Sites. MolecularBiology and Evolution. 2003 Jun;20(6):901–906.[27] He BZ, Holloway AK, Maerkl SJ, Kreitman M. Does Positive Selection Drive Transcription Factor Binding Site Turnover?A Test with Drosophila Cis-Regulatory Modules. PLoS Genet. 2011 Apr;7(4):e1002053.[28] Arnold CD, Gerlach D, Spies D, Matts JA, Sytnikova YA, Pagani M, et al. Quantitative genome-wide enhancer activitymaps for ﬁve Drosophila species show functional enhancer conservation and turnover during cis-regulatory evolution.Nature Genetics. 2014 Jul;46(7):685–692.[29] MacArthur S, Brookﬁeld JFY. Expected Rates and Modes of Evolution of Enhancer Sequences. Molecular Biology andEvolution. 2004 Jun;21(6):1064–1073.[30] Nourmohammad A, L¨assig M. Formation of Regulatory Modules by Local Sequence Duplication. PLoS Comput Biol. 2011Oct;7(10):e1002167.[31] Stone JR, Wray GA. Rapid evolution of cis-regulatory sequences via local point mutations. Molecular Biology andEvolution. 2001 Sep;18(9):1764–1770.[32] Berg J, Willmann S, L¨assig M. Adaptive evolution of transcription factor binding sites. BMC Evolutionary Biology. 2004Oct;4(1):42.[33] von Hippel PH, Berg OG. On the speciﬁcity of DNA-protein interactions. Proceedings of the National Academy of Sciencesof the United States of America. 1986 Mar;83(6):1608–1612.[34] Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory andapplication to operators and promoters. Journal of molecular biology. 1987 Feb;193(4):723–750.[35] Stormo GD, Fields DS. Speciﬁcity, free energy and information content in protein-DNA interactions. Trends in biochemicalsciences. 1998 Mar;23(3):109–113.[36] Stormo GD, Hartzell GW. Identifying protein-binding sites from unaligned DNA fragments. Proceedings of the NationalAcademy of Sciences. 1989 Feb;86(4):1183–1187.[37] Stormo GD, Zhao Y. Determining the speciﬁcity of protein-DNA interactions. Nature Reviews Genetics. 2010Nov;11(11):751–760.[38] Zhao Y, Granas D, Stormo GD. Inferring Binding Energies from Selected Binding Sites. PLoS Comput Biol. 2009Dec;5(12):e1000590.[39] Shea MA, Ackers GK. The OR Control system of bacteriophage lambda: A physical-chemical model for gene regulation.Journal of Molecular Biology. 1984;p. 211–230.[40] Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, et al. Transcriptional regulation by the numbers:applications. Current Opinion in Genetics & Development. 2005;15:125–135.[41] Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, et al. Transcriptional regulation by the numbers: models. Current Opinion in Genetics & Development. 2005;15:116–124.[42] Hermsen R, Tans S, ten Wolde PR. Transcriptional Regulation by Competing Transcription Factor Modules. PLoS ComputBiol. 2006 Dec;2(12):e164.[43] Hermsen R, Ursem B, ten Wolde PR. Combinatorial Gene Regulation Using Auto-Regulation. PLoS Comput Biol. 2010Jun;6(6):e1000813.[44] Gerland U, Moroz JD, Hwa T. Physical constraints and functional characteristics of transcription factor-DNA interaction.Proceedings of the National Academy of Sciences of the United States of America. 2002 Sep;99(19):12015–12020.[45] Gerland U, Hwa T. On the selection and evolution of regulatory DNA motifs. Journal of Molecular Evolution. 2002Oct;55(4):386–400.[46] Stewart AJ, Plotkin JB. The evolution of complex gene regulation by low-speciﬁcity binding sites. Proceedings of theRoyal Society B: Biological Sciences. 2013 Oct;280(1768).[47] Payne JL, Wagner A. The Robustness and Evolvability of Transcription Factor Binding Sites. Science. 2014Feb;343(6173):875–877.[48] Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U. Predicting expression patterns from regulatory sequence inDrosophila segmentation. Nature. 2008 Jan;451(7178):535–540.[49] Samee MAH, Sinha S. Quantitative Modeling of a Gene’s Expression from Its Intergenic Sequence. PLoS Comput Biol.2014 Mar;10(3):e1003467.[50] He X, Samee AH, Blatti C, Sinha S. Thermodynamics-Based Models of Transcriptional Regulation by Enhancers: TheRoles of Synergistic Activation, Cooperative Binding and Short-Range Repression. PLOS Computational Biology. 2010.[51] He X, Duque TSPC, Sinha S. Evolutionary Origins of Transcription Factor Binding Site Clusters. Molecular Biology andEvolution. 2012 Mar;29(3):1059–1070.[52] Duque T, Samee MAH, Kazemian M, Pham HN, Brodsky MH, Sinha S. Simulations of Enhancer Evolution ProvideMechanistic Insights into Gene Regulation. Molecular Biology and Evolution. 2013 Oct;31(1):184–200.[53] Duque T, Sinha S. What Does It Take to Evolve an Enhancer? A Simulation-Based Study of Factors Inﬂuencing theEmergence of Combinatorial Regulation. Genome Biology and Evolution. 2015 Jun;7(6):1415–1431.[54] Villar D, Berthelot C, Aldridge S, Rayner T, Lukk M, Pignatelli M, et al. Enhancer Evolution across 20 MammalianSpecies. Cell. 2015 Jan;160(3):554–566.[55] Desai MM, Fisher DS. Beneﬁcial Mutation-Selection Balance and the Eﬀect of Linkage on Positive Selection. Genetics.2007 Jul;176(3):1759–1798.[56] Lynch M, Conery JS. The Origins of Genome Complexity. Science. 2003 Nov;302(5649):1401–1404.[57] Kimura M. On the Probability of Fixation of Mutant Genes in a Population. Genetics. 1962 Jun;47(6):713–719.[58] Hammar P, Wallden M, Fange D, Persson F, Baltekin ¨O, Ullman G, et al. Direct measurement of transcription factordissociation excludes a simple operator occupancy model for gene regulation. Nature Genetics. 2014 Apr;46(4):405–408.[59] Cepeda-Humerez SA, Rieckh G, Tkaˇcik G. Stochastic proofreading mechanism alleviates crosstalk in transcriptionalregulation. arXiv:150405716 [q-bio]. 2015 Apr;ArXiv: 1504.05716. Available from: http://arxiv.org/abs/1504.05716 .[60] Brewster RC, Jones DL, Phillips R. Tuning Promoter Strength through RNA Polymerase Binding Site Design in Escherichiacoli. PLoS Computational Biology. 2012 Dec;8(12).[61] Razo-Mejia M, Boedicker JQ, Jones D, DeLuna A, Kinney JB, Phillips R. Comparison of the theoretical and real-worldevolutionary potential of a genetic circuit. Physical Biology. 2014 Apr;11(2):026005.[62] Haldane A, Manhart M, Morozov AV. Biophysical Fitness Landscapes for Transcription Factor Binding Sites. PLoSComput Biol. 2014 Jul;10(7):e1003683.[63] McKeown AN, Bridgham JT, Anderson DW, Murphy MN, Ortlund EA, Thornton JW. Evolution of DNA Speciﬁcity ina Transcription Factor Family Produced a New Gene Regulatory Module. Cell. 2014 Sep;159(1):58–68.[64] Weinert FM, Brewster RC, Rydenfelt M, Phillips R, Kegel WK. Scaling of Gene Expression with Transcription-FactorFugacity. Physical Review Letters. 2014 Dec;113(25):258101.[65] Maerkl SJ, Quake SR. A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors.Science. 2007 Jan;315(5809):233–237.[66] Kinney JB, Murugan A, Callan CG, Cox EC. Using deep sequencing to characterize the biophysical mechanism of atranscriptional regulatory sequence. Proceedings of the National Academy of Sciences. 2010 May;107(20):9158–9163.[67] Fields DS, He Yy, Al-Uzri AY, Stormo GD. Quantitative speciﬁcity of the Mnt repressor 1. Journal of Molecular Biology.1997 Aug;271(2):178–194.[68] Mirny LA. Nucleosome-mediated cooperativity between transcription factors. Proceedings of the National Academy ofSciences. 2010 Dec;107(52):22534–22539.[69] Taylor MS, Ponting CP, Copley RR. Occurrence and Consequences of Coding Sequence Insertions and Deletions inMammalian Genomes. Genome Research. 2004 Apr;14(4):555–566.[70] Brandstr¨om M, Ellegren H. The Genomic Landscape of Short Insertion and Deletion Polymorphisms in the Chicken (Gallusgallus) Genome: A High Frequency of Deletions in Tandem Duplicates. Genetics. 2007 Jul;176(3):1691–1701.[71] Park L. Ancestral Alleles in the Human Genome Based on Population Sequencing Data. PLoS ONE. 2015May;10(5):e0128186.[72] Cartwright RA. Problems and Solutions for Estimating Indel Rates and Length Distributions. Molecular Biology andEvolution. 2009 Feb;26(2):473–480.[73] Chen JQ, Wu Y, Yang H, Bergelson J, Kreitman M, Tian D. Variation in the Ratio of Nucleotide Substitution and IndelRates across Genomes in Mammals and Bacteria. Molecular Biology and Evolution. 2009 Jul;26(7):1523–1531. [74] Lee H, Popodi E, Tang H, Foster PL. Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichiacoli as determined by whole-genome sequencing. Proceedings of the National Academy of Sciences. 2012 Oct;109(41):E2774–E2783.[75] Keightley PD, Johnson T. MCALIGN: Stochastic Alignment of Noncoding DNA Sequences Based on an EvolutionaryModel of Sequence Evolution. Genome Research. 2004 Mar;14(3):442–450.[76] Wright S. Evolution in Mendelian Populations. Genetics. 1931 Mar;16(2):97–159.[77] Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proceedings of the National Academy ofSciences of the United States of America. 2005;102:9541–9546.[78] Mustonen V, L¨assig M. Evolutionary population genetics of promoters: Predicting binding sites and functional phylogenies.Proceedings of the National Academy of Sciences of the United States of America. 2005 Nov;102(44):15936–15941.[79] Mustonen V, Kinney J, Callan CG, L¨assig M. Energy-dependent ﬁtness: A quantitative model for the evolution of yeasttranscription factor binding sites. Proceedings of the National Academy of Sciences of the United States of America. 2008Aug;105(34):12376–12381.[80] Barton NH, Coe JB. On the application of statistical physics to evolutionary biology. Journal of Theoretical Biology. 2009Jul;259(2):317–324.[81] Manhart M, Haldane A, Morozov AV. A universal scaling law determines time reversibility and steady state of substitutionsunder selection. Theoretical Population Biology. 2012 Aug;82(1):66–76.[82] Paix˜ao T, Heredia JP, Sudholt D, Trubenova B. First Steps Towards a Runtime Comparison of Natural and ArtiﬁcialEvolution. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July11-15, 2015. ACM; 2015. p. 1455–1462.[83] Otto SP, Day T. A Biologist’s Guide to Mathematical Modeling in Ecology and Evolution. Princeton University Press;2007.[84] Giorgetti L, Siggers T, Tiana G, Caprara G, Notarbartolo S, Corona T, et al. Noncooperative Interactions betweenTranscription Factors and Clustered DNA Binding Sites Enable Graded Transcriptional Responses to EnvironmentalInputs. Molecular Cell. 2010 Feb;37(3):418–428.[85] Weirauch MT, Yang A, Albu M, Cote AG, Montenegro-Montero A, Drewe P, et al. Determination and inference ofeukaryotic transcription factor sequence speciﬁcity. Cell. 2014 Sep;158(6):1431–1443.[86] Rajon E, Masel J. Compensatory Evolution and the Origins of Innovations. Genetics. 2013 Jan;193(4):1209–1220. SUPPORTING INFORMATIONOther ﬁtness models for comparison & for interacting TFBSs

Power-law decaying ﬁtness models for comparison:

In order to understand the importance of the thermodynamically-motivated sigmoid shape for the binding prob-ability, we compare our results to those obtained with power-law functions that decay with exponent γ (note that γ = ∞ corresponds to a step-like ﬁtness landscape), formally deﬁned as π pl ( k ) = (cid:40) π TD ( k ) k ≤ k S (cid:0) k S /k (cid:1) γ π TD ( k S ) k > k S . (31)Fig 8 shows that the power-law exponent is a major determinant of the gain rates, suggesting that a biophysicallyrealistic ﬁtness landscape is crucial for the quantitative understanding of TFBS evolution. Fitness models of interacting TFBSs in larger regulatory sequence:

In addition to physical cooperativity between nearby TFs on promoter/enhancers (see the Models & Methods, Fig 5and Fig 13), here we also consider two other models. The ﬁrst additional model assumes that the binding occupancyof the strongest binding site in the regulatory sequence is the proxy for the gene expression level and the ﬁtness, i.e. f ( σ ) = s MAX { π (i) ( σ ) } . (32)Note that diﬀerent TFBSs interact with each other to compete for the strongest binding within a promoter or anenhancer.The second additional model addresses synergistic interaction between the two strongest-binding TFBS, locatedanywhere in the regulatory sequence. This example is a simpliﬁed version of a biophysical model where TFs, bindinganywhere in a regulatory region, compete for the occupancy of that region with a nucleosome (for a more elaborativemodeling framework, see Mirny (2010) [68]). We call this type of interaction between two TFs “non-physical” becauseTFs don’t interact directly; their interaction is eﬀectively mediated by some other biophysical process. The probabilityof the joint occupancy of the two TFs at promoter or enhancer can be used as the proxy for gene expression level andthe ﬁtness, i.e. f ( σ ) = s e − β ( (cid:15) ( k + k ) − µ ) e − β ( (cid:15)k − µ ) + e − β ( (cid:15)k − µ ) + e − β ( (cid:15) ( k + k ) − µ ) , (33)where k and k correspond to the genotypes of two TFBSs with the smallest mismatches in the regulatory sequence.Do these models yield diﬀerent result for the emergence of strong binding sites from random sequences at earlyevolutionary times ( ∼ speciation time scales), in comparison to our main model, where the sum of binding occupanciesis used as a proxy for gene expression level [Eq(7) in the main text]? For typical biophysical parameters (bindinglenght: n = 7 bp, binding speciﬁcity: (cid:15) = 2 k B T and chemical potential: µ = 4 k B T ), we show in Fig 14 that thesemodiﬁed models do not diﬀer extensively from results of our main model.2 ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — P k ' , k ( i nde l ) k = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — P k ' , k ( i nde l ) k = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — P k ' , k ( i nde l ) k = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — P k ' , k ( i nde l ) k = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ● ● ● ● ● ● ● ● — — — — — — — —— — — — — — — — = ab FIG. 6:

Indel mutations connect the mismatch genotype space diﬀerently from point mutations. a)

Probabilitythat a binding site with k mismatches mutates to k (cid:48) mismatches, for a single binding site of length n = 7 bp, according to ourindel mutation model in a ﬁxed genomic window (see the Models & Methods section). Dashed curve = analytical predictionaccording to Eq (13). Red points = mean ± replicate realizations of the frequency distribution (for each replicate,1 consensus sequence is created and 10 mutations are simulated for each k ). b) The same analysis as in a), but allowing fora ﬂexible genomic window for alignment after insertion mutations. We pick the minimal mismatch case to asses the quality ofour approximation. As expected, this creates a bias towards smaller mismatch classes, but suggests that our approximation isstill reasonable. ϵ = k B T, μ = k B T ϵ = k B T, μ = k B T ϵ = k B T, μ = k B T ϵ = k B T, μ = k B Tn log ( )/ - n t h r e s ho l d N s Ns = = n log ( )/ = f r equen cy ψ k n = ϵ = k B T, μ = k B T Ns = = n log ( )/ = f r equen cy ψ k n = ϵ = k B T, μ = k B T FIG. 7:

Threshold value of Ns for bimodality (i.e., threshold between strong and weak selection regimes) .The value of Ns at which 5% of the probability weight in the stationary distribution is in non-strong mismatch classes, i.e. k > k S . For selection stronger than this threshold, the stationary distribution is concentrated at low k (high ﬁtness) classesand is practically unimodal. Diﬀerent colors correspond to diﬀerent biophysical parameters (see legend), analytical prediction n log(2) / Ns values for short and long binding sites. n = ϵ = k B T; μ = k B T Strong π > / π < / π TD ( k ) π pl ( k ) γ = π pl ( k ) γ = π pl ( k ) γ = ∞ b i nd i ngp r ob . π - Nsu10 - Nsu10 - Nsu10 - Nsu10 Nsu binding length - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k ∼ n / , θ = ⊕ - Nsu10 Nsu binding length - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k ∼ n / , θ = ⊕ - Nsu10 - Nsu10 Nsu10 Nsu binding length - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k ∼ n / , θ = ⊕ - Nsu10 - Nsu10 - Nsu10 - Nsu10 Nsu 10 Nsu10 Nsu10 Nsu binding length - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k ∼ n / , θ = ⊕ a bc , γ = γ = γ = ∞ FIG. 8:

Single TFBS gain rates in modiﬁed ﬁtness landscapes with a power-law tail.

The thermodynamic ﬁtnesslandscape has been modiﬁed to have a power-law decaying tail of exponent γ for k > k S , as in Eq (31) in SI text. We tested γ = 1, 2 and ∞ corresponding to smooth, intermediate and step-like decay. Plot conventions are the same as in Fig 2C. b) Isolated TFBS gain rate from the most redundant mismatch class for the thermodynamic model, replotted from Fig 2C forreference. c) Plots analogous to b) using modiﬁed ﬁtness landscapes deﬁned by the power-law exponent γ . Gain rates arehigher for small γ = 1 and lower for the step landscape ( γ = ∞ ), relative to the reference. ● ● ● ● ● ●            ◆ ◆ ◆ ◆ ◆ ◆            ● ● ● ● ● ●            ◆ ◆ ◆ ◆ ◆ ◆            ● k = θ = ( point ) ● k = θ = ( point ) ◆ k = θ = ( point & indel ) ◆ k = θ = ( point & indel ) r a t eo f ga i n i ng TF BS / 〈 t 〉  ← k [ ⨯ u ] ⟵ point mutation rate --- approx., θ = ( point ) FIG. 9:

The eﬀect of polymorphisms on the single TFBS gain rate at higher mutation rates.

Wright-Fishersimulation results (point markers, error bars = 2 standard errors of the mean) at 4 Nu = 0 .

1, in comparison to the ﬁxed statemodel (continuous curves). Plot conventions are the same as in Fig 2. Biophysical parameters used: n = 7, (cid:15) = 2 k B T , µ = 4 k B T . Polymorphisms generally decrease TFBS gain rates. ● ● ● ● ● ●            ◆ ◆ ◆ ◆ ◆ ◆            ● ● ● ● ● ●            ◆ ◆ ◆ ◆ ◆ ◆            ● k = θ = ( point ) ● k = θ = ( point ) ◆ k = θ = ( point & indel ) ◆ k = θ = ( point & indel ) r a t eo f ga i n i ng TF BS / 〈 t 〉  ← k [ ⨯ u ] ⟵ point mutation rate --- approx., θ = ( point ) FIG. 10:

Relaxing the mismatch assumption.

Fig 2, but using energy matrices whose nonzero entries are gaussianrandom variables ε i , such that (cid:104) ε i (cid:105) = (cid:15) = 2 k B T and σ ε = 0 . k B T ; n = 7, µ = 4 k B T . The analytical results under the equalmismatch assumption are shown in continuous lines. k = θ = ( point & indel ) k = θ = ( point & indel ) k = θ = ( point ) k = θ = ( point ) (- Ns ) r a t eo f l o s i nga TF BS / 〈 t 〉  ← k [ ⨯ u ] =- = = + + Ns --- approx., θ = ( point ) - Nsu - Nsu - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k = , θ = |- ⊕ -| - - Nsu -

10 Nsu - n s pe c i f i c i t y - ϵ / 〈 t 〉 ← k = , θ = |- ⊕ -| - a bc FIG. 11:

Single TF binding site loss rates at an isolated DNA region.

The dependence of the loss rate, 1 / (cid:104) t (cid:105) W← k shown in units of point mutation rate, from sequences in diﬀerent initial mismatch classes k (blue: k = 2, red: k = 0), as afunction of negative selection strength. Results with point mutations only ( θ = 0) are shown by dashed line; with admixture ofindel mutations ( θ = 0 .

15) by a solid line. For strong selection, | Ns | (cid:29)

1, the rates scale with 2 | Ns | nu , which is captured wellby the “shortest path” approximation (black dashed lines in the main ﬁgure) of Eq (24). The biophysical parameters are: sitelength n = 7 bp; binding speciﬁcity (cid:15) = 2 k B T ; chemical potential µ = 4 k B T . Left inset: Ns -scaling with positive selection.Right inset: gain rates as a function of the initial mismatch class k for diﬀerent Ns . b, c) Loss rates from the consensussequence ( k = 0) under strong negative selection, without (b) and with (c) indel mutations supplementing point mutations.Red crosshairs denote the cases depicted in panel a). Contour lines show constant loss rates in units of Ns u as a function ofbiophysical parameters n and (cid:15) . - - - - - ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - n =

10 bp ϵ = k B T - - - - - n =

12 bp ϵ = k B T - - - - - ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - n =

10 bp ϵ = k B T - - - - - n =

12 bp ϵ = k B T - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - [ ⨯ u - ] n =

10 bp ϵ = k B T - - - - - [ ⨯ u - ] n =

12 bp ϵ = k B T — similations ( mean ± ⨯ SEM ) --- prediction ( single TFBS ) FIG. 12:

TFBS evolution in longer sequences . Example simulations (black solid line) and analytic predictions based onsingle TFBS gain/loss rates (black dashed line), for diﬀerent binding length n and speciﬁcity (cid:15) . Details are identical to Fig. 4. - - - - - - ne w TF BS z - z o [ ⨯ N s L ] ● ● ● ● ● - - - - - Ancient site kt = / u n = ϵ = k B T - - - - - - ● ● ● ● ● - - - - - Ancient site kt = / u n =

10 bp ϵ = k B T - - - - - - ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - - n =

10 bp ϵ = k B T - - - - - - ne w TF BS z - z o [ ⨯ N s L ] ● ● ● ● ● - - - - - Ancient site kt = / u n = ϵ = k B T - - - - - - ● ● ● ● ● - - - - - Ancient site kt = / u n =

10 bp ϵ = k B T - - - - - - ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - - n =

10 bp ϵ = k B T - - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] ● ● ● ● ● - - - - - Ancient site kt = / u n = ϵ = k B T - - - - - - [ ⨯ u - ] ● ● ● ● ● - - - - - Ancient site kt = / u n =

10 bp ϵ = k B T - - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - - [ ⨯ u - ] n =

10 bp ϵ = k B T — w / o cooperativity — w / cooperativity (- k B T ) — w / cooperativity (- k B T ) — w / cooperativity (- k B T ) --- prediction ( single TFBS ) — without ancient site — with ancient site ( k = ) — with ancient site ( k = ) --- predictions ( single TFBS ) a b FIG. 13:

The eﬀect of ancient sites (a) and cooperativity (b) for diﬀerent binding lengths and speciﬁcities .Simulations of TFBS evolution in longer sequences (colored lines) and analytic predictions based on single TFBS gain and lossrates (dashed black lines), analogous to Fig. 5. Diﬀerent panels show diﬀerent choices of TFBS binding length n and speciﬁcity (cid:15) .Ancient sites speciﬁcally facilitate the emergence of longer sites of high speciﬁcity, whereas cooperativity speciﬁcally facilitatesthe emergence of shorter sites of intermediate or low speciﬁcity. - - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T - - - - - - [ ⨯ u - ] ne w TF BS z - z o [ ⨯ N s L ] n = ϵ = k B T — Main model — — Synergy of 2 TFBSs --- prediction ( single TFBS ) L =

30 bp L =

50 bp

FIG. 14:

Fitness models of interacting TFBSs.

The expected number of newly evolved TFBS for binding site length n = 7 bp, speciﬁcity (cid:15) = 2 k B T , and chemical potential µ = 4 k B T are shown for diﬀerent ﬁtness models. The solid black curveis the non-interacting model used in the main text (dashed curve: theoretical prediction). The green curve stands for the modelof Eq (32) in SI text, where only the strongest binding site in the regulatory sequence determines gene expression. The purplecurve stands for the model of Eq (33) in SI text, where two strongest TFBS synergistically determine the gene expression level.Shading denotes ± L = 30 bp (left) and L = 50 bp (right). ---- θ = θ = / 〈 t 〉 ← k = / 〈 t 〉 ← k ∼ n / / 〈 t 〉 k = ← k = / 〈 t 〉 k = ← k = r a t e s [ ⨯ u - ] ---- θ = θ = / 〈 t 〉 ← k = / 〈 t 〉 ← k ∼ n / / 〈 t 〉 k = ← k = / 〈 t 〉 k = ← k = r a t e s [ ⨯ u - ] n = ϵ = k B T μ = k B T n =

10 bp ϵ = k B T μ = k B T FIG. 15:

Comparison rates of TFBS gain rates and sequence turnover rates within functional TFBSs

Averageﬁrst hitting times to particular mismatch k j state can be calculated with a minor modiﬁcation to Eq (21) by replacing S with k j . The ﬁgures compare the rates of evolution of TFBS within the functional sites (i.e. 1 / (cid:104) t (cid:105) k =0 ← k =1 and 1 / (cid:104) t (cid:105) k =1 ← k =0 ). Plotconventions are the same as in Fig 2-A. Biophysical parameters used: n = 7 bp (left), n = 10 bp (right) (cid:15) = 2 k B T , µ = 4 k B T .It shows that for weak selection, the rates to evolve from k = 0 to k = 1 can be relatively faster. Also, although adaptationfrom random sites slows down with increasing n , we see that the adaptation rate to evolve from k = 1 to kk