aa r X i v : . [ m a t h - ph ] O c t Fragmentation dynamics of DNA sequence duplications
M.V. Koroteev and J. Miller ∗ Physics and Biology Unit, Okinawa Institute of Science and Technology(Graduate University) Kunigami 1919-1, Onna-son, Okinawa 904-2234, Japan
Motivated by empirical observations of algebraic duplicated sequence length distributions in abroad range of natural genomes, we analytically formulate and solve a class of simple discrete dupli-cation/substitution models that generate steady-states sharing this property. Continuum equationsare derived for arbitrary time-independent duplication length source distribution, a limit that weshow can be mapped directly onto certain fragmentation models that have been intensively studiedby physicists in recent years. Quantitative agreement with simulation is demonstrated. These mod-els account for the algebraic form and exponent of naturally occuring duplication length distributionswithout the need for fine-tuning of parameters.
PACS numbers:
A century has elapsed since the earliest reports of theevolutionary impact of gene duplication[1]. At the timethere existed only a macroscopic and phenomenologicalconception of genetic material, but within the last decadestatic characterization of the finest details of the latterhas become routine. Its dynamics, on the other hand, re-mains for the most part only indirectly accessible; ‘snap-shots’of complete individual genomes at short time inter-vals are not yet practical, and dynamics must be inferredfrom their cumulative effect on representative genome se-quences.This dynamics is important because to a good approx-imation genome sequence determines, via natural selec-tion, the fates of individuals and of species - but our un-derstanding of how this happens is primitive. Contempo-rary lineages are our primary source of genome sequence,making it difficult to associate the presence or absenceof most genomic features with their effects, if any, on anindividual. Indeed, a primary goal of modern genomicsis to determine if, when, and on what time scales thesequence evolution reflects selection.
Neutral models of sequence evolution - sequence dy-namics that, on the time scales of interest, are indepen-dent of selection - underlie all methods that we know ofto achieve this goal [2]. When sequences common to twodifferent organisms, or that appear multiply within thesame genome, exhibit identity exceeding (falling short of)that expected on given model of neutral evolution, it istaken as evidence either that negative (positive) selectionis acting on these sequences, or that the neutral model isflawed. As a given sequence fragment has some chance ofexhibiting any excess or shortfall of identity within a neu-tral model, selection is inferred probablistically by study-ing frequencies of the levels of sequence identity withinor between genomes [2]. Thus, length distributions ofsimilar or identical sequences have traditionally played ∗ Electronic address: [email protected] a fundamental role in genomics and molecular genetics,and our interpretation of genomic sequence relies uponour understanding of these distributions.The topic of this manuscript originates in an empir-ical observation of algebraic duplicated-sequence lengthdistributions in a broad range of natural genomes [3–6].In [7] an empirical model of duplication was proposedthat accounted for the observed algebraic distribution ofduplicated sequence lengths in natural genomes, but re-lied on tuning the length distribution of the duplicationsource. Here we analytically derive and solve an alterna-tive model for which no such tuning is required.The action of duplication is to copy a sequence frag-ment and subsequently to insert it, or to substitute itfor a same-sized sequence fragment, elsewhere in a chro-mosome [8]. Standard models of sequence evolution alsoincorporate random, uncorrelated base substitution.A chromosome consists of a string of L bases chosenfrom a finite alphabet; in natural genomes the alphabetis typically represented by four bases A, G, C, and T; forsimplicity and without loss of generality we use here atwo base alphabet. The fundamental sequence elementthat we study is the the set of repeated sequences withinthe chromosome, counted in an algorithm-independentway. Specifically, we study ‘supermaximal repeats’ (or‘super maxmers’): sequence duplications neither copy ofwhich is contained in any longer sequence duplicationwithin the chromosome [5, 7]. From now on, we refer toa supermaxmer of length m simply as an m -mer.Within our models duplications occur with the rate β per unit time: namely, a subsequence of the length m is chosen randomly within the chromosome according toa predetermined source distribution P ( m ) and is susbti-tuted for a sequence of length m at another randomlychosen position in the chromosome. It was numericallydemonstrated [7] that for monoscale sources and certainpower-law source distributions, the duplication lengthdistribution attained a stationary state at long times.We denote the ensemble-averaged number of m -mersat time t as f ( t, m ). For a monoscale source P ( m ) = δ c ( D − m ) ( δ c , Kronecker delta), we expect at station-arity that f ( t, m ) will decay rapidly for m > D . Thereare two processes altering the number of m -mers: a newduplication of fixed length D can fragment an existing m -mer or generate a new m -mers by fragmenting a longer m -mer; processes of higher order in D/L are ignored,where
D, L >>
D << L .The time dependence of f can be represented by termsof the form uf ( t, m ), u being a coefficient describing therate of creation or destruction of corresponding m -mers.An m -mer is annihilated when a newly created duplica-tion of length D overlaps with one of the sequences com-posing the m -mer; the rate of this event is 2( D + m − a ) /L ,where a is the length of a single base. Alternatively an m -mer may be created when a newly created duplicationoverlaps with an m + k -mer, k >
0. The probability thata m + k -mer produces an m -mer is 4 a/L .Supermaxmers may also be annihilated by base sub-stitution. Substitution occurs with rate µ per time unitper unit length (in bases); duplication with rate β , mea-sured in 1 / time unit. Then the balance equation takesthe form f ( t +1 , m ) − f ( t, m ) = − (cid:20) m + D − aL β + µm (cid:21) f ( t, m )++ (cid:20) L aβ + 4 aµ (cid:21) D X k = m +1 f ( t, k ) + 2 βδ c ( D − m ) . (1)The dimensions of (1) are correct as we take ∆ t = 1,as it is seen from lhs of the equation. The solution of (1)converges to a stationary one. To see this, take µ = 0, β = 1, a = 1 [base], and note that (1) may be representedin matrix form as ~f ( t + 1) = A ~f ( t )+ ~δ c , where the matrix A is such that A ii = 2(1 − ( D +1) /L ), A ij = 4 /L for i < j ,and A ij = 0, for i > j , i, j = 1 , , . . . D ; the vector ~δ c is δ D = 2, δ i = 0, i < D . The matrix is upper triangularand its eigenvalues are readily computed yielding λ i =1 − β ( D + ( i − /L . It is evident that 0 < | λ i | < i , as we assumed D << L and i = 1 , , . . . D , andconsequently, the iteration is guaranteed to converge. For µ = 0 the eigenvalues have the form λ i = 1 − β i + D − L − µi ,thus the requirement of the convergence to a stationarystate | λ | < D << L , µ ∆ τ < /D , ∆ τ being a time step.If some initial state ~f (0) is given and if we denote by ~f s the limiting stationary state of the system, we cancalculate ~f s as follows ~f s = lim t →∞ ~f ( t ) = lim t →∞ " T t − X k =0 Λ k T − ~δ + T Λ t T − ~f (0) = T lim t →∞ t − X k =0 Λ k T − ~δ, (2) log(length of maxmers) l og ( o f m ax m er s ) simulation. µ=10 −4 simulation. µ=10 −5 simulation. µ=10 −6 simulation. µ=10 −8 analytic solution effect of point substitutions for supermaxmers FIG. 1: Curves represent stationary states of the system de-scribed in the paper for various base substitution rates µ andcorresponding analytic solutions of the system (1) for β = 1.The chromosome length L = 10 ; source length D = 1024.Increasing base substitution rates exhibits a power-law tailwith the exponent −
3. Note also the match of amplitudesbetween simulations and solution. where A = T Λ T − and T diagonalizes A and consistsof eigen vectors of A . Further estimates show that ~f s = LT ˜Λ T − ~δ , where, e.g., for the case µ = 0 we have ˜Λ ji =1 / ( D + i − j = i and ˜Λ ji = 0 when j = i . We maywrite down the exact stationary solution of the equation(1) in scalar form f ( m, D, L, µ ) = " D − ( m − a ) β D +( m − a ) L + mµ − D − mβ D + mL + ( m + a ) µ + D − ( m + a ) β D +( m + a ) L + ( m + 2 a ) µ , m < D. (3)Obvious scaling wrt. L is observed when µ = 0. Com-parisons to the empirical model [7] with (3) for µ = 0are presented in fig. 1. It is evident that with increas-ing base substitution rate both simulations and the solu-tion demonstrate power-law behavior with the exponent −
3. To obtain this exponent from the solution (3) ob-serve that f in (3) is approximately represented as g ′′ ( x ),where g ( x ) = ( D − x ) / ( β D + xL + µx ). Then one obtains f ( m, D, L, µ ) ∼ / ( β D + mL + mµ ) . The peak observedin the left-hand side of the length distributions reflectsthat of a random sequence of the length L . High muta-tions conserve this random part of the distribution whileduplications tend to distort the statistic. The maxi-mum of the peak is located in the point m = log L for binary sequence. For the maximum length of super-maxmers in a random binary sequence there is an esti-mate M L ∼ L [10] which thus corresponds to thewidth of the peak.For dynamics described by a power-law source p ( m ) ∼ /m γ the characteristic scale D is replaced by the first log (length of maxmers) l og ( o f m ax m er s ) E=20E=67E=143E=467analytic solution slope γ out = -3.4 µ=0 FIG. 2: Curves represent stationary states of the system de-scribed in [7] for various first moments M of the power-lawsource of the form p (¯ x ) ∼ / ( ¯ m + ¯ x γ ) compared to the solu-tion of (4). L = 10 , µ = 0, γ = − .
4. Deviation of solutionfrom simulations is observed for small M , when the condi-tion M >> a is violated. For large lengths it is observed theregime with the exponent γ + 1 described in [7]. Values ¯ m corresponding to M on the plot are: 10, 35, 75, 250.. moment of the source. We also make all lengths dimen-sionless, dividing them by the length of 1 base a or bythe first moment of the source M . Then, we have forprobabilities of fragmentation p ( m ) = 1 / ( m γ φ ( γ, N )),where φ ( γ, N ) = ζ ( γ ) − ζ ( γ, N + 1), L = N a , and ζ ( γ )is the Riemann zeta-function, ζ ( γ, N + 1) is the general-ized zeta-function. The equation for a finite-size systemwith a power-law source can be obtained similarly to thatfor the monoscale source and has the form∆ f ( t, m ) = − " N N X r =1 m + r − φ ( γ, N ) r γ β + maµ f ( t, m )++ (cid:20) N β + 4 aµ (cid:21) N X k = m +1 f ( t, k ) + 2 β φ ( γ, N ) m γ , (4)and m, r = 1 , , . . . , N .The structure of the source makes an analytic repre-sentation unwieldy; the solution of (4) as t → ∞ wasobtained numerically. The comparison of the solutionwith simulations is presented in fig. 2. Evidently equa-tion (4) reproduces both exponent and amplitude andthus the dynamics involved encompasses both small andlarge mutation rates.Finally, some continuum limits following from thesediscrete dynamics are obtained. As all lengths are mea-sured in bases we dimensionalize them as follows: ¯ a = a/M , ¯ m = m/M , ¯ L = L/M , ¯ t = tβ , ¯ µ = M µ/β .Then (4) with an arbitrary source P ( ¯ m ) takes the form∆ f ( t, m )∆ t = − (cid:20) m − ¯ a ¯ L + ¯ µ ¯ m (cid:21) f + log (length of maxmers) -1 l og ( av er ag e o f m ax m er s ) µ=10 −7 , E=25 µ=10 −5 , E=100 µ=10 −5 , E=500 µ=10 −3 , E=1000analytic solution slope -3 slope -4
FIG. 3: Curves represent stationary states of the system de-scribed in the Letter for L = 10 and varying M and µ com-pared to the solution of (4). The distributions are obtained byaveraging over 100 realizations. These are shown the regime µ − ≈ M corresponding to the continuum equation and theregime with small mutation rate µ studied in [7], γ = − . + (cid:20) a ¯ L + 4¯ a ¯ µ (cid:21) N X k = m +1 f + 2 P ( ¯ m ) . (5)Taking ¯ a → f = ¯ a ˆ f , P = ¯ a ˆ p with ¯ L = N ¯ a , ¯ m = n ¯ a , the equation becomes∆ ˆ f ( t, m )∆ t = − (cid:20) n ¯ a − ¯ aN ¯ a + ¯ µ ¯ an (cid:21) ˆ f ++ (cid:20) N ¯ a + 4¯ µ (cid:21) N X k = m +1 ˆ f ¯ a + 2ˆ p ( ¯ m ) . (6)Additionally, we introduce a duplication rate λ measuredper base taking β = λL .As ¯ a → a << M , and as µ and λ dimin-ish with a , we need to evaluate the orders of correspond-ing terms. We keep the product n ¯ a finite and denote it by x ; this implies n ∼ ¯ a − and corresponds to an intermedi-ate regime[12] for (5). The source term has the order ∼ ∼ ¯ a α − ,and mutation terms the order ¯ µ ∼ ( µ/λ )¯ a α − as ¯ a → µ ∼
1, ¯ a α − = o (1) , ¯ a →
0, otherregimes being described elsewhere. Taking into accountthat in this regime ¯ L → ∞ and L >> M , the equationtakes the form ∂ ˆ f∂ ¯ t = − x ¯ µ ˆ f + 4¯ µ Z ∞ x ˆ f (¯ t, y ) dy + 2ˆ p ( x ) + O (¯ a α − ) . (7)The main order regime corresponds to fragmentationwith input studied in [9].For the numerical simulations in fig. 1 we set L = 10 , M = D ≈ and vary µ ; thus, L >> M . As µ ap-proaches 10 the output distribution approaches an alge-braic form with exponent −
3, corresponding to the so-lution of the fragmentation equation for a monodispersesource[9, 11]. Fig. 2 corresponds to the regime with van-ishing µ and is not described by (7); fig. 3 demonstratesvarious regimes and exponent − M >> a , ¯ a α − = o (¯ µ ) and m << M .In fig. 4 we provide comparison of natural data withour simulations. Both chromosomes, one from humanand the other from grapevine, demonstrate good fit to − m -mer evolves indepen-dently of other m -mers. Therefore, following Ben-Naim[9] we can estimate the fragment length distribution byfollowing a typical duplication of length M . Substitu-tions break the duplication into fragments whose num-ber M varies in time as M = M µt so that the averagefragment length at time t is h m i = M / M . Consecu-tive mutations are independent, hence the distributionof fragment lengths is close to Poisson for ¯ a <<
1, i.e., p ( m ) ∼ m − exp( − m/ h m i ), neglecting contributions ex-ponentially in M . Then the number S ( t, m ) of fragmentsof length m is M p ( m ). New duplications occur contin-uously, so to obtain the fragment length distribution as t → ∞ we integrate over time to obtain S ( m ) = Z ∞ M βp ( m ) dt ∼ Z ∞ M βm e − m/ h m i dt = M βµm . (8)The similar result can be also obtained from the exactsolution of (7) to give S ( m ) = 2 M β/ ( µm ); the addi-tional prefactor 2 appears as in the equations for eachsequences we count another one, which is identical to theformer. The consistency of dimensions follows from thepresence of additional prefactor a , which is equal to 1 forthe discrete case.The estimate allows to calculate M from the empir-ical distributions, e.g., for fig. 1 β = 1, µ = 10 − inthe algebraic regime. Then we can estimate S ( m ) for m = 1 from the plot [13](supplemental figure 1)); we have S ( m ) ≈ × and from (8) we find M ≈ whichpretty well corresponds to the real value M = 1024 forthis simulation. Similar estimates can be obtained fordifferent sources.For eukaryotes, gene duplicatons are conventionally es-timated to arise at around ∼ − per gene per 10 y (years)[14]; assuming ∼ genes per eukaryotic genomeyields a genome-wide gene duplication rate[14] β =10 / y . Thus for the human genome with around4 × genes, β ≈ / y [14]. Then, for duplicationrate per base we have λ = β /L , where L the numberof bases belonging to genes; accepting the estimate L log (length of maxmer) l og ( o f s up er m ax m er s ) homo sapiens, chr 3, 4 base. rmvitis vinifera, chr 6, 4 base, non-rmsimulation. 4 base. D=[900, 1100], µ=10 −3 simulation.4 base. m =500, E=923, µ=10 −3 slope -3 FIG. 4: Human chr. 3 (brown) was repeat-masked, grapevine(orange) was not. Source length distributions were chosen aspower-law with exponent − . , x axisto the right by the factor of 2 for clarity. Straight lines withthe slope − M β/µ in accordance with(8) and the exact solution of (7). For the human chr. 3 wehave 2 M β/µ ≈ × , and for vitis vinifera chr. 6, ≈ . to be 2% of L we can estimate λ . Let us assume thatduplications, as well as point substitutions, are uniformover genome, i.e., duplication rate per base λ = λ andfor β we have β ≈ β . The average time between du-plications, 1 /β = 200 y corresponds to a single time unitin the parametrization of our models. The point substi-tution rate is µ ∼ − per base per 10 y [14] or ∼ − per base per time unit. The algebraic regime with expo-nent − µ = 10 − in fig. 1 and µ/λ = 100 >> λ ≈ . × − per base, per 10 y and consequently β ≈ . × (wetake into account here that the length of repeat-maskedchromosome differs significantly from that of the wholesequence). Then, we find M ≈ M β ≈ . × bases related to supermaxmerswere duplicated in human genome per 10 years. Theregion of the tail in supp. fig. 2 corresponds to lengths m >
30; the tail contains ≈ bases, hence, assumingsimilar processes in different chromosomes, the observedlong identical duplicates occurred in the human genomelast 6 − − −
3. Thus the state of manycurrently studied genomes mapped to this regime of ourdynamics, may be understood as a result of continuousinteraction of point substitutions and (segmental) dupli-cations generated by some source. These results also pro-vide some evidence for the neutral nature of long segmen-tal duplications. On the other hand, the assumptions for(6) may be altered to obtain different regimes to includegenomes, whose state deviates from − M (or D ),the first moment of the source, appears in calculations todetermine the regime in which we observe genomes withthe exponent − M as there is only one specific regime with the ex-ponent − M turns out to beless important; 2) our dynamics allows to consider vari-ous asymptotic orders of µ/λ , not necessarily µ >> λ .Thus, − x → m << M for discrete equations (1),(4), i.e., the algebraic regime with the exponent − << M . If the source produces short duplicates, whichare in the same time are not hit by strong mutations(e.g., µ ∼ λ ) then the tail occurs for m >> M or inasymptotic regime x → ∞ for (7) and we may observeregimes with different exponents (fig. 3) or even non-algebraic regimes. The latter ones are also observed inreal genomes [13](supplemental figure 3) and can not betreated in terms of specific −3 exponent but are repro-duced by our dynamics. Thus we can map various asymp-totic regions of parameters to natural sequences to fit our observations in genomes, demonstrating variety of lengthdistributions for duplicates.We acknowledge Kun Gao, Eddy Taillefer and SatishVenkatesan for helpful duscussion of this work. We thankQuoc-Viet Ha for the help with computations. We alsothank Peter Arndt, Florian Massip, Nick Barton, DanielWeissman, and Tiago Paixao for discussions of this workand the manuscript.