Correcting the apparent mutation rate acceleration at shorter time scales under a Jukes-Cantor model
Christopher Tuffley, Timothy White, Michael D. Hendy, David Penny
aa r X i v : . [ q - b i o . P E ] A p r CORRECTING THE APPARENT MUTATION RATE ACCELERATIONAT SHORTER TIME SCALES UNDER A JUKES-CANTOR MODEL
CHRISTOPHER TUFFLEY, W TIMOTHY J WHITE, MICHAEL D HENDY AND DAVID PENNY
Abstract.
At macroevolutionary time scales, and for a constant mutation rate, thereis an expected linear relationship between time and the number of inferred neutral muta-tions (the “molecular clock”). However, at shorter time scales a number of recent studieshave observed an apparent acceleration in the rate of molecular evolution.We study this apparent acceleration under a Jukes-Cantor model applied to a ran-domly mating population, and show that, under the model, it arises as a consequenceof ignoring short term effects due to existing diversity within the population. The ac-celeration can be accounted for by adding the correction term h e − µt/ to the usualJukes-Cantor formula p ( t ) = (1 − e − µt/ ), where h is the expected heterozygosity inthe population at time t = 0. The true mutation rate µ may then be recovered, even if h is not known, by estimating µ and h simultaneously using least squares.Rate estimates made without the correction term (that is, incorrectly assuming thepopulation to be homogeneous) will result in a divergent rate curve of the form µ div = µ + C/t , so that the mutation rate appears to approach infinity as the time scale ap-proaches zero. While our quantitative results apply only to the Jukes-Cantor model,it is reasonable to suppose that the qualitative picture that emerges also applies tomore complex models. Our study therefore demonstrates the importance of properlyaccounting for any ancestral diversity, as it may otherwise play a dominant role in rateoverestimation. Introduction
It is well-known since Kimura (1968; 1983) that over time scales of tens and hundredsof millions of years the longer term rate of molecular evolution, k , is expected to be vir-tually the same as the short term rate of neutral mutations µ , that is, k ≈ µ . However,at shorter time scales a number of recent studies (for example Garc´ıa-Moreno (2004);Ho et al. (2005); Penny (2005); Millar et al. (2008); Henn et al. (2009); although therewere some earlier indications (Fitch and Atchley, 1985)) have observed an apparent ac-celeration in the rate of molecular evolution. This has led to debate as to the underlyingcauses of the apparent acceleration, and raised important questions as to how long itpersists and how to correct for it.A number of factors that may contribute to this apparent rate acceleration have beenproposed, and we refer the reader to Ho et al. (2011) for a recent review. One such factoris the effect of ancestral polymorphism (Peterson and Masel, 2009; Charlesworth, 2010).Indeed, it is clear that, if ancestral polymorphism is present and not properly accountedfor, then an apparent rate acceleration will inevitably be seen. Consider a comparisonbetween two sequences drawn from a population at time t = 0 which is incorrectly assumedto be homogeneous. Any differences between the sequences due to polymorphism will Date : 11th April 2012.
Key words and phrases.
Multiscale, short-term rate acceleration, ancestral polymorphism, divergentrate curve, Jukes-Cantor model. appear to be mutations that have occurred in zero time, leading to an apparent infinitesubstitution rate at time zero. Moreover, on sufficiently short time scales the averagedifference due to polymorphism will dominate any changes due to substitution, and tofirst order the expected difference between sequences at times 0 and t will be given by p ( t ) = C + µ ′ t . (Here µ ′ will differ slightly from the mutation rate µ , as some mutations willact to decrease polymorphism.) Naively dividing by t to recover µ , under the assumptionthat the first order approximation is given by p ( t ) = µt , then gives an apparent divergentrate curve µ ′ + C/t , suggesting that unaccounted ancestral polymorphism will contributea term of the form
C/t to rate estimates. This a priori estimate is in sharp contrast withthe use of exponential decay curves of the form k + µe − λt (for example, by Ho et al. (2005))to fit estimated rates.The effect of ancestral polymorphism on sequence divergence has been studied quan-titatively by Peterson and Masel (2009), who calculate the expected divergence betweentwo sequences as a function of time under the Moran model with selection. Their modelprovides a reasonable fit to data from Genner et al. (2007) and Henn et al. (2009), confirm-ing that the observed rate acceleration is consistent with an underlying constant mutationrate. However, their model does not admit a simple closed form analytic expression, andthis makes it difficult to correct for the apparent rate acceleration or quantify its duration.The purpose of this paper is to quantify the effect of ancestral polymorphism undera Jukes-Cantor type model applied to a randomly mating population. This is a morerestrictive setting than that considered by Peterson and Masel, but the benefit of workingin this simpler setting is that we are able to obtain simple and explicit analytic expressionsfor quantities of interest. In particular, we show that, under the model, ancestral polymor-phism contributes a term h e − µt/ to the usual Jukes-Cantor formula p ( t ) = (1 − e − µt/ ),where h is the expected heterozygosity at time t = 0, and confirm that comparisons be-tween sequences made under the incorrect assumption h = 0 lead to a divergent ratecurve of the form µ div = µ + C/t . This has consequences at long time scales as well asshort, as
C/t tends to zero comparatively slowly as t tends to infinity (in particular, moreslowly than any exponential decay), and so the effects of ancestral polymorphism may berelatively long lived. We show however that the true mutation rate µ may still be recov-ered from several observations even if the level of heterozygosity at time 0 is unknown,by estimating h simultaneously with µ using least squares. We show further that ourcorrection term also applies to other population structures (for example, island models)under an appropriate assumption on h .These results show that ancestral polymorphism, where present and unaccounted for,will be a significant contributing factor to mutation rate overestimation, with the mag-nitude of the effect approaching infinity as the time scale shrinks to zero. Our principalfinding then is that an apparent rate acceleration at short time scales is consistent with — and indeed to be expected under — a constant mutation rate, in the presence of ancestralpolymorphism that is not properly taken into account. This finding is in agreement withthat of Peterson and Masel (2009).Ancestral polymorphism is just one of several processes that are thought to contributeto the apparent rate acceleration, and is unlikely to be the sole contributing factor. How-ever, it is clearly vital that its effect be quantified and accounted for if we are to obtainmeaningful and accurate rate estimates on short time scales, and fully resolve the questionof the apparent acceleration. ORRECTING THE APPARENT RATE ACCELERATION UNDER A JUKES-CANTOR MODEL 3 The model, and our main result
We consider a population of N individuals evolving under a random mating process,with discrete generations, where at each generation the allele of each individual is re-placed with a copy of the allele from the previous generation, chosen uniformly at ran-dom (Wright, 1931). For simplicity we restrict our attention to haploid populations.We assume that alleles are r -state characters evolving under the r -state Jukes-Cantormodel (Jukes and Cantor, 1969), with an instantaneous mutation rate of µ mutations perindividual per generation per site. All mutations are assumed to be neutral. We orienttime in the forwards direction, so that populations at times t > t = 0. We refer to the population at time 0 as the reference population, andthe population at a given time t > P ( t ) be the probability that they havedifferent character states at a fixed site. Suppose that the distribution of character statesat that site in the reference population is given by π = ( π , . . . , π r ). (Note that π as usedhere is the distribution of states within the population at a given site, rather than theequilibrium distribution of the Jukes-Cantor model, which is the distribution of statesacross all sites within an individual.) Then Theorem.
The probability that uniformly randomly chosen members of the reference andcontemporary populations have different states at a given site is given by (1) P ( t ) = h e − rr − µt + r − r (1 − e − rr − µt ) , where h = P (0) = r X i =1 π i (1 − π i ) = X i X j = i π i π j is the expected heterozygosity at t = 0 . In particular, for r = 4 (as for nucleotides) we get P ( t ) = h e − µt + (1 − e − µt ) . The proof of the theorem is given in the Appendix. Note that the second term inequation (1) is the standard probability under the r -state Jukes-Cantor model that a netchange takes place at the given site, when the contemporary sequence directly descendsfrom the reference sequence — this second term is the usual form of the equation forlonger time periods. Thus the theorem adds the correction term h e − rµt/ ( r − when thesequences being compared are not necessarily direct descendants. This accounts for thevariation that is present in the reference population from past mutations that have notyet been either fixed or lost. The two expressions P ( t ) and the Jukes-Cantor mutationprobability are asymptotic, in the sense that P ( t ) r − r (1 − e − rr − µt ) = 1 + h e − rr − µtr − r (1 − e − rr − µt ) → t → ∞ .Note that the theorem requires that the population at the earlier time be used as thereference population. In particular, if populations at several different times are to becompared, then this should be done by comparing each population against the oldestpopulation, to ensure that all pairwise comparisons made involve the same value of the TUFFLEY, WHITE, HENDY AND PENNY initial heterozygosity h . This requirement may be relaxed if there is reason to believethat the heterozygosity h is constant, or nearly so. In general, we expect π (and hence h )to depend on t , but as t → ∞ we also expect π to approach an equilibrium distribution π ∞ , where the rate at which new mutations are introduced balances the rate at whichmutations are fixed or lost by the mating process. If this equilibrium is assumed to haveoccurred then equation (1) may be written in the form(2) P ( t ) = h ∞ e − rr − µt + r − r (1 − e − rr − µt ) , where h ∞ is the expected heterozygosity at equilibrium. Under this assumption any twopopulations at times t and t may be compared, using t = | t − t | .2.1. Recovering the mutation rate from observed data.
We now demonstrate howthe mutation rate µ may be estimated from observations of P ( t ). If the expected het-erozygosity is assumed to have reached equilibrium, then h ∞ may be substituted for h throughout.Equation (1) may be rearranged to the form(3) P ( t ) = r − r (cid:16) − (1 − rr − h ) e − rr − µt (cid:17) . If h is known then we may estimate µ as(4) µ = − r − rt log 1 − rr − P ( t )1 − rr − h . On the other hand, if h is not known then it will need to be estimated simultaneouslywith µ . This may be done using a least squares fit tolog (cid:0) − rr − P ( t ) (cid:1) = log(1 − rr − h ) − rr − µt ;if the least squares line is y = mt + b then we recover µ and h as µ = − r − r m, h = r − r (1 − e b ) . The divergent rate curve.
A divergent rate curve arises when we either omit oruse an incorrect value for h to estimate µ . In particular, if we ignore the existing diversityand thus use h = 0 we get the estimate µ div = − ( r −
1) log(1 − rr − P ( t )) rt (5) = µ − ( r −
1) log (cid:0) − rr − h (cid:1) rt , which adds the divergent term (1 − r ) log (cid:16) − rr − h (cid:17) rt to the estimate above. Thus, if thevariation present within the reference population is ignored we obtain a divergent ratecurve of the form µ div = µ + C/t , where C = − rr log (cid:0) − rr − h (cid:1) is a positive constantindependent of t . This rate estimate tends to infinity as t →
0, and thus becomesincreasingly inaccurate on shorter and shorter times scales.
ORRECTING THE APPARENT RATE ACCELERATION UNDER A JUKES-CANTOR MODEL 5 Results of simulations
Figure 1A shows results of simulations over microevolutionary time scales, and exhibitsthe phenomena described above. Most importantly, if we do not correct for existing geneticdiversity there is the apparent acceleration at shorter times, even though the basic neutralmutation rate µ is kept constant. This first main point then is that the apparent increasein “rate” (the divergent rate curve) is obtained at shorter periods. However, it is thenimportant that, using equation (4) and the value of h , the value of µ can still be estimatedaccurately at these shorter intervals. This is certainly as expected; it has always beenassumed that the mutation rate µ was basically constant, that it was independent of time.Thus the conclusion from Figure 1A is that by explicitly considering genetic variabilityin the ancestral (reference) population — either by using an a priori estimate of it (bluecurve), or by estimating it simultaneously using least squares (dashed black line) — it ispossible to recover the mutation rate µ . In practice, it may be that knowledge of µ (fromlonger term studies) may also be important in understanding population structure andits change over time.3.1. The simulations.
The simulations were run using the statistical package R (2011).Each simulation begins with a haploid population containing 70 individuals having alleleA and 30 individuals having allele B, giving a true h of 0.42. There are four allele typesin total, and mutations from any given allele to any of the other three take place atequal rates, as in the Jukes-Cantor model for DNA substitutions. This population is thenevolved for 1000 generations. In each generation, reproduction is simulated accordingto the Wright-Fisher process, whereby each of the 100 individuals making up the newpopulation is chosen randomly with replacement from the previous generation; each ofthese new individuals is then subjected to mutation at the rate of µ = 0 .
001 mutations perindividual per generation. For the final step in each generation, a pair of individuals —one from the current population and one from the initial population — is picked randomlyand compared; if they have different alleles, a counter for that generation is increased by1. The entire simulation was repeated 2000 times, with counters accumulating acrossruns: consequently the final counter value for generation i , after dividing by 2000, is anestimate of the probability that an individual picked randomly from the initial populationdiffers from an individual picked randomly from generation i . These relative frequenciesare used to estimate the mutation rate parameter µ in several different ways, as we nowdescribe.3.2. Estimating µ . Three different approaches to estimating µ are shown in Figure 1A.The dashed pink curve shows the result of estimating µ while incorrectly assuming h = 0(the assumption made by most previous studies); the solid blue curve shows the resultof estimating µ assuming h to be its true value, 0.42. The difference is plain, particu-larly at early times when the probability of observing a difference is dominated by theheterozygosity of the initial population. To demonstrate the limited accuracy achievableusing the incorrect assumption that h = 0, the µ curve that would be estimated froman infinite number of observations (instead of 2000) is shown as a dashed green line — itis scarcely better, indicating that sampling error is not the problem. The fact that thepink curve tracks the dashed green curve also indicates that, as expected, the simulationfits our theoretical divergent curve µ div of equation (5). Note that the erratic behaviour TUFFLEY, WHITE, HENDY AND PENNY of the blue curve near 0 is due to numerical instability, as the calculations there involvelogs and ratios of numbers close to zero.Both of the above estimation procedures can estimate µ using a probability estimatefrom a single time point, but assume h to be known a priori . Where this is not the case, h can be estimated simultaneously with µ using least squares, which requires probabilityestimates from multiple time points. The result of estimating a single value for µ usingall 1001 time points is shown as a horizontal dashed black line in Figure 1A.3.3. Other population structures.
Our corrected formula extends to cases where thepopulation consists of multiple subpopulations that interact via arbitrary migration rates,provided that the initial distribution of states is the same for each subpopulation. Fig-ure 1B shows the accurate recovery of µ and h for a simulated population consistingof two islands, each containing 70 individuals having allele A and 30 individuals havingallele B, that mix sporadically and asymmetrically: in each generation, the probabilitythat an individual in subpopulation 1 comes from subpopulation 2 is just 1%, while theprobability that an individual in subpopulation 2 comes from subpopulation 1 is 10%.Adding this limited kind of population structure does not alter the fact that the proba-bility distribution of states for the ancestor of an individual chosen randomly at time t remains equal to that for an individual chosen randomly from the initial population, sothe probability p diff that the ancestral and reference states differ is still given by h .If the initial state distributions differ across subpopulations, then the probability thatthe ancestor of a randomly-chosen contemporary individual has a particular state is nolonger constant, so h in equation (1) must be replaced with a function of time, p diff ( t ).Figure 1C shows results from a simulation of one such scenario. There are two islands,each containing 100 individuals: the first contains 70 individuals having allele A and 30individuals having allele B, the second contains 25 individuals with each of the four alleles.Migration rates are as for Figure 1B. The blue curve, which shows an attempt to estimate µ using equation (4) and assuming that h = 0 . p diff is constant and equal to h .Despite the fact that p diff is nonconstant when the subpopulations have different initialstate distributions, our approach is still useful here. Under reasonable conditions on mi-gration probabilities, the probability that the ancestor of a randomly-chosen individualhas a given state converges towards an equilibrium value as t → ∞ , and so p diff will ap-proach an equilibrium value also. In particular, when estimating µ and h simultaneouslyvia least squares, only the estimate of h is affected by different initial state distributionsacross subpopulations: as expected, the dashed black line depicting the least squares esti-mate of µ in Figure 1C is close to the true value of µ . Since µ is usually the parameter ofinterest, this means that our model can still be used for inference with these more generalpopulation structures. Note that any edge-weighted graph describing mating probabili-ties within a population can be represented as a multi-island model in which each islandcontains a single individual. 4. Conclusions
The Jukes-Cantor model is one of the oldest and simplest substitution models, andthe benefit of studying ancestral polymorphism in this simple setting is that we are able
ORRECTING THE APPARENT RATE ACCELERATION UNDER A JUKES-CANTOR MODEL 7 − . − . . . . Generation m ( m u t a t i on s pe r i nd i v i dua l pe r s i t e pe r gene r a t i on ) m estimated using Eqn 4 (correctly assuming h = m = m estimated as 0.001047 using least squares (h simultaneously estimated as 0.4179) m estimated using Eqn 5 (incorrectly assuming h = m estimated from perfect data using Eqn 5 Figure 1A: Estimated mutation rateN = , m = , h = , , Figure 1.
Estimated mutation rates against time for three differentpopulation structures. (1A): A single population initially containing70 individuals having allele A and 30 individuals having allele B; (1B):A 2-island model, both of whose subpopulations are initially the sameas for (1A), with 1% migration per generation from island 1 to island2, and 10% migration per generation in the reverse direction; (1C): A2-island model, whose first subpopulation is the same as for (1A), andwhose second subpopulation contains 25 individuals of each of the fourstates. The graphs show the results of simulations of four state data.
Dashed red line: the value of the mutation rate µ used. Blue curve: µ estimated at each time point from simulated data trans-formed using equation (4) and the correct value for the ancestral het-erozygosity h . Pink curve: simulated data transformed using the incorrect transform(equation (5)) to estimate µ , giving a divergent rate curve. Dashed green curve (1A, 1B): theoretical divergent rate curve, ob-tained by transforming perfect data (equation (1)) according to equa-tion (5). This assumes an incorrect value of h = 0. Dashed black line: the least squares estimate of µ from the simulateddata. This uses no knowledge of h or µ (unlike the blue curve).Note that the blue curve behaves poorly near 0, where the calculations arenumerically unstable as they involve logs and ratios of numbers close tozero. TUFFLEY, WHITE, HENDY AND PENNY − . − . . . . Generation m ( m u t a t i on s pe r i nd i v i dua l pe r s i t e pe r gene r a t i on ) m estimated using Eqn 4 (correctly assuming h = m = m estimated as 0.001015 using least squares (h simultaneously estimated as 0.4251) m estimated using Eqn 5 (incorrectly assuming h = m estimated from perfect data using Eqn 5 Figure 1B: Estimated mutation rate for 2−island model,equal initial distributionsN = , m = , h = , , to obtain explicit analytic expressions for quantities of interest, including the expecteddivergence between two sequences, and the uncorrected rate curve µ div . Moreover, manymore complex and realistic models include the Jukes-Cantor model as a special case,so while our precise quantitative findings are only supported under the limited modelconsidered here, there are clear qualitative implications for these more general models.Given the rapid advance in DNA sequencing technology we expect a large increasein sequences that have diverged in recent, intermediate and longer times. It is there-fore essential to be able to generalise results to a full range of divergence times — it isno longer appropriate to consider separately shorter term microevolutionary and longerterm macroevolutionary studies. Our results show that a significant contributing fac-tor to the apparent acceleration at shorter times is the pre-existing diversity within apopulation, and this can be estimated for a wide range of population sizes and struc-tures (Charlesworth and Charlesworth, 2010). By considering the expected diversity ina population we show how the real rate of neutral mutation can be estimated at shortertimes, using either an appropriate estimate of genetic diversity or data from multiple timepoints. As the timescales of traditional phylogenetics and population genetics draw closertogether, we anticipate seeing an increasing emphasis on the careful handling of geneticdiversity in phylogenetic analyses; the work presented here can provide a starting pointfor the exploration of more general models involving more complex substitution processes,time-varying population sizes, and other effects. ORRECTING THE APPARENT RATE ACCELERATION UNDER A JUKES-CANTOR MODEL 9 − . − . . . . Generation m ( m u t a t i on s pe r i nd i v i dua l pe r s i t e pe r gene r a t i on ) m estimated using Eqn 4 (correctly assuming h = m = m estimated as 0.0009843 using least squares (h simultaneously estimated as 0.609) m estimated using Eqn 5 (incorrectly assuming h = Figure 1C: Estimated mutation rate for 2−island model,unequal initial distributionsN = , m = , h = , , Appendix A. Proof of the main result
The individual chosen from the contemporary population descends from a unique mem-ber of the reference population, and this ancestor is equally likely to be any member ofthe reference population. We may therefore calculate the probability that the contempo-rary and reference states differ by considering two cases: the ancestral state agrees withthe reference state, but evolves to disagree; and the ancestral state disagrees with thereference state, and evolves so as to remain in disagreement at time t .Under the r -state Jukes-Cantor model the probability that there is a net transitionfrom state i to state j = i in time t is given by p ( t ) = r (1 − e − rr − µt ) . The probability that the ancestral state agrees with the reference state but evolves todisagree is therefore given by(6) X i π i ( r − p ( t ) = ( r − p ( t ) X i π i , while the probability that both the ancestral and contemporary states differ from thereference is given by X i X j = i π i π j (1 − p ( t )) = (1 − p ( t )) h . (7) Observe that 1 = X i π i ! = X i π i + X i X j = i π i π j = h + X i π i , so P i π i = 1 − h . Hence summing equations (6) and (7) we get P ( t ) = (1 − h )( r − p ( t ) + h (1 − p ( t ))= ( r − p ( t ) + h (1 − p ( t ) − ( r − p ( t ))= ( r − p ( t ) + h (1 − rp ( t ))= r − r (1 − e − rr − µt ) + h e − rr − µt , as claimed. References
B. Charlesworth and D. Charlesworth.
Elements of evolutionary genetics . Roberts andCo., Greenwood Village, Colo., 2010.D. Charlesworth. Don’t forget the ancestral polymorphisms.
Heredity , 105:509–510, 2010.W. M. Fitch and W. R. Atchley. Evolution in inbred strains of mice appears rapid.
Science , 228(4704):1169–1175, 1985.J. Garc´ıa-Moreno. Is there a universal mtDNA clock for birds?
Journal of Avian Biology ,35(6):465–468, 2004.Martin J. Genner, Ole Seehausen, David H. Lunt, Domino A. Joyce, Paul W. Shaw,Gary R. Carvalho, and George F. Turner. Age of cichlids: New dates for ancient lakefish radiations.
Mol. Biol. Evol. , 24(5):1269–1282, 2007.B. M. Henn, C. R. Gignoux, M. W. Feldman, and J. L. Mountain. Characterizing thetime dependency of human mitochondrial DNA mutation rate estimates.
MolecularBiology and Evolution , 26(1):217–230, 2009.S. Y. W. Ho, M. J. Phillips, A. Cooper, and A. J. Drummond. Time dependency of molec-ular rate estimates and systematic overestimation of recent divergence times.
MolecularBiology and Evolution , 22(7):1561–1568, 2005.S. Y. W. Ho, R. Lanfear, L. Bromham, M. J. Phillips, J. Soubrier, A. G. Rodrigo, andA. Cooper. Time-dependent rates of molecular evolution.
Molecular Ecology , 20(15):3087–3101, 2011.T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro and J. B.Allison, editors,
Mammalian protein metabolism , pages 21–123. Academic Press, NewYork, 1969.M. Kimura. Evolutionary rate at the molecular level.
Nature , 217(5129):624–626, 1968.M. Kimura.
The neutral theory of molecular evolution . Cambridge University Press, 1983.C. D. Millar, A. Dodd, J. Anderson, G. C. Gibb, P. A. Ritchie, C. Baroni, M. D. Wood-hams, M. D. Hendy, and D. M. Lambert. Mutation and evolutionary rates in ad´eliepenguins from the antarctic.
PLoS Genetics , 4(10), 2008.D. Penny. Evolutionary biology: Relativity for molecular clocks.
Nature , 436(7048):183–184, 2005.G. I. Peterson and J. Masel. Quantitative prediction of molecular clock and K a /K s atshort timescales. Molecular Biology and Evolution , 26(11):2595–2603, 2009.
ORRECTING THE APPARENT RATE ACCELERATION UNDER A JUKES-CANTOR MODEL 11
R Development Core Team.
R: A Language and Environment for Statistical Computing .R Foundation for Statistical Computing, Vienna, Austria, 2011.S. Wright. Evolution in Mendelian populations.
Genetics , 16(2):97–159, 1931.
Institute of Fundamental Sciences, Massey University, Private Bag 11222, PalmerstonNorth 4442, New Zealand
E-mail address : [email protected]@massey.ac.nz